Building an AI-powered juggling trainer in one afternoon

Ever since I’ve picked up my first set of bean bags as a kid, juggling has become a hobby that has stayed with me over the years. In my later teens and during my time at university, one of my part-time jobs was being a juggling teacher. I worked at a local youth club, at events and fairs, and had the chance to teach juggling to many people — starting from just 4 or 5 years old to seniors in their late 70s.

A photo of a person juggling in the middle of a tea plantation

Juggling? Works everywhere. This is me, at a tea plantation on the Azores in 2018. How long till AI can do the same?

Fast forward to today. In the last few years, I have been working in the field of AI, working with my team to build computer vision systems that understand human motion and assist people in learning how to move correctly (i.e. with fitness exercises in our latest product).

Doesn’t this sound like something I should combine with my long-time hobby? While every person learns differently and at their own pace, I think juggling is a great skill to learn yourself while being assisted by an AI. When it comes to juggling, I’ve observed most people struggle in a similar manner to overcome common obstacles as they progress — a perfect example to put into an application.

The idea

Here’s the idea: You pick up a set of juggling balls and position yourself in front of your webcam. Step by step, you progress through basic juggling moves as software analyses the live video and provides feedback: Is your juggling pattern stable? Should you throw higher or lower? Are your hands positioned correctly? Is your rhythm fine?

With this in mind, I sat down one weekend this winter to build an AI-powered juggling teacher. In this post, I’ll show you how I did it.

Understanding what’s happening inside a video

To analyze the video of a person learning how to juggle, we’ll train a neural network (also “neural net” or “model”). If you are not familiar with neural networks, don’t be intimidated: It’s a concept that sounds fancy and comes from the field of Artificial Intelligence, but ultimately you can imagine it as a function, or a simple black box: We input a video clip and it returns as the output some information about that video.

A simple animation that explains the idea: A video stream is encoded by its pixel values and passed through a neural network. The network digests the visual information to produce a classification decision: What action is happening in the video?

We’ll set up our neural network to be able to classify a given video clip: Given a video, what visual class does the video belong to. A class in our case is the name of an action that is happening in the video - like “throwing 1 ball and dropping it”. In our application, we’ll use that visual class in order to give appropriate feedback to the user.

How to train the neural network

But how does the neural network know what to do? How does it know the difference between correctly tossing a ball versus dropping a ball? Well, it has to learn it first, which means that we need to train it.

Training a neural network means presenting it with example video clips of all the visual classes it should be able to recognize. Initially, the neural net doesn’t know much. It simply guesses what’s inside the video. If a guess is incorrect, we can adjust the internal parameters of the function (= of the neural network) so that the network is improved based on the error it just made. We’ll do this over and over again with all videos we’ve prepared for training until the network doesn’t get any better. At that point, we stop training and move on to build the application around it. But first, we need to prepare some video data for the training process.

Data collection

To train the neural net, we need a training dataset — that is a collection of video clips, each belonging to one distinct visual class we want the net to be able to recognize later. For the juggling use-case, I wanted the network to recognize the following:

2 balls: A single repetition:

3 balls: A good pattern, continuously:

Throwing too high:

Throwing 2 balls at the same time:

Continuous juggling, but not at a steady rhythm:

Entering the webcam view:

Pretending to juggle, no objects used:

All in all, this class catalog contains 27 different classes. I’ve recorded 545 video clips, each 3 seconds long. This took me around 1 hour. 70 videos went into a hold-out validation set so that I ended up using 475 videos to train the network. Is this enough data? We’ll discuss this in a bit. First, let’s have a look at the actual neural network.

The neural network

Neural networks come in all kinds of flavors. For the juggling project, we want a network that can process a video stream, digest its visual characteristics to produce a classification output, and be compact enough to run in real-time.

I got all of this out-of-the-box by using the SDK we are developing and currently open-sourcing at Twenty Billion Neurons: SenseKit, an open-source project (work in progress) that makes it easy to train a video classifier without needing millions of videos.

The neural network architecture is a MobileNet-style neural network. Models of this architecture are popular for computer vision applications because they are designed for visual data while being compact enough to run in real-time on many devices, even smartphones. 3D convolutions instead of 2D convolutions allow powerful feature extractors on videos that include motion.

These “deep” neural networks (= many layers of feature extractors) require a lot of data to be able to learn useful features. One trick to get away with less data is called transfer learning: We don’t train the network from scratch. Instead, let’s take an already trained version and only slightly re-train it for our specific juggling task. In fact, the SenseKit version of the network comes with a pre-trained model. This means that my handful of juggling videos are enough to teach the network about juggling and the different kind of juggling mistakes we want the application to react to.

Typically, training a video classification network requires thousands, if not millions of videos. With that in mind, it’s quite impressive that I could teach the network a completely new set of activities with just a few hundred videos. In addition, not training from scratch gives us a huge speedup. Training the juggling net took less than 10 minutes on a GPU machine (NVIDIA Geforce 1080 Ti). As a comparison, these big networks can often take days to train from start to finish.

The juggling trainer in action

Having trained the network, I built a small juggling trainer application in Python that takes care of the following:

Based on the juggling information I can extract from the recognized class name, the interface displays the following information:

This is what it looks like in action:

A short video of what the juggling teacher (more precisely: the neural network) looks like in action

Limitations

A glance at the past

The idea to combine juggling and computer vision isn’t new, of course. Not to the world (check YouTube), but also not to me. Back at university (think 2014), two friends and I used the Kinect depth sensor to look at juggling patterns. It took us a few weeks and some failed attempts to produce a demo, held together by some carefully tuned thresholds. It was fun and we were able to produce some entertaining visualizations, but the demo was prone to misclassifications. To actually react to a person’s juggling pattern wasn’t feasible with our solution back then.

An alternative approach from 2014: Using the Kinect depth sensor to localize hands and balls and use the coordinates for some fun visualizations. Limitations: No understanding of good or bad juggling patterns, no recognition of juggling mistakes.

Conclusion: A lot is possible in one afternoon

Throwing together a few videos and fine-tuning a neural network: It’s amazing to see and experience how much is possible with the tooling that’s available in 2021. Yes, I’ve only built a prototype of a demo so far — but the goal of building a real juggling trainer powered by computer vision isn’t out of reach. Looking back at my early attempts with the Kinect six years ago and comparing it to my recent attempt, it’s almost unreal to see that the same can be achieved in just one afternoon of work. I don’t know if I’ll push the project further than this, but it sure was a lot of fun.

If you have an idea for a similar computer vision project, I recommend you follow the progress of SenseKit. It comes with some built-in demos and provides everything you need to train your own video classification network similar to my juggling project.