Welcome to casualcoding.com, the new location of my blog.
If you’re one of the few who still use RSS feeds, make sure to update your feed URL to https://casualcoding.com/feed. The old URLs will redirect automatically, but better safe than sorry.
For a few years now, the team from fast.ai has been providing free education about deep learning on their website. Their video course promises a hands-on approach that aims to de-mystify the technologies of modern deep learning. With the book “Deep Learning for coders with fast.ai“, they bring these education principles to the written format, either as a printed book from O’Reilly or on Github (for free).
Before I talk about the book, some context: fast.ai is the name of a website with a video course of the same name. The course is taught using a Python library (called fastai, no dot) which is built on top of PyTorch, the popular deep learning framework. Nomenclature can be confusing. I’ll try to be specific and reference “the fast.ai team” or “the fastai library” in this review of the book.
The authors are very vocal about their teaching principles: The goal is to “teach the whole game” while skipping the often demotivating mathematical principles at the beginning.
Instead, the first example gives you all instructions needed to train a state-of-the-art image classification model from scratch.
Then, the book progresses deeper into the technical and mathematical foundations, which they use to build up a (simple) version of their fastai library from scratch.
I’m torn: While this structure lowers the barrier of entry, it also makes for a repetitive experience: You encounter the same example many times, just at different levels of abstraction.
The book covers a wide range of deep learning topics at different levels of depth.
You see practical examples across the main applications areas of deep learning: Computer vision, natural language processing, tabular modeling, and collaborative filtering
All examples are presented with full code listings and everything in the book invites you to go and try things out yourself.
The book presents a wide collection of deep learning techniques that help to get trainings running properly in practice.
Popular deep learning architectures are explained, including ResNet, LSTMs, and U-Nets. With a mix of code, visualization, and (some) maths, the authors do a good job of conveying the core ideas of important architectures.
The authors don’t stop at the technical explanations but stress that it’s important to think further. Deep learning is a powerful tool and one that should be used responsibly. Yes, the technical implementor has in fact a responsibility to consider fairness criteria and ask the question “should we even do this at all?”
The book is packed with code examples. I personally learn best when implementing something by hand and seeing how an abstract idea translates to actual source code, so this really matched my learning style.
The language of the text is also very easy to digest. You can tell that the fast.ai team wants to teach a little differently and is genuinely excited about the topic. The text is mixed with personal anecdotes and examples of Twitter conversations to create a sense of community around the otherwise technical topic.
The collection of the latest deep learning techniques and condensed experience is immensely valuable: You learn how a proper training process looks like, which techniques you can use to improve the training and how to investigate if the training is not behaving nicely.
My biggest gripe with the fast.ai material is their Python coding style: Everything has to be an abbreviation, apparently. I don’t know why you call a parameter
ni when it could just as well be called
num_inputs. If the goal is to “reduce jargon”, using explicit naming in the code would be part of that, if you ask me.
Secondly, the teaching principle of “top-down, then bottom-up” has its quirks: You repeat the same example over and over again, just on different levels of abstraction. When I want to look up “the chapter on convolutional neural networks”, it’s not one chapter I have to browse, but 4 or 5. This may make for a good didactic progression but feels quite repetitive at times.
The name and subtitle of the book capture it quite well: The code-centric approach of learning (and trying out) deep learning lends itself for people who self-identify as “coders” and not so much as academic scholars who want to have the theory laid out first.
Still, it shows that this book originated in a course. The material will stick if you really follow along and try things for yourself. If you don’t, and you’re completely new to deep learning, it will be hard to map out where in the level of abstractions each chapter is situated.
I actually found the book very helpful for myself, because it helped me understand how to use the latest deep learning technologies such as learning rate finder, 1-cycle training, label smoothing, and mixup augmentation. Having worked with deep learning for a while, I still learned quite some new methods and was able to gain a deeper understanding of concepts I had known before.
Overall, I really liked the book. The authors did a great job of covering a wide range of deep learning applications while showing both: easy-to-use black box examples and the deepest insides of that black box. This helps to de-mystify the ai hype and teaches helpful hands-on skills.
They share a lot of expert advice on how to set up training procedures properly and I actually agree with their claim: Those who really complete this material have a great starting point working in the field of deep learning.
The didactic style may not be for everyone and I personally hope the fastai coding style doesn’t stick, but I am grateful for the fast.ai team’s contribution: Making deep learning accessible for anyone who is interested.
“Meow” — I’m sorry? “Meow!” — Oh, right! Here you go.
What if I could understand exactly what my cat is trying to tell me? We live in 2021, which is basically the future. How hard can it be?
A group of dedicated researchers from northern Italy has recently released a public dataset of cat vocalizations (let’s call them “meows”). 21 cats from two different breeds were exposed to three different situations while a microphone was listening:
In total, the dataset comprises 440 audio files.
The dataset is not evenly split between those three situations.
Neither is it evenly split between cat breeds or the sex of the cat.
In fact, some cats occur way more often in the recordings than others. I don’t know why. Maybe “CAN01” is just very talkative whereas “NIG01” prefers to keep to himself?
Looking at these distributions is important. When we train a neural network to classify a given voice recording, we want to make sure it performs better than simply guessing the most frequent label.
For example, always guessing “female” when asked for the cat’s gender would be correct in 78% of cases because there are 345 female voice recordings and only 95 recordings of male cats.
Any classifier that is supposed to be useful has to surpass this baseline of “informed” guessing.
|Feature||Most frequent label||Absolute count||Relative count = baseline accuracy|
|isolation||221 of 440 recordings||50.2 %|
|female||345 of 440 recordings||78.4 %|
|european_shorthair||225 of 440 recordings||51.1 %|
Now we have an idea of what our data distributions look like. In total, there are three interesting tasks we can have a model learn from the data: (1) What situation was the cat in, (2) what is the sex of the cat, and (3) what is the breed of the cat. It will be interesting to see if these tasks can be learned from the data at all. Let’s start preparing our data to train a model.
There are many ways to encode an audio signal before passing it into a neural network. For my project, I am choosing a visual approach: We plot the spectrogram of the audio recordings as an image.
This allows us to use well-established neural networks from the field of computer vision. Also, spectrograms look nice.
Spectrograms are a plot where the location in the image represents a given frequency at a given point in time in the audio file. The brightness of a pixel represents the intensity of the audio signal.
The following example shows one of the recordings as a spectrogram. The time axis goes from top left (zero) to bottom left. The x-axis denotes the frequencies.
Having turned our audio classification task into an image classification task, we can start with our model training. We are going to train three models for three different tasks:
Like most deep learning frameworks, it is easy to re-use popular computer vision architectures in fast.ai. With one(-ish) line of Python, you have a capable neural network for image classification at your hands. It comes pre-trained so that you need fewer images for your task at hand.
create_cnn_model( models.resnet18, n_classes, pretrained=True)
ResNets are a popular neural network architecture from 2015 that introduced residual connections – a mechanism that improves training behavior and allows the training of (very) deep networks.
The catmeows dataset is quite small, so I was satisfied with the smallest ResNet flavor (called ResNet-18). It has “only” 18 layers and it is still oversized for my 440 images.
The ResNet implementation wants to have square images as its input, so I took random square crops from the spectrograms during training. The crops were 81 x 81 pixels in size and could be from different points in time of the recording, but always contain the full spectrogram.
When training a classifier it is important not to show all of your data to the model during training. You want to hold out some samples for validating the classifier during the training process. That way you get an idea if the model learns the training data by heart or if it actually learns something useful.
Sometimes it is fine to take a random percentage of the dataset as the validation set. In this case, I wanted to separate the cats across train and validation split so that the model can’t cheat by memorizing the characteristics of an individual cat.
I took 4 individual cats out of the training data. Their recordings combined made up 66 samples of the dataset, which means 15% of the data was reserved for validation and only the remaining 85% were used for training.
For the three different tasks, the 3 models I trained achieved the following accuracy scores.
|Task||Classification accuracy||Guessing baseline (see above)|
|Situation||63.6 %||50.2 %|
|Sex||90.9 %||78.4 %|
|Breed||93.9 %||51.1 %|
Across all three tasks, the models performed well above the guessing baseline we have determined earlier.
Let’s also take a look at the confusion matrix for each task. A confusion matrix plots each sample of the validation set and indicates how many were classified correctly and which errors were made.
First of all, these are quick results. We haven’t built a super AI that understands every single cat in the world. (Yet.)
What these results mostly show are interesting aspects of the dataset: Most of all, I was surprised how well the sex and breed can be told apart by the model. As I made sure to separate individual cats across train and validation data, I do have some confidence that the model didn’t cheat. There may still be some information leakage that I’m not aware of, of course.
This is a small dataset. ResNet-18 is a big network. This mix can cause problems.
In my case, I am using a pre-trained version of ResNet, so the convolutional features don’t have to be learned from scratch. Still, I found myself re-running the training multiple times with varying success. I think with such little data it is still easy for the model to run into a local optimum and overfit on the training data.
Ideas for improvement:
Try freezing different layers and sets of layers of the network. It’s a tiny amount of data, we wouldn’t want to destroy the pre-trained features by accident. At the same time, spectrograms are not natural images, so fine-tuning probably makes sense.
Some additional data augmentation would surely help to enrich the training data. As these are not natural images but visualizations of an audio signal, I think some augmentation operations make sense (cropping at different points in time, jitter contrast, and brightness to simulate volume fluctuations). Some others are more questionable (perspective transformations, cropping different frequency bands). I haven’t tried them so far, but they could very well improve the results.
To learn more about the data, it would be interesting to extract quantitative audio characteristics and train a logistic regression or random forest on the data. These models are easier to interpret and could help to understand if the models look at something meaningful in the data or if there is some data leakage that allows the models to cheat.
Playing with public datasets is fun! You should try it.
I may continue with this pet project (pet! get it?) or start something fresh with the next dataset that looks interesting.
If you’ve found an issue in my data or training setup, please let me know.
You can find the complete project code in a messy Jupyter notebook on Github.
If you are anything like me, “machine learning” to you means working with algorithms that adapt their function from data. And while this is true, it’s not the complete story when actually working in the field of machine learning.
Yes, picking the right algorithm and creating the appropriate model is important. But data cleaning, optimizing for production, and setting up scalable infrastructure is just as much part of the day-to-day work on the job.
Along comes Machine Learning Design Patterns, a book that looks at common challenges in practical machine learning. It leaves out model architectures on purpose and instead promises to collect best practices to move machine learning to production.
The book’s title plays on the famous 1994 book by the Gang of Four that popularised the concept of design patterns in software engineering. It comes as no surprise that in Machine Learning Design Patterns, the authors attempt something very similar and identify reoccurring problems from their field.
Each chapter is structured the same way:
Written by three engineers from the Google Cloud AI team, the book covers a breadth of topics:
The early chapters cover topics frequently occurring in the “data science” section of machine learning: What is an embedding, how to work with imbalanced datasets, how to create proper checkpoints during training.
In the later parts, the authors focus more and more on inference and challenges of automation, repeatability and scalability. I now have an idea what a feature store is and how to bridge data schemas when mixing old and new data sources.
The book is full of examples and they use different technologies: Tensorflow examples in Python, a BigQuery listing, a Google cloud SDK API call.
From the preference of technologies, you notice that the book has been written by Googlers. This doesn’t matter in my mind, because the concepts are always explained clearly, so that porting to other platforms or products should be straight forward.
The book covers a wide range of topics and helped extend my knowledge of machine learning to areas I am not an expert in: Resilient Serving, Reproducibility and MLOps in general.
The structure of design patterns lends itself to keep this book on the shelf for future reference. Chapters have a clear motivation and are written to the point, so that I can see myself looking up a design pattern in the future.
Aside from technical topics, the authors also include three chapters about responsible AI and a (brilliant) section about the ML Life Cycle and the AI Readiness of organisations.
I have a few complaints, though.
In places it becomes clear that the idea of extracting design patterns from machine learning approaches works well for some topics, but becomes a bit of a stretch for others. I personally didn’t mind this too much, but it’s not as elegant the title of the book suggests.
What I did mind was the fact that this first edition is quite riddled with errors: From figures containing incorrect numbers that don’t align with the text (just annoying) to an explanation of convolution layers that confused convolution with pooling, I think (potentially misleading).
And a final nitpick: The printed copy I ordered was a a monochrome version with low contrast. Many figures were completely indecipherable. A bit disappointing for an O’Reilly book upwards of 40€.
In conclusion, Machine Learning Design Patterns gives a great overview over common problems you encounter when designing, building and deploying machine learning algorithms.
It will offer valuable content for many in the industry: Data scientists who have never deployed a cloud pipeline, ops experts who are curious about “MLOps” and the product person who wants to understand the constraints and possibilities of modern machine learning development.
My favourite chapter was actually a non-technical one: How to move a team and a whole company from running first ML experiments to becoming an ML-first organisation. This idea ties a lot of the technical and human topics together and it is a topic that excites me personally.
I’ve enjoyed working through this book (together with my data science study group) and it will find a valued place in my bookshelf – to be referenced whenever I encounter one of the problems in the wild again and need a foundational perspective.
With a second cat moving in this week, I am even more curious than before about what happens at home when no human is around.
There is a range of products that offer pet monitoring. Given I have an unused Raspberry Pi and a spare webcam lying around, I decided to DIY the solution.
As it turns out, this is easy: Take a Raspberry Pi, connect a webcam and install motion – an open source tool that exposes the webcam stream over the local network.
To enable remote access, I used tailscale which creates a private network for all my devices, no matter where they are located physically.
This guide has great step by step instructions which take less than 30 minutes to complete: https://tailscale.com/kb/1076/dogcam/
One thing to note: Pick the right Raspberry Pi. I started off with the model B+ (from 2014) which is a little underpowered and even has issues running the current Raspberry Pi OS smoothly. Luckily, I also had a “Pi 3 Model B” sitting in a drawer which did the job just fine.
The motion project comes with some handy features: Whenever motion is detected, it will save a snapshot frame and even short videos. These are stored on the Pi (under
/var/lib/motion by default) and include the time of the event.
Even when not watching the webcam stream, these recordings allow a summary of what the furballs were up to while I’m gone.
I hadn’t used it before, but tailscale really was perfect for this case: The Raspberry Pi is one device in my virtual network. The other two I’ve added are my laptop and my phone.
Anytime I want to check the webcam stream I simply open the browser and access the (virtual?) IP of the Pi.
On the iPhone, I added that URL to the home screen so the ominous toilet stream is always in reach.
The initial setup was easy. I now have one webcam that I can place anywhere (anywhere the ethernet cable reaches, that is). A set of these would be cool so that I could monitor all movement in the flat.
My ambitions to build the next NSA for cats are limited though, so I’ll probably stick to a single cam and point it at one key location. Right now it’s looking at the litter box and I’m working on a spreadsheet to plot the bowel movements of the little one. Uhm, yeah.
Another idea for the next step: Collect the visual data over time and run some vision algorithms. How often do they eat? Do they really sleep 16 hours a day? Who spends more time in each room? All of them cool ideas (which I’ll never implement, let’s be honest).
This was a quick project. I can now check in on my cats when I’m out and about. A Saturday afternoon well spent.
Ever since I’ve picked up my first set of bean bags as a kid, juggling has become a hobby that has stayed with me over the years. In my later teens and during my time at university, one of my part-time jobs was being a juggling teacher. I worked at a local youth club, at events and fairs, and had the chance to teach juggling to many people — starting from just 4 or 5 years old to seniors in their late 70s.
Fast forward to today. In the last few years, I have been working in the field of AI, working with my team to build computer vision systems that understand human motion and assist people in learning how to move correctly (i.e. with fitness exercises in our latest product).
Doesn’t this sound like something I should combine with my long-time hobby? While every person learns differently and at their own pace, I think juggling is a great skill to learn yourself while being assisted by an AI. When it comes to juggling, I’ve observed most people struggle in a similar manner to overcome common obstacles as they progress — a perfect example to put into an application.
Here’s the idea: You pick up a set of juggling balls and position yourself in front of your webcam. Step by step, you progress through basic juggling moves as software analyses the live video and provides feedback: Is your juggling pattern stable? Should you throw higher or lower? Are your hands positioned correctly? Is your rhythm fine?
With this in mind, I sat down one weekend this winter to build an AI-powered juggling teacher. In this post, I’ll show you how I did it.
To analyze the video of a person learning how to juggle, we’ll train a neural network (also “neural net” or “model”). If you are not familiar with neural networks, don’t be intimidated: It’s a concept that sounds fancy and comes from the field of Artificial Intelligence, but ultimately you can imagine it as a function, or a simple black box: We input a video clip and it returns as the output some information about that video.
We’ll set up our neural network to be able to classify a given video clip: Given a video, what visual class does the video belong to. A class in our case is the name of an action that is happening in the video – like “throwing 1 ball and dropping it”. In our application, we’ll use that visual class in order to give appropriate feedback to the user.
But how does the neural network know what to do? How does it know the difference between correctly tossing a ball versus dropping a ball? Well, it has to learn it first, which means that we need to train it.
Training a neural network means presenting it with example video clips of all the visual classes it should be able to recognize. Initially, the neural net doesn’t know much. It simply guesses what’s inside the video. If a guess is incorrect, we can adjust the internal parameters of the function (= of the neural network) so that the network is improved based on the error it just made. We’ll do this over and over again with all videos we’ve prepared for training until the network doesn’t get any better. At that point, we stop training and move on to build the application around it. But first, we need to prepare some video data for the training process.
To train the neural net, we need a training dataset — that is a collection of video clips, each belonging to one distinct visual class we want the net to be able to recognize later. For the juggling use-case, I wanted the network to recognize the following:
All in all, this class catalog contains 27 different classes. I’ve recorded 545 video clips, each 3 seconds long. This took me around 1 hour. 70 videos went into a hold-out validation set so that I ended up using 475 videos to train the network. Is this enough data? We’ll discuss this in a bit. First, let’s have a look at the actual neural network.
Neural networks come in all kinds of flavors. For the juggling project, we want a network that can process a video stream, digest its visual characteristics to produce a classification output, and be compact enough to run in real-time.
I got all of this out-of-the-box by using the SDK we are developing and currently open-sourcing at Twenty Billion Neurons: SenseKit, an open-source project (work in progress) that makes it easy to train a video classifier without needing millions of videos.
The neural network architecture is a MobileNet-style neural network. Models of this architecture are popular for computer vision applications because they are designed for visual data while being compact enough to run in real-time on many devices, even smartphones. 3D convolutions instead of 2D convolutions allow powerful feature extractors on videos that include motion.
These “deep” neural networks (= many layers of feature extractors) require a lot of data to be able to learn useful features. One trick to get away with less data is called transfer learning: We don’t train the network from scratch. Instead, let’s take an already trained version and only slightly re-train it for our specific juggling task. In fact, the SenseKit version of the network comes with a pre-trained model. This means that my handful of juggling videos are enough to teach the network about juggling and the different kind of juggling mistakes we want the application to react to.
Typically, training a video classification network requires thousands, if not millions of videos. With that in mind, it’s quite impressive that I could teach the network a completely new set of activities with just a few hundred videos. In addition, not training from scratch gives us a huge speedup. Training the juggling net took less than 10 minutes on a GPU machine (NVIDIA Geforce 1080 Ti). As a comparison, these big networks can often take days to train from start to finish.
Having trained the network, I built a small juggling trainer application in Python that takes care of the following:
2b_...will be interpreted as “2 balls” being present in the video.
Based on the juggling information I can extract from the recognized class name, the interface displays the following information:
This is what it looks like in action:
The idea to combine juggling and computer vision isn’t new, of course. Not to the world (check YouTube), but also not to me. Back at university (think 2014), two friends and I used the Kinect depth sensor to look at juggling patterns. It took us a few weeks and some failed attempts to produce a demo, held together by some carefully tuned thresholds. It was fun and we were able to produce some entertaining visualizations, but the demo was prone to misclassifications. To actually react to a person’s juggling pattern wasn’t feasible with our solution back then.
Throwing together a few videos and fine-tuning a neural network: It’s amazing to see and experience how much is possible with the tooling that’s available in 2021. Yes, I’ve only built a prototype of a demo so far — but the goal of building a real juggling trainer powered by computer vision isn’t out of reach. Looking back at my early attempts with the Kinect six years ago and comparing it to my recent attempt, it’s almost unreal to see that the same can be achieved in just one afternoon of work. I don’t know if I’ll push the project further than this, but it sure was a lot of fun.
If you have an idea for a similar computer vision project, I recommend you follow the progress of SenseKit. It comes with some built-in demos and provides everything you need to train your own video classification network similar to my juggling project.