Learnings from a Machine Learning Engineer — Part 4: The Model
In this latest part of my series, I will share what I have learned on selecting a model for image classification and how to fine tune that model. I will also show how you can leverage the model to accelerate your labelling process, and finally how to justify your efforts by generating usage and performance […] The post Learnings from a Machine Learning Engineer — Part 4: The Model appeared first on Towards Data Science.

In this latest part of my series, I will share what I have learned on selecting a model for Image Classification and how to fine tune that model. I will also show how you can leverage the model to accelerate your labelling process, and finally how to justify your efforts by generating usage and performance statistics.
In Part 1, I discussed the process of labelling your image data that you use in your image classification project. I showed how define “good” images and create sub-classes. In Part 2, I went over various data sets, beyond the usual train-validation-test sets, with benchmark sets, plus how to handle synthetic data and duplicate images. In Part 3, I explained how to apply different evaluation criteria to a trained model versus a deployed model, and using benchmarks to determine when to deploy a model.
Model selection
So far I have focused a lot of time on labelling and curating the set of images, and also evaluating model performance, which is like putting the cart before the horse. I’m not trying to minimize what it takes to design a massive neural network — this is a very important part of the application you are building. In my case, I spent a few weeks experimenting with different available models before settling on one that fit the bill.
Once you pick a model structure, you usually don’t make any major changes to it. For me, six years into deployment, I’m still using the same one. Specifically, I chose Inception V4 because it has a large input image size and an adequate number of layers to pick up on subtle image features. It also performs inference fast enough on CPU, so I don’t need to run expensive hardware to serve the model.
Your mileage may vary. But again, the main takeaway is that focusing on your data will pay dividends versus searching for the best model.
Fine tuning
I will share a process that I found to work extremely well. Once I decided on the model to use, I randomly initialized the weights and let the model train for about 120 epoch before improvements plateau at a fairly modest accuracy, like 93%. At this point, I performed the evaluation of the trained model (see Part 3) to clean up the data set. I also incorporated new images as part of the data pipeline (see Part 1) and prepared the data sets for the next training run.
Before starting the next training run, I simply take the last trained model, pop the output layer, and add it back in with random weights. Since the number of output classes are constantly increasing in my case, I have to pop that layer anyway to account for the new number of classes. Importantly, I leave the rest of the trained weights as they were and allow them to continue updating for the new classes.
This allows the model to train much faster before improvements stall. After repeating this process dozens of times, the training reaches plateau after about 20 epochs, and the test accuracy can reach 99%! The model is building upon the low-level features that it established from the previous runs while re-learning the output weights to prevent overfitting.
It took me a while to trust this process, and for a few years I would train from scratch every time. But after I attempted this and saw the training time (not to mention the cost of cloud GPU) go down while the accuracy continued to go up, I started to embrace the process. More importantly, I continue to see the evaluation metrics of the deployed model return solid performances.
Augmentation
During training, you can apply transformations on your images (called “augmentation”) to give you more diversity from you data set. With our zoo animals, it is fairly safe to apply left-right flop, slight rotations clockwise and counterclockwise, and slight resize that will zoom in and out.
With these transformations in mind, make sure your images are still able to act as good training images. In other words, an image where the subject is already small will be even smaller with a zoom out, so you probably want to discard the original. Also, some of your original pictures may need to be re-oriented by 90 degrees to be upright since a further rotation would make them look unusual.
Bulk identification
As I mentioned in Part 1, you can use the trained model to assist you in labelling images one at a time. But the way to take this even further is to have your newly trained model identify hundreds at a time while building a list of the results that you can then filter.
Typically, we have large collections of unlabelled images that have come in either through regular usage of the application or some other means. Recall from Part 1 assigning “unknown” labels to interesting pictures but you have no clue what it is. By using the bulk identification method, we can sift through the collections quickly to target the labelling once we know what they are.
By combining your current image counts with the bulk identification results, you can target classes that need expanded coverage. Here are a few ways you can leverage bulk identification:
- Increase low image counts — Some of your classes may have just barely made the cutoff to be included in the training set, which means you need more examples to improve coverage. Filter for images that have low counts.
- Replace staged or synthetic images — Some classes may be built entirely using non-real-world images. These pictures may be good enough to get started with, but may cause performance issues down the road because they look different than what typically comes through. Filter for classes that depend on staged images.
- Find look-alike classes — A class in your data set may look like another one. For example, let’s say your model can identify an antelope, and that looks like a gazelle which your model cannot identify yet. Setting a filter for antelope and a lower confidence score may reveal gazelle images that you can label.
- Unknown labels — You may not have known how to identify the dozens of cute wallaby pictures, so you saved them under “Unknown” because it was a good image. Now that you know what it is, you can filter for its look-alike kangaroo and quickly add a new class.
- Mass removal of low scores — As a way to clean out your large collection of unlabelled images that have nothing worth labelling, set a filter for lowest scores.
Throw-away training run
Recall the decision I made to have image cutoffs from Part 2, which allows us to ensure an adequate number of example images of a class before we train and server a model to the public. The problem is that you may have a number of classes that are just below your cutoff (in my case, 40) and don’t make it into the model.
The way I approach this is with a “throw-away” training run that I do not intend to move to production. I will decrease the lower cutoff from 40 to perhaps 35, build my train-validation-test sets, then train and evaluate like I normally do. The most important part of this is the bulk identification at the end!
There is a chance that somewhere in the large collection of unlabelled images I will find the few that I need. Doing the bulk identification with this throw-away model helps find them.
Performance Reporting
One very important aspect of any machine learning application is being able to show usage and performance reports. Your manager will likely want to see how many times the application is being used to justify the expense, and you as the ML engineer will want to see how the latest model is performing compared to the previous one.
You should build logging into your model serving to record every transaction going through the system. Also, the manual evaluations from Part 3 should be recorded so you can report on performance for such things as accuracy over time, by model version, by confidence scores, by class, etc. You will be able to detect trends and make adjustments to improve the overall solution.
There are a lot of reporting tools, so I won’t recommend one over the other. Just make sure you are collecting as much information as you can to build these dashboards. This will justify the time, effort, and cost associated with maintaining the application.
Conclusion
We covered a lot of ground across this four-part series on building an image classification project and deploying it in the real world. It all starts with the data, and by investing the time and effort into maintaining the highest quality image library, you can reach impressive levels of model performance that will gain the trust and confidence of your business partners.
As a Machine Learning Engineer, you are primarily responsible for building and deploying your model. But it doesn’t stop there — dive into the data. The more familiar you are with the data, the better you will understand the strengths and weaknesses of your model. Take a close look at the evaluations and use them as an opportunity to adjust the data set.
I hope these articles have helped you find new ways to improve your own machine learning project. And by the way, don’t let the machine do all the learning — as humans, our job is to continue our own learning, so don’t ever stop!
Thank you for taking this deep dive with me into a data-driven approach to model optimization. I look forward to your feedback and how you can apply this to your own application.
The post Learnings from a Machine Learning Engineer — Part 4: The Model appeared first on Towards Data Science.