In late 2020, I began working with a company that manufactures high resolution photo scanners. Their aim was to detect superficial damages on cars, such as a scratch, a bump etc.

First approach

After an initial meeting where they described their problematic, they stated that their goal was using something called “Deflectometry”. Deflectometry is a way to observe the reflexion of a known pattern on a reflective surface and, deduce from the deformation how is the surface shaped. This way, it’s possible to detect bumps and deformation that a simple camera shouldn’t be able to detect.

However, to differentiate themselves from their competitor, they couldn’t use a huge robot arm in a controlled, somewhat dark environment like in the above video. They wanted to do that using a fixed pattern printed on a piece of plastic with a backlight, and in broad daylight. Furthermore, the cars weren’t always the same model since this was aimed at car rentals which own many many different models.

The science behind deflectometry is very well understood and really easy to implement in practice: you’ve got a known input pattern, and a captured output pattern and given the two you can deduce if any bump or scratch is on the surface (by computing the divergence at different points of the vector fields). However, as you can imagine, we were far, far away from this dreamy land of perfect observation and repeatable processes.

Suggesting a new approach

In most, if not all projects, the customer knows what they want, but often have questionable ideas of how to achieve it. My first goal is to listen to them and understand exactly what their needs are. Then, from experience, I can slightly shift their goal to make it way, way simpler and cheaper to attain, while still reaping the same benefits.

What the customer wants

The above example was one of these projects. In this case, I was competing with two other research organisations: Fraunhofer-Gesellschaft and another institute from Greece which I wasn’t given any more details about. Both institute deployed “a team” consisting of many business related persons and only one technical person per team. After a few months of researching the problem, they didn’t come up with anything significant, which isn’t surprising considering the amount of time spent on it.

In the mean time, I started by going through the scientific literature on the topic and found this lovely article published in the International Journal of Applied Engineering Research in 2018, which further indicated me that the deep learning approach was best and “deflectometry” wasn’t going to be of any use here. Quickly after that, I held a meeting detailing my findings with the customer and began cleaning their data.

The Data

The customer already had hand annotated data, which was definitely a big plus. However, it wasn’t organised in any helpful way for my job. There were duplicates everywhere, subsubsub folders, and corrupted data files. To top it off, the whole dataset was on a WebDAV server, which meant the transfer speeds were abysmal. In total, it took me 7 continuous days simply to copy over all the dataset they had. Once that was done, I used bash commands to remove any duplicates, corrupted files, and flatten the whole structure. I also computed the files hash to be able to delete future duplicates faster and store the data more easily (in retrospect, I basically did what DVC does…).

My first attempt

Then, the real gruesome work began. I chunked the 4096 x 3000 images into smaller 300 x 300 chunks, and then binary labelled them if an annotated scratch polygon was visible on them. That way, this turned into a binary classification problem.

Training records

In retrospect, it’s possible to do much better than this, but with the time, budget and data available, this was the best I could do.

Assessing the final product

Once I had my binary classification model working flawlessly, it was time for the customer to assess it. During development, they kept on insisting about having an “accuracy over 80% !”. However, they failed to consider that the dataset wasn’t even balanced (more non-scratch than actual scratches), and didn’t specify any accuracy metric to use. Of course, it’s easy to reach 99% accuracy: simply return that every input is not a scratch. Since the dataset is widely unbalanced, most images won’t be scratched (so we are correct there), but some are (so we’re incorrect), however, due to the sheer proportion of one to the other, we’ll fail less often than we succeed, leading to a high accuracy despite having a ridiculous model.

As the saying goes, you can’t improve what you can’t measure. That’s why I explained the difference between Precision and Recall during my presentation:

Precision vs Recall

This is something the other institutes apparently didn’t seem like it was important to mention, but the downside of trying to explain this is that it’s a rather subtle concept. Indeed, it’s far more versatile than “accuracy”, but less straightforward to understand and thus less easy to communicate up the chain of command than a nice big high percentage number.


I ended up deploying the finalised model on Tensorflow Serving using Docker. This stack is officially recommended by Google, supports huge production loads, and has very little overhead time. I also made a web interface using Plotly’s excellent Dash framework which communicated with the inference server using gRPC (so no REST API there, to make the data transfer faster).

It was a rather crude interface, yet still fast and functional. It highlights in yellow the areas where it detects a damage, and in blue the ones where it doesn’t register a damage. Of course, it’s possible to improve the interface substancially, but the inference server could have sustained any other effortlessly.