Detecting missing signatures from documents: lessons from my summer internship

Published on May 21, 2025

⚠️ DISCLAIMER: All images and screenshots in this post are recreations and do not include any real signatures or documents. No private or confidential information is shared.

I’ve been wanting to write about the things I did last summer for a while. It was a super fun and productive experience that I’d like to document. This will be a reflection on what I learned, what I got right/wrong, the things I’d do differently, and how I managed to actually get things done!

I’d like to start by saying that initially the project was meant to be a few weeks long. Then, I’d get the chance to work on other different projects in the same area. Spoiler alert: this did not happen! In fact, I was working on this project the entire time. It turned out to be a lot more complicated than we anticipated.

I hope this post is useful for anyone trying to implement a similar project, because I couldn't find anything similar when I was doing my initial research.

The project

Companies work with a big volume of documents. A lot of these are receipts that require to be checked and approved by higher-ups. It is common for these documents to require physical signatures from over 3 people at times.

Signatures are the legal backup for any future audits or claims. It’s important for compliance to identify any documents that miss them. That way, they can be signed and updated on the database.

The main goal of the project was to implement a model using Google Document AI. It would look at documents, and determine whether the document missed any signature or not. As you’ll soon read, the solution ended up being more of a pipeline rather than a single magic model.

The issues begin

When the internship first started, I truly believed that I could just implement the work that I was given easily, and move on to the next project. I’d later come to find out that while in university things work more or less like that, in a company with real processes and problems, things aren't always so smooth.

One of the first complications with the project was the lack of standardization of the documents. Most of these documents were not created by the company, but instead sent or uploaded by the client on a platform.

Because of this, it is impossible to actually standardize documents. This means that the signatures could be in any page of the document, in any place within that page, and in many different layouts. The problem of identifying whether a required signature is missing, now became a problem of finding where in the document the signatures are expected to be, too.

It's important to note that all of that is assuming the document is actually some kind of valid receipt. Sometimes, workers uploaded screenshots of approval emails or WhatsApp conversations as backup, and never replaced it with the actual signed document. This is easier to tackle, because it only requires to filter out documents, and that is a simple classifier model. This model actually got a 99% accuracy on the first training step, so I’m not going to talk much about it.

Labeling data

Document AI works with OCR. It works well with digital documents and scanned documents, but can struggle a bit with bad quality scanned documents or photos. The data labeling process is simple. For an extractor model, you load the documents, create as many labels as you want in the dataset, and in the document you enclose the target in a bounding box.

It looks like this:

At first, I chose two labels: “present_signature” and “expected_signature”. “Present” meant there was a signature and “expected” referred to an empty signature field. This time, I did the labeling as shown in the screenshot: a big bounding box over the signature field.

A training failure?

My first dataset consisted of around 300 documents labeled like shown above. My expectation was that, somehow, the model would learn to identify the fields that contained a signature and those that didn’t. If there was any “expected_signature” found in the document, I would know that at least one signature was missing, and I could flag the document for manual review.

I was expecting that the accuracy wouldn’t be high in the first training round, but I never thought I would get 0% accuracy! This felt like a huge failure, until I went and analyzed the responses and why they were wrong.

I got two important takeaways from my failed model:

The model correctly identified where the signatures were supposed to be
The model accuracy was being measured in respect of the total area of the label and the expected text within that area

These two key findings gave me a better understanding of Document AI, and the types of results I could get from it. My model was inaccurate in respect to the metrics, but it somewhat did what I wanted it to do!

Turning failure into success

With the insight I got from the failed model, I was able to train a second version of it with more data and only the “expected_signature” label. This time, the bounding box was smaller and primarily enclosing the text indicators of the signature, rather than the entire signature block.

This worked a lot better, I quickly got an 86% accuracy rate. Most inaccuracies were text related, generally character mismatches in bad quality documents. But positionally speaking, the model performed excellently, always finding the coordinates of the signatures even within the bad quality docs.

Now, I had a way to cut off the document and only keep the relevant signature area! I did have a few issues with PDF processing, so I made my own API, but I’ll spare the details as it’s just a sidenote. In simple terms, the API received the PDF doc and the coordinates of the signatures present, and returned the cropped signature area.

With this model and the PDF processing done, I solved the “finding where in the document the signatures are expected to be” problem. Now, I only had to focus on determining whether all the signatures in this area are present, or at least one is missing.

Turns out, OCR is not that great for signature recognition

Before the project began, I was told to use Google Document AI to solve the problem. My internship supervisor wanted to see what this tool was capable of, and how they could integrate it in solutions for different areas of the company. This tool worked great for the first two problems encountered: filtering out incorrect documents and identifying where the signatures should be in a document.

After the implementation of the signature area extractor, I had a collection of small signature areas rather than full docs. I made a third model, a classifier again, and labeled each image of the set as either “missing_signature” or “fully_signed”.

I started the training of the model with 300 images, and incrementally added more data. The last version of this model trained on ~800 images. I noticed that the accuracy was not increasing, even after doubling the data and adding more examples. I managed to get it close to 90% accuracy before deciding this was not the correct approach.

Next steps

In my last day working, I presented my work and findings. I suggested replacing this last model with a computer vision approach. A traditional CNN would do a much better job at classifying whether a signature is missing or not.

Signatures aren’t really characters, and it honestly surprises me that the OCR classifier was even able to get somewhat consistent results. In hindsight, I believe it may have just been overfitting.

Final thoughts

I think I managed to build a solid pipeline, because you really don’t want to throw 9-page documents to a CNN. It would be extremely inefficient and a waste of resources. I found that the value in Document AI in this case is huge, because it allows you to crop down the document to let the computer vision model focus on the important area.

If I had the time, I would have liked to implement a reward function that punishes false positives a lot more than false negatives. It is way less problematic to flag a fully signed document as incomplete than it is to miss an incomplete doc. The model should minimize false positives.

This was an incredible experience and I’m very grateful that I got to work on such a cool project, and with a cool supervisor that let me experiment with things and research on my own. It was not easy, but I learned way more than I would have if I had just been following orders.