More and more data has continued to originate from digital formats over the last decade. However, there are still a lot of cases where physical documents need to be used or preserved. Healthcare and financial industries in particular typically scan or fax a lot of physical documents into TIFF or PDF formats. Unstructured content analysis is already a bit of a challenge, and these formats end up being an even harder nut to crack.
What is Optical Character Recognition?
Optical character recognition essentially allows users to extract text content from images of physical documents so that it’s in an editable format. This can apply to pages of a book, scanned PDF files and even handwritten content (though this functionality is more limited). Compliance Guardian’s OCR implementation is possible in large part due to Google’s Tesseract library.
Considerations
Aside from the exciting new capability to see text in images, there are also some considerations to keep in mind.
OCR requires a lot of computation and will have a significant impact on CPUs. As a result, the speed at which documents can be scanned will be considerably slower than before. For instance, it may take 5 seconds to process a 300-dpi scanned page (depending on CPU power).
Accuracy
Although OCR technology has advanced a lot in recent years, it’s still far from perfect. It’s rare for OCR to yield 100% accurate results. The clearer the original image, the more accurate the result will be.
In the following section, we will expand on some common factors that can affect accuracy. To help improve accuracy, pre-processing is very important. Common approaches include things like converting an image to grayscale, increasing contrast, noise reduction, and more. In special cases, more complex pre-processing may be needed (e.g. computer vision, contour detection, rotate/crop/anchors).
Common Factors That Impact OCR Accuracy
Typically OCR works better for documents that are:
Scanned with flatbed scanners
Scanned with good resolution and lighting conditions
Scanned with high contrast
Text centric
Using common fonts
Well aligned.
Documents produced by a dedicated scanner or fax machine can meet most of these conditions, but not all documents can.
Following are some details about how common factors can impact OCR results.
Nature of the Image
Scanned documents have much better accuracy than photos because photos typically have less contrast, more noise, blurriness (e.g. out of focus for edge area, or due to camera shaking), distortion (not flat), not well aligned and so on.
The same principal applies to images in scanned documents. The text centric content will be much better than scanned pictures of driver’s licenses and ID cards, for instance.
Resolution
From testing, we found that images with a resolution of 300 dpi will typically have better results. If image resolution is too low (less than 100 dpi), even with some pre-processing to enlarge the image to improve accuracy a bit, but it would still not be as good as higher resolution images. On the other hand, images with too high resolution will take longer to process.
Font
Font size also plays a part in resolution. Larger font size could be fine with low dpi, but smaller font size will require higher resolution to be recognized. For example, font size 10.5pt could work fine with 300 dpi images, but for images at 200 dpi, a font size smaller than 12pt may not work well.
Font type is another factor. Google’s Tesseract library is pre-trained with the most common font types. If the font used in the document is not common, the accuracy will be lower.
Handwriting results are typically poor due to the same reasoning. It’s also important to note that input and handwriting OCR are a bit different in that handwriting input tracks movement while only the final image is available in handwriting OCR.
Contrast
The OCR engine works best on high contrast images. Most well-scanned, text-centric documents can satisfy this. For less than ideal situations, pre-processing may be used to increase the contrast.
Alignment
Google’s Tesseract library has minimal tolerance when it comes to the alignment of scanned images. Based on our testing, the accuracy of a scan will drop if its alignment is more than 5 degrees off. Again, several complex pre-processing techniques could help overcome this.
There are several other factors that can degrade image accuracy such as blurriness, images not being flat when scanned, and blemishes being on images.
How AvePoint Can Help
AvePoint’s Compliance Guardian product already has an extensive framework of technologies to help customers with deep content analysis. In our newest update to version 4.4, optical character recognition (OCR) for scanned documents will further expand our technology stack in a major way.
With the help of OCR, Compliance Guardian will allow users to analyze physical documents much more efficiently via several image enhancement techniques that will significantly improve OCR results.
Compliance Guardian’s out-of-the-box optical character recognition (OCR) functionality is targeted towards more common scanned text document situations. We’re excited that users will finally be able to get text content from images and scan for compliance violations directly on the platform.
That said, there are still challenges optimizing OCR to work seamlessly for all use cases. As of now, accuracy is still being evaluated on a case-by-case basis. We’ll continue to work hard on accuracy and optimization improvements, so please stay tuned!
George Wang brings more than 20 years of experience in software architecture and design – focusing on data protection, disaster recovery, archiving, database, storage, and large-scale distributed enterprise application systems – to his role as Chief Architect at AvePoint. George designed and created AvePoint’s award-winning platform recovery, replication, and storage management products as well as NetApp SnapManager for SharePoint, which features deep integration with AvePoint’s DocAve Software Platform. George holds a Master’s Degree in Electrical Engineering from Tsinghua University and currently resides in New Jersey.