OCR on PDF documents
-
In the past I have used Optical Character Recognition (OCR) software to scan in a document, and then convert the document into editable text. At the moment I have some PDF documents on the computer which Acrobat does not recognise as containing text. Is it possible to run an OCR program over these PDF documents to convert the content into editable text?
I have also frequently encountered PDF documents which Acrobat does not recognise as having text, although there is obviously text in the document. When a PDF document is created, normally the PDF is created directly from the original electronic document. In these cases, Acrobat encodes the text in the PDF as actual text. However, when Acrobat does not recognise the text of a document as text, this generally indicates each page in the PDF is a scanned image of the page from the original document. Obviously, in this case, Acrobat has no idea there is text in the document, but instead only sees the content of the PDF as an image. This is especially the case with older documents, where the original electronic file may no longer be accessible or is in a format which cannot be converted into PDF. In any case, having each page as an image in the PDF has two major implications. Firstly, it generally makes the PDF file quite large, since graphics are very large in file size compared with text. Secondly, you cannot search the text inside the PDF using the Acrobat Find/Search function, as Acrobat cannot see any text in the document.
So, back to your question regarding OCR software. This should be possible, as you are simply feeding the OCR software an existing document on your computer rather than scanning a document and then converting. The only trick will be finding an OCR application which allows you to load PDF files. I found some free software called SimpleOCR (www.simpleocr.com) which does exactly what you are looking for, except it can only accept TIFF, BMP or JPEG files. However, if any other readers are looking for similar software to convert graphic files, this is definitely worth a look and the price is right!
Searching Google for the term ‘pdf ocr’ brought-up quite a few results relating to converting image-only PDF files to editable text. Unfortunately, some of these are prohibitively expensive (US$1000). However, one I found was Able2Extract Professional (www.investintech.com/prod_a2e_pro.htm), which claims it can convert image-only PDFs into editable text. This software is a bit more reasonably priced (US$119) and you can download a free trial of the software. Again, I have not tried any such software so I can’t offer any recommendations. I suggest you have a look around and maybe search Google yourself for some software. However, before purchasing any software, I strongly recommend downloading a trial version to check that it meets your requirements. If a free trial is not available for the software, then I would not even consider purchasing the product since it would be a gamble whether it would meet your requirements.