Extract Text From PDFs (Including Scan Copy) – Ubuntu way

To extract all text from PDFs (including text in images/Scan copy), we can use a combination of Ghostscript and a command line OCR tool called tesseract-ocr. First we need to convert our PDF to individual image files (TIFF) so we can then OCR-scan them again. We need Ghostscript for that. It’s probably already installed on your system but just to be sure you can run:
sudo apt-get install ghostscript

Once we have ghostscript installed we can convert the actual PDF using the gs utility:

gs -dNOPAUSE -sDEVICE=tiffg4 -r600x600 -dBATCH -sPAPERSIZE=a4 -sOutputFile=Output_File_Name.tif Name_of_PDF.pdf

You will need to adjust the “Name_of_PDF.pdf” and “Output_File_Name.tif” from the above command for your purposes.

This will leave us with one large TIFF file (could be same or large depending on the quality of scan copy) which we will now OCR-scan (Optical Character Recognition). We’re going to use “tesseract-ocr” for that. But we need to install it first:

sudo apt-get install tesseract-ocr tesseract-ocr-eng

The package “tesseract-ocr-eng” is the English language recognition support and is REQUIRED for tesseract-ocr to work, no matter what locale your system is. Support for other languages is available in packages with their country code in them such as “tesseract-ocr-deu” for German language support.

Let’s finally convert our big TIFF file to a TXT file including all the text, even that from images in the original PDF:

tesseract Output_File_Name.tif Name_of_TXT -l eng

Again, adjust “Output_File_Name.tif” to the filename you originally choose and substitute your desired TXT file name for “Name_of_TXT” – leave out the *.txt extension. If your PDF isn’t in English, then set the “-l eng” accordingly to the package for your language support you installed earlier.

That’s it. Check out the resultant TXT file.

Please note: the quality of extracted text from the images inside the PDF is highly dependent on the quality of the original PDF’s images.

(Visited 70 times, 1 visits today)

Leave a comment

Your email address will not be published. Required fields are marked *