sudo apt-get install ghostscript
Once we have ghostscript installed we can convert the actual PDF using the gs utility:
gs -dNOPAUSE -sDEVICE=tiffg4 -r600x600 -dBATCH -sPAPERSIZE=a4 -sOutputFile=Output_File_Name.tif Name_of_PDF.pdf
You will need to adjust the “Name_of_PDF.pdf” and “Output_File_Name.tif” from the above command for your purposes.
This will leave us with one large TIFF file (could be same or large depending on the quality of scan copy) which we will now OCR-scan (Optical Character Recognition). We’re going to use “tesseract-ocr” for that. But we need to install it first:
sudo apt-get install tesseract-ocr tesseract-ocr-eng
The package “tesseract-ocr-eng” is the English language recognition support and is REQUIRED for tesseract-ocr to work, no matter what locale your system is. Support for other languages is available in packages with their country code in them such as “tesseract-ocr-deu” for German language support.
Let’s finally convert our big TIFF file to a TXT file including all the text, even that from images in the original PDF:
tesseract Output_File_Name.tif Name_of_TXT -l eng
Again, adjust “Output_File_Name.tif” to the filename you originally choose and substitute your desired TXT file name for “Name_of_TXT” – leave out the *.txt extension. If your PDF isn’t in English, then set the “-l eng” accordingly to the package for your language support you installed earlier.
That’s it. Check out the resultant TXT file.
Please note: the quality of extracted text from the images inside the PDF is highly dependent on the quality of the original PDF’s images.