

- #Ocr tool to make pdf searchable how to
- #Ocr tool to make pdf searchable install
- #Ocr tool to make pdf searchable software
The process to convert it into a search PDF file is simpler. If the scanned file is in an image format, such as, tif, png, jpg. # export the searchable PDF to searchable.pdf Pdf = PyPDF2.PdfFileReader(io.BytesIO(page)) Page = pytesseract.image_to_pdf_or_hocr(image, extension='pdf') PdfFileReader: it creates a pdf reader objectĪddPage: it adds a page to pdf writer object images = convert_from_path('Receipt.pdf', poppler_path=poppler_path)

PdfFileWriter: it creates a pdf writer object for the new PDF file. Image_to_pdf_or_hocr: it converts an image into a searchable pdf page We can use the following functions to process all the pages with a for-loop.Ĭonvert_from_path: it converts all the scanned PDF pages into images. Usually, we have multiple scanned PDF pages in a single file. _cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'Ĭonvert scanned PDF page(s) to single search PDF Alternatively, you can run the following commands to directly include their paths in the Python program.
#Ocr tool to make pdf searchable install
For pytesseract, we will need to download and install Tesseract-OCR Engine.Īfter you download and install the software, you can add their executable paths into Environment Variables on your computer.
#Ocr tool to make pdf searchable software
This software functions similarly to pdftoppm and pdftocairo in a Linux system. For pdf2image, we will have to download the poppler for windows users.We would need additional software to use the libraries. io: It allows us to manage the file-related input and output.PyPDF2: It is a Python PDF toolkit, which is capable of splitting, cropping, merging PDF pages and more.pytesseract has the advantages of extracting text from PDF (such as preserving whitespaces between words) over other Python packages. It uses an OCR engine (namely, Google’s Tesseract-OCR Engine) to extract text from the image(s) instead of relying on underlying text and structure from PDF. pytesseract: Python-Tesseract is an optical character recognition (OCR) tool developed for Python.The produced output would be a list of image objects. pdf2image: It is a Python module that wraps pdftoppm and pdftocairo to convert a PDF to an image object.
#Ocr tool to make pdf searchable how to
In this article, I’m going to talk about how to turn scanned file(s) into searchable PDF programmatically using Python and Pytesseract.

More importantly, the underlying text would be useful for Natural Language Processing (NLP). Oftentimes, you would like to make scanned files or image files searchable in PDF, because it is much quicker and more convenient to search keywords with a searchable PDF than in native image format. LinkedIn logo for sharing a link Twitter logo for sharing a link Reddit logo for sharing a link
