How to Read PDF or specific Page of a PDF file using Python Code
This document let’s you learn how to extract data from a PDF file using Python code.
Before starting with extraction of data from PDF file using Python let’s first understand what Python is -
Python is an open source programming language, generally used for data analysis and machine learning. As every one know software industry is having number of language to develop the requirement and Python is the first choice of developer amongst other languages.
It is an effective approach to object-oriented programming and an easy to learn and powerful language.
To read/write/change any PDF file using Python we would require to have Python setup to be in place. So please find the required setup for the installation of Python and then we can get started.
To get latest version of Python, refer — https://www.python.org/downloads/
We would be using third party module PyPDF2 which is a python library built as a PDF toolkit. It can perform various things -
- Getting document information.
- Merging documents one after the other page.
- Segregating document one after the other page.
- Cropping pages.
- Merging several pages into the one.
- Also, encrypting and decrypting the PDF files.
Once installation of Python is done, we have to add PyPGF2 package by using below command.
$ python3 -m pip install PyPDF2
To Verify the version, use below
$ python3 -m pip show PyPDF2
Implementation :
A PDF file has file extension .PDF and it stands for Portable Document Format.
As we are aware that Python is having various libraries for the different/same work, here we will use PyPDF2 library to extract data from pdf file.
Sample single page PDF file is having content as below, we will write the code to read this file.
File Reading
#Command
PIP Install PyPDF2
Step 1. Write the python code as below -
# Import the required module
import PyPDF2
from pdfminer.high_level import extract_text
# Create an object
FileName = “XYZ.pdf”;
text = extract_text(“XYZ.pdf”)
print(text)
Multiple Pages Reading
We have various scenarios to be considered
Step 2. Write the python code for specific page -
# get text from forth Page
text_4th_pages = extract_text(“XYZ.pdf”, page_numbers = [3])
print(text_4th_pages)
# get text from 3rd & 5th pages
text_nth_pages = extract_text(“XYZ.pdf”, page_numbers = [2,4])
print(text_nth_pages)
# get text to any range
text_nRange_pages = extract_text(“XYZ.pdf”, page_numbers = range(3))
print(text_nRange_pages)
Output of PDF file can be obtained in the format of text written.
In Addition, We can perform various action for PDF file like Create, Concatenate/Merge, Rotate etc.
To Create:
>>> from PyPDF2 import PdfFileWriter
>>> pdf_write = PdfFileWriter()
To merge:
>>> from PyPDF2 import PdfFileMerger
>>> pdf_file = PdfFileMerger()
To concatenate PDFs: use PdfFileMerger Class
>>> from PyPDF2 import PdfFileMerger
>>> pdf_merger = PdfFileMerger()
Conclusion:
After completing this topic, we can understand the simplicity of code to read the PDF file in various ways , almost in a similar manner we can open/read/write in other file formats as well.