PDF to TEXT Conversion using Python
- November 9, 2022
- Posted by: Code and Debug
- Category: Mini Python Project Project

PDF or Portable Document Format is one of the most common documents sharing format. It can have different elements like text, images, tables, or forms in the file. Since there is a lot happening in a single file, it becomes tedious to extract data out of the PDF file.
In this post we will use the PyPDF2 library
Step 1 - Creating functions to pick PDF file and extract path
We will use tkinter module to show a file picker dialog to user, where user picks up a PDF file and its path is returned to us.
Let us start by importing tkinter module.
from tkinter import Tk
from tkinter.filedialog import askopenfilename
Let us set write some configuration, where we restrict user to only pick PDF file and nothing other than that.
FILEOPENOPTIONS = dict(defaultextension=".pdf",
filetypes=[('pdf file', '*.pdf')])
Next, we create a function pickFilePath which will open up dialog, letting user choose PDF file and then return it’s path.
def pickFilePath():
Tk().withdraw()
filename = askopenfilename(**FILEOPENOPTIONS)
return filename
Next, we create a function getBASEDir which will give us base path for the folder in which we are going to store our result.
def getBASEDir():
return os.path.realpath("conversionResult")
Next, we create a function getTXTPath which will give us path by joining BASEDIR with the PDF file name.
def getTXTPath(pdfpath):
return os.path.join(BASEDIR,os.path.basename(os.path.normpath(pdfpath)).replace(".pdf", "")+".txt")
Step 2 - Executing above functions
We will now execute each and every function to start our flow.
pdfPath=pickFilePath()
# IF conversionResult folder is not created, it will get created
if(os.path.isdir("conversionResult") == False):
os.mkdir("conversionResult")
BASEDIR = getBASEDir()
txtpath = getTXTPath(pdfPath)
Step 3 - Opening and extracting text using PyPDF2
Before going anywhere further, a reminder that we need to first install PyPDF2 using pip command. To do so, open command prompt and type in the following code.
pip install PyPDF2
We now will create a file object by opening a PDF file in rb (read binary) mode and passing it to PdfFileReader.
And then calculate number of pages present in the PDF selected by user.
# Creating file object with pdfObj
pdfobj = open(pdfPath, 'rb')
# Passing file object to PdfFileReader
pdfread = PyPDF2.PdfFileReader(pdfobj)
# Getting number of pages
x = pdfread.numPages
Extracting text page by page and writing into the file.
for i in range(x):
pageObj = pdfread.getPage(i)
with open(txtpath, 'a+') as f:
f.write((pageObj.extractText()))
Full source code
import os
from tkinter import Tk
from tkinter.filedialog import askopenfilename
import PyPDF2
FILEOPENOPTIONS = dict(defaultextension=".pdf", filetypes=[("pdf file", "*.pdf")])
def pickFilePath():
Tk().withdraw()
filename = askopenfilename(**FILEOPENOPTIONS)
return filename
def getBASEDir():
return os.path.realpath("conversionResult")
def getTXTPath(pdfpath):
return os.path.join(
BASEDIR,
os.path.basename(os.path.normpath(pdfpath)).replace(".pdf", "") + ".txt",
)
pdfPath = pickFilePath()
# IF conversionResult folder is not created, it will create
if os.path.isdir("conversionResult") == False:
os.mkdir("conversionResult")
BASEDIR = getBASEDir()
txtpath = getTXTPath(pdfPath)
# Creating file object with pdfObj
pdfobj = open(pdfPath, "rb")
# Passing file object to PdfFileReader
pdfread = PyPDF2.PdfFileReader(pdfobj)
# Getting number of pages
x = pdfread.numPages
for i in range(x):
pageObj = pdfread.getPage(i)
with open(txtpath, "a+") as f:
f.write((pageObj.extractText()))
print(f"File saved in {txtpath}")
Sample PDF
Download sample PDF from here.