Search
  • English
Login Register
  • Mon - Sat 11.00 am - 8:00 pm
  • 1st-29, Atlanta Business Hub, VIP Road, Surat
  • +91 97129 28220
Code and Debug
  • Offline Courses
    • C Programming
    • Cpp Programming
    • Django Framework
    • Flutter Development
    • HTML & CSS
    • Javascript
    • MySQL
    • Node.js (Core)
    • Node.js (Advance)
    • Python Programming
  • About Us
  • Contact Us
  • Blog
  • Offline Courses
    • C Programming
    • Cpp Programming
    • Django Framework
    • Flutter Development
    • HTML & CSS
    • Javascript
    • MySQL
    • Node.js (Core)
    • Node.js (Advance)
    • Python Programming
  • About Us
  • Contact Us
  • Blog
Code and Debug > Blog > Project > Python Project > Mini Python Project > PDF to TEXT Conversion using Python

PDF to TEXT Conversion using Python

  • November 9, 2022
  • Posted by: Code and Debug
  • Category: Mini Python Project Project
No Comments

PDF or Portable Document Format is one of the most common documents sharing format. It can have different elements like text, images, tables, or forms in the file. Since there is a lot happening in a single file, it becomes tedious to extract data out of the PDF file.

In this post we will use the PyPDF2 library

Step 1 - Creating functions to pick PDF file and extract path

We will use tkinter module to show a file picker dialog to user, where user picks up a PDF file and its path is returned to us.

Let us start by importing tkinter module.

				
					from tkinter import Tk
from tkinter.filedialog import askopenfilename
				
			

Let us set write some configuration, where we restrict user to only pick PDF file and nothing other than that.

				
					FILEOPENOPTIONS = dict(defaultextension=".pdf",
                       filetypes=[('pdf file', '*.pdf')])
				
			

Next, we create a function pickFilePath which will open up dialog, letting user choose PDF file and then return it’s path.

				
					def pickFilePath():
        Tk().withdraw()
        filename = askopenfilename(**FILEOPENOPTIONS)
        return filename
				
			

Next, we create a function getBASEDir which will give us base path for the folder in which we are going to store our result.

				
					def getBASEDir():
        return os.path.realpath("conversionResult")
				
			

Next, we create a function getTXTPath which will give us path by joining BASEDIR with the PDF file name.

				
					def getTXTPath(pdfpath):
        return os.path.join(BASEDIR,os.path.basename(os.path.normpath(pdfpath)).replace(".pdf", "")+".txt")
				
			

Step 2 - Executing above functions

We will now execute each and every function to start our flow.

				
					pdfPath=pickFilePath()

# IF conversionResult folder is not created, it will get created
if(os.path.isdir("conversionResult") == False):
    os.mkdir("conversionResult")

BASEDIR = getBASEDir()

txtpath = getTXTPath(pdfPath)
				
			

Step 3 - Opening and extracting text using PyPDF2

Before going anywhere  further, a reminder that we need to first install PyPDF2 using pip command. To do so, open command prompt and type in the following code.

				
					pip install PyPDF2
				
			

We now will create a file object by opening a PDF file in rb (read binary) mode and passing it to PdfFileReader.

And then calculate number of pages present in the PDF selected by user.

				
					# Creating file object with pdfObj
pdfobj = open(pdfPath, 'rb')

# Passing file object to PdfFileReader
pdfread = PyPDF2.PdfFileReader(pdfobj)

# Getting number of pages
x = pdfread.numPages
				
			

Extracting text page by page and writing into the file.

				
					for i in range(x):
        pageObj = pdfread.getPage(i)
        with open(txtpath, 'a+') as f: 
                f.write((pageObj.extractText()))
				
			

Full source code

				
					import os
from tkinter import Tk
from tkinter.filedialog import askopenfilename

import PyPDF2

FILEOPENOPTIONS = dict(defaultextension=".pdf", filetypes=[("pdf file", "*.pdf")])


def pickFilePath():
    Tk().withdraw()
    filename = askopenfilename(**FILEOPENOPTIONS)
    return filename


def getBASEDir():
    return os.path.realpath("conversionResult")


def getTXTPath(pdfpath):
    return os.path.join(
        BASEDIR,
        os.path.basename(os.path.normpath(pdfpath)).replace(".pdf", "") + ".txt",
    )


pdfPath = pickFilePath()

# IF conversionResult folder is not created, it will create
if os.path.isdir("conversionResult") == False:
    os.mkdir("conversionResult")


BASEDIR = getBASEDir()

txtpath = getTXTPath(pdfPath)

# Creating file object with pdfObj
pdfobj = open(pdfPath, "rb")

# Passing file object to PdfFileReader
pdfread = PyPDF2.PdfFileReader(pdfobj)

# Getting number of pages
x = pdfread.numPages

for i in range(x):
    pageObj = pdfread.getPage(i)
    with open(txtpath, "a+") as f:
        f.write((pageObj.extractText()))

print(f"File saved in {txtpath}")

				
			

Sample PDF

Download sample PDF from here.

Leave a Reply Cancel reply

About US

At Code & Debug, our mission is to continuously innovate the best ways to train the next generation of developers and to transform the the way tech education is delivered.

Code & Debug was founded in 2020 to bridge the knowledge gap between colleges and industry. Founded by Anirudh Khurana, Code & Debug has professional teaching faculty and a state-of-art learning platform for Coding education.
View Courses

Pages

  • About Us
  • Contact Us
  • Home
  • Offline Courses
  • User Account

Contact Us

  • 1st-29, Atlanta Business Hub, VIP Road, Surat
  • Tel.: +91 97129 28220
  • info@codeanddebug.in