PyPDF2
package.PyPDF2
is a pure-Python package that you can use for many different types of PDF operations.pyPdf
, PyPDF2
, and PyPDF4
#pyPdf
package was released way back in 2005. The last official release of pyPdf
was in 2010. After a lapse of around a year, a company called Phasit sponsored a fork of pyPdf
called PyPDF2
. The code was written to be backwards compatible with the original and worked quite well for several years, with its last release being in 2016. PyPDF3
, and then the project was renamed to PyPDF4
. All of these projects do pretty much the same thing, but the biggest difference between pyPdf
and PyPDF2+ is that the latter versions added Python 3 support. There is a different Python 3 fork of the original pyPdf
for Python 3, but that one has not been maintained for many years.PyPDF2
was recently abandoned, the new PyPDF4
does not have full backwards compatibility with PyPDF2
. Most of the examples in this article will work perfectly fine with PyPDF4
, but there are some that cannot, which is why PyPDF4
is not featured more heavily in this article. Feel free to swap out the imports for PyPDF2
with PyPDF4
and see how it works for you.pdfrw
: An Alternative#pdfrw
that can do many of the same things that PyPDF2
does. You can use pdfrw
for all of the same sorts of tasks that you will learn how to do in this article for PyPDF2
, with the notable exception of encryption. pdfrw
is that it integrates with the ReportLab package so that you can take a preexisting PDF and build a new one with ReportLab using some or all of the preexisting PDF.PyPDF2
can be done with pip
or conda
if you happen to be using Anaconda instead of regular Python.PyPDF2
with pip
:PyPDF2
does not have any dependencies. You will likely spend as much time downloading the package as you will installing it.PyPDF2
to extract metadata and some text from a PDF. This can be useful when you’re doing certain types of automation on your preexisting PDF files.reportlab-sample.pdf
.PdfFileReader
from the PyPDF2
package. The PdfFileReader
is a class with several methods for interacting with PDF files. In this example, you call .getDocumentInfo()
, which will return an instance of DocumentInformation
. This contains most of the information that you’re interested in. You also call .getNumPages()
on the reader object, which returns the number of pages in the document.information
variable has several instance attributes that you can use to get the rest of the metadata you want from the document. You print out that information and also return it for potential future use.PyPDF2
has .extractText()
, which can be used on its page objects (not shown in this example), it does not work very well. Some PDFs will return text and some will return an empty string. When you want to extract text from a PDF, you should check out the PDFMiner
project instead. PDFMiner
is much more robust and was specifically designed for extracting text from PDFs.PyPDF2
:PdfFileWriter
in addition to PdfFileReader
because you will need to write out a new PDF. rotate_pages()
takes in the path to the PDF that you want to modify. Within that function, you will need to create a writer object that you can name pdf_writer
and a reader object called pdf_reader
..GetPage()
to get the desired page. Here you grab page zero, which is the first page. Then you call the page object’s .rotateClockwise()
method and pass in 90 degrees. Then for page two, you call .rotateCounterClockwise()
and pass it 90 degrees as well.PyPDF2
package only allows you to rotate a page in increments of 90 degrees. You will receive an AssertionError
otherwise..addPage()
. This will add the rotated version of the page to the writer object. The last page that you add to the writer object is page 3 without any rotation done to it. .write()
. It takes a file-like object as its parameter. This new PDF will contain three pages. The first two will be rotated in opposite directions of each other and be in landscape while the third page is a normal page. merge_pdfs()
when you have a list of PDFs that you want to merge together. You will also need to know where to save the result, so this function takes a list of input paths and an output path..addPage()
to add each of those pages to itself.argparse
module. PyPDF2
to split your PDF into multiple files:PyPDF2
to watermark your documents. You need to have a PDF that only contains your watermark image or text. create_watermark()
accepts three arguments:input_pdf
: the PDF file path to be watermarkedoutput
: the path you want to save the watermarked version of the PDFwatermark
: a PDF that contains your watermark image or textinput_pdf
and a generic pdf_writer
object for writing out the watermarked PDF.input_pdf
. This is where the magic happens. You will need to call .mergePage()
and pass it the watermark_page
. When you do that, it will overlay the watermark_page
on top of the current page. Then you add that newly merged page to your pdf_writer
object.PyPDF2
handles encryption.PyPDF2
currently only supports adding a user password and an owner password to a preexisting PDF. In PDF land, an owner password will basically give you administrator privileges over the PDF and allow you to set permissions on the document. On the other hand, the user password just allows you to open the document.PyPDF2
doesn’t actually allow you to set any permissions on the document even though it does allow you to set the owner password.add_encryption()
takes in the input and output PDF paths as well as the password that you want to add to the PDF. It then opens a PDF writer and a reader object, as before. Since you will want to encrypt the entire input PDF, you will need to loop over all of its pages and add them to the writer..encrypt()
, which takes the user password, the owner password, and whether or not 128-bit encryption should be added. The default is for 128-bit encryption to be turned on. If you set it to False
, then 40-bit encryption will be applied instead.PyPDF2
package is quite useful and is usually pretty fast. You can use PyPDF2
to automate large jobs and leverage its capabilities to help you do your job better!PyPDF4
package as it will likely replace PyPDF2
soon. You might also want to check out pdfrw
, which can do many of the same things that PyPDF2
can do.pip
is the preferred installer program. Starting with Python 3.4, itis included by default with the Python binary installers.venv
is the standard tool for creating virtual environments, and hasbeen part of Python since Python 3.3. Starting with Python 3.4, itdefaults to installing pip
into all created virtual environments.virtualenv
is a third party alternative (and predecessor) tovenv
. It allows virtual environments to be used on versions ofPython prior to 3.4, which either don’t provide venv
at all, oraren’t able to automatically install pip
into created environments.distutils
is the original build and distribution system first added tothe Python standard library in 1998. While direct use of distutils
isbeing phased out, it still laid the foundation for the current packagingand distribution infrastructure, and it not only remains part of thestandard library, but its name lives on in other ways (such as the nameof the mailing list used to coordinate Python packaging standardsdevelopment).venv
is now recommended for creating virtual environments.>
, <
or some otherspecial character which get interpreted by shell, the package name and theversion should be enclosed within double quotes:pip
and its capabilities can befound in the Python Packaging User Guide.venv
module.Installing packages into an active virtual environment uses the commands shownabove.pip
in versions of Python prior to Python 3.4?¶pip
with Python 3.4. For earlier versions,pip
needs to be “bootstrapped” as described in the Python PackagingUser Guide.--user
option to python-mpipinstall
will install apackage just for the current user, rather than for all users of the system.pip
directly. At this point intime, it will often be easier for users to install these packages byother meansrather than attempting to install them with pip
.-m
switch to run the appropriate copy ofpip
:pip
commands may also be available.py
Python launcher in combination with the -m
switch:pip
.pip
.pip
does not get installed by default. One potential fix is:wheel
format, and theability to publish wheels for at least Windows and Mac OS X through thePython Packaging Index, this problem is expected to diminish over time,as users are more regularly able to install pre-built extensions ratherthan needing to build them themselves.wheel
files may also help withobtaining other binary extensions without needing to build them locally.