分享

Convert PDF to docx and docx to PDF using Python

 黄爸爸好 2023-10-19 发布于上海

Convert PDF to docx and docx to PDF using Python

In this article we will explore how to convert PDF files into Microsoft Word docx format and vice versa using Python.

Table of contents


Introduction

In one of our tutorials explaining how to work with PDF files in Python, and specifically how to extract tables from PDF files, we focused on PDF files with tables. However, in a lot of instances, we don’t want to only export certain parts of the PDF, rather than convert the whole PDF file to docx to allow for editing. In this article we will see how to easily and efficiently perform this conversion.

To continue following this tutorial we will need the following Python libraries: pdf2docx and docx2pdf.

If you don’t have it installed, please open “Command Prompt” (on Windows) and install it using the following code:

pip install pdf2docx
pip install docx2pdf

PowerShell

Copy


Sample files

In order to follow the examples shown below, you will need to have a PDF file to make the conversion to docx format.

The file I will be using for this article is here. And you can also download it from below:

sample.pdfDownload

Once you have downloaded the file, place it in the same folder as you Python file. For example, here is my setup:


For the docx to PDF conversion, you will need a sample docx file which you can download below:

inputDownload

Once you have downloaded the file, place it in the same folder as you Python file. For example, here is my setup:


How to convert PDF files to docx format using Python

Using pdf2docx library, we can perform the conversion in a few lines of code. This library is quite extensive and I encourage readers to check out their official documentation to learn more about its capabilities.


Convert all pages from PDF file to docx format using Python

Method 1:

First step is to import the required dependencies:

from pdf2docx import Converter

Python

Copy

Second step is to define input and output paths. The input path should be the path to your PDF file, and the output path should be the path to where you would like to write out the .docx file to (in our case it’s just filenames since the code and the files are located in the same directory):

pdf_file = 'sample.pdf'docx_file = 'sample.docx'

Python

Copy

Third step is to convert PDF file to .docx:

cv = Converter(pdf_file)cv.convert(docx_file)cv.close()

Python

Copy

And you should see the sample.docx created in the same directory.


Method 2:

First step is to import the required dependencies:

from pdf2docx import parse

Python

Copy

Second step is to define input and output paths. The input path should be the path to your PDF file, and the output path should be the path to where you would like to write out the .docx file to (in our case it’s just filenames since the code and the files are located in the same directory):

pdf_file = 'sample.pdf'docx_file = 'sample.docx'

Python

Copy

Third step is to convert PDF file to .docx:

parse(pdf_file, docx_file)

Python

Copy

And you should see the sample.docx created in the same directory.


Convert a single page from PDF file to docx format using Python

First step is to import the required dependencies:

from pdf2docx import Converter

Python

Copy

Second step is to define input and output paths. The input path should be the path to your PDF file, and the output path should be the path to where you would like to write out the .docx file to (in our case it’s just filenames since the code and the files are located in the same directory):

pdf_file = 'sample.pdf'docx_file = 'sample.docx'

Python

Copy

Third step is to define a list with the numbers of pages you would like to be converted. In our case, the PDF file has 2 pages and let’s say we would like to print the first one. We would define a pages_list and assign the index number of the page (first page is index 0).

pages_list = [0]

Python

Copy

Fourth step is to convert PDF file specific pages to .docx:

cv = Converter(pdf_file)cv.convert(docx_file, pages=pages_list)cv.close()

Python

Copy


How to convert docx files to PDF format using Python

Using docx2pdf library, we can perform the conversion in a few lines of code. This library is quite extensive and I encourage readers to check out their official documentation to learn more about its capabilities.

First step is to import the required dependencies:

from docx2pdf import convert

Python

Copy

Second step is to define input and output paths. The input path should be the path to your PDF file, and the output path should be the path to where you would like to write out the .docx file to (in our case it’s just filenames since the code and the files are located in the same directory):

docx_file = 'input.docx'pdf_file = 'output.pdf'

Python

Copy

Third step is to convert the .docx file to PDF:

convert(docx_file, pdf_file)

Python

Copy

And you should see the output.pdf created in the same directory.


Conclusion

In this article we explored how to convert PDF files into Microsoft Word docx format and vice versa using Python.

Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python for PDF tutorials.

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多