Convert PDF to docx and docx to PDF using Python

黄爸爸好 2023-10-19 发布于上海

展开全文

Convert PDF to docx and docx to PDF using Python

In this article we will explore how to convert PDF files into Microsoft Word docx format and vice versa using Python.

Table of contents

Introduction

In one of our tutorials explaining how to work with PDF files in Python, and specifically how to extract tables from PDF files, we focused on PDF files with tables. However, in a lot of instances, we don’t want to only export certain parts of the PDF, rather than convert the whole PDF file to docx to allow for editing. In this article we will see how to easily and efficiently perform this conversion.

To continue following this tutorial we will need the following Python libraries: pdf2docx and docx2pdf.

If you don’t have it installed, please open “Command Prompt” (on Windows) and install it using the following code:

pip install pdf2docx
pip install docx2pdf

PowerShell

Copy

Sample files

In order to follow the examples shown below, you will need to have a PDF file to make the conversion to docx format.

The file I will be using for this article is here. And you can also download it from below:

sample.pdf Download

Once you have downloaded the file, place it in the same folder as you Python file. For example, here is my setup:

For the docx to PDF conversion, you will need a sample docx file which you can download below:

input Download

Once you have downloaded the file, place it in the same folder as you Python file. For example, here is my setup:

How to convert PDF files to docx format using Python

Using pdf2docx library, we can perform the conversion in a few lines of code. This library is quite extensive and I encourage readers to check out their official documentation to learn more about its capabilities.

Convert all pages from PDF file to docx format using Python

Method 1:

First step is to import the required dependencies:

from pdf2docx import Converter

Python

Copy

Second step is to define input and output paths. The input path should be the path to your PDF file, and the output path should be the path to where you would like to write out the .docx file to (in our case it’s just filenames since the code and the files are located in the same directory):

pdf_file = 'sample.pdf'docx_file = 'sample.docx'

Python

Copy

Third step is to convert PDF file to .docx:

cv = Converter(pdf_file)cv.convert(docx_file)cv.close()

Python

Copy

And you should see the sample.docx created in the same directory.

Method 2:

First step is to import the required dependencies:

from pdf2docx import parse

Python

Copy

pdf_file = 'sample.pdf'docx_file = 'sample.docx'

Python

Copy

Third step is to convert PDF file to .docx:

parse(pdf_file, docx_file)

Python

Copy

And you should see the sample.docx created in the same directory.

Convert a single page from PDF file to docx format using Python

First step is to import the required dependencies:

from pdf2docx import Converter

Python

Copy

pdf_file = 'sample.pdf'docx_file = 'sample.docx'

Python

Copy

Third step is to define a list with the numbers of pages you would like to be converted. In our case, the PDF file has 2 pages and let’s say we would like to print the first one. We would define a pages_list and assign the index number of the page (first page is index 0).

pages_list = [0]

Python

Copy

Fourth step is to convert PDF file specific pages to .docx:

cv = Converter(pdf_file)cv.convert(docx_file, pages=pages_list)cv.close()

Python

Copy

How to convert docx files to PDF format using Python

Using docx2pdf library, we can perform the conversion in a few lines of code. This library is quite extensive and I encourage readers to check out their official documentation to learn more about its capabilities.

First step is to import the required dependencies:

from docx2pdf import convert

Python

Copy

docx_file = 'input.docx'pdf_file = 'output.pdf'

Python

Copy

Third step is to convert the .docx file to PDF:

convert(docx_file, pdf_file)

Python

Copy

And you should see the output.pdf created in the same directory.

Conclusion

In this article we explored how to convert PDF files into Microsoft Word docx format and vice versa using Python.

Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python for PDF tutorials.

本站是提供个人知识管理的网络存储空间，所有内容均由用户发布，不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息，谨防诈骗。如发现有害或侵权内容，请点击一键举报。

转藏分享

QQ空间 QQ好友新浪微博微信

献花（0） +1

来自：黄爸爸好 > 《OCR相关》

举报/认领

0条评论

发表

请遵守用户评论公约

类似文章 更多

黄爸爸好

关注对话

TA的最新馆藏

Al Agent--大模型时代重要落地方向
Llama 3技术剖析、微调、部署以及多模态训练
万用之道：上下文为王
麦肯锡工作术三部曲：《麦肯锡思考武器写作武器谈判武器》
今日arXiv最热NLP大模型论文：CMU最新综述：工具使用，大模型的神兵利器
北大发现了一种特殊类型的注意力头！

喜欢该文的人也喜欢更多

热门阅读换一换