今天在知乎看到
近几年【pdf转word】在百度指数中呈上升趋势, 由于pdf文件本身特殊性,想要百分百地将其转回word格式基本上是不可能的。我尝试着在github上搜了下,找到一个pdf2docx库可以实现批量转换功能。 特性支持的pdf转docx的功能有
- [x] 字体样式,例如 字体名称,大小,粗细,斜体和颜色
由于还分析了表内容和格式/样式,因此它也可用作提取表内容的工具。 不足
安装pip3 install pdf2docx
数据在项目文件夹中有data文件夹,我们看一下data文件夹内的实验文件
import os
os.listdir('data')
['demo-table.pdf', 'demo-image.pdf', 'demo-table-lattice.pdf', 'demo-table-close-underline.pdf', 'demo-path-transformation.pdf', 'demo-text-scaling.pdf', 'demo-text-alignment.pdf', 'demo-text-unnamed-fonts.pdf', 'demo-table-shading-highlight.pdf', 'demo-table-lattice-one-cell.pdf', 'demo.pdf', 'demo-image-cmyk.pdf', 'demo-text.pdf', 'demo-table-weird.pdf', 'demo-image-vector-graphic.pdf', 'demo-image-transparent.pdf', 'demo-table-nested.pdf', 'demo-blank.pdf', 'demo-table-shading.pdf', 'demo-table-border-style.pdf', 'demo-table-bottom.pdf', 'demo-table-align-borders.pdf', 'demo-table-stream.pdf']
转docx将pdf转为docx,结果存到项目文件夹output中 parse(pdf_file, docx_file, start, end, pages) - pages 待转换的页面数列表,可不设置,默认为None
测试demo-text.pdf打开样式如下图
from pdf2docx import parse
pdf_file = 'data/demo-text.pdf' docx_file = 'output/demo-text.docx'
# 将pdf里第一页转为docx parse(pdf_file, docx_file, start=0, end=1)
Processing Pages: 1/1... -------------------------------------------------- Terminated in 0.11046225800009779s.
抽取表格extract_tables(pdf_file, docx_file, start, end, pages) - pages 待转换的页面数列表,可不设置,默认为None
测试demo-table.pdf如下图
from pdf2docx import extract_tables
pdf_file = 'data/demo-table.pdf'
tables = extract_tables(pdf_file, start=0, end=1) for table in tables: print(table)
Run
Processing Pages: 1/1... [[' ', 'Method / Attribute ', 'Description '], ['1 ', 'Document.pageCount ', 'the number of pages (int) \nthe metadata (dict) '], ['2 ', 'Document.metadata ', None], ['3 ', 'Document.getTo\nC() \nDocument.loadP\nage() \nread a Page read \na Page read a \nPage read a \nPage read a \nPage read a \nPage read a \nPage read a \nPage read a \nPage ', None], ['4 ', None, None]] [['Input ', None, None, None, None, None], ['Description A ', 'mm ', '30.34 ', '35.30 ', '19.30 ', '80.21 '], ['Description B ', '1.00 ', '5.95 ', '6.16 ', '16.48 ', '48.81 '], ['Description C ', '1.00 ', '0.98 ', '0.94 ', '1.03 ', '0.32 '], ['Description D ', 'kg ', '0.84 ', '0.53 ', '0.52 ', '0.33 '], ['Description E ', '1.00 ', '0.15 ', None, None, None], ['Description F ', '1.00 ', '0.86 ', '0.37 ', '0.78 ', '0.01 ']]
这块可能对pdf中报表变量抽取有一些帮助,可以做的可能有很多吧,比如 - 结合pandas,将pdf中提取的表格存到excel中,后期分析
近期文章代码项目文件下载链接:https://pan.baidu.com/s/1Wkkz1z8VRHQmvTS8qv2sJQ 密码:4omx
|