معالجة النصوص بإستخدام مكتبة PyPDF2

بسم الله الرحمن الرحيم

معالجة النصوص بإستخدام مكتبة PyPDF2

تاريخ النشر : July 11, 2020

معالجة النصوص بإستخدام مكتبة PyPDF2

pip install PyPDF2

import PyPDF2

بعد ذاللك نفتح ملف PDF في شكل ثنائي

myfile=open ('filename.pdf',mode='rb')

إنشاء كائنات لـ PDF

pdf_reader= PyPDF2.PdfFileReader(myfile)

لمعرفة عدد الصفحات في ملف PDF الحالي

pdf_reader.numPages

page_one=pdf_reader.getPage(0)
print(page_one.extractText())

# create page object and extract text
pageObj = pdf_reader.getPage(0)
page1 = pageObj.extractText()
page1

# strip away page header
page1 = page1[25:]

# insert commas to separate variables and then remove excess strings
page1 = page1.replace('\n \n',', ').replace('\n','')

myfile.close()

للكتابة في الملف

pdf_writer=PyPDF2.PdfFileWriter()
pdf_writer.addPage(page_one)
pdf_output=open('New updated file.pdf',mode='wb')
pdf_writer.write(pdf_output)
page=pdf_writer.getPage(0).extractText()

print(pdf_reader.isEncrypted)

مكتبة tabula-py

pip install tabula-py

import tabula
df = tabula.io.read_pdf(url, pages='all')

then you will get many tables, you can call it by using index, it's like printing element from list, Example:

more info here - https://pypi.org/project/tabula-py/

# ex
df[0]

مكتبة

Camelot

!pip install "camelot-py[cv]"
!apt install python3-tk ghostscript

df_table = camelot.read_pdf('file.pdf', pages='1,2,4-5')

#To display the ith table as Pandas Data frame
tables[i].df

https://camelot-py.readthedocs.io/en/master/user/install-deps.html

العودة إلي لغة البرمجة البايثون Python