Tutorial¶
The philosophy of redstork is to map API to standard and well undersood Python objects, like
list and dict.
In this tutorial we will use the following sample document.
Version¶
There are two version values in redstork module: PDFium build version, and Python package version:
import redstork
redstork.__pdfium_version__
>> 'cromium/4097'
redstork.__version__
>> '0.0.1'
Document¶
Document is the top-level object, and the only object that can be instantiated directly:
from redstork import Document
doc = Document('sample.pdf')
len(doc)
>> 15
As you can see, Document resembles standard Python list, containing Page objects.
PDF file creators can attach arbitraty key-value strings to the document, that we call meta (official
PDf specs call it Document Information Dictionary).
Most commonly these values describe Author, Title, and the name of software that created this
document. Lets see the meta in our sample:
doc.meta['Title']
>> 'Red Stork'
You can change meta content and save the updated document:
doc.meta['Title'] = 'Awesome PDF parsing library'
doc.save('awesome.pdf')
Document has a lazily populated collection of fonts. Initially this collection is empty. As pages are being accessed and parsed, this collection is being populated:
list(doc[0]) # read all objects from page 1
len(doc.fonts)
>> 2
Page¶
Page represents PDF page. Get page by indexing a Document object, just like a normal list:
page = doc[0]
page.crop_box
>> (0.0, 0.0, 612.0, 792.0)
Page has Page.label, representing the page label (like xxi, or 128):
doc[2].label # this is the label of the third page
>> 'i'
A page of PDF document is a list-like object, containing concrete instances of PageObject:
page = ...
len(page) # how many objects on this page?
>> 17
PageObject¶
Abstract class PageObject describes an object on a PDF page. Concrete classes implementing PageObject are:
TextObject- a string of charactersPathObject- vector graphicsImageObject- a bitmap imageShadingObject- a shading object
Notable properties of all objects are:
PageObject.page()- links back to the parent pagePageObject.matrix()- transformation matrix of this objectPageObject.rect()- rectangle of this object on the page
TextObject¶
Text object represents a string of characters. Each character is a three-tuple of (charcode, x, y), where
charcode is a character code (this value is just an index in the font glyph table, not a
text corresponding to this character!). x and y are placement coordinates of this character (in the
coordinate system of this TextObject - first character typically has x,y == 0, 0.
Text object has font property. Here is how to use font to extract text of a TextObject:
def text_of(o):
assert o.type == PageObject.OBJ_TYPE_TEXT, o
text = []
for c, x, y in o:
text.append(o.font[c])
return '.join(text)
page = ...
for o in page:
if o.type == PageObject.OBJ_TYPE_TEXT:
text = text_of(o)
print(text)
PathObject¶
Path object represents a set of vector drawing instructions.
ImageObject¶
Image object represents an embedded bitmap image. You can ge the pixel width and height of the image, using the properties
ImageObject.pixel_width() and ImageObject.pixel_height().
Example:
page = ...
for o in page:
if o.type == PageObject.OBJ_TYPE_IMAGE:
print(o.pixel_width, o.pixel_height)
Font¶
Font object is a look-up table for character text, and also holds character glyphs (shape).
Font names in PDF file have a special prefix. To get a human-friendly one use Font.simple_name().
Document contains a lazy font collection Document.fonts(). It is lazy, because just after document is opened,
it is empty. As pages are accessed and parsed, this collection is populated.
Here is how to get Glyph object:
page = ...
for o in page:
if o.type == PageObject.OBJ_TYPE_TEXT:
for code,_,_ in o:
glyph = o.font.load_glyph(code)
print('Character with code %d has %d glyph instructions', code, len(glyph))