Tutorial

The philosophy of redstork is to map API to standard and well undersood Python objects, like list and dict.

In this tutorial we will use the following sample document.

Version

There are two version values in redstork module: PDFium build version, and Python package version:

import redstork

redstork.__pdfium_version__
>> 'cromium/4097'

redstork.__version__
>> '0.0.1'

Document

Document is the top-level object, and the only object that can be instantiated directly:

from redstork import Document

doc = Document('sample.pdf')

len(doc)
>> 15

As you can see, Document resembles standard Python list, containing Page objects.

PDF file creators can attach arbitraty key-value strings to the document, that we call meta (official PDf specs call it Document Information Dictionary). Most commonly these values describe Author, Title, and the name of software that created this document. Lets see the meta in our sample:

doc.meta['Title']
>> 'Red Stork'

You can change meta content and save the updated document:

doc.meta['Title'] = 'Awesome PDF parsing library'
doc.save('awesome.pdf')

Document has a lazily populated collection of fonts. Initially this collection is empty. As pages are being accessed and parsed, this collection is being populated:

list(doc[0])  # read all objects from page 1
len(doc.fonts)
>> 2

Page

Page represents PDF page. Get page by indexing a Document object, just like a normal list:

page = doc[0]
page.crop_box
>> (0.0, 0.0, 612.0, 792.0)

Page has Page.label, representing the page label (like xxi, or 128):

doc[2].label  # this is the label of the third page
>> 'i'

A page of PDF document is a list-like object, containing concrete instances of PageObject:

page = ...
len(page)  # how many objects on this page?
>> 17

PageObject

Abstract class PageObject describes an object on a PDF page. Concrete classes implementing PageObject are:

Notable properties of all objects are:

TextObject

Text object represents a string of characters. Each character is a three-tuple of (charcode, x, y), where charcode is a character code (this value is just an index in the font glyph table, not a text corresponding to this character!). x and y are placement coordinates of this character (in the coordinate system of this TextObject - first character typically has x,y == 0, 0.

Text object has font property. Here is how to use font to extract text of a TextObject:

def text_of(o):
    assert o.type == PageObject.OBJ_TYPE_TEXT, o
    text = []
    for c, x, y in o:
        text.append(o.font[c])
    return '.join(text)

page = ...
for o in page:
    if o.type == PageObject.OBJ_TYPE_TEXT:
        text = text_of(o)
        print(text)

PathObject

Path object represents a set of vector drawing instructions.

ImageObject

Image object represents an embedded bitmap image. You can ge the pixel width and height of the image, using the properties ImageObject.pixel_width() and ImageObject.pixel_height().

Example:

page = ...
for o in page:
    if o.type == PageObject.OBJ_TYPE_IMAGE:
        print(o.pixel_width, o.pixel_height)

Font

Font object is a look-up table for character text, and also holds character glyphs (shape).

Font names in PDF file have a special prefix. To get a human-friendly one use Font.simple_name().

Document contains a lazy font collection Document.fonts(). It is lazy, because just after document is opened, it is empty. As pages are accessed and parsed, this collection is populated.

Here is how to get Glyph object:

page = ...
for o in page:
    if o.type == PageObject.OBJ_TYPE_TEXT:
        for code,_,_ in o:
            glyph = o.font.load_glyph(code)
            print('Character with code %d has %d glyph instructions', code, len(glyph))