Release Notes

July 2021

This release was triggered by Bob Nemec from HTS to improve error handling when appending PDFs. Two errors were seen: objects referenced but missing and streams with one extra byte.

The use case of appending PDFs is the topic of this release. Some internal structures were redesigned and the bugs are handled. Also, the performance for appending large files was improved.

Since the HTS systems run on Gemstone, the Gemstone version of the library was updated.

Two structural errors were discovered which need to be handled. For describing these errors in more detail, a new page Monsters was created to collect some observations from the wild.

Handling missing object errors

A reference pointing to an non-existing object (see Missing object for details). A MissingObject is created with the list of expected types allowing useful error messages and the creation of dummy objects.

On writing, the MissingObject is written as string saying that the object is missing. This preserves the references and leads to a TypeMismatch error on next reading, which can be handled easily.

Handling incorrect stream length errors

The /Length of a stream is different from the number of bytes in the content (see Incorrect stream length for details). In our case, the stream contents was always exactly one byte longer than stated by the /Length attribute. That last byte was probably not needed for the stream to be correct considering the filters applied, like /FlateDecode. This was checked for a few instances.

Therefore, a very specific error ExtraCharacterInStreamError is raised in this case and the extra byte is ignored (giving the /Length attribute priority). This error can resume meaningfully. On writing, /Lenght bytes are written to the content, dropping the extra byte.

Document>>appendAllPagesFrom:

A PDF (all pages) can be appended efficiently to a PDF Document.

Document>>appendAllPagesFrom: aPDFtalkFile

All objects of the PDF to be appended are read from the file by resolving all references reachable from the Catalog. This happens with a protection against Type- and FileErrors, which can savely be resumed.

To concatenate some PDFs do:

| doc |
doc := Document new.
doc appendAllPagesFrom: (File read: 'file1.pdf' asFilename).
doc appendAllPagesFrom: (File read: 'file2.pdf' asFilename).
doc appendAllPagesFrom: (File read: 'file3.pdf' asFilename).
doc saveAs: 'file123.pdf'.

Raw objects

There is also a variant

Document>>appendAllRawPagesFrom: aPDFtalkFile

which reads all objects without typing. The objects are raw - generic Dictionary and Array objects. Note: the only purpose is to write out the PDF immediately, because nothing useful can be done with the raw objects.

In VisualWorks, the raw version is performing slightly faster (~ 5%) than the standard version with typing.

On Gemstone, the difference is much bigger (~ 75%) - 4 times faster! My guess is that Pragmas, with which the type annotations are implemented, are not efficient in Gemstone.

The user of the library is not affected by these changes.

Improving performance for large files

When reading many objects at once, the library was slow with large files. In this investigation, a few issues came up which were never a problem when clicking through objects one by one.

  • Object streams were created and initialized for each access to an object inside. Now, the streams are kept alive in a cache.
  • References from traversing the PDF objects were collected in an OrderedCollection. The visited check was done with this collection. The time grows exponentially with the number of collected objects, so that large files can become very slow. Now, for the visited check, a Set is used. The OrderedCollection for the collected references is kept to ensure a reproducable order.

Redesigned references and tracing

Objects are picked (read) from a PDF file stream when they are needed. Originally, this was done using blocks stored in place of the value (referent) of a reference. When the value is requested, the block is evaluated and the resulting PDF object is stored as the referent. The block reads the raw object and converts it to the proper type. This can be nested and several types may apply.

Unfortunately, the design with blocks does not allow to defer the typing. This led to problems where a general type overtook a more specific, better matching type. So, I reified the blocks to FileReference which can read an object from file and has a list of types to be applied to the raw object. The types list is maintained to reflect the subtype order.

While at it, the number and generation of references was extracted to an ObjectId.

Changed internal streams to bytes

The Writer (internal write stream) writes now bytes instead of characters to produce the PDF file. When writing the physical file, the string was converted to a byte array to write the binary data. This copy is not needed anymore.

This release updates the Gemstone code for the library and also the PDFtalk for Gemstone page. The biggest addition is the PostScript module used with CMaps introduced in version 2.3.

Encoded PostScript sources

PostScript source methods (mainly cmaps and examples) are reencoded with ASCII85 to allow fileIn to Gemstone. Topas from Gemstone as well as PostScript use the % character at the beginning of a line for directives and comments. Since cmaps are PostScript programs, their source cannot be embedded directly without disturbing Gemstone.

Interestingly, I believe that Gemstone and PostScript share some early history which can also be seen in the way the dictionary stack is used in both.

Optional CMaps

The CMaps module is used to decode strings to unicode. The library uses this when a font supplies a /ToUnicode attribute. In case you want to use this for Asian languages as Japanese, Chinese or Korean, the standard CMap files for these languages are needed. There are 182 standard CMaps defined which are all needed when dealing with arbitrary PDFs. These CMap source files, in PostScript, are stored in the image and parsed by the PostScript interpreter on demand.

Since they are very big, there are two Gemstone source files: PDFtalk.gs (3.8 MB) and PDFtalkWithCMaps.gs (12.1 MB). Unless you do serious things with Asian text, the smaller one is recommended.

In VisualWorks 9.1, icons were renamed and changed. In order to use the library's UI in all versions, some icons were copied from older releases.

March 2021

Embedded OpenType(PS) fonts can now be used for the screen on windows without having them installed.

aPage pageNumber finds the page number of aPage. This is interesting, because pages do not contain their number for a good reason: modifying the order of pages should not lead to modified pages all over the place. Therefore, there is no explicit place where the page number is stored. To find it, a page needs to ask its anchestors for its position.

Added cache for tabular glyph variants to improve performance.

January 2021

I worked on extracting content from PDFs for the Unsere Gelder project. While doing this, I worked freely in the library and in the experimental code for content recognition. As always, by looking at many different PDFs, there are always some bug or things to improve in the UI.

This messed up the base a bit and test cases were starting to fail.

With this release, everything is clean again: the code {PDFtalk Project} loads without warnings or undeclareds. All tests pass (almost. See [1]). The same goes for the [Report4PDF] package. Loads clean, all tests pass.

One functional enhancement is that text is now properly decoded for the UI. “add Picture”

Happy hacking

[1] There is a strange problem when one or two tests fail in an fresh image. But only the first time. After the first run, they all pass.

February 2020

Added PostScript to the PDFtalk runtime.

The package [PostScript] implements some low level methods which are used by PDFtalk.

PostScript was implemented after PDFtalk and used some basic methods of it (Number reading and writing, ASCII85 encoding and PostScript character names). These dependencies have been reversed so that PostScript can be used stand-alone while PDFtalk now depends on it. This also reflects the correct historical relationship.

Added CMap to the {PDFtalk Fonts} bundle.

CMaps are PostScript programs defining complex code mappings. The mechanism is very general and allows for variable byte length encodings. Because of its generality, CMaps are used by some PDF writers to even encode simple mappings. Hence, it is necessary to fully implement CMaps in order to decode PDF text.

This is not intended to be used by the user of the library. Rather, it is part of the basic font infrastructure enabling decoding of PDF strings. This will be the base for Text extraction in the next step.

Standard CMaps

PDF defines 181 standard CMaps which are to be understood by a conforming reader. These CMaps are available at GitHub1). All maps have been imported as methods containing the source of the CMaps. Since they are rather large (16 MB with sources), it might be important to not load them into a runtime image when you don't need them, i.e. do text extraction in your application.

Therefore, I put them into a seperate package (outside of the runtime, but part of the project bundle): [PostScript CMap instances]. The CMaps are constructed from the source methods lazily when needed. If the package is not loaded, the source of a requested CMap is downloaded from GitHub, which is slower.

Known problem

The PDF specification allows bfchar-mappings to have a string of UTF-16BE characters as destination. This is not yet implemented.

Allow narrower types to shadow broader types

Example:

DecodeParms
	<type: #ZipFilterParameter>
	<type: #Dictionary>

ZipFilterParameter is a subtype of Dictionary. Because it is declared before, it is tried to match it first. Before, both alternatives were equal and Dictionary might have matched, even when the dictionary was a valid ZipFilterParameter.

Generalized ''Textstring'' to ''String''

Textstring does not need to be differenciated. We can rely on VisualWorks handling of multi byte strings.

August 2019

Renamed OrderedDictionary to Valuemap in the [Values] package. This version replaces all references of OrderedDictionary.

PDFtalk now depends on the [Values] package with version 3.x and up and is incompatible the earlier versions.

July 2019

Flate encoding is using zlib of VW 8.1 now. This solved problems allocating buffers under heavy load

October 2017

Name The new name is PDFtalk. The first version was called PDF4Smalltalk. The namespace changed from PDF to PDFtalk and the domain “pdftalk.de” provides a home with a wiki dedicated to the library: wiki.pdftalk.de.

Typing The heard of the “PDF engine” is the typing system which allows the assignment of Smalltalk classes to raw PDF objects. The new version has a redesigned type system where PDF types are properly modeled independent from the Smalltalk class hierarchy. This allows to rename classes freely (i.e. adding prefixes) without affecting PDF types. Also, boxing of some simple objects like “null” and booleans is not necessary anymore. Instead, existing classes are declared as PDF types.

PDFtalk for Gemstone The new release was triggered by a contract to port the library to Gemstone (thanks to HTS and Bob Nemec). A talk about this was held at ESUG 2017: “PDFtalk for Gemstone” (slides are here).

Gemstone Fileout A VisualWorks to Gemstone translation tool. This tool, with project specific code transformation declarations, creates a Gemstone filein. Used to create the Gemstone PDFtalkLibrary from the Values package and PDFtalk bundle.

Both new projects are open source with MIT licence.

Some changes are incompatible with the previous version, which are described here.

It is not recommended to load the new version into an image with the old version of the library.

Namespace and bundle structure

The former namespace PDF is renamed to PDFtalk.

The former independent bundle Fonts is now integrated into PDFtalk. The packages are all renamed with the PDFtalk prefix and the order and contents have been revised and changed.

The demos are now in class PDF in its own package [PDFtalk Demonstrations] in the {PDFtalk Testing} bundle. Do

PDF runAllDemos

to see if they are running. You may need to edit the file path to the PDF specification and to your demo directory.

Referencing PDF classes

Smalltalk classes representing a PDF type should not be referenced directly anymore. Instead an expression like

PDF classAt: <PDF type symbol>

should be used.

Example

(PDF classAt: #Contents) "returns class PDFtype.Contents"

Often used classes can be accessed through a shortcut method:

PDF String.     "returns class PDFtalk.PDFString"
PDF Array.      "returns class PDFtalk.PDFArray"
PDF Dictionary. "returns class PDFtalk.PDFDictionary"
PDF Stream.     "returns class PDFtalk.PDFStream"
PDF Page.       "returns class PDFtalk.Page"

There are 2 reasons for this

  1. PDF type and Smalltalk class names may not be the same anymore
  2. The Smalltalk class name may differ in different ports of the library.

New shared Smalltalk.PDF

The shared variable PDF was added to the Smalltalk namespace. The variable holds the class PDFtalk.PDF which serves as general entry point for the library.

Unless you extend the library, there should be no need to add the PDFtalk namespace to the imports of your project. Instead most functionality should be accessed through PDF in the Smalltalk namespace.

Aligned types

The new typing system allowed to remove the PDF classes Null and Boolean. Instead, they are now implemented by the system classes UndefinedObject and Boolean.

Boxing (with asPDF) and unboxing (with asSmalltalkValue) is not necessary anymore for nil, true and false. Instead, the Smalltalk objects are used directly.

The work is not finished yet. Number and maybe String and Date could be aligned as well.

The major change is the redesign of the PDF typing system. Initially, I represented the types of PDF objects by Smalltalk classes with the same name. This turned out to be not sufficient.

Firstly, PDF types form a different hierarchy than the Smalltalk classes implementing them. Both hierarchies cannot be represented in a single inheritance class hierarchy at the same time.

Secondly, the PDF type names are tied to class names. A class cannot be renamed without changing the PDF types. This has been a problem for porting the libary to other Smalltalk dialects.

Therefore, PDF types are now modeled independently.

Notes

Specialization only on assignment

When PDF objects were created, all classes were searched for possible specializations: for example, a Dictionary with a #Type entry was automatically converted to its corresponding class.

In the new version, objects are only typed and specialized when they are assigned to an attribute of a Dictionary or Array.

System classes as PDF classes

The first version had wrappers for all basic types of PDF, such as null, booleans, numbers etc.. This is similar to boxing of primitive types in other programming languages.

The motivation was to accurately implement any semantic differences to the corresponding Smalltalk objects in a proper place. This lead to the wide spread use of #asPDF and #asSmalltalkValue for boxing and unboxing.

Now I know that the differences are minor and that they could be properly implemented in the system classes. Therefore, I aligned some system classes with the PDF classes by declaring the system classes as PDF type and removing the PDF class: Null and Boolean. Next to go is Number; later maybe String and Date and…

This should lead to simpler code and more space and time efficient processing.

On the other hand, there are now a lot more root PDF classes. Therefore, I moved some of the general PDF behavior to Object.


  • releasenotes.txt
  • Last modified: 2021/07/29 20:10
  • by christian