Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
releasenotes [2020/02/22 12:40] christian [CMaps] |
releasenotes [2021/07/29 17:58] christian [Internal changes] |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Release Notes ====== | ====== Release Notes ====== | ||
+ | ===== PDFtalk 2.5.0 ===== | ||
+ | |||
+ | July 2021 | ||
+ | |||
+ | This release was triggered by Bob Nemec from HTS to improve error handling when appending PDFs. Two errors were seen: objects referenced but missing and streams with one extra byte. | ||
+ | |||
+ | The use case of appending PDFs is the topic of this release. Some internal structures were redesigned and the bugs are handled. Also, the performance for appending large files was improved. | ||
+ | |||
+ | Since the HTS systems run on Gemstone, the Gemstone version of the library was updated. | ||
+ | |||
+ | ==== Error handling ==== | ||
+ | |||
+ | Two structural errors were discovered which need to be handled. For describing these errors in more detail, a new page [[monsters|Monsters]] was created to collect some observations from the wild. | ||
+ | |||
+ | === Handling missing object errors === | ||
+ | |||
+ | A reference pointing to an non-existing object (see [[monsters# | ||
+ | |||
+ | On writing, the MissingObject is written as string saying that the object is missing. This preserves the references and leads to a TypeMismatch error on next reading, which can be handled easily. | ||
+ | |||
+ | === Handling incorrect stream length errors === | ||
+ | |||
+ | The ''/ | ||
+ | |||
+ | Therefore, a very specific error '' | ||
+ | ==== New APIs ==== | ||
+ | |||
+ | === Document>> | ||
+ | |||
+ | A PDF (all pages) can be appended efficiently to a PDF Document. | ||
+ | <code smalltalk> | ||
+ | |||
+ | All objects of the PDF to be appended are read from the file by resolving all references reachable from the '' | ||
+ | |||
+ | To concatenate some PDFs do: | ||
+ | <code smalltalk> | ||
+ | | doc | | ||
+ | doc := Document new. | ||
+ | doc appendAllPagesFrom: | ||
+ | doc appendAllPagesFrom: | ||
+ | doc appendAllPagesFrom: | ||
+ | doc saveAs: ' | ||
+ | </ | ||
+ | |||
+ | === Raw objects === | ||
+ | |||
+ | There is also a variant | ||
+ | <code smalltalk> | ||
+ | which reads all objects without typing. The objects are raw - generic '' | ||
+ | |||
+ | In '' | ||
+ | |||
+ | On '' | ||
+ | |||
+ | ==== Internal changes ==== | ||
+ | |||
+ | The user of the library is not affected by these changes. | ||
+ | |||
+ | === Improving performance for large files === | ||
+ | |||
+ | When reading many objects at once, the library was slow with large files. In this investigation, | ||
+ | |||
+ | * Object streams were created and initialized for each access to an object inside. Now, the streams are kept alive in a cache. | ||
+ | * References from traversing the PDF objects were collected in an OrderedCollection. The visited check was done with this collection. The time grows exponentially with the number of collected objects, so that large files can become very slow. Now, for the visited check, a Set is used. The OrderedCollection for the collected references is kept to ensure a reproducable order. | ||
+ | |||
+ | === Redesigned references and tracing === | ||
+ | |||
+ | Objects are picked (read) from a PDF file stream when they are needed. Originally, this was done using blocks stored in place of the value (referent) of a reference. When the value is requested, the block is evaluated and the resulting PDF object is stored as the referent. The block reads the raw object and converts it to the proper type. This can be nested and several types may apply. | ||
+ | |||
+ | Unfortunately, | ||
+ | |||
+ | While at it, the number and generation of references was extracted to an '' | ||
+ | |||
+ | === Changed internal streams to bytes === | ||
+ | |||
+ | The '' | ||
+ | |||
+ | ==== Gemstone ==== | ||
+ | |||
+ | This release updates the Gemstone code for the library. The biggest addition is the [[postscript|PostScript]] module used with [[cmap|CMaps]] introduced in [[releasenotes# | ||
+ | |||
+ | === Encoded PostScript sources === | ||
+ | |||
+ | PostScript source methods (mainly cmaps and examples) are reencoded with ASCII85 to allow fileIn to Gemstone. Topas from Gemstone as well as PostScript use the % character at the beginning of a line for directives and comments. Since cmaps are PostScript programs, their source cannot be embedded directly without disturbing Gemstone. | ||
+ | |||
+ | Interestingly, | ||
+ | |||
+ | === Optional CMaps === | ||
+ | |||
+ | The [[cmap|CMaps module]] is used to decode strings to unicode. The library uses this when a font supplies a ''/ | ||
+ | |||
+ | Since they are very big, there are two Gemstone source files: **'' | ||
+ | ==== other changes ==== | ||
+ | |||
+ | In VisualWorks 9.1, icons were renamed and changed. In order to use the library' | ||
+ | |||
+ | |||
+ | ===== PDFtalk 2.4.0 ===== | ||
+ | |||
+ | March 2021 | ||
+ | |||
+ | Embedded OpenType(PS) fonts can now be used for the screen on windows without having them installed. | ||
+ | |||
+ | '' | ||
+ | |||
+ | Added cache for tabular glyph variants to improve performance. | ||
+ | ===== PDFtalk 2.3.5 ===== | ||
+ | |||
+ | January 2021 | ||
+ | |||
+ | I worked on extracting content from PDFs for the [[https:// | ||
+ | |||
+ | This messed up the base a bit and test cases were starting to fail. | ||
+ | |||
+ | With this release, everything is clean again: the code **{PDFtalk Project}** loads without warnings or undeclareds. All tests pass (almost. See [1]). | ||
+ | The same goes for the **[Report4PDF]** package. Loads clean, all tests pass. | ||
+ | |||
+ | One functional enhancement is that text is now properly decoded for the UI. "add Picture" | ||
+ | |||
+ | Happy hacking | ||
+ | |||
+ | |||
+ | [1] There is a strange problem when one or two tests fail in an fresh image. But only the first time. After the first run, they all pass. | ||
===== PDFtalk 2.3 ===== | ===== PDFtalk 2.3 ===== | ||
Line 6: | Line 129: | ||
==== PostScript ==== | ==== PostScript ==== | ||
- | Added [[PostScript]] | + | Added **[[PostScript]]** to the PDFtalk |
The package **[PostScript]** implements some low level methods which are used by PDFtalk. | The package **[PostScript]** implements some low level methods which are used by PDFtalk. | ||
Line 14: | Line 137: | ||
==== CMaps ==== | ==== CMaps ==== | ||
- | Added [[CMap]] to the **{PDFtalk Fonts}** bundle. | + | Added **[[CMap]]** to the **{PDFtalk Fonts}** bundle. |
**CMaps** are PostScript programs defining complex code mappings. The mechanism is very general and allows for variable byte length encodings. Because of its generality, CMaps are used by some PDF writers to even encode simple mappings. Hence, it is necessary to fully implement CMaps in order to decode PDF text. | **CMaps** are PostScript programs defining complex code mappings. The mechanism is very general and allows for variable byte length encodings. Because of its generality, CMaps are used by some PDF writers to even encode simple mappings. Hence, it is necessary to fully implement CMaps in order to decode PDF text. | ||
Line 25: | Line 148: | ||
Therefore, I put them into a seperate package (outside of the runtime, but part of the project bundle): **[PostScript CMap instances]**. The CMaps are constructed from the source methods lazily when needed. If the package is not loaded, the source of a requested CMap is downloaded from GitHub, which is slower. | Therefore, I put them into a seperate package (outside of the runtime, but part of the project bundle): **[PostScript CMap instances]**. The CMaps are constructed from the source methods lazily when needed. If the package is not loaded, the source of a requested CMap is downloaded from GitHub, which is slower. | ||
+ | |||
+ | === Known problem === | ||
+ | |||
+ | The PDF specification allows bfchar-mappings to have a string of UTF-16BE characters as destination. This is not yet implemented. | ||
+ | |||
==== Typing ==== | ==== Typing ==== | ||
- | Changed typing to allow narrower types to shadow broader types | + | === Allow narrower types to shadow broader types === |
+ | |||
+ | Example: | ||
+ | <code smalltalk> | ||
+ | DecodeParms | ||
+ | <type: # | ||
+ | <type: # | ||
+ | </ | ||
+ | |||
+ | '' | ||
+ | |||
+ | === Generalized '' | ||
+ | Textstring does not need to be differenciated. We can rely on VisualWorks handling of multi byte strings. | ||
===== PDFtalk 2.2 ===== | ===== PDFtalk 2.2 ===== |