Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
releasenotes [2017/10/10 13:28] christian [Changes for users of the library] |
releasenotes [2021/07/29 17:58] christian [Internal changes] |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Release Notes for PDFtalk (PDF4Smalltalk 2.0) ====== | + | ====== Release Notes ====== |
- | This is the release of the second major version of the PDF library. | + | ===== PDFtalk 2.5.0 ===== |
- | ===== Changes for users of the library ===== | + | July 2021 |
- | Some changes are incompatible with the previous version, which are describe | + | This release was triggered by Bob Nemec from HTS to improve error handling when appending PDFs. Two errors were seen: objects referenced but missing and streams with one extra byte. |
+ | |||
+ | The use case of appending PDFs is the topic of this release. Some internal structures were redesigned and the bugs are handled. Also, the performance for appending large files was improved. | ||
+ | |||
+ | Since the HTS systems run on Gemstone, the Gemstone version of the library was updated. | ||
+ | |||
+ | ==== Error handling ==== | ||
+ | |||
+ | Two structural errors were discovered which need to be handled. For describing these errors in more detail, a new page [[monsters|Monsters]] was created to collect some observations from the wild. | ||
+ | |||
+ | === Handling missing object errors === | ||
+ | |||
+ | A reference pointing to an non-existing object (see [[monsters# | ||
+ | |||
+ | On writing, the MissingObject is written as string saying that the object is missing. This preserves the references and leads to a TypeMismatch error on next reading, which can be handled easily. | ||
+ | |||
+ | === Handling incorrect stream length errors === | ||
+ | |||
+ | The ''/ | ||
+ | |||
+ | Therefore, a very specific error '' | ||
+ | ==== New APIs ==== | ||
+ | |||
+ | === Document>> | ||
+ | |||
+ | A PDF (all pages) can be appended efficiently to a PDF Document. | ||
+ | <code smalltalk> | ||
+ | |||
+ | All objects of the PDF to be appended are read from the file by resolving all references reachable from the '' | ||
+ | |||
+ | To concatenate some PDFs do: | ||
+ | <code smalltalk> | ||
+ | | doc | | ||
+ | doc := Document new. | ||
+ | doc appendAllPagesFrom: | ||
+ | doc appendAllPagesFrom: | ||
+ | doc appendAllPagesFrom: | ||
+ | doc saveAs: ' | ||
+ | </ | ||
+ | |||
+ | === Raw objects === | ||
+ | |||
+ | There is also a variant | ||
+ | <code smalltalk> | ||
+ | which reads all objects without typing. The objects are raw - generic '' | ||
+ | |||
+ | In '' | ||
+ | |||
+ | On '' | ||
+ | |||
+ | ==== Internal changes ==== | ||
+ | |||
+ | The user of the library is not affected by these changes. | ||
+ | |||
+ | === Improving performance for large files === | ||
+ | |||
+ | When reading many objects at once, the library was slow with large files. In this investigation, | ||
+ | |||
+ | * Object streams were created and initialized for each access to an object inside. Now, the streams are kept alive in a cache. | ||
+ | * References from traversing the PDF objects were collected in an OrderedCollection. The visited check was done with this collection. The time grows exponentially with the number of collected objects, so that large files can become very slow. Now, for the visited check, a Set is used. The OrderedCollection for the collected references is kept to ensure a reproducable order. | ||
+ | |||
+ | === Redesigned references and tracing === | ||
+ | |||
+ | Objects are picked (read) from a PDF file stream when they are needed. Originally, this was done using blocks stored in place of the value (referent) of a reference. When the value is requested, the block is evaluated and the resulting PDF object is stored as the referent. The block reads the raw object and converts it to the proper type. This can be nested and several types may apply. | ||
+ | |||
+ | Unfortunately, | ||
+ | |||
+ | While at it, the number and generation of references was extracted to an '' | ||
+ | |||
+ | === Changed internal streams to bytes === | ||
+ | |||
+ | The '' | ||
+ | |||
+ | ==== Gemstone ==== | ||
+ | |||
+ | This release updates the Gemstone code for the library. The biggest addition is the [[postscript|PostScript]] module used with [[cmap|CMaps]] introduced in [[releasenotes# | ||
+ | |||
+ | === Encoded PostScript sources === | ||
+ | |||
+ | PostScript source methods (mainly cmaps and examples) are reencoded with ASCII85 to allow fileIn to Gemstone. Topas from Gemstone as well as PostScript use the % character at the beginning of a line for directives and comments. Since cmaps are PostScript programs, their source cannot be embedded directly without disturbing Gemstone. | ||
+ | |||
+ | Interestingly, | ||
+ | |||
+ | === Optional CMaps === | ||
+ | |||
+ | The [[cmap|CMaps module]] is used to decode strings to unicode. The library uses this when a font supplies a ''/ | ||
+ | |||
+ | Since they are very big, there are two Gemstone source files: **'' | ||
+ | ==== other changes ==== | ||
+ | |||
+ | In VisualWorks 9.1, icons were renamed and changed. In order to use the library' | ||
+ | |||
+ | |||
+ | ===== PDFtalk 2.4.0 ===== | ||
+ | |||
+ | March 2021 | ||
+ | |||
+ | Embedded OpenType(PS) fonts can now be used for the screen on windows without having them installed. | ||
+ | |||
+ | '' | ||
+ | |||
+ | Added cache for tabular glyph variants to improve performance. | ||
+ | ===== PDFtalk 2.3.5 ===== | ||
+ | |||
+ | January 2021 | ||
+ | |||
+ | I worked on extracting content from PDFs for the [[https:// | ||
+ | |||
+ | This messed up the base a bit and test cases were starting to fail. | ||
+ | |||
+ | With this release, everything is clean again: the code **{PDFtalk Project}** loads without warnings or undeclareds. All tests pass (almost. See [1]). | ||
+ | The same goes for the **[Report4PDF]** package. Loads clean, all tests pass. | ||
+ | |||
+ | One functional enhancement is that text is now properly decoded for the UI. "add Picture" | ||
+ | |||
+ | Happy hacking | ||
+ | |||
+ | |||
+ | [1] There is a strange problem when one or two tests fail in an fresh image. But only the first time. After the first run, they all pass. | ||
+ | ===== PDFtalk 2.3 ===== | ||
+ | |||
+ | February 2020 | ||
+ | ==== PostScript ==== | ||
+ | |||
+ | Added **[[PostScript]]** to the PDFtalk runtime. | ||
+ | |||
+ | The package **[PostScript]** implements some low level methods which are used by PDFtalk. | ||
+ | |||
+ | PostScript was implemented after PDFtalk and used some basic methods of it (Number reading and writing, ASCII85 encoding and PostScript character names). These dependencies have been reversed so that PostScript can be used stand-alone while PDFtalk now depends on it. This also reflects the correct historical relationship. | ||
+ | |||
+ | ==== CMaps ==== | ||
+ | |||
+ | Added **[[CMap]]** to the **{PDFtalk Fonts}** bundle. | ||
+ | |||
+ | **CMaps** are PostScript programs defining complex code mappings. The mechanism is very general and allows for variable byte length encodings. Because of its generality, CMaps are used by some PDF writers to even encode simple mappings. Hence, it is necessary to fully implement CMaps in order to decode PDF text. | ||
+ | |||
+ | This is not intended to be used by the user of the library. Rather, it is part of the basic font infrastructure enabling decoding of PDF strings. This will be the base for Text extraction in the next step. | ||
+ | |||
+ | === Standard CMaps === | ||
+ | |||
+ | PDF defines 181 standard CMaps which are to be understood by a conforming reader. These CMaps are available at GitHub(([[https:// | ||
+ | |||
+ | Therefore, I put them into a seperate package (outside of the runtime, but part of the project bundle): **[PostScript CMap instances]**. The CMaps are constructed from the source methods lazily when needed. If the package is not loaded, the source of a requested CMap is downloaded from GitHub, which is slower. | ||
+ | |||
+ | === Known problem === | ||
+ | |||
+ | The PDF specification allows bfchar-mappings to have a string of UTF-16BE characters as destination. This is not yet implemented. | ||
+ | |||
+ | ==== Typing ==== | ||
+ | |||
+ | === Allow narrower types to shadow broader types === | ||
+ | |||
+ | Example: | ||
+ | <code smalltalk> | ||
+ | DecodeParms | ||
+ | <type: # | ||
+ | <type: # | ||
+ | </ | ||
+ | |||
+ | '' | ||
+ | |||
+ | === Generalized '' | ||
+ | |||
+ | Textstring does not need to be differenciated. We can rely on VisualWorks handling of multi byte strings. | ||
+ | |||
+ | ===== PDFtalk 2.2 ===== | ||
+ | |||
+ | August 2019 | ||
+ | |||
+ | Renamed '' | ||
+ | This version replaces all references of '' | ||
+ | |||
+ | PDFtalk now depends on the **[Values]** package with version 3.x and up and is incompatible the earlier versions. | ||
+ | |||
+ | ===== PDFtalk 2.1 ===== | ||
+ | |||
+ | July 2019 | ||
+ | |||
+ | Flate encoding is using zlib of VW 8.1 now. | ||
+ | This solved problems allocating buffers under heavy load | ||
+ | ===== PDFtalk 2.0 ===== | ||
+ | |||
+ | October 2017 | ||
+ | |||
+ | ==== What's new ==== | ||
+ | |||
+ | **Name** The new name is // | ||
+ | |||
+ | **Typing** The heard of the “PDF engine” is the [[newtyping|typing system]] which allows the assignment of Smalltalk classes to raw PDF objects. The new version has a redesigned type system where PDF types are properly modeled independent from the Smalltalk class hierarchy. This allows to rename classes freely (i.e. adding prefixes) without affecting PDF types. Also, boxing of some simple objects like " | ||
+ | |||
+ | **[[PDFtalk4Gemstone|PDFtalk for Gemstone]]** The new release was triggered by a contract to port the library to Gemstone (thanks to HTS and Bob Nemec). A talk about this was held at ESUG 2017: " | ||
+ | |||
+ | **[[GemstoneFileout|Gemstone Fileout]]** A VisualWorks to Gemstone translation tool. This tool, with project specific code transformation declarations, | ||
+ | |||
+ | Both new projects are open source with MIT licence. | ||
+ | ==== Changes for users of the library ==== | ||
+ | |||
+ | Some changes are incompatible with the previous version, which are described | ||
It is not recommended to load the new version into an image with the old version of the library. | It is not recommended to load the new version into an image with the old version of the library. | ||
- | ==== Referencing PDF classes | + | === Namespace and bundle structure === |
+ | |||
+ | The former namespace //PDF// is renamed to **'' | ||
+ | |||
+ | The former independent bundle '' | ||
+ | |||
+ | The demos are now in class **'' | ||
+ | <code smalltalk> | ||
+ | PDF runAllDemos | ||
+ | </ | ||
+ | to see if they are running. You may need to edit the file path to the PDF specification and to your demo directory. | ||
+ | |||
+ | === Referencing PDF classes === | ||
Smalltalk classes representing a PDF type should not be referenced directly anymore. Instead an expression like | Smalltalk classes representing a PDF type should not be referenced directly anymore. Instead an expression like | ||
Line 15: | Line 224: | ||
PDF classAt: <PDF type symbol> | PDF classAt: <PDF type symbol> | ||
</ | </ | ||
- | shhould | + | should |
Example | Example | ||
Line 32: | Line 241: | ||
There are 2 reasons for this | There are 2 reasons for this | ||
- | | + | - PDF type and Smalltalk class names may not be the same anymore |
- | * The Smalltalk class name may differ in different ports of the library. | + | |
- | ==== New shared Smalltalk.PDF | + | === New shared Smalltalk.PDF === |
The shared variable **'' | The shared variable **'' | ||
Line 41: | Line 250: | ||
Unless you extend the library, there should be no need to add the PDFtalk namespace to the imports of your project. Instead most functionality should be accessed through **'' | Unless you extend the library, there should be no need to add the PDFtalk namespace to the imports of your project. Instead most functionality should be accessed through **'' | ||
- | ==== Aligned types ==== | + | === Aligned types === |
The new typing system allowed to remove the PDF classes **'' | The new typing system allowed to remove the PDF classes **'' | ||
- | Boxing (with **'' | + | Boxing (with **'' |
The work is not finished yet. **'' | The work is not finished yet. **'' | ||
- | ===== Typing redesign | + | ==== Typing redesign ==== |
The major change is the redesign of the [[newtyping|PDF typing system]]. Initially, I represented the types of PDF objects by Smalltalk classes with the same name. This turned out to be not sufficient. | The major change is the redesign of the [[newtyping|PDF typing system]]. Initially, I represented the types of PDF objects by Smalltalk classes with the same name. This turned out to be not sufficient. | ||
Line 57: | Line 266: | ||
Therefore, PDF types are now modeled independently. | Therefore, PDF types are now modeled independently. | ||
- | ==== Notes ==== | + | === Notes === |
- | === Specialization only on assignment | + | == Specialization only on assignment == |
When PDF objects were created, all classes were searched for possible specializations: | When PDF objects were created, all classes were searched for possible specializations: | ||
Line 65: | Line 274: | ||
In the new version, objects are only typed and specialized when they are assigned to an attribute of a Dictionary or Array. | In the new version, objects are only typed and specialized when they are assigned to an attribute of a Dictionary or Array. | ||
- | === System classes as PDF classes=== | + | == System classes as PDF classes == |
The first version had wrappers for all basic types of PDF, such as null, booleans, numbers etc.. This is similar to boxing of primitive types in other programming languages. | The first version had wrappers for all basic types of PDF, such as null, booleans, numbers etc.. This is similar to boxing of primitive types in other programming languages. | ||
Line 71: | Line 280: | ||
The motivation was to accurately implement any semantic differences to the corresponding Smalltalk objects in a proper place. This lead to the wide spread use of #asPDF and # | The motivation was to accurately implement any semantic differences to the corresponding Smalltalk objects in a proper place. This lead to the wide spread use of #asPDF and # | ||
- | Now I know that the differences are minor and that they could be properly implemented | + | Now I know that the differences are minor and that they could be properly implemented |
This should lead to simpler code and more space and time efficient processing. | This should lead to simpler code and more space and time efficient processing. | ||
On the other hand, there are now a lot more root PDF classes. Therefore, I moved some of the general PDF behavior to Object. | On the other hand, there are now a lot more root PDF classes. Therefore, I moved some of the general PDF behavior to Object. |