Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
releasenotes [2021/07/29 09:52]
christian [PDFtalk 2.5.0]
releasenotes [2021/07/29 20:10] (current)
christian [PDFtalk 2.5.0]
Line 7: Line 7:
 This release was triggered by Bob Nemec from HTS to improve error handling when appending PDFs. Two errors were seen: objects referenced but missing and streams with one extra byte. This release was triggered by Bob Nemec from HTS to improve error handling when appending PDFs. Two errors were seen: objects referenced but missing and streams with one extra byte.
  
-The use case of appending PDFs is the topic of this release. Some internal structures were redesigned and the bugs are handled. Also, the performance for appending large files was improved.+The use case of **appending PDFs** is the topic of this release. Some internal structures were redesigned and the bugs are handled. Also, the performance for appending large files was improved.
  
 Since the HTS systems run on Gemstone, the Gemstone version of the library was updated. Since the HTS systems run on Gemstone, the Gemstone version of the library was updated.
-==== Internal changes ==== 
  
-The user of the library is not affected by these changes.+==== Error handling ====
  
-=== Improving performance for large files ===+Two structural errors were discovered which need to be handled. For describing these errors in more detail, a new page [[monsters|Monsters]] was created to collect some observations from the wild.
  
-When reading many objects at once, the library was slow with certain files. In this investigation, a few issues came up which were never a problem when clicking through objects one by one.+=== Handling missing object errors ===
  
-  * Object streams were created and initialized for each access to an object insideNow, the streams are kept alive in a cache. +A reference pointing to an non-existing object (see [[monsters#missing_object|Missing object]] for details)A ''MissingObject'' is created with the list of expected types allowing useful error messages and the creation of dummy objects.
-  * References from traversing the PDF objects were collected in an OrderedCollection. The visited check was done with this collection. Unfortunately, the time grows exponentially with the number of collected objects so that large files can become very slow. Now, for the visited check, a Set is used. The OrderedCollection for the collected references is kept to ensure a reproducable order.+
  
-=== Redesigned references and tracing ===+On writing, the MissingObject is written as string saying that the object is missing. This preserves the references and leads to a TypeMismatch error on next reading, which can be handled easily.
  
-Objects are picked (read) from a PDF file stream when they are needed. Originally, this was done using blocks stored in place of the value (referent) of a reference. When the value is requested, the block is evaluated and the resulting PDF object is stored as the referent. The block reads the raw object and converts it to the proper type. This can be nested and several types may apply.+=== Handling incorrect stream length errors ===
  
-Unfortunately, the design with blocks does not allow to reason about the types to be appliedThis led to problems where a general type overtook a more specificbetter matching type. So, I reified the blocks to ''FileReference'' which can read an object from file and has a list of types to be applied to the raw objectThe types list is maintained to reflect the subtype order.+The ''/Length'' of a stream is different from the number of bytes in the content (see [[monsters#incorrect_stream_length|Incorrect stream length]] for details)In our case, the stream contents was always exactly one byte longer than stated by the ''/Length'' attribute. That last byte was probably not needed for the stream to be correct considering the filters applied, like ''/FlateDecode''This was checked for a few instances.
  
-While at itthe number and generation of references was extracted to an ''ObjectId''.+Thereforea very specific error ''ExtraCharacterInStreamError'' is raised in this case and the extra byte is ignored (giving the ''/Length'' attribute priority). This error can resume meaningfully. On writing, ''/Lenght'' bytes are written to the content, dropping the extra byte. 
 +==== New APIs ====
  
-=== Changed internal streams to bytes ===+=== Document>>appendAllPagesFrom: ===
  
-The ''Writer'' (internal write streamwrites bytes instead of characters to produce the PDF file. When writing the physical file, a copy to a byte array was needed to write the binary data. This copy is not needed anymore.+A PDF (all pagescan be appended efficiently to PDF Document. 
 +<code smalltalk>Document>>appendAllPagesFrom: aPDFtalkFile</code>
  
-==== Error handling ====+All objects of the PDF to be appended are read from the file by resolving all references reachable from the ''Catalog''. This happens with a protection against ''Type''- and ''FileError''s, which can savely be resumed. 
  
-=== Added MissungReference error ===+To concatenate some PDFs do: 
 +<code smalltalk> 
 +| doc | 
 +doc :Document new. 
 +doc appendAllPagesFrom: (File read: 'file1.pdf' asFilename). 
 +doc appendAllPagesFrom: (File read: 'file2.pdf' asFilename). 
 +doc appendAllPagesFrom: (File read: 'file3.pdf' asFilename). 
 +doc saveAs: 'file123.pdf'
 +</code>
  
-I encountered an interesting error. The object, a reference was pointing to, was not there. The entry in the cross references was 'free'. For this case, I created the new MissungReference. It holds the objectId and a list of expected types allowing useful error messages and the creation of dummy objects.+=== Raw objects ===
  
-==== New APIs ====+There is also a variant  
 +<code smalltalk>Document>>appendAllRawPagesFrom: aPDFtalkFile</code> 
 +which reads all objects without typing. The objects are raw - generic ''Dictionary'' and ''Array'' objects. Note: the only purpose is to write out the PDF immediately, because nothing useful can be done with the raw objects.
  
-=== Document>>#appendAllPagesFrom: ===+In ''VisualWorks'', the raw version is performing slightly faster (~ 5%) than the standard version with typing. 
  
-A PDF (all pagescan be appended efficiently to a PDF Document.+On ''Gemstone'', the difference is much bigger (~ 75%- 4 times faster! My guess is that ''Pragmas'', with which the type annotations are implemented, are not efficient in ''Gemstone''.
  
-All objects of a PDF to be appended are fully read by resolving all references reachable from the ''Catalog''. This happens with a protection against ''Type''- and ''FileError''s, which are resumed. +==== Internal changes ====
  
-Other errors are collected in the #errors variable of the Parser.+The user of the library is not affected by these changes.
  
-To check for these errors, add the following to your code: +=== Improving performance for large files ===
-<code smalltalk> +
-aPDFFile parser errors notEmpty ifTrue: [ +
- aPDFFile parser errors inspect]. +
-</code>+
  
-=== Raw objects ===+When reading many objects at once, the library was slow with large files. In this investigation, a few issues came up which were never a problem when clicking through objects one by one.
  
-There is also a variant <code>Document>>#appendAllRawPagesFrom:</code> which reads all objects without typingThe objects are raw - generic ''Dictionary'' and ''Array'' objectsNote: the only purpose is to write out the PDF immediately, because nothing useful can be done with the raw objects.+  * Object streams were created and initialized for each access to an object insideNow, the streams are kept alive in a cache. 
 +  * References from traversing the PDF objects were collected in an OrderedCollection. The visited check was done with this collection. The time grows exponentially with the number of collected objects, so that large files can become very slow. Now, for the visited check, a Set is used. The OrderedCollection for the collected references is kept to ensure a reproducable order.
  
-The version is performing slightly faster (~ 5%) than the typed standard variant in ''VisualWorks''. On ''Gemstone'', the difference is much bigger (~ 75%)!+=== Redesigned references and tracing ===
  
-==== other changes ====+Objects are picked (read) from a PDF file stream when they are needed. Originally, this was done using blocks stored in place of the value (referent) of a reference. When the value is requested, the block is evaluated and the resulting PDF object is stored as the referent. The block reads the raw object and converts it to the proper type. This can be nested and several types may apply. 
 + 
 +Unfortunately, the design with blocks does not allow to defer the typing. This led to problems where a general type overtook a more specific, better matching type. So, I reified the blocks to ''FileReference'' which can read an object from file and has a list of types to be applied to the raw object. The types list is maintained to reflect the subtype order. 
 + 
 +While at it, the number and generation of references was extracted to an ''ObjectId''
 + 
 +=== Changed internal streams to bytes === 
 + 
 +The ''Writer'' (internal write stream) writes now bytes instead of characters to produce the PDF file. When writing the physical file, the string was converted to a byte array to write the binary data. This copy is not needed anymore. 
 + 
 +==== Gemstone ==== 
 + 
 +This release updates the Gemstone code for the library and also the [[pdftalk4gemstone|PDFtalk for Gemstone]] page. The biggest addition is the [[postscript|PostScript]] module used with [[cmap|CMaps]] introduced in [[releasenotes#pdftalk_23|version 2.3]].
  
 === Encoded PostScript sources === === Encoded PostScript sources ===
  
-Reencoded cmap source file methods with ASCII85 to allow fileIn to Gemstone. Topas from Gemstone as well as PostScript use the % character at the beginning of a line for directives and comments. Since cmaps are PostScript programs, their source cannot be embedded directly without disturbing Gemstone.+PostScript source methods (mainly cmaps and examples) are reencoded with ASCII85 to allow fileIn to Gemstone. Topas from Gemstone as well as PostScript use the % character at the beginning of a line for directives and comments. Since cmaps are PostScript programs, their source cannot be embedded directly without disturbing Gemstone.
  
-=== Compatibility changes ===+Interestingly, I believe that Gemstone and PostScript share some early history which can also be seen in the way the dictionary stack is used in both. 
 + 
 +=== Optional CMaps === 
 + 
 +The [[cmap|CMaps module]] is used to decode strings to unicode. The library uses this when a font supplies a ''/ToUnicode'' attribute. In case you want to use this for Asian languages as Japanese, Chinese or Korean, the standard CMap files for these languages are needed. There are 182 standard CMaps defined which are all needed when dealing with arbitrary PDFs. These CMap source files, in PostScript, are stored in the image and parsed by the PostScript interpreter on demand. 
 + 
 +Since they are very big, there are two Gemstone source files: **''PDFtalk.gs''** (3.8 MB) and **''PDFtalkWithCMaps.gs''** (12.1 MB). Unless you do serious things with Asian text, the smaller one is recommended. 
 +==== other changes ====
  
-Some icons were copied to be used in both, the new VW 9.1 and earlier versions.+In VisualWorks 9.1, icons were renamed and changed. In order to use the library's UI in all versions, some icons were copied from older releases.
  
  
  • releasenotes.1627545158.txt.gz
  • Last modified: 2021/07/29 09:52
  • by christian