Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
cmap [2020/02/23 10:28]
christian [General Info]
cmap [2020/02/23 14:26]
christian [Char to string mapping]
Line 108: Line 108:
 endbfrange endbfrange
 </code> </code>
-===== Decoding ===== 
  
-The steps of decoding are: 
-  * take the first byte from the source and find a 1-byte codespace range which includes it 
-    * if found, find a 1-byte mapping for the byte 
-      * if found, return the destination code or character 
-      * if no mapping found, try to find a notdef mapping and return the code 
-        * if not found, see below 
-    * if not found, read the next byte and repeat with 2-byte mappings 
- 
-When no mapping was found, one has to find out how many of the unmappable bytes have to be read from the source. This is not well defined (or I have not understood it yet). 
 ===== Implementation notes ===== ===== Implementation notes =====
  
Line 134: Line 124:
   * the mappings are ordered. This is not strictly prescribed, but recommended by the specifications.   * the mappings are ordered. This is not strictly prescribed, but recommended by the specifications.
  
-==== Handling malformed CMaps ====+===== Monster from the wild ===== 
 + 
 +CMaps are not well defined. Therefore, there are some interesting variations of them in the wild. Here is a small selection of some issues. 
 +==== Codespace problems ==== 
 + 
 +=== Wrong code length === 
 + 
 +<code postscript> 
 +%... 
 +1 begincodespacerange 
 +<0000> <FFFF> 
 +endcodespacerange 
 +27 beginbfchar 
 +<20> <0020> 
 +<2E> <002E> 
 +<43> <0043> 
 +<44> <0044> 
 +<45> <0045> 
 +%... 
 +</code> 
 + 
 +Here are single byte mappings in a double byte codespace which is not correct according to the documentation. 
 + 
 +This can be seen often. These illegal mappings are collected into the ''#unmapped'' variable of a Mappings object. 
 + 
 +=== Mappings outside the codespace === 
 + 
 +<code postscript> 
 +%... 
 +1 begincodespacerange 
 +<0001> <1004> 
 +endcodespacerange 
 +11 beginbfchar 
 +<0003> <00A0> 
 +<0005> <0022> 
 +<0008> <0025> 
 +<000F> <002C> 
 +<0010> <00AD> 
 +%... 
 +</code> 
 + 
 +Here, only the first mapping matches the code space. All others fall outside of it, because the second byte has to be between <00> and <04>
 + 
 +==== Wrong PostScript ==== 
 + 
 +On one occasion, I saw a CMap where the PostScript used a non-existing operator (''/find'' instead of ''/findresource''). See the [[postscript#exception_handling_example]] on the PostScript page. 
 +==== Prevent copying ====
  
-Sometimes CMaps define mappings which are not covered by the codespace rangesThis can be seen very often in the wildThese illegal mappings are collected into the ''#unmapped'' variable of a Mappings object+<code postscript> 
-===== Examples from the wild =====+%... 
 +1 begincodespacerange 
 +<0000> <FFFF> 
 +endcodespacerange 
 +100 beginbfchar 
 +<0000> <001A> 
 +<0100> <001A> 
 +<0200> <001A> 
 +<0300> <001A> 
 +<0400> <001A> 
 +%... 
 +<4900> <001A> 
 +<4A00> <001A> 
 +<0001> <001A> 
 +<0101> <001A> 
 +<0201> <001A> 
 +<0301> <001A> 
 +<0401> <001A> 
 +%... 
 +</code>
  
-single byte mappings in a double byte codespace+Here, all codes map to the same character (Substitute character, Ctrl-Z) to prevent extracting the text. Interesting is also the ordering by the second byte, which forced me to redesign the object structure to avoid exponential processing time.
  
-using /find instead of /findresource+Seen in [[https://github.com/adobe-type-tools/Adobe-CNS1/raw/master/Adobe-CNS1-7.pdf|The Adobe-CNS1-7 Character Collection]]. 
 +==== Char to string mapping ====
  
-preventing copying+<code postscript> 
 +%... 
 +/CMapType 2 def 
 +1 begincodespacerange 
 +<00><FF> 
 +endcodespacerange 
 +1 beginbfchar 
 +<24><0009 000d 0020 00a0> 
 +endbfchar 
 +1 beginbfchar 
 +<50><002d 00ad 2010> 
 +endbfchar 
 +50 beginbfrange 
 +<21><21><0050> 
 +%... 
 +</code>
  
 +Two codes (<24> and <50>) are mapped to a string of 2-byte characters. This is defined by the PDF spec(({{pdf:pdf32000_2008.pdf|PDF specification (ISO standard PDF 32000-1:2008)}})) in section 9.10.3 "ToUnicode CMaps". This has not been implemented yet.
  
 +Seen in a PDF with the ''Producer'' "Mac OS X 10.7.1 Quartz PDFContext".
  • cmap.txt
  • Last modified: 2020/02/23 14:33
  • by christian