Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cmap [2020/02/23 10:29]
christian [Decoding]
cmap [2020/02/23 14:33] (current)
christian [CMap]
Line 1: Line 1:
 ====== CMap ====== ====== CMap ======
  
-CMaps(([[https://www-cdf.fnal.gov/offline/PostScript/5014.CIDFont_Spec.pdf|5014.CIDFont_Spec.pdf]] Adobe CMap and CIDFont Files Specification)) (Character Maps) define unidirectional mapping from a code to another. +CMaps(([[https://www-cdf.fnal.gov/offline/PostScript/5014.CIDFont_Spec.pdf|5014.CIDFont_Spec.pdf]] Adobe CMap and CIDFont Files Specification)) (Character Maps) define unidirectional mapping from a code to another. (This should not be confused with the cmap table(([[https://docs.microsoft.com/en-us/typography/opentype/spec/cmap|cmap — Character to Glyph Index Mapping Table]])) of an OpenType font.)
  
 CMaps provide a very general mechanism which can describe any mappings, including unicode which was developed later. Input codes of variable length (1, 2, 3 or more bytes) can be mapped to characters. CMaps provide a very general mechanism which can describe any mappings, including unicode which was developed later. Input codes of variable length (1, 2, 3 or more bytes) can be mapped to characters.
Line 124: Line 124:
   * the mappings are ordered. This is not strictly prescribed, but recommended by the specifications.   * the mappings are ordered. This is not strictly prescribed, but recommended by the specifications.
  
-==== Handling malformed CMaps ====+===== Monster from the wild =====
  
-Sometimes CMaps define mappings which are not covered by the codespace rangesThis can be seen very often in the wild. These illegal mappings are collected into the ''#unmapped'' variable of a Mappings object+CMaps are not well definedTherefore, there are some interesting variations of them in the wild. Here is a small selection of some issues
-===== Examples from the wild =====+==== Codespace problems ====
  
-single byte mappings in a double byte codespace+=== Wrong code length ===
  
-using /find instead of /findresource+<code postscript> 
 +%... 
 +1 begincodespacerange 
 +<0000> <FFFF> 
 +endcodespacerange 
 +27 beginbfchar 
 +<20> <0020> 
 +<2E> <002E> 
 +<43> <0043> 
 +<44> <0044> 
 +<45> <0045> 
 +%... 
 +</code>
  
-preventing copying+Here are single byte mappings in a double byte codespace which is not correct according to the documentation. 
 + 
 +This can be seen often. These illegal mappings are collected into the ''#unmapped'' variable of a Mappings object. 
 + 
 +=== Mappings outside the codespace === 
 + 
 +<code postscript> 
 +%... 
 +1 begincodespacerange 
 +<0001> <1004> 
 +endcodespacerange 
 +11 beginbfchar 
 +<0003> <00A0> 
 +<0005> <0022> 
 +<0008> <0025> 
 +<000F> <002C> 
 +<0010> <00AD> 
 +%... 
 +</code> 
 + 
 +Here, only the first mapping matches the code space. All others fall outside of it, because the second byte has to be between <00> and <04>
 + 
 +==== Wrong PostScript ==== 
 + 
 +On one occasion, I saw a CMap where the PostScript used a non-existing operator (''/find'' instead of ''/findresource''). See the [[postscript#exception_handling_example]] on the PostScript page. 
 +==== Prevent copying ==== 
 + 
 +<code postscript> 
 +%... 
 +1 begincodespacerange 
 +<0000> <FFFF> 
 +endcodespacerange 
 +100 beginbfchar 
 +<0000> <001A> 
 +<0100> <001A> 
 +<0200> <001A> 
 +<0300> <001A> 
 +<0400> <001A> 
 +%... 
 +<4900> <001A> 
 +<4A00> <001A> 
 +<0001> <001A> 
 +<0101> <001A> 
 +<0201> <001A> 
 +<0301> <001A> 
 +<0401> <001A> 
 +%... 
 +</code> 
 + 
 +Here, all codes map to the same character (Substitute character, Ctrl-Z) to prevent extracting the text. Interesting is also the ordering by the second byte, which forced me to redesign the object structure to avoid exponential processing time. 
 + 
 +Seen in [[https://github.com/adobe-type-tools/Adobe-CNS1/raw/master/Adobe-CNS1-7.pdf|The Adobe-CNS1-7 Character Collection]]. 
 +==== Char to string mapping ==== 
 + 
 +<code postscript> 
 +%... 
 +/CMapType 2 def 
 +1 begincodespacerange 
 +<00><FF> 
 +endcodespacerange 
 +1 beginbfchar 
 +<24><0009 000d 0020 00a0> 
 +endbfchar 
 +1 beginbfchar 
 +<50><002d 00ad 2010> 
 +endbfchar 
 +50 beginbfrange 
 +<21><21><0050> 
 +%... 
 +</code>
  
 +Two codes (<24> and <50>) are mapped to a string of 2-byte characters. This is defined by the PDF spec(({{pdf:pdf32000_2008.pdf|PDF specification (ISO standard PDF 32000-1:2008)}})) in section 9.10.3 "ToUnicode CMaps". This has not been implemented yet.
  
 +Seen in a PDF with the ''Producer'' "Mac OS X 10.7.1 Quartz PDFContext".
  • cmap.1582450166.txt.gz
  • Last modified: 2020/02/23 10:29
  • by christian