Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
cmap [2020/02/22 17:37]
christian [References]
cmap [2020/02/23 13:33]
christian [Mappings outside the codespace]
Line 11: Line 11:
 CMaps are used in two ways in PDF (and PostScript): mapping codes in text operators CMaps are used in two ways in PDF (and PostScript): mapping codes in text operators
   - to the glyph to be displayed and   - to the glyph to be displayed and
-  - to unicode (in the ''ToUnicode''(()) attribute of a font) +  - to unicode (in the ''ToUnicode''(([[https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf|5411.ToUnicode.pdf]] ToUnicode Mapping File Tutorial)) attribute of a font)
- +
-The official standard CMaps are now hosted at GitHub as open source project(([[https://github.com/adobe-type-tools/cmap-resources]] Standard CMaps from Adobe)). Also the mappings from the standard character collections to unicode are available(([[https://github.com/adobe-type-tools/mapping-resources-pdf]] Mapping the character collections to unicode)).+
  
 +The official standard CMaps are now hosted at GitHub as open source project(([[https://github.com/adobe-type-tools/cmap-resources|cmap-resources]] Standard CMaps from Adobe at GitHub)). Also the mappings from the standard character collections to unicode are available(([[https://github.com/adobe-type-tools/mapping-resources-pdf|mapping-resources-pdf]] Mapping character collections to unicode at GitHub)). An interesting blog post about how the CMap names were chosen can be found here(([[https://blogs.adobe.com/CCJKType/2012/02/cmap-resource-names-explained.html|CMap Resource Names Explained]] Adobe blog post)).
 ===== Example ===== ===== Example =====
  
 +The source of a typical CMap looks like:
 +{{:pdf:cmap_raw.png?nolink|CMap source}}
 +
 +The derived CMap is displayed like this:
 +{{:pdf:cmap.png?nolink|CMap object}}
 ===== Components ===== ===== Components =====
  
Line 29: Line 33:
   * **/CMapName** the name under which the CMap is stored in the CMap resources   * **/CMapName** the name under which the CMap is stored in the CMap resources
   * **/CIDSystemInfo** the character collection (see below). Mandatory for CMapType 1 (I think), without meaning for CMapType 2   * **/CIDSystemInfo** the character collection (see below). Mandatory for CMapType 1 (I think), without meaning for CMapType 2
-  * **/CMapType** Not clearly defined. 1 for predefined CID maps(?), 2 for ToUnicode maps+  * **/CMapType** Not clearly defined. 1 for predefined CID maps, 2 for ToUnicode maps
   * **/WMode** Writing direction: 0 for horizontal, 1 for vertical   * **/WMode** Writing direction: 0 for horizontal, 1 for vertical
   * **/CMapVersion**, **/UIDOffset**, **/XUID** and others without relevance for me   * **/CMapVersion**, **/UIDOffset**, **/XUID** and others without relevance for me
Line 104: Line 108:
 endbfrange endbfrange
 </code> </code>
-===== Decoding ===== 
  
-The steps of decoding are: 
-  * take the first byte from the source and find a 1-byte codespace range which includes it 
-    * if found, find a 1-byte mapping for the byte 
-      * if found, return the destination code or character 
-      * if no mapping found, try to find a notdef mapping and return the code 
-        * if not found, see below 
-    * if not found, read the next byte and repeat with 2-byte mappings 
- 
-When no mapping was found, one has to find out how many of the unmappable bytes have to be read from the source. This is not well defined (or I have not understood it yet). 
 ===== Implementation notes ===== ===== Implementation notes =====
  
Line 130: Line 124:
   * the mappings are ordered. This is not strictly prescribed, but recommended by the specifications.   * the mappings are ordered. This is not strictly prescribed, but recommended by the specifications.
  
-==== Handling malformed CMaps ====+===== Monster from the wild =====
  
-Sometimes CMaps define mappings which are not covered by the codespace ranges. This can be seen very often in the wild. These illegal mappings are collected into the ''#unmapped'' variable of a Mappings object. 
-===== Examples from the wild ===== 
  
-single byte mappings in a double byte codespace+==== Mappings outside the codespace ====
  
-using /find instead of /findresource+<code postscript> 
 +%... 
 +1 begincodespacerange 
 +<0000> <FFFF> 
 +endcodespacerange 
 +27 beginbfchar 
 +<20> <0020> 
 +<2E> <002E> 
 +<43> <0043> 
 +<44> <0044> 
 +<45> <0045> 
 +%... 
 +</code>
  
-preventing copying+Here are single byte mappings in a double byte codespace which is not correct according to the documentation.
  
-===== References =====+This can be seen often. These illegal mappings are collected into the ''#unmapped'' variable of a Mappings object.
  
-[x] [[https://blogs.adobe.com/CCJKType/2012/02/cmap-resource-names-explained.html]]+==== Wrong PostScript ====
  
 +using /find instead of /findresource 
 +
 +See [[postscript#exception_handling_example]]
 +==== Prevent copying ====
 +
 +<code postscript>
 +%...
 +1 begincodespacerange
 +<0000> <FFFF>
 +endcodespacerange
 +100 beginbfchar
 +<0000> <001A>
 +<0100> <001A>
 +<0200> <001A>
 +<0300> <001A>
 +<0400> <001A>
 +%...
 +<4900> <001A>
 +<4A00> <001A>
 +<0001> <001A>
 +<0101> <001A>
 +<0201> <001A>
 +<0301> <001A>
 +<0401> <001A>
 +%...
 +</code>
 +
 +Here, all codes map to the same character (Substitute character, Ctrl-Z) to prevent extracting the text. Interesting is also the ordering by the second byte, which forced me to redesign the object structure to avoid exponential processing time.
 +
 +Seen in [[https://github.com/adobe-type-tools/Adobe-CNS1/raw/master/Adobe-CNS1-7.pdf|The Adobe-CNS1-7 Character Collection]].
 +==== Char to string mapping ====
 +
 +<code postscript>
 +%...
 +/CMapType 2 def
 +1 begincodespacerange
 +<00><FF>
 +endcodespacerange
 +1 beginbfchar
 +<24><0009 000d 0020 00a0>
 +endbfchar
 +1 beginbfchar
 +<50><002d 00ad 2010>
 +endbfchar
 +50 beginbfrange
 +<21><21><0050>
 +%...
 +</code>
  
 +It looks as if two codes (<24> and <50>) are mapped to a string of 2-byte characters. I have not found anything about this in the documenation. Seen in a PDF with the ''Producer'' "Mac OS X 10.7.1 Quartz PDFContext".
  • cmap.txt
  • Last modified: 2020/02/23 14:33
  • by christian