Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cmap [2020/02/21 12:15]
christian [CMap]
cmap [2020/02/23 14:33]
christian [CMap]
Line 1: Line 1:
 ====== CMap ====== ====== CMap ======
  
-CMaps(([[https://www-cdf.fnal.gov/offline/PostScript/5014.CIDFont_Spec.pdf|5014.CIDFont_Spec.pdf]] Adobe CMap and CIDFont Files Specification)) (Character Maps) define unidirectional mapping from a code to another. +CMaps(([[https://www-cdf.fnal.gov/offline/PostScript/5014.CIDFont_Spec.pdf|5014.CIDFont_Spec.pdf]] Adobe CMap and CIDFont Files Specification)) (Character Maps) define unidirectional mapping from a code to another. (This should not be confused with the cmap table(([[https://docs.microsoft.com/en-us/typography/opentype/spec/cmap|cmap — Character to Glyph Index Mapping Table]])) of an OpenType font.)
  
 CMaps provide a very general mechanism which can describe any mappings, including unicode which was developed later. Input codes of variable length (1, 2, 3 or more bytes) can be mapped to characters. CMaps provide a very general mechanism which can describe any mappings, including unicode which was developed later. Input codes of variable length (1, 2, 3 or more bytes) can be mapped to characters.
Line 11: Line 11:
 CMaps are used in two ways in PDF (and PostScript): mapping codes in text operators CMaps are used in two ways in PDF (and PostScript): mapping codes in text operators
   - to the glyph to be displayed and   - to the glyph to be displayed and
-  - to unicode+  - to unicode (in the ''ToUnicode''(([[https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf|5411.ToUnicode.pdf]] ToUnicode Mapping File Tutorial)) attribute of a font)
  
-The official standard CMaps are now hosted at GitHub as open source project(([[https://github.com/adobe-type-tools/cmap-resources]] All standard CMaps from Adobe)). Also the mappings from the standard character collections to unicode are available(([[https://github.com/adobe-type-tools/mapping-resources-pdf]] Mapping the character collections to unicode)).+The official standard CMaps are now hosted at GitHub as open source project(([[https://github.com/adobe-type-tools/cmap-resources|cmap-resources]] Standard CMaps from Adobe at GitHub)). Also the mappings from the standard character collections to unicode are available(([[https://github.com/adobe-type-tools/mapping-resources-pdf|mapping-resources-pdf]] Mapping character collections to unicode at GitHub)). An interesting blog post about how the CMap names were chosen can be found here(([[https://blogs.adobe.com/CCJKType/2012/02/cmap-resource-names-explained.html|CMap Resource Names Explained]] Adobe blog post)). 
 +===== Example =====
  
 +The source of a typical CMap looks like:
 +{{:pdf:cmap_raw.png?nolink|CMap source}}
 +
 +The derived CMap is displayed like this:
 +{{:pdf:cmap.png?nolink|CMap object}}
 ===== Components ===== ===== Components =====
  
Line 27: Line 33:
   * **/CMapName** the name under which the CMap is stored in the CMap resources   * **/CMapName** the name under which the CMap is stored in the CMap resources
   * **/CIDSystemInfo** the character collection (see below). Mandatory for CMapType 1 (I think), without meaning for CMapType 2   * **/CIDSystemInfo** the character collection (see below). Mandatory for CMapType 1 (I think), without meaning for CMapType 2
-  * **/CMapType** Not clearly defined. 1 for predefined CID maps(?), 2 for ToUnicode maps+  * **/CMapType** Not clearly defined. 1 for predefined CID maps, 2 for ToUnicode maps
   * **/WMode** Writing direction: 0 for horizontal, 1 for vertical   * **/WMode** Writing direction: 0 for horizontal, 1 for vertical
   * **/CMapVersion**, **/UIDOffset**, **/XUID** and others without relevance for me   * **/CMapVersion**, **/UIDOffset**, **/XUID** and others without relevance for me
Line 60: Line 66:
 </code> </code>
  
-The byte ranges are dimensions+The byte ranges are dimensions. The bytes on each position define the range of possible bytes in that position. If we take the second codespace range <C080>..<DFBF>, it should be read a two ranges: <C0>..<DF> for the first byte and <80>..<BF> for the second. The code <C785> is in that space while <C77F> is not.
  
-more defined than used, because of **/usecmap**+A CMap can be defined on the base of another with the operator **/usecmap**. ''usecmap'' takes the codespace and all the mappings from the referenced CMap and may add more mappings. In this case, the CMap cannot have a codespace definition. This means, that codespaces cannot be enlarged or altered when reusing another CMaps.
  
 ==== Mappings ==== ==== Mappings ====
Line 68: Line 74:
 The mapping information is provided by char and range mappings. The mapping information is provided by char and range mappings.
  
-There are **bf**, **cid** and **notdef** mappings. **bf** (what does this stand for?) maps codes to characters. **cid** and **notdef** map codes to CIDs (Character IDs) used to index the glyph of a font.+There are **bf**, **cid** and **notdef** mappings. **bf** (base font) maps codes to characters. **cid** and **notdef** map codes to CIDs (Character IDs) used as index of glyphs in a font.
  
   * /bfchar /bfrange   * /bfchar /bfrange
Line 74: Line 80:
   * /notdefchar /notdefrange   * /notdefchar /notdefrange
  
-===== Decoding =====+Char mappings map one code to another and is written as 2 byte strings. 
 +<code postscript> 
 +<A63F> <32> 
 +<37> 7346456 
 +</code>
  
-Steps of decoding a code+The **source** (the first element) should be a bytestring written in hex notation, while the **destination** (second element) can also be given as integer. 
 + 
 +Range mappings consist of 2 elements where the first 2 define a range and the third element is the first destination code.  
 +<code postscript> 
 +<A63A> <A63F> <32> 
 +<37> <3B> 7346456 
 +</code> 
 +The first mapping maps a range of 6 codes (<A63A>..<A63F>) to the destination range <32>..<37>
 + 
 +For **bf** mappings (mapping to characters), the destination can also be a PostScript character name or an array of names for ranges. 
 +<code postscript> 
 +beginbfchar 
 +<A63F> <32> 
 +<37> 7346456 
 +<84> /epsilon 
 +endbfchar 
 +beginbfrange 
 +<A63A> <A63F> <32> 
 +<37> <3B> 7346456 
 +<84> <86> [//c /mu] 
 +endbfrange 
 +</code>
  
 ===== Implementation notes ===== ===== Implementation notes =====
  
-Canonical representation+==== Canonical representation ====
  
-Handling malformed CMaps+When constructing a CMap object, great care has been taken to derive a canonical form of the CMap. This means that no matter how the original CMap is written, it will always end up with the same minimal CMap.
  
-===== Examples from the wild =====+The following modifications are applied: 
 +  * a range mapping with only one code is converted to a char mapping <code postscript><3F> <3F> <54> 
 +==> <3F> <54></code> 
 +  * adjecent mappings are joined to range mappings <code postscript><3E> <53> 
 +<3F> <54> 
 +==> <3E> <3F> <53></code> 
 +  * no duplicate mappings. CMaps allows duplicate mappings which is needed for ''usecmap''. The later mapping wins. This is resolved in the canonical mapping so that no duplications exist. 
 +  * the mappings are ordered. This is not strictly prescribed, but recommended by the specifications.
  
-single byte mappings in a double byte codespace+===== Monster from the wild =====
  
-using /find instead of /findresource+CMaps are not well defined. Therefore, there are some interesting variations of them in the wild. Here is a small selection of some issues. 
 +==== Codespace problems ====
  
-preventing copying+=== Wrong code length ===
  
-===== References =====+<code postscript> 
 +%... 
 +1 begincodespacerange 
 +<0000> <FFFF> 
 +endcodespacerange 
 +27 beginbfchar 
 +<20> <0020> 
 +<2E> <002E> 
 +<43> <0043> 
 +<44> <0044> 
 +<45> <0045> 
 +%... 
 +</code> 
 + 
 +Here are single byte mappings in a double byte codespace which is not correct according to the documentation. 
 + 
 +This can be seen often. These illegal mappings are collected into the ''#unmapped'' variable of a Mappings object. 
 + 
 +=== Mappings outside the codespace === 
 + 
 +<code postscript> 
 +%... 
 +1 begincodespacerange 
 +<0001> <1004> 
 +endcodespacerange 
 +11 beginbfchar 
 +<0003> <00A0> 
 +<0005> <0022> 
 +<0008> <0025> 
 +<000F> <002C> 
 +<0010> <00AD> 
 +%... 
 +</code> 
 + 
 +Here, only the first mapping matches the code space. All others fall outside of it, because the second byte has to be between <00> and <04>
 + 
 +==== Wrong PostScript ==== 
 + 
 +On one occasion, I saw a CMap where the PostScript used a non-existing operator (''/find'' instead of ''/findresource''). See the [[postscript#exception_handling_example]] on the PostScript page. 
 +==== Prevent copying ==== 
 + 
 +<code postscript> 
 +%... 
 +1 begincodespacerange 
 +<0000> <FFFF> 
 +endcodespacerange 
 +100 beginbfchar 
 +<0000> <001A> 
 +<0100> <001A> 
 +<0200> <001A> 
 +<0300> <001A> 
 +<0400> <001A> 
 +%... 
 +<4900> <001A> 
 +<4A00> <001A> 
 +<0001> <001A> 
 +<0101> <001A> 
 +<0201> <001A> 
 +<0301> <001A> 
 +<0401> <001A> 
 +%... 
 +</code> 
 + 
 +Here, all codes map to the same character (Substitute character, Ctrl-Z) to prevent extracting the text. Interesting is also the ordering by the second byte, which forced me to redesign the object structure to avoid exponential processing time. 
 + 
 +Seen in [[https://github.com/adobe-type-tools/Adobe-CNS1/raw/master/Adobe-CNS1-7.pdf|The Adobe-CNS1-7 Character Collection]]. 
 +==== Char to string mapping ==== 
 + 
 +<code postscript> 
 +%... 
 +/CMapType 2 def 
 +1 begincodespacerange 
 +<00><FF> 
 +endcodespacerange 
 +1 beginbfchar 
 +<24><0009 000d 0020 00a0> 
 +endbfchar 
 +1 beginbfchar 
 +<50><002d 00ad 2010> 
 +endbfchar 
 +50 beginbfrange 
 +<21><21><0050> 
 +%... 
 +</code>
  
-[x] [[https://blogs.adobe.com/CCJKType/2012/02/cmap-resource-names-explained.html]]+Two codes (<24> and <50>) are mapped to a string of 2-byte characters. This is defined by the PDF spec(({{pdf:pdf32000_2008.pdf|PDF specification (ISO standard PDF 32000-1:2008)}})) in section 9.10.3 "ToUnicode CMaps". This has not been implemented yet.
  
-[x] [[https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf]]+Seen in a PDF with the ''Producer'' "Mac OS X 10.7.1 Quartz PDFContext".
  • cmap.txt
  • Last modified: 2020/02/23 14:33
  • by christian