User Tools

Site Tools


cmap

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cmap [2020/02/04 12:09]
christian [CMap]
cmap [2020/02/23 14:33] (current)
christian [CMap]
Line 1: Line 1:
 ====== CMap ====== ====== CMap ======
  
-CMaps(([[https://​www-cdf.fnal.gov/​offline/​PostScript/​5014.CIDFont_Spec.pdf|5014.CIDFont_Spec.pdf]] Adobe CMap and CIDFont Files Specification)) (Character Maps) define unidirectional mapping from a code to another. ​+CMaps(([[https://​www-cdf.fnal.gov/​offline/​PostScript/​5014.CIDFont_Spec.pdf|5014.CIDFont_Spec.pdf]] Adobe CMap and CIDFont Files Specification)) (Character Maps) define unidirectional mapping from a code to another. ​(This should not be confused with the cmap table(([[https://​docs.microsoft.com/​en-us/​typography/​opentype/​spec/​cmap|cmap — Character to Glyph Index Mapping Table]])) of an OpenType font.)
  
-CMaps provide a very general mechanism which can describe any mappings, including unicode which was devloped ​later. Input codes of variable length (1, 2, 3 or more bytes) can be mapped to characters.+CMaps provide a very general mechanism which can describe any mappings, including unicode which was developed ​later. Input codes of variable length (1, 2, 3 or more bytes) can be mapped to characters.
  
 They are part of type-0 fonts defining the mapping from input codes to glyphs in the font. This is used mainly for Asian fonts (Japanese, Chinese, Korean) with thousends of characters. But, since CMaps are so general, some PDF applications use it as default for encoding. Therefore, for PDF text extraction, it is necessary to understand and use CMaps. They are part of type-0 fonts defining the mapping from input codes to glyphs in the font. This is used mainly for Asian fonts (Japanese, Chinese, Korean) with thousends of characters. But, since CMaps are so general, some PDF applications use it as default for encoding. Therefore, for PDF text extraction, it is necessary to understand and use CMaps.
Line 11: Line 11:
 CMaps are used in two ways in PDF (and PostScript):​ mapping codes in text operators CMaps are used in two ways in PDF (and PostScript):​ mapping codes in text operators
   - to the glyph to be displayed and   - to the glyph to be displayed and
-  - to unicode+  - to unicode ​(in the ''​ToUnicode''​(([[https://​www.adobe.com/​content/​dam/​acom/​en/​devnet/​acrobat/​pdfs/​5411.ToUnicode.pdf|5411.ToUnicode.pdf]] ToUnicode Mapping File Tutorial)) attribute of a font)
  
-The official standard CMaps are now hosted at GitHub as open source project(([[https://​github.com/​adobe-type-tools/​cmap-resources]] ​All standard ​CMaps from Adobe)). Also the mappings from the standard character collections to unicode are available(([[https://​github.com/​adobe-type-tools/​mapping-resources-pdf]] Mapping ​the character collections to unicode)).+The official standard CMaps are now hosted at GitHub as open source project(([[https://​github.com/​adobe-type-tools/​cmap-resources|cmap-resources]] ​Standard ​CMaps from Adobe at GitHub)). Also the mappings from the standard character collections to unicode are available(([[https://​github.com/​adobe-type-tools/​mapping-resources-pdf|mapping-resources-pdf]] Mapping character collections to unicode ​at GitHub)). An interesting blog post about how the CMap names were chosen can be found here(([[https://​blogs.adobe.com/​CCJKType/​2012/​02/​cmap-resource-names-explained.html|CMap Resource Names Explained]] Adobe blog post)). 
 +===== Example =====
  
 +The source of a typical CMap looks like:
 +{{:​pdf:​cmap_raw.png?​nolink|CMap source}}
 +
 +The derived CMap is displayed like this:
 +{{:​pdf:​cmap.png?​nolink|CMap object}}
 ===== Components ===== ===== Components =====
  
Line 27: Line 33:
   * **/​CMapName** the name under which the CMap is stored in the CMap resources   * **/​CMapName** the name under which the CMap is stored in the CMap resources
   * **/​CIDSystemInfo** the character collection (see below). Mandatory for CMapType 1 (I think), without meaning for CMapType 2   * **/​CIDSystemInfo** the character collection (see below). Mandatory for CMapType 1 (I think), without meaning for CMapType 2
-  * **/​CMapType** Not clearly defined. 1 for predefined CID maps(?), 2 for ToUnicode maps+  * **/​CMapType** Not clearly defined. 1 for predefined CID maps, 2 for ToUnicode maps
   * **/WMode** Writing direction: 0 for horizontal, 1 for vertical   * **/WMode** Writing direction: 0 for horizontal, 1 for vertical
   * **/​CMapVersion**,​ **/​UIDOffset**,​ **/XUID** and others without relevance for me   * **/​CMapVersion**,​ **/​UIDOffset**,​ **/XUID** and others without relevance for me
Line 60: Line 66:
 </​code>​ </​code>​
  
-The byte ranges are dimensions+The byte ranges are dimensions. The bytes on each position define the range of possible bytes in that position. If we take the second codespace range <​C080>​..<​DFBF>,​ it should be read a two ranges: <​C0>​..<​DF>​ for the first byte and <​80>​..<​BF>​ for the second. The code <​C785>​ is in that space while <​C77F>​ is not.
  
-more defined ​than used, because ​of **/​usecmap**+A CMap can be defined ​on the base of another with the operator ​**/​usecmap**. ''​usecmap''​ takes the codespace and all the mappings from the referenced CMap and may add more mappings. In this case, the CMap cannot have a codespace definition. This means, that codespaces cannot be enlarged or altered when reusing another CMaps.
  
 ==== Mappings ==== ==== Mappings ====
Line 68: Line 74:
 The mapping information is provided by char and range mappings. The mapping information is provided by char and range mappings.
  
-There are **bf**, **cid** and **notdef** mappings. **bf** (what does this stand for?) maps codes to characters. **cid** and **notdef** map codes to CIDs (Character IDs) used to index the glyph of a font.+There are **bf**, **cid** and **notdef** mappings. **bf** (base font) maps codes to characters. **cid** and **notdef** map codes to CIDs (Character IDs) used as index of glyphs in a font.
  
   * /bfchar /bfrange   * /bfchar /bfrange
Line 74: Line 80:
   * /notdefchar /​notdefrange   * /notdefchar /​notdefrange
  
-===== Decoding =====+Char mappings map one code to another and is written as 2 byte strings. 
 +<code postscript>​ 
 +<​A63F>​ <​32>​ 
 +<37> 7346456 
 +</​code>​
  
-Steps of decoding ​a code+The **source** (the first element) should be a bytestring written in hex notation, while the **destination** (second element) can also be given as integer. 
 + 
 +Range mappings consist ​of 2 elements where the first 2 define a range and the third element is the first destination code.  
 +<code postscript>​ 
 +<​A63A>​ <​A63F>​ <​32>​ 
 +<37> <3B> 7346456 
 +</​code>​ 
 +The first mapping maps a range of 6 codes (<​A63A>​..<​A63F>​) to the destination range <​32>​..<​37>​. 
 + 
 +For **bf** mappings (mapping to characters),​ the destination can also be a PostScript character name or an array of names for ranges. 
 +<code postscript>​ 
 +beginbfchar 
 +<​A63F>​ <​32>​ 
 +<37> 7346456 
 +<84> /epsilon 
 +endbfchar 
 +beginbfrange 
 +<​A63A>​ <​A63F>​ <​32>​ 
 +<37> <3B> 7346456 
 +<84> <86> [//c /mu] 
 +endbfrange 
 +</code>
  
 ===== Implementation notes ===== ===== Implementation notes =====
  
-Canonical representation+==== Canonical representation ​====
  
-Handling malformed CMaps+When constructing a CMap object, great care has been taken to derive a canonical form of the CMap. This means that no matter how the original CMap is written, it will always end up with the same minimal CMap.
  
-===== Examples from the wild =====+The following modifications are applied: 
 +  * a range mapping with only one code is converted to a char mapping <code postscript><​3F>​ <3F> <​54>​ 
 +==> <3F> <​54></​code>​ 
 +  * adjecent mappings are joined to range mappings <code postscript><​3E>​ <​53>​ 
 +<3F> <​54>​ 
 +==> <3E> <3F> <​53></​code>​ 
 +  * no duplicate mappings. CMaps allows duplicate mappings which is needed for ''​usecmap''​. The later mapping wins. This is resolved in the canonical mapping so that no duplications exist. 
 +  * the mappings are ordered. This is not strictly prescribed, but recommended by the specifications.
  
-single byte mappings in a double byte codespace+===== Monster from the wild =====
  
-using /find instead ​of /​findresource+CMaps are not well defined. Therefore, there are some interesting variations ​of them in the wild. Here is a small selection of some issues. 
 +==== Codespace problems ====
  
-preventing ​copying+=== Wrong code length === 
 + 
 +<code postscript>​ 
 +%... 
 +1 begincodespacerange 
 +<​0000>​ <​FFFF>​ 
 +endcodespacerange 
 +27 beginbfchar 
 +<20> <​0020>​ 
 +<2E> <​002E>​ 
 +<43> <​0043>​ 
 +<44> <​0044>​ 
 +<45> <​0045>​ 
 +%... 
 +</​code>​ 
 + 
 +Here are single byte mappings in a double byte codespace which is not correct according to the documentation. 
 + 
 +This can be seen often. These illegal mappings are collected into the ''#​unmapped''​ variable of a Mappings object. 
 + 
 +=== Mappings outside the codespace === 
 + 
 +<code postscript>​ 
 +%... 
 +1 begincodespacerange 
 +<​0001>​ <​1004>​ 
 +endcodespacerange 
 +11 beginbfchar 
 +<​0003>​ <​00A0>​ 
 +<​0005>​ <​0022>​ 
 +<​0008>​ <​0025>​ 
 +<​000F>​ <​002C>​ 
 +<​0010>​ <​00AD>​ 
 +%... 
 +</​code>​ 
 + 
 +Here, only the first mapping matches the code space. All others fall outside of it, because the second byte has to be between <00> and <​04>​. 
 + 
 +==== Wrong PostScript ==== 
 + 
 +On one occasion, I saw a CMap where the PostScript used a non-existing operator (''/​find''​ instead of ''/​findresource''​). See the [[postscript#​exception_handling_example]] on the PostScript page. 
 +==== Prevent ​copying ​==== 
 + 
 +<code postscript>​ 
 +%... 
 +1 begincodespacerange 
 +<​0000>​ <​FFFF>​ 
 +endcodespacerange 
 +100 beginbfchar 
 +<​0000>​ <​001A>​ 
 +<​0100>​ <​001A>​ 
 +<​0200>​ <​001A>​ 
 +<​0300>​ <​001A>​ 
 +<​0400>​ <​001A>​ 
 +%... 
 +<​4900>​ <​001A>​ 
 +<​4A00>​ <​001A>​ 
 +<​0001>​ <​001A>​ 
 +<​0101>​ <​001A>​ 
 +<​0201>​ <​001A>​ 
 +<​0301>​ <​001A>​ 
 +<​0401>​ <​001A>​ 
 +%... 
 +</​code>​ 
 + 
 +Here, all codes map to the same character (Substitute character, Ctrl-Z) to prevent extracting the text. Interesting is also the ordering by the second byte, which forced me to redesign the object structure to avoid exponential processing time. 
 + 
 +Seen in [[https://​github.com/​adobe-type-tools/​Adobe-CNS1/​raw/​master/​Adobe-CNS1-7.pdf|The Adobe-CNS1-7 Character Collection]]. 
 +==== Char to string mapping ==== 
 + 
 +<code postscript>​ 
 +%... 
 +/CMapType 2 def 
 +1 begincodespacerange 
 +<​00><​FF>​ 
 +endcodespacerange 
 +1 beginbfchar 
 +<​24><​0009 000d 0020 00a0> 
 +endbfchar 
 +1 beginbfchar 
 +<​50><​002d 00ad 2010> 
 +endbfchar 
 +50 beginbfrange 
 +<​21><​21><​0050>​ 
 +%... 
 +</​code>​
  
-===== References =====+Two codes (<24> and <50>) are mapped to a string of 2-byte characters. This is defined by the PDF spec(({{pdf:​pdf32000_2008.pdf|PDF specification (ISO standard PDF 32000-1:​2008)}})) in section 9.10.3 "​ToUnicode CMaps"​. This has not been implemented yet.
  
-[x] [[https://​blogs.adobe.com/​CCJKType/​2012/​02/​cmap-resource-names-explained.html]]+Seen in a PDF with the ''​Producer''​ "Mac OS X 10.7.1 Quartz PDFContext"​.
cmap.1580814569.txt.gz · Last modified: 2020/02/04 12:09 by christian