Differences

This shows you the differences between two versions of the page.

--- cmap [2020/02/22 16:16]
christian [CMap]
+++ cmap [2020/02/23 11:17]
christian [Mappings outside the codespace]
@@ Line 11: / Line 11: @@
 CMaps are used in two ways in PDF (and PostScript): mapping codes in text operators
   - to the glyph to be displayed and
-  - to unicode
+  - to unicode (in the ''ToUnicode''(([[https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf|5411.ToUnicode.pdf]] ToUnicode Mapping File Tutorial)) attribute of a font)
-The official standard CMaps are now hosted at GitHub as open source project(([[https://github.com/adobe-type-tools/cmap-resources]] Standard CMaps from Adobe)). Also the mappings from the standard character collections to unicode are available(([[https://github.com/adobe-type-tools/mapping-resources-pdf]] Mapping the character collections to unicode)).
+The official standard CMaps are now hosted at GitHub as open source project(([[https://github.com/adobe-type-tools/cmap-resources|cmap-resources]] Standard CMaps from Adobe at GitHub)). Also the mappings from the standard character collections to unicode are available(([[https://github.com/adobe-type-tools/mapping-resources-pdf|mapping-resources-pdf]] Mapping character collections to unicode at GitHub)). An interesting blog post about how the CMap names were chosen can be found here(([[https://blogs.adobe.com/CCJKType/2012/02/cmap-resource-names-explained.html|CMap Resource Names Explained]] Adobe blog post)).
 ===== Example =====
+The source of a typical CMap looks like:
+{{:pdf:cmap_raw.png?nolink|CMap source}}
+The derived CMap is displayed like this:
+{{:pdf:cmap.png?nolink|CMap object}}
 ===== Components =====
@@ Line 29: / Line 33: @@
   * **/CMapName** the name under which the CMap is stored in the CMap resources
   * **/CIDSystemInfo** the character collection (see below). Mandatory for CMapType 1 (I think), without meaning for CMapType 2
-  * **/CMapType** Not clearly defined. 1 for predefined CID maps(?), 2 for ToUnicode maps
+  * **/CMapType** Not clearly defined. 1 for predefined CID maps, 2 for ToUnicode maps
   * **/WMode** Writing direction: 0 for horizontal, 1 for vertical
   * **/CMapVersion**, **/UIDOffset**, **/XUID** and others without relevance for me
@@ Line 70: / Line 74: @@
 The mapping information is provided by char and range mappings.
-There are **bf**, **cid** and **notdef** mappings. **bf** (what does this stand for?) maps codes to characters. **cid** and **notdef** map codes to CIDs (Character IDs) used as index of glyphs in a font.
+There are **bf**, **cid** and **notdef** mappings. **bf** (base font) maps codes to characters. **cid** and **notdef** map codes to CIDs (Character IDs) used as index of glyphs in a font.
   * /bfchar /bfrange
@@ Line 104: / Line 108: @@
 endbfrange
 </code>
-===== Decoding =====
-The steps of decoding are:
-  * take the first byte from the source and find a 1-byte codespace range which includes it
-    * if found, find a 1-byte mapping for the byte
-      * if found, return the destination code or character
-      * if no mapping found, try to find a notdef mapping and return the code
-        * if not found, see below
-    * if not found, read the next byte and repeat with 2-byte mappings
-When no mapping was found, one has to find out how many of the unmappable bytes have to be read from the source. This is not well defined (or I have not understood it yet).
 ===== Implementation notes =====
@@ Line 130: / Line 124: @@
   * the mappings are ordered. This is not strictly prescribed, but recommended by the specifications.
-==== Handling malformed CMaps ====
+===== Monster from the wild =====
-Sometimes CMaps define mappings which are not covered by the codespace ranges. This can be seen very often in the wild. These illegal mappings are collected into the ''#unmapped'' variable of a Mappings object.
-===== Examples from the wild =====
+==== Mappings outside the codespace ====
 single byte mappings in a double byte codespace
-using /find instead of /findresource
+Sometimes CMaps define mappings which are not covered by the codespace ranges. This can be seen very often in the wild. These illegal mappings are collected into the ''#unmapped'' variable of a Mappings object.
-preventing copying
+==== Wrong PostScript ====
-===== References =====
+using /find instead of /findresource
-[x] [[https://blogs.adobe.com/CCJKType/2012/02/cmap-resource-names-explained.html]]
+See [[postscript#exception_handling_example]]
+==== Prevent copying ====
+<code postscript>
+%...
+begincodespacerange
+<0000> <FFFF>
+endcodespacerange
+beginbfchar
+<0000> <001A>
+<0100> <001A>
+<0200> <001A>
+<0300> <001A>
+<0400> <001A>
+%...
+<4900> <001A>
+<4A00> <001A>
+<0001> <001A>
+<0101> <001A>
+<0201> <001A>
+<0301> <001A>
+<0401> <001A>
+%...
+</code>
+Here, all codes map to the same character (Substitute character, Ctrl-Z) to prevent extracting the text. Interesting is also the ordering by the second byte, which forced me to redesign the object structure to avoid exponential processing time.
+Seen in [[https://github.com/adobe-type-tools/Adobe-CNS1/raw/master/Adobe-CNS1-7.pdf|The Adobe-CNS1-7 Character Collection]].
+==== Char to string mapping ====
+<code postscript>
+%...
+/CMapType 2 def
+begincodespacerange
+<00><FF>
+endcodespacerange
+beginbfchar
+<24><0009 000d 0020 00a0>
+endbfchar
+beginbfchar
+<50><002d 00ad 2010>
+endbfchar
+beginbfrange
+<21><21><0050>
+%...
+</code>
-[x] [[https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf]]
+It looks as if two codes (<24> and <50>) are mapped to a string of 2-byte characters. I have not found anything about this in the documenation. Seen in a PDF with the ''Producer'' "Mac OS X 10.7.1 Quartz PDFContext".