Differences

This shows you the differences between two versions of the page.

--- cmap [2020/02/23 10:29]
christian [Decoding]
+++ cmap [2020/02/23 14:33] (current)
christian [CMap]
@@ Line 1: / Line 1: @@
 ====== CMap ======
-CMaps(([[https://www-cdf.fnal.gov/offline/PostScript/5014.CIDFont_Spec.pdf|5014.CIDFont_Spec.pdf]] Adobe CMap and CIDFont Files Specification)) (Character Maps) define unidirectional mapping from a code to another.
+CMaps(([[https://www-cdf.fnal.gov/offline/PostScript/5014.CIDFont_Spec.pdf|5014.CIDFont_Spec.pdf]] Adobe CMap and CIDFont Files Specification)) (Character Maps) define unidirectional mapping from a code to another. (This should not be confused with the cmap table(([[https://docs.microsoft.com/en-us/typography/opentype/spec/cmap|cmap — Character to Glyph Index Mapping Table]])) of an OpenType font.)
 CMaps provide a very general mechanism which can describe any mappings, including unicode which was developed later. Input codes of variable length (1, 2, 3 or more bytes) can be mapped to characters.
@@ Line 124: / Line 124: @@
   * the mappings are ordered. This is not strictly prescribed, but recommended by the specifications.
-==== Handling malformed CMaps ====
+===== Monster from the wild =====
-Sometimes CMaps define mappings which are not covered by the codespace ranges. This can be seen very often in the wild. These illegal mappings are collected into the ''#unmapped'' variable of a Mappings object.
+CMaps are not well defined. Therefore, there are some interesting variations of them in the wild. Here is a small selection of some issues.
-===== Examples from the wild =====
+==== Codespace problems ====
-single byte mappings in a double byte codespace
+=== Wrong code length ===
-using /find instead of /findresource
+<code postscript>
+%...
+begincodespacerange
+<0000> <FFFF>
+endcodespacerange
+beginbfchar
+<20> <0020>
+<2E> <002E>
+<43> <0043>
+<44> <0044>
+<45> <0045>
+%...
+</code>
-preventing copying
+Here are single byte mappings in a double byte codespace which is not correct according to the documentation.
+This can be seen often. These illegal mappings are collected into the ''#unmapped'' variable of a Mappings object.
+=== Mappings outside the codespace ===
+<code postscript>
+%...
+begincodespacerange
+<0001> <1004>
+endcodespacerange
+beginbfchar
+<0003> <00A0>
+<0005> <0022>
+<0008> <0025>
+<000F> <002C>
+<0010> <00AD>
+%...
+</code>
+Here, only the first mapping matches the code space. All others fall outside of it, because the second byte has to be between <00> and <04>.
+==== Wrong PostScript ====
+On one occasion, I saw a CMap where the PostScript used a non-existing operator (''/find'' instead of ''/findresource''). See the [[postscript#exception_handling_example]] on the PostScript page.
+==== Prevent copying ====
+<code postscript>
+%...
+begincodespacerange
+<0000> <FFFF>
+endcodespacerange
+beginbfchar
+<0000> <001A>
+<0100> <001A>
+<0200> <001A>
+<0300> <001A>
+<0400> <001A>
+%...
+<4900> <001A>
+<4A00> <001A>
+<0001> <001A>
+<0101> <001A>
+<0201> <001A>
+<0301> <001A>
+<0401> <001A>
+%...
+</code>
+Here, all codes map to the same character (Substitute character, Ctrl-Z) to prevent extracting the text. Interesting is also the ordering by the second byte, which forced me to redesign the object structure to avoid exponential processing time.
+Seen in [[https://github.com/adobe-type-tools/Adobe-CNS1/raw/master/Adobe-CNS1-7.pdf|The Adobe-CNS1-7 Character Collection]].
+==== Char to string mapping ====
+<code postscript>
+%...
+/CMapType 2 def
+begincodespacerange
+<00><FF>
+endcodespacerange
+beginbfchar
+<24><0009 000d 0020 00a0>
+endbfchar
+beginbfchar
+<50><002d 00ad 2010>
+endbfchar
+beginbfrange
+<21><21><0050>
+%...
+</code>
+Two codes (<24> and <50>) are mapped to a string of 2-byte characters. This is defined by the PDF spec(({{pdf:pdf32000_2008.pdf|PDF specification (ISO standard PDF 32000-1:2008)}})) in section 9.10.3 "ToUnicode CMaps". This has not been implemented yet.
+Seen in a PDF with the ''Producer'' "Mac OS X 10.7.1 Quartz PDFContext".