Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cmap [2020/02/23 10:29] christian [Decoding] |
cmap [2020/02/23 14:33] (current) christian [CMap] |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== CMap ====== | ====== CMap ====== | ||
- | CMaps(([[https:// | + | CMaps(([[https:// |
CMaps provide a very general mechanism which can describe any mappings, including unicode which was developed later. Input codes of variable length (1, 2, 3 or more bytes) can be mapped to characters. | CMaps provide a very general mechanism which can describe any mappings, including unicode which was developed later. Input codes of variable length (1, 2, 3 or more bytes) can be mapped to characters. | ||
Line 124: | Line 124: | ||
* the mappings are ordered. This is not strictly prescribed, but recommended by the specifications. | * the mappings are ordered. This is not strictly prescribed, but recommended by the specifications. | ||
- | ==== Handling malformed CMaps ==== | + | ===== Monster from the wild ===== |
- | Sometimes | + | CMaps are not well defined. Therefore, there are some interesting variations of them in the wild. Here is a small selection |
- | ===== Examples from the wild ===== | + | ==== Codespace problems |
- | single byte mappings in a double byte codespace | + | === Wrong code length === |
- | using /find instead of / | + | <code postscript> |
+ | %... | ||
+ | 1 begincodespacerange | ||
+ | < | ||
+ | endcodespacerange | ||
+ | 27 beginbfchar | ||
+ | <20> < | ||
+ | <2E> < | ||
+ | <43> < | ||
+ | <44> < | ||
+ | <45> < | ||
+ | %... | ||
+ | </code> | ||
- | preventing | + | Here are single byte mappings in a double byte codespace which is not correct according to the documentation. |
+ | |||
+ | This can be seen often. These illegal mappings are collected into the ''# | ||
+ | |||
+ | === Mappings outside the codespace === | ||
+ | |||
+ | <code postscript> | ||
+ | %... | ||
+ | 1 begincodespacerange | ||
+ | < | ||
+ | endcodespacerange | ||
+ | 11 beginbfchar | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | %... | ||
+ | </ | ||
+ | |||
+ | Here, only the first mapping matches the code space. All others fall outside of it, because the second byte has to be between <00> and < | ||
+ | |||
+ | ==== Wrong PostScript ==== | ||
+ | |||
+ | On one occasion, I saw a CMap where the PostScript used a non-existing operator (''/ | ||
+ | ==== Prevent | ||
+ | |||
+ | <code postscript> | ||
+ | %... | ||
+ | 1 begincodespacerange | ||
+ | < | ||
+ | endcodespacerange | ||
+ | 100 beginbfchar | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | %... | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | %... | ||
+ | </ | ||
+ | |||
+ | Here, all codes map to the same character (Substitute character, Ctrl-Z) to prevent extracting the text. Interesting is also the ordering by the second byte, which forced me to redesign the object structure to avoid exponential processing time. | ||
+ | |||
+ | Seen in [[https:// | ||
+ | ==== Char to string mapping ==== | ||
+ | |||
+ | <code postscript> | ||
+ | %... | ||
+ | /CMapType 2 def | ||
+ | 1 begincodespacerange | ||
+ | < | ||
+ | endcodespacerange | ||
+ | 1 beginbfchar | ||
+ | < | ||
+ | endbfchar | ||
+ | 1 beginbfchar | ||
+ | < | ||
+ | endbfchar | ||
+ | 50 beginbfrange | ||
+ | < | ||
+ | %... | ||
+ | </ | ||
+ | Two codes (<24> and <50>) are mapped to a string of 2-byte characters. This is defined by the PDF spec(({{pdf: | ||
+ | Seen in a PDF with the '' |