Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cmap [2020/02/21 12:15] christian [CMap] |
cmap [2020/02/23 14:33] christian [CMap] |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== CMap ====== | ====== CMap ====== | ||
- | CMaps(([[https:// | + | CMaps(([[https:// |
CMaps provide a very general mechanism which can describe any mappings, including unicode which was developed later. Input codes of variable length (1, 2, 3 or more bytes) can be mapped to characters. | CMaps provide a very general mechanism which can describe any mappings, including unicode which was developed later. Input codes of variable length (1, 2, 3 or more bytes) can be mapped to characters. | ||
Line 11: | Line 11: | ||
CMaps are used in two ways in PDF (and PostScript): | CMaps are used in two ways in PDF (and PostScript): | ||
- to the glyph to be displayed and | - to the glyph to be displayed and | ||
- | - to unicode | + | - to unicode |
- | The official standard CMaps are now hosted at GitHub as open source project(([[https:// | + | The official standard CMaps are now hosted at GitHub as open source project(([[https:// |
+ | ===== Example ===== | ||
+ | The source of a typical CMap looks like: | ||
+ | {{: | ||
+ | |||
+ | The derived CMap is displayed like this: | ||
+ | {{: | ||
===== Components ===== | ===== Components ===== | ||
Line 27: | Line 33: | ||
* **/ | * **/ | ||
* **/ | * **/ | ||
- | * **/ | + | * **/ |
* **/WMode** Writing direction: 0 for horizontal, 1 for vertical | * **/WMode** Writing direction: 0 for horizontal, 1 for vertical | ||
* **/ | * **/ | ||
Line 60: | Line 66: | ||
</ | </ | ||
- | The byte ranges are dimensions | + | The byte ranges are dimensions. The bytes on each position define the range of possible bytes in that position. If we take the second codespace range < |
- | more defined | + | A CMap can be defined |
==== Mappings ==== | ==== Mappings ==== | ||
Line 68: | Line 74: | ||
The mapping information is provided by char and range mappings. | The mapping information is provided by char and range mappings. | ||
- | There are **bf**, **cid** and **notdef** mappings. **bf** (what does this stand for?) maps codes to characters. **cid** and **notdef** map codes to CIDs (Character IDs) used to index the glyph of a font. | + | There are **bf**, **cid** and **notdef** mappings. **bf** (base font) maps codes to characters. **cid** and **notdef** map codes to CIDs (Character IDs) used as index of glyphs in a font. |
* /bfchar /bfrange | * /bfchar /bfrange | ||
Line 74: | Line 80: | ||
* /notdefchar / | * /notdefchar / | ||
- | ===== Decoding ===== | + | Char mappings map one code to another and is written as 2 byte strings. |
+ | <code postscript> | ||
+ | < | ||
+ | <37> 7346456 | ||
+ | </ | ||
- | Steps of decoding | + | The **source** (the first element) should be a bytestring written in hex notation, while the **destination** (second element) can also be given as integer. |
+ | |||
+ | Range mappings consist | ||
+ | <code postscript> | ||
+ | < | ||
+ | <37> <3B> 7346456 | ||
+ | </ | ||
+ | The first mapping maps a range of 6 codes (< | ||
+ | |||
+ | For **bf** mappings (mapping to characters), | ||
+ | <code postscript> | ||
+ | beginbfchar | ||
+ | < | ||
+ | <37> 7346456 | ||
+ | <84> /epsilon | ||
+ | endbfchar | ||
+ | beginbfrange | ||
+ | < | ||
+ | <37> <3B> 7346456 | ||
+ | <84> <86> [/a /c /mu] | ||
+ | endbfrange | ||
+ | </code> | ||
===== Implementation notes ===== | ===== Implementation notes ===== | ||
- | Canonical representation | + | ==== Canonical representation |
- | Handling malformed CMaps | + | When constructing a CMap object, great care has been taken to derive a canonical form of the CMap. This means that no matter how the original CMap is written, it will always end up with the same minimal CMap. |
- | ===== Examples from the wild ===== | + | The following modifications are applied: |
+ | * a range mapping with only one code is converted to a char mapping <code postscript>< | ||
+ | ==> <3F> < | ||
+ | * adjecent mappings are joined to range mappings <code postscript>< | ||
+ | <3F> < | ||
+ | ==> <3E> <3F> < | ||
+ | * no duplicate mappings. CMaps allows duplicate mappings which is needed for '' | ||
+ | * the mappings are ordered. This is not strictly prescribed, but recommended by the specifications. | ||
- | single byte mappings in a double byte codespace | + | ===== Monster from the wild ===== |
- | using /find instead | + | CMaps are not well defined. Therefore, there are some interesting variations |
+ | ==== Codespace problems ==== | ||
- | preventing copying | + | === Wrong code length === |
- | ===== References | + | <code postscript> |
+ | %... | ||
+ | 1 begincodespacerange | ||
+ | < | ||
+ | endcodespacerange | ||
+ | 27 beginbfchar | ||
+ | <20> < | ||
+ | <2E> < | ||
+ | <43> < | ||
+ | <44> < | ||
+ | <45> < | ||
+ | %... | ||
+ | </ | ||
+ | |||
+ | Here are single byte mappings in a double byte codespace which is not correct according to the documentation. | ||
+ | |||
+ | This can be seen often. These illegal mappings are collected into the ''# | ||
+ | |||
+ | === Mappings outside the codespace | ||
+ | |||
+ | <code postscript> | ||
+ | %... | ||
+ | 1 begincodespacerange | ||
+ | < | ||
+ | endcodespacerange | ||
+ | 11 beginbfchar | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | %... | ||
+ | </ | ||
+ | |||
+ | Here, only the first mapping matches the code space. All others fall outside of it, because the second byte has to be between <00> and < | ||
+ | |||
+ | ==== Wrong PostScript ==== | ||
+ | |||
+ | On one occasion, I saw a CMap where the PostScript used a non-existing operator (''/ | ||
+ | ==== Prevent copying ==== | ||
+ | |||
+ | <code postscript> | ||
+ | %... | ||
+ | 1 begincodespacerange | ||
+ | < | ||
+ | endcodespacerange | ||
+ | 100 beginbfchar | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | %... | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | %... | ||
+ | </ | ||
+ | |||
+ | Here, all codes map to the same character (Substitute character, Ctrl-Z) to prevent extracting the text. Interesting is also the ordering by the second byte, which forced me to redesign the object structure to avoid exponential processing time. | ||
+ | |||
+ | Seen in [[https:// | ||
+ | ==== Char to string mapping ==== | ||
+ | |||
+ | <code postscript> | ||
+ | %... | ||
+ | /CMapType 2 def | ||
+ | 1 begincodespacerange | ||
+ | < | ||
+ | endcodespacerange | ||
+ | 1 beginbfchar | ||
+ | < | ||
+ | endbfchar | ||
+ | 1 beginbfchar | ||
+ | < | ||
+ | endbfchar | ||
+ | 50 beginbfrange | ||
+ | < | ||
+ | %... | ||
+ | </ | ||
- | [x] [[https://blogs.adobe.com/ | + | Two codes (<24> and <50>) are mapped to a string of 2-byte characters. This is defined by the PDF spec(({{pdf:pdf32000_2008.pdf|PDF specification (ISO standard PDF 32000-1:2008)}})) in section 9.10.3 " |
- | [x] [[https:// | + | Seen in a PDF with the '' |