Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cmap [2020/02/04 12:09] christian [CMap] |
cmap [2020/02/23 14:33] (current) christian [CMap] |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== CMap ====== | ====== CMap ====== | ||
- | CMaps(([[https:// | + | CMaps(([[https:// |
- | CMaps provide a very general mechanism which can describe any mappings, including unicode which was devloped | + | CMaps provide a very general mechanism which can describe any mappings, including unicode which was developed |
They are part of type-0 fonts defining the mapping from input codes to glyphs in the font. This is used mainly for Asian fonts (Japanese, Chinese, Korean) with thousends of characters. But, since CMaps are so general, some PDF applications use it as default for encoding. Therefore, for PDF text extraction, it is necessary to understand and use CMaps. | They are part of type-0 fonts defining the mapping from input codes to glyphs in the font. This is used mainly for Asian fonts (Japanese, Chinese, Korean) with thousends of characters. But, since CMaps are so general, some PDF applications use it as default for encoding. Therefore, for PDF text extraction, it is necessary to understand and use CMaps. | ||
Line 11: | Line 11: | ||
CMaps are used in two ways in PDF (and PostScript): | CMaps are used in two ways in PDF (and PostScript): | ||
- to the glyph to be displayed and | - to the glyph to be displayed and | ||
- | - to unicode | + | - to unicode |
- | The official standard CMaps are now hosted at GitHub as open source project(([[https:// | + | The official standard CMaps are now hosted at GitHub as open source project(([[https:// |
+ | ===== Example ===== | ||
+ | The source of a typical CMap looks like: | ||
+ | {{: | ||
+ | |||
+ | The derived CMap is displayed like this: | ||
+ | {{: | ||
===== Components ===== | ===== Components ===== | ||
Line 27: | Line 33: | ||
* **/ | * **/ | ||
* **/ | * **/ | ||
- | * **/ | + | * **/ |
* **/WMode** Writing direction: 0 for horizontal, 1 for vertical | * **/WMode** Writing direction: 0 for horizontal, 1 for vertical | ||
* **/ | * **/ | ||
Line 60: | Line 66: | ||
</ | </ | ||
- | The byte ranges are dimensions | + | The byte ranges are dimensions. The bytes on each position define the range of possible bytes in that position. If we take the second codespace range < |
- | more defined | + | A CMap can be defined |
==== Mappings ==== | ==== Mappings ==== | ||
Line 68: | Line 74: | ||
The mapping information is provided by char and range mappings. | The mapping information is provided by char and range mappings. | ||
- | There are **bf**, **cid** and **notdef** mappings. **bf** (what does this stand for?) maps codes to characters. **cid** and **notdef** map codes to CIDs (Character IDs) used to index the glyph of a font. | + | There are **bf**, **cid** and **notdef** mappings. **bf** (base font) maps codes to characters. **cid** and **notdef** map codes to CIDs (Character IDs) used as index of glyphs in a font. |
* /bfchar /bfrange | * /bfchar /bfrange | ||
Line 74: | Line 80: | ||
* /notdefchar / | * /notdefchar / | ||
- | ===== Decoding ===== | + | Char mappings map one code to another and is written as 2 byte strings. |
+ | <code postscript> | ||
+ | < | ||
+ | <37> 7346456 | ||
+ | </ | ||
- | Steps of decoding | + | The **source** (the first element) should be a bytestring written in hex notation, while the **destination** (second element) can also be given as integer. |
+ | |||
+ | Range mappings consist | ||
+ | <code postscript> | ||
+ | < | ||
+ | <37> <3B> 7346456 | ||
+ | </ | ||
+ | The first mapping maps a range of 6 codes (< | ||
+ | |||
+ | For **bf** mappings (mapping to characters), | ||
+ | <code postscript> | ||
+ | beginbfchar | ||
+ | < | ||
+ | <37> 7346456 | ||
+ | <84> /epsilon | ||
+ | endbfchar | ||
+ | beginbfrange | ||
+ | < | ||
+ | <37> <3B> 7346456 | ||
+ | <84> <86> [/a /c /mu] | ||
+ | endbfrange | ||
+ | </code> | ||
===== Implementation notes ===== | ===== Implementation notes ===== | ||
- | Canonical representation | + | ==== Canonical representation |
- | Handling malformed CMaps | + | When constructing a CMap object, great care has been taken to derive a canonical form of the CMap. This means that no matter how the original CMap is written, it will always end up with the same minimal CMap. |
- | ===== Examples from the wild ===== | + | The following modifications are applied: |
+ | * a range mapping with only one code is converted to a char mapping <code postscript>< | ||
+ | ==> <3F> < | ||
+ | * adjecent mappings are joined to range mappings <code postscript>< | ||
+ | <3F> < | ||
+ | ==> <3E> <3F> < | ||
+ | * no duplicate mappings. CMaps allows duplicate mappings which is needed for '' | ||
+ | * the mappings are ordered. This is not strictly prescribed, but recommended by the specifications. | ||
- | single byte mappings in a double byte codespace | + | ===== Monster from the wild ===== |
- | using /find instead | + | CMaps are not well defined. Therefore, there are some interesting variations |
+ | ==== Codespace problems ==== | ||
- | preventing | + | === Wrong code length === |
+ | |||
+ | <code postscript> | ||
+ | %... | ||
+ | 1 begincodespacerange | ||
+ | < | ||
+ | endcodespacerange | ||
+ | 27 beginbfchar | ||
+ | <20> < | ||
+ | <2E> < | ||
+ | <43> < | ||
+ | <44> < | ||
+ | <45> < | ||
+ | %... | ||
+ | </ | ||
+ | |||
+ | Here are single byte mappings in a double byte codespace which is not correct according to the documentation. | ||
+ | |||
+ | This can be seen often. These illegal mappings are collected into the ''# | ||
+ | |||
+ | === Mappings outside the codespace === | ||
+ | |||
+ | <code postscript> | ||
+ | %... | ||
+ | 1 begincodespacerange | ||
+ | < | ||
+ | endcodespacerange | ||
+ | 11 beginbfchar | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | %... | ||
+ | </ | ||
+ | |||
+ | Here, only the first mapping matches the code space. All others fall outside of it, because the second byte has to be between <00> and < | ||
+ | |||
+ | ==== Wrong PostScript ==== | ||
+ | |||
+ | On one occasion, I saw a CMap where the PostScript used a non-existing operator (''/ | ||
+ | ==== Prevent | ||
+ | |||
+ | <code postscript> | ||
+ | %... | ||
+ | 1 begincodespacerange | ||
+ | < | ||
+ | endcodespacerange | ||
+ | 100 beginbfchar | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | %... | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | %... | ||
+ | </ | ||
+ | |||
+ | Here, all codes map to the same character (Substitute character, Ctrl-Z) to prevent extracting the text. Interesting is also the ordering by the second byte, which forced me to redesign the object structure to avoid exponential processing time. | ||
+ | |||
+ | Seen in [[https:// | ||
+ | ==== Char to string mapping ==== | ||
+ | |||
+ | <code postscript> | ||
+ | %... | ||
+ | /CMapType 2 def | ||
+ | 1 begincodespacerange | ||
+ | < | ||
+ | endcodespacerange | ||
+ | 1 beginbfchar | ||
+ | < | ||
+ | endbfchar | ||
+ | 1 beginbfchar | ||
+ | < | ||
+ | endbfchar | ||
+ | 50 beginbfrange | ||
+ | < | ||
+ | %... | ||
+ | </ | ||
- | ===== References ===== | + | Two codes (<24> and <50>) are mapped to a string of 2-byte characters. This is defined by the PDF spec(({{pdf: |
- | [x] [[https:// | + | Seen in a PDF with the '' |