Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
cmap [2020/02/22 16:16] christian [CMap] |
cmap [2020/02/23 11:17] christian [Mappings outside the codespace] |
||
---|---|---|---|
Line 11: | Line 11: | ||
CMaps are used in two ways in PDF (and PostScript): | CMaps are used in two ways in PDF (and PostScript): | ||
- to the glyph to be displayed and | - to the glyph to be displayed and | ||
- | - to unicode | + | - to unicode |
- | + | ||
- | The official standard CMaps are now hosted at GitHub as open source project(([[https:// | + | |
+ | The official standard CMaps are now hosted at GitHub as open source project(([[https:// | ||
===== Example ===== | ===== Example ===== | ||
+ | The source of a typical CMap looks like: | ||
+ | {{: | ||
+ | |||
+ | The derived CMap is displayed like this: | ||
+ | {{: | ||
===== Components ===== | ===== Components ===== | ||
Line 29: | Line 33: | ||
* **/ | * **/ | ||
* **/ | * **/ | ||
- | * **/ | + | * **/ |
* **/WMode** Writing direction: 0 for horizontal, 1 for vertical | * **/WMode** Writing direction: 0 for horizontal, 1 for vertical | ||
* **/ | * **/ | ||
Line 70: | Line 74: | ||
The mapping information is provided by char and range mappings. | The mapping information is provided by char and range mappings. | ||
- | There are **bf**, **cid** and **notdef** mappings. **bf** (what does this stand for?) maps codes to characters. **cid** and **notdef** map codes to CIDs (Character IDs) used as index of glyphs in a font. | + | There are **bf**, **cid** and **notdef** mappings. **bf** (base font) maps codes to characters. **cid** and **notdef** map codes to CIDs (Character IDs) used as index of glyphs in a font. |
* /bfchar /bfrange | * /bfchar /bfrange | ||
Line 104: | Line 108: | ||
endbfrange | endbfrange | ||
</ | </ | ||
- | ===== Decoding ===== | ||
- | The steps of decoding are: | ||
- | * take the first byte from the source and find a 1-byte codespace range which includes it | ||
- | * if found, find a 1-byte mapping for the byte | ||
- | * if found, return the destination code or character | ||
- | * if no mapping found, try to find a notdef mapping and return the code | ||
- | * if not found, see below | ||
- | * if not found, read the next byte and repeat with 2-byte mappings | ||
- | |||
- | When no mapping was found, one has to find out how many of the unmappable bytes have to be read from the source. This is not well defined (or I have not understood it yet). | ||
===== Implementation notes ===== | ===== Implementation notes ===== | ||
Line 130: | Line 124: | ||
* the mappings are ordered. This is not strictly prescribed, but recommended by the specifications. | * the mappings are ordered. This is not strictly prescribed, but recommended by the specifications. | ||
- | ==== Handling malformed CMaps ==== | + | ===== Monster from the wild ===== |
- | Sometimes CMaps define mappings which are not covered by the codespace ranges. This can be seen very often in the wild. These illegal mappings are collected into the ''# | + | |
- | ===== Examples from the wild ===== | + | ==== Mappings outside |
single byte mappings in a double byte codespace | single byte mappings in a double byte codespace | ||
- | using /find instead | + | Sometimes CMaps define mappings which are not covered by the codespace ranges. This can be seen very often in the wild. These illegal mappings are collected into the ''# |
- | preventing copying | + | ==== Wrong PostScript ==== |
- | ===== References ===== | + | using /find instead of / |
- | [x] [[https://blogs.adobe.com/CCJKType/2012/02/cmap-resource-names-explained.html]] | + | See [[postscript# |
+ | ==== Prevent copying ==== | ||
+ | |||
+ | <code postscript> | ||
+ | %... | ||
+ | 1 begincodespacerange | ||
+ | < | ||
+ | endcodespacerange | ||
+ | 100 beginbfchar | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | %... | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | %... | ||
+ | </ | ||
+ | |||
+ | Here, all codes map to the same character (Substitute character, Ctrl-Z) to prevent extracting the text. Interesting is also the ordering by the second byte, which forced me to redesign the object structure to avoid exponential processing time. | ||
+ | |||
+ | Seen in [[https://github.com/adobe-type-tools/Adobe-CNS1/raw/master/ | ||
+ | ==== Char to string mapping ==== | ||
+ | |||
+ | <code postscript> | ||
+ | %... | ||
+ | /CMapType 2 def | ||
+ | 1 begincodespacerange | ||
+ | < | ||
+ | endcodespacerange | ||
+ | 1 beginbfchar | ||
+ | < | ||
+ | endbfchar | ||
+ | 1 beginbfchar | ||
+ | < | ||
+ | endbfchar | ||
+ | 50 beginbfrange | ||
+ | < | ||
+ | %... | ||
+ | </ | ||
- | [x] [[https:// | + | It looks as if two codes (<24> and <50>) are mapped to a string of 2-byte characters. I have not found anything about this in the documenation. Seen in a PDF with the '' |