This is an old revision of the document!
CMap
CMaps1) (Character Maps) define unidirectional mapping from a code to another.
CMaps provide a very general mechanism which can describe any mappings, including unicode which was devloped later. Input codes of variable length (1, 2, 3 or more bytes) can be mapped to characters.
They are part of type-0 fonts defining the mapping from input codes to glyphs in the font. This is used mainly for Asian fonts (Japanese, Chinese, Korean) with thousends of characters. But, since CMaps are so general, some PDF applications use it as default for encoding. Therefore, for PDF text extraction, it is necessary to understand and use CMaps.
A CMap is a PostScript program using operators from the /CIDInit ProcSet.
CMaps are used in two ways in PDF (and PostScript): mapping codes in text operators
- to the glyph to be displayed and
- to unicode
The official standard CMaps are now hosted at GitHub as open source project2). Also the mappings from the standard character collections to unicode are available3).
Components
A CMap PostScript program creates a dictionary with all information in the CMap resource category. It can be accessed by is name with
((aPostScript.Interpreter resources at: #CMap) at: aCMapNameSymbol)
General Info
The following keys can be defined:
- /CMapName the name under which the CMap is stored in the CMap resources
- /CIDSystemInfo the character collection (see below). Mandatory for CMapType 1 (I think), without meaning for CMapType 2
- /CMapType Not clearly defined. 1 for predefined CID maps(?), 2 for ToUnicode maps
- /WMode Writing direction: 0 for horizontal, 1 for vertical
- /CMapVersion, /UIDOffset, /XUID and others without relevance for me
/CIDSystemInfo
A character collection defined by a dictionary with 3 keys: /Registry, /Ordering and /Supplement.
Example:
/CIDSystemInfo <</Registry (Adobe) /Ordering (GB1) /Supplement 5>> def
/Registry is almost always (Adobe). Especially the standard CMaps of PDF are all from that registry.
/Ordering is a specific ordering of characters. Besides (Identity), there are only 5 supported ones: (CNS1), (GB1), (Japan1), (Korea1) and (KR).
/Supplement is a version number. A higher number adds more characters to the collection at the end.
Codespace
The codespace defines the range of poosible mappings and the number of bytes used for the mapping.
The UTF-8 encoding codespace as example:
4 begincodespacerange <00> <7F> <C080> <DFBF> <E08080> <EFBFBF> <F0808080> <F7BFBFBF> codespacerange
The byte ranges are dimensions
more defined than used, because of /usecmap
Mappings
The mapping information is provided by char and range mappings.
There are bf, cid and notdef mappings. bf (what does this stand for?) maps codes to characters. cid and notdef map codes to CIDs (Character IDs) used to index the glyph of a font.
- /bfchar /bfrange
- /cidchar /cidrange
- /notdefchar /notdefrange
Decoding
Steps of decoding a code
Implementation notes
Canonical representation
Handling malformed CMaps
Examples from the wild
single byte mappings in a double byte codespace
using /find instead of /findresource
preventing copying