This is an old revision of the document!


CMap

CMaps1) (Character Maps) define unidirectional mapping from a code to another.

CMaps provide a very general mechanism which can describe any mappings, including unicode which was devloped later. Input codes of variable length (1, 2, 3 or more bytes) can be mapped to characters.

They are part of type-0 fonts defining the mapping from input codes to glyphs in the font. This is used mainly for Asian fonts (Japanese, Chinese, Korean) with thousends of characters. But, since CMaps are so general, some PDF applications use it as default for encoding. Therefore, for PDF text extraction, it is necessary to understand and use CMaps.

A CMap is a PostScript program using operators from the /CIDInit ProcSet.

CMaps are used in two ways in PDF (and PostScript): mapping codes in text operators

  1. to the glyph to be displayed and
  2. to unicode

The official standard CMaps are now hosted at GitHub as open source project2). Also the mappings from the standard character collections to unicode are available3).

A CMap PostScript program creates a dictionary with all information in the CMap resource category. It can be accessed by is name with

((aPostScript.Interpreter resources at: #CMap) at: aCMapNameSymbol)

The following keys can be defined:

  • /CMapName the name under which the CMap is stored in the CMap resources
  • /CIDSystemInfo the character collection (see below). Mandatory for CMapType 1 (I think), without meaning for CMapType 2
  • /CMapType Not clearly defined. 1 for predefined CID maps(?), 2 for ToUnicode maps
  • /WMode Writing direction: 0 for horizontal, 1 for vertical
  • /CMapVersion, /UIDOffset, /XUID and others without relevance for me

/CIDSystemInfo

A character collection defined by a dictionary with 3 keys: /Registry, /Ordering and /Supplement.

Example:

/CIDSystemInfo <</Registry (Adobe) /Ordering (GB1) /Supplement 5>> def

/Registry is almost always (Adobe). Especially the standard CMaps of PDF are all from that registry.

/Ordering is a specific ordering of characters. Besides (Identity), there are only 5 supported ones: (CNS1), (GB1), (Japan1), (Korea1) and (KR).

/Supplement is a version number. A higher number adds more characters to the collection at the end.

The codespace defines the range of poosible mappings and the number of bytes used for the mapping.

The UTF-8 encoding codespace as example:

4 begincodespacerange
	<00> <7F>
	<C080> <DFBF>
	<E08080> <EFBFBF>
	<F0808080> <F7BFBFBF>
codespacerange

The byte ranges are dimensions

more defined than used, because of /usecmap

The mapping information is provided by char and range mappings.

There are bf, cid and notdef mappings. bf (what does this stand for?) maps codes to characters. cid and notdef map codes to CIDs (Character IDs) used to index the glyph of a font.

  • /bfchar /bfrange
  • /cidchar /cidrange
  • /notdefchar /notdefrange

Steps of decoding a code

Canonical representation

Handling malformed CMaps

single byte mappings in a double byte codespace

using /find instead of /findresource

preventing copying


1)
5014.CIDFont_Spec.pdf Adobe CMap and CIDFont Files Specification
3)
https://github.com/adobe-type-tools/mapping-resources-pdf Mapping the character collections to unicode
  • cmap.1580814569.txt.gz
  • Last modified: 2020/02/04 12:09
  • by christian