ABAP Keyword Documentation → ABAP - Overview
ABAP Character Set
Application Server ABAP supports only Unicode systems in the current release.
- A Unicode system is an ABAP based on Unicode character representation with a code page for Unicode and which runs on a an appropriate operating system.
- A non-Unicode system is an AS ABAP with code pages for single-byte code and double-byte code. Non-Unicode systems are no longer supported in the current release.
Unicode (ISO/IEC 10646) with the character set UCS covers all existing characters. A variety of Unicode character formats is possible for the Unicode character set, such as UTF (in which a character can occupy between one and four bytes) or UCS-2 (where a character occupies two bytes).
- UTF-16 is the system code page of a Unicode system.
- The ABAP programming language supports the character representation UCS-2, which fundamentally matches the UTF-16 representation and covers its characters (except the characters in the surrogate area).
A restriction to UCS-2 in ABAP means that a character is always assumed as having a length of two bytes. This generally only produces problems if character strings are truncated in the middle of a character representation from the UTF-16 surrogate area or if individual characters from sets of characters are compared in character string processing.
In a Unicode system, an ABAP program must have the ABAP language version Standard ABAP (Unicode). Programs with the obsolete language version Non-Unicode ABAP can no longer be used in a Unicode system.
Other versions:
7.31 | 7.40 | 7.54
Notes
- Before Unicode, SAP used various different codes for representing characters in different fonts, such as ASCII, EBCDIC as single-byte code pages, or double-byte code pages:
- ASCII (American Standard Code for Information Interchange) encodes every character with one byte. This means that a maximum of 256 characters can be displayed (strictly speaking, standard ASCII only encodes one character using 7 bit and can therefore only represent 128 characters. The extension to 8 bit was introduced in ISO-8859). Examples of common code pages are ISO-8859-1 for Western European, or ISO-8859-5 for Cyrillic fonts.
- EBCDIC (Extended Binary Coded Decimal Interchange) also encodes each character using one byte, and can therefore also represent 256 characters. For example, EBCDIC 0697/0500 is an IBM format that has been used on the AS/400 platform (now known as IBM System i) for Western European fonts.
- Double-byte code pages require between 1 and 2 bytes per character. This enables 65536 characters to be represented, of which only 10000 to 15000 characters are normally used. For example, the code page SJIS is used for Japanese and BIG5 for traditional Chinese fonts.
- In earlier non-Unicode systems, the system code pages were defined in the database table TCPDB. In non-Unicode single code page systems, there was only one system code page. In the obsolete MDMP systems, there were multiple system code pages.
- Before Unicode support was introduced, many ABAP programmers assumed that one character corresponded to one byte. Therefore, before a non-Unicode system is converted to Unicode, ABAP programs must be changed wherever an explicit or implicit assumption is made about the internal length of a character. This mainly affects the following:
- Access to structures. The latter is affected because flat structures in a program of the obsolete ABAP language version Non-Unicode ABAP are handled like character-like data objects and some programming techniques exploit this fact. The structure fragment view can be used to handle structures.