UNICONV(1) LINUX COMMANDS UNICONV(1)NAMEuniconv - convert text to native formats through unicode
SYNOPSISuniconv-out output-file [ -decode input-encoding ] [ -encode output-
encoding ] [ input-file ] [ -todos ] [ -fromdos ] [ -tomac ] [ -frommac
]
DESCRIPTIONuniconv program decodes scripts with a certain encoding encodes them
with some other encoding. The scipt is a 16,8 or 7 bit-byte stream.
The converted text will be sent to the standard output, even in case
of 16-bit encodings,unless the output file is specified by the -out
option.
The -decode and -encode options are optional, the default converter is
utf-8. The program reads the Unicode map helper files (*.my) from the
default directory /usr/share/data. Simple 1-to-1 encodings can be
added on the fly by adding a a my-file, or setting your yudit.datapath
property in ~/.yudit/yudit.properties or /usr/share/yudit/con‐
fig/yudit.properties. By default /usr/share/yudit/data is searched.
My-files can be created by a program called The files can be converted
between dos/unix/mac line-ending variants with -fromdos, -frommac,
-todos, -tomac options. the default (not scpecified one) is Unix.
makeumap.
ENCODING
If you received this program through the Yudit distribution, then as of
today you can convert between the encodings below.
utf-8 Yudit recommends this format for international information
exchange. ASCII text will get through intact, while other
unicode characters will get their 8th bit set and the length of
the code will depend on how far away they are in the Unicode
space. This is the only transformation format that can encode
both 16-bit (ucs-2) and 31-bit (ucs-4) unicode.
utf-8-s
Hackers utf-8 format - it does not give an error message when a
surrogate pair is decoded and it can encode a surrogate pair 'as
is'. This is not a recommended encoding format although this
format is used to encode/decode clipboard data, in order to pre‐
serve input.
utf-16 Although 16 is bigger than 8 this is still a compromise required
by OSes like Windows that can not handle ucs-4 - this encoding
produces 16-bit unicode streams. In addition to BMP it can con‐
vert 16 planes using the Unicode Surrogate Area. This encoding
can not convert anything above U+10FFFF (Plane 16). The input
byte order is recognized by the first two characters BEM (byte-
order-mark) U+FEFF. This format is used in Windows NT for docu‐
ments like notepad .txt files.
utf-16-be
Big endian utf-16 converter.
utf-16-le
Littlen endian utf-16 converter.
utf-7 This is the recommended format for international information
exchange, when 7-bit can only be used. It can only handle 16-bit
(utf-16) unicode, for ucs-4 (above U+10FFFF) you should use
utf-8 encoding.
iso-8859-1
This is the ISO 8859-1 character encoding format. It is also
known as "Latin-1" encoding.
iso-8859-2
This is the ISO 8859-2 character encoding format. It is also
known as "Central European" encoding.
iso-8859-5
This is the ISO 8859-5 character encoding format. It is also
known as "Cyrillic" encoding.
iso-8859-7
This is the ISO 8859-7 character encoding format. It is also
known as "Greek" encoding.
iso-8859-9
This is the ISO 8859-9 character encoding format. It is also
known as "Turkish" encoding.
koi8-r This is the KOI8-R character encoding format. It is mainly used
in Russia.
cp-1251
This is the CP1251 cyrillic character encoding format. It is
mainly used in Microsoft Windows and some web sites.
iso-2022-jp
This is a Japanese character encoding format. It is a 7-bit
encoding format.
iso-2022-jp-3
This is a Japanese character encoding format. It is a 7-bit
encoding format. It is base upon JIS X 0213 standard.
euc-jp This is a Japanese character encoding format. It is an 8-bit
encoding format. Mainly used in UNIX systems.
euc-jp-3
The official name is EUC-JISX0213 - I just could not read this.
This is a Japanese character encoding format. It is a 8-bit
encoding format. It is base upon JIS X 0213 standard.
shift-jis
This is a Japanese character encoding format. It is an 8-bit
encoding format. Mainly used in MSDOS/Windows.
shift-jis-3
The official name is Shift_JISX0213 - I just could not read
this. This is a Japanese character encoding format. It is an
8-bit encoding format. Mainly used in MSDOS/Windows.
iso-2022-jp
This is a Japanese 7-bit character encoding format. The
iso-2022-jp email messages can be decoded/encoded are in this
format.
iso-2022-x11
This is a Japanese character encoding format. It is also known
as "COMPOUND_TEXT" encoding for the X Window System. This is a
7-bit encoding format. It can be derived from the ISO 2022-JP
format with some differences.
ksc-5601-x11
This is a Korean character encoding format used by the X win‐
dow system(COMPOUND_TEXT encoding) to encode Korean(KS X 1001)
and US-ASCII. This is a 7bit encoding format compliant to
ISO-2022 specification for encoding of multiple character sets.
Please, note that this is DIFFERENT from ISO-2022-KR (defined in
IETF RFC 1557).
euc-kr This is an 8bit multibyte encoding for Korean. It encodes
US-ASCII(7bit) in single byte range and characters in KS X
1001(formerly KS C 5601) in double byte range with MSB on(8bit).
It's used in Unix and Internet. Korean version of MS-DOS, MacOS
and MS-Windows use compatible (most cases, identical) variant of
this encoding.
johab This is a Korean encoding specified in KS X 1001(KS C
5601-1992), Annex 3 as a supplementary encoding. Widely
used in Korean MS-DOS until mid-1990's. It can encode all
Hangul syllables(11,172) of modern Korean as well as all the
special symbols and Hanja (Chinese ideograms used in Korea)
defined in KS X 1001.
uhc A variant of EUC-KR used in Korean MS-Windows 95/98(pro‐
prietary encoding of Microsoft,CP949). Its character repertoire
includes all modern syllables of Hangul,Korean script as
well as all the special symbols and Hanja (Chinese ideograms
used in Korea) defined in KS X 1001.
gb-18030
This is a Chinese character encoding format based upon GB 18030.
It encodes the whole U+0000..U+10FFFF range, while being compat‐
ible with gb-2312.
gb-2312-x11
This is a Chinese character encoding format based upon GB 2312.
It is a 7-bit encoding format.
gb-2312
This is a Chinese character encoding format based upon GB 2312.
It is an 8-bit encoding format.
big-5 This is a Chinese character encoding format based upon BIG5
encoding. It is an 8-bit encoding format.
hz This is a Chinese character encoding format based upon "Hanzi"
encoding. It is a 7-bit encoding format.
viscii This is a Vietnamese character encoding format.
ucs-2-be
This converts 16-bit unicode (ucs-2) streams. The format takes
care of big-endian variant. Yudit does not recommend this for‐
mat.
ucs-2-le
This converts 16-bit unicode (ucs-2) streams. The format takes
care of little-endian variant. Yudit does not recommend this
format.
ucs-2 This converts 16-bit unicode (ucs-2) streams. The input byte
order is recognized by the first two characters BEM (byte-order-
mark) U+FEFF. Yudit does not recommend this format.
java This converts \uxxxx character escapes. When encoding, all char‐
acters above U+0080 will be escaped with a string like '\u0080'.
When decoding the same format is decoded but, in addition, utf-8
format is also recognized, so it can also be used to recover
data accidentally saved with the wrong enconding. The
U+10000..U+10FFFF area is converted to surrogates and vice
versa.
java-s This converts \uxxxx character escapes. When encoding, all char‐
acters above U+0080 will be escaped with a string like '\u0080'.
When decoding the same format is decoded but, in addition, utf-8
format is also recognized, so it can also be used to recover
data accidentally saved with the wrong enconding. Surrogates are
not treated specially during conversion - this is why it is not
a recommened conversion.
FILES
~/.yudit/yudit.properties or /usr/share/yudit/config/yudit.properties
can have yudit.datapath property. This is where the map files
are kept. By default /usr/share/yudit/data is searched.
SEE ALSO
makeumap
AUTHOR
This program was written by gsinai@yudit.org (Gaspar Sinai), Tokyo, 2
January, 2001.
LINUX COMMANDS Nov 5 1997 UNICONV(1)