Locale Settings In Linux
Dealing with multiple languages
When we start dealing with multiple character sets and languages, we get into a confusing area that's poorly understood by most American computer users. Here's a quick crash course on how to deal with these issues on our Linux cluster.
Most software now understands UTF-8 by default, but you may need to tell Linux to send UTF-8 to your terminal for it to display foreign characters properly. Because some software still has problems with UTF-8 collations, this is currently not the system default. You can change to UTF-8 mode with the following command:
If your file is in UTF-8, and if the correct font is installed, you shouldn't have any trouble after doing this; it should work both in X and in a terminal. If you plan on doing a lot of UTF-8 work you may want to put the above command in your .bashrc, but note that this can cause problems for some applications; in particular, some of the scripts associated with SRILM give incorrect results in UTF-8 locales. The LANG environment variable is more thoroughly described in the section below on other character encodings.
If you're running a text terminal program, it's the fonts on your local machine
and your terminal program's understanding of the encoding that make the difference.
If you're running a forwarded X program, it will rely on the fonts on your local machine, unless you tell your machine's X server to connect to the server's font server so it can download additional fonts. How to do this depends on the system. Here are some instructions for MacOS X
Other character sets
This is where it gets tricky, since there's often no standard way for the system to know what encoding your file is in. You'll need to tell it which to use.
Emacs in X
Emacs 22 defaults to UTF-8 encoding (although it will speak to your terminal in Latin1, unless you set the LANG variable as noted above.) If you need a different encoding, you must select this before
you open the file. There are two ways to do this.
You can click the Options drop-down menu, then click Mule, Set Language Encoding and choose the correct encoding. Alternatively, you can press
C-x RET l
. (This is Emacs-speak for Ctrl-x, the RETURN (or ENTER) key, then the lowercase L key.) Then type the name of your character encoding. Tab-completion works, so you can type a partial name and then hit tab to get a list of matches. Once the character encoding is set, you can open your file with
or by using the File drop-down menu.
A good example to test this with is the file /corpora/LDC/LDC05T06/RAW/data/source/AFC20020701.0014.sgm. This file is in GB2312 Simplified Chinese. Emacs's name for this encoding is "Chinese-GB".
Emacs in text terminals
Four conditions must be met for this to work:
- Your system must have a font for the language in question
- Your terminal must expect the proper encoding
- Linux must be told what encoding your terminal is using
- Emacs must be told what encoding to use
The first two items are system-dependent. In MacOS X Terminal, for example, you can set the encoding by clicking the Terminal drop-down menu, choosing Preferences, clicking Settings at the top of the dialog box, and clicking the Advanced tab. The character set encoding drop-down box is near the bottom of the dialog. (In OS X 10.4 and earlier this is under Terminal, Preferences, Window Settings, Display.)
Linux determines what encoding your terminal supports based on the LANG environment variable. You can override it for a single command by setting it on the same command line, like so:
Or you can override it for an entire session by using the export command:
Once emacs is loaded, you can set the encoding with
C-x RET l
, as before.
Other text-based programs
The LANG variable also affects other commands. For example, if your terminal is configured for GB2312, you can do the following:
LANG=zh_CN.gb2312 more /corpora/LDC/LDC05T06/RAW/data/source/AFC20020701.0014.sgm
and get a correct display.
will list all the valid settings for LANG that the system knows about.