
| Line: 1 to 1 | ||||||||
|---|---|---|---|---|---|---|---|---|
Locale Settings In Linux | ||||||||
| Changed: | ||||||||
| < < | Dealing with multiple languages | |||||||
| > > | Dealing with multiple character encodings | |||||||
| When we start dealing with multiple character sets and languages, we get into a confusing area that's poorly understood by most American computer users. Here's a quick crash course on how to deal with these issues on our Linux cluster. | ||||||||
| Added: | ||||||||
| > > | Telling Linux what your terminal expects | |||||||
| Changed: | ||||||||
| < < | UTF-8 | |||||||
| > > | The LANG environment variable controls what character set Linux sends to your terminal program. (It also controls some other localization settings, but we won't get into that here.) The default setting is "en_US", which specifies English with a default encoding of Latin1. Linux software that's locale-aware will try to translate foreign characters into a character set your terminal can understand. | |||||||
| Changed: | ||||||||
| < < | Most software now understands UTF-8 by default, but you may need to tell Linux to send UTF-8 to your terminal for it to display foreign characters properly. Because some software still has problems with UTF-8 collations, this is currently not the system default. You can change to UTF-8 mode with the following command: | |||||||
| > > | For working with foreign languages, you probably want to use UTF-8 encoding. To tell Linux to send UTF-8 to your terminal, issue the following command: | |||||||
export LANG="en_US.utf8" | ||||||||
| Changed: | ||||||||
| < < | If your file is in UTF-8, and if the correct font is installed, you shouldn't have any trouble after doing this; it should work both in X and in a terminal. If you plan on doing a lot of UTF-8 work you may want to put the above command in your .bashrc, but note that this can cause problems for some applications; in particular, some of the scripts associated with SRILM give incorrect results in UTF-8 locales. The LANG environment variable is more thoroughly described in the section below on other character encodings. | |||||||
| > > | This change will last until you log out. To make it permanent, add the line above to your .bash_profile. Note that this also affects the default collation, and some software (notably the SRILM support scripts) does not cope with this well. | |||||||
| Changed: | ||||||||
| < < | If you're running a text terminal program, it's the fonts on your local machine and your terminal program's understanding of the encoding that make the difference. | |||||||
| > > | You will also need to tell your terminal to expect UTF-8 encoding; otherwise foreign characters will appear as gibberish. Settings for some common terminal programs appear below:
| |||||||
| Changed: | ||||||||
| < < | If you're running a forwarded X program, it will rely on the fonts on your local machine, unless you tell your machine's X server to connect to the server's font server so it can download additional fonts. How to do this depends on the system. Here are some instructions for MacOS X. | |||||||
| > > | Note that you also must have fonts that contain the characters you are trying to display. For most modern systems this is not a problem. | |||||||
| Changed: | ||||||||
| < < | Other character sets | |||||||
| > > | If you're running a forwarded X program, it will rely on the fonts on your local machine, unless you tell your machine's X server to connect to the server's font server so it can download additional fonts. How to do this depends on the system. Here are some instructions for MacOS X.
Dealing with files in non-UTF character sets | |||||||
| This is where it gets tricky, since there's often no standard way for the system to know what encoding your file is in. You'll need to tell it which to use. | ||||||||
| Added: | ||||||||
| > > | Emacs | |||||||
| Changed: | ||||||||
| < < | Emacs in X | |||||||
| > > | Emacs 22 defaults to UTF-8 encoding. It can translate other encodings into UTF-8 for display on your terminal, but you must select the proper encoding before you open the file. This must be done manually; there is no auto-detection in most cases. | |||||||
| Changed: | ||||||||
| < < | Emacs 22 defaults to UTF-8 encoding (although it will speak to your terminal in Latin1, unless you set the LANG variable as noted above.) If you need a different encoding, you must select this before you open the file. There are two ways to do this. | |||||||
| > > | To change the character encoding, press C-x RET l. (This is Emacs-speak for Ctrl-x, the RETURN (or ENTER) key, then the lowercase L key.) Then type the name of your character encoding. Tab-completion works at this prompt, so if you aren't sure, enter a partial name and hit tab a couple of times to get a list of matches. Once the character encoding is set, you can open your file with C-x C-f. | |||||||
| Changed: | ||||||||
| < < | You can click the Options drop-down menu, then click Mule, Set Language Encoding and choose the correct encoding. Alternatively, you can press C-x RET l. (This is Emacs-speak for Ctrl-x, the RETURN (or ENTER) key, then the lowercase L key.) Then type the name of your character encoding. Tab-completion works, so you can type a partial name and then hit tab to get a list of matches. Once the character encoding is set, you can open your file with C-x C-f or by using the File drop-down menu. | |||||||
| > > | If you are running Emacs in an X window, you can click the Options drop-down menu, then click Mule, Set Language Encoding and choose the correct encoding from the list. | |||||||
| A good example to test this with is the file /corpora/LDC/LDC05T06/RAW/data/source/AFC20020701.0014.sgm. This file is in GB2312 Simplified Chinese. Emacs's name for this encoding is "Chinese-GB". | ||||||||
| Added: | ||||||||
| > > | Emacs troubleshooting:
vim | |||||||
| Changed: | ||||||||
| < < | Emacs in text terminalsFour conditions must be met for this to work:
| |||||||
| > > | (This section could use some expansion, but I'm not a vim user. If anyone who is would like to take a crack at it, feel free.) | |||||||
| Changed: | ||||||||
| < < | Linux determines what encoding your terminal supports based on the LANG environment variable. You can override it for a single command by setting it on the same command line, like so: | |||||||
| > > | The "fileencodings" global variable gives a list of encodings vim will try when opening a new file. It will then convert the file into your terminal's character set. The default is "utf-8, latin1", which means vim will try utf-8 first, then try latin1 if utf-8 fails. | |||||||
| Changed: | ||||||||
| < < | LANG=zh_CN.gb2312 emacs | |||||||
| > > | To edit our GB2312 file in vim, assuming we have our terminal correctly configured for UTF-8, we could do something like the following:
:setglobal fileencodings=gb2312 :e /corpora/LDC/LDC05T06/RAW/data/source/AFC20020701.0014.sgm | |||||||
| Changed: | ||||||||
| < < | Or you can override it for an entire session by using the export command: | |||||||
| > > | Note that, as in Emacs, the correct encoding must be set before loading the file.
Other text-based programs | |||||||
| Changed: | ||||||||
| < < | export LANG=zh_CN.gb2312
Once emacs is loaded, you can set the encoding with C-x RET l, as before.
Other text-based programsThe LANG variable also affects other commands. For example, if your terminal is configured for GB2312, you can do the following: | |||||||
| > > | Unfortunately, most other commands will not do translation the way Emacs will. This means that to read a file in an alternate encoding, both your terminal and the LANG variable must be set to that encoding. For example, if we wanted to view the example file above using "more", we would first need to set our terminal to GB2312, then do the following: | |||||||
LANG=zh_CN.gb2312 more /corpora/LDC/LDC05T06/RAW/data/source/AFC20020701.0014.sgm | ||||||||
| Deleted: | ||||||||
| < < | and get a correct display. | |||||||
locale -a will list all the valid settings for LANG that the system knows about. | ||||||||