TWiki> Main Web>LocaleSettings (2013-06-20, brodbd)EditAttach

Locale Settings In Linux

Dealing with multiple character encodings

When we start dealing with multiple character sets and languages, we get into a confusing area that's poorly understood by most American computer users. Here's a quick crash course on how to deal with these issues on our Linux cluster.

Telling Linux what your terminal expects

The LANG environment variable controls what character set Linux sends to your terminal program. (It also controls some other localization settings, but we won't get into that here.) The default setting is "en_US", which specifies English with a default encoding of Latin1. Linux software that's locale-aware will try to translate foreign characters into a character set your terminal can understand.

For working with foreign languages, you probably want to use UTF-8 encoding. To tell Linux to send UTF-8 to your terminal, issue the following command:

export LANG="en_US.utf8"

This change will last until you log out. To make it permanent, add the line above to your .bash_profile. Note that this also affects the default collation, and some software (notably the SRILM support scripts) does not cope with this well.

You will also need to tell your terminal to expect UTF-8 encoding; otherwise foreign characters will appear as gibberish. Settings for some common terminal programs appear below:

  • Terminal, MacOS X 10.5 and later: Click the Terminal drop-down menu, choose Preferences, click Settings at the top of the dialog box, and click the Advanced tab. The character set encoding drop-down box is near the bottom of the dialog. The default is UTF-8.
  • Terminal, MacOS X 10.4 and earlier: Same as above, but click Terminal, Preferences, Window Settings, Display.
  • xterm, MacOS X: The version of xterm included with MacOS X does not support UTF-8. Use Terminal instead.
  • PuTTY: Click on the icon at the upper left of the window and choose Change Settings. Under Window, choose Translation. The character set encoding drop-down box is near the top of the dialog. The default is Latin1.
  • TeraTerm Pro: TeraTerm Pro does not support UTF-8. If you plan to work with foreign languages I recommend switching to PuTTY.
  • !MobaXterm: Click on the settings icon. In the terminal tab, select your character set from the Charset drop-down. This will only apply to new settings launched after changing the setting. Also note that MobaXterm's X fonts seem to lack Asian character support.

Note that you also must have fonts that contain the characters you are trying to display. For most modern systems this is not a problem.

If you're running a forwarded X program, it will rely on the fonts on your local machine, unless you tell your machine's X server to connect to the server's font server so it can download additional fonts. How to do this depends on the system. Here are some instructions for MacOS X.

Dealing with files in non-UTF character sets

This is where it gets tricky, since there's often no standard way for the system to know what encoding your file is in. You'll need to tell it which to use.

Emacs

Emacs 22 defaults to UTF-8 encoding. It can translate other encodings into UTF-8 for display on your terminal, but you must select the proper encoding before you open the file. This must be done manually; there is no auto-detection in most cases.

To change the character encoding, press C-x RET l. (This is Emacs-speak for Ctrl-x, the RETURN (or ENTER) key, then the lowercase L key.) Then type the name of your character encoding. Tab-completion works at this prompt, so if you aren't sure, enter a partial name and hit tab a couple of times to get a list of matches. Once the character encoding is set, you can open your file with C-x C-f.

If you are running Emacs in an X window, you can click the Options drop-down menu, then click Mule, Set Language Encoding and choose the correct encoding from the list.

A good example to test this with is the file /corpora/LDC/LDC05T06/RAW/data/source/AFC20020701.0014.sgm. This file is in GB2312 Simplified Chinese. Emacs's name for this encoding is "Chinese-GB".

Emacs troubleshooting:
  • Question marks instead of foreign language characters: This usually means your terminal is using an encoding that cannot display some of the characters in the file; this is common when trying to display Asian languages on a Latin1 terminal. Change your terminal encoding and set the LANG environment variable accordingly.
  • Gibberish instead of foreign language characters: Either Emacs is set to the wrong encoding for the file, or your terminal is set to an encoding that does not agree with the way you have LANG set. To find out which, press C-h h (Ctrl-h followed by the h key) to display some sample multilingual text. If this text looks OK, the file encoding is wrong; if this text is gibberish, your LANG setting disagrees with your terminal settings.
  • Empty boxes or diamonds with question marks instead of foreign language characters: The fonts installed on your local system do not support the characters you're trying to display.

vim

(This section could use some expansion, but I'm not a vim user. If anyone who is would like to take a crack at it, feel free.)

The "fileencodings" global variable gives a list of encodings vim will try when opening a new file. It will then convert the file into your terminal's character set. The default is "utf-8, latin1", which means vim will try utf-8 first, then try latin1 if utf-8 fails.

To edit our GB2312 file in vim, assuming we have our terminal correctly configured for UTF-8, we could do something like the following:

:setglobal fileencodings=gb2312
:e /corpora/LDC/LDC05T06/RAW/data/source/AFC20020701.0014.sgm

Note that, as in Emacs, the correct encoding must be set before loading the file.

Other text-based programs

Unfortunately, most other commands will not do translation the way Emacs will. This means that to read a file in an alternate encoding, both your terminal and the LANG variable must be set to that encoding. For example, if we wanted to view the example file above using "more", we would first need to set our terminal to GB2312, then do the following:

LANG=zh_CN.gb2312 more /corpora/LDC/LDC05T06/RAW/data/source/AFC20020701.0014.sgm

locale -a will list all the valid settings for LANG that the system knows about.

Topic revision: r5 - 2013-06-20 - 20:52:58 - brodbd
 

This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Privacy Statement Terms & Conditions