Difference: LocaleSettings (1 vs. 5)

Revision 52013-06-20 - brodbd

Line: 1 to 1
 

Locale Settings In Linux

Dealing with multiple character encodings

Line: 19 to 19
 
  • xterm, MacOS X: The version of xterm included with MacOS X does not support UTF-8. Use Terminal instead.
  • PuTTY: Click on the icon at the upper left of the window and choose Change Settings. Under Window, choose Translation. The character set encoding drop-down box is near the top of the dialog. The default is Latin1.
  • TeraTerm Pro: TeraTerm Pro does not support UTF-8. If you plan to work with foreign languages I recommend switching to PuTTY.
Added:
>
>
  • !MobaXterm: Click on the settings icon. In the terminal tab, select your character set from the Charset drop-down. This will only apply to new settings launched after changing the setting. Also note that MobaXterm's X fonts seem to lack Asian character support.
  Note that you also must have fonts that contain the characters you are trying to display. For most modern systems this is not a problem.

Revision 42010-05-05 - brodbd

Line: 1 to 1
 

Locale Settings In Linux

Changed:
<
<

Dealing with multiple languages

>
>

Dealing with multiple character encodings

  When we start dealing with multiple character sets and languages, we get into a confusing area that's poorly understood by most American computer users. Here's a quick crash course on how to deal with these issues on our Linux cluster.
Added:
>
>

Telling Linux what your terminal expects

 
Changed:
<
<

UTF-8

>
>
The LANG environment variable controls what character set Linux sends to your terminal program. (It also controls some other localization settings, but we won't get into that here.) The default setting is "en_US", which specifies English with a default encoding of Latin1. Linux software that's locale-aware will try to translate foreign characters into a character set your terminal can understand.
 
Changed:
<
<
Most software now understands UTF-8 by default, but you may need to tell Linux to send UTF-8 to your terminal for it to display foreign characters properly. Because some software still has problems with UTF-8 collations, this is currently not the system default. You can change to UTF-8 mode with the following command:
>
>
For working with foreign languages, you probably want to use UTF-8 encoding. To tell Linux to send UTF-8 to your terminal, issue the following command:
 
export LANG="en_US.utf8"
Changed:
<
<
If your file is in UTF-8, and if the correct font is installed, you shouldn't have any trouble after doing this; it should work both in X and in a terminal. If you plan on doing a lot of UTF-8 work you may want to put the above command in your .bashrc, but note that this can cause problems for some applications; in particular, some of the scripts associated with SRILM give incorrect results in UTF-8 locales. The LANG environment variable is more thoroughly described in the section below on other character encodings.
>
>
This change will last until you log out. To make it permanent, add the line above to your .bash_profile. Note that this also affects the default collation, and some software (notably the SRILM support scripts) does not cope with this well.
 
Changed:
<
<
If you're running a text terminal program, it's the fonts on your local machine and your terminal program's understanding of the encoding that make the difference.
>
>
You will also need to tell your terminal to expect UTF-8 encoding; otherwise foreign characters will appear as gibberish. Settings for some common terminal programs appear below:
  • Terminal, MacOS X 10.5 and later: Click the Terminal drop-down menu, choose Preferences, click Settings at the top of the dialog box, and click the Advanced tab. The character set encoding drop-down box is near the bottom of the dialog. The default is UTF-8.
  • Terminal, MacOS X 10.4 and earlier: Same as above, but click Terminal, Preferences, Window Settings, Display.
  • xterm, MacOS X: The version of xterm included with MacOS X does not support UTF-8. Use Terminal instead.
  • PuTTY: Click on the icon at the upper left of the window and choose Change Settings. Under Window, choose Translation. The character set encoding drop-down box is near the top of the dialog. The default is Latin1.
  • TeraTerm Pro: TeraTerm Pro does not support UTF-8. If you plan to work with foreign languages I recommend switching to PuTTY.
 
Changed:
<
<
If you're running a forwarded X program, it will rely on the fonts on your local machine, unless you tell your machine's X server to connect to the server's font server so it can download additional fonts. How to do this depends on the system. Here are some instructions for MacOS X.
>
>
Note that you also must have fonts that contain the characters you are trying to display. For most modern systems this is not a problem.
 
Changed:
<
<

Other character sets

>
>
If you're running a forwarded X program, it will rely on the fonts on your local machine, unless you tell your machine's X server to connect to the server's font server so it can download additional fonts. How to do this depends on the system. Here are some instructions for MacOS X.

Dealing with files in non-UTF character sets

  This is where it gets tricky, since there's often no standard way for the system to know what encoding your file is in. You'll need to tell it which to use.
Added:
>
>

Emacs

 
Changed:
<
<

Emacs in X

>
>
Emacs 22 defaults to UTF-8 encoding. It can translate other encodings into UTF-8 for display on your terminal, but you must select the proper encoding before you open the file. This must be done manually; there is no auto-detection in most cases.
 
Changed:
<
<
Emacs 22 defaults to UTF-8 encoding (although it will speak to your terminal in Latin1, unless you set the LANG variable as noted above.) If you need a different encoding, you must select this before you open the file. There are two ways to do this.
>
>
To change the character encoding, press C-x RET l. (This is Emacs-speak for Ctrl-x, the RETURN (or ENTER) key, then the lowercase L key.) Then type the name of your character encoding. Tab-completion works at this prompt, so if you aren't sure, enter a partial name and hit tab a couple of times to get a list of matches. Once the character encoding is set, you can open your file with C-x C-f.
 
Changed:
<
<
You can click the Options drop-down menu, then click Mule, Set Language Encoding and choose the correct encoding. Alternatively, you can press C-x RET l. (This is Emacs-speak for Ctrl-x, the RETURN (or ENTER) key, then the lowercase L key.) Then type the name of your character encoding. Tab-completion works, so you can type a partial name and then hit tab to get a list of matches. Once the character encoding is set, you can open your file with C-x C-f or by using the File drop-down menu.
>
>
If you are running Emacs in an X window, you can click the Options drop-down menu, then click Mule, Set Language Encoding and choose the correct encoding from the list.
  A good example to test this with is the file /corpora/LDC/LDC05T06/RAW/data/source/AFC20020701.0014.sgm. This file is in GB2312 Simplified Chinese. Emacs's name for this encoding is "Chinese-GB".
Added:
>
>
Emacs troubleshooting:
  • Question marks instead of foreign language characters: This usually means your terminal is using an encoding that cannot display some of the characters in the file; this is common when trying to display Asian languages on a Latin1 terminal. Change your terminal encoding and set the LANG environment variable accordingly.
  • Gibberish instead of foreign language characters: Either Emacs is set to the wrong encoding for the file, or your terminal is set to an encoding that does not agree with the way you have LANG set. To find out which, press C-h h (Ctrl-h followed by the h key) to display some sample multilingual text. If this text looks OK, the file encoding is wrong; if this text is gibberish, your LANG setting disagrees with your terminal settings.
  • Empty boxes or diamonds with question marks instead of foreign language characters: The fonts installed on your local system do not support the characters you're trying to display.

vim

 
Changed:
<
<

Emacs in text terminals

Four conditions must be met for this to work:

  • Your system must have a font for the language in question
  • Your terminal must expect the proper encoding
  • Linux must be told what encoding your terminal is using
  • Emacs must be told what encoding to use

The first two items are system-dependent. In MacOS X Terminal, for example, you can set the encoding by clicking the Terminal drop-down menu, choosing Preferences, clicking Settings at the top of the dialog box, and clicking the Advanced tab. The character set encoding drop-down box is near the bottom of the dialog. (In OS X 10.4 and earlier this is under Terminal, Preferences, Window Settings, Display.)

>
>
(This section could use some expansion, but I'm not a vim user. If anyone who is would like to take a crack at it, feel free.)
 
Changed:
<
<
Linux determines what encoding your terminal supports based on the LANG environment variable. You can override it for a single command by setting it on the same command line, like so:
>
>
The "fileencodings" global variable gives a list of encodings vim will try when opening a new file. It will then convert the file into your terminal's character set. The default is "utf-8, latin1", which means vim will try utf-8 first, then try latin1 if utf-8 fails.
 
Changed:
<
<
LANG=zh_CN.gb2312 emacs
>
>
To edit our GB2312 file in vim, assuming we have our terminal correctly configured for UTF-8, we could do something like the following:
:setglobal fileencodings=gb2312
:e /corpora/LDC/LDC05T06/RAW/data/source/AFC20020701.0014.sgm
 
Changed:
<
<
Or you can override it for an entire session by using the export command:
>
>
Note that, as in Emacs, the correct encoding must be set before loading the file.

Other text-based programs

 
Changed:
<
<
export LANG=zh_CN.gb2312

Once emacs is loaded, you can set the encoding with C-x RET l, as before.

Other text-based programs

The LANG variable also affects other commands. For example, if your terminal is configured for GB2312, you can do the following:

>
>
Unfortunately, most other commands will not do translation the way Emacs will. This means that to read a file in an alternate encoding, both your terminal and the LANG variable must be set to that encoding. For example, if we wanted to view the example file above using "more", we would first need to set our terminal to GB2312, then do the following:
  LANG=zh_CN.gb2312 more /corpora/LDC/LDC05T06/RAW/data/source/AFC20020701.0014.sgm
Deleted:
<
<
and get a correct display.
 locale -a will list all the valid settings for LANG that the system knows about.

Revision 32010-05-05 - brodbd

Line: 1 to 1
 

Locale Settings In Linux

Dealing with multiple languages

When we start dealing with multiple character sets and languages, we get into a confusing area that's poorly understood by most American computer users. Here's a quick crash course on how to deal with these issues on our Linux cluster.

UTF-8

Changed:
<
<
Most software now understands UTF-8 by default. If your file is in UTF-8, and if the correct font is installed, you probably won't have any trouble, either in X or in a terminal.
>
>
Most software now understands UTF-8 by default, but you may need to tell Linux to send UTF-8 to your terminal for it to display foreign characters properly. Because some software still has problems with UTF-8 collations, this is currently not the system default. You can change to UTF-8 mode with the following command:
export LANG="en_US.utf8"

If your file is in UTF-8, and if the correct font is installed, you shouldn't have any trouble after doing this; it should work both in X and in a terminal. If you plan on doing a lot of UTF-8 work you may want to put the above command in your .bashrc, but note that this can cause problems for some applications; in particular, some of the scripts associated with SRILM give incorrect results in UTF-8 locales. The LANG environment variable is more thoroughly described in the section below on other character encodings.

  If you're running a text terminal program, it's the fonts on your local machine and your terminal program's understanding of the encoding that make the difference.
Line: 16 to 22
 

Emacs in X

Changed:
<
<
Emacs 22 defaults to UTF-8 encoding. If you need a different encoding, you must select this before you open the file. There are two ways to do this.
>
>
Emacs 22 defaults to UTF-8 encoding (although it will speak to your terminal in Latin1, unless you set the LANG variable as noted above.) If you need a different encoding, you must select this before you open the file. There are two ways to do this.
  You can click the Options drop-down menu, then click Mule, Set Language Encoding and choose the correct encoding. Alternatively, you can press C-x RET l. (This is Emacs-speak for Ctrl-x, the RETURN (or ENTER) key, then the lowercase L key.) Then type the name of your character encoding. Tab-completion works, so you can type a partial name and then hit tab to get a list of matches. Once the character encoding is set, you can open your file with C-x C-f or by using the File drop-down menu.
Line: 51 to 57
 and get a correct display.

locale -a will list all the valid settings for LANG that the system knows about.

Deleted:
<
<
-- DavidBrodbeck - 17 Sep 2009

Revision 22009-09-17 - brodbd

Line: 1 to 1
 

Locale Settings In Linux

Dealing with multiple languages

Line: 18 to 18
  Emacs 22 defaults to UTF-8 encoding. If you need a different encoding, you must select this before you open the file. There are two ways to do this.
Changed:
<
<
You can click the Options drop-down menu, then click Mule, Set Language Encoding and choose the correct encoding. Alternatively, you can press C-x RET l. (This is Emacs-speak for Ctrl-x, the RETURN (or ENTER) key, then the lowercase L key.) Then type the name of your character encoding. Tab-completion works, so you can type a partial name and then hit tab to get a list of matches. Once the character encoding is set, you can open your file with C-x C-f or by using the File drop-down menu.
>
>
You can click the Options drop-down menu, then click Mule, Set Language Encoding and choose the correct encoding. Alternatively, you can press C-x RET l. (This is Emacs-speak for Ctrl-x, the RETURN (or ENTER) key, then the lowercase L key.) Then type the name of your character encoding. Tab-completion works, so you can type a partial name and then hit tab to get a list of matches. Once the character encoding is set, you can open your file with C-x C-f or by using the File drop-down menu.
  A good example to test this with is the file /corpora/LDC/LDC05T06/RAW/data/source/AFC20020701.0014.sgm. This file is in GB2312 Simplified Chinese. Emacs's name for this encoding is "Chinese-GB".
Line: 31 to 30
 
  • Linux must be told what encoding your terminal is using
  • Emacs must be told what encoding to use
Changed:
<
<
The first two items are system-dependent. In MacOS X Terminal, for example, you can set the encoding by clicking the Terminal drop-down menu, choosing Window Settings, and selecting Display. The Character Set Encoding drop-down box will be at the bottom of the dialog.
>
>
The first two items are system-dependent. In MacOS X Terminal, for example, you can set the encoding by clicking the Terminal drop-down menu, choosing Preferences, clicking Settings at the top of the dialog box, and clicking the Advanced tab. The character set encoding drop-down box is near the bottom of the dialog. (In OS X 10.4 and earlier this is under Terminal, Preferences, Window Settings, Display.)
  Linux determines what encoding your terminal supports based on the LANG environment variable. You can override it for a single command by setting it on the same command line, like so:
Line: 53 to 52
  locale -a will list all the valid settings for LANG that the system knows about.
Changed:
<
<
-- DavidBrodbeck - 28 Sep 2007
>
>
-- DavidBrodbeck - 17 Sep 2009

Revision 12007-09-28 - DavidBrodbeck

Line: 1 to 1
Added:
>
>

Locale Settings In Linux

Dealing with multiple languages

When we start dealing with multiple character sets and languages, we get into a confusing area that's poorly understood by most American computer users. Here's a quick crash course on how to deal with these issues on our Linux cluster.

UTF-8

Most software now understands UTF-8 by default. If your file is in UTF-8, and if the correct font is installed, you probably won't have any trouble, either in X or in a terminal.

If you're running a text terminal program, it's the fonts on your local machine and your terminal program's understanding of the encoding that make the difference.

If you're running a forwarded X program, it will rely on the fonts on your local machine, unless you tell your machine's X server to connect to the server's font server so it can download additional fonts. How to do this depends on the system. Here are some instructions for MacOS X.

Other character sets

This is where it gets tricky, since there's often no standard way for the system to know what encoding your file is in. You'll need to tell it which to use.

Emacs in X

Emacs 22 defaults to UTF-8 encoding. If you need a different encoding, you must select this before you open the file. There are two ways to do this.

You can click the Options drop-down menu, then click Mule, Set Language Encoding and choose the correct encoding. Alternatively, you can press C-x RET l. (This is Emacs-speak for Ctrl-x, the RETURN (or ENTER) key, then the lowercase L key.) Then type the name of your character encoding. Tab-completion works, so you can type a partial name and then hit tab to get a list of matches. Once the character encoding is set, you can open your file with C-x C-f or by using the File drop-down menu.

A good example to test this with is the file /corpora/LDC/LDC05T06/RAW/data/source/AFC20020701.0014.sgm. This file is in GB2312 Simplified Chinese. Emacs's name for this encoding is "Chinese-GB".

Emacs in text terminals

Four conditions must be met for this to work:

  • Your system must have a font for the language in question
  • Your terminal must expect the proper encoding
  • Linux must be told what encoding your terminal is using
  • Emacs must be told what encoding to use

The first two items are system-dependent. In MacOS X Terminal, for example, you can set the encoding by clicking the Terminal drop-down menu, choosing Window Settings, and selecting Display. The Character Set Encoding drop-down box will be at the bottom of the dialog.

Linux determines what encoding your terminal supports based on the LANG environment variable. You can override it for a single command by setting it on the same command line, like so:

LANG=zh_CN.gb2312 emacs

Or you can override it for an entire session by using the export command:

export LANG=zh_CN.gb2312

Once emacs is loaded, you can set the encoding with C-x RET l, as before.

Other text-based programs

The LANG variable also affects other commands. For example, if your terminal is configured for GB2312, you can do the following:

LANG=zh_CN.gb2312 more /corpora/LDC/LDC05T06/RAW/data/source/AFC20020701.0014.sgm

and get a correct display.

locale -a will list all the valid settings for LANG that the system knows about.

-- DavidBrodbeck - 28 Sep 2007

 
This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Privacy Statement Terms & Conditions