Sophie

Sophie

distrib > Mandriva > 2008.1 > x86_64 > by-pkgid > ebb1914cf182a88528b4547490db1dd8 > files > 341

kdewebdev-quanta-doc-3.5.9-2mdv2008.1.x86_64.rpm

<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><title>Chapter 10. Character Set Support</title><meta name="generator" content="DocBook XSL Stylesheets V1.69.1"><link rel="start" href="index.html" title="MySQL 5.0 Reference Manual"><link rel="up" href="index.html" title="MySQL 5.0 Reference Manual"><link rel="prev" href="language-structure.html" title="Chapter 9. Language Structure"><link rel="next" href="column-types.html" title="Chapter 11. Column Types"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">Chapter 10. Character Set Support</th></tr><tr><td width="20%" align="left"><a accesskey="p" href="language-structure.html">Prev</a> </td><th width="60%" align="center"> </th><td width="20%" align="right"> <a accesskey="n" href="column-types.html">Next</a></td></tr></table><hr></div><div class="chapter" lang="en"><div class="titlepage"><div><div><h2 class="title"><a name="charset"></a>Chapter 10. Character Set Support</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="charset.html#charset-general">10.1. Character Sets and Collations in General</a></span></dt><dt><span class="section"><a href="charset.html#charset-mysql">10.2. Character Sets and Collations in MySQL</a></span></dt><dt><span class="section"><a href="charset.html#charset-defaults">10.3. Determining the Default Character Set and Collation</a></span></dt><dd><dl><dt><span class="section"><a href="charset.html#charset-server">10.3.1. Server Character Set and Collation</a></span></dt><dt><span class="section"><a href="charset.html#charset-database">10.3.2. Database Character Set and Collation</a></span></dt><dt><span class="section"><a href="charset.html#charset-table">10.3.3. Table Character Set and Collation</a></span></dt><dt><span class="section"><a href="charset.html#charset-column">10.3.4. Column Character Set and Collation</a></span></dt><dt><span class="section"><a href="charset.html#charset-examples">10.3.5. Examples of Character Set and Collation Assignment</a></span></dt><dt><span class="section"><a href="charset.html#charset-connection">10.3.6. Connection Character Sets and Collations</a></span></dt><dt><span class="section"><a href="charset.html#charset-literal">10.3.7. Character String Literal Character Set and Collation</a></span></dt><dt><span class="section"><a href="charset.html#charset-collate">10.3.8. Using <code class="literal">COLLATE</code> in SQL Statements</a></span></dt><dt><span class="section"><a href="charset.html#charset-collate-precedence">10.3.9. <code class="literal">COLLATE</code> Clause Precedence</a></span></dt><dt><span class="section"><a href="charset.html#charset-binary-op">10.3.10. <code class="literal">BINARY</code> Operator</a></span></dt><dt><span class="section"><a href="charset.html#charset-collate-tricky">10.3.11. Some Special Cases Where the Collation Determination Is Tricky</a></span></dt><dt><span class="section"><a href="charset.html#charset-collation-charset">10.3.12. Collations Must Be for the Right Character Set</a></span></dt><dt><span class="section"><a href="charset.html#charset-collation-effect">10.3.13. An Example of the Effect of Collation</a></span></dt></dl></dd><dt><span class="section"><a href="charset.html#charset-operations">10.4. Operations Affected by Character Set Support</a></span></dt><dd><dl><dt><span class="section"><a href="charset.html#charset-result">10.4.1. Result Strings</a></span></dt><dt><span class="section"><a href="charset.html#charset-convert">10.4.2. <code class="literal">CONVERT()</code></a></span></dt><dt><span class="section"><a href="charset.html#charset-cast">10.4.3. <code class="literal">CAST()</code></a></span></dt><dt><span class="section"><a href="charset.html#charset-show">10.4.4. <code class="literal">SHOW</code> Statements</a></span></dt></dl></dd><dt><span class="section"><a href="charset.html#charset-unicode">10.5. Unicode Support</a></span></dt><dt><span class="section"><a href="charset.html#charset-metadata">10.6. UTF8 for Metadata</a></span></dt><dt><span class="section"><a href="charset.html#charset-compatibility">10.7. Compatibility with Other DBMSs</a></span></dt><dt><span class="section"><a href="charset.html#charset-config-file">10.8. New Character Set Configuration File Format</a></span></dt><dt><span class="section"><a href="charset.html#charset-national">10.9. National Character Set</a></span></dt><dt><span class="section"><a href="charset.html#charset-charsets">10.10. Character Sets and Collations That MySQL Supports</a></span></dt><dd><dl><dt><span class="section"><a href="charset.html#charset-unicode-sets">10.10.1. Unicode Character Sets</a></span></dt><dt><span class="section"><a href="charset.html#charset-we-sets">10.10.2. West European Character Sets</a></span></dt><dt><span class="section"><a href="charset.html#charset-ce-sets">10.10.3. Central European Character Sets</a></span></dt><dt><span class="section"><a href="charset.html#charset-se-me-sets">10.10.4. South European and Middle East Character Sets</a></span></dt><dt><span class="section"><a href="charset.html#charset-baltic-sets">10.10.5. Baltic Character Sets</a></span></dt><dt><span class="section"><a href="charset.html#charset-cyrillic-sets">10.10.6. Cyrillic Character Sets</a></span></dt><dt><span class="section"><a href="charset.html#charset-asian-sets">10.10.7. Asian Character Sets</a></span></dt></dl></dd></dl></div><a class="indexterm" name="id2903999"></a><a class="indexterm" name="id2904006"></a><a class="indexterm" name="id2904012"></a><a class="indexterm" name="id2904019"></a><a class="indexterm" name="id2904026"></a><p>
    This chapter discusses the following topics:
  </p><div class="itemizedlist"><ul type="disc"><li><p>
        What are character sets and collations?
      </p></li><li><p>
        The multiple-level default system
      </p></li><li><p>
        Character set syntax
      </p></li><li><p>
        Affected functions and operations
      </p></li><li><p>
        Unicode support
      </p></li><li><p>
        The meaning of each individual character set and collation
      </p></li></ul></div><p>
    Character set support in MySQL 5.0 is included in the
    <code class="literal">MyISAM</code>, <code class="literal">MEMORY</code>, and
    <code class="literal">InnoDB</code> storage engines.
  </p><div class="section" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="charset-general"></a>10.1. Character Sets and Collations in General</h2></div></div></div><p>
      A <em class="firstterm">character set</em> is a set of symbols and
      encodings. A <em class="firstterm">collation</em> is a set of rules
      for comparing characters in a character set. Let's make the
      distinction clear with an example of an imaginary character set.
    </p><p>
      Suppose that we have an alphabet with four letters:
      ‘<code class="literal">A</code>’,
      ‘<code class="literal">B</code>’,
      ‘<code class="literal">a</code>’,
      ‘<code class="literal">b</code>’. We give each letter a number:
      ‘<code class="literal">A</code>’ = 0,
      ‘<code class="literal">B</code>’ = 1,
      ‘<code class="literal">a</code>’ = 2,
      ‘<code class="literal">b</code>’ = 3. The letter
      ‘<code class="literal">A</code>’ is a symbol, the number 0 is
      the <span class="bold"><strong>encoding</strong></span> for
      ‘<code class="literal">A</code>’, and the combination of all
      four letters and their encodings is a
      <span class="bold"><strong>character set</strong></span>.
    </p><p>
      Suppose that we want to compare two string values,
      ‘<code class="literal">A</code>’ and
      ‘<code class="literal">B</code>’. The simplest way to do this is
      to look at the encodings: 0 for ‘<code class="literal">A</code>’
      and 1 for ‘<code class="literal">B</code>’. Because 0 is less
      than 1, we say ‘<code class="literal">A</code>’ is less than
      ‘<code class="literal">B</code>’. What we've just done is apply
      a collation to our character set. The collation is a set of rules
      (only one rule in this case): “<span class="quote">compare the
      encodings.</span>” We call this simplest of all possible
      collations a <em class="firstterm">binary</em> collation.
    </p><p>
      But what if we want to say that the lowercase and uppercase
      letters are equivalent? Then we would have at least two rules: (1)
      treat the lowercase letters ‘<code class="literal">a</code>’ and
      ‘<code class="literal">b</code>’ as equivalent to
      ‘<code class="literal">A</code>’ and
      ‘<code class="literal">B</code>’; (2) then compare the
      encodings. We call this a <em class="firstterm">case-insensitive</em>
      collation. It's a little more complex than a binary collation.
    </p><p>
      In real life, most character sets have many characters: not just
      ‘<code class="literal">A</code>’ and
      ‘<code class="literal">B</code>’ but whole alphabets, sometimes
      multiple alphabets or eastern writing systems with thousands of
      characters, along with many special symbols and punctuation marks.
      Also in real life, most collations have many rules: not just case
      insensitivity but also accent insensitivity (an
      “<span class="quote">accent</span>” is a mark attached to a character as in
      German ‘<code class="literal">Ö</code>’) and multiple-character
      mappings (such as the rule that
      ‘<code class="literal">Ö</code>’ =
      ‘<code class="literal">OE</code>’ in one of the two German
      collations).
    </p><p>
      MySQL 5.0 can do these things for you:
    </p><div class="itemizedlist"><ul type="disc"><li><p>
          Store strings using a variety of character sets
        </p></li><li><p>
          Compare strings using a variety of collations
        </p></li><li><p>
          Mix strings with different character sets or collations in the
          same server, the same database, or even the same table
        </p></li><li><p>
          Allow specification of character set and collation at any
          level
        </p></li></ul></div><p>
      In these respects, not only is MySQL 5.0 far more
      flexible than versions of MySQL prior to 4.1, it also is far ahead
      of most other database management systems. However, to use these
      features effectively, you need to know what character sets and
      collations are available, how to change the defaults, and how they
      affect the behavior of string operators and functions.
    </p></div><div class="section" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="charset-mysql"></a>10.2. Character Sets and Collations in MySQL</h2></div></div></div><p>
      The MySQL server can support multiple character sets. To list the
      available character sets, use the <code class="literal">SHOW CHARACTER
      SET</code> statement:
    </p><pre class="programlisting">mysql&gt; <strong class="userinput"><code>SHOW CHARACTER SET;</code></strong>
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 |
| dec8     | DEC West European           | dec8_swedish_ci     |      1 |
| cp850    | DOS West European           | cp850_general_ci    |      1 |
| hp8      | HP West European            | hp8_english_ci      |      1 |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |      1 |
| latin1   | cp1252     West European    | latin1_swedish_ci   |      1 |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |      1 |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |      1 |
| ascii    | US ASCII                    | ascii_general_ci    |      1 |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |      3 |
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |      2 |
| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |      1 |
| tis620   | TIS620 Thai                 | tis620_thai_ci      |      1 |
| euckr    | EUC-KR Korean               | euckr_korean_ci     |      2 |
| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |      1 |
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |      2 |
| greek    | ISO 8859-7 Greek            | greek_general_ci    |      1 |
| cp1250   | Windows Central European    | cp1250_general_ci   |      1 |
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |      2 |
| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |      1 |
...
</pre><p>
      (For a complete listing, see <a href="charset.html#charset-charsets" title="10.10. Character Sets and Collations That MySQL Supports">Section 10.10, “Character Sets and Collations That MySQL Supports”</a>.)
    </p><p>
      Any given character set always has at least one collation. It may
      have several collations.
    </p><p>
      To list the collations for a character set, use the <code class="literal">SHOW
      COLLATION</code> statement. For example, to see the collations
      for the <code class="literal">latin1</code> (“<span class="quote">ISO-8859-1 West
      European</span>”) character set, use this statement to find those
      collation names that begin with <code class="literal">latin1</code>:
    </p><pre class="programlisting">mysql&gt; <strong class="userinput"><code>SHOW COLLATION LIKE 'latin1%';</code></strong>
+---------------------+---------+----+---------+----------+---------+
| Collation           | Charset | Id | Default | Compiled | Sortlen |
+---------------------+---------+----+---------+----------+---------+
| latin1_german1_ci   | latin1  |  5 |         |          |       0 |
| latin1_swedish_ci   | latin1  |  8 | Yes     | Yes      |       1 |
| latin1_danish_ci    | latin1  | 15 |         |          |       0 |
| latin1_german2_ci   | latin1  | 31 |         | Yes      |       2 |
| latin1_bin          | latin1  | 47 |         | Yes      |       1 |
| latin1_general_ci   | latin1  | 48 |         |          |       0 |
| latin1_general_cs   | latin1  | 49 |         |          |       0 |
| latin1_spanish_ci   | latin1  | 94 |         |          |       0 |
+---------------------+---------+----+---------+----------+---------+
</pre><p>
      The <code class="literal">latin1</code> collations have the following
      meanings:
    </p><div class="informaltable"><table border="1"><colgroup><col><col></colgroup><tbody><tr><td><span class="bold"><strong>Collation</strong></span></td><td><span class="bold"><strong>Meaning</strong></span></td></tr><tr><td><code class="literal">latin1_german1_ci</code></td><td>German DIN-1</td></tr><tr><td><code class="literal">latin1_swedish_ci</code></td><td>Swedish/Finnish</td></tr><tr><td><code class="literal">latin1_danish_ci</code></td><td>Danish/Norwegian</td></tr><tr><td><code class="literal">latin1_german2_ci</code></td><td>German DIN-2</td></tr><tr><td><code class="literal">latin1_bin</code></td><td>Binary according to <code class="literal">latin1</code> encoding</td></tr><tr><td><code class="literal">latin1_general_ci</code></td><td>Multilingual (Western European)</td></tr><tr><td><code class="literal">latin1_general_cs</code></td><td>Multilingual (ISO Western European), case sensitive</td></tr><tr><td><code class="literal">latin1_spanish_ci</code></td><td>Modern Spanish</td></tr></tbody></table></div><p>
      Collations have these general characteristics:
    </p><div class="itemizedlist"><ul type="disc"><li><p>
          Two different character sets cannot have the same collation.
        </p></li><li><p>
          Each character set has one collation that is the
          <span class="emphasis"><em>default collation</em></span>. For example, the
          default collation for <code class="literal">latin1</code> is
          <code class="literal">latin1_swedish_ci</code>.
        </p></li><li><p>
          There is a convention for collation names: They start with the
          name of the character set with which they are associated, they
          usually include a language name, and they end with
          <code class="literal">_ci</code> (case insensitive),
          <code class="literal">_cs</code> (case sensitive), or
          <code class="literal">_bin</code> (binary).
        </p></li></ul></div></div><div class="section" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="charset-defaults"></a>10.3. Determining the Default Character Set and Collation</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="charset.html#charset-server">10.3.1. Server Character Set and Collation</a></span></dt><dt><span class="section"><a href="charset.html#charset-database">10.3.2. Database Character Set and Collation</a></span></dt><dt><span class="section"><a href="charset.html#charset-table">10.3.3. Table Character Set and Collation</a></span></dt><dt><span class="section"><a href="charset.html#charset-column">10.3.4. Column Character Set and Collation</a></span></dt><dt><span class="section"><a href="charset.html#charset-examples">10.3.5. Examples of Character Set and Collation Assignment</a></span></dt><dt><span class="section"><a href="charset.html#charset-connection">10.3.6. Connection Character Sets and Collations</a></span></dt><dt><span class="section"><a href="charset.html#charset-literal">10.3.7. Character String Literal Character Set and Collation</a></span></dt><dt><span class="section"><a href="charset.html#charset-collate">10.3.8. Using <code class="literal">COLLATE</code> in SQL Statements</a></span></dt><dt><span class="section"><a href="charset.html#charset-collate-precedence">10.3.9. <code class="literal">COLLATE</code> Clause Precedence</a></span></dt><dt><span class="section"><a href="charset.html#charset-binary-op">10.3.10. <code class="literal">BINARY</code> Operator</a></span></dt><dt><span class="section"><a href="charset.html#charset-collate-tricky">10.3.11. Some Special Cases Where the Collation Determination Is Tricky</a></span></dt><dt><span class="section"><a href="charset.html#charset-collation-charset">10.3.12. Collations Must Be for the Right Character Set</a></span></dt><dt><span class="section"><a href="charset.html#charset-collation-effect">10.3.13. An Example of the Effect of Collation</a></span></dt></dl></div><p>
      There are default settings for character sets and collations at
      four levels: server, database, table, and connection. The
      following description may appear complex, but it has been found in
      practice that multiple-level defaulting leads to natural and
      obvious results.
    </p><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-server"></a>10.3.1. Server Character Set and Collation</h3></div></div></div><p>
        The MySQL Server has a server character set and a server
        collation, neither of which which may be null.
      </p><p>
        MySQL determines the server character set and server collation
        as follows:
      </p><div class="itemizedlist"><ul type="disc"><li><p>
            According to the option settings in effect when the server
            starts
          </p></li><li><p>
            According to the values set at runtime
          </p></li></ul></div><p>
        At the server level, the decision is simple. The server
        character set and collation depend initially on the options that
        you use when you start <span><strong class="command">mysqld</strong></span>. You can use
        <code class="option">--default-character-set</code> for the character set,
        and along with it you can add
        <code class="option">--default-collation</code> for the collation. If you
        don't specify a character set, that is the same as saying
        <code class="option">--default-character-set=latin1</code>. If you specify
        only a character set (for example, <code class="literal">latin1</code>)
        but not a collation, that is the same as saying
        <code class="option">--default-charset=latin1</code>
        <code class="option">--default-collation=latin1_swedish_ci</code> because
        <code class="literal">latin1_swedish_ci</code> is the default collation
        for <code class="literal">latin1</code>. Therefore, the following three
        commands all have the same effect:
      </p><pre class="programlisting">shell&gt; <strong class="userinput"><code>mysqld</code></strong>
shell&gt; <strong class="userinput"><code>mysqld --default-character-set=latin1</code></strong>
shell&gt; <strong class="userinput"><code>mysqld --default-character-set=latin1 \</code></strong>
           <strong class="userinput"><code>--default-collation=latin1_swedish_ci</code></strong>
</pre><p>
        One way to change the settings is by recompiling. If you want to
        change the default server character set and collation when
        building from sources, use: <code class="option">--with-charset</code> and
        <code class="option">--with-collation</code> as arguments for
        <span><strong class="command">configure</strong></span>. For example:
      </p><pre class="programlisting">shell&gt; <strong class="userinput"><code>./configure --with-charset=latin1</code></strong>
</pre><p>
        Or:
      </p><pre class="programlisting">shell&gt; <strong class="userinput"><code>./configure --with-charset=latin1 \</code></strong>
           <strong class="userinput"><code>--with-collation=latin1_german1_ci</code></strong>
</pre><p>
        Both <span><strong class="command">mysqld</strong></span> and <span><strong class="command">configure</strong></span>
        verify that the character set/collation combination is valid. If
        not, each program displays an error message and terminates.
      </p><p>
        The current server character set and collation are available as
        the values of the <code class="literal">character_set_server</code> and
        <code class="literal">collation_server</code> system variables. These
        variables can be changed at runtime.
      </p></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-database"></a>10.3.2. Database Character Set and Collation</h3></div></div></div><p>
        Every database has a database character set and a database
        collation, which may not be null. The <code class="literal">CREATE
        DATABASE</code> and <code class="literal">ALTER DATABASE</code>
        statements have optional clauses for specifying the database
        character set and collation:
      </p><pre class="programlisting">CREATE DATABASE <em class="replaceable"><code>db_name</code></em>
    [[DEFAULT] CHARACTER SET <em class="replaceable"><code>charset_name</code></em>]
    [[DEFAULT] COLLATE <em class="replaceable"><code>collation_name</code></em>]

ALTER DATABASE <em class="replaceable"><code>db_name</code></em>
    [[DEFAULT] CHARACTER SET <em class="replaceable"><code>charset_name</code></em>]
    [[DEFAULT] COLLATE <em class="replaceable"><code>collation_name</code></em>]
</pre><p>
        Example:
      </p><pre class="programlisting">CREATE DATABASE <em class="replaceable"><code>db_name</code></em>
    DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci;
</pre><p>
        MySQL chooses the database character set and database collation
        thus:
      </p><div class="itemizedlist"><ul type="disc"><li><p>
            If both <code class="literal">CHARACTER SET
            <em class="replaceable"><code>X</code></em></code> and <code class="literal">COLLATE
            <em class="replaceable"><code>Y</code></em></code> were specified, then
            character set <em class="replaceable"><code>X</code></em> and collation
            <em class="replaceable"><code>Y</code></em>.
          </p></li><li><p>
            If <code class="literal">CHARACTER SET
            <em class="replaceable"><code>X</code></em></code> was specified without
            <code class="literal">COLLATE</code>, then character set
            <em class="replaceable"><code>X</code></em> and its default collation.
          </p></li><li><p>
            Otherwise, the server character set and server collation.
          </p></li></ul></div><p>
        MySQL's <code class="literal">CREATE DATABASE ... DEFAULT CHARACTER SET
        ...</code> syntax is analogous to the standard SQL
        <code class="literal">CREATE SCHEMA ... CHARACTER SET ...</code> syntax.
        Because of this, it is possible to create databases with
        different character sets and collations on the same MySQL
        server.
      </p><p>
        The database character set and collation are used as default
        values if the table character set and collation are not
        specified in <code class="literal">CREATE TABLE</code> statements. They
        have no other purpose.
      </p><p>
        The character set and collation for the default database are
        available as the values of the
        <code class="literal">character_set_database</code> and
        <code class="literal">collation_database</code> system variables. The
        server sets these variables whenever the default database
        changes. If there is no default database, the variables have the
        same value as the corresponding server-level variables,
        <code class="literal">character_set_server</code> and
        <code class="literal">collation_server</code>.
      </p></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-table"></a>10.3.3. Table Character Set and Collation</h3></div></div></div><p>
        Every table has a table character set and a table collation,
        which may not be null. The <code class="literal">CREATE TABLE</code> and
        <code class="literal">ALTER TABLE</code> statements have optional clauses
        for specifying the table character set and collation:
      </p><pre class="programlisting">CREATE TABLE <em class="replaceable"><code>tbl_name</code></em> (<em class="replaceable"><code>column_list</code></em>)
    [DEFAULT CHARACTER SET <em class="replaceable"><code>charset_name</code></em> [COLLATE <em class="replaceable"><code>collation_name</code></em>]]

ALTER TABLE <em class="replaceable"><code>tbl_name</code></em>
    [DEFAULT CHARACTER SET <em class="replaceable"><code>charset_name</code></em>] [COLLATE <em class="replaceable"><code>collation_name</code></em>]
</pre><p>
        Example:
      </p><pre class="programlisting">CREATE TABLE t1 ( ... )
    DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;
</pre><p>
        MySQL chooses the table character set and collation in the
        following manner:
      </p><div class="itemizedlist"><ul type="disc"><li><p>
            If both <code class="literal">CHARACTER SET
            <em class="replaceable"><code>X</code></em></code> and <code class="literal">COLLATE
            <em class="replaceable"><code>Y</code></em></code> were specified, then
            character set <em class="replaceable"><code>X</code></em> and collation
            <em class="replaceable"><code>Y</code></em>.
          </p></li><li><p>
            If <code class="literal">CHARACTER SET
            <em class="replaceable"><code>X</code></em></code> was specified without
            <code class="literal">COLLATE</code>, then character set
            <em class="replaceable"><code>X</code></em> and its default collation.
          </p></li><li><p>
            Otherwise, the database character set and collation.
          </p></li></ul></div><p>
        The table character set and collation are used as default values
        if the column character set and collation are not specified in
        individual column definitions. The table character set and
        collation are MySQL extensions; there are no such things in
        standard SQL.
      </p></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-column"></a>10.3.4. Column Character Set and Collation</h3></div></div></div><p>
        Every “<span class="quote">character</span>” column (that is, a column of type
        <code class="literal">CHAR</code>, <code class="literal">VARCHAR</code>, or
        <code class="literal">TEXT</code>) has a column character set and a column
        collation, which may not be null. Column definition syntax has
        optional clauses for specifying the column character set and
        collation:
      </p><pre class="programlisting"><em class="replaceable"><code>col_name</code></em> {CHAR | VARCHAR | TEXT} (<em class="replaceable"><code>col_length</code></em>)
    [CHARACTER SET <em class="replaceable"><code>charset_name</code></em> [COLLATE <em class="replaceable"><code>collation_name</code></em>]]
</pre><p>
        Example:
      </p><pre class="programlisting">CREATE TABLE Table1
(
    column1 VARCHAR(5) CHARACTER SET latin1 COLLATE latin1_german1_ci
);
</pre><p>
        MySQL chooses the column character set and collation in the
        following manner:
      </p><div class="itemizedlist"><ul type="disc"><li><p>
            If both <code class="literal">CHARACTER SET
            <em class="replaceable"><code>X</code></em></code> and <code class="literal">COLLATE
            <em class="replaceable"><code>Y</code></em></code> were specified, then
            character set <em class="replaceable"><code>X</code></em> and collation
            <em class="replaceable"><code>Y</code></em> are used.
          </p></li><li><p>
            If <code class="literal">CHARACTER SET
            <em class="replaceable"><code>X</code></em></code> was specified without
            <code class="literal">COLLATE</code>, then character set
            <em class="replaceable"><code>X</code></em> and its default collation are
            used.
          </p></li><li><p>
            Otherwise, the table character set and collation are used.
          </p></li></ul></div><p>
        The <code class="literal">CHARACTER SET</code> and
        <code class="literal">COLLATE</code> clauses are standard SQL.
      </p></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-examples"></a>10.3.5. Examples of Character Set and Collation Assignment</h3></div></div></div><p>
        The following examples show how MySQL determines default
        character set and collation values.
      </p><p>
        <span class="bold"><strong>Example 1: Table and Column
        Definition</strong></span>
      </p><pre class="programlisting">CREATE TABLE t1
(
    c1 CHAR(10) CHARACTER SET latin1 COLLATE latin1_german1_ci
) DEFAULT CHARACTER SET latin2 COLLATE latin2_bin;
</pre><p>
        Here we have a column with a <code class="literal">latin1</code> character
        set and a <code class="literal">latin1_german1_ci</code> collation. The
        definition is explicit, so that's straightforward. Notice that
        there is no problem with storing a <code class="literal">latin1</code>
        column in a <code class="literal">latin2</code> table.
      </p><p>
        <span class="bold"><strong>Example 2: Table and Column
        Definition</strong></span>
      </p><pre class="programlisting">CREATE TABLE t1
(
    c1 CHAR(10) CHARACTER SET latin1
) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;
</pre><p>
        This time we have a column with a <code class="literal">latin1</code>
        character set and a default collation. Although it might seem
        natural, the default collation is not taken from the table
        level. Instead, because the default collation for
        <code class="literal">latin1</code> is always
        <code class="literal">latin1_swedish_ci</code>, column
        <code class="literal">c1</code> has a collation of
        <code class="literal">latin1_swedish_ci</code> (not
        <code class="literal">latin1_danish_ci</code>).
      </p><p>
        <span class="bold"><strong>Example 3: Table and Column
        Definition</strong></span>
      </p><pre class="programlisting">CREATE TABLE t1
(
    c1 CHAR(10)
) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;
</pre><p>
        We have a column with a default character set and a default
        collation. In this circumstance, MySQL looks up to the table
        level for inspiration in determining the column character set
        and collation. So, the character set for column
        <code class="literal">c1</code> is <code class="literal">latin1</code> and its
        collation is <code class="literal">latin1_danish_ci</code>.
      </p><p>
        <span class="bold"><strong>Example 4: Database, Table, and Column
        Definition</strong></span>
      </p><pre class="programlisting">CREATE DATABASE d1
    DEFAULT CHARACTER SET latin2 COLLATE latin2_czech_ci;
USE d1;
CREATE TABLE t1
(
    c1 CHAR(10)
);
</pre><p>
        We create a column without specifying its character set and
        collation. We're also not specifying a character set and a
        collation at the table level. In this circumstance, MySQL looks
        up to the database level for inspiration. (The database's
        settings become the table's settings, and thereafter become the
        column's setting.) So, the character set for column
        <code class="literal">c1</code> is <code class="literal">latin2</code> and its
        collation is <code class="literal">latin2_czech_ci</code>.
      </p></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-connection"></a>10.3.6. Connection Character Sets and Collations</h3></div></div></div><p>
        Several character set and collation system variables relate to a
        client's interaction with the server. Some of these have been
        mentioned in earlier sections:
      </p><div class="itemizedlist"><ul type="disc"><li><p>
            The server character set and collation are available as the
            values of the <code class="literal">character_set_server</code> and
            <code class="literal">collation_server</code> variables.
          </p></li><li><p>
            The character set and collation of the default database are
            available as the values of the
            <code class="literal">character_set_database</code> and
            <code class="literal">collation_database</code> variables.
          </p></li></ul></div><p>
        Additional character set and collation variables are involved in
        handling traffic for the connection between a client and the
        server. Every client has connection-related character set and
        collation variables.
      </p><p>
        Consider what a “<span class="quote">connection</span>” is: It's what you make
        when you connect to the server. The client sends SQL statements,
        such as queries, over the connection to the server. The server
        sends responses, such as result sets, over the connection back
        to the client. This leads to several questions about character
        set and collation handling for client connections, each of which
        can be answered in terms of system variables:
      </p><div class="itemizedlist"><ul type="disc"><li><p>
            What character set is the query in when it leaves the
            client?
          </p><p>
            The server takes the <code class="literal">character_set_client</code>
            variable to be the character set in which queries are sent
            by the client.
          </p></li><li><p>
            What character set should the server translate a query to
            after receiving it?
          </p><p>
            For this, <code class="literal">character_set_connection</code> and
            <code class="literal">collation_connection</code> are used by the
            server. It converts queries sent by the client from
            <code class="literal">character_set_client</code> to
            <code class="literal">character_set_connection</code> (except for
            string literals that have an introducer such as
            <code class="literal">_latin1</code> or <code class="literal">_utf8</code>).
            <code class="literal">collation_connection</code> is important for
            comparisons of literal strings. For comparisons of strings
            with column values, it does not matter because columns have
            a higher collation precedence.
          </p></li><li><p>
            What character set should the server translate to before
            shipping result sets or error messages back to the client?
          </p><p>
            The <code class="literal">character_set_results</code> variable
            indicates the character set in which the server returns
            query results to the client. This includes result data such
            as column values, and result metadata such as column names.
          </p></li></ul></div><p>
        You can fine-tune the settings for these variables, or you can
        depend on the defaults (in which case, you can skip this
        section).
      </p><p>
        There are two statements that affect the connection character
        sets:
      </p><pre class="programlisting">SET NAMES '<em class="replaceable"><code>charset_name</code></em>'
SET CHARACTER SET <em class="replaceable"><code>charset_name</code></em>
</pre><p>
        <code class="literal">SET NAMES</code> indicates what is in the SQL
        statements that the client sends. Thus, <code class="literal">SET NAMES
        'cp1251'</code> tells the server “<span class="quote">future incoming
        messages from this client are in character set
        <code class="literal">cp1251</code>.</span>” It also specifies the
        character set for results that the server sends back to the
        client. (For example, it indicates what character set column
        values are if you use a <code class="literal">SELECT</code> statement.)
      </p><p>
        A <code class="literal">SET NAMES '<em class="replaceable"><code>x</code></em>'</code>
        statement is equivalent to these three statements:
      </p><pre class="programlisting">mysql&gt; <strong class="userinput"><code>SET character_set_client = <em class="replaceable"><code>x</code></em>;</code></strong>
mysql&gt; <strong class="userinput"><code>SET character_set_results = <em class="replaceable"><code>x</code></em>;</code></strong>
mysql&gt; <strong class="userinput"><code>SET character_set_connection = <em class="replaceable"><code>x</code></em>;</code></strong>
</pre><p>
        Setting <code class="literal">character_set_connection</code> to
        <code class="literal">x</code> also sets
        <code class="literal">collation_connection</code> to the default collation
        for <code class="literal">x</code>.
      </p><p>
        <code class="literal">SET CHARACTER SET</code> is similar but sets the
        connection character set and collation to be those of the
        default database. A <code class="literal">SET CHARACTER SET x</code>
        statement is equivalent to these three statements:
      </p><pre class="programlisting">mysql&gt; <strong class="userinput"><code>SET character_set_client = <em class="replaceable"><code>x</code></em>;</code></strong>
mysql&gt; <strong class="userinput"><code>SET character_set_results = <em class="replaceable"><code>x</code></em>;</code></strong>
mysql&gt; <strong class="userinput"><code>SET collation_connection = @@collation_database;</code></strong>
</pre><p>
        When a client connects, it sends to the server the name of the
        character set that it wants to use. The server sets the
        <code class="literal">character_set_client</code>,
        <code class="literal">character_set_results</code>, and
        <code class="literal">character_set_connection</code> variables to that
        character set. (In effect, the server performs a <code class="literal">SET
        NAMES</code> operation using the character set.)
      </p><p>
        With the <span><strong class="command">mysql</strong></span> client, it is not necessary to
        execute <code class="literal">SET NAMES</code> every time you start up if
        you want to use a character set different from the default. You
        can add the <code class="option">--default-character-set</code> option
        setting to your <span><strong class="command">mysql</strong></span> statement line, or in
        your option file. For example, the following option file setting
        changes the three character set variables set to
        <code class="literal">koi8r</code> each time you run
        <span><strong class="command">mysql</strong></span>:
      </p><pre class="programlisting">[mysql]
default-character-set=koi8r
</pre><p>
        Example: Suppose that <code class="literal">column1</code> is defined as
        <code class="literal">CHAR(5) CHARACTER SET latin2</code>. If you do not
        say <code class="literal">SET NAMES</code> or <code class="literal">SET CHARACTER
        SET</code>, then for <code class="literal">SELECT column1 FROM
        t</code>, the server sends back all the values for
        <code class="literal">column1</code> using the character set that the
        client specified when it connected. On the other hand, if you
        say <code class="literal">SET NAMES 'latin1'</code> or <code class="literal">SET
        CHARACTER SET latin1</code>, then just before sending results
        back, the server converts the <code class="literal">latin2</code> values
        to <code class="literal">latin1</code>. Conversion may be lossy if there
        are characters that are not in both character sets.
      </p><p>
        If you do not want the server to perform any conversion, set
        <code class="literal">character_set_results</code> to
        <code class="literal">NULL</code>:
      </p><pre class="programlisting">mysql&gt; <strong class="userinput"><code>SET character_set_results = NULL;</code></strong>
</pre></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-literal"></a>10.3.7. Character String Literal Character Set and Collation</h3></div></div></div><p>
        Every character string literal has a character set and a
        collation, which may not be null.
      </p><p>
        A character string literal may have an optional character set
        introducer and <code class="literal">COLLATE</code> clause:
      </p><a class="indexterm" name="id2905958"></a><a class="indexterm" name="id2905968"></a><pre class="programlisting">[_<em class="replaceable"><code>charset_name</code></em>]'<em class="replaceable"><code>string</code></em>' [COLLATE <em class="replaceable"><code>collation_name</code></em>]
</pre><p>
        Examples:
      </p><pre class="programlisting">SELECT '<em class="replaceable"><code>string</code></em>';
SELECT _latin1'<em class="replaceable"><code>string</code></em>';
SELECT _latin1'<em class="replaceable"><code>string</code></em>' COLLATE latin1_danish_ci;
</pre><p>
        For the simple statement <code class="literal">SELECT
        '<em class="replaceable"><code>string</code></em>'</code>, the string has
        the character set and collation defined by the
        <code class="literal">character_set_connection</code> and
        <code class="literal">collation_connection</code> system variables.
      </p><p>
        The <code class="literal">_<em class="replaceable"><code>charset_name</code></em></code>
        expression is formally called an
        <span class="emphasis"><em>introducer</em></span>. It tells the parser, “<span class="quote">the
        string that is about to follow uses character set
        <em class="replaceable"><code>X</code></em>.</span>” Because this has confused
        people in the past, we emphasize that an introducer does not
        cause any conversion; it is strictly a signal that does not
        change the string's value. An introducer is also legal before
        standard hex literal and numeric hex literal notation
        (<code class="literal">x'<em class="replaceable"><code>literal</code></em>'</code> and
        <code class="literal">0x<em class="replaceable"><code>nnnn</code></em></code>), and
        before <code class="literal">?</code> (parameter substitution when using
        prepared statements within a programming language interface).
      </p><p>
        Examples:
      </p><pre class="programlisting">SELECT _latin1 x'AABBCC';
SELECT _latin1 0xAABBCC;
SELECT _latin1 ?;
</pre><p>
        MySQL determines a literal's character set and collation thus:
      </p><div class="itemizedlist"><ul type="disc"><li><p>
            If both <em class="replaceable"><code>_X</code></em> and <code class="literal">COLLATE
            <em class="replaceable"><code>Y</code></em></code> were specified, then
            character set <em class="replaceable"><code>X</code></em> and collation
            <em class="replaceable"><code>Y</code></em> are used.
          </p></li><li><p>
            If <em class="replaceable"><code>_X</code></em> is specified but
            <code class="literal">COLLATE</code> is not specified, then character
            set <em class="replaceable"><code>X</code></em> and its default collation
            are used.
          </p></li><li><p>
            Otherwise, the character set and collation given by the
            <code class="literal">character_set_connection</code> and
            <code class="literal">collation_connection</code> system variables are
            used.
          </p></li></ul></div><p>
        Examples:
      </p><div class="itemizedlist"><ul type="disc"><li><p>
            A string with <code class="literal">latin1</code> character set and
            <code class="literal">latin1_german1_ci</code> collation:
          </p><pre class="programlisting">SELECT _latin1'Müller' COLLATE latin1_german1_ci;
</pre></li><li><p>
            A string with <code class="literal">latin1</code> character set and
            its default collation (that is,
            <code class="literal">latin1_swedish_ci</code>):
          </p><pre class="programlisting">SELECT _latin1'Müller';
</pre></li><li><p>
            A string with the connection default character set and
            collation:
          </p><pre class="programlisting">SELECT 'Müller';
</pre></li></ul></div><p>
        Character set introducers and the <code class="literal">COLLATE</code>
        clause are implemented according to standard SQL specifications.
      </p></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-collate"></a>10.3.8. Using <code class="literal">COLLATE</code> in SQL Statements</h3></div></div></div><p>
        With the <code class="literal">COLLATE</code> clause, you can override
        whatever the default collation is for a comparison.
        <code class="literal">COLLATE</code> may be used in various parts of SQL
        statements. Here are some examples:
      </p><div class="itemizedlist"><ul type="disc"><li><p>
            With <code class="literal">ORDER BY</code>:
          </p><pre class="programlisting">SELECT k
FROM t1
ORDER BY k COLLATE latin1_german2_ci;
</pre></li><li><p>
            With <code class="literal">AS</code>:
          </p><pre class="programlisting">SELECT k COLLATE latin1_german2_ci AS k1
FROM t1
ORDER BY k1;
</pre></li><li><p>
            With <code class="literal">GROUP BY</code>:
          </p><pre class="programlisting">SELECT k
FROM t1
GROUP BY k COLLATE latin1_german2_ci;
</pre></li><li><p>
            With aggregate functions:
          </p><pre class="programlisting">SELECT MAX(k COLLATE latin1_german2_ci)
FROM t1;
</pre></li><li><p>
            With <code class="literal">DISTINCT</code>:
          </p><pre class="programlisting">SELECT DISTINCT k COLLATE latin1_german2_ci
FROM t1;
</pre></li><li><p>
            With <code class="literal">WHERE</code>:
          </p><pre class="programlisting">     SELECT *
     FROM t1
     WHERE _latin1 'Müller' COLLATE latin1_german2_ci = k;
</pre><pre class="programlisting">     SELECT *
     FROM t1
     WHERE k LIKE _latin1 'Müller' COLLATE latin1_german2_ci;
</pre></li><li><p>
            With <code class="literal">HAVING</code>:
          </p><pre class="programlisting">SELECT k
FROM t1
GROUP BY k
HAVING k = _latin1 'Müller' COLLATE latin1_german2_ci;
</pre></li></ul></div></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-collate-precedence"></a>10.3.9. <code class="literal">COLLATE</code> Clause Precedence</h3></div></div></div><p>
        The <code class="literal">COLLATE</code> clause has high precedence
        (higher than <code class="literal">||</code>), so the following two
        expressions are equivalent:
      </p><pre class="programlisting">x || y COLLATE z
x || (y COLLATE z)
</pre></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-binary-op"></a>10.3.10. <code class="literal">BINARY</code> Operator</h3></div></div></div><p>
        The <code class="literal">BINARY</code> operator is a shorthand for a
        <code class="literal">COLLATE</code> clause. <code class="literal">BINARY
        '<em class="replaceable"><code>x</code></em>'</code> is equivalent to
        <code class="literal">'<em class="replaceable"><code>x</code></em>' COLLATE
        <em class="replaceable"><code>y</code></em></code>, where
        <em class="replaceable"><code>y</code></em> is the name of the binary collation
        for the character set of '<em class="replaceable"><code>x</code></em>'. Every
        character set has a binary collation. For example, the binary
        collation for the <code class="literal">latin1</code> character set is
        <code class="literal">latin1_bin</code>, so if the column
        <code class="literal">a</code> is of character set
        <code class="literal">latin1</code>, the following two statements have the
        same effect:
      </p><pre class="programlisting">SELECT * FROM t1 ORDER BY BINARY a;
SELECT * FROM t1 ORDER BY a COLLATE latin1_bin;
</pre></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-collate-tricky"></a>10.3.11. Some Special Cases Where the Collation Determination Is Tricky</h3></div></div></div><p>
        In the great majority of queries, it is obvious what collation
        MySQL uses to resolve a comparison operation. For example, in
        the following cases, it should be clear that the collation is
        “<span class="quote">the column collation of column
        <code class="literal">x</code></span>”:
      </p><pre class="programlisting">SELECT x FROM T ORDER BY x;
SELECT x FROM T WHERE x = x;
SELECT DISTINCT x FROM T;
</pre><p>
        However, when multiple operands are involved, there can be
        ambiguity. For example:
      </p><pre class="programlisting">SELECT x FROM T WHERE x = 'Y';
</pre><p>
        Should this query use the collation of the column
        <code class="literal">x</code>, or of the string literal
        <code class="literal">'Y'</code>?
      </p><p>
        Standard SQL resolves such questions using what used to be
        called “<span class="quote">coercibility</span>” rules. Basically, this means:
        Since both <code class="literal">x</code> and <code class="literal">'Y'</code> have
        collations, whose collation takes precedence? This can be
        difficult to resolve, but the following rules take care of most
        situations:
      </p><div class="itemizedlist"><ul type="disc"><li><p>
            An explicit <code class="literal">COLLATE</code> clause has a
            coercibility of 0. (Not coercible at all.)
          </p></li><li><p>
            The concatenation of two strings with different collations
            has a coercibility of 1.
          </p></li><li><p>
            A column's collation has a coercibility of 2.
          </p></li><li><p>
            A “<span class="quote">system constant</span>” (the string returned by
            functions such as <code class="literal">USER()</code> or
            <code class="literal">VERSION()</code>) has a coercibility of 3.
          </p></li><li><p>
            A literal's collation has a coercibility of 4.
          </p></li><li><p>
            <code class="literal">NULL</code> or an expression that is derived
            from <code class="literal">NULL</code> has a coercibility of 5.
          </p></li></ul></div><p>
        The preceding coercibility values are current as of MySQL 5.0.3.
        See the note later in this section for additional
        version-related information.
      </p><p>
        Those rules resolve ambiguities thus:
      </p><div class="itemizedlist"><ul type="disc"><li><p>
            Use the collation with the lowest coercibility value.
          </p></li><li><p>
            If both sides have the same coercibility, then it is an
            error if the collations aren't the same.
          </p></li></ul></div><p>
        Examples:
      </p><div class="informaltable"><table border="1"><colgroup><col><col></colgroup><tbody><tr><td><code class="literal">column1 = 'A'</code></td><td>Use collation of <code class="literal">column1</code></td></tr><tr><td><code class="literal">column1 = 'A' COLLATE x</code></td><td>Use collation of <code class="literal">'A'</code></td></tr><tr><td><code class="literal">column1 COLLATE x = 'A' COLLATE y</code></td><td>Error</td></tr></tbody></table></div><p>
        The <code class="literal">COERCIBILITY()</code> function can be used to
        determine the coercibility of a string expression:
      </p><pre class="programlisting">mysql&gt; <strong class="userinput"><code>SELECT COERCIBILITY('A' COLLATE latin1_swedish_ci);</code></strong>
        -&gt; 0
mysql&gt; <strong class="userinput"><code>SELECT COERCIBILITY(VERSION());</code></strong>
        -&gt; 3
mysql&gt; <strong class="userinput"><code>SELECT COERCIBILITY('A');</code></strong>
        -&gt; 4
</pre><p>
        See <a href="functions.html#information-functions" title="12.9.3. Information Functions">Section 12.9.3, “Information Functions”</a>.
      </p><p>
        In MySQL 5.0 prior to release 5.0.3, there is no
        system constant or ignorable coercibility. Functions such as
        <code class="literal">USER()</code> have a coercibility of 2 rather than
        3, and literals have a coercibility of 3 rather than 4.
      </p></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-collation-charset"></a>10.3.12. Collations Must Be for the Right Character Set</h3></div></div></div><p>
        Recall that each character set has one or more collations, and
        each collation is associated with one and only one character
        set. Therefore, the following statement causes an error message
        because the <code class="literal">latin2_bin</code> collation is not legal
        with the <code class="literal">latin1</code> character set:
      </p><pre class="programlisting">mysql&gt; <strong class="userinput"><code>SELECT _latin1 'x' COLLATE latin2_bin;</code></strong>
ERROR 1251: COLLATION 'latin2_bin' is not valid
for CHARACTER SET 'latin1'
</pre></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-collation-effect"></a>10.3.13. An Example of the Effect of Collation</h3></div></div></div><p>
        Suppose that column <code class="literal">X</code> in table
        <code class="literal">T</code> has these <code class="literal">latin1</code> column
        values:
      </p><pre class="programlisting">Muffler
Müller
MX Systems
MySQL
</pre><p>
        And suppose that the column values are retrieved using the
        following statement:
      </p><pre class="programlisting">SELECT X FROM T ORDER BY X COLLATE <em class="replaceable"><code>collation_name</code></em>;
</pre><p>
        The resulting order of the values for different collations is
        shown in this table:
      </p><div class="informaltable"><table border="1"><colgroup><col><col><col></colgroup><tbody><tr><td><code class="literal">latin1_swedish_ci</code></td><td><code class="literal">latin1_german1_ci</code></td><td><code class="literal">latin1_german2_ci</code></td></tr><tr><td>Muffler</td><td>Muffler</td><td>Müller</td></tr><tr><td>MX Systems</td><td>Müller</td><td>Muffler</td></tr><tr><td>Müller</td><td>MX Systems</td><td>MX Systems</td></tr><tr><td>MySQL</td><td>MySQL</td><td>MySQL</td></tr></tbody></table></div><p>
        The table is an example that shows what the effect would be if
        we used different collations in an <code class="literal">ORDER BY</code>
        clause. The character that causes the different sort orders in
        this example is the U with two dots over it
        (<code class="literal">ü</code>), which the Germans call "U-umlaut".
      </p><div class="itemizedlist"><ul type="disc"><li><p>
            The first column shows the result of the
            <code class="literal">SELECT</code> using the Swedish/Finnish
            collating rule, which says that U-umlaut sorts with Y.
          </p></li><li><p>
            The second column shows the result of the
            <code class="literal">SELECT</code> using the German DIN-1 rule, which
            says that U-umlaut sorts with U.
          </p></li><li><p>
            The third column shows the result of the
            <code class="literal">SELECT</code> using the German DIN-2 rule, which
            says that U-umlaut sorts with UE.
          </p></li></ul></div></div></div><div class="section" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="charset-operations"></a>10.4. Operations Affected by Character Set Support</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="charset.html#charset-result">10.4.1. Result Strings</a></span></dt><dt><span class="section"><a href="charset.html#charset-convert">10.4.2. <code class="literal">CONVERT()</code></a></span></dt><dt><span class="section"><a href="charset.html#charset-cast">10.4.3. <code class="literal">CAST()</code></a></span></dt><dt><span class="section"><a href="charset.html#charset-show">10.4.4. <code class="literal">SHOW</code> Statements</a></span></dt></dl></div><p>
      This section describes operations that take character set
      information into account in MySQL 5.0.
    </p><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-result"></a>10.4.1. Result Strings</h3></div></div></div><p>
        MySQL has many operators and functions that return a string.
        This section answers the question: What is the character set and
        collation of such a string?
      </p><p>
        For simple functions that take string input and return a string
        result as output, the output's character set and collation are
        the same as those of the principal input value. For example,
        <code class="literal">UPPER(<em class="replaceable"><code>X</code></em>)</code> returns a
        string whose character string and collation are the same as that
        of <em class="replaceable"><code>X</code></em>. The same applies for
        <code class="literal">INSTR()</code>, <code class="literal">LCASE()</code>,
        <code class="literal">LOWER()</code>, <code class="literal">LTRIM()</code>,
        <code class="literal">MID()</code>, <code class="literal">REPEAT()</code>,
        <code class="literal">REPLACE()</code>, <code class="literal">REVERSE()</code>,
        <code class="literal">RIGHT()</code>, <code class="literal">RPAD()</code>,
        <code class="literal">RTRIM()</code>, <code class="literal">SOUNDEX()</code>,
        <code class="literal">SUBSTRING()</code>, <code class="literal">TRIM()</code>,
        <code class="literal">UCASE()</code>, and <code class="literal">UPPER()</code>.
        (Also note: The <code class="literal">REPLACE()</code> function, unlike
        all other functions, always ignores the collation of the string
        input and performs a case-insensitive comparison.)
      </p><p>
        For operations that combine multiple string inputs and return a
        single string output, the “<span class="quote">aggregation rules</span>” of
        standard SQL apply:
      </p><div class="itemizedlist"><ul type="disc"><li><p>
            If an explicit <code class="literal">COLLATE
            <em class="replaceable"><code>X</code></em></code> occurs, then use
            <em class="replaceable"><code>X</code></em>.
          </p></li><li><p>
            If explicit <code class="literal">COLLATE
            <em class="replaceable"><code>X</code></em></code> and <code class="literal">COLLATE
            <em class="replaceable"><code>Y</code></em></code> occur, then raise an
            error.
          </p></li><li><p>
            Otherwise, if all collations are
            <em class="replaceable"><code>X</code></em>, then use
            <em class="replaceable"><code>X</code></em>.
          </p></li><li><p>
            Otherwise, the result has no collation.
          </p></li></ul></div><p>
        For example, with <code class="literal">CASE ... WHEN a THEN b WHEN b THEN c
        COLLATE <em class="replaceable"><code>X</code></em> END</code>, the
        resulting collation is <em class="replaceable"><code>X</code></em>. The same
        applies for <code class="literal">CASE</code>, <code class="literal">UNION</code>,
        <code class="literal">||</code>, <code class="literal">CONCAT()</code>,
        <code class="literal">ELT()</code>, <code class="literal">GREATEST()</code>,
        <code class="literal">IF()</code>, and <code class="literal">LEAST()</code>.
      </p><p>
        For operations that convert to character data, the character set
        and collation of the strings that result from the operations are
        defined by the <code class="literal">character_set_connection</code> and
        <code class="literal">collation_connection</code> system variables. This
        applies to <code class="literal">CAST()</code>, <code class="literal">CHAR()</code>,
        <code class="literal">CONV()</code>, <code class="literal">FORMAT()</code>,
        <code class="literal">HEX()</code>, and <code class="literal">SPACE()</code>.
      </p></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-convert"></a>10.4.2. <code class="literal">CONVERT()</code></h3></div></div></div><p>
        <code class="literal">CONVERT()</code> provides a way to convert data
        between different character sets. The syntax is:
      </p><pre class="programlisting">CONVERT(<em class="replaceable"><code>expr</code></em> USING <em class="replaceable"><code>transcoding_name</code></em>)
</pre><p>
        In MySQL, transcoding names are the same as the corresponding
        character set names.
      </p><p>
        Examples:
      </p><pre class="programlisting">SELECT CONVERT(_latin1'Müller' USING utf8);
INSERT INTO utf8table (utf8column)
    SELECT CONVERT(latin1field USING utf8) FROM latin1table;
</pre><p>
        <code class="literal">CONVERT(... USING ...)</code> is implemented
        according to the standard SQL specification.
      </p><p>
        In <code class="literal">TRADITIONAL</code> SQL mode, if you convert a
        “<span class="quote">zero</span>” date string to a date,
        <code class="literal">CONVERT()</code> returns <code class="literal">NULL</code>.
        MySQL 5.0.4 and above also produce a warning.
      </p></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-cast"></a>10.4.3. <code class="literal">CAST()</code></h3></div></div></div><p>
        You may also use <code class="literal">CAST()</code> to convert a string
        to a different character set. The syntax is:
      </p><pre class="programlisting">CAST(<em class="replaceable"><code>character_string</code></em> AS <em class="replaceable"><code>character_data_type</code></em> CHARACTER SET <em class="replaceable"><code>charset_name</code></em>)
</pre><p>
        Example:
      </p><pre class="programlisting">SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8);
</pre><p>
        If you use <code class="literal">CAST()</code> without specifying
        <code class="literal">CHARACTER SET</code>, the resulting character set
        and collation are defined by the
        <code class="literal">character_set_connection</code> and
        <code class="literal">collation_connection</code> system variables. If you
        use <code class="literal">CAST()</code> with <code class="literal">CHARACTER SET
        X</code>, then the resulting character set and collation are
        <code class="literal">X</code> and the default collation of
        <code class="literal">X</code>.
      </p><p>
        You may not use a <code class="literal">COLLATE</code> clause inside a
        <code class="literal">CAST()</code>, but you may use it outside. That is,
        <code class="literal">CAST(... COLLATE ...)</code> is illegal, but
        <code class="literal">CAST(...) COLLATE ...</code> is legal.
      </p><p>
        Example:
      </p><pre class="programlisting">SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8) COLLATE utf8_bin;
</pre><p>
        In <code class="literal">TRADITIONAL</code> SQL mode, if you convert a
        “<span class="quote">zero</span>” date string to a date,
        <code class="literal">CAST()</code> returns <code class="literal">NULL</code>. MySQL
        5.0.4 and above also produce a warning.
      </p></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-show"></a>10.4.4. <code class="literal">SHOW</code> Statements</h3></div></div></div><p>
        Several <code class="literal">SHOW</code> statements provide additional
        character set information. These include <code class="literal">SHOW CHARACTER
        SET</code>, <code class="literal">SHOW COLLATION</code>, <code class="literal">SHOW
        CREATE DATABASE</code>, <code class="literal">SHOW CREATE TABLE</code>
        and <code class="literal">SHOW COLUMNS</code>.
      </p><p>
        The <code class="literal">SHOW CHARACTER SET</code> command shows all
        available character sets. It takes an optional
        <code class="literal">LIKE</code> clause that indicates which character
        set names to match. For example:
      </p><pre class="programlisting">mysql&gt; <strong class="userinput"><code>SHOW CHARACTER SET LIKE 'latin%';</code></strong>
+---------+-----------------------------+-------------------+--------+
| Charset | Description                 | Default collation | Maxlen |
+---------+-----------------------------+-------------------+--------+
| latin1  | cp1252 West European        | latin1_swedish_ci |      1 |
| latin2  | ISO 8859-2 Central European | latin2_general_ci |      1 |
| latin5  | ISO 8859-9 Turkish          | latin5_turkish_ci |      1 |
| latin7  | ISO 8859-13 Baltic          | latin7_general_ci |      1 |
+---------+-----------------------------+-------------------+--------+
</pre><p>
        See <a href="sql-syntax.html#show-character-set" title="13.5.4.1. SHOW CHARACTER SET Syntax">Section 13.5.4.1, “<code class="literal">SHOW CHARACTER SET</code> Syntax”</a>.
      </p><p>
        The output from <code class="literal">SHOW COLLATION</code> includes all
        available character sets. It takes an optional
        <code class="literal">LIKE</code> clause that indicates which collation
        names to match. For example:
      </p><pre class="programlisting">mysql&gt; <strong class="userinput"><code>SHOW COLLATION LIKE 'latin1%';</code></strong>
+-------------------+---------+----+---------+----------+---------+
| Collation         | Charset | Id | Default | Compiled | Sortlen |
+-------------------+---------+----+---------+----------+---------+
| latin1_german1_ci | latin1  |  5 |         |          |       0 |
| latin1_swedish_ci | latin1  |  8 | Yes     | Yes      |       0 |
| latin1_danish_ci  | latin1  | 15 |         |          |       0 |
| latin1_german2_ci | latin1  | 31 |         | Yes      |       2 |
| latin1_bin        | latin1  | 47 |         | Yes      |       0 |
| latin1_general_ci | latin1  | 48 |         |          |       0 |
| latin1_general_cs | latin1  | 49 |         |          |       0 |
| latin1_spanish_ci | latin1  | 94 |         |          |       0 |
+-------------------+---------+----+---------+----------+---------+
</pre><p>
        See <a href="sql-syntax.html#show-collation" title="13.5.4.2. SHOW COLLATION Syntax">Section 13.5.4.2, “<code class="literal">SHOW COLLATION</code> Syntax”</a>.
      </p><p>
        <code class="literal">SHOW CREATE DATABASE</code> displays the
        <code class="literal">CREATE DATABASE</code> statement that creates a
        given database. The result includes all database options.
        <code class="literal">DEFAULT CHARACTER SET</code> and
        <code class="literal">COLLATE</code> are supported. All database options
        are stored in a text file named <code class="filename">db.opt</code> that
        can be found in the database directory.
      </p><pre class="programlisting">mysql&gt; <strong class="userinput"><code>SHOW CREATE DATABASE test;</code></strong>
+----------+-----------------------------------------------------------------+
| Database | Create Database                                                 |
+----------+-----------------------------------------------------------------+
| test     | CREATE DATABASE `test` /*!40100 DEFAULT CHARACTER SET latin1 */ |
+----------+-----------------------------------------------------------------+
</pre><p>
        See <a href="sql-syntax.html#show-create-database" title="13.5.4.4. SHOW CREATE DATABASE Syntax">Section 13.5.4.4, “<code class="literal">SHOW CREATE DATABASE</code> Syntax”</a>.
      </p><p>
        <code class="literal">SHOW CREATE TABLE</code> is similar, but displays
        the <code class="literal">CREATE TABLE</code> statement to create a given
        table. The column definitions indicate any character set
        specifications, and the table options include character set
        information.
      </p><p>
        See <a href="sql-syntax.html#show-create-table" title="13.5.4.5. SHOW CREATE TABLE Syntax">Section 13.5.4.5, “<code class="literal">SHOW CREATE TABLE</code> Syntax”</a>.
      </p><p>
        The <code class="literal">SHOW COLUMNS</code> statement displays the
        collations of a table's columns when invoked as <code class="literal">SHOW
        FULL COLUMNS</code>. Columns with <code class="literal">CHAR</code>,
        <code class="literal">VARCHAR</code>, or <code class="literal">TEXT</code> data
        types have non-<code class="literal">NULL</code> collations. Numeric and
        other non-character types have <code class="literal">NULL</code>
        collations. For example:
      </p><pre class="programlisting">mysql&gt; <strong class="userinput"><code>SHOW FULL COLUMNS FROM person\G</code></strong>
*************************** 1. row ***************************
     Field: id
      Type: smallint(5) unsigned
 Collation: NULL
      Null: NO
       Key: PRI
   Default: NULL
     Extra: auto_increment
Privileges: select,insert,update,references
   Comment:
*************************** 2. row ***************************
     Field: name
      Type: char(60)
 Collation: latin1_swedish_ci
      Null: NO
       Key:
   Default:
     Extra:
Privileges: select,insert,update,references
   Comment:

</pre><p>
        The character set is not part of the display. (The character set
        name is implied by the collation name.)
      </p><p>
        See <a href="sql-syntax.html#show-columns" title="13.5.4.3. SHOW COLUMNS Syntax">Section 13.5.4.3, “<code class="literal">SHOW COLUMNS</code> Syntax”</a>.
      </p></div></div><div class="section" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="charset-unicode"></a>10.5. Unicode Support</h2></div></div></div><p>
      MySQL 5.0 supports two character sets for storing
      Unicode data:
    </p><div class="itemizedlist"><ul type="disc"><li><p>
          <code class="literal">ucs2</code>, the UCS-2 Unicode character set.
        </p></li><li><p>
          <code class="literal">utf8</code>, the UTF8 encoding of the Unicode
          character set.
        </p></li></ul></div><p>
      In UCS-2 (binary Unicode representation), every character is
      represented by a two-byte Unicode code with the most significant
      byte first. For example: "LATIN CAPITAL LETTER A" has the code
      0x0041 and it's stored as a two-byte sequence: 0x00 0x41.
      "CYRILLIC SMALL LETTER YERU" (Unicode 0x044B) is stored as a
      two-byte sequence: 0x04 0x4B. For Unicode characters and their
      codes, please refer to the
      <a href="http://www.unicode.org/" target="_top">Unicode Home Page</a>.
    </p><p>
      Currently, UCS-2 cannot yet be used as a client character set,
      which means that <code class="literal">SET NAMES 'ucs2'</code> does not
      work.
    </p><p>
      The UTF8 character set (transform Unicode representation) is an
      alternative way to store Unicode data. It is implemented according
      to RFC 3629. The idea of the UTF8 character set is that various
      Unicode characters are encoded using byte sequences of different
      lengths:
    </p><div class="itemizedlist"><ul type="disc"><li><p>
          Basic Latin letters, digits, and punctuation signs use one
          byte.
        </p></li><li><p>
          Most European and Middle East script letters fit into a
          two-byte sequence: extended Latin letters (with tilde, macron,
          acute, grave and other accents), Cyrillic, Greek, Armenian,
          Hebrew, Arabic, Syriac, and others.
        </p></li><li><p>
          Korean, Chinese, and Japanese ideographs use three-byte
          sequences.
        </p></li></ul></div><p>
      RFC 3629 describes encoding sequences that take from one to four
      bytes. Currently, MySQL UTF8 support does not include four-byte
      sequences. (An older standard for UTF8 encoding is given by RFC
      2279, which describes UTF8 sequences that take from one to six
      bytes. RFC 3629 renders RFC 2279 obsolete; for this reason,
      sequences with five and six bytes are no longer used.)
    </p><p>
      <span class="bold"><strong>Tip</strong></span>: To save space with UTF8, use
      <code class="literal">VARCHAR</code> instead of <code class="literal">CHAR</code>.
      Otherwise, MySQL has to reserve 30 bytes for a <code class="literal">CHAR(10)
      CHARACTER SET utf8</code> column, because this is the maximum
      possible length.
    </p></div><div class="section" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="charset-metadata"></a>10.6. UTF8 for Metadata</h2></div></div></div><p>
      <em class="firstterm">Metadata</em> is “<span class="quote">the data about the
      data</span>”. Anything that <span class="emphasis"><em>describes</em></span> the
      database — as opposed to being the
      <span class="emphasis"><em>contents</em></span> of the database — is metadata.
      Thus column names, database names, usernames, version names, and
      most of the string results from <code class="literal">SHOW</code> are
      metadata. This is also true of the contents of tables in the
      <code class="literal">INFORMATION_SCHEMA</code> database, because those
      tables by definition contain information about database objects.
    </p><p>
      Representation of metadata must satisfy these requirements:
    </p><div class="itemizedlist"><ul type="disc"><li><p>
          All metadata must be in the same character set. Otherwise,
          neither the <code class="literal">SHOW</code> commands nor
          <code class="literal">SELECT</code> queries against tables in the
          <code class="literal">INFORMATION_SCHEMA</code> database would work
          properly because different rows in the same column of the
          results of these operations would be in different character
          sets.
        </p></li><li><p>
          Metadata must include all characters in all languages.
          Otherwise, users wouldn't be able to name columns and tables
          in their own languages.
        </p></li></ul></div><p>
      In order to satisfy both requirements, MySQL stores metadata in a
      Unicode character set, namely UTF8. This does not cause any
      disruption if you never use accented characters. But if you do,
      you should be aware that metadata is in UTF8.
    </p><p>
      This means that the return values of the
      <code class="literal">USER()</code>, <code class="literal">CURRENT_USER()</code>,
      <code class="literal">DATABASE()</code>, and <code class="literal">VERSION()</code>
      functions have the UTF8 character set by default, as do synonyms
      such as <code class="literal">SESSION_USER()</code> and
      <code class="literal">SYSTEM_USER()</code>.
    </p><p>
      The server sets the <code class="literal">character_set_system</code> system
      variable to the name of the metadata character set:
    </p><pre class="programlisting">mysql&gt; <strong class="userinput"><code>SHOW VARIABLES LIKE 'character_set_system';</code></strong>
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| character_set_system | utf8  |
+----------------------+-------+
</pre><p>
      Storage of metadata using Unicode does <span class="emphasis"><em>not</em></span>
      mean that the headers of columns and the results of
      <code class="literal">DESCRIBE</code> functions are in the
      <code class="literal">character_set_system</code> character set by default.
      When you use <code class="literal">SELECT column1 FROM t</code>, the name
      <code class="literal">column1</code> itself is returned from the server to
      the client in the character set as determined by the <code class="literal">SET
      NAMES</code> statement. More specifically, the character set
      used is determined by the value of the
      <code class="literal">character_set_results</code> system variable. If this
      variable is set to <code class="literal">NULL</code>, no conversion is
      performed and the server returns metadata using its original
      character set (the set indicated by
      <code class="literal">character_set_system</code>).
    </p><p>
      If you want the server to pass metadata results back in a non-UTF8
      character set, then use <code class="literal">SET NAMES</code> to force the
      server to perform character set conversion (see
      <a href="charset.html#charset-connection" title="10.3.6. Connection Character Sets and Collations">Section 10.3.6, “Connection Character Sets and Collations”</a>), or else cause the client to
      perform the conversion. It is more efficient to have the client
      perform the conversion, but this option is not always available
      for all clients.
    </p><p>
      If you are using (for example) the <code class="literal">USER()</code>
      function for comparison or assignment within a single statement,
      don't worry. MySQL performs some automatic conversion for you.
    </p><pre class="programlisting">SELECT * FROM Table1 WHERE USER() = latin1_column;
</pre><p>
      This works because the contents of
      <code class="literal">latin1_column</code> are automatically converted to
      UTF8 before the comparison.
    </p><pre class="programlisting">INSERT INTO Table1 (latin1_column) SELECT USER();
</pre><p>
      This works because the contents of <code class="literal">USER()</code> are
      automatically converted to <code class="literal">latin1</code> before the
      assignment. Automatic conversion is not fully implemented yet, but
      should work correctly in a later version.
    </p><p>
      Although automatic conversion is not in the SQL standard, the SQL
      standard document does say that every character set is (in terms
      of supported characters) a “<span class="quote">subset</span>” of Unicode. Since
      it is a well-known principle that “<span class="quote">what applies to a
      superset can apply to a subset</span>”, we believe that a
      collation for Unicode can apply for comparisons with non-Unicode
      strings.
    </p><p>
      <span class="bold"><strong>Note</strong></span>: In MySQL 5.0,
      the <code class="filename">errmsg.txt</code> files all use UTF8. Conversion
      to the client character set is automatic, as with metadata.
    </p></div><div class="section" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="charset-compatibility"></a>10.7. Compatibility with Other DBMSs</h2></div></div></div><p>
      For MaxDB compatibility these two statements are the same:
    </p><pre class="programlisting">CREATE TABLE t1 (f1 CHAR(<em class="replaceable"><code>n</code></em>) UNICODE);
CREATE TABLE t1 (f1 CHAR(<em class="replaceable"><code>n</code></em>) CHARACTER SET ucs2);
</pre></div><div class="section" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="charset-config-file"></a>10.8. New Character Set Configuration File Format</h2></div></div></div><p>
      Character set configuration is stored in XML files, one file per
      character set.
    </p></div><div class="section" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="charset-national"></a>10.9. National Character Set</h2></div></div></div><p>
      ANSI SQL defines <code class="literal">NCHAR</code> or <code class="literal">NATIONAL
      CHAR</code> as a way to indicate that a <code class="literal">CHAR</code>
      column should use some predefined character set. MySQL
      5.0 uses <code class="literal">utf8</code> as this predefined
      character set. For example, these column type declarations are
      equivalent:
    </p><pre class="programlisting">CHAR(10) CHARACTER SET utf8
NATIONAL CHARACTER(10)
NCHAR(10)
</pre><p>
      As are these:
    </p><pre class="programlisting">VARCHAR(10) CHARACTER SET utf8
NATIONAL VARCHAR(10)
NCHAR VARCHAR(10)
NATIONAL CHARACTER VARYING(10)
NATIONAL CHAR VARYING(10)
</pre><p>
      You can use
      <code class="literal">N'<em class="replaceable"><code>literal</code></em>'</code> to create
      a string in the national character set. These two statements are
      equivalent:
    </p><pre class="programlisting">SELECT N'some text';
SELECT _utf8'some text';
</pre><p>
      For information on upgrading character sets to MySQL
      5.0 from versions prior to 4.1, see the
      <em class="citetitle">MySQL 4.1 Reference Manual</em>.
    </p></div><div class="section" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="charset-charsets"></a>10.10. Character Sets and Collations That MySQL Supports</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="charset.html#charset-unicode-sets">10.10.1. Unicode Character Sets</a></span></dt><dt><span class="section"><a href="charset.html#charset-we-sets">10.10.2. West European Character Sets</a></span></dt><dt><span class="section"><a href="charset.html#charset-ce-sets">10.10.3. Central European Character Sets</a></span></dt><dt><span class="section"><a href="charset.html#charset-se-me-sets">10.10.4. South European and Middle East Character Sets</a></span></dt><dt><span class="section"><a href="charset.html#charset-baltic-sets">10.10.5. Baltic Character Sets</a></span></dt><dt><span class="section"><a href="charset.html#charset-cyrillic-sets">10.10.6. Cyrillic Character Sets</a></span></dt><dt><span class="section"><a href="charset.html#charset-asian-sets">10.10.7. Asian Character Sets</a></span></dt></dl></div><p>
      MySQL supports 70+ collations for 30+ character sets. The
      character sets and their default collations are displayed by the
      <code class="literal">SHOW CHARACTER SET</code> statement:
    </p><pre class="programlisting">mysql&gt; <strong class="userinput"><code>SHOW CHARACTER SET;</code></strong>
+----------+-----------------------------+---------------------+
| Charset  | Description                 | Default collation   |
+----------+-----------------------------+---------------------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |
| dec8     | DEC West European           | dec8_swedish_ci     |
| cp850    | DOS West European           | cp850_general_ci    |
| hp8      | HP West European            | hp8_english_ci      |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |
| latin1   | cp1252 West European        | latin1_swedish_ci   |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |
| ascii    | US ASCII                    | ascii_general_ci    |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |
| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |
| tis620   | TIS620 Thai                 | tis620_thai_ci      |
| euckr    | EUC-KR Korean               | euckr_korean_ci     |
| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |
| greek    | ISO 8859-7 Greek            | greek_general_ci    |
| cp1250   | Windows Central European    | cp1250_general_ci   |
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |
| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |
| armscii8 | ARMSCII-8 Armenian          | armscii8_general_ci |
| utf8     | UTF-8 Unicode               | utf8_general_ci     |
| ucs2     | UCS-2 Unicode               | ucs2_general_ci     |
| cp866    | DOS Russian                 | cp866_general_ci    |
| keybcs2  | DOS Kamenicky Czech-Slovak  | keybcs2_general_ci  |
| macce    | Mac Central European        | macce_general_ci    |
| macroman | Mac West European           | macroman_general_ci |
| cp852    | DOS Central European        | cp852_general_ci    |
| latin7   | ISO 8859-13 Baltic          | latin7_general_ci   |
| cp1251   | Windows Cyrillic            | cp1251_general_ci   |
| cp1256   | Windows Arabic              | cp1256_general_ci   |
| cp1257   | Windows Baltic              | cp1257_general_ci   |
| binary   | Binary pseudo charset       | binary              |
| geostd8  | GEOSTD8 Georgian            | geostd8_general_ci  |
| cp932    | SJIS for Windows Japanese   | cp932_japanese_ci   |
| eucjpms  | UJIS for Windows Japanese   | eucjpms_japanese_ci |
+----------+-----------------------------+---------------------+
</pre><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-unicode-sets"></a>10.10.1. Unicode Character Sets</h3></div></div></div><p>
        MySQL has two Unicode character sets. You can store text in
        about 650 languages using these character sets.
      </p><div class="itemizedlist"><ul type="disc"><li><p>
            <code class="literal">ucs2</code> (UCS-2 Unicode) collations:
          </p><pre class="programlisting">mysql&gt; <strong class="userinput"><code>SHOW COLLATION LIKE 'ucs2%';</code></strong>
+--------------------+---------+-----+---------+----------+---------+
| Collation          | Charset | Id  | Default | Compiled | Sortlen |
+--------------------+---------+-----+---------+----------+---------+
| ucs2_general_ci    | ucs2    |  35 | Yes     | Yes      |       1 |
| ucs2_bin           | ucs2    |  90 |         | Yes      |       1 |
| ucs2_unicode_ci    | ucs2    | 128 |         | Yes      |       8 |
| ucs2_icelandic_ci  | ucs2    | 129 |         | Yes      |       8 |
| ucs2_latvian_ci    | ucs2    | 130 |         | Yes      |       8 |
| ucs2_romanian_ci   | ucs2    | 131 |         | Yes      |       8 |
| ucs2_slovenian_ci  | ucs2    | 132 |         | Yes      |       8 |
| ucs2_polish_ci     | ucs2    | 133 |         | Yes      |       8 |
| ucs2_estonian_ci   | ucs2    | 134 |         | Yes      |       8 |
| ucs2_spanish_ci    | ucs2    | 135 |         | Yes      |       8 |
| ucs2_swedish_ci    | ucs2    | 136 |         | Yes      |       8 |
| ucs2_turkish_ci    | ucs2    | 137 |         | Yes      |       8 |
| ucs2_czech_ci      | ucs2    | 138 |         | Yes      |       8 |
| ucs2_danish_ci     | ucs2    | 139 |         | Yes      |       8 |
| ucs2_lithuanian_ci | ucs2    | 140 |         | Yes      |       8 |
| ucs2_slovak_ci     | ucs2    | 141 |         | Yes      |       8 |
| ucs2_spanish2_ci   | ucs2    | 142 |         | Yes      |       8 |
| ucs2_roman_ci      | ucs2    | 143 |         | Yes      |       8 |
| ucs2_persian_ci    | ucs2    | 144 |         | Yes      |       8 |
| ucs2_esperanto_ci  | ucs2    | 145 |         | Yes      |       8 |
+--------------------+---------+-----+---------+----------+---------+
</pre></li><li><p>
            <code class="literal">utf8</code> (UTF-8 Unicode) collations:
          </p><pre class="programlisting">mysql&gt; <strong class="userinput"><code>SHOW COLLATION LIKE 'utf8%';</code></strong>
+--------------------+---------+-----+---------+----------+---------+
| Collation          | Charset | Id  | Default | Compiled | Sortlen |
+--------------------+---------+-----+---------+----------+---------+
| utf8_general_ci    | utf8    |  33 | Yes     | Yes      |       1 |
| utf8_bin           | utf8    |  83 |         | Yes      |       1 |
| utf8_unicode_ci    | utf8    | 192 |         | Yes      |       8 |
| utf8_icelandic_ci  | utf8    | 193 |         | Yes      |       8 |
| utf8_latvian_ci    | utf8    | 194 |         | Yes      |       8 |
| utf8_romanian_ci   | utf8    | 195 |         | Yes      |       8 |
| utf8_slovenian_ci  | utf8    | 196 |         | Yes      |       8 |
| utf8_polish_ci     | utf8    | 197 |         | Yes      |       8 |
| utf8_estonian_ci   | utf8    | 198 |         | Yes      |       8 |
| utf8_spanish_ci    | utf8    | 199 |         | Yes      |       8 |
| utf8_swedish_ci    | utf8    | 200 |         | Yes      |       8 |
| utf8_turkish_ci    | utf8    | 201 |         | Yes      |       8 |
| utf8_czech_ci      | utf8    | 202 |         | Yes      |       8 |
| utf8_danish_ci     | utf8    | 203 |         | Yes      |       8 |
| utf8_lithuanian_ci | utf8    | 204 |         | Yes      |       8 |
| utf8_slovak_ci     | utf8    | 205 |         | Yes      |       8 |
| utf8_spanish2_ci   | utf8    | 206 |         | Yes      |       8 |
| utf8_roman_ci      | utf8    | 207 |         | Yes      |       8 |
| utf8_persian_ci    | utf8    | 208 |         | Yes      |       8 |
| utf8_esperanto_ci  | utf8    | 209 |         | Yes      |       8 |
+--------------------+---------+-----+---------+----------+---------+
</pre></li></ul></div><p>
        <span class="bold"><strong>Note</strong></span>: Support for the
        <code class="literal">ucs2_esperanto_ci</code> and
        <code class="literal">utf8_esperanto_ci</code> collations was added in
        MySQL 5.0.13.
      </p><a class="indexterm" name="id2908517"></a><p>
        The <code class="literal">utf8_unicode_ci</code> collation is implemented
        according to the Unicode Collation Algorithm (UCA) described at
        <a href="http://www.unicode.org/reports/tr10/" target="_top">http://www.unicode.org/reports/tr10/</a>. The
        collation uses the version-4.0.0 UCA weight keys:
        <a href="http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt" target="_top">http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt</a>.
        (The following discussion uses
        <code class="literal">utf8_unicode_ci</code>, but it is also true for
        <code class="literal">ucs2_unicode_ci</code>.)
      </p><p>
        Currently, the <code class="literal">utf8_unicode_ci</code> collation has
        only partial support for the Unicode Collation Algorithm. Some
        characters are not supported yet. Also, combining marks are not
        fully supported. This affects primarily Vietnamese and some
        minority languages in Russia such as Udmurt, Tatar, Bashkir, and
        Mari.
      </p><p>
        The most significant feature in
        <code class="literal">utf8_unicode_ci</code> is that it supports
        expansions, that is, when one character compares as equal to
        combinations of other characters. For example, in German and
        some other languages ‘<code class="literal">ß</code>’ is
        equal to ‘<code class="literal">ss</code>’.
      </p><p>
        <code class="literal">utf8_general_ci</code> is a legacy collation that
        does not support expansions. It can make only one-to-one
        comparisons between characters. This means that comparisons for
        the <code class="literal">utf8_general_ci</code> collation are faster, but
        slightly less correct, than comparisons for
        <code class="literal">utf8_unicode_ci</code>).
      </p><p>
        For example, the following equalities hold in both
        <code class="literal">utf8_general_ci</code> and
        <code class="literal">utf8_unicode_ci</code>:
      </p><pre class="programlisting">Ä = A
Ö = O
Ü = U
</pre><p>
        A difference between the collations is that this is true for
        <code class="literal">utf8_general_ci</code>:
      </p><pre class="programlisting">ß = s
</pre><p>
        Whereas this is true for <code class="literal">utf8_unicode_ci</code>:
      </p><pre class="programlisting">ß = ss
</pre><p>
        Language-specific collations for the <code class="literal">utf8</code>
        character set are implemented only if the ordering with
        <code class="literal">utf8_unicode_ci</code> does not work well for a
        language. For example, <code class="literal">utf8_unicode_ci</code> works
        fine for German and French, so there is no need to create
        special <code class="literal">utf8</code> collations for these two
        languages.
      </p><p>
        <code class="literal">utf8_general_ci</code> is also satisfactory for both
        German and French, except that
        ‘<code class="literal">ß</code>’ is equal to
        ‘<code class="literal">s</code>’, and not to
        ‘<code class="literal">ss</code>’. If this is acceptable for
        your application, then you should use
        <code class="literal">utf8_general_ci</code> because it is faster.
        Otherwise, use <code class="literal">utf8_unicode_ci</code> because it is
        more accurate.
      </p><p>
        <code class="literal">utf8_swedish_ci</code>, like other
        <code class="literal">utf8</code> language-specific collations, is derived
        from <code class="literal">utf8_unicode_ci</code> with additional language
        rules. For example, in Swedish, the following relationship
        holds, which is not something expected by a German or French
        speaker:
      </p><pre class="programlisting">Ü = Y &lt; Ö
</pre><p>
        The <code class="literal">utf8_spanish_ci</code> and
        <code class="literal">utf8_spanish2_ci</code> collations correspond to
        modern Spanish and traditional Spanish, respectively. In both
        collations, ‘<code class="literal">ñ</code>’ (n-tilde) is a
        separate letter between ‘<code class="literal">n</code>’ and
        ‘<code class="literal">o</code>’. In addition, for traditional
        Spanish, ‘<code class="literal">ch</code>’ is a separate
        letter between ‘<code class="literal">c</code>’ and
        <code class="literal">d</code>, and ‘<code class="literal">ll</code>’ is
        a separate letter between ‘<code class="literal">l</code>’ and
        ‘<code class="literal">m</code>’
      </p></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-we-sets"></a>10.10.2. West European Character Sets</h3></div></div></div><p>
        Western European character sets cover most West European
        languages, such as French, Spanish, Catalan, Basque, Portuguese,
        Italian, Albanian, Dutch, German, Danish, Swedish, Norwegian,
        Finnish, Faroese, Icelandic, Irish, Scottish, and English.
      </p><div class="itemizedlist"><ul type="disc"><li><p>
            <code class="literal">ascii</code> (US ASCII) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">ascii_bin</code>
              </p></li><li><p>
                <code class="literal">ascii_general_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">cp850</code> (DOS West European) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">cp850_bin</code>
              </p></li><li><p>
                <code class="literal">cp850_general_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">dec8</code> (DEC Western European) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">dec8_bin</code>
              </p></li><li><p>
                <code class="literal">dec8_swedish_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">hp8</code> (HP Western European) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">hp8_bin</code>
              </p></li><li><p>
                <code class="literal">hp8_english_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">latin1</code> (cp1252 West European) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">latin1_bin</code>
              </p></li><li><p>
                <code class="literal">latin1_danish_ci</code>
              </p></li><li><p>
                <code class="literal">latin1_general_ci</code>
              </p></li><li><p>
                <code class="literal">latin1_general_cs</code>
              </p></li><li><p>
                <code class="literal">latin1_german1_ci</code>
              </p></li><li><p>
                <code class="literal">latin1_german2_ci</code>
              </p></li><li><p>
                <code class="literal">latin1_spanish_ci</code>
              </p></li><li><p>
                <code class="literal">latin1_swedish_ci</code> (default)
              </p></li></ul></div><p>
            <code class="literal">latin1</code> is the default character set. The
            <code class="literal">latin1_swedish_ci</code> collation is the
            default that probably is used by the majority of MySQL
            customers. While it is frequently said that it is based on
            the Swedish/Finnish collation rules, there are Swedes and
            Finns who disagree with this statement.
          </p><p>
            The <code class="literal">latin1_german1_ci</code> and
            <code class="literal">latin1_german2_ci</code> collations are based on
            the DIN-1 and DIN-2 standards, where DIN stands for
            <span class="foreignphrase"><em class="foreignphrase">Deutsches Institut für
            Normung</em></span> (the German equivalent of ANSI).
            DIN-1 is called the “<span class="quote">dictionary collation</span>” and
            DIN-2 is called the “<span class="quote">phone book collation</span>”.
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">latin1_german1_ci</code> (dictionary) rules:
              </p><pre class="programlisting">Ä = A
Ö = O
Ü = U
ß = s
</pre></li><li><p>
                <code class="literal">latin1_german2_ci</code> (phone-book) rules:
              </p><pre class="programlisting">Ä = AE
Ö = OE
Ü = UE
ß = ss
</pre></li></ul></div><p>
            In the <code class="literal">latin1_spanish_ci</code> collation,
            ‘<code class="literal">ñ</code>’ (n-tilde) is a separate
            letter between ‘<code class="literal">n</code>’ and
            ‘<code class="literal">o</code>’.
          </p></li><li><p>
            <code class="literal">macroman</code> (Mac West European) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">macroman_bin</code>
              </p></li><li><p>
                <code class="literal">macroman_general_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">swe7</code> (7bit Swedish) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">swe7_bin</code>
              </p></li><li><p>
                <code class="literal">swe7_swedish_ci</code> (default)
              </p></li></ul></div></li></ul></div></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-ce-sets"></a>10.10.3. Central European Character Sets</h3></div></div></div><p>
        We also provide some support for character sets used in the
        Czech Republic, Slovakia, Hungary, Romania, Slovenia, Croatia,
        and Poland.
      </p><div class="itemizedlist"><ul type="disc"><li><p>
            <code class="literal">cp1250</code> (Windows Central European)
            collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">cp1250_bin</code>
              </p></li><li><p>
                <code class="literal">cp1250_croatian_ci</code>
              </p></li><li><p>
                <code class="literal">cp1250_czech_cs</code>
              </p></li><li><p>
                <code class="literal">cp1250_general_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">cp852</code> (DOS Central European) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">cp852_bin</code>
              </p></li><li><p>
                <code class="literal">cp852_general_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">keybcs2</code> (DOS Kamenicky Czech-Slovak)
            collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">keybcs2_bin</code>
              </p></li><li><p>
                <code class="literal">keybcs2_general_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">latin2</code> (ISO 8859-2 Central European)
            collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">latin2_bin</code>
              </p></li><li><p>
                <code class="literal">latin2_croatian_ci</code>
              </p></li><li><p>
                <code class="literal">latin2_czech_cs</code>
              </p></li><li><p>
                <code class="literal">latin2_general_ci</code> (default)
              </p></li><li><p>
                <code class="literal">latin2_hungarian_ci</code>
              </p></li></ul></div></li><li><p>
            <code class="literal">macce</code> (Mac Central European) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">macce_bin</code>
              </p></li><li><p>
                <code class="literal">macce_general_ci</code> (default)
              </p></li></ul></div></li></ul></div></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-se-me-sets"></a>10.10.4. South European and Middle East Character Sets</h3></div></div></div><p>
        South European and Middle Eastern character sets supported by
        MySQL include Armenian, Arabic, Georgian, Greek, Hebrew, and
        Turkish:
      </p><div class="itemizedlist"><ul type="disc"><li><p>
            <code class="literal">armscii8</code> (ARMSCII-8 Armenian) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">armscii8_bin</code>
              </p></li><li><p>
                <code class="literal">armscii8_general_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">cp1256</code> (Windows Arabic) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">cp1256_bin</code>
              </p></li><li><p>
                <code class="literal">cp1256_general_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">geostd8</code> (GEOSTD8 Georgian) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">geostd8_bin</code>
              </p></li><li><p>
                <code class="literal">geostd8_general_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">greek</code> (ISO 8859-7 Greek) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">greek_bin</code>
              </p></li><li><p>
                <code class="literal">greek_general_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">hebrew</code> (ISO 8859-8 Hebrew) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">hebrew_bin</code>
              </p></li><li><p>
                <code class="literal">hebrew_general_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">latin5</code> (ISO 8859-9 Turkish) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">latin5_bin</code>
              </p></li><li><p>
                <code class="literal">latin5_turkish_ci</code> (default)
              </p></li></ul></div></li></ul></div></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-baltic-sets"></a>10.10.5. Baltic Character Sets</h3></div></div></div><p>
        The Baltic character sets cover Estonian, Latvian, and
        Lithuanian languages. There are two Baltic character sets
        currently supported:
      </p><div class="itemizedlist"><ul type="disc"><li><p>
            <code class="literal">cp1257</code> (Windows Baltic) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">cp1257_bin</code>
              </p></li><li><p>
                <code class="literal">cp1257_general_ci</code> (default)
              </p></li><li><p>
                <code class="literal">cp1257_lithuanian_ci</code>
              </p></li></ul></div></li><li><p>
            <code class="literal">latin7</code> (ISO 8859-13 Baltic) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">latin7_bin</code>
              </p></li><li><p>
                <code class="literal">latin7_estonian_cs</code>
              </p></li><li><p>
                <code class="literal">latin7_general_ci</code> (default)
              </p></li><li><p>
                <code class="literal">latin7_general_cs</code>
              </p></li></ul></div></li></ul></div></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-cyrillic-sets"></a>10.10.6. Cyrillic Character Sets</h3></div></div></div><p>
        Here are the Cyrillic character sets and collations for use with
        Belarusian, Bulgarian, Russian, and Ukrainian languages.
      </p><div class="itemizedlist"><ul type="disc"><li><p>
            <code class="literal">cp1251</code> (Windows Cyrillic) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">cp1251_bin</code>
              </p></li><li><p>
                <code class="literal">cp1251_bulgarian_ci</code>
              </p></li><li><p>
                <code class="literal">cp1251_general_ci</code> (default)
              </p></li><li><p>
                <code class="literal">cp1251_general_cs</code>
              </p></li><li><p>
                <code class="literal">cp1251_ukrainian_ci</code>
              </p></li></ul></div></li><li><p>
            <code class="literal">cp866</code> (DOS Russian) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">cp866_bin</code>
              </p></li><li><p>
                <code class="literal">cp866_general_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">koi8r</code> (KOI8-R Relcom Russian) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">koi8r_bin</code>
              </p></li><li><p>
                <code class="literal">koi8r_general_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">koi8u</code> (KOI8-U Ukrainian) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">koi8u_bin</code>
              </p></li><li><p>
                <code class="literal">koi8u_general_ci</code> (default)
              </p></li></ul></div></li></ul></div></div><div class="section" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charset-asian-sets"></a>10.10.7. Asian Character Sets</h3></div></div></div><div class="toc"><dl><dt><span class="section"><a href="charset.html#charset-cp932">10.10.7.1. The <code class="literal">cp932</code> Character Set</a></span></dt></dl></div><p>
        The Asian character sets that we support include Chinese,
        Japanese, Korean, and Thai. These can be complicated. For
        example, the Chinese sets must allow for thousands of different
        characters.
      </p><div class="itemizedlist"><ul type="disc"><li><p>
            <code class="literal">big5</code> (Big5 Traditional Chinese)
            collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">big5_bin</code>
              </p></li><li><p>
                <code class="literal">big5_chinese_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">cp932</code> (SJIS for Windows Japanese)
            collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">cp932_bin</code>
              </p></li><li><p>
                <code class="literal">cp932_japanese_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">eucjpms</code> (UJIS for Windows Japanese)
            collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">eucjpms_bin</code>
              </p></li><li><p>
                <code class="literal">eucjpms_japanese_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">euckr</code> (EUC-KR Korean) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">euckr_bin</code>
              </p></li><li><p>
                <code class="literal">euckr_korean_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">gb2312</code> (GB2312 Simplified Chinese)
            collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">gb2312_bin</code>
              </p></li><li><p>
                <code class="literal">gb2312_chinese_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">gbk</code> (GBK Simplified Chinese) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">gbk_bin</code>
              </p></li><li><p>
                <code class="literal">gbk_chinese_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">sjis</code> (Shift-JIS Japanese) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">sjis_bin</code>
              </p></li><li><p>
                <code class="literal">sjis_japanese_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">tis620</code> (TIS620 Thai) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">tis620_bin</code>
              </p></li><li><p>
                <code class="literal">tis620_thai_ci</code> (default)
              </p></li></ul></div></li><li><p>
            <code class="literal">ujis</code> (EUC-JP Japanese) collations:
          </p><div class="itemizedlist"><ul type="circle"><li><p>
                <code class="literal">ujis_bin</code>
              </p></li><li><p>
                <code class="literal">ujis_japanese_ci</code> (default)
              </p></li></ul></div></li></ul></div><div class="section" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="charset-cp932"></a>10.10.7.1. The <code class="literal">cp932</code> Character Set</h4></div></div></div><p>
          <span class="bold"><strong>Why is <code class="literal">cp932</code>
          needed?</strong></span>
        </p><p>
          In MySQL, the <code class="literal">sjis</code> character set
          corresponds to the <code class="literal">Shift_JIS</code> character set
          defined by IANA, which supports JIS X0201 and JIS X0208
          characters. (See
          <a href="http://www.iana.org/assignments/character-sets" target="_top">http://www.iana.org/assignments/character-sets</a>.)
        </p><p>
          However, the meaning of “<span class="quote">SHIFT JIS</span>” as a
          descriptive term has become very vague and it often includes
          the extensions to <code class="literal">Shift_JIS</code> that are
          defined by various venders.
        </p><p>
          For example, “<span class="quote">SHIFT JIS</span>” used in Japanese Windows
          environments is a Microsoft extension of
          <code class="literal">Shift_JIS</code> and its exact name is
          <code class="literal">Microsoft Windows Codepage : 932</code> or
          <code class="literal">cp932</code>. In addition to the characters
          supported by <code class="literal">Shift_JIS</code>,
          <code class="literal">cp932</code> supports extension characters such as
          NEC special characters, NEC selected — IBM extended
          characters, and IBM extended characters.
        </p><p>
          Many Japanese users have experienced problems using these
          extension characters. These problems stem from the following
          factors:
        </p><div class="itemizedlist"><ul type="disc"><li><p>
              MySQL automatically converts character sets.
            </p></li><li><p>
              Character sets are converted via Unicode
              (<code class="literal">ucs2</code>).
            </p></li><li><p>
              The <code class="literal">sjis</code> character set does not support
              the conversion of these extension characters.
            </p></li><li><p>
              There are several conversion rules from so-called
              “<span class="quote">SHIFT JIS</span>” to Unicode, and some characters
              are converted to Unicode differently depending on the
              conversion rule. MySQL supports only one of these rules
              (described later).
            </p></li></ul></div><p>
          The MySQL <code class="literal">cp932</code> character set is designed
          to solve these problems. It is available as of MySQL 5.0.3.
        </p><p>
          Because MySQL supports character set conversion, it is
          important to separate IANA <code class="literal">Shift_JIS</code> and
          <code class="literal">cp932</code> into two different character sets
          because they provide different conversion rules.
        </p><p>
          <span class="bold"><strong>How does <code class="literal">cp932</code> differ
          from <code class="literal">sjis</code>?</strong></span>
        </p><p>
          The <code class="literal">cp932</code> character set differs from
          <code class="literal">sjis</code> in the following ways:
        </p><div class="itemizedlist"><ul type="disc"><li><p>
              <code class="literal">cp932</code> supports NEC special characters,
              NEC selected — IBM extended characters, and IBM
              selected characters.
            </p></li><li><p>
              Some <code class="literal">cp932</code> characters have two
              different code points, both of which convert to the same
              Unicode code point. So, when converting from Unicode back
              to <code class="literal">cp932</code>, one of the code points must
              be selected. For this “<span class="quote">round trip
              conversion,</span>” the rule recommended by Microsoft is
              used. (See
              <a href="http://support.microsoft.com/kb/170559/EN-US/" target="_top">http://support.microsoft.com/kb/170559/EN-US/</a>.)
            </p><p>
              The conversion rule works like this:
            </p><div class="itemizedlist"><ul type="circle"><li><p>
                  If the character is in both JIS X 0208 and NEC special
                  characters, use the code point of JIS X 0208.
                </p></li><li><p>
                  If the character is in both NEC special characters and
                  IBM selected characters, use the code point of NEC
                  special characters.
                </p></li><li><p>
                  If the character is in both IBM selected characters
                  and NEC selected — IBM extended characters, use
                  the code point of IBM extended characters.
                </p></li></ul></div><p>
              Information about the Unicode values of
              <code class="literal">cp932</code> characters is given in the table
              shown at
              <a href="http://www.microsoft.com/globaldev/reference/dbcs/932.htm" target="_top">http://www.microsoft.com/globaldev/reference/dbcs/932.htm</a>.
              For <code class="literal">cp932</code> table entries with characters
              under which a four-digit number appears, the number
              represents the corresponding Unicode
              (<code class="literal">ucs2</code>) encoding. For table entries with
              an underlined two-digit value appears, there is a range of
              <code class="literal">cp932</code> character values that begin with
              those two digits. Clicking such a table entry takes you to
              a page that displays the Unicode value for each of the
              <code class="literal">cp932</code> characters that begin with those
              digits.
            </p><p>
              The following links are of special interest. They
              correspond to the encodings for the following sets of
              characters:
            </p><div class="itemizedlist"><ul type="circle"><li><p>
                  NEC special characters:
                </p><pre class="programlisting"><a href="http://www.microsoft.com/globaldev/reference/dbcs/932/932_87.htm" target="_top">http://www.microsoft.com/globaldev/reference/dbcs/932/932_87.htm</a>
</pre></li><li><p>
                  NEC selected — IBM extended characters:
                </p><pre class="programlisting"><a href="http://www.microsoft.com/globaldev/reference/dbcs/932/932_ED.htm" target="_top">http://www.microsoft.com/globaldev/reference/dbcs/932/932_ED.htm</a>
<a href="http://www.microsoft.com/globaldev/reference/dbcs/932/932_EE.htm" target="_top">http://www.microsoft.com/globaldev/reference/dbcs/932/932_EE.htm</a>
</pre></li><li><p>
                  IBM selected characters:
                </p><pre class="programlisting"><a href="http://www.microsoft.com/globaldev/reference/dbcs/932/932_FA.htm" target="_top">http://www.microsoft.com/globaldev/reference/dbcs/932/932_FA.htm</a>
<a href="http://www.microsoft.com/globaldev/reference/dbcs/932/932_FB.htm" target="_top">http://www.microsoft.com/globaldev/reference/dbcs/932/932_FB.htm</a>
<a href="http://www.microsoft.com/globaldev/reference/dbcs/932/932_FC.htm" target="_top">http://www.microsoft.com/globaldev/reference/dbcs/932/932_FC.htm</a>
</pre></li></ul></div></li><li><p>
              Starting from version 5.0.3, <code class="literal">cp932</code>
              supports conversion of user-defined characters in
              combination with <code class="literal">eucjpms</code>, and solves
              the problems with
              <code class="literal">sjis</code>/<code class="literal">ujis</code>
              conversion. For details, please refer to
              <a href="http://www.opengroup.or.jp/jvc/cde/sjis-euc-e.html" target="_top">http://www.opengroup.or.jp/jvc/cde/sjis-euc-e.html</a>.
            </p></li><li><p>
              For some characters, conversion to and from
              <code class="literal">ucs2</code> is different for
              <code class="literal">sjis</code> and <code class="literal">cp932</code>. The
              following tables illustrate these differences.
            </p><p>
              Conversion to <code class="literal">ucs2</code>:
            </p><div class="informaltable"><table border="1"><colgroup><col><col><col></colgroup><tbody><tr><td><span class="bold"><strong><code class="literal">sjis</code>/<code class="literal">cp932</code>
                      Value</strong></span></td><td><span class="bold"><strong><code class="literal">sjis</code> →
                      <code class="literal">ucs2</code> Conversion</strong></span></td><td><span class="bold"><strong><code class="literal">cp932</code> →
                      <code class="literal">ucs2</code> Conversion</strong></span></td></tr><tr><td>5C</td><td>005C</td><td>005C</td></tr><tr><td>7E</td><td>007E</td><td>007E</td></tr><tr><td>815C</td><td>2015</td><td>2015</td></tr><tr><td>815F</td><td>005C</td><td>FF3C</td></tr><tr><td>8160</td><td>301C</td><td>FF5E</td></tr><tr><td>8161</td><td>2016</td><td>2225</td></tr><tr><td>817C</td><td>2212</td><td>FF0D</td></tr><tr><td>8191</td><td>00A2</td><td>FFE0</td></tr><tr><td>8192</td><td>00A3</td><td>FFE1</td></tr><tr><td>81CA</td><td>00AC</td><td>FFE2</td></tr></tbody></table></div><p>
              Conversion from <code class="literal">ucs2</code>:
            </p><div class="informaltable"><table border="1"><colgroup><col><col><col></colgroup><tbody><tr><td><span class="bold"><strong><code class="literal">ucs2</code> value</strong></span></td><td><span class="bold"><strong><code class="literal">ucs2</code> →
                      <code class="literal">sjis</code> Conversion</strong></span></td><td><span class="bold"><strong><code class="literal">ucs2</code> →
                      <code class="literal">cp932</code> Conversion</strong></span></td></tr><tr><td>005C</td><td>815F</td><td>5C</td></tr><tr><td>007E</td><td>7E</td><td>7E</td></tr><tr><td>00A2</td><td>8191</td><td>3F</td></tr><tr><td>00A3</td><td>8192</td><td>3F</td></tr><tr><td>00AC</td><td>81CA</td><td>3F</td></tr><tr><td>2015</td><td>815C</td><td>815C</td></tr><tr><td>2016</td><td>8161</td><td>3F</td></tr><tr><td>2212</td><td>817C</td><td>3F</td></tr><tr><td>2225</td><td>3F</td><td>8161</td></tr><tr><td>301C</td><td>8160</td><td>3F</td></tr><tr><td>FF0D</td><td>3F</td><td>817C</td></tr><tr><td>FF3C</td><td>3F</td><td>815F</td></tr><tr><td>FF5E</td><td>3F</td><td>8160</td></tr><tr><td>FFE0</td><td>3F</td><td>8191</td></tr><tr><td>FFE1</td><td>3F</td><td>8192</td></tr><tr><td>FFE2</td><td>3F</td><td>81CA</td></tr></tbody></table></div></li></ul></div></div></div></div></div><div class="navfooter"><hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="language-structure.html">Prev</a> </td><td width="20%" align="center"> </td><td width="40%" align="right"> <a accesskey="n" href="column-types.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 9. Language Structure </td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top"> Chapter 11. Column Types</td></tr></table></div></body></html>