[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

8.3 Determining The Default Character Set And Collation

There are default settings for character sets and collations at four levels: server, database, table, connection. The following description may appear complex, but it's been found in practice that multi-level defaulting leads to natural and obvious results.


8.3.1 Server Character Set and Collation

The MySQL Server has a server character set and a server collation, which may not be null.

MySQL determines the server character set and server collation thus:

At this level, the decision is simple. The server character set and collation depend on the options that you use when you start mysqld. You can use --default-character-set=character_set_name for the character set, and along with it you can add --default-collation=collation_name for the collation. If you don't specify a character set, that is the same as saying --default-character-set=latin1. If you specify only a character set (for instance, latin1) but not a collation, that is the same as saying --default-charset=latin1 --collation=latin1_swedish_ci because latin1_swedish_ci is the default collation for latin1. Therefore the following three commands all have the same effect:

 
shell> mysqld
shell> mysqld --default-character-set=latin1
shell> mysqld --default-character-set=latin1
           --default-collation=latin1_swedish_ci

One way to change the settings is by recompiling. If you want to change the default server character set and collation when building from sources, use: --with-character-set and --with-collation as arguments for configure. For example:

 
shell> ./configure --with-character-set=latin1

or

 
shell> ./configure --with-character-set=latin1
           --with-collation=latin1_german1_ci

Both mysqld and configure check that the character set/collation combination is valid. Each program displays an error message and terminates if the combination is not valid.


8.3.2 Database Character Set and Collation

Every database has a database character set and a database collation, which may not be null. The CREATE DATABASE and ALTER DATABASE commands now have optional clauses for specifying the database character set and collation:

 
CREATE DATABASE db_name
   [CHARACTER SET character_set_name [COLLATE collation_name]]

ALTER DATABASE db_name
    [CHARACTER SET character_set_name [COLLATE collation_name]]

Example:

 
CREATE DATABASE db_name
   CHARACTER SET latin1 COLLATE latin1_swedish_ci;

MySQL chooses the database character set and database collation thus:

MySQL's CREATE DATABASE ... CHARACTER SET ... syntax is analogous to the standard-SQL CREATE SCHEMA ... CHARACTER SET ... syntax. Because of this, it is possible to create databases with different character sets and collations, on the same MySQL server.

The database character set and collation are used as default values if the table character set and collation are not specified in CREATE TABLE statements. They have no other purpose.


8.3.3 Table Character Set and Collation

Every table has a table character set and a table collation, which may not be null. The CREATE TABLE and ALTER TABLE statements now have optional clauses for specifying the table character set and collation:

 
CREATE TABLE table_name ( column_list )
   [CHARACTER SET character_set_name [COLLATE collation_name]]

ALTER TABLE table_name
   [CHARACTER SET character_set_name] [COLLATE collation_name]

Example:

 
CREATE TABLE t1 ( ... ) CHARACTER SET latin1 COLLATE latin1_danish_ci;

MySQL chooses the table character set and collation thus:

The table character set and collation are used as default values, if the column character set and collation are not specified in individual column definitions. The table character set and collation are MySQL extensions; there are no such things in standard SQL.


8.3.4 Column Character Set and Collation

Every "character" column (that is, a column of type CHAR, VARCHAR, or TEXT) has a column character set and a column collation, which may not be null. Column definition syntax now has optional clauses for specifying the column character set and collation:

 
column_name {CHAR | VARCHAR | TEXT} (column_length)
    [CHARACTER SET character_set_name [COLLATE collation_name]]

Example:

 
CREATE TABLE Table1
(
   column1 VARCHAR(5) CHARACTER SET latin1 COLLATE latin1_german1_ci
);

MySQL chooses the column character set and collation thus:

The CHARACTER SET and COLLATE clauses are standard SQL.


8.3.5 Examples of Character Set and Collation Assignment

The following examples show how MySQL determines default character set and collation values.

Example 1: Table + Column Definition

 
CREATE TABLE t1
(
  c1 CHAR(10) CHARACTER SET latin1 COLLATE latin1_german1_ci
) CHARACTER SET latin2 COLLATE latin2_bin;

Here you have a column with a latin1 character set and a latin1_german1_ci collation. The definition is explicit, so that's straightforward. Notice that there's no problem storing a latin1 column in a latin2 table.

Example 2: Table + Column Definition

 
CREATE TABLE t1
(
   c1 CHAR(10) CHARACTER SET latin1
) CHARACTER SET latin1 COLLATE latin1_danish_ci;

This time we have a column with a latin1 character set and a default collation. Now, although it might seem natural, the default collation is not taken from the table level. Instead, because the default collation for latin1 is always latin1_swedish_ci, column c1 will have a collation of latin1_swedish_ci (not latin1_danish_ci).

Example 3: Table + Column Definition

 
CREATE TABLE t1
(
   c1 CHAR(10)
) CHARACTER SET latin1 COLLATE latin1_danish_ci;

We have a column with a default character set and a default collation. In this circumstance, MySQL looks up to the table level for inspiration in determining the column character set and collation. So the character set for column c1 is latin1 and its collation is latin1_danish_ci.

Example 4: Database + Table + Column Definition

 
CREATE DATABASE d1 CHARACTER SET latin2 COLLATE latin2_czech_ci;
USE d1;
CREATE TABLE t1
(
   c1 CHAR(10)
);

We create a column without specifying its character set and collation. We're also not specifying a character set and a collation at the table level. In this circumstance, MySQL looks up to the database level for inspiration. (The database's settings become the table's settings, and thereafter become the column's setting.) So the character set for column c1 is latin2 and its collation is latin2_czech_ci.


8.3.6 Connection Character Sets and Collations

Every connection has connection character sets and connection collations, which may not be null. There are actually two connection character sets, which we will call "connection/literals" and "connection/results" when it is necessary to distinguish them.

Consider what a "connection" is: It's what you make when you connect to the server. The client sends SQL statements, such as queries, over the connection to the server. The server sends responses, such as result sets, over the connection back to the client. This leads to several questions, such as: (a) what character set is the query in when it leaves the client? (b) what character set should the server translate a query to after receiving it? (c) what character set should the server translate to before shipping result sets or error messages back to the client? You can fine-tune the setting for these things, or you can depend on the defaults (in which case, you can skip this section).

There are two statements that affect the connection character sets:

 
SET NAMES character_set_name
SET CHARACTER SET character_set_name

SET NAMES indicates what is in the SQL statement that the client sends. Thus, SET NAMES cp1251 tells the server "future incoming messages from this client will be in character set cp1251" and the server is free to translate to its own character set, if appropriate.

SET CHARACTER SET indicates what is in the SQL statement that the client sends, and also what is in the result set that the server sends back to the client. Thus, SET CHARACTER SET includes SET NAMES, and also specifies what character set the column values will have if, for example, you use a SELECT statement.

EXAMPLE: Suppose that column1 is defined as CHAR(5) CHARACTER SET latin2. If you do not say SET CHARACTER SET, then for SELECT column1 FROM t the server will send back all the values for column1 using character set latin2. If on the other hand you say SET CHARACTER SET latin1 then the server will, just before sending back, convert the latin2 values to latin1. Such conversion is slow and may be lossy.

When you execute SET NAMES or SET CHARACTER SET, you are also changing the "connection collation". However, the connection collation exists for consistency only. Usually its value doesn't matter.

With the mysql client, it is not necessary to execute SET NAMES every time you start up. You can add the --default-character-set-name option setting to your mysql statement line, or in your option file. For example, the following option file setting will change the connection character set each time you run mysql:

 
[mysql]
default-character-set-name=character_set_name

8.3.7 Character String Literal Character Set and Collation

Every character string literal has a character set and a collation, which may not be null.

A character string literal may have an optional character set introducer and COLLATE clause:

 
[_character_set_name]'string' [COLLATE collation_name]

Examples:

 
SELECT 'string';
SELECT _latin1'string';
SELECT _latin1'string' COLLATE latin1_danish_ci;

The simple statement SELECT 'string' uses the connection/literal character set.

The _character_set_name expression is formally called an introducer. It tells the parser, "the string that is about to follow is in character set X." Because this has confused people in the past, we emphasize that an introducer does not cause any conversion, it is strictly a signal that does not change the string's value. An introducer is also legal before standard hex literal and numeric hex literal notation (x'literal' and 0xnnnn), and before ? (parameter substitution when using prepared statements within a programming language interface).

Examples:

 
SELECT _latin1 x'AABBCC';
SELECT _latin1 0xAABBCC;
SELECT _latin1 ?;

MySQL determines a literal's character set and collation thus:

Examples:

Character set introducers and the COLLATE clause are implemented according to standard-SQL specifications.


8.3.8 COLLATE Clause in Various Parts of an SQL Query

With the COLLATE clause you can override whatever the default collation is for a comparison. COLLATE may be used in various parts of SQL queries. Here are some examples:


8.3.9 COLLATE Clause Precedence

The COLLATE clause has high precedence (higher than ||), so the expression

 
x || y COLLATE z

is equivalent to:

 
x || (y COLLATE z)

8.3.10 BINARY Operator

The BINARY operator is a shorthand for a COLLATE clause. For example, BINARY 'x' is equivalent to 'x' COLLATE y, where y is the name of an appropriate binary collation. For example, assuming that column a is of character set latin1, these two queries have the same effect:

 
SELECT * FROM t1 ORDER BY BINARY a;
SELECT * FROM t1 ORDER BY a COLLATE latin1_bin;

Note: Every character set has a binary collation.


8.3.11 Some Special Cases Where the Collation Determination is Tricky

In the great majority of queries, it is obvious what collation MySQL uses to resolve a comparison operation. For example, in the following cases it should be clear that the collation will be "the column collation of column x":

 
SELECT x FROM T ORDER BY x;
SELECT x FROM T WHERE x = x;
SELECT DISTINCT x FROM T;

However, when multiple operands are involved, there can be ambiguity. For example:

 
SELECT x FROM T WHERE x = 'Y';

Should this query use the collation of the column x, or of the string literal 'Y'?

Standard SQL resolves such questions using what used to be called "coercibility" rules. The essence is: Because x and 'Y' both have collations, whose collation takes precedence? It's complex, but these rules would take care of most situations:

Those rules resolve ambiguities thus:

Examples:

column1 = 'A'

Use collation of column1

column1 = 'A' COLLATE x

Use collation of 'A'

column1 COLLATE x = 'A' COLLATE y

Error


8.3.12 Collations Must Be for the Right Character Set

Recall that each character set has one or more collations, and each collation is associated with one and only one character set. Therefore, the following statement causes an error message because the latin2_bin collation is not legal with the latin1 character set:

 
mysql> SELECT _latin1 'x' COLLATE latin2_bin;
ERROR 1251: COLLATION 'latin2_bin' is not valid
for CHARACTER SET 'latin1'

8.3.13 An example of the Effect of Collation

Suppose column X in table T has these latin1 column values:

 
Muffler
Müller
MX Systems
MySQL

And suppose that the column values are retrieved using the following statement:

 
SELECT X FROM T ORDER BY X COLLATE collation_name;

The resulting order of the values for different collations is shown in this table:

latin1_swedish_ci

latin1_german1_ci

latin1_german2_ci

Muffler

Muffler

Müller

MX Systems

Müller

Muffler

Müller

MX Systems

MX Systems

MySQL

MySQL

MySQL

The table is an example that shows what the effect would be if we used different collations in an ORDER BY clause. The character that's causing the trouble in this example is the U with two dots over it, which the Germans call U-umlaut, but we'll call it U-diaeresis.

The first column shows the result of the SELECT using the Swedish/Finnish collating rule, which says that U-diaeresis sorts with Y.

The second column shows the result of the SELECT using the German DIN-1 rule, which says that U-diaeresis sorts with U.

The third column shows the result of the SELECT using the German DIN-2 rule, which says that U-diaeresis sorts with UE.

Three different collations, three different results. That's what MySQL is here to handle. By using the appropriate collation, you can choose the sort order you want.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]


Hosting by: Hurra Communications Ltd.
Generated: 2007-01-26 17:58:46