Working With National Language Data in MySQL

January 14th, 2009

Working with localized data is one of the widespread problem our users encounter with. We try to cover the following questions in this article:

Note: In the article examples, cp1251 is set as Windows code page*.

How MySQL server works with different character sets

Server has the following options for working with data with different character sets: character_set_client, character_set_connection, and character_set_results. They are set at the server side when installing MySQL server. Default settings are latin1. The set names command changes all the three options for your session. We will call this option set the connection character set or server settings.

The character_set_client and character_set_connection variables determines the character set the client will send data in.

The character_set_results variable specifies the character set for returning query results.

In addition, the column character set specifies the encoding of stored data. Client usually sends and receives data in the operating system encoding, for example, cp1251 or UTF-8. It is important, that server settings (connection character set) and client character set were the same.

The data is written to the textual column in the following way: server receives the data and considers them to be in the character_set_client character set. If the column character set is the same as the connection character set, the server writes data to the column without modification. If the character sets are different, server tries to convert data from the character_set_client character set to the column character set. It is important that server does not recognize data encoding, it uses only the character set settings.

The data is read from the textual column in the following way: server compares the column character set and the character_set_results character set. If they are the same, it returns the unchanged data, otherwise, it tries to convert data.

The conversion success depends on both character sets and the data. For example, data of any one-byte mysql charset (latin1, cp1251) can be successfully converted to UTF-8. But conversion between one-byte character sets, for example latin1 and cp1251 is possible only if the data letters exist in both of the charsets. Thus, “abcde” can be presented in both latin1 and cp1251 character sets, but Русский текст” cannot be presented in the latin1 character set.

Correct work with character sets

Devart dbForge Studio for MySQL allows you to connect to the server using the operating system encoding or the UTF-8 (Unicode) encoding. Unicode connection provides correct character set conversion without user actions regardless of the MySQL server charset. When working with Unicode connection, connection character set is set to UTF-8. Sent and received data are in the UTF-8 character set.

If dbForge Studio for MySQL uses connection with operating system encoding (for example, cp1251) and server is using latin1 client character set settings, “set names cp1251command can be used to write cyrillic characters to cp1251 and UTF-8 columns. Reading data should be also performed after executing the set names cp1251command. Such data can be read without any additional actions if you use dbForge Studio for MySQL with Unicode connection.

Note: some applications split script in the editor to separate queries, so that “set names” command will be executed successfully but will not affect the following SELECT statement.

If dbForge Studio uses Unicode connection, you can write data to UTF-8 and cp1251 columns without any problems. You should execute “set names cp1251” command to read these data from the connection with operating system encoding.

If dbForge Studio uses Windows encoding connection and server is configured for using cp1251 character set, you can write data to cp1251 columns. It is the widespread case. The server is configured for cp1251 character set when installing, and when connecting from windows with default cp1251 encoding, there are no problems with both Windows encoding and Unicode connections.

Incorrect work with character sets

A widespread case: connection character set (server settings) – latin1, and column character set is latin1. Old server versions that does not support character set setting also work in that way. In this case there are no visible problems with reading and writing data. MySQL server does not convert data, because connection and column character sets are equal, and both latin1 and Windows encoding – cp1251 – are one-byte encodings. The column size is correct in this case. But if the client sets another connection character set with the “set names” command or Use Unicode option, the output of these data may be incorrect.

To avoid this you may clear the Use Unicode option and set the same operating system encoding as in the server settings. To convert the data to the correct form, you need to export them using the same connection character set, that was used when writing these data, and then change the column character set to the character set you need and import data through the Unicode connection.

When actual data encoding is not latin1, writing to the UTF-8 charset column through a latin1 connection, gives the same result as writing to the latin1 charset column through a latin1 connection.

When server settings are latin1, it is not possible to write cyrillic data to the cp1251 charset column.

Unfortunately, if the server settings are latin1, you cannot just execute script with “set names utf8”. If you do so, writing data to any of the cp1251, latin1, or UTF-8 columns will not be successful.

If dbForge Studio uses Unicode connection, you cannot write cyrillic data to latin1 columns. You may execute set names 1251 or latin1, but the data will be written in wrong character set (because of double recoding). To preserve the data, dbForge Studio for MySQL does not allow executing “set names” for any character set except UTF-8 if the Use Unicode option is set.

Usage Tips

  • When installing the new server, if it is known, that it will work with Cyrillic data, you can set default connection character set as cp1251. It gives good compatibility with applications, which work with MySQL. Server will work with data correctly in most cases, when client is run under Windows. This server will always work correctly when client application use UTF-8.
  • You also can create database with default cp1251 character set, to make all columns of all tables in the database to have cp1251 charset by default. Besides, you will have no problems with latin characters even if client Windows encoding is not cp1251
  • When installing the new server, or creating new database, or, most importantly, new table, set UTF-8 character set. However, setting server character set may cause problems with some applications. Some applications may work with UTF-8 data incorrectly. But UTF-8 connection and UTF-8 column character set will provide correct working with data in any encoding.

* Default encoding in Windows 2000 and later can be set on the Advanced tab of the Regional and Language Options. Russian versions of Windows 95/98/Me always use cp1251.

** In dbForge Studio for MySQL and dbForge Fusion for MySQL Unicode connection encoding is set on the Advanced tab of the Database Connection Properties Dialog Box (Use Unicode check box).

3 Responses to “Working With National Language Data in MySQL”

  1. Alain Says:

    The article is clear. But what I get on my Pc is not. I do not get the same result with dbforge and with my PhP application.
    Both dbForge and my app show the same result for the character variable:
    Array ( [Variable_name] => character_set_client [Value] => latin1 )
    Array ( [Variable_name] => character_set_connection [Value] => latin1 )
    Array ( [Variable_name] => character_set_results [Value] => latin1 )
    Array ( [Variable_name] => character_set_server [Value] => utf8 )
    Array ( [Variable_name] => character_set_system [Value] => utf8 )
    Array ( [Variable_name] => character_set_database [Value] => latin1 )
    The echo of a text query is correct on my app (my app file is utf8 encoded)
    but I cannot see it correctly on dbForge queryBuilder neither on its export file. I try the unicode attribute as well as setting the character_set variable without results.
    Please help

  2. .jp Says:

    Check that the connection string in your PHP application has utf8 encoding.

  3. Alain Says:

    Thank you jp. That fix this.

Leave a Comment