Skip to content

Fix Character Encoding on Your Website

The restaurant website updates were looking good, including the little ñ in jalapeño and the é in sauté.   That is, everything looked good until it was deployed to production.

Questions marks characters in web page

Doh!  How do we fix these strange characters showing up in our web page?

First, Some Basic Definitions

  • Code point: A numerical ID of a character in a character set.
  • Character set: or more specifically, a coded character set is a set of character symbols and their associated code points.
  • Character encoding: The method of mapping of a set of characters to their code points. You can dig deeper into character encoding on Wikipedia.
  • Collation: a set of rules for comparing characters in a character set. For example, if we want A = a, then we use a case-insensitive collation, such as that defined in MySQL as utf8_general_ci (ci = case-insensitive).
  • UTF-8: Unicode Transformation Format. A character encoding that uses one to four 8-bit units (bytes) for storing characters.  This includes the mapping of most of the characters known around the globe.

Using UTF-8 will solve most character encoding issues you may come across.  The previously and widely used ISO-8859-1 (aka latin1) encodings are a subset of UTF-8, so the code points are the same.  ISO-8859-1 encoded characters will display properly using UTF-8.

Browser Uses the Wrong Character Set

The easiest way is to check if your browser is using the wrong character set is to test with the W3C Validator.  Enter your website address then scroll to the bottom of the results page to see your character encoding.  If it shows UTF-8, skip to the section Wrong Character Set Used in the Database.

w3c validator character set result

If the W3C Validator shows a character set other than UTF-8 or if you can’t use the validator because you’re developing locally or working on a password protected website, you’ll have to focus on two things.  First is the HTML code, second is the HTTP headers.

Fix the HTML

View the source code and look for one of two different attributes of a meta tag.  The attribute differs based on which version of HTML your page references.  Learn how to tell which version of HTML you’re using here.

For HTML 5, ensure this tag references UTF-8:

<meta charset="UTF-8">

If you’re maintaining a page in HTML 4, look at the meta tag that looks like this and ensure it says UTF-8.

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

If your page is missing the appropriate meta tag, add it.  Test to see if this worked.  If not, your webserver could be sending the wrong content type header.  We solve this issue next.

Fix the HTTP headers

First, let’s check the headers.  Below are two methods of checking HTTP headers.

You can type the following at the command line if you have curl installed:

curl –I http://www.my-website.com/my-page

You can check using the Chrome browser as shown in this animation.

Charset http header anination

 

If the Content-Type charset is missing, the above HTML tags are enough to fix the issue of the browser using the wrong character set to render the page.  If the charset is missing and your characters are still broken, scroll down to the in the section Wrong character set used in the database.

If the content type charset is set to something other than UTF-8, you’ll need to change this.  Here’s several ways to resolve this.

Option 1: Update your Webserver Configuration

Most shared hosts support .htaccess files.  This file lives in your document root (ex: public_html).  If it doesn’t exist, create it and add this line:

AddDefaultCharset UTF-8

If you happen to be using NGINX on a VPS, add this line to your server declaration of your nginx.conf or similar configuration file.

charset UTF-8;

Option 2: Update your PHP Configuration

If you happen to be using PHP for your website, you can change your configuration to send the proper character set HTTP header. PHP versions starting with 5.6 automatically send the UTF-8 charset HTTP header.

The directive to use is default_charset.  This can be set in a configuration file or your PHP code.

Add the following to the main php.ini file or a user-based php.ini file in your document root.

default_charset = utf-8;

Or add this to your PHP code.

<?php ini_set('default_charset', 'utf-8'); ?>

Wrong Character Set Used in the Database

If your content is being stored in the database, this is another area to check for compatibility.

I had my problem while using MySQL so I’ll be referencing MySQL specific features here.  These ideas translate to other DBMS.

Bad Storage Encoding

First, ensure the data isn’t corrupt by selecting the data in a quality SQL client.   If the characters look broken in your table, then there’s a chance your table or column isn’t set to store the character set that was inserted.

An easy way of checking the character set is with the following query:

show create table my_table;

The character set is listed at the bottom of the definition as DEFAULT CHARSET = xxx.

We can change the character set of the table and individual columns after the table is created and populated.  Changing the character set won’t fix your broken characters but it will prevent it from happening on future inserts and updates.

Bad Transfer Encoding

If you’ve reached this point, the problem is likely caused by incompatible encoding in transit from the server to the client.

The following queries pinpoint areas that could cause character encoding issues:

show GLOBAL VARIABLES LIKE 'character_set_client';
show GLOBAL VARIABLES LIKE 'character_set_connection';
show GLOBAL VARIABLES LIKE 'character_set_results';

Or more succinctly …

show GLOBAL VARIABLES LIKE ‘character_set_%’;

If any of these show anything other than your expected character set (ex: utf8), you found your problem.  This is where I find my problem for a client hosted at InMotion Hosting.

Mysql character set latin1

 

My local machine is set to use utf8 which is why characters look fine locally but not when on production.

Local MySQL charset utf8

 

You can set these variables independently but MySQL makes this easier for us by allowing the shortcut…

set names ‘utf8’;

This setting is only valid for the duration of the connection and so has to be sent with each new connection.

PHP allows us to set this when creating a new PDO object.

$db = new PDO(
    'mysql:host=myhost;dbname=mydb',
    'login',
    'password',
    [PDO::MYSQL_ATTR_INIT_COMMAND => 'SET NAMES “UTF8”’]
);

Or using mysqli with

mysqli_set_charset ($link , ‘UTF8’ );

I hope you found this article helpful. If so, please share!  Have a suggestion to make this article better?  Let me know in a comment below.

 

Facebooktwitterredditlinkedin

Published inWeb Development

7 Comments

  1. Hi Larry, thank you very much for this great article. This has helped me sort out our issue with the incorrect coding set in the database being latin! You have saved us hours or days of trouble!! Keep up the great work!

  2. DaveL DaveL

    Lovely clear article allowed me to confidently troubleshoot the problem I was having.
    Turned out the database charsets were wrong, and the problem was solved with the PHP PDO code you provided. Thanks! 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *