What Is Encoding and Understanding of Windows-1252 vs. UTF-8

Differences between various encodings 

So, you might think that text is simply text. Well, think again. In this series of blog posts we shall descend to the byte level, study how text is actually represented by computers, and discuss how this impacts your integration solutions.

What is encoding?

Encoding is the way a computer stores text as raw binary data. In order to read text data properly, you have to know which encoding was used to store it, and then use that same encoding to interpret the binary data in order to retrieve the original text. Now you’re probably thinking, “that doesn’t sound so bad, surely there are just a couple of different encodings, and surely all text data contains information about which encoding is used, right?” Well, the answers to those questions are unfortunately not that simple, which is why encoding can be such a nightmare to deal with for developers.

What is text?

What text actually is depends on the context. When stored or in transit somewhere, text is simply a piece of binary data – the same as any other kind of data. At its most basic level, it’s a long row of zeroes and ones. When it’s being actively worked on by a computer it’s still binary data, but it’s interpreted by the system as individual characters, and in many cases converted into another binary representation while it’s being processed. This representation is called Unicode.

A brief introduction to Unicode 

Back in 1988 digital data processing was becoming more and more prevalent, but the market was still extremely fragmented with every supplier using their own proprietary non-standardized solutions for most things. As a result, inter-compatibility between different computer systems was virtually non-existent, and sending data from one system to another was often very challenging. At this time, an attempt was made to try and stem the flow of emerging encoding problems by introducing a standardized common character set known as Unicode. This way, all the different encodings in use could at least be mapped to a common set of characters, so there wouldn’t be any doubt as to which character a given code was supposed to represent.


From the Wikipedia article for Unicode:

“Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems. The standard is maintained by the Unicode Consortium, and as of May 2019 the most recent version, Unicode 12.1, contains a repertoire of 137,994 characters covering 150 modern and historic scripts, as well as multiple symbol sets and emoji.

In text processing, Unicode takes the role of providing a unique code point—a number, not a glyph—for each character. In other words, Unicode represents a character in an abstract way and leaves the visual rendering (size, shape, font, or style) to other software, such as a web browser or word processor.”


The Unicode character set is not an encoding itself but is merely a standardized set of all characters that anyone is likely to encounter in a data file. The Unicode standard contains a number of actual encodings as well. Common to all of these, as opposed to most other forms of text encoding, is that they support the entire Unicode character set.

The History of Unicode

From: XKCD #1953 – The History Of Unicode

While Unicode did fix some of the problems in having an overabundance of co-existing character encodings, it did not solve all of them. For one thing, the adoption of the accompanying encoding systems was slow, and is still far from universal. For another, even though having a common character set to map encodings to was certainly helpful, it did not change the unfortunate fact that many types of textual data do not contain any information about which encoding system was used to produce them.

So, how does encoding work?

Right, let’s get down into the nitty-gritty details. What’s actually stored when you save a text-based file? First, we’ll take a look at one of the oldest and most simple encodings, ASCII. Here is an excerpt of the Wikipedia article for ASCII:

“Originally based on the English alphabet, ASCII encodes 128 specified characters into seven-bit integers as shown by the ASCII chart above. Ninety-five of the encoded characters are printable: these include the digits 0 to 9, lowercase letters a to z, uppercase letters A to Z, and punctuation symbols. In addition, the original ASCII specification included 33 non-printing control codes which originated with Teletype machines; most of these are now obsolete, although a few are still commonly used, such as the carriage return, line feed and tab codes.”

As ASCII was developed in the US and based on the English alphabet, it only contains the standard English characters. This means that text containing non-English characters (such as accented letters, or special letters used in other languages) cannot be accurately encoded in ASCII without changing the special characters to English standard ones. ASCII was designed using 7-bit codes to represent the characters it encoded, but because all modern computers use bytes (8 bits) as their smallest memory unit, ASCII characters are now stored using 8 bits per character. The first bit is simply not used.

The entire ASCII encoding standard looks like this:

ASCII table

Now, let’s look at some example to see how these texts would be encoded in the ASCII standard. Instead of writing the binary representations of longer texts in full, we will use hexadecimal notation for the binary data.

When you open an ASCII encoded text file in a text editor, the program reads each byte of the file and looks up the value in an ASCII table to determine which character to show for that byte.

ASCII is a very limited encoding though. It only contains 95 printable characters and can therefore only be used to encode those characters. If you have textual data that contain more characters than those 95 printable characters, you will have to use another encoding.

If only the entire IT industry had agreed on a common encoding back in the day, things would be considerably easier to deal with now. However, this is not the case. Over the years, various companies have set out to solve the exact same problem: how to represent text as binary data for storage or transmission. As a result, an overwhelming number of encoding systems exist today. Frustratingly, many of them are almost identical, leading one to question the necessity for their existence even further.

Many modern encodings are based on the ASCII encoding, but extend it to include more characters. This means that text which only contains characters that are present in the ASCII standard can often also be decoded as ASCII, even though technically it’s encoded using a different standard. This is because many larger encodings still use the same codes as ASCII does for the first 128 characters, but beyond those characters these encoding schemes differ to a greater or lesser degree.

Let’s compare two of the most common encodings used for western languages, Windows-1252 and UTF-8.

Windows-1252

This is the default encoding used by Windows systems in most western countries. This means that text data produced by software running on such systems by default will use the Windows-1252 encoding, unless explicitly set to use a different one. Some software lets the user choose which encoding to use, some is set to use a specific encoding rather than the default, and some leaves it up to the system itself.

Windows-1252 is a single-byte encoding, which means that each character is encoded as a single byte, the same as with ASCII. However, since Windows-1252 uses the full 8 bits of each byte for its code points (as opposed to ASCII’s 7-bit codes), it contains 256 code points compared to ASCII’s 128. The first half of the code points are identical to the ones defined in ASCII, while the second half encodes additional characters that do not exist in the ASCII character set.

UTF-8

UTF-8 is an encoding from the Unicode standard. UTF stands for Unicode Transformation Format, and the 8 at the end signifies that it’s an 8-bit variable encoding. This means that each character uses at least 8 bits for its code point, but some may use more. As with Windows-1252, the first 128 code points are identical to ASCII, but above that the two encodings differ considerably. While Windows-1252 only contains 256 code points altogether, UTF-8 has code points for the entire Unicode character set. The way this is handled is to define some of the byte values above 127 as prefixes for further byte values. For instance, the copyright symbol (©) is encoded as C2 A9, and the pound sign (£) is encoded as C2 A3. Because the C2 byte is designed as a prefix byte, this opens an additional 256 2-byte code points with C2 as the first byte.

This design means that most of the common characters used in western languages only take up a single byte of space, while the multi-byte encodings are used less frequently. As a result, UTF-8 is able to encode any character while still keeping the data size relatively small. This is valuable for both permanent storage (small file sizes) and transmission (e.g. opening a web page). Because of this, UTF-8 is now by far the most dominant encoding in use on the World Wide Web and accounted for 94% of all web pages as of September 2019.

Encoding comparison 

Let’s look at a specific example of how these two encodings differ from one another. We’ll use the word “Naïveté”, which contains two non-ASCII characters (it has alternative spellings without those, but the example is a recognized legitimate spelling of the word in English).

As we can see, the characters ï and é exist in both encodings but are encoded in two different ways. In Windows-1252, all characters are encoded using a single byte and therefore the encoding only contains 256 characters altogether. In UTF-8 however, those two characters are ones that are encoded using 2 bytes each. As a result, the word takes up two bytes more using the UTF-8 encoding than it does using the Windows-1252 encoding.

So, different encodings treat some characters differently from one another.

Error scenarios

Now we’ve come to the central issue with having multiple encodings and why it matters so much to integration solutions. We’re going to look at exactly what happens to the data when there is a mismatch in the encodings used to read and write it.

Scenario: We have an integration which receives text data as Windows-1252, converts it to UTF-8 and sends it on to a target system. We’ll use the same word as before, “Naïveté” to illustrate.

First, let’s examine how it’s supposed to work and go over the process step by step.

But what happens then, when the integration receives data in a different encoding than the expected? Well, you can see for yourself in the schedule below.

As you can see, the target system has the text “Naïveté” saved instead of “Naïveté”. The problem is that the individual byte values used in the multi-byte encodings of UTF-8 are all valid Windows-1252 character codes. So, when interpreted as Windows-1252, each UTF-8 2-byte character becomes two Unicode characters, matching the equivalent Windows-1252 single-byte codes. When the Unicode text string is then converted back into a UTF-8 representation, each of those characters get encoded as the equivalent UTF-8 code points. Since all 4 of those characters (well, 3 characters, but one of them is used twice) are 2-byte characters in UTF-8, the binary representation of the string is now significantly longer and wrong.

The opposite scenario also causes different problems.

Scenario: We have an integration which receives text data as UTF-8, converts it to Windows-1252 and sends it on to a target system. We’ll use the same word as before, “Naïveté” to illustrate again.

Here’s how it’s supposed to work:

But what if the text the integration receives isn’t UTF-8 as it expects, but Windows-1252 instead? The answer for this is a bit more complicated than before. In this scenario several things can happen, depending on precisely how the integration system handles this specific situation. The problem here is that the codes used in Windows-1252 to represent the ï and é characters are not valid character codes in UTF-8. This means that they can’t be mapped directly to Unicode characters using the UTF-8 encoding. When trying to do so, one of five things might happen:

  • Reject: The system halts the processing of the data and throws an error.
  • Remove: The unrecognized characters are removed from the string.
  • Replace: The unrecognized characters are replaced by the Unicode replacement character (�), which when rendered as text on a screen is usually depicted as either a blank diamond with a question mark in it, or the empty outline of a a square, depending on the font used. 
  • Remember: The same as Remove or Replace, but the unrecognized character codes are remembered and are still saved along with the recognized ones.
  • Re-interpret: When faced with a code that’s not valid according to the encoding used, the system may attempt to interpret the code using another likely encoding. 

These five scenarios depend entirely on how the specific integration is programmed. Now let’s take a look at the outcomes of these five scenarios:

As a matter of fact, there are more than 5 possible results. In the “Replace” scenario, you’ll note that what happens in the integration says, “see below”. That’s because it’s essentially the same thing as the bigger scenario outlined above, only reversed. The integration must try to encode a string containing Unicode characters that have no representation in the encoding used. As above, there are different ways it can go about this, depending on exactly how it was programmed. Let’s look at the different variations of that scenario, just from the integration and onwards.

There’s no “Re-interpret” or “Remember” scenarios in this case, since that only makes sense when decoding text, not when encoding it.

Finally, to illustrate just how badly text data can get mangled if there’s a disagreement on which encodings to use, let’s look at what happens if the source and target systems both use Windows-1252, but the integration uses UTF-8 both ingoing and outgoing. We’ll assume that the integration uses the “Replace” option for unknown character codes.

This might certainly cause problems later on when that data is supposed to be used for something functional. For example, the name of a customer or a delivery address, which could have prevented a paying customer from having ordered products delivered.

Ref: https://bizbrains.com/blog/encoding-101-part-1-what-is-encoding/