by Мастер » Mon Oct 20, 2014 1:54 am
Universal character set Transformation Format is one of several methods, the most popular of which seems to be UTF-8, of representing the characters in the Unicode character set in digital form. The objective here is to be able to represent every character used in every language in the world. Originally, dead languages were excluded, but it was quickly realised that people working in history, linguistics, etc. wanted to use characters from even these languages in the articles they would write.
The original one-byte character encoding schemes (like ANSII) could represent 128 (and later, 256) distinct characters. Good enough for English, and most European languages, provided you only use one language in a document, rather than mixing them all together. If you want to create a web page which shows how to say "hello" in all the different western-European languages, then 256 is not enough, and forget about Greek, Russian, Hebrew, Arabic, and (gasp) Chinese.
The traditional system was that different encoding schemes would be used for different languages. So a document written in Russian would use a scheme which assigned each upper-case and lower-case Cyrillic character to an 8-bit code, with the leftovers used for punctuation. If someone tried to open the document on a computer in Portugal, and the computer failed to recognise that the document was Russian, it would try to interpret each byte as representing a Portuguese character, rather than a Russian character. The result would be total gibberish. (Well, even it worked correctly, it would still look like total gibberish if you don't understand Russian.)
Unicode was an effort to represent all the characters in a single encoding system, so there would be no ambiguity - every computer in the world would interpret text in exactly the same way, and it would be possible to put any characters you like in any text document, including many different languages mixed together. UTF-8 is one of several competing Unicode schemes (I think it has pretty much won out), which uses 8-bits to represent English characters, 16-bits to represent characters in other European languages, and more bits still to represent Chinese, Arabic, Hindi, Japanese, Mongolian, Russian, etc.
But, despite the availability of UTF-8, many text documents (including emails and web pages) still use other encodings. If there is some mistake in interpretation, and the computer displaying the information to someone uses a different encoding than the one intended, then the information will not be presented correctly, and most likely look like total crap.
If you ever see a text document where some of the characters look like hollow boxes or question marks, this is frequently an encoding-related problem.
They call me Mr Celsius!