Poziom skomplikowania "Plain textu" (film, 55 minut)

Hello, NDC. How are you all doing out there? Welcome to a talk that is billed in the program as plain text. And I'm interested, I'm always interested why people are like, oh, plain text, I want to listen to that for an hour. Because the question that I have for all of you is, is there such a thing as plain text? So a little quick intro, this is me. I'm sure some of you know who I am because I have a feeling you're not here because you love the plain text. But I run Ursatile, which is my online software training and consultancy company. I'm a Microsoft MVP. I run a .NET user group and I invented the Rockstar programming language as a joke because I think everybody in the world should be able to be a Rockstar programmer because then recruiters are going to have to stop talking about that like it actually means something. And, yes, the title of this talk is plain text. We're going to talk about text. Now, you know, you say to a developer, hey, you know, you like code and computers and all kind of stuff, what do you think about text? It's like going to someone who likes cars and me, what do you think about metal? Or going to someone who's into diving and boats and me, what's your opinion on water? It's ubiquitous. It's everywhere. Everything we do involves manipulating text files in various ways. And so we never really kind of stop to think about how all of these encoding systems and text formats and things that we ended up with, where did they come from? And, you know, why do you keep finding these weird little quirks when you have a file you think you should be able to read but you can't and it's just plain text, so how can it possibly go wrong? Now, the thing about technology is what makes it interesting and useful and exciting and wonderful is that a human being somewhere can have an idea or a thought or some kind of thing that they want to communicate and we can take that idea and we can put that into some technology and then the technology can do something with it. It can search it or index it or render it or train a machine learning network with it or put it in an email and send it to Australia, but sooner or later it comes back out the other end and somebody else can share that idea. And fundamentally, whenever we use computers to store data, manipulate data, communicate, that's what we're doing. We are taking weird analog thoughts and concepts and stuff from the world around us, the stuff that matters to us human beings, and we're turning it into technology and then we're getting it back out at the other end and somebody else is going, oh, yeah, that's cool. Now, we invented writing text. Well, actually, about 5,000 years ago, human beings invented writing on clay. We may have invented writing on paper earlier than that. We don't know because the paper would not have survived in the archaeological record. But it looks like about 5,000 years ago, human beings in Mesopotamia, in what is now Iraq, started using clay tablets to send messages and keep logs of who owed who how many goats and how much grain and all these kinds of things. And we kind of rattled along with that for about four, four and a half thousand years. But it didn't get interesting until we invented electricity and then someone went, hey, what if we put these things together? What if we do a mashup of writing and electricity? And they came up with this, the Cook and Wheatston telegraph system. This was used in England in the 1830s, and I think it is the first instance anywhere in history of text being encoding into some kind of electrical messaging system. Now, Cook and Wheatston was really interesting for a lot of reasons. There's a lot of stuff about this project which will sound familiar to a lot of the people in this room. Cook was a business person. Cook was an entrepreneur. And Wheatston was a scientist. He invented a thing called the Wheatston Bridge, if any of you have heard of that. And Cook was like, this thing needs to be really, really expensive. And Wheatston's like, no, we must give our technology away to the world for free. And Cook is like, it has to be really, really simple so that people don't need training. And Wheatston's like, we're literally about to revolutionize human communications. Surely people can be expected to read the manual first. And so they have to go backwards and forwards. But they eventually came up with a design that looked like this. Now, the Cook and Wheatston system used five cables. The first installation of this was in West London between Paddington Railway Station and a place called West Drayton, which is about 20 kilometers distance. So they had to run five wires, each wire is 20 kilometers long, using state-of-the-art technology in 1830, which is copper wrapped up in a thing called gutter percher, which is made from the bark of a tree. We're not talking fiber optic here. And, you know, what you got to remember, you can't call ahead and say, hey, the cable's arriving tomorrow, because that doesn't exist yet. So if you want to get a message to someone and say, we're bringing a cable through, you need to send a horse. So they run these five cables, five cables, 20 kilometers long. And the way that the system worked, it has these five little magnetic needles across the middle of the dial. And if you wanted to send somebody a letter, you would push a key on the keyboard at one end, and two of these needles would deflect, because the key would close a circuit between one pair of cables. So you'd have a positive charge on one, negative coming back on the other. Two needles would move, and then you would read off the dial where do they cross. So it could encode 24 letters of the alphabet. It didn't do K, and it didn't do C, and it didn't do, I can't remember the other one, Q, I think. But it had 24 letters, which was enough to send messages. And this thing made headlines when they used it to catch a murderer. Somebody tried to escape the police by jumping on a train at Paddington, and they telegraphed ahead and had the police waiting when they got there. And it made news like amazing, telegraph, you know, can communicate faster than a train. And that was a big deal in 1834. Now, like I said, this is the earliest encoding system I was able to find for this talk. And it's a five-symbol trinary encoding system. So if you're ever scratching your head with, like, 8-bit binary ASCII, and you're thinking, ah, 8 bits is actually a little bit cumbersome to work with. Imagine if we had had a system where every character is encoded in five symbols, each of which can be positive, negative, or zero. Now, I think we, you know, as a society, we kind of dodged that bullet. The problem with the telegraph system is those five wires, eventually one of them broke. And so somebody's like, we need to run a new wire. And someone else, we've still got four wires. Can we come up with a telegraph system that works with only four? And they did. And then one of those broke. And someone said, well, could we come up with a three-wire system? And meanwhile, across the other side of the Atlantic in the United States, a guy called Samuel Morse was starting with one wire. He didn't even have one coming back the other way. He's like, well, the ground is the ground. So that's your live. Ground is the ground. And we are going to send signals on that. And he invented this, which I'm sure all of you recognize, even if you've never used it. Now, this is not actually the Morse code that was invented by Samuel Morse. This is a variant on that, which is called international Morse code. It was invented in Germany by an engineer called Friedrich Goethe about 10 years after Samuel Morse invented his system. And they had a conference in Paris in 1965 to agree some telecom standards for the emerging telegraph network that was being built across Europe. And they signed off on this, international Morse code. All the companies and countries that were building telegraph systems were like, yes, we will use this one. Now, if you've ever built a product or shipped any kind of API or service, you'll know that when nobody is using your stuff, you can change it as often as you like because nobody cares. But as soon as you have real users in the real world using your system, anytime you change anything, they get a bit upset. Now, international Morse code arrived at just the right time that within about a decade, every single telegraph system on the planet, one, had just been built. You know, this was the decade when telegraphy swept the world. And two, they were all using this system. Except for the United States of America, which has always done its own thing when it comes to international cooperation. But everyone else was using international Morse code. And so it survived. This thing proved almost impossible to replace with anything better. It wasn't the best encoding system, but everyone knew it. Telegraph operators were familiar with it. It worked on a single wire. You could send it over cables. You could also send it using sound. You could send it using lights and signaling lamps and all kinds of things. And so this wasn't unseated until almost exactly 100 years later. 1965, when digital computers arrived on the scene and people were like, we probably need a text encoding system that works well with binary computers. Now, the reason Morse code doesn't, you might look at that and think, well, dot dash, dot dash. Morse code is not just dot dash. Morse code also includes timing. To decode a Morse code message, you need to be able to count the gaps between the messages or look for the spaces between them. So there was actually more to it than just a dot and a dash. It's not a binary system. Computers like binary, true, false, 1, 0. And so a bunch of people sat down in America, the American Standards Association, which later became ANSI, and they're like, we are going to invent an encoding system for putting text into computers. Now, the computers they were thinking about looked like this. They did not have screens. They had teletypes, teleprinters, and they didn't have disks. They had punch cards. Now, they did a kind of reasonable job. This was the project that gave the world ASCII, the American Standard Code for Information Interchange. And, you know, we've all worked with ASCII, right? But have you ever taken ASCII apart and tried to work out why it looks the way it does and why it works that way? ASCII is a 7-bit encoding system. Now, here is the bottom block of ASCII, the first 32 characters in the ASCII table. 0 is all the 0s, and that is null. Any C++ or C developers in the room? Because we still use that. Null terminated strings is one of those things that was the best idea we had at the time, and now it's just never going away. But then you get into a block of characters that were designed to control these 1965 teleprinters. Now, if you wanted to send an instruction to your teleprinter, there was a key on your keyboard called control, and you pushed control, and then if you wanted to send a 1 for start of heading, you sent a control A. If you wanted to send a start of text, you pressed control B. If you wanted to send end of text, as in stop printing stuff because you're not supposed to be printing because I've made a mistake in my program, you'd send control C, and that one has survived to this day as the interrupt character on almost all development platforms. Then we've got end of transmission, inquiry, acknowledgment. ASCII code number 7, if you open a terminal on a Windows machine and you type echo space push control G and press enter, it will play ping.wav because that's how they interpret the instruction to ring the bell on the teleprinter on Windows 11, and that still works. And then we get into the tabs and the line feeds and these kinds of characters. And by the way, character 11 vertical tab is the correct answer to the tabs versus spaces argument, if anyone ever tries to have that with you. Now, this block here, we've got a carriage return and we've got a line feed. And mechanical teleprinters, if you look at the way a teleprinter works, it has a thing called a carriage, and when you print a line of text, the carriage is going to move across and print things. And then you get to the end, you're like, well, I need to return the carriage to the start of the line, and then I need to feed another line of paper through the teleprinter. Now, these things were very primitive. They used like a daisy wheel style typewriter head. They couldn't do bold, they couldn't do italics, but they could do a kind of pseudo bold because what you could do is you could print a line, you could do a carriage return, and then you could print the same line again and you'd get this kind of bold overprinted effect. So a carriage return without a new line was actually used a lot. A new line without a carriage return, very occasionally, but honestly it was kind of useless. And so we ended up with this scenario. If you look at the state of the art today, we have Linux and Unix machines and we have Mac OS and we have Android, and they say slash N is a new line. And we have Windows, 10, 11, 8, 7, Vista, XP, and they say no, a new line is slash R slash N. And the reason for this is that most Unix systems, or, you know, Linux, Android, those kind of systems, they evolved from Unix, and Unix evolved from a system called Multics, and Multics was the first operating system in history that had device drivers. So they had a little place where they could run a piece of code that said, hey, if you're talking to a teleprinter and you see a slash N, put in a slash R as well because obviously we want to return the carriage before we print the next line. Whereas Windows evolved out of, well, MS-DOS, and MS-DOS kind of ripped off a lot of ideas from CPM, and CPM was designed to run on really cheap mini computers which didn't have any device driver capability. And so that's why, to this day, the major operating systems in the world cannot agree on what the end of a line looks like is because they came from these two very different schools of abstraction and hardware engineering. Now, we take a look at the next chunk. One of the interesting decisions that the creators of ASCII had to make is they're like, we are allowed 127 characters and we've lost 32 of them to control codes. What are we going to include? And they did some things. Like, does anyone here work in typesetting or typography? Where you have a, you know, in ASCII we have a dash, and that dash, it's a minus sign, and it's a dash, and it's an N-dash, and it's an M-dash, all these different kinds of things. In publishing, those are different characters. You have an M-dash and you have an N-dash and they're different widths and a minus sign is not the same as a hyphen. The creators of ASCII were like, actually, we don't need that stuff. And so they just came up with one little horizontal line and you can use that for whatever you need. But within the last sort of ten years, particularly as e-books and high-definition displays have become popular, we are starting to see a return to some of this, you know, typographic characters. Smart quotes is another good example. ASCII didn't have space for an open quote and a close quote, so it just doubled them up, and now we're starting to see those come back in, which is why if you copy and paste code out of Microsoft Word into Visual Studio, it never works because it's broken all the quotes for you, right? Now, this block here starting at 48, these are the decimal numbers. And one of the things a computer has to do a lot is it's got to turn numbers that are actually text, which are what human beings can read, into numbers that are actually numbers so it can do arithmetic with them. And the way ASCII does this is you chop off the top half of basically the biggest four bits, and what you're left with is the binary representation of that digit. So the code for the character 0, the bottom four bits of that are the binary value 0. 1 is 1, 2 is 2. And this is blisteringly fast even on a four-bit microprocessor like the Intel 404. So this was a really kind of nice optimization. Now, you get to the alphanumeric or the alphabetic characters. You ever wondered why uppercase A in ASCII is 65 and lowercase A is 97? Why did they pick those two numbers? And the reason is they are separated by one bit. So if you have an ASCII text file and you want to do a case-insensitive sort or string comparison or something, you just ignore the sixth bit and then they come out the same. That's why they're 32 characters apart in the index. We mop up a little bit more punctuation with a vertical bar and a tilt and stuff, and then finally we get to character number 127, which is ASCII delete del, which looks like this. And the reason ASCII del looks like that is you can't fill in the holes on a punch card. The only way to erase information from a punch card is to punch out all the rest of the holes as well, and that's why ASCII delete is a block of ones, because you send this to a card punch and it punches out all the holes and then you can't read what was on the card anymore. And so the United States slaps themselves on the back and they're like, hey, good job, everyone. 127 characters is all you'll ever need. And the rest of the world goes, what? We can't even work. I mean, some countries and cultures were like, well, we've missed the last four letters of our alphabet, but, okay, we'll work with this for a while. Other countries and cultures are like, this thing, we're not even, like, we can't even get out of bed with this thing. Like, literally, there is no way our culture can use your standard to communicate. Now, my personal journey with plain text is a story of three turning points, and the first of those came in about 1988, 1989. I grew up in Zimbabwe where, you know, I grew up speaking English and using dollars and cents. So the early computers that I played around with, I didn't have any encoding problems because I used all the characters that the Americans used. When I moved to the United Kingdom when I was about ten years old, I was trying to print some homework, and I tried to print this, and it didn't work. And it didn't work because the Americans did not give us a standard for the pound sign, the pound sterling currency symbol, because they didn't use it. And so my computer was using one set of abstractions to do pound signs, but my printer, which was like one meter away on the desk connected by a cable, that was using a completely different system. And that's when I started having to learn about something called code pages, because ASCII is a seven-bit encoding system, and a whole lot of people kind of looked at it, and they went, well, we've got this alphabet that we like, and the Americans have used the bottom half, but we've got eight bits in a byte. Maybe we could use this bit to send emails in Bulgarian Cyrillic or Hebrew or whatever it is that they wanted to use. And lots of people said, is anyone using this bit? And no one said, please don't. And so they all invented their own ways of doing it. Every single one of those is a code page. That's what a code page is. It's a set of rules that say, this is what the top half of eight-bit ASCII means right here, right now in this document on this computer. And often that's not the same as on this document on the printer that is right next to it. Now, one of the most common or sort of popular, I guess infamous is probably a good word, code pages in history, was code page 437. You're literally listening to a guy right now talk about his favorite code pages. How do you feel about that? Code page 437 shipped with the IBM PC in the 1980s. And it's interesting for two reasons. One, they extended ASCII to support most of what Western Europe needed. So they included all of the accent characters necessary for, you know, the Nordic languages, Eastern Europe, French. Then they included half of the Greek alphabet. Not all of it. They included the half that shows up in physics. So all the letters of the Greek alphabet that are used in mainstream kind of high school level physics are in code page 437, but you can't use it to write documents in Greek because it doesn't have enough letters. And then they went, well, the IBM PC doesn't actually have a teleprinter, so we probably don't need that bottom control thing. Let's override that and use it so we've got smiley faces and playing card symbols and all those kinds of things. And if you've ever seen an IBM PC crash really, really hard and you get a screen full of gibberish and smiley faces and playing cards, that's because it's gone completely, you know, the operating system kernel has crashed. It's sending control codes to try and stop the teleprinter, and it's going, oh, yeah, smiley face. Clearly that's a good way of communicating what happened. Now, there were a whole bunch of different code pages created for different alphabets and different languages, but they didn't always solve the problem terribly well. One that I'm particularly interested in is code pages used for the Cyrillic alphabet, which is used in Russia, Ukraine, Belarus, and Bulgaria, I think, and a bunch of other countries as well. Now, there was an international encoding system which was created kind of, you know, ratified by ANSI for doing Russian in 8-bit ASCII, where you would take a word like Privyet, which is hi in Russian, and you'd turn it into a bunch of 8-bit sequences. But a lot of systems at the time would chop off the first bit. They would either just not bother transmitting it, or they'd use it for parity checks, or they'd use it for things like has this word been spell checked, those kinds of things. Email systems, WordStar, quite a lot of bulletin board systems could not cope with the 8th bit, and so that would disappear. It would get chopped off, and what you'd get out the other side was a bunch of numbers that when you turn them back into text, gibberish, doesn't mean anything at all. So the Russians went, this system is no good. We're going to create a better system. And this was a brilliant hack. It was the 8-bit encoding system. What they did, they said, well, take the Russian alphabet, and instead of encoding it in alphabetical order, we'll encode it using the English letters which kind of sound the same. And so what you do is you take Privyet, you run it through your 7-bit WordStar spell check, it chops off the top bits, and what you get out the other side is that. And they transpose the upper and lower case, so it's obvious that this thing has been mangled, but you can still kind of read it and make sense of what it says. And that's the kind of ingenious hack that I absolutely love. Now, you can go down all kinds of rabbit holes when it comes to Russian and ASCII. There's a story. I did this talk at a conference a couple of months ago, and someone emailed me afterwards and said, do you know about the Harry Potter envelope? And I was like, what's the Harry Potter envelope? And they sent me this story, and I did a bit of research. What happened? This is in 2002, and there was an email kind of pen friend correspondence thing going on between a woman in Russia called Svetlana and somebody in France whose name is lost to history, so I've called her Claudette. And they're emailing backwards and forwards, and Sveta says an email to Claudette saying, hey, that would be amazing. Yes, send me the new Harry Potter book. We can't get it in Russia yet. Here's my address. It's in the Russian alphabet. Please copy it very carefully. And includes a street address written in a Moscow street address in Russian in Cyrillic. And this arrives on Claudette's French computer, which is using a Windows Western European encoding, and it looks like this. But the instructions say, here's my address in the Russian alphabet. Please copy it carefully. Now, the underlying, you know, the code points there, the numeric values are the same, but it's being rendered using a different interpretation. So Claudette follows the instructions, and she copies that very carefully onto an envelope and puts it in the mail, and it finds somebody in the Russian postal system who knows what has happened who goes through and fills it all in. And the book ends up getting delivered. Now, then, how many of you in here have got a phone in your pocket which has Spotify or Apple Music or iTunes? If you do, get it out now, and I want you to go on there, and I want you to search for this word. You found anything? Billy Joel. Live in Leningrad, 1987. Because in 1987, Billy Joel played a concert in the city of Leningrad, which is now St. Petersburg. And it was one of the earliest Western recordings to be released inside the Soviet Union. And so they called the album Concert, but they wrote it using the Russian alphabet. And then this got sent over to some record company who were like, we need to put this in a database. K-O-H, is that a U? In Russian, this is concert. But they co-co-opt. And of course, once something's in a database, it's never coming out. And so when Apple Music and Spotify set up all these licensing deals with the big music providers, they're like, do you have a database of all your records? And they're like, yeah, yeah, I have a copy. I have a big CSV file. And so co-co-opt, live in Leningrad, has just become a thing now. It's an encoding problem that became a database problem that became an album that you can go on to Spotify and listen to right now. Now, code pages worked up to a point. If you were just using your computer to do your stuff in your language and you had your printer set up properly, that's kind of OK. They sucked for email. They sucked for a lot of things. You couldn't write a document which combined Russian and Hebrew on the same page. You had to do it as separate documents and then staple them together. And if you're not stapling stuff because you're sending it by email, there's just no way of doing this. And there are a lot of code pages. This is the one section of the Wikipedia list of code pages. Because Mac OS has one for Arabic, one for Armenian, one for Cyrillic, one for Chinese, Korean. And then DOS has all of its own. And then IBM has some for AX. And IBM has ones that are built for other people. And PostScript has a set. And this was just chaos. Like, if we were going to wire the planet and connect all our computers together so we could send email backwards and forwards, we needed a unified coding system. And so the Unicode Consortium was born. Set up in 1988. The consortium was kind of legally founded in 1991. And their mission statement, which I think is a great example of a mission statement, just says this. It's to provide a single consistent way to represent each letter and symbol needed for all languages across all computers and devices. Now, a single consistent way to represent each letter and symbol. Actually, we start at the end. All computers and devices. Well, the only way to do that is to come up with something really, really good. Like, literally easier than people inventing their own. Document it, give it away for free, make it completely free of any licensing restrictions, and then encourage people to use it. And they succeeded. You know, the Unicode standard is so prevalent now. It's why WhatsApp works. It's why Signal works. It's why email works. It's why the web works. They did a fantastic job. And when they said all human languages, they also wanted to include, you know, like Egyptian hieroglyphics and ancient Sumerian so that people who are doing historical research can use Unicode. And then we've got this first part. A single consistent way to represent letters and symbols for all human languages. Now, what is a letter? There's two strings here. They look the same. They're the same shape. They are not the same string. They are not even the same alphabet. One of them is the start of the Russian word horosho. The other one is the English word exoplanet. And so you get into, very quickly, these kind of complicated discussions about, if we have two letters that look the same and they represent the same sound, but they come from different alphabets, are they the same letter or not? Now, I'm British. I speak British English. Good afternoon. How is the queen? Would you like some tea? So when I travel and I see words like this, what my English brain does is go, oh, yeah, I'm just going to filter that out because I don't know what that means. And I'm going to say, good afternoon, could I have a half owner, please? And I'm like, what's a half owner? It's a half owner. Is that close enough? A half owner is a hairdryer in Norwegian. But my English brain, it just thinks that these are like extra decoration, you know, like the garnish on a cheeseburger. I push them to one side. But they're not. These are different letters of the alphabet in Norwegian and Swedish and Danish. And so when my English brain treats it as this, I'm actually throwing away semantically significant information. Let's meet some friends of mine. This is Francois Bordeaux, a French archaeologist. And this is the Los Angeles heavy metal band Motley Crue. And Francois, the archaeologist, went to the Motley Crue concert. Now, this sentence is written in English. But it has some things in it which do not appear in the English alphabet. We have a little C with a sedia here. And we have an A and an E kind of mashed together, which is still a thing some typesetters do. And then we've got an O with an umlaut and a U with an umlaut on it. And that is the heavy metal umlaut, which is used by American rock bands because they think it makes them look cool. And the first time Motley Crue went to Germany, they could not work out why the audience was chanting Motley Crue, Motley Crue, until someone had to sit them down and go, you know your name is nonsense in this country. Now, making Motley Crue look stupid is not really a big deal. I kind of look stupid anyway, right? But this can have real consequences. Some of you probably know this guy. Magnus actually knows. Some of you probably know this guy, Magnus Mortensen, a Swedish developer, Azure MVP. And he was traveling to a conference in the United States. And he has a Swedish passport. And the rules of Swedish orthography, if you look at the bottom of your passport, there's a little strip that has to be ASCII so that mainframes from the 1970s can read it. That's just the rules. And the Swedish government says that if you have the letter O, which is the A with a circle, and you have to turn it into ASCII, it's two As, AA. And the people who printed Magnus' airline ticket, they're like, we just take the circle off because we don't know what that is. Trying to get into the U.S. with a passport that doesn't exactly match the name printed on his airline ticket. How many of you have met United States border security? They're a famously understanding bunch, aren't they? Now, you know, Magnus is a kind of affluent white guy on his way to go and speak at a conference at Microsoft or something. So, he just had to answer a couple of questions and explain what had happened. But you can see how this kind of text encoding problem can have real consequences for somebody who the authorities decide they don't like the look of. And we are just getting warmed up. Let's take a look at a list of cities. Berlin, Aachen, Zurich, Aarhus, and Örbu. Now, we are going to take this list of European cities and we are going to put it into our SQL server database. Insert into cities, name, we've got a 128, so we've got Unicode support in the table. We're going to put all those cities in. We're going to select star just to make sure we can get our data back out again. There we go. And now what we're going to do is select star from cities where name equals Örbro and we get nothing back. Well, that's not what I was expecting as a British English person. Maybe it's case sensitive Örbro with a big O. Nope, still nothing. Well, are these things even the same? No, they're not. By default, they're not. Unless you tell your database, I would like you to compare strings using the Latin one general case insensitive accent insensitive collation, which is a set of rules for deciding whether two letters of the alphabet are greater, lesser, or the same. And if we use that collation, it says yes. These are now considered to be equivalent. Now, what we're going to do next is we're going to go here. We've got cities ordered by name and we're going to use that Latin general collation and we get Aachen, Aarhus, Berlin, Örbro, and Zurich. Now, that's the order that a British English speaker would expect. It kind of makes sense. But the letter Ör is not O. It is the 27th letter of the Swedish alphabet. And so if we use the Finnish and Swedish collation, we get Aachen, Aarhus, Berlin, Zurich, and then Örbro comes out at the end because it's the 27th letter of the alphabet. Now, if we do this with the Danish and Norwegian collation, Aachen and Aarhus now appear at the end. Now, how many of you in the room are looking at that going, well, of course, that's how it works. That makes perfect sense. Yeah. Obviously. Now, the reason for this, and if we put Athens in, you see Athens does still appear at the beginning. Now, there's a really interesting example. The best way that I found to come explaining this is the Danish spelling reform that took place. So the city of Aarhus in Denmark, which is a second city after Copenhagen, was called Aarhus historically until 1948 when Denmark decided that it would quite like it if its alphabet looked a little bit less German and a little bit more kind of Swedish and Norwegian. And you can probably work out why. And so they took the Danish alphabet, they added three new letters to the end of it, and they made a rule that said anywhere that started with AAR, spelled A-A, was now AAR spelled AAR with one of these. So this list was then in correct Danish alphabetical order, the rules of Danish orthography. Then in 2011, the city of Aarhus voted to change its name back so it could find itself on TripAdvisor and search engines more easily, but not to change the position in the alphabet. So the list on the right is in alphabetical order according to Danish orthographic conventions. Because Aachen is not in Denmark, and so it goes at the beginning, but Aarhus is in Denmark, and so it goes at the end. So your database now needs to know for every single piece of text, it needs to know whether that is a place, person, or company in Scandinavia or not to be able to get the alphabetical order incorrect, which is technically impossible. You know, there's just no way that you can do this. So if you ever do end up in a scenario where you need to apply these things where there are kind of cultural precedents around orthography which are not possible to deduce from the data you have available, the only thing you can do is you can plug in more data. And you can put in a column that the computer uses to sort things and then a column that it uses to display what the human beings are going to read, and so if we put in an additional column here called a sort name, we can override Aachen and put in Aachen with a single A, and that kind of short-circuits the put AA at the end of the alphabet thing, and we get at what would be familiar to people who aren't necessarily familiar with all of these conventions. Now, what this kind of goes to show, there's a lot of people I've met in tech who are like, I don't really do politics, I just do computers, and I don't worry too much about the implications of what I'm doing. But the technology that we use every day, operating systems, databases, keyboards, these are all informed by the history and the culture and the politics that created the problems that technology has been asked to try and solve. If you look around for technology that is completely apolitical, there isn't any. There are all of these cultural and historical precedents baked into the way Norwegian Windows puts aardvark.txt at the end of the list instead of the beginning, and if you don't know about Danish and Swedish and Norwegian orthography and alphabets and stuff, you have no idea why it's doing that. Let's take a look at this letter. C with a sedia. Unicode says this is a letter of the French alphabet, and it gives it a code point, but in English, this isn't a letter of the alphabet, it's the letter C which exists in ASCII with a little tail hung off the bottom of it. So Unicode says if you want to do it that way, that's fine, we'll give you the C and the tail and you can glue those together yourself. These combining characters, you can have a lot of fun with these, because you can put combining characters on combining characters, and you can put more combining characters on top of those ones, and you can just keep stacking them up until you end up with Zalgo text, and then you can go and paste this on Stack Overflow and questions about Tony the Pony and stuff. But it brings up more questions about what if there's two different ways to write the same letter, how do we know if those are considered equivalent or not? We can write Motley Crue as M-O-T-L-E-Y-C-R-E-E, or we can do it as M-O Unicode combining diuresis, and then when this gets rendered, your operating system will go, oh, yeah, put the dots on top of the O, clearly that's what they were asking for here. Are these two strings equivalent? Because they're not even the same length, they're not binary equivalent, and again, what Unicode did is they said we're not going to tell you what's right and wrong, we're going to give you some options you can work with. If you're working in .NET, the way you tap into this is to use something called normalization forms. So Unicode defines four normalization forms. Composed and decomposed. Composition means take the string and squash it down into the smallest possible number of code points. If you've got an O plus, you replace that with the letter which looks like that in the first place. Decomposition is the opposite. You stretch the thing out, all the accents get spun out into combining characters, so you make the string as big as you can. And then the K here stands for canonical, because they'd already used C to stand for composition, and canonical says forget what they look like, do they represent the same word? Now, this code is online, you can go and grab this later. If we create Motley Crue in .NET, terrible ideas, but let's do it anyway, using inline strings and then down here we're going to do it by putting in an O and a combining diuresis, we get two strings and then we are going to go run these through our comparison function, and what we get is we get two strings out, S1 equals S2. Now, you'll notice they're the same length. Like .NET has been, oh, well, combining characters don't meet our definition of making the string longer. They are not equivalent. S1 is not equal to S2. They are binary different, but they are considered equivalent under all four forms of Unicode normalization. If we take the word plain text and we encode that as normal text and we encode it as text in little circles so you can make your Twitter handle look pretty, what we get at the other side there is we get two strings that are not equal, and they are not equivalent under the non-canonical comparisons, but if we say apply the canonical comparison, it's like, yes, they both say plain text in English, so they do represent the same string. Now, the second turning point in my personal journey with text encoding was when I went into work one day, probably about five, six years ago now, and somebody said to me, dude, I think we've been hacked. Now, when you work in IT, people come up to you all the time and say they think they've been hacked because they've right-clicked on Facebook and view source and now they think that the hackers are in the internet getting into their computer because they watch too many movies. And I was like, why do you think we've been hacked? Now, the person, this was not somebody prone to Facebook movie paranoia. This was one of the best security engineers that I've ever worked with, and he said, there's Chinese in the Windows event logs. It's a British company in London using English language Windows with no Chinese customers, no Chinese business deals, no Chinese employees. Why is it? What do you mean there's Chinese in the event logs? And sure enough, there is Chinese in the Windows event logs. And so I do what managers do in that situation. As I say to one person, you check the firewall, see if there's any suspicious activity or anything. You go and look through all the database tables, see if you can find any data. Like, is this an injection attack? Is this like a weird end point on one of our websites? You go and tell customer services that weirdness is going on and we'll have a proper update for them in 15 minutes, which means I can sit back and shitpost about it on Twitter. And that actually solved the problem faster than any, because I put this little message up, and I was like, oh, it's going to be one of those days. And immediately I got a couple of replies back from people going, oh, that looks like a Unicode mapping error, and then the fake Unicode Twitter account, which is a lot of fun, you should follow it, pops up and says, yeah, the low bits are null, so this is probably UTF-16 LE, being mistaken for UTF-16 BE. And I'm going, right. I should probably figure out really fast what those things are. And so I dive headfirst into the Unicode rabbit hole, and a couple of hours later I sort of stumble out, be mused and blinking and figure out what's going on. The word delete. Internally, Windows and Java and JavaScript, they use 16 bits for almost everything, because it is fast. With, you know, native operating system running on the hardware in front of you, network speed is not a big deal compared to being able to manage stuff effectively in memory. And so they use 16 bits for everything. But if you are using 16 bits to store 8-bit characters, you kind of need to decide which way around are you going to read them starting with the big end or starting at the little end. And so the word delete in UTF-16, if you take apart the memory in your Windows machine and look at the raw numbers, the word delete is in there encoded like this. But if you flip those pairs around, what you get at the other end is Chinese. The first, I ran this through Google Translate in case it was like a ransom demand or something. And the first one there is a character that means putting a pearl in the mouth of a deceased relative for burial. And the second one is preparing a duck for a traditional feast. So, like, if this was hackers, I don't know what they wanted. It's nonsense. It's, you know, what had happened is, yeah, this absolutely mapped out. We took a couple of SQL statements, flipped the bits on them using a little .NET utility that we hacked together, and it's like, yes, this is exactly what's going on. Next question, why is our SQL server using big-endian instead of little-endian? Why are the bits getting flipped? And it turns out they weren't. This was in a data center in a private sort of cloud hosting facility, and they had a bad network switch. And about every three minutes, it would drop one byte from the stream between our dynamic CRM box and the database server. And when that byte fell out, everything else just got shunted sideways, and the systems had no idea. They're just like, oh, well, clearly there's Chinese in the event logs now. All right. I'll put it over here in case it's important. And then they'd be like, but actually you can't run that. That's not a valid SQL statement. I'll log it here. So, yeah, you know, that was a kind of real interesting crash course on text encoding. And, you know, I learned this thing about Windows and Java and stuff. They use 16 bits internally for most things. And so I started thinking, well, okay, so there's ASCII, right? Figured that out. And then there's this UTF-16 thing. Now, when you get out onto the web, things get much more interesting. Because on the web, network speed matters. The size of your data becomes important because it has a cost. The bigger it is, the harder it is to send stuff around and the slower it is to get there. Now, let's take an HTML page, which is a� so this is the HTML for hello world, and this time it's in Ukrainian Cyrillic. And most HTML, you know, web pages, even if your web page is in Korean or Hebrew, all of the HTML tags are 7-bit ASCII. All of your JavaScript� I hope all of your JavaScript is 7-bit ASCII. Your CSS should be 7-bit ASCII. Our programming languages are all based on ASCII. Because this is the whole thing. Developers work with plain text. And if you encode this document using UTF-16, the bits that actually need 16 bits are here. Because the only stuff that doesn't fit in regular ASCII is the two tiny snippets here that are in Ukrainian Cyrillic. And so all the rest of it, half of each of those bytes is null. But we need to send those nulls across the wire if we are using UTF-16. It is a really inefficient serialization format. In fact, 44% of this page is redundant information if we encode it with UTF-16. And so along comes what I think is one of the most beautiful and brilliant hacks in the history of technology anywhere. UTF-8 encoding. Now, UTF-8 says if you have a byte that starts with a 0, that is 7-bit ASCII 1965 American flavor. That stuff has not changed. Every ASCII document is a valid UTF-8 document as long as it's only using 7-bit ASCII. If you have any byte that starts with a 1, the 1 means this is part of a multi-byte encoding sequence. And if you start with a 1-0, you are halfway through a letter right now. You are going to need to backtrack a little bit until you find 1-1. And if you find 1-1-0, that 1-1 means 1-2 bytes. If you find 1-1-1-0, this is the beginning of a 3-byte encoding sequence. If you find 1-1-1-1-0, that's a 4-byte encoding sequence. Now, UTF-8 stops there because once you get 5, 6, 7, 8-byte encoding sequences, you can't translate those into anything else. A 4-byte encoding can go into UTF-32, which is like UTF-16, but it's twice as inefficient. But there is no UTF-64 or UTF-128. But this does give us, you know, this astonishing amount of headroom in terms of being able to make up new encodings for new alphabets and new languages. And, you know, it turns out that we are not finished inventing alphabets and languages yet. We got all these amazing ones around the world, but every once in a while, someone goes, hmm, maybe we could come up with a new one. Now, there is a completely new alphabet that's been invented within my lifetime that I bet good money most of you have used already today. It is the language and alphabet called Emoji. It came out of a project, a Japanese mobile phone company in the 1990s wanted to do, like, you know, weather forecasts and bus timetables, and they hired an artist to create them this set of Emoji that was on the original platform was called iMode, and it ran on a company called Docomo. And this set of Emoji is in the Museum of Modern Art in New York now. And this took off in a big way in Japan. Like, within a year, you couldn't sell a phone in Japan if it didn't have Emoji support. So when Apple in 2008 wanted to launch iPhone in Japan, it had to have Emoji. And so people around the world were like, hey, if you switch on Japanese in your iPhone keyboard, you can start sending these little smiley faces and all sorts of things. And as this spread around the world, people started asking some quite delicate questions. You know, like, okay, Emoji is cool. The pilot, the police officer, the chef, the firefighter, why are they all men? And why are they all white men? And someone came out, well, actually, they're yellow men. And it's like, no, that's not even no, no, no, let's not even go there. You know, and lots of people are like, why are there tacos? Why is there sushi here but there are no tacos? Why are there the flags of these countries but the flag of Palestine is not included? What are you trying to tell me with this set of characters here? Now, in 2015, Unicode took a massive step towards kind of inclusivity with the way you can use Emoji to communicate. And they did it in a way that I thought was really clever. They added the ability to change the skin tone that's used on faces and hands and those kinds of Emoji. And they did it using combining characters. So the same way that we put the little tail on the C earlier, this UF1F44D code point is the thumbs up, that's been for years. But they said, by the way, this is a combining character. And if your device handset can support it, it will combine these two characters and gives you thumbs up with a modified skin tone. And now, every year, the Unicode consortium announces the new Emoji. And they have all these astonishing combinations of gender and identity and family configurations and occupations and all this kind of stuff. And the way they've done it all is to use the flexibility that's built into the Unicode encoding standard. If you want to send someone an Emoji of a female astronaut, what you actually send them is a rocket ship and then a thing called a Zwidge, a zero-width joiner, and a rocket. And if your handset supports that version, then it'll combine the woman and the rocket and give you female astronaut. And you can actually kind of stack these things. You can say, we'll take a woman, plus the dark skin tone, plus the zero-width joiner, plus the rocket, and so you can get female astronauts with a variety of different skin tones. If any of you was at the meetup the other night, Scott Helm said something on the panel show I thought was really cool. He's like, Apple, keep putting out updates, which are like, get these new Emoji. Oh, yeah, and also this update, that security update, that security update. And people will update their phones to get new Emoji. People don't care about security, but the new Emoji is a really good way of getting everyone to keep their operating system up to date because they want all the new shiny things. Now, there's an interesting detail about one particular set of Emoji which are, you know, used an awful lot, which is flags. If you want to do the, I apologize for this. I've changed the coding, but I have accidentally left the flag of Belarus in here. But if you want to do the flag of Norway in Emoji, there is a block of alphabet letters called the regional symbol indicators. And you do regional symbol indicator N plus O, and that gives you the flag of Norway. And if you forget to replace it, you get the flag of Belarus. If you want to do the flag of England, then England is not a country, according to the people who make these kinds of decisions. Great Britain is a country, the United Kingdom of Great Britain, and it gets complicated. The flag of England, there's a block in Unicode called the alphabetical tags. And this was put in in 2010 because it might be a good idea. And then it was taken out because they're like, no one's using this. And then it was put back in in a big hurry when they realized that handset manufacturers were using this for things like the flag of England, which didn't fit into the country block. You send a black flag, and then you send the tag Latin letter small G, B, E, N, G. And this is how the flag of England is encoded. And the flag of Scotland, Wales, and Texas are also now part of the Unicode standard on that basis. And the pride flag is a white flag, a zero with joiner, and a rainbow. Now, does anyone know what country this flag belongs to? Bonus points if you can tell me the actual official name of the country that this flag belongs to. Anybody? The Republic of China is the correct answer, not to be confused with the People's Republic of China, which is a much bigger country on the other side of the sea. The Republic of China is the country that most people know as Taiwan. And in mainland China, this flag is seen as a symbol of support for Taiwanese independence, and the People's Republic of China does not like the idea of Taiwanese independence. Now, if you go onto your iPhone, and you go in, and you write a text message to somebody, and you type in Taiwan, it pops up. Hey, you mean that? And you click the flag, and there you go. But then you go into your regional settings, and you go, I'm going to change my phone's region to China mainland. And then I'm going to go back to my emoji, and look at that. The flag has disappeared, and it doesn't appear on the keyboard anymore. This is Apple's solution to how to sell iPhone to the 3 billion people in China who eventually are going to want iPhones. Now, this is a kind of weird thing. This is Apple's solution to the fact that flags can be very, very political and can cause all kinds of problems with international trade and stuff. If you go onto my Twitter profile, I have a couple of flags in there, the EU one and the fact that I'm in Norway right now. And if you want to go in and edit my profile, on macOS, it looks like the one on the left. On Windows, you get the thing on the right. Spot the difference? Windows has no flags. The only flags in Microsoft Windows are the pirate skull and crossbones and the gay pride flag. So Windows' position on flags is that gay pirates are okay and everyone else can get in the bin. Which, you know, what would you do? What would your solution be? And so if you actually, the Twitter web interface, what that does is it injects PNGs, so it replaces the operating system's native emoji with its own emoji character set. But yeah, there are no flag emojis on Microsoft Windows still. Now, there's one more little detail that I want to leave you with. Actually, two more little details I want to leave you with. One, if you remember when we were talking about Francois the archaeologist, we had that little weird, the A and the E kind of mashed together. Now, that's not a letter of an alphabet. The underlying text there is still just regular 7-bit ASCII. What that is, it is a rendering concept called a ligature. If you have a look at the logo type for my company, you'll see that the T and the I are kind of stuck together. Now, this is just text, and it's text rendered in a typeface called Lato, and Lato includes ligatures. So when you have a T and an I, if the operating system and the platform supports it, there's a little thing that says, hey, use this character for that pair instead because it looks nice. And the awesome thing about ligatures is you can use them in programming fonts. So you can take a chunk of code and you can set it. Anyone recognize this language? Now, the language is F sharp. This is fizzbuzz in F sharp. The actual source code looks like that if you render it in an ordinary monospaced font. But this is being rendered using a typeface called Fyra code, which has rules in it that, hey, the little minus sign followed by the bracket, that's an arrow, so let's draw that as an arrow. A little vertical pipe with a thing, that's a triangle, draw that as a triangle. And Fyra code includes a whole bunch of concepts for this, like the triple equality in JavaScript. In Fyra code, it's three bars, one above the other, which I absolutely love. One, I think it makes code look cool, and I like things that look cool. It also makes it more readable. It makes it easier to see if you've missed a double equals that should have been a triple equals and all these kinds of things. So go and check out Fyra code. This was the first one that did it. There's now a whole bunch of programming fonts out there that use ligatures to kind of do cool stuff with how they represent programming language operators and things. And finally, I want to share the third little turning point on my personal journey with text encoding with you, because a few years ago, I had the great pleasure of visiting Ukraine for the first time, and I had a blast. It's an awesome country. If you get a chance, you should totally go there. And I'm walking around, and I can't even read the alphabet. Like I have no idea what anything says. I can't read the menus. I can't read the road signs. I can't read the billboards. I can read the license plates on the cars. And I thought, yeah, that's really weird. Why would Ukraine use the English alphabet for their vehicle license plates, but they use the Cyrillic alphabet for everything else? And so I did a bit of digging, and I turned out a weird coincidence. 1965, the same year as ASCII, there was a thing called the Vienna Convention on Road Traffic, which was this big international treaty signed by basically every country in Europe about what you had to do if you wanted to drive your car across the border. And one of the things that all the countries signed up to, including the Soviet Union, was any vehicle crossing a border can only use the Latin alphabet on its license plates. Okay, they signed that. And somebody said, now, Soviet license plates in 1965 looked like this. And I'm guessing that there was a meeting somewhere where someone said, comrade, how have we signed the Vienna Convention when our license plates look like this? And someone else says, no one is driving their car out of the Soviet Union. That's just not a thing that's going to happen. If you did, if you could get the travel permits and everything, because the USSR was still a closed country, they would give you temporary license plates at the border. And I'm assuming if you didn't bring them back, you would be in a lot of trouble. Then when the USSR, you know, collapsed in the 1990s, and Russia and Ukraine and Belarus, all these countries became independent, they took the opportunity to overhaul their vehicle registration systems. And so today, the license plates in those countries look like this. And the way they did it is they took the Cyrillic alphabet and the Latin alphabet, and they looked for the intersection. They looked for the set of, they're not letters, they are glyphs. They are things that are the same shape, even if they're not necessarily the same sound. And they said, this is the set of characters we will use on our vehicle registration plates. And if you take that set of characters in English and you shuffle them around a little bit, they spell Pike Matchbox. And so that's a little shibboleth for all of you. Next time somebody says to you, oh, I'm sending you this file, it'll be fine, it's plain text. You can turn around to them and go, do you know Pike Matchbox? And if they say, oh, yeah, yeah, then they've been to this talk. And they know about big-ending and little-ending and code pages and UTF-8 and teleprinters and the Cook and Wheatston telegraph system, and you can have a civilized conversation about what plain text actually means. But if they say, I'm sending you a plain text file, yeah, you'll be able to read it. And you say, do you know Pike Matchbox? And they say, what's Pike Matchbox? You should hang on to your hat because you have no idea what they are going to send you. Thank you very much.

Menu

Poziom skomplikowania "Plain textu" (film, 55 minut)

Toggle timeline summary

Transcription