Pierwszy międzygwiezdny update oprogramowania - historia naprawy Voyagera 1 (film, 38m)

Hello, it's Scott Manley here. I grew up in the 1980s and as a fan of everything space related that meant watching things like Star Wars, Doctor Who and of course Star Trek. Where every week we'd explore strange new worlds, but in real life every few years we'd explore a whole bunch of real strange new worlds thanks to the Voyager spacecraft, perhaps the most important space probes ever launched by humanity. They expanded our knowledge of the solar system more than any other spacecraft. Launched in 1977 they would encounter Jupiter in 1979, discovering things like volcanoes on Io. Saturn was visited in 1981 and they delivered amazing views of the intricate ring system. Now Voyager 1 would fly close to Titan and that was its last planetary encounter, but Voyager 2 would take a different route, getting slingshotted by the gravity of Saturn to carry it on to Uranus where it arrived in 1986 discovering two new rings and ten new moons. From there it went onwards to Neptune arriving in 1989 where it discovered cryovolcanism on the moon of Triton. In the 35 years since then they've kept going, continuing to take measurements of their environment, showing us how the hot solar wind from the sun runs into the cold interstellar medium. Voyager 1 is the most distant human made object and two years ago its communication was interrupted for months and it only returned after what would become the first interstellar software update. Voyager 1 would take its final images in 1990, creating the famous family portrait of the solar system where Earth is barely visible as a small blue dot in a diffraction spike, which is kind of like a lens flare but it doesn't need lenses. Besides Carl Sagan said it best when he described the Earth as a pale blue dot suspended in the sunbeam. Now since then Voyager has cruised out into the blackness of space and as its power sources wane the spacecraft has slowly turned off systems to keep the spacecraft alive, but it's continued to transmit telemetry faithfully for 35 years. This meagre trickle of data isn't exciting like the images of new moons but it's scientifically interesting and since we don't expect to have replacement spacecraft out there any time soon there is a small devoted handful of engineers listening to this whisper from the darkness and sending back command to keep it ticking over as long as possible. The spacecraft instruments that are still working are the magnetometer, the plasma wave subsystem and the low energy charged particle instrument. There is a cosmic ray detector but that was actually just turned off a couple of months ago but it was still operating back in November of 2023 when the spacecraft encountered its problem. These instruments provided the scientific evidence showing the spacecraft passing through the termination shock and crossing the heliopause at the edge of the solar system and entering into interstellar space. The hot solar wind from the sun runs into the cold interstellar gas. It slows down rapidly forming a shock wave followed by turbulence and then finally a drop in temperature and an increase in density, a change in the magnetic field and an increase in cosmic rays. Now this process was observed over decades. Real time data would be set back slowly at bit rates of about 160 bits per second. Voyager doesn't care that it's got a slow internet connection. It's playing the long game trickling a steady stream of measurements back for decades on end and then on November 12th, 1923 this suddenly stopped. The ground stations that were listening lost sync. They could still see a signal from the spacecraft coming down at the correct frequency. They could see modulation the same used by the spacecraft but the data was garbage. There was no recognizable synchronization markers to signify the frames of data. They could even see periodic changes in the signal strength which were consistent with the spacecraft's attitude control keeping the spacecraft pointed within the dead bands. Voyager wasn't dead. It was talking to us but we couldn't understand what it was saying or even if it was saying anything at all. The Voyager missions are managed by the Jet Propulsion Laboratory in California. JPL has a small team of engineers keeping the Voyager spacecraft working. Most of these engineers had worked on the planetary exploration campaigns during the glory days of the mission and they stuck around while other engineers retired or moved on to newer possibly more exciting missions and while these engineers are the world experts on these two spacecraft they're a long way from the inception and construction of the mission and there's a lot of tribal knowledge that has been lost. A lot of the documentation is scattered through years of papers, reports and memoranda and they're thankfully mostly in digital form but that doesn't mean they're easily searchable and some of the tools that they had used in the past no longer worked. Software that was designed in the 1970s to run on mainframes hadn't necessarily been ported to new platforms unless it was something the engineers were using regularly. Anyway the engineers set to work on the problem and the first step in diagnosing the problem well that was done by sending commands to the spacecraft and looking to see if they could get any response. Now there was some easy things that they can do just asking for the spacecraft to reset components in the hope that turning it off and on again will fix the telemetry stream. Now they showed that not only was the spacecraft not dead but it was actually responding to some commands. They couldn't fix the telemetry but they could trigger changes in the carrier frequency or the sub-carrier separation. Now this whole process of diagnosing this takes a long time. The spacecraft was about 22 and a half light hours from earth so the round trip of the signal was 45 hours. They would send a signal and not see a response for a couple of days. Frequently they would see nothing. They'd also typically get one opportunity to uplink commands per week so the troubleshooting would follow a weekly cadence. They'd get a few days to analyze and plan and build the command sequences. They would send them perhaps late in the week, wait a couple of days and then see what happens over the weekend and then by Monday morning they were ready to figure out what to do next. Now based on the fact that they could see some response to the commands and that the attitude control system was working they believed that this showed that two of the three computer systems were working fine and that the third computer was the leading suspect. Voyager's three computers were designed for specific tasks. Actually there were three types of computers but there were six computers in all because everything had backups. There were duplicate processors and then duplicate memories and because they had duplicate computers they could actually sometimes run the computers in parallel if they needed the performance but at this point in the mission power was so low that they were only running single string computers. Now the first computer was the computer command system. This was the same as the computer system that was used in the Viking mission orbiters. It's the primary controller for the spacecraft. It provides sequencing and control functions. The CCS contains fixed routines such as command decoding and fault detection and corrective routines, antenna pointing information and spacecraft sequencing information. The CCS was probably working fine because there was evidence that it was responding to the commands. The second computer system is the attitude and articulation control system AACS. This controls the spacecraft orientation, it maintains the pointing of the high gain antenna towards the earth, controls attitude maneuvers and it positions the scan platform right. So it's important to understand that the cameras and a bunch of other instruments are attached to the scan platform which can point at different targets so if the spacecraft or needed to adjust its orientation the scan platform needed to adjust its pointing at the same time. Now originally this computer was going to be a new design called the high pace hybrid programmable attitude control electronics and this had a bunch of new features over the CCS but for budgetary reasons they were forced to switch to a modified version of the CCS. The processor was clocked faster for the performance needed and it added a memory interface enabled a bunch of new addressing features and other stuff that was needed. The AACS was probably working because the spacecraft was correctly holding attitude and pointing the antenna at the earth. If it wasn't we wouldn't be hearing the spacecraft. So now that left the flight data subsystem. The FDS is responsible for talking to the scientific instruments, the data recorder and everything else and formatting all this data for downlink. If this system had failed you could see that perhaps the data being sent to earth may not be what we wanted. So the FDS is the computer we know the least about. It was only ever used on Voyager. It was different from the other computers because it had to process data at a much higher speed than the other systems and for this it included one huge innovation. While the other computers used old-school plated wire memory, the flight data subsystem was the first spacecraft computer to use dynamic RAM. No more slow magnetic-based memory. Instead this was fast silicon-based volatile memory storage. The FDS used 8192 words of 16-bit memory and this was implemented using 256-bit memory chips. The CR4061 chips to be specific. That meant that they needed 512 chips to make this work and then on top of that you need some extra silicon to implement the addressing interface. So this was packaged as four boards with 128 memory chips on each and about 16 other chips for the decoding hardware. So that's 144 chips per board. Four boards for the memory and then there's actually two memories on each computer for redundancy. So yeah this was cutting edge for the early 1970s. Now this transition to semiconductor memory was seen as very risky but necessary. One concern was that the memory did not retain contents if the power was lost and the flight data subsystem software existed only in this volatile memory. There wasn't any backup storage to reload programs from. If the memory was lost because a relay failed or somebody accidentally powered the memory down, they would be in serious trouble. So in an unusual step the FDS memory was wired directly into the spacecraft's radioisotope power generators. There was no way to disconnect them. The engineers reasoned if the memory failed due to a lack of power the rest of the spacecraft was probably already dead. So the FDS memory system on Voyager 1 had already had a significant failure early in its flight. In 1981 one of the banks became unusable because every memory read on the affected bank would return one stuck bit high. One engineer remarked that it might have been fixed by turning the unit on off and on again because hey that's what tech support does. But that process was impossible since it was hardwired to the power source. So since then they've been running on single string memory with no redundancy for the majority of the mission. So the engineers troubleshooting the communications problem knew that this kind of memory failure could be the root cause. But there could be many other problems in the FDS hardware and they needed to know what they were going to do before developing a fix. They'd have to diagnose this because they didn't want to take a misstep and break the computer. Diagnosing the specific problem was going to be the biggest challenge because it wasn't talking to them. Deciding which commands to send and how to read the results was going to take them months. They hoped that they could issue the right command and just unstick the system but they wanted to be careful. For example if the FDS processor was the problem then switching over to the other FDS which was still in backup they might be able to fix everything. But that was considered risky even if they had healthy telemetry. They didn't know the state of the other system. It's the kind of thing that you want to be sure that it's got a chance of working before you take that massive risk. So the engineers developed a fault tree and the commands that could be sent to step through the tree as carefully as possible eliminating things that aren't fixed and narrowing it down to the things that are causing the problem. And this cautious approach is the reason why they spent three months poking the spacecraft and not getting the breakthrough they nearly needed. The FDS computer isn't like modern systems. There's no operating system like modern computers that manages all the tasks required. The programs operate as a small number of slices that run sequentially. Every 2.5 milliseconds there's a hardware interrupt generated which resets the program counter to the start of the memory and so it would never get stuck in an infinite loop somewhere. Every time it restarts it looks at the mode that it's in and it executes the slice of code associated with that mode and it runs sequentially through 24 different slices that are in a sequence. Now each of these small slices have to execute to completion within 2.5 milliseconds or they will get interrupted and possibly leave the system in a bad state. So the 24 slices at 2.5 milliseconds is 60 milliseconds and in the documentation this is referred to as a line because apparently that's how long the TV camera would take to read out one line of an 800 pixel image data to the onboard tape recorder. 800 lines by the way make up a complete frame and that's also tracked that takes 48 seconds. Now I'm going into this particular detail because we actually have some code for this that was shared during a couple of presentations and it's code that starts every 2.5 milliseconds it reads the mode variables and then it uses those variables to jump to the correct routines. Now this data is all in assembly language and moreover it is in an assembly language for a machine that was built in the 1970s and all the documentation used to be in paper form but it's not really public but you know point is we don't know anything about this system but as somebody that has programmed a lot of assembly language in the past I can make out some of the stuff that's going on and based on the comments get an idea of what some of this does. So the first column of this is the memory address and the second is the 16-bit data that's at that address and then there's things like labels for jumps opcodes operand and of course those very helpful comments. So the first four bits of each word are typically the opcodes and the obvious example here is the jump instruction which if you look all of the jump instructions start with the zero which in nerdy computer way is actually really convenient because if you think about it if you write a memory address to a memory location with the four top bits being zero you can just jump to that memory that location and then it will then jump to whatever memory address you wrote in there that's a kind of elegant thing there. There's one little bit at the end by the way I want to mention the memory uses 16-bit words and as I said four bits are the opcodes and that means the last 12 are the address and the computer savvy people out there know that 12 address bits can access 4096 memory locations which if you remember is only half of the 8192 words of memory so if you look at the third line from the bottom labeled go up that instruction uses the out opcode which is usually how the processor sends signals out of the computer and into external devices and what that's doing there is it's setting an external flag on the memory addressing system so that when it jumps through that address it's going to switch to the upper half of the memory there's actually two bits to this there's one for reading the instructions and there's another one for like actually executing the memory reads but anyway like this is a small detail but it will actually be relevant later. Assuming the flight data system was still operating one plausible explanation for the communications outage was that the program was stuck in its execution possibly because it was reaching some corrupted memory and then the execution would fail and it was reasoned that perhaps they could switch the program into another mode and by switching into a different mode the execution flow would follow some different path through the memory and possibly avoid the corrupted location so they wanted to try switching it into another mode but not knowing which modes might work the plan was to just cycle through all of the different modes the spacecraft communications supported to give them like 10 minutes each and see if anything showed any restored telemetry so they built the command sequences to do this and they were hours away from sending it when one of the team realized this wouldn't actually work. You see the FDS software wasn't supposed to change modes in the middle of a frame so the command to change mode all it would do would be write the requested mode into FDS memory and then when the FDS finished the frame the final thing it did was check this location and then update its internal structures to trigger the mode for the next run through the frame but if the current mode software was getting stuck it would never reach the end of that sequence and so it never switched to the requested mode so the Voyager FDS it initially supported something like 30 different modes the modes had names things like GS4, CR7, PB5 etc. Generally the lower the numbers were the higher the bit rates and the letters specify the primary tasks so GS is general science reading instruments and downlinking data immediately IM is for real-time imaging downlinking slow scan imagery conversely PB is for playing back from the onboard recording system then there's the CR modes which is cruise and these are important because they would send low bit rate data via the lower gain antenna which doesn't need to be pointed with the same degree of precision so that meant the attitude control dead bands could be relaxed saving attitude control propellant finally there's two engineering modes EL engineering low at 40 bits per second and EH12 engineering high at 1200 bits per second and these are the modes that they really wanted to be in for troubleshooting but they didn't know if they would necessarily work so anyway even though the mode switching code wasn't being run they knew what it was supposed to do they just knew it would have to write some magic numbers to a few places in memory so the code execution would change over to the new system and there was a command which they figured out could be sent which would set words in memory at a specific location to a specific value essentially poking a number directly in there and I use the term poke because that is a common computer command on my old but 8-bit microcomputers right you would use this to modify a computer program in memory just change one value and generally what I would do is I would change a value such as the number of lives in the game so I could cheat and get past those parts in the game where I always die and in the same way they were going to poke some stuff into memory to get around the mode switching code that always died this was however considered risky it was a feature which had never been used on either spacecraft in their entire history but as they had run out of other options the risk reward equation turned in the favor of new and untested ideas so they rebuilt the command sequence to cycle through the modes for 10 minutes each using the poke commands to directly set the parameters at the end of each 10 minute section this sequence was sent up on a Friday and the team went home for the weekend Sunday rolled around and sure enough they saw some changes to the signals being sent but still the receivers on the ground were unable to find any data they were unable to synchronize to these signal streams normally data from spacecraft is modulated and formatted with markers which signify where the data packets start so the hardware can look for these packets and synchronize their decoding functions however the engineering team had the capability to look more deeply at the data that was coming in the raw signal samples were being recorded and analyzed by experts in RF and signals now this individual had helped rule out various problems using this raw or open loop for example the signal strength was what showed that the attitude control system was working fine so one person was looking at the raw data coming back during this mode switching test and they saw one period was something that they needed to look into to their trained eye the telemetry stream had changed it wasn't sending properly formatted data but now the bits being received followed a different pattern a different type of nothingness it looked like there might be something in there and the mode by the way was pb15 which if you remember is a playback mode where it presumably was designed to read data from the tape and then send that directly out via the telemetry system so the plan was to repeat this manual mode setting and this time let it run for longer so they could get more of this random data hoping that maybe there was some clues about the state of the spacecraft to be gleaned from this so the week passes and they do a one hour run in this state and the open loop signals experts start processing this data using custom tools and human intuition in technical terms the voyager spacecraft uses binary phase shift keying on the subcarrier which well it means they just flip the signal at the symbol rate through its 180 degrees of phase the data uses convolutional code for error correction and there are actually guides online for decoding this information on this signal using freely available tools if you just happen to have a 70 meter radio dish sitting around so you can receive the signal so the data comes down and the radio engineers start converting it into a stream of bits and they share this around the team to see if they can get any clues from the random patterns of ones and zeros they don't know what they were looking for or what they might find and it took a couple of days of mulling this over before someone realized that the patterns matched up to well memory from the flight data system this is the kind of thing that you might recognize if you're say writing super low level machine code as opposed to higher level programming languages like c or fortran so this data wasn't using the conventional memory dump routines because that would include generating the correct headers and other structures this was just stepping through the memory in a loop now the team working on this have admitted they don't know exactly how this happened but you know i'm going to speculate and presume that the playback mode had enabled the modulation subsystem to start reading the memory directly and sending it down the pipe you would imagine that normally this would be pointed to a specific spot in memory where you'd written some data but the broken code wasn't executing correctly and it was probably starting at zero and just counting up and then looping around so you got the entire memory dump sequentially and this of course was a massive breakthrough suddenly they had a window into the state of the system and maybe that might show them what was broken enabling a fix to be developed finally so with the shiny new memory dumps in hand the first thing to do was to compare it against what they expected it to be now they had good dumps from prior to this anomaly and when they aligned the dumps over each other they found a 256 word sequence of memory starting at address 1400 and going all the way through to 14ff where every word had the fifth bit set to one so about half the memory locations were corrupted because typically about half of them need that bit set there were all bits on a single memory chip and it's possible that that single chip went bad or simply that the data and address lines running to it were failed regardless of the hardware error the results was that anytime the software execution entered this section of memory it would likely hit an affected instruction and the code would fail and the collection of subroutines in this region were all called at some point by basically every single telemetry mode meaning that everything would fall apart at one point so now this explained the problem and with this knowledge came the possibility of developing a fix they would need to rewrite the code in the system so this damaged portion of memory would never be used but developing new code for a computer which is practically one of a kind a computer that they didn't have working hardware for on the ground they didn't have a software emulator or simulator or even an emulator or other tools to manipulate code they had none of this stuff what they did have was a few examples of code in the documentation that they had accumulated over the years early in the troubleshooting process the team had been trying to develop a simple program that they could run on the FDS to help them diagnose the problem and in an old memo they found a reference to something called minsimrot the minimal command routine and that would run on the FDS and it would send back some basic telemetry it's the equivalent of a hello world program now this wasn't a bit of code that they could just download from their archives and run even if they had the simulator it was they found the code in a memo and a document which had been scanned via OCR and then converted into a Microsoft Word document and down the bottom of it there was a list a comma separated list of 267 16-bit hex values which they surmised must be the code and they then had to hand disassemble it and verify that it was doing something sensible i mean like they weren't even sure it was the right code to begin with right or if the scanning process had correctly transcribed all the characters i mean yeah it was a manual disassembly figure out what it was doing and then having understood it they then needed to develop a version with a few more useful features such as the ability to dump the full memory correctly so this software development process had started soon after the initial failure long before this these memory dumps had come up but it was being developed in parallel with the spacecraft troubleshooting which is good because when the team identified the memory corruption problem after several months the minimum command routine or min sim rot was ready to go so having demonstrated now that the low-level memory poke would work to set locations in memory the plan was to then use the same command hundreds of times to uplink this min sim rot software one word at a time and so that worked the software went up and two days later they got the equivalent of hello world back from deep space finally they had a foothold on the hardware and voyager was talking back to them in the way that they wanted now that they have this foothold they needed a permanent fix that could restore the proper telemetry routines and the way to do this would be to move the code of the affected memory into a new area of memory that was unaffected then find everywhere that was using to both pointing this to that code and then update that to make sure they point to the new relocated code so they needed to find a chunk of memory that contained old code that was no longer being used because they didn't exactly have free memory right and after all you know over the last few decades they turned off a bunch of instruments and there's probably some bits of this that aren't needed anymore furthermore since the affected memory was at address uh 14 1400 that meant that it was in the upper memory bank and since jumps between memory banks are required extra flags to be set that added complexity to the calling and it changed to other things you could break the instruction timing if you had to jump to the wrong section of memory so they could only look for free memory code in the upper memory bank but after lots of searching they couldn't find a contiguous chunk of memory that was unused so well they needed to make some free memory and they decided to sacrifice the code that was being used or could be used by the eh12 mode right that is the engineering high rate data road mode after all the high data rates were getting harder to use as the spacecraft got further and further from the sun and they had the engineering low el40 mode that would let them work at 40 bits per second that was all they would need if they were going to deal with a future anomaly so that would get them the required memory but there were still routines in the eh12 mode that were used by other modes so those needed to be preserved and that meant that while they had freed up enough memory it wasn't a 256 word contiguous section and so the code relocation was going to have to split up that chunk of code into different areas to make it work and remember all this would have to be done with no tools poor documentation and source code listings that may not have been scanned correctly but on the bright side they only had eight kilo words of memory to work with so it was something that a human could reasonably manage using old-fashioned brain power there was one major piece of code that they got really lucky with the copy subroutine was used to copy memory from one memory bank to another but since memory bank b was dead since 1981 this wasn't useful it wasn't needed so they didn't need to move this code at all they could just replace everything that called it with a no-op well not quite a no-op because they couldn't be sure that no other the other code wouldn't rely on the exact timing of the call being the same so the calls were actually replaced by wait op codes that simulated the call and all the desktop the execution that would happen there now the rest of the code you couldn't fake that you had to move it and something had to be carefully split up into parts now some subroutines could safely be transcribed into the free spaces in memory but others they were simply too large and they had to be split up into parts hopping from one sliver of free memory to the next until they completed their task but you can't split this too often because every time you jump you're modifying the timing and remember there's that hardware interrupt which stops and restarts every 2.5 milliseconds and if the new code takes too long it might still break the program even if it's you know got through we're going to execute the correct stuff and again all this is being done by hand with minimal tools the team members would be checking each other's work to make sure there were no possible bugs before sending them off into the void initially the first patch was to get the el40 low bit rate engineering mode working the code changes were in the form of memory addresses and the contents of those those would get sent up in a sequence of commands setting one word at a time hundreds of these commands the first patch would initially replace all of the routines into their new locations then once that was done the next patch would then replace all of the jumps that point to those patches in the new routines and so on april 18th the final commands were sent on the 19th voyager 1's flight data system in deep space broke out of its loop and on the 20th there was much rejoicing on the ground as they received the first telemetry in five months the first interstellar software update now that wasn't the end of the process they'd fixed engineering mode but they needed to then bring the science modes back and well that required finding even more pieces of free memory and again move the remaining sections of code and this involved hunting through lots of old documentation and more hands tuning of machine code a month later by may 19th they then had the cruise science mode running so they could get real-time measurements from the remaining instruments and then it would take until july to bring back the general science and the playback modes and that then was the spacecraft restored to its original functionality it had been eight months voyager 1 had lost its mind and thanks to heroic efforts by engineers it had found it again and in that time it had traveled 300 million kilometers further from the sun and the round trip time was now half an hour longer earlier this year the spacecraft powered down the cosmic ray detector preserving power for the remaining systems letting the spacecraft operate just a little bit longer on its journey into deep space it's been going for almost five decades and while other spacecraft of that era have been retired the voyager spacecraft's unique position lets it survive lets it live on they can return data that isn't available in any other way there's no interstellar probe currently planned and even if there were it might take decades to surpass a voyager one so i'm hoping that the voyager spacecraft keep going as long as possible putting off the inevitable decay of their power supplies then the 50 year mark is almost upon us but that's just an arbitrary anniversary 2030 that seems within reach based on current power limits one day the voyagers will inevitably fall silent but the engineers will keep them alive as long as they can if given the opportunity the spacecraft are going places and we can go along with them on the ride for a little while longer i'm scott manley fly safe you

Menu

Pierwszy międzygwiezdny update oprogramowania - historia naprawy Voyagera 1 (film, 38m)

Toggle timeline summary

Transcription