How Neural Networks Remember Information (15-minute film)

The amount of random access memory, RAM, in a typical laptop is probably around 8 to 32 gigabyte. That's the part that directly interacts with the CPU. Aside from that, your hard disk might have, say, another terabyte or so of memory. But how much memory do you, that's to say, the human brain, have? And can we even measure it in bytes? But maybe that's not even the first question we should ask. However much memory the brain has, where is it? Because every piece of memory in a computer has a physical location. To access a piece of data in RAM, for example, you have to know the binary address associated with that location. In fact, for the CPU, this really comes down to turning on just the right wires to retrieve the bits at the desired location. Now imagine a different kind of memory. Instead of specifying the where of a memory, its binary address, how about we could specify the what, its content? A memory system where if we provide an incomplete version of the memory, it just sort of autocompletes. Of course, with the right software, your computer can already do this. But it's not how computer memory works on its basic level. The point of this video is to convince you that autocompleting memories, also known as associative memory, is a kind of natural behavior of networks of neurons. With that, it'll become clear that it doesn't really make sense to measure memory capacity in networks of neurons in the same way we measure computer memory. The biggest difference might be, computer memories have a place, a fixed location, but as we'll see, the memories in an associative network rather have a time. Computer memory is measured in bits, binary switches of ones and zeros. A string of eight such bits can represent anything from letters to integers. For our purposes, let's visualize them as patterns of this kind, like these 64 bits representing this 8x8 image of binary pixels. I always find that there's a piece missing from the story of bits as memory, and that's the following. How do I get to a memory once it's saved, say, in RAM? Because on its own, it doesn't do much. It's only when we retrieve it that it becomes useful. So how do we? Well, broadly speaking, and I'm glossing over tons of technical detail here, every piece of data in RAM is matched to a binary address. And this binary address really eventually boils down to a set of wires, in this case eight, that are either turned on or off. Each piece of data is in a different physical location and can only be retrieved by knowing its address. How the reading and writing of memories is accomplished is really the meat of programming and is another story. What I want you to remember is the peculiar fact that memories are matched to addresses, and that's ultimately the only way to retrieve them. Contrast this with what we believe about the brain. There isn't a central orchestrator, like a CPU, and there aren't any addresses. Rather, there is a constant buzz of activity of many independent units called neurons. In this video, we'll try to make some sense of the buzzing of activity of networks of neurons by introducing a mathematical model called the Hopfield network, named after the author of this 1982 paper. And as much as this has to do with memory, more generally, this video aims to be a lesson in modeling itself, which I always think of as the art of the essential. This is a neuron. I'm almost certain you've seen something like this before. But it sometimes pays to remember why we always come back to this when we want to understand things about the brain. The reason is, I think, that it has a rather simple behavior. It integrates electrical signals from other neurons to determine its own activity, and then it broadcasts that activity back to the network. Mathematically, the story goes something like this. There's electrical signals coming in from other neurons, which we will say are just some numbers. Then the synapses act as multipliers on these signals, another set of numbers. And then the activity of the neuron is based on the sum of the weighted inputs. And by based on, I mean that it's fine to apply some function after computing the sum. And that's it. So it gets interesting once we turn this into a network, connecting the outputs of neurons to the inputs of other neurons. This is a special type of neural network. It's a recurrent network, meaning that there are back-and-forth connections between the neurons. I haven't drawn them, but remember that any such edge is actually two edges, so that the two neurons influence each other. Okay, there's details here that we need to get into. But first, what does this have to do with memory? Well, it needs to be somewhere in here, doesn't it? Where? Remember the idea of an associative memory, which is the ability of a system to sort of pattern autocomplete? Let's try a definition of memory that's slightly wider than maybe what we're used to. Let a memory system be a system that, after having been in a certain state, a configuration, it has the ability to return to that state later on. Now, our computer memory from earlier actually has this property if we include the CPU into the memory system. Our network seems different, though. So let's get creative. There's other things in our everyday lives that fall under our definition of memory. And one might be, and hear me out on this, a simple plastic bottle. If it's crushed, in other words, its configuration changed, it can sometimes return to its earlier state, which in that sense could be said to have been memorized. And the metaphor is not arbitrary. I actually do think that networks of neurons are kind of like that. What I mean is, a neural network is a system with a pattern of activity that dynamically evolves. If somehow we could construct our network such that it would have some preferred state and would return to that state over time if it was perturbed, then that could reasonably be qualified as a memory. This is a network of 64 neurons that I cleverly constructed such that it memorized this pattern of 8x8 binary pixels. So what's going on? To describe what this model is actually doing, we need to take the following steps. Remember I said time was important? We need to describe how the activity of the network changes over time. And then there's the question of learning. How do we actually imprint memories into the network? And this will have to do with the connections between the neurons. Finally, we need to understand if and when the network converges to its memory states. The crucial ingredient of our network really is the fact that it is a dynamical system. Its activity changes over time. By activity, we mean that each of the now 16 neurons in the network is described by a number and that this number is a function of time. And let's just assume that time moves forward in discrete steps. Furthermore, since we are interested in binary memory states, we'll assume that activity means that the neurons can only be in one of two states. Inactive, say, minus 1, and active, say, plus 1. Anyway, this all leaves us with 16 minus or plus 1s at any given time, which we will call the state of the network. What actually happens if we increase time by one step? Imagine folding out the time dimension in space. Now we select one of the neurons at random and update its state according to the input of all other neurons in the network. The rest of the neurons stay the same. And this we simply continue. But hold on, you might ask. How do we update the state exactly? And why update only one neuron at a time? Well, for the second question, we could, in fact, update all neurons at once. But the issue here is plausibility because that would require a global updating signal almost like a clock instructing all neurons to update simultaneously. It's slightly more realistic, although not too much to be honest, to let them update asynchronously. Okay, and for the other question, yes, what is the actual update equation? Well, it's remarkably simple. It's a weighted sum of the states of the other neurons. Weighted meaning that each state is multiplied by the strength of the connection between the neurons. And since the connections in that sense weigh the inputs, from now on I'm actually going to call them weights. But then, of course, to ensure that the result is plus one or minus one again, we'll make use of that function I mentioned earlier, f, and that's it. Those of you familiar with linear algebra will have recognized this as computing the dot product between the vector of neuron states and the vector of connection weights. This is a network of 64 neurons, and I'm just going to tell you, without explaining how, that it has memorized this pattern. Starting it off in different initial states and then running the equations I just described, selecting one neuron at a time at random and updating its activity, and wait, let's just make this a little simpler, we can see that the network really has this intriguing property that it gravitates towards the memory pattern in all cases. Or, well, it ends up with the anti-memory in some cases. We'll ignore that. It has to do with a certain symmetry in the network that we will get to. Moreover, once it's settled into that state, it doesn't change anymore. The memory pattern is what is called a stable state of the network. So can we be sure that this always happens? Well, no. For example, things start to get complicated when there is more than one memory stored in the network. We'll get to that. But for the simple case of just a single memory, yeah. The network will converge to either the memory or its anti-memory, or as an edge case, to the completely inactive state. So let's recap. Networks are just vectors that evolve in time, and we update their elements asynchronously. Given some configuration of weights, there are stable states in the network, which we call memory states, and we have seen, although not proved, that the network is kind of attracted to these states. That leaves the following question. How do we make the network memorize a certain pattern? In other words, how can we design the stable states of the network? This amounts to setting the weights of the network, which turns out to be a matrix with as many rows and columns as there are neurons. For example, the state of this neuron is determined by the weights in this row of the matrix. At first, this might seem impossible, especially if we wish to store more than one memory in the network. So I'm just going to tell you the magic rule, and then I'll motivate it. Given a desired memory state with, say, 8 elements, we are looking for an 8x8 matrix. We will just say, seemingly out of nowhere, that the weight between two neurons, i, j in general, is determined by the product of their states in the memory. For all of you keen on linear algebra, this means that the matrix is an outer product of the memory vector with itself, except for the diagonal, which we set to zero, since we don't want any self-reinforcement. But why? Think of it this way. Our whole approach was in some sense to build something interesting from many simple parts. So there really should be a way for the two neurons to determine the weight between them, independent of the rest of the network. This is a principle called Hebbian learning, and it is exceedingly plausible, because remember that weights are supposed to be synapses, and synapses in actual neurons also would have no way of knowing what goes on in the broader network. And so why the multiplication? Well, here comes the reason I coded the binary states, minus one and plus one. If you map out all four combinations of states of the neurons, you'll see that the weight will have positive sign whenever the neurons in their memory state agree, and negative if they disagree. And this should make sense, because a positive weight lets a neuron project its state onto other neurons, and a negative weight lets a neuron flip the state of other neurons. It's almost like the neurons behave like charged particles, maybe. One last question before we can see it in action, and that's how do we store multiple patterns at once in the same network. We do it by computing the outer products for all desired memory patterns. This gives us a matrix for each, and then we average those matrices. This gives finer and finer gradations of the weights. Now, however, things start to become a little bit more complicated. There's no way to guarantee that all memory patterns are stable states. The memories will start talking to each other, fusing into new memories, which I find frustrating and super interesting at the same time. Here's the network with four memories again. It sometimes converges to one of the memory states, yes, but in other cases, it converges to something in between, a merging of memories, which I find, well, almost a kind of human mistake. And with this, we are finally making some progress on our question from the very beginning of the video. How much memory do you have? Originally, we might have answered this by giving some number of bytes, but now the question presents itself very differently. It's more like, how many memory patterns can we store in a recurrent network of a given size, such that they are stable states of the network? And the answer is, at least for this model, not very many. The original paper showed that this model has only linear memory capacity. That's to say, the number of stable states grows as a linear function of the size of the network. Plus, there's a hidden assumption in this graph, too, which is that all memory states are uncorrelated, which, for any set of pictures like this, is totally not the case. I tried my best picking out some not-so-correlated images to really show what this network can do, and the results are, well, see for yourself. Realistically, that means that this model is maybe too simple after all. And that's not surprising, given all the violence we did to these networks with our many simplifications. But the goal of this video wasn't to convince you that this model can be used in any practical sense anyway. Although, in a follow-up video, I want to tell you what steps people took to make this model actually useful for deep neural networks. What I wanted to achieve with this video is, first, to warn you of false comparisons. Networks of neurons don't behave like USB sticks. And why should they? But secondly, it's to show you how with modeling approaches, walking a very thin line between complexity and simplicity, we can sometimes start to conceive of the world differently than otherwise we would have. You might have set out picturing memory as something static. But I'm hoping now you're willing to consider that it might be something dynamic. The invisible stable states of a network buzzing with activity.

Menu

How Neural Networks Remember Information (15-minute film)

Toggle timeline summary

Transcription