Talking Machines podcast Ep 5 transcript: distributed representations vs symbolic representations
This is a transcript of an interesting segment from episode 5 of the Talking Machines podcast, which contains part 1 of a conversation between three leaders in machine learning: Geoffrey Hinton, Yoshua Bengio, and Yann LeCun.
The full episode is available here: http://www.thetalkingmachines.com/blog/2015/2/26/the-history-of-machine-learning-from-the-inside-out
It seems like thinking hard about distributed representations is some of the most exiting stuff that’s come out of this resurgence. It’s a very different way of — I would say it kind of challenges a long history of knowledge representation. It feels very biological, right? Geoff can you talk a little bit more about distributed representations and maybe explain that to our audience.
Ok the idea is that you have a large number of neurons and they’re conspiring together to represent something and they each represent some tiny aspect of it, and between them they represent the whole thing and all its wonderful properties.
And it’s very different from a symbol. Where a symbol is just something that is either identical or not identical to another symbol. Whereas these big patterns, these distributed representations, have all sorts of intrinsic properties that make them relate in particular ways to other distributed representations. And so you don’t need explicit rules, you just need a whole bunch of connection strengths, and one distributed representation will cause another one in just the right way.
For example you could read an English sentence and get a distributed representation of what it means, and that could cause a distributed representation that creates a French sentence that means the same thing. And all of that can be done with no symbols.
So the power of that concept can be seen in the fact that all of us, in all of our labs, are essentially working on embedding the world — you can think of it this way. So how do we find vector representations for words, for text in various languages, for images, for video, for everything in the world. For people, actually, so you can match people’s interests with content, for example, which is something that Facebook is very interested in.
So finding embedding is a very interesting thing. And there’s a lot of methods for doing this. For text there’s the very famous method called word2vec invented by Thomas Mikolov[?].
And following the neural language model that Yoshua had worked on before that, Geoff and I also had worked separately on different methods to do high-level embeddings rather than low-level embeddings. So things that could be applied to images for example. So I guess this could be called metric learning. So this is situations where you have a collection of objects and you know that two different objects are actually the same object with different views or the same category. So two images of the same person, or two views of the same object, or two different instances of the same category.
And so you have two copies of the same network, you show those two images and you tell the two networks ‘produce the same output. I don’t care what output you produce, but your output should be nearby.’ And then you show two objects that are known to be different, and then you can push the output of the two networks away from each other.
Geoff had a technique called NCA to do this neighborhood component analysis. […] And then Jason Weston and Sonny Bengio came up with a technique called Wasabi which they used to do image search on Google. Google used that as a method to build vector representations for images and text so you could match them in search. At Facebook we’re using techniques like this for face recognition. So we find embedding spaces for faces, which allows us to search very quickly through hundreds of millions of faces to find you in pictures, essentially.
So those are very powerful methods that I think we’re gonna use increasingly over the next few years.
Is there a point where you need to have discrete grammars on top, or can it be distributed the whole way down?
My belief — if you’d asked me a few years ago, I’d have said well maybe in the end we need something like a discrete grammar on top. Right now I don’t think we do. My belief is we can get a recurrent neural network — that is something with an internal state that has connections to itself so it sort of keeps going over time. We can get that kind of network to translate from one language to another — this has been done at Google, and it’s been done in Yoshua Bengio’s group — we can do that with nothing that looks like symbols with symbolic rules operating on them. It’s just vectors inside.
It works very well. It’s at about the state of the art now, both at Google and at Yoshua’s lab. And it’s developing very fast.
And I think the writing’s on the wall for people who think the way you get implications from one sentence to the next is by turning the sentence into some kind of mentalese that looks a bit like logic and then applying rules of inference. it seems that you can do a better job by using these big distributed vectors, and that’s much more likely to be what people are up to.
There’s a very interesting white paper or position paper by Leon Bottou the title is “From Machine Learning to Machine Reasoning” which basically advocates the idea that we can use those vector representations as the basic components of an algebra for reasoning, if you want. Some of those ideas have been tried out but not to the extent that we can exploit the full power of it.
And you start seeing work now, so for example my colleague [?] Fergus […] and someone from Google, worked on a system that uses distributed representation that identifies mathematical identities. And it’s one of those problems that is very very sort of classical AI — like solving intervals and stuff like that — that involves reasoning and search and stuff like that. And we can do that recurrent nets now to some extent.
Then there are people working on how do you augment recurrent networks with sort of a memory structure. So there’s been ideas going back to the early 2000’s or late 90’s, like LSDM which is pretty widely used at Google and other places. So it’s a recurrent net that has a sort of a separate structure for memory. You can think of it as sort of a processor part and a memory part, where the processor can write and read from the memory.
So neural Turing Machine is one example, there’s another example. Jason Weston [and others] have proposed something called a Memory Network which is kind of a similar idea. It’s somewhat simpler than SDM in many ways. […]
And there’s a sense that we can use those types of methods for things like producing long chains of reasoning, maintaining a kind of state of the world if you want. So there’s a cute example in the memory network where you can tell a story to the network, like say Lord of the Rings, so “Bilbo takes the ring and goes to Mt Doom, and then drops the ring, and blah blah blah.” You tell all the events in the story, and at the end you can ask a question to the system, so, “Where’s the ring?”and it tells you, “Well, it’s in Mt Doom.” Because it maintains sort of an idea of the state of the world, and it can respond to questions about it.
So that’s pretty cool cool because that starts to get into the stuff that a lot of symbolic AI people said neural networks will never be able to do.
I’d like to add something about the question you asked regarding distributed representations and why they are so powerful and behind a lot of what we do.
So one way to think about these vectors of numbers what they really are are attributes that are learned by the machine, or by a brain if we think that’s how brains work. So a word or an image or any concept is going to be associated with these attributes that are learned.
Now associating attributes to concepts is not a new idea. Linguists will define things like the gender or plural or this is an animal or this is alive or not. And people trying to build semantic descriptions of the world do that all the time. But here the difference is that these attributes are learned. And the learning system discovers all of the attributes that it needs to do a good job of predicting the kind of data that we observe.
The important notion here is the notion of composition — something which is is very central in computer science and also in many of the older ideas of AI. Cognitive scientists thought that neural nets cannot do composition.
Actually composition is at the heart of why deep learning works. In the case of the attributes and distributed representation I was talking about it’s because there are so many configurations of these attributes that can be composed in exponentially many ways that these representations are so powerful.
And when you consider multiple levels of representations, which is what deep learning is about,
then you get an extra level of composition that comes in, and that allows you to represent even more abstract things.
A nice example of distributed representations where you can see them at work in people, is if you just have symbols, you might have a symbol for a dog and a symbol for a cat, and a symbol for a man and a symbol for a woman. But that wouldn’t explain why you can ask anybody the following question, and young kids can do this. If you say “You’ve gotta chose: either dogs are male and cats are female, or dogs are female and cats are male.” People have no doubt whatsoever. It’s clear that dogs are male and cats are female.
And that doesn’t make any sense at all. And the reason it’s clear is because the vector for dogs is more like the vector for man, and the vector for cats is more like the vector for woman. And that’s just obvious to everybody. And if you believe in symbols and rules, it doesn’t make any sense.