Geoffrey Hinton: "Some Applications of Deep Learning"

so speech recognition has been around for time people and it was one of the Seth areas for machine learning algorithms so the e/m algorithm Faden Markov models had a huge effect on speech recognition it became the method of choice in the 1970s and it’s been the method of choice ever since and the hidden Markov models they used the way they fitted them to the way they interfaced with the acoustic data was by having Gaussian mixture models so each state of a hidden Markov model would decide how well it fitted the data by saying how likely was that data given the Gaussian mixture model associated with that state of the hidden Markov model that’s the standard technology and what’s happened in the last few years is the deep neural Nets have been shown to do a better job as saying whether a state of hidden Markov model fits the data and they work in the other direction they say given the acoustic data can we predict which state of which hidden Markov model is the best fit and having made those predictions you then use the standard technology to figure out what the most likely sentences so you model each phoneme by a little hidden Markov model that has a few states sometimes for a phoneme you have many different hidden Markov models depending on the neighboring phonemes but initially we just used one in Markov model per phoneme roughly there’s a standard way to pre-process a sound wave in speech recognition which is to get things called capsule coefficients they’re not the best way to pre-process for neural nets because they’re designed to throw away a lot of the data so that dumb machine learning methods can cope with a deep neural net you’d rather see more of the data and decide what to do with it but for now we’ll use the coefficients that are standard lis used and you train up your neural net by first taking your database of speech using a standard speech recognizer to decide at each point in the speech what the phoneme being said is the standard speech recognizer knows what words are being uttered but it has to decide what the folk where the phonemes are and then you can use that as training data for a neural net that will then do a better job so some early research in that area was done by George Dahl and Abdul Rahman Muhammad and they used a net like this they started off with metal capture all four coefficients here but after a while they switched to filter back coefficients which are a less pre-process form of speech and actually work better with deep neural Nets and they just tried very deep very wide neural nets so this was done pretty much when GPUs first started being used and so they could train nets with so there’s four million weights they’re informally and they’re informally in there so they could train these nets and in fact using seven hidden layers work best so that’s got of the order of 50 million connections in it they’re training up on not much data there’s only about three hours of data in the timid database and so pre training was important here so what they did was they pre trained restricted Boltzmann machine to learn 2,000 features of the Mel capsule coefficients all the filtering coefficients and it’s a little window that’s looking at it’s 15 frames with 40 things per frame they then pre-trained another Boltzmann machine to learn features of these features and so on they did this for quite a few layers and they basically discovered them all the better then the last layer weights are not pre trained so they’re started off at small random values or 0 and here they have the states of 61 different hidden Markov models and each hidden Markov model has three possible States and so they’re trying to say which phone is it and which part of which phone is it which of the three states of the Markov model and the three states in these hidden Markov model say it’s the first bit of phone the middle bit of the phone and the last bit of the phone and this is a very well tried databases the sort of speech equivalent of missed the at the time they did this the record what for speaker independent methods was twenty four point four percent error and that involves averaging many systems they initially got down to twenty three percent and then later on they got down to just over

twenty percent and they subsequently discovered that you can do just about the same by using a more complex pre-processing method that mockery Lorenz Otto will probably talk about in his tutorial later in the summer school I won’t talk about that now so speech recognition people saw this that corresponds to like ten years progress in speech recognition and it happens sort of overnight with GPUs and neural nets and they began to get excited and so they tried it on more complicated problems in fact what happened was the students who did this went off to big speech labs and they took the code with them and they tried the same methods almost almost identical code on much more challenging speech recognition problems and that turned into the Microsoft speech recognizer and some others be recognizes so the people who developed this took it to Microsoft and now djt another student took it to Google Microsoft has already deployed a speech recognizer based on this because it works better than their previous one Google’s been developing the technology and now has it working quite a lot better than the thing that never dp8 reported there but basically it’s essentially the same method and if you look at how well these methods perform you can compare them with the standard method which is to use hidden Markov models with Gaussian mixture models and we’re not replacing the hidden Markov models we’re keeping those for now we’re just replacing the Gaussian mixture models with feed-forward neural networks to say how well a state of hidden Markov model fits the acoustic data and this is done by three of the biggest speech groups there’s about half a dozen big speech groups and these are through the major ones this was done Microsoft on timet they were training on three hours of data now they’re training on 300 hours of data and we call these things it’s complicated to know what to call them we used to call them deep belief Nets but that’s just how you initialize them and then you train them with back propagation as a neural net so their best called deep neural nets that are pre-trained as deep belief Nets but that’s rather a mouthful so the deep neural net gets 18.5% error instead of 27.4% error that’s a big improvement and on a different test set for the same training set it makes another big improvement this is probably the most impressive result there even though it doesn’t look like it this is done on IBM and has the probably the best speech recognizer for this task which is a very well studied task because it was used in DARPA competitions and the IBM speech group managed to get from 18.8% arrow down to 17.5% error and this was a very highly tuned system so this system many years of junior gone into it and fairly quickly the deep neural nets meted a Microsoft they tried it on voice search and they got a big improvement and they’re now deploying it a Google they tried it on voice search Microsoft says like 300 hours is a lot Google thinks a lot is 5,000 hours they got down to 12.3% and I need another slide to show you where they started they started with a lot more than 5,000 hours of training with the Gaussian mixture models and they were getting 16% error and so that’s a huge improvement they’ve got rid of like a quarter of their errors and it’s even better now they also did it on YouTube which is very hard task because it’s all sorts of different speakers and it’s very poor quality and although they didn’t get such an impressive improvement they did get a significant improvement on YouTube as one so there’s a paper coming out in signal processing magazine in the fall that has these results probably the most impressive thing about the paper is it has authors from the University of Toronto an IBM and Google and getting these guys to be authors on the same paper is quite an achievement and it basically is the consensus opinion that these guys are better than those guys and so that’s a big victory for these deep neural nets now onto the next application we would like to get similar improvements in object recognition a lot of work has been done on object recognition using neural nets about half of it’s been done by Jana and his students and they over the years have developed better and better methods and basically we’re piggybacking on the methods they’ve developed so you use convolutional nets for the early features you use rectified linear units you use competitive interactions so that these strong

features suppress the weak ones and you use pooling and you even use overlapping pooling we have a critic called Jitendra malik who thinks that the deep neural networks haven’t really proved themselves for computer vision yet because they haven’t been tried on severe enough tests so I checked with Jitendra Malik and got him to say that if we could do the image net task with a thousand classes then he would be impressed and that was a good test of whether these deep neural networks really did work for object recognition so the image net on database has millions of images crawled from the net and there was a competition in 2010 where they took 1.3 million of these images and their high-resolution images from a thousand different classes and the task is to train on these images and then be able to predict on new test images which of these thousand classes it is now there’s many reasons why this is hard not just that there’s a thousand class and the high resolution images but it’s a bit arbitrary what the label is so each image has one label and the image might contain a fridge in a microwave and a sink and the label would be dishwasher because it’s just in the corner of the image there’s a bit of a dishwasher so it’s tough to get the label right there’s many cases where you wouldn’t get the label right because the microwave sitting right in the middle of the image but it’s not a microwave so the neural net really has to hedge his bets it has to sort of in a higher layers it has to have sort of guesses about all the various things it might be so that it can guess the right thing but also make other plausible guesses and to evaluate these nets you have to say not just how well do they do it getting at their first choice but do they get the right answer in their top five choices because that’s a more reasonable way to deal with the fact that there’s severe label noise here it’s a bit arbitrary what the things called the winner of the 2010 competition got 47 percent error in their top choice and 25 percent error in the top five choices since then there’s been another competition where the test set isn’t public the people who won the next competition went back to this competition and showed their new method gets 45 percent error on this competition for the first choice so that’s the state of the art for the arm this data these 1.3 million training images for a thousand classes and we know that if we can beat that state of the art Jitendra malik will be impressed which is important so we essentially Alex Krakowski developed his own version of Yannick as kind of deep neural net which has a number of extra tweaks to it all the layers use rectified linear units he trains not he first down samples the images to 256 by 256 by doing some cropping and put it down sampling I think and he then uses quite big patches but not always in the same location that’s a way of telling the neural net much more information about translation invariance because the objects in these nets are quite big a patch that contains most of the image probably has the object in it and when he trains he actually shows it a bunch of different patches at test time he shows it ten different patches the four corners the middle and the horizontal reflections of all those and all of them get to classify what’s in the image and then he takes a consensus and that makes it work considerably better and as a result of all this he gets about state of the art so his convolutional net is getting about he keeps improving it and I’m not sure he’s measured the performance with a magic trick yet but without a magic trick he gets about 45 percent errors but we now have a magic trick which is a much better regularizer for the top layers of the net so in the lower layers is convolutional and i I have a faint suspicion now may have explained convolutional nets and in the higher layers it’s globally connected and so there’s many more parameters in these high layers the good thing about convolutional is aren’t many free parameters and so if it’s gonna over fit it’s gonna over fit because of this global connectivity in the highly highest layers and we use our magic new regularizer which I’ll explain in the next lecture and that takes a net that’s getting mid-40s error and gets it down

to 39 percent error and 19 percent instead of 25% for the top five choices so this is significantly better than the previous state-of-the-art this is sort of comparable with the improvement we were getting in speech and it comes from using standard deep neural net technology plus an extra trick and using GPUs and using Alex Krakowski who’s very good at truly tuning things um here’s some examples from a slightly earlier version of his system this system was not doing quite as well but I haven’t have the examples of this shows the sort of things it can do so that would be an image net image and the right answer is quail you’ll notice the beaks missing these are the answers the neck gave these probabilities associated with those answers so got quails second so it’s in the top five but it didn’t get it first the first times it gives his otter and since sort of I believe in this night I’ll make apologies for that art is not a bad answer because look at all this wet fur here and it’s just like the wet fur of an otter and anyway there’s no beak but notice the other three are all kinds of bird and I’ll bet you don’t know whether that’s a quail or a rough grouse or a partridge this one I like um this is snowplow this is a Canadian example and it’s very confident snowplow but if you look at its other birds it says drilling platform garbage truck and lifeboat which are all quite plausible in particular lifeboat look at this again here’s the flag at the front of the boat here’s the bridge of the boat here’s the flag at the back here’s the surf it’s actually like but it’s a pretty plausible interpretation you can see details well enough to know that’s not a lifeboat but if you blow your eyes a bit that’s a pretty good lifeboat this is an example where gets it completely wrong the answer is scabbard it thinks earthworm we’ll come back to that guillotine is a pretty good thing you know two big vertical things that’s a guillotine this is contextual recognition this is clearly jungle and this is orange so it’s an orangutan broom is not bad as a big vertical thing but why does it get earthworm first well if you look carefully you’ll see the earthworms you there’s one earth well and there’s the other earthworm they’re crawling across it from the grass and there’s many examples like that where you can see why it gets it wrong I’ll show you another one in a minute just to sort of emphasize how hard imagenet is here’s some other examples you see I would have said microwave there the right answer is dishwasher my can’t even and I think there’s a bit of dishwasher here was a bit sink there sorry the right house is electric range and there’s a little bit of electric range here so that’s an example way it’s obvious the net has to see all the main things in the image and then guess them here’s an example something is fairly distributed it’s not like it’s a single local convex object it’s occupying a large fraction of the image and it gets it and prisons not a bad bet for that kind of set up it gets pictures as well as real thing what’s these are from catalogs and quite a lot of the things in image nature like that Adam and here’s my favorite example this is some iPod headphones or something it says corkscrew well that’s not so good and lipstick that’s not so good and screw that’s not so good then it says aunt and you think aunt what why don’t ask does it think this is an aunt and then look at it this is a great big aunt that’s just about to eat you here’s his eyes and here’s this antenna and that’s the sort of green prized room if you’re a green fly you don’t want to see that and I’m pretty convinced that’s why thinks this around you can see the anther now Linux does have some cases where he’s back propagating from the object class to ask which pixels in the image are most responsible for giving this answer in other words you find pixels where when you change their intensities your probability of the right answer changes a lot and you can show for some images like an image of a might on a leaf that the the net is saying it’s a might because of the legs of the might which occupy a tiny fraction of the image so we know in some cases that really is getting the right answer for the right reasons obviously we’d like to get something where we can look inside the net afterwards and say yes it said that was an ant because it thought that was an eye and that was an eye and that was an antenna and in the lecture I give tomorrow I’ll talk about a system that has more chance of doing that it doesn’t does doesn’t do it yet but it has a chance of actually there’s a chance you can look inside after its learned and understand why it recognizes things here’s another nice example of this is a barber chair which he gets right but cellos are better and you can see there’s a sort of big diagonal bit

here with some stuff going on and that’s sort of chilling like okay that’s it for object recognition my prediction is over the next few years as GPUs get faster and we get better at tuning up these big neural nets those vision people will get quieter and quieter about how neural nets are no good so I’m now going to talk a bit about retrieval I started off working with Russell Akutan off on document retrieval and we came up with a way of doing extremely fast retrieval with not very high quality but really really fast well what you do is you take a document represent as a bag of words use a very deep autoencoder to convert the document into a code of about 30 bits so a very small binary code and then you treat that neural net that converts bags of words into small binary codes as a hash function that is what you’re going to do is take a document the neural net will tell you where to put it in memory and you’ll have a pointer there back to the document and from nearby memory locations you’ll have pointers back to other documents which your documents that have very similar codes so if you want to find a document similar to this one you don’t have to do any search at all you have a billion documents you go to the memory location associated with this document by the neural net and then you start flipping bits of the address and if you flip one bit of an address you’ll get the neighboring location and then you look if that points to another document so in effect you can search a big database without really doing any search at all just a few machine instructions you can start a numerating documents a few machine instructions per document now they won’t be very high quality so really you could join you just use that for getting a short list I call this supermarket search because in a supermarket well yeah when I first came to the States I wanted to find anchovies and so I asked where the sardines were and so I went to the sardines and then in supermarket search you look around and there’s the anchovies because supermarkets got this 2d shelf space that wrapped around and it puts knit similar things near one another the problem is in North America and to visit with the pizza toppings they’re not next to the sardines but that’s just bad luck if they had 30 two-dimensional supermarkets there’d be much better off in a thirty two-dimensional supermarket you could have them next to the pizza toppings and next to the sardines if I mean a third two-dimensional supermarket you could have the kosher things next to the non kosher things and you have the expensive things next to the cheap things and you would have the things that were expensive and they’re very cheap because they’re slightly off next to those and with those dimensions you can do amazing things in capturing similarity structure you can’t do in the two dimensions of a supermarket and that’s what’s happening here but the point is normally with hashing when you hash code something to a memory address you can’t use that to find similar things you can use it to find if the identical thing you say because it’ll have the same address but most hashing functions don’t hash similar things to nearby addresses but a neural net can do that so the idea is spend a long time doing machine learning to get this really cool hashing function that allows you to match similar things and then you can do searches really fast so we’ve applied that to images let me say one more view of this if you take a conventional computer well if you take retrieval here’s how a tree this is a slightly exaggerated game all retrieval works like this you find some lists and then you intersect them well if you’re going to do that on a computer you better find lists it can intersect fast and there’s one kind of Lists intersection a Klingon puja can do in one machine instruction which is take 32 lists each of which has about half a billion items and intersect those 32 lists to find the one item that’s in all those lists and the way the lists work is each bit of the memory address is a list and if the bits on its picking out one half of the memory space that’s half a billion if the bits off its picking out the other half of the memory space so it’s a list of all the addresses in that half of the memory space and what the address bus does is intersect those photos you lists to give you the right address and it doesn’t in one machine instruction that’s the nice thing and so another way I think about semantic hashing is if you can intersect lists do it the way the computer wants to do it and so now all you have to do is take your task and map it onto that kind of list intersection and that’s what semantic hashing is doing okay so for image retrieval what we’d like to do is instead of using captions

we’d like to well what we’d really like to of course there’s extract objects from images which is rather hard and then use the objects in the images like use the words in a document a word in a document is quite good at retrieving similar documents because it has a lot of meaning associated with it whereas a pixel and an image has very little meaning associated with it I mean it might be read it tells you something but not much and you’d like to get two objects but if you can’t do that at least you’d like a neural net to get things that are better than individual pixels and so we’re going to try using semantic hashing for image retrieval so we take images we use a deep-water encoder to convert the image into a very short binary code from which you can very approximately reconstruct the image and we’re going to actually use a two stage method we’ll use a neural net that gets a very short code and use that very short code you can very quickly get a short list of say a hundred thousand images that are similar or maybe ten thousand images that are probably similar and then in that short list you can do a more accurate match using say a 256 bit binary code because even with 256 bits that’s only four words of memory so you can afford to associate that with every image and you can do a linear search through say 10,000 things because comparing to 256 bit vectors to see how different they are can be done with machine instructions that operate on whole words and do excells so it can be done very fast so a linear search through tens of thousands of things is extremely fast with these binary codes so the question is will our 256-bit binary codes work well enough to produce reasonable answers so Alex designed an auto encoder like this there are no principles whatsoever except that on GPU powers of two is a good idea when he wanted to end up with 256 so he just kept doubling it and actually got back to that he inputs 32 by 32 color images so these are small color images you can still see a motion an image like that and what he’s going to try and do is come up with codes for these images that allow him to find similar images he’s worth going to work on the CFR 10 database which has labeled images to test the thing but for training it there’s a huge number of unlabeled images that he can use for training so if you go down to 256 bits and then find other images that have similar bit vectors there’s a test image wasn’t use of training here’s the closest other image and this shows you the distances of these other images in bits and so even 60 bits away is remarkably similar since the sort of random one will be 128 bits away at least it would if the bits were on on and off equally often and you can see here that most of the things retrieved have a kind of collar and tie and they’re all heads of people this is what happens if you do Euclidean distance in pixel space obviously on a big database that will get you good things to notice though they don’t have so many collars and ties and another thing to notice is that many of the things that are closest in Euclidean distance also show up here like this woman here right she shows up in the 256 bit binary codes she’s also closer in Euclidean distance but this method is hundreds of times faster than doing Euclidean distance on the images so it is a bit better than Euclidean distance but another way thinking about it is it does Euclidean distance extremely fast and gets similar performance that works for a number of different kinds of scene so here’s kind of outdoor scenes again Euclidean distance returns many of the same things as a fan by the 256 bit binary codes including things that are quite wrong like if you look carefully at this you’ll see that’s like a water tank it’s nothing like these other things but you’re clearly in distance gets it here and the 256 bit codes get it there we show that this thing has a very limited understanding of what’s going on in images it’s fairly superficial similarity it’s using but again think of this is a very fast way of getting things with very similar which are very close in Euclidean distance here’s an example it definitely does better than Euclidean distance so this is a little group of people and if you look in the things that are returned by the 256 bit codes there’s one two three four five six maybe seven groups of people whereas Euclidean distance alone gets one group of people maybe

yeah it’s not nearly as good and in fact when you have a busy image like this one you clean distance is going to tend to do well by having a uniform image that’s the average intensity of this image an average color because that’s going to minimize the squared difference if you have another busy image if it’s exactly in phase you’ll get a smaller Euclidean distance but if it’s at all out of phase you’ll start getting very big pixel differences so if you look on the Euclidean distance these very smooth images do pretty well you never get those guys with using these binary codes now there’s some obvious extensions to this you could make the auto encoder work a lot better if instead of just feeding it images you’ve fed it images and captions represented as bags of words and the teacher Orosco has done that recently and it does make it work better restrictive Boltzmann machines make better models of bags of words than Layton directly allocation topic models and so you can use restrictive Boltzmann machines on a bag of words at the level for getting vectors to describe the bag of words and you do the same thing for the pixels and then you combine those in a deep autoencoder to get labels for the joint bag of words and pixels and it’s a wing because it makes the labels of the images be more semantic because you’re getting information from the words it’s a wing because it makes the labels of the captions being more closely related what you get from the captain’s be more closely related to what things look like and the interaction during training just makes everything easier rather than more difficult there’s a less obvious extension that also works nicely which is if you’ve got an incredibly fast retrieval technique you can afford to do retrieval many times and so Alex tried instead of using the whole image for retrieval in your database take patches of images so you get a much bigger database but that doesn’t matter and then when you’ve given a test image take patches from that and then try receiving patches and say this image is similar to one of the images in the database if they have a patch that matches really well and that’ll cope with objects moving around and being in different positions in the two images and that improves things significantly there’s many other transformations you could think of applying like that you can only really afford to do that if you’ve got a technique that’s incredibly fast so you can do it many times more recently alex has realized you can make the whole thing work much better and here’s what you do his original image retrieval thing started with pixels and tried to get binary codes for pixels and there was no knowledge of object classes at all it was just pure unsupervised codes for images but he’s now developed a net that’s very good at getting thousands of different a thousand different classes so what happens if you look at the last hidden layer of that net so the last hidden layer has 4096 units in his image net net and what you can do is you can take the activity pattern in the last hidden layer produced by an image and I asked finally the other images that give me similar activity patterns in Euclidean distance and so here’s what they look like this image here according to alex is net is extremely similar to this image here but notice the elephants are facing in different directions the Euclidean distance in image space is huge here but in terms of what’s in the image these are very similar images so you would accept these as pretty good images that were similar to that one it’s even better for the aircraft carrier here I assume that’s what it is I mean it says this image is very similar to that image and conceptually very similar they’re both an aircraft carrier on some water with something in the background but in terms of pixels are totally different well look at the sort of Halloween pumpkins this image doesn’t have a lot in common with that image but it’s the fourth closest image if you look at the last layer of the net so Alex’s proposal now is to take this wreck the representation of an image in the last layer and now apply an autoencoder to that and all the semantic hashing and he should be able to get a very nice retrieval technique but that hasn’t been done yet the last thing I want to talk about in this lecture is another of the failures of backpropagation in the 1980s when computers were too slow it wasn’t backpropagation that failed it was those computer hardware guys just hadn’t made decent machines yet so back propagation through time looked like you’d ought to be able to learn programs you ought to

be able to take a big net run back propagation through time and have it learn lots of little programs that interact to predict for example what’s going to happen next and that would have been really nice so people may have covered this already but just in case back propagating through time is very similar to having a leg net and back propagating through the leg net except that the weights are tied so this is a little recurrent neural net this is the same net where I blown up it states in time so this is it states at times area this is the states of time one and so on and this wait that says that the middle unit affects that unit one time step later shows up here and here and here okay so it’s just a lead net with tied weights and we could think of it as you put inputs in here you take outputs out here and these are hidden units and then if you gave me an initial state and said some final state you’d like I could back propagate and learn the weights or if you gave me a sequence of inputs and told me the sequence of outputs you’d like but as long as you don’t what the output still to two time steps later then I could train up these weights so given a sequence of inputs oh give the right sequence of outputs and that’s what we’re going to try and do now the problem with those nets as people may have mentioned is that the gradients tend to blow up so if you put in small weights in your back propagate by the time you back property through many layers you’ve got very very small gradients if you would in big weights by the time you’ve gone through many of theirs you’ve got very big gradients and that made these nets hard to look one way of saying this which seems sort of trivial is if you take a number and you raise it to a large power you either get something much too big or much too small which is pretty obvious now of course you could try and keep the number be about 1 and that large eigenvalue is very big yep it’s the power method for extracting a single egg of either yes yeah okay so they didn’t work but actually if you initialize the weights carefully you can make them work much better and one thing that’s happened in the last few years is that people developing things called Echo state networks said let’s take networks like this let’s initialize the weights very carefully so that they neither blow up nor die as you run them in other words so you take the biggest eigenvalue of the weight matrix to be about 1.2 and then you put it through the logistic non-linearity one point one’s not big enough and one point three is too big one point two sort of ideal and then the net will run for a very long time without either dying or blowing up and then the people doing echo state network said okay if we initially say that all we need to do is just learn output weights now so you can think of them as freezing or W 1 and W 3 and W 4 and just learning the W 2 so that from these inputs and hidden stuff they can predict the output and that works surprisingly well of course it will work much better if you train these two and as soon as you’ve initialized the net sensibly so it doesn’t blow up or die you can train recurrent networks before in your SATs kiva realized that he worked with James Martins who developed a very fancy optimization method called the hessian-free method that was tailored for neural networks and people in the optimization community developed hessian-free methods but they’re never really tailored them properly for neural networks and so what aciem free methods do is they make very good use of curvature information and what they’re very good at is seeing a direction in the space where the weights of the horizontal dimensions and the badness of the net is the vertical direction they find directions which have very small gradients but even smaller curvatures because if a Direction’s got a very small gradient you can still make a lot of progress by going a long way in that direction if it’s got no curvature if you keep going downhill and the problem with training neural nets and everything else is that you have some directions with big gradients but also big curvatures and so you can make initial very fast progress but you quickly go up and again you have other directions that may have very small curvatures but they also have small gradients and those are masked by these directions with big curvatures and you need a very good method to be able to reveal for example the direction with one hundredth biggest eigenvalue that happens time very low curvature 100 his gradient but has very low curvature um and James Martin’s

got this working and he’ll you work with them to make it work better and they then wanted to reply to recurrent neural Nets and we decided to try the problem of modeling language now I have a theory of language my theory of language is actually a tautology is definitely true and it’s like this you have a mental state a word comes in and you have a new mental state how could that not be true and therefore a word is something that operates on a mental state to produce a new mental state and now we get the slippery little trick which is and what’s more your mental state is a vector in the words of matrix and you operate on it linearly that bit isn’t less so necessarily true but we could try that so the idea was let’s have a hundred thousand words and let’s have each of them be a matrix and let’s have a mental state that’s a great big vector and each of these words operates on your mental state to produce a new mental state the problem is your mental state needs to be a big vector and this needs to be a big square matrix and you need a hundred thousand of them so you’re gonna need rather a lot of parameters it would be much easier if instead of modeling language of the word level we modeled it at the current level because there aren’t many characters we could take all the ASCII characters and strip them down to about 86 characters without losing much well you have to be careful who you’re talking to I think we threw away all the French accents but in Ontario you’re not losing much and so then the problem is to predict the next character in a string or to make a model that says you’ve got a belief state which is based on the character you seen so far a new character comes in that changes your belief state so there’s a good reason good reasons for using characters you’re going to need many fewer parameters now you’re gonna have to solve some difficult problems like learning what words are but compared with on Stanly natural language that’s a fairly trivial problem so if you’re hoping to understand natural language learning what characters make a word ought to be trivial here’s some other reasons for using characters rather than words the web’s composed a character so you can just take stuff off the web and model it it’s non-trivial to find out which strings of characters or words but it’s trivial compared with learning all sorts of other things about language like what it means pre-processing texts on the web to get words is actually tricky so it’s not clear whether you want the words or the morphemes and it’s not even clear what the morphemes are so for example in English most speakers of English don’t realize that word starting that’s SM is a is almost a morpheme all words all words starting with SN but not all of them but quite a few of them have a particular aspect to their meaning so SN words on the whole means something to do with the upper lip or nose there’s just too many of them for that not to be true like sneer and snarl and snot and sneeze and snog and there’s just a huge number can anybody think of an exception snails and exception okay but they’re kind of yucky but um snow good snow thank you and so normally people shout out snow so ask yourself why is snow such a good name for cocaine okay it’s not just the cocaine’s white it’s to do with the upper lip and nose the things like New York Press not clear whether it’s one word or two words and there’s lots of things like that in English is these little sub regularities that you’re not explicitly aware of but there’s some level you know about and then the languages like Finnish so this is Finnish and it means it’s a single word and it means despite his lack of understanding so you can now start estimating how many words that are in Finnish and it’s going to be billions of billions of words right they just stack these morphemes together to make up whatever it is they want and it’s like German but worse um okay so there’s advantages to using characters so here’s a way to use characters without having nearly as many parameters as you might have thought we’re going to have a mental state that has 2,000 hidden units and they’re going to be logistic hidden units and we’re gonna have 86 characters and so you’d have thought each character had to be a 2000 by 2000 matrix so that’s four million right there and so

we’re going to need two by two third of a billion parameters you have 86 characters but we can factorize things and do some sharing and use far fewer parameters and the way we’re going to do it is this we’re going to say instead of going straight from here to here via matrix multiply what I’m gonna do is I’m gonna have things called factors and I’ll have about 2000 of those it could be more than 2000 or less than 2000 and the wage factor works is this it applies a linear filter to this vector so it multiplies the activities by some weights to get a scalar here it applies a linear filter to this vector but this is the 1 of n vector 4 characters so in effect it just takes this weight then it multiplies these two scalars together and sends that product out here via these weights and you have a whole bunch of factors that do that so now that means characters can share things for example all the vowels could have strong weights to a subset of the factors that do sort of vowel e things like predict that a constant will come next another way of thinking about this is take the outer product of this weight vector and this weight vector to get a square matrix that has four million entries but only has four thousand degrees of freedom it’s a Rank 1 matrix and each of these factors is a Rank 1 matrix and what a character does is put weights on a bunch of rank one matrices to synthesize a matrix with much higher ranks by adding together all these rank one matrices and obviously if you’ve got a few thousand of you can synthesize more or less any matrix if you want that particular matrix but this system can have far fewer parameters than having 86 arbitrary matrices and still capture the structure what’s going on okay so the way the net works is this it has some mental state it sees a character it uses its factors to get inputs for the next mental state it takes the total input each of these unit gets push you through the logistic function and it gets some real value vector there there’s nothing stochastic here and then from that real valued vector is going to make a prediction for a softmax about what the next character will be and now you’re going to train it to maximize the log probability of getting the right answer and so you back propagate the cross-entropy error through here through here through here to learn this weight and through here to learn these weights and through here and down to the previous character and through here and you keep back propagating it and you back propagated for hundreds of time steps or about a hundred time steps and so you can update all these weights and the question is can you see effects that are coming from many time steps ago like for example can it learn that after an opening bracket if there hasn’t been a closing bracket already 30 times steps later you’re much like to have a closing bracket than the base rate for closing brackets because you’ve got a bracket that’s open and indeed it can learn things like that and the only method we’ve been able to find that can learn things like that is this Hessian free optimizer so let me show you some of the things it can do yeah took this net initially used five minion strings of a hundred characters each and started training after the eleventh character so he gave a kind of burning of eleven characters and then the net started predicting and he started back propagating and he ran for a month on a GPU board his best model was not quite as good as the state-of-the-art which uses lots of little neural nets and a manager that decides which one is making good predictions and also updates his model based on words it seen recently Ilyas net is stationary in the sense that its prediction isn’t learning a test time you have to learn at test time to do really well at this but it can do things like balance quotes and brackets over long ranges and Markov models can’t possibly do this because a Markov model can’t sort of ignore stuff and then say you can’t say I’ve got an opening bracket and there’s all this stuff but I’m waiting for a closing bracket and please give me a closing bracket if it wants to say that what it has to do is take its whole state space and multiply it by whether or not it by two one for having an opening bracket and one for not having an opening bracket and every time it wants to remember one bit that it’s going to carry along for a long time it has to double the size of its state space whereas these neural nets can just set a frontside a few hidden units to remember there’s a bracket the same time so potentially they’re much more powerful if you could train them

and this is an example of them doing that so after it’s trained it’s fun to get you to now generate text so one way to get it to generate text is to give it an initial string and to ask what character you think is most likely to come next then you produce that character you tell the net okay that’s what did come next now what do you predict for the next character well if you do that quickly start saying the united states of the united states of the united states of the united states of the united states that’s the sort of degenerate loop in wikipedia in english wikipedia it’s much better obviously to take the distribution it predicts and sample from that distribution serena says there’s a one in a thousand chance of the next character zack you produce a cue one time in a thousand now obviously if you do that you’ll sometimes in producing characters the net thought were very unlikely and so it’ll sometimes do weird things which it knew were weird it thought they were unlikely but what’s amazing is the most of the time it does something very reasonable so I’ll show you some text generation from the net we generated a lot of text from the net and this is selected to be a particularly good passage so there’s cheating going on because we’ve selected a passage but there’s not much gene this is a contiguous passage right if you’re refusing garbage we couldn’t have found a long passage that was good so the net January did this there was some stuff came before this and this tells you a lot about what the net understands for example it really understands what a word is there aren’t any non words here there’s initials and stuff well not in this one but a later one you know initials it understands what initials look like too it has a limited Gras it either has a limited grasp of geography and intelligence agencies or it has a great sense of humor it doesn’t stay on topic right it stays on topic within a sentence but then it switches to something else a good thing about this is you can take this text and you can search Wikipedia and if a string is not in Wikipedia it cannot possibly been in the training set for this note okay so you know for example that you can you can look for escape during an alarm and Alliance and it probably isn’t there most of the strings are forwards here and not in the Wikipedia it’s really generating this stuff and it generates sort of technical terms like from reticulum but it stays on topic for a whole sentence and it sort of clearly has a very shallow kind of semantic understanding so such that it is the blurring of appearing on any well paid type of box printer well you got blurring and appearing and box printer which ought to do with kind of the appearance of stuff and the production of text so it gives the impression it understands quite a lot and the problem with this thing is this is exactly the same as the problem with grading midterm tests did they really understand what I said or have they just managed to generate strings of words that look quite plausible here’s some more text you can see look you wouldn’t say he forgave opus paul at rome but you know these things are very highly associated right dopers d and there’s a pope called paul on the roman with all that stuff it does produce non words she has a non word but you’re not quite sure that’s a non word right it’s a pretty good non word it produced another passage you produced a word contingent tinged is probably a non word but sort of continual not knowing what’s going on it could be a word it does very well with initials and notice here opening quote bunch of words closing quote markov models can’t do that okay right remember I said at the beginning the sometimes word is of all the things you’ve seen there had to be some characters that he thought the chance was less than one in a thousand because you see more than thousand characters roughly speaking that’s probably one of those places if thought they thought that the next character you know also because it’s trained on limited strings it can sort of get discontinuities but it’s doing pretty well it’s doing much better than we expected yes you might

well have two spaces so I did actually some tests on it so if you give it Sheila and then a nonsense words runs most English speakers will say that Thrones is probably a verb and most English speakers will say either Thrones door thrown juice is the most likely completion here and it thinks both of those are pretty frequent and Thrones is is the one it likes most so then thought I fool it and put a comma after Sheila and put a throne a capital with Thrones so it look like a list of names and I expect you to say Sheila thrums Fred it didn’t say that it knows a lot about proper names and it made up the name of this guy from Jelena del Rey throne Jelena derive decided is actually a maker of exotic movies who has a Spanish father and Earth’s whisper and this an Italian mother and is currently living in Switzerland okay you can give it the meaning of life is and you often get nonsense but occasion you get interesting things in fact quite often so in the first 10 tries this was the sixth try if you got 42 it wouldn’t be interesting because that’s in Wikipedia I bet but you can find that somewhere this is not in Wikipedia I went and checked it doesn’t say the meaning of Isis issue your cognition it would have been better if it said the meaning of life is getting lots of citations but this is place easier then trained it a lot more Oh before I say yeah so here’s some of the things it knows it knows lots about work it knows what words up you knows lots about numbers proper names day so it knows that you often start paragraphs with you know in 1973 things like that it can count brackets if you give it one opening bracket you will produce a closing bracket typically about five to ten characters later if you give it two opening brackets it’ll produce a closing bracket only a few characters later it gets very anxious with children backers saying you know I got too many brackets open here but it can’t tell the difference between two and three he counts none one or two none one or many basically but it definitely pays differently after two opening brackets and after one cooking bracket it knows a bunch of math too for example if you say F opening bracket it will almost certainly put X closing bracket he knows an awful lot of syntax it wasn’t just but it so short range syntax and of course it’s not in the form of proper syntactic rules like a linguist would like he knows lots of semantics but it’s shallow semantics I only ever saw its save Vic and Stein once and it said it about ten words after Plato and I’m convinced it sort of knows that they can surround and play to have something to do with each other he knows that cabbages and vegetables tend to occur in the same sentence and things like that it’s that kind of semantic knowledge is like the knowledge that I can make you exhibit if I force you to answer questions very quickly so for example I need a cooperative audience here of people who haven’t seen this before am I going to ask you a question and you have to shout out the answer that comes into your head and the no I’m not a question yet and the first person to shout wins the prize and the essence G is speed don’t don’t filter it to say is that doesn’t any sense just shout the answers as fast as you can okay what do cows drink very good now it’s true some cows do drink milk the cows drink water but cows and drink are associated with milk right and that’s the kind of knowledge this net has it has lots of that knowledge so even then trained it’s immoral and philosophically I believe that if you get a long enough string of text there’s enough information in that string of text to understand the world now you have to make be have to be very careful in authority to build a model of the world that models still up for interpretation but you can the model ought to be able to for example answer questions sort of what it all really means that this is like the Matrix movies what it all really means is another question but with enough text you’ll be able to understand English so that you can answer questions and so wouldn’t it be nice if you can take this neural net you can train it up on a billion characters in Wikipedia then you can ask it the meaning of life and you can get the right answer so this is the closest we’ve got so far and this was one of its first ten at ten attempts after we given it a lot more training because of the way PowerPoint works I cannot show you this in PowerPoint but I can show you it like this so what we’re going to do now is get it to predict one

character at a time it’s more fun like this you run till you hit a full stop his syntax isn’t perfect and honestly the net did this it wasn’t me okay it’s got close that’s the end of this lecture