# 35C3 – Introduction to Deep Learning

separable by a line If we look again at some training samples — training samples are the data points we use for the machine learning process, so, to try to find the parameters of our statistical model — if we look at the line again, then this will not be able to separate this training set Well, we will have a line that has some errors, some unicorns which will be classified as rabbits, some rabbits which will be classified as unicorns This is what we call underfitting Our model is just not able to express what we want it to learn There is the opposite case The opposite case being: we just learn all the training samples by heart This is if we have a very complex model and just a few training samples to teach the model what it should learn In this case we have a perfect separation of unicorns and rabbits, at least for the few data points we have If we draw another example from the real world, some other data points, they will most likely be wrong And this is what we call overfitting The perfect scenario in this case would be something like this: a classifier which is really close to the distribution we have in the real world and machine learning is tasked with finding this perfect model and its parameters Let me show you a different kind of model, something you probably all have heard about: Neural networks Neural networks are inspired by the brain Or more precisely, by the neurons in our brain Neurons are tiny objects, tiny cells in our brain that take some input and generate some output Sounds familiar, right? We have inputs usually in the form of electrical signals And if they are strong enough, this neuron will also send out an electrical signal And this is something we can model in a computer- engineering way So, what we do is: We take a neuron The neuron is just a simple mapping from input to output Input here, just three input nodes We denote them by i1, i2 and i3 and output denoted by o And now you will actually see some mathematical equations There are not many of these in this foundation talk, don’t worry, and it’s really simple There’s one more thing we need first, though, if we want to map input to output in the way a neuron does Namely, the weights The weights are just some arbitrary numbers for now Let’s call them w1, w2 and w3 So, we take those weights and we multiply them with the input Input1 times weight1, input2 times weight2, and so on And this, this sum just will be our output Well, not quite We make it a little bit more complicated We also use something called an activation function The activation function is just a mapping from one scalar value to another scalar value In this case from what we got as an output, the sum, to something that more closely fits what we need This could for example be something binary, where we have all the negative numbers being mapped to zero and all the positive numbers being mapped to one And then this zero and one can encode something For example: rabbit or unicorn So, let me give you an example of how we can make the previous example with the rabbits and unicorns work with such a simple neuron We just use speed, size, and the arbitrarily chosen number 10 as our inputs and the weights 1, 1, and -1 If we look at the equations, then we get for our negative numbers — so, speed plus size being less than 10 — a 0, and a 1 for all positive numbers — being speed plus size larger than 10, greater than 10 This way we again have a separating line between unicorns and rabbits But again we have this really simplistic model We want to become more and more complicated in order to express more complex tasks So what do we do? We take more neurons We take our three input values and put them into one neuron, and into a second neuron, and into a third neuron And we take the output of those three neurons as input for another neuron We also call this a multilayer perceptron Perceptron just being a different name for a neuron, what we have there And the whole thing is also called a neural network So now the question: How do we train this? How do we learn what this network should encode? Well, we want a mapping from input to output, and what we can change are the weights First, what we do is we take a training sample, some input

It is again a hidden layer in a neural network, but it does something special It actually is a very simple neuron again, just four input values – one output value But the four input values look at two by two pixels, and encode one output value And then the same network is shifted to the right, and encodes another pixel, and another pixel, and the next row of pixels And in this way creates another 2D image We have preserved information about the neighborhood, and we just have a very low number of weights, not the huge number of parameters we saw earlier We can use this once, or twice, or several hundred times And this is actually where we go deep Deep means: We have several layers, and having layers that don’t need thousands or millions of connections, but only a few This is what allows us to go really deep And in this fashion we can encode an entire image in just a few meaningful values How these values look like, and what they encode, this is learned through the learning process And we can then, for example, use these few values as input for a classification network The fully connected network we saw earlier Or we can do something more clever We can do the inverse operation and create an image again, for example, the same image, which is then called an auto encoder Auto encoders are tremendously useful, even though they don’t appear that way For example, imagine you want to check whether something has a defect, or not, a picture of a fabric, or of something You just train the network with normal pictures And then, if you have a defect picture, the network is not able to produce this defect And so the difference of the reproduced picture, and the real picture will show you where errors are If it works properly, I’ll have to admit that But we can go even further Let’s say, we want to encode something entirely else Well, let’s encode the image, the information in the image, but in another representation For example, let’s say we have three classes again The background class in grey, a class called hat or headwear in blue, and person in green We can also use this for other applications than just for pictures of humans For example, we have a picture of a street and want to encode: Where is the car, where’s the pedestrian? Tremendously useful Or we have an MRI scan of a brain: Where in the brain is the tumor? Can we somehow learn this? Yes we can do this, with methods like these, if they are trained properly More about that later Well we expect something like this to come out but the truth looks rather like this – especially if it’s not properly trained We have not the real shape we want to get but something distorted So here is again where we need to do learning First we take a picture, put it through the network, get our output representation And we have the information about how we want it to look We again compute some kind of loss value This time for example being the overlap between the shape we get out of the model and the shape we want to have And we use this error, this lost function, to update the weights of our network Again – even though it’s more complicated here, even though we have more layers, and even though the layers look slightly different – it is the same process all over again as with a binary case And we need lots of training data This is something that you’ll hear often in connection with deep learning: You need lots of training data to make this work Images are complex things and in order to meaningful extract knowledge from them, the network needs to see a multitude of different images Well now I already showed you some things we use in network architecture, some support networks: The fully convolutional encoder, which takes an image and produces a few meaningful values out of this image; its counterpart the fully convolutional decoder – fully convolutional meaning by the way that we only have these convolutional layers with a few parameters that somehow encode spatial information and keep it for the next layers The decoder takes a few meaningful numbers and reproduces an image – either the same image or another representation of the information encoded in the image We also already saw the fully connected network

Fully connected meaning every neuron is connected to every neuron in the next layer This of course can be dangerous because this is where we actually get most of our parameters If we have a fully connected network, this is where the most parameters will be present because connecting every node to every node … this is just a high number of connections We can also do other things For example something called a pooling layer A pooling layer being basically the same as one of those convolutional layers, just that we don’t have parameters we need to learn This works without parameters because this neuron just chooses whichever value is the highest and takes that value as output This is really great for reducing the size of your image and also getting rid of information that might not be that important We can also do some clever techniques like adding a dropout layer A dropout layer just being a normal layer in a neural network where we remove some connections: In one training step these connections, in the next training step some other connections This way we teach the other connections to become more resilient against errors I would like to start with something I call the “Model Show” now, and show you some models and how we train those models And I will start with a fully convolutional decoder we saw earlier: This thing that takes a number and creates a picture I would like to take this model, put in some number and get out a picture – a picture of a horse for example If I put in a different number I also want to get a picture of a horse, but of a different horse So what I want to get is a mapping from some numbers, some features that encode something about the horse picture, and get a horse picture out of it You might see already why this is problematic It is problematic because we don’t have a mapping from feature to horse or from horse to features So we don’t have a truth value we can use to learn how to generate this mapping Well computer vision engineers – or deep learning professionals – they’re smart and have clever ideas Let’s just assume we have such a network and let’s call it a generator Let’s take some numbers put, them into the generator and get some horses Well it doesn’t work yet We still have to train it So they’re probably not only horses but also some very special unicorns among the horses; which might be nice for other applications, but I wanted pictures of horses right now So I can’t train with this data directly But what I can do is I can create a second network This network is called a discriminator and I can give it the input generated from the generator as well as the real data I have: the real horse pictures And then I can teach the discriminator to distinguish between those Tell me it is a real horse or it’s not a real horse And there I know what is the truth because I either take real horse pictures or fake horse pictures from the generator So I have a truth value for this discriminator But in doing this I also have a truth value for the generator Because I want the generator to work against the discriminator So I can also use the information how well the discriminator does to train the generator to become better in fooling This is called a generative adversarial network And it can be used to generate pictures of an arbitrary distribution Let’s do this with numbers and I will actually show you the training process Before I start the video, I’ll tell you what I did I took some handwritten digits There is a database called “??? of handwritten digits” so the numbers of 0 to 9 And I took those and used them as training data I trained a generator in the way I showed you on the previous slide, and then I just took some random numbers I put those random numbers into the network and just stored the image of what came out of the network And here in the video you’ll see how the network improved with ongoing training You will see that we start basically with just noisy images … and then after some – what we call apox(???) so training iterations – the network is able to almost perfectly generate handwritten digits just from noise