Understanding the Math Behind Training Deep Learning Models

So, what is “training”? (Lesson 2) Training is how we

create a mathematical function
that takes a bunch of numbers – eg., the numbers that represent the pixels of a picture of “8”
and spits out the probabilities for each possible class/answer – eg., 92% that it’s an ‘8’

So, how do we create a function that does that?

Pictures are too complicated to do the math on,
so we’ll start w something way, way simpler: given temperature, predict how many ice creams sold (Lesson 2’s ex)

NOTE:I think from here on, it’s mostly Jeremy; need to check for whatever I decided to keep

The simplest function you can do: for a line

y = mx + b
m: "gradient" of the line

For simplest v of what Gradient is: it’s another word for “slope” (In vector calculus, the gradient is a multi-variable generalization of the derivative)

b: “intercept” of the line

Only what we’re actually going to use is y = a1x1 + a2x2 where x2 is 1

Why? (not sure yet. J says: in machine learning, we don’t hae one equation, we got lots)

or: y[i] = a1x[i,1] + a2x[i,2]

a1 and a2 are the coefficients or parameters

This lets us deal w it using linear algebra

NOTE: multiplying matrices, you multiply pieces and then add them up, aka “dot product.” Do it for each item in the matrix and you’ve got a matrix product

[I think] For our “x” data, create a “tensor” of 2 columns: col 1: a random number between -1 and 1 col 2: 1 – which is the intercept?? not sure. Or it’s there to make the matrix multiplication possible?

For “a”, create a ‘rank 1’ tensor of (3., 2) rank = how many axes/dimensions in math world, rank 1 is a vector, rank 2 is a matrix

(3,2) was the correct answer. i,e, that m, aka a1, was 3 and b, aka a2 was 2.

What if we didn’t know the correct answers were 3,2?
They’re what stats people call coefficient and PyTorch calls parameters

Loss function: function we can run on the predicted vs actual numbers to tell us how accurate our prediction, aka the line generated by our parameters, was.

The basic strategy: start with a wild ass guess for the parameters

Then we calculate the difference between this guess and the right answer (from the training data), then make the guess a bit better For a line, which has only 2 parameters, we ask: what if we made the intercept a bit higher or lower? What if we made the gradient/slope a bit more positive or negative?

2 parameters X 2 options = 4 possible outcomes. So we calculate the loss for each of the 4, then whichever did the best, that’s what we do Only we don’t have to literally move it up/down, we can calculate the derivative, which tells us what moving it around would’ve done J: The derivativ “tells you how changing one thing change the function” The derivative is kinda sorta, close enough, to the gradient – how changing it ujp or down would change how close we were (in PyTorch, you calculate the gradient using “backward” method) Learning rate is the amount we multiply the gradient by – essentially how big of a jump we do

He suggests running the animation of SGD using a big learning rate, a small learning rate, etc to get a feel for it

stochastic gradient descent: do that, but don’t do it on your whole dataset, do it on mini-batches of your data. If you’ve got 1 million images and you do all of them, that means you’re calculating the loss function on 1 million images every single time.

The danger w epochs: if you try to fit your data too many times, if you look at each image too many times, it’ll start overfitting – working too exactly w thos images

So when we created that teddy bear detector, what we actually did was we created a mathematical function that took the numbers from the images of the teddy bears and a mathematical function converted those numbers into, in our case, three numbers: a number for the probability that it’s a teddy, a probability that it’s a grizzly, and the probability that it’s a black bear.

In this case, there’s some hypothetical function that’s taking the pixel representing a handwritten digit and returning ten numbers: the probability for each possible outcome (i.e. the numbers from zero to nine). So what you’ll often see in our code and other deep learning code is that you’ll find this bunch of probabilities and then you’ll find a function called max or argmax attached to it. What that function is doing is, it’s saying find the highest number (i.e. probability) and tell me what the index is. So np.argmax or torch.argmax of the above array would return the index 8.