Learning Rate

Learning Rate Learning rate finder [1:22:58]

Let’s run learning rate finder. We are learning about what that is next week, but for now, just know this is the thing that figures out what is the fastest I can train this neural network at without making it zip off the rails and get blown apart.

learn.lr_find() learn.recorder.plot()

This will plot the result of our LR finder and what this basically shows you is this key parameter called a learning rate. The learning rate basically says how quickly am I updating the parameters in my model. The x-axis one here shows me what happens as I increase the learning rate. The y axis show what the loss is. So you can see, once the learning rate gets passed 10^-4, my loss gets worse. It actually so happens, in fact I can check this if I press shift+tab here, my learning defaults to 0.003. So you can see why our loss got worse. Because we are trying to fine-tune things now, we can’t use such a high learning rate. So based on the learning rate finder, I tried to pick something well before it started getting worse. So I decided to pick 1e-6. But there’s no point training all the layers at that rate, because we know that the later layers worked just fine before when we were training much more quickly. So what we can actually do is we can pass a range of learning rates to learn.fit_one_cycle. And we do it like this:

learn.unfreeze() learn.fit_one_cycle(2, max_lr=slice(1e-6,1e-4)) Total time: 00:41 epoch train_loss valid_loss error_rate 1 0.226494 0.173675 0.057219 (00:20) 2 0.197376 0.170252 0.053227 (00:20) You use this keyword in Python called slice and that can take a start value and a stop value and basically what this says is train the very first layers at a learning rate of 1e-6, and the very last layers at a rate of 1e-4, and distribute all the other layers across that (i.e. between those two values equally).

How to pick learning rates after unfreezing [1:25:23]

A good rule of thumb is after you unfreeze (i.e. train the whole thing), pass a max learning rate parameter, pass it a slice, make the second part of that slice about 10 times smaller than your first stage. Our first stage defaulted to about 1e-3 so it’s about 1e-4. And the first part of the slice should be a value from your learning rate finder which is well before things started getting worse. So you can see things are starting to get worse maybe about here:

So I picked something that’s at least 10 times smaller than that.

If I do that, then the error rate gets a bit better. So I would perhaps say for most people most of the time, these two stages are enough to get pretty much a world-class model….. you’ll get something that’s about as good in practice as the vast majority of practitioners can do.

On the learning rate finder, what you are looking for is the strongest downward slope that’s kind of sticking around for quite a while. It’s something you are going to have to practice with and get a feel for﹣which bit works. So if you are not sure which, try both learning rates and see which one works better.

For my top learning rate, I normally pick 1e-4 or 3e-4, it’s kind of like I don’t really think about it too much. That’s a rule of thumb﹣it always works pretty well. One of the things you’ll realize is that most of these parameters don’t actually matter that much in detail. If you just copy the numbers that I use each time, the vast majority of the time, it’ll just work fine