What does "learning rate warm | warm up optimizer
Ifyourdatasetishighlydifferentiated,youcansufferfromasortof"earlyover-fitting".Ifyourshuffleddatahappenstoincludeaclusterofrelated,strongly-featuredobservations,yourmodelsinitialtrainingcanskewbadlytowardthosefeatures--orworse,towardincidentalfeaturesthatarenttrulyrelatedtothetopicatall.Warm-upisawaytoreducetheprimacyeffectoftheearlytrainingexamples.Withoutit,youmayneedtorunafewextraepochstogettheconvergencedesired,asthemodelun-trainsthoseearlysuperstitions.Manymodelsaffordthisasacommand-lin...
If your data set is highly differentiated, you can suffer from a sort of "early over-fitting". If your shuffled data happens to include a cluster of related, strongly-featured observations, your models initial training can skew badly toward those features -- or worse, toward incidental features that arent truly related to the topic at all.
Warm-up is a way to reduce the primacy effect of the early training examples. Without it, you may need to run a few extra epochs to get the convergence desired, as the model un-trains those early superstitions.
Many models afford this as a command-line option. The learning rate is increased linearly over the warm-up period. If the target learning rate is p and the warm-up period is n, then the first batch iteration uses 1*p/n for its learning rate; the second uses 2*p/n, and so on: iteration i uses i*p/n, until we hit the nominal rate at iteration n.
This means that the first iteration gets only 1/n of the primacy effe...