共 1 篇文章 |
|
Stochastic gradient descent (SGD) simply updates each parameter by subtracting the gradient of the loss with respect to the parameter, scaled by the learning rate η, a hyperparameter. If η is too large, SGD will diverge; if it''s too small, it will converge slowly. The update rule is simplyθt+1=θt?η?L(θt... 阅81 转0 评0 公众公开 15-12-23 09:16 |