tensorflow - the learning rate change for the momentum optimizer -
when running existing tensorflow implementation, found learning rate keeps same between different epochs. original implementations uses tf.train.momentumoptimizer
, , has decay rate
setup.
my understanding momentum optimizer learning rate should decrease along epochs. why learning rate keeps same training process. possible learning rate depend on performance, e.g., if performance not change quickly, learning rate keep same. think not clear underlying mechanism of momentum optimizer, , feel confused learning rate keeps same along epoch though guess should keep decreasing based on given decay rate.
the optimizer defined follows
learning_rate = 0.2 decay_rate = 0.95 self.learning_rate_node = tf.train.exponential_decay(learning_rate=learning_rate, global_step=global_step, decay_steps=training_iters, decay_rate=decay_rate, staircase=true) optimizer = tf.train.momentumoptimizer(learning_rate=self.learning_rate_node).minimize(self.net.cost, global_step=global_step)
it little bit hard tell without looking @ code if answer helpful or not.
however if need insights on how momentum optimizer works , how learning rate should decay.
first vanilla
gradientdescentminimizer
's update basic:w^(n+1)=w^(n)-alpha*(gradient of cost wrt w)(w^n)
you following opposite of gradient.
the
gradientdescentminimizer
learning rate decay:w^(n+1)=w^(n)-alpha(n)*(gradient of cost wrt w)(w^n)
the thing changed learning rate alpha , dependent of step in tensorflow used exponential decay after n step learning rate divided constant i.e. 10.
this change happens later in training you might need let few epochs pass before seeing decay.
the
momentumoptimizer
: have keep additional variable: update have done before i.e have store @ each time step:update^(n)=(w^(n)-w^(n-1))
then corrected update momentum is:
update^(n+1)=mupdate^(n)-alpha(gradient of cost wrt w)(w^n)
so doing doing simple gradient descent correcting remembering immediate past (there smarter , more complicated ways of doing nesterov's momentum)
momentumoptimizer
learning rate decay:
update^(n)=(w^(n)-w^(n-1))
update^(n+1)=mupdate^(n)-alpha(n)(gradient of cost wrt w)(w^n)
alpha dependent of n too.
so @ 1 point starts slowing down in gradient descent learning rate decay decrease affected momentum.
for complete review of methods , more have the excellent website explains far better me , alec radford's famous visualization better thousand words.
the learning rate should not depend on performance unless specified in decay !
see code in question !
edit1:: here working example think answer both questions asked:
import tensorflow tf import numpy np import matplotlib.pyplot plt #pure sgd batch_size=1 #batch gradient descent #batch_size=1000 starter_learning_rate=0.001 xdata=np.linspace(0.,2*np.pi,1000)[:,np.newaxis] ydata=np.sin(xdata)+np.random.normal(0.0,0.05,size=1000)[:,np.newaxis] plt.scatter(xdata,ydata) x=tf.placeholder(tf.float32,[none,1]) y=tf.placeholder(tf.float32, [none,1]) #we define global_step variable initialized @ 0 global_step=tf.variable(0,trainable=false) w1=tf.variable(0.05*tf.random_normal((1,100)),tf.float32) w2=tf.variable(0.05*tf.random_normal((100,1)),tf.float32) b1=tf.variable(np.zeros([100]).astype("float32"),tf.float32) b2=tf.variable(np.zeros([1]).astype("float32"),tf.float32) h1=tf.nn.relu(tf.matmul(x,w1)+b1) y_model=tf.matmul(h1,w2)+b2 l=tf.reduce_mean(tf.square(y_model-y)) #we want decrease learning rate after having seen data 5 times num_epochs_per_decay=5 learning_rate_decay_factor=0.1 #since mechanism of decay depends on number of iterations , not epochs have connect number of epochs number of iterations #so if have batch_size=1 have iterate 1000 times 1 epoch 5*1000=5000 iterations before decaying if batch_size 1000 1 iterations=1epoch , decrease after 5 iterations num_batches_per_epoch=int(xdata.shape[0]/float(batch_size)) decay_steps=int(num_batches_per_epoch*num_epochs_per_decay) decayed_learning_rate=tf.train.exponential_decay(starter_learning_rate, global_step, decay_steps, learning_rate_decay_factor, staircase=true) #so have object depends on global_step , divided 10 every decay_steps iterations i.e. when global_step=n*decay_steps n non 0 integer #we create train_step pass learning rate created each time function called global_step incremented 1 gonna check case careful have give global_step argument train_step=tf.train.gradientdescentoptimizer(decayed_learning_rate).minimize(l,global_step=global_step) sess=tf.session() sess.run(tf.initialize_all_variables()) global_s=[] lr_val=[] costs=[] in range(16000): #we 1600 iterations there decay every 5000 iterations see 3 decays (5000,10000,15000) start_data=(i*batch_size)%1000 costs.append([sess.run(l, feed_dict={x:xdata,y:ydata})]) global_s.append([sess.run(global_step)]) lr_val.append([sess.run(decayed_learning_rate)]) #i see train_step implicitely executing sess.run(tf.add(global_step,1)) sess.run(train_step,feed_dict={x:xdata[start_data:start_data+batch_size],y:ydata[start_data:start_data+batch_size]}) plt.figure() plt.subplot(211) plt.plot(global_s,lr_val,"-b") plt.title("evolution of learning rate" ) plt.subplot(212) plt.plot(global_s,costs,".g") plt.title("evolution of cost" ) #notice 2 things first global_step being incremented , learning rate being decayed
(you can writemomentumoptimize() instead of gradientdescentoptimizer() obviously...)
here 2 plots get: sum in mind when call train_step tensorflow runs tf.add(global_step,1)
Comments
Post a Comment