現在讓我們定義這個利用 Q-Learning 學習 Catch 游戲的模型。我們使用 Keras 作為 Tensorflow 的前端。我們的基準模型是一個簡單的三層密集網絡。這個模型在簡單版的 Catch 游戲當中表現很好。你可以在 GitHub 中找到它的完整實現過程。
?
你也可以嘗試更加復雜的模型,測試其能否獲得更好的性能。
num_actions =3# [move_left, stay, move_right]hidden_size =100# Size of the hidden layersgrid_size =10# Size of the playing fielddefbaseline_model(grid_size,num_actions,hidden_size):#seting up the model with kerasmodel = Sequential() model.add(Dense(hidden_size, input_shape=(grid_size**2,), activation='relu')) model.add(Dense(hidden_size, activation='relu')) model.add(Dense(num_actions)) model.compile(sgd(lr=.1),"mse")returnmodel
?
探索
?
Q-Learning 的最后一種成分是探索。日常生活的經驗告訴我們,有時候你得做點奇怪的事情或是隨機的手段,才能發現是否有比日常動作更好的東西。
?
Q-Learning 也是如此。總是做最好的選擇,意味著你可能會錯過一些從未探索的道路。為了避免這種情況,學習者有時會添加一個隨機項,而未必總是用最好的。我們可以將定義訓練方法如下:
deftrain(model,epochs):# Train#Reseting the win counterwin_cnt =0# We want to keep track of the progress of the AI over time, so we save its win count historywin_hist = []#Epochs is the number of games we playforeinrange(epochs): loss =0.#Resetting the gameenv.reset() game_over =False# get initial inputinput_t = env.observe()whilenotgame_over:#The learner is acting on the last observed game screen#input_t is a vector containing representing the game screeninput_tm1 = input_t#Take a random action with probability epsilonifnp.random.rand() <= epsilon:#Eat something random from the menuaction = np.random.randint(0, num_actions, size=1)else:#Choose yourself#q contains the expected rewards for the actionsq = model.predict(input_tm1)#We pick the action with the highest expected rewardaction = np.argmax(q[0])# apply action, get rewards and new stateinput_t, reward, game_over = env.act(action)#If we managed to catch the fruit we add 1 to our win counterifreward ==1: win_cnt +=1#Uncomment this to render the game here#display_screen(action,3000,inputs[0])""" The experiences < s, a, r, s’ > we make during gameplay are our training data. Here we first save the last experience, and then load a batch of experiences to train our model """# store experienceexp_replay.remember([input_tm1, action, reward, input_t], game_over)# Load batch of experiencesinputs, targets = exp_replay.get_batch(model, batch_size=batch_size)# train model on experiencesbatch_loss = model.train_on_batch(inputs, targets)#sum up loss over all batches in an epochloss += batch_loss win_hist.append(win_cnt)returnwin_hist
?
我將這個游戲機器人訓練了 5000 個 epoch,結果表現得很不錯!
?
?
?
?
Catch 機器人的動作
?
評論