Q-Learning in MDPs — from the Modeling Commons
Model was written in NetLogo 5.0.5
•
Viewed 1124 times
•
Downloaded 68 times
•
Run 0 times
Do you have questions or comments about this model? Ask them here! (You'll first need to log in.)
Comments and Questions
Please start the discussion about this model!
(You'll first need to log in.)
Click to Run Model
patches-own[ q-val-north q-val-south q-val-east q-val-west ] to setup ca reset-ticks ;; set intial q-values and patches set-patch ;; create agent crt 1[ setxy 0 0 set shape "car" set size 0.5 set color yellow set heading 0 ] end to go tick ask turtle 0[ ;; retrieve current x and y coordinates let c-xcor xcor let c-ycor ycor ;; 0.25 probability of moving in N,S,E,W direction set heading ((random 4) * 90) ;; equal probability: 0, 90, 180, 270 ;; probabilistic movement - 0.8 chance of moving in intended direction let prob random-float 1 ifelse(prob < 0.8)[ ;; move in intended direction ;; no changes in heading required ][ ifelse(prob < 0.9)[ ;; move to left (+90) of intended direction set heading (heading - 90) ][ ;; move to right (+90) of intended direction set heading (heading + 90) ] ] ;; after setting direction, move forward 1 step fd 1 ;; disallow movement into blue cell if(xcor = 1) and (ycor = 1)[ bk 1 ] ;; set q-values set-qval c-xcor c-ycor heading xcor ycor ;; reset agent's position if reach winning or losing state if ([pcolor] of patch xcor ycor) != black[ if ([pcolor] of patch xcor ycor) != blue[ set xcor 0 set ycor 0 ] ] ] ;; set the q-values in patches label set-patch end to set-qval[cur-xcor cur-ycor cur-heading new-xcor new-ycor] ;; optimal future value - Q(s',a') let opt-fut-val 0 ;; compute optimal future value ask patch new-xcor new-ycor[ set opt-fut-val (max (list q-val-north q-val-east q-val-south q-val-west)) ] ;; set computed q-value into Q(s,a) ask patch cur-xcor cur-ycor[ if(cur-heading = 0)[ ;; north set q-val-north (precision (q-val-north + alpha * (reward + (gamma * opt-fut-val) - q-val-north)) 1) ] if(cur-heading = 90)[ ;; east set q-val-east (precision (q-val-east + alpha * (reward + (gamma * opt-fut-val) - q-val-east)) 1) ] if(cur-heading = 180)[ ;; south set q-val-south (precision (q-val-south + alpha * (reward + (gamma * opt-fut-val) - q-val-south)) 1) ] if(cur-heading = 270)[ ;; west set q-val-west (precision (q-val-west + alpha * (reward + (gamma * opt-fut-val) - q-val-west)) 1) ] ] end to set-patch ask patches[ set pcolor black ] ask patch 1 1[ set pcolor blue set q-val-west 0 set q-val-north 0 set q-val-east 0 set q-val-south 0 ] ask patch 3 2[ set pcolor green set q-val-west winning-state-value set q-val-north winning-state-value set q-val-east winning-state-value set q-val-south winning-state-value ] ask patch 3 1[ set pcolor red set q-val-north losing-state-value set q-val-east losing-state-value set q-val-south losing-state-value set q-val-west losing-state-value ] ask patches[ set plabel (list q-val-west q-val-north q-val-east q-val-south) ] end
There is only one version of this model, created about 11 years ago by Larry Lin.
This model does not have any ancestors.
This model does not have any descendants.