Q-Learning in MDPs — from the Modeling Commons

Model was written in NetLogo 5.0.5 • Viewed 1149 times • Downloaded 69 times • Run 0 times

Info
Discuss
Run in NetLogo Web
Code
History
Files
Family

Do you have questions or comments about this model? Ask them here! (You'll first need to log in.)

Comments and Questions

Please start the discussion about this model! (You'll first need to log in.)

Click to Run Model

patches-own[
  q-val-north
  q-val-south
  q-val-east
  q-val-west
]

to setup
  ca
  reset-ticks
  
  ;; set intial q-values and patches
  set-patch
  
  ;; create agent
  crt 1[
    setxy 0 0
    set shape "car"
    set size 0.5
    set color yellow
    set heading 0
  ]
end 

to go
  tick
  
  ask turtle 0[
    ;; retrieve current x and y coordinates
    let c-xcor xcor
    let c-ycor ycor
    
    ;; 0.25 probability of moving in N,S,E,W direction
    set heading ((random 4) * 90) ;; equal probability: 0, 90, 180, 270
    
    ;; probabilistic movement - 0.8 chance of moving in intended direction
    let prob random-float 1
    
    ifelse(prob < 0.8)[
      ;; move in intended direction
      ;; no changes in heading required
    ][
      ifelse(prob < 0.9)[
        ;; move to left (+90) of intended direction
        set heading (heading - 90)
      ][
        ;; move to right (+90) of intended direction
        set heading (heading + 90)
      ]
    ]
    
    ;; after setting direction, move forward 1 step
    fd 1
    
    ;; disallow movement into blue cell
    if(xcor = 1) and (ycor = 1)[
      bk 1
    ]
        
    ;; set q-values
    set-qval c-xcor c-ycor heading xcor ycor
    
    ;; reset agent's position if reach winning or losing state
    if ([pcolor] of patch xcor ycor) != black[
      if ([pcolor] of patch xcor ycor) != blue[
        set xcor 0
        set ycor 0
      ]
    ]
    
  ]
    
  ;; set the q-values in patches label
  set-patch
end 

to set-qval[cur-xcor cur-ycor cur-heading new-xcor new-ycor]
  
  ;; optimal future value - Q(s',a')
  let opt-fut-val 0
  
  ;; compute optimal future value
  ask patch new-xcor new-ycor[
    set opt-fut-val (max (list q-val-north q-val-east q-val-south q-val-west))
  ]
  
  ;; set computed q-value into Q(s,a)
  ask patch cur-xcor cur-ycor[
    if(cur-heading = 0)[
      ;; north
      set q-val-north (precision (q-val-north + alpha * (reward + (gamma * opt-fut-val) - q-val-north)) 1)
    ]
    if(cur-heading = 90)[
      ;; east
      set q-val-east (precision (q-val-east + alpha * (reward + (gamma * opt-fut-val) - q-val-east)) 1)
    ]
    if(cur-heading = 180)[
      ;; south
      set q-val-south (precision (q-val-south + alpha * (reward + (gamma * opt-fut-val) - q-val-south)) 1)
    ]
    if(cur-heading = 270)[
      ;; west
      set q-val-west (precision (q-val-west + alpha * (reward + (gamma * opt-fut-val) - q-val-west)) 1)
    ]
  ]
end 

to set-patch
  
  ask patches[
    set pcolor black
  ]
  
  ask patch 1 1[
    set pcolor blue
    set q-val-west 0
    set q-val-north 0
    set q-val-east 0
    set q-val-south 0
  ]
    
  ask patch 3 2[
    set pcolor green
    set q-val-west winning-state-value
    set q-val-north winning-state-value
    set q-val-east winning-state-value
    set q-val-south winning-state-value
  ]
  
  ask patch 3 1[
    set pcolor red
    set q-val-north losing-state-value
    set q-val-east losing-state-value
    set q-val-south losing-state-value
    set q-val-west losing-state-value
  ]
  
  ask patches[
    set plabel (list q-val-west q-val-north q-val-east q-val-south)
  ]
end

There is only one version of this model, created over 11 years ago by Larry Lin.

Attached files

File	Type	Description	Last updated
Q-Learning in MDPs.png	preview	Preview for 'Q-Learning in MDPs'	over 11 years ago, by Larry Lin	Download

This model does not have any ancestors.

This model does not have any descendants.