Q-Learning in MDPs

Q-Learning in MDPs preview image

1 collaborator

Default-person Larry Lin (Author)

Tags

(This model has yet to be categorized with any tags)
Visible to everyone | Changeable by everyone
Model was written in NetLogo 5.0.5 • Viewed 986 times • Downloaded 62 times • Run 0 times
Download the 'Q-Learning in MDPs' modelDownload this modelEmbed this model

Do you have questions or comments about this model? Ask them here! (You'll first need to log in.)


Comments and Questions

Please start the discussion about this model! (You'll first need to log in.)

Click to Run Model

patches-own[
  q-val-north
  q-val-south
  q-val-east
  q-val-west
]

to setup
  ca
  reset-ticks
  
  ;; set intial q-values and patches
  set-patch
  
  ;; create agent
  crt 1[
    setxy 0 0
    set shape "car"
    set size 0.5
    set color yellow
    set heading 0
  ]
end 

to go
  tick
  
  ask turtle 0[
    ;; retrieve current x and y coordinates
    let c-xcor xcor
    let c-ycor ycor
    
    ;; 0.25 probability of moving in N,S,E,W direction
    set heading ((random 4) * 90) ;; equal probability: 0, 90, 180, 270
    
    ;; probabilistic movement - 0.8 chance of moving in intended direction
    let prob random-float 1
    
    ifelse(prob < 0.8)[
      ;; move in intended direction
      ;; no changes in heading required
    ][
      ifelse(prob < 0.9)[
        ;; move to left (+90) of intended direction
        set heading (heading - 90)
      ][
        ;; move to right (+90) of intended direction
        set heading (heading + 90)
      ]
    ]
    
    ;; after setting direction, move forward 1 step
    fd 1
    
    ;; disallow movement into blue cell
    if(xcor = 1) and (ycor = 1)[
      bk 1
    ]
        
    ;; set q-values
    set-qval c-xcor c-ycor heading xcor ycor
    
    ;; reset agent's position if reach winning or losing state
    if ([pcolor] of patch xcor ycor) != black[
      if ([pcolor] of patch xcor ycor) != blue[
        set xcor 0
        set ycor 0
      ]
    ]
    
  ]
    
  ;; set the q-values in patches label
  set-patch
end 

to set-qval[cur-xcor cur-ycor cur-heading new-xcor new-ycor]
  
  ;; optimal future value - Q(s',a')
  let opt-fut-val 0
  
  ;; compute optimal future value
  ask patch new-xcor new-ycor[
    set opt-fut-val (max (list q-val-north q-val-east q-val-south q-val-west))
  ]
  
  ;; set computed q-value into Q(s,a)
  ask patch cur-xcor cur-ycor[
    if(cur-heading = 0)[
      ;; north
      set q-val-north (precision (q-val-north + alpha * (reward + (gamma * opt-fut-val) - q-val-north)) 1)
    ]
    if(cur-heading = 90)[
      ;; east
      set q-val-east (precision (q-val-east + alpha * (reward + (gamma * opt-fut-val) - q-val-east)) 1)
    ]
    if(cur-heading = 180)[
      ;; south
      set q-val-south (precision (q-val-south + alpha * (reward + (gamma * opt-fut-val) - q-val-south)) 1)
    ]
    if(cur-heading = 270)[
      ;; west
      set q-val-west (precision (q-val-west + alpha * (reward + (gamma * opt-fut-val) - q-val-west)) 1)
    ]
  ]
end 

to set-patch
  
  ask patches[
    set pcolor black
  ]
  
  ask patch 1 1[
    set pcolor blue
    set q-val-west 0
    set q-val-north 0
    set q-val-east 0
    set q-val-south 0
  ]
    
  ask patch 3 2[
    set pcolor green
    set q-val-west winning-state-value
    set q-val-north winning-state-value
    set q-val-east winning-state-value
    set q-val-south winning-state-value
  ]
  
  ask patch 3 1[
    set pcolor red
    set q-val-north losing-state-value
    set q-val-east losing-state-value
    set q-val-south losing-state-value
    set q-val-west losing-state-value
  ]
  
  ask patches[
    set plabel (list q-val-west q-val-north q-val-east q-val-south)
  ]
end 

There is only one version of this model, created about 10 years ago by Larry Lin.

Attached files

File Type Description Last updated
Q-Learning in MDPs.png preview Preview for 'Q-Learning in MDPs' about 10 years ago, by Larry Lin Download

This model does not have any ancestors.

This model does not have any descendants.