As an example of the different solutions to a problem emerging by adopting ON- or OFF-policy TD algorithms, I use here a task described in the new edition of the Sutton and Barto 2017 (you can download the entire book for free here : http://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton-bookdraft2016sep.pdf, see chapter 6).
In this task the agent is required to navigate the environment to reach the goal “G”, from the starting state “S”. Each step in the environment is punished with a negative reward (R=-1), thus making the shortest path an apparent optimal solution. However, this optimal path takes the agent close to “The Cliff” (see picture below), which is composed by a series of states associated with a strong negative reward (R=-100). Thus the agent may occasionally fall due to the randomised exploration (ε-greedy) option encoded in the algorithm, which dictates that, for ε=0.1 (standard setting), the agent will choose the highest reward among the available ones 90% of the cases and a random action in the remaining 10%.
The two algorithms record in a significantly different way the consequences (in terms of rewards and punishments) of the actions performed in each state. On the one hand, OFF-policy TD learning (or Q-learning) considers the maximum value that is associated with any of the four actions (up-down-left-right), when it has to ascribe a new value to the action that has been just completed. On the other hand, ON-policy TD learning (or SARSA) considers all actions associated to the achieved state. As a consequence only the latter considers the risk of falling in the cliff.
This difference can be highlighted with a graphical representation of the map of values ascribed to each state in the environment, which clearly shows a different optimal path (below simplified in a heat map that represents the mean of all four action-values, per state). Note in this representation the thick black line represents the cliffs, the thin black line represents a border of the grid-world and R stands for the final Reward or goal. Finally, in comparison with the original task presented in Sutton and Barto 2017, I have here increased the number of states that can be reached by the agent to test whether the optimal policy found by SARSA would lead to navigate further away from the cliff than allowed in the original space.
OFF policy:ON policy (note the path closest to the cliff is now suboptimal in comparison with path leading the agent farther away from the “danger”):
Here you can download the code (run “main_cliff” for a test):