Model-based value iteration algorithm for deterministic cleaning robots. This simple implementation of the value iteration algorithm serves as a helpful starting point for beginners in reinforcement learning and dynamic programming. The deterministic cleaning robot MDP involves the robot collecting used cans and recharging its battery. The state represents the robot's position, and the action defines the movement direction, either left or right. The first (1) and last (6) states are terminal states. The goal is to find the optimal policy to maximize the reward from any initial state. This is an example of Q-iteration (model-based value iteration DP). Reference: Algorithm 2-1, from: @book{busoniu2010reinforcement, title={Reinforcement Learning and Dynamic Programming Using Function Approximation}, authors={Busoniu, Lucian and Babuska, Robert and De Schutter, Bart and Ernst, Damien}, year={2010}, publisher={CRC Press}}.