10-Armed Bandit Testbed: This script uses the greedy algorithm to simulate a testbed of 10-armed bandits. The setup involves 2,000 randomly generated k-armed bandit problems with k = 10. For each bandit problem, the action values, q*(a) for a = 1, 2, ..., 10, are selected from a normal distribution with a mean of 0 and a variance of 1.
During each time step t, a learning method selects an action At, and the actual reward Rt is drawn from a normal distribution with a mean of q*(At) and variance 1. By evaluating performance over 1,000 time steps for each testbed, we obtain a performance measure that shows improvement in the learning method over time. Each test is considered a run, and we conduct 2,000 independent runs with unique bandit problems.
This simulation enables us to measure the average behavior of the greedy algorithm using sample average techniques to estimate action values. We then compare the average reward over 2,000 simulations. The code also allows for modification to evaluate non-greedy algorithms.