Reinforcement Learning

In machine learning, supervised is sometimes contrasted with unsupervised learning. This is a useful distinction, but there are some problem domains that have share characteristics with each without fitting exactly in either category. In cases where the algorithm does not have explicit labels but does receive a form of feedback, we are dealing with a third and distinct paradigm of machine learning - reinforcement learning.

Programmatic and a theoretical introduction to reinforcement learning:https://spinningup.openai.com/

There are different problem types and algorithms, but all reinforcement learning problems have the following aspects in common:

  • an agent - the algorithm or “AI” responsible for making decisions
  • an environment, consisting of different states in which the agent may find itself
  • a reward signal which is returned by the environment as a function of the current state
  • actions, each of which takes the agent from one state to another
  • a policy, i.e. a mapping from states to actions that defines the agent’s behavior

The goal of reinforcement learning is to learn the optimal policy, that is the policy that maximizes expected (discounted) cumulative reward.

Many RL algorithms will include a value function or a Q-function. A value function gives the expected cumulative reward for each state under the current policy In other words, it answers the question, “If I begin in state \(i\) and follow my policy, what will be my expected reward?”

In most algorithms, expected cumulative reward is discounted by some factor \(\gamma \in (0, 1)\); a typical value for \(\gamma\) is 0.9. In addition to more accurately modeling the behavior of humans and other animals, \(\gamma < 1\) helps to ensure that algorithms converge even when there is no terminal state or when the terminal state is never found (because otherwise expected cumulative reward may also become infinite).

Note on Terminology

For mostly historical reasons, engineering and operations research use different words to talk about the same concepts. For example, the general field of reinforcement learning itself is sometimes referred to as optimal control, approximate dynamic programming, or neuro-dynamic programming.1

Eploration vs. Exploitation

One dilemma inherent to the RL problem setting is the tension between the desire to choose the best known option and the need to try something new in order to discover other options that may be even better. Choosing the best known action is known as exploitation, while choosing a different action is known as exploration.

Typically, this is solved by adding to the policy a small probability of exploration. For example, the policy could be to choose the optimal action (optimal with regard to what is known) with probability 0.95, and exploring by randomly choosing some other action with probability 0.5 (if uniform across all remaining actions: probability 0.5/(n-1) where n is the number of states).

MDPs and Tabular methods

Many problems can be effectively modeled as Markov Decision Processes (MDPs), and usually as Partially Observable Markov Decision Processes (POMDPs) <https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process>. That is, we have

  • a set of states \(S\)
  • a set of actions \(A\)
  • a set of conditional state transition probabilities \(T\)
  • a reward function \(R: S \times A \rightarrow \mathbb{R}\)
  • a set of observations \(\Omega\)
  • a set of condition observation probabilities \(O\)
  • a discount factor \(\gamma \in [0]\)

Given these things, the goal is to choose the action at each time step which will maximize \(E \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right]\), the expected discounted reward.

Monte Carlo methods

One possible approach is to run a large number of simulations to learn \(p^*\). This is good for cases where we know the environment and can run many simulations reasonably quickly. For example, it is fairly trivial to compute an optimal policy for the card game 21 (blackjack) <https://en.wikipedia.org/wiki/Twenty-One_(card_game)> by running many simulations, and the same is true for most simple games.

Temporal-Difference Learning

TODO

Planning

TODO

On-Policy vs. Off-Policy Learning

TODO

Model-Free vs. Model-Based Approaches

TODO

Imitation Learning

TODO

Q-Learning

Q Learning, a model-free RL algorithm, is to update Q values to the optimal by iteration. It is an off-policy method that select the optimal action based on the current estimated Q* and does not follow the current policy.

The algorithm of Q Learning is:

  1. Initialize t = 0.
  2. Start at initial state st = 0.
  3. The agent chooses at = ɛ-greedy action.
  4. For given at, the agent retrieves the reward rt+1 as well as the next state st+1.
  5. Get (but do not perform) the next action at+1 = argmaxa∈AQ(st+1, a).
  6. Compute the TD target yt = rt+1 + γ · Q(st+1, at+1), where γ is the discounted factor.
  7. Calculate the TD error δ = yt − Q(st, at).
  8. Update Q(st, at) ← Q(st, at) + αt · δ, where αt is the step size (learning rate) at t.
  9. Update t ← t + 1 and repeat step 3-9 until Q(s, a) converge.

Epsilon-Greedy Algorithm

\[\begin{split}\begin{equation} a_{t} = \begin{cases} argmax_{a∈A} & \text{if } p = 1 - e \\ random\, action\ &\text{otherwise} \end{cases} \end{equation}\end{split}\]

The agent performs optimal action for exploitation or random action for exploration during training. It acts randomly in the beginning with the ɛ = 1 and chooses the best action based on the Q function with a decreasing ɛ capped at some small constant not equal to zero.

Q-Table / Q-Matrix

  a1 a2 an
s1 Q (s1, a1) Q (s1, a2) Q (s1, a3)
s2 Q (s2, a1) Q (s2, a2) Q (s2, a3)
sm Q (sm, a1) Q (sm, a2) Q (sm, a3)

It’s a lookup table storing the action-value function Q(s, a) for state-action pairs where there are M states and n actions. We can initialize the Q(s, a) arbitrarily except s = terminal state. For s = final state, we set it equal to the reward on that state.

Reasons of using Q Learning are:

  • It’s applicable for the discrete action space of our environment.
  • When we don’t have the true MDP model: transitional probability matrix and rewards (Model-Free Setting).
  • It’s able to learn from incomplete episodes because of TD learning.

Drawbacks of Q Learning are:

  • When the state space and action space are continuous and extremely large, due to the curse of dimensionality, it’s nearly impossible to maintain a Q-matrix when the data is large.
  • Using a Q-table is unable to infer optimal action for unseen states.

Deep Q-Learning

Deep Q-learning pursues the same general methods as Q-learning. Its innovation is to add a neural network, which makes it possible to learn a very complex Q-function. This makes it very powerful, especially because it makes a large body of well-developed theory and tools for deep learning useful to reinforcement learning problems.