Reinforcement Learning¶

In machine learning, supervised is sometimes contrasted with unsupervised learning. This is a useful distinction, but there are some problem domains that have share characteristics with each without fitting exactly in either category. In cases where the algorithm does not have explicit labels but does receive a form of feedback, we are dealing with a third and distinct paradigm of machine learning - reinforcement learning.

Programmatic and a theoretical introduction to reinforcement learning:https://spinningup.openai.com/

There are different problem types and algorithms, but all reinforcement learning problems have the following aspects in common:

an agent - the algorithm or “AI” responsible for making decisions

an environment, consisting of different states in which the agent may find itself

a reward signal which is returned by the environment as a function of the current state

actions, each of which takes the agent from one state to another

a policy, i.e. a mapping from states to actions that defines the agent’s behavior

The goal of reinforcement learning is to learn the optimal policy, that is the policy that maximizes expected (discounted) cumulative reward.

Many RL algorithms will include a value function or a Q-function. A value function gives the expected cumulative reward for each state under the current policy In other words, it answers the question, “If I begin in state \(i\) and follow my policy, what will be my expected reward?”

In most algorithms, expected cumulative reward is discounted by some factor \(\gamma \in (0, 1)\); a typical value for \(\gamma\) is 0.9. In addition to more accurately modeling the behavior of humans and other animals, \(\gamma < 1\) helps to ensure that algorithms converge even when there is no terminal state or when the terminal state is never found (because otherwise expected cumulative reward may also become infinite).

Note on Terminology¶

For mostly historical reasons, engineering and operations research use different words to talk about the same concepts. For example, the general field of reinforcement learning itself is sometimes referred to as optimal control, approximate dynamic programming, or neuro-dynamic programming.¹

Eploration vs. Exploitation¶

One dilemma inherent to the RL problem setting is the tension between the desire to choose the best known option and the need to try something new in order to discover other options that may be even better. Choosing the best known action is known as exploitation, while choosing a different action is known as exploration.

Typically, this is solved by adding to the policy a small probability of exploration. For example, the policy could be to choose the optimal action (optimal with regard to what is known) with probability 0.95, and exploring by randomly choosing some other action with probability 0.5 (if uniform across all remaining actions: probability 0.5/(n-1) where n is the number of states).

MDPs and Tabular methods¶

Many problems can be effectively modeled as Markov Decision Processes (MDPs), and usually as Partially Observable Markov Decision Processes (POMDPs) <https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process>. That is, we have

a set of states \(S\)

a set of actions \(A\)

a set of conditional state transition probabilities \(T\)

a reward function \(R: S \times A \rightarrow \mathbb{R}\)

a set of observations \(\Omega\)

a set of condition observation probabilities \(O\)

a discount factor \(\gamma \in [0]\)

Given these things, the goal is to choose the action at each time step which will maximize \(E \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right]\), the expected discounted reward.

Monte Carlo methods¶

One possible approach is to run a large number of simulations to learn \(p^*\). This is good for cases where we know the environment and can run many simulations reasonably quickly. For example, it is fairly trivial to compute an optimal policy for the card game 21 (blackjack) <https://en.wikipedia.org/wiki/Twenty-One_(card_game)> by running many simulations, and the same is true for most simple games.

Temporal-Difference Learning¶

TODO

Planning¶

TODO

On-Policy vs. Off-Policy Learning¶

TODO

Model-Free vs. Model-Based Approaches¶

TODO

Imitation Learning¶

TODO

Q-Learning¶

Q Learning, a model-free RL algorithm, is to update Q values to the optimal by iteration. It is an off-policy method that select the optimal action based on the current estimated Q* and does not follow the current policy.

The algorithm of Q Learning is:

Initialize t = 0.

Start at initial state s_t = 0.

The agent chooses a_t = ɛ-greedy action.

For given a_t, the agent retrieves the reward r_t+1 as well as the next state s_t+1.

Get (but do not perform) the next action a_t+1 = argmax_a∈AQ(s_t+1, a).

Compute the TD target y_t = r_t+1 + γ · Q(s_t+1, a_t+1), where γ is the discounted factor.

Calculate the TD error δ = y_t − Q(s_t, a_t).

Update Q(s_t, a_t) ← Q(s_t, a_t) + α_t · δ, where α_t is the step size (learning rate) at t.

Update t ← t + 1 and repeat step 3-9 until Q(s, a) converge.

Epsilon-Greedy Algorithm

\[\begin{split}\begin{equation} a_{t} = \begin{cases} argmax_{a∈A} & \text{if } p = 1 - e \\ random\, action\ &\text{otherwise} \end{cases} \end{equation}\end{split}\]

The agent performs optimal action for exploitation or random action for exploration during training. It acts randomly in the beginning with the ɛ = 1 and chooses the best action based on the Q function with a decreasing ɛ capped at some small constant not equal to zero.

Q-Table / Q-Matrix

a₁ a₂ … a_n

s₁ Q (s₁, a₁) Q (s₁, a₂) … Q (s₁, a₃)

s₂ Q (s₂, a₁) Q (s₂, a₂) … Q (s₂, a₃)

… … … … …

s_m Q (s_m, a₁) Q (s_m, a₂) … Q (s_m, a₃)

It’s a lookup table storing the action-value function Q(s, a) for state-action pairs where there are M states and n actions. We can initialize the Q(s, a) arbitrarily except s = terminal state. For s = final state, we set it equal to the reward on that state.

Reasons of using Q Learning are:

It’s applicable for the discrete action space of our environment.

When we don’t have the true MDP model: transitional probability matrix and rewards (Model-Free Setting).

It’s able to learn from incomplete episodes because of TD learning.

Drawbacks of Q Learning are:

When the state space and action space are continuous and extremely large, due to the curse of dimensionality, it’s nearly impossible to maintain a Q-matrix when the data is large.

Using a Q-table is unable to infer optimal action for unseen states.

Deep Q-Learning¶

Deep Q-learning pursues the same general methods as Q-learning. Its innovation is to add a neural network, which makes it possible to learn a very complex Q-function. This makes it very powerful, especially because it makes a large body of well-developed theory and tools for deep learning useful to reinforcement learning problems.

Examples of Applications¶

Getting Started With OpenAI Gym: Creating Custom Gym Environments

What Is Q-Learning: The Best Guide To Understand Q-Learning (Simplilearn)

REINFORCEMENT LEARNING (DQN) TUTORIAL (PyTorch)

QWOP Game AI (DQN/DDQN)

Links¶

Practical Applications of Reinforcement Learning (tTowards Data Science)

Reinforcement learning (GeeksforGeeks)

Reinforcement Learning Algorithms: An Intuitive Overview (SmartLabAI)

Q-learning(Wikipedia)

Epsilon-Greedy Algorithm in Reinforcement Learning (GeeksforGeeks)

OpenAI Gym Documentation

Stable-Baselines3 Documentation

David Silver Teaching Material

References

[1]	https://en.wikipedia.org/wiki/Reinforcement_learning#Introduction

[2]	Reinforcement Learning: An Introduction (Sutton and Barto, 2018)

[3]	Silver, David. “Lecture 5: Model-Free Control.” UCL, Computer Sci. Dep. Reinf Learn. Lect. (2015): 101-140.

[4]	En.wikipedia.org. 2022. Q-learning - Wikipedia. [online] Available at: <https://en.wikipedia.org/wiki/Q-learning> [Accessed 15 June 2022].