RL— Introduction to Deep Reinforcement Learning

25 min read Original article ↗

Jonathan Hui

Deep reinforcement learning is about choosing the best actions based on what is perceived from the environment. Unfortunately, reinforcement learning (RL) can be intimidating, with a steep learning curve in both concepts and terminology. This article will explore deep RL by giving an overview of the broader landscape, without shying away from the equations and terms that form the foundation for deeper understanding. It will not claim that solving an RL problem only takes 20 lines of code; the honest answer might be one, but the focus will be on making the journey as approachable and clear as possible.

Note: Note: This article was written several years ago. While preparing to write two books on discriminative AI and generative AI, the material has been reviewed to improve clarity. Minor content updates may be made, but no major structural changes or significant additions will occur here. Instead, this content will serve as source material for the books, where it will be expanded, updated, and organized more effectively to fit the context. Follow me on LinkedIn to stay informed about their release dates.

In most areas of AI, mathematical frameworks are developed to formalize problems. In RL, that framework is the Markov Decision Process (MDP). It provides a simple yet powerful way to model complex decision-making problems. An agent, such as a human or a robot, observes its environment and takes actions. Rewards are provided as feedback, but they can be infrequent and delayed. When rewards are significantly delayed, it becomes difficult to trace back through the chain of events to identify which actions were responsible for the outcome, making the learning process especially challenging.

A Markov Decision Process (MDP) is composed of:

Press enter or click to view image in full size

Press enter or click to view image in full size

Source: left, right

A state can be represented as raw images.

Press enter or click to view image in full size

AlphaGO & Atari Seaquest

Or, in robotic control, sensors can measure joint angles, velocities, and the pose of the end-effector, forming the state representation.

Press enter or click to view image in full size

(Image by Author)

An action can take many forms, such as moving a chess piece, repositioning a robotic arm, or adjusting a joystick. In some tasks, rewards are sparse. For example, in the game of Go, the agent receives a reward of 1 for a win or −1 for a loss. Other tasks provide more frequent feedback, as in the Atari game Seaquest, where the agent scores points whenever it hits a shark.

The discount factor γ reduces the weight of future rewards, reflecting the principle that a delayed reward often holds less value than an immediate one. It also helps some algorithms achieve stable convergence.

Press enter or click to view image in full size

The sequence of actions, known as the horizon, can be finite — limited to N time steps — or infinite, continuing indefinitely.

The transition function describes the environment’s dynamics by predicting the next state given the current state and action. This function, often referred to as the model, is central to model-based reinforcement learning.

Convention

Reinforcement learning concepts draw from many research areas, including control theory. As a result, different notations are often used in different contexts, which can be confusing when reading RL materials. It is best to clarify this early.

The state may be denoted as s or x, and the action as a or u. The term “action” is equivalent to “control” in many contexts. Objectives can be framed in two equivalent ways: maximizing rewards or minimizing costs, where costs are simply the negative of rewards. Notation can also appear in either uppercase or lowercase depending on the source.

Press enter or click to view image in full size

Policy

Press enter or click to view image in full size

In reinforcement learning, the goal is to find an optimal policy — a strategy that defines the best action to take from any given state.

Press enter or click to view image in full size

(Image by Author)

Like the weights in deep learning models, a policy can be parameterized by θ, and the objective of training is to learn these parameters to make the most rewarding decisions.

Press enter or click to view image in full size

In real life, nothing is absolute, so a policy can be either deterministic or stochastic. In a stochastic policy, the output is a probability distribution over possible actions rather than a single fixed choice.

Press enter or click to view image in full size

Press enter or click to view image in full size

Finally, the objective in reinforcement learning is to find a sequence of actions that maximizes the expected cumulative reward or, equivalently, minimizes the expected cumulative cost.

Press enter or click to view image in full size

There are several ways to approach the problem:

  1. Value-based methods — Estimate how beneficial it is to reach a particular state or take a specific action (value learning).
  2. Model-based methods — Use a model of the environment to plan actions that maximize rewards.
  3. Policy-based methods — Directly optimize a policy to choose actions that maximize rewards.

Each of these approaches will be explored in the following sections.

Notation

If you would like a quick refresher on reinforcement learning terms and notation before diving into the methods, the following table provides a helpful reference.

Press enter or click to view image in full size

Modified from source

Model-based RL

Intuitively, if the rules of the game and the cost of each move are known, it is possible to select actions that minimize the total cost. The model p — also referred to as the system dynamics — predicts the next state given the current state and action. Mathematically, this is represented as a probability distribution. In this article, the model may be denoted as either p or f.

Press enter or click to view image in full size

For example, in a cart-pole system, the model p predicts the angle of the pole after taking a given action.

Press enter or click to view image in full size

(Image by Author)

Press enter or click to view image in full size

Here is the probability distribution for θ at the next time step in the example above.

Press enter or click to view image in full size

A model can represent the laws of physics, or simply encode the rules of a game like chess. The core idea of model-based reinforcement learning is to use the model together with a cost function to determine the optimal sequence of actions — more precisely, a trajectory of states and actions.

Press enter or click to view image in full size

Press enter or click to view image in full size

Press enter or click to view image in full size

Consider Go: here, the “model” is simply the game’s rules. Using these rules, it is possible to simulate legal moves and search for actions that lead to a win. Because the search space is enormous, more efficient search strategies are required to explore it effectively.

AlphaGO

In model-based reinforcement learning, the model and cost function are used to determine an optimal trajectory of states and actions, a process also referred to as optimal control.

Sometimes the model is unknown, but it can still be learned. Deep learning can be used to capture complex dynamics from sample trajectories or to approximate them locally. The video below demonstrates a robot performing tasks using model-based reinforcement learning. Instead of being programmed directly, the robot is trained for about 20 minutes to learn each task largely on its own. Once trained, it can handle situations it has not encountered before. Objects can be moved or the hammer’s grasp can be altered, yet the robot is still able to complete the task successfully.

The tasks may sound simple, but they are far from easy to solve. To make them more tractable, approximations are often introduced. A common approach is to approximate the system dynamics as linear and the cost function as a quadratic equation:

Press enter or click to view image in full size

Press enter or click to view image in full size

The next step is to determine the actions that minimize the cost while satisfying the constraints of the model.

Press enter or click to view image in full size

Well-established optimization methods such as Linear Quadratic Regulator (LQR) can be used to solve this type of objective. For nonlinear system dynamics, the iterative Linear Quadratic Regulator (iLQR) applies LQR repeatedly, refining the solution in a manner similar to Newton’s optimization. While effective, these methods are often complex and computationally demanding. For the current discussion, the key point is that given a cost function and a model, it is possible to determine the corresponding optimal actions.

Model-based reinforcement learning has a strong advantage over other RL approaches due to its high sample efficiency. Many models can be locally approximated with relatively few samples, and once a model is learned, trajectory planning can be performed without collecting additional data. This is especially valuable when physical simulations or real-world experiments are time-consuming, leading to substantial savings. As a result, model-based RL is widely adopted in robotic control, where similar training with other RL methods could take weeks.

The process can be illustrated using Model Predictive Control (MPC). First, the system runs either a random policy or an informed, pre-designed policy to explore the state–action space and collect data for fitting the dynamics model. Once the model is learned, step three involves using iLQR to plan the optimal sequence of controls. However, only the first action from this planned sequence is executed in the real environment. The new state is then observed, and the trajectory is replanned. This iterative process ensures that the agent can adapt its plan on the fly, taking corrective actions in response to unexpected changes or model inaccuracies.

Press enter or click to view image in full size

Source

The figure below summarizes the process. The environment is first observed to extract the current state. These observations are then used to fit a model of the system’s dynamics. A trajectory optimization method is applied to this model to plan a sequence of actions at each time step, defining the optimal path toward the objective.

Press enter or click to view image in full size

(Composed by Author)

Value learning

The next major RL method is Value Learning. In the game of Go, even with a complete understanding of the rules, planning the exact winning move is extremely difficult due to the exponential growth of possible positions. Instead, one can evaluate how good a move is and how valuable it is to reach a particular board position. Imagine having a cheat sheet that assigns a score to every state:

Press enter or click to view image in full size

the next step would simply be to select the state with the highest reward and take the corresponding action.

The value function V(s) measures the expected sum of discounted future rewards starting from state s and following a given policy π. Intuitively, it represents how much total reward can be expected from that state when actions are chosen according to the policy.

Press enter or click to view image in full size

In the cart-pole example, the duration for which the pole remains upright can be used as the reward signal. Consider two states, s₁​ and s₂​: in s₁​, the pole is in a position that makes it easier to keep balanced, while in s₂​ it is harder to prevent it from falling. For most policies, s₁​ would have a higher value function than s₂​, as it is more likely to yield greater cumulative rewards.

Press enter or click to view image in full size

(Image by Author)

One way to estimate V(s) is by using the Monte Carlo method. The idea is to run the policy and play out an entire episode until it terminates, then record the total rewards obtained. For example, in the cart-pole task, the total reward could be the duration the pole remains upright. By repeating this process many times — known as Monte Carlo rollouts — and averaging the total rewards from all runs starting at the same state, the value V(s) can be approximated.

There are several ways to derive an optimal policy. In policy evaluation, the process can begin with a random policy to estimate how valuable each state is. After many iterations, the value function V(s) is used to choose the most promising next state. The model of the environment then determines the action that leads to that state. In games like Go, this approach is straightforward because the rules of the game are fully known.

Press enter or click to view image in full size

(Image by Author)

Another approach is to alternate between policy evaluation and policy improvement. After each evaluation step, the policy is refined using the updated value function, guiding the agent toward better action choices. This cycle of evaluating and improving continues until the policy converges to the optimal one, a process known as policy iteration.

Modified from source

The challenge arises when the model is unknown: without knowing the environment’s transition dynamics, it is impossible to determine which action will lead to the desired next state.

Press enter or click to view image in full size

(Image by Author)

The value function itself is not a model-free method, as it requires a model of the environment to determine which actions lead to which states when making decisions. However, even when the model is unknown, the value function can still be valuable as a complementary tool in other reinforcement learning approaches — particularly those that do not require a model — by providing useful estimates of state quality.

Value iteration with Dynamic programming

In addition to the Monte Carlo method, dynamic programming can be used to compute V(s). This involves taking an action, observing the immediate reward, and combining it with the estimated value of the next state to update the current state’s value.

Press enter or click to view image in full size

The exact formulation is given by

Press enter or click to view image in full size

If the model is unknown, the value function V can be estimated through sampling. In this approach, an action is executed, and the resulting reward and next state are observed directly from the environment.

Function fitting

However, maintaining the value function V for every state is not feasible for many problems. To address this, we use supervised learning to train a deep neural network that approximates V.

(Image by Author)

Here, y represents the target value, which can be estimated using the Monte Carlo method.

Press enter or click to view image in full size

Otherwise, the concept of dynamic programming can be applied with a one-step lookahead to update value estimates. This approach is known as Temporal Difference (TD) learning.

Press enter or click to view image in full size

TD learning updates the value estimate V’(sₜ) immediately after taking a single action. Specifically, the agent observes the reward from that action and uses the current estimate of the value for the next state to update V’(sₜ). This allows learning to occur online, at each step, and combines the advantages of Monte Carlo’s experience-based approach with the bootstrapping of dynamic programming. For example, in the first row below, an action is taken that yields a reward of –2 and leads to state SJ. Therefore, the value of state SF is calculated as –2–7 = –9.

Press enter or click to view image in full size

(Image by Author)

The Monte Carlo method can provide accurate value estimates, but for a stochastic policy or a stochastic environment model, each run may yield different results, leading to high variance. Temporal Difference (TD) learning updates its value estimates using far fewer actions, which results in lower variance. However, especially during early training, TD can have high bias, producing systematically inaccurate estimates.

High bias leads to incorrect predictions, while high variance makes convergence difficult. In practice, these methods can be combined by using a k-step lookahead to form the target. This approach balances bias and variance, helping stabilize training.

Action-value function

The value learning concept can still be applied without a model by evaluating actions instead of states. The action-value function Q(s, a) measures the expected discounted rewards of taking a specific action in a given state. The trade-off is that more information must be stored: for each state, if k possible actions exist, there will be k corresponding Q-values to track.

Press enter or click to view image in full size

(Image by Author)

For optimal results, the action with the highest Q-value is selected.

Press enter or click to view image in full size

As shown, a model is not required to determine the optimal action, making action-value learning a model-free approach. Which action below has the higher Q-value? Intuitively, moving left in the depicted state should yield a higher value than moving right.

Press enter or click to view image in full size

(Image by Author)

In deep learning, gradient descent often converges more efficiently when features are zero-centered. Similarly, in reinforcement learning, the absolute magnitude of rewards is less important than how much better an action performs compared to the average action in the same state. This idea is captured by the advantage function A(s, a). Many RL algorithms use A instead of Q to emphasize relative performance, helping reduce variance in policy gradient estimates and leading to more stable training.

Press enter or click to view image in full size

where A(s, a) measures how much better taking action a in state s is compared to the average action in that state. To recap, here are all the definitions:

Press enter or click to view image in full size

Press enter or click to view image in full size

Press enter or click to view image in full size

Q-learning

So how can the Q-value be learned? One of the most popular approaches is Q-learning, which follows these steps:

  1. Sample an action based on the current state.
  2. Observe the reward and the next state.
  3. Select the action with the highest Q-value for the next step.

Press enter or click to view image in full size

Then dynamic programming is applied again to iteratively compute the Q-value function:

Press enter or click to view image in full size

Here is the Q-learning algorithm with function approximation. Step 2 reduces variance by applying Temporal Difference (TD) updates, which improves sample efficiency compared to the Monte Carlo approach that requires sampling until the episode ends.

Press enter or click to view image in full size

Modified from source

Exploration is essential in reinforcement learning. Without it, the agent may never discover better strategies beyond its current knowledge. However, excessive exploration can waste time on unpromising actions.

In Q-learning, exploration is often implemented through an exploration policy such as epsilon-greedy. In this approach, the agent usually selects the action with the highest Q-value but occasionally chooses a random action with a small probability. This ensures a balance between exploiting known good actions and exploring new possibilities.

At the start of training, Q-values are typically initialized to zero, so no action appears better than the others. As training progresses and the Q-values are updated, more promising actions begin to stand out, and the behavior gradually shifts from exploration toward exploitation.

Deep Q-network DQN

Q-learning tends to be unstable when combined with deep neural networks due to issues such as correlated data and non-stationary targets. In this section, we will bring together the key concepts introduced so far and present the Deep Q-Network (DQN). This approach achieved superhuman performance in several Atari games while relying solely on raw image frames as input.

Press enter or click to view image in full size

Source

DQN is the flagship example of combining Q-learning with a deep neural network to approximate the Q-value function. The network is trained using supervised learning techniques, but reinforcement learning introduces challenges that make this process far less stable.

In supervised learning, input samples are typically randomized, giving each class a balanced and relatively stable representation across training batches. In reinforcement learning, however, the data distribution changes as the agent explores. The states and actions visited early in training differ from those encountered later, meaning the input space is constantly shifting.

Another complication is that the target values for Q are updated as the agent’s estimates improve. This means both the inputs and the outputs change frequently, which can destabilize training. These issues highlight why directly applying deep learning methods to Q-learning can be problematic and why DQN introduced additional mechanisms to address them.

Press enter or click to view image in full size

This makes it difficult to train a stable Q-value approximator. DQN addresses this by introducing two techniques: experience replay and a target network, both of which slow down the rate of change so that Q can be learned more gradually.

Experience replay stores a large number of past state–action–reward transitions (often up to one million) in a replay buffer. During training, the network is updated using random batches sampled from this buffer. This randomization breaks the strong correlations between consecutive samples and makes the training data distribution more stable, bringing it closer to the conditions of supervised learning in deep learning.

In addition, DQN maintains two separate networks for storing Q-values. The primary network is updated continuously during training, while the second network, called the target network, is updated only occasionally by copying the weights from the primary network. Using the target network to compute target Q-values reduces volatility in the target estimates, making training more stable.

For those interested, the training objective can be expressed as follows, where D is the replay buffer and θ− represents the parameters of the target network:

Press enter or click to view image in full size

DQN enables value learning to be applied to reinforcement learning problems within a more stable training framework.

Policy-Gradient

So far, two major reinforcement learning approaches have been introduced: model-based methods and value-based methods. Model-based RL uses a model of the environment and a cost function to plan the optimal path. Value-based RL uses the value function V or the action-value function Q to derive the optimal policy.

The next section will examine the third major approach in reinforcement learning — policy gradient methods — which are among the most widely used and optimize the policy directly.

Press enter or click to view image in full size

Many actions, especially in human motor control, are intuitive. People often observe and act immediately rather than engage in detailed planning or gather extensive samples to maximize returns.

Press enter or click to view image in full size

(Image by Author)

In the cart-pole example, the underlying physics may be unknown, but observation-based experience can still guide actions — for instance, when the pole tilts to the left, moving the cart left is an intuitive response. In many cases, selecting actions directly from observations is far simpler than constructing and reasoning over a full model of the system.

Press enter or click to view image in full size

This type of RL method is policy-based, modeling a policy parameterized directly by θ. The concept behind policy gradients is straightforward: actions that lead to higher rewards are made more likely, while those with lower rewards become less likely.

The policy gradient is computed as:

Press enter or click to view image in full size

This gradient is used to update the policy through gradient ascent, adjusting it in the direction that yields the steepest increase in expected reward.

Press enter or click to view image in full size

The term underlined in red above represents the maximum likelihood, which measures the likelihood of an action under a specific policy. When multiplied by the advantage function, it adjusts the policy to favor actions with rewards greater than the average, and conversely reduces the probability of actions with lower-than-average rewards. Such policy changes must be made cautiously, as the gradient method is a first-order derivative approach and may be inaccurate when the reward function has steep curvature. If the policy update is too aggressive, the estimated improvement can deviate significantly from reality, potentially leading to poor or even disastrous decisions.

Press enter or click to view image in full size

Image source

To address this issue, a trust region is imposed, and the optimal control is selected only within this region. By establishing an upper bound on the potential error, it becomes possible to determine how far the policy can be adjusted before excessive optimism leads to harmful outcomes. Within the trust region, there is a reasonable guarantee that the updated policy will improve performance; outside the region, this guarantee no longer holds, and forcing updates can lead to significantly worse states and disrupt training. Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) apply the trust region concept to enhance policy model convergence.

Actor-Critic Method

Vanilla Policy gradient methods typically require a large number of samples to reach an optimal solution. Each time the policy is updated, new samples must be collected for the complete trajectory with the new policy. Convergence can be slow and is often a major concern. This raises two key questions: Can the policy gradient be computed with fewer samples? And can the variance of the advantage estimates be further reduced to make gradient updates more stable?

Press enter or click to view image in full size

Reinforcement learning methods are often complementary rather than mutually exclusive, and combining them can enhance performance. The actor–critic approach blends policy gradient methods with value function estimation. In this framework, the actor models the policy, while the critic estimates the value function V.

Adding a critic (value function) improves sample efficiency by providing low-variance, bootstrapped estimates of returns using temporal-difference learning, rather than relying solely on high-variance Monte Carlo returns. This enables policy updates from partial trajectories without collecting the entire episode’s trajectory, allowing more frequent and stable gradient updates, smoother learning, and faster convergence compared to pure policy gradient methods like REINFORCE.

The actor–critic algorithm closely resembles the policy gradient method. In step 2 below, the V-value function (the critic) is fitted. In step 3, temporal-difference (TD) learning is used to calculate the advantage estimate. In step 5, the policy (the actor) is updated.

Press enter or click to view image in full size

Source

Guide Policy Search

The actor–critic approach combines value learning with policy gradients. Similarly, model-based and policy-based methods can be integrated. In such a setup, model-based RL is used to improve a controller, which is then deployed on a robot to select actions based on the results of trajectory optimization. The resulting trajectories are recorded, and, in parallel, the generated trajectories are used to train a policy (as shown in the right figure below) through supervised learning.

Press enter or click to view image in full size

Modified from source

Why train a policy when a controller is already available? Both predict the action from a given state, but the question becomes whether the model or the policy is simpler. Model-based learning can produce accurate trajectories but may yield inconsistent results in regions where the model is complex or insufficiently trained. Accumulated errors can also degrade performance. If the policy is simpler, it may be easier to learn and generalize. Supervised learning can be applied to remove noise from model-based trajectories and uncover the underlying rules behind them, thereby improving generalization.

However, policy gradient methods operate in a manner similar to trial-and-error, but with more informed and structured exploration. Training often requires a long warm-up period before the agent begins producing actions that are meaningfully aligned with the task.

Press enter or click to view image in full size

Source

Guided Policy Search combines the strengths of both approaches by using model-based RL to guide the search process more effectively. The resulting trajectories are then used to train a policy that can generalize better, particularly when a simpler policy is sufficient for the task. Training alternates between updating the controller and the policy. To prevent aggressive changes, a trust region is applied between the controller and the policy, ensuring that both are learned in closely aligned steps. This coordinated learning process helps the training converge more reliably.

Press enter or click to view image in full size

Source

Deep Learning

In reinforcement learning, deep learning plays the role of the system’s eyes and ears. CNNs are used to extract visual features from images, while Transformers can process sequential data such as speech.

Press enter or click to view image in full size

(Image by Author)

Beyond perception, deep networks are powerful function approximators. They can be trained to approximate any function needed in reinforcement learning, including value functions (V-value), action-value functions (Q-value), policies, and models of the environment’s dynamics. This flexibility allows them to support tasks ranging from decision-making to predictive modeling.

Press enter or click to view image in full size

(Image by Author)

Partially Observable MDP

In many problems, objects in the environment can be temporarily hidden or occluded by others. Relying solely on the current image is therefore insufficient to capture the true state of the environment. In a Partially Observable Markov Decision Process (POMDP), the state can be constructed from a recent history of observations rather than a single frame. Traditionally, this has been achieved by applying a recurrent neural network (RNN), such as an LSTM or GRU, to a sequence of images, allowing the model to integrate temporal information and infer the underlying state. More recently, attention-based architectures such as transformers have been used to model long-range dependencies across observations, and methods like temporal convolutional networks or state-space models (e.g., Deep Kalman Filters, GTrXL) have further improved performance in environments with complex partial observability.

Press enter or click to view image in full size

(Image by Author)

Which Methods to Use?

The discussion examined three primary categories of reinforcement learning approaches: Model-Based RL, Value-Based Methods, and Policy Gradient Methods. The table below summarizes their respective advantages and disadvantages, showing how each method is naturally suited to certain problem settings while facing characteristic limitations. Over time, advanced variants within each category have been developed to address specific weaknesses, improve stability, increase sample efficiency, and extend applicability to more complex environments.

Press enter or click to view image in full size

(Composed by Author)

More

As one of the most complex areas in artificial intelligence, RL mirrors the intricacies of real-world decision-making. For readers interested in exploring further, additional articles are available at the following link: