Reinforcement Learning Algorithms Explorer

Akash Chandra Debnath

--

What is Reinforcement Learning?

Reinforcement Learning is a type of machine learning paradigm where an agent learns to make decisions by interacting with an environment. It revolves around the concept of reward-based learning, where an agent strives to maximize cumulative rewards through a series of actions. The core idea behind reinforcement learning is to enable machines to learn and adapt autonomously by receiving feedback in the form of rewards or penalties. This learning paradigm finds applications in various real-world scenarios, ranging from game-playing strategies to robotics and optimization problems.

Key Components of Reinforcement Learning?

Agent: The entity that makes decisions and takes actions in the environment. This could be a robot, a game-playing algorithm, or any system that can interact with its surroundings.

Environment: The external system with which the agent interacts. It provides feedback to the agent based on the actions taken, influencing the agent’s future decisions.

State: A representation of the current situation of the environment. The state is essential for the agent to make decisions about which action to take.

Action: The set of possible moves or decisions that the agent can make. The agent chooses an action based on its current state.

Reward: A numerical value that the environment provides to the agent as feedback for the action taken in a specific state. The goal of the agent is to maximize the cumulative reward over time.

Policy: The strategy or mapping from states to actions that the agent follows. It defines how the agent chooses actions in different situations.

Value Function: A function that estimates the expected cumulative future reward of being in a particular state. It helps the agent evaluate the desirability of different states.

Discount Factor (γ): A parameter that determines the importance of future rewards. It discounts the value of future rewards, making immediate rewards more valuable.

Agent Learn by Action in Environment with positive or negative Reward

Model-free Algorithms of Reinforcment Learning (RL)?

Reinforcement Learning (RL) encompasses a variety of algorithms designed to enable agents to learn optimal strategies by interacting with an environment. There have broadly 3 categorized model-free algorithms in reinforcement learning. They are -

  1. Value-based RL Algorithm
  2. Policy-based RL Algorithm
  3. Actor-critic RL Algorithm

1. Value-based RL Algorithm

Value-based Reinforcement Learning algorithms focus on estimating the value of different actions or states in order to make decisions that maximize cumulative rewards.

I. Q-Learning: In this algorithm, agent learn a policy to maximize cumulative rewards over time. It Update Q-values based on the difference between predicted and actual rewards.

II. Deep Q Networks (DQN): DQN extends Q-learning with the neural approximation of Q values. It devises the experience replay method where an interaction history is collected, stored, and used to retrain the parameters of the Q function. This technique smooths the distribution of transitions, which can stablize the training process by randomly sampling from memory instead of correlated episodes sampled from recent states of environments.

III. Soft Q-Learning: It is a variant of the traditional Q-learning algorithm that incorporates a probabilistic or “soft” approach to action selection. This algorithm is particularly useful in environments with continuous action spaces, where deterministic policies may not be well-suited. Soft Q-learning introduces a stochastic policy that allows for more flexibility and exploration.

2. Policy-based RL Algorithm

Policy-based reinforcement learning algorithms directly learn the optimal policy, which is a mapping from states to actions. Unlike value-based methods that aim to estimate the value function and then derive a policy from it, policy-based approaches directly parameterize the policy and adjust its parameters to maximize the expected cumulative reward.

I. Policy Gradient Methods/REINFORCE: REINFORCE algorithms directly optimize the parameters of the policy to maximize the expected cumulative reward. The objective is to find the policy parameters that lead to higher probabilities for actions that result in greater rewards.

II. Trust Region Policy Optimization (TRPO): TRPO introduces a trust region for the policy to update itself, in which the policy improvement is monotonically guaranteed theoretically. The trust region is expressed as an expectation of KL divergence between the old policy and the new policy. It substitutes the sum over actions with importance sampling techniques.

III. Proximal Policy Optimization: PPO uses a clipped surrogate objective to form a lower bound of the objective of TRPO, which utilizes only first-order optimization to improve the data efficiency and performance.

3. Actor-Critic

Actor-Critic is a class of reinforcement learning algorithms that combines aspects of both policy-based (Actor) and value-based (Critic) methods. The Actor is responsible for selecting actions based on the current policy, while the Critic evaluates the selected actions by estimating their values.

I. Actor-Critic: This algorithm combines value-based (Critic) and policy-based (Actor) methods for stability. The Actor proposes actions, and the Critic evaluates them, providing feedback for policy improvement. It have a similar form as variance-reduced REINFORCE algorithm does. The key difference is that the value function takes part in the value term, which can increase the bias of estimation and accelerate learning by decreasing the variance.

II. Asynchronous Advantage Actor-Critic (A3C): A3C parallelize RL training for efficiency and faster learning. In this algorithm, multiple agents interact asynchronously with their own copies of the environment, sharing information periodically.

III. Soft Actor-Critic (SAC): SAC is a state-of-the-art reinforcement learning algorithm designed for training agents in continuous action spaces.

IV. Deterministic Policy Gradient (DPG): DPG extends the idea of policy gradient methods to handle deterministic policies. While traditional policy gradient methods work well with stochastic policies, DPG is designed for tasks where the optimal policy is deterministic. This is especially useful in environments with continuous action spaces.

V. Deep Deterministic Policy Gradients (DDPG): DDPG extends policy gradient methods for continuous action spaces. It uses an actor network for continuous actions and a critic network to estimate the value function.

Reference: Yuanjiang Cao, Quan Z. Sheng, Julian McAuley, and Lina Yao. 2023. Reinforcement Learning for Generative AI: A Survey. 1, 1(August 2023)

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Responses (1)

Write a response