Reinforcement Learning: Comparison of Policy Gradient Approaches to Continuous Control Tasks

Abstract

This project investigated the performance of various policy gradient algorithms in environments with continuous action spaces. Specifically, Soft Actor-Critic (SAC), Twin Delayed Deep Deterministic Policy Gradient (TD3), and Deep Deterministic Policy Gradient (DDPG) were evaluated and compared on various OpenAI Gym tasks: Pendulum, InvertedPendulum, and Hopper. TD3 and DDPG were tested using existing implementations [2], which provided a framework on which we implemented SAC. SAC demonstrated moderate and significant superiority over TD3 and DDPG, respectively, by converging in the fewest episodes.

Introduction

Many real-world control problems require continuous input. This is particularly true in the field of robotics. Accordingly, solving these problems using reinforcement learning (RL) requires algorithms with continuous actions spaces. This poses some challenges for purely value-based approaches; it is generally more effective for an agent to learn a policy directly [1]. Actor-critic algorithms learn both value and policy functions, and are commonly applied to continuous control tasks. The type of policy function learned depends on the specific algorithm. Some approaches, such as TD3 or DDPG, learn a deterministic policy where the action is fully defined by its state. Other approaches, such as SAC, are stochastic, where the action is drawn from a distribution of possible actions. Among other characteristics, this project aims to investigate the empirical differences between these types of policy.

In contrast with discrete environments where linear function approximation is often sufficient, continuous environments often require a more complex mapping between states, values, and actions which calls for multi-layer perception (MLP) function approximation. Some actor-critic algorithms use separate MLPs for the actor (policy function) and critic (value function), while others use a shared MLP with different output layers.

This project explores three distinct actor-critic approaches – SAC, DDPG, and TD3 – on continuous control tasks of varying complexity – Pendulum-v1, InvertedPendulum-v4, and Hopper-v4. These tasks provide different challenges for the agents and are common benchmarks for RL algorithm evaluation.

Algorithms

Soft actor-critic (SAC)

Like all policy gradient algorithms, our implementation of SAC aims to learn the policy that maximises the future expected reward. To do so it leverages one neural network that represents the policy and two other networks that serve as the Q value function approximators. Since this agent is designed to operate in continuous action spaces, the network outputs a mean and standard deviation used to define a Normal distribution from which an action is sampled. In addition, the algorithm also makes use of a replay buffer and two Q target networks, used to update the actual Q value functions. Initially, the agent collects transition data according to a randomly initialized policy for a set number of timesteps. Once this period is over, at every timestep, the agent is trained. During training, data is sampled from the replay buffer and then the target Q value is computed using the target networks according to the following equation [4]:

$$
y(r,s’,d) = r + \gamma (1-d)* (\min_{i=1,2} Q_{\phi targ,i}(s’, \tilde{a}’)- \alpha * \log(\pi_{\theta}(\tilde{a}’|s’)), \quad \tilde{a}’\sim\pi_{\theta}(\cdot|s’)
$$

As we can see, this resembles the temporal difference update however we introduce an entropy regularization term and we only use the smallest of the two target networks to compute the target value. We chose to keep the coefficient of the entropy term constant in our implementation. The tilde above the actions implies that the actions have to be re-sampled from the policy and do not come from the replay buffer. The Q networks are then updated by applying a gradient of the mean squared error:

$$\nabla_{\phi_{i}} \frac{1}{|B|} \sum_{(s,a,r,s^{‘},d) \in B} (Q_{\phi_{i}}(s,a) – y(r, s^{‘},d))^2$$

Subsequently, the updated Q networks are used to update the policy itself:

$$\nabla_{\theta} \frac{1}{|B|} (\min_{i=1,2} Q_{\phi,i}(s^{‘}, \tilde{a_{\theta}}^{‘})- \alpha\log(\pi_{\theta}(\tilde{a_{\theta}}^{‘}|s^{‘}))$$

with $\theta$ being the weights of the policy. Finally, the target networks are updated using the soft update equation:

Deep deterministic policy gradient (DDPG)

The most notable difference between DDPG and SAC is that DDPG the policy network outputs the action directly instead of a mean and standard deviation. Exploration is therefore achieved by adding normally distributed noise to the action. It also only uses one Q network (and target Q network) instead of two and implements a target policy network. This implies that the entropy terms is no longer included in the Q-function target equation and the policy error since the policy is deterministic. Therefore, the policy is updated using the gradient of the Q value:

$$\nabla_{\theta_{i}} \frac{1}{|B|} \sum_{s \in B} Q_{\phi}(s,a_{\theta}(s))$$

Like the Q target, the policy target is also updated using the soft update rule.

Twin delayed deep deterministic policy gradient (TD3)

The final algorithm included in our study is TD3. This algorithm is very similar to DDPG with a few minor changes. Firstly, it leverages two Q networks and like SAC, only uses the smallest Q value (from the two target networks) to compute the target Q value. In addition, the policy update is delayed with respect to the Q networks which are updated every time-step. Finally, the target networks are all updated using the soft update rule.

Methodology

The aim of this project is to compare different policy gradient algorithms on continuous control tasks. To this end, several different algorithms were explored. In addition to those described in the previous section, implementations of both Vanilla Policy Gradient and Temporal Difference Actor-Critic were attempted. Despite a consuming a significant amount of time and effort, these approaches were not successful in any of the tested environments. Regardless, these implementations are included in the project submission.

TD3, DDPG, and SAC were chosen due to their previous success in continuous control tasks [2,3]. The implementations of both TD3 and DDPG were retrieved from [2]. Using these implementations as a framework, SAC was implemented with inspiration from [3,4,5].

Hyperparameters

With regards to hyperparameter selection, the TD3 and DDPG implementations came with the optimal setup preloaded. Since available computing power was limited and these agents take a long time to train on certain environments, it was decided not to tune the hyperparameters for our implementation of SAC and use most of the hyperparamaters from TD3 and DDPG instead. The hyperparameters shared by all three agents are listed below:

Algorithm	Learning Rate	ρ	γ	Batch Size	Noise (std. dev.)	Noise (clip)	π update frequency	α
TD3	0.0003	0.005	0.99	256	0.1	0.5	2	N/A
DDPG	0.0003	0.005	0.99	256	0.1	0.5	N/A	N/A
SAC	0.0003	0.005	0.99	256	N/A	N/A	N/A	0.001

Note that the learning rates are the same for the actor and the critic(s) across all algorithms.

Test parameters

To reduce variance, training was performed using multiple different seeds for the environment and network initialization. Five seeds were used for Pendulum and InvertedPendulum, while Hopper used only two to minimize its training time.

For each environment, all algorithms were trained for a fixed number of timesteps; Pendulum used 30,000, Inverted pendulum used 75,000, and Hopper used 200,000. It should be noted that previous research using Hopper has trained for ~1,000,000 timesteps [3]. This results in varying numbers of episodes between algorithms and seeds, so we opted to compare the algorithms on a timestep basis.

At the start of each training session, the agent is set to act randomly for 10,000 timesteps, during which the transition data is stored in the replay buffer. This serves as an initial exploration period for the agents.

Results and discussion

the figures below show the performance of the algorithms on the Pendulum, Inverted Pendulum and Hopper environments respectively.

Return vs timestep for Pendulum-v1

Return vs timestep for InvertedPendulum-v4

Return vs timesteps for Hopper-v4

Here, we notice that the relative performance of each algorithm is consistent across all environments. In Pendulum, all algorithms ultimately converge to a similar near-optimal performance. In InvertedPendulum, DDPG fails to converge to a stable performace. In Hopper, none of the algorithms managed to converge to a near-optimal policy.

SAC and TD3 consistently show similar performance, particularly in Pendulum; SAC reaches a near-optimal policy slightly sooner. In InvertedPendulum, SAC initially learns quickly but seems to lose effectiveness at approximately 40,000 timesteps before converging to the optimal policy. TD3 follows a more predictable learning curve, but takes slightly longer to converge. Overall, the disparity in performance between TD3 and SAC is found to be less extreme than in previous studies \cite{3}. Both approaches demonstrate similar variance overall. This is interesting, considering the policy differences between the two: TD3 uses a deterministic policy with fixed exploration noise, so one might expect more variance at later timesteps compared to SAC, where the standard deviation of the action distribution (which defines the exploration) is learned.

In all tasks, DDPG showed the slowest learning ability. This is consistent with literature [2,3] and expected considering its design: it employs only one critic network and lacks target policy smoothing regularization; it can be thought of as a stripped-down version of TD3.

The learning characteristics of each algorithm are hardest to evaluate for the Hopper environment. In all cases, 200,000 timesteps was not enough time to learn a near-optimal policy; perhaps the comparison would be different after training for longer. Here, SAC and TD3 again show similar performance. DDPG shows high variance during learning with a heavily fluctuating average return.

Despite its fast learning ability, SAC took the longest time to train in real-time. This is likely due to a number of factors. Like TD3, it has two critic networks, but unlike TD3, it trains every timestep. Furthermore, SAC’s policy network has two output layers (to define its action distribution), thus providing more weights to train. Finally, the efficiency of our specific implementation of SAC is perhaps sub-optimal.

Conclusion

SAC, TD3, and DDPG were tested on three benchmark continuous control tasks. Similar hyperparameters were used. SAC showed the fastest learning ability, with TD3 following closely behind. DDPG showed inferior performance on all tasks. SAC seems to show the most superiority in the Hopper environment, but this result is somewhat inconclusive due to high variance and insufficient training time. Overall, these results are similar to those in past literature. This project provided valuable experience in applying RL to simple continuous control problems.

Bibliography

[1] Doina Precup. “COMP 579 Lecture Slides”. 2023. https://www.cs.mcgill.ca/~dprecup/courses/Winter2023/lectures.html

[2] Fujimoto, Scott, Herke Hoof, and David Meger. “Addressing function approximation error in actor-critic methods.” International conference on machine learning. PMLR, 2018.

[3] Haarnoja, Tuomas, et al. “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor.” International conference on machine learning. PMLR, 2018.

[4] OpenAI. “Spinning Up in Deep RL.” Spinning Up, OpenAI, 2018-2021, https://spinningup.openai.com/en/latest/index.html.

[5] Rho, Seungeun. “MinimalRL.” GitHub, 2018, https://github.com/seungeunrho/minimalRL