Application of Deep Reinforcement Learning algorithms to train autonomous driving agents in complex simulated environments using the MetaDrive simulator.
github.com/pompos02/MetaDriveRLagentThis project explores the use of Deep Reinforcement Learning (DRL) techniques to develop intelligent autonomous driving agents capable of navigating various road geometries and traffic conditions. The implementation compares different training approaches, from Stable Baselines3 to custom PyTorch implementations, and investigates both sensor-based (Lidar) and vision-based (RGB camera) observations.
MetaDrive is an advanced open-source simulation environment specifically designed for developing and evaluating autonomous driving models. Built on technologies like Panda3D and OpenAI Gym, it enables the creation of realistic and customizable driving environments.
The agent's training was initially conducted in a simple geometry environment—specifically an empty straight road—to familiarize it with basic driving principles. A custom environment class was created that inherits from MetaDrive's default environment with specialized parameters.
Agent at 10,000 steps - Learning basic movement
Agent at 2,000,000 steps - Exhibiting "slalom" behavior
Initial results showed the agent learned to move toward the goal but exhibited non-smooth behavior, performing "slalom" type maneuvers.
To improve driving stability, the reward function was redesigned to include checking the vehicle's position relative to the reference lane and calculating the distance traveled between consecutive steps.
R_speed = SpeedReward × (vehicle.speed / vehicle.max_speed)
R_driving = d × StepDistance × positiveRoad
Where d is a reward weight coefficient and positiveRoad indicates whether the vehicle is moving on the correct side of the road.
Updated agent after 3,000,000 steps - Smoother driving behavior achieved
The road geography was changed by adding two turns. Training revealed an interesting challenge: during training, the average reward was approximately 700 (indicating completion), but in deterministic evaluation, the car repeatedly crashed at the first turn.
Reduced the standard deviation (stochasticity) of the policy during training to minimize the training-evaluation gap, making the model behave more similarly in both phases.
Agent successfully navigating turns after adjustments
Training metrics showing improvement over time
To gain a better understanding of how the PPO agent works "under the hood," the learning logic was rewritten exclusively with PyTorch. This provided better control over the training process with more low-level knowledge of the algorithm.
Results were considerably worse compared to the Baseline3 implementation, likely due to "unstable" training. This was evident from significant fluctuations in all graphs (even with smoothing), which was unexpected since the environment is relatively deterministic.
Custom PyTorch implementation after 3,000,000 steps - Less stable than Baseline3
Despite extensive experimentation with different hyperparameters and architectures, the custom implementation could not match Baseline3's stability and performance, revealing the importance of implementation details in DRL.
An agent was trained to navigate a complex multi-turn track, initially without traffic, then with traffic added for realism. Due to the complexity, a comprehensive multi-component reward function was designed.
Complex multi-turn track layout used for advanced training
The agent maintained the first lane and moved at 40 km/h when all reward components were active.
Conservative approach - maintaining lane at 40 km/h
Training metrics for conservative approach
By removing R₁ (speed control) and R₃ (lane center deviation), the agent moved more freely and completed the track much faster.
Aggressive approach - faster but riskier driving
Lower success rate but faster completion times
Adding traffic to the simulation dramatically reduced training speed (from 1000 it/s to 60 it/s). The pre-trained model was retrained with traffic for 5,000,000 steps with v_target = 75 km/h, keeping all reward components except R₃ for potential overtaking capability.
Training metrics showing upward trend in success rate with traffic
Note: The success rate showed an upward trend during training, suggesting that continuing training would yield better results. Unfortunately, due to lack of resources and time, experiments in this environment were not continued.
In this section, the observation sensor was changed to an RGB camera (along with vehicle states) which returns consecutive images. The implementation used PyTorch with a custom CNN-based architecture.
Due to time constraints, training was repeated on a simpler track with just 3 turns, showing the camera-based approach's potential but also its computational demands.
Final results showing RGB camera input and agent behavior on simplified 3-turn track
Stable Baselines3 provided significantly more stable training compared to custom PyTorch implementation, with clear differences in convergence and overall performance.
Critical importance of reward function design - incorporating smoothness and progress dramatically improved behavior. Different reward configurations led to conservative vs. aggressive driving styles.
Stochastic policy during training led to different speeds compared to deterministic evaluation, requiring careful tuning of policy standard deviation.
Adding traffic and complex tracks dramatically increased computational requirements (1000 it/s → 60 it/s).
RGB camera input proved much more demanding than Lidar-based observations, both computationally and in terms of convergence speed.
The research revealed that the Stable Baselines3 architecture offers significantly more stable training compared to custom PyTorch implementations, with clear differences in convergence and overall performance. The design of the reward function played a critical role, where incorporating parameters such as driving smoothness and route progress significantly improved vehicle behavior.
A significant difference was observed between training and evaluation phases, with the vehicle developing higher speeds during evaluation due to the non-stochastic nature of decisions. Adding traffic and complexity to the track dramatically increased computational requirements, while using an RGB camera as input proved significantly more demanding both computationally and in terms of convergence.
The work demonstrates the need for careful balance between environment complexity, reward function design, and computational resources for successful training of autonomous vehicles using deep reinforcement learning techniques.