End-to-end Vision-aware Vehicle Decision Making via Imitation Learning
Autonomous vehicles that understand road agents: Detection, tracking, and behavior prediction
Problem Statement
Take raw sensor data (camera, LiDAR, HD map) from the ego vehicle over the last 10 frames, predict driver behavior n frames into the future. Driver behavior at each future frame is one of:
Left turn
Right turn
Left lane change
Right lane change
Straight
A sample sequence and its output labels with n=5 is given below:
Real world driving data collected by Argo AI’s self-driving test vehicles in Miami and Pittsburgh.
The data covers different seasons, weather conditions, and times of day to provide a broad range of real-world driving scenarios.
Sensor data:
RGB video frames (1920 x 1200 x 3) at 30 Hz
LiDAR point cloud at 10 Hz
High Definition Map (HD map) with drivable area and lane polygons
Vehicle position and pose information from GPS-based and sensor-based localization
Ground Truth Label Generation & Data Pre-processing
Ground Truth
High quality ground truth label is an essential but often overlooked factor in learning-based problems. To generate ground truth action labels, I have tried the following two methods:
A heuristic algorithm to generate labels using GPS rotation vector and position vector
Label every frame manually
Labels generated by the heuristic algorithm are very noisy. I decided to manually label every frame.
Pre-processing
Resize RGB video frames
LiDAR point cloud coordinate transform
Generate LiDAR Bird's Eye View (BEV)
Remove ground LiDAR points
Parse HD map as a three-channel image, following Uber's approach
Align map orientation with ego vehicle heading direction at each time step
Models
Baseline Model: FaF
Adapt The "late fusion" version of Uber’s Fast and Furios paper for our problem setup.
Our Model: Fusion Seq2seq
A Seq2seq model with attention that takes the raw sensor data from last 10 time steps and predicts the vehicle action labels for the next n frames (n=1, 5, 10, 20, 30). At each time step, RGB video frame, LiDAR BEV image, and HD map are concatenated along the channel dimension.
Model Variant: 3-branch Seq2seq
The 3-branch variant of of Fusion Seq2seq. Use one CNN branch for each type of raw sensor data.