End-to-end Vision-aware Vehicle Decision Making via Imitation Learning

Autonomous vehicles that understand road agents: Detection, tracking, and behavior prediction

MS Thesis

Problem Statement

Argoverse Driving Dataset

Ground Truth Label Generation & Data Pre-processing

Our Model: Fusion Seq2seq

Model Variant: 3-branch Seq2seq

Evaluation and Analysis

FaF, Input 10, Predict 10, Stride 1

Fusion Seq2seq, Input 10, Predict 10, Stride 1

3-branch Seq2seq, Input 10, Predict 10, Stride 1

Problem Statement

Take raw sensor data (camera, LiDAR, HD map) from the ego vehicle over the last 10 frames, predict driver behavior n frames into the future. Driver behavior at each future frame is one of:

Left turn
Right turn
Left lane change
Right lane change
Straight

A sample sequence and its output labels with n=5 is given below:

Argoverse Driving Dataset

Real world driving data collected by Argo AI’s self-driving test vehicles in Miami and Pittsburgh.

The data covers different seasons, weather conditions, and times of day to provide a broad range of real-world driving scenarios.

Sensor data:

RGB video frames (1920 x 1200 x 3) at 30 Hz
LiDAR point cloud at 10 Hz
High Definition Map (HD map) with drivable area and lane polygons
Vehicle position and pose information from GPS-based and sensor-based localization

Ground Truth Label Generation & Data Pre-processing

Ground Truth

High quality ground truth label is an essential but often overlooked factor in learning-based problems. To generate ground truth action labels, I have tried the following two methods:

A heuristic algorithm to generate labels using GPS rotation vector and position vector
Label every frame manually

Labels generated by the heuristic algorithm are very noisy. I decided to manually label every frame.

Pre-processing

Resize RGB video frames
LiDAR point cloud coordinate transform
Generate LiDAR Bird's Eye View (BEV)
Remove ground LiDAR points
Parse HD map as a three-channel image, following Uber's approach
Align map orientation with ego vehicle heading direction at each time step

Models

Baseline Model: FaF

Adapt The "late fusion" version of Uber’s Fast and Furios paper for our problem setup.

Our Model: Fusion Seq2seq

A Seq2seq model with attention that takes the raw sensor data from last 10 time steps and predicts the vehicle action labels for the next n frames (n=1, 5, 10, 20, 30). At each time step, RGB video frame, LiDAR BEV image, and HD map are concatenated along the channel dimension.

Model Variant: 3-branch Seq2seq

The 3-branch variant of of Fusion Seq2seq. Use one CNN branch for each type of raw sensor data.