Towards End-to-End Lane Detection: an Instance Segmentation Approach

Motivation

Traditional Lane Detection

Traditional lane detection methods rely on a combination of highly-specialized, hand-crafted features and heuristics, usually followed by post-processing techniques.

Computationally expensive
Prone to scalability due to road scene variations.

Deep Learning Models

More recent approaches leverage deep learning models, trained for pixel-wise lane segmentation, even when no markings are present in the image due to their big receptive field.

Limited to detecting a pre-defined, fixed number of lanes, e.g. ego-lanes, and can not cope with lane changes.
At a final stage, the generated binary lane segmentations still need to be disentangled into the different lane instances.

Proposed Approach

This paper propose to cast the lane detection problem as an instance segmentation problem — in which each lane forms its own instance — that can be trained end-to-end.
To parametrize the segmented lane instances before fitting the lane, they further propose to apply a learned perspective transformation, conditioned on the image, in contrast to a fixed “bird-eye-view” transformation. By doing so, a lane fitting is robust against road plane changes, unlike existing approaches that rely on a fixed, predefined transformation.

Screen Shot 2022-10-12 at 11.10.09 AM.png

LaneNet

Inspired by the success of dense prediction networks in semantic segmentation and instance segmentation tasks, they design a branched, multi-task network for lane instance segmentation, consisting of

a lane segmentation branch
and a lance embedding branch

that can be trained end-to-end.

Screen Shot 2022-10-12 at 11.13.13 AM.png

The lane segmentation branch has two output classes, background or lane, while the lane embedding branch further disentangles the segmented lane pixels into different lane instances.

H-Net

To this end, curve fitting algorithms have been widely used in the literature. Popular models are

cubic polynomials
splines
clothoids

To increase the quality of the fit while retaining computational efficiency, it is common to convert the image into a “bird-eye view” using a perspective transformation and perform the curve fitting there.

In particular, the neural network takes as input the image and is optimized with a loss function that is tailored to the lane fitting problem.

Method

LaneNet

The instance segmentation task consists of two parts, a segmentation and a clustering part. To increase performance, both in terms of speed and accuracy, these two parts are jointly trained in a multi-task network.

Binary Segmentation

To construct the ground-truth segmentation map, we connect all ground-truth lane points together, forming a connected line per lane.

Note that they draw these ground-truth lanes even through objects like occluding cars, or also in the absence of explicit visual lane segments, like dashed or faded lanes.
This way, the network will learn to predict lane location even when they are occluded or in adverse circumstances.
The segmentation network is trained with the standard cross-entropy loss function. Since the two classes (land/background) are highly unbalanced, they apply bounded inverse class weighting.

Instance Segmentation

Therefore they use a one-shot method based on distance metric learning, which can easily be integrated with standard feed-forward networks and which is specifically designed for real-time applications.

By using the clustering loss function, the instance embedding branch is trained to output an embedding for each lane pixel so that the distance between pixel embeddings belonging to the same lane is small, whereas the distance between pixel embeddings belonging to different lanes is maximized.

Clustering

The clustering is done by an iterative procedure.

By setting \(\delta_d > 6\delta_v\) in the above loss, one can take a random lane embedding and threshold around it with a radius of \(2 \delta_v\) to select all embeddings belong to the same lane.
This is repeated until all lane embeddings are assigned to a lane.
To avoid selecting an outlier to threshold around, we first use mean shift to shift closer to the cluster center and then do the thresholding.

Network Architecture

While the original ENet’s encoder consists of three stages, LaneNet only shares the first two stages between the two branches, leaving stage 3 of the ENet encoder and the full ENet decoder as the backbone of each separate branch.

The last layer of the segmentation branch outputs a one channel image, whereas the last layer of the embedding branch outputs a N-channel image, with N the embedding dimension.

Curve Fitting Using H-Net

A frequently used solution to this problem is to project the image into a “bird-eye view” representation, in which lanes are parallel to each other and as such, curved lanes can be fitted with a 2nd or 3rd order polynomial.

Loss Function

Since the lane fitting is done by using the closed-form solution of the least squares algorithm, the loss is differentiable. We use automatic differentiation to calculate the gradients.

Network Architecture

The network architecture of H-Net is kept intentionally small and is constructed out of consecutive blocks of 3x3 convolutions, BNs and ReLUs.

Screen Shot 2022-10-12 at 2.30.29 PM.png

Results

Dataset

TuSimple lane dataset