Skip to main content

Cityscapes class level boundary labeling with improved one-hot

· 10 min read
Ordinary Magician | Half stack developer


I'm working on some semantic segmentation related code, where I need to enhance segmentation accuracy on boundaries. Therefore, I tried to use boundary loss to assist model training. This article is my attempt and codes.


Our purpose is clear. In Cityscapes, we have indexed image that represents pixels of each class named gtFine_labelIds. What we want is to generate class level boundary from gtFine_labelIds so that we can use it to optimize boundary regions for specific class.

Background information

First thing comes first:

import numpy as np
import torch
from PIL import Image
import matplotlib.pyplot as plt
import torch.nn.functional as F
import torchvision.transforms as T

Since Cityscapes provides a variety of labels, such as gtFine_color and gtFine_labelIds, We first need to convert these formats into indexed images with valid train id, and drop ignored label by the way. You can find mappings for those labels easily with google, or you may copy them from cityscapesScripts(。I've put a table here for easy reference:

nameidtrain idcategorycategory idhas instancesignore in evalcolor in gtFine_Color
unlabeled0255void0FalseTrue( 0, 0, 0)
ego vehicle1255void0FalseTrue( 0, 0, 0)
rectification border2255void0FalseTrue( 0, 0, 0)
out of roi3255void0FalseTrue( 0, 0, 0)
static4255void0FalseTrue( 0, 0, 0)
dynamic5255void0FalseTrue(111, 74, 0)
ground6255void0FalseTrue( 81, 0, 81)
road70flat1FalseFalse(128, 64,128)
sidewalk81flat1FalseFalse(244, 35,232)
rail track10255flat1FalseTrue(230,150,140)
building112construction2FalseFalse( 70, 70, 70)
guard rail14255construction2FalseTrue(180,165,180)
tunnel16255construction2FalseTrue(150,120, 90)
traffic light196object3FalseFalse(250,170, 30)
traffic sign207object3FalseFalse(220,220, 0)
vegetation218nature4FalseFalse(107,142, 35)
sky2310sky5FalseFalse( 70,130,180)
person2411human6TrueFalse(220, 20, 60)
rider2512human6TrueFalse(255, 0, 0)
car2613vehicle7TrueFalse( 0, 0,142)
truck2714vehicle7TrueFalse( 0, 0, 70)
bus2815vehicle7TrueFalse( 0, 60,100)
caravan29255vehicle7TrueTrue( 0, 0, 90)
trailer30255vehicle7TrueTrue( 0, 0,110)
train3116vehicle7TrueFalse( 0, 80,100)
motorcycle3217vehicle7TrueFalse( 0, 0,230)
bicycle3318vehicle7TrueFalse(119, 11, 32)
license plate-1-1vehicle7FalseTrue( 0, 0,142)

According to the table, all the label marked ignore in eval = True (or train id = 255 and train id = -1 ) should be ignored during train. If you read from gtFine_labelIds, it is already made with train id, what you have to do is to drop all the IDs marked as ignore in eval, or mark them with "ignore label". On the other hand, if you read labels from colored images, you need to convert them into train id first. Once you get the indexed image in train id format, code below helps with dropping ignored labels:

ignore_label = -1
label_mapping = {-1: ignore_label, 0: ignore_label,
1: ignore_label, 2: ignore_label,
3: ignore_label, 4: ignore_label,
5: ignore_label, 6: ignore_label,
7: 0, 8: 1, 9: ignore_label,
10: ignore_label, 11: 2, 12: 3,
13: 4, 14: ignore_label, 15: ignore_label,
16: ignore_label, 17: 5, 18: ignore_label,
19: 6, 20: 7, 21: 8, 22: 9, 23: 10, 24: 11,
25: 12, 26: 13, 27: 14, 28: 15,
29: ignore_label, 30: ignore_label,
31: 16, 32: 17, 33: 18, 255:-1}

def convert_cityscapes_label(label, inverse=False):
"""convert label image into indexed image with train ids

label (np.array): you may read the image file using opencv
inverse (bool, optional): inverse convert. Defaults to False.

_type_: _description_
temp = label.copy()
if inverse:
for v, k in label_mapping.items():
label[temp == k] = v
for k, v in label_mapping.items():
label[temp == k] = v
return label

We use gtFine/train/aachen/aachen_000020_000019_gtFine_labelIds.png as an example:

label_img = "cityscapes/gtFine/train/aachen/aachen_000020_000019_gtFine_labelIds.png"
label_img =
transform = T.Resize((256,512),interpolation=T.InterpolationMode.NEAREST)
label_img = transform(label_img) # Resize image into (256,512) to reduce computation cost
label = np.array(label_img) # convert into numpy array
label = convert_cityscapes_label(label) # drop ignored label
label = np.array(label_img, dtype=int) # convert into numpy array
label = convert_cityscapes_label(label) # drop ignore labels by turning them into -1



The reason why the labels are adjusted to (256,512) resolution in the above code is to reduce the amount of calculation. The original image resolution in Cityscapes is 2048x1024. However, in fact, when training the semantic segmentation network, the output scale of the model will not maintain the original size, but will be down-sampled to a certain extent. For example, a common downsampling rate for semantic segmentation is 1/8, which means that the final network output resolution is 256x128. Therefore, downsampling the label to a certain extent does not affect the calculation of loss and helps speed up the calculation.

Improved one-hot encoding

Intuitively, you only need to use one hot to decode the single channel mask into a class level mask with channel of num_channel(19 in cityscapes). However, PyTorch API does not provide one hot encoding functions that can ignore labels. In other words, you need to implement this function yourself. The simple idea to implement this function is to create a tensor with a number of channel more than num_classes and eventually reduce the excess.

def one_hot_encode(
mask: torch.Tensor,
num_classes: int,
ignored_label: Union[str, int] = "negative",
"""Convert the mask to a one-hot encoded representation by @visualDust

mask (torch.Tensor): indexed label image. Should types int
num_classes (int): number of classes
ignored_label (Union[str|int], optional): specify labels to ignore, or ignore by pattern. Defaults to "negative".

torch.Tensor: one hot encoded tensor
original_shape = mask.shape
for _ in range(4 - len(mask.shape)):
mask = mask.unsqueeze(0) # H W -> C H W -> B C H W, if applicable
# start to handle ignored label
# convert ignored label into positive index bigger than num_classes
if type(ignored_label) is int:
mask[mask == ignored_label] = num_classes
elif ignored_label == "negative":
print(mask.shape, mask.min())
mask[mask < 0] = num_classes
print(mask.shape, mask.min())

# check if mask image is valid
if torch.max(mask) > num_classes:
raise RuntimeError("class values must be smaller than num_classes.")
B, _, H, W = mask.shape
one_hot = torch.zeros(B, num_classes + 1, H, W)
one_hot.scatter_(1, mask, 1) # mark 1 on channel(dim=1) with index of mask
one_hot = one_hot[:, :num_classes] # remove ignored label(s)
for _ in range(len(one_hot.shape) - len(original_shape)):
one_hot.squeeze_(0) # B C H W -> H W -> C H W, if applicable
return one_hot

Code takes a labeled mask, the number of classes, and an optional parameter for ignored labels. It converts the input mask into a one-hot encoded representation.

If ignored_label is an integer, occurrences of that label in the mask are replaced with a value greater than num_classes. If ignored_label is set to "negative," values less than 0 in the mask are replaced with a value greater than num_classes. This step effectively transforms the ignored label into a positive index larger than the specified number of classes.


gt_tensor = torch.tensor(label).type(torch.int64) # convert label into tensor
class_level_mask = one_hot_encode(gt_tensor, num_classes=19) # get class level mask using above mentioned one hot encoding
sub_idx = 1 # for figure line wrapping
for m in class_level_mask: # draw each class
plt.subplot(6, 4, sub_idx)
sub_idx += 1 # show result



Dont worry if some of visible classes in color map are missing in boundary maps. This is because they are marked as ignore labels so that we dropped them in one hot encoding. For example, grounds are missing because they are ignored in training.

obtain per-class boundary

Since you want to identify boundaries between different classes or regions in the ground truth mask, code below computes boundary targets from ground truth masks using a Laplacian kernel. The function takes a ground truth mask (gt_mask) and calculates the boundary targets based on a Laplacian kernel convolution.

def get_boundary_of_gt(gt_mask, boundary_threshold=0.1):
laplacian_kernel = (
[-1, -1, -1, -1, 8, -1, -1, -1, -1],
.reshape(1, 3, 3)
C = 1
original_shape = gt_mask.shape
if len(gt_mask.shape) == 2: # add channel dim
gt_mask = gt_mask.unsqueeze(0)
if len(gt_mask.shape) == 3:
C, _, _ = gt_mask.shape # get num_of_classes
elif len(gt_mask.shape) == 4:
_, C, _, _ = gt_mask.shape # get num_of_classes
laplacian_kernel = [laplacian_kernel] * C # conv on each channel individually
laplacian_kernel = torch.stack(laplacian_kernel, dim=0)
_conv = functools.partial(
F.conv2d, weight=laplacian_kernel, padding=1, groups=C
) # conv on each channel individually
boundary_targets = _conv(gt_mask)
boundary_targets = boundary_targets.clamp(min=0)
boundary_targets[boundary_targets > boundary_threshold] = 1
boundary_targets[boundary_targets <= boundary_threshold] = 0
for _ in range(len(boundary_targets.shape) - max(3, len(original_shape))):
boundary_targets.squeeze_(0) # reduce dim if applicable
return boundary_targets

Note that the Laplacian kernel's shape is modified to be (1, 3, 3) and stacked along channel. It also takes groups = C in F.conv2d to make sure that the Laplacian kernel is applied separately to each channel of the input. In PyTorch documentation, it says torch.nn.functional.conv2d takes those arguments:

  • input: input tensor of shape (minibatch,in_channels,iH,iW)(minibatch,in\_channels,iH,iW)
  • weight: filters of shape (out_channels,in_channelsgroups,kH,kW)(out\_channels,\frac{in\_channels}{groups},kH,kW)
  • groups – split input into groups, both in_channelsin\_channels and out_channelsout\_channels should be divisible by the number of groups. Default: 1

In order to apply Laplacian kernels separately on each channel, we use groups = C. Check the doc if you're interested: torch.nn.functional.conv2d — PyTorch 2.1 documentation


gt_tensor = torch.tensor(label).type(torch.int64) # convert label into tensor
class_level_mask = one_hot_encode(gt_tensor, num_classes=19) # get class level mask using above mentioned one hot encoding
gt_boundary_class_level = get_boundary_of_gt(class_level_mask)
sub_idx = 1
for b in gt_boundary_class_level:
plt.subplot(6, 4, sub_idx)
sub_idx += 1