API Reference

mmtrack.apis

mmtrack.apis.inference_mot(model, img, frame_id)[source]

Inference image(s) with the mot model.

Parameters
  • model (nn.Module) – The loaded mot model.

  • img (str | ndarray) – Either image name or loaded image.

  • frame_id (int) – frame id.

Returns

ndarray]: The tracking results.

Return type

dict[str

mmtrack.apis.inference_sot(model, image, init_bbox, frame_id)[source]

Inference image with the single object tracker.

Parameters
  • model (nn.Module) – The loaded tracker.

  • image (ndarray) – Loaded images.

  • init_bbox (ndarray) – The target needs to be tracked.

  • frame_id (int) – frame id.

Returns

ndarray]: The tracking results.

Return type

dict[str

mmtrack.apis.inference_vid(model, image, frame_id, ref_img_sampler={'frame_stride': 10, 'num_left_ref_imgs': 10})[source]

Inference image with the video object detector.

Parameters
  • model (nn.Module) – The loaded detector.

  • image (ndarray) – Loaded images.

  • frame_id (int) – Frame id.

  • ref_img_sampler (dict) – The configuration for sampling reference images. Only used under video detector of fgfa style. Defaults to dict(frame_stride=2, num_left_ref_imgs=10).

Returns

ndarray]: The detection results.

Return type

dict[str

mmtrack.apis.init_model(config, checkpoint=None, device='cuda:0', cfg_options=None)[source]

Initialize a model from config file.

Parameters
  • config (str or mmcv.Config) – Config file path or the config object.

  • checkpoint (str, optional) – Checkpoint path. Default as None.

  • cfg_options (dict, optional) – Options to override some settings in the used config. Default to None.

Returns

The constructed detector.

Return type

nn.Module

mmtrack.apis.multi_gpu_test(model, data_loader, tmpdir=None, gpu_collect=False)[source]

Test model with multiple gpus.

This method tests model with multiple gpus and collects the results under two different modes: gpu and cpu modes. By setting ‘gpu_collect=True’ it encodes results to gpu tensors and use gpu communication for results collection. On cpu mode it saves the results on different gpus to ‘tmpdir’ and collects them by the rank 0 worker. ‘gpu_collect=True’ is not supported for now.

Parameters
  • model (nn.Module) – Model to be tested.

  • data_loader (nn.Dataloader) – Pytorch data loader.

  • tmpdir (str) – Path of directory to save the temporary results from different gpus under cpu mode. Defaults to None.

  • gpu_collect (bool) – Option to use either gpu or cpu to collect results. Defaults to False.

Returns

The prediction results.

Return type

dict[str, list]

mmtrack.apis.single_gpu_test(model, data_loader, show=False, out_dir=None, show_score_thr=0.3)[source]

Test model with single gpu.

Parameters
  • model (nn.Module) – Model to be tested.

  • data_loader (nn.Dataloader) – Pytorch data loader.

  • show (bool) – If True, visualize the prediction results (Not supported for now). Defaults to False.

  • out_dir (str) – Path of directory to save the visualization results (Not supported for now). Defaults to None.

  • show_score_thr (float) – The score threthold of visualization (Not supported for now). Defaults to 0.3.

Returns

The prediction results.

Return type

dict[str, list]

mmtrack.apis.train_model(model, dataset, cfg, distributed=False, validate=False, timestamp=None, meta=None)[source]

Train model entry function.

Parameters
  • model (nn.Module) – The model to be trained.

  • dataset (Dataset) – Train dataset.

  • cfg (dict) – The config dict for training.

  • distributed (bool) – Whether to use distributed training. Default: False.

  • validate (bool) – Whether to do evaluation. Default: False.

  • timestamp (str | None) – Local time for runner. Default: None.

  • meta (dict | None) – Meta dict to record some important information. Default: None

mmtrack.core

anchor

evaluation

class mmtrack.core.evaluation.DistEvalHook(*args: Any, **kwargs: Any)[source]

Please refer to mmdet.core.evaluation.eval_hooks.py:DistEvalHook for detailed docstring.

class mmtrack.core.evaluation.EvalHook(*args: Any, **kwargs: Any)[source]

Please refer to mmdet.core.evaluation.eval_hooks.py:EvalHook for detailed docstring.

mmtrack.core.evaluation.eval_mot(results, annotations, logger=None, classes=None, iou_thr=0.5, ignore_iof_thr=0.5, ignore_by_classes=False, nproc=4)[source]

Evaluation CLEAR MOT metrics.

Parameters
  • results (list[list[list[ndarray]]]) – The first list indicates videos, The second list indicates images. The third list indicates categories. The ndarray indicates the tracking results.

  • annotations (list[list[dict]]) –

    The first list indicates videos, The second list indicates images. The third list indicates the annotations of each video. Keys of annotations are

    • bboxes: numpy array of shape (n, 4)

    • labels: numpy array of shape (n, )

    • instance_ids: numpy array of shape (n, )

    • bboxes_ignore (optional): numpy array of shape (k, 4)

    • labels_ignore (optional): numpy array of shape (k, )

  • logger (logging.Logger | str | None, optional) – The way to print the evaluation results. Defaults to None.

  • classes (list, optional) – Classes in the dataset. Defaults to None.

  • iou_thr (float, optional) – IoU threshold for evaluation. Defaults to 0.5.

  • ignore_iof_thr (float, optional) – Iof threshold to ignore results. Defaults to 0.5.

  • ignore_by_classes (bool, optional) – Whether ignore the results by classes or not. Defaults to False.

  • nproc (int, optional) – Number of the processes. Defaults to 4.

Returns

Evaluation results.

Return type

dict[str, float]

mmtrack.core.evaluation.eval_sot_ope(results, annotations)[source]

Evaluation in OPE protocol.

Parameters
  • results (list[list[ndarray]]) – The first list contains the tracking results of each video. The second list contains the tracking results of each frame in one video. The ndarray denotes the tracking box in [tl_x, tl_y, br_x, br_y] format.

  • annotations (list[list[dict]]) – The first list contains the annotations of each video. The second list contains the annotations of each frame in one video. The dict contains the annotation information of one frame.

Returns

OPE style evaluation metric (i.e. success, norm precision and precision).

Return type

dict[str, float]

motion

mmtrack.core.motion.flow_warp_feats(x, flow)[source]

Use flow to warp feature map.

Parameters
  • x (Tensor) – of shape (N, C, H_x, W_x).

  • flow (Tensor) – of shape (N, C, H_f, W_f).

Returns

The warpped feature map with shape (N, C, H_x, W_x).

Return type

Tensor

optimizer

class mmtrack.core.optimizer.SiameseRPNLrUpdaterHook(lr_configs=[{'type': 'step', 'start_lr_factor': 0.2, 'end_lr_factor': 1.0, 'end_epoch': 5}, {'type': 'log', 'start_lr_factor': 1.0, 'end_lr_factor': 0.1, 'end_epoch': 20}], **kwargs)[source]

Learning rate updater for siamese rpn.

Parameters

lr_configs (list[dict]) – List of dict where each dict denotes the configuration of specifical learning rate updater and must have ‘type’.

get_lr(runner, base_lr)[source]

Get a specifical learning rate for each epoch.

class mmtrack.core.optimizer.SiameseRPNOptimizerHook(backbone_start_train_epoch, backbone_train_layers, **kwargs)[source]

Optimizer hook for siamese rpn.

Parameters
  • backbone_start_train_epoch (int) – Start to train the backbone at backbone_start_train_epoch-th epoch. Note the epoch in this class counts from 0, while the epoch in the log file counts from 1.

  • backbone_train_layers (list(str)) – List of str denoting the stages needed be trained in backbone.

before_train_epoch(runner)[source]

If runner.epoch >= self.backbone_start_train_epoch, start to train the backbone.

track

mmtrack.core.track.depthwise_correlation(x, kernel)[source]

Depthwise cross correlation.

This function is proposed in SiamRPN++.

Parameters
  • x (Tensor) – of shape (N, C, H_x, W_x).

  • kernel (Tensor) – of shape (N, C, H_k, W_k).

Returns

of shape (N, C, H_o, W_o). H_o = H_x - H_k + 1. So does W_o.

Return type

Tensor

mmtrack.core.track.embed_similarity(key_embeds, ref_embeds, method='dot_product', temperature=- 1, transpose=True)[source]

Calculate feature similarity from embeddings.

Parameters
  • key_embeds (Tensor) – Shape (N1, C).

  • ref_embeds (Tensor) – Shape (N2, C) or (C, N2).

  • method (str, optional) – Method to calculate the similarity, options are ‘dot_product’ and ‘cosine’. Defaults to ‘dot_product’.

  • temperature (int, optional) – Softmax temperature. Defaults to -1.

  • transpose (bool, optional) – Whether transpose ref_embeds. Defaults to True.

Returns

Similarity matrix of shape (N1, N2).

Return type

Tensor

mmtrack.core.track.imrenormalize(img, img_norm_cfg, new_img_norm_cfg)[source]

Re-normalize the image.

Parameters
  • img (Tensor | ndarray) – Input image. If the input is a Tensor, the shape is (1, C, H, W). If the input is a ndarray, the shape is (H, W, C).

  • img_norm_cfg (dict) – Original configuration for the normalization.

  • new_img_norm_cfg (dict) – New configuration for the normalization.

Returns

Output image with the same type and shape of the input.

Return type

Tensor | ndarray

mmtrack.core.track.restore_result(result, return_ids=False)[source]

Restore the results (list of results of each category) into the results of the model forward.

Parameters
  • result (list[ndarray]) – shape (n, 5) or (n, 6)

  • return_ids (bool, optional) – Whether the input has tracking result. Default to False.

Returns

tracking results of each class.

Return type

tuple

mmtrack.core.track.track2result(bboxes, labels, ids, num_classes)[source]

Convert tracking results to a list of numpy arrays.

Parameters
  • bboxes (torch.Tensor | np.ndarray) – shape (n, 5)

  • labels (torch.Tensor | np.ndarray) – shape (n, )

  • ids (torch.Tensor | np.ndarray) – shape (n, )

  • num_classes (int) – class number, including background class

Returns

tracking results of each class.

Return type

list(ndarray)

utils

mmtrack.core.utils.crop_image(image, crop_region, crop_size, padding=(0, 0, 0))[source]

Crop image based on crop_region and crop_size.

Parameters
  • image (ndarray) – of shape (H, W, 3).

  • crop_region (ndarray) – of shape (4, ) in [x1, y1, x2, y2] format.

  • crop_size (int) – Crop size.

  • padding (tuple | ndarray) – of shape (3, ) denoting the padding values.

Returns

Cropped image of shape (crop_size, crop_size, 3).

Return type

ndarray

mmtrack.core.utils.imshow_tracks(*args, backend='cv2', **kwargs)[source]

Show the tracks on the input image.

mmtrack.datasets

datasets

class mmtrack.datasets.CocoVID(*args: Any, **kwargs: Any)[source]

Inherit official COCO class in order to parse the annotations of bbox- related video tasks.

Parameters
  • annotation_file (str) – location of annotation file. Defaults to None.

  • load_img_as_vid (bool) – If True, convert image data to video data, which means each image is converted to a video. Defaults to False.

convert_img_to_vid(dataset)[source]

Convert image data to video data.

createIndex()[source]

Create index.

get_img_ids_from_ins_id(insId)[source]

Get image ids from given instance id.

Parameters

insId (int) – The given instance id.

Returns

Image ids of given instance id.

Return type

list[int]

get_img_ids_from_vid(vidId)[source]

Get image ids from given video id.

Parameters

vidId (int) – The given video id.

Returns

Image ids of given video id.

Return type

list[int]

get_ins_ids_from_vid(vidId)[source]

Get instance ids from given video id.

Parameters

vidId (int) – The given video id.

Returns

Instance ids of given video id.

Return type

list[int]

get_vid_ids(vidIds=[])[source]

Get video ids that satisfy given filter conditions.

Default return all video ids.

Parameters

vidIds (list[int]) – The given video ids. Defaults to [].

Returns

Video ids.

Return type

list[int]

load_vids(ids=[])[source]

Get video information of given video ids.

Default return all videos information.

Parameters

ids (list[int]) – The given video ids. Defaults to [].

Returns

List of video information.

Return type

list[dict]

mmtrack.datasets.build_dataloader(dataset, samples_per_gpu, workers_per_gpu, num_gpus=1, dist=True, shuffle=True, seed=None, **kwargs)[source]

Build PyTorch DataLoader.

In distributed training, each GPU/process has a dataloader. In non-distributed training, there is only one dataloader for all GPUs.

Parameters
  • dataset (Dataset) – A PyTorch dataset.

  • samples_per_gpu (int) – Number of training samples on each GPU, i.e., batch size of each GPU.

  • workers_per_gpu (int) – How many subprocesses to use for data loading for each GPU.

  • num_gpus (int) – Number of GPUs. Only used in non-distributed training.

  • dist (bool) – Distributed training/test or not. Default: True.

  • shuffle (bool) – Whether to shuffle the data at every epoch. Default: True.

  • kwargs – any keyword argument to be used to initialize DataLoader

Returns

A PyTorch dataloader.

Return type

DataLoader

parsers

class mmtrack.datasets.parsers.CocoVID(*args: Any, **kwargs: Any)[source]

Inherit official COCO class in order to parse the annotations of bbox- related video tasks.

Parameters
  • annotation_file (str) – location of annotation file. Defaults to None.

  • load_img_as_vid (bool) – If True, convert image data to video data, which means each image is converted to a video. Defaults to False.

convert_img_to_vid(dataset)[source]

Convert image data to video data.

createIndex()[source]

Create index.

get_img_ids_from_ins_id(insId)[source]

Get image ids from given instance id.

Parameters

insId (int) – The given instance id.

Returns

Image ids of given instance id.

Return type

list[int]

get_img_ids_from_vid(vidId)[source]

Get image ids from given video id.

Parameters

vidId (int) – The given video id.

Returns

Image ids of given video id.

Return type

list[int]

get_ins_ids_from_vid(vidId)[source]

Get instance ids from given video id.

Parameters

vidId (int) – The given video id.

Returns

Instance ids of given video id.

Return type

list[int]

get_vid_ids(vidIds=[])[source]

Get video ids that satisfy given filter conditions.

Default return all video ids.

Parameters

vidIds (list[int]) – The given video ids. Defaults to [].

Returns

Video ids.

Return type

list[int]

load_vids(ids=[])[source]

Get video information of given video ids.

Default return all videos information.

Parameters

ids (list[int]) – The given video ids. Defaults to [].

Returns

List of video information.

Return type

list[dict]

pipelines

samplers

class mmtrack.datasets.samplers.DistributedVideoSampler(dataset, num_replicas=None, rank=None, shuffle=False)[source]

Put videos to multi gpus during testing.

Parameters
  • dataset (Dataset) – Test dataset that must has data_infos attribute. Each data_info in data_infos record information of one frame, and each video must has one data_info that includes data_info[‘frame_id’] == 0.

  • num_replicas (int) – The number of gpus. Defaults to None.

  • rank (int) – Gpu rank id. Defaults to None.

  • shuffle (bool) – If True, shuffle the dataset. Defaults to False.

mmtrack.models

mot

class mmtrack.models.mot.BaseMultiObjectTracker[source]

Base class for multiple object tracking.

aug_test(imgs, img_metas, **kwargs)[source]

Test function with test time augmentation.

forward(img, img_metas, return_loss=True, **kwargs)[source]

Calls either forward_train() or forward_test() depending on whether return_loss is True.

Note this setting will change the expected inputs. When return_loss=True, img and img_meta are single-nested (i.e. Tensor and List[dict]), and when resturn_loss=False, img and img_meta should be double nested (i.e. List[Tensor], List[List[dict]]), with the outer list indicating test time augmentations.

forward_test(imgs, img_metas, **kwargs)[source]
Parameters
  • imgs (List[Tensor]) – the outer list indicates test-time augmentations and inner Tensor should have a shape NxCxHxW, which contains all images in the batch.

  • img_metas (List[List[dict]]) – the outer list indicates test-time augs (multiscale, flip, etc.) and the inner list indicates images in a batch.

abstract forward_train(imgs, img_metas, **kwargs)[source]
Parameters
  • img (list[Tensor]) – List of tensors of shape (1, C, H, W). Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – List of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys, see mmdet.datasets.pipelines.Collect.

  • kwargs (keyword arguments) – Specific to concrete implementation.

freeze_module(module)[source]

Freeze module during training.

init_module(module_name, pretrain=None)[source]

Initialize the weights of a sub-module.

Parameters
  • module (nn.Module) – A sub-module of the model.

  • pretrained (str, optional) – Path to pre-trained weights. Defaults to None.

show_result(img, result, thickness=1, font_scale=0.5, show=False, out_file=None, wait_time=0, backend='cv2')[source]

Visualize tracking results.

Parameters
  • img (str | ndarray) – Filename of loaded image.

  • result (list[ndarray]) – Tracking results.

  • thickness (int, optional) – Thickness of lines. Defaults to 1.

  • font_scale (float, optional) – Font scales of texts. Defaults to 0.5.

  • show (bool, optional) – Whether show the visualizations on the fly. Defaults to False.

  • out_file (str | None, optional) – Output filename. Defaults to None.

  • backend (str, optional) – Backend to draw the bounding boxes, options are cv2 and plt. Defaults to ‘cv2’.

Returns

Visualized image.

Return type

ndarray

abstract simple_test(img, img_metas, **kwargs)[source]

Test function with a single scale.

train_step(data, optimizer)[source]

The iteration step during training.

This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.

Parameters
  • data (dict) – The output of dataloader.

  • optimizer (torch.optim.Optimizer | dict) – The optimizer of runner is passed to train_step(). This argument is unused and reserved.

Returns

It should contain at least 3 keys: loss, log_vars, num_samples.

  • loss is a tensor for back propagation, which can be a

weighted sum of multiple losses. - log_vars contains all the variables to be sent to the logger. - num_samples indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.

Return type

dict

val_step(data, optimizer)[source]

The iteration step during validation.

This method shares the same signature as train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.

property with_detector

whether the framework has a detector.

Type

bool

property with_motion

whether the framework has a motion model.

Type

bool

property with_reid

whether the framework has a reid model.

Type

bool

property with_track_head

whether the framework has a track_head.

Type

bool

property with_tracker

whether the framework has a tracker.

Type

bool

class mmtrack.models.mot.DeepSORT(detector=None, reid=None, tracker=None, motion=None, pretrains=None)[source]

Simple online and realtime tracking with a deep association metric.

Details can be found at `DeepSORT<https://arxiv.org/abs/1703.07402>`_.

forward_train(*args, **kwargs)[source]

Forward function during training.

init_weights(pretrain)[source]

Initialize the weights of the modules.

Parameters

pretrained (dict) – Path to pre-trained weights.

simple_test(img, img_metas, rescale=False, public_bboxes=None, **kwargs)[source]

Test without augmentations.

Parameters
  • img (Tensor) – of shape (N, C, H, W) encoding input images. Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – list of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’.

  • rescale (bool, optional) – If False, then returned bboxes and masks will fit the scale of img, otherwise, returned bboxes and masks will fit the scale of original image shape. Defaults to False.

  • public_bboxes (list[Tensor], optional) – Public bounding boxes from the benchmark. Defaults to None.

Returns

list(ndarray)]: The tracking results.

Return type

dict[str

class mmtrack.models.mot.Tracktor(detector=None, reid=None, tracker=None, motion=None, pretrains=None)[source]

Tracking without bells and whistles.

Details can be found at `Tracktor<https://arxiv.org/abs/1903.05625>`_.

forward_train(*args, **kwargs)[source]

Forward function during training.

init_weights(pretrain)[source]

Initialize the weights of the modules.

Parameters

pretrained (dict) – Path to pre-trained weights.

simple_test(img, img_metas, rescale=False, public_bboxes=None, **kwargs)[source]

Test without augmentations.

Parameters
  • img (Tensor) – of shape (N, C, H, W) encoding input images. Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – list of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’.

  • rescale (bool, optional) – If False, then returned bboxes and masks will fit the scale of img, otherwise, returned bboxes and masks will fit the scale of original image shape. Defaults to False.

  • public_bboxes (list[Tensor], optional) – Public bounding boxes from the benchmark. Defaults to None.

Returns

list(ndarray)]: The tracking results.

Return type

dict[str

property with_cmc

whether the framework has a camera model compensation model.

Type

bool

property with_linear_motion

whether the framework has a linear motion model.

Type

bool

sot

class mmtrack.models.sot.SiamRPN(pretrains=None, backbone=None, neck=None, head=None, frozen_modules=None, train_cfg=None, test_cfg=None)[source]

SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks.

This single object tracker is the implementation of SiamRPN++.

Extract the features of search images.

Parameters

x_img (Tensor) – of shape (N, C, H, W) encoding input search images. Typically H and W equal to 255.

Returns

Multi level feature map of search images.

Return type

tuple(Tensor)

forward_template(z_img)[source]

Extract the features of exemplar images.

Parameters

z_img (Tensor) – of shape (N, C, H, W) encoding input exemplar images. Typically H and W equal to 127.

Returns

Multi level feature map of exemplar images.

Return type

tuple(Tensor)

forward_train(img, img_metas, gt_bboxes, search_img, search_img_metas, search_gt_bboxes, is_positive_pairs, **kwargs)[source]
Parameters
  • img (Tensor) – of shape (N, C, H, W) encoding input exemplar images. Typically H and W equal to 127.

  • img_metas (list[dict]) – list of image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.

  • gt_bboxes (list[Tensor]) – Ground truth bboxes for each exemplar image with shape (1, 4) in [tl_x, tl_y, br_x, br_y] format.

  • search_img (Tensor) – of shape (N, 1, C, H, W) encoding input search images. 1 denotes there is only one search image for each exemplar image. Typically H and W equal to 255.

  • search_img_metas (list[list[dict]]) – The second list only has one element. The first list contains search image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.

  • search_gt_bboxes (list[Tensor]) – Ground truth bboxes for each search image with shape (1, 5) in [0.0, tl_x, tl_y, br_x, br_y] format.

  • is_positive_pairs (list[bool]) – list of bool denoting whether each exemplar image and corresponding seach image is positive pair.

Returns

a dictionary of loss components.

Return type

dict[str, Tensor]

get_cropped_img(img, center_xy, target_size, crop_size, avg_channel)[source]

Crop image.

Only used during testing.

This function mainly contains two steps: 1. Crop img based on center center_xy and size crop_size. If the cropped image is out of boundary of img, use avg_channel to pad. 2. Resize the cropped image to target_size.

Parameters
  • img (Tensor) – of shape (1, C, H, W) encoding original input image.

  • center_xy (Tensor) – of shape (2, ) denoting the center point for cropping image.

  • target_size (int) – The output size of cropped image.

  • crop_size (Tensor) – The size for cropping image.

  • avg_channel (Tensor) – of shape (3, ) denoting the padding values.

Returns

of shape (1, C, target_size, target_size) encoding the resized cropped image.

Return type

Tensor

init(img, bbox)[source]

Initialize the single object tracker in the first frame.

Parameters
  • img (Tensor) – of shape (1, C, H, W) encoding original input image.

  • bbox (Tensor) – The given instance bbox of first frame that need be tracked in the following frames. The shape of the box is (4, ) with [cx, cy, w, h] format.

Returns

z_feat is a tuple[Tensor] that contains the multi level feature maps of exemplar image, avg_channel is Tensor with shape (3, ), and denotes the padding values.

Return type

tuple(z_feat, avg_channel)

init_weights(pretrain)[source]

Initialize the weights of modules in single object tracker.

Parameters

pretrained (dict) – Path to pre-trained weights.

simple_test(img, img_metas, gt_bboxes, **kwargs)[source]

Test without augmentation.

Parameters
  • img (Tensor) – of shape (1, C, H, W) encoding input image.

  • img_metas (list[dict]) – list of image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.

  • gt_bboxes (list[Tensor]) – list of ground truth bboxes for each image with shape (1, 4) in [tl_x, tl_y, br_x, br_y] format.

Returns

ndarray]: The tracking results.

Return type

dict[str

track(img, bbox, z_feat, avg_channel)[source]

Track the box bbox of previous frame to current frame img.

Parameters
  • img (Tensor) – of shape (1, C, H, W) encoding original input image.

  • bbox (Tensor) – The bbox in previous frame. The shape of the box is (4, ) in [cx, cy, w, h] format.

  • z_feat (tuple[Tensor]) – The multi level feature maps of exemplar image in the first frame.

  • avg_channel (Tensor) – of shape (3, ) denoting the padding values.

Returns

best_score is a Tensor denoting the score of best_bbox, best_bbox is a Tensor of shape (4, ) in [cx, cy, w, h] format, and denotes the best tracked bbox in current frame.

Return type

tuple(best_score, best_bbox)

vid

class mmtrack.models.vid.BaseVideoDetector[source]

Base class for video object detector.

aug_test(imgs, img_metas, **kwargs)[source]

Test function with test time augmentation.

forward(img, img_metas, return_loss=True, **kwargs)[source]

Calls either forward_train() or forward_test() depending on whether return_loss is True.

Note this setting will change the expected inputs. When return_loss=True, img and img_meta are single-nested (i.e. Tensor and List[dict]), and when resturn_loss=False, img and img_meta should be double nested (i.e. List[Tensor], List[List[dict]]), with the outer list indicating test time augmentations.

forward_test(imgs, img_metas, **kwargs)[source]
Parameters
  • imgs (List[Tensor]) – the outer list indicates test-time augmentations and inner Tensor should have a shape NxCxHxW, which contains all images in the batch.

  • img_metas (List[List[dict]]) – the outer list indicates test-time augs (multiscale, flip, etc.) and the inner list indicates images in a batch.

abstract forward_train(imgs, img_metas, **kwargs)[source]
Parameters
  • img (Tensor) – of shape (N, C, H, W) encoding input images. Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – list of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.

freeze_module(module)[source]

Freeze module during training.

init_module(module, pretrain=None)[source]

Initialize the weights of modules in video detector.

Parameters

pretrained (str, optional) – Path to pre-trained weights. Defaults to None.

show_result(img, result, score_thr=0.3, bbox_color='green', text_color='green', thickness=1, font_scale=0.5, win_name='', show=False, wait_time=0, out_file=None)[source]

Draw result over img.

Parameters
  • img (str or Tensor) – The image to be displayed.

  • result (Tensor or tuple) – The results to draw over img bbox_result or (bbox_result, segm_result).

  • score_thr (float, optional) – Minimum score of bboxes to be shown. Default: 0.3.

  • bbox_color (str or tuple or Color) – Color of bbox lines.

  • text_color (str or tuple or Color) – Color of texts.

  • thickness (int) – Thickness of lines.

  • font_scale (float) – Font scales of texts.

  • win_name (str) – The window name.

  • wait_time (int) – Value of waitKey param. Default: 0.

  • show (bool) – Whether to show the image. Default: False.

  • out_file (str or None) – The filename to write the image. Default: None.

Returns

Only if not show or out_file

Return type

img (Tensor)

train_step(data, optimizer)[source]

The iteration step during training.

This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.

Parameters
  • data (dict) – The output of dataloader.

  • optimizer (torch.optim.Optimizer | dict) – The optimizer of runner is passed to train_step(). This argument is unused and reserved.

Returns

It should contain at least 3 keys: loss, log_vars, num_samples.

  • loss is a tensor for back propagation, which can be a weighted sum of multiple losses.

  • log_vars contains all the variables to be sent to the

logger. - num_samples indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.

Return type

dict

val_step(data, optimizer)[source]

The iteration step during validation.

This method shares the same signature as train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.

property with_aggregator

whether the framework has a aggregator

Type

bool

property with_detector

whether the framework has a detector

Type

bool

property with_motion

whether the framework has a motion model

Type

bool

class mmtrack.models.vid.DFF(detector, motion, pretrains=None, frozen_modules=None, train_cfg=None, test_cfg=None)[source]

Deep Feature Flow for Video Recognition.

This video object detector is the implementation of DFF.

aug_test(imgs, img_metas, **kwargs)[source]

Test function with test time augmentation.

extract_feats(img, img_metas)[source]

Extract features for img during testing.

Parameters
  • img (Tensor) – of shape (1, C, H, W) encoding input image. Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – list of image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.

Returns

Multi level feature maps of img.

Return type

list[Tensor]

forward_train(img, img_metas, gt_bboxes, gt_labels, ref_img, ref_img_metas, ref_gt_bboxes, ref_gt_labels, gt_instance_ids=None, gt_bboxes_ignore=None, gt_masks=None, proposals=None, ref_gt_instance_ids=None, ref_gt_bboxes_ignore=None, ref_gt_masks=None, ref_proposals=None, **kwargs)[source]
Parameters
  • img (Tensor) – of shape (N, C, H, W) encoding input images. Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – list of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.

  • gt_bboxes (list[Tensor]) – Ground truth bboxes for each image with shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format.

  • gt_labels (list[Tensor]) – class indices corresponding to each box.

  • ref_img (Tensor) – of shape (N, 1, C, H, W) encoding input images. Typically these should be mean centered and std scaled. 1 denotes there is only one reference image for each input image.

  • ref_img_metas (list[list[dict]]) – The first list only has one element. The second list contains reference image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.

  • ref_gt_bboxes (list[Tensor]) – The list only has one Tensor. The Tensor contains ground truth bboxes for each reference image with shape (num_all_ref_gts, 5) in [ref_img_id, tl_x, tl_y, br_x, br_y] format. The ref_img_id start from 0, and denotes the id of reference image for each key image.

  • ref_gt_labels (list[Tensor]) – The list only has one Tensor. The Tensor contains class indices corresponding to each reference box with shape (num_all_ref_gts, 2) in [ref_img_id, class_indice].

  • gt_instance_ids (None | list[Tensor]) – specify the instance id for each ground truth bbox.

  • gt_bboxes_ignore (None | list[Tensor]) – specify which bounding boxes can be ignored when computing the loss.

  • gt_masks (None | Tensor) – true segmentation masks for each box used if the architecture supports a segmentation task.

  • proposals (None | Tensor) – override rpn proposals with custom proposals. Use when with_rpn is False.

  • ref_gt_instance_ids (None | list[Tensor]) – specify the instance id for each ground truth bboxes of reference images.

  • ref_gt_bboxes_ignore (None | list[Tensor]) – specify which bounding boxes of reference images can be ignored when computing the loss.

  • ref_gt_masks (None | Tensor) – True segmentation masks for each box of reference image used if the architecture supports a segmentation task.

  • ref_proposals (None | Tensor) – override rpn proposals with custom proposals of reference images. Use when with_rpn is False.

Returns

a dictionary of loss components

Return type

dict[str, Tensor]

init_weights(pretrain)[source]

Initialize the weights of modules in video object detector.

Parameters

pretrained (dict) – Path to pre-trained weights.

simple_test(img, img_metas, proposals=None, rescale=False)[source]

Test without augmentation.

Parameters
  • img (Tensor) – of shape (1, C, H, W) encoding input image. Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – list of image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.

  • proposals (None | Tensor) – Override rpn proposals with custom proposals. Use when with_rpn is False. Defaults to None.

  • rescale (bool) – If False, then returned bboxes and masks will fit the scale of img, otherwise, returned bboxes and masks will fit the scale of original image shape. Defaults to False.

Returns

list(ndarray)]: The detection results.

Return type

dict[str

class mmtrack.models.vid.FGFA(detector, motion, aggregator, pretrains=None, frozen_modules=None, train_cfg=None, test_cfg=None)[source]

Flow-Guided Feature Aggregation for Video Object Detection.

This video object detector is the implementation of FGFA.

aug_test(imgs, img_metas, **kwargs)[source]

Test function with test time augmentation.

extract_feats(img, img_metas, ref_img, ref_img_metas)[source]

Extract features for img during testing.

Parameters
  • img (Tensor) – of shape (1, C, H, W) encoding input image. Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – list of image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.

  • ref_img (Tensor | None) – of shape (1, N, C, H, W) encoding input reference images. Typically these should be mean centered and std scaled. N denotes the number of reference images. There may be no reference images in some cases.

  • ref_img_metas (list[list[dict]] | None) – The first list only has one element. The second list contains image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect. There may be no reference images in some cases.

Returns

Multi level feature maps of img.

Return type

list[Tensor]

forward_train(img, img_metas, gt_bboxes, gt_labels, ref_img, ref_img_metas, ref_gt_bboxes, ref_gt_labels, gt_instance_ids=None, gt_bboxes_ignore=None, gt_masks=None, proposals=None, ref_gt_instance_ids=None, ref_gt_bboxes_ignore=None, ref_gt_masks=None, ref_proposals=None, **kwargs)[source]
Parameters
  • img (Tensor) – of shape (N, C, H, W) encoding input images. Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – list of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.

  • gt_bboxes (list[Tensor]) – Ground truth bboxes for each image with shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format.

  • gt_labels (list[Tensor]) – class indices corresponding to each box.

  • ref_img (Tensor) – of shape (N, 2, C, H, W) encoding input images. Typically these should be mean centered and std scaled. 2 denotes there is two reference images for each input image.

  • ref_img_metas (list[list[dict]]) – The first list only has one element. The second list contains reference image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.

  • ref_gt_bboxes (list[Tensor]) – The list only has one Tensor. The Tensor contains ground truth bboxes for each reference image with shape (num_all_ref_gts, 5) in [ref_img_id, tl_x, tl_y, br_x, br_y] format. The ref_img_id start from 0, and denotes the id of reference image for each key image.

  • ref_gt_labels (list[Tensor]) – The list only has one Tensor. The Tensor contains class indices corresponding to each reference box with shape (num_all_ref_gts, 2) in [ref_img_id, class_indice].

  • gt_instance_ids (None | list[Tensor]) – specify the instance id for each ground truth bbox.

  • gt_bboxes_ignore (None | list[Tensor]) – specify which bounding boxes can be ignored when computing the loss.

  • gt_masks (None | Tensor) – true segmentation masks for each box used if the architecture supports a segmentation task.

  • proposals (None | Tensor) – override rpn proposals with custom proposals. Use when with_rpn is False.

  • ref_gt_instance_ids (None | list[Tensor]) – specify the instance id for each ground truth bboxes of reference images.

  • ref_gt_bboxes_ignore (None | list[Tensor]) – specify which bounding boxes of reference images can be ignored when computing the loss.

  • ref_gt_masks (None | Tensor) – True segmentation masks for each box of reference image used if the architecture supports a segmentation task.

  • ref_proposals (None | Tensor) – override rpn proposals with custom proposals of reference images. Use when with_rpn is False.

Returns

a dictionary of loss components

Return type

dict[str, Tensor]

init_weights(pretrain)[source]

Initialize the weights of modules in video object detector.

Parameters

pretrained (dict) – Path to pre-trained weights.

simple_test(img, img_metas, ref_img=None, ref_img_metas=None, proposals=None, rescale=False)[source]

Test without augmentation.

Parameters
  • img (Tensor) – of shape (1, C, H, W) encoding input image. Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – list of image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.

  • ref_img (list[Tensor] | None) – The list only contains one Tensor of shape (1, N, C, H, W) encoding input reference images. Typically these should be mean centered and std scaled. N denotes the number for reference images. There may be no reference images in some cases.

  • ref_img_metas (list[list[list[dict]]] | None) – The first and second list only has one element. The third list contains image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect. There may be no reference images in some cases.

  • proposals (None | Tensor) – Override rpn proposals with custom proposals. Use when with_rpn is False. Defaults to None.

  • rescale (bool) – If False, then returned bboxes and masks will fit the scale of img, otherwise, returned bboxes and masks will fit the scale of original image shape. Defaults to False.

Returns

list(ndarray)]: The detection results.

Return type

dict[str

class mmtrack.models.vid.SELSA(detector, pretrains=None, frozen_modules=None, train_cfg=None, test_cfg=None)[source]

Sequence Level Semantics Aggregation for Video Object Detection.

This video object detector is the implementation of SELSA.

aug_test(imgs, img_metas, **kwargs)[source]

Test function with test time augmentation.

extract_feats(img, img_metas, ref_img, ref_img_metas)[source]

Extract features for img during testing.

Parameters
  • img (Tensor) – of shape (1, C, H, W) encoding input image. Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – list of image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.

  • ref_img (Tensor | None) – of shape (1, N, C, H, W) encoding input reference images. Typically these should be mean centered and std scaled. N denotes the number of reference images. There may be no reference images in some cases.

  • ref_img_metas (list[list[dict]] | None) – The first list only has one element. The second list contains image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect. There may be no reference images in some cases.

Returns

x is the multi level

feature maps of img, ref_x is the multi level feature maps of ref_img.

Return type

tuple(x, img_metas, ref_x, ref_img_metas)

forward_train(img, img_metas, gt_bboxes, gt_labels, ref_img, ref_img_metas, ref_gt_bboxes, ref_gt_labels, gt_instance_ids=None, gt_bboxes_ignore=None, gt_masks=None, proposals=None, ref_gt_instance_ids=None, ref_gt_bboxes_ignore=None, ref_gt_masks=None, ref_proposals=None, **kwargs)[source]
Parameters
  • img (Tensor) – of shape (N, C, H, W) encoding input images. Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – list of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.

  • gt_bboxes (list[Tensor]) – Ground truth bboxes for each image with shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format.

  • gt_labels (list[Tensor]) – class indices corresponding to each box.

  • ref_img (Tensor) – of shape (N, 2, C, H, W) encoding input images. Typically these should be mean centered and std scaled. 2 denotes there is two reference images for each input image.

  • ref_img_metas (list[list[dict]]) – The first list only has one element. The second list contains reference image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.

  • ref_gt_bboxes (list[Tensor]) – The list only has one Tensor. The Tensor contains ground truth bboxes for each reference image with shape (num_all_ref_gts, 5) in [ref_img_id, tl_x, tl_y, br_x, br_y] format. The ref_img_id start from 0, and denotes the id of reference image for each key image.

  • ref_gt_labels (list[Tensor]) – The list only has one Tensor. The Tensor contains class indices corresponding to each reference box with shape (num_all_ref_gts, 2) in [ref_img_id, class_indice].

  • gt_instance_ids (None | list[Tensor]) – specify the instance id for each ground truth bbox.

  • gt_bboxes_ignore (None | list[Tensor]) – specify which bounding boxes can be ignored when computing the loss.

  • gt_masks (None | Tensor) – true segmentation masks for each box used if the architecture supports a segmentation task.

  • proposals (None | Tensor) – override rpn proposals with custom proposals. Use when with_rpn is False.

  • ref_gt_instance_ids (None | list[Tensor]) – specify the instance id for each ground truth bboxes of reference images.

  • ref_gt_bboxes_ignore (None | list[Tensor]) – specify which bounding boxes of reference images can be ignored when computing the loss.

  • ref_gt_masks (None | Tensor) – True segmentation masks for each box of reference image used if the architecture supports a segmentation task.

  • ref_proposals (None | Tensor) – override rpn proposals with custom proposals of reference images. Use when with_rpn is False.

Returns

a dictionary of loss components

Return type

dict[str, Tensor]

init_weights(pretrain)[source]

Initialize the weights of modules in video object detector.

Parameters

pretrained (dict) – Path to pre-trained weights.

simple_test(img, img_metas, ref_img=None, ref_img_metas=None, proposals=None, ref_proposals=None, rescale=False)[source]

Test without augmentation.

Parameters
  • img (Tensor) – of shape (1, C, H, W) encoding input image. Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – list of image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.

  • ref_img (list[Tensor] | None) – The list only contains one Tensor of shape (1, N, C, H, W) encoding input reference images. Typically these should be mean centered and std scaled. N denotes the number for reference images. There may be no reference images in some cases.

  • ref_img_metas (list[list[list[dict]]] | None) – The first and second list only has one element. The third list contains image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect. There may be no reference images in some cases.

  • proposals (None | Tensor) – Override rpn proposals with custom proposals. Use when with_rpn is False. Defaults to None.

  • rescale (bool) – If False, then returned bboxes and masks will fit the scale of img, otherwise, returned bboxes and masks will fit the scale of original image shape. Defaults to False.

Returns

list(ndarray)]: The detection results.

Return type

dict[str

aggregators

class mmtrack.models.aggregators.EmbedAggregator(num_convs=1, channels=256, kernel_size=3, norm_cfg=None, act_cfg={'type': 'ReLU'})[source]

Embedding convs to aggregate multi feature maps.

This module is proposed in “Flow-Guided Feature Aggregation for Video Object Detection”. FGFA.

Parameters
  • num_convs (int) – Number of embedding convs.

  • channels (int) – Channels of embedding convs. Defaults to 256.

  • kernel_size (int) – Kernel size of embedding convs, Defaults to 3.

  • norm_cfg (dict) – Configuration of normlization method after each conv. Defaults to None.

  • act_cfg (dict) – Configuration of activation method after each conv. Defaults to dict(type=’ReLU’).

forward(x, ref_x)[source]

Aggregate reference feature maps ref_x.

The aggregation mainly contains two steps: 1. Computing the cos similarity between x and ref_x. 2. Use the normlized (i.e. softmax) cos similarity to weightedly sum ref_x.

Parameters
  • x (Tensor) – of shape [1, C, H, W]

  • ref_x (Tensor) – of shape [N, C, H, W]. N is the number of reference feature maps.

Returns

The aggregated feature map with shape [1, C, H, W].

Return type

Tensor

class mmtrack.models.aggregators.SelsaAggregator(in_channels, num_attention_blocks=16)[source]

Selsa aggregator module.

This module is proposed in “Sequence Level Semantics Aggregation for Video Object Detection”. SELSA.

Parameters
  • in_channels (int) – The number of channels of the features of proposal.

  • num_attention_blocks (int) – The number of attention blocks used in selsa aggregator module. Defaults to 16.

forward(x, ref_x)[source]

Aggregate the features ref_x of reference proposals.

The aggregation mainly contains two steps: 1. Use multi-head attention to computing the weight between x and ref_x. 2. Use the normlized (i.e. softmax) weight to weightedly sum ref_x.

Parameters
  • x (Tensor) – of shape [N, C]. N is the number of key frame proposals.

  • ref_x (Tensor) – of shape [M, C]. M is the number of reference frame proposals.

Returns

The aggregated features of key frame proposals with shape [N, C].

Return type

Tensor

backbones

losses

motion

class mmtrack.models.motion.CameraMotionCompensation(warp_mode='cv2.MOTION_EUCLIDEAN', num_iters=50, stop_eps=0.001)[source]

Camera motion compensation.

Parameters
  • warp_mode (str) – Warp mode in opencv.

  • num_iters (int) – Number of the iterations.

  • stop_eps (float) – Terminate threshold.

get_warp_matrix(img, ref_img)[source]

Calculate warping matrix between two images.

track(img, ref_img, tracks, num_samples, frame_id)[source]

Tracking forward.

warp_bboxes(bboxes, warp_matrix)[source]

Warp bounding boxes according to the warping matrix.

class mmtrack.models.motion.FlowNetSimple(img_scale_factor, out_indices=[2, 3, 4, 5, 6], flow_scale_factor=5.0, flow_img_norm_std=[255.0, 255.0, 255.0], flow_img_norm_mean=[0.411, 0.432, 0.45])[source]

The simple version of FlowNet.

This FlowNetSimple is the implementation of FlowNetSimple.

Parameters
  • img_scale_factor (float) – Used to upsample/downsample the image.

  • out_indices (list) – The indices of outputting feature maps after each group of conv layers. Defaults to [2, 3, 4, 5, 6].

  • flow_scale_factor (float) – Used to enlarge the values of flow. Defaults to 5.0.

  • flow_img_norm_std (list) – Used to scale the values of image. Defaults to [255.0, 255.0, 255.0].

  • flow_img_norm_mean (list) – Used to center the values of image. Defaults to [0.411, 0.432, 0.450].

crop_like(input, target)[source]

Crop input as the size of target.

forward(imgs, img_metas)[source]

Compute the flow of images pairs.

Parameters
  • imgs (Tensor) – of shape (N, 6, H, W) encoding input images pairs. Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – list of image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.

Returns

of shape (N, 2, H, W) encoding flow of images pairs.

Return type

Tensor

init_weights()[source]

Initialize the weight FlowNetSimple.

prepare_imgs(imgs, img_metas)[source]

Preprocess images pairs for computing flow.

Parameters
  • imgs (Tensor) – of shape (N, 6, H, W) encoding input images pairs. Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – list of image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.

Returns

of shape (N, 6, H, W) encoding the input images pairs for FlowNetSimple.

Return type

Tensor

class mmtrack.models.motion.KalmanFilter(center_only=False)[source]

A simple Kalman filter for tracking bounding boxes in image space.

The implementation is refered to https://github.com/nwojke/deep_sort.

gating_distance(mean, covariance, measurements, only_position=False)[source]

Compute gating distance between state distribution and measurements.

A suitable distance threshold can be obtained from chi2inv95. If only_position is False, the chi-square distribution has 4 degrees of freedom, otherwise 2.

Parameters
  • mean (ndarray) – Mean vector over the state distribution (8 dimensional).

  • covariance (ndarray) – Covariance of the state distribution (8x8 dimensional).

  • measurements (ndarray) – An Nx4 dimensional matrix of N measurements, each in format (x, y, a, h) where (x, y) is the bounding box center position, a the aspect ratio, and h the height.

  • only_position (bool, optional) – If True, distance computation is done with respect to the bounding box center position only. Defaults to False.

Returns

Returns an array of length N, where the i-th element contains the squared Mahalanobis distance between (mean, covariance) and measurements[i].

Return type

ndarray

initiate(measurement)[source]

Create track from unassociated measurement.

Parameters
  • measurement (ndarray) – Bounding box coordinates (x, y, a, h) with

  • position (center) –

Returns

Returns the mean vector (8 dimensional) and

covariance matrix (8x8 dimensional) of the new track. Unobserved velocities are initialized to 0 mean.

Return type

(ndarray, ndarray)

predict(mean, covariance)[source]

Run Kalman filter prediction step.

Parameters
  • mean (ndarray) – The 8 dimensional mean vector of the object state at the previous time step.

  • covariance (ndarray) – The 8x8 dimensional covariance matrix of the object state at the previous time step.

Returns

Returns the mean vector and covariance

matrix of the predicted state. Unobserved velocities are initialized to 0 mean.

Return type

(ndarray, ndarray)

project(mean, covariance)[source]

Project state distribution to measurement space.

Parameters
  • mean (ndarray) – The state’s mean vector (8 dimensional array).

  • covariance (ndarray) – The state’s covariance matrix (8x8 dimensional).

Returns

Returns the projected mean and covariance matrix of the given state estimate.

Return type

(ndarray, ndarray)

track(tracks, bboxes)[source]

Track forward.

Parameters
  • (dict[int (tracks) – dict]): Track buffer.

  • bboxes (Tensor) – Detected bounding boxes.

Returns

dict], Tensor): Updated tracks and bboxes.

Return type

(dict[int

update(mean, covariance, measurement)[source]

Run Kalman filter correction step.

Parameters
  • mean (ndarray) – The predicted state’s mean vector (8 dimensional).

  • covariance (ndarray) – The state’s covariance matrix (8x8 dimensional).

  • measurement (ndarray) – The 4 dimensional measurement vector (x, y, a, h), where (x, y) is the center position, a the aspect ratio, and h the height of the bounding box.

Returns

Returns the measurement-corrected state distribution.

Return type

(ndarray, ndarray)

class mmtrack.models.motion.LinearMotion(num_samples=2, center_motion=False)[source]

Linear motion while tracking.

Parameters
  • num_samples (int, optional) – Number of samples to calculate the velocity. Default to 2.

  • center_motion (bool, optional) – Whether use center location or bounding box location to estimate the velocity. Default to False.

center(bbox)[source]

Get the center of the box.

get_velocity(bboxes, num_samples=None)[source]

Get velocities of the input objects.

step(bboxes, velocity=None)[source]

Step forward with the velocity.

track(tracks, frame_id)[source]

Tracking forward.

reid

class mmtrack.models.reid.BaseReID(*args: Any, **kwargs: Any)[source]

Base class for re-identification.

forward_train(*args, **kwargs)[source]

“Training forward function.

simple_test(img)[source]

Test without augmentation.

class mmtrack.models.reid.FcModule(in_channels, out_channels, norm_cfg=None, act_cfg={'type': 'ReLU'}, inplace=True)[source]

Fully-connected layer module.

Parameters
  • in_channels (int) – Input channels.

  • out_channels (int) – Ourput channels.

  • norm_cfg (dict, optional) – Configuration of normlization method after fc. Defaults to None.

  • act_cfg (dict, optional) – Configuration of activation method after fc. Defaults to dict(type=’ReLU’).

  • inplace (bool, optional) – Whether inplace the activatation module.

forward(x, activate=True, norm=True)[source]

Model forward.

init_weights()[source]

Initialize weights.

property norm

Normalization.

roi_heads

track_heads

utils