API Reference¶
mmtrack.apis¶
- mmtrack.apis.inference_mot(model, img, frame_id)[source]¶
Inference image(s) with the mot model.
- Parameters
model (nn.Module) – The loaded mot model.
img (str | ndarray) – Either image name or loaded image.
frame_id (int) – frame id.
- Returns
ndarray]: The tracking results.
- Return type
dict[str
- mmtrack.apis.inference_sot(model, image, init_bbox, frame_id)[source]¶
Inference image with the single object tracker.
- Parameters
model (nn.Module) – The loaded tracker.
image (ndarray) – Loaded images.
init_bbox (ndarray) – The target needs to be tracked.
frame_id (int) – frame id.
- Returns
ndarray]: The tracking results.
- Return type
dict[str
- mmtrack.apis.inference_vid(model, image, frame_id, ref_img_sampler={'frame_stride': 10, 'num_left_ref_imgs': 10})[source]¶
Inference image with the video object detector.
- Parameters
model (nn.Module) – The loaded detector.
image (ndarray) – Loaded images.
frame_id (int) – Frame id.
ref_img_sampler (dict) – The configuration for sampling reference images. Only used under video detector of fgfa style. Defaults to dict(frame_stride=2, num_left_ref_imgs=10).
- Returns
ndarray]: The detection results.
- Return type
dict[str
- mmtrack.apis.init_model(config, checkpoint=None, device='cuda:0', cfg_options=None)[source]¶
Initialize a model from config file.
- Parameters
config (str or
mmcv.Config) – Config file path or the config object.checkpoint (str, optional) – Checkpoint path. Default as None.
cfg_options (dict, optional) – Options to override some settings in the used config. Default to None.
- Returns
The constructed detector.
- Return type
nn.Module
- mmtrack.apis.multi_gpu_test(model, data_loader, tmpdir=None, gpu_collect=False)[source]¶
Test model with multiple gpus.
This method tests model with multiple gpus and collects the results under two different modes: gpu and cpu modes. By setting ‘gpu_collect=True’ it encodes results to gpu tensors and use gpu communication for results collection. On cpu mode it saves the results on different gpus to ‘tmpdir’ and collects them by the rank 0 worker. ‘gpu_collect=True’ is not supported for now.
- Parameters
model (nn.Module) – Model to be tested.
data_loader (nn.Dataloader) – Pytorch data loader.
tmpdir (str) – Path of directory to save the temporary results from different gpus under cpu mode. Defaults to None.
gpu_collect (bool) – Option to use either gpu or cpu to collect results. Defaults to False.
- Returns
The prediction results.
- Return type
dict[str, list]
- mmtrack.apis.single_gpu_test(model, data_loader, show=False, out_dir=None, show_score_thr=0.3)[source]¶
Test model with single gpu.
- Parameters
model (nn.Module) – Model to be tested.
data_loader (nn.Dataloader) – Pytorch data loader.
show (bool) – If True, visualize the prediction results (Not supported for now). Defaults to False.
out_dir (str) – Path of directory to save the visualization results (Not supported for now). Defaults to None.
show_score_thr (float) – The score threthold of visualization (Not supported for now). Defaults to 0.3.
- Returns
The prediction results.
- Return type
dict[str, list]
- mmtrack.apis.train_model(model, dataset, cfg, distributed=False, validate=False, timestamp=None, meta=None)[source]¶
Train model entry function.
- Parameters
model (nn.Module) – The model to be trained.
dataset (
Dataset) – Train dataset.cfg (dict) – The config dict for training.
distributed (bool) – Whether to use distributed training. Default: False.
validate (bool) – Whether to do evaluation. Default: False.
timestamp (str | None) – Local time for runner. Default: None.
meta (dict | None) – Meta dict to record some important information. Default: None
mmtrack.core¶
anchor¶
evaluation¶
- class mmtrack.core.evaluation.DistEvalHook(*args: Any, **kwargs: Any)[source]¶
Please refer to mmdet.core.evaluation.eval_hooks.py:DistEvalHook for detailed docstring.
- class mmtrack.core.evaluation.EvalHook(*args: Any, **kwargs: Any)[source]¶
Please refer to mmdet.core.evaluation.eval_hooks.py:EvalHook for detailed docstring.
- mmtrack.core.evaluation.eval_mot(results, annotations, logger=None, classes=None, iou_thr=0.5, ignore_iof_thr=0.5, ignore_by_classes=False, nproc=4)[source]¶
Evaluation CLEAR MOT metrics.
- Parameters
results (list[list[list[ndarray]]]) – The first list indicates videos, The second list indicates images. The third list indicates categories. The ndarray indicates the tracking results.
annotations (list[list[dict]]) –
The first list indicates videos, The second list indicates images. The third list indicates the annotations of each video. Keys of annotations are
bboxes: numpy array of shape (n, 4)
labels: numpy array of shape (n, )
instance_ids: numpy array of shape (n, )
bboxes_ignore (optional): numpy array of shape (k, 4)
labels_ignore (optional): numpy array of shape (k, )
logger (logging.Logger | str | None, optional) – The way to print the evaluation results. Defaults to None.
classes (list, optional) – Classes in the dataset. Defaults to None.
iou_thr (float, optional) – IoU threshold for evaluation. Defaults to 0.5.
ignore_iof_thr (float, optional) – Iof threshold to ignore results. Defaults to 0.5.
ignore_by_classes (bool, optional) – Whether ignore the results by classes or not. Defaults to False.
nproc (int, optional) – Number of the processes. Defaults to 4.
- Returns
Evaluation results.
- Return type
dict[str, float]
- mmtrack.core.evaluation.eval_sot_ope(results, annotations)[source]¶
Evaluation in OPE protocol.
- Parameters
results (list[list[ndarray]]) – The first list contains the tracking results of each video. The second list contains the tracking results of each frame in one video. The ndarray denotes the tracking box in [tl_x, tl_y, br_x, br_y] format.
annotations (list[list[dict]]) – The first list contains the annotations of each video. The second list contains the annotations of each frame in one video. The dict contains the annotation information of one frame.
- Returns
OPE style evaluation metric (i.e. success, norm precision and precision).
- Return type
dict[str, float]
motion¶
optimizer¶
- class mmtrack.core.optimizer.SiameseRPNLrUpdaterHook(lr_configs=[{'type': 'step', 'start_lr_factor': 0.2, 'end_lr_factor': 1.0, 'end_epoch': 5}, {'type': 'log', 'start_lr_factor': 1.0, 'end_lr_factor': 0.1, 'end_epoch': 20}], **kwargs)[source]¶
Learning rate updater for siamese rpn.
- Parameters
lr_configs (list[dict]) – List of dict where each dict denotes the configuration of specifical learning rate updater and must have ‘type’.
- class mmtrack.core.optimizer.SiameseRPNOptimizerHook(backbone_start_train_epoch, backbone_train_layers, **kwargs)[source]¶
Optimizer hook for siamese rpn.
- Parameters
backbone_start_train_epoch (int) – Start to train the backbone at backbone_start_train_epoch-th epoch. Note the epoch in this class counts from 0, while the epoch in the log file counts from 1.
backbone_train_layers (list(str)) – List of str denoting the stages needed be trained in backbone.
track¶
- mmtrack.core.track.depthwise_correlation(x, kernel)[source]¶
Depthwise cross correlation.
This function is proposed in SiamRPN++.
- Parameters
x (Tensor) – of shape (N, C, H_x, W_x).
kernel (Tensor) – of shape (N, C, H_k, W_k).
- Returns
of shape (N, C, H_o, W_o). H_o = H_x - H_k + 1. So does W_o.
- Return type
Tensor
- mmtrack.core.track.embed_similarity(key_embeds, ref_embeds, method='dot_product', temperature=- 1, transpose=True)[source]¶
Calculate feature similarity from embeddings.
- Parameters
key_embeds (Tensor) – Shape (N1, C).
ref_embeds (Tensor) – Shape (N2, C) or (C, N2).
method (str, optional) – Method to calculate the similarity, options are ‘dot_product’ and ‘cosine’. Defaults to ‘dot_product’.
temperature (int, optional) – Softmax temperature. Defaults to -1.
transpose (bool, optional) – Whether transpose ref_embeds. Defaults to True.
- Returns
Similarity matrix of shape (N1, N2).
- Return type
Tensor
- mmtrack.core.track.imrenormalize(img, img_norm_cfg, new_img_norm_cfg)[source]¶
Re-normalize the image.
- Parameters
img (Tensor | ndarray) – Input image. If the input is a Tensor, the shape is (1, C, H, W). If the input is a ndarray, the shape is (H, W, C).
img_norm_cfg (dict) – Original configuration for the normalization.
new_img_norm_cfg (dict) – New configuration for the normalization.
- Returns
Output image with the same type and shape of the input.
- Return type
Tensor | ndarray
- mmtrack.core.track.restore_result(result, return_ids=False)[source]¶
Restore the results (list of results of each category) into the results of the model forward.
- Parameters
result (list[ndarray]) – shape (n, 5) or (n, 6)
return_ids (bool, optional) – Whether the input has tracking result. Default to False.
- Returns
tracking results of each class.
- Return type
tuple
- mmtrack.core.track.track2result(bboxes, labels, ids, num_classes)[source]¶
Convert tracking results to a list of numpy arrays.
- Parameters
bboxes (torch.Tensor | np.ndarray) – shape (n, 5)
labels (torch.Tensor | np.ndarray) – shape (n, )
ids (torch.Tensor | np.ndarray) – shape (n, )
num_classes (int) – class number, including background class
- Returns
tracking results of each class.
- Return type
list(ndarray)
utils¶
- mmtrack.core.utils.crop_image(image, crop_region, crop_size, padding=(0, 0, 0))[source]¶
Crop image based on crop_region and crop_size.
- Parameters
image (ndarray) – of shape (H, W, 3).
crop_region (ndarray) – of shape (4, ) in [x1, y1, x2, y2] format.
crop_size (int) – Crop size.
padding (tuple | ndarray) – of shape (3, ) denoting the padding values.
- Returns
Cropped image of shape (crop_size, crop_size, 3).
- Return type
ndarray
mmtrack.datasets¶
datasets¶
- class mmtrack.datasets.CocoVID(*args: Any, **kwargs: Any)[source]¶
Inherit official COCO class in order to parse the annotations of bbox- related video tasks.
- Parameters
annotation_file (str) – location of annotation file. Defaults to None.
load_img_as_vid (bool) – If True, convert image data to video data, which means each image is converted to a video. Defaults to False.
- get_img_ids_from_ins_id(insId)[source]¶
Get image ids from given instance id.
- Parameters
insId (int) – The given instance id.
- Returns
Image ids of given instance id.
- Return type
list[int]
- get_img_ids_from_vid(vidId)[source]¶
Get image ids from given video id.
- Parameters
vidId (int) – The given video id.
- Returns
Image ids of given video id.
- Return type
list[int]
- get_ins_ids_from_vid(vidId)[source]¶
Get instance ids from given video id.
- Parameters
vidId (int) – The given video id.
- Returns
Instance ids of given video id.
- Return type
list[int]
- mmtrack.datasets.build_dataloader(dataset, samples_per_gpu, workers_per_gpu, num_gpus=1, dist=True, shuffle=True, seed=None, **kwargs)[source]¶
Build PyTorch DataLoader.
In distributed training, each GPU/process has a dataloader. In non-distributed training, there is only one dataloader for all GPUs.
- Parameters
dataset (Dataset) – A PyTorch dataset.
samples_per_gpu (int) – Number of training samples on each GPU, i.e., batch size of each GPU.
workers_per_gpu (int) – How many subprocesses to use for data loading for each GPU.
num_gpus (int) – Number of GPUs. Only used in non-distributed training.
dist (bool) – Distributed training/test or not. Default: True.
shuffle (bool) – Whether to shuffle the data at every epoch. Default: True.
kwargs – any keyword argument to be used to initialize DataLoader
- Returns
A PyTorch dataloader.
- Return type
DataLoader
parsers¶
- class mmtrack.datasets.parsers.CocoVID(*args: Any, **kwargs: Any)[source]¶
Inherit official COCO class in order to parse the annotations of bbox- related video tasks.
- Parameters
annotation_file (str) – location of annotation file. Defaults to None.
load_img_as_vid (bool) – If True, convert image data to video data, which means each image is converted to a video. Defaults to False.
- get_img_ids_from_ins_id(insId)[source]¶
Get image ids from given instance id.
- Parameters
insId (int) – The given instance id.
- Returns
Image ids of given instance id.
- Return type
list[int]
- get_img_ids_from_vid(vidId)[source]¶
Get image ids from given video id.
- Parameters
vidId (int) – The given video id.
- Returns
Image ids of given video id.
- Return type
list[int]
- get_ins_ids_from_vid(vidId)[source]¶
Get instance ids from given video id.
- Parameters
vidId (int) – The given video id.
- Returns
Instance ids of given video id.
- Return type
list[int]
pipelines¶
samplers¶
- class mmtrack.datasets.samplers.DistributedVideoSampler(dataset, num_replicas=None, rank=None, shuffle=False)[source]¶
Put videos to multi gpus during testing.
- Parameters
dataset (Dataset) – Test dataset that must has data_infos attribute. Each data_info in data_infos record information of one frame, and each video must has one data_info that includes data_info[‘frame_id’] == 0.
num_replicas (int) – The number of gpus. Defaults to None.
rank (int) – Gpu rank id. Defaults to None.
shuffle (bool) – If True, shuffle the dataset. Defaults to False.
mmtrack.models¶
mot¶
- class mmtrack.models.mot.BaseMultiObjectTracker[source]¶
Base class for multiple object tracking.
- forward(img, img_metas, return_loss=True, **kwargs)[source]¶
Calls either
forward_train()orforward_test()depending on whetherreturn_lossisTrue.Note this setting will change the expected inputs. When
return_loss=True, img and img_meta are single-nested (i.e. Tensor and List[dict]), and whenresturn_loss=False, img and img_meta should be double nested (i.e. List[Tensor], List[List[dict]]), with the outer list indicating test time augmentations.
- forward_test(imgs, img_metas, **kwargs)[source]¶
- Parameters
imgs (List[Tensor]) – the outer list indicates test-time augmentations and inner Tensor should have a shape NxCxHxW, which contains all images in the batch.
img_metas (List[List[dict]]) – the outer list indicates test-time augs (multiscale, flip, etc.) and the inner list indicates images in a batch.
- abstract forward_train(imgs, img_metas, **kwargs)[source]¶
- Parameters
img (list[Tensor]) – List of tensors of shape (1, C, H, W). Typically these should be mean centered and std scaled.
img_metas (list[dict]) – List of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys, see
mmdet.datasets.pipelines.Collect.kwargs (keyword arguments) – Specific to concrete implementation.
- init_module(module_name, pretrain=None)[source]¶
Initialize the weights of a sub-module.
- Parameters
module (nn.Module) – A sub-module of the model.
pretrained (str, optional) – Path to pre-trained weights. Defaults to None.
- show_result(img, result, thickness=1, font_scale=0.5, show=False, out_file=None, wait_time=0, backend='cv2')[source]¶
Visualize tracking results.
- Parameters
img (str | ndarray) – Filename of loaded image.
result (list[ndarray]) – Tracking results.
thickness (int, optional) – Thickness of lines. Defaults to 1.
font_scale (float, optional) – Font scales of texts. Defaults to 0.5.
show (bool, optional) – Whether show the visualizations on the fly. Defaults to False.
out_file (str | None, optional) – Output filename. Defaults to None.
backend (str, optional) – Backend to draw the bounding boxes, options are cv2 and plt. Defaults to ‘cv2’.
- Returns
Visualized image.
- Return type
ndarray
- train_step(data, optimizer)[source]¶
The iteration step during training.
This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.
- Parameters
data (dict) – The output of dataloader.
optimizer (
torch.optim.Optimizer| dict) – The optimizer of runner is passed totrain_step(). This argument is unused and reserved.
- Returns
It should contain at least 3 keys:
loss,log_vars,num_samples.lossis a tensor for back propagation, which can be a
weighted sum of multiple losses. -
log_varscontains all the variables to be sent to the logger. -num_samplesindicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.- Return type
dict
- val_step(data, optimizer)[source]¶
The iteration step during validation.
This method shares the same signature as
train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.
- property with_detector¶
whether the framework has a detector.
- Type
bool
- property with_motion¶
whether the framework has a motion model.
- Type
bool
- property with_reid¶
whether the framework has a reid model.
- Type
bool
- property with_track_head¶
whether the framework has a track_head.
- Type
bool
- property with_tracker¶
whether the framework has a tracker.
- Type
bool
- class mmtrack.models.mot.DeepSORT(detector=None, reid=None, tracker=None, motion=None, pretrains=None)[source]¶
Simple online and realtime tracking with a deep association metric.
Details can be found at `DeepSORT<https://arxiv.org/abs/1703.07402>`_.
- init_weights(pretrain)[source]¶
Initialize the weights of the modules.
- Parameters
pretrained (dict) – Path to pre-trained weights.
- simple_test(img, img_metas, rescale=False, public_bboxes=None, **kwargs)[source]¶
Test without augmentations.
- Parameters
img (Tensor) – of shape (N, C, H, W) encoding input images. Typically these should be mean centered and std scaled.
img_metas (list[dict]) – list of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’.
rescale (bool, optional) – If False, then returned bboxes and masks will fit the scale of img, otherwise, returned bboxes and masks will fit the scale of original image shape. Defaults to False.
public_bboxes (list[Tensor], optional) – Public bounding boxes from the benchmark. Defaults to None.
- Returns
list(ndarray)]: The tracking results.
- Return type
dict[str
- class mmtrack.models.mot.Tracktor(detector=None, reid=None, tracker=None, motion=None, pretrains=None)[source]¶
Tracking without bells and whistles.
Details can be found at `Tracktor<https://arxiv.org/abs/1903.05625>`_.
- init_weights(pretrain)[source]¶
Initialize the weights of the modules.
- Parameters
pretrained (dict) – Path to pre-trained weights.
- simple_test(img, img_metas, rescale=False, public_bboxes=None, **kwargs)[source]¶
Test without augmentations.
- Parameters
img (Tensor) – of shape (N, C, H, W) encoding input images. Typically these should be mean centered and std scaled.
img_metas (list[dict]) – list of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’.
rescale (bool, optional) – If False, then returned bboxes and masks will fit the scale of img, otherwise, returned bboxes and masks will fit the scale of original image shape. Defaults to False.
public_bboxes (list[Tensor], optional) – Public bounding boxes from the benchmark. Defaults to None.
- Returns
list(ndarray)]: The tracking results.
- Return type
dict[str
- property with_cmc¶
whether the framework has a camera model compensation model.
- Type
bool
- property with_linear_motion¶
whether the framework has a linear motion model.
- Type
bool
sot¶
- class mmtrack.models.sot.SiamRPN(pretrains=None, backbone=None, neck=None, head=None, frozen_modules=None, train_cfg=None, test_cfg=None)[source]¶
SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks.
This single object tracker is the implementation of SiamRPN++.
- forward_search(x_img)[source]¶
Extract the features of search images.
- Parameters
x_img (Tensor) – of shape (N, C, H, W) encoding input search images. Typically H and W equal to 255.
- Returns
Multi level feature map of search images.
- Return type
tuple(Tensor)
- forward_template(z_img)[source]¶
Extract the features of exemplar images.
- Parameters
z_img (Tensor) – of shape (N, C, H, W) encoding input exemplar images. Typically H and W equal to 127.
- Returns
Multi level feature map of exemplar images.
- Return type
tuple(Tensor)
- forward_train(img, img_metas, gt_bboxes, search_img, search_img_metas, search_gt_bboxes, is_positive_pairs, **kwargs)[source]¶
- Parameters
img (Tensor) – of shape (N, C, H, W) encoding input exemplar images. Typically H and W equal to 127.
img_metas (list[dict]) – list of image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.
gt_bboxes (list[Tensor]) – Ground truth bboxes for each exemplar image with shape (1, 4) in [tl_x, tl_y, br_x, br_y] format.
search_img (Tensor) – of shape (N, 1, C, H, W) encoding input search images. 1 denotes there is only one search image for each exemplar image. Typically H and W equal to 255.
search_img_metas (list[list[dict]]) – The second list only has one element. The first list contains search image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.
search_gt_bboxes (list[Tensor]) – Ground truth bboxes for each search image with shape (1, 5) in [0.0, tl_x, tl_y, br_x, br_y] format.
is_positive_pairs (list[bool]) – list of bool denoting whether each exemplar image and corresponding seach image is positive pair.
- Returns
a dictionary of loss components.
- Return type
dict[str, Tensor]
- get_cropped_img(img, center_xy, target_size, crop_size, avg_channel)[source]¶
Crop image.
Only used during testing.
This function mainly contains two steps: 1. Crop img based on center center_xy and size crop_size. If the cropped image is out of boundary of img, use avg_channel to pad. 2. Resize the cropped image to target_size.
- Parameters
img (Tensor) – of shape (1, C, H, W) encoding original input image.
center_xy (Tensor) – of shape (2, ) denoting the center point for cropping image.
target_size (int) – The output size of cropped image.
crop_size (Tensor) – The size for cropping image.
avg_channel (Tensor) – of shape (3, ) denoting the padding values.
- Returns
of shape (1, C, target_size, target_size) encoding the resized cropped image.
- Return type
Tensor
- init(img, bbox)[source]¶
Initialize the single object tracker in the first frame.
- Parameters
img (Tensor) – of shape (1, C, H, W) encoding original input image.
bbox (Tensor) – The given instance bbox of first frame that need be tracked in the following frames. The shape of the box is (4, ) with [cx, cy, w, h] format.
- Returns
z_feat is a tuple[Tensor] that contains the multi level feature maps of exemplar image, avg_channel is Tensor with shape (3, ), and denotes the padding values.
- Return type
tuple(z_feat, avg_channel)
- init_weights(pretrain)[source]¶
Initialize the weights of modules in single object tracker.
- Parameters
pretrained (dict) – Path to pre-trained weights.
- simple_test(img, img_metas, gt_bboxes, **kwargs)[source]¶
Test without augmentation.
- Parameters
img (Tensor) – of shape (1, C, H, W) encoding input image.
img_metas (list[dict]) – list of image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.
gt_bboxes (list[Tensor]) – list of ground truth bboxes for each image with shape (1, 4) in [tl_x, tl_y, br_x, br_y] format.
- Returns
ndarray]: The tracking results.
- Return type
dict[str
- track(img, bbox, z_feat, avg_channel)[source]¶
Track the box bbox of previous frame to current frame img.
- Parameters
img (Tensor) – of shape (1, C, H, W) encoding original input image.
bbox (Tensor) – The bbox in previous frame. The shape of the box is (4, ) in [cx, cy, w, h] format.
z_feat (tuple[Tensor]) – The multi level feature maps of exemplar image in the first frame.
avg_channel (Tensor) – of shape (3, ) denoting the padding values.
- Returns
best_score is a Tensor denoting the score of best_bbox, best_bbox is a Tensor of shape (4, ) in [cx, cy, w, h] format, and denotes the best tracked bbox in current frame.
- Return type
tuple(best_score, best_bbox)
vid¶
- class mmtrack.models.vid.BaseVideoDetector[source]¶
Base class for video object detector.
- forward(img, img_metas, return_loss=True, **kwargs)[source]¶
Calls either
forward_train()orforward_test()depending on whetherreturn_lossisTrue.Note this setting will change the expected inputs. When
return_loss=True, img and img_meta are single-nested (i.e. Tensor and List[dict]), and whenresturn_loss=False, img and img_meta should be double nested (i.e. List[Tensor], List[List[dict]]), with the outer list indicating test time augmentations.
- forward_test(imgs, img_metas, **kwargs)[source]¶
- Parameters
imgs (List[Tensor]) – the outer list indicates test-time augmentations and inner Tensor should have a shape NxCxHxW, which contains all images in the batch.
img_metas (List[List[dict]]) – the outer list indicates test-time augs (multiscale, flip, etc.) and the inner list indicates images in a batch.
- abstract forward_train(imgs, img_metas, **kwargs)[source]¶
- Parameters
img (Tensor) – of shape (N, C, H, W) encoding input images. Typically these should be mean centered and std scaled.
img_metas (list[dict]) – list of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.
- init_module(module, pretrain=None)[source]¶
Initialize the weights of modules in video detector.
- Parameters
pretrained (str, optional) – Path to pre-trained weights. Defaults to None.
- show_result(img, result, score_thr=0.3, bbox_color='green', text_color='green', thickness=1, font_scale=0.5, win_name='', show=False, wait_time=0, out_file=None)[source]¶
Draw result over img.
- Parameters
img (str or Tensor) – The image to be displayed.
result (Tensor or tuple) – The results to draw over img bbox_result or (bbox_result, segm_result).
score_thr (float, optional) – Minimum score of bboxes to be shown. Default: 0.3.
bbox_color (str or tuple or
Color) – Color of bbox lines.text_color (str or tuple or
Color) – Color of texts.thickness (int) – Thickness of lines.
font_scale (float) – Font scales of texts.
win_name (str) – The window name.
wait_time (int) – Value of waitKey param. Default: 0.
show (bool) – Whether to show the image. Default: False.
out_file (str or None) – The filename to write the image. Default: None.
- Returns
Only if not show or out_file
- Return type
img (Tensor)
- train_step(data, optimizer)[source]¶
The iteration step during training.
This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.
- Parameters
data (dict) – The output of dataloader.
optimizer (
torch.optim.Optimizer| dict) – The optimizer of runner is passed totrain_step(). This argument is unused and reserved.
- Returns
It should contain at least 3 keys:
loss,log_vars,num_samples.lossis a tensor for back propagation, which can be a weighted sum of multiple losses.log_varscontains all the variables to be sent to the
logger. -
num_samplesindicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.- Return type
dict
- val_step(data, optimizer)[source]¶
The iteration step during validation.
This method shares the same signature as
train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.
- property with_aggregator¶
whether the framework has a aggregator
- Type
bool
- property with_detector¶
whether the framework has a detector
- Type
bool
- property with_motion¶
whether the framework has a motion model
- Type
bool
- class mmtrack.models.vid.DFF(detector, motion, pretrains=None, frozen_modules=None, train_cfg=None, test_cfg=None)[source]¶
Deep Feature Flow for Video Recognition.
This video object detector is the implementation of DFF.
- extract_feats(img, img_metas)[source]¶
Extract features for img during testing.
- Parameters
img (Tensor) – of shape (1, C, H, W) encoding input image. Typically these should be mean centered and std scaled.
img_metas (list[dict]) – list of image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.
- Returns
Multi level feature maps of img.
- Return type
list[Tensor]
- forward_train(img, img_metas, gt_bboxes, gt_labels, ref_img, ref_img_metas, ref_gt_bboxes, ref_gt_labels, gt_instance_ids=None, gt_bboxes_ignore=None, gt_masks=None, proposals=None, ref_gt_instance_ids=None, ref_gt_bboxes_ignore=None, ref_gt_masks=None, ref_proposals=None, **kwargs)[source]¶
- Parameters
img (Tensor) – of shape (N, C, H, W) encoding input images. Typically these should be mean centered and std scaled.
img_metas (list[dict]) – list of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.
gt_bboxes (list[Tensor]) – Ground truth bboxes for each image with shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format.
gt_labels (list[Tensor]) – class indices corresponding to each box.
ref_img (Tensor) – of shape (N, 1, C, H, W) encoding input images. Typically these should be mean centered and std scaled. 1 denotes there is only one reference image for each input image.
ref_img_metas (list[list[dict]]) – The first list only has one element. The second list contains reference image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.
ref_gt_bboxes (list[Tensor]) – The list only has one Tensor. The Tensor contains ground truth bboxes for each reference image with shape (num_all_ref_gts, 5) in [ref_img_id, tl_x, tl_y, br_x, br_y] format. The ref_img_id start from 0, and denotes the id of reference image for each key image.
ref_gt_labels (list[Tensor]) – The list only has one Tensor. The Tensor contains class indices corresponding to each reference box with shape (num_all_ref_gts, 2) in [ref_img_id, class_indice].
gt_instance_ids (None | list[Tensor]) – specify the instance id for each ground truth bbox.
gt_bboxes_ignore (None | list[Tensor]) – specify which bounding boxes can be ignored when computing the loss.
gt_masks (None | Tensor) – true segmentation masks for each box used if the architecture supports a segmentation task.
proposals (None | Tensor) – override rpn proposals with custom proposals. Use when with_rpn is False.
ref_gt_instance_ids (None | list[Tensor]) – specify the instance id for each ground truth bboxes of reference images.
ref_gt_bboxes_ignore (None | list[Tensor]) – specify which bounding boxes of reference images can be ignored when computing the loss.
ref_gt_masks (None | Tensor) – True segmentation masks for each box of reference image used if the architecture supports a segmentation task.
ref_proposals (None | Tensor) – override rpn proposals with custom proposals of reference images. Use when with_rpn is False.
- Returns
a dictionary of loss components
- Return type
dict[str, Tensor]
- init_weights(pretrain)[source]¶
Initialize the weights of modules in video object detector.
- Parameters
pretrained (dict) – Path to pre-trained weights.
- simple_test(img, img_metas, proposals=None, rescale=False)[source]¶
Test without augmentation.
- Parameters
img (Tensor) – of shape (1, C, H, W) encoding input image. Typically these should be mean centered and std scaled.
img_metas (list[dict]) – list of image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.
proposals (None | Tensor) – Override rpn proposals with custom proposals. Use when with_rpn is False. Defaults to None.
rescale (bool) – If False, then returned bboxes and masks will fit the scale of img, otherwise, returned bboxes and masks will fit the scale of original image shape. Defaults to False.
- Returns
list(ndarray)]: The detection results.
- Return type
dict[str
- class mmtrack.models.vid.FGFA(detector, motion, aggregator, pretrains=None, frozen_modules=None, train_cfg=None, test_cfg=None)[source]¶
Flow-Guided Feature Aggregation for Video Object Detection.
This video object detector is the implementation of FGFA.
- extract_feats(img, img_metas, ref_img, ref_img_metas)[source]¶
Extract features for img during testing.
- Parameters
img (Tensor) – of shape (1, C, H, W) encoding input image. Typically these should be mean centered and std scaled.
img_metas (list[dict]) – list of image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.
ref_img (Tensor | None) – of shape (1, N, C, H, W) encoding input reference images. Typically these should be mean centered and std scaled. N denotes the number of reference images. There may be no reference images in some cases.
ref_img_metas (list[list[dict]] | None) – The first list only has one element. The second list contains image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect. There may be no reference images in some cases.
- Returns
Multi level feature maps of img.
- Return type
list[Tensor]
- forward_train(img, img_metas, gt_bboxes, gt_labels, ref_img, ref_img_metas, ref_gt_bboxes, ref_gt_labels, gt_instance_ids=None, gt_bboxes_ignore=None, gt_masks=None, proposals=None, ref_gt_instance_ids=None, ref_gt_bboxes_ignore=None, ref_gt_masks=None, ref_proposals=None, **kwargs)[source]¶
- Parameters
img (Tensor) – of shape (N, C, H, W) encoding input images. Typically these should be mean centered and std scaled.
img_metas (list[dict]) – list of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.
gt_bboxes (list[Tensor]) – Ground truth bboxes for each image with shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format.
gt_labels (list[Tensor]) – class indices corresponding to each box.
ref_img (Tensor) – of shape (N, 2, C, H, W) encoding input images. Typically these should be mean centered and std scaled. 2 denotes there is two reference images for each input image.
ref_img_metas (list[list[dict]]) – The first list only has one element. The second list contains reference image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.
ref_gt_bboxes (list[Tensor]) – The list only has one Tensor. The Tensor contains ground truth bboxes for each reference image with shape (num_all_ref_gts, 5) in [ref_img_id, tl_x, tl_y, br_x, br_y] format. The ref_img_id start from 0, and denotes the id of reference image for each key image.
ref_gt_labels (list[Tensor]) – The list only has one Tensor. The Tensor contains class indices corresponding to each reference box with shape (num_all_ref_gts, 2) in [ref_img_id, class_indice].
gt_instance_ids (None | list[Tensor]) – specify the instance id for each ground truth bbox.
gt_bboxes_ignore (None | list[Tensor]) – specify which bounding boxes can be ignored when computing the loss.
gt_masks (None | Tensor) – true segmentation masks for each box used if the architecture supports a segmentation task.
proposals (None | Tensor) – override rpn proposals with custom proposals. Use when with_rpn is False.
ref_gt_instance_ids (None | list[Tensor]) – specify the instance id for each ground truth bboxes of reference images.
ref_gt_bboxes_ignore (None | list[Tensor]) – specify which bounding boxes of reference images can be ignored when computing the loss.
ref_gt_masks (None | Tensor) – True segmentation masks for each box of reference image used if the architecture supports a segmentation task.
ref_proposals (None | Tensor) – override rpn proposals with custom proposals of reference images. Use when with_rpn is False.
- Returns
a dictionary of loss components
- Return type
dict[str, Tensor]
- init_weights(pretrain)[source]¶
Initialize the weights of modules in video object detector.
- Parameters
pretrained (dict) – Path to pre-trained weights.
- simple_test(img, img_metas, ref_img=None, ref_img_metas=None, proposals=None, rescale=False)[source]¶
Test without augmentation.
- Parameters
img (Tensor) – of shape (1, C, H, W) encoding input image. Typically these should be mean centered and std scaled.
img_metas (list[dict]) – list of image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.
ref_img (list[Tensor] | None) – The list only contains one Tensor of shape (1, N, C, H, W) encoding input reference images. Typically these should be mean centered and std scaled. N denotes the number for reference images. There may be no reference images in some cases.
ref_img_metas (list[list[list[dict]]] | None) – The first and second list only has one element. The third list contains image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect. There may be no reference images in some cases.
proposals (None | Tensor) – Override rpn proposals with custom proposals. Use when with_rpn is False. Defaults to None.
rescale (bool) – If False, then returned bboxes and masks will fit the scale of img, otherwise, returned bboxes and masks will fit the scale of original image shape. Defaults to False.
- Returns
list(ndarray)]: The detection results.
- Return type
dict[str
- class mmtrack.models.vid.SELSA(detector, pretrains=None, frozen_modules=None, train_cfg=None, test_cfg=None)[source]¶
Sequence Level Semantics Aggregation for Video Object Detection.
This video object detector is the implementation of SELSA.
- extract_feats(img, img_metas, ref_img, ref_img_metas)[source]¶
Extract features for img during testing.
- Parameters
img (Tensor) – of shape (1, C, H, W) encoding input image. Typically these should be mean centered and std scaled.
img_metas (list[dict]) – list of image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.
ref_img (Tensor | None) – of shape (1, N, C, H, W) encoding input reference images. Typically these should be mean centered and std scaled. N denotes the number of reference images. There may be no reference images in some cases.
ref_img_metas (list[list[dict]] | None) – The first list only has one element. The second list contains image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect. There may be no reference images in some cases.
- Returns
- x is the multi level
feature maps of img, ref_x is the multi level feature maps of ref_img.
- Return type
tuple(x, img_metas, ref_x, ref_img_metas)
- forward_train(img, img_metas, gt_bboxes, gt_labels, ref_img, ref_img_metas, ref_gt_bboxes, ref_gt_labels, gt_instance_ids=None, gt_bboxes_ignore=None, gt_masks=None, proposals=None, ref_gt_instance_ids=None, ref_gt_bboxes_ignore=None, ref_gt_masks=None, ref_proposals=None, **kwargs)[source]¶
- Parameters
img (Tensor) – of shape (N, C, H, W) encoding input images. Typically these should be mean centered and std scaled.
img_metas (list[dict]) – list of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.
gt_bboxes (list[Tensor]) – Ground truth bboxes for each image with shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format.
gt_labels (list[Tensor]) – class indices corresponding to each box.
ref_img (Tensor) – of shape (N, 2, C, H, W) encoding input images. Typically these should be mean centered and std scaled. 2 denotes there is two reference images for each input image.
ref_img_metas (list[list[dict]]) – The first list only has one element. The second list contains reference image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.
ref_gt_bboxes (list[Tensor]) – The list only has one Tensor. The Tensor contains ground truth bboxes for each reference image with shape (num_all_ref_gts, 5) in [ref_img_id, tl_x, tl_y, br_x, br_y] format. The ref_img_id start from 0, and denotes the id of reference image for each key image.
ref_gt_labels (list[Tensor]) – The list only has one Tensor. The Tensor contains class indices corresponding to each reference box with shape (num_all_ref_gts, 2) in [ref_img_id, class_indice].
gt_instance_ids (None | list[Tensor]) – specify the instance id for each ground truth bbox.
gt_bboxes_ignore (None | list[Tensor]) – specify which bounding boxes can be ignored when computing the loss.
gt_masks (None | Tensor) – true segmentation masks for each box used if the architecture supports a segmentation task.
proposals (None | Tensor) – override rpn proposals with custom proposals. Use when with_rpn is False.
ref_gt_instance_ids (None | list[Tensor]) – specify the instance id for each ground truth bboxes of reference images.
ref_gt_bboxes_ignore (None | list[Tensor]) – specify which bounding boxes of reference images can be ignored when computing the loss.
ref_gt_masks (None | Tensor) – True segmentation masks for each box of reference image used if the architecture supports a segmentation task.
ref_proposals (None | Tensor) – override rpn proposals with custom proposals of reference images. Use when with_rpn is False.
- Returns
a dictionary of loss components
- Return type
dict[str, Tensor]
- init_weights(pretrain)[source]¶
Initialize the weights of modules in video object detector.
- Parameters
pretrained (dict) – Path to pre-trained weights.
- simple_test(img, img_metas, ref_img=None, ref_img_metas=None, proposals=None, ref_proposals=None, rescale=False)[source]¶
Test without augmentation.
- Parameters
img (Tensor) – of shape (1, C, H, W) encoding input image. Typically these should be mean centered and std scaled.
img_metas (list[dict]) – list of image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.
ref_img (list[Tensor] | None) – The list only contains one Tensor of shape (1, N, C, H, W) encoding input reference images. Typically these should be mean centered and std scaled. N denotes the number for reference images. There may be no reference images in some cases.
ref_img_metas (list[list[list[dict]]] | None) – The first and second list only has one element. The third list contains image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect. There may be no reference images in some cases.
proposals (None | Tensor) – Override rpn proposals with custom proposals. Use when with_rpn is False. Defaults to None.
rescale (bool) – If False, then returned bboxes and masks will fit the scale of img, otherwise, returned bboxes and masks will fit the scale of original image shape. Defaults to False.
- Returns
list(ndarray)]: The detection results.
- Return type
dict[str
aggregators¶
- class mmtrack.models.aggregators.EmbedAggregator(num_convs=1, channels=256, kernel_size=3, norm_cfg=None, act_cfg={'type': 'ReLU'})[source]¶
Embedding convs to aggregate multi feature maps.
This module is proposed in “Flow-Guided Feature Aggregation for Video Object Detection”. FGFA.
- Parameters
num_convs (int) – Number of embedding convs.
channels (int) – Channels of embedding convs. Defaults to 256.
kernel_size (int) – Kernel size of embedding convs, Defaults to 3.
norm_cfg (dict) – Configuration of normlization method after each conv. Defaults to None.
act_cfg (dict) – Configuration of activation method after each conv. Defaults to dict(type=’ReLU’).
- forward(x, ref_x)[source]¶
Aggregate reference feature maps ref_x.
The aggregation mainly contains two steps: 1. Computing the cos similarity between x and ref_x. 2. Use the normlized (i.e. softmax) cos similarity to weightedly sum ref_x.
- Parameters
x (Tensor) – of shape [1, C, H, W]
ref_x (Tensor) – of shape [N, C, H, W]. N is the number of reference feature maps.
- Returns
The aggregated feature map with shape [1, C, H, W].
- Return type
Tensor
- class mmtrack.models.aggregators.SelsaAggregator(in_channels, num_attention_blocks=16)[source]¶
Selsa aggregator module.
This module is proposed in “Sequence Level Semantics Aggregation for Video Object Detection”. SELSA.
- Parameters
in_channels (int) – The number of channels of the features of proposal.
num_attention_blocks (int) – The number of attention blocks used in selsa aggregator module. Defaults to 16.
- forward(x, ref_x)[source]¶
Aggregate the features ref_x of reference proposals.
The aggregation mainly contains two steps: 1. Use multi-head attention to computing the weight between x and ref_x. 2. Use the normlized (i.e. softmax) weight to weightedly sum ref_x.
- Parameters
x (Tensor) – of shape [N, C]. N is the number of key frame proposals.
ref_x (Tensor) – of shape [M, C]. M is the number of reference frame proposals.
- Returns
The aggregated features of key frame proposals with shape [N, C].
- Return type
Tensor
backbones¶
losses¶
motion¶
- class mmtrack.models.motion.CameraMotionCompensation(warp_mode='cv2.MOTION_EUCLIDEAN', num_iters=50, stop_eps=0.001)[source]¶
Camera motion compensation.
- Parameters
warp_mode (str) – Warp mode in opencv.
num_iters (int) – Number of the iterations.
stop_eps (float) – Terminate threshold.
- class mmtrack.models.motion.FlowNetSimple(img_scale_factor, out_indices=[2, 3, 4, 5, 6], flow_scale_factor=5.0, flow_img_norm_std=[255.0, 255.0, 255.0], flow_img_norm_mean=[0.411, 0.432, 0.45])[source]¶
The simple version of FlowNet.
This FlowNetSimple is the implementation of FlowNetSimple.
- Parameters
img_scale_factor (float) – Used to upsample/downsample the image.
out_indices (list) – The indices of outputting feature maps after each group of conv layers. Defaults to [2, 3, 4, 5, 6].
flow_scale_factor (float) – Used to enlarge the values of flow. Defaults to 5.0.
flow_img_norm_std (list) – Used to scale the values of image. Defaults to [255.0, 255.0, 255.0].
flow_img_norm_mean (list) – Used to center the values of image. Defaults to [0.411, 0.432, 0.450].
- forward(imgs, img_metas)[source]¶
Compute the flow of images pairs.
- Parameters
imgs (Tensor) – of shape (N, 6, H, W) encoding input images pairs. Typically these should be mean centered and std scaled.
img_metas (list[dict]) – list of image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.
- Returns
of shape (N, 2, H, W) encoding flow of images pairs.
- Return type
Tensor
- prepare_imgs(imgs, img_metas)[source]¶
Preprocess images pairs for computing flow.
- Parameters
imgs (Tensor) – of shape (N, 6, H, W) encoding input images pairs. Typically these should be mean centered and std scaled.
img_metas (list[dict]) – list of image information dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmtrack/datasets/pipelines/formatting.py:VideoCollect.
- Returns
of shape (N, 6, H, W) encoding the input images pairs for FlowNetSimple.
- Return type
Tensor
- class mmtrack.models.motion.KalmanFilter(center_only=False)[source]¶
A simple Kalman filter for tracking bounding boxes in image space.
The implementation is refered to https://github.com/nwojke/deep_sort.
- gating_distance(mean, covariance, measurements, only_position=False)[source]¶
Compute gating distance between state distribution and measurements.
A suitable distance threshold can be obtained from chi2inv95. If only_position is False, the chi-square distribution has 4 degrees of freedom, otherwise 2.
- Parameters
mean (ndarray) – Mean vector over the state distribution (8 dimensional).
covariance (ndarray) – Covariance of the state distribution (8x8 dimensional).
measurements (ndarray) – An Nx4 dimensional matrix of N measurements, each in format (x, y, a, h) where (x, y) is the bounding box center position, a the aspect ratio, and h the height.
only_position (bool, optional) – If True, distance computation is done with respect to the bounding box center position only. Defaults to False.
- Returns
Returns an array of length N, where the i-th element contains the squared Mahalanobis distance between (mean, covariance) and measurements[i].
- Return type
ndarray
- initiate(measurement)[source]¶
Create track from unassociated measurement.
- Parameters
measurement (ndarray) – Bounding box coordinates (x, y, a, h) with
position (center) –
- Returns
- Returns the mean vector (8 dimensional) and
covariance matrix (8x8 dimensional) of the new track. Unobserved velocities are initialized to 0 mean.
- Return type
(ndarray, ndarray)
- predict(mean, covariance)[source]¶
Run Kalman filter prediction step.
- Parameters
mean (ndarray) – The 8 dimensional mean vector of the object state at the previous time step.
covariance (ndarray) – The 8x8 dimensional covariance matrix of the object state at the previous time step.
- Returns
- Returns the mean vector and covariance
matrix of the predicted state. Unobserved velocities are initialized to 0 mean.
- Return type
(ndarray, ndarray)
- project(mean, covariance)[source]¶
Project state distribution to measurement space.
- Parameters
mean (ndarray) – The state’s mean vector (8 dimensional array).
covariance (ndarray) – The state’s covariance matrix (8x8 dimensional).
- Returns
Returns the projected mean and covariance matrix of the given state estimate.
- Return type
(ndarray, ndarray)
- track(tracks, bboxes)[source]¶
Track forward.
- Parameters
(dict[int (tracks) – dict]): Track buffer.
bboxes (Tensor) – Detected bounding boxes.
- Returns
dict], Tensor): Updated tracks and bboxes.
- Return type
(dict[int
- update(mean, covariance, measurement)[source]¶
Run Kalman filter correction step.
- Parameters
mean (ndarray) – The predicted state’s mean vector (8 dimensional).
covariance (ndarray) – The state’s covariance matrix (8x8 dimensional).
measurement (ndarray) – The 4 dimensional measurement vector (x, y, a, h), where (x, y) is the center position, a the aspect ratio, and h the height of the bounding box.
- Returns
Returns the measurement-corrected state distribution.
- Return type
(ndarray, ndarray)
- class mmtrack.models.motion.LinearMotion(num_samples=2, center_motion=False)[source]¶
Linear motion while tracking.
- Parameters
num_samples (int, optional) – Number of samples to calculate the velocity. Default to 2.
center_motion (bool, optional) – Whether use center location or bounding box location to estimate the velocity. Default to False.
reid¶
- class mmtrack.models.reid.BaseReID(*args: Any, **kwargs: Any)[source]¶
Base class for re-identification.
- class mmtrack.models.reid.FcModule(in_channels, out_channels, norm_cfg=None, act_cfg={'type': 'ReLU'}, inplace=True)[source]¶
Fully-connected layer module.
- Parameters
in_channels (int) – Input channels.
out_channels (int) – Ourput channels.
norm_cfg (dict, optional) – Configuration of normlization method after fc. Defaults to None.
act_cfg (dict, optional) – Configuration of activation method after fc. Defaults to dict(type=’ReLU’).
inplace (bool, optional) – Whether inplace the activatation module.
- property norm¶
Normalization.