WO2020151167A1

WO2020151167A1 - Target tracking method and device, computer device and readable storage medium

Info

Publication number: WO2020151167A1
Application number: PCT/CN2019/091160
Authority: WO
Inventors: 杨国青
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-01-23
Filing date: 2019-06-13
Publication date: 2020-07-30
Also published as: CN109903310A

Abstract

The present application relates to the technical field of image processing, and provided therein are a target tracking method and device, a computer device, and a storage medium. The target tracking method comprises: using a target detector to detect a predetermined type of target in a current image to obtain a first target frame in the current image; acquiring a second target frame in a previous frame image of the current image, and using a predictor to predict the position of the second target frame in the current image so as to obtain a prediction frame of the second target frame in the current image; matching the first target frame in the current image with the prediction frame to obtain a matching result of the first target frame and the prediction frame; and according to the matching result of the first target frame and the prediction frame, updating the position of the target in the current image. The present application improves the robustness and scene adaptability of target tracking.

Description

Target tracking method, device, computer device and readable storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on January 23, 2019, the application number is 201910064675.5, and the invention title is "target tracking method, device, computer device and computer storage medium", the entire content of which is incorporated by reference Incorporated in this application.

Technical field

This application relates to the field of image processing technology, and in particular to a target tracking method, device, computer device and non-volatile readable storage medium.

Background technique

Target tracking refers to tracking moving objects in a video or image sequence (for example, cars and pedestrians in traffic videos) to obtain the position of the moving object in each frame. Target tracking has a wide range of applications in the fields of video surveillance, autonomous driving and video entertainment.

The current target tracking mainly adopts the track by detection architecture. The position information of each target is detected by the detector on each frame of the video or image sequence, and then the target position information of the current frame and the target position information of the previous frame are match. However, the existing schemes are not robust in target tracking, and if the illumination changes, the tracking effect is not good.

Summary of the invention

In view of the above, it is necessary to propose a target tracking method, device, computer device, and non-volatile readable storage medium, which can improve the robustness and scene adaptability of target tracking.

The first aspect of the present application provides a target tracking method, the method includes:

Using a target detector to detect a predetermined type of target in the current image to obtain the first target frame in the current image;

Acquire the second target frame in the previous frame of the current image, use a predictor to predict the position of the second target frame in the current image, and obtain the position of the second target frame in the current image Prediction box

Matching the first target frame in the current image with the prediction frame to obtain a matching result of the first target frame and the prediction frame;

According to the matching result of the first target frame and the prediction frame, the position of the target is updated in the current image.

A second aspect of the present application provides a target tracking device, the device includes:

The detection module is configured to use a target detector to detect a predetermined type of target in the current image to obtain the first target frame in the current image;

The prediction module is used to obtain the second target frame in the previous frame of the current image, use a predictor to predict the position of the second target frame in the current image, and obtain the second target frame in the current image. Describe the prediction frame in the current image;

A matching module, configured to match a first target frame in the current image with the prediction frame to obtain a matching result between the first target frame and the prediction frame;

The update module is configured to update the position of the target in the current image according to the matching result of the first target frame and the prediction frame.

A third aspect of the present application provides a computer device that includes a processor, and the processor is configured to implement the target tracking method when executing computer-readable instructions stored in a memory.

A fourth aspect of the present application provides a non-volatile readable storage medium having computer readable instructions stored thereon, and when the computer readable instructions are executed by a processor, the target tracking method is implemented.

This application uses a target detector to detect a predetermined type of target in the current image to obtain the first target frame in the current image; obtains the second target frame in the previous frame of the current image, and uses a predictor to predict the The position of the second target frame in the current image is used to obtain the predicted frame of the second target frame in the current image; the first target frame in the current image is matched with the predicted frame to obtain The matching result of the first target frame and the prediction frame; and updating the position of the target in the current image according to the matching result of the first target frame and the prediction frame. This application improves the robustness and scene adaptability of target tracking.

Description of the drawings

Fig. 1 is a flowchart of a target tracking method provided by an embodiment of the present application.

Figure 2 is a structural diagram of a target tracking device provided by an embodiment of the present application.

Fig. 3 is a schematic diagram of a computer device provided by an embodiment of the present application.

Figure 4 is a schematic diagram of the SiamFC model.

detailed description

In order to be able to understand the above objectives, features and advantages of the application more clearly, the application will be described in detail below with reference to the drawings and specific embodiments. It should be noted that the embodiments of the application and the features in the embodiments can be combined with each other if there is no conflict.

In the following description, many specific details are set forth in order to fully understand the present application. The described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the technical field of this application. The terms used in the specification of the application herein are only for the purpose of describing specific embodiments, and are not intended to limit the application.

Preferably, the target tracking method of the present application is applied to one or more computer devices. The computer device is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor and an application specific integrated circuit (ASIC). , Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded devices, etc.

The computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.

Example one

FIG. 1 is a flowchart of a target tracking method provided in Embodiment 1 of the present application. The target tracking method is applied to a computer device.

The target tracking method of the present application tracks a specific type of moving object (such as a pedestrian) in a video or image sequence, and obtains the position of the moving object in each frame of the image. The target tracking method can solve the shortcoming that the existing solutions cannot track high-speed moving targets, and improve the robustness of target tracking.

As shown in Figure 1, the target tracking method includes:

Step 101: Use a target detector to detect a predetermined type of target in a current image to obtain a first target frame in the current image.

The predetermined type of target may include pedestrians, cars, airplanes, ships, and so on. The predetermined type of target may be one type of target (for example, pedestrians) or multiple types of targets (for example, pedestrians and cars).

The target detector may be a neural network model with classification and regression functions. In this embodiment, the target detector may be a Faster Region-Based Convolutional Neural Network (Faster RCNN) model.

The Faster RCNN model includes the Region Proposal Network (RPN) and the Fast Region-based Convolution Neural Network (Fast RCNN).

The region suggestion network and the fast region convolutional neural network have a shared convolutional layer, and the convolutional layer is used to extract a feature map of an image. The region suggestion network generates a candidate frame of the image according to the feature map, and inputs the generated candidate frame into the fast regional convolutional neural network. The fast area convolutional neural network screens and adjusts the candidate frame according to the feature map to obtain the target frame of the image.

Before using the target detector to detect the first target frame of the predetermined type target in the current image, the target detector is trained using the first training sample set. During training, the convolutional layer extracts a feature map of each sample image in the first training sample set, the region suggestion network obtains candidate frames in each sample image according to the feature map, and the fast region convolutional neural The network screens and adjusts the candidate frames according to the feature map to obtain the target frame of each sample image. The target frame may include target frames of different types of targets (for example, pedestrians, cars, airplanes, ships, etc.).

In a preferred embodiment, the accelerated regional convolutional neural network model adopts a ZF framework, and the regional suggestion network and the fast regional convolutional neural network share 5 convolutional layers. The ZF framework is a commonly used network structure, proposed by Matthew D Zeiler and Rob Fergus in the "Visualizing and Understanding Convolutional Networks" paper in 2013, and it is a variant of the AlexNet network. ZF is fine-tuned based on AlexNet, using ReLU activation function and cross-entropy cost function, and using a smaller convolution kernel to retain more original pixel information.

In a specific embodiment, the first training sample set can be used to train the accelerated regional convolutional neural network model according to the following steps:

(1) Use the Imagenet model to initialize the area suggestion network, and use the first training sample set to train the area suggestion network;

(2) Use the region suggestion network trained in (1) to generate candidate frames of each sample image in the first training sample set, and use the candidate frames to train the fast regional convolutional neural network. At this time, the regional suggestion network and the fast regional convolutional neural network have not shared the convolutional layer;

(3) Use the fast regional convolutional neural network trained in (2) to initialize the region suggestion network, and use the first training sample set to train the region suggestion network;

(4) Use the trained region suggestion network in (3) to initialize the fast region convolutional neural network, keep the convolution layer fixed, and train the fast region convolutional neural network using the first training sample set. At this time, the regional proposal network and the fast regional convolutional neural network share the same convolutional layer, forming a unified network model.

The regional suggestion network selects many candidate boxes, and several candidate boxes with the highest scores can be screened according to the target classification score of the candidate boxes and input to the fast regional convolutional neural network to speed up training and detection.

The backpropagation algorithm can be used to train the region suggestion network, and the network parameters of the region suggestion network can be adjusted during the training process to minimize the loss function. The loss function indicates the difference between the prediction confidence of the candidate frame predicted by the region suggestion network and the true confidence. The loss function can include two parts: target classification loss and regression loss.

The loss function can be defined as:

Among them, i is the index of the candidate frame in a training batch (mini-batch).

Is the target classification loss of the candidate box. N _cls is the size of the training batch, such as 256. p _i is the predicted probability of the i-th candidate frame as the target.

Is the GT label, if the candidate box is positive (that is, the assigned label is a positive label, called a positive candidate box),

Is 1; if the candidate box is negative (that is, the assigned label is a negative label, called a negative candidate box),

Is 0.

Can be calculated as

Is the regression loss of the candidate box. λ is the balance weight, which can be taken as 10. N _reg is the number of candidate frames.

Can be calculated as

t _i is a coordinate vector, that is, t _i =(t _x , t _y , t _w , t _h ), which represents the 4 parameterized coordinates of the candidate box (for example, the coordinates of the upper left corner of the candidate box and the width and height).

Is the coordinate vector of the GT bounding box corresponding to the positive candidate box, namely

(For example, the coordinates, width and height of the upper left corner of the real target box). R is a robust loss function (smoothL1), defined as:

The training method of the fast regional convolutional network can refer to the training method of the regional suggestion network, which will not be repeated here.

In this embodiment, the method of Hard Negative Mining (HNM) is added to the training of the fast area convolutional network. For negative samples (ie difficult cases) that are incorrectly classified as positive samples by the fast area convolutional network, record the information of these negative samples, and input these negative samples into the first training sample again during the next iteration of training Concentrate, and increase the weight of its loss, and enhance its impact on the classifier. This can ensure that the more difficult negative samples are continuously classified, so that the features learned by the classifier are from easy to difficult, and the distribution of samples covered is also More diversity.

In other embodiments, the target detector may also be other neural network models, such as a regional convolutional neural network (RCNN) model, or a Faster Convolutional Neural Network (RCNN) model.

When a target detector is used to detect a predetermined type of target in an image, the image is input to the target detector, and the target detector detects the predetermined type of target in the image, and outputs the second target of the predetermined type of target in the image. The position of a target frame. For example, the target detector outputs 6 first target frames in the image. The first target frame is presented in the form of a rectangular frame. The position of the first target frame may be represented by position coordinates, and the position coordinates may include upper left corner coordinates (x, y) and width and height (w, h).

The target detector may also output the type of each first target frame, for example, output 5 first target frames of pedestrian type and 1 first target frame of automobile type.

Step 102: Obtain a second target frame in the previous frame of the current image, and use a predictor to predict the position of the second target frame in the current image to obtain the second target frame in the current image. The prediction box in the image.

The second target frame in the previous frame is a target frame obtained by using a target detector to detect a predetermined type of target in the previous frame.

Predict the position of the second target frame in the current image to obtain the prediction frame of the second target frame in the current image is to predict the position of each second target frame in the current image, and obtain each The prediction frame of the second target frame in the current image. For example, if four pedestrian target frames are detected in the previous frame of the current image, the positions of the four pedestrian target frames in the current image are predicted (that is, the four pedestrian target frames corresponding to the four pedestrian target frames are predicted The position in the current image) to obtain the prediction frames of the four pedestrian target frames in the current image.

The predictor may be a deep neural network model.

Before using the predictor to predict the second target frame, the predictor is trained using the second training sample set. The features learned by the predictor are depth features, in which color features account for a relatively small proportion and are limited by illumination. Therefore, the predictor can overcome the impact of illumination to a certain extent, and improve the robustness and scene adaptability of target tracking. In this embodiment, the second training sample set may include a large number of sample images of objects with different illumination, deformation, and high-speed motion. Therefore, the predictor can further overcome the influence of illumination, and can overcome the influence of deformation and high-speed movement to a certain extent, so that this application realizes the tracking of high-speed moving targets and improves the robustness of target tracking.

In this embodiment, a feature pyramid network (Feature Pyramid Network, FPN) can be constructed in the deep neural network model, and the deep neural network model with the feature pyramid network can be used to predict that the second target frame is The position in the current image. The feature pyramid network connects the high-level features of low-resolution, high-semantic information and the low-level features of high-resolution, low-semantic information from top to bottom, so that features at all scales have rich semantic information. The connection method of the feature pyramid network is to upsample the high-level features twice, and then combine them with the corresponding features of the previous layer (the previous layer has to go through a 1*1 convolution kernel), and the combination method is to add between pixels. Through this connection, the feature maps used in each layer of prediction are fused with features of different resolutions and different semantic strengths, and the fused feature maps of different resolutions are used for object detection with corresponding resolutions. This ensures that each layer has appropriate resolution and strong semantic features. Constructing a feature pyramid network in the deep neural network model can improve the performance of predicting the second target frame, so that the deformed second target frame can still be better predicted.

In a specific embodiment, the predictor may be a SiamFC network (Fully-Convolutional Siamese Network) model, for example, a SiamFC network model constructed with a feature pyramid network.

Figure 4 is a schematic diagram of the SiamFC model. In Figure 4, z represents the template image, that is, the second target frame in the previous frame of image; x represents the search area, that is, the current image;

Represents a feature mapping operation, which maps the original image to a specific feature space. The convolutional layer and pooling layer in CNN can be used; 6*6*128 represents z passing

The resulting feature is a 128-channel 6*6 feature. Similarly, 22*22*128 is x passing

The latter feature; * represents the convolution operation, the 22*22*128 feature is convolved by the 6*6*128 convolution kernel, and a 17*17 score map is obtained, which represents the similarity of each position in the search area to the template image degree. The position in the search area with the highest similarity to the template image is the position of the prediction frame.

Step 103: Match the first target frame in the current image with the prediction frame to obtain a matching result of the first target frame and the prediction frame.

The matching result of the first target frame and the prediction frame may include a match between the first target frame and the prediction frame, a mismatch between the first target frame and any prediction frame, and the prediction frame and any prediction frame. The first target frame does not match.

In this embodiment, the overlap area ratio (Intersection over Union, IOU) of the first target frame and the prediction frame may be calculated, and each pair of matching first target frame and the prediction frame may be determined according to the overlap area ratio. The prediction box.

For example, the first target frame includes a first target frame A1, a first target frame A2, a first target frame A3, and a first target frame A4, and a prediction frame includes a prediction frame P1, a prediction frame P2, a prediction frame P3, and a prediction frame P4. The prediction frame P1 corresponds to the second target frame B1, the prediction frame P2 corresponds to the second target frame B2, the prediction frame P3 corresponds to the second target frame B3, and the prediction frame P4 corresponds to the second target frame B4. For the first target frame A1, calculate the overlap area of the first target frame A1 and the prediction frame P1, the first target frame A1 and the prediction frame P2, the first target frame A1 and the prediction frame P3, and the first target frame A1 and the prediction frame P4 If the ratio of the overlapping area of the first target frame A1 and the prediction frame P1 is the largest and is greater than or equal to a preset threshold (for example, 70%), it is determined that the first target frame A1 matches the prediction frame P1. Similarly, for the first target frame A2, calculate the first target frame A2 and the prediction frame P1, the first target frame A2 and the prediction frame P2, the first target frame A2 and the prediction frame P3, the first target frame A2 and the prediction frame P4 If the overlap area ratio of the first target frame A2 and the prediction frame P2 is the largest and greater than or equal to the preset threshold (for example, 70%), it is determined that the first target frame A2 matches the prediction frame P2; Target frame A3, calculate the overlap area ratio of the first target frame A3 and the prediction frame P1, the first target frame A3 and the prediction frame P2, the first target frame A3 and the prediction frame P3, the first target frame A3 and the prediction frame P4, if The ratio of the overlapping area of the first target frame A3 and the prediction frame P3 is the largest and is greater than or equal to a preset threshold (for example, 70%), it is determined that the first target frame A3 matches the prediction frame P3; for the first target frame A4, the first target frame A4 is calculated. A target frame A4 and a prediction frame P1, a first target frame A4 and a prediction frame P2, a first target frame A4 and a prediction frame P3, a first target frame A4 and a prediction frame P4 overlap area ratio, if the first target frame A4 and If the overlap area ratio of the prediction frame P4 is the largest and is greater than or equal to a preset threshold (for example, 70%), it is determined that the first target frame A4 matches the prediction frame P4.

Alternatively, the distance between the center point of the first target frame and the prediction frame may be calculated, and each pair of matched first target frame and the prediction frame may be determined according to the distance.

For example, the first target frame includes a first target frame A1, a first target frame A2, a first target frame A3, and a first target frame A4, and the prediction frame includes a prediction frame P1, a prediction frame P2, a prediction frame P3, and a prediction frame P4. In the example, for the first target frame A1, calculate the first target frame A1 and the prediction frame P1, the first target frame A1 and the prediction frame P2, the first target frame A1 and the prediction frame P3, the first target frame A1 and the prediction frame The distance between the center point of P4. If the distance between the center point of the first target frame A1 and the prediction frame P1 is the smallest and is less than or equal to the preset distance (for example, 10 pixels), it is determined that the first target frame A1 and the prediction frame P1 are the same match. Similarly, similarly, for the first target frame A2, calculate the first target frame A2 and the prediction frame P1, the first target frame A2 and the prediction frame P2, the first target frame A2 and the prediction frame P3, and the first target frame A2 and the prediction frame P3. The distance between the center point of the prediction frame P4. If the distance between the first target frame A2 and the center point of the prediction frame P2 is the smallest and less than or equal to the preset distance (for example, 10 pixels), the first target frame A2 and the prediction frame are determined P2 matches; for the first target frame A3, calculate the first target frame A3 and the prediction frame P1, the first target frame A3 and the prediction frame P2, the first target frame A3 and the prediction frame P3, the first target frame A3 and the prediction frame The distance between the center point of P4. If the distance between the center point of the first target frame A3 and the prediction frame P3 is the smallest and is less than or equal to the preset distance (for example, 10 pixels), it is determined that the first target frame A3 and the prediction frame P3 are the same Match; for the first target frame A4, calculate the first target frame A4 and the prediction frame P1, the first target frame A4 and the prediction frame P2, the first target frame A4 and the prediction frame P3, the first target frame A4 and the prediction frame P4 The distance of the center point, if the distance between the center point of the first target frame A4 and the prediction frame P4 is the smallest and is less than or equal to the preset distance (for example, 10 pixels), it is determined that the first target frame A4 matches the prediction frame P4.

Step 104: According to the matching result of the first target frame and the prediction frame, update the position of the target in the current image.

According to the obtained matching result of the first target frame and the prediction frame, updating the position of the target in the current image may include:

If the first target frame matches the prediction frame, use the position of the first target frame in the current image as the updated position of the target corresponding to the prediction frame;

If the first target frame does not match any of the prediction frames, use the position of the first target frame in the current image as the position of the new target;

If the prediction frame does not match any of the first target frames, the target corresponding to the prediction frame is taken as the lost target in the current image.

The target tracking method of the first embodiment uses a target detector to detect a predetermined type of target in the current image to obtain the first target frame in the current image; to obtain the second target frame in the previous frame of the current image, use The predictor predicts the position of the second target frame in the current image to obtain the predicted frame of the second target frame in the current image; compare the first target frame in the current image with the predicted frame The frames are matched to obtain a matching result of the first target frame and the predicted frame; and the position of the target is updated in the current image according to the matching result of the first target frame and the predicted frame. The first embodiment improves the robustness and scene adaptability of target tracking.

Example two

Fig. 2 is a structural diagram of a target tracking device provided in the second embodiment of the present application. The target tracking device 20 is applied to a computer device. The target tracking of this device tracks specific types of moving objects (such as pedestrians) in a video or image sequence, and obtains the position of the moving object in each frame of the image. The target tracking device 20 can improve the robustness and scene adaptability of target tracking. As shown in FIG. 2, the target tracking device 20 may include a detection module 201, a prediction module 202, a matching module 203, and an update module 204.

The detection module 201 is configured to use a target detector to detect a predetermined type of target in the current image to obtain the first target frame in the current image.

In a preferred embodiment, the accelerated regional convolutional neural network model adopts a ZF framework, and the regional suggestion network and the fast regional convolutional neural network share 5 convolutional layers.

Is 0.

Can be calculated as

The prediction module 202 is configured to obtain the second target frame in the previous frame of the current image, use a predictor to predict the position of the second target frame in the current image, and obtain the second target frame at The prediction box in the current image.

The predictor may be a deep neural network model.

In a specific embodiment, the predictor may be a SiamFC network (Fully-Convolutional Siamese Network) model, for example, a SiamFC network model constructed with a feature pyramid network. Figure 4 is a schematic diagram of the SiamFC model.

In Figure 4, z represents the template image, that is, the second target frame in the previous frame of image; x represents the search area, that is, the current image; φ represents a feature mapping operation, which maps the original image to a specific The feature space of, you can use the convolutional layer and pooling layer in CNN; 6*6*128 represents the feature obtained after z passes through φ, which is a 128-channel 6*6 feature, the same, 22*22*128 Is the feature of x after φ; * represents the convolution operation, the feature of 22*22*128 is convolved by the 6*6*128 convolution kernel, and a 17*17 score map is obtained, which represents each position in the search area and The similarity of the template image. The position in the search area with the highest similarity to the template image is the position of the prediction frame.

The matching module 203 is configured to match the first target frame in the current image with the prediction frame to obtain a matching result between the first target frame and the prediction frame.

For example, the first target frame includes a first target frame A1, a first target frame A2, a first target frame A3, and a first target frame A4, and a prediction frame includes a prediction frame P1, a prediction frame P2, a prediction frame P3, and a prediction frame P4. The prediction frame P1 corresponds to the second target frame B1, the prediction frame P2 corresponds to the second target frame B2, the prediction frame P3 corresponds to the second target frame B3, and the prediction frame P4 corresponds to the second target frame B4. For the first target frame A1, calculate the overlap area of the first target frame A1 and the prediction frame P1, the first target frame A1 and the prediction frame P2, the first target frame A1 and the prediction frame P3, and the first target frame A1 and the prediction frame P4 If the ratio of the overlapping area of the first target frame A1 and the prediction frame P1 is the largest and is greater than or equal to a preset threshold (for example, 70%), it is determined that the first target frame A1 matches the prediction frame P1. Similarly, for the first target frame A2, calculate the first target frame A2 and the prediction frame P1, the first target frame A2 and the prediction frame P2, the first target frame A2 and the prediction frame P3, the first target frame A2 and the prediction frame P4 If the overlap area ratio of the first target frame A2 and the prediction frame P2 is the largest and is greater than or equal to a preset threshold (for example, 70%), it is determined that the first target frame A2 matches the prediction frame P2; Target frame A3, calculate the overlap area ratio of the first target frame A3 and the prediction frame P1, the first target frame A3 and the prediction frame P2, the first target frame A3 and the prediction frame P3, the first target frame A3 and the prediction frame P4, if The ratio of the overlapping area of the first target frame A3 and the prediction frame P3 is the largest and is greater than or equal to a preset threshold (for example, 70%), it is determined that the first target frame A3 matches the prediction frame P3; for the first target frame A4, the first target frame A4 is calculated. A target frame A4 and the prediction frame P1, the first target frame A4 and the prediction frame P2, the first target frame A4 and the prediction frame P3, the first target frame A4 and the prediction frame P4 overlap area ratio, if the first target frame A4 and If the overlap area ratio of the prediction frame P4 is the largest and is greater than or equal to a preset threshold (for example, 70%), it is determined that the first target frame A4 matches the prediction frame P4.

The update module 204 is configured to update the position of the target in the current image according to the matching result of the first target frame and the prediction frame.

This embodiment provides a target tracking device 20. The target tracking is to track a specific type of moving object (such as a pedestrian) in a video or image sequence to obtain the position of the moving object in each frame of the image. The target tracking device 20 uses a target detector to detect a predetermined type of target in the current image to obtain the first target frame in the current image; obtains the second target frame in the previous frame of the current image, and uses prediction The device predicts the position of the second target frame in the current image to obtain the predicted frame of the second target frame in the current image; and compares the first target frame in the current image with the predicted frame Matching is performed to obtain a matching result of the first target frame and the prediction frame; according to the matching result of the first target frame and the prediction frame, the position of the target is updated in the current image. This embodiment improves the robustness and scene adaptability of target tracking.

Example three

This embodiment provides a non-volatile readable storage medium having computer readable instructions stored on the non-volatile readable storage medium. When the computer readable instructions are executed by a processor, the target tracking method in the above embodiment Steps, such as steps 101-104 shown in Figure 1:

Step 101: Use a target detector to detect a predetermined type of target in a current image to obtain a first target frame in the current image;

Step 102: Obtain a second target frame in the previous frame of the current image, and use a predictor to predict the position of the second target frame in the current image to obtain the second target frame in the current image. The prediction box in the image;

Step 103: Match the first target frame in the current image with the prediction frame to obtain a matching result of the first target frame and the prediction frame;

Or, when the computer-readable instruction is executed by the processor, the function of each module in the foregoing device embodiment is realized, for example, the modules 201-204 in FIG.

The detection module 201 uses a target detector to detect a predetermined type of target in the current image to obtain the first target frame in the current image;

The prediction module 202 obtains the second target frame in the previous frame of the current image, uses a predictor to predict the position of the second target frame in the current image, and obtains that the second target frame is in the The prediction box in the current image;

The matching module 203 is configured to match the first target frame in the current image with the prediction frame to obtain a matching result between the first target frame and the prediction frame;

Example four

FIG. 3 is a schematic diagram of a computer device provided in Embodiment 4 of this application. The computer device 30 includes a memory 301, a processor 302, and computer-readable instructions 303 stored in the memory 301 and running on the processor 302, such as a target tracking program. When the processor 302 executes the computer-readable instruction 303, the steps in the embodiment of the target tracking method are implemented, for example, steps 101-104 shown in FIG.

Or, when the computer-readable instruction 303 is executed by the processor, the function of each module in the foregoing device embodiment is realized, for example, the modules 201-204 in FIG. 2:

Exemplarily, the computer-readable instruction 303 may be divided into one or more modules, and the one or more modules are stored in the memory 301 and executed by the processor 302 to complete the method . For example, the computer-readable instruction 303 can be divided into the detection module 201, the prediction module 202, the matching module 203, and the update module 204 in FIG. 2. For the specific functions of each module, refer to the second embodiment.

The computer device 30 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. Those skilled in the art can understand that the schematic diagram 3 is only an example of the computer device 30, and does not constitute a limitation on the computer device 30. It may include more or less components than those shown in the figure, or combine certain components, or different components. For example, the computer device 30 may also include input and output devices, network access devices, buses, etc.

The so-called processor 302 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor 302 may also be any conventional processor, etc. The processor 302 is the control center of the computer device 30, which connects the entire computer device 30 through various interfaces and lines. Various parts.

The memory 301 may be used to store the computer-readable instructions 303, and the processor 302 executes or executes the computer-readable instructions or modules stored in the memory 301 and calls data stored in the memory 301 to implement Various functions of the computer device 30. The memory 302 may mainly include a program storage area and a data storage area. The program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; The data (such as audio data, phone book, etc.) created according to the use of the computer device 30 and the like are stored. In addition, the memory 301 may include a high-speed random access memory, and may also include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), and a Secure Digital (SD) Card, Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

If the integrated module of the computer device 30 is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a non-volatile readable storage medium. Based on this understanding, this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through computer-readable instructions. The computer-readable instructions can be stored in a non-volatile memory. In the read storage medium, the computer-readable instructions, when executed by the processor, can implement the steps of the foregoing method embodiments. The computer-readable medium may include: any entity or device capable of carrying the computer-readable instructions, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) ), random access memory (RAM, Random Access Memory), electrical carrier signal, telecommunications signal, and software distribution media. It should be noted that the content contained in the computer-readable medium can be appropriately added or deleted according to the requirements of the legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to the legislation and patent practice, the computer-readable medium Does not include electrical carrier signals and telecommunication signals.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of the modules is only a logical function division, and there may be other division methods in actual implementation.

The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or in the form of hardware plus software functional modules.

The above-mentioned software function module is stored in a non-volatile readable storage medium, and includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) execute each of this application Part of the steps of the method described in the embodiment.

For those skilled in the art, it is obvious that the present application is not limited to the details of the foregoing exemplary embodiments, and the present application can be implemented in other specific forms without departing from the spirit or basic characteristics of the application. Therefore, from any point of view, the embodiments should be regarded as exemplary and non-restrictive. The scope of this application is defined by the appended claims rather than the above description, and therefore it is intended to fall into the claims. All changes in the meaning and scope of the equivalent elements of are included in this application. Any associated diagram marks in the claims should not be regarded as limiting the claims involved. In addition, it is obvious that the word "include" does not exclude other modules or steps, and the singular does not exclude the plural. Multiple modules or devices stated in the system claims can also be implemented by one module or device through software or hardware. Words such as first and second are used to denote names, but do not denote any specific order.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the application and not to limit them. Although the application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the application can be Make modifications or equivalent replacements without departing from the spirit and scope of the technical solution of the present application.

Claims

A target tracking method, characterized in that the method includes:

Using a target detector to detect a predetermined type of target in the current image to obtain the first target frame in the current image;

Acquire the second target frame in the previous frame of the current image, use a predictor to predict the position of the second target frame in the current image, and obtain the position of the second target frame in the current image Prediction box

Matching the first target frame in the current image with the prediction frame to obtain a matching result of the first target frame and the prediction frame;

According to the matching result of the first target frame and the prediction frame, the position of the target is updated in the current image.
The method of claim 1, wherein the target detector is an accelerated regional convolutional neural network model, and the accelerated regional convolutional neural network model includes a regional suggestion network and a fast regional convolutional neural network. The accelerated regional convolutional neural network model is trained according to the following steps before detecting a predetermined type of target in the image:

The first training step is to use the Imagenet model to initialize the region suggestion network, and use the first training sample set to train the region suggestion network;

In the second training step, the region suggestion network trained in the first training step is used to generate candidate frames of each sample image in the first training sample set, and the candidate frames are used to train the fast regional convolutional neural network;

A third training step, using the fast regional convolutional neural network trained in the second training step to initialize the region suggestion network, and using the first training sample set to train the region suggestion network;

In the fourth training step, the fast regional convolutional neural network is initialized using the region proposal network trained in the third training step, and the convolutional layer is kept fixed, and the first training sample set is used to train the fast Area convolutional neural network.
The method according to claim 2, wherein the accelerated regional convolutional neural network model adopts a ZF framework, and the regional suggestion network and the fast regional convolutional neural network share 5 convolutional layers.
The method of claim 1, wherein the predictor is a deep neural network model constructed with a feature pyramid network.
The method according to claim 1, characterized in that, in said using a predictor to predict the position of the second target frame in the current image, the prediction of the second target frame in the current image is obtained Before the frame, the method also includes:

The predictor is trained using a second training sample set, and the second training sample set includes sample images of objects with different illumination, deformation, and high-speed moving.
The method of claim 1, wherein the matching the first target frame in the current image with the prediction frame comprises:

Calculate the overlap area ratio of the first target frame and the prediction frame, and determine each pair of matching first target frame and the prediction frame according to the overlap area ratio; or

The distance between the center point of the first target frame and the prediction frame is calculated, and each pair of matching first target frame and the prediction frame is determined according to the distance.
The method according to claim 1, wherein the updating the position of the target in the current image according to the matching result of the first target frame and the prediction frame comprises:

If the first target frame matches the prediction frame, use the position of the first target frame in the current image as the updated position of the target corresponding to the prediction frame;

If the first target frame does not match any of the prediction frames, use the position of the first target frame in the current image as the position of the new target;

If the prediction frame does not match any of the first target frames, the target corresponding to the prediction frame is taken as the lost target in the current image.
A target tracking device, characterized in that the device comprises:

The detection module is configured to use a target detector to detect a predetermined type of target in the current image to obtain the first target frame in the current image;

The prediction module is used to obtain the second target frame in the previous frame of the current image, use a predictor to predict the position of the second target frame in the current image, and obtain the second target frame in the current image. Describe the prediction frame in the current image;

A matching module, configured to match a first target frame in the current image with the prediction frame to obtain a matching result between the first target frame and the prediction frame;

The update module is configured to update the position of the target in the current image according to the matching result of the first target frame and the prediction frame.
A computer device, wherein the computer device includes a memory and a processor, the memory stores at least one computer readable instruction, and the processor executes the at least one computer readable instruction to implement the following steps:

Using a target detector to detect a predetermined type of target in the current image to obtain the first target frame in the current image;

Acquire the second target frame in the previous frame of the current image, use a predictor to predict the position of the second target frame in the current image, and obtain the position of the second target frame in the current image Prediction box

Matching the first target frame in the current image with the prediction frame to obtain a matching result of the first target frame and the prediction frame;

According to the matching result of the first target frame and the prediction frame, the position of the target is updated in the current image.
The computer device according to claim 9, wherein the target detector is an accelerated regional convolutional neural network model, and the accelerated regional convolutional neural network model includes a regional suggestion network and a fast regional convolutional neural network, so The processor further executes the at least one computer-readable instruction to implement the following steps before the target detector detects a predetermined type of target in the current image to obtain the first target frame in the current image:

The first training step is to initialize the region suggestion network using the Imagenet model, and train the region suggestion network using the first training sample set;

In the second training step, the region suggestion network trained in the first training step is used to generate candidate frames of each sample image in the first training sample set, and the candidate frames are used to train the fast regional convolutional neural network;

A third training step, using the fast regional convolutional neural network trained in the second training step to initialize the region suggestion network, and using the first training sample set to train the region suggestion network;

In the fourth training step, the fast regional convolutional neural network is initialized using the region proposal network trained in the third training step, and the convolutional layer is kept fixed, and the first training sample set is used to train the fast Area convolutional neural network.
10. The computer device of claim 10, wherein the accelerated regional convolutional neural network model adopts a ZF framework, and the regional suggestion network and the fast regional convolutional neural network share 5 convolutional layers.
9. The computer device according to claim 9, wherein the predictor is a deep neural network model constructed with a feature pyramid network.
The computer device according to claim 9, wherein the processor predicts the position of the second target frame in the current image by using a predictor to obtain the position of the second target frame in the Before the prediction box in the current image, the at least one computer-readable instruction is also executed to implement the following steps:

The predictor is trained using a second training sample set, and the second training sample set includes sample images of objects with different illumination, deformation, and high-speed moving.
9. The computer device of claim 9, wherein the matching the first target frame in the current image with the prediction frame comprises:

Calculate the overlap area ratio of the first target frame and the prediction frame, and determine each pair of matching first target frame and the prediction frame according to the overlap area ratio; or

The distance between the center point of the first target frame and the prediction frame is calculated, and each pair of matching first target frame and the prediction frame is determined according to the distance.
9. The computer device according to claim 9, wherein the updating the position of the target in the current image according to the matching result of the first target frame and the prediction frame comprises:

If the first target frame matches the prediction frame, use the position of the first target frame in the current image as the updated position of the target corresponding to the prediction frame;

If the first target frame does not match any of the prediction frames, use the position of the first target frame in the current image as the position of the new target;

If the prediction frame does not match any of the first target frames, the target corresponding to the prediction frame is taken as the lost target in the current image.
A non-volatile readable storage medium storing at least one computer readable instruction, wherein the at least one computer readable instruction is executed by a processor to achieve the following step:

Using a target detector to detect a predetermined type of target in the current image to obtain the first target frame in the current image;

Acquire the second target frame in the previous frame of the current image, use a predictor to predict the position of the second target frame in the current image, and obtain the position of the second target frame in the current image Prediction box

Matching the first target frame in the current image with the prediction frame to obtain a matching result of the first target frame and the prediction frame;

According to the matching result of the first target frame and the prediction frame, the position of the target is updated in the current image.
The non-volatile readable storage medium of claim 16, wherein the target detector is an accelerated region convolutional neural network model, and the accelerated region convolutional neural network model includes a region suggestion network and a fast region Convolutional neural network, said using a target detector to detect a predetermined type of target in the current image, and before obtaining the first target frame in the current image, the at least one computer-readable instruction is also implemented when executed by the processor The following steps:

In the first training step, the target detector is used to detect a predetermined type of target in the current image, and before the first target frame in the current image is obtained, the imagenet model is used to initialize the region suggestion network, and the first training sample set is used for training Said regional proposal network;

In the second training step, the region suggestion network trained in the first training step is used to generate candidate frames of each sample image in the first training sample set, and the candidate frames are used to train the fast regional convolutional neural network;

A third training step, using the fast regional convolutional neural network trained in the second training step to initialize the region suggestion network, and using the first training sample set to train the region suggestion network;

In the fourth training step, the fast regional convolutional neural network is initialized using the region proposal network trained in the third training step, and the convolutional layer is kept fixed, and the first training sample set is used to train the fast Area convolutional neural network.
The non-volatile readable storage medium according to claim 16, wherein the predictor is used to predict the position of the second target frame in the current image to obtain the position of the second target frame. Before the prediction frame in the current image, when the at least one computer-readable instruction is executed by the processor, the following steps are further implemented:

The predictor is trained using a second training sample set, and the second training sample set includes sample images of objects with different illumination, deformation, and high-speed moving.
The non-volatile readable storage medium of claim 16, wherein the matching the first target frame in the current image with the prediction frame comprises:

Calculate the overlap area ratio of the first target frame and the prediction frame, and determine each pair of matching first target frame and the prediction frame according to the overlap area ratio; or

The distance between the center point of the first target frame and the prediction frame is calculated, and each pair of matching first target frame and the prediction frame is determined according to the distance.
The non-volatile readable storage medium according to claim 16, wherein the updating the position of the target in the current image according to the matching result of the first target frame and the prediction frame comprises:

If the first target frame matches the prediction frame, use the position of the first target frame in the current image as the updated position of the target corresponding to the prediction frame;

If the first target frame does not match any of the prediction frames, use the position of the first target frame in the current image as the position of the new target;

If the prediction frame does not match any of the first target frames, the target corresponding to the prediction frame is taken as the lost target in the current image.