CN110889453A

CN110889453A - Target detection and tracking method, device, system, medium and equipment

Info

Publication number: CN110889453A
Application number: CN201911188614.6A
Authority: CN
Inventors: 屈盛官; 王圣杰; 吕继亮; 赵馨雨; 李小强
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-11-28
Filing date: 2019-11-28
Publication date: 2020-03-17

Abstract

The invention discloses a target tracking method, a device, a system, a medium and equipment, comprising the following steps: building a multi-target detector model and a multi-target tracker model, and then training to obtain a multi-target detector and a multi-target tracker; inputting the collected images into a multi-target detector to detect a target; inputting the output of the multi-target detector into a multi-target tracker to track the target; the convolutional neural network for building the multi-target detector model comprises an input layer, a first convolutional layer, a first maximum merging layer, a second convolutional layer, a second maximum merging layer, a third convolutional layer, a third maximum merging layer, a fourth convolutional layer, a fourth maximum merging layer, a fifth convolutional layer, a sixth convolutional layer, a full connection layer and an output layer which are sequentially connected.

Description

Target detection and tracking method, device, system, medium and equipment

Technical Field

The invention relates to the technical field of computer vision, in particular to a target detection and tracking method, device, system, medium and equipment.

Background

With the progress of the times and the development of science and technology, intelligent mobile robots are more and more appeared in the visual field of people, computer vision is one of the important components of modern robots, cameras are important sensing elements for the robots to sense the external environment and acquire information, and detection and tracking of objects in image information acquired by the cameras are one of the most important and basic technologies of intelligent robots.

In the field of computer vision, conventional object detection algorithms typically comprise three stages: firstly, selecting some candidate regions with different scales on an input image, then performing feature extraction on the candidate regions by using methods such as Harr and HOG, and finally performing classification prediction on the result by using machine learning classification algorithms such as SVM. The effect of the detection methods depends on the quality of the background model and the quality of the manually selected characteristic quality to a great extent, the detection methods are time-consuming and labor-consuming, the identification effect is poor, the model universality is poor, and the detection methods are not satisfactory for multi-target detection and tracking tasks under complex conditions.

In recent years, deep learning becomes an increasingly popular field thanks to the improvement of hardware performance and the breakthrough of convolutional neural network algorithm. The concept of deep learning was proposed by Hinton et al in 2006. An unsupervised greedy layer-by-layer training algorithm is provided based on a deep belief network, and a multilayer automatic encoder deep structure is provided later to hope for solving the optimization problem related to the deep structure. Furthermore, the convolutional neural network based on regions proposed by Lecun et al is the first true multi-layer learning algorithm, which uses the spatial relative relationship to reduce the number of parameters to improve the training performance. The essence of deep learning is to learn more useful features by constructing a machine learning model with many hidden layers and massive training data, thereby finally improving the accuracy of classification or prediction.

With the introduction of deep learning, the application of the method in various fields is gradually explored by the academic world and the industrial world, a series of methods and frameworks are proposed for target detection and tracking, such as Two-Stage methods like R-CNN, Fast-RCNN and Fast-RCNN, and One-Stage methods represented by YOLO and SSD, and the methods are respectively optimized for the accuracy and the operation speed of target detection, and all represent the most advanced methods at that time in the period of their introduction. However, these methods have a big problem that the network is too computationally intensive in the learning and prediction process. The model has a good prediction effect, but the platform carrying the model has a high requirement on energy consumption, and puts a high requirement on energy storage when being applied to a mobile robot, while the embedded device carrying the GPU has relatively low energy consumption but has very limited calculation power, cannot keep a high running speed under the condition of high accuracy for a large network, and cannot realize real-time target detection and tracking.

Disclosure of Invention

The first purpose of the present invention is to overcome the drawbacks and deficiencies of the prior art, and to provide a target detection and tracking method, which is suitable for automatic driving devices with low computing power, such as small unmanned vehicles and mobile robots, and can effectively improve the accuracy and speed of the automatic driving devices for completing multi-target detection and tracking tasks.

The second objective of the present invention is to provide a target detecting and tracking device.

A third objective of the present invention is to provide a target detecting and tracking system.

A fourth object of the present invention is to provide a storage medium.

It is a fifth object of the invention to provide a computing device.

The first purpose of the invention is realized by the following technical scheme: a target detection and tracking method is applied to automatic driving equipment and comprises the following steps:

building a multi-target detector model based on a convolutional neural network;

constructing a convolutional neural network according to a Markov decision rule to be used as a multi-target tracker model;

respectively training the multi-target detector model and the multi-target tracker model by using a training data set to obtain a trained multi-target detector and a trained multi-target tracker;

acquiring a target image to be detected, inputting the target image to be detected into a multi-target detector, and detecting a target in the target image to be detected;

inputting the output of the multi-target detector into a multi-target tracker, and tracking the target detected in the multi-target detector through the multi-target tracker;

the convolutional neural network structure for building the multi-target detector model is as follows: the multilayer structure comprises an input layer, a first convolution layer, a first maximum convergence layer, a second convolution layer, a second maximum convergence layer, a third convolution layer, a third maximum convergence layer, a fourth convolution layer, a fourth maximum convergence layer, a fifth convolution layer, a sixth convolution layer, a full-connection layer and an output layer which are connected in sequence.

Preferably, the convolutional neural network used for building the multi-target detector model comprises the following steps:

the first convolution layer includes 32 convolution kernels of size 3x 3; in the first convolution layer, 32 convolution kernels of 3 × 3 size are used for convolution calculation with respect to the input;

the first maximum fusion layer comprises a convolution kernel of size 2x2 and step size 2;

the second convolutional layer comprises 64 convolution kernels of size 3x3 and 32 convolution kernels of size 1x 1; in the second convolutional layer, firstly, 64 convolutional kernels with the size of 3x3 are used for carrying out convolution calculation on the input, and then 32 convolutional kernels with the size of 1x1 are used for carrying out convolution calculation;

the second maximum convergence layer comprises a convolution kernel of size 2x2 and step size 2;

the third convolutional layer comprises 64 convolution kernels of size 3x3 and 32 convolution kernels of size 1x 1; in the third convolutional layer, the following convolutional calculation is performed sequentially for the input: step 1, performing convolution calculation by using 64 convolution kernels with the size of 3x 3; performing convolution calculation by using 32 convolution kernels of 1x 1; step 2, performing convolution calculation by using 64 convolution kernels with the size of 3x3 according to the result processed in the step 1; performing convolution calculation by using 32 convolution kernels of 1x 1; step 3, performing convolution calculation by using 64 convolution kernels of 3x3 according to the result processed in the step 2;

the third maximum convergence layer comprises a convolution kernel of size 2x2 and step size 2;

the fourth convolutional layer comprises 128 convolution kernels of size 3x3 and 64 convolution kernels of size 1x 1; in the fourth convolutional layer, the following convolutional calculations are performed sequentially for the input: step 1, firstly, performing convolution calculation by using 128 convolution kernels with the size of 3x3, and then performing convolution calculation by using 64 convolution kernels with the size of 1x 1; step 2, regarding to the result processed in the step 1, firstly, using 128 convolution kernels with the size of 3x3 to perform convolution calculation, and then using 64 convolution kernels with the size of 1x1 to perform convolution calculation; step 3, regarding to the result processed in the step 2, firstly, performing convolution calculation by using 128 convolution kernels with the size of 3x 3; then using 64 convolution kernels of 1x1 to perform convolution calculation; step 4, regarding to the result processed in the step 3, firstly, performing convolution calculation by using 128 convolution kernels with the size of 3x 3; then using 64 convolution kernels of 1x1 to perform convolution calculation; step 5, performing convolution calculation by using 128 convolution kernels of 3x3 according to the result processed in the step 4;

the fourth maximum convergence layer comprises a convolution kernel of size 2x2 and step size 2;

the fifth convolutional layer comprises 256 convolution kernels of size 3x3 and 128 convolution kernels of size 1x 1; in the fifth convolutional layer, the following convolutional calculations are performed sequentially for the input: step 1, firstly using 256 convolution kernels with the size of 3x3 to perform convolution calculation, and then using 128 convolution kernels with the size of 1x1 to perform convolution calculation; step 2, regarding to the result processed in the step 1, firstly using 256 convolution kernels with the size of 3x3 to perform convolution calculation, and then using 128 convolution kernels with the size of 1x1 to perform convolution calculation; step 3, regarding to the result processed in the step 2, firstly using 256 convolution kernels with the size of 3x3 to perform convolution calculation, and then using 128 convolution kernels with the size of 1x1 to perform convolution calculation; step 4, regarding to the result processed in the step 3, firstly using 256 convolution kernels with the size of 3x3 to perform convolution calculation, and then using 128 convolution kernels with the size of 1x1 to perform convolution calculation; step 5, performing convolution calculation by using 128 convolution kernels with the size of 3x3 and the step length of 2 according to the result processed in the step 4;

the sixth convolutional layer comprises 512 convolution kernels of size 3x3 and 512 convolution kernels of size 1x 1; in the sixth convolutional layer, the following convolutional calculations are performed sequentially for the input: step 1, firstly, carrying out convolution calculation by using 512 convolution kernels with the size of 3x3, and then carrying out convolution calculation by using 256 convolution kernels with the size of 1x 1; step 2, regarding to the result processed in the step 1, firstly, 512 convolution kernels with the size of 3x3 are used for convolution calculation, and then 256 convolution kernels with the size of 1x1 are used for convolution calculation; and 3, performing convolution calculation on the result processed in the step 2 by using 512 convolution kernels of 1x 1.

Preferably, when the multi-target detector and the multi-target tracker are trained, the PyTorch deep learning framework is utilized, the COCO data set is used as a training data set to train the multi-target detector model, the trained multi-target detector is obtained, and the MOT Benchmark data set is used as the training data set to train the multi-target tracker model, so that the trained multi-target tracker model is obtained.

Furthermore, when the multi-target detector is trained, aiming at each training sample in the COCO data set, normalizing the 3-channel pixel images in the training samples, inputting the normalized images into the multi-target detector in a batch processing mode, and training a multi-target detector model;

the acquired target image to be detected is a 3-channel pixel image;

when a target in a target image to be detected is detected through the multi-target detector, the 3-channel pixel image of the target image to be detected is subjected to normalization processing and then is input into the multi-target detector.

Preferably, the specific process of the multi-target tracker model for tracking the target is as follows:

step S1, dividing the target states input by the multi-target tracker into: activation, tracking, loss, and stop;

step S2, after the multi-target detector inputs the detected target information into the multi-target tracker model, the state of the target is marked as activated in the multi-target tracker model, and whether the target belongs to the target category owned by the training data set is detected:

if yes, the multi-target tracker starts to work, and the process goes to step S3;

if not, stopping tracking the target;

step S3, the target marked as the activated state immediately enters a tracking state, the target starts to be tracked, and at the moment, if the tracked target in each frame of input image is detected, the tracker normally works; if the target tracked at the previous moment disappears from the image at the current moment, the target enters a lost state, the lost state is kept for a certain period of time, if the target is detected again in the period of time, the target returns to the tracking state again, and if the target is not detected yet, the target enters a stop state.

Preferably, the output result of the multi-target detector model is a vector containing n elements, wherein 4 elements represent the coordinates (x, y) of the center position of the target Bounding Box and the length w and width h of the target Bounding Box, 1 element represents the confidence c of the marked target, n-5 elements represent the target class respectively, and n is a constant value.

The second purpose of the invention is realized by the following technical scheme: a target detection and tracking device is applied to automatic driving equipment and comprises:

the multi-target detector model building module is used for building a multi-target detector model based on a convolutional neural network; the convolutional neural network structure for building the multi-target detector model is as follows: the device comprises an input layer, a first convolution layer, a first maximum convergence layer, a second convolution layer, a second maximum convergence layer, a third convolution layer, a third maximum convergence layer, a fourth convolution layer, a fourth maximum convergence layer, a fifth convolution layer, a sixth convolution layer, a full-connection layer and an output layer which are connected in sequence;

the multi-target tracker model building module is used for building a convolutional neural network according to a Markov decision rule to serve as a multi-target tracker model;

the model construction module is used for respectively training the multi-target detector model and the multi-target tracker model by using the training data set to obtain a trained multi-target detector and a trained multi-target tracker;

the image acquisition module is used for acquiring a target image to be detected;

the target detection module is used for inputting a target image to be detected into the multi-target detector and detecting a target in the target image to be detected;

and the target tracking module is used for inputting the output of the multi-target detector into the multi-target tracker and tracking the target detected in the multi-target detector through the multi-target tracker.

The third purpose of the invention is realized by the following technical scheme: a target detection and tracking system is applied to automatic driving equipment and comprises an embedded system and image acquisition equipment;

the image acquisition device: the system is used for collecting a target image to be detected;

the embedded system:

the method is used for loading the trained multi-target detector and multi-target tracker in the target detection and tracking method of the first object of the invention;

the system comprises a multi-target detector, a target detection device and a target detection device, wherein the multi-target detector is used for acquiring a target image to be detected acquired by image acquisition equipment, inputting the target image to be detected into the multi-target detector and detecting a target in the target image to be detected;

the multi-target tracking device is used for inputting the output of the multi-target detector into the multi-target tracker and tracking the target detected in the multi-target detector through the multi-target tracker.

The fourth purpose of the invention is realized by the following technical scheme: a storage medium storing a program, wherein the program is executed by a processor to implement the object detection and tracking method according to the first object of the present invention.

The fifth purpose of the invention is realized by the following technical scheme: a computing device comprising a processor and a memory for storing processor-executable programs, the processor, when executing the programs stored in the memory, implementing the object detection and tracking method according to the first object of the present invention.

Compared with the prior art, the invention has the following advantages and effects:

(1) the invention relates to a target detection and tracking method, which comprises the steps of firstly building a multi-target detector model and a multi-target tracker model, and then training to obtain a multi-target detector and a multi-target tracker; when target detection and tracking test are carried out, the acquired images are input into a multi-target detector to detect a target; then, the output of the multi-target detector is input into the multi-target tracker so as to track the target detected in the multi-target detector; in the method, the convolutional neural network for building the multi-target detector model comprises an input layer, a first convolutional layer, a first maximum merging layer, a second convolutional layer, a second maximum merging layer, a third convolutional layer, a third maximum merging layer, a fourth convolutional layer, a fourth maximum merging layer, a fifth convolutional layer, a sixth convolutional layer, a full-connection layer and an output layer which are connected in sequence, wherein the convolutional neural network used as the multi-target tracker model is built for a Markov decision rule (MDP); because the number of layers of the convolutional neural network for building the multi-target detector model is small, the method has the advantages of low calculated amount and low power consumption, and is suitable for automatic driving equipment containing low-calculation-force embedded equipment (such as a GPU chip) such as small unmanned vehicles, mobile robots and the like; in addition, the method utilizes the convolutional neural network deep learning technology and the MDP reinforcement learning technology, and can effectively improve the accuracy and speed of multi-target detection and tracking of the automatic driving equipment.

(2) In the target detection and tracking method, a PyTorch deep learning framework is utilized, a COCO data set is used as a training data set to train a multi-target detector model to obtain a trained multi-target detector, and an MOTBenchmark data set is used as a training data set to train the multi-target tracker model to obtain the trained multi-target tracker model. The COCO data set and the MOT Benchmark data set are data sets of known labels, and a large amount of time spent on manually marking the labels can be saved, so that the multi-target detector and the multi-target tracker model can be trained more quickly by using the method.

(3) In the target detection and tracking method, a multi-target tracker model realizes multi-target tracking based on an MDP rule, wherein the state of a target output by a multi-target detector in the multi-target tracker model is marked as activated, the target marked as the activated state immediately enters a tracking state, and the target starts to be tracked, and at the moment, if the tracked target in each frame of input image is detected, the tracker normally works; if the target tracked at the previous moment disappears from the image at the current moment, the target enters a lost state, the lost state is kept for a certain period of time, if the target is detected again in the period of time, the target returns to the tracking state again, and if the target is not detected yet, the target enters a stop state. Therefore, the multi-target tracker realizes the multi-target tracking and the lost state based on the MDP rule, and can realize the accurate and real-time tracking of the target.

Drawings

FIG. 1 is a flow chart of a target detection and tracking method according to the present invention.

FIG. 2 is a diagram of a convolutional neural network structure for constructing a multi-target detector model in the method of the present invention.

FIG. 3 is a schematic diagram of the operation of a convolutional neural network for constructing a multi-target detector model in the method of the present invention.

FIG. 4 is a schematic diagram of the operation of the multi-target tracker model to track a target in the method of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Example 1

The embodiment discloses a target detection and tracking method, which is applied to automatic driving equipment, such as some small unmanned vehicles and mobile robots, and as shown in fig. 1, the method comprises the following steps:

1) building a multi-target detector model based on the convolutional neural network; constructing a convolutional neural network according to a Markov decision rule to be used as a multi-target tracker model;

in this embodiment, the above convolutional neural network structure for building a multi-target detector model is shown in fig. 2: the multilayer printed circuit board comprises an input layer, a first convolution layer conv1, a first maximum convergence layer Maxpool1, a second convolution layer conv2, a second maximum convergence layer Maxpool2, a third convolution layer conv3, a third maximum convergence layer Maxpool3, a fourth convolution layer conv4, a fourth maximum convergence layer Maxpool4, a fifth convolution layer conv5, a sixth convolution layer conv6, a full connection layer and an output layer which are connected in sequence. The method specifically comprises the following steps:

the first convolution layer conv1 includes 32 convolution kernels of size 3x 3; in the first convolution layer, 32 convolution kernels of 3 × 3 size are used for convolution calculation with respect to the input;

the first maximum confluent layer Maxpool1 includes convolution kernels of size 2x2 and step size 2;

the second convolutional layer conv2 includes 64 convolution kernels of size 3x3 and 32 convolution kernels of size 1x 1; in the second convolutional layer, firstly, 64 convolutional kernels with the size of 3x3 are used for carrying out convolution calculation on the input, and then 32 convolutional kernels with the size of 1x1 are used for carrying out convolution calculation;

the second maximum convergence layer Maxpool2 includes a convolution kernel of size 2x2 and step size 2;

the third convolutional layer conv3 includes 64 convolution kernels of size 3x3 and 32 convolution kernels of size 1x 1; in the third convolutional layer, the following convolutional calculation is performed sequentially for the input: step 1, performing convolution calculation by using 64 convolution kernels with the size of 3x 3; performing convolution calculation by using 32 convolution kernels of 1x 1; step 2, performing convolution calculation by using 64 convolution kernels with the size of 3x3 according to the result processed in the step 1; performing convolution calculation by using 32 convolution kernels of 1x 1; step 3, performing convolution calculation by using 64 convolution kernels of 3x3 according to the result processed in the step 2;

the third maximum convergence layer Maxpool3 includes a convolution kernel of size 2x2 and step size 2;

the fourth convolutional layer conv4 includes 128 convolution kernels of size 3x3 and 64 convolution kernels of size 1x 1; in the fourth convolutional layer, the following convolutional calculations are performed sequentially for the input: step 1, firstly, performing convolution calculation by using 128 convolution kernels with the size of 3x3, and then performing convolution calculation by using 64 convolution kernels with the size of 1x 1; step 2, regarding to the result processed in the step 1, firstly, using 128 convolution kernels with the size of 3x3 to perform convolution calculation, and then using 64 convolution kernels with the size of 1x1 to perform convolution calculation; step 3, regarding to the result processed in the step 2, firstly, performing convolution calculation by using 128 convolution kernels with the size of 3x 3; then using 64 convolution kernels of 1x1 to perform convolution calculation; step 4, regarding to the result processed in the step 3, firstly, performing convolution calculation by using 128 convolution kernels with the size of 3x 3; then using 64 convolution kernels of 1x1 to perform convolution calculation; step 5, performing convolution calculation by using 128 convolution kernels of 3x3 according to the result processed in the step 4;

the fourth maximum convergence layer Maxpool4 includes a convolution kernel of size 2x2 and step size 2;

the fifth convolutional layer conv5 includes 256 convolution kernels of 3x3 size and 128 convolution kernels of 1x 1; in the fifth convolutional layer, the following convolutional calculations are performed sequentially for the input: step 1, firstly using 256 convolution kernels with the size of 3x3 to perform convolution calculation, and then using 128 convolution kernels with the size of 1x1 to perform convolution calculation; step 2, regarding to the result processed in the step 1, firstly using 256 convolution kernels with the size of 3x3 to perform convolution calculation, and then using 128 convolution kernels with the size of 1x1 to perform convolution calculation; step 3, regarding to the result processed in the step 2, firstly using 256 convolution kernels with the size of 3x3 to perform convolution calculation, and then using 128 convolution kernels with the size of 1x1 to perform convolution calculation; step 4, regarding to the result processed in the step 3, firstly using 256 convolution kernels with the size of 3x3 to perform convolution calculation, and then using 128 convolution kernels with the size of 1x1 to perform convolution calculation; step 5, performing convolution calculation by using 128 convolution kernels with the size of 3x3 and the step length of 2 according to the result processed in the step 4;

the sixth convolutional layer conv6 includes 512 convolution kernels of size 3x3 and 512 convolution kernels of size 1x 1; in the fifth convolutional layer, the following convolutional calculations are performed sequentially for the input: step 1, firstly, carrying out convolution calculation by using 512 convolution kernels with the size of 3x3, and then carrying out convolution calculation by using 256 convolution kernels with the size of 1x 1; step 2, regarding to the result processed in the step 1, firstly, 512 convolution kernels with the size of 3x3 are used for convolution calculation, and then 256 convolution kernels with the size of 1x1 are used for convolution calculation; and 3, performing convolution calculation on the result processed in the step 2 by using 512 convolution kernels of 1x 1.

In this embodiment, when the image size of the convolutional neural network input layer input is 416x416x3, the input and output of the convolutional neural network layers are as shown in fig. 3. In this embodiment, the relationship between the input and output of the convolutional neural network is as follows:

N_o＝(N_i+2*P-F)/S+1；

wherein N is_oOutputting the image pixel size, N, for the current layer of the neural network_iThe image pixel size is input for the current layer of the neural network. Typically, the input to the mth layer of the convolutional neural network is one W_i ^(m)x H_i ^(m)x D_i ^(m)Is output as a W_o ^(m)x H_o ^(m)x D_o ^(m)Wherein W is the width of the matrix, H is the height of the matrix, and D is the number of convolution kernels used for each convolution or the channel dimension of the input image. In the present embodiment, the normalized image width (W) and height (H) are the same, and therefore are collectively denoted by N herein.

P is Padding size, and in FIG. 3, if input N is input at mth level_i ^(m)And output (output) N_o ^(m)If the same, the layer is filled, otherwise, P is 0.

F is the size of a convolution Kernel (also called Filter), the value of which can be arbitrarily set, but is usually determined according to the size of the input size of the current layer, and the method adopts two convolution kernels: 3x3 and 1x 1.

Stride is the step size of the convolution kernel, i.e., the convolution kernel filters a few pixel values at a time. In FIG. 4, taking the description below the first input of the net as an example, the convolution kernel size below the maximum convergence layer is 2x2-s-2, which illustrates that the maximum convergence layer is calculated in steps of 2 using convolution kernels of 2x 2.

2) And training the multi-target detector model and the multi-target tracker model respectively by using the training data set to obtain the trained multi-target detector and multi-target tracker.

In this embodiment, a PyTorch deep learning framework is used, a COCO dataset is used as a training dataset to train a multi-target detector model, so as to obtain a trained multi-target detector, and a MOT Benchmark dataset is used as a training dataset to train a multi-target tracker model, so as to obtain a trained multi-target tracker model.

Wherein the COCO dataset comprises 80 classes of targets, and each sample in the COCO dataset is a target and a sample with a known class. Each sample in the MOT Benchmark dataset is a sample with known target class, and the sample is a target Bounding Box (Bounding Box) center position coordinate (x, y), a length w, a width h, a target confidence c and a target type.

In this embodiment, each sample in the COCO dataset is a 3-channel pixel image, when the multi-target detector is trained, the 3-channel pixel images in the training samples are normalized for each training sample in the COCO dataset, the size of each normalized sample is 416 × 416 pixels, the images normalized in the COCO dataset are input into the multi-target detector in a batch processing manner, and the multi-target detector model is trained, wherein the number of each batch of training images is 256.

3) Acquiring a target image to be detected, inputting the target image to the multi-target detector, and detecting a target in the target image to be detected; in this embodiment, an image of the target to be detected may be captured by a camera provided on the autopilot device, and the captured image is a 3-channel pixel image and becomes an image of 416 × 416 pixels after normalization before being input to the multi-target detector.

In this embodiment, the output result of the multi-target detector is a vector containing n elements, where 4 elements represent the coordinates (x, y) of the center position of the Bounding Box and the length w and width h of the Bounding Box, 1 element represents the confidence c of the marked target, n-5 elements represent the target class, n is a constant value, and the total number of target classes is n-5. In this embodiment, n is 85, that is, the total number of object classes is 80, and the class of the object in the image can be determined by 80 elements corresponding to the object class.

4) Inputting the output of the multi-target detector into a multi-target tracker, and tracking the target detected in the multi-target detector through the multi-target tracker; in this embodiment, as shown in fig. 4, a specific process of the multi-target tracker model to track the target is as follows:

if not, stopping tracking the target;

step S3, the target marked as active enters the tracking state (a)₁) Starting to track the target, and if the tracked target in each frame of input image is detected, the tracker works normally (a)₃) (ii) a If the target tracked at the previous moment disappears from the image at the current moment, the target enters a lost state (a)₄) The lost state will remain for a certain period of time (a)₅) The time period may be set to 10 seconds, during which the target, if detected again, returns to the tracking state (a)₆) If not yet detected (a)₇) Then enter a stop state (a)₂)。

Example 2

The embodiment discloses a target detection and tracking device, is applied to automatic driving equipment, builds module, multi-target tracker model and builds module, model construction module, image acquisition module, target detection module and target tracking module including multi-target detector model, wherein:

the multi-target detector model building module is used for building a multi-target detector model based on a convolutional neural network; the convolutional neural network structure for building the multi-target detector model is as follows: the multilayer structure comprises an input layer, a first convolution layer, a first maximum convergence layer, a second convolution layer, a second maximum convergence layer, a third convolution layer, a third maximum convergence layer, a fourth convolution layer, a fourth maximum convergence layer, a fifth convolution layer, a sixth convolution layer, a full-connection layer and an output layer which are connected in sequence. The specific structure of each layer in the convolutional neural network is the same as that described in embodiment 1, and is not described herein again.

in this embodiment, the model building module trains the multi-target detector model by using a PyTorch deep learning framework and using a COCO dataset as a training dataset to obtain a trained multi-target detector, and trains the multi-target tracker model by using an MOT Benchmark dataset as a training dataset to obtain a trained multi-target tracker model.

The image acquisition module is used for acquiring a target image to be detected; the image is a 3-channel pixel image that is normalized to a size of 416x416 pixels before being input to the multi-target detector.

The target detection and tracking apparatus of this embodiment corresponds to the target detection and tracking method of embodiment 1, and therefore specific implementation of each module can be referred to in embodiment 1, which is not described in detail herein; it should be noted that, the apparatus provided in this embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure is divided into different functional modules to complete all or part of the functions described above. Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Example 3

The embodiment discloses a target detection and tracking system, which is applied to automatic driving equipment, wherein the automatic driving equipment can be small unmanned vehicles and mobile robots; these autopilot devices contain some low-computing and low-power embedded devices, such as GPUs;

the target detection and tracking system comprises an embedded system and an image acquisition device;

the image acquisition device: the system is used for collecting a target image to be detected; in this embodiment, the image capture device may be a camera mounted on the autopilot device.

The embedded system may be specifically an Ubuntu system carried by Jetson TX2, and is implemented as follows:

the method is used for loading the trained multi-target detector and multi-target tracker in the target detection and tracking method of embodiment 1;

the system comprises a multi-target detector, a target detection device and a target detection device, wherein the multi-target detector is used for acquiring a target image to be detected acquired by image acquisition equipment, inputting the target image to be detected into the multi-target detector and detecting a target in the target image to be detected; in this embodiment, the output result of the multi-target detector is a vector containing n elements, where 4 elements represent the coordinates (x, y) of the center position of the Bounding Box and the length w and width h of the Bounding Box, 1 element represents the confidence c of the marked target, n-5 elements represent the total number of target classes, and n is a fixed value.

The method for inputting the outputs of the multi-target detector into the multi-target tracker and tracking the targets detected by the multi-target detector by the multi-target tracker is as in steps S1 to S3 of embodiment 1.

Example 4

The present embodiment discloses a storage medium storing a program, which when executed by a processor, implements the target detection and tracking method of embodiment 1, as follows:

building a multi-target detector model based on a convolutional neural network;

and inputting the output of the multi-target detector into the multi-target tracker, and tracking the target detected in the multi-target detector through the multi-target tracker.

The storage medium in this embodiment may be a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a Random Access Memory (RAM), a usb disk, a removable hard disk, or other media.

Example 5

The embodiment discloses a computing device, which includes a processor and a memory for storing a processor executable program, and is characterized in that when the processor executes the program stored in the memory, the target detection and tracking method of embodiment 1 is implemented as follows:

building a multi-target detector model based on a convolutional neural network;

The computing device described in this embodiment may be a desktop computer, a notebook computer, a smart phone, a PDA handheld terminal, a tablet computer, or other terminal device with a processor function.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A target detection and tracking method is applied to automatic driving equipment, and is characterized by comprising the following steps:

building a multi-target detector model based on a convolutional neural network;

2. The method for detecting and tracking the target according to claim 1, wherein the method is used in a convolutional neural network for building a multi-target detector model, and comprises the following steps:

3. The method of claim 1, wherein when training the multi-target detector and the multi-target tracker, a PyTorch deep learning framework is used, a COCO dataset is used as a training dataset to train a multi-target detector model to obtain a trained multi-target detector, and a MOT Benchmark dataset is used as a training dataset to train a multi-target tracker model to obtain a trained multi-target tracker model.

4. The method of claim 3, wherein in training the multi-target detector, 3-channel pixel images in the training samples are normalized for each training sample in the COCO dataset, and then input into the multi-target detector in a batch processing manner to train the multi-target detector model;

the acquired target image to be detected is a 3-channel pixel image;

5. The method of claim 1, wherein the multi-target tracker model tracks the target by following the following steps:

if not, stopping tracking the target;

6. The method of claim 1, wherein the output result of the multi-target detector model is a vector containing n elements, wherein 4 elements represent the coordinates (x, y) of the center position of the target Bounding Box and the length w and width h of the target Bounding Box, 1 element represents the confidence of the marked target c, n-5 elements represent the target class respectively, and n is a constant value.

7. The utility model provides a target detection and tracer, is applied to autopilot equipment, its characterized in that includes:

8. A target detection and tracking system is applied to automatic driving equipment and is characterized by comprising an embedded system and image acquisition equipment;

the embedded system:

the method is used for loading the trained multi-target detector and multi-target tracker in the target detection and tracking method of any one of claims 1-6;

9. A storage medium storing a program, wherein the program, when executed by a processor, implements the object detection and tracking method of any one of claims 1 to 6.

10. A computing device comprising a processor and a memory for storing processor-executable programs, wherein the processor, when executing a program stored in the memory, implements the object detection and tracking method of any of claims 1-6.