CN107808122B

CN107808122B - Target tracking method and device

Info

Publication number: CN107808122B
Application number: CN201710920018.7A
Authority: CN
Inventors: 杨依凡; 王宇庆; 杨航
Original assignee: Changchun Institute of Optics Fine Mechanics and Physics of CAS
Current assignee: Changchun Institute of Optics Fine Mechanics and Physics of CAS
Priority date: 2017-09-30
Filing date: 2017-09-30
Publication date: 2020-08-11
Anticipated expiration: 2037-09-30
Also published as: CN107808122A

Abstract

The embodiment of the application discloses a target tracking method and a target tracking device, which combine two layers of convolutional neural networks with a time recursive neural network model and solve the problem of low detection rate of small targets. And moreover, information related to the target in the background is extracted for target detection, so that the speed and the accuracy of the target tracking model in video target detection are improved.

Description

Target tracking method and device

Technical Field

The present application relates to the field of target detection technologies, and in particular, to a target tracking method and apparatus.

Background

Target tracking is always a hotspot problem in the fields of computer vision and pattern recognition, and has wide application in video monitoring, man-machine interaction, vehicle navigation and the like. The inventor finds that the current target tracking method has poor detection effect on a small group in the process of realizing the application.

Therefore, how to improve the accuracy of the target detection result is an urgent problem to be solved.

Disclosure of Invention

The application aims to provide a target tracking method and a target tracking device so as to improve the accuracy of a target detection result.

In order to achieve the purpose, the application provides the following technical scheme:

a target tracking method for detecting the target of each frame of image in a video stream through a pre-trained target tracking model comprises the following steps:

a first convolution neural network in the target tracking model performs target detection on the image to obtain the position of the detected target in the image and the category of the detected target;

a second convolutional neural network in the target tracking model performs target detection based on the background on the image to obtain information associated with different types of targets in the background;

and the time recursive neural network in the target tracking model associates the detected target with different backgrounds at different moments based on the information associated with the targets of different types in the backgrounds to obtain a target detection result.

In the method, preferably, the process of performing target detection on the image by the first convolutional neural network includes:

dividing the image into n x n meshes;

predicting a plurality of bounding boxes in each grid, and recording the position and the size of each bounding box, and a trust value and a category value corresponding to each bounding box;

calculating the trust value score of each bounding box pair belonging to the category based on the trust value and the category value corresponding to each bounding box;

and deleting the bounding boxes in the grid, of which the trust values of the categories are less than a preset threshold value, and respectively inhibiting the non-maximum values of all the reserved bounding boxes of different categories to obtain the position and category information of the target.

dividing the image into m grids according to L different division granularities, wherein m has L different values;

predicting a plurality of bounding boxes in each grid corresponding to each partition granularity, and recording the position and the size of each bounding box, and a trust value and a category value corresponding to each bounding box;

calculating the trust value score of each bounding box to the category of each bounding box based on the trust value and the category value corresponding to each bounding box in the grid;

and deleting the bounding boxes in the grid, of which the trust values of the categories are less than a preset threshold value, and respectively inhibiting the non-maximum values of the bounding boxes of different categories reserved under different partition granularities to obtain the position and category information of the target.

Preferably, in the above method, the associating, by the temporal recurrent neural network, the detected target with different backgrounds at different times based on the information associated with the targets of different categories in the backgrounds to obtain the target detection result includes:

the time recursive neural network correlates the detected target with different backgrounds at different times through the pre-learned correlation relationship between the target of the same type and different backgrounds at different times to obtain a target detection result.

In the above method, preferably, the training process of the target tracking model includes:

assigning the weight of the parameter of the convolution layer in the YOLO convolution neural network to the first convolution neural network, and initializing the weights of other parameters of the first convolution neural network by adopting Gaussian random distribution; performing end-to-end training on the first convolutional neural network on a target detection and classification task to obtain a first convolutional neural network model;

assigning the weight of the parameter of the convolution layer in the first convolution neural network to the second convolution neural network, and initializing the weights of other parameters of the second convolution neural network by selecting Gaussian random distribution; performing end-to-end training on the second convolutional neural network on a background-based target type detection task to obtain a second convolutional neural network model;

assigning parameters of the weight of the convolutional layer of the second convolutional neural network model to the convolutional layer of the first convolutional neural network model, training again through the steps, and repeating the steps twice to obtain a final first convolutional neural network model and a final second convolutional neural network model;

training a time recurrent neural network on a task of associating the same type of target with different backgrounds at different moments through a preselected video training set to obtain a time recurrent neural network model; the video training set comprises a first type of video and a second type of video which are equal in quantity, the time lengths of the first type of video and the second type of video are the same, and the variation amplitude of a target in the first type of video is larger than that of a target in the second type of video;

constructing an initial target tracking model: connecting all convolutional layers of a first convolutional neural network model into the time recursive neural network model through a first fully-connected layer, connecting at least one part (for example, all convolutional layers or the first 12 layers) of the convolutional layers of the second convolutional neural network model into the time recursive neural network model through a second fully-connected layer, connecting the output end of the time recursive neural network model with the input ends of the first fully-connected layer and the second fully-connected layer, and the input end of a third fully-connected layer,

and training the initial target tracking model on a preset target detection task to obtain the target tracking model.

Preferably, in the method, the end-to-end training of the first convolutional neural network on the target detection and classification task includes: the first convolutional neural network performs target detection and classification by the following method:

dividing the image into n x n grids;

deleting the bounding boxes in the grids, of which the trust value scores of the category information are smaller than a preset threshold value, and respectively inhibiting non-maximum values of the bounding boxes of different categories reserved in all the grids to obtain a target detection result;

calculating the error degree of the target detection result of the first convolutional neural network through a preset loss function, wherein the loss function is as follows:

wherein, Loss is the error process of the target detection result of the first convolution neural networkDegree, lambda₁Predicting a loss weight, λ, of the loss for the coordinates₁Can take on a value of 5, λ₂Loss weight, λ, for loss of confidence value for bounding box without target₂Can take on a value of 0.5, lambda₃Loss weight, λ, for loss of confidence value and loss of class for bounding box containing target₃The value of (d) can be 1; i is used to distinguish different grids and j is used to distinguish different bounding boxes; x is the number of_ij，y_ij，w_ij，h_ij，C_ijThe predicted value is represented by a value of the prediction,

indicating a calibrated value, S²Representing the number of divided grids, B representing the number of bounding boxes in a certain grid, C_ijRepresents the confidence score, p, of the jth bounding box in the ith grid_i(c) Representing the probability of the occurrence of the object of category c in the ith grid; if the pre-calibrated bounding box is the same as the item type detected by the jth bounding box in the ith grid, then

Taking 1; otherwise

Taking 0; if the pre-calibrated bounding box is the same as the item type detected by the jth bounding box in the ith grid, then

Taking 0; otherwise

Taking 1;

and if the error degree is greater than or equal to the preset threshold, updating the weight by adopting a back propagation algorithm and an Adam update method, and inputting unused data in a training library for next training until the difference value between the loss degree and the minimum value of the loss function is less than a preset threshold.

An object detection device comprising:

the first detection module is used for carrying out target detection on each frame of image in the video stream through a first convolutional neural network to obtain the position of a detected target in the image and the category of the detected target;

the second detection module is used for carrying out target detection based on the background on the image through a second convolutional neural network to obtain information associated with different types of targets in the background;

and the association module is used for associating the detected targets with different backgrounds at different moments based on the information associated with the targets of different types in the backgrounds to obtain target detection results.

Preferably, the first detection module is specifically configured to divide the image into n × n grids through a first convolutional neural network; predicting a plurality of bounding boxes in each grid, and recording the position and the size of each bounding box, and a trust value and a category value corresponding to each bounding box; calculating the trust value score of each bounding box pair belonging to the category based on the trust value and the category value corresponding to each bounding box; and deleting the bounding boxes in the grid, of which the trust values of the categories are less than a preset threshold value, and respectively inhibiting the non-maximum values of all the reserved bounding boxes of different categories to obtain the position and category information of the target.

Preferably, in the apparatus, the first detection module is specifically configured to divide the image into m × m grids according to L different division granularities through a first convolutional neural network, where m has L different values; predicting a plurality of bounding boxes in each grid corresponding to each partition granularity, and recording the position and the size of each bounding box, and a trust value and a category value corresponding to each bounding box; calculating the trust value score of each bounding box to the category of each bounding box based on the trust value and the category value corresponding to each bounding box in the grid; and deleting the bounding boxes in the grid, of which the trust values of the categories are less than a preset threshold value, and respectively inhibiting the non-maximum values of the bounding boxes of different categories reserved under different partition granularities to obtain the position and category information of the target.

The above apparatus, preferably, the association module is specifically configured to,

and associating the detected target with different backgrounds at different moments to obtain a target detection result through the association relationship between the target of the same type and different backgrounds at different moments learned in advance.

According to the scheme, the target tracking method and the target tracking device provided by the application combine the two layers of convolutional neural networks and the time recursive neural network model, and solve the problem of low detection rate of small targets. And moreover, information related to the target in the background is extracted for target detection, so that the speed and the accuracy of the target tracking model in video target detection are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is an exemplary diagram of a target tracking model provided by an embodiment of the present application;

FIG. 2 is a flowchart of an implementation of a target tracking method provided in an embodiment of the present application;

fig. 3 is a flowchart of an implementation of the object detection apparatus according to the embodiment of the present application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be practiced otherwise than as specifically illustrated.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

As shown in fig. 1, which is an exemplary diagram of a target tracking model provided in an embodiment of the present application, the target tracking model provided in the present application includes two Convolutional Neural Networks (CNNs) and a time recursive Neural network LSTM (Long Short-Term Memory). The convolutional network 1 is a convolutional layer of one convolutional neural network (hereinafter referred to as a first convolutional neural network for convenience of distinction), and the convolutional network 2 is a convolutional layer of another convolutional neural network (hereinafter referred to as a second convolutional neural network for convenience of distinction).

The following first explains the training process of the target tracking model.

In the embodiment of the application, the two convolutional neural networks and the time recursive neural network are trained independently, an initial target tracking model of the application is constructed based on the result obtained by training, and then the initial target tracking model is trained to obtain a final target tracking model.

In the embodiment of the application, the first convolutional neural network is mainly responsible for extracting the target and marking the type and the position of the target. The first convolutional neural network comprises 24 convolutional layers and 2 fully-connected layers. The method can be obtained by training on the basis of a YOLO (you Only Look one) convolutional neural network. Specifically, a weight of a parameter of a convolutional layer in the YOLO convolutional neural network is assigned to a convolutional layer of a first convolutional neural network, and a weight of a fully-connected layer of the first convolutional neural network is initialized by using gaussian random distribution (for example, the gaussian random distribution may be a gaussian random distribution with a mean value of zero and a variance of 0.01); performing end-to-end training on the first convolutional neural network on a target detection and classification task to obtain a first convolutional neural network initial model;

in the training process, one way for the first convolutional neural network to perform the target detection and classification task may be as follows:

each frame of image in the training video is divided into n x n grids, wherein n is a positive integer. In an alternative embodiment, n may take a value of 7. The position and the class value of the target are marked in each frame of image in the training video.

Predicting a plurality of bounding boxes (usually rectangular boxes for marking detected targets) in each grid, and recording the position and the size of each predicted bounding box and a corresponding trust value and a corresponding category value of each bounding box; the class value represents the class of the target in the bounding box, the trust value represents two important pieces of information, namely the confidence degree of the target in the predicted bounding box and the prediction accuracy of the bounding box, and the calculation formula of the trust value is as follows:

wherein, the values of Pr (object) are determined according to whether the object is in a bounding box, when the object is in a bounding box, the values of Pr (object) are 1, otherwise, the values of Pr (object) are 0.

Represents the IOU (Intersection-over-Union ratio) value between the predicted bounding box and the targeted target bounding box. Wherein, whether the target falls in the bounding box can be judged according to the calibration value, and the target falls in the bounding box includes: the target falls entirely within the bounding box and the target portion falls within the bounding box.

Generally, the location of the bounding box is the coordinates of the upper left corner of the bounding box, and the size of the bounding box is the length and width of the bounding box.

And calculating the trust value score of the category to which each bounding box pair belongs based on the corresponding trust value and the category value of each bounding box.

And multiplying the trust value corresponding to each bounding box by the class value to obtain the specific class trust value score of each bounding box, namely the trust value score of the class to which each bounding box belongs.

And deleting the bounding boxes of which the scores of the trust values of the categories are smaller than a preset score threshold value in the grids, and performing non-maximum suppression on the bounding boxes belonging to the same category in the bounding boxes reserved in the grids to obtain a target detection result of each grid.

The processing mode of each grid is the same, and is not described in detail here.

In an alternative embodiment, the predetermined score threshold may be 0.6.

After the target detection result of each grid is obtained, the non-maximum value suppression is carried out on the bounding boxes belonging to the same category in the whole image, and the final target detection result is obtained.

The non-maximum suppression process for bounding boxes belonging to the same category in the bounding boxes reserved in the grid may be:

determining the bounding box with the highest score of the arbitrary values in the bounding boxes in the same category (denoted as the first bounding box for convenience of description);

and calculating the coincidence rate of other bounding boxes (for convenience of description, called second bounding boxes) in the same category and the first bounding box, deleting the second bounding box if the coincidence rate is higher than a set value, and otherwise, keeping the second bounding box.

Calculating the error degree of the target detection result of the first convolutional neural network through a preset loss function, wherein the loss degree represents the error between a predicted value (namely the detection result) and a calibrated value, and the loss function is as follows:

wherein, Loss is the error degree of the target detection result of the first convolution neural network, lambda₁Predicting a loss weight, λ, of the loss for the coordinates₁Can take on a value of 5, λ₂Loss weight, λ, for loss of confidence value for bounding box without target₂Can take on a value of 0.5, lambda₃Loss weight, λ, for loss of confidence value and loss of class for bounding box containing target₃The value of (d) can be 1;i is used to distinguish different grids and j is used to distinguish different bounding boxes. x is the number of_ij，y_ij，w_ij，h_ij，C_ijIndicates the predicted value, x_ijAnd y_ijFor the predicted coordinates of the jth bounding box in the ith grid, w_ijFor the predicted width, h, of the jth bounding box in the ith mesh_ijTo predict the height of the jth bounding box in the ith mesh,

the value of the calibration is represented and,

and

for the coordinates of the jth bounding box in the nominal ith grid,

for the nominal width of the jth bounding box in the ith grid,

for the height of the jth bounding box in the ith grid, S²Representing the number of divided grids, B representing the number of bounding boxes in a certain grid, C_ijRepresents the confidence score for the jth bounding box in the predicted ith mesh,

representing a confidence score, p, for the jth bounding box in the ith grid of the target_i(c) Representing the probability of the bounding box of the class c in the predicted ith mesh;

representing the probability of the bounding box of the c category in the nominal ith mesh. The probability of the occurrence of the bounding boxes of the c category in the ith grid is the quotient of the number of the bounding boxes of the c category in the ith grid and the total number of all the bounding boxes in the ith bounding box.

The value of (b) is determined according to whether the jth bounding box in the ith grid contains a set detection target, if the pre-calibrated bounding box is the same as the item type detected by the jth bounding box in the ith grid, the item type is determined to be the same

Taking 1; otherwise, 0 is taken.

A product of the confidence value prediction penalty and the penalty weight representing the bounding box containing the target;

a product of the confidence value prediction penalty and the penalty weight representing the bounding box without the target;

Taking 0; otherwise

1 is taken.

The product of the class prediction loss and the loss weight indicating whether the target center falls within grid i. Wherein, if the target center falls in the grid i, then

The value of (a) is 1, otherwise,

the value is 0. And c represents a category.

In order to detect both a small target and a large target, in the embodiment of the application, in order to make each loss in the loss function more balanced, the coordinate prediction loss is represented by the euler distance, so that only the coordinate is finely adjusted in the process of optimizing the first convolution neural network, and the problems of target false detection and target missing detection and multiple detection are solved.

And if the error degree is greater than or equal to the preset threshold, updating the weight by adopting a BP back propagation algorithm and an Adam update method, and inputting other data of the database for next training until the error degree is less than the preset threshold.

In the training process, another way for the first convolutional neural network to perform the target detection and classification task may be:

dividing the image into m grids according to L different division granularities, wherein m has L different values; in an alternative embodiment, L may have a value of 4, and 4 values of m may be 7, 5, 3, and 1, respectively. Corresponding to each of the granularity of the division,

predicting a plurality of bounding boxes in each grid, and recording the position and the size of each predicted bounding box, and a trust value and a category value corresponding to each bounding box;

and deleting the bounding boxes in the grid, of which the trust value scores of the category information are smaller than a preset threshold value, and respectively performing non-maximum suppression on the bounding boxes of different categories reserved in the grid, namely performing non-maximum suppression on the bounding boxes belonging to the same category in the bounding boxes reserved in the grid to obtain a target detection result of each grid.

After the target detection result of each grid is obtained, carrying out non-maximum suppression on bounding boxes of different types in the whole image respectively, namely carrying out non-maximum suppression on bounding boxes belonging to the same type in the whole image to obtain a final target detection result.

And calculating the error degree of the target detection result of the first convolution neural network through a preset loss function.

The above target detection and classification process in each granularity division can be referred to as the foregoing process, that is, when the image is divided into 7 × 7 grids, the above target detection process is performed once, when the image is divided into 5 × 5 grids, the above target detection process is performed once, and so on, until the above target detection is performed in each granularity division. The target detection process at each granularity is not described in detail here.

In each training process, the union of the detection results under all the granularities is the final target detection result in the training process.

In the embodiment of the application, the target detection and classification are carried out through various granularity divisions, so that the accuracy of the target detection is higher.

The second convolutional neural network is primarily responsible for extracting information associated with different classes of targets in the background. The second convolutional neural network has the same structure as the first convolutional neural network, but performs different tasks and outputs, the task performed by the second convolutional neural network is the detection of the type of the target based on the background, the output of the second convolutional neural network is information associated with different types of targets in the background, the second convolutional neural network optimizes the Softmax function as a loss function, and the parameter updating process is the same as that of the first convolutional network.

When a second convolutional neural network is trained, assigning the weight of the parameter of the convolutional layer in the trained first convolutional neural network to the second convolutional neural network, and initializing the weight of the parameter of the full-connection layer of the second convolutional neural network by selecting Gaussian random distribution; performing end-to-end training on the second convolutional neural network on a background-based target type detection task to obtain a second convolutional neural network model; background-based object type detection may use common detection methods.

And assigning the parameters of the weight of the convolutional layer of the second convolutional neural network model to the parameters of the convolutional layer of the first convolutional neural network model, and training the first convolutional neural network model and the second convolutional neural network model again by the method, so that the first convolutional neural network model and the second convolutional neural network model are circularly performed twice (namely, three times of training are performed in total) to obtain the final first convolutional neural network model and the final second convolutional neural network model.

In the embodiment of the application, the first convolutional neural network and the second convolutional neural network are jointly trained, so that the calculation speed in the training process is increased.

From the training process of the two convolutional neural networks, the convolutional layer parameters of the first convolutional neural network and the second convolutional neural network are the same. In order to reduce the calculation time, the first convolutional neural network and the second convolutional neural network can share convolutional layer parameters, so that the occupied storage space can be reduced.

The time recursive neural network is mainly used for correlating the detected target with different backgrounds at different moments, so that the target detection accuracy in the video is improved.

In the embodiment of the application, a training set comprising two types of videos is selected to train the time recurrent neural network. The quantity of the first type of video and the quantity of the second type of video are equal, the time lengths of the first type of video and the second type of video are the same, and the change amplitude of the target in the first type of video is larger than that of the target in the second type of video; the large variation range of the target may mean that the target suddenly appears, disappears, or the posture and the like have large variation. The small change amplitude of the target may mean that the target changes slowly, does not appear or disappear suddenly, and has a small posture change.

And analyzing the incidence relation between the same target in each video and different backgrounds at different moments by the time recurrent neural network, and obtaining the incidence relation between the same type of target and different backgrounds at different moments by machine learning.

And in the training process, updating the weight according to a time back propagation algorithm and an Adam update method.

The respective training processes of the convolutional neural network and the time recursive neural network have been described previously. The following describes a process of training a target tracking model formed by the above-described trained convolutional neural network and temporal recurrent neural network.

Constructing an initial target tracking model through the two trained convolutional neural network models and the time recursive neural network model: connecting all convolution layers of the first convolution neural network model into the time recurrent neural network model through a first full connection layer, connecting at least part of convolution layers of the second convolution neural network model into the time recurrent neural network model through a second full connection layer, and connecting the output end of the time recurrent neural network model with the input ends of the two first full connection layers and the input end of the third full connection layer.

The preset target detection task may be:

the method comprises the steps that a first convolution neural network carries out target detection on an image to obtain the position of a detected target in the image and the type of the detected target;

the second convolutional neural network carries out target detection based on the background on the image to obtain information associated with different types of targets in the background;

and the time recursive neural network correlates the detected target with different backgrounds at different moments based on the information correlated with the targets of different types in the backgrounds to obtain a target detection result, and the target detection result is output through a third full-connection layer.

In a preferred embodiment, after obtaining the target detection result, the time-recursive neural network does not output the result, but feeds back the target detection result to the convolutional neural network, specifically to the fully-connected layer of the convolutional neural network, the fully-connected layer of the previous stage randomly selects the data output by the convolutional network and the data fed back by the LSTM, processes the randomly selected value by the time-recursive neural network to obtain a final target detection result, and outputs the final target detection result through the final fully-connected layer. In the embodiment of the application, the target detection precision is improved through a feedback mechanism.

In the target tracking model training process, the weight of the parameters in the convolutional neural network is updated by adopting a BP back propagation algorithm and an Adam update method, and the weight of the parameters in the time recursive neural network is updated by adopting a time back propagation algorithm and the Adam update method.

In an alternative embodiment, the process of the first convolutional neural network performing target detection on the image may include:

dividing the image into n x n meshes;

and deleting the bounding boxes in the grid, of which the trust values to the categories are less than a preset threshold value, and performing non-maximum value suppression on the bounding boxes which are reserved in the grid and belong to the same category to obtain the position and category information of the target in the grid.

corresponding to each partition granularity, pre-predicting a plurality of bounding boxes in each grid, and recording the position and the size of each bounding box, and a trust value and a category value corresponding to each bounding box;

calculating the trust value score of each bounding box to the category information based on the trust value and the category value corresponding to each bounding box;

and deleting the bounding boxes in the grid, of which the trust values of the categories are less than a preset threshold value, and performing non-maximum suppression on the bounding boxes belonging to the same category in the bounding boxes reserved in the grid to obtain the position and category information of the target.

Target detection was performed at each granularity by the method described above.

After the target tracking model is trained, the target tracking model can be used for target detection.

Referring to fig. 2, fig. 2 is a flowchart of an implementation of a target tracking method according to an embodiment of the present application, where the implementation of the target tracking method includes:

step S21: the first convolution neural network carries out target detection on the image to obtain the position of the detected target in the image and the category of the detected target;

step S22: the second convolutional neural network carries out target detection based on the background on the image to obtain information associated with different types of targets in the background;

step S22: and the time recursive neural network associates the detected targets with different backgrounds at different moments based on the information associated with the targets of different classes in the backgrounds to obtain target detection results.

The process of performing target detection on the image by the first convolutional neural network may include:

dividing the image into n x n meshes;

and deleting the bounding boxes in the grids, of which the trust value scores to the category information are smaller than a preset threshold value, and performing non-maximum suppression on the bounding boxes reserved in the grids, which belong to the same category, to obtain the position and the category information of the target in each grid.

In another alternative embodiment, the process of performing target detection on the image by the first convolutional neural network may include:

and deleting the bounding boxes in the grids, of which the trust value scores to the category information are smaller than a preset threshold value, and performing non-maximum suppression on the bounding boxes belonging to the same category in the bounding boxes reserved in the grids to obtain the position and the category information of the target in each grid.

The process of target detection is the same for each granularity division, which is not described herein.

In an alternative embodiment, the associating, by the temporal recurrent neural network, the detected object with different contexts at different times based on the information associated with the different classes of objects in the contexts to obtain the object detection result may include:

Corresponding to the method embodiment, the present application further provides a target detection apparatus, and an implementation flowchart of the target detection apparatus provided in the embodiment of the present application is shown in fig. 3, and may include:

a first detection module 31, a second detection module 32 and an association module 33; wherein the content of the first and second substances,

the first detection module 31 is configured to perform target detection on each frame of image in the video stream through a first convolutional neural network, so as to obtain a position of a detected target in the image and a category of the detected target;

the second detection module 32 is configured to perform background-based target detection on the image through a second convolutional neural network, so as to obtain information associated with different types of targets in the background;

the association module 33 is configured to associate the detected target with different backgrounds at different times based on the information associated with the targets of different categories in the backgrounds, so as to obtain a target detection result.

The application provides a target detection device combines two-layer convolution neural network and time recursion neural network model, has solved the problem that the detection rate is low to little target. And moreover, information related to the target in the background is extracted for target detection, so that the speed and the accuracy of the target tracking model in video target detection are improved.

In an optional embodiment, the first detecting module 31 may be specifically configured to divide the image into n × n grids through a first convolutional neural network; predicting a plurality of bounding boxes in each grid, and recording the position and the size of each bounding box, and a trust value and a category value corresponding to each bounding box; calculating the trust value score of each bounding box pair belonging to the category based on the trust value and the category value corresponding to each bounding box; and deleting the bounding boxes in the grid, of which the trust values of the categories are less than a preset threshold value, and respectively inhibiting the non-maximum values of all the reserved bounding boxes of different categories to obtain the position and category information of the target.

In another optional embodiment, the first detecting module 31 may be specifically configured to divide the image into m × m grids according to L different division granularities through a first convolutional neural network, where m has L different values; predicting a plurality of bounding boxes in each grid corresponding to each partition granularity, and recording the position and the size of each bounding box, and a trust value and a category value corresponding to each bounding box; calculating the trust value score of each bounding box to the category of each bounding box based on the trust value and the category value corresponding to each bounding box in the grid; and deleting the bounding boxes in the grid, of which the trust values of the categories are less than a preset threshold value, and respectively inhibiting the non-maximum values of the bounding boxes of different categories reserved under different partition granularities to obtain the position and category information of the target.

In an alternative embodiment, the association module 33 may be specifically adapted to,

In an optional embodiment, the target detection apparatus may further include:

the training module is used for training a target tracking model, and specifically used for assigning the weight of the parameter of the convolution layer in the YOLO convolutional neural network to the first convolutional neural network, and the weights of other parameters of the first convolutional neural network are initialized by adopting Gaussian random distribution; performing end-to-end training on the first convolutional neural network on a target detection and classification task to obtain a first convolutional neural network model;

constructing an initial target tracking model: connecting all convolutional layers of a first convolutional neural network model into the time recursive neural network model through a first fully-connected layer, connecting at least one part (for example, all convolutional layers or the first 12 layers) of the convolutional layers of the second convolutional neural network model into the time recursive neural network model through a second fully-connected layer, and connecting the output end of the time recursive neural network model with the input ends of the first fully-connected layer and the second fully-connected layer and the input end of a third fully-connected layer.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

It should be understood that the technical problems can be solved by combining and combining the features of the embodiments from the claims.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A target tracking method is characterized in that target detection is carried out on each frame of image in a video stream through a pre-trained target tracking model, and the method comprises the following steps:

the time recursion neural network in the target tracking model associates the detected target with different backgrounds at different moments based on the information associated with the targets of different classes in the backgrounds to obtain a target detection result;

after obtaining a target detection result, the time recurrent neural network feeds back the target detection result to a first full connection layer of the first recurrent neural network and a second full connection layer of the second recurrent neural network, the first full connection layer and the second full connection layer randomly select data output by the recurrent neural network and data fed back by the time recurrent neural network, the randomly selected numerical value is processed by the time recurrent neural network to obtain a final target detection result, and the final target detection result is output through a third full connection layer;

the training process of the target tracking model comprises the following steps:

assigning weight parameters of convolution layers in the YOLO convolutional neural network to the first convolutional neural network, and initializing the weights of other weight parameters of the first convolutional neural network by selecting Gaussian random distribution; performing end-to-end training on the first convolutional neural network on a target detection and classification task to obtain a first convolutional neural network model;

assigning the weight parameters of convolution layers in the first convolution neural network to the second convolution neural network, and initializing the weights of other weight parameters of the second convolution neural network by adopting Gaussian random distribution; performing end-to-end training on the second convolutional neural network on a background-based target type detection task to obtain a second convolutional neural network model;

assigning weight parameters of the convolutional layer of the second convolutional neural network model to the convolutional layer of the first convolutional neural network model, training again through the steps, and repeating the steps twice to obtain a final first convolutional neural network model and a final second convolutional neural network model;

constructing an initial target tracking model: connecting all convolutional layers of a first convolutional neural network model into the time recursive neural network model through a first fully-connected layer, connecting at least one part of convolutional layers of a second convolutional neural network model into the time recursive neural network model through a second fully-connected layer, and connecting the output end of the time recursive neural network model with the input ends of the first fully-connected layer and the second fully-connected layer and the input end of a third fully-connected layer;

2. The method of claim 1, wherein the first convolutional neural network performs a target detection process on the image, comprising:

dividing the image into n x n meshes;

3. The method of claim 1, wherein the first convolutional neural network performs a target detection process on the image, comprising:

4. The method of claim 1, wherein the temporally recurrent neural network associates the detected objects with different contexts at different times based on information associated with different classes of objects in the contexts to obtain object detection results, comprising:

5. The method of claim 1, wherein the end-to-end training of the first convolutional neural network on a target detection and classification task comprises: the first convolutional neural network performs target detection and classification by the following method:

dividing the image into n x n grids;

wherein, Loss is the error degree of the target detection result of the first convolutional neural network,

the lost weight of the loss is predicted for the coordinates,

the value of (a) may be 5,

the loss weight for the loss of trust value for the bounding box without the target,

the value of (a) may be 0.5,

the loss weights for the loss of confidence value and the loss of category for the bounding box containing the target,

the value of (d) can be 1; i is used to distinguish different grids and j is used to distinguish different bounding boxes;

，

，

，

，

the predicted value is represented by a value of the prediction,

and

for the predicted coordinates of the jth bounding box in the ith mesh,

for the predicted width of the jth bounding box in the ith mesh,

to predict the height of the jth bounding box in the ith mesh,

，

，

，

，

the value of the calibration is represented and,

and

for the coordinates of the jth bounding box in the nominal ith grid,

for the nominal width of the jth bounding box in the ith grid,

to scale the height of the jth bounding box in the ith grid,

represents the number of divided grids, B represents the number of bounding boxes in a certain grid,

representing the trust value score of the jth bounding box in the ith grid,

representing the confidence score of the jth bounding box in the nominal ith grid,

representing the probability of the occurrence of the object of category c in the ith grid;

representing the probability of the bounding box of the class c in the ith mesh of the calibration; if the pre-calibrated bounding box is the same as the item type detected by the jth bounding box in the ith grid, then

Taking 1; otherwise

Taking 0; otherwise

Taking 1;

6. An object detection device, comprising:

the first detection module is used for carrying out target detection on each frame of image in the video stream through a first convolution neural network in a target tracking model to obtain the position of a detected target in the image and the category of the detected target;

the second detection module is used for carrying out background-based target detection on the image through a second convolutional neural network in the target tracking model to obtain information associated with different types of targets in the background;

the correlation module is used for correlating the detected target with different backgrounds at different moments through a time recurrent neural network in the target tracking model based on the information associated with the targets of different classes in the backgrounds to obtain a target detection result; after obtaining a target detection result, the time recurrent neural network feeds back the target detection result to a first full connection layer of the first recurrent neural network and a second full connection layer of the second recurrent neural network, the first full connection layer and the second full connection layer randomly select data output by the recurrent neural network and data fed back by the time recurrent neural network, the randomly selected numerical value is processed by the time recurrent neural network to obtain a final target detection result, and the final target detection result is output through a third full connection layer;

the training module is used for training a target tracking model, and the specific training process is that weight parameters of convolution layers in the YOLO convolutional neural network are assigned to the first convolutional neural network, and other weight parameters of the first convolutional neural network are initialized by selecting Gaussian random distribution; performing end-to-end training on the first convolutional neural network on a target detection and classification task to obtain a first convolutional neural network model;

7. The apparatus of claim 6, wherein the first detection module is specifically configured to divide the image into n x n meshes by a first convolutional neural network; predicting a plurality of bounding boxes in each grid, and recording the position and the size of each bounding box, and a trust value and a category value corresponding to each bounding box; calculating the trust value score of each bounding box pair belonging to the category based on the trust value and the category value corresponding to each bounding box; and deleting the bounding boxes in the grid, of which the trust values of the categories are less than a preset threshold value, and respectively inhibiting the non-maximum values of all the reserved bounding boxes of different categories to obtain the position and category information of the target.

8. The apparatus of claim 6, wherein the first detection module is specifically configured to divide the image into m × m grids according to L different division granularities through a first convolutional neural network, where m has L different values; predicting a plurality of bounding boxes in each grid corresponding to each partition granularity, and recording the position and the size of each bounding box, and a trust value and a category value corresponding to each bounding box; calculating the trust value score of each bounding box to the category of each bounding box based on the trust value and the category value corresponding to each bounding box in the grid; and deleting the bounding boxes in the grid, of which the trust values of the categories are less than a preset threshold value, and respectively inhibiting the non-maximum values of the bounding boxes of different categories reserved under different partition granularities to obtain the position and category information of the target.

9. The apparatus according to claim 6, characterized in that the association module is specifically configured to,