CN107808122A

CN107808122A - Method for tracking target and device

Info

Publication number: CN107808122A
Application number: CN201710920018.7A
Authority: CN
Inventors: 杨依凡; 王宇庆; 杨航
Original assignee: Changchun Institute of Optics Fine Mechanics and Physics of CAS
Current assignee: Changchun Institute of Optics Fine Mechanics and Physics of CAS
Priority date: 2017-09-30
Filing date: 2017-09-30
Publication date: 2018-03-16
Anticipated expiration: 2037-09-30
Also published as: CN107808122B

Abstract

The embodiment of the present application discloses a kind of method for tracking target and device, and two layers of convolutional neural networks is combined with time recurrent neural networks model, solves the problems, such as that the verification and measurement ratio for Small object is low.Moreover, the information in extraction background with target association carries out target detection, speed and accuracy rate of the target following model in video object detection are improved.

Description

Method for tracking target and device

Technical field

The application is related to target detection technique field, more specifically to a kind of method for tracking target and device.

Background technology

Target following is always computer vision, and the hot issue in area of pattern recognition, it is in video monitoring, man-machine friendship Mutually, automobile navigation etc. is all widely used.Inventor has found during the application is realized, current target following Method, it is poor for the crowd surveillance effect of very little.

Therefore, how to improve the accuracy rate of object detection results turns into urgent problem to be solved.

The content of the invention

The purpose of the application is to provide a kind of method for tracking target and device, to improve the accuracy rate of object detection results.

To achieve the above object, this application provides following technical scheme：

A kind of method for tracking target, each two field picture in video flowing is carried out by training in advance good target following model Target detection, including：

The first convolutional neural networks in the target following model carry out target detection to described image, are detected Position of the target in described image, and the classification of detected target；

The second convolutional neural networks in the target following model carry out the target detection based on background to described image, Obtain information associated with different classes of target in background；

Time recurrent neural network in the target following model based in the background with different classes of target phase The information of association, the target detected is being associated with different backgrounds at different moments, is obtaining object detection results.

The above method, it is preferred that first convolutional neural networks carry out the process of target detection to image, including：

Described image is divided into n*n grid；

In several bounding boxs of each grid forecasting, and the position of each bounding box, size are recorded, and each bounding box Corresponding trust value and class label；

Based on trust value and class label corresponding to each bounding box, trust value point of each bounding box to generic is calculated Number；

The bounding box deletion of predetermined threshold value will be less than to the trust value fraction of generic in the grid, and to all guarantors The different classes of bounding box stayed carries out non-maxima suppression respectively, obtains position and the classification information of target.

Described image is divided into m*m grid according to L kinds different granularity of division, m there are L different values；

Each corresponding granularity of division, predicts several bounding boxs, and record the position of each bounding box in each grid Put, size, and trust value and class label corresponding to each bounding box；

Based on each trust value and class label corresponding to bounding box in grid, letter of each bounding box to generic is calculated Appoint value fraction；

The bounding box deletion of predetermined threshold value will be less than to the trust value fraction of generic in the grid, and difference is drawn The different classes of bounding box remained under gradation degree carries out non-maxima suppression respectively, obtains position and the classification of target Information.

The above method, it is preferred that time recurrent neural network is based on associated with different classes of target in the background Information, the target detected is being associated with different backgrounds at different moments, is obtaining object detection results, including：

Time recurrent neural network passes through between the same type of at different moments target and different background that learn in advance Incidence relation, the target detected is being associated with different backgrounds at different moments, is obtaining object detection results.

The above method, it is preferred that the training process of the target following model includes：

The weights of the parameter of convolutional layer in YOLO convolutional neural networks are assigned to first convolutional neural networks, institute The weights for stating the other parameters of the first convolutional neural networks carry out weight initialization from gaussian random distribution；In target detection and First convolutional neural networks are trained end to end in classification task, obtain the first convolution neural network model；

The weights of the parameter of convolutional layer in first convolutional neural networks are assigned to second convolutional neural networks, it is described The weights of the other parameters of second convolutional neural networks carry out weight initialization from gaussian random distribution；In the mesh based on background Second convolutional neural networks are trained end to end in mark type detection task, obtain the second convolutional neural networks mould Type；

Give the parameter assignment of the weights of the convolutional layer of the second convolution neural network model to first convolutional Neural The convolutional layer of network model, it is trained again by as above step, so circulation twice, obtains the first final convolutional Neural Network model and the second convolution neural network model；

By the video training set chosen in advance by target under at different moments same type of target and different background Time recurrent neural network is trained in being associated for task, obtains time recurrent neural networks model；The video Training set includes the equal first kind video of quantity and the second class video, the first kind video and the second class video Duration is identical, and the amplitude of variation of target is more than the amplitude of variation of target in second video in the first kind video；

Construct initial target following model：Whole convolutional layers of first convolution neural network model are connected entirely by first Connect layer and be connected into the time recurrent neural networks model, by least one of the convolutional layer of the second convolution neural network model Point (for example, it may be whole convolutional layer or first 12 layers) is connected into the time recurrence god by the second full articulamentum Through network model, by the output end of the time recurrent neural networks model and the described first full articulamentum and the second full articulamentum Input, and the 3rd full articulamentum input connection,

The initial target following model is trained on preset object detection task, obtain the target with Track model.

The above method, it is preferred that described that first convolutional neural networks are carried out in target detection and classification task Train end to end, including：First convolutional neural networks carry out target detection and classification in the following way：

Divide an image into n*n grid；

Several bounding boxs are predicted in each grid, and record the position of each bounding box, size, and each encirclement Trust value corresponding to box and class label；

The bounding box deletion of predetermined threshold value will be less than to the trust value fraction of generic information in the grid, and to institute There is the different classes of bounding box retained in grid to carry out non-maxima suppression respectively, obtain object detection results；

The extent of error of the object detection results of first convolutional neural networks is calculated by preset loss function, it is described Loss function is：

Wherein, Loss be the first convolutional neural networks object detection results extent of error, λ₁Predict and lose for coordinate Loss weight, λ₁Value can be 5, λ₂The loss weight lost for the trust value of aimless bounding box, λ₂Value Can be 0.5, λ₃For the loss weight of trust value loss and the classification loss of the bounding box containing target, λ₃Value can be 1；I is used to distinguish different grids, and j is used to distinguish different bounding boxs；x_ij, y_ij, w_ij, h_ij, C_ijRepresent predicted value,Represent calibration value, S²Divided grid number is represented, B represents the bounding box in some grid Number, C_ijRepresent the trust value fraction of j-th of bounding box in i-th of grid, p_i(c) mesh of c classifications in i-th of grid is represented Mark existing probability；If in the bounding box and i-th of grid demarcated in advance j-th of bounding box detection goods categories be , thenTake 1；OtherwiseTake 0；If the article of j-th of bounding box detection in the bounding box and i-th of grid demarcated in advance Classification be it is the same, thenTake 0；OtherwiseTake 1；

If extent of error is more than or equal to predetermined threshold value, weighed using back-propagation algorithm and Adam update methods Value renewal, and input in training storehouse be not used data are trained next time, until the extent of damage and the loss function The difference of minimum value be less than pre-determined threshold.

A kind of object detecting device, including：

First detection module, for carrying out target inspection to each two field picture in video flowing by the first convolutional neural networks Survey, position of the target detected in described image, and the classification of detected target；

Second detection module, for carrying out the target inspection based on background to described image by the second convolutional neural networks Survey, obtain information associated with different classes of target in background；

Relating module, for based on information associated with different classes of target in the background, the mesh that will be detected It is marked on and is associated at different moments with different backgrounds, obtains object detection results.

Said apparatus, it is preferred that the first detection module is specifically used for, by the first convolutional neural networks by the figure As being divided into n*n grid；In several bounding boxs of each grid forecasting, and the position of each bounding box, size are recorded, and Trust value and class label corresponding to each bounding box；Based on trust value and class label corresponding to each bounding box, each bag is calculated Enclose trust value fraction of the box to generic；The bag of predetermined threshold value will be less than to the trust value fraction of generic in the grid Enclose box deletion, and to different classes of bounding box with a grain of salt carry out non-maxima suppression respectively, obtain target position and Classification information.

Said apparatus, it is preferred that the first detection module is specifically used for, by the first convolutional neural networks according to L kinds Described image is divided m*m grid by different granularity of division, and m has L different values；Each corresponding granularity of division, Several bounding boxs are predicted in each grid, and record the position of each bounding box, size, and are believed corresponding to each bounding box Appoint value and class label；Based on each trust value and class label corresponding to bounding box in grid, each bounding box is calculated to affiliated class Other trust value fraction；The bounding box deletion of predetermined threshold value will be less than to the trust value fraction of generic in the grid, and Non-maxima suppression is carried out respectively to the different classes of bounding box remained under different demarcation granularity, obtains the position of target Put and classification information.

Said apparatus, it is preferred that the relating module is specifically used for,

By the incidence relation between the same type of at different moments target and different background that learn in advance, will detect To target be associated at different moments with different backgrounds, obtain object detection results.

By above scheme, a kind of method for tracking target and device that the application provides, by two layers of convolutional Neural net Network is combined with time recurrent neural networks model, solves the problems, such as that the verification and measurement ratio for Small object is low.Moreover, extraction background In with the information of target association carry out target detection, improve speed of the target following model in video object detection with it is accurate Rate.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is the exemplary plot for the target following model that the embodiment of the present application provides；

Fig. 2 is a kind of implementation process figure for the method for tracking target that the embodiment of the present application provides；

Fig. 3 is a kind of implementation process figure for the object detecting device that the embodiment of the present application provides.

Term " first ", " second ", " the 3rd " " the 4th " in specification and claims and above-mentioned accompanying drawing etc. (if In the presence of) it is for distinguishing similar part, without for describing specific order or precedence.It should be appreciated that so use Data can exchange in the appropriate case, so that embodiments herein described herein can be with except illustrating herein Order in addition is implemented.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not paid Embodiment, belong to the scope of protection of the invention.

As shown in figure 1, for the embodiment of the present application provide target following model exemplary plot, the application provide target with Track model includes two convolutional neural networks (Convolutional Neural Networks, CNN) and a time recurrence god Through network LSTM (Long Short-Term Memory).Wherein, convolutional network 1 is one of convolutional neural networks (for just In differentiation, hereinafter referred to as the first convolutional neural networks) convolutional layer, convolutional network 2 for another convolutional neural networks (for ease of Distinguish, hereinafter referred to as the second convolutional neural networks) convolutional layer.

Illustrate the training process of target following model first below.

In the embodiment of the present application, independent instruction is first carried out respectively to two convolutional neural networks and time recurrent neural network Practice, then, the result obtained based on each self-training constructs the initial target following model of the application, then to initial target Trace model is trained, and obtains final target following model.

In the embodiment of the present application, the first convolutional neural networks are mainly responsible for extraction target, and mark classification and the position of target Put.First convolutional neural networks include 24 layers of convolutional layer and 2 layers of full articulamentum.Can be at YOLO (You Only Look Once) It is trained to obtain on the basis of convolutional neural networks.Specifically, by the parameter of the convolutional layer in YOLO convolutional neural networks Weights are assigned to the convolutional layer of the first convolutional neural networks, and the weights of the full articulamentum of first convolutional neural networks are from high This random distribution (for example, it may be average is zero, the gaussian random that variance is 0.01 is distributed) carries out weight initialization；In target First convolutional neural networks are trained end to end in detection and classification task, at the beginning of obtaining the first convolutional neural networks Beginning model；

In the training process, a kind of mode of the detection of the first convolutional neural networks performance objective and classification task can be：

Training is divided into n*n grid with each two field picture in video, n is positive integer.In an optional embodiment, N value can be 7.All demarcation has position and the class label of target in each two field picture in the training video.

Several bounding boxs (generally rectangular frame, for marking the target detected) are predicted in each grid, and are remembered Record position, the size of each bounding box of prediction, and trust value and class label corresponding to each bounding box；Wherein, class label The classification of target in bounding box is characterized, trust value represents the confidence level containing target and this encirclement in predicted bounding box This two important informations of the accuracy of box prediction, the calculation formula of trust value are as follows：

In formula, Pr (Object) value falls in a bag depending on whether target falls in bounding box, when there is target When enclosing in box, Pr (Object) value is 1, and otherwise Pr (Object) value is 0.Represent the bounding box and mark of prediction IOU (the ratio between Intersection-over-Union, common factor union) value between fixed target bounding box.Wherein, whether target Falling can judge in bounding box according to calibration value, and target falls to be included in bounding box：Target is fully fallen in bounding box, and Target part falls in bounding box.

Generally, the position of bounding box is the coordinate in the upper left corner of bounding box, and the size of bounding box is the length of bounding box Degree and width.

Based on trust value and class label corresponding to each bounding box, trust value point of each bounding box to generic is calculated Number.

Trust value corresponding to each bounding box is multiplied with class label, obtains the certain kinds trust value point of each bounding box Number, i.e., trust value fraction of each bounding box to generic.

The bounding box deletion of preset fraction threshold value will be less than to the trust value fraction of generic in the grid, and to net Belong to same category of bounding box in the bounding box retained in lattice and carry out non-maxima suppression, obtain the target detection of each grid As a result.

The processing mode of each grid is identical, no longer repeats one by one here.

In an optional embodiment, preset fraction threshold value can be 0.6.

After the object detection results of each grid are obtained, carried out to belonging to same category of bounding box in whole image Non-maxima suppression, obtain final object detection results.

The process that non-maxima suppression is carried out to belonging to same category of bounding box in the bounding box that retains in grid can be with For：

Determine that trust value fraction highest bounding box in same category of bounding box (is designated as the first encirclement for ease of narration Box)；

Same category of other bounding boxs (being designated as the second bounding box for ease of narration) are calculated to overlap with the first bounding box Rate, if coincidence factor is higher than a setting value, the second bounding box is deleted, otherwise, retain the second bounding box.

The extent of error of the object detection results of first convolutional neural networks, loss are calculated by preset loss function Degree characterizes error of the predicted value (i.e. testing result) between calibration value, and the loss function is：

Wherein, Loss be the first convolutional neural networks object detection results extent of error, λ₁Predict and lose for coordinate Loss weight, λ₁Value can be 5, λ₂The loss weight lost for the trust value of aimless bounding box, λ₂Value Can be 0.5, λ₃For the loss weight of trust value loss and the classification loss of the bounding box containing target, λ₃Value can be 1；I is used to distinguish different grids, and j is used to distinguish different bounding boxs.x_ij, y_ij, w_ij, h_ij, C_ijRepresent predicted value, x_ijAnd y_ij For the coordinate of j-th of bounding box in i-th of grid of prediction, w_ijFor the width of j-th of bounding box in i-th of grid of prediction, h_ijFor the height of j-th of bounding box in i-th of grid of prediction,Represent calibration value,WithFor the coordinate of j-th of bounding box in i-th of grid of demarcation,For the width of j-th of bounding box in i-th of grid of demarcation Degree,For the height of j-th of bounding box in i-th of grid of demarcation, S²Divided grid number is represented, B represents some grid In bounding box number, C_ijThe trust value fraction of j-th of bounding box in i-th of grid of prediction is represented,Represent demarcation I-th of grid in j-th of bounding box trust value fraction, p_i(c) encirclement of c classifications in i-th of grid of prediction is represented The probability of box；Represent the probability of the bounding box of c classifications in i-th of grid of demarcation.The encirclement of c classifications in i-th of grid The probability that box occurs is the quantity of the bounding box of c classifications and all bounding box sums in i-th of bounding box in i-th of grid Quotient.

J-th bounding box of the value in i-th of grid whether depending on the detection target comprising setting, if in advance In the bounding box and i-th of grid first demarcated j-th of bounding box detection goods categories be it is the same, thenTake 1；Otherwise take 0。

Represent the trust value prediction loss of the bounding box containing target and multiplying for loss weight Product；

Represent the trust value prediction loss of the bounding box without target and multiplying for loss weight Product；J-th bounding box of the value in i-th of grid whether depending on the detection target comprising setting, if in advance In the bounding box of demarcation and i-th of grid j-th of bounding box detection goods categories be it is the same, thenTake 0；OtherwiseTake 1.

Indicate whether target's center fall in grid i class prediction loss with Lose the product of weight.Wherein, if there is target's center to fall in grid i,Value be 1, otherwise,Value is 0.C tables Show classification.

In order to which small target should be detected, big target is detected again, in the embodiment of the present application, in order that must lose Each loss is more balanced in function, and coordinate prediction loss is characterized by Euler's distance, so excellent to the first convolutional neural networks During change, only coordinate is finely adjusted, target flase drop and target missing inspection is solved, examines problem more.

If extent of error is more than or equal to predetermined threshold value, carried out using BP back-propagation algorithms and Adam update methods Right value update, and other data of input database are trained next time, until extent of error is less than the predetermined threshold value.

In the training process, the another way of the detection of the first convolutional neural networks performance objective and classification task can be with For：

Described image is divided into m*m grid according to L kinds different granularity of division, m there are L different values；Can one In the embodiment of choosing, L value can be able to be respectively 7,5,3,1 for 4, m 4 kinds of values.Each corresponding granularity of division,

Several bounding boxs are predicted in each grid, and record the position of each bounding box of prediction, size, and often Trust value and class label corresponding to individual bounding box；

The bounding box deletion of predetermined threshold value will be less than to the trust value fraction of generic information in the grid, and to net The different classes of bounding box retained in lattice carries out non-maxima suppression respectively, i.e., to belonging to same in the bounding box that retains in grid A kind of other bounding box carries out non-maxima suppression, obtains the object detection results of each grid.

After the object detection results of each grid are obtained, bounding box different classes of in whole image is carried out respectively Non-maxima suppression, i.e., non-maxima suppression is carried out to belonging to same category of bounding box in whole image, obtain final mesh Mark testing result.

The extent of error of the object detection results of first convolutional neural networks is calculated by preset loss function.

Target detection and assorting process under each above-mentioned granularity of division may refer to aforementioned process, that is to say, that will When image is divided into 7*7 grid, once above-mentioned target detection process is performed, when dividing an image into 5*5 grid, performs one Secondary above-mentioned target detection process, the like, until carrying out as above target detection under each granularity of division.Here no longer one by one Repeat the target detection process under each granularity.

In each training process, the union of the testing result under all granularities is final goal inspection in this training process Survey result.

In the embodiment of the present application, pass through a variety of granularity of division and carry out target detection and classification so that target detection it is accurate Rate is higher.

Second convolutional neural networks are mainly responsible for information associated with different classes of target in extraction background.Volume Two Product neutral net is identical with the structure of the first convolutional neural networks, but the task of the second convolutional neural networks execution and output are not Together, the task that the second convolutional neural networks perform detects for the target type based on background, the output of the second convolutional neural networks For information associated with different classes of target in background, the second convolutional neural networks are using Softmax functions as loss letter Number optimizes, and parameter renewal process is identical with the first convolutional network.

When being trained to the second convolutional neural networks, by the ginseng of convolutional layer in the first convolutional neural networks trained Several weights are assigned to the second convolutional neural networks, and the weights of the parameter of the full articulamentum of the second convolutional neural networks select Gauss Random distribution carries out weight initialization；The second convolutional neural networks are held on the target type Detection task based on background To the training at end, the second convolution neural network model is obtained；Target type detection based on background can use conventional detection Method.

Give the parameter assignment of the weights of the convolutional layer of the second convolution neural network model to the first convolution neural network model Convolutional layer parameter, the first convolution neural network model and the second convolution neural network model are entered again by preceding method Row training, so circulation (carry out altogether training three times) twice, obtain final the first convolution neural network model and second Convolutional neural networks model.

In the embodiment of the present application, the first convolutional neural networks and the second convolutional neural networks carry out joint training, improve Calculating speed in training process.

From the training process of both of the aforesaid convolutional neural networks, the first convolutional neural networks and the second convolution nerve net The convolution layer parameter of network is identical.The time is calculated in order to reduce, above-mentioned first convolutional neural networks and the second convolutional neural networks can To share convolutional layer parameter, the memory space of occupancy can also be so reduced.

Time recurrent neural network is mainly used in detection target being associated with different background at different moments, improves Target detection accuracy rate in video.

In the embodiment of the present application, time recurrent neural network is trained from the training set comprising two class videos.Its In, the quantity of first kind video and the second class video is equal, and the duration of first kind video and the second class video is identical, and the first kind regards The amplitude of variation of target is more than the amplitude of variation of target in the second video in frequency；The amplitude of variation of target can refer to that greatly target is dashed forward So occur, suddenly disappear, or the change that the appearance such as posture is big.The amplitude of variation of target is small can to refer to that object variations are slow, Be not in appear or disappear suddenly, attitudes vibration is small etc..

Same target is in associating between different background at different moments in each video of time recurrent neural network analysis Relation, the incidence relation between same type of target and different background at different moments is obtained by machine learning.

In the training process, right value update is carried out according to time reversal propagation algorithm and Adam update methods.

Front has been described above the respective training process of convolutional neural networks and time recurrent neural network.Explanation pair below The mistake that the target following model being made up of the above-mentioned convolutional neural networks trained and time recurrent neural network is trained Journey.

Initial by the above-mentioned two convolutional neural networks models trained and time recurrent neural networks model construction Target following model：Whole convolutional layers of first convolution neural network model are connected into time recurrence god by the first full articulamentum It is connected through network model, at least part convolutional layer of the second convolution neural network model is connected into by the second full articulamentum described Time recurrent neural networks model, the output end of the time recurrent neural networks model also with the first full articulamentum of above-mentioned two Input, and the 3rd full articulamentum input connection.

Above-mentioned preset object detection task can be：

First convolutional neural networks carry out target detection, position of the target detected in described image to image Put, and the classification of detected target；

Second convolutional neural networks carry out the target detection based on background to described image, obtain in background with it is different classes of The associated information of target；

Time recurrent neural network is based on information associated with different classes of target in the background, by what is detected Target is being associated with different backgrounds at different moments, obtains object detection results, and object detection results are complete by the 3rd Articulamentum exports.

In a preferred embodiment, time recurrent neural network is after object detection results are obtained, first not output result, But object detection results are fed back into convolutional neural networks, specifically feed back to the full articulamentum of convolutional neural networks, previous stage Full articulamentum the data that convolutional network exports and the LSTM data fed back are randomly selected, by the numerical value randomly selected pass through The processing of time recurrent neural net is crossed, obtains final object detection results, the final object detection results are passed through last Full articulamentum output.In the embodiment of the present application, by feedback mechanism, target detection precision is improved.

During target following model training, using BP back-propagation algorithms and Adam update methods to convolutional Neural The weights of parameter are updated in network, using time reversal propagation algorithm and Adam update methods to time recurrent neural The weights of parameter in network are updated.

In an optional embodiment, the process that the first convolutional neural networks carry out target detection to image can include：

Described image is divided into n*n grid；

The bounding box deletion of predetermined threshold value will be less than to the trust value fraction of generic in the grid, and in grid The same category of bounding box that belongs to retained carries out non-maxima suppression, obtains the position of target and classification information in grid.

Each corresponding granularity of division, predicts several bounding boxs, and record each bounding box in each grid Position, size, and trust value and class label corresponding to each bounding box；

Based on trust value and class label corresponding to each bounding box, trust of each bounding box to generic information is calculated It is worth fraction；

The bounding box deletion of predetermined threshold value will be less than to the trust value fraction of generic in the grid, and in grid Belong to same category of bounding box in the bounding box of reservation and carry out non-maxima suppression, obtain position and the classification information of target.

Target detection is carried out by as above method under each granularity of division.

After training target following model, it is possible to carry out target detection using target following model.

Referring to Fig. 2, a kind of implementation process figure that Fig. 2 is the method for tracking target that the embodiment of the present application provides can wrap Include：

Step S21：First convolutional neural networks carry out target detection to described image, and the target detected is described Position in image, and the classification of detected target；

Step S22：Second convolutional neural networks carry out the target detection based on background to described image, obtain in background with The associated information of different classes of target；

Step S22：Time recurrent neural network, will based on information associated with different classes of target in the background The target detected is being associated with different backgrounds at different moments, obtains object detection results.

Wherein, the first convolutional neural networks carry out the process of target detection to image, can include：

Described image is divided into n*n grid；

The bounding box deletion of predetermined threshold value will be less than to the trust value fraction of generic information in the grid, and to net The bounding box that lattice retain belongs to same category of bounding box and carries out non-maxima suppression, obtain in each grid the position of target and Classification information.

In another optional embodiment, the first convolutional neural networks carry out the process of target detection to image, can wrap Include：

The bounding box deletion of predetermined threshold value will be less than to the trust value fraction of generic information in the grid, and to net Belong to same category of bounding box in the bounding box retained in lattice and carry out non-maxima suppression, obtain the position of target in each grid Put and classification information.

Under each granularity of division, the process of target detection is identical, does not repeat one by one here.

In an optional embodiment, time recurrent neural network is based on related to different classes of target in the background The information of connection, the target detected is being associated with different backgrounds at different moments, object detection results is being obtained, can wrap Include：

Corresponding with embodiment of the method, the application also provides a kind of object detecting device, the mesh that the embodiment of the present application provides A kind of implementation process figure of detection means is marked as shown in figure 3, can include：

First detection module 31, the second detection module 32 and relating module 33；Wherein,

First detection module 31 is used to carry out target inspection to each two field picture in video flowing by the first convolutional neural networks Survey, position of the target detected in described image, and the classification of detected target；

Second detection module 32 is used to carry out described image by the second convolutional neural networks the target inspection based on background Survey, obtain information associated with different classes of target in background；

Relating module 33 is used for based on information associated with different classes of target in the background, the mesh that will be detected It is marked on and is associated at different moments with different backgrounds, obtains object detection results.

The object detecting device that the application provides, two layers of convolutional neural networks and time recurrent neural networks model are mutually tied Close, solve the problems, such as that the verification and measurement ratio for Small object is low.Moreover, the information in extraction background with target association carries out target inspection Survey, improve speed and accuracy rate of the target following model in video object detection.

In an optional embodiment, above-mentioned first detection module 31 specifically can be used for, and pass through the first convolution nerve net Described image is divided into n*n grid by network；In several bounding boxs of each grid forecasting, and record the position of each bounding box Put, size, and trust value and class label corresponding to each bounding box；Based on trust value and classification corresponding to each bounding box Value, calculates trust value fraction of each bounding box to generic；Will be small to the trust value fraction of generic in the grid In predetermined threshold value bounding box delete, and to different classes of bounding box with a grain of salt carry out non-maxima suppression respectively, obtain To the position of target and classification information.

In another optional embodiment, first detection module 31 specifically can be used for, and pass through the first convolutional neural networks Described image is divided into m*m grid according to L kinds different granularity of division, m there are L different values；Each corresponding division Granularity, several bounding boxs are predicted in each grid, and record the position of each bounding box, size, and each bounding box Corresponding trust value and class label；Based on each trust value and class label corresponding to bounding box in grid, each bounding box is calculated To the trust value fraction of generic；The bounding box of predetermined threshold value will be less than to the trust value fraction of generic in the grid Delete, and the different classes of bounding box to being remained under different demarcation granularity carries out non-maxima suppression respectively, obtains The position of target and classification information.

In an optional embodiment, relating module 33 specifically can be used for,

In an optional embodiment, object detecting device can also include：

Training module, for training objective trace model, it is specifically used for, by the convolutional layer in YOLO convolutional neural networks The weights of parameter are assigned to first convolutional neural networks, and the weights of the other parameters of first convolutional neural networks are selected Gaussian random distribution carries out weight initialization；First convolutional neural networks are held in target detection and classification task To the training at end, the first convolution neural network model is obtained；

Construct initial target following model：Whole convolutional layers of first convolution neural network model are connected entirely by first Connect layer and be connected into the time recurrent neural networks model, by least one of the convolutional layer of the second convolution neural network model Point (for example, it may be whole convolutional layer or first 12 layers) is connected into the time recurrence god by the second full articulamentum Through network model, by the output end of the time recurrent neural networks model and the described first full articulamentum and the second full articulamentum Input, and the 3rd full articulamentum input connection.

Those of ordinary skill in the art are it is to be appreciated that the list of each example described with reference to the embodiments described herein Member and algorithm steps, it can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually Performed with hardware or software mode, application-specific and design constraint depending on technical scheme.Professional and technical personnel Described function can be realized using distinct methods to each specific application, but this realization is it is not considered that exceed The scope of the present invention.

In several embodiments provided herein, it should be understood that disclosed systems, devices and methods, can be with Realize by another way.Another, shown or discussed mutual coupling or direct-coupling or communication connection can Can be electrical, mechanical or other forms to be by some interfaces, the INDIRECT COUPLING or communication connection of device or unit.

The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.

It should be appreciated that in the embodiment of the present application, combination can be combined with each other from power, each embodiment, feature, can be realized Solves aforementioned technical problem.

If the function is realized in the form of SFU software functional unit and is used as independent production marketing or in use, can be with It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially in other words The part to be contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, including some instructions are causing a computer equipment (can be People's computer, server, or network equipment etc.) perform all or part of step of each embodiment methods described of the present invention. And foregoing storage medium includes：USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.

The foregoing description of the disclosed embodiments, professional and technical personnel in the field are enable to realize or using the present invention. A variety of modifications to these embodiments will be apparent for those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, it is of the invention The embodiments shown herein is not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty phase one The most wide scope caused.

Claims

1. a kind of method for tracking target, it is characterised in that by the good target following model of training in advance to each in video flowing Two field picture carries out target detection, including：

The first convolutional neural networks in the target following model carry out target detection, the mesh detected to described image The position being marked in described image, and the classification of detected target；

The second convolutional neural networks in the target following model carry out the target detection based on background to described image, obtain The information associated with different classes of target in background；

Time recurrent neural network in the target following model is based on associated with different classes of target in the background Information, the target detected is being associated with different backgrounds at different moments, is obtaining object detection results.

2. according to the method for claim 1, it is characterised in that first convolutional neural networks carry out target inspection to image The process of survey, including：

Described image is divided into n*n grid；

In several bounding boxs of each grid forecasting, and the position of each bounding box, size are recorded, and each bounding box is corresponding Trust value and class label；

Based on trust value and class label corresponding to each bounding box, trust value fraction of each bounding box to generic is calculated；

The bounding box deletion of predetermined threshold value will be less than to the trust value fraction of generic in the grid, and it is with a grain of salt to institute Different classes of bounding box carries out non-maxima suppression respectively, obtains position and the classification information of target.

3. according to the method for claim 1, it is characterised in that first convolutional neural networks carry out target inspection to image The process of survey, including：

Each corresponding granularity of division, predicts several bounding boxs in each grid, and records the position, big of each bounding box It is small, and trust value and class label corresponding to each bounding box；

Based on each trust value and class label corresponding to bounding box in grid, trust value of each bounding box to generic is calculated Fraction；

The bounding box deletion of predetermined threshold value will be less than to the trust value fraction of generic in the grid, and to different demarcation grain The lower different classes of bounding box remained of degree carries out non-maxima suppression respectively, obtains position and the classification letter of target Breath.

4. according to the method for claim 1, it is characterised in that time recurrent neural network is based in the background and difference The associated information of the target of classification, the target detected is being associated with different backgrounds at different moments, is obtaining target Testing result, including：

Time recurrent neural network passes through the pass between the same type of at different moments target and different background that learn in advance Connection relation, the target detected is being associated with different backgrounds at different moments, is obtaining object detection results.

5. according to the method for claim 1, it is characterised in that the training process of the target following model includes：

The weights of the parameter of convolutional layer in YOLO convolutional neural networks are assigned to first convolutional neural networks, described The weights of the other parameters of one convolutional neural networks carry out weight initialization from gaussian random distribution；In target detection and classification First convolutional neural networks are trained end to end in task, obtain the first convolution neural network model；

The weights of the parameter of convolutional layer in first convolutional neural networks are assigned to second convolutional neural networks, described second The weights of the other parameters of convolutional neural networks carry out weight initialization from gaussian random distribution；In the target class based on background Second convolutional neural networks are trained end to end on type Detection task, obtain the second convolution neural network model；

Give the parameter assignment of the weights of the convolutional layer of the second convolution neural network model to first convolutional neural networks The convolutional layer of model, it is trained again by as above step, so circulation twice, obtains the first final convolutional neural networks Model and the second convolution neural network model；

By the video training set chosen in advance by target, same type of target is carried out with different background under at different moments Time recurrent neural network is trained in the task of association, obtains time recurrent neural networks model；The video training Concentrating includes the duration of the equal first kind video of quantity and the second class video, the first kind video and the second class video Identical, the amplitude of variation of target is more than the amplitude of variation of target in second video in the first kind video；

Construct initial target following model：Whole convolutional layers of first convolution neural network model are passed through into the first full articulamentum The time recurrent neural networks model is connected into, at least a portion of the convolutional layer of the second convolution neural network model is led to Cross the second full articulamentum and be connected into the time recurrent neural networks model, by the output end of the time recurrent neural networks model Connected with the input of the described first full articulamentum and the input of the second full articulamentum, and the 3rd full articulamentum,

The initial target following model is trained on preset object detection task, obtains the target following mould Type.

6. according to the method for claim 5, it is characterised in that it is described in target detection and classification task to described first Convolutional neural networks are trained end to end, including：First convolutional neural networks carry out in the following way target detection and Classification：

Divide an image into n*n grid；

Several bounding boxs are predicted in each grid, and record the position of each bounding box, size, and each bounding box pair The trust value and class label answered；

The bounding box deletion of predetermined threshold value will be less than to the trust value fraction of generic information in the grid, and to all nets The different classes of bounding box retained in lattice carries out non-maxima suppression respectively, obtains object detection results；

The extent of error of the object detection results of first convolutional neural networks, the loss are calculated by preset loss function Function is：

<mfenced open = "" close = ""> <mtable> <mtr> <mtd> <mrow> <mi>L</mi> <mi>o</mi> <mi>s</mi> <mi>s</mi> <mo>=</mo> <msub> <mi>&lambda;</mi> <mn>1</mn> </msub> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>0</mn> </mrow> <msup> <mi>S</mi> <mn>2</mn> </msup> </munderover> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>0</mn> </mrow> <mi>B</mi> </munderover> <msubsup> <mi>l</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> <mrow> <mi>o</mi> <mi>b</mi> <mi>j</mi> </mrow> </msubsup> <msup> <mrow> <mo>&lsqb;</mo> <mrow> <msup> <mrow> <mo>(</mo> <mrow> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>-</mo> <msub> <mover> <mi>x</mi> <mo>^</mo> </mover> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>+</mo> <msup> <mrow> <mo>(</mo> <mrow> <msub> <mi>y</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>-</mo> <msub> <mover> <mi>y</mi> <mo>^</mo> </mover> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> <mo>&rsqb;</mo> </mrow> <mrow> <mn>1</mn> <mo>/</mo> <mn>2</mn> </mrow> </msup> <mo>+</mo> <msub> <mi>&lambda;</mi> <mn>1</mn> </msub> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>0</mn> </mrow> <msup> <mi>S</mi> <mn>2</mn> </msup> </munderover> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>0</mn> </mrow> <mi>B</mi> </munderover> <msubsup> <mi>l</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> <mrow> <mi>o</mi> <mi>b</mi> <mi>j</mi> </mrow> </msubsup> <mrow> <mo>&lsqb;</mo> <mrow> <msup> <mrow> <mo>(</mo> <mrow> <msqrt> <msub> <mi>w</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </msqrt> <mo>-</mo> <msqrt> <msub> <mover> <mi>w</mi> <mo>^</mo> </mover> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </msqrt> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>+</mo> <msup> <mrow> <mo>(</mo> <mrow> <msqrt> <msub> <mi>h</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </msqrt> <mo>-</mo> <msqrt> <msub> <mover> <mi>h</mi> <mo>^</mo> </mover> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </msqrt> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> <mo>&rsqb;</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>+</mo> <msub> <mi>&lambda;</mi> <mn>3</mn> </msub> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>0</mn> </mrow> <msup> <mi>S</mi> <mn>2</mn> </msup> </munderover> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>0</mn> </mrow> <mi>B</mi> </munderover> <msubsup> <mi>l</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> <mrow> <mi>o</mi> <mi>b</mi> <mi>j</mi> </mrow> </msubsup> <msup> <mrow> <mo>(</mo> <mrow> <msub> <mi>C</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>-</mo> <msub> <mover> <mi>C</mi> <mo>^</mo> </mover> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>+</mo> <msub> <mi>&lambda;</mi> <mn>2</mn> </msub> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>0</mn> </mrow> <msup> <mi>S</mi> <mn>2</mn> </msup> </munderover> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>0</mn> </mrow> <mi>B</mi> </munderover> <msubsup> <mi>l</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> <mrow> <mi>n</mi> <mi>o</mi> <mi>o</mi> <mi>b</mi> <mi>j</mi> </mrow> </msubsup> <msup> <mrow> <mo>(</mo> <mrow> <msub> <mi>C</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>-</mo> <msub> <mover> <mi>C</mi> <mo>^</mo> </mover> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>+</mo> <msub> <mi>&lambda;</mi> <mn>3</mn> </msub> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>0</mn> </mrow> <msup> <mi>S</mi> <mn>2</mn> </msup> </munderover> <msubsup> <mi>l</mi> <mi>i</mi> <mrow> <mi>o</mi> <mi>b</mi> <mi>j</mi> </mrow> </msubsup> <munder> <mi>&Sigma;</mi> <mrow> <mi>c</mi> <mo>&Element;</mo> <mi>c</mi> <mi>l</mi> <mi>a</mi> <mi>s</mi> <mi>s</mi> <mi>e</mi> <mi>s</mi> </mrow> </munder> <msup> <mrow> <mo>(</mo> <mrow> <msub> <mi>p</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mover> <mi>p</mi> <mo>^</mo> </mover> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>c</mi> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </mtd> </mtr> </mtable> </mfenced>

Wherein, Loss be the first convolutional neural networks object detection results extent of error, λ₁For the loss of coordinate prediction loss Weight, λ₁Value can be 5, λ₂The loss weight lost for the trust value of aimless bounding box, λ₂Value can be 0.5, λ₃For the loss weight of trust value loss and the classification loss of the bounding box containing target, λ₃Value can be 1；I is used for Different grids is distinguished, j is used to distinguish different bounding boxs；x_ij, y_ij, w_ij, h_ij, C_ijRepresent predicted value,Represent calibration value, S²Divided grid number is represented, B represents the bounding box in some grid Number, C_ijRepresent the trust value fraction of j-th of bounding box in i-th of grid, p_i(c) mesh of c classifications in i-th of grid is represented Mark existing probability；If in the bounding box and i-th of grid demarcated in advance j-th of bounding box detection goods categories be , thenTake 1；OtherwiseTake 0；If the article of j-th of bounding box detection in the bounding box and i-th of grid demarcated in advance Classification be it is the same, thenTake 0；OtherwiseTake 1；

If extent of error is more than or equal to predetermined threshold value, weights are carried out more using back-propagation algorithm and Adam update methods Newly, and input in training storehouse be not used data are trained next time, until the extent of damage and the loss function are most The difference of small value is less than pre-determined threshold.

A kind of 7. object detecting device, it is characterised in that including：

First detection module, for carrying out target detection to each two field picture in video flowing by the first convolutional neural networks, obtain To position of the target detected in described image, and the classification of detected target；

Second detection module, for carrying out the target detection based on background to described image by the second convolutional neural networks, obtain The information associated with different classes of target into background；

Relating module, for based on information associated with different classes of target in the background, the target detected to be existed It is associated at different moments with different backgrounds, obtains object detection results.

8. device according to claim 7, it is characterised in that the first detection module is specifically used for, and passes through the first volume Described image is divided into n*n grid by product neutral net；In several bounding boxs of each grid forecasting, and record each surround The position of box, size, and trust value and class label corresponding to each bounding box；Based on trust value corresponding to each bounding box and Class label, calculate trust value fraction of each bounding box to generic；The trust value of generic will be divided in the grid Number less than predetermined threshold value bounding boxs delete, and to different classes of bounding box with a grain of salt carry out non-maximum suppression respectively System, obtains position and the classification information of target.

9. device according to claim 7, it is characterised in that the first detection module is specifically used for, and passes through the first volume Described image is divided m*m grid by product neutral net according to the different granularity of division of L kinds, and m has L different values；It is corresponding Each granularity of division, several bounding boxs are predicted in each grid, and record the position of each bounding box, size, and Trust value and class label corresponding to each bounding box；Based on each trust value and class label corresponding to bounding box in grid, calculate Trust value fraction of each bounding box to generic；Default threshold will be less than to the trust value fraction of generic in the grid The bounding box of value is deleted, and the different classes of bounding box to being remained under different demarcation granularity carries out non-maximum respectively Suppress, obtain position and the classification information of target.

10. device according to claim 7, it is characterised in that the relating module is specifically used for,

By the incidence relation between the same type of at different moments target and different background that learn in advance, by what is detected Target is being associated with different backgrounds at different moments, obtains object detection results.