CN113642490B

CN113642490B - Target detection and tracking integrated method and device

Info

Publication number: CN113642490B
Application number: CN202110956341.6A
Authority: CN
Inventors: 徐志通; 陈书楷; 杨奇
Original assignee: Xiamen Entropy Technology Co ltd
Current assignee: Xiamen Entropy Technology Co ltd
Priority date: 2021-08-19
Filing date: 2021-08-19
Publication date: 2023-07-11
Anticipated expiration: 2041-08-19
Also published as: CN113642490A

Abstract

The application discloses an integrated target detection and tracking method and device, the application preliminarily detects a target prediction frame and a detection score thereof through a target detection network on an acquired image to be detected, but the acquired target prediction frame is not necessarily a correct target prediction frame at the moment, the application further judges the acquired target prediction frame, judges whether the target prediction frame is the correct target prediction frame or not based on the size relation between the detection score of the target prediction frame and a preset detection score threshold value and the similarity between the target prediction frame and the target prediction frame which is determined to be the correct prediction frame, if so, the target tracking network is called, the tracking information of the correct target prediction frame is output, and meanwhile, the correct target prediction frame and the corresponding prediction score are output, so that the accuracy of target detection is improved.

Description

Target detection and tracking integrated method and device

Technical Field

The application relates to the technical field of computer vision, in particular to an integrated target detection and tracking method and device.

Background

Target detection and tracking are a research direction which is paid attention to in the technical field of computer vision, and are the basis of applications such as video monitoring, man-machine interaction, robot vision navigation and the like. For example, a passgate-based authentication system typically detects and tracks pedestrians through video surveillance, thereby verifying the identity of the pedestrians.

Therefore, whether the head object passing through the channel gate can be correctly detected and tracked is particularly important for the follow-up business of the channel gate identity verification system, such as channel trailing, detecting whether reverse traffic phenomenon exists or not, counting the number of pedestrians passing through the channel, and the like.

With the application of the neural network algorithm in the technical field of computer vision, a plurality of algorithms for target detection based on deep learning appear, but the existing target detection algorithms have the common conditions of missed detection and false detection, and the missed detection or the false detection has larger influence on the follow-up service of the channel gate identity verification system.

Disclosure of Invention

In view of the above problems, the present application proposes an integrated target detection and tracking scheme to improve the target detection accuracy. The specific scheme is as follows:

an integrated target detection and tracking method, comprising:

acquiring an image to be detected;

inputting the image to be detected into a target detection network to obtain a target prediction frame and a detection score thereof;

judging whether the target prediction frame is correct or not based on the magnitude relation between the detection score of the target prediction frame and a preset detection score threshold value and the similarity between the target prediction frame and the prediction frame determined to be correct;

If the target prediction frame is correct, a target tracking network is called to obtain tracking information of the target prediction frame;

and outputting the target prediction frame, the detection score of the target prediction frame and the tracking information of the target prediction frame.

Optionally, based on the magnitude relation between the detection score of the target prediction frame and a preset detection score threshold, and the similarity between the target prediction frame and the prediction frame determined to be correct, determining whether the target prediction frame is correct includes:

comparing the detection score of the target prediction frame with a preset detection score threshold value to obtain a first comparison result;

calculating the similarity of the target prediction frame and the prediction frame determined to be correct based on the position relation of the target prediction frame and the prediction frame determined to be correct;

comparing the similarity with a preset similarity threshold value to obtain a second comparison result;

and judging whether the target prediction frame is correct or not according to the first comparison result and the second comparison result.

Optionally, calculating the similarity between the target prediction frame and the prediction frame determined to be correct based on the position relationship between the target prediction frame and the prediction frame determined to be correct includes:

Calculating the intersection ratio of the target prediction frame and the prediction frame determined to be correct;

calculating distance parameters of the target prediction frame and the prediction frame determined to be correct;

and calculating the similarity between the target prediction frame and the prediction frame determined to be correct according to the intersection ratio and the distance parameter.

Optionally, calculating a distance parameter between the target prediction frame and the prediction frame determined to be correct includes:

calculating the distance between the center point of the target prediction frame and the center point of the prediction frame determined to be correct, so as to obtain the center point distance;

calculating the diagonal distance between the target prediction frame and the minimum circumscribed frame of the prediction frame determined to be correct;

dividing the center point distance by the diagonal distance of the minimum circumscribed frame to obtain the distance parameter.

Optionally, according to the intersection ratio and the distance parameter, calculating to obtain the similarity between the target prediction frame and the prediction frame determined to be correct, where the calculation formula includes:

wherein l _DIOU Representing the similarity of the target prediction frame to the prediction frame determined to be correct,

representing the intersection area of the target prediction frame and the prediction frame determined to be correct, +. >

Representing the union minimum bounding box area of the target prediction box and the prediction box determined to be correct, E ² (o ₁ ,o ₂ ) Representing the distance, delta (a, d), of the center point of the target prediction frame from the center point of the prediction frame determined to be correct ₁ ) Representing the diagonal distance of the target prediction frame from the smallest bounding box of the prediction frame determined to be correct.

Optionally, the first comparison result includes a magnitude relation between the detection score of the target prediction frame and a first detection score threshold, and the second comparison result includes a magnitude relation between the similarity and the preset similarity threshold;

judging whether the target prediction frame is correct according to the first comparison result and the second comparison result, including:

if the first comparison result is that the detection score of the target prediction frame is larger than or equal to a preset first detection score threshold value, judging that the target prediction frame is correct;

and if the first comparison result is that the detection score of the target prediction frame is smaller than a preset first detection score threshold value and the second comparison result is that the similarity between the target prediction frame and the prediction frame determined to be correct is larger than the preset similarity threshold value, judging that the target prediction frame is correct.

Optionally, the first comparison result includes a magnitude relation between the detection score of the target prediction frame and a first detection score threshold and a second detection score threshold, and the second comparison result includes a magnitude relation between the similarity and the preset similarity threshold;

if the detection score of the target prediction frame is smaller than a first detection score threshold, or the detection score of the target prediction frame is larger than or equal to the first detection score threshold but smaller than a second detection score threshold, and the similarity between the target prediction frame and the prediction frame determined to be correct is smaller than the preset similarity threshold, judging that the target prediction frame is wrong;

if the detection score of the target prediction frame is greater than or equal to the second detection score threshold, or the detection score of the target prediction frame is greater than or equal to the first detection score threshold but less than the second detection score threshold, and the similarity between the target prediction frame and the prediction frame determined to be correct is greater than the preset similarity threshold, judging that the target prediction frame is correct;

Wherein the second detection score threshold is greater than the first detection score threshold.

Optionally, the calling the target tracking network to obtain tracking information of the target prediction frame includes:

among the prediction frames determined to be correct, determining a prediction frame with the maximum similarity with the target prediction frame;

and associating the tracking information corresponding to the prediction frame with the maximum similarity of the target prediction frame with the target prediction frame.

Optionally, the target detection network and the target tracking network form a joint model for unified training, and the training process includes:

inputting a training image into the joint model to obtain a target prediction frame, a detection score of the target prediction frame and target tracking information which are output by the model, wherein the training image is marked with a target position frame and tracking information identification;

determining a first loss value based on the target prediction frame and the target position frame, determining a second loss value based on a detection score of the target prediction frame, and determining a third loss value based on the target tracking information and the tracking information identification;

calculating an overall loss value using the first loss value, the second loss value, and the third loss value;

Parameters are updated for the joint model based on the overall loss value.

An integrated target detection and tracking device, comprising:

the image acquisition unit is used for acquiring an image to be detected;

the target detection unit is used for inputting the image to be detected into a target detection network to obtain a target prediction frame and a detection score thereof;

the judging unit is used for judging whether the target prediction frame is correct or not based on the magnitude relation between the detection score of the target prediction frame and a preset detection score threshold value and the similarity between the target prediction frame and the prediction frame determined to be correct;

the target tracking unit is used for calling a target tracking network to obtain tracking information of the target prediction frame;

and the result acquisition unit is used for outputting the target prediction frame, the detection score of the target prediction frame and the tracking information of the target prediction frame.

By means of the technical scheme, the target prediction frame and the detection score thereof are detected primarily by the target detection network for the acquired image to be detected, but the acquired target prediction frame is not necessarily the correct target prediction frame at the moment, the acquired target prediction frame is further judged, whether the target prediction frame is the correct target prediction frame or not is judged based on the size relation between the detection score of the target prediction frame and the preset detection score threshold value and the similarity between the target prediction frame and the prediction frame which is determined to be correct, if yes, the target tracking network is called, the tracking information of the correct target prediction frame is output, and meanwhile the correct target prediction frame and the corresponding prediction score thereof are output, so that the accuracy of target detection is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.

Fig. 1 is a schematic flow chart of an integrated target detection and tracking method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a calculation method of similarity of prediction frames according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating an example of determining whether a target prediction frame is correct according to an embodiment of the present application;

FIG. 4 is a diagram of a combined small model framework provided by an embodiment of the present application;

fig. 5 is a schematic structural diagram of an integrated target detection and tracking device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The application considers whether the head object of the channel can be correctly detected and tracked, and is particularly important for the logic judgment of the follow-up business of the channel gate identity verification system, such as channel trailing, reverse traffic, head count and the like. The target detection and tracking integrated scheme can improve the accuracy of target detection through regression inhibition of a target prediction frame and a method for calling a tracking network.

Next, as described in connection with fig. 1, fig. 1 is a schematic flow chart of an integrated target detection and tracking method provided in an embodiment of the present application, where the integrated target detection and tracking method may include the following steps:

step S100, an image to be detected is acquired.

Specifically, the acquired image to be detected may be a video frame image obtained by preprocessing video data acquired by the monitoring camera. In the embodiment of the application, the image to be detected can be obtained by performing differential operation on two adjacent frame images in the video image sequence to obtain the video frame image with the moving object, specifically, pixel difference values between adjacent frames can be calculated on continuous frame images in the video data, so that the video frame without the moving object in the video can be primarily removed. It is understood that a video frame without a moving object may refer to a video frame image without a human body, and the moving object is a fixed object within the same shooting range.

On the other hand, considering that the resolution of the image to be detected has an influence on the subsequent reasoning effect, the speed and accuracy of reasoning are affected by the normal condition, in the embodiment of the present application, the obtained original video frame image may be compressed, and in the normal condition, the resolution of the obtained video frame image is 1920dpi×1080dpi, and after scaling according to an equal proportion, the short side may be supplemented to obtain the image to be detected. Under the circumstance that the applicant of the present application considers the tradeoff between the reasoning precision and the reasoning speed, according to the effect of multiple experiments, in some embodiments of the present application, an original video frame image is compressed to a non-square image of 416×234dpi according to an equal proportion, and a pad black-added edge is performed on the short side of the compressed video frame image to a square image of 416dpi×416 dpi.

Step S110, inputting the image to be detected into a target detection network to obtain a target prediction frame and a detection score thereof.

Specifically, the image to be detected obtained in the step S100 is input to a preset target detection network, so as to obtain a target prediction frame and a detection score of the target prediction frame. By taking intelligent channel head detection as a target for illustration, pedestrians passing through the intelligent channel are detected, pictures to be detected are input into a head target detection network, and head prediction frames and detection scores of the head prediction frames can be obtained.

However, the head prediction frame obtained in this step may be an erroneous prediction frame, where the erroneous prediction frame refers to that the object detection network erroneously detects a backpack or other object similar to a head as a head. If the detection result of the target detection network in the step is directly output, and regression constraint is not further performed on the detection result of the target detection network, the probability of false detection is high. Therefore, in order to improve the accuracy of target detection, the applicant of the present application further constrains the target prediction frame output by the target detection network, so as to reduce the probability of false detection to a certain extent.

Step S120, judging whether the target prediction frame is correct, if so, executing the following step S130.

Specifically, based on the magnitude relation between the detection score of the target prediction frame and the preset detection score threshold value and the similarity between the target prediction frame and the prediction frame determined to be correct, whether the target prediction frame is correct is further judged.

It can be understood that the preset detection score threshold can be adjusted according to actual needs and according to test results, and the preset detection score threshold can roughly determine whether the target prediction frame is a pseudo-head prediction frame or not, and filter false-head prediction frames detected by mistake.

Meanwhile, the similarity between the target prediction frame and the prediction frame determined to be correct is combined, and whether the target prediction frame is correct or not is judged more accurately. It should be noted that, in some embodiments of the present application, a prediction box that is determined to be correct may be understood as a target prediction box that is already cached locally. In the process of target detection network reasoning, the target prediction frame judged to be correct is cached in a specific cache container for comparison with the target prediction frame of the subsequent video frame. The similarity between the target prediction frame of the subsequent video frame and the prediction frame determined to be correct can be used to determine whether the target prediction frame of the video frame is the correct target prediction frame that has been detected previously.

Step S130, calling a target tracking network.

Specifically, in the step S120, it may be determined whether the target prediction frame output by the target detection network is correct, and if the target prediction frame is correct, the target tracking network is called to track the correct target prediction frame, so as to obtain tracking information of the correct target prediction frame. The tracking information may be the same person head target encoded, or may be other ways to distinguish whether the person head target is the same person head target.

It can be understood that the method for encoding different head objects may be random encoding, or may be encoding from 1 according to the order in which the head objects appear in the video, and the above two encoding modes are merely illustrated in the embodiment of the present application for clarity of illustrating the identification mode of the tracking information, and other identification numbers that can also uniquely identify different head objects are also applicable to the scheme of the present application, where the method for representing the tracking information for distinguishing different head objects is not strictly limited.

And step S140, outputting a target prediction frame, a target prediction frame detection score and target tracking information.

Specifically, in step S120 described above, whether the target prediction frame of the current video frame is determined to be correct or not has been determined, and in the case where the target prediction frame is determined to be a correct prediction frame, the target prediction frame detection score, and tracking information of the target prediction frame may be output.

In some embodiments of the present application, the above process of determining whether the target prediction frame is correct in step S130 is described in detail, and the determining process may include the following steps:

s1, comparing the detection score of the target prediction frame with a preset detection score threshold value to obtain a first comparison result.

In this step, the detection score of the target prediction frame may be compared with a preset detection score threshold to obtain a first comparison result. The obtained first comparison result can preliminarily judge whether the target prediction frame is a correct or incorrect prediction frame, and temporarily cache the target prediction frame which is preliminarily judged to be correct. It will be appreciated that for a predicted frame that is initially determined to be correct by the temporary cache, it may be further verified, and if the predicted frame is further verified that the condition is not satisfied, the target predicted frame of the temporary cache may be removed.

S2, calculating the similarity of the target prediction frame and the prediction frame determined to be correct based on the position relation of the target prediction frame and the prediction frame determined to be correct.

In this step, it is possible to determine, for example, a correct prediction frame based on the positional relationship between the target prediction frame and the prediction frame determined to be correct: the target prediction frame completely coincides with the prediction frame determined to be correct or there is no intersection portion of the target prediction frame and the prediction frame determined to be correct, or the target prediction frame has an overlapping region with the prediction frame determined to be correct but does not completely overlap.

According to the above positional relationship, the similarity of the target prediction frame to the prediction frame determined to be correct can be calculated, and the similarity can be used to distinguish whether the target prediction frame is a new-appearing correct target prediction frame, a correct target prediction frame that has been previously detected, or a false target prediction frame that has been erroneously detected.

S3, comparing the similarity with a preset similarity threshold value to obtain a second comparison result.

Specifically, for the similarity calculated in the step S2, the similarity may be compared with a preset similarity threshold, so as to obtain a second comparison result. It should be noted that the similarity threshold may be a similarity threshold set empirically, or may be adjusted according to actual needs.

And S4, judging whether the target prediction frame is correct or not according to the first comparison result and the second comparison result.

Specifically, the first comparison result and the second comparison result are combined to judge whether the target prediction frame is correct, and the two comparison results are combined to judge whether the target prediction frame is correct, so that reliability of the judgment result can be enhanced, and meanwhile, the detection accuracy can be improved.

In some embodiments of the present application, a process of calculating the similarity between the target prediction frame and the prediction frame determined to be correct based on the positional relationship between the target prediction frame and the prediction frame determined to be correct in S2 will be described in detail, and the process may include the following steps:

s21, calculating the intersection ratio of the target prediction frame and the prediction frame determined to be correct.

S22, calculating the distance parameter between the target prediction frame and the prediction frame determined to be correct.

S23, calculating the similarity between the target prediction frame and the prediction frame determined to be correct according to the intersection ratio and the distance parameter.

Specifically, the applicant considers that the similarity of the target prediction frame and the prediction frame determined to be correct can be determined not only by the overlapping region of the two prediction frames, but also by combining the distance parameters between the two prediction frames. It should be noted that the distance parameter may be a center point distance calculated according to the center point of the target prediction frame and the prediction frame that is determined to be correct, or may be a ratio calculated by combining the center point distances of the two prediction frames and the diagonal distance of the smallest circumscribed frame of the two prediction frames.

In this regard, the step S22 may further include a process of calculating a distance parameter between the target prediction frame and the prediction frame determined to be correct, including: the distance between the center point of the target prediction frame and the center point of the prediction frame determined to be correct can be calculated, so that the center point distance of the two prediction frames is obtained; in addition, the diagonal distance between the target prediction frame and the minimum circumscribed frame of the prediction frame determined to be correct is calculated; and finally dividing the distance between the central points by the diagonal distance of the minimum circumscribed frame to obtain a ratio, and taking the ratio as a distance parameter.

In the embodiment of the present application, the process of calculating the similarity between the target prediction frame and the prediction frame determined to be correct according to the intersection ratio and the distance parameter in step S23 will be described in detail. Referring to fig. 2, fig. 2 is a schematic diagram of a calculation method of similarity of prediction frames according to an embodiment of the present application. Specifically, according to the intersection ratio and the distance parameter, the similarity between the target prediction frame and the prediction frame determined to be correct is calculated, and the calculation formula of the process may include:

Representing the intersection area of the target prediction frame and the prediction frame determined to be correct, +.>

In some embodiments of the present application, the process of determining whether the target prediction frame is correct according to the first comparison result and the second comparison result in the step S4 will be described in detail.

The embodiments of the present application provide several different implementations, which are respectively described as follows:

first, when the first comparison result includes a magnitude relation between the detection score of the target prediction frame and a first detection score threshold, and the second comparison result includes a magnitude relation between the similarity and the preset similarity threshold, step S4 may include:

s41, if the first comparison result is that the detection score of the target prediction frame is larger than or equal to a preset first detection score threshold value, judging that the target prediction frame is correct;

And S42, if the first comparison result is that the detection score of the target prediction frame is smaller than a preset first detection score threshold value and the second comparison result is that the similarity between the target prediction frame and the prediction frame determined to be correct is larger than the preset similarity threshold value, judging that the target prediction frame is correct.

Second, when the first comparison result includes a magnitude relation between the detection score of the target prediction frame and the first detection score threshold and the second detection score threshold, respectively, and the second comparison result includes a magnitude relation between the similarity and the preset similarity threshold, please refer to fig. 3, fig. 3 is a schematic flow chart for determining whether the target prediction frame is correct according to an embodiment of the present application. Step S4 may include:

s41, if the detection score of the target prediction frame is smaller than a first detection score threshold, or the detection score of the target prediction frame is larger than or equal to the first detection score threshold but smaller than a second detection score threshold, and the similarity between the target prediction frame and the prediction frame determined to be correct is smaller than the preset similarity threshold, judging that the target prediction frame is wrong.

Specifically, there may be case 1: if the detection score DScore of the target prediction frame is smaller than the first detection score threshold S ₁ And judging that the head target prediction frame is not detected currently, and taking the head target prediction frame as an incorrect prediction frame.

Alternatively, there may be case 2: if the detection score DScore of the current target prediction frame is greater than or equal to the first detection score threshold S ₁ But is smaller than the firstTwo detection score threshold S ₂ And the similarity DIOU of the target prediction frame and the prediction frame determined to be correct is smaller than a preset similarity threshold S ₃ The target prediction frame can be determined to be a wrong prediction frame, namely, the prediction frame is not a head prediction frame, and can be a prediction frame of a schoolbag or other objects, and in this way, the prediction frame can be used as a target prediction frame for false detection, a target tracking network will not be called, and the prediction frame is not tracked.

S42, if the detection score of the target prediction frame is greater than or equal to the second detection score threshold, or the detection score of the target prediction frame is greater than or equal to the first detection score threshold but less than the second detection score threshold, and the similarity between the target prediction frame and the prediction frame determined to be correct is greater than the preset similarity threshold, judging that the target prediction frame is correct.

Specifically, there may be case 3: if the detection score DScore of the current target prediction frame is greater than or equal to the second detection score threshold S ₂ The current target prediction frame may be determined to be the correct prediction frame. The preset first detection score threshold S ₁ Less than a second detection score threshold S ₂ When the detection score DScore of the current target prediction frame is more than or equal to the second detection score threshold S ₂ After that, it has been meant that the detection score DScore of the current target prediction box is greater than the first detection score threshold S ₁ 。

Alternatively, there may be case 4: if the detection score DScore of the current prediction frame is greater than or equal to the first detection score threshold S ₁ But less than the second detection score threshold S ₂ And the similarity DIOU of the target prediction frame and the prediction frame determined to be correct is greater than or equal to a preset similarity threshold S ₃ And judging that the target prediction frame is correct.

And for the prediction frames judged to be correct, the target tracking network is called to track the correct prediction frames.

In the embodiment of the present application, a preset first detection score threshold S ₁ Can take a value of 0.2, and a second detection score threshold S ₂ The value may be 0.58, and obviously, in this embodiment of the present application, the second detection score threshold is greater than the first detection score threshold, and in addition, the similarity threshold S preset in this embodiment of the present application ₃ The value can be 0.5. It should be noted that, the preset threshold values in the embodiments of the present application are all preferred threshold values determined by the applicant through a large number of experiments, and those skilled in the art can adjust the preset threshold values according to actual needs.

In the embodiment of the present application, a preset first detection score threshold S ₁ The =0.2 is to avoid missing detection targets as much as possible in the detection process, and when the set detection score value is smaller, most of the prediction frames may be cached in the cache container 1, but some false detection may exist in these prediction frames, and the false detection may occur through the second detection score threshold S preset in the embodiment of the present application ₂ =0.58 and similarity threshold S ₃ Further regression suppression removes erroneous prediction boxes. When the prediction frame is judged as a false prediction frame, the prediction frame may be removed from the container 1; when the predicted frame is determined to be correct, the predicted frame is buffered in the buffer container 2, and the buffer container 2 may be set to 40 frames capacity. It will be appreciated that the buffer container 2 may be set to 40 frames, which means that if the target leaves the frame for a short time and then enters the frame, the target can be regarded as the same target, if the target leaves the frame 40 frames and does not enter the frame, the target tracking information is forgotten, and then if the target enters the frame again, the target is detected and tracked as a new target.

In some embodiments of the present application, when the target prediction box is determined to be correct, step S130 described above, the process of calling the target tracking network to obtain the tracking information of the target prediction box may include the following steps:

s1, determining a prediction frame with the maximum similarity with the target prediction frame in each prediction frame determined to be correct.

Specifically, in each prediction frame determined to be correct, if the target in the prediction frame leaves the video frame and then enters the video frame within a short time, the target can be regarded as the same target, if the target leaves the frame for a longer time interval, the time interval is set to be 40 frames in the embodiment of the present application, and if the target does not enter the frame after 40 frames of video images, tracking information, such as sequence numbers, of the target prediction frame is forgotten. And if the target subsequently enters the picture again, detecting and tracking as a new target.

And S2, associating the tracking information corresponding to the prediction frame with the maximum similarity of the target prediction frame with the target prediction frame.

Specifically, in the above substep S1, it is known that the time interval of each prediction frame determined to be correct is 40 frames, so that a plurality of prediction frames having a similarity with the target prediction frame greater than a preset similarity threshold will occur, it is necessary to determine the prediction frame having the greatest similarity with the target prediction frame, and assign the tracking information corresponding to the prediction frame having the greatest similarity with the target prediction frame to the current target prediction frame.

In some embodiments of the present application, the target detection network and the target tracking network may form a joint model unified training. The training process may include:

s1, inputting a training image into the joint model to obtain a target prediction frame, a detection score of the target prediction frame and target tracking information which are output by the model, wherein the training image is marked with a target position frame and tracking information identification.

Specifically, the joint model may adopt a disposable aggregation network ese _vovnet39b, and a training image marked with a target position frame and tracking information identification is input into the joint model, so that a target prediction frame, a detection score of the target prediction frame and target tracking information output by the model can be obtained.

S2, determining a first loss value based on the target prediction frame and the target position frame, determining a second loss value based on the detection score of the target prediction frame, and determining a third loss value based on the target tracking information and the tracking information identification.

Specifically, the target prediction box bbox and the noted targetThe loss between the position boxes can be calculated by CIOU to obtain a first loss value loss _bbox The calculation formula of the loss may include:

loss _bbox ＝CIoULoss(bbox,box)；

Wherein loss is _bbox A first loss value may be represented, CIoULoss (bbox, box) representing the difference between the target prediction box bbox and the noted target position box calculated using CIOU.

In another aspect, a third loss value is determined based on the target tracking information and the tracking information identification. In the training process, in order to make the error between the tracking information learned by the joint model and the tracking information identifier truly marked by the training image as small as possible, in the embodiment of the present application, the input training image is x, the tracking information identifier is y, the adaptive cosine loss function AdaCos may be adopted to calculate the tracking result AdaCos (x, y) predicted by the target tracking head network, and then based on the predicted tracking result AdaCos (x, y), the tracking embedding loss between the prediction result AdaCos (x, y) and the tracking information identifier y is calculated by the cross entropy loss function cross entropy, which may include the following formula:

loss _track ＝CrossEntropy(AdaCos(x,y),y)

s3, calculating an overall loss value by using the first loss value, the second loss value and the third loss value;

specifically, in the process that the target detection network and the target tracking network form a joint model to perform unified training, the loss of the regression frame of the joint model may be adjusted, in this embodiment, the total loss of the joint model of the total loss value is calculated by combining the first loss value, the second loss value and the third loss value, and the total loss may be calculated by performing weighted summation on the three loss values, where the calculation formula may include:

loss＝α*loss _cls +β*loss _bbox +γ*loss _track ；

Wherein loss is _cls Loss value, loss, which may represent the detection score of a target prediction box _bbox Loss values, loss, that may represent target prediction frames _track The tracking information embedding loss value may be represented. Wherein alpha can be 0.3, beta can be 0.5, gamma can be 0.2, and the sum of three loss weights is 1.

In the embodiment of the application, considering that the model needs to pay more attention to the regression effect of the target prediction frame bbox when the network model is trained, the weight with higher regression loss of the target prediction frame is set to be 0.5; in addition, the loss of the detection score of the target prediction frame needs to instruct the tracking head network to determine whether the target prediction frame of the current regression is a correct target prediction frame, so the weight of the target prediction frame is set to be 0.3; meanwhile, tracking information embedding loss weight is set to 0.2 in consideration of the fact that the target prediction frame which is detected to be correct can be tracked, tracking loss is avoided. It should be noted that, in the embodiments of the present application, α=0.3, β=0.5, and γ=0.2 are all better parameter settings determined by the applicant after multiple experiments.

And S4, updating parameters of the joint model based on the total loss value.

Specifically, based on the total loss value, the joint model with higher accuracy of the output predicted result can be obtained, and the applicant considers that the trained joint model can output the predicted result with higher accuracy, but the joint model trained by adopting the disposable aggregation network ese _vovnet39b has larger volume, is unfavorable for deployment and is put into use, and further, the joint model can be compressed by adopting a distillation method, so that a joint small model which is easier to deploy and put into use can be obtained.

The backbone network of the joint small model can be a real-time detection tracking network MNet25M, and the network can comprise a multi-scale convolution layer, a stacked convolution layer and a multi-scale feature map, and can combine inverse convolution and bilinear sampling processing. The network can take RGB video image frames with the size of 3 x 416 as input, and can acquire multi-scale feature images of the same frame of image under different receptive fields by carrying out multi-scale convolution operation on the input images by adopting multi-scale convolution kernels; meanwhile, by stacking a plurality of convolution processes, the more detailed distinguishable features of the feature map can be extracted. And further, the same convolution operation is carried out on the continuous frames of the video through the convolution kernels with the same scale, so that feature images with the same scale of different video frames can be obtained. Thus, feature images of the same scale in the video continuous image frames and feature images of different scales in the same image frame can be extracted. And further downsampling and upsampling the image feature images with various scales to enable the image feature images to be the same in size, and finally performing ReLu activation on the obtained features.

In the embodiment of the application, the combined small model can adopt MNet25M as a backbone network, and the training image is subjected to feature learning through the backbone network and is connected with a target detection network, and the target detection network extracts the features learned by the backbone network and returns a target prediction frame and a corresponding detection score.

Referring to fig. 4, fig. 4 is a frame diagram of a joint small model according to an embodiment of the present application. In the embodiment of the application, the backbone network MNet25M adds a target tracking network structure on the basis of connecting a target detection network, and tracks and matches detection results, so that if a false detection target appears in the video image detection process, regression of a detection frame can be restrained in a subsequent frame, and the accuracy rate of target detection is further improved. In this way, each frame of the video is not detected as a separate image, so that the problem that when the false detection of the target occurs, whether the currently detected target is the false detection target cannot be determined, and the false detection rate of the target is high can be reduced.

The following describes an object detection and tracking integration device provided in the embodiments of the present application, and the object detection and tracking integration device described below and the object detection and tracking integration method described above may be referred to correspondingly.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an integrated target detection and tracking device according to an embodiment of the present application.

As shown in fig. 5, the apparatus may include:

an image acquisition unit 11 for acquiring an image to be detected;

The target detection unit 12 is configured to input the image to be detected into a target detection network to obtain a target prediction frame and a detection score thereof;

a judging unit 13, configured to judge whether the target prediction frame is correct based on a magnitude relation between a detection score of the target prediction frame and a preset detection score threshold value, and a similarity between the target prediction frame and the prediction frame determined to be correct;

a target tracking unit 14, configured to invoke a target tracking network to obtain tracking information of the target prediction frame;

and a result obtaining unit 15, configured to output the target prediction frame, the detection score of the target prediction frame, and tracking information of the target prediction frame.

Alternatively, the judging unit 13 may include:

the first comparison result acquisition unit is used for comparing the detection score of the target prediction frame with a preset detection score threshold value to obtain a first comparison result;

a similarity calculation unit configured to calculate a similarity between the target prediction frame and the prediction frame determined to be correct based on a positional relationship between the target prediction frame and the prediction frame determined to be correct;

the second comparison result acquisition unit is used for comparing the similarity with a preset similarity threshold value to obtain a second comparison result;

And the first judging subunit is used for judging whether the target prediction frame is correct or not according to the first comparison result and the second comparison result.

Alternatively, the similarity calculation unit may include:

an intersection ratio calculating unit for calculating an intersection ratio of the target prediction frame and the prediction frame determined to be correct;

a distance parameter calculation unit for calculating the distance parameter between the target prediction frame and the prediction frame determined to be correct;

and the similarity obtaining unit is used for calculating the similarity between the target prediction frame and the prediction frame determined to be correct according to the intersection ratio and the distance parameter.

Optionally, the distance parameter calculation unit may include:

the center point distance calculation unit is used for calculating the distance between the center point of the target prediction frame and the center point of the prediction frame determined to be correct, so as to obtain the center point distance;

a diagonal distance calculating unit for calculating a diagonal distance between the target prediction frame and the minimum circumscribed frame of the prediction frame determined to be correct;

and the ratio calculation unit is used for dividing the center point distance by the diagonal distance of the minimum circumscribed frame to obtain the distance parameter.

Optionally, the first comparison result includes a magnitude relation between the detection score of the target prediction frame and a first detection score threshold, and the second comparison result includes a magnitude relation between the similarity and the preset similarity threshold, and the determining process of the first determining subunit may include:

Optionally, the first comparison result includes a magnitude relation between the detection score of the target prediction frame and a first detection score threshold and a second detection score threshold, and the second comparison result includes a magnitude relation between the similarity and the preset similarity threshold, and the determining process of the first determining subunit may include:

Alternatively, the target tracking unit 14 may include:

a similarity comparing unit configured to determine, among prediction frames determined to be correct, a prediction frame having a maximum similarity to the target prediction frame;

and the association unit is used for associating the tracking information corresponding to the prediction frame with the maximum similarity of the target prediction frame with the target prediction frame.

Optionally, the above-mentioned integrated target detection and tracking device may include a joint model training unit, and it may be understood that the joint model training unit is configured to jointly train the target detection network and the target tracking network, and the training process of the joint model may include:

parameters are updated for the joint model based on the overall loss value.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the present specification, each embodiment is described in a progressive manner, and each embodiment focuses on the difference from other embodiments, and may be combined according to needs, and the same similar parts may be referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An integrated target detection and tracking method is characterized by comprising the following steps:

acquiring an image to be detected;

judging whether the target prediction frame is correct or not based on the magnitude relation between the detection score of the target prediction frame and a preset detection score threshold value and the similarity between the target prediction frame and the prediction frame determined to be correct; the method specifically comprises the following steps: comparing the detection score of the target prediction frame with a preset detection score threshold value to obtain a first comparison result; calculating the similarity of the target prediction frame and the prediction frame determined to be correct based on the position relation of the target prediction frame and the prediction frame determined to be correct; comparing the similarity with a preset similarity threshold value to obtain a second comparison result; judging whether the target prediction frame is correct or not according to the first comparison result and the second comparison result; wherein the step of calculating the similarity of the target prediction frame and the prediction frame determined to be correct based on the positional relationship between the target prediction frame and the prediction frame determined to be correct comprises: calculating the intersection ratio of the target prediction frame and the prediction frame determined to be correct; calculating distance parameters of the target prediction frame and the prediction frame determined to be correct; calculating to obtain the similarity between the target prediction frame and the prediction frame determined to be correct according to the intersection ratio and the distance parameter;

2. The method of claim 1, wherein calculating the distance parameter of the target prediction frame from the prediction frame determined to be correct comprises:

3. The method according to claim 2, wherein the calculating a similarity between the target prediction frame and the prediction frame determined to be correct according to the intersection ratio and the distance parameter includes:

；

wherein,,

representing the similarity of the target prediction frame to the prediction frame determined to be correct,/i>

Minimum circumscribed frame area representing the union of the target prediction frame and the prediction frame determined to be correct, +.>

Representing the distance of the center point of the target prediction frame from the center point of the prediction frame determined to be correct,/for>

Representing the diagonal distance of the target prediction frame from the smallest bounding box of the prediction frame determined to be correct.

4. The method of claim 1, wherein the first comparison result comprises a magnitude relation of a detection score of the target prediction box to a first detection score threshold, and the second comparison result comprises a magnitude relation of the similarity to the preset similarity threshold;

5. The method of claim 1, wherein the first comparison result includes a magnitude relation between a detection score of the target prediction frame and a first detection score threshold and a second detection score threshold, respectively, and the second comparison result includes a magnitude relation between the similarity and the preset similarity threshold;

6. The method of claim 1, wherein the invoking the target tracking network to obtain tracking information for the target prediction box comprises:

7. The method of claim 1, wherein the target detection network and the target tracking network form a joint model unified training, the training process comprising:

parameters are updated for the joint model based on the overall loss value.

8. An integrated target detection and tracking device, comprising:

the image acquisition unit is used for acquiring an image to be detected;

the judging unit is used for judging whether the target prediction frame is correct or not based on the magnitude relation between the detection score of the target prediction frame and a preset detection score threshold value and the similarity between the target prediction frame and the prediction frame determined to be correct; the method specifically comprises the following steps: comparing the detection score of the target prediction frame with a preset detection score threshold value to obtain a first comparison result; calculating the similarity of the target prediction frame and the prediction frame determined to be correct based on the position relation of the target prediction frame and the prediction frame determined to be correct; comparing the similarity with a preset similarity threshold value to obtain a second comparison result; judging whether the target prediction frame is correct or not according to the first comparison result and the second comparison result; wherein the step of calculating the similarity of the target prediction frame and the prediction frame determined to be correct based on the positional relationship between the target prediction frame and the prediction frame determined to be correct comprises: calculating the intersection ratio of the target prediction frame and the prediction frame determined to be correct; calculating distance parameters of the target prediction frame and the prediction frame determined to be correct; calculating to obtain the similarity between the target prediction frame and the prediction frame determined to be correct according to the intersection ratio and the distance parameter;