CN113642490A

CN113642490A - Target detection and tracking integrated method and device

Info

Publication number: CN113642490A
Application number: CN202110956341.6A
Authority: CN
Inventors: 徐志通; 陈书楷; 杨奇
Original assignee: Xiamen Entropy Technology Co Ltd
Current assignee: Xiamen Entropy Technology Co Ltd
Priority date: 2021-08-19
Filing date: 2021-08-19
Publication date: 2021-11-12
Anticipated expiration: 2041-08-19
Also published as: CN113642490B

Abstract

The application discloses a target detection and tracking integrated method and a device, the method and the device are used for preliminarily detecting a target prediction frame and a detection score of the target prediction frame through a target detection network for an obtained image to be detected, but the obtained target prediction frame is not necessarily a correct target prediction frame at the moment, the method and the device are used for further judging the obtained target prediction frame, judging whether the target prediction frame is the correct target prediction frame or not based on the size relation between the detection score of the target prediction frame and a preset detection score threshold value and the similarity between the target prediction frame and the determined correct prediction frame, if so, calling the target tracking network, outputting the tracking information of the correct target prediction frame, and simultaneously outputting the correct target prediction frame and the corresponding prediction score thereof, thereby improving the accuracy of target detection.

Description

Target detection and tracking integrated method and device

Technical Field

The application relates to the technical field of computer vision, in particular to a target detection and tracking integrated method and device.

Background

Target detection and tracking are a research direction which is concerned about in the technical field of computer vision, and are the basis of applications such as video monitoring, man-machine interaction, robot visual navigation and the like. For example, an identity verification system based on a gateway gate usually detects and tracks a pedestrian through video monitoring, and then verifies the identity of the pedestrian.

Therefore, whether the head target passing through the tunnel gate can be correctly detected and tracked is very important for subsequent services of the tunnel gate identity verification system, such as tunnel tailing, detection of whether reverse traffic occurs or not, people counting of pedestrians passing through the tunnel and the like.

With the application of neural network algorithms in the technical field of computer vision, many algorithms for performing target detection based on deep learning appear, but the existing target detection algorithms generally have the conditions of missing detection and false detection, and the missing detection or the false detection both have great influence on subsequent services of the channel gate identity verification system.

Disclosure of Invention

In view of the above problems, the present application provides an integrated target detection and tracking scheme to improve the target detection accuracy. The specific scheme is as follows:

an integrated target detection and tracking method comprises the following steps:

acquiring an image to be detected;

inputting the image to be detected into a target detection network to obtain a target prediction frame and a detection score thereof;

judging whether the target prediction frame is correct or not based on the magnitude relation between the detection score of the target prediction frame and a preset detection score threshold value and the similarity between the target prediction frame and the prediction frame which is determined to be correct;

if the target prediction frame is correct, calling a target tracking network to obtain tracking information of the target prediction frame;

and outputting the target prediction frame, the detection score of the target prediction frame and the tracking information of the target prediction frame.

Optionally, determining whether the target prediction frame is correct based on a size relationship between the detection score of the target prediction frame and a preset detection score threshold, and a similarity between the target prediction frame and the prediction frame determined to be correct includes:

comparing the detection score of the target prediction frame with a preset detection score threshold value to obtain a first comparison result;

calculating the similarity between the target prediction frame and the prediction frame determined to be correct based on the position relation between the target prediction frame and the prediction frame determined to be correct;

comparing the similarity with a preset similarity threshold to obtain a second comparison result;

and judging whether the target prediction frame is correct or not according to the first comparison result and the second comparison result.

Optionally, calculating a similarity between the target prediction frame and the prediction frame determined to be correct based on the position relationship between the target prediction frame and the prediction frame determined to be correct includes:

calculating the intersection ratio of the target prediction frame and the prediction frame determined to be correct;

calculating a distance parameter between the target prediction frame and the prediction frame determined to be correct;

and calculating the similarity between the target prediction frame and the prediction frame determined to be correct according to the intersection ratio and the distance parameter.

Optionally, calculating a distance parameter between the target prediction frame and the prediction frame determined to be correct includes:

calculating the distance between the central point of the target prediction frame and the central point of the prediction frame determined to be correct to obtain the distance between the central points;

calculating the diagonal distance between the target prediction frame and the minimum external frame of the prediction frame determined to be correct;

and dividing the distance of the central point by the diagonal distance of the minimum external frame to obtain the distance parameter.

Optionally, according to the intersection ratio and the distance parameter, calculating a similarity between the target prediction frame and the prediction frame determined to be correct, where the calculation formula includes:

wherein l_DIOURepresenting the similarity of the target prediction box and the prediction box determined to be correct,

indicating the intersection area of the target prediction box and the prediction box determined to be correct,

representing the minimum bounding box area of the union of the target prediction box and the prediction box determined to be correct, E²(o₁,o₂) Represents the distance, δ (a, d), between the center point of the target prediction frame and the center point of the prediction frame determined to be correct₁) Representing a diagonal distance of the target prediction box from the smallest bounding box of the prediction boxes that have been determined to be correct.

Optionally, the first comparison result includes a size relationship between a detection score of the target prediction box and a first detection score threshold, and the second comparison result includes a size relationship between the similarity and the preset similarity threshold;

then, according to the first comparison result and the second comparison result, determining whether the target prediction frame is correct includes:

if the first comparison result is that the detection score of the target prediction frame is greater than or equal to a preset first detection score threshold value, judging that the target prediction frame is correct;

and if the first comparison result is that the detection score of the target prediction frame is smaller than a preset first detection score threshold value, and the second comparison result is that the similarity between the target prediction frame and the prediction frame determined to be correct is larger than the preset similarity threshold value, judging that the target prediction frame is correct.

Optionally, the first comparison result includes a magnitude relationship between a detection score of the target prediction box and a first detection score threshold and a second detection score threshold, respectively, and the second comparison result includes a magnitude relationship between the similarity and the preset similarity threshold;

if the detection score of the target prediction frame is smaller than a first detection score threshold value, or the detection score of the target prediction frame is larger than or equal to the first detection score threshold value but smaller than a second detection score threshold value, and the similarity between the target prediction frame and the prediction frame determined to be correct is smaller than the preset similarity threshold value, determining that the target prediction frame is wrong;

if the detection score of the target prediction frame is greater than or equal to the second detection score threshold value, or the detection score of the target prediction frame is greater than or equal to the first detection score threshold value but smaller than the second detection score threshold value, and the similarity between the target prediction frame and the prediction frame which is determined to be correct is greater than the preset similarity threshold value, judging that the target prediction frame is correct;

wherein the second detection score threshold is greater than the first detection score threshold.

Optionally, the invoking a target tracking network to obtain tracking information of the target prediction box includes:

determining a prediction frame with the maximum similarity with the target prediction frame in all the prediction frames which are determined to be correct;

and associating the tracking information corresponding to the prediction frame with the maximum similarity with the target prediction frame.

Optionally, the target detection network and the target tracking network form a joint model for unified training, and the training process includes:

inputting a training image into the combined model to obtain a target prediction frame output by the model, a detection score of the target prediction frame and target tracking information, wherein the training image is marked with a target position frame and a tracking information identifier;

determining a first loss value based on the target prediction box and the target location box, determining a second loss value based on a detection score of the target prediction box, and determining a third loss value based on the target tracking information and the tracking information identification;

calculating an overall loss value using the first loss value, the second loss value, and the third loss value;

updating parameters of the joint model based on the total loss values.

An integrated target detection and tracking device, comprising:

the image acquisition unit is used for acquiring an image to be detected;

the target detection unit is used for inputting the image to be detected into a target detection network to obtain a target prediction frame and a detection score thereof;

the judging unit is used for judging whether the target prediction frame is correct or not based on the magnitude relation between the detection score of the target prediction frame and a preset detection score threshold value and the similarity between the target prediction frame and the prediction frame which is determined to be correct;

the target tracking unit is used for calling a target tracking network to obtain tracking information of the target prediction frame;

and the result acquisition unit is used for outputting the target prediction frame, the detection score of the target prediction frame and the tracking information of the target prediction frame.

By means of the technical scheme, the target prediction frame and the detection fraction of the acquired image to be detected are preliminarily detected through the target detection network, the acquired target prediction frame is not necessarily the correct target prediction frame, the acquired target prediction frame is further judged, whether the target prediction frame is the correct target prediction frame or not is judged based on the size relation between the detection fraction of the target prediction frame and a preset detection fraction threshold value and the similarity between the target prediction frame and the prediction frame determined to be correct, if the target prediction frame is correct, the target tracking network is called, the tracking information of the correct target prediction frame is output, and meanwhile the correct target prediction frame and the corresponding prediction fraction are output, so that the accuracy of target detection is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of an integrated target detection and tracking method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating a method for calculating similarity of prediction frames according to an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating an example of determining whether a target prediction box is correct according to an embodiment of the present disclosure;

FIG. 4 is a diagram of a framework of a federated small model provided by an embodiment of the present application;

fig. 5 is a schematic structural diagram of an integrated target detection and tracking device disclosed in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The application of the method and the device considers whether the channel head target can be correctly detected and tracked, and is particularly important for logic judgment of subsequent services of the channel gate identity verification system, such as channel trailing, reverse traffic, people counting and the like. The application provides a target detection and tracking integrated scheme, which can improve the accuracy of target detection by carrying out regression inhibition on a target prediction frame and calling a tracking network method.

Next, referring to fig. 1, fig. 1 is a schematic flowchart of a target detection and tracking integration method provided in an embodiment of the present application, where the target detection and tracking integration method of the present application may include the following steps:

and S100, acquiring an image to be detected.

Specifically, the acquired image to be detected may be a video frame image obtained by preprocessing video data acquired by the monitoring camera. In the embodiment of the present application, the image to be detected may be obtained by performing a difference operation on two adjacent frames of images in the video image sequence to obtain a video frame image with a moving object, and specifically, a pixel difference between adjacent frames may be calculated on consecutive frames of images in the video data, so that a video frame without a moving object in the video may be removed initially. It is understood that a video frame without a moving object may refer to a video frame image without a human body, and the moving object is relative to an object fixedly arranged in the same shooting range.

On the other hand, considering that the resolution of the image to be detected has an influence on the subsequent inference effect, which generally affects the speed and precision of inference, in the embodiment of the present application, the obtained original video frame image may be compressed, and generally, the resolution of the obtained video frame image is 1920dpi by 1080dpi, and after scaling the image according to equal proportion, the short edge may be supplemented to obtain the image to be detected. In some embodiments of the present application, the original video frame image is compressed to 416 × 234dpi non-square image according to an equal proportion, and the short edge of the compressed video frame image is subjected to pad black edge compensation to 416dpi square image according to the effect of multiple experiments under the condition of considering the balance between the inference precision and the inference speed.

And step S110, inputting the image to be detected into a target detection network to obtain a target prediction frame and a detection score thereof.

Specifically, the image to be detected obtained in step S100 is input to a preset target detection network, so that a target prediction frame and a detection score of the target prediction frame can be obtained. By taking intelligent channel human head detection as an example, pedestrians passing through an intelligent channel are detected, and a picture to be detected is input into a human head target detection network, so that a human head prediction frame and a detection score of the human head prediction frame can be obtained.

However, the human head prediction box obtained in this step may be an erroneous prediction box, which means that the object detection network erroneously detects a backpack or other object similar to the human head as the human head. If the detection result of the target detection network in this step is directly output without further regression constraint on the detection result of the target detection network, the probability of false detection is very high. Therefore, in order to improve the accuracy of target detection, the applicant of the present application will further constrain the target prediction frame output by the target detection network, and reduce the probability of false detection to some extent.

In step S120, it is determined whether the target prediction frame is correct, and if so, the following step S130 is performed.

Specifically, whether the target prediction frame is correct or not is further judged based on the magnitude relation between the detection score of the target prediction frame and a preset detection score threshold value and the similarity between the target prediction frame and the prediction frame which is determined to be correct.

It can be understood that the preset detection score threshold may be adjusted according to actual needs and according to test results, and the preset detection score threshold may roughly determine whether the target prediction frame is a false head prediction frame, and filter out false head prediction frames that are falsely detected.

Meanwhile, whether the target prediction frame is correct or not is judged more accurately by combining the similarity of the target prediction frame and the prediction frame which is determined to be correct. It should be noted that, in some embodiments of the present application, a prediction block that has been determined to be correct may be understood as a target prediction block that has been cached locally. And in the process of target detection network reasoning, caching the target prediction frame judged to be correct in a specific cache container for comparing with the target prediction frame of a subsequent video frame. The similarity between the target prediction frame of the subsequent video frame and the prediction frame determined to be correct is compared, and whether the target prediction frame of the video frame is the correct target prediction frame which has been detected previously can be judged.

Step S130, a target tracking network is called.

Specifically, in step S120, it may be determined whether the target prediction frame output by the target detection network is correct, and if the target prediction frame is correct, the target tracking network may be invoked to track the correct target prediction frame, so as to obtain tracking information of the correct target prediction frame. The tracking information may be the same head target encoded or may be in other forms to distinguish whether the tracking information is the same head target.

It can be understood that the method for encoding different head targets may be random encoding, or encoding from 1 according to the sequence in which the head targets appear in the video, where the two encoding manners are merely illustrated in the embodiment of the present application for clearly illustrating the identification manner of the tracking information, and other id numbers that can also uniquely identify different head targets are also applicable to the scheme of the present application.

And step S140, outputting a target prediction frame, a target prediction frame detection score and target tracking information.

Specifically, it is determined whether the target prediction frame of the current video frame is correct in step S120, and in the case where the target prediction frame is determined to be a correct prediction frame, the target prediction frame detection score, and the tracking information of the target prediction frame may be output.

In some embodiments of the present application, a detailed description is given of the process of determining whether the target prediction box is correct in step S130, where the determining process may include the following steps:

and S1, comparing the detection score of the target prediction frame with a preset detection score threshold value to obtain a first comparison result.

In this step, the detection score of the target prediction frame may be compared with a preset detection score threshold to obtain a first comparison result. The obtained first comparison result can preliminarily judge whether the target prediction frame is a correct prediction frame or an incorrect prediction frame, and temporarily cache the target prediction frame preliminarily judged to be correct. It is understood that the prediction block that is initially determined to be correct in the temporary cache may be further verified, and if the condition is not satisfied after the prediction block is further verified, the target prediction block in the temporary cache may be removed.

And S2, calculating the similarity between the target prediction frame and the prediction frame determined to be correct based on the position relation between the target prediction frame and the prediction frame determined to be correct.

In this step, for example, according to the position relationship between the target prediction frame and the prediction frame determined to be correct: the target prediction frame is completely coincided with the prediction frame determined to be correct or the target prediction frame is not intersected with the prediction frame determined to be correct, or the target prediction frame is overlapped with the prediction frame determined to be correct in an overlapping area but not completely overlapped.

According to the above-mentioned positional relationship, the similarity between the target prediction frame and the prediction frame determined to be correct can be calculated, and the similarity can be used to distinguish whether the target prediction frame is a correct target prediction frame that newly appears, a correct target prediction frame that has been detected previously, or an incorrect target prediction frame that is detected incorrectly.

And S3, comparing the similarity with a preset similarity threshold to obtain a second comparison result.

Specifically, the similarity calculated in step S2 may be compared with a preset similarity threshold to obtain a second comparison result. The similarity threshold may be an empirically set similarity threshold, or may be adjusted according to actual needs.

And S4, judging whether the target prediction frame is correct or not according to the first comparison result and the second comparison result.

Specifically, the first comparison result and the second comparison result are combined to determine whether the target prediction frame is correct, and the two comparison results are combined to determine whether the target prediction frame is correct, so that the reliability of the determination result can be enhanced, and the accuracy of detection can be improved.

In some embodiments of the present application, the detailed description of the process of calculating the similarity between the target prediction box and the prediction box determined to be correct based on the position relationship between the target prediction box and the prediction box determined to be correct at S2 may include the following steps:

and S21, calculating the intersection ratio of the target prediction frame and the prediction frame determined to be correct.

And S22, calculating a distance parameter between the target prediction frame and the prediction frame determined to be correct.

And S23, calculating the similarity between the target prediction frame and the prediction frame determined to be correct according to the intersection ratio and the distance parameter.

Specifically, the applicant considers that the similarity between the target prediction frame and the prediction frame determined to be correct can not only determine the similarity between the two prediction frames through the overlapping area of the two prediction frames, but also determine the similarity between the two prediction frames by combining the distance parameter between the two prediction frames. It should be noted that the distance parameter may be a center point distance calculated according to the center points of the target prediction frame and the prediction frame determined to be correct, or may be a ratio calculated by combining the center point distances of the two prediction frames and the diagonal distance of the minimum circumscribing frame of the two prediction frames.

In this regard, in step S22, the step of calculating the distance parameter between the target prediction frame and the prediction frame determined to be correct may include: the distance between the central point of the target prediction frame and the central point of the prediction frame determined to be correct can be calculated, so that the distance between the central points of the two prediction frames is obtained; in addition, the diagonal distance between the target prediction frame and the minimum external frame of the prediction frame determined to be correct is calculated; and finally, dividing the distance between the central points by the diagonal distance of the minimum external frame to obtain a ratio, and taking the ratio as a distance parameter.

In the embodiment of the present application, a detailed description will be given of a process of calculating, in step S23, a similarity between the target prediction box and the prediction box determined to be correct according to the intersection ratio and the distance parameter. Referring to fig. 2, fig. 2 is a schematic view illustrating a method for calculating similarity of prediction frames according to an embodiment of the present disclosure. Specifically, according to the intersection ratio and the distance parameter, the similarity between the target prediction frame and the prediction frame determined to be correct is calculated, and the calculation formula of the process may include:

In some embodiments of the present application, a detailed description will be given to the process of determining whether the target prediction box is correct according to the first comparison result and the second comparison result in step S4.

The embodiment of the present application provides several different implementation manners, which are respectively introduced as follows:

first, when the first comparison result includes a magnitude relation between the detection score of the target prediction box and a first detection score threshold, and the second comparison result includes a magnitude relation between the similarity and the preset similarity threshold, step S4 may include:

s41, if the first comparison result is that the detection score of the target prediction box is greater than or equal to a preset first detection score threshold value, judging that the target prediction box is correct;

and S42, if the first comparison result is that the detection score of the target prediction frame is smaller than a preset first detection score threshold value, and the second comparison result is that the similarity between the target prediction frame and the prediction frame determined to be correct is larger than the preset similarity threshold value, judging that the target prediction frame is correct.

Secondly, when the first comparison result includes a magnitude relationship between the detection score of the target prediction box and a first detection score threshold and a second detection score threshold, respectively, and the second comparison result includes a magnitude relationship between the similarity and the preset similarity threshold, please refer to fig. 3, where fig. 3 is a schematic flowchart illustrating an example of the embodiment of the present application for determining whether the target prediction box is correct. Step S4 may include:

s41, if the detection score of the target frame is smaller than a first detection score threshold, or the detection score of the target frame is greater than or equal to the first detection score threshold but smaller than a second detection score threshold, and the similarity between the target frame and the frame determined to be correct is smaller than the preset similarity threshold, determining that the target frame is incorrect.

Specifically, there may be case 1: if the detection score DScore of the target prediction frame is smaller than a first detection score threshold value S₁And judging that the head target prediction frame is not detected at present and taking the head target prediction frame as an error prediction frame.

Alternatively, there may be case 2: if the detection score DScore of the current target prediction frame is larger than or equal to the first detection score threshold S₁But less than a second detection score threshold S₂And the similarity DIOU of the target prediction frame and the prediction frame determined to be correct is smaller than a preset similarity threshold value S₃If the target prediction frame is a wrong prediction frame, the target prediction frame is determined to be not a human head prediction frame, and may be a prediction frame of a bag or other objects.

S42, if the detection score of the target frame is greater than or equal to the second detection score threshold, or the detection score of the target frame is greater than or equal to the first detection score threshold but less than the second detection score threshold, and the similarity between the target frame and the frame determined to be correct is greater than the preset similarity threshold, determining that the target frame is correct.

Specifically, there may be case 3: if the detection score DScore of the current target prediction frame is larger than or equal to the second detection score threshold value S₂Then the current target prediction box can be determined to be the correct prediction box. It should be noted that the preset first detection score threshold S₁Less than a second detection score threshold S₂Judging whether the detection score DScore of the current target prediction frame is greater than or equal to a second detection score threshold value S₂Later, it already means the examination of the current target prediction boxThe detection score DScore is greater than a first detection score threshold S₁。

Alternatively, there may be case 4: if the detection score DScore of the current prediction box is larger than or equal to the first detection score threshold S₁But less than a second detection score threshold S₂And the similarity DIOU of the target prediction frame and the prediction frame determined to be correct is greater than or equal to a preset similarity threshold value S₃Then the target prediction box is determined to be correct.

And calling a target tracking network for the prediction frame judged to be correct, and tracking the correct prediction frame.

In the embodiment of the present application, a preset first detection score threshold S₁Can take the value of 0.2, the second detection score threshold S₂A value of 0.58 may be taken, obviously, in this application embodiment, the second detection score threshold is greater than the first detection score threshold, and in addition, in this application embodiment, a preset similarity threshold S is set₃Values of 0.5 can be taken. It should be noted that the preset threshold values in the embodiments of the present application are all better threshold values determined by the applicant after a lot of experiments, and those skilled in the art can adjust the preset threshold values according to actual needs.

In the embodiment of the present application, a preset first detection score threshold S₁The reason is that 0.2 is to avoid missing detection targets as much as possible in the detection process, when the set detection score value is relatively small, most of the prediction frames may be cached in the cache container 1, but certain false detection may exist in the prediction frames, and the second detection score threshold S preset by the embodiment of the present application may be used for the false detection₂0.58 and a similarity threshold S₃Further regression inhibition removed the wrong prediction box 0.5. When the prediction box is determined to be a wrong prediction box, the prediction box may be removed from the container 1; when the prediction frame is determined to be correct, the prediction frame is buffered in the buffer container 2, and the buffer container 2 may be set to 40 frame capacity. It is understood that the buffer container 2 may be set to 40 frame capacity, which means that if the target leaves the frame for a short time and then enters again, it can be regarded as the same target, if the target leaves the frame for a short time, and if the target leaves the frame for a short timeIf the target still does not enter the picture after the target leaves the picture 40 frames, the target tracking information is forgotten, and if the target enters the picture again, the target is taken as a new target to be detected and tracked.

In some embodiments of the present application, when the target prediction box is determined to be correct, the step S130 of calling the target tracking network to obtain the tracking information of the target prediction box may include the following steps:

s1, among the prediction frames determined to be correct, a prediction frame having the highest similarity to the target prediction frame is determined.

Specifically, in each prediction frame determined to be correct, for the prediction frame of each video frame image, if the target in the prediction frame leaves the video frame within a short time and then enters, the prediction frame can be regarded as the same target, if the target leaves the frame with a longer time interval, the time interval is set to 40 frames in the embodiment of the present application, and if the target still does not enter the frame after 40 frames of the video image, the tracking information, such as the sequence number, of the target prediction frame is forgotten. And if the target enters the picture again, the target is taken as a new target to be detected and tracked.

And S2, associating the tracking information corresponding to the prediction frame with the maximum similarity with the target prediction frame.

Specifically, in the above sub-step S1, it can be known that the time interval of each predicted frame determined to be correct is 40 frames, so that a plurality of predicted frames with similarity greater than the preset similarity threshold will appear, and it is necessary to determine the predicted frame with the highest similarity to the target predicted frame and assign the tracking information corresponding to the predicted frame with the highest similarity to the current target predicted frame.

In some embodiments of the present application, the target detection network and the target tracking network may form a unified training model. The training process may include:

and S1, inputting the training image into the combined model to obtain a target prediction frame output by the model, the detection score of the target prediction frame and target tracking information, wherein the training image is marked with a target position frame and tracking information identification.

Specifically, the joint model may use the one-time aggregation network ese _ vovnet39b, and input the training image labeled with the target location box and the tracking information identifier into the joint model, so as to obtain the target prediction box output by the model, the detection score of the target prediction box, and the target tracking information.

S2, determining a first loss value based on the target prediction box and the target position box, determining a second loss value based on the detection score of the target prediction box, and determining a third loss value based on the target tracking information and the tracking information identification.

Specifically, the loss between the target prediction box bbox and the labeled target position box can be calculated by CIOU, so as to obtain a first loss value loss_bboxThe calculation formula of the loss may include:

loss_bbox＝CIoULoss(bbox,box)；

wherein loss_bboxA first loss value may be represented and CIoULoss (bbox, box) represents the difference between the target prediction box bbox and the annotated target location box calculated using CIOU.

In another aspect, a third loss value is determined based on the target tracking information and the tracking information identification. In the training process, in order to make an error between tracking information learned by a joint model and tracking information identification actually labeled by a training image as small as possible, in the embodiment of the present application, an input training image is x, and the tracking information identification is y, a tracking result AdaCos (x, y) predicted by a target tracking head network may be calculated by using an adaptive cosine loss function AdaCos, and then on the basis of the predicted tracking result AdaCos (x, y), a tracking embedding loss between the predicted result AdaCos (x, y) and the tracking information identification y may be calculated by using a cross entropy loss function cross entry, where a specific process may include the following formula:

loss_track＝CrossEntropy(AdaCos(x,y),y)

s3, calculating an overall loss value using the first loss value, the second loss value and the third loss value;

specifically, in the process of unified training of a joint model formed by the target detection network and the target tracking network, the loss of a regression frame of the joint model may be adjusted, in this embodiment, the total loss of the joint model is calculated by combining the first loss value, the second loss value, and the third loss value, and the total loss may be calculated by performing weighted summation on the three loss values, and the calculation formula may include:

loss＝α*loss_cls+β*loss_bbox+γ*loss_track；

therein, loss_clsLoss value, loss, which may represent the detection score of the target prediction box_bboxMay represent the loss value, loss, of the target prediction box_trackMay represent tracking information embedding loss values. Wherein alpha can be 0.3, beta can be 0.5, gamma can be 0.2, and the sum of the three loss weights is 1.

In the embodiment of the application, considering that the model needs to pay more attention to the regression effect of the target prediction box bbox during the training of the network model, the higher weight of the regression loss of the target prediction box is set to be 0.5; in addition, the loss of the detection score of the target prediction frame needs to guide the tracking head network to determine whether the currently regressed target prediction frame is the correct target prediction frame, so that the weight of the target prediction frame is set to be 0.3; meanwhile, considering that the target prediction frame detected to be correct can be tracked, the tracking loss is avoided, and the tracking information embedding loss weight is set to be 0.2. In the examples of the present application, α is 0.3, β is 0.5, and γ is 0.2, which are all the preferred parameter settings determined by the applicant after many experiments.

S4, updating parameters of the joint model based on the total loss value.

Specifically, a combined model which can output a prediction result with high accuracy can be obtained by updating parameters of the combined model based on the total loss value, and the applicant considers that although the trained combined model can output a prediction result with high accuracy, the combined model trained by the one-time aggregation network ese _ vovnet39b has a large volume and is not beneficial to deployment and use, and further considers that a distillation method is used for compressing the combined model, so that a combined small model which is easier to deploy and use can be obtained.

The main network of the combined small model can be a real-time detection tracking network MNet25M, and the network can comprise multi-scale convolutional layers, stacked convolutional layers and multi-scale feature maps, and meanwhile, the inverse convolution and bilinear sampling processing can be combined. The network can take RGB video image frames with 3 × 416 scales as input, and can acquire multi-scale characteristic maps of the same frame of image under different receptive fields by performing multi-scale convolution operation on the input image by adopting a multi-scale convolution kernel; meanwhile, by stacking a plurality of convolution processes, more detailed distinguishable features of the feature map can be extracted. And further, the same convolution operation is carried out on the continuous video frames through the convolution kernel with the same scale, so that feature maps with the same scale of different video frames can be obtained. In this way, feature maps of the same scale in successive image frames of the video and feature maps of different scales in the same image frame can be extracted. And further performing down-sampling and up-sampling processing on the image feature maps of various scales to enable the image feature maps to be the same in size, and finally performing ReLu activation on the obtained features.

In this embodiment of the application, the joint small model may use MNet25M as a backbone network, perform feature learning on a training image through the backbone network, connect to a target detection network, extract features learned by the backbone network through the target detection network, and regress a target prediction frame and a corresponding detection score.

Referring to fig. 4, fig. 4 is a combined small model framework diagram provided in the embodiment of the present application. In the embodiment of the present application, the main network MNet25M adds a target tracking network structure on the basis of connecting with a target detection network, and performs tracking matching on the detection result, so that if a false-detected target occurs in the video image detection process, regression of the detection frame in subsequent frames can be suppressed, thereby further improving the accuracy of target detection. Therefore, each frame of the video is not detected as a single image, so that the problem that when the target is detected mistakenly, whether the currently detected target is a mistakenly detected target cannot be determined, and the target false detection rate is high can be solved.

The following describes the target detection and tracking integrated device provided in the embodiment of the present application, and the target detection and tracking integrated device described below and the target detection and tracking integrated method described above may be referred to in correspondence with each other.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an integrated target detection and tracking device disclosed in the embodiment of the present application.

As shown in fig. 5, the apparatus may include:

an image acquisition unit 11 for acquiring an image to be detected;

the target detection unit 12 is configured to input the image to be detected into a target detection network, so as to obtain a target prediction frame and a detection score thereof;

a judging unit 13, configured to judge whether the target prediction frame is correct based on a magnitude relationship between a detection score of the target prediction frame and a preset detection score threshold, and a similarity between the target prediction frame and a prediction frame that has been determined to be correct;

the target tracking unit 14 is configured to invoke a target tracking network to obtain tracking information of the target prediction frame;

a result obtaining unit 15, configured to output the target prediction frame, the detection score of the target prediction frame, and the tracking information of the target prediction frame.

Optionally, the judging unit 13 may include:

a first comparison result obtaining unit, configured to compare the detection score of the target prediction frame with a preset detection score threshold to obtain a first comparison result;

a similarity calculation unit configured to calculate a similarity between the target prediction frame and the prediction frame determined to be correct based on a positional relationship between the target prediction frame and the prediction frame determined to be correct;

the second comparison result acquisition unit is used for comparing the similarity with a preset similarity threshold value to obtain a second comparison result;

and the first judgment subunit is used for judging whether the target prediction frame is correct or not according to the first comparison result and the second comparison result.

Optionally, the similarity calculation unit may include:

a merging ratio calculation unit for calculating a merging ratio of the target prediction frame and the prediction frame determined to be correct;

a distance parameter calculation unit for calculating a distance parameter between the target prediction frame and the prediction frame determined to be correct;

and the similarity obtaining unit is used for calculating the similarity between the target prediction frame and the prediction frame determined to be correct according to the intersection ratio and the distance parameter.

Optionally, the distance parameter calculating unit may include:

a central point distance calculating unit, configured to calculate a distance between a central point of the target prediction frame and the central point of the prediction frame determined to be correct, and obtain a central point distance;

a diagonal distance calculation unit for calculating a diagonal distance between the target prediction box and the minimum bounding box of the prediction box determined to be correct;

and the ratio calculation unit is used for dividing the distance between the central points by the diagonal distance of the minimum external frame to obtain the distance parameter.

Optionally, the first comparison result includes a size relationship between the detection score of the target prediction box and a first detection score threshold, and the second comparison result includes a size relationship between the similarity and the preset similarity threshold, so that the determining process of the first determining subunit may include:

Optionally, the first comparison result includes a magnitude relationship between the detection score of the target prediction box and a first detection score threshold and a second detection score threshold, and the second comparison result includes a magnitude relationship between the similarity and the preset similarity threshold, so that the determining process of the first determining subunit may include:

Optionally, the target tracking unit 14 may include:

a similarity comparison unit configured to determine, among the prediction frames that have been determined to be correct, a prediction frame having the greatest similarity to the target prediction frame;

and the association unit is used for associating the tracking information corresponding to the prediction frame with the maximum similarity with the target prediction frame.

Optionally, the target detection and tracking integrated apparatus may include a joint model training unit, and it is understood that the joint model training unit is configured to train the target detection network and the target tracking network jointly, and a training process of the joint model may include:

updating parameters of the joint model based on the total loss values.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, the embodiments may be combined as needed, and the same and similar parts may be referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A target detection and tracking integrated method is characterized by comprising the following steps:

acquiring an image to be detected;

2. The method of claim 1, wherein determining whether the target prediction frame is correct based on a magnitude relationship between a detection score of the target prediction frame and a preset detection score threshold and a similarity between the target prediction frame and a prediction frame determined to be correct comprises:

3. The method according to claim 2, wherein calculating the similarity between the target prediction frame and the prediction frame determined to be correct based on the position relationship between the target prediction frame and the prediction frame determined to be correct comprises:

4. The method of claim 3, wherein calculating the distance parameter between the target prediction box and the prediction box determined to be correct comprises:

5. The method according to claim 4, wherein the similarity between the target prediction frame and the prediction frame determined to be correct is calculated according to the intersection ratio and the distance parameter, and the calculation formula includes:

wherein l_DIOURepresenting the phase of the target prediction box with the prediction box determined to be correctThe degree of similarity is calculated by the following formula,

6. The method according to claim 2, wherein the first comparison result comprises a magnitude relation between a detection score of the target prediction box and a first detection score threshold, and the second comparison result comprises a magnitude relation between the similarity and the preset similarity threshold;

7. The method according to claim 2, wherein the first comparison result comprises magnitude relations between detection scores of the target prediction boxes and a first detection score threshold and a second detection score threshold respectively, and the second comparison result comprises magnitude relations between the similarity degrees and the preset similarity degree threshold;

8. The method of claim 1, wherein said invoking a target tracking network to obtain tracking information of the target prediction box comprises:

9. The method of claim 1, wherein the target detection network and the target tracking network form a unified training model, and the training process comprises:

updating parameters of the joint model based on the total loss values.

10. An integrated target detection and tracking device, comprising:

the image acquisition unit is used for acquiring an image to be detected;