CN111161311A

CN111161311A - Visual multi-target tracking method and device based on deep learning

Info

Publication number: CN111161311A
Application number: CN201911252433.5A
Authority: CN
Inventors: 田寅; 温博阁; 唐海川; 咸哓雨; 李欣旭
Original assignee: CRRC Industry Institute Co Ltd
Current assignee: CRRC Industry Institute Co Ltd
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2020-05-15

Abstract

The embodiment of the invention provides a visual multi-target tracking method and device based on deep learning, wherein the method comprises the following steps: sequentially acquiring candidate detection frames of a tracking target in a current video frame through a target detection network model, recording coordinate position information and acquiring a corresponding template image; acquiring images of each frame except the 1 st frame in the video as images of a region to be searched; and inputting each template image and the image of the area to be searched into a target tracking network model constructed by a twin convolutional neural network so as to obtain a tracking result of the tracking target. According to the visual multi-target tracking method and device based on deep learning, the template images corresponding to the tracking targets and the images of the areas to be searched, which are acquired by the target detection network model, are respectively input into the target tracking network model constructed by the twin convolutional neural network, so that the tracking results of the tracking targets corresponding to the template images are acquired, the calculation amount is low, and multi-target real-time and accurate tracking is achieved.

Description

Visual multi-target tracking method and device based on deep learning

Technical Field

The invention relates to the technical field of computer vision, in particular to a visual multi-target tracking method and device based on deep learning.

Background

Visual target tracking is a hot problem in the field of computer vision research, and with the rapid development of the development of computer technology, the target tracking technology is also greatly improved. With the rapid rise of artificial intelligence in recent years, the research of target tracking technology is receiving more and more attention.

The deep learning technology has strong characteristic representation capability, and obtains better effect than the traditional method in the applications of image classification, object recognition, natural language processing and the like, thereby gradually becoming the mainstream technology of image video research. The tracking method based on deep learning is an important branch in the target tracking method, and the appearance characteristic and the motion characteristic of the target are automatically learned and tracked by a model by utilizing the advantage of end-to-end training of a deep convolutional network, so that high-quality robust tracking is realized.

In recent years, related reports on multi-target tracking are also found. However, the multi-target tracking method disclosed in the prior art generally has a large calculation amount, and cannot realize real-time tracking, so that the tracking effect is poor.

Disclosure of Invention

In order to solve the problems in the prior art, embodiments of the present invention provide a visual multi-target tracking method and apparatus based on deep learning.

In a first aspect, an embodiment of the present invention provides a visual multi-target tracking method based on deep learning, including: sequentially acquiring candidate detection frames of a tracking target in a current video frame through a target detection network model according to the frame sequence of the video, recording coordinate position information of the candidate detection frames, and acquiring template images corresponding to the candidate detection frames according to the coordinate position information; wherein the tracking targets are one or more; acquiring images of each frame except the 1 st frame in the video, and taking the images as images of a region to be searched; respectively inputting each template image and the image of the area to be searched into a target tracking network model constructed by a twin convolutional neural network; and acquiring a tracking result of the tracking target corresponding to the template image according to the output of the target tracking network model.

Further, the target detection network model is a YOLOv3 network model.

Further, the obtaining of the tracking result of the tracking target corresponding to the template image according to the output of the target tracking network model includes: respectively extracting the characteristics of the template image and the image of the area to be searched to obtain a template characteristic image and a characteristic image of the area to be searched; performing cross-correlation operation on the template characteristic image and the characteristic image of the area to be searched to obtain a cross-correlation operation result characteristic diagram; obtaining a feature graph row with the highest class probability according to the feature graph of the cross-correlation operation result, and performing channel transformation convolution operation by using the feature graph row to respectively obtain a classification branch response graph and a regression branch response graph; and acquiring the tracking result of the tracking target corresponding to the template image according to the classification branch response diagram and the regression branch response diagram. Further, the performing a cross-correlation operation on the template feature image and the feature image of the area to be searched to obtain a feature map of a cross-correlation operation result includes: and sliding the template characteristic image on the characteristic image of the area to be searched, and performing cross-correlation operation channel by channel to obtain a cross-correlation operation result characteristic image.

Further, the cross-correlation operation result characteristic diagram comprises a first cross-correlation operation result characteristic diagram and a second cross-correlation operation result characteristic diagram; the cross-correlation operation is performed on the template characteristic image and the characteristic image of the area to be searched to obtain a cross-correlation operation result characteristic diagram, and the method comprises the following steps: performing convolution operation on the template characteristic image to obtain two classification branch characteristic graphs, and performing convolution operation on the characteristic image of the area to be searched to obtain two regression branch characteristic graphs; respectively combining the classification branch feature graph and the other regression branch feature graph pairwise to perform cross-correlation operation to obtain a first cross-correlation operation result feature graph and a second cross-correlation operation result feature graph; the method for obtaining the feature map row with the highest class probability according to the feature map of the cross-correlation operation result, and performing channel transformation convolution operation by using the feature map row to respectively obtain a classification branch response map and a regression branch response map includes: obtaining a first characteristic diagram row with the highest class probability according to the characteristic diagram of the first cross-correlation operation result, and performing channel transformation convolution operation by using the first characteristic diagram row to obtain the classification branch response diagram; and obtaining a second characteristic diagram row with the highest class probability according to the second cross-correlation operation result characteristic diagram, and performing channel transformation convolution operation by using the second characteristic diagram row to obtain the regression branch response diagram.

Further, the obtaining a tracking result of the tracking target corresponding to the template image according to the classification branch response map and the regression branch response map includes: sorting a plurality of target detection frames corresponding to the tracking target through the classification branch response graph; and predicting the boundary frame of each target detection frame through the regression branch response graph, and obtaining the boundary frame corresponding to the tracking result by using a preset screening algorithm.

Further, the screening out a plurality of target detection frames corresponding to the tracking target through the sorting of the classification branch response graph includes: screening out a plurality of target detection frames corresponding to the tracking target through the classification branch response graph, and sequencing the target detection frames through a cosine window and a scale punishment; the preset screening algorithm is a non-maximum suppression algorithm.

In a second aspect, an embodiment of the present invention provides a visual multi-target tracking device based on deep learning, including: a template image acquisition module to: sequentially acquiring candidate detection frames of a tracking target in a current video frame through a target detection network model according to the frame sequence of the video, recording coordinate position information of the candidate detection frames, and acquiring template images corresponding to the candidate detection frames according to the coordinate position information; wherein the tracking targets are one or more; the image acquisition module of the area to be searched is used for: acquiring images of each frame except the 1 st frame in the video, and taking the images as images of a region to be searched; a tracking result obtaining module configured to: respectively inputting each template image and the image of the area to be searched into a target tracking network model constructed by a twin convolutional neural network; and acquiring a tracking result of the tracking target corresponding to the template image according to the output of the target tracking network model.

In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the method provided in the first aspect when executing the computer program.

In a fourth aspect, an embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method as provided in the first aspect.

According to the visual multi-target tracking method and device based on deep learning provided by the embodiment of the invention, the candidate detection frame of the tracking target is obtained in real time by utilizing the target detection network model, so that the corresponding template image is obtained, the template image corresponding to each tracking target and the image of the area to be searched are respectively input into the target tracking network model constructed by the twin convolutional neural network, and the tracking result of the tracking target corresponding to the template image is obtained according to the output of the target tracking network model, so that the calculation amount is low, and the real-time and accurate tracking of the multi-target is realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of a depth learning-based visual multi-target tracking method according to an embodiment of the present invention;

FIG. 2 is a schematic processing flow diagram of a target tracking network model in the deep learning-based visual multi-target tracking method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a deep learning-based visual multi-target tracking device according to an embodiment of the present invention;

fig. 4 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a visual multi-target tracking method based on deep learning according to an embodiment of the present invention. As shown in fig. 1, the method includes:

step 101, sequentially acquiring candidate detection frames of a tracking target in a current video frame through a target detection network model according to a frame sequence of a video, recording coordinate position information of the candidate detection frames, and acquiring template images corresponding to the candidate detection frames according to the coordinate position information; wherein the tracking targets are one or more;

102, acquiring images of each frame except the 1 st frame in the video, and taking the images as images of a region to be searched;

103, respectively inputting the template image and the image of the area to be searched into a target tracking network model constructed by a twin convolutional neural network; and acquiring a tracking result of the tracking target corresponding to the template image according to the output of the target tracking network model.

The target detection network model is used for carrying out target detection and carrying out target detection on a preset tracking target aiming at each frame of image in the video. Taking a tracking target as an example, as time goes by, the tracking target in the video frame changes, for example, some tracking targets disappear, and new tracking targets are added. Therefore, the target detection is carried out on each frame of image through the target detection network model, and the real-time update of the tracking target can be realized.

Specifically, in the process of target detection, the visual multi-target tracking device based on deep learning sequentially obtains candidate detection frames of a tracked target in a current video frame through a target detection network model according to the frame sequence of the video, records coordinate position information of the candidate detection frames, and obtains template images corresponding to the candidate detection frames according to the coordinate position information. If the current video frame has a tracking target, the number of the tracking targets is at least one, and can also be multiple. The candidate detection frame corresponds to a tracking target.

The visual multi-target tracking device based on deep learning acquires images of each frame except the 1 st frame in the video and takes the images as images of a region to be searched. Namely, the tracking target is found and tracked in the image of the area to be searched.

After the visual multi-target tracking device based on the deep learning obtains the template images and the images of the areas to be searched, each template image and the images of the areas to be searched are respectively input into a target tracking network model constructed by a twin convolutional neural network. The target tracking network model constructed by the twin convolutional neural network comprises two networks sharing weight, the template image and the image of the area to be searched can be respectively input into the two networks, and a tracking result is obtained through correlation calculation.

According to the embodiment of the invention, the target objects which disappear in the video image can be removed; for a target object which newly appears in the video, the target detection network can detect the target object and store the position coordinate detection frame information of the target object, and the target tracking network model can continuously acquire the position detection frame information of the target object and automatically track the target object, so that the accuracy and the real-time performance of multi-target tracking are ensured.

According to the embodiment of the invention, the candidate detection frame of the tracking target is obtained in real time by utilizing the target detection network model, so that the corresponding template image is obtained, the template image corresponding to each tracking target and the image of the area to be searched are respectively input into the target tracking network model constructed by the twin convolutional neural network, and the tracking result of the tracking target corresponding to the template image is obtained according to the output of the target tracking network model, so that the calculation amount is low, and the real-time and accurate tracking of multiple targets is realized.

Further, based on the above embodiment, the target detection network model is a YOLOv3 network model.

The YOLOv3 algorithm has good effect on object detection and recognition accuracy and speed, so the embodiment of the invention adopts the YOLOv3 network model to detect the target object, the detection mode of the YOLOv3 adopts an end-to-end idea, the Darknet network is used for training, the model takes the whole image as the input of the network, the model directly regresses the position of a boundary frame and the category of the boundary frame at an output layer by using a regression method to recognize the target object, and the coordinate position information of a candidate frame of the target object is stored.

On the basis of the above embodiment, the method and the device provided by the embodiment of the invention improve the accuracy of tracking target identification in multi-target tracking by adopting the YOLOv3 network model for target detection.

Fig. 2 is a schematic processing flow diagram of a target tracking network model in the deep learning-based visual multi-target tracking method according to an embodiment of the present invention. As shown in fig. 2, the obtaining of the tracking result of the tracking target corresponding to the template image according to the output of the target tracking network model includes: respectively extracting the characteristics of the template image and the image of the area to be searched to obtain a template characteristic image and a characteristic image of the area to be searched; performing cross-correlation operation on the template characteristic image and the characteristic image of the area to be searched to obtain a cross-correlation operation result characteristic diagram; obtaining a feature graph row with the highest class probability according to the feature graph of the cross-correlation operation result, and performing channel transformation convolution operation by using the feature graph row to respectively obtain a classification branch response graph and a regression branch response graph; and acquiring the tracking result of the tracking target corresponding to the template image according to the classification branch response diagram and the regression branch response diagram.

Specifically, the process of obtaining the tracking result of the tracked object by using the target tracking network model is as follows: respectively extracting the characteristics of the template image and the image of the area to be searched to obtain a template characteristic image and a characteristic image of the area to be searched; since the image of the area to be searched is obtained from the entire video frame and the template image is obtained from the tracking object in the video frame, the size of the template image is generally smaller than that of the area to be searched. The template characteristic image obtained from the template image is also smaller than the characteristic image of the area to be searched obtained from the area image to be searched.

As shown in fig. 2, the image corresponding to 127 × 3 is a template image, and the image corresponding to 255 × 3 is an image of the region to be searched. The numbers therein indicate the dimensions of the image, e.g. 127 × 3, where 127 × 127 indicates the length × width of the image and 3 indicates 3 channels (RGB). And then, extracting features through a target tracking network model to respectively obtain feature images. As 15 × 256, a template feature image obtained by feature extraction of the template image is indicated, and 31 × 256, a region-to-be-searched feature image obtained by feature extraction of the region-to-be-searched image is indicated. Wherein g is_θRepresenting the feature extraction operation by using a twin neural network.

Performing cross-correlation operation on the template characteristic image and the characteristic image of the area to be searched (denoted by x d for cross-correlation operation), and performing cross-correlation operation by sliding the template characteristic image on the characteristic image of the area to be searched to obtain a cross-correlation operation result characteristic diagram (17 x 256); and during cross-correlation calculation, the template characteristic image is slid on the characteristic image of the area to be searched, and cross-correlation operation is performed channel by channel, so that the number of channels is kept unchanged.

And obtaining a feature map row with the highest class probability according to the feature map of the cross-correlation operation result, wherein the class probability is the highest, namely the fitting confidence coefficient in the whole feature map of the cross-correlation operation result is the highest. After the cross-correlation operation, a feature map of 17 × 256 is obtained, and the feature map row is a feature cube with the highest probability (for example, 1 × 256) obtained according to the class probability in the feature map of the cross-correlation operation result of 17 × 256A feature map). The characteristic diagram of the result of the cross-correlation operation is connected with two branches, each branch is subjected to two layers of channel transformation convolution of 1 multiplied by 1, and the size of the characteristic diagram is unchanged, so that a classification branch response diagram (such as 17 x 2k in figure 2) and a regression branch response diagram (such as 17 x 4k in figure 2) are obtained respectively. b_σAnd S_φRepresenting a convolution operation. k refers to the number of object detection frames, that is, the number of object detection frames of different sizes corresponding to each position. The classification branch response graph can be used for screening the target detection frame through grading (score), and the regression branch response graph can enable the position of the network learning object to be regressed, so that more accurate prediction (boundary box prediction, box) can be obtained, and therefore the tracking result of the tracking target corresponding to the template image is obtained according to the classification branch response graph and the regression branch response graph, and tracking of the tracking object is completed.

On the basis of the embodiment, the tracking of the tracked object is realized by utilizing the target tracking network model through operations of feature extraction, cross-correlation operation, acquisition of the classification branch response diagram and the regression branch response diagram and the like, and the accuracy of multi-target tracking is improved.

Further, based on the above embodiment, the performing a cross-correlation operation on the template feature image and the feature image of the area to be searched to obtain a feature map of a cross-correlation operation result includes: and sliding the template characteristic image on the characteristic image of the area to be searched, and performing cross-correlation operation channel by channel to obtain a cross-correlation operation result characteristic image.

On the basis of the above embodiment, in the embodiment of the present invention, the template feature image is slid on the feature image of the region to be searched, and cross-correlation operation is performed channel by channel, so that the number of channels is kept unchanged.

Further, based on the above embodiment, the cross-correlation operation result feature map includes a first cross-correlation operation result feature map and a second cross-correlation operation result feature map; the cross-correlation operation is performed on the template characteristic image and the characteristic image of the area to be searched to obtain a cross-correlation operation result characteristic diagram, and the method comprises the following steps: performing convolution operation on the template characteristic image to obtain two classification branch characteristic graphs, and performing convolution operation on the characteristic image of the area to be searched to obtain two regression branch characteristic graphs; respectively combining the classification branch feature graph and the other regression branch feature graph pairwise to perform cross-correlation operation to obtain a first cross-correlation operation result feature graph and a second cross-correlation operation result feature graph; the method for obtaining the feature map row with the highest class probability according to the feature map of the cross-correlation operation result, and performing channel transformation convolution operation by using the feature map row to respectively obtain a classification branch response map and a regression branch response map includes: obtaining a first characteristic diagram row with the highest class probability according to the characteristic diagram of the first cross-correlation operation result, and performing channel transformation convolution operation by using the first characteristic diagram row to obtain the classification branch response diagram; and obtaining a second characteristic diagram row with the highest class probability according to the second cross-correlation operation result characteristic diagram, and performing channel transformation convolution operation by using the second characteristic diagram row to obtain the regression branch response diagram.

The cross-correlation operation result characteristic diagram comprises a first cross-correlation operation result characteristic diagram and a second cross-correlation operation result characteristic diagram. Performing convolution operation on the template feature image to obtain two same classification branch feature maps, and performing convolution operation on the feature image of the area to be searched to obtain two same regression branch feature maps; respectively combining the classification branch characteristic diagram and the other regression branch characteristic diagram to perform cross-correlation operation, namely combining one classification branch characteristic diagram and one regression branch characteristic diagram in pairs to perform cross-correlation operation, and combining the other classification branch characteristic diagram and the other regression branch characteristic diagram in pairs to perform cross-correlation operation to respectively obtain a first cross-correlation operation result characteristic diagram and a second cross-correlation operation result characteristic diagram;

obtaining a first feature map row with highest class probability according to the feature map of the first cross-correlation operation result, wherein the first feature map row is a feature cube (such as a feature map of 1 × 256) with highest class probability in the feature map of the first cross-correlation operation result; performing channel transformation convolution operation by using the first characteristic diagram row, and setting relevant labels of the classification branches to obtain a response diagram of the classification branches

Obtaining a second feature map row with highest class probability according to the second cross-correlation operation result feature map, where the second feature map row is a feature cube (e.g., a feature map of 1 × 256) with highest class probability in the second cross-correlation operation result feature map; and performing channel transformation convolution operation by using the second characteristic diagram row, and setting a regression branch correlation label to obtain the regression branch response diagram.

On the basis of the embodiment, the embodiment of the invention obtains the combination of two pairs of classification branch feature maps and regression branch feature maps by respectively carrying out convolution operation on the template feature image and the feature image of the area to be searched, and further obtains the feature map of the cross-correlation operation result by carrying out cross-correlation operation on each combination, thereby improving the accuracy of the cross-correlation operation result and further improving the accuracy of classification and tracking.

Further, based on the above embodiment, the obtaining a tracking result of the tracking target corresponding to the template image according to the classification branch response map and the regression branch response map includes: sorting a plurality of target detection frames corresponding to the tracking target through the classification branch response graph; and predicting the boundary frame of each target detection frame through the regression branch, and obtaining the boundary frame corresponding to the tracking result by using a preset screening algorithm.

When the tracking result of the tracking target corresponding to the template image is obtained according to the classification branch response diagram and the regression branch response diagram, a plurality of target detection frames corresponding to the tracking target can be screened out through the classification branch response diagram, and the target detection frames can be sorted through a cosine window and a scale penalty, so that the plurality of target detection frames corresponding to the tracking target can be screened out through the sorting of the classification branch response diagram. And predicting the boundary frame of each target detection frame through the regression branch, and obtaining the boundary frame corresponding to the tracking result by using a preset screening algorithm (such as a non-maximum suppression algorithm).

During prediction, k targets are sorted and screened out in the classification branches, then the targets are sorted through a cosine window and a scale penalty, a boundary frame of each target is obtained according to the regression branches, and finally a final result is obtained by using a non-maximum suppression algorithm.

On the basis of the embodiment, the embodiment of the invention screens out a plurality of target detection frames corresponding to the tracking target through sorting the branch response graphs, predicts the boundary frame of each target detection frame through the regression branch, and obtains the boundary frame corresponding to the tracking result by using a preset screening algorithm, thereby ensuring the reliability of multi-target tracking; by selecting a proper sorting and screening algorithm of the target detection box and a proper sorting and screening algorithm of the boundary box, the accuracy of multi-target tracking is improved.

The embodiment of the invention provides a multi-target tracking method combining target detection and a target tracking algorithm based on deep learning, which can accurately identify and track a target object, and the training process is off-line operation, so that the network inference speed is high, and the real-time effect can be achieved.

Fig. 3 is a schematic structural diagram of a deep learning-based visual multi-target tracking device according to an embodiment of the present invention. As shown in fig. 3, the apparatus includes a template image obtaining module 10, an image obtaining module 20 of a region to be searched, and a tracking result obtaining module 30, wherein: the template image acquisition module 10 is configured to: sequentially acquiring candidate detection frames of a tracking target in a current video frame through a target detection network model according to the frame sequence of the video, recording coordinate position information of the candidate detection frames, and acquiring template images corresponding to the candidate detection frames according to the coordinate position information; wherein the tracking targets are one or more; the image obtaining module 20 of the area to be searched is configured to: acquiring images of each frame except the 1 st frame in the video, and taking the images as images of a region to be searched; the tracking result obtaining module 30 is configured to: respectively inputting each template image and the image of the area to be searched into a target tracking network model constructed by a twin convolutional neural network; and acquiring a tracking result of the tracking target corresponding to the template image according to the output of the target tracking network model.

Further, based on the above embodiment, when the tracking result obtaining module 30 is configured to obtain the tracking result of the tracking target corresponding to the template image according to the output of the target tracking network model, specifically: respectively extracting the characteristics of the template image and the image of the area to be searched to obtain a template characteristic image and a characteristic image of the area to be searched; performing cross-correlation operation on the template characteristic image and the characteristic image of the area to be searched to obtain a cross-correlation operation result characteristic diagram; obtaining a feature graph row with the highest class probability according to the feature graph of the cross-correlation operation result, and performing channel transformation convolution operation by using the feature graph row to respectively obtain a classification branch response graph and a regression branch response graph; and acquiring the tracking result of the tracking target corresponding to the template image according to the classification branch response diagram and the regression branch response diagram.

On the basis of the embodiment, the tracking of the tracked object is completed by utilizing the target tracking network model through operations of feature extraction, cross-correlation operation, acquisition of the classification branch response diagram and the regression branch response diagram and the like, and the accuracy of multi-target tracking is improved.

Further, based on the above embodiment, when the tracking result obtaining module 30 is configured to perform cross-correlation operation on the template feature image and the feature image of the area to be searched to obtain a feature map of a cross-correlation operation result, specifically: and sliding the template characteristic image on the characteristic image of the area to be searched, and performing cross-correlation operation channel by channel to obtain a cross-correlation operation result characteristic image.

On the basis of the embodiment, the embodiment of the invention ensures that the number of channels is unchanged by sliding the template characteristic image on the characteristic image of the area to be searched and performing cross-correlation operation channel by channel.

Further, based on the above embodiment, the cross-correlation operation result feature map includes a first cross-correlation operation result feature map and a second cross-correlation operation result feature map; the tracking result obtaining module 30 is specifically configured to, when being configured to perform cross-correlation operation on the template feature image and the feature image of the area to be searched to obtain a cross-correlation operation result feature map: performing convolution operation on the template characteristic image to obtain two classification branch characteristic graphs, and performing convolution operation on the characteristic image of the area to be searched to obtain two regression branch characteristic graphs; respectively combining the classification branch feature graph and the other regression branch feature graph pairwise to perform cross-correlation operation to obtain a first cross-correlation operation result feature graph and a second cross-correlation operation result feature graph; the tracking result obtaining module 30 is specifically configured to, when being configured to obtain a feature map row with the highest class probability according to the feature map of the cross-correlation operation result, and perform channel transformation convolution operation by using the feature map row to obtain a classification branch response map and a regression branch response map, respectively: obtaining a first characteristic diagram row with the highest class probability according to the characteristic diagram of the first cross-correlation operation result, and performing channel transformation convolution operation by using the first characteristic diagram row to obtain the classification branch response diagram; and obtaining a second characteristic diagram row with the highest class probability according to the second cross-correlation operation result characteristic diagram, and performing channel transformation convolution operation by using the second characteristic diagram row to obtain the regression branch response diagram.

On the basis of the embodiment, the embodiment of the invention obtains the combination of two pairs of classification branch feature maps and regression branch feature maps by respectively carrying out convolution elements on the template feature image and the feature image of the area to be searched, and further obtains the feature map of the cross-correlation operation result by carrying out cross-correlation operation on each combination, thereby improving the accuracy of the cross-correlation operation result and further improving the accuracy of classification and tracking.

Further, based on the above embodiment, when the tracking result obtaining module 30 is configured to obtain the tracking result of the tracking target corresponding to the template image according to the classification branch response map and the regression branch response map, specifically configured to: sorting and screening a plurality of target detection frames corresponding to the tracking target through the classification branch characteristic graph; and predicting the boundary frame of each target detection frame through the regression branch, and obtaining the boundary frame corresponding to the tracking result by using a preset screening algorithm.

On the basis of the embodiment, the embodiment of the invention screens out a plurality of target detection frames corresponding to the tracking target through sorting the branch response graphs, predicts the boundary frame of each target detection frame through regression branches, and obtains the boundary frame corresponding to the tracking result by using a preset screening algorithm, thereby ensuring the reliability of multi-target tracking.

Further, based on the above embodiment, when the tracking result obtaining module 30 is configured to screen out a plurality of target detection frames corresponding to the tracking target through the sorting of the classification branch response graph, specifically, the tracking result obtaining module is configured to: screening out a plurality of target detection frames corresponding to the tracking target through the classification branch characteristic graph, and sequencing the target detection frames through a cosine window and a scale punishment; the preset screening algorithm is a non-maximum suppression algorithm.

On the basis of the above embodiment, the accuracy of multi-target tracking is improved by selecting the proper sorting and screening algorithm of the target detection frame and selecting the proper sorting and screening algorithm of the boundary frame in the embodiment of the invention.

The apparatus provided in the embodiment of the present invention is used for the method, and specific functions may refer to the method flow described above, which is not described herein again.

Fig. 4 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 4, the electronic device may include: a processor (processor)410, a communication Interface 420, a memory (memory)430 and a communication bus 440, wherein the processor 410, the communication Interface 420 and the memory 430 are communicated with each other via the communication bus 440. The processor 410 may call logic instructions in the memory 430 to perform the following method: sequentially acquiring candidate detection frames of a tracking target in a current video frame through a target detection network model according to the frame sequence of the video, recording coordinate position information of the candidate detection frames, and acquiring template images corresponding to the candidate detection frames according to the coordinate position information; wherein the tracking targets are one or more; acquiring images of each frame except the 1 st frame in the video, and taking the images as images of a region to be searched; respectively inputting each template image and the image of the area to be searched into a target tracking network model constructed by a twin convolutional neural network; and acquiring a tracking result of the tracking target corresponding to the template image according to the output of the target tracking network model.

In addition, the logic instructions in the memory 430 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented by a processor to perform the method provided by the foregoing embodiments, for example, including: sequentially acquiring candidate detection frames of a tracking target in a current video frame through a target detection network model according to the frame sequence of the video, recording coordinate position information of the candidate detection frames, and acquiring template images corresponding to the candidate detection frames according to the coordinate position information; wherein the tracking targets are one or more; acquiring images of each frame except the 1 st frame in the video, and taking the images as images of a region to be searched; respectively inputting each template image and the image of the area to be searched into a target tracking network model constructed by a twin convolutional neural network; and acquiring a tracking result of the tracking target corresponding to the template image according to the output of the target tracking network model.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A visual multi-target tracking method based on deep learning is characterized by comprising the following steps:

sequentially acquiring candidate detection frames of a tracking target in a current video frame through a target detection network model according to the frame sequence of the video, recording coordinate position information of the candidate detection frames, and acquiring template images corresponding to the candidate detection frames according to the coordinate position information; wherein the tracking targets are one or more;

acquiring images of each frame except the 1 st frame in the video, and taking the images as images of a region to be searched;

respectively inputting each template image and the image of the area to be searched into a target tracking network model constructed by a twin convolutional neural network; and acquiring a tracking result of the tracking target corresponding to the template image according to the output of the target tracking network model.

2. The deep learning based visual multi-target tracking method according to claim 1, wherein the target detection network model is a YOLOv3 network model.

3. The deep learning-based visual multi-target tracking method according to claim 1, wherein the obtaining of the tracking result of the tracking target corresponding to the template image according to the output of the target tracking network model comprises:

respectively extracting the characteristics of the template image and the image of the area to be searched to obtain a template characteristic image and a characteristic image of the area to be searched;

performing cross-correlation operation on the template characteristic image and the characteristic image of the area to be searched to obtain a cross-correlation operation result characteristic diagram;

obtaining a feature graph row with the highest class probability according to the feature graph of the cross-correlation operation result, and performing channel transformation convolution operation by using the feature graph row to respectively obtain a classification branch response graph and a regression branch response graph;

and acquiring the tracking result of the tracking target corresponding to the template image according to the classification branch response diagram and the regression branch response diagram.

4. The visual multi-target tracking method based on deep learning of claim 3, wherein the cross-correlation operation is performed on the template feature image and the feature image of the area to be searched to obtain a cross-correlation operation result feature map, and the method comprises the following steps:

and sliding the template characteristic image on the characteristic image of the area to be searched, and performing cross-correlation operation channel by channel to obtain a cross-correlation operation result characteristic image.

5. The deep learning-based visual multi-target tracking method according to claim 3, wherein the cross-correlation result feature map comprises a first cross-correlation result feature map and a second cross-correlation result feature map; the cross-correlation operation is performed on the template characteristic image and the characteristic image of the area to be searched to obtain a cross-correlation operation result characteristic diagram, and the method comprises the following steps:

performing convolution operation on the template characteristic image to obtain two classification branch characteristic graphs, and performing convolution operation on the characteristic image of the area to be searched to obtain two regression branch characteristic graphs; respectively combining the classification branch feature graph and the other regression branch feature graph pairwise to perform cross-correlation operation to obtain a first cross-correlation operation result feature graph and a second cross-correlation operation result feature graph;

the method for obtaining the feature map row with the highest class probability according to the feature map of the cross-correlation operation result, and performing channel transformation convolution operation by using the feature map row to respectively obtain a classification branch response map and a regression branch response map includes:

obtaining a first characteristic diagram row with the highest class probability according to the characteristic diagram of the first cross-correlation operation result, and performing channel transformation convolution operation by using the first characteristic diagram row to obtain the classification branch response diagram; and obtaining a second characteristic diagram row with the highest class probability according to the second cross-correlation operation result characteristic diagram, and performing channel transformation convolution operation by using the second characteristic diagram row to obtain the regression branch response diagram.

6. The deep learning-based visual multi-target tracking method according to claim 3, wherein the obtaining of the tracking result of the tracking target corresponding to the template image according to the classification branch response map and the regression branch response map comprises:

sorting a plurality of target detection frames corresponding to the tracking target through the classification branch response graph;

and acquiring the boundary frame of each target detection frame through the regression branch response graph, and acquiring the boundary frame corresponding to the tracking result by using a preset screening algorithm.

7. The deep learning based visual multi-target tracking method according to claim 6, wherein the screening out a plurality of target detection boxes corresponding to the tracking targets through the sorting of the branch response graphs comprises:

screening out a plurality of target detection frames corresponding to the tracking target through the classification branch response graph, and sequencing the target detection frames through a cosine window and a scale punishment; the preset screening algorithm is a non-maximum suppression algorithm.

8. A visual multi-target tracking device based on deep learning is characterized by comprising:

a template image acquisition module to: sequentially acquiring candidate detection frames of a tracking target in a current video frame through a target detection network model according to the frame sequence of the video, recording coordinate position information of the candidate detection frames, and acquiring template images corresponding to the candidate detection frames according to the coordinate position information; wherein the tracking targets are one or more;

the image acquisition module of the area to be searched is used for: acquiring images of each frame except the 1 st frame in the video, and taking the images as images of a region to be searched;

a tracking result obtaining module configured to: respectively inputting each template image and the image of the area to be searched into a target tracking network model constructed by a twin convolutional neural network; and acquiring a tracking result of the tracking target corresponding to the template image according to the output of the target tracking network model.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the deep learning based visual multi-target tracking method according to any one of claims 1 to 7 when executing the computer program.

10. A non-transitory computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, performs the steps of the deep learning based visual multi-target tracking method according to any one of claims 1 to 7.