WO2018157735A1

WO2018157735A1 - Target tracking method and system, and electronic device

Info

Publication number: WO2018157735A1
Application number: PCT/CN2018/076381
Authority: WO
Inventors: 余锋伟; 闫俊杰
Original assignee: 北京市商汤科技开发有限公司
Priority date: 2017-03-03
Filing date: 2018-02-12
Publication date: 2018-09-07
Also published as: CN108230353A

Abstract

A target tracking method and system, and an electronic device. The method comprises: acquiring feature information about each target detection box of a first video frame (S100), the feature information comprising any one or more of the following: presentation information, motion information and shape information; respectively matching the feature information about each target detection box of the first video frame with feature information about each target detection box of a second video frame (S102); and determining a tracking trajectory of each target detection box of the first video frame according to a matching result (S104), wherein the second video frame is a video frame before the first video frame. By means of the method, a more precise matching result can be obtained, and the precision of target tracking can also be improved.

Description

Target tracking method, system and electronic device

The present disclosure claims the priority of the Chinese Patent Application, filed on March 3, 2017, filed on Jan. 3,,,,,,,,,,,,,,,,,,,,,,,, in.

Technical field

The present disclosure relates to video analysis technologies, and in particular, to a target tracking method, system, and electronic device.

Background technique

Target tracking technology is one of the important technologies in video analysis. It can be simply described as a process: a video consisting of multiple consecutive video frames, from the first video frame to the last video frame, each video frame contains multiple The target continuously appears or disappears in the video frame, and the target continuously moves in the video frame; the purpose of the target tracking is to distinguish each target in the video frame from other targets to obtain the same target in different video frames. The track in .

Summary of the invention

The present disclosure provides a target tracking method, system, and electronic device technical solution.

According to an aspect of the disclosure, a target tracking method is provided, including:

Obtaining feature information of each target detection frame of the first video frame; matching feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame; wherein, The two video frames are video frames before the first video frame; and the tracking track of each target detection frame of the first video frame is determined according to the matching result.

In an implementation manner of the disclosure, the feature information includes any one or more of the following: representation information, motion information, and shape information.

In an implementation manner of the present disclosure, acquiring feature information of each target detection frame of the first video frame includes: detecting, by using a convolutional neural network, each target detection frame of the first video frame and the first video frame, and acquiring The image information of each target detection frame of the first video frame; the motion information and the shape information of each target detection frame of the first video frame are determined according to each target detection frame of the first video frame.

In an implementation manner of the present disclosure, before acquiring feature information of each target detection frame of the first video frame, the target tracking method further includes: determining a convolutional neural network according to the type of the target.

In an implementation of the present disclosure, the type of the target includes any one or more of the following: a face, a pedestrian, a vehicle.

In an implementation manner of the present disclosure, matching feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame, respectively, includes: The feature information of each target detection frame is respectively calculated similarly with the feature information of each target detection frame of the second video frame; the similarity calculation result is weighted to obtain a similarity matrix; and the similarity matrix is optimally matched.

In an implementation manner of the present disclosure, performing similarity calculation on feature information of each target detection frame of the first video frame and feature information of each target detection frame of the second video frame, respectively, including: first video The feature information of each target detection frame of the frame is compared with the feature information of each target detection frame of the second video frame, and each target detection frame of each target detection frame and the second video frame of the first video frame is obtained. The similarity data of the box.

In an implementation manner of the present disclosure, the feature information of each target detection frame of the first video frame and the feature information of each target detection frame of the second video frame are respectively calculated, including: calculating the first video. Calculating a cosine angle between the representation information of each target detection frame of the frame and the representation information of each target detection frame of the second video frame, calculating motion information of each target detection frame of the first video frame and the second video frame Calculating a target distance between the motion detection information of the target detection frame and the target information of each target detection frame of the second video frame and the target information of each target detection frame of the second video frame Check the length and width of the frame.

In an implementation manner of the present disclosure, the similarity calculation result is weighted to obtain a similarity matrix, including: weighting the cosine angle, the center distance, and the length and width difference, or weighting the product to obtain a similarity matrix.

In an implementation manner of the present disclosure, determining a tracking trajectory of each target detection frame of the first video frame according to the matching result includes: in response to the matching result being a matching success, the target of the first video frame related to the matching result The detection frame and the target detection frame of the second video frame are associated into a continuous tracking trajectory.

In an implementation manner of the present disclosure, determining a tracking trajectory of each target detection frame of the first video frame according to the matching result includes: in response to the matching result being a matching failure, the target of the first video frame related to the matching result The detection frame is used as the starting target detection frame of the new tracking track.

In an implementation manner of the present disclosure, the representation information includes a feature vector of the target detection frame, the motion information includes location information of the target detection frame, and the shape information includes size information of the target detection frame.

In an implementation manner of the present disclosure, before acquiring the feature information of each target detection frame of the first video frame, the target tracking method further includes: detecting, by the detector, the first video frame to obtain each of the first video frames. Target detection box.

In one implementation of the present disclosure, the detector is a deep convolutional neural network based detector.

In one implementation of the present disclosure, the detector is a fast region convolutional neural network Faster-RCNN.

According to another aspect of the present disclosure, a target tracking system is further provided, including: an acquiring module, configured to acquire feature information of each target detection frame of a first video frame; and a matching module, configured to: The feature information of each target detection frame is matched with the feature information of each target detection frame of the second video frame, and the determining module is configured to determine a tracking track of each target detection frame of the first video frame according to the matching result; The second video frame is a video frame before the first video frame.

In an implementation of the disclosure, the feature information includes any one or more of the following: representation information, motion information, and shape information.

In an implementation of the disclosure, the acquiring module includes: a first acquiring submodule, configured to detect, by using a convolutional neural network, each target detection frame of the first video frame and the first video frame to obtain the first video. Each target of the frame detects the representation information of the frame; and the second acquisition sub-module is configured to determine motion information and shape information of each target detection frame of the first video frame according to each target detection frame of the first video frame.

In an implementation of the disclosure, the target tracking system further includes: a convolutional neural network determining module, configured to determine, according to the type of the target, before the acquiring module acquires the feature information of each target detection frame of the first video frame Convolutional neural network.

In an implementation of the disclosure, the type of the target includes any one or more of the following: a face, a pedestrian, a vehicle.

In an implementation of the disclosure, the matching module includes: a similarity calculation sub-module, configured to respectively select feature information of each target detection frame of the first video frame and each target detection frame of the second video frame The feature information is used for similarity calculation; the weighting sub-module is used to weight the similarity calculation result to obtain a similarity matrix; the optimal matching sub-module is used for optimal matching of the similarity matrix.

In an implementation manner of the disclosure, the similarity calculation sub-module is configured to compare the feature information of each target detection frame of the first video frame with the feature information of each target detection frame of the second video frame. Pairing, obtaining similarity data of each target detection frame of the first video frame and each target detection frame of the second video frame.

In an implementation manner of the disclosure, the similarity calculation sub-module is configured to calculate between the representation information of each target detection frame of the first video frame and the representation information of each target detection frame of the second video frame. a cosine angle, calculating a center distance of the target detection frame between the motion information of each target detection frame of the first video frame and the motion information of each target detection frame of the second video frame, and calculating each target of the first video frame The length difference width of the target detection frame between the shape information of the detection frame and the shape information of each target detection frame of the second video frame is detected.

In an implementation of the disclosure, the weighting sub-module is configured to perform a weighted summation or a weighted product on the cosine angle, the center distance, and the length and width difference to obtain a similarity matrix.

In an implementation manner of the disclosure, the determining module is configured to associate the target detection frame of the first video frame related to the matching result with the target detection frame of the second video frame into a continuous manner in response to the matching result being a matching success. Track the track.

In an implementation manner of the disclosure, the determining module is configured to: in response to the matching result being a matching failure, use a target detection frame of the first video frame related to the matching result as a starting target detection frame of the new tracking trajectory.

In an implementation of the disclosure, the representation information includes a feature vector of the target detection frame, the motion information includes location information of the target detection frame, and the shape information includes size information of the target detection frame.

In an implementation of the disclosure, the target tracking system further includes: a detecting module, configured to detect the first video frame by using the detector before the acquiring module acquires the feature information of each target detection frame of the first video frame, Each target detection frame of the first video frame is obtained.

According to another aspect of the present disclosure, there is also provided an electronic device comprising: a processor, a memory, a communication component, and a communication bus, wherein the processor, the memory, and the communication component complete communication with each other through a communication bus; the memory is configured to store at least An executable instruction, the executable instruction causing the processor to perform an operation corresponding to any of the target tracking methods described above.

According to another aspect of the present disclosure, there is also provided an electronic device comprising: a processor and a target tracking system as described above; target tracking as described in any one of the above when the processor runs the object attribute detecting device The units in the system are run.

According to another aspect of the present disclosure, there is also provided a computer program comprising computer readable code, the processor in the device executing to implement any of the above when the computer readable code is run in a device The executable instructions of the steps in the target tracking method.

According to another aspect of the present disclosure, there is also provided a computer readable medium for storing computer readable instructions, wherein the instructions are executed to implement the target tracking method according to any one of the preceding claims The operation of each step.

According to another aspect of the present disclosure, there is also provided a computer readable storage medium storing executable instructions for acquiring feature information of each target detection frame of a first video frame, feature information Include any one or more of the following: representation information, motion information, shape information; for performing feature information of each target detection frame of the first video frame and feature information of each target detection frame of the second video frame Alignable executable instructions; executable instructions for determining a tracking trajectory of each target detection frame of the first video frame based on the matching result, wherein the second video frame is a video frame preceding the first video frame.

According to the technical solution provided by the present disclosure, feature information of each target detection frame of the first video frame is acquired, wherein the feature information includes any one or more of the following: representation information, motion information, shape information. And further matching the feature information of each target detection frame of the first video frame with the feature information of each target detection frame of the second video frame, and then determining the tracking of each target detection frame of the first video frame according to the matching result. Track. In the present disclosure, the second video frame is a video frame before the first video frame, and the feature information of each target detection frame of the acquired second video frame may be pre-stored.

In the present disclosure, due to the acquired feature information, such as the representation information, the motion information, or the shape information, the target detection frame of the video frame can be more accurately represented than the simple image feature or the optical flow information, and the target detection frame is The recognition effect is better. Therefore, using feature information, such as any combination of one or more of the representation information, the motion information, or the shape information, can not only obtain more accurate matching results, but also improve the accuracy of the target tracking.

The technical solutions of the present disclosure will be further described in detail below through the accompanying drawings and embodiments.

DRAWINGS

The accompanying drawings, which are incorporated in FIG

The present disclosure can be more clearly understood from the following detailed description, in which:

1 is a flow chart of steps of a target tracking method provided by the present disclosure;

2 is a flow chart of steps of a target tracking method provided by the present disclosure;

FIG. 3 is a schematic flowchart of an execution process of a target tracking method provided by the present disclosure;

4 is a schematic structural diagram of a target tracking system provided by the present disclosure;

FIG. 5 is a schematic structural diagram of a target tracking system provided by the present disclosure.

FIG. 6 is a schematic structural diagram of an electronic device provided in accordance with the present disclosure.

detailed description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the drawings. It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in the embodiments are not intended to limit the scope of the disclosure.

In the meantime, it should be understood that the dimensions of the various parts shown in the drawings are not drawn in the actual scale relationship for the convenience of the description.

The following description of the at least one exemplary embodiment is merely illustrative and is in no way

Techniques, methods and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but the techniques, methods and apparatus should be considered as part of the specification, where appropriate.

It should be noted that similar reference numerals and letters indicate similar items in the following figures, and therefore, once an item is defined in one figure, it is not required to be further discussed in the subsequent figures.

The present disclosure can be applied to computer systems/servers that can operate with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations suitable for use with computer systems/servers include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, based on Microprocessor systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, large computer systems, and distributed cloud computing technology environments including any of the above.

The computer system/server can be described in the general context of computer system executable instructions (such as program modules) being executed by a computer system. Generally, program modules may include routines, programs, target programs, components, logic, data structures that perform particular tasks or implement particular abstract data types. The computer system/server can be implemented in a distributed cloud computing environment where tasks are performed by remote processing devices that are linked through a communication network. In a distributed cloud computing environment, program modules may be located on a local or remote computing system storage medium including storage devices.

The target tracking technical solution provided by the present disclosure will be described below with reference to FIGS. Any of the target tracking technical solutions provided by the present disclosure may be exemplified by software or hardware or a combination of software and hardware. For example, the target tracking technology solution provided by the present disclosure may be implemented by a certain electronic device or implemented by a certain processor, which may include, but is not limited to, a terminal or a server, and the processor may include but Not limited to CPU or GPU. The details are not described below.

In FIG. 1, S100: acquires feature information of each target detection frame of the first video frame.

In an optional implementation manner, step S100 may be performed by a processor calling a corresponding instruction stored in a memory, or may be performed by an acquisition module 400 executed by a processor.

In this embodiment, the first video frame can be understood as the current video frame, and the video frame in this embodiment can be any video frame of the video image obtained by the processor in real time, and can also be the collected complete video stream. A video frame, and, in an actual application, the processor may perform a frame-by-frame detection on the video image or the video stream to obtain a video frame, and may also perform a frame-by-sample frame detection on the video image or the video stream to obtain a video frame. There are no restrictions on the means and source of acquisition of frames.

Optionally, in step S100, the acquired feature information is feature information of each target detection frame of the first video frame, that is, each target detection frame of the first video frame needs to be acquired before the feature information is acquired. A process description for specifically acquiring each target detection frame of the first video frame will be described in detail in the subsequent embodiments.

In an optional implementation manner, the feature information includes but is not limited to any one or more of the following: representation information, motion information, and shape information. The representation information is used to represent the feature vector of the target in the target detection frame; the motion information is used to represent the position of the target detection frame; and the shape information is used to represent the size of the target detection frame. The representation information, the motion information and the shape information respectively represent the characteristics of the three different aspects of the target. In order to improve the accuracy of the target tracking in the embodiment, it is preferable to use the representation information, the motion information and the shape information as the feature information, if the representation is in the representation Any one or a combination of information, motion information, and shape information as feature information may affect the accuracy of target tracking. The representation information may be a high-dimensional feature extracted from a deep neural network obtained by training for different types of targets (such as pedestrians and faces).

Compared with the simple image feature or the optical flow information, the feature information in this embodiment can more accurately represent the target detection frame of the first video frame, and provides a more accurate matching condition for the subsequent matching process.

S102. Match feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame.

In an optional implementation, step S102 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a matching module 402 executed by the processor.

The second video frame can be understood as a video frame before the first video frame. For example, the second video frame is video frame A, the time interval is 00:08:20 - 00:08:40, and the first video frame is video frame B, and the time interval is 00:08:41 - 00:08: 60. That is, during the video frame playing process, the first video frame is played next to the second video frame, and there are no other video frames between the first video frame and the second video frame, and the first video frame and the second video frame are two. Consecutive or adjacent video frames.

The feature information of each target detection frame of the second video frame may be acquired and stored in advance, that is, before step S100, there is also a step of acquiring and storing feature information of each target detection frame of the second video frame, similarly, The feature information of each target detection frame of the first video frame acquired in step S100 may be stored.

Since the feature information includes any one or more of the following: representation information, motion information, and shape information, the representation information, the motion information, and the shape information may be matched separately when the feature information is matched.

Step S104: Determine a tracking trajectory of each target detection frame of the first video frame according to the matching result.

In an optional implementation, step S104 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a determining module 404 executed by the processor.

In one implementation, the matching result includes a matching success and a matching failure.

When the matching result is that the matching is successful, the target detection frame of the first video frame related to the matching result (the target detection frame related to the matching result may be referred to as a matching target detection frame) and the second information related to the matching result The feature information of the target detection frame of the video frame is similar or identical in similarity, that is, the target in the matching target detection frame of the first video frame is close to or identical to the target in the matching target detection frame of the second video frame. At this time, the matching target detection frame of the first video frame may be added to the tracking track of the matching target detection frame of the second video frame.

When the matching result is a matching failure, the feature information indicating the target detection frame of the first video frame related to the matching result and the feature information of the target detection frame of the second video frame related to the matching result are not close to each other or not The same, that is, the target in the matching target detection frame of the first video frame is not close to or different from the target in the matching target detection frame of the second video frame, and the matching target detection of the first video frame may be detected at this time. The frame serves as a starting point of the new tracking trajectory or matches the feature information of the matching target detection frame of the first video frame with the feature information of the other target detection frames of the second video frame.

The feature information of each target detection frame of the first video frame is obtained by the target tracking method provided by the present disclosure, wherein the feature information may include any one or more of the following: representation information, motion information, and shape information. And further matching the feature information of each target detection frame of the first video frame with the feature information of each target detection frame of the second video frame, and then determining the tracking of each target detection frame of the first video frame according to the matching result. Track. In the present disclosure, the second video frame is a video frame before the first video frame, and the feature information of each target detection frame of the acquired second video frame may be pre-stored.

In the present disclosure, due to the acquired feature information, such as representation information, motion information, or shape information, the target detection frame of the video frame can be more accurately represented than the simple image feature or optical flow information, and the target detection frame is The recognition effect is better. Therefore, using feature information, such as any combination of one or more of the representation information, the motion information or the shape information, can not only obtain more accurate matching results, but also improve the accuracy of the target tracking. .

As shown in FIG. 2, the embodiments of the present disclosure are different from the above embodiments on the basis of the above embodiments, and the same points can be referred to the description and description in the foregoing embodiments.

In FIG. 2, in step S200, the first video frame is detected by the detector, and each target detection frame of the first video frame is obtained.

In an optional implementation, step S200 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a detection module 508 running by the processor.

Optionally, the detector may be a high-performance detector based on a deep convolutional neural network for determining each target area (target detection frame) in the video frame, for example, a fast-area convolutional neural network (Faster-Region with Convolutional Neural Network, Faster-RCNN), Faster-RCNN is the industry's leading detector with the advantages of high detection accuracy and fast speed.

The first video frame may be a current video frame, and each target detection frame of the first video frame may be obtained after the first video frame passes the detector.

Step S202: Determine a convolutional neural network according to the type of the target in the target detection frame, and use a convolutional neural network to detect each target detection frame of the first video frame and the first video frame, and acquire each target detection frame of the first video frame. The representation information determines motion information and shape information of the target detection frame of the first video frame according to each target detection frame of the first video frame.

In an optional implementation, step S202 may be performed by a processor calling a corresponding instruction stored in the memory, or may be performed by the acquisition module 400 executed by the processor.

In one implementation, the type of the target may include, but is not limited to, any one or more of the following: a face, a pedestrian, a vehicle. Any object that can be moved and has a distinguishing identifier can be the target in this embodiment, and the embodiment does not limit the type of the target. In the actual application scenario, if the face is to be tracked, a convolutional neural network for face recognition, such as FaceNet and DeepID, may be used; if the pedestrian is to be tracked, convolution for pedestrian recognition may be used. Neural Networks.

Alternatively, a convolutional neural network can be understood as a convolutional neural network for identifying a target during computer vision processing. Each target detection frame of the first video frame and the first video frame is input to the convolutional neural network, and after being processed by the convolutional neural network, the feature information for identifying the target is output from the last fully connected layer of the convolutional neural network. The feature information can be used to calculate the similarity between the targets.

Moreover, in order to improve the accuracy of the feature information acquired by the convolutional neural network, during the training of the convolutional neural network, the network parameters of the convolutional neural network can be fine-tuned using a specific training data set, or the volume can be trained through a specific training method. Neural network. However, since this embodiment is a general target tracking method, the present embodiment is directed to how to use a specific training data set for fine tuning, and which network parameters of a specific layer of the convolutional neural network are fine-tuned.

Moreover, in order to meet the requirements of the computational efficiency of the convolutional neural network, the convolutional neural network can be appropriately compressed, for example, using a convolutional neural network with less network parameters, or using a small-scale convolutional neural network to imitate a large scale. The mapping relationship between the input and output of the convolutional neural network.

In an optional implementation manner, the convolutional neural network in this embodiment may be a convolutional neural network trained on ImageNet (image recognition database), or may be any convolutional nerve for identifying a target. The internet. Since the convolutional neural network has a large amount of computation, the convolutional neural network can be applied to the graphics processor, and the computational efficiency of the convolutional neural network is improved by the powerful graphics computing capability of the graphics processor. The present disclosure does not limit the convolutional neural network.

Inputting each target detection frame of the first video frame and the first video frame into a convolutional neural network, and extracting, by the convolutional neural network, feature extraction of each target detection frame of the first video frame and the first video frame, and extracting The representation information of each target detection frame, the representation information includes a feature vector of the target detection frame, and the representation information is a fixed length feature vector. The feature information of the target detection frame may include motion information or shape information in addition to the feature information. The motion information includes location information of the target detection frame, and the shape information includes size information of the target detection frame. In the process of determining the motion information and the shape information according to the target detection frame, in a feasible implementation manner, the center point position information of the target detection frame may be determined as the motion information, and the length information and the width information of the target detection frame are determined as The shape information, after obtaining the target detection frame, can determine the motion information and the shape information according to the central point position information and the length information and the width information of the target detection frame. In this embodiment, how to determine the motion information and shape according to the target detection frame. Information is not restricted.

In step S204, the feature information of each target detection frame of the first video frame is matched with the feature information of each target detection frame of the second video frame, and if the matching result is successful, step S206 is performed; If the matching fails, step S208 is performed.

In an optional implementation, step S204 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a matching module 402 executed by the processor.

In an implementation manner, S204 may be specifically divided into the following steps:

Step S2040: Perform similarity calculation on the feature information.

In an optional implementation, step S2040 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a similarity calculation sub-module 5020 executed by the processor.

The similarity calculation is performed on the feature information, and the feature information of each target detection frame of the first video frame and the feature information of each target detection frame of the second video frame may be similarly calculated, that is, the first video frame is calculated. Calculating a cosine angle between the representation information of each target detection frame and the representation information of each target detection frame of the second video frame, calculating motion information of each target detection frame of the first video frame and each of the second video frames A center detection distance between the motion detection information of the target detection frame, and a target detection frame between the shape information of each target detection frame of the first video frame and the shape information of each target detection frame of the second video frame The length difference.

In an optional implementation manner, the feature information of each target detection frame of the first video frame is compared with the feature information of each target detection frame of the second video frame, and the first video frame may also be used. The feature information of each target detection frame is compared with the feature information of each target detection frame of the second video frame, and each target detection frame of the first video frame and each target detection frame of the second video frame are obtained. Similarity data. The similarity data ranges from 0.0 to 1.0, and the similarity data is larger, indicating that the degree of similarity between the feature information is higher.

Step S2042, weighting the similarity calculation result to obtain a similarity matrix.

In an optional implementation, step S2042 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a weighting sub-module 5022 executed by the processor.

Optionally, the similarity matrix may be obtained according to the obtained similarity calculation result, for example, the weighted sum or the weighted product of the cosine angle, the center distance, and the length and width difference are obtained to obtain a similarity matrix, which is similar to the present embodiment. Sexual outcome weighting is not restricted.

Step S2044: Perform an optimal matching on the similarity matrix.

In an optional implementation, step S2044 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by an optimal matching sub-module 5024 operated by the processor.

Optionally, the optimal matching may use a bipartite graph with weight matching. For example, the Hungarian algorithm is adopted, and the matching process of the similarity matrix is not limited in this embodiment.

Step S206: Associate the target detection frame of the first video frame and the target detection frame of the second video frame related to the matching result into a continuous tracking trajectory.

In an optional implementation, step S206 may be performed by the processor invoking a corresponding instruction stored in the memory, or may be performed by the determining module 504 being executed by the processor.

If the matching result is that the matching is successful, the target detection frame of the first video frame related to the matching result may be added to the tracking track of the target detection frame of the second video frame related to the matching result, and the first video related to the matching result may be added. The target detection frame of the frame is the next movement position in the tracking trajectory of the target detection frame of the second video frame associated with the matching result. For example, the target in the target detection frame of the first video frame related to the matching result is the target A, and the target in the target detection frame of the second video frame related to the matching result is the target A′, and if the matching result is successful, Then, the target A and the target A′ are the same target, and the target detection frame of the first video frame related to the matching result can be used as the new tracking track position of the existing tracking track of the target A or the target A′.

It should be noted that the target detection frame of the first video frame related to the matching result is a target detection frame for matching the feature information with the target detection frame of the second video frame to obtain a matching result. For example, the first video frame includes a target detection frame m1 and a target detection frame m2, and the target detection frame m1 is matched with the target detection frame m3 of the second video frame to obtain a matching result, and the first video related to the matching result is obtained. The target detection frame of the frame is the target detection frame m1, and the target detection frame of the second video frame related to the matching result is the target detection frame m3.

Step S208: The target detection frame of the first video frame related to the matching result is used as a starting target detection frame of the new tracking trajectory.

In an optional implementation, step S208 may be performed by the processor invoking a corresponding instruction stored in the memory, or may be performed by the determining module 504 being executed by the processor.

If the matching result is a matching failure, the target detection frame of the first video frame related to the matching result may be used as a starting target detection frame of the new tracking trajectory. For example, the target in the target detection frame of the first video frame related to the matching result is the target B, and the target in the target detection frame of the second video frame related to the matching result is the target B′, and if the matching result is a match failure, Then, the target B and the target B' are different targets, and the target detection frame of the first video frame related to the matching result can be used as the starting position of the new tracking trajectory of the target B.

It should be noted that one of the prerequisites in this step is that the target detection frame of the first video frame related to the matching result has been matched with the feature information of each target detection frame in the second video frame, and each match is matched. The result is a match failure.

Based on the above description, referring to FIG. 3, a flowchart of an execution flow of a target tracking method according to the present disclosure is shown. The video frame to be detected is input to the detector to obtain a target detection frame of the video frame to be detected. The video frame to be detected and the target detection frame are input into the convolutional neural network, and the feature information of the target detection frame is extracted. The extracted feature information is matched with the feature information of the existing tracking track. If the matching is successful, the target detection frame of the video frame to be detected is added to the existing tracking track. If the match fails, the target detection frame of the video frame to be detected is used as the starting point of the new tracking track.

The feature information of each target detection frame of the first video frame is obtained by the target tracking method provided by the embodiment, where the feature information includes but is not limited to any one or more of the following: representation information, motion information, and shape information. And further matching feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame, and then determining a tracking track of each target detection frame of the first video frame according to the matching result. . In the present disclosure, the first video frame may be the current video frame, and the second video frame may be the video frame before the first video frame, and the feature information of each target detection frame of the acquired second video frame may be pre-stored.

In the present disclosure, due to the acquired feature information, such as representation information, motion information, and shape information, the target detection frame of the video frame can be more accurately represented than the simple image feature or optical flow information, and the target detection frame is The recognition effect is better. Therefore, matching with any combination of one or more of the feature information, such as the representation information, the motion information, and the shape information, can not only obtain more accurate matching results, but also improve the accuracy of the target tracking. .

Target tracking technology can be generally divided into target online tracking technology and target offline tracking technology. The target tracking method provided by the present disclosure can be applied to a target online tracking scene, for example, acquiring feature information of each target detection frame of a video frame in a real-time played video image online, and then displaying adjacent video frames in a real-time played video image. The feature information of each target detection frame is matched, and then the tracking track of each target detection frame is determined according to the matching result.

In a feasible implementation manner, the target tracking method provided in this embodiment may be applied to an online video surveillance analysis solution, for example, in a face recognition application, a face online detection is required for each video frame, and The face feature detected online is input into the face feature database for query, thereby determining the person corresponding to the face in the video frame. The face features of the faces in the plurality of video frames can be matched, and the successfully matched face features are searched in the face feature database, thereby improving the accuracy of the face recognition.

Meanwhile, the target tracking method provided by the present disclosure may also be applied to a target offline tracking scene, for example, acquiring feature information of each target detection frame of a video frame in an offline video image, and then placing adjacent video frames in the offline video image. The feature information of each target detection frame is matched, and then the tracking track of each target detection frame is determined according to the matching result.

Any of the methods provided by the embodiments of the present disclosure may be performed by any suitable device having data processing capabilities, including but not limited to: a terminal device and a server. Alternatively, any of the methods provided by the embodiments of the present disclosure may be executed by a processor, such as the processor, by executing a corresponding instruction stored in the memory to perform any of the methods mentioned in the embodiments of the present disclosure. This will not be repeated below.

A person skilled in the art can understand that all or part of the steps of implementing the above method embodiments may be completed by using hardware related to the program instructions. The foregoing program may be stored in a computer readable storage medium, and the program is executed when executed. The steps of the foregoing method embodiments are included; and the foregoing storage medium includes: ROM, RAM, disk or optical disk, and various media that can store program codes.

Applying the technical solutions shown in FIG. 1 to FIG. 3, when the processor implements target tracking, the target detection frame in the first video frame is matched with the target detection frame in the second video frame according to the feature information, and the target detection is determined according to the matching result. The tracking trajectory of the frame can not only obtain more accurate matching results but also improve the accuracy of target tracking due to the acquired feature information, such as representation information, motion information or shape information.

The system shown in FIG. 4 includes:

The obtaining module 400 is configured to acquire feature information of each target detection frame of the first video frame;

The matching module 402 is configured to match feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame;

The determining module 404 is configured to determine, according to the matching result, a tracking track of each target detection frame of the first video frame; wherein the second video frame is a video frame before the first video frame.

In one implementation, the feature information may include, but is not limited to, any one or more of the following: representation information, motion information, shape information.

With the target tracking system provided by the embodiment, the acquiring module acquires feature information of each target detection frame of the first video frame, and then the matching module sets feature information of each target detection frame of the first video frame with the second video frame. The feature information of each target detection frame is matched, and then the determining module determines a tracking trajectory of each target detection frame of the first video frame according to the matching result. In the present disclosure, the first video frame may be the current video frame, and the second video frame may be the video frame before the first video frame, and the feature information of each target detection frame of the acquired second video frame may be pre-stored.

In the present disclosure, due to the acquired feature information, such as representation information, motion information, and shape information, the target detection frame of the video frame can be more accurately represented than the simple image feature or optical flow information, for the target detection frame. The recognition effect is better. Therefore, using the feature information, such as one or a combination of the image information, the motion information, and the shape information to perform matching, not only can obtain more accurate matching results, but also can improve the accuracy of the target tracking.

The system shown in FIG. 5 includes:

The acquiring module 500 is configured to acquire feature information of each target detection frame of the first video frame, where the feature information may include, but is not limited to, any one or more of the following: representation information, motion information, shape information;

The matching module 502 is configured to match feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame;

The determining module 504 is configured to determine, according to the matching result, a tracking track of each target detection frame of the first video frame, where the second video frame is a video frame before the first video frame.

In an implementation manner, the obtaining module 500 includes: a first acquiring submodule 5000, configured to detect each target detection frame of the first video frame and the first video frame by using a convolutional neural network, and acquire each of the first video frames. The second detection sub-module 5002 is configured to determine motion information and shape information of each target detection frame of the first video frame according to each target detection frame of the first video frame.

In an implementation manner, the target tracking system provided by the present disclosure further includes: a convolutional neural network determining module 506, configured to: before the acquiring module 500 acquires feature information of each target detection frame of the first video frame, according to the type of the target Determine the convolutional neural network.

Optionally, the type of the target may include, but is not limited to, any one or more of the following: a face, a pedestrian, a vehicle.

In an implementation manner, the matching module 502 includes: a similarity calculation sub-module 5020, configured to perform feature information of each target detection frame of the first video frame and feature information of each target detection frame of the second video frame. The similarity calculation; the weighting sub-module 5022 is configured to weight the similarity calculation result to obtain a similarity matrix; the optimal matching sub-module 5024 is configured to perform an optimal matching on the similarity matrix.

Optionally, the similarity calculation sub-module 5020 is configured to compare the feature information of each target detection frame of the first video frame with the feature information of each target detection frame of the second video frame, to obtain the first video. The similarity data of each target detection frame of the frame and each target detection frame of the second video frame.

Optionally, the similarity calculation sub-module 5020 is configured to calculate a cosine angle between the representation information of each target detection frame of the first video frame and the representation information of each target detection frame of the second video frame, and calculate the first Calculating the shape information of each target detection frame of the first video frame and the center distance of the target detection frame between the motion information of each target detection frame of the video frame and the motion information of each target detection frame of the second video frame The length difference width of the target detection frame between the shape information of each target detection frame of the second video frame.

Optionally, the weighting sub-module 5022 is configured to perform a weighted summation or a weighted product on the cosine angle, the center distance, and the length and width difference to obtain a similarity matrix.

Optionally, the determining module 504 is configured to associate the target detection frame of the first video frame and the target detection frame of the second video frame related to the matching result into a continuous tracking trajectory if the matching result is that the matching is successful.

Optionally, the determining module 504 is configured to: if the matching result is a matching failure, use a target detection frame of the first video frame related to the matching result as a starting target detection frame of the new tracking track.

Optionally, the representation information includes a feature vector of the target detection frame, the motion information includes location information of the target detection frame, and the shape information includes size information of the target detection frame.

In an implementation manner, the target tracking system provided in this embodiment further includes: a detecting module 508, configured to detect the first video by using the detector before the acquiring module 500 acquires the feature information of each target detection frame of the first video frame. Frame, each target detection frame of the first video frame is obtained.

Optionally, the detector is a deep convolutional neural network based detector.

Optionally, the detector is a fast region convolutional neural network Faster-RCNN.

The area detecting device of the present embodiment can be used to implement the corresponding area detecting method in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, and details are not described herein again.

In addition, an embodiment of the present application further provides an electronic device, including: a processor and a memory;

The memory is configured to store at least one executable instruction, the executable instruction causing the processor to perform an operation corresponding to the target tracking method described in any of the above embodiments of the present application.

In addition, the embodiment of the present application further provides another electronic device, including:

The processor and the target tracking device of any of the above embodiments of the present application; when the processor runs the target tracking device, the unit in the target tracking device according to any of the above embodiments of the present application is operated.

The embodiment of the present disclosure further provides an electronic device, which may include, but is not limited to, a mobile terminal, a personal computer (PC), a tablet, and a server. Referring now to Figure 6, there is shown a block diagram of an electronic device 600 suitable for use in implementing the system of an embodiment of the present disclosure: as shown in Figure 6, the electronic device 600 can include one or more processors, communication elements, and Or a plurality of processors, such as one or more central processing units (CPUs) 601, and/or one or more image processors (GPUs) 613, which may be stored in accordance with a read-only memory (ROM) 602. Various appropriate actions and processes are performed by executing instructions or loading from executable portion 608 into executable instructions in random access memory (RAM) 603. The communication component includes a communication component 612 and/or a communication interface 609. The communication component 612 can include, but is not limited to, a network card, which can include, but is not limited to, an IB (Infiniband) network card. The communication interface 609 includes a communication interface such as a LAN card, a network interface card of a modem, and the communication interface 609 is executed via a network such as the Internet. Communication processing.

The processor can communicate with read only memory 602 and/or random access memory 603 to execute executable instructions, communicate with communication component 612 via communication bus 604, and communicate with other target devices via communication component 612 to complete embodiments of the present disclosure. The operation corresponding to any one of the target tracking methods is provided, for example, acquiring feature information of each target detection frame of the first video frame, and the feature information may include, but is not limited to, any one or more of the following: representation information, motion information, Shape information; matching feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame; determining, for each target detection frame of the first video frame, according to the matching result Tracking the track; wherein the second video frame is a video frame before the first video frame.

Further, in the RAM 603, various programs and data required can also be stored. The CPU 601 or the GPU 613, the ROM 602, and the RAM 603 are connected to each other through a communication bus 604. The executable instruction is written in the 0R20OM602, and the executable instruction causes the processor to perform the operation corresponding to the target tracking method. An input/output (I/O) interface 605 is also coupled to communication bus 604. The communication component 612 can be integrated or can be configured to have multiple sub-modules (e.g., multiple IB network cards) and be on a communication bus link.

The following components are coupled to I/O interface 605: including but not limited to keyboard, mouse input portion 606; including but not limited to output portions 607 such as cathode ray tubes (CRTs), liquid crystal displays (LCDs), and speakers; including but not limited to A storage portion 608 of the hard disk; and a communication interface 609 including a network interface card such as a LAN card and a modem. Driver 610 is also coupled to I/O interface 605 as needed. A removable medium 611, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, is mounted on the drive 610 as needed so that a computer program read therefrom is installed into the storage portion 608 as needed.

It should be noted that the architecture shown in FIG. 6 is only an optional implementation manner. In an optional practice process, the number and type of components in FIG. 6 may be selected, deleted, added, or replaced according to actual needs; Separate settings or integrated setting implementations can also be used for different feature settings, such as GPU and CPU detachable settings or GPU integration on the CPU, communication components can be detached, or integrated on the CPU or GPU. These alternative embodiments are all within the scope of the present disclosure.

In particular, the processes described above with reference to the flowcharts may be implemented as a computer software program in accordance with an embodiment of the present disclosure. For example, an embodiment of the present disclosure includes a computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program comprising program code for executing the method illustrated in the flowchart, the program code comprising the corresponding execution The instruction corresponding to the method step provided by any embodiment of the present disclosure, for example, the program code may include an instruction corresponding to the following steps provided by the embodiment of the present application: acquiring feature information of each target detection frame of the first video frame, and the feature The information includes any one or more of the following: representation information, motion information, shape information; matching feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame And determining, according to the matching result, a tracking track of each target detection frame of the first video frame; wherein, the second video frame is a video frame before the first video frame. In such an embodiment, the computer program can be downloaded and installed from the network via a communication component, and/or installed from removable media 611. The above-described functions defined in the method of any of the embodiments of the present disclosure are performed when the computer program is executed by a processor.

In addition, the embodiment of the present application further provides a computer program, including computer readable code, when the computer readable code is run on a device, the processor in the device executes to implement any of the embodiments of the present application. The instructions of each step in the target tracking method.

In addition, the embodiment of the present application further provides a computer readable storage medium, configured to store computer readable instructions, when the instructions are executed, to implement steps in the target tracking method according to any embodiment of the present application. operating.

Each of the at least one embodiment of the present specification is described in a progressive manner, and each of the at least one embodiment focuses on differences from other embodiments, and the same or similar parts between the respective at least one embodiment are referred to each other. Just fine. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented in software, hardware, firmware or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustrative purposes only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless otherwise specifically stated. Moreover, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine readable instructions for implementing a method in accordance with the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

The description of the present disclosure has been presented for purposes of illustration and description. Many modifications and variations will be apparent to those skilled in the art. The embodiment was chosen and described in order to best explain the principles of the invention and the embodiments of the invention.

Claims

A target tracking method, comprising:

Obtaining feature information of each target detection frame of the first video frame;

Matching feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame, wherein the second video frame is before the first video frame Video frame

A tracking trajectory of each target detection frame of the first video frame is determined according to the matching result.
The method according to claim 1, wherein the feature information comprises any one or more of the following: representation information, motion information, shape information.
The method according to claim 1 or 2, wherein the acquiring the feature information of each target detection frame of the first video frame comprises:

Detecting, by using a convolutional neural network, each target detection frame of the first video frame and the first video frame, and acquiring representation information of each target detection frame of the first video frame;

And determining motion information and shape information of each target detection frame of the first video frame according to each target detection frame of the first video frame.
The method according to claim 3, wherein before the acquiring the feature information of each target detection frame of the first video frame, the method further comprises:

The convolutional neural network is determined according to the type of the target.
The method according to claim 4, wherein the type of the target comprises any one or more of the following: a face, a pedestrian, a vehicle.
The method according to any one of claims 1-5, wherein the feature information of each target detection frame of the first video frame and the feature information of each target detection frame of the second video frame are respectively Matches, including:

Performing similarity calculation on feature information of each target detection frame of the first video frame and feature information of each target detection frame of the second video frame;

Weighting the similarity calculation results to obtain a similarity matrix;

Optimal matching is performed on the similarity matrix.
The method according to claim 6, wherein the feature information of each target detection frame of the first video frame is similar to the feature information of each target detection frame of the second video frame, respectively. Sex calculations, including:

And comparing the feature information of each target detection frame of the first video frame with the feature information of each target detection frame of the second video frame, to obtain each target detection frame of the first video frame. Similarity data with each target detection frame of the second video frame.
The method according to claim 6 or 7, wherein the feature information of each target detection frame of the first video frame and the feature information of each target detection frame of the second video frame are respectively Perform similarity calculations, including:

Calculating a cosine angle between the representation information of each target detection frame of the first video frame and the representation information of each target detection frame of the second video frame, and calculating each target detection of the first video frame Calculating shape information of each target detection frame of the first video frame and the number of the center of the target detection frame between the motion information of the frame and the motion information of each target detection frame of the second video frame The length difference of the target detection frame between the shape information of each target detection frame of the two video frames.
The method according to claim 8, wherein the weighting the similarity calculation result to obtain a similarity matrix comprises:

Weighting the cosine angle, the center distance, and the length and width difference, or weighting the product to obtain a similarity matrix.
The method according to any one of claims 1-9, wherein the determining a tracking trajectory of each target detection frame of the first video frame according to the matching result comprises:

And in response to the matching result being that the matching is successful, the target detection frame of the first video frame and the target detection frame of the second video frame related to the matching result are associated into a continuous tracking trajectory.
The method according to any one of claims 1-9, wherein the determining a tracking trajectory of each target detection frame of the first video frame according to the matching result comprises:

In response to the matching result being a matching failure, the target detection frame of the first video frame related to the matching result is used as a starting target detection frame of the new tracking trajectory.
The method according to any one of claims 1 to 11, wherein the representation information comprises a feature vector of a target detection frame, the motion information includes location information of a target detection frame, and the shape information includes a target detection frame. Size Information.
The method according to any one of claims 1 to 12, wherein before the acquiring the feature information of each target detection frame of the first video frame, the method further comprises:

The first video frame is detected by a detector, and each target detection frame of the first video frame is obtained.
The method of claim 13 wherein said detector is a deep convolutional neural network based detector.
The method according to claim 13 or 14, wherein the detector is a fast region convolutional neural network Faster-RCNN.
A target tracking system, comprising:

An acquiring module, configured to acquire feature information of each target detection frame of the first video frame;

a matching module, configured to match feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame; wherein the second video frame is a video frame preceding the first video frame;

And a determining module, configured to determine a tracking trajectory of each target detection frame of the first video frame according to the matching result.
The system according to claim 16, wherein the feature information comprises any one or more of the following: representation information, motion information, shape information.
The system of claim 16 or 17, wherein the obtaining module comprises:

a first obtaining submodule, configured to detect, by using a convolutional neural network, each target detection frame of the first video frame and the first video frame, and obtain image information of each target detection frame of the first video frame ;

And a second acquiring submodule, configured to determine motion information and shape information of each target detection frame of the first video frame according to each target detection frame of the first video frame.
The system of claim 18, wherein the system further comprises:

The convolutional neural network determining module is configured to determine the convolutional neural network according to the type of the target before the acquiring module acquires the feature information of each target detection frame of the first video frame.
The system according to claim 19, wherein the type of the target comprises any one or more of the following: a face, a pedestrian, a vehicle.
The system according to any one of claims 16 to 20, wherein the matching module comprises:

a similarity calculation sub-module, configured to perform similarity calculation on feature information of each target detection frame of the first video frame and feature information of each target detection frame of the second video frame;

a weighting sub-module for weighting the similarity calculation result to obtain a similarity matrix;

An optimal matching sub-module for optimally matching the similarity matrix.
The system according to claim 21, wherein said similarity calculation sub-module is configured to map feature information of each target detection frame of said first video frame to each target of said second video frame The feature information of the detection frame is compared one by one, and similarity data of each target detection frame of the first video frame and each target detection frame of the second video frame is obtained.
The system according to claim 21 or 22, wherein the similarity calculation sub-module is configured to calculate representation information of each target detection frame of the first video frame and each of the second video frames Calculating the cosine angle between the representation information of the target detection frame, calculating the target detection between the motion information of each target detection frame of the first video frame and the motion information of each target detection frame of the second video frame The center distance of the frame calculates a length difference of the target detection frame between the shape information of each target detection frame of the first video frame and the shape information of each target detection frame of the second video frame.
The system according to claim 23, wherein the weighting sub-module is configured to perform a weighted summation or a weighted product on the cosine angle, the center distance, and the length and width difference to obtain a similarity matrix.
The system according to any one of claims 16-24, wherein the determining module is configured to: in response to the matching result being a matching success, target the first video frame related to the matching result The detection frame and the target detection frame of the second video frame are associated into a continuous tracking trajectory.
The system according to any one of claims 16-24, wherein the determining module is configured to: in response to the matching result being a matching failure, target the first video frame related to the matching result The detection frame is used as the starting target detection frame of the new tracking track.
The system according to any one of claims 16-26, wherein the representation information comprises a feature vector of a target detection frame, the motion information comprising location information of a target detection frame, the shape information comprising a target detection frame Size Information.
The system of any of claims 16-27, wherein the system further comprises:

a detecting module, configured to detect the first video frame by using a detector, and obtain each target detection frame of the first video frame, before acquiring, by the acquiring module, feature information of each target detection frame of the first video frame .
The system of claim 28 wherein said detector is a depth convolutional neural network based detector.
A system according to claim 28 or 29, wherein said detector is a fast region convolutional neural network Faster-RCNN.
An electronic device comprising: a processor, a memory, a communication component, and a communication bus, wherein the processor, the memory, and the communication component complete communication with each other through the communication bus;

The memory is configured to store at least one executable instruction that causes the processor to perform operations corresponding to the target tracking method of any of claims 1-15.
An electronic device comprising:

The processor and the target tracking system of any one of claims 16-30; wherein the unit in the target tracking system of any one of claims 16-30 is operated when the processor runs the object attribute detecting device.
A computer program comprising computer readable code, the processor in the device executing a target for implementing any of claims 1-15 when the computer readable code is run in a device An executable instruction that tracks the steps in a method.
A computer readable medium for storing computer readable instructions, wherein the instructions are executed to perform the operations of the steps of the target tracking method of any of claims 1-15.