WO2018157735A1 - Target tracking method and system, and electronic device - Google Patents

Target tracking method and system, and electronic device Download PDF

Info

Publication number
WO2018157735A1
WO2018157735A1 PCT/CN2018/076381 CN2018076381W WO2018157735A1 WO 2018157735 A1 WO2018157735 A1 WO 2018157735A1 CN 2018076381 W CN2018076381 W CN 2018076381W WO 2018157735 A1 WO2018157735 A1 WO 2018157735A1
Authority
WO
WIPO (PCT)
Prior art keywords
target detection
video frame
frame
detection frame
information
Prior art date
Application number
PCT/CN2018/076381
Other languages
French (fr)
Chinese (zh)
Inventor
余锋伟
闫俊杰
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Publication of WO2018157735A1 publication Critical patent/WO2018157735A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the present disclosure relates to video analysis technologies, and in particular, to a target tracking method, system, and electronic device.
  • Target tracking technology is one of the important technologies in video analysis. It can be simply described as a process: a video consisting of multiple consecutive video frames, from the first video frame to the last video frame, each video frame contains multiple The target continuously appears or disappears in the video frame, and the target continuously moves in the video frame; the purpose of the target tracking is to distinguish each target in the video frame from other targets to obtain the same target in different video frames.
  • the track in .
  • the present disclosure provides a target tracking method, system, and electronic device technical solution.
  • a target tracking method including:
  • the feature information includes any one or more of the following: representation information, motion information, and shape information.
  • acquiring feature information of each target detection frame of the first video frame includes: detecting, by using a convolutional neural network, each target detection frame of the first video frame and the first video frame, and acquiring The image information of each target detection frame of the first video frame; the motion information and the shape information of each target detection frame of the first video frame are determined according to each target detection frame of the first video frame.
  • the target tracking method before acquiring feature information of each target detection frame of the first video frame, the target tracking method further includes: determining a convolutional neural network according to the type of the target.
  • the type of the target includes any one or more of the following: a face, a pedestrian, a vehicle.
  • matching feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame includes: The feature information of each target detection frame is respectively calculated similarly with the feature information of each target detection frame of the second video frame; the similarity calculation result is weighted to obtain a similarity matrix; and the similarity matrix is optimally matched.
  • the similarity data of the box including: first video The feature information of each target detection frame of the frame is compared with the feature information of each target detection frame of the second video frame, and each target detection frame of each target detection frame and the second video frame of the first video frame is obtained.
  • the feature information of each target detection frame of the first video frame and the feature information of each target detection frame of the second video frame are respectively calculated, including: calculating the first video. Calculating a cosine angle between the representation information of each target detection frame of the frame and the representation information of each target detection frame of the second video frame, calculating motion information of each target detection frame of the first video frame and the second video frame Calculating a target distance between the motion detection information of the target detection frame and the target information of each target detection frame of the second video frame and the target information of each target detection frame of the second video frame Check the length and width of the frame.
  • the similarity calculation result is weighted to obtain a similarity matrix, including: weighting the cosine angle, the center distance, and the length and width difference, or weighting the product to obtain a similarity matrix.
  • determining a tracking trajectory of each target detection frame of the first video frame according to the matching result includes: in response to the matching result being a matching success, the target of the first video frame related to the matching result The detection frame and the target detection frame of the second video frame are associated into a continuous tracking trajectory.
  • determining a tracking trajectory of each target detection frame of the first video frame according to the matching result includes: in response to the matching result being a matching failure, the target of the first video frame related to the matching result The detection frame is used as the starting target detection frame of the new tracking track.
  • the representation information includes a feature vector of the target detection frame
  • the motion information includes location information of the target detection frame
  • the shape information includes size information of the target detection frame
  • the target tracking method before acquiring the feature information of each target detection frame of the first video frame, the target tracking method further includes: detecting, by the detector, the first video frame to obtain each of the first video frames. Target detection box.
  • the detector is a deep convolutional neural network based detector.
  • the detector is a fast region convolutional neural network Faster-RCNN.
  • a target tracking system including: an acquiring module, configured to acquire feature information of each target detection frame of a first video frame; and a matching module, configured to: The feature information of each target detection frame is matched with the feature information of each target detection frame of the second video frame, and the determining module is configured to determine a tracking track of each target detection frame of the first video frame according to the matching result;
  • the second video frame is a video frame before the first video frame.
  • the feature information includes any one or more of the following: representation information, motion information, and shape information.
  • the acquiring module includes: a first acquiring submodule, configured to detect, by using a convolutional neural network, each target detection frame of the first video frame and the first video frame to obtain the first video.
  • Each target of the frame detects the representation information of the frame; and the second acquisition sub-module is configured to determine motion information and shape information of each target detection frame of the first video frame according to each target detection frame of the first video frame.
  • the target tracking system further includes: a convolutional neural network determining module, configured to determine, according to the type of the target, before the acquiring module acquires the feature information of each target detection frame of the first video frame Convolutional neural network.
  • the type of the target includes any one or more of the following: a face, a pedestrian, a vehicle.
  • the matching module includes: a similarity calculation sub-module, configured to respectively select feature information of each target detection frame of the first video frame and each target detection frame of the second video frame The feature information is used for similarity calculation; the weighting sub-module is used to weight the similarity calculation result to obtain a similarity matrix; the optimal matching sub-module is used for optimal matching of the similarity matrix.
  • the similarity calculation sub-module is configured to compare the feature information of each target detection frame of the first video frame with the feature information of each target detection frame of the second video frame. Pairing, obtaining similarity data of each target detection frame of the first video frame and each target detection frame of the second video frame.
  • the similarity calculation sub-module is configured to calculate between the representation information of each target detection frame of the first video frame and the representation information of each target detection frame of the second video frame. a cosine angle, calculating a center distance of the target detection frame between the motion information of each target detection frame of the first video frame and the motion information of each target detection frame of the second video frame, and calculating each target of the first video frame The length difference width of the target detection frame between the shape information of the detection frame and the shape information of each target detection frame of the second video frame is detected.
  • the weighting sub-module is configured to perform a weighted summation or a weighted product on the cosine angle, the center distance, and the length and width difference to obtain a similarity matrix.
  • the determining module is configured to associate the target detection frame of the first video frame related to the matching result with the target detection frame of the second video frame into a continuous manner in response to the matching result being a matching success. Track the track.
  • the determining module is configured to: in response to the matching result being a matching failure, use a target detection frame of the first video frame related to the matching result as a starting target detection frame of the new tracking trajectory.
  • the representation information includes a feature vector of the target detection frame
  • the motion information includes location information of the target detection frame
  • the shape information includes size information of the target detection frame
  • the target tracking system further includes: a detecting module, configured to detect the first video frame by using the detector before the acquiring module acquires the feature information of each target detection frame of the first video frame, Each target detection frame of the first video frame is obtained.
  • the detector is a deep convolutional neural network based detector.
  • the detector is a fast region convolutional neural network Faster-RCNN.
  • an electronic device comprising: a processor, a memory, a communication component, and a communication bus, wherein the processor, the memory, and the communication component complete communication with each other through a communication bus; the memory is configured to store at least An executable instruction, the executable instruction causing the processor to perform an operation corresponding to any of the target tracking methods described above.
  • an electronic device comprising: a processor and a target tracking system as described above; target tracking as described in any one of the above when the processor runs the object attribute detecting device The units in the system are run.
  • a computer program comprising computer readable code, the processor in the device executing to implement any of the above when the computer readable code is run in a device
  • the executable instructions of the steps in the target tracking method are also provided.
  • a computer readable medium for storing computer readable instructions, wherein the instructions are executed to implement the target tracking method according to any one of the preceding claims The operation of each step.
  • a computer readable storage medium storing executable instructions for acquiring feature information of each target detection frame of a first video frame, feature information Include any one or more of the following: representation information, motion information, shape information; for performing feature information of each target detection frame of the first video frame and feature information of each target detection frame of the second video frame Alignable executable instructions; executable instructions for determining a tracking trajectory of each target detection frame of the first video frame based on the matching result, wherein the second video frame is a video frame preceding the first video frame.
  • feature information of each target detection frame of the first video frame is acquired, wherein the feature information includes any one or more of the following: representation information, motion information, shape information. And further matching the feature information of each target detection frame of the first video frame with the feature information of each target detection frame of the second video frame, and then determining the tracking of each target detection frame of the first video frame according to the matching result. Track.
  • the second video frame is a video frame before the first video frame, and the feature information of each target detection frame of the acquired second video frame may be pre-stored.
  • the target detection frame of the video frame can be more accurately represented than the simple image feature or the optical flow information, and the target detection frame is The recognition effect is better. Therefore, using feature information, such as any combination of one or more of the representation information, the motion information, or the shape information, can not only obtain more accurate matching results, but also improve the accuracy of the target tracking.
  • FIG. 1 is a flow chart of steps of a target tracking method provided by the present disclosure
  • FIG. 2 is a flow chart of steps of a target tracking method provided by the present disclosure
  • FIG. 3 is a schematic flowchart of an execution process of a target tracking method provided by the present disclosure
  • FIG. 4 is a schematic structural diagram of a target tracking system provided by the present disclosure.
  • FIG. 5 is a schematic structural diagram of a target tracking system provided by the present disclosure.
  • FIG. 6 is a schematic structural diagram of an electronic device provided in accordance with the present disclosure.
  • the present disclosure can be applied to computer systems/servers that can operate with numerous other general purpose or special purpose computing system environments or configurations.
  • Examples of well-known computing systems, environments, and/or configurations suitable for use with computer systems/servers include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, based on Microprocessor systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, large computer systems, and distributed cloud computing technology environments including any of the above.
  • the computer system/server can be described in the general context of computer system executable instructions (such as program modules) being executed by a computer system.
  • program modules may include routines, programs, target programs, components, logic, data structures that perform particular tasks or implement particular abstract data types.
  • the computer system/server can be implemented in a distributed cloud computing environment where tasks are performed by remote processing devices that are linked through a communication network.
  • program modules may be located on a local or remote computing system storage medium including storage devices.
  • the target tracking technical solution provided by the present disclosure will be described below with reference to FIGS. Any of the target tracking technical solutions provided by the present disclosure may be exemplified by software or hardware or a combination of software and hardware.
  • the target tracking technology solution provided by the present disclosure may be implemented by a certain electronic device or implemented by a certain processor, which may include, but is not limited to, a terminal or a server, and the processor may include but Not limited to CPU or GPU. The details are not described below.
  • S100 acquires feature information of each target detection frame of the first video frame.
  • step S100 may be performed by a processor calling a corresponding instruction stored in a memory, or may be performed by an acquisition module 400 executed by a processor.
  • the first video frame can be understood as the current video frame, and the video frame in this embodiment can be any video frame of the video image obtained by the processor in real time, and can also be the collected complete video stream.
  • a video frame and, in an actual application, the processor may perform a frame-by-frame detection on the video image or the video stream to obtain a video frame, and may also perform a frame-by-sample frame detection on the video image or the video stream to obtain a video frame. There are no restrictions on the means and source of acquisition of frames.
  • the acquired feature information is feature information of each target detection frame of the first video frame, that is, each target detection frame of the first video frame needs to be acquired before the feature information is acquired.
  • a process description for specifically acquiring each target detection frame of the first video frame will be described in detail in the subsequent embodiments.
  • the feature information includes but is not limited to any one or more of the following: representation information, motion information, and shape information.
  • the representation information is used to represent the feature vector of the target in the target detection frame; the motion information is used to represent the position of the target detection frame; and the shape information is used to represent the size of the target detection frame.
  • the representation information, the motion information and the shape information respectively represent the characteristics of the three different aspects of the target.
  • the representation information may be a high-dimensional feature extracted from a deep neural network obtained by training for different types of targets (such as pedestrians and faces).
  • the feature information in this embodiment can more accurately represent the target detection frame of the first video frame, and provides a more accurate matching condition for the subsequent matching process.
  • step S102 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a matching module 402 executed by the processor.
  • the second video frame can be understood as a video frame before the first video frame.
  • the second video frame is video frame A
  • the time interval is 00:08:20 - 00:08:40
  • the first video frame is video frame B
  • the time interval is 00:08:41 - 00:08: 60. That is, during the video frame playing process, the first video frame is played next to the second video frame, and there are no other video frames between the first video frame and the second video frame, and the first video frame and the second video frame are two. Consecutive or adjacent video frames.
  • the feature information of each target detection frame of the second video frame may be acquired and stored in advance, that is, before step S100, there is also a step of acquiring and storing feature information of each target detection frame of the second video frame, similarly, The feature information of each target detection frame of the first video frame acquired in step S100 may be stored.
  • the feature information includes any one or more of the following: representation information, motion information, and shape information
  • the representation information, the motion information, and the shape information may be matched separately when the feature information is matched.
  • Step S104 Determine a tracking trajectory of each target detection frame of the first video frame according to the matching result.
  • step S104 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a determining module 404 executed by the processor.
  • the matching result includes a matching success and a matching failure.
  • the target detection frame of the first video frame related to the matching result (the target detection frame related to the matching result may be referred to as a matching target detection frame) and the second information related to the matching result
  • the feature information of the target detection frame of the video frame is similar or identical in similarity, that is, the target in the matching target detection frame of the first video frame is close to or identical to the target in the matching target detection frame of the second video frame.
  • the matching target detection frame of the first video frame may be added to the tracking track of the matching target detection frame of the second video frame.
  • the feature information indicating the target detection frame of the first video frame related to the matching result and the feature information of the target detection frame of the second video frame related to the matching result are not close to each other or not
  • the target in the matching target detection frame of the first video frame is not close to or different from the target in the matching target detection frame of the second video frame, and the matching target detection of the first video frame may be detected at this time.
  • the frame serves as a starting point of the new tracking trajectory or matches the feature information of the matching target detection frame of the first video frame with the feature information of the other target detection frames of the second video frame.
  • the feature information of each target detection frame of the first video frame is obtained by the target tracking method provided by the present disclosure, wherein the feature information may include any one or more of the following: representation information, motion information, and shape information. And further matching the feature information of each target detection frame of the first video frame with the feature information of each target detection frame of the second video frame, and then determining the tracking of each target detection frame of the first video frame according to the matching result. Track.
  • the second video frame is a video frame before the first video frame, and the feature information of each target detection frame of the acquired second video frame may be pre-stored.
  • the target detection frame of the video frame can be more accurately represented than the simple image feature or optical flow information, and the target detection frame is The recognition effect is better. Therefore, using feature information, such as any combination of one or more of the representation information, the motion information or the shape information, can not only obtain more accurate matching results, but also improve the accuracy of the target tracking. .
  • the embodiments of the present disclosure are different from the above embodiments on the basis of the above embodiments, and the same points can be referred to the description and description in the foregoing embodiments.
  • step S200 the first video frame is detected by the detector, and each target detection frame of the first video frame is obtained.
  • step S200 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a detection module 508 running by the processor.
  • the detector may be a high-performance detector based on a deep convolutional neural network for determining each target area (target detection frame) in the video frame, for example, a fast-area convolutional neural network (Faster-Region with Convolutional Neural Network, Faster-RCNN), Faster-RCNN is the industry's leading detector with the advantages of high detection accuracy and fast speed.
  • a fast-area convolutional neural network Faster-Region with Convolutional Neural Network, Faster-RCNN
  • Faster-RCNN Faster-RCNN
  • the first video frame may be a current video frame, and each target detection frame of the first video frame may be obtained after the first video frame passes the detector.
  • Step S202 Determine a convolutional neural network according to the type of the target in the target detection frame, and use a convolutional neural network to detect each target detection frame of the first video frame and the first video frame, and acquire each target detection frame of the first video frame.
  • the representation information determines motion information and shape information of the target detection frame of the first video frame according to each target detection frame of the first video frame.
  • step S202 may be performed by a processor calling a corresponding instruction stored in the memory, or may be performed by the acquisition module 400 executed by the processor.
  • the type of the target may include, but is not limited to, any one or more of the following: a face, a pedestrian, a vehicle. Any object that can be moved and has a distinguishing identifier can be the target in this embodiment, and the embodiment does not limit the type of the target.
  • a convolutional neural network for face recognition such as FaceNet and DeepID
  • the pedestrian if the pedestrian is to be tracked, convolution for pedestrian recognition may be used. Neural Networks.
  • a convolutional neural network can be understood as a convolutional neural network for identifying a target during computer vision processing.
  • Each target detection frame of the first video frame and the first video frame is input to the convolutional neural network, and after being processed by the convolutional neural network, the feature information for identifying the target is output from the last fully connected layer of the convolutional neural network.
  • the feature information can be used to calculate the similarity between the targets.
  • the network parameters of the convolutional neural network can be fine-tuned using a specific training data set, or the volume can be trained through a specific training method.
  • Neural network since this embodiment is a general target tracking method, the present embodiment is directed to how to use a specific training data set for fine tuning, and which network parameters of a specific layer of the convolutional neural network are fine-tuned.
  • the convolutional neural network can be appropriately compressed, for example, using a convolutional neural network with less network parameters, or using a small-scale convolutional neural network to imitate a large scale.
  • the mapping relationship between the input and output of the convolutional neural network is not limited to, but not limited to, a convolutional neural network with less network parameters, or using a small-scale convolutional neural network to imitate a large scale.
  • the convolutional neural network in this embodiment may be a convolutional neural network trained on ImageNet (image recognition database), or may be any convolutional nerve for identifying a target.
  • ImageNet image recognition database
  • the convolutional neural network can be applied to the graphics processor, and the computational efficiency of the convolutional neural network is improved by the powerful graphics computing capability of the graphics processor.
  • the present disclosure does not limit the convolutional neural network.
  • each target detection frame of the first video frame and the first video frame into a convolutional neural network, and extracting, by the convolutional neural network, feature extraction of each target detection frame of the first video frame and the first video frame, and extracting
  • the representation information of each target detection frame, the representation information includes a feature vector of the target detection frame, and the representation information is a fixed length feature vector.
  • the feature information of the target detection frame may include motion information or shape information in addition to the feature information.
  • the motion information includes location information of the target detection frame, and the shape information includes size information of the target detection frame.
  • the center point position information of the target detection frame may be determined as the motion information, and the length information and the width information of the target detection frame are determined as
  • the shape information after obtaining the target detection frame, can determine the motion information and the shape information according to the central point position information and the length information and the width information of the target detection frame. In this embodiment, how to determine the motion information and shape according to the target detection frame. Information is not restricted.
  • step S204 the feature information of each target detection frame of the first video frame is matched with the feature information of each target detection frame of the second video frame, and if the matching result is successful, step S206 is performed; If the matching fails, step S208 is performed.
  • step S204 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a matching module 402 executed by the processor.
  • S204 may be specifically divided into the following steps:
  • Step S2040 Perform similarity calculation on the feature information.
  • step S2040 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a similarity calculation sub-module 5020 executed by the processor.
  • the similarity calculation is performed on the feature information, and the feature information of each target detection frame of the first video frame and the feature information of each target detection frame of the second video frame may be similarly calculated, that is, the first video frame is calculated.
  • Calculating a cosine angle between the representation information of each target detection frame and the representation information of each target detection frame of the second video frame, calculating motion information of each target detection frame of the first video frame and each of the second video frames A center detection distance between the motion detection information of the target detection frame, and a target detection frame between the shape information of each target detection frame of the first video frame and the shape information of each target detection frame of the second video frame The length difference.
  • the feature information of each target detection frame of the first video frame is compared with the feature information of each target detection frame of the second video frame, and the first video frame may also be used.
  • the feature information of each target detection frame is compared with the feature information of each target detection frame of the second video frame, and each target detection frame of the first video frame and each target detection frame of the second video frame are obtained. Similarity data.
  • the similarity data ranges from 0.0 to 1.0, and the similarity data is larger, indicating that the degree of similarity between the feature information is higher.
  • Step S2042 weighting the similarity calculation result to obtain a similarity matrix.
  • step S2042 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a weighting sub-module 5022 executed by the processor.
  • the similarity matrix may be obtained according to the obtained similarity calculation result, for example, the weighted sum or the weighted product of the cosine angle, the center distance, and the length and width difference are obtained to obtain a similarity matrix, which is similar to the present embodiment.
  • Sexual outcome weighting is not restricted.
  • Step S2044 Perform an optimal matching on the similarity matrix.
  • step S2044 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by an optimal matching sub-module 5024 operated by the processor.
  • the optimal matching may use a bipartite graph with weight matching.
  • the Hungarian algorithm is adopted, and the matching process of the similarity matrix is not limited in this embodiment.
  • Step S206 Associate the target detection frame of the first video frame and the target detection frame of the second video frame related to the matching result into a continuous tracking trajectory.
  • step S206 may be performed by the processor invoking a corresponding instruction stored in the memory, or may be performed by the determining module 504 being executed by the processor.
  • the target detection frame of the first video frame related to the matching result may be added to the tracking track of the target detection frame of the second video frame related to the matching result, and the first video related to the matching result may be added.
  • the target detection frame of the frame is the next movement position in the tracking trajectory of the target detection frame of the second video frame associated with the matching result.
  • the target in the target detection frame of the first video frame related to the matching result is the target A
  • the target in the target detection frame of the second video frame related to the matching result is the target A′
  • the target detection frame of the first video frame related to the matching result can be used as the new tracking track position of the existing tracking track of the target A or the target A′.
  • the target detection frame of the first video frame related to the matching result is a target detection frame for matching the feature information with the target detection frame of the second video frame to obtain a matching result.
  • the first video frame includes a target detection frame m1 and a target detection frame m2, and the target detection frame m1 is matched with the target detection frame m3 of the second video frame to obtain a matching result, and the first video related to the matching result is obtained.
  • the target detection frame of the frame is the target detection frame m1
  • the target detection frame of the second video frame related to the matching result is the target detection frame m3.
  • Step S208 The target detection frame of the first video frame related to the matching result is used as a starting target detection frame of the new tracking trajectory.
  • step S208 may be performed by the processor invoking a corresponding instruction stored in the memory, or may be performed by the determining module 504 being executed by the processor.
  • the target detection frame of the first video frame related to the matching result may be used as a starting target detection frame of the new tracking trajectory.
  • the target in the target detection frame of the first video frame related to the matching result is the target B
  • the target in the target detection frame of the second video frame related to the matching result is the target B′
  • the target B and the target B' are different targets, and the target detection frame of the first video frame related to the matching result can be used as the starting position of the new tracking trajectory of the target B.
  • one of the prerequisites in this step is that the target detection frame of the first video frame related to the matching result has been matched with the feature information of each target detection frame in the second video frame, and each match is matched.
  • the result is a match failure.
  • FIG. 3 a flowchart of an execution flow of a target tracking method according to the present disclosure is shown.
  • the video frame to be detected is input to the detector to obtain a target detection frame of the video frame to be detected.
  • the video frame to be detected and the target detection frame are input into the convolutional neural network, and the feature information of the target detection frame is extracted.
  • the extracted feature information is matched with the feature information of the existing tracking track. If the matching is successful, the target detection frame of the video frame to be detected is added to the existing tracking track. If the match fails, the target detection frame of the video frame to be detected is used as the starting point of the new tracking track.
  • the feature information of each target detection frame of the first video frame is obtained by the target tracking method provided by the embodiment, where the feature information includes but is not limited to any one or more of the following: representation information, motion information, and shape information. And further matching feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame, and then determining a tracking track of each target detection frame of the first video frame according to the matching result.
  • the first video frame may be the current video frame
  • the second video frame may be the video frame before the first video frame
  • the feature information of each target detection frame of the acquired second video frame may be pre-stored.
  • the target detection frame of the video frame can be more accurately represented than the simple image feature or optical flow information, and the target detection frame is The recognition effect is better. Therefore, matching with any combination of one or more of the feature information, such as the representation information, the motion information, and the shape information, can not only obtain more accurate matching results, but also improve the accuracy of the target tracking. .
  • Target tracking technology can be generally divided into target online tracking technology and target offline tracking technology.
  • the target tracking method provided by the present disclosure can be applied to a target online tracking scene, for example, acquiring feature information of each target detection frame of a video frame in a real-time played video image online, and then displaying adjacent video frames in a real-time played video image. The feature information of each target detection frame is matched, and then the tracking track of each target detection frame is determined according to the matching result.
  • the target tracking method provided in this embodiment may be applied to an online video surveillance analysis solution, for example, in a face recognition application, a face online detection is required for each video frame, and The face feature detected online is input into the face feature database for query, thereby determining the person corresponding to the face in the video frame.
  • the face features of the faces in the plurality of video frames can be matched, and the successfully matched face features are searched in the face feature database, thereby improving the accuracy of the face recognition.
  • the target tracking method provided by the present disclosure may also be applied to a target offline tracking scene, for example, acquiring feature information of each target detection frame of a video frame in an offline video image, and then placing adjacent video frames in the offline video image. The feature information of each target detection frame is matched, and then the tracking track of each target detection frame is determined according to the matching result.
  • any of the methods provided by the embodiments of the present disclosure may be performed by any suitable device having data processing capabilities, including but not limited to: a terminal device and a server.
  • any of the methods provided by the embodiments of the present disclosure may be executed by a processor, such as the processor, by executing a corresponding instruction stored in the memory to perform any of the methods mentioned in the embodiments of the present disclosure. This will not be repeated below.
  • the foregoing program may be stored in a computer readable storage medium, and the program is executed when executed.
  • the steps of the foregoing method embodiments are included; and the foregoing storage medium includes: ROM, RAM, disk or optical disk, and various media that can store program codes.
  • the target detection frame in the first video frame is matched with the target detection frame in the second video frame according to the feature information, and the target detection is determined according to the matching result.
  • the tracking trajectory of the frame can not only obtain more accurate matching results but also improve the accuracy of target tracking due to the acquired feature information, such as representation information, motion information or shape information.
  • the system shown in FIG. 4 includes:
  • the obtaining module 400 is configured to acquire feature information of each target detection frame of the first video frame
  • the matching module 402 is configured to match feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame;
  • the determining module 404 is configured to determine, according to the matching result, a tracking track of each target detection frame of the first video frame; wherein the second video frame is a video frame before the first video frame.
  • the feature information may include, but is not limited to, any one or more of the following: representation information, motion information, shape information.
  • the acquiring module acquires feature information of each target detection frame of the first video frame, and then the matching module sets feature information of each target detection frame of the first video frame with the second video frame.
  • the feature information of each target detection frame is matched, and then the determining module determines a tracking trajectory of each target detection frame of the first video frame according to the matching result.
  • the first video frame may be the current video frame
  • the second video frame may be the video frame before the first video frame
  • the feature information of each target detection frame of the acquired second video frame may be pre-stored.
  • the target detection frame of the video frame can be more accurately represented than the simple image feature or optical flow information, for the target detection frame.
  • the recognition effect is better. Therefore, using the feature information, such as one or a combination of the image information, the motion information, and the shape information to perform matching, not only can obtain more accurate matching results, but also can improve the accuracy of the target tracking.
  • the system shown in FIG. 5 includes:
  • the acquiring module 500 is configured to acquire feature information of each target detection frame of the first video frame, where the feature information may include, but is not limited to, any one or more of the following: representation information, motion information, shape information;
  • the matching module 502 is configured to match feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame;
  • the determining module 504 is configured to determine, according to the matching result, a tracking track of each target detection frame of the first video frame, where the second video frame is a video frame before the first video frame.
  • the obtaining module 500 includes: a first acquiring submodule 5000, configured to detect each target detection frame of the first video frame and the first video frame by using a convolutional neural network, and acquire each of the first video frames.
  • the second detection sub-module 5002 is configured to determine motion information and shape information of each target detection frame of the first video frame according to each target detection frame of the first video frame.
  • the target tracking system provided by the present disclosure further includes: a convolutional neural network determining module 506, configured to: before the acquiring module 500 acquires feature information of each target detection frame of the first video frame, according to the type of the target Determine the convolutional neural network.
  • the type of the target may include, but is not limited to, any one or more of the following: a face, a pedestrian, a vehicle.
  • the matching module 502 includes: a similarity calculation sub-module 5020, configured to perform feature information of each target detection frame of the first video frame and feature information of each target detection frame of the second video frame.
  • the similarity calculation sub-module 5020 is configured to compare the feature information of each target detection frame of the first video frame with the feature information of each target detection frame of the second video frame, to obtain the first video.
  • the similarity data of each target detection frame of the frame and each target detection frame of the second video frame is configured to compare the feature information of each target detection frame of the first video frame with the feature information of each target detection frame of the second video frame.
  • the similarity calculation sub-module 5020 is configured to calculate a cosine angle between the representation information of each target detection frame of the first video frame and the representation information of each target detection frame of the second video frame, and calculate the first Calculating the shape information of each target detection frame of the first video frame and the center distance of the target detection frame between the motion information of each target detection frame of the video frame and the motion information of each target detection frame of the second video frame The length difference width of the target detection frame between the shape information of each target detection frame of the second video frame.
  • the weighting sub-module 5022 is configured to perform a weighted summation or a weighted product on the cosine angle, the center distance, and the length and width difference to obtain a similarity matrix.
  • the determining module 504 is configured to associate the target detection frame of the first video frame and the target detection frame of the second video frame related to the matching result into a continuous tracking trajectory if the matching result is that the matching is successful.
  • the determining module 504 is configured to: if the matching result is a matching failure, use a target detection frame of the first video frame related to the matching result as a starting target detection frame of the new tracking track.
  • the representation information includes a feature vector of the target detection frame
  • the motion information includes location information of the target detection frame
  • the shape information includes size information of the target detection frame
  • the target tracking system provided in this embodiment further includes: a detecting module 508, configured to detect the first video by using the detector before the acquiring module 500 acquires the feature information of each target detection frame of the first video frame. Frame, each target detection frame of the first video frame is obtained.
  • the detector is a deep convolutional neural network based detector.
  • the detector is a fast region convolutional neural network Faster-RCNN.
  • the area detecting device of the present embodiment can be used to implement the corresponding area detecting method in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, and details are not described herein again.
  • an embodiment of the present application further provides an electronic device, including: a processor and a memory;
  • the memory is configured to store at least one executable instruction, the executable instruction causing the processor to perform an operation corresponding to the target tracking method described in any of the above embodiments of the present application.
  • the embodiment of the present application further provides another electronic device, including:
  • the processor and the target tracking device of any of the above embodiments of the present application when the processor runs the target tracking device, the unit in the target tracking device according to any of the above embodiments of the present application is operated.
  • the embodiment of the present disclosure further provides an electronic device, which may include, but is not limited to, a mobile terminal, a personal computer (PC), a tablet, and a server.
  • an electronic device 600 can include one or more processors, communication elements, and Or a plurality of processors, such as one or more central processing units (CPUs) 601, and/or one or more image processors (GPUs) 613, which may be stored in accordance with a read-only memory (ROM) 602.
  • CPUs central processing units
  • GPUs image processors
  • ROM read-only memory
  • Various appropriate actions and processes are performed by executing instructions or loading from executable portion 608 into executable instructions in random access memory (RAM) 603.
  • the communication component includes a communication component 612 and/or a communication interface 609.
  • the communication component 612 can include, but is not limited to, a network card, which can include, but is not limited to, an IB (Infiniband) network card.
  • the communication interface 609 includes a communication interface such as a LAN card, a network interface card of a modem, and the communication interface 609 is executed via a network such as the Internet. Communication processing.
  • the processor can communicate with read only memory 602 and/or random access memory 603 to execute executable instructions, communicate with communication component 612 via communication bus 604, and communicate with other target devices via communication component 612 to complete embodiments of the present disclosure.
  • the operation corresponding to any one of the target tracking methods is provided, for example, acquiring feature information of each target detection frame of the first video frame, and the feature information may include, but is not limited to, any one or more of the following: representation information, motion information, Shape information; matching feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame; determining, for each target detection frame of the first video frame, according to the matching result Tracking the track; wherein the second video frame is a video frame before the first video frame.
  • the CPU 601 or the GPU 613, the ROM 602, and the RAM 603 are connected to each other through a communication bus 604.
  • the executable instruction is written in the 0R20OM602, and the executable instruction causes the processor to perform the operation corresponding to the target tracking method.
  • An input/output (I/O) interface 605 is also coupled to communication bus 604.
  • the communication component 612 can be integrated or can be configured to have multiple sub-modules (e.g., multiple IB network cards) and be on a communication bus link.
  • I/O interface 605 including but not limited to keyboard, mouse input portion 606; including but not limited to output portions 607 such as cathode ray tubes (CRTs), liquid crystal displays (LCDs), and speakers; including but not limited to A storage portion 608 of the hard disk; and a communication interface 609 including a network interface card such as a LAN card and a modem.
  • Driver 610 is also coupled to I/O interface 605 as needed.
  • a removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, is mounted on the drive 610 as needed so that a computer program read therefrom is installed into the storage portion 608 as needed.
  • FIG. 6 is only an optional implementation manner.
  • the number and type of components in FIG. 6 may be selected, deleted, added, or replaced according to actual needs; Separate settings or integrated setting implementations can also be used for different feature settings, such as GPU and CPU detachable settings or GPU integration on the CPU, communication components can be detached, or integrated on the CPU or GPU.
  • an embodiment of the present disclosure includes a computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program comprising program code for executing the method illustrated in the flowchart, the program code comprising the corresponding execution
  • the instruction corresponding to the method step provided by any embodiment of the present disclosure may include an instruction corresponding to the following steps provided by the embodiment of the present application: acquiring feature information of each target detection frame of the first video frame, and the feature The information includes any one or more of the following: representation information, motion information, shape information; matching feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame And determining, according to the matching result, a tracking track of each target detection frame of the first video frame; wherein, the second video frame is a video frame before the first video frame.
  • the computer program can be downloaded and installed from the network via a communication component, and/or installed from removable media 611.
  • the above-described functions defined in the method of any of the embodiments of the present disclosure are performed when the computer program is executed by a processor.
  • the embodiment of the present application further provides a computer program, including computer readable code, when the computer readable code is run on a device, the processor in the device executes to implement any of the embodiments of the present application.
  • the instructions of each step in the target tracking method when the computer readable code is run on a device, the processor in the device executes to implement any of the embodiments of the present application.
  • the embodiment of the present application further provides a computer readable storage medium, configured to store computer readable instructions, when the instructions are executed, to implement steps in the target tracking method according to any embodiment of the present application. operating.
  • each of the at least one embodiment of the present specification is described in a progressive manner, and each of the at least one embodiment focuses on differences from other embodiments, and the same or similar parts between the respective at least one embodiment are referred to each other. Just fine.
  • the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
  • the methods and apparatus of the present disclosure may be implemented in a number of ways.
  • the methods and apparatus of the present disclosure may be implemented in software, hardware, firmware or any combination of software, hardware, firmware.
  • the above-described sequence of steps for the method is for illustrative purposes only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless otherwise specifically stated.
  • the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine readable instructions for implementing a method in accordance with the present disclosure.
  • the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

Abstract

A target tracking method and system, and an electronic device. The method comprises: acquiring feature information about each target detection box of a first video frame (S100), the feature information comprising any one or more of the following: presentation information, motion information and shape information; respectively matching the feature information about each target detection box of the first video frame with feature information about each target detection box of a second video frame (S102); and determining a tracking trajectory of each target detection box of the first video frame according to a matching result (S104), wherein the second video frame is a video frame before the first video frame. By means of the method, a more precise matching result can be obtained, and the precision of target tracking can also be improved.

Description

目标跟踪方法、系统及电子设备Target tracking method, system and electronic device
本公开要求在2017年3月3日提交中国专利局、申请号为201710124025.6、发明名称为“目标跟踪方法、系统及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。The present disclosure claims the priority of the Chinese Patent Application, filed on March 3, 2017, filed on Jan. 3,,,,,,,,,,,,,,,,,,,,,,,, in.
技术领域Technical field
本公开涉及视频分析技术,尤其涉及一种目标跟踪方法、系统及电子设备。The present disclosure relates to video analysis technologies, and in particular, to a target tracking method, system, and electronic device.
背景技术Background technique
目标跟踪技术是视频分析中重要的技术之一,可以简单描述为以下过程:一段由多个连续视频帧构成的视频,从第一视频帧到最后一个视频帧,每一个视频帧里包含多个目标,不断地有目标在视频帧中出现或者消失,目标在视频帧中不断地运动;目标跟踪的目的是将视频帧中的每个目标与其他目标区分开,获得同一个目标在不同视频帧中的轨迹。Target tracking technology is one of the important technologies in video analysis. It can be simply described as a process: a video consisting of multiple consecutive video frames, from the first video frame to the last video frame, each video frame contains multiple The target continuously appears or disappears in the video frame, and the target continuously moves in the video frame; the purpose of the target tracking is to distinguish each target in the video frame from other targets to obtain the same target in different video frames. The track in .
发明内容Summary of the invention
本公开提供了目标跟踪方法、系统及电子设备技术方案。The present disclosure provides a target tracking method, system, and electronic device technical solution.
根据公开的一方面,提供了一种目标跟踪方法,包括:According to an aspect of the disclosure, a target tracking method is provided, including:
获取第一视频帧的每个目标检测框的特征信息;将第一视频帧的每个目标检测框的特征信息分别与第二视频帧的每个目标检测框的特征信息进行匹配;其中,第二视频帧为第一视频帧之前的视频帧;根据匹配结果确定第一视频帧的每个目标检测框的跟踪轨迹。Obtaining feature information of each target detection frame of the first video frame; matching feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame; wherein, The two video frames are video frames before the first video frame; and the tracking track of each target detection frame of the first video frame is determined according to the matching result.
在本公开的一种实现方式中,所述特征信息包括以下任意一项或多项:表象信息、运动信息、形状信息。In an implementation manner of the disclosure, the feature information includes any one or more of the following: representation information, motion information, and shape information.
在本公开的一种实现方式中,获取第一视频帧的每个目标检测框的特征信息,包括:采用卷积神经网络检测第一视频帧和第一视频帧的每个目标检测框,获取第一视频帧的每个目标检测框的表象信息;根据第一视频帧的每个目标检测框确定第一视频帧的每个目标检测框的运动信息和形状信息。In an implementation manner of the present disclosure, acquiring feature information of each target detection frame of the first video frame includes: detecting, by using a convolutional neural network, each target detection frame of the first video frame and the first video frame, and acquiring The image information of each target detection frame of the first video frame; the motion information and the shape information of each target detection frame of the first video frame are determined according to each target detection frame of the first video frame.
在本公开的一种实现方式中,在获取第一视频帧的每个目标检测框的特征信息之前,目标跟踪方法还包括:根据目标的类型确定卷积神经网络。In an implementation manner of the present disclosure, before acquiring feature information of each target detection frame of the first video frame, the target tracking method further includes: determining a convolutional neural network according to the type of the target.
在本公开的一种实现方式中,目标的类型包括以下任意一项或多项:人脸、行人、车辆。In an implementation of the present disclosure, the type of the target includes any one or more of the following: a face, a pedestrian, a vehicle.
在本公开的一种实现方式中,将第一视频帧的每个目标检测框的特征信息分别与第二视频帧的每个目标检测框的特征信息进行匹配,包括:将第一视频帧的每个目标检测框的特征信息分别与第二视频帧的每个目标检测框的特征信息进行相似性计算;对相似性计算结果加权得到相似性矩阵;对相似性矩阵进行最优匹配。In an implementation manner of the present disclosure, matching feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame, respectively, includes: The feature information of each target detection frame is respectively calculated similarly with the feature information of each target detection frame of the second video frame; the similarity calculation result is weighted to obtain a similarity matrix; and the similarity matrix is optimally matched.
在本公开的一种实现方式中,将第一视频帧的每个目标检测框的特征信息分别与第二视频帧的每个目标检测框的特征信息进行相似性计算,包括:将第一视频帧的每个目标检测框的特征信息与第二视频帧的每个目标检测框的特征信息进行逐一比对,获得第一视频帧的每个目标检测框与第二视频帧的每个目标检测框的相似度数据。In an implementation manner of the present disclosure, performing similarity calculation on feature information of each target detection frame of the first video frame and feature information of each target detection frame of the second video frame, respectively, including: first video The feature information of each target detection frame of the frame is compared with the feature information of each target detection frame of the second video frame, and each target detection frame of each target detection frame and the second video frame of the first video frame is obtained. The similarity data of the box.
在本公开的一种实现方式中,将第一视频帧的每个目标检测框的特征信息分别与第二视频帧的每个目标检测框的特征信息进行相似性计算,包括:计算第一视频帧的每个目标检测框的表象信息与第二视频帧的每个目标检测框的表象信息之间的余弦角,计算第一视频帧的每个目标检测框的运动信息与第二视频帧的每个目标检测框的运动信息之间的目标检测框的中心距离,计算第一视频帧的每个目标检测框的形状信息与第二视频帧的每个目标检测框的形状信息之间的目标检测框的长宽差。In an implementation manner of the present disclosure, the feature information of each target detection frame of the first video frame and the feature information of each target detection frame of the second video frame are respectively calculated, including: calculating the first video. Calculating a cosine angle between the representation information of each target detection frame of the frame and the representation information of each target detection frame of the second video frame, calculating motion information of each target detection frame of the first video frame and the second video frame Calculating a target distance between the motion detection information of the target detection frame and the target information of each target detection frame of the second video frame and the target information of each target detection frame of the second video frame Check the length and width of the frame.
在本公开的一种实现方式中,对相似性计算结果加权得到相似性矩阵,包括:对余弦角、中心距离和长宽差进行加权求和,或者加权乘积得到相似性矩阵。In an implementation manner of the present disclosure, the similarity calculation result is weighted to obtain a similarity matrix, including: weighting the cosine angle, the center distance, and the length and width difference, or weighting the product to obtain a similarity matrix.
在本公开的一种实现方式中,根据匹配结果确定第一视频帧的每个目标检测框的跟踪轨迹,包括:响应于匹配结果为匹配成功,将与匹配结果相关的第一视频帧的目标检测框和第二视频帧的目标检测框关联成连续跟踪轨迹。In an implementation manner of the present disclosure, determining a tracking trajectory of each target detection frame of the first video frame according to the matching result includes: in response to the matching result being a matching success, the target of the first video frame related to the matching result The detection frame and the target detection frame of the second video frame are associated into a continuous tracking trajectory.
在本公开的一种实现方式中,根据匹配结果确定第一视频帧的每个目标检测框的跟踪轨迹,包括:响应于匹配结果为匹配失败,将与匹配结果相关的第一视频帧的目标检测框作为新跟踪轨迹的起始目标检测框。In an implementation manner of the present disclosure, determining a tracking trajectory of each target detection frame of the first video frame according to the matching result includes: in response to the matching result being a matching failure, the target of the first video frame related to the matching result The detection frame is used as the starting target detection frame of the new tracking track.
在本公开的一种实现方式中,表象信息包括目标检测框的特征向量,运动信息包括目标检测框的位置信息,形状信息包括目标检测框的尺寸信息。In an implementation manner of the present disclosure, the representation information includes a feature vector of the target detection frame, the motion information includes location information of the target detection frame, and the shape information includes size information of the target detection frame.
在本公开的一种实现方式中,在获取第一视频帧的每个目标检测框的特征信息之前,目标跟踪方法还包括:采用检测器检测第一视频帧,获得第一视频帧的每个目标检测框。In an implementation manner of the present disclosure, before acquiring the feature information of each target detection frame of the first video frame, the target tracking method further includes: detecting, by the detector, the first video frame to obtain each of the first video frames. Target detection box.
在本公开的一种实现方式中,检测器为基于深度卷积神经网络的检测器。In one implementation of the present disclosure, the detector is a deep convolutional neural network based detector.
在本公开的一种实现方式中,检测器为快速区域卷积神经网络Faster-RCNN。In one implementation of the present disclosure, the detector is a fast region convolutional neural network Faster-RCNN.
根据本公开的另一方面,还提供了一种目标跟踪系统,包括:获取模块,用于获取第一视 频帧的每个目标检测框的特征信息;匹配模块,用于将第一视频帧的每个目标检测框的特征信息分别与第二视频帧的每个目标检测框的特征信息进行匹配;确定模块,用于根据匹配结果确定第一视频帧的每个目标检测框的跟踪轨迹;其中,第二视频帧为第一视频帧之前的视频帧。According to another aspect of the present disclosure, a target tracking system is further provided, including: an acquiring module, configured to acquire feature information of each target detection frame of a first video frame; and a matching module, configured to: The feature information of each target detection frame is matched with the feature information of each target detection frame of the second video frame, and the determining module is configured to determine a tracking track of each target detection frame of the first video frame according to the matching result; The second video frame is a video frame before the first video frame.
在本公开公开的一种实现方式中,所述特征信息包括以下任意一项或多项:表象信息、运动信息、形状信息。In an implementation of the disclosure, the feature information includes any one or more of the following: representation information, motion information, and shape information.
在本公开公开的一种实现方式中,获取模块,包括:第一获取子模块,用于采用卷积神经网络检测第一视频帧和第一视频帧的每个目标检测框,获取第一视频帧的每个目标检测框的表象信息;第二获取子模块,用于根据第一视频帧的每个目标检测框确定第一视频帧的每个目标检测框的运动信息和形状信息。In an implementation of the disclosure, the acquiring module includes: a first acquiring submodule, configured to detect, by using a convolutional neural network, each target detection frame of the first video frame and the first video frame to obtain the first video. Each target of the frame detects the representation information of the frame; and the second acquisition sub-module is configured to determine motion information and shape information of each target detection frame of the first video frame according to each target detection frame of the first video frame.
在本公开公开的一种实现方式中,目标跟踪系统还包括:卷积神经网络确定模块,用于在获取模块获取第一视频帧的每个目标检测框的特征信息之前,根据目标的类型确定卷积神经网络。In an implementation of the disclosure, the target tracking system further includes: a convolutional neural network determining module, configured to determine, according to the type of the target, before the acquiring module acquires the feature information of each target detection frame of the first video frame Convolutional neural network.
在本公开公开的一种实现方式中,目标的类型包括以下任意一项或多项:人脸、行人、车辆。In an implementation of the disclosure, the type of the target includes any one or more of the following: a face, a pedestrian, a vehicle.
在本公开公开的一种实现方式中,匹配模块,包括:相似性计算子模块,用于将第一视频帧的每个目标检测框的特征信息分别与第二视频帧的每个目标检测框的特征信息进行相似性计算;加权子模块,用于对相似性计算结果加权得到相似性矩阵;最优匹配子模块,用于对相似性矩阵进行最优匹配。In an implementation of the disclosure, the matching module includes: a similarity calculation sub-module, configured to respectively select feature information of each target detection frame of the first video frame and each target detection frame of the second video frame The feature information is used for similarity calculation; the weighting sub-module is used to weight the similarity calculation result to obtain a similarity matrix; the optimal matching sub-module is used for optimal matching of the similarity matrix.
在本公开公开的一种实现方式中,相似性计算子模块,用于将第一视频帧的每个目标检测框的特征信息与第二视频帧的每个目标检测框的特征信息进行逐一比对,获得第一视频帧的每个目标检测框与第二视频帧的每个目标检测框的相似度数据。In an implementation manner of the disclosure, the similarity calculation sub-module is configured to compare the feature information of each target detection frame of the first video frame with the feature information of each target detection frame of the second video frame. Pairing, obtaining similarity data of each target detection frame of the first video frame and each target detection frame of the second video frame.
在本公开公开的一种实现方式中,相似性计算子模块,用于计算第一视频帧的每个目标检测框的表象信息与第二视频帧的每个目标检测框的表象信息之间的余弦角,计算第一视频帧的每个目标检测框的运动信息与第二视频帧的每个目标检测框的运动信息之间的目标检测框的中心距离,计算第一视频帧的每个目标检测框的形状信息与第二视频帧的每个目标检测框的形状信息之间的目标检测框的长宽差。In an implementation manner of the disclosure, the similarity calculation sub-module is configured to calculate between the representation information of each target detection frame of the first video frame and the representation information of each target detection frame of the second video frame. a cosine angle, calculating a center distance of the target detection frame between the motion information of each target detection frame of the first video frame and the motion information of each target detection frame of the second video frame, and calculating each target of the first video frame The length difference width of the target detection frame between the shape information of the detection frame and the shape information of each target detection frame of the second video frame is detected.
在本公开公开的一种实现方式中,加权子模块,用于对余弦角、中心距离和长宽差进行加权求和或者加权乘积得到相似性矩阵。In an implementation of the disclosure, the weighting sub-module is configured to perform a weighted summation or a weighted product on the cosine angle, the center distance, and the length and width difference to obtain a similarity matrix.
在本公开公开的一种实现方式中,确定模块,用于响应于匹配结果为匹配成功,将与匹配结果相关的第一视频帧的目标检测框和第二视频帧的目标检测框关联成连续跟踪轨迹。In an implementation manner of the disclosure, the determining module is configured to associate the target detection frame of the first video frame related to the matching result with the target detection frame of the second video frame into a continuous manner in response to the matching result being a matching success. Track the track.
在本公开公开的一种实现方式中,确定模块,用于响应于匹配结果为匹配失败,将与匹配结果相关的第一视频帧的目标检测框作为新跟踪轨迹的起始目标检测框。In an implementation manner of the disclosure, the determining module is configured to: in response to the matching result being a matching failure, use a target detection frame of the first video frame related to the matching result as a starting target detection frame of the new tracking trajectory.
在本公开公开的一种实现方式中,表象信息包括目标检测框的特征向量,运动信息包括目标检测框的位置信息,形状信息包括目标检测框的尺寸信息。In an implementation of the disclosure, the representation information includes a feature vector of the target detection frame, the motion information includes location information of the target detection frame, and the shape information includes size information of the target detection frame.
在本公开公开的一种实现方式中,目标跟踪系统还包括:检测模块,用于在获取模块获取第一视频帧的每个目标检测框的特征信息之前,采用检测器检测第一视频帧,获得第一视频帧的每个目标检测框。In an implementation of the disclosure, the target tracking system further includes: a detecting module, configured to detect the first video frame by using the detector before the acquiring module acquires the feature information of each target detection frame of the first video frame, Each target detection frame of the first video frame is obtained.
在本公开公开的一种实现方式中,检测器为基于深度卷积神经网络的检测器。In one implementation of the present disclosure, the detector is a deep convolutional neural network based detector.
在本公开公开的一种实现方式中,检测器为快速区域卷积神经网络Faster-RCNN。In one implementation of the present disclosure, the detector is a fast region convolutional neural network Faster-RCNN.
根据本公开的另一方面,还提供了一种电子设备,包括:处理器、存储器、通信元件和通信总线,处理器、存储器和通信元件通过通信总线完成相互间的通信;存储器用于存放至少一可执行指令,可执行指令使处理器执行如任一项上述的目标跟踪方法对应的操作。According to another aspect of the present disclosure, there is also provided an electronic device comprising: a processor, a memory, a communication component, and a communication bus, wherein the processor, the memory, and the communication component complete communication with each other through a communication bus; the memory is configured to store at least An executable instruction, the executable instruction causing the processor to perform an operation corresponding to any of the target tracking methods described above.
根据本公开的另一方面,还提供了一种电子设备,包括:处理器和如上所述的目标跟踪系统;在处理器运行所述对象属性检测装置时,如上任一项所述的目标跟踪系统中的单元被运行。According to another aspect of the present disclosure, there is also provided an electronic device comprising: a processor and a target tracking system as described above; target tracking as described in any one of the above when the processor runs the object attribute detecting device The units in the system are run.
根据本本公开的另一方面,还提供了一种计算机程序,包括计算机可读代码,当所述计算机可读代码在设备中运行时,所述设备中的处理器执行用于实现如上任一项所述的目标跟踪方法中的步骤的可执行指令。According to another aspect of the present disclosure, there is also provided a computer program comprising computer readable code, the processor in the device executing to implement any of the above when the computer readable code is run in a device The executable instructions of the steps in the target tracking method.
根据本本公开的另一方面,还提供了一种计算机可读介质,用于存储计算机可读取的指令,其特征在于,所述指令被执行时实现如上任一项所述的目标跟踪方法中各步骤的操作。According to another aspect of the present disclosure, there is also provided a computer readable medium for storing computer readable instructions, wherein the instructions are executed to implement the target tracking method according to any one of the preceding claims The operation of each step.
根据本本公开的另一方面,还提供了一种计算机可读存储介质,计算机可读存储介质存储有:用于获取第一视频帧的每个目标检测框的特征信息的可执行指令,特征信息包括以下任意一项或多项:表象信息、运动信息、形状信息;用于将第一视频帧的每个目标检测框的特征信息分别与第二视频帧的每个目标检测框的特征信息进行匹配的可执行指令;用于根据匹配结果确定第一视频帧的每个目标检测框的跟踪轨迹的可执行指令,其中,第二视频帧为第一视频帧之前的视频帧。According to another aspect of the present disclosure, there is also provided a computer readable storage medium storing executable instructions for acquiring feature information of each target detection frame of a first video frame, feature information Include any one or more of the following: representation information, motion information, shape information; for performing feature information of each target detection frame of the first video frame and feature information of each target detection frame of the second video frame Alignable executable instructions; executable instructions for determining a tracking trajectory of each target detection frame of the first video frame based on the matching result, wherein the second video frame is a video frame preceding the first video frame.
根据本公开提供的技术方案,获取第一视频帧的每个目标检测框的特征信息,其中,特征信息包括以下任意一项或多项:表象信息、运动信息、形状信息。进而将第一视频帧的每个目标检测框的特征信息分别与第二视频帧的每个目标检测框的特征信息进行匹配,然后根据匹配结果确定第一视频帧的每个目标检测框的跟踪轨迹。本公开中,第二视频帧为第一视频帧之前的视频帧,可以预先储存已获取的第二视频帧的每个目标检测框的特征信息。According to the technical solution provided by the present disclosure, feature information of each target detection frame of the first video frame is acquired, wherein the feature information includes any one or more of the following: representation information, motion information, shape information. And further matching the feature information of each target detection frame of the first video frame with the feature information of each target detection frame of the second video frame, and then determining the tracking of each target detection frame of the first video frame according to the matching result. Track. In the present disclosure, the second video frame is a video frame before the first video frame, and the feature information of each target detection frame of the acquired second video frame may be pre-stored.
本公开中,由于获取到的特征信息,如表象信息、运动信息或形状信息,相比于简单的图像特征或者光流信息,能够更加精确的表示视频帧的目标检测框,对于目标检测框的识别效果更好,因此,使用特征信息,如表象信息、运动信息或形状信息中的一种或几种的任意组合进行匹配,不仅可以获得更加精确的匹配结果,还可以提高目标跟踪的精度。In the present disclosure, due to the acquired feature information, such as the representation information, the motion information, or the shape information, the target detection frame of the video frame can be more accurately represented than the simple image feature or the optical flow information, and the target detection frame is The recognition effect is better. Therefore, using feature information, such as any combination of one or more of the representation information, the motion information, or the shape information, can not only obtain more accurate matching results, but also improve the accuracy of the target tracking.
下面通过附图和实施例,对本公开的技术方案做进一步的详细描述。The technical solutions of the present disclosure will be further described in detail below through the accompanying drawings and embodiments.
附图说明DRAWINGS
构成说明书的一部分的附图描述了本公开的实施例,并且连同描述一起用于解释本公开的原理。The accompanying drawings, which are incorporated in FIG
参照附图,根据下面的详细描述,可以更加清楚地理解本公开,其中:The present disclosure can be more clearly understood from the following detailed description, in which:
图1为本公开提供的目标跟踪方法的步骤流程图;1 is a flow chart of steps of a target tracking method provided by the present disclosure;
图2为本公开提供的目标跟踪方法的步骤流程图;2 is a flow chart of steps of a target tracking method provided by the present disclosure;
图3为本公开提供的目标跟踪方法的执行流程示意图;FIG. 3 is a schematic flowchart of an execution process of a target tracking method provided by the present disclosure;
图4为本公开提供的目标跟踪系统的结构示意图;4 is a schematic structural diagram of a target tracking system provided by the present disclosure;
图5为本公开提供的目标跟踪系统的结构示意图。FIG. 5 is a schematic structural diagram of a target tracking system provided by the present disclosure.
图6是根据本公开提供的电子设备的结构示意图。FIG. 6 is a schematic structural diagram of an electronic device provided in accordance with the present disclosure.
具体实施方式detailed description
现在将参照附图来详细描述本公开的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本公开的范围。Various exemplary embodiments of the present disclosure will now be described in detail with reference to the drawings. It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in the embodiments are not intended to limit the scope of the disclosure.
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。In the meantime, it should be understood that the dimensions of the various parts shown in the drawings are not drawn in the actual scale relationship for the convenience of the description.
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本公开及其应用或使用的任何限制。The following description of the at least one exemplary embodiment is merely illustrative and is in no way
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。Techniques, methods and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but the techniques, methods and apparatus should be considered as part of the specification, where appropriate.
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。It should be noted that similar reference numerals and letters indicate similar items in the following figures, and therefore, once an item is defined in one figure, it is not required to be further discussed in the subsequent figures.
本公开可以应用于计算机系统/服务器,其可与众多其它通用或专用计算系统环境或配置一起操作。适于与计算机系统/服务器一起使用的众所周知的计算系统、环境和/或配置的例子包括但不限于:个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、 基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统﹑大型计算机系统和包括上述任何系统的分布式云计算技术环境。The present disclosure can be applied to computer systems/servers that can operate with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations suitable for use with computer systems/servers include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, based on Microprocessor systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, large computer systems, and distributed cloud computing technology environments including any of the above.
计算机系统/服务器可以在由计算机系统执行的计算机系统可执行指令(诸如程序模块)的一般语境下描述。通常,程序模块可以包括例程、程序、目标程序、组件、逻辑、数据结构,它们执行特定的任务或者实现特定的抽象数据类型。计算机系统/服务器可以在分布式云计算环境中实施,分布式云计算环境中,任务是由通过通信网络链接的远程处理设备执行的。在分布式云计算环境中,程序模块可以位于包括存储设备的本地或远程计算系统存储介质上。The computer system/server can be described in the general context of computer system executable instructions (such as program modules) being executed by a computer system. Generally, program modules may include routines, programs, target programs, components, logic, data structures that perform particular tasks or implement particular abstract data types. The computer system/server can be implemented in a distributed cloud computing environment where tasks are performed by remote processing devices that are linked through a communication network. In a distributed cloud computing environment, program modules may be located on a local or remote computing system storage medium including storage devices.
下面结合图1-图6对本公开提供的目标跟踪技术方案进行说明。本公开提供的任一种目标跟踪技术方案可由软件或者硬件或者软硬结合的方式进行示例。例如,本公开提供的目标跟踪技术方案可由某一电子设备实施或者由某一处理器实施,本公开并不限制,所述电子设备可包括但不限于终端或服务器,所述处理器可包括但不限于CPU或GPU。以下不再赘述。The target tracking technical solution provided by the present disclosure will be described below with reference to FIGS. Any of the target tracking technical solutions provided by the present disclosure may be exemplified by software or hardware or a combination of software and hardware. For example, the target tracking technology solution provided by the present disclosure may be implemented by a certain electronic device or implemented by a certain processor, which may include, but is not limited to, a terminal or a server, and the processor may include but Not limited to CPU or GPU. The details are not described below.
图1中,S100、获取第一视频帧的每个目标检测框的特征信息。In FIG. 1, S100: acquires feature information of each target detection frame of the first video frame.
一种可选的实现方式中,步骤S100可以由处理器调用存储器存储的相应指令来执行,或者,可由被处理器运行的获取模块400执行。In an optional implementation manner, step S100 may be performed by a processor calling a corresponding instruction stored in a memory, or may be performed by an acquisition module 400 executed by a processor.
本实施例中,第一视频帧可以理解为当前视频帧,本实施例中的视频帧可以为处理器实时采集得到的视频图像的任一视频帧,还可以为采集到的完整视频流的任一视频帧,而且,在实际应用中,处理器可以对视频图像或者视频流进行逐帧检测得到视频帧,还可以对视频图像或者视频流进行逐采样帧检测得到视频帧,本实施例对视频帧的获取手段和来源不做限制。In this embodiment, the first video frame can be understood as the current video frame, and the video frame in this embodiment can be any video frame of the video image obtained by the processor in real time, and can also be the collected complete video stream. A video frame, and, in an actual application, the processor may perform a frame-by-frame detection on the video image or the video stream to obtain a video frame, and may also perform a frame-by-sample frame detection on the video image or the video stream to obtain a video frame. There are no restrictions on the means and source of acquisition of frames.
可选地,步骤S100中,获取到的特征信息为第一视频帧的每个目标检测框的特征信息,即,在获取特征信息之前,需要先获取第一视频帧的每个目标检测框,具体获取第一视频帧的每个目标检测框的过程描述将在后续实施例中详细介绍。Optionally, in step S100, the acquired feature information is feature information of each target detection frame of the first video frame, that is, each target detection frame of the first video frame needs to be acquired before the feature information is acquired. A process description for specifically acquiring each target detection frame of the first video frame will be described in detail in the subsequent embodiments.
一种可选的实现方式中,特征信息包括但不限于以下任意一项或多项:表象信息、运动信息、形状信息。其中,表象信息用于表征目标检测框中目标的特征向量;运动信息用于表征目标检测框的位置;形状信息用于表征目标检测框的尺寸。表象信息、运动信息、形状信息分别表征了目标的三个不同方面的特征,为了提高本实施例目标跟踪的精度,优选将表象信息、运动信息和形状信息三者共同作为特征信息,若在表象信息、运动信息、形状信息中任选其中一种或者几种的组合作为特征信息,会影响目标跟踪的精度。其中,表象信息可以为针对不同类型的目标(如行人、人脸)通过训练得到的深度神经网络中提取出来的高维特征。In an optional implementation manner, the feature information includes but is not limited to any one or more of the following: representation information, motion information, and shape information. The representation information is used to represent the feature vector of the target in the target detection frame; the motion information is used to represent the position of the target detection frame; and the shape information is used to represent the size of the target detection frame. The representation information, the motion information and the shape information respectively represent the characteristics of the three different aspects of the target. In order to improve the accuracy of the target tracking in the embodiment, it is preferable to use the representation information, the motion information and the shape information as the feature information, if the representation is in the representation Any one or a combination of information, motion information, and shape information as feature information may affect the accuracy of target tracking. The representation information may be a high-dimensional feature extracted from a deep neural network obtained by training for different types of targets (such as pedestrians and faces).
相比于简单的图像特征或者光流信息,本实施例中的特征信息能够更加精确的表示第一视频帧的目标检测框,为后续匹配过程提供了更加精确的匹配条件。Compared with the simple image feature or the optical flow information, the feature information in this embodiment can more accurately represent the target detection frame of the first video frame, and provides a more accurate matching condition for the subsequent matching process.
S102、将第一视频帧的每个目标检测框的特征信息分别与第二视频帧的每个目标检测框的特征信息进行匹配。S102. Match feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame.
一种可选的实现方式中,步骤S102可以由处理器调用存储器存储的相应指令来执行,或者,可由被处理器运行的匹配模块402执行。In an optional implementation, step S102 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a matching module 402 executed by the processor.
第二视频帧可以理解为第一视频帧之前的视频帧。例如,第二视频帧为视频帧A,时间间隔为00:08:20——00:08:40,第一视频帧为视频帧B,时间间隔为00:08:41——00:08:60。即在视频帧播放过程中,第一视频帧紧挨着第二视频帧播放,在第一视频帧和第二视频帧之间不存在其他视频帧,第一视频帧和第二视频帧为两个连续或者相邻的视频帧。The second video frame can be understood as a video frame before the first video frame. For example, the second video frame is video frame A, the time interval is 00:08:20 - 00:08:40, and the first video frame is video frame B, and the time interval is 00:08:41 - 00:08: 60. That is, during the video frame playing process, the first video frame is played next to the second video frame, and there are no other video frames between the first video frame and the second video frame, and the first video frame and the second video frame are two. Consecutive or adjacent video frames.
第二视频帧的每个目标检测框的特征信息可以预先获取并储存,即在步骤S100之前,还存在获取并储存第二视频帧的每个目标检测框的特征信息的步骤,同理,也可以存储步骤S100中获取到的第一视频帧的每个目标检测框的特征信息。The feature information of each target detection frame of the second video frame may be acquired and stored in advance, that is, before step S100, there is also a step of acquiring and storing feature information of each target detection frame of the second video frame, similarly, The feature information of each target detection frame of the first video frame acquired in step S100 may be stored.
由于特征信息包括以下任意一项或多项:表象信息、运动信息、形状信息,因此,在特征信息进行匹配时,可以对表象信息、运动信息和形状信息分别进行匹配。Since the feature information includes any one or more of the following: representation information, motion information, and shape information, the representation information, the motion information, and the shape information may be matched separately when the feature information is matched.
步骤S104、根据匹配结果确定第一视频帧的每个目标检测框的跟踪轨迹。Step S104: Determine a tracking trajectory of each target detection frame of the first video frame according to the matching result.
一种可选的实现方式中,步骤S104可以由处理器调用存储器存储的相应指令来执行,或者,可由被处理器运行的确定模块404执行。In an optional implementation, step S104 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a determining module 404 executed by the processor.
一种实现方式中,匹配结果包括匹配成功和匹配失败。In one implementation, the matching result includes a matching success and a matching failure.
当匹配结果为匹配成功时,表示与匹配结果相关的第一视频帧的目标检测框(与匹配结果相关的目标检测框可以称为匹配目标检测框)的特征信息和与匹配结果相关的第二视频帧的目标检测框的特征信息在相似性上接近或者相同,即第一视频帧的匹配目标检测框中的目标与第二视频帧的匹配目标检测框中的目标在相似性上接近或者相同,此时可以将第一视频帧的匹配目标检测框加入到第二视频帧的匹配目标检测框的跟踪轨迹中。When the matching result is that the matching is successful, the target detection frame of the first video frame related to the matching result (the target detection frame related to the matching result may be referred to as a matching target detection frame) and the second information related to the matching result The feature information of the target detection frame of the video frame is similar or identical in similarity, that is, the target in the matching target detection frame of the first video frame is close to or identical to the target in the matching target detection frame of the second video frame. At this time, the matching target detection frame of the first video frame may be added to the tracking track of the matching target detection frame of the second video frame.
当匹配结果为匹配失败时,表示与匹配结果相关的第一视频帧的目标检测框的特征信息和与匹配结果相关的第二视频帧的目标检测框的特征信息在相似性上不接近或者不相同,即第一视频帧的匹配目标检测框中的目标与第二视频帧的匹配目标检测框中的目标在相似性上不接近或者不相同,此时可以将第一视频帧的匹配目标检测框作为新的跟踪轨迹的起点或者将第一视频帧的匹配目标检测框的特征信息与第二视频帧的其他目标检测框的特征信息进行匹配。When the matching result is a matching failure, the feature information indicating the target detection frame of the first video frame related to the matching result and the feature information of the target detection frame of the second video frame related to the matching result are not close to each other or not The same, that is, the target in the matching target detection frame of the first video frame is not close to or different from the target in the matching target detection frame of the second video frame, and the matching target detection of the first video frame may be detected at this time. The frame serves as a starting point of the new tracking trajectory or matches the feature information of the matching target detection frame of the first video frame with the feature information of the other target detection frames of the second video frame.
通过本公开提供的目标跟踪方法,获取第一视频帧的每个目标检测框的特征信息,其中,特征信息可以包括以下任意一项或多项:表象信息、运动信息、形状信息。进而将第一视频帧的每个目标检测框的特征信息分别与第二视频帧的每个目标检测框的特征信息进行匹配,然后 根据匹配结果确定第一视频帧的每个目标检测框的跟踪轨迹。本公开中,第二视频帧为第一视频帧之前的视频帧,可以预先储存已获取的第二视频帧的每个目标检测框的特征信息。The feature information of each target detection frame of the first video frame is obtained by the target tracking method provided by the present disclosure, wherein the feature information may include any one or more of the following: representation information, motion information, and shape information. And further matching the feature information of each target detection frame of the first video frame with the feature information of each target detection frame of the second video frame, and then determining the tracking of each target detection frame of the first video frame according to the matching result. Track. In the present disclosure, the second video frame is a video frame before the first video frame, and the feature information of each target detection frame of the acquired second video frame may be pre-stored.
本公开中,由于获取到的特征信息,如:表象信息、运动信息或形状信息,相比于简单的图像特征或者光流信息,能够更加精确的表示视频帧的目标检测框,对于目标检测框的识别效果更好,因此,使用特征信息,如表象信息、运动信息或形状信息中的一种或几种的任意组合进行匹配,不仅可以获得更加精确的匹配结果,还可以提高目标跟踪的精度。In the present disclosure, due to the acquired feature information, such as representation information, motion information, or shape information, the target detection frame of the video frame can be more accurately represented than the simple image feature or optical flow information, and the target detection frame is The recognition effect is better. Therefore, using feature information, such as any combination of one or more of the representation information, the motion information or the shape information, can not only obtain more accurate matching results, but also improve the accuracy of the target tracking. .
如图2所示,本公开实施例在上述实施例的基础上,强调与上述实施例的不同之处,相同之处可以参照上述实施例中的介绍和说明。As shown in FIG. 2, the embodiments of the present disclosure are different from the above embodiments on the basis of the above embodiments, and the same points can be referred to the description and description in the foregoing embodiments.
图2中,步骤S200、采用检测器检测第一视频帧,获得第一视频帧的每个目标检测框。In FIG. 2, in step S200, the first video frame is detected by the detector, and each target detection frame of the first video frame is obtained.
一种可选的实现方式中,步骤S200可以由处理器调用存储器存储的相应指令来执行,或者,可由处理器运行的检测模块508执行。In an optional implementation, step S200 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a detection module 508 running by the processor.
可选地,检测器可以是基于深度卷积神经网络的高性能检测器,用于确定视频帧中的每个目标区域(目标检测框),例如,快速区域卷积神经网络(Faster-Region with Convolutional Neural Network,Faster-RCNN),Faster-RCNN是目前业界领先的检测器,具有检测精度高、速度快的优势。Optionally, the detector may be a high-performance detector based on a deep convolutional neural network for determining each target area (target detection frame) in the video frame, for example, a fast-area convolutional neural network (Faster-Region with Convolutional Neural Network, Faster-RCNN), Faster-RCNN is the industry's leading detector with the advantages of high detection accuracy and fast speed.
其中,第一视频帧可以是当前视频帧,在第一视频帧经过检测器后,可以获得第一视频帧的每个目标检测框。The first video frame may be a current video frame, and each target detection frame of the first video frame may be obtained after the first video frame passes the detector.
步骤S202、根据目标检测框中目标的类型确定卷积神经网络,采用卷积神经网络检测第一视频帧和第一视频帧的每个目标检测框,获取第一视频帧的每个目标检测框的表象信息,根据第一视频帧的每个目标检测框确定第一视频帧的目标检测框的运动信息和形状信息。Step S202: Determine a convolutional neural network according to the type of the target in the target detection frame, and use a convolutional neural network to detect each target detection frame of the first video frame and the first video frame, and acquire each target detection frame of the first video frame. The representation information determines motion information and shape information of the target detection frame of the first video frame according to each target detection frame of the first video frame.
一种可选的实现方式中,步骤S202可以由处理器调用存储器存储的相应指令来执行,或者,可由被处理器运行的获取模块400执行。In an optional implementation, step S202 may be performed by a processor calling a corresponding instruction stored in the memory, or may be performed by the acquisition module 400 executed by the processor.
一种实现方式中,目标的类型可以包括但不限于以下任意一项或多项:人脸、行人、车辆。任何可以移动并且具有区分标识的物体均可以作为本实施例中的目标,本实施例对目标的类型不做限制。在实际应用场景中,若要对人脸进行跟踪,则可以采用用于人脸识别的卷积神经网络,如FaceNet、DeepID;若要对行人进行跟踪,则可以采用用于行人识别的卷积神经网络。In one implementation, the type of the target may include, but is not limited to, any one or more of the following: a face, a pedestrian, a vehicle. Any object that can be moved and has a distinguishing identifier can be the target in this embodiment, and the embodiment does not limit the type of the target. In the actual application scenario, if the face is to be tracked, a convolutional neural network for face recognition, such as FaceNet and DeepID, may be used; if the pedestrian is to be tracked, convolution for pedestrian recognition may be used. Neural Networks.
可选地,卷积神经网络可以理解为在计算机视觉处理过程中,用于识别目标的卷积神经网络。第一视频帧和第一视频帧的每个目标检测框输入到卷积神经网络,经过卷积神经网络计算处理后,从卷积神经网络的最后一个全连接层输出用于标识目标的特征信息,特征信息可以用于计算目标之间的相似性。Alternatively, a convolutional neural network can be understood as a convolutional neural network for identifying a target during computer vision processing. Each target detection frame of the first video frame and the first video frame is input to the convolutional neural network, and after being processed by the convolutional neural network, the feature information for identifying the target is output from the last fully connected layer of the convolutional neural network. The feature information can be used to calculate the similarity between the targets.
而且,为了提高卷积神经网络获取特征信息的准确性,在卷积神经网络训练过程中,可以使用特定的训练数据集对卷积神经网络的网络参数进行微调,或者通过特定的训练方式训练卷积神经网络。但是,由于本实施例是一种通用的目标跟踪方法,因此,本实施例针对如何使用特定的训练数据集进行微调,微调卷积神经网络的具体层的哪些网络参数均不做限定。Moreover, in order to improve the accuracy of the feature information acquired by the convolutional neural network, during the training of the convolutional neural network, the network parameters of the convolutional neural network can be fine-tuned using a specific training data set, or the volume can be trained through a specific training method. Neural network. However, since this embodiment is a general target tracking method, the present embodiment is directed to how to use a specific training data set for fine tuning, and which network parameters of a specific layer of the convolutional neural network are fine-tuned.
并且,为了满足卷积神经网络计算处理效率的要求,可以对卷积神经网络进行适当的压缩,例如,采用网络参数较少的卷积神经网络,或者使用规模小的卷积神经网络模仿规模大的卷积神经网络的输入和输出之间的映射关系。Moreover, in order to meet the requirements of the computational efficiency of the convolutional neural network, the convolutional neural network can be appropriately compressed, for example, using a convolutional neural network with less network parameters, or using a small-scale convolutional neural network to imitate a large scale. The mapping relationship between the input and output of the convolutional neural network.
在一种可选的实现方式中,本实施例中的卷积神经网络可以为在ImageNet(图像识别的数据库)上训练完毕的卷积神经网络,或者可以为任意用于识别目标的卷积神经网络。由于卷积神经网络的计算量较多,可以将卷积神经网络应用于图形处理器,借助图形处理器强大的图形运算能力,提高卷积神经网络的计算效率。本公开对卷积神经网络不做限制。In an optional implementation manner, the convolutional neural network in this embodiment may be a convolutional neural network trained on ImageNet (image recognition database), or may be any convolutional nerve for identifying a target. The internet. Since the convolutional neural network has a large amount of computation, the convolutional neural network can be applied to the graphics processor, and the computational efficiency of the convolutional neural network is improved by the powerful graphics computing capability of the graphics processor. The present disclosure does not limit the convolutional neural network.
将第一视频帧和第一视频帧的每个目标检测框输入到卷积神经网络中,卷积神经网络对第一视频帧和第一视频帧的每个目标检测框进行特征提取,提取到每个目标检测框的表象信息,表象信息包括目标检测框的特征向量,表象信息是一个定长的特征向量。目标检测框的特征信息除包括特征信息之外,还可以包括运动信息或形状信息。其中,运动信息包括目标检测框的位置信息,形状信息包括目标检测框的尺寸信息。在根据目标检测框确定运动信息和形状信息的过程中,一种可行的实施方式中,可以将目标检测框的中心点位置信息确定为运动信息,将目标检测框的长度信息和宽度信息确定为形状信息,在获取目标检测框之后,即可根据目标检测框的中心点位置信息和长度信息及宽度信息确定得到运动信息和形状信息,本实施例对如何根据目标检测框确定得到运动信息和形状信息不做限制。Inputting each target detection frame of the first video frame and the first video frame into a convolutional neural network, and extracting, by the convolutional neural network, feature extraction of each target detection frame of the first video frame and the first video frame, and extracting The representation information of each target detection frame, the representation information includes a feature vector of the target detection frame, and the representation information is a fixed length feature vector. The feature information of the target detection frame may include motion information or shape information in addition to the feature information. The motion information includes location information of the target detection frame, and the shape information includes size information of the target detection frame. In the process of determining the motion information and the shape information according to the target detection frame, in a feasible implementation manner, the center point position information of the target detection frame may be determined as the motion information, and the length information and the width information of the target detection frame are determined as The shape information, after obtaining the target detection frame, can determine the motion information and the shape information according to the central point position information and the length information and the width information of the target detection frame. In this embodiment, how to determine the motion information and shape according to the target detection frame. Information is not restricted.
步骤S204、将第一视频帧的每个目标检测框的特征信息分别与第二视频帧的每个目标检测框的特征信息进行匹配,若匹配结果为匹配成功,则执行步骤S206;若匹配结果为匹配失败,则执行步骤S208。In step S204, the feature information of each target detection frame of the first video frame is matched with the feature information of each target detection frame of the second video frame, and if the matching result is successful, step S206 is performed; If the matching fails, step S208 is performed.
一种可选的实现方式中,步骤S204可以由处理器调用存储器存储的相应指令来执行,或者,可由被处理器运行的匹配模块402执行。In an optional implementation, step S204 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a matching module 402 executed by the processor.
一种实现方式中,S204具体可以划分为如下步骤:In an implementation manner, S204 may be specifically divided into the following steps:
步骤S2040、对特征信息进行相似性计算。Step S2040: Perform similarity calculation on the feature information.
一种可选的实现方式中,步骤S2040可以由处理器调用存储器存储的相应指令来执行,或者,可由被处理器运行的相似性计算子模块5020执行。In an optional implementation, step S2040 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a similarity calculation sub-module 5020 executed by the processor.
对特征信息进行相似性计算,具体可以将第一视频帧的每个目标检测框的特征信息与第二 视频帧的每个目标检测框的特征信息进行相似性计算,即计算第一视频帧的每个目标检测框的表象信息与第二视频帧的每个目标检测框的表象信息之间的余弦角,计算第一视频帧的每个目标检测框的运动信息与第二视频帧的每个目标检测框的运动信息之间的目标检测框的中心距离,计算第一视频帧的每个目标检测框的形状信息与第二视频帧的每个目标检测框的形状信息之间的目标检测框的长宽差。The similarity calculation is performed on the feature information, and the feature information of each target detection frame of the first video frame and the feature information of each target detection frame of the second video frame may be similarly calculated, that is, the first video frame is calculated. Calculating a cosine angle between the representation information of each target detection frame and the representation information of each target detection frame of the second video frame, calculating motion information of each target detection frame of the first video frame and each of the second video frames A center detection distance between the motion detection information of the target detection frame, and a target detection frame between the shape information of each target detection frame of the first video frame and the shape information of each target detection frame of the second video frame The length difference.
在一种可选的实现方式中,将第一视频帧的每个目标检测框的特征信息与第二视频帧的每个目标检测框的特征信息进行相似性计算,还可以将第一视频帧的每个目标检测框的特征信息与第二视频帧的每个目标检测框的特征信息进行逐一比对,获得第一视频帧的每个目标检测框与第二视频帧的每个目标检测框的相似度数据。其中,相似度数据的范围为0.0~1.0,相似度数据越大,表示特征信息之间的相似程度越高。In an optional implementation manner, the feature information of each target detection frame of the first video frame is compared with the feature information of each target detection frame of the second video frame, and the first video frame may also be used. The feature information of each target detection frame is compared with the feature information of each target detection frame of the second video frame, and each target detection frame of the first video frame and each target detection frame of the second video frame are obtained. Similarity data. The similarity data ranges from 0.0 to 1.0, and the similarity data is larger, indicating that the degree of similarity between the feature information is higher.
步骤S2042、对相似性计算结果加权得到相似性矩阵。Step S2042, weighting the similarity calculation result to obtain a similarity matrix.
一种可选的实现方式中,步骤S2042可以由处理器调用存储器存储的相应指令来执行,或者,可由被处理器运行的加权子模块5022执行。In an optional implementation, step S2042 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a weighting sub-module 5022 executed by the processor.
可选地,可以根据实际得到的相似性计算结果进行加权,得到相似性矩阵,例如,对余弦角、中心距离和长宽差进行加权求和或者加权乘积得到相似性矩阵,本实施例对相似性结果加权不做限制。Optionally, the similarity matrix may be obtained according to the obtained similarity calculation result, for example, the weighted sum or the weighted product of the cosine angle, the center distance, and the length and width difference are obtained to obtain a similarity matrix, which is similar to the present embodiment. Sexual outcome weighting is not restricted.
步骤S2044、对相似性矩阵进行最优匹配。Step S2044: Perform an optimal matching on the similarity matrix.
一种可选的实现方式中,步骤S2044可以由处理器调用存储器存储的相应指令来执行,或者,可由被处理器运行的最优匹配子模块5024执行。In an optional implementation, step S2044 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by an optimal matching sub-module 5024 operated by the processor.
可选地,最优匹配可以采用二分图带权匹配,如采用匈牙利算法,本实施例对相似性矩阵的匹配过程不做限制。Optionally, the optimal matching may use a bipartite graph with weight matching. For example, the Hungarian algorithm is adopted, and the matching process of the similarity matrix is not limited in this embodiment.
步骤S206、将与匹配结果相关的第一视频帧的目标检测框和第二视频帧的目标检测框关联成连续跟踪轨迹。Step S206: Associate the target detection frame of the first video frame and the target detection frame of the second video frame related to the matching result into a continuous tracking trajectory.
一种可选的实现方式中,步骤S206可以由处理器调用存储器存储的相应指令来执行,或者,可由被处理器运行的确定模块504执行。In an optional implementation, step S206 may be performed by the processor invoking a corresponding instruction stored in the memory, or may be performed by the determining module 504 being executed by the processor.
若匹配结果为匹配成功,可以将与匹配结果相关的第一视频帧的目标检测框加入到与匹配结果相关的第二视频帧的目标检测框的跟踪轨迹中,与匹配结果相关的第一视频帧的目标检测框作为与匹配结果相关的第二视频帧的目标检测框在跟踪轨迹中的下一移动位置。例如,与匹配结果相关的第一视频帧的目标检测框中的目标为目标A,与匹配结果相关的第二视频帧的目标检测框中的目标为目标A′,若匹配结果为匹配成功,则目标A和目标A′为相同目标,与匹配 结果相关的第一视频帧的目标检测框可以作为目标A或者目标A′的已有的跟踪轨迹的新的跟踪轨迹位置。If the matching result is that the matching is successful, the target detection frame of the first video frame related to the matching result may be added to the tracking track of the target detection frame of the second video frame related to the matching result, and the first video related to the matching result may be added. The target detection frame of the frame is the next movement position in the tracking trajectory of the target detection frame of the second video frame associated with the matching result. For example, the target in the target detection frame of the first video frame related to the matching result is the target A, and the target in the target detection frame of the second video frame related to the matching result is the target A′, and if the matching result is successful, Then, the target A and the target A′ are the same target, and the target detection frame of the first video frame related to the matching result can be used as the new tracking track position of the existing tracking track of the target A or the target A′.
需要说明的是,上述与匹配结果相关的第一视频帧的目标检测框为用于与第二视频帧的目标检测框进行特征信息匹配,得到匹配结果的目标检测框。例如,第一视频帧包括目标检测框m1和目标检测框m2,将目标检测框m1与第二视频帧的目标检测框m3进行特征信息匹配,得到匹配结果,则与匹配结果相关的第一视频帧的目标检测框为目标检测框m1,与匹配结果相关的第二视频帧的目标检测框为目标检测框m3。It should be noted that the target detection frame of the first video frame related to the matching result is a target detection frame for matching the feature information with the target detection frame of the second video frame to obtain a matching result. For example, the first video frame includes a target detection frame m1 and a target detection frame m2, and the target detection frame m1 is matched with the target detection frame m3 of the second video frame to obtain a matching result, and the first video related to the matching result is obtained. The target detection frame of the frame is the target detection frame m1, and the target detection frame of the second video frame related to the matching result is the target detection frame m3.
步骤S208、将与匹配结果相关的第一视频帧的目标检测框作为新跟踪轨迹的起始目标检测框。Step S208: The target detection frame of the first video frame related to the matching result is used as a starting target detection frame of the new tracking trajectory.
一种可选的实现方式中,步骤S208可以由处理器调用存储器存储的相应指令来执行,或者,可由被处理器运行的确定模块504执行。In an optional implementation, step S208 may be performed by the processor invoking a corresponding instruction stored in the memory, or may be performed by the determining module 504 being executed by the processor.
若匹配结果为匹配失败,可以将与匹配结果相关的第一视频帧的目标检测框作为新的跟踪轨迹的起始目标检测框。例如,与匹配结果相关的第一视频帧的目标检测框中的目标为目标B,与匹配结果相关的第二视频帧的目标检测框中的目标为目标B′,若匹配结果为匹配失败,则目标B和目标B′为不相同目标,与匹配结果相关的第一视频帧的目标检测框可以作为目标B的新的跟踪轨迹的起点位置。If the matching result is a matching failure, the target detection frame of the first video frame related to the matching result may be used as a starting target detection frame of the new tracking trajectory. For example, the target in the target detection frame of the first video frame related to the matching result is the target B, and the target in the target detection frame of the second video frame related to the matching result is the target B′, and if the matching result is a match failure, Then, the target B and the target B' are different targets, and the target detection frame of the first video frame related to the matching result can be used as the starting position of the new tracking trajectory of the target B.
需要说明的是,本步骤的其中一个前提条件是,与匹配结果相关的第一视频帧的目标检测框已经与第二视频帧中的每个目标检测框均进行特征信息匹配,且每个匹配结果均为匹配失败。It should be noted that one of the prerequisites in this step is that the target detection frame of the first video frame related to the matching result has been matched with the feature information of each target detection frame in the second video frame, and each match is matched. The result is a match failure.
基于以上描述,参照图3,示出了根据本公开提供的目标跟踪方法的执行流程示意图。待检测的视频帧输入到检测器中,获得待检测的视频帧的目标检测框。将待检测的视频帧和目标检测框都输入到卷积神经网络中,提取目标检测框的特征信息。将提取到的特征信息与已有的跟踪轨迹的特征信息进行匹配,若匹配成功,则将待检测的视频帧的目标检测框加入到已有的跟踪轨迹中。若匹配失败,则将待检测的视频帧的目标检测框作为新的跟踪轨迹的起点。Based on the above description, referring to FIG. 3, a flowchart of an execution flow of a target tracking method according to the present disclosure is shown. The video frame to be detected is input to the detector to obtain a target detection frame of the video frame to be detected. The video frame to be detected and the target detection frame are input into the convolutional neural network, and the feature information of the target detection frame is extracted. The extracted feature information is matched with the feature information of the existing tracking track. If the matching is successful, the target detection frame of the video frame to be detected is added to the existing tracking track. If the match fails, the target detection frame of the video frame to be detected is used as the starting point of the new tracking track.
通过本实施例提供的目标跟踪方法,获取第一视频帧的每个目标检测框的特征信息,其中,特征信息包括但不限于以下任意一项或多项:表象信息、运动信息、形状信息。进而将第一视频帧的每个目标检测框的特征信息与第二视频帧的每个目标检测框的特征信息进行匹配,然后根据匹配结果确定第一视频帧的每个目标检测框的跟踪轨迹。本公开中,第一视频帧可以为当前视频帧,第二视频帧可以为第一视频帧之前的视频帧,可以预先储存已获取的第二视频帧的每个目标检测框的特征信息。The feature information of each target detection frame of the first video frame is obtained by the target tracking method provided by the embodiment, where the feature information includes but is not limited to any one or more of the following: representation information, motion information, and shape information. And further matching feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame, and then determining a tracking track of each target detection frame of the first video frame according to the matching result. . In the present disclosure, the first video frame may be the current video frame, and the second video frame may be the video frame before the first video frame, and the feature information of each target detection frame of the acquired second video frame may be pre-stored.
本公开中,由于获取到的特征信息,如:表象信息、运动信息、形状信息,相比于简单的 图像特征或者光流信息,能够更加精确的表示视频帧的目标检测框,对于目标检测框的识别效果更好,因此,使用特征信息,如表象信息、运动信息、形状信息中的一种或几种的任意组合进行匹配,不仅可以获得更加精确的匹配结果,还可以提高目标跟踪的精度。In the present disclosure, due to the acquired feature information, such as representation information, motion information, and shape information, the target detection frame of the video frame can be more accurately represented than the simple image feature or optical flow information, and the target detection frame is The recognition effect is better. Therefore, matching with any combination of one or more of the feature information, such as the representation information, the motion information, and the shape information, can not only obtain more accurate matching results, but also improve the accuracy of the target tracking. .
目标跟踪技术一般可以分为目标在线跟踪技术和目标离线跟踪技术。本公开提供的目标跟踪方法可以应用于目标在线跟踪场景中,例如,在线获取实时播放的视频图像中视频帧的每个目标检测框的特征信息,进而将实时播放的视频图像中相邻视频帧中的每个目标检测框的特征信息进行匹配,然后根据匹配结果确定每个目标检测框的跟踪轨迹。Target tracking technology can be generally divided into target online tracking technology and target offline tracking technology. The target tracking method provided by the present disclosure can be applied to a target online tracking scene, for example, acquiring feature information of each target detection frame of a video frame in a real-time played video image online, and then displaying adjacent video frames in a real-time played video image. The feature information of each target detection frame is matched, and then the tracking track of each target detection frame is determined according to the matching result.
在一种可行的实施方式中,本实施例提供的目标跟踪方法可以应用在在线视频监控分析方案中,例如,在人脸识别应用中,需要对每个视频帧进行人脸在线检测,并将在线检测到的人脸特征输入到人脸特征库中查询,进而确定视频帧中的人脸对应的个人。可以将多个视频帧中人脸的人脸特征进行匹配,将匹配成功的人脸特征在人脸特征库中查询,提高了人脸识别的准确率。In a feasible implementation manner, the target tracking method provided in this embodiment may be applied to an online video surveillance analysis solution, for example, in a face recognition application, a face online detection is required for each video frame, and The face feature detected online is input into the face feature database for query, thereby determining the person corresponding to the face in the video frame. The face features of the faces in the plurality of video frames can be matched, and the successfully matched face features are searched in the face feature database, thereby improving the accuracy of the face recognition.
同时,本公开提供的目标跟踪方法还可以应用于目标离线跟踪场景中,例如,获取离线视频图像中的视频帧的每个目标检测框的特征信息,进而将离线视频图像中相邻视频帧中的每个目标检测框的特征信息进行匹配,然后根据匹配结果确定每个目标检测框的跟踪轨迹。Meanwhile, the target tracking method provided by the present disclosure may also be applied to a target offline tracking scene, for example, acquiring feature information of each target detection frame of a video frame in an offline video image, and then placing adjacent video frames in the offline video image. The feature information of each target detection frame is matched, and then the tracking track of each target detection frame is determined according to the matching result.
本公开实施例提供的任一方法可以由任意适当的具有数据处理能力的设备执行,包括但不限于:终端设备和服务器。或者,本公开实施例提供的任一方法可以由处理器执行,如处理器通过调用存储器存储的相应指令来执行本公开实施例提及的任一方法。下文不再赘述。Any of the methods provided by the embodiments of the present disclosure may be performed by any suitable device having data processing capabilities, including but not limited to: a terminal device and a server. Alternatively, any of the methods provided by the embodiments of the present disclosure may be executed by a processor, such as the processor, by executing a corresponding instruction stored in the memory to perform any of the methods mentioned in the embodiments of the present disclosure. This will not be repeated below.
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘各种可以存储程序代码的介质。A person skilled in the art can understand that all or part of the steps of implementing the above method embodiments may be completed by using hardware related to the program instructions. The foregoing program may be stored in a computer readable storage medium, and the program is executed when executed. The steps of the foregoing method embodiments are included; and the foregoing storage medium includes: ROM, RAM, disk or optical disk, and various media that can store program codes.
应用图1-图3所示技术方案,处理器实现目标跟踪时,根据特征信息将第一视频帧中的目标检测框与第二视频帧中的目标检测框进行匹配,根据匹配结果确定目标检测框的跟踪轨迹,由于获取到的特征信息,如表象信息、运动信息或形状信息,不仅可以获得更加精确的匹配结果,还可以提高目标跟踪的精度。Applying the technical solutions shown in FIG. 1 to FIG. 3, when the processor implements target tracking, the target detection frame in the first video frame is matched with the target detection frame in the second video frame according to the feature information, and the target detection is determined according to the matching result. The tracking trajectory of the frame can not only obtain more accurate matching results but also improve the accuracy of target tracking due to the acquired feature information, such as representation information, motion information or shape information.
参照图4所示的系统包括:The system shown in FIG. 4 includes:
获取模块400,用于获取第一视频帧的每个目标检测框的特征信息;The obtaining module 400 is configured to acquire feature information of each target detection frame of the first video frame;
匹配模块402,用于将第一视频帧的每个目标检测框的特征信息分别与第二视频帧的每个目标检测框的特征信息进行匹配;The matching module 402 is configured to match feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame;
确定模块404,用于根据匹配结果确定第一视频帧的每个目标检测框的跟踪轨迹;其中,第二视频帧为第一视频帧之前的视频帧。The determining module 404 is configured to determine, according to the matching result, a tracking track of each target detection frame of the first video frame; wherein the second video frame is a video frame before the first video frame.
一种实现方式中,特征信息可以包括但不限于以下任意一项或多项:表象信息、运动信息、形状信息。In one implementation, the feature information may include, but is not limited to, any one or more of the following: representation information, motion information, shape information.
通过本实施例提供的目标跟踪系统,获取模块获取第一视频帧的每个目标检测框的特征信息,进而匹配模块将第一视频帧的每个目标检测框的特征信息与第二视频帧的每个目标检测框的特征信息进行匹配,然后确定模块根据匹配结果确定第一视频帧的每个目标检测框的跟踪轨迹。本公开中,第一视频帧可以为当前视频帧,第二视频帧可以为第一视频帧之前的视频帧,可以预先储存已获取的第二视频帧的每个目标检测框的特征信息。With the target tracking system provided by the embodiment, the acquiring module acquires feature information of each target detection frame of the first video frame, and then the matching module sets feature information of each target detection frame of the first video frame with the second video frame. The feature information of each target detection frame is matched, and then the determining module determines a tracking trajectory of each target detection frame of the first video frame according to the matching result. In the present disclosure, the first video frame may be the current video frame, and the second video frame may be the video frame before the first video frame, and the feature information of each target detection frame of the acquired second video frame may be pre-stored.
本公开中,由于获取到的特征信息,如表象信息、运动信息、形状信息,相比于简单的图像特征或者光流信息,能够更加精确的表示视频帧的目标检测框,对于目标检测框的识别效果更好,因此,使用特征信息,如表象信息、运动信息、形状信息中的一种或几种的任意组合进行匹配,不仅可以获得更加精确的匹配结果,还可以提高目标跟踪的精度。In the present disclosure, due to the acquired feature information, such as representation information, motion information, and shape information, the target detection frame of the video frame can be more accurately represented than the simple image feature or optical flow information, for the target detection frame. The recognition effect is better. Therefore, using the feature information, such as one or a combination of the image information, the motion information, and the shape information to perform matching, not only can obtain more accurate matching results, but also can improve the accuracy of the target tracking.
参照图5所示的系统包括:The system shown in FIG. 5 includes:
获取模块500,用于获取第一视频帧的每个目标检测框的特征信息,特征信息可以包括但不限于以下任意一项或多项:表象信息、运动信息、形状信息;The acquiring module 500 is configured to acquire feature information of each target detection frame of the first video frame, where the feature information may include, but is not limited to, any one or more of the following: representation information, motion information, shape information;
匹配模块502,用于将第一视频帧的每个目标检测框的特征信息分别与第二视频帧的每个目标检测框的特征信息进行匹配;The matching module 502 is configured to match feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame;
确定模块504,用于根据匹配结果确定第一视频帧的每个目标检测框的跟踪轨迹;其中,第二视频帧为第一视频帧之前的视频帧。The determining module 504 is configured to determine, according to the matching result, a tracking track of each target detection frame of the first video frame, where the second video frame is a video frame before the first video frame.
一种实现方式中,获取模块500包括:第一获取子模块5000,用于采用卷积神经网络检测第一视频帧和第一视频帧的每个目标检测框,获取第一视频帧的每个目标检测框的表象信息;第二获取子模块5002,用于根据第一视频帧的每个目标检测框确定第一视频帧的每个目标检测框的运动信息和形状信息。In an implementation manner, the obtaining module 500 includes: a first acquiring submodule 5000, configured to detect each target detection frame of the first video frame and the first video frame by using a convolutional neural network, and acquire each of the first video frames. The second detection sub-module 5002 is configured to determine motion information and shape information of each target detection frame of the first video frame according to each target detection frame of the first video frame.
一种实现方式中,本公开提供的目标跟踪系统还包括:卷积神经网络确定模块506,用于在获取模块500获取第一视频帧的每个目标检测框的特征信息之前,根据目标的类型确定卷积神经网络。In an implementation manner, the target tracking system provided by the present disclosure further includes: a convolutional neural network determining module 506, configured to: before the acquiring module 500 acquires feature information of each target detection frame of the first video frame, according to the type of the target Determine the convolutional neural network.
可选地,目标的类型可以包括但不限于以下任意一项或多项:人脸、行人、车辆。Optionally, the type of the target may include, but is not limited to, any one or more of the following: a face, a pedestrian, a vehicle.
一种实现方式中,匹配模块502包括:相似性计算子模块5020,用于将第一视频帧的每个目标检测框的特征信息分别与第二视频帧的每个目标检测框的特征信息进行相似性计算;加 权子模块5022,用于对相似性计算结果加权得到相似性矩阵;最优匹配子模块5024,用于对相似性矩阵进行最优匹配。In an implementation manner, the matching module 502 includes: a similarity calculation sub-module 5020, configured to perform feature information of each target detection frame of the first video frame and feature information of each target detection frame of the second video frame. The similarity calculation; the weighting sub-module 5022 is configured to weight the similarity calculation result to obtain a similarity matrix; the optimal matching sub-module 5024 is configured to perform an optimal matching on the similarity matrix.
可选地,相似性计算子模块5020,用于将第一视频帧的每个目标检测框的特征信息与第二视频帧的每个目标检测框的特征信息进行逐一比对,获得第一视频帧的每个目标检测框与第二视频帧的每个目标检测框的相似度数据。Optionally, the similarity calculation sub-module 5020 is configured to compare the feature information of each target detection frame of the first video frame with the feature information of each target detection frame of the second video frame, to obtain the first video. The similarity data of each target detection frame of the frame and each target detection frame of the second video frame.
可选地,相似性计算子模块5020,用于计算第一视频帧的每个目标检测框的表象信息与第二视频帧的每个目标检测框的表象信息之间的余弦角,计算第一视频帧的每个目标检测框的运动信息与第二视频帧的每个目标检测框的运动信息之间的目标检测框的中心距离,计算第一视频帧的每个目标检测框的形状信息与第二视频帧的每个目标检测框的形状信息之间的目标检测框的长宽差。Optionally, the similarity calculation sub-module 5020 is configured to calculate a cosine angle between the representation information of each target detection frame of the first video frame and the representation information of each target detection frame of the second video frame, and calculate the first Calculating the shape information of each target detection frame of the first video frame and the center distance of the target detection frame between the motion information of each target detection frame of the video frame and the motion information of each target detection frame of the second video frame The length difference width of the target detection frame between the shape information of each target detection frame of the second video frame.
可选地,加权子模块5022,用于对余弦角、中心距离和长宽差进行加权求和或者加权乘积得到相似性矩阵。Optionally, the weighting sub-module 5022 is configured to perform a weighted summation or a weighted product on the cosine angle, the center distance, and the length and width difference to obtain a similarity matrix.
可选地,确定模块504,用于若匹配结果为匹配成功,则将与匹配结果相关的第一视频帧的目标检测框和第二视频帧的目标检测框关联成连续跟踪轨迹。Optionally, the determining module 504 is configured to associate the target detection frame of the first video frame and the target detection frame of the second video frame related to the matching result into a continuous tracking trajectory if the matching result is that the matching is successful.
可选地,确定模块504,用于若匹配结果为匹配失败,则将与匹配结果相关的第一视频帧的目标检测框作为新跟踪轨迹的起始目标检测框。Optionally, the determining module 504 is configured to: if the matching result is a matching failure, use a target detection frame of the first video frame related to the matching result as a starting target detection frame of the new tracking track.
可选地,表象信息包括目标检测框的特征向量,运动信息包括目标检测框的位置信息,形状信息包括目标检测框的尺寸信息。Optionally, the representation information includes a feature vector of the target detection frame, the motion information includes location information of the target detection frame, and the shape information includes size information of the target detection frame.
一种实现方式中,本实施例提供的目标跟踪系统还包括:检测模块508,用于在获取模块500获取第一视频帧的每个目标检测框的特征信息之前,采用检测器检测第一视频帧,获得第一视频帧的每个目标检测框。In an implementation manner, the target tracking system provided in this embodiment further includes: a detecting module 508, configured to detect the first video by using the detector before the acquiring module 500 acquires the feature information of each target detection frame of the first video frame. Frame, each target detection frame of the first video frame is obtained.
可选地,检测器为基于深度卷积神经网络的检测器。Optionally, the detector is a deep convolutional neural network based detector.
可选地,检测器为快速区域卷积神经网络Faster-RCNN。Optionally, the detector is a fast region convolutional neural network Faster-RCNN.
本实施例的区域检测装置可用于实现前述多个方法实施例中相应的区域检测方法,并具有相应的方法实施例的有益效果,在此不再赘述。The area detecting device of the present embodiment can be used to implement the corresponding area detecting method in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, and details are not described herein again.
另外,本申请实施例还提供了一种电子设备,包括:处理器和存储器;In addition, an embodiment of the present application further provides an electronic device, including: a processor and a memory;
所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行本申请上述任一实施例所述的目标跟踪方法对应的操作。The memory is configured to store at least one executable instruction, the executable instruction causing the processor to perform an operation corresponding to the target tracking method described in any of the above embodiments of the present application.
另外,本申请实施例还提供了另一种电子设备,包括:In addition, the embodiment of the present application further provides another electronic device, including:
处理器和本申请上述任一实施例所述的目标跟踪装置;在处理器运行所述目标跟踪装置时, 本申请上述任一实施例所述的目标跟踪装置中的单元被运行。The processor and the target tracking device of any of the above embodiments of the present application; when the processor runs the target tracking device, the unit in the target tracking device according to any of the above embodiments of the present application is operated.
本公开实施例还提供了一种电子设备,例如可以包括但不限于移动终端、个人计算机(PC)、平板电脑、服务器。下面参考图6,其示出了适于用来实现本公开实施例的系统的电子设备600的结构示意图:如图6所示,电子设备600可以包括一个或多个处理器、通信元件,一个或多个处理器例如:一个或多个中央处理单元(CPU)601,和/或一个或多个图像处理器(GPU)613,处理器可以根据存储在只读存储器(ROM)602中的可执行指令或者从存储部分608加载到随机访问存储器(RAM)603中的可执行指令而执行各种适当的动作和处理。通信元件包括通信组件612和/或通信接口609。其中,通信组件612可包括但不限于网卡,网卡可包括但不限于IB(Infiniband)网卡,通信接口609包括诸如LAN卡、调制解调器的网络接口卡的通信接口,通信接口609经由诸如因特网的网络执行通信处理。The embodiment of the present disclosure further provides an electronic device, which may include, but is not limited to, a mobile terminal, a personal computer (PC), a tablet, and a server. Referring now to Figure 6, there is shown a block diagram of an electronic device 600 suitable for use in implementing the system of an embodiment of the present disclosure: as shown in Figure 6, the electronic device 600 can include one or more processors, communication elements, and Or a plurality of processors, such as one or more central processing units (CPUs) 601, and/or one or more image processors (GPUs) 613, which may be stored in accordance with a read-only memory (ROM) 602. Various appropriate actions and processes are performed by executing instructions or loading from executable portion 608 into executable instructions in random access memory (RAM) 603. The communication component includes a communication component 612 and/or a communication interface 609. The communication component 612 can include, but is not limited to, a network card, which can include, but is not limited to, an IB (Infiniband) network card. The communication interface 609 includes a communication interface such as a LAN card, a network interface card of a modem, and the communication interface 609 is executed via a network such as the Internet. Communication processing.
处理器可与只读存储器602和/或随机访问存储器603中通信以执行可执行指令,通过通信总线604与通信组件612相连、并经通信组件612与其他目标设备通信,从而完成本公开实施例提供的任一项目标跟踪方法对应的操作,例如,获取第一视频帧的每个目标检测框的特征信息,特征信息可以包括但不限于以下任意一项或多项:表象信息、运动信息、形状信息;将第一视频帧的每个目标检测框的特征信息分别与第二视频帧的每个目标检测框的特征信息进行匹配;根据匹配结果确定第一视频帧的每个目标检测框的跟踪轨迹;其中,第二视频帧为第一视频帧之前的视频帧。The processor can communicate with read only memory 602 and/or random access memory 603 to execute executable instructions, communicate with communication component 612 via communication bus 604, and communicate with other target devices via communication component 612 to complete embodiments of the present disclosure. The operation corresponding to any one of the target tracking methods is provided, for example, acquiring feature information of each target detection frame of the first video frame, and the feature information may include, but is not limited to, any one or more of the following: representation information, motion information, Shape information; matching feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame; determining, for each target detection frame of the first video frame, according to the matching result Tracking the track; wherein the second video frame is a video frame before the first video frame.
此外,在RAM 603中,还可存储有所需的各种程序和数据。CPU601或GPU613、ROM602以及RAM603通过通信总线604彼此相连。在有0R20OM602中写入可执行指令,可执行指令使处理器执行上述目标跟踪方法对应的操作。输入/输出(I/O)接口605也连接至通信总线604。通信组件612可以集成设置,也可以设置为具有多个子模块(例如多个IB网卡),并在通信总线链接上。Further, in the RAM 603, various programs and data required can also be stored. The CPU 601 or the GPU 613, the ROM 602, and the RAM 603 are connected to each other through a communication bus 604. The executable instruction is written in the 0R20OM602, and the executable instruction causes the processor to perform the operation corresponding to the target tracking method. An input/output (I/O) interface 605 is also coupled to communication bus 604. The communication component 612 can be integrated or can be configured to have multiple sub-modules (e.g., multiple IB network cards) and be on a communication bus link.
以下部件连接至I/O接口605:包括但不限于键盘、鼠标的输入部分606;包括但不限于诸如阴极射线管(CRT)、液晶显示器(LCD)以及扬声器的输出部分607;包括但不限于硬盘的存储部分608;以及包括诸如LAN卡、调制解调器的网络接口卡的通信接口609。驱动器610也根据需要连接至I/O接口605。可拆卸介质611,诸如磁盘、光盘、磁光盘、半导体存储器,根据需要安装在驱动器610上,以便于从其上读出的计算机程序根据需要被安装入存储部分608。The following components are coupled to I/O interface 605: including but not limited to keyboard, mouse input portion 606; including but not limited to output portions 607 such as cathode ray tubes (CRTs), liquid crystal displays (LCDs), and speakers; including but not limited to A storage portion 608 of the hard disk; and a communication interface 609 including a network interface card such as a LAN card and a modem. Driver 610 is also coupled to I/O interface 605 as needed. A removable medium 611, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, is mounted on the drive 610 as needed so that a computer program read therefrom is installed into the storage portion 608 as needed.
需要说明的,如图6所示的架构仅为一种可选实现方式,在可选实践过程中,可根据实际需要对上述图6的部件数量和类型进行选择、删减、增加或替换;在不同功能部件设置上,也 可采用分离设置或集成设置实现方式,例如GPU和CPU可分离设置或者可将GPU集成在CPU上,通信元件可分离设置,也可集成设置在CPU或GPU上。这些可替换的实施方式均落入本公开的保护范围。It should be noted that the architecture shown in FIG. 6 is only an optional implementation manner. In an optional practice process, the number and type of components in FIG. 6 may be selected, deleted, added, or replaced according to actual needs; Separate settings or integrated setting implementations can also be used for different feature settings, such as GPU and CPU detachable settings or GPU integration on the CPU, communication components can be detached, or integrated on the CPU or GPU. These alternative embodiments are all within the scope of the present disclosure.
特别地,根据本公开实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开实施例包括一种计算机程序产品,其包括有形地包含在机器可读介质上的计算机程序,计算机程序包含用于执行流程图所示的方法的程序代码,程序代码可包括对应执行本公开任一实施例提供的方法步骤对应的指令,例如,程序代码可包括对应执行本申请实施例提供的如下步骤对应的指令:获取第一视频帧的每个目标检测框的特征信息,特征信息包括以下任意一项或多项:表象信息、运动信息、形状信息;将第一视频帧的每个目标检测框的特征信息分别与第二视频帧的每个目标检测框的特征信息进行匹配;根据匹配结果确定第一视频帧的每个目标检测框的跟踪轨迹;其中,第二视频帧为第一视频帧之前的视频帧。在这样的实施例中,该计算机程序可以通过通信元件从网络上被下载和安装,和/或从可拆卸介质611被安装。在该计算机程序被处理器执行时,执行本公开任意实施例的方法中限定的上述功能。In particular, the processes described above with reference to the flowcharts may be implemented as a computer software program in accordance with an embodiment of the present disclosure. For example, an embodiment of the present disclosure includes a computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program comprising program code for executing the method illustrated in the flowchart, the program code comprising the corresponding execution The instruction corresponding to the method step provided by any embodiment of the present disclosure, for example, the program code may include an instruction corresponding to the following steps provided by the embodiment of the present application: acquiring feature information of each target detection frame of the first video frame, and the feature The information includes any one or more of the following: representation information, motion information, shape information; matching feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame And determining, according to the matching result, a tracking track of each target detection frame of the first video frame; wherein, the second video frame is a video frame before the first video frame. In such an embodiment, the computer program can be downloaded and installed from the network via a communication component, and/or installed from removable media 611. The above-described functions defined in the method of any of the embodiments of the present disclosure are performed when the computer program is executed by a processor.
另外,本申请实施例还提供了一种计算机程序,包括计算机可读代码,当所述计算机可读代码在设备上运行时,所述设备中的处理器执行用于实现本申请任一实施例所述的目标跟踪方法中各步骤的指令。In addition, the embodiment of the present application further provides a computer program, including computer readable code, when the computer readable code is run on a device, the processor in the device executes to implement any of the embodiments of the present application. The instructions of each step in the target tracking method.
另外,本申请实施例还提供了一种计算机可读存储介质,用于存储计算机可读取的指令,所述指令被执行时实现本申请任一实施例所述的目标跟踪方法中各步骤的操作。In addition, the embodiment of the present application further provides a computer readable storage medium, configured to store computer readable instructions, when the instructions are executed, to implement steps in the target tracking method according to any embodiment of the present application. operating.
本说明书中各个至少一个实施例均采用递进的方式描述,每个至少一个实施例重点说明的都是与其它实施例的不同之处,各个至少一个实施例之间相同或相似的部分相互参见即可。对于系统实施例而言,由于其与方法实施例基本对应,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。Each of the at least one embodiment of the present specification is described in a progressive manner, and each of the at least one embodiment focuses on differences from other embodiments, and the same or similar parts between the respective at least one embodiment are referred to each other. Just fine. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
可能以许多方式来实现本公开的方法和装置。例如,可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本公开的方法和装置。用于所述方法的步骤的上述顺序仅是为了进行说明,本公开的方法的步骤不限于以上具体描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本公开实施为记录在记录介质中的程序,这些程序包括用于实现根据本公开的方法的机器可读指令。因而,本公开还覆盖存储用于执行根据本公开的方法的程序的记录介质。The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented in software, hardware, firmware or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustrative purposes only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless otherwise specifically stated. Moreover, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine readable instructions for implementing a method in accordance with the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
本公开的描述是为了示例和描述起见而给出的,而并不是无遗漏的或者将本公开限于所公开的形式。很多修改和变化对于本领域的普通技术人员而言是显然的。选择和描述实施例是为 了更好说明本公开的原理和实际应用,并且使本领域的普通技术人员能够理解本公开从而设计适于特定用途的带有各种修改的各种实施例。The description of the present disclosure has been presented for purposes of illustration and description. Many modifications and variations will be apparent to those skilled in the art. The embodiment was chosen and described in order to best explain the principles of the invention and the embodiments of the invention.

Claims (34)

  1. 一种目标跟踪方法,其特征在于,包括:A target tracking method, comprising:
    获取第一视频帧的每个目标检测框的特征信息;Obtaining feature information of each target detection frame of the first video frame;
    将所述第一视频帧的每个目标检测框的特征信息分别与第二视频帧的每个目标检测框的特征信息进行匹配;其中,所述第二视频帧为所述第一视频帧之前的视频帧;Matching feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame, wherein the second video frame is before the first video frame Video frame
    根据匹配结果确定所述第一视频帧的每个目标检测框的跟踪轨迹。A tracking trajectory of each target detection frame of the first video frame is determined according to the matching result.
  2. 根据权利要求1所述的方法,其特征在于,所述特征信息包括以下任意一项或多项:表象信息、运动信息、形状信息。The method according to claim 1, wherein the feature information comprises any one or more of the following: representation information, motion information, shape information.
  3. 根据权利要求1或2所述的方法,其特征在于,所述获取第一视频帧的每个目标检测框的特征信息,包括:The method according to claim 1 or 2, wherein the acquiring the feature information of each target detection frame of the first video frame comprises:
    采用卷积神经网络检测所述第一视频帧和所述第一视频帧的每个目标检测框,获取所述第一视频帧的每个目标检测框的表象信息;Detecting, by using a convolutional neural network, each target detection frame of the first video frame and the first video frame, and acquiring representation information of each target detection frame of the first video frame;
    根据所述第一视频帧的每个目标检测框确定所述第一视频帧的每个目标检测框的运动信息和形状信息。And determining motion information and shape information of each target detection frame of the first video frame according to each target detection frame of the first video frame.
  4. 根据权利要求3所述的方法,其特征在于,在所述获取第一视频帧的每个目标检测框的特征信息之前,所述方法还包括:The method according to claim 3, wherein before the acquiring the feature information of each target detection frame of the first video frame, the method further comprises:
    根据目标的类型确定所述卷积神经网络。The convolutional neural network is determined according to the type of the target.
  5. 根据权利要求4所述的方法,其特征在于,所述目标的类型包括以下任意一项或多项:人脸、行人、车辆。The method according to claim 4, wherein the type of the target comprises any one or more of the following: a face, a pedestrian, a vehicle.
  6. 根据权利要求1-5任一所述的方法,其特征在于,所述将所述第一视频帧的每个目标检测框的特征信息分别与第二视频帧的每个目标检测框的特征信息进行匹配,包括:The method according to any one of claims 1-5, wherein the feature information of each target detection frame of the first video frame and the feature information of each target detection frame of the second video frame are respectively Matches, including:
    将所述第一视频帧的每个目标检测框的特征信息分别与所述第二视频帧的每个目标检测框的特征信息进行相似性计算;Performing similarity calculation on feature information of each target detection frame of the first video frame and feature information of each target detection frame of the second video frame;
    对相似性计算结果加权得到相似性矩阵;Weighting the similarity calculation results to obtain a similarity matrix;
    对所述相似性矩阵进行最优匹配。Optimal matching is performed on the similarity matrix.
  7. 根据权利要求6所述的方法,其特征在于,所述将所述第一视频帧的每个目标检测框的特征信息分别与所述第二视频帧的每个目标检测框的特征信息进行相似性计算,包括:The method according to claim 6, wherein the feature information of each target detection frame of the first video frame is similar to the feature information of each target detection frame of the second video frame, respectively. Sex calculations, including:
    将所述第一视频帧的每个目标检测框的特征信息与所述第二视频帧的每个目标检测框的特征信息进行逐一比对,获得所述第一视频帧的每个目标检测框与所述第二视频帧的每个目标 检测框的相似度数据。And comparing the feature information of each target detection frame of the first video frame with the feature information of each target detection frame of the second video frame, to obtain each target detection frame of the first video frame. Similarity data with each target detection frame of the second video frame.
  8. 根据权利要求6或7所述的方法,其特征在于,所述将所述第一视频帧的每个目标检测框的特征信息分别与所述第二视频帧的每个目标检测框的特征信息进行相似性计算,包括:The method according to claim 6 or 7, wherein the feature information of each target detection frame of the first video frame and the feature information of each target detection frame of the second video frame are respectively Perform similarity calculations, including:
    计算所述第一视频帧的每个目标检测框的表象信息与所述第二视频帧的每个目标检测框的表象信息之间的余弦角,计算所述第一视频帧的每个目标检测框的运动信息与所述第二视频帧的每个目标检测框的运动信息之间的目标检测框的中心距离,计算所述第一视频帧的每个目标检测框的形状信息与所述第二视频帧的每个目标检测框的形状信息之间的目标检测框的长宽差。Calculating a cosine angle between the representation information of each target detection frame of the first video frame and the representation information of each target detection frame of the second video frame, and calculating each target detection of the first video frame Calculating shape information of each target detection frame of the first video frame and the number of the center of the target detection frame between the motion information of the frame and the motion information of each target detection frame of the second video frame The length difference of the target detection frame between the shape information of each target detection frame of the two video frames.
  9. 根据权利要求8所述的方法,其特征在于,所述对相似性计算结果加权得到相似性矩阵,包括:The method according to claim 8, wherein the weighting the similarity calculation result to obtain a similarity matrix comprises:
    对所述余弦角、所述中心距离和所述长宽差进行加权求和,或者加权乘积得到相似性矩阵。Weighting the cosine angle, the center distance, and the length and width difference, or weighting the product to obtain a similarity matrix.
  10. 根据权利要求1-9任一所述的方法,其特征在于,所述根据匹配结果确定所述第一视频帧的每个目标检测框的跟踪轨迹,包括:The method according to any one of claims 1-9, wherein the determining a tracking trajectory of each target detection frame of the first video frame according to the matching result comprises:
    响应于所述匹配结果为匹配成功,将与所述匹配结果相关的所述第一视频帧的目标检测框和所述第二视频帧的目标检测框关联成连续跟踪轨迹。And in response to the matching result being that the matching is successful, the target detection frame of the first video frame and the target detection frame of the second video frame related to the matching result are associated into a continuous tracking trajectory.
  11. 根据权利要求1-9任一所述的方法,其特征在于,所述根据匹配结果确定所述第一视频帧的每个目标检测框的跟踪轨迹,包括:The method according to any one of claims 1-9, wherein the determining a tracking trajectory of each target detection frame of the first video frame according to the matching result comprises:
    响应于所述匹配结果为匹配失败,将与所述匹配结果相关的所述第一视频帧的目标检测框作为新跟踪轨迹的起始目标检测框。In response to the matching result being a matching failure, the target detection frame of the first video frame related to the matching result is used as a starting target detection frame of the new tracking trajectory.
  12. 根据权利要求1-11任一所述的方法,其特征在于,所述表象信息包括目标检测框的特征向量,所述运动信息包括目标检测框的位置信息,所述形状信息包括目标检测框的尺寸信息。The method according to any one of claims 1 to 11, wherein the representation information comprises a feature vector of a target detection frame, the motion information includes location information of a target detection frame, and the shape information includes a target detection frame. Size Information.
  13. 根据权利要求1-12任一所述的方法,其特征在于,在所述获取第一视频帧的每个目标检测框的特征信息之前,所述方法还包括:The method according to any one of claims 1 to 12, wherein before the acquiring the feature information of each target detection frame of the first video frame, the method further comprises:
    采用检测器检测所述第一视频帧,获得所述第一视频帧的每个目标检测框。The first video frame is detected by a detector, and each target detection frame of the first video frame is obtained.
  14. 根据权利要求13所述的方法,其特征在于,所述检测器为基于深度卷积神经网络的检测器。The method of claim 13 wherein said detector is a deep convolutional neural network based detector.
  15. 根据权利要求13或14所述的方法,其特征在于,所述检测器为快速区域卷积神经网络Faster-RCNN。The method according to claim 13 or 14, wherein the detector is a fast region convolutional neural network Faster-RCNN.
  16. 一种目标跟踪系统,其特征在于,包括:A target tracking system, comprising:
    获取模块,用于获取第一视频帧的每个目标检测框的特征信息;An acquiring module, configured to acquire feature information of each target detection frame of the first video frame;
    匹配模块,用于将所述第一视频帧的每个目标检测框的特征信息分别与第二视频帧的每个目标检测框的特征信息进行匹配;其中,所述第二视频帧为所述第一视频帧之前的视频帧;a matching module, configured to match feature information of each target detection frame of the first video frame with feature information of each target detection frame of the second video frame; wherein the second video frame is a video frame preceding the first video frame;
    确定模块,用于根据匹配结果确定所述第一视频帧的每个目标检测框的跟踪轨迹。And a determining module, configured to determine a tracking trajectory of each target detection frame of the first video frame according to the matching result.
  17. 根据权利要求16所述的系统,其特征在于,所述特征信息包括以下任意一项或多项:表象信息、运动信息、形状信息。The system according to claim 16, wherein the feature information comprises any one or more of the following: representation information, motion information, shape information.
  18. 根据权利要求16或17所述的系统,其特征在于,所述获取模块,包括:The system of claim 16 or 17, wherein the obtaining module comprises:
    第一获取子模块,用于采用卷积神经网络检测所述第一视频帧和所述第一视频帧的每个目标检测框,获取所述第一视频帧的每个目标检测框的表象信息;a first obtaining submodule, configured to detect, by using a convolutional neural network, each target detection frame of the first video frame and the first video frame, and obtain image information of each target detection frame of the first video frame ;
    第二获取子模块,用于根据所述第一视频帧的每个目标检测框确定所述第一视频帧的每个目标检测框的运动信息和形状信息。And a second acquiring submodule, configured to determine motion information and shape information of each target detection frame of the first video frame according to each target detection frame of the first video frame.
  19. 根据权利要求18所述的系统,其特征在于,所述系统还包括:The system of claim 18, wherein the system further comprises:
    卷积神经网络确定模块,用于在所述获取模块获取第一视频帧的每个目标检测框的特征信息之前,根据目标的类型确定所述卷积神经网络。The convolutional neural network determining module is configured to determine the convolutional neural network according to the type of the target before the acquiring module acquires the feature information of each target detection frame of the first video frame.
  20. 根据权利要求19所述的系统,其特征在于,所述目标的类型包括以下任意一项或多项:人脸、行人、车辆。The system according to claim 19, wherein the type of the target comprises any one or more of the following: a face, a pedestrian, a vehicle.
  21. 根据权利要求16-20任一所述的系统,其特征在于,所述匹配模块,包括:The system according to any one of claims 16 to 20, wherein the matching module comprises:
    相似性计算子模块,用于将所述第一视频帧的每个目标检测框的特征信息分别与所述第二视频帧的每个目标检测框的特征信息进行相似性计算;a similarity calculation sub-module, configured to perform similarity calculation on feature information of each target detection frame of the first video frame and feature information of each target detection frame of the second video frame;
    加权子模块,用于对相似性计算结果加权得到相似性矩阵;a weighting sub-module for weighting the similarity calculation result to obtain a similarity matrix;
    最优匹配子模块,用于对所述相似性矩阵进行最优匹配。An optimal matching sub-module for optimally matching the similarity matrix.
  22. 根据权利要求21所述的系统,其特征在于,所述相似性计算子模块,用于将所述第一视频帧的每个目标检测框的特征信息与所述第二视频帧的每个目标检测框的特征信息进行逐一比对,获得所述第一视频帧的每个目标检测框与所述第二视频帧的每个目标检测框的相似度数据。The system according to claim 21, wherein said similarity calculation sub-module is configured to map feature information of each target detection frame of said first video frame to each target of said second video frame The feature information of the detection frame is compared one by one, and similarity data of each target detection frame of the first video frame and each target detection frame of the second video frame is obtained.
  23. 根据权利要求21或22所述的系统,其特征在于,所述相似性计算子模块,用于计算所述第一视频帧的每个目标检测框的表象信息与所述第二视频帧的每个目标检测框的表象信息之间的余弦角,计算所述第一视频帧的每个目标检测框的运动信息与所述第二视频帧的每个目标检测框的运动信息之间的目标检测框的中心距离,计算所述第一视频帧的每个目标检测框的形状信息与所述第二视频帧的每个目标检测框的形状信息之间的目标检测框的长宽差。The system according to claim 21 or 22, wherein the similarity calculation sub-module is configured to calculate representation information of each target detection frame of the first video frame and each of the second video frames Calculating the cosine angle between the representation information of the target detection frame, calculating the target detection between the motion information of each target detection frame of the first video frame and the motion information of each target detection frame of the second video frame The center distance of the frame calculates a length difference of the target detection frame between the shape information of each target detection frame of the first video frame and the shape information of each target detection frame of the second video frame.
  24. 根据权利要求23所述的系统,其特征在于,所述加权子模块,用于对所述余弦角、所述中心距离和所述长宽差进行加权求和或者加权乘积得到相似性矩阵。The system according to claim 23, wherein the weighting sub-module is configured to perform a weighted summation or a weighted product on the cosine angle, the center distance, and the length and width difference to obtain a similarity matrix.
  25. 根据权利要求16-24任一所述的系统,其特征在于,所述确定模块,用于响应于所述匹配结果为匹配成功,将与所述匹配结果相关的所述第一视频帧的目标检测框和所述第二视频帧的目标检测框关联成连续跟踪轨迹。The system according to any one of claims 16-24, wherein the determining module is configured to: in response to the matching result being a matching success, target the first video frame related to the matching result The detection frame and the target detection frame of the second video frame are associated into a continuous tracking trajectory.
  26. 根据权利要求16-24任一所述的系统,其特征在于,所述确定模块,用于响应于所述匹配结果为匹配失败,将与所述匹配结果相关的所述第一视频帧的目标检测框作为新跟踪轨迹的起始目标检测框。The system according to any one of claims 16-24, wherein the determining module is configured to: in response to the matching result being a matching failure, target the first video frame related to the matching result The detection frame is used as the starting target detection frame of the new tracking track.
  27. 根据权利要求16-26任一所述的系统,其特征在于,所述表象信息包括目标检测框的特征向量,所述运动信息包括目标检测框的位置信息,所述形状信息包括目标检测框的尺寸信息。The system according to any one of claims 16-26, wherein the representation information comprises a feature vector of a target detection frame, the motion information comprising location information of a target detection frame, the shape information comprising a target detection frame Size Information.
  28. 根据权利要求16-27任一所述的系统,其特征在于,所述系统还包括:The system of any of claims 16-27, wherein the system further comprises:
    检测模块,用于在所述获取模块获取第一视频帧的每个目标检测框的特征信息之前,采用检测器检测所述第一视频帧,获得所述第一视频帧的每个目标检测框。a detecting module, configured to detect the first video frame by using a detector, and obtain each target detection frame of the first video frame, before acquiring, by the acquiring module, feature information of each target detection frame of the first video frame .
  29. 根据权利要求28所述的系统,其特征在于,所述检测器为基于深度卷积神经网络的检测器。The system of claim 28 wherein said detector is a depth convolutional neural network based detector.
  30. 根据权利要求28或29所述的系统,其特征在于,所述检测器为快速区域卷积神经网络Faster-RCNN。A system according to claim 28 or 29, wherein said detector is a fast region convolutional neural network Faster-RCNN.
  31. 一种电子设备,包括:处理器、存储器、通信元件和通信总线,所述处理器、所述存储器和所述通信元件通过所述通信总线完成相互间的通信;An electronic device comprising: a processor, a memory, a communication component, and a communication bus, wherein the processor, the memory, and the communication component complete communication with each other through the communication bus;
    所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如权利要求1-15任一项所述的目标跟踪方法对应的操作。The memory is configured to store at least one executable instruction that causes the processor to perform operations corresponding to the target tracking method of any of claims 1-15.
  32. 一种电子设备,包括:An electronic device comprising:
    处理器和权利要求16-30任一项所述的目标跟踪系统;在处理器运行所述对象属性检测装置时,权利要求16-30任一项所述的目标跟踪系统中的单元被运行。The processor and the target tracking system of any one of claims 16-30; wherein the unit in the target tracking system of any one of claims 16-30 is operated when the processor runs the object attribute detecting device.
  33. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在设备中运行时,所述设备中的处理器执行用于实现权利要求1-15中的任一权利要求所述的目标跟踪方法中的步骤的可执行指令。A computer program comprising computer readable code, the processor in the device executing a target for implementing any of claims 1-15 when the computer readable code is run in a device An executable instruction that tracks the steps in a method.
  34. 一种计算机可读介质,用于存储计算机可读取的指令,其特征在于,所述指令被执行时实现如权利要求1-15任一项所述的目标跟踪方法中各步骤的操作。A computer readable medium for storing computer readable instructions, wherein the instructions are executed to perform the operations of the steps of the target tracking method of any of claims 1-15.
PCT/CN2018/076381 2017-03-03 2018-02-12 Target tracking method and system, and electronic device WO2018157735A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710124025.6A CN108230353A (en) 2017-03-03 2017-03-03 Method for tracking target, system and electronic equipment
CN201710124025.6 2017-03-03

Publications (1)

Publication Number Publication Date
WO2018157735A1 true WO2018157735A1 (en) 2018-09-07

Family

ID=62657301

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/076381 WO2018157735A1 (en) 2017-03-03 2018-02-12 Target tracking method and system, and electronic device

Country Status (2)

Country Link
CN (1) CN108230353A (en)
WO (1) WO2018157735A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114067594A (en) * 2020-08-05 2022-02-18 北京万集科技股份有限公司 Planning method and device of driving path, computer equipment and storage medium

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108921875B (en) * 2018-07-09 2021-08-17 哈尔滨工业大学(深圳) Real-time traffic flow detection and tracking method based on aerial photography data
CN110866428B (en) * 2018-08-28 2023-12-15 杭州海康威视数字技术股份有限公司 Target tracking method, device, electronic equipment and storage medium
CN111127509B (en) * 2018-10-31 2023-09-01 杭州海康威视数字技术股份有限公司 Target tracking method, apparatus and computer readable storage medium
CN109492584A (en) * 2018-11-09 2019-03-19 联想(北京)有限公司 A kind of recognition and tracking method and electronic equipment
CN109635657B (en) * 2018-11-12 2023-01-06 平安科技(深圳)有限公司 Target tracking method, device, equipment and storage medium
CN109558505A (en) 2018-11-21 2019-04-02 百度在线网络技术(北京)有限公司 Visual search method, apparatus, computer equipment and storage medium
CN109726683B (en) 2018-12-29 2021-06-22 北京市商汤科技开发有限公司 Target object detection method and device, electronic equipment and storage medium
CN109840917B (en) * 2019-01-29 2021-01-26 北京市商汤科技开发有限公司 Image processing method and device and network training method and device
CN110163124A (en) * 2019-04-30 2019-08-23 北京易华录信息技术股份有限公司 A kind of trajectory track processing system
CN110378515A (en) * 2019-06-14 2019-10-25 平安科技(深圳)有限公司 A kind of prediction technique of emergency event, device, storage medium and server
CN110516559B (en) * 2019-08-02 2022-02-22 西安天和防务技术股份有限公司 Target tracking method and device suitable for accurate monitoring and computer equipment
CN110717414B (en) * 2019-09-24 2023-01-03 青岛海信网络科技股份有限公司 Target detection tracking method, device and equipment
CN111027376A (en) * 2019-10-28 2020-04-17 中国科学院上海微系统与信息技术研究所 Method and device for determining event map, electronic equipment and storage medium
CN110827325B (en) * 2019-11-13 2022-08-09 阿波罗智联(北京)科技有限公司 Target tracking method and device, electronic equipment and storage medium
CN111402294B (en) * 2020-03-10 2022-10-18 腾讯科技(深圳)有限公司 Target tracking method, target tracking device, computer-readable storage medium and computer equipment
CN111784224A (en) * 2020-03-26 2020-10-16 北京京东乾石科技有限公司 Object tracking method and device, control platform and storage medium
CN114503160A (en) 2020-08-01 2022-05-13 商汤国际私人有限公司 Object association method, device, system, electronic equipment, storage medium and computer program
CN112070803A (en) * 2020-09-02 2020-12-11 安徽工程大学 Unmanned ship path tracking method based on SSD neural network model
CN112381092A (en) * 2020-11-20 2021-02-19 深圳力维智联技术有限公司 Tracking method, device and computer readable storage medium
CN113223052A (en) * 2021-05-12 2021-08-06 北京百度网讯科技有限公司 Trajectory optimization method, apparatus, device, storage medium, and program product

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530613A (en) * 2013-10-15 2014-01-22 无锡易视腾科技有限公司 Target person hand gesture interaction method based on monocular video sequence
CN103679186A (en) * 2012-09-10 2014-03-26 华为技术有限公司 Target detecting and tracking method and device
CN103778647A (en) * 2014-02-14 2014-05-07 中国科学院自动化研究所 Multi-target tracking method based on layered hypergraph optimization

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105931269A (en) * 2016-04-22 2016-09-07 海信集团有限公司 Tracking method for target in video and tracking device thereof
CN105976400B (en) * 2016-05-10 2017-06-30 北京旷视科技有限公司 Method for tracking target and device based on neural network model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103679186A (en) * 2012-09-10 2014-03-26 华为技术有限公司 Target detecting and tracking method and device
CN103530613A (en) * 2013-10-15 2014-01-22 无锡易视腾科技有限公司 Target person hand gesture interaction method based on monocular video sequence
CN103778647A (en) * 2014-02-14 2014-05-07 中国科学院自动化研究所 Multi-target tracking method based on layered hypergraph optimization

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114067594A (en) * 2020-08-05 2022-02-18 北京万集科技股份有限公司 Planning method and device of driving path, computer equipment and storage medium
CN114067594B (en) * 2020-08-05 2023-02-17 北京万集科技股份有限公司 Method and device for planning driving path, computer equipment and storage medium

Also Published As

Publication number Publication date
CN108230353A (en) 2018-06-29

Similar Documents

Publication Publication Date Title
WO2018157735A1 (en) Target tracking method and system, and electronic device
WO2019105337A1 (en) Video-based face recognition method, apparatus, device, medium and program
US10540772B2 (en) Feature trackability ranking, systems and methods
US11301687B2 (en) Pedestrian re-identification methods and apparatuses, electronic devices, and storage media
US10891465B2 (en) Methods and apparatuses for searching for target person, devices, and media
WO2019091464A1 (en) Target detection method and apparatus, training method, electronic device and medium
WO2018019126A1 (en) Video category identification method and device, data processing device and electronic apparatus
CN107111787B (en) Stream processing
WO2020006961A1 (en) Image extraction method and device
CN108427927B (en) Object re-recognition method and apparatus, electronic device, program, and storage medium
US20130215113A1 (en) Systems and methods for animating the faces of 3d characters using images of human faces
CN108229532B (en) Image recognition method and device and electronic equipment
US9129152B2 (en) Exemplar-based feature weighting
CN113971751A (en) Training feature extraction model, and method and device for detecting similar images
US11842514B1 (en) Determining a pose of an object from rgb-d images
WO2020200095A1 (en) Action recognition method and apparatus, and electronic device and storage medium
WO2020007177A1 (en) Quotation method executed by computer, quotation device, electronic device and storage medium
CN108229494B (en) Network training method, processing method, device, storage medium and electronic equipment
CA3052846A1 (en) Character recognition method, device, electronic device and storage medium
JP7393472B2 (en) Display scene recognition method, device, electronic device, storage medium and computer program
WO2019170024A1 (en) Target tracking method and apparatus, and electronic device and storage medium
CN114674328B (en) Map generation method, map generation device, electronic device, storage medium, and vehicle
CN108229320B (en) Frame selection method and device, electronic device, program and medium
CN111968030B (en) Information generation method, apparatus, electronic device and computer readable medium
CN110942056A (en) Clothing key point positioning method and device, electronic equipment and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18760884

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.12.2019)

122 Ep: pct application non-entry in european phase

Ref document number: 18760884

Country of ref document: EP

Kind code of ref document: A1