WO2018133666A1

WO2018133666A1 - Method and apparatus for tracking video target

Info

Publication number: WO2018133666A1
Application number: PCT/CN2018/070090
Authority: WO
Inventors: 余三思
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2017-01-17
Filing date: 2018-01-03
Publication date: 2018-07-26
Also published as: TW201828158A; CN106845385A; TWI677825B

Abstract

A method and apparatus for tracking a video target. The method can be applied to a terminal or a server, and comprises: obtaining a video stream, and recognizing a face region according to a face detection algorithm, so as to obtain a first to-be-tracked target corresponding to a first video frame (S210); performing extraction on the first to-be-tracked target according to face features based on a deep neural network so as to obtain a first face feature, and adding the first face feature into a feature library corresponding to the first to-be-tracked target (S220); and recognizing a face region in a current video frame according to the face detection algorithm so as obtain a current to-be-tracked target corresponding to the current video frame, performing extraction on the current to-be-tracked target according to the face features based on the deep neural network so as to obtain a second face feature, and performing feature matching on the current to-be-tracked target and the first to-be-tracked target according to the second face feature and the feature library so as to track the first to-be-tracked target starting from the first video frame, and in the tracking process, updating the feature library according to extracted updated face features (S230).

Description

Video target tracking method and device

The present application claims priority to Chinese Patent Application, filed on Jan. 17, 2017, filed Jan. .

Technical field

The present application relates to the field of computer technologies, and in particular, to a video object tracking method and apparatus.

Background technique

Target tracking technology has always been a hotspot in the field of computer vision and image processing, and is widely used in the fields of intelligent monitoring, intelligent transportation, visual navigation, human-computer interaction, and defense reconnaissance.

Target tracking algorithms typically use one or several simple traditional feature matching algorithms to distinguish targets, such as using the color, shape, and other characteristics of the image itself.

Summary of the invention

The embodiment of the present application provides a video object tracking method and apparatus, which can improve the continuity and robustness of tracking.

The embodiment of the present application provides a method for video target tracking, which is applied to a terminal or a server, and the method includes:

Obtaining a video stream, identifying a face region according to a face detection algorithm, and obtaining a first to-be-tracked target corresponding to the first video frame;

Obtaining a first facial feature by using a depth neural network based facial feature extraction on the first to-be-tracked target, and storing the first facial feature into a feature database corresponding to the first to-be-tracked target;

Identifying a face region according to a face detection algorithm in the current video frame, obtaining a current target to be tracked corresponding to the current video frame, and obtaining a second face feature by using the face feature extraction based on the depth neural network for the current target to be tracked, according to The second face feature and the feature library perform feature matching on the current to-be-tracked target and the first to-be-tracked target to track the first to-be-tracked target from the first video frame, in the tracking process The feature library is updated according to the extracted updated face features.

The embodiment of the present application further provides a video object tracking device, where the device includes:

a processor and a memory coupled to the processor, the memory having stored therein a machine readable instruction module executable by the processor; the machine readable instruction module comprising:

a detecting module, configured to acquire a video stream, and identify a face region according to a face detection algorithm, to obtain a first to-be-tracked target corresponding to the first video frame;

a face feature extraction module, configured to obtain a first face feature by using a depth neural network based face feature extraction on the first to-be-tracked target, and storing the first face feature into the first to-be-tracked a feature library corresponding to the target;

The detecting module is further configured to: identify a face area according to a face detection algorithm in the current video frame, and obtain a current target to be tracked corresponding to the current video frame;

The face feature extraction module is further configured to obtain a second face feature by using a depth neural network based face feature extraction on the current target to be tracked;

a tracking module, configured to perform feature matching between the current to-be-tracked target and the first to-be-tracked target according to the second facial feature and the feature library, to track the first to-be-being from the first video frame Track the target;

And a learning module, configured to update the feature library according to the extracted updated facial features during the tracking process.

The embodiment of the present application further provides a non-transitory computer readable storage medium storing machine readable instructions, the machine readable instructions being executable by a processor to perform the following operations:

DRAWINGS

1 is an application environment diagram of a video object tracking method according to an embodiment of the present application;

2 is an internal structural diagram of a terminal in FIG. 1 according to an embodiment of the present application;

3 is an internal structural diagram of the server in FIG. 1 in an embodiment of the present application;

4 is a flowchart of a video object tracking method according to an embodiment of the present application;

FIG. 5 is a flowchart of obtaining an object to be tracked in an embodiment of the present application;

6 is a flowchart of updating a feature library in an embodiment of the present application;

FIG. 7 is a schematic diagram showing matching comparison between a video target tracking algorithm and a template matching algorithm according to an embodiment of the present application; FIG.

FIG. 8 is another flowchart of obtaining a current target to be tracked in an embodiment of the present application;

9 is a schematic diagram of a target tracking system corresponding to a video object tracking method according to an embodiment of the present application;

10 is a schematic diagram of video tracking results obtained by a video target tracking algorithm according to an embodiment of the present application;

FIG. 11 is a schematic diagram showing video tracking results obtained by a TLD tracking algorithm according to an embodiment of the present application; FIG.

12 is a schematic structural diagram of a video object tracking apparatus according to an embodiment of the present application;

FIG. 13 is another schematic structural diagram of a video object tracking apparatus according to an embodiment of the present application; FIG.

FIG. 14 is another schematic structural diagram of a video object tracking apparatus according to an embodiment of the present application; FIG.

15 is another schematic structural diagram of a video object tracking apparatus according to an embodiment of the present application;

FIG. 16 is another schematic structural diagram of a video object tracking apparatus according to an embodiment of the present application.

detailed description

FIG. 1 is an application environment diagram of a video target tracking method in an embodiment of the present application. As shown in FIG. 1 , the application environment includes a terminal 110, a server 120, and a video capture device 130. The terminal 110, the server 120, and the video capture device 130 communicate through the network 140.

In some embodiments of the present application, the terminal 110 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, etc., but is not limited thereto. The video capture device 130 can be a camera disposed at a location such as an entrance to a building. Network 140 can be a wired network or a wireless network. In some embodiments of the present application, the video capture device 130 may send the collected video stream to the terminal 110 or the server 120, and the terminal 110 or the server 120 may perform target tracking on the video stream. In other embodiments of the present application, the video capture device 130 may directly perform target tracking on the video stream, and send the tracking result to the terminal 110 for display.

In an embodiment of the present application, the internal structure of the terminal 110 in FIG. 1 is as shown in FIG. 2, and the terminal 110 includes a processor 1102, a graphics processing unit 1103, a storage medium 1104, a memory 1105, and a network connected through a system bus 1101. Interface 1106, display screen 1107, and input device 1108. The storage medium 1104 of the terminal 110 stores an operating system 11041 and a first video object tracking device 11042. The device 11042 is configured to implement a video object tracking method suitable for the terminal 110. The processor 1102 is configured to provide computing and control capabilities to support operation of the entire terminal 110. The graphics processing unit 1103 in the terminal 110 is operative to provide at least the rendering capabilities of the display interface. Memory 1105 provides an environment for operation of first video target tracking device 11042 in storage medium 1104. The network interface 1106 is configured to perform network communication with the video capture device 130, such as receiving a video stream collected by the video capture device 130. The display screen 1107 is for displaying a tracking result and the like. The input device 1108 is configured to receive commands or data input by the user, and the like. For terminal 110 with a touch screen, display screen 1107 and input device 1108 can be touch screens. The structure shown in FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the terminal 110 to which the solution of the present application is applied. The specific terminal 110 may include the same as shown in FIG. More or fewer parts, or some parts, or different parts.

In an embodiment of the present application, the internal structure of the server 120 in FIG. 1 is as shown in FIG. 3. The server 120 includes a processor 1202, a storage medium 1203, a memory 1204, and a network interface 1205 connected through a system bus 1201. The storage medium 1203 of the server 120 stores an operating system 12031, a database 12032, and a second video target tracking device 12033. Database 12032 is used to store data. The second video object tracking device 12033 is configured to implement a video object tracking method suitable for the server 120. The processor 1202 of the server 120 is used to provide computing and control capabilities to support the operation of the entire server 120. The memory 1204 of the server 120 provides an environment for the operation of the second video object tracking device 12033 in the storage medium 1203. The network interface 1205 of the server 120 is configured to communicate with the external video capture device 130 via a network connection, such as receiving a video stream sent by the video capture device 130.

As shown in FIG. 4, in an embodiment of the present application, a video object tracking method is provided, which is applied to the terminal 110, the server 120, or the video collection device 130 in the application environment, and the method may be implemented by any of the applications. The video target tracking device provided by the example performs the following steps:

Step S210: Acquire a video stream, and identify a face region according to the face detection algorithm to obtain a first to-be-tracked target corresponding to the first video frame.

Specifically, the video stream can be acquired by a video capture device distributed at the entrance of the building. If the video target tracking method is applied to a video capture device, the video stream can be obtained directly from the memory of the video capture device. If the video target tracking method is applied to a terminal or a server, the video capture device can transmit the collected video stream to the terminal or server in real time.

Face detection refers to searching for a given image with a certain strategy to determine whether it contains a face, and if so, returning the position, size and posture of the face. In some embodiments of the present application, the face area (such as the rectangular frame shown in FIG. 10) may be displayed by a recommendation box to obtain a first target to be tracked corresponding to the first video frame. By continuously performing face detection on the video stream until the presence of a human face is detected, the face area is determined as the first target to be tracked. Since multiple faces may be detected in one frame, there may be multiple first to-be-tracked targets. If there are multiple first to-be-tracked targets, different face areas may be identified by different identification information, such as different face areas by different recommended frames. The face detection algorithm can be customized according to needs, such as using NPD (Normalized Pixel Difference) face detection algorithm, or combining NPD face detection algorithm with other algorithms to improve the accuracy of determining the target to be tracked. Sex.

Step S220: The first face feature is obtained by the face feature extraction based on the depth neural network for the first target to be tracked, and the first face feature is stored in the feature library corresponding to the first target to be tracked.

Specifically, deep neural network is a machine learning model under deep learning. Deep learning is a branch of machine learning. It is an algorithm that uses high-level abstraction of data using multiple processing layers consisting of complex structures or multiple nonlinear transforms. The deep neural network can adopt the VGG (Visual Geometry Group) network structure, and the recall rate and accuracy of the VGG network structure are better than the target matching algorithm.

Assigning a target identifier to the first target to be tracked and establishing a feature database, establishing an association relationship between the target identifier and the feature database, and saving the association relationship. When the first to-be-tracked target is multiple, a target identifier may be assigned to each of the first to-be-tracked targets, and a feature database is established, and an association relationship is established for each of the first to-be-tracked targets and their corresponding first facial features. The association relationship and the first facial feature are stored to a feature library corresponding to the first to-be-tracked target. By matching the face features for feature matching, the problem of the target tracking algorithm can be solved because the face features are not used well, so the tracking target can not be retrieved correctly after frequent error, deviation and loss.

Step S230, the current video frame is identified according to the face detection algorithm, and the current target to be tracked corresponding to the current video frame is obtained, and the second face feature is obtained by extracting the face feature based on the depth neural network for the current target to be tracked. Feature matching between the current target to be tracked and the first target to be tracked according to the second face feature and the feature library to track the first target to be tracked from the first video frame, and according to the extracted updated face feature during the tracking process Update the signature library.

Specifically, the second facial features are matched with the first facial features corresponding to the first target to be tracked in the feature database. The specific algorithm of feature matching can be customized, for example, the Euclidean distance of the vector corresponding to the face feature can be directly calculated, and whether the match can be successfully determined according to the Euclidean distance. If the second facial feature matches the first facial feature successfully, it is determined that the current target to be tracked is the continuous moving target of the first target to be tracked. If there are multiple targets to be tracked, each current target to be tracked constitutes a current target group to be tracked, and the second face features corresponding to each current target to be tracked in the current target group to be tracked are respectively associated with each history in the feature library. The face features corresponding to the target to be tracked are matched. If the matching is successful, the target identifier of the historical target to be tracked is used as the target identifier of the current target to be tracked, and the current target to be tracked is the position after the historical target to be tracked.

In some embodiments of the present application, the feature library may be updated according to the extracted updated facial features during the tracking process, such as when the illumination continuously changes or the side face, the updated face of the first target to be tracked in other frames is obtained. a feature, if the updated facial feature is different from the first facial feature, the updated facial feature having the difference may be added to the feature database corresponding to the first to-be-tracked target, and the updated facial feature is the first Correlating the target identifier of the target to be tracked, and storing the association relationship in the feature library, so that when the first target to be tracked has a larger angle of side face or a light intensity change of a larger light intensity in other frames, The second face feature corresponding to the current target to be tracked may be matched with the updated face feature of the first target to be tracked, and the difference between the feature matching directly with the first face feature is smaller, thereby increasing feature matching. The probability of success reduces the sensitivity of the target tracking process to tracking changes, tilting, occlusion, and illumination changes, and improves tracking continuity and robustness. And the feature library can save a large number of face features corresponding to the first target to be tracked in different frames, and in the case that the first target to be tracked disappears, the feature library corresponding to the first target to be tracked can be saved previously. The face features before the disappearance of the first target to be tracked are feature-matched, so that a good tracking effect is achieved for the intermittently occurring target. The update signature database updates a positive and negative sample library by tracking and detection, which is equivalent to a semi-online tracking algorithm. Compared with the full offline tracking algorithm, it has a better recall rate, compared to the fully online tracking algorithm. Shows a higher accuracy rate.

In the embodiment of the present application, the video stream is obtained, the face region is identified according to the face detection algorithm, and the first target to be tracked corresponding to the first video frame is obtained, and the face feature based on the depth neural network is adopted for the first target to be tracked. Extracting the first facial feature, adding the first facial feature to the feature database, and identifying the face region according to the face detection algorithm in the current video frame, and obtaining the current target to be tracked corresponding to the current video frame, and the current target to be tracked The second facial feature is obtained by the face feature extraction based on the deep neural network, and the current target to be tracked is matched with the first target to be tracked according to the second facial feature and the feature library to start from the first video frame. Tracking the first target to be tracked, updating the feature database according to the extracted updated facial features during the tracking process, and performing feature matching by referring to the facial features based on the deep neural network, the target tracking algorithm can be solved because the face is not well utilized Features, frequent occurrences of mistakes, deviations, and misses can not re-follow the correct tracking target, thus saving Resources, client or server device, to enhance the processing speed of a processor or a terminal server. At the same time, the feature library is continuously updated during the tracking process, which can save different face features corresponding to the target to be tracked in different states, thereby improving the success rate of face feature matching, reducing the change, tilt, and tracking of the target tracking process. The sensitivity of occlusion and illumination changes improves tracking continuity and robustness, which in turn increases the processing speed of the processor of the terminal or server.

In an embodiment of the present application, the method further includes: identifying a face identity information corresponding to each target to be tracked by a face recognition algorithm according to a face state of each target to be tracked, and obtaining a face by using an image feature extraction algorithm. The target feature corresponding to the identity information.

In some embodiments of the present application, the face state refers to the state of the deflection angle of the face. When the face is detected as a positive face, the corresponding face identity information can be identified by the face recognition algorithm. The face identity information is used to describe the identity of the face. The face recognition refers to searching and matching the feature data of the extracted face image with a feature template stored in the database, such as a face feature template, and determining the face identity information according to the degree of similarity. For example, when performing face recognition on an employee entering the enterprise, a feature template of each employee in the enterprise, such as a face feature template, is stored in advance in the database, thereby storing the feature data of the currently extracted face image and the database. The face feature template is compared to get the employee's face identity information. The specific content of the face identity information can be customized according to needs, such as employee name, job number, and department.

The image feature extraction algorithm extracts feature data according to characteristics of the image itself, such as a color feature, a texture feature, a shape feature, a spatial relationship feature, and the like, to obtain a target feature, wherein the target feature is a set of all the feature data extracted. The relationship between the target feature and the face identity information, such as clothing color, clothes texture, human body shape, height ratio, etc., is stored in the database. In this way, when the face is deflected and covered, the identification of the identity and the determination of the face area can be performed by other target features. In an embodiment of the present application, as shown in FIG. 5, the step of identifying the face area according to the face detection algorithm in the current video frame in step S230, and obtaining the current target to be tracked corresponding to the current video frame includes:

Step S231, determining whether the current video frame recognizes the face region according to the face detection algorithm, and if the face region is not recognized, acquiring the current image feature corresponding to the current video frame according to the image feature extraction algorithm.

Specifically, if the face region is not recognized in the current video frame according to the face detection algorithm, the detection may fail due to the face bias. In this case, the current image feature corresponding to the current video frame needs to be acquired according to the image feature extraction algorithm. .

Step S232, comparing the current image feature with the target feature to obtain the matched target face identity information, and obtaining the current target to be tracked corresponding to the current video frame according to the target face identity information.

Specifically, since the target feature has been associated with the face identity information, the current image feature can be compared with the target feature to calculate the similarity. If the similarity exceeds the threshold, the matching is successful, and the matching target feature can be obtained. The target face identity information, so that the current target to be tracked corresponding to the current video frame is obtained according to the target face identity information. Then, the current target to be tracked is matched with the first target to be tracked by the face identity information, thereby implementing tracking of the first target to be tracked.

In the embodiment of the present application, the face identity information is introduced into the target tracking, and the face feature is combined with the image feature, and the face detection algorithm can also track the target when the face detection algorithm cannot recognize the face region, thereby further improving the continuity of the tracking. And robustness.

In an embodiment of the present application, step S220 may include: acquiring first face identity information corresponding to the first target to be tracked, establishing a first face feature set corresponding to the first face identity information, and using the first face feature Adding the first facial feature set and storing the first facial feature set to a feature library corresponding to the first target to be tracked.

Specifically, the first to-be-tracked target may be subjected to face recognition to obtain first face identity information corresponding to the first target to be tracked. The first facial feature set is used to store the first facial features of the first target to be tracked in different states during the motion, and the different states include different angles, different illuminations, different coverage ranges, and the like. Adding a first facial feature obtained by extracting the facial features to the first facial feature set, and establishing an association relationship between the first facial feature set and the first facial identity information, and the associated relationship and the first The set of face features is stored to a feature library corresponding to the first target to be tracked.

In an embodiment of the present application, as shown in FIG. 6, the step of updating the feature library according to the extracted updated facial features in the tracking process in step S230 may include:

Step S233: Acquire current face identity information corresponding to the current target to be tracked, and obtain a first face feature set corresponding to the current face identity information from the feature database.

Specifically, in an embodiment, the current face identity information corresponding to the current target to be tracked may be obtained by performing face recognition on the current target to be tracked. In another embodiment, the current image feature corresponding to the current target to be tracked may be obtained by applying an image feature extraction algorithm to the current target to be tracked, and then the current image feature is matched with the target feature, and the matching target feature is matched. The face identity information is used as the current face identity information, so that the current face identity information can also be obtained when the current target to be tracked cannot recognize the face region. According to the association relationship between the face identity information and the face feature set, the first face feature set corresponding to the current face identity information is obtained, indicating that the current target to be tracked and the first target to be tracked are the same target.

Step S234, calculating a difference between the first facial feature and the second facial feature in the first facial feature set, and if the difference exceeds a preset threshold, adding a second facial feature in the first facial feature set .

Specifically, the custom algorithm calculates a difference amount of the second facial feature and the first facial feature in the first facial feature set. If the first face feature in the first face feature set is plural, the difference amount between the second face feature and each first face feature is separately calculated, and a plurality of difference amounts are obtained. The difference amount indicates the difference between the second face feature and the face feature of the same tracking target that has been saved in the feature database. The larger the difference, the larger the face state change of the tracking target. If the difference amount exceeds the preset threshold, the second face feature is added to the first face feature set, and the added second face feature is available for subsequent feature matching. The more face features stored in the face feature set, the more it can characterize the same track target in different states. As long as any one of the features can match successfully when the feature matches, the current target to be tracked is considered to be the first target. The matching of the target to be tracked is successful, thereby increasing the probability of successful matching, reducing the sensitivity of the target tracking process to the change, tilt, occlusion, and illumination changes of the tracking target, and improving the continuity and robustness of the tracking.

In an embodiment of the present application, step S220 may include: performing facial feature extraction on the first to-be-tracked target through the depth neural network to obtain the first feature vector.

Specifically, after training the deep neural network to obtain a face feature extraction model, and inputting the pixel value corresponding to the first target to be tracked, the first feature vector is obtained, and the dimension of the first feature vector is determined by the face feature extraction model.

Step S230 includes: performing a facial feature extraction on the current target to be tracked to obtain a second feature vector, and calculating an Euclidean distance between the first feature vector and the second feature vector. If the Euclidean distance is less than a preset threshold, determining The first to-be-tracked target is successfully matched with the current target feature to be tracked.

Specifically, the pixel value corresponding to the current target to be tracked is input to the face feature extraction model, and the second feature vector is obtained. The Euclidean distance of the first feature vector and the second feature vector represents the similarity between the current target to be tracked and the first target to be tracked. If the Euclidean distance is less than the preset threshold, it is determined that the current target to be tracked and the first target to be tracked are successfully matched, indicating that the current target to be tracked is the same target as the first target to be tracked, and the tracking target is achieved.

In an embodiment of the present application, the network structure of the deep neural network may be an 11-layer network layer, including a stacked convolutional product network and a fully connected layer, and the stacked convolutional product network is composed of multiple convolution layers and The maxpool layer is composed of specific network structures:

Conv3-64*2+LRN+max pool

Conv3-128+max pool

Conv3-256*2+max pool

Conv3-512*2+max pool

FC2048

FC1024,

Where conv3 represents a convolutional layer with a radius of 3, LRN represents the LRN layer, max pool represents the maxpool layer, and FC represents a fully connected layer.

Specifically, the network structure is a simplified deep neural network VGG network structure, wherein 64*2 represents two 64 groups, the LRN layer is a parameter-free layer for training, and the FC2048 represents a fully connected layer with a 2048 dimension vector. The output of the last fully connected layer FC1024 is the face feature obtained by feature extraction, which is a 1024-dimensional vector. The optimized face features obtained by the simplified VGG network structure perform much better than the matching modules in the TLD (Tracking-Learning-Detection). The efficiency of face feature extraction is greatly improved, and the real-time performance required by the tracking algorithm is achieved. In one embodiment of the present application, the resolution of the target to be tracked can be controlled to be 112*112 pixels to reduce computational complexity. FIG. 7 is a schematic diagram of matching matching of the face feature extraction algorithm VGG-S corresponding to the VGG network structure and the template matching algorithm match template. As shown in Fig. 7, the abscissa represents the recall rate and the ordinate represents the accuracy. It can be seen that the face feature extraction algorithm corresponding to the VGG network structure has better accuracy in feature matching and improves the correct rate of target tracking.

In an embodiment of the present application, in step S230, the step of identifying the face region according to the face detection algorithm in the current video frame, and obtaining the current target to be tracked corresponding to the current video frame may include: normalizing the pixel difference feature and the human body The half body identification algorithm identifies the face area in the current video frame, and obtains the current target to be tracked corresponding to the current video frame.

Specifically, the face detection is performed based on the normalized Pixel Difference (NPD), and the obtained return value is used as a face region recommendation frame. For example, the AdaBoost structure strong classifier can be used to identify and distinguish based on the NPD feature. human face. The human body half-length recognition algorithm can be defined according to needs, and can perform upper body detection. According to the upper body detection, the face area recommendation box can be screened, and the partial recognition frame of the face area can be filtered out, which greatly improves the recall rate and accuracy of the face area detection. The rate improves the overall performance of the target tracking.

In an embodiment of the present application, as shown in FIG. 8, the step of identifying the face area according to the face detection algorithm in the current video frame in step S230, and obtaining the current to-be-tracked target corresponding to the current video frame may include:

Step S235, identifying a face region based on the normalized pixel difference feature, and obtaining a first recommended region in the current video frame.

Step S236, calculating, according to the optical flow analysis algorithm, that the first target to be tracked is in the second recommended area corresponding to the current video frame.

Specifically, the optical flow analysis algorithm assumes that a pixel I(x, y, t) is at the light intensity of the first frame, and it moves the distance of (dx, dy) to the next frame, using the dt time. Since the pixels are the same, the light intensity does not change. According to the motion track of the first to-be-tracked target, the vector velocity model corresponding to the first target to be tracked is calculated by using the optical flow analysis principle, and the current video frame and the previous frame of the current video frame and the first target to be tracked are input to the vector velocity model. In the position of the previous frame, the second recommended area corresponding to the current video frame of the first to-be-tracked target may be obtained, that is, the position where the first to-be-tracked target may appear in the current video frame.

Step S237, obtaining a current target to be tracked according to the first recommended area and the second recommended area.

Specifically, the second recommended area according to the optical flow analysis algorithm is an area that the first to-be-tracked target may move based on the historical moving speed, and the distance between the second recommended area and the second recommended area may be excluded according to the position of the second recommended area. The first recommended area of the range, thereby obtaining the current target to be tracked. The first recommended area and the second recommended area may all be used as the current target to be tracked. If the first target to be tracked is multiple, each of the first to-be-tracked targets has a corresponding second recommended area.

In this embodiment, the normalized pixel difference feature is combined with the optical flow analysis algorithm to obtain the current target to be tracked, because the addition of a priori information improves the accuracy of subsequent feature matching.

In an embodiment, step S237 may include: performing motion prediction according to inter-frame correlation to obtain an expected motion range, and screening the first recommended area and the second recommended area according to the expected motion range to obtain a current target to be tracked.

Specifically, the inter-frame correlation uses the historical position information and the motion trajectory to predict the position of the target in the next frame or frames, which is equivalent to using the prior information to adjust the credibility of the NPD algorithm. The first recommended area and the second recommended area outside the expected motion range are filtered out to obtain the current target to be tracked, which reduces the matching number of subsequent calculated feature matching, and improves the matching efficiency and accuracy.

In an embodiment of the present application, the video target tracking method may complete video target tracking by using three modules as shown in FIG. 9, including a tracking module 310, a detecting module 320, and a learning module 330. Specifically, the video stream is obtained, and the face region is identified according to the face detection algorithm, and the first to-be-tracked target corresponding to the first video frame is obtained, and the tracking is started from the video frame where the first to-be-tracked target is located, and the tracking module 310 The tracking target obtains the first facial feature by facial feature extraction based on the depth neural network, and adds the first facial feature to the feature library, and the learning module 330 updates the feature database according to the tracking condition, and the detecting module 320 continuously obtains the current video frame. The tracking module 310 matches the current target to be tracked with the first target to be tracked according to the updated feature database to track the first target to be tracked.

In an embodiment of the present application, the schematic diagram of the tracking area obtained by using the video target tracking method described above may be as shown in FIG. 10, and the tracking area obtained by using the TLD tracking algorithm may be as shown in FIG. 11. By comparison, it can be found that the tracking area of the video object tracking method proposed by the embodiment of the present application is more accurate than the tracking area of the TLD tracking algorithm, and the TLD tracking algorithm may fail to track when the face is completely deflected. The video target tracking method proposed in the embodiment of the present application can still track success when the face is completely deflected. The correct rate and recall rate are improved compared to the TLD tracking algorithm. The specific data is as follows:

1. Unmanned head detection version: The accuracy rate is increased by about 5 percentage points, the error rate is reduced by 100%, and the target tracking loss rate is reduced by 25%.

2. The header detection version: the accuracy rate is increased by about 1%, the error rate is reduced by 100%, and the target tracking loss rate is reduced by 15%.

In terms of performance, at a resolution of 640*480, a 3.5G CPU and an Nvidia Geforce Gtx 775m machine have a single frame processing time of about 40ms and a frame rate of 25FPS or more.

The above video target tracking method is more accurate than the traditional method, which provides the possibility and convenience for subsequent personnel flow statistics, identification and behavior analysis requirements, and the performance performance also satisfies the requirements of online processing, and improves the monitoring and analysis system. Accuracy, scalability and applicability, which in turn increases the processing speed of the hardware processor and improves the processing performance of the processor.

In an embodiment of the present application, as shown in FIG. 12, a video object tracking device is provided, and the device may include:

The detecting module 410 is configured to acquire a video stream, and identify a face region according to the face detection algorithm to obtain a first to-be-tracked target corresponding to the first video frame.

The facial feature extraction module 420 is configured to obtain the first facial feature by using the facial feature extraction based on the depth neural network for the first to-be-tracked target, and store the first facial feature into the first Track the feature library corresponding to the target.

The detecting module 410 is further configured to: in the current video frame, identify the face region according to the face detection algorithm, and obtain the current target to be tracked corresponding to the current video frame.

The facial feature extraction module 420 is further configured to obtain a second facial feature by using a depth neural network based facial feature extraction on the current target to be tracked.

The tracking module 430 is configured to perform feature matching between the current target to be tracked and the first target to be tracked according to the second face feature and the feature library to track the first target to be tracked from the first video frame.

The learning module 440 is configured to update the feature library according to the extracted updated facial features during the tracking process.

In an embodiment of the present application, as shown in FIG. 13, the device further includes:

The feature identity processing module 450 is configured to identify, according to the face state of the target to be tracked, the corresponding face identity information by using a face recognition algorithm, and obtain the target feature corresponding to the face identity information according to the image feature extraction algorithm, and serve the target Feature and face identity information are related.

The detecting module 410 can include:

The image feature extraction unit 411 is configured to determine whether the current video frame recognizes the face region according to the face detection algorithm. If the face region is not recognized, the current image feature corresponding to the current video frame is acquired according to the image feature extraction algorithm.

The identity matching unit 412 is configured to compare the current image feature with the target feature to obtain matching target facial identity information based on the association relationship.

The first tracking target determining unit 413 is configured to obtain a current target to be tracked corresponding to the current video frame according to the target facial identity information.

In an embodiment of the present application, the facial feature extraction module 420 is further configured to acquire first facial identity information corresponding to the first target to be tracked, and establish a first facial feature set corresponding to the first facial identity information, A face feature is added to the first face feature set and the first face feature set is stored to the feature library.

The learning module 440 is further configured to acquire current face identity information corresponding to the current target to be tracked, obtain a first face feature set corresponding to the current face identity information from the feature database, and calculate a first face in the first face feature set. The amount of difference between the feature and the second face feature, if the amount of difference exceeds a preset threshold, adding a second face feature to the first face feature set.

In an embodiment of the present application, the detecting module 410 is further configured to identify a face region in the current video frame based on the normalized pixel difference feature and the human body half body recognition algorithm, to obtain a current target to be tracked corresponding to the current video frame.

In an embodiment of the present application, as shown in FIG. 14, the detecting module 410 may include:

The first recommending unit 414 is configured to identify the face region based on the normalized pixel difference feature, and obtain the first recommended region in the current video frame.

The second recommending unit 415 calculates, according to the optical flow analysis algorithm, that the first target to be tracked is in the second recommended area corresponding to the current video frame.

The second tracking target determining unit 416 is configured to obtain the current target to be tracked according to the first recommended area and the second recommended area.

In an embodiment of the present application, the second tracking target determining unit 416 is further configured to perform motion prediction according to the inter-frame correlation to obtain an expected motion range, and filter the first recommended area and the second recommended area according to the expected motion range to obtain the current target to be tracked. .

In an embodiment of the present application, the network structure of the deep neural network is an 11-layer network layer, including a stacked convolutional product network and a fully connected layer, and the stacked convolutional product network consists of multiple convolution layers and maxpool. Layer composition, the specific network structure is:

Conv3-64*2+LRN+max pool

Conv3-128+max pool

Conv3-256*2+max pool

Conv3-512*2+max pool

FC2048

FC1024,

Where conv3 represents a convolutional layer with a radius of 3, LRN represents the LRN layer, max pool represents the maxpool layer, and FC represents the fully connected layer.

In an embodiment of the present application, the facial feature extraction module 420 is further configured to perform facial feature extraction on the first to-be-tracked target through the depth neural network to obtain a first feature vector, and perform a face on the current target to be tracked through the deep neural network. Feature extraction yields a second feature vector.

The tracking module 430 is further configured to calculate an Euclidean distance between the first feature vector and the second feature vector, and if the Euclidean distance is less than a preset threshold, determine that the first to-be-tracked target matches the current target feature to be tracked successfully.

FIG. 15 is another schematic structural diagram of a video object tracking apparatus according to an embodiment of the present application. As shown in FIG. 15, the video object tracking device includes a processor 510, a memory 520 coupled to the processor 510, and a port 530 for transmitting and receiving data. The memory 520 stores a machine readable instruction module executable by the processor 510, the machine readable instruction module comprising:

The detecting module 521 is configured to acquire a video stream, and identify a face region according to the face detection algorithm to obtain a first to-be-tracked target corresponding to the first video frame.

The facial feature extraction module 522 is configured to obtain the first facial feature by using the facial feature extraction based on the depth neural network for the first to-be-tracked target, and store the first facial feature into the first Track the feature library corresponding to the target.

The detecting module 521 is further configured to identify the face area according to the face detection algorithm in the current video frame, and obtain the current target to be tracked corresponding to the current video frame.

The facial feature extraction module 522 is further configured to obtain a second facial feature by using a depth neural network based facial feature extraction on the current target to be tracked.

The tracking module 523 is configured to perform feature matching between the current target to be tracked and the first target to be tracked according to the second face feature and the feature library to track the first target to be tracked from the first video frame.

The learning module 524 is configured to update the feature library according to the extracted updated facial features during the tracking process.

In an embodiment of the present application, as shown in FIG. 16, the machine readable instruction module may further include:

The feature identity processing module 525 is configured to identify the corresponding face identity information by using a face recognition algorithm according to the face state of the target to be tracked, and obtain the target feature corresponding to the face identity information according to the image feature extraction algorithm, and serve the target Feature and face identity information are related.

In the embodiment of the present application, the specific functions and implementation manners of the foregoing detection module 521, the face feature extraction module 522, the tracking module 523, the learning module 524, and the feature identity processing module 525 may refer to the related descriptions of the foregoing modules 410 to 450. I will not repeat them here.

One of ordinary skill in the art can understand that all or part of the process of implementing the above embodiments can be completed by a computer program to instruct related hardware, and the program can be stored in a non-volatile computer readable storage medium. As in the embodiment of the present application, the program may be stored in a storage medium of the computer system and executed by at least one processor in the computer system to implement a flow including an embodiment of the methods as described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Through the description of the above embodiments, those skilled in the art can clearly understand that the embodiments of the present application can be implemented by means of software plus a necessary general hardware platform, that is, the machine hardware readable instructions are used to instruct related hardware. Of course, hardware can also be used, but in many cases the former is a better implementation. Based on such understanding, the technical solution of the embodiments of the present application may be embodied in the form of a software product in essence or in the form of a software product stored in a storage medium, including a plurality of instructions. A terminal device (which may be a cell phone, a personal computer, a server, or a network device, etc.) is caused to perform the methods described in the various embodiments of the present application.

The technical features of the above-described embodiments may be arbitrarily combined. For the sake of brevity of description, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction between the combinations of these technical features, All should be considered as the scope of this manual.

The above-mentioned embodiments are merely illustrative of several embodiments of the present application, and the description thereof is not to be construed as limiting the scope of the application. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the present application. Therefore, the scope of protection of the application should be determined by the appended claims.

Claims

A video object tracking method is applied to a terminal or a server, and the method includes:

Obtaining a video stream, identifying a face region according to a face detection algorithm, and obtaining a first to-be-tracked target corresponding to the first video frame;

Obtaining a first facial feature by using a depth neural network based facial feature extraction on the first to-be-tracked target, and storing the first facial feature into a feature database corresponding to the first to-be-tracked target;

Identifying a face region according to a face detection algorithm in the current video frame, obtaining a current target to be tracked corresponding to the current video frame, and obtaining a second face feature by using the face feature extraction based on the depth neural network for the current target to be tracked, according to The second face feature and the feature library perform feature matching on the current to-be-tracked target and the first to-be-tracked target to track the first to-be-tracked target from the first video frame, in the tracking process The feature library is updated according to the extracted updated face features.
The method of claim 1 further comprising:

Corresponding facial identity information is identified by a face recognition algorithm according to a face state of the target to be tracked, and a target feature corresponding to the facial identity information is obtained according to an image feature extraction algorithm, and the target feature and the face identity information are obtained. Establish an association relationship;

The step of identifying the face area according to the face detection algorithm in the current video frame, and obtaining the current target to be tracked corresponding to the current video frame includes:

Determining whether the current video frame recognizes the face region according to the face detection algorithm, and if the face region is not recognized, acquiring the current image feature corresponding to the current video frame according to the image feature extraction algorithm;

And comparing the current image feature with the target feature to obtain matching target face identity information based on the association relationship;

Obtaining a current to-be-tracked target corresponding to the current video frame according to the target facial identity information.
The method according to claim 1, wherein the first to-be-tracked target obtains a first facial feature by facial feature extraction based on a depth neural network, and stores the first facial feature into the first The steps of the feature library corresponding to the tracking target include:

Obtaining first face identity information corresponding to the first target to be tracked;

Establishing a first facial feature set corresponding to the first facial identity information, adding the first facial feature to the first facial feature set, and storing the first facial feature set to the feature Library

The step of updating the feature library according to the extracted updated facial features during the tracking process includes:

Obtaining current face identity information corresponding to the current target to be tracked;

Obtaining, from the feature database, a first facial feature set corresponding to the current facial identity information;

Calculating a difference amount between the first facial feature and the second facial feature in the first facial feature set, and adding the difference in the first facial feature set if the difference exceeds a preset threshold The second facial feature.
The method according to claim 1, wherein the step of identifying the face region according to the face detection algorithm in the current video frame, and obtaining the current target to be tracked corresponding to the current video frame comprises:

The normalized pixel difference feature and the human body half body recognition algorithm identify the face region in the current video frame, and obtain the current target to be tracked corresponding to the current video frame.
The method according to claim 1, wherein the step of identifying the face region according to the face detection algorithm in the current video frame, and obtaining the current target to be tracked corresponding to the current video frame comprises:

Identifying a face region based on the normalized pixel difference feature, and obtaining a first recommended region in the current video frame;

Calculating, according to the optical flow analysis algorithm, the second recommended area of the first to-be-tracked target corresponding to the current video frame;

Obtaining the current target to be tracked according to the first recommended area and the second recommended area.
The method according to claim 5, wherein the step of obtaining the current target to be tracked according to the first recommended area and the second recommended area comprises:

The motion prediction is performed according to the inter-frame correlation to obtain an expected motion range, and the first recommended area and the second recommended area are filtered according to the expected motion range to obtain the current to-be-tracked target.
The method according to any one of claims 1 to 6, wherein the network structure of the deep neural network is an 11-layer network layer, including a stacked convolutional product network and a fully connected layer, the stacked convolution god The product network consists of multiple convolutional layers and maxpool layers. The specific network structure is:

Conv3-64*2+LRN+max pool

Conv3-128+max pool

Conv3-256*2+max pool

Conv3-512*2+max pool

Conv3-512*2+max pool

FC2048

FC1024,

Where conv3 represents a convolutional layer with a radius of 3, LRN represents the LRN layer, max pool represents the maxpool layer, and FC represents the fully connected layer.
The method according to any one of claims 1 to 6, wherein the first to-be-tracked target obtains a first facial feature by facial feature extraction based on a depth neural network, and the first facial feature is obtained The step of storing the feature library corresponding to the first to-be-tracked target includes:

Performing facial feature extraction on the first to-be-tracked target through a deep neural network to obtain a first feature vector;

Determining, by the depth neural network based facial feature extraction, the second facial feature to the current to-be-tracked target, and according to the second facial feature and the feature database, the current to-be-tracked target and the first to-be-targeted Tracking the target for feature matching to track the first to-be-tracked target from the first video frame includes:

Performing facial feature extraction on the current target to be tracked through the depth neural network to obtain a second feature vector;

And calculating an Euclidean distance between the first feature vector and the second feature vector. If the Euclidean distance is less than a preset threshold, determining that the first to-be-tracked target matches the current target feature to be tracked successfully.
A video object tracking device, the device comprising:

a processor and a memory coupled to the processor, the memory having stored therein a machine readable instruction module executable by the processor; the machine readable instruction module comprising:

a detecting module, configured to acquire a video stream, and identify a face region according to a face detection algorithm, to obtain a first to-be-tracked target corresponding to the first video frame;

a face feature extraction module, configured to obtain a first face feature by using a depth neural network based face feature extraction on the first to-be-tracked target, and storing the first face feature into the first to-be-tracked a feature library corresponding to the target;

The detecting module is further configured to: identify a face area according to a face detection algorithm in the current video frame, and obtain a current target to be tracked corresponding to the current video frame;

The face feature extraction module is further configured to obtain a second face feature by using a depth neural network based face feature extraction on the current target to be tracked;

a tracking module, configured to perform feature matching between the current to-be-tracked target and the first to-be-tracked target according to the second facial feature and the feature library, to track the first to-be-being from the first video frame Track the target;

And a learning module, configured to update the feature library according to the extracted updated facial features during the tracking process.
The apparatus of claim 9 further comprising:

a feature identity processing module, configured to identify a corresponding face identity information by using a face recognition algorithm according to a face state of the target to be tracked, and obtain a target feature corresponding to the face identity information according to the image feature extraction algorithm, and The relationship between the target feature and the face identity information is established;

The detection module includes:

The image feature extraction unit is configured to determine whether the current video frame recognizes the face region according to the face detection algorithm, and if the face region is not recognized, acquire the current image feature corresponding to the current video frame according to the image feature extraction algorithm;

An identity matching unit, configured to compare the current image feature with the target feature to obtain matching target face identity information based on the association relationship;

The first tracking target determining unit is configured to obtain, according to the target facial identity information, a current target to be tracked corresponding to the current video frame.
The device according to claim 9, wherein the facial feature extraction module is further configured to acquire first facial identity information corresponding to the first target to be tracked, and establish a first facial feature corresponding to the first facial identity information. Collecting, adding the first facial feature to the first facial feature set and storing the first facial feature set to the feature database;

The learning module is further configured to acquire current facial identity information corresponding to the current target to be tracked, obtain a first facial feature set corresponding to the current facial identity information from the feature database, and calculate the first facial feature And a difference amount of the first facial feature in the set and the second facial feature, and if the difference exceeds a preset threshold, adding the second facial feature to the first facial feature set.
The apparatus according to claim 9, wherein the detecting module is further configured to identify a face region in the current video frame based on the normalized pixel difference feature and the body half body recognition algorithm to obtain a current target to be tracked corresponding to the current video frame.
The apparatus of claim 9, the detecting module comprising:

a first recommending unit, configured to identify a face region based on the normalized pixel difference feature, and obtain a first recommended region in the current video frame;

The second recommendation unit calculates, according to the optical flow analysis algorithm, that the first to-be-tracked target is in the second recommended area corresponding to the current video frame;

The second tracking target determining unit is configured to obtain the current target to be tracked according to the first recommended area and the second recommended area.
The apparatus according to claim 13, wherein the second tracking target determining unit is further configured to perform motion prediction according to inter-frame correlation to obtain an expected motion range, and filter the first recommended region and the first according to the expected motion range. The two recommended areas get the current target to be tracked.
The device according to any one of claims 9 to 14, wherein the facial feature extraction module is further configured to perform facial feature extraction on the first to-be-tracked target through a depth neural network to obtain a first feature vector, The current target to be tracked performs face feature extraction through a deep neural network to obtain a second feature vector;

The tracking module is further configured to calculate an Euclidean distance between the first feature vector and the second feature vector, and if the Euclidean distance is less than a preset threshold, determine the first target to be tracked and the current target feature to be tracked The match was successful.
A non-transitory computer readable storage medium storing machine readable instructions, the machine readable instructions being executable by a processor to perform the following operations:

Obtaining a video stream, identifying a face region according to a face detection algorithm, and obtaining a first to-be-tracked target corresponding to the first video frame;

Obtaining a first facial feature by the facial feature extraction based on the depth neural network for the first to-be-tracked target, and storing the first facial feature into a feature database corresponding to the first to-be-tracked target;

Identifying a face region according to a face detection algorithm in the current video frame, obtaining a current target to be tracked corresponding to the current video frame, and obtaining a second face feature by using the face feature extraction based on the depth neural network for the current target to be tracked, according to The second face feature and the feature library perform feature matching on the current to-be-tracked target and the first to-be-tracked target to track the first to-be-tracked target from the first video frame, in the tracking process The feature library is updated according to the extracted updated face features.
The non-transitory computer readable storage medium of claim 16, the machine readable instructions being executable by the processor to:

Corresponding facial identity information is identified by a face recognition algorithm according to a face state of the target to be tracked, and a target feature corresponding to the facial identity information is obtained according to an image feature extraction algorithm, and the target feature and the face identity information are obtained. Establish an association relationship;

The step of identifying the face area according to the face detection algorithm in the current video frame, and obtaining the current target to be tracked corresponding to the current video frame includes:

Determining whether the current video frame recognizes the face region according to the face detection algorithm, and if the face region is not recognized, acquiring the current image feature corresponding to the current video frame according to the image feature extraction algorithm;

And comparing the current image feature with the target feature to obtain matching target face identity information based on the association relationship;

Obtaining a current target to be tracked corresponding to the current video frame according to the target facial identity information.
The non-transitory computer readable storage medium according to claim 16, wherein the first to-be-tracked target obtains a first facial feature by depth neural network-based facial feature extraction, and the first The step of storing the face feature in the feature library corresponding to the first to-be-tracked target includes:

Obtaining first face identity information corresponding to the first target to be tracked;

Establishing a first facial feature set corresponding to the first facial identity information, adding the first facial feature to the first facial feature set, and storing the first facial feature set to the feature Library

The step of updating the feature library according to the extracted updated facial features during the tracking process includes:

Obtaining current face identity information corresponding to the current target to be tracked;

Obtaining, from the feature database, a first facial feature set corresponding to the current facial identity information;

Calculating a difference amount between the first facial feature and the second facial feature in the first facial feature set, and adding the difference in the first facial feature set if the difference exceeds a preset threshold The second facial feature.
The non-transitory computer readable storage medium of claim 16, wherein the step of identifying the face region according to the face detection algorithm in the current video frame, and obtaining the current target to be tracked corresponding to the current video frame comprises:

The normalized pixel difference feature and the human body half body recognition algorithm identify the face region in the current video frame, and obtain the current target to be tracked corresponding to the current video frame.
The non-transitory computer readable storage medium of claim 16, wherein the step of identifying the face region according to the face detection algorithm in the current video frame, and obtaining the current target to be tracked corresponding to the current video frame comprises:

Identifying a face region based on the normalized pixel difference feature, and obtaining a first recommended region in the current video frame;

Calculating, according to the optical flow analysis algorithm, the second recommended area of the first to-be-tracked target corresponding to the current video frame;

Obtaining the current target to be tracked according to the first recommended area and the second recommended area.
The non-transitory computer readable storage medium of claim 20, wherein the step of obtaining the current target to be tracked according to the first recommended area and the second recommended area comprises:

The motion prediction is performed according to the inter-frame correlation to obtain an expected motion range, and the first recommended area and the second recommended area are filtered according to the expected motion range to obtain the current to-be-tracked target.
The non-transitory computer readable storage medium according to any one of claims 16 to 21, wherein the first to-be-tracked target obtains a first facial feature by facial feature extraction based on a depth neural network, and The step of storing the first facial feature into the feature library corresponding to the first to-be-tracked target includes:

Performing facial feature extraction on the first to-be-tracked target through a deep neural network to obtain a first feature vector;

Determining, by the depth neural network based facial feature extraction, the second facial feature to the current to-be-tracked target, and according to the second facial feature and the feature database, the current to-be-tracked target and the first to-be-targeted Tracking the target for feature matching to track the first to-be-tracked target from the first video frame includes:

Performing facial feature extraction on the current target to be tracked through the depth neural network to obtain a second feature vector;

And calculating an Euclidean distance between the first feature vector and the second feature vector. If the Euclidean distance is less than a preset threshold, determining that the first to-be-tracked target matches the current target feature to be tracked successfully.