CN111145215A

CN111145215A - Target tracking method and device

Info

Publication number: CN111145215A
Application number: CN201911357264.1A
Authority: CN
Inventors: 沈磊; 李伯勋; 张弛; 俞刚
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2020-05-12
Anticipated expiration: 2039-12-25
Also published as: CN111145215B

Abstract

The invention provides a target tracking method and a target tracking device, which relate to the technical field of image processing and comprise the following steps: respectively inputting the video images into a face detection network and a pedestrian detection network, outputting a face frame detected by the face detection network and a first pedestrian frame corresponding to the face frame, and also outputting a second pedestrian frame detected by the pedestrian detection network; carrying out reduction operation on the first pedestrian frame towards a first preset direction to obtain a first reduced frame corresponding to the first pedestrian frame, and carrying out reduction operation on the second pedestrian frame towards a second preset direction to obtain a second reduced frame corresponding to the second pedestrian frame; a tracking sequence is determined based on the first and second reduced frames. The method can effectively avoid the phenomenon of face tracking fracture caused by face shielding and back-to-back actions in the dynamic video in a face-to-back binding mode; the frame reduction operation can effectively relieve the exchange phenomenon of the identification codes of the tracked objects which are easily generated due to the overlapping of pedestrians, thereby improving the tracking performance of the human face.

Description

Target tracking method and device

Technical Field

The invention relates to the technical field of image processing, in particular to a target tracking method and device.

Background

With the rapid development of Computer technology, the continuous optimization of software environment and the continuous improvement of hardware equipment performance, Computer Vision (Computer Vision) related technology is also becoming more mature. In the early stages of computer vision research, due to the constraint of multiple factors, people have increasingly conducted research on objects of interest in still pictures. However, with the rapid development of computer technology and the current intensive concern of people on social security and other problems, the research on video sequences has become a hot spot in the field of computer vision, and has greatly attracted attention and research enthusiasm of people.

The face tracking based on the dynamic video is always a hot point of research of people due to the wide prospect and the important position in the actual service application. Face tracking techniques are also a challenging task because they are susceptible to factors such as lighting, pose, expression, noise, occlusion, and resolution. After a period of development, the face tracking technology based on dynamic video is now widely applied in the fields of intelligent access control systems, intelligent video monitoring, intelligent home, video teleconferencing, visual navigation, virtual reality and the like.

However, the existing face tracker generally has the following disadvantages: firstly, the speed problem is solved, the time consumption of the current face tracker based on multi-target tracking is more, and the application of the face tracker in an actual service scene is greatly limited. The second is the accuracy problem, existing trackers (e.g. IOU Tracker) based on detection are used, and when the frame rate of the video is high, the speed is very fast. However, the tracking performance thereof greatly depends on the detection performance (for example, a human face cannot be detected in the case of a back of a person, face tracking before and after turning is mistaken for faces of two persons, and the like), and an identification code swapping problem (ID switch) of a tracked object is likely to occur in large object tracking (for example, pedestrian tracking). How to improve the speed and the precision of the face tracker is the bottleneck of the current face tracking technology in the practical business application.

Disclosure of Invention

To achieve at least some of the above objectives, an embodiment of a first aspect of the present invention provides a target tracking method, including:

respectively inputting video images into a face detection network and a pedestrian detection network, outputting a face frame detected by the face detection network and a first pedestrian frame corresponding to the face frame, and also outputting a second pedestrian frame detected by the pedestrian detection network;

carrying out reduction operation on the first pedestrian frame towards a first preset direction to obtain a first reduced frame corresponding to the first pedestrian frame, and carrying out reduction operation on the second pedestrian frame towards a second preset direction to obtain a second reduced frame corresponding to the second pedestrian frame;

determining a tracking sequence according to the first and second reduced frames.

Further, the outputting the face frame detected by the face detection network and the frame corresponding to the face frame includes: outputting a face frame contained in each frame of image in the video image detected by the face detection network, a confidence coefficient of the face frame and the frame bound with the face frame; the further outputting of the second pedestrian frame detected by the pedestrian detection network comprises: outputting the second pedestrian frame contained in each frame of image in the video image detected by the pedestrian detection network.

Further, still include:

determining whether the tracking sequence meets an output condition according to the confidence of the face frame in the tracking sequence;

and outputting the tracking sequence when the output condition is met.

Further, the output conditions include that the confidence degrees of the face frames of each frame of image in the tracking sequence are all greater than a lowest threshold value, and the confidence degree of at least one face frame in the tracking sequence is greater than a highest threshold value.

Further, the reducing the first pedestrian frame in a first preset direction to obtain a first reduced frame corresponding to the first pedestrian frame includes: carrying out size reduction operation on the first pedestrian frame in the direction of the face frame corresponding to the first pedestrian frame according to the proportion to obtain a first reduced frame corresponding to the first pedestrian frame; the step of reducing the second pedestrian frame towards a second preset direction to obtain a second reduced frame corresponding to the second pedestrian frame comprises the following steps: and carrying out scaling-down operation on the second pedestrian frame towards the head direction of the second pedestrian frame to obtain the second scaling-down frame corresponding to the second pedestrian frame.

Further, the scaling down the first pedestrian frame and the corresponding face frame direction includes: carrying out size reduction operation on the first pedestrian frame in the horizontal direction according to a first reduction ratio, and carrying out size reduction operation on the first pedestrian frame in the vertical direction according to a second reduction ratio; the scaling down operation of the second pedestrian frame towards the head direction of the second pedestrian frame to obtain the second scaling down frame corresponding to the second pedestrian frame comprises the following steps: and carrying out size reduction operation on the second pedestrian frame in the horizontal direction according to a third reduction proportion, and carrying out size reduction operation on the second pedestrian frame in the vertical direction according to a fourth reduction proportion.

Further, the reducing the size of the first pedestrian frame in the vertical direction according to the second reduction scale includes: keeping the top position of the first pedestrian frame unchanged, and carrying out size reduction operation on the bottom of the first pedestrian frame to the top of the first pedestrian frame according to the second reduction proportion; the reducing the size of the second pedestrian frame in the vertical direction according to a fourth reduction scale comprises the following steps: and keeping the top position of the second pedestrian frame unchanged, and carrying out size reduction operation on the bottom of the second pedestrian frame to the top of the second pedestrian frame according to the fourth reduction proportion.

Further, if the height of the first reduced frame is smaller than the width of the first reduced frame, the height of the first reduced frame is changed to be equal to the width of the first reduced frame; and if the height of the second reducing frame is smaller than the width of the second reducing frame, changing the height of the second reducing frame to be equal to the width of the second reducing frame.

Further, the determining a tracking sequence according to the first and second reduced frames comprises:

performing intersection comparison matching according to the first reduced frames in the adjacent frame images of the video image, and adding the first reduced frames larger than a first matching threshold value and the corresponding face frames into a tracking sequence; and performing intersection comparison matching according to the second reduced frames in the adjacent frame images of the video image, and adding the second reduced frames larger than a second matching threshold value into the tracking sequence.

Further, the determining a tracking sequence according to the first and second reduced frames further comprises:

and if the second reduced frame which is larger than the second matching threshold value does not have a corresponding face frame, adding a set frame as the corresponding face frame.

Further, the confidence of the setting box is set to be a negative number.

Further, after determining a tracking sequence according to the first and second reduced frames, the method further includes:

determining whether the tracking sequence meets the output condition according to the confidence of the face frames in the tracking sequence, wherein the face frames with the negative confidence do not participate in the judgment of whether the output condition is met;

and outputting the tracking sequence when the output condition is met.

Further, after the video images are respectively input to a face detection network and a pedestrian detection network, a face frame detected by the face detection network and a first pedestrian frame corresponding to the face frame are output, and a second pedestrian frame detected by the pedestrian detection network is also output, the method further includes:

matching the first pedestrian frame detected by the face detection network and the second pedestrian frame detected by the pedestrian detection network in the same frame of image;

and deleting the second pedestrian frame when the matching result is greater than a preset threshold value.

To achieve the above object, an embodiment of the second aspect of the present invention further provides a target tracking apparatus, including:

the acquisition module is used for respectively inputting video images into a face detection network and a pedestrian detection network, outputting a face frame detected by the face detection network and a first pedestrian frame corresponding to the face frame, and also outputting a second pedestrian frame detected by the pedestrian detection network;

the processing module is used for carrying out reduction operation on the first pedestrian frame towards a first preset direction to obtain a first reduced frame corresponding to the first pedestrian frame, and carrying out reduction operation on the second pedestrian frame towards a second preset direction to obtain a second reduced frame corresponding to the second pedestrian frame;

a tracking module for determining a tracking sequence according to the first and second reduced frames.

The target tracking method or device provided by the invention has the advantages that the face frame detected in the video is bound with the first pedestrian frame, the first pedestrian frame is subjected to frame shrinking operation to obtain the corresponding first shrinking frame, the second pedestrian frame in the video is detected through the pedestrian detection network, the tracking sequence is determined according to the first shrinking frame and the second shrinking frame, and the face is tracked through the human body frame. The method can effectively avoid the phenomenon of face tracking fracture caused by face shielding and back-to-back actions in the dynamic video through the tracking mode of face-to-person binding (namely determining the face frame and the corresponding first pedestrian frame); the pedestrian frame is reduced, so that the phenomenon of tracking object identification code interchange (IDswitch) which is easily generated due to pedestrian overlapping can be effectively relieved, and the tracking performance of the human face is improved.

To achieve the above object, an embodiment of the third aspect of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for dynamic video tracking of human faces based on human face binding according to the first aspect of the present invention is implemented.

To achieve the above object, an embodiment of a fourth aspect of the present invention provides a computing device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the method for dynamic video tracking of human faces based on human face binding according to the first aspect of the present invention.

The non-transitory computer readable storage medium and the computing device according to the present invention have similar beneficial effects to those of the method for dynamic video tracking of human face based on human face binding according to the first aspect of the present invention, and are not described herein again.

Drawings

FIG. 1 is a schematic diagram of a target Tracker IOU Tracker;

FIG. 2 is a schematic diagram of a target tracking method according to an embodiment of the invention;

FIG. 3 is a flowchart illustrating a target tracking method according to an embodiment of the invention;

FIG. 4 is a schematic diagram of cross-over matching according to an embodiment of the present invention;

fig. 5 is a schematic view of a second pedestrian frame deletion operation according to the embodiment of the invention;

FIG. 6 is a schematic diagram of the output of a tracking sequence according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a target tracking device according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a computing device according to an embodiment of the invention.

Detailed Description

Embodiments in accordance with the present invention will now be described in detail with reference to the drawings, wherein like reference numerals refer to the same or similar elements throughout the different views unless otherwise specified. It is to be noted that the embodiments described in the following exemplary embodiments do not represent all embodiments of the present invention. They are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the claims, and the scope of the present disclosure is not limited in these respects. Features of the various embodiments of the invention may be combined with each other without departing from the scope of the invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the past few years, from object detection and classification and target tracking to intelligent home and intelligent medical treatment, the application of everywhere visible deep learning in video image analysis influences the life of people and brings convenience to human life. With the rapid development of hardware and the support of software technology, face detection and target tracking in video image analysis are gradually changed from traditional manual design features to deep learning-based research.

The human face, one of the most important organs of the human body, has inherent biological characteristics and carries a large amount of information. Compared with the traditional digital password, handwriting simulation and the like, the method has the advantages of being more difficult to crack, strong in confidentiality and the like. Therefore, the application of face tracking in the field of intelligent security becomes very significant. However, the face in the video is easily affected by factors such as illumination, deformation, video acquisition resolution and the like, and missing detection and false detection sometimes occur.

In video image processing, target tracking can be largely complementary to face recognition. When the face cannot be detected, the target detection and tracking can avoid the loss of the tracking target to a greater extent, but in a scene with dense faces, the combination processing speed is slow, and much time is consumed. The IOU strategy is basically negligible in time under the condition of not considering detection, so that the speed is high, but the detection performance is excessively depended on, and the condition of ID switch (tracking object identification code exchange, namely that a tracking object A is mistaken as a tracking object B) is easy to occur when a large object is tracked. The method adopts a mode of binding a human face frame and the human frame and adds a first frame reduction strategy, realizes the tracking of the human face frame through the human frame, and effectively improves the speed and the precision of human face tracking in the dynamic video.

Fig. 1 is a schematic diagram of a target Tracker IOU Tracker. When the detection precision is high and the video frame rate is high, the IOU Tracker adopts a multi-threshold target detection and judgment mode of the coincidence rate (i.e. the cross-over ratio) of the previous frame and the next frame, uses a high threshold to obtain a determined detection frame, uses a low threshold to contain all possible detection frames, and uses the information of the previous frame and the next frame to identify the object frame which is not detected and add the object frame into the determined detection frame. And filtering according to the tracking duration and the confidence coefficient of detection in tracking to realize the tracking of the object. However, in a pedestrian tracking task of a dynamic video, because a human body is often shielded or overlapped, an ID switch is easily generated by the IOU Tracker, so that information of a face is not corresponding, and the accuracy of face tracking is affected.

FIG. 2 is a schematic diagram illustrating a principle of a target tracking method according to an embodiment of the present invention, including steps S21-S23. Fig. 3 is a schematic flow chart of a target tracking method according to an embodiment of the present invention, and the present invention is better explained with reference to fig. 2 and 3.

In step S21, the video images are input to a face detection network and a pedestrian detection network, respectively, and a face frame detected by the face detection network and a first pedestrian frame corresponding to the face frame are output, and a second pedestrian frame detected by the pedestrian detection network is also output. In the embodiment of the invention, a section of dynamic video is obtained, the video is input into a pre-trained face detection network, and a pair of bound face frame and first pedestrian frame is output. The face detection network can be trained by adopting a training set with a face and a person bound by labels, so that the face detection network can output a face frame contained in each frame of image in an input video image and the first pedestrian frame bound with the face frame. The binding means that the pedestrian frame corresponds to the face frame and belongs to the same person. In some embodiments, the face detection network also outputs a confidence level of the detected face box.

In the embodiment of the invention, the video is further input into a pre-trained pedestrian detection network, and the second pedestrian frame included in each frame of image in the video image detected by the pedestrian detection network is output. The pedestrian detection network detects pedestrians in each frame of image in the input video image, and can obtain a human body frame with higher quality, so that the human body frame can be detected under the condition that the back or the face of a pedestrian is shielded. Therefore, even if the face of the tracked object in the video is blocked or the tracked object carries on the back, the face detection network cannot detect the face frame and the first pedestrian frame corresponding to the face frame, but the tracked object can be continuously tracked by the second pedestrian frame detected by the pedestrian detection network. The invention adopts a tracking mode of binding the face frame and the pedestrian frame, and can effectively avoid the occurrence of face tracking fracture phenomenon caused by face shielding and back-to-back action in the dynamic video.

In step S22, a zoom-out operation is performed on the first pedestrian frame in a first preset direction to obtain a first zoom-out frame corresponding to the first pedestrian frame, and a zoom-out operation is performed on the second pedestrian frame in a second preset direction to obtain a second zoom-out frame corresponding to the second pedestrian frame. In an embodiment of the present invention, the first preset direction is a direction of a face frame corresponding to the first pedestrian frame, and the second preset direction is a head direction corresponding to the second pedestrian frame. It should be understood that the first preset direction may also be a left upper direction or a right upper direction of the first pedestrian frame, and the second preset direction may also be a left upper direction or a right upper direction of the second pedestrian frame, which is not limited in this embodiment of the present invention.

In the embodiment of the present invention, the first pedestrian frame is narrowed down in the direction of the face frame corresponding to the first pedestrian frame to obtain a first narrowed down frame corresponding to the first pedestrian frame, and the second pedestrian frame is narrowed down in the direction of the head of the second pedestrian frame to obtain a second narrowed down frame corresponding to the second pedestrian frame. In the embodiment of the present invention, the scaling down operation of the first pedestrian frame in the direction of the face frame corresponding to the first pedestrian frame includes scaling down the first pedestrian frame in the horizontal direction according to a first scaling down operation, and scaling down the first pedestrian frame in the vertical direction according to a second scaling down operation, so as to obtain the first scaling down frame corresponding to the first pedestrian frame.

In an embodiment of the present invention, the scaling down operation of the second pedestrian frame in the direction of the head includes performing a scaling down operation of the second pedestrian frame in the horizontal direction according to a third scaling down ratio, and performing a scaling down operation of the second pedestrian frame in the vertical direction according to a fourth scaling down ratio, so as to obtain the second scaling down frame corresponding to the second pedestrian frame. Since a pedestrian is very likely to be occluded and overlapped in a normal case, an ID switch situation is very likely to occur when a pedestrian is tracked in a dynamic video. Therefore, the frame shrinking operation is carried out on the first pedestrian frame and the second pedestrian frame detected in the steps, so that after the sizes of the first pedestrian frame and the second pedestrian frame are shrunk, the condition that the pedestrian frames are in staggered connection when the pedestrian frames are in staggered connection and matched can be effectively reduced, the condition of ID switch is reduced, and the accuracy of face tracking is improved. It is understood that, in the embodiment of the present invention, the reduction ratios of the first pedestrian frame and the second pedestrian frame may be the same, or may be set to different reduction ratios according to actual task requirements, that is, the first reduction ratio may be the same as or different from the third reduction ratio, and the second reduction ratio may be the same as or different from the fourth reduction ratio.

In some embodiments, the first pedestrian frame and the face frame detected to be bound by the face detection network are shown as solid line frames in fig. 3, and the effect of frame reduction of the first pedestrian frame to the face frame is better. The first reduced frame after the scaling down is shown as a dotted line frame in fig. 3, and at this time, because the size of the first reduced frame is small, the first reduced frame is not easily blocked or overlapped. And because the first pedestrian frame is subjected to first frame reduction towards the bound face frame, the accuracy of the pedestrian frame corresponding to the face frame is improved, and the ID switch is not easy to occur.

In some embodiments, the width coordinates of the detected person frame in the current frame image are set to X1 and X2, respectively, and the height coordinates of the first pedestrian frame are Y1 and Y2, respectively, where X1 is the left side, X2 is the right side, Y1 is the upper side, and Y2 is the lower side, with the horizontal direction of the current frame image being the X axis and the vertical direction being the Y axis. In the embodiment of the present invention, the first reduction ratio is 1/2, and the second reduction ratio is 1/4, so that the size of the first pedestrian frame is reduced to half of the original size in the horizontal direction (i.e., the X-axis direction) and reduced to 1/4 of the original size in the vertical direction (i.e., the Y-axis direction). As shown in fig. 3, the first frame reduction is performed while keeping Y1 unchanged, that is, the first pedestrian frame is reduced toward the face frame, so that a better first frame reduction effect can be obtained.

In some embodiments, the first pedestrian frame is scaled down toward the face frame corresponding to the first pedestrian frame, that is, the top position of the first pedestrian frame is kept unchanged, and the scaling down operation is performed on the bottom of the first pedestrian frame toward the top of the first pedestrian frame according to the second scaling down. If the height of the first pedestrian frame reduced in the vertical direction is smaller than the width of the first pedestrian frame reduced in the horizontal direction, the height of the first pedestrian frame reduced in the vertical direction is made to be equal to the width of the first pedestrian frame reduced in the horizontal direction. In practical application, it often happens that a pedestrian walks out of the picture in the video, which results in the height of the pedestrian frame being truncated. At this time, if the first pedestrian frame is still narrowed according to the above ratio, the height of the narrowed first frame may become small, which is not favorable for the subsequent processing of the first frame, and affects the accuracy of face tracking.

Therefore, a constraint condition is added to the reduction of the height of the pedestrian frame, and in the embodiment of the invention, when the height of the first pedestrian frame reduced in the Y direction is smaller than the width of the first pedestrian frame reduced in the X direction, the height of the first pedestrian frame reduced in the Y direction is changed to be equal to the width of the first pedestrian frame reduced in the X direction, so that the first reduced frame is a square. An example of a specific first reduction box strategy is given below, but not limited to:

X1_new＝(X2-X1)/4+X1

X2_new＝X2-(X2-X1)/4

Y1_new＝Y1

Y2_new＝max(Y2-(Y2-Y1)*3/4，Y1+(X2_new-X1_new))

therefore, the height of the first reduced frame after reduction can be ensured not to be too small to influence the accuracy of face tracking. It can be understood that, for the second pedestrian frame detected by the pedestrian detection network, the frame reduction operation in the same manner may be performed, and the second pedestrian frame is reduced toward the head of the second pedestrian frame, so as to obtain a second reduced frame corresponding to the second pedestrian frame. Namely, the top position of the second pedestrian frame is kept unchanged, and the size reduction operation is carried out on the bottom of the second pedestrian frame towards the top of the second pedestrian frame according to the fourth reduction proportion. And if the height of the second reducing frame is smaller than the width of the second reducing frame, changing the height of the second reducing frame to be equal to the width of the second reducing frame.

In some embodiments, for a tracked object with a smaller size in the graph, frame expansion (i.e., enlarging the size of the human frame) operation may be performed to perform detection and identification, and then frame reduction may be performed to perform subsequent processing.

In step S23, a tracking sequence is determined according to the first and second reduced frames. In the embodiment of the invention, the intersection comparison matching is carried out according to the first reduced frame, the first reduced frame which is larger than the first matching threshold value and the corresponding face frame are added into the tracking sequence, the intersection comparison matching is carried out according to the second reduced frame, and the second reduced frame which is larger than the second matching threshold value is added into the tracking sequence. And the tracking sequence is used for recording the tracking object, performing evaluation according to the score of the recorded tracking object in the subsequent evaluation stage, and outputting the tracking sequence meeting the condition.

Fig. 4 is a schematic diagram illustrating a principle of Intersection Over Unity (IOU) matching according to an embodiment of the present invention, where the first reduced frame (i.e., the human frame subjected to the downsizing operation in the above step), such as the frame a and the frame B, in two adjacent frame images is subjected to area calculation, and an Intersection (a ∩ B) and a Union (a ∪ B) of areas of the two frames a and B are calculated, respectively, and then the IOU values of the two frames are:

in some embodiments, a first matching threshold may be set according to task needs, where the first matching threshold is a number greater than 0 and less than 1, and a value of the first matching threshold may be set according to a specific application condition. For example, if the first matching threshold is set to 0.5, the first reduced frames with the IOU value greater than 0.5 are all considered to be correctly tracked, and the corresponding face frame and the corresponding first reduced frame are added to the tracking sequence, so that the accuracy of face tracking can be ensured. If the detection judgment is more accurate, the value of the first matching threshold value can be correspondingly improved, for example, the first matching threshold value is set to be 0.6, and the face tracking has higher precision. It is understood that the second reduced frames in the adjacent frame images are also subjected to cross-matching, and the second reduced frames larger than a second matching threshold, which may be equal to or different from the first matching threshold, are added to the tracking sequence.

In some embodiments, if the second reduced frame larger than the matching threshold has no corresponding face frame, a setting frame is added as the corresponding face frame. Some human body frames may exist in some frame images detected in the pedestrian detection network, and there may not be corresponding face frames (for example, pedestrians turn back to the lens), but according to the IOU matching of the human body frames, the object may be tracked, and an occupied setting frame is allocated to the object as the corresponding face frame. Therefore, under the conditions of shielding, overlapping, back-carrying and the like of the face, the face can still be accurately tracked, and the accuracy of face tracking is improved.

In some embodiments, after the step S21, a second pedestrian frame deletion operation may be further included. Fig. 5 is a schematic diagram illustrating a second pedestrian frame deletion operation according to an embodiment of the present invention, including steps S24 to S25.

In step S24, the first pedestrian frame detected by the face detection network and the second pedestrian frame detected by the pedestrian detection network in the same frame of image are subjected to merge ratio matching. In the embodiment of the invention, a plurality of pedestrian frames can be detected for the same frame of image in the videos input into the face detection network and the pedestrian detection network, wherein repeated detection frames can exist, and at the moment, the judgment is carried out according to the intersection-to-parallel ratio matching of the detection frames, so that the repetition rate can be effectively reduced, and the tracking accuracy is ensured.

In step S25, when the matching result is greater than a preset threshold, the second pedestrian frame is deleted. In the embodiment of the invention, the detection frames with the matching result larger than the preset threshold value can be considered as belonging to the same person, at the moment, the second pedestrian frame detected by the pedestrian detection network is deleted, only the face frame and the first pedestrian frame detected and bound by the face detection network are reserved, the repetition rate can be effectively reduced, the redundant detection frames are removed, the system storage resources are saved, and the processing speed is effectively improved.

In some embodiments, after the step S23, a tracking sequence output step may be further included. FIG. 6 is a schematic diagram illustrating the output of the tracking sequence according to the embodiment of the present invention, which includes steps S26-S27.

In step S26, it is determined whether the tracking sequence satisfies an output condition according to the confidence of the face frame in the tracking sequence. In the embodiment of the present invention, the confidence corresponding to each face frame is obtained by the pre-trained face detection network, and the confidence is a numerical value between (0, 1). For the frame image with the setting frame added as the face frame, if the confidence of the setting frame is set to a distinguishable value, for example, -1, the confidence of the face frame actually detected is clearly distinguished. Therefore, when judging whether the tracking sequence is output or not, the human face frame with the negative confidence coefficient is not evaluated, namely, whether the output condition is met or not is not judged, so that the accurate tracking of the human face can be realized, the condition that no human face is contained in the output tracking image can be ensured, and the precision of human face tracking is improved.

In step S27, when the output condition is satisfied, the tracking sequence is output. In the embodiment of the present invention, the output condition includes that the confidence of the face frame of each frame of image in the tracking sequence is greater than the lowest threshold, and the confidence of at least one face frame in the tracking sequence is greater than the highest threshold. For the tracking sequence, the confidence coefficient of the face frame of each frame of image is greater than the lowest threshold value, so that the accuracy of the detected face frame can be ensured; the confidence coefficient of the face frame in at least one frame of image is greater than the highest threshold value, the accuracy of face frame tracking can be guaranteed, the tracking sequence at the moment is considered to have good tracking performance, and the tracking sequence can be output to display the face tracking result. The minimum threshold is smaller than the maximum threshold, and specific values of the minimum threshold and the maximum threshold may be set according to specific application conditions, which is not limited in the embodiment of the present invention.

The following table shows experimental comparison data of the target tracking method and other face tracking methods in the embodiment of the invention, so as to better explain the performance optimization of the invention on dynamic video face tracking.

The above experimental indexes show that:

1. through 2 nd and 1 st experiment comparison, tracking is carried out through the mode that the face people bound, can reduce repetition rate (reduce 13.55, the performance becomes good) by a wide margin, but the snapshot rate loss is more serious (loss 3.6, performance variation).

2. It can be seen from the comparison between the 3 rd experiment and the 2 nd experiment that not only can a good low repetition rate be obtained (the repetition rates are basically the same) but also the snapshot rate can be significantly improved (the improvement is 2.11, and the performance is better) by increasing the tracking mode of the frame shrinking operation.

3. Through the comparison of the 3 rd experiment and the 1 st experiment, the fact that tracking is carried out in a face-to-face binding and frame-shrinking mode is seen, the snapshot rate loss is not large (1.49 is reduced, the performance is slightly poor), and the repetition rate is greatly reduced (13.43, the performance is greatly improved).

4. In conclusion, by the mode of combining the face binding and the frame shrinking operation, the repetition rate can be greatly reduced under the condition that the loss of the snapshot rate is not high, and the face tracking performance is remarkably improved.

By adopting the recognition model training method based on component segmentation, the face frame detected in the video is bound with the first pedestrian frame, the second pedestrian frame in the video is detected through the pedestrian detection network, and the tracking sequence is determined according to the first reduced frame and the second reduced frame, so that the face is tracked through the pedestrian frame. The invention can effectively avoid the phenomenon of face tracking fracture caused by face shielding and back-to-back actions in the dynamic video through the face-to-person binding tracking mode; the frame shrinking operation of the pedestrian frame can effectively relieve the phenomenon of tracked object identification code exchange (ID switch) which is easily generated due to pedestrian overlapping, and the performance of face tracking is greatly improved.

The embodiment of the second aspect of the invention also provides a target tracking device. Fig. 7 is a schematic structural diagram of a target tracking apparatus 700 according to an embodiment of the present invention, which includes an obtaining module 701, a processing module 702, and a tracking module 703.

The obtaining module 701 is configured to input a video image into a face detection network and a pedestrian detection network, output a face frame detected by the face detection network and a first pedestrian frame corresponding to the face frame, and further output a second pedestrian frame detected by the pedestrian detection network. In this embodiment of the present invention, the obtaining module 701 includes a first output unit 7011 and a second output unit 7012 (both shown in the drawings), where the first output unit 7011 is configured to output a face frame included in each frame of image in the video image detected by the face detection network, a confidence of the face frame, and the first pedestrian frame bound to the face frame; the second output unit 7012 is configured to output the second pedestrian frame included in each frame of image in the video image detected by the pedestrian detection network.

The processing module 702 is configured to perform a zoom-out operation on the first pedestrian frame in a first preset direction to obtain a first zoom-out frame corresponding to the first pedestrian frame, and perform a zoom-out operation on the second pedestrian frame in a second preset direction to obtain a second zoom-out frame corresponding to the second pedestrian frame. In this embodiment of the present invention, the processing module 702 includes a first frame reducing unit 7021 and a second frame reducing unit 7022 (both shown in the figure), where the first frame reducing unit 7021 is configured to perform a scaling-down operation on the first pedestrian frame in the direction of the face frame corresponding to the first pedestrian frame, so as to obtain the first reduced frame corresponding to the first pedestrian frame; the second frame reducing unit 7022 is configured to perform scaling-down operation on the second pedestrian frame in the direction of the head of the second pedestrian frame, so as to obtain a second reduced frame corresponding to the second pedestrian frame.

Optionally, the first frame reducing unit 7021 is configured to perform a size reduction operation on the first pedestrian frame in a horizontal direction according to a first reduction ratio, and perform a size reduction operation on the first pedestrian frame in a vertical direction according to a second reduction ratio; the second frame reducing unit 7022 is configured to perform a size reduction operation on the second pedestrian frame in the horizontal direction according to a third reduction ratio, and perform a size reduction operation on the second pedestrian frame in the vertical direction according to a fourth reduction ratio. Optionally, the first frame shrinking unit 7021 is configured to keep the top position of the first pedestrian frame unchanged, and perform a size reduction operation on the bottom of the first pedestrian frame to the top of the first pedestrian frame according to the second reduction ratio; the second frame reducing unit 7022 is configured to keep the top position of the second pedestrian frame unchanged, and perform a size reduction operation from the bottom of the second pedestrian frame to the top of the second pedestrian frame according to the fourth reduction ratio. When the height of the first reducing frame is smaller than the width of the first reducing frame, changing the height of the first reducing frame to be equal to the width of the first reducing frame; and when the height of the second reducing frame is smaller than the width of the second reducing frame, changing the height of the second reducing frame to be equal to the width of the second reducing frame.

The tracking module 703 is configured to determine a tracking sequence according to the first and second reduced frames. In this embodiment of the present invention, the tracking module 703 includes a first matching unit 7031 and a second matching unit 7032 (not shown in the figure), where the first matching unit 7031 is configured to perform intersection comparison matching according to the first reduced frames in adjacent frame images of the video image, and add the first reduced frame larger than a first matching threshold and a corresponding face frame into the tracking sequence; the second matching unit 7032 is configured to perform intersection-comparison matching according to the second reduced frames in the adjacent frame images of the video image, and add the second reduced frame larger than a second matching threshold to the tracking sequence.

In this embodiment of the present invention, the tracking module 703 further includes a setting frame unit 7033 (not shown in the figure), where the setting frame unit 7033 is configured to add a setting frame as the corresponding face frame when the second reduced frame that is greater than the second matching threshold does not have the corresponding face frame, where a confidence of the setting frame is set to be a negative number.

In this embodiment of the present invention, the target tracking apparatus 700 further includes an output module 704 (not shown in the figure), and the output module 704 is configured to determine whether the tracking sequence satisfies an output condition according to the confidence of the face frame in the tracking sequence, and output the tracking sequence when the output condition is satisfied. Wherein the output condition includes that the confidence degrees of the face frames of each frame of image in the tracking sequence are all larger than a lowest threshold value, and the confidence degree of at least one face frame in the tracking sequence is larger than a highest threshold value. In the embodiment of the invention, the face frame with the negative confidence coefficient does not participate in the judgment of whether the output condition is met.

In this embodiment of the present invention, the target tracking apparatus 700 further includes a pedestrian frame deleting module 705 (not shown in the figure), where the pedestrian frame deleting module 705 is configured to perform intersection and comparison matching on the first pedestrian frame detected by the face detection network and the second pedestrian frame detected by the pedestrian detection network in the same frame of image, and delete the second pedestrian frame when a matching result is greater than a preset threshold.

For more specific implementation of each module of the target tracking apparatus 700, reference may be made to the description of the target tracking method of the present invention, and similar beneficial effects are obtained, and therefore, detailed description is omitted here.

An embodiment of the third aspect of the present invention proposes a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a target tracking method according to an embodiment of the first aspect of the present invention.

Generally, computer instructions for carrying out the methods of the present invention may be carried using any combination of one or more computer-readable storage media. Non-transitory computer readable storage media may include any computer readable medium except for the signal itself, which is temporarily propagating.

A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages, and in particular may employ Python languages suitable for neural network computing and TensorFlow, PyTorch-based platform frameworks. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

An embodiment of a fourth aspect of the present invention provides a computing device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the program to implement the object tracking method according to the embodiment of the first aspect of the present invention.

The non-transitory computer-readable storage medium and the computing device according to the third and fourth aspects of the present invention may be implemented with reference to the content specifically described in the embodiment according to the first aspect of the present invention, and have similar beneficial effects to the target tracking method according to the embodiment of the first aspect of the present invention, and are not described herein again.

FIG. 8 illustrates a block diagram of an exemplary computing device suitable for use to implement embodiments of the present disclosure. The computing device 12 shown in FIG. 8 is only one example and should not impose any limitations on the functionality or scope of use of embodiments of the disclosure.

As shown in FIG. 8, computing device 12 may be implemented in the form of a general purpose computing device. Components of computing device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.

Computing device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computing device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. Computing device 12 may further include other removable/non-removable, volatile/nonvolatile computer-readable storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown, but commonly referred to as a "hard drive"). Although not shown in FIG. 8, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only memory (CD-ROM), a Digital versatile disk Read Only memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the disclosure.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described in this disclosure.

Computing device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the computer system/server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Moreover, computing device 12 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet via Network adapter 20. As shown, network adapter 20 communicates with the other modules of computing device 12 via bus 18. It is noted that although not shown, other hardware and/or software modules may be used in conjunction with computing device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 16 executes various functional applications and data processing, for example, implementing the methods mentioned in the foregoing embodiments, by executing programs stored in the system memory 28.

The computing device of the invention can be a server or a terminal device with limited computing power.

Although embodiments of the present invention have been shown and described above, it should be understood that the above embodiments are illustrative and not to be construed as limiting the present invention, and that changes, modifications, substitutions and alterations can be made in the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A target tracking method, comprising:

2. The target tracking method according to claim 1, wherein the outputting the face frame detected by the face detection network and the first pedestrian frame corresponding to the face frame comprises: outputting a face frame contained in each frame of image in the video image detected by the face detection network, a confidence coefficient of the face frame and the first pedestrian frame bound with the face frame; the further outputting of the second pedestrian frame detected by the pedestrian detection network comprises: outputting the second pedestrian frame contained in each frame of image in the video image detected by the pedestrian detection network.

3. The target tracking method of claim 2, further comprising:

and outputting the tracking sequence when the output condition is met.

4. The target tracking method of claim 3, wherein the output conditions include that the confidence level of the face box of each frame of image in the tracking sequence is greater than a lowest threshold value, and the confidence level of at least one face box in the tracking sequence is greater than a highest threshold value.

5. The target tracking method according to claim 1, wherein the reducing the first pedestrian frame in a first preset direction to obtain a first reduced frame corresponding to the first pedestrian frame comprises: carrying out size reduction operation on the first pedestrian frame in the direction of the face frame corresponding to the first pedestrian frame according to the proportion to obtain a first reduced frame corresponding to the first pedestrian frame; the step of reducing the second pedestrian frame towards a second preset direction to obtain a second reduced frame corresponding to the second pedestrian frame comprises the following steps: and carrying out scaling-down operation on the second pedestrian frame towards the head direction of the second pedestrian frame to obtain the second scaling-down frame corresponding to the second pedestrian frame.

6. The method of claim 5, wherein the scaling down the first pedestrian frame to the face frame direction corresponding thereto comprises: carrying out size reduction operation on the first pedestrian frame in the horizontal direction according to a first reduction ratio, and carrying out size reduction operation on the first pedestrian frame in the vertical direction according to a second reduction ratio; the scaling down operation of the second pedestrian frame towards the head direction of the second pedestrian frame to obtain the second scaling down frame corresponding to the second pedestrian frame comprises the following steps: and carrying out size reduction operation on the second pedestrian frame in the horizontal direction according to a third reduction proportion, and carrying out size reduction operation on the second pedestrian frame in the vertical direction according to a fourth reduction proportion.

7. The object tracking method of claim 6, wherein the resizing the first pedestrian frame at a second zoom-out scale in the vertical direction comprises: keeping the top position of the first pedestrian frame unchanged, and carrying out size reduction operation on the bottom of the first pedestrian frame to the top of the first pedestrian frame according to the second reduction proportion; the reducing the size of the second pedestrian frame in the vertical direction according to a fourth reduction scale comprises the following steps: and keeping the top position of the second pedestrian frame unchanged, and carrying out size reduction operation on the bottom of the second pedestrian frame to the top of the second pedestrian frame according to the fourth reduction proportion.

8. The target tracking method according to claim 6, wherein when the height of the first reduced frame is smaller than the width of the first reduced frame, the height of the first reduced frame is changed to be equal to the width of the first reduced frame; and when the height of the second reducing frame is smaller than the width of the second reducing frame, changing the height of the second reducing frame to be equal to the width of the second reducing frame.

9. The target tracking method of claim 1, wherein the determining a tracking sequence from the first and second reduced boxes comprises: performing intersection comparison matching according to the first reduced frames in the adjacent frame images of the video image, and adding the first reduced frames larger than a first matching threshold value and the corresponding face frames into the tracking sequence; and performing intersection comparison matching according to the second reduced frames in the adjacent frame images of the video image, and adding the second reduced frames larger than a second matching threshold value into the tracking sequence.

10. The target tracking method of claim 9, wherein said determining a tracking sequence from the first and second reduced boxes further comprises:

11. The target tracking method of claim 10, wherein the confidence level of the setting box is set to a negative number.

12. The object tracking method of claim 11, after determining a tracking sequence according to the first and second reduced frames, further comprising:

and outputting the tracking sequence when the output condition is met.

13. The target tracking method according to claim 1, wherein after the video images are respectively input into a face detection network and a pedestrian detection network, a face frame detected by the face detection network and a first pedestrian frame corresponding to the face frame are output, and a second pedestrian frame detected by the pedestrian detection network is also output, the method further comprises:

14. An object tracking device, comprising:

15. A non-transitory computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the object tracking method according to any one of claims 1-13.

16. A computing device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, implements the object tracking method of any one of claims 1-13.