CN114155571A

CN114155571A - Method for mixed extraction of pedestrians and human faces in video

Info

Publication number: CN114155571A
Application number: CN202111269098.7A
Authority: CN
Inventors: 夏立; 周祥; 王祥; 钱坤; 王康; 李峰岳
Original assignee: Nanjing Fiberhome Telecommunication Technologies Co ltd
Current assignee: Nanjing Fiberhome Telecommunication Technologies Co ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-03-08

Abstract

The invention discloses a pedestrian and face mixed extraction method in a video, which comprises the steps of providing a pedestrian and face mixed extraction technology for a monitoring video based on a deep learning and image processing algorithm, using general cpu and gpu hardware on a traditional server, realizing the pedestrian and face snapshot function of a real-time video stream output by a monitoring camera by using a deep learning related technology, extracting the optimal image of pedestrians and the corresponding face image passing through a specified monitoring area, and simultaneously realizing the rapid extraction of pedestrians/faces in an offline monitoring video by using the technology, thereby facilitating the arrangement of monitoring history video data and greatly improving the control efficiency of security personnel on key personnel.

Description

Method for mixed extraction of pedestrians and human faces in video

Technical Field

The invention relates to the technical field of video snapshot, in particular to a method for extracting pedestrians and human faces in a video in a mixed mode.

Background

In recent years, with the development of smart cities and the field of smart security, video monitoring is taken as an indispensable data acquisition method, and security monitoring cameras which are visible everywhere are deployed in cities, so that the safety of people is ensured all the time. The pedestrians and the faces are mainly concerned by the security camera, an efficient snapshot extraction means is needed to improve the working efficiency of monitoring personnel, and the cameras integrating face snapshot and pedestrian snapshot functions are mainly used in the industry at present to extract the pedestrians and the faces passing through the monitoring area, such as the cameras for snapshot in Haokangwei, Dahua and the like.

The existing products which use a snapshot camera mode to extract pedestrian and face information in the market have the following four defects:

1. the use cost is high

Compared with a common monitoring camera, the camera with the snapshot function is expensive, and meanwhile, as a lot of common monitoring cameras are already deployed in a city, the cameras are completely updated and deployed, so that waste of capital investment and huge workload are caused.

2. The algorithm is complicated to upgrade and maintain

Because the snapshot extraction algorithm is integrated in the snapshot camera, the snapshot algorithm is more and more accurate along with the continuous evolution of the artificial intelligence algorithm, the functions are more and more abundant, however, the updating of the algorithm by the current snapshot camera needs to upgrade the camera firmware one by one, some cameras even possibly need to be connected with the camera on site, the optimization and the upgrading are complex, and the workload is large.

3. Has less covering surface

Because the camera that has the snapshot function is released later and the price is higher simultaneously, most surveillance cameras all are the ordinary camera that does not take the snapshot function in the city at present, can intelligent extraction pedestrian and the face information in the surveillance area, and the regional coverage of quick location key people is few.

4. Various varieties and different functions and effects

At present, all large monitoring camera manufacturers develop cameras supporting pedestrian or face snapshot functions, but the effect and the function are different, if some cameras only support face snapshot extraction but do not support pedestrian snapshot extraction, although some cameras support multiple functions but have poor effect, the situation of missed snapshot or poor snapshot picture effect is easy to occur, and the snapshot cameras are limited by the problems of deployment and complex upgrading, and are difficult to replace in a short period once online.

Therefore, the method improves the method and provides a method for extracting the mixture of the pedestrians and the human faces in the video.

Disclosure of Invention

In order to solve the technical problems, the invention provides the following technical scheme:

the invention discloses a method for extracting pedestrians and human faces in a video in a mixed manner, which comprises the following steps:

s1, decoding a real-time video stream transmitted by the camera through a 5G network and a monitoring video file stored offline, performing video frame-by-frame decoding by adopting an ffmpeg video processing library to obtain serialized picture data, and then respectively sending the picture data into a face detection module and a pedestrian detection module;

s2, carrying out target detection on the video information decoded in the S1 to obtain human face and pedestrian information, and carrying out detection on the position and size of a pedestrian in a video frame by using a pedestrian detection module trained on the basis of the CenterNet deep learning network;

s3, carrying out target positioning and tracking on the face and pedestrian information in S2, using Faceboxes as detectors of face positions and sizes in video frames, positioning the faces under different sizes, postures and scenes, adopting a sort algorithm, associating the target currently detected in S2 with the existing target, and managing the life cycle of the tracked target;

s4, respectively carrying out quality evaluation on the pedestrian and the face pictures detected in the S3, wherein the quality evaluation uses a quality classification algorithm based on deep learning and a traditional image processing method to obtain various quality evaluation results;

s5, performing correlated contrast identification on the pedestrian and the face;

and S6, mixing, extracting and comparing, and extracting and outputting the result according to the pedestrian, the human face and the association evaluation.

As a preferred technical solution of the present invention, the target detection in S1S1 includes a cpu, a gpu hardware, and a video decoding module, the video decoding module is connected to the camera through a 5G network signal, and the video decoding module performs frame-by-frame video decoding by using an ffmpeg video processing library.

As a preferred technical solution of the present invention, a face detection module and a pedestrian detection module are arranged in S2, the face detection module adopts Faceboxes to detect the position and size of a face in a video frame, the face detection module includes a face detection module algorithm, and the face detection module algorithm includes the following three steps:

firstly, rapidly down-sampling a 1024 × 1024 high-resolution image by using a 7 × 7 and 5 × 5 large convolution kernel and a CReLU through RDCL, and covering necessary image distribution information while greatly reducing the parameter number;

step two, introducing MSCL and FPN to fuse the information of the convolution layers with different scales; the perception visual field of the model is further expanded, and the recall capability of the human faces with different scales is improved;

and step three, introducing Anchor localization strategy, and improving the probability of successful matching of the small face through dense sampling anchors.

As a preferred technical solution of the present invention, the pedestrian detection module in S2 adopts a centrnet target detection method, detects the current hot point and 8 nearby points around, and adopts a 3 × 3 maximum pooling.

As a preferred technical solution of the present invention, the S2 includes a face tracking module and a pedestrian tracking module, both of which adopt multi-target tracking, and adopt a sort algorithm, and use a target detection frame iou as a target relation metric index between front and rear frames, predict a current position, and associate the detection frame to a target.

As a preferred technical solution of the present invention, the S4 includes a face quality evaluation module and a pedestrian quality evaluation module, both of which use a quality classification algorithm based on deep learning and a conventional image processing method, the face tracking module includes a pedestrian evaluation dimension and a face quality evaluation module dimension, the pedestrian evaluation dimension includes a pedestrian definition, a pedestrian overlap, a pedestrian width-height ratio, whether a face exists, whether a pedestrian is complete, and a pedestrian orientation, and the face quality evaluation module dimension includes a face definition, a face angle, and a face shielding condition.

As a preferred technical solution of the present invention, the S5 includes a face/pedestrian association module, and the face/pedestrian association module selects a face id with the largest number of associated frames as an optimal associated face of the pedestrian, so as to avoid interference of transient association results and reduce association errors.

As a preferred technical solution of the present invention, a face/pedestrian result output module is provided in S6, the face/pedestrian result output module records output after a pedestrian leaves and output after a pedestrian stays for a long time, and the best pedestrian and face picture and attribute information in the same pedestrian id are comprehensively evaluated according to output results given by pedestrian, face and association evaluation.

The invention has the beneficial effects that:

1. the pedestrian and face mixed extraction method in the Video has the advantages that algorithm processing with higher speed and lower resource consumption is realized, real-time processing of more than 8 paths can be realized for real-time monitoring Video streams on NVIDIA2080ti card, the offline Video processing frame rate can reach about 200fps, the extraction effect is good, the optimal pedestrian and face information can be output after a pedestrian leaves and stays in a monitoring area for a long time, meanwhile, the upgrading cost is low, the originally deployed monitoring camera does not need to be updated, the originally deployed common monitoring camera can be upgraded by using a beacon artificial intelligence computing server to use the functions of pedestrian and face extraction, and the labor and capital costs are greatly saved;

2. the pedestrian and face mixed extraction method in the video combines the pedestrians and the faces, and the best pedestrians and the best faces are independently detected and evaluated in real time, so that the video can output the best pedestrians and the best faces passing through a monitoring area at one time, the extraction algorithm can be conveniently upgraded, and the extraction algorithm can be optimized in a targeted and customized manner to obtain a more suitable extraction effect;

3. according to the method for extracting pedestrians and faces in the video in a mixed mode, association of the pedestrians and the faces is combined with continuous frame tracking information in the video, the problem of output pedestrian and face association errors caused by transient interaction can be solved by selecting the faces with the most associated pedestrians, extraction of pedestrian information in the security field is more complete, the pedestrian and face information is associated, tracing of pedestrian attributes and face features is facilitated, most of snapshot cameras only have single face snapshot and pedestrian snapshot functions, and the association problem cannot be well solved;

4. according to the pedestrian and face mixed extraction method in the video, the pedestrian quality evaluation module integrates the dimensions of definition, overlapping degree, width-to-height ratio, pedestrian integrity degree and the like to effectively filter out pedestrians which cannot well extract pedestrian features, meanwhile, strategies of whether faces are recognized and the orientation priority are added, so that output pedestrian picture information is more comprehensive and friendly, subsequent inquiry and tracing are facilitated, meanwhile, the pedestrian and face extraction work can be carried out on historical monitoring video data, and the arrangement and the tracing of the historical data are facilitated;

5. according to the method for mixed extraction of the pedestrians and the human faces in the video, the human face quality evaluation module integrates the dimensions such as definition, left-right and pitching attitude angles, human face shielding conditions and the like, effectively filters out pictures with poor human face quality, outputs human face pictures with richer human face information, facilitates follow-up inquiry and tracing of the pedestrians and the human faces, and due to the fact that the video edge cannot avoid the phenomenon of cutting off the pedestrians or the human faces, the pedestrians or the human face pictures which are cut off at the edge are difficult to be classified correctly by a quality classification model;

6. according to the method for mixed extraction of the pedestrians and the faces in the video, a judgment strategy of video edges is added, and when non-edge qualified pictures exist in the pedestrians (faces) in continuous frames, processing of the image frames of the pedestrians (faces) at the edges is abandoned, so that the finally output pedestrians and faces are more complete.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a schematic diagram of the steps of a pedestrian and face hybrid extraction method of the pedestrian and face hybrid extraction method in a video according to the invention;

FIG. 2 is a flow chart of a pedestrian and face hybrid extraction method of the invention;

FIG. 3 is a schematic diagram of steps of a face detection module algorithm of the method for extracting pedestrians and faces in a video in a mixed manner.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

Example (b): as shown in fig. 1-3, the method for extracting pedestrian and face mixture in video of the present invention comprises the following steps:

The S1 target detection comprises cpu hardware, gpu hardware and a video decoding module, the video decoding module is connected with the camera through a 5G network signal, and the video decoding module adopts an ffmpeg video processing library to decode videos frame by frame.

Wherein, be equipped with face detection module and pedestrian detection module in S2, face detection module adopts Faceboxes detector, detects face position and size in the video frame, and face detection module includes face detection module algorithm, and face detection module algorithm includes following three steps:

The pedestrian detection module in the S2 adopts a centrNet target detection method to detect the current hot point and 8 nearby points around, and adopts 3 multiplied by 3 maximum pooling.

The S2 includes a face tracking module and a pedestrian tracking module, both of which adopt multi-target tracking, and adopt a sort algorithm, and use a target detection frame iou as a target relation metric index between front and rear frames, predict a current position, and associate the detection frame to a target.

The system comprises a human face quality evaluation module and a pedestrian quality evaluation module in S4, wherein the human face quality evaluation module and the pedestrian quality evaluation module both use a quality classification algorithm based on deep learning and a traditional image processing method, the human face tracking module comprises a pedestrian evaluation dimension and a human face quality evaluation module dimension, the pedestrian evaluation dimension comprises pedestrian definition, pedestrian overlapping degree, pedestrian width and height ratio, whether a human face exists or not, whether a pedestrian is complete or not and pedestrian orientation, and the human face quality evaluation module dimension comprises human face definition, a human face angle and a human face shielding condition.

Wherein, S5 includes a face/pedestrian association module, and the face/pedestrian association module selects a face id with the largest number of associated frames as the optimal associated face of the pedestrian, so as to avoid the interference of transient association results and reduce association errors.

And S6 is provided with a face/pedestrian result output module, the face/pedestrian result output module records output after a pedestrian leaves and output after the pedestrian stays for a long time, and the best pedestrian and face picture and attribute information in the same pedestrian id are comprehensively evaluated according to the output results given by the pedestrian, the face and the association evaluation.

The working principle is as follows: the video decoding module supports real-time video stream decoding and off-line stored monitoring video file decoding which are transmitted by a camera through a network, a ffmpeg video processing library is adopted to carry out video frame-by-frame decoding, so that serialized picture data are obtained, then the picture data are respectively sent into a face detection module and a pedestrian detection module, pedestrian detection module devices trained on a centerNet deep learning network are used for detecting the position and size of pedestrians in a video frame, the centerNet is used as an anchor-free target detection method, targets are directly used as one point for prediction, and the processing operation is completely lost for nms.

The method comprises the steps of conducting down-sampling on an image, then conducting prediction on the image after the down-sampling, predicting a central point in a feature map of the down-sampling for each class, then extracting hot points of each class in an output map independently, wherein the extraction method is to detect whether the current hot points are larger than or equal to 8 adjacent points around, then extracting 100 points, adopting a 3 x 3 maximum pooling mode, finally selecting a final result from the 100 results according to the probability that an object exists in the current central point, and using Faceboxes as detectors of the positions and the sizes of human faces in a video frame, wherein the Faceboxes are a light single-stage human face detection module algorithm, the network structure is similar to SSD, the whole body uses n x n convolution, and the CPU deployment and hardware optimization are friendly.

The multi-target tracking is realized based on the sort algorithm, the sort algorithm takes detection as a key component, transmits a target state to a future frame, associates the current detection with the existing target, manages the life cycle of the tracked target, and can very quickly track and associate the pedestrian/face target in a monitoring scene under the condition of good target detection effect from practice verification.

The quality evaluation is carried out on the detected pedestrian and face pictures respectively, each quality evaluation result is obtained by using a quality classification algorithm based on deep learning and a traditional image processing method in the quality evaluation, and the specific pedestrian evaluation dimensionality is as follows:

pedestrian clarity: the clearer the pedestrian picture definition score is, the higher is;

pedestrian overlapping degree: the lower the degree of overlapping with other pedestrians, the higher the score;

the width-height ratio of the pedestrians is as follows: the closer the aspect ratio is to 1:4, the higher the score;

whether a face exists: preferentially selecting a pedestrian picture of which the detected pedestrian leaks out of the face;

whether the pedestrian is complete: preferentially selecting a pedestrian picture which is not blocked and truncated by a pedestrian;

the pedestrian faces: selecting a pedestrian picture according to a mode that the forward direction is better than the backward direction and the backward direction is better than the lateral direction;

the dimension of the face quality evaluation module is as follows:

face definition: the clearer the pedestrian picture definition score is, the higher is;

face angle: the more positive the face angle, the higher the score;

face shielding condition: the less the face shielding, the higher the score

And the pedestrians and the faces comprehensively evaluate the quality scores and the output priorities through the quality of each dimension, and the optimal picture is taken as the picture to be output under the current pedestrian/face id.

The extracted faces need to be associated with pedestrians, so that the pedestrian information in the monitoring video can be better analyzed, the area judgment that the faces in a single video frame exist within 15% of the top of a pedestrian frame is used, certainly, because a plurality of different faces inevitably appear in the pedestrian judgment area in the monitoring scene, the number of each face associated is stored in a face information structure associated with a single pedestrian, and finally, one face id with the largest number of associated frames is selected as the optimal associated face of the pedestrian, so that the interference of transient association results is avoided, and the association errors are reduced.

The unified management mixed extraction and brushing mechanism comprises output after the pedestrian leaves and output after the pedestrian stays for a long time, and comprehensively evaluating the best image and attribute information of the pedestrian and the face in the same pedestrian id according to the output results given by the pedestrian and the face and the correlation evaluation

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for extracting pedestrians and human faces in a video in a mixed mode is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein the S1 target detection includes cpu and gpu hardware and a video decoding module, the video decoding module is connected to the camera via a 5G network signal, and the video decoding module performs frame-by-frame video decoding using an ffmpeg video processing library.

3. The method according to claim 1, wherein a face detection module and a pedestrian detection module are provided in S2, the face detection module employs Faceboxes to detect the position and size of a face in a video frame, the face detection module includes a face detection module algorithm, and the face detection module algorithm includes the following three steps:

and step three, introducing Anchor localization strategy, and improving the probability of successful matching of the small human face through dense sampling anchors.

4. The method of claim 1, wherein the pedestrian detection module in S2 detects the current hot point and the 8 nearby points by using a centrnet target detection method and performs 3 x 3 max pooling.

5. The method for hybrid extraction of pedestrians and faces in video according to claim 1, wherein S2 includes a face tracking module and a pedestrian tracking module, both of which adopt multi-target tracking and sort algorithm, and use a target detection frame iou as a target relation metric index between front and rear frames to predict the current position and associate the detection frame to the target.

6. The method according to claim 1, wherein the S4 includes a face quality assessment module and a pedestrian quality assessment module, the face quality assessment module and the pedestrian quality assessment module both use a deep learning-based quality classification algorithm and a conventional image processing method, the face tracking module includes a pedestrian assessment dimension and a face quality assessment module dimension, the pedestrian assessment dimension is composed of a pedestrian definition, a pedestrian overlap, a pedestrian width-to-height ratio, whether a face exists, whether a pedestrian is complete, and a pedestrian orientation, and the face quality assessment module dimension is composed of a face definition, a face angle, and an occlusion face condition.

7. The method according to claim 1, wherein the S5 includes a face/pedestrian association module, and the face/pedestrian association module selects a face id with the largest number of associated frames as an optimal associated face for the pedestrian, so as to avoid interference of transient association results and reduce association errors.

8. The method according to claim 1, wherein a face/pedestrian result output module is provided in S6, the face/pedestrian result output module records output after pedestrian leaves and output after pedestrian stays for a long time, and the best pedestrian and face image and attribute information in the same pedestrian id are comprehensively evaluated according to output results given by pedestrian, face and association evaluation.