CN110879970A

CN110879970A - Video interest area face abstraction method and device based on deep learning and storage device thereof

Info

Publication number: CN110879970A
Application number: CN201911002439.7A
Authority: CN
Inventors: 程家明; 孔繁东; 陈升亮
Original assignee: WUHAN XINGTU XINKE ELECTRONIC CO Ltd
Current assignee: WUHAN XINGTU XINKE ELECTRONIC CO Ltd
Priority date: 2019-10-21
Filing date: 2019-10-21
Publication date: 2020-03-13

Abstract

The invention discloses a method, equipment and storage equipment for abstracting a face in a video interest area based on deep learning, which comprises the following steps: million Asian face pictures with good picture significance are collected, and the Asian face pictures are trained, detected and identified; improving an mtcnn _ detector face detection algorithm; detecting the face of the video sequence image by using an improved mtcnn _ detector algorithm; initializing a Kalman filter by utilizing an improved mtcnn _ detector algorithm; recognizing the detected face by utilizing facenet face recognition algorithm; judging whether the face identified by facenet is a target face by utilizing a binary classification algorithm; carrying out coincidence comparison on the prediction position of the Kalman filter and the position of a binary classification judgment non-target face frame; and (4) carrying out video synthesis on the video image frame containing the target face identified by the facenet, the prediction position of the Kalman filter and the binary classification judgment non-target face frame.

Description

Video interest area face abstraction method and device based on deep learning and storage device thereof

Technical Field

The invention relates to the field of artificial intelligence, in particular to a method and equipment for abstracting a human face in a video interest area based on deep learning and a storage device thereof.

Background

Deep learning is an important research direction of artificial intelligence, and is widely applied in the field of image recognition and video analysis, so that the recognition and analysis precision is greatly improved. And applying a face recognition algorithm based on deep learning to the video abstract of the interest region on the basis of the deep learning.

In recent years, many results have been obtained in the research of face search and summarization algorithms. Patent document 1(CN106682094A) proposes a method and a system for retrieving a face video, in which a search area of a key frame is determined by information in a non-compressed domain, and then a tracking search area is obtained by motion and prediction information in a compressed domain, so that the data amount and the computation amount of video search are reduced, and the timeliness of video search is improved. Patent document 2(CN204102129U) discloses a device for face retrieval in video, which includes a preprocessing module, a face detection module, a face extraction module, a face recognition module, and a face index association module, which are respectively connected to a system bus module; the system bus module is connected with the data interface module, and the data interface module is connected with the display module; the device for searching the face in the video can solve the problem that a target video with specific face information can be browsed and played back quickly and accurately under massive video monitoring data. The workload of workers is reduced, the operation time is shortened, and the working efficiency is improved. Patent document 3(CN104731964A) proposes a face summarization method based on face recognition, which includes generating face images of different people appearing in an original video, and forming a list of the appearing face images, including steps of scanning image frames in the original video, obtaining whether a face region exists in the video frames, face detection, face feature extraction, face feature clustering, face summarization image generation, and the like.

However, the face video retrieval methods and systems disclosed in patent documents 1, 2, and 3 do not relate to a face detection and recognition algorithm based on deep learning that performs well in the case of angle and scene changes. The face video retrieval method proposed in patent document 1(CN106682094A) and the face part of the video face retrieval device designed in patent document 2(CN204102129U) both use the traditional face detection algorithm, and the traditional face detection algorithm has poor detection adaptability to faces with multiple scenes, multiple angles and multiple scales; patent document 3(CN104731964A) proposes a face summarization method based on face recognition, in which the recognition accuracy of the face recognition algorithm is also affected by the scene and the target features. The invention is made in view of the above-mentioned shortcomings, and aims to provide a method for abstracting a face of a video region of interest based on deep learning, which performs fast frame-by-frame recognition on a face in a video image with multiple scenes, multiple angles and multiple scales through a deep learning face detection and recognition algorithm, stores a recognized target frame, and forms a section of condensed video, thereby completing retrieval of key frames and video condensation.

Disclosure of Invention

In view of the above, the present invention provides a method, an apparatus and a storage apparatus for abstracting a face of a video interest region based on deep learning, which performs better under the condition of angle and scene change.

The invention provides a method, equipment and storage equipment for abstracting a face in a video interest area based on deep learning, which comprises the following steps:

step 1: collecting million Asian face pictures to train a face detection and recognition model;

step 2: improving an mtcnn _ detector face detection algorithm;

and step 3: selecting an interested image in a video sequence image scene by using a mouse;

and 4, step 4: detecting the human face appearing in the video sequence image in the step 3 by using the improved mtcnn _ detector algorithm in the step 2, and initializing a Kalman filter;

and 5: recognizing the face detected in the step 4 by using a facenet face recognition model;

step 6: judging whether the face identified by facenet is a target face by utilizing a binary classification algorithm;

and 7: the method comprises the following steps of carrying out video synthesis on a video image frame which contains a target face and an image frame which is predicted by a Kalman filter and is superposed with a binary classification judgment non-target face frame, and specifically comprises the following steps: after the face is detected in the next frame by the improved mtcnn _ detector algorithm, the detected face is identified by utilizing the facenet, if the ratio of the characteristic value of the image frame calculated by the facenet to the characteristic value in the face library is greater than a threshold value, the face is a target face, video synthesis can be directly carried out, if the ratio is less than the threshold value, coincidence judgment is carried out on the face and the prediction position of a Kalman filter, the Kalman filter predicts the target face position of the next frame by taking the target face position detected by the mtcnn _ detector algorithm as the reference, and if the Kalman prediction position is coincident with the face position detected in the next frame by the mtcnn _ detector algorithm, video synthesis can be carried out.

Further, the face picture used for training the model in the step 1 covers the characteristics of multiple angles, multiple scales, multiple illumination changes, background changes and better significance.

Further, the improved method of the mtcnn _ detector algorithm in step 2 is as follows: the upper limit and the lower limit of the mtcnn _ detector algorithm face detection frame are dynamically adjusted by combining practical application, the lower limit of the face detection frame is 5% of the area of an image to be detected, the upper limit of the face detection frame is 90% of the area of the image to be detected, and false detection in the detection process can be reduced through dynamic adjustment.

Further, the method for judging whether the face identified by facenet is the target face by the binary classification algorithm in step 6 is as follows: setting a threshold, if the ratio of the characteristic value of the image frame calculated by the facenet to the characteristic value in the face library is larger than the threshold, indicating that the image frame is a target face, if the ratio is smaller than the threshold, then judging whether the image frame position is overlapped with the image frame position detected by the mtcnn _ detector detection frame according to the Kalman filter prediction, and if the image frame position is overlapped, indicating that the image frame is the target face.

Further, step 7 sets the threshold value to 0.7.

A storage device, the storage device storing instructions and data for implementing the method for abstracting a video interest area face based on deep learning, a video interest area face abstracting device based on deep learning, the device comprising a processor and the storage device; the processor loads and executes instructions and data in the storage device to realize the video interest region face summarization method based on deep learning.

The technical scheme provided by the invention has the beneficial effects that: the problems of large workload of manual retrieval and low retrieval precision of the traditional method are solved, and the efficiency and accuracy of video retrieval are improved.

Drawings

FIG. 1 is a flow chart of a video interest region face summarization method based on deep learning according to the present invention;

FIG. 2 is a detailed operation procedure of a video interest region face summarization method based on deep learning according to the present invention;

fig. 3 is a schematic diagram of the operation of the hardware device according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be further described with reference to the accompanying drawings.

Referring to fig. 1 and fig. 2, an embodiment of the present invention provides a method, an apparatus and a storage apparatus for abstracting a face of a video region of interest based on deep learning, including the following steps:

step 1: the method comprises the steps of collecting 100 million Asian face pictures, training a face detection model for locating the face position in an image and a face identification model for identifying the face identity, wherein the 100 million Asian face pictures have the characteristics of covering multiple angles, multiple scales, multiple illumination changes and background changes and good significance;

step 2: improving an mtcnn _ detector face detection algorithm; the upper limit and the lower limit of the mtcnn _ detector algorithm face detection frame are dynamically adjusted by combining practical application, the lower limit of the face detection frame is 5% of the area of an image to be detected, the upper limit of the face detection frame is 90% of the area of the image to be detected, and false detection in the detection process can be reduced through dynamic adjustment.

and 4, step 4: detecting the human face appearing in the video sequence image in the step 3 by utilizing the improved mtcnn _ detector algorithm in the step 2;

step 6: judging whether the face identified by facenet is a target face by utilizing a binary classification algorithm; the method comprises the following steps: setting a threshold value to be 0.7, if the ratio of the characteristic value of the image frame calculated by the facenet to the characteristic value in the face library is more than 0.7, indicating that the image frame is a target face, if the ratio is less than 0.7, judging whether the image frame face position is overlapped with the image frame face position detected by the mtcnn _ detector detection frame according to the Kalman filter prediction, and if the image frame face position is overlapped, indicating that the image frame face is the target face.

And 7: the method comprises the following steps of carrying out video synthesis on a video image frame which contains a target face and an image frame which is predicted by a Kalman filter and is superposed with a binary classification judgment non-target face frame, and specifically comprises the following steps: after the face is detected in the next frame of the mtcnn _ detector algorithm, the detected face is identified by utilizing the facenet, if the ratio of the characteristic value of the image frame calculated by the facenet to the characteristic value in the face library is greater than 0.7, the face is a target face, video synthesis can be directly carried out, if the ratio is less than 0.7, coincidence judgment is carried out on the face and the prediction position of a Kalman filter, the Kalman filter predicts the target face position of the next frame by taking the target face position detected by the mtcnn _ detector algorithm as the reference, and if the Kalman prediction position and the face position detected in the next frame of the mtcnn _ detector algorithm coincide, video synthesis can be carried out.

Referring to fig. 3, fig. 3 is a schematic diagram of a hardware device according to an embodiment of the present invention, where the hardware device specifically includes: a video interest region face summarization device 401 based on deep learning, a processor 402 and a storage device 403.

A video interest region face summarization device 401 based on deep learning: a video interest region face summarization device 401 based on deep learning realizes the video interest region face summarization method based on deep learning.

The processor 402: the processor 402 loads and executes the instructions and data in the storage device 403 to implement the method for abstracting a video region of interest face based on deep learning.

The storage device 403: the storage device 403 stores instructions and data; the storage device 403 is used to implement the method for abstracting a face of a video interest region based on deep learning.

The embodiments and features of the embodiments described herein above may be combined with each other without conflict. The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A video interest area face abstraction method based on deep learning is characterized by comprising the following steps:

step 2: improving an mtcnn _ detector face detection algorithm;

step 6: judging whether the face identified by the face net is a target face or not by utilizing a binary classification algorithm;

and 7: the video synthesis is carried out on the video image frame which contains the target face and the image frame which is predicted by the Kalman filter and is superposed with the two classification judgment non-target face frames, and the method specifically comprises the following steps: after the face is detected in the next frame by the improved mtcnn _ detector algorithm, the detected face is identified by utilizing the facenet, if the ratio of the characteristic value of the image frame calculated by the facenet to the characteristic value in the face library is greater than a threshold value, the face is a target face, video synthesis can be directly carried out, if the ratio is less than the threshold value, coincidence judgment is carried out on the face and the prediction position of a Kalman filter, the Kalman filter predicts the target face position of the next frame by taking the target face position detected by the improved mtcnn _ detector algorithm as a reference, and if the target face position is coincident with the face position detected in the next frame by the mtcnn _ detector algorithm, video synthesis can be carried out.

2. The method for abstracting human face of interest region in video based on deep learning of claim 1, wherein the human face image used for training the model in step 1 covers features of multiple angles, multiple scales, multiple illumination changes and background changes and better significance.

3. The method for abstracting a human face of a video interest region based on deep learning of claim 1, wherein the mtcnn _ detector algorithm in step 2 is improved by: the upper limit and the lower limit of the mtcnn _ detector algorithm face detection frame are dynamically adjusted by combining practical application, the lower limit of the face detection frame is 5% of the area of an image to be detected, the upper limit of the face detection frame is 90% of the area of the image to be detected, and false detection in the detection process can be reduced through dynamic adjustment.

4. The method for abstracting a video interested area face based on deep learning of claim 1, wherein the method for judging whether the face identified by facenet is the target face by the binary algorithm in step 6 is as follows: setting a threshold, if the ratio of the characteristic value of the image frame calculated by the facenet to the characteristic value in the face library is larger than the threshold, indicating that the image frame is a target face, if the ratio is smaller than the threshold, predicting whether the position of the image frame face is overlapped with the position of the image frame face detected by the mtcnn _ detector detection frame according to the Kalman filter, and if the position is overlapped, indicating that the image frame face is the target face.

5. The method for abstracting a face of a video interest region based on deep learning of claim 1, wherein the threshold of step 7 is set to 0.7.

6. A storage device, comprising: the storage device stores instructions and data for implementing the method for abstracting the human face in the video interest region based on deep learning as claimed in claims 1-4.

7. A video interest area face abstract device based on deep learning is characterized in that: the method comprises the following steps: a processor and the storage device; the processor loads and executes instructions and data in the storage device to realize the video interest region face summarization method based on deep learning as claimed in claims 1-4.