CN112395922A - Face action detection method, device and system - Google Patents

Face action detection method, device and system Download PDF

Info

Publication number
CN112395922A
CN112395922A CN201910760634.XA CN201910760634A CN112395922A CN 112395922 A CN112395922 A CN 112395922A CN 201910760634 A CN201910760634 A CN 201910760634A CN 112395922 A CN112395922 A CN 112395922A
Authority
CN
China
Prior art keywords
face
images
target
video
color
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910760634.XA
Other languages
Chinese (zh)
Inventor
李强
王晶晶
王春茂
严经纬
谢迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hikvision Digital Technology Co Ltd
Original Assignee
Hangzhou Hikvision Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hikvision Digital Technology Co Ltd filed Critical Hangzhou Hikvision Digital Technology Co Ltd
Priority to CN201910760634.XA priority Critical patent/CN112395922A/en
Publication of CN112395922A publication Critical patent/CN112395922A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Abstract

The application discloses a facial action detection method, a facial action detection device and a facial action detection system, and belongs to the technical field of video monitoring. The method comprises the following steps: the method comprises the steps of obtaining a plurality of color face images and corresponding depth face images which are related to a current video clip, determining the face action of a target through a target network model based on the color face images and the corresponding depth face images, and obtaining target video data when the face action of the target is determined to belong to an abnormal action, wherein the target video data comprises the current video clip and/or video images corresponding to the face action belonging to the abnormal action. According to the method and the device, the target network model is used for automatically determining the facial action in any video clip and determining whether the facial action belongs to the abnormal action, so that the target video data with the abnormal action is positioned, the condition that a plurality of video clips are manually checked is avoided, and therefore the efficiency of determining the video clips with the abnormal action is improved.

Description

Face action detection method, device and system
Technical Field
The present application relates to the field of video surveillance technologies, and in particular, to a method, an apparatus, and a system for detecting facial movements.
Background
In daily life, an object such as a user may have abnormal behavior such as illegal criminal behavior. When abnormal behaviors occur, negative effects are often generated on social stability, and in order to know the field conditions when the abnormal behaviors occur, the scenes where the abnormal behaviors occur are usually required to be positioned.
In the related art, a scene in which an abnormal behavior occurs in a certain target may be usually located through video surveillance, for example, a plurality of video segments in which the target exists may be screened out from a surveillance video through a human eye search or a human face recognition technique, and then the plurality of video segments are manually examined to determine the video segment in which the abnormal behavior occurs in the target, such as a video segment in a scene of a case, so as to locate the scene in which the abnormal behavior occurs.
However, since a plurality of video segments need to be manually checked, the workload is large and the manual checking speed is slow, resulting in low efficiency in determining the video segments including abnormal behaviors.
Disclosure of Invention
The application provides a facial motion detection method, a facial motion detection device and a facial motion detection system, which can solve the problem of low efficiency of determining video clips including abnormal behaviors in the related art. The technical scheme is as follows:
in one aspect, a facial motion detection method is provided, the method comprising:
acquiring a plurality of color face images associated with a current video clip and corresponding depth face images;
determining the facial action of the target through a target network model based on the multiple color face images and the corresponding depth face images;
and when the target face action is determined to belong to the abnormal action, acquiring target video data, wherein the target video data comprises the current video clip and/or a video image corresponding to the face action belonging to the abnormal action.
In a possible implementation manner of the present application, the acquiring multiple face images and corresponding depth face images associated with a current video segment includes:
acquiring a plurality of color video images and corresponding depth video images which are associated with the video clip and comprise the target;
respectively carrying out face detection on the obtained multiple color video images and the corresponding depth video images;
determining a plurality of face color area images and corresponding face depth area images from the plurality of color video images and corresponding depth video images according to a face detection result;
and determining the plurality of face color area images and the corresponding face depth area images as the plurality of color face images and the corresponding depth face images.
In a possible implementation manner of the present application, determining the plurality of face color region images and the corresponding face depth region images as the plurality of face color region images and the corresponding depth face images includes:
respectively carrying out face alignment treatment on the plurality of face color area images and the corresponding face depth area images;
adjusting the sizes of the plurality of face color area images and the corresponding face depth area images after the face alignment processing to be the same;
and taking the plurality of face color area images and the corresponding face depth area images after size adjustment as the plurality of color face images and the corresponding depth face images.
In one possible implementation manner of the present application, the determining, by the target network model, a facial action of the target based on the multiple color face images and the corresponding depth face images includes:
inputting the multiple color face images and the corresponding depth face images into the target network model, extracting key features of each face image and the corresponding depth face image through a feature fusion network in the target network model, and fusing to obtain fusion features corresponding to each face image;
and analyzing the obtained multiple fusion features through a multi-frame analysis network in the target network model, and determining the facial action of the target.
In one possible implementation manner of the present application, after determining the facial action of the target, the method further includes:
classifying facial movements of the target;
and determining whether the facial action belongs to abnormal actions according to the classification result.
In a possible implementation manner of the present application, after the obtaining the target video data, the method further includes:
extracting video sub-segments corresponding to the facial actions from other video segments comprising the target or extracting video sub-segments corresponding to action categories to which the facial actions belong;
synthesizing the target video data and the extracted video sub-segments into a video according to the shooting time of the target video data and the extracted video sub-segments and/or the image frame numbers of the target video data and the extracted video sub-segments;
and playing the synthesized video.
In one possible implementation manner of the present application, the method further includes:
acquiring image information of a video image corresponding to the facial action belonging to the specified category;
determining the position of a camera for shooting the facial action belonging to the specified category according to the image information;
and sending the determined position of the camera to a designated terminal, and/or adding the determined position of the camera to the image information and then displaying.
In a possible implementation manner of the present application, the target network model is obtained by training a network model to be trained based on a plurality of face color image samples, corresponding face depth image samples, and actual face action categories of faces in the plurality of face image samples.
In another aspect, there is provided a facial motion detection apparatus, the apparatus comprising:
the first acquisition module is used for acquiring a plurality of color face images and corresponding depth face images which are associated with the current video clip;
the determining module is used for determining the facial action of the target through a target network model based on the multiple colorful facial images and the corresponding deep facial images;
and the second acquisition module is used for acquiring target video data when the target face action is determined to belong to an abnormal action, wherein the target video data comprises the current video clip and/or a video image corresponding to the target face action belonging to the abnormal action.
In one possible implementation manner of the present application, the first obtaining module is configured to:
acquiring a plurality of color video images and corresponding depth video images which are associated with the video clip and comprise the target;
respectively carrying out face detection on the obtained multiple color video images and the corresponding depth video images;
determining a plurality of face color area images and corresponding face depth area images from the plurality of color video images and corresponding depth video images according to a face detection result;
and determining the plurality of face color area images and the corresponding face depth area images as the plurality of color face images and the corresponding depth face images.
In one possible implementation manner of the present application, the first obtaining module is configured to:
respectively carrying out face alignment treatment on the plurality of face color area images and the corresponding face depth area images;
adjusting the sizes of the plurality of face color area images and the corresponding face depth area images after the face alignment processing to be the same;
and taking the plurality of face color area images and the corresponding face depth area images after size adjustment as the plurality of color face images and the corresponding depth face images.
In one possible implementation manner of the present application, the target network model includes a feature fusion network and a multi-frame analysis network, and the determining module is configured to:
inputting the multiple color face images and the corresponding depth face images into the target network model, extracting key features of each face image and the corresponding depth face image through a feature fusion network in the target network model, and fusing to obtain fusion features corresponding to each face image;
and analyzing the obtained multiple fusion features through a multi-frame analysis network in the target network model, and determining the facial action of the target.
In one possible implementation manner of the present application, the determining module is further configured to:
classifying facial movements of the target;
and determining whether the facial action belongs to abnormal actions according to the classification result.
In a possible implementation manner of the present application, the second obtaining module is configured to:
extracting video sub-segments corresponding to the facial actions from other video segments comprising the target or extracting video sub-segments corresponding to action categories to which the facial actions belong;
synthesizing the target video data and the extracted video sub-segments into a video according to the shooting time of the target video data and the extracted video sub-segments and/or the image frame numbers of the target video data and the extracted video sub-segments;
and playing the synthesized video.
In a possible implementation manner of the present application, the second obtaining module is further configured to:
acquiring image information of a video image corresponding to the facial action belonging to the specified category;
determining the position of a camera for shooting the facial action belonging to the specified category according to the image information;
and sending the determined position of the camera to a designated terminal, and/or adding the determined position of the camera to the image information and then displaying.
In a possible implementation manner of the present application, the target network model is obtained by training a network model to be trained based on a plurality of face color image samples, corresponding face depth image samples, and actual face action categories of faces in the plurality of face image samples.
In another aspect, a monitoring system is provided, the monitoring system comprising a processor and at least one camera, the processor being configured to:
acquiring a plurality of color face images and corresponding depth face images which are acquired by the at least one camera and are associated with the current video clip;
determining the facial action of the target through a target network model based on the multiple color face images and the corresponding depth face images;
and when the target face action is determined to belong to the abnormal action, acquiring target video data, wherein the target video data comprises the current video clip and/or a video image corresponding to the face action belonging to the abnormal action.
In one possible implementation manner of the present application, the processor is further configured to:
when the at least one camera comprises a red, green and blue depth RGBD camera, acquiring a color video image through the RGBD camera and acquiring a corresponding depth video image under the condition that infrared light exists; alternatively, the first and second electrodes may be,
when the at least one camera comprises two red, green and blue (RGB) cameras, the two RGB cameras respectively collect color video images, and corresponding depth video images are determined according to the color video images respectively collected by the two RGB cameras; alternatively, the first and second electrodes may be,
when the at least one camera comprises an RGB camera and a depth camera, acquiring a color video image through the RGB camera, and acquiring a depth video image through the depth camera.
In another aspect, a control device is provided, which includes a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory complete mutual communication through the communication bus, the memory stores a computer program, and the processor executes the program stored in the memory to implement the steps of the facial motion detection method according to the above aspect.
In another aspect, a computer-readable storage medium is provided, in which a computer program is stored, which, when executed by a processor, implements the steps of the facial motion detection method according to one aspect described above.
In another aspect, a computer program product is provided that comprises instructions which, when run on a computer, cause the computer to perform the steps of the facial motion detection method of one aspect described above.
The technical scheme provided by the application can at least bring the following beneficial effects:
and acquiring a plurality of color face images and corresponding depth face images associated with the current video clip. And inputting the colorful face images and the corresponding depth face images into a target network model, and determining the facial action of the target by the target network model based on the colorful face images and the corresponding depth face images. When the facial motion belongs to an abnormal motion, target video data is acquired. Therefore, for any video clip, the facial action of the target can be automatically determined through the target network model according to the color face image and the depth face image, so that whether the facial action belongs to the abnormal action or not is determined, the target video data with the abnormal action is further positioned, the need of manually checking a plurality of video clips is avoided, the efficiency of determining the video clip with the abnormal action is improved, and the detection accuracy is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a facial motion detection method provided in an embodiment of the present application;
fig. 2 is a schematic diagram of a video image and a face area image according to an embodiment of the present application;
fig. 3 is a schematic diagram of a face region image according to an embodiment of the present application;
fig. 4 is a schematic diagram of another face region image provided in the embodiment of the present application;
fig. 5 is a schematic structural diagram of a facial movement detection apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a control device according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Before describing the facial motion detection method provided by the embodiment of the present application in detail, terms and implementation environments related to the embodiment of the present application will be briefly described.
First, terms related to the embodiments of the present application will be briefly described.
Depth image: refers to an image constructed with the distance from the camera to each point in the scene as a pixel value.
Color image: refers to an image comprising pixels each consisting of R, G, B components.
Next, a brief description will be given of an implementation environment related to the embodiments of the present application.
Embodiments of the present application may be directed to an implementation environment that includes a monitoring system that may include a processor and at least one camera. The processor may be configured in the control device, and the at least one camera may be connected to the control device, or may be configured on the control device.
As an example, the at least one camera may include an RGBD (Red Green Blue Depth) camera, and a color video image and a corresponding Depth video image may be obtained by exposing the RGBD camera twice in different manners. For example, the RGBD camera can acquire a color video image under a normal exposure condition, and acquire a corresponding depth video image under an infrared exposure condition.
As another example, the at least one camera may include two RGB (Red Green Blue Depth) cameras, so as to perform image acquisition by the two RGB cameras respectively, to obtain a color video image, and further, the obtained color video image may be processed, so as to obtain a Depth video image.
As another example, the at least one camera may also include an RGB camera that may be used to capture color video images and a depth camera that may be used to capture depth video images. As an example, the depth camera may be monocular structured light, TOF (Time Of flight), binocular vision, and the like.
As an example, the control device may be a computer, a palm pc (ppc) (pocket pc), a tablet pc, or the like.
After the terms and implementation environments related to the embodiments of the present application are described, a facial motion detection method provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings.
Referring to fig. 1, fig. 1 is a flowchart illustrating a facial motion detection method according to an exemplary embodiment, which may be applied in the above implementation environment, and the facial motion detection method may include the following steps:
step 101: and acquiring a plurality of color face images and corresponding depth face images associated with the current video clip.
The current video clip may be any video clip currently shot in real time, or the video clip may be any one of a plurality of video clips stored in advance. In addition, the monitoring scene in the video clip may be preset by the user according to actual conditions, for example, the monitoring scene may be a withdrawal area of an ATM (Automatic Teller Machine), an exit area of a subway entrance, a door area of a hospital, and the like.
As an example, the control device may be provided with a user interaction interface, and when a certain target needs to be queried, the user may specify the target through the user interaction interface.
In the embodiment of the application, the control device determines the facial action of the target based on the facial image of the target. At present, color face images acquired from a monitoring video are generally processed, but the color face images have poor interference resistance to illumination, posture and the like, and thus the obtained results are not accurate enough.
The plurality of color face images and the depth face images may be in a one-to-one relationship, or may be in a many-to-one relationship, or may be in a one-to-many relationship.
In addition, since the facial motion of a target tends to be dynamic, the control device typically needs to determine the facial motion of the target by acquiring multiple color face images and corresponding depth face images.
As an example, a specific implementation of acquiring multiple face images and corresponding depth face images associated with a current video segment may include: the method comprises the steps of obtaining a plurality of color video images and corresponding depth video images which are related to the video clip and comprise the target, respectively carrying out face detection on the obtained plurality of color video images and the corresponding depth video images, determining a plurality of face color area images and corresponding face depth area images from the plurality of color video images and the corresponding depth video images according to a face detection result, and determining the plurality of face color area images and the corresponding face depth area images as the plurality of color face images and the corresponding depth face images.
As described above, the control device may acquire a plurality of color video images and corresponding depth video images associated with the video clip through one RGB camera and one depth camera, respectively. Then, the control device respectively carries out face detection on the multiple color video images and the corresponding depth video images, determines the multiple color video images and the corresponding depth video images of the face containing the target, and then cuts out the area where the face is located according to the face detection result to obtain the face area image corresponding to each color video image and the corresponding depth video image.
For example, as shown in fig. 2, a color video image is subjected to face detection to determine an object in the color video image, as shown in fig. 2(a), an area where a face is located is cut according to a face detection result, as shown in fig. 2(b), and a face color area image is obtained.
As an example, the face detection may be implemented by a single CNN (Convolutional Neural Networks) face detection method, a cascade CNN face detection method, an OpenCV face detection method, and the like, so as to determine whether the image contains a face and whether the face is a target face.
Further, after the plurality of determined face color area images and the corresponding face depth area images are obtained, the sizes of the plurality of face color area images and the corresponding face depth area images can be adjusted to be the same, and the adjusted plurality of face color area images and the corresponding face depth area images are determined as the plurality of face color images and the corresponding face depth images of the target. In addition, the size can be adjusted according to a reference size during adjustment, and the reference size can be preset according to actual requirements.
Further, before the plurality of determined face color area images and the corresponding face depth area images are determined as a plurality of color face images and corresponding depth face images, face alignment processing is respectively performed on the plurality of face color area images and the corresponding face depth area images, the sizes of the plurality of face color area images and the corresponding face depth area images after the face alignment processing are adjusted to be the same, and the plurality of face color area images and the corresponding face depth area images after the size adjustment are used as the plurality of color face images and the corresponding depth face images.
Because the size of the face region image may be various, and the pose of the face in the face region image may also be different, the face region image may be adjusted, for example, face alignment processing and image size adjustment may be performed on the face region image, in order to more accurately determine the target facial motion.
The face alignment processing means that the face in each face region image is aligned through the coordinates of the key points of the face region image, that is, the face region images with different face poses are normalized, so that the poses of the faces in all the face region images are as same as possible.
For example, for any face region image, the coordinates of two eyes, a nose, a mouth and other key parts can be determined, and according to the coordinates, the face alignment processing can be performed on the face through a face alignment algorithm to obtain a face region image of the front face. As shown in fig. 3, fig. 3(a) is a face region image without face alignment processing, and fig. 3(b) is a face region image after face alignment processing.
Because the positions of the target in the video shot by the camera and the camera may be different, the sizes of the obtained face region images may be different, and in order to ensure that the judgment result of the facial action category of the target is accurate, the face region images after the face alignment processing can be adjusted, so that the sizes of the face region images are completely the same. The standard data of a face region image, such as standard length, standard width and standard key point coordinates, can be preset, the data can be adjusted according to actual conditions, and all face region images need to be adjusted according to the set standard data.
As an example, as shown in fig. 4, the width of each face region image is adjusted to a standard width, the height of each face region image is adjusted to a standard height, and the coordinates of the key points in each face region image are adjusted to standard key point coordinates, where fig. 4(a) and 4(b) are images after face alignment processing, and after image size adjustment, the images all become images with the same size in fig. 4 (c).
Step 102: and determining the facial action of the target through a target network model based on the plurality of color face images and the corresponding depth face images.
The facial actions may include smiling, frowning, eyes opening, etc., and the facial actions may generally indicate the mood of the target, so the facial actions of the target may be determined prior to subsequent processing.
As an example, when the target network model includes a feature fusion network and a multi-frame analysis network, determining the facial movement of the target through the target network model based on the plurality of color face images and the corresponding depth face images, includes: inputting the multiple colorful face images and the corresponding deep face images into the target network model, extracting key features of each face image and the corresponding deep face image through a feature fusion network in the target network model, fusing to obtain fusion features corresponding to each face image, analyzing and processing the obtained multiple fusion features through a multi-frame analysis network in the target network model, and determining the facial action of the target.
Since the feature extraction is performed on the target face color image and the corresponding target face depth image respectively, and the recognition result obtained by analyzing the depth feature and the recognition result obtained by analyzing the color feature are not accurate enough, in order to obtain a more accurate recognition result, the extracted color feature and the depth feature need to be fused, and then the processing is performed according to the obtained fusion feature, so that the target network model includes a feature fusion network.
The control equipment can perform convolution operation on the target face color image and the corresponding target face depth image through a feature fusion network in the target network model, extract color features in the target face color image and depth features in the corresponding target face depth image, perform connection operation on the color features and the depth features, and extract fusion features from the connected color features and depth features through a full connection layer. The feature fusion can be realized based on a neural network, and the neural network automatically selects key features in the color features and the depth features through learning to fuse so as to achieve the optimal judgment result.
For example, when facial movements are usually determined, the most critical parts are eyes, mouth, eyebrows and the like, and different movements of the parts represent abundant facial movements, so that in comparison, the movement of the nose is less, and the nose can be analyzed as less as possible, so that the judgment speed is increased and the judgment accuracy is improved. That is, the color and depth characteristics of the eyes, mouth, eyebrow area can be selected as key features.
The control device obtains a plurality of fusion features through a feature fusion network in a target network model, and then inputs the fusion features into a multi-frame analysis network so as to determine the facial action of the target through analyzing a facial motion unit through the multi-frame analysis network.
For example, the facial motion unit may be used to indicate the motion of each region of the face, such as the mouth corner, the eyelid, the middle of the person, and so on, please refer to table 1, where table 1 is a definition of a facial motion unit (AU) shown according to an exemplary embodiment, and different facial motion units may represent different facial motions after being combined.
TABLE 1
AU Definition of AU Definition of
AU1 Inner brow angle is raised AU14 Tightening of mouth angle
AU2 Outer eyebrow angle is raised AU15 Pulling the nozzle angle downwards
AU4 Frown AU16 Pull the lips downward
AU5 On the upper eyelid AU17 Push the lower lip upwards
AU6 Cheek lifting and eye orbicularis muscle outer ring tightening AU20 Stretching of mouth angle
AU7 Contracting eyelid AU23 Tighten the lips
AU9 Wrinkle nose AU24 Closing the mouth with force
AU10 Raise the upper lip AU25 Lip separation
AU11 The skin of the middle part of the human body is upward AU26 Mouth opening device
AU12 The angle of the pulling mouth is inclined upwards AU32 Biting lip
Further, the target network model is obtained by training the network model to be trained based on a plurality of face color image samples, corresponding face depth image samples and actual face action types of faces in the plurality of face image samples.
If the target network model processes and analyzes a plurality of face color images and corresponding face depth images based on the initial model parameters, the recognition result of the target face action may not be accurate enough, and the target network model needs to be trained first in order to accurately determine the target face action. As an example, a plurality of face color image samples and corresponding face depth image samples may be preset, and the face color image samples and the corresponding face depth image samples that may represent different facial movements are selected as the face image samples. Then, the face color image samples and the corresponding face depth image samples in continuous time are set as a group of face image samples, that is, each group of face image samples corresponds to a video segment.
Determining actual facial movements of a plurality of facial color image samples and corresponding facial depth image samples, inputting the plurality of facial color image samples and corresponding facial depth image samples into a network model to be trained, analyzing the plurality of facial color image samples and corresponding facial depth image samples by the network model to be trained based on initial model parameters, outputting a recognition result of the facial movement, comparing the recognition result of the output facial movement with the actual facial movement, if the recognition result of the output facial movement is wrong, adjusting the initial model parameters until a large number of facial image samples, such as 1000 groups are input, wherein the accuracy of the facial movement recognition result is high, such as when the accuracy is greater than or equal to 95%, the network model to be trained can be considered to be finished, and the network model obtained at the moment and having finished training can be determined as a target network model, in this way, the target network model can be used to detect the facial movements of an arbitrary target based on multiple face color image samples and corresponding face depth image samples of the target.
Further, the number of face image samples used for training may be limited, for example, N groups may be selected, and correspondingly, when the face image of the target is detected, N groups of face images may be acquired.
Further, the face image samples used for training may also be resized to the same size, for example, according to the above-mentioned reference size.
Further, after determining the facial action of the target, the method further includes: and classifying the facial action of the target, and determining whether the facial action belongs to abnormal action according to the classification result.
Illustratively, the facial actions may be classified into different categories according to different classification rules. For example, when the facial movements are classified according to the emotion of the target, the facial movements indicating that the emotion of the target is happy, such as smiling, laughing, and eyebrow spreading, may be classified into the happy category, and the facial movements indicating that the emotion of the target is angry, such as frowning, eye dropping, and lower lip falling, may be classified into the angry category, and the facial movements may be classified into the angry category, the happy category, the fear category, the anger category, the tense category, and the like according to the difference in the emotion of the target.
The present embodiment may determine whether the facial motion belongs to an abnormal motion according to the classification result of the facial motion. The abnormal motion can be set by the user according to actual conditions, and generally, since there is a certain correlation between various facial motions, such as an angry class, an angry class and a tension class, which can represent a negative emotion, the abnormal motion may include not only one type of facial motion, that is, the abnormal motion may include one type of facial motion, but also multiple types of facial motion. If the abnormal action is a qi type or a tension type facial action, the abnormal action can also be a qi type facial action.
When the classification result of the facial motion is that the facial motion belongs to the facial motion category included in the abnormal motion, the facial motion is determined to belong to the abnormal motion. When the classification result of the facial motion is that the facial motion does not belong to the facial motion category included in the abnormal motion, the facial motion is determined not to belong to the abnormal motion.
For example, when an angry facial motion is set as an abnormal motion, the target detection model specifies that the target facial motion is a gazelle eye, classifies the facial motion of the gazelle eye, and as a result of the classification, the angry facial motion is an abnormal motion, and thus it can be specified that the facial motion belongs to the abnormal motion.
Step 103: and when the target face action is determined to belong to the abnormal action, acquiring target video data, wherein the target video data comprises the video image corresponding to the current video segment and/or the face action belonging to the abnormal action.
That is, the target video data may include a video image corresponding to the abnormal motion, may also include a video clip to which the video image corresponding to the abnormal motion belongs, and may also include a video image corresponding to the abnormal motion and a video clip to which the video image belongs.
Further, after the target video data is acquired, video sub-segments corresponding to the facial actions may be extracted from other video segments including the target, or video sub-segments corresponding to action categories to which the facial actions belong may be extracted. And synthesizing the target video data and the extracted video sub-segments into a video according to the shooting time of the target video data and the extracted video sub-segments and/or the image frame number of the target video data and the extracted video sub-segments, and playing the synthesized video.
That is, if the target's facial motion belongs to an abnormal motion, there is a high possibility that a need will be made to retrieve all the same video data as the facial motion when viewing the video later. Therefore, after the target video data is acquired, a video sub-segment including the facial motion can be extracted from other video segments, or video sub-segments corresponding to motion categories to which the facial motion belongs are extracted, and then the extracted video sub-segment and the acquired target video data are synthesized and played.
Further, since one type of facial motion indicates the same emotion, it may be considered that all the facial motions in the motion category to which the facial motion belongs belong to abnormal motions, and in some embodiments, there may be a need to retrieve video data of all the motion categories to which the facial motion belongs. Therefore, after the target video data is acquired, video sub-segments corresponding to motion categories to which the facial motion belongs can be extracted from other video segments, and then the extracted video sub-segments and the acquired target video data are synthesized and played.
Therefore, the video data belonging to abnormal actions are synthesized into the video, so that a user can conveniently and quickly locate the abnormal video, a large number of video segments are prevented from being played and checked one by one, and the video searching efficiency is improved.
In the synthesizing process, the target video data and the extracted video sub-segments can be synthesized according to the shooting time of the target video data and the extracted video sub-segments and according to the sequence of the shooting time. Or, the target video data and the extracted video sub-segments can be synthesized according to the image frame numbers of the target video data and the extracted video sub-segments and according to the shooting sequence. Alternatively, the target video data and the extracted video sub-segment may be synthesized based on the shooting times of the target video data and the extracted video sub-segment, and the shooting times of the target video data and the extracted video sub-segment.
Further, image information of a video image corresponding to the facial action belonging to the specified category is acquired, the position of a camera for shooting the facial action belonging to the specified category is determined according to the image information, the determined position of the camera is sent to a specified terminal, and/or the determined position of the camera is added to the image information and then displayed.
The image information may include, but is not limited to, information of the camera, such as the number of the camera, the position of the camera, and the like. As an example, the position of the camera may be represented by a coordinate, or may be represented by specific address information, such as 1-block 1 street south segment, 1 mall 1 gate, and so on.
The designated equipment can be security equipment, equipment held by police or an alarm.
The image information corresponding to the video image of the facial motion belonging to the abnormal motion, that is, the number and/or the position information of the camera which captured the video image are acquired, and the position of the camera is determined, so that the position of the abnormal motion can be determined. After the position of the camera is determined, the position of the camera can be sent to a designated terminal for displaying, the position of the camera can be added to the image information for displaying, and the position of the camera can be sent to the designated terminal and the position information can be added to the image information for displaying. For example, only one piece of location information may be displayed on a designated terminal, such as "the shooting location is mall # 1 door".
Furthermore, the image information added with the position information of the camera can be displayed on the video image with the abnormal action, so that the position of the abnormal action can be known according to the video image corresponding to the abnormal action.
In the embodiment of the application, a plurality of color face images and corresponding depth face images associated with a current video clip are obtained. And inputting the colorful face images and the corresponding depth face images into a target network model, and determining the facial action of the target by the target network model based on the colorful face images and the corresponding depth face images. When the facial motion belongs to an abnormal motion, target video data is acquired. Therefore, for any video clip, the facial action of the target can be automatically determined through the target network model according to the color face image and the depth face image, so that whether the facial action belongs to the abnormal action or not is determined, the target video data with the abnormal action is further positioned, the need of manually checking a plurality of video clips is avoided, the efficiency of determining the video clip with the abnormal action is improved, and the detection accuracy is improved.
Fig. 5 is a schematic structural diagram illustrating a facial motion detection apparatus according to an exemplary embodiment, and the image display apparatus may be implemented by software, hardware, or a combination of both. The facial motion detection apparatus may include:
a first obtaining module 510, configured to obtain a plurality of color face images and corresponding depth face images associated with a current video segment;
a determining module 520, configured to determine a facial action of the target through a target network model based on the multiple color face images and the corresponding depth face images;
the second obtaining module 530, when it is determined that the facial motion of the target belongs to the abnormal motion, obtains target video data, where the target video data includes the current video segment and/or a video image corresponding to the facial motion belonging to the abnormal motion.
In a possible implementation manner of the present application, the first obtaining module 510 is configured to:
acquiring a plurality of color video images and corresponding depth video images which are associated with the video clip and comprise the target;
respectively carrying out face detection on the obtained multiple color video images and the corresponding depth video images;
determining a plurality of face color area images and corresponding face depth area images from the plurality of color video images and corresponding depth video images according to a face detection result;
and determining the plurality of face color area images and the corresponding face depth area images as the plurality of color face images and the corresponding depth face images.
In a possible implementation manner of the present application, the first obtaining module 510 is configured to:
respectively carrying out face alignment treatment on the plurality of face color area images and the corresponding face depth area images;
adjusting the sizes of the plurality of face color area images and the corresponding face depth area images after the face alignment processing to be the same;
and taking the plurality of face color area images and the corresponding face depth area images after size adjustment as the plurality of color face images and the corresponding depth face images.
In a possible implementation manner of the present application, the target network model includes a feature fusion network and a multi-frame analysis network, and the determining module 520 is configured to:
inputting the multiple color face images and the corresponding depth face images into the target network model, extracting key features of each face image and the corresponding depth face image through a feature fusion network in the target network model, and fusing to obtain fusion features corresponding to each face image;
and analyzing the obtained multiple fusion features through a multi-frame analysis network in the target network model, and determining the facial action of the target.
In one possible implementation manner of the present application, the determining module 520 is further configured to:
classifying facial movements of the target;
and determining whether the facial action belongs to abnormal actions according to the classification result.
In a possible implementation manner of the present application, the second obtaining module 530 is configured to:
extracting video sub-segments corresponding to the facial actions from other video segments comprising the target or extracting video sub-segments corresponding to action categories to which the facial actions belong;
synthesizing the target video data and the extracted video sub-segments into a video according to the shooting time of the target video data and the extracted video sub-segments and/or the image frame numbers of the target video data and the extracted video sub-segments;
and playing the synthesized video.
In a possible implementation manner of the present application, the second obtaining module 530 is further configured to:
acquiring image information of a video image corresponding to the facial action belonging to the specified category;
determining the position of a camera for shooting the facial action belonging to the specified category according to the image information;
and sending the determined position of the camera to a designated terminal, and/or adding the determined position of the camera to the image information and then displaying.
In a possible implementation manner of the present application, the target network model is obtained by training a network model to be trained based on a plurality of face color image samples, corresponding face depth image samples, and actual face action categories of faces in the plurality of face image samples.
In the embodiment of the application, a plurality of color face images and corresponding depth face images associated with a current video clip are obtained. And inputting the colorful face images and the corresponding depth face images into a target network model, and determining the facial action of the target by the target network model based on the colorful face images and the corresponding depth face images. When the facial motion belongs to an abnormal motion, target video data is acquired. Therefore, for any video clip, the facial action of the target can be automatically determined through the target network model according to the color face image and the depth face image, so that whether the facial action belongs to the abnormal action or not is determined, the target video data with the abnormal action is further positioned, the need of manually checking a plurality of video clips is avoided, the efficiency of determining the video clip with the abnormal action is improved, and the detection accuracy is improved.
It should be noted that: in the face motion detection apparatus provided in the foregoing embodiment, when implementing the face motion detection method, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the functions described above. In addition, the facial motion detection apparatus provided by the above embodiment and the facial motion detection method embodiment belong to the same concept, and specific implementation processes thereof are detailed in the method embodiment and are not described herein again.
Fig. 6 is a schematic structural diagram of a control device 600 according to an embodiment of the present application, where the control device 600 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 601 and one or more memories 602, where the memory 602 stores at least one instruction, and the at least one instruction is loaded and executed by the processor 601 to implement the facial motion detection method provided by each method embodiment.
Of course, the control device 600 may further have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the control device 600 may further include other components for implementing device functions, which are not described herein again.
Embodiments of the present application further provide a non-transitory computer-readable storage medium, where instructions in the storage medium, when executed by a processor of a mobile terminal, enable the mobile terminal to perform the facial motion detection method provided in the embodiment shown in fig. 1.
Embodiments of the present application also provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the facial motion detection method provided in the embodiment shown in fig. 1.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (11)

1. A facial motion detection method, the method comprising:
acquiring a plurality of color face images associated with a current video clip and corresponding depth face images;
determining the facial action of the target through a target network model based on the multiple color face images and the corresponding depth face images;
and when the target face action is determined to belong to the abnormal action, acquiring target video data, wherein the target video data comprises the current video clip and/or a video image corresponding to the face action belonging to the abnormal action.
2. The method of claim 1, wherein the obtaining a plurality of face images and corresponding depth face images associated with a current video segment comprises:
acquiring a plurality of color video images and corresponding depth video images which are associated with the video clip and comprise the target;
respectively carrying out face detection on the obtained multiple color video images and the corresponding depth video images;
determining a plurality of face color area images and corresponding face depth area images from the plurality of color video images and corresponding depth video images according to a face detection result;
and determining the plurality of face color area images and the corresponding face depth area images as the plurality of color face images and the corresponding depth face images.
3. The method of claim 2, wherein determining the plurality of face color region images and corresponding face depth region images as the plurality of color face images and corresponding depth face images comprises:
respectively carrying out face alignment treatment on the plurality of face color area images and the corresponding face depth area images;
adjusting the sizes of the plurality of face color area images and the corresponding face depth area images after the face alignment processing to be the same;
and taking the plurality of face color area images and the corresponding face depth area images after size adjustment as the plurality of color face images and the corresponding depth face images.
4. The method of claim 1, wherein the target network model comprises a feature fusion network and a multi-frame analysis network, and wherein determining the facial movements of the target based on the plurality of color face images and the corresponding depth face images through the target network model comprises:
inputting the multiple color face images and the corresponding depth face images into the target network model, extracting key features of each face image and the corresponding depth face image through a feature fusion network in the target network model, and fusing to obtain fusion features corresponding to each face image;
and analyzing the obtained multiple fusion features through a multi-frame analysis network in the target network model, and determining the facial action of the target.
5. The method of claim 4, wherein the determining the facial action of the target is followed by further comprising:
classifying facial movements of the target;
and determining whether the facial action belongs to abnormal actions according to the classification result.
6. The method of claim 5, wherein after the obtaining the target video data, further comprising:
extracting video sub-segments corresponding to the facial actions from other video segments comprising the target or extracting video sub-segments corresponding to action categories to which the facial actions belong;
synthesizing the target video data and the extracted video sub-segments into a video according to the shooting time of the target video data and the extracted video sub-segments and/or the image frame numbers of the target video data and the extracted video sub-segments;
and playing the synthesized video.
7. The method of claim 1, wherein the method further comprises:
acquiring image information of a video image corresponding to the facial action belonging to the specified category;
determining the position of a camera for shooting the facial action belonging to the specified category according to the image information;
and sending the determined position of the camera to a designated terminal, and/or adding the determined position of the camera to the image information and then displaying.
8. The method of claim 1, wherein the target network model is obtained by training a network model to be trained based on a plurality of face color image samples and corresponding face depth image samples, and actual facial motion classes of faces in the plurality of face image samples.
9. A facial motion detection apparatus, characterized in that the apparatus comprises:
the first acquisition module is used for acquiring a plurality of color face images and corresponding depth face images which are associated with the current video clip;
the determining module is used for determining the facial action of the target through a target network model based on the multiple colorful facial images and the corresponding deep facial images;
and the second acquisition module is used for acquiring target video data when the target face action is determined to belong to an abnormal action, wherein the target video data comprises the current video clip and/or a video image corresponding to the target face action belonging to the abnormal action.
10. A monitoring system, comprising a processor and at least one camera, the processor configured to:
acquiring a plurality of color face images and corresponding depth face images which are acquired by the at least one camera and are associated with the current video clip;
determining the facial action of the target through a target network model based on the multiple color face images and the corresponding depth face images;
and when the target face action is determined to belong to the abnormal action, acquiring target video data, wherein the target video data comprises the current video clip and/or a video image corresponding to the face action belonging to the abnormal action.
11. The monitoring system of claim 10, wherein the processor is further configured to:
when the at least one camera comprises a red, green and blue depth RGBD camera, acquiring a color video image through the RGBD camera and acquiring a corresponding depth video image under the condition that infrared light exists; alternatively, the first and second electrodes may be,
when the at least one camera comprises two red, green and blue (RGB) cameras, the two RGB cameras respectively collect color video images, and corresponding depth video images are determined according to the color video images respectively collected by the two RGB cameras; alternatively, the first and second electrodes may be,
when the at least one camera comprises an RGB camera and a depth camera, acquiring a color video image through the RGB camera, and acquiring a depth video image through the depth camera.
CN201910760634.XA 2019-08-16 2019-08-16 Face action detection method, device and system Pending CN112395922A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910760634.XA CN112395922A (en) 2019-08-16 2019-08-16 Face action detection method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910760634.XA CN112395922A (en) 2019-08-16 2019-08-16 Face action detection method, device and system

Publications (1)

Publication Number Publication Date
CN112395922A true CN112395922A (en) 2021-02-23

Family

ID=74603119

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910760634.XA Pending CN112395922A (en) 2019-08-16 2019-08-16 Face action detection method, device and system

Country Status (1)

Country Link
CN (1) CN112395922A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115082298A (en) * 2022-07-15 2022-09-20 北京百度网讯科技有限公司 Image generation method, image generation device, electronic device, and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106774856A (en) * 2016-08-01 2017-05-31 深圳奥比中光科技有限公司 Exchange method and interactive device based on lip reading
CN106778506A (en) * 2016-11-24 2017-05-31 重庆邮电大学 A kind of expression recognition method for merging depth image and multi-channel feature
CN106919251A (en) * 2017-01-09 2017-07-04 重庆邮电大学 A kind of collaborative virtual learning environment natural interactive method based on multi-modal emotion recognition
CN107368778A (en) * 2017-06-02 2017-11-21 深圳奥比中光科技有限公司 Method for catching, device and the storage device of human face expression
CN107368810A (en) * 2017-07-20 2017-11-21 北京小米移动软件有限公司 Method for detecting human face and device
CN107491726A (en) * 2017-07-04 2017-12-19 重庆邮电大学 A kind of real-time expression recognition method based on multi-channel parallel convolutional neural networks
CN108171212A (en) * 2018-01-19 2018-06-15 百度在线网络技术(北京)有限公司 For detecting the method and apparatus of target
CN109376667A (en) * 2018-10-29 2019-02-22 北京旷视科技有限公司 Object detection method, device and electronic equipment
CN109712105A (en) * 2018-12-24 2019-05-03 浙江大学 A kind of image well-marked target detection method of combination colour and depth information
GB201909300D0 (en) * 2019-06-28 2019-08-14 Facesoft Ltd Facial behaviour analysis

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106774856A (en) * 2016-08-01 2017-05-31 深圳奥比中光科技有限公司 Exchange method and interactive device based on lip reading
CN106778506A (en) * 2016-11-24 2017-05-31 重庆邮电大学 A kind of expression recognition method for merging depth image and multi-channel feature
CN106919251A (en) * 2017-01-09 2017-07-04 重庆邮电大学 A kind of collaborative virtual learning environment natural interactive method based on multi-modal emotion recognition
CN107368778A (en) * 2017-06-02 2017-11-21 深圳奥比中光科技有限公司 Method for catching, device and the storage device of human face expression
CN107491726A (en) * 2017-07-04 2017-12-19 重庆邮电大学 A kind of real-time expression recognition method based on multi-channel parallel convolutional neural networks
CN107368810A (en) * 2017-07-20 2017-11-21 北京小米移动软件有限公司 Method for detecting human face and device
CN108171212A (en) * 2018-01-19 2018-06-15 百度在线网络技术(北京)有限公司 For detecting the method and apparatus of target
CN109376667A (en) * 2018-10-29 2019-02-22 北京旷视科技有限公司 Object detection method, device and electronic equipment
CN109712105A (en) * 2018-12-24 2019-05-03 浙江大学 A kind of image well-marked target detection method of combination colour and depth information
GB201909300D0 (en) * 2019-06-28 2019-08-14 Facesoft Ltd Facial behaviour analysis

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115082298A (en) * 2022-07-15 2022-09-20 北京百度网讯科技有限公司 Image generation method, image generation device, electronic device, and storage medium

Similar Documents

Publication Publication Date Title
CN110235138B (en) System and method for appearance search
CN106372662B (en) Detection method and device for wearing of safety helmet, camera and server
CN105809144B (en) A kind of gesture recognition system and method using movement cutting
US8866931B2 (en) Apparatus and method for image recognition of facial areas in photographic images from a digital camera
CN110232369B (en) Face recognition method and electronic equipment
WO2019128507A1 (en) Image processing method and apparatus, storage medium and electronic device
WO2020215552A1 (en) Multi-target tracking method, apparatus, computer device, and storage medium
JP5569990B2 (en) Attribute determination method, attribute determination apparatus, program, recording medium, and attribute determination system
US20070116364A1 (en) Apparatus and method for feature recognition
CN109299658B (en) Face detection method, face image rendering device and storage medium
US10922531B2 (en) Face recognition method
CN104951773A (en) Real-time face recognizing and monitoring system
Chauhan et al. Study & analysis of different face detection techniques
CN105022999A (en) Man code company real-time acquisition system
JP6157165B2 (en) Gaze detection device and imaging device
CN106033539A (en) Meeting guiding method and system based on video face recognition
JPWO2008035411A1 (en) Mobile object information detection apparatus, mobile object information detection method, and mobile object information detection program
Putro et al. Adult image classifiers based on face detection using Viola-Jones method
CN109986553B (en) Active interaction robot, system, method and storage device
CN112395922A (en) Face action detection method, device and system
CN108197593B (en) Multi-size facial expression recognition method and device based on three-point positioning method
KR20190093372A (en) Make-up evaluation system and operating method thereof
CN114387670A (en) Gait recognition method and device based on space-time feature fusion and storage medium
CN113052087A (en) Face recognition method based on YOLOV5 model
CN112668357A (en) Monitoring method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination