CN113422982B

CN113422982B - Data processing method, device, equipment and storage medium

Info

Publication number: CN113422982B
Application number: CN202110965118.8A
Authority: CN
Inventors: 陈观钦; 陈远; 王摘星; 陈斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-08-23
Filing date: 2021-08-23
Publication date: 2021-12-14
Anticipated expiration: 2041-08-23
Also published as: CN113422982A

Abstract

The embodiment of the application discloses a data processing method, a data processing device, data processing equipment and a storage medium, and relates to the technical field of image recognition. The method comprises the following steps: acquiring a face video to be detected; extracting image frames from the face video, and intercepting a face frame diagram from the image frames; performing living body detection based on facial feature information contained in the human face frame diagram to obtain a first detection result; selecting a plurality of image frames from a face video to generate a frame image sequence; performing living body detection based on the spatio-temporal feature information contained in the frame image sequence to obtain a second detection result; and determining whether the face video is a face living body video or not according to the first detection result and the second detection result. According to the method and the device, the face video is detected from the two aspects of the picture and the video, the face video can be detected based on various characteristic information such as texture, illumination, face state change and face space, the accuracy of the living body detection is improved, and the omission probability is reduced.

Description

Data processing method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of image recognition, in particular to a data processing method, a device, equipment and a storage medium.

Background

Face recognition technology has been widely used, but the face is very easy to be copied by means of photos, videos and the like, so that the face needs to be detected in a living body.

In the related art, a face live detection method based on texture difference is provided. Non-living human faces such as printed human face pictures and 3D (three-dimensional) masks have some difference in texture from living human faces collected by devices, and human face living body detection is realized by using the difference in texture.

However, this approach only considers texture information, and there is still a risk of being bypassed by high-fidelity video or pictures, and a false detection situation occurs.

Disclosure of Invention

The embodiment of the application provides a data processing method, a data processing device and a data processing storage medium, which can improve the accuracy of human face living body detection and reduce the probability of missed detection. The technical scheme provided by the embodiment of the application is as follows.

According to an aspect of an embodiment of the present application, there is provided a data processing method, including:

acquiring a face video to be detected;

extracting image frames from the face video, and intercepting a face frame diagram from the image frames; performing living body detection based on facial feature information contained in the human face block diagram to obtain a first detection result; the facial feature information is used for representing texture and illumination color and luster features of a face area in the face block diagram;

selecting a plurality of image frames from the face video to generate a frame image sequence; performing living body detection based on the spatio-temporal feature information contained in the frame image sequence to obtain a second detection result; the spatiotemporal feature information is used for representing human face state change features and human face background space features in the frame image sequence;

and determining whether the face video is a face living body video or not according to the first detection result and the second detection result.

According to an aspect of an embodiment of the present application, there is provided a method for training a data processing model, the method including:

acquiring a first sample set and a second sample set, wherein the first sample set comprises a plurality of first training samples, the first training samples are face image frames intercepted from image frames of a face video, the second sample set comprises a plurality of second training samples, and the second training samples are frame image sequences acquired from the face video;

training the picture detection model by adopting the first sample set; the image detection model is used for performing living body detection based on facial feature information contained in the human face block diagram to obtain a first detection result; the facial feature information is used for representing texture and illumination color and luster features of a face area in the face block diagram;

training the video detection model by adopting the second sample set; the video detection model is used for performing living body detection based on spatio-temporal feature information contained in the frame image sequence to obtain a second detection result; and the spatiotemporal feature information is used for representing the human face state change feature and the human face background spatial feature in the frame image sequence.

According to an aspect of an embodiment of the present application, there is provided a data processing apparatus, including:

the video acquisition module is used for acquiring a human face video to be detected;

the first detection module is used for extracting image frames from the face video and intercepting face frame diagrams from the image frames; performing living body detection based on facial feature information contained in the human face block diagram to obtain a first detection result; the facial feature information is used for representing texture and illumination color and luster features of a face area in the face block diagram;

the second detection module is used for selecting a plurality of image frames from the face video and generating a frame image sequence; performing living body detection based on the spatio-temporal feature information contained in the frame image sequence to obtain a second detection result; the spatiotemporal feature information is used for representing human face state change features and human face background space features in the frame image sequence;

and the living body detection module is used for determining whether the face video is the face living body video according to the first detection result and the second detection result.

According to an aspect of an embodiment of the present application, there is provided an apparatus for training a data processing model, the apparatus including:

the system comprises a sample acquisition module, a comparison module and a comparison module, wherein the sample acquisition module is used for acquiring a first sample set and a second sample set, the first sample set comprises a plurality of first training samples, the first training samples are face image frames intercepted from image frames of a face video, the second sample set comprises a plurality of second training samples, and the second training samples are frame image sequences acquired from the face video;

the first training module is used for training the picture detection model by adopting the first sample set; the image detection model is used for performing living body detection based on facial feature information contained in the human face block diagram to obtain a first detection result;

the second training module is used for training the video detection model by adopting the second sample set; the video detection model is used for performing living body detection based on the spatio-temporal feature information contained in the frame image sequence to obtain a second detection result.

According to an aspect of embodiments of the present application, there is provided a computer device, including a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the above-mentioned data processing method, or implement the training method of the above-mentioned data processing model.

According to an aspect of embodiments of the present application, there is provided a computer-readable storage medium having at least one instruction, at least one program, a set of codes, or a set of instructions stored therein, which is loaded and executed by a processor to implement the above-mentioned data processing method or implement the training method of the above-mentioned data processing model.

According to an aspect of embodiments herein, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the data processing method or the training method of the data processing model.

The technical scheme provided by the embodiment of the application can bring the following beneficial effects:

the face block diagram and the frame diagram sequence are obtained from the face video, the face video is subjected to in-vivo detection from two aspects of the image and the video, the texture and the illumination color characteristic of the face area are extracted from the face block diagram to perform in-vivo detection, and the face state change characteristic and the face background space characteristic are extracted from the frame diagram sequence to perform in-vivo detection, so that the in-vivo detection can be performed based on multi-aspect characteristic information, the accuracy of the in-vivo detection is improved, and the omission probability is reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of an environment for implementing an embodiment provided by an embodiment of the present application;

FIG. 2 is a flow chart of a data processing method provided by an embodiment of the present application;

FIG. 3 is a flow chart of a data processing method provided by an embodiment of the present application;

FIG. 4 is a diagram of a picture inspection model provided in an embodiment of the present application;

FIG. 5 is a diagram of a picture inspection model according to an embodiment of the present application;

FIG. 6 is a flow chart of a data processing method provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a video detection model provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a video detection model provided by an embodiment of the present application;

FIG. 9 is a flow diagram of a method for training a data processing model according to one embodiment of the present application;

FIG. 10 is a schematic diagram illustrating an application scenario of a data processing method according to an embodiment of the present application;

FIG. 11 is a block diagram of a data processing apparatus provided in one embodiment of the present application;

FIG. 12 is a block diagram of a data processing model training apparatus provided in one embodiment of the present application;

fig. 13 is a block diagram of a computer device according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Refer to fig. 1, which illustrates a schematic diagram of an environment for implementing an embodiment of the present application. The embodiment implementation environment can be realized into a system architecture of data processing (such as human face living body detection). The embodiment implementation environment may include: a terminal 100 and a server 200.

The terminal 100 may be an electronic device such as a mobile phone, a tablet Computer, a vehicle-mounted terminal (car machine), a wearable device, a PC (Personal Computer), a door access device, an unmanned terminal, and the like. A client running a target application program, which may be a game application program or other application programs providing a data processing function (e.g., human face liveness detection), such as a social application program, a shopping application program, a payment application program, a financial application program, a life service application program, etc., may be installed in the terminal 100, which is not limited in this application. The form of the target Application is not limited in the present Application, and may include, but is not limited to, an App (Application program) installed in the terminal 100, an applet, and the like, and may be a web page form.

The server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. The server 200 may be a background server of the target application program, and is configured to provide a background service for a client of the target application program.

The terminal 100 and the server 200 may communicate with each other through a network, such as a wired or wireless network.

In the data processing method (also referred to as a living human face detection method) provided by the embodiment of the application, the execution subject of each step may be a computer device, and the computer device refers to an electronic device with data calculation, processing and storage capabilities. Taking the embodiment environment shown in fig. 1 as an example, the data processing method may be executed by the terminal 100 (for example, the data processing method is executed by a client installed with a target application running in the terminal 100), may be executed by the server 200, or may be executed by the terminal 100 and the server 200 in an interactive cooperation manner, which is not limited in this application.

In addition, the technical scheme of the application can be combined with the block chain technology. For example, some data (such as face video, detection result, etc.) involved in the data processing method disclosed in the present application may be saved on the blockchain.

For convenience of explanation, in the following method embodiments, only the execution subject of each step of the data processing method is described as a computer device.

Referring to fig. 2, a flowchart of a data processing method according to an embodiment of the present application is shown. The method can include the following steps (210-260).

And step 210, acquiring a human face video to be detected.

The face video refers to a video containing a face, the face video to be detected refers to a face video for which it cannot be determined whether the face video is a living body video, and subsequently, whether the face video is a living body video is determined by the method flow described in this embodiment.

In some embodiments, the terminal acquires a face video to be detected through the camera, and then sends the face video to the server, and the server executes the method provided by this embodiment to perform live body detection on the face video. Of course, in some other embodiments, the terminal may also perform live detection on the acquired face video. Or, the terminal or the server obtains the face video to be detected through other channels, for example, obtains the face video to be detected from a storage device or a database, which is not limited in this application.

Step 220, extracting image frames from the face video, and capturing a face frame diagram from the image frames.

In some embodiments, the face video is firstly subjected to framing processing to obtain a frame atlas of the face video, where the frame atlas includes each image frame of the face video. Optionally, an image frame is selected from the frame atlas, and face liveness detection based on the image is performed.

Optionally, after selecting an image frame from the frame atlas, detecting whether the image frame contains a human face, if so, using the image frame to perform picture-based human face living body detection, and if not, selecting an image frame from the frame atlas again.

Optionally, after the frame atlas is obtained, the image frames not containing the human face may be removed from the frame atlas, so as to obtain a frame atlas after being screened. And then selecting image frames from the screened frame image set to carry out human face living body detection based on the images.

In addition, in the embodiment of the present application, the selection manner of the image frame is not limited. For example, image frames may be randomly selected from a frame atlas for image-based face liveness detection. For example, an image frame at a predetermined position (for example, the 100 th image frame in the frame image set) may be selected from the frame image set, and face live body detection may be performed by using the image.

The number of image frames selected from the frame image set may be one or more. And respectively carrying out face living body detection based on the picture for each selected image frame.

The human face frame refers to an image area containing a human face in an image frame. For example, the image region corresponding to the face frame diagram may be a region where a minimum rectangular frame containing a face is located. In some embodiments, the face location in the image frame is detected by a face location model, resulting in a face frame diagram. Illustratively, the face localization model may be a CNN (Convolutional Neural Network) model, such as an MT-CNN (Multi-Task CNN) model.

Step 230, performing living body detection based on facial feature information contained in the human face block diagram to obtain a first detection result; the face feature information is used for representing texture and illumination color features of a face area in the face block diagram.

The face region refers to a region corresponding to the face in the face map. The texture features of the face region of the human face reflect the pattern or texture condition of the face region of the human face. The human face area has the organization structure arrangement attribute with slow change or periodic change, and the texture characteristics are expressed through the gray distribution of pixels and surrounding space neighborhoods. The illumination color characteristic of the face area of the human face reflects the illumination condition of the face area of the human face and the color and shade change condition of the face area of the human face.

In the embodiment of the application, in the process of performing living body detection on a human face based on an image, the living body detection is performed according to facial feature information contained in a human face block diagram, and a first detection result is obtained. The first detection result is used for representing whether the face video is a face living body video or not, or whether an image frame extracted from the face video is a face living body image frame or not.

In some embodiments, the face block diagram is subjected to living body detection through a picture detection model, and a first detection result is output. The picture detection model may be a model obtained by machine learning training, and for the description of the picture detection model, refer to the following embodiments.

Step 240, selecting a plurality of image frames from the face video, and generating a frame image sequence.

Optionally, a plurality of image frames in the frame image set are selected and combined to generate a frame image sequence. In one example, a plurality of image frames in the frame map set are randomly extracted and combined to generate a frame map sequence. In another example, a plurality of consecutive image frames in the frame map set are selected and combined to generate a frame map sequence. The present application is not limited to the specific embodiment of selecting a plurality of image frames from a face video. The obtained frame image sequence is used for carrying out human face living body detection based on the video.

It should be noted that the sequence of each image frame included in the frame image sequence is the same as the sequence of the image frame in the face video, so that the change characteristics of the face in the face video along with time can be maintained.

Step 250, performing living body detection based on the space-time characteristic information contained in the frame image sequence to obtain a second detection result; the spatio-temporal feature information is used for representing the human face state change feature and the human face background spatial feature in the frame image sequence.

The human face state change characteristics in the frame image sequence represent the state changes of the human face in the human face video at different times. The spatial feature of the face background in the frame image sequence reflects the spatial change of the face in the face video and the change of the background of the face video. And performing living body detection according to the space-time characteristic information contained in the frame image sequence to obtain a second detection result. And the second detection result is used for representing whether the face video is a face living body video.

In some embodiments, the live body detection is performed on the frame image sequence through a video detection model, and a second detection result is output. The video detection model may be a model obtained by machine learning training, and for the description of the video detection model, refer to the following embodiments.

In some embodiments, a plurality of image frames are selected from the frame atlas to obtain an initial frame sequence. And performing edge filling on each image frame in the initial frame sequence, and adjusting the image frame to a set length-width ratio to obtain the filled image frame. And adjusting each filled image frame to a set size to obtain an adjusted image frame. Based on each adjusted image frame, a sequence of frame images is obtained, which serves as an input for the video detection model. Because the devices for recording the face video are different, the length-width ratio and the size of the face video image are different, and therefore, the image frame needs to be adjusted. First, a plurality of image frames are selected from the frame image set to obtain an initial frame sequence. And performing edge filling on each image frame in the initial frame sequence, and adjusting the image frame to the length-width ratio set by the video detection model to obtain the filled image frame. And finally, adjusting the filled image frames to the size set by the video detection model to obtain the adjusted image frames, and obtaining the frame image sequence based on the adjusted image frames.

In an exemplary embodiment, the video inspection model is set to have a length to width ratio of 8:5 and a size of 160 × 100 (mm). First, a plurality of image frames are selected from the frame image set to obtain an initial frame sequence. And performing edge filling on each image frame in the initial frame sequence, and adjusting the length-width ratio of each image frame to 8:5 to obtain the filled image frame. Illustratively, the aspect ratio of the image frames in the initial frame sequence is 2:1, edges of each image frame in the initial frame sequence are padded, for example, edges of each image frame are padded with 0, and the aspect ratio of each image frame in the initial frame sequence is adjusted to 8:5, resulting in the padded image frame. And finally, adjusting the size of the filled image frame to 160 × 100 (mm) to obtain an adjusted image frame, and obtaining the frame map sequence based on the adjusted image frame. Illustratively, the size of the filled image frame is 240 × 150 (mm), and the size of the filled image frame is adjusted to 160 × 100 (mm) using a resize function. The above process of adjusting the size of the image frames in the initial frame sequence can be performed by software. Illustratively, the initial frame sequence is processed by OpenCV interpolation mapping to obtain a frame map sequence.

And step 260, determining whether the face video is a face living body video according to the first detection result and the second detection result.

In one example, determining whether the face video is a live face video does not require human intervention, and is determined independently and automatically by a computer device. Illustratively, the first detection result and the second detection result both indicate that the face video is a face live video, and the computer device determines that the face video is the face live video according to the first detection result and the second detection result; at least one of the first detection result and the second detection result indicates that the face video is not the face live video, and the computer device determines that the face video is not the face live video according to the first detection result and the second detection result. Or at least one of the first detection result and the second detection result indicates that the face video is a face live video, and the computer device determines that the face video is the face live video according to the first detection result and the second detection result; and the computer equipment determines that the face video is not the face live video according to the first detection result and the second detection result.

In another example, determining whether the face video is a live face video requires human intervention. Illustratively, the first detection result indicates that the face video is a face live video, the second detection result indicates that the face video is a face live video, and the computer device determines that the face video is the face live video according to the first detection result and the second detection result; the first detection result indicates that the face video is not the face live video, the second detection result indicates that the face video is not the face live video, and the computer equipment determines that the face video is not the face live video according to the first detection result and the second detection result; the computer equipment generates a report information request according to the first detection result and the second detection result to carry out manual review on the face video, and determines whether the face video is the face live video or not according to the manual review result; or the first detection result indicates that the face video is not the face live video, the second detection result indicates that the face video is the face live video, the computer equipment generates a report information request according to the first detection result and the second detection result to perform manual review on the face video, and whether the face video is the face live video is determined according to the manual review result. Optionally, the report information request includes, but is not limited to, at least one of the following: the system comprises a face video, an image frame extracted from the face video, a first detection result corresponding to the image frame, a frame image sequence extracted from the face video and a second detection result corresponding to the frame image sequence, so that relevant personnel can perform manual review based on information carried in the request.

It should be noted that, in the embodiment of the present application, the execution sequence of steps 220 to 230 and steps 240 to 250 is not limited, and steps 240 to 250 may be executed after steps 220 to 230, before steps 220 to 230, or simultaneously with steps 220 to 230.

In summary, according to the technical scheme provided by the embodiment of the application, the face block diagram and the frame diagram sequence are acquired from the face video, the face video is subjected to in-vivo detection from two aspects of the picture and the video, the texture and the illumination color characteristic of the face region are extracted from the face block diagram to perform in-vivo detection, and the face state change characteristic and the face background space characteristic are extracted from the frame diagram sequence to perform in-vivo detection, so that the in-vivo detection can be performed based on multi-aspect characteristic information, the accuracy of the in-vivo detection is improved, and the probability of missed detection is reduced.

In an exemplary embodiment, a flow of a method for performing living body detection based on a human face block diagram is described. As shown in FIG. 3, the method may include the following steps (310-360).

Step 310, extracting image frames from the face video, and capturing a face frame diagram from the image frames.

For the description of this step, refer to the description of step 220 in the above embodiments, and are not repeated herein.

And 320, extracting first face feature information from the face frame diagram, wherein the first face feature information is used for representing texture features of the face region of the face.

The texture features of the face region represent that the face region has slowly-transformed or periodically-changed organization structure arrangement attributes, and are expressed by the gray distribution of pixels and surrounding spatial neighborhoods. The texture features of the face area of the human face are extracted for in-vivo detection, so that the false human face generated based on the mask can be effectively identified.

Optionally, the picture detection model comprises a first picture detection model. In one example, first face feature information is extracted from a face frame diagram through a first picture detection model, and living body detection is performed based on the first face feature information. Illustratively, the first picture detection model may adopt an MLP-Mixer architecture, which is a Multi-Layer perceptron (MLP) architecture oriented to the visual field, and the description of the architecture can be referred to below.

In step 330, a living body detection is performed based on the first face feature information to obtain a first candidate detection result.

The first candidate detection result is used for indicating whether the face video is a face live body video or not, or whether the image frame extracted from the face video is a face live body image frame or not.

Optionally, the first picture detection model includes a first feature extraction layer, a first pooling layer, and a first classification layer; the first feature extraction layer is used for carrying out feature extraction processing on the face frame diagram to obtain a first feature extraction result; the first pooling layer is used for performing maximum pooling on the first feature extraction result to obtain first facial feature information; the first classification layer is used for carrying out living body detection based on the first face feature information to obtain a first candidate detection result.

Exemplarily, the first feature extraction layer performs feature extraction processing on the face block diagram through CNN to obtain a first feature extraction result; the first pooling layer performs maximum pooling on the first feature extraction result to obtain first facial feature information; the first classification layer realizes classification of whether the living body is a living body or not based on the first face feature information through a classification full-connection layer, and a first candidate detection result is obtained.

And 340, extracting second face feature information from the face block diagram, wherein the second face feature information is used for representing the illumination color and luster features of the face area.

Illumination color characteristic information of the face block diagram can be reserved by extracting illumination color characteristics of the face region, and false faces generated based on the copied pictures and the copied videos are effectively identified.

Optionally, the picture detection model further includes a second picture detection model. In one example, second face feature information is extracted from the face frame diagram through a second picture detection model, and living body detection is performed based on the second face feature information. Illustratively, the second picture detection model may also employ an MLP-Mixer architecture.

And step 350, performing living body detection based on the second face characteristic information to obtain a second candidate detection result.

The second candidate detection result is used for indicating whether the face video is the face living body video or not, or whether the image frame extracted from the face video is the face living body image frame or not.

Optionally, the second picture detection model includes a second feature extraction layer, a second pooling layer, and a second classification layer; the second feature extraction layer is used for carrying out feature extraction processing on the face frame diagram to obtain a second feature extraction result; the second pooling layer is used for carrying out average pooling on the second feature extraction result to obtain second face feature information; and the second classification layer is used for performing living body detection based on the second face characteristic information to obtain a second candidate detection result.

Exemplarily, the second feature extraction layer performs feature extraction processing on the face block diagram through CNN to obtain a second feature extraction result; the second pooling layer performs average pooling on the second feature extraction result to obtain second face feature information; the second classification layer realizes classification whether the living body is a living body or not based on the second facial feature information through the classification full-connection layer, and a second candidate detection result is obtained.

The structure of the first picture detection model and the structure of the second picture detection model may be the same or different, and this application does not limit this. For example, the number of feature extraction layers and the specific structure of the feature extraction layers included in the first picture detection model and the second picture detection model may be different. In the model training process, a first training sample is adopted to train a first picture detection model, and the first picture detection model for in vivo detection based on the texture features of the face region of the human face can be obtained; similarly, the second picture detection model is trained by adopting the first training sample, and the second picture detection model for living body detection based on the illumination color and luster characteristics of the face region can be obtained.

Step 360, determining a first detection result according to the first candidate detection result and the second candidate detection result.

In one example, if the first candidate detection result and the second candidate detection result both indicate that the face video is the live face video, generating a first detection result indicating that the face video is the live face video; if at least one of the first candidate detection result and the second candidate detection result indicates that the face video is not the face live video, a first detection result indicating that the face video is not the face live video is generated. In this way, as long as one candidate detection result is judged to be not the face live video, the first detection result which is not the face live video is output, and the coverage rate of non-live recognition is improved.

In another example, if at least one of the first candidate detection result and the second candidate detection result indicates that the face video is a live-face video, a first detection result indicating that the face video is the live-face video is generated; and if the first candidate detection result and the second candidate detection result both indicate that the face video is not the face live video, generating a first detection result for indicating that the face video is not the face live video. In this way, when both candidate detection results are determined not to be the face live video, the first detection result not being the face live video is output, which is helpful for improving the accuracy of non-live recognition.

Of course, in some other embodiments, in a case where the first candidate detection result indicates that the face video is a face live video and the second candidate detection result indicates that the face video is not a face live video, it may be determined whether the face video is a face live video through manual intervention; or, in the case that the first candidate detection result indicates that the face video is not the face live video and the second candidate detection result indicates that the face video is the face live video, the human intervention may also be performed to determine whether the face video is the face live video.

It should be noted that, in the embodiment of the present application, the execution sequence of steps 320 to 330 and steps 340 to 350 is not limited, and steps 340 to 350 may be executed after steps 320 to 330, before steps 320 to 330, or simultaneously with steps 320 to 330.

Referring to fig. 4, a diagram of a picture detection model is exemplarily shown. The picture detection model comprises a first picture detection model and a second picture detection model.

Illustratively, the face block diagram is first subjected to a blocking process, and the obtained plurality of image blocks (such as the image blocks identified by 1, 2, 3, and 4 in fig. 4) are input into the first picture detection model and the second picture detection model. Taking a first picture detection model as an example, extracting first facial feature information from a plurality of image blocks through a first feature extraction layer; inputting the first surface characteristic information into a first pooling layer, and performing maximum pooling operation on the first surface characteristic information to obtain pooled first surface characteristic information; and inputting the pooled first face feature information into a first classification layer to obtain a first candidate detection result. Similarly, second facial feature information is extracted from the plurality of image blocks by a second feature extraction layer; inputting the second face feature information into a second pooling layer, and performing average pooling operation on the second face feature information to obtain pooled second face feature information; and inputting the pooled second face characteristic information into a second classification layer to obtain a second candidate detection result. And finally, determining a first detection result according to the first candidate detection result and the second candidate detection result.

It should be noted that, in this embodiment of the application, the execution order of the first picture detection model and the second picture detection model is not limited, and the second picture detection model may be started after the first picture detection model, may be started before the first picture detection model, or may be started simultaneously with the first picture detection model.

In an exemplary embodiment, referring to fig. 5, taking an example that the picture detection model adopts an MLP-Mixer architecture, the picture detection model may include a mapping layer, a feature extraction layer, a pooling layer, and a classification layer. The method comprises the steps of partitioning a face block diagram to obtain a plurality of image blocks, then mapping the image blocks by adopting a mapping layer, converting the image blocks into feature vectors (feature vectors) respectively corresponding to the image blocks, and inputting the feature vectors respectively corresponding to the image blocks to a feature extraction layer together. The feature extraction Layer extracts facial feature information using a multi-Layer Mixer Layer, and exemplarily extracts facial feature information using a 4-Layer Mixer Layer. The pooling layer is used for pooling the facial feature information to obtain pooled facial feature information. And inputting the pooled facial feature information into a classification layer, and obtaining a candidate detection result by the classification layer according to the pooled facial feature information. The pooling layer can adopt row-based pooling, column-based pooling or both row-based pooling and column-based pooling, and the two pooling results are integrated to obtain pooled facial feature information.

The first picture detection model and the second picture detection model can both adopt the model architecture shown in fig. 5. The difference between the first image detection model and the second image detection model is that the pooling layer of the first image detection model (i.e., the first pooling layer introduced above) performs the maximum pooling operation, which is more suitable for extracting texture features, and the pooling layer of the second image detection model (i.e., the second pooling layer introduced above) performs the average pooling operation, which is more suitable for extracting illumination color features. The structure of the first picture detection model and the structure of the second picture detection model may be the same or different, and this application does not limit this. For example, the number of feature extraction layers and the specific structure of the feature extraction layers included in the first picture detection model and the second picture detection model may be different. Illustratively, the first picture detection model extracts first facial feature information using a 4-Layer Mixer Layer, and the second picture detection model extracts second facial feature information using a 6-Layer Mixer Layer.

In summary, according to the face live-body detection method and device, the face live-body detection is performed on the face frame diagram through the first picture detection model and the second picture detection model, the extraction effects of the maximum pooling and the average pooling on different features are utilized, the texture features and the illumination color and luster features of the face area are reserved, and therefore the accuracy of the face live-body detection based on pictures is improved.

In an exemplary embodiment, a flow of a method for performing living human face detection based on a frame image sequence is described. As shown in FIG. 6, the method may include the following steps (610-660).

Step 610, selecting a plurality of image frames from the face video, and generating a frame image sequence.

For the description of this step, refer to the description of step 240 in the above embodiments, and are not repeated herein.

And step 620, performing short-time feature extraction on the frame image sequence to obtain a plurality of shallow feature maps, wherein the shallow feature maps are used for representing short-time segment space-time fusion features of the frame image sequence.

The short-time feature extraction of the frame image sequence refers to feature extraction of image frames in short-time segments in the frame image sequence, and the short-time feature extraction is used for extracting short-time segment space-time fusion features of the frame image sequence. The short-time segment space-time fusion feature refers to the fusion feature of the human face state change feature and the human face background space feature of the image frame in the short-time segment in the frame image sequence. Illustratively, the frame map sequence is divided into a plurality of short-time frame map sequences, and short-time feature extraction is performed on each short-time frame map sequence to obtain a plurality of shallow feature maps.

Illustratively, multilayer 3D CNN (Convolutional Neural Network) is adopted, the size and the number of suitable convolution kernels are respectively set, and short-time segment space-time fusion features of the short-time frame image sequence are extracted. Alternatively, the 3D CNN module may be built according to the situation of the service data, or may use an existing model, such as C3D.

Optionally, after performing short-time feature extraction on the frame map sequence, the method further includes: for the input of the frame image sequence with a large scale range, the pooling operation is adopted to carry out the size compression on each image frame in the frame image sequence. Pooling operations include maximum pooling and/or average pooling.

Illustratively, for a sequence of frame maps with an input size of 160 × 100 × 60, first pass through 3 layers of 3D CNN operations with convolution kernel sizes of 3 × 1, 3 × 2, and 3 × 2 in this order, each layer having 24 convolution kernels, and simultaneously mine the association and variation between adjacent frames from both the temporal and spatial dimensions. The feature map size was then reduced to 80 x 50 x 60 by a maximum pooling operation with both windows and step sizes of 2 x1, resulting in a shallow feature map size of 80 x 50 x 60.

And 630, performing sequence feature extraction on the plurality of shallow feature maps to obtain a plurality of sequence feature maps, wherein the sequence feature maps are used for representing long-term segment space-time fusion features of the frame map sequence.

The sequence feature extraction of the shallow feature maps refers to feature extraction of a sequence formed by the shallow feature maps, and the sequence feature extraction is used for extracting long-term segment space-time fusion features of the frame map sequence. The long-time segment space-time fusion feature refers to the fusion feature of the human face state change feature and the human face background space feature of the image frames in the long-time segment in the frame image sequence. Illustratively, several continuous shallow feature maps are combined into a long-term sequence, and sequence feature extraction is performed to obtain a plurality of sequence feature maps.

Illustratively, the long-term segment space-time fusion features of the frame image sequence are further extracted through the ConvLSTM structure, and meanwhile, the spatial feature information can be retained to a certain extent.

Illustratively, for a shallow profile of size 80 x 50 x 60, a convolutional recurrent neural network using 48 ConvLSTM cells with a convolution kernel size of 3 x 3, outputs a 48-channel sequence profile of size 80 x 50.

And 640, performing deep semantic extraction on the plurality of sequence feature maps to obtain a plurality of deep feature vectors, wherein the deep feature vectors are used for representing deep spatial features of the frame map sequence.

The deep semantic extraction of the plurality of sequence feature maps refers to extracting the deep semantic of each sequence feature map, and the deep semantic extraction is used for extracting the deep spatial features of the frame map sequence. The deep spatial feature refers to a deep human face background spatial feature of an image frame in a frame image sequence. Illustratively, deep semantic extraction is performed on the plurality of sequence feature maps respectively to obtain a plurality of deep feature vectors.

Illustratively, the multilayer 2D CNN module is adopted to extract the face state change characteristic and the face background spatial characteristic of each image frame from the space deeply. Optionally, after deep semantic extraction is performed on the plurality of sequence feature maps, the method further includes: for the input of the frame image sequence with a large scale range, the pooling operation is adopted to carry out the size compression on each image frame in the frame image sequence. Pooling operations include maximum pooling and/or average pooling.

Illustratively, for the sequence feature map with the input size of 80 × 50, first, two layers of 2D CNN operations with the convolution kernel size of 3 × 3 are performed, then, the maximum pooling operation with the window of 2 × 2 is performed, the number of convolution kernels in each layer is 64, then, the 2D CNN operations with the convolution kernel size of 3 × 3 and the maximum pooling operation with the window of 2 × 2 are performed in a crossed manner, and the number of convolution kernels is doubled, for example, the 2D CNN operations with the convolution kernel size of 3 × 3 and the maximum pooling operation with the window of 2 × 2 are performed in a crossed manner for 4 times, and 256 deep layer feature vectors with the size of 3 × 2 are output.

And 650, performing interframe feature extraction on the deep feature vectors to obtain global feature vectors, wherein the global feature vectors are used for representing global space-time fusion features of the frame image sequence.

The inter-frame feature extraction of the deep feature vectors refers to extraction of inter-frame features between adjacent image frames in the image frame sequence, and the inter-frame feature extraction is used for extracting a global feature vector of the image frame sequence. The global feature vector refers to the fusion feature of the face state change feature and the face background spatial feature in the frame image sequence.

In an exemplary embodiment, the inter-frame feature extraction of the plurality of deep feature vectors, and obtaining the global feature vector includes: merging a plurality of deep layer eigenvectors to obtain an eigenvector matrix; performing convolution processing on the feature vector matrixes respectively by adopting a plurality of convolution kernels with different scales to obtain a plurality of feature vectors, wherein the feature vectors are used for representing deep space-time fusion features of the frame image sequence; fusing the plurality of feature vectors to obtain fused feature vectors; and performing gating filtration on the fusion feature vector to obtain a global feature vector.

Illustratively, a plurality of deep feature vectors are merged into a feature vector matrix, and the feature vector matrix is subjected to feature extraction through a plurality of layers of sliding 1D CNN operations with different widths. The method comprises the following steps of firstly performing shallow feature extraction on a feature matrix through a first-layer sliding 1D CNN, superposing a second-layer sliding 1D CNN on the output of the first-layer sliding 1D CNN, superposing a third-layer sliding 1D CNN on the output of the second-layer sliding 1D CNN, and so on, and superposing multiple layers of sliding 1D CNNs to expand the receptive field layer by layer and extract high-layer abstract features in a layering manner. After convolution processing is respectively carried out on the characteristic vector matrixes by adopting a plurality of convolution kernels with different scales, the method further comprises the following steps: and performing pooling operation on the obtained convolution feature vectors to obtain feature vectors. Pooling operations include maximum pooling and/or average pooling. Fusing the plurality of feature vectors to obtain fused feature vectors; and performing gating filtration on the fusion feature vector to obtain a global feature vector.

Exemplarily, referring to fig. 8, a plurality of deep eigenvectors are merged into an eigenvector matrix, and are respectively subjected to 3-layer 1D CNN operations of 6 widths, the number of convolution kernels of each layer is 64, 64 eigenvectors are obtained through maximum pooling operation, and the 64 eigenvectors are input into a highway (High Way) layer for gate-controlled filtering, so as to obtain a global eigenvector. The High Way layer structure is as follows:

gate=sigmoid(Input W ₁ ^T)

trans=tanh(Input W ₂ ^T)

output=trans

gate+Input

(1-gate)

for the parameters in the above formula, the definitions are as follows:

gate represents a threshold value of a High Way layer, sigmoid () represents a sigmoid function, Input represents a function Input, W₁ ^TRepresenting the input feature vector.

trans is used to represent yet another threshold value of the High Way layer, tanh () is used to represent the tanh function, and Input represents the functionNumber input, W₂ ^TRepresenting the input feature vector.

output is used to compute the output of the High Way layer, which is the global feature vector,

are the convolution symbols.

And 660, performing living body detection based on the global feature vector to obtain a second detection result.

Illustratively, the global feature vector is subjected to nonlinear and dimensional conversion through the fully connected layer, and finally, a second detection result is output through the classified fully connected layer. Optionally, to prevent overfitting, regularization may be added to the weight parameters of the fully-connected layers.

In the embodiment of the application, the video detection model comprises a short-time feature extraction layer, a sequence feature extraction layer, a deep semantic extraction layer, an inter-frame feature extraction layer and a third classification layer.

The short-time feature extraction layer is used for carrying out short-time feature extraction on the frame image sequence to obtain a plurality of shallow feature images.

The sequence feature extraction layer is used for performing sequence feature extraction on the shallow feature maps to obtain a plurality of sequence feature maps.

The deep semantic extraction layer is used for carrying out deep semantic extraction on the plurality of sequence feature maps to obtain a plurality of deep feature vectors.

The interframe feature extraction layer is used for performing interframe feature extraction on the deep feature vectors to obtain global feature vectors.

And the third classification layer is used for performing living body detection based on the global feature vector to obtain a second detection result.

For the description of the specific structure of each functional layer, reference is made to the above description and no further description is made here.

In summary, according to the method, a post-fusion structure of space-time compression is adopted for the sequence data of the frame images, a short-time feature extraction layer and a sequence feature extraction layer which are shallow are adopted at first to obtain a sequence feature image containing space-time fusion feature information, deep spatial features of the human face are described through a deep semantic extraction layer, and finally, a mode of time compression is adopted for the inter-frame feature extraction layer, so that the defect of human face state change features between each image frame is avoided, the description of human face background spatial features is enhanced, and misjudgment and missing of a video detection model are reduced.

An exemplary embodiment of the present application further provides a training method of a data processing model (also referred to as a face living body detection model). The method may be applied in computer devices such as PCs, servers and the like. The method may include the following steps.

1. The method comprises the steps of obtaining a first sample set and a second sample set, wherein the first sample set comprises a plurality of first training samples, the first training samples are face image frames cut from image frames of a face video, the second sample set comprises a plurality of second training samples, and the second training samples are frame image sequences obtained from the face video.

In one example, referring to fig. 9, after the first sample set is acquired, the similarity between the first training sample and the non-living template is calculated; determining the first training sample as a negative sample under the condition that the similarity is greater than or equal to a threshold value; in the case where the similarity is less than or equal to the threshold, the first training sample is determined to be a positive sample.

The non-living body template refers to a known dummy face sample and can be obtained through an existing face recognition sample library. Setting a similarity threshold, exemplarily setting the threshold to 95%, and determining the first training sample as a negative sample in the case that the similarity is greater than or equal to 95%; in the case where the similarity is less than or equal to 95%, the first training sample is determined to be a positive sample.

In another example, referring to fig. 9, after the first sample set is obtained, clustering is performed on a plurality of first training samples to obtain a plurality of sample clusters; according to the corresponding cohesion degrees of the sample clusters, selecting a first training sample in the sample cluster with the cohesion degree meeting the condition as a negative sample, and selecting a first training sample in the sample cluster with the cohesion degree not meeting the condition as a positive sample; wherein the above conditions include at least one of: the cohesion degree is larger than a threshold value, the sequences are positioned at the first N positions of the sequences according to the descending order of the cohesion degree, and N is a positive integer.

Exemplarily, clustering the first training samples through FaceNet, selecting the first training sample in the sample cluster with the cohesion degree meeting the condition as a negative sample, and selecting the first training sample in the sample cluster with the cohesion degree not meeting the condition as a positive sample. For example, the first training sample in the N sample clusters with the highest cohesion is selected as a negative sample, and the first training samples in the other sample clusters are selected as positive samples. In some embodiments, the determination of positive and negative examples may also be manually intervened.

In yet another example, the negative and positive samples in the second set of samples are determined from the negative and positive samples in the first set of samples. For example, a second training sample (frame image sequence) to which a first training sample (face image frame) determined to be a positive sample belongs is also determined to be a positive sample; the second training sample (frame image sequence) to which the first training sample (face image frame) determined to be a negative sample belongs is also determined to be a negative sample.

In the embodiment of the present application, for each training sample in the first sample set and the second sample set, after the training sample is divided into a positive sample and a negative sample, a corresponding label is labeled, so as to facilitate subsequent training operations.

2. Training the picture detection model by adopting a first sample set; the image detection model is used for performing living body detection based on facial feature information contained in the human face block diagram to obtain a first detection result; the face feature information is used for representing texture and illumination color features of a face area in the face block diagram.

Optionally, a training loss corresponding to the picture detection model is calculated according to the first detection result and the label corresponding to the first training sample, and parameters of the picture detection model are adjusted based on the training loss until the picture detection model meets the training stopping condition, and the training of the picture detection model is finished.

3. Training the video detection model by adopting a second sample set; the video detection model is used for performing living body detection based on spatio-temporal feature information contained in the frame image sequence to obtain a second detection result; the spatio-temporal feature information is used for representing the human face state change feature and the human face background spatial feature in the frame image sequence.

Optionally, a training loss corresponding to the video detection model is calculated according to the second detection result and the label corresponding to the second training sample, and the parameters of the video detection model are adjusted based on the training loss until the video detection model meets the training stopping condition, and the training of the video detection model is finished.

In addition, the stop training condition of the data processing model (such as the picture detection model and the video detection model introduced above) may be preset, for example, the stop training condition includes that the prediction accuracy of the data processing model reaches a preset threshold, such as 95%.

And when the data processing model does not meet the training stopping condition, the computer equipment continues to train the model by adopting the training sample so as to optimize the parameters of the model until the data processing model meets the training stopping condition, and finally the data processing model meeting the practical application requirement is obtained. Illustratively, the loss function for calculating the training loss adopts a two-class cross entropy loss function based on Softmax, and adopts an Adam algorithm to optimize parameters of each layer of the data processing model, and the learning rate is set to be 0.001.

The trained data processing model can be used for implementing the face living body detection introduced in the above embodiment on the face video.

It should be noted that, the use process and the training process of the data processing model are described separately in different embodiments, and the content involved in the use process and the content involved in the training process of the model are corresponding to each other and are communicated with each other, for example, where a detailed description is not provided on one side, reference may be made to the description on the other side.

The technical scheme of data processing provided by the embodiment of the application can be applied to any application scene with human face living body detection requirements. In the following, several possible application scenarios are exemplarily presented.

1. The game real-name verification system based on face recognition comprises:

compared with a game real-name verification system based on identity information, the game real-name verification system based on face recognition can reduce the interaction cost of a user, improve the user experience, and the accuracy of face brushing and body checking is higher than that of identity information uploading. In the game real-name verification system based on face recognition, the face is subjected to living body detection, identity falsification in modes of face paper, high-definition screen copying and the like can be effectively resisted, and the manual auditing cost is reduced.

Referring to fig. 10, the game real-name verification system may include a client for face video capture, and a background processing device (e.g., a server) for processing the face video. When the target user needs real-name verification, the client side collects a face video through a camera of the terminal device and then sends the face video of the target user to the background processing device. And the background processing equipment calls the face living body detection model to process the face video, determines that the target user passes face living body detection, executes a subsequent face verification process, and unseals the account number of the target user after the face verification passes. Or the background processing equipment calls a general face detection model to process the face video, and the background processing equipment determines that the target user unseals the account of the target user after passing through the general face detection model; and for the face video passing through the general face detection model, the background processing equipment calls the data processing model to process, and performs number sealing processing on the user which does not pass through the data processing model.

2. A payment verification scene based on face recognition:

compared with a payment verification scene based on fingerprint identification, the payment verification scene based on face identification only needs to be set in front of the camera, and interaction is simpler and quicker. However, face counterfeiting costs are lower than fingerprints, and it is easy to collect a picture of the face of a user. Therefore, in a payment verification scene based on face recognition, living body detection needs to be performed on a face to intercept malicious payment imitating the face of another person, so as to ensure the property safety of a user.

In a payment verification scene, a 3D camera of a terminal (such as a mobile phone) can collect a face video of a target user, and then the face video of the target user is sent to a server; and the server calls a data processing model to process the face video. And under the condition that the target user is a human face living body, the server can perform a further human face verification process on the target user according to the human face video of the target user, and inform the terminal after the human face verification is passed, and the terminal executes a payment process after receiving response information of the human face verification passing.

The above only introduces two possible application scenarios, and the technical solution provided in the embodiment of the present application can also be applied to any application scenario with human face live body detection requirements, such as a terminal unlocking scenario, an application login scenario, and a check-in scenario, which is not limited in the embodiment of the present application.

Based on the video detection model, a relevant experiment is designed to verify the identification effect. In an experimental environment, the adopted hardware platforms are a core (TM) i7-8700 CPU @ 3.6GHz processor, a 16G memory, a 256G solid-state disk and a STRIX-GTX1080TI-11G video card. The software platform used was a 64-bit operating system based on window10, python2.7, Tensorflow1.8.

Taking the image sequence data of each segment N160 × 100 × 3, using the dimension (N, 160, 100, 3) single sample data as an input sample, the specific structural parameters and output results of the experimental setup are shown in table 1 (some Drop Out and regularization auxiliary processes to avoid overfitting are not shown in table 1).

Table 1: experimental data provided in one embodiment of the present application

A video detection model is constructed according to the structure shown in the table 1, and experiments show that the video detection model can cover 85% -90% of artificial faces, and the model accuracy rate is about 95%. In addition, the picture detection model can further cover the remaining 10% -15%. The dual model can cover about 98% of the entire human face video. According to the overall scheme, 98% of dummy videos in the full face videos can be found only by manually checking 3-4% of the full face videos, and the checking efficiency is improved by 20 times compared with that of manual random full-volume checking. Meanwhile, the two models can be mutually optimized in an iterative mode, and the high-efficiency active discovery module is combined, so that the dummy detection scheme can be operated and optimized at low cost, the black product can be effectively resisted for a long time, and the trial cost of manufacturing the dummy by the black product is increased.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 11, a block diagram of a data processing apparatus according to an embodiment of the present application is shown. The device has the function of realizing the data processing method, and the function can be realized by hardware or by hardware executing corresponding software. The apparatus 1100 may include: a video acquisition module 1110, a first detection module 1120, a second detection module 1130, and a liveness detection module 1140.

The video acquiring module 1110 is configured to acquire a face video to be detected.

A first detection module 1120, configured to extract an image frame from the face video, and intercept a face frame from the image frame; performing living body detection based on facial feature information contained in the human face block diagram to obtain a first detection result; the facial feature information is used for representing texture and illumination color and luster features of the face area in the face block diagram.

A second detection module 1130, configured to select a plurality of image frames from the face video, and generate a frame image sequence; performing living body detection based on the spatio-temporal feature information contained in the frame image sequence to obtain a second detection result; and the spatiotemporal feature information is used for representing the human face state change feature and the human face background spatial feature in the frame image sequence.

And a living body detection module 1140, configured to determine whether the face video is a face living body video according to the first detection result and the second detection result.

In some embodiments, the first detection module 1120 is configured to: extracting first face feature information from the face frame diagram, wherein the first face feature information is used for representing texture features of the face region; performing living body detection based on the first face feature information to obtain a first candidate detection result; extracting second face feature information from the face block diagram, wherein the second face feature information is used for representing illumination color and luster features of the face region of the face; performing living body detection based on the second face feature information to obtain a second candidate detection result; and determining the first detection result according to the first candidate detection result and the second candidate detection result.

In some embodiments, the first candidate detection result is obtained by a first picture detection model, and the second candidate detection result is obtained by a second picture detection model; the first picture detection model comprises a first feature extraction layer, a first pooling layer and a first classification layer; the first feature extraction layer is used for performing feature extraction processing on the face block diagram to obtain a first feature extraction result; the first pooling layer is used for performing maximum pooling on the first feature extraction result to obtain the first facial feature information; the first classification layer is used for performing living body detection based on the first face feature information to obtain the first candidate detection result; the second picture detection model comprises a second feature extraction layer, a second pooling layer and a second classification layer; the second feature extraction layer is used for performing feature extraction processing on the face block diagram to obtain a second feature extraction result; the second pooling layer is used for performing average pooling on the second feature extraction result to obtain the second face feature information; and the second classification layer is used for performing living body detection based on the second facial feature information to obtain the second candidate detection result.

In some embodiments, the first detection module 1120 is configured to: if the first candidate detection result and the second candidate detection result both indicate that the face video is a face live video, generating the first detection result for indicating that the face video is the face live video; if at least one of the first candidate detection result and the second candidate detection result indicates that the face video is not a face live video, generating the first detection result for indicating that the face video is not a face live video.

In some embodiments, the first detection module 1120 is configured to: if at least one of the first candidate detection result and the second candidate detection result indicates that the face video is a face live video, generating the first detection result for indicating that the face video is the face live video; and if the first candidate detection result and the second candidate detection result indicate that the face video is not the face live video, generating the first detection result for indicating that the face video is not the face live video.

In some embodiments, the second detection module 1130 is configured to: performing short-time feature extraction on the frame image sequence to obtain a plurality of shallow feature maps, wherein the shallow feature maps are used for representing short-time segment space-time fusion features of the frame image sequence; performing sequence feature extraction on the shallow feature maps to obtain a plurality of sequence feature maps, wherein the sequence feature maps are used for representing long-term segment space-time fusion features of the frame map sequence; performing deep semantic extraction on the plurality of sequence feature maps to obtain a plurality of deep feature vectors, wherein the deep feature vectors are used for representing deep spatial features of the frame map sequence; performing interframe feature extraction on the deep feature vectors to obtain a global feature vector, wherein the global feature vector is used for representing global space-time fusion features of the frame image sequence; and performing living body detection based on the global feature vector to obtain the second detection result.

In some embodiments, the second detection module 1130 is configured to: merging the deep eigenvectors to obtain an eigenvector matrix; performing convolution processing on the feature vector matrixes respectively by adopting a plurality of convolution kernels with different scales to obtain a plurality of feature vectors, wherein the feature vectors are used for representing deep space-time fusion features of the frame image sequence; fusing the plurality of feature vectors to obtain fused feature vectors; and performing gating filtration on the fusion feature vector to obtain the global feature vector.

In some embodiments, the second detection result is obtained through a video detection model, and the video detection model includes a short-time feature extraction layer, a sequence feature extraction layer, a deep semantic extraction layer, an inter-frame feature extraction layer, and a third classification layer; the short-time feature extraction layer is used for carrying out short-time feature extraction on the frame image sequence to obtain a plurality of shallow feature maps, and the shallow feature maps are used for representing short-time segment space-time fusion features of the frame image sequence; the sequence feature extraction layer is used for performing sequence feature extraction on the shallow feature maps to obtain a plurality of sequence feature maps, and the sequence feature maps are used for representing long-term segment space-time fusion features of the frame map sequence; the deep semantic extraction layer is used for performing deep semantic extraction on the plurality of sequence feature maps to obtain a plurality of deep feature vectors, and the deep feature vectors are used for representing deep spatial features of the frame map sequence; the inter-frame feature extraction layer is used for performing inter-frame feature extraction on the deep layer feature vectors to obtain a global feature vector, and the global feature vector is used for representing global space-time fusion features of the frame image sequence; and the third classification layer is used for performing living body detection based on the global feature vector to obtain the second detection result.

In some embodiments, the second detection module 1130 is configured to: performing framing processing on the face video to obtain a frame atlas of the face video; selecting a plurality of image frames from the frame image set to obtain an initial frame sequence; performing edge filling on each image frame in the initial frame sequence, and adjusting the image frame to a set length-width ratio to obtain a filled image frame; adjusting each filled image frame to a set size to obtain an adjusted image frame; and obtaining the frame image sequence based on each adjusted image frame.

Referring to fig. 12, a block diagram of a training apparatus for a data processing model according to an embodiment of the present application is shown. The device has the function of realizing the training method of the data processing model, and the function can be realized by hardware or by hardware executing corresponding software. The apparatus 1200 may include: a sample acquisition module 1210, a first training module 1220, and a second training module 1230.

A sample acquiring module 1210, configured to acquire a first sample set and a second sample set, where the first sample set includes a plurality of first training samples, the first training samples are face images captured from image frames of a face video, the second sample set includes a plurality of second training samples, and the second training samples are frame image sequences acquired from the face video.

A first training module 1220, configured to train the picture detection model using the first sample set; the image detection model is used for performing living body detection based on facial feature information contained in the human face block diagram to obtain a first detection result.

A second training module 1230, configured to train the video detection model using the second sample set; the video detection model is used for performing living body detection based on the spatio-temporal feature information contained in the frame image sequence to obtain a second detection result.

In some embodiments, the sample acquisition module 1210 is further configured to: calculating the similarity between the first training sample and a non-living body template; determining the first training sample as a negative sample if the similarity is greater than or equal to a threshold; determining the first training sample as a positive sample if the similarity is less than or equal to a threshold.

In some embodiments, the sample acquisition module 1210 is further configured to: clustering the first training samples to obtain a plurality of sample clusters; according to the cohesion degrees corresponding to the sample clusters respectively, selecting the first training sample in the sample cluster with the cohesion degree meeting the condition as a negative sample, and selecting the first training sample in the sample cluster with the cohesion degree not meeting the condition as a positive sample; wherein the conditions include at least one of: and the cohesion degree is greater than a threshold value, the sequences are positioned at the front N positions of the sequences according to the descending order of the cohesion degree, and N is a positive integer.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Referring to fig. 13, a schematic structural diagram of a computer device according to an embodiment of the present application is shown. The computer device may be any electronic device having data computing, processing, and storage functions. The computer device may be used to implement the data processing method provided in the above embodiment, and may also be used to implement the training method of the data processing model provided in the above embodiment. Specifically, the method comprises the following steps:

the computer device 1300 includes a Central Processing Unit (e.g., a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), an FPGA (Field Programmable Gate Array), etc.) 1301, a system Memory 1304 including a RAM (Random-Access Memory) 1302 and a ROM (Read-Only Memory) 1303, and a system bus 1305 connecting the system Memory 1304 and the Central Processing Unit 1301. The computer device 1300 also includes a basic Input/Output System (I/O System) 1306 for facilitating information transfer between various devices within the server, and a mass storage device 1307 for storing an operating System 1313, application programs 1314 and other program modules 1315.

In some embodiments, the basic input/output system 1306 includes a display 1308 for displaying information and an input device 1309, such as a mouse, keyboard, or the like, for a user to input information. Wherein the display 1308 and input device 1309 are connected to the central processing unit 1301 through an input output controller 1310 connected to the system bus 1305. The basic input/output system 1306 may also include an input/output controller 1310 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1310 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1307 is connected to the central processing unit 1301 through a mass storage controller (not shown) connected to the system bus 1305. The mass storage device 1307 and its associated computer-readable media provide non-volatile storage for the computer device 1300. That is, the mass storage device 1307 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM (Compact disk Read-Only Memory) drive.

Without loss of generality, the computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD (Digital Video Disc) or other optical, magnetic, or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 1304 and mass storage device 1307 described above may be collectively referred to as memory.

The computer device 1300 may also operate as a remote computer connected to a network via a network, such as the internet, according to embodiments of the present application. That is, the computer device 1300 may be connected to the network 1312 through the network interface unit 1311, which is connected to the system bus 1305, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 1311.

The memory also includes at least one instruction, at least one program, set of codes, or set of instructions stored in the memory and configured to be executed by one or more processors to implement the above-described data processing method, or training method of a data processing model.

In an exemplary embodiment, a computer readable storage medium is also provided, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, which when executed by a processor of a computer device implements the above-mentioned data processing method, or a training method of a data processing model.

Optionally, the computer-readable storage medium may include: ROM (Read-Only Memory), RAM (Random-Access Memory), SSD (Solid State drive), or optical disk. The Random Access Memory may include a ReRAM (resistive Random Access Memory) and a DRAM (Dynamic Random Access Memory).

In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the data processing method or the training method of the data processing model.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The above description is only exemplary of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of data processing, the method comprising:

acquiring a face video to be detected;

extracting image frames from the face video, and intercepting a face frame diagram from the image frames; performing living body detection based on facial feature information contained in the human face block diagram to obtain a first detection result; the face feature information is used for representing texture and illumination color features of a face region in the face frame diagram, and the first detection result is used for representing whether the face video is a living face video;

selecting a plurality of image frames from the face video to generate a frame image sequence; performing short-time feature extraction on the frame image sequence to obtain a plurality of shallow feature maps, wherein the shallow feature maps are used for representing short-time segment space-time fusion features of the frame image sequence; the short-time segment space-time fusion feature refers to the fusion feature of the human face state change feature and the human face background space feature of the image frames in the short-time segment in the frame image sequence; performing sequence feature extraction on the shallow feature maps to obtain a plurality of sequence feature maps, wherein the sequence feature maps are used for representing long-term segment space-time fusion features of the frame map sequence; the long-time segment space-time fusion feature refers to the fusion feature of the human face state change feature and the human face background space feature of the image frames in the long-time segment in the frame image sequence; performing deep semantic extraction on the plurality of sequence feature maps to obtain a plurality of deep feature vectors, wherein the deep feature vectors are used for representing deep spatial features of the frame map sequence; performing interframe feature extraction on the deep feature vectors to obtain a global feature vector, wherein the global feature vector is used for representing global space-time fusion features of the frame image sequence; performing living body detection based on the global feature vector to obtain a second detection result, wherein the second detection result is used for representing whether the face video is a face living body video;

2. The method according to claim 1, wherein the performing living body detection based on facial feature information included in the face frame diagram to obtain a first detection result comprises:

extracting first face feature information from the face frame diagram, wherein the first face feature information is used for representing texture features of the face region; performing living body detection based on the first face feature information to obtain a first candidate detection result;

extracting second face feature information from the face block diagram, wherein the second face feature information is used for representing illumination color and luster features of the face region of the face; performing living body detection based on the second face feature information to obtain a second candidate detection result;

and determining the first detection result according to the first candidate detection result and the second candidate detection result.

3. The method according to claim 2, wherein the first candidate detection result is obtained by a first picture detection model, and the second candidate detection result is obtained by a second picture detection model;

the first picture detection model comprises a first feature extraction layer, a first pooling layer and a first classification layer; the first feature extraction layer is used for performing feature extraction processing on the face block diagram to obtain a first feature extraction result; the first pooling layer is used for performing maximum pooling on the first feature extraction result to obtain the first facial feature information; the first classification layer is used for performing living body detection based on the first face feature information to obtain the first candidate detection result;

the second picture detection model comprises a second feature extraction layer, a second pooling layer and a second classification layer; the second feature extraction layer is used for performing feature extraction processing on the face block diagram to obtain a second feature extraction result; the second pooling layer is used for performing average pooling on the second feature extraction result to obtain the second face feature information; and the second classification layer is used for performing living body detection based on the second facial feature information to obtain the second candidate detection result.

4. The method of claim 2, wherein determining the first detection result from the first candidate detection result and the second candidate detection result comprises:

if the first candidate detection result and the second candidate detection result both indicate that the face video is a face live video, generating the first detection result for indicating that the face video is the face live video; if at least one of the first candidate detection result and the second candidate detection result indicates that the face video is not a face live video, generating the first detection result for indicating that the face video is not a face live video;

alternatively, the first and second electrodes may be,

if at least one of the first candidate detection result and the second candidate detection result indicates that the face video is a face live video, generating the first detection result for indicating that the face video is the face live video; if the first candidate detection result and the second candidate detection result both indicate that the face video is not the face live video, generating the first detection result for indicating that the face video is not the face live video.

5. The method of claim 1, wherein the performing inter-frame feature extraction on the plurality of deep feature vectors to obtain a global feature vector comprises:

merging the deep eigenvectors to obtain an eigenvector matrix;

performing convolution processing on the feature vector matrixes respectively by adopting a plurality of convolution kernels with different scales to obtain a plurality of feature vectors, wherein the feature vectors are used for representing deep space-time fusion features of the frame image sequence;

fusing the plurality of feature vectors to obtain fused feature vectors;

and performing gating filtration on the fusion feature vector to obtain the global feature vector.

6. The method of claim 1, wherein the second detection result is obtained through a video detection model, and the video detection model comprises a short-time feature extraction layer, a sequence feature extraction layer, a deep semantic extraction layer, an inter-frame feature extraction layer, and a third classification layer;

the short-time feature extraction layer is used for carrying out short-time feature extraction on the frame image sequence to obtain a plurality of shallow feature maps, and the shallow feature maps are used for representing short-time segment space-time fusion features of the frame image sequence;

the sequence feature extraction layer is used for performing sequence feature extraction on the shallow feature maps to obtain a plurality of sequence feature maps, and the sequence feature maps are used for representing long-term segment space-time fusion features of the frame map sequence;

the deep semantic extraction layer is used for performing deep semantic extraction on the plurality of sequence feature maps to obtain a plurality of deep feature vectors, and the deep feature vectors are used for representing deep spatial features of the frame map sequence;

the inter-frame feature extraction layer is used for performing inter-frame feature extraction on the deep layer feature vectors to obtain a global feature vector, and the global feature vector is used for representing global space-time fusion features of the frame image sequence;

and the third classification layer is used for performing living body detection based on the global feature vector to obtain the second detection result.

7. The method according to any one of claims 1 to 6, wherein said selecting a plurality of image frames from said face video to generate a frame image sequence comprises:

performing framing processing on the face video to obtain a frame atlas of the face video;

selecting a plurality of image frames from the frame image set to obtain an initial frame sequence;

performing edge filling on each image frame in the initial frame sequence, and adjusting the image frame to a set length-width ratio to obtain a filled image frame;

adjusting each filled image frame to a set size to obtain an adjusted image frame;

and obtaining the frame image sequence based on each adjusted image frame.

8. A training method of a data processing model, wherein the data processing model comprises a picture detection model and a video detection model, and the method comprises the following steps:

training the picture detection model by adopting the first sample set; the image detection model is used for performing living body detection based on facial feature information contained in the human face block diagram to obtain a first detection result; the face feature information is used for representing texture and illumination color features of a face region in the face frame diagram, and the first detection result is used for representing whether the face video is a living face video;

training the video detection model by adopting the second sample set; the video detection model is used for performing living body detection based on spatio-temporal feature information contained in the frame image sequence to obtain a second detection result; the spatiotemporal feature information is used for representing human face state change features and human face background spatial features in the frame image sequence, and the second detection result is used for representing whether the human face video is a human face living body video;

wherein the video detection model is to:

performing short-time feature extraction on the frame image sequence to obtain a plurality of shallow feature maps, wherein the shallow feature maps are used for representing short-time segment space-time fusion features of the frame image sequence; the short-time segment space-time fusion feature refers to the fusion feature of the human face state change feature and the human face background space feature of the image frames in the short-time segment in the frame image sequence;

performing sequence feature extraction on the shallow feature maps to obtain a plurality of sequence feature maps, wherein the sequence feature maps are used for representing long-term segment space-time fusion features of the frame map sequence; the long-time segment space-time fusion feature refers to the fusion feature of the human face state change feature and the human face background space feature of the image frames in the long-time segment in the frame image sequence;

performing deep semantic extraction on the plurality of sequence feature maps to obtain a plurality of deep feature vectors, wherein the deep feature vectors are used for representing deep spatial features of the frame map sequence;

performing interframe feature extraction on the deep feature vectors to obtain a global feature vector, wherein the global feature vector is used for representing global space-time fusion features of the frame image sequence;

and performing living body detection based on the global feature vector to obtain the second detection result.

9. The method of claim 8, wherein after the obtaining the first set of samples, further comprising:

calculating the similarity between the first training sample and a non-living body template;

determining the first training sample as a negative sample if the similarity is greater than or equal to a threshold;

determining the first training sample as a positive sample if the similarity is less than or equal to a threshold.

10. The method of claim 8, wherein after the obtaining the first set of samples, further comprising:

clustering the first training samples to obtain a plurality of sample clusters;

according to the cohesion degrees corresponding to the sample clusters respectively, selecting the first training sample in the sample cluster with the cohesion degree meeting the condition as a negative sample, and selecting the first training sample in the sample cluster with the cohesion degree not meeting the condition as a positive sample;

wherein the conditions include at least one of: and the cohesion degree is greater than a threshold value, the sequences are positioned at the front N positions of the sequences according to the descending order of the cohesion degree, and N is a positive integer.

11. A data processing apparatus, characterized in that the apparatus comprises:

the first detection module is used for extracting image frames from the face video and intercepting face frame diagrams from the image frames; performing living body detection based on facial feature information contained in the human face block diagram to obtain a first detection result; the face feature information is used for representing texture and illumination color features of a face region in the face frame diagram, and the first detection result is used for representing whether the face video is a living face video;

the second detection module is used for selecting a plurality of image frames from the face video and generating a frame image sequence; performing short-time feature extraction on the frame image sequence to obtain a plurality of shallow feature maps, wherein the shallow feature maps are used for representing short-time segment space-time fusion features of the frame image sequence; the short-time segment space-time fusion feature refers to the fusion feature of the human face state change feature and the human face background space feature of the image frames in the short-time segment in the frame image sequence; performing sequence feature extraction on the shallow feature maps to obtain a plurality of sequence feature maps, wherein the sequence feature maps are used for representing long-term segment space-time fusion features of the frame map sequence; the long-time segment space-time fusion feature refers to the fusion feature of the human face state change feature and the human face background space feature of the image frames in the long-time segment in the frame image sequence; performing deep semantic extraction on the plurality of sequence feature maps to obtain a plurality of deep feature vectors, wherein the deep feature vectors are used for representing deep spatial features of the frame map sequence; performing interframe feature extraction on the deep feature vectors to obtain a global feature vector, wherein the global feature vector is used for representing global space-time fusion features of the frame image sequence; performing living body detection based on the global feature vector to obtain a second detection result, wherein the second detection result is used for representing whether the face video is a face living body video;

12. An apparatus for training a data processing model, the apparatus comprising:

the first training module is used for training the picture detection model by adopting the first sample set; the image detection model is used for performing living body detection based on facial feature information contained in the human face block diagram to obtain a first detection result, and the first detection result is used for representing whether the human face video is a human face living body video;

the second training module is used for training a video detection model by adopting the second sample set; the video detection model is used for performing living body detection based on spatio-temporal feature information contained in the frame image sequence to obtain a second detection result; the spatiotemporal feature information is used for representing human face state change features and human face background spatial features in the frame image sequence, and the second detection result is used for representing whether the human face video is a human face living body video;

wherein the video detection model is to:

13. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by the processor to implement a data processing method according to any one of claims 1 to 7, or to implement a training method of a data processing model according to any one of claims 8 to 10.

14. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement a data processing method according to any one of claims 1 to 7 or to implement a training method of a data processing model according to any one of claims 8 to 10.