CN115512419A

CN115512419A - Video identification method, system, electronic equipment and storage medium

Info

Publication number: CN115512419A
Application number: CN202211204582.6A
Authority: CN
Inventors: 王小东; 朱羽; 吕文勇; 周智杰; 廖浩
Original assignee: Chengdu New Hope Finance Information Co Ltd
Current assignee: Chengdu New Hope Finance Information Co Ltd
Priority date: 2022-09-29
Filing date: 2022-09-29
Publication date: 2022-12-23

Abstract

The embodiment of the application provides a video identification method, a video identification device, electronic equipment and a storage medium, wherein the method comprises the steps of carrying out audio-video separation on a collected video signal, and respectively obtaining an audio signal and a video signal; wherein the video signal comprises an object to be authenticated; performing feature extraction on the audio signal and the video signal to obtain an audio feature corresponding to the audio signal and a video feature corresponding to the video signal; inputting the audio features and the video features into a pre-trained neural network to obtain a recognition result output by the neural network; and the identification result is used for representing whether the video signal is finished under the guidance of the medium for the object to be verified. The technical scheme can judge whether the face recognition object is finished under the guidance of the intermediary.

Description

Video identification method, system, electronic equipment and storage medium

Technical Field

The application relates to the field of face recognition, in particular to a video recognition method, a video recognition system, electronic equipment and a storage medium.

Background

In the financial industry, personal information verification is often involved, so face recognition technology is widely used. In the prior art, the focus of attention on the authenticity problem of face recognition is whether an image of the face recognition is a real person or not and whether the image is the same person or not.

In the prior art, face recognition is performed voluntarily without paying attention to whether or not a person who performs face recognition is in a natural state, and face recognition in a natural state means that unintended face recognition is not performed under the guidance of a benefit-related intermediary person. The authenticity judgment of the face recognition in the prior art is simpler.

Disclosure of Invention

An object of an embodiment of the present application is to provide a video identification method, so as to solve a technical problem in the prior art that face identification cannot determine whether an identified person is voluntarily identified instead of being identified under guidance of an intermediary.

In order to achieve the above purpose, the technical solutions provided in the embodiments of the present application are as follows:

in a first aspect, an embodiment of the present application provides a video identification method, which performs audio and video separation on collected audio and video signals to respectively obtain the audio signal and the video signal; wherein the video signal comprises an object to be authenticated; extracting the characteristics of the audio signal and the video signal to obtain the audio characteristics corresponding to the audio signal and the video characteristics corresponding to the video signal; inputting the audio features and the video features into a pre-trained neural network to obtain a recognition result output by the neural network; and the identification result is used for representing whether the video signal is the object to be verified or not under the guidance of an intermediary. The technical scheme can judge whether the face recognition object is finished under the guidance of the intermediary.

In this embodiment, the collected audio and video signals are first separated to obtain separate audio signals and video signals, where the video signals mainly include videos recorded during face recognition, including face videos that need to be verified. And then, corresponding feature extraction needs to be performed on the audio signal and the video signal, specifically, audio features are extracted from the audio signal, video features are extracted from the video signal, then, the audio features and the video features are input into a neural network for recognition by using a pre-trained neural network, and finally, whether an object to be verified of the video signal is completed under the guidance of an intermediary is judged.

Further, the audio features comprise the number of speakers in the audio signal and/or whether the audio signal contains a first keyword; the video features include at least one of: shooting angle features, face features and shooting scene features.

In this embodiment, the audio feature and the video feature are explained, where the audio feature includes determining the number of speakers to determine whether the speakers speak for multiple persons, and whether the content of the speech includes the first keyword. The technical scheme can realize the extraction of the audio features and the video features.

Further, the performing feature extraction on the audio signal includes: intercepting the audio signal to obtain a plurality of audio segments; classifying the plurality of audio segments, and determining the number of speakers in the audio signal according to the classification result; and/or recognizing and converting the voice in the audio signal into a text, and performing keyword recognition on the text.

In this embodiment, the audio signal is first cut correspondingly, that is, segmented into a plurality of audio segments, then the plurality of audio segments are classified, and the specific number of speakers in the audio signal is determined by using the classified processing result, so as to determine whether there is a situation of multiple speakers. In the step of extracting the characteristics of the audio signal, voice recognition is also performed on the voice in the audio signal, the recognition result is converted into a text, and corresponding keyword recognition is performed on the text. The technical scheme can realize the judgment on the number of speakers and whether keywords are contained in the audio signal extraction.

Further, the object to be verified is a human face; the extracting the features of the video signal comprises: framing the video signal to obtain a plurality of frames of image signals; and performing feature extraction on the image signal to obtain the shooting angle feature, the face feature and the shooting scene feature.

In this embodiment, a video feature is extracted, a specific object detected by the video feature is a human face, a video signal including human face information is firstly subjected to framing processing to obtain a multi-frame image signal, and then a feature extraction step is performed on the image signal, so that information features related to a shooting angle feature, a facial feature and a shooting scene feature are obtained. The technical scheme can realize the preprocessing of the video signal and obtain the multi-frame image signal.

Further, the performing feature extraction on the image signal to obtain the shooting angle feature includes: extracting the face angle characteristics of the image signals to obtain the depression angle, the deflection angle and the roll angle of the face in the image signals; judging whether the face is over against a screen or not according to the depression angle, the deflection angle and the roll angle; and/or carrying out shooting angle identification on the image signal, and judging whether the shooting angle is a self-shooting angle.

In this embodiment, in the step of extracting the shooting angle feature, first, angle feature extraction needs to be performed on the face in the image signal, specifically, a depression angle, a deflection angle, and a roll angle of the face are first obtained, and then, according to the three angles, whether the face is in a state of being over against the screen is determined. In addition, the step of obtaining the shooting angle characteristics further comprises the step of carrying out shooting angle identification on the image signals and judging whether the shooting angle is a self-shooting angle or an angle shot by other people. The technical scheme can realize the judgment of whether the person is dead against the screen and whether the shooting angle is self-shooting or other shooting.

Further, the performing feature extraction on the image signal to obtain the facial feature includes: acquiring an eye region and a mouth region in the image signal; extracting the features of the eye region, and judging whether the eye attention is focused on a screen; and/or, extracting the characteristics of the mouth region, and judging whether the mouth moves; and/or identifying the face image in the image signal by using a micro expression identification model to determine whether micro expressions of the face image are abnormal or not.

In the present embodiment, acquiring the facial features first requires determining the eye region and mouth region of the person in the image signal. And then, extracting the features of the eye region, and judging whether the eye attention of the person is focused on the screen or not according to the features. In addition, the feature extraction is carried out on the mouth region of the person, whether the mouth of the person moves in the video signal is judged, and therefore whether the person speaks is judged. In addition, the pre-trained micro expression recognition model is used for recognizing the face image and judging whether the micro expression of the face image has abnormal conditions. The technical scheme can realize the judgment of whether the eye attention of the person is focused on face recognition, whether the person speaks during face recognition and the micro expression during face recognition.

Further, the performing feature extraction on the image signal to obtain shooting scene features includes: acquiring a portrait background picture of the image signal; identifying the portrait background picture and judging whether the portrait background picture is indoor or not; and/or carrying out keyword identification on the portrait background picture and judging whether keywords appear in the background.

In this embodiment, for extracting the shooting scene features, a background picture where a portrait in an image signal is located is first obtained, then the background picture of the portrait is identified, and whether a corresponding scene is indoors or outdoors is determined through the background picture. In addition, corresponding keyword recognition can be performed on the background picture, and whether a second keyword appears or not can be judged. The technical scheme can realize the judgment of whether the shooting scene is indoor or outdoor and whether keywords exist in the background.

Further, the inputting the audio features and the video features into a neural network trained in advance to obtain a recognition result output by the neural network includes: generating corresponding labels for the audio features and the video features respectively; and generating a feature vector according to the label, inputting the feature vector into the pre-trained neural network, and obtaining an identification result output by the neural network.

In this embodiment, since the audio features and the video features are already obtained in the foregoing steps, the audio features and the video features are used to generate corresponding labels, and feature vectors are generated by using the corresponding labels, so that the feature vectors are input into a neural network that has been trained in advance, and thus, a recognition result output by the neural network is obtained. According to the technical scheme, comprehensive analysis can be performed by using the extracted audio features and video features, the accuracy of judgment on whether intermediary intervention exists or not is improved by using the spatial relation among the features, and whether intermediary intervention exists or not in face recognition is judged more accurately.

In a second aspect, an embodiment of the present application provides a video identification system, including: the signal separation module is used for carrying out audio-video separation on the collected audio-video signals and respectively acquiring the audio signals and the video signals; wherein the video signal comprises an object to be authenticated; the feature extraction module is used for performing feature extraction on the audio signal and the video signal to obtain audio features corresponding to the audio signal and video features corresponding to the video signal; the judging module is used for inputting the audio features and the video features into a pre-trained utilized neural network to obtain an identification result output by the neural network; and the identification result is used for representing whether the living body video signal is finished under the guidance of an intermediary for the object to be verified.

In this embodiment, the collected audio and video signals are first separated, and an individual audio signal and a video signal that does not include the audio signal are respectively obtained, where the video signal mainly includes a video recorded during face recognition, and the video includes a face video that needs to be verified. And then, corresponding feature extraction needs to be performed on the audio signal and the video signal, specifically, audio features are extracted from the audio signal, video features are extracted from the video signal, then, the audio features and the video features are input into a neural network for recognition by using a pre-trained neural network, and finally, whether an object to be verified of the video signal is completed under the guidance of an intermediary is judged.

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, memory, and a bus; the processor and the memory are communicated with each other through the bus; the memory stores program instructions executable by the processor, the processor being capable of executing the method as in the first aspect when invoked by the processor.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing computer instructions that cause the computer to perform the method of the first aspect.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic diagram illustrating steps of a video identification method according to an embodiment of the present application;

fig. 2 is a schematic diagram of video feature extraction provided in an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating a step of obtaining a recognition result output by a neural network according to an embodiment of the present application;

fig. 4 shows extraction rules and labels of audio features and video features provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of a deep learning model provided in an embodiment of the present application;

FIG. 6 is a schematic view of a video recognition system according to an embodiment of the present application; and

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

Fig. 1 is a schematic step diagram of a video identification method according to an embodiment of the present application.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating steps of a video identification method according to an embodiment of the present application, and in fig. 1, the method specifically includes:

step 101: carrying out audio and video separation on the collected audio and video signals to respectively obtain the audio signals and the video signals; wherein the video signal comprises an object to be authenticated.

In the specific implementation process of step 101, the video recognition system first performs audio-video separation on the acquired audio-video signals, where the audio-video separation mainly refers to separating a video portion from a sound portion in the original audio-video signals, so as to perform separate analysis on the audio signals of the sound portion and the signals of the video portion. One case of audio and video acquisition is that in the execution process of video identification, a face identification participant records audio and video by using terminal devices such as a mobile phone, and the recorded audio and video is uploaded to a video identification system for processing, but it should be noted that the face identification participant may actively record audio and video by using the terminal devices such as the mobile phone, or may take up the terminal devices such as the mobile phone by himself to record audio and video, but the face identification participant is not actively recorded but induced by an intermediary; another case of audio and video acquisition is that terminal equipment such as a mobile phone is held by an intermediary or is installed at a fixed position, and audio and video recording is performed on the terminal equipment by utilizing the condition that a human face recognition participant is not aware of, and the recorded audio and video is uploaded to a video recognition system for processing. The object to be verified is a face recognition participant, and face video information contained in audio and video recorded during face recognition is obtained.

Step 102: and performing feature extraction on the audio signal and the video signal to obtain audio features corresponding to the audio signal and video features corresponding to the video signal.

In the specific implementation of step 102, the video recognition system performs feature extraction on the video signal obtained in step 101, extracts corresponding audio features from the audio signal, and extracts corresponding video features from the video signal.

Step 103: and inputting the audio features and the video features into a pre-trained neural network to obtain a recognition result output by the neural network.

In the specific implementation process of step 103, since the relevant neural network has been trained in advance, in this step, the video recognition system may input the extracted audio features and video features into the relevant neural network, and the neural network performs comprehensive analysis on the audio features and video features to obtain a comprehensive recognition result, so as to determine whether the relevant video recorded by the face recognition participant during face recognition is completed under the guidance of an intermediary, rather than under the voluntary condition.

Further, the audio feature includes the number of speakers in the audio signal and/or whether the first keyword is included. Under the condition of intermediary intervention, the intermediary directs the operation of the face recognizer, so that multiple speakers possibly speak, and the number of the speakers is judged by the audio characteristics, so that whether the multiple speakers speak can be determined. In addition, in the case of intervention, the intermediary guides the human face recognizer to perform operations, and some keywords may be involved, such as blinking, speaking, looking at the screen, and other related instructions. In the audio feature extraction, the keywords are extracted, and whether the intermediary intervenes can be preliminarily judged. The video features comprise shooting angle features, face features and shooting scene features, and each feature can be used for basically and preliminarily judging whether the problem of intermediary intervention exists or not.

In a preferred embodiment, the identified characteristic may include whether the person is a live person as the subject of the audio-video recording.

Further, in the process of extracting the audio signal, the video recognition system firstly intercepts the audio signal, that is, segments the voice data, where the time interval of interception needs to be determined, in a preferred embodiment, the interception of the voice data is performed at an interval of 1 second, and after the interception is completed, a plurality of audio segments can be obtained, where the duration of each audio segment is 1 second. The interception can be performed by using, for example, ffmpeg command, which is a set of open-source computer programs that can be used to record, convert digital audio and video, and convert them into streams. An mfcc (Mel-scale frequency Cepstral Coefficients, mel Cepstral coefficient) feature map is extracted from each intercepted audio, and feature extraction is carried out through a trained neural network. And then carrying out similarity calculation on each audio feature, carrying out similarity comparison by using the result of the similarity calculation, specifically, arranging the audio features, then carrying out similarity comparison on the arranged audio segment and the arranged previous audio segment, and then classifying the audio segments with high similarity into one class. The audio segments belonging to the same class can be regarded as the same speaker, and the classification into several classes means how many speakers exist, so that whether the speakers speak for more than one person can be judged. And then, different audio segments under the same class are spliced to form a new long audio, and then the long audio is subjected to voice recognition by using a function of converting voice into text, so that whether the keywords are involved in the voice is judged.

In a preferred embodiment, the keywords in the speech include speech fraud recognition for action live bodies, such as "nodding the head", "blinking", "looking to the left", etc., and for digital live bodies, such as "reading two, three, four", and "reading five, six, seven, eight", etc., and also for light live bodies, such as "watching the screen without action", "still looking at the screen", etc.

Further, the video recognition system preprocesses the video signal, firstly frames the video signal to obtain a multi-frame image signal, and then correspondingly extracts the characteristics of the multi-frame image signal, wherein the characteristic extraction is mainly processed from three aspects of shooting angle characteristics, facial characteristics and shooting scene characteristics.

Furthermore, the video recognition system acquires shooting angle features, firstly, the angle features of the face are extracted, the angle of the face mainly refers to judging whether a person is facing to a screen during shooting, so that the intermediary can be prevented from shooting when the person is not noticed by the person, and preliminary judgment is carried out on whether the intermediary intervenes. In the process, firstly, after an image is obtained, a face image part is extracted, the extracted face image is input into a pre-trained face recognition model for angle recognition, the angles are angles corresponding to three corner points of a depression angle, a deflection angle and a roll angle estimated for the image, and when the angle of the depression angle is larger than a first threshold value, or the angle of the deflection angle is larger than a second threshold value, or the angle of the roll angle is larger than a third threshold value, the face is judged not to be over against a screen. The first threshold, the second threshold and the third threshold may be the same or different. Moreover, the condition for judging whether the face is directly facing the screen may also be: when the angle of the depression angle is larger than a first threshold value, the angle of the declination angle is larger than a second threshold value, and the angle of the roll angle is larger than a third threshold value, determining that the face is not over against the screen; and if any two angles of the depression angle, the deflection angle and the roll angle are larger than the corresponding threshold values, determining that the face does not face the screen.

In addition, the user can be identified to be self-shot or shot by other people during face identification, and the intermediary can hold the shooting equipment to shoot the user with high possibility during inducing the user to carry out face identification, so that the identification of self-shot and other shooting can be carried out, and whether the intermediary intervenes can be preliminarily judged.

In a preferred embodiment, whether the user is a self-timer can be determined, and the face image can be input into a preset self-timer recognition model for determination. The self-timer recognition model is generated by using deep learning training, and multiple self-timer and self-timer images are used for deep learning training in advance, so that the self-timer and self-timer images can be judged.

In the face feature extraction, it is necessary to acquire an eye region and a mouth region in an image signal, and generally, a face partial image is taken, and the upper half partial image is determined as the eye region and the lower half as the mouth region. In the process of analyzing the eye region, the pre-trained eye recognition model is utilized to input the eye region image into the eye recognition model, and whether human eyes are concentrated on the screen is judged.

In the process of analyzing the mouth region, whether the mouth region image of the person contains the open mouth state and the closed mouth state is judged, in a preferred embodiment, a pre-trained mouth recognition model is used, a plurality of frames of mouth region images are input into the model for recognition, and if one part of the mouth region images in the plurality of frames of mouth region images contains the open mouth and the other part of the mouth region images contains the closed mouth, the user can be judged to be speaking when recording videos.

In the process of analyzing the micro expression of the person, the video recognition system judges by using a pre-trained micro expression recognition model of the person, inputs a face image into the model and judges whether the expression of the user is abnormal. The generation of the human micro expression recognition model utilizes deep learning training, and the deep learning training is carried out by utilizing a plurality of human face images in advance, so that the emotion information represented by the human micro expression can be judged.

Further, in the process of identifying the shooting scene characteristics, firstly, a background picture of the acquired image signal is extracted. And then, analyzing the background picture, and judging whether the background is an environment with a possible intermediary by using a pre-trained background recognition model.

In a preferred embodiment, it is determined whether the background is mediated, and in background recognition, it is determined whether the background is indoors, has white walls, and is in a sealed environment.

In the process of carrying out keyword Recognition on an image background picture, firstly extracting the background picture of an acquired image signal, then recognizing a font appearing in the background picture by utilizing a pre-trained OCR (Optical Character Recognition) Recognition model, then inputting a Recognition result into the pre-trained background Recognition model, and judging whether a keyword appears in the background picture. The keywords in the background image comprise words such as 'agency', 'agency' or 'loan'.

The following describes the video feature extraction and fraud identification by a preferred embodiment with reference to the accompanying drawings.

Referring to fig. 2, fig. 2 is a schematic diagram of video feature extraction according to an embodiment of the present disclosure;

as shown in fig. 2, after the user identification based on the audio data is performed to identify the fraud, the intermediary fraud identification is performed based on the video, that is, whether the fraud exists in the user is identified from the user video data acquired by the video identification system. Specifically, for example, the obtained video data is analyzed, that is, frames of the video are extracted, and each frame of image is identified in various aspects, so as to determine whether fraud exists. The recognition can be divided into face detection on the image and OCR recognition on the image; the features to be extracted in the face detection are shooting angle features, face features and shooting scene features of the face in the image; the face detection comprises image-taking-instead identification, shooting scene identification, eye attention identification, micro expression identification, mouth state identification and screen alignment identification; OCR recognizes textual information in a feature image that needs to be extracted.

Wherein, background identification: the background of a normal person is complex when performing living body authentication, but the background of an intermediary place is generally indoor and is a white wall in a sealed environment, for example, so that the living body background of a person image needs to be identified, and whether fraud exists is judged by judging the image background. For example, the method includes analyzing and framing an acquired video to generate an image, performing face detection on the image of each frame, and performing face framing according to the detected face frame for the image with a face. And filling the face area, inputting the filled image into a pre-trained background recognition model, performing background recognition on the image, and judging whether the image has intermediary fraud or not.

Carrying out substitute shooting identification: the live user usually performs living body authentication by self-shooting and does not need others to perform agency. In order to prevent the intervention of an intermediary in the living body identification link, for example, a real person is in front of a camera, but in reality, the intermediary holds a mobile phone of a user and guides the user aside, and the intermediary belongs to the generation operation. For such a case, it is possible to identify whether the portrait being subjected to living body recognition is a self-portrait, thereby determining whether or not there is fraud.

The method comprises the specific steps of firstly analyzing and framing an acquired video to generate images, detecting the face of each frame of image, inputting the next image into a pre-trained self-photographing recognition model for the images with the faces, performing self-photographing recognition on the images, and judging whether intermediary fraud exists in the images. The generation of the self-shooting recognition model utilizes deep learning training, and multiple self-shooting images and other images are utilized in advance to perform deep learning training, so that the self-shooting and other images can be judged.

Recognizing a shooting scene: when a user performs living body authentication, a shooting scene is usually indoor, is rarely outdoor or in an automobile, and is high in possibility of being an intermediary if the user is in the automobile. The method comprises the specific steps of analyzing and framing an acquired video to generate images, detecting the face of each frame of image, inputting the images into a pre-trained scene recognition model for the images with the faces, carrying out shooting scene recognition on the images, and judging whether intermediary fraud exists in the images.

Eye attention recognition: eye state attention identification is carried out on portrait data of a user, in some fraud scenes, live body authentication is not carried out on a real person, but on the premise that the user does not know, live body authentication is carried out on video data through collecting and capturing videos in front of the user, and therefore fraud is carried out. The situation is generally operated under the condition that the user does not know, the eyes of the user do not see the camera, otherwise, the user easily finds that someone illegally grabs himself, and aiming at the sneak shooting type fraud recognition, whether the eye attention looks at the screen or not can be recognized, so that whether fraud exists or not can be determined.

The method includes the specific steps of analyzing and framing an acquired video, generating an image, performing face detection on the image of each frame, performing matting processing on the image of a face, taking out the image until the image contains the face part, acquiring an eye region and a mouth region in the image, taking a face partial image, judging an upper half partial image as the eye region, and judging a lower half as the mouth region. In the process of analyzing the eye region, the pre-trained eye recognition model is used for inputting the eye region image into the eye recognition model, and whether the eyes are concentrated on the screen is judged.

Micro-expression recognition: normally, the expression of the user is naturally relaxed when the living body authentication is carried out, if the expression of the user before the camera is frightened, delegated and the like, the expression of the user is greatly possibly substituted under the coercion of an intermediary, and therefore whether the user is cheated or not can be judged according to the micro expression of the user.

The specific steps are as follows: analyzing and framing the acquired video to generate images, performing face detection on the images of each frame, inputting the face images into a human micro expression recognition model trained in advance by a video recognition system for the images with faces, judging whether the expressions of users are abnormal or not, and further judging whether fraudulent behaviors exist or not. The generation of the human micro expression recognition model utilizes deep learning training, and multiple face images are utilized for deep learning training in advance, so that emotion information represented by the human micro expression can be judged.

Mouth state recognition: when the fraudulent behavior is identified through the audio, whether the speech spoken by the user contains the keywords or not can be judged, at the moment, whether the mouth of the user is synchronous with the speech or not can be judged by combining the image of the video, and if the mouth of the user is not moved and only the sound appears, the user is guided to operate nearby possibly, so that the mouth state of the user needs to be identified.

The method comprises the following specific steps: the method comprises the steps of analyzing and framing an acquired video to generate images, carrying out face detection on the images of each frame, carrying out matting processing on the faces in the images for the images with the faces, acquiring partial images only containing mouths in the images, specifically segmenting the face images according to a certain proportion, segmenting the face images into an upper part and a lower part, and acquiring the images with the mouths in the lower part. And inputting the multiple frames of mouth region images into the model for recognition by using a pre-trained mouth recognition model, wherein if one part of the multiple frames of mouth region images contains open mouths and the other part of the multiple frames of mouth region images contains closed mouths, the user can be judged to speak when recording videos.

And identifying the screen: firstly, extracting the angle characteristics of the face, wherein the angle of the face mainly refers to judging whether a person is over against a screen when shooting so as to prevent the intermediary from shooting when the person is not noticed, and preliminarily judging whether the intermediary intervenes. In the process, firstly, after an image is obtained, a face image part is extracted, the extracted face image is input into a pre-trained face recognition model for angle recognition, the angles are angles corresponding to three corner points of a depression angle, a deflection angle and a roll angle estimated for the image, the three angles respectively have corresponding threshold values, and whether the face is over against a screen is judged through the threshold values.

OCR recognition: extracting OCR characters from the image containing the portrait obtained by living body authentication, extracting character information in the image, and judging whether the character type contains preset keywords or not, if the preset keywords are detected, the user can go to the appointed place by the intermediary to carry out the living body authentication, and the intermediary fraud behavior can be determined. Specifically, for example, file detection and character recognition are performed on each frame of image by using a pre-trained OCR recognition model, and whether a preset keyword appears in the image is determined.

The method comprises the steps of separating audio and video of a living body video, extracting features based on the separated audio and video by using an algorithm, coding and assembling the audio and video features into a two-dimensional feature vector, and performing fraud identification by using a convolutional neural network. The method has the advantages that the fraud identification is effectively carried out on multiple aspects, the possible fraud condition of the living body video is fully excavated, the condition of medium fraud is avoided in the living body identification process, the accuracy of the fraud identification is improved, and a guiding direction is provided for the fraud identification by utilizing the unstructured audio and video characteristics.

Fig. 3 is a schematic diagram of a step of obtaining a recognition result output by a neural network according to an embodiment of the present application.

Referring to fig. 3, fig. 3 is a schematic diagram illustrating steps of obtaining a recognition result output by a neural network according to an embodiment of the present disclosure, where in fig. 3, the specific implementation steps of the method include:

step 201: and generating corresponding labels for the audio features and the video features respectively.

Referring to fig. 4, fig. 4 is a diagram illustrating an extraction rule and a label of an audio feature and a video feature according to an embodiment of the present application;

in the specific implementation process of step 201, audio and video separation is performed on unstructured video data, feature extraction is performed on the separated audio and video, and a corresponding tag is generated according to the extracted features.

In a preferred embodiment, for audio features, the labels are defined as follows: the extraction rule is whether more speakers exist, the labels of a plurality of speakers are a few, and if no speaker label exists, the label is-1; the extraction rule is whether the keyword is hit, the hit is 1, and the miss is 0; for the video features, if the extraction rule is that whether the eyes pay attention to the screen or not, the label defines that the eye attention-free screen is 1 and 0; if the extraction rule is that whether the head part is over against the screen or not, the label defines that the head part is not over against the screen as 1 and the over-against is 0; if the extraction rule is a live video shooting scene, the label defines that the live video shooting scene is 1 in the vehicle, 0 in the room and 2 outside the vehicle; if the extraction rule is that whether the live video is self-shot or not, the label defines that if the live video is shot by the user, the label is 1, and the self-shot is 0; if the extraction rule is micro expression, the label defines that the micro expression sadness is 0, the joy is 1, the difficulty is 2 and the like; if the extraction rule is an OCR text name, the label defines that the OCR of the living portrait characters contains a keyword of 1 and does not contain the keyword of 0; if the extraction rule is that whether the mouth of the user is opened or not, the label defines that the mouth of the living body video moves to be 1 and does not move to be 0; and if the extraction rule is living portrait background identification, the label defines that the living portrait background is a white wall of 0 and the non-white wall of 1. And according to the rules, completing the generation of specific labels of the audio features and the video features. The above rule is only an example, and the label corresponding to the specific feature may also be determined according to actual conditions.

Step 202: and generating a feature vector according to the label, inputting the feature vector into the pre-trained neural network, and obtaining an identification result output by the neural network.

In the specific implementation process of step 202, one-hot encoding is performed on the labeled features to form multidimensional feature codes, then the encoded feature vectors are spliced to form two-dimensional image codes, and then the codes are identified by using a pre-trained neural network to complete the final judgment on whether there is intermediary intervention.

Please refer to fig. 5, fig. 5 is a schematic structural diagram of a deep learning model according to an embodiment of the present disclosure.

In a preferred embodiment, regarding the pre-training of the neural network, a deep learning model is constructed by using Pytorch, and the spatial features are extracted by changing the number of channels, down-sampling, receptive fields and focusing through four-layer convolution, and then classified by using Softmax. For example, as shown in FIG. 4, the model may be a 5-tier network, with four layers of convolution kernels, and one layer of softmax. Inputting two-dimensional features of 10 × 1, changing the number of channels through the convolution layer Conv1, and forming a feature diagram of 10 × 128 without other transformation; changing the number of channels and down-sampling through the convolution layer Conv2 to form a 5 × 256 feature map; after passing through the convolutional layer Conv3, spatial features are focused and extracted, the receptive field is increased to form a feature map of 3 × 512, after passing through the convolutional layer Conv4, a larger receptive field is focused, the number of channels is increased to prepare for classification, and finally, the classification is carried out by using Softmax.

Fig. 6 is a schematic diagram of a video recognition system 300 according to an embodiment of the present application. Fig. 6 shows a signal separation module 301, a feature extraction module 302, and a decision module 303.

The signal separation module 301: the face recognition device is used for carrying out audio and video separation on the collected face recognition living body video signals and respectively obtaining audio signals and video signals; wherein the video signal comprises an object to be authenticated.

In the specific implementation process of the signal separation module 301, the audio and video separation is performed on the acquired video signal first, where the audio and video separation mainly refers to separation of a video part from a sound part in the original video signal, so as to perform separate analysis on an audio signal of the sound part and a signal of the video part.

The feature extraction module 302: the method is used for extracting the characteristics of the audio signal and the video signal to obtain the audio characteristics corresponding to the audio signal and the video characteristics corresponding to the video signal.

In the specific implementation process of the feature extraction module 302, feature extraction is performed on the video signal obtained in the signal separation module 301, corresponding audio features are extracted from the audio signal, and corresponding video features are extracted from the video signal.

A judging module 303: the voice frequency characteristic and the video frequency characteristic are input into a pre-trained utilization neural network, and a recognition result output by the neural network is obtained; and the identification result is used for representing whether the living body video signal is finished under the guidance of the medium for the object to be verified.

In the specific implementation process of the determining module 303, since the relevant neural network has been trained in advance, in this module, the extracted audio features and video features may be input into the relevant neural network, and the neural network performs comprehensive analysis on the audio features and the video features to obtain a comprehensive recognition result, so as to determine whether the relevant video recorded by the face recognition of the user is completed under the guidance of an intermediary, rather than being completed voluntarily when the user performs the face recognition.

Further, in a specific implementation process of the feature extraction module 302, the audio feature includes the number of speakers in the audio signal and/or whether the audio feature includes the first keyword. Under the condition of intermediary intervention, the intermediary commands the human face recognizer to operate, so that multiple speakers are likely to speak, and the number of the speakers is judged by the audio characteristics, so that whether the multiple speakers speak or not can be determined. In addition, under the condition that the intermediary intervenes, the intermediary guides a human face recognizer to operate, some key words can be involved, and in the audio feature extraction, the key words are extracted, so that whether the intermediary intervenes can be preliminarily judged. The video features comprise shooting angle features, face features and shooting scene features, and each feature can be used for basically and primarily judging whether the problem of intermediary intervention exists or not.

Further, in the process of extracting the audio signal, the feature extraction module 302 firstly intercepts the audio signal, that is, segments the voice data, and firstly determines a time interval of interception, in a preferred embodiment, the interception of the voice data is performed at an interval of 1 second, and after the interception is completed, a plurality of audio segments can be obtained, where the duration of each audio segment is 1 second. And then, carrying out similarity calculation on each audio segment, carrying out similarity comparison on the current voice characteristic and the previous voice characteristic, and classifying the audio segments with high similarity into one class. Each class can be regarded as the same speaker because of similarity, and the classification into several classes means how many speakers exist, so that whether the speakers speak for more than one person can be judged. And then, different audio segments under the same type are spoken and spliced to form a new long audio, and then the long video is subjected to voice recognition by using a function of converting voice into text, so that whether the keywords are involved in the voice is judged.

Further, the feature extraction module 302 pre-processes the video signal, first performs framing on the video signal to obtain a multi-frame image signal, and then performs corresponding feature extraction on the multi-frame image signal, where the feature extraction is mainly performed from three aspects of a shooting angle feature, a facial feature, and a shooting scene feature.

Further, the feature extraction module 302 obtains the shooting angle feature, and firstly extracts the angle feature of the face, where the angle of the face mainly refers to judging whether the person is facing to the screen during shooting, so as to prevent the intermediary from shooting when the person is not noticed, and preliminarily judge whether the intermediary intervenes. In the process, three corner points of a depression angle, a deflection angle and a roll angle need to be estimated for the image, after the image is obtained, a face image part is extracted, the extracted face image is input into a face recognition model trained in advance for angle recognition, and when the three corner points are larger than a certain threshold value, the fact that the face is not over against the screen is judged. Subsequently, whether the user belongs to self-shooting or shooting by other people during face recognition can be recognized, and when the intermediary induces the user to carry out face recognition, the handheld shooting equipment with high possibility shoots the user, so that recognition of self-shooting and other shooting can be carried out, and whether the intermediary intervenes to carry out preliminary judgment can be also carried out.

Further, in the facial feature extraction, the feature extraction module 302 needs to acquire an eye region and a mouth region in the image signal, and generally, a face partial image is taken, an upper half partial image is determined as the eye region, and a lower half partial image is determined as the mouth region. In the process of analyzing the eye region, the pre-trained eye recognition model is utilized to input the eye region image into the eye recognition model, and whether human eyes are concentrated on the screen is judged. In the process of analyzing the mouth region, whether the mouth region image of the person comprises the open mouth state and the closed mouth state is judged, in a preferred embodiment, a pre-trained mouth recognition model is used, the mouth region image of the person is input into the model for recognition, and if the mouth region image of the person comprises both the open mouth and the closed mouth, the fact that the user speaks when recording videos can be judged. In the process of analyzing the micro expression of the person, a pre-trained micro expression recognition model of the person is used for judging, a face image is input into the model, and whether the expression of the user is abnormal or not is judged.

Further, the feature extraction module 302 firstly extracts a background picture of the acquired image signal in the process of identifying the shooting scene features. And then, analyzing the background picture, and judging whether the background is an environment with a possible intermediary by using a pre-trained background recognition system.

Further, the judgment module 303 generates a corresponding label according to the extracted features, encodes the features of the label to form a multi-dimensional feature code, then splices the encoded feature vectors to form a two-dimensional image code, and then identifies the code by using a pre-trained neural network to complete the final judgment on whether there is an intermediary intervention.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application. An electronic device 400 provided in an embodiment of the present application includes: a processor 401 and a memory 402, the memory 402 storing machine-readable instructions executable by the processor 401, the machine-readable instructions when executed by the processor 401 performing the method as above.

For example, the processor 401 of the embodiment of the present application may read the computer program from the memory 402 through the communication bus and execute the computer program to implement the following method: a data management method. In some examples, the processor 401 may also update the configuration item, that is, may perform the following steps: receiving input log data, storing the log data to a high-speed access module, and marking storage time when the log data are stored; judging whether the time length between the current time and the storage time of the log data is greater than a preset threshold value or not, and if so, migrating and storing the log data to a common rate access module; and querying existing log data stored, wherein the existing log data comprises log data stored on the high rate access module and the normal rate access module.

The processor 401 may be an integrated circuit chip having signal processing capabilities. The Processor 401 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field-Programmable Gate arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. Which may implement or perform the various methods, steps, and logic blocks disclosed in embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The Memory 402 may include, but is not limited to, random Access Memory (RAM), read Only Memory (ROM), programmable Read Only Memory (PROM), erasable Read Only Memory (EPROM), electrically Erasable Read Only Memory (EEPROM), and the like.

It will be appreciated that the configuration shown in FIG. 7 is merely illustrative and that electronic device 400 may include more or fewer components than shown in FIG. 7 or have a different configuration than shown in FIG. 7. The components shown in fig. 7 may be implemented in hardware, software, or a combination thereof. In the embodiment of the present application, the electronic device 400 may be, but is not limited to, an entity device such as a desktop, a laptop, a smart phone, an intelligent wearable device, and a vehicle-mounted device, and may also be a virtual device such as a virtual machine. In addition, the electronic device 400 is not necessarily a single device, and may also be a combination of multiple devices, such as a server cluster, and the like. In the embodiment of the present application, a server in a method for photographing a vehicle may be implemented by using the electronic device 400 shown in fig. 7.

Embodiments of the present application further provide a computer-readable storage medium, which includes a computer program stored on the computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a computer, the computer is capable of executing the steps of a data management method in the foregoing embodiments, for example, including: receiving input log data, storing the log data to a high-speed access module, and marking storage time when the log data are stored; judging whether the time length between the current time and the storage time of the log data is greater than a preset threshold value or not, and if so, migrating and storing the log data to a common rate access module; and querying existing log data stored, wherein the existing log data comprises log data stored on the high rate access module and the normal rate access module.

In the embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for identifying a video, comprising:

carrying out audio and video separation on the collected audio and video signals to respectively obtain the audio signals and the video signals; wherein the video signal comprises an object to be authenticated;

extracting the characteristics of the audio signal and the video signal to obtain the audio characteristics corresponding to the audio signal and the video characteristics corresponding to the video signal;

inputting the audio features and the video features into a pre-trained neural network to obtain a recognition result output by the neural network; and the identification result is used for representing whether the video signal is the object to be verified or not under the guidance of an intermediary.

2. The method of claim 1, wherein the audio feature comprises the number of speakers in the audio signal and/or whether the audio signal contains a first keyword;

the video features include at least one of:

shooting angle features, face features and shooting scene features.

3. The method of claim 2, wherein the feature extracting the audio signal comprises:

intercepting the audio signal to obtain a plurality of audio segments; classifying the plurality of audio segments, and determining the number of speakers in the audio signal according to the classification result; and/or the presence of a gas in the atmosphere,

and recognizing and converting the voice in the audio signal into a text, and performing keyword recognition on the text.

4. The method according to claim 2, wherein the object to be verified is a human face; the extracting the features of the video signal comprises:

framing the video signal to obtain a plurality of frames of image signals;

and performing feature extraction on the image signal to obtain the shooting angle feature, the face feature and the shooting scene feature.

5. The method according to claim 4, wherein the performing feature extraction on the image signal to obtain the shooting angle feature comprises:

extracting the face angle characteristics of the image signals to obtain the depression angle, the deflection angle and the roll angle of the face in the image signals; judging whether the face is over against a screen or not according to the depression angle, the deflection angle and the roll angle; and/or the presence of a gas in the gas,

and carrying out shooting angle identification on the image signal, and judging whether the shooting angle is a self-timer angle.

6. The method according to claim 4, wherein the performing feature extraction on the image signal to obtain the facial feature comprises:

acquiring an eye region and a mouth region in the image signal;

extracting the features of the eye region, and judging whether the eye attention is focused on a screen; and/or the presence of a gas in the gas,

extracting features of the mouth region, and judging whether the mouth moves; and/or the presence of a gas in the gas,

and identifying the face image in the image signal by using a micro expression identification model so as to determine whether micro expressions of the face image are abnormal or not.

7. The method according to claim 4, wherein the performing feature extraction on the image signal to obtain shooting scene features comprises:

acquiring a portrait background picture of the image signal;

identifying the portrait background picture and judging whether the portrait background picture is indoor or not; and/or the presence of a gas in the gas,

and carrying out keyword identification on the portrait background picture and judging whether a second keyword appears in the background.

8. The method according to any one of claims 1 to 7, wherein the inputting the audio features and the video features into a pre-trained neural network to obtain the recognition result output by the neural network comprises:

generating corresponding labels for the audio features and the video features respectively;

and generating a feature vector according to the label, inputting the feature vector into the pre-trained neural network, and obtaining an identification result output by the neural network.

9. A video recognition system, comprising:

the signal separation module is used for carrying out audio-video separation on the collected audio-video signals and respectively acquiring the audio signals and the video signals; wherein the video signal comprises an object to be authenticated;

the characteristic extraction module is used for extracting the characteristics of the audio signal and the video signal to obtain the audio characteristic corresponding to the audio signal and the video characteristic corresponding to the video signal;

the judging module is used for inputting the audio features and the video features into a pre-trained neural network to obtain a recognition result output by the neural network; and the identification result is used for representing whether the video signal is the object to be verified or not under the guidance of an intermediary.

10. An electronic device, comprising: a processor, a memory, and a bus;

the processor and the memory are communicated with each other through the bus;

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any one of claims 1-8.

11. A computer-readable storage medium, storing computer instructions which, when executed by a computer, cause the computer to perform the method of any one of claims 1-8.