CN112507311A

CN112507311A - High-security identity verification method based on multi-mode feature fusion

Info

Publication number: CN112507311A
Application number: CN202011438711.9A
Authority: CN
Inventors: 董睿; 蔡林希; 涂煜洋; 陈嘉博; 安典坤; 张柏礼
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-03-16

Abstract

The invention discloses a high-security identity authentication method based on multi-modal feature fusion, which comprises the following steps: (1) collecting audio and video data when a user reads the verification code; (2) carrying out face recognition verification on the collected audio and video data; (3) carrying out image recognition judgment on the collected audio and video data, judging whether the lip action of a user is similar to the lip action of reading real-time numbers or not, and comparing if the lip action of the user is similar to the lip action of reading real-time numbers; (4) and carrying out voice verification on the collected audio and video data to verify whether the voiceprint of the speaker is similar to the voiceprint of the speaker during registration, and judging whether the number read by the user is a verification code which appears randomly on a screen according to the voice. The invention effectively improves the reliability of the identity authentication.

Description

High-security identity verification method based on multi-mode feature fusion

Technical Field

The invention belongs to the field of identity verification, and particularly relates to a high-security identity verification method based on multi-modal feature fusion.

Background

The institutional staff regularly uses an 'unmanned correction booth' to punch a card after identity verification and report thought and whereabouts according to the requirements. In the process of identity authentication, if basic face recognition is simply adopted, deception may occur by using methods such as pictures, video recording, three-dimensional model making and the like, so that the situation of impersonation and card punching is realized. In order to solve such problems, it is necessary to consider the convenience of use and the limited construction cost, and to realize a convenient and reliable identification method at a low cost.

Currently, in vivo detection is a basic method for coping, and the main prevention strategies can be divided into three categories: face recognition based on infrared spectroscopy, RGB silence face recognition, and face recognition based on interaction. The special camera used for infrared spectrum identification greatly improves the detection cost and limits the application range of the camera. For the RGB face recognition, the definition of photos and videos is higher and higher, and the optical characteristics of the photos and videos can be close to those of the faces. And the human face recognition process based on interaction is complicated, man-machine interaction is not friendly enough, and 3D human face model attack cannot be prevented.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems, the invention provides a high-security identity authentication method based on multi-modal feature fusion, which organically fuses identity authentication technologies such as face recognition, lip language recognition, voice recognition, speaker confirmation and the like comprehensively, and realizes cross authentication by adopting other authentication means such as basic face recognition, thereby effectively preventing photo attack, video attack and three-dimensional model attack and improving the reliability of identity authentication.

The above purpose is realized by the following technical scheme:

a high-security identity authentication method based on multi-modal feature fusion comprises the following steps:

(1) collecting audio and video data when a user reads the verification code;

(2) carrying out face recognition verification on the collected audio and video data;

(3) carrying out image recognition judgment on the collected audio and video data, judging whether the lip action of a user is similar to the lip action of reading real-time numbers or not, and comparing if the lip action of the user is similar to the lip action of reading real-time numbers;

(4) and carrying out voice verification on the collected audio and video data to verify whether the voiceprint of the speaker is similar to the voiceprint of the speaker during registration, and judging whether the number read by the user is a verification code which appears randomly on a screen according to the voice.

The high-security identity authentication method based on multi-modal feature fusion comprises the following steps of:

(31) for the detection of the face region, the present embodiment adopts Dlib to detect 68 key points of the face;

(32) positioning the lip region by using 49-68 th feature points in the 68 feature points as feature data for lip language recognition to perform training input;

(32) and extracting features from the extracted lip language image sequence by adopting a 3D-CNN model, extracting the features in time and space by executing 3D convolution, and training to obtain output.

The high-security identity authentication method based on multi-modal feature fusion comprises the steps that voice authentication is carried out on collected audio and video data in the step (4), voice print feature extraction and learning model selection are included, in the aspect of voice print feature extraction, voice print MFCC features are adopted, voice prints are preprocessed and divided into a plurality of independent frames, each voice print is converted into corresponding frequency spectrums through short-time Fourier transform, and then Mel frequency analysis and cepstrum analysis are carried out on the frequency spectrums, so that the MFCC features of the voice prints can be obtained; in the aspect of learning model selection, a method of combining a convolutional neural network and a connectivity time sequence classification model is adopted, wherein a CTC model is used for combining the same phoneme symbols in a voiceprint to ensure the correctness of an output sequence, in the aspect of model realization, TensorFlow and Keras frameworks are adopted to construct a network model and train the network model, in the prediction process, feature vectors of corresponding audios are calculated, and the feature vectors are calculated by using the corresponding models to obtain pinyin data in the audios.

Has the advantages that: compared with the prior art, the invention has the following remarkable advantages:

1. the method provided by the invention can complete the identity verification with the living body detection on a common computer and a mobile phone without special equipment, and has low cost.

2. The invention has convenient operation, and the user can pass the verification only by reading the number on the screen.

3. The method provided by the invention has very high safety. Simple photo attacks cannot be verified by voice and lip language recognition. Video attacks are also not valid for the present invention because the way a few digits compose is so many that it is not possible to attack by recording tens of thousands of videos. Computer 3D model and simulated mask model attacks are also not effective because they cannot be verified by voiceprint.

Drawings

Fig. 1 is a schematic flow chart of a high-security authentication method provided by the present invention.

Fig. 2 is a 3D ResNet structure diagram of the present invention.

FIG. 3 is a diagram of the convolutional neural network structure definition of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

As shown in fig. 1, the present invention provides a high security identity authentication method based on multi-modal feature fusion, which includes the following steps:

(1) collecting audio and video data when a user reads the verification code;

The four methods can be freely combined according to actual needs and safety level requirements, if the requirements are high, all the four methods need to pass, and if the safety requirements are moderate, only 2-3 items need to pass.

Voiceprint recognition

Data set

The data set of this example uses the Free ST Chinese manual kernel data set, which has 855 total speech data for a person, 120 pieces of audio for each person, and 102600 pieces of speech data.

Training process

The voice data is small and much, and the embodiment generates the TFrecord by the audio files, so that the training speed is accelerated. Create _ data _ py is used to generate TFRecord (TFRecord uses "Protocol Buffer" binary data coding scheme inside, it only occupies one memory block, only needs to load one binary file at a time, it is simple, fast, especially friendly to large training data.

Firstly, a data list is created, the format of the data list is < voice file path \ t voice classification label >, the creation of the data list is mainly convenient for reading later and is also convenient for reading and using other voice data sets, different voice data sets can be written in the same data list by writing corresponding functions for generating the data list, and thus, the TFRecord file can be directly generated in the next step.

With the data list created above, the voice data can be converted into training data, mainly the voice data is converted into Mel spectrum (Mel spectrum), the Mel spectrum of audio can be conveniently obtained by using librosa, the API used is librosa. Split cuts off the audio of the mute part during the conversion process, thus reducing the noise of the training data and providing the training accuracy. The embodiment defaults that the length of each voice is 2.04 seconds, which can modify the length of the voice according to specific situations, and if the length of the training voice is to be modified, the corresponding data value needs to be modified according to the prompt of the annotation. If the voice length is longer, the program will cut 20 times randomly to achieve the effect of data enhancement.

Py starts training the model, builds a resenet 50 classification model, and input _ shape is set to (128, None,1)) mainly to fit other audio length inputs and predict that the input is of arbitrary size. class _ dim is the total number of classes, the Free ST Chinese manual kernel dataset together having 855 individual speech data, so here the total number of classes is 855.

To begin the training, the embodiment converts the Mel frequency spectrum of the audio data into a one-dimensional list when creating the TFRecord file, so before inputting the data into the model, the data reshape is the previous shape, and the operation mode is reshape ((-1,128, 1)). Note that if other lengths of audio are used, shape modification according to the mel-frequency spectrum is required, and both training data and test data need to be processed the same. The test and save model, including the prediction model and network weights, is performed every 200 batchs trained.

Voiceprint contrast

And then, starting to realize voiceprint comparison, and creating an input _ contrast. The names of the inputs and outputs of each layer are viewed by using netron.

Then two functions are written, the classification is a function for loading data and executing prediction, the size of input audio is not limited in the function for loading data, and only the audio after cutting and muting is not allowed to be less than 0.5 second, so that the audio with any length can be input. The data after performing prediction is the feature value of speech.

With the above two functions, the voiceprint recognition can be performed. Two speeches are input, the feature data of the speeches are obtained through a prediction function, the diagonal cosine values of the speeches can be obtained through the feature data, and the obtained result can be used as the degree of identity of the speeches. The threshold for this degree of identity can be modified according to the accuracy requirements of the project.

2) Speech recognition

The voice recognition comprises the steps of voice print feature extraction and learning model selection, in the aspect of voice print feature extraction, the voice print MFCC features of the voice print adopted in the embodiment are mainly extracted, the voice print is divided into a plurality of independent frames after being preprocessed, each frame of voice print is converted into a corresponding frequency spectrum through SFFT (short time Fourier transform), and then Mel frequency analysis and cepstrum analysis are carried out on the frequency spectrum, so that the voice print MFCC features can be obtained. In the aspect of learning model selection, due to the continuity of sound in time and the uncertainty of information range in the same interval time, the embodiment selects a method of combining a Convolutional Neural Network (CNN) and a connectivity time-series classification model (CTC), wherein the CTC model is used for combining the same phoneme symbols in the voiceprint to ensure the correctness of the output sequence. In the aspect of model implementation, the embodiment chooses to adopt TensorFlow and Keras frameworks to construct and train the network model. And calculating the characteristic vector of the corresponding audio in the prediction process and calculating the characteristic vector by using the corresponding model to obtain pinyin data in the audio.

3) Lip language identification

Lip language identification dataset preprocessing

The lip language identification data set preprocessing mainly comprises the steps of extracting key frames in a lip movement video, detecting a face region and positioning and extracting the lip region.

For lip motion video, the present embodiment uses a frontal face pronunciation video set with no significant relative motion between the speaker and the lens, the content of the utterance includes numbers 0-9, each pronunciation sequence lasts about one second and the interval between different numbers pronunciations is greater than 0.3 second. Whereby the start time and the end time of each individual pronunciation unit can be accurately located on the audio signal by speech analysis. Since each pronunciation sequence lasts about 1 second, but the intercepted pronunciation durations are not the same. A fixed length sequence is sampled from each independent segment of voiced video, called a key frame.

For the detection of the face region, the present embodiment adopts Dlib to detect 68 key points of the face.

For the positioning and extraction of the lip region, the present embodiment positions the lip region by using 49 th to 68 th feature points in 68 th feature points, and performs training input as feature data of lip language recognition. The lip external contour feature points comprise 12 lip external contour feature points and 8 lip internal contour feature points, and the 49 th, 51 th, 53 th, 55 th and 58 th points are the left and right mouth corner points of the lips, the two highest points of the upper lip and the one lowest point of the lower lip respectively. The five key points can determine the boundary of the lip in one picture. And then, carrying out unification treatment on the extracted lip region and finishing the pretreatment of the data set.

The embodiment proposes that the 3D-CNN model extracts features from an extracted lip language image sequence, extracts features in time and space by performing 3D convolution, and performs training to obtain an output.

The general CNN model is mainly used for 2D images, but for prediction of video, it needs to be identified by combining the previous and next frames of the video. To use CNN for human motion recognition in video, one approach is to treat each frame of the video as a fine image and use CNN to recognize the motion at the level of a single frame, but this ignores the encoded motion information of multiple consecutive frames. To efficiently incorporate motion information in video, 3D convolution can be performed in CNN convolutional layers to obtain discriminative features in both spatial and temporal dimensions. The 3D CNN architecture may generate multiple channels of information from adjacent video frames and perform convolution and downsampling, respectively, in each channel to obtain a final feature representation by combining the information from the video channels.

Implementation procedure

According to a specific practical situation, the embodiment constructs a 3D ResNet model for training and prediction. ResNet (depth residual neural network), which greatly simplifies learning for an identity layer by learning a residual module f (x) h (x) -x between input x and mapping h (x), introducing the residual module, and performing wig operation on the positions of corresponding elements.

The structure of 3D ResNet is shown in FIG. 2.

From the previous work on the extraction of slice and lip features of video, we proceed as defined in fig. 3 for the structure of convolutional neural networks.

A corresponding neural network structure was built using tensoflow with the depth of the kernel being the same as the input data. In addition, the step size stride is set to 2, which increases the robustness of training.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are intended to further illustrate the principles of the invention, and that various changes and modifications may be made without departing from the spirit and scope of the invention, which is also intended to be covered by the appended claims. The scope of the invention is defined by the claims and their equivalents.

Claims

1. A high-security identity authentication method based on multi-modal feature fusion is characterized by comprising the following steps:

(1) collecting audio and video data when a user reads the verification code;

2. The multi-modal feature fusion based high-security identity authentication method as claimed in claim 1, wherein the image recognition and judgment in step (3) comprises the following steps:

3. The identity authentication method with high security based on multi-modal feature fusion as claimed in claim 1, wherein the voice authentication of the collected audio and video data in step (4) includes the feature extraction of the voiceprint and the selection of a learning model, in the aspect of the feature extraction of the voiceprint, the adopted MFCC features of the voiceprint divide the voiceprint into a plurality of separate frames after preprocessing, each voiceprint is converted into a corresponding frequency spectrum through short-time Fourier transform, and then the MFCC features of the voiceprint can be obtained by performing Mel frequency analysis and cepstrum analysis on the frequency spectrum; in the aspect of learning model selection, a method of combining a convolutional neural network and a connectivity time sequence classification model is adopted, wherein a CTC model is used for combining the same phoneme symbols in a voiceprint to ensure the correctness of an output sequence, in the aspect of model realization, TensorFlow and Keras frameworks are adopted to construct a network model and train the network model, in the prediction process, feature vectors of corresponding audios are calculated, and the feature vectors are calculated by using the corresponding models to obtain pinyin data in the audios.