CN112507311A - High-security identity verification method based on multi-mode feature fusion - Google Patents

High-security identity verification method based on multi-mode feature fusion Download PDF

Info

Publication number
CN112507311A
CN112507311A CN202011438711.9A CN202011438711A CN112507311A CN 112507311 A CN112507311 A CN 112507311A CN 202011438711 A CN202011438711 A CN 202011438711A CN 112507311 A CN112507311 A CN 112507311A
Authority
CN
China
Prior art keywords
voiceprint
lip
video data
model
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011438711.9A
Other languages
Chinese (zh)
Inventor
董睿
蔡林希
涂煜洋
陈嘉博
安典坤
张柏礼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202011438711.9A priority Critical patent/CN112507311A/en
Publication of CN112507311A publication Critical patent/CN112507311A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/32User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/70Multimodal biometrics, e.g. combining information from different biometric modalities

Abstract

The invention discloses a high-security identity authentication method based on multi-modal feature fusion, which comprises the following steps: (1) collecting audio and video data when a user reads the verification code; (2) carrying out face recognition verification on the collected audio and video data; (3) carrying out image recognition judgment on the collected audio and video data, judging whether the lip action of a user is similar to the lip action of reading real-time numbers or not, and comparing if the lip action of the user is similar to the lip action of reading real-time numbers; (4) and carrying out voice verification on the collected audio and video data to verify whether the voiceprint of the speaker is similar to the voiceprint of the speaker during registration, and judging whether the number read by the user is a verification code which appears randomly on a screen according to the voice. The invention effectively improves the reliability of the identity authentication.

Description

High-security identity verification method based on multi-mode feature fusion
Technical Field
The invention belongs to the field of identity verification, and particularly relates to a high-security identity verification method based on multi-modal feature fusion.
Background
The institutional staff regularly uses an 'unmanned correction booth' to punch a card after identity verification and report thought and whereabouts according to the requirements. In the process of identity authentication, if basic face recognition is simply adopted, deception may occur by using methods such as pictures, video recording, three-dimensional model making and the like, so that the situation of impersonation and card punching is realized. In order to solve such problems, it is necessary to consider the convenience of use and the limited construction cost, and to realize a convenient and reliable identification method at a low cost.
Currently, in vivo detection is a basic method for coping, and the main prevention strategies can be divided into three categories: face recognition based on infrared spectroscopy, RGB silence face recognition, and face recognition based on interaction. The special camera used for infrared spectrum identification greatly improves the detection cost and limits the application range of the camera. For the RGB face recognition, the definition of photos and videos is higher and higher, and the optical characteristics of the photos and videos can be close to those of the faces. And the human face recognition process based on interaction is complicated, man-machine interaction is not friendly enough, and 3D human face model attack cannot be prevented.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems, the invention provides a high-security identity authentication method based on multi-modal feature fusion, which organically fuses identity authentication technologies such as face recognition, lip language recognition, voice recognition, speaker confirmation and the like comprehensively, and realizes cross authentication by adopting other authentication means such as basic face recognition, thereby effectively preventing photo attack, video attack and three-dimensional model attack and improving the reliability of identity authentication.
The above purpose is realized by the following technical scheme:
a high-security identity authentication method based on multi-modal feature fusion comprises the following steps:
(1) collecting audio and video data when a user reads the verification code;
(2) carrying out face recognition verification on the collected audio and video data;
(3) carrying out image recognition judgment on the collected audio and video data, judging whether the lip action of a user is similar to the lip action of reading real-time numbers or not, and comparing if the lip action of the user is similar to the lip action of reading real-time numbers;
(4) and carrying out voice verification on the collected audio and video data to verify whether the voiceprint of the speaker is similar to the voiceprint of the speaker during registration, and judging whether the number read by the user is a verification code which appears randomly on a screen according to the voice.
The high-security identity authentication method based on multi-modal feature fusion comprises the following steps of:
(31) for the detection of the face region, the present embodiment adopts Dlib to detect 68 key points of the face;
(32) positioning the lip region by using 49-68 th feature points in the 68 feature points as feature data for lip language recognition to perform training input;
(32) and extracting features from the extracted lip language image sequence by adopting a 3D-CNN model, extracting the features in time and space by executing 3D convolution, and training to obtain output.
The high-security identity authentication method based on multi-modal feature fusion comprises the steps that voice authentication is carried out on collected audio and video data in the step (4), voice print feature extraction and learning model selection are included, in the aspect of voice print feature extraction, voice print MFCC features are adopted, voice prints are preprocessed and divided into a plurality of independent frames, each voice print is converted into corresponding frequency spectrums through short-time Fourier transform, and then Mel frequency analysis and cepstrum analysis are carried out on the frequency spectrums, so that the MFCC features of the voice prints can be obtained; in the aspect of learning model selection, a method of combining a convolutional neural network and a connectivity time sequence classification model is adopted, wherein a CTC model is used for combining the same phoneme symbols in a voiceprint to ensure the correctness of an output sequence, in the aspect of model realization, TensorFlow and Keras frameworks are adopted to construct a network model and train the network model, in the prediction process, feature vectors of corresponding audios are calculated, and the feature vectors are calculated by using the corresponding models to obtain pinyin data in the audios.
Has the advantages that: compared with the prior art, the invention has the following remarkable advantages:
1. the method provided by the invention can complete the identity verification with the living body detection on a common computer and a mobile phone without special equipment, and has low cost.
2. The invention has convenient operation, and the user can pass the verification only by reading the number on the screen.
3. The method provided by the invention has very high safety. Simple photo attacks cannot be verified by voice and lip language recognition. Video attacks are also not valid for the present invention because the way a few digits compose is so many that it is not possible to attack by recording tens of thousands of videos. Computer 3D model and simulated mask model attacks are also not effective because they cannot be verified by voiceprint.
Drawings
Fig. 1 is a schematic flow chart of a high-security authentication method provided by the present invention.
Fig. 2 is a 3D ResNet structure diagram of the present invention.
FIG. 3 is a diagram of the convolutional neural network structure definition of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
As shown in fig. 1, the present invention provides a high security identity authentication method based on multi-modal feature fusion, which includes the following steps:
(1) collecting audio and video data when a user reads the verification code;
(2) carrying out face recognition verification on the collected audio and video data;
(3) carrying out image recognition judgment on the collected audio and video data, judging whether the lip action of a user is similar to the lip action of reading real-time numbers or not, and comparing if the lip action of the user is similar to the lip action of reading real-time numbers;
(4) and carrying out voice verification on the collected audio and video data to verify whether the voiceprint of the speaker is similar to the voiceprint of the speaker during registration, and judging whether the number read by the user is a verification code which appears randomly on a screen according to the voice.
The four methods can be freely combined according to actual needs and safety level requirements, if the requirements are high, all the four methods need to pass, and if the safety requirements are moderate, only 2-3 items need to pass.
Voiceprint recognition
Data set
The data set of this example uses the Free ST Chinese manual kernel data set, which has 855 total speech data for a person, 120 pieces of audio for each person, and 102600 pieces of speech data.
Training process
The voice data is small and much, and the embodiment generates the TFrecord by the audio files, so that the training speed is accelerated. Create _ data _ py is used to generate TFRecord (TFRecord uses "Protocol Buffer" binary data coding scheme inside, it only occupies one memory block, only needs to load one binary file at a time, it is simple, fast, especially friendly to large training data.
Firstly, a data list is created, the format of the data list is < voice file path \ t voice classification label >, the creation of the data list is mainly convenient for reading later and is also convenient for reading and using other voice data sets, different voice data sets can be written in the same data list by writing corresponding functions for generating the data list, and thus, the TFRecord file can be directly generated in the next step.
With the data list created above, the voice data can be converted into training data, mainly the voice data is converted into Mel spectrum (Mel spectrum), the Mel spectrum of audio can be conveniently obtained by using librosa, the API used is librosa. Split cuts off the audio of the mute part during the conversion process, thus reducing the noise of the training data and providing the training accuracy. The embodiment defaults that the length of each voice is 2.04 seconds, which can modify the length of the voice according to specific situations, and if the length of the training voice is to be modified, the corresponding data value needs to be modified according to the prompt of the annotation. If the voice length is longer, the program will cut 20 times randomly to achieve the effect of data enhancement.
Py starts training the model, builds a resenet 50 classification model, and input _ shape is set to (128, None,1)) mainly to fit other audio length inputs and predict that the input is of arbitrary size. class _ dim is the total number of classes, the Free ST Chinese manual kernel dataset together having 855 individual speech data, so here the total number of classes is 855.
To begin the training, the embodiment converts the Mel frequency spectrum of the audio data into a one-dimensional list when creating the TFRecord file, so before inputting the data into the model, the data reshape is the previous shape, and the operation mode is reshape ((-1,128, 1)). Note that if other lengths of audio are used, shape modification according to the mel-frequency spectrum is required, and both training data and test data need to be processed the same. The test and save model, including the prediction model and network weights, is performed every 200 batchs trained.
Voiceprint contrast
And then, starting to realize voiceprint comparison, and creating an input _ contrast. The names of the inputs and outputs of each layer are viewed by using netron.
Then two functions are written, the classification is a function for loading data and executing prediction, the size of input audio is not limited in the function for loading data, and only the audio after cutting and muting is not allowed to be less than 0.5 second, so that the audio with any length can be input. The data after performing prediction is the feature value of speech.
With the above two functions, the voiceprint recognition can be performed. Two speeches are input, the feature data of the speeches are obtained through a prediction function, the diagonal cosine values of the speeches can be obtained through the feature data, and the obtained result can be used as the degree of identity of the speeches. The threshold for this degree of identity can be modified according to the accuracy requirements of the project.
2) Speech recognition
The voice recognition comprises the steps of voice print feature extraction and learning model selection, in the aspect of voice print feature extraction, the voice print MFCC features of the voice print adopted in the embodiment are mainly extracted, the voice print is divided into a plurality of independent frames after being preprocessed, each frame of voice print is converted into a corresponding frequency spectrum through SFFT (short time Fourier transform), and then Mel frequency analysis and cepstrum analysis are carried out on the frequency spectrum, so that the voice print MFCC features can be obtained. In the aspect of learning model selection, due to the continuity of sound in time and the uncertainty of information range in the same interval time, the embodiment selects a method of combining a Convolutional Neural Network (CNN) and a connectivity time-series classification model (CTC), wherein the CTC model is used for combining the same phoneme symbols in the voiceprint to ensure the correctness of the output sequence. In the aspect of model implementation, the embodiment chooses to adopt TensorFlow and Keras frameworks to construct and train the network model. And calculating the characteristic vector of the corresponding audio in the prediction process and calculating the characteristic vector by using the corresponding model to obtain pinyin data in the audio.
3) Lip language identification
Lip language identification dataset preprocessing
The lip language identification data set preprocessing mainly comprises the steps of extracting key frames in a lip movement video, detecting a face region and positioning and extracting the lip region.
For lip motion video, the present embodiment uses a frontal face pronunciation video set with no significant relative motion between the speaker and the lens, the content of the utterance includes numbers 0-9, each pronunciation sequence lasts about one second and the interval between different numbers pronunciations is greater than 0.3 second. Whereby the start time and the end time of each individual pronunciation unit can be accurately located on the audio signal by speech analysis. Since each pronunciation sequence lasts about 1 second, but the intercepted pronunciation durations are not the same. A fixed length sequence is sampled from each independent segment of voiced video, called a key frame.
For the detection of the face region, the present embodiment adopts Dlib to detect 68 key points of the face.
For the positioning and extraction of the lip region, the present embodiment positions the lip region by using 49 th to 68 th feature points in 68 th feature points, and performs training input as feature data of lip language recognition. The lip external contour feature points comprise 12 lip external contour feature points and 8 lip internal contour feature points, and the 49 th, 51 th, 53 th, 55 th and 58 th points are the left and right mouth corner points of the lips, the two highest points of the upper lip and the one lowest point of the lower lip respectively. The five key points can determine the boundary of the lip in one picture. And then, carrying out unification treatment on the extracted lip region and finishing the pretreatment of the data set.
The embodiment proposes that the 3D-CNN model extracts features from an extracted lip language image sequence, extracts features in time and space by performing 3D convolution, and performs training to obtain an output.
The general CNN model is mainly used for 2D images, but for prediction of video, it needs to be identified by combining the previous and next frames of the video. To use CNN for human motion recognition in video, one approach is to treat each frame of the video as a fine image and use CNN to recognize the motion at the level of a single frame, but this ignores the encoded motion information of multiple consecutive frames. To efficiently incorporate motion information in video, 3D convolution can be performed in CNN convolutional layers to obtain discriminative features in both spatial and temporal dimensions. The 3D CNN architecture may generate multiple channels of information from adjacent video frames and perform convolution and downsampling, respectively, in each channel to obtain a final feature representation by combining the information from the video channels.
Implementation procedure
According to a specific practical situation, the embodiment constructs a 3D ResNet model for training and prediction. ResNet (depth residual neural network), which greatly simplifies learning for an identity layer by learning a residual module f (x) h (x) -x between input x and mapping h (x), introducing the residual module, and performing wig operation on the positions of corresponding elements.
The structure of 3D ResNet is shown in FIG. 2.
From the previous work on the extraction of slice and lip features of video, we proceed as defined in fig. 3 for the structure of convolutional neural networks.
A corresponding neural network structure was built using tensoflow with the depth of the kernel being the same as the input data. In addition, the step size stride is set to 2, which increases the robustness of training.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are intended to further illustrate the principles of the invention, and that various changes and modifications may be made without departing from the spirit and scope of the invention, which is also intended to be covered by the appended claims. The scope of the invention is defined by the claims and their equivalents.

Claims (3)

1. A high-security identity authentication method based on multi-modal feature fusion is characterized by comprising the following steps:
(1) collecting audio and video data when a user reads the verification code;
(2) carrying out face recognition verification on the collected audio and video data;
(3) carrying out image recognition judgment on the collected audio and video data, judging whether the lip action of a user is similar to the lip action of reading real-time numbers or not, and comparing if the lip action of the user is similar to the lip action of reading real-time numbers;
(4) and carrying out voice verification on the collected audio and video data to verify whether the voiceprint of the speaker is similar to the voiceprint of the speaker during registration, and judging whether the number read by the user is a verification code which appears randomly on a screen according to the voice.
2. The multi-modal feature fusion based high-security identity authentication method as claimed in claim 1, wherein the image recognition and judgment in step (3) comprises the following steps:
(31) for the detection of the face region, the present embodiment adopts Dlib to detect 68 key points of the face;
(32) positioning the lip region by using 49-68 th feature points in the 68 feature points as feature data for lip language recognition to perform training input;
(32) and extracting features from the extracted lip language image sequence by adopting a 3D-CNN model, extracting the features in time and space by executing 3D convolution, and training to obtain output.
3. The identity authentication method with high security based on multi-modal feature fusion as claimed in claim 1, wherein the voice authentication of the collected audio and video data in step (4) includes the feature extraction of the voiceprint and the selection of a learning model, in the aspect of the feature extraction of the voiceprint, the adopted MFCC features of the voiceprint divide the voiceprint into a plurality of separate frames after preprocessing, each voiceprint is converted into a corresponding frequency spectrum through short-time Fourier transform, and then the MFCC features of the voiceprint can be obtained by performing Mel frequency analysis and cepstrum analysis on the frequency spectrum; in the aspect of learning model selection, a method of combining a convolutional neural network and a connectivity time sequence classification model is adopted, wherein a CTC model is used for combining the same phoneme symbols in a voiceprint to ensure the correctness of an output sequence, in the aspect of model realization, TensorFlow and Keras frameworks are adopted to construct a network model and train the network model, in the prediction process, feature vectors of corresponding audios are calculated, and the feature vectors are calculated by using the corresponding models to obtain pinyin data in the audios.
CN202011438711.9A 2020-12-10 2020-12-10 High-security identity verification method based on multi-mode feature fusion Pending CN112507311A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011438711.9A CN112507311A (en) 2020-12-10 2020-12-10 High-security identity verification method based on multi-mode feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011438711.9A CN112507311A (en) 2020-12-10 2020-12-10 High-security identity verification method based on multi-mode feature fusion

Publications (1)

Publication Number Publication Date
CN112507311A true CN112507311A (en) 2021-03-16

Family

ID=74970733

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011438711.9A Pending CN112507311A (en) 2020-12-10 2020-12-10 High-security identity verification method based on multi-mode feature fusion

Country Status (1)

Country Link
CN (1) CN112507311A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113946810A (en) * 2021-12-07 2022-01-18 荣耀终端有限公司 Application program running method and electronic equipment
CN114491467A (en) * 2022-04-15 2022-05-13 北京快联科技有限公司 Identity authentication method and device
CN114780940A (en) * 2022-06-21 2022-07-22 中铁电气化局集团有限公司 Cross-system data sharing interaction project operation monitoring and early warning method and system
CN117155583A (en) * 2023-10-24 2023-12-01 清华大学 Multi-mode identity authentication method and system for incomplete information deep fusion
CN117174092A (en) * 2023-11-02 2023-12-05 北京语言大学 Mobile corpus transcription method and device based on voiceprint recognition and multi-modal analysis

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376250A (en) * 2014-12-03 2015-02-25 优化科技(苏州)有限公司 Real person living body identity verification method based on sound-type image feature
CN108763897A (en) * 2018-05-22 2018-11-06 平安科技(深圳)有限公司 Method of calibration, terminal device and the medium of identity legitimacy
CN110634491A (en) * 2019-10-23 2019-12-31 大连东软信息学院 Series connection feature extraction system and method for general voice task in voice signal

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376250A (en) * 2014-12-03 2015-02-25 优化科技(苏州)有限公司 Real person living body identity verification method based on sound-type image feature
CN108763897A (en) * 2018-05-22 2018-11-06 平安科技(深圳)有限公司 Method of calibration, terminal device and the medium of identity legitimacy
CN110634491A (en) * 2019-10-23 2019-12-31 大连东软信息学院 Series connection feature extraction system and method for general voice task in voice signal

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113946810A (en) * 2021-12-07 2022-01-18 荣耀终端有限公司 Application program running method and electronic equipment
WO2023103499A1 (en) * 2021-12-07 2023-06-15 荣耀终端有限公司 Method for running application, and electronic device
CN114491467A (en) * 2022-04-15 2022-05-13 北京快联科技有限公司 Identity authentication method and device
CN114780940A (en) * 2022-06-21 2022-07-22 中铁电气化局集团有限公司 Cross-system data sharing interaction project operation monitoring and early warning method and system
CN117155583A (en) * 2023-10-24 2023-12-01 清华大学 Multi-mode identity authentication method and system for incomplete information deep fusion
CN117155583B (en) * 2023-10-24 2024-01-23 清华大学 Multi-mode identity authentication method and system for incomplete information deep fusion
CN117174092A (en) * 2023-11-02 2023-12-05 北京语言大学 Mobile corpus transcription method and device based on voiceprint recognition and multi-modal analysis
CN117174092B (en) * 2023-11-02 2024-01-26 北京语言大学 Mobile corpus transcription method and device based on voiceprint recognition and multi-modal analysis

Similar Documents

Publication Publication Date Title
US10699699B2 (en) Constructing speech decoding network for numeric speech recognition
CN112507311A (en) High-security identity verification method based on multi-mode feature fusion
CN104361276B (en) A kind of multi-modal biological characteristic identity identifying method and system
CN107731233B (en) Voiceprint recognition method based on RNN
CN110909613A (en) Video character recognition method and device, storage medium and electronic equipment
CN111325817A (en) Virtual character scene video generation method, terminal device and medium
CN108346427A (en) A kind of audio recognition method, device, equipment and storage medium
CN111667835A (en) Voice recognition method, living body detection method, model training method and device
CN111785275A (en) Voice recognition method and device
JP7412496B2 (en) Living body (liveness) detection verification method, living body detection verification system, recording medium, and training method for living body detection verification system
CN111916054A (en) Lip-based voice generation method, device and system and storage medium
CN111402892A (en) Conference recording template generation method based on voice recognition
WO2021007856A1 (en) Identity verification method, terminal device, and storage medium
CN110782902A (en) Audio data determination method, apparatus, device and medium
CN112017633B (en) Speech recognition method, device, storage medium and electronic equipment
CN116246610A (en) Conference record generation method and system based on multi-mode identification
CN113923521B (en) Video scripting method
Qu et al. Lipsound2: Self-supervised pre-training for lip-to-speech reconstruction and lip reading
CN110232928B (en) Text-independent speaker verification method and device
Uzan et al. I know that voice: Identifying the voice actor behind the voice
Lucey et al. Continuous pose-invariant lipreading
CN116883900A (en) Video authenticity identification method and system based on multidimensional biological characteristics
Al-Shayea et al. Speaker identification: A novel fusion samples approach
KR20150035312A (en) Method for unlocking user equipment based on voice, user equipment releasing lock based on voice and computer readable medium having computer program recorded therefor
CN112466287B (en) Voice segmentation method, device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination