CN110738985A

CN110738985A - Cross-modal biometric feature recognition method and system based on voice signals

Info

Publication number: CN110738985A
Application number: CN201910981216.3A
Authority: CN
Inventors: 潘成华
Original assignee: Jiangsu Net Into Polytron Technologies Inc
Current assignee: Jiangsu Net Into Polytron Technologies Inc
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2020-01-31

Abstract

The invention provides a method for identifying cross-modal biological characteristics of voice signals, which comprises the steps of S1 obtaining multi-modal biological characteristic information including voice signals to be identified and a plurality of persons, S2 extracting characteristics by utilizing a neural network model aiming at each single modes to obtain voiceprint characteristics and vectors of fixed dimensions of corresponding biological characteristics of other modes, S3 confirming whether the voiceprint characteristic vectors of the multi-modal biological characteristics and the characteristic vectors of other dimensions are from the same person, aiming at a plurality of obtained parallel vector pairs and corresponding 0 or 1 labels, carrying out classification training, selecting a loss function to evaluate optimal model and parameters, outputting 0 or 1 to confirm an identification result, and identifying biological characteristic information of other modes of a speaker by a system through inputting the voice signals.

Description

Cross-modal biometric feature recognition method and system based on voice signals

Technical Field

The invention relates to biometric feature recognition methods and systems, in particular to cross-modal biometric feature recognition methods and systems based on voice signals.

Background

With the widespread application of the artificial intelligence technology in in the field of biometric identification, the technologies of face identification, voiceprint identification, fingerprint identification, iris identification, palm print identification, gait identification and the like have obtained very high identification rate and a large number of application scenes capable of landing.

However, in some practical applications , there are no registered data corresponding to the biometric identification modality data, for example, there is a phone recording of a fraudster suspect, but there is no registered voice thereof, so that voiceprint identification cannot be performed.

There is strong correlation between the biometric data of individuals in different modalities, for example, by listening -segment recording, we can find out who the individual is, sex, general age, general regional dialect, tone, whether the utterance is thin, sharp, etc., and all these information can find the corresponding place in the face image, because the face image can also recognize the above information by face recognition, such as identity, sex, general age, south/north, height, character, etc.

Therefore, it is necessary to provide cross-modal biometric identification methods and systems based on speech signals.

Disclosure of Invention

The invention aims to provide cross-modal biometric identification methods and systems based on voice signals, which identify biometric information of other modalities of a speaker of the voice signals through the input voice signals.

In order to achieve the purpose, the invention adopts the following technical scheme that the cross-modal biometric feature recognition method of voice signals comprises the following steps:

s1, acquiring multi-mode biological characteristic information including the voice signal to be recognized and a plurality of persons;

s2, extracting features by utilizing a neural network model aiming at each single modes, and acquiring fixed dimension vectors of the voiceprint features and the corresponding biological features of other modes;

s3, confirming whether the voiceprint feature vectors of the multi-modal biological features and the feature vectors of other dimensions come from the same person, enabling the voiceprint feature vectors extracted in the step 2 and the feature vectors of other modalities to be connected in parallel to form vector pairs, enabling the output of the vector pairs to be labeled with 1 if the voiceprint features and the features of other dimensions come from the same person, and enabling the output of the vector pairs to be labeled with 0 if the voiceprint features and the features of other dimensions come from two different persons.

S4: and carrying out supervised classification training on the obtained vector pairs formed by the parallel connection of the plurality of vectors and corresponding 0 or 1 labels, selecting a model and parameters with optimal loss function evaluation, and outputting a 0 or 1 confirmation recognition result.

In the step S2, the neural network model is used to extract the speech signal to be recognized, the input speech signal to be recognized is used to extract mel spectrum features by using a python toolkit, a Resnet neural network model is built to pass through the network model, the input of the neural network model is mel spectrum vectors extracted by the python toolkit, the output is fixed dimension 128-dimensional g-vector features, and the g-vector features are output of the neural network.

In the step S4, the neural network is used to train and evaluate the non-linear kernel-based SVM support vector machine, and based on the kernel technique, the non-linear SVM model can be expressed as follows:

satisfies the following conditions

When the condition (2) is satisfied, the parameter α for the minimum value of the formula (1) is obtained, where N represents the number of samples, y represents the true label value, and X represents the input value, where K (xi, yj) is a function of the original low-dimensional feature space X

The invention also provides a cross-modal biological feature recognition system of voice signals, which comprises an acquisition module, an extraction module, a confirmation module and an output module, wherein the acquisition module is used for acquiring multi-modal biological feature information of multiple persons including voice to be recognized, the extraction module is used for extracting features by using a neural network model every single modes and acquiring fixed-dimension vectors of voiceprint features and corresponding other modal biological features, the confirmation module is used for confirming whether the voiceprint feature vectors of the multi-modal biological features and the feature vectors of other dimensions are from the same person, if the voiceprint features and the features of other dimensions are from the same person, the output person of the vector pair is labeled as 1, otherwise, the output person is labeled as 0 if the voiceprint features and the features of other dimensions are from different two persons, the output module is used for carrying out supervision classification training on the obtained vector pairs formed in parallel connection and the corresponding labels of 0 or 1, models and parameters optimal for loss function evaluation are selected, and 0 or 1 confirmation recognition results are output.

Compared with the prior art, the cross-modal biometric feature recognition method and system based on the voice signal have the beneficial effects that: by inputting the voice signal, the system recognizes the biological characteristic information of other modes of the speaker of the voice signal in the biological characteristic signals of other modes of a plurality of candidate persons by means of the input voice signal.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and other drawings can be obtained by those skilled in the art without inventive efforts, wherein:

FIG. 1 is a flow chart of a cross-modal biometric identification method of a speech signal according to the present invention;

fig. 2 is a block diagram of a cross-modal biometric recognition system of a speech signal of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the drawings, but it should be emphasized that the following embodiments are only exemplary and are not intended to limit the scope and application of the present invention.

Fig. 1 is a flow chart of a cross-modal biometric identification method of a speech signal according to the present invention.

S1: acquiring multi-modal biological characteristic information of a plurality of persons including the speech to be recognized;

specifically, the multi-modal biometric features include voiceprint features of a speaker's voice, face features of human facial information, gait features of a walking posture, iris features of a human eye, and the like, and form a multi-modal biometric feature data set or use a public biometric feature data set as a training set of the system model.

the method comprises the steps of extracting a voice signal to be recognized by using a neural network model, extracting Mel spectral characteristics of the input voice signal to be recognized by using a python toolkit, building a Resnet neural network model through the network model, inputting the neural network model by using Mel spectral vectors extracted by the python program toolkit, outputting the model by using a fixed dimension 128-dimensional g-vector characteristic, and outputting the g-vector characteristic by using a neural network. The specific model structure is as follows:

in the above table, layer is a neural network layer, output size is the output size of the corresponding layer, 3X3 represents a convolution kernel, stride represents a step size, T represents a time step size, and params is what the parameters of this layer are.

The th layer of the network is conv1 convolutional layer, the second layer is Res1, the third layer is Res2, the … … … … sixth layer is GSP pooling layer, a global statistics pooling mode (statistics) is adopted, the 7 th layer is full connection layer (FC1), the output is also full connection layer (FC2), and full-connected means.

For the extraction of the human face features, a DeepiD network model is adopted, and vectors with fixed dimensions are obtained for each human faces, wherein the DeepiD model is based on a convolutional neural network and comprises 4 convolutional layers (each has the maximum pooling layer) and fully-connected layers (namely 160-dimensional features of the DeepiD).

Obtaining fixed-dimension feature vectors reflecting identity information of people through a DeepiD network model (namely, collected multi-modal biological feature information is represented by mathematical vectors), wherein text records similar to Excel are made during feature extraction, such as Xiaoming classmate, the number of which is 1, the name of extracted voiceprint features is 1-vector, the name of extracted face features is 1-factor, and other biological feature recording methods are similar, the features of Xiaoming classmate are extracted, if other people exist, the recording is continued according to the recording mode, and records are obtained as follows:

name-number voiceprint features, face features, or other features

Xiaoming 11-vector 1-factor 1-xxxxxxx

Zhang three 22-vector 2-factor 2-xxxxxx

S3, confirming whether the voiceprint feature vector of the multi-modal biological features and the feature vector of other dimensions are from the same person, if the voiceprint feature and the features of other dimensions are from the same person, the output person of the vector pair is labeled as 1, otherwise, if the voiceprint feature and the features of other dimensions are from two different persons, the label is 0.

And (3) connecting a plurality of voiceprint features generated in the training data in parallel with features of other modality biological features respectively to form vector pairs, namely connecting the voiceprint feature vector extracted in the step (2) and the other modality biological feature vectors in parallel to form vector pairs, and if the two voiceprint feature vectors and the other modality biological feature vectors are from the same person, labeling the output person of the vector pair as 1, otherwise, labeling the output person as 0 if the two voiceprint feature vectors and the other modality biological feature vectors are from different two persons.

For example, if A is a voiceprint feature, B is a face feature, A and B are both the feature vectors extracted in step 2, A and B may be the same person's face and voiceprint, and may not be the same person

All vector pairs obtained were split by a ratio of 8:2, with random 80% vector pairs as the training set and 20% vector pairs as the test set.

The neural network is adopted for training through a nonlinear kernel function-based SVM (support vector machine), and based on kernel skills, a nonlinear SVM model can be expressed as follows:

satisfies the following conditions

When the condition (2) is satisfied, the parameter α when the minimum value is obtained in the formula (1) is obtained, in which N represents the number of samples, y represents the true label value, and X represents the input value, where K (xi, yj) is a function of the original low-dimensional feature space X, and the operation is performed on the low-dimensional space, which is essentially a kernel function.

When the method is applied, a fixed dimension of a voiceprint feature reflecting identity information of a person is obtained for an input voice signal, a feature vector of the fixed dimension is obtained by preprocessing and denoising and filtering candidate data of other modes to be recognized, other mode fixed dimension features reflecting the identity information of the person are obtained, a vector pair to be recognized is obtained by connecting the candidate data of the other modes to be recognized in parallel with the voiceprint feature mode fixed dimension features, the generated vector pair to be recognized is recognized by using the supervised learning classifier model in the step 4, the vector pair to be recognized is input to the supervised learning classifier model, and the voiceprint feature modal data and other modal data which are output by the model and represented by 0 or 1.1 come from the same person , wherein 0 represents different persons.

For example, when an arbitrary person speaks into the system, the voiceprint feature A of the person is extracted, then all face feature vectors B are extracted from alternative face pictures, A and B are connected in parallel to form feature vector pairs and used as the input of the system, and then the prediction is made by supervising classification model reasoning operation.1 represents that the A-mode data and the B-mode data are from the same person, and 0 represents from different persons.

Fig. 2 is a block diagram of a cross-modal biometric recognition system of a speech signal according to the present invention, which includes:

the system comprises an acquisition module 1, a recognition module and a processing module, wherein the acquisition module is used for acquiring multi-modal biological characteristic information of a plurality of people including voices to be recognized;

specifically, the voice to be recognized is acquired through a microphone, the system has multi-modal biological characteristics including voiceprint characteristics of the voice of a speaker, face characteristics of human face information, gait characteristics of walking posture, iris characteristics of human eyes and the like, and a multi-modal biological characteristic data set is formed or a public biological characteristic data set is used as a training set of a system model.

The extraction module 2 is used for extracting features by utilizing a neural network model in each single modes, and obtaining fixed dimension vectors of voiceprint features and corresponding biological features of other modes;

the confirming module 3 is used for confirming whether the voiceprint feature vectors of the multi-modal biological features and the feature vectors of other dimensions are from the same person, if the voiceprint features and the features of other dimensions are from the same person, the output person of the vector pair is labeled as 1, otherwise, if the voiceprint features and the features of other dimensions are from two different persons, the label is labeled as 0;

and the output module 4 is used for carrying out supervision classification training on the obtained multiple parallel vector pairs and corresponding 0 or 1 labels, selecting a model and parameters with optimal loss function evaluation, and outputting a 0 or 1 confirmation recognition result.

The cross-modal biometric feature recognition method and system based on the voice signal have the beneficial effects that: by inputting the voice signal, the system recognizes the biological characteristic information of other modes of the speaker of the voice signal in the biological characteristic signals of other modes of a plurality of candidate persons by means of the input voice signal.

Of course, persons skilled in the art should recognize that the above-described embodiments are illustrative only, and not limiting, and that changes and modifications to the above-described embodiments are intended to fall within the scope of the appended claims, as long as they fall within the true spirit of the invention.

Claims

1, A method for recognizing cross-modal biometric features of speech signals, characterized by the steps of:

s3, confirming whether the voiceprint feature vectors of the multi-modal biological features and the feature vectors of other dimensions come from the same person, and enabling the voiceprint feature vectors extracted in the step 2 and the feature vectors of other modalities to be connected in parallel to form vector pairs, wherein if the voiceprint features and the features of other dimensions come from the same person, the output of the vector pairs is artificially labeled as 1, and if the voiceprint features and the features of other dimensions come from two different persons, the label is 0;

2. The method for recognizing the biometric characteristic of the voice signal across the modal state according to claim 1, wherein in the step S2, the voice signal to be recognized is extracted by using a neural network model, the mel spectrum characteristic of the input voice signal to be recognized is extracted by using a python toolkit, a respet neural network model is built to pass through the neural network model, the input of the neural network model is the mel spectrum vector extracted by the python toolkit, the output is the g-vector characteristic with the fixed dimension of 128 dimensions, and the g-vector characteristic is the output of the neural network.

3. The cross-modal biometric recognition method of a speech signal according to claim 1,

satisfies the following conditions

When the condition (2) is satisfied, the parameter α for the minimum value of the equation (1) is obtained, where N represents the number of samples, y represents the true label value, and X represents the input value, where K (xi, yj) is a function of the original low-dimensional feature space X.

A system for cross-modal biometric recognition of speech signals, comprising:

the system comprises an acquisition module, a recognition module and a processing module, wherein the acquisition module is used for acquiring multi-modal biological characteristic information of a plurality of people including voices to be recognized;

the extraction module is used for extracting features by utilizing a neural network model in each single modes and acquiring fixed dimension vectors of voiceprint features and corresponding biological features of other modes;

a confirmation module for confirming whether the voiceprint feature vector of the multi-modal biological features and the feature vector of other dimensions are from the same person, if the voiceprint feature and the feature of other dimensions are from the same person, the output person of the vector pair is labeled as 1, otherwise, if the voiceprint feature and the feature of other dimensions are from two different persons, the label is 0;

and the output module is used for carrying out supervision classification training on the obtained multiple parallel vector pairs and corresponding 0 or 1 labels, selecting a model and parameters with optimal loss function evaluation, and outputting a 0 or 1 confirmation recognition result.