CN107731233B

CN107731233B - Voiceprint recognition method based on RNN

Info

Publication number: CN107731233B
Application number: CN201711070510.6A
Authority: CN
Inventors: 冯毅夫; 王华锋; 徐雷; 杜俊逸; 付明霞; 马晨南; 齐一凡; 潘海侠
Original assignee: Individual
Current assignee: Suzhou Fuji Robot Co.,Ltd.
Priority date: 2017-11-03
Filing date: 2017-11-03
Publication date: 2021-02-09
Anticipated expiration: 2037-11-03
Also published as: CN107731233A

Abstract

The invention provides a voiceprint recognition method based on RNN, which is characterized in that after MFCC characteristics of de-noised voice data and a first-order difference of the MFCC characteristics are obtained, a recurrent neural network is used for extracting high-level characteristics of a speaker in the MFCC characteristics, the extracted characteristics are classified by a softmax classifier, and finally, the speaker is recognized by a naive Bayes method. Different from silence elimination of the traditional method, the method reserves a silence segment in voice data, can extract the characteristics related to the context based on the recurrent neural network, and can extract the high-level characteristics of the voice of the speaker, such as speaking mode, rhythm and the like aiming at the voice data, so that the characteristic information is more complete and can represent the speaker more. Compared with the existing voice print identification method based on Gaussian, the method has relatively low requirement on voice data and higher accuracy, the accuracy is still kept at a high level even facing large data, and the running speed is not reduced.

Description

Voiceprint recognition method based on RNN

Technical Field

The invention provides a voiceprint recognition method based on RNN, and relates to the fields of deep learning, pattern recognition and voice signal processing.

Background

The rapid development of information technology is how to accurately authenticate the identity of a person, protect the privacy of the person and guarantee the information security, and the problem which needs to be solved at present is formed. Compared with the traditional identity authentication mode, the biological characteristic identification identity authentication technology has the characteristic of no loss, theft or forgetting in the using process; it is not only quick and convenient, but also accurate and reliable. Voiceprint recognition is one of the most popular biometric feature recognition technologies at present, has unique advantages in the application fields of remote authentication and the like, receives more and more attention, a WeChat has started a voice lock verification login mode, and the first global idea of associative phone A586 which adopts the voiceprint recognition technology for unlocking opens the way of applying the voiceprint recognition technology, and a user in a private banking department, Barcley Wealth (Barcleys Wealth) under the Barcley Bank flags completes identity verification through own voice. Compared with human faces and fingerprints, voiceprint recognition is always low in tone, and public cognition is not high. In fact, in recent years, voiceprint recognition is developing rapidly and in low tone due to the characteristics of high usability, high user acceptance, low acquisition cost and the like of voiceprint recognition, and the application range is continuously expanded. The annual investment in speech and voiceprint recognition by large global companies, including apple, google, microsoft, hundredths, science news, has risen, and public data has shown that by 2020, the market size of global speech-related pattern recognition will increase from $ 61.9 million to $ 200 million in 2015, which can be said to be of great potential for future market development.

Common voiceprint recognition methods mainly include: the method comprises a voiceprint recognition method based on signal processing, a voiceprint recognition method based on acoustic characteristics and pattern matching, a voiceprint recognition method based on a Gaussian mixture model and a voiceprint recognition method based on deep learning.

Method based on signal processing: this is the earliest method applied in the development of voiceprint recognition technology. The method calculates the parameters of the speech data in the signal science by using some technical methods in the signal processing technology, and then carries out template matching, statistical variance analysis and the like. The method is extremely sensitive to voice data, low in accuracy and not ideal in recognition effect.

The identification method based on acoustic features and pattern matching comprises the following steps: from the end of the 70 s to the end of the 80 s of the 20 th century, speaker recognition research has focused on processing acoustic feature parameters and new pattern matching methods. Researchers successively put forward speaker identification characteristic parameters such as LPC spectral coefficients, LSP spectral coefficients, perceptual linear prediction coefficients, mel-frequency cepstrum coefficients and the like. In this case, techniques such as a dynamic time warping method, a vector quantization method, a support vector machine, and an artificial neural network method are widely used in the speech recognition field, and are also core techniques for speaker recognition. In the above speaker recognition model algorithms, there are certain limitations on the length of speech, text, and speech channel, etc., but in practical applications, short speech and cross-channel problems are common, where the cross-channel problem has the greatest impact on the performance of the voiceprint recognition system.

The identification method based on the Gaussian mixture model comprises the following steps: after the 90 s of the 20 th century, the Gaussian Mixture Model (GMM) rapidly became the mainstream technology in speaker recognition independent of text at present with its simplicity, flexibility, effectiveness and better robustness, bringing speaker recognition research into a new stage. The GMM model is a probabilistic model that models based on probability distributions derived from features, unlike the way that features of speech are modeled directly. Meanwhile, the judgment mode is also changed, and the similarity of the models is judged according to the likelihood scores. But the requirement on the voice data volume is very large, the method is very sensitive to channel environment noise, and the requirement under a real scene cannot be met.

The voiceprint recognition method based on deep learning comprises the following steps: the method uses a large number of training samples to automatically learn the voiceprint features, and can extract excellent voiceprint features with discrimination. However, the existing methods based on deep learning do not consider the context-dependent nature of the speech signal, and the extracted features do not represent the speaker well, and the advantages of deep learning are not fully exerted.

In order to solve the problems, the invention provides a voiceprint recognition method based on RNN, which can extract high-level voice features and accurately and efficiently complete a voiceprint recognition task.

Disclosure of Invention

The technical problem solved by the invention is as follows: the problems that the context correlation of voice data is not considered in the existing voiceprint recognition method, the extracted features cannot well represent speakers, the strong feature extraction capability of deep learning is not exerted, and the like are solved. A voiceprint recognition method based on a Recurrent Neural Networks (RNN) is provided.

The technical scheme adopted by the invention is as follows: the method comprises the following four steps:

the method comprises the following steps that (1) denoising processing is carried out on input voice data by adopting a spectral subtraction method, wherein channel noise is eliminated, and the channel noise is noise caused by recording equipment; and taking the pure voice data after the channel noise is eliminated as the input of the training data.

Step (2), framing the pure voice data obtained in the step (1) according to the frame length of 25ms and the frame shift of 10ms, wherein each voice data can be divided into hundreds of thousands of frames of voice signals, respectively calculating the MFCC characteristic parameters of each frame of voice signals, selecting the first 13-dimensional MFCC characteristic parameters, continuously calculating the first-order difference and the second-order difference of the MFCC characteristic parameters, respectively extracting the first 13-dimensional MFCC characteristics, and splicing into a 39-dimensional characteristic vector to serve as the characteristic parameters of the frame of voice signals; and combining 39 features of every 64 frames of voice signals into 64-39 two-dimensional data, discarding the voice signals with less than 64 frames, and labeling the two-dimensional data as the identity representation of the speaker to be used as the input of the neural network.

And (3) adding the two-dimensional data obtained in the step (2) as input into the training of the recurrent neural network. The recurrent neural network has 64 LSTM units; each LSTM unit has 256 hidden neurons, and the hidden neurons are expanded into 64 steps on time series, and each time series is the same network model; the circular neural network adopts a one-way circular neural network, so that the last LSTM unit contains the information of all the previous LSTM units, and the output of the last LSTM unit is used as the final voice feature to enter a recognition stage.

And (4) identifying the voice characteristics obtained in the step (3) and determining the speaker to which the voice data belongs.

Further, the spectral subtraction denoising in step (1) has the advantage that only the channel noise is removed, and the silence segment therein is not removed. Because the joint of the silent segment and the voiced segment can well represent the high-level characteristics of the speaker, such as speaking mode, rhythm and the like, the high-level characteristics need to be extracted by using a recurrent neural network in the follow-up process.

Further, the MFCC features in step (2) take human auditory features into consideration, and first the linear spectrum is mapped into Mel nonlinear spectrum based on auditory perception, and then the spectrum is converted onto cepstrum, which is very prominent in terms of artificial speech features. Whereas the standard cepstral parameters MFCC reflect only the static characteristics of the speech parameters, the dynamic characteristics of speech can be described by a differential spectrum of these static characteristics (i.e., a second order difference reflects the dynamic characteristics of speech). The dynamic and static characteristics are combined, so that the identification performance of the system can be improved.

Furthermore, a brand-new technical means is proposed in the step (3), namely a Recurrent Neural Network (RNN) is applied to speaker recognition. The voice data is continuous data and has strong context correlation. The RNN has outstanding capability of extracting context-related features, and is widely applied to the fields of natural language processing and speech recognition. The RNN is used for providing high-level features containing context information on the basis of traditional voice features, so that the completeness of the features is better, and the representativeness is stronger.

Further, the step (4) of recognizing the speech features includes classifying the speech segments formed by splicing 64 frames of speech features. The speech segment formed by splicing 64 frames of speech features can be classified by using softmax as a classifier. And then, confirming the speaker of the whole voice by using a naive Bayes method, namely that the speaker of the whole voice with the highest frequency of all the segmented voices is the speaker of the whole voice. Because a speech segment can obtain a plurality of speech features, that is, a speech segment may obtain a plurality of results, according to the naive bayes method, among the plurality of speech features obtained by a speech segment, the speaker with the largest classification result obtained by softmax is identified as the speaker to which the speech segment belongs.

The principle of the invention is as follows:

the invention provides an RNN-based voiceprint recognition method, overcomes the defects that the conventional deep learning-based method does not consider the context-related essence of a voice signal, the extracted features cannot represent a speaker well, the advantages of deep learning are not fully exerted, and the like, and has the characteristics of strong adaptability, good performance and high result accuracy. The method comprises four steps: firstly, denoising input voice data by adopting a spectral subtraction method, and taking the obtained pure voice data with channel noise eliminated as the input of training data. Pure voice data is divided into frames according to the frame length of 25ms and the frame shift of 10ms, MFCC characteristic parameters of each frame are respectively calculated, front 13-dimensional MFCC characteristic parameters are selected and are continuously calculated to calculate first-order and second-order differences, front 13-dimensional characteristics are respectively extracted and spliced into a 39-dimensional characteristic vector to serve as the characteristic parameters of the voice signal of the frame, each voice can be divided into hundreds of frames, 39-dimensional characteristics of every 64 frames of voice signals are combined into a 64 x 39 two-dimensional voice acoustic characteristic parameter, voice signals of less than 64 frames are discarded, and labels of all two-dimensional voice acoustic characteristic parameters generated by voice data spoken by the same speaker are represented by the identity of the same speaker and serve as the input of a neural network. The recurrent neural network has 64 LSTM units (equal to the number of rows of input data), with 256 neurons in each LSTM unit. The method adopts a one-way circulation neural network, so that the last LSTM unit contains the information of all the units, and the output of the last LSTM unit is used as the final voice characteristic to enter a recognition stage. And classifying the obtained speech features by using softmax and obtaining the result. Because a speech segment can obtain a plurality of speech features, that is, a speech segment may obtain a plurality of results, according to the naive bayes method, among the plurality of speech features obtained by a speech segment, the speaker with the largest classification result obtained by softmax is identified as the speaker to which the speech segment belongs.

The invention mainly comprises the following four aspects:

and preprocessing voice data. In an actual scene, due to the difference of recording equipment and environments, collected voice data can generate more channel noise, and great difficulty is brought to an identification task. Therefore, there is a need for a limited method for preprocessing speech data to improve the accuracy of the algorithmic prediction. The method adopts a spectral subtraction method to carry out denoising on the voice data, removes channel noise and completely saves all information related to a speaker.

And extracting and splicing acoustic features of the voice signals. The acoustic features commonly used for voiceprint recognition or speech recognition are MFCC features, however, speech is a continuous signal and has strong context correlation, and the traditional MFCC features cannot well represent data. Because it reflects only the static characteristics of the speech parameters, the dynamic characteristics of speech can be described by a difference spectrum of these static characteristics (i.e., a second order difference reflects the dynamic characteristics of speech). The dynamic and static characteristics are combined, so that the identification performance of the system can be improved. After the MFCC features and the binary differential features thereof are spliced, the input required by the neural network is considered to be two-dimensional data, so that 39 features of every 64 frames of voice signals are combined into 64-39 two-dimensional data, less than 64 frames are discarded, and the label of the data is the identity representation of a speaker, so that the data is used as the input of the neural network.

Extraction of pitch level features using RNN. Speech is continuous data and is extremely contextually relevant. RNNs have been shown to be highly significant in extracting context-dependent features, and have been highly successful in the fields of speech recognition, natural language processing, and the like. In the method, a cyclic neural network model has 256 hidden neurons, and is expanded into 64 steps on time series, wherein each time series is the same network model. Because the unidirectional cyclic neural network is adopted, the last LSTM unit contains the information of all the previous units, and the output of the last LSTM unit is taken as the final voice characteristic to enter the recognition stage.

And the recognition of the speaker is carried out by Softmax in cooperation with naive Bayes. Because an entire speech segment can be divided into many frames, each frame produces a MFCC signature (containing a second order difference). Entering the recurrent neural network as an input for every 64 frame features can yield one result, so a segment of speech will yield multiple results. The method classifies each 64-frame feature input using softmax and obtains the result. And a section of voice may obtain a plurality of results, and according to the naive Bayes method, the speaker with the most classification result obtained by softmax in a plurality of voice features obtained by the section of voice is identified as the speaker to which the section of voice belongs.

Compared with the prior art, the invention has the advantages that:

1. the invention provides extraction and splicing of acoustic features of voice signals. The acoustic features commonly used for voiceprint recognition or speech recognition are MFCC features, however, speech is a continuous signal and has strong context correlation, and the traditional MFCC features cannot well represent data. Because it reflects only the static characteristics of the speech parameters, the dynamic characteristics of speech can be described by a difference spectrum of these static characteristics (i.e., a second order difference reflects the dynamic characteristics of speech). The dynamic and static characteristics are combined, so that the identification performance of the system can be improved. After the MFCC features and the binary differential features thereof are spliced, the input required by the neural network is considered to be two-dimensional data, so that 39 features of every 64 frames of voice signals are combined into 64-39 two-dimensional data, less than 64 frames are discarded, and the label of the data is the identity representation of a speaker, so that the data is used as the input of the neural network.

2. The invention provides extraction of pitch level features using RNN. Speech is continuous data and is extremely contextually relevant. RNNs have been shown to be highly significant in extracting context-dependent features, and have been highly successful in the fields of speech recognition, natural language processing, and the like. The recurrent neural network in this method has 64 LSTM units (equal to the number of rows of input data), with 256 neurons in each LSTM unit. Because the unidirectional cyclic neural network is adopted, the last LSTM unit contains the information of all the previous units, and the output of the last LSTM unit is taken as the final voice characteristic to enter the recognition stage.

3. The invention provides a method for confirming a speaker by using Softmax in combination with naive Bayes. Because an entire speech segment can be divided into many frames, each frame produces a MFCC signature (containing a second order difference). Entering the recurrent neural network as an input for every 64 frame features can yield one result, so a segment of speech will yield multiple results. The method classifies each 64-frame feature input using softmax and obtains the result. And a section of voice may obtain a plurality of results, and according to the naive Bayes method, the speaker with the most classification result obtained by softmax in a plurality of voice features obtained by the section of voice is identified as the speaker to which the section of voice belongs.

Drawings

FIG. 1 is a flowchart of a RNN-based voiceprint recognition method of the present invention;

FIG. 2 is a schematic diagram of speech denoising;

FIG. 3 is a schematic diagram of extraction and concatenation of acoustic features of a speech signal;

FIG. 4 is a schematic diagram of RNN feature extraction;

FIG. 5 is a schematic diagram of Softmax in conjunction with naive Bayes recognition.

Detailed Description

Figure 1 shows the overall process flow of the present invention. The invention provides a voiceprint recognition method based on RNN, which mainly comprises the following steps: firstly, input voice data is denoised by using a spectral subtraction method, obtained pure voice data is subjected to framing to extract MFCC characteristic parameters and second-order differences, then the MFCC parameters and the second-order differences of a plurality of continuous frames are spliced to obtain a two-dimensional characteristic parameter matrix, and the two-dimensional characteristic parameter matrix is used as the input of a recurrent neural network. The invention uses the variant LSTM of the recurrent neural network for training, the LSTM reserves the long-term information of the sequence through the specific door mechanisms of a forgetting door, an input door and an output door (delta is a sigmoid activation function and tanh is a hyperbolic tangent activation function in figure 1), thereby completing the training to obtain the LSTM model for the recognition of a small segment of voice, finally counting the recognition result of each small segment of voice of the long segment of voice, and using a naive Bayes method to confirm the final speaker.

The invention is further described below with reference to other figures and embodiments.

1. Voice preprocessing module

In an actual scene, due to the difference of recording equipment and environments, collected voice data can generate more channel noise, and great difficulty is brought to an identification task. Therefore, there is a need for a limited method for preprocessing speech data to improve the accuracy of the algorithmic prediction. The method adopts a spectral subtraction method to carry out denoising on the voice data, removes channel noise and completely saves all information related to a speaker. In the method, voice noise does not need to be extracted particularly, the first 5 frames of voice data are selected to be about 0.1s, no speaker voice exists at the moment, only the noise of a signal in recording is used as a channel noise template in sequence, and the noise is removed by using a spectral subtraction method, so that pure voice data after the channel noise is removed can be obtained. As shown in fig. 2, first, discrete fourier transform (FFT) is performed on the noisy speech to retain phase information, then the power spectrum is taken to subtract the power spectrum of the noise to obtain the power spectrum of the clean speech, and after matching with the previous phase information, inverse fourier transform (IFFT) is performed to recover the clean speech.

2. Extraction and splicing of acoustic features of voice signals

The acoustic features commonly used for voiceprint recognition or speech recognition are MFCC features, however, speech is a continuous signal and has strong context correlation, and the traditional MFCC features cannot well represent data. Because it reflects only the static characteristics of the speech parameters, the dynamic characteristics of speech can be described by a difference spectrum of these static characteristics (i.e., a second order difference reflects the dynamic characteristics of speech). The dynamic and static characteristics are combined, so that the identification performance of the system can be improved. The invention divides the frame of the pure voice data obtained in the last step to extract the MFCC characteristic parameter and the second-order difference, then splices the MFCC parameter and the second-order difference of a plurality of continuous frames to obtain a two-dimensional characteristic parameter matrix, and uses the two-dimensional characteristic parameter matrix as the input of the recurrent neural network. As shown in FIG. 3, the method divides the frame of the voice according to the frame length of 25ms and the frame shift of 10ms, calculates the MFCC characteristics and the second-order difference characteristics thereof and splices the MFCC characteristics and the second-order difference characteristics thereof, and then combines the 39-dimensional characteristics of each 64 frames of voice signals into a 64 x 39 two-dimensional voice acoustic characteristic parameter, and discards the voice signals less than 64 frames, and the labels of all the two-dimensional voice acoustic characteristic parameters generated by the voice data spoken by the same speaker are all the identity representation of the same speaker, so as to be used as the input of the neural network. For example, 15s for a segment of speech, which may be divided into 1498 frames, may produce 23 feature inputs of 64 × 39 with tags pointing to the same speaker.

3. Extraction of pitch level features using RNN

Speech is continuous data and is extremely contextually relevant. RNNs have been shown to be highly significant in extracting context-dependent features, and have been highly successful in the fields of speech recognition, natural language processing, and the like. The recurrent neural network in this method has 64 LSTM units (equal to the number of rows of input data), with 256 neurons in each LSTM unit. Because the unidirectional cyclic neural network is adopted, the last LSTM unit contains the information of all the previous units, and the output of the last LSTM unit is taken as the final voice characteristic to enter the recognition stage. As shown in fig. 4, the 64 × 39 dimensional features are input into the neural network, i.e. the 39 dimensional features of each frame are sequentially filled into the neural network, and since the training set used in the method has 251 people, the final feature output selects 251 vector for processing softmax during recognition.

4. Recognition of speaker by softmax in cooperation with naive Bayes

Because an entire speech segment can be divided into many frames, each frame produces a MFCC signature (containing a second order difference). Entering the recurrent neural network as an input for every 64 frame features can yield one result, so a segment of speech will yield multiple results. The method uses softmax as a classifier to classify each 64-frame feature input and obtain the result. And a section of voice may obtain a plurality of results, and according to the naive Bayes method, the speaker with the most classification result obtained by softmax in a plurality of voice features obtained by the section of voice is identified as the speaker to which the section of voice belongs. As shown in fig. 5, the test speech is de-noised as the training speech, the feature parameters are extracted and spliced to obtain a plurality of feature inputs, for example, 23 feature inputs generated by 15s speech, and the feature inputs are input into the LSTM model to obtain 23 classification results, wherein the speaker corresponding to the label with the most appeared result is the speaker of the whole speech.

Technical contents not described in detail in the present invention belong to the well-known techniques of those skilled in the art.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. An RNN-based voiceprint recognition method, characterized by: the method comprises the following steps:

the method comprises the following steps that (1) denoising processing is carried out on input voice data by adopting a spectral subtraction method, wherein channel noise is eliminated, and the channel noise is noise caused by recording equipment; the pure voice data after the channel noise is eliminated is used as the input of training data;

step (2), framing the pure voice data obtained in the step (1) according to the frame length of 25ms and the frame shift of 10ms, wherein each voice data can be divided into hundreds of thousands of frames of voice signals, respectively calculating the MFCC characteristic parameters of each frame of voice signals, selecting the first 13-dimensional MFCC characteristic parameters, continuously calculating the first-order difference and the second-order difference of the MFCC characteristic parameters, respectively extracting the first 13-dimensional MFCC characteristics, and splicing into a 39-dimensional characteristic vector to serve as the characteristic parameters of the frame of voice signals; combining 39-dimensional features of every 64 frames of voice signals into 64-39 two-dimensional voice acoustic feature parameters, discarding voice signals less than 64 frames, and taking labels of all two-dimensional voice acoustic feature parameters generated by voice data spoken by the same speaker as identity representation of the same speaker as input of a neural network;

step (3), the two-dimensional data obtained in the step (2) is used as input and added into the training of the recurrent neural network; the recurrent neural network has 64 LSTM units; each LSTM unit has 256 hidden neurons, and the hidden neurons are expanded into 64 steps on time series, and each time series is the same network model; the circulating neural network adopts a unidirectional circulating neural network, the last LSTM unit contains the information of all the LSTM units in the front, and the output of the last LSTM unit is used as the final voice characteristic to enter a recognition stage;

step (4), the voice features obtained in the step (3) are identified, and a naive Bayes method is used for determining the speaker to which the voice data belongs; the method specifically comprises the following steps: classifying each 64-frame feature input using softmax as a classifier; according to the naive Bayes method, the speaker with the largest classification result obtained by softmax in a plurality of voice features obtained by a section of voice is confirmed as the speaker to which the section of voice belongs.

2. The voiceprint recognition method according to claim 1, characterized in that: and (2) denoising the input voice data by adopting a spectral subtraction method in the step (1), only eliminating channel noise and reserving a silence segment.

3. The voiceprint recognition method according to claim 1, characterized in that: the MFCC characteristics in the step (2) take the auditory characteristics of human beings into account; the linear spectrum is first mapped into the Mel nonlinear spectrum based on auditory perception and then converted onto the cepstrum.

4. The voiceprint recognition method according to claim 1, characterized in that: and (4) identifying the voice characteristics, including classifying the voice sections formed by splicing 64 frames of voice characteristics.