Speaker identification method based on 3DCNN-LSTM and storage medium
Technical Field
The invention belongs to the field of voice signal processing and pattern recognition, and relates to a speaker recognition method based on 3 DCNN-LSTM.
Background
Speaker recognition, also known as voiceprint recognition, is one of the important components of biometric signal recognition. Compared with the popular biometric identification modes such as fingerprints, hand shapes, retinas, irises, faces and the like, the voice is the most convenient and direct mode in the human communication process, meanwhile, the voice collection of the speaker is more convenient, the cost is controllable, and the privacy of the speaker can be well protected.
The task of speaker recognition is to identify which speaker is speaking in an established speaker library. The speaker recognition method can be divided into text-related and text-unrelated speaker recognition according to whether the speaking content of the speaker is predefined. Or can be divided into speaker determination and speaker identification according to whether the voice quantity of the identified speaker is single or not. The basic system framework is mainly divided into feature extraction and a speaker model.
The feature extraction is to extract the feature vectors of the voice signals of the speaker, and the feature vectors can fully reflect individual differences and can be kept stable for a long time. The speaker characteristics are divided into time domain characteristics and transform domain characteristics, common time domain characteristics include amplitude, energy, average zero crossing rate and the like, but the characteristics are characteristic vectors obtained by directly passing voice signals through a filter, the processing process is simple, the stability is poor, the capability of expressing speaker identity information is also weak, and the speaker characteristics are rarely applied at present. The transform domain refers to a vector feature obtained by transforming a voice signal, and a common transform domain feature is a Linear Prediction Coefficient (LPC) [2] For parameters (Line Spectral Pair, LSP), Mel Frequency Cepstrum Coefficient (MFCC), Bark Frequency Cepstrum Coefficient (BFCC), characteristic parameters of the transform domain can better simulate the human voice characteristics, so that the robustness is stronger, the stability is better, and the method can be widely applied.
The traditional speaker model has a dynamic time planning algorithm based on template matching, and the method is simple and strong in real-time performance, but small in data storage capacity and poor in robustness; based on the classic probability statistical algorithm of the GMM-based Kelvin model HMM, the method is widely applied to various mode identification and obtains better effect, but with the improvement of the requirement on identification accuracy, parameters required to be determined by the model are increased, the calculation complexity is high, and the identification process time is correspondingly increased; until now, a wider i-vector recognition algorithm is applied, the difference between speakers can be well expressed by matching with a plurality of channel compensation technologies, and although a better recognition effect is obtained, the difference between the training phase and the testing phase is still larger, so that the method is particularly obvious in the recognition of speakers irrelevant to texts, and the anti-noise capability to the environment is weaker.
The model modeling method can effectively improve the signal-to-noise ratio of characteristic information, reduce the occurrence frequency of fitting phenomena in the training process, enable the model to have better generalization performance, but still have more redundant characteristics in the bottleneck characteristics, and have weaker characterization capability of individual speakers. In order to solve the problem of feature redundancy, experts and scholars at home and abroad propose that sentences with different lengths are mapped into feature vector embeddings with fixed dimensions, each speech frame of a speaker can be input into a DNN Neural network, the last layer of an hidden layer outputs an activation value as a feature vector d-vector of the speaker, but the processing mode is relatively simple, then Snyde proposes a Time-Delay Neural network (TDNN) to extract an x-vector feature vector from the voice of the speaker, a statistical pooling layer in a network structure converts frame-level features into segment-level features, so compared with the d-vector, the added Time Delay structure can better capture the long correlation of speech features, but the recognition performance of short-speech speakers is not greatly improved, and the TDNN network structure under the environment of short speech duration does not fully utilize the upper Time factor, the speaker recognition rate may decrease instead. And then, a compact speaker characteristic structure e-vector is provided, the problem of system performance reduction caused by overlong speaker voice is avoided, and compared with a standard i-vector system, no extra memory and calculation cost are added while a more accurate speaker subspace is generated.
The invention mainly aims at improving the influence on the robustness of the speaker recognition system under different voice time lengths and solving the problem that the voice feature of a speaker with high dimension is lost in the process of reducing the dimension and extracting the feature of the voice of the speaker at present.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. A method is presented. The technical scheme of the invention is as follows:
a speaker identification method based on 3DCNN-LSTM is characterized by comprising the following steps:
s1, acquiring a voice signal, and performing half-text processing including pre-emphasis, windowing and framing, fast Fourier transform and MFEC transform on the voice signal; MFEC transform refers to Mel filter bank, taking logarithmic energy spectrum, removing Mel cepstrum coefficient characteristics of discrete cosine change;
s2, processing the two-dimensional spectrogram into three-dimensional data by stacking the voice signal processed in the step S1 through MFEC characteristics of a plurality of continuous frames, wherein the three-dimensional data is used as the input of the 3 DCNN; 3DCNN representation of three-dimensional convolutional neural network
S3, 3DCNN extracts the space-time characteristics of the speaker voice from a spectrogram, an improved 3D convolution kernel is designed in the 3DCNN, the improved 3D convolution kernel is improved in that the designed internal structure parameters including the number of the convolution kernels, the convolution step length and a built-in BN layer are optimized and used for extracting deep-level characteristics, the data obtained by subjecting three-dimensional data to convolution pooling of the 3D convolution kernel is in a sequence form, and an LSTM network is introduced for extracting time sequence characteristics;
s4, taking the output of the 3DNN as the input of an LSTM model, extracting the long-term dependence relation of the space-time characteristic through the LSTM, and performing time sequence arrangement on the output of the convolutional neural network to learn the context content of the speaker voice;
s5, in the model training and optimizing stage, an optimizer is adopted to set as Adam, the number of nodes of a full-connection layer is set as 3026, a dropout method is used, the initial value of the dropout method is set to be 0.95, the dropout method is applied to each layer of network, and a cross entropy loss function is selected when a loss function is calculated;
and S6, verifying the trained model by using the test set, adjusting each parameter of the model to obtain a final network model, and finally classifying the speaker by using a Softmax layer.
2. The 3 DCNN-LSTM-based speaker recognition method according to claim 1, wherein the step S1 is implemented by performing a halftoning process on the speech signal according to its short-time stationarity to obtain MFEC features, and the specific steps are as follows:
step A1: passing the speech signal through a high pass filter to enhance the high frequency portion of the signal and flatten the speech signal, with a transfer function of H (z) -1-az -1 A takes a value of 0.95, and the signal after pre-emphasis processing is x (t);
step A2: dividing a voice signal into short-time frame windows to reduce the edge effect of voice, framing the pre-emphasized signal to be x (m, n), wherein n is the frame length, m is the number of frames, and windowing is performed by adopting a Hamming window:
the windowed and framed speech signal is: s w (m, N) ═ x (m, N) × w (N), where each frame contains N sample points;
a3: then, transforming the voice data x (n) from the time domain to the frequency domain, and performing fast fourier transform on the windowed signal to obtain a linear spectrum e (k) as follows:
taking the modulus of the data after Fourier transform:
X(k)=[E(k)] 2
step A4: the linear spectrum obtained by FFT is converted into Mel spectrum by a Mel filter bank composed of a series of triangular band-pass filters H m (k) The frequency response function expression of the filter is as follows:
wherein f (m) represents a center frequency; k represents a frequency;
step A5: logarithm is taken on the output of the Mel filter, and the logarithmic spectrum S (m) obtained by logarithm operation is:
further, step S2 is to convert the processed speech signal into a two-dimensional spectrogram, and process the two-dimensional spectrogram into three-dimensional data by stacking a plurality of spectrograms of consecutive frames, and the processing steps are as follows:
step B1: superposing voice frames of n milliseconds at intervals on a speaker voice signal of m seconds for half-text processing;
step B2: transforming the signals processed by B1 from time domain data to frequency domain by MFEC transformation to obtain S (m) two-dimensional spectrogram;
step B3: processing a plurality of speeches of a speaker by B2 to obtain three-dimensional data, performing convolution on the three-dimensional data and a 3D convolution kernel to extract deep speaker characteristics, forming a cube by stacking a plurality of spectrogram of continuous frames, and performing convolution operation on the cube and the 3D convolution kernel, wherein input data is set as Time multiplied by Frequency multiplied by C, and C represents the speaking volume of the speaker.
Further, the 3D convolution kernel designed in step S3 extracts short-term space-time characteristics of the speaker 'S voice from the three-dimensional speech spectrogram, the number of the convolution kernels in the first two layers is set to 16, and the numbers of the convolution kernels are 3x1x5 and 3x9x5, respectively, and the time-frequency-utterance amount of the speaker' S voice signal is subjected to three-dimensional convolution to extract deep-level characteristics of the speaker; the number of the third and fourth convolution kernels is set to 32, and the sizes of the third and fourth convolution kernels are 3x1x4 and 3x8x1 respectively. Pool processing is carried out on each two layers, in addition, the step length of the first four layers is respectively 1x1x1 and 1x2x1, and meanwhile, a BN layer is also arranged on each layer of the network to carry out normalized processing on the data;
the number of convolution kernels of the fifth layer and the sixth layer is set to 64, and the sizes of the convolution kernels are 3x1x3 and 3x7x1 respectively; step size is set to 1x1x 1; the number of convolution kernels of the seventh layer and the eighth layer is set to be 128, the size of the convolution kernels is consistent with that of the former two layers, a BN layer is also arranged on each layer, and finally pooling is carried out to obtain deep individual characteristics of the speaker;
BN normalizes activation of a deep neural network middle layer, and the key of the algorithm is that two learnable parameters gamma and beta are introduced:
in one batch, BN is for each feature, with m training samples, j dimensions (j neuron nodes), normalized for the j dimension:
wherein,
is the result of the linear calculation of the ith dimension of the ith layer,
μ
j the batch mean, batch variance, and batch normalization are shown, respectively, and ε is to prevent variance from being 0.
Further, in step S4, the output of the 3DCNN is used as the input of the LSTM model, and the long-term dependency relationship of these spatio-temporal features is extracted, the conventional LSTM unit is composed of three gate structures, which are a forgetting gate, input gates, and an output gate, and the forgetting gate is used to determine which information should be discarded in the unit state at the previous time and directly participate in updating the unit state, and the updating algorithm of the unit state is related to the hidden layer output at the previous time and the input at the current time, and the unit state at the previous time is used as a parameter for updating the current state;
forget gate algorithm: f. of t =σ(W f ×[h t-1 ,x t ]+b f )
The unit state updating algorithm: i.e. i t =σ(W i ×[h t-1 ,x t ]+b i )
Wherein C is
t-1 And h
t-1 The cell state and hidden layer output, x, at the previous moment, respectively
t Is an input for the current time of day,
is a candidate value, W, to be added to the memory cell
f 、W
i And W
C Are the weights of the forgetting gate, the input gate and the candidate cell, respectively, obtained from the training, b
f 、b
i And b
C Is the deviation of them, i
t Is that
σ represents a logic sigmoid function:
further, in the step S5, in the optimization stage of model training, an initial learning rate is 0.01, β 1 is 0.9, β 2 is 0.999, and ∈ is 10E-8, the optimizer is set to Adam, the number of nodes in a full connection layer is set to 3026, and meanwhile, to prevent a gradient disappearance phenomenon from occurring in the training process, a dropout method is used, an initial value of the method is set to 0.95 and applied to each layer of network, and when a loss function is calculated, a cross entropy loss function is selected;
the cross entropy algorithm is defined as follows:
wherein,
the true label of the jth sample, k representing the total number of samples;
y j : the predicted output of the network model for the jth sample.
Further, in step S6, the speaker classification is performed by using the Softmax layer, and the formula of the Softmax function is as follows:
the Softmax value of the ith element in the array represented by the equation.
A storage medium, the storage medium being a computer readable storage medium storing one or more programs that, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to perform the method of any of the above.
The invention has the following advantages and beneficial effects:
in conclusion, by adopting the technical scheme, under the same experimental environment, the speaker recognition method based on the 3DCNN-LSTM can better solve the problems of speaker voice low-dimensional feature loss and weak space-time correlation, the processing mode of the culture can better memorize the context information of the speaker voice by combining the LSTM rear-end network, the deep personalized feature of the speaker is extracted by the network structure of the 3DCNN, and the speaker recognition accuracy is improved. In a word, the 3 DCNN-LSTM-based speaker recognition method provided by the invention improves the performance of a speaker recognition system to a greater extent.
The innovation of the invention is mainly that step S1 and step S3, step S1 mainly converts the speaker recognition mode irrelevant to text into a speaker recognition mode relevant to 'half text', the superimposed voice signal destroys the text information content which the speaker wants to express in the speaker recognition, but strengthens the individual characteristic of the speaker voice, thereby improving the recognition rate of the speaker recognition system; step S3 is to design a convolution kernel of a 3D convolutional neural network with a completely new structure, which is innovative in that the convolutional neural network is widely applied to the field of image recognition and achieves a good effect, but is rarely applied to speaker recognition, and has a difficulty in that the recognition rate of the conventional speaker recognition algorithm is generally higher than that of the currently popular neural network structure, and the 3D convolutional neural network designed herein can be superior to the conventional algorithm in speaker recognition of middle and long voices.
Drawings
FIG. 1 is a general block diagram of a 3DCNN-LSTM based speaker recognition method according to a preferred embodiment of the present invention;
FIG. 2 is a halftoning process;
FIG. 3 is a 3DCNN block diagram;
fig. 4 is a diagram of an LSTM network architecture.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
as shown in fig. 1, the present invention provides a speaker recognition method based on 3DCNN-LSTM, which is characterized by comprising the following steps:
s1, according to the short-time stationarity of the voice signal, performing half-text processing on the voice signal to obtain MFEC characteristics, mainly including pre-emphasis, windowing and framing, fast Fourier transform, Mel filter bank and Log logarithmic energy, and the specific steps are as follows:
step A1: the speech signal is passed through a high pass filter to enhance the high frequency portion of the signal and flatten the speech signal. Its transfer function is H (z) ═ 1-az -1 A takes a value of 0.95, and the signal after pre-emphasis processing is x (t);
step A2: the speech signal is segmented into short temporal windows of frames to reduce the edge effects of speech. The pre-emphasized signal is framed into x (m, n) (n is the frame length, and m is the number of frames). We use hamming windows for windowing:
the windowed and framed speech signal is: s w (m, N) ═ x (m, N) × w (N), where each frame contains N samples.
A3: next, voice data x (n) is transformed from the time domain to the frequency domain, and Fast Fourier Transform (FFT) is performed on the windowed signal, so as to obtain a linear spectrum e (k) as:
taking the modulus of the data after Fourier transform:
X(k)=[E(k)] 2
step A4: in order to better simulate the auditory properties of human ears, the linear spectrum obtained by FFT is converted into Mel spectrum by a Mel filter bank. The Mel filter bank is composed of a series of triangular band-pass filters H m (k) The frequency response function expression of the filter is as follows:
wherein f (m) represents a center frequency.
Step A5: the output of the Mel filter is logarithmized, so that the perceptual loudness of people can be better reflected, and the Mel filter can also be used for compensating the natural downward inclination of a frequency amplitude spectrum. The logarithmic spectrum s (m) obtained by the logarithmic operation is:
s2: converting the processed voice signal into a two-dimensional spectrogram, and processing the two-dimensional spectrogram into three-dimensional data by a method of stacking a plurality of continuous frames of spectrograms, wherein the specific process is shown in fig. 2, and the processing steps are as follows:
step B1: the speech frames with the length of m seconds of the speaker are superposed at intervals of n milliseconds to carry out the half-text processing.
Step B2: and transforming the signal processed by the B1 from time domain data to a frequency domain through MFEC transformation to obtain an S (m) two-dimensional spectrogram.
Step B3: a plurality of speeches of the speaker are processed by B2 to obtain three-dimensional data, and then the three-dimensional data and the 3D convolution kernel designed by the invention are convoluted to extract deep speaker characteristics. The invention forms a cube by stacking a plurality of spectrogram of continuous frames, and then performs convolution operation with a 3D convolution kernel in the cube, wherein the input data is set as Time multiplied by Frequency multiplied by C, and C represents the speaking volume of a speaker.
S3: the invention designs a 3D convolution kernel to extract short-term space-time characteristics of speaker voice from a three-dimensional spectrogram, and the structure of the short-term space-time characteristics is shown in figure 3. The number of the convolution kernels of the first two layers is set to be 16, the sizes of the convolution kernels are 3x1x5 and 3x9x5, three-dimensional convolution can be carried out on the time-frequency-speech volume of the speech signal of the speaker, and the deep level features of the speaker are extracted; the number of the third and fourth convolution kernels is set to 32, and the sizes of the third and fourth convolution kernels are 3x1x4 and 3x8x1 respectively. Each two layers are subjected to pool treatment. In addition, the step lengths of the first four layers are respectively 1x1x1 and 1x2x1, so that the individual characteristics of the speaker can be fully extracted, the high efficiency of parameter learning can be ensured, meanwhile, a BN layer (Batch Normalization, BN) is also arranged on each layer of the network for carrying out normalized processing on data, and the stability of parameters is ensured, so that the problem of gradient disappearance or explosion is avoided.
The number of convolution kernels of the fifth layer and the sixth layer is set to 64, and the sizes of the convolution kernels are 3x1x3 and 3x7x1 respectively; step size is set to 1x1x 1; the number of convolution kernels of the seventh layer and the eighth layer is set to be 128, the size of the convolution kernels is consistent with that of the former two layers, a BN layer is also arranged on each layer, and finally pooling is carried out to obtain deep individual characteristics of the speaker.
BN normalizes activation of a deep neural network middle layer, and the key of the algorithm is that two learnable parameters gamma and beta are introduced:
in one batch, BN is for each feature, with m training samples, j dimensions (j neuron nodes). Normalizing the j dimension:
wherein,
is the result of the linear calculation of the ith dimension of the ith layer, and epsilon is to prevent the variance from being 0.
S4: the output of the 3DCNN is used as the input of the LSTM model, and the long-term dependence relation of the space-time characteristics is extracted. The conventional LSTM unit consists of three gate structures, namely a forgetting gate, input gates and output gates, and the structures are shown in fig. 4. And determining which information should be discarded in the unit state at the previous moment by using a forgetting gate, directly participating in updating the unit state, wherein an updating algorithm of the unit state is related to the hidden layer output at the previous moment and the input at the current moment, and the unit state at the previous moment is taken as a parameter for updating the current state.
Forget gate algorithm: f. of t =σ(W f ×[h t-1 ,x t ]+b f )
The unit state updating algorithm: i.e. i t =σ(W i ×[h t-1 ,x t ]+b i )
Wherein C is
t-1 And h
t-1 The cell state and hidden layer output, x, at the previous moment, respectively
t Is an input for the current time of day,
is a candidate value, W, to be added to the memory cell
f 、W
i And W
C Are the weights of the forgetting gate, the input gate and the candidate cell, respectively, obtained from the training, b
f 、b
i And b
C Is the deviation thereof, i
t Is that
σ represents a logic sigmoid function:
s5: in the model training optimization stage, the initial learning rate is 0.01, β 1 is 0.9, β 2 is 0.999, ε is 10e-8, the optimizer is set to Adam, the node number of the full connection layer is set to 3026, and in order to prevent the gradient disappearance phenomenon during the training process, the dropout method is used, the initial value is set to 0.95 and applied to each layer of the network, and when the loss function is calculated, the cross entropy loss function is selected.
S6: the model is trained with a training set, with cross entropy as the loss function.
The cross entropy algorithm is defined as follows:
wherein,
genuine label of jth sample
y j : predicted output of network model for jth sample
And (3) carrying out emotion classification of the speech by utilizing a Softmax layer, wherein the formula of a Softmax function is as follows:
the Softmax value of the ith element in the array represented by the equation.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.