CN111968652B - Speaker identification method based on 3DCNN-LSTM and storage medium - Google Patents

Speaker identification method based on 3DCNN-LSTM and storage medium Download PDF

Info

Publication number
CN111968652B
CN111968652B CN202010674320.0A CN202010674320A CN111968652B CN 111968652 B CN111968652 B CN 111968652B CN 202010674320 A CN202010674320 A CN 202010674320A CN 111968652 B CN111968652 B CN 111968652B
Authority
CN
China
Prior art keywords
speaker
layer
lstm
voice
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010674320.0A
Other languages
Chinese (zh)
Other versions
CN111968652A (en
Inventor
胡章芳
斯星童
罗元
徐博浩
熊润
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beta Intelligent Technology Beijing Co ltd
Shenzhen Hongyue Information Technology Co ltd
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202010674320.0A priority Critical patent/CN111968652B/en
Publication of CN111968652A publication Critical patent/CN111968652A/en
Application granted granted Critical
Publication of CN111968652B publication Critical patent/CN111968652B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Complex Calculations (AREA)

Abstract

The invention claims a speaker identification method and a storage medium based on 3DCNN-LSTM, comprising the following steps: s1, performing half-text processing on the voice signal, and converting the speaker voice into a spectrogram by MFEC conversion; s2, processing the spectrogram into three-dimensional data by a method of stacking a plurality of continuous frames as the input of the 3 DCNN; s3, extracting the space-time characteristics of the speaker voice from the spectrogram by 3 DCNN; s4, extracting the long-term dependence of the space-time characteristics through LSTM, and performing time sequence arrangement on the output of the convolutional neural network to learn the context content of the speaker voice; s5, respectively updating parameters of the model in the model training process to minimize loss, and finally optimizing the model through continuous iterative optimization; and S6, finally, carrying out speaker classification by utilizing a Softmax layer. The invention can effectively solve the problems of speaker voice low-dimensional feature loss and weak space-time correlation, and improve the speaker recognition accuracy.

Description

Speaker identification method based on 3DCNN-LSTM and storage medium
Technical Field
The invention belongs to the field of voice signal processing and pattern recognition, and relates to a speaker recognition method based on 3 DCNN-LSTM.
Background
Speaker recognition, also known as voiceprint recognition, is one of the important components of biometric signal recognition. Compared with the popular biometric identification modes such as fingerprints, hand shapes, retinas, irises, faces and the like, the voice is the most convenient and direct mode in the human communication process, meanwhile, the voice collection of the speaker is more convenient, the cost is controllable, and the privacy of the speaker can be well protected.
The task of speaker recognition is to identify which speaker is speaking in an established speaker library. The speaker recognition method can be divided into text-related and text-unrelated speaker recognition according to whether the speaking content of the speaker is predefined. Or can be divided into speaker determination and speaker identification according to whether the voice quantity of the identified speaker is single or not. The basic system framework is mainly divided into feature extraction and a speaker model.
The feature extraction is to extract the feature vectors of the voice signals of the speaker, and the feature vectors can fully reflect individual differences and can be kept stable for a long time. The speaker characteristics are divided into time domain characteristics and transform domain characteristics, common time domain characteristics include amplitude, energy, average zero crossing rate and the like, but the characteristics are characteristic vectors obtained by directly passing voice signals through a filter, the processing process is simple, the stability is poor, the capability of expressing speaker identity information is also weak, and the speaker characteristics are rarely applied at present. The transform domain refers to a vector feature obtained by transforming a voice signal, and a common transform domain feature is a Linear Prediction Coefficient (LPC) [2] For parameters (Line Spectral Pair, LSP), Mel Frequency Cepstrum Coefficient (MFCC), Bark Frequency Cepstrum Coefficient (BFCC), characteristic parameters of the transform domain can better simulate the human voice characteristics, so that the robustness is stronger, the stability is better, and the method can be widely applied.
The traditional speaker model has a dynamic time planning algorithm based on template matching, and the method is simple and strong in real-time performance, but small in data storage capacity and poor in robustness; based on the classic probability statistical algorithm of the GMM-based Kelvin model HMM, the method is widely applied to various mode identification and obtains better effect, but with the improvement of the requirement on identification accuracy, parameters required to be determined by the model are increased, the calculation complexity is high, and the identification process time is correspondingly increased; until now, a wider i-vector recognition algorithm is applied, the difference between speakers can be well expressed by matching with a plurality of channel compensation technologies, and although a better recognition effect is obtained, the difference between the training phase and the testing phase is still larger, so that the method is particularly obvious in the recognition of speakers irrelevant to texts, and the anti-noise capability to the environment is weaker.
The model modeling method can effectively improve the signal-to-noise ratio of characteristic information, reduce the occurrence frequency of fitting phenomena in the training process, enable the model to have better generalization performance, but still have more redundant characteristics in the bottleneck characteristics, and have weaker characterization capability of individual speakers. In order to solve the problem of feature redundancy, experts and scholars at home and abroad propose that sentences with different lengths are mapped into feature vector embeddings with fixed dimensions, each speech frame of a speaker can be input into a DNN Neural network, the last layer of an hidden layer outputs an activation value as a feature vector d-vector of the speaker, but the processing mode is relatively simple, then Snyde proposes a Time-Delay Neural network (TDNN) to extract an x-vector feature vector from the voice of the speaker, a statistical pooling layer in a network structure converts frame-level features into segment-level features, so compared with the d-vector, the added Time Delay structure can better capture the long correlation of speech features, but the recognition performance of short-speech speakers is not greatly improved, and the TDNN network structure under the environment of short speech duration does not fully utilize the upper Time factor, the speaker recognition rate may decrease instead. And then, a compact speaker characteristic structure e-vector is provided, the problem of system performance reduction caused by overlong speaker voice is avoided, and compared with a standard i-vector system, no extra memory and calculation cost are added while a more accurate speaker subspace is generated.
The invention mainly aims at improving the influence on the robustness of the speaker recognition system under different voice time lengths and solving the problem that the voice feature of a speaker with high dimension is lost in the process of reducing the dimension and extracting the feature of the voice of the speaker at present.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. A method is presented. The technical scheme of the invention is as follows:
a speaker identification method based on 3DCNN-LSTM is characterized by comprising the following steps:
s1, acquiring a voice signal, and performing half-text processing including pre-emphasis, windowing and framing, fast Fourier transform and MFEC transform on the voice signal; MFEC transform refers to Mel filter bank, taking logarithmic energy spectrum, removing Mel cepstrum coefficient characteristics of discrete cosine change;
s2, processing the two-dimensional spectrogram into three-dimensional data by stacking the voice signal processed in the step S1 through MFEC characteristics of a plurality of continuous frames, wherein the three-dimensional data is used as the input of the 3 DCNN; 3DCNN representation of three-dimensional convolutional neural network
S3, 3DCNN extracts the space-time characteristics of the speaker voice from a spectrogram, an improved 3D convolution kernel is designed in the 3DCNN, the improved 3D convolution kernel is improved in that the designed internal structure parameters including the number of the convolution kernels, the convolution step length and a built-in BN layer are optimized and used for extracting deep-level characteristics, the data obtained by subjecting three-dimensional data to convolution pooling of the 3D convolution kernel is in a sequence form, and an LSTM network is introduced for extracting time sequence characteristics;
s4, taking the output of the 3DNN as the input of an LSTM model, extracting the long-term dependence relation of the space-time characteristic through the LSTM, and performing time sequence arrangement on the output of the convolutional neural network to learn the context content of the speaker voice;
s5, in the model training and optimizing stage, an optimizer is adopted to set as Adam, the number of nodes of a full-connection layer is set as 3026, a dropout method is used, the initial value of the dropout method is set to be 0.95, the dropout method is applied to each layer of network, and a cross entropy loss function is selected when a loss function is calculated;
and S6, verifying the trained model by using the test set, adjusting each parameter of the model to obtain a final network model, and finally classifying the speaker by using a Softmax layer.
2. The 3 DCNN-LSTM-based speaker recognition method according to claim 1, wherein the step S1 is implemented by performing a halftoning process on the speech signal according to its short-time stationarity to obtain MFEC features, and the specific steps are as follows:
step A1: passing the speech signal through a high pass filter to enhance the high frequency portion of the signal and flatten the speech signal, with a transfer function of H (z) -1-az -1 A takes a value of 0.95, and the signal after pre-emphasis processing is x (t);
step A2: dividing a voice signal into short-time frame windows to reduce the edge effect of voice, framing the pre-emphasized signal to be x (m, n), wherein n is the frame length, m is the number of frames, and windowing is performed by adopting a Hamming window:
Figure BDA0002583501950000041
the windowed and framed speech signal is: s w (m, N) ═ x (m, N) × w (N), where each frame contains N sample points;
a3: then, transforming the voice data x (n) from the time domain to the frequency domain, and performing fast fourier transform on the windowed signal to obtain a linear spectrum e (k) as follows:
Figure BDA0002583501950000042
taking the modulus of the data after Fourier transform:
X(k)=[E(k)] 2
step A4: the linear spectrum obtained by FFT is converted into Mel spectrum by a Mel filter bank composed of a series of triangular band-pass filters H m (k) The frequency response function expression of the filter is as follows:
Figure BDA0002583501950000043
wherein f (m) represents a center frequency; k represents a frequency;
step A5: logarithm is taken on the output of the Mel filter, and the logarithmic spectrum S (m) obtained by logarithm operation is:
Figure BDA0002583501950000044
further, step S2 is to convert the processed speech signal into a two-dimensional spectrogram, and process the two-dimensional spectrogram into three-dimensional data by stacking a plurality of spectrograms of consecutive frames, and the processing steps are as follows:
step B1: superposing voice frames of n milliseconds at intervals on a speaker voice signal of m seconds for half-text processing;
step B2: transforming the signals processed by B1 from time domain data to frequency domain by MFEC transformation to obtain S (m) two-dimensional spectrogram;
step B3: processing a plurality of speeches of a speaker by B2 to obtain three-dimensional data, performing convolution on the three-dimensional data and a 3D convolution kernel to extract deep speaker characteristics, forming a cube by stacking a plurality of spectrogram of continuous frames, and performing convolution operation on the cube and the 3D convolution kernel, wherein input data is set as Time multiplied by Frequency multiplied by C, and C represents the speaking volume of the speaker.
Further, the 3D convolution kernel designed in step S3 extracts short-term space-time characteristics of the speaker 'S voice from the three-dimensional speech spectrogram, the number of the convolution kernels in the first two layers is set to 16, and the numbers of the convolution kernels are 3x1x5 and 3x9x5, respectively, and the time-frequency-utterance amount of the speaker' S voice signal is subjected to three-dimensional convolution to extract deep-level characteristics of the speaker; the number of the third and fourth convolution kernels is set to 32, and the sizes of the third and fourth convolution kernels are 3x1x4 and 3x8x1 respectively. Pool processing is carried out on each two layers, in addition, the step length of the first four layers is respectively 1x1x1 and 1x2x1, and meanwhile, a BN layer is also arranged on each layer of the network to carry out normalized processing on the data;
the number of convolution kernels of the fifth layer and the sixth layer is set to 64, and the sizes of the convolution kernels are 3x1x3 and 3x7x1 respectively; step size is set to 1x1x 1; the number of convolution kernels of the seventh layer and the eighth layer is set to be 128, the size of the convolution kernels is consistent with that of the former two layers, a BN layer is also arranged on each layer, and finally pooling is carried out to obtain deep individual characteristics of the speaker;
BN normalizes activation of a deep neural network middle layer, and the key of the algorithm is that two learnable parameters gamma and beta are introduced:
Figure BDA0002583501950000051
in one batch, BN is for each feature, with m training samples, j dimensions (j neuron nodes), normalized for the j dimension:
Figure BDA0002583501950000052
Figure BDA0002583501950000061
Figure BDA0002583501950000062
wherein the content of the first and second substances,
Figure BDA0002583501950000063
is the result of the linear calculation of the ith dimension of the ith layer,
Figure BDA0002583501950000064
Figure BDA0002583501950000065
μ j the batch mean, batch variance, and batch normalization are shown, respectively, and ε is to prevent variance from being 0.
Further, in step S4, the output of the 3DCNN is used as the input of the LSTM model, and the long-term dependency relationship of these spatio-temporal features is extracted, the conventional LSTM unit is composed of three gate structures, which are a forgetting gate, input gates, and an output gate, and the forgetting gate is used to determine which information should be discarded in the unit state at the previous time and directly participate in updating the unit state, and the updating algorithm of the unit state is related to the hidden layer output at the previous time and the input at the current time, and the unit state at the previous time is used as a parameter for updating the current state;
forget gate algorithm: f. of t =σ(W f ×[h t-1 ,x t ]+b f )
The unit state updating algorithm: i.e. i t =σ(W i ×[h t-1 ,x t ]+b i )
Figure BDA0002583501950000066
Figure BDA0002583501950000067
Wherein C is t-1 And h t-1 The cell state and hidden layer output, x, at the previous moment, respectively t Is an input for the current time of day,
Figure BDA0002583501950000068
is a candidate value, W, to be added to the memory cell f 、W i And W C Are the weights of the forgetting gate, the input gate and the candidate cell, respectively, obtained from the training, b f 、b i And b C Is the deviation of them, i t Is that
Figure BDA0002583501950000069
σ represents a logic sigmoid function:
Figure BDA00025835019500000610
further, in the step S5, in the optimization stage of model training, an initial learning rate is 0.01, β 1 is 0.9, β 2 is 0.999, and ∈ is 10E-8, the optimizer is set to Adam, the number of nodes in a full connection layer is set to 3026, and meanwhile, to prevent a gradient disappearance phenomenon from occurring in the training process, a dropout method is used, an initial value of the method is set to 0.95 and applied to each layer of network, and when a loss function is calculated, a cross entropy loss function is selected;
the cross entropy algorithm is defined as follows:
Figure BDA0002583501950000071
wherein the content of the first and second substances,
Figure BDA0002583501950000072
the true label of the jth sample, k representing the total number of samples;
y j : the predicted output of the network model for the jth sample.
Further, in step S6, the speaker classification is performed by using the Softmax layer, and the formula of the Softmax function is as follows:
Figure BDA0002583501950000073
the Softmax value of the ith element in the array represented by the equation.
A storage medium, the storage medium being a computer readable storage medium storing one or more programs that, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to perform the method of any of the above.
The invention has the following advantages and beneficial effects:
in conclusion, by adopting the technical scheme, under the same experimental environment, the speaker recognition method based on the 3DCNN-LSTM can better solve the problems of speaker voice low-dimensional feature loss and weak space-time correlation, the processing mode of the culture can better memorize the context information of the speaker voice by combining the LSTM rear-end network, the deep personalized feature of the speaker is extracted by the network structure of the 3DCNN, and the speaker recognition accuracy is improved. In a word, the 3 DCNN-LSTM-based speaker recognition method provided by the invention improves the performance of a speaker recognition system to a greater extent.
The innovation of the invention is mainly that step S1 and step S3, step S1 mainly converts the speaker recognition mode irrelevant to text into a speaker recognition mode relevant to 'half text', the superimposed voice signal destroys the text information content which the speaker wants to express in the speaker recognition, but strengthens the individual characteristic of the speaker voice, thereby improving the recognition rate of the speaker recognition system; step S3 is to design a convolution kernel of a 3D convolutional neural network with a completely new structure, which is innovative in that the convolutional neural network is widely applied to the field of image recognition and achieves a good effect, but is rarely applied to speaker recognition, and has a difficulty in that the recognition rate of the conventional speaker recognition algorithm is generally higher than that of the currently popular neural network structure, and the 3D convolutional neural network designed herein can be superior to the conventional algorithm in speaker recognition of middle and long voices.
Drawings
FIG. 1 is a general block diagram of a 3DCNN-LSTM based speaker recognition method according to a preferred embodiment of the present invention;
FIG. 2 is a halftoning process;
FIG. 3 is a 3DCNN block diagram;
fig. 4 is a diagram of an LSTM network architecture.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
as shown in fig. 1, the present invention provides a speaker recognition method based on 3DCNN-LSTM, which is characterized by comprising the following steps:
s1, according to the short-time stationarity of the voice signal, performing half-text processing on the voice signal to obtain MFEC characteristics, mainly including pre-emphasis, windowing and framing, fast Fourier transform, Mel filter bank and Log logarithmic energy, and the specific steps are as follows:
step A1: the speech signal is passed through a high pass filter to enhance the high frequency portion of the signal and flatten the speech signal. Its transfer function is H (z) ═ 1-az -1 A takes a value of 0.95, and the signal after pre-emphasis processing is x (t);
step A2: the speech signal is segmented into short temporal windows of frames to reduce the edge effects of speech. The pre-emphasized signal is framed into x (m, n) (n is the frame length, and m is the number of frames). We use hamming windows for windowing:
Figure BDA0002583501950000091
the windowed and framed speech signal is: s w (m, N) ═ x (m, N) × w (N), where each frame contains N samples.
A3: next, voice data x (n) is transformed from the time domain to the frequency domain, and Fast Fourier Transform (FFT) is performed on the windowed signal, so as to obtain a linear spectrum e (k) as:
Figure BDA0002583501950000092
taking the modulus of the data after Fourier transform:
X(k)=[E(k)] 2
step A4: in order to better simulate the auditory properties of human ears, the linear spectrum obtained by FFT is converted into Mel spectrum by a Mel filter bank. The Mel filter bank is composed of a series of triangular band-pass filters H m (k) The frequency response function expression of the filter is as follows:
Figure BDA0002583501950000093
wherein f (m) represents a center frequency.
Step A5: the output of the Mel filter is logarithmized, so that the perceptual loudness of people can be better reflected, and the Mel filter can also be used for compensating the natural downward inclination of a frequency amplitude spectrum. The logarithmic spectrum s (m) obtained by the logarithmic operation is:
Figure BDA0002583501950000094
s2: converting the processed voice signal into a two-dimensional spectrogram, and processing the two-dimensional spectrogram into three-dimensional data by a method of stacking a plurality of continuous frames of spectrograms, wherein the specific process is shown in fig. 2, and the processing steps are as follows:
step B1: the speech frames with the length of m seconds of the speaker are superposed at intervals of n milliseconds to carry out the half-text processing.
Step B2: and transforming the signal processed by the B1 from time domain data to a frequency domain through MFEC transformation to obtain an S (m) two-dimensional spectrogram.
Step B3: a plurality of speeches of the speaker are processed by B2 to obtain three-dimensional data, and then the three-dimensional data and the 3D convolution kernel designed by the invention are convoluted to extract deep speaker characteristics. The invention forms a cube by stacking a plurality of spectrogram of continuous frames, and then performs convolution operation with a 3D convolution kernel in the cube, wherein the input data is set as Time multiplied by Frequency multiplied by C, and C represents the speaking volume of a speaker.
S3: the invention designs a 3D convolution kernel to extract short-term space-time characteristics of speaker voice from a three-dimensional spectrogram, and the structure of the short-term space-time characteristics is shown in figure 3. The number of the convolution kernels of the first two layers is set to be 16, the sizes of the convolution kernels are 3x1x5 and 3x9x5, three-dimensional convolution can be carried out on the time-frequency-speech volume of the speech signal of the speaker, and the deep level features of the speaker are extracted; the number of the third and fourth convolution kernels is set to 32, and the sizes of the third and fourth convolution kernels are 3x1x4 and 3x8x1 respectively. Each two layers are subjected to pool treatment. In addition, the step lengths of the first four layers are respectively 1x1x1 and 1x2x1, so that the individual characteristics of the speaker can be fully extracted, the high efficiency of parameter learning can be ensured, meanwhile, a BN layer (Batch Normalization, BN) is also arranged on each layer of the network for carrying out normalized processing on data, and the stability of parameters is ensured, so that the problem of gradient disappearance or explosion is avoided.
The number of convolution kernels of the fifth layer and the sixth layer is set to 64, and the sizes of the convolution kernels are 3x1x3 and 3x7x1 respectively; step size is set to 1x1x 1; the number of convolution kernels of the seventh layer and the eighth layer is set to be 128, the size of the convolution kernels is consistent with that of the former two layers, a BN layer is also arranged on each layer, and finally pooling is carried out to obtain deep individual characteristics of the speaker.
BN normalizes activation of a deep neural network middle layer, and the key of the algorithm is that two learnable parameters gamma and beta are introduced:
Figure BDA0002583501950000101
in one batch, BN is for each feature, with m training samples, j dimensions (j neuron nodes). Normalizing the j dimension:
Figure BDA0002583501950000102
Figure BDA0002583501950000111
Figure BDA0002583501950000112
wherein the content of the first and second substances,
Figure BDA0002583501950000113
is the result of the linear calculation of the ith dimension of the ith layer, and epsilon is to prevent the variance from being 0.
S4: the output of the 3DCNN is used as the input of the LSTM model, and the long-term dependence relation of the space-time characteristics is extracted. The conventional LSTM unit consists of three gate structures, namely a forgetting gate, input gates and output gates, and the structures are shown in fig. 4. And determining which information should be discarded in the unit state at the previous moment by using a forgetting gate, directly participating in updating the unit state, wherein an updating algorithm of the unit state is related to the hidden layer output at the previous moment and the input at the current moment, and the unit state at the previous moment is taken as a parameter for updating the current state.
Forget gate algorithm: f. of t =σ(W f ×[h t-1 ,x t ]+b f )
The unit state updating algorithm: i.e. i t =σ(W i ×[h t-1 ,x t ]+b i )
Figure BDA0002583501950000114
Figure BDA0002583501950000115
Wherein C is t-1 And h t-1 The cell state and hidden layer output, x, at the previous moment, respectively t Is an input for the current time of day,
Figure BDA0002583501950000116
is a candidate value, W, to be added to the memory cell f 、W i And W C Are the weights of the forgetting gate, the input gate and the candidate cell, respectively, obtained from the training, b f 、b i And b C Is the deviation thereof, i t Is that
Figure BDA0002583501950000117
σ represents a logic sigmoid function:
Figure BDA0002583501950000118
s5: in the model training optimization stage, the initial learning rate is 0.01, β 1 is 0.9, β 2 is 0.999, ε is 10e-8, the optimizer is set to Adam, the node number of the full connection layer is set to 3026, and in order to prevent the gradient disappearance phenomenon during the training process, the dropout method is used, the initial value is set to 0.95 and applied to each layer of the network, and when the loss function is calculated, the cross entropy loss function is selected.
S6: the model is trained with a training set, with cross entropy as the loss function.
The cross entropy algorithm is defined as follows:
Figure BDA0002583501950000121
wherein the content of the first and second substances,
Figure BDA0002583501950000122
genuine label of jth sample
y j : predicted output of network model for jth sample
And (3) carrying out emotion classification of the speech by utilizing a Softmax layer, wherein the formula of a Softmax function is as follows:
Figure BDA0002583501950000123
the Softmax value of the ith element in the array represented by the equation.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims (7)

1. A speaker identification method based on 3DCNN-LSTM is characterized by comprising the following steps:
s1, acquiring a voice signal, and performing semi-text processing including pre-emphasis, windowing and framing, fast Fourier transform and MFEC transform on the voice signal, wherein the MFEC transform refers to a Mel filter bank, a logarithmic energy spectrum taking and a Mel cepstrum coefficient characteristic of discrete cosine change removing;
s2, processing the two-dimensional spectrogram into three-dimensional data by stacking the voice signal processed in the step S1 through MFEC characteristics of a plurality of continuous frames, wherein the three-dimensional data is used as the input of the 3 DCNN; 3DCNN represents a three-dimensional convolutional neural network;
s3 and 3DCNN extract the space-time characteristics of the speaker voice from the three-dimensional data, specifically extract the deep characteristics, and introduce the data of the three-dimensional data after the convolution pooling of the 3D convolution kernel into the LSTM network for the time sequence characteristic extraction;
s4, taking the output of the 3DNN as the input of an LSTM model, extracting the long-term dependence of the space-time characteristics through the LSTM, and performing time sequence arrangement on the output of the convolutional neural network to learn the context content of the speaker voice;
s5, in the model training and optimizing stage, an optimizer is adopted to set as Adam, the number of nodes of a full-connection layer is set as 3026, a dropout method is used, the initial value of the dropout method is set to be 0.95, the dropout method is applied to each layer of network, and a cross entropy loss function is selected when a loss function is calculated;
s6, verifying the trained model by using a test set, adjusting each parameter of the model to obtain a final network model, and finally classifying speakers by using a Softmax layer;
the convolution kernels in the 3DCNN designed in the step S3 extract the short-term space-time characteristics of the speaker voice from the three-dimensional data, the number of the convolution kernels of the first two layers is set to be 16, the sizes of the convolution kernels are 3x1x5 and 3x9x5 respectively, three-dimensional convolution is carried out on the time-frequency-speech volume of the speaker voice signal, and the deep level characteristics of the speaker are extracted; the number of the third convolution kernel and the fourth convolution kernel is set to be 32, the sizes of the third convolution kernel and the fourth convolution kernel are respectively 3x1x4 and 3x8x1, pool pooling processing is carried out on each two layers, in addition, the step sizes of the first four layers are respectively 1x1x1 and 1x2x1, and meanwhile, a BN layer is also arranged on each layer of network to carry out normalized processing on data;
the number of convolution kernels of the fifth layer and the sixth layer is set to 64, and the sizes of the convolution kernels are 3x1x3 and 3x7x1 respectively; step size is set to 1x1x 1; the number of convolution kernels of the seventh layer and the eighth layer is set to be 128, the size of the convolution kernels is consistent with that of the former two layers, each layer is also provided with a BN layer, and finally, pooling processing is carried out to obtain the deep individual characteristics of the speaker;
BN normalizes activation of a deep neural network middle layer, and the key of the algorithm is that two learnable parameters gamma and beta are introduced:
Figure FDA0003735470750000021
in one batch, BN is for each feature, with m training samples, j dimensions (j neuron nodes), normalized for the j dimension:
Figure FDA0003735470750000022
Figure FDA0003735470750000023
Figure FDA0003735470750000024
wherein the content of the first and second substances,
Figure FDA0003735470750000025
is the result of the linear calculation of the ith dimension of the ith layer,
Figure FDA0003735470750000026
μ j the mean, variance and normalization of the lot are shown, respectively, and ε is to prevent variance from being 0.
2. The 3 DCNN-LSTM-based speaker recognition method according to claim 1, wherein the step S1 is implemented by performing a halftoning process on the speech signal according to its short-time stationarity to obtain MFEC features, and the specific steps are as follows:
step A1: the speech signal is passed through a high-pass filter,enhancing the high frequency part of the signal to make the voice signal flat, and its transfer function is H (z) -1-az -1 A takes a value of 0.95, and the signal after pre-emphasis processing is x (t);
step A2: dividing a voice signal into short-time frame windows to reduce the edge effect of voice, framing the pre-emphasized signal to be x (m, n), wherein n is the frame length, m is the number of frames, and windowing is performed by adopting a Hamming window:
Figure FDA0003735470750000031
the windowed and framed speech signal is: s w (m, N) ═ x (m, N) × w (N), where each frame contains N sample points;
a3: then, transforming the voice data x (n) from the time domain to the frequency domain, and performing fast fourier transform on the windowed signal to obtain a linear spectrum e (k) as follows:
Figure FDA0003735470750000032
taking the modulus of the data after Fourier transform:
X(k)=[E(k)] 2
step A4: the linear spectrum obtained by FFT is converted into Mel spectrum by a Mel filter bank composed of a series of triangular band-pass filters H m (k) The frequency response function expression of the filter is as follows:
Figure FDA0003735470750000033
wherein f (m) represents a center frequency; k represents a frequency;
step A5: logarithm is taken on the output of the Mel filter, and the logarithmic spectrum S (m) obtained by logarithm operation is:
Figure FDA0003735470750000034
3. the 3 DCNN-LSTM-based speaker recognition method according to claim 2, wherein the step S2 is to convert the processed speech signal into a two-dimensional spectrogram, and the two-dimensional spectrogram is processed into three-dimensional data by stacking a plurality of spectrogram of consecutive frames, and the processing steps are as follows:
step B1: superposing voice frames of n milliseconds at intervals on a speaker voice signal of m seconds for half-text processing;
step B2: transforming the signals processed by B1 from time domain data to frequency domain by MFEC transformation to obtain S (m) two-dimensional spectrogram;
step B3: processing a plurality of speeches of a speaker by B2 to obtain three-dimensional data, performing convolution on the three-dimensional data and a 3D convolution kernel to extract deep speaker characteristics, forming a cube by stacking a plurality of spectrogram of continuous frames, and performing convolution operation on the cube and the 3D convolution kernel, wherein input data is set as Time multiplied by Frequency multiplied by C, and C represents the speaking volume of the speaker.
4. The 3 DCNN-LSTM-based speaker recognition method according to claim 1, wherein the step S4 uses the output of 3DCNN as the input of LSTM model to extract the long-term dependency relationship of the spatio-temporal features, the LSTM unit is composed of three gate structures, namely a forgetting gate, an input gate and an output gate, the forgetting gate is used to determine the information that should be discarded in the unit state at the previous time and directly participates in updating the unit state, the updating algorithm of the unit state is related to the hidden layer output at the previous time and the input at the current time, and uses the unit state at the previous time as the parameter for updating the current state;
forget gate algorithm: f. of t =σ(W f ×[h t-1 ,x t ]+b f )
The unit state updating algorithm: i.e. i t =σ(W i ×[h t-1 ,x t ]+b i )
Figure FDA0003735470750000041
Figure FDA0003735470750000042
Wherein C is t-1 And h t-1 The cell state and hidden layer output, x, at the previous moment, respectively t Is an input for the current time of day,
Figure FDA0003735470750000043
is a candidate value, W, to be added to the memory cell f 、W i And W C Weights of forgetting gate, input gate and candidate cell, respectively, obtained by training, b f 、b i And b C Is the deviation of them, i t Is that
Figure FDA0003735470750000044
σ represents a logic sigmoid function:
Figure FDA0003735470750000045
5. the 3 DCNN-LSTM-based speaker recognition method according to claim 4, wherein the step S5 adopts an initial learning rate of 0.01, β 1-0.9, β 2-0.999, and ∈ 10E-8 in the model training optimization stage, the optimizer is set to Adam, the number of nodes in the fully-connected layer is set to 3026, and in order to prevent the gradient disappearance phenomenon during the training process, a dropout method is used, the initial value is set to 0.95 and applied to each layer of the network, and when calculating the loss function, the cross entropy loss function is selected;
the cross entropy algorithm is defined as follows:
Figure FDA0003735470750000051
wherein the content of the first and second substances,
Figure FDA0003735470750000052
the true label of the jth sample, k denotes the total number of samples;
y j : the predicted output of the network model for the jth sample.
6. The 3 DCNN-LSTM-based speaker recognition method of claim 5, wherein said step S6 utilizes a Softmax layer for speaker classification, and the formula of the Softmax function is as follows:
Figure FDA0003735470750000053
the Softmax value of the ith element in the array represented by the equation.
7. A storage medium being a computer readable storage medium storing one or more programs which, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to perform the method of any of claims 1-6 above.
CN202010674320.0A 2020-07-14 2020-07-14 Speaker identification method based on 3DCNN-LSTM and storage medium Active CN111968652B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010674320.0A CN111968652B (en) 2020-07-14 2020-07-14 Speaker identification method based on 3DCNN-LSTM and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010674320.0A CN111968652B (en) 2020-07-14 2020-07-14 Speaker identification method based on 3DCNN-LSTM and storage medium

Publications (2)

Publication Number Publication Date
CN111968652A CN111968652A (en) 2020-11-20
CN111968652B true CN111968652B (en) 2022-08-26

Family

ID=73361989

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010674320.0A Active CN111968652B (en) 2020-07-14 2020-07-14 Speaker identification method based on 3DCNN-LSTM and storage medium

Country Status (1)

Country Link
CN (1) CN111968652B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112614492A (en) * 2020-12-09 2021-04-06 通号智慧城市研究设计院有限公司 Voiceprint recognition method, system and storage medium based on time-space information fusion
CN113327616A (en) * 2021-06-02 2021-08-31 广东电网有限责任公司 Voiceprint recognition method and device, electronic equipment and storage medium
CN114598565A (en) * 2022-05-10 2022-06-07 深圳市发掘科技有限公司 Kitchen electrical equipment remote control system and method and computer equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107464568A (en) * 2017-09-25 2017-12-12 四川长虹电器股份有限公司 Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system
CN108597539A (en) * 2018-02-09 2018-09-28 桂林电子科技大学 Speech-emotion recognition method based on parameter migration and sound spectrograph
CN110718228A (en) * 2019-10-22 2020-01-21 中信银行股份有限公司 Voice separation method and device, electronic equipment and computer readable storage medium
WO2021191659A1 (en) * 2020-03-24 2021-09-30 Rakuten, Inc. Liveness detection using audio-visual inconsistencies

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107464568A (en) * 2017-09-25 2017-12-12 四川长虹电器股份有限公司 Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system
CN108597539A (en) * 2018-02-09 2018-09-28 桂林电子科技大学 Speech-emotion recognition method based on parameter migration and sound spectrograph
CN110718228A (en) * 2019-10-22 2020-01-21 中信银行股份有限公司 Voice separation method and device, electronic equipment and computer readable storage medium
WO2021191659A1 (en) * 2020-03-24 2021-09-30 Rakuten, Inc. Liveness detection using audio-visual inconsistencies

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A network model of speaker identification with new feature extraction methods and asymmetric BLSTM;Wang X,等;《Neurocomputing》;20200429;全文 *
Speaker Recognition Based on 3DCNN-LSTM;Hu Z F,等;《Engineering Letters》;20210630;全文 *
基于深度学习的多说话人识别技术研究与实现;斯星童;《中国优秀硕士学位论文全文数据库》;20220315;全文 *
基于深度学习的说话人识别研究;陈甜甜;《中国优秀硕士学位论文全文数据库》;20181115;全文 *

Also Published As

Publication number Publication date
CN111968652A (en) 2020-11-20

Similar Documents

Publication Publication Date Title
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
CN110400579B (en) Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network
CN111968652B (en) Speaker identification method based on 3DCNN-LSTM and storage medium
CN111785301B (en) Residual error network-based 3DACRNN speech emotion recognition method and storage medium
CN110853680B (en) double-BiLSTM speech emotion recognition method with multi-input multi-fusion strategy
Mashao et al. Combining classifier decisions for robust speaker identification
Huang et al. Query-by-example keyword spotting system using multi-head attention and soft-triple loss
Yücesoy et al. A new approach with score-level fusion for the classification of a speaker age and gender
Khdier et al. Deep learning algorithms based voiceprint recognition system in noisy environment
Wang et al. I-vector features and deep neural network modeling for language recognition
Sun et al. A novel convolutional neural network voiceprint recognition method based on improved pooling method and dropout idea
Rudresh et al. Performance analysis of speech digit recognition using cepstrum and vector quantization
CN115101077A (en) Voiceprint detection model training method and voiceprint recognition method
CN111243621A (en) Construction method of GRU-SVM deep learning model for synthetic speech detection
Nazid et al. Improved speaker-independent emotion recognition from speech using two-stage feature reduction
Dhar et al. A system to predict emotion from Bengali speech
Anand et al. Text-independent speaker recognition for Ambient Intelligence applications by using information set features
Medikonda et al. An information set-based robust text-independent speaker authentication
Renisha et al. Cascaded Feedforward Neural Networks for speaker identification using Perceptual Wavelet based Cepstral Coefficients
Ramani et al. Autoencoder based architecture for fast & real time audio style transfer
Singh et al. Application of different filters in mel frequency cepstral coefficients feature extraction and fuzzy vector quantization approach in speaker recognition
Slívová et al. Isolated word automatic speech recognition system
CN115019760A (en) Data amplification method for audio and real-time sound event detection system and method
CN115064175A (en) Speaker recognition method
Sefara et al. Gender identification in Sepedi speech corpus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20231121

Address after: 302-8, floor 3, No. 8, caihefang Road, Haidian District, Beijing 100080

Patentee after: Beta Intelligent Technology (Beijing) Co.,Ltd.

Address before: 518000 1104, Building A, Zhiyun Industrial Park, No. 13, Huaxing Road, Henglang Community, Longhua District, Shenzhen, Guangdong Province

Patentee before: Shenzhen Hongyue Information Technology Co.,Ltd.

Effective date of registration: 20231121

Address after: 518000 1104, Building A, Zhiyun Industrial Park, No. 13, Huaxing Road, Henglang Community, Longhua District, Shenzhen, Guangdong Province

Patentee after: Shenzhen Hongyue Information Technology Co.,Ltd.

Address before: 400065 Chongwen Road, Nanshan Street, Nanan District, Chongqing

Patentee before: CHONGQING University OF POSTS AND TELECOMMUNICATIONS