CN112967722A

CN112967722A - Text-independent multi-source speaker identification method based on blind source separation

Info

Publication number: CN112967722A
Application number: CN202110137229.XA
Authority: CN
Inventors: 谭振华; 徐晓梦
Original assignee: Northeastern University China; CERNET Corp
Current assignee: Northeastern University China; CERNET Corp
Priority date: 2021-02-01
Filing date: 2021-02-01
Publication date: 2021-06-15

Abstract

The invention provides a text-independent multi-source speaker recognition method based on blind source separation, and relates to the technical field of voiceprint recognition. Firstly, acquiring a section of sound source containing voices of multiple persons, carrying out blind source separation and detection on the sound source by adopting a blind source signal detection and separation algorithm, and separating a multi-source voice signal into multiple single-source voice signals; then, each separated single-source voice signal is subjected to pre-emphasis, framing and windowing processing to obtain a time sequence voice signal; wavelet packet decomposition and reconstruction are carried out on the time-series voice signals; then, a cochlear auditory filter is adopted to carry out human ear feature filtering, and voice features are extracted; finally, a CNN model is constructed, and the extracted voice characteristics are input into the CNN model to realize multi-source speaker recognition; the method of the invention adopts a method of combining the wavelet packet and the gamma-ray filter, and can have higher recognition rate in a noise environment.

Description

Text-independent multi-source speaker identification method based on blind source separation

Technical Field

The invention relates to the technical field of voiceprint recognition, in particular to a text-independent multi-source speaker recognition method based on blind source separation.

Background

The voice is one kind of biological recognition, bears corresponding information as other biological characteristic recognition, and can be applied to the aspects of identity authentication, information service, voice mail and the like. Not only the voice content information of the speaker can be known through the voice, but also the voiceprint recognition, which is the information of the speaker, can be obtained. Voiceprint recognition is the process of automatically identifying a speaker for personal information contained in a speech waveform. Speaker identification in a cocktail environment presents a great challenge, and first, the cocktail environment is noisy, and the voices of many speakers are mixed together.

The process of speaker recognition can be divided into two parts: speech feature extraction and training of speaker models. In speech feature extraction, most of the research is mainly aimed at short-term spectral characteristics of speech signals, and mainly aims at decomposing the signals based on short frames of about 10-30 milliseconds, during which the speech signals are most stable, and researching the spectral features of voiceprints, such as Mel cepstral coefficients and linear cepstral coefficients, in the frames. Training of the model, which is mainly to train the extracted features, for example, the commonly used traditional models include a vector quantization model, dynamic time warping, a Gaussian mixture model and the like; for the deep learning model, a deep neural network is adopted, and a model with a better effect, such as a convolutional neural network, is applied to a model for speaker recognition and the like.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a text-independent multi-source speaker identification method based on blind source separation aiming at the defects of the prior art, and the multi-source speaker identification is carried out in a cocktail environment.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a text-independent multi-source speaker recognition method based on blind source separation is characterized in that multi-sound source voices in a cocktail environment are separated and detected according to a blind source signal detection and separation algorithm, a plurality of included sound sources are separated, then voice feature extraction is carried out on each sound source, namely, the voice is subjected to wavelet packet conversion and is combined with a gamma-pass filter for feature extraction, and meanwhile, extracted features pass through a deep learning model CNN to finish multi-source speaker recognition; the method specifically comprises the following steps:

step 1: blind source separation and detection; acquiring a section of sound source containing voices of multiple persons, performing blind source separation and detection on the sound source by adopting a blind source signal detection and separation algorithm, and separating a multi-source voice signal into multiple single-source voice signals;

firstly, normalizing and whitening an original mixed voice data matrix to obtain a whitened and transformed voice matrix; then initializing a matrix W by adopting a random mode, and performing decorrelation processing on the iteration of the matrix W to obtain an updated matrix W_new(ii) a Finally, the original mixed voice data matrix, the whitened and transformed voice matrix and the updated matrix W are combined_newMatrix multiplication is carried out, and a plurality of single-source voice signals are separated from the multi-source voice signals;

step 2: preprocessing voice characteristics; carrying out pre-emphasis, framing and windowing on each single-source voice signal separated in the step 1 to obtain a time sequence voice signal;

and step 3: performing wavelet packet decomposition and reconstruction on the time-series voice signals;

decomposing the sequential speech signal by adopting a wavelet packet so as to carry out time-frequency localized processing and analysis on low-frequency and high-frequency signals contained in the speech signal; the wavelet packet is decomposed according to a complete optimal binary tree mode, and the time-frequency node corresponds to a wavelet packet frequency coefficient; reconstructing the low-frequency and high-frequency voice signals after wavelet packet decomposition, wherein the time sequence of the reconstructed voice signals corresponds to the original time domain information;

and 4, step 4: performing human ear feature filtering on the voice signal subjected to wavelet packet decomposition and reconstruction by adopting a cochlear auditory filter, and extracting voice features;

the voice signals obtained after wavelet packet decomposition and reconstruction in the step 3 pass through a group of gamma pass filter banks to obtain voice feature vectors which accord with human ear physiological data, short-time Fourier transform is carried out on the obtained voice feature vectors to obtain two-dimensional voice feature vectors, and extraction of voice features is completed;

and 5: constructing a CNN model, converting the two-dimensional voice characteristic vector extracted in the step (4) into a three-dimensional vector, and inputting the three-dimensional vector into the CNN model to realize multi-source speaker recognition;

the CNN model consists of 4 2D convolutional layers, 4 pooling layers, 2 full-connection layers and an output layer; the convolution kernel adopts a 3x 3 matrix; in each convolutional layer, the activation function ReLu is used; entering a pooling layer after each convolution layer operation; the strategy adopted by the pooling layer is maximum pooling, and the size of the pooling is 2 multiplied by 2; and outputting the probability of the class corresponding to the voice by using softmax as the activation function of the output layer.

Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: the text-independent multi-source speaker recognition method based on blind source separation provided by the invention aims at the problem of low recognition accuracy in a noise environment, and adopts a method of combining a wavelet packet and a gamma-pass filter, so that the method has higher recognition rate in the noise environment. Has important application value for applying the composite material in a real environment.

Drawings

Fig. 1 is a general architecture diagram of a text-independent multi-source speaker recognition method based on blind source separation according to an embodiment of the present invention.

Fig. 2 is a waveform diagram of an original single-source speech signal visualization provided by an embodiment of the present invention;

FIG. 3 is a diagram illustrating pre-emphasis effects of a single-source speech signal according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating the effect of framing a single-source speech signal according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating the effect of windowing a single-source speech signal according to an embodiment of the present invention;

fig. 6 is a diagram illustrating an effect of a single-source speech signal after decomposition of three-layer wavelet packets according to an embodiment of the present invention;

fig. 7 is a diagram illustrating the effect of the single-source speech signal filtered by the Gammatone filter according to the embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings.

In this embodiment, the software environment is a WINDOWS 10 system, and the simulation environment is PyCharm 2018.3.3x 64.

In this embodiment, a total architecture of the designed multi-source speaker recognition is shown in fig. 1, and a text-independent multi-source speaker recognition method based on blind source separation according to the architecture diagram includes the following steps:

firstly, inputting a section of mixed multi-channel voice signal, opening the input voice signal by adopting a wave function in python language to obtain a mixed voice data matrix D, and carrying out centralization processing on the matrix D, namely averaging all row directions of the matrix D by using a mean function in a numpy function library, and then traversing each row of the matrix D to subtract the average value to obtain a new matrix D after centralization processing_center(ii) a Then to the new matrix D_centerPerforming whitening, i.e. calculating D_centerCov (D) covariance matrix of (c)_center) Then, calculating an eigenvalue and an eigenvector of the covariance matrix, diagonalizing the eigenvalue vector to obtain a diagonal matrix, inverting the diagonal matrix, and multiplying the square of the inverted diagonal matrix by the transposed eigenvector to obtain a whitening transformation matrix V; then the matrix D_centerMultiplying the matrix V to obtain a whitened data matrix Z, and performing FastICA algorithm processing on the data matrix Z, namely randomly generating a random matrix W at first and performing decorrelation processing on W (for W, W)^TDecomposing the eigenvalue to obtain an eigenvalue x and an eigenvector p, then carrying out diagonalization and inversion on the eigenvalue x to obtain div _ x, and after W is decorrelated, obtaining

And setting the maximum iteration times, continuously updating iteration W' to obtainNew W_new＝g(s)Z^T-g '(s) × W ' (where s ═ W ' × Z, g(s) represents the tanh () function of s, g '(s) represents the probability density function of s, W ' represents W, Z after each update of the decorrelation^TA transpose matrix representing the whitened data matrix Z), based on the obtained W_newThe estimation S of the reconstructed mixed source signal can be obtained by multiplying the whitening transformation matrix V and the mixed voice data matrix D_rFurther, n estimated single-source speech signals S1 ═ S are obtained_r[0,:],S2＝S_r[1,:]....，S_n＝S_r[n,:]。

Step 2: preprocessing voice characteristics; for the n single-source voice signals S1, S2.., S obtained in the step 1_nThe method comprises the steps of preprocessing respectively to obtain time sequence voice signals, and specifically comprises three links of pre-emphasis, framing and windowing, wherein the pre-emphasis aims to compensate attenuation of each octave of a voice high-frequency part caused by glottal excitation and oral-nasal radiation. Carrying out short-time framing processing on the voice signal after the pre-emphasis updating to obtain short-time stable voice signal frames, numbering the short-time stable voice signal frames according to a time sequence, and overlapping continuous frame signals to a certain extent to ensure that information among each frame is not lost; and finally, windowing the frame signal, performing sliding window processing on the frame-long voice signal according to the position by adopting a Hamming window function, and taking the preprocessed voice signal as a time sequence discrete signal form so as to facilitate decomposition and feature extraction.

In this embodiment, the specific implementation parameters of the three links of pre-emphasis, framing and windowing are as follows:

pre-emphasis: and processing the voice data by wavfile in python to obtain a sampling rate of the voice frequency and a numpy array, wherein the obtained voice one-dimensional array is subjected to pre-emphasis according to a pre-emphasis formula y (t) ═ x (t) — x (t-1), a pre-emphasis coefficient α ═ 0.97, y (t) represents the single-source voice array at t seconds after pre-emphasis, x (t) is the single-source voice array at t seconds, and x (t-1) is the single-source voice array at t-1 seconds. In this embodiment, for the original single-source speech image as shown in fig. 2, the pre-emphasized speech image is shown in fig. 3.

Framing: the whole voice is divided into frames, and the voice signal after the voice signal is pre-emphasized is processed by using python numpy data analysis, wherein the parameter of each frame is 30ms, certain overlap is performed between continuous frame signals, the overlap time is 2ms, and the result after the frame division is shown in fig. 4.

Windowing: in order to ensure that the voice data after the frame division is processed continuously, a sliding window technology is adopted according to a Hamming window formula

Wherein, w (N) represents the sliding window factor, N represents the total frame length of the voice data, and N represents the length of each frame after framing, and Hamming window processing is carried out. The effect after windowing is shown in fig. 5.

And step 3: decomposing and reconstructing a wavelet packet of the time sequence signal; therefore, the invention adopts the wavelet packet to decompose the time-sequence speech signal and carries out time-frequency localization processing and analysis on the low-frequency and high-frequency signals contained in the speech signal; the wavelet packet is decomposed according to a complete optimal binary tree mode, and the time-frequency node corresponds to a wavelet packet frequency coefficient; reconstructing the low-frequency and high-frequency voice signals after wavelet packet decomposition, wherein the time sequence of the reconstructed voice signals corresponds to the original time domain information;

in this embodiment, a third-party library pywt existing in a python library is used to perform wavelet packet decomposition on the voice data preprocessed in step 2, a wavepacket function in the pywt library is used, the input data is the voice data preprocessed, a wavelet packet model is used as symmetric to perform three-layer db6 wavelet packet decomposition, and a tree structure is shown in fig. 6. Then, the decomposed wavelet packets are respectively layer number and data, and the data are subjected to traversal decomposition and are reconstructed by adopting a reconstruct function to obtain voice signal data with high frequency and low frequency which are decomposed thoroughly.

And 4, step 4: performing human ear feature filtering on the voice signal subjected to wavelet packet decomposition and reconstruction by adopting a cochlear auditory filter, and extracting voice features; obtaining a voice feature vector which accords with human ear physiological data by a group of gamma pass filter groups of the voice signals obtained after wavelet packet decomposition and reconstruction in the step 3, wherein a gamma pass filter is a standard auditory filter and accords with human ear cochlear features, and performing short-time Fourier transform (STFT) on the obtained voice feature vector to obtain a two-dimensional voice feature vector so as to complete extraction of voice features;

the Gamma atom filter is shown in the following formula:

h(t)＝ct^l-1e^-2πbt cos(2πf_it+φ)

where c is the tuning proportionality constant, l is the number of filter stages (usually 4), b is the attenuation factor that determines the attenuation speed of the filter, and is a positive integer, f_iIs the center frequency of the filter, phi is the function phase, and can be generally omitted; the relation between attenuation factor and bandwidth is b-1.019 ERB (f)_i) Wherein ERB (f)_i) Equivalent rectangular bandwidth:

in this embodiment, a visualization effect graph of the single-source speech signal after passing through the Gammatone filter is shown in fig. 7.

in order to effectively train and predict the voice characteristic information irrelevant to the text of the speaker, the method of the invention combines the designed characteristic extraction method and adopts the convolutional neural network to design the deep learning and recognition network of the speaker, as shown in table 1. In the network structure, the network structure is composed of 4 2D convolutional layers (Conv2D _1-Conv2D _4), 4 pooling layers (Pool1-Pool4), 2 full-connection layers (Dense _1, Dense _2) and an output layer, and the convolutional layers adopt a 3 × 3 matrix. In each convolutional layer, the activation function ReLu is used. After each convolution layer operation, entering a pooling layer, wherein the strategy adopted by the pooling layer is a maximum pooling strategy, and the pooling size is 2 multiplied by 2. And the output layer outputs the probability of the corresponding class of the voice by adopting a softmax activation function.

TABLE 1 CNN-based speaker recognition deep learning recognition network structure parameters

Layer(s)	CNN parameter
		Conv2d_1	[3*3,16]
Pool1	2*2,maxpool,stride2
		Conv2d_2	[3*3,32]
Pool2	2*2,maxpool,stride2
		Conv2d_3	[3*3,64]
Pool3	2*2,maxpool,stride2
		Conv2d_4	[3*3,128]
Pool4	2*2,maxpool,stride2
		Dense_1	5072*278
Dense_2	278*69

In conclusion, the method of the invention is based on the method of extracting the voice characteristics by using the wavelet packet and the gamma-pass filter after blind source separation, and realizes the requirement of high recognition rate of multi-source speaker recognition in a noise environment. The problem of the recognition rate is low under the cocktail environment, can't distinguish a plurality of speakers is solved.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims

1. A text-independent multi-source speaker identification method based on blind source separation is characterized in that: aiming at multi-sound source voice in a cocktail environment, the multi-sound source voice is separated and detected according to a blind source signal detection and separation algorithm, a plurality of included sound sources are separated, then voice feature extraction is carried out on each sound source, namely, the voice is subjected to wavelet packet transformation and is combined with a gamma-pass filter for feature extraction, and meanwhile, the extracted features pass through a deep learning model CNN, so that multi-source speaker recognition is completed.

2. The method according to claim 1, wherein the text-independent multi-source speaker recognition method based on blind source separation comprises: the identification method specifically comprises the following steps:

and 5: and (4) constructing a CNN model, converting the two-dimensional voice characteristic vector extracted in the step (4) into a three-dimensional vector, and inputting the three-dimensional vector into the CNN model to realize multi-source speaker recognition.

3. The method according to claim 2, wherein the text-independent multi-source speaker recognition method based on blind source separation comprises: the specific method of the step 1 comprises the following steps:

firstly, normalizing and whitening an original mixed voice data matrix to obtain a whitened and transformed voice matrix; then initializing a matrix W by adopting a random mode, and performing decorrelation processing on the iteration of the matrix W to obtain an updated matrix W_new(ii) a Finally, the original mixed voice data matrix, the whitened and transformed voice matrix and the updated matrix W are combined_newAnd carrying out matrix multiplication to separate the multi-source voice signals into a plurality of single-source voice signals.

4. The method according to claim 2, wherein the text-independent multi-source speaker recognition method based on blind source separation comprises: the specific method of the step 2 comprises the following steps:

decomposing the sequential speech signal by adopting a wavelet packet so as to carry out time-frequency localized processing and analysis on low-frequency and high-frequency signals contained in the speech signal; the wavelet packet is decomposed according to a complete optimal binary tree mode, and the time-frequency node corresponds to a wavelet packet frequency coefficient; and reconstructing the low-frequency voice signal and the high-frequency voice signal after the wavelet packet decomposition, wherein the time sequence of the reconstructed voice signal corresponds to the original time domain information.

5. The method according to claim 2, wherein the text-independent multi-source speaker recognition method based on blind source separation comprises: the specific method of the step 4 comprises the following steps:

and (3) the voice signal obtained after wavelet packet decomposition and reconstruction in the step (3) is processed by a group of gamma pass filter banks to obtain a voice feature vector which accords with human ear physiological data, short-time Fourier transform is carried out on the obtained voice feature vector to obtain a two-dimensional voice feature vector, and extraction of voice features is completed.

6. The method according to claim 2, wherein the text-independent multi-source speaker recognition method based on blind source separation comprises: step 4, the CNN model consists of 4 2D convolution layers, 4 pooling layers, 2 full-connection layers and an output layer; the convolution kernel adopts a 3x 3 matrix; in each convolutional layer, the activation function ReLu is used; entering a pooling layer after each convolution layer operation; the strategy adopted by the pooling layer is maximum pooling, and the size of the pooling is 2 multiplied by 2; and outputting the probability of the class corresponding to the voice by using softmax as the activation function of the output layer.