CN113903344B

CN113903344B - Deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction

Info

Publication number: CN113903344B
Application number: CN202111480885.6A
Authority: CN
Inventors: 曹祖杨; 杜子哲; 张凯强
Original assignee: Cry Sound Co ltd
Current assignee: Hangzhou Crysound Electronics Co Ltd
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2022-03-11
Anticipated expiration: 2041-12-07
Also published as: CN113903344A

Abstract

A deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction comprises the following steps: a. performing a plurality of channelsSignal acquisition; b. calculating the direction of an incoming wave through an array; c. preprocessing signals; d. carrying out Fourier transform on each frame of preprocessed signals; converting the actual frequency to a mel frequency; determining a maximum value at which the signal amplitude exceeds a threshold value; determining the frequency omega corresponding to the maximum value₁,…,ω_N(ii) a Get (omega)_n+ω_n+1) The/2 is used as the boundary of the adjacent segments, and the signal frequency range is equally divided into N intervals; and then converted from the mel frequency back to the actual frequency; e. performing wavelet transformation on the signals processed in the step d according to the N intervals; f. performing noise reduction on each mode in a hard processing mode, performing cross-correlation operation on each mode and the preprocessed signals, and selecting the mode corresponding to the cross-correlation value exceeding a set threshold value; g. obtaining a voice information matrix according to the selected mode; h. and inputting the voice information matrix into a convolutional neural network based on an attention mechanism model, and identifying the object to which the voiceprint belongs.

Description

Deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction

Technical Field

The invention relates to a voiceprint recognition technology, in particular to a deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction.

Background

The voiceprint recognition technology has a large number of application scenes in the aspects of security, finance and the like. For example: identity confirmation, mobile payment, etc. In the case of a quiet environment, the voiceprint recognition accuracy is already quite high. However, in a real scene, the environment is complex, the noise sources are various, and the situation of multiple targets where multiple persons speak simultaneously is accompanied, and when the collected sound signals are directly processed, the noise greatly influences the accuracy of sound pattern recognition, so that recognition errors are caused. Therefore, it is of great value to research how to identify the target sound source in the acquired non-stationary signal through the deep neural network, and improve the accuracy of identification.

The commonly used voiceprint recognition models are HMM, GMM-UBM and the like, and with the intensive research on neural networks, some neural network models are also used in voiceprint recognition, such as network structures of RNN, LSTM. However, these neural networks have long training time, and the voiceprint recognition rate is reduced in a complex environment.

In view of the above, a voiceprint recognition method capable of improving recognition accuracy under a complex environment is needed.

Disclosure of Invention

In order to overcome the technical problems in the prior art, the invention provides a deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction.

The invention discloses a deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction, which comprises the following steps of:

a. acquiring signals of a plurality of channels by using a spiral microphone array;

b. carrying out spatial filtering on the collected signals of the plurality of channels to obtain spatially filtered signals;

c. preprocessing the signals after spatial filtering;

d. carrying out Fourier transform on each frame of preprocessed signals; converting the actual frequency to a mel frequency; determining a maximum value of the signal amplitude value exceeding a preset threshold value; determining the frequency omega corresponding to the maximum value₁,…,ω_N(ii) a Get (omega)_n+ω_n+1) /2 as the boundary of the adjacent segments, where 1<n<N-1, equally dividing the signal frequency range into N intervals; and then from said mel frequency to convert back to said actual frequency;

e. performing wavelet transformation on the signals processed in the step d according to the N intervals to obtain N modes;

f. c, carrying out noise reduction treatment on each mode in the N modes in a hard processing mode, carrying out cross-correlation operation on each mode and the preprocessed signals in the step c, and selecting the mode corresponding to the cross-correlation value exceeding a set threshold value;

g. f, carrying out logarithmic power spectrum calculation and cepstrum coefficient calculation on the mode selected in the step f, and obtaining a voice information matrix according to the cepstrum coefficient;

h. and inputting the obtained voice information matrix into a convolutional neural network based on an attention mechanism model for identifying the object to which the voiceprint belongs.

In one embodiment, step b comprises: carrying out spatial feature analysis on the acquired signals of the channels to confirm the incoming wave direction of the human voice, and adjusting the direction of the multi-channel spiral array according to the incoming wave direction to realize voice enhancement; and carrying out phase synchronization on the signals of the multiple channels according to the time delay values of the signals reaching different array units in the multi-channel spiral array, and summing the signals of the multiple channels according to the weight ratio to obtain the signals after spatial filtering.

In one embodiment, the incoming wave direction is obtained as follows:

performing generalized cross-correlation operation on the acquired signals of the plurality of channels to obtain the time delay values tau of the signals reaching different array units;

solving the incoming wave direction according to the following formula and the distance R between array units in the multichannel spiral array, the sound velocity c and the time delay value tau:

where θ represents the incoming wave direction.

In one embodiment, the pre-processing of step c comprises the steps of: pre-emphasis, framing, windowing, endpoint detection.

In one embodiment, the pre-emphasis step is: carrying out pre-emphasis processing according to the following formula by adopting a high-pass filtering mode to obtain a high-pass filtered signal;

y (l) x (l) -0.95x (l-1), where x (l) is the spatially filtered discrete signal, y (l) is the high-pass filtered signal, and l is the sample point.

In one embodiment, the framing step is: and framing the high-pass filtered signal according to a fixed length N.

In one embodiment, the windowing step is: adding a window function w (l) to the framed signal to obtain a windowed signal y (l) w (l), wherein,

where N is the speech sequence length, and the empirical value a is 0.46.

In one embodiment, the step of detecting the end point uses a double threshold method, that is, two thresholds of the double threshold method are determined by using a short-term energy and a zero-crossing rate, and when the windowed signal exceeds the two thresholds at the same time, the signal is considered to be in a speech phase.

In one embodiment, the wavelet transform in step e is an empirical wavelet transform.

In one embodiment, in step e:

defining a scale function of the empirical wavelet transform

Sum wavelet function

In the frequency domain the following:

wherein the expression of the beta function is beta (x) ═ x⁴(35-84x+70x²-20x³) X is in

And

is replaced by the independent variable of the respective beta function;

wherein gamma is a coefficient and satisfies 0 < gamma < 1,

where ω is frequency and N represents the nth of the N modes;

let the approximation coefficient be

Detail coefficient of

Wherein the content of the first and second substances,

as a function of the frequency spectrum of the preprocessed signal f (t),

is composed of

The complex conjugate of (a) to (b),

respectively, represent the number of the symbols f,

fourier transform of psi, F^-1Is inverse Fourier transform;

the N modalities are represented as:

in one embodiment, in step f:

the hard processing mode is as follows: for each mode, selecting a sampling point with the amplitude value exceeding a general threshold value in the mode, and counting as f_rn[l]Wherein n represents the nth mode, and l represents the l sampling point in one mode;

the general threshold is set to

Where L is the signal length after framing in the time domain, f_nFor the nth modality of the N modalities obtained in step e,

f_rn[l]the calculation method is as follows:

wherein N is 1,2, … N; l ═ 1,2,3,. L.

In one embodiment, in step g:

the log power spectrum is calculated as follows:

according to the logarithmic power spectrum, the d-th cepstrum coefficient c of the n-th mode⁽ⁿ⁾(d) The calculation is as follows:

wherein D represents the total number of cepstrum coefficients in each mode;

according to the cepstrum coefficient, the speech information matrix is expressed as:

v＝[c⁽¹⁾(1)，c⁽¹⁾(2)，…，c^(N)(D)]

in one embodiment, in step h:

the attention mechanism model is represented as:

wherein S represents a weight of each of the plurality of channels, δ () and σ () represent a ReLu activation function and a sigmoid activation function, respectively, W₁And W₂For the coefficients of the full connection layer in the attention mechanism model, H and W represent the number of rows and columns of the matrix, i and j represent the ith row and the jth column respectively, and u is the direct input of the attention mechanism model.

The deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction divides frequency intervals of experience wavelets in Mel frequency, obtains effective voice characteristics through experience wavelet transformation, and inputs the effective voice characteristics into a neural network to realize voiceprint recognition. The invention is applied to the voiceprint recognition method in the scene with noisy environment and rich noise, and obtains the input characteristic matrix of the neural network by adopting a multi-angle (space, signal) noise reduction mode. Meanwhile, a convolutional neural network is improved, an attention mechanism is introduced, the weight ratio of each channel is obtained, and the accuracy of signal identification is improved.

Drawings

The foregoing summary, as well as the following detailed description of the invention, will be better understood when read in conjunction with the appended drawings. It is to be noted that the appended drawings are intended as examples of the claimed invention. In the drawings, like reference characters designate the same or similar elements.

FIG. 1 is a diagram illustrating a deep learning voiceprint recognition method based on multi-channel wavelet decomposition co-denoising according to an embodiment of the invention;

FIG. 2 illustrates a flow diagram of a deep learning voiceprint recognition method based on multi-channel wavelet decomposition co-denoising according to an embodiment of the invention;

fig. 3 shows a diagram of a neural network configuration according to an embodiment of the present invention.

Detailed Description

The detailed features and advantages of the present invention are described in detail in the detailed description which follows, and will be sufficient for anyone skilled in the art to understand the technical content of the present invention and to implement the present invention, and the related objects and advantages of the present invention will be easily understood by those skilled in the art from the description, claims and drawings disclosed in the present specification.

FIG. 1 is a diagram illustrating a deep learning voiceprint recognition method based on multi-channel wavelet decomposition co-denoising according to an embodiment of the invention.

The method mainly comprises a spatial filtering part, a wavelet denoising part and a voiceprint recognition part.

FIG. 2 shows a flowchart of a deep learning voiceprint recognition method based on multi-channel wavelet decomposition co-denoising according to an embodiment of the invention.

With reference to fig. 1 and 2, the method includes, but is not limited to, the following steps.

Step 101: acquiring signals of a plurality of channels by using a spiral microphone array;

step 102: and carrying out spatial filtering on the collected signals of the plurality of channels to obtain signals after spatial filtering.

In one embodiment, the spatial filtering comprises: carrying out spatial feature analysis on the acquired signals of the channels to confirm the incoming wave direction of the human voice, and adjusting the direction of the multi-channel spiral array according to the incoming wave direction to realize voice enhancement; and performing phase synchronization on all channel signals according to time delay values of the signals reaching different array units, and summing all the channel signals according to the weight ratio (because an incoming wave direction theta exists, the weight ratio of the signals received by each array unit is different) to obtain the signals after spatial filtering.

The incoming wave direction θ is obtained as follows:

and performing generalized cross-correlation operation on the acquired multi-channel time domain signals to obtain time delay values tau of the signals reaching different array units. Obtaining an incoming wave direction theta (namely an arrival angle of a signal) according to the distance R between the array units, the sound velocity c and the time delay value tau:

step 103: and preprocessing the signals after the spatial filtering to obtain primary signals. The preprocessing includes signal normalization, pre-emphasis, framing, windowing (e.g., window function selection hamming window), end point detection.

In one embodiment, the pre-emphasis process comprises: and carrying out pre-emphasis processing by adopting high-pass filtering, wherein the formula is as follows.

y(l)＝x(l)-0.95x(l-1)

Where x (l) is the spatially filtered discrete signal and y (l) is the high-pass filtered signal, where l is the sample point.

In one embodiment, the framing process includes: the high-pass filtered signal is framed by a fixed length N, for example, each frame being 40ms in length.

In one embodiment, the windowing process comprises: adding a window function w (l) to the high-pass filtered signal to obtain a windowed signal y (l) w (l), wherein,

where N is the speech sequence length, l is the sample point, and the empirical value a is 0.46.

In one embodiment, the endpoint detection process employs a double threshold method: the short-term energy and the zero-crossing rate determine two thresholds of a double-threshold method, and when the signal simultaneously exceeds the two thresholds, the signal is considered to be in a speech stage.

Step 104: carrying out Fourier transform on each frame of the preprocessed signals f (t), and converting the actual frequency into Mel (Mel) frequency; determining a maximum value of the signal amplitude value exceeding a preset threshold value; determining the frequency omega corresponding to the maximum value₁，…，ω_N(ii) a Get (omega)_n+ω_n+1) The/2 is used as the boundary of adjacent intervals, wherein N is more than 1 and less than N-1, and the signal frequency range is equally divided into N intervals; and then from said mel frequency to convert back to said actual frequency;

the deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction converts signals into Mel frequency to divide frequency intervals to accord with auditory characteristics of human ears, divides the frequency intervals according to a maximum value to ensure that effective signals exist in each divided region, and increases the accuracy of extracted signal characteristics.

Step 105: and performing empirical wavelet transform on the signals processed in the step 104 according to the division areas to obtain N modes. Each interval corresponds to one mode; scale function of empirical wavelet

Sum wavelet function

The definition in the frequency domain is as follows:

And

is replaced by the independent variable of the respective beta function;

wherein gamma is a coefficient and satisfies 0 < gamma < 1,

where ω is the frequency and n represents the nth mode.

Let the approximation coefficient be

Detail coefficient of

Wherein the content of the first and second substances,

as a function of the frequency spectrum of the preprocessed signal f (t),

is composed of

The complex conjugate of (a) to (b),

respectively represents f，

Fourier transform of psi, F^-1Is inverse Fourier transform;

the respective modalities may be represented as:

where n represents the nth modality.

Step 106: and carrying out noise reduction treatment on each mode in a hard treatment mode, carrying out cross-correlation operation on each mode and the preprocessed signals, and selecting the mode corresponding to the cross-correlation value exceeding a set threshold value.

The hard processing mode is as follows: for each mode, selecting a sampling point with the amplitude value of the sampling point in the mode exceeding a general threshold value, and calculating as f_rn[l]Wherein n represents the nth mode, and l represents the l sampling point in one mode;

the general threshold is set to

Where L is the signal length after framing in the time domain, f_nFor the nth modality among the N modalities obtained in step 105, abs () represents an absolute value, and mean () represents a median.

f_rn[l]The calculation method is as follows:

wherein N is 1,2, … N; l ═ 1,2,3,. L.

Where abs () represents the absolute value.

Step 107: and (4) performing logarithmic power spectrum calculation on the mode selected in the step 106, calculating the d-th cepstrum coefficient of the nth mode according to the logarithmic power spectrum, and obtaining a feature vector, namely a voice information matrix, according to the cepstrum coefficient.

Wherein the log power spectrum is calculated as follows:

wherein D represents the total number of cepstrum coefficients in each mode;

from the cepstral coefficients, the speech information matrix (i.e., feature vector) is represented as:

v＝[c⁽¹⁾(1)，c⁽¹⁾(2)，…，c^(N)(D)]

step 108: and inputting the obtained voice information matrix into a convolutional neural network based on an attention mechanism model for identifying the object to which the voiceprint belongs.

The neural network is configured as shown in fig. 3, a voice information matrix is input into a convolutional layer and a pooling layer, low-dimensional features are extracted, space dimensions are reduced at the same time, and batch normalization (BN layer) is performed between layers to improve the generalization capability of the model. And then obtaining the weight ratio of each channel through a residual channel attention module. And finally, entering a full connection layer to identify the identity of the tested person.

The attention mechanism model may be expressed as:

where S represents the weight of each channel, δ () and σ () represent the ReLu and sigmoid activation functions, respectively, W₁And W₂The coefficients of the full connection layer in the residual channel attention model are H, W represents the number of rows and columns of the matrix, i and j represent the ith row and jth column respectively, and u is the direct input of the attention model.

Different from the prior art that the characteristics of the traditional MFCC are extracted from signals and used as the input of a deep neural network, the deep learning voiceprint recognition technology based on the multi-channel wavelet decomposition common noise reduction respectively enhances effective signals through multi-channel spatial filtering, and the difference of decomposition of different signals caused by selecting different basic wavelets in wavelet transformation is solved by using empirical wavelet transformation. The influence of local noise on the global characteristic coefficient is reduced by solving the cepstrum coefficient of different modes, and the robustness of the cepstrum characteristic in noise is greatly improved.

In summary, the voiceprint recognition method applied to the scene with noisy environment and rich noise in the embodiment of the invention obtains the input feature matrix of the neural network by adopting a multi-angle (space, signal) noise reduction mode. Meanwhile, a convolutional neural network is improved, an attention mechanism is introduced, the weight ratio of each channel is obtained, and the accuracy of signal identification is improved.

The order of processing elements and sequences, the use of alphanumeric characters, or other designations in the present application is not intended to limit the order of the processes and methods in the present application, unless otherwise specified in the claims. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.

Similarly, it should be noted that in the preceding description of embodiments of the application, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to require more features than are expressly recited in the claims. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

This application uses specific words to describe embodiments of the application. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the present application is included in at least one embodiment of the present application. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the present application may be combined as appropriate.

The terms and expressions which have been employed herein are used as terms of description and not of limitation. The use of such terms and expressions is not intended to exclude any equivalents of the features shown and described (or portions thereof), and it is recognized that various modifications may be made within the scope of the claims. Other modifications, variations, and alternatives are also possible. Accordingly, the claims should be looked to in order to cover all such equivalents.

Also, it should be noted that although the present invention has been described with reference to the current specific embodiments, it should be understood by those skilled in the art that the above embodiments are merely illustrative of the present invention, and various equivalent changes or substitutions may be made without departing from the spirit of the present invention, and therefore, it is intended that all changes and modifications to the above embodiments be included within the scope of the claims of the present application.

Claims

1. A deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction is characterized by comprising the following steps:

c. preprocessing the signals after spatial filtering;

d. carrying out Fourier transform on each frame of preprocessed signals; converting the actual frequency to a mel frequency; determining a maximum value of the signal amplitude value exceeding a preset threshold value; determining the frequency omega corresponding to the maximum value₁,…,ω_N(ii) a Get (omega)_n+ω_n+1) A/2 as the boundary of the adjacent interval, wherein 1<n<N-1, equally dividing the signal frequency range into N intervals; and then from said mel frequency to convert back to said actual frequency;

h. inputting the obtained voice information matrix into a convolutional neural network based on an attention mechanism model, and identifying an object to which the voiceprint belongs;

wherein, in step f:

the general threshold is set to

f_rn[l]the calculation method is as follows:

wherein N is 1,2, … N; l ═ 1,2,3,. L.

2. The deep learning voiceprint recognition method based on multi-channel wavelet decomposition common denoising of claim 1, wherein step b comprises:

carrying out spatial feature analysis on the acquired signals of the channels to confirm the incoming wave direction of the human voice, and adjusting the direction of the multi-channel spiral array according to the incoming wave direction to realize voice enhancement; and carrying out phase synchronization on the signals of the multiple channels according to the time delay values of the signals reaching different array units in the multi-channel spiral array, and summing the signals of the multiple channels according to the weight ratio to obtain the signals after spatial filtering.

3. The deep learning voiceprint recognition method based on multi-channel wavelet decomposition common denoising as claimed in claim 2, wherein the incoming wave direction is obtained as follows:

where θ represents the incoming wave direction.

4. The deep learning voiceprint recognition method based on multi-channel wavelet decomposition common denoising of claim 1, wherein the preprocessing of step c comprises the steps of:

pre-emphasis, framing, windowing, endpoint detection.

5. The deep learning voiceprint recognition method based on multi-channel wavelet decomposition common denoising of claim 4, wherein:

the pre-emphasis step comprises: carrying out pre-emphasis processing according to the following formula by adopting a high-pass filtering mode to obtain a high-pass filtered signal;

y (l) x (l) -0.95x (l-1), where x (l) is the spatially filtered discrete signal, y (l) is the high-pass filtered signal, and l is the sample point;

the framing step is as follows: framing the high-pass filtered signal according to a fixed length N;

the windowing step comprises: adding a window function w (l) to the framed signal to obtain a windowed signal y (l) w (l), wherein,

wherein, N is the length of the speech sequence, and the empirical value a is 0.46;

the step of end point detection adopts a double threshold method, namely two thresholds of the double threshold method are determined by adopting short-time energy and zero crossing rate, and when the windowed signal exceeds the two thresholds simultaneously, the signal is considered to be in a speech stage.

6. The method for deep learning voiceprint recognition based on multi-channel wavelet decomposition common denoising of claim 1, wherein the wavelet transform in step e is an empirical wavelet transform.

7. The deep learning voiceprint recognition method based on multi-channel wavelet decomposition common denoising of claim 6, wherein in step e:

defining a scale function of the empirical wavelet transform

Sum wavelet function

In the frequency domain the following:

And

is replaced by the independent variable of the respective beta function;

wherein gamma is a coefficient and satisfies 0 < gamma < 1,

where ω is frequency and N represents the nth of the N modes; the nth mode corresponds to the nth interval, ω_nIndicating a frequency corresponding to the maximum value of the nth section;

let the approximation coefficient be