CN113903344B - Deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction - Google Patents

Deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction Download PDF

Info

Publication number
CN113903344B
CN113903344B CN202111480885.6A CN202111480885A CN113903344B CN 113903344 B CN113903344 B CN 113903344B CN 202111480885 A CN202111480885 A CN 202111480885A CN 113903344 B CN113903344 B CN 113903344B
Authority
CN
China
Prior art keywords
mode
signals
frequency
signal
deep learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111480885.6A
Other languages
Chinese (zh)
Other versions
CN113903344A (en
Inventor
曹祖杨
杜子哲
张凯强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Crysound Electronics Co Ltd
Original Assignee
Cry Sound Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cry Sound Co ltd filed Critical Cry Sound Co ltd
Priority to CN202111480885.6A priority Critical patent/CN113903344B/en
Publication of CN113903344A publication Critical patent/CN113903344A/en
Application granted granted Critical
Publication of CN113903344B publication Critical patent/CN113903344B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

A deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction comprises the following steps: a. performing a plurality of channelsSignal acquisition; b. calculating the direction of an incoming wave through an array; c. preprocessing signals; d. carrying out Fourier transform on each frame of preprocessed signals; converting the actual frequency to a mel frequency; determining a maximum value at which the signal amplitude exceeds a threshold value; determining the frequency omega corresponding to the maximum value1,…,ωN(ii) a Get (omega)nn+1) The/2 is used as the boundary of the adjacent segments, and the signal frequency range is equally divided into N intervals; and then converted from the mel frequency back to the actual frequency; e. performing wavelet transformation on the signals processed in the step d according to the N intervals; f. performing noise reduction on each mode in a hard processing mode, performing cross-correlation operation on each mode and the preprocessed signals, and selecting the mode corresponding to the cross-correlation value exceeding a set threshold value; g. obtaining a voice information matrix according to the selected mode; h. and inputting the voice information matrix into a convolutional neural network based on an attention mechanism model, and identifying the object to which the voiceprint belongs.

Description

Deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction
Technical Field
The invention relates to a voiceprint recognition technology, in particular to a deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction.
Background
The voiceprint recognition technology has a large number of application scenes in the aspects of security, finance and the like. For example: identity confirmation, mobile payment, etc. In the case of a quiet environment, the voiceprint recognition accuracy is already quite high. However, in a real scene, the environment is complex, the noise sources are various, and the situation of multiple targets where multiple persons speak simultaneously is accompanied, and when the collected sound signals are directly processed, the noise greatly influences the accuracy of sound pattern recognition, so that recognition errors are caused. Therefore, it is of great value to research how to identify the target sound source in the acquired non-stationary signal through the deep neural network, and improve the accuracy of identification.
The commonly used voiceprint recognition models are HMM, GMM-UBM and the like, and with the intensive research on neural networks, some neural network models are also used in voiceprint recognition, such as network structures of RNN, LSTM. However, these neural networks have long training time, and the voiceprint recognition rate is reduced in a complex environment.
In view of the above, a voiceprint recognition method capable of improving recognition accuracy under a complex environment is needed.
Disclosure of Invention
In order to overcome the technical problems in the prior art, the invention provides a deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction.
The invention discloses a deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction, which comprises the following steps of:
a. acquiring signals of a plurality of channels by using a spiral microphone array;
b. carrying out spatial filtering on the collected signals of the plurality of channels to obtain spatially filtered signals;
c. preprocessing the signals after spatial filtering;
d. carrying out Fourier transform on each frame of preprocessed signals; converting the actual frequency to a mel frequency; determining a maximum value of the signal amplitude value exceeding a preset threshold value; determining the frequency omega corresponding to the maximum value1,…,ωN(ii) a Get (omega)nn+1) /2 as the boundary of the adjacent segments, where 1<n<N-1, equally dividing the signal frequency range into N intervals; and then from said mel frequency to convert back to said actual frequency;
e. performing wavelet transformation on the signals processed in the step d according to the N intervals to obtain N modes;
f. c, carrying out noise reduction treatment on each mode in the N modes in a hard processing mode, carrying out cross-correlation operation on each mode and the preprocessed signals in the step c, and selecting the mode corresponding to the cross-correlation value exceeding a set threshold value;
g. f, carrying out logarithmic power spectrum calculation and cepstrum coefficient calculation on the mode selected in the step f, and obtaining a voice information matrix according to the cepstrum coefficient;
h. and inputting the obtained voice information matrix into a convolutional neural network based on an attention mechanism model for identifying the object to which the voiceprint belongs.
In one embodiment, step b comprises: carrying out spatial feature analysis on the acquired signals of the channels to confirm the incoming wave direction of the human voice, and adjusting the direction of the multi-channel spiral array according to the incoming wave direction to realize voice enhancement; and carrying out phase synchronization on the signals of the multiple channels according to the time delay values of the signals reaching different array units in the multi-channel spiral array, and summing the signals of the multiple channels according to the weight ratio to obtain the signals after spatial filtering.
In one embodiment, the incoming wave direction is obtained as follows:
performing generalized cross-correlation operation on the acquired signals of the plurality of channels to obtain the time delay values tau of the signals reaching different array units;
solving the incoming wave direction according to the following formula and the distance R between array units in the multichannel spiral array, the sound velocity c and the time delay value tau:
Figure GDA0003495343710000021
where θ represents the incoming wave direction.
In one embodiment, the pre-processing of step c comprises the steps of: pre-emphasis, framing, windowing, endpoint detection.
In one embodiment, the pre-emphasis step is: carrying out pre-emphasis processing according to the following formula by adopting a high-pass filtering mode to obtain a high-pass filtered signal;
y (l) x (l) -0.95x (l-1), where x (l) is the spatially filtered discrete signal, y (l) is the high-pass filtered signal, and l is the sample point.
In one embodiment, the framing step is: and framing the high-pass filtered signal according to a fixed length N.
In one embodiment, the windowing step is: adding a window function w (l) to the framed signal to obtain a windowed signal y (l) w (l), wherein,
Figure GDA0003495343710000031
where N is the speech sequence length, and the empirical value a is 0.46.
In one embodiment, the step of detecting the end point uses a double threshold method, that is, two thresholds of the double threshold method are determined by using a short-term energy and a zero-crossing rate, and when the windowed signal exceeds the two thresholds at the same time, the signal is considered to be in a speech phase.
In one embodiment, the wavelet transform in step e is an empirical wavelet transform.
In one embodiment, in step e:
defining a scale function of the empirical wavelet transform
Figure GDA0003495343710000032
Sum wavelet function
Figure GDA0003495343710000033
In the frequency domain the following:
Figure GDA0003495343710000034
Figure GDA0003495343710000035
wherein the expression of the beta function is beta (x) ═ x4(35-84x+70x2-20x3) X is in
Figure GDA0003495343710000036
And
Figure GDA0003495343710000037
is replaced by the independent variable of the respective beta function;
wherein gamma is a coefficient and satisfies 0 < gamma < 1,
Figure GDA0003495343710000038
where ω is frequency and N represents the nth of the N modes;
let the approximation coefficient be
Figure GDA0003495343710000039
Detail coefficient of
Figure GDA00034953437100000310
Wherein the content of the first and second substances,
Figure GDA00034953437100000311
as a function of the frequency spectrum of the preprocessed signal f (t),
Figure GDA00034953437100000312
is composed of
Figure GDA00034953437100000313
The complex conjugate of (a) to (b),
Figure GDA00034953437100000314
respectively, represent the number of the symbols f,
Figure GDA00034953437100000315
fourier transform of psi, F-1Is inverse Fourier transform;
the N modalities are represented as:
Figure GDA0003495343710000041
in one embodiment, in step f:
the hard processing mode is as follows: for each mode, selecting a sampling point with the amplitude value exceeding a general threshold value in the mode, and counting as frn[l]Wherein n represents the nth mode, and l represents the l sampling point in one mode;
the general threshold is set to
Figure GDA0003495343710000042
Where L is the signal length after framing in the time domain, fnFor the nth modality of the N modalities obtained in step e,
frn[l]the calculation method is as follows:
Figure GDA0003495343710000043
wherein N is 1,2, … N; l ═ 1,2,3,. L.
In one embodiment, in step g:
the log power spectrum is calculated as follows:
Figure GDA0003495343710000044
according to the logarithmic power spectrum, the d-th cepstrum coefficient c of the n-th mode(n)(d) The calculation is as follows:
Figure GDA0003495343710000045
wherein D represents the total number of cepstrum coefficients in each mode;
according to the cepstrum coefficient, the speech information matrix is expressed as:
v=[c(1)(1),c(1)(2),…,c(N)(D)]
in one embodiment, in step h:
the attention mechanism model is represented as:
Figure GDA0003495343710000046
wherein S represents a weight of each of the plurality of channels, δ () and σ () represent a ReLu activation function and a sigmoid activation function, respectively, W1And W2For the coefficients of the full connection layer in the attention mechanism model, H and W represent the number of rows and columns of the matrix, i and j represent the ith row and the jth column respectively, and u is the direct input of the attention mechanism model.
The deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction divides frequency intervals of experience wavelets in Mel frequency, obtains effective voice characteristics through experience wavelet transformation, and inputs the effective voice characteristics into a neural network to realize voiceprint recognition. The invention is applied to the voiceprint recognition method in the scene with noisy environment and rich noise, and obtains the input characteristic matrix of the neural network by adopting a multi-angle (space, signal) noise reduction mode. Meanwhile, a convolutional neural network is improved, an attention mechanism is introduced, the weight ratio of each channel is obtained, and the accuracy of signal identification is improved.
Drawings
The foregoing summary, as well as the following detailed description of the invention, will be better understood when read in conjunction with the appended drawings. It is to be noted that the appended drawings are intended as examples of the claimed invention. In the drawings, like reference characters designate the same or similar elements.
FIG. 1 is a diagram illustrating a deep learning voiceprint recognition method based on multi-channel wavelet decomposition co-denoising according to an embodiment of the invention;
FIG. 2 illustrates a flow diagram of a deep learning voiceprint recognition method based on multi-channel wavelet decomposition co-denoising according to an embodiment of the invention;
fig. 3 shows a diagram of a neural network configuration according to an embodiment of the present invention.
Detailed Description
The detailed features and advantages of the present invention are described in detail in the detailed description which follows, and will be sufficient for anyone skilled in the art to understand the technical content of the present invention and to implement the present invention, and the related objects and advantages of the present invention will be easily understood by those skilled in the art from the description, claims and drawings disclosed in the present specification.
FIG. 1 is a diagram illustrating a deep learning voiceprint recognition method based on multi-channel wavelet decomposition co-denoising according to an embodiment of the invention.
The method mainly comprises a spatial filtering part, a wavelet denoising part and a voiceprint recognition part.
FIG. 2 shows a flowchart of a deep learning voiceprint recognition method based on multi-channel wavelet decomposition co-denoising according to an embodiment of the invention.
With reference to fig. 1 and 2, the method includes, but is not limited to, the following steps.
Step 101: acquiring signals of a plurality of channels by using a spiral microphone array;
step 102: and carrying out spatial filtering on the collected signals of the plurality of channels to obtain signals after spatial filtering.
In one embodiment, the spatial filtering comprises: carrying out spatial feature analysis on the acquired signals of the channels to confirm the incoming wave direction of the human voice, and adjusting the direction of the multi-channel spiral array according to the incoming wave direction to realize voice enhancement; and performing phase synchronization on all channel signals according to time delay values of the signals reaching different array units, and summing all the channel signals according to the weight ratio (because an incoming wave direction theta exists, the weight ratio of the signals received by each array unit is different) to obtain the signals after spatial filtering.
The incoming wave direction θ is obtained as follows:
and performing generalized cross-correlation operation on the acquired multi-channel time domain signals to obtain time delay values tau of the signals reaching different array units. Obtaining an incoming wave direction theta (namely an arrival angle of a signal) according to the distance R between the array units, the sound velocity c and the time delay value tau:
Figure GDA0003495343710000061
step 103: and preprocessing the signals after the spatial filtering to obtain primary signals. The preprocessing includes signal normalization, pre-emphasis, framing, windowing (e.g., window function selection hamming window), end point detection.
In one embodiment, the pre-emphasis process comprises: and carrying out pre-emphasis processing by adopting high-pass filtering, wherein the formula is as follows.
y(l)=x(l)-0.95x(l-1)
Where x (l) is the spatially filtered discrete signal and y (l) is the high-pass filtered signal, where l is the sample point.
In one embodiment, the framing process includes: the high-pass filtered signal is framed by a fixed length N, for example, each frame being 40ms in length.
In one embodiment, the windowing process comprises: adding a window function w (l) to the high-pass filtered signal to obtain a windowed signal y (l) w (l), wherein,
Figure GDA0003495343710000062
where N is the speech sequence length, l is the sample point, and the empirical value a is 0.46.
In one embodiment, the endpoint detection process employs a double threshold method: the short-term energy and the zero-crossing rate determine two thresholds of a double-threshold method, and when the signal simultaneously exceeds the two thresholds, the signal is considered to be in a speech stage.
Step 104: carrying out Fourier transform on each frame of the preprocessed signals f (t), and converting the actual frequency into Mel (Mel) frequency; determining a maximum value of the signal amplitude value exceeding a preset threshold value; determining the frequency omega corresponding to the maximum value1,…,ωN(ii) a Get (omega)nn+1) The/2 is used as the boundary of adjacent intervals, wherein N is more than 1 and less than N-1, and the signal frequency range is equally divided into N intervals; and then from said mel frequency to convert back to said actual frequency;
the deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction converts signals into Mel frequency to divide frequency intervals to accord with auditory characteristics of human ears, divides the frequency intervals according to a maximum value to ensure that effective signals exist in each divided region, and increases the accuracy of extracted signal characteristics.
Step 105: and performing empirical wavelet transform on the signals processed in the step 104 according to the division areas to obtain N modes. Each interval corresponds to one mode; scale function of empirical wavelet
Figure GDA0003495343710000071
Sum wavelet function
Figure GDA0003495343710000072
The definition in the frequency domain is as follows:
Figure GDA0003495343710000073
Figure GDA0003495343710000074
wherein the expression of the beta function is beta (x) ═ x4(35-84x+70x2-20x3) X is in
Figure GDA0003495343710000075
And
Figure GDA0003495343710000076
is replaced by the independent variable of the respective beta function;
wherein gamma is a coefficient and satisfies 0 < gamma < 1,
Figure GDA0003495343710000077
where ω is the frequency and n represents the nth mode.
Let the approximation coefficient be
Figure GDA0003495343710000078
Detail coefficient of
Figure GDA0003495343710000079
Wherein the content of the first and second substances,
Figure GDA00034953437100000710
as a function of the frequency spectrum of the preprocessed signal f (t),
Figure GDA00034953437100000711
is composed of
Figure GDA00034953437100000712
The complex conjugate of (a) to (b),
Figure GDA00034953437100000713
respectively represents f,
Figure GDA00034953437100000714
Fourier transform of psi, F-1Is inverse Fourier transform;
the respective modalities may be represented as:
Figure GDA00034953437100000715
where n represents the nth modality.
Step 106: and carrying out noise reduction treatment on each mode in a hard treatment mode, carrying out cross-correlation operation on each mode and the preprocessed signals, and selecting the mode corresponding to the cross-correlation value exceeding a set threshold value.
The hard processing mode is as follows: for each mode, selecting a sampling point with the amplitude value of the sampling point in the mode exceeding a general threshold value, and calculating as frn[l]Wherein n represents the nth mode, and l represents the l sampling point in one mode;
the general threshold is set to
Figure GDA0003495343710000081
Where L is the signal length after framing in the time domain, fnFor the nth modality among the N modalities obtained in step 105, abs () represents an absolute value, and mean () represents a median.
frn[l]The calculation method is as follows:
Figure GDA0003495343710000082
wherein N is 1,2, … N; l ═ 1,2,3,. L.
Where abs () represents the absolute value.
Step 107: and (4) performing logarithmic power spectrum calculation on the mode selected in the step 106, calculating the d-th cepstrum coefficient of the nth mode according to the logarithmic power spectrum, and obtaining a feature vector, namely a voice information matrix, according to the cepstrum coefficient.
Wherein the log power spectrum is calculated as follows:
Figure GDA0003495343710000083
according to the logarithmic power spectrum, the d-th cepstrum coefficient c of the n-th mode(n)(d) The calculation is as follows:
Figure GDA0003495343710000084
wherein D represents the total number of cepstrum coefficients in each mode;
from the cepstral coefficients, the speech information matrix (i.e., feature vector) is represented as:
v=[c(1)(1),c(1)(2),…,c(N)(D)]
step 108: and inputting the obtained voice information matrix into a convolutional neural network based on an attention mechanism model for identifying the object to which the voiceprint belongs.
The neural network is configured as shown in fig. 3, a voice information matrix is input into a convolutional layer and a pooling layer, low-dimensional features are extracted, space dimensions are reduced at the same time, and batch normalization (BN layer) is performed between layers to improve the generalization capability of the model. And then obtaining the weight ratio of each channel through a residual channel attention module. And finally, entering a full connection layer to identify the identity of the tested person.
The attention mechanism model may be expressed as:
Figure GDA0003495343710000091
where S represents the weight of each channel, δ () and σ () represent the ReLu and sigmoid activation functions, respectively, W1And W2The coefficients of the full connection layer in the residual channel attention model are H, W represents the number of rows and columns of the matrix, i and j represent the ith row and jth column respectively, and u is the direct input of the attention model.
Different from the prior art that the characteristics of the traditional MFCC are extracted from signals and used as the input of a deep neural network, the deep learning voiceprint recognition technology based on the multi-channel wavelet decomposition common noise reduction respectively enhances effective signals through multi-channel spatial filtering, and the difference of decomposition of different signals caused by selecting different basic wavelets in wavelet transformation is solved by using empirical wavelet transformation. The influence of local noise on the global characteristic coefficient is reduced by solving the cepstrum coefficient of different modes, and the robustness of the cepstrum characteristic in noise is greatly improved.
In summary, the voiceprint recognition method applied to the scene with noisy environment and rich noise in the embodiment of the invention obtains the input feature matrix of the neural network by adopting a multi-angle (space, signal) noise reduction mode. Meanwhile, a convolutional neural network is improved, an attention mechanism is introduced, the weight ratio of each channel is obtained, and the accuracy of signal identification is improved.
The order of processing elements and sequences, the use of alphanumeric characters, or other designations in the present application is not intended to limit the order of the processes and methods in the present application, unless otherwise specified in the claims. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.
Similarly, it should be noted that in the preceding description of embodiments of the application, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to require more features than are expressly recited in the claims. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.
This application uses specific words to describe embodiments of the application. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the present application is included in at least one embodiment of the present application. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the present application may be combined as appropriate.
The terms and expressions which have been employed herein are used as terms of description and not of limitation. The use of such terms and expressions is not intended to exclude any equivalents of the features shown and described (or portions thereof), and it is recognized that various modifications may be made within the scope of the claims. Other modifications, variations, and alternatives are also possible. Accordingly, the claims should be looked to in order to cover all such equivalents.
Also, it should be noted that although the present invention has been described with reference to the current specific embodiments, it should be understood by those skilled in the art that the above embodiments are merely illustrative of the present invention, and various equivalent changes or substitutions may be made without departing from the spirit of the present invention, and therefore, it is intended that all changes and modifications to the above embodiments be included within the scope of the claims of the present application.

Claims (9)

1. A deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction is characterized by comprising the following steps:
a. acquiring signals of a plurality of channels by using a spiral microphone array;
b. carrying out spatial filtering on the collected signals of the plurality of channels to obtain spatially filtered signals;
c. preprocessing the signals after spatial filtering;
d. carrying out Fourier transform on each frame of preprocessed signals; converting the actual frequency to a mel frequency; determining a maximum value of the signal amplitude value exceeding a preset threshold value; determining the frequency omega corresponding to the maximum value1,…,ωN(ii) a Get (omega)nn+1) A/2 as the boundary of the adjacent interval, wherein 1<n<N-1, equally dividing the signal frequency range into N intervals; and then from said mel frequency to convert back to said actual frequency;
e. performing wavelet transformation on the signals processed in the step d according to the N intervals to obtain N modes;
f. c, carrying out noise reduction treatment on each mode in the N modes in a hard processing mode, carrying out cross-correlation operation on each mode and the preprocessed signals in the step c, and selecting the mode corresponding to the cross-correlation value exceeding a set threshold value;
g. f, carrying out logarithmic power spectrum calculation and cepstrum coefficient calculation on the mode selected in the step f, and obtaining a voice information matrix according to the cepstrum coefficient;
h. inputting the obtained voice information matrix into a convolutional neural network based on an attention mechanism model, and identifying an object to which the voiceprint belongs;
wherein, in step f:
the hard processing mode is as follows: for each mode, selecting a sampling point with the amplitude value exceeding a general threshold value in the mode, and counting as frn[l]Wherein n represents the nth mode, and l represents the l sampling point in one mode;
the general threshold is set to
Figure FDA0003495343700000012
Where L is the signal length after framing in the time domain, fnFor the nth modality of the N modalities obtained in step e,
frn[l]the calculation method is as follows:
Figure FDA0003495343700000011
wherein N is 1,2, … N; l ═ 1,2,3,. L.
2. The deep learning voiceprint recognition method based on multi-channel wavelet decomposition common denoising of claim 1, wherein step b comprises:
carrying out spatial feature analysis on the acquired signals of the channels to confirm the incoming wave direction of the human voice, and adjusting the direction of the multi-channel spiral array according to the incoming wave direction to realize voice enhancement; and carrying out phase synchronization on the signals of the multiple channels according to the time delay values of the signals reaching different array units in the multi-channel spiral array, and summing the signals of the multiple channels according to the weight ratio to obtain the signals after spatial filtering.
3. The deep learning voiceprint recognition method based on multi-channel wavelet decomposition common denoising as claimed in claim 2, wherein the incoming wave direction is obtained as follows:
performing generalized cross-correlation operation on the acquired signals of the plurality of channels to obtain the time delay values tau of the signals reaching different array units;
solving the incoming wave direction according to the following formula and the distance R between array units in the multichannel spiral array, the sound velocity c and the time delay value tau:
Figure FDA0003495343700000021
where θ represents the incoming wave direction.
4. The deep learning voiceprint recognition method based on multi-channel wavelet decomposition common denoising of claim 1, wherein the preprocessing of step c comprises the steps of:
pre-emphasis, framing, windowing, endpoint detection.
5. The deep learning voiceprint recognition method based on multi-channel wavelet decomposition common denoising of claim 4, wherein:
the pre-emphasis step comprises: carrying out pre-emphasis processing according to the following formula by adopting a high-pass filtering mode to obtain a high-pass filtered signal;
y (l) x (l) -0.95x (l-1), where x (l) is the spatially filtered discrete signal, y (l) is the high-pass filtered signal, and l is the sample point;
the framing step is as follows: framing the high-pass filtered signal according to a fixed length N;
the windowing step comprises: adding a window function w (l) to the framed signal to obtain a windowed signal y (l) w (l), wherein,
Figure FDA0003495343700000022
wherein, N is the length of the speech sequence, and the empirical value a is 0.46;
the step of end point detection adopts a double threshold method, namely two thresholds of the double threshold method are determined by adopting short-time energy and zero crossing rate, and when the windowed signal exceeds the two thresholds simultaneously, the signal is considered to be in a speech stage.
6. The method for deep learning voiceprint recognition based on multi-channel wavelet decomposition common denoising of claim 1, wherein the wavelet transform in step e is an empirical wavelet transform.
7. The deep learning voiceprint recognition method based on multi-channel wavelet decomposition common denoising of claim 6, wherein in step e:
defining a scale function of the empirical wavelet transform
Figure FDA0003495343700000031
Sum wavelet function
Figure FDA0003495343700000032
In the frequency domain the following:
Figure FDA0003495343700000033
Figure FDA0003495343700000034
wherein the expression of the beta function is beta (x) ═ x4(35-84x+70x2-20x3) X is in
Figure FDA0003495343700000035
And
Figure FDA0003495343700000036
is replaced by the independent variable of the respective beta function;
wherein gamma is a coefficient and satisfies 0 < gamma < 1,
Figure FDA0003495343700000037
where ω is frequency and N represents the nth of the N modes; the nth mode corresponds to the nth interval, ωnIndicating a frequency corresponding to the maximum value of the nth section;
let the approximation coefficient be
Figure FDA0003495343700000038
Detail coefficient of
Figure FDA0003495343700000039
Wherein the content of the first and second substances,
Figure FDA00034953437000000310
as a function of the frequency spectrum of the preprocessed signal f (t),
Figure FDA00034953437000000311
is composed of
Figure FDA00034953437000000312
The complex conjugate of (a) and (b),
Figure FDA00034953437000000313
as a function of scale
Figure FDA00034953437000000314
For the expression of the 1 st modality,
Figure FDA00034953437000000315
is composed of
Figure FDA00034953437000000316
The complex conjugate of (a) to (b),
Figure FDA00034953437000000317
respectively, represent the number of the symbols f,
Figure FDA00034953437000000318
fourier transform of psi, F-1Is inverse Fourier transform;
the N modalities are represented as:
Figure FDA0003495343700000041
wherein, t represents the time of day,
Figure FDA0003495343700000042
is composed of
Figure FDA0003495343700000043
Time domain representation of (2).
8. The deep learning voiceprint recognition method based on multi-channel wavelet decomposition common denoising of claim 1, wherein in step g:
the log power spectrum is calculated as follows:
Figure FDA0003495343700000044
according to the logarithmic power spectrum, the d-th cepstrum coefficient c of the n-th mode(n)(d) The calculation is as follows:
Figure FDA0003495343700000045
wherein D represents the total number of cepstrum coefficients in each mode;
according to the cepstrum coefficient, the speech information matrix is expressed as:
v=[ c(1)(1),c(1)(2),…,c(N)(D)]。
9. the deep learning voiceprint recognition method based on multi-channel wavelet decomposition common denoising of claim 1, wherein in step h:
the attention mechanism model is represented as:
Figure FDA0003495343700000046
wherein S represents a weight of each of the plurality of channels, δ () and σ () represent a ReLu activation function and a sigmoid activation function, respectively, W1And W2For the coefficients of the full connection layer in the attention mechanism model, H and W represent the number of rows and columns of the matrix, i and j represent the ith row and the jth column respectively, and u is the direct input of the attention mechanism model.
CN202111480885.6A 2021-12-07 2021-12-07 Deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction Active CN113903344B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111480885.6A CN113903344B (en) 2021-12-07 2021-12-07 Deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111480885.6A CN113903344B (en) 2021-12-07 2021-12-07 Deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction

Publications (2)

Publication Number Publication Date
CN113903344A CN113903344A (en) 2022-01-07
CN113903344B true CN113903344B (en) 2022-03-11

Family

ID=79025559

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111480885.6A Active CN113903344B (en) 2021-12-07 2021-12-07 Deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction

Country Status (1)

Country Link
CN (1) CN113903344B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115249476A (en) * 2022-07-15 2022-10-28 北京市燃气集团有限责任公司 Intelligent linkage gas cooker based on voice recognition and intelligent linkage method

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107316653B (en) * 2016-04-27 2020-06-26 南京理工大学 Improved empirical wavelet transform-based fundamental frequency detection method
CN106568607A (en) * 2016-11-04 2017-04-19 东南大学 Rub-impact sound emission fault diagnosis method based on empirical wavelet transformation
US20200057932A1 (en) * 2018-08-16 2020-02-20 Gyrfalcon Technology Inc. System and method for generating time-spectral diagrams in an integrated circuit solution
CN111341307A (en) * 2020-03-13 2020-06-26 腾讯科技(深圳)有限公司 Voice recognition method and device, electronic equipment and storage medium
CN111640437A (en) * 2020-05-25 2020-09-08 中国科学院空间应用工程与技术中心 Voiceprint recognition method and system based on deep learning
CN112001215B (en) * 2020-05-25 2023-11-24 天津大学 Text irrelevant speaker identity recognition method based on three-dimensional lip movement
CN112712814A (en) * 2020-12-04 2021-04-27 中国南方电网有限责任公司 Voiceprint recognition method based on deep learning algorithm
CN112784798B (en) * 2021-02-01 2022-11-08 东南大学 Multi-modal emotion recognition method based on feature-time attention mechanism
CN112908341B (en) * 2021-02-22 2023-01-03 哈尔滨工程大学 Language learner voiceprint recognition method based on multitask self-attention mechanism
CN113077795B (en) * 2021-04-06 2022-07-15 重庆邮电大学 Voiceprint recognition method under channel attention spreading and aggregation
CN113129897B (en) * 2021-04-08 2024-02-20 杭州电子科技大学 Voiceprint recognition method based on attention mechanism cyclic neural network

Also Published As

Publication number Publication date
CN113903344A (en) 2022-01-07

Similar Documents

Publication Publication Date Title
Ayvaz et al. Automatic speaker recognition using mel-frequency cepstral coefficients through machine learning
CN112349297B (en) Depression detection method based on microphone array
KR20200115731A (en) Method and apparatus for recognition of sound events based on convolutional neural network
CN106847267B (en) Method for detecting overlapped voice in continuous voice stream
CN112331218B (en) Single-channel voice separation method and device for multiple speakers
Sharma et al. Study of robust feature extraction techniques for speech recognition system
CN113903344B (en) Deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction
Qu et al. Multimodal target speech separation with voice and face references
CN108053842B (en) Short wave voice endpoint detection method based on image recognition
Aroudi et al. Dbnet: Doa-driven beamforming network for end-to-end reverberant sound source separation
CN111508504A (en) Speaker recognition method based on auditory center perception mechanism
Hemavathi et al. Voice conversion spoofing detection by exploring artifacts estimates
AU2362495A (en) Speech-recognition system utilizing neural networks and method of using same
CN113963718B (en) Voice conversation segmentation method based on deep learning
Sailor et al. Unsupervised Representation Learning Using Convolutional Restricted Boltzmann Machine for Spoof Speech Detection.
Martín-Doñas et al. Multi-channel block-online source extraction based on utterance adaptation
CN111968671B (en) Low-altitude sound target comprehensive identification method and device based on multidimensional feature space
Salvati et al. Time Delay Estimation for Speaker Localization Using CNN-Based Parametrized GCC-PHAT Features.
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
Mukhedkar et al. Robust feature extraction methods for speech recognition in noisy environments
Shareef et al. Comparison between features extraction techniques for impairments arabic speech
Thakur et al. Design of Hindi key word recognition system for home automation system using MFCC and DTW
CN112908340A (en) Global-local windowing-based sound feature rapid extraction method
CN113314127A (en) Space orientation-based bird song recognition method, system, computer device and medium
Tahliramani et al. Performance analysis of speaker identification system with and without spoofing attack of voice conversion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 311121 building 3, No.10 Xianqiao Road, Zhongtai street, Yuhang District, Hangzhou City, Zhejiang Province

Patentee after: Hangzhou Zhaohua Electronics Co.,Ltd.

Address before: 311122 building 3, No. 10, Xianqiao Road, Zhongtai street, Yuhang District, Hangzhou City, Zhejiang Province

Patentee before: CRY SOUND CO.,LTD.

CP03 Change of name, title or address