CN113903344B - Deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction - Google Patents
Deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction Download PDFInfo
- Publication number
- CN113903344B CN113903344B CN202111480885.6A CN202111480885A CN113903344B CN 113903344 B CN113903344 B CN 113903344B CN 202111480885 A CN202111480885 A CN 202111480885A CN 113903344 B CN113903344 B CN 113903344B
- Authority
- CN
- China
- Prior art keywords
- mode
- signals
- frequency
- signal
- deep learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/18—Artificial neural networks; Connectionist approaches
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
A deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction comprises the following steps: a. performing a plurality of channelsSignal acquisition; b. calculating the direction of an incoming wave through an array; c. preprocessing signals; d. carrying out Fourier transform on each frame of preprocessed signals; converting the actual frequency to a mel frequency; determining a maximum value at which the signal amplitude exceeds a threshold value; determining the frequency omega corresponding to the maximum value1,…,ωN(ii) a Get (omega)n+ωn+1) The/2 is used as the boundary of the adjacent segments, and the signal frequency range is equally divided into N intervals; and then converted from the mel frequency back to the actual frequency; e. performing wavelet transformation on the signals processed in the step d according to the N intervals; f. performing noise reduction on each mode in a hard processing mode, performing cross-correlation operation on each mode and the preprocessed signals, and selecting the mode corresponding to the cross-correlation value exceeding a set threshold value; g. obtaining a voice information matrix according to the selected mode; h. and inputting the voice information matrix into a convolutional neural network based on an attention mechanism model, and identifying the object to which the voiceprint belongs.
Description
Technical Field
The invention relates to a voiceprint recognition technology, in particular to a deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction.
Background
The voiceprint recognition technology has a large number of application scenes in the aspects of security, finance and the like. For example: identity confirmation, mobile payment, etc. In the case of a quiet environment, the voiceprint recognition accuracy is already quite high. However, in a real scene, the environment is complex, the noise sources are various, and the situation of multiple targets where multiple persons speak simultaneously is accompanied, and when the collected sound signals are directly processed, the noise greatly influences the accuracy of sound pattern recognition, so that recognition errors are caused. Therefore, it is of great value to research how to identify the target sound source in the acquired non-stationary signal through the deep neural network, and improve the accuracy of identification.
The commonly used voiceprint recognition models are HMM, GMM-UBM and the like, and with the intensive research on neural networks, some neural network models are also used in voiceprint recognition, such as network structures of RNN, LSTM. However, these neural networks have long training time, and the voiceprint recognition rate is reduced in a complex environment.
In view of the above, a voiceprint recognition method capable of improving recognition accuracy under a complex environment is needed.
Disclosure of Invention
In order to overcome the technical problems in the prior art, the invention provides a deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction.
The invention discloses a deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction, which comprises the following steps of:
a. acquiring signals of a plurality of channels by using a spiral microphone array;
b. carrying out spatial filtering on the collected signals of the plurality of channels to obtain spatially filtered signals;
c. preprocessing the signals after spatial filtering;
d. carrying out Fourier transform on each frame of preprocessed signals; converting the actual frequency to a mel frequency; determining a maximum value of the signal amplitude value exceeding a preset threshold value; determining the frequency omega corresponding to the maximum value1,…,ωN(ii) a Get (omega)n+ωn+1) /2 as the boundary of the adjacent segments, where 1<n<N-1, equally dividing the signal frequency range into N intervals; and then from said mel frequency to convert back to said actual frequency;
e. performing wavelet transformation on the signals processed in the step d according to the N intervals to obtain N modes;
f. c, carrying out noise reduction treatment on each mode in the N modes in a hard processing mode, carrying out cross-correlation operation on each mode and the preprocessed signals in the step c, and selecting the mode corresponding to the cross-correlation value exceeding a set threshold value;
g. f, carrying out logarithmic power spectrum calculation and cepstrum coefficient calculation on the mode selected in the step f, and obtaining a voice information matrix according to the cepstrum coefficient;
h. and inputting the obtained voice information matrix into a convolutional neural network based on an attention mechanism model for identifying the object to which the voiceprint belongs.
In one embodiment, step b comprises: carrying out spatial feature analysis on the acquired signals of the channels to confirm the incoming wave direction of the human voice, and adjusting the direction of the multi-channel spiral array according to the incoming wave direction to realize voice enhancement; and carrying out phase synchronization on the signals of the multiple channels according to the time delay values of the signals reaching different array units in the multi-channel spiral array, and summing the signals of the multiple channels according to the weight ratio to obtain the signals after spatial filtering.
In one embodiment, the incoming wave direction is obtained as follows:
performing generalized cross-correlation operation on the acquired signals of the plurality of channels to obtain the time delay values tau of the signals reaching different array units;
solving the incoming wave direction according to the following formula and the distance R between array units in the multichannel spiral array, the sound velocity c and the time delay value tau:
In one embodiment, the pre-processing of step c comprises the steps of: pre-emphasis, framing, windowing, endpoint detection.
In one embodiment, the pre-emphasis step is: carrying out pre-emphasis processing according to the following formula by adopting a high-pass filtering mode to obtain a high-pass filtered signal;
y (l) x (l) -0.95x (l-1), where x (l) is the spatially filtered discrete signal, y (l) is the high-pass filtered signal, and l is the sample point.
In one embodiment, the framing step is: and framing the high-pass filtered signal according to a fixed length N.
In one embodiment, the windowing step is: adding a window function w (l) to the framed signal to obtain a windowed signal y (l) w (l), wherein,where N is the speech sequence length, and the empirical value a is 0.46.
In one embodiment, the step of detecting the end point uses a double threshold method, that is, two thresholds of the double threshold method are determined by using a short-term energy and a zero-crossing rate, and when the windowed signal exceeds the two thresholds at the same time, the signal is considered to be in a speech phase.
In one embodiment, the wavelet transform in step e is an empirical wavelet transform.
In one embodiment, in step e:
defining a scale function of the empirical wavelet transformSum wavelet functionIn the frequency domain the following:
wherein the expression of the beta function is beta (x) ═ x4(35-84x+70x2-20x3) X is inAndis replaced by the independent variable of the respective beta function;
where ω is frequency and N represents the nth of the N modes;
let the approximation coefficient beDetail coefficient ofWherein the content of the first and second substances,as a function of the frequency spectrum of the preprocessed signal f (t),is composed ofThe complex conjugate of (a) to (b),respectively, represent the number of the symbols f,fourier transform of psi, F-1Is inverse Fourier transform;
the N modalities are represented as:
in one embodiment, in step f:
the hard processing mode is as follows: for each mode, selecting a sampling point with the amplitude value exceeding a general threshold value in the mode, and counting as frn[l]Wherein n represents the nth mode, and l represents the l sampling point in one mode;
the general threshold is set toWhere L is the signal length after framing in the time domain, fnFor the nth modality of the N modalities obtained in step e,
frn[l]the calculation method is as follows:
wherein N is 1,2, … N; l ═ 1,2,3,. L.
In one embodiment, in step g:
the log power spectrum is calculated as follows:
according to the logarithmic power spectrum, the d-th cepstrum coefficient c of the n-th mode(n)(d) The calculation is as follows:
wherein D represents the total number of cepstrum coefficients in each mode;
according to the cepstrum coefficient, the speech information matrix is expressed as:
v=[c(1)(1),c(1)(2),…,c(N)(D)]
in one embodiment, in step h:
the attention mechanism model is represented as:
wherein S represents a weight of each of the plurality of channels, δ () and σ () represent a ReLu activation function and a sigmoid activation function, respectively, W1And W2For the coefficients of the full connection layer in the attention mechanism model, H and W represent the number of rows and columns of the matrix, i and j represent the ith row and the jth column respectively, and u is the direct input of the attention mechanism model.
The deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction divides frequency intervals of experience wavelets in Mel frequency, obtains effective voice characteristics through experience wavelet transformation, and inputs the effective voice characteristics into a neural network to realize voiceprint recognition. The invention is applied to the voiceprint recognition method in the scene with noisy environment and rich noise, and obtains the input characteristic matrix of the neural network by adopting a multi-angle (space, signal) noise reduction mode. Meanwhile, a convolutional neural network is improved, an attention mechanism is introduced, the weight ratio of each channel is obtained, and the accuracy of signal identification is improved.
Drawings
The foregoing summary, as well as the following detailed description of the invention, will be better understood when read in conjunction with the appended drawings. It is to be noted that the appended drawings are intended as examples of the claimed invention. In the drawings, like reference characters designate the same or similar elements.
FIG. 1 is a diagram illustrating a deep learning voiceprint recognition method based on multi-channel wavelet decomposition co-denoising according to an embodiment of the invention;
FIG. 2 illustrates a flow diagram of a deep learning voiceprint recognition method based on multi-channel wavelet decomposition co-denoising according to an embodiment of the invention;
fig. 3 shows a diagram of a neural network configuration according to an embodiment of the present invention.
Detailed Description
The detailed features and advantages of the present invention are described in detail in the detailed description which follows, and will be sufficient for anyone skilled in the art to understand the technical content of the present invention and to implement the present invention, and the related objects and advantages of the present invention will be easily understood by those skilled in the art from the description, claims and drawings disclosed in the present specification.
FIG. 1 is a diagram illustrating a deep learning voiceprint recognition method based on multi-channel wavelet decomposition co-denoising according to an embodiment of the invention.
The method mainly comprises a spatial filtering part, a wavelet denoising part and a voiceprint recognition part.
FIG. 2 shows a flowchart of a deep learning voiceprint recognition method based on multi-channel wavelet decomposition co-denoising according to an embodiment of the invention.
With reference to fig. 1 and 2, the method includes, but is not limited to, the following steps.
Step 101: acquiring signals of a plurality of channels by using a spiral microphone array;
step 102: and carrying out spatial filtering on the collected signals of the plurality of channels to obtain signals after spatial filtering.
In one embodiment, the spatial filtering comprises: carrying out spatial feature analysis on the acquired signals of the channels to confirm the incoming wave direction of the human voice, and adjusting the direction of the multi-channel spiral array according to the incoming wave direction to realize voice enhancement; and performing phase synchronization on all channel signals according to time delay values of the signals reaching different array units, and summing all the channel signals according to the weight ratio (because an incoming wave direction theta exists, the weight ratio of the signals received by each array unit is different) to obtain the signals after spatial filtering.
The incoming wave direction θ is obtained as follows:
and performing generalized cross-correlation operation on the acquired multi-channel time domain signals to obtain time delay values tau of the signals reaching different array units. Obtaining an incoming wave direction theta (namely an arrival angle of a signal) according to the distance R between the array units, the sound velocity c and the time delay value tau:
step 103: and preprocessing the signals after the spatial filtering to obtain primary signals. The preprocessing includes signal normalization, pre-emphasis, framing, windowing (e.g., window function selection hamming window), end point detection.
In one embodiment, the pre-emphasis process comprises: and carrying out pre-emphasis processing by adopting high-pass filtering, wherein the formula is as follows.
y(l)=x(l)-0.95x(l-1)
Where x (l) is the spatially filtered discrete signal and y (l) is the high-pass filtered signal, where l is the sample point.
In one embodiment, the framing process includes: the high-pass filtered signal is framed by a fixed length N, for example, each frame being 40ms in length.
In one embodiment, the windowing process comprises: adding a window function w (l) to the high-pass filtered signal to obtain a windowed signal y (l) w (l), wherein,where N is the speech sequence length, l is the sample point, and the empirical value a is 0.46.
In one embodiment, the endpoint detection process employs a double threshold method: the short-term energy and the zero-crossing rate determine two thresholds of a double-threshold method, and when the signal simultaneously exceeds the two thresholds, the signal is considered to be in a speech stage.
Step 104: carrying out Fourier transform on each frame of the preprocessed signals f (t), and converting the actual frequency into Mel (Mel) frequency; determining a maximum value of the signal amplitude value exceeding a preset threshold value; determining the frequency omega corresponding to the maximum value1,…,ωN(ii) a Get (omega)n+ωn+1) The/2 is used as the boundary of adjacent intervals, wherein N is more than 1 and less than N-1, and the signal frequency range is equally divided into N intervals; and then from said mel frequency to convert back to said actual frequency;
the deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction converts signals into Mel frequency to divide frequency intervals to accord with auditory characteristics of human ears, divides the frequency intervals according to a maximum value to ensure that effective signals exist in each divided region, and increases the accuracy of extracted signal characteristics.
Step 105: and performing empirical wavelet transform on the signals processed in the step 104 according to the division areas to obtain N modes. Each interval corresponds to one mode; scale function of empirical waveletSum wavelet functionThe definition in the frequency domain is as follows:
wherein the expression of the beta function is beta (x) ═ x4(35-84x+70x2-20x3) X is inAndis replaced by the independent variable of the respective beta function;
where ω is the frequency and n represents the nth mode.
Let the approximation coefficient beDetail coefficient ofWherein the content of the first and second substances,as a function of the frequency spectrum of the preprocessed signal f (t),is composed ofThe complex conjugate of (a) to (b),respectively represents f,Fourier transform of psi, F-1Is inverse Fourier transform;
the respective modalities may be represented as:
Step 106: and carrying out noise reduction treatment on each mode in a hard treatment mode, carrying out cross-correlation operation on each mode and the preprocessed signals, and selecting the mode corresponding to the cross-correlation value exceeding a set threshold value.
The hard processing mode is as follows: for each mode, selecting a sampling point with the amplitude value of the sampling point in the mode exceeding a general threshold value, and calculating as frn[l]Wherein n represents the nth mode, and l represents the l sampling point in one mode;
the general threshold is set toWhere L is the signal length after framing in the time domain, fnFor the nth modality among the N modalities obtained in step 105, abs () represents an absolute value, and mean () represents a median.
frn[l]The calculation method is as follows:
wherein N is 1,2, … N; l ═ 1,2,3,. L.
Where abs () represents the absolute value.
Step 107: and (4) performing logarithmic power spectrum calculation on the mode selected in the step 106, calculating the d-th cepstrum coefficient of the nth mode according to the logarithmic power spectrum, and obtaining a feature vector, namely a voice information matrix, according to the cepstrum coefficient.
Wherein the log power spectrum is calculated as follows:
according to the logarithmic power spectrum, the d-th cepstrum coefficient c of the n-th mode(n)(d) The calculation is as follows:
wherein D represents the total number of cepstrum coefficients in each mode;
from the cepstral coefficients, the speech information matrix (i.e., feature vector) is represented as:
v=[c(1)(1),c(1)(2),…,c(N)(D)]
step 108: and inputting the obtained voice information matrix into a convolutional neural network based on an attention mechanism model for identifying the object to which the voiceprint belongs.
The neural network is configured as shown in fig. 3, a voice information matrix is input into a convolutional layer and a pooling layer, low-dimensional features are extracted, space dimensions are reduced at the same time, and batch normalization (BN layer) is performed between layers to improve the generalization capability of the model. And then obtaining the weight ratio of each channel through a residual channel attention module. And finally, entering a full connection layer to identify the identity of the tested person.
The attention mechanism model may be expressed as:
where S represents the weight of each channel, δ () and σ () represent the ReLu and sigmoid activation functions, respectively, W1And W2The coefficients of the full connection layer in the residual channel attention model are H, W represents the number of rows and columns of the matrix, i and j represent the ith row and jth column respectively, and u is the direct input of the attention model.
Different from the prior art that the characteristics of the traditional MFCC are extracted from signals and used as the input of a deep neural network, the deep learning voiceprint recognition technology based on the multi-channel wavelet decomposition common noise reduction respectively enhances effective signals through multi-channel spatial filtering, and the difference of decomposition of different signals caused by selecting different basic wavelets in wavelet transformation is solved by using empirical wavelet transformation. The influence of local noise on the global characteristic coefficient is reduced by solving the cepstrum coefficient of different modes, and the robustness of the cepstrum characteristic in noise is greatly improved.
In summary, the voiceprint recognition method applied to the scene with noisy environment and rich noise in the embodiment of the invention obtains the input feature matrix of the neural network by adopting a multi-angle (space, signal) noise reduction mode. Meanwhile, a convolutional neural network is improved, an attention mechanism is introduced, the weight ratio of each channel is obtained, and the accuracy of signal identification is improved.
The order of processing elements and sequences, the use of alphanumeric characters, or other designations in the present application is not intended to limit the order of the processes and methods in the present application, unless otherwise specified in the claims. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.
Similarly, it should be noted that in the preceding description of embodiments of the application, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to require more features than are expressly recited in the claims. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.
This application uses specific words to describe embodiments of the application. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the present application is included in at least one embodiment of the present application. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the present application may be combined as appropriate.
The terms and expressions which have been employed herein are used as terms of description and not of limitation. The use of such terms and expressions is not intended to exclude any equivalents of the features shown and described (or portions thereof), and it is recognized that various modifications may be made within the scope of the claims. Other modifications, variations, and alternatives are also possible. Accordingly, the claims should be looked to in order to cover all such equivalents.
Also, it should be noted that although the present invention has been described with reference to the current specific embodiments, it should be understood by those skilled in the art that the above embodiments are merely illustrative of the present invention, and various equivalent changes or substitutions may be made without departing from the spirit of the present invention, and therefore, it is intended that all changes and modifications to the above embodiments be included within the scope of the claims of the present application.
Claims (9)
1. A deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction is characterized by comprising the following steps:
a. acquiring signals of a plurality of channels by using a spiral microphone array;
b. carrying out spatial filtering on the collected signals of the plurality of channels to obtain spatially filtered signals;
c. preprocessing the signals after spatial filtering;
d. carrying out Fourier transform on each frame of preprocessed signals; converting the actual frequency to a mel frequency; determining a maximum value of the signal amplitude value exceeding a preset threshold value; determining the frequency omega corresponding to the maximum value1,…,ωN(ii) a Get (omega)n+ωn+1) A/2 as the boundary of the adjacent interval, wherein 1<n<N-1, equally dividing the signal frequency range into N intervals; and then from said mel frequency to convert back to said actual frequency;
e. performing wavelet transformation on the signals processed in the step d according to the N intervals to obtain N modes;
f. c, carrying out noise reduction treatment on each mode in the N modes in a hard processing mode, carrying out cross-correlation operation on each mode and the preprocessed signals in the step c, and selecting the mode corresponding to the cross-correlation value exceeding a set threshold value;
g. f, carrying out logarithmic power spectrum calculation and cepstrum coefficient calculation on the mode selected in the step f, and obtaining a voice information matrix according to the cepstrum coefficient;
h. inputting the obtained voice information matrix into a convolutional neural network based on an attention mechanism model, and identifying an object to which the voiceprint belongs;
wherein, in step f:
the hard processing mode is as follows: for each mode, selecting a sampling point with the amplitude value exceeding a general threshold value in the mode, and counting as frn[l]Wherein n represents the nth mode, and l represents the l sampling point in one mode;
the general threshold is set toWhere L is the signal length after framing in the time domain, fnFor the nth modality of the N modalities obtained in step e,
frn[l]the calculation method is as follows:
wherein N is 1,2, … N; l ═ 1,2,3,. L.
2. The deep learning voiceprint recognition method based on multi-channel wavelet decomposition common denoising of claim 1, wherein step b comprises:
carrying out spatial feature analysis on the acquired signals of the channels to confirm the incoming wave direction of the human voice, and adjusting the direction of the multi-channel spiral array according to the incoming wave direction to realize voice enhancement; and carrying out phase synchronization on the signals of the multiple channels according to the time delay values of the signals reaching different array units in the multi-channel spiral array, and summing the signals of the multiple channels according to the weight ratio to obtain the signals after spatial filtering.
3. The deep learning voiceprint recognition method based on multi-channel wavelet decomposition common denoising as claimed in claim 2, wherein the incoming wave direction is obtained as follows:
performing generalized cross-correlation operation on the acquired signals of the plurality of channels to obtain the time delay values tau of the signals reaching different array units;
solving the incoming wave direction according to the following formula and the distance R between array units in the multichannel spiral array, the sound velocity c and the time delay value tau:
4. The deep learning voiceprint recognition method based on multi-channel wavelet decomposition common denoising of claim 1, wherein the preprocessing of step c comprises the steps of:
pre-emphasis, framing, windowing, endpoint detection.
5. The deep learning voiceprint recognition method based on multi-channel wavelet decomposition common denoising of claim 4, wherein:
the pre-emphasis step comprises: carrying out pre-emphasis processing according to the following formula by adopting a high-pass filtering mode to obtain a high-pass filtered signal;
y (l) x (l) -0.95x (l-1), where x (l) is the spatially filtered discrete signal, y (l) is the high-pass filtered signal, and l is the sample point;
the framing step is as follows: framing the high-pass filtered signal according to a fixed length N;
the windowing step comprises: adding a window function w (l) to the framed signal to obtain a windowed signal y (l) w (l), wherein,wherein, N is the length of the speech sequence, and the empirical value a is 0.46;
the step of end point detection adopts a double threshold method, namely two thresholds of the double threshold method are determined by adopting short-time energy and zero crossing rate, and when the windowed signal exceeds the two thresholds simultaneously, the signal is considered to be in a speech stage.
6. The method for deep learning voiceprint recognition based on multi-channel wavelet decomposition common denoising of claim 1, wherein the wavelet transform in step e is an empirical wavelet transform.
7. The deep learning voiceprint recognition method based on multi-channel wavelet decomposition common denoising of claim 6, wherein in step e:
defining a scale function of the empirical wavelet transformSum wavelet functionIn the frequency domain the following:
wherein the expression of the beta function is beta (x) ═ x4(35-84x+70x2-20x3) X is inAndis replaced by the independent variable of the respective beta function;
where ω is frequency and N represents the nth of the N modes; the nth mode corresponds to the nth interval, ωnIndicating a frequency corresponding to the maximum value of the nth section;
let the approximation coefficient beDetail coefficient ofWherein the content of the first and second substances,as a function of the frequency spectrum of the preprocessed signal f (t),is composed ofThe complex conjugate of (a) and (b),as a function of scaleFor the expression of the 1 st modality,is composed ofThe complex conjugate of (a) to (b),respectively, represent the number of the symbols f,fourier transform of psi, F-1Is inverse Fourier transform;
the N modalities are represented as:
8. The deep learning voiceprint recognition method based on multi-channel wavelet decomposition common denoising of claim 1, wherein in step g:
the log power spectrum is calculated as follows:
according to the logarithmic power spectrum, the d-th cepstrum coefficient c of the n-th mode(n)(d) The calculation is as follows:
wherein D represents the total number of cepstrum coefficients in each mode;
according to the cepstrum coefficient, the speech information matrix is expressed as:
v=[ c(1)(1),c(1)(2),…,c(N)(D)]。
9. the deep learning voiceprint recognition method based on multi-channel wavelet decomposition common denoising of claim 1, wherein in step h:
the attention mechanism model is represented as:
wherein S represents a weight of each of the plurality of channels, δ () and σ () represent a ReLu activation function and a sigmoid activation function, respectively, W1And W2For the coefficients of the full connection layer in the attention mechanism model, H and W represent the number of rows and columns of the matrix, i and j represent the ith row and the jth column respectively, and u is the direct input of the attention mechanism model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111480885.6A CN113903344B (en) | 2021-12-07 | 2021-12-07 | Deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111480885.6A CN113903344B (en) | 2021-12-07 | 2021-12-07 | Deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113903344A CN113903344A (en) | 2022-01-07 |
CN113903344B true CN113903344B (en) | 2022-03-11 |
Family
ID=79025559
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111480885.6A Active CN113903344B (en) | 2021-12-07 | 2021-12-07 | Deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113903344B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115249476A (en) * | 2022-07-15 | 2022-10-28 | 北京市燃气集团有限责任公司 | Intelligent linkage gas cooker based on voice recognition and intelligent linkage method |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107316653B (en) * | 2016-04-27 | 2020-06-26 | 南京理工大学 | Improved empirical wavelet transform-based fundamental frequency detection method |
CN106568607A (en) * | 2016-11-04 | 2017-04-19 | 东南大学 | Rub-impact sound emission fault diagnosis method based on empirical wavelet transformation |
US20200057932A1 (en) * | 2018-08-16 | 2020-02-20 | Gyrfalcon Technology Inc. | System and method for generating time-spectral diagrams in an integrated circuit solution |
CN111341307A (en) * | 2020-03-13 | 2020-06-26 | 腾讯科技(深圳)有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN111640437A (en) * | 2020-05-25 | 2020-09-08 | 中国科学院空间应用工程与技术中心 | Voiceprint recognition method and system based on deep learning |
CN112001215B (en) * | 2020-05-25 | 2023-11-24 | 天津大学 | Text irrelevant speaker identity recognition method based on three-dimensional lip movement |
CN112712814A (en) * | 2020-12-04 | 2021-04-27 | 中国南方电网有限责任公司 | Voiceprint recognition method based on deep learning algorithm |
CN112784798B (en) * | 2021-02-01 | 2022-11-08 | 东南大学 | Multi-modal emotion recognition method based on feature-time attention mechanism |
CN112908341B (en) * | 2021-02-22 | 2023-01-03 | 哈尔滨工程大学 | Language learner voiceprint recognition method based on multitask self-attention mechanism |
CN113077795B (en) * | 2021-04-06 | 2022-07-15 | 重庆邮电大学 | Voiceprint recognition method under channel attention spreading and aggregation |
CN113129897B (en) * | 2021-04-08 | 2024-02-20 | 杭州电子科技大学 | Voiceprint recognition method based on attention mechanism cyclic neural network |
-
2021
- 2021-12-07 CN CN202111480885.6A patent/CN113903344B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN113903344A (en) | 2022-01-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ayvaz et al. | Automatic speaker recognition using mel-frequency cepstral coefficients through machine learning | |
CN112349297B (en) | Depression detection method based on microphone array | |
KR20200115731A (en) | Method and apparatus for recognition of sound events based on convolutional neural network | |
CN106847267B (en) | Method for detecting overlapped voice in continuous voice stream | |
CN112331218B (en) | Single-channel voice separation method and device for multiple speakers | |
Sharma et al. | Study of robust feature extraction techniques for speech recognition system | |
CN113903344B (en) | Deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction | |
Qu et al. | Multimodal target speech separation with voice and face references | |
CN108053842B (en) | Short wave voice endpoint detection method based on image recognition | |
Aroudi et al. | Dbnet: Doa-driven beamforming network for end-to-end reverberant sound source separation | |
CN111508504A (en) | Speaker recognition method based on auditory center perception mechanism | |
Hemavathi et al. | Voice conversion spoofing detection by exploring artifacts estimates | |
AU2362495A (en) | Speech-recognition system utilizing neural networks and method of using same | |
CN113963718B (en) | Voice conversation segmentation method based on deep learning | |
Sailor et al. | Unsupervised Representation Learning Using Convolutional Restricted Boltzmann Machine for Spoof Speech Detection. | |
Martín-Doñas et al. | Multi-channel block-online source extraction based on utterance adaptation | |
CN111968671B (en) | Low-altitude sound target comprehensive identification method and device based on multidimensional feature space | |
Salvati et al. | Time Delay Estimation for Speaker Localization Using CNN-Based Parametrized GCC-PHAT Features. | |
CN111785262B (en) | Speaker age and gender classification method based on residual error network and fusion characteristics | |
Mukhedkar et al. | Robust feature extraction methods for speech recognition in noisy environments | |
Shareef et al. | Comparison between features extraction techniques for impairments arabic speech | |
Thakur et al. | Design of Hindi key word recognition system for home automation system using MFCC and DTW | |
CN112908340A (en) | Global-local windowing-based sound feature rapid extraction method | |
CN113314127A (en) | Space orientation-based bird song recognition method, system, computer device and medium | |
Tahliramani et al. | Performance analysis of speaker identification system with and without spoofing attack of voice conversion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP03 | Change of name, title or address |
Address after: 311121 building 3, No.10 Xianqiao Road, Zhongtai street, Yuhang District, Hangzhou City, Zhejiang Province Patentee after: Hangzhou Zhaohua Electronics Co.,Ltd. Address before: 311122 building 3, No. 10, Xianqiao Road, Zhongtai street, Yuhang District, Hangzhou City, Zhejiang Province Patentee before: CRY SOUND CO.,LTD. |
|
CP03 | Change of name, title or address |