CN112967722A - Text-independent multi-source speaker identification method based on blind source separation - Google Patents
Text-independent multi-source speaker identification method based on blind source separation Download PDFInfo
- Publication number
- CN112967722A CN112967722A CN202110137229.XA CN202110137229A CN112967722A CN 112967722 A CN112967722 A CN 112967722A CN 202110137229 A CN202110137229 A CN 202110137229A CN 112967722 A CN112967722 A CN 112967722A
- Authority
- CN
- China
- Prior art keywords
- voice
- source
- matrix
- wavelet packet
- blind
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000000926 separation method Methods 0.000 title claims abstract description 29
- 238000000354 decomposition reaction Methods 0.000 claims abstract description 17
- 238000001514 detection method Methods 0.000 claims abstract description 13
- 238000012545 processing Methods 0.000 claims abstract description 13
- 238000009432 framing Methods 0.000 claims abstract description 9
- 238000001914 filtration Methods 0.000 claims abstract description 4
- 239000011159 matrix material Substances 0.000 claims description 42
- 239000013598 vector Substances 0.000 claims description 20
- 238000011176 pooling Methods 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 14
- 238000000605 extraction Methods 0.000 claims description 11
- 230000004913 activation Effects 0.000 claims description 6
- 230000002087 whitening effect Effects 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 4
- 238000004458 analytical method Methods 0.000 claims description 3
- 238000013136 deep learning model Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 230000005251 gamma ray Effects 0.000 abstract 1
- 238000013527 convolutional neural network Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 8
- 238000013135 deep learning Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000005428 wave function Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/20—Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02087—Noise filtering the noise being separate speech, e.g. cocktail party
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Complex Calculations (AREA)
Abstract
The invention provides a text-independent multi-source speaker recognition method based on blind source separation, and relates to the technical field of voiceprint recognition. Firstly, acquiring a section of sound source containing voices of multiple persons, carrying out blind source separation and detection on the sound source by adopting a blind source signal detection and separation algorithm, and separating a multi-source voice signal into multiple single-source voice signals; then, each separated single-source voice signal is subjected to pre-emphasis, framing and windowing processing to obtain a time sequence voice signal; wavelet packet decomposition and reconstruction are carried out on the time-series voice signals; then, a cochlear auditory filter is adopted to carry out human ear feature filtering, and voice features are extracted; finally, a CNN model is constructed, and the extracted voice characteristics are input into the CNN model to realize multi-source speaker recognition; the method of the invention adopts a method of combining the wavelet packet and the gamma-ray filter, and can have higher recognition rate in a noise environment.
Description
Technical Field
The invention relates to the technical field of voiceprint recognition, in particular to a text-independent multi-source speaker recognition method based on blind source separation.
Background
The voice is one kind of biological recognition, bears corresponding information as other biological characteristic recognition, and can be applied to the aspects of identity authentication, information service, voice mail and the like. Not only the voice content information of the speaker can be known through the voice, but also the voiceprint recognition, which is the information of the speaker, can be obtained. Voiceprint recognition is the process of automatically identifying a speaker for personal information contained in a speech waveform. Speaker identification in a cocktail environment presents a great challenge, and first, the cocktail environment is noisy, and the voices of many speakers are mixed together.
The process of speaker recognition can be divided into two parts: speech feature extraction and training of speaker models. In speech feature extraction, most of the research is mainly aimed at short-term spectral characteristics of speech signals, and mainly aims at decomposing the signals based on short frames of about 10-30 milliseconds, during which the speech signals are most stable, and researching the spectral features of voiceprints, such as Mel cepstral coefficients and linear cepstral coefficients, in the frames. Training of the model, which is mainly to train the extracted features, for example, the commonly used traditional models include a vector quantization model, dynamic time warping, a Gaussian mixture model and the like; for the deep learning model, a deep neural network is adopted, and a model with a better effect, such as a convolutional neural network, is applied to a model for speaker recognition and the like.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a text-independent multi-source speaker identification method based on blind source separation aiming at the defects of the prior art, and the multi-source speaker identification is carried out in a cocktail environment.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a text-independent multi-source speaker recognition method based on blind source separation is characterized in that multi-sound source voices in a cocktail environment are separated and detected according to a blind source signal detection and separation algorithm, a plurality of included sound sources are separated, then voice feature extraction is carried out on each sound source, namely, the voice is subjected to wavelet packet conversion and is combined with a gamma-pass filter for feature extraction, and meanwhile, extracted features pass through a deep learning model CNN to finish multi-source speaker recognition; the method specifically comprises the following steps:
step 1: blind source separation and detection; acquiring a section of sound source containing voices of multiple persons, performing blind source separation and detection on the sound source by adopting a blind source signal detection and separation algorithm, and separating a multi-source voice signal into multiple single-source voice signals;
firstly, normalizing and whitening an original mixed voice data matrix to obtain a whitened and transformed voice matrix; then initializing a matrix W by adopting a random mode, and performing decorrelation processing on the iteration of the matrix W to obtain an updated matrix Wnew(ii) a Finally, the original mixed voice data matrix, the whitened and transformed voice matrix and the updated matrix W are combinednewMatrix multiplication is carried out, and a plurality of single-source voice signals are separated from the multi-source voice signals;
step 2: preprocessing voice characteristics; carrying out pre-emphasis, framing and windowing on each single-source voice signal separated in the step 1 to obtain a time sequence voice signal;
and step 3: performing wavelet packet decomposition and reconstruction on the time-series voice signals;
decomposing the sequential speech signal by adopting a wavelet packet so as to carry out time-frequency localized processing and analysis on low-frequency and high-frequency signals contained in the speech signal; the wavelet packet is decomposed according to a complete optimal binary tree mode, and the time-frequency node corresponds to a wavelet packet frequency coefficient; reconstructing the low-frequency and high-frequency voice signals after wavelet packet decomposition, wherein the time sequence of the reconstructed voice signals corresponds to the original time domain information;
and 4, step 4: performing human ear feature filtering on the voice signal subjected to wavelet packet decomposition and reconstruction by adopting a cochlear auditory filter, and extracting voice features;
the voice signals obtained after wavelet packet decomposition and reconstruction in the step 3 pass through a group of gamma pass filter banks to obtain voice feature vectors which accord with human ear physiological data, short-time Fourier transform is carried out on the obtained voice feature vectors to obtain two-dimensional voice feature vectors, and extraction of voice features is completed;
and 5: constructing a CNN model, converting the two-dimensional voice characteristic vector extracted in the step (4) into a three-dimensional vector, and inputting the three-dimensional vector into the CNN model to realize multi-source speaker recognition;
the CNN model consists of 4 2D convolutional layers, 4 pooling layers, 2 full-connection layers and an output layer; the convolution kernel adopts a 3x 3 matrix; in each convolutional layer, the activation function ReLu is used; entering a pooling layer after each convolution layer operation; the strategy adopted by the pooling layer is maximum pooling, and the size of the pooling is 2 multiplied by 2; and outputting the probability of the class corresponding to the voice by using softmax as the activation function of the output layer.
Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: the text-independent multi-source speaker recognition method based on blind source separation provided by the invention aims at the problem of low recognition accuracy in a noise environment, and adopts a method of combining a wavelet packet and a gamma-pass filter, so that the method has higher recognition rate in the noise environment. Has important application value for applying the composite material in a real environment.
Drawings
Fig. 1 is a general architecture diagram of a text-independent multi-source speaker recognition method based on blind source separation according to an embodiment of the present invention.
Fig. 2 is a waveform diagram of an original single-source speech signal visualization provided by an embodiment of the present invention;
FIG. 3 is a diagram illustrating pre-emphasis effects of a single-source speech signal according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating the effect of framing a single-source speech signal according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating the effect of windowing a single-source speech signal according to an embodiment of the present invention;
fig. 6 is a diagram illustrating an effect of a single-source speech signal after decomposition of three-layer wavelet packets according to an embodiment of the present invention;
fig. 7 is a diagram illustrating the effect of the single-source speech signal filtered by the Gammatone filter according to the embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings.
In this embodiment, the software environment is a WINDOWS 10 system, and the simulation environment is PyCharm 2018.3.3x 64.
In this embodiment, a total architecture of the designed multi-source speaker recognition is shown in fig. 1, and a text-independent multi-source speaker recognition method based on blind source separation according to the architecture diagram includes the following steps:
step 1: blind source separation and detection; acquiring a section of sound source containing voices of multiple persons, performing blind source separation and detection on the sound source by adopting a blind source signal detection and separation algorithm, and separating a multi-source voice signal into multiple single-source voice signals;
firstly, inputting a section of mixed multi-channel voice signal, opening the input voice signal by adopting a wave function in python language to obtain a mixed voice data matrix D, and carrying out centralization processing on the matrix D, namely averaging all row directions of the matrix D by using a mean function in a numpy function library, and then traversing each row of the matrix D to subtract the average value to obtain a new matrix D after centralization processingcenter(ii) a Then to the new matrix DcenterPerforming whitening, i.e. calculating DcenterCov (D) covariance matrix of (c)center) Then, calculating an eigenvalue and an eigenvector of the covariance matrix, diagonalizing the eigenvalue vector to obtain a diagonal matrix, inverting the diagonal matrix, and multiplying the square of the inverted diagonal matrix by the transposed eigenvector to obtain a whitening transformation matrix V; then the matrix DcenterMultiplying the matrix V to obtain a whitened data matrix Z, and performing FastICA algorithm processing on the data matrix Z, namely randomly generating a random matrix W at first and performing decorrelation processing on W (for W, W)TDecomposing the eigenvalue to obtain an eigenvalue x and an eigenvector p, then carrying out diagonalization and inversion on the eigenvalue x to obtain div _ x, and after W is decorrelated, obtainingAnd setting the maximum iteration times, continuously updating iteration W' to obtainNew Wnew=g(s)ZT-g '(s) × W ' (where s ═ W ' × Z, g(s) represents the tanh () function of s, g '(s) represents the probability density function of s, W ' represents W, Z after each update of the decorrelationTA transpose matrix representing the whitened data matrix Z), based on the obtained WnewThe estimation S of the reconstructed mixed source signal can be obtained by multiplying the whitening transformation matrix V and the mixed voice data matrix DrFurther, n estimated single-source speech signals S1 ═ S are obtainedr[0,:],S2=Sr[1,:]....,Sn=Sr[n,:]。
Step 2: preprocessing voice characteristics; for the n single-source voice signals S1, S2.., S obtained in the step 1nThe method comprises the steps of preprocessing respectively to obtain time sequence voice signals, and specifically comprises three links of pre-emphasis, framing and windowing, wherein the pre-emphasis aims to compensate attenuation of each octave of a voice high-frequency part caused by glottal excitation and oral-nasal radiation. Carrying out short-time framing processing on the voice signal after the pre-emphasis updating to obtain short-time stable voice signal frames, numbering the short-time stable voice signal frames according to a time sequence, and overlapping continuous frame signals to a certain extent to ensure that information among each frame is not lost; and finally, windowing the frame signal, performing sliding window processing on the frame-long voice signal according to the position by adopting a Hamming window function, and taking the preprocessed voice signal as a time sequence discrete signal form so as to facilitate decomposition and feature extraction.
In this embodiment, the specific implementation parameters of the three links of pre-emphasis, framing and windowing are as follows:
pre-emphasis: and processing the voice data by wavfile in python to obtain a sampling rate of the voice frequency and a numpy array, wherein the obtained voice one-dimensional array is subjected to pre-emphasis according to a pre-emphasis formula y (t) ═ x (t) — x (t-1), a pre-emphasis coefficient α ═ 0.97, y (t) represents the single-source voice array at t seconds after pre-emphasis, x (t) is the single-source voice array at t seconds, and x (t-1) is the single-source voice array at t-1 seconds. In this embodiment, for the original single-source speech image as shown in fig. 2, the pre-emphasized speech image is shown in fig. 3.
Framing: the whole voice is divided into frames, and the voice signal after the voice signal is pre-emphasized is processed by using python numpy data analysis, wherein the parameter of each frame is 30ms, certain overlap is performed between continuous frame signals, the overlap time is 2ms, and the result after the frame division is shown in fig. 4.
Windowing: in order to ensure that the voice data after the frame division is processed continuously, a sliding window technology is adopted according to a Hamming window formulaWherein, w (N) represents the sliding window factor, N represents the total frame length of the voice data, and N represents the length of each frame after framing, and Hamming window processing is carried out. The effect after windowing is shown in fig. 5.
And step 3: decomposing and reconstructing a wavelet packet of the time sequence signal; therefore, the invention adopts the wavelet packet to decompose the time-sequence speech signal and carries out time-frequency localization processing and analysis on the low-frequency and high-frequency signals contained in the speech signal; the wavelet packet is decomposed according to a complete optimal binary tree mode, and the time-frequency node corresponds to a wavelet packet frequency coefficient; reconstructing the low-frequency and high-frequency voice signals after wavelet packet decomposition, wherein the time sequence of the reconstructed voice signals corresponds to the original time domain information;
in this embodiment, a third-party library pywt existing in a python library is used to perform wavelet packet decomposition on the voice data preprocessed in step 2, a wavepacket function in the pywt library is used, the input data is the voice data preprocessed, a wavelet packet model is used as symmetric to perform three-layer db6 wavelet packet decomposition, and a tree structure is shown in fig. 6. Then, the decomposed wavelet packets are respectively layer number and data, and the data are subjected to traversal decomposition and are reconstructed by adopting a reconstruct function to obtain voice signal data with high frequency and low frequency which are decomposed thoroughly.
And 4, step 4: performing human ear feature filtering on the voice signal subjected to wavelet packet decomposition and reconstruction by adopting a cochlear auditory filter, and extracting voice features; obtaining a voice feature vector which accords with human ear physiological data by a group of gamma pass filter groups of the voice signals obtained after wavelet packet decomposition and reconstruction in the step 3, wherein a gamma pass filter is a standard auditory filter and accords with human ear cochlear features, and performing short-time Fourier transform (STFT) on the obtained voice feature vector to obtain a two-dimensional voice feature vector so as to complete extraction of voice features;
the Gamma atom filter is shown in the following formula:
h(t)=ctl-1e-2πbt cos(2πfit+φ)
where c is the tuning proportionality constant, l is the number of filter stages (usually 4), b is the attenuation factor that determines the attenuation speed of the filter, and is a positive integer, fiIs the center frequency of the filter, phi is the function phase, and can be generally omitted; the relation between attenuation factor and bandwidth is b-1.019 ERB (f)i) Wherein ERB (f)i) Equivalent rectangular bandwidth:
in this embodiment, a visualization effect graph of the single-source speech signal after passing through the Gammatone filter is shown in fig. 7.
And 5: constructing a CNN model, converting the two-dimensional voice characteristic vector extracted in the step (4) into a three-dimensional vector, and inputting the three-dimensional vector into the CNN model to realize multi-source speaker recognition;
in order to effectively train and predict the voice characteristic information irrelevant to the text of the speaker, the method of the invention combines the designed characteristic extraction method and adopts the convolutional neural network to design the deep learning and recognition network of the speaker, as shown in table 1. In the network structure, the network structure is composed of 4 2D convolutional layers (Conv2D _1-Conv2D _4), 4 pooling layers (Pool1-Pool4), 2 full-connection layers (Dense _1, Dense _2) and an output layer, and the convolutional layers adopt a 3 × 3 matrix. In each convolutional layer, the activation function ReLu is used. After each convolution layer operation, entering a pooling layer, wherein the strategy adopted by the pooling layer is a maximum pooling strategy, and the pooling size is 2 multiplied by 2. And the output layer outputs the probability of the corresponding class of the voice by adopting a softmax activation function.
TABLE 1 CNN-based speaker recognition deep learning recognition network structure parameters
Layer(s) | CNN parameter |
Conv2d_1 | [3*3,16] |
Pool1 | 2*2,maxpool,stride2 |
Conv2d_2 | [3*3,32] |
Pool2 | 2*2,maxpool,stride2 |
Conv2d_3 | [3*3,64] |
Pool3 | 2*2,maxpool,stride2 |
Conv2d_4 | [3*3,128] |
Pool4 | 2*2,maxpool,stride2 |
Dense_1 | 5072*278 |
Dense_2 | 278*69 |
In conclusion, the method of the invention is based on the method of extracting the voice characteristics by using the wavelet packet and the gamma-pass filter after blind source separation, and realizes the requirement of high recognition rate of multi-source speaker recognition in a noise environment. The problem of the recognition rate is low under the cocktail environment, can't distinguish a plurality of speakers is solved.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.
Claims (6)
1. A text-independent multi-source speaker identification method based on blind source separation is characterized in that: aiming at multi-sound source voice in a cocktail environment, the multi-sound source voice is separated and detected according to a blind source signal detection and separation algorithm, a plurality of included sound sources are separated, then voice feature extraction is carried out on each sound source, namely, the voice is subjected to wavelet packet transformation and is combined with a gamma-pass filter for feature extraction, and meanwhile, the extracted features pass through a deep learning model CNN, so that multi-source speaker recognition is completed.
2. The method according to claim 1, wherein the text-independent multi-source speaker recognition method based on blind source separation comprises: the identification method specifically comprises the following steps:
step 1: blind source separation and detection; acquiring a section of sound source containing voices of multiple persons, performing blind source separation and detection on the sound source by adopting a blind source signal detection and separation algorithm, and separating a multi-source voice signal into multiple single-source voice signals;
step 2: preprocessing voice characteristics; carrying out pre-emphasis, framing and windowing on each single-source voice signal separated in the step 1 to obtain a time sequence voice signal;
and step 3: performing wavelet packet decomposition and reconstruction on the time-series voice signals;
and 4, step 4: performing human ear feature filtering on the voice signal subjected to wavelet packet decomposition and reconstruction by adopting a cochlear auditory filter, and extracting voice features;
and 5: and (4) constructing a CNN model, converting the two-dimensional voice characteristic vector extracted in the step (4) into a three-dimensional vector, and inputting the three-dimensional vector into the CNN model to realize multi-source speaker recognition.
3. The method according to claim 2, wherein the text-independent multi-source speaker recognition method based on blind source separation comprises: the specific method of the step 1 comprises the following steps:
firstly, normalizing and whitening an original mixed voice data matrix to obtain a whitened and transformed voice matrix; then initializing a matrix W by adopting a random mode, and performing decorrelation processing on the iteration of the matrix W to obtain an updated matrix Wnew(ii) a Finally, the original mixed voice data matrix, the whitened and transformed voice matrix and the updated matrix W are combinednewAnd carrying out matrix multiplication to separate the multi-source voice signals into a plurality of single-source voice signals.
4. The method according to claim 2, wherein the text-independent multi-source speaker recognition method based on blind source separation comprises: the specific method of the step 2 comprises the following steps:
decomposing the sequential speech signal by adopting a wavelet packet so as to carry out time-frequency localized processing and analysis on low-frequency and high-frequency signals contained in the speech signal; the wavelet packet is decomposed according to a complete optimal binary tree mode, and the time-frequency node corresponds to a wavelet packet frequency coefficient; and reconstructing the low-frequency voice signal and the high-frequency voice signal after the wavelet packet decomposition, wherein the time sequence of the reconstructed voice signal corresponds to the original time domain information.
5. The method according to claim 2, wherein the text-independent multi-source speaker recognition method based on blind source separation comprises: the specific method of the step 4 comprises the following steps:
and (3) the voice signal obtained after wavelet packet decomposition and reconstruction in the step (3) is processed by a group of gamma pass filter banks to obtain a voice feature vector which accords with human ear physiological data, short-time Fourier transform is carried out on the obtained voice feature vector to obtain a two-dimensional voice feature vector, and extraction of voice features is completed.
6. The method according to claim 2, wherein the text-independent multi-source speaker recognition method based on blind source separation comprises: step 4, the CNN model consists of 4 2D convolution layers, 4 pooling layers, 2 full-connection layers and an output layer; the convolution kernel adopts a 3x 3 matrix; in each convolutional layer, the activation function ReLu is used; entering a pooling layer after each convolution layer operation; the strategy adopted by the pooling layer is maximum pooling, and the size of the pooling is 2 multiplied by 2; and outputting the probability of the class corresponding to the voice by using softmax as the activation function of the output layer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110137229.XA CN112967722A (en) | 2021-02-01 | 2021-02-01 | Text-independent multi-source speaker identification method based on blind source separation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110137229.XA CN112967722A (en) | 2021-02-01 | 2021-02-01 | Text-independent multi-source speaker identification method based on blind source separation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112967722A true CN112967722A (en) | 2021-06-15 |
Family
ID=76272715
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110137229.XA Pending CN112967722A (en) | 2021-02-01 | 2021-02-01 | Text-independent multi-source speaker identification method based on blind source separation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112967722A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117727329A (en) * | 2024-02-07 | 2024-03-19 | 深圳市科荣软件股份有限公司 | Multi-target monitoring method for intelligent supervision |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180299527A1 (en) * | 2015-12-22 | 2018-10-18 | Huawei Technologies Duesseldorf Gmbh | Localization algorithm for sound sources with known statistics |
CN109584900A (en) * | 2018-11-15 | 2019-04-05 | 昆明理工大学 | A kind of blind source separation algorithm of signals and associated noises |
CN111199741A (en) * | 2018-11-20 | 2020-05-26 | 阿里巴巴集团控股有限公司 | Voiceprint identification method, voiceprint verification method, voiceprint identification device, computing device and medium |
-
2021
- 2021-02-01 CN CN202110137229.XA patent/CN112967722A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180299527A1 (en) * | 2015-12-22 | 2018-10-18 | Huawei Technologies Duesseldorf Gmbh | Localization algorithm for sound sources with known statistics |
CN109584900A (en) * | 2018-11-15 | 2019-04-05 | 昆明理工大学 | A kind of blind source separation algorithm of signals and associated noises |
CN111199741A (en) * | 2018-11-20 | 2020-05-26 | 阿里巴巴集团控股有限公司 | Voiceprint identification method, voiceprint verification method, voiceprint identification device, computing device and medium |
Non-Patent Citations (2)
Title |
---|
徐晓梦 等: "基于小波包全频分解的耐噪声纹识别算法", 《深圳大学学报理工版》 * |
朱佳等: "基于独立分量分析的说话人自动识别方法的研究", 《仪器仪表与分析监测》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117727329A (en) * | 2024-02-07 | 2024-03-19 | 深圳市科荣软件股份有限公司 | Multi-target monitoring method for intelligent supervision |
CN117727329B (en) * | 2024-02-07 | 2024-04-26 | 深圳市科荣软件股份有限公司 | Multi-target monitoring method for intelligent supervision |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Weninger et al. | Single-channel speech separation with memory-enhanced recurrent neural networks | |
CN107845389A (en) | A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks | |
CN105448302B (en) | A kind of the speech reverberation removing method and system of environment self-adaption | |
US20230317056A1 (en) | Audio generator and methods for generating an audio signal and training an audio generator | |
Do et al. | Speech source separation using variational autoencoder and bandpass filter | |
Strauss et al. | A flow-based neural network for time domain speech enhancement | |
Adiga et al. | Speech Enhancement for Noise-Robust Speech Synthesis Using Wasserstein GAN. | |
Geng et al. | End-to-end speech enhancement based on discrete cosine transform | |
CN111816200B (en) | Multi-channel speech enhancement method based on time-frequency domain binary mask | |
Islam et al. | Supervised single channel dual domains speech enhancement using sparse non-negative matrix factorization | |
Do et al. | Speech Separation in the Frequency Domain with Autoencoder. | |
Fazel et al. | Sparse auditory reproducing kernel (SPARK) features for noise-robust speech recognition | |
CN106653004A (en) | Speaker identification feature extraction method for sensing speech spectrum regularization cochlear filter coefficient | |
Saleem et al. | Unsupervised speech enhancement in low SNR environments via sparseness and temporal gradient regularization | |
Islam et al. | Supervised single channel speech enhancement based on stationary wavelet transforms and non-negative matrix factorization with concatenated framing process and subband smooth ratio mask | |
CN114360571A (en) | Reference-based speech enhancement method | |
CN112967722A (en) | Text-independent multi-source speaker identification method based on blind source separation | |
CN113593588A (en) | Multi-singer singing voice synthesis method and system based on generation countermeasure network | |
Zhao et al. | An Improved Speech Enhancement Method based on Teager Energy Operator and Perceptual Wavelet Packet Decomposition. | |
Baby et al. | Speech dereverberation using variational autoencoders | |
Singh et al. | Speech enhancement for Punjabi language using deep neural network | |
Sun et al. | Enhancement of Chinese speech based on nonlinear dynamics | |
Shu-Guang et al. | Isolated word recognition in reverberant environments | |
Amarjouf et al. | Denoising esophageal speech using combination of complex and discrete wavelet transform with wiener filter and time dilated Fourier Cepstra | |
Sharma et al. | Self-supervision and learnable strfs for age, emotion, and country prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210615 |