CN112967722A - Text-independent multi-source speaker identification method based on blind source separation - Google Patents

Text-independent multi-source speaker identification method based on blind source separation Download PDF

Info

Publication number
CN112967722A
CN112967722A CN202110137229.XA CN202110137229A CN112967722A CN 112967722 A CN112967722 A CN 112967722A CN 202110137229 A CN202110137229 A CN 202110137229A CN 112967722 A CN112967722 A CN 112967722A
Authority
CN
China
Prior art keywords
voice
source
matrix
wavelet packet
blind
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110137229.XA
Other languages
Chinese (zh)
Inventor
谭振华
徐晓梦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
CERNET Corp
Original Assignee
Northeastern University China
CERNET Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China, CERNET Corp filed Critical Northeastern University China
Priority to CN202110137229.XA priority Critical patent/CN112967722A/en
Publication of CN112967722A publication Critical patent/CN112967722A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a text-independent multi-source speaker recognition method based on blind source separation, and relates to the technical field of voiceprint recognition. Firstly, acquiring a section of sound source containing voices of multiple persons, carrying out blind source separation and detection on the sound source by adopting a blind source signal detection and separation algorithm, and separating a multi-source voice signal into multiple single-source voice signals; then, each separated single-source voice signal is subjected to pre-emphasis, framing and windowing processing to obtain a time sequence voice signal; wavelet packet decomposition and reconstruction are carried out on the time-series voice signals; then, a cochlear auditory filter is adopted to carry out human ear feature filtering, and voice features are extracted; finally, a CNN model is constructed, and the extracted voice characteristics are input into the CNN model to realize multi-source speaker recognition; the method of the invention adopts a method of combining the wavelet packet and the gamma-ray filter, and can have higher recognition rate in a noise environment.

Description

Text-independent multi-source speaker identification method based on blind source separation
Technical Field
The invention relates to the technical field of voiceprint recognition, in particular to a text-independent multi-source speaker recognition method based on blind source separation.
Background
The voice is one kind of biological recognition, bears corresponding information as other biological characteristic recognition, and can be applied to the aspects of identity authentication, information service, voice mail and the like. Not only the voice content information of the speaker can be known through the voice, but also the voiceprint recognition, which is the information of the speaker, can be obtained. Voiceprint recognition is the process of automatically identifying a speaker for personal information contained in a speech waveform. Speaker identification in a cocktail environment presents a great challenge, and first, the cocktail environment is noisy, and the voices of many speakers are mixed together.
The process of speaker recognition can be divided into two parts: speech feature extraction and training of speaker models. In speech feature extraction, most of the research is mainly aimed at short-term spectral characteristics of speech signals, and mainly aims at decomposing the signals based on short frames of about 10-30 milliseconds, during which the speech signals are most stable, and researching the spectral features of voiceprints, such as Mel cepstral coefficients and linear cepstral coefficients, in the frames. Training of the model, which is mainly to train the extracted features, for example, the commonly used traditional models include a vector quantization model, dynamic time warping, a Gaussian mixture model and the like; for the deep learning model, a deep neural network is adopted, and a model with a better effect, such as a convolutional neural network, is applied to a model for speaker recognition and the like.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a text-independent multi-source speaker identification method based on blind source separation aiming at the defects of the prior art, and the multi-source speaker identification is carried out in a cocktail environment.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a text-independent multi-source speaker recognition method based on blind source separation is characterized in that multi-sound source voices in a cocktail environment are separated and detected according to a blind source signal detection and separation algorithm, a plurality of included sound sources are separated, then voice feature extraction is carried out on each sound source, namely, the voice is subjected to wavelet packet conversion and is combined with a gamma-pass filter for feature extraction, and meanwhile, extracted features pass through a deep learning model CNN to finish multi-source speaker recognition; the method specifically comprises the following steps:
step 1: blind source separation and detection; acquiring a section of sound source containing voices of multiple persons, performing blind source separation and detection on the sound source by adopting a blind source signal detection and separation algorithm, and separating a multi-source voice signal into multiple single-source voice signals;
firstly, normalizing and whitening an original mixed voice data matrix to obtain a whitened and transformed voice matrix; then initializing a matrix W by adopting a random mode, and performing decorrelation processing on the iteration of the matrix W to obtain an updated matrix Wnew(ii) a Finally, the original mixed voice data matrix, the whitened and transformed voice matrix and the updated matrix W are combinednewMatrix multiplication is carried out, and a plurality of single-source voice signals are separated from the multi-source voice signals;
step 2: preprocessing voice characteristics; carrying out pre-emphasis, framing and windowing on each single-source voice signal separated in the step 1 to obtain a time sequence voice signal;
and step 3: performing wavelet packet decomposition and reconstruction on the time-series voice signals;
decomposing the sequential speech signal by adopting a wavelet packet so as to carry out time-frequency localized processing and analysis on low-frequency and high-frequency signals contained in the speech signal; the wavelet packet is decomposed according to a complete optimal binary tree mode, and the time-frequency node corresponds to a wavelet packet frequency coefficient; reconstructing the low-frequency and high-frequency voice signals after wavelet packet decomposition, wherein the time sequence of the reconstructed voice signals corresponds to the original time domain information;
and 4, step 4: performing human ear feature filtering on the voice signal subjected to wavelet packet decomposition and reconstruction by adopting a cochlear auditory filter, and extracting voice features;
the voice signals obtained after wavelet packet decomposition and reconstruction in the step 3 pass through a group of gamma pass filter banks to obtain voice feature vectors which accord with human ear physiological data, short-time Fourier transform is carried out on the obtained voice feature vectors to obtain two-dimensional voice feature vectors, and extraction of voice features is completed;
and 5: constructing a CNN model, converting the two-dimensional voice characteristic vector extracted in the step (4) into a three-dimensional vector, and inputting the three-dimensional vector into the CNN model to realize multi-source speaker recognition;
the CNN model consists of 4 2D convolutional layers, 4 pooling layers, 2 full-connection layers and an output layer; the convolution kernel adopts a 3x 3 matrix; in each convolutional layer, the activation function ReLu is used; entering a pooling layer after each convolution layer operation; the strategy adopted by the pooling layer is maximum pooling, and the size of the pooling is 2 multiplied by 2; and outputting the probability of the class corresponding to the voice by using softmax as the activation function of the output layer.
Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: the text-independent multi-source speaker recognition method based on blind source separation provided by the invention aims at the problem of low recognition accuracy in a noise environment, and adopts a method of combining a wavelet packet and a gamma-pass filter, so that the method has higher recognition rate in the noise environment. Has important application value for applying the composite material in a real environment.
Drawings
Fig. 1 is a general architecture diagram of a text-independent multi-source speaker recognition method based on blind source separation according to an embodiment of the present invention.
Fig. 2 is a waveform diagram of an original single-source speech signal visualization provided by an embodiment of the present invention;
FIG. 3 is a diagram illustrating pre-emphasis effects of a single-source speech signal according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating the effect of framing a single-source speech signal according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating the effect of windowing a single-source speech signal according to an embodiment of the present invention;
fig. 6 is a diagram illustrating an effect of a single-source speech signal after decomposition of three-layer wavelet packets according to an embodiment of the present invention;
fig. 7 is a diagram illustrating the effect of the single-source speech signal filtered by the Gammatone filter according to the embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings.
In this embodiment, the software environment is a WINDOWS 10 system, and the simulation environment is PyCharm 2018.3.3x 64.
In this embodiment, a total architecture of the designed multi-source speaker recognition is shown in fig. 1, and a text-independent multi-source speaker recognition method based on blind source separation according to the architecture diagram includes the following steps:
step 1: blind source separation and detection; acquiring a section of sound source containing voices of multiple persons, performing blind source separation and detection on the sound source by adopting a blind source signal detection and separation algorithm, and separating a multi-source voice signal into multiple single-source voice signals;
firstly, inputting a section of mixed multi-channel voice signal, opening the input voice signal by adopting a wave function in python language to obtain a mixed voice data matrix D, and carrying out centralization processing on the matrix D, namely averaging all row directions of the matrix D by using a mean function in a numpy function library, and then traversing each row of the matrix D to subtract the average value to obtain a new matrix D after centralization processingcenter(ii) a Then to the new matrix DcenterPerforming whitening, i.e. calculating DcenterCov (D) covariance matrix of (c)center) Then, calculating an eigenvalue and an eigenvector of the covariance matrix, diagonalizing the eigenvalue vector to obtain a diagonal matrix, inverting the diagonal matrix, and multiplying the square of the inverted diagonal matrix by the transposed eigenvector to obtain a whitening transformation matrix V; then the matrix DcenterMultiplying the matrix V to obtain a whitened data matrix Z, and performing FastICA algorithm processing on the data matrix Z, namely randomly generating a random matrix W at first and performing decorrelation processing on W (for W, W)TDecomposing the eigenvalue to obtain an eigenvalue x and an eigenvector p, then carrying out diagonalization and inversion on the eigenvalue x to obtain div _ x, and after W is decorrelated, obtaining
Figure BDA0002927446880000031
And setting the maximum iteration times, continuously updating iteration W' to obtainNew Wnew=g(s)ZT-g '(s) × W ' (where s ═ W ' × Z, g(s) represents the tanh () function of s, g '(s) represents the probability density function of s, W ' represents W, Z after each update of the decorrelationTA transpose matrix representing the whitened data matrix Z), based on the obtained WnewThe estimation S of the reconstructed mixed source signal can be obtained by multiplying the whitening transformation matrix V and the mixed voice data matrix DrFurther, n estimated single-source speech signals S1 ═ S are obtainedr[0,:],S2=Sr[1,:]....,Sn=Sr[n,:]。
Step 2: preprocessing voice characteristics; for the n single-source voice signals S1, S2.., S obtained in the step 1nThe method comprises the steps of preprocessing respectively to obtain time sequence voice signals, and specifically comprises three links of pre-emphasis, framing and windowing, wherein the pre-emphasis aims to compensate attenuation of each octave of a voice high-frequency part caused by glottal excitation and oral-nasal radiation. Carrying out short-time framing processing on the voice signal after the pre-emphasis updating to obtain short-time stable voice signal frames, numbering the short-time stable voice signal frames according to a time sequence, and overlapping continuous frame signals to a certain extent to ensure that information among each frame is not lost; and finally, windowing the frame signal, performing sliding window processing on the frame-long voice signal according to the position by adopting a Hamming window function, and taking the preprocessed voice signal as a time sequence discrete signal form so as to facilitate decomposition and feature extraction.
In this embodiment, the specific implementation parameters of the three links of pre-emphasis, framing and windowing are as follows:
pre-emphasis: and processing the voice data by wavfile in python to obtain a sampling rate of the voice frequency and a numpy array, wherein the obtained voice one-dimensional array is subjected to pre-emphasis according to a pre-emphasis formula y (t) ═ x (t) — x (t-1), a pre-emphasis coefficient α ═ 0.97, y (t) represents the single-source voice array at t seconds after pre-emphasis, x (t) is the single-source voice array at t seconds, and x (t-1) is the single-source voice array at t-1 seconds. In this embodiment, for the original single-source speech image as shown in fig. 2, the pre-emphasized speech image is shown in fig. 3.
Framing: the whole voice is divided into frames, and the voice signal after the voice signal is pre-emphasized is processed by using python numpy data analysis, wherein the parameter of each frame is 30ms, certain overlap is performed between continuous frame signals, the overlap time is 2ms, and the result after the frame division is shown in fig. 4.
Windowing: in order to ensure that the voice data after the frame division is processed continuously, a sliding window technology is adopted according to a Hamming window formula
Figure BDA0002927446880000041
Wherein, w (N) represents the sliding window factor, N represents the total frame length of the voice data, and N represents the length of each frame after framing, and Hamming window processing is carried out. The effect after windowing is shown in fig. 5.
And step 3: decomposing and reconstructing a wavelet packet of the time sequence signal; therefore, the invention adopts the wavelet packet to decompose the time-sequence speech signal and carries out time-frequency localization processing and analysis on the low-frequency and high-frequency signals contained in the speech signal; the wavelet packet is decomposed according to a complete optimal binary tree mode, and the time-frequency node corresponds to a wavelet packet frequency coefficient; reconstructing the low-frequency and high-frequency voice signals after wavelet packet decomposition, wherein the time sequence of the reconstructed voice signals corresponds to the original time domain information;
in this embodiment, a third-party library pywt existing in a python library is used to perform wavelet packet decomposition on the voice data preprocessed in step 2, a wavepacket function in the pywt library is used, the input data is the voice data preprocessed, a wavelet packet model is used as symmetric to perform three-layer db6 wavelet packet decomposition, and a tree structure is shown in fig. 6. Then, the decomposed wavelet packets are respectively layer number and data, and the data are subjected to traversal decomposition and are reconstructed by adopting a reconstruct function to obtain voice signal data with high frequency and low frequency which are decomposed thoroughly.
And 4, step 4: performing human ear feature filtering on the voice signal subjected to wavelet packet decomposition and reconstruction by adopting a cochlear auditory filter, and extracting voice features; obtaining a voice feature vector which accords with human ear physiological data by a group of gamma pass filter groups of the voice signals obtained after wavelet packet decomposition and reconstruction in the step 3, wherein a gamma pass filter is a standard auditory filter and accords with human ear cochlear features, and performing short-time Fourier transform (STFT) on the obtained voice feature vector to obtain a two-dimensional voice feature vector so as to complete extraction of voice features;
the Gamma atom filter is shown in the following formula:
h(t)=ctl-1e-2πbt cos(2πfit+φ)
where c is the tuning proportionality constant, l is the number of filter stages (usually 4), b is the attenuation factor that determines the attenuation speed of the filter, and is a positive integer, fiIs the center frequency of the filter, phi is the function phase, and can be generally omitted; the relation between attenuation factor and bandwidth is b-1.019 ERB (f)i) Wherein ERB (f)i) Equivalent rectangular bandwidth:
Figure BDA0002927446880000051
in this embodiment, a visualization effect graph of the single-source speech signal after passing through the Gammatone filter is shown in fig. 7.
And 5: constructing a CNN model, converting the two-dimensional voice characteristic vector extracted in the step (4) into a three-dimensional vector, and inputting the three-dimensional vector into the CNN model to realize multi-source speaker recognition;
in order to effectively train and predict the voice characteristic information irrelevant to the text of the speaker, the method of the invention combines the designed characteristic extraction method and adopts the convolutional neural network to design the deep learning and recognition network of the speaker, as shown in table 1. In the network structure, the network structure is composed of 4 2D convolutional layers (Conv2D _1-Conv2D _4), 4 pooling layers (Pool1-Pool4), 2 full-connection layers (Dense _1, Dense _2) and an output layer, and the convolutional layers adopt a 3 × 3 matrix. In each convolutional layer, the activation function ReLu is used. After each convolution layer operation, entering a pooling layer, wherein the strategy adopted by the pooling layer is a maximum pooling strategy, and the pooling size is 2 multiplied by 2. And the output layer outputs the probability of the corresponding class of the voice by adopting a softmax activation function.
TABLE 1 CNN-based speaker recognition deep learning recognition network structure parameters
Layer(s) CNN parameter
Conv2d_1 [3*3,16]
Pool1 2*2,maxpool,stride2
Conv2d_2 [3*3,32]
Pool2 2*2,maxpool,stride2
Conv2d_3 [3*3,64]
Pool3 2*2,maxpool,stride2
Conv2d_4 [3*3,128]
Pool4 2*2,maxpool,stride2
Dense_1 5072*278
Dense_2 278*69
In conclusion, the method of the invention is based on the method of extracting the voice characteristics by using the wavelet packet and the gamma-pass filter after blind source separation, and realizes the requirement of high recognition rate of multi-source speaker recognition in a noise environment. The problem of the recognition rate is low under the cocktail environment, can't distinguish a plurality of speakers is solved.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims (6)

1. A text-independent multi-source speaker identification method based on blind source separation is characterized in that: aiming at multi-sound source voice in a cocktail environment, the multi-sound source voice is separated and detected according to a blind source signal detection and separation algorithm, a plurality of included sound sources are separated, then voice feature extraction is carried out on each sound source, namely, the voice is subjected to wavelet packet transformation and is combined with a gamma-pass filter for feature extraction, and meanwhile, the extracted features pass through a deep learning model CNN, so that multi-source speaker recognition is completed.
2. The method according to claim 1, wherein the text-independent multi-source speaker recognition method based on blind source separation comprises: the identification method specifically comprises the following steps:
step 1: blind source separation and detection; acquiring a section of sound source containing voices of multiple persons, performing blind source separation and detection on the sound source by adopting a blind source signal detection and separation algorithm, and separating a multi-source voice signal into multiple single-source voice signals;
step 2: preprocessing voice characteristics; carrying out pre-emphasis, framing and windowing on each single-source voice signal separated in the step 1 to obtain a time sequence voice signal;
and step 3: performing wavelet packet decomposition and reconstruction on the time-series voice signals;
and 4, step 4: performing human ear feature filtering on the voice signal subjected to wavelet packet decomposition and reconstruction by adopting a cochlear auditory filter, and extracting voice features;
and 5: and (4) constructing a CNN model, converting the two-dimensional voice characteristic vector extracted in the step (4) into a three-dimensional vector, and inputting the three-dimensional vector into the CNN model to realize multi-source speaker recognition.
3. The method according to claim 2, wherein the text-independent multi-source speaker recognition method based on blind source separation comprises: the specific method of the step 1 comprises the following steps:
firstly, normalizing and whitening an original mixed voice data matrix to obtain a whitened and transformed voice matrix; then initializing a matrix W by adopting a random mode, and performing decorrelation processing on the iteration of the matrix W to obtain an updated matrix Wnew(ii) a Finally, the original mixed voice data matrix, the whitened and transformed voice matrix and the updated matrix W are combinednewAnd carrying out matrix multiplication to separate the multi-source voice signals into a plurality of single-source voice signals.
4. The method according to claim 2, wherein the text-independent multi-source speaker recognition method based on blind source separation comprises: the specific method of the step 2 comprises the following steps:
decomposing the sequential speech signal by adopting a wavelet packet so as to carry out time-frequency localized processing and analysis on low-frequency and high-frequency signals contained in the speech signal; the wavelet packet is decomposed according to a complete optimal binary tree mode, and the time-frequency node corresponds to a wavelet packet frequency coefficient; and reconstructing the low-frequency voice signal and the high-frequency voice signal after the wavelet packet decomposition, wherein the time sequence of the reconstructed voice signal corresponds to the original time domain information.
5. The method according to claim 2, wherein the text-independent multi-source speaker recognition method based on blind source separation comprises: the specific method of the step 4 comprises the following steps:
and (3) the voice signal obtained after wavelet packet decomposition and reconstruction in the step (3) is processed by a group of gamma pass filter banks to obtain a voice feature vector which accords with human ear physiological data, short-time Fourier transform is carried out on the obtained voice feature vector to obtain a two-dimensional voice feature vector, and extraction of voice features is completed.
6. The method according to claim 2, wherein the text-independent multi-source speaker recognition method based on blind source separation comprises: step 4, the CNN model consists of 4 2D convolution layers, 4 pooling layers, 2 full-connection layers and an output layer; the convolution kernel adopts a 3x 3 matrix; in each convolutional layer, the activation function ReLu is used; entering a pooling layer after each convolution layer operation; the strategy adopted by the pooling layer is maximum pooling, and the size of the pooling is 2 multiplied by 2; and outputting the probability of the class corresponding to the voice by using softmax as the activation function of the output layer.
CN202110137229.XA 2021-02-01 2021-02-01 Text-independent multi-source speaker identification method based on blind source separation Pending CN112967722A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110137229.XA CN112967722A (en) 2021-02-01 2021-02-01 Text-independent multi-source speaker identification method based on blind source separation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110137229.XA CN112967722A (en) 2021-02-01 2021-02-01 Text-independent multi-source speaker identification method based on blind source separation

Publications (1)

Publication Number Publication Date
CN112967722A true CN112967722A (en) 2021-06-15

Family

ID=76272715

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110137229.XA Pending CN112967722A (en) 2021-02-01 2021-02-01 Text-independent multi-source speaker identification method based on blind source separation

Country Status (1)

Country Link
CN (1) CN112967722A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727329A (en) * 2024-02-07 2024-03-19 深圳市科荣软件股份有限公司 Multi-target monitoring method for intelligent supervision

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180299527A1 (en) * 2015-12-22 2018-10-18 Huawei Technologies Duesseldorf Gmbh Localization algorithm for sound sources with known statistics
CN109584900A (en) * 2018-11-15 2019-04-05 昆明理工大学 A kind of blind source separation algorithm of signals and associated noises
CN111199741A (en) * 2018-11-20 2020-05-26 阿里巴巴集团控股有限公司 Voiceprint identification method, voiceprint verification method, voiceprint identification device, computing device and medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180299527A1 (en) * 2015-12-22 2018-10-18 Huawei Technologies Duesseldorf Gmbh Localization algorithm for sound sources with known statistics
CN109584900A (en) * 2018-11-15 2019-04-05 昆明理工大学 A kind of blind source separation algorithm of signals and associated noises
CN111199741A (en) * 2018-11-20 2020-05-26 阿里巴巴集团控股有限公司 Voiceprint identification method, voiceprint verification method, voiceprint identification device, computing device and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
徐晓梦 等: "基于小波包全频分解的耐噪声纹识别算法", 《深圳大学学报理工版》 *
朱佳等: "基于独立分量分析的说话人自动识别方法的研究", 《仪器仪表与分析监测》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727329A (en) * 2024-02-07 2024-03-19 深圳市科荣软件股份有限公司 Multi-target monitoring method for intelligent supervision
CN117727329B (en) * 2024-02-07 2024-04-26 深圳市科荣软件股份有限公司 Multi-target monitoring method for intelligent supervision

Similar Documents

Publication Publication Date Title
Weninger et al. Single-channel speech separation with memory-enhanced recurrent neural networks
CN107845389A (en) A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks
CN105448302B (en) A kind of the speech reverberation removing method and system of environment self-adaption
US20230317056A1 (en) Audio generator and methods for generating an audio signal and training an audio generator
Do et al. Speech source separation using variational autoencoder and bandpass filter
Strauss et al. A flow-based neural network for time domain speech enhancement
Adiga et al. Speech Enhancement for Noise-Robust Speech Synthesis Using Wasserstein GAN.
Geng et al. End-to-end speech enhancement based on discrete cosine transform
CN111816200B (en) Multi-channel speech enhancement method based on time-frequency domain binary mask
Islam et al. Supervised single channel dual domains speech enhancement using sparse non-negative matrix factorization
Do et al. Speech Separation in the Frequency Domain with Autoencoder.
Fazel et al. Sparse auditory reproducing kernel (SPARK) features for noise-robust speech recognition
CN106653004A (en) Speaker identification feature extraction method for sensing speech spectrum regularization cochlear filter coefficient
Saleem et al. Unsupervised speech enhancement in low SNR environments via sparseness and temporal gradient regularization
Islam et al. Supervised single channel speech enhancement based on stationary wavelet transforms and non-negative matrix factorization with concatenated framing process and subband smooth ratio mask
CN114360571A (en) Reference-based speech enhancement method
CN112967722A (en) Text-independent multi-source speaker identification method based on blind source separation
CN113593588A (en) Multi-singer singing voice synthesis method and system based on generation countermeasure network
Zhao et al. An Improved Speech Enhancement Method based on Teager Energy Operator and Perceptual Wavelet Packet Decomposition.
Baby et al. Speech dereverberation using variational autoencoders
Singh et al. Speech enhancement for Punjabi language using deep neural network
Sun et al. Enhancement of Chinese speech based on nonlinear dynamics
Shu-Guang et al. Isolated word recognition in reverberant environments
Amarjouf et al. Denoising esophageal speech using combination of complex and discrete wavelet transform with wiener filter and time dilated Fourier Cepstra
Sharma et al. Self-supervision and learnable strfs for age, emotion, and country prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210615