CN111667836B - Text irrelevant multi-label speaker recognition method based on deep learning - Google Patents

Text irrelevant multi-label speaker recognition method based on deep learning Download PDF

Info

Publication number
CN111667836B
CN111667836B CN202010563201.8A CN202010563201A CN111667836B CN 111667836 B CN111667836 B CN 111667836B CN 202010563201 A CN202010563201 A CN 202010563201A CN 111667836 B CN111667836 B CN 111667836B
Authority
CN
China
Prior art keywords
voice
marks
speaker
training
speaker recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010563201.8A
Other languages
Chinese (zh)
Other versions
CN111667836A (en
Inventor
邓克琦
卢晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202010563201.8A priority Critical patent/CN111667836B/en
Publication of CN111667836A publication Critical patent/CN111667836A/en
Application granted granted Critical
Publication of CN111667836B publication Critical patent/CN111667836B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a text irrelevant multi-label speaker recognition method based on deep learning. The method comprises the following steps: (1) Dividing the voice of each speaker in the training data set into N parts, and marking different marks on each part; (2) Constructing a corresponding neural network model, and ensuring that the dimension of an output layer is consistent with the number of the training data set labels; (3) Inputting training data into a neural network, comparing an output layer result with a label corresponding to the data, and solving a cross entropy loss function so as to train; (4) And (3) presetting N marks which are regarded as effective identification for the voice data of each speaker according to the corresponding relation of the training data set in the step (1), inputting the data of the testing data set into a neural network, comparing the marks predicted by the model with the preset N marks, and judging that one of the marks is correct identification as long as the one of the marks is satisfied. The method can effectively improve the speaker recognition performance of the model in pure and noisy environments.

Description

Text irrelevant multi-label speaker recognition method based on deep learning
Technical Field
The invention relates to a text irrelevant multi-label speaker recognition method based on deep learning.
Background
Speaker recognition, also known as speaker recognition, voiceprint recognition, aims to confirm the identity of a speaker from its speech characteristics. Speaker recognition is divided into two processes, namely speaker recognition and speaker confirmation, wherein the speaker recognition refers to whether a speaker is in a recorded speaker set or not after processing and analyzing the corresponding voice of the speaker; speaker verification refers to a process of further verifying whether a speaker corresponding to an input voice is a target speaker.
The i-vector method may be used to implement speaker recognition (N.Dehak, P.J.Kenny, R.Dehak, P.Dumouchel and P.ouellet, "Front-End Factor Analysis for Speaker Verification," in IEEE Transactions on Audio, spech, and Language Processing, vol.19, no.4, pp.788-798, may 2011.). Literature (D.Snyder, P.Ghahremani, D.Povey, D.Garcia-romiro, and y. Carmiel, "Deep Neural Network Embeddings for Text Independent Speaker Verification," in Interspeech,2017, pp. 999-1003.) states that deep learning methods can be exceeded by conventional i-vector methods when large-scale data are used, particularly after data enhancement. Speaker identification in noisy environments, however, remains a challenging problem.
A noise-reducing auto-encoder (DAE) may be used to generate enhanced speech from noisy speech to improve speaker recognition performance in noisy scenes (O.Plchot, L.Burget, H.Aronowitz and P).
Figure GDA0004085271920000011
"Audio enhancing with DNN autoencoder for speaker recognition,"2016IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), shanghai,2016, pp.5090-5094. However, since this method uses L 2 The loss function firstly carries out voice enhancement on the data and then carries out speaker recognition, so that the voice enhancement and the speaker recognition are not matched, and a high noise-containing voice recognition rate cannot be obtained. Literature (Suwo Shon, hao Tang, and James Glass, "VoiceID Loss: speech Enhancement for Speaker Verification," in Interseech, 2019, pp.2888-2892.) uses end-to-end structure for speech enhancement and speaker recognition, but there is room for improvement in the overall performance of the algorithm.
When deep learning is used for speaker recognition, large-scale data is required for training, however, the improvement of the model discrimination ability is slow with further increase of the data amount. The speaker is one-to-one with the label of the corresponding voice data, which is not the most efficient way to use the data.
Disclosure of Invention
In the existing speaker recognition training strategy based on deep learning, the speaker corresponds to the corresponding voice data one by one, and cannot utilize the data to the greatest extent. The invention provides a text-independent multi-label speaker recognition method based on deep learning, which can further improve the distinguishing capability of a model and can further improve the recognition capability of the model in a pure environment or a noisy environment.
The invention adopts the technical scheme that:
the text irrelevant multi-label speaker recognition method based on deep learning comprises the following steps:
step 1, equally dividing the voice of each speaker in the training data set into N parts, and marking different marks on each part of voice so that the number of marks of the whole training data set is N times of the number of the speakers, wherein N is more than or equal to 2;
step 2, constructing a neural network model, wherein the dimension of an output layer of the model is consistent with the total number of the training data set labels;
step 3, inputting the training data set in the step 1 into the neural network model in the step 2, comparing the output result with the label corresponding to the voice data, and solving a cross entropy loss function so as to perform supervised training;
and 4, presetting N marks which are regarded as effective identification for the voice data of each speaker according to the corresponding relation between the voice of the training data set and the marks in the step 1 by the data of the test set, inputting the data of the test data set into the neural network model, comparing the predicted marks of the model with the N marks which are regarded as effective identification and are set before, and obtaining the speaker identification rate of the model as long as the predicted marks are one of the N marks which are regarded as effective identification, namely correct identification.
Further, the neural network model comprises a voice enhancement network and a speaker recognition network, wherein the voice enhancement network is used for enhancing noise-containing voice and improving the robustness of the neural network model; in the step 3, the speaker recognition network is pre-trained, the parameters are locked after the training is converged, and then the end-to-end training comprises a complete neural network model of the voice enhancement network and the speaker recognition network.
Further, in the step 3, a specific formula of the cross entropy loss function is:
Figure GDA0004085271920000021
wherein, C is the total number of speakers; p is p i According to the voice data and the label y i The true identity being confirmed, i.e. only in N X C classification locationsOne position is 1, and the other positions are zero; q i Refers to the probability of systematic prediction at each classification location.
The invention changes the previous method of speaker voice and speaker labels corresponding to each other based on deep learning speaker recognition method, and adopts a multi-label mode. The method can further improve the speaker recognition performance of the model in a pure environment and a noise environment.
Drawings
FIG. 1 is a schematic diagram of a neural network in an embodiment of the present invention; where N is the number of fractions of the training dataset segmentation.
Fig. 2 is a specific structure of a voice enhancement network according to an embodiment of the present invention.
FIG. 3 is a flow chart of a text independent multi-label speaker recognition method based on deep learning of the present invention.
FIG. 4 is a flowchart of a specific algorithm of the method of the present invention, wherein n1 and n2 represent the training set and the test lumped number of voices, respectively; d is training set, T is test set, and C is speaker number.
FIG. 5 is a comparison of a prior deep learning-based speaker recognition scheme with the method of the present invention for clean speech and various types of noisy speech.
FIG. 6 is a line graph of the comparison of the prior deep learning-based speaker recognition scheme with the inventive method for clean speech and various types of noisy speech.
Detailed Description
The text irrelevant multi-label speaker identification method based on deep learning mainly comprises the following parts:
1. segmentation training data set
1) Defining training data sets
D={(x 1 ,y 1 ),...,(x n1 ,y n1 )}, (1)
Wherein D is a training set, x and y are respectively voice and corresponding original labels, and n1 represents the total voice number of training;
2) Training set phonetic data label
Dividing the voice of each speaker in the training data set into N parts, marking different marks on each part of voice, so that the number of the marks of the whole training data set is N times of the number of the speakers, and the number of the marks can be expressed as;
Figure GDA0004085271920000031
wherein y is i Is the label of the voice data in the training set in the initial state,
Figure GDA0004085271920000032
for the modified labels of the speech data in the training set, C is the total number of speakers, and m is a value between 0 and N-1.
2. Construction of neural networks
Corresponding neural network models are constructed, including a voice enhancement network and a speaker recognition network, and the dimension of the output layer is ensured to be consistent with the total number of the training data set labels. The voice enhancement network can enhance the noise-containing voice, and the recognition performance of the enhanced noise-containing voice can be obtained after the voice enhancement network inputs the voice. The whole network can compare the recognition performance of the model on pure voice, noise-containing voice and enhanced noise-containing voice, and the recognition performance and the robustness of the model can be obtained.
3. Constructing noisy speech data sets
The data set (which is regarded as the pure voice data set at the moment) is mixed with different types of noise voice data according to a certain signal-to-noise ratio, so as to obtain the noise-containing voice data set.
4. Training
1) Pre-training
Firstly, training a speaker recognition network by using pure voice, inputting training data into a neural network, comparing an output layer result with a label corresponding to the data, and solving a cross entropy loss function, so as to perform supervised training, and locking parameters of the speaker recognition network after convergence; the number of classification positions involved in calculating the cross entropy loss function is correspondingly changed into N times of the initial state because the number of classification types distinguished by the constructed neural network is N times of the number of speakers in the data set, and the method comprises the following specific formula:
Figure GDA0004085271920000041
wherein N is the number of divided training data sets, C is the total number of speakers, and p i According to the arrangement as described above
Figure GDA0004085271920000043
To confirm the true identity, namely that only one position is 1 in N multiplied by C classification positions, and the other positions are zero. And q i Refers to the probability of systematic prediction at each classification location.
2) Training
A speech enhancement network is introduced and a neural network model including the speech enhancement network and a parameter-locked speaker recognition network is trained end-to-end with noisy speech data. The voice enhancement network calculates a ratio mask by using a SIGMOID function, and the ratio mask is multiplied by input voice data to achieve the enhancement effect, wherein the SIGMOID function can be expressed as:
Figure GDA0004085271920000042
where z is the value at any point in the tensor output by the speech enhancement network.
5. Testing
1) Defining test data sets
T={(x 1 ,y 1 ),...,(x n2 ,y n2 )}, (5)
Wherein T is a test set, x and y are respectively voice and corresponding original labels, and n2 represents the number of voice in the training set.
2) Presetting effective identification mark
Each speaker is preset with N labels that can be considered valid recognition according to equation (2).
3) Clean speech recognition
Inputting the clean test voice into a speaker recognition network without a voice enhancement network, comparing the test result of the network with a preset effective recognition mark, and if one of the test result is correct recognition, otherwise, the test result is incorrect recognition, thereby obtaining the clean voice recognition rate.
4) Noisy speech recognition
The test voice with noise is directly input into the speaker recognition network without the voice enhancement network, the test result of the network is compared with the preset effective recognition mark, one of the test result is the correct recognition, and the other test result is the wrong recognition, so that the recognition rate of the voice with noise is obtained.
5) Enhanced speech recognition
Inputting the noise-containing voice into a voice enhancement network, inputting the enhanced voice into a speaker recognition network, comparing the obtained result with a preset effective recognition mark, and if one of the obtained result is correct recognition, otherwise, carrying out error recognition, thereby obtaining the enhanced noise-containing voice recognition rate.
Comparing the system identification performance under the three conditions of 3), 4) and 5), and judging whether the result meets the expectations or not: the clean voice recognition rate is highest, the noise-containing voice data recognition rate is lowest, and the enhanced noise-containing voice data recognition rate is between the clean voice recognition rate and the noise-containing voice data recognition rate.
Therefore, the method changes the prior method for identifying the speaker based on the deep learning text irrelevant to the speaker, changes the one-to-one correspondence between the voice of the speaker and the labels of the speaker, and adopts a multi-label mode.
Examples
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the accompanying drawings.
1. Test sample and objective evaluation criterion
The clean speech of the sample of this embodiment is derived from the Voxceleb1 dataset, and the Voxceleb1 data is extracted from YouTube video, which contains 1251 speakers and approximately 150000 speech segments, each segment having an average speech length of 7.8 seconds. To test the robustness of the model, this embodiment uses a Noise-92 Noise dataset. Seven different types of Noise, namely White, babble, leopard (military vehicle Noise), volvo (in-vehicle Noise) and Factory, tank, gun (gun Noise), are selected from a Noise-92 Noise data set, and are mixed with pure voice in a proportion of 10dB in signal-to-Noise ratio to obtain Noise-containing voice, and the Noise-containing voice is used for training and testing the performance of the system in a Noise-containing environment. When the noise data is mixed with the pure voice, the noise data is firstly divided into two parts, one part is used for being mixed with the training set data, and the other part is mixed with the test set data, so that the situation that the same noise data set is used by the training set and the test set at the same time is avoided.
The invention adopts Accuracy (Top-1 Accuracy) score as objective evaluation standard. When calculating the Accurcry of the multi-label speaker recognition method, the output result only meets one of the preset effective recognition labels, so that the method belongs to effective recognition and meets the Top-1 Accurcry requirement.
2. Parameter setting
1) Speaker recognition network
Referring to fig. 1, the speaker recognition network used in this embodiment is composed of four layers of one-dimensional convolution and one layer of full-connection layer (dimension 1500). The convolution kernel sizes of the 4 one-dimensional convolution layers are respectively as follows: 5. 7, 1; the step sizes are respectively as follows: 1. 2, 1; a global average pooling layer is also passed between the final convolutional layer and the full join layer. Here the 257-dimensional spectrum is fetched as input directly using a window function with a window length of 25ms frame shift of 10 ms. The present embodiment does not normalize the input data, but only multiplies the magnitude spectrum by an index of 0.3. A fixed length of 298 frames (257 per frame length) is used as input for a piece of speech during training.
2) Speech enhancement network
The enhancement section model consists of 11 layers of expanded convolution layers, see fig. 2 for a specific structure. For the final layer convolution output, the SIGMOID function is used for generating the rate mask with the same size, and the rate mask is multiplied with the original input, so that the purpose of voice enhancement is achieved. This embodiment uses RELU as the nonlinear activation. And finally outputting a predicted result obtained after the layer passes through SOFTMAX.
3) Data set construction
This requires first partitioning the original dataset into a training set and a testing set, and both the training set and the testing set need to contain the same number of total speakers, and the proportions of the training set and testing set sizes for each set of embodiments need to be kept consistent. In this embodiment, the test set and the training set are constructed in a 1:3 ratio. Before the algorithm starts, the training data sets are ordered according to the corresponding speaker ID, i.e. the voices belonging to the same speaker are ordered together.
3) N value setting
Because on a data set with limited size, N is too large, the data of the training subset is too sparse, and the overall recognition performance of the system is finally reduced. In this embodiment, therefore, only cases where N is 2 and 3 are tested.
3. Specific implementation flow of method
1) Referring to fig. 3 and 4, the algorithm is initialized according to formulas (1), (5) and the above parameter settings, training data and test data buffers are established for buffering data used in speaker recognition, speaker label buffers are established for buffering labels used in training and testing, and for any moment of model training, the following calculation is performed: a new piece of voice data is obtained, short-time Fourier transform is carried out on the new piece of voice data by windowing, a group of data containing 298 frames and 257 frequency points of each frame is obtained, the amplitude spectrum is exponentially multiplied by 0.3, and the data cache is updated.
2) Referring to the newly set label in step 4 of fig. 4, the corresponding label of the voice data is obtained, and the label buffer is updated.
3) Inputting data into a neural network, comparing an output result with labels in a label cache, calculating a cross entropy loss function, and back-propagating optimization model parameters.
For any time of the model test, the following is calculated:
1) A new piece of voice data is obtained, short-time Fourier transform is carried out on the new piece of voice data by windowing, a group of data containing 298 frames and 257 frequency points of each frame is obtained, the amplitude spectrum is exponentially multiplied by 0.3, and the data cache is updated.
2) And obtaining a corresponding label newly set by the voice data, and updating a label cache.
3) Inputting data into the neural network, referring to steps 11 to 13 of fig. 4, comparing the output result with the labels in the label cache, and judging whether one of them is satisfied:
31 If yes, the identification is correct;
32 If the judgment result is negative, the error identification is performed;
4) Referring to steps 13 to 17 of fig. 4, the speaker recognition rate of the model on the test set is found.
In order to embody the performance of the method of the invention in speaker recognition in a pure environment and a noisy environment, the embodiment compares the existing speaker recognition based on deep learning with the method of the invention. Fig. 5 shows results of Accuracy scoring before and after enhancement of the existing deep learning-based speaker recognition and the method of the present invention in a pure environment and various types of noisy environments, and fig. 6 shows a line graph of the results.
In FIG. 5, D represents training using clean speech, D N Noisy speech training is used. "baseine" represents an existing scheme, "Proposed (N=3)" and "Proposed (N=2)" represent the method of the present invention employed, and N takes the values 3 and 2, respectively. From the results, the results of the invention are better than the prior art, both for clean speech and for enhanced noise-containing speech. The noise-containing voice recognition rate after enhancement is obviously higher than that before enhancement, and the enhancement effect of the invention is more obvious.
In fig. 6, "baseline (original)", "proposed (n=3) (original)" and "proposed (n=2) (original)" represent the speaker recognition rate of the clean speech and the noisy speech, respectively, when the speech enhancement is not performed, and "baseline (enhanced)", "proposed (n=3) (enhanced)" and "proposed (n=2) (enhanced)" represent the speaker recognition rate of the noisy speech after the speech enhancement, respectively. From the results, it can be seen that, on the data set used in this embodiment, for the noisy speech which is not enhanced, the effect of the method of the present invention is not greatly different when the value of N is 2 and 3; for the enhanced noise-containing voice and the pure voice, the method has better effect when the value of N is 2, and is obviously superior to the existing scheme.
As can be seen from the results of FIG. 5 and FIG. 6, the speaker recognition method based on deep learning can further improve the recognition performance of the model in both pure environment and noisy environment.

Claims (3)

1. The text irrelevant multi-label speaker recognition method based on deep learning is characterized by comprising the following steps:
step 1, equally dividing the voice of each speaker in the training data set into N parts, and marking different marks on each part of voice so that the number of marks of the whole training data set is N times of the number of the speakers, wherein N is more than or equal to 2;
step 2, constructing a neural network model, wherein the dimension of an output layer of the model is consistent with the total number of the training data set labels;
step 3, inputting the training data set in the step 1 into the neural network model in the step 2, comparing the output result with the label corresponding to the voice data, and solving a cross entropy loss function so as to perform supervised training;
and 4, presetting N marks which are regarded as effective identification for the voice data of each speaker according to the corresponding relation between the voice of the training data set and the marks in the step 1 by the data of the test set, inputting the data of the test data set into the neural network model, comparing the predicted marks of the model with the N marks which are regarded as effective identification and are set before, and obtaining the speaker identification rate of the model as long as the predicted marks are one of the N marks which are regarded as effective identification, namely correct identification.
2. The deep learning-based text-independent multi-label speaker recognition method of claim 1, wherein the neural network model comprises a voice enhancement network and a speaker recognition network, the voice enhancement network is used for enhancing noisy voices, and robustness of the neural network model is improved; in the step 3, the speaker recognition network is pre-trained, the parameters are locked after the training is converged, and then the end-to-end training comprises a complete neural network model of the voice enhancement network and the speaker recognition network.
3. The text-independent multi-label speaker recognition method based on deep learning of claim 1, wherein in the step 3, the specific formula of the cross entropy loss function is:
Figure FDA0002546871850000011
wherein, C is the total number of speakers; p is p i According to the voice data and the label y i The confirmed true identity, namely, only one position is 1 in N multiplied by C classification positions, and the other positions are zero; q i Refers to the probability of systematic prediction at each classification location.
CN202010563201.8A 2020-06-19 2020-06-19 Text irrelevant multi-label speaker recognition method based on deep learning Active CN111667836B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010563201.8A CN111667836B (en) 2020-06-19 2020-06-19 Text irrelevant multi-label speaker recognition method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010563201.8A CN111667836B (en) 2020-06-19 2020-06-19 Text irrelevant multi-label speaker recognition method based on deep learning

Publications (2)

Publication Number Publication Date
CN111667836A CN111667836A (en) 2020-09-15
CN111667836B true CN111667836B (en) 2023-05-05

Family

ID=72388943

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010563201.8A Active CN111667836B (en) 2020-06-19 2020-06-19 Text irrelevant multi-label speaker recognition method based on deep learning

Country Status (1)

Country Link
CN (1) CN111667836B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113066507B (en) * 2021-03-15 2024-04-19 上海明略人工智能(集团)有限公司 End-to-end speaker separation method, system and equipment

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049629B (en) * 2011-10-17 2016-08-10 阿里巴巴集团控股有限公司 A kind of method and device detecting noise data
CN104732978B (en) * 2015-03-12 2018-05-08 上海交通大学 The relevant method for distinguishing speek person of text based on combined depth study
CN106952649A (en) * 2017-05-14 2017-07-14 北京工业大学 Method for distinguishing speek person based on convolutional neural networks and spectrogram
CN107464568B (en) * 2017-09-25 2020-06-30 四川长虹电器股份有限公司 Speaker identification method and system based on three-dimensional convolution neural network text independence
US10380997B1 (en) * 2018-07-27 2019-08-13 Deepgram, Inc. Deep learning internal state index-based search and classification
CN109599117A (en) * 2018-11-14 2019-04-09 厦门快商通信息技术有限公司 A kind of audio data recognition methods and human voice anti-replay identifying system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
林舒都 ; 邵曦 ; .基于i-vector和深度学习的说话人识别.计算机技术与发展.(第06期), *

Also Published As

Publication number Publication date
CN111667836A (en) 2020-09-15

Similar Documents

Publication Publication Date Title
CN103824557B (en) A kind of audio detection sorting technique with custom feature
US9595257B2 (en) Downsampling schemes in a hierarchical neural network structure for phoneme recognition
KR101807948B1 (en) Ensemble of Jointly Trained Deep Neural Network-based Acoustic Models for Reverberant Speech Recognition and Method for Recognizing Speech using the same
US8301578B2 (en) System and method for tagging signals of interest in time variant data
Akbacak et al. Environmental sniffing: noise knowledge estimation for robust speech systems
US11100932B2 (en) Robust start-end point detection algorithm using neural network
KR101618512B1 (en) Gaussian mixture model based speaker recognition system and the selection method of additional training utterance
CN112735456A (en) Speech enhancement method based on DNN-CLSTM network
Mallidi et al. Autoencoder based multi-stream combination for noise robust speech recognition
Mun et al. The sound of my voice: Speaker representation loss for target voice separation
Zou et al. Improved voice activity detection based on support vector machine with high separable speech feature vectors
KR102406512B1 (en) Method and apparatus for voice recognition
CN111667836B (en) Text irrelevant multi-label speaker recognition method based on deep learning
Kinoshita et al. Deep mixture density network for statistical model-based feature enhancement
WO2016152132A1 (en) Speech processing device, speech processing system, speech processing method, and recording medium
Nicolson et al. Sum-product networks for robust automatic speaker identification
Matsui et al. N-best-based unsupervised speaker adaptation for speech recognition
Wang et al. Robust speech recognition from ratio masks
Reshma et al. A survey on speech emotion recognition
US7912715B2 (en) Determining distortion measures in a pattern recognition process
Techini et al. Robust Front-End Based on MVA and HEQ Post-processing for Arabic Speech Recognition Using Hidden Markov Model Toolkit (HTK)
Gu et al. Gaussian speaker embedding learning for text-independent speaker verification
Janicki et al. Improving GMM-based speaker recognition using trained voice activity detection
Morales et al. Adding noise to improve noise robustness in speech recognition.
Soni et al. Comparing front-end enhancement techniques and multiconditioned training for robust automatic speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant