CN111667836A - Text-irrelevant multi-label speaker identification method based on deep learning - Google Patents

Text-irrelevant multi-label speaker identification method based on deep learning Download PDF

Info

Publication number
CN111667836A
CN111667836A CN202010563201.8A CN202010563201A CN111667836A CN 111667836 A CN111667836 A CN 111667836A CN 202010563201 A CN202010563201 A CN 202010563201A CN 111667836 A CN111667836 A CN 111667836A
Authority
CN
China
Prior art keywords
speaker
labels
voice
training
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010563201.8A
Other languages
Chinese (zh)
Other versions
CN111667836B (en
Inventor
邓克琦
卢晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202010563201.8A priority Critical patent/CN111667836B/en
Publication of CN111667836A publication Critical patent/CN111667836A/en
Application granted granted Critical
Publication of CN111667836B publication Critical patent/CN111667836B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a text-irrelevant multi-label speaker recognition method based on deep learning. The method comprises the following steps: (1) averagely dividing the voice of each speaker in the training data set into N parts, and marking different labels on each part; (2) constructing a corresponding neural network model, and ensuring that the dimension of an output layer is consistent with the number of the marks of the training data set; (3) inputting training data into a neural network, comparing the output layer result with the label corresponding to the data, and solving a cross entropy loss function so as to carry out training; (4) and (3) presetting N labels which are regarded as effective identification for the voice data of each speaker according to the corresponding relation of the training data set in the step (1) by the test set data, inputting the test data set data into a neural network, comparing the labels predicted by the model with the preset N labels, and identifying the speaker correctly as long as one of the labels is satisfied. The method can effectively improve the speaker recognition performance of the model in pure and noisy environments.

Description

Text-irrelevant multi-label speaker identification method based on deep learning
Technical Field
The invention relates to a text irrelevant multi-label speaker identification method based on deep learning.
Background
Speaker recognition, also known as speaker recognition, voiceprint recognition, is aimed at identifying a speaker from its speech characteristics. The speaker identification is divided into two processes of speaker identification and speaker confirmation, wherein the speaker identification means that whether a speaker is in a recorded speaker set or not is identified after the corresponding voice processing analysis of the speaker is carried out; speaker verification refers to a process of further verifying whether a speaker corresponding to an input voice is a target speaker.
The i-vector method can be used to achieve speaker recognition (N.Dehak, P.J.Kenny, R.Dehak, P.Dumouchel and P.Ouellet, "Front-End Factor Analysis for speaker verification," in IEEE Transactions on Audio, Speech, and Language Processing, vol.19, No.4, pp.788-798, May 2011.). The literature (D.Snyder, P.Ghahremni, D.Povey, D.Garcia-Romero, and Y.Carmiel, "Deep Neural Network expressions for textlndexlndicant Speaker Verification," in Interspeed, 2017, pp.999-1003.) indicates that the method of Deep learning can already exceed the conventional i-vector method after the use of large-scale data, in particular, data enhancement. However, speaker recognition in noisy environments remains a challenging problem.
A noise reduction auto-encoder (DAE) may be used to generate enhanced speech from noisy speech, thereby improving speaker recognition performance in noisy scenes (o.plchot, l.burget, h.aronoutz and P.
Figure BDA0002546871860000011
"Audio enhancement with DNN automation for spaker recognition,"2016IEEEInternational Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai,2016, pp.5090-5094.). However, since this method uses L2The loss function performs voice enhancement on the data first and then performs speaker recognition, which results in mismatching of the voice enhancement and the speaker recognition, and thus cannot achieve a high noise-containing voice recognition rate. The literature (Sun Shon, Hao Tang, and James Glass, "VoiceID Loss: Speech Enhancement for Speaker Verification," in Interspeed, 2019, pp.2888-2892.) uses an end-to-end architecture for Speech Enhancement and Speaker recognition,but the overall performance of the algorithm has room for improvement.
When speaker recognition is performed using deep learning, large-scale data is required for training, however, as the amount of data further increases, the improvement of the model discrimination capability is slow. Speakers are in one-to-one correspondence with labels of corresponding voice data, which is not the most efficient way to use the data.
Disclosure of Invention
In the existing speaker recognition training strategy based on deep learning, speakers correspond to labels of corresponding voice data one by one, and the data cannot be utilized to the maximum extent. The invention provides a text-independent multi-label speaker recognition method based on deep learning, which can further improve the distinguishing capability of a model and can further improve the recognition capability of the model no matter in a pure environment or a noisy environment.
The technical scheme adopted by the invention is as follows:
the text irrelevant multi-label speaker identification method based on deep learning comprises the following steps:
step 1, dividing the voice of each speaker in a training data set into N parts on average, and marking different labels on each part of voice, so that the number of the labels of the whole training data set is N times of the number of the speakers, wherein N is more than or equal to 2;
step 2, constructing a neural network model, wherein the dimension of an output layer of the model is consistent with the total number of labels of the training data set;
step 3, inputting the training data set in the step 1 into the neural network model in the step 2, comparing the output result with the label corresponding to the voice data, and solving a cross entropy loss function so as to perform supervised training;
and 4, presetting N labels regarded as effective identification for the voice data of each speaker according to the corresponding relation between the voice of the training data set and the labels in the step 1 by the test set data, inputting the test data set data into the neural network model, comparing the labels predicted by the model with the N labels regarded as effective identification set before, and obtaining the speaker identification rate of the model as long as the predicted label is one of the N labels regarded as effective identification, namely correct identification.
Furthermore, the neural network model comprises a voice enhancement network and a speaker recognition network, wherein the voice enhancement network is used for enhancing noisy voice and improving the robustness of the neural network model; in the step 3, the speaker recognition network is trained in advance, parameters are locked after the training is carried out until convergence, and then an end-to-end training comprises a complete neural network model of the voice enhancement network and the speaker recognition network.
Further, in step 3, a specific formula of the cross entropy loss function is as follows:
Figure BDA0002546871860000021
wherein C is the total number of speakers; p is a radical ofiAccording to the speech data according to the index yiConfirmed true identity, i.e. only one of the N × C classification positions is 1, and the other positions are zero, qiRefers to the probability of systematic prediction at each classified location.
The invention changes the method that the speaker voice and the speaker label are in one-to-one correspondence in the past speaker recognition method based on deep learning, and adopts a multi-label mode. The method can further improve the speaker recognition performance of the model in a pure environment and a noise environment.
Drawings
FIG. 1 is a schematic diagram of a neural network in an embodiment of the present invention; wherein N is the number of parts of the training data set.
Fig. 2 is a specific structure of a speech enhancement network in an embodiment of the present invention.
FIG. 3 is a flow chart of the text irrelevant multi-labeled speaker recognition method based on deep learning according to the invention.
FIG. 4 is a flow chart of a specific algorithm of the method of the present invention, wherein n1 and n2 represent the number of training sets and test lumped speech respectively; d is the training set, T is the testing set, and C is the number of speakers.
FIG. 5 is a comparison of a prior art deep learning based speaker recognition scheme with the method of the present invention for clean speech and for various types of noisy speech.
FIG. 6 is a line graph of the comparison of a prior art deep learning based speaker recognition scheme with the method of the present invention for clean speech and various types of noisy speech.
Detailed Description
The text irrelevant multi-label speaker identification method based on deep learning mainly comprises the following parts:
1. segmenting a training data set
1) Defining a training data set
D={(x1,y1),...,(xn1,yn1)}, (1)
In the formula, D is a training set, x and y are respectively voice and corresponding original labels, and n1 represents the voice quantity of the training set;
2) training set speech data labels
Dividing the voice of each speaker in the training data set into N parts on average, and marking different labels on each part of voice, so that the number of the labels of the whole training data set is N times of the number of the speakers, which can be expressed specifically;
Figure BDA0002546871860000031
in the formula, yiLabels for the speech data in the training set in the initial state,
Figure BDA0002546871860000032
the number of the modified phonetic data in the training set is C, the total number of the speakers is C, and the value of m is 0 to N-1.
2. Constructing neural networks
And constructing a corresponding neural network model which comprises a voice enhancement network and a speaker recognition network, and ensuring that the dimension of an output layer is consistent with the total number of labels of the training data set. The voice enhancement network can enhance the noisy voice, and the enhanced noisy voice can be recognized after the voice is input into the speaker recognition network. The whole network can compare the recognition performance of the model on pure voice, noisy voice and enhanced noisy voice, so as to obtain the recognition performance and robustness of the model.
3. Constructing noisy speech data set
And mixing the data set (which is regarded as a pure voice data set at this time) with different types of noise voice data according to a certain signal-to-noise ratio to obtain a noise-containing voice data set.
4. Training
1) Pre-training
Firstly, pure voice is used for training a speaker recognition network, training data are input into a neural network, the output layer result is compared with a label corresponding to the data, a cross entropy loss function is solved, therefore, supervised training is carried out, and parameters of the speaker recognition network are locked after the training is converged; because the number of categories distinguished by the constructed neural network is N times of the number of speakers in the data set, the number of classification positions involved in calculating the cross entropy loss function also becomes N times of the initial state correspondingly, which is specifically as follows:
Figure BDA0002546871860000041
where N is the number of partitions into the training data set, C is the total number of speakers, piAccording to arrangement of fingers as before
Figure BDA0002546871860000042
To confirm the true identity, i.e., only one of the N × C classification positions is 1, and the other positions are all zeros, and qcRefers to the probability of systematic prediction at each classified location.
2) Training
A speech enhancement network is introduced, and a neural network model comprising the speech enhancement network and a parameter-locked speaker recognition network is trained end-to-end by using noisy speech data. The speech enhancement network uses the SIGMOID function to calculate a ratio mask, and the ratio mask is multiplied by the input speech data to achieve the enhancement effect, wherein the SIGMOID function can be expressed as:
Figure BDA0002546871860000043
wherein z is the value at any point in the tensor output by the speech enhancement network.
5. Testing
1) Defining a test data set
T={(x1,y1),...,(xn2,yn2)}, (5)
Where T is the test set, x and y are the speech and corresponding original labels, respectively, and n2 represents the number of speeches in the training set.
2) Presetting valid identification label
N labels which can be regarded as effective identification are preset for each speaker according to the formula (2).
3) Pure speech recognition
The pure test voice is input into the speaker recognition network, the voice enhancement network is not needed, the test result of the network is compared with the preset effective recognition label, one of the results is correct recognition, and otherwise, the result is wrong recognition, so that the pure voice recognition rate is obtained.
4) Noisy speech recognition
The noise-containing test voice is directly input into the speaker recognition network, the test result of the network is compared with the preset effective recognition label without passing through the voice enhancement network, one of the results is correct recognition, and otherwise, the result is wrong recognition, so that the noise-containing voice recognition rate is obtained.
5) Enhanced speech recognition
Inputting the voice containing noise into a voice enhancement network, then inputting the enhanced voice into a speaker recognition network, comparing the obtained result with a preset effective recognition label, and determining whether the voice containing noise is correctly recognized or not, otherwise, determining whether the voice containing noise is wrongly recognized, thereby obtaining the enhanced voice containing noise recognition rate.
Comparing the system identification performances under the three conditions of 3), 4) and 5), and judging whether the result meets the expectation: the recognition rate of the pure voice is the highest, the recognition rate of the noisy voice data is the lowest, and the recognition rate of the enhanced noisy voice data is between the two recognition rates.
Therefore, the method changes the prior method that the voice of the speaker corresponds to the label of the speaker one by one on the basis of the prior text-independent speaker identification method based on deep learning, and adopts a mode of multiple labels.
Examples
The technical scheme in the embodiment of the invention is clearly and completely described below with reference to the accompanying drawings.
1. Test sample and objective evaluation criteria
The sample clean speech of this example was derived from the Voxceleb1 data set, and the Voxceleb1 data was extracted from the YouTube video, which contained 1251 speakers and approximately 150000 segments of speech, each having an average speech length of 7.8 seconds. To test the robustness of the model, the present embodiment used the Noise-92 data set. Selecting White, Babble, Leopard, Volvo, Factory, Tank and Gun noises from the Noise-92 data set, mixing the Noise with pure voice in a ratio of 10dB of signal-to-Noise ratio to obtain Noise-containing voice for training and testing the performance of the system in a Noise-containing environment. And when mixing the noise data with the clean speech, the noise data is firstly divided into two parts, one part is used for mixing with the training set data, and the other part is mixed with the test set data, so that the same noise data set is prevented from being used by the training set and the test set at the same time.
The invention adopts Accuracy (Top-1 Accuracy) score as an objective evaluation standard. When the Accuracy of the multi-label speaker identification method is calculated, the speaker belongs to effective identification as long as the output result meets one of the preset effective identification labels, and the requirement of Top-1 Accuracy is met.
2. Parameter setting
1) Speaker recognition network
Referring to fig. 1, the speaker recognition network used in the present embodiment is composed of four one-dimensional convolution layers and a fully connected layer (dimension 1500). The sizes of convolution kernels of the 4 one-dimensional convolution layers are respectively as follows: 5. 7, 1; the step lengths are respectively: 1. 2, 1; a global average pooling layer is also passed between the last convolutional layer and the fully-connected layer. Here, the 257-dimensional spectrum is taken as input directly using a window function with a window length of 25ms and a frame shift of 10 ms. This example does not normalize the input data, but rather takes the magnitude spectrum an exponential multiple of 0.3. The training uses 298 frames of fixed length (257 frames each) as input for a segment of speech.
2) Speech enhancement network
The reinforced part model is composed of 11 expansion convolution layers, and the specific structure is shown in figure 2. For the final layer of convolution output, a ratio mask with the same size is generated by using a SIGMOID function, and the ratio mask is multiplied by the original input to achieve the aim of speech enhancement. This embodiment uses RELU as the non-linear activation. And finally outputting a prediction result obtained after the layer passes through SOFTMAX.
3) Data set composition
This requires that the original data set be partitioned into training and test sets, and that both training and test sets contain the same number of speakers, and that the proportion of training and test set sizes for each set of embodiments need to be consistent. In this example, the test set and training set were constructed at a 1:3 ratio. Before the algorithm starts, the training data sets need to be sorted according to the corresponding speaker ID, namely, the voices belonging to the same speaker are sorted together.
3) Setting of N value
Since on a data set with limited size, N is too large, the data of the training subset is too sparse, and the overall recognition performance of the system is finally reduced. Therefore, in the present embodiment, only the cases where N is 2 and 3 are tested.
3. Concrete implementation process of method
1) Referring to fig. 3 and 4, the algorithm is initialized according to equations (1), (5) and the parameter settings described above, training data and test data buffers are established for buffering data used in speaker recognition, a speaker label buffer is established for buffering labels used in training and testing, and for any moment in model training, the following calculations are performed: and acquiring a new piece of voice data, performing short-time Fourier transform on windowing the new piece of voice data to acquire a group of data containing 298 frames and 257 frequency points in each frame, taking an exponential time of 0.3 for the amplitude spectrum, and updating the data cache.
2) Referring to the label newly set in step 4 of fig. 4, the corresponding label of the voice data is obtained, and the label cache is updated.
3) And inputting the data into a neural network, comparing the output result with the labels in the label cache, calculating a cross entropy loss function, and reversely propagating and optimizing the model parameters.
For any moment of the model test, the following calculation is performed:
1) and acquiring a new piece of voice data, performing short-time Fourier transform on windowing the new piece of voice data to acquire a group of data containing 298 frames and 257 frequency points in each frame, taking an exponential time of 0.3 for the amplitude spectrum, and updating the data cache.
2) And acquiring a corresponding label newly set by the voice data, and updating the label cache.
3) Inputting data into the neural network, referring to steps 11 to 13 in fig. 4, comparing the output result with the labels in the label cache, and judging whether one of the following is satisfied:
31) if the judgment result is yes, the identification is correct;
32) if the judgment result is negative, the error identification is carried out;
4) referring to steps 13 to 17 of fig. 4, the speaker recognition rate of the model on the test set is found.
In order to show the performance of the speaker recognition in the clean environment and the noisy environment, the present embodiment compares the speaker recognition based on deep learning with the method of the present invention. Fig. 5 shows Accuracy scoring results of the prior deep learning-based speaker recognition and the method of the present invention in clean environments and before and after enhancement of various types of noisy environments, and fig. 6 shows a line graph of the results.
In FIG. 5, D denotes training using clean speech, DNNoisy speech training is used. "Baseline" indicates the existing scheme, "deployed (N ═ 3)" and "deployed (N ═ 2)" respectively represent that the method of the invention is used, and N takes the values 3 and 3 respectively2. It can be seen from the results that the present invention results better than the existing solutions for both clean speech and enhanced types of noisy speech. And the recognition rate of the enhanced noisy speech is obviously higher than that before enhancement, and the enhancement effect of the invention is more obvious.
In fig. 6, "baseline (original)", "deployed (N ═ 3) (original)" and "deployed (N ═ 2) (original)" represent speaker recognition rates of clean speech and noisy speech without speech enhancement, and "baseline (enhanced)", "deployed (N ═ 3) (enhanced)" and "deployed (N ═ 2) (enhanced)" represent speaker recognition rates of noisy speech after speech enhancement. It can be seen from the results that for the noisy speech which is not enhanced, the method of the present invention has little effect difference when the value of N is 2 and 3; for enhanced noisy speech and pure speech, the method of the invention has better effect when N is 2, and the N is obviously superior to the existing scheme.
As can be seen from the results shown in fig. 5 and fig. 6, the speaker recognition method based on deep learning of the present invention can further improve the recognition performance of the model in both clean environment and noisy environment.

Claims (3)

1. The text irrelevant multi-label speaker recognition method based on deep learning is characterized by comprising the following steps of:
step 1, dividing the voice of each speaker in a training data set into N parts on average, and marking different labels on each part of voice, so that the number of the labels of the whole training data set is N times of the number of the speakers, wherein N is more than or equal to 2;
step 2, constructing a neural network model, wherein the dimension of an output layer of the model is consistent with the total number of labels of the training data set;
step 3, inputting the training data set in the step 1 into the neural network model in the step 2, comparing the output result with the label corresponding to the voice data, and solving a cross entropy loss function so as to perform supervised training;
and 4, presetting N labels regarded as effective identification for the voice data of each speaker according to the corresponding relation between the voice of the training data set and the labels in the step 1 by the test set data, inputting the test data set data into the neural network model, comparing the labels predicted by the model with the N labels regarded as effective identification set before, and obtaining the speaker identification rate of the model as long as the predicted label is one of the N labels regarded as effective identification, namely correct identification.
2. The text-independent multi-label speaker recognition method based on deep learning of claim 1, wherein the neural network model comprises a speech enhancement network and a speaker recognition network, the speech enhancement network is used for enhancing noisy speech and improving robustness of the neural network model; in the step 3, the speaker recognition network is trained in advance, parameters are locked after the training is carried out until convergence, and then an end-to-end training comprises a complete neural network model of the voice enhancement network and the speaker recognition network.
3. The method for recognizing the text-irrelevant multi-labeled speaker based on the deep learning as claimed in claim 1, wherein in the step 3, the specific formula of the cross entropy loss function is as follows:
Figure FDA0002546871850000011
wherein C is the total number of speakers; p is a radical ofiAccording to the speech data according to the index yiConfirmed true identity, i.e. only one of the N × C classification positions is 1, and the other positions are zero, qiRefers to the probability of systematic prediction at each classified location.
CN202010563201.8A 2020-06-19 2020-06-19 Text irrelevant multi-label speaker recognition method based on deep learning Active CN111667836B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010563201.8A CN111667836B (en) 2020-06-19 2020-06-19 Text irrelevant multi-label speaker recognition method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010563201.8A CN111667836B (en) 2020-06-19 2020-06-19 Text irrelevant multi-label speaker recognition method based on deep learning

Publications (2)

Publication Number Publication Date
CN111667836A true CN111667836A (en) 2020-09-15
CN111667836B CN111667836B (en) 2023-05-05

Family

ID=72388943

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010563201.8A Active CN111667836B (en) 2020-06-19 2020-06-19 Text irrelevant multi-label speaker recognition method based on deep learning

Country Status (1)

Country Link
CN (1) CN111667836B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113066507A (en) * 2021-03-15 2021-07-02 上海明略人工智能(集团)有限公司 End-to-end speaker separation method, system and equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049629A (en) * 2011-10-17 2013-04-17 阿里巴巴集团控股有限公司 Method and device for detecting noise data
CN104732978A (en) * 2015-03-12 2015-06-24 上海交通大学 Text-dependent speaker recognition method based on joint deep learning
CN106952649A (en) * 2017-05-14 2017-07-14 北京工业大学 Method for distinguishing speek person based on convolutional neural networks and spectrogram
CN107464568A (en) * 2017-09-25 2017-12-12 四川长虹电器股份有限公司 Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system
US10210860B1 (en) * 2018-07-27 2019-02-19 Deepgram, Inc. Augmented generalized deep learning with special vocabulary
CN109599117A (en) * 2018-11-14 2019-04-09 厦门快商通信息技术有限公司 A kind of audio data recognition methods and human voice anti-replay identifying system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049629A (en) * 2011-10-17 2013-04-17 阿里巴巴集团控股有限公司 Method and device for detecting noise data
CN104732978A (en) * 2015-03-12 2015-06-24 上海交通大学 Text-dependent speaker recognition method based on joint deep learning
CN106952649A (en) * 2017-05-14 2017-07-14 北京工业大学 Method for distinguishing speek person based on convolutional neural networks and spectrogram
CN107464568A (en) * 2017-09-25 2017-12-12 四川长虹电器股份有限公司 Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system
US10210860B1 (en) * 2018-07-27 2019-02-19 Deepgram, Inc. Augmented generalized deep learning with special vocabulary
CN109599117A (en) * 2018-11-14 2019-04-09 厦门快商通信息技术有限公司 A kind of audio data recognition methods and human voice anti-replay identifying system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
林舒都;邵曦;: "基于i-vector和深度学习的说话人识别" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113066507A (en) * 2021-03-15 2021-07-02 上海明略人工智能(集团)有限公司 End-to-end speaker separation method, system and equipment
CN113066507B (en) * 2021-03-15 2024-04-19 上海明略人工智能(集团)有限公司 End-to-end speaker separation method, system and equipment

Also Published As

Publication number Publication date
CN111667836B (en) 2023-05-05

Similar Documents

Publication Publication Date Title
Ding et al. Autospeech: Neural architecture search for speaker recognition
US9595257B2 (en) Downsampling schemes in a hierarchical neural network structure for phoneme recognition
CN103824557B (en) A kind of audio detection sorting technique with custom feature
Xue et al. Online end-to-end neural diarization with speaker-tracing buffer
Kim et al. Environmental noise embeddings for robust speech recognition
US11100932B2 (en) Robust start-end point detection algorithm using neural network
JP2000099080A (en) Voice recognizing method using evaluation of reliability scale
US7617101B2 (en) Method and system for utterance verification
CN109192200A (en) A kind of audio recognition method
Mun et al. The sound of my voice: Speaker representation loss for target voice separation
CN110751278A (en) Neural network bit quantization method and system
Fan et al. Utterance-level permutation invariant training with discriminative learning for single channel speech separation
CN111667836B (en) Text irrelevant multi-label speaker recognition method based on deep learning
WO1995034064A1 (en) Speech-recognition system utilizing neural networks and method of using same
KR102429656B1 (en) A speaker embedding extraction method and system for automatic speech recognition based pooling method for speaker recognition, and recording medium therefor
Zhang et al. Deep Template Matching for Small-Footprint and Configurable Keyword Spotting.
WO2016152132A1 (en) Speech processing device, speech processing system, speech processing method, and recording medium
Matsui et al. N-best-based unsupervised speaker adaptation for speech recognition
Yu et al. Bayesian adaptive inference and adaptive training
Li et al. Neural discriminant analysis for deep speaker embedding
Reshma et al. A survey on speech emotion recognition
Łopatka et al. State sequence pooling training of acoustic models for keyword spotting
US7912715B2 (en) Determining distortion measures in a pattern recognition process
Lin et al. Exploiting polynomial-fit histogram equalization and temporal average for robust speech recognition.
Sun et al. Combination of sparse classification and multilayer perceptron for noise-robust ASR

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant