CN111667836A - Text-irrelevant multi-label speaker identification method based on deep learning - Google Patents
Text-irrelevant multi-label speaker identification method based on deep learning Download PDFInfo
- Publication number
- CN111667836A CN111667836A CN202010563201.8A CN202010563201A CN111667836A CN 111667836 A CN111667836 A CN 111667836A CN 202010563201 A CN202010563201 A CN 202010563201A CN 111667836 A CN111667836 A CN 111667836A
- Authority
- CN
- China
- Prior art keywords
- speaker
- labels
- voice
- training
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000013135 deep learning Methods 0.000 title claims abstract description 21
- 238000012549 training Methods 0.000 claims abstract description 60
- 238000012360 testing method Methods 0.000 claims abstract description 28
- 238000003062 neural network model Methods 0.000 claims abstract description 15
- 230000006870 function Effects 0.000 claims abstract description 13
- 230000009897 systematic effect Effects 0.000 claims description 3
- 230000002708 enhancing effect Effects 0.000 claims description 2
- 238000013528 artificial neural network Methods 0.000 abstract description 9
- 230000000694 effects Effects 0.000 description 4
- 238000001228 spectrum Methods 0.000 description 4
- 238000012795 verification Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 101100153586 Caenorhabditis elegans top-1 gene Proteins 0.000 description 2
- 101100370075 Mus musculus Top1 gene Proteins 0.000 description 2
- 239000000872 buffer Substances 0.000 description 2
- 230000003139 buffering effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 241000282373 Panthera pardus Species 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000000556 factor analysis Methods 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Abstract
The invention discloses a text-irrelevant multi-label speaker recognition method based on deep learning. The method comprises the following steps: (1) averagely dividing the voice of each speaker in the training data set into N parts, and marking different labels on each part; (2) constructing a corresponding neural network model, and ensuring that the dimension of an output layer is consistent with the number of the marks of the training data set; (3) inputting training data into a neural network, comparing the output layer result with the label corresponding to the data, and solving a cross entropy loss function so as to carry out training; (4) and (3) presetting N labels which are regarded as effective identification for the voice data of each speaker according to the corresponding relation of the training data set in the step (1) by the test set data, inputting the test data set data into a neural network, comparing the labels predicted by the model with the preset N labels, and identifying the speaker correctly as long as one of the labels is satisfied. The method can effectively improve the speaker recognition performance of the model in pure and noisy environments.
Description
Technical Field
The invention relates to a text irrelevant multi-label speaker identification method based on deep learning.
Background
Speaker recognition, also known as speaker recognition, voiceprint recognition, is aimed at identifying a speaker from its speech characteristics. The speaker identification is divided into two processes of speaker identification and speaker confirmation, wherein the speaker identification means that whether a speaker is in a recorded speaker set or not is identified after the corresponding voice processing analysis of the speaker is carried out; speaker verification refers to a process of further verifying whether a speaker corresponding to an input voice is a target speaker.
The i-vector method can be used to achieve speaker recognition (N.Dehak, P.J.Kenny, R.Dehak, P.Dumouchel and P.Ouellet, "Front-End Factor Analysis for speaker verification," in IEEE Transactions on Audio, Speech, and Language Processing, vol.19, No.4, pp.788-798, May 2011.). The literature (D.Snyder, P.Ghahremni, D.Povey, D.Garcia-Romero, and Y.Carmiel, "Deep Neural Network expressions for textlndexlndicant Speaker Verification," in Interspeed, 2017, pp.999-1003.) indicates that the method of Deep learning can already exceed the conventional i-vector method after the use of large-scale data, in particular, data enhancement. However, speaker recognition in noisy environments remains a challenging problem.
A noise reduction auto-encoder (DAE) may be used to generate enhanced speech from noisy speech, thereby improving speaker recognition performance in noisy scenes (o.plchot, l.burget, h.aronoutz and P."Audio enhancement with DNN automation for spaker recognition,"2016IEEEInternational Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai,2016, pp.5090-5094.). However, since this method uses L2The loss function performs voice enhancement on the data first and then performs speaker recognition, which results in mismatching of the voice enhancement and the speaker recognition, and thus cannot achieve a high noise-containing voice recognition rate. The literature (Sun Shon, Hao Tang, and James Glass, "VoiceID Loss: Speech Enhancement for Speaker Verification," in Interspeed, 2019, pp.2888-2892.) uses an end-to-end architecture for Speech Enhancement and Speaker recognition,but the overall performance of the algorithm has room for improvement.
When speaker recognition is performed using deep learning, large-scale data is required for training, however, as the amount of data further increases, the improvement of the model discrimination capability is slow. Speakers are in one-to-one correspondence with labels of corresponding voice data, which is not the most efficient way to use the data.
Disclosure of Invention
In the existing speaker recognition training strategy based on deep learning, speakers correspond to labels of corresponding voice data one by one, and the data cannot be utilized to the maximum extent. The invention provides a text-independent multi-label speaker recognition method based on deep learning, which can further improve the distinguishing capability of a model and can further improve the recognition capability of the model no matter in a pure environment or a noisy environment.
The technical scheme adopted by the invention is as follows:
the text irrelevant multi-label speaker identification method based on deep learning comprises the following steps:
and 4, presetting N labels regarded as effective identification for the voice data of each speaker according to the corresponding relation between the voice of the training data set and the labels in the step 1 by the test set data, inputting the test data set data into the neural network model, comparing the labels predicted by the model with the N labels regarded as effective identification set before, and obtaining the speaker identification rate of the model as long as the predicted label is one of the N labels regarded as effective identification, namely correct identification.
Furthermore, the neural network model comprises a voice enhancement network and a speaker recognition network, wherein the voice enhancement network is used for enhancing noisy voice and improving the robustness of the neural network model; in the step 3, the speaker recognition network is trained in advance, parameters are locked after the training is carried out until convergence, and then an end-to-end training comprises a complete neural network model of the voice enhancement network and the speaker recognition network.
Further, in step 3, a specific formula of the cross entropy loss function is as follows:
wherein C is the total number of speakers; p is a radical ofiAccording to the speech data according to the index yiConfirmed true identity, i.e. only one of the N × C classification positions is 1, and the other positions are zero, qiRefers to the probability of systematic prediction at each classified location.
The invention changes the method that the speaker voice and the speaker label are in one-to-one correspondence in the past speaker recognition method based on deep learning, and adopts a multi-label mode. The method can further improve the speaker recognition performance of the model in a pure environment and a noise environment.
Drawings
FIG. 1 is a schematic diagram of a neural network in an embodiment of the present invention; wherein N is the number of parts of the training data set.
Fig. 2 is a specific structure of a speech enhancement network in an embodiment of the present invention.
FIG. 3 is a flow chart of the text irrelevant multi-labeled speaker recognition method based on deep learning according to the invention.
FIG. 4 is a flow chart of a specific algorithm of the method of the present invention, wherein n1 and n2 represent the number of training sets and test lumped speech respectively; d is the training set, T is the testing set, and C is the number of speakers.
FIG. 5 is a comparison of a prior art deep learning based speaker recognition scheme with the method of the present invention for clean speech and for various types of noisy speech.
FIG. 6 is a line graph of the comparison of a prior art deep learning based speaker recognition scheme with the method of the present invention for clean speech and various types of noisy speech.
Detailed Description
The text irrelevant multi-label speaker identification method based on deep learning mainly comprises the following parts:
1. segmenting a training data set
1) Defining a training data set
D={(x1,y1),...,(xn1,yn1)}, (1)
In the formula, D is a training set, x and y are respectively voice and corresponding original labels, and n1 represents the voice quantity of the training set;
2) training set speech data labels
Dividing the voice of each speaker in the training data set into N parts on average, and marking different labels on each part of voice, so that the number of the labels of the whole training data set is N times of the number of the speakers, which can be expressed specifically;
in the formula, yiLabels for the speech data in the training set in the initial state,the number of the modified phonetic data in the training set is C, the total number of the speakers is C, and the value of m is 0 to N-1.
2. Constructing neural networks
And constructing a corresponding neural network model which comprises a voice enhancement network and a speaker recognition network, and ensuring that the dimension of an output layer is consistent with the total number of labels of the training data set. The voice enhancement network can enhance the noisy voice, and the enhanced noisy voice can be recognized after the voice is input into the speaker recognition network. The whole network can compare the recognition performance of the model on pure voice, noisy voice and enhanced noisy voice, so as to obtain the recognition performance and robustness of the model.
3. Constructing noisy speech data set
And mixing the data set (which is regarded as a pure voice data set at this time) with different types of noise voice data according to a certain signal-to-noise ratio to obtain a noise-containing voice data set.
4. Training
1) Pre-training
Firstly, pure voice is used for training a speaker recognition network, training data are input into a neural network, the output layer result is compared with a label corresponding to the data, a cross entropy loss function is solved, therefore, supervised training is carried out, and parameters of the speaker recognition network are locked after the training is converged; because the number of categories distinguished by the constructed neural network is N times of the number of speakers in the data set, the number of classification positions involved in calculating the cross entropy loss function also becomes N times of the initial state correspondingly, which is specifically as follows:
where N is the number of partitions into the training data set, C is the total number of speakers, piAccording to arrangement of fingers as beforeTo confirm the true identity, i.e., only one of the N × C classification positions is 1, and the other positions are all zeros, and qcRefers to the probability of systematic prediction at each classified location.
2) Training
A speech enhancement network is introduced, and a neural network model comprising the speech enhancement network and a parameter-locked speaker recognition network is trained end-to-end by using noisy speech data. The speech enhancement network uses the SIGMOID function to calculate a ratio mask, and the ratio mask is multiplied by the input speech data to achieve the enhancement effect, wherein the SIGMOID function can be expressed as:
wherein z is the value at any point in the tensor output by the speech enhancement network.
5. Testing
1) Defining a test data set
T={(x1,y1),...,(xn2,yn2)}, (5)
Where T is the test set, x and y are the speech and corresponding original labels, respectively, and n2 represents the number of speeches in the training set.
2) Presetting valid identification label
N labels which can be regarded as effective identification are preset for each speaker according to the formula (2).
3) Pure speech recognition
The pure test voice is input into the speaker recognition network, the voice enhancement network is not needed, the test result of the network is compared with the preset effective recognition label, one of the results is correct recognition, and otherwise, the result is wrong recognition, so that the pure voice recognition rate is obtained.
4) Noisy speech recognition
The noise-containing test voice is directly input into the speaker recognition network, the test result of the network is compared with the preset effective recognition label without passing through the voice enhancement network, one of the results is correct recognition, and otherwise, the result is wrong recognition, so that the noise-containing voice recognition rate is obtained.
5) Enhanced speech recognition
Inputting the voice containing noise into a voice enhancement network, then inputting the enhanced voice into a speaker recognition network, comparing the obtained result with a preset effective recognition label, and determining whether the voice containing noise is correctly recognized or not, otherwise, determining whether the voice containing noise is wrongly recognized, thereby obtaining the enhanced voice containing noise recognition rate.
Comparing the system identification performances under the three conditions of 3), 4) and 5), and judging whether the result meets the expectation: the recognition rate of the pure voice is the highest, the recognition rate of the noisy voice data is the lowest, and the recognition rate of the enhanced noisy voice data is between the two recognition rates.
Therefore, the method changes the prior method that the voice of the speaker corresponds to the label of the speaker one by one on the basis of the prior text-independent speaker identification method based on deep learning, and adopts a mode of multiple labels.
Examples
The technical scheme in the embodiment of the invention is clearly and completely described below with reference to the accompanying drawings.
1. Test sample and objective evaluation criteria
The sample clean speech of this example was derived from the Voxceleb1 data set, and the Voxceleb1 data was extracted from the YouTube video, which contained 1251 speakers and approximately 150000 segments of speech, each having an average speech length of 7.8 seconds. To test the robustness of the model, the present embodiment used the Noise-92 data set. Selecting White, Babble, Leopard, Volvo, Factory, Tank and Gun noises from the Noise-92 data set, mixing the Noise with pure voice in a ratio of 10dB of signal-to-Noise ratio to obtain Noise-containing voice for training and testing the performance of the system in a Noise-containing environment. And when mixing the noise data with the clean speech, the noise data is firstly divided into two parts, one part is used for mixing with the training set data, and the other part is mixed with the test set data, so that the same noise data set is prevented from being used by the training set and the test set at the same time.
The invention adopts Accuracy (Top-1 Accuracy) score as an objective evaluation standard. When the Accuracy of the multi-label speaker identification method is calculated, the speaker belongs to effective identification as long as the output result meets one of the preset effective identification labels, and the requirement of Top-1 Accuracy is met.
2. Parameter setting
1) Speaker recognition network
Referring to fig. 1, the speaker recognition network used in the present embodiment is composed of four one-dimensional convolution layers and a fully connected layer (dimension 1500). The sizes of convolution kernels of the 4 one-dimensional convolution layers are respectively as follows: 5. 7, 1; the step lengths are respectively: 1. 2, 1; a global average pooling layer is also passed between the last convolutional layer and the fully-connected layer. Here, the 257-dimensional spectrum is taken as input directly using a window function with a window length of 25ms and a frame shift of 10 ms. This example does not normalize the input data, but rather takes the magnitude spectrum an exponential multiple of 0.3. The training uses 298 frames of fixed length (257 frames each) as input for a segment of speech.
2) Speech enhancement network
The reinforced part model is composed of 11 expansion convolution layers, and the specific structure is shown in figure 2. For the final layer of convolution output, a ratio mask with the same size is generated by using a SIGMOID function, and the ratio mask is multiplied by the original input to achieve the aim of speech enhancement. This embodiment uses RELU as the non-linear activation. And finally outputting a prediction result obtained after the layer passes through SOFTMAX.
3) Data set composition
This requires that the original data set be partitioned into training and test sets, and that both training and test sets contain the same number of speakers, and that the proportion of training and test set sizes for each set of embodiments need to be consistent. In this example, the test set and training set were constructed at a 1:3 ratio. Before the algorithm starts, the training data sets need to be sorted according to the corresponding speaker ID, namely, the voices belonging to the same speaker are sorted together.
3) Setting of N value
Since on a data set with limited size, N is too large, the data of the training subset is too sparse, and the overall recognition performance of the system is finally reduced. Therefore, in the present embodiment, only the cases where N is 2 and 3 are tested.
3. Concrete implementation process of method
1) Referring to fig. 3 and 4, the algorithm is initialized according to equations (1), (5) and the parameter settings described above, training data and test data buffers are established for buffering data used in speaker recognition, a speaker label buffer is established for buffering labels used in training and testing, and for any moment in model training, the following calculations are performed: and acquiring a new piece of voice data, performing short-time Fourier transform on windowing the new piece of voice data to acquire a group of data containing 298 frames and 257 frequency points in each frame, taking an exponential time of 0.3 for the amplitude spectrum, and updating the data cache.
2) Referring to the label newly set in step 4 of fig. 4, the corresponding label of the voice data is obtained, and the label cache is updated.
3) And inputting the data into a neural network, comparing the output result with the labels in the label cache, calculating a cross entropy loss function, and reversely propagating and optimizing the model parameters.
For any moment of the model test, the following calculation is performed:
1) and acquiring a new piece of voice data, performing short-time Fourier transform on windowing the new piece of voice data to acquire a group of data containing 298 frames and 257 frequency points in each frame, taking an exponential time of 0.3 for the amplitude spectrum, and updating the data cache.
2) And acquiring a corresponding label newly set by the voice data, and updating the label cache.
3) Inputting data into the neural network, referring to steps 11 to 13 in fig. 4, comparing the output result with the labels in the label cache, and judging whether one of the following is satisfied:
31) if the judgment result is yes, the identification is correct;
32) if the judgment result is negative, the error identification is carried out;
4) referring to steps 13 to 17 of fig. 4, the speaker recognition rate of the model on the test set is found.
In order to show the performance of the speaker recognition in the clean environment and the noisy environment, the present embodiment compares the speaker recognition based on deep learning with the method of the present invention. Fig. 5 shows Accuracy scoring results of the prior deep learning-based speaker recognition and the method of the present invention in clean environments and before and after enhancement of various types of noisy environments, and fig. 6 shows a line graph of the results.
In FIG. 5, D denotes training using clean speech, DNNoisy speech training is used. "Baseline" indicates the existing scheme, "deployed (N ═ 3)" and "deployed (N ═ 2)" respectively represent that the method of the invention is used, and N takes the values 3 and 3 respectively2. It can be seen from the results that the present invention results better than the existing solutions for both clean speech and enhanced types of noisy speech. And the recognition rate of the enhanced noisy speech is obviously higher than that before enhancement, and the enhancement effect of the invention is more obvious.
In fig. 6, "baseline (original)", "deployed (N ═ 3) (original)" and "deployed (N ═ 2) (original)" represent speaker recognition rates of clean speech and noisy speech without speech enhancement, and "baseline (enhanced)", "deployed (N ═ 3) (enhanced)" and "deployed (N ═ 2) (enhanced)" represent speaker recognition rates of noisy speech after speech enhancement. It can be seen from the results that for the noisy speech which is not enhanced, the method of the present invention has little effect difference when the value of N is 2 and 3; for enhanced noisy speech and pure speech, the method of the invention has better effect when N is 2, and the N is obviously superior to the existing scheme.
As can be seen from the results shown in fig. 5 and fig. 6, the speaker recognition method based on deep learning of the present invention can further improve the recognition performance of the model in both clean environment and noisy environment.
Claims (3)
1. The text irrelevant multi-label speaker recognition method based on deep learning is characterized by comprising the following steps of:
step 1, dividing the voice of each speaker in a training data set into N parts on average, and marking different labels on each part of voice, so that the number of the labels of the whole training data set is N times of the number of the speakers, wherein N is more than or equal to 2;
step 2, constructing a neural network model, wherein the dimension of an output layer of the model is consistent with the total number of labels of the training data set;
step 3, inputting the training data set in the step 1 into the neural network model in the step 2, comparing the output result with the label corresponding to the voice data, and solving a cross entropy loss function so as to perform supervised training;
and 4, presetting N labels regarded as effective identification for the voice data of each speaker according to the corresponding relation between the voice of the training data set and the labels in the step 1 by the test set data, inputting the test data set data into the neural network model, comparing the labels predicted by the model with the N labels regarded as effective identification set before, and obtaining the speaker identification rate of the model as long as the predicted label is one of the N labels regarded as effective identification, namely correct identification.
2. The text-independent multi-label speaker recognition method based on deep learning of claim 1, wherein the neural network model comprises a speech enhancement network and a speaker recognition network, the speech enhancement network is used for enhancing noisy speech and improving robustness of the neural network model; in the step 3, the speaker recognition network is trained in advance, parameters are locked after the training is carried out until convergence, and then an end-to-end training comprises a complete neural network model of the voice enhancement network and the speaker recognition network.
3. The method for recognizing the text-irrelevant multi-labeled speaker based on the deep learning as claimed in claim 1, wherein in the step 3, the specific formula of the cross entropy loss function is as follows:
wherein C is the total number of speakers; p is a radical ofiAccording to the speech data according to the index yiConfirmed true identity, i.e. only one of the N × C classification positions is 1, and the other positions are zero, qiRefers to the probability of systematic prediction at each classified location.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010563201.8A CN111667836B (en) | 2020-06-19 | 2020-06-19 | Text irrelevant multi-label speaker recognition method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010563201.8A CN111667836B (en) | 2020-06-19 | 2020-06-19 | Text irrelevant multi-label speaker recognition method based on deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111667836A true CN111667836A (en) | 2020-09-15 |
CN111667836B CN111667836B (en) | 2023-05-05 |
Family
ID=72388943
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010563201.8A Active CN111667836B (en) | 2020-06-19 | 2020-06-19 | Text irrelevant multi-label speaker recognition method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111667836B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113066507A (en) * | 2021-03-15 | 2021-07-02 | 上海明略人工智能(集团)有限公司 | End-to-end speaker separation method, system and equipment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049629A (en) * | 2011-10-17 | 2013-04-17 | 阿里巴巴集团控股有限公司 | Method and device for detecting noise data |
CN104732978A (en) * | 2015-03-12 | 2015-06-24 | 上海交通大学 | Text-dependent speaker recognition method based on joint deep learning |
CN106952649A (en) * | 2017-05-14 | 2017-07-14 | 北京工业大学 | Method for distinguishing speek person based on convolutional neural networks and spectrogram |
CN107464568A (en) * | 2017-09-25 | 2017-12-12 | 四川长虹电器股份有限公司 | Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system |
US10210860B1 (en) * | 2018-07-27 | 2019-02-19 | Deepgram, Inc. | Augmented generalized deep learning with special vocabulary |
CN109599117A (en) * | 2018-11-14 | 2019-04-09 | 厦门快商通信息技术有限公司 | A kind of audio data recognition methods and human voice anti-replay identifying system |
-
2020
- 2020-06-19 CN CN202010563201.8A patent/CN111667836B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049629A (en) * | 2011-10-17 | 2013-04-17 | 阿里巴巴集团控股有限公司 | Method and device for detecting noise data |
CN104732978A (en) * | 2015-03-12 | 2015-06-24 | 上海交通大学 | Text-dependent speaker recognition method based on joint deep learning |
CN106952649A (en) * | 2017-05-14 | 2017-07-14 | 北京工业大学 | Method for distinguishing speek person based on convolutional neural networks and spectrogram |
CN107464568A (en) * | 2017-09-25 | 2017-12-12 | 四川长虹电器股份有限公司 | Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system |
US10210860B1 (en) * | 2018-07-27 | 2019-02-19 | Deepgram, Inc. | Augmented generalized deep learning with special vocabulary |
CN109599117A (en) * | 2018-11-14 | 2019-04-09 | 厦门快商通信息技术有限公司 | A kind of audio data recognition methods and human voice anti-replay identifying system |
Non-Patent Citations (1)
Title |
---|
林舒都;邵曦;: "基于i-vector和深度学习的说话人识别" * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113066507A (en) * | 2021-03-15 | 2021-07-02 | 上海明略人工智能(集团)有限公司 | End-to-end speaker separation method, system and equipment |
CN113066507B (en) * | 2021-03-15 | 2024-04-19 | 上海明略人工智能(集团)有限公司 | End-to-end speaker separation method, system and equipment |
Also Published As
Publication number | Publication date |
---|---|
CN111667836B (en) | 2023-05-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ding et al. | Autospeech: Neural architecture search for speaker recognition | |
US9595257B2 (en) | Downsampling schemes in a hierarchical neural network structure for phoneme recognition | |
CN103824557B (en) | A kind of audio detection sorting technique with custom feature | |
Xue et al. | Online end-to-end neural diarization with speaker-tracing buffer | |
Kim et al. | Environmental noise embeddings for robust speech recognition | |
US11100932B2 (en) | Robust start-end point detection algorithm using neural network | |
JP2000099080A (en) | Voice recognizing method using evaluation of reliability scale | |
US7617101B2 (en) | Method and system for utterance verification | |
CN109192200A (en) | A kind of audio recognition method | |
Mun et al. | The sound of my voice: Speaker representation loss for target voice separation | |
CN110751278A (en) | Neural network bit quantization method and system | |
Fan et al. | Utterance-level permutation invariant training with discriminative learning for single channel speech separation | |
CN111667836B (en) | Text irrelevant multi-label speaker recognition method based on deep learning | |
WO1995034064A1 (en) | Speech-recognition system utilizing neural networks and method of using same | |
KR102429656B1 (en) | A speaker embedding extraction method and system for automatic speech recognition based pooling method for speaker recognition, and recording medium therefor | |
Zhang et al. | Deep Template Matching for Small-Footprint and Configurable Keyword Spotting. | |
WO2016152132A1 (en) | Speech processing device, speech processing system, speech processing method, and recording medium | |
Matsui et al. | N-best-based unsupervised speaker adaptation for speech recognition | |
Yu et al. | Bayesian adaptive inference and adaptive training | |
Li et al. | Neural discriminant analysis for deep speaker embedding | |
Reshma et al. | A survey on speech emotion recognition | |
Łopatka et al. | State sequence pooling training of acoustic models for keyword spotting | |
US7912715B2 (en) | Determining distortion measures in a pattern recognition process | |
Lin et al. | Exploiting polynomial-fit histogram equalization and temporal average for robust speech recognition. | |
Sun et al. | Combination of sparse classification and multilayer perceptron for noise-robust ASR |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |