CN111951783A - Speaker recognition method based on phoneme filtering - Google Patents
Speaker recognition method based on phoneme filtering Download PDFInfo
- Publication number
- CN111951783A CN111951783A CN202010810083.6A CN202010810083A CN111951783A CN 111951783 A CN111951783 A CN 111951783A CN 202010810083 A CN202010810083 A CN 202010810083A CN 111951783 A CN111951783 A CN 111951783A
- Authority
- CN
- China
- Prior art keywords
- phoneme
- voice
- speaker
- recognition
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001914 filtration Methods 0.000 title claims abstract description 26
- 238000000034 method Methods 0.000 title claims abstract description 25
- 238000012549 training Methods 0.000 claims abstract description 59
- 238000012360 testing method Methods 0.000 claims abstract description 26
- 238000011176 pooling Methods 0.000 claims abstract description 21
- 238000007781 pre-processing Methods 0.000 claims abstract description 14
- 238000013528 artificial neural network Methods 0.000 claims description 16
- 238000009432 framing Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 8
- 238000010801 machine learning Methods 0.000 abstract description 2
- 238000003909 pattern recognition Methods 0.000 abstract description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/12—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Quality & Reliability (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Computation (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a speaker recognition method based on phoneme filtering, and belongs to the field of voiceprint recognition, pattern recognition and machine learning. In order to overcome the problem that the influence of voice content information is not considered in the traditional speaker recognition technology, the invention provides a speaker recognition method based on phoneme filtering. The method establishes a phoneme filter for each phoneme of the voice, and selects a corresponding phoneme filter to remove the content information according to the phoneme corresponding to each frame of voice before speaker recognition. Therefore, the influence of the content information on the speaker identification is reduced, and the accuracy of the speaker identification is effectively improved. The invention is characterized in that it comprises a model training phase and a testing phase, wherein the model training comprises the steps of speech preprocessing, phoneme recognition, phoneme filtering, pooling, speaker recognition, minimizing cross entropy. The testing stage comprises the steps of voice preprocessing, phoneme recognition, phoneme filtering, pooling and speaker recognition.
Description
Technical Field
The invention belongs to the technical field of voiceprint recognition, pattern recognition and machine learning, and particularly relates to a speaker recognition method based on phoneme filtering.
Background
Speaker recognition refers to recognizing the identity of a speaker based on information related to the speaker contained in speech, and as information technology and communication technology are rapidly developed, speaker recognition technology is increasingly gaining importance and is widely applied in many fields. For example, identity authentication, seizing of a telephone channel criminal, identity confirmation in court according to telephone recording, telephone voice tracking and the function of opening and closing an anti-theft door are provided. The technology of speaker recognition can be applied to the fields of voice dialing, telephone banking, telephone shopping, database access, information service, voice e-mail, security control, computer remote login and the like.
In 2011, Kenny proposed an i-vector speaker recognition method based on a gaussian mixture model, which achieved the best performance at that time. With the large-scale application of the deep neural network, in 2014, the d-vector speaker recognition method based on the deep neural network receives more and more attention, and compared with a traditional Gaussian mixture model, the deep neural network has stronger description capability and can better simulate very complex data distribution. In 2017, the Snyder considers the time sequence information of the voice and provides an x-vector speaker recognition method based on a time delay neural network. The current speaker recognition aspect state-of-the-art method is the x-vector. It first preprocesses the voice data, extracts MFCC features, performs active voice inspection, and removes the silent part. Inputting the MFCC characteristics of each frame of voice into a time delay neural network to obtain the output result of each frame, pooling the output results of all the frames of a voice, calculating the average value, and identifying the speaker of the voice according to the average value. Although this method achieves good results, it does not directly analyze and study the difficulty of speaker recognition. The difficulty in speaker recognition is that speaker information is entangled with other information in the speech (e.g., noise, channel, content), and we are unaware of the principle of their entanglement with each other. Therefore, when we analyze speaker information, the uncertainty of other factors, especially the uncertainty of the speech content information, will degrade the system performance. The existing speaker recognition technology does not give emphasis to the influence of the unmatched voice content on the recognition of the speaker factor.
Disclosure of Invention
The invention aims to provide a speaker recognition method based on phoneme filtering, aiming at overcoming the problem that the influence of voice content information is not considered in the traditional speaker recognition technology. The method establishes a phoneme filter for each phoneme of the voice, and selects a corresponding phoneme filter to remove the content information according to the phoneme corresponding to each frame of voice when the speaker is identified. Therefore, the influence of the content information on the speaker identification is reduced, and the accuracy of the speaker identification is effectively improved.
The invention provides a speaker recognition method based on phoneme filtering, which is characterized by comprising a model training stage and a testing stage. As shown in fig. 1, wherein model training includes speech preprocessing, phoneme recognition, phoneme filtering, pooling, speaker recognition, and minimal cross-entropy stages. The speaker recognition comprises the steps of voice preprocessing, phoneme recognition, phoneme filtering, pooling and speaker recognition. The method specifically comprises the following steps:
1) a model training stage; the method specifically comprises the following steps:
1-1) Speech preprocessing
Training speech data set is (x)i,zi)(i=1,…,I),xiFor the i-th training speech, ziAnd the speaker label corresponding to the training voice of the ith training voice is marked. For training speech xiPerforming frame division and extracting Mel cepstrum characteristics corresponding to each frame Features representing the T-th frame of the ith training speech, TiThe total frame number of the ith training voice is shown.
1-2) phoneme recognition
According to the Mel cepstrum characteristics extracted in step 1)The phonemes of each frame of speech are identified using a phoneme recognizer.WhereinAnd N is the total number of phonemes corresponding to the t frame of the ith training speech.
1-3) phoneme filtering
Constructing its own phoneme filter f for phoneme N (N equals 1, …, N)n。fnCan be a deep neural network, or other linear or nonlinear function with a parameter thetan. The phoneme filter is input as the Mel cepstrum characteristics extracted in the step 1-1)Outputting the features after filtering the phoneme informationPhonemes obtained according to step 1-2)If it isThen selectCorresponding phoneme filter fnNamely:
1-4) pooling
And pooling the features of the training speech after the phoneme information is filtered corresponding to all frames to obtain the mean value of the features of the training speech after the phoneme information is filtered. For example, the average value of the features of the ith training speech after the phoneme information is filtered out is:
1-5) speaker identification
Constructing a speaker recognition network g, wherein g can be a deep neural network or other linear or nonlinear functions, the parameter is phi, and the input is the average value y of the features of the speech with phoneme information filterediAnd outputting the probability z 'of each speaker corresponding to the voice'i=g(yi;φ)。
1-6) minimizing cross entropy
The objective function is to minimize the probability z 'of the speaker corresponding to the training speech through model prediction'iAnd a label ziCross entropy between, i.e.:
training to obtain a phoneme filter f corresponding to each phoneme by minimizing the target functionnParameter θ of (N ═ 1, …, N)n(N ═ 1, …, N) and a parameter Φ of the speaker identification network g.
The model training phase is finished, and a phoneme filter f corresponding to each phoneme is obtainednAnd a speaker recognition network g.
2) The testing stage specifically comprises the following steps:
2-1) Speech preprocessing
Framing the test speech x and extracting the Mel cepstrum feature x corresponding to each framet(t=1,…,T),xtThe characteristics of the T-th frame of the test voice are shown, and T represents the total frame number of the test voice.
2-2) phoneme recognition
Extracting a Mel cepstrum characteristic x according to the step 2-1)tAnd identifying the phoneme of each frame of voice by using a phoneme recognizer used in the step 1-2). q. q.st1, 2, …, N, wherein qtTo test the tth of speechAnd the phonemes corresponding to the frames, wherein N is the total number of the phonemes.
2-3) phoneme filtering
Phoneme q obtained according to step 2-2)tIf q istN, selecting the phoneme filter f trained in the model training stagenAs xtThe filter of (2). The features of the test speech after the phoneme information is filtered from the tth frame feature are as follows: y ist=fn(xt;θn)。
2-4) pooling
Pooling the features of the tested voice after the phoneme information is filtered out corresponding to all frames to obtain a mean value of the features of the tested voice after the phoneme information is filtered out, namely:
2-5) speaker recognition
And identifying the speaker corresponding to the tested voice according to the deep neural network g trained in the model training stage to obtain the probability z' of the voice belonging to each speaker, namely g (y; phi).
And finishing the speaker recognition corresponding to the test voice.
The invention has the characteristics and beneficial effects that:
compared with the existing speaker recognition technology, the invention emphasizes on reducing the influence of the voice content information on speaker recognition. Because the main voice bearing is the content information, the speaker information is taken as weak information, is easily submerged in the content information and is not easy to identify. The invention constructs a filter corresponding to each phoneme, and filters the phoneme information before speaker recognition, thereby reducing the influence of the voice content information on the speaker recognition. The method of the invention improves the accuracy of speaker identification.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention provides a speaker recognition method based on phoneme filtering, which comprises a model training stage and a speaker recognition stage. As shown in fig. 1, the model training includes the stages of speech preprocessing, phoneme recognition, phoneme filtering, pooling, speaker recognition, minimizing cross entropy, etc. The speaker recognition comprises the steps of voice preprocessing, phoneme recognition, phoneme filtering, pooling, speaker recognition and the like. Specific examples are described in further detail below.
1) A model training stage; the method specifically comprises the following steps:
1-1) Speech preprocessing
Training speech data set ofxiFor the i-th training speech, ziAnd the number of the speaker labels corresponding to the ith training voice is I, and the I is the total number of the training voices. For training speech xiPerforming frame division and extracting Mel cepstrum characteristics corresponding to each frameFeatures representing the T-th frame of the ith training speech, TiThe total frame number of the ith training voice is shown. In this embodiment, the number I of training speeches is 8000, the mel-frequency cepstrum feature corresponding to each frame is a 23-dimensional feature, all the training speeches have the same length, and each speech has Ti300 frames.
1-2) phoneme recognition
According to the Mel cepstrum characteristics extracted in step 1)The phonemes of each frame of speech are identified using a phoneme recognizer.WhereinAnd N is the total number of phonemes corresponding to the t frame of the ith training speech. In the embodiment, the phoneme recognizer adopts the sound of open source on the network of the university of Brunou RituerAnd the total number of corresponding phonemes is 39. And obtaining the phoneme corresponding to each frame of each voice according to the phoneme recognizer.
1-3) phoneme filtering
Constructing its own phoneme filter f for phoneme N (N equals 1, …, N)n。fnCan be a deep neural network, or other linear or nonlinear function with a parameter thetan. The phoneme filter is input as the Mel cepstrum characteristics extracted in the step 1-1)Outputting the features after filtering the phoneme informationPhonemes obtained according to step 1-2)If it isThen selectCorresponding phoneme filter fnNamely:in the present embodiment, since the total number of phonemes is 39, 39 phoneme filters f are constructedn(n-1, …, 39). Each phoneme filter is constructed by adopting a 5-layer deep neural network, and the corresponding parameter is thetan. The nth phoneme filter filters the nth phoneme. If the 125 th frame of the 5 th speechBelonging to the 13 th phoneme, i.e.Then selectThe corresponding phoneme filter is f13Namely:
1-4) pooling
And pooling the features of the training speech after the phoneme information is filtered corresponding to all frames to obtain the mean value of the features of the training speech after the phoneme information is filtered. For example, the average value of the features of the ith training speech after the phoneme information is filtered out is:
in this embodiment, the mean value of the features of the training speech after the phoneme information is filtered out is:
1-5) speaker identification
Constructing a speaker recognition network g, wherein g can be a deep neural network or other linear or nonlinear functions, the parameter is phi, and the input is the average value y of the features of the speech with phoneme information filterediAnd outputting the probability z 'of each speaker corresponding to the voice'i=g(yi(ii) a Phi) is added. In this embodiment, the speaker recognition network employs an 8-layer deep neural network.
1-6) minimizing cross entropy
In this embodiment, the objective function is to minimize the probability z 'of the speaker corresponding to the training speech obtained through model prediction'iAnd a label ziCross entropy between, i.e.:
by minimizing the objective function, each is trainedPhoneme filter f corresponding to phonemenParameter θ of (n ═ 1, …, 39)n(n-1, …, 39) and a parameter phi of the speaker recognition network g.
The model training phase is finished, and a phoneme filter f corresponding to each phoneme is obtainedn(n-1, …, 39) and a speaker recognition network g.
2) A testing stage; the method specifically comprises the following steps:
2-1) Speech preprocessing
Framing the test speech x and extracting the Mel cepstrum feature x corresponding to each framet(t=1,…,T),xtThe characteristics of the T-th frame of the test voice are shown, and T represents the total frame number of the test voice. In this embodiment, T is 328.
2-2) phoneme recognition
Extracting a Mel cepstrum characteristic x according to the step 2-1)tAnd identifying the phoneme of each frame of voice by using a phoneme recognizer used in the step 1-2). q. q.st1, 2, …, 39, wherein qtFor the phonemes corresponding to the t-th frame of the test speech, 39 is the total number of phonemes.
2-3) phoneme filtering
Phoneme q obtained according to step 2-2)tIf q istN, selecting the phoneme filter f trained in the model training stagenAs xtThe filter of (2). The features of the test speech after the phoneme information is filtered from the tth frame feature are as follows: y ist=fn(xt;θn)。
2-4) pooling
Pooling the features of the tested voice after the phoneme information is filtered out corresponding to all frames to obtain a mean value of the features of the tested voice after the phoneme information is filtered out, namely:
2-5) speaker recognition
And identifying the speaker corresponding to the tested voice according to the deep neural network g trained in the model training stage to obtain the probability z' of the voice belonging to each speaker, namely g (y; phi).
And finishing the speaker recognition corresponding to the test voice.
The method of the present invention can be implemented by a program, which can be stored in a computer-readable storage medium, as will be understood by those skilled in the art.
While the invention has been described with reference to a specific embodiment, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention.
Claims (3)
1. A speaker recognition method based on phoneme filtering is characterized by comprising a model training stage and a testing stage, wherein the model training stage comprises a speech preprocessing stage, a phoneme recognition stage, a phoneme filtering stage, a pooling stage, a speaker recognition stage and a minimum cross entropy stage; the testing stage comprises the stages of voice preprocessing, phoneme recognition, phoneme filtering, pooling and speaker recognition.
2. The method as claimed in claim 1, wherein the model training stage comprises the following steps:
1-1) Speech preprocessing
Training speech data set is (x)i,zi)(i=1,…,I),xiFor the i-th training speech, ziA speaker label corresponding to the ith training voice; for training speech xiPerforming frame division and extracting Mel cepstrum characteristics corresponding to each frame Features representing the T-th frame of the ith training speech, TiRepresenting the total frame number of the ith training voice;
1-2) phoneme recognition
Extracting Mel cepstrum characteristics according to step 1-1)The phonemes of each frame of speech are identified using a phoneme recognizer.WhereinThe number of phonemes corresponding to the t frame of the ith training voice is N;
1-3) phoneme filtering
Constructing its own phoneme filter f for phoneme N (N equals 1, …, N)n,fnCan be a deep neural network, or other linear or nonlinear function with a parameter thetan(ii) a The phoneme filter is input as the Mel cepstrum characteristics extracted in the step 1-1)Outputting the features after filtering the phoneme informationPhonemes obtained according to step 1-2)If it isThen selectCorresponding phoneme filter fnNamely:
1-4) pooling
Pooling the features after the phoneme information is filtered corresponding to all frames of the training voice to obtain a mean value of the features after the phoneme information is filtered corresponding to the voice, wherein the mean value of the features after the phoneme information is filtered corresponding to the ith training voice is as follows:
1-5) speaker identification
Constructing a speaker recognition network g, wherein g can be a deep neural network or other linear or nonlinear functions, the parameter is phi, and the input is the average value y of the features of the speech with phoneme information filterediAnd outputting the probability z 'of each speaker corresponding to the voice'i=g(yi;φ);
1-6) minimizing cross entropy
The objective function is to minimize the cross entropy between the probability z' i of the speaker corresponding to the training speech obtained by model prediction and the label zi, that is:
training to obtain a phoneme filter f corresponding to each phoneme by minimizing the target functionnParameter θ of (N ═ 1, …, N)n(N ═ 1, …, N) and a parameter Φ of the speaker identification network g;
the model training phase is finished, and a phoneme filter f corresponding to each phoneme is obtainednAnd a speaker recognition network g.
3. The method as claimed in claim 2, wherein the testing stage comprises the following steps:
2-1) Speech preprocessing
Framing the test speech x and extracting the Mel cepstrum feature x corresponding to each framet(t=1,…,T),xtRepresenting the characteristics of the T frame of the test voice, wherein T represents the total frame number of the test voice;
2-2) phoneme recognition
Extracting a Mel cepstrum characteristic x according to the step 2-1)tRecognizing the phoneme of each frame of voice by using a phoneme recognizer used in the step 1-2); q. q.st1, 2, …, N, wherein qtThe number of phonemes corresponding to the t frame of the test voice is N;
2-3) phoneme filtering
Phoneme q obtained according to step 2-2)tIf q istN, selecting the phoneme filter f trained in the model training stagenAs xtThe filter for testing the features of the t-th frame of the voice after the phoneme information is filtered out is as follows: y ist=fn(xt;θn);
2-4) pooling
Pooling the features of the tested voice after the phoneme information is filtered out corresponding to all frames to obtain a mean value of the features of the tested voice after the phoneme information is filtered out, namely:
2-5) speaker recognition
According to the deep neural network g trained in the model training stage, recognizing the speakers corresponding to the tested voices to obtain the probability z' that the voices belong to each speaker-g (y; phi);
and finishing the speaker recognition corresponding to the test voice.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010810083.6A CN111951783B (en) | 2020-08-12 | 2020-08-12 | Speaker recognition method based on phoneme filtering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010810083.6A CN111951783B (en) | 2020-08-12 | 2020-08-12 | Speaker recognition method based on phoneme filtering |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111951783A true CN111951783A (en) | 2020-11-17 |
CN111951783B CN111951783B (en) | 2023-08-18 |
Family
ID=73332504
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010810083.6A Active CN111951783B (en) | 2020-08-12 | 2020-08-12 | Speaker recognition method based on phoneme filtering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111951783B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5131043A (en) * | 1983-09-05 | 1992-07-14 | Matsushita Electric Industrial Co., Ltd. | Method of and apparatus for speech recognition wherein decisions are made based on phonemes |
AU2004237046A1 (en) * | 2003-05-02 | 2004-11-18 | Giritech A/S | Pervasive, user-centric network security enabled by dynamic datagram switch and an on-demand authentication and encryption scheme through mobile intelligent data carriers |
CN1991976A (en) * | 2005-12-31 | 2007-07-04 | 潘建强 | Phoneme based voice recognition method and system |
CN107369440A (en) * | 2017-08-02 | 2017-11-21 | 北京灵伴未来科技有限公司 | The training method and device of a kind of Speaker Identification model for phrase sound |
CN108172214A (en) * | 2017-12-27 | 2018-06-15 | 安徽建筑大学 | A kind of small echo speech recognition features parameter extracting method based on Mel domains |
CN108564956A (en) * | 2018-03-26 | 2018-09-21 | 京北方信息技术股份有限公司 | A kind of method for recognizing sound-groove and device, server, storage medium |
CN109119069A (en) * | 2018-07-23 | 2019-01-01 | 深圳大学 | Specific crowd recognition methods, electronic device and computer readable storage medium |
CN110600018A (en) * | 2019-09-05 | 2019-12-20 | 腾讯科技(深圳)有限公司 | Voice recognition method and device and neural network training method and device |
-
2020
- 2020-08-12 CN CN202010810083.6A patent/CN111951783B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5131043A (en) * | 1983-09-05 | 1992-07-14 | Matsushita Electric Industrial Co., Ltd. | Method of and apparatus for speech recognition wherein decisions are made based on phonemes |
AU2004237046A1 (en) * | 2003-05-02 | 2004-11-18 | Giritech A/S | Pervasive, user-centric network security enabled by dynamic datagram switch and an on-demand authentication and encryption scheme through mobile intelligent data carriers |
CN1991976A (en) * | 2005-12-31 | 2007-07-04 | 潘建强 | Phoneme based voice recognition method and system |
CN107369440A (en) * | 2017-08-02 | 2017-11-21 | 北京灵伴未来科技有限公司 | The training method and device of a kind of Speaker Identification model for phrase sound |
CN108172214A (en) * | 2017-12-27 | 2018-06-15 | 安徽建筑大学 | A kind of small echo speech recognition features parameter extracting method based on Mel domains |
CN108564956A (en) * | 2018-03-26 | 2018-09-21 | 京北方信息技术股份有限公司 | A kind of method for recognizing sound-groove and device, server, storage medium |
CN109119069A (en) * | 2018-07-23 | 2019-01-01 | 深圳大学 | Specific crowd recognition methods, electronic device and computer readable storage medium |
CN110600018A (en) * | 2019-09-05 | 2019-12-20 | 腾讯科技(深圳)有限公司 | Voice recognition method and device and neural network training method and device |
Non-Patent Citations (3)
Title |
---|
王昌龙;周福才;凌裕平;於锋;: "基于特征音素的说话人识别方法", 仪器仪表学报, no. 10 * |
谭萍;邢玉娟;: "噪声环境下文本相关说话人识别方法改进", 西安工程大学学报, no. 05 * |
陈雷;杨俊安;王一;王龙;: "LVCSR系统中一种基于区分性和自适应瓶颈深度置信网络的特征提取方法", 信号处理, no. 03 * |
Also Published As
Publication number | Publication date |
---|---|
CN111951783B (en) | 2023-08-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3719798B1 (en) | Voiceprint recognition method and device based on memorability bottleneck feature | |
CN107731233B (en) | Voiceprint recognition method based on RNN | |
US7904295B2 (en) | Method for automatic speaker recognition with hurst parameter based features and method for speaker classification based on fractional brownian motion classifiers | |
CN111276131A (en) | Multi-class acoustic feature integration method and system based on deep neural network | |
CN108986798B (en) | Processing method, device and the equipment of voice data | |
CN111429935B (en) | Voice caller separation method and device | |
CN108091340B (en) | Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium | |
Todkar et al. | Speaker recognition techniques: A review | |
CN109473102A (en) | A kind of robot secretary intelligent meeting recording method and system | |
CN113113022A (en) | Method for automatically identifying identity based on voiceprint information of speaker | |
KR100779242B1 (en) | Speaker recognition methods of a speech recognition and speaker recognition integrated system | |
Park et al. | The Second DIHARD Challenge: System Description for USC-SAIL Team. | |
Krishna et al. | Emotion recognition using dynamic time warping technique for isolated words | |
Koolagudi et al. | Speaker recognition in the case of emotional environment using transformation of speech features | |
Dwijayanti et al. | Speaker identification using a convolutional neural network | |
CN113516987B (en) | Speaker recognition method, speaker recognition device, storage medium and equipment | |
CN111951783B (en) | Speaker recognition method based on phoneme filtering | |
Zailan et al. | Comparative analysis of LPC and MFCC for male speaker recognition in text-independent context | |
Khanum et al. | A novel speaker identification system using feed forward neural networks | |
CN110875044B (en) | Speaker identification method based on word correlation score calculation | |
Juneja | Two-level noise robust and block featured PNN model for speaker recognition in real environment | |
Abd El-Moneim et al. | Effect of reverberation phenomena on text-independent speaker recognition based deep learning | |
Deepa et al. | A Report on Voice Recognition System: Techniques, Methodologies and Challenges using Deep Neural Network | |
Nosan et al. | Enhanced Feature Extraction Based on Absolute Sort Delta Mean Algorithm and MFCC for Noise Robustness Speech Recognition. | |
Dwijayanti et al. | JURNAL RESTI |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |