CN111951783A - Speaker recognition method based on phoneme filtering - Google Patents

Speaker recognition method based on phoneme filtering Download PDF

Info

Publication number
CN111951783A
CN111951783A CN202010810083.6A CN202010810083A CN111951783A CN 111951783 A CN111951783 A CN 111951783A CN 202010810083 A CN202010810083 A CN 202010810083A CN 111951783 A CN111951783 A CN 111951783A
Authority
CN
China
Prior art keywords
phoneme
voice
speaker
recognition
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010810083.6A
Other languages
Chinese (zh)
Other versions
CN111951783B (en
Inventor
陈仙红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202010810083.6A priority Critical patent/CN111951783B/en
Publication of CN111951783A publication Critical patent/CN111951783A/en
Application granted granted Critical
Publication of CN111951783B publication Critical patent/CN111951783B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a speaker recognition method based on phoneme filtering, and belongs to the field of voiceprint recognition, pattern recognition and machine learning. In order to overcome the problem that the influence of voice content information is not considered in the traditional speaker recognition technology, the invention provides a speaker recognition method based on phoneme filtering. The method establishes a phoneme filter for each phoneme of the voice, and selects a corresponding phoneme filter to remove the content information according to the phoneme corresponding to each frame of voice before speaker recognition. Therefore, the influence of the content information on the speaker identification is reduced, and the accuracy of the speaker identification is effectively improved. The invention is characterized in that it comprises a model training phase and a testing phase, wherein the model training comprises the steps of speech preprocessing, phoneme recognition, phoneme filtering, pooling, speaker recognition, minimizing cross entropy. The testing stage comprises the steps of voice preprocessing, phoneme recognition, phoneme filtering, pooling and speaker recognition.

Description

Speaker recognition method based on phoneme filtering
Technical Field
The invention belongs to the technical field of voiceprint recognition, pattern recognition and machine learning, and particularly relates to a speaker recognition method based on phoneme filtering.
Background
Speaker recognition refers to recognizing the identity of a speaker based on information related to the speaker contained in speech, and as information technology and communication technology are rapidly developed, speaker recognition technology is increasingly gaining importance and is widely applied in many fields. For example, identity authentication, seizing of a telephone channel criminal, identity confirmation in court according to telephone recording, telephone voice tracking and the function of opening and closing an anti-theft door are provided. The technology of speaker recognition can be applied to the fields of voice dialing, telephone banking, telephone shopping, database access, information service, voice e-mail, security control, computer remote login and the like.
In 2011, Kenny proposed an i-vector speaker recognition method based on a gaussian mixture model, which achieved the best performance at that time. With the large-scale application of the deep neural network, in 2014, the d-vector speaker recognition method based on the deep neural network receives more and more attention, and compared with a traditional Gaussian mixture model, the deep neural network has stronger description capability and can better simulate very complex data distribution. In 2017, the Snyder considers the time sequence information of the voice and provides an x-vector speaker recognition method based on a time delay neural network. The current speaker recognition aspect state-of-the-art method is the x-vector. It first preprocesses the voice data, extracts MFCC features, performs active voice inspection, and removes the silent part. Inputting the MFCC characteristics of each frame of voice into a time delay neural network to obtain the output result of each frame, pooling the output results of all the frames of a voice, calculating the average value, and identifying the speaker of the voice according to the average value. Although this method achieves good results, it does not directly analyze and study the difficulty of speaker recognition. The difficulty in speaker recognition is that speaker information is entangled with other information in the speech (e.g., noise, channel, content), and we are unaware of the principle of their entanglement with each other. Therefore, when we analyze speaker information, the uncertainty of other factors, especially the uncertainty of the speech content information, will degrade the system performance. The existing speaker recognition technology does not give emphasis to the influence of the unmatched voice content on the recognition of the speaker factor.
Disclosure of Invention
The invention aims to provide a speaker recognition method based on phoneme filtering, aiming at overcoming the problem that the influence of voice content information is not considered in the traditional speaker recognition technology. The method establishes a phoneme filter for each phoneme of the voice, and selects a corresponding phoneme filter to remove the content information according to the phoneme corresponding to each frame of voice when the speaker is identified. Therefore, the influence of the content information on the speaker identification is reduced, and the accuracy of the speaker identification is effectively improved.
The invention provides a speaker recognition method based on phoneme filtering, which is characterized by comprising a model training stage and a testing stage. As shown in fig. 1, wherein model training includes speech preprocessing, phoneme recognition, phoneme filtering, pooling, speaker recognition, and minimal cross-entropy stages. The speaker recognition comprises the steps of voice preprocessing, phoneme recognition, phoneme filtering, pooling and speaker recognition. The method specifically comprises the following steps:
1) a model training stage; the method specifically comprises the following steps:
1-1) Speech preprocessing
Training speech data set is (x)i,zi)(i=1,…,I),xiFor the i-th training speech, ziAnd the speaker label corresponding to the training voice of the ith training voice is marked. For training speech xiPerforming frame division and extracting Mel cepstrum characteristics corresponding to each frame
Figure BDA0002629322130000021
Figure BDA0002629322130000022
Features representing the T-th frame of the ith training speech, TiThe total frame number of the ith training voice is shown.
1-2) phoneme recognition
According to the Mel cepstrum characteristics extracted in step 1)
Figure BDA00026293221300000211
The phonemes of each frame of speech are identified using a phoneme recognizer.
Figure BDA0002629322130000023
Wherein
Figure BDA0002629322130000024
And N is the total number of phonemes corresponding to the t frame of the ith training speech.
1-3) phoneme filtering
Constructing its own phoneme filter f for phoneme N (N equals 1, …, N)n。fnCan be a deep neural network, or other linear or nonlinear function with a parameter thetan. The phoneme filter is input as the Mel cepstrum characteristics extracted in the step 1-1)
Figure BDA0002629322130000025
Outputting the features after filtering the phoneme information
Figure BDA0002629322130000026
Phonemes obtained according to step 1-2)
Figure BDA0002629322130000027
If it is
Figure BDA0002629322130000028
Then select
Figure BDA0002629322130000029
Corresponding phoneme filter fnNamely:
Figure BDA00026293221300000210
1-4) pooling
And pooling the features of the training speech after the phoneme information is filtered corresponding to all frames to obtain the mean value of the features of the training speech after the phoneme information is filtered. For example, the average value of the features of the ith training speech after the phoneme information is filtered out is:
Figure BDA0002629322130000031
1-5) speaker identification
Constructing a speaker recognition network g, wherein g can be a deep neural network or other linear or nonlinear functions, the parameter is phi, and the input is the average value y of the features of the speech with phoneme information filterediAnd outputting the probability z 'of each speaker corresponding to the voice'i=g(yi;φ)。
1-6) minimizing cross entropy
The objective function is to minimize the probability z 'of the speaker corresponding to the training speech through model prediction'iAnd a label ziCross entropy between, i.e.:
Figure BDA0002629322130000032
training to obtain a phoneme filter f corresponding to each phoneme by minimizing the target functionnParameter θ of (N ═ 1, …, N)n(N ═ 1, …, N) and a parameter Φ of the speaker identification network g.
The model training phase is finished, and a phoneme filter f corresponding to each phoneme is obtainednAnd a speaker recognition network g.
2) The testing stage specifically comprises the following steps:
2-1) Speech preprocessing
Framing the test speech x and extracting the Mel cepstrum feature x corresponding to each framet(t=1,…,T),xtThe characteristics of the T-th frame of the test voice are shown, and T represents the total frame number of the test voice.
2-2) phoneme recognition
Extracting a Mel cepstrum characteristic x according to the step 2-1)tAnd identifying the phoneme of each frame of voice by using a phoneme recognizer used in the step 1-2). q. q.st1, 2, …, N, wherein qtTo test the tth of speechAnd the phonemes corresponding to the frames, wherein N is the total number of the phonemes.
2-3) phoneme filtering
Phoneme q obtained according to step 2-2)tIf q istN, selecting the phoneme filter f trained in the model training stagenAs xtThe filter of (2). The features of the test speech after the phoneme information is filtered from the tth frame feature are as follows: y ist=fn(xt;θn)。
2-4) pooling
Pooling the features of the tested voice after the phoneme information is filtered out corresponding to all frames to obtain a mean value of the features of the tested voice after the phoneme information is filtered out, namely:
Figure BDA0002629322130000041
2-5) speaker recognition
And identifying the speaker corresponding to the tested voice according to the deep neural network g trained in the model training stage to obtain the probability z' of the voice belonging to each speaker, namely g (y; phi).
And finishing the speaker recognition corresponding to the test voice.
The invention has the characteristics and beneficial effects that:
compared with the existing speaker recognition technology, the invention emphasizes on reducing the influence of the voice content information on speaker recognition. Because the main voice bearing is the content information, the speaker information is taken as weak information, is easily submerged in the content information and is not easy to identify. The invention constructs a filter corresponding to each phoneme, and filters the phoneme information before speaker recognition, thereby reducing the influence of the voice content information on the speaker recognition. The method of the invention improves the accuracy of speaker identification.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention provides a speaker recognition method based on phoneme filtering, which comprises a model training stage and a speaker recognition stage. As shown in fig. 1, the model training includes the stages of speech preprocessing, phoneme recognition, phoneme filtering, pooling, speaker recognition, minimizing cross entropy, etc. The speaker recognition comprises the steps of voice preprocessing, phoneme recognition, phoneme filtering, pooling, speaker recognition and the like. Specific examples are described in further detail below.
1) A model training stage; the method specifically comprises the following steps:
1-1) Speech preprocessing
Training speech data set of
Figure BDA0002629322130000051
xiFor the i-th training speech, ziAnd the number of the speaker labels corresponding to the ith training voice is I, and the I is the total number of the training voices. For training speech xiPerforming frame division and extracting Mel cepstrum characteristics corresponding to each frame
Figure BDA0002629322130000052
Features representing the T-th frame of the ith training speech, TiThe total frame number of the ith training voice is shown. In this embodiment, the number I of training speeches is 8000, the mel-frequency cepstrum feature corresponding to each frame is a 23-dimensional feature, all the training speeches have the same length, and each speech has Ti300 frames.
1-2) phoneme recognition
According to the Mel cepstrum characteristics extracted in step 1)
Figure BDA0002629322130000053
The phonemes of each frame of speech are identified using a phoneme recognizer.
Figure BDA0002629322130000054
Wherein
Figure BDA0002629322130000055
And N is the total number of phonemes corresponding to the t frame of the ith training speech. In the embodiment, the phoneme recognizer adopts the sound of open source on the network of the university of Brunou RituerAnd the total number of corresponding phonemes is 39. And obtaining the phoneme corresponding to each frame of each voice according to the phoneme recognizer.
1-3) phoneme filtering
Constructing its own phoneme filter f for phoneme N (N equals 1, …, N)n。fnCan be a deep neural network, or other linear or nonlinear function with a parameter thetan. The phoneme filter is input as the Mel cepstrum characteristics extracted in the step 1-1)
Figure BDA0002629322130000056
Outputting the features after filtering the phoneme information
Figure BDA0002629322130000057
Phonemes obtained according to step 1-2)
Figure BDA0002629322130000058
If it is
Figure BDA0002629322130000059
Then select
Figure BDA00026293221300000510
Corresponding phoneme filter fnNamely:
Figure BDA00026293221300000511
in the present embodiment, since the total number of phonemes is 39, 39 phoneme filters f are constructedn(n-1, …, 39). Each phoneme filter is constructed by adopting a 5-layer deep neural network, and the corresponding parameter is thetan. The nth phoneme filter filters the nth phoneme. If the 125 th frame of the 5 th speech
Figure BDA00026293221300000512
Belonging to the 13 th phoneme, i.e.
Figure BDA00026293221300000513
Then select
Figure BDA00026293221300000514
The corresponding phoneme filter is f13Namely:
Figure BDA00026293221300000515
1-4) pooling
And pooling the features of the training speech after the phoneme information is filtered corresponding to all frames to obtain the mean value of the features of the training speech after the phoneme information is filtered. For example, the average value of the features of the ith training speech after the phoneme information is filtered out is:
Figure BDA00026293221300000516
in this embodiment, the mean value of the features of the training speech after the phoneme information is filtered out is:
Figure BDA0002629322130000061
1-5) speaker identification
Constructing a speaker recognition network g, wherein g can be a deep neural network or other linear or nonlinear functions, the parameter is phi, and the input is the average value y of the features of the speech with phoneme information filterediAnd outputting the probability z 'of each speaker corresponding to the voice'i=g(yi(ii) a Phi) is added. In this embodiment, the speaker recognition network employs an 8-layer deep neural network.
1-6) minimizing cross entropy
In this embodiment, the objective function is to minimize the probability z 'of the speaker corresponding to the training speech obtained through model prediction'iAnd a label ziCross entropy between, i.e.:
Figure BDA0002629322130000062
by minimizing the objective function, each is trainedPhoneme filter f corresponding to phonemenParameter θ of (n ═ 1, …, 39)n(n-1, …, 39) and a parameter phi of the speaker recognition network g.
The model training phase is finished, and a phoneme filter f corresponding to each phoneme is obtainedn(n-1, …, 39) and a speaker recognition network g.
2) A testing stage; the method specifically comprises the following steps:
2-1) Speech preprocessing
Framing the test speech x and extracting the Mel cepstrum feature x corresponding to each framet(t=1,…,T),xtThe characteristics of the T-th frame of the test voice are shown, and T represents the total frame number of the test voice. In this embodiment, T is 328.
2-2) phoneme recognition
Extracting a Mel cepstrum characteristic x according to the step 2-1)tAnd identifying the phoneme of each frame of voice by using a phoneme recognizer used in the step 1-2). q. q.st1, 2, …, 39, wherein qtFor the phonemes corresponding to the t-th frame of the test speech, 39 is the total number of phonemes.
2-3) phoneme filtering
Phoneme q obtained according to step 2-2)tIf q istN, selecting the phoneme filter f trained in the model training stagenAs xtThe filter of (2). The features of the test speech after the phoneme information is filtered from the tth frame feature are as follows: y ist=fn(xt;θn)。
2-4) pooling
Pooling the features of the tested voice after the phoneme information is filtered out corresponding to all frames to obtain a mean value of the features of the tested voice after the phoneme information is filtered out, namely:
Figure BDA0002629322130000071
2-5) speaker recognition
And identifying the speaker corresponding to the tested voice according to the deep neural network g trained in the model training stage to obtain the probability z' of the voice belonging to each speaker, namely g (y; phi).
And finishing the speaker recognition corresponding to the test voice.
The method of the present invention can be implemented by a program, which can be stored in a computer-readable storage medium, as will be understood by those skilled in the art.
While the invention has been described with reference to a specific embodiment, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention.

Claims (3)

1. A speaker recognition method based on phoneme filtering is characterized by comprising a model training stage and a testing stage, wherein the model training stage comprises a speech preprocessing stage, a phoneme recognition stage, a phoneme filtering stage, a pooling stage, a speaker recognition stage and a minimum cross entropy stage; the testing stage comprises the stages of voice preprocessing, phoneme recognition, phoneme filtering, pooling and speaker recognition.
2. The method as claimed in claim 1, wherein the model training stage comprises the following steps:
1-1) Speech preprocessing
Training speech data set is (x)i,zi)(i=1,…,I),xiFor the i-th training speech, ziA speaker label corresponding to the ith training voice; for training speech xiPerforming frame division and extracting Mel cepstrum characteristics corresponding to each frame
Figure FDA0002629322120000011
Figure FDA0002629322120000012
Features representing the T-th frame of the ith training speech, TiRepresenting the total frame number of the ith training voice;
1-2) phoneme recognition
Extracting Mel cepstrum characteristics according to step 1-1)
Figure FDA00026293221200000112
The phonemes of each frame of speech are identified using a phoneme recognizer.
Figure FDA0002629322120000013
Wherein
Figure FDA0002629322120000014
The number of phonemes corresponding to the t frame of the ith training voice is N;
1-3) phoneme filtering
Constructing its own phoneme filter f for phoneme N (N equals 1, …, N)n,fnCan be a deep neural network, or other linear or nonlinear function with a parameter thetan(ii) a The phoneme filter is input as the Mel cepstrum characteristics extracted in the step 1-1)
Figure FDA0002629322120000015
Outputting the features after filtering the phoneme information
Figure FDA0002629322120000016
Phonemes obtained according to step 1-2)
Figure FDA0002629322120000017
If it is
Figure FDA0002629322120000018
Then select
Figure FDA0002629322120000019
Corresponding phoneme filter fnNamely:
Figure FDA00026293221200000110
1-4) pooling
Pooling the features after the phoneme information is filtered corresponding to all frames of the training voice to obtain a mean value of the features after the phoneme information is filtered corresponding to the voice, wherein the mean value of the features after the phoneme information is filtered corresponding to the ith training voice is as follows:
Figure FDA00026293221200000111
1-5) speaker identification
Constructing a speaker recognition network g, wherein g can be a deep neural network or other linear or nonlinear functions, the parameter is phi, and the input is the average value y of the features of the speech with phoneme information filterediAnd outputting the probability z 'of each speaker corresponding to the voice'i=g(yi;φ);
1-6) minimizing cross entropy
The objective function is to minimize the cross entropy between the probability z' i of the speaker corresponding to the training speech obtained by model prediction and the label zi, that is:
Figure FDA0002629322120000021
training to obtain a phoneme filter f corresponding to each phoneme by minimizing the target functionnParameter θ of (N ═ 1, …, N)n(N ═ 1, …, N) and a parameter Φ of the speaker identification network g;
the model training phase is finished, and a phoneme filter f corresponding to each phoneme is obtainednAnd a speaker recognition network g.
3. The method as claimed in claim 2, wherein the testing stage comprises the following steps:
2-1) Speech preprocessing
Framing the test speech x and extracting the Mel cepstrum feature x corresponding to each framet(t=1,…,T),xtRepresenting the characteristics of the T frame of the test voice, wherein T represents the total frame number of the test voice;
2-2) phoneme recognition
Extracting a Mel cepstrum characteristic x according to the step 2-1)tRecognizing the phoneme of each frame of voice by using a phoneme recognizer used in the step 1-2); q. q.st1, 2, …, N, wherein qtThe number of phonemes corresponding to the t frame of the test voice is N;
2-3) phoneme filtering
Phoneme q obtained according to step 2-2)tIf q istN, selecting the phoneme filter f trained in the model training stagenAs xtThe filter for testing the features of the t-th frame of the voice after the phoneme information is filtered out is as follows: y ist=fn(xt;θn);
2-4) pooling
Pooling the features of the tested voice after the phoneme information is filtered out corresponding to all frames to obtain a mean value of the features of the tested voice after the phoneme information is filtered out, namely:
Figure FDA0002629322120000031
2-5) speaker recognition
According to the deep neural network g trained in the model training stage, recognizing the speakers corresponding to the tested voices to obtain the probability z' that the voices belong to each speaker-g (y; phi);
and finishing the speaker recognition corresponding to the test voice.
CN202010810083.6A 2020-08-12 2020-08-12 Speaker recognition method based on phoneme filtering Active CN111951783B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010810083.6A CN111951783B (en) 2020-08-12 2020-08-12 Speaker recognition method based on phoneme filtering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010810083.6A CN111951783B (en) 2020-08-12 2020-08-12 Speaker recognition method based on phoneme filtering

Publications (2)

Publication Number Publication Date
CN111951783A true CN111951783A (en) 2020-11-17
CN111951783B CN111951783B (en) 2023-08-18

Family

ID=73332504

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010810083.6A Active CN111951783B (en) 2020-08-12 2020-08-12 Speaker recognition method based on phoneme filtering

Country Status (1)

Country Link
CN (1) CN111951783B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5131043A (en) * 1983-09-05 1992-07-14 Matsushita Electric Industrial Co., Ltd. Method of and apparatus for speech recognition wherein decisions are made based on phonemes
AU2004237046A1 (en) * 2003-05-02 2004-11-18 Giritech A/S Pervasive, user-centric network security enabled by dynamic datagram switch and an on-demand authentication and encryption scheme through mobile intelligent data carriers
CN1991976A (en) * 2005-12-31 2007-07-04 潘建强 Phoneme based voice recognition method and system
CN107369440A (en) * 2017-08-02 2017-11-21 北京灵伴未来科技有限公司 The training method and device of a kind of Speaker Identification model for phrase sound
CN108172214A (en) * 2017-12-27 2018-06-15 安徽建筑大学 A kind of small echo speech recognition features parameter extracting method based on Mel domains
CN108564956A (en) * 2018-03-26 2018-09-21 京北方信息技术股份有限公司 A kind of method for recognizing sound-groove and device, server, storage medium
CN109119069A (en) * 2018-07-23 2019-01-01 深圳大学 Specific crowd recognition methods, electronic device and computer readable storage medium
CN110600018A (en) * 2019-09-05 2019-12-20 腾讯科技(深圳)有限公司 Voice recognition method and device and neural network training method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5131043A (en) * 1983-09-05 1992-07-14 Matsushita Electric Industrial Co., Ltd. Method of and apparatus for speech recognition wherein decisions are made based on phonemes
AU2004237046A1 (en) * 2003-05-02 2004-11-18 Giritech A/S Pervasive, user-centric network security enabled by dynamic datagram switch and an on-demand authentication and encryption scheme through mobile intelligent data carriers
CN1991976A (en) * 2005-12-31 2007-07-04 潘建强 Phoneme based voice recognition method and system
CN107369440A (en) * 2017-08-02 2017-11-21 北京灵伴未来科技有限公司 The training method and device of a kind of Speaker Identification model for phrase sound
CN108172214A (en) * 2017-12-27 2018-06-15 安徽建筑大学 A kind of small echo speech recognition features parameter extracting method based on Mel domains
CN108564956A (en) * 2018-03-26 2018-09-21 京北方信息技术股份有限公司 A kind of method for recognizing sound-groove and device, server, storage medium
CN109119069A (en) * 2018-07-23 2019-01-01 深圳大学 Specific crowd recognition methods, electronic device and computer readable storage medium
CN110600018A (en) * 2019-09-05 2019-12-20 腾讯科技(深圳)有限公司 Voice recognition method and device and neural network training method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
王昌龙;周福才;凌裕平;於锋;: "基于特征音素的说话人识别方法", 仪器仪表学报, no. 10 *
谭萍;邢玉娟;: "噪声环境下文本相关说话人识别方法改进", 西安工程大学学报, no. 05 *
陈雷;杨俊安;王一;王龙;: "LVCSR系统中一种基于区分性和自适应瓶颈深度置信网络的特征提取方法", 信号处理, no. 03 *

Also Published As

Publication number Publication date
CN111951783B (en) 2023-08-18

Similar Documents

Publication Publication Date Title
EP3719798B1 (en) Voiceprint recognition method and device based on memorability bottleneck feature
CN107731233B (en) Voiceprint recognition method based on RNN
US7904295B2 (en) Method for automatic speaker recognition with hurst parameter based features and method for speaker classification based on fractional brownian motion classifiers
CN111276131A (en) Multi-class acoustic feature integration method and system based on deep neural network
CN108986798B (en) Processing method, device and the equipment of voice data
CN111429935B (en) Voice caller separation method and device
CN108091340B (en) Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium
Todkar et al. Speaker recognition techniques: A review
CN109473102A (en) A kind of robot secretary intelligent meeting recording method and system
CN113113022A (en) Method for automatically identifying identity based on voiceprint information of speaker
KR100779242B1 (en) Speaker recognition methods of a speech recognition and speaker recognition integrated system
Park et al. The Second DIHARD Challenge: System Description for USC-SAIL Team.
Krishna et al. Emotion recognition using dynamic time warping technique for isolated words
Koolagudi et al. Speaker recognition in the case of emotional environment using transformation of speech features
Dwijayanti et al. Speaker identification using a convolutional neural network
CN113516987B (en) Speaker recognition method, speaker recognition device, storage medium and equipment
CN111951783B (en) Speaker recognition method based on phoneme filtering
Zailan et al. Comparative analysis of LPC and MFCC for male speaker recognition in text-independent context
Khanum et al. A novel speaker identification system using feed forward neural networks
CN110875044B (en) Speaker identification method based on word correlation score calculation
Juneja Two-level noise robust and block featured PNN model for speaker recognition in real environment
Abd El-Moneim et al. Effect of reverberation phenomena on text-independent speaker recognition based deep learning
Deepa et al. A Report on Voice Recognition System: Techniques, Methodologies and Challenges using Deep Neural Network
Nosan et al. Enhanced Feature Extraction Based on Absolute Sort Delta Mean Algorithm and MFCC for Noise Robustness Speech Recognition.
Dwijayanti et al. JURNAL RESTI

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant