CN111951783A

CN111951783A - Speaker recognition method based on phoneme filtering

Info

Publication number: CN111951783A
Application number: CN202010810083.6A
Authority: CN
Inventors: 陈仙红
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-08-12
Filing date: 2020-08-12
Publication date: 2020-11-17
Anticipated expiration: 2040-08-12
Also published as: CN111951783B

Abstract

The invention provides a speaker recognition method based on phoneme filtering, and belongs to the field of voiceprint recognition, pattern recognition and machine learning. In order to overcome the problem that the influence of voice content information is not considered in the traditional speaker recognition technology, the invention provides a speaker recognition method based on phoneme filtering. The method establishes a phoneme filter for each phoneme of the voice, and selects a corresponding phoneme filter to remove the content information according to the phoneme corresponding to each frame of voice before speaker recognition. Therefore, the influence of the content information on the speaker identification is reduced, and the accuracy of the speaker identification is effectively improved. The invention is characterized in that it comprises a model training phase and a testing phase, wherein the model training comprises the steps of speech preprocessing, phoneme recognition, phoneme filtering, pooling, speaker recognition, minimizing cross entropy. The testing stage comprises the steps of voice preprocessing, phoneme recognition, phoneme filtering, pooling and speaker recognition.

Description

Speaker recognition method based on phoneme filtering

Technical Field

The invention belongs to the technical field of voiceprint recognition, pattern recognition and machine learning, and particularly relates to a speaker recognition method based on phoneme filtering.

Background

Speaker recognition refers to recognizing the identity of a speaker based on information related to the speaker contained in speech, and as information technology and communication technology are rapidly developed, speaker recognition technology is increasingly gaining importance and is widely applied in many fields. For example, identity authentication, seizing of a telephone channel criminal, identity confirmation in court according to telephone recording, telephone voice tracking and the function of opening and closing an anti-theft door are provided. The technology of speaker recognition can be applied to the fields of voice dialing, telephone banking, telephone shopping, database access, information service, voice e-mail, security control, computer remote login and the like.

In 2011, Kenny proposed an i-vector speaker recognition method based on a gaussian mixture model, which achieved the best performance at that time. With the large-scale application of the deep neural network, in 2014, the d-vector speaker recognition method based on the deep neural network receives more and more attention, and compared with a traditional Gaussian mixture model, the deep neural network has stronger description capability and can better simulate very complex data distribution. In 2017, the Snyder considers the time sequence information of the voice and provides an x-vector speaker recognition method based on a time delay neural network. The current speaker recognition aspect state-of-the-art method is the x-vector. It first preprocesses the voice data, extracts MFCC features, performs active voice inspection, and removes the silent part. Inputting the MFCC characteristics of each frame of voice into a time delay neural network to obtain the output result of each frame, pooling the output results of all the frames of a voice, calculating the average value, and identifying the speaker of the voice according to the average value. Although this method achieves good results, it does not directly analyze and study the difficulty of speaker recognition. The difficulty in speaker recognition is that speaker information is entangled with other information in the speech (e.g., noise, channel, content), and we are unaware of the principle of their entanglement with each other. Therefore, when we analyze speaker information, the uncertainty of other factors, especially the uncertainty of the speech content information, will degrade the system performance. The existing speaker recognition technology does not give emphasis to the influence of the unmatched voice content on the recognition of the speaker factor.

Disclosure of Invention

The invention aims to provide a speaker recognition method based on phoneme filtering, aiming at overcoming the problem that the influence of voice content information is not considered in the traditional speaker recognition technology. The method establishes a phoneme filter for each phoneme of the voice, and selects a corresponding phoneme filter to remove the content information according to the phoneme corresponding to each frame of voice when the speaker is identified. Therefore, the influence of the content information on the speaker identification is reduced, and the accuracy of the speaker identification is effectively improved.

The invention provides a speaker recognition method based on phoneme filtering, which is characterized by comprising a model training stage and a testing stage. As shown in fig. 1, wherein model training includes speech preprocessing, phoneme recognition, phoneme filtering, pooling, speaker recognition, and minimal cross-entropy stages. The speaker recognition comprises the steps of voice preprocessing, phoneme recognition, phoneme filtering, pooling and speaker recognition. The method specifically comprises the following steps:

1) a model training stage; the method specifically comprises the following steps:

1-1) Speech preprocessing

Training speech data set is (x)ⁱ，zⁱ)(i＝1，…，I)，xⁱFor the i-th training speech, zⁱAnd the speaker label corresponding to the training voice of the ith training voice is marked. For training speech xⁱPerforming frame division and extracting Mel cepstrum characteristics corresponding to each frame

Features representing the T-th frame of the ith training speech, T_iThe total frame number of the ith training voice is shown.

1-2) phoneme recognition

According to the Mel cepstrum characteristics extracted in step 1)

The phonemes of each frame of speech are identified using a phoneme recognizer.

Wherein

And N is the total number of phonemes corresponding to the t frame of the ith training speech.

1-3) phoneme filtering

Constructing its own phoneme filter f for phoneme N (N equals 1, …, N)_n。f_nCan be a deep neural network, or other linear or nonlinear function with a parameter theta_n. The phoneme filter is input as the Mel cepstrum characteristics extracted in the step 1-1)

Outputting the features after filtering the phoneme information

Phonemes obtained according to step 1-2)

If it is

Then select

Corresponding phoneme filter f_nNamely:

1-4) pooling

And pooling the features of the training speech after the phoneme information is filtered corresponding to all frames to obtain the mean value of the features of the training speech after the phoneme information is filtered. For example, the average value of the features of the ith training speech after the phoneme information is filtered out is:

1-5) speaker identification

Constructing a speaker recognition network g, wherein g can be a deep neural network or other linear or nonlinear functions, the parameter is phi, and the input is the average value y of the features of the speech with phoneme information filteredⁱAnd outputting the probability z 'of each speaker corresponding to the voice'_i＝g(y_i；φ)。

1-6) minimizing cross entropy

The objective function is to minimize the probability z 'of the speaker corresponding to the training speech through model prediction'_iAnd a label z_iCross entropy between, i.e.:

training to obtain a phoneme filter f corresponding to each phoneme by minimizing the target function_nParameter θ of (N ═ 1, …, N)_n(N ═ 1, …, N) and a parameter Φ of the speaker identification network g.

The model training phase is finished, and a phoneme filter f corresponding to each phoneme is obtained_nAnd a speaker recognition network g.

2) The testing stage specifically comprises the following steps:

2-1) Speech preprocessing

Framing the test speech x and extracting the Mel cepstrum feature x corresponding to each frame_t(t＝1，…，T)，x_tThe characteristics of the T-th frame of the test voice are shown, and T represents the total frame number of the test voice.

2-2) phoneme recognition

Extracting a Mel cepstrum characteristic x according to the step 2-1)_tAnd identifying the phoneme of each frame of voice by using a phoneme recognizer used in the step 1-2). q. q.s_t1, 2, …, N, wherein q_tTo test the tth of speechAnd the phonemes corresponding to the frames, wherein N is the total number of the phonemes.

2-3) phoneme filtering

Phoneme q obtained according to step 2-2)_tIf q is_tN, selecting the phoneme filter f trained in the model training stage_nAs x_tThe filter of (2). The features of the test speech after the phoneme information is filtered from the tth frame feature are as follows: y is_t＝f_n(x_t；θ_n)。

2-4) pooling

Pooling the features of the tested voice after the phoneme information is filtered out corresponding to all frames to obtain a mean value of the features of the tested voice after the phoneme information is filtered out, namely:

2-5) speaker recognition

And identifying the speaker corresponding to the tested voice according to the deep neural network g trained in the model training stage to obtain the probability z' of the voice belonging to each speaker, namely g (y; phi).

And finishing the speaker recognition corresponding to the test voice.

The invention has the characteristics and beneficial effects that:

compared with the existing speaker recognition technology, the invention emphasizes on reducing the influence of the voice content information on speaker recognition. Because the main voice bearing is the content information, the speaker information is taken as weak information, is easily submerged in the content information and is not easy to identify. The invention constructs a filter corresponding to each phoneme, and filters the phoneme information before speaker recognition, thereby reducing the influence of the voice content information on the speaker recognition. The method of the invention improves the accuracy of speaker identification.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The invention provides a speaker recognition method based on phoneme filtering, which comprises a model training stage and a speaker recognition stage. As shown in fig. 1, the model training includes the stages of speech preprocessing, phoneme recognition, phoneme filtering, pooling, speaker recognition, minimizing cross entropy, etc. The speaker recognition comprises the steps of voice preprocessing, phoneme recognition, phoneme filtering, pooling, speaker recognition and the like. Specific examples are described in further detail below.

1-1) Speech preprocessing

Training speech data set of

xⁱFor the i-th training speech, zⁱAnd the number of the speaker labels corresponding to the ith training voice is I, and the I is the total number of the training voices. For training speech xⁱPerforming frame division and extracting Mel cepstrum characteristics corresponding to each frame

Features representing the T-th frame of the ith training speech, T_iThe total frame number of the ith training voice is shown. In this embodiment, the number I of training speeches is 8000, the mel-frequency cepstrum feature corresponding to each frame is a 23-dimensional feature, all the training speeches have the same length, and each speech has T_i300 frames.

1-2) phoneme recognition

According to the Mel cepstrum characteristics extracted in step 1)

The phonemes of each frame of speech are identified using a phoneme recognizer.

Wherein

And N is the total number of phonemes corresponding to the t frame of the ith training speech. In the embodiment, the phoneme recognizer adopts the sound of open source on the network of the university of Brunou RituerAnd the total number of corresponding phonemes is 39. And obtaining the phoneme corresponding to each frame of each voice according to the phoneme recognizer.

1-3) phoneme filtering

Outputting the features after filtering the phoneme information

Phonemes obtained according to step 1-2)

If it is

Then select

Corresponding phoneme filter f_nNamely:

in the present embodiment, since the total number of phonemes is 39, 39 phoneme filters f are constructed_n(n-1, …, 39). Each phoneme filter is constructed by adopting a 5-layer deep neural network, and the corresponding parameter is theta_n. The nth phoneme filter filters the nth phoneme. If the 125 th frame of the 5 th speech

Belonging to the 13 th phoneme, i.e.

Then select

The corresponding phoneme filter is f₁₃Namely:

1-4) pooling

in this embodiment, the mean value of the features of the training speech after the phoneme information is filtered out is:

1-5) speaker identification

Constructing a speaker recognition network g, wherein g can be a deep neural network or other linear or nonlinear functions, the parameter is phi, and the input is the average value y of the features of the speech with phoneme information filteredⁱAnd outputting the probability z 'of each speaker corresponding to the voice'_i＝g(y_i(ii) a Phi) is added. In this embodiment, the speaker recognition network employs an 8-layer deep neural network.

1-6) minimizing cross entropy

In this embodiment, the objective function is to minimize the probability z 'of the speaker corresponding to the training speech obtained through model prediction'_iAnd a label z_iCross entropy between, i.e.:

by minimizing the objective function, each is trainedPhoneme filter f corresponding to phoneme_nParameter θ of (n ═ 1, …, 39)_n(n-1, …, 39) and a parameter phi of the speaker recognition network g.

The model training phase is finished, and a phoneme filter f corresponding to each phoneme is obtained_n(n-1, …, 39) and a speaker recognition network g.

2) A testing stage; the method specifically comprises the following steps:

2-1) Speech preprocessing

Framing the test speech x and extracting the Mel cepstrum feature x corresponding to each frame_t(t＝1，…，T)，x_tThe characteristics of the T-th frame of the test voice are shown, and T represents the total frame number of the test voice. In this embodiment, T is 328.

2-2) phoneme recognition

Extracting a Mel cepstrum characteristic x according to the step 2-1)_tAnd identifying the phoneme of each frame of voice by using a phoneme recognizer used in the step 1-2). q. q.s_t1, 2, …, 39, wherein q_tFor the phonemes corresponding to the t-th frame of the test speech, 39 is the total number of phonemes.

2-3) phoneme filtering

2-4) pooling

2-5) speaker recognition

And finishing the speaker recognition corresponding to the test voice.

The method of the present invention can be implemented by a program, which can be stored in a computer-readable storage medium, as will be understood by those skilled in the art.

While the invention has been described with reference to a specific embodiment, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention.

Claims

1. A speaker recognition method based on phoneme filtering is characterized by comprising a model training stage and a testing stage, wherein the model training stage comprises a speech preprocessing stage, a phoneme recognition stage, a phoneme filtering stage, a pooling stage, a speaker recognition stage and a minimum cross entropy stage; the testing stage comprises the stages of voice preprocessing, phoneme recognition, phoneme filtering, pooling and speaker recognition.

2. The method as claimed in claim 1, wherein the model training stage comprises the following steps:

1-1) Speech preprocessing

Training speech data set is (x)ⁱ，zⁱ)(i＝1，…，I)，xⁱFor the i-th training speech, zⁱA speaker label corresponding to the ith training voice; for training speech xⁱPerforming frame division and extracting Mel cepstrum characteristics corresponding to each frame

Features representing the T-th frame of the ith training speech, T_iRepresenting the total frame number of the ith training voice;

1-2) phoneme recognition

Extracting Mel cepstrum characteristics according to step 1-1)

The phonemes of each frame of speech are identified using a phoneme recognizer.

Wherein

The number of phonemes corresponding to the t frame of the ith training voice is N;

1-3) phoneme filtering

Constructing its own phoneme filter f for phoneme N (N equals 1, …, N)_n，f_nCan be a deep neural network, or other linear or nonlinear function with a parameter theta_n(ii) a The phoneme filter is input as the Mel cepstrum characteristics extracted in the step 1-1)

Outputting the features after filtering the phoneme information

Phonemes obtained according to step 1-2)

If it is

Then select

Corresponding phoneme filter f_nNamely:

1-4) pooling

Pooling the features after the phoneme information is filtered corresponding to all frames of the training voice to obtain a mean value of the features after the phoneme information is filtered corresponding to the voice, wherein the mean value of the features after the phoneme information is filtered corresponding to the ith training voice is as follows:

1-5) speaker identification

Constructing a speaker recognition network g, wherein g can be a deep neural network or other linear or nonlinear functions, the parameter is phi, and the input is the average value y of the features of the speech with phoneme information filteredⁱAnd outputting the probability z 'of each speaker corresponding to the voice'_i＝g(y_i；φ)；

1-6) minimizing cross entropy

The objective function is to minimize the cross entropy between the probability z' i of the speaker corresponding to the training speech obtained by model prediction and the label zi, that is:

training to obtain a phoneme filter f corresponding to each phoneme by minimizing the target function_nParameter θ of (N ═ 1, …, N)_n(N ═ 1, …, N) and a parameter Φ of the speaker identification network g;

3. The method as claimed in claim 2, wherein the testing stage comprises the following steps:

2-1) Speech preprocessing

Framing the test speech x and extracting the Mel cepstrum feature x corresponding to each frame_t(t＝1，…，T)，x_tRepresenting the characteristics of the T frame of the test voice, wherein T represents the total frame number of the test voice;

2-2) phoneme recognition

Extracting a Mel cepstrum characteristic x according to the step 2-1)_tRecognizing the phoneme of each frame of voice by using a phoneme recognizer used in the step 1-2); q. q.s_t1, 2, …, N, wherein q_tThe number of phonemes corresponding to the t frame of the test voice is N;

2-3) phoneme filtering

Phoneme q obtained according to step 2-2)_tIf q is_tN, selecting the phoneme filter f trained in the model training stage_nAs x_tThe filter for testing the features of the t-th frame of the voice after the phoneme information is filtered out is as follows: y is_t＝f_n(x_t；θ_n)；

2-4) pooling

2-5) speaker recognition

According to the deep neural network g trained in the model training stage, recognizing the speakers corresponding to the tested voices to obtain the probability z' that the voices belong to each speaker-g (y; phi);

and finishing the speaker recognition corresponding to the test voice.