CN115440193A

CN115440193A - Pronunciation evaluation scoring method based on deep learning

Info

Publication number: CN115440193A
Application number: CN202211085643.1A
Authority: CN
Inventors: 王龙标; 李志刚; 关昊天; 王宇光
Original assignee: Suzhou Zhiyan Information Technology Co ltd
Current assignee: Suzhou Zhiyan Information Technology Co ltd
Priority date: 2022-09-06
Filing date: 2022-09-06
Publication date: 2022-12-06

Abstract

The invention relates to the technical field of voice evaluation, in particular to a pronunciation evaluation scoring method based on deep learning. The invention uses the model of speech recognition to recognize the real text result of the audio. This is then used to obtain the a posteriori probability of the audio by HMM-DNN model. Finally, the phoneme is scored through a scoring model. Before forced alignment, the correct text of the audio is recognized by using a speech recognition model, so that the problem that the audio cannot be aligned to the correct position when the audio is inconsistent with the text in the forced alignment process is avoided. Meanwhile, a scoring model is constructed by using a deep neural network, so that a plurality of information such as posterior probability, vowels, part of speech, tone, pronunciation duration and the like can be fitted, and phoneme scoring is more reasonable and more accurate.

Description

Pronunciation evaluation scoring method based on deep learning

Technical Field

The invention relates to the technical field of voice evaluation, in particular to a pronunciation evaluation scoring method based on deep learning.

Background

Spoken language is more and more taken attention in language education course, and teacher and student's one-to-one exchange and teaching are the most effective mode of improving spoken language in english, but hardly satisfy numerous spoken language learner's demand. Due to the rapid progress of the computer technology and the pronunciation evaluation technology, various spoken language evaluation schemes based on the artificial intelligence technology fall on the ground successively. The system provides additional learning opportunities and rich learning materials for students, can assist or replace teachers to guide the students to do more targeted pronunciation exercises, points out pronunciation errors of the students, provides effective diagnosis feedback information, evaluates the integral pronunciation level of the students, and effectively improves the spoken language learning efficiency and spoken language level of the students.

The current mainstream method for pronunciation evaluation is to obtain the posterior probability of the speech based on a hidden Markov-deep neural network (HMM-DNN) model, then to perform forced alignment with the evaluation text, and to score by using a GOP method.

The forced alignment method can achieve a high degree of accuracy, but this has to satisfy a precondition: a given text and audio must match. If a user reads I am a teacher as I was a teacher, when processing the audio segment corresponding to ws, the user may erroneously compare the audio segment with the phoneme corresponding to am, and it is likely that subsequent a and teacher cannot be aligned to the correct position, thereby affecting the scoring accuracy.

Disclosure of Invention

In order to solve the problems, the invention provides a speech evaluation scoring method based on deep learning. Firstly, the text of the audio is recognized through a speech recognition model, and then the recognized text is used for forced alignment, so that the alignment result is more accurate. And finally, predicting the scores of the phonemes through a scoring model constructed by a deep neural network, and calculating the scores of the words and the sentences according to the scores of the phonemes.

A pronunciation evaluation scoring method based on deep learning is provided, and firstly, a voice recognition model is used for recognizing real text results of audio. The next step is to use the HMM-DNN model to obtain the posterior probability of the audio. Then, the recognition text result of the audio and the posterior probability of the audio are used for forced alignment, and the time boundary of each phoneme is determined. Finally, the phoneme is scored through a scoring model.

The specific technical scheme is as follows:

step one, extracting acoustic features of the speech to be evaluated, sending the acoustic features into a speech recognition model, and recognizing a real text result of the speech to be evaluated.

And step two, the acoustic features of the speech to be evaluated, which are extracted in the step one, are sent into an HMM-DNN model, and the posterior probability of each frame is predicted.

And step three, performing forced alignment according to the text result identified in the step one and the posterior probability of each frame obtained in the step two, and determining the time boundary of each phoneme.

And step four, calculating the average value of the posterior probability of each phoneme according to the time boundary of each phoneme obtained in the step three and the posterior probability of each frame obtained in the step two, splicing the average value of the posterior probability of each phoneme with the characteristic information of the phoneme, such as the vowel consonant, the part of speech, the tone, the pronunciation duration and the like, and sending the spliced result into a scoring model to obtain the score of the phoneme.

And step five, performing phoneme alignment according to the text result recognized in the step one and the reference text, and determining which phonemes are multi-reading and skip-reading.

And step six, calculating the final score, and calculating the score of the word and the score of the whole sentence according to the multi-reading and missing-reading conditions in the step five.

Advantageous effects

According to the invention, the correct text of the audio is recognized by using the voice recognition model before forced alignment, so that the problem that the audio cannot be aligned to the correct position when the audio is inconsistent with the text in the forced alignment process is avoided. Meanwhile, a scoring model is constructed by using a deep neural network, so that a plurality of information such as posterior probability, vowels, part of speech, tone, pronunciation duration and the like can be fitted, and phoneme scoring is more reasonable and more accurate.

1. The correct text of the audio is recognized by using the voice recognition model, so that the problem that the audio cannot be aligned to the correct position when the audio is inconsistent with the text in the forced alignment process is avoided.

2. A scoring model is constructed by using a deep neural network, so that a plurality of information such as posterior probability, vowels, part of speech, tone, pronunciation duration and the like can be fitted, and phoneme scoring is more reasonable and more accurate.

Drawings

FIG. 1 is a schematic flow diagram.

Detailed Description

The present invention is described in further detail below with reference to the attached drawings.

FIG. 1 is a flow chart of the pronunciation assessment method based on deep learning according to the present invention. As shown in fig. 1, the method mainly comprises the following steps:

step one, extracting acoustic features of the speech to be evaluated, wherein the extracted acoustic features can be Fbank features, and when the Fbank features are extracted, the sampling frequency is 16000, the window length is set to be 25ms, and the frame shift is set to be 10ms. After the characteristics are extracted, the characteristics are sent to a voice recognition model, and the voice recognition model can use a wenet model to recognize the real text result of the voice to be evaluated.

And step two, sending the acoustic features of the speech to be evaluated, such as Fbank features, extracted in the step one into an HMM-DNN model, and predicting the posterior probability of each frame. Assuming a total of m frames, n phonemes, a matrix of m x n posterior probabilities is ultimately generated.

And step three, performing forced alignment according to the text result identified in the step one and the posterior probability of each frame obtained in the step two, wherein the forced alignment is performed by adopting a greedy or Viterbi algorithm, a path with the maximum probability is found out, and the time boundary of each phoneme is finally determined.

And step four, calculating the average value of the posterior probability of each phoneme according to the time boundary of each phoneme obtained in the step three and the posterior probability of each frame obtained in the step two.

Then, feature information of each phoneme, such as features of a vowel consonant, a part of speech, a tone, a pronunciation duration, and the like, needs to be obtained.

1. The vowel consonant feature judges whether the current phoneme is a vowel or a consonant, scores of the vowel and the consonant are different according to different phoneme types, and the vowel is more important.

2. And the part-of-speech characteristics are used for judging the part-of-speech of the word where the current phoneme is located, according to the importance of the part-of-speech of the word, the scoring results are different, and the importance of real words such as verbs and nouns is higher than that of virtual words.

3. And tone characteristics, namely judging whether the current phoneme contains tones, and reducing the final score if tone information is marked in the reference text but the tones are not read in the audio.

4. The pronunciation duration characteristic is used for calculating the duration of the current phoneme and carrying out normalization processing with the pronunciation duration of the standard phoneme,

if the normalized utterance duration feature is low or high, the score is reduced.

And finally, splicing the average value of the posterior probability of the phoneme with the characteristic information of the phoneme, such as the vowel consonant, the part of speech, the tone, the pronunciation duration and the like, and sending the spliced result into a scoring model to obtain the score of the phoneme.

The scoring model can be modeled by adopting DNN, three layers of deep neural networks DNN are used, the dimensionality of a hidden layer is set to be 128, the dimensionality of the last layer is set to be 1, and the final scoring result is output.

Various features of the phonemes are modeled by using a deep neural network DNN, and a complex function is fitted to score the phonemes, so that the scoring of the phonemes is more reasonable and accurate.

And step five, performing phoneme alignment according to the text result and the reference text identified in the step one, and determining which phonemes are the multi-reading phoneme, the missing-reading phoneme and the normal phoneme.

The alignment method may be implemented by calculating the edit distance by first converting the recognized text results into a corresponding list of phoneme strings X [1, …, n ], converting the reference text into a corresponding list of phoneme strings Y [1, …, m ], defining the distance D (i, j) as the distance between X [1, …, i ] and Y [1, …, j ], so that the edit distance between X and Y is D (n, m). And solving by using a dynamic programming method, wherein a state transition equation is as follows:

D(i,0)＝i

D(0,j)＝j

wherein insert represents the unread phone, delete represents the unread phone, and norm represents the normal phone.

By recording which sub-problem each sub-result is solved by, the result can be traced back, and finally the alignment result of two character strings can be obtained.

The score of a word is related to the scores of all normal phonemes in the word and the ratio of multiple-reading missed reads. The score of the sentence is not only related to the scores and the multi-reading missing ratio of all the normal phonemes in the sentence, but also related to the continuous frame number of the words and the mute frame number between the words, and the mute frame number between the words is only calculated to be larger than the mute frame number of the normal pause time.

The score of the word is calculated by the following formula:

wherein, score _{norm_phone} Denotes the score of the normal phonemes in a word, n denotes the number of all phonemes in the reference text, nor _cnt Indicates the number of normal phonemes, ins _cnt Indicating the number of unread phones, del _cnt Indicating the number of missed phonemes.

The score of the sentence is calculated by the following formula:

wherein, score _{norm_phone} Represents the score of the normal phonemes in the sentence, n represents the number of all phonemes in the reference text, nor _cnt Indicates the number of normal phonemes, ins _cnt Indicating the number of unread phones, del _cnt Indicating the number of missed phonemes. frame _word Representing the number of frames, occupied by all words _sil Representing the number of silent frames between words.

Claims

1. A pronunciation evaluation scoring method based on deep learning is characterized by comprising the following steps:

firstly, identifying a real text result of an audio frequency through a speech recognition model;

secondly, acquiring the posterior probability of the audio through an HMM-DNN model;

then, forcibly aligning the recognition text result of the audio and the posterior probability of the audio to determine the time boundary of each phoneme;

finally, scoring the phoneme through a scoring model;

the method comprises the following specific steps:

extracting acoustic features of a voice to be evaluated, sending the acoustic features into a voice recognition model, and recognizing a real text result of the voice to be evaluated;

step two, the acoustic features of the speech to be evaluated extracted in the step one are sent into an HMM-DNN model, and the posterior probability of each frame is predicted;

step three, performing forced alignment according to the text result identified in the step one and the posterior probability of each frame obtained in the step two, and determining the time boundary of each phoneme;

step four, calculating the average value of the posterior probability of each phoneme according to the time boundary of each phoneme obtained in the step three and the posterior probability of each frame obtained in the step two, splicing the average value of the posterior probability of each phoneme with the characteristic information of the vowel consonant, the part of speech, the tone and the pronunciation duration of the phoneme, and sending the spliced value into a scoring model to obtain the score of the phoneme;

step five, aligning phonemes according to the text result and the reference text identified in the step one, and determining which phonemes are multi-reading and skip-reading;

2. The pronunciation assessment scoring method based on deep learning as claimed in claim 1, wherein the acoustic features extracted in step one are Fbank features.

3. The pronunciation assessment scoring method based on deep learning of claim 1, wherein the step of a speech recognition model uses a wenet model.

4. The pronunciation assessment scoring method based on deep learning as claimed in claim 1, wherein the forced alignment in step three uses the text result recognized in step one.

5. The pronunciation assessment scoring method based on deep learning as claimed in claim 1, wherein step three is performed by using greedy or Viterbi algorithm to find a path with the largest probability and finally determine the time boundary of each phoneme.

6. The pronunciation assessment scoring method based on deep learning as claimed in claim 1, wherein the characteristics used by the step four scoring model are the average of the posterior probabilities of the phoneme and the vowel, part of speech, tone, pronunciation duration characteristics of the phoneme.

7. The pronunciation assessment scoring method based on deep learning according to claim 1, wherein the quinphone alignment method is implemented by calculating the edit distance, and firstly converting the recognized text result into the corresponding phone string list X [1, …, n ], converting the reference text into the corresponding phone string list Y [1, …, m ], defining the distance D (i, j) as the distance between X [1, …, i ] and Y [1, …, j ], so that the edit distance between X and Y is D (n, m);

and solving by using a dynamic programming method, wherein a state transition equation is as follows:

D(i,0)＝i

D(0,j)＝j

wherein insert represents the multi-reading phoneme, delete represents the missing-reading phoneme, and norm represents the normal phoneme;

and backtracking the result by recording which subproblem each subproblem is solved by, and finally obtaining the alignment result of the two character strings.

8. The pronunciation assessment scoring method based on deep learning as claimed in claim 1, wherein the score of the word is calculated by using the following formula:

wherein, score _{norm_phone} Represents the score of the normal phonemes in a word, n represents the number of all phonemes in the reference text, norn _cnt Indicates the number of normal phonemes, ins _cnt Indicating the number of unread phones, del _cnt Representing the number of missed phonemes;

the score of the sentence is calculated by the following formula:

wherein, score _{norm_phone} Representing normal phonemes in a sentenceN denotes the number of all phonemes in the reference text, nor _cnt Indicates the number of normal phonemes, ins _cnt Indicating the number of unread phones, del _cnt Representing the number of missed phonemes, frame _word Representing the number of frames, occupied by all words _sil Representing the number of silent frames between words.