CN115440193A - Pronunciation evaluation scoring method based on deep learning - Google Patents

Pronunciation evaluation scoring method based on deep learning Download PDF

Info

Publication number
CN115440193A
CN115440193A CN202211085643.1A CN202211085643A CN115440193A CN 115440193 A CN115440193 A CN 115440193A CN 202211085643 A CN202211085643 A CN 202211085643A CN 115440193 A CN115440193 A CN 115440193A
Authority
CN
China
Prior art keywords
phoneme
phonemes
score
pronunciation
deep learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211085643.1A
Other languages
Chinese (zh)
Inventor
王龙标
李志刚
关昊天
王宇光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Zhiyan Information Technology Co ltd
Original Assignee
Suzhou Zhiyan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Zhiyan Information Technology Co ltd filed Critical Suzhou Zhiyan Information Technology Co ltd
Priority to CN202211085643.1A priority Critical patent/CN115440193A/en
Publication of CN115440193A publication Critical patent/CN115440193A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of voice evaluation, in particular to a pronunciation evaluation scoring method based on deep learning. The invention uses the model of speech recognition to recognize the real text result of the audio. This is then used to obtain the a posteriori probability of the audio by HMM-DNN model. Finally, the phoneme is scored through a scoring model. Before forced alignment, the correct text of the audio is recognized by using a speech recognition model, so that the problem that the audio cannot be aligned to the correct position when the audio is inconsistent with the text in the forced alignment process is avoided. Meanwhile, a scoring model is constructed by using a deep neural network, so that a plurality of information such as posterior probability, vowels, part of speech, tone, pronunciation duration and the like can be fitted, and phoneme scoring is more reasonable and more accurate.

Description

Pronunciation evaluation scoring method based on deep learning
Technical Field
The invention relates to the technical field of voice evaluation, in particular to a pronunciation evaluation scoring method based on deep learning.
Background
Spoken language is more and more taken attention in language education course, and teacher and student's one-to-one exchange and teaching are the most effective mode of improving spoken language in english, but hardly satisfy numerous spoken language learner's demand. Due to the rapid progress of the computer technology and the pronunciation evaluation technology, various spoken language evaluation schemes based on the artificial intelligence technology fall on the ground successively. The system provides additional learning opportunities and rich learning materials for students, can assist or replace teachers to guide the students to do more targeted pronunciation exercises, points out pronunciation errors of the students, provides effective diagnosis feedback information, evaluates the integral pronunciation level of the students, and effectively improves the spoken language learning efficiency and spoken language level of the students.
The current mainstream method for pronunciation evaluation is to obtain the posterior probability of the speech based on a hidden Markov-deep neural network (HMM-DNN) model, then to perform forced alignment with the evaluation text, and to score by using a GOP method.
The forced alignment method can achieve a high degree of accuracy, but this has to satisfy a precondition: a given text and audio must match. If a user reads I am a teacher as I was a teacher, when processing the audio segment corresponding to ws, the user may erroneously compare the audio segment with the phoneme corresponding to am, and it is likely that subsequent a and teacher cannot be aligned to the correct position, thereby affecting the scoring accuracy.
Disclosure of Invention
In order to solve the problems, the invention provides a speech evaluation scoring method based on deep learning. Firstly, the text of the audio is recognized through a speech recognition model, and then the recognized text is used for forced alignment, so that the alignment result is more accurate. And finally, predicting the scores of the phonemes through a scoring model constructed by a deep neural network, and calculating the scores of the words and the sentences according to the scores of the phonemes.
A pronunciation evaluation scoring method based on deep learning is provided, and firstly, a voice recognition model is used for recognizing real text results of audio. The next step is to use the HMM-DNN model to obtain the posterior probability of the audio. Then, the recognition text result of the audio and the posterior probability of the audio are used for forced alignment, and the time boundary of each phoneme is determined. Finally, the phoneme is scored through a scoring model.
The specific technical scheme is as follows:
step one, extracting acoustic features of the speech to be evaluated, sending the acoustic features into a speech recognition model, and recognizing a real text result of the speech to be evaluated.
And step two, the acoustic features of the speech to be evaluated, which are extracted in the step one, are sent into an HMM-DNN model, and the posterior probability of each frame is predicted.
And step three, performing forced alignment according to the text result identified in the step one and the posterior probability of each frame obtained in the step two, and determining the time boundary of each phoneme.
And step four, calculating the average value of the posterior probability of each phoneme according to the time boundary of each phoneme obtained in the step three and the posterior probability of each frame obtained in the step two, splicing the average value of the posterior probability of each phoneme with the characteristic information of the phoneme, such as the vowel consonant, the part of speech, the tone, the pronunciation duration and the like, and sending the spliced result into a scoring model to obtain the score of the phoneme.
And step five, performing phoneme alignment according to the text result recognized in the step one and the reference text, and determining which phonemes are multi-reading and skip-reading.
And step six, calculating the final score, and calculating the score of the word and the score of the whole sentence according to the multi-reading and missing-reading conditions in the step five.
Advantageous effects
According to the invention, the correct text of the audio is recognized by using the voice recognition model before forced alignment, so that the problem that the audio cannot be aligned to the correct position when the audio is inconsistent with the text in the forced alignment process is avoided. Meanwhile, a scoring model is constructed by using a deep neural network, so that a plurality of information such as posterior probability, vowels, part of speech, tone, pronunciation duration and the like can be fitted, and phoneme scoring is more reasonable and more accurate.
1. The correct text of the audio is recognized by using the voice recognition model, so that the problem that the audio cannot be aligned to the correct position when the audio is inconsistent with the text in the forced alignment process is avoided.
2. A scoring model is constructed by using a deep neural network, so that a plurality of information such as posterior probability, vowels, part of speech, tone, pronunciation duration and the like can be fitted, and phoneme scoring is more reasonable and more accurate.
Drawings
FIG. 1 is a schematic flow diagram.
Detailed Description
The present invention is described in further detail below with reference to the attached drawings.
FIG. 1 is a flow chart of the pronunciation assessment method based on deep learning according to the present invention. As shown in fig. 1, the method mainly comprises the following steps:
step one, extracting acoustic features of the speech to be evaluated, wherein the extracted acoustic features can be Fbank features, and when the Fbank features are extracted, the sampling frequency is 16000, the window length is set to be 25ms, and the frame shift is set to be 10ms. After the characteristics are extracted, the characteristics are sent to a voice recognition model, and the voice recognition model can use a wenet model to recognize the real text result of the voice to be evaluated.
And step two, sending the acoustic features of the speech to be evaluated, such as Fbank features, extracted in the step one into an HMM-DNN model, and predicting the posterior probability of each frame. Assuming a total of m frames, n phonemes, a matrix of m x n posterior probabilities is ultimately generated.
And step three, performing forced alignment according to the text result identified in the step one and the posterior probability of each frame obtained in the step two, wherein the forced alignment is performed by adopting a greedy or Viterbi algorithm, a path with the maximum probability is found out, and the time boundary of each phoneme is finally determined.
And step four, calculating the average value of the posterior probability of each phoneme according to the time boundary of each phoneme obtained in the step three and the posterior probability of each frame obtained in the step two.
Then, feature information of each phoneme, such as features of a vowel consonant, a part of speech, a tone, a pronunciation duration, and the like, needs to be obtained.
1. The vowel consonant feature judges whether the current phoneme is a vowel or a consonant, scores of the vowel and the consonant are different according to different phoneme types, and the vowel is more important.
2. And the part-of-speech characteristics are used for judging the part-of-speech of the word where the current phoneme is located, according to the importance of the part-of-speech of the word, the scoring results are different, and the importance of real words such as verbs and nouns is higher than that of virtual words.
3. And tone characteristics, namely judging whether the current phoneme contains tones, and reducing the final score if tone information is marked in the reference text but the tones are not read in the audio.
4. The pronunciation duration characteristic is used for calculating the duration of the current phoneme and carrying out normalization processing with the pronunciation duration of the standard phoneme,
if the normalized utterance duration feature is low or high, the score is reduced.
And finally, splicing the average value of the posterior probability of the phoneme with the characteristic information of the phoneme, such as the vowel consonant, the part of speech, the tone, the pronunciation duration and the like, and sending the spliced result into a scoring model to obtain the score of the phoneme.
The scoring model can be modeled by adopting DNN, three layers of deep neural networks DNN are used, the dimensionality of a hidden layer is set to be 128, the dimensionality of the last layer is set to be 1, and the final scoring result is output.
Various features of the phonemes are modeled by using a deep neural network DNN, and a complex function is fitted to score the phonemes, so that the scoring of the phonemes is more reasonable and accurate.
And step five, performing phoneme alignment according to the text result and the reference text identified in the step one, and determining which phonemes are the multi-reading phoneme, the missing-reading phoneme and the normal phoneme.
The alignment method may be implemented by calculating the edit distance by first converting the recognized text results into a corresponding list of phoneme strings X [1, …, n ], converting the reference text into a corresponding list of phoneme strings Y [1, …, m ], defining the distance D (i, j) as the distance between X [1, …, i ] and Y [1, …, j ], so that the edit distance between X and Y is D (n, m). And solving by using a dynamic programming method, wherein a state transition equation is as follows:
D(i,0)=i
D(0,j)=j
Figure BDA0003834901880000041
wherein insert represents the unread phone, delete represents the unread phone, and norm represents the normal phone.
By recording which sub-problem each sub-result is solved by, the result can be traced back, and finally the alignment result of two character strings can be obtained.
And step six, calculating the final score, and calculating the score of the word and the score of the whole sentence according to the multi-reading and missing-reading conditions in the step five.
The score of a word is related to the scores of all normal phonemes in the word and the ratio of multiple-reading missed reads. The score of the sentence is not only related to the scores and the multi-reading missing ratio of all the normal phonemes in the sentence, but also related to the continuous frame number of the words and the mute frame number between the words, and the mute frame number between the words is only calculated to be larger than the mute frame number of the normal pause time.
The score of the word is calculated by the following formula:
Figure BDA0003834901880000042
wherein, score norm_phone Denotes the score of the normal phonemes in a word, n denotes the number of all phonemes in the reference text, nor cnt Indicates the number of normal phonemes, ins cnt Indicating the number of unread phones, del cnt Indicating the number of missed phonemes.
The score of the sentence is calculated by the following formula:
Figure BDA0003834901880000043
wherein, score norm_phone Represents the score of the normal phonemes in the sentence, n represents the number of all phonemes in the reference text, nor cnt Indicates the number of normal phonemes, ins cnt Indicating the number of unread phones, del cnt Indicating the number of missed phonemes. frame word Representing the number of frames, occupied by all words sil Representing the number of silent frames between words.

Claims (8)

1. A pronunciation evaluation scoring method based on deep learning is characterized by comprising the following steps:
firstly, identifying a real text result of an audio frequency through a speech recognition model;
secondly, acquiring the posterior probability of the audio through an HMM-DNN model;
then, forcibly aligning the recognition text result of the audio and the posterior probability of the audio to determine the time boundary of each phoneme;
finally, scoring the phoneme through a scoring model;
the method comprises the following specific steps:
extracting acoustic features of a voice to be evaluated, sending the acoustic features into a voice recognition model, and recognizing a real text result of the voice to be evaluated;
step two, the acoustic features of the speech to be evaluated extracted in the step one are sent into an HMM-DNN model, and the posterior probability of each frame is predicted;
step three, performing forced alignment according to the text result identified in the step one and the posterior probability of each frame obtained in the step two, and determining the time boundary of each phoneme;
step four, calculating the average value of the posterior probability of each phoneme according to the time boundary of each phoneme obtained in the step three and the posterior probability of each frame obtained in the step two, splicing the average value of the posterior probability of each phoneme with the characteristic information of the vowel consonant, the part of speech, the tone and the pronunciation duration of the phoneme, and sending the spliced value into a scoring model to obtain the score of the phoneme;
step five, aligning phonemes according to the text result and the reference text identified in the step one, and determining which phonemes are multi-reading and skip-reading;
and step six, calculating the final score, and calculating the score of the word and the score of the whole sentence according to the multi-reading and missing-reading conditions in the step five.
2. The pronunciation assessment scoring method based on deep learning as claimed in claim 1, wherein the acoustic features extracted in step one are Fbank features.
3. The pronunciation assessment scoring method based on deep learning of claim 1, wherein the step of a speech recognition model uses a wenet model.
4. The pronunciation assessment scoring method based on deep learning as claimed in claim 1, wherein the forced alignment in step three uses the text result recognized in step one.
5. The pronunciation assessment scoring method based on deep learning as claimed in claim 1, wherein step three is performed by using greedy or Viterbi algorithm to find a path with the largest probability and finally determine the time boundary of each phoneme.
6. The pronunciation assessment scoring method based on deep learning as claimed in claim 1, wherein the characteristics used by the step four scoring model are the average of the posterior probabilities of the phoneme and the vowel, part of speech, tone, pronunciation duration characteristics of the phoneme.
7. The pronunciation assessment scoring method based on deep learning according to claim 1, wherein the quinphone alignment method is implemented by calculating the edit distance, and firstly converting the recognized text result into the corresponding phone string list X [1, …, n ], converting the reference text into the corresponding phone string list Y [1, …, m ], defining the distance D (i, j) as the distance between X [1, …, i ] and Y [1, …, j ], so that the edit distance between X and Y is D (n, m);
and solving by using a dynamic programming method, wherein a state transition equation is as follows:
D(i,0)=i
D(0,j)=j
Figure FDA0003834901870000021
wherein insert represents the multi-reading phoneme, delete represents the missing-reading phoneme, and norm represents the normal phoneme;
and backtracking the result by recording which subproblem each subproblem is solved by, and finally obtaining the alignment result of the two character strings.
8. The pronunciation assessment scoring method based on deep learning as claimed in claim 1, wherein the score of the word is calculated by using the following formula:
Figure FDA0003834901870000022
wherein, score norm_phone Represents the score of the normal phonemes in a word, n represents the number of all phonemes in the reference text, norn cnt Indicates the number of normal phonemes, ins cnt Indicating the number of unread phones, del cnt Representing the number of missed phonemes;
the score of the sentence is calculated by the following formula:
Figure FDA0003834901870000023
wherein, score norm_phone Representing normal phonemes in a sentenceN denotes the number of all phonemes in the reference text, nor cnt Indicates the number of normal phonemes, ins cnt Indicating the number of unread phones, del cnt Representing the number of missed phonemes, frame word Representing the number of frames, occupied by all words sil Representing the number of silent frames between words.
CN202211085643.1A 2022-09-06 2022-09-06 Pronunciation evaluation scoring method based on deep learning Pending CN115440193A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211085643.1A CN115440193A (en) 2022-09-06 2022-09-06 Pronunciation evaluation scoring method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211085643.1A CN115440193A (en) 2022-09-06 2022-09-06 Pronunciation evaluation scoring method based on deep learning

Publications (1)

Publication Number Publication Date
CN115440193A true CN115440193A (en) 2022-12-06

Family

ID=84247794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211085643.1A Pending CN115440193A (en) 2022-09-06 2022-09-06 Pronunciation evaluation scoring method based on deep learning

Country Status (1)

Country Link
CN (1) CN115440193A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116403604A (en) * 2023-06-07 2023-07-07 北京奇趣万物科技有限公司 Child reading ability evaluation method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116403604A (en) * 2023-06-07 2023-07-07 北京奇趣万物科技有限公司 Child reading ability evaluation method and system
CN116403604B (en) * 2023-06-07 2023-11-03 北京奇趣万物科技有限公司 Child reading ability evaluation method and system

Similar Documents

Publication Publication Date Title
CN112397091B (en) Chinese speech comprehensive scoring and diagnosing system and method
US7266495B1 (en) Method and system for learning linguistically valid word pronunciations from acoustic data
US8019602B2 (en) Automatic speech recognition learning using user corrections
JP3481497B2 (en) Method and apparatus using a decision tree to generate and evaluate multiple pronunciations for spelled words
CN101551947A (en) Computer system for assisting spoken language learning
US20090258333A1 (en) Spoken language learning systems
CN111862954B (en) Method and device for acquiring voice recognition model
CN109979257B (en) Method for performing accurate splitting operation correction based on English reading automatic scoring
Gao et al. A study on robust detection of pronunciation erroneous tendency based on deep neural network.
US7280963B1 (en) Method for learning linguistically valid word pronunciations from acoustic data
US11935523B2 (en) Detection of correctness of pronunciation
CN112466279B (en) Automatic correction method and device for spoken English pronunciation
Lee Language-independent methods for computer-assisted pronunciation training
KR20090060631A (en) System and method of pronunciation variation modeling based on indirect data-driven method for foreign speech recognition
Ibrahim et al. Improve design for automated Tajweed checking rules engine of Quranic verse recitation: a review
CN115440193A (en) Pronunciation evaluation scoring method based on deep learning
Azim et al. Large vocabulary Arabic continuous speech recognition using tied states acoustic models
CN115116428B (en) Prosodic boundary labeling method, device, equipment, medium and program product
Luo et al. Automatic pronunciation evaluation of language learners' utterances generated through shadowing.
CN111508522A (en) Statement analysis processing method and system
JP2006084966A (en) Automatic evaluating device of uttered voice and computer program
CN111429886B (en) Voice recognition method and system
Mote et al. Tactical language detection and modeling of learner speech errors: The case of Arabic tactical language training for American English speakers
Li et al. Improving mandarin tone mispronunciation detection for non-native learners with soft-target tone labels and blstm-based deep models
JPH08123470A (en) Speech recognition device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination