CN109545243B - Pronunciation quality evaluation method, pronunciation quality evaluation device, electronic equipment and storage medium - Google Patents

Pronunciation quality evaluation method, pronunciation quality evaluation device, electronic equipment and storage medium Download PDF

Info

Publication number
CN109545243B
CN109545243B CN201910062339.7A CN201910062339A CN109545243B CN 109545243 B CN109545243 B CN 109545243B CN 201910062339 A CN201910062339 A CN 201910062339A CN 109545243 B CN109545243 B CN 109545243B
Authority
CN
China
Prior art keywords
phoneme
evaluation value
evaluated
fluency
pronunciation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910062339.7A
Other languages
Chinese (zh)
Other versions
CN109545243A (en
Inventor
刘顺鹏
钟贵平
李宝祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Orion Star Technology Co Ltd
Original Assignee
Beijing Orion Star Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Orion Star Technology Co Ltd filed Critical Beijing Orion Star Technology Co Ltd
Priority to CN201910062339.7A priority Critical patent/CN109545243B/en
Publication of CN109545243A publication Critical patent/CN109545243A/en
Application granted granted Critical
Publication of CN109545243B publication Critical patent/CN109545243B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention relates to the technical field of voice recognition, and discloses a pronunciation quality evaluation method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: in the speech to be evaluated, determining an audio frame corresponding to each phoneme corresponding to a reference text and the matching probability of each phoneme and the corresponding audio frame, wherein the reference text is the reference text corresponding to the speech to be evaluated; aiming at each phoneme, calculating a pronunciation accuracy evaluation value of the phoneme according to the matching probability corresponding to the phoneme and the audio frame corresponding to the phoneme; and obtaining the accuracy evaluation value of the speech to be evaluated according to the pronunciation accuracy evaluation value of each phoneme and the weight value determined for each phoneme in advance. According to the technical scheme provided by the embodiment of the invention, the difference degree of the accuracy evaluation value between the voice with better pronunciation and the voice with poorer pronunciation is enlarged by setting the weight value for each phoneme, and the accuracy and the reliability of pronunciation quality evaluation are improved.

Description

Pronunciation quality evaluation method, pronunciation quality evaluation device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of voice recognition, in particular to a pronunciation quality evaluation method and device, electronic equipment and a storage medium.
Background
With the development of the internet, the internet-based language learning application has also been rapidly developed. For language learning, in addition to learning grammar and vocabulary, etc., an important aspect is learning the listening and speaking abilities of a language, particularly the ability to do so. In the existing language learning application, a user records voice through a recording device at a user end, and a system compares the voice recorded by the user with an existing acoustic model according to a reference text corresponding to the voice, so that pronunciation scores of whole sentence recording and feedback of whether pronunciation of each word is correct are provided for the user. Therefore, the accuracy of the pronunciation evaluation method directly affects the learning effect of the user.
Currently, a method for evaluating a spoken language Pronunciation mainly adopts a GOP (good of Pronunciation accuracy) algorithm, calculates a Pronunciation accuracy evaluation value of each phoneme corresponding to a reference text corresponding to a voice recorded by a user through the GOP algorithm, calculates an average value of the Pronunciation accuracy evaluation values of the phonemes corresponding to each word to obtain a Pronunciation accuracy evaluation value of the word, and then calculates an average value of the Pronunciation accuracy evaluation values of all words in the reference text as a Pronunciation score of the voice. However, the conventional GOP algorithm obtains an accuracy evaluation value corresponding to each word, and the granularity of the word itself is relatively large, so that a more detailed quality evaluation result cannot be reflected, and the pronunciation quality evaluation result is not accurate enough and has relatively low reliability.
Disclosure of Invention
The embodiment of the invention provides a pronunciation quality evaluation method and device, electronic equipment and a storage medium, and aims to solve the problems of inaccurate pronunciation quality evaluation and low reliability in the prior art.
In a first aspect, an embodiment of the present invention provides a pronunciation quality evaluation method, including:
in the speech to be evaluated, determining an audio frame corresponding to each phoneme corresponding to a reference text and the matching probability of each phoneme and the corresponding audio frame, wherein the reference text is the reference text corresponding to the speech to be evaluated;
calculating a pronunciation accuracy evaluation value of each phoneme according to the matching probability corresponding to the phoneme and the audio frame corresponding to the phoneme;
and obtaining the accuracy evaluation value of the speech to be evaluated according to the pronunciation accuracy evaluation value of each phoneme and the weight value determined for each phoneme in advance.
In a second aspect, an embodiment of the present invention provides a pronunciation quality evaluation apparatus, including:
the determining module is used for determining an audio frame corresponding to each phoneme corresponding to the reference text and the matching probability of each phoneme and the corresponding audio frame in the speech to be evaluated, wherein the reference text is the reference text corresponding to the speech to be evaluated;
the phoneme accuracy calculation module is used for calculating a pronunciation accuracy evaluation value of each phoneme according to the matching probability corresponding to the phoneme and the audio frame corresponding to the phoneme;
and the accuracy calculation module is used for obtaining the accuracy evaluation value of the speech to be evaluated according to the pronunciation accuracy evaluation value of each phoneme and the weight value determined for each phoneme in advance.
In a third aspect, an embodiment of the present invention provides an electronic device, including a transceiver, a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the transceiver is configured to receive and transmit data under the control of the processor, and the processor implements the steps of any one of the methods when executing the computer program.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which computer program instructions are stored, which computer program instructions, when executed by a processor, implement the steps of any one of the methods described above.
According to the technical scheme provided by the embodiment of the invention, when the pronunciation accuracy of a word or a sentence is evaluated based on the pronunciation accuracy of the phonemes, the weight value is set for each phoneme, so that the contribution degree of partial phonemes to the pronunciation accuracy evaluation value is improved, the difference degree of the accuracy evaluation value between the voice with better pronunciation and the voice with poorer pronunciation is enlarged, and the accuracy and the reliability of pronunciation quality evaluation are improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario of a pronunciation quality evaluation method according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of a pronunciation quality evaluation method according to an embodiment of the present invention;
fig. 3 is a schematic flow chart of a pronunciation quality evaluation method according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a pronunciation quality evaluation device according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a pronunciation quality evaluation device according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.
For convenience of understanding, terms referred to in the embodiments of the present invention are explained below:
the GOP (Pronunciation accuracy) algorithm is proposed by Silke Witt, university of Massachusetts, in his doctor's paper. The basic idea of the GOP algorithm is to utilize a reference text known in advance, make a forced alignment (force alignment) between a speech and the reference text corresponding to the speech, identify a speech segment (i.e. a plurality of continuous audio frames in the speech) corresponding to each phoneme in the reference text, and then calculate the matching probability of the phoneme in the reference text corresponding to the speech segment on the premise that the speech segment is observed, wherein the higher the matching probability is, the more accurate the pronunciation is, the lower the matching probability is, the worse the pronunciation is. Intuitively, the GOP algorithm calculates the likelihood that the input speech corresponds to a known word, and if the likelihood is higher, the pronunciation is more standard.
Phones (phones), which are the smallest units in speech, are analyzed according to the pronunciation actions in syllables, and one action constitutes one phone. Phonemes are classified into two broad categories, namely, vowels are a, e, ai, etc., and consonants are p, t, h, etc.
Any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.
In a specific practice process, a GOP algorithm is mainly adopted in the method for evaluating the pronunciation of the spoken language, the pronunciation accuracy evaluation value of each phoneme corresponding to a reference text corresponding to the voice recorded by a user is calculated through the GOP algorithm, the average value of the pronunciation accuracy evaluation values of the phonemes corresponding to each word is calculated to obtain the pronunciation accuracy evaluation value of the word, and then the average value of the pronunciation accuracy evaluation values of all the words in the reference text is calculated to be used as the pronunciation score of the voice. The inventor of the present invention finds that when the GOP algorithm is used to calculate the accuracy of the word pronunciation, the scores output by the GOP algorithm are sensitive to the accuracy of the pronunciations of some phonemes, that is, the pronunciations corresponding to the phonemes have good or bad scores, and are less sensitive to the accuracy of the pronunciations of other phonemes, that is, the pronunciations corresponding to the phonemes have good or bad scores, and have small differences. For example, when two people have the same vowel, sound a well, and sound b poorly, the score of a is much higher than that of b; however, when two persons have the same consonant, the pronunciation of A is good, the pronunciation of B is poor, and the score of A is slightly different from that of B. Therefore, if the existing method of averaging the pronunciation accuracy evaluation values of the phonemes is adopted to calculate the pronunciation accuracy evaluation value of the speech, the proportion of the sensitive phoneme in the finally obtained pronunciation accuracy evaluation value is reduced, so that the degree of score difference between the better-pronunciation speech and the worse-pronunciation speech is reduced, and the accuracy of the pronunciation accuracy evaluation value is reduced. In addition, the inventor of the present invention also finds that the index for evaluating the speech currently only has one dimension of pronunciation accuracy, but the existing GOP algorithm obtains the accuracy corresponding to each word, and only a simple method for averaging the accuracies of all the words in a sentence is used, so that the relation between the words during pronunciation is easily ignored, and the distinction degree of pronunciation scores obtained from a section of fluent and complete speech and a section of less fluent and complete speech is low, so that the pronunciation score is not objective and accurate enough.
For this reason, the inventors of the present invention considered that, when evaluating the pronunciation accuracy of a word or sentence based on the pronunciation accuracy of phonemes, by setting a weight value for each phoneme, the contribution degree of a part of phonemes to the pronunciation accuracy evaluation value is increased, the degree of difference in accuracy evaluation value between a better-pronounced speech and a worse-pronounced speech is increased, and the accuracy and reliability of pronunciation quality evaluation are improved. In addition, on the premise of obtaining the accuracy evaluation value of the pronunciation of the voice to be evaluated, the integrity evaluation value and the fluency evaluation value of the pronunciation of the voice to be evaluated are determined, the pronunciation score of the voice to be evaluated is determined by integrating the accuracy evaluation value, the integrity evaluation value and the fluency evaluation value of the voice to be evaluated, the relation between words during pronunciation is fully considered by the introduced integrity evaluation value and the fluency evaluation value, so that a score index of a sentence level is obtained, the pronunciation score is more comprehensive, objective and accurate by integrating the score index of the word level and the score index of the sentence level, and the reliability of the pronunciation score is improved.
Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.
Fig. 1 is a schematic view of an application scenario of the pronunciation quality evaluation method according to an embodiment of the present invention. The user 10 interacts with the user terminal 11 through an application program in the user terminal 11, the user terminal 11 can display a reference text or play the reference text, the user 10 reads the reference text, at this time, the user terminal 11 starts a voice collecting device 12 (such as a microphone) which is arranged in or externally connected to the user terminal 11 through the application program to collect voice of the user reading the reference text as voice to be evaluated, the application program sends the voice to be evaluated and the reference text to the server 13, the server 13 evaluates the pronunciation quality of the voice to be evaluated according to the voice to be evaluated and the reference text to obtain pronunciation score of the voice to be evaluated, and the pronunciation score is fed back to the user terminal 11, and the user terminal 11 displays the pronunciation score fed back by the server 13.
In this application scenario, the user terminal 11 and the server 13 are communicatively connected through a network, which may be a local area network, a wide area network, or the like. The user end 11 may be a portable device (e.g., a mobile phone, a tablet, a notebook Computer, etc.) or a Personal Computer (PC), and a microphone is disposed in a general mobile phone, tablet, or notebook Computer, and the PC can collect the voice of the user through an external voice collecting device. The server 13 may be any device capable of providing speech recognition and pronunciation quality evaluation services.
In addition, the pronunciation quality evaluation method provided by the embodiment of the invention can also be executed locally at the user terminal. Specifically, the user 10 completes interaction with the user terminal 11 through an application program in the user terminal 11, the user terminal 11 may display a reference text or play the reference text, and the user 10 reads the reference text aloud, at this time, the user terminal 11 starts a voice collecting device 12 (such as a microphone) built in or externally connected to the user terminal 11 through the application program to collect voice of the reference text aloud read by the user as voice to be evaluated, and then, according to the voice to be evaluated and the reference text, performs pronunciation quality evaluation on the voice to be evaluated, obtains pronunciation score of the voice to be evaluated, and displays the pronunciation score.
The following describes a technical solution provided by an embodiment of the present invention with reference to an application scenario shown in fig. 1.
Referring to fig. 2, an embodiment of the present invention provides a pronunciation quality evaluation method, including the following steps:
s201, in the speech to be evaluated, determining an audio frame corresponding to each phoneme corresponding to a reference text and the matching probability of each phoneme and the corresponding audio frame, wherein the reference text is the reference text corresponding to the speech to be evaluated.
In this embodiment, the reference text is usually a complete sentence, and the reference text contains at least one word. The phoneme string corresponding to the reference text can be determined by looking up the pronunciation dictionary. For example, if the reference text is "good ringing", the corresponding phoneme string includes eight phonemes [ g]、[u]、[d]、[m]、
Figure BDA0001954549650000051
[n]、[i]、[η]. The reference text is "hello", the corresponding phoneme string comprises three phonemes n]、[i]、[h]、[ao]. In specific implementation, the language of the speech to be evaluated is selected, and the pronunciation dictionary corresponding to the language is selected, for example, the language to be evaluated is English, and the English pronunciation dictionary is selected.
In specific implementation, step S201 may be implemented by alignment processing, and before the alignment processing, the speech to be evaluated needs to be preprocessed: the method comprises the steps of dividing a voice to be evaluated into a plurality of audio frames, extracting an acoustic feature vector of each audio frame, wherein the acoustic feature vector is a multi-dimensional feature vector, and each frame of audio is represented by one multi-dimensional feature vector so as to convert the voice to be evaluated into an audio frame sequence formed by a plurality of audio frames. Typically 10-30ms is taken as a frame, and framing can be achieved using a moving window function, with overlapping portions between adjacent audio frames to avoid window boundaries missing the signal. The extracted acoustic features may be Fbank features, MFCC (Mel Frequency Cepstral coeffients, Mel Frequency cepstrum coefficients) features, spectrogram features, or the like. The method for extracting the Fbank characteristic and the MFCC characteristic is the prior art and is not described in detail.
In specific implementation, the process of the alignment process is roughly as follows: inputting the acoustic feature vector corresponding to the speech to be evaluated into the alignment model to obtain a conditional probability matrix, where the conditional probability matrix describes the conditional probability of each audio frame being recognized as any phoneme, and the conditional probability matrix gives the conditional probability between an audio frame and multiple phonemes for an audio frame, which may include, for example, that an audio frame is recognized as [ u [ u ] ]]And an audio frame is identified as
Figure BDA0001954549650000052
The conditional probability of (a); then, inputting the conditional probability matrix into a decoder for path search, taking the phoneme string corresponding to the reference text as a limiting condition during the path search, obtaining an audio frame corresponding to each phoneme in the phoneme string corresponding to the reference text, wherein generally one phoneme corresponds to a plurality of continuous audio frames in the speech to be evaluated, and the decoder previously constructs all phonemes. The alignment model may be implemented by using a Deep Neural Network (DNN) -HMM model, or a Convolutional Neural Network (CNN) + LSTM (Long Short-Term Memory Network). The state transition probabilities used in the decoding process may be determined by a pre-trained Mixture Gaussian (GMM) -Hidden Markov (HMM) Model. Since the corresponding relationship between each phoneme corresponding to the reference text and the audio frame has been determined, for each phoneme corresponding to the reference text, the conditional probability between the phoneme and the audio frame corresponding to the phoneme can be obtained from the conditional probability matrix, so as to determine the matching probability of the phoneme and the corresponding audio frame, for example, the phoneme [ u ] u]Corresponding to 10 tonesFrequency frame, obtaining the 10 audio frames and phonemes [ u ] from the conditional probability matrix]Taking the average or maximum or median of the 10 conditional probabilities as the phoneme [ u ]]And phoneme [ u ]]The matching probability of the corresponding audio frame.
In this embodiment, the speech to be evaluated and the reference text may be aligned through the alignment model, so as to determine a correspondence between each phoneme corresponding to each word in the reference text and a part of speech (i.e., a plurality of consecutive audio frames) of the speech to be evaluated.
S202, aiming at each phoneme, calculating the pronunciation accuracy evaluation value of the phoneme according to the matching probability corresponding to the phoneme and the audio frame corresponding to the phoneme.
In a specific embodiment, the GOP value may be used as the pronunciation accuracy evaluation value. Specifically, the GOP value of a phoneme can be calculated by the following formula:
Figure BDA0001954549650000061
wherein P is a phoneme in the reference text, P (P | o) is a matching probability corresponding to the phoneme P, nf (P) is the number of audio frames corresponding to the phoneme P, and o is an audio frame corresponding to the phoneme P.
S203, obtaining the accuracy evaluation value of the speech to be evaluated according to the pronunciation accuracy evaluation value of each phoneme and the weight value determined for each phoneme in advance.
In specific implementation, the pronunciation accuracy evaluation value corresponding to each phoneme may be weighted according to a weight value determined for each phoneme in advance, so as to obtain an accuracy evaluation value of the speech to be evaluated.
For example, the phonemes corresponding to the word good include [ g ], [ u ], [ d ], and assuming that the weight values corresponding to [ g ] and [ d ] are both 0.15, the weight value corresponding to [ u ] is 0.7. After the user A with better pronunciation inputs the voice good, the GOP value corresponding to [ g ] is 0.9, the GOP value corresponding to [ u ] is 0.8, the GOP value corresponding to [ d ] is 0.8, and the accuracy evaluation value of the voice good obtained after weighting is 0.815. After the user B with poor pronunciation inputs the voice good, the GOP value corresponding to [ g ] is 0.85, the GOP value corresponding to [ u ] is 0.6, the GOP value corresponding to [ d ] is 0.8, and the accuracy evaluation value of the voice good is 0.6675 after weighted average. If the weight value is not set, the accuracy evaluation value of the voice good obtained by the user A is 0.83, the accuracy evaluation value of the voice good obtained by the user A is 0.75, the accuracy evaluation values obtained by the user A and the user B are relatively close, and good pronunciation and poor pronunciation cannot be well distinguished.
Obviously, when the accuracy evaluation value of the pronunciation of the word or sentence is calculated based on the accuracy of the phoneme, after the weight of the phoneme is increased, the difference degree of the accuracy evaluation value between the voice with better pronunciation and the voice with worse pronunciation is enlarged, and the accuracy and the credibility of the pronunciation quality evaluation are improved.
In this embodiment, the weight value corresponding to each phoneme vowel may be determined in advance according to a test result in an actual application scenario, which is not limited herein. For example, according to the test results, it is found that the accuracy evaluation value output by the pronunciation accuracy evaluation algorithm is more sensitive to the accuracy of vowel pronunciation and less sensitive to the accuracy of consonant pronunciation, so that if the existing averaging method is used to calculate the pronunciation accuracy evaluation value of a voice, the proportion of vowels in the finally obtained pronunciation accuracy evaluation value is reduced, thereby reducing the score difference between a better-pronunciation voice and a worse-pronunciation voice, and reducing the accuracy of the pronunciation accuracy evaluation value. For this reason, in this embodiment, of the weight values determined for each phoneme in advance, the weight value corresponding to the vowel is greater than the weight value corresponding to the consonant.
In a specific implementation, all vowels may share one weight value, and all consonants may share one weight value, in this case, an average value of the pronunciation accuracy evaluation values of all vowels in the reference text may be calculated, an average value of the pronunciation accuracy evaluation values of all consonants in the reference text may be calculated, the average value of the pronunciation accuracy evaluation values of all vowels and the average value of the pronunciation accuracy evaluation values of all consonants may be weighted, and a weighting result is used as the accuracy evaluation value of the speech to be evaluated. The specific setting of the weight value corresponding to the vowel and the weight value corresponding to the consonant is not limited herein. Of course, the weight value may be set individually for each phoneme.
In a specific implementation, a truncation threshold may be set, and when the pronunciation accuracy evaluation value of a phoneme is lower than the truncation threshold, when the accuracy evaluation value of the speech to be evaluated is calculated according to the pronunciation accuracy evaluation value of each phoneme and a weight value determined for each phoneme in advance, the weight value corresponding to the phoneme is set to 0. For example, the phonemes corresponding to the word good include [ g ], [ u ], [ d ], and after the user inputs the voice good, the GOP value corresponding to [ g ] is 0.9, the GOP value corresponding to [ u ] is 0.09, the GOP value corresponding to [ d ] is 0.8, and if the truncation threshold is 0.1, the weight value corresponding to [ u ] is adjusted to 0, and the accuracy of the voice good obtained after weighting is 0.57.
According to the pronunciation quality evaluation method provided by the embodiment of the invention, when the pronunciation accuracy of a word or a sentence is evaluated based on the pronunciation accuracy of the phonemes, the weight value is set for each phoneme, so that the contribution degree of part of the phonemes to the pronunciation accuracy evaluation value is improved, the difference degree of the accuracy evaluation value between the voice with better pronunciation and the voice with worse pronunciation is enlarged, and the accuracy and the reliability of pronunciation quality evaluation are improved.
Based on any of the above embodiments, further, as shown in fig. 3, the method of the embodiment of the present invention further includes the following processing steps:
and S204, determining the integrity evaluation value of the voice to be evaluated.
In this embodiment, the speech to be evaluated may be recognized by an existing speech recognition method, and the speech to be evaluated is converted into a corresponding recognition text.
In specific implementation, the integrity evaluation value of the speech to be evaluated is determined according to the number of words contained in the recognition text corresponding to the speech to be evaluated and the number of words contained in the reference text corresponding to the speech to be evaluated, and the recognition text is obtained by performing speech recognition on the speech to be evaluated. Specifically, the completeness evaluation value of the speech to be evaluated may be determined according to a difference between the number of words contained in the recognition text and the number of words contained in the reference text. For example, the integrity evaluation value I of the speech to be evaluated can be calculated by the following formula:
Figure BDA0001954549650000071
wherein, N 0 N is the number of words contained in the recognition text for the number of words contained in the reference text. The above formula for calculating the integrity evaluation value is merely an example, and other formulas may be used to calculate the integrity evaluation value in actual application.
For example, the recognition text contains 9 words, but the reference text contains 10 words, it is obvious that a word is absent in the speech recognition result obtained according to the speech to be evaluated, which may be caused by missing recognition of an individual word or recognition of two words as one word when the speech recognition is performed due to the user pronunciation irregularity, or may be caused by missing reading of one word when the user reads the reference text, and the integrity evaluation value of the speech to be evaluated is 0.9 according to the above formula for calculating the integrity evaluation value. If the recognized text contains 11 words, but the reference text contains 10 words, obviously, one more word is obtained in the speech recognition result according to the speech to be evaluated, which may be that a user may recognize one word as two words when the speech recognition is caused by an abnormal pronunciation criterion, and the integrity evaluation value of the speech to be evaluated is 0.9 according to the above formula for calculating the integrity evaluation value.
S205, determining the fluency evaluation value of the voice to be evaluated.
In specific implementation, the fluency of the speech to be evaluated can be determined by the following steps: aiming at each phoneme corresponding to the reference text, determining an actual pronunciation duration corresponding to the phoneme according to an audio frame corresponding to the phoneme, and determining a fluency evaluation value corresponding to the phoneme according to the actual pronunciation duration corresponding to the phoneme and a reference pronunciation duration corresponding to the phoneme; and determining the fluency evaluation value of the speech to be evaluated according to the fluency evaluation value corresponding to each phoneme corresponding to the reference text. The closer the actual pronunciation length of a phoneme in the speech to be evaluated is to the reference pronunciation duration, the higher the fluency of the user when speaking the phoneme is indicated. For example, in practical applications, the fluency evaluation value F of a phoneme can be calculated by the following formula:
Figure BDA0001954549650000081
wherein, T 0 The reference pronunciation duration corresponding to the phoneme, and T is the actual pronunciation duration corresponding to the phoneme.
In this embodiment, the actual pronunciation duration may be determined according to the number of audio frames corresponding to the phoneme and the duration of one frame of audio. For example, the phoneme [ g ] corresponds to 30 frames of audio, the duration of each frame of audio is 20ms, the actual pronunciation duration of the phoneme [ g ] is 600ms, and the fluency evaluation value of the phoneme [ g ] in the speech to be evaluated is 0.667 assuming that the reference pronunciation duration of the phoneme [ g ] is 400 ms. For example, a phoneme [ i: ] corresponds to 30 frames of audio, the duration of each frame of audio is 20ms, the actual pronunciation duration of the phoneme [ i: ] is 600ms, and the fluency evaluation value of the phoneme [ g ] in the speech to be evaluated is 0.6 assuming that the reference pronunciation duration of the phoneme [ i: ] is 1000 ms.
In specific implementation, the fluency evaluation value corresponding to each phoneme may be weighted according to a weight value determined for each phoneme in advance, so as to obtain the fluency evaluation value of the speech to be evaluated.
In specific implementation, the fluency evaluation value of each word in the reference text can be determined according to the fluency evaluation value corresponding to each phoneme; and then determining the fluency evaluation value of the voice to be evaluated according to the fluency evaluation value of each word.
And calculating the fluency evaluation value of each word in the reference text according to the fluency evaluation value of each phoneme, wherein the specific calculation method can be as follows:
calculating the average value of the fluency evaluation values of the phonemes corresponding to each word to obtain a first fluency evaluation value of the word; calculating a first time length corresponding to the word, wherein the first time length is the time length from a first audio frame corresponding to a first phoneme corresponding to the word to a last audio frame corresponding to a last phoneme, calculating the sum of actual pronunciation durations corresponding to each phoneme corresponding to the word to obtain a second time length corresponding to the word, and determining a second popularity evaluation value of the word according to the first time length and the second time length corresponding to the word; and obtaining the fluency evaluation value of the word according to the first fluency evaluation value and the second fluency evaluation value of the word.
Specifically, the first popularity evaluation value and the second popularity evaluation value of the word may be subjected to weighting processing, and the result of the weighting processing may be regarded as the popularity evaluation value of the word. The weight used in the weighting process may be freely set according to actual conditions, and is not limited herein.
Of course, in a specific implementation, only the first popularity evaluation value of the word may be calculated, and the first popularity evaluation value of the word may be directly used as the popularity evaluation value of the word; in a specific implementation, only the second popularity evaluation value of the word may be calculated, and the second popularity evaluation value of the word may be directly used as the popularity evaluation value of the word.
Specifically, the first popularity evaluation value of a word is calculated by calculating an average value of the popularity evaluation values of phonemes corresponding to the word for each word in the reference text to obtain the first popularity evaluation value of the word.
For example, the phonemes corresponding to the word good include [ g ], [ u ], [ d ], and assuming that the fluency evaluation values of [ g ], [ u ], [ d ] are 0.9, 0.8 and 0.84, respectively, the first fluency evaluation value of the word good is 0.847. Further, for each word in the reference text, the fluency evaluation value of each phoneme corresponding to the word may be weighted according to a weight value determined for each phoneme in advance, so as to obtain a first fluency evaluation value of the word.
The second fluency evaluation value of the word is determined according to the blank audio frames among the phonemes. After the blank audio frames are aligned, the determined audio frames which do not belong to any phoneme have more blank audio frames between two adjacent phonemes in the same word, which means that the user is less fluent when reading the word. Specifically, for each word in the reference text, a first time length corresponding to the word is calculated, the sum of actual pronunciation durations corresponding to phonemes corresponding to the word is calculated, a second time length corresponding to the word is obtained, and a second popularity evaluation value of the word is determined according to the first time length and the second time length corresponding to the word. The first time length is the time length from the first audio frame corresponding to the first phoneme corresponding to the word to the last audio frame corresponding to the last phoneme.
For example, the word morning corresponds to a phoneme of [ m [ ]]、
Figure BDA0001954549650000091
[n]、[i]、[η]Suppose [ m]Corresponding to the 11 th to 40 th audio frames in the speech to be evaluated,
Figure BDA0001954549650000092
Corresponding to 41-80 th audio frame, [ n ]]Corresponding to the 101 st-]Corresponding to the 131 st and 160 nd audio frames, [ eta ] eta-]Corresponding to the 190 th audio frame 161-]Corresponding first frame (i.e. 11 th audio frame) to [ η [ ]]The corresponding time length of the last frame (i.e. the 190 th audio frame), then the first time length corresponds to the time length of 180 audio frames. [ m ] of]、
Figure BDA0001954549650000093
[n]、[i]And [ eta ]]The sum of the number of the audio frames respectively corresponding to the word mouning is 160, and the second time length corresponding to the word mouning is the time length of 160 audio frames. The first length of time and the second length of time differ by a length of time of 20 frames, which difference is due to insufficient fluency of the user reading the word. The greater the difference between the first length of time and the second length of time, the less fluent the user may be in reading the word. The correspondence between the first time length and the second time length corresponding to the word and the second popularity evaluation value of the word can be determined according to actual situations, and is not limited herein.
After the fluency evaluation value of each word in the reference text is calculated, a fluency evaluation value of the speech to be evaluated may be calculated based on the fluency evaluation value of each word.
The specific calculation method of the fluency evaluation value of the speech to be evaluated can be as follows: the average value of the fluency evaluation values of the words in the reference text is calculated as the fluency evaluation value of the speech to be evaluated, for example, the fluency evaluation value of the word good is 0.847, the fluency of the word morning is 0.78, and the fluency rating value corresponding to the speech to be evaluated "good moving" is 0.814.
The specific calculation method of the fluency evaluation value of the speech to be evaluated may also be: taking the fluency evaluation value of the voice to be evaluated calculated by the calculation method as a first fluency evaluation value of the voice to be evaluated; determining a second popularity evaluation value of the speech to be evaluated according to a blank audio frame between two adjacent words; and comprehensively determining the fluency evaluation value of the voice to be evaluated according to the first fluency evaluation value of the voice to be evaluated and the second fluency evaluation value of the voice to be evaluated. Specifically, the first and second fluency evaluation values of the speech to be evaluated may be subjected to weighting processing, and the result of the weighting processing may be taken as the fluency evaluation value of the speech to be evaluated. The weight used in the weighting process may be freely set according to actual conditions, and is not limited herein.
In one sentence, after the number of blank audio frames between two adjacent words exceeds a certain number, the more the blank audio frames are, the longer the pause time of the user is, and the less fluent the user is when reading the sentence. Specifically, the number of blank audio frames between any two adjacent words is determined according to the audio frame corresponding to each word in the reference text, and the second popularity evaluation value of the speech to be evaluated is determined according to the number of blank audio frames between any two adjacent words.
In specific implementation, the second popularity evaluation value of the speech to be evaluated is determined according to the number of blank audio frames between any two adjacent words, and the method can be implemented by the following steps: determining the pause time length between two adjacent words according to the number of blank audio frames between the two adjacent words, counting the times of the pause time length exceeding the preset time length, and determining a second popularity evaluation value of the voice to be evaluated according to the counted times and the pause time length exceeding the preset time length. The more times that the pause duration exceeds the preset duration, the lower the score of the second fluency of the speech to be evaluated, and the larger the value of the pause duration exceeding the preset duration, the lower the score of the second fluency of the speech to be evaluated. The preset duration can be determined according to the average pause duration between words when the statistical person speaks, and is not limited herein.
Of course, in a specific implementation, only the second fluency evaluation value of the speech to be evaluated may be calculated, and the second fluency evaluation value of the speech to be evaluated may be directly used as the fluency evaluation value of the speech to be evaluated.
In this embodiment, the reference pronunciation time length of each phoneme may be predetermined by:
step one, aiming at each section of voice information in a corpus, determining an audio frame corresponding to each phoneme corresponding to text information in the voice information, wherein the text information is a reference text corresponding to the voice information.
In this embodiment, corpora that belong to the same language as the speech to be evaluated are stored in the corpus. The voice information in the corpus is from different people, and the voice information in the corpus is voice with pronunciation standard.
In specific implementation, the audio frame corresponding to each phoneme corresponding to the text information may be determined in the speech information by an alignment processing method, and the specific implementation may refer to the specific implementation of S201, which is not described again.
And step two, determining the pronunciation duration corresponding to each phoneme according to the audio frame corresponding to each phoneme corresponding to the text information.
In this embodiment, the pronunciation duration of the phoneme may be determined according to the number of audio frames corresponding to the phoneme and the duration of one frame of audio. For example, the phoneme [ g ] corresponds to 30 frames of audio, the duration of each frame of audio is 20ms, and the pronunciation duration of the phoneme [ g ] is 600 ms.
And thirdly, according to the pronunciation duration corresponding to each phoneme, counting the pronunciation duration distribution corresponding to each phoneme in the phoneme set, wherein the phoneme set is a set formed by all phonemes contained in the specified language.
For example, if the language is english, and english contains 48 phonemes, the phoneme set corresponding to english contains these 48 phonemes.
And step four, taking the central value of the pronunciation duration distribution corresponding to each phoneme as the reference pronunciation duration of each phoneme in the phoneme set.
And S206, determining the pronunciation score of the voice to be evaluated according to the integrity evaluation value, the fluency evaluation value and the accuracy evaluation value.
In specific implementation, the integrity evaluation value, the fluency evaluation value and the accuracy evaluation value can be weighted according to the pre-fitted weight coefficient, so as to obtain the pronunciation score of the voice to be evaluated. Specifically, the weighting coefficients corresponding to the integrity evaluation value, the fluency evaluation value, and the accuracy evaluation value may be determined by linear regression, and the weighting coefficients are not limited in this embodiment.
In specific implementation, the pronunciation score fed back to the user side can be converted into a percentile system.
The pronunciation quality evaluation method of the embodiment determines the integrity evaluation value and fluency evaluation value of the pronunciation of the voice to be evaluated on the premise of obtaining the accuracy evaluation value of the pronunciation of the voice to be evaluated by using the GOP algorithm, determines the pronunciation score of the voice to be evaluated by synthesizing the accuracy evaluation value, the integrity evaluation value and the fluency evaluation value of the voice to be evaluated, and fully considers the relation between words during pronunciation by the introduced integrity evaluation value and fluency evaluation value, thereby obtaining the score index of sentence level.
It should be noted that, there is no inevitable sequence among the steps S203, S204 and S205, for example, the steps S203, S204 and S205 may be executed simultaneously, or the steps S203, S204 and S205 may be executed sequentially according to a preset sequence.
The method of the present embodiment can be used to evaluate speech in any language, such as chinese, english, japanese, korean, etc. In specific implementation, for different languages, only the models such as the decoder and the alignment model used in the method of this embodiment need to be trained using the corpora corresponding to the different languages, and for different languages, the model training method is the same and is not repeated.
As shown in fig. 4, based on the same inventive concept as the pronunciation quality evaluation method described above, an embodiment of the present invention further provides a pronunciation quality evaluation apparatus 40 including a determination module 401, a phoneme accuracy calculation module 402, and an accuracy calculation module 403.
The determining module 401 is configured to determine, in the speech to be evaluated, an audio frame corresponding to each phoneme corresponding to the reference text and a matching probability between each phoneme and the corresponding audio frame, where the reference text is the reference text corresponding to the speech to be evaluated.
A phoneme accuracy calculation module 402, configured to calculate, for each phoneme, a pronunciation accuracy evaluation value of the phoneme according to the matching probability corresponding to the phoneme and the audio frame corresponding to the phoneme.
And an accuracy calculation module 403, configured to obtain an accuracy evaluation value of the speech to be evaluated according to the pronunciation accuracy evaluation value of each phoneme and a weight value determined in advance for each phoneme.
Further, in the weight values determined for each phoneme in advance, the weight value corresponding to the vowel is greater than the weight value corresponding to the consonant.
Further, the accuracy calculation module 403 is specifically configured to: and weighting the pronunciation accuracy evaluation value corresponding to each phoneme according to the weight value determined for each phoneme in advance to obtain the accuracy evaluation value of the speech to be evaluated.
Further, as shown in fig. 5, the pronunciation quality evaluation device 40 according to the embodiment of the invention further includes a completeness calculation module 404, a fluency calculation module 405 and a scoring module 406.
And an integrity calculation module 404, configured to determine an integrity evaluation value of the speech to be evaluated.
And a fluency calculation module 405, configured to determine a fluency evaluation value of the speech to be evaluated.
And the scoring module 406 is used for determining the pronunciation score of the voice to be evaluated according to the integrity evaluation value, the fluency evaluation value and the accuracy evaluation value.
Further, the integrity calculation module 404 is specifically configured to: and determining the integrity evaluation value of the speech to be evaluated according to the number of words contained in the recognition text corresponding to the speech to be evaluated and the number of words contained in the reference text corresponding to the speech to be evaluated, wherein the recognition text is the text obtained by performing speech recognition on the speech to be evaluated.
Further, the fluency calculation module 405 includes: a phoneme fluency calculation unit and a voice fluency calculation unit.
And the phoneme fluency calculation unit is used for determining the actual pronunciation duration corresponding to the phoneme according to the audio frame corresponding to the phoneme aiming at each phoneme corresponding to the reference text, and determining the fluency evaluation value corresponding to the phoneme according to the actual pronunciation duration corresponding to the phoneme and the reference pronunciation duration corresponding to the phoneme.
And the voice fluency calculation unit is used for determining the fluency evaluation value of the voice to be evaluated according to the fluency evaluation value corresponding to each phoneme.
Further, the voice stream profit computation unit is specifically configured to: and weighting the fluency evaluation value corresponding to each phoneme according to the weight value determined for each phoneme in advance to obtain the fluency evaluation value of the voice to be evaluated.
Further, the voice stream profit computation unit is specifically configured to: calculating the average value of the fluency evaluation values of the phonemes corresponding to each word in the reference text to obtain a first fluency evaluation value of the word; calculating a first time length corresponding to each word in a reference text, wherein the first time length is the time length from a first audio frame corresponding to a first phoneme corresponding to the word to a last audio frame corresponding to a last phoneme, calculating the sum of actual pronunciation durations corresponding to the phonemes corresponding to the word to obtain a second time length corresponding to the word, and determining a second fluency evaluation value of the word according to the first time length and the second time length corresponding to the word; aiming at each word in the reference text, obtaining the fluency evaluation value of the word according to the first fluency evaluation value and the second fluency evaluation value of the word; and determining the fluency evaluation value of the voice to be evaluated according to the fluency evaluation value of each word in the reference text.
Further, the voice stream profit computation unit is further configured to: determining the number of blank audio frames between any two adjacent words according to the audio frame corresponding to each word in the reference text, and determining a second fluency evaluation value of the speech to be evaluated according to the number of blank audio frames between any two adjacent words; determining a first fluency evaluation value of the voice to be evaluated according to the fluency evaluation value of each word in the reference text; and carrying out weighted average on the first fluency evaluation value and the second fluency evaluation value of the voice to be evaluated to obtain the fluency evaluation value of the voice to be evaluated.
Further, the pronunciation quality evaluation device 40 according to the embodiment of the present invention further includes a reference pronunciation duration determination module, configured to: aiming at each section of voice information in the corpus, determining an audio frame corresponding to each phoneme corresponding to the text information in the voice information, wherein the text information is a reference text corresponding to the voice information; determining the pronunciation duration corresponding to each phoneme according to the audio frame corresponding to each phoneme corresponding to the text information; according to the pronunciation duration corresponding to each phoneme, counting the pronunciation duration distribution corresponding to each phoneme in a phoneme set, wherein the phoneme set is a set consisting of all phonemes contained in a specified language; and taking the central value of the pronunciation duration distribution corresponding to each phoneme as the reference pronunciation duration of each phoneme in the phoneme set.
Further, the scoring module 406 is specifically configured to: and weighting the integrity evaluation value, the fluency evaluation value and the accuracy evaluation value according to the pre-fitted weight coefficient to obtain the pronunciation score of the voice to be evaluated.
The pronunciation quality evaluation device and the pronunciation quality evaluation method provided by the embodiment of the invention adopt the same inventive concept, can obtain the same beneficial effects, and are not repeated herein.
Based on the same inventive concept as the pronunciation quality evaluation method, the embodiment of the invention further provides an electronic device, which can be specifically a user side device such as a smart sound box, a desktop computer, a portable computer, a smart phone, a tablet personal computer and the like, and can also be a cloud side device such as a server and the like. As shown in fig. 6, the electronic device 60 may include a processor 601, a memory 602, and a transceiver 603. The transceiver 603 is used for receiving and transmitting data under the control of the processor 601.
Memory 602 may include Read Only Memory (ROM) and Random Access Memory (RAM), and provides the processor with program instructions and data stored in the memory. In an embodiment of the present invention, the memory may be used to store a program of the pronunciation quality evaluation method.
The processor 601 may be a CPU (central processing unit), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or a CPLD (Complex Programmable Logic Device) processor, and implements the pronunciation quality evaluation method in any of the above embodiments according to obtained program instructions by calling program instructions stored in a memory.
An embodiment of the present invention provides a computer-readable storage medium for storing computer program instructions for the electronic device, which includes a program for executing the pronunciation quality evaluation method.
The computer storage media may be any available media or data storage device that can be accessed by a computer, including but not limited to magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs)), etc.
The above embodiments are only used to describe the technical solutions of the present application in detail, but the above embodiments are only used to help understanding the method of the embodiments of the present invention, and should not be construed as limiting the embodiments of the present invention. Variations or substitutions that may be readily apparent to one skilled in the art are intended to be included within the scope of the embodiments of the present invention.

Claims (18)

1. A pronunciation quality evaluation method is characterized by comprising the following steps:
in a speech to be evaluated, determining an audio frame corresponding to each phoneme corresponding to a reference text and the matching probability of each phoneme and the corresponding audio frame, wherein the reference text is the reference text corresponding to the speech to be evaluated;
calculating a pronunciation accuracy evaluation value of each phoneme according to the matching probability corresponding to the phoneme and the audio frame corresponding to the phoneme;
obtaining an accuracy evaluation value of the voice to be evaluated according to the pronunciation accuracy evaluation value of each phoneme and a weight value determined for each phoneme in advance, wherein the phoneme comprises a vowel and a consonant, the weight value corresponding to the vowel is greater than the weight value corresponding to the consonant, and if the pronunciation accuracy evaluation value of the phoneme is lower than a preset truncation threshold, the weight value of the phoneme is 0;
determining a completeness evaluation value and a fluency evaluation value of the voice to be evaluated;
determining the pronunciation score of the voice to be evaluated according to the integrity evaluation value, the fluency evaluation value and the accuracy evaluation value;
the determining the fluency evaluation value of the voice to be evaluated comprises the following steps:
for each phoneme, determining an actual pronunciation duration corresponding to the phoneme according to an audio frame corresponding to the phoneme, and determining a fluency evaluation value corresponding to the phoneme according to the actual pronunciation duration corresponding to the phoneme and a reference pronunciation duration corresponding to the phoneme;
and determining the fluency evaluation value of the speech to be evaluated according to the fluency evaluation value corresponding to each phoneme.
2. The method according to claim 1, wherein the obtaining the accuracy evaluation value of the speech to be evaluated according to the pronunciation accuracy evaluation value of each phoneme and a weight value determined in advance for each phoneme comprises:
and weighting the pronunciation accuracy evaluation value corresponding to each phoneme according to a weight value determined for each phoneme in advance to obtain the accuracy evaluation value of the speech to be evaluated.
3. The method according to claim 1 or 2, wherein the determining the integrity evaluation value of the speech to be evaluated comprises:
and determining the integrity evaluation value of the speech to be evaluated according to the number of words contained in the recognition text corresponding to the speech to be evaluated and the number of words contained in the reference text corresponding to the speech to be evaluated, wherein the recognition text is obtained by performing speech recognition on the speech to be evaluated.
4. The method according to claim 1, wherein the determining the fluency evaluation value of the speech to be evaluated according to the fluency evaluation value corresponding to each phoneme comprises:
and weighting the fluency evaluation value corresponding to each phoneme according to a weight value determined for each phoneme in advance to obtain the fluency evaluation value of the voice to be evaluated.
5. The method according to claim 1, wherein the determining the fluency evaluation value of the speech to be evaluated according to the fluency evaluation value corresponding to each phoneme comprises:
calculating the average value of the fluency evaluation values of the phonemes corresponding to the words aiming at each word in the reference text to obtain a first fluency evaluation value of the word;
calculating a first time length corresponding to each word in the reference text, wherein the first time length is the time length from a first audio frame corresponding to a first phoneme corresponding to the word to a last audio frame corresponding to a last phoneme, calculating the sum of actual pronunciation durations corresponding to each phoneme corresponding to the word to obtain a second time length corresponding to the word, and determining a second fluency evaluation value of the word according to the first time length and the second time length corresponding to the word;
aiming at each word in the reference text, obtaining a fluency evaluation value of the word according to a first fluency evaluation value and a second fluency evaluation value of the word;
and determining the fluency evaluation value of the voice to be evaluated according to the fluency evaluation value of each word in the reference text.
6. The method of claim 5, further comprising:
determining the number of blank audio frames between any two adjacent words according to the audio frame corresponding to each word in the reference text, and determining a second popularity evaluation value of the speech to be evaluated according to the number of blank audio frames between any two adjacent words;
the determining the fluency evaluation value of the speech to be evaluated according to the fluency evaluation value of each word in the reference text comprises the following steps:
determining a first fluency evaluation value of the voice to be evaluated according to the fluency evaluation value of each word in the reference text;
and carrying out weighted average on the first fluency evaluation value and the second fluency evaluation value of the voice to be evaluated to obtain the fluency evaluation value of the voice to be evaluated.
7. The method of claim 1, wherein determining the reference utterance duration comprises:
aiming at each section of voice information in a corpus, determining an audio frame corresponding to each phoneme corresponding to text information in the voice information, wherein the text information is a reference text corresponding to the voice information;
determining the pronunciation duration corresponding to each phoneme according to the audio frame corresponding to each phoneme corresponding to the text information;
according to the pronunciation duration corresponding to each phoneme, counting the pronunciation duration distribution corresponding to each phoneme in a phoneme set, wherein the phoneme set is a set formed by all phonemes contained in a specified language;
and taking the central value of the pronunciation duration distribution corresponding to each phoneme as the reference pronunciation duration of each phoneme in the phoneme set.
8. The method according to claim 1 or 2, wherein determining the pronunciation score of the speech to be evaluated according to the completeness evaluation value, the fluency evaluation value and the accuracy evaluation value comprises:
and weighting the integrity evaluation value, the fluency evaluation value and the accuracy evaluation value according to a pre-fitted weight coefficient to obtain the pronunciation score of the voice to be evaluated.
9. An utterance quality evaluation device, comprising:
the determining module is used for determining an audio frame corresponding to each phoneme corresponding to a reference text and the matching probability of each phoneme and the corresponding audio frame in the speech to be evaluated, wherein the reference text is the reference text corresponding to the speech to be evaluated;
a phoneme accuracy calculation module, configured to calculate, for each phoneme, a pronunciation accuracy evaluation value of the phoneme according to the matching probability corresponding to the phoneme and the audio frame corresponding to the phoneme;
the accuracy calculation module is used for obtaining an accuracy evaluation value of the to-be-evaluated voice according to the pronunciation accuracy evaluation value of each phoneme and a weight value determined for each phoneme in advance, wherein the phoneme comprises a vowel and a consonant, the weight value corresponding to the vowel is greater than the weight value corresponding to the consonant, and if the pronunciation accuracy evaluation value of the phoneme is lower than a preset truncation threshold, the weight value of the phoneme is 0;
the integrity calculation module is used for determining an integrity evaluation value of the voice to be evaluated;
the fluency calculation module is used for determining a fluency evaluation value of the voice to be evaluated;
the scoring module is used for determining the pronunciation score of the voice to be evaluated according to the integrity evaluation value, the fluency evaluation value and the accuracy evaluation value;
wherein the fluency calculation module comprises:
a phoneme fluency calculation unit, configured to determine, for each phoneme, an actual pronunciation duration corresponding to the phoneme according to the audio frame corresponding to the phoneme, and determine a fluency evaluation value corresponding to the phoneme according to the actual pronunciation duration corresponding to the phoneme and the reference pronunciation duration corresponding to the phoneme;
and the voice fluency calculation unit is used for determining the fluency evaluation value of the voice to be evaluated according to the fluency evaluation value corresponding to each phoneme.
10. The apparatus of claim 9, wherein the accuracy calculation module is specifically configured to:
and weighting the pronunciation accuracy evaluation value corresponding to each phoneme according to a weight value determined for each phoneme in advance to obtain the accuracy evaluation value of the speech to be evaluated.
11. The apparatus according to claim 9 or 10, wherein the integrity calculation module is specifically configured to:
and determining the integrity evaluation value of the speech to be evaluated according to the number of words contained in the recognition text corresponding to the speech to be evaluated and the number of words contained in the reference text corresponding to the speech to be evaluated, wherein the recognition text is obtained by performing speech recognition on the speech to be evaluated.
12. The apparatus according to claim 9, wherein the speech fluency calculation unit is specifically configured to:
and weighting the fluency evaluation value corresponding to each phoneme according to a weight value determined for each phoneme in advance to obtain the fluency evaluation value of the voice to be evaluated.
13. The apparatus according to claim 9, wherein the speech fluency calculation unit is specifically configured to:
calculating the average value of the fluency evaluation values of the phonemes corresponding to the words aiming at each word in the reference text to obtain a first fluency evaluation value of the word;
calculating a first time length corresponding to each word in the reference text, wherein the first time length is the time length from a first audio frame corresponding to a first phoneme corresponding to the word to a last audio frame corresponding to a last phoneme, calculating the sum of actual pronunciation durations corresponding to each phoneme corresponding to the word to obtain a second time length corresponding to the word, and determining a second fluency evaluation value of the word according to the first time length and the second time length corresponding to the word;
aiming at each word in the reference text, obtaining a fluency evaluation value of the word according to a first fluency evaluation value and a second fluency evaluation value of the word;
and determining the fluency evaluation value of the voice to be evaluated according to the fluency evaluation value of each word in the reference text.
14. The apparatus of claim 13, wherein the speech fluency calculation unit is further configured to:
determining the number of blank audio frames between any two adjacent words according to the audio frame corresponding to each word in the reference text, and determining a second popularity evaluation value of the speech to be evaluated according to the number of blank audio frames between any two adjacent words;
determining a first fluency evaluation value of the voice to be evaluated according to the fluency evaluation value of each word in the reference text;
and carrying out weighted average on the first fluency evaluation value and the second fluency evaluation value of the voice to be evaluated to obtain the fluency evaluation value of the voice to be evaluated.
15. The apparatus of claim 9, further comprising a reference utterance duration determination module configured to:
aiming at each section of voice information in a corpus, determining an audio frame corresponding to each phoneme corresponding to text information in the voice information, wherein the text information is a reference text corresponding to the voice information;
determining the pronunciation duration corresponding to each phoneme according to the audio frame corresponding to each phoneme corresponding to the text information;
according to the pronunciation duration corresponding to each phoneme, counting the pronunciation duration distribution corresponding to each phoneme in a phoneme set, wherein the phoneme set is a set formed by all phonemes contained in a specified language;
and taking the central value of the pronunciation duration distribution corresponding to each phoneme as the reference pronunciation duration of each phoneme in the phoneme set.
16. The apparatus according to claim 9 or 10, wherein the scoring module is specifically configured to:
and weighting the integrity evaluation value, the fluency evaluation value and the accuracy evaluation value according to a pre-fitted weight coefficient to obtain the pronunciation score of the voice to be evaluated.
17. An electronic device comprising a transceiver, a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the transceiver is configured to receive and transmit data under control of the processor, and wherein the processor implements the steps of the method according to any one of claims 1 to 8 when executing the computer program.
18. A computer-readable storage medium having computer program instructions stored thereon, which, when executed by a processor, implement the steps of the method of any one of claims 1 to 8.
CN201910062339.7A 2019-01-23 2019-01-23 Pronunciation quality evaluation method, pronunciation quality evaluation device, electronic equipment and storage medium Active CN109545243B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910062339.7A CN109545243B (en) 2019-01-23 2019-01-23 Pronunciation quality evaluation method, pronunciation quality evaluation device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910062339.7A CN109545243B (en) 2019-01-23 2019-01-23 Pronunciation quality evaluation method, pronunciation quality evaluation device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109545243A CN109545243A (en) 2019-03-29
CN109545243B true CN109545243B (en) 2022-09-02

Family

ID=65838414

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910062339.7A Active CN109545243B (en) 2019-01-23 2019-01-23 Pronunciation quality evaluation method, pronunciation quality evaluation device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109545243B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110176249A (en) * 2019-04-03 2019-08-27 苏州驰声信息科技有限公司 A kind of appraisal procedure and device of spoken language pronunciation
CN109979484B (en) * 2019-04-03 2021-06-08 北京儒博科技有限公司 Pronunciation error detection method and device, electronic equipment and storage medium
CN110164422A (en) * 2019-04-03 2019-08-23 苏州驰声信息科技有限公司 A kind of the various dimensions appraisal procedure and device of speaking test
CN110782921B (en) * 2019-09-19 2023-09-22 腾讯科技(深圳)有限公司 Voice evaluation method and device, storage medium and electronic device
CN112927696A (en) * 2019-12-05 2021-06-08 中国科学院深圳先进技术研究院 System and method for automatically evaluating dysarthria based on voice recognition
CN111402924B (en) * 2020-02-28 2024-04-19 联想(北京)有限公司 Spoken language evaluation method, device and computer readable storage medium
CN113707178B (en) * 2020-05-22 2024-02-06 苏州声通信息科技有限公司 Audio evaluation method and device and non-transient storage medium
CN111402895B (en) * 2020-06-08 2020-10-02 腾讯科技(深圳)有限公司 Voice processing method, voice evaluating method, voice processing device, voice evaluating device, computer equipment and storage medium
CN111915940A (en) * 2020-06-29 2020-11-10 厦门快商通科技股份有限公司 Method, system, terminal and storage medium for evaluating and teaching spoken language pronunciation
CN111916108B (en) * 2020-07-24 2021-04-02 北京声智科技有限公司 Voice evaluation method and device
CN111798868B (en) * 2020-09-07 2020-12-08 北京世纪好未来教育科技有限公司 Voice forced alignment model evaluation method and device, electronic equipment and storage medium
CN112492343B (en) * 2020-12-16 2023-11-10 浙江大华技术股份有限公司 Video live broadcast monitoring method and related device
CN112614510B (en) * 2020-12-23 2024-04-30 北京猿力未来科技有限公司 Audio quality assessment method and device
US11580955B1 (en) * 2021-03-31 2023-02-14 Amazon Technologies, Inc. Synthetic speech processing
CN113035238B (en) * 2021-05-20 2021-08-27 北京世纪好未来教育科技有限公司 Audio evaluation method, device, electronic equipment and medium
CN113299278B (en) * 2021-05-20 2023-06-13 北京大米科技有限公司 Acoustic model performance evaluation method and device and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739868A (en) * 2008-11-19 2010-06-16 中国科学院自动化研究所 Automatic evaluation and diagnosis method of text reading level for oral test
CN101826263A (en) * 2009-03-04 2010-09-08 中国科学院自动化研究所 Objective standard based automatic oral evaluation system
CN103151042A (en) * 2013-01-23 2013-06-12 中国科学院深圳先进技术研究院 Full-automatic oral language evaluating management and scoring system and scoring method thereof
CN104485115A (en) * 2014-12-04 2015-04-01 上海流利说信息技术有限公司 Pronunciation evaluation equipment, method and system
JP2016103081A (en) * 2014-11-27 2016-06-02 Kddi株式会社 Conversation analysis device, conversation analysis system, conversation analysis method and conversation analysis program

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006337667A (en) * 2005-06-01 2006-12-14 Ntt Communications Kk Pronunciation evaluating method, phoneme series model learning method, device using their methods, program and recording medium
US9613638B2 (en) * 2014-02-28 2017-04-04 Educational Testing Service Computer-implemented systems and methods for determining an intelligibility score for speech

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739868A (en) * 2008-11-19 2010-06-16 中国科学院自动化研究所 Automatic evaluation and diagnosis method of text reading level for oral test
CN101826263A (en) * 2009-03-04 2010-09-08 中国科学院自动化研究所 Objective standard based automatic oral evaluation system
CN103151042A (en) * 2013-01-23 2013-06-12 中国科学院深圳先进技术研究院 Full-automatic oral language evaluating management and scoring system and scoring method thereof
JP2016103081A (en) * 2014-11-27 2016-06-02 Kddi株式会社 Conversation analysis device, conversation analysis system, conversation analysis method and conversation analysis program
CN104485115A (en) * 2014-12-04 2015-04-01 上海流利说信息技术有限公司 Pronunciation evaluation equipment, method and system

Also Published As

Publication number Publication date
CN109545243A (en) 2019-03-29

Similar Documents

Publication Publication Date Title
CN109545243B (en) Pronunciation quality evaluation method, pronunciation quality evaluation device, electronic equipment and storage medium
US10319250B2 (en) Pronunciation guided by automatic speech recognition
O’Shaughnessy Automatic speech recognition: History, methods and challenges
EP1557822B1 (en) Automatic speech recognition adaptation using user corrections
US6694296B1 (en) Method and apparatus for the recognition of spelled spoken words
US6618702B1 (en) Method of and device for phone-based speaker recognition
KR101153078B1 (en) Hidden conditional random field models for phonetic classification and speech recognition
US7890325B2 (en) Subword unit posterior probability for measuring confidence
US20150112685A1 (en) Speech recognition method and electronic apparatus using the method
EP1675102A2 (en) Method for extracting feature vectors for speech recognition
Pellegrino et al. Automatic language identification: an alternative approach to phonetic modelling
Liu et al. Deriving disyllabic word variants from a Chinese conversational speech corpus
JP4864783B2 (en) Pattern matching device, pattern matching program, and pattern matching method
US11043212B2 (en) Speech signal processing and evaluation
Tabibian A survey on structured discriminative spoken keyword spotting
KR100474253B1 (en) Speech recognition method using utterance of the first consonant of word and media storing thereof
Hmad Deep neural network acoustic models for multi-dialect Arabic speech recognition
Nouza Strategies for developing a real-time continuous speech recognition system for czech language
Tan et al. Speech recognition in mobile phones
Leung et al. Articulatory-feature-based confidence measures
Fanty et al. Neural networks for alphabet recognition
Amdal Learning pronunciation variation: A data-driven approach to rule-based lecxicon adaptation for automatic speech recognition
JP3917880B2 (en) Speech recognition apparatus, speech recognition method, and speech recognition program
CN114255758A (en) Spoken language evaluation method and device, equipment and storage medium
Hasegawa-Johnson Lecture notes in speech production, speech coding, and speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant