CN112397056B - Voice evaluation method and computer storage medium - Google Patents

Voice evaluation method and computer storage medium Download PDF

Info

Publication number
CN112397056B
CN112397056B CN202110072627.8A CN202110072627A CN112397056B CN 112397056 B CN112397056 B CN 112397056B CN 202110072627 A CN202110072627 A CN 202110072627A CN 112397056 B CN112397056 B CN 112397056B
Authority
CN
China
Prior art keywords
pronunciation
phoneme
word
speech
pronunciation phoneme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110072627.8A
Other languages
Chinese (zh)
Other versions
CN112397056A (en
Inventor
孟凡昌
杨嵩
袁军峰
张家源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Century TAL Education Technology Co Ltd
Original Assignee
Beijing Century TAL Education Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Century TAL Education Technology Co Ltd filed Critical Beijing Century TAL Education Technology Co Ltd
Priority to CN202110072627.8A priority Critical patent/CN112397056B/en
Publication of CN112397056A publication Critical patent/CN112397056A/en
Application granted granted Critical
Publication of CN112397056B publication Critical patent/CN112397056B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Abstract

The embodiment of the invention provides a voice evaluation method and a computer storage medium. Wherein the method comprises the following steps: determining a reference text for voice evaluation of voice data to be evaluated; if the word in the reference text is detected to have the self-defined pronunciation label, converting the self-defined pronunciation label of the word to obtain the pronunciation phoneme of the word; if the word is detected not to have the self-defined pronunciation label, retrieving a pre-configured pronunciation dictionary according to the word so as to retrieve the pronunciation phoneme of the word; if the word is not retrieved from the pronunciation dictionary, performing virtual pronunciation on the word to obtain a pronunciation phoneme of the word; and performing voice evaluation on the voice data to be evaluated according to the pronunciation phoneme sequence of the reference text formed by the pronunciation phonemes of the words to obtain a voice evaluation result of the voice data to be evaluated. The embodiment of the invention can simply, conveniently and flexibly meet the personalized evaluation pronunciation requirement in the voice evaluation service.

Description

Voice evaluation method and computer storage medium
Technical Field
The embodiment of the invention relates to the technical field of voice, in particular to a voice evaluation method and a computer storage medium.
Background
Online teaching is increasingly popularized, and in order to better improve interactive experience and teaching effect in the online teaching process, a voice technology participates in an interactive link of online teaching. Especially in online language teaching, spoken language pronunciation is one of the core links in teaching, and both spoken language pronunciation learning in class and spoken language pronunciation practice after class require a large amount of spoken language pronunciation evaluation as the feedback of learning effect. In addition, with the aging of the student population for online teaching, more and more interactions are more inclined to the voice mode, such as voice answer, for the child population who does not grasp enough vocabulary.
The speech evaluation technology is needed no matter whether the spoken language pronunciation evaluation or the speech answer is carried out. Generally, when a user performs speech evaluation, a speech evaluation system first defaults to system initialization, and specifically, the speech evaluation system loads an original input resource file, such as a pronunciation dictionary, an acoustic model, and the like. And then, the voice evaluating system loads a reference text for voice evaluation input by the user, and evaluates the voice of the user according to the pronunciation dictionary and the reference text to obtain an evaluation result of the voice of the user.
However, in the actual teaching evaluation activity, different speech evaluation services may generate more personalized evaluation pronunciation requirements, so that two special situations may occur in the pronunciation dictionary loaded by the speech evaluation system in the initialization stage. One is that there is no word in the reference text in the pronunciation dictionary, and the other is that although there is a word in the reference text in the pronunciation dictionary, the user requires personalized pronunciation of the word, so the pronunciation dictionary cannot meet the personalized pronunciation evaluation requirement in the speech evaluation service.
In the prior art, the main solution is to add the pronunciation of a word requiring personalized word pronunciation to the pronunciation dictionary, but manually add the pronunciation of a word requiring personalized word pronunciation to the pronunciation dictionary. Specifically, an original resource packet loaded by the speech evaluation system is decompressed to obtain a pronunciation dictionary, pronunciations of words requiring personalized word pronunciations are added into the pronunciation dictionary, and the pronunciation dictionary after the pronunciations are added is packaged again and is given to the speech evaluation system. And finally, evaluating the voice of the user by adding the pronunciation dictionary after pronunciation, and the whole process is very complicated and inflexible.
Therefore, how to simply, conveniently and flexibly meet the personalized evaluation pronunciation requirement in the voice evaluation service becomes a technical problem to be solved urgently at present.
Disclosure of Invention
In view of the above, an embodiment of the present invention provides a speech evaluation method and a computer storage medium to solve at least one of the above problems.
The embodiment of the invention provides a voice evaluation method, which comprises the following steps: determining a reference text for voice evaluation of voice data to be evaluated; if the word in the reference text is detected to have the self-defined pronunciation label, converting the self-defined pronunciation label of the word to obtain the pronunciation phoneme of the word; if the word is detected not to have the self-defined pronunciation label, retrieving a pre-configured pronunciation dictionary according to the word so as to retrieve the pronunciation phoneme of the word; if the word is not retrieved from the pronunciation dictionary, performing virtual pronunciation on the word to obtain a pronunciation phoneme of the word; and performing voice evaluation on the voice data to be evaluated according to the pronunciation phoneme sequence of the reference text formed by the pronunciation phonemes of the words to obtain a voice evaluation result of the voice data to be evaluated.
An embodiment of the present invention further provides a computer-readable medium, where a readable program is stored in the computer-readable medium, and the readable program includes: the instruction is used for determining a reference text for carrying out voice evaluation on voice data to be evaluated; instructions for converting the self-defined pronunciation labels of the words to obtain pronunciation phonemes of the words if it is detected that the words in the reference text have the self-defined pronunciation labels; instructions for retrieving a pre-configured pronunciation dictionary according to the word if it is detected that the word does not have a custom pronunciation label, to retrieve a pronunciation phoneme of the word; instructions for virtually pronouncing the word to obtain a pronunciation phoneme for the word if the word is not retrieved from the pronunciation dictionary; and performing voice evaluation on the voice data to be evaluated according to the pronunciation phoneme sequence of the reference text formed by the pronunciation phonemes of the words to obtain a voice evaluation result of the voice data to be evaluated.
According to the voice evaluation scheme provided by the embodiment of the invention, a reference text for performing voice evaluation on voice data to be evaluated is determined; if the word in the reference text is detected to have the self-defined pronunciation label, the self-defined pronunciation label of the word is converted to obtain the pronunciation phoneme of the word, if the word is detected not to have the self-defined pronunciation label, a pre-configured pronunciation dictionary is retrieved according to the word to retrieve the pronunciation phoneme of the word, if the word is not retrieved in the pronunciation dictionary, the word is virtually pronounced to obtain the pronunciation phoneme of the word, the pronunciation phoneme sequence of the reference text formed by the pronunciation phoneme of the word is used for carrying out voice evaluation on the voice data to be evaluated to obtain the voice evaluation result of the voice data to be evaluated, compared with the existing other modes, whether the word in the reference text has the self-defined pronunciation label according to the personalized evaluation pronunciation requirement is firstly detected, if the speech evaluation service has the self-defined pronunciation label, the self-defined pronunciation label of the word is converted to obtain the pronunciation phoneme of the word, and if the speech evaluation service does not have the self-defined pronunciation label, the pronunciation dictionary and the virtual pronunciation are combined to obtain the pronunciation phoneme of the word in the reference text, so that the personalized evaluation pronunciation requirement in the speech evaluation service can be simply, conveniently and flexibly met.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present invention, and it is also possible for a person skilled in the art to obtain other drawings based on the drawings.
FIG. 1 is a flow chart illustrating steps of a speech evaluation method according to a first embodiment of the present invention;
FIG. 2 is a flow chart illustrating steps of a speech evaluation method according to a second embodiment of the invention;
fig. 3 is a schematic diagram illustrating a speech evaluation method according to a second embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments of the present invention shall fall within the scope of the protection of the embodiments of the present invention.
The following further describes specific implementation of the embodiments of the present invention with reference to the drawings.
Example one
Referring to fig. 1, a flowchart illustrating steps of a speech evaluation method according to a first embodiment of the present invention is shown. The voice evaluation method provided by the embodiment of the invention comprises the following steps:
in step S101, a reference text for speech evaluation of speech data to be evaluated is determined.
In this embodiment, the speech data to be evaluated may be understood as speech data formed by a user speaking against the reference text. The reference text may be chinese text, english text, french text, german text, spanish text, etc. The words in the reference text may be english words, chinese words, french words, german words, spanish words, etc. The customized pronunciation label can be understood as a customized pronunciation label according to the personalized evaluation pronunciation requirement. The self-defined pronunciation labels can be phonetic symbols of English, French, German, Spanish and the like, and can also be pinyin of Chinese. The pronunciation phoneme is the minimum phonetic unit divided according to the natural attribute of the pronunciation. From the acoustic property, the pronunciation phoneme is the minimum phonetic unit divided from the acoustic quality; from the physiological nature, a pronunciation action forms a pronunciation phoneme. For example, pronunciation phonemes in english can be divided into two broad categories, vowels and consonants. English includes 48 pronunciation phonemes, including/i: /,/I/and/e/etc. 20 vowel phones, and/p/,/t/,/k/, and/f/etc. 28 consonant phones. For example, "a o e b p m" in the pinyin. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In a specific example, when a reference text for performing speech evaluation on speech data to be evaluated is determined, a reference text for performing speech evaluation on the speech data to be evaluated can be generated through a speech evaluation system in electronic equipment; or, an input reference text for performing speech evaluation on the speech data to be evaluated can be received through a speech evaluation system in the electronic device. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In step S102, if it is detected that a word in the reference text has a customized pronunciation label, the customized pronunciation label of the word is converted to obtain a pronunciation phoneme of the word.
In one specific example, whether a word in the reference text has a custom pronunciation label may be determined by detecting whether the word in the reference text has a custom option. And if detecting that the words in the reference text have the self-defined option, determining that the words in the reference text have the self-defined pronunciation labels. And if the word in the reference text is detected to have no self-defined option, determining that the word in the reference text has no self-defined pronunciation label. If the word in the reference text is determined to have the self-defined pronunciation label, converting the self-defined pronunciation label of the word according to the pronunciation phoneme corresponding to the voice of the reference text to obtain the pronunciation phoneme of the word. For example, the phonetic symbol of english pronunciation of english word "aunt" is "[ a ː nt ]", the phonetic symbol of american pronunciation of english word "aunt" is "[ æ nt ]", the pronunciation phoneme obtained after phonetic symbol conversion is "AE N T", and if the user's personalized evaluation pronunciation requirement for english word "aunt" is custom phonetic symbol "[ ɒ nt ]", the pronunciation phoneme obtained after custom phonetic symbol conversion is "AO N T". It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In step S103, if it is detected that the word does not have the customized pronunciation label, a pre-configured pronunciation dictionary is retrieved according to the word to retrieve a pronunciation phoneme of the word.
In this embodiment, the pronunciation dictionary may be understood as a set describing the correspondence between words and their pronunciation labels, and the pronunciation of each word may be determined according to the pronunciation labels recorded in the dictionary, for example, the pronunciation label [ w ǒ ] corresponding to the Chinese character "i" may be obtained, so that the pronunciation phonemes "w" and "o" corresponding to the Chinese character "i" may be obtained. For another example, the pronunciation phonetic symbol corresponding to the english word "good" is/gud/, so that the pronunciation phonetic symbols corresponding to the english word "good" can be obtained as "/g/", "/u/" and "/d/", etc. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In step S104, if the word is not retrieved from the pronunciation dictionary, the word is virtually pronounced to obtain a pronunciation phoneme of the word.
In this embodiment, the virtual pronunciation may be understood as that in a case that the pronunciation dictionary does not have the words in the reference text, and the pronunciation of the words in the reference text is not customized, the speech evaluation system virtually pronounces the words in the reference text. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In a specific example, when the word is virtually pronounced, the word is virtually pronounced according to a mapping relation between the word and the pronunciation phoneme configured in advance so as to obtain the pronunciation phoneme of the word. Therefore, the words can be accurately and virtually pronounced through the preset mapping relation between the words and the pronunciation phonemes. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In a specific example, the word may be virtually pronounced according to a mapping relationship between letters in the word and pronunciation phonemes configured in advance to obtain pronunciation phonemes of the word. In addition, the Chinese words can be subjected to virtual pronunciation according to the mapping relation between Chinese characters and pronunciation phonemes in the Chinese words which are configured in advance, so that the pronunciation phonemes of the Chinese words can be obtained. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In step S105, performing speech evaluation on the speech data to be evaluated according to the pronunciation phoneme sequence of the reference text formed by the pronunciation phonemes of the word, so as to obtain a speech evaluation result of the speech data to be evaluated.
In this embodiment, the pronunciation phonemes of the words are sorted according to the position sequence of the words in the reference text, so as to obtain a pronunciation phoneme sequence of the reference text formed by the pronunciation phonemes of the words. Then, speech evaluation can be performed on the speech data to be evaluated according to the pronunciation phoneme sequence of the reference text formed by the pronunciation phonemes of the words, so as to obtain a speech evaluation result of the speech data to be evaluated. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In some optional embodiments, when performing speech evaluation on speech data to be evaluated according to a pronunciation phoneme sequence of the reference text formed by pronunciation phonemes of the word, a hidden markov model corresponding to the pronunciation phonemes in the pronunciation phoneme sequence is obtained, wherein the hidden markov model is a pre-trained hidden markov model for the pronunciation phonemes in the pronunciation phoneme sequence; and performing voice evaluation on the voice data to be evaluated according to the hidden Markov model corresponding to the pronunciation phoneme in the pronunciation phoneme sequence to obtain a voice evaluation result of the voice data to be evaluated. Therefore, the speech evaluation is carried out on the speech data to be evaluated through the hidden Markov model corresponding to the pronunciation phoneme in the pronunciation phoneme sequence, and the speech evaluation result of the speech data to be evaluated can be accurately obtained. The hidden markov model trained in advance for the pronunciation phoneme in the pronunciation phoneme sequence can be understood as a hidden markov model with known model parameters for the pronunciation phoneme in the pronunciation phoneme sequence, and the model parameters can include a state transition probability in the hidden markov model and an observation probability in each state. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In some optional embodiments, the hidden markov model is labeled with an identification of a pronunciation phoneme in the corresponding pronunciation phoneme sequence. When a hidden Markov model corresponding to the pronunciation phoneme in the pronunciation phoneme sequence is obtained, mapping the pronunciation phoneme in the pronunciation phoneme sequence according to a pre-configured single-phoneme dictionary to obtain the identification of the pronunciation phoneme in the pronunciation phoneme sequence; and obtaining a hidden Markov model corresponding to the pronunciation phoneme in the pronunciation phoneme sequence according to the identification of the pronunciation phoneme in the pronunciation phoneme sequence. Therefore, the hidden Markov model corresponding to the pronunciation phoneme in the pronunciation phoneme sequence can be accurately obtained through the identification of the pronunciation phoneme in the pronunciation phoneme sequence of the reference text. Wherein the monophonic dictionary may be understood as a set of correspondences describing pronunciation phonemes and their identities. The identifier of the pronunciation phoneme can be the number or serial number of the pronunciation phoneme. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In some optional embodiments, when performing speech evaluation on the speech data to be evaluated according to the hidden markov models corresponding to the pronunciation phonemes in the pronunciation phoneme sequence to obtain a speech evaluation result of the speech data to be evaluated, concatenating the hidden markov models corresponding to the pronunciation phonemes in the pronunciation phoneme sequence to obtain the hidden markov models corresponding to the reference text; and performing voice evaluation on the voice data to be evaluated according to the hidden Markov model corresponding to the reference text to obtain a voice evaluation result of the voice data to be evaluated. Therefore, the hidden Markov models corresponding to the reference texts can be accurately obtained by connecting the hidden Markov models corresponding to the pronunciation phonemes in the pronunciation phoneme sequence in series. In addition, the speech evaluation is carried out on the speech data to be evaluated through the hidden Markov model corresponding to the reference text, and the speech evaluation result of the speech data to be evaluated can be accurately obtained. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In some optional embodiments, when performing speech evaluation on the speech data to be evaluated according to the hidden markov model corresponding to the reference text, performing acoustic feature extraction on a speech data frame in the speech data to be evaluated to obtain acoustic feature data of the speech data frame; determining the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence according to the acoustic feature data of the voice data frame and the hidden Markov model corresponding to the reference text; and performing voice evaluation on the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence to obtain a voice evaluation result of the voice data to be evaluated. Therefore, the speech evaluation result of the speech data to be evaluated can be accurately obtained by performing speech evaluation on the speech data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In a specific example, the acoustic feature data is effective information that can distinguish speech, such as time domain resolution, or frequency domain resolution. Specifically, the acoustic feature data may include Mel-Frequency Cepstral Coefficient (MFCC), Linear Prediction Cepstral Coefficient (LPCC), or the like. The MFCC features are acoustic features extracted based on human ear characteristics, the MFCC features and the frequency form a nonlinear corresponding relation, and the frequency spectrum features of the voice data can be obtained through calculation based on the nonlinear corresponding relation. There are a number of ways to form the acoustic signature sequence. Taking the extraction of Mel-Frequency Cepstral Coefficient (MFCC) as an acoustic feature, the step of extracting an MFCC feature sequence of the speech data to be evaluated may include: balancing high and low frequency components of the voice data to be evaluated by adopting a Pre-emphasis (Pre-emphasis) technology; sampling voice data to be evaluated, and dividing the voice data to be evaluated into a plurality of voice data frames; multiplying each voice data frame by a hamming window to increase continuity of left and right ends of the voice data frame and converting a time domain signal of the voice data frame to a frequency domain signal through Discrete Fourier Transform (DFT); smoothing the frequency domain signal by using a Mel filter and eliminating the effect of harmonic; after taking logarithm of M energy values of the frequency domain signal filtered by the Mel filter, generating an M-dimensional feature vector; then, Discrete Cosine Transform (DCT) is carried out on the feature vector of the M dimension to obtain the MFCC feature of each voice data frame; and forming an MFCC feature sequence, namely an acoustic feature sequence, of the speech data to be evaluated according to the MFCC features of all the speech data frames of the speech data to be evaluated. It should be noted that other acoustic features, such as Linear Prediction Cepstral Coefficient (LPCC), may also be adopted in this embodiment, and a general method in the art may be adopted as a method for extracting other acoustic features, which is not described herein again. It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
In a specific example, it is generally considered that the pronunciation process of each pronunciation phoneme is generated by deformation of a sounding organ, each shape of the sounding organ corresponds to one hidden state of a hidden markov model (generally, 3-5 hidden states are used for modeling), and the sounding organ generates specific sound (acoustic features) with a certain probability under each shape (hidden state). Although we cannot directly observe the shape of the vocal organs (implicit state), we can see specific acoustic features (observations). Therefore, when the model parameters of the hidden markov model corresponding to the pronounced phonemes are known (i.e., the hidden markov model is pre-trained offline for the pronounced phonemes), the most likely phoneme state sequence can be calculated from the acoustic feature sequence. Furthermore, the hidden markov models corresponding to the pronunciation phonemes can be concatenated (determined by the model characteristics, which are not explained in detail here), i.e., the ending state of the current pronunciation phoneme can jump to the starting state of the next pronunciation phoneme, so that the hidden markov models of 8 pronunciation phonemes in a known phoneme sequence (e.g., Hello World [ h ɛ low ɝ ld ]) can be concatenated to form a large hidden markov model, which describes the pronunciation process of the "Hello World" whole sentence. Based on the observed whole acoustic feature sequence and the model parameters of the hidden markov model, we can calculate the most likely phoneme state sequence of the current speech data under the hidden markov model of "Hello World", i.e. to which phoneme of pronunciation each frame of speech data belongs. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In some optional embodiments, when the speech data frame corresponding to a pronunciation phoneme in the pronunciation phoneme sequence is determined according to the acoustic feature data of the speech data frame and the hidden markov model corresponding to the reference text, the acoustic feature data of the speech data frame is identified through the acoustic model to obtain a conditional probability that the speech data frame is identified as any pronunciation phoneme; and performing path search through a decoder according to the conditional probability that the voice data frame is recognized as any pronunciation phoneme and the hidden Markov model corresponding to the reference text to obtain the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence. Therefore, the path search is carried out according to the conditional probability that the voice data frame is recognized as any pronunciation phoneme and the hidden Markov model corresponding to the reference text, and the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence can be accurately obtained. The acoustic Model may be understood as a Model that classifies acoustic features of speech into phonemes, such as DNN (Deep Neural Network) -HMM (Hidden Markov Model), CNN (Convolutional Neural Network) + LSTM (Long Short-Term Memory Network). It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In a specific example, the acoustic feature data of the speech data frame of the speech data to be evaluated is input into an acoustic model to obtain a conditional probability matrix describing the conditional probability that each speech data frame is recognized as any pronunciation phoneme, wherein the conditional probability matrix gives the conditional probability between the speech data frame and a plurality of pronunciation phonemes for one speech data frame, for example, the conditional probability that one speech data frame is recognized as [ g ] and the conditional probability that one speech data frame is recognized as [ s ], and then the conditional probability matrix is input into a decoder, the decoder performs a path search by using a viterbi algorithm, and the hidden markov model corresponding to the reference text is used as a limiting condition in the path search to obtain the speech data frame corresponding to each pronunciation phoneme in the pronunciation phoneme sequence corresponding to the reference text, typically, a pronunciation phoneme corresponds to a plurality of successive frames of speech data in the speech data to be evaluated. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In some optional embodiments, when performing speech evaluation on the speech data frame corresponding to a pronunciation phoneme in the pronunciation phoneme sequence, determining pronunciation accuracy of the speech data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence according to a conditional probability that the speech data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence is recognized as a pronunciation phoneme in the pronunciation phoneme sequence; and determining the pronunciation accuracy of the voice data to be evaluated according to the pronunciation accuracy of the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence. Thereby, the pronunciation accuracy of the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence can be accurately determined by the conditional probability that the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence is recognized as the pronunciation phoneme in the pronunciation phoneme sequence. Furthermore, the pronunciation accuracy of the speech data to be evaluated can be accurately determined through the pronunciation accuracy of the speech data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In a specific example, the pronunciation accuracy of the speech data frame corresponding to a pronunciation phoneme in the pronunciation phoneme sequence can be calculated by the following formula:
Figure 612576DEST_PATH_IMAGE001
wherein P is a pronunciation phoneme in the pronunciation phoneme sequence, P (P | o) is a conditional probability that the speech data frame corresponding to the pronunciation phoneme P in the pronunciation phoneme sequence is recognized as the pronunciation phoneme P in the pronunciation phoneme sequence, nf (P) is the number of the speech data frames corresponding to the pronunciation phoneme P in the pronunciation phoneme sequence, o is the speech data frame corresponding to the pronunciation phoneme P in the pronunciation phoneme sequence, and GOP is the pronunciation accuracy of the speech data frame corresponding to the pronunciation phoneme P in the pronunciation phoneme sequence. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In a specific example, when determining the pronunciation accuracy of the speech data to be evaluated according to the pronunciation accuracy of the speech data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence, calculating the average value of the pronunciation accuracy of the speech data frame corresponding to all the pronunciation phonemes in the pronunciation phoneme sequence; and determining the average value as the pronunciation accuracy of the voice data to be evaluated. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In some optional embodiments, when performing speech evaluation on the speech data frame corresponding to a pronunciation phoneme in the pronunciation phoneme sequence, determining a pronunciation fluency of the speech data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence according to an actual pronunciation duration of the speech data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence and a standard pronunciation duration corresponding to the pronunciation phoneme in the pronunciation phoneme sequence; and determining the pronunciation fluency of the speech data to be evaluated according to the pronunciation fluency of the speech data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence. The standard pronunciation duration corresponding to the pronunciation phoneme in the pronunciation phoneme sequence is preset or calculated in advance. Therefore, the pronunciation fluency of the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence can be accurately determined through the actual pronunciation duration of the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence and the standard pronunciation duration corresponding to the pronunciation phoneme in the pronunciation phoneme sequence. Furthermore, the pronunciation fluency of the speech data to be evaluated can be accurately determined through the pronunciation fluency of the speech data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In a specific example, the pronunciation fluency of the speech data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence can be calculated by the following formula:
Figure 732978DEST_PATH_IMAGE002
wherein, T0The standard pronunciation duration corresponding to the pronunciation phoneme in the pronunciation phoneme sequence is set, T is the actual pronunciation duration of the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence, F is the pronunciation fluency of the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence, and the closer the actual pronunciation duration of the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence is to the standard pronunciation duration corresponding to the pronunciation phoneme in the pronunciation phoneme sequence, the higher the fluency of the user in reading the pronunciation phoneme in the pronunciation phoneme sequence is. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In a specific example, the actual pronunciation duration may be determined according to the number of frames of speech data corresponding to the pronunciation phoneme in the pronunciation phoneme sequence and the duration of one frame of speech data. For example, if the pronunciation phoneme [ g ] in the pronunciation phoneme sequence corresponds to 30 frames of speech data, and the duration of each frame of speech data is 20ms, the actual pronunciation duration of the speech data frame corresponding to the pronunciation phoneme [ g ] in the pronunciation phoneme sequence is 600ms, and if the standard pronunciation duration of the pronunciation phoneme [ g ] in the pronunciation phoneme sequence is 400ms, the fluency of the speech data frame corresponding to the pronunciation phoneme [ g ] in the pronunciation phoneme sequence is 0.667. For another example, the pronunciation phoneme [ i: ] in the pronunciation phoneme sequence corresponds to 30 frames of voice data, and the duration of each frame of voice data is 20ms, the actual pronunciation duration of the voice data frame corresponding to the pronunciation phoneme [ i: ] in the pronunciation phoneme sequence is 600ms, and the fluency of the voice data frame corresponding to the pronunciation phoneme [ i: ] in the pronunciation phoneme sequence is 0.6 assuming that the standard pronunciation duration of the pronunciation phoneme [ i: ] in the pronunciation phoneme sequence is 1000 ms. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In a specific example, when determining the pronunciation fluency of the speech data to be evaluated according to the pronunciation fluency of the speech data frames corresponding to the pronunciation phonemes in the pronunciation phoneme sequence, calculating an average value of the pronunciation fluency of the speech data frames corresponding to all the pronunciation phonemes in the pronunciation phoneme sequence; and determining the average value as the pronunciation fluency of the voice data to be evaluated. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In some optional embodiments, when performing speech evaluation on the speech data frame corresponding to a pronunciation phoneme in the pronunciation phoneme sequence, determining the number of pronunciation phonemes being pronounced in the pronunciation phoneme sequence according to the conditional probability that the speech data frame corresponding to a pronunciation phoneme in the pronunciation phoneme sequence is recognized as a pronunciation phoneme in the pronunciation phoneme sequence; and determining the pronunciation integrity of the speech data to be evaluated according to the number of the pronunciation phonemes to be pronounced in the pronunciation phoneme sequence and the number of the pronunciation phonemes in the pronunciation phoneme sequence. Thereby, the number of pronounced phonemes being pronounced in the sequence of pronounced phonemes can be accurately determined by the conditional probability that the frame of speech data corresponding to a pronounced phoneme in the sequence of pronounced phonemes is recognized as a pronounced phoneme in the sequence of pronounced phonemes. Furthermore, the pronunciation integrity of the speech data to be evaluated can be accurately determined by the number of the pronunciation phonemes to be pronounced in the pronunciation phoneme sequence. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In a specific example, when the number of the pronounced phonemes to be pronounced in the pronounced phoneme sequence is determined, comparing the conditional probability that the voice data frame corresponding to the pronounced phonemes in the pronounced phoneme sequence is recognized as the pronounced phonemes in the pronounced phoneme sequence with a preset conditional probability threshold to obtain a comparison result; and if the condition probability that the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence is identified as the pronunciation phoneme in the pronunciation phoneme sequence is larger than or equal to a preset condition probability threshold value according to the comparison result, determining that the pronunciation phoneme in the pronunciation phoneme sequence is pronounced, and further determining the number of the pronunciation phonemes which are pronounced in the pronunciation phoneme sequence. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In a specific example, when determining the pronunciation integrity of the speech data to be evaluated, the number of pronunciation phonemes to be pronounced in the pronunciation phoneme sequence is divided by the number of pronunciation phonemes in the pronunciation phoneme sequence to obtain the pronunciation integrity of the speech data to be evaluated. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In some optional embodiments, after performing speech evaluation on the speech data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence, the method further includes: performing voice recognition on the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence to obtain an actual pronunciation phoneme of the voice data frame; and if the actual pronunciation phoneme of the voice data frame is different from the pronunciation phoneme corresponding to the voice data frame, outputting pronunciation error information of the pronunciation phoneme corresponding to the voice data frame and the actual pronunciation phoneme of the voice data frame. Therefore, when the actual pronunciation phoneme of the voice data frame is different from the pronunciation phoneme of the voice data frame, the pronunciation error information of the pronunciation phoneme of the voice data frame and the actual pronunciation phoneme of the voice data frame are output, and the pronunciation error information of the pronunciation phoneme of the voice data frame and the actual pronunciation phoneme of the voice data frame can be shown to a user, so that the user is helped to gradually improve the pronunciation level. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In a specific example, the actual pronunciation phoneme of the speech data frame can be obtained by directly performing speech recognition on the speech data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence through a hidden markov model without any restriction on pronunciation phonemes. That is, the phoneme of the pronunciation that best fits the pronunciation of the user is selected from the list of all phonemes. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
According to the voice evaluation method provided by the embodiment of the invention, a reference text for performing voice evaluation on voice data to be evaluated is determined; if the word in the reference text is detected to have the self-defined pronunciation label, the self-defined pronunciation label of the word is converted to obtain the pronunciation phoneme of the word, if the word is detected not to have the self-defined pronunciation label, a pre-configured pronunciation dictionary is retrieved according to the word to retrieve the pronunciation phoneme of the word, if the word is not retrieved in the pronunciation dictionary, the word is virtually pronounced to obtain the pronunciation phoneme of the word, the pronunciation phoneme sequence of the reference text formed by the pronunciation phoneme of the word is used for carrying out voice evaluation on the voice data to be evaluated to obtain the voice evaluation result of the voice data to be evaluated, compared with the existing other modes, whether the word in the reference text has the self-defined pronunciation label according to the personalized evaluation pronunciation requirement is firstly detected, if the speech evaluation service has the self-defined pronunciation label, the self-defined pronunciation label of the word is converted to obtain the pronunciation phoneme of the word, and if the speech evaluation service does not have the self-defined pronunciation label, the pronunciation dictionary and the virtual pronunciation are combined to obtain the pronunciation phoneme of the word in the reference text, so that the personalized evaluation pronunciation requirement in the speech evaluation service can be simply, conveniently and flexibly met.
The speech evaluation method provided by the present embodiment can be executed by any suitable device with data processing capability, including but not limited to: cameras, terminals, mobile terminals, PCs, servers, in-vehicle devices, entertainment devices, advertising devices, Personal Digital Assistants (PDAs), tablet computers, notebook computers, handheld game consoles, smart glasses, smart watches, wearable devices, virtual display devices or display enhancement devices (such as Google Glass, Oculus rise, Hololens, Gear VR), and the like.
Example two
Referring to fig. 2, a flowchart illustrating steps of a speech evaluation method according to a second embodiment of the present invention is shown. The voice evaluation method provided by the embodiment of the invention comprises the following steps:
in step S201, a reference text for performing speech evaluation on speech data to be evaluated is determined.
Since the specific implementation of step S201 is similar to the specific implementation of step S101 in the first embodiment, it is not repeated herein.
In step S202, if it is detected that a word in the reference text has a customized pronunciation label, the customized pronunciation label of the word is converted to obtain a pronunciation phoneme of the word.
Since the specific implementation of step S202 is similar to the specific implementation of step S102 in the first embodiment, it is not repeated herein.
In step S203, if it is detected that the word does not have the customized pronunciation label, a pre-configured pronunciation dictionary is retrieved according to the word to retrieve a pronunciation phoneme of the word.
Since the specific implementation of step S203 is similar to the specific implementation of step S103 in the first embodiment, it is not repeated here.
In step S204, if the word is not retrieved from the pronunciation dictionary, the word is virtually pronounced through a virtual pronunciation model to obtain a pronunciation phoneme of the word.
In this embodiment, the virtual pronunciation model may be a recurrent neural network or a long-term memory network for virtual pronunciation. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In some optional embodiments, before virtually pronouncing the term through the virtual pronunciation model, the method further comprises: and training the virtual pronunciation model to be trained according to pronunciation phoneme labeling data of the word sample so as to obtain the trained virtual pronunciation model. Therefore, the words are subjected to virtual pronunciation through the trained virtual pronunciation model, and the words can be accurately subjected to virtual pronunciation. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In a specific example, when the virtual pronunciation model to be trained is trained according to pronunciation phoneme labeling data of a word sample, the word sample is subjected to virtual pronunciation through the virtual pronunciation model to be trained so as to obtain pronunciation phoneme detection data of the word sample; and training the virtual pronunciation model to be trained according to the pronunciation phoneme detection data and the pronunciation phoneme labeling data of the word sample so as to obtain the trained virtual pronunciation model. It should be understood that the above description is only exemplary, and the present embodiment is not limited thereto.
In a specific example, when the virtual pronunciation model to be trained is trained according to pronunciation phoneme detection data and pronunciation phoneme tagging data of the word sample, determining a difference value between the pronunciation phoneme detection data and the pronunciation phoneme tagging data through a target loss function; and adjusting the model parameters of the virtual pronunciation model based on the difference value. The target loss function can be any loss function such as a cross entropy loss function, a softmax loss function, an L1 loss function, and an L2 loss function. In adjusting the model parameters of the virtual pronunciation model, a back propagation algorithm or a stochastic gradient descent algorithm may be used to adjust the model parameters of the virtual pronunciation model. It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
In a specific example, the currently obtained phoneme detection data is evaluated by determining a difference value between the phoneme detection data and the phoneme tagging data, so as to be used as a basis for subsequently training the virtual pronunciation model. In particular, the discrepancy values may be transmitted back to the virtual pronunciation model, thereby iteratively training the virtual pronunciation model. The training of the virtual pronunciation model is an iterative process, and the embodiment of the present application only describes one training process, but it should be understood by those skilled in the art that this training mode may be adopted for each training of the virtual pronunciation model until the training of the virtual pronunciation model is completed. It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
In step S205, performing speech evaluation on the speech data to be evaluated according to the pronunciation phoneme sequence of the reference text formed by the pronunciation phonemes of the word, so as to obtain a speech evaluation result of the speech data to be evaluated.
Since the specific implementation of step S205 is similar to the specific implementation of step S105 in the first embodiment, it is not repeated herein.
In a specific example, as shown in fig. 3, the speech evaluation system loads initialization resources, including a pronunciation dictionary, a monosyllable dictionary, an acoustic model, speech data to be evaluated, reference text for speech evaluation, and a customized pronunciation label. And then, the voice evaluating system carries out voice evaluation on the voice data to be evaluated. The specific speech evaluation process is as follows: if the word in the reference text for voice evaluation is detected to have the self-defined pronunciation label, converting the self-defined pronunciation label of the word to obtain the pronunciation phoneme of the word; if the word is detected not to have the self-defined pronunciation label, retrieving a pre-configured pronunciation dictionary according to the word so as to retrieve the pronunciation phoneme of the word; and if the word is not retrieved from the pronunciation dictionary, virtually pronouncing the word to obtain a pronunciation phoneme of the word. After the pronunciation phonemes of all the words in the reference text are obtained, determining the pronunciation phoneme sequence of the reference text according to the pronunciation phonemes of all the words in the reference text. Then, mapping the pronunciation phoneme in the pronunciation phoneme sequence according to a pre-configured single-phoneme dictionary to obtain the identification of the pronunciation phoneme in the pronunciation phoneme sequence, obtaining the hidden markov model corresponding to the pronunciation phoneme in the pronunciation phoneme sequence according to the identification of the pronunciation phoneme in the pronunciation phoneme sequence, and determining the hidden markov model (text composition) corresponding to the reference text according to the hidden markov model corresponding to the pronunciation phoneme in the pronunciation phoneme sequence. After determining the hidden Markov model corresponding to the reference text, performing acoustic feature extraction on a voice data frame in the voice data to be evaluated to obtain acoustic feature data of the voice data frame, identifying the acoustic feature data of the voice data frame through the acoustic model to obtain a conditional probability that the voice data frame is identified as any pronunciation phoneme, and performing path search through a decoder according to the conditional probability that the voice data frame is identified as any pronunciation phoneme and the hidden Markov model corresponding to the reference text to obtain the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence. After the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence is obtained, determining the pronunciation accuracy of the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence according to the conditional probability that the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence is identified as the pronunciation phoneme in the pronunciation phoneme sequence, and determining the pronunciation accuracy (voice evaluation) of the voice data to be evaluated according to the pronunciation accuracy of the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence. Or determining the pronunciation fluency of the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence according to the actual pronunciation duration of the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence and the standard pronunciation duration corresponding to the pronunciation phoneme in the pronunciation phoneme sequence, and determining the pronunciation fluency of the voice data to be evaluated (voice evaluation) according to the pronunciation fluency of the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence. Or determining the number of the pronouncing phonemes to be pronouncing in the pronouncing phoneme sequence according to the conditional probability that the voice data frame corresponding to the pronouncing phoneme in the pronouncing phoneme sequence is recognized as the pronouncing phoneme in the pronouncing phoneme sequence, and determining the pronunciation integrity (voice evaluation) of the voice data to be evaluated according to the number of the pronouncing phonemes to be pronouncing in the pronouncing phoneme sequence and the number of the pronouncing phonemes in the pronouncing phoneme sequence. It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
According to the voice evaluation method provided by the embodiment of the invention, a reference text for performing voice evaluation on voice data to be evaluated is determined; if the word in the reference text is detected to have the self-defined pronunciation label, the self-defined pronunciation label of the word is converted to obtain the pronunciation phoneme of the word, if the word is detected not to have the self-defined pronunciation label, a pre-configured pronunciation dictionary is retrieved according to the word to retrieve the pronunciation phoneme of the word, if the word is not retrieved in the pronunciation dictionary, the word is virtually pronounced through a virtual pronunciation model to obtain the pronunciation phoneme of the word, the speech evaluation is performed on the speech data to be evaluated according to the pronunciation phoneme sequence of the reference text formed by the pronunciation phoneme of the word to obtain the speech evaluation result of the speech data to be evaluated, and compared with the existing other modes, whether the word in the reference text has the self-defined pronunciation label according to the individual evaluation pronunciation requirement is firstly detected, if the speech evaluation service has the self-defined pronunciation label, the self-defined pronunciation label of the word is converted to obtain the pronunciation phoneme of the word, and if the speech evaluation service does not have the self-defined pronunciation label, the pronunciation dictionary and the virtual pronunciation are combined to obtain the pronunciation phoneme of the word in the reference text, so that the personalized evaluation pronunciation requirement in the speech evaluation service can be simply, conveniently and flexibly met.
The speech evaluation method provided by the present embodiment can be executed by any suitable device with data processing capability, including but not limited to: cameras, terminals, mobile terminals, PCs, servers, in-vehicle devices, entertainment devices, advertising devices, Personal Digital Assistants (PDAs), tablet computers, notebook computers, handheld game consoles, smart glasses, smart watches, wearable devices, virtual display devices or display enhancement devices (such as Google Glass, Oculus rise, Hololens, Gear VR), and the like.
EXAMPLE III
An embodiment of the present invention further provides a computer-readable medium, where a readable program is stored in the computer-readable medium, and the readable program includes: the instruction is used for determining a reference text for carrying out voice evaluation on voice data to be evaluated; instructions for converting the self-defined pronunciation labels of the words to obtain pronunciation phonemes of the words if it is detected that the words in the reference text have the self-defined pronunciation labels; instructions for retrieving a pre-configured pronunciation dictionary according to the word if it is detected that the word does not have a custom pronunciation label, to retrieve a pronunciation phoneme of the word; instructions for virtually pronouncing the word to obtain a pronunciation phoneme for the word if the word is not retrieved from the pronunciation dictionary; and performing voice evaluation on the voice data to be evaluated according to the pronunciation phoneme sequence of the reference text formed by the pronunciation phonemes of the words to obtain a voice evaluation result of the voice data to be evaluated.
Optionally, the instructions for virtually pronouncing the word to obtain a pronunciation phoneme of the word include: and instructions for virtually pronouncing the words according to a mapping relation between the pre-configured words and pronunciation phonemes to obtain pronunciation phonemes of the words.
Optionally, the instructions for virtually pronouncing the word to obtain a pronunciation phoneme of the word include: instructions for virtually pronouncing the word through a virtual pronunciation model to obtain a pronunciation phoneme of the word.
Optionally, the instruction for performing speech evaluation on the speech data to be evaluated according to the pronunciation phoneme sequence of the reference text formed by the pronunciation phonemes of the word to obtain a speech evaluation result of the speech data to be evaluated includes: instructions for obtaining a hidden Markov model corresponding to a pronunciation phoneme in the pronunciation phoneme sequence, wherein the hidden Markov model is a previously trained hidden Markov model for the pronunciation phoneme in the pronunciation phoneme sequence; and performing voice evaluation on the voice data to be evaluated according to the hidden Markov model corresponding to the pronunciation phoneme in the pronunciation phoneme sequence to obtain a voice evaluation result of the voice data to be evaluated.
Optionally, the hidden markov model is labeled with an identification of a pronunciation phoneme in the corresponding pronunciation phoneme sequence. The instructions for obtaining a hidden markov model corresponding to a pronunciation phoneme in the pronunciation phoneme sequence include: instructions for mapping the phones of the sequence of phones according to a pre-configured monophonic dictionary to obtain an identification of the phones of the sequence of phones; and instructions for obtaining a hidden Markov model corresponding to the pronunciation phoneme in the pronunciation phoneme sequence according to the identification of the pronunciation phoneme in the pronunciation phoneme sequence.
Optionally, the instructions for performing speech evaluation on the speech data to be evaluated according to the hidden markov model corresponding to the pronunciation phoneme in the pronunciation phoneme sequence to obtain a speech evaluation result of the speech data to be evaluated include: instructions for concatenating hidden Markov models corresponding to the pronunciation phonemes in the pronunciation phoneme sequence to obtain a hidden Markov model corresponding to the reference text; and the instruction is used for carrying out voice evaluation on the voice data to be evaluated according to the hidden Markov model corresponding to the reference text so as to obtain a voice evaluation result of the voice data to be evaluated.
Optionally, the instruction for performing speech evaluation on the speech data to be evaluated according to the hidden markov model corresponding to the reference text to obtain a speech evaluation result of the speech data to be evaluated includes: instructions for performing acoustic feature extraction on a voice data frame in the voice data to be evaluated to obtain acoustic feature data of the voice data frame; instructions for determining the speech data frame corresponding to a pronunciation phoneme in the pronunciation phoneme sequence according to acoustic feature data of the speech data frame and a hidden Markov model corresponding to the reference text; and the instruction is used for carrying out voice evaluation on the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence so as to obtain a voice evaluation result of the voice data to be evaluated.
Optionally, the instructions for determining the speech data frame corresponding to a pronunciation phoneme in the pronunciation phoneme sequence according to the acoustic feature data of the speech data frame and the hidden markov model corresponding to the reference text include: instructions for recognizing acoustic feature data of the frame of speech data by an acoustic model to obtain a conditional probability that the frame of speech data is recognized as any of the phonemes of a pronunciation; instructions for performing, by a decoder, a path search based on the conditional probability that the frame of speech data is recognized as any of the phones and the hidden Markov model corresponding to the reference text to obtain the frame of speech data corresponding to a phone in the sequence of phones.
Optionally, the instruction for performing speech evaluation on the speech data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence to obtain a speech evaluation result of the speech data to be evaluated includes: instructions for determining a pronunciation accuracy of the frame of speech data corresponding to a pronunciation phoneme in the pronunciation phoneme sequence based on a conditional probability that the frame of speech data corresponding to the pronunciation phoneme in the pronunciation phoneme sequence is identified as a pronunciation phoneme in the pronunciation phoneme sequence; and determining the pronunciation accuracy of the voice data to be evaluated according to the pronunciation accuracy of the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence.
Optionally, the instruction for performing speech evaluation on the speech data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence to obtain a speech evaluation result of the speech data to be evaluated includes: instructions for determining pronunciation fluency of the speech data frame corresponding to a pronunciation phoneme in the pronunciation phoneme sequence according to an actual pronunciation duration of the speech data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence and a standard pronunciation duration corresponding to the pronunciation phoneme in the pronunciation phoneme sequence; and determining the pronunciation fluency of the speech data to be evaluated according to the pronunciation fluency of the speech data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence.
Optionally, the instruction for performing speech evaluation on the speech data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence to obtain a speech evaluation result of the speech data to be evaluated includes: instructions for determining a number of pronounced phonemes to be pronounced in the sequence of pronounced phonemes based on a conditional probability that the frame of speech data corresponding to a pronounced phoneme in the sequence of pronounced phonemes is identified as a pronounced phoneme in the sequence of pronounced phonemes; and determining the pronunciation integrity of the speech data to be evaluated according to the number of the pronunciation phonemes to be pronounced in the pronunciation phoneme sequence and the number of the pronunciation phonemes in the pronunciation phoneme sequence.
Optionally, after the instructions for performing speech evaluation on the speech data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence, the readable program further includes: instructions for performing speech recognition on the speech data frame corresponding to a pronunciation phoneme in the pronunciation phoneme sequence to obtain an actual pronunciation phoneme of the speech data frame; and outputting pronunciation error information of the pronunciation phoneme corresponding to the voice data frame and an instruction of the actual pronunciation phoneme of the voice data frame if the actual pronunciation phoneme of the voice data frame is different from the pronunciation phoneme corresponding to the voice data frame.
Through the computer readable medium provided by the embodiment of the application, a reference text for carrying out voice evaluation on voice data to be evaluated is determined; if the word in the reference text is detected to have the self-defined pronunciation label, the self-defined pronunciation label of the word is converted to obtain the pronunciation phoneme of the word, if the word is detected not to have the self-defined pronunciation label, a pre-configured pronunciation dictionary is retrieved according to the word to retrieve the pronunciation phoneme of the word, if the word is not retrieved in the pronunciation dictionary, the word is virtually pronounced to obtain the pronunciation phoneme of the word, the pronunciation phoneme sequence of the reference text formed by the pronunciation phoneme of the word is used for carrying out voice evaluation on the voice data to be evaluated to obtain the voice evaluation result of the voice data to be evaluated, compared with the existing other modes, whether the word in the reference text has the self-defined pronunciation label according to the personalized evaluation pronunciation requirement is firstly detected, if the speech evaluation service has the self-defined pronunciation label, the self-defined pronunciation label of the word is converted to obtain the pronunciation phoneme of the word, and if the speech evaluation service does not have the self-defined pronunciation label, the pronunciation dictionary and the virtual pronunciation are combined to obtain the pronunciation phoneme of the word in the reference text, so that the personalized evaluation pronunciation requirement in the speech evaluation service can be simply, conveniently and flexibly met.
It should be noted that, according to the implementation requirement, each component/step described in the embodiment of the present invention may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present invention.
The above-described method according to an embodiment of the present invention may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the method described herein may be stored in such software processing on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that the computer, processor, microprocessor controller or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the speech profiling methods described herein. Further, when a general-purpose computer accesses code for implementing the speech evaluation methods illustrated herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the speech evaluation methods illustrated herein.
Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.
The above embodiments are only for illustrating the embodiments of the present invention and not for limiting the embodiments of the present invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present invention, so that all equivalent technical solutions also belong to the scope of the embodiments of the present invention, and the scope of patent protection of the embodiments of the present invention should be defined by the claims.

Claims (11)

1. A method for speech assessment, the method comprising:
determining a reference text for voice evaluation of voice data to be evaluated;
if the word in the reference text is detected to have the self-defined pronunciation label, converting the self-defined pronunciation label of the word to obtain the pronunciation phoneme of the word;
if the word is detected not to have the self-defined pronunciation label, retrieving a pre-configured pronunciation dictionary according to the word so as to retrieve the pronunciation phoneme of the word;
if the word is not retrieved from the pronunciation dictionary, performing virtual pronunciation on the word to obtain a pronunciation phoneme of the word;
performing speech evaluation on the speech data to be evaluated according to the pronunciation phoneme sequence of the reference text formed by the pronunciation phonemes of the words to obtain a speech evaluation result of the speech data to be evaluated,
wherein the virtually pronouncing the word to obtain a pronunciation phoneme of the word comprises:
the method comprises the steps of virtually pronouncing a word through a virtual pronunciation model to obtain pronunciation phonemes of the word, or virtually pronouncing the word according to a mapping relation between the pre-configured word and the pronunciation phonemes to obtain pronunciation phonemes of the word, wherein the mapping relation between the pre-configured word and the pronunciation phonemes is independent of a pronunciation dictionary.
2. The speech evaluation method according to claim 1, wherein the speech evaluation of the speech data to be evaluated according to the pronunciation phoneme sequence of the reference text formed by the pronunciation phonemes of the word to obtain a speech evaluation result of the speech data to be evaluated comprises:
acquiring a hidden Markov model corresponding to a pronunciation phoneme in the pronunciation phoneme sequence, wherein the hidden Markov model is a pre-trained hidden Markov model aiming at the pronunciation phoneme in the pronunciation phoneme sequence;
and performing voice evaluation on the voice data to be evaluated according to the hidden Markov model corresponding to the pronunciation phoneme in the pronunciation phoneme sequence to obtain a voice evaluation result of the voice data to be evaluated.
3. The method for speech assessment according to claim 2, wherein said hidden Markov models are labeled with an identification of the pronunciation phoneme in the corresponding pronunciation phoneme sequence,
the obtaining of the hidden markov model corresponding to the pronunciation phoneme in the pronunciation phoneme sequence includes:
mapping the pronunciation phonemes in the pronunciation phoneme sequence according to a pre-configured single-phoneme dictionary to obtain the identification of the pronunciation phonemes in the pronunciation phoneme sequence;
and obtaining a hidden Markov model corresponding to the pronunciation phoneme in the pronunciation phoneme sequence according to the identification of the pronunciation phoneme in the pronunciation phoneme sequence.
4. The speech evaluation method according to claim 2, wherein the speech evaluation of the speech data to be evaluated according to the hidden markov model corresponding to the phoneme in the sequence of phonemes to obtain the speech evaluation result of the speech data to be evaluated comprises:
connecting the hidden Markov models corresponding to the pronunciation phonemes in the pronunciation phoneme sequence in series to obtain the hidden Markov models corresponding to the reference text;
and performing voice evaluation on the voice data to be evaluated according to the hidden Markov model corresponding to the reference text to obtain a voice evaluation result of the voice data to be evaluated.
5. The speech evaluation method according to claim 4, wherein the speech evaluation of the speech data to be evaluated according to the hidden Markov model corresponding to the reference text to obtain a speech evaluation result of the speech data to be evaluated comprises:
extracting acoustic features of the voice data frame in the voice data to be evaluated to obtain acoustic feature data of the voice data frame;
determining the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence according to the acoustic feature data of the voice data frame and the hidden Markov model corresponding to the reference text;
and performing voice evaluation on the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence to obtain a voice evaluation result of the voice data to be evaluated.
6. The method for evaluating speech according to claim 5, wherein the determining the speech data frame corresponding to the phoneme of pronunciation in the sequence of phonemes according to the acoustic feature data of the speech data frame and the hidden Markov model corresponding to the reference text comprises:
recognizing acoustic feature data of the voice data frame through an acoustic model to obtain the conditional probability that the voice data frame is recognized as any pronunciation phoneme;
and performing path search through a decoder according to the conditional probability that the voice data frame is recognized as any pronunciation phoneme and the hidden Markov model corresponding to the reference text to obtain the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence.
7. The speech evaluation method according to claim 5, wherein the speech evaluation of the speech data frame corresponding to the phoneme to be articulated in the phoneme sequence to obtain the speech evaluation result of the speech data to be evaluated comprises:
determining the pronunciation accuracy of the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence according to the conditional probability that the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence is recognized as the pronunciation phoneme in the pronunciation phoneme sequence;
and determining the pronunciation accuracy of the voice data to be evaluated according to the pronunciation accuracy of the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence.
8. The speech evaluation method according to claim 5, wherein the speech evaluation of the speech data frame corresponding to the phoneme to be articulated in the phoneme sequence to obtain the speech evaluation result of the speech data to be evaluated comprises:
determining the pronunciation fluency of the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence according to the actual pronunciation duration of the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence and the standard pronunciation duration corresponding to the pronunciation phoneme in the pronunciation phoneme sequence;
and determining the pronunciation fluency of the speech data to be evaluated according to the pronunciation fluency of the speech data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence.
9. The speech evaluation method according to claim 5, wherein the speech evaluation of the speech data frame corresponding to the phoneme to be articulated in the phoneme sequence to obtain the speech evaluation result of the speech data to be evaluated comprises:
determining the number of pronunciation phonemes to be pronounced in the pronunciation phoneme sequence according to the conditional probability that the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence is recognized as a pronunciation phoneme in the pronunciation phoneme sequence;
and determining the pronunciation integrity of the speech data to be evaluated according to the number of the pronunciation phonemes to be pronounced in the pronunciation phoneme sequence and the number of the pronunciation phonemes in the pronunciation phoneme sequence.
10. The speech assessment method according to claim 5, wherein after performing speech assessment on the speech data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence, the method further comprises:
performing voice recognition on the voice data frame corresponding to the pronunciation phoneme in the pronunciation phoneme sequence to obtain an actual pronunciation phoneme of the voice data frame;
and if the actual pronunciation phoneme of the voice data frame is different from the pronunciation phoneme corresponding to the voice data frame, outputting pronunciation error information of the pronunciation phoneme corresponding to the voice data frame and the actual pronunciation phoneme of the voice data frame.
11. A computer storage medium, characterized in that the computer storage medium stores a readable program, the readable program comprising:
the instruction is used for determining a reference text for carrying out voice evaluation on voice data to be evaluated;
instructions for converting the self-defined pronunciation labels of the words to obtain pronunciation phonemes of the words if it is detected that the words in the reference text have the self-defined pronunciation labels;
instructions for retrieving a pre-configured pronunciation dictionary according to the word if it is detected that the word does not have a custom pronunciation label, to retrieve a pronunciation phoneme of the word;
instructions for virtually pronouncing the word to obtain a pronunciation phoneme for the word if the word is not retrieved from the pronunciation dictionary;
instructions for performing speech evaluation on the speech data to be evaluated according to the pronunciation phoneme sequence of the reference text formed by the pronunciation phonemes of the words to obtain a speech evaluation result of the speech data to be evaluated,
wherein the virtually pronouncing the word to obtain a pronunciation phoneme of the word comprises:
the method comprises the steps of virtually pronouncing a word through a virtual pronunciation model to obtain pronunciation phonemes of the word, or virtually pronouncing the word according to a mapping relation between the pre-configured word and the pronunciation phonemes to obtain pronunciation phonemes of the word, wherein the mapping relation between the pre-configured word and the pronunciation phonemes is independent of a pronunciation dictionary.
CN202110072627.8A 2021-01-20 2021-01-20 Voice evaluation method and computer storage medium Active CN112397056B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110072627.8A CN112397056B (en) 2021-01-20 2021-01-20 Voice evaluation method and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110072627.8A CN112397056B (en) 2021-01-20 2021-01-20 Voice evaluation method and computer storage medium

Publications (2)

Publication Number Publication Date
CN112397056A CN112397056A (en) 2021-02-23
CN112397056B true CN112397056B (en) 2021-04-09

Family

ID=74625554

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110072627.8A Active CN112397056B (en) 2021-01-20 2021-01-20 Voice evaluation method and computer storage medium

Country Status (1)

Country Link
CN (1) CN112397056B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112802494B (en) * 2021-04-12 2021-07-16 北京世纪好未来教育科技有限公司 Voice evaluation method, device, computer equipment and medium
CN112802456A (en) * 2021-04-14 2021-05-14 北京世纪好未来教育科技有限公司 Voice evaluation scoring method and device, electronic equipment and storage medium
CN112992184B (en) * 2021-04-20 2021-09-10 北京世纪好未来教育科技有限公司 Pronunciation evaluation method and device, electronic equipment and storage medium
CN113793593B (en) * 2021-11-18 2022-03-18 北京优幕科技有限责任公司 Training data generation method and device suitable for speech recognition model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103985391A (en) * 2014-04-16 2014-08-13 柳超 Phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation
CN109545244A (en) * 2019-01-29 2019-03-29 北京猎户星空科技有限公司 Speech evaluating method, device, electronic equipment and storage medium
CN110085257A (en) * 2019-03-29 2019-08-02 语文出版社有限公司 A kind of rhythm automated decision system based on the study of national literature classics
CN110136747A (en) * 2019-05-16 2019-08-16 上海流利说信息技术有限公司 A kind of method, apparatus, equipment and storage medium for evaluating phoneme of speech sound correctness
US10395640B1 (en) * 2014-07-23 2019-08-27 Nvoq Incorporated Systems and methods evaluating user audio profiles for continuous speech recognition
CN111916108A (en) * 2020-07-24 2020-11-10 北京声智科技有限公司 Voice evaluation method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103985391A (en) * 2014-04-16 2014-08-13 柳超 Phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation
US10395640B1 (en) * 2014-07-23 2019-08-27 Nvoq Incorporated Systems and methods evaluating user audio profiles for continuous speech recognition
CN109545244A (en) * 2019-01-29 2019-03-29 北京猎户星空科技有限公司 Speech evaluating method, device, electronic equipment and storage medium
CN110085257A (en) * 2019-03-29 2019-08-02 语文出版社有限公司 A kind of rhythm automated decision system based on the study of national literature classics
CN110136747A (en) * 2019-05-16 2019-08-16 上海流利说信息技术有限公司 A kind of method, apparatus, equipment and storage medium for evaluating phoneme of speech sound correctness
CN111916108A (en) * 2020-07-24 2020-11-10 北京声智科技有限公司 Voice evaluation method and device

Also Published As

Publication number Publication date
CN112397056A (en) 2021-02-23

Similar Documents

Publication Publication Date Title
AU2019395322B2 (en) Reconciliation between simulated data and speech recognition output using sequence-to-sequence mapping
AU2019347734B2 (en) Conversational agent pipeline trained on synthetic data
CN112397056B (en) Voice evaluation method and computer storage medium
CN109887497B (en) Modeling method, device and equipment for speech recognition
US20180137109A1 (en) Methodology for automatic multilingual speech recognition
KR101153078B1 (en) Hidden conditional random field models for phonetic classification and speech recognition
CN111402862B (en) Speech recognition method, device, storage medium and equipment
CN108766415B (en) Voice evaluation method
CN110246488B (en) Voice conversion method and device of semi-optimized cycleGAN model
JP2012037619A (en) Speaker-adaptation device, speaker-adaptation method and program for speaker-adaptation
CN110459202B (en) Rhythm labeling method, device, equipment and medium
CN110797010A (en) Question-answer scoring method, device, equipment and storage medium based on artificial intelligence
US11810471B2 (en) Computer implemented method and apparatus for recognition of speech patterns and feedback
CN111243599B (en) Speech recognition model construction method, device, medium and electronic equipment
CN111862954A (en) Method and device for acquiring voice recognition model
CN111369974A (en) Dialect pronunciation labeling method, language identification method and related device
Al-Bakeri et al. ASR for Tajweed rules: integrated with self-learning environments
KR100848148B1 (en) Apparatus and method for syllabled speech recognition and inputting characters using syllabled speech recognition and recording medium thereof
CN113053409B (en) Audio evaluation method and device
CN114299930A (en) End-to-end speech recognition model processing method, speech recognition method and related device
CN114420159A (en) Audio evaluation method and device and non-transient storage medium
CN113506561B (en) Text pinyin conversion method and device, storage medium and electronic equipment
CN114420086B (en) Speech synthesis method and device
CN113707178B (en) Audio evaluation method and device and non-transient storage medium
CN117765922A (en) Text-to-speech method, model training method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant