CN116386665A - Reading scoring method, device, equipment and storage medium - Google Patents

Reading scoring method, device, equipment and storage medium Download PDF

Info

Publication number
CN116386665A
CN116386665A CN202211712395.9A CN202211712395A CN116386665A CN 116386665 A CN116386665 A CN 116386665A CN 202211712395 A CN202211712395 A CN 202211712395A CN 116386665 A CN116386665 A CN 116386665A
Authority
CN
China
Prior art keywords
pronunciation
sentence
phoneme
characterization
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211712395.9A
Other languages
Chinese (zh)
Inventor
金海�
吴奎
盛志超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202211712395.9A priority Critical patent/CN116386665A/en
Publication of CN116386665A publication Critical patent/CN116386665A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Abstract

The embodiment of the application discloses a reading scoring method, a reading scoring device, reading scoring equipment and a storage medium, wherein reading texts are converted into corresponding phoneme sequences, and acoustic characteristics of each voice frame of voice data corresponding to the reading texts are extracted; obtaining pronunciation characterization of each phoneme in the phoneme sequence based on the acoustic features of each speech frame; the pronunciation characterization of any phoneme is at least used for determining the pronunciation error detection result of any phoneme; obtaining a pronunciation characterization of any word based on the pronunciation characterization of phonemes contained in any word in the speakable text; the pronunciation characterizations of any word are used to determine a pronunciation score for any word; obtaining the pronunciation characterization of any sentence based on the pronunciation characterization of the word contained in any sentence in the speakable text; determining the pronunciation comprehensive score of any sentence according to the pronunciation characterization of any sentence; and weighting and summing the pronunciation comprehensive scores of the sentences to obtain the pronunciation comprehensive score of the voice data. The accuracy of the scoring result of the voice data corresponding to the speakable text is improved.

Description

Reading scoring method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for scoring speaks.
Background
With the continuous penetration of education reform and the continuous breakthrough of artificial intelligence technology, spoken language examination is carried out in all provinces and cities of the whole country, the problem of speaking is the most common problem in spoken language examination and spoken language learning, and a speaking scoring system based on the artificial intelligence technology also starts to be applied to the floor of all regions.
The accuracy of the reading scoring system is deficient in the prior art, so how to improve the accuracy of the reading scoring is a technical problem to be solved urgently.
Disclosure of Invention
In view of this, the present application provides a method, apparatus, device and storage medium for scoring speaks, so as to improve accuracy of scoring speaks.
In order to achieve the above object, the following solutions have been proposed:
a speakable scoring method comprising:
converting the speakable text into a corresponding phoneme sequence, and extracting acoustic features of each voice frame of voice data corresponding to the speakable text;
obtaining pronunciation characterizations of each phoneme in the phoneme sequence based on the acoustic features of each speech frame; the pronunciation characterization of any phoneme is at least used for determining the pronunciation error detection result of any phoneme;
obtaining the pronunciation characterization of any word based on the pronunciation characterization of the phonemes contained in the word in the speakable text; the pronunciation characterization of any word is used for determining the pronunciation score of any word;
Obtaining the pronunciation characterization of any sentence in the reading text based on the pronunciation characterization of the word contained in the sentence;
determining the pronunciation comprehensive score of any sentence according to the pronunciation characterization of the any sentence;
weighting and summing the pronunciation comprehensive scores of all sentences to obtain pronunciation comprehensive scores of the voice data; the weight of any sentence is positively correlated with the length of the any sentence.
The method, preferably, the obtaining the pronunciation characterization of each phoneme in the phoneme sequence based on the acoustic characteristics of each speech frame includes:
encoding acoustic features of each voice frame to obtain an encoding result of each voice frame;
performing attention interaction on the vector representation of any phoneme and the coding result of each voice frame respectively to obtain attention weights of any phoneme and each voice frame;
the acoustic characteristics of the voice frames are weighted and summed to obtain pronunciation characterization of any phoneme; the weight of any voice frame is the attention weight of any phoneme and any voice frame.
The above method, preferably, further comprises:
determining the pronunciation error detection result of any phoneme according to the pronunciation characterization of any phoneme;
And determining the pronunciation score of any word according to the pronunciation characterization of the any word.
The above-mentioned method, preferably, wherein,
the pronunciation characterization of any one of the phonemes is further used to determine pronunciation skills of the any one of the phonemes;
and/or the number of the groups of groups,
the pronunciation characterization of any sentence is also used for determining pronunciation fluency score of any sentence;
and/or the number of the groups of groups,
the pronunciation characterization of any sentence is also used to determine a pronunciation prosody score for the any sentence.
The above method, preferably, further comprises:
determining pronunciation skills of any phoneme according to the pronunciation characterization of the any phoneme;
and/or the number of the groups of groups,
determining pronunciation fluency scores of any sentence according to pronunciation characterizations of the any sentence; weighting and summing the pronunciation fluency scores of all sentences to obtain pronunciation fluency scores corresponding to the voice data; the weight of any sentence is positively correlated with the length of the any sentence;
and/or the number of the groups of groups,
determining a pronunciation prosody score of any sentence according to the pronunciation characterization of the any sentence; weighting and summing the pronunciation rhythm scores of all sentences to obtain pronunciation rhythm scores corresponding to the voice data; the weight of any sentence is positively correlated with the length of the any sentence.
In the above method, preferably, the process of obtaining pronunciation characterizations of each phoneme, pronunciation characterizations of each word, pronunciation characterizations of each sentence, and determining pronunciation comprehensive score of each sentence is implemented through a scoring model;
the scoring model is obtained by training a text sample and a voice sample corresponding to the text sample serving as a training sample and a plurality of marked information serving as labels;
the plurality of information includes at least: and detecting the error detection result of the pronunciation of the phonemes, the pronunciation score of the words and the pronunciation comprehensive score of the sentences corresponding to the voice sample.
In the above method, preferably, the scoring model is trained by:
converting a text sample in a training sample into a corresponding phoneme sequence, and extracting acoustic characteristics of each voice frame of a voice sample in the training sample;
obtaining pronunciation characterization of each phoneme in a phoneme sequence corresponding to the text sample based on acoustic characteristics of each voice frame of the voice sample through the scoring model;
determining the pronunciation error detection result of any phoneme in the phoneme sequence corresponding to the text sample according to the pronunciation characterization of any phoneme in the phoneme sequence corresponding to the text sample through the scoring model;
Obtaining, by the scoring model, a pronunciation characterization of any word in the text sample based on a pronunciation characterization of phonemes contained by any word in the text sample;
determining a pronunciation score of any word in the text sample according to the pronunciation characterization of any word in the text sample through the scoring model;
obtaining, by the scoring model, a pronunciation characterization of any sentence in the text sample based on a pronunciation characterization of words contained in any sentence in the text sample;
determining the pronunciation comprehensive score of any sentence in the text sample according to the pronunciation characterization of any sentence in the text sample through the scoring model;
and updating parameters of the scoring model by taking the fact that the pronunciation error detection result of each phoneme in the phoneme sequence corresponding to the text sample is close to the pronunciation error detection result label of each phoneme of the training sample, the pronunciation score of each word in the text sample is close to the pronunciation score label of each word of the training sample, and the pronunciation comprehensive score of each sentence in the text sample is close to the pronunciation comprehensive score label of each sentence of the training sample as a target.
The above method, preferably, the plurality of information further includes at least one of: pronunciation skill of phonemes, pronunciation fluency score of sentences, pronunciation prosody score of sentences.
The above method, preferably, further comprises:
determining the pronunciation skill of any phoneme in the phoneme sequence corresponding to the text sample according to the pronunciation characterization of any phoneme in the phoneme sequence corresponding to the text sample through the scoring model; and/or determining a pronunciation fluency score of any sentence in the text sample according to the pronunciation characterization of any sentence in the text sample; and/or determining a prosody score for any sentence in the text sample from the phonetic representation of the sentence in the text sample;
correspondingly, the process of updating the parameters of the scoring model comprises the following steps:
the pronunciation error detection result of each phoneme in the phoneme sequence corresponding to the text sample is close to the pronunciation error detection result label of each phoneme of the training sample, the pronunciation score of each word in the text sample is close to the pronunciation score label of each word of the training sample, and the pronunciation comprehensive score of each sentence in the text sample is close to the pronunciation comprehensive score label of each sentence of the training sample as a first target; and (3) taking the pronunciation skill of each phoneme in the phoneme sequence corresponding to the text sample as a pronunciation skill label of each phoneme of the training sample, and/or taking the pronunciation fluency score of each sentence in the text sample as a pronunciation fluency score label of each sentence of the training sample, and/or taking the pronunciation fluency score label of each sentence in the text sample as a pronunciation rhythm score label of each sentence of the training sample as a second target, and updating the parameters of the scoring model.
A speakable scoring device comprising:
the preprocessing unit is used for converting the speakable text into a corresponding phoneme sequence and extracting acoustic characteristics of each voice frame of voice data corresponding to the speakable text;
a phoneme-level feature obtaining unit configured to obtain a pronunciation characterization of each phoneme in the phoneme sequence based on acoustic features of each speech frame; the pronunciation characterization of any phoneme is at least used for determining the pronunciation error detection result of any phoneme;
the word level feature obtaining unit is used for obtaining the pronunciation characterization of any word based on the pronunciation characterization of the phonemes contained in the word in the reading text; the pronunciation characterization of any word is used for determining the pronunciation score of any word;
a sentence-level feature obtaining unit, configured to obtain a pronunciation characterization of any sentence based on a pronunciation characterization of a word included in the any sentence in the speakable text;
a sentence comprehensive scoring unit, configured to determine a pronunciation comprehensive score of the any sentence according to the pronunciation characterization of the any sentence;
the fusion unit is used for weighting and summing the pronunciation comprehensive scores of all sentences to obtain pronunciation comprehensive scores of the voice data; the weight of any sentence is positively correlated with the length of the any sentence.
A speakable scoring device comprising a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the speakable scoring method as described in any one of the above.
A computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the read-aloud scoring method as defined in any one of the preceding claims.
As can be seen from the above technical solutions, the speakable text is converted into a corresponding phoneme sequence by using the speakable scoring method, apparatus, device and storage medium provided in the embodiments of the present application, and acoustic features of each voice frame of voice data corresponding to the speakable text are extracted; obtaining pronunciation characterization of each phoneme in the phoneme sequence based on the acoustic features of each speech frame; the pronunciation characterization of any phoneme is at least used for determining the pronunciation error detection result of any phoneme; obtaining a pronunciation characterization of any word based on the pronunciation characterization of phonemes contained in any word in the speakable text; the pronunciation characterizations of any word are used to determine a pronunciation score for any word; obtaining the pronunciation characterization of any sentence based on the pronunciation characterization of the word contained in any sentence in the speakable text; determining the pronunciation comprehensive score of any sentence according to the pronunciation characterization of any sentence; weighting and summing the pronunciation comprehensive scores of all sentences to obtain pronunciation comprehensive scores of the voice data; the weight of any sentence is positively correlated with the length of any sentence. In the method, the pronunciation characterization of any sentence is obtained based on the pronunciation characterization of the word contained in any sentence, the pronunciation characterization of any word is obtained based on the pronunciation characterization of the phoneme contained in any word, the pronunciation characterization of any phoneme is at least used for determining the pronunciation error detection result of any phoneme, the pronunciation characterization of any word is used for determining the pronunciation score of any word, and the pronunciation characterization of any sentence is used for determining the pronunciation comprehensive score of any sentence, so that in the method, the pronunciation characterization of each sentence carries the characteristic information of three granularities of phonemes, words and sentences for pronunciation assessment, namely, the pronunciation characterization of each sentence has multi-scale characterization capability, thereby improving the accuracy of the pronunciation scoring result of each sentence and further improving the accuracy of the scoring result of the voice data corresponding to the reading text.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.
FIG. 1 is a flow chart of one implementation of the speakable scoring method disclosed in embodiments of the present application;
FIG. 2 is a flowchart of one implementation of obtaining a pronunciation characterization of each phoneme in a phoneme sequence based on acoustic features of each speech frame as disclosed in an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a scoring model disclosed in an embodiment of the present application;
FIG. 4 is another exemplary diagram of a scoring model disclosed in an embodiment of the present application;
FIG. 5 is a schematic view of another structure of the scoring model disclosed in the embodiments of the present application;
FIG. 6 is a schematic illustration of a speakable scoring method disclosed in an embodiment of the application, in which;
fig. 7 is a schematic structural diagram of a reading scoring device disclosed in an embodiment of the present application;
fig. 8 is a hardware block diagram of a read scoring device disclosed in an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The current reading scoring scheme is mainly a scheme based on characteristic engineering, and comprises the following steps: forcedly segmenting the reading text and the examinee voice to obtain information such as increasing/missing reading information, word and phoneme level time boundaries, state posterior probability of each frame and the like; the scoring characteristics such as GOP (goodness ofpronunciation) characteristics, word increasing/missing reading ratio, average phoneme pronunciation time length and the like can be further calculated based on the phoneme boundary after forced segmentation and are used for measuring pronunciation accuracy, completeness and fluency of the examinee, and then scoring of the examinee voice is obtained based on the obtained scoring characteristics.
The current reading scoring scheme calculates GOP features based on phoneme boundaries to measure phoneme pronunciation accuracy, but GOP has two inherent drawbacks: firstly, the GOP is sensitive to the segmentation boundary, and the GOP is greatly changed due to the weak change of the segmentation boundary; secondly, the current reading scoring scheme uses an acoustic model trained based on a voice recognition task to perform forced segmentation, the performance dependence on the acoustic model is high, if the robustness of the acoustic model is poor, GOP calculation is inaccurate, and therefore, the scoring is not robust due to the fact that the GOP features are used for measuring the pronunciation accuracy.
In addition, the current reading scoring scheme uses manually extracted (i.e. manually defined) scoring features (such as GOP features, word increasing/missing reading ratio, average phoneme pronunciation duration, etc.), and the situation that the extracted scoring features do not cover the scoring standard fully easily occurs, which also results in lower scoring accuracy.
In order to improve accuracy of reading scoring, the proposal of the application is provided.
As shown in fig. 1, a flowchart for implementing the speakable scoring method according to the embodiment of the present application may include:
step S101: and converting the speakable text into a corresponding phoneme sequence, and extracting acoustic features of each voice frame of voice data corresponding to the speakable text.
The speakable text may be text in any language, such as chinese text, or may be english text, or may be russian text, or the like.
For some languages, such as Chinese or English, due to the multiple-pronunciation phenomenon, a phoneme sequence corresponding to the speakable text can be determined in advance through forced segmentation Alignment (Force Alignment), or pronunciation prediction is performed through Grapheme-to-Phoneme Conversion (G2P) technology, so that the phoneme sequence accuracy is improved.
The voice data corresponding to the speakable text is the voice data collected by the recording device when the reader speaks the speakable text. The recording device may be a headset microphone or a collar clip microphone worn by the reader, although other types of microphones are possible, such as a podium microphone, etc.
The speech data may be first subjected to frame processing, pre-emphasis is performed on the speech data after the frame processing, and then acoustic features of each frame of speech data (each frame of speech data is recorded as one speech frame) are sequentially extracted, where the acoustic features may be, for example, spectral features of the speech data, such as mel-frequency cepstral coefficient (Mel Frequency Cepstrum Coefficient, MFCC) features, or Perceptual linear prediction (persistence LinearPredictive, PLP) features, and the like.
Step S102: obtaining pronunciation characterization of each phoneme in the phoneme sequence based on the acoustic features of each speech frame; the pronunciation characterization of any phoneme is used at least to determine the pronunciation error detection result of any phoneme. The pronunciation error detection result of any phoneme indicates whether the pronunciation of any phoneme is correct.
For the ith (i=1, 2,3, … …, N; N is the total number of phonemes in the phoneme sequence) phoneme in the phoneme sequence, the pronunciation characterization of the ith phoneme can be obtained according to the acoustic characteristics of each speech frame and the association relationship between the ith phoneme and each speech frame.
In this application, the pronunciation characterization of the ith phoneme can be used to determine whether the pronunciation of the ith phoneme is correct, and can also be used for other tasks, and particularly, reference can be made to the following embodiments. That is, corresponding to the i-th phoneme, obtained based on the acoustic features of the respective speech frames is a pronunciation characterization of at least the i-th phoneme for determining whether the pronunciation of the i-th phoneme is correct.
Step S103: obtaining a pronunciation characterization of any word based on the pronunciation characterization of phonemes contained in any word in the speakable text; the pronunciation characterization of any word is used to determine a score (abbreviated pronunciation score for ease of distinction and description) for the pronunciation of any word.
For chinese text, a word is a word.
Since the conversion is performed in units of words when converting the speakable text into a sequence of phonemes, it can be known that those phonemes belong to the same word. That is, for the jth word in the speakable text (j=1, 2,3, … …, M; M is the total number of words in the speakable text), it is known which phonemes the jth word contains.
The pronunciation characterizations of the j-th word used to determine the pronunciation score of the j-th word are obtained based on the pronunciation characterizations of the phonemes contained in the j-th word in the speakable text.
Alternatively, long Short-Term Memory (LSTM) coding may be performed on the pronunciation representation of each phoneme included in the jth word to obtain a coding result with context information of each phoneme included in the jth word, and the coding result of the last phoneme included in the jth word may be determined as the pronunciation representation of the jth word. Alternatively, the long-short-time memory code may be a unidirectional long-short-time memory code or a bidirectional long-short-time memory code. In the case of performing the bidirectional long-short-time memory coding on the pronunciation characterizations of the phonemes included in the jth word, the coding result of the first phoneme included in the jth word may be determined as the pronunciation characterizations of the jth word.
Step S104: the pronunciation characterization of any sentence is obtained based on the pronunciation characterization of the words contained in any sentence in the speakable text.
For the kth sentence in the speakable text (k=1, 2,3, … …, Q; Q is the total number of sentences in the speakable text), the pronunciation characterization of the kth sentence is obtained based on the pronunciation characterization of each word contained in the kth sentence in the speakable text.
Alternatively, long-short-term memory coding can be performed on the pronunciation characterizations of the words contained in the kth sentence, so as to obtain the coding result with the context information of each word contained in the kth sentence, and the coding result of the last word contained in the kth sentence can be determined as the pronunciation characterizations of the kth sentence. Alternatively, the long-short-time memory code may be a unidirectional long-short-time memory code or a bidirectional long-short-time memory code. In the case of bi-directional long-short-term memory coding of pronunciation characterizations of respective words contained in the kth sentence, the coding result of the first word contained in the kth sentence may also be determined as the pronunciation characterizations of the kth sentence.
Step S105: a composite score (abbreviated as pronunciation composite score for ease of distinction and description) of the pronunciation of any sentence is determined based on the pronunciation characterization of any sentence.
Optionally, a Full Connected (FC) process may be performed on the pronunciation representation of the kth sentence (for convenience of description and distinction, denoted as a first Full Connected process), to obtain a first Full Connected result corresponding to the kth sentence; and processing the first full-connection result corresponding to the kth sentence through an activation function to obtain the pronunciation comprehensive score of the kth sentence.
Alternatively, the activation function may be a Sigmoid function.
Step S106: weighting and summing the pronunciation comprehensive scores of all sentences to obtain pronunciation comprehensive scores of the voice data; the weight of any sentence is positively correlated with the length of any sentence.
The length of the kth sentence may be characterized by the number of words contained in the kth sentence, and as an example, the weight of the kth sentence may be: the ratio of the number of words contained in the kth sentence to the sum of the number of words contained in the Q sentences in the speakable text.
According to the reading scoring method provided by the embodiment of the application, the pronunciation characterization of any sentence is obtained based on the pronunciation characterization of the word contained in the any sentence, the pronunciation characterization of any word is obtained based on the pronunciation characterization of the phoneme contained in the any word, the pronunciation characterization of any phoneme is at least used for determining the pronunciation error detection result of any phoneme, the pronunciation characterization of any word is used for determining the pronunciation score of any word, and the pronunciation characterization of any sentence is used for determining the pronunciation comprehensive score of any sentence, so that in the application, the pronunciation characterization of each sentence carries the characteristic information of three granularities of the phonemes, the words and the sentences for pronunciation assessment, namely, the pronunciation characterization of each sentence has multi-scale characterization capability, so that the accuracy of the pronunciation scoring result of each sentence is improved, and the accuracy of the scoring result of voice data corresponding to the reading text is further improved.
In an alternative embodiment, a flowchart of an implementation of obtaining the pronunciation characterizations of each phoneme in the phoneme sequence based on the acoustic features of each speech frame as shown in fig. 2 may include:
step S201: and encoding the acoustic characteristics of each voice frame to obtain the encoding result of each voice frame.
Step S202: and respectively carrying out attention interaction on the vector representation of any phoneme and the coding structure of each voice frame to obtain the attention weight of any phoneme and each voice frame.
The vector representation of the ith phoneme may be obtained by performing one-hot encoding (one-hot) on the ith phoneme, or may be obtained by encoding the ith phoneme in other encoding manners.
Alternatively, the vector representation of the ith phoneme and the coding result of each voice frame may be respectively interacted with multiple-head attention, so as to obtain the attention weight of the ith phoneme and each voice frame.
Step S203: the acoustic characteristics of each voice frame are weighted and summed to obtain the pronunciation representation of any phoneme; the weight of any speech frame is the attention weight of any phoneme and any speech frame.
In the present application, when the acoustic features of the speech frames are weighted and summed, the weight of each speech frame is the attention weight of each speech frame corresponding to the same phoneme. For convenience of distinction and description, the attention weight of the ith phoneme and the p (p=1, 2,3, … …, R; R is the number of total speech frames contained in the speech data) speech frame is noted as w ip The acoustic features of the p-th speech frame are denoted as F p Pronunciation characterization F of the ith phoneme i The formula can be expressed as:
Figure BDA0004027930640000101
in an alternative embodiment, the reading scoring method provided in the present application may further include:
determining the pronunciation error detection result of any phoneme according to the pronunciation characterization of the any phoneme.
Optionally, the pronunciation characterization of the ith phoneme may be subjected to full-connection processing (for convenience of description and distinction, denoted as second full-connection processing), so as to obtain a second full-connection result corresponding to the ith phoneme; and processing a second full-connection result corresponding to the ith phoneme through an activation function to obtain a pronunciation error detection result of the ith phoneme.
Alternatively, the activation function may be a Sigmoid function.
In an alternative embodiment, the reading scoring method provided in the present application may further include:
the pronunciation score of any word is determined based on the pronunciation characterizations of the word.
Optionally, the pronunciation characterization of the jth word may be fully connected (for convenience of description and distinction, denoted as third fully connected processing) to obtain a third fully connected result corresponding to the jth word; and processing a third full-connection result corresponding to the jth word through an activation function to obtain the pronunciation score of the jth word.
Alternatively, the activation function may be a Sigmoid function.
In the method, the pronunciation error detection result and the pronunciation score of each phoneme are determined, and the pronunciation error detection result of each phoneme, the pronunciation score of each word and the pronunciation comprehensive score of each sentence can be output when required, so that the pronunciation score results of the words, the sentences and the voice data are all interpretable.
Alternatively, the pronunciation error detection result of each phoneme, the pronunciation score of each word, and the pronunciation composite score of each sentence may be output while the pronunciation composite score of the speech data is output. Alternatively, after the pronunciation synthesis score of the speech data is output, the pronunciation error detection result of each phoneme, the pronunciation score of each word, and the pronunciation synthesis score of each sentence may be output when the interpretation information display instruction is received.
In an alternative embodiment, the method comprises, among other things,
the pronunciation characterization of any phoneme is also used to determine the pronunciation skill of the any phoneme. I.e. the pronunciation characterization of the i-th phoneme is also used to determine the pronunciation skill of the i-th phoneme. Pronunciation skills may include, but are not limited to, the following: continuous reading, explosion loss, turbidity, no pronunciation, etc.
And/or the number of the groups of groups,
the pronunciation characterization of any sentence is also used to determine a pronunciation fluency score for the any sentence. I.e., the pronunciation characterization of the kth sentence is also used to determine the pronunciation fluency score of the kth sentence.
And/or the number of the groups of groups,
the pronunciation characterization of any sentence is also used to determine a pronunciation prosody score for the any sentence. I.e., the pronunciation characterization of the kth sentence is also used to determine the pronunciation prosody score of the kth sentence.
The pronunciation skill, pronunciation fluency and pronunciation rhythm belong to the ultrasonic segment information, that is, in the application, besides the characteristic information for pronunciation assessment, which carries three granularity of phonemes, words and sentences, the pronunciation characterization of each sentence also carries the characteristic information for pronunciation assessment, which is at least one of three angles of pronunciation skill, pronunciation fluency and pronunciation rhythm, namely, the pronunciation characterization of each sentence has the characterization capability of multiple scales and multiple angles (namely, sound segment and at least one ultrasonic segment), so that the accuracy of the pronunciation scoring result of each sentence is further improved, and the accuracy of the scoring result of the voice data corresponding to the reading text is further improved.
In an optional embodiment, the reading scoring method provided in the embodiment of the present application may further include:
and determining the pronunciation skill of any phoneme according to the pronunciation characterization of any phoneme. I.e. determining the pronunciation skill of the ith phoneme based on the pronunciation characterizations of the ith phoneme.
And/or the number of the groups of groups,
determining the pronunciation fluency score of any sentence according to the pronunciation characterization of the sentence, namely determining the pronunciation fluency score of the kth sentence according to the pronunciation characterization of the kth sentence; and weighting and summing the pronunciation fluency scores of the sentences to obtain pronunciation fluency scores corresponding to the voice data. The weight of any sentence is positively correlated with the length of that sentence. The determination of the weight of the sentence can be referred to the foregoing embodiment, and will not be described herein.
And/or the number of the groups of groups,
determining a pronunciation prosody score of any sentence according to the pronunciation characterization of the sentence, namely determining the pronunciation prosody score of the kth sentence according to the pronunciation characterization of the kth sentence; weighting and summing the pronunciation rhythm scores of each sentence to obtain a pronunciation rhythm score corresponding to the voice data; the weight of any sentence is positively correlated with the length of that sentence. The determination of the weight of the sentence can be referred to the foregoing embodiment, and will not be described herein.
In the method, the pronunciation skill of each phoneme, the pronunciation fluency score of each sentence and the pronunciation prosody score can be output when required by determining the pronunciation skill of each phoneme, the pronunciation fluency score of each sentence and the pronunciation prosody score, so that the pronunciation scoring result of the word, the sentence and the voice data has stronger interpretability.
Alternatively, the pronunciation error detection result and pronunciation skill of each phoneme, the pronunciation score of each word, the pronunciation composite score of each sentence, the pronunciation fluency score of each sentence, and the prosody score of each sentence may be output simultaneously with the pronunciation composite score of the speech data. Alternatively, after the pronunciation synthesis score, pronunciation fluency score, and pronunciation prosody score of the speech data are output, when the interpretation information display instruction is received, the pronunciation error detection result and pronunciation skill of each phoneme, the pronunciation score of each word, the pronunciation synthesis score of each sentence, the pronunciation fluency score of each sentence, and the prosody score of each sentence may be output.
In an alternative embodiment, an implementation of determining pronunciation skills of any phoneme based on the pronunciation characterizations of any phoneme may include:
and performing full-connection processing (for convenience of description and distinction, denoted as fourth full-connection processing) on the pronunciation characterization of the ith phoneme to obtain a fourth full-connection result corresponding to the ith phoneme.
And processing a fourth full-connection result corresponding to the ith phoneme through an activation function to obtain the pronunciation skill of the ith phoneme.
Alternatively, the activation function may be a softmax function.
In an alternative embodiment, an implementation of determining a pronunciation fluency score for any sentence based on a pronunciation characterization of the any sentence may include:
and carrying out full-connection processing (fifth full-connection processing is marked for convenience of description and distinction) on the pronunciation characterization of the kth sentence to obtain a fifth full-connection result corresponding to the kth sentence.
And processing a fifth full-connection result corresponding to the kth sentence through an activation function to obtain the pronunciation fluency score of the kth sentence.
Alternatively, the activation function may be a Sigmoid function.
In an alternative embodiment, one implementation of determining a prosody score for any sentence based on a prosody representation of the sentence may include:
and carrying out full-connection processing (for convenience of description and distinction, the pronunciation characterization of the kth sentence is recorded as sixth full-connection processing) on the pronunciation characterization of the kth sentence, so as to obtain a sixth full-connection result corresponding to the kth sentence.
And processing a sixth full-connection result corresponding to the kth sentence through an activation function to obtain the pronunciation rhythm score of the kth sentence.
Alternatively, the activation function may be a Sigmoid function.
In an alternative embodiment, the process of obtaining the pronunciation characterizations of each phoneme, the pronunciation characterizations of each word, the pronunciation characterizations of each sentence, and determining the pronunciation comprehensive score of each sentence is implemented by a scoring model;
The scoring model is obtained by training a text sample and a voice sample corresponding to the text sample serving as a training sample and a plurality of marked information serving as labels. The voice sample corresponding to the text sample is voice data collected by the voice collection device when a reader reads the text sample.
The plurality of information may include at least: each phoneme pronunciation error detection result, each word pronunciation score and each sentence pronunciation comprehensive score corresponding to the voice sample.
As shown in fig. 3, a schematic structural diagram of a scoring model provided in an embodiment of the present application may include:
a phoneme feature obtaining module 301, a word feature obtaining module 302, a sentence feature obtaining module 303 and a sentence comprehensive score determining module 304; wherein, the liquid crystal display device comprises a liquid crystal display device,
the phoneme characteristic obtaining module 301 is configured to obtain a pronunciation representation of each phoneme in the phoneme sequence based on the acoustic characteristic of each speech frame.
Alternatively, the phoneme feature obtaining module 301 may be implemented by using an Encoder-Decoder framework, and may include an encoding module 3011 and a decoding module 3012; the coding module 3011 is configured to code acoustic features of each voice frame to obtain a coding result of each voice frame; the decoding module 3012 is configured to perform attention interaction on the vector representation of any phoneme and the encoding result of each speech frame, so as to obtain attention weights of the any phoneme and each speech frame; the acoustic characteristics of each voice frame are weighted and summed to obtain the pronunciation representation of any phoneme; the weight of any speech frame is the attention weight of any phoneme and any speech frame.
The word feature obtaining module 302 is configured to obtain a pronunciation representation of any word in the speakable text based on a pronunciation representation of phonemes contained in the any word.
Alternatively, the word feature obtaining module 302 may be implemented through the first LSTM network, and the specific implementation process is described in the foregoing embodiment, which is not described herein again.
The sentence characteristic obtaining module 303 is configured to obtain a pronunciation representation of any sentence in the speakable text based on a pronunciation representation of a word contained in the any sentence.
Alternatively, the sentence characteristic obtaining module 303 may be implemented through the second LSTM network, and the specific implementation process is described in the foregoing embodiment, which is not described herein again.
The sentence integrated score determining module 304 is configured to determine a pronunciation integrated score of any sentence according to the pronunciation characterizations of the any sentence.
Alternatively, the sentence integrated score determining module 304 may be implemented through the first fully-connected network and the first activation function, and the specific implementation process is described in the foregoing embodiment, which is not described herein.
Further, the scoring model may further include a phoneme error detection module 401 and a word scoring module 402, as shown in fig. 4, which is another exemplary diagram of the scoring model provided in the embodiment of the present application. Wherein, the liquid crystal display device comprises a liquid crystal display device,
the phoneme error detecting module 401 is configured to determine a pronunciation error detecting result of any phoneme according to the pronunciation representation of the any phoneme.
Optionally, the phoneme error detection module 401 may be implemented through the second fully-connected network and the second activation function, and the specific implementation process is described in the foregoing embodiment, which is not described herein again.
The word scoring module 402 is configured to determine a pronunciation score for any word based on a pronunciation characterization of the word.
Alternatively, the word scoring module 402 may be implemented through a third fully-connected network and a third activation function, and the specific implementation process is described in the foregoing embodiment, which is not described herein.
The scoring models of fig. 3 and 4 can both be obtained by training the scoring model of the structure shown in fig. 4. The scoring model shown in fig. 4 is obtained by training a text sample and a corresponding voice sample thereof as training samples and a plurality of labeled information as labels. Wherein the noted plurality of information includes at least: the phonemic pronunciation error detection result, the pronunciation score of the word and the pronunciation comprehensive score of the sentence corresponding to the voice sample.
The process of training the scoring model shown in fig. 4 includes:
converting text samples in the training samples into corresponding phoneme sequences, and extracting acoustic features of each voice frame of the voice samples in the training samples;
the phoneme characteristic obtaining module 301 of the scoring model obtains pronunciation characterizations of each phoneme in the phoneme sequence corresponding to the text sample based on the acoustic characteristics of each speech frame of the speech sample;
Determining the pronunciation error detection result of any phoneme in the phoneme sequence corresponding to the text sample according to the pronunciation characterization of the any phoneme in the phoneme sequence corresponding to the text sample by the phoneme error detection module 401 of the scoring model;
obtaining, by the word feature obtaining module 302 of the scoring model, a pronunciation characterization of any word in the text sample based on the pronunciation characterizations of phonemes contained by the any word in the text sample;
determining, by the word scoring module 402 of the scoring model, a pronunciation score for any word in the text sample based on the pronunciation characterizations of the any word in the text sample;
the sentence characteristic obtaining module 303 of the scoring model obtains a pronunciation characterization of any sentence in the text sample based on the pronunciation characterizations of words contained in the any sentence in the text sample;
determining the pronunciation comprehensive score of any sentence in the text sample according to the pronunciation characterization of the sentence in the text sample by the sentence comprehensive score determining module 304 of the scoring model;
the pronunciation error detection result of each phoneme in the phoneme sequence corresponding to the text sample is close to the pronunciation error detection result label of each phoneme of the training sample, the pronunciation score of each word in the text sample is close to the pronunciation score label of each word of the training sample, the pronunciation comprehensive score of each sentence in the text sample is close to the pronunciation comprehensive score label of each sentence of the training sample, and the parameters of the scoring model are updated.
Optionally, a first difference between the pronunciation error detection result of each phoneme in the phoneme sequence corresponding to the text sample and the pronunciation error detection result of each phoneme of the training sample may be calculated by a mean square error function, a second difference between the pronunciation score of each word in the text sample and the pronunciation score label of each word of the training sample may be calculated by a mean square error function, a third difference between the pronunciation comprehensive score of each sentence in the text sample and the pronunciation comprehensive score label of each sentence of the training sample may be calculated by a mean square error function, and the parameters of the scoring model may be updated with the first difference, the second difference and the third difference being smaller and smaller as targets.
The first difference can be expressed as:
Figure BDA0004027930640000161
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0004027930640000162
representing the first difference, h i Pronunciation characterization representing the ith phoneme, +.>
Figure BDA0004027930640000163
Pronunciation error detection result label for representing ith phoneme, f md (h i ) And representing the pronunciation error detection result of the ith phoneme output by the scoring model.
The second difference can be expressed as:
Figure BDA0004027930640000164
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0004027930640000165
representing the second difference, h j Pronunciation characterization representing the jth word, +.>
Figure BDA0004027930640000166
Pronunciation scoring tag representing jth word, f word (h j ) Representing the pronunciation score of the j-th word output by the scoring model.
The third difference can be expressed as:
Figure BDA0004027930640000167
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0004027930640000168
representing the third difference, s k Pronunciation characterization representing the kth sentence, +.>
Figure BDA0004027930640000169
Pronunciation synthesis scoring tag representing kth sentence, f total (s k ) Representing the pronunciation composite score of the kth sentence output by the scoring model.
After training the model shown in fig. 4, a trained scoring model shown in fig. 4 is obtained, and the phoneme error detection module 401 and the word scoring module 402 are removed or turned off from the trained scoring model shown in fig. 4, so that the trained scoring model shown in fig. 3 is obtained.
As shown in fig. 5, which is a schematic structural diagram of a scoring model provided in an embodiment of the present application, on the basis of the embodiment shown in fig. 4, the scoring model may further include at least one of the following three modules:
a pronunciation skill determination module 501, a sentence pronunciation fluency scoring module 502 and a sentence pronunciation prosody scoring module 503; wherein, the liquid crystal display device comprises a liquid crystal display device,
the pronunciation skill determination module 501 is configured to determine a pronunciation skill of any phoneme according to the pronunciation characterization of the any phoneme.
The pronunciation skill determination module 501 may be implemented by the fourth fully-connected network and the fourth activation function, and the specific implementation process is described in the foregoing embodiment, which is not described herein.
The sentence pronunciation fluency scoring module 502 is configured to determine a pronunciation fluency score of any sentence according to the pronunciation characterizations of the any sentence.
The sentence pronunciation fluency scoring module 502 may be implemented through a fifth fully-connected network and a fifth activation function, and the specific implementation process is described in the foregoing embodiment, which is not described herein.
The sentence prosody score module 503 is configured to determine a prosody score of any sentence according to the pronunciation characterizations of the any sentence.
The sentence pronunciation prosody scoring module 503 may be implemented through a sixth fully-connected network and a sixth activation function, and the specific implementation process is described in the foregoing embodiment, which is not described herein.
Training the scoring model shown in fig. 5 may also obtain the scoring model shown in fig. 3 or fig. 4, and specifically, the training the scoring model shown in fig. 5 may include:
converting a text sample in a training sample into a corresponding phoneme sequence, and extracting acoustic characteristics of each voice frame of a voice sample in the training sample;
the phoneme characteristic obtaining module 301 of the scoring model obtains pronunciation characterizations of each phoneme in the phoneme sequence corresponding to the text sample based on the acoustic characteristics of each speech frame of the speech sample;
Determining the pronunciation error detection result of any phoneme in the phoneme sequence corresponding to the text sample according to the pronunciation characterization of the any phoneme in the phoneme sequence corresponding to the text sample by the phoneme error detection module 401 of the scoring model;
optionally, the pronunciation skill determination module 501 of the scoring model determines the pronunciation skill of any phoneme according to the pronunciation characterization of the any phoneme. If the scoring model does not include the pronunciation skill determination module 501, the training process does not include this step.
Obtaining, by the word feature obtaining module 302 of the scoring model, a pronunciation characterization of any word in the text sample based on the pronunciation characterizations of phonemes contained by the any word in the text sample;
determining, by the word scoring module 402 of the scoring model, a pronunciation score for any word in the text sample based on the pronunciation characterizations of the any word in the text sample;
the sentence characteristic obtaining module 303 of the scoring model obtains a pronunciation characterization of any sentence in the text sample based on the pronunciation characterizations of words contained in the any sentence in the text sample;
determining the pronunciation comprehensive score of any sentence in the text sample according to the pronunciation characterization of the sentence in the text sample by the sentence comprehensive score determining module 304 of the scoring model;
Optionally, the sentence pronunciation fluency scoring module 502 of the scoring model determines the pronunciation fluency score of any sentence according to the pronunciation characterization of the any sentence. If the scoring model does not include the sentence pronunciation fluency scoring module 502, the training process does not include this step.
Optionally, the sentence prosody score module 503 of the scoring model determines the prosody score of any sentence according to the pronunciation characterizations of the any sentence. If the scoring model does not include the sentence-pronunciation-prosody scoring module 503, the training process does not include this step.
The pronunciation error detection result of each phoneme in the phoneme sequence corresponding to the text sample is close to the pronunciation error detection result label of each phoneme of the training sample, the pronunciation score of each word in the text sample is close to the pronunciation score label of each word of the training sample, the pronunciation comprehensive score label of each sentence in the text sample is close to the pronunciation comprehensive score label of each sentence of the training sample, the pronunciation skill of each phoneme in the phoneme sequence corresponding to the text sample is close to the pronunciation skill label of each phoneme of the training sample, and/or the pronunciation fluency score of each sentence in the text sample is close to the pronunciation fluency score label of each sentence of the training sample, and/or the pronunciation prosody score label of each sentence in the text sample is close to the pronunciation prosody score label of each sentence of the training sample is the second target, and parameters of the scoring model are updated.
Optionally, a first difference between the pronunciation error detection result of each phoneme in the phoneme sequence corresponding to the text sample and the pronunciation error detection result of each phoneme of the training sample can be calculated through a mean square error function, a second difference between the pronunciation score of each word in the text sample and the pronunciation score label of each word of the training sample can be calculated through the mean square error function, and a third difference between the pronunciation comprehensive score of each sentence in the text sample and the pronunciation comprehensive score label of each sentence of the training sample can be calculated through the mean square error function;
in the case where the scoring model includes the pronunciation skill determination module 501, a fourth difference of the pronunciation skill of each phoneme in the phoneme sequence corresponding to the text sample and the pronunciation skill label of each phoneme of the training sample may be calculated by the cross entropy function;
the fourth difference can be expressed as:
Figure BDA0004027930640000181
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0004027930640000182
indicating the fourth difference, ++>
Figure BDA0004027930640000183
Pronunciation skill label, h, representing the ith phoneme i Representing a pronunciation characterization of the ith phoneme, f psd (h i ) Representing the pronunciation skill of the ith phoneme output by the scoring model.
In the case where the scoring model includes a sentence pronunciation fluency scoring module 502, a fifth difference between the pronunciation fluency score of each sentence in the text sample and the pronunciation fluency score tag of each sentence of the training sample may be calculated by a mean square error function;
The fifth difference can be expressed as:
Figure BDA0004027930640000191
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0004027930640000192
representing the fifth difference, s k Pronunciation characterization representing the kth sentence, +.>
Figure BDA0004027930640000193
Pronunciation fluency scoring tag representing kth sentence, f fluency (s k ) And the pronunciation fluency score of the kth sentence output by the scoring model is represented.
Where the scoring model includes a sentence prosody scoring module 503, a sixth difference of the prosody score of each sentence in the text sample and the prosody score tag of each sentence of the training sample may be calculated by a mean square error function.
The sixth difference can be expressed as:
Figure BDA0004027930640000194
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0004027930640000195
representing the sixth difference, s k Pronunciation characterization representing the kth sentence, +.>
Figure BDA0004027930640000196
Pronunciation prosody scoring tag representing kth sentence, f prosdy (s k ) And the pronunciation rhythm score of the kth sentence output by the scoring model is represented.
In the case that the scoring model includes at least one of the pronunciation skill determining module 501, the sentence pronunciation fluency scoring module 502 and the sentence pronunciation prosody scoring module 503, the parameters of the scoring model are updated with the first difference, the second difference and the third difference being smaller and smaller as the first target and the difference corresponding to the at least one module being smaller and smaller as the second target.
After training is completed, a trained scoring model shown in fig. 5 is obtained. The trained scoring model of the structure shown in fig. 4 can be obtained by removing or turning off the trained scoring model shown in fig. 5 from the pronunciation skill determination module 501, the sentence pronunciation fluency scoring module 502 and the sentence pronunciation prosody scoring module 503. The trained scoring model shown in fig. 5 is obtained by removing or closing the pronunciation skill determination module 501, the sentence pronunciation fluency scoring module 502, the sentence pronunciation prosody scoring module 503, and the phoneme error detection module 401 and the word scoring module 402 from the trained scoring model shown in fig. 3.
As shown in fig. 6, a schematic illustration of a speakable scoring method provided in an embodiment of the present application is shown, in this example,
after the voice data and the corresponding reading text are obtained, acoustic features are extracted from the voice data, namely, the voice data are divided into frames, acoustic features of each voice frame are extracted, and a phoneme sequence is obtained based on the reading text.
The acoustic features of each speech frame are encoded by an encoder to obtain an encoding result of each speech frame, the encoding result of each speech frame is input to a decoder, and in addition, the phoneme sequence is also supplied to the decoder, which obtains a pronunciation representation of each phoneme in the phoneme sequence based on the acoustic features of each speech frame.
And in the phoneme-level pronunciation evaluation stage, performing second full-connection processing on pronunciation characterization of each phoneme through a full-connection network FC-2 to obtain a second full-connection result of each phoneme, and performing processing on the second full-connection result of each phoneme through a sigmoid function to obtain a pronunciation error detection result of each phoneme.
And in the phoneme-level pronunciation evaluation stage, the pronunciation characterization of each phoneme is subjected to fourth full-connection processing through a fourth full-connection network FC-4 to obtain a fourth full-connection result of each phoneme, and the fourth full-connection result of each phoneme is processed through a softmax function to obtain the pronunciation skills of each phoneme.
And obtaining the pronunciation characterization of any word in the speakable text based on the pronunciation characterization of the phonemes contained in the word by the first long-short time memory network LSTM-1. That is, the pronunciation characterizations of phonemes belonging to the same word are encoded through the LSTM-1 network to obtain the pronunciation characterizations of the same word.
In the word-level pronunciation evaluation stage, the pronunciation characterization of each word is reasonably connected through a third full-connection network FC-3 to obtain a third full-connection result of each word, and the third full-connection result of each word is processed through a sigmoid function to obtain the pronunciation score (i.e. the word score in fig. 6) of each word.
And obtaining the pronunciation characterization of any sentence in the speakable text based on the pronunciation characterization of the words contained in the sentence through a second long and short term memory network LSTM-2. That is, the pronunciation characterizations of words belonging to the same sentence are encoded through the LSTM-2 network to obtain the pronunciation characterizations of the same sentence.
In the sentence-level pronunciation evaluation stage, the pronunciation characterization of each sentence is subjected to a first full-connection process through a first full-connection network FC-1 to obtain a first full-connection result of each sentence, and the first full-connection result of each sentence is processed through a sigmoid function to obtain a pronunciation comprehensive score (i.e., the comprehensive score in fig. 6) of each sentence.
In the sentence-level pronunciation evaluation stage, the pronunciation characterization of each sentence is further processed through a fifth full-connection network FC-5 to obtain a fifth full-connection result of each sentence, and the fifth full-connection result of each sentence is processed through a sigmoid function to obtain a pronunciation fluency score (i.e., fluency score in fig. 6) of each sentence.
In the sentence-level pronunciation evaluation stage, the pronunciation characterization of each sentence is further subjected to a sixth full-connection process through a sixth full-connection network FC-6 to obtain a sixth full-connection result of each sentence, and the sixth full-connection result of each sentence is processed through a sigmoid function to obtain a pronunciation prosody score (i.e., a prosody score in fig. 6) of each sentence.
After the pronunciation evaluation result of each sentence is obtained, the chapter level score can be calculated. Specifically, the pronunciation comprehensive scores of all sentences can be weighted and summed to obtain pronunciation comprehensive scores of the voice data; weighting and summing the pronunciation fluency scores of all sentences to obtain pronunciation fluency scores of the voice data; and weighting and summing the pronunciation rhythm scores of all sentences to obtain the pronunciation rhythm score of the voice data.
Corresponding to the method embodiment, the present application further provides a reading scoring device, and a schematic structural diagram of the reading scoring device provided in the embodiment of the present application is shown in fig. 7, which may include:
a preprocessing unit 701, a phoneme-level feature obtaining unit 702, a word-level feature obtaining unit 703, a sentence-level feature obtaining unit 704, a sentence comprehensive scoring unit 705, and a fusion unit 706; wherein, the liquid crystal display device comprises a liquid crystal display device,
the preprocessing unit 701 is configured to convert a speakable text into a corresponding phoneme sequence, and extract acoustic features of each voice frame of voice data corresponding to the speakable text;
a phoneme-level feature obtaining unit 702 is configured to obtain pronunciation characterizations of each phoneme in the phoneme sequence based on acoustic features of each speech frame; the pronunciation characterization of any phoneme is at least used for determining the pronunciation error detection result of any phoneme;
The word-level feature obtaining unit 703 is configured to obtain a pronunciation characterization of any word based on a pronunciation characterization of a phoneme included in the any word in the speakable text; the pronunciation characterization of any word is used for determining the pronunciation score of any word;
the sentence-level feature obtaining unit 704 is configured to obtain a pronunciation characterization of any sentence based on a pronunciation characterization of a word included in the sentence in the speakable text;
the sentence integrated scoring unit 705 is configured to determine a pronunciation integrated score of the any sentence according to the pronunciation characterization of the any sentence;
the fusion unit 706 is configured to weight and sum the pronunciation synthesis scores of the sentences to obtain a pronunciation synthesis score of the speech data; the weight of any sentence is positively correlated with the length of the any sentence.
According to the reading scoring device provided by the embodiment of the application, the pronunciation characterization of any sentence is obtained based on the pronunciation characterization of the word contained in any sentence, the pronunciation characterization of any word is obtained based on the pronunciation characterization of the phoneme contained in any word, the pronunciation characterization of any phoneme is at least used for determining the pronunciation result of any phoneme, the pronunciation characterization of any word is used for determining the pronunciation score of any word, and the pronunciation characterization of any sentence is used for determining the pronunciation comprehensive score of any sentence, so that in the application, the pronunciation characterization of each sentence carries the characteristic information of three granularities of phonemes, words and sentences for pronunciation assessment, namely, the pronunciation characterization of each sentence has multi-scale characterization capability, so that the accuracy of the pronunciation scoring result of each sentence is improved, and the accuracy of the scoring result of voice data corresponding to the reading text is further improved.
In an alternative embodiment, the phoneme level feature obtaining unit 702 includes:
the coding unit is used for coding the acoustic characteristics of each voice frame to obtain a coding result of each voice frame;
the attention unit is used for respectively carrying out attention interaction on the vector representation of any phoneme and the coding result of each voice frame to obtain the attention weight of any phoneme and each voice frame;
the acquisition unit is used for weighting and summing the acoustic characteristics of each voice frame to obtain the pronunciation characterization of any phoneme; the weight of any voice frame is the attention weight of any phoneme and any voice frame.
In an alternative embodiment, the reading device further comprises:
the phoneme pronunciation evaluation unit is used for determining the pronunciation error detection result of any phoneme according to the pronunciation characterization of any phoneme;
and the word pronunciation evaluation unit is used for determining the pronunciation score of any word according to the pronunciation characterization of any word.
In an alternative embodiment, the method comprises, among other things,
the pronunciation characterization of any one of the phonemes is further used to determine pronunciation skills of the any one of the phonemes;
and/or the number of the groups of groups,
the pronunciation characterization of any sentence is also used for determining pronunciation fluency score of any sentence;
And/or the number of the groups of groups,
the pronunciation characterization of any sentence is also used to determine a pronunciation prosody score for the any sentence.
In an alternative embodiment, the reading device further comprises:
a phoneme pronunciation skill evaluating unit for determining the pronunciation skill of any phoneme according to the pronunciation characterization of any phoneme;
and/or the number of the groups of groups,
the pronunciation fluency evaluation unit is used for determining the pronunciation fluency score of any sentence according to the pronunciation characterization of the any sentence; weighting and summing the pronunciation fluency scores of all sentences to obtain pronunciation fluency scores corresponding to the voice data; the weight of any sentence is positively correlated with the length of the any sentence;
and/or the number of the groups of groups,
the pronunciation prosody evaluation unit is used for determining the pronunciation prosody score of any sentence according to the pronunciation characterization of the any sentence; weighting and summing the pronunciation rhythm scores of all sentences to obtain pronunciation rhythm scores corresponding to the voice data; the weight of any sentence is positively correlated with the length of the any sentence.
In an alternative embodiment, the process of obtaining pronunciation characterizations of each phoneme, pronunciation characterizations of each word, pronunciation characterizations of each sentence, and determining pronunciation comprehensive scores of each sentence by the reading device is implemented by a scoring model;
The scoring model is obtained by training a text sample and a voice sample corresponding to the text sample serving as a training sample and a plurality of marked information serving as labels;
the plurality of information includes at least: and detecting the error detection result of the pronunciation of the phonemes, the pronunciation score of the words and the pronunciation comprehensive score of the sentences corresponding to the voice sample.
In an alternative embodiment, the scoring model further comprises a training unit for:
converting a text sample in a training sample into a corresponding phoneme sequence, and extracting acoustic characteristics of each voice frame of a voice sample in the training sample;
obtaining pronunciation characterization of each phoneme in a phoneme sequence corresponding to the text sample based on acoustic characteristics of each voice frame of the voice sample through the scoring model;
determining the pronunciation error detection result of any phoneme in the phoneme sequence corresponding to the text sample according to the pronunciation characterization of any phoneme in the phoneme sequence corresponding to the text sample through the scoring model;
obtaining, by the scoring model, a pronunciation characterization of any word in the text sample based on a pronunciation characterization of phonemes contained by any word in the text sample;
determining a pronunciation score of any word in the text sample according to the pronunciation characterization of any word in the text sample through the scoring model;
Obtaining, by the scoring model, a pronunciation characterization of any sentence in the text sample based on a pronunciation characterization of words contained in any sentence in the text sample;
determining the pronunciation comprehensive score of any sentence in the text sample according to the pronunciation characterization of any sentence in the text sample through the scoring model;
and updating parameters of the scoring model by taking the fact that the pronunciation error detection result of each phoneme in the phoneme sequence corresponding to the text sample is close to the pronunciation error detection result label of each phoneme of the training sample, the pronunciation score of each word in the text sample is close to the pronunciation score label of each word of the training sample, and the pronunciation comprehensive score of each sentence in the text sample is close to the pronunciation comprehensive score label of each sentence of the training sample as a target.
In an alternative embodiment, the plurality of information further includes at least one of: pronunciation skill of phonemes, pronunciation fluency score of sentences, pronunciation prosody score of sentences.
In an alternative embodiment, the training unit is further configured to:
determining the pronunciation skill of any phoneme in the phoneme sequence corresponding to the text sample according to the pronunciation characterization of any phoneme in the phoneme sequence corresponding to the text sample through the scoring model; and/or determining a pronunciation fluency score of any sentence in the text sample according to the pronunciation characterization of any sentence in the text sample; and/or determining a prosody score for any sentence in the text sample from the phonetic representation of the sentence in the text sample;
Correspondingly, when the training unit updates the parameters of the scoring model, the training unit is used for:
the pronunciation error detection result of each phoneme in the phoneme sequence corresponding to the text sample is close to the pronunciation error detection result label of each phoneme of the training sample, the pronunciation score of each word in the text sample is close to the pronunciation score label of each word of the training sample, and the pronunciation comprehensive score of each sentence in the text sample is close to the pronunciation comprehensive score label of each sentence of the training sample as a first target; and (3) taking the pronunciation skill of each phoneme in the phoneme sequence corresponding to the text sample as a pronunciation skill label of each phoneme of the training sample, and/or taking the pronunciation fluency score of each sentence in the text sample as a pronunciation fluency score label of each sentence of the training sample, and/or taking the pronunciation fluency score label of each sentence in the text sample as a pronunciation rhythm score label of each sentence of the training sample as a second target, and updating the parameters of the scoring model.
The reading scoring device provided by the embodiment of the application can be applied to reading scoring equipment, such as PC terminals, cloud platforms, servers, server clusters and the like. Optionally, fig. 8 shows a block diagram of a hardware structure of the speakable scoring device, and referring to fig. 8, the hardware structure of the speakable scoring device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
In the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete communication with each other through the communication bus 4;
processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;
the memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;
wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:
converting the speakable text into a corresponding phoneme sequence, and extracting acoustic features of each voice frame of voice data corresponding to the speakable text;
obtaining pronunciation characterizations of each phoneme in the phoneme sequence based on the acoustic features of each speech frame; the pronunciation characterization of any phoneme is at least used for determining the pronunciation error detection result of any phoneme;
obtaining the pronunciation characterization of any word based on the pronunciation characterization of the phonemes contained in the word in the speakable text; the pronunciation characterization of any word is used for determining the pronunciation score of any word;
Obtaining the pronunciation characterization of any sentence in the reading text based on the pronunciation characterization of the word contained in the sentence;
determining the pronunciation comprehensive score of any sentence according to the pronunciation characterization of the any sentence;
weighting and summing the pronunciation comprehensive scores of all sentences to obtain pronunciation comprehensive scores of the voice data; the weight of any sentence is positively correlated with the length of the any sentence.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
The embodiment of the application also provides a storage medium, which may store a program adapted to be executed by a processor, the program being configured to:
converting the speakable text into a corresponding phoneme sequence, and extracting acoustic features of each voice frame of voice data corresponding to the speakable text;
obtaining pronunciation characterizations of each phoneme in the phoneme sequence based on the acoustic features of each speech frame; the pronunciation characterization of any phoneme is at least used for determining the pronunciation error detection result of any phoneme;
obtaining the pronunciation characterization of any word based on the pronunciation characterization of the phonemes contained in the word in the speakable text; the pronunciation characterization of any word is used for determining the pronunciation score of any word;
Obtaining the pronunciation characterization of any sentence in the reading text based on the pronunciation characterization of the word contained in the sentence;
determining the pronunciation comprehensive score of any sentence according to the pronunciation characterization of the any sentence;
weighting and summing the pronunciation comprehensive scores of all sentences to obtain pronunciation comprehensive scores of the voice data; the weight of any sentence is positively correlated with the length of the any sentence.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

1. A method of reading scoring comprising:
converting the speakable text into a corresponding phoneme sequence, and extracting acoustic features of each voice frame of voice data corresponding to the speakable text;
obtaining pronunciation characterizations of each phoneme in the phoneme sequence based on the acoustic features of each speech frame; the pronunciation characterization of any phoneme is at least used for determining the pronunciation error detection result of any phoneme;
obtaining the pronunciation characterization of any word based on the pronunciation characterization of the phonemes contained in the word in the speakable text; the pronunciation characterization of any word is used for determining the pronunciation score of any word;
obtaining the pronunciation characterization of any sentence in the reading text based on the pronunciation characterization of the word contained in the sentence;
determining the pronunciation comprehensive score of any sentence according to the pronunciation characterization of the any sentence;
weighting and summing the pronunciation comprehensive scores of all sentences to obtain pronunciation comprehensive scores of the voice data; the weight of any sentence is positively correlated with the length of the any sentence.
2. The method of claim 1, wherein said obtaining a pronunciation characterization of each phoneme in the sequence of phonemes based on acoustic features of the respective speech frame comprises:
Encoding acoustic features of each voice frame to obtain an encoding result of each voice frame;
performing attention interaction on the vector representation of any phoneme and the coding result of each voice frame respectively to obtain attention weights of any phoneme and each voice frame;
the acoustic characteristics of the voice frames are weighted and summed to obtain pronunciation characterization of any phoneme; the weight of any voice frame is the attention weight of any phoneme and any voice frame.
3. The method as recited in claim 1, further comprising:
determining the pronunciation error detection result of any phoneme according to the pronunciation characterization of any phoneme;
and determining the pronunciation score of any word according to the pronunciation characterization of the any word.
4. The method of claim 1, wherein the step of determining the position of the probe comprises,
the pronunciation characterization of any one of the phonemes is further used to determine pronunciation skills of the any one of the phonemes;
and/or the number of the groups of groups,
the pronunciation characterization of any sentence is also used for determining pronunciation fluency score of any sentence;
and/or the number of the groups of groups,
the pronunciation characterization of any sentence is also used to determine a pronunciation prosody score for the any sentence.
5. The method as recited in claim 4, further comprising:
determining pronunciation skills of any phoneme according to the pronunciation characterization of the any phoneme;
and/or the number of the groups of groups,
determining pronunciation fluency scores of any sentence according to pronunciation characterizations of the any sentence; weighting and summing the pronunciation fluency scores of all sentences to obtain pronunciation fluency scores corresponding to the voice data; the weight of any sentence is positively correlated with the length of the any sentence;
and/or the number of the groups of groups,
determining a pronunciation prosody score of any sentence according to the pronunciation characterization of the any sentence; weighting and summing the pronunciation rhythm scores of all sentences to obtain pronunciation rhythm scores corresponding to the voice data; the weight of any sentence is positively correlated with the length of the any sentence.
6. The method of any one of claims 1-5, wherein the steps of obtaining a pronunciation characterization for each phoneme, a pronunciation characterization for each word, a pronunciation characterization for each sentence, and determining a pronunciation composite score for each sentence are performed by a scoring model;
the scoring model is obtained by training a text sample and a voice sample corresponding to the text sample serving as a training sample and a plurality of marked information serving as labels;
The plurality of information includes at least: and detecting the error detection result of the pronunciation of the phonemes, the pronunciation score of the words and the pronunciation comprehensive score of the sentences corresponding to the voice sample.
7. The method of claim 6, wherein the scoring model is trained by:
converting a text sample in a training sample into a corresponding phoneme sequence, and extracting acoustic characteristics of each voice frame of a voice sample in the training sample;
obtaining pronunciation characterization of each phoneme in a phoneme sequence corresponding to the text sample based on acoustic characteristics of each voice frame of the voice sample through the scoring model;
determining the pronunciation error detection result of any phoneme in the phoneme sequence corresponding to the text sample according to the pronunciation characterization of any phoneme in the phoneme sequence corresponding to the text sample through the scoring model;
obtaining, by the scoring model, a pronunciation characterization of any word in the text sample based on a pronunciation characterization of phonemes contained by any word in the text sample;
determining a pronunciation score of any word in the text sample according to the pronunciation characterization of any word in the text sample through the scoring model;
Obtaining, by the scoring model, a pronunciation characterization of any sentence in the text sample based on a pronunciation characterization of words contained in any sentence in the text sample;
determining the pronunciation comprehensive score of any sentence in the text sample according to the pronunciation characterization of any sentence in the text sample through the scoring model;
and updating parameters of the scoring model by taking the fact that the pronunciation error detection result of each phoneme in the phoneme sequence corresponding to the text sample is close to the pronunciation error detection result label of each phoneme of the training sample, the pronunciation score of each word in the text sample is close to the pronunciation score label of each word of the training sample, and the pronunciation comprehensive score of each sentence in the text sample is close to the pronunciation comprehensive score label of each sentence of the training sample as a target.
8. The method of claim 6, wherein the plurality of information further comprises at least one of: pronunciation skill of phonemes, pronunciation fluency score of sentences, pronunciation prosody score of sentences.
9. The method as recited in claim 8, further comprising:
determining the pronunciation skill of any phoneme in the phoneme sequence corresponding to the text sample according to the pronunciation characterization of any phoneme in the phoneme sequence corresponding to the text sample through the scoring model; and/or determining a pronunciation fluency score of any sentence in the text sample according to the pronunciation characterization of any sentence in the text sample; and/or determining a prosody score for any sentence in the text sample from the phonetic representation of the sentence in the text sample;
Correspondingly, the process of updating the parameters of the scoring model comprises the following steps:
the pronunciation error detection result of each phoneme in the phoneme sequence corresponding to the text sample is close to the pronunciation error detection result label of each phoneme of the training sample, the pronunciation score of each word in the text sample is close to the pronunciation score label of each word of the training sample, and the pronunciation comprehensive score of each sentence in the text sample is close to the pronunciation comprehensive score label of each sentence of the training sample as a first target; and (3) taking the pronunciation skill of each phoneme in the phoneme sequence corresponding to the text sample as a pronunciation skill label of each phoneme of the training sample, and/or taking the pronunciation fluency score of each sentence in the text sample as a pronunciation fluency score label of each sentence of the training sample, and/or taking the pronunciation fluency score label of each sentence in the text sample as a pronunciation rhythm score label of each sentence of the training sample as a second target, and updating the parameters of the scoring model.
10. A speakable scoring device, comprising:
the preprocessing unit is used for converting the speakable text into a corresponding phoneme sequence and extracting acoustic characteristics of each voice frame of voice data corresponding to the speakable text;
A phoneme-level feature obtaining unit configured to obtain a pronunciation characterization of each phoneme in the phoneme sequence based on acoustic features of each speech frame; the pronunciation characterization of any phoneme is at least used for determining the pronunciation error detection result of any phoneme;
the word level feature obtaining unit is used for obtaining the pronunciation characterization of any word based on the pronunciation characterization of the phonemes contained in the word in the reading text; the pronunciation characterization of any word is used for determining the pronunciation score of any word;
a sentence-level feature obtaining unit, configured to obtain a pronunciation characterization of any sentence based on a pronunciation characterization of a word included in the any sentence in the speakable text;
a sentence comprehensive scoring unit, configured to determine a pronunciation comprehensive score of the any sentence according to the pronunciation characterization of the any sentence;
the fusion unit is used for weighting and summing the pronunciation comprehensive scores of all sentences to obtain pronunciation comprehensive scores of the voice data; the weight of any sentence is positively correlated with the length of the any sentence.
11. A speakable scoring device comprising a memory and a processor;
the memory is used for storing programs;
the processor configured to execute the program to implement the steps of the speakable scoring method of any one of claims 1-9.
12. A computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the speakable scoring method of any one of claims 1 to 9.
CN202211712395.9A 2022-12-29 2022-12-29 Reading scoring method, device, equipment and storage medium Pending CN116386665A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211712395.9A CN116386665A (en) 2022-12-29 2022-12-29 Reading scoring method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211712395.9A CN116386665A (en) 2022-12-29 2022-12-29 Reading scoring method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116386665A true CN116386665A (en) 2023-07-04

Family

ID=86962232

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211712395.9A Pending CN116386665A (en) 2022-12-29 2022-12-29 Reading scoring method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116386665A (en)

Similar Documents

Publication Publication Date Title
KR101153078B1 (en) Hidden conditional random field models for phonetic classification and speech recognition
US9672816B1 (en) Annotating maps with user-contributed pronunciations
CN111402862B (en) Speech recognition method, device, storage medium and equipment
CN111640418B (en) Prosodic phrase identification method and device and electronic equipment
CN108766415B (en) Voice evaluation method
CN107886968B (en) Voice evaluation method and system
US9489864B2 (en) Systems and methods for an automated pronunciation assessment system for similar vowel pairs
CN112397056B (en) Voice evaluation method and computer storage medium
US11935523B2 (en) Detection of correctness of pronunciation
CN112349300A (en) Voice evaluation method and device
Keshet Automatic speech recognition: A primer for speech-language pathology researchers
Prakoso et al. Indonesian Automatic Speech Recognition system using CMUSphinx toolkit and limited dataset
CN114783464A (en) Cognitive detection method and related device, electronic equipment and storage medium
CN111915940A (en) Method, system, terminal and storage medium for evaluating and teaching spoken language pronunciation
CN109697975B (en) Voice evaluation method and device
Shen et al. Self-supervised pre-trained speech representation based end-to-end mispronunciation detection and diagnosis of Mandarin
CN111833859B (en) Pronunciation error detection method and device, electronic equipment and storage medium
Liu et al. AI recognition method of pronunciation errors in oral English speech with the help of big data for personalized learning
Alkhatib et al. Building an assistant mobile application for teaching arabic pronunciation using a new approach for arabic speech recognition
Yousfi et al. Holy Qur'an speech recognition system Imaalah checking rule for warsh recitation
CN112687291B (en) Pronunciation defect recognition model training method and pronunciation defect recognition method
CN116386665A (en) Reading scoring method, device, equipment and storage medium
CN111199750B (en) Pronunciation evaluation method and device, electronic equipment and storage medium
Wiśniewski et al. Automatic detection and classification of phoneme repetitions using HTK toolkit
Hou et al. Automatic speech attribute transcription (asat)-the front end processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination