CN105845134B - Spoken language evaluation method and system for freely reading question types - Google Patents

Spoken language evaluation method and system for freely reading question types Download PDF

Info

Publication number
CN105845134B
CN105845134B CN201610423082.XA CN201610423082A CN105845134B CN 105845134 B CN105845134 B CN 105845134B CN 201610423082 A CN201610423082 A CN 201610423082A CN 105845134 B CN105845134 B CN 105845134B
Authority
CN
China
Prior art keywords
voice
word
evaluation
voice signal
fluency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610423082.XA
Other languages
Chinese (zh)
Other versions
CN105845134A (en
Inventor
宋碧霄
潘颂声
宋铁
高前勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201610423082.XA priority Critical patent/CN105845134B/en
Publication of CN105845134A publication Critical patent/CN105845134A/en
Application granted granted Critical
Publication of CN105845134B publication Critical patent/CN105845134B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B19/00Teaching not covered by other main groups of this subclass
    • G09B19/04Speaking
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/027Syllables being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Abstract

The invention discloses a spoken language evaluation method and a spoken language evaluation system for freely reading question types, wherein the method comprises the following steps: receiving a voice signal to be evaluated; performing voice recognition on the voice signal, and segmenting based on a limited boundary of a recognition text to obtain voice segments corresponding to all basic voice units in the voice signal; extracting pronunciation accuracy characteristics of the voice signals according to the recognition texts and the voice segments, and taking the pronunciation accuracy characteristics as evaluation characteristics of the voice signals; the pronunciation accuracy characteristics include any one or more of the following: the posterior probability of the non-misrecognized word, the ratio of the misread word and the ratio of the correct reading word; and calculating the evaluation score of the voice signal according to the evaluation characteristics of the voice signal and a pre-constructed evaluation model. The invention can accurately realize the automatic evaluation of the freely read question types.

Description

Spoken language evaluation method and system for freely reading question types
Technical Field
The invention relates to the technical field of voice signals, in particular to a spoken language evaluation method and system for freely reading question types.
Background
Spoken language plays an extremely important role in real life as an important medium for interpersonal communication. With the continuous development of social economy and the aggravation of global trend, people put higher and higher requirements on the efficiency of language learning and the objectivity, fairness and scale test of language assessment. For this reason, some language teaching and evaluating systems have been developed successively.
With the continuous maturity of the oral evaluation technology, more and more oral learning or oral teacher have to learn and learn oral language with the help of the technology. At present, common spoken language learning scenes are all defined reading texts, and then pronunciation accuracy and fluency are evaluated according to the speech read by learners. However, specifying speakable text limits the learner to specifying a given topic or content to practice spoken language. Therefore, in order to make the learner more conveniently learn the spoken language, the free reading question type is produced. The free reading question type means that the learner can randomly select the reading text to practice the spoken language.
Because the existing spoken language evaluating technology aims at the specified reading text and is based on the spoken language evaluation of the prior text content, and the freely-reading question type has no standard answer, the existing spoken language evaluating technology cannot accurately realize the automatic evaluation of the freely-reading question type.
Disclosure of Invention
The embodiment of the invention provides a spoken language evaluation method and a spoken language evaluation system for a free-reading question type, which are used for accurately and automatically evaluating the free-reading question type.
Therefore, the embodiment of the invention provides the following technical scheme:
a spoken language evaluating method of a free-reading question type comprises the following steps:
receiving a voice signal to be evaluated;
performing voice recognition on the voice signal, and segmenting based on a limited boundary of a recognition text to obtain voice segments corresponding to all basic voice units in the voice signal;
extracting pronunciation accuracy characteristics of the voice signals according to the recognition texts and the voice segments, and taking the pronunciation accuracy characteristics as evaluation characteristics of the voice signals; the pronunciation accuracy characteristics include any one or more of the following: the posterior probability of the non-misrecognized word, the ratio of the misread word and the ratio of the correct reading word;
and calculating the evaluation score of the voice signal according to the evaluation characteristics of the voice signal and a pre-constructed evaluation model.
Preferably, the extracting pronunciation accuracy characteristics of the speech signal according to the recognition text and each speech segment includes:
obtaining the classification characteristics of each word in the recognition text according to the recognition text and each voice segment, wherein the classification characteristics comprise any one or more of the following characteristics: acoustic features, language model features, grammatical features;
determining the category of each word based on the classification features and a pre-trained word classification model, wherein the categories comprise: misrecognition, misreading and correct reading;
acquiring any one or more of the following characteristics in the recognition text: the posterior probability of the non-misrecognized word, the ratio of the misread word and the ratio of the correct reading word.
Preferably, the obtaining the acoustic features of each word in the recognition text according to the recognition text and each voice segment includes:
obtaining the posterior probability of all basic voice units contained in each word in the recognition text according to the voice fragments;
for each word, calculating the mean value of the posterior probabilities of all basic phonetic units contained in the word, and taking the mean value as the acoustic feature of the word.
Preferably, the obtaining of the language model feature of each word in the recognition text according to the recognition text and each voice segment includes:
obtaining language model scores of all basic voice units contained in each word in the recognition text according to the voice fragments;
for each word, calculating the average value of the language model scores of all basic phonetic units contained in the word, and taking the average value as the language model characteristic of the word.
Preferably, the obtaining of the grammatical features of each word in the recognition text according to the recognition text and each voice segment includes:
syntax error detection is carried out on the recognition text according to syntax rules to obtain an error detection result;
and determining the grammatical features of each word in the recognition text according to the error detection result.
Preferably, the evaluation feature further comprises: fluency characteristics; the method further comprises the following steps:
extracting acoustic features of a voice classification basic voice unit in the voice signal, wherein the voice classification basic unit is a syllable, a word or a phrase;
determining augmented speech and non-augmented speech in the speech signal by using the acoustic features and a pre-constructed speech classification model;
extracting fluency characteristics of the voice signal according to augmented voice and non-augmented voice in the voice signal, wherein the fluency characteristics comprise any one or more of the following characteristics: increasing the voice time length proportion, increasing the occurrence times of the voice and the average speed;
the step of calculating the evaluation score of the voice signal according to the evaluation characteristics of the voice signal and a pre-constructed evaluation model comprises the following steps:
respectively calculating pronunciation accuracy scores and fluency scores of the voice signals according to the extracted pronunciation accuracy characteristics and fluency characteristics of the voice signals and an evaluation model corresponding to the characteristics, and then calculating evaluation scores of the voice signals according to the pronunciation accuracy scores and the fluency scores; or
And taking the pronunciation accuracy characteristic and the fluency characteristic of the voice signal as comprehensive evaluation characteristics, and calculating the evaluation score of the voice signal by using an evaluation model corresponding to the comprehensive evaluation characteristics.
Preferably, the speech classification model is constructed in the following way:
collecting different types of augmented voice data and non-augmented voice data, and taking the collected voice data as training data;
extracting acoustic features of the training data;
and training by using the acoustic features to obtain the voice classification model.
Preferably, the evaluation feature further comprises: language expressive power features; the method further comprises the following steps:
extracting language expressiveness characteristics of the voice signal according to the recognition text, wherein the language expressiveness characteristics comprise any one or more of the following characteristics:
semantic continuity characteristics, which indicate whether each sentence or each semantic segment in the identification text is continuous;
lexical characteristics including any one or more of: the number of non-repeated words, the number of idioms and the number of high-level words in the recognition text;
the text collection characteristic refers to whether the expression of each sentence or each paragraph in the identification text is graceful or not;
the step of calculating the evaluation score of the voice signal according to the evaluation characteristics of the voice signal and a pre-constructed evaluation model comprises the following steps:
respectively calculating the pronunciation accuracy score, the fluency score and the language expression ability score of the voice signal according to the pronunciation accuracy feature, the fluency feature and the language expression ability feature of the extracted voice signal and an evaluation model corresponding to the feature, and then calculating the evaluation score of the voice signal according to the pronunciation accuracy score, the fluency score and the language expression ability score; or
And taking the pronunciation accuracy characteristic, the fluency characteristic and the language expression capability characteristic of the voice signal as comprehensive evaluation characteristics, and calculating the evaluation score of the voice signal by using an evaluation model corresponding to the comprehensive evaluation characteristics.
Preferably, the extracting the linguistic expression capability feature of the speech signal according to the recognition text comprises:
determining misrecognized words in the recognized text;
and performing correction processing on the recognition text, wherein the correction processing comprises the following steps: removing the misrecognized word from the recognized text or correcting the misrecognized word in the recognized text;
and extracting the language expression capability characteristics of the voice signal according to the corrected recognition text.
A spoken language evaluation system of a free-reading question type comprises:
the receiving module is used for receiving a voice signal to be evaluated;
the voice recognition module is used for carrying out voice recognition on the voice signal and segmenting the voice signal based on a limited boundary of a recognition text to obtain voice segments corresponding to all basic voice units in the voice signal;
the pronunciation accuracy feature extraction module is used for extracting pronunciation accuracy features of the voice signals according to the recognition texts and the voice segments, and taking the pronunciation accuracy features as evaluation features of the voice signals; the pronunciation accuracy characteristics include any one or more of the following: the posterior probability of the non-misrecognized word, the ratio of the misread word and the ratio of the correct reading word;
and the evaluation module is used for calculating the evaluation score of the voice signal according to the evaluation characteristics of the voice signal and a pre-constructed evaluation model.
Preferably, the pronunciation accuracy feature extraction module comprises:
a classification feature obtaining unit, configured to obtain, according to the recognition text and each voice segment, a classification feature of each word in the recognition text, where the classification feature includes any one or more of the following: acoustic features, language model features, grammatical features;
a word category determination unit, configured to determine a category of each word based on the classification features and a pre-trained word classification model, where the category includes: misrecognition, misreading and correct reading;
the first computing unit is used for acquiring any one or more of the following characteristics in the recognition text: the posterior probability of the non-misrecognized word, the ratio of the misread word and the ratio of the correct reading word.
Preferably, the classification feature acquisition unit includes:
the acoustic feature acquisition subunit is used for acquiring the posterior probabilities of all basic voice units contained in each word in the recognition text according to the voice fragments; then, for each word, calculating the mean value of the posterior probabilities of all basic phonetic units contained in the word, and taking the mean value as the acoustic feature of the word;
the language model characteristic obtaining subunit is used for obtaining the language model scores of all basic voice units contained in each word in the recognition text according to the voice fragments; then, for each word, calculating the average value of the language model scores of all basic voice units contained in the word, and taking the average value as the language model characteristic of the word;
the grammar characteristic obtaining subunit is used for carrying out grammar error detection on the recognition text according to grammar rules to obtain an error detection result; and then determining the grammatical features of each word in the recognition text according to the error detection result.
Preferably, the system further comprises:
the fluency feature extraction module is used for extracting fluency features of the voice signals, and the fluency features comprise any one or more of the following: increasing the voice time length proportion, increasing the occurrence times of the voice and the average speed; the fluency feature extraction module comprises:
the acoustic feature extraction unit is used for extracting acoustic features of a voice classification basic voice unit in the voice signal, wherein the voice classification basic unit is a syllable, a word or a phrase;
the voice category determining unit is used for determining augmented voice and non-augmented voice in the voice signal by utilizing the acoustic features and a pre-constructed voice classification model;
the second computing unit is used for extracting fluency characteristics of the voice signals according to augmented voice and non-augmented voice in the voice signals, and the fluency characteristics comprise any one or more of the following characteristics: increasing the voice time length proportion, increasing the occurrence times of the voice and the average speed;
the evaluation module is specifically used for taking the pronunciation accuracy characteristic and the fluency characteristic of the voice signal as comprehensive evaluation characteristics and calculating the evaluation score of the voice signal by using an evaluation model corresponding to the comprehensive evaluation characteristics; or
The evaluation module comprises:
the pronunciation accuracy evaluating unit is used for calculating a pronunciation accuracy score of the voice signal according to the pronunciation accuracy characteristics of the voice signal and the pronunciation accuracy evaluating model;
the fluency evaluating unit is used for calculating the fluency score of the voice signal according to the fluency characteristic of the voice signal and the fluency evaluating model;
and the first evaluation score calculating unit is used for calculating the evaluation score of the voice signal according to the pronunciation accuracy score and the fluency score.
Preferably, the fluency feature extraction module further comprises: the voice classification model building unit is used for building the voice classification model; the speech classification model construction unit includes:
the training data collecting subunit is used for collecting the augmented voice data and the non-augmented voice data of different types and taking the collected voice data as training data;
the acoustic feature extraction subunit is used for extracting acoustic features of the training data;
and the training subunit is used for training by using the acoustic features to obtain the voice classification model.
Preferably, the system further comprises:
a language expression ability feature extraction module, configured to extract a language expression ability feature of the speech signal according to the recognition text, where the language expression ability feature includes any one or more of the following features:
semantic continuity characteristics, which indicate whether each sentence or each semantic segment in the identification text is continuous;
lexical characteristics including any one or more of: the number of non-repeated words, the number of idioms and the number of high-level words in the recognition text;
the text collection characteristic refers to whether the expression of each sentence or each paragraph in the identification text is graceful or not;
the evaluation module is specifically used for taking the pronunciation accuracy characteristic, the fluency characteristic and the language expression capability characteristic of the voice signal as comprehensive evaluation characteristics, and calculating the evaluation score of the voice signal by using an evaluation model corresponding to the comprehensive evaluation characteristics; or
The evaluation module comprises:
the pronunciation accuracy evaluating unit is used for calculating a pronunciation accuracy score of the voice signal according to the pronunciation accuracy characteristics of the voice signal and the pronunciation accuracy evaluating model;
the fluency evaluating unit is used for calculating the fluency score of the voice signal according to the fluency characteristic of the voice signal and the fluency evaluating model;
the speech signal evaluation unit is used for evaluating the speech signal according to the speech signal speech expression ability characteristic and the speech expression ability evaluation model;
and the second evaluation score calculating unit is used for calculating the evaluation score of the voice signal according to the pronunciation accuracy score, the fluency score and the language expression ability score.
Preferably, the language expression ability feature extraction module includes:
a recognition error determination unit for determining a misrecognized word in the recognized text;
a correction unit configured to perform correction processing on the recognition text, the correction processing including: removing the misrecognized word from the recognized text or correcting the misrecognized word in the recognized text;
and the extraction unit is used for extracting the language expression capability characteristics of the voice signal according to the corrected recognition text.
According to the spoken language evaluation method and system of the free-reading question type, provided by the embodiment of the invention, aiming at the characteristic that the free-reading question type has no standard answer, voice recognition is firstly carried out on a voice signal to be evaluated, voice segments corresponding to all basic voice units in the voice signal are obtained through segmentation based on the limited boundary of a recognition text, then the pronunciation accuracy characteristics of the voice signal are extracted according to the recognition text and all the voice segments, and aiming at the characteristics, the pronunciation accuracy characteristics comprise the posterior probability of non-misrecognized words, the proportion of misrecognized words and the proportion of correct words, so that the interference of the misrecognized words is eliminated, and the final evaluation score can accurately reflect the true level of readers.
Furthermore, augmented speech and non-augmented speech in the speech signal are determined according to the acoustic characteristics of the speech signal and a pre-constructed speech classification model, fluency characteristics of the speech signal are extracted according to the augmented speech and the non-augmented speech in the speech signal, the speech signal is comprehensively evaluated by integrating accuracy characteristics and fluency characteristics, and the spoken language level of a reader is more comprehensively reflected.
Furthermore, the language expression capability characteristics of the voice signal are extracted according to the recognition text, and then various different characteristics are integrated to perform spoken language evaluation on the reader, so that the evaluation result is more comprehensive.
Drawings
In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 is a flow chart of a spoken language evaluation method of a freely-read question type according to an embodiment of the present invention;
FIG. 2 is a flow chart of extracting pronunciation accuracy features of a speech signal according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating fluency evaluation of spoken language of the free-reading topic type in an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a spoken language evaluation system of the freely readable topic type according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating an exemplary structure of a pronunciation accuracy feature extraction module according to an embodiment of the present invention;
FIG. 6 is another schematic structural diagram of a spoken language evaluation system of the freely readable topic type according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of an embodiment of the fluency feature extraction module;
fig. 8 is another schematic structural diagram of the spoken language evaluation system of the freely readable topic type according to the embodiment of the present invention.
Detailed Description
In order to make the technical field of the invention better understand the scheme of the embodiment of the invention, the embodiment of the invention is further described in detail with reference to the drawings and the implementation mode.
Aiming at the limitation that the spoken language evaluation method in the prior art only aims at the specified reading text with standard answers, the embodiment of the invention provides the spoken language evaluation method and the spoken language evaluation system for the free reading question type.
Fig. 1 is a flowchart of a spoken language evaluation method of a freely-read question type according to an embodiment of the present invention, which includes the following steps:
step 101, receiving a speech signal to be evaluated.
The voice signal is obtained by reading the text selected at will by the user, for example, the voice signal can be obtained by recording.
And 102, performing voice recognition on the voice signal, and segmenting based on the limited boundary of the recognition text to obtain voice segments corresponding to each basic voice unit in the voice signal.
The basic speech unit may be a syllable, a phoneme, etc.
The speech recognition may be implemented by using the prior art, and different speech recognition systems may decode the speech signal based on different acoustic features, such as an acoustic Model based on MFCC (Mel-Frequency Cepstrum Coefficients) features, an acoustic Model based on PLP (Perceptual Linear prediction) features, etc., or by using different acoustic models, such as HMM-GMM (Hidden Markov Model-Gaussian Mixture Model), DBN (Dynamic bayesian Network) neural Network acoustic Model, etc., or even by using different decoding methods, such as Viterbi search, a-search, etc. Thus, the basic speech unit of the speech signal and the corresponding speech segment sequence can be obtained.
103, extracting pronunciation accuracy characteristics of the voice signal according to the recognition text and each voice segment, and taking the pronunciation accuracy characteristics as evaluation characteristics of the voice signal; the pronunciation accuracy characteristics include any one or more of the following: the posterior probability of the non-misrecognized word, the ratio of the misread word and the ratio of the correct reading word.
In the prior art, the pronunciation accuracy of the reader can be reflected by the information such as the similarity of the pronunciation acoustic model of the basic voice unit corresponding to each voice segment, the word missing ratio and the like aiming at the pronunciation accuracy evaluation of the designated reading content. However, for the free-reading question type, because there is no standard answer and there may be recognition errors in the recognized text, in the embodiment of the present invention, the pronunciation accuracy characteristics are characterized by the posterior probability of non-misrecognized words, or the ratio of misrecognized words and the ratio of correctly-read words, or the three parameters, by excluding the interference of misrecognized words. Wherein the misread word proportion and the correct read word proportion are the misread word proportion and the correct read word proportion in the words remaining after the misrecognized words are removed.
Specifically, the classification model and the corresponding classification features may be used to determine the category of each word, and the specific determination manner may be various, for example, the category of the word is divided into three categories: misrecognizing, misreading and correctly reading, and sequentially inputting the classification characteristics of each word into a corresponding classification model to obtain the category of the word; of course, a binary model may be used to determine the misrecognized and non-misrecognized words in the recognized text, and then determine whether the non-misrecognized words are misread or correctly read, which is not limited in the embodiment of the present invention.
The following describes the process of extracting pronunciation accuracy features by taking the first mode as an example. As shown in fig. 2, a specific process for extracting pronunciation accuracy characteristics of a speech signal according to an embodiment of the present invention includes the following steps:
step 201, obtaining the classification characteristics of each word in the recognition text according to the recognition text and each voice segment, where the classification characteristics include any one or more of the following: acoustic features, language model features, grammatical features.
The acoustic features of the words refer to the mean value of the posterior probabilities of all basic speech units contained in the words obtained in the speech recognition and speech segment boundary division processes. Specifically, the posterior probabilities of all the basic speech units included in each word in the recognition text may be first obtained according to each speech fragment; then, for each word, calculating the mean value of the posterior probabilities of all the basic phonetic units contained in the word, and taking the mean value as the acoustic feature of the word.
Similarly, the language model feature of the word refers to the average of the language model scores of all basic phonetic units contained in the word. Specifically, the language model scores of all basic speech units contained in each word in the recognition text may be first obtained according to each speech fragment; for each word, calculating the average value of the language model scores of all basic phonetic units contained in the word, and taking the average value as the language model characteristic of the word.
The grammatical characteristics of the words refer to grammatical error detection of the recognition text according to grammatical rules, and common grammatical errors such as 'main meaning inconsistent', 'verb tense, morpheme errors', 'name single complex', 'adjective, adverb misuse', and the like. Specifically, syntax error detection can be performed on the recognition text according to a syntax rule to obtain an error detection result; and then determining the grammatical features of each word in the recognition text according to the error detection result. For example, different syntax errors may be error-numbered, such as a predicate inconsistency as the syntax error numbered 1, and if the error occurs in a speakable sentence, the subject and predicate words in the sentence are marked as 1, that is, the grammatical features of the subject and predicate elements are 1, and of course, the grammatical features of a word may be multidimensional.
Step 202, determining the category of each word based on the classification features and a pre-trained word classification model, wherein the category comprises: misrecognition, misreading and correct reading.
Specifically, the acoustic features, language model features and grammatical features of each word are sequentially input into the classification model (such as the SVM) and output as likelihood scores of the corresponding word and the three types of misrecognition, misreading and correct reading, or whether the corresponding word is misrecognition, misreading and correct reading is directly output.
The classification model can be obtained by collecting a large number of speech signals for training, and the specific training mode is the same as that of the prior art and is not described herein again.
Step 203, acquiring any one or more of the following characteristics in the recognition text: the posterior probability of the non-misrecognized word, the ratio of the misread word and the ratio of the correct reading word.
After the category of each word in the recognition text is determined, the posterior probability, the misreading word proportion and the correct reading word proportion of the non-misrecognized word in the recognition text can be obtained through calculation.
It should be noted that the extraction of the pronunciation accuracy feature may be in units of sentences, paragraphs, chapters, and the like, and accordingly, when performing evaluation score calculation on the speech signal in the following, may also be in units of sentences, paragraphs, or chapters, which is not limited in the embodiment of the present invention.
For example, the "non-misrecognized word posterior probability" in the feature may be an average of all non-misrecognized word posterior probabilities in the sentence, a word proportion with a posterior probability smaller than a certain threshold, a word proportion with a posterior probability larger than a certain threshold, or the like, taking the sentence as a unit. The posterior probability of each non-misrecognized word is the sum of the posterior probabilities of all the basic phonetic units contained in that word.
Taking the above three descriptions as an example, the pronunciation accuracy feature is a 5-dimensional feature vector.
And 104, calculating the evaluation score of the voice signal according to the evaluation characteristics of the voice signal and a pre-constructed evaluation model.
As mentioned above, the evaluation score may be in units of sentences, paragraphs, or chapters when calculating the evaluation score, and if the evaluation score is in units of sentences or paragraphs, the evaluation score of all sentences or paragraphs in the chapters to be evaluated may be averaged to serve as the evaluation score of the chapters to be evaluated.
It should be noted that, when the pronunciation accuracy evaluation is performed on the voice signal of the free-reading question type alone, the evaluation model is a pronunciation accuracy evaluation model for the pronunciation accuracy evaluation dimension. When carrying out multidimensional comprehensive evaluation on the voice signal, a comprehensive score is obtained by adopting an evaluation model aiming at corresponding multidimensional.
The spoken language evaluation method of the free-reading question type provided by the embodiment of the invention is characterized in that the free-reading question type has no standard answer, speech recognition is firstly carried out on a speech signal to be evaluated, speech segments corresponding to all basic speech units in the speech signal are obtained by segmentation based on the limited boundary of a recognition text, then pronunciation accuracy characteristics of the speech signal are extracted according to the recognition text and the speech segments, and according to the characteristics, the pronunciation accuracy characteristics comprise any one or more of the following characteristics: the posterior probability of the non-misrecognized word, the misrecognized word proportion and the correct word-reading proportion are eliminated, so that the interference of the misrecognized word is eliminated, and the final evaluation score can accurately reflect the true level of the reader.
When the spoken language is evaluated, an important index is also the fluency of the spoken language. The fluency refers to the fluency of the statement expression, and for the voice signals of the free reading question type, the invention also correspondingly provides an evaluation method for the fluency.
As shown in fig. 3, which is a flowchart for fluency evaluation of spoken language of the free-reading question type in the embodiment of the present invention, the method includes the following steps:
step 301, extracting acoustic features of the basic voice unit of voice classification.
The phonetic classification basic unit may be a syllable, word, phrase, etc., which is the same as or consists of the basic phonetic unit at the time of speech recognition. The acoustic features may be one or more of MFCC features, PLP features, LPC features, etc., and the specific extraction method may adopt the prior art.
Step 302, determining augmented speech and non-augmented speech in the speech signal by using the acoustic features and a pre-constructed speech classification model.
The augmented speech refers to a word of tone (e.g., "en", "e", "a"), a redundant conjunctive (e.g., and, so), noise (e.g., a cough, a door opening sound, a table knock, etc.), and the like, which the user is accustomed to using.
The construction process of the voice classification model is as follows: collecting a plurality of different types of augmented speech data, and normal (i.e., non-augmented) speech data as training data; extracting acoustic features of the training data; the model inputs the acoustic features of the collected training data and outputs the likelihood scores of the input speech and the augmented and non-augmented speech, or outputs whether the input speech is augmented or non-augmented speech. The speech classification model may be trained using existing training methods (e.g., MCE).
Step 303, extracting fluency features of the speech signal according to the augmented speech and the non-augmented speech in the speech signal, where the fluency features include any one or more of the following: increasing the voice time length proportion, increasing the occurrence times of the voice and the average speed. The ratio of the reading-added voice time length can be the ratio of the reading-added voice time length to the whole voice time length or the ratio of the reading-added voice time length to the non-reading-added voice time length. The average speech rate can be calculated according to the duration of the voice signal and the number of effective words in the corresponding recognition text. The number of significant words refers to the number of characters in the recognized text.
And step 304, calculating the fluency score of the voice signal according to the extracted fluency characteristics and a pre-constructed fluency evaluation model.
It should be noted that, in practical application, for spoken language evaluation of a free-reading question type, pronunciation accuracy evaluation and fluency evaluation can be performed on a speech signal to be evaluated respectively based on corresponding evaluation models to obtain a pronunciation accuracy score and a fluency score; if desired, a composite score for the speech signal may also be calculated (e.g., weighted) based on the pronunciation accuracy score and fluency score; the pronunciation accuracy characteristic and the fluency characteristic of the voice signal can also be used as comprehensive evaluation characteristics, and the evaluation score of the voice signal can be calculated by utilizing a corresponding comprehensive evaluation model. The speech signals are comprehensively evaluated through the comprehensive accuracy characteristics and the fluency characteristics, and the spoken language level of the reader can be more comprehensively reflected.
In addition, the spoken language evaluation method of the embodiment of the invention can further extract the language expression capability features of the voice signal according to the recognition text aiming at the characteristics of the free reading topic type, perform spoken language evaluation on the reader from the aspect of the language expression capability, or perform spoken language evaluation on the reader by integrating various different features, so that the evaluation result is more comprehensive.
The language expression ability characteristics comprise any one or more of the following characteristics:
(1) and the semantic continuity characteristic is used for identifying whether each sentence or each piece of semantic in the text is continuous.
Specifically, a large number of texts with good semantic continuity (for example, some textbooks and articles appearing on auxiliary textbooks with high quality can be found) and with poor semantic continuity (for example, the texts can be obtained by randomly combining a plurality of texts in a disorganized manner) are collected as training corpora, a classification model of whether the training semantics are continuous or not is obtained, the input of the model is word vectors of a previous sentence and a current sentence or word vectors of a previous segment and a current segment, and the output is two types of continuous semantics and discontinuous semantics.
(2) Lexical characteristics including any one or more of: and the number of non-repeated words, the number of idioms and the number of high-level words in the recognition text.
The vocabulary characteristics can be obtained by presetting a language library and a high-level vocabulary library and then using a statistical method.
(3) And the literary style characteristic refers to whether the expression of each sentence or each paragraph in the identification text is beautiful or not.
Specifically, the deep learning mode can be adopted to divide sentences into graceful sentences and non-graceful sentences. The input of the classification model is a sentence vector, and the output is the above classification. The classification model may adopt RNN (current neural Network, Recurrent neural Network), and the specific training method is the same as that in the prior art, and is not described herein again.
It should be noted that, in practical applications, the linguistic expression capability feature of the speech signal can be directly extracted from the recognition text; or processing the misrecognized words in the recognized text, and then extracting corresponding features, namely, the method comprises the following steps:
determining misrecognized words in the recognized text;
and performing correction processing on the recognition text, wherein the correction processing comprises the following steps: removing the misrecognized word from the recognized text or correcting the misrecognized word in the recognized text;
and extracting the language expression capability characteristics of the voice signal according to the corrected recognition text.
When the language expression ability evaluation is carried out, the language expression ability score of the voice signal can be calculated according to the extracted language expression ability characteristics and a pre-constructed language expression ability evaluation model.
Similarly, in practical application, for spoken language evaluation of the free-reading question type, pronunciation accuracy evaluation, fluency evaluation and language expression capability evaluation can be performed on a speech signal to be evaluated respectively on the basis of corresponding evaluation models to obtain pronunciation accuracy scores and fluency scores; if desired, a composite score of the speech signal may also be calculated (e.g., weighted) based on the pronunciation accuracy score, fluency score, and expressiveness score; the pronunciation accuracy characteristic, the fluency characteristic and the language expression capability characteristic of the voice signal can also be used as comprehensive evaluation characteristics, and the evaluation score of the voice signal can be calculated by utilizing a corresponding comprehensive evaluation model. The speech signal is comprehensively evaluated by integrating the accuracy characteristic, the fluency characteristic and the language expression capability characteristic, so that the spoken language level of a reader can be more comprehensively reflected.
Correspondingly, the embodiment of the invention also provides a spoken language evaluation system of a free-reading question type, which is a structural schematic diagram of the system as shown in fig. 4.
In this embodiment, the system includes:
a receiving module 401, configured to receive a speech signal to be evaluated;
a voice recognition module 402, configured to perform voice recognition on the voice signal, and obtain a voice segment corresponding to each basic voice unit in the voice signal based on a limited boundary segmentation of a recognition text;
a pronunciation accuracy feature extraction module 403, configured to extract a pronunciation accuracy feature of the speech signal according to the recognition text and each speech segment, and use the pronunciation accuracy feature as an evaluation feature of the speech signal; the pronunciation accuracy characteristics include any one or more of the following: the posterior probability of the non-misrecognized word, the ratio of the misread word and the ratio of the correct reading word;
and the evaluating module 404 is configured to calculate an evaluation score of the speech signal according to the evaluation feature of the speech signal and a pre-constructed evaluating model.
The speech recognition module 402 can decode the speech signal by using the prior art to obtain a recognition text, each basic speech unit in the speech signal, and a corresponding speech fragment sequence.
In the embodiment of the present invention, for the characteristics that the free-reading question type has no standard answer and the recognition text may have recognition errors, the pronunciation accuracy feature extraction module 403 may determine the category of each word by using the classification model, so as to eliminate the interference of the misrecognized word, and characterize the pronunciation accuracy feature by one or more of the posterior probability of the non-misrecognized word, the ratio of the misrecognized word, and the ratio of the correct reading word.
Fig. 5 is a schematic diagram of a specific structure of the pronunciation accuracy feature extraction module according to an embodiment of the present invention.
In this embodiment, the pronunciation accuracy feature extraction module includes:
a classification feature obtaining unit 51, configured to obtain, according to the recognition text and each voice segment, a classification feature of each word in the recognition text, where the classification feature includes any one or more of the following: acoustic features, language model features, and grammatical features;
a word class determining unit 52, configured to determine a class of each word based on the classification features and a pre-trained word classification model, where the class includes: misrecognition, misreading and correct reading;
a first calculating unit 53, configured to obtain any one or more of the following features in the recognition text: the posterior probability of the non-misrecognized word, the ratio of the misread word and the ratio of the correct reading word.
The meaning of each classification feature is described in detail in the foregoing, and the obtaining process is completed by a corresponding subunit in the classification feature obtaining unit, which is as follows:
the acoustic feature acquisition subunit is used for acquiring the posterior probabilities of all basic voice units contained in each word in the recognition text according to the voice fragments; then, for each word, calculating the mean value of the posterior probabilities of all basic phonetic units contained in the word, and taking the mean value as the acoustic feature of the word;
the language model characteristic obtaining subunit is used for obtaining the language model scores of all basic voice units contained in each word in the recognition text according to the voice fragments; then, for each word, calculating the average value of the language model scores of all basic voice units contained in the word, and taking the average value as the language model characteristic of the word;
the grammar characteristic obtaining subunit is used for carrying out grammar error detection on the recognition text according to grammar rules to obtain an error detection result; and then determining the grammatical features of each word in the recognition text according to the error detection result.
The pronunciation accuracy feature extraction module 403 is not limited to the above-described structure, but may have other specific structures, for example, in another embodiment, a functional unit for determining whether to recognize misrecognized and non-misrecognized words in the text and a functional unit for determining whether non-misrecognized words are misread or correctly read may be provided.
It should be noted that the extraction of the pronunciation accuracy feature may be in units of sentences, paragraphs, chapters, and the like, and accordingly, when performing evaluation score calculation on the speech signal in the following, may also be in units of sentences, paragraphs, or chapters, which is not limited in the embodiment of the present invention.
In addition, it should be noted that, when the pronunciation accuracy of the speech signal of the free-reading question type is evaluated separately, the evaluation model may be a pronunciation accuracy evaluation model for a pronunciation accuracy evaluation dimension. When carrying out multidimensional comprehensive evaluation on the voice signal, a comprehensive score is obtained by adopting an evaluation model aiming at corresponding multiple dimensions.
The spoken language evaluation system of the free-reading question type provided by the embodiment of the invention firstly performs voice recognition on a voice signal to be evaluated aiming at the characteristic that the free-reading question type has no standard answer, and divides the voice signal to be evaluated based on the limited boundary of a recognition text to obtain voice segments corresponding to all basic voice units in the voice signal, and then extracts pronunciation accuracy characteristics of the voice signal according to the recognition text and all voice segments, and aiming at the characteristics, the pronunciation accuracy characteristics comprise any one or more of the following characteristics: the posterior probability of the non-misrecognized word, the misrecognized word proportion and the correct word-reading proportion are eliminated, so that the interference of the misrecognized word is eliminated, and the final evaluation score can accurately reflect the true level of the reader.
Fig. 6 is another schematic structural diagram of the spoken language evaluation system of the freely readable topic type according to the embodiment of the present invention.
Unlike the embodiment shown in fig. 4, in this embodiment, the system further includes:
a fluency feature extraction module 405, configured to extract fluency features of the speech signal, where the fluency features include any one or more of: increasing the voice time length proportion, increasing the occurrence times of the voice and the average speed.
Fig. 7 shows a specific structure of the fluency feature extraction module, which includes:
an acoustic feature extraction unit 71, configured to extract acoustic features of a speech classification basic speech unit in the speech signal, where the speech classification basic speech unit is a syllable, a word, or a phrase;
a speech type determining unit 72, configured to determine augmented speech and non-augmented speech in the speech signal by using the acoustic features and a pre-constructed speech classification model;
a second calculating unit 73, configured to extract fluency features of the speech signal according to augmented speech and non-augmented speech in the speech signal, where the fluency features include any one or more of the following: increasing the voice time length proportion, increasing the occurrence times of the voice and the average speed.
In the embodiment shown in fig. 4, the evaluating module 404 may calculate an evaluation score of the speech signal by using an evaluating model corresponding to the comprehensive evaluating feature, with the pronunciation accuracy feature and the fluency feature of the speech signal as the comprehensive evaluating feature; or respectively calculating pronunciation accuracy score and fluency score by using corresponding evaluation models, and then calculating the evaluation score of the voice signal according to the pronunciation accuracy score and the fluency score, wherein the corresponding specific structure comprises the following units:
the pronunciation accuracy evaluating unit is used for calculating a pronunciation accuracy score of the voice signal according to the pronunciation accuracy characteristics of the voice signal and the pronunciation accuracy evaluating model;
the fluency evaluating unit is used for calculating the fluency score of the voice signal according to the fluency characteristic of the voice signal and the fluency evaluating model;
and the first evaluation score calculating unit is used for calculating the evaluation score of the voice signal according to the pronunciation accuracy score and the fluency score.
It should be noted that the speech classification model may be constructed by a corresponding speech classification model construction unit, and the unit may be independent of the fluency feature extraction module, even independent of the system of the embodiment of the present invention, or may be integrated with the fluency feature extraction module, and the embodiment of the present invention is not limited thereto.
One specific structure of the speech classification model building unit may include the following subunits:
the training data collecting subunit is used for collecting the augmented voice data and the non-augmented voice data of different types and taking the collected voice data as training data;
the acoustic feature extraction subunit is used for extracting acoustic features of the training data;
and the training subunit is used for training by using the acoustic features to obtain the voice classification model.
Fig. 8 is another schematic structural diagram of the spoken language evaluation system of the freely readable topic type according to the embodiment of the present invention.
Unlike the embodiment shown in fig. 6, in this embodiment, the system further includes:
a language expressibility feature extraction module 406, configured to extract a language expressibility feature of the speech signal according to the recognition text, where the language expressibility feature includes any one or more of the following features:
semantic continuity characteristics, which indicate whether each sentence or each semantic segment in the identification text is continuous;
lexical characteristics including any one or more of: the number of non-repeated words, the number of idioms and the number of high-level words in the recognition text;
and the literary style characteristic refers to whether the expression of each sentence or each paragraph in the identification text is beautiful or not.
The evaluation module is specifically used for taking the pronunciation accuracy characteristic, the fluency characteristic and the language expression capability characteristic of the voice signal as comprehensive evaluation characteristics, and calculating the evaluation score of the voice signal by using an evaluation model corresponding to the comprehensive evaluation characteristics; or
Accordingly, in the embodiment shown in fig. 8, the evaluation module 404 may use the pronunciation accuracy feature, the fluency feature, and the language expression ability feature of the speech signal as the comprehensive evaluation feature, and calculate the evaluation score of the speech signal by using the evaluation model corresponding to the comprehensive evaluation feature; or respectively calculating pronunciation accuracy score, fluency score and linguistic expression ability score by using corresponding evaluation models, and then calculating the evaluation score of the voice signal according to the pronunciation accuracy score, the fluency score and the linguistic expression ability score, wherein the corresponding specific structure comprises the following units:
the pronunciation accuracy evaluating unit is used for calculating a pronunciation accuracy score of the voice signal according to the pronunciation accuracy characteristics of the voice signal and the pronunciation accuracy evaluating model;
the fluency evaluating unit is used for calculating the fluency score of the voice signal according to the fluency characteristic of the voice signal and the fluency evaluating model;
the speech signal evaluation unit is used for evaluating the speech signal according to the speech signal speech expression ability characteristic and the speech expression ability evaluation model;
and the second evaluation score calculating unit is used for calculating the evaluation score of the voice signal according to the pronunciation accuracy score, the fluency score and the language expression ability score.
It should be noted that, in practical applications, the speech expressiveness feature extraction module 406 may directly extract the speech expressiveness features of the speech signal from the recognition text; or processing the misrecognized words in the recognized text, and then extracting the corresponding features, accordingly, an embodiment of the module includes the following units:
a recognition error determination unit for determining a misrecognized word in the recognized text;
a correction unit configured to perform correction processing on the recognition text, the correction processing including: removing the misrecognized word from the recognized text or correcting the misrecognized word in the recognized text; such as by manual misrecognition correction;
and the extraction unit is used for extracting the language expression capability characteristics of the voice signal according to the corrected recognition text.
It should be noted that, the various classification models and evaluation models mentioned in the foregoing embodiments may be constructed offline, and the corresponding model construction modules or units may be independent physical entities or modules or units integrated in the system of the present invention, which is not limited to this embodiment of the present invention. In addition, in practical application, corresponding modules and units can be selected according to evaluation requirements, and independent evaluation or comprehensive evaluation of pronunciation accuracy, fluency and language expression capability is realized. Because the characteristics of the free reading question type are fully considered, the accurate evaluation score can be obtained.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, they are described in a relatively simple manner, and reference may be made to some descriptions of method embodiments for relevant points. The above-described system embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above embodiments of the present invention have been described in detail, and the present invention is described herein using specific embodiments, but the above embodiments are only used to help understanding the method and system of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (16)

1. A spoken language evaluation method of a free-reading question type is characterized by comprising the following steps:
receiving a voice signal to be evaluated;
performing voice recognition on the voice signal, and segmenting based on a limited boundary of a recognition text to obtain voice segments corresponding to all basic voice units in the voice signal;
extracting pronunciation accuracy characteristics of the voice signal according to the recognition text and each voice segment, and taking the pronunciation accuracy characteristics as evaluation characteristics of the voice signal, wherein the pronunciation accuracy characteristics comprise classification of each word in the recognition text and acquisition of the pronunciation accuracy characteristics based on the category of each word; the pronunciation accuracy characteristics include any one or more of the following: the posterior probability of the non-misrecognized word, the ratio of the misread word and the ratio of the correct reading word;
and calculating the evaluation score of the voice signal according to the evaluation characteristics of the voice signal and a pre-constructed evaluation model.
2. The method according to claim 1, wherein said extracting pronunciation accuracy features of the speech signal from the recognized text and each speech segment specifically comprises:
obtaining the classification characteristics of each word in the recognition text according to the recognition text and each voice segment, wherein the classification characteristics comprise any one or more of the following characteristics: acoustic features, language model features, grammatical features;
determining the category of each word based on the classification features and a pre-trained word classification model, wherein the categories comprise: misrecognition, misreading and correct reading;
acquiring any one or more of the following characteristics in the recognition text: the posterior probability of the non-misrecognized word, the ratio of the misread word and the ratio of the correct reading word.
3. The method of claim 2, wherein obtaining the acoustic features of the words in the recognized text from the recognized text and the speech segments comprises:
obtaining the posterior probability of all basic voice units contained in each word in the recognition text according to the voice fragments;
for each word, calculating the mean value of the posterior probabilities of all basic phonetic units contained in the word, and taking the mean value as the acoustic feature of the word.
4. The method of claim 2, wherein obtaining language model features of words in the recognition text from the recognition text and the speech segments comprises:
obtaining language model scores of all basic voice units contained in each word in the recognition text according to the voice fragments;
for each word, calculating the average value of the language model scores of all basic phonetic units contained in the word, and taking the average value as the language model characteristic of the word.
5. The method of claim 2, wherein obtaining the grammatical features of each word in the recognition text according to the recognition text and each voice segment comprises:
syntax error detection is carried out on the recognition text according to syntax rules to obtain an error detection result;
and determining the grammatical features of each word in the recognition text according to the error detection result.
6. The method according to any one of claims 1 to 5, wherein the evaluating feature further comprises: fluency characteristics; the method further comprises the following steps:
extracting acoustic features of a voice classification basic voice unit in the voice signal, wherein the voice classification basic voice unit is a syllable, a word or a phrase;
determining augmented speech and non-augmented speech in the speech signal by using the acoustic features and a pre-constructed speech classification model;
extracting fluency characteristics of the voice signal according to augmented voice and non-augmented voice in the voice signal, wherein the fluency characteristics comprise any one or more of the following characteristics: increasing the voice time length proportion, increasing the occurrence times of the voice and the average speed;
the step of calculating the evaluation score of the voice signal according to the evaluation characteristics of the voice signal and a pre-constructed evaluation model comprises the following steps:
respectively calculating pronunciation accuracy scores and fluency scores of the voice signals according to the extracted pronunciation accuracy characteristics and fluency characteristics of the voice signals and an evaluation model corresponding to the characteristics, and then calculating evaluation scores of the voice signals according to the pronunciation accuracy scores and the fluency scores; or
And taking the pronunciation accuracy characteristic and the fluency characteristic of the voice signal as comprehensive evaluation characteristics, and calculating the evaluation score of the voice signal by using an evaluation model corresponding to the comprehensive evaluation characteristics.
7. The method of claim 6, wherein the speech classification model is constructed as follows:
collecting different types of augmented voice data and non-augmented voice data, and taking the collected voice data as training data;
extracting acoustic features of the training data;
and training by using the acoustic features to obtain the voice classification model.
8. The method according to claim 6, wherein the evaluation feature further comprises: language expressive power features; the method further comprises the following steps:
extracting language expressiveness characteristics of the voice signal according to the recognition text, wherein the language expressiveness characteristics comprise any one or more of the following characteristics:
semantic continuity characteristics, which indicate whether each sentence or each semantic segment in the identification text is continuous;
lexical characteristics including any one or more of: the number of non-repeated words, the number of idioms and the number of high-level words in the recognition text;
the text collection characteristic refers to whether the expression of each sentence or each paragraph in the identification text is graceful or not;
the step of calculating the evaluation score of the voice signal according to the evaluation characteristics of the voice signal and a pre-constructed evaluation model comprises the following steps:
respectively calculating the pronunciation accuracy score, the fluency score and the language expression ability score of the voice signal according to the pronunciation accuracy feature, the fluency feature and the language expression ability feature of the extracted voice signal and an evaluation model corresponding to the feature, and then calculating the evaluation score of the voice signal according to the pronunciation accuracy score, the fluency score and the language expression ability score; or
And taking the pronunciation accuracy characteristic, the fluency characteristic and the language expression capability characteristic of the voice signal as comprehensive evaluation characteristics, and calculating the evaluation score of the voice signal by using an evaluation model corresponding to the comprehensive evaluation characteristics.
9. The method of claim 8, wherein the extracting the linguistic expressiveness feature of the speech signal from the recognition text comprises:
determining misrecognized words in the recognized text;
and performing correction processing on the recognition text, wherein the correction processing comprises the following steps: removing the misrecognized word from the recognized text or correcting the misrecognized word in the recognized text;
and extracting the language expression capability characteristics of the voice signal according to the corrected recognition text.
10. A spoken language evaluation system of a free-reading question type is characterized by comprising:
the receiving module is used for receiving a voice signal to be evaluated;
the voice recognition module is used for carrying out voice recognition on the voice signal and segmenting the voice signal based on a limited boundary of a recognition text to obtain voice segments corresponding to all basic voice units in the voice signal;
the pronunciation accuracy feature extraction module is used for extracting pronunciation accuracy features of the voice signals according to the recognition texts and the voice segments, taking the pronunciation accuracy features as evaluation features of the voice signals, classifying words in the recognition texts, and obtaining the pronunciation accuracy features based on the category of the words; the pronunciation accuracy characteristics include any one or more of the following: the posterior probability of the non-misrecognized word, the ratio of the misread word and the ratio of the correct reading word;
and the evaluation module is used for calculating the evaluation score of the voice signal according to the evaluation characteristics of the voice signal and a pre-constructed evaluation model.
11. The system of claim 10, wherein the pronunciation accuracy feature extraction module comprises:
a classification feature obtaining unit, configured to obtain, according to the recognition text and each voice segment, a classification feature of each word in the recognition text, where the classification feature includes any one or more of the following: acoustic features, language model features, grammatical features;
a word category determination unit, configured to determine a category of each word based on the classification features and a pre-trained word classification model, where the category includes: misrecognition, misreading and correct reading;
the first computing unit is used for acquiring any one or more of the following characteristics in the recognition text: the posterior probability of the non-misrecognized word, the ratio of the misread word and the ratio of the correct reading word.
12. The system according to claim 11, wherein the classification feature obtaining unit includes:
the acoustic feature acquisition subunit is used for acquiring the posterior probabilities of all basic voice units contained in each word in the recognition text according to the voice fragments; then, for each word, calculating the mean value of the posterior probabilities of all basic phonetic units contained in the word, and taking the mean value as the acoustic feature of the word;
the language model characteristic obtaining subunit is used for obtaining the language model scores of all basic voice units contained in each word in the recognition text according to the voice fragments; then, for each word, calculating the average value of the language model scores of all basic voice units contained in the word, and taking the average value as the language model characteristic of the word;
the grammar characteristic obtaining subunit is used for carrying out grammar error detection on the recognition text according to grammar rules to obtain an error detection result; and then determining the grammatical features of each word in the recognition text according to the error detection result.
13. The system of any one of claims 10 to 12, further comprising:
the fluency feature extraction module is used for extracting fluency features of the voice signals, and the fluency features comprise any one or more of the following: increasing the voice time length proportion, increasing the occurrence times of the voice and the average speed; the fluency feature extraction module comprises:
the acoustic feature extraction unit is used for extracting acoustic features of a voice classification basic voice unit in the voice signal, wherein the voice classification basic voice unit is a syllable, a word or a phrase;
the voice category determining unit is used for determining augmented voice and non-augmented voice in the voice signal by utilizing the acoustic features and a pre-constructed voice classification model;
the second computing unit is used for extracting fluency characteristics of the voice signals according to augmented voice and non-augmented voice in the voice signals, and the fluency characteristics comprise any one or more of the following characteristics: increasing the voice time length proportion, increasing the occurrence times of the voice and the average speed;
the evaluation module is specifically used for taking the pronunciation accuracy characteristic and the fluency characteristic of the voice signal as comprehensive evaluation characteristics and calculating the evaluation score of the voice signal by using an evaluation model corresponding to the comprehensive evaluation characteristics; or
The evaluation module comprises:
the pronunciation accuracy evaluating unit is used for calculating a pronunciation accuracy score of the voice signal according to the pronunciation accuracy characteristics of the voice signal and the pronunciation accuracy evaluating model;
the fluency evaluating unit is used for calculating the fluency score of the voice signal according to the fluency characteristic of the voice signal and the fluency evaluating model;
and the first evaluation score calculating unit is used for calculating the evaluation score of the voice signal according to the pronunciation accuracy score and the fluency score.
14. The system of claim 13, wherein the fluency feature extraction module further comprises: the voice classification model building unit is used for building the voice classification model; the speech classification model construction unit includes:
the training data collecting subunit is used for collecting the augmented voice data and the non-augmented voice data of different types and taking the collected voice data as training data;
the acoustic feature extraction subunit is used for extracting acoustic features of the training data;
and the training subunit is used for training by using the acoustic features to obtain the voice classification model.
15. The system of claim 13, further comprising:
a language expression ability feature extraction module, configured to extract a language expression ability feature of the speech signal according to the recognition text, where the language expression ability feature includes any one or more of the following features:
semantic continuity characteristics, which indicate whether each sentence or each semantic segment in the identification text is continuous;
lexical characteristics including any one or more of: the number of non-repeated words, the number of idioms and the number of high-level words in the recognition text;
the text collection characteristic refers to whether the expression of each sentence or each paragraph in the identification text is graceful or not;
the evaluation module is specifically used for taking the pronunciation accuracy characteristic, the fluency characteristic and the language expression capability characteristic of the voice signal as comprehensive evaluation characteristics, and calculating the evaluation score of the voice signal by using an evaluation model corresponding to the comprehensive evaluation characteristics; or
The evaluation module comprises:
the pronunciation accuracy evaluating unit is used for calculating a pronunciation accuracy score of the voice signal according to the pronunciation accuracy characteristics of the voice signal and the pronunciation accuracy evaluating model;
the fluency evaluating unit is used for calculating the fluency score of the voice signal according to the fluency characteristic of the voice signal and the fluency evaluating model;
the speech signal evaluation unit is used for evaluating the speech signal according to the speech signal speech expression ability characteristic and the speech expression ability evaluation model;
and the second evaluation score calculating unit is used for calculating the evaluation score of the voice signal according to the pronunciation accuracy score, the fluency score and the language expression ability score.
16. The system of claim 15, wherein the linguistic expression capability feature extraction module comprises:
a recognition error determination unit for determining a misrecognized word in the recognized text;
a correction unit configured to perform correction processing on the recognition text, the correction processing including: removing the misrecognized word from the recognized text or correcting the misrecognized word in the recognized text;
and the extraction unit is used for extracting the language expression capability characteristics of the voice signal according to the corrected recognition text.
CN201610423082.XA 2016-06-14 2016-06-14 Spoken language evaluation method and system for freely reading question types Active CN105845134B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610423082.XA CN105845134B (en) 2016-06-14 2016-06-14 Spoken language evaluation method and system for freely reading question types

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610423082.XA CN105845134B (en) 2016-06-14 2016-06-14 Spoken language evaluation method and system for freely reading question types

Publications (2)

Publication Number Publication Date
CN105845134A CN105845134A (en) 2016-08-10
CN105845134B true CN105845134B (en) 2020-02-07

Family

ID=56576747

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610423082.XA Active CN105845134B (en) 2016-06-14 2016-06-14 Spoken language evaluation method and system for freely reading question types

Country Status (1)

Country Link
CN (1) CN105845134B (en)

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106297828B (en) * 2016-08-12 2020-03-24 苏州驰声信息科技有限公司 Detection method and device for false sounding detection based on deep learning
CN107067834A (en) * 2017-03-17 2017-08-18 麦片科技(深圳)有限公司 Point-of-reading system with oral evaluation function
KR102490752B1 (en) * 2017-08-03 2023-01-20 링고챔프 인포메이션 테크놀로지 (상하이) 컴퍼니, 리미티드 Deep context-based grammatical error correction using artificial neural networks
CN107808674B (en) * 2017-09-28 2020-11-03 上海流利说信息技术有限公司 Method, medium and device for evaluating voice and electronic equipment
CN109697975B (en) * 2017-10-20 2021-05-14 深圳市鹰硕教育服务有限公司 Voice evaluation method and device
CN107818795B (en) * 2017-11-15 2020-11-17 苏州驰声信息科技有限公司 Method and device for evaluating oral English
CN107886968B (en) * 2017-12-28 2021-08-24 广州讯飞易听说网络科技有限公司 Voice evaluation method and system
CN108320734A (en) * 2017-12-29 2018-07-24 安徽科大讯飞医疗信息技术有限公司 Audio signal processing method and device, storage medium, electronic equipment
CN108831503B (en) * 2018-06-07 2021-11-19 邓北平 Spoken language evaluation method and device
US11527174B2 (en) * 2018-06-18 2022-12-13 Pearson Education, Inc. System to evaluate dimensions of pronunciation quality
CN109147765B (en) * 2018-11-16 2021-09-03 安徽听见科技有限公司 Audio quality comprehensive evaluation method and system
CN110164422A (en) * 2019-04-03 2019-08-23 苏州驰声信息科技有限公司 A kind of the various dimensions appraisal procedure and device of speaking test
CN110136721A (en) * 2019-04-09 2019-08-16 北京大米科技有限公司 A kind of scoring generation method, device, storage medium and electronic equipment
CN112151018A (en) * 2019-06-10 2020-12-29 阿里巴巴集团控股有限公司 Voice evaluation and voice recognition method, device, equipment and storage medium
CN110610630B (en) * 2019-08-02 2021-05-14 广州千课教育科技有限公司 Intelligent English teaching system based on error dispersion checking
CN110503941B (en) * 2019-08-21 2022-04-12 北京隐虚等贤科技有限公司 Language ability evaluation method, device, system, computer equipment and storage medium
CN111128181B (en) * 2019-12-09 2023-05-30 科大讯飞股份有限公司 Recitation question evaluating method, recitation question evaluating device and recitation question evaluating equipment
CN113707178B (en) * 2020-05-22 2024-02-06 苏州声通信息科技有限公司 Audio evaluation method and device and non-transient storage medium
CN111916108B (en) * 2020-07-24 2021-04-02 北京声智科技有限公司 Voice evaluation method and device
TWI760856B (en) * 2020-09-23 2022-04-11 亞東學校財團法人亞東科技大學 Habitual wrong language grammar learning correcting system and method
CN112767932A (en) * 2020-12-11 2021-05-07 北京百家科技集团有限公司 Voice evaluation system, method, device, equipment and computer readable storage medium
CN112686020A (en) * 2020-12-29 2021-04-20 科大讯飞股份有限公司 Composition scoring method and device, electronic equipment and storage medium
CN112908358B (en) * 2021-01-31 2022-10-18 云知声智能科技股份有限公司 Open type voice evaluation method and device
JP7371644B2 (en) * 2021-02-01 2023-10-31 カシオ計算機株式会社 Pronunciation training program and terminal device
CN115346421A (en) * 2021-05-12 2022-11-15 北京猿力未来科技有限公司 Spoken language fluency scoring method, computing device and storage medium
CN113486970B (en) * 2021-07-15 2024-04-05 北京全未来教育科技有限公司 Reading capability evaluation method and device

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2375210B (en) * 2001-04-30 2005-03-23 Vox Generation Ltd Grammar coverage tool for spoken language interface
US7657221B2 (en) * 2005-09-12 2010-02-02 Northwest Educational Software, Inc. Virtual oral recitation examination apparatus, system and method
KR100687441B1 (en) * 2006-03-16 2007-02-27 장성옥 Method and system for evaluation of foring language voice
CN101739868B (en) * 2008-11-19 2012-03-28 中国科学院自动化研究所 Automatic evaluation and diagnosis method of text reading level for oral test
CN101740024B (en) * 2008-11-19 2012-02-08 中国科学院自动化研究所 Method for automatic evaluation of spoken language fluency based on generalized fluency
CN101826263B (en) * 2009-03-04 2012-01-04 中国科学院自动化研究所 Objective standard based automatic oral evaluation system
CN102568475B (en) * 2011-12-31 2014-11-26 安徽科大讯飞信息科技股份有限公司 System and method for assessing proficiency in Putonghua
CN103559894B (en) * 2013-11-08 2016-04-20 科大讯飞股份有限公司 Oral evaluation method and system
CN103761975B (en) * 2014-01-07 2017-05-17 苏州驰声信息科技有限公司 Method and device for oral evaluation
CN104952444B (en) * 2015-04-27 2018-07-17 桂林电子科技大学 A kind of Chinese's Oral English Practice method for evaluating quality that text is unrelated

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
文本无关发音质量评估系统中声学模型的若干研究和改进;蒋同海等;《网络新媒体技术》;20120331;第47-53页 *

Also Published As

Publication number Publication date
CN105845134A (en) 2016-08-10

Similar Documents

Publication Publication Date Title
CN105845134B (en) Spoken language evaluation method and system for freely reading question types
CN101645271B (en) Rapid confidence-calculation method in pronunciation quality evaluation system
CN101650886B (en) Method for automatically detecting reading errors of language learners
CN108766415B (en) Voice evaluation method
CN112466279B (en) Automatic correction method and device for spoken English pronunciation
Lee Language-independent methods for computer-assisted pronunciation training
Sefara et al. HMM-based speech synthesis system incorporated with language identification for low-resourced languages
Mao et al. Applying multitask learning to acoustic-phonemic model for mispronunciation detection and diagnosis in l2 english speech
Kyriakopoulos et al. Automatic detection of accent and lexical pronunciation errors in spontaneous non-native English speech
CN110675292A (en) Child language ability evaluation method based on artificial intelligence
Cámara-Arenas et al. Automatic pronunciation assessment vs. automatic speech recognition: A study of conflicting conditions for L2-English
Mary et al. Searching speech databases: features, techniques and evaluation measures
WO2019075827A1 (en) Voice evaluation method and device
KR101145440B1 (en) A method and system for estimating foreign language speaking using speech recognition technique
Proença Automatic assessment of reading ability of children
CN115376547A (en) Pronunciation evaluation method and device, computer equipment and storage medium
Razik et al. Frame-synchronous and local confidence measures for automatic speech recognition
Harere et al. Quran recitation recognition using end-to-end deep learning
CN114420159A (en) Audio evaluation method and device and non-transient storage medium
Li et al. English sentence pronunciation evaluation using rhythm and intonation
Wang et al. Automatic Detection of Speaker Attributes Based on Utterance Text.
van Doremalen Developing automatic speech recognition-enabled language learning applications: from theory to practice
CN113053414A (en) Pronunciation evaluation method and device
Pranjol et al. Bengali speech recognition: An overview
Al-Barhamtoshy et al. Speak correct: phonetic editor approach

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant