CN103559892A

CN103559892A - Method and system for evaluating spoken language

Info

Publication number: CN103559892A
Application number: CN201310554703.4A
Authority: CN
Inventors: 王士进; 刘丹; 陈进; 魏思; 胡郁; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2013-11-08
Filing date: 2013-11-08
Publication date: 2014-02-05
Anticipated expiration: 2033-11-08
Also published as: CN103559892B

Abstract

The invention relates to the technical field of voice signal processing, and discloses a method and system for evaluating a spoken language. The method comprises the step of receiving a voice signal to be evaluated, the step of respectively obtaining voice fragments corresponding to basic voice units in the voice signal through at least two different voice recognition systems, the step of respectively extracting evaluating characteristics corresponding to different characteristics types from the voice fragments, the step of calculating original scores of the evaluating characteristics, the step of optimizing and fusing the original scores obtained based on the different voice recognition systems according to the characteristic types to obtain the comprehensive score of the evaluating characteristics, and the step of calculating the score of the voice signal according to the comprehensive score of the different evaluating characteristics. By means of the method and system for evaluating the spoken language, the accuracy of evaluating the spoken language can be improved, and abnormal grading can be reduced.

Description

Spoken evaluating method and system

Technical field

The present invention relates to voice process technology field, be specifically related to a kind of spoken evaluating method and system.

Background technology

As the important medium of interpersonal communication, conversational language occupies extremely important status in real life.Along with the aggravation of socioeconomic development and the trend of globalization, people have proposed more and more higher requirement to the objectivity of the efficiency of language learning and language assessment, fairness and scale test.Traditional artificial spoken language proficiency evaluating method is very limited Faculty and Students on instructional blocks of time and space, at aspects such as qualified teachers' strength, teaching place, funds expenditures, also has gap and the imbalance on many hardware; Artificial evaluation and test cannot be avoided evaluator's self individual deviation, thereby can not guarantee the unification of standards of grading, sometimes even cannot accurately reflect measured's true horizon; And for extensive oral test, need a large amount of human and material resources and financial support, limited assessment test regular, scale.For this reason, industry has been developed some language teachings and evaluating system in succession.

In the prior art, spoken evaluating system adopts single recognizer to carry out speech recognition (as question-and-answer problem) or speech text alignment (as reading aloud topic) to the voice signal receiving conventionally, thereby obtains the voice snippet that each basic voice unit is corresponding.System is extracted and is described the feature that each basic voice unit pronunciation standard degree or fluency etc. are weighed spoken evaluating standard respectively from each voice snippet subsequently, finally based on described feature, by forecast analysis, obtains evaluating and testing final score.

While using the sound pick-up outfit of high-fidelity under quiet environment, speech recognition system is owing to providing higher recognition accuracy thereby follow-up spoken evaluation and test also can provide comparatively objective and accurate result.Yet in actual applications particularly for extensive SET, playback environ-ment unavoidably can be subject to the impact of the factors such as examination hall noise, neighbourhood noise, and speech recognition accuracy rate declines and to cause there will be in spoken evaluation and test process a certain proportion of abnormal scoring voice.It is real practical that obvious this phenomenon is difficult to extensive SET Computer automatic scoring, range of application and the popularization of spoken evaluating system have been limited, to a lot of vital examinations, cannot apply, once otherwise occur that abnormal scoring will cause the accident of marking examination papers.

Summary of the invention

The embodiment of the present invention provides a kind of spoken evaluating method and system, to improve the accuracy of spoken evaluation and test, reduces scoring extremely.

For this reason, the invention provides following technical scheme:

An evaluating method, comprising:

Receive voice signal to be evaluated;

Utilize at least two kinds of different speech recognition systems to obtain respectively voice snippet corresponding to each basic voice unit in described voice signal;

From described voice snippet, extract respectively the evaluation and test feature of corresponding different characteristic type;

Calculate the original score of described evaluation and test feature;

According to described characteristic type, the described original score obtaining based on different phonetic recognition system is optimized to fusion, obtains the integrate score of described evaluation and test feature;

According to the integrate score of difference evaluation and test feature, calculate the score of described voice signal.

Preferably, described characteristic type comprise following one or more: integrity feature, pronunciation accuracy feature, fluency feature, prosodic features.

Preferably, the original score of the described evaluation and test feature of described calculating comprises:

Load the score in predicting model corresponding with the characteristic type of described evaluation and test feature;

Calculate described evaluation and test feature corresponding to the similarity of described score in predicting model, and the original score using described similarity as described evaluation and test feature.

Preferably, the score in predicting model of the same characteristic type of corresponding different topic types is different.

Preferably, describedly according to described characteristic type, the described original score obtaining based on different phonetic recognition system is optimized to fusion, the integrate score that obtains described evaluation and test feature comprises:

For the original score of the evaluation and test feature obtaining based on different phonetic recognition system of same characteristic type, get wherein maximum score or meta score or average, as the integrate score of described evaluation and test feature.

An evaluating system, comprising:

Receiver module, for receiving voice signal to be evaluated;

Voice snippet acquisition module, for utilizing at least two kinds of different speech recognition systems to obtain respectively voice snippet corresponding to each basic voice unit of described voice signal;

Characteristic extracting module, for extracting respectively the evaluation and test feature of corresponding different characteristic type from described voice snippet;

Computing module, for calculating the original score of described evaluation and test feature;

Optimization fusion module, for according to described characteristic type, the described original score obtaining based on different phonetic recognition system being optimized to fusion, obtains the integrate score of described evaluation and test feature;

Grading module, for calculating the score of described voice signal according to the integrate score of difference evaluation and test feature.

Preferably, described computing module comprises:

Loading unit, for loading the score in predicting model corresponding with the characteristic type of described evaluation and test feature;

Similarity calculated, for calculating described evaluation and test feature corresponding to the similarity of described score in predicting model, and the original score using described similarity as described evaluation and test feature.

Preferably, described grading module, specifically for the original score of the evaluation and test feature obtaining based on different phonetic recognition system for same characteristic type, gets wherein maximum score or meta score or average, as the integrate score of described evaluation and test feature.

Spoken evaluating method and system that the embodiment of the present invention provides, by adopting the more voice recognition system comprehensive mode of marking respectively, identification and the abnormal situation of evaluation and test feature extraction that single system scoring brings have been reduced, and then reduced the error score that identification error brings, realized the accurately evaluation and test comprehensively to user's spoken language proficiency.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, to the accompanying drawing of required use in embodiment be briefly described below, apparently, the accompanying drawing the following describes is only some embodiment that record in the present invention, for those of ordinary skills, can also obtain according to these accompanying drawings other accompanying drawing.

Fig. 1 is the process flow diagram of the spoken evaluating method of the embodiment of the present invention;

Fig. 2 is the process flow diagram that builds score in predicting model in the embodiment of the present invention;

Fig. 3 is the structural representation of the spoken evaluating system of the embodiment of the present invention.

Embodiment

In order to make those skilled in the art person understand better the scheme of the embodiment of the present invention, below in conjunction with drawings and embodiments, the embodiment of the present invention is described in further detail.

For being subject to the decline of such environmental effects speech recognition accuracy rate can cause occurring in spoken evaluation and test process the problem of a certain proportion of abnormal scoring voice in prior art, the embodiment of the present invention provides a kind of spoken evaluating method and system, first to voice signal to be evaluated, adopt multiple voice recognition method to identify, obtain multiple recognition result; Then from every kind of recognition result, extract respectively the evaluation and test feature based on different characteristic type, and calculate respectively scoring according to described evaluation and test feature; According to characteristic type, the described scoring of each recognition result is optimized and merges the comprehensive grading that obtains different characteristic type subsequently; Finally the comprehensive grading of different characteristic type is changed to the final score of determining described voice signal.

As shown in Figure 1, be the process flow diagram of the spoken evaluating method of the embodiment of the present invention, comprise the following steps:

Step 101, receives voice signal to be evaluated.

Step 102, adopts at least two kinds of different speech recognition systems to obtain respectively voice snippet corresponding to each basic voice unit in described voice signal.

Described basic voice unit can be syllable, phoneme etc.Different speech recognition systems by the acoustic feature based on different as based on MFCC(Mel-Frequency Cepstrum Coefficients, Mel-cepstrum coefficient) acoustic model of feature, based on PLP(Perceptual Linear Predictive, perception linear prediction) acoustic model of feature etc., or adopt different acoustic models as HMM-GMM(Hidden Markov Model-Gaussian Mixture Model, hidden Markov model-gauss hybrid models), based on DBN(Dynamic BeyesianNetwork, dynamic bayesian network) neural network acoustic model etc., even adopting different decoding processes searches for as Viterbi, A* search etc., voice signal is decoded.Like this, can obtain basic voice unit and the corresponding voice segment sequence of described voice signal.

Particularly, for question-and-answer problem etc., not having the voice signal of text marking to obtain the text that described voice signal is corresponding by continuous speech recognition is basic voice unit sequence, and the corresponding voice snippet of each basic voice unit.For reading aloud topic, wait the voice signal with model answer to adopt voice alignment thereof to obtain the time boundary of the corresponding voice snippet of each basic voice unit.

Because different speech recognition systems has different decoding advantages, between its recognition result, often there is certain complementarity.

Step 103 is extracted respectively the evaluation and test feature of corresponding different characteristic type from described voice snippet.

Described characteristic type can comprise following one or more: integrity feature, pronunciation accuracy feature, fluency feature, prosodic features etc.Wherein:

Described integrity feature is for describing basic voice unit sequence corresponding to described voice segment sequence corresponding to the text integrity degree of model answer.

In embodiments of the present invention, can, by described basic voice unit sequence is mated with the model answer network building in advance, obtain optimal path, using the matching degree of optimal path and voice unit sequence as integrity feature.

It should be noted that, for different topic types, the form of described model answer network can be different, such as, to reading aloud topic type, its model answer is topic face words sequence, and for semi-open topic types such as question-and-answer problems, its model answer often consists of the core words of determining and other complementary connection words.In addition due to the uncertainty of answer, its expression-form is often more, and corresponding model answer network consists of a plurality of model answers conventionally, shows as the model answer of a plurality of Answer Sentence formulas or grid configuration.

Certainly, when model answer is not unique, can also build according to the probability of occurrence of each model answer the model answer network of a Weight, and select corresponding weighted registration rate to calculate the matching degree of optimal path and voice unit sequence, using the matching degree of corresponding each voice unit as integrity feature.

Further, in the answer network of semi-open topic type, in answer definite core words check on one's answers the importance that correctness describes will be far above other connectivity words, in order to highlight the check on one's answers importance of integrity degree of core words, can be respectively to core words and connect the weight that words arranges different numerical value, in the model answer network of Weight, search for the optimal path of described basic voice unit sequence, and using the cumulative score of optimal path as matching degree.

Described pronunciation accuracy feature is for describing the pronunciation standard degree of each voice snippet.Particularly, can calculate respectively each voice snippet corresponding to its similarity of the default pronunciation acoustic model of corresponding basic voice unit, using described similarity as pronunciation accuracy feature.

Described fluency feature is for describing the smoothness of user's statement statement, includes but not limited to the average word speed of statement (as the ratio of voice duration and voice unit number etc.), the average flow length of statement, effectively pause ratio of statement etc.In addition, in order to compensate the difference of different speaker in word speed, can also adopt phoneme section feature, all pronunciation parts are normalized to rear common composition fluency feature.Particularly, can be by the duration discrete probability distribution of statistics context-free phoneme, the logarithm probability of duration scoring after calculating normalization, obtains segment length's scoring of phoneme.

Described prosodic features, for describing the rhythm feature of user pronunciation, comprises the features such as pitch variation fluctuating.Particularly, can extract the fundamental frequency characteristic sequence of each voice snippet, also can further obtain subsequently its dynamic change characterization, as extracted the prosodic features as a supplement such as first order difference, second order difference.

The evaluation and test feature of above-mentioned corresponding different characteristic type has been described respectively the feature of active user's pronunciation from different perspectives, has each other certain complementarity.

Step 104, calculates every kind of original score of evaluating and testing feature.

Evaluation and test feature for different characteristic type can load respectively corresponding score in predicting model and calculate described evaluation and test feature corresponding to the similarity of this score in predicting model, the original score using described similarity as described evaluation and test feature.

It should be noted that, in actual applications, can also load corresponding score in predicting model according to difference topic type, the score in predicting model of the same characteristic type of corresponding different topic types can be identical, also can be different, thus fineness and the accuracy of scoring further improved.The structure of each score in predicting model will describe in detail in the back.

Step 105, is optimized fusion according to described characteristic type to the described original score obtaining based on different phonetic recognition system, obtains the integrate score of described evaluation and test feature.

Because different speech recognition systems has adopted different recognizers or acoustic model, often there is different recognition results, the evaluation and test feature of the same characteristic type extracting based on different phonetic segment accordingly is also not quite similar, and the score of evaluation and test feature also exists certain complementarity (integrality, accuracy, fluency, the rhythm etc.).

In embodiments of the present invention, the original score of the evaluation and test feature for same characteristic type first obtaining for different phonetic recognition system is optimized fusion, weighs the user pronunciation level of this evaluation and test characteristic present comprehensively.Particularly, can adopt and get maximum according to the demand of difference examination and the number of speech recognition system, get median, the mode such as average is optimized fusion to described score.Such as, if the original phase-splitting difference that obtains of the evaluation and test feature obtaining based on different phonetic recognition system is in the threshold value of setting, the mean value of each original score is evaluated and tested to the integrate score of feature as this; If the original score of this evaluation and test feature that the original score of the evaluation and test feature that certain or some speech recognition systems obtain obtains higher than other speech recognition systems, gets near mean value maximal value wherein or maximal value as the integrate score of this evaluation and test feature.

By above-mentioned integrate score, can reduce to a certain extent the score abnormal conditions that individual voice recognition system is abnormal or evaluation and test feature extraction causes extremely.

Step 106, calculates the score of described voice signal according to the integrate score of difference evaluation and test feature.

After the fusion process of above-mentioned steps 105, can obtain the integrate score of different evaluation and test features.In embodiments of the present invention, can consider that the integrate score of dissimilar evaluation and test feature has certain correlativity from practical application, the conversion method based on linear regression, calculates PTS, i.e. the score of computing voice signal as follows:

S = \frac{1}{N} Σ_{i = 1}^{N} w_{i} s_{i}

Wherein, w _ithe correlation parameter of respectively evaluating and testing feature, w _ifor positive number, by system, set in advance and meet

s _iit is the integrate score of respectively evaluating and testing feature; N is the number of integrate score.

Visible, the spoken evaluating method of the embodiment of the present invention, by adopting the more voice recognition system comprehensive mode of marking respectively, identification and the abnormal situation of evaluation and test feature extraction that single system scoring brings have been reduced, and then reduced the error score that identification error brings, realized the accurately evaluation and test comprehensively to user's spoken language proficiency.

Before mention, when calculating the score of evaluation and test feature, need to load the score in predicting model corresponding with the characteristic type of described evaluation and test feature.It should be noted that, described score in predicting model in advance off-line builds.

In embodiments of the present invention, score in predicting model arranges respectively for each characteristic type, its input is that the evaluation and test feature of a certain special characteristic of correspondence that extracts from voice snippet is (as integrity feature, pronunciation accuracy feature etc.), output is mark, is actually the mapping of having set up from evaluation and test feature to scoring.It should be noted that, every kind of evaluation and test feature has all been set up respectively to a score in predicting model.Further, the identical scoring characteristic type of corresponding different topic types, also can set up respectively corresponding score in predicting model.

As shown in Figure 2, be the process flow diagram that builds score in predicting model in the embodiment of the present invention, comprise the following steps:

Step 201, gathers scoring training data.

Particularly, can collect respectively to each exercise question a plurality of users' answer speech data, as scoring training data.

Step 202, manually marks described training data, comprises text marking and cutting and the artificial marking of spoken evaluation and test etc.

Described text marking refers to the conversion from speech-to-text.Cutting refers to by artificial monitoring, and continuous speech signal is divided, and determines the voice snippet that each basic voice unit is corresponding.The artificial marking of spoken evaluation and test refers to by the mode of artificial audiometry marks to spoken language proficiency.

In actual applications, can to above-mentioned different evaluation and test feature, mark respectively respectively, described evaluation and test feature comprises integrity feature, pronunciation accuracy feature, fluency feature, prosodic features etc.

Step 203, extracts respectively the evaluation and test feature of different characteristic type according to annotation results.

That is to say, according to the basic voice unit in annotation results and corresponding voice snippet, from described voice snippet, according to the mode of introducing, extract respectively the evaluation and test feature of different characteristic type above.

Step 204, utilizes described evaluation and test feature to build respectively the score in predicting model relevant to described characteristic type.

Particularly, can utilize forecasting techniques training under the guidance of artificial scoring to obtain the parameter of score in predicting model, then obtain score in predicting model.Further, can also set up respectively the score in predicting model relevant to topic type according to different test question types.

In embodiments of the present invention, need respectively specific evaluation and test feature to be set up to independent score in predicting model.Building process is roughly as follows:

First suppose that score in predicting model is for the mapping function of evaluation and test feature.As to integrity feature, its intrinsic dimensionality is 1, and this forecast model is linear function y=a*x+b, the pronunciation accuracy feature of x for extracting wherein, and y is the evaluation and test score of prediction, a, b is prediction model parameters.

Then from the training data obtaining in advance, extract the integrity feature X and the corresponding artificial integrity feature scoring Y that obtain each sample.Then at LSE(Least Squares Error, least mean-square error) or MSE(Mean Squared Error) under criterion training obtain a, the prediction model parameters of b.

Certainly score in predicting model is not limited to above-mentioned linear mapping function, can also adopt NN(Neural Network, neural network) etc. the method for statistical model, be not described in detail here.

Correspondingly, the embodiment of the present invention also provides a kind of spoken evaluating system, as shown in Figure 3, is the structural representation of this system.

In this embodiment, described system comprises:

Receiver module 301, for receiving voice signal to be evaluated.

Voice snippet acquisition module 302, for utilizing at least two kinds of different speech recognition systems to obtain respectively voice snippet corresponding to each basic voice unit of described voice signal.

Above-mentioned basic voice unit can be syllable, phoneme etc.Different speech recognition systems by the acoustic feature based on different as the acoustic model based on MFCC feature, acoustic model based on PLP feature etc., or adopt different acoustic models as HMM-GMM, neural network acoustic model based on DBN etc., even adopt different decoding processes to search for as Viterbi, A ^*search etc., decode to voice signal.Like this, can obtain basic voice unit and the corresponding voice segment sequence of described voice signal.

Characteristic extracting module 303, for extracting respectively the evaluation and test feature of corresponding different characteristic type from described voice snippet.

Described characteristic type can comprise following one or more: integrity feature, pronunciation accuracy feature, fluency feature, prosodic features etc., being defined in of various characteristic types has been described in detail above, do not repeat them here.

Computing module 304, for calculating the original score of described evaluation and test feature.

Optimization fusion module 305, for according to described characteristic type, the described original score obtaining based on different phonetic recognition system being optimized to fusion, obtains the integrate score of described evaluation and test feature.

Because different speech recognition systems has adopted different recognizers or acoustic model, often there is different recognition results, the evaluation and test feature of the same characteristic type extracting based on different phonetic segment accordingly is also not quite similar, and the score of evaluation and test feature also exists certain complementarity.

For this reason, in embodiments of the present invention, the original score of the evaluation and test feature for same characteristic type that optimization fusion module 305 obtains for different phonetic recognition system is optimized fusion, weighs the user pronunciation level of this evaluation and test characteristic present comprehensively.Particularly, optimization fusion module 305 can adopt and get maximum according to the demand of difference examination and the number of speech recognition system, gets median, the mode such as average is optimized fusion to described score.Such as, if original the phase-splitting difference of the evaluation and test feature obtaining based on different phonetic recognition system in the threshold value of setting, the integrate score of optimization fusion module 305 using the mean value of each original score as this evaluation and test feature; If the original score of this evaluation and test feature that the original score of the evaluation and test feature that certain or some speech recognition systems obtain obtains higher than other speech recognition systems, optimization fusion module 305 is got near mean value maximal value wherein or maximal value as the integrate score of this evaluation and test feature.

Grading module 306, calculates the score of described voice signal according to the integrate score of difference evaluation and test feature.

Grading module 306 can be based on linear regression conversion method, calculate PTS, concrete account form elaborates in the spoken evaluating method of the embodiment of the present invention above, does not repeat them here.

Visible, the spoken evaluating system of the embodiment of the present invention, by adopting the more voice recognition system comprehensive mode of marking respectively, identification and the abnormal situation of evaluation and test feature extraction that single system scoring brings have been reduced, and then reduced the error score that identification error brings, realized the accurately evaluation and test comprehensively to user's spoken language proficiency.

It should be noted that, in embodiments of the present invention, above-mentioned computing module 304 specifically can utilize the score in predicting model of corresponding different evaluation and test features to calculate described evaluation and test feature corresponding to the similarity of this score in predicting model, the original score using described similarity as described evaluation and test feature.

For this reason, a kind of implementation of described computing module 304 comprises: loading unit and similarity calculated (not shown).Wherein:

Described loading unit, for loading the score in predicting model corresponding with the characteristic type of described evaluation and test feature;

Described similarity calculated, for calculating described evaluation and test feature corresponding to the similarity of described score in predicting model, and the original score using described similarity as described evaluation and test feature.

Above-mentioned score in predicting model in advance off-line builds, and concrete building process is described in detail above, does not repeat them here.

Each embodiment in this instructions all adopts the mode of going forward one by one to describe, between each embodiment identical similar part mutually referring to, each embodiment stresses is the difference with other embodiment.Especially, for system embodiment, because it is substantially similar in appearance to embodiment of the method, so describe fairly simplely, relevant part is referring to the part explanation of embodiment of the method.System embodiment described above is only schematic, wherein said module or unit as separating component explanation can or can not be also physically to separate, the parts that show as module or unit can be or can not be also physical locations, can be positioned at a place, or also can be distributed in a plurality of network element.Can select according to the actual needs some or all of module wherein to realize the object of the present embodiment scheme.Those of ordinary skills, in the situation that not paying creative work, are appreciated that and implement.

All parts embodiment of the present invention can realize with hardware, or realizes with the software module moved on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that and can use in practice microprocessor or digital signal processor (DSP) to realize according to the some or all functions of the some or all parts in the spoken evaluating system of the embodiment of the present invention.The present invention for example can also be embodied as, for carrying out part or all equipment or device program (, computer program and computer program) of method as described herein.Realizing like this program of the present invention can be stored on computer-readable medium, or can have the form of one or more signal.Such signal can be downloaded and obtain from internet website, or provides on carrier signal, or provides with any other form.

Above the embodiment of the present invention is described in detail, has applied embodiment herein the present invention is set forth, the explanation of above embodiment is just for helping to understand method and apparatus of the present invention; , for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention meanwhile.

Claims

1. a spoken evaluating method, is characterized in that, comprising:

Receive voice signal to be evaluated;

Calculate the original score of described evaluation and test feature;

2. method according to claim 1, is characterized in that, described characteristic type comprise following one or more: integrity feature, pronunciation accuracy feature, fluency feature, prosodic features.

3. method according to claim 1, is characterized in that, the original score of the described evaluation and test feature of described calculating comprises:

4. method according to claim 3, is characterized in that, the score in predicting model of the same characteristic type of corresponding different topic types is different.

5. according to the method described in claim 1 to 4 any one, it is characterized in that, describedly according to described characteristic type, the described original score obtaining based on different phonetic recognition system is optimized to fusion, the integrate score that obtains described evaluation and test feature comprises:

6. a spoken evaluating system, is characterized in that, comprising:

Receiver module, for receiving voice signal to be evaluated;

7. system according to claim 6, is characterized in that, described characteristic type comprise following one or more: integrity feature, pronunciation accuracy feature, fluency feature, prosodic features.

8. system according to claim 6, is characterized in that, described computing module comprises:

9. system according to claim 8, is characterized in that, the score in predicting model of the same characteristic type of corresponding different topic types is different.

10. according to the system described in claim 6 to 9 any one, it is characterized in that,

Described grading module, specifically for the original score of the evaluation and test feature obtaining based on different phonetic recognition system for same characteristic type, gets wherein maximum score or meta score or average, as the integrate score of described evaluation and test feature.