Summary of the invention
The embodiment of the present invention provides a kind of oral evaluation method and system, to improve the accuracy of oral evaluation, reduces abnormal scoring.
For this reason, the invention provides following technical scheme:
A kind of oral evaluation method, comprising:
Receive voice signal to be evaluated;
At least two kinds of different speech recognition systems are utilized to obtain the voice snippet that in described voice signal, each basic voice unit is corresponding respectively;
The voice snippet obtained is merged, obtains the efficient voice fragment sequence of corresponding described voice signal;
Evaluation and test feature is extracted from described efficient voice fragment sequence;
Mark according to described evaluation and test feature.
Preferably, the described voice snippet to obtaining merges, and the efficient voice fragment sequence obtaining corresponding described voice signal comprises:
The text that voice snippet different phonetic recognition system obtained is corresponding carries out Dynamic Matching with the model answer network built in advance respectively, obtains Optimum Matching result;
The set of different corresponding unit is generated successively according to described Optimum Matching result, described corresponding unit refers to that the voice snippet that the different phonetic recognition system of its correspondence obtains exists plyability in time, and can the recognition result unit of correct match-on criterion answer network;
Determine the optimum cell in described set;
Splice the optimum cell in described set successively, obtain the efficient voice fragment sequence of corresponding described voice signal.
Preferably, describedly determine that the optimum cell in described set comprises:
Calculate acoustic model probability or the pronunciation posterior probability of the voice snippet of each corresponding unit in described set respectively;
Select the corresponding unit with maximum acoustic model probability or pronunciation posterior probability as the optimum cell in described set.
Preferably, the corresponding a kind of characteristic type of described evaluation and test feature, described characteristic type be following any one: integrity feature, pronunciation accuracy characteristics, fluency feature, prosodic features;
Describedly carry out scoring according to described evaluation and test feature and comprise:
Load the score in predicting model corresponding with the characteristic type of described evaluation and test feature;
Calculate the similarity of described evaluation and test feature corresponding to described score in predicting model, and using the score of described similarity as described voice signal.
Preferably, described evaluation and test feature comprise corresponding different characteristic type at least two groups evaluation and test features, described characteristic type be following any one: integrity feature, pronunciation accuracy characteristics, fluency feature, prosodic features;
Describedly carry out scoring according to described evaluation and test feature and comprise:
For often organizing evaluation and test feature, load the score in predicting model corresponding with the characteristic type of described evaluation and test feature;
Calculate the similarity of described evaluation and test feature corresponding to described score in predicting model, and using the score of described similarity as described evaluation and test feature;
The score of described voice signal is calculated according to the score often organizing evaluation and test feature.
A kind of oral evaluation system, comprising:
Receiver module, for receiving voice signal to be evaluated;
Voice snippet acquisition module, obtains for utilizing at least two kinds of different speech recognition systems the voice snippet that in described voice signal, each basic voice unit is corresponding respectively;
Fusion Module, merges for the voice snippet obtained described voice snippet acquisition module, obtains the efficient voice fragment sequence of corresponding described voice signal;
Characteristic extracting module, for extracting evaluation and test feature from described efficient voice fragment sequence;
Grading module, for marking according to described evaluation and test feature.
Preferably, described Fusion Module comprises:
Matching unit, the text that the voice snippet for different phonetic recognition system being obtained is corresponding carries out Dynamic Matching with the model answer network built in advance respectively, obtains Optimum Matching result;
Set generation unit, for generating the set of different corresponding unit successively according to described Optimum Matching result, described corresponding unit refers to that the voice snippet that the different phonetic recognition system of its correspondence obtains exists plyability in time, and can the recognition result unit of correct match-on criterion answer network;
Determining unit, for determining the optimum cell in described set;
Concatenation unit, for splicing the optimum cell in described set successively, obtains the efficient voice fragment sequence of corresponding described voice signal.
Preferably, described determining unit comprises:
Computing unit, for calculating acoustic model probability or the pronunciation posterior probability of the voice snippet of each corresponding unit in described set respectively;
Selection unit, for selecting the corresponding unit with maximum acoustic model probability or pronunciation posterior probability as the optimum cell in described set.
Preferably, the corresponding a kind of characteristic type of described evaluation and test feature, described characteristic type be following any one: integrity feature, pronunciation accuracy characteristics, fluency feature, prosodic features;
Institute's scoring module comprises:
Loading unit, for loading the score in predicting model corresponding with the characteristic type of described evaluation and test feature;
Computing unit, for calculating the similarity of described evaluation and test feature corresponding to described score in predicting model, and using the score of described similarity as described voice signal.
Preferably, described evaluation and test feature comprise corresponding different characteristic type at least two groups evaluation and test features, described characteristic type be following any one: integrity feature, pronunciation accuracy characteristics, fluency feature, prosodic features;
Institute's scoring module comprises:
Loading unit, for often organizing evaluation and test feature, loads the score in predicting model corresponding with the characteristic type of described evaluation and test feature;
First computing unit, for calculating the similarity of described evaluation and test feature corresponding to described score in predicting model, and using the score of described similarity as described evaluation and test feature;
Second computing unit, for calculating the score of described voice signal according to the score often organizing evaluation and test feature.
The oral evaluation method and system that the embodiment of the present invention provides, adopt multiple voice recognition method to identify to voice signal to be evaluated, obtain multiple voice segment sequence; Then fusion is carried out to these voice segment sequence and obtain efficient voice fragment sequence, finally carry out oral evaluation according to described efficient voice fragment sequence and obtain evaluation result.The method and system investigate validity and the rationality of object by the accuracy rate and oral evaluation improving voice identification result, greatly reduce the ratio that scoring is abnormal, thus meet the application demand of extensive SET better.
Embodiment
In order to the scheme making those skilled in the art person understand the embodiment of the present invention better, below in conjunction with drawings and embodiments, the embodiment of the present invention is described in further detail.
Can cause by the decline of such environmental effects speech recognition accuracy rate the problem occurring a certain proportion of abnormal scoring voice in oral evaluation process in prior art, the embodiment of the present invention provides a kind of oral evaluation method and system, first adopt multiple voice recognition method to identify to voice signal to be evaluated, obtain multiple voice segment sequence; Then fusion is carried out to these voice segment sequence and obtain efficient voice fragment sequence, finally carry out spoken language scoring according to described efficient voice fragment sequence and obtain evaluation result.
As shown in Figure 1, be the process flow diagram of embodiment of the present invention oral evaluation method, comprise the following steps:
Step 101, receives voice signal to be evaluated.
Step 102, utilizes at least two kinds of different speech recognition systems to obtain the voice snippet that in described voice signal, each basic voice unit is corresponding respectively.
Described basic voice unit can be syllable, phoneme etc.Different speech recognition systems is by based on different acoustic features (as based on the acoustic model of MFCC feature, the acoustic model etc. based on PLP feature) or adopt different acoustic model (as band the HMM-GMM acoustic model in discrimination dwelling, the neural network acoustic model etc. based on DBN) to decode to voice signal.Like this, the voice segment sequence of corresponding described voice signal can be obtained.
Particularly, the voice signal of text marking is not had can be obtained each basic voice unit segment of text corresponding to described voice signal and correspondence by continuous speech recognition for question-and-answer problem etc.Voice alignment thereof is then adopted to obtain the time boundary of each basic voice unit for the voice signal reading aloud topic etc. and have model answer.
Because different speech recognition systems has different decoding advantages, between its recognition result, often there is certain complementarity.
Step 103, merges the voice snippet obtained, obtains the efficient voice fragment sequence of corresponding described voice signal.
Because individual voice recognition system may cause the recognition result of partial error, and the speech recognition system with complementary characteristic is owing to having certain complementarity, therefore can largely on avoid this problem, and then by the choose reasonable of each voice snippet is improved each voice snippet scoring accuracy and rationality.
In embodiments of the present invention, the text that the voice snippet that can first different phonetic recognition system be obtained is corresponding carries out Dynamic Matching with the model answer network built in advance respectively, obtains Optimum Matching result.Particularly, described text can be adopted DTW(DynamicTimeWarping in model answer network, dynamic time consolidation) algorithm calculates the cumulative probability of historical path, selects the historical path with maximum probability to be optimal path when search terminates.Such as, the recognition result that speech recognition system 1 obtains is " ABCDE ", and model answer net mate, acquisition Optimum Matching result " A(+) BC (+) D (+) E (+) ", i.e. A, C, in D, E unit and answer matches, and B does not match.
Then, comprehensive described Optimum Matching result generates effective unit sequence, described effective unit refer to can with model answer network, and there is the recognition result unit of plyability in the voice snippet that the different phonetic recognition system of its correspondence obtains in time.Determine the efficient voice segment corresponding to each effective unit sequence, splice the voice snippet that described optimum cell is corresponding successively, obtain the efficient voice fragment sequence of corresponding described voice signal.
When determining efficient voice segment corresponding to each effective unit, due to the voice snippet of effective unit and correspondence all may be there is in different recognition results, for this reason, first the set of different corresponding unit can be generated successively according to described Optimum Matching result, calculate acoustic model probability or the pronunciation posterior probability of the voice snippet that in described set, each unit is corresponding respectively, select the corresponding unit with maximum probability score as the optimum cell in described set.Then voice snippet corresponding for the optimum cell in each set obtained is spliced in chronological order, the efficient voice fragment sequence of corresponding described voice signal can be obtained.
Such as: suppose to have two speech recognition systems to export recognition result as shown in Figure 2 respectively, the recognition result that wherein speech recognition system 1 obtains is " ABCDE ", and the recognition result that speech recognition system 2 obtains is " AFCGE ".By above-mentioned two kinds of recognition results respectively with model answer net mate, obtain Optimum Matching result " A(+) BC (+) D (+) E (+) " and " A(+) F(+) C(+) GE(+) ".Described (+) is and can mates with model answer, is correct recognition result.In Fig. 2, vertical line is for describing the time boundary of each voice snippet.
The efficient voice fragment sequence obtained by fusion is " AFCDE ".Recognition result accuracy after obvious fusion has had obvious lifting than the recognition result accuracy of individual voice recognition system.
Step 104, extracts evaluation and test feature from described efficient voice fragment sequence.
It should be noted that, in actual applications, according to application needs, the evaluation and test feature of a certain characteristic type can be extracted, such as: the evaluation and test feature of integrity feature, characteristic types such as pronunciation accuracy characteristics, fluency feature, prosodic features etc., and mark according to described evaluation and test feature.
Certainly, also can extract the evaluation and test feature of various features type, that is, the evaluation and test feature of extraction can have two or more sets simultaneously, often group evaluates and tests a kind of characteristic type of feature correspondence, such as: integrity feature, pronunciation accuracy characteristics, fluency feature or prosodic features etc.
Described integrity feature is for describing the text integrity degree of speech unit sequence corresponding to described voice segment sequence corresponding to model answer.
In embodiments of the present invention, can by described basic speech unit sequence be mated with the model answer network built in advance, acquisition optimal path, using the matching degree of optimal path and speech unit sequence as integrity feature.
It should be noted that, for different topic types, the form of described model answer network can be different, and such as, be exactly topic face to reading aloud topic, question-and-answer problem is exactly some keywords, and picture talk or statement topic etc. are exactly some kernel sentences etc.
Question-and-answer problem and statement topic etc. have certain uncertainty due to its answer, and belong to semi-open topic type, thus its model answer often arranges multiple different answer according to crucial words, and model answer network can be multiple answer entries in form.
For Open-ended Question type, its model answer comprises the sentence of crucial words often.The importance of obvious crucial words is higher than other auxiliary words, so can arrange larger weight to crucial words, and arranges less weight to other auxiliary words, to improve the rationality of semantic matches.Therefore, for Open-ended Question type, the model answer network of a Weight can also be built according to the probability of occurrence of words crucial in each model answer, and search acquisition and speech unit sequence have the optimal path of highest similarity in described model answer network, and then using the matching degree of each for correspondence consistent with unit in optimal path in speech unit sequence voice unit as integrity feature.Described matching degree refers to the weighting weight corresponding to voice unit of each coupling.
Described pronunciation accuracy characteristics is for describing the pronunciation standard degree of each voice snippet.Particularly, the similarity of each voice snippet corresponding to the pronunciation acoustic model preset can be calculated respectively, using described similarity as pronunciation accuracy characteristics.
Described fluency feature, for describing the smoothness of user's statement statement, includes but not limited to the average word speed of statement (ratio etc. as voice duration and voice unit number), the average flow length of statement, statement effectively pause ratio etc.In addition, in order to compensate the difference of different speaker in word speed, phoneme section feature can also be adopted, rear common composition fluency feature is normalized to all pronunciation parts.Particularly, can by the duration discrete probability distribution of statistics context-free phoneme, the log probability that after calculating normalization, duration is marked, obtains segment length's scoring of phoneme.
Described prosodic features, for describing the rhythm feature of user pronunciation, comprises the features such as pitch variation fluctuating.Particularly, the fundamental frequency characteristic sequence of each voice snippet can be extracted, obtain its dynamic change characterization subsequently, as extracted first order difference, second order difference etc. as prosodic features.
The evaluation and test feature of above-mentioned corresponding different characteristic type describes the feature of active user's pronunciation respectively from different perspectives, has certain complementarity each other.
Step 105, marks according to described evaluation and test feature.
Evaluation and test feature for different characteristic type can load corresponding score in predicting model respectively and calculate the similarity of described evaluation and test feature corresponding to this score in predicting model.
It should be noted that, in actual applications, can also load corresponding score in predicting model according to difference topic type, the score in predicting model of the same characteristic type of corresponding different topic type can be identical, also can be different, thus improve fineness and the accuracy of scoring further.The structure of each score in predicting model will describe in detail below.
If be only extracted a kind of evaluation and test feature of characteristic type, then can using the above-mentioned described evaluation and test feature calculated corresponding to the similarity of score in predicting model as the score of described voice signal.
If be extracted the evaluation and test feature of various features type, then need the score of the above-mentioned similarity calculated as corresponding evaluation and test feature, and then calculate the score of described voice signal according to the score often organizing evaluation and test feature.Particularly, from practical application, can consider that the score of dissimilar evaluation and test feature has certain correlativity, based on the conversion method of linear regression, calculate PTS, namely calculate the score of voice signal as follows:
Wherein, w
ithe correlation parameter of each evaluation and test feature, w
ifor positive number, pre-set by system and meet
s
iit is the integrate score of each evaluation and test feature; N is the number of integrate score.
Visible, the oral evaluation method of the embodiment of the present invention, adopts multiple voice recognition method to identify to voice signal to be evaluated, obtains multiple voice segment sequence; Then fusion is carried out to these voice segment sequence and obtain efficient voice fragment sequence, finally carry out oral evaluation according to described efficient voice fragment sequence and obtain evaluation result.The method investigates validity and the rationality of object by the accuracy rate and oral evaluation improving voice identification result, greatly reduces the ratio that scoring is abnormal, thus meets the application demand of extensive SET better.
Mentioning above, when calculating the score of evaluation and test feature, needing to load the score in predicting model corresponding with the characteristic type of described evaluation and test feature.It should be noted that, described score in predicting model can build by off-line in advance.
As shown in Figure 3, be the process flow diagram building score in predicting model in the embodiment of the present invention, comprise the following steps:
Step 301, gathers scoring training data.
Particularly, the answer speech data of multiple user can be collected respectively to each exercise question, as scoring training data.
Step 302, manually marks described training data, comprises text marking and cutting and oral evaluation and manually gives a mark.
Described text marking refers to the conversion from speech-to-text.Cutting refers to by artificial monitoring, divides continuous speech signal, determines the voice snippet that each basic voice unit is corresponding.Oral evaluation is manually given a mark and is referred to that the mode by artificial audiometry is marked to spoken language proficiency.
In actual applications, can mark respectively to above-mentioned different evaluation and test feature respectively, described evaluation and test feature comprises integrity feature, pronunciation accuracy characteristics, fluency feature, prosodic features etc.
Step 303, extracts the evaluation and test feature of different characteristic type respectively according to annotation results.
That is, according to the voice snippet of the basic voice unit in annotation results and correspondence, from described voice snippet, the evaluation and test feature of different characteristic type is extracted above respectively according to the mode introduced.
Step 304, utilizes described evaluation and test feature to build the score in predicting model relevant to described characteristic type respectively.
Particularly, forecasting techniques training under the guidance of artificial scoring can be utilized to obtain the parameter of score in predicting model, obtain score in predicting model.Further, the score in predicting model relevant to topic type can also be set up respectively according to different Testing gateway.
Correspondingly, the embodiment of the present invention also provides a kind of oral evaluation system, as shown in Figure 4, is a kind of structural representation of this system.
In this embodiment, described system comprises: receiver module 401, voice snippet acquisition module 402, Fusion Module 403, characteristic extracting module 404 and grading module 405.Wherein:
Receiver module 401, for receiving voice signal to be evaluated.
Voice snippet acquisition module 402, obtains for utilizing at least two kinds of different speech recognition systems the voice snippet that in described voice signal, each basic voice unit is corresponding respectively.
Question-and-answer problem etc. be there is no to the voice signal of text marking, each basic voice unit segment of text corresponding to described voice signal and correspondence can be obtained by continuous speech recognition.And for reading aloud topic etc. and have the voice signal of model answer, voice alignment thereof can be adopted to obtain the time boundary of each basic voice unit.
Because different speech recognition systems has different decoding advantages, between its recognition result, often there is certain complementarity.
Fusion Module 403, merges for the voice snippet obtained described voice snippet acquisition module 402, obtains the efficient voice fragment sequence of corresponding described voice signal.
Characteristic extracting module 404, for extracting evaluation and test feature from described efficient voice fragment sequence.
Grading module 405, for marking according to described evaluation and test feature.
Because individual voice recognition system may cause the recognition result of partial error, and the speech recognition system with complementary characteristic is owing to having certain complementarity, therefore can largely on avoid this problem, and then by the choose reasonable of each voice snippet is improved each voice snippet scoring accuracy and rationality.
For this reason, in embodiments of the present invention, a kind of specific implementation structure of described Fusion Module 403 as shown in Figure 5.
In this embodiment, described Fusion Module comprises:
Matching unit 501, the text that the voice snippet for different phonetic recognition system being obtained is corresponding carries out Dynamic Matching with the model answer network built in advance respectively, obtains Optimum Matching result;
Set generation unit 502, for generating the set of different corresponding unit successively according to described Optimum Matching result, described corresponding unit refers to that the voice snippet that the different phonetic recognition system of its correspondence obtains exists plyability in time, and can the recognition result unit of correct match-on criterion answer network;
Determining unit 503, for determining the optimum cell in described set;
Concatenation unit 504, for splicing the optimum cell in described set successively, obtains the efficient voice fragment sequence of corresponding described voice signal.
Above-mentioned determining unit 503 can comprise: computing unit and selection unit (not shown).Wherein: described computing unit is used for acoustic model probability or the pronunciation posterior probability of the voice snippet calculating each corresponding unit in described set respectively; Described selection unit is for selecting the corresponding unit with maximum acoustic model probability or pronunciation posterior probability as the optimum cell in described set.
By the fusion of above-mentioned Fusion Module to voice snippet, the recognition result accuracy after merging is made to have larger lifting than the recognition result accuracy of individual voice recognition system.
Visible, the oral evaluation system of the embodiment of the present invention, adopts multiple voice recognition method to identify to voice signal to be evaluated, obtains multiple voice segment sequence; Then fusion is carried out to these voice segment sequence and obtain efficient voice fragment sequence, finally carry out oral evaluation according to described efficient voice fragment sequence and obtain evaluation result.The method investigates validity and the rationality of object by the accuracy rate and oral evaluation improving voice identification result, greatly reduces the ratio that scoring is abnormal, thus meets the application demand of extensive SET better.
It should be noted that, in actual applications, characteristic extracting module 404 can according to application needs, extract the evaluation and test feature of a certain characteristic type, such as: the evaluation and test feature of integrity feature, characteristic types such as pronunciation accuracy characteristics, fluency feature, prosodic features etc., and mark according to described evaluation and test feature.Certainly, also can extract the evaluation and test feature of various features type, that is, the evaluation and test feature of extraction can have two or more sets simultaneously, often group evaluates and tests a kind of characteristic type of feature correspondence, such as: integrity feature, pronunciation accuracy characteristics, fluency feature or prosodic features etc.
The concrete meaning of above-mentioned various types of evaluation and test feature and extracting mode illustrate existing above, do not repeat them here.The evaluation and test feature of these corresponding different characteristic types describes the feature of active user's pronunciation respectively from different perspectives, has certain complementarity each other.
When illustrating different for the evaluation and test feature extracted below respectively, the specific implementation of institute's scoring module.
As shown in Figure 6, be a kind of specific implementation structural representation of grading module in the embodiment of the present invention.
In this embodiment, institute's scoring module comprises:
Loading unit 601, for the score in predicting model that the characteristic type loaded with evaluate and test feature is corresponding;
Computing unit 602, for calculating the similarity of described evaluation and test feature corresponding to described score in predicting model, and using the score of described similarity as described voice signal.
The grading module of this embodiment, for the evaluation and test feature of a certain characteristic type that characteristic extracting module is extracted, by calculating the similarity of this evaluation and test feature corresponding to score in predicting model, and using the score of described similarity as described voice signal.
As shown in Figure 7, be the another kind of specific implementation structural representation of grading module in the embodiment of the present invention.
In this embodiment, institute's scoring module comprises:
Loading unit 701, for often organizing evaluation and test feature, loads the score in predicting model corresponding with the characteristic type of described evaluation and test feature;
First computing unit 702, for calculating the similarity of described evaluation and test feature corresponding to described score in predicting model, and using the score of described similarity as described evaluation and test feature;
Second computing unit 703, for calculating the score of described voice signal according to the score often organizing evaluation and test feature.
Consider that the score of dissimilar evaluation and test feature has certain correlativity, the second computing unit 703 based on the conversion method of linear regression, can calculate PTS, namely calculates the score of voice signal as follows:
Wherein, w
ithe correlation parameter of each evaluation and test feature, w
ifor positive number, pre-set by system and meet
s
iit is the integrate score of each evaluation and test feature; N is the number of integrate score.
The grading module of this embodiment, for the evaluation and test feature of the multiple different characteristic type that characteristic extracting module is extracted, by calculating the similarity of this evaluation and test feature corresponding to score in predicting model, often organized the score of evaluation and test feature, then the score of voice signal is calculated according to the score often organizing evaluation and test feature, further increase validity and the rationality of oral evaluation, significantly reduce the ratio that scoring is abnormal.
It should be noted that, the score in predicting model that the characteristic type of above-mentioned and different evaluation and test feature is corresponding can build by off-line in advance, is described in detail, does not repeat them here above.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, between each embodiment identical similar part mutually see, what each embodiment stressed is the difference with other embodiments.Especially, for system embodiment, because it is substantially similar to embodiment of the method, so describe fairly simple, relevant part illustrates see the part of embodiment of the method.System embodiment described above is only schematic, the wherein said module that illustrates as separating component or unit or can may not be and physically separate, parts as module or unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of module wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.Those of ordinary skill in the art, when not paying creative work, are namely appreciated that and implement.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the some or all parts in the oral evaluation system of the embodiment of the present invention.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realize program of the present invention like this can store on a computer-readable medium, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
Being described in detail the embodiment of the present invention above, applying embodiment herein to invention has been elaboration, the explanation of above embodiment just understands method and apparatus of the present invention for helping; Meanwhile, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.