CN103559894A

CN103559894A - Method and system for evaluating spoken language

Info

Publication number: CN103559894A
Application number: CN201310554431.8A
Authority: CN
Inventors: 王士进; 刘丹; 魏思; 胡郁; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: Guangzhou Xunfei Yi Heard Network Technology Co Ltd
Priority date: 2013-11-08
Filing date: 2013-11-08
Publication date: 2014-02-05
Anticipated expiration: 2033-11-08
Also published as: CN103559894B

Abstract

The invention relates to the technical field of voice signal processing, and discloses a method and system for evaluating a spoken language. The method comprises the step of receiving a voice signal to be evaluated, the step of respectively obtaining voice fragments corresponding to basic voice units in the voice signal through at least two different voice recognition systems, the step of fusing the obtained voice fragments to obtain an effective voice fragment sequence corresponding to the voice signal, the step of extracting evaluating characteristics in the effective voice fragment sequence and the step of carrying out grading according to the evaluating characteristics. By means of the method and system for evaluating the spoken language, the accuracy of evaluating the spoken language can be improved, and abnormal grading can be reduced.

Description

Spoken evaluating method and system

Technical field

The present invention relates to voice process technology field, be specifically related to a kind of spoken evaluating method and system.

Background technology

As the important medium of interpersonal communication, conversational language occupies extremely important status in real life.Along with the aggravation of socioeconomic development and the trend of globalization, people have proposed more and more higher requirement to the objectivity of the efficiency of language learning and language assessment, fairness and scale test.Traditional artificial spoken language proficiency evaluating method is very limited Faculty and Students on instructional blocks of time and space, at aspects such as qualified teachers' strength, teaching place, funds expenditures, also has gap and the imbalance on many hardware; Artificial evaluation and test cannot be avoided evaluator's self individual deviation, thereby can not guarantee the unification of standards of grading, sometimes even cannot accurately reflect measured's true horizon; And for extensive oral test, need a large amount of human and material resources and financial support, limited assessment test regular, scale.For this reason, industry has been developed some language teachings and evaluating system in succession.

In the prior art, spoken evaluating system adopts single recognizer to carry out speech recognition (as question-and-answer problem) or speech text alignment (as reading aloud topic) to the voice signal receiving conventionally, thereby obtains the voice snippet that each basic voice unit is corresponding.System is extracted and is described the feature that each basic voice unit pronunciation standard degree or fluency etc. are weighed spoken evaluating standard respectively from each voice snippet subsequently, finally based on described feature, by forecast analysis, obtains evaluating and testing final score.

While using the sound pick-up outfit of high-fidelity under quiet environment, speech recognition system is owing to providing higher recognition accuracy thereby follow-up spoken evaluation and test also can provide comparatively objective and accurate result.Yet in actual applications particularly for extensive SET, playback environ-ment unavoidably can be subject to the impact of the factors such as examination hall noise, neighbourhood noise, and speech recognition accuracy rate declines and to cause there will be in spoken evaluation and test process a certain proportion of abnormal scoring voice.It is real practical that obvious this phenomenon is difficult to extensive SET Computer automatic scoring, range of application and the popularization of spoken evaluating system have been limited, to a lot of vital examinations, cannot apply, once otherwise occur that abnormal scoring will cause the accident of marking examination papers.

Summary of the invention

The embodiment of the present invention provides a kind of spoken evaluating method and system, to improve the accuracy of spoken evaluation and test, reduces scoring extremely.

For this reason, the invention provides following technical scheme:

An evaluating method, comprising:

Receive voice signal to be evaluated;

Utilize at least two kinds of different speech recognition systems to obtain respectively voice snippet corresponding to each basic voice unit in described voice signal;

The voice snippet obtaining is merged, obtain the efficient voice segment sequence of corresponding described voice signal;

From described efficient voice segment sequence, extract evaluation and test feature;

According to described evaluation and test feature, mark.

Preferably, described the voice snippet obtaining is merged, the efficient voice segment sequence that obtains corresponding described voice signal comprises:

Text corresponding to voice snippet that different phonetic recognition system is obtained carries out Dynamic Matching with the model answer network building in advance respectively, obtains Optimum Matching result;

According to described Optimum Matching result, generate successively the set of different corresponding units, described corresponding unit refers to that the voice snippet that its corresponding different phonetic recognition system obtains exists plyability in time, and the correct recognition result unit of match-on criterion answer network;

Determine the optimum cell in described set;

Splice successively the optimum cell in described set, obtain the efficient voice segment sequence of corresponding described voice signal.

Preferably, the optimum cell in described definite described set comprises:

Calculate respectively acoustic model probability or the pronunciation posterior probability of the voice snippet of each corresponding unit in described set;

Selection has the corresponding unit of maximum acoustic model probability or pronunciation posterior probability as the optimum cell in described set.

Preferably, the corresponding a kind of characteristic type of described evaluation and test feature, described characteristic type be following any one: integrity feature, the accuracy of pronouncing feature, fluency feature, prosodic features;

Describedly according to described evaluation and test feature, mark and comprise:

Load the score in predicting model corresponding with the characteristic type of described evaluation and test feature;

Calculate described evaluation and test feature corresponding to the similarity of described score in predicting model, and the score using described similarity as described voice signal.

Preferably, described evaluation and test feature comprises at least two groups evaluation and tests features of corresponding different characteristic type, described characteristic type be following any one: integrity feature, the accuracy of pronouncing feature, fluency feature, prosodic features;

For every group of evaluation and test feature, load the score in predicting model corresponding with the characteristic type of described evaluation and test feature;

Calculate described evaluation and test feature corresponding to the similarity of described score in predicting model, and the score using described similarity as described evaluation and test feature;

According to the score of every group of evaluation and test feature, calculate the score of described voice signal.

An evaluating system, comprising:

Receiver module, for receiving voice signal to be evaluated;

Voice snippet acquisition module, for utilizing at least two kinds of different speech recognition systems to obtain respectively voice snippet corresponding to each basic voice unit of described voice signal;

Fusion Module, merges for the voice snippet that described voice snippet acquisition module is obtained, and obtains the efficient voice segment sequence of corresponding described voice signal;

Characteristic extracting module, for extracting evaluation and test feature from described efficient voice segment sequence;

Grading module, for marking according to described evaluation and test feature.

Preferably, described Fusion Module comprises:

Matching unit, carries out Dynamic Matching with the model answer network building in advance respectively for text corresponding to voice snippet that different phonetic recognition system is obtained, and obtains Optimum Matching result;

Set generation unit, for generate successively the set of different corresponding units according to described Optimum Matching result, described corresponding unit refers to that the voice snippet that its corresponding different phonetic recognition system obtains exists plyability in time, and the correct recognition result unit of match-on criterion answer network;

Determining unit, for determining the optimum cell of described set;

Concatenation unit, for splicing successively the optimum cell of described set, obtains the efficient voice segment sequence of corresponding described voice signal.

Preferably, described determining unit comprises:

Computing unit, for calculating respectively acoustic model probability or the pronunciation posterior probability of the voice snippet of described each corresponding unit of set;

Selected cell, has the corresponding unit of maximum acoustic model probability or pronunciation posterior probability as the optimum cell of described set for selecting.

Described grading module comprises:

Loading unit, for loading the score in predicting model corresponding with the characteristic type of described evaluation and test feature;

Computing unit, for calculating described evaluation and test feature corresponding to the similarity of described score in predicting model, and the score using described similarity as described voice signal.

Described grading module comprises:

Loading unit, for to every group of evaluation and test feature, loads the score in predicting model corresponding with the characteristic type of described evaluation and test feature;

The first computing unit, for calculating described evaluation and test feature corresponding to the similarity of described score in predicting model, and the score using described similarity as described evaluation and test feature;

The second computing unit, for calculating the score of described voice signal according to the score of every group of evaluation and test feature.

Spoken evaluating method and system that the embodiment of the present invention provides, adopt multiple voice recognition method to identify to voice signal to be evaluated, obtains a plurality of voice segment sequence; Then these voice segment sequence are merged and obtain efficient voice segment sequence, finally according to described efficient voice segment sequence, carry out spoken language evaluation and test and obtain evaluation result.The method and system, by improving the accuracy rate of voice identification result and validity and the rationality that object is investigated in spoken evaluation and test, have greatly reduced the abnormal ratio of marking, thereby have met better the application demand of extensive SET.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, to the accompanying drawing of required use in embodiment be briefly described below, apparently, the accompanying drawing the following describes is only some embodiment that record in the present invention, for those of ordinary skills, can also obtain according to these accompanying drawings other accompanying drawing.

Fig. 1 is the process flow diagram of the spoken evaluating method of the embodiment of the present invention;

Fig. 2 is the match information schematic diagram of different phonetic recognition system recognition result in the embodiment of the present invention;

Fig. 3 is the process flow diagram that builds score in predicting model in the embodiment of the present invention;

Fig. 4 is the structural representation of the spoken evaluating system of the embodiment of the present invention;

Fig. 5 is a kind of specific implementation structural representation of Fusion Module in the spoken evaluating system of the embodiment of the present invention;

Fig. 6 is a kind of specific implementation structural representation of grading module in the embodiment of the present invention;

Fig. 7 is the another kind of specific implementation structural representation of grading module in the embodiment of the present invention.

Embodiment

In order to make those skilled in the art person understand better the scheme of the embodiment of the present invention, below in conjunction with drawings and embodiments, the embodiment of the present invention is described in further detail.

For being subject to the decline of such environmental effects speech recognition accuracy rate can cause occurring in spoken evaluation and test process the problem of a certain proportion of abnormal scoring voice in prior art, the embodiment of the present invention provides a kind of spoken evaluating method and system, first to voice signal to be evaluated, adopt multiple voice recognition method to identify, obtain a plurality of voice segment sequence; Then these voice segment sequence are merged and obtain efficient voice segment sequence, finally according to described efficient voice segment sequence, carry out spoken language scoring and obtain evaluation result.

As shown in Figure 1, be the process flow diagram of the spoken evaluating method of the embodiment of the present invention, comprise the following steps:

Step 101, receives voice signal to be evaluated.

Step 102, utilizes at least two kinds of different speech recognition systems to obtain respectively voice snippet corresponding to each basic voice unit in described voice signal.

Described basic voice unit can be syllable, phoneme etc.Different speech recognition systems is by the acoustic feature based on different (as the acoustic model based on MFCC feature, acoustic model based on PLP feature etc.) or adopt different acoustic model (as the HMM-GMM acoustic model with discrimination dwelling, neural network acoustic model based on DBN etc.) to decode to voice signal.Like this, can obtain the voice segment sequence of corresponding described voice signal.

Particularly, for question-and-answer problem etc., do not have the voice signal of text marking to obtain text and each corresponding basic voice unit segment that described voice signal is corresponding by continuous speech recognition.For reading aloud topic, wait the voice signal with model answer to adopt voice alignment thereof to obtain the time boundary of each basic voice unit.

Because different speech recognition systems has different decoding advantages, between its recognition result, often there is certain complementarity.

Step 103, merges the voice snippet obtaining, and obtains the efficient voice segment sequence of corresponding described voice signal.

Because individual voice recognition system may cause the recognition result of part mistake, and the speech recognition system with complementary characteristic is owing to having certain complementarity, therefore on can be largely, avoid this problem, and then by the choose reasonable of each voice snippet being improved to accuracy and the rationality of each voice snippet scoring.

In embodiments of the present invention, the text corresponding to voice snippet that can first different phonetic recognition system be obtained carries out Dynamic Matching with the model answer network building in advance respectively, obtains Optimum Matching result.Particularly, described text can be adopted in model answer network to DTW(Dynamic Time Warping, dynamic time consolidation) algorithm calculates the cumulative probability in historical path, and the historical path of selecting to have maximum probability when search finishes is optimal path.Such as, the recognition result that speech recognition system 1 obtains is " ABCDE ", and model answer net mate, obtains Optimum Matching result " A(+) BC (+) D (+) E (+) ", i.e. A, and C, D, in E unit and answer matches, and B does not match.

Then, comprehensive described Optimum Matching result generates effective unit sequence, described effective unit refer to can with model answer network, and there is the recognition result unit of plyability in the voice snippet that obtains of its corresponding different phonetic recognition system in time.Determine the corresponding efficient voice segment of each effective unit sequence, splice successively voice snippet corresponding to described optimum cell, obtain the efficient voice segment sequence of corresponding described voice signal.

When determining efficient voice segment corresponding to each effective unit, owing to all may having effective unit and corresponding voice snippet in different recognition results, for this reason, can first according to described Optimum Matching result, generate successively the set of different corresponding units, calculate respectively acoustic model probability or the pronunciation posterior probability of the voice snippet that in described set, each unit is corresponding, select the corresponding unit with maximum probability score as the optimum cell in described set.Then voice snippet corresponding to optimum cell in each set obtaining spliced in chronological order, can obtain the efficient voice segment sequence of corresponding described voice signal.

For example: suppose to have two speech recognition systems to export respectively recognition result as shown in Figure 2, the recognition result that wherein speech recognition system 1 obtains is " ABCDE ", and the recognition result that speech recognition system 2 obtains is " AFCGE ".By above-mentioned two kinds of recognition results respectively with model answer net mate, obtain Optimum Matching result " A(+) BC (+) D (+) E (+) " and " A(+) F(+) C(+) GE(+) ".Described (+) is and can mates with model answer, is correct recognition result.In Fig. 2, vertical line is for describing the time boundary of each voice snippet.

The efficient voice segment sequence obtaining by fusion is " A F C D E ".Obviously the recognition result accuracy after merging has had obvious lifting than the recognition result accuracy of individual voice recognition system.

Step 104 is extracted evaluation and test feature from described efficient voice segment sequence.

It should be noted that, in actual applications, can, according to application needs, extract the evaluation and test feature of a certain characteristic type, such as: the evaluation and test feature of the characteristic types such as integrity feature, pronunciation accuracy feature, fluency feature, prosodic features, and mark according to described evaluation and test feature.

Certainly, also can extract the evaluation and test feature of various features type simultaneously, that is to say, the evaluation and test feature of extraction can have two or more sets, every group of evaluation and test feature correspondence a kind of characteristic type, such as integrity feature, pronunciation accuracy feature, fluency feature or prosodic features etc.

Described integrity feature is for describing voice unit sequence corresponding to described voice segment sequence corresponding to the text integrity degree of model answer.

In embodiments of the present invention, can, by described basic voice unit sequence is mated with the model answer network building in advance, obtain optimal path, using the matching degree of optimal path and voice unit sequence as integrity feature.

It should be noted that, for different topic types, the form of described model answer network can be different, such as, to reading aloud topic, be exactly topic face, question-and-answer problem is exactly some keywords, and picture talk or statement topic etc. is exactly some kernel sentences etc.

Question-and-answer problem and statement topic etc., because its answer has certain uncertainty, belong to semi-open topic type, thereby its model answer often arranges a plurality of different answers according to crucial words, and model answer network can be a plurality of answer entries in form.

For Open-ended Question type, its model answer comprises the sentence of crucial words often.The importance of obvious crucial words will be higher than other auxiliary words, thus can be to the larger weight of crucial words setting, and to the less weight of other auxiliary words settings, to improve the rationality of semantic matches.Therefore, for Open-ended Question type, can also build according to the probability of occurrence of crucial words in each model answer the model answer network of a Weight, and search obtains and voice unit sequence has the optimal path of highest similarity in described model answer network, and then using the matching degree of each voice unit of correspondence consistent with unit in optimal path in voice unit sequence as integrity feature.Described matching degree refers to the corresponding weighting weight of voice unit of each coupling.

Described pronunciation accuracy feature is for describing the pronunciation standard degree of each voice snippet.Particularly, can calculate respectively each voice snippet corresponding to the similarity of default pronunciation acoustic model, using described similarity as pronunciation accuracy feature.

Described fluency feature is for describing the smoothness of user's statement statement, includes but not limited to the average word speed of statement (as the ratio of voice duration and voice unit number etc.), the average flow length of statement, effectively pause ratio of statement etc.In addition, in order to compensate the difference of different speaker in word speed, can also adopt phoneme section feature, all pronunciation parts are normalized to rear common composition fluency feature.Particularly, can be by the duration discrete probability distribution of statistics context-free phoneme, the logarithm probability of duration scoring after calculating normalization, obtains segment length's scoring of phoneme.

Described prosodic features, for describing the rhythm feature of user pronunciation, comprises the features such as pitch variation fluctuating.Particularly, can extract the fundamental frequency characteristic sequence of each voice snippet, obtain subsequently its dynamic change characterization, as extracted first order difference, second order difference etc. as prosodic features.

The evaluation and test feature of above-mentioned corresponding different characteristic type has been described respectively the feature of active user's pronunciation from different perspectives, has each other certain complementarity.

Step 105, marks according to described evaluation and test feature.

Evaluation and test feature for different characteristic type can load respectively corresponding score in predicting model and calculate described evaluation and test feature corresponding to the similarity of this score in predicting model.

It should be noted that, in actual applications, can also load corresponding score in predicting model according to difference topic type, the score in predicting model of the same characteristic type of corresponding different topic types can be identical, also can be different, thus fineness and the accuracy of scoring further improved.The structure of each score in predicting model will describe in detail in the back.

If only extracted a kind of evaluation and test feature of characteristic type, can using the above-mentioned described evaluation and test feature calculating corresponding to the similarity of score in predicting model the score as described voice signal.

If extracted the evaluation and test feature of various features type, need the score using the above-mentioned similarity calculating as corresponding evaluation and test feature, and then according to the score of every group of evaluation and test feature, calculate the score of described voice signal.Particularly, can consider that the score of dissimilar evaluation and test feature has certain correlativity from practical application, the conversion method based on linear regression, calculates PTS, i.e. the score of computing voice signal as follows:

S = \frac{1}{N} Σ_{i = 1}^{N} w_{i} s_{i}

Wherein, w _ithe correlation parameter of respectively evaluating and testing feature, w _ifor positive number, by system, set in advance and meet

s _iit is the integrate score of respectively evaluating and testing feature; N is the number of integrate score.

Visible, the spoken evaluating method of the embodiment of the present invention, adopts multiple voice recognition method to identify to voice signal to be evaluated, obtains a plurality of voice segment sequence; Then these voice segment sequence are merged and obtain efficient voice segment sequence, finally according to described efficient voice segment sequence, carry out spoken language evaluation and test and obtain evaluation result.The method, by improving the accuracy rate of voice identification result and validity and the rationality that object is investigated in spoken evaluation and test, has greatly reduced the abnormal ratio of marking, thereby has met better the application demand of extensive SET.

Before mention, when calculating the score of evaluation and test feature, need to load the score in predicting model corresponding with the characteristic type of described evaluation and test feature.It should be noted that, described score in predicting model in advance off-line builds.

As shown in Figure 3, be the process flow diagram that builds score in predicting model in the embodiment of the present invention, comprise the following steps:

Step 301, gathers scoring training data.

Particularly, can collect respectively to each exercise question a plurality of users' answer speech data, as scoring training data.

Step 302, manually marks described training data, comprises text marking and cutting and the artificial marking of spoken evaluation and test etc.

Described text marking refers to the conversion from speech-to-text.Cutting refers to by artificial monitoring, and continuous speech signal is divided, and determines the voice snippet that each basic voice unit is corresponding.The artificial marking of spoken evaluation and test refers to by the mode of artificial audiometry marks to spoken language proficiency.

In actual applications, can to above-mentioned different evaluation and test feature, mark respectively respectively, described evaluation and test feature comprises integrity feature, pronunciation accuracy feature, fluency feature, prosodic features etc.

Step 303, extracts respectively the evaluation and test feature of different characteristic type according to annotation results.

That is to say, according to the basic voice unit in annotation results and corresponding voice snippet, from described voice snippet, according to the mode of introducing, extract respectively the evaluation and test feature of different characteristic type above.

Step 304, utilizes described evaluation and test feature to build respectively the score in predicting model relevant to described characteristic type.

Particularly, can utilize forecasting techniques training under the guidance of artificial scoring to obtain the parameter of score in predicting model, obtain score in predicting model.Further, can also set up respectively the score in predicting model relevant to topic type according to different test question types.

Correspondingly, the embodiment of the present invention also provides a kind of spoken evaluating system, as shown in Figure 4, is a kind of structural representation of this system.

In this embodiment, described system comprises: receiver module 401, voice snippet acquisition module 402, Fusion Module 403, characteristic extracting module 404 and grading module 405.Wherein:

Receiver module 401, for receiving voice signal to be evaluated.

Voice snippet acquisition module 402, for utilizing at least two kinds of different speech recognition systems to obtain respectively voice snippet corresponding to each basic voice unit of described voice signal.

The voice signal that there is no text marking for question-and-answer problem etc., can obtain text and each corresponding basic voice unit segment that described voice signal is corresponding by continuous speech recognition.And wait the voice signal with model answer for reading aloud topic, can adopt voice alignment thereof to obtain the time boundary of each basic voice unit.

Fusion Module 403, merges for the voice snippet that described voice snippet acquisition module 402 is obtained, and obtains the efficient voice segment sequence of corresponding described voice signal.

Characteristic extracting module 404, for extracting evaluation and test feature from described efficient voice segment sequence.

Grading module 405, for marking according to described evaluation and test feature.

For this reason, in embodiments of the present invention, a kind of specific implementation structure of described Fusion Module 403 as shown in Figure 5.

In this embodiment, described Fusion Module comprises:

Matching unit 501, carries out Dynamic Matching with the model answer network building in advance respectively for text corresponding to voice snippet that different phonetic recognition system is obtained, and obtains Optimum Matching result;

Set generation unit 502, for generate successively the set of different corresponding units according to described Optimum Matching result, described corresponding unit refers to that the voice snippet that its corresponding different phonetic recognition system obtains exists plyability in time, and the correct recognition result unit of match-on criterion answer network;

Determining unit 503, for determining the optimum cell of described set;

Concatenation unit 504, for splicing successively the optimum cell of described set, obtains the efficient voice segment sequence of corresponding described voice signal.

Above-mentioned determining unit 503 can comprise: computing unit and selected cell (not shown).Wherein: described computing unit is for calculating respectively acoustic model probability or the pronunciation posterior probability of the voice snippet of described each corresponding unit of set; Described selected cell has the corresponding unit of maximum acoustic model probability or pronunciation posterior probability as the optimum cell of described set for selecting.

Fusion by above-mentioned Fusion Module to voice snippet, makes the recognition result accuracy after merging have larger lifting than the recognition result accuracy of individual voice recognition system.

Visible, the spoken evaluating system of the embodiment of the present invention, adopts multiple voice recognition method to identify to voice signal to be evaluated, obtains a plurality of voice segment sequence; Then these voice segment sequence are merged and obtain efficient voice segment sequence, finally according to described efficient voice segment sequence, carry out spoken language evaluation and test and obtain evaluation result.The method, by improving the accuracy rate of voice identification result and validity and the rationality that object is investigated in spoken evaluation and test, has greatly reduced the abnormal ratio of marking, thereby has met better the application demand of extensive SET.

It should be noted that, in actual applications, characteristic extracting module 404 can be according to application needs, extract the evaluation and test feature of a certain characteristic type, such as: the evaluation and test feature of the characteristic types such as integrity feature, pronunciation accuracy feature, fluency feature, prosodic features, and mark according to described evaluation and test feature.Certainly, also can extract the evaluation and test feature of various features type simultaneously, that is to say, the evaluation and test feature of extraction can have two or more sets, every group of evaluation and test feature correspondence a kind of characteristic type, such as integrity feature, pronunciation accuracy feature, fluency feature or prosodic features etc.

The concrete meaning of above-mentioned various types of evaluation and test features and extracting mode, in existing explanation above, do not repeat them here.The evaluation and test feature of these corresponding different characteristic types has been described respectively the feature of active user's pronunciation from different perspectives, has each other certain complementarity.

The specific implementation of described grading module is described respectively when different for the evaluation and test feature of extracting below.

As shown in Figure 6, be a kind of specific implementation structural representation of grading module in the embodiment of the present invention.

In this embodiment, described grading module comprises:

Loading unit 601, for loading the score in predicting model corresponding with the characteristic type of evaluating and testing feature;

Computing unit 602, for calculating described evaluation and test feature corresponding to the similarity of described score in predicting model, and the score using described similarity as described voice signal.

The grading module of this embodiment, the evaluation and test feature of a certain characteristic type extracting for characteristic extracting module, by calculating this evaluation and test feature corresponding to the similarity of score in predicting model, and the score using described similarity as described voice signal.

As shown in Figure 7, be the another kind of specific implementation structural representation of grading module in the embodiment of the present invention.

In this embodiment, described grading module comprises:

Loading unit 701, for to every group of evaluation and test feature, loads the score in predicting model corresponding with the characteristic type of described evaluation and test feature;

The first computing unit 702, for calculating described evaluation and test feature corresponding to the similarity of described score in predicting model, and the score using described similarity as described evaluation and test feature;

The second computing unit 703, for calculating the score of described voice signal according to the score of every group of evaluation and test feature.

The score of considering dissimilar evaluation and test feature has certain correlativity, and the conversion method that the second computing unit 703 can be based on linear regression is calculated PTS, i.e. the score of computing voice signal as follows:

S = \frac{1}{N} Σ_{i = 1}^{N} w_{i} s_{i}

The grading module of this embodiment, the evaluation and test feature of the multiple different characteristic type of extracting for characteristic extracting module, by calculating this evaluation and test feature corresponding to the similarity of score in predicting model, obtain the score of every group of evaluation and test feature, then according to the score of every group of evaluation and test feature, calculate the score of voice signal, the validity and the rationality that have further improved spoken evaluation and test, greatly reduced the abnormal ratio of marking.

It should be noted that, the corresponding score in predicting model of the characteristic type of above-mentioned and different evaluation and test features in advance off-line builds, and above, is described in detail, and does not repeat them here.

Each embodiment in this instructions all adopts the mode of going forward one by one to describe, between each embodiment identical similar part mutually referring to, each embodiment stresses is the difference with other embodiment.Especially, for system embodiment, because it is substantially similar in appearance to embodiment of the method, so describe fairly simplely, relevant part is referring to the part explanation of embodiment of the method.System embodiment described above is only schematic, wherein said module or unit as separating component explanation can or can not be also physically to separate, the parts that show as module or unit can be or can not be also physical locations, can be positioned at a place, or also can be distributed in a plurality of network element.Can select according to the actual needs some or all of module wherein to realize the object of the present embodiment scheme.Those of ordinary skills, in the situation that not paying creative work, are appreciated that and implement.

All parts embodiment of the present invention can realize with hardware, or realizes with the software module moved on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that and can use in practice microprocessor or digital signal processor (DSP) to realize according to the some or all functions of the some or all parts in the spoken evaluating system of the embodiment of the present invention.The present invention for example can also be embodied as, for carrying out part or all equipment or device program (, computer program and computer program) of method as described herein.Realizing like this program of the present invention can be stored on computer-readable medium, or can have the form of one or more signal.Such signal can be downloaded and obtain from internet website, or provides on carrier signal, or provides with any other form.

Above the embodiment of the present invention is described in detail, has applied embodiment herein the present invention is set forth, the explanation of above embodiment is just for helping to understand method and apparatus of the present invention; , for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention meanwhile.

Claims

1. a spoken evaluating method, is characterized in that, comprising:

Receive voice signal to be evaluated;

According to described evaluation and test feature, mark.

2. method according to claim 1, is characterized in that, described the voice snippet obtaining is merged, and the efficient voice segment sequence that obtains corresponding described voice signal comprises:

Determine the optimum cell in described set;

3. method according to claim 2, is characterized in that, the optimum cell in described definite described set comprises:

4. method according to claim 1, is characterized in that, the corresponding a kind of characteristic type of described evaluation and test feature, described characteristic type be following any one: integrity feature, the accuracy of pronouncing feature, fluency feature, prosodic features;

5. method according to claim 1, it is characterized in that, described evaluation and test feature comprises at least two groups evaluation and tests features of corresponding different characteristic type, described characteristic type be following any one: integrity feature, the accuracy of pronouncing feature, fluency feature, prosodic features;

6. a spoken evaluating system, is characterized in that, comprising:

Receiver module, for receiving voice signal to be evaluated;

Grading module, for marking according to described evaluation and test feature.

7. system according to claim 6, is characterized in that, described Fusion Module comprises:

Determining unit, for determining the optimum cell of described set;

8. system according to claim 7, is characterized in that, described determining unit comprises:

9. system according to claim 6, is characterized in that, the corresponding a kind of characteristic type of described evaluation and test feature, described characteristic type be following any one: integrity feature, the accuracy of pronouncing feature, fluency feature, prosodic features;

Described grading module comprises:

10. system according to claim 6, it is characterized in that, described evaluation and test feature comprises at least two groups evaluation and tests features of corresponding different characteristic type, described characteristic type be following any one: integrity feature, the accuracy of pronouncing feature, fluency feature, prosodic features;

Described grading module comprises: