CN103559894A - Method and system for evaluating spoken language - Google Patents

Method and system for evaluating spoken language Download PDF

Info

Publication number
CN103559894A
CN103559894A CN201310554431.8A CN201310554431A CN103559894A CN 103559894 A CN103559894 A CN 103559894A CN 201310554431 A CN201310554431 A CN 201310554431A CN 103559894 A CN103559894 A CN 103559894A
Authority
CN
China
Prior art keywords
voice
evaluation
feature
score
test feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310554431.8A
Other languages
Chinese (zh)
Other versions
CN103559894B (en
Inventor
王士进
刘丹
魏思
胡郁
刘庆峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Xunfei Yi Heard Network Technology Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201310554431.8A priority Critical patent/CN103559894B/en
Publication of CN103559894A publication Critical patent/CN103559894A/en
Application granted granted Critical
Publication of CN103559894B publication Critical patent/CN103559894B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to the technical field of voice signal processing, and discloses a method and system for evaluating a spoken language. The method comprises the step of receiving a voice signal to be evaluated, the step of respectively obtaining voice fragments corresponding to basic voice units in the voice signal through at least two different voice recognition systems, the step of fusing the obtained voice fragments to obtain an effective voice fragment sequence corresponding to the voice signal, the step of extracting evaluating characteristics in the effective voice fragment sequence and the step of carrying out grading according to the evaluating characteristics. By means of the method and system for evaluating the spoken language, the accuracy of evaluating the spoken language can be improved, and abnormal grading can be reduced.

Description

Spoken evaluating method and system
Technical field
The present invention relates to voice process technology field, be specifically related to a kind of spoken evaluating method and system.
Background technology
As the important medium of interpersonal communication, conversational language occupies extremely important status in real life.Along with the aggravation of socioeconomic development and the trend of globalization, people have proposed more and more higher requirement to the objectivity of the efficiency of language learning and language assessment, fairness and scale test.Traditional artificial spoken language proficiency evaluating method is very limited Faculty and Students on instructional blocks of time and space, at aspects such as qualified teachers' strength, teaching place, funds expenditures, also has gap and the imbalance on many hardware; Artificial evaluation and test cannot be avoided evaluator's self individual deviation, thereby can not guarantee the unification of standards of grading, sometimes even cannot accurately reflect measured's true horizon; And for extensive oral test, need a large amount of human and material resources and financial support, limited assessment test regular, scale.For this reason, industry has been developed some language teachings and evaluating system in succession.
In the prior art, spoken evaluating system adopts single recognizer to carry out speech recognition (as question-and-answer problem) or speech text alignment (as reading aloud topic) to the voice signal receiving conventionally, thereby obtains the voice snippet that each basic voice unit is corresponding.System is extracted and is described the feature that each basic voice unit pronunciation standard degree or fluency etc. are weighed spoken evaluating standard respectively from each voice snippet subsequently, finally based on described feature, by forecast analysis, obtains evaluating and testing final score.
While using the sound pick-up outfit of high-fidelity under quiet environment, speech recognition system is owing to providing higher recognition accuracy thereby follow-up spoken evaluation and test also can provide comparatively objective and accurate result.Yet in actual applications particularly for extensive SET, playback environ-ment unavoidably can be subject to the impact of the factors such as examination hall noise, neighbourhood noise, and speech recognition accuracy rate declines and to cause there will be in spoken evaluation and test process a certain proportion of abnormal scoring voice.It is real practical that obvious this phenomenon is difficult to extensive SET Computer automatic scoring, range of application and the popularization of spoken evaluating system have been limited, to a lot of vital examinations, cannot apply, once otherwise occur that abnormal scoring will cause the accident of marking examination papers.
Summary of the invention
The embodiment of the present invention provides a kind of spoken evaluating method and system, to improve the accuracy of spoken evaluation and test, reduces scoring extremely.
For this reason, the invention provides following technical scheme:
An evaluating method, comprising:
Receive voice signal to be evaluated;
Utilize at least two kinds of different speech recognition systems to obtain respectively voice snippet corresponding to each basic voice unit in described voice signal;
The voice snippet obtaining is merged, obtain the efficient voice segment sequence of corresponding described voice signal;
From described efficient voice segment sequence, extract evaluation and test feature;
According to described evaluation and test feature, mark.
Preferably, described the voice snippet obtaining is merged, the efficient voice segment sequence that obtains corresponding described voice signal comprises:
Text corresponding to voice snippet that different phonetic recognition system is obtained carries out Dynamic Matching with the model answer network building in advance respectively, obtains Optimum Matching result;
According to described Optimum Matching result, generate successively the set of different corresponding units, described corresponding unit refers to that the voice snippet that its corresponding different phonetic recognition system obtains exists plyability in time, and the correct recognition result unit of match-on criterion answer network;
Determine the optimum cell in described set;
Splice successively the optimum cell in described set, obtain the efficient voice segment sequence of corresponding described voice signal.
Preferably, the optimum cell in described definite described set comprises:
Calculate respectively acoustic model probability or the pronunciation posterior probability of the voice snippet of each corresponding unit in described set;
Selection has the corresponding unit of maximum acoustic model probability or pronunciation posterior probability as the optimum cell in described set.
Preferably, the corresponding a kind of characteristic type of described evaluation and test feature, described characteristic type be following any one: integrity feature, the accuracy of pronouncing feature, fluency feature, prosodic features;
Describedly according to described evaluation and test feature, mark and comprise:
Load the score in predicting model corresponding with the characteristic type of described evaluation and test feature;
Calculate described evaluation and test feature corresponding to the similarity of described score in predicting model, and the score using described similarity as described voice signal.
Preferably, described evaluation and test feature comprises at least two groups evaluation and tests features of corresponding different characteristic type, described characteristic type be following any one: integrity feature, the accuracy of pronouncing feature, fluency feature, prosodic features;
Describedly according to described evaluation and test feature, mark and comprise:
For every group of evaluation and test feature, load the score in predicting model corresponding with the characteristic type of described evaluation and test feature;
Calculate described evaluation and test feature corresponding to the similarity of described score in predicting model, and the score using described similarity as described evaluation and test feature;
According to the score of every group of evaluation and test feature, calculate the score of described voice signal.
An evaluating system, comprising:
Receiver module, for receiving voice signal to be evaluated;
Voice snippet acquisition module, for utilizing at least two kinds of different speech recognition systems to obtain respectively voice snippet corresponding to each basic voice unit of described voice signal;
Fusion Module, merges for the voice snippet that described voice snippet acquisition module is obtained, and obtains the efficient voice segment sequence of corresponding described voice signal;
Characteristic extracting module, for extracting evaluation and test feature from described efficient voice segment sequence;
Grading module, for marking according to described evaluation and test feature.
Preferably, described Fusion Module comprises:
Matching unit, carries out Dynamic Matching with the model answer network building in advance respectively for text corresponding to voice snippet that different phonetic recognition system is obtained, and obtains Optimum Matching result;
Set generation unit, for generate successively the set of different corresponding units according to described Optimum Matching result, described corresponding unit refers to that the voice snippet that its corresponding different phonetic recognition system obtains exists plyability in time, and the correct recognition result unit of match-on criterion answer network;
Determining unit, for determining the optimum cell of described set;
Concatenation unit, for splicing successively the optimum cell of described set, obtains the efficient voice segment sequence of corresponding described voice signal.
Preferably, described determining unit comprises:
Computing unit, for calculating respectively acoustic model probability or the pronunciation posterior probability of the voice snippet of described each corresponding unit of set;
Selected cell, has the corresponding unit of maximum acoustic model probability or pronunciation posterior probability as the optimum cell of described set for selecting.
Preferably, the corresponding a kind of characteristic type of described evaluation and test feature, described characteristic type be following any one: integrity feature, the accuracy of pronouncing feature, fluency feature, prosodic features;
Described grading module comprises:
Loading unit, for loading the score in predicting model corresponding with the characteristic type of described evaluation and test feature;
Computing unit, for calculating described evaluation and test feature corresponding to the similarity of described score in predicting model, and the score using described similarity as described voice signal.
Preferably, described evaluation and test feature comprises at least two groups evaluation and tests features of corresponding different characteristic type, described characteristic type be following any one: integrity feature, the accuracy of pronouncing feature, fluency feature, prosodic features;
Described grading module comprises:
Loading unit, for to every group of evaluation and test feature, loads the score in predicting model corresponding with the characteristic type of described evaluation and test feature;
The first computing unit, for calculating described evaluation and test feature corresponding to the similarity of described score in predicting model, and the score using described similarity as described evaluation and test feature;
The second computing unit, for calculating the score of described voice signal according to the score of every group of evaluation and test feature.
Spoken evaluating method and system that the embodiment of the present invention provides, adopt multiple voice recognition method to identify to voice signal to be evaluated, obtains a plurality of voice segment sequence; Then these voice segment sequence are merged and obtain efficient voice segment sequence, finally according to described efficient voice segment sequence, carry out spoken language evaluation and test and obtain evaluation result.The method and system, by improving the accuracy rate of voice identification result and validity and the rationality that object is investigated in spoken evaluation and test, have greatly reduced the abnormal ratio of marking, thereby have met better the application demand of extensive SET.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, to the accompanying drawing of required use in embodiment be briefly described below, apparently, the accompanying drawing the following describes is only some embodiment that record in the present invention, for those of ordinary skills, can also obtain according to these accompanying drawings other accompanying drawing.
Fig. 1 is the process flow diagram of the spoken evaluating method of the embodiment of the present invention;
Fig. 2 is the match information schematic diagram of different phonetic recognition system recognition result in the embodiment of the present invention;
Fig. 3 is the process flow diagram that builds score in predicting model in the embodiment of the present invention;
Fig. 4 is the structural representation of the spoken evaluating system of the embodiment of the present invention;
Fig. 5 is a kind of specific implementation structural representation of Fusion Module in the spoken evaluating system of the embodiment of the present invention;
Fig. 6 is a kind of specific implementation structural representation of grading module in the embodiment of the present invention;
Fig. 7 is the another kind of specific implementation structural representation of grading module in the embodiment of the present invention.
Embodiment
In order to make those skilled in the art person understand better the scheme of the embodiment of the present invention, below in conjunction with drawings and embodiments, the embodiment of the present invention is described in further detail.
For being subject to the decline of such environmental effects speech recognition accuracy rate can cause occurring in spoken evaluation and test process the problem of a certain proportion of abnormal scoring voice in prior art, the embodiment of the present invention provides a kind of spoken evaluating method and system, first to voice signal to be evaluated, adopt multiple voice recognition method to identify, obtain a plurality of voice segment sequence; Then these voice segment sequence are merged and obtain efficient voice segment sequence, finally according to described efficient voice segment sequence, carry out spoken language scoring and obtain evaluation result.
As shown in Figure 1, be the process flow diagram of the spoken evaluating method of the embodiment of the present invention, comprise the following steps:
Step 101, receives voice signal to be evaluated.
Step 102, utilizes at least two kinds of different speech recognition systems to obtain respectively voice snippet corresponding to each basic voice unit in described voice signal.
Described basic voice unit can be syllable, phoneme etc.Different speech recognition systems is by the acoustic feature based on different (as the acoustic model based on MFCC feature, acoustic model based on PLP feature etc.) or adopt different acoustic model (as the HMM-GMM acoustic model with discrimination dwelling, neural network acoustic model based on DBN etc.) to decode to voice signal.Like this, can obtain the voice segment sequence of corresponding described voice signal.
Particularly, for question-and-answer problem etc., do not have the voice signal of text marking to obtain text and each corresponding basic voice unit segment that described voice signal is corresponding by continuous speech recognition.For reading aloud topic, wait the voice signal with model answer to adopt voice alignment thereof to obtain the time boundary of each basic voice unit.
Because different speech recognition systems has different decoding advantages, between its recognition result, often there is certain complementarity.
Step 103, merges the voice snippet obtaining, and obtains the efficient voice segment sequence of corresponding described voice signal.
Because individual voice recognition system may cause the recognition result of part mistake, and the speech recognition system with complementary characteristic is owing to having certain complementarity, therefore on can be largely, avoid this problem, and then by the choose reasonable of each voice snippet being improved to accuracy and the rationality of each voice snippet scoring.
In embodiments of the present invention, the text corresponding to voice snippet that can first different phonetic recognition system be obtained carries out Dynamic Matching with the model answer network building in advance respectively, obtains Optimum Matching result.Particularly, described text can be adopted in model answer network to DTW(Dynamic Time Warping, dynamic time consolidation) algorithm calculates the cumulative probability in historical path, and the historical path of selecting to have maximum probability when search finishes is optimal path.Such as, the recognition result that speech recognition system 1 obtains is " ABCDE ", and model answer net mate, obtains Optimum Matching result " A(+) BC (+) D (+) E (+) ", i.e. A, and C, D, in E unit and answer matches, and B does not match.
Then, comprehensive described Optimum Matching result generates effective unit sequence, described effective unit refer to can with model answer network, and there is the recognition result unit of plyability in the voice snippet that obtains of its corresponding different phonetic recognition system in time.Determine the corresponding efficient voice segment of each effective unit sequence, splice successively voice snippet corresponding to described optimum cell, obtain the efficient voice segment sequence of corresponding described voice signal.
When determining efficient voice segment corresponding to each effective unit, owing to all may having effective unit and corresponding voice snippet in different recognition results, for this reason, can first according to described Optimum Matching result, generate successively the set of different corresponding units, calculate respectively acoustic model probability or the pronunciation posterior probability of the voice snippet that in described set, each unit is corresponding, select the corresponding unit with maximum probability score as the optimum cell in described set.Then voice snippet corresponding to optimum cell in each set obtaining spliced in chronological order, can obtain the efficient voice segment sequence of corresponding described voice signal.
For example: suppose to have two speech recognition systems to export respectively recognition result as shown in Figure 2, the recognition result that wherein speech recognition system 1 obtains is " ABCDE ", and the recognition result that speech recognition system 2 obtains is " AFCGE ".By above-mentioned two kinds of recognition results respectively with model answer net mate, obtain Optimum Matching result " A(+) BC (+) D (+) E (+) " and " A(+) F(+) C(+) GE(+) ".Described (+) is and can mates with model answer, is correct recognition result.In Fig. 2, vertical line is for describing the time boundary of each voice snippet.
The efficient voice segment sequence obtaining by fusion is " A F C D E ".Obviously the recognition result accuracy after merging has had obvious lifting than the recognition result accuracy of individual voice recognition system.
Step 104 is extracted evaluation and test feature from described efficient voice segment sequence.
It should be noted that, in actual applications, can, according to application needs, extract the evaluation and test feature of a certain characteristic type, such as: the evaluation and test feature of the characteristic types such as integrity feature, pronunciation accuracy feature, fluency feature, prosodic features, and mark according to described evaluation and test feature.
Certainly, also can extract the evaluation and test feature of various features type simultaneously, that is to say, the evaluation and test feature of extraction can have two or more sets, every group of evaluation and test feature correspondence a kind of characteristic type, such as integrity feature, pronunciation accuracy feature, fluency feature or prosodic features etc.
Described integrity feature is for describing voice unit sequence corresponding to described voice segment sequence corresponding to the text integrity degree of model answer.
In embodiments of the present invention, can, by described basic voice unit sequence is mated with the model answer network building in advance, obtain optimal path, using the matching degree of optimal path and voice unit sequence as integrity feature.
It should be noted that, for different topic types, the form of described model answer network can be different, such as, to reading aloud topic, be exactly topic face, question-and-answer problem is exactly some keywords, and picture talk or statement topic etc. is exactly some kernel sentences etc.
Question-and-answer problem and statement topic etc., because its answer has certain uncertainty, belong to semi-open topic type, thereby its model answer often arranges a plurality of different answers according to crucial words, and model answer network can be a plurality of answer entries in form.
For Open-ended Question type, its model answer comprises the sentence of crucial words often.The importance of obvious crucial words will be higher than other auxiliary words, thus can be to the larger weight of crucial words setting, and to the less weight of other auxiliary words settings, to improve the rationality of semantic matches.Therefore, for Open-ended Question type, can also build according to the probability of occurrence of crucial words in each model answer the model answer network of a Weight, and search obtains and voice unit sequence has the optimal path of highest similarity in described model answer network, and then using the matching degree of each voice unit of correspondence consistent with unit in optimal path in voice unit sequence as integrity feature.Described matching degree refers to the corresponding weighting weight of voice unit of each coupling.
Described pronunciation accuracy feature is for describing the pronunciation standard degree of each voice snippet.Particularly, can calculate respectively each voice snippet corresponding to the similarity of default pronunciation acoustic model, using described similarity as pronunciation accuracy feature.
Described fluency feature is for describing the smoothness of user's statement statement, includes but not limited to the average word speed of statement (as the ratio of voice duration and voice unit number etc.), the average flow length of statement, effectively pause ratio of statement etc.In addition, in order to compensate the difference of different speaker in word speed, can also adopt phoneme section feature, all pronunciation parts are normalized to rear common composition fluency feature.Particularly, can be by the duration discrete probability distribution of statistics context-free phoneme, the logarithm probability of duration scoring after calculating normalization, obtains segment length's scoring of phoneme.
Described prosodic features, for describing the rhythm feature of user pronunciation, comprises the features such as pitch variation fluctuating.Particularly, can extract the fundamental frequency characteristic sequence of each voice snippet, obtain subsequently its dynamic change characterization, as extracted first order difference, second order difference etc. as prosodic features.
The evaluation and test feature of above-mentioned corresponding different characteristic type has been described respectively the feature of active user's pronunciation from different perspectives, has each other certain complementarity.
Step 105, marks according to described evaluation and test feature.
Evaluation and test feature for different characteristic type can load respectively corresponding score in predicting model and calculate described evaluation and test feature corresponding to the similarity of this score in predicting model.
It should be noted that, in actual applications, can also load corresponding score in predicting model according to difference topic type, the score in predicting model of the same characteristic type of corresponding different topic types can be identical, also can be different, thus fineness and the accuracy of scoring further improved.The structure of each score in predicting model will describe in detail in the back.
If only extracted a kind of evaluation and test feature of characteristic type, can using the above-mentioned described evaluation and test feature calculating corresponding to the similarity of score in predicting model the score as described voice signal.
If extracted the evaluation and test feature of various features type, need the score using the above-mentioned similarity calculating as corresponding evaluation and test feature, and then according to the score of every group of evaluation and test feature, calculate the score of described voice signal.Particularly, can consider that the score of dissimilar evaluation and test feature has certain correlativity from practical application, the conversion method based on linear regression, calculates PTS, i.e. the score of computing voice signal as follows:
S = 1 N Σ i = 1 N w i s i
Wherein, w ithe correlation parameter of respectively evaluating and testing feature, w ifor positive number, by system, set in advance and meet
Figure BDA0000410864650000092
s iit is the integrate score of respectively evaluating and testing feature; N is the number of integrate score.
Visible, the spoken evaluating method of the embodiment of the present invention, adopts multiple voice recognition method to identify to voice signal to be evaluated, obtains a plurality of voice segment sequence; Then these voice segment sequence are merged and obtain efficient voice segment sequence, finally according to described efficient voice segment sequence, carry out spoken language evaluation and test and obtain evaluation result.The method, by improving the accuracy rate of voice identification result and validity and the rationality that object is investigated in spoken evaluation and test, has greatly reduced the abnormal ratio of marking, thereby has met better the application demand of extensive SET.
Before mention, when calculating the score of evaluation and test feature, need to load the score in predicting model corresponding with the characteristic type of described evaluation and test feature.It should be noted that, described score in predicting model in advance off-line builds.
As shown in Figure 3, be the process flow diagram that builds score in predicting model in the embodiment of the present invention, comprise the following steps:
Step 301, gathers scoring training data.
Particularly, can collect respectively to each exercise question a plurality of users' answer speech data, as scoring training data.
Step 302, manually marks described training data, comprises text marking and cutting and the artificial marking of spoken evaluation and test etc.
Described text marking refers to the conversion from speech-to-text.Cutting refers to by artificial monitoring, and continuous speech signal is divided, and determines the voice snippet that each basic voice unit is corresponding.The artificial marking of spoken evaluation and test refers to by the mode of artificial audiometry marks to spoken language proficiency.
In actual applications, can to above-mentioned different evaluation and test feature, mark respectively respectively, described evaluation and test feature comprises integrity feature, pronunciation accuracy feature, fluency feature, prosodic features etc.
Step 303, extracts respectively the evaluation and test feature of different characteristic type according to annotation results.
That is to say, according to the basic voice unit in annotation results and corresponding voice snippet, from described voice snippet, according to the mode of introducing, extract respectively the evaluation and test feature of different characteristic type above.
Step 304, utilizes described evaluation and test feature to build respectively the score in predicting model relevant to described characteristic type.
Particularly, can utilize forecasting techniques training under the guidance of artificial scoring to obtain the parameter of score in predicting model, obtain score in predicting model.Further, can also set up respectively the score in predicting model relevant to topic type according to different test question types.
Correspondingly, the embodiment of the present invention also provides a kind of spoken evaluating system, as shown in Figure 4, is a kind of structural representation of this system.
In this embodiment, described system comprises: receiver module 401, voice snippet acquisition module 402, Fusion Module 403, characteristic extracting module 404 and grading module 405.Wherein:
Receiver module 401, for receiving voice signal to be evaluated.
Voice snippet acquisition module 402, for utilizing at least two kinds of different speech recognition systems to obtain respectively voice snippet corresponding to each basic voice unit of described voice signal.
The voice signal that there is no text marking for question-and-answer problem etc., can obtain text and each corresponding basic voice unit segment that described voice signal is corresponding by continuous speech recognition.And wait the voice signal with model answer for reading aloud topic, can adopt voice alignment thereof to obtain the time boundary of each basic voice unit.
Because different speech recognition systems has different decoding advantages, between its recognition result, often there is certain complementarity.
Fusion Module 403, merges for the voice snippet that described voice snippet acquisition module 402 is obtained, and obtains the efficient voice segment sequence of corresponding described voice signal.
Characteristic extracting module 404, for extracting evaluation and test feature from described efficient voice segment sequence.
Grading module 405, for marking according to described evaluation and test feature.
Because individual voice recognition system may cause the recognition result of part mistake, and the speech recognition system with complementary characteristic is owing to having certain complementarity, therefore on can be largely, avoid this problem, and then by the choose reasonable of each voice snippet being improved to accuracy and the rationality of each voice snippet scoring.
For this reason, in embodiments of the present invention, a kind of specific implementation structure of described Fusion Module 403 as shown in Figure 5.
In this embodiment, described Fusion Module comprises:
Matching unit 501, carries out Dynamic Matching with the model answer network building in advance respectively for text corresponding to voice snippet that different phonetic recognition system is obtained, and obtains Optimum Matching result;
Set generation unit 502, for generate successively the set of different corresponding units according to described Optimum Matching result, described corresponding unit refers to that the voice snippet that its corresponding different phonetic recognition system obtains exists plyability in time, and the correct recognition result unit of match-on criterion answer network;
Determining unit 503, for determining the optimum cell of described set;
Concatenation unit 504, for splicing successively the optimum cell of described set, obtains the efficient voice segment sequence of corresponding described voice signal.
Above-mentioned determining unit 503 can comprise: computing unit and selected cell (not shown).Wherein: described computing unit is for calculating respectively acoustic model probability or the pronunciation posterior probability of the voice snippet of described each corresponding unit of set; Described selected cell has the corresponding unit of maximum acoustic model probability or pronunciation posterior probability as the optimum cell of described set for selecting.
Fusion by above-mentioned Fusion Module to voice snippet, makes the recognition result accuracy after merging have larger lifting than the recognition result accuracy of individual voice recognition system.
Visible, the spoken evaluating system of the embodiment of the present invention, adopts multiple voice recognition method to identify to voice signal to be evaluated, obtains a plurality of voice segment sequence; Then these voice segment sequence are merged and obtain efficient voice segment sequence, finally according to described efficient voice segment sequence, carry out spoken language evaluation and test and obtain evaluation result.The method, by improving the accuracy rate of voice identification result and validity and the rationality that object is investigated in spoken evaluation and test, has greatly reduced the abnormal ratio of marking, thereby has met better the application demand of extensive SET.
It should be noted that, in actual applications, characteristic extracting module 404 can be according to application needs, extract the evaluation and test feature of a certain characteristic type, such as: the evaluation and test feature of the characteristic types such as integrity feature, pronunciation accuracy feature, fluency feature, prosodic features, and mark according to described evaluation and test feature.Certainly, also can extract the evaluation and test feature of various features type simultaneously, that is to say, the evaluation and test feature of extraction can have two or more sets, every group of evaluation and test feature correspondence a kind of characteristic type, such as integrity feature, pronunciation accuracy feature, fluency feature or prosodic features etc.
The concrete meaning of above-mentioned various types of evaluation and test features and extracting mode, in existing explanation above, do not repeat them here.The evaluation and test feature of these corresponding different characteristic types has been described respectively the feature of active user's pronunciation from different perspectives, has each other certain complementarity.
The specific implementation of described grading module is described respectively when different for the evaluation and test feature of extracting below.
As shown in Figure 6, be a kind of specific implementation structural representation of grading module in the embodiment of the present invention.
In this embodiment, described grading module comprises:
Loading unit 601, for loading the score in predicting model corresponding with the characteristic type of evaluating and testing feature;
Computing unit 602, for calculating described evaluation and test feature corresponding to the similarity of described score in predicting model, and the score using described similarity as described voice signal.
The grading module of this embodiment, the evaluation and test feature of a certain characteristic type extracting for characteristic extracting module, by calculating this evaluation and test feature corresponding to the similarity of score in predicting model, and the score using described similarity as described voice signal.
As shown in Figure 7, be the another kind of specific implementation structural representation of grading module in the embodiment of the present invention.
In this embodiment, described grading module comprises:
Loading unit 701, for to every group of evaluation and test feature, loads the score in predicting model corresponding with the characteristic type of described evaluation and test feature;
The first computing unit 702, for calculating described evaluation and test feature corresponding to the similarity of described score in predicting model, and the score using described similarity as described evaluation and test feature;
The second computing unit 703, for calculating the score of described voice signal according to the score of every group of evaluation and test feature.
The score of considering dissimilar evaluation and test feature has certain correlativity, and the conversion method that the second computing unit 703 can be based on linear regression is calculated PTS, i.e. the score of computing voice signal as follows:
S = 1 N Σ i = 1 N w i s i
Wherein, w ithe correlation parameter of respectively evaluating and testing feature, w ifor positive number, by system, set in advance and meet
Figure BDA0000410864650000122
s iit is the integrate score of respectively evaluating and testing feature; N is the number of integrate score.
The grading module of this embodiment, the evaluation and test feature of the multiple different characteristic type of extracting for characteristic extracting module, by calculating this evaluation and test feature corresponding to the similarity of score in predicting model, obtain the score of every group of evaluation and test feature, then according to the score of every group of evaluation and test feature, calculate the score of voice signal, the validity and the rationality that have further improved spoken evaluation and test, greatly reduced the abnormal ratio of marking.
It should be noted that, the corresponding score in predicting model of the characteristic type of above-mentioned and different evaluation and test features in advance off-line builds, and above, is described in detail, and does not repeat them here.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, between each embodiment identical similar part mutually referring to, each embodiment stresses is the difference with other embodiment.Especially, for system embodiment, because it is substantially similar in appearance to embodiment of the method, so describe fairly simplely, relevant part is referring to the part explanation of embodiment of the method.System embodiment described above is only schematic, wherein said module or unit as separating component explanation can or can not be also physically to separate, the parts that show as module or unit can be or can not be also physical locations, can be positioned at a place, or also can be distributed in a plurality of network element.Can select according to the actual needs some or all of module wherein to realize the object of the present embodiment scheme.Those of ordinary skills, in the situation that not paying creative work, are appreciated that and implement.
All parts embodiment of the present invention can realize with hardware, or realizes with the software module moved on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that and can use in practice microprocessor or digital signal processor (DSP) to realize according to the some or all functions of the some or all parts in the spoken evaluating system of the embodiment of the present invention.The present invention for example can also be embodied as, for carrying out part or all equipment or device program (, computer program and computer program) of method as described herein.Realizing like this program of the present invention can be stored on computer-readable medium, or can have the form of one or more signal.Such signal can be downloaded and obtain from internet website, or provides on carrier signal, or provides with any other form.
Above the embodiment of the present invention is described in detail, has applied embodiment herein the present invention is set forth, the explanation of above embodiment is just for helping to understand method and apparatus of the present invention; , for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention meanwhile.

Claims (10)

1. a spoken evaluating method, is characterized in that, comprising:
Receive voice signal to be evaluated;
Utilize at least two kinds of different speech recognition systems to obtain respectively voice snippet corresponding to each basic voice unit in described voice signal;
The voice snippet obtaining is merged, obtain the efficient voice segment sequence of corresponding described voice signal;
From described efficient voice segment sequence, extract evaluation and test feature;
According to described evaluation and test feature, mark.
2. method according to claim 1, is characterized in that, described the voice snippet obtaining is merged, and the efficient voice segment sequence that obtains corresponding described voice signal comprises:
Text corresponding to voice snippet that different phonetic recognition system is obtained carries out Dynamic Matching with the model answer network building in advance respectively, obtains Optimum Matching result;
According to described Optimum Matching result, generate successively the set of different corresponding units, described corresponding unit refers to that the voice snippet that its corresponding different phonetic recognition system obtains exists plyability in time, and the correct recognition result unit of match-on criterion answer network;
Determine the optimum cell in described set;
Splice successively the optimum cell in described set, obtain the efficient voice segment sequence of corresponding described voice signal.
3. method according to claim 2, is characterized in that, the optimum cell in described definite described set comprises:
Calculate respectively acoustic model probability or the pronunciation posterior probability of the voice snippet of each corresponding unit in described set;
Selection has the corresponding unit of maximum acoustic model probability or pronunciation posterior probability as the optimum cell in described set.
4. method according to claim 1, is characterized in that, the corresponding a kind of characteristic type of described evaluation and test feature, described characteristic type be following any one: integrity feature, the accuracy of pronouncing feature, fluency feature, prosodic features;
Describedly according to described evaluation and test feature, mark and comprise:
Load the score in predicting model corresponding with the characteristic type of described evaluation and test feature;
Calculate described evaluation and test feature corresponding to the similarity of described score in predicting model, and the score using described similarity as described voice signal.
5. method according to claim 1, it is characterized in that, described evaluation and test feature comprises at least two groups evaluation and tests features of corresponding different characteristic type, described characteristic type be following any one: integrity feature, the accuracy of pronouncing feature, fluency feature, prosodic features;
Describedly according to described evaluation and test feature, mark and comprise:
For every group of evaluation and test feature, load the score in predicting model corresponding with the characteristic type of described evaluation and test feature;
Calculate described evaluation and test feature corresponding to the similarity of described score in predicting model, and the score using described similarity as described evaluation and test feature;
According to the score of every group of evaluation and test feature, calculate the score of described voice signal.
6. a spoken evaluating system, is characterized in that, comprising:
Receiver module, for receiving voice signal to be evaluated;
Voice snippet acquisition module, for utilizing at least two kinds of different speech recognition systems to obtain respectively voice snippet corresponding to each basic voice unit of described voice signal;
Fusion Module, merges for the voice snippet that described voice snippet acquisition module is obtained, and obtains the efficient voice segment sequence of corresponding described voice signal;
Characteristic extracting module, for extracting evaluation and test feature from described efficient voice segment sequence;
Grading module, for marking according to described evaluation and test feature.
7. system according to claim 6, is characterized in that, described Fusion Module comprises:
Matching unit, carries out Dynamic Matching with the model answer network building in advance respectively for text corresponding to voice snippet that different phonetic recognition system is obtained, and obtains Optimum Matching result;
Set generation unit, for generate successively the set of different corresponding units according to described Optimum Matching result, described corresponding unit refers to that the voice snippet that its corresponding different phonetic recognition system obtains exists plyability in time, and the correct recognition result unit of match-on criterion answer network;
Determining unit, for determining the optimum cell of described set;
Concatenation unit, for splicing successively the optimum cell of described set, obtains the efficient voice segment sequence of corresponding described voice signal.
8. system according to claim 7, is characterized in that, described determining unit comprises:
Computing unit, for calculating respectively acoustic model probability or the pronunciation posterior probability of the voice snippet of described each corresponding unit of set;
Selected cell, has the corresponding unit of maximum acoustic model probability or pronunciation posterior probability as the optimum cell of described set for selecting.
9. system according to claim 6, is characterized in that, the corresponding a kind of characteristic type of described evaluation and test feature, described characteristic type be following any one: integrity feature, the accuracy of pronouncing feature, fluency feature, prosodic features;
Described grading module comprises:
Loading unit, for loading the score in predicting model corresponding with the characteristic type of described evaluation and test feature;
Computing unit, for calculating described evaluation and test feature corresponding to the similarity of described score in predicting model, and the score using described similarity as described voice signal.
10. system according to claim 6, it is characterized in that, described evaluation and test feature comprises at least two groups evaluation and tests features of corresponding different characteristic type, described characteristic type be following any one: integrity feature, the accuracy of pronouncing feature, fluency feature, prosodic features;
Described grading module comprises:
Loading unit, for to every group of evaluation and test feature, loads the score in predicting model corresponding with the characteristic type of described evaluation and test feature;
The first computing unit, for calculating described evaluation and test feature corresponding to the similarity of described score in predicting model, and the score using described similarity as described evaluation and test feature;
The second computing unit, for calculating the score of described voice signal according to the score of every group of evaluation and test feature.
CN201310554431.8A 2013-11-08 2013-11-08 Oral evaluation method and system Active CN103559894B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310554431.8A CN103559894B (en) 2013-11-08 2013-11-08 Oral evaluation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310554431.8A CN103559894B (en) 2013-11-08 2013-11-08 Oral evaluation method and system

Publications (2)

Publication Number Publication Date
CN103559894A true CN103559894A (en) 2014-02-05
CN103559894B CN103559894B (en) 2016-04-20

Family

ID=50014121

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310554431.8A Active CN103559894B (en) 2013-11-08 2013-11-08 Oral evaluation method and system

Country Status (1)

Country Link
CN (1) CN103559894B (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103928023A (en) * 2014-04-29 2014-07-16 广东外语外贸大学 Voice scoring method and system
CN104318921A (en) * 2014-11-06 2015-01-28 科大讯飞股份有限公司 Voice section segmentation detection method and system and spoken language detecting and evaluating method and system
CN104464757A (en) * 2014-10-28 2015-03-25 科大讯飞股份有限公司 Voice evaluation method and device
CN104978971A (en) * 2014-04-08 2015-10-14 安徽科大讯飞信息科技股份有限公司 Oral evaluation method and system
CN105845134A (en) * 2016-06-14 2016-08-10 科大讯飞股份有限公司 Spoken language evaluation method through freely read topics and spoken language evaluation system thereof
CN107894882A (en) * 2017-11-21 2018-04-10 马博 A kind of pronunciation inputting method of mobile terminal
CN107945788A (en) * 2017-11-27 2018-04-20 桂林电子科技大学 A kind of relevant Oral English Practice pronunciation error detection of text and quality score method
CN108597538A (en) * 2018-03-05 2018-09-28 标贝(北京)科技有限公司 The evaluating method and system of speech synthesis system
CN108829894A (en) * 2018-06-29 2018-11-16 北京百度网讯科技有限公司 Spoken word identification and method for recognizing semantics and its device
CN109273023A (en) * 2018-09-20 2019-01-25 科大讯飞股份有限公司 A kind of data evaluating method, device, equipment and readable storage medium storing program for executing
CN109300474A (en) * 2018-09-14 2019-02-01 北京网众共创科技有限公司 A kind of audio signal processing method and device
CN109308118A (en) * 2018-09-04 2019-02-05 安徽大学 Chinese eye write signal identifying system and its recognition methods based on EOG
CN109697988A (en) * 2017-10-20 2019-04-30 深圳市鹰硕音频科技有限公司 A kind of Speech Assessment Methods and device
WO2020181800A1 (en) * 2019-03-12 2020-09-17 平安科技(深圳)有限公司 Apparatus and method for predicting score for question and answer content, and storage medium
CN111833853A (en) * 2020-07-01 2020-10-27 腾讯科技(深圳)有限公司 Voice processing method and device, electronic equipment and computer readable storage medium
CN111916108A (en) * 2020-07-24 2020-11-10 北京声智科技有限公司 Voice evaluation method and device
CN112331180A (en) * 2020-11-03 2021-02-05 北京猿力未来科技有限公司 Spoken language evaluation method and device
CN112951274A (en) * 2021-02-07 2021-06-11 脸萌有限公司 Voice similarity determination method and device, and program product

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645271A (en) * 2008-12-23 2010-02-10 中国科学院声学研究所 Rapid confidence-calculation method in pronunciation quality evaluation system
CN101740024A (en) * 2008-11-19 2010-06-16 中国科学院自动化研究所 Method for automatic evaluation based on generalized fluent spoken language fluency

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101740024A (en) * 2008-11-19 2010-06-16 中国科学院自动化研究所 Method for automatic evaluation based on generalized fluent spoken language fluency
CN101645271A (en) * 2008-12-23 2010-02-10 中国科学院声学研究所 Rapid confidence-calculation method in pronunciation quality evaluation system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JONATHAN G. FISCUS: "A POST-PROCESSING SYSTEM TO YIELD REDUCED WORD ERROR RATES:RECOGNIZER OUTPUT VOTING ERROR REDUCTION (ROVER)", 《AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, 1997. PROCEEDINGS., 1997 IEEE WORKSHOP ON》, 17 December 1997 (1997-12-17) *
SATOSHI NATORI ET AL: "Spoken Term Detection Using Phoneme Transition Network from Multiple Speech Recognizers’ Outputs", 《JOURNAL OF INFORMATION PROCESSING》, vol. 21, no. 2, 30 April 2013 (2013-04-30) *

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978971A (en) * 2014-04-08 2015-10-14 安徽科大讯飞信息科技股份有限公司 Oral evaluation method and system
CN103928023A (en) * 2014-04-29 2014-07-16 广东外语外贸大学 Voice scoring method and system
CN104464757B (en) * 2014-10-28 2019-01-18 科大讯飞股份有限公司 Speech evaluating method and speech evaluating device
CN104464757A (en) * 2014-10-28 2015-03-25 科大讯飞股份有限公司 Voice evaluation method and device
CN104318921A (en) * 2014-11-06 2015-01-28 科大讯飞股份有限公司 Voice section segmentation detection method and system and spoken language detecting and evaluating method and system
CN104318921B (en) * 2014-11-06 2017-08-25 科大讯飞股份有限公司 Segment cutting detection method and system, method and system for evaluating spoken language
CN105845134A (en) * 2016-06-14 2016-08-10 科大讯飞股份有限公司 Spoken language evaluation method through freely read topics and spoken language evaluation system thereof
CN109697988B (en) * 2017-10-20 2021-05-14 深圳市鹰硕教育服务有限公司 Voice evaluation method and device
CN109697988A (en) * 2017-10-20 2019-04-30 深圳市鹰硕音频科技有限公司 A kind of Speech Assessment Methods and device
CN107894882B (en) * 2017-11-21 2021-02-09 南京硅基智能科技有限公司 Voice input method of mobile terminal
CN107894882A (en) * 2017-11-21 2018-04-10 马博 A kind of pronunciation inputting method of mobile terminal
CN107945788A (en) * 2017-11-27 2018-04-20 桂林电子科技大学 A kind of relevant Oral English Practice pronunciation error detection of text and quality score method
CN107945788B (en) * 2017-11-27 2021-11-02 桂林电子科技大学 Method for detecting pronunciation error and scoring quality of spoken English related to text
CN108597538A (en) * 2018-03-05 2018-09-28 标贝(北京)科技有限公司 The evaluating method and system of speech synthesis system
CN108829894A (en) * 2018-06-29 2018-11-16 北京百度网讯科技有限公司 Spoken word identification and method for recognizing semantics and its device
CN108829894B (en) * 2018-06-29 2021-11-12 北京百度网讯科技有限公司 Spoken word recognition and semantic recognition method and device
CN109308118A (en) * 2018-09-04 2019-02-05 安徽大学 Chinese eye write signal identifying system and its recognition methods based on EOG
CN109308118B (en) * 2018-09-04 2021-12-14 安徽大学 Chinese eye writing signal recognition system based on EOG and recognition method thereof
CN109300474A (en) * 2018-09-14 2019-02-01 北京网众共创科技有限公司 A kind of audio signal processing method and device
CN109300474B (en) * 2018-09-14 2022-04-26 北京网众共创科技有限公司 Voice signal processing method and device
CN109273023A (en) * 2018-09-20 2019-01-25 科大讯飞股份有限公司 A kind of data evaluating method, device, equipment and readable storage medium storing program for executing
CN109273023B (en) * 2018-09-20 2022-05-17 科大讯飞股份有限公司 Data evaluation method, device and equipment and readable storage medium
WO2020181800A1 (en) * 2019-03-12 2020-09-17 平安科技(深圳)有限公司 Apparatus and method for predicting score for question and answer content, and storage medium
CN111833853A (en) * 2020-07-01 2020-10-27 腾讯科技(深圳)有限公司 Voice processing method and device, electronic equipment and computer readable storage medium
CN111833853B (en) * 2020-07-01 2023-10-27 腾讯科技(深圳)有限公司 Voice processing method and device, electronic equipment and computer readable storage medium
CN111916108A (en) * 2020-07-24 2020-11-10 北京声智科技有限公司 Voice evaluation method and device
CN111916108B (en) * 2020-07-24 2021-04-02 北京声智科技有限公司 Voice evaluation method and device
CN112331180A (en) * 2020-11-03 2021-02-05 北京猿力未来科技有限公司 Spoken language evaluation method and device
CN112951274A (en) * 2021-02-07 2021-06-11 脸萌有限公司 Voice similarity determination method and device, and program product

Also Published As

Publication number Publication date
CN103559894B (en) 2016-04-20

Similar Documents

Publication Publication Date Title
CN103559894B (en) Oral evaluation method and system
CN103559892B (en) Oral evaluation method and system
CN110782921B (en) Voice evaluation method and device, storage medium and electronic device
CN103594087B (en) Improve the method and system of oral evaluation performance
CN101740024B (en) Method for automatic evaluation of spoken language fluency based on generalized fluency
CN101751919B (en) Spoken Chinese stress automatic detection method
CN102568475B (en) System and method for assessing proficiency in Putonghua
CN105845134A (en) Spoken language evaluation method through freely read topics and spoken language evaluation system thereof
US9262941B2 (en) Systems and methods for assessment of non-native speech using vowel space characteristics
US9489864B2 (en) Systems and methods for an automated pronunciation assessment system for similar vowel pairs
CN102214462A (en) Method and system for estimating pronunciation
CN102034475A (en) Method for interactively scoring open short conversation by using computer
CN103985392A (en) Phoneme-level low-power consumption spoken language assessment and defect diagnosis method
CN110415725B (en) Method and system for evaluating pronunciation quality of second language using first language data
Yin et al. Automatic cognitive load detection from speech features
CN106157974A (en) Text recites quality assessment device and method
Ghanem et al. Pronunciation features in rating criteria
CN104700831B (en) The method and apparatus for analyzing the phonetic feature of audio file
CN104347071A (en) Method and system for generating oral test reference answer
Shashidhar et al. Automatic spontaneous speech grading: A novel feature derivation technique using the crowd
Gao et al. Spoken english intelligibility remediation with pocketsphinx alignment and feature extraction improves substantially over the state of the art
Li et al. Techware: Speaker and spoken language recognition resources [best of the web]
CN112116181B (en) Classroom quality model training method, classroom quality evaluation method and classroom quality evaluation device
Loukina et al. Pronunciation accuracy and intelligibility of non-native speech
CN109065024A (en) abnormal voice data detection method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Wangjiang Road high tech Development Zone Hefei city Anhui province 230088 No. 666

Applicant after: Iflytek Co., Ltd.

Address before: Wangjiang Road high tech Development Zone Hefei city Anhui province 230088 No. 666

Applicant before: Anhui USTC iFLYTEK Co., Ltd.

COR Change of bibliographic data
CB03 Change of inventor or designer information

Inventor after: Wei Si

Inventor after: Wang Shijin

Inventor after: Liu Dan

Inventor after: Hu Yu

Inventor after: Liu Qingfeng

Inventor before: Wang Shijin

Inventor before: Liu Dan

Inventor before: Wei Si

Inventor before: Hu Yu

Inventor before: Liu Qingfeng

COR Change of bibliographic data
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20171207

Address after: 510000 Guangzhou City, Guangzhou, Guangdong, Haizhuqu District Guangzhou Avenue South 788, self compiled 15 houses, 177 rooms

Patentee after: Guangzhou Xunfei Yi heard Network Technology Co. Ltd.

Address before: Wangjiang Road high tech Development Zone Hefei city Anhui province 230088 No. 666

Patentee before: Iflytek Co., Ltd.