CN104599678A - Spoken language pronunciation evaluation system and spoken language pronunciation evaluation method - Google Patents

Spoken language pronunciation evaluation system and spoken language pronunciation evaluation method Download PDF

Info

Publication number
CN104599678A
CN104599678A CN201310524855.XA CN201310524855A CN104599678A CN 104599678 A CN104599678 A CN 104599678A CN 201310524855 A CN201310524855 A CN 201310524855A CN 104599678 A CN104599678 A CN 104599678A
Authority
CN
China
Prior art keywords
spoken language
pronunciation
speech
speech data
standard
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310524855.XA
Other languages
Chinese (zh)
Inventor
林晖
王翌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Crown Information Technology (shanghai) Co Ltd
Original Assignee
Crown Information Technology (shanghai) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Crown Information Technology (shanghai) Co Ltd filed Critical Crown Information Technology (shanghai) Co Ltd
Priority to CN201310524855.XA priority Critical patent/CN104599678A/en
Publication of CN104599678A publication Critical patent/CN104599678A/en
Pending legal-status Critical Current

Links

Landscapes

  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a spoken language pronunciation evaluation system and a spoken language pronunciation evaluation method. The system comprises an acquisition module used for acquiring voice data of a spoken language pronunciation to be evaluated, an extraction module used for extracting at least one acoustic feature from the voice data acquired by the acquisition module, inputting the extracted acoustic feature to a preset universal background model which outputs a voice feature vector, and an evaluation module used for inputting the voice feature vector acquired by the extraction module to a preset classifier which outputs the spoken language pronunciation evaluation value of the spoken language pronunciation to be evaluated relative to a preset standard spoken language pronunciation. By adopting the system and the method disclosed by the embodiment of the invention, whether a speaker speaks a standard language can be automatically distinguished. If the pronunciation of a person is closer to the pronunciation of standard spoken language, the system disclosed by the embodiment of the invention gives higher evaluation to the pronunciation.

Description

Spoken language pronunciation evaluation system and method
Technical field
The present invention relates to field of computer technology, particularly relate to a kind of spoken language pronunciation evaluation system and method.
Background technology
Current pronunciation evaluation system mostly relies on and speech recognition technology, speech recognition system relative complex, runs and needs computational resource relatively sufficient.In addition, due to speech recognition technology and imperfection, especially, for identifying that the phonetic recognization rate had an accent is lower, so often need to know that content that people to be evaluated speaks improves the accuracy of evaluation based on the pronunciation evaluation system of speech recognition.
Because existing speech recognition system is when identifying, needing the content knowing that people to be evaluated speaks, causing the calculating of speech recognition to need the plenty of time on the one hand, also can reduce the accuracy of speech recognition on the other hand, reduce Consumer's Experience.
Summary of the invention
In view of the above problems, propose the present invention, to provide a kind of overcoming the problems referred to above or the spoken language pronunciation evaluation system solved the problem at least in part and method.
According to an aspect of embodiments of the invention, provide a kind of spoken language pronunciation evaluation system, it comprises: acquisition module, for gathering the speech data of spoken language pronunciation to be evaluated; Extraction module, extracts at least one acoustic feature in the speech data that gathers from described acquisition module, is input to by the acoustic feature of extraction in the global context model preset, and exports speech feature vector by described global context model; Evaluation module, the speech feature vector for being obtained by described extraction module is input in the sorter preset, and exports the spoken language pronunciation evaluation of estimate of the colloquial standard pronunciation that described spoken language pronunciation correspondence to be evaluated presets by described sorter.
Alternatively, described system also comprises: the first training module, for extracting at least one acoustic feature respectively from the speech data preset, and utilize the acoustic feature training mixed Gauss model extracted, using as global context model, the wherein said speech data preset comprises: the speech data of colloquial standard pronunciation and the speech data of non-standard spoken language pronunciation.
Alternatively, described system also comprises: speech feature vector generation module, for being input in global context model respectively by the speech data preset, then generates the speech feature vector for training by maximal condition posterior probability estimation; Second training module, the mark for the speech feature vector of training and the speech data preset generated for utilizing described speech feature vector generation module, training obtains binary classifier, using as the sorter preset, the mark of the wherein said speech data preset comprises: colloquial standard pronunciation and non-standard spoken language pronunciation.
Alternatively, described evaluation module is further used for the spoken language pronunciation evaluation of estimate obtaining the colloquial standard pronunciation that spoken language pronunciation to be evaluated correspondence presets according to following formula:
score = p ( X | λ a ) p ( X | λ a ) + p ( X | λ b ) × 100
Wherein, score represents: spoken language pronunciation evaluation of estimate;
λ a represents: the model of the speech data of colloquial standard pronunciation, and λ b represents: the model of the speech data of non-standard spoken language pronunciation;
(X| λ a) represents p: from the speech data of colloquial standard pronunciation, extract acoustic feature, by the speech feature vector X that the acoustic feature of extraction is obtained by global context model conversion, after speech feature vector X is input to sorter, the likelihood probability value of output;
P (X| λ b) represents: from the speech data of non-standard spoken language pronunciation, extract acoustic feature, by the speech feature vector X that the acoustic feature of extraction is obtained by global context model conversion, after speech feature vector X is input to sorter, the likelihood probability value of output.
Alternatively, described acoustic feature comprises: Mel Cepstral Frequency Coefficients and perception linear predictor coefficient.
According to another aspect of embodiments of the invention, additionally provide a kind of spoken language pronunciation evaluation method, comprising: the speech data gathering spoken language pronunciation to be evaluated; From the speech data gathered, extract at least one acoustic feature, the acoustic feature of extraction is input in the global context model preset, and export speech feature vector by described global context model; The speech feature vector of acquisition is input in the sorter preset, and exports the spoken language pronunciation evaluation of estimate of the colloquial standard pronunciation that described spoken language pronunciation correspondence to be evaluated presets by described sorter.
Alternatively, described method also comprises: from the speech data preset, extract at least one acoustic feature respectively, and utilize extraction acoustic feature training mixed Gauss model, using as global context model, the wherein said speech data preset comprises: the speech data of colloquial standard pronunciation and the speech data of non-standard spoken language pronunciation.
Alternatively, described method also comprises: be input to respectively by the speech data preset in global context model, then generates the speech feature vector for training by maximal condition posterior probability estimation; Utilize the mark for the speech feature vector of training and the speech data preset generated, training obtains binary classifier,, using as the sorter preset, the mark of the wherein said speech data preset comprises: colloquial standard pronunciation and non-standard spoken language pronunciation.
Alternatively, the described speech feature vector by acquisition is input in the sorter preset, and the step being exported the spoken language pronunciation evaluation of estimate of the colloquial standard pronunciation that described spoken language pronunciation correspondence to be evaluated presets by described sorter comprises:
The spoken language pronunciation evaluation of estimate of the colloquial standard pronunciation that spoken language pronunciation to be evaluated correspondence presets is obtained according to following formula:
score = p ( X | λ a ) p ( X | λ a ) + p ( X | λ b ) × 100
Wherein, score represents: spoken language pronunciation evaluation of estimate;
Λ a represents: the model of the speech data of colloquial standard pronunciation, and λ b represents: the model of the speech data of non-standard spoken language pronunciation;
(X| λ a) represents p: from the speech data of colloquial standard pronunciation, extract acoustic feature, by the speech feature vector X that the acoustic feature of extraction is obtained by global context model conversion, after speech feature vector X is input to sorter, the likelihood probability value of output;
P (X| λ b) represents: from the speech data of non-standard spoken language pronunciation, extract acoustic feature, by the speech feature vector X that the acoustic feature of extraction is obtained by global context model conversion, after speech feature vector X is input to sorter, the likelihood probability value of output.
Whether spoken language pronunciation evaluation system and method in embodiments of the invention can automatic distinguishing speaker be the enunciators of colloquial standard.The spoken pronunciation if the pronunciation of a people is more near the mark, embodiments of the invention provide higher evaluation by its pronunciation.During owing to evaluating at the spoken language pronunciation calculating enunciator, do not need to identify the content that enunciator speaks, the computing time that spoken language pronunciation is evaluated can be saved on the one hand, also can improve the accuracy that spoken language pronunciation is evaluated on the other hand, effectively improve the effect of Consumer's Experience.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.
Accompanying drawing explanation
By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:
Fig. 1 schematically shows the structured flowchart of spoken language pronunciation evaluation system 100 according to an embodiment of the invention; And
Fig. 2 schematically shows the process flow diagram of spoken language pronunciation evaluation method 200 according to an embodiment of the invention.
Embodiment
Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in more detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.
But achieve the present invention although should be appreciated that those skilled in the art can dream up clearly not describe in this manual or record and be included in the various structures in spirit of the present invention, principle and scope.
The all examples quoted from this instructions and conditional language are all for the object illustrated and instruct, the principle of prior art being made contributions to help reader understanding inventor and concept, and should be understood to be not limited to these the specifically example of citation and conditions.
In addition, quote from all descriptions of principle of the present invention, each side and each embodiment and object lesson thereof in this instructions and the equivalent or equivalent that are all intended to contain in its structure and are functionally described.In addition, such equivalent or equivalent should comprise current equivalent that is known and exploitation in the future or equivalent, that is, not tubular construction how, all perform the research and development achievement of identical function.
It should be appreciated by those skilled in the art that the block diagram presented in Figure of description represents the indicative icon realizing structure of the present invention or circuit.Similarly, should be appreciated that, any process flow diagram presented in Figure of description etc. represent the actual various process that can be performed by various computing machine or processor, and no matter whether clearly show this type of computing machine or processor in the drawings.
In detail in the claims, the module being used for performing appointed function is intended to contain any mode performing this function, comprise combination or (b) any type of software that such as (a) performs the circuit component of this function, therefore firmware, microcode etc. is comprised, itself and proper circuit combine, and are used for performing the software of practical function.The function provided by various module by with claim the mode advocated combine, will be understood that thus, any module of these functions, parts or element can be provided all to be equivalent to the module limited in claim.
Term " embodiment " in instructions means that the specific features, structure etc. that describe in conjunction with this embodiment are at least one embodiment of the present invention involved, therefore, the term " in an embodiment " occurred everywhere at instructions differs to establish a capital and refers to identical embodiment.
As shown in Figure 1, spoken language pronunciation evaluation system 100 according to an embodiment of the invention on mobile device can mainly comprise: acquisition module 110, extraction module 130 and evaluation module 150, be to be understood that, the annexation of modules represented in Fig. 1 is only example, those skilled in the art can adopt other annexation completely, as long as modules also can realize function of the present invention under such annexation.
In this manual, the function of modules can realize by using specialized hardware or the hardware performing process that can combine with suitable software.Such hardware or specialized hardware can comprise special IC (ASIC), various other circuit, various processors etc.When being realized by processor, this function can be provided by single application specific processor, single share processor or multiple independently processor (wherein some may be shared).In addition, processor should not be understood to specially to refer to can the hardware of executive software, but impliedly can comprise and be not limited to digital signal processor (DSP) hardware, be used for the ROM (read-only memory) (ROM) of storing software, random-access memory (ram) and non-volatile memory apparatus.
In an embodiment of the present invention, acquisition module 110 is for gathering the speech data of spoken language pronunciation to be evaluated.
Alternatively, in an embodiment of the present invention, speech data can be converted to digital signal by simulating signal by acquisition module 110, and determines initiating terminal and the clearing end (i.e. end-point detection) of spoken language pronunciation to be evaluated.
In an embodiment of the present invention, extraction module 130 extracts at least one acoustic feature in the speech data that gathers from acquisition module 110, the acoustic feature of extraction is input in the global context model preset, and the speech feature vector exported by global context model.Namely, this extraction module 130 can convert speech waveform to proper vector form.
Alternatively, acoustic feature can be amplitude, energy, zero-crossing rate, linear predictor coefficient (LPC), LPC cepstrum coefficient (LPCC), line spectrum pairs parameter (LSP), short-term spectrum, resonance peak frequency spectrum, Mel Cepstral Frequency Coefficients (Mel Frequency Cepstrum Coefficient, MFCC) and perception linear predictor coefficient.Preferably, in an embodiment of the present invention, acoustic feature comprises: Mel Cepstral Frequency Coefficients and perception linear predictor coefficient.It is appreciated of course that, do not limit the particular type of acoustic feature in an embodiment of the present invention.
Alternatively, in an embodiment of the present invention, extraction module 130 can also do certain consolidation to the speech feature vector obtained, and the characteristic parameter that there is individual difference can be compared under same benchmark.Normally used method has interpolation method, linear scale method, linear translation method, language person normalization etc.
In an embodiment of the present invention, evaluation module 150, speech feature vector for being obtained by extraction module 130 is input in the sorter preset, and the spoken language pronunciation evaluation of estimate that the colloquial standard that the spoken language pronunciation correspondence to be evaluated exported by sorter presets pronounces.Wherein, the colloquial standard pronunciation preset is equivalent to the pronunciation of mother tongue enunciator (native speaker).
Alternatively, in an embodiment of the present invention, evaluation module 150 can adopt log likelihood scoring (Log-likelihood scores) and log posterior probability scoring (Log-posteriorprobability scores) etc. to evaluate spoken language pronunciation to be evaluated.
According to embodiments of the invention, described spoken language pronunciation evaluation system 100 can also comprise one or more optional module, to realize extra or additional function, but these optional modules are not indispensable for realizing for object of the present invention, spoken language pronunciation evaluation system 100 when not having these optional modules, can realize object of the present invention completely according to an embodiment of the invention.Although these optional modules are not shown in Figure 1, the annexation between they and above-mentioned each module easily can be drawn according to following instruction by those skilled in the art.
Alternatively, in an embodiment of the present invention, system 100 also comprises: the first training module, for extracting at least one acoustic feature respectively from the speech data preset, and utilize extraction acoustic feature training mixed Gauss model (Gaussian Mixture Model), using as global context model (Universal Background Model), the speech data wherein preset comprises: the speech data of colloquial standard pronunciation and the speech data of non-standard spoken language pronunciation, the speech data of colloquial standard pronunciation is equivalent to the speech data of mother tongue enunciator (native speaker), the speech data of non-standard spoken language pronunciation is equivalent to the speech data of non-mother tongue enunciator (non-native speaker).
Alternatively, in an embodiment of the present invention, the EM(greatest hope of the employing classics of global context model) algorithm, represent with λ the gauss hybrid models that UBM: has M single Gauss, the probability that this UBM exports acoustic feature X can be expressed as:
p ( X | λ ) = p ( x → | λ ) = Σ i = 1 M g i N ( x → ; μ → i , Σ i )
Wherein, g ithe weight of Gauss, for single Gaussian distribution; it is average.
Alternatively, in an embodiment of the present invention, system 100 also comprises: speech feature vector generation module and the second training module, wherein speech feature vector generation module is used for the speech data preset to be input in global context model respectively, then generates the speech feature vector for training by maximal condition posterior probability estimation;
Alternatively, speech feature vector generation module can adopt following formulae discovery to obtain speech feature vector
s → i = n i n i + τ μ → i + τ n i + τ μ → ^ i
Wherein, n ifor the weight of this speech data, τ is adjustment parameter, it is the characteristic mean of this speech data.
The mark for the speech feature vector of training and the speech data that preset of the second training module for utilizing speech feature vector generation module to generate, training obtains binary classifier, using as the sorter preset, the mark of the wherein said speech data preset comprises: colloquial standard pronunciation and non-standard spoken language pronunciation.
Alternatively, above-mentioned binary classifier can be Gaussian classifier, it is appreciated of course that binary classifier and the training algorithm thereof of any energy output category probability all can be applied in an embodiment of the present invention.
Alternatively, in an embodiment of the present invention, evaluation module 150 is further used for the spoken language pronunciation evaluation of estimate score obtaining spoken language pronunciation to be evaluated according to following formula:
score = p ( X | λ a ) p ( X | λ a ) + p ( X | λ b ) × 100
Wherein, λ a represents: the model of the speech data of colloquial standard pronunciation, and λ b represents: the model of the speech data of non-standard spoken language pronunciation;
(X| λ a) represents p: from the speech data of colloquial standard pronunciation, extract acoustic feature, by the speech feature vector X that the acoustic feature of extraction is obtained by global context model conversion, after speech feature vector X is input to sorter, the likelihood probability value of output;
P (X| λ b) represents: from the speech data of non-standard spoken language pronunciation, extract acoustic feature, by the speech feature vector X that the acoustic feature of extraction is obtained by global context model conversion, after speech feature vector X is input to sorter, the likelihood probability value of output.
Alternatively, in an embodiment of the present invention, by linear or nonlinear transformation, spoken language pronunciation evaluation of estimate is mapped to terms for questionnaire by centesimal system score.
Excellent degree Outstanding Well Better Generally Pass Difference Very poor
Score 90-100 80-90 70-80 55-70 45-55 30-45 <30
By above-mentioned centesimal system score or terms for questionnaire, learner can be made definitely to recognize problem in pronunciation, grasp pronunciation gist as soon as possible, and improve pronunciation level as early as possible.
According to a second aspect of the invention, with spoken language pronunciation evaluation system 100 is corresponding according to an embodiment of the invention as above, present invention also offers a kind of spoken language pronunciation evaluation method 200.
With reference to figure 2, which schematically illustrates the process flow diagram of spoken language pronunciation evaluation method 200 according to an embodiment of the invention.As shown in Figure 2, described method 200 comprises step S210, S230 and S250, and method 200 starts from step S210, in step S210, gathers the speech data of spoken language pronunciation to be evaluated.
Subsequently, in step S230, from the speech data gathered, extract at least one acoustic feature, the acoustic feature of extraction is input in the global context model preset, and receive the speech feature vector of global context model output.
Subsequently, in step s 250, the speech feature vector of acquisition is input in the sorter preset, and receives the spoken language pronunciation evaluation of estimate of the corresponding colloquial standard pronunciation preset of spoken language pronunciation described to be evaluated that sorter exports.
Alternatively, in an embodiment of the present invention, in step s 250, the spoken language pronunciation evaluation of estimate score of spoken language pronunciation to be evaluated is obtained according to following formula:
score = p ( X | λ a ) p ( X | λ a ) + p ( X | λ b ) × 100
Wherein, λ a represents: the model of the speech data of colloquial standard pronunciation, and λ b represents: the model of the speech data of non-standard spoken language pronunciation;
(X| λ a) represents p: from the speech data of colloquial standard pronunciation, extract acoustic feature, by the speech feature vector X that the acoustic feature of extraction is obtained by global context model conversion, after speech feature vector X is input to sorter, the likelihood probability value of output;
P (X| λ b) represents: from the speech data of non-standard spoken language pronunciation, extract acoustic feature, by the speech feature vector X that the acoustic feature of extraction is obtained by global context model conversion, after speech feature vector X is input to sorter, the likelihood probability value of output.
According to embodiments of the invention, described spoken language pronunciation evaluation method 200 can also comprise one or more optional step, to realize extra or additional function, but these optional steps are not indispensable for realizing for object of the present invention, spoken language pronunciation evaluation method 200 when not having these optional steps, can realize object of the present invention completely according to an embodiment of the invention.These optional steps are not shown in Figure 2, but the priority between they and above steps performs and easily can be drawn according to following instruction by those skilled in the art.It is pointed out that unless otherwise specified, these optional steps can be selected according to actual needs together with the execution sequence of above-mentioned steps.
Alternatively, in an embodiment of the present invention, method 200 also comprises: from the speech data preset, extract at least one acoustic feature respectively, and utilize extraction acoustic feature training mixed Gauss model, using as global context model, the wherein said speech data preset comprises: the speech data of colloquial standard pronunciation and the speech data of non-standard spoken language pronunciation.
Alternatively, in an embodiment of the present invention, method 200 also comprises: be input to respectively by the speech data preset in global context model, then generates the speech feature vector for training by maximal condition posterior probability estimation; Utilize the mark for the speech feature vector of training and the speech data preset generated, training obtains sorter, and the mark of the wherein said speech data preset comprises: colloquial standard pronunciation and non-standard spoken language pronunciation.
Whether spoken language pronunciation evaluation system and method in embodiments of the invention can automatic distinguishing speaker be the enunciators of colloquial standard.The spoken pronunciation if the pronunciation of a people is more near the mark, embodiments of the invention provide higher evaluation by its pronunciation.During owing to evaluating at the spoken language pronunciation calculating enunciator, do not need to identify the content that enunciator speaks, the computing time that spoken language pronunciation is evaluated can be saved on the one hand, also can improve the accuracy that spoken language pronunciation is evaluated on the other hand, effectively improve the effect of Consumer's Experience.
Because above-mentioned each embodiment of the method is corresponding with aforementioned each device embodiment, therefore no longer each embodiment of the method is described in detail.
In this manual, a large amount of details is described.But, should be appreciated that embodiments of the invention can be implemented when not having these details.In certain embodiments, be not shown specifically known method, structure and technology, not make reader obscure the understanding of the principle to this instructions.
It will be understood by those skilled in the art that and can the module in the device in each embodiment adaptively be changed, and they are arranged in one or more devices different from this embodiment.Some block combiner in embodiment can be become a module or unit or assembly, multiple submodule or subelement or sub-component can also be put them into.Except the situation that feature and/or process are mutually repelled, any combination can be adopted, to method any disclosed in this instructions in steps or all modules of any device combine.Unless expressly stated otherwise, each feature disclosed in this instructions can by providing identical, equivalent or similar object alternative features replaces.
Each device embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the some or all modules in the device of the embodiment of the present invention.The present invention can also be embodied as the device program (such as, computer program and computer program) for performing method as described herein.
The present invention will be described instead of limit the invention to it should be noted that above-described embodiment, and those skilled in the art are not when departing from the scope of claims, can design various alternate embodiment.In detail in the claims, the sequence of feature does not also mean that any particular order of feature, and especially, in claim to a method each step order and do not mean that these steps must perform according to this order.On the contrary, these steps can perform with any order suitably.Equally, in device claim, the order of each module execution process also should not limit by the sequence of module each in claim, but can perform process with any order suitably.In detail in the claims, any reference marker being positioned at bracket should be understood as limitations on claims.Term " comprises " or " comprising " is not got rid of existence and do not arrange module in the claims or step.The term "a" or "an" be positioned at before module or step is not got rid of and be there is multiple such module or step.The present invention can by means of comprising the hardware of some disparate modules or realizing by means of the computing machine of suitably programming or processor.In the device claim enumerating several means, the some items in these modules can be realized by same hardware module.The use of term " first ", " second " and " the 3rd " etc. does not represent any order, can be title by these terminological interpretations.Term " connection ", " coupling " etc. are defined as when using in this manual and are operably connected with any expectation form, such as, mechanically, electronically, digitally, in analog, directly, indirectly, by software, connected by modes such as hardware.

Claims (9)

1. a spoken language pronunciation evaluation system (100), it comprises:
Acquisition module (110), for gathering the speech data of spoken language pronunciation to be evaluated;
Extraction module (130), at least one acoustic feature is extracted in the speech data that gathers from described acquisition module (110), the acoustic feature of extraction is input in the global context model preset, exports speech feature vector by described global context model;
Evaluation module (150), speech feature vector for being obtained by described extraction module (130) is input in the sorter preset, and exports the spoken language pronunciation evaluation of estimate of the colloquial standard pronunciation that described spoken language pronunciation correspondence to be evaluated presets by described sorter.
2. system according to claim 1, described system also comprises:
First training module, for extracting at least one acoustic feature respectively from the speech data preset, and utilize the acoustic feature training mixed Gauss model extracted, using as global context model, the wherein said speech data preset comprises: the speech data of colloquial standard pronunciation and the speech data of non-standard spoken language pronunciation.
3. system according to claim 2, described system also comprises:
Speech feature vector generation module, for being input in global context model respectively by the speech data preset, then generates the speech feature vector for training by maximal condition posterior probability estimation;
Second training module, the mark for the speech feature vector of training and the speech data preset generated for utilizing described speech feature vector generation module, training obtains binary classifier, using as the sorter preset, the mark of the wherein said speech data preset comprises: colloquial standard pronunciation and non-standard spoken language pronunciation.
4. system according to claim 1, wherein, described evaluation module is further used for the spoken language pronunciation evaluation of estimate obtaining the colloquial standard pronunciation that spoken language pronunciation to be evaluated correspondence presets according to following formula:
score = p ( X | λ a ) p ( X | λ a ) + p ( X | λ b ) × 100
Wherein, score represents: spoken language pronunciation evaluation of estimate;
λ a represents: the model of the speech data of colloquial standard pronunciation, and λ b represents: the model of the speech data of non-standard spoken language pronunciation;
(X| λ a) represents p: from the speech data of colloquial standard pronunciation, extract acoustic feature, by the speech feature vector X that the acoustic feature of extraction is obtained by global context model conversion, after speech feature vector X is input to sorter, the likelihood probability value of output;
P (X| λ b) represents: from the speech data of non-standard spoken language pronunciation, extract acoustic feature, by the speech feature vector X that the acoustic feature of extraction is obtained by global context model conversion, after speech feature vector X is input to sorter, the likelihood probability value of output.
5., according to the arbitrary described system of Claims 1 to 4, wherein, described acoustic feature comprises: Mel Cepstral Frequency Coefficients and perception linear predictor coefficient.
6. a spoken language pronunciation evaluation method (200), comprising:
Gather the speech data (S210) of spoken language pronunciation to be evaluated;
From the speech data gathered, extract at least one acoustic feature, the acoustic feature of extraction is input in the global context model preset, and export speech feature vector (S230) by described global context model;
The speech feature vector of acquisition is input in the sorter preset, and exports the spoken language pronunciation evaluation of estimate (S250) of the colloquial standard pronunciation that described spoken language pronunciation correspondence to be evaluated presets by described sorter.
7. method according to claim 6, described method also comprises:
At least one acoustic feature is extracted respectively from the speech data preset, and utilize extraction acoustic feature training mixed Gauss model, using as global context model, the wherein said speech data preset comprises: the speech data of colloquial standard pronunciation and the speech data of non-standard spoken language pronunciation.
8. method according to claim 7, described method also comprises:
The speech data preset being input to respectively in global context model, then generating the speech feature vector for training by maximal condition posterior probability estimation;
Utilize the mark for the speech feature vector of training and the speech data preset generated, training obtains binary classifier,, using as the sorter preset, the mark of the wherein said speech data preset comprises: colloquial standard pronunciation and non-standard spoken language pronunciation.
9. method according to claim 6, wherein, the described speech feature vector by acquisition is input in the sorter preset, and the step being exported the spoken language pronunciation evaluation of estimate of the colloquial standard pronunciation that described spoken language pronunciation correspondence to be evaluated presets by described sorter comprises:
The spoken language pronunciation evaluation of estimate of the colloquial standard pronunciation that spoken language pronunciation to be evaluated correspondence presets is obtained according to following formula:
score = p ( X | λ a ) p ( X | λ a ) + p ( X | λ b ) × 100
Wherein, score represents: spoken language pronunciation evaluation of estimate;
Λ a represents: the model of the speech data of colloquial standard pronunciation, and λ b represents: the model of the speech data of non-standard spoken language pronunciation;
(X| λ a) represents p: from the speech data of colloquial standard pronunciation, extract acoustic feature, by the speech feature vector X that the acoustic feature of extraction is obtained by global context model conversion, after speech feature vector X is input to sorter, the likelihood probability value of output;
P (X| λ b) represents: from the speech data of non-standard spoken language pronunciation, extract acoustic feature, by the speech feature vector X that the acoustic feature of extraction is obtained by global context model conversion, after speech feature vector X is input to sorter, the likelihood probability value of output.
CN201310524855.XA 2013-10-30 2013-10-30 Spoken language pronunciation evaluation system and spoken language pronunciation evaluation method Pending CN104599678A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310524855.XA CN104599678A (en) 2013-10-30 2013-10-30 Spoken language pronunciation evaluation system and spoken language pronunciation evaluation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310524855.XA CN104599678A (en) 2013-10-30 2013-10-30 Spoken language pronunciation evaluation system and spoken language pronunciation evaluation method

Publications (1)

Publication Number Publication Date
CN104599678A true CN104599678A (en) 2015-05-06

Family

ID=53125411

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310524855.XA Pending CN104599678A (en) 2013-10-30 2013-10-30 Spoken language pronunciation evaluation system and spoken language pronunciation evaluation method

Country Status (1)

Country Link
CN (1) CN104599678A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105205763A (en) * 2015-11-06 2015-12-30 陈国庆 Teaching method and apparatus based on new media modes
CN105702263A (en) * 2016-01-06 2016-06-22 清华大学 Voice playback detection method and device
CN109727609A (en) * 2019-01-11 2019-05-07 龙马智芯(珠海横琴)科技有限公司 Spoken language pronunciation appraisal procedure and device, computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101105939A (en) * 2007-09-04 2008-01-16 安徽科大讯飞信息科技股份有限公司 Sonification guiding method
CN101321387A (en) * 2008-07-10 2008-12-10 中国移动通信集团广东有限公司 Voiceprint recognition method and system based on communication system
CN101661675A (en) * 2009-09-29 2010-03-03 苏州思必驰信息科技有限公司 Self-sensing error tone pronunciation learning method and system
CN102354495A (en) * 2011-08-31 2012-02-15 中国科学院自动化研究所 Testing method and system of semi-opened spoken language examination questions
EP2450877B1 (en) * 2010-11-09 2013-04-24 Sony Computer Entertainment Europe Limited System and method of speech evaluation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101105939A (en) * 2007-09-04 2008-01-16 安徽科大讯飞信息科技股份有限公司 Sonification guiding method
CN101321387A (en) * 2008-07-10 2008-12-10 中国移动通信集团广东有限公司 Voiceprint recognition method and system based on communication system
CN101661675A (en) * 2009-09-29 2010-03-03 苏州思必驰信息科技有限公司 Self-sensing error tone pronunciation learning method and system
EP2450877B1 (en) * 2010-11-09 2013-04-24 Sony Computer Entertainment Europe Limited System and method of speech evaluation
CN102354495A (en) * 2011-08-31 2012-02-15 中国科学院自动化研究所 Testing method and system of semi-opened spoken language examination questions

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
纪现清: "文本无关说话人确认及其应用研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105205763A (en) * 2015-11-06 2015-12-30 陈国庆 Teaching method and apparatus based on new media modes
CN105702263A (en) * 2016-01-06 2016-06-22 清华大学 Voice playback detection method and device
CN105702263B (en) * 2016-01-06 2019-08-30 清华大学 Speech playback detection method and device
CN109727609A (en) * 2019-01-11 2019-05-07 龙马智芯(珠海横琴)科技有限公司 Spoken language pronunciation appraisal procedure and device, computer readable storage medium

Similar Documents

Publication Publication Date Title
CN106560891B (en) Speech recognition apparatus and method using acoustic modeling
CN107731228B (en) Text conversion method and device for English voice information
CN105304080A (en) Speech synthesis device and speech synthesis method
CN106297826A (en) Speech emotional identification system and method
CN101833951B (en) Multi-background modeling method for speaker recognition
CN104036774A (en) Method and system for recognizing Tibetan dialects
CN105869641A (en) Speech recognition device and speech recognition method
CN108735199B (en) Self-adaptive training method and system of acoustic model
CN109256118B (en) End-to-end Chinese dialect identification system and method based on generative auditory model
CN105895103A (en) Speech recognition method and device
CN1298172A (en) Context correlated acoustic mode for medium and large vocabulary speech recognition
CN106803422A (en) A kind of language model re-evaluation method based on memory network in short-term long
CN109243466A (en) A kind of vocal print authentication training method and system
CN110704590B (en) Method and apparatus for augmenting training samples
US20180033427A1 (en) Speech recognition transformation system
CN104599678A (en) Spoken language pronunciation evaluation system and spoken language pronunciation evaluation method
CN106782502A (en) A kind of speech recognition equipment of children robot
CN105654955A (en) Voice recognition method and device
CN111128211A (en) Voice separation method and device
KR102167157B1 (en) Voice recognition considering utterance variation
CN106297765A (en) Phoneme synthesizing method and system
CN106384587B (en) A kind of audio recognition method and system
CN103797535A (en) Reducing false positives in speech recognition systems
CN106708950B (en) Data processing method and device for intelligent robot self-learning system
CN113782030B (en) Error correction method based on multi-mode voice recognition result and related equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150506

RJ01 Rejection of invention patent application after publication