CN104599678A

CN104599678A - Spoken language pronunciation evaluation system and spoken language pronunciation evaluation method

Info

Publication number: CN104599678A
Application number: CN201310524855.XA
Authority: CN
Inventors: 林晖; 王翌
Original assignee: Crown Information Technology (shanghai) Co Ltd
Current assignee: Crown Information Technology (shanghai) Co Ltd
Priority date: 2013-10-30
Filing date: 2013-10-30
Publication date: 2015-05-06

Abstract

The invention discloses a spoken language pronunciation evaluation system and a spoken language pronunciation evaluation method. The system comprises an acquisition module used for acquiring voice data of a spoken language pronunciation to be evaluated, an extraction module used for extracting at least one acoustic feature from the voice data acquired by the acquisition module, inputting the extracted acoustic feature to a preset universal background model which outputs a voice feature vector, and an evaluation module used for inputting the voice feature vector acquired by the extraction module to a preset classifier which outputs the spoken language pronunciation evaluation value of the spoken language pronunciation to be evaluated relative to a preset standard spoken language pronunciation. By adopting the system and the method disclosed by the embodiment of the invention, whether a speaker speaks a standard language can be automatically distinguished. If the pronunciation of a person is closer to the pronunciation of standard spoken language, the system disclosed by the embodiment of the invention gives higher evaluation to the pronunciation.

Description

Spoken language pronunciation evaluation system and method

Technical field

The present invention relates to field of computer technology, particularly relate to a kind of spoken language pronunciation evaluation system and method.

Background technology

Current pronunciation evaluation system mostly relies on and speech recognition technology, speech recognition system relative complex, runs and needs computational resource relatively sufficient.In addition, due to speech recognition technology and imperfection, especially, for identifying that the phonetic recognization rate had an accent is lower, so often need to know that content that people to be evaluated speaks improves the accuracy of evaluation based on the pronunciation evaluation system of speech recognition.

Because existing speech recognition system is when identifying, needing the content knowing that people to be evaluated speaks, causing the calculating of speech recognition to need the plenty of time on the one hand, also can reduce the accuracy of speech recognition on the other hand, reduce Consumer's Experience.

Summary of the invention

In view of the above problems, propose the present invention, to provide a kind of overcoming the problems referred to above or the spoken language pronunciation evaluation system solved the problem at least in part and method.

According to an aspect of embodiments of the invention, provide a kind of spoken language pronunciation evaluation system, it comprises: acquisition module, for gathering the speech data of spoken language pronunciation to be evaluated; Extraction module, extracts at least one acoustic feature in the speech data that gathers from described acquisition module, is input to by the acoustic feature of extraction in the global context model preset, and exports speech feature vector by described global context model; Evaluation module, the speech feature vector for being obtained by described extraction module is input in the sorter preset, and exports the spoken language pronunciation evaluation of estimate of the colloquial standard pronunciation that described spoken language pronunciation correspondence to be evaluated presets by described sorter.

Alternatively, described system also comprises: the first training module, for extracting at least one acoustic feature respectively from the speech data preset, and utilize the acoustic feature training mixed Gauss model extracted, using as global context model, the wherein said speech data preset comprises: the speech data of colloquial standard pronunciation and the speech data of non-standard spoken language pronunciation.

Alternatively, described system also comprises: speech feature vector generation module, for being input in global context model respectively by the speech data preset, then generates the speech feature vector for training by maximal condition posterior probability estimation; Second training module, the mark for the speech feature vector of training and the speech data preset generated for utilizing described speech feature vector generation module, training obtains binary classifier, using as the sorter preset, the mark of the wherein said speech data preset comprises: colloquial standard pronunciation and non-standard spoken language pronunciation.

Alternatively, described evaluation module is further used for the spoken language pronunciation evaluation of estimate obtaining the colloquial standard pronunciation that spoken language pronunciation to be evaluated correspondence presets according to following formula:

score = \frac{p (X | λ_{a})}{p (X | λ_{a}) + p (X | λ_{b})} \times 100

Wherein, score represents: spoken language pronunciation evaluation of estimate;

λ a represents: the model of the speech data of colloquial standard pronunciation, and λ b represents: the model of the speech data of non-standard spoken language pronunciation;

(X| λ a) represents p: from the speech data of colloquial standard pronunciation, extract acoustic feature, by the speech feature vector X that the acoustic feature of extraction is obtained by global context model conversion, after speech feature vector X is input to sorter, the likelihood probability value of output;

P (X| λ b) represents: from the speech data of non-standard spoken language pronunciation, extract acoustic feature, by the speech feature vector X that the acoustic feature of extraction is obtained by global context model conversion, after speech feature vector X is input to sorter, the likelihood probability value of output.

Alternatively, described acoustic feature comprises: Mel Cepstral Frequency Coefficients and perception linear predictor coefficient.

According to another aspect of embodiments of the invention, additionally provide a kind of spoken language pronunciation evaluation method, comprising: the speech data gathering spoken language pronunciation to be evaluated; From the speech data gathered, extract at least one acoustic feature, the acoustic feature of extraction is input in the global context model preset, and export speech feature vector by described global context model; The speech feature vector of acquisition is input in the sorter preset, and exports the spoken language pronunciation evaluation of estimate of the colloquial standard pronunciation that described spoken language pronunciation correspondence to be evaluated presets by described sorter.

Alternatively, described method also comprises: from the speech data preset, extract at least one acoustic feature respectively, and utilize extraction acoustic feature training mixed Gauss model, using as global context model, the wherein said speech data preset comprises: the speech data of colloquial standard pronunciation and the speech data of non-standard spoken language pronunciation.

Alternatively, described method also comprises: be input to respectively by the speech data preset in global context model, then generates the speech feature vector for training by maximal condition posterior probability estimation; Utilize the mark for the speech feature vector of training and the speech data preset generated, training obtains binary classifier,, using as the sorter preset, the mark of the wherein said speech data preset comprises: colloquial standard pronunciation and non-standard spoken language pronunciation.

Alternatively, the described speech feature vector by acquisition is input in the sorter preset, and the step being exported the spoken language pronunciation evaluation of estimate of the colloquial standard pronunciation that described spoken language pronunciation correspondence to be evaluated presets by described sorter comprises:

The spoken language pronunciation evaluation of estimate of the colloquial standard pronunciation that spoken language pronunciation to be evaluated correspondence presets is obtained according to following formula:

score = \frac{p (X | λ_{a})}{p (X | λ_{a}) + p (X | λ_{b})} \times 100

Whether spoken language pronunciation evaluation system and method in embodiments of the invention can automatic distinguishing speaker be the enunciators of colloquial standard.The spoken pronunciation if the pronunciation of a people is more near the mark, embodiments of the invention provide higher evaluation by its pronunciation.During owing to evaluating at the spoken language pronunciation calculating enunciator, do not need to identify the content that enunciator speaks, the computing time that spoken language pronunciation is evaluated can be saved on the one hand, also can improve the accuracy that spoken language pronunciation is evaluated on the other hand, effectively improve the effect of Consumer's Experience.

Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.

Accompanying drawing explanation

By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:

Fig. 1 schematically shows the structured flowchart of spoken language pronunciation evaluation system 100 according to an embodiment of the invention; And

Fig. 2 schematically shows the process flow diagram of spoken language pronunciation evaluation method 200 according to an embodiment of the invention.

Embodiment

Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in more detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.

But achieve the present invention although should be appreciated that those skilled in the art can dream up clearly not describe in this manual or record and be included in the various structures in spirit of the present invention, principle and scope.

The all examples quoted from this instructions and conditional language are all for the object illustrated and instruct, the principle of prior art being made contributions to help reader understanding inventor and concept, and should be understood to be not limited to these the specifically example of citation and conditions.

In addition, quote from all descriptions of principle of the present invention, each side and each embodiment and object lesson thereof in this instructions and the equivalent or equivalent that are all intended to contain in its structure and are functionally described.In addition, such equivalent or equivalent should comprise current equivalent that is known and exploitation in the future or equivalent, that is, not tubular construction how, all perform the research and development achievement of identical function.

It should be appreciated by those skilled in the art that the block diagram presented in Figure of description represents the indicative icon realizing structure of the present invention or circuit.Similarly, should be appreciated that, any process flow diagram presented in Figure of description etc. represent the actual various process that can be performed by various computing machine or processor, and no matter whether clearly show this type of computing machine or processor in the drawings.

In detail in the claims, the module being used for performing appointed function is intended to contain any mode performing this function, comprise combination or (b) any type of software that such as (a) performs the circuit component of this function, therefore firmware, microcode etc. is comprised, itself and proper circuit combine, and are used for performing the software of practical function.The function provided by various module by with claim the mode advocated combine, will be understood that thus, any module of these functions, parts or element can be provided all to be equivalent to the module limited in claim.

Term " embodiment " in instructions means that the specific features, structure etc. that describe in conjunction with this embodiment are at least one embodiment of the present invention involved, therefore, the term " in an embodiment " occurred everywhere at instructions differs to establish a capital and refers to identical embodiment.

As shown in Figure 1, spoken language pronunciation evaluation system 100 according to an embodiment of the invention on mobile device can mainly comprise: acquisition module 110, extraction module 130 and evaluation module 150, be to be understood that, the annexation of modules represented in Fig. 1 is only example, those skilled in the art can adopt other annexation completely, as long as modules also can realize function of the present invention under such annexation.

In this manual, the function of modules can realize by using specialized hardware or the hardware performing process that can combine with suitable software.Such hardware or specialized hardware can comprise special IC (ASIC), various other circuit, various processors etc.When being realized by processor, this function can be provided by single application specific processor, single share processor or multiple independently processor (wherein some may be shared).In addition, processor should not be understood to specially to refer to can the hardware of executive software, but impliedly can comprise and be not limited to digital signal processor (DSP) hardware, be used for the ROM (read-only memory) (ROM) of storing software, random-access memory (ram) and non-volatile memory apparatus.

In an embodiment of the present invention, acquisition module 110 is for gathering the speech data of spoken language pronunciation to be evaluated.

Alternatively, in an embodiment of the present invention, speech data can be converted to digital signal by simulating signal by acquisition module 110, and determines initiating terminal and the clearing end (i.e. end-point detection) of spoken language pronunciation to be evaluated.

In an embodiment of the present invention, extraction module 130 extracts at least one acoustic feature in the speech data that gathers from acquisition module 110, the acoustic feature of extraction is input in the global context model preset, and the speech feature vector exported by global context model.Namely, this extraction module 130 can convert speech waveform to proper vector form.

Alternatively, acoustic feature can be amplitude, energy, zero-crossing rate, linear predictor coefficient (LPC), LPC cepstrum coefficient (LPCC), line spectrum pairs parameter (LSP), short-term spectrum, resonance peak frequency spectrum, Mel Cepstral Frequency Coefficients (Mel Frequency Cepstrum Coefficient, MFCC) and perception linear predictor coefficient.Preferably, in an embodiment of the present invention, acoustic feature comprises: Mel Cepstral Frequency Coefficients and perception linear predictor coefficient.It is appreciated of course that, do not limit the particular type of acoustic feature in an embodiment of the present invention.

Alternatively, in an embodiment of the present invention, extraction module 130 can also do certain consolidation to the speech feature vector obtained, and the characteristic parameter that there is individual difference can be compared under same benchmark.Normally used method has interpolation method, linear scale method, linear translation method, language person normalization etc.

In an embodiment of the present invention, evaluation module 150, speech feature vector for being obtained by extraction module 130 is input in the sorter preset, and the spoken language pronunciation evaluation of estimate that the colloquial standard that the spoken language pronunciation correspondence to be evaluated exported by sorter presets pronounces.Wherein, the colloquial standard pronunciation preset is equivalent to the pronunciation of mother tongue enunciator (native speaker).

Alternatively, in an embodiment of the present invention, evaluation module 150 can adopt log likelihood scoring (Log-likelihood scores) and log posterior probability scoring (Log-posteriorprobability scores) etc. to evaluate spoken language pronunciation to be evaluated.

According to embodiments of the invention, described spoken language pronunciation evaluation system 100 can also comprise one or more optional module, to realize extra or additional function, but these optional modules are not indispensable for realizing for object of the present invention, spoken language pronunciation evaluation system 100 when not having these optional modules, can realize object of the present invention completely according to an embodiment of the invention.Although these optional modules are not shown in Figure 1, the annexation between they and above-mentioned each module easily can be drawn according to following instruction by those skilled in the art.

Alternatively, in an embodiment of the present invention, system 100 also comprises: the first training module, for extracting at least one acoustic feature respectively from the speech data preset, and utilize extraction acoustic feature training mixed Gauss model (Gaussian Mixture Model), using as global context model (Universal Background Model), the speech data wherein preset comprises: the speech data of colloquial standard pronunciation and the speech data of non-standard spoken language pronunciation, the speech data of colloquial standard pronunciation is equivalent to the speech data of mother tongue enunciator (native speaker), the speech data of non-standard spoken language pronunciation is equivalent to the speech data of non-mother tongue enunciator (non-native speaker).

Alternatively, in an embodiment of the present invention, the EM(greatest hope of the employing classics of global context model) algorithm, represent with λ the gauss hybrid models that UBM: has M single Gauss, the probability that this UBM exports acoustic feature X can be expressed as:

p (X | λ) = p (\overset{&RightArrow;}{x} | λ) = Σ_{i = 1}^{M} g_{i} N (\overset{&RightArrow;}{x}; {\overset{&RightArrow;}{μ}}_{i}, Σ_{i})

Wherein, g _ithe weight of Gauss, for single Gaussian distribution; it is average.

Alternatively, in an embodiment of the present invention, system 100 also comprises: speech feature vector generation module and the second training module, wherein speech feature vector generation module is used for the speech data preset to be input in global context model respectively, then generates the speech feature vector for training by maximal condition posterior probability estimation;

Alternatively, speech feature vector generation module can adopt following formulae discovery to obtain speech feature vector

{\overset{&RightArrow;}{s}}_{i} = \frac{n_{i}}{n_{i} + τ} {\overset{&RightArrow;}{μ}}_{i} + \frac{τ}{n_{i} + τ} {\hat{\overset{&RightArrow;}{μ}}}_{i}

Wherein, n _ifor the weight of this speech data, τ is adjustment parameter, it is the characteristic mean of this speech data.

The mark for the speech feature vector of training and the speech data that preset of the second training module for utilizing speech feature vector generation module to generate, training obtains binary classifier, using as the sorter preset, the mark of the wherein said speech data preset comprises: colloquial standard pronunciation and non-standard spoken language pronunciation.

Alternatively, above-mentioned binary classifier can be Gaussian classifier, it is appreciated of course that binary classifier and the training algorithm thereof of any energy output category probability all can be applied in an embodiment of the present invention.

Alternatively, in an embodiment of the present invention, evaluation module 150 is further used for the spoken language pronunciation evaluation of estimate score obtaining spoken language pronunciation to be evaluated according to following formula:

score = \frac{p (X | λ_{a})}{p (X | λ_{a}) + p (X | λ_{b})} \times 100

Wherein, λ a represents: the model of the speech data of colloquial standard pronunciation, and λ b represents: the model of the speech data of non-standard spoken language pronunciation;

Alternatively, in an embodiment of the present invention, by linear or nonlinear transformation, spoken language pronunciation evaluation of estimate is mapped to terms for questionnaire by centesimal system score.

Excellent degree

Outstanding

Well

Better

Generally

Pass

Difference

Very poor

Score

90-100

80-90

70-80

55-70

45-55

30-45

＜30

By above-mentioned centesimal system score or terms for questionnaire, learner can be made definitely to recognize problem in pronunciation, grasp pronunciation gist as soon as possible, and improve pronunciation level as early as possible.

According to a second aspect of the invention, with spoken language pronunciation evaluation system 100 is corresponding according to an embodiment of the invention as above, present invention also offers a kind of spoken language pronunciation evaluation method 200.

With reference to figure 2, which schematically illustrates the process flow diagram of spoken language pronunciation evaluation method 200 according to an embodiment of the invention.As shown in Figure 2, described method 200 comprises step S210, S230 and S250, and method 200 starts from step S210, in step S210, gathers the speech data of spoken language pronunciation to be evaluated.

Subsequently, in step S230, from the speech data gathered, extract at least one acoustic feature, the acoustic feature of extraction is input in the global context model preset, and receive the speech feature vector of global context model output.

Subsequently, in step s 250, the speech feature vector of acquisition is input in the sorter preset, and receives the spoken language pronunciation evaluation of estimate of the corresponding colloquial standard pronunciation preset of spoken language pronunciation described to be evaluated that sorter exports.

Alternatively, in an embodiment of the present invention, in step s 250, the spoken language pronunciation evaluation of estimate score of spoken language pronunciation to be evaluated is obtained according to following formula:

score = \frac{p (X | λ_{a})}{p (X | λ_{a}) + p (X | λ_{b})} \times 100

According to embodiments of the invention, described spoken language pronunciation evaluation method 200 can also comprise one or more optional step, to realize extra or additional function, but these optional steps are not indispensable for realizing for object of the present invention, spoken language pronunciation evaluation method 200 when not having these optional steps, can realize object of the present invention completely according to an embodiment of the invention.These optional steps are not shown in Figure 2, but the priority between they and above steps performs and easily can be drawn according to following instruction by those skilled in the art.It is pointed out that unless otherwise specified, these optional steps can be selected according to actual needs together with the execution sequence of above-mentioned steps.

Alternatively, in an embodiment of the present invention, method 200 also comprises: from the speech data preset, extract at least one acoustic feature respectively, and utilize extraction acoustic feature training mixed Gauss model, using as global context model, the wherein said speech data preset comprises: the speech data of colloquial standard pronunciation and the speech data of non-standard spoken language pronunciation.

Alternatively, in an embodiment of the present invention, method 200 also comprises: be input to respectively by the speech data preset in global context model, then generates the speech feature vector for training by maximal condition posterior probability estimation; Utilize the mark for the speech feature vector of training and the speech data preset generated, training obtains sorter, and the mark of the wherein said speech data preset comprises: colloquial standard pronunciation and non-standard spoken language pronunciation.

Because above-mentioned each embodiment of the method is corresponding with aforementioned each device embodiment, therefore no longer each embodiment of the method is described in detail.

In this manual, a large amount of details is described.But, should be appreciated that embodiments of the invention can be implemented when not having these details.In certain embodiments, be not shown specifically known method, structure and technology, not make reader obscure the understanding of the principle to this instructions.

It will be understood by those skilled in the art that and can the module in the device in each embodiment adaptively be changed, and they are arranged in one or more devices different from this embodiment.Some block combiner in embodiment can be become a module or unit or assembly, multiple submodule or subelement or sub-component can also be put them into.Except the situation that feature and/or process are mutually repelled, any combination can be adopted, to method any disclosed in this instructions in steps or all modules of any device combine.Unless expressly stated otherwise, each feature disclosed in this instructions can by providing identical, equivalent or similar object alternative features replaces.

Each device embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the some or all modules in the device of the embodiment of the present invention.The present invention can also be embodied as the device program (such as, computer program and computer program) for performing method as described herein.

The present invention will be described instead of limit the invention to it should be noted that above-described embodiment, and those skilled in the art are not when departing from the scope of claims, can design various alternate embodiment.In detail in the claims, the sequence of feature does not also mean that any particular order of feature, and especially, in claim to a method each step order and do not mean that these steps must perform according to this order.On the contrary, these steps can perform with any order suitably.Equally, in device claim, the order of each module execution process also should not limit by the sequence of module each in claim, but can perform process with any order suitably.In detail in the claims, any reference marker being positioned at bracket should be understood as limitations on claims.Term " comprises " or " comprising " is not got rid of existence and do not arrange module in the claims or step.The term "a" or "an" be positioned at before module or step is not got rid of and be there is multiple such module or step.The present invention can by means of comprising the hardware of some disparate modules or realizing by means of the computing machine of suitably programming or processor.In the device claim enumerating several means, the some items in these modules can be realized by same hardware module.The use of term " first ", " second " and " the 3rd " etc. does not represent any order, can be title by these terminological interpretations.Term " connection ", " coupling " etc. are defined as when using in this manual and are operably connected with any expectation form, such as, mechanically, electronically, digitally, in analog, directly, indirectly, by software, connected by modes such as hardware.

Claims

1. a spoken language pronunciation evaluation system (100), it comprises:

Acquisition module (110), for gathering the speech data of spoken language pronunciation to be evaluated;

Extraction module (130), at least one acoustic feature is extracted in the speech data that gathers from described acquisition module (110), the acoustic feature of extraction is input in the global context model preset, exports speech feature vector by described global context model;

Evaluation module (150), speech feature vector for being obtained by described extraction module (130) is input in the sorter preset, and exports the spoken language pronunciation evaluation of estimate of the colloquial standard pronunciation that described spoken language pronunciation correspondence to be evaluated presets by described sorter.

2. system according to claim 1, described system also comprises:

First training module, for extracting at least one acoustic feature respectively from the speech data preset, and utilize the acoustic feature training mixed Gauss model extracted, using as global context model, the wherein said speech data preset comprises: the speech data of colloquial standard pronunciation and the speech data of non-standard spoken language pronunciation.

3. system according to claim 2, described system also comprises:

Speech feature vector generation module, for being input in global context model respectively by the speech data preset, then generates the speech feature vector for training by maximal condition posterior probability estimation;

Second training module, the mark for the speech feature vector of training and the speech data preset generated for utilizing described speech feature vector generation module, training obtains binary classifier, using as the sorter preset, the mark of the wherein said speech data preset comprises: colloquial standard pronunciation and non-standard spoken language pronunciation.

4. system according to claim 1, wherein, described evaluation module is further used for the spoken language pronunciation evaluation of estimate obtaining the colloquial standard pronunciation that spoken language pronunciation to be evaluated correspondence presets according to following formula:

score = \frac{p (X | λ_{a})}{p (X | λ_{a}) + p (X | λ_{b})} \times 100

5., according to the arbitrary described system of Claims 1 to 4, wherein, described acoustic feature comprises: Mel Cepstral Frequency Coefficients and perception linear predictor coefficient.

6. a spoken language pronunciation evaluation method (200), comprising:

Gather the speech data (S210) of spoken language pronunciation to be evaluated;

From the speech data gathered, extract at least one acoustic feature, the acoustic feature of extraction is input in the global context model preset, and export speech feature vector (S230) by described global context model;

The speech feature vector of acquisition is input in the sorter preset, and exports the spoken language pronunciation evaluation of estimate (S250) of the colloquial standard pronunciation that described spoken language pronunciation correspondence to be evaluated presets by described sorter.

7. method according to claim 6, described method also comprises:

At least one acoustic feature is extracted respectively from the speech data preset, and utilize extraction acoustic feature training mixed Gauss model, using as global context model, the wherein said speech data preset comprises: the speech data of colloquial standard pronunciation and the speech data of non-standard spoken language pronunciation.

8. method according to claim 7, described method also comprises:

The speech data preset being input to respectively in global context model, then generating the speech feature vector for training by maximal condition posterior probability estimation;

Utilize the mark for the speech feature vector of training and the speech data preset generated, training obtains binary classifier,, using as the sorter preset, the mark of the wherein said speech data preset comprises: colloquial standard pronunciation and non-standard spoken language pronunciation.

9. method according to claim 6, wherein, the described speech feature vector by acquisition is input in the sorter preset, and the step being exported the spoken language pronunciation evaluation of estimate of the colloquial standard pronunciation that described spoken language pronunciation correspondence to be evaluated presets by described sorter comprises:

score = \frac{p (X | λ_{a})}{p (X | λ_{a}) + p (X | λ_{b})} \times 100