CN109300339A

CN109300339A - A kind of exercising method and system of Oral English Practice

Info

Publication number: CN109300339A
Application number: CN201811376417.2A
Authority: CN
Inventors: 王泓懿
Original assignee: Individual
Current assignee: Individual
Priority date: 2018-11-19
Filing date: 2018-11-19
Publication date: 2019-02-01

Abstract

The invention discloses a kind of exercising method of Oral English Practice and system, wherein the testing audio is translated to computer literal this document the following steps are included: receive spoken testing audio by method；Described computer literal this document is translated into standard audio；The mel-frequency cepstrum coefficient for stating testing audio and the standard audio is extracted respectively, the phoneme posterior probability of the testing audio Yu the standard audio is calculated according to hidden Markov model, it determines that the opposite percentage with the standard audio of the testing audio scores according to the phoneme posterior probability, and exports the standard audio and percentage scoring.The above method and system can be effectively according to the corresponding standard audios of the audio output of input, while can also provide reasonable scoring for the audio of input, can effectively improve the Oral English Practice ability of user, have very high practicability.

Description

A kind of exercising method and system of Oral English Practice

Technical field

The present invention relates to language learning field, in particular to the exercising method and system of a kind of Oral English Practice.

Background technique

With economic globalization deeply and the promotion of Chinese overall national strength, China and exchanging for the world are just increasingly frequent, Demand to international language knowledge is also being skyrocketed through.Meanwhile depending on information technology and making rapid progress, area of computer aided language Speech study is increasingly mature, so that e-learning spoken language becomes possibility.But existing terminal teaching is still habitually continued to use Intrinsic teaching pattern, stresses the study of word and grammer mostly, and spoken language exercise software few in number can only also be provided and only be limited In simulation communication read aloud or with read function, cannot fundamentally improve the use ability of user's English.

Summary of the invention

In view of the above technical problems, the present invention, which provides one kind, effectively to correspond to standard pronunciation according to the audio output received Frequently, while corresponding scoring is provided, and the exercising method and system of a kind of Oral English Practice of Oral English Practice ability can be effectively improved.

In order to solve the above technical problems, the technical solution used in the present invention is: providing a kind of practice side of Oral English Practice Method, which comprises the following steps:

Spoken testing audio is received, the testing audio is translated into computer literal this document；

Described computer literal this document is translated into standard audio；

The mel-frequency cepstrum coefficient for stating testing audio and the standard audio is extracted respectively, according to hidden Markov model The phoneme posterior probability for calculating the testing audio Yu the standard audio determines the test according to the phoneme posterior probability The opposite percentage with the standard audio of audio scores, and exports the standard audio and percentage scoring.

The invention adopts the above technical scheme, the technical effect reached are as follows: the practice side of Oral English Practice provided by the invention Method can export corresponding computer literal this document, and according to the output pair of computer literal this document effectively according to testing audio The standard audio answered can effectively determine survey by the testing audio of extraction and the mel-frequency cepstrum coefficient of standard audio The phoneme posterior probability of audition frequency and standard audio, and determine testing audio relative to standard audio according to phoneme posterior probability Scoring, while outputting standard audio.The exercising method of above-mentioned Oral English Practice can be effectively corresponding according to the audio output of input Standard audio, while can also for input audio provide reasonable scoring, can effectively improve the Oral English Practice of user Ability has very high practicability.

Preferably, in the above-mentioned technical solutions, described the testing audio is translated into computer literal this document specifically to wrap Include following steps:

The testing audio is converted into speech waveform signal, frequency spectrum or cepstrum point are carried out to the speech waveform signal Acoustic feature value corresponding with the speech waveform signal is extracted in analysis, carries out model recognition training to the acoustic feature value, really Fixed corresponding acoustic model and language model；

Contacting between the acoustic feature value and sentence pronunciation modeling unit is created by the acoustic model, and determination is given Determine the probability that text issues corresponding voice；

The language model disassembles complete sentence for single word according to chain rule, and determines that current word occurs general Rate；

The probability of corresponding voice and the probability of current word appearance are issued according to the given text, export optimal text This sequence.

Preferably, in the above-mentioned technical solutions, the Meier of testing audio and the standard audio is stated in the extraction respectively Before frequency cepstral coefficient, it is described described computer literal this document is translated into standard audio after, it is further comprising the steps of:

Respectively to phonetic speech power corresponding with the standard audio and the testing audio spectrum it is intrinsic decline and pronounced The oppressive high frequency section of system is supplemented；

To the standard audio and testing audio progress sub-frame processing after supplement.

Preferably, in the above-mentioned technical solutions, described to extract the Meier frequency for stating testing audio and the standard audio respectively Rate cepstrum coefficient, specifically includes the following steps:

The time domain waveform of every frame audio of standard audio and testing audio after supplement is converted into frequency domain figure；

The component frequency feature of every frame audio is extracted respectively；

After carrying out discrete cosine transform to the component frequency feature, mel-frequency cepstrum coefficient is obtained.

Additionally provide a kind of exercise system of Oral English Practice, comprising:

The testing audio is translated to computer literal herein for receiving spoken testing audio by audio conversion module Part；

Text conversion module, for described computer literal this document to be translated to standard audio；

Audio comparison module, for extracting the mel-frequency cepstrum coefficient for stating testing audio and the standard audio respectively, The phoneme posterior probability that the testing audio Yu the standard audio are calculated according to hidden Markov model, after the phoneme It tests the opposite percentage with the standard audio of testing audio described in determine the probability to score, and exports the standard audio and institute State percentage scoring.

Preferably, in the above-mentioned technical solutions, the testing audio is translated to computer literal by the audio conversion module The concrete operations that this document executes are as follows:

Preferably, in the above-mentioned technical solutions, the audio comparison module, be also used to respectively to the standard audio and It the intrinsic decline of the corresponding phonetic speech power spectrum of the testing audio and is supplemented by the high frequency section that articulatory system constrains；

Preferably, in the above-mentioned technical solutions, the audio comparison module, standard audio and survey after being also used to supplement The time domain waveform of every frame audio of audition frequency is converted to frequency domain figure；

The invention adopts the above technical scheme, the technical effect reached are as follows: the practice system of Oral English Practice provided by the invention System can export corresponding computer literal this document, and according to the output pair of computer literal this document effectively according to testing audio The standard audio answered can effectively determine survey by the testing audio of extraction and the mel-frequency cepstrum coefficient of standard audio The phoneme posterior probability of audition frequency and standard audio, and determine testing audio relative to standard audio according to phoneme posterior probability Scoring, while outputting standard audio.The exercise system of above-mentioned Oral English Practice can be effectively corresponding according to the audio output of input Standard audio, while can also for input audio provide reasonable scoring, can effectively improve the Oral English Practice of user Ability has very high practicability.

A kind of storage medium is additionally provided, program instruction is stored thereon with, described program is instructed when being executed by processor, Implementation method the method for claim.

Detailed description of the invention

The present invention will be further explained below with reference to the attached drawings:

Fig. 1 is the exercising method schematic flow chart of Oral English Practice provided by the invention；

Fig. 2 is the schematic flow chart of audio conversion text provided by the invention；

Fig. 3 is the schematic block diagram of the exercise system of Oral English Practice provided by the invention.

Specific embodiment

In order to effectively improve the oracy of user's English, the present invention provides a kind of practice sides of Oral English Practice Method is detailed in figure.Fig. 1 is the schematic flow chart of the exercising method of Oral English Practice provided by the invention.Specifically includes the following steps:

Step S10: spoken testing audio is received, testing audio is translated into computer literal this document；

Step S20: computer literal this document is translated into standard audio；

Step S30: the mel-frequency cepstrum coefficient for stating testing audio and standard audio is extracted respectively, according to hidden Markov Model calculates the phoneme posterior probability of testing audio and standard audio, determines that testing audio is opposite with mark according to phoneme posterior probability The percentage of quasi- audio scores, and outputting standard audio and percentage scoring.

The above method can export corresponding computer literal this document, and herein according to computer literal according to testing audio Part exports corresponding standard audio can be effective by the testing audio of extraction and the mel-frequency cepstrum coefficient of standard audio Determination testing audio and standard audio phoneme posterior probability, and determine testing audio relative to mark according to phoneme posterior probability The scoring of quasi- audio, while outputting standard audio.Allow users to according to specific scoring, for oneself oracy into Row improves, and the effective Oral English Practice ability for improving user has very high practicability.

On the basis of Fig. 1 corresponding embodiment, also improved.It is detailed in Fig. 2, Fig. 2 is audio conversion provided by the invention The schematic flow chart of text.Specifically includes the following steps:

Step S11: being converted to speech waveform signal for testing audio, carries out frequency spectrum or cepstrum point to speech waveform signal Acoustic feature value corresponding with speech waveform signal is extracted in analysis, is carried out model recognition training to acoustic feature value, is determined corresponding Acoustic model and language model；

Step S12: contacting between acoustic feature value and sentence pronunciation modeling unit is created by acoustic model, and determination is given Determine the probability that text issues corresponding voice；

Step S13: language model disassembles complete sentence for single word according to chain rule, and determines that current word occurs Probability；

The probability that step S14: issuing the probability of corresponding voice according to given text and current word occurs, exports optimal text This sequence.

Above-mentioned technical proposal can export optimal text sequence, i.e. subsequent standards voice effectively according to testing audio The foundation of generation.By to a series of operation such as testing audio, it is ensured that the accuracy and uniqueness of the text sequence of output, Data are provided for subsequent scoring and the output of standard audio to support.

On the basis of Fig. 2 corresponding embodiment, in order to guarantee that system has good recognition effect, also improved. Specifically before the mel-frequency cepstrum coefficient for extracting testing audio and standard audio respectively, computer literal this document is translated It is further comprising the steps of after standard audio:

Constrain respectively to the intrinsic decline of phonetic speech power corresponding with standard audio and testing audio spectrum and by articulatory system High frequency section supplemented；

By the above-mentioned processing to testing audio and standard audio progress, it is effectively guaranteed subsequent to standard audio and survey The extraction of audition frequency mel-frequency cepstrum coefficient, improves the efficiency of audio identification.

Preferably, in the above-mentioned technical solutions, the mel-frequency cepstrum system for stating testing audio and standard audio is extracted respectively Number, specifically includes the following steps:

After carrying out discrete cosine transform to component frequency feature, mel-frequency cepstrum coefficient is obtained.

By the conversion of frequency domain figure to standard audio and testing audio and the extraction of the component frequency feature of every frame audio, It effectively ensures lifting for mel-frequency cepstrum coefficient, ensure that the accuracy and efficiency that mel-frequency cepstrum coefficient extracts.

On the basis of Fig. 1 corresponding embodiment of the method, the present invention also provides a kind of exercise systems of Oral English Practice, in detail See that Fig. 3, Fig. 3 are the schematic block diagram of the exercise system of Oral English Practice provided by the invention.A kind of exercise system of Oral English Practice Include:

Testing audio is translated to computer literal this document for receiving spoken testing audio by audio conversion module；

Text conversion module, for computer literal this document to be translated to standard audio；

Audio comparison module, for extracting the mel-frequency cepstrum coefficient for stating testing audio and standard audio respectively, according to Hidden Markov model calculates the phoneme posterior probability of testing audio and standard audio, determines test tone according to phoneme posterior probability The opposite percentage with standard audio of frequency scores, and outputting standard audio and percentage scoring.

When scoring standard audio, it is necessary to be built upon the specific point of articulation basis that user is able to use On, meanwhile, the result of software feedback must be similar with the sense of hearing judging result of native speaker of English.

For audio comparison module based on speech recognition module (ASR), ASR system exports speech recognition period before Text is converted to audio, and the test tone as master pattern and practitioner compares, and comments to provide to the test tone of practitioner Point.

Pretreatment: firstly, being pre-processed respectively to test speaker and canonical reference pronunciation.Pretreatment includes to pronunciation 1) preemphasis: the intrinsic high frequency section for declining and being constrained by articulatory system of supplement phonetic speech power spectrum, to reduce noise Influence to end-point detection later and characteristic parameter extraction module.2) framing adding window: long jiggly phonetic segmentation at 20- 50 milliseconds of short and small " frame ", to meet the condition of Fourier transformation.3) end-point detection: as far as possible by one section of pretreated voice It is divided into independent word.Pretreated purpose is that guarantee system has good recognition effect.

It extracts the process of MFCC: voice to be measured being pre-processed first.Fast Fourier Transform (FFT) is passed through to wherein every frame (FFr) voice is transformed into frequency domain figure from time domain waveform should by the acquirement of Meier filter group according to the auditory properties of human ear The component frequency feature of frame voice, then by can be obtained by MFCC after discrete cosine transform (DCT).

HMM model: hidden Markov model (Hidden Markov Models, HMM), a kind of system as voice signal Model is counted, is the major technique model of current speech processes every field.HMM includes five basic elements and three big basic calculations Method, wherein decoding algorithm viterbi is also the basis of pronunciation scoring algorithm in Oral English Practice study.For giving the sequence of observations And model λ=(A, B, π), Viterbi algorithm can not only find a status switch Q=q good enough₁q₂...q_tTo explain The sequence of observations can also obtain output probability corresponding to the path.

After above-mentioned processing, the phoneme posterior probability of available test speaker contrast standard reference model.If this Shi Jinhang be grading parameters generating process, then need expert for this pronunciation carry out experience marking, it is general to obtain phoneme posteriority Several corresponding relationships between rate and expertise scoring can train the auto-adaptive parameter x to be scored according to corresponding relationship With y, and then determine score function for pronounce scoring.If what is carried out at this time is the operation of pronunciation scoring, system can be incited somebody to action The phoneme posterior probability of test speaker substitutes into score function, finally obtains pronunciation scoring.

Scoring algorithm:

Scoring process can regard a kind of mode identification procedure based on HMM model as, after feature extraction, known to setting The output observation sequence of voice to be scored is O (O₁,O₂,...,O_t), it usesIndicate canonical reference HMM model, wherein π Indicate reset condition distribution, A is S_t-1To S_tState transformation probability matrix, B is HMM in i environment corresponding to status switch Observation sequence output probability matrix, there are more recessive state sequence S=(s in the model₁,s₂,...,s_t), then voice Assessment is operation when known to canonical reference HMM model π, obtains the probability of input voice observation sequence OProcess. Cutting alignment, acquisition most probable corresponding recessiveness with observation sequence O are implemented to the phoneme in characteristic sequence using Viterbi algorithm Status switch S. repeatedly trains HMM model, updates the parameter in the model, the HMM that output matches with observation sequence The optimal probability of modelThe optimum probability is then posterior probability scoring.

For each frame O_tPhoneme q is calculated_iPosterior probability P (q_i|O_t):

Wherein P (O_t|q_i) it is given phoneme q_iLower observation vector O_tProbability distribution, P (q_i) it is phoneme q_iPrior probability, Denominator is to obtain observed quantity O to the phoneme of all text locatings_tProbability summation.Phoneme q_iPosterior probability under i sections each Logarithm is taken, is then added up, so that it may obtain phoneme q_iPosterior probability score under i sections of voices.

And the posterior probability score of entire sentence are as follows:

Wherein N is the number of phoneme in sentence.

In view of word speed and an index of the spoken qualification of judge, so pronunciation rate should be included in judge mark Standard finally can define the score of phoneme duration are as follows:

Wherein d_iIt is corresponding to phoneme q_iI-th section of duration, f (d_i) it is normalized function, this allows for text and speaks The independence of people normalizes voice duration using the measurement of voice rate (ROS), and voice rate is in a word or a speaker In all pronunciations, the phoneme quantity of every unit duration.Usually take f (d_i)=ROSd_i。

Preferably, in the above-mentioned technical solutions, testing audio is translated to computer literal this document and held by audio conversion module Capable concrete operations are as follows:

Testing audio is converted into speech waveform signal, frequency spectrum or cepstral analysis are carried out to speech waveform signal, extract with The corresponding acoustic feature value of speech waveform signal carries out model recognition training to acoustic feature value, determines corresponding acoustic model And language model；

Contacting between acoustic feature value and sentence pronunciation modeling unit is created by acoustic model, and determines given text hair The probability of voice is corresponded to out；

Language model disassembles complete sentence for single word according to chain rule, and determines the probability that current word occurs；

The probability of corresponding voice and the probability of current word appearance are issued according to given text, export optimal text sequence.

Specific audio conversion module is mainly by front-end processing, acoustic model, language model, decoder (decoder) four Big module composition.

Front end processing block is mainly that the speech waveform signal that will be received passes through pretreatment, carries out frequency spectrum to voice signal Or cepstral analysis, extract corresponding acoustic feature value to carry out the recognition training of model, the quality of feature extraction is by direct shadow Ring the precision to identification.

The task of acoustic model is to calculate p (X | W), i.e., after given text sequence, issues the probability of this section of voice.Acoustic mode Type be automatic speech recognition system major part it in occupation of most computing cost and the performance that decide system.Sound Model is learned the observational characteristic of voice signal and the pronunciation modeling unit of sentence is used to connect.Traditional speech recognition system Generally use the acoustic model based on GMM-HMM (Gaussian Mixture hidden markov model).Microsoft Research Yu Dong in 2011, Deng The deep neural network and hidden Markov mould based on context-sensitive (Context Dependent, CD) that power etc. puts forward The acoustic model of type (CD-DNN-HMM), so that the accuracy of speech recognition has the raising of matter.

Language model (Language Model, LM) is the Probability p (W) generated for predicting character (word) sequence.Language Model generally utilizes chain rule, the probability of a sentence is disassembled the product at the wherein probability of each word.If W is by w₁, w₂,...,w_nComposition, then P (W) can be splitted into:

P (W)=P (w₁)P(w₂|w₁)P(w₃|w₁,w₂)...P(w_n|w₁,w₂,...,w_n-1)

Each single item is all the probability of current word under conditions of all words before known.It is most common to do to improve efficiency Method is to think that the probability distribution of each word only depends on several words last in history.Such language model is known as n-gram Model, in n-gram model, the probability distribution of each word only depends on the word of front n-1.Such as in 2-gram model, it is Split into following this form:

P (W)=P (w₁)P(w₂|w₁)P(w₃|w₂)...P(wn|w_n-1)

The training of usual language model and acoustic model is relatively independent.After training each model, Wo Menxu The two to be combined by a decoding stage.Such as formula:

Decoded final purpose is to combine language model and acoustic model, obtains an optimal output sequence by search Column.Viterbi algorithm (Viterbi Algorithm) is generally used in the decoder of mainstream at present.

In practice, this four module is carried out and is conditioned each other simultaneously, cuts down not excellent enough possibility at any time, finally acceptable Time in find out optimal solution

Preferably, in the above-mentioned technical solutions, audio comparison module, be also used to respectively to standard audio and testing audio The intrinsic of corresponding phonetic speech power spectrum declines and is supplemented by the high frequency section that articulatory system constrains；

Preferably, in the above-mentioned technical solutions, audio comparison module, standard audio and test tone after being also used to supplement The time domain waveform of every frame audio of frequency is converted to frequency domain figure；

Reader should be understood that in the description of this specification reference term " one embodiment ", " is shown " some embodiments " The description of example ", " specific example " or " some examples " etc. mean specific features described in conjunction with this embodiment or example, structure, Material or feature are included at least one embodiment or example of the invention.In the present specification, above-mentioned term is shown The statement of meaning property need not be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described It may be combined in any suitable manner in any one or more of the embodiments or examples.In addition, without conflicting with each other, this The technical staff in field can be by the spy of different embodiments or examples described in this specification and different embodiments or examples Sign is combined.

It is apparent to those skilled in the art that for convenience of description and succinctly, the dress of foregoing description The specific work process with unit is set, can refer to corresponding processes in the foregoing method embodiment, details are not described herein.

In several embodiments provided herein, it should be understood that disclosed device and method can pass through it Its mode is realized.For example, the apparatus embodiments described above are merely exemplary, for example, the division of unit, only A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or Person is desirably integrated into another system, or some features can be ignored or not executed.

Unit may or may not be physically separated as illustrated by the separation member, shown as a unit Component may or may not be physical unit, it can and it is in one place, or may be distributed over multiple networks On unit.It can select some or all of unit therein according to the actual needs to realize the mesh of the embodiment of the present invention 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, is also possible to two or more units and is integrated in one unit.It is above-mentioned integrated Unit both can take the form of hardware realization, can also realize in the form of software functional units.

It, can if integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product To be stored in a computer readable storage medium.Based on this understanding, technical solution of the present invention substantially or Say that all or part of the part that contributes to existing technology or the technical solution can embody in the form of software products Out, which is stored in a storage medium, including some instructions are used so that a computer equipment (can be personal computer, server or the network equipment etc.) executes all or part of each embodiment method of the present invention Step.And storage medium above-mentioned include: USB flash disk, it is mobile hard disk, read-only memory (ROM, Read-Only Memory), random Access various Jie that can store program code such as memory (RAM, Random Access Memory), magnetic or disk Matter.

Above embodiment, which is intended to illustrate the present invention, to be realized or use for professional and technical personnel in the field, to above-mentioned Embodiment, which is modified, will be readily apparent to those skilled in the art, therefore the present invention includes but is not limited to Above embodiment, it is any to meet the claims or specification description, meet with principles disclosed herein and novelty, The method of inventive features, technique, product, fall within the scope of protection of the present invention.

Claims

1. a kind of exercising method of Oral English Practice, which comprises the following steps:

Described computer literal this document is translated into standard audio；

The mel-frequency cepstrum coefficient for stating testing audio and the standard audio is extracted respectively, is calculated according to hidden Markov model The phoneme posterior probability of the testing audio and the standard audio determines the testing audio according to the phoneme posterior probability The opposite percentage with the standard audio scores, and exports the standard audio and percentage scoring.

2. the exercising method of Oral English Practice as described in claim 1, which is characterized in that described to translate to the testing audio Computer literal this document specifically includes the following steps:

The testing audio is converted into speech waveform signal, frequency spectrum or cepstral analysis are carried out to the speech waveform signal, mentioned Acoustic feature value corresponding with the speech waveform signal is taken, model recognition training, determination pair are carried out to the acoustic feature value The acoustic model and language model answered；

Contacting between the acoustic feature value and sentence pronunciation modeling unit is created by the acoustic model, and determines given text Word issues the probability of corresponding voice；

The language model disassembles complete sentence for single word according to chain rule, and determines the probability that current word occurs；

The probability of corresponding voice and the probability of current word appearance are issued according to the given text, export optimal text sequence Column.

3. the exercising method of Oral English Practice as described in claim 1, which is characterized in that state testing audio in described extract respectively Before the mel-frequency cepstrum coefficient of the standard audio, it is described by described computer literal this document translate to standard audio it Afterwards, further comprising the steps of:

Respectively to the intrinsic decline of phonetic speech power corresponding with the standard audio and testing audio spectrum and by articulatory system Oppressive high frequency section is supplemented；

4. the exercising method of Oral English Practice as claimed in claim 3, which is characterized in that it is described respectively extract state testing audio and The mel-frequency cepstrum coefficient of the standard audio, specifically includes the following steps:

5. a kind of exercise system of Oral English Practice characterized by comprising

The testing audio is translated to computer literal this document for receiving spoken testing audio by audio conversion module；

Audio comparison module, for extracting the mel-frequency cepstrum coefficient for stating testing audio and the standard audio respectively, according to Hidden Markov model calculates the phoneme posterior probability of the testing audio Yu the standard audio, general according to the phoneme posteriority Rate determines that the opposite percentage with the standard audio of the testing audio scores, and exports the standard audio and described hundred Divide than scoring.

6. the exercise system of Oral English Practice as claimed in claim 5, which is characterized in that the audio conversion module is by the survey Audition frequency translates to the concrete operations of computer literal this document execution are as follows:

7. the exercise system of Oral English Practice as claimed in claim 5, which is characterized in that the audio comparison module is also used to Constrain respectively to the intrinsic decline of phonetic speech power corresponding with the standard audio and testing audio spectrum and by articulatory system High frequency section supplemented；

8. the exercise system of Oral English Practice as claimed in claim 7, which is characterized in that the audio comparison module is also used to The time domain waveform of every frame audio of standard audio and testing audio after supplement is converted into frequency domain figure；

9. a kind of storage medium, is stored thereon with program instruction, which is characterized in that described program instruction is being executed by processor When, realize the described in any item methods of Claims 1-4.