CN109545243A

CN109545243A - Pronunciation quality evaluating method, device, electronic equipment and storage medium

Info

Publication number: CN109545243A
Application number: CN201910062339.7A
Authority: CN
Inventors: 刘顺鹏; 钟贵平; 李宝祥
Original assignee: Beijing Orion Star Technology Co Ltd
Current assignee: Beijing Orion Star Technology Co Ltd
Priority date: 2019-01-23
Filing date: 2019-01-23
Publication date: 2019-03-29
Anticipated expiration: 2039-01-23
Also published as: CN109545243B

Abstract

The present invention relates to technical field of voice recognition, disclose a kind of pronunciation quality evaluating method, device, electronic equipment and storage medium, the described method includes: in voice to be evaluated, determine the matching probability of the corresponding audio frame of the corresponding each phoneme of referenced text and each phoneme and corresponding audio frame, referenced text is the corresponding referenced text of voice to be evaluated；For each phoneme, the pronouncing accuracy evaluation of estimate of the phoneme is calculated according to the corresponding matching probability of phoneme audio frame corresponding with the phoneme；Weighted value determining for each phoneme according to the pronouncing accuracy evaluation of estimate of each phoneme and in advance, obtains the accuracy estimating value of voice to be evaluated.Technical solution provided in an embodiment of the present invention expands the diversity factor for the accuracy estimating value between preferable voice and the poor voice that pronounces of pronouncing, improves the accuracy and confidence level of pronunciation quality evaluating by the way that weighted value is arranged for each phoneme.

Description

Pronunciation quality evaluating method, device, electronic equipment and storage medium

Technical field

The present invention relates to technical field of voice recognition more particularly to a kind of pronunciation quality evaluating method, device, electronic equipments And storage medium.

Background technique

With the development of internet, language learning application Internet-based has also obtained quick development.For language Study, other than learning grammar with vocabulary etc., an important aspect is the listening and speaking ability for learning language, the energy especially said Power.In existing language learning application, user passes through the sound pick-up outfit recorded speech of user terminal, and system is according to opposite with the voice The voice of user's recording and existing acoustic model are compared by the referenced text answered, to provide a user whole sentence recording Whether pronunciation scoring and the pronunciation of each word correctly feed back.Therefore, the accuracy for evaluation method of pronouncing directly affects use Family learning effect.

Currently, the method evaluated spoken language pronunciation mainly uses GOP (Goodness of Pronunciation, hair Accuracy in pitch exactness) algorithm, the pronunciation for the corresponding each phoneme of the corresponding referenced text of voice that user records is calculated by GOP algorithm Accuracy estimating value seeks the average value of the pronouncing accuracy evaluation of estimate of the corresponding phoneme of each word, and the pronunciation for obtaining word is quasi- Exactness evaluation of estimate, then in seeking referenced text the pronouncing accuracy evaluation of estimate of all words average value, as the voice Pronunciation scoring.But what existing GOP algorithm obtained is the corresponding accuracy estimating value of each word, and of word itself Granularity is bigger, cannot react more detailed quality evaluation result, leads to that pronunciation quality evaluating result is not accurate enough, confidence level It is lower.

Summary of the invention

The embodiment of the present invention provides a kind of pronunciation quality evaluating method, device, electronic equipment and storage medium, existing to solve There is in technology the problem that pronunciation quality evaluating is not accurate enough, confidence level is lower.

In a first aspect, one embodiment of the invention provides a kind of pronunciation quality evaluating method, comprising:

In voice to be evaluated, determine the corresponding each corresponding audio frame of phoneme of referenced text and each phoneme with The matching probability of corresponding audio frame, referenced text are the corresponding referenced text of voice to be evaluated；

For each phoneme, the pronunciation for calculating phoneme according to the corresponding matching probability of phoneme and the corresponding audio frame of phoneme is quasi- Exactness evaluation of estimate；

Weighted value determining for each phoneme according to the pronouncing accuracy evaluation of estimate of each phoneme and in advance, obtains to be evaluated The accuracy estimating value of voice.

Second aspect, one embodiment of the invention provide a kind of pronunciation quality evaluating device, comprising:

Determining module determines the corresponding audio frame of the corresponding each phoneme of referenced text in voice to be evaluated And the matching probability of each phoneme and corresponding audio frame, referenced text are the corresponding referenced text of voice to be evaluated；

Phoneme accuracy computing module, it is corresponding according to the corresponding matching probability of phoneme and phoneme for being directed to each phoneme Audio frame calculate phoneme pronouncing accuracy evaluation of estimate；

Accuracy computing module, for being determined according to the pronouncing accuracy evaluation of estimate of each phoneme and in advance for each phoneme Weighted value, obtain the accuracy estimating value of voice to be evaluated.

The third aspect, one embodiment of the invention provide a kind of electronic equipment, including transceiver, memory, processor and Store the computer program that can be run on a memory and on a processor, wherein transceiver is under the control of a processor Send and receive data, the step of processor realizes any of the above-described kind of method when executing computer program.

Fourth aspect, one embodiment of the invention provide a kind of computer readable storage medium, are stored thereon with computer The step of program instruction, which realizes any of the above-described kind of method when being executed by processor.

Technical solution provided in an embodiment of the present invention, in the pronunciation of pronouncing accuracy evaluation word or sentence based on phoneme When accuracy, by the way that weighted value is arranged for each phoneme, part phoneme is improved to the contribution degree of pronouncing accuracy evaluation of estimate, is expanded The diversity factor for the accuracy estimating value pronounced greatly between preferable voice and the poor voice that pronounces, improves pronunciation quality evaluating Accuracy and confidence level.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, will make below to required in the embodiment of the present invention Attached drawing is briefly described, it should be apparent that, attached drawing described below is only some embodiments of the present invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is the application scenarios schematic diagram of pronunciation quality evaluating method provided in an embodiment of the present invention；

Fig. 2 is the flow diagram for the pronunciation quality evaluating method that one embodiment of the invention provides；

Fig. 3 is the flow diagram for the pronunciation quality evaluating method that one embodiment of the invention provides；

Fig. 4 is the structural schematic diagram for the pronunciation quality evaluating device that one embodiment of the invention provides；

Fig. 5 is the structural schematic diagram for the pronunciation quality evaluating device that one embodiment of the invention provides；

Fig. 6 is the structural schematic diagram for the electronic equipment that one embodiment of the invention provides.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described.

In order to facilitate understanding, noun involved in the embodiment of the present invention is explained below:

GOP (Goodness of Pronunciation, pronouncing accuracy) algorithm is by the Silke of Massachusetts Polytechnics Witt is proposed in his doctoral thesis.Exactly the referenced text known in advance is utilized in the basic thought of GOP algorithm, voice Referenced text corresponding with the voice does pressure alignment (force alignment), identifies that each phoneme is corresponding in referenced text Voice segments (multiple continuous audio frames i.e. in voice), then calculate under the premise of observing this voice segments, this Voice segments correspond to the matching probability of the phoneme in referenced text, and matching probability is higher, illustrate that pronunciation is more accurate, matching probability is got over It is low, illustrate that pronunciation is poorer.Intuitively, what GOP algorithm calculated is a possibility that input voice corresponds to known text, if Possibility is higher, illustrates that standard is got in pronunciation.

Phoneme (phone) is the smallest unit in voice, is analyzed according to the articulation in syllable, a movement Constitute a phoneme.Phoneme is divided into vowel, consonant two major classes, for example, vowel has a, e, ai etc., consonant has p, t, h etc..

Any number of elements in attached drawing is used to example rather than limitation and any name are only used for distinguishing, without With any restrictions meaning.

During concrete practice, the method evaluated spoken language pronunciation mainly uses GOP algorithm, passes through GOP algorithm The pronouncing accuracy evaluation of estimate for calculating the corresponding each phoneme of the corresponding referenced text of voice that user records, seeks each word pair The average value of the pronouncing accuracy evaluation of estimate for the phoneme answered obtains the pronouncing accuracy evaluation of estimate of word, is then asking with reference to text The average value of the pronouncing accuracy evaluation of estimate of all words in this, the pronunciation as the voice are scored.The present inventor's hair It is existing, when calculating the accuracy of pronunciation of words using GOP algorithm, the standard of the scoring of GOP algorithm output to the pronunciation of certain phonemes Exactness is more sensitive, i.e. it is larger that the quality of the corresponding pronunciation of these phonemes will lead to scoring difference, and to other phoneme The accuracy of pronunciation is less sensitive, i.e. the corresponding pronunciation of these phonemes is either good or bad, and obtained scoring difference is smaller.Example Such as, when the same vowel of two human hairs, very well, second pronunciation is poor, then the scoring of first can be much higher than second for first pronunciation；But when When the same consonant of two human hairs, very well, second pronunciation is poor for first pronunciation, and the scoring of first and the scoring difference of second are smaller.Therefore, such as Fruit is evaluated using the method for the average value of the existing pronouncing accuracy evaluation of estimate for seeking phoneme to calculate the pronouncing accuracy of voice Value can reduce sensitive phoneme specific gravity shared in finally obtained pronouncing accuracy evaluation of estimate, so that it is preferable to reduce pronunciation Diversity of values degree between voice and the poor voice that pronounces, reduces the accuracy of pronouncing accuracy evaluation of estimate.In addition, this hair Bright inventor also found that the index for evaluating voice at present only has this dimension of pronouncing accuracy, and still, existing GOP is calculated What method obtained is the corresponding accuracy of each word, only simply asks the side of the average value of the accuracy of all words in sentence Method, relationship when being easy to ignore pronunciation between word and word, cause one section of fluent complete voice and one section it is less fluent complete Voice obtain pronunciation scoring discrimination it is lower, cause pronunciation scoring it is not objective and accurate enough.

For this purpose, the present inventor is it is considered that evaluate the pronunciation of word or sentence in the pronouncing accuracy based on phoneme When accuracy, by the way that weighted value is arranged for each phoneme, part phoneme is improved to the contribution degree of pronouncing accuracy evaluation of estimate, is expanded The diversity factor for the accuracy estimating value pronounced greatly between preferable voice and the poor voice that pronounces, improves pronunciation quality evaluating Accuracy and confidence level.In addition, determining language to be evaluated under the premise of obtaining the accuracy estimating value of the pronunciation of voice to be evaluated The integrity degree evaluation of estimate and fluency evaluation of estimate of the pronunciation of sound, the accuracy estimating value of comprehensive voice to be evaluated, integrity degree evaluation Value and fluency evaluation of estimate determine the pronunciation scoring of voice to be evaluated, and the integrity degree evaluation of estimate and fluency evaluation of estimate of introducing are abundant Relationship when considering pronunciation between word and word passes through comprehensive word level to obtain the Score index of sentence level The Score index of Score index and sentence level, so that pronunciation scoring is more comprehensive, objective, accurate, what raising pronunciation was scored can Reliability.

After introduced the basic principles of the present invention, lower mask body introduces various non-limiting embodiment party of the invention Formula.

It is the application scenarios schematic diagram of pronunciation quality evaluating method provided in an embodiment of the present invention referring initially to Fig. 1.With Family 10 is interacted by the application program in user terminal 11 with the completion of user terminal 11, and user terminal 11 can show referenced text or broadcast Referenced text is put, user 10 reads referenced text, at this point, user terminal 11 is built-in or external by application program launching user terminal 11 Voice acquisition device 12 (such as microphone), read aloud the voice of referenced text as voice to be evaluated, using journey to acquire user Voice to be evaluated and referenced text are sent to server 13 by sequence, and server 13 is treated according to voice to be evaluated and referenced text The evaluation that voice carries out voice quality is evaluated, the pronunciation scoring of voice to be evaluated is obtained, pronunciation scoring is fed back into user terminal 11, The pronunciation scoring of 11 display server 13 of user terminal feedback.

It under this application scenarios, is communicatively coupled between user terminal 11 and server 13 by network, which can be with For local area network, wide area network etc..User terminal 11 can be portable equipment (such as: mobile phone, plate, laptop etc.), can also be with For PC (PC, Personal Computer), general mobile phone, plate, laptop are built-in with microphone, and a People's computer can acquire the voice of user by external voice acquisition device.Server 13 can be capable of providing voice knowledge to be any Not, the equipment of pronunciation quality evaluating service.

In addition, pronunciation quality evaluating method provided in an embodiment of the present invention can also be locally executed in user terminal.Specifically, it uses Family 10 is interacted by the application program in user terminal 11 with the completion of user terminal 11, and user terminal 11 can show referenced text or broadcast Referenced text is put, user 10 reads aloud referenced text, at this point, user terminal 11 is built-in or external by application program launching user terminal 11 Voice acquisition device 12 (such as microphone), read aloud the voice of referenced text as voice to be evaluated, then, root to acquire user The evaluation of voice quality is carried out to voice to be evaluated according to voice to be evaluated and referenced text, the pronunciation for obtaining voice to be evaluated is commented Point, and show pronunciation scoring.

Below with reference to application scenarios shown in FIG. 1, technical solution provided in an embodiment of the present invention is illustrated.

With reference to Fig. 2, the embodiment of the present invention provides a kind of pronunciation quality evaluating method, comprising the following steps:

S201, in voice to be evaluated, determine the corresponding audio frame of the corresponding each phoneme of referenced text and each The matching probability of phoneme and corresponding audio frame, referenced text are the corresponding referenced text of voice to be evaluated.

In the present embodiment, referenced text is usually a complete sentence, and referenced text includes at least one word.Pass through Searching pronunciation dictionary can determine the corresponding phone string of referenced text.For example, referenced text is " good morning ", then correspond to Phone string include eight phonemes [g], [u], [d], [m],[n],[i],[η].Referenced text is " hello ", then corresponding Phone string includes three phonemes [n], [i], [h], [ao].When it is implemented, which kind of language voice to be evaluated is, the language is just selected It says corresponding pronunciation dictionary, for example, language to be evaluated is English, then selects English equivalents dictionary.

When it is implemented, step S201 can be realized by registration process, before carrying out registration process, need to language to be evaluated Sound is pre-processed: being several audio frames by phonetic segmentation to be evaluated, is extracted the acoustic feature vector of each audio frame, acoustics is special Sign vector is a multidimensional characteristic vectors, and each frame audio is indicated with a multidimensional characteristic vectors, and voice to be evaluated is converted For the audio frame sequence of multiple audio frames composition.Generally taking 10-30ms is a frame, and moving window function can be used to realize framing, There is lap between adjacent audio frame, the omission to avoid window boundary to signal.The acoustic feature of extraction can be with Fbank spy Sign, MFCC (Mel Frequency Cepstral Coefficents, mel-frequency cepstrum coefficient) feature or sound spectrograph feature Deng.Fbank feature, the extracting method of MFCC feature are the prior art, are repeated no more.

When it is implemented, the process of registration process is substantially are as follows: by the corresponding acoustic feature vector input pair of voice to be evaluated Neat model, obtains conditioned probability matrix, conditioned probability matrix describe each audio frame be identified as any phoneme condition it is general Rate, wherein be directed to an audio frame, conditioned probability matrix gives the conditional probability between the audio frame and multiple phonemes, example Such as, it may include an audio frame is identified as the conditional probability of [u] and an audio frame is identified asConditional probability；So Afterwards, conditioned probability matrix input decoder is subjected to route searching, using the corresponding phone string of referenced text as when route searching Restrictive condition, obtain the corresponding audio frame of each phoneme in the corresponding phone string of referenced text, a general phoneme it is corresponding to Multiple continuous audio frames in voice are evaluated, which is in advance patterned all phonemes.Wherein, alignment model DNN (Deep Neural Network, deep neural network)-HMM model or alignment model, which can be selected, also can be selected CNN (convolutional neural networks, Convolutional Neural Networks)+LSTM (Long Short-Term Memory, length Phase memory network) network implementations.It can be hidden by mixed Gaussian (GMM, Gaussian Mixture Model)-trained in advance Markov (HMM, Hidden Markov Model) model determines state transition probability used in decoding process.Due to The corresponding relationship between the corresponding each phoneme of referenced text and audio frame has been determined, it is therefore, corresponding for referenced text Each phoneme, can from the conditional probability obtained in conditioned probability matrix between phoneme audio frame corresponding with the phoneme, thus The matching probability of the phoneme with corresponding audio frame is determined, for example, phoneme [u] corresponds to 10 audio frames, from conditioned probability matrix The middle conditional probability obtained between this 10 audio frames and phoneme [u], take this 10 conditional probabilities average value or maximum value or Intermediate value, the matching probability as phoneme [u] audio frame corresponding with phoneme [u].

In the present embodiment, registration process can be carried out to voice to be evaluated and referenced text by alignment model, with determination The part of speech (i.e. multiple continuous audio frames) of the corresponding each phoneme of each word and voice to be evaluated in referenced text Between corresponding relationship.

S202, it is directed to each phoneme, calculating according to the corresponding matching probability of phoneme audio frame corresponding with the phoneme should The pronouncing accuracy evaluation of estimate of phoneme.

When it is implemented, can be using GOP value as pronouncing accuracy evaluation of estimate.Specifically, sound can be calculated by the following formula The GOP value of element:

Wherein, p is the phoneme in referenced text, and P (p | o) is the corresponding matching probability of phoneme p, and NF (p) is phoneme p correspondence Audio frame quantity, o be the corresponding audio frame of phoneme p.

S203, the pronouncing accuracy evaluation of estimate according to each phoneme and weighted value determining for each phoneme in advance, obtain The accuracy estimating value of voice to be evaluated.

When it is implemented, can be according to the weighted value determined in advance for each phoneme, pronunciation corresponding to each phoneme is accurate Degree evaluation of estimate is weighted, and obtains the accuracy estimating value of voice to be evaluated.

For example, the corresponding phoneme of word good includes [g], [u], [d], it is assumed that [g] and [d] corresponding weighted value is 0.15, [u] corresponding weighted value 0.7.Pronounce after preferable user's first input voice good, obtaining [g] corresponding GOP value is 0.9, [u] corresponding GOP value is 0.8, and [d] corresponding GOP value is 0.8, and the accuracy estimating value of voice good is obtained after weighting It is 0.815.After the user's second input voice good for pronouncing poor, obtaining [g] corresponding GOP value is 0.85, [u] corresponding GOP Value is 0.6, and [d] corresponding GOP value is 0.8, and the accuracy estimating value that voice good is obtained after weighted average is 0.6675.If If being not provided with weighted value, the accuracy estimating value that user's first obtains voice good is 0.83, and user has obtained voice good's Accuracy estimating value be 0.75, the accuracy estimating value that user's first and user's second obtain is closer to, can not distinguish well compared with Good pronunciation and poor pronunciation.

Obviously, when the accuracy based on phoneme calculates the accuracy estimating value of word or sentence pronunciation, phoneme is increased Weight after, expand the diversity factor for the accuracy estimating value between preferable voice and the poor voice of pronunciation of pronouncing, improve hair The accuracy and confidence level of sound quality evaluation.

In the present embodiment, the corresponding weighted value of each phoneme vowel can be previously according to the test result in practical application scene It determines, is not limited thereto.For example, being found according to test result, the accuracy estimating value of pronouncing accuracy evaluation algorithms output It is more sensitive to the accuracy of vowel articulation, and it is less sensitive to the accuracy of consonant articulation, therefore, if asked using existing The method of average value calculates the pronouncing accuracy evaluation of estimate of voice, can reduce vowel and evaluate in finally obtained pronouncing accuracy Shared specific gravity in value, so that the diversity of values degree for reducing preferable voice between the poor voice that pronounces, reduces The accuracy of pronouncing accuracy evaluation of estimate.For this purpose, in the present embodiment, in advance in the weighted value for the determination of each phoneme, vowel pair The weighted value answered is greater than the corresponding weighted value of consonant.

In the specific implementation, all vowels can share a weighted value, and all consonants can share a weighted value, at this point, The average value of the pronouncing accuracy evaluation of estimate of all vowels in referenced text can be calculated, all consonants in referenced text are calculated The average value of pronouncing accuracy evaluation of estimate, the pronunciation to the average value and all consonants of the pronouncing accuracy evaluation of estimate of all vowels The average value of accuracy estimating value is weighted, using weighted results as the accuracy estimating value of voice to be evaluated.Wherein, vowel The specific setting of corresponding weighted value and the corresponding weighted value of consonant, is not limited thereto.It is of course also possible to be each phoneme list Solely setting weighted value.

When it is implemented, an interceptive value can also be arranged, when the pronouncing accuracy evaluation of estimate of phoneme is lower than the truncation When threshold value, in the weighted value determined according to the pronouncing accuracy evaluation of estimate of each phoneme and in advance for each phoneme, calculate to be evaluated When the accuracy estimating value of valence voice, the corresponding weighted value of the phoneme is set as 0.For example, the corresponding phoneme of word good includes [g], [u], [d], after user inputs voice good, obtaining [g] corresponding GOP value is 0.9, and [u] corresponding GOP value is 0.09, [d] corresponding GOP value is 0.8, it is assumed that interceptive value 0.1, then [u] respective weights value is adjusted to 0, then obtains voice after weighting The accuracy of good is 0.57.

The pronunciation quality evaluating method of the embodiment of the present invention, in pronouncing accuracy evaluation word or sentence based on phoneme When pronouncing accuracy, by the way that weighted value is arranged for each phoneme, contribution of the part phoneme to pronouncing accuracy evaluation of estimate is improved Degree expands the diversity factor for the accuracy estimating value between preferable voice and the poor voice that pronounces of pronouncing, improves voice quality The accuracy and confidence level of evaluation.

Based on any of the above-described embodiment, further, as shown in figure 3, the method for the embodiment of the present invention further includes following place Manage step:

S204, the integrity degree evaluation of estimate for determining voice to be evaluated.

In the present embodiment, voice to be evaluated can be identified by existing audio recognition method, by voice to be evaluated Corresponding identification text is converted to, specific audio recognition method is the prior art, is repeated no more.

When it is implemented, according to the quantity for the word for including in the corresponding identification text of voice to be evaluated and voice to be evaluated The quantity for the word for including in corresponding referenced text, determines the integrity degree evaluation of estimate of voice to be evaluated, and identification text is to treat Evaluation voice carries out the text that speech recognition obtains.It specifically, can be according to the quantity and reference for the word for including in identification text The difference of the quantity for the word for including in text determines the integrity degree evaluation of estimate of voice to be evaluated.For example, following formula can be passed through Calculate the integrity degree evaluation of estimate I of voice to be evaluated:

Wherein, N₀For the word quantity for including in referenced text, N is to identify the word quantity for including in text.It needs Bright, the formula of above-mentioned calculating integrity degree evaluation of estimate is only an example, when practical application, others formula can be selected and calculate Integrity degree evaluation of estimate.

For example, including 9 words in identification text, but in referenced text include 10 words, it is clear that according to be evaluated Lack word in the speech recognition result that voice obtains, this, which may be that user pronunciation is nonstandard, causes to leak identification when speech recognition Two word identifications are a word by other word, it is also possible to which one word of skip causes when user reads aloud referenced text , according to the formula of above-mentioned calculating integrity degree evaluation of estimate, the integrity degree evaluation of estimate that voice to be evaluated can be obtained is 0.9.If identification Include 11 words in text, but in referenced text include 10 words, it is clear that is known according to the voice that voice to be evaluated obtains More words in other result, this, which may be that user pronunciation is nonstandard, causes a word identification when speech recognition to be two Word, according to the formula of above-mentioned calculating integrity degree evaluation of estimate, the integrity degree evaluation of estimate that voice to be evaluated can be obtained is 0.9.

S205, the fluency evaluation of estimate for determining voice to be evaluated.

When it is implemented, the fluency of voice to be evaluated can be determined by following steps: corresponding every for referenced text A phoneme determines the corresponding practical pronunciation duration of the phoneme, according to the corresponding reality of the phoneme according to the corresponding audio frame of the phoneme Border pronunciation duration is corresponding with the phoneme with reference to pronunciation duration, determines the corresponding fluency evaluation of estimate of the phoneme；According to reference text Originally the corresponding fluency evaluation of estimate of corresponding each phoneme determines the fluency evaluation of estimate of voice to be evaluated.Voice middle pitch to be evaluated The practical pronunciation length of element then illustrates that fluency of the user when saying the phoneme is higher closer to reference to pronunciation duration.For example, real Border is in application, the fluency evaluation of estimate F of phoneme can be calculated by the following formula:

Wherein, T₀Corresponding with reference to pronunciation duration for phoneme, T is the corresponding practical pronunciation duration of phoneme.

In the present embodiment, practical pronunciation duration can be according to the quantity of the corresponding audio frame of phoneme and the duration of a frame audio It determines.Such as the corresponding 30 frame audio frames of phoneme [g], the when a length of 20ms of each frame audio, then the practical pronunciation duration of phoneme [g] For 600ms, it is assumed that a length of 400ms when the reference pronunciation of phoneme [g], then the fluency evaluation of the phoneme [g] in voice to be evaluated Value is 0.667.Such as the corresponding 30 frame audio frames of phoneme [i :], the when a length of 20ms of each frame audio, then the reality of phoneme [i :] A length of 600ms when pronunciation, it is assumed that a length of 1000ms when the reference pronunciation of phoneme [i :], the then phoneme [g] in voice to be evaluated Fluency evaluation of estimate is 0.6.

When it is implemented, can be commented according to the weighted value determined in advance for each phoneme the corresponding fluency of each phoneme Value is weighted, and obtains the fluency evaluation of estimate of voice to be evaluated.

When it is implemented, can also determine each list in referenced text according to the corresponding fluency evaluation of estimate of each phoneme The fluency evaluation of estimate of word；Then according to the fluency evaluation of estimate of each word, the fluency evaluation of estimate of voice to be evaluated is determined.

According to the fluency evaluation of estimate of each phoneme, the fluency evaluation of estimate of each word in referenced text is calculated, specifically Calculation method may is that

For each word, the average value of the fluency evaluation of estimate of the corresponding phoneme of the word is calculated, the word is obtained First fluency evaluation of estimate；Calculate the corresponding first time length of the word, which is the word corresponding the Corresponding first audio frame of one phoneme is counted to the time span between the last one corresponding audio frame of a last phoneme The summation for calculating the corresponding practical pronunciation duration of the corresponding each phoneme of the word, obtains corresponding second time span of the word, According to the corresponding first time length of the word and the second time span, the second fluency evaluation of estimate of the word is determined；According to The the first fluency evaluation of estimate and the second fluency evaluation of estimate of the word, obtain the fluency evaluation of estimate of the word.

Specifically, can the first fluency evaluation of estimate to the word and the second fluency evaluation of estimate be weighted processing, will Weight fluency evaluation of estimate of the processing result as the word.Wherein, the weight that uses of weighting processing can according to the actual situation into Row is freely set, and is not limited thereto.

Certainly, the first fluency evaluation of estimate of the word can also be only calculated when specific implementation, directly by the of the word Fluency evaluation of estimate of the one fluency evaluation of estimate as the word；Second that the word can also be only calculated when specific implementation is fluent Evaluation of estimate is spent, directly using the second fluency evaluation of estimate of the word as the fluency evaluation of estimate of the word.

The calculating of first fluency evaluation of estimate of word is specially to calculate the word for each word in referenced text The average value of the fluency evaluation of estimate of corresponding phoneme obtains the first fluency evaluation of estimate of the word.

For example, the corresponding phoneme of word good includes [g], [u], [d], it is assumed that the fluency evaluation of estimate of [g], [u], [d] Respectively 0.9,0.8 and 0.84, then the first fluency evaluation of estimate of word good is 0.847.Further, for referenced text In each word, can according in advance be each phoneme determine weighted value, the fluency of each phoneme corresponding to the word Evaluation of estimate is weighted, and obtains the first fluency evaluation of estimate of the word.

Second fluency evaluation of estimate of word is to determine that the second of word is fluent according to the blank audio frame between each phoneme Spend evaluation of estimate.For blank audio frame i.e. into after crossing registration process, that determines is not belonging to the audio frame of any one phoneme, same list Blank audio frame in word between two neighboring phoneme is more, illustrates that user is not fluent when reading the word.Specifically, for ginseng Each word in text is examined, the corresponding first time length of the word is calculated, calculates the corresponding reality of the corresponding phoneme of the word Border is pronounced the summation of duration, and corresponding second time span of the word is obtained, according to the corresponding first time length of the word and Second time span determines the second fluency evaluation of estimate of the word.Wherein, first time length is the word corresponding first Corresponding first audio frame of a phoneme is to the time span between the last one corresponding audio frame of a last phoneme.

For example, the corresponding phoneme of word morning be [m],[n], [i], [η], it is assumed that [m] corresponding voice to be evaluated In the 11-40 audio frame,Corresponding the 41-80 audio frame, [n] corresponding the 101-130 audio frame, [i] correspondence the 131-160 audio frame, [η] corresponding the 161-190 audio frame, then the corresponding first time length of morning is that phoneme is The time of [m] corresponding first frame (i.e. the 11st audio frame) to [η] corresponding last frame (i.e. the 190th audio frame) is long Degree, then first time length corresponds to the time span of 180 audio frames.[m],[n], [i] and [η] corresponding audio The summation of the quantity of frame is 160, then corresponding second time span of word morning is the time span of 160 audio frames.The One time span and the second time span differ the time span of 20 frames, this difference is not fluent enough when reading word due to user It is caused.First time length and the difference of the second time span are bigger, illustrate that user is not fluent when reading the word.Word Corresponding relationship between corresponding first time length and the second time span, and the second fluency evaluation of estimate of the word, can Determines according to actual conditions, herein without limitation.

In calculating referenced text after the fluency evaluation of estimate of each word, it can be commented according to the fluency of each word Value, calculates the fluency evaluation of estimate of voice to be evaluated.

The circular of the fluency evaluation of estimate of voice to be evaluated, which may is that, calculates each word in referenced text The average value of fluency evaluation of estimate, as the fluency evaluation of estimate of voice to be evaluated, for example, the fluency evaluation of estimate of word good It is 0.847, the fluency of word morning is evaluation of estimate 0.78, then voice " good morning " to be evaluated is corresponding fluent Spending grading value is 0.814.

The circular of the fluency evaluation of estimate of voice to be evaluated is also possible to: above-mentioned calculation method is calculated First fluency evaluation of estimate of the fluency evaluation of estimate of voice to be evaluated as voice to be evaluated；According between two neighboring word Blank audio frame determines the second fluency evaluation of estimate of voice to be evaluated；According to the first fluency evaluation of estimate of voice to be evaluated With the second fluency evaluation of estimate of voice to be evaluated, the comprehensive fluency evaluation of estimate for determining the voice to be evaluated.Specifically, may be used Processing is weighted to the first fluency evaluation of estimate and the second fluency evaluation of estimate of voice to be evaluated, weighting processing result is made For the fluency evaluation of estimate of the voice to be evaluated.Wherein, the weight that weighting processing uses can carry out freely setting according to the actual situation It is fixed, it is not limited thereto.

In one sentence, for the blank audio frame between two neighboring word more than after certain amount, blank audio frame is more, says Bright user's dead time is longer, and user is not fluent when reading sentence.Specifically, corresponding according to word each in referenced text Audio frame determines the quantity of the blank audio frame between any two adjacent words, according to the blank between any two adjacent words The quantity of audio frame determines the second fluency evaluation of estimate of voice to be evaluated.

When it is implemented, determining voice to be evaluated according to the quantity of the blank audio frame between any two adjacent words Second fluency evaluation of estimate, can implement by the following method: determine two according to the blank audio frame number between two adjacent words Pause duration between a adjacent words, statistics pause duration are more than the number of preset duration, according to the number of statistics and are more than The pause duration of preset duration determines the second fluency evaluation of estimate of voice to be evaluated.Wherein, pause duration is more than preset duration Number it is more, the score value of the second fluency of voice to be evaluated is lower, and the value more than the pause duration of preset duration is bigger, then The score value of second fluency of voice to be evaluated is lower.Wherein, being averaged between word when preset duration can speak according to statistics people Pause duration determines, is not limited thereto.

Certainly, when it is implemented, the second fluency evaluation of estimate of voice to be evaluated can also only be calculated, by voice to be evaluated The second fluency evaluation of estimate directly as the voice to be evaluated fluency evaluation of estimate.

In the present embodiment, the reference pronunciation duration of each phoneme can be predefined by following steps:

Step 1: in voice messaging, determining that text information is corresponding every for every section of voice messaging in corpus The corresponding audio frame of a phoneme, text information are the corresponding referenced text of voice messaging.

The corpus for belonging to language of the same race with voice to be evaluated is stored in the present embodiment, in corpus.Language in corpus Voice messaging of the message breath in different people, corpus is the voice of pronunciation standard.

When it is implemented, the corresponding each sound of text information can be determined in voice messaging by the method for registration process The corresponding audio frame of element, specific embodiment can refer to the specific embodiment of S201, repeat no more.

Step 2: determining the corresponding pronunciation of each phoneme according to the corresponding audio frame of the corresponding each phoneme of text information Duration.

In the present embodiment, the pronunciation duration of phoneme can according to the quantity of the corresponding audio frame of phoneme and a frame audio when It is long to determine.Such as the corresponding 30 frame audio frames of phoneme [g], the when a length of 20ms of each frame audio are then a length of when the pronunciation of phoneme [g] 600ms。

Step 3: according to the corresponding pronunciation duration of each phoneme, when counting the corresponding pronunciation of each phoneme in phone set Long distribution, phone set are the set for all phonemes composition that specified languages include.

For example, specified languages are English, it includes 48 phonemes that English, which has altogether, then in the corresponding phone set of English comprising this 48 A phoneme.

Step 4: using the central value of the corresponding pronunciation duration distribution of each phoneme as the ginseng of each phoneme in phone set Examine pronunciation duration.

S206, the pronunciation that voice to be evaluated is determined according to integrity degree evaluation of estimate, fluency evaluation of estimate and accuracy estimating value Scoring.

When it is implemented, can be according to the weight coefficient being fitted in advance, to integrity degree evaluation of estimate, fluency evaluation of estimate and accurate Degree evaluation of estimate is weighted, and obtains the pronunciation scoring of voice to be evaluated.Specifically, it can be determined by way of linear regression complete Degree evaluation of estimate, fluency evaluation of estimate and the corresponding weight coefficient of accuracy estimating value, the present embodiment do not make weight coefficient It limits.

When it is implemented, the pronunciation scoring for feeding back to user terminal can be exchanged into hundred-mark system.

The pronunciation quality evaluating method of the present embodiment, in the accuracy for the pronunciation for obtaining voice to be evaluated using GOP algorithm Under the premise of evaluation of estimate, the integrity degree evaluation of estimate and fluency evaluation of estimate of the pronunciation of voice to be evaluated, comprehensive language to be evaluated are determined Accuracy estimating value, integrity degree evaluation of estimate and the fluency evaluation of estimate of sound determine voice to be evaluated pronunciation scoring, introducing it is complete Relationship when whole degree evaluation of estimate and fluency evaluation of estimate have fully considered pronunciation between word and word, to obtain sentence level Score index, by the Score index of comprehensive word level and the Score index of sentence level so that pronunciation scoring more comprehensively, It is objective, accurate, improve the confidence level of pronunciation scoring.

It should be noted that there is no inevitable sequencing, example between above-mentioned steps S203, step S204 and step S205 Such as, step S203, step S204 and step S205 can be performed simultaneously, can also successively be executed by preset sequence step S203, Step S204 and step S205.

The method of the present embodiment can be used for testing and assessing to the voice of any one language, for example, Chinese, English, Japanese, Korean etc..When it is implemented, being directed to different language, it is only necessary to use the side of the corresponding corpus training the present embodiment of different language The models such as decoder, alignment model used in method, for different language, the method for model training is identical, repeats no more.

As shown in figure 4, being based on inventive concept identical with above-mentioned pronunciation quality evaluating method, the embodiment of the present invention is also provided A kind of pronunciation quality evaluating device 40, including determining module 401, phoneme accuracy computing module 402 and accuracy calculate mould Block 403.

Determining module 401 determines the corresponding audio of the corresponding each phoneme of referenced text in voice to be evaluated The matching probability of frame and each phoneme and corresponding audio frame, referenced text are the corresponding referenced text of voice to be evaluated.

Phoneme accuracy computing module 402, for being directed to each phoneme, according to the corresponding matching probability of the phoneme and the sound The corresponding audio frame of element calculates the pronouncing accuracy evaluation of estimate of the phoneme.

Accuracy computing module 403, for according to the pronouncing accuracy evaluation of estimate of each phoneme and be in advance each phoneme Determining weighted value obtains the accuracy estimating value of voice to be evaluated.

Further, in the weighted value determined in advance for each phoneme, it is corresponding that the corresponding weighted value of vowel is greater than consonant Weighted value.

Further, accuracy computing module 403 is specifically used for: right according to the weighted value determined in advance for each phoneme The corresponding pronouncing accuracy evaluation of estimate of each phoneme is weighted, and obtains the accuracy estimating value of voice to be evaluated.

Further, as shown in figure 5, the pronunciation quality evaluating device 40 of the embodiment of the present invention further includes that integrity degree calculates mould Block 404, fluency computing module 405 and grading module 406.

Integrity degree computing module 404, for determining the integrity degree evaluation of estimate of voice to be evaluated.

Fluency computing module 405, for determining the fluency evaluation of estimate of voice to be evaluated.

Grading module 406, it is to be evaluated for being determined according to integrity degree evaluation of estimate, fluency evaluation of estimate and accuracy estimating value The pronunciation of voice is scored.

Further, integrity degree computing module 404 is specifically used for: including according in the corresponding identification text of voice to be evaluated Word quantity and the corresponding referenced text of voice to be evaluated in include word quantity, determine the complete of voice to be evaluated Evaluation of estimate is spent, identification text is to carry out the text that speech recognition obtains to voice to be evaluated.

Further, fluency computing module 405 includes: that phoneme fluency computing unit and voice fluency calculate list Member.

Phoneme fluency computing unit, for being directed to the corresponding each phoneme of referenced text, according to the corresponding sound of the phoneme Frequency frame determines the corresponding practical pronunciation duration of the phoneme, corresponding with the phoneme according to the corresponding practical pronunciation duration of the phoneme With reference to pronunciation duration, the corresponding fluency evaluation of estimate of the phoneme is determined.

Voice fluency computing unit, for determining voice to be evaluated according to the corresponding fluency evaluation of estimate of each phoneme Fluency evaluation of estimate.

Further, voice fluency computing unit is specifically used for: right according to the weighted value determined in advance for each phoneme The corresponding fluency evaluation of estimate of each phoneme is weighted, and obtains the fluency evaluation of estimate of voice to be evaluated.

Further, voice fluency computing unit is specifically used for: for each word in referenced text, calculating the list The average value of the fluency evaluation of estimate of the corresponding phoneme of word obtains the first fluency evaluation of estimate of the word；For referenced text In each word, calculate the corresponding first time length of the word, first time length is corresponding first sound of the word Corresponding first audio frame of element calculates the list to the time span between the last one corresponding audio frame of a last phoneme The summation of the corresponding practical pronunciation duration of the corresponding phoneme of word, obtains corresponding second time span of the word, according to the word Corresponding first time length and the second time span determine the second fluency evaluation of estimate of the word；For in referenced text Each word the fluency of the word is obtained according to the first fluency evaluation of estimate and the second fluency evaluation of estimate of the word Evaluation of estimate；According to the fluency evaluation of estimate of word each in referenced text, the fluency evaluation of estimate of voice to be evaluated is determined.

Further, voice fluency computing unit is also used to: according to the corresponding audio frame of word each in referenced text, The quantity for determining the blank audio frame between any two adjacent words, according to the blank audio frame between any two adjacent words Quantity determines the second fluency evaluation of estimate of voice to be evaluated；According to the fluency evaluation of estimate of word each in referenced text, really First fluency evaluation of estimate of fixed voice to be evaluated；First fluency evaluation of estimate of voice to be evaluated and the second fluency are evaluated Value is weighted and averaged, and obtains the fluency evaluation of estimate of voice to be evaluated.

Further, the pronunciation quality evaluating device 40 of the embodiment of the present invention further includes referring to pronunciation duration determining module, For: for every section of voice messaging in corpus, in voice messaging, determine that the corresponding each phoneme of text information is corresponding Audio frame, text information be the corresponding referenced text of voice messaging；According to the corresponding sound of the corresponding each phoneme of text information Frequency frame determines the corresponding pronunciation duration of each phoneme；According to the corresponding pronunciation duration of each phoneme, each of phone set is counted The corresponding pronunciation duration distribution of phoneme, phone set are the set for all phonemes composition that specified languages include；By each phoneme pair Reference pronunciation duration of the central value for the pronunciation duration distribution answered as each phoneme in phone set.

Further, grading module 406 is specifically used for: according to the weight coefficient being fitted in advance, to integrity degree evaluation of estimate, stream Sharp degree evaluation of estimate and accuracy estimating value are weighted, and obtain the pronunciation scoring of voice to be evaluated.

The pronunciation quality evaluating device and above-mentioned pronunciation quality evaluating method that the embodiment of the present invention mentions use identical hair Bright design can obtain identical beneficial effect, and details are not described herein.

Based on inventive concept identical with above-mentioned pronunciation quality evaluating method, the embodiment of the invention also provides a kind of electronics Equipment, the electronic equipment are specifically as follows intelligent sound box, desktop computer, portable computer, smart phone, tablet computer etc. Ustomer premises access equipment, or the cloud devices such as server.As shown in fig. 6, the electronic equipment 60 may include processor 601, Memory 602 and transceiver 603.Transceiver 603 is for sending and receiving data under the control of processor 601.

Memory 602 may include read-only memory (ROM) and random access memory (RAM), and provide to processor The program instruction and data stored in memory.In embodiments of the present invention, memory can be used for storing pronunciation quality evaluating The program of method.

Processor 601 can be CPU (centre buries device), ASIC (Application Specific Integrated Circuit, specific integrated circuit), FPGA (Field-Programmable Gate Array, field programmable gate array) or CPLD (Complex Programmable Logic Device, Complex Programmable Logic Devices) processor is by calling storage The program instruction of device storage, realizes the pronunciation quality evaluating method in any of the above-described embodiment according to the program instruction of acquisition.

The embodiment of the invention provides a kind of computer readable storage mediums, for being stored as above-mentioned electronic equipments Computer program instructions, it includes the programs for executing above-mentioned pronunciation quality evaluating method.

Above-mentioned computer storage medium can be any usable medium or data storage device that computer can access, packet Include but be not limited to magnetic storage (such as floppy disk, hard disk, tape, magneto-optic disk (MO) etc.), optical memory (such as CD, DVD, BD, HVD etc.) and semiconductor memory (such as it is ROM, EPROM, EEPROM, nonvolatile memory (NAND FLASH), solid State hard disk (SSD)) etc..

The above, above embodiments are only described in detail to the technical solution to the application, but the above implementation The method that the explanation of example is merely used to help understand the embodiment of the present invention, should not be construed as the limitation to the embodiment of the present invention.This Any changes or substitutions that can be easily thought of by those skilled in the art, should all cover the embodiment of the present invention protection scope it It is interior.

Claims

1. a kind of pronunciation quality evaluating method characterized by comprising

In voice to be evaluated, determine the corresponding each corresponding audio frame of phoneme of referenced text and each phoneme with The matching probability of corresponding audio frame, the referenced text are the corresponding referenced text of the voice to be evaluated；

For each phoneme, according to the corresponding matching probability of the phoneme and the corresponding audio frame calculating of the phoneme The pronouncing accuracy evaluation of estimate of phoneme；

The weighted value determined according to the pronouncing accuracy evaluation of estimate of each phoneme and in advance for each phoneme, obtain it is described to Evaluate the accuracy estimating value of voice.

2. the method according to claim 1, wherein described be member in the determining weighted value of each phoneme in advance The corresponding weighted value of sound is greater than the corresponding weighted value of consonant.

3. the method according to claim 1, wherein described evaluate according to the pronouncing accuracy of each phoneme Value and the weighted value determined in advance for each phoneme, obtain the accuracy estimating value of the voice to be evaluated, comprising:

According to being in advance weighted value that each phoneme determines, the corresponding pronouncing accuracy evaluation of estimate of each phoneme is added Power, obtains the accuracy estimating value of the voice to be evaluated.

4. method according to any one of claims 1 to 3, which is characterized in that further include:

Determine the integrity degree evaluation of estimate and fluency evaluation of estimate of the voice to be evaluated；

The voice to be evaluated is determined according to the integrity degree evaluation of estimate, the fluency evaluation of estimate and the accuracy estimating value Pronunciation scoring.

5. according to the method described in claim 4, it is characterized in that, the integrity degree of the determination voice to be evaluated is evaluated Value, comprising:

It is corresponding according to the corresponding quantity for identifying the word for including in text of voice to be evaluated and the voice to be evaluated The quantity for the word for including in referenced text, determines the integrity degree evaluation of estimate of the voice to be evaluated, and the identification text is pair The voice to be evaluated carries out the text that speech recognition obtains.

6. according to the method described in claim 4, it is characterized in that, the fluency of the determination voice to be evaluated is evaluated Value, comprising:

The corresponding practical pronunciation duration of the phoneme is determined according to the corresponding audio frame of the phoneme for each phoneme, It is corresponding with the phoneme with reference to pronunciation duration according to the corresponding practical pronunciation duration of the phoneme, determine that the phoneme is corresponding Fluency evaluation of estimate；

The fluency evaluation of estimate of the voice to be evaluated is determined according to the corresponding fluency evaluation of estimate of each phoneme.

7. according to the method described in claim 6, it is characterized in that, described evaluate according to the corresponding fluency of each phoneme Value determines the fluency evaluation of estimate of the voice to be evaluated, comprising:

According to being in advance weighted value that each phoneme determines, the corresponding fluency evaluation of estimate of each phoneme is weighted, Obtain the fluency evaluation of estimate of the voice to be evaluated.

8. according to the method described in claim 6, it is characterized in that, described evaluate according to the corresponding fluency of each phoneme Value determines the fluency evaluation of estimate of the voice to be evaluated, comprising:

For each word in the referenced text, being averaged for the fluency evaluation of estimate of the corresponding phoneme of the word is calculated Value, obtains the first fluency evaluation of estimate of the word；

For each word in the referenced text, the corresponding first time length of the word, the first time are calculated Length be corresponding first audio frame of corresponding first phoneme of the word to a last phoneme it is corresponding the last one Time span between audio frame calculates the summation of the corresponding practical pronunciation duration of the corresponding each phoneme of the word, obtains Corresponding second time span of the word is determined according to the corresponding first time length of the word and the second time span Second fluency evaluation of estimate of the word；

For each word in the referenced text, commented according to the first fluency evaluation of estimate of the word and the second fluency Value, obtains the fluency evaluation of estimate of the word；

According to the fluency evaluation of estimate of word each in the referenced text, the fluency evaluation of the voice to be evaluated is determined Value.

9. according to the method described in claim 8, it is characterized by further comprising:

According to the corresponding audio frame of word each in the referenced text, the blank audio frame between any two adjacent words is determined Quantity the second of the voice to be evaluated is determined according to the quantity of the blank audio frame between any two adjacent words Fluency evaluation of estimate；

The fluency evaluation of estimate according to word each in the referenced text determines that the fluency of the voice to be evaluated is commented Value, comprising:

According to the fluency evaluation of estimate of word each in the referenced text, determine that the first fluency of the voice to be evaluated is commented Value；

The the first fluency evaluation of estimate and the second fluency evaluation of estimate of the voice to be evaluated are weighted and averaged, obtained described The fluency evaluation of estimate of voice to be evaluated.

10. according to the method described in claim 6, it is characterized in that, determining that the method with reference to pronunciation duration includes:

For every section of voice messaging in corpus, in the voice messaging, the corresponding each phoneme of text information is determined Corresponding audio frame, the text information are the corresponding referenced text of the voice messaging；

According to the corresponding audio frame of the corresponding each phoneme of the text information, when determining the corresponding pronunciation of each phoneme It is long；

According to the corresponding pronunciation duration of each phoneme, the corresponding pronunciation duration of each phoneme counted in phone set is distributed, The phone set is the set for all phonemes composition that specified languages include；

Using the central value of the corresponding pronunciation duration distribution of each phoneme as the reference of each phoneme in the phone set Pronunciation duration.