CN105609114A - Method and device for detecting pronunciation - Google Patents

Method and device for detecting pronunciation Download PDF

Info

Publication number
CN105609114A
CN105609114A CN201410692378.2A CN201410692378A CN105609114A CN 105609114 A CN105609114 A CN 105609114A CN 201410692378 A CN201410692378 A CN 201410692378A CN 105609114 A CN105609114 A CN 105609114A
Authority
CN
China
Prior art keywords
unit
basic voice
voice unit
frame
pronunciation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410692378.2A
Other languages
Chinese (zh)
Other versions
CN105609114B (en
Inventor
高前勇
魏思
胡国平
胡郁
刘庆峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201410692378.2A priority Critical patent/CN105609114B/en
Publication of CN105609114A publication Critical patent/CN105609114A/en
Application granted granted Critical
Publication of CN105609114B publication Critical patent/CN105609114B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a method and a device for detecting pronunciation. The method comprises steps: a to-be-detected speech signal is received; each basic speech unit of the speech signal, speech frames corresponding to the basic speech unit and the frame number are determined; variable frame shift needed when the basic speech unit is regularized into a preset fixed frame number is calculated; according to the variable frame shift and a preset fixed frame length, framing is carried out on the regularized basic speech unit; segmental features of the basic speech unit after framing are extracted; the likelihood between the segmental features of the basic speech unit and a preset standard pronunciation model corresponding to the basic speech unit is calculated, wherein the preset standard pronunciation model corresponding to the basic speech unit is obtained when a mathematical statistical model is trained after acoustic features of the basic speech unit are extracted on a training set in advance; and according to the likelihood, whether pronunciation of the basic speech unit is correct is determined. Thus, the pronunciation detection accuracy can be improved.

Description

A kind of pronunciation detection method and device
Technical field
The present invention relates to signal processing technology field, relate in particular to a kind of pronunciation detection method and device.
Background technology
Utterance detection refers in computer automatic decision user pronunciation whether have mispronounce, and determines and send outThe different levels of sound mistake, as phoneme, syllable or tone mistake etc. Utterance detection is area of computer aidedThe core of pronunciation training (ComputerAssistedPronunciationTraining, CAPT) system, rootCan realize effectively instructing targetedly user pronunciation according to its testing result.
In prior art, the higher method of Detection accuracy is as based on degree of depth neutral net (DeepNeuralNetwork, DNN) pronunciation detection method of identifier, mainly comprise: receive voice signal to be detected;Extract the acoustic feature of this voice signal, in DNN system, adopt frame splicing feature extraction, particularly,By continuous some frame Mel cepstrum coefficients (MelFrequencyCepstralCoefficients, MFCC) orThe assemblage characteristic of MFCC and fundamental frequency F0, the super vector that is spliced into a multidimensional number is as current speech frameExtension feature, as shown in Figure 1, by the head and the tail sequentially of each frame measurement vector before and after speech frame t frameJoin and form t frame super vector, the sequence that in this voice signal, the super vector of speech frame forms forms thisThe acoustic feature sequence of voice signal; Then, at acoustic model, language model or the word figure of systemic presuppositionSearch and the corresponding decoded result of described acoustic feature sequence in the search volume building, and then according to decoding knotFruit is determined each basic voice unit and the corresponding voice snippet of this voice signal, finally according to frame splicing featureThe super vector of the speech frame that the each basic voice unit that obtains when extraction is corresponding calculates likelihood score, obtains each basicThe accuracy rate of voice unit pronunciation, and then carry out utterance detection according to pronunciation accuracy rate.
But, in the time of frame splicing feature extraction, be positioned at the super arrow of the speech frame on the initial border of basic voice unitIn amount, may comprise the measurement vector of the speech frame of a previous or rear basic voice unit, calculateWhen the accuracy rate of basic voice unit pronunciation, if there is mispronounce in current basic voice unit, and before and after itBasic voice unit pronunciation is all correct, the likelihood score of the super vector that the current basic voice unit of splicing is correspondingScore will be higher, and the accuracy rate of current basic voice unit pronunciation is also just higher, and then causes detectingMispronounce. Therefore, utilize the super vector of the speech frame that basic voice unit is corresponding to calculate the basic voice of acquisitionThe method of unit pronunciation accuracy rate, especially, at the basic initial boundary of voice unit, easily produces error, leadsCausing Detection accuracy reduces.
Summary of the invention
For solving the problems of the technologies described above, the embodiment of the present invention provides a kind of pronunciation detection method and device, canReduce the error of calculation of basic voice unit pronunciation accuracy rate, improve the accuracy of utterance detection.
The technical scheme that the embodiment of the present invention provides is as follows:
A kind of pronunciation detection method, comprising:
Receive voice signal to be detected;
Determine each basic voice unit and the voice corresponding to described basic voice unit of described voice signalFrame and frame number;
Calculate by regular described basic voice unit to preset fixing frame number time required variable frame move;
Move and default fixing frame length according to described variable frame, the basic voice unit after regular is carried outDivide frame;
Extract the segment5al feature of the basic voice unit after point frame;
The segment5al feature that calculates described basic voice unit and default corresponding described basic voice unitThe likelihood score of RP model, the RP mould of the described default described basic voice unit of correspondenceType is to train mathematical statistical model to obtain extract in advance the acoustic feature of basic voice unit on training set afterArrive;
Whether the pronunciation of determining described basic voice unit according to described likelihood score is correct.
Preferably, described calculating is by regular described basic voice unit required during to default fixing frame numberVariable frame move and comprise:
Calculate the ratio of the first difference and the second difference, described the first difference is described basic voice unitThe frame number of corresponding speech frame and 1 difference, described the second difference is the poor of described fixing frame number and 1Value;
Described the first difference is corresponding with regular front described basic voice unit with the ratio of the second differenceThe product that the frame of speech frame moves moves as described variable frame.
Preferably, described calculating by regular described basic voice unit to default fixing frame number time instituteBefore the variable frame needing moves, also comprise:
Determine according to the type of described basic voice unit the fixing frame number that described basic voice unit is corresponding.
Preferably, described extraction divides the segment5al feature of the basic voice unit after frame to comprise:
Extraction divides the acoustic feature of each speech frame that the basic voice unit after frame is corresponding;
The acoustic feature that splices successively described each speech frame in described voice unit obtains described elementary cellSegment5al feature.
Alternatively, describedly determine that according to described likelihood score whether the pronunciation of described basic voice unit is correctComprise:
If described likelihood score is greater than the likelihood score threshold value of setting, definite described basic voice unit is sent outSound is correct;
Otherwise, determine described basic voice unit mispronounce.
Preferably, describedly determine that according to described likelihood score whether the pronunciation of described basic voice unit is correctComprise:
Calculate the pronunciation posterior probability of described basic voice unit according to described likelihood score;
If described pronunciation posterior probability is greater than the probability threshold value of setting, determine described basic voice listUnit carries a tune;
Otherwise, determine described basic voice unit mispronounce.
A kind of utterance detection device, comprising:
Signal receiving unit, for receiving voice signal to be detected;
Determining unit, for determining each basic voice unit and the described basic voice of described voice signalThe speech frame that unit is corresponding and frame number;
Variable frame moves computing unit, for calculating regular described basic voice unit to default fixingVariable frame required when frame number moves;
Divide frame unit, for moving according to described variable frame and default fixing frame length, to the base after regularThis voice unit divides frame;
Segment5al feature extraction unit, for extracting the segment5al feature of the basic voice unit after point frame;
Likelihood score computing unit, for the segment5al feature that calculates described basic voice unit with default rightAnswer the likelihood score of the RP model of described basic voice unit, described default correspondence is described basicThe RP model of voice unit is the acoustic feature that extracts in advance basic voice unit on training setRear training mathematical statistical model obtains;
Detecting unit, for the pronunciation of determining described basic voice unit according to described likelihood score whether justReally.
Preferably, described variable frame moves computing unit, specifically for calculating the first difference and the second differenceRatio, described the first difference be the speech frame that described basic voice unit is corresponding frame number and 1 poorValue, described the second difference is the difference of described fixing frame number and 1; And by described the first difference and secondThe product that the frame of the speech frame that the ratio of difference is corresponding with regular front described basic voice unit moves is as instituteStating variable frame moves.
Preferably, described device also comprises:
Fixing frame number determining unit, described basic for determining according to the type of described basic voice unitThe fixing frame number that voice unit is corresponding.
Preferably, described segment5al feature extraction unit comprises:
Extract subelement, for extracting the acoustics of each speech frame that basic voice unit after point frame is correspondingFeature;
Splicing subelement, for splicing successively the acoustic feature of described each speech frame in described voice unitObtain the segment5al feature of described elementary cell.
Alternatively, described detecting unit, specifically for being greater than the likelihood score threshold of setting at described likelihood scoreWhen value, determine that described basic voice unit carries a tune; Otherwise, determine that described basic voice unit sends outSound mistake.
Preferably, described detecting unit comprises:
Computation subunit, for calculating the pronunciation posteriority of described basic voice unit according to described likelihood scoreProbability;
Judgment sub-unit, in the time that described pronunciation posterior probability is greater than the probability threshold value of setting, determinesDescribed basic voice unit carries a tune; Otherwise, determine described basic voice unit mispronounce.
In the embodiment of the present invention, obtaining after the basic voice unit of voice signal and the speech frame of correspondence thereof,For each voice unit, in phoneme boundary, use variable frame to move technology, extract isometric segment5al feature, keep awayThe acoustic feature of having exempted from the speech frame that is positioned at the initial border of basic voice unit may comprise previous or afterThe situation of the acoustic feature of the speech frame of a basic voice unit, makes the pronunciation of basic voice unit accurateRate is not subject to the impact of the pronunciation situation of adjacent basic voice unit, thereby it is accurate to have reduced basic voice unit pronunciationThe really error of calculation of rate, has improved the counting accuracy of pronunciation accuracy rate.
Brief description of the drawings
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, below will be to implementingIn example or description of the Prior Art, the accompanying drawing of required use is briefly described, and apparently, the following describesIn accompanying drawing be only some embodiment that record in the present invention, for those of ordinary skill in the art,Do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.
Fig. 1 is the schematic diagram that forms the super vector of speech frame t frame in prior art;
Fig. 2 is a kind of flow chart of embodiment of the present invention pronunciation detection method;
Fig. 3 is that the variable frame that calculates the speech frame that basic voice unit is corresponding in the embodiment of the present invention movesMethod flow diagram;
Fig. 4 is the structural representation of embodiment of the present invention utterance detection device;
Fig. 5 is the structural representation that in the embodiment of the present invention, frame moves determining unit;
Fig. 6 is the structural representation of detecting unit in the embodiment of the present invention.
Detailed description of the invention
In order to make those skilled in the art person understand better the technical scheme in the present invention, below in conjunction with thisAccompanying drawing in inventive embodiments, is clearly and completely described the technical scheme in the embodiment of the present invention,Obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiment. BaseEmbodiment in the present invention, those of ordinary skill in the art obtain not making under creative work prerequisiteThe every other embodiment obtaining, should belong to the scope of protection of the invention.
For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, below in conjunction with accompanying drawing andThe present invention is further detailed explanation for detailed description of the invention.
Referring to Fig. 2, it is a kind of flow chart of embodiment of the present invention pronunciation detection method. The method can comprise:
Step 201, receives voice signal to be detected.
Step 202, determines each basic voice unit and the described basic voice unit of described voice signalCorresponding speech frame and frame number.
Particularly, receiving after voice signal to be detected, extracting the acoustic feature of this voice signal,As MFCC feature and/or F0 feature, then at default acoustic model, language model or word figure structureSearch and the corresponding decoded result of acoustic feature extracting in the search volume of building, and then according to decoding knotFruit is determined each basic voice unit and corresponding speech frame, the frame number of this voice signal. Above-mentioned acquisition languageProcess and the prior art class of the speech frame that the basic voice unit of tone signal and basic voice unit are correspondingSeemingly, repeat no more herein.
Step 203, calculate by regular described basic voice unit to preset fixing frame number time requiredVariable frame moves.
In the present embodiment, the impact producing in order to reduce word speed speed, need to change existing frame and move,The frame of adjusting each basic voice unit by self adaptation moves size, and described basic voice unit is carried outAgain divide frame, by regular the speech frame of basic voice unit to identical frame number.
First the variable frame that calculates in this step the speech frame that basic voice unit is corresponding moves, this calculatingProcess as shown in Figure 3, can comprise:
Step 301, the ratio of calculating the first difference and the second difference, the first difference is basic voice listThe frame number of speech frame corresponding to unit and 1 difference, the second difference is the poor of the fixing frame number preset and 1Value.
Suppose basic voice unit MiSpeech frame beThe whole story, boundary frame was respectivelyThe first difference is Ne-Ns-1, wherein, NeIt is speech frameFrame number, NsIt is speech frameFrame number. Default fixing frame number is assumed to be Nnorm, the second difference is Nnorm-1, ratio is N e - N s - 1 N norm - 1 .
It should be noted that, in actual applications, can phase be set for different basic voice unitsWith fixing frame number, different fixing frame numbers also can be set, such as, for basic voice unitThe fixing frame number that dissimilar setting is different, for example, can arrange respectively different consolidating to vowel and consonantFraming number. If more than one of default fixing frame number, before this step, can also be first definiteThe default fixing frame number corresponding with the type of this basic voice unit.
Step 302, by the ratio of described the first difference and the second difference and regular front described basic voiceThe product that the frame of the speech frame that unit is corresponding moves moves as described variable frame.
It should be noted that, the frame of the speech frame that regular front described basic voice unit is corresponding moves and can beAnchor-frame is moved, and also can move by on-fixed frame, and this embodiment of the present invention is not limited.
Move as example with anchor-frame, variable frame movesBe calculated as follows:
F mov var = N e - N s - 1 N norm - 1 * F mov fix
Wherein,For default anchor-frame is moved, this anchor-frame is moved to adopt in prior art and is often adoptedWith frame move size.
Step 204, moves and default fixing frame length according to described variable frame, to the basic language after regularSound unit divides frame.
After definite variable frame moves, can be in conjunction with fixing frame length FlenThe voice corresponding to basic voice unitFrame divides frame again, thus by regular speech frame corresponding each basic voice unit to identical frame number.
Step 205, extracts the segment5al feature that divides the basic voice unit after frame.
Particularly, can first extract the acoustics of each speech frame that basic voice unit after point frame is correspondingFeature; Then the acoustic feature that splices successively each speech frame in described basic voice unit obtains described baseThe segment5al feature of this voice unit.
Wherein, the method for extracting acoustic feature can adopt acoustic feature extracting method of the prior art,Extract the measurement vector of the each speech frame in basic acoustic unit boundary, as MFCC feature and/or F0Fundamental frequency feature etc., and the measurement vector splicing of each speech frame in described voice unit is obtained to super vector,Obtain the super vector sequence of the speech frame that basic voice unit is corresponding, using this super vector sequence as this baseThe segment5al feature that this voice unit is corresponding.
Step 206, the segment5al feature that calculates described basic voice unit is with default corresponding described basicThe likelihood score of the RP model of voice unit.
The RP model of the described default described basic voice unit of correspondence is in advance on training setAfter extracting the acoustic feature of basic voice unit, train mathematical statistical model to obtain, as DNN model etc.
The voice that the computational process of described likelihood score is corresponding with utilizing each basic voice unit in prior artThe process that the super vector of frame calculates likelihood score is similar, does not repeat them here.
Step 207, determines that according to described likelihood score whether the pronunciation of described basic voice unit is correct.
Particularly, if described likelihood score is greater than the likelihood score threshold value of setting, determine described basic languageSound unit carries a tune; Otherwise, determine described basic voice unit mispronounce.
In order further to improve the accuracy of utterance detection, in another embodiment of the present invention, also can rootCalculate the pronunciation posterior probability of basic voice unit according to likelihood score; Determine institute according to this pronunciation posterior probabilityWhether the pronunciation of stating basic voice unit is correct, if described pronunciation posterior probability is greater than the general of settingRate threshold value, determines that described basic voice unit carries a tune; Otherwise, determine described basic voice listUnit's mispronounce.
The computational process of the pronunciation posterior probability of basic voice unit is as follows:
1) calculate i basic voice unit MiCorresponding measurement vector OtCorresponding to default basic languageSound unit MiLikelihood score P (the O of RP modelt|Mi), wherein t represents frame number;
2) calculate basic voice unit MiCorresponding measurement vector OtCorresponding to j voice unit MjLikelihood score P (Ot|Mj), wherein, j ≠ i;
3) basic voice unit MiPronunciation posterior probability be calculated as follows:
PP M i = P ( O t | M i ) Σ j P ( O t | M j ) - - - ( 1 )
Or
GOP M i = P ( O t | M i ) max j P ( O t | M j ) - - - ( 2 )
J in formula (1) (2) can comprise i's.
The pronunciation detection method of the embodiment of the present invention, by adopting the adaptive frame technology of moving to extract each baseThe segment5al feature of this voice unit, not only makes segment5al feature have multiframe splicing characteristic vector concurrently abundantCorrelation properties, comprise abundant frequency spectrum and tone information, make the pronunciation standard of basic voice unit simultaneouslyReally rate is not subject to the impact of adjacent basic voice unit pronunciation situation, has reduced basic voice unit pronunciation accurateThe really error of calculation of rate, has improved the counting accuracy of pronunciation accuracy rate, is conducive to the inspection of mispronounceGo out, and, word speed is had to very strong inhibition and elimination effect, significantly improve the accurate of utterance detectionProperty. Pronunciation as fast in word speed, moves and extracts MFCC or F0 feature by the frame of " fixing ", adjacentVariation between the frame feature pronunciation slower than word speed is large, and the fast phoneme of speech sound length of word speed is generalThe capital pronunciation slower than word speed is short. After the frame of employing " can change " moves, the fast pronunciation of word speed calculatesFrame move less than the slow pronunciation of word speed, thereby can reduce the difference of consecutive frame feature; Similarly, word speedThe frame of slow pronunciation moves longer than the fast pronunciation of word speed, the corresponding difference that increases consecutive frame feature.
Correspondingly, the embodiment of the present invention also provides a kind of utterance detection device, referring to Fig. 4, is this dressA kind of structural representation of putting.
This device comprises:
Signal receiving unit 401, for receiving voice signal to be detected.
Determining unit 402, for determining the each basic voice unit of described voice signal and described basicThe speech frame that voice unit is corresponding and frame number.
Variable frame moves computing unit 403, for calculating regular described basic voice unit to defaultWhen fixing frame number, required variable frame moves;
Divide frame unit 404, for moving according to described variable frame and default fixing frame length, after regularBasic voice unit divide frame;
Segment5al feature extraction unit 405, for extracting the segment5al feature of the basic voice unit after point frame;
Likelihood score computing unit 406, for calculating the segment5al feature of described basic voice unit and presettingThe likelihood score of RP model of the described basic voice unit of correspondence, described in described default correspondenceThe RP model of basic voice unit is the acoustics that extracts in advance basic voice unit on training setAfter feature, train mathematical statistical model to obtain;
Detecting unit 407, for determining that according to described likelihood score the pronunciation of described basic voice unit isNo correct.
Above-mentioned variable frame moves computing unit 403 specifically for calculating the ratio of the first difference and the second difference,Described the first difference is the frame number of the speech frame that described basic voice unit is corresponding and 1 difference, described inThe second difference is the difference of described fixing frame number and 1; And by described the first difference the ratio with the second differenceThe product that the frame of the value speech frame corresponding with regular front described basic voice unit moves is as described variable frameMove.
As shown in Figure 5, a kind of concrete structure that variable frame moves computing unit 403 comprises:
The first computation subunit 501, for calculating the ratio of the first difference and the second difference, describedOne difference is the frame number of the speech frame that described basic voice unit is corresponding and 1 difference, described second poorValue is default fixing frame number and 1 difference.
The second computation subunit 502, for by the ratio of described the first difference and the second difference and regularThe product that the frame of the speech frame that front described basic voice unit is corresponding moves is as described basic voice unit pairThe variable frame of the speech frame of answering moves.
A kind of concrete structure of above-mentioned segment5al feature extraction unit 405 comprises:
Extract subelement, for extracting the acoustics of each speech frame that basic voice unit after point frame is correspondingFeature;
Splicing subelement, for splicing successively the acoustic feature of described each speech frame in described voice unitObtain the segment5al feature of described elementary cell.
The utterance detection device of the embodiment of the present invention, by adopting the adaptive frame technology of moving to extract each baseThe segment5al feature of this voice unit, not only makes segment5al feature have multiframe splicing characteristic vector concurrently abundantCorrelation properties, comprise abundant frequency spectrum and tone information, make the pronunciation standard of basic voice unit simultaneouslyReally rate is not subject to the impact of adjacent basic voice unit pronunciation situation, has reduced basic voice unit pronunciation accurateThe really error of calculation of rate, has improved the counting accuracy of pronunciation accuracy rate, is conducive to the inspection of mispronounceGo out, and, word speed is had to very strong inhibition and elimination effect, significantly improve the accurate of utterance detectionProperty.
It should be noted that, in actual applications, can phase be set for different basic voice unitsWith fixing frame number, different fixing frame numbers also can be set, such as, for basic voice unitThe fixing frame number that dissimilar setting is different, for example, can arrange respectively different consolidating to vowel and consonantFraming number. If more than one of default fixing frame number, this device can also comprise that fixing frame number is definite singleUnit (not shown), for determining described basic voice unit according to the type of described basic voice unitCorresponding fixing frame number.
The likelihood that above-mentioned detecting unit 407 can directly calculate according to likelihood score computing unit 406Degree determines that basic voice unit carries a tune or mistake, such as, be greater than setting at described likelihood scoreWhen likelihood score threshold value, determine that described basic voice unit carries a tune; Otherwise, determine described basic languageSound unit mispronounce.
The likelihood score that above-mentioned detecting unit 407 can also calculate according to likelihood score computing unit 406Further calculate the pronunciation posterior probability of basic voice unit, determine basic language according to this posterior probabilitySound unit carries a tune or mistake. Correspondingly, a kind of concrete structure of detecting unit 407 is as Fig. 6Shown in, comprising:
Computation subunit 601, for calculating the pronunciation of described basic voice unit according to described likelihood scorePosterior probability;
Judgment sub-unit 602, in the time that described pronunciation posterior probability is greater than the probability threshold value of setting,Determine that described basic voice unit carries a tune; Otherwise, determine described basic voice unit mispronounce.
The computational process of posterior probability can be with reference to the description in the inventive method embodiment above, at this notRepeat again.
For convenience of description, while describing above device, being divided into various unit with function describes respectively. Certainly,In the time that enforcement is of the present invention, the function of each unit can be realized in same or multiple software and/or hardware.
As seen through the above description of the embodiments, those skilled in the art can be well understood to thisThe mode that invention can add essential general hardware platform by software realizes. Based on such understanding, thisThe part that bright technical scheme contributes to prior art in essence in other words can be with the shape of software productFormula embodies, and this computer software product can be stored in storage medium, as ROM/RAM, magnetic disc,CD etc., comprise some instructions in order to make a computer equipment (can be personal computer, server,Or the network equipment etc.) carry out the method described in some part of each embodiment of the present invention or embodiment.
Each embodiment in this description all adopts the mode of going forward one by one to describe, phase homophase between each embodimentLike part mutually referring to, what each embodiment stressed is and the difference of other embodiment.Especially, for system embodiment, because it is substantially similar in appearance to embodiment of the method, so describe relativelySimply, relevant part is referring to the part explanation of embodiment of the method. System embodiment described above onlyBe only schematically, the wherein said unit as separating component explanation can be or can not be also physicsAbove separate, the parts that show as unit can be or can not be also physical locations, can be positioned atA place, or also can be distributed on multiple NEs. Can select according to the actual needs whereinSome or all of module realize the object of the present embodiment scheme. Those of ordinary skill in the art are not payingGo out in the situation of creative work, be appreciated that and implement.
The present invention can be used in numerous general or special purpose computing system environment or configuration. For example: individual calculusMachine, server computer, handheld device or portable set, laptop device, multicomputer system, baseIn the system of microprocessor, set top box, programmable consumer-elcetronics devices, network PC, minicom,Mainframe computer, the DCE that comprises above any system or equipment etc.
The present invention can describe in the general context of computer executable instructions, exampleAs program module. Usually, program module comprises and carries out particular task or realize particular abstract data typeRoutine, program, object, assembly, data structure etc. Also can in DCE, put into practice thisInvention, in these DCEs, is come by the teleprocessing equipment being connected by communication networkExecute the task. In DCE, program module can be positioned at this locality including memory deviceIn remote computer storage medium.
The above is only the specific embodiment of the present invention, it should be pointed out that common for the artTechnical staff, under the premise without departing from the principles of the invention, can also make some improvements and modifications,These improvements and modifications also should be considered as protection scope of the present invention.

Claims (12)

1. a pronunciation detection method, is characterized in that, comprising:
Receive voice signal to be detected;
Determine each basic voice unit and the voice corresponding to described basic voice unit of described voice signalFrame and frame number;
Calculate by regular described basic voice unit to preset fixing frame number time required variable frame move;
Move and default fixing frame length according to described variable frame, the basic voice unit after regular is carried outDivide frame;
Extract the segment5al feature of the basic voice unit after point frame;
The segment5al feature that calculates described basic voice unit and default corresponding described basic voice unitThe likelihood score of RP model, the RP mould of the described default described basic voice unit of correspondenceType is to train mathematical statistical model to obtain extract in advance the acoustic feature of basic voice unit on training set afterArrive;
Whether the pronunciation of determining described basic voice unit according to described likelihood score is correct.
2. method according to claim 1, is characterized in that, described calculating is by described basic languageSound unit regular during to default fixing frame number required variable frame move and comprise:
Calculate the ratio of the first difference and the second difference, described the first difference is described basic voice unitThe frame number of corresponding speech frame and 1 difference, described the second difference is the poor of described fixing frame number and 1Value;
Described the first difference is corresponding with regular front described basic voice unit with the ratio of the second differenceThe product that the frame of speech frame moves moves as described variable frame.
3. method according to claim 1, is characterized in that, in described calculating by described basicVoice unit is regular before required variable frame moves during to default fixing frame number, also comprises:
Determine according to the type of described basic voice unit the fixing frame number that described basic voice unit is corresponding.
4. method according to claim 1, is characterized in that, described extraction divides basic after frameThe segment5al feature of voice unit comprises:
Extraction divides the acoustic feature of each speech frame that the basic voice unit after frame is corresponding;
The acoustic feature that splices successively described each speech frame in described voice unit obtains described elementary cellSegment5al feature.
5. according to the method described in claim 1 to 4 any one, it is characterized in that, described according to instituteState likelihood score and determine whether the pronunciation of described basic voice unit correctly comprises:
If described likelihood score is greater than the likelihood score threshold value of setting, definite described basic voice unit is sent outSound is correct;
Otherwise, determine described basic voice unit mispronounce.
6. according to the method described in claim 1 to 4 any one, it is characterized in that, described according to instituteState likelihood score and determine whether the pronunciation of described basic voice unit correctly comprises:
Calculate the pronunciation posterior probability of described basic voice unit according to described likelihood score;
If described pronunciation posterior probability is greater than the probability threshold value of setting, determine described basic voice listUnit carries a tune;
Otherwise, determine described basic voice unit mispronounce.
7. a utterance detection device, is characterized in that, comprising:
Signal receiving unit, for receiving voice signal to be detected;
Determining unit, for determining each basic voice unit and the described basic voice of described voice signalThe speech frame that unit is corresponding and frame number;
Variable frame moves computing unit, for calculating regular described basic voice unit to default fixingVariable frame required when frame number moves;
Divide frame unit, for moving according to described variable frame and default fixing frame length, to the base after regularThis voice unit divides frame;
Segment5al feature extraction unit, for extracting the segment5al feature of the basic voice unit after point frame;
Likelihood score computing unit, for the segment5al feature that calculates described basic voice unit with default rightAnswer the likelihood score of the RP model of described basic voice unit, described default correspondence is described basicThe RP model of voice unit is the acoustic feature that extracts in advance basic voice unit on training setRear training mathematical statistical model obtains;
Detecting unit, for the pronunciation of determining described basic voice unit according to described likelihood score whether justReally.
8. device according to claim 7, is characterized in that,
Described variable frame moves computing unit, specifically for calculating the ratio of the first difference and the second difference,Described the first difference is the frame number of the speech frame that described basic voice unit is corresponding and 1 difference, described inThe second difference is the difference of described fixing frame number and 1; And by described the first difference the ratio with the second differenceThe product that the frame of the value speech frame corresponding with regular front described basic voice unit moves is as described variable frameMove.
9. device according to claim 7, is characterized in that, described device also comprises:
Fixing frame number determining unit, described basic for determining according to the type of described basic voice unitThe fixing frame number that voice unit is corresponding.
10. device according to claim 7, is characterized in that, described segment5al feature extracts singleUnit comprises:
Extract subelement, for extracting the acoustics of each speech frame that basic voice unit after point frame is correspondingFeature;
Splicing subelement, for splicing successively the acoustic feature of described each speech frame in described voice unitObtain the segment5al feature of described elementary cell.
11. according to the device described in claim 7 to 10 any one, it is characterized in that,
Described detecting unit, when being greater than the likelihood score threshold value of setting at described likelihood score, reallyFixed described basic voice unit carries a tune; Otherwise, determine described basic voice unit mispronounce.
12. according to the device described in claim 7 to 10 any one, it is characterized in that described detectionUnit comprises:
Computation subunit, for calculating the pronunciation posteriority of described basic voice unit according to described likelihood scoreProbability;
Judgment sub-unit, in the time that described pronunciation posterior probability is greater than the probability threshold value of setting, determinesDescribed basic voice unit carries a tune; Otherwise, determine described basic voice unit mispronounce.
CN201410692378.2A 2014-11-25 2014-11-25 A kind of pronunciation detection method and device Active CN105609114B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410692378.2A CN105609114B (en) 2014-11-25 2014-11-25 A kind of pronunciation detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410692378.2A CN105609114B (en) 2014-11-25 2014-11-25 A kind of pronunciation detection method and device

Publications (2)

Publication Number Publication Date
CN105609114A true CN105609114A (en) 2016-05-25
CN105609114B CN105609114B (en) 2019-11-15

Family

ID=55988997

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410692378.2A Active CN105609114B (en) 2014-11-25 2014-11-25 A kind of pronunciation detection method and device

Country Status (1)

Country Link
CN (1) CN105609114B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106205603A (en) * 2016-08-29 2016-12-07 北京语言大学 A kind of tone appraisal procedure
CN106448660A (en) * 2016-10-31 2017-02-22 闽江学院 Natural language fuzzy boundary determining method with introduction of big data analysis
CN106782609A (en) * 2016-12-20 2017-05-31 杨白宇 A kind of spoken comparison method
CN107886943A (en) * 2017-11-21 2018-04-06 广州势必可赢网络科技有限公司 A kind of method for recognizing sound-groove and device
CN111951825A (en) * 2019-05-16 2020-11-17 上海流利说信息技术有限公司 Pronunciation evaluation method, medium, device and computing equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1750121A (en) * 2004-09-16 2006-03-22 北京中科信利技术有限公司 A kind of pronunciation evaluating method based on speech recognition and speech analysis
CN101383103A (en) * 2006-02-28 2009-03-11 安徽中科大讯飞信息科技有限公司 Spoken language pronunciation level automatic test method
GB2458461A (en) * 2008-03-17 2009-09-23 Kai Yu Spoken language learning system
CN101727903A (en) * 2008-10-29 2010-06-09 中国科学院自动化研究所 Pronunciation quality assessment and error detection method based on fusion of multiple characteristics and multiple systems
CN102214462A (en) * 2011-06-08 2011-10-12 北京爱说吧科技有限公司 Method and system for estimating pronunciation
CN103065626A (en) * 2012-12-20 2013-04-24 中国科学院声学研究所 Automatic grading method and automatic grading equipment for read questions in test of spoken English
CN103617799A (en) * 2013-11-28 2014-03-05 广东外语外贸大学 Method for detecting English statement pronunciation quality suitable for mobile device
CN104050965A (en) * 2013-09-02 2014-09-17 广东外语外贸大学 English phonetic pronunciation quality evaluation system with emotion recognition function and method thereof

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1750121A (en) * 2004-09-16 2006-03-22 北京中科信利技术有限公司 A kind of pronunciation evaluating method based on speech recognition and speech analysis
CN101383103A (en) * 2006-02-28 2009-03-11 安徽中科大讯飞信息科技有限公司 Spoken language pronunciation level automatic test method
GB2458461A (en) * 2008-03-17 2009-09-23 Kai Yu Spoken language learning system
CN101727903A (en) * 2008-10-29 2010-06-09 中国科学院自动化研究所 Pronunciation quality assessment and error detection method based on fusion of multiple characteristics and multiple systems
CN102214462A (en) * 2011-06-08 2011-10-12 北京爱说吧科技有限公司 Method and system for estimating pronunciation
CN103065626A (en) * 2012-12-20 2013-04-24 中国科学院声学研究所 Automatic grading method and automatic grading equipment for read questions in test of spoken English
CN104050965A (en) * 2013-09-02 2014-09-17 广东外语外贸大学 English phonetic pronunciation quality evaluation system with emotion recognition function and method thereof
CN103617799A (en) * 2013-11-28 2014-03-05 广东外语外贸大学 Method for detecting English statement pronunciation quality suitable for mobile device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106205603A (en) * 2016-08-29 2016-12-07 北京语言大学 A kind of tone appraisal procedure
CN106205603B (en) * 2016-08-29 2019-06-07 北京语言大学 A kind of tone appraisal procedure
CN106448660A (en) * 2016-10-31 2017-02-22 闽江学院 Natural language fuzzy boundary determining method with introduction of big data analysis
CN106448660B (en) * 2016-10-31 2019-09-17 闽江学院 It is a kind of introduce big data analysis natural language smeared out boundary determine method
CN106782609A (en) * 2016-12-20 2017-05-31 杨白宇 A kind of spoken comparison method
CN107886943A (en) * 2017-11-21 2018-04-06 广州势必可赢网络科技有限公司 A kind of method for recognizing sound-groove and device
CN111951825A (en) * 2019-05-16 2020-11-17 上海流利说信息技术有限公司 Pronunciation evaluation method, medium, device and computing equipment

Also Published As

Publication number Publication date
CN105609114B (en) 2019-11-15

Similar Documents

Publication Publication Date Title
CN105976812B (en) A kind of audio recognition method and its equipment
US10930270B2 (en) Processing audio waveforms
CN108010515B (en) Voice endpoint detection and awakening method and device
CN108305641B (en) Method and device for determining emotion information
CN108305643B (en) Method and device for determining emotion information
JP6400936B2 (en) Voice search method, voice search device, and program for voice search device
CN103559879B (en) Acoustic feature extracting method and device in language recognition system
CN105702250B (en) Speech recognition method and device
EP2700071B1 (en) Speech recognition using multiple language models
CN105609114A (en) Method and device for detecting pronunciation
CN103400577A (en) Acoustic model building method and device for multi-language voice identification
CN108630200B (en) Voice keyword detection device and voice keyword detection method
CN105529028A (en) Voice analytical method and apparatus
CN110503944B (en) Method and device for training and using voice awakening model
US9799325B1 (en) Methods and systems for identifying keywords in speech signal
KR20120072145A (en) Method and apparatus for recognizing speech
CN110600008A (en) Voice wake-up optimization method and system
CN109858038A (en) A kind of text punctuate determines method and device
CN103903633A (en) Method and apparatus for detecting voice signal
KR102167157B1 (en) Voice recognition considering utterance variation
WO2014183411A1 (en) Method, apparatus and speech synthesis system for classifying unvoiced and voiced sound
CN103559289A (en) Language-irrelevant keyword search method and system
CN111386566A (en) Device control method, cloud device, intelligent device, computer medium and device
CN113823323A (en) Audio processing method and device based on convolutional neural network and related equipment
CN110349567B (en) Speech signal recognition method and device, storage medium and electronic device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant