CN102982811A - Voice endpoint detection method based on real-time decoding - Google Patents

Voice endpoint detection method based on real-time decoding Download PDF

Info

Publication number
CN102982811A
CN102982811A CN2012104830464A CN201210483046A CN102982811A CN 102982811 A CN102982811 A CN 102982811A CN 2012104830464 A CN2012104830464 A CN 2012104830464A CN 201210483046 A CN201210483046 A CN 201210483046A CN 102982811 A CN102982811 A CN 102982811A
Authority
CN
China
Prior art keywords
voice
text
point
starting point
decoding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012104830464A
Other languages
Chinese (zh)
Other versions
CN102982811B (en
Inventor
吴玲
王兵
赵乾
潘颂声
何春江
朱群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Hear Technology Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201210483046.4A priority Critical patent/CN102982811B/en
Publication of CN102982811A publication Critical patent/CN102982811A/en
Application granted granted Critical
Publication of CN102982811B publication Critical patent/CN102982811B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

A voice endpoint detection method based on real-time decoding includes the following steps: inputting texts related to voice recognition, analyzing the texts, constructing a decoding network according to text analyzing results, inputting voice, extracting acoustic characteristics in the voice, decoding the acoustic characteristics based on the constructed decoding network to obtain a decoded voice unit sequence, conducting voice endpoint judgment on the decoded voice unit sequence to judge whether voice endpoints exist or not, dividing voice endpoints into voice starting points and voice ending points, feeding voice endpoint information to an external application system on yes judgment and continuing step2 on no judgment. The voice starting point judgment is selectable in the third step, and voice starting point is not judged if the external application system does not care about the voice starting point. The voice endpoint detection method resolves the problem that targeted detection cannot be conducted on voice cared by users due to the fact that real time performance of the traditional endpoint detection technology is not high under the condition that voice recognition texts are not determined.

Description

A kind of sound end detecting method based on real-time decoding
Technical field
The present invention relates to a kind of sound end detecting method based on decoded result, the especially a kind of method that can in time feed back the voice end point.
Background technology
The end-point detection of voice is exactly to determine starting point and the terminal point of voice, gets rid of unvoiced segments from voice signal.The correctness of end-point detection has a great impact the performance of speech recognition.In the speech evaluating system, the content of user recording determined by the paper text, in time provides the voice distal point after running through the paper content and stops to calculate in that the user is bright, helps to improve system performance and evaluation and test effect.In outer application system, the effect of end-point detection quality directly affects the user and experiences.
Such as in phonetic study software, carry out simultaneously end-point detection during the user recording evaluation and test, when detecting the voice end, automatically stop recording, saved the troublesome operation of some stop button, when the user repeatedly uses, can greatly promote the user and experience; At speech control system, such as Smart Home, the user can be by " turning on light ", the switch of order control lamp such as " turn off the light " in the not high situation of end-point detection real-time, can cause these command responses untimely, experience bad, if the user just finishes " turning on light " " lamp " word, lamp has just been opened, and it is extraordinary experiencing like this.
Existing end-point detecting method can be divided into two classes: threshold method and pattern-recongnition method.
(1) threshold method
Extract a certain feature of voice, such as: short-time energy, short-time average magnitude, zero-crossing rate etc., calculate the size of its value, determine a threshold value in conjunction with actual conditions and experience, according to some determination strategy, determine whether it being voice start frame or end frame, main algorithm has: utilize short-time energy and short-time zero-crossing rate, use cepstrum feature etc.
(2) pattern-recongnition method
These class methods are the end-point detection problem of voice signal to be regarded as every frame signal is classified, by setting up corresponding detection criteria, every frame voice are divided, judge that these frame voice belong to ground unrest or voice, the sound end detecting method based on the auto-correlation similarity distance of having that belongs to this algorithm is based on sound end detection of HMM model etc.
Any method in the said method is all read aloud text-independent with the user.
When the content of text of speech recognition is determined, for example English study or Chinese learning system, in these application scenarioss, the content of text that outer application system user reads aloud is determined, and only pay close attention to the text-dependent voice partly that the user reads aloud, wish the user bright run through specify text or the order word last word the time, endpoint detection module can provide the voice end position immediately.
Normally read aloud the user in the situation of specify text, existing end-point detection technology just can be made a strategic decision because the content of text of not knowing or not utilizing the user to read aloud needs next section non-speech data to arrive, and the response time is larger.
Continue again to read aloud the irrelevant content of some and specify text behind the text of appointment if the user has run through, existing end-point detection can not be distinguished the unconcerned voice of this part system, provides suitable voice end point.
In some application scenario, may be in that the user be bright when running through complete order word or sentence, just can provide the voice end point and stop recording, if the user has read aloud the content in half text, then stopped the long period, it is quiet that existing end-point detection may detect this section, and the too early voice end point that provides can not satisfy this application demand.
Summary of the invention
The technology of the present invention is dealt with problems: overcome the deficiencies in the prior art, a kind of sound end detecting method based on real-time decoding is provided, solution is in the situation that the speech recognition text is determined, the real-time that existing end-point detection technology shows is not high, can't carry out the aimed detection problem to the voice that the user is concerned about.
The technology of the present invention solution: a kind of sound end detecting method based on real-time decoding is a kind of end-point detecting method by combining with content of text, and performing step is as follows:
The first step: input speech recognition related text, resolve text;
Second step: the result makes up decoding network according to text resolution;
The 3rd step: the input voice, extract the acoustic feature in the voice, based on the decoding network that second step makes up described acoustic feature is decoded, obtain decoded linguistic unit sequence; Each unit is called a frame in the described linguistic unit sequence.Acoustic feature described herein is a class value of describing the Short Time Speech essential characteristic, normally a kind of proper vector of fixedly dimension (such as the MFCC proper vector of 39 dimensions).
The 4th step: decoded voice unit sequence is carried out sound end judge that judge whether it is sound end, described sound end is divided into voice starting point and voice end point; If judged result is the voice end point, then the voice endpoint information is fed back to outer application system, otherwise continued for the 3rd step; The voice starting point judges it is optional in the 4th step, if the external application system is indifferent to the voice starting point, does not then judge the voice starting point;
Voice starting point in described the 4th step is judged as follows:
(1.1) get optimal path in the demoder; Demoder is one of core of speech recognition system, and its task is that according to acoustic model, decoding network, searching can be exported with maximum probability the linguistic unit sequence of this signal to the acoustic feature of input.It is one of input of demoder that decoding network is again the grammer network, does not have the decoding network demoder not work, and decoding network defines the scope of demoder output language unit sequence;
(1.2) voice starting point early warning namely according to the optimal path in the demoder, judges that current speech text possibility reaches the voice starting point, if so, carries out step (1.3), otherwise withdraws from;
(1.3) early warning is confirmed, whether namely judge has phoneme or effective rubbish voice in the text, confirm the current voice starting point that whether really reaches by this process in the speech text; If so, obtain starting point, otherwise directly withdraw from.
Voice end point in described the 4th step is judged as follows:
(2.1) get current optimal path in the demoder;
(2.2) voice end point early warning namely according to the optimal path in the demoder, judges that last the phoneme possibility in the speech text has been said, if so, carries out step (2.3), otherwise withdraws from;
(2.3) early warning is identified, and namely whether last phoneme was really said in the speech text, and by frame length, the average likelihood score index of frame is made a strategic decision, and has said if be judged as really, then obtains the voice end point, process ends, otherwise directly finish.
In the application scenarios of some, sometimes the user does not run through content of text, need to return the end point of voice, and this just needs detection method of the present invention to be combined with traditional end-point detecting method, and the process steps that combines with traditional end-point detecting method is as follows:
(1) input speech recognition related text is resolved text;
(2) make up decoding network according to first step text resolution result;
(3) input voice extract the acoustic feature in the voice on the one hand, on the other hand traditional endpoint detection module are passed in voice;
(4) end-point detecting method of the present invention and legacy endpoint detection are carried out simultaneously, detect separately sound end;
(5) sound end that provides in conjunction with end-point detecting method of the present invention and the legacy endpoint detection method sound end of whether making a strategic decision, can adopt in above-mentioned two any one method to detect is the strategy that end points is just thought end points, can also all detect end points with two kinds of methods and just think end points;
(6) the backchannel voice endpoint is to outer application system.
It is as follows to make up the decoding network step in the described second step:
(1) obtaining minimum modeling unit behind the text resolution of the first step, can be phoneme, syllable, word;
(2) count and total nodal point number according to the dummy section in the minimum modeling unit number computational grid, be the node storage allocation, and minimum modeling unit and network node are associated;
(3) according to the arc number in the regular computational grid read aloud that allows, and be the arc storage allocation; The rule of reading aloud of described permission comprises retaking of a year or grade, skip;
(4) according to reading aloud rule, by arc node is coupled together;
(5) output decoding network.
The step of getting the optimal path in the demoder in described step (1.1) and the step (2.1) is as follows:
(1) all paths in the current demoder of traversal are resolved each path and are obtained corresponding voice unit sequence and probability;
(2) according to probability sorted in the path;
(3) get the path of ordering posterior probability maximum as optimal path.
Acoustic feature is Mel cepstrum coefficient MFCC, cepstrum coefficient CEP in described the 3rd step, linear predictor coefficient LPC or perception linear predictor coefficient PLP.
The voice unit sequence is aligned phoneme sequence, syllable sequence or word sequence in described the 3rd step.
Be decoded as Viterbi decoding in described the 3rd step, or based on the decoding of dynamic time warping (DTW).
The present invention's advantage compared with prior art is:
(1) the present invention can in time provide the voice end point when the user runs through last word when the user normally reads aloud specify text, and the response time is shorter than the existing end-point detection technology response time, and real-time is high.
(2) the present invention continues to read aloud some other irrelevant contents after running through specify text when the user is bright, and this programme can intelligence distinguishes the unconcerned rubbish voice of this part system, makes external application system better effects if.
(3) the present invention can be used in the recording occasion that requirement is arranged of the integrality that the user is read aloud, and the user does not run through given content and just do not provide the voice end point, and existing end-point detection technology is not accomplished.
Description of drawings
Fig. 1 is realization flow figure of the present invention;
Fig. 2 is the voice starting point decision flow chart among the present invention;
Fig. 3 is the voice end point decision flow chart among the present invention;
Fig. 4 is the realization flow figure that combines with existing end-point detection technology among the present invention;
Fig. 5 is with the decoding network example of Chinese sound mother as minimum unit;
Fig. 6 is traditional MFCC feature extraction flow process;
Fig. 7 is traditional end-point detection flow process.
Embodiment
The present invention is a kind of and new end-point detecting method text-dependent, take the viterbi decoding as decoding process as example (the present invention is not limited only to the viterbi decoding), process flow diagram of the present invention as shown in Figure 1:
The first step: input speech recognition related text, resolve text:
The text of input be the user be scheduled to read aloud content, also be one of foundation that makes up decoding network.This step is mainly finished two tasks: at first need the coded format of text is unified conversion, convert the UTF8 form to as unified, the benefit of doing like this is that the code of resolving text only needs to realize a cover; Secondly according to the granularity (such as word, syllable, phoneme) of corresponding model unit in the acoustic model resolve (adopt phoneme better as the modeling unit effect, below describe all take phoneme as example), generate and resolve bearing-age tree shape structure, this structure comprises the complete information of chapter, sentence, word, word, syllable, six levels of phoneme, wherein front 4 levels can divide word algorithm to resolve according to the text front end, and rear 2 levels can be resolved according to pronunciation dictionary.
Second step: the result makes up decoding network according to text resolution, and is specific as follows:
(2.1): obtain the minimum modeling unit (can be phoneme, syllable, word) behind the text resolution of the first step;
(2.2): count and total nodal point number according to the dummy section in the minimum modeling unit number computational grid, the distribution node internal memory, and minimum modeling unit and network node associated;
(2.3): according to retaking of a year or grade, the permissions such as skip read aloud arc number in the regular computational grid, and be the arc storage allocation;
(2.4): according to reading aloud rule, make up arc node is coupled together;
(2.5): the output decoding network.
Suppose that the speech recognition text is " China " two words, the minimum modeling unit of network is phoneme (sound is female), the decoding network that constructs is shown in figure five, as can be seen from the figure, this network allows 1-2 word of skip, a retaking of a year or grade 1-2 word, this network is read " state " for the user, and " Chinese state " etc. increases the skip situation and can both be correctly decoded.
The 3rd step: the input voice, extract acoustic feature, decode based on decoding network:
Acoustic feature is some values of describing phonetic feature, and normally a kind of sequence of proper vector of fixedly dimension is such as the 39 MFCC(Mel-cepstrum coefficients of tieing up) feature, 39 floating point values represent the feature of frame voice exactly; The extraction flow process of MFCC feature as shown in Figure 6, concrete steps are as follows:
(3.1) A/D conversion is digital signal with analog signal conversion;
(3.2) pre-emphasis: by a limited exciter response Hi-pass filter of single order, make the frequency spectrum of signal become smooth, be not vulnerable to the impact of finite word length effect;
(3.3) divide frame: according to the in short-term smooth performance of voice, voice can be processed take frame as unit, generally can get 25 milliseconds (ms) as a frame;
(3.4) windowing: adopt hamming code window to a frame voice windowing, to reduce the impact of Gibbs' effect;
(3.5) fast fourier transform (Fast Fourier Transformation, FFT): the power spectrum that time-domain signal is for conversion into signal;
(3.6) quarter window filtering: with the quarter window wave filter (totally 24 quarter window wave filters) of one group of Mel frequency marking Linear distribution, power spectrum filtering to signal, the scope that each quarter window wave filter covers is similar to a critical bandwidth of people's ear, simulates the masking effect of people's ear with this;
(3.7) ask logarithm: logarithm is asked in the output of quarter window bank of filters, can obtain being similar to the result of isomorphic transformation;
(3.8) discrete cosine transform (Discrete Cosine Transformation, DCT): remove the correlativity between each dimensional signal, signal map is arrived lower dimensional space;
(3.9) spectrum weighting: because the low order parameter of cepstrum is subject to the impact of speaker's characteristic, the characteristic of channel etc., and the resolution characteristic of high order parameters is lower, so need to compose weighting, suppresses its low order and high order parameters;
(3.10) cepstral mean subtracts (Cepstrum Mean Subtraction, CMS): CMS and can effectively reduce the phonetic entry channel to the impact of characteristic parameter;
(3.11) differential parameter: great many of experiments shows, adds the differential parameter that characterizes the voice dynamic perfromance in phonetic feature, can improve the recognition performance of system.We have also used first order difference parameter and the second order difference parameter of MFCC parameter.
After acoustic feature extracts, just carry out real-time decoding, tone decoding is a step (being decoded as example with Viterbi) important among the present invention, the process of decoding among the present invention is: to every frame acoustic feature of input, calculate output probability and the intra-node state transition probability of current every feasible path corresponding node in the decoding network, and upgrade the accumulated probability of current path.Output probability herein can be corresponding according to the node phoneme Hidden Markov Model (HMM) and acoustics feature calculation, the intra-node state transition probability directly reads from model; When above-mentioned when being decoded to last state of intra-node, can expand current decoding path, the foundation of expansion is exactly to follow the tracks of decoding network, when this node is connected to a plurality of node, need the expansion mulitpath to proceed decoding, if there is path punishment on the arc of tracking decoding network, then need punishment is added in the accumulated probability in path, decode procedure can generate mulitpath in real time.
The 4th step: judge whether it is sound end, sound end can be divided into voice starting point and voice end point, if the voice end point then feeds back to the external application system to the voice end point, otherwise continues for the 3rd step; Starting point judges it is optional in the 4th step, if external system is indifferent to starting point information, then can not judge starting point, and both of these case of the present invention is all included.
Among Fig. 2 sound end judge can comprise that the voice starting point is judged, a kind of or whole in judging of voice end point.
The judgement of voice starting point: the flow process that the voice starting point is judged as shown in Figure 2.
The judgement of voice starting point roughly divides following a few step:
The first step: get the demoder optimal path, it is as follows in detail to get optimal path:
The path namely is that demoder arrives the road that end node is passed by, because begin not know which bar road is the shortest, demoder can be walked a lot of bars road, the corresponding a kind of voice unit sequence in each bar road, and optimal path is exactly for path corresponding to current speech maximum probability voice unit sequence.
(1.1): travel through all paths in the current demoder, resolve each path and obtain voice unit sequence and probability;
(1.2): according to probability sorted in the path;
(1.3): get the path of ordering posterior probability maximum as optimal path.
Second step: the early warning of voice starting point, namely judge that according to optimal path current possibility reaches starting point, if so, carried out for the 3rd step, otherwise withdraw from flow process.Decoding network with figure five is decoded as example, and the identification text is " China ", and zh is the text starting point of network, judges namely herein whether current decoding has arrived " zh " node in the network.
The 3rd step: early warning is confirmed, by judging whether that the process such as phoneme or effective rubbish voice is confirmed the current voice starting point that whether really reaches in the text.If so, obtain starting point, otherwise directly withdraw from flow process.Decoding network with figure five is decoded as example, judges namely whether the probability of decoding path arrival " zh " is enough large.
The judgement of voice end point: the flow process that the voice end point is judged as shown in Figure 3.
The first step: get current optimal path in the demoder, the same with the method for optimal path in the above-mentioned voice starting point judgement.
Second step: the early warning of voice end point, namely said according to last phoneme possibility of optimal path judgement text, if so, carried out for the 3rd step, otherwise withdraw from flow process.Decoding network with Fig. 5 is decoded as example, judges namely herein whether the demoder optimal path is decoded to " uo " node.
The 3rd step: confirm early warning, judge namely whether last phoneme was really said in the text.Can make a strategic decision by indexs such as frame length, the average likelihood scores of frame.Said if be judged as really, then obtained the end point process ends, otherwise direct process ends.Decoding network with Fig. 5 is decoded as example, namely weighs the frame length of the optimal path that arrives " uo " herein, and whether average likelihood score is reasonable.
Detect in conjunction with (expansion scheme) with legacy endpoint:
In the application scenarios of some, sometimes the user does not run through content of text, needs endpoint detection system can return the end point of voice yet, and this just need to be combined with traditional end-point detection technology, and process flow diagram as shown in Figure 4.Suppose that end-point detection flow process traditional among Fig. 4 is the end-point detection based on energy, process flow diagram mainly is divided into following a few step as shown in Figure 7:
The first step: input voice;
Second step: to voice segment, extract short-time energy;
The 3rd step: comprehensive judgement, according to current short-time energy judgement voice segments and non-speech segment, can adopt one of four thresholdings or two kinds of methods of double threshold;
The 4th step: the end points feedback, according to the 3rd step court verdict feedback testing result.
The process steps that combines with traditional end-point detecting method among the present invention is as follows:
The first step: input speech recognition related text, resolve text;
Second step: the result makes up decoding network according to first step text resolution;
The 3rd step: the input voice, extract on the one hand the acoustic feature in the voice, on the other hand traditional endpoint detection module passed in voice;
The 4th step: end-point detecting method of the present invention and legacy endpoint detection are carried out simultaneously, detect separately sound end;
The 5th step: the sound end that provides in conjunction with end-point detecting method of the present invention and the legacy endpoint detection method sound end of whether making a strategic decision, can adopt in above-mentioned two any one method to detect is the strategy that end points is just thought end points, can also all detect end points with two kinds of methods and just think end points;
The 6th step: the backchannel voice endpoint is to outer application system.
" whether detect starting point " among upper Fig. 4 and comprised the process that starting point does not then detect that detects.
In a word, the invention solves in the situation that the speech recognition text is determined, the real-time that the legacy endpoint detection technique shows is not high, can't carry out the aimed detection problem to the voice that the user is concerned about.
The non-elaborated part of the present invention belongs to those skilled in the art's known technology.

Claims (7)

1. sound end detecting method based on real-time decoding is characterized in that performing step is as follows:
The first step: input speech recognition related text, resolve text;
Second step: the result makes up decoding network according to text resolution;
The 3rd step: input in real time voice, extract the acoustic feature in the voice, based on the decoding network that second step makes up described acoustic feature is decoded, obtain decoded linguistic unit sequence; Each unit is called a frame in the described linguistic unit sequence;
The 4th step: decoded voice unit sequence is carried out sound end judge that judge whether it is sound end, described sound end is divided into voice starting point and voice end point; If judged result is the voice end point, then the voice endpoint information is fed back to outer application system, otherwise continued for the 3rd step; The voice starting point judges it is optional in the 4th step, if the external application system is indifferent to the voice starting point, does not then judge the voice starting point;
Voice starting point in described the 4th step is judged as follows:
(1.1) get optimal path in the demoder;
(1.2) voice starting point early warning namely according to the optimal path in the demoder, judges that current speech text possibility reaches the voice starting point, if so, carries out step (1.3), otherwise finishes to judge;
(1.3) whether confirm early warning, namely judging has phoneme or effective rubbish voice in the text, confirm the current voice starting point that whether really reaches by this process in the speech text; If so, obtain starting point, otherwise directly withdraw from;
Voice end point in described the 4th step is judged as follows:
(2.1) get current optimal path in the demoder;
(2.2) voice end point early warning namely according to the optimal path in the demoder, judges that last the phoneme possibility in the speech text has been said, if so, carries out step (2.3), otherwise finishes to judge;
(2.3) confirm early warning, namely whether last phoneme was really said in the speech text, and by frame length, the average likelihood score index of frame is made a strategic decision, and has said if be judged as really, then obtains the voice end point, process ends, otherwise directly finish.
2. a kind of sound end detecting method based on real-time decoding according to claim 1, it is characterized in that: in the application scenarios of some, sometimes the user does not run through content of text, need to return the end point of voice, this just needs detection method of the present invention to be combined with traditional end-point detecting method, and the process steps that combines with traditional end-point detecting method is as follows:
(1) input speech recognition related text is resolved text;
(2) make up decoding network according to first step text resolution result;
(3) input voice extract the acoustic feature in the voice on the one hand, on the other hand traditional endpoint detection module are passed in voice;
(4) end-point detecting method of the present invention and legacy endpoint detection are carried out simultaneously, detect separately sound end;
(5) sound end that provides in conjunction with end-point detecting method of the present invention and the legacy endpoint detection method sound end of whether making a strategic decision, can adopt in above-mentioned two any one method to detect is the strategy that end points is just thought end points, can also all detect end points with two kinds of methods and just think end points;
(6) the backchannel voice endpoint is to outer application system.
3. a kind of sound end detecting method based on real-time decoding according to claim 1 and 2 is characterized in that: it is as follows to make up the decoding network step in the described second step:
(1) obtaining minimum modeling unit behind the text resolution of the first step, can be phoneme, syllable, word;
(2) count and total nodal point number according to the dummy section in the minimum modeling unit number computational grid, be the node storage allocation, and minimum modeling unit and network node are associated;
(3) according to the arc number in the regular computational grid read aloud that allows, and be the arc storage allocation; The rule of reading aloud of described permission comprises retaking of a year or grade, skip;
(4) according to reading aloud rule, by arc node is coupled together;
(5) output decoding network.
4. a kind of sound end detecting method based on real-time decoding according to claim 1 and 2, it is characterized in that: the step of getting the optimal path in the demoder in described step (1.1) and the step (2.1) is as follows:
(1) all paths in the current demoder of traversal are resolved each path and are obtained corresponding voice unit sequence and probability;
(2) according to probability sorted in the path;
(3) get the path of ordering posterior probability maximum as optimal path.
5. a kind of sound end detecting method based on real-time decoding according to claim 1 and 2, it is characterized in that: acoustic feature is Mel cepstrum coefficient MFCC, cepstrum coefficient CEP in described the 3rd step, linear predictor coefficient LPC or perception linear predictor coefficient PLP.
6. a kind of sound end detecting method based on real-time decoding according to claim 1 and 2 is characterized in that: the voice unit sequence is aligned phoneme sequence, syllable sequence or word sequence in described the 3rd step.
7. a kind of sound end detecting method based on real-time decoding according to claim 1 and 2 is characterized in that: be decoded as the Viterbi decoding in described the 3rd step, or based on the decoding of dynamic time warping (DTW).
CN201210483046.4A 2012-11-24 2012-11-24 Voice endpoint detection method based on real-time decoding Active CN102982811B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210483046.4A CN102982811B (en) 2012-11-24 2012-11-24 Voice endpoint detection method based on real-time decoding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210483046.4A CN102982811B (en) 2012-11-24 2012-11-24 Voice endpoint detection method based on real-time decoding

Publications (2)

Publication Number Publication Date
CN102982811A true CN102982811A (en) 2013-03-20
CN102982811B CN102982811B (en) 2015-01-14

Family

ID=47856719

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210483046.4A Active CN102982811B (en) 2012-11-24 2012-11-24 Voice endpoint detection method based on real-time decoding

Country Status (1)

Country Link
CN (1) CN102982811B (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593048A (en) * 2013-10-28 2014-02-19 浙江大学 Voice navigation system and method of animal robot system
CN105118502A (en) * 2015-07-14 2015-12-02 百度在线网络技术(北京)有限公司 End point detection method and system of voice identification system
CN105261357A (en) * 2015-09-15 2016-01-20 百度在线网络技术(北京)有限公司 Voice endpoint detection method and device based on statistics model
CN105374352A (en) * 2014-08-22 2016-03-02 中国科学院声学研究所 Voice activation method and system
CN105981099A (en) * 2014-02-06 2016-09-28 三菱电机株式会社 Speech search device and speech search method
CN106205607A (en) * 2015-05-05 2016-12-07 联想(北京)有限公司 Voice information processing method and speech information processing apparatus
CN106653022A (en) * 2016-12-29 2017-05-10 百度在线网络技术(北京)有限公司 Voice awakening method and device based on artificial intelligence
CN107004407A (en) * 2015-09-03 2017-08-01 谷歌公司 Enhanced sound end is determined
CN107146633A (en) * 2017-05-09 2017-09-08 广东工业大学 A kind of complete speech data preparation method and device
CN107423275A (en) * 2017-06-27 2017-12-01 北京小度信息科技有限公司 Sequence information generation method and device
CN107799126A (en) * 2017-10-16 2018-03-13 深圳狗尾草智能科技有限公司 Sound end detecting method and device based on Supervised machine learning
CN108257616A (en) * 2017-12-05 2018-07-06 苏州车萝卜汽车电子科技有限公司 Interactive detection method and device
CN108538310A (en) * 2018-03-28 2018-09-14 天津大学 It is a kind of based on it is long when power spectrum signal variation sound end detecting method
CN109087645A (en) * 2018-10-24 2018-12-25 科大讯飞股份有限公司 A kind of decoding network generation method, device, equipment and readable storage medium storing program for executing
CN109859773A (en) * 2019-02-14 2019-06-07 北京儒博科技有限公司 A kind of method for recording of sound, device, storage medium and electronic equipment
CN110070885A (en) * 2019-02-28 2019-07-30 北京字节跳动网络技术有限公司 Audio originates point detecting method and device
CN110364171A (en) * 2018-01-09 2019-10-22 深圳市腾讯计算机系统有限公司 A kind of audio recognition method, speech recognition system and storage medium
CN110827795A (en) * 2018-08-07 2020-02-21 阿里巴巴集团控股有限公司 Voice input end judgment method, device, equipment, system and storage medium
CN111583910A (en) * 2019-01-30 2020-08-25 北京猎户星空科技有限公司 Model updating method and device, electronic equipment and storage medium
CN112151073A (en) * 2019-06-28 2020-12-29 北京声智科技有限公司 Voice processing method, system, device and medium
CN112511698A (en) * 2020-12-03 2021-03-16 普强时代(珠海横琴)信息技术有限公司 Real-time call analysis method based on universal boundary detection
CN112581982A (en) * 2017-06-06 2021-03-30 谷歌有限责任公司 End of query detection
CN112614514A (en) * 2020-12-15 2021-04-06 科大讯飞股份有限公司 Valid voice segment detection method, related device and readable storage medium
CN112652306A (en) * 2020-12-29 2021-04-13 珠海市杰理科技股份有限公司 Voice wake-up method and device, computer equipment and storage medium
CN112669880A (en) * 2020-12-16 2021-04-16 北京读我网络技术有限公司 Method and system for adaptively detecting voice termination
WO2022016580A1 (en) * 2020-07-21 2022-01-27 南京智金科技创新服务中心 Intelligent voice recognition method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1763843A (en) * 2005-11-18 2006-04-26 清华大学 Pronunciation quality evaluating method for language learning machine
CN101030369A (en) * 2007-03-30 2007-09-05 清华大学 Built-in speech discriminating method based on sub-word hidden Markov model
US20120072211A1 (en) * 2010-09-16 2012-03-22 Nuance Communications, Inc. Using codec parameters for endpoint detection in speech recognition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1763843A (en) * 2005-11-18 2006-04-26 清华大学 Pronunciation quality evaluating method for language learning machine
CN101030369A (en) * 2007-03-30 2007-09-05 清华大学 Built-in speech discriminating method based on sub-word hidden Markov model
US20120072211A1 (en) * 2010-09-16 2012-03-22 Nuance Communications, Inc. Using codec parameters for endpoint detection in speech recognition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
朱杰等: "噪声环境中基于HMM 模型的语音信号端点检测方法", 《上海交通大学学报》 *

Cited By (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593048A (en) * 2013-10-28 2014-02-19 浙江大学 Voice navigation system and method of animal robot system
CN103593048B (en) * 2013-10-28 2017-01-11 浙江大学 Voice navigation system and method of animal robot system
CN105981099A (en) * 2014-02-06 2016-09-28 三菱电机株式会社 Speech search device and speech search method
CN105374352A (en) * 2014-08-22 2016-03-02 中国科学院声学研究所 Voice activation method and system
CN105374352B (en) * 2014-08-22 2019-06-18 中国科学院声学研究所 A kind of voice activated method and system
CN106205607B (en) * 2015-05-05 2019-10-29 联想(北京)有限公司 Voice information processing method and speech information processing apparatus
CN106205607A (en) * 2015-05-05 2016-12-07 联想(北京)有限公司 Voice information processing method and speech information processing apparatus
CN105118502A (en) * 2015-07-14 2015-12-02 百度在线网络技术(北京)有限公司 End point detection method and system of voice identification system
CN105118502B (en) * 2015-07-14 2017-05-10 百度在线网络技术(北京)有限公司 End point detection method and system of voice identification system
CN107004407B (en) * 2015-09-03 2020-12-25 谷歌有限责任公司 Enhanced speech endpoint determination
CN107004407A (en) * 2015-09-03 2017-08-01 谷歌公司 Enhanced sound end is determined
CN105261357B (en) * 2015-09-15 2016-11-23 百度在线网络技术(北京)有限公司 Sound end detecting method based on statistical model and device
CN105261357A (en) * 2015-09-15 2016-01-20 百度在线网络技术(北京)有限公司 Voice endpoint detection method and device based on statistics model
CN106653022A (en) * 2016-12-29 2017-05-10 百度在线网络技术(北京)有限公司 Voice awakening method and device based on artificial intelligence
CN106653022B (en) * 2016-12-29 2020-06-23 百度在线网络技术(北京)有限公司 Voice awakening method and device based on artificial intelligence
CN107146633A (en) * 2017-05-09 2017-09-08 广东工业大学 A kind of complete speech data preparation method and device
CN112581982A (en) * 2017-06-06 2021-03-30 谷歌有限责任公司 End of query detection
CN107423275A (en) * 2017-06-27 2017-12-01 北京小度信息科技有限公司 Sequence information generation method and device
CN107799126A (en) * 2017-10-16 2018-03-13 深圳狗尾草智能科技有限公司 Sound end detecting method and device based on Supervised machine learning
CN107799126B (en) * 2017-10-16 2020-10-16 苏州狗尾草智能科技有限公司 Voice endpoint detection method and device based on supervised machine learning
CN108257616A (en) * 2017-12-05 2018-07-06 苏州车萝卜汽车电子科技有限公司 Interactive detection method and device
CN110364171A (en) * 2018-01-09 2019-10-22 深圳市腾讯计算机系统有限公司 A kind of audio recognition method, speech recognition system and storage medium
CN110364171B (en) * 2018-01-09 2023-01-06 深圳市腾讯计算机系统有限公司 Voice recognition method, voice recognition system and storage medium
CN108538310B (en) * 2018-03-28 2021-06-25 天津大学 Voice endpoint detection method based on long-time signal power spectrum change
CN108538310A (en) * 2018-03-28 2018-09-14 天津大学 It is a kind of based on it is long when power spectrum signal variation sound end detecting method
CN110827795A (en) * 2018-08-07 2020-02-21 阿里巴巴集团控股有限公司 Voice input end judgment method, device, equipment, system and storage medium
CN109087645A (en) * 2018-10-24 2018-12-25 科大讯飞股份有限公司 A kind of decoding network generation method, device, equipment and readable storage medium storing program for executing
CN111583910A (en) * 2019-01-30 2020-08-25 北京猎户星空科技有限公司 Model updating method and device, electronic equipment and storage medium
CN111583910B (en) * 2019-01-30 2023-09-26 北京猎户星空科技有限公司 Model updating method and device, electronic equipment and storage medium
CN109859773A (en) * 2019-02-14 2019-06-07 北京儒博科技有限公司 A kind of method for recording of sound, device, storage medium and electronic equipment
CN110070885A (en) * 2019-02-28 2019-07-30 北京字节跳动网络技术有限公司 Audio originates point detecting method and device
CN112151073A (en) * 2019-06-28 2020-12-29 北京声智科技有限公司 Voice processing method, system, device and medium
WO2022016580A1 (en) * 2020-07-21 2022-01-27 南京智金科技创新服务中心 Intelligent voice recognition method and device
CN112511698A (en) * 2020-12-03 2021-03-16 普强时代(珠海横琴)信息技术有限公司 Real-time call analysis method based on universal boundary detection
CN112614514A (en) * 2020-12-15 2021-04-06 科大讯飞股份有限公司 Valid voice segment detection method, related device and readable storage medium
CN112614514B (en) * 2020-12-15 2024-02-13 中国科学技术大学 Effective voice fragment detection method, related equipment and readable storage medium
CN112669880A (en) * 2020-12-16 2021-04-16 北京读我网络技术有限公司 Method and system for adaptively detecting voice termination
CN112669880B (en) * 2020-12-16 2023-05-02 北京读我网络技术有限公司 Method and system for adaptively detecting voice ending
CN112652306A (en) * 2020-12-29 2021-04-13 珠海市杰理科技股份有限公司 Voice wake-up method and device, computer equipment and storage medium
CN112652306B (en) * 2020-12-29 2023-10-03 珠海市杰理科技股份有限公司 Voice wakeup method, voice wakeup device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN102982811B (en) 2015-01-14

Similar Documents

Publication Publication Date Title
CN102982811B (en) Voice endpoint detection method based on real-time decoding
CN108320733B (en) Voice data processing method and device, storage medium and electronic equipment
CN102779508B (en) Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
KR100755677B1 (en) Apparatus and method for dialogue speech recognition using topic detection
CN102231278B (en) Method and system for realizing automatic addition of punctuation marks in speech recognition
CN100411011C (en) Pronunciation quality evaluating method for language learning machine
CN101751919B (en) Spoken Chinese stress automatic detection method
CN103177733B (en) Standard Chinese suffixation of a nonsyllabic "r" sound voice quality evaluating method and system
CN110706690A (en) Speech recognition method and device
CN104575504A (en) Method for personalized television voice wake-up by voiceprint and voice identification
CN104575490A (en) Spoken language pronunciation detecting and evaluating method based on deep neural network posterior probability algorithm
WO2018192186A1 (en) Speech recognition method and apparatus
CN104464755B (en) Speech evaluating method and device
CN101887725A (en) Phoneme confusion network-based phoneme posterior probability calculation method
CN111640456B (en) Method, device and equipment for detecting overlapping sound
CN102945673A (en) Continuous speech recognition method with speech command range changed dynamically
CN112102850A (en) Processing method, device and medium for emotion recognition and electronic equipment
CN106558306A (en) Method for voice recognition, device and equipment
CN1787070B (en) On-chip system for language learner
CN110428853A (en) Voice activity detection method, Voice activity detection device and electronic equipment
KR20190112682A (en) Data mining apparatus, method and system for speech recognition using the same
CN110853669B (en) Audio identification method, device and equipment
CN103035244B (en) Voice tracking method capable of feeding back loud-reading progress of user in real time
CN114550706A (en) Smart campus voice recognition method based on deep learning
Rabiee et al. Persian accents identification using an adaptive neural network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C56 Change in the name or address of the patentee
CP01 Change in the name or title of a patent holder

Address after: Wangjiang Road high tech Development Zone Hefei city Anhui province 230088 No. 666

Patentee after: Iflytek Co., Ltd.

Address before: Wangjiang Road high tech Development Zone Hefei city Anhui province 230088 No. 666

Patentee before: Anhui USTC iFLYTEK Co., Ltd.

TR01 Transfer of patent right

Effective date of registration: 20170823

Address after: 230088, Hefei province high tech Zone, 2800 innovation Avenue, 248 innovation industry park, H2 building, room two, Anhui

Patentee after: Anhui hear Technology Co., Ltd.

Address before: Wangjiang Road high tech Development Zone Hefei city Anhui province 230088 No. 666

Patentee before: Iflytek Co., Ltd.

TR01 Transfer of patent right