CN102982811B - Voice endpoint detection method based on real-time decoding - Google Patents

Voice endpoint detection method based on real-time decoding Download PDF

Info

Publication number
CN102982811B
CN102982811B CN201210483046.4A CN201210483046A CN102982811B CN 102982811 B CN102982811 B CN 102982811B CN 201210483046 A CN201210483046 A CN 201210483046A CN 102982811 B CN102982811 B CN 102982811B
Authority
CN
China
Prior art keywords
voice
text
starting point
decoding
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210483046.4A
Other languages
Chinese (zh)
Other versions
CN102982811A (en
Inventor
吴玲
王兵
赵乾
潘颂声
何春江
朱群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Hear Technology Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201210483046.4A priority Critical patent/CN102982811B/en
Publication of CN102982811A publication Critical patent/CN102982811A/en
Application granted granted Critical
Publication of CN102982811B publication Critical patent/CN102982811B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

A voice endpoint detection method based on real-time decoding includes the following steps: inputting texts related to voice recognition, analyzing the texts, constructing a decoding network according to text analyzing results, inputting voice, extracting acoustic characteristics in the voice, decoding the acoustic characteristics based on the constructed decoding network to obtain a decoded voice unit sequence, conducting voice endpoint judgment on the decoded voice unit sequence to judge whether voice endpoints exist or not, dividing voice endpoints into voice starting points and voice ending points, feeding voice endpoint information to an external application system on yes judgment and continuing step2 on no judgment. The voice starting point judgment is selectable in the third step, and voice starting point is not judged if the external application system does not care about the voice starting point. The voice endpoint detection method resolves the problem that targeted detection cannot be conducted on voice cared by users due to the fact that real time performance of the traditional endpoint detection technology is not high under the condition that voice recognition texts are not determined.

Description

A kind of sound end detecting method based on real-time decoding
Technical field
The present invention relates to a kind of sound end detecting method based on decoded result, especially a kind of method feeding back voice end point in time.
Background technology
The end-point detection of voice is exactly determine starting point and the terminal of voice, from voice signal, get rid of unvoiced segments.The performance of correctness to speech recognition of end-point detection has a great impact.In speech evaluating system, the content of user recording is determined by paper text, user is bright run through paper content after provide voice distal point in time and stop calculate, contribute to improve system performance and evaluation and test effect.In outer application system, the effect quality of end-point detection directly affects Consumer's Experience.
Such as in phonetic study software, during user recording evaluation and test, carry out end-point detection simultaneously, at the end of voice being detected, automatically stop recording, eliminate the troublesome operation of a stop button, greatly can promote Consumer's Experience when user repeatedly uses; At speech control system, as Smart Home, user can pass through " turning on light ", the order such as " to turn off the light " controls the switch of lamp, when end-point detection real-time is not high, can cause these command responses not in time, experience bad, " lamp " word that if user just finishes " turning on light ", lamp has just been opened, and it is extraordinary for experiencing like this.
Existing end-point detecting method can be divided into two classes: threshold method and pattern-recongnition method.
(1) threshold method
Extract a certain feature of voice, such as: short-time energy, short-time average magnitude, zero-crossing rate etc., calculate the size of its value, a threshold value is determined in conjunction with actual conditions and experience, according to some determination strategy, determine whether it being voice start frame or end frame, main algorithm has: utilize short-time energy and short-time zero-crossing rate, application cepstrum feature etc.
(2) pattern-recongnition method
These class methods the end-point detection problem of voice signal are regarded as classify to every frame signal, by setting up corresponding detection criteria, every frame voice are divided, judge that these frame voice belong to ground unrest or voice, belong to the sound end detecting method had based on auto-correlation similarity distance of this algorithm, based on the speech terminals detection etc. of HMM model.
Any one method in said method is all read aloud text and is had nothing to do with user.
When the content of text of speech recognition is determined, such as English study or Chinese learning system, in these application scenarioss, the content of text that outer application system user reads aloud is determined, and only pay close attention to the voice of the text relevant portion that user reads aloud, when wishing that user is bright and run through last word of specifying text or order word, endpoint detection module can provide voice end position immediately.
When user normally reads aloud appointment text, existing end-point detection technology is due to the content of text of not knowing or do not utilize user to read aloud, and need next section of non-speech data to arrive, just can carry out decision-making, the response time is larger.
If user continues again after having run through the text of specifying to read aloud some contents irrelevant with specifying text, existing end-point detection can not distinguish the unconcerned voice of this part system, provides suitable voice end point.
In some application scenario, may need user is bright run through complete order word or sentence time, just can provide voice end point and stop recording, if user has read aloud the content in half text, then the long period has been stopped, existing end-point detection may detect this section quiet, too early provides voice end point, can not meet this application demand.
Summary of the invention
The technology of the present invention is dealt with problems: overcome the deficiencies in the prior art, a kind of sound end detecting method based on real-time decoding is provided, solve when speech recognition text is determined, existing end-point detection Technological expression real-time is out not high, cannot carry out aimed detection problem to the voice that user is concerned about.
The technology of the present invention solution: a kind of sound end detecting method based on real-time decoding, be a kind of end-point detecting method by combining with content of text, performing step is as follows:
The first step: input speech recognition related text, resolves text;
Second step: build decoding network according to text resolution result;
3rd step: input voice, extract the acoustic feature in voice, the decoding network built based on second step is decoded to described acoustic feature, obtains decoded language unit sequence; In described language unit sequence, each unit is called a frame.Acoustic feature described herein is the class value describing Short Time Speech essential characteristic, normally a kind of proper vector (the MFCC proper vectors as 39 dimensions) of fixing dimension.
4th step: carry out sound end judgement to decoded speech unit sequence, judges whether it is sound end, and described sound end is divided into voice starting point and voice end point; If judged result is voice end point, then voice endpoint information is fed back to outer application system, otherwise continue the 3rd step; In the 4th step, voice starting point judges is optional, if external application system is indifferent to voice starting point, does not then judge voice starting point;
Voice starting point in described 4th step judges as follows:
(1.1) optimal path in demoder is got; Demoder is one of core of speech recognition system, and its task is the acoustic feature to input, according to acoustic model, decoding network, finds the language unit sequence that can export this signal with maximum probability.Decoding network makes again grammer network be one of input of demoder, and do not have decoding network demoder not work, decoding network defines the scope of demoder output language unit sequence;
(1.2) voice starting point early warning, namely according to the optimal path in demoder, judges whether current speech text may reach voice starting point, if so, carries out step (1.3), otherwise exit;
(1.3) early warning confirms, namely to judge whether to have in speech text in text phoneme or effective rubbish voice, confirms currently whether really reach voice starting point by this process; If so, obtain starting point, otherwise directly exit.
Voice end point in described 4th step judges as follows:
(2.1) current optimal path in demoder is got;
(2.2) voice end point early warning, namely according to the optimal path in demoder, judges whether last phoneme in speech text can be talkative, if so, carries out step (2.3), otherwise exit;
(2.3) early warning is identified, and namely in speech text, whether last phoneme was really said, by frame length, the average likelihood score index of frame carrys out decision-making, if be judged as really having said, then obtain voice end point, process ends, otherwise directly terminates.
In the application scenarios of some, sometimes user does not run through content of text, needs the end point returning voice, and this just needs detection method to be combined with traditional end-point detecting method, and the process steps combined with traditional end-point detecting method is as follows:
(1) input speech recognition related text, resolve text;
(2) decoding network is built according to first step text resolution result;
(3) input voice, extract the acoustic feature in voice on the one hand, on the other hand traditional endpoint detection module passed in voice;
(4) end-point detecting method of the present invention and legacy endpoint detection are carried out simultaneously, detect sound end separately;
(5) whether the sound end decision-making provided in conjunction with end-point detecting method of the present invention and legacy endpoint detection method is sound end, any one method in above-mentioned two can be adopted to detect it is the strategy that end points just thinks end points, can also all detect that end points just thinks end points by two kinds of methods;
(6) backchannel voice endpoint is to outer application system.
Decoding network step is built as follows in described second step:
(1) obtaining the minimum modeling unit after the text resolution of the first step, can be phoneme, syllable, word;
(2) count and total nodal point number according to the dummy section in minimum modeling unit number computational grid, be peer distribution internal memory, and minimum modeling unit and network node are associated;
(3) according to the arc number read aloud in regular computational grid allowed, and be arc storage allocation; The rule of reading aloud of described permission comprises retaking of a year or grade, skip;
(4) according to reading aloud rule, by arc, node is coupled together;
(5) decoding network is exported.
The step of getting the optimal path in demoder in described step (1.1) and step (2.1) is as follows:
(1) travel through all paths in current decoder, resolve each path and obtain corresponding speech unit sequence and probability;
(2) according to probability, sorted in path;
(3) the maximum path of sequence posterior probability is got as optimal path.
In described 3rd step, acoustic feature is mel cepstrum coefficients MFCC, cepstrum coefficient CEP, linear predictor coefficient LPC or perception linear predictor coefficient PLP.
In described 3rd step, speech unit sequence is aligned phoneme sequence, syllable sequence or word sequence.
Viterbi decoding is decoded as in described 3rd step, or based on the decoding of dynamic time warping (DTW).
The present invention's advantage is compared with prior art:
(1) the present invention is when user normally reads aloud appointment text, and can provide voice end point in time when user runs through last word, the response time is shorter than the existing end-point detection technology response time, and real-time is high.
(2) the present invention runs through to specify after text continue to read aloud some other irrelevant contents when user is bright, this programme can intelligence distinguish the unconcerned rubbish voice of this part system, make external application system better effects if.
(3) the present invention can be used in the recording occasion having requirement to the integrality that user reads aloud, and user does not run through given content and just do not provide voice end point, and existing end-point detection technology is not accomplished.
Accompanying drawing explanation
Fig. 1 is realization flow figure of the present invention;
Fig. 2 is the voice starting point decision flow chart in the present invention;
Fig. 3 is the voice end point decision flow chart in the present invention;
Fig. 4 is the realization flow figure combined with existing end-point detection technology in the present invention;
Fig. 5 is using the female decoding network example as minimum unit of Chinese sound;
Fig. 6 is traditional MFCC feature extraction flow process;
Fig. 7 is traditional end-point detection flow process.
Embodiment
The present invention is a kind of new end-point detecting method relevant to text, for viterbi decoding as decoding process (the present invention be not limited only to viterbi decoding), process flow diagram of the present invention as shown in Figure 1:
The first step: input speech recognition related text, resolve text:
Input text be user predetermined read aloud content, be also build decoding network according to one of.This step mainly completes two tasks: first need to unify conversion to the coded format of text, convert UTF8 form to as unified, and the code that the benefit done like this is to resolve text only needs to realize a set of; Secondly carry out resolving that (general employing phoneme is better as modeling unit effect according to the granularity (as word, syllable, phoneme) of model unit corresponding in acoustic model, below describe all for phoneme), generate and resolve bearing-age tree shape structure, this structure comprises the complete information of chapter, sentence, word, word, syllable, phoneme six levels, wherein front 4 levels can be resolved according to text front end segmentation methods, and rear 2 levels can be resolved according to pronunciation dictionary.
Second step: build decoding network according to text resolution result, specific as follows:
(2.1): obtain the minimum modeling unit after the text resolution of the first step (can be phoneme, syllable, word);
(2.2): count and total nodal point number according to the dummy section in minimum modeling unit number computational grid, distribution node internal memory, and minimum modeling unit and network node are associated;
(2.3): according to retaking of a year or grade, the arc number read aloud in regular computational grid that skip etc. allow, and be arc storage allocation;
(2.4): according to reading aloud rule, build arc node is coupled together;
(2.5): export decoding network.
Suppose that speech recognition text is for " China " two word, the minimum modeling unit of network is phoneme (sound is female), the decoding network constructed as shown in Figure 5, as can be seen from the figure, this network allows skip 1-2 word, a retaking of a year or grade 1-2 word, this network is read " state " for user, and " Chinese state " etc. increase skip situation and can be correctly decoded.
3rd step: input voice, extract acoustic feature, decode based on decoding network:
Acoustic feature describes some values of phonetic feature, normally a kind of sequence of proper vector of fixing dimension, the MFCC(MFCC cepstrum as 39 dimensions) feature, 39 floating point values represent the feature of frame voice exactly; As shown in Figure 6, concrete steps are as follows for the extraction flow process of MFCC feature:
(3.1) A/D conversion, is converted to digital signal by simulating signal;
(3.2) pre-emphasis: by a limited exciter response Hi-pass filter of single order, make the frequency spectrum of signal become smooth, be not vulnerable to the impact of finite word length effect;
(3.3) framing: according to the short-term stationarity characteristic of voice, voice can process in units of frame, generally can get 25 milliseconds (ms) as a frame;
(3.4) windowing: adopt hamming code window to a frame voice windowing, to reduce the impact of Gibbs' effect;
(3.5) fast fourier transform (Fast Fourier Transformation, FFT): power spectrum time-domain signal being for conversion into signal;
(3.6) quarter window filtering: with the quarter window wave filter (totally 24 quarter window wave filters) of one group of Mel frequency marking Linear distribution, to the power spectrum filtering of signal, the scope that each quarter window wave filter covers is similar to a critical bandwidth of people's ear, simulates the masking effect of people's ear with this;
(3.7) ask logarithm: logarithm is asked in the output of quarter window bank of filters, the result being similar to isomorphic transformation can be obtained;
(3.8) discrete cosine transform (Discrete Cosine Transformation, DCT): remove the correlativity between each dimensional signal, by signal map to lower dimensional space;
(3.9) compose weighting: the low order parameter due to cepstrum is subject to the impact of speaker's characteristic, the characteristic of channel etc., and the resolution characteristic of high order parameters is lower, so need to carry out spectrum weighting, suppress its low order and high order parameters;
(3.10) cepstral mean subtracts (Cepstrum Mean Subtraction, CMS): CMS can reduce the impact of phonetic entry channel on characteristic parameter effectively;
(3.11) differential parameter: great many of experiments shows, adds the differential parameter characterizing non-speech dynamic characteristics, can improve the recognition performance of system in phonetic feature.We have also used first order difference parameter and the second order difference parameter of MFCC parameter.
After acoustic feature extracts, just carry out real-time decoding, tone decoding is a step (being decoded as example with Viterbi) important in the present invention, the process of decoding in the present invention is: to every frame acoustic feature of input, calculate output probability and the intra-node state transition probability of current every bar feasible path corresponding node in decoding network, and upgrade the accumulated probability of current path.Output probability herein can according to Hidden Markov Model (HMM) corresponding to node phoneme and acoustics feature calculation, and intra-node state transition probability directly reads from model; When above-mentioned be decoded to last state of intra-node time, can expand current decoding paths, the foundation of expansion is exactly follow the tracks of decoding network, when this node is connected to multiple node, expansion mulitpath is needed to proceed decoding, if the arc of tracking decoding network exists path punishment, then need punishment to be added in the accumulated probability in path, decode procedure can generate mulitpath in real time.
4th step: judge whether it is sound end, sound end can be divided into voice starting point and voice end point, if voice end point, then voice end point is fed back to external application system, otherwise continues the 3rd step; In the 4th step, starting point judges is optional, if external system is indifferent to starting point information, then can not judge starting point, both of these case of the present invention is all included.
In Fig. 2, sound end judges to comprise the one or whole in the judgement of voice starting point, the judgement of voice end point.
The judgement of voice starting point: the flow process that voice starting point judges as shown in Figure 2.
Voice starting point judges roughly to divide a few step below:
The first step: get demoder optimal path, get optimal path as follows in detail:
Namely path is the road that demoder arrival end node is passed by, because start not know which bar road is the shortest, demoder can walk a lot of bar road, and the corresponding a kind of speech unit sequence in each road, optimal path is exactly for path corresponding to current speech maximum probability speech unit sequence.
(1.1): all paths in traversal current decoder, resolve each path and obtain speech unit sequence and probability;
(1.2): according to probability, sorted in path;
(1.3): get the maximum path of sequence posterior probability as optimal path.
Second step: voice starting point early warning, namely judges currently whether may reach starting point according to optimal path, if so, carries out the 3rd step, otherwise exit flow process.Carry out being decoded as example with the decoding network of figure five, identify that text is " China ", zh is the text starting point of network, namely judges whether current decoding has arrived " zh " node in network herein.
3rd step: early warning confirms, by judging whether in text that the process such as phoneme or effective rubbish voice confirms currently whether really to reach voice starting point.If so, obtain starting point, otherwise directly exit flow process.Carry out being decoded as example with the decoding network of figure five, namely judge that whether the probability of decoding paths arrival " zh " is enough large.
The judgement of voice end point: the flow process that voice end point judges as shown in Figure 3.
The first step: get current optimal path in demoder, in judging with above-mentioned voice starting point, the method for optimal path is the same.
According to optimal path, second step: voice end point early warning, namely judges whether last phoneme of text can be talkative, if so, carry out the 3rd step, otherwise exit flow process.Carry out being decoded as example with the decoding network of Fig. 5, namely judge whether demoder optimal path is decoded to " uo " node herein.
3rd step: confirm early warning, namely judge in text, whether last phoneme was really said.Decision-making can be carried out by indexs such as frame length, the average likelihood scores of frame.If be judged as really having said, then obtain end point process ends, otherwise direct process ends.Carry out being decoded as example with the decoding network of Fig. 5, namely weigh the frame length of the optimal path of arrival " uo " herein, whether average likelihood score is reasonable.
Detect with legacy endpoint and combine (expansion scheme):
In the application scenarios of some, sometimes user does not run through content of text, also needs endpoint detection system can return the end point of voice, and this just needs and traditional end-point detection combine with technique, and process flow diagram as shown in Figure 4.Suppose that end-point detection flow process traditional in Fig. 4 is the end-point detection based on energy, process flow diagram as shown in Figure 7, is mainly divided into the following steps:
The first step: input voice;
Second step: to voice segment, extracts short-time energy;
3rd step: comprehensively adjudicate, according to current short-time energy judgement voice segments and non-speech segment, can adopt one of four thresholdings or double threshold two kinds of methods;
4th step: end points feeds back, according to the 3rd step court verdict feedback testing result.
The process steps combined with traditional end-point detecting method in the present invention is as follows:
The first step: input speech recognition related text, resolves text;
Second step: build decoding network according to first step text resolution result;
3rd step: input voice, extracts the acoustic feature in voice on the one hand, on the other hand traditional endpoint detection module is passed in voice;
4th step: end-point detecting method of the present invention and legacy endpoint are detected and carries out simultaneously, detect sound end separately;
5th step: whether the sound end decision-making provided in conjunction with end-point detecting method of the present invention and legacy endpoint detection method is sound end, any one method in above-mentioned two can be adopted to detect it is the strategy that end points just thinks end points, can also all detect that end points just thinks end points by two kinds of methods;
6th step: backchannel voice endpoint is to outer application system.
" whether starting point detected " in upper Fig. 4 and contain the process detecting that starting point does not then detect.
In a word, the invention solves when speech recognition text is determined, the real-time that legacy endpoint detection technique shows is not high, cannot carry out aimed detection problem to the voice that user is concerned about.
Non-elaborated part of the present invention belongs to the known technology of those skilled in the art.

Claims (7)

1., based on a sound end detecting method for real-time decoding, it is characterized in that performing step is as follows:
The first step: input speech recognition related text, resolves text;
Second step: build decoding network according to text resolution result;
3rd step: input voice in real time, extracts the acoustic feature in voice, and the decoding network built based on second step is decoded to described acoustic feature, obtains decoded speech unit sequence; In described speech unit sequence, each unit is called a frame;
4th step: carry out sound end judgement to decoded speech unit sequence, judges whether it is sound end, and described sound end is divided into voice starting point and voice end point; If judged result is voice end point, then voice endpoint information is fed back to outer application system, otherwise continue the 3rd step; In the 4th step, voice starting point judges is optional, if outer application system is indifferent to voice starting point, does not then judge voice starting point;
Voice starting point in described 4th step judges as follows:
(1.1) optimal path in demoder is got;
(1.2) voice starting point early warning, namely according to the optimal path in demoder, judges whether current speech text may reach voice starting point, if so, carries out step (1.3), otherwise terminates to judge;
(1.3) confirm early warning, namely to judge whether to have in speech text in text phoneme or effective rubbish voice confirm currently whether really reach voice starting point by this process; If so, obtain starting point, otherwise directly exit;
Voice end point in described 4th step judges as follows:
(2.1) current optimal path in demoder is got;
(2.2) voice end point early warning, namely according to the optimal path in demoder, judges whether last phoneme in speech text can be talkative, if so, carries out step (2.3), otherwise terminates to judge;
(2.3) confirm early warning, namely in speech text, whether last phoneme was really said, by frame length, the average likelihood score index of frame carrys out decision-making, if be judged as really having said, then obtain voice end point, process ends, otherwise directly terminates.
2. a kind of sound end detecting method based on real-time decoding according to claim 1, it is characterized in that: in the application scenarios of some, sometimes user does not run through content of text, need the end point returning voice, this just needs described sound end detecting method to be combined with traditional end-point detecting method, and the process steps combined with traditional end-point detecting method is as follows:
(1) input speech recognition related text, resolve text;
(2) decoding network is built according to first step text resolution result;
(3) input voice, extract the acoustic feature in voice on the one hand, on the other hand traditional endpoint detection module passed in voice;
(4) described sound end detecting method and legacy endpoint detection are carried out simultaneously, detect sound end separately;
(5) whether the sound end decision-making provided in conjunction with described sound end detecting method and legacy endpoint detection method is sound end, adopt any one method in above-mentioned two to detect it is the strategy that end points just thinks end points, or all detect that end points just thinks end points by two kinds of methods;
(6) backchannel voice endpoint is to outer application system.
3. a kind of sound end detecting method based on real-time decoding according to claim 1 and 2, is characterized in that: build decoding network step in described second step as follows:
(1) obtain the minimum modeling unit after the text resolution of the first step, described minimum modeling unit is phoneme, syllable, word;
(2) count and total nodal point number according to the dummy section in minimum modeling unit number computational grid, be peer distribution internal memory, and minimum modeling unit and network node are associated;
(3) according to the arc number read aloud in regular computational grid allowed, and be arc storage allocation; The rule of reading aloud of described permission comprises retaking of a year or grade, skip;
(4) according to reading aloud rule, by arc, node is coupled together;
(5) decoding network is exported.
4. a kind of sound end detecting method based on real-time decoding according to claim 1 and 2, is characterized in that: the step of getting the optimal path in demoder in described step (1.1) and step (2.1) is as follows:
(1) travel through all paths in current decoder, resolve each path and obtain corresponding speech unit sequence and probability;
(2) according to probability, sorted in path;
(3) the maximum path of sequence posterior probability is got as optimal path.
5. a kind of sound end detecting method based on real-time decoding according to claim 1 and 2, is characterized in that: in described 3rd step, acoustic feature is mel cepstrum coefficients MFCC, cepstrum coefficient CEP, linear predictor coefficient LPC or perception linear predictor coefficient PLP.
6. a kind of sound end detecting method based on real-time decoding according to claim 1 and 2, is characterized in that: in described 3rd step, speech unit sequence is aligned phoneme sequence, syllable sequence or word sequence.
7. a kind of sound end detecting method based on real-time decoding according to claim 1 and 2, is characterized in that: be decoded as Viterbi decoding in described 3rd step, or based on the decoding of dynamic time warping (DTW).
CN201210483046.4A 2012-11-24 2012-11-24 Voice endpoint detection method based on real-time decoding Active CN102982811B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210483046.4A CN102982811B (en) 2012-11-24 2012-11-24 Voice endpoint detection method based on real-time decoding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210483046.4A CN102982811B (en) 2012-11-24 2012-11-24 Voice endpoint detection method based on real-time decoding

Publications (2)

Publication Number Publication Date
CN102982811A CN102982811A (en) 2013-03-20
CN102982811B true CN102982811B (en) 2015-01-14

Family

ID=47856719

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210483046.4A Active CN102982811B (en) 2012-11-24 2012-11-24 Voice endpoint detection method based on real-time decoding

Country Status (1)

Country Link
CN (1) CN102982811B (en)

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593048B (en) * 2013-10-28 2017-01-11 浙江大学 Voice navigation system and method of animal robot system
CN105981099A (en) * 2014-02-06 2016-09-28 三菱电机株式会社 Speech search device and speech search method
CN105374352B (en) * 2014-08-22 2019-06-18 中国科学院声学研究所 A kind of voice activated method and system
CN106205607B (en) * 2015-05-05 2019-10-29 联想(北京)有限公司 Voice information processing method and speech information processing apparatus
CN105118502B (en) * 2015-07-14 2017-05-10 百度在线网络技术(北京)有限公司 End point detection method and system of voice identification system
US20170069309A1 (en) * 2015-09-03 2017-03-09 Google Inc. Enhanced speech endpointing
CN105261357B (en) * 2015-09-15 2016-11-23 百度在线网络技术(北京)有限公司 Sound end detecting method based on statistical model and device
CN106653022B (en) * 2016-12-29 2020-06-23 百度在线网络技术(北京)有限公司 Voice awakening method and device based on artificial intelligence
CN107146633A (en) * 2017-05-09 2017-09-08 广东工业大学 A kind of complete speech data preparation method and device
CN110520925B (en) * 2017-06-06 2020-12-15 谷歌有限责任公司 End of query detection
CN107423275A (en) * 2017-06-27 2017-12-01 北京小度信息科技有限公司 Sequence information generation method and device
CN107799126B (en) * 2017-10-16 2020-10-16 苏州狗尾草智能科技有限公司 Voice endpoint detection method and device based on supervised machine learning
CN108257616A (en) * 2017-12-05 2018-07-06 苏州车萝卜汽车电子科技有限公司 Interactive detection method and device
CN110364171B (en) * 2018-01-09 2023-01-06 深圳市腾讯计算机系统有限公司 Voice recognition method, voice recognition system and storage medium
CN108538310B (en) * 2018-03-28 2021-06-25 天津大学 Voice endpoint detection method based on long-time signal power spectrum change
CN110827795A (en) * 2018-08-07 2020-02-21 阿里巴巴集团控股有限公司 Voice input end judgment method, device, equipment, system and storage medium
CN109087645B (en) * 2018-10-24 2021-04-30 科大讯飞股份有限公司 Decoding network generation method, device, equipment and readable storage medium
CN111583910B (en) * 2019-01-30 2023-09-26 北京猎户星空科技有限公司 Model updating method and device, electronic equipment and storage medium
CN109859773A (en) * 2019-02-14 2019-06-07 北京儒博科技有限公司 A kind of method for recording of sound, device, storage medium and electronic equipment
CN110070885B (en) * 2019-02-28 2021-12-24 北京字节跳动网络技术有限公司 Audio starting point detection method and device
CN112151073B (en) * 2019-06-28 2024-07-09 北京声智科技有限公司 Voice processing method, system, equipment and medium
CN113160854B (en) * 2020-01-22 2024-10-18 阿里巴巴集团控股有限公司 Voice interaction system, related method, device and equipment
CN111754979A (en) * 2020-07-21 2020-10-09 南京智金科技创新服务中心 Intelligent voice recognition method and device
CN112511698B (en) * 2020-12-03 2022-04-01 普强时代(珠海横琴)信息技术有限公司 Real-time call analysis method based on universal boundary detection
CN112614514B (en) * 2020-12-15 2024-02-13 中国科学技术大学 Effective voice fragment detection method, related equipment and readable storage medium
CN112669880B (en) * 2020-12-16 2023-05-02 北京读我网络技术有限公司 Method and system for adaptively detecting voice ending
CN112652306B (en) * 2020-12-29 2023-10-03 珠海市杰理科技股份有限公司 Voice wakeup method, voice wakeup device, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1763843A (en) * 2005-11-18 2006-04-26 清华大学 Pronunciation quality evaluating method for language learning machine
CN101030369A (en) * 2007-03-30 2007-09-05 清华大学 Built-in speech discriminating method based on sub-word hidden Markov model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8762150B2 (en) * 2010-09-16 2014-06-24 Nuance Communications, Inc. Using codec parameters for endpoint detection in speech recognition

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1763843A (en) * 2005-11-18 2006-04-26 清华大学 Pronunciation quality evaluating method for language learning machine
CN101030369A (en) * 2007-03-30 2007-09-05 清华大学 Built-in speech discriminating method based on sub-word hidden Markov model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
噪声环境中基于HMM 模型的语音信号端点检测方法;朱杰等;《上海交通大学学报》;19981031;第32卷(第10期);14-16 *

Also Published As

Publication number Publication date
CN102982811A (en) 2013-03-20

Similar Documents

Publication Publication Date Title
CN102982811B (en) Voice endpoint detection method based on real-time decoding
CN110136749B (en) Method and device for detecting end-to-end voice endpoint related to speaker
KR100755677B1 (en) Apparatus and method for dialogue speech recognition using topic detection
CN101930735B (en) Speech emotion recognition equipment and speech emotion recognition method
CN110706690A (en) Speech recognition method and device
CN111640456B (en) Method, device and equipment for detecting overlapping sound
CN110827795A (en) Voice input end judgment method, device, equipment, system and storage medium
CN102945673A (en) Continuous speech recognition method with speech command range changed dynamically
CN101887725A (en) Phoneme confusion network-based phoneme posterior probability calculation method
WO2018192186A1 (en) Speech recognition method and apparatus
US11495234B2 (en) Data mining apparatus, method and system for speech recognition using the same
CN112102850A (en) Processing method, device and medium for emotion recognition and electronic equipment
CN106558306A (en) Method for voice recognition, device and equipment
CN103035244B (en) Voice tracking method capable of feeding back loud-reading progress of user in real time
Mistry et al. Overview: Speech recognition technology, mel-frequency cepstral coefficients (mfcc), artificial neural network (ann)
Kadyan et al. Developing children’s speech recognition system for low resource Punjabi language
CN118471201B (en) Efficient self-adaptive hotword error correction method and system for speech recognition engine
Stanek et al. Algorithms for vowel recognition in fluent speech based on formant positions
CN110853669A (en) Audio identification method, device and equipment
CN117765932A (en) Speech recognition method, device, electronic equipment and storage medium
CN113823265A (en) Voice recognition method and device and computer equipment
CN113327596A (en) Training method of voice recognition model, voice recognition method and device
CN113053358A (en) Voice recognition customer service system for regional dialects
CN112185357A (en) Device and method for simultaneously recognizing human voice and non-human voice
CN116072146A (en) Pumped storage station detection method and system based on voiceprint recognition

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C56 Change in the name or address of the patentee
CP01 Change in the name or title of a patent holder

Address after: Wangjiang Road high tech Development Zone Hefei city Anhui province 230088 No. 666

Patentee after: Iflytek Co., Ltd.

Address before: Wangjiang Road high tech Development Zone Hefei city Anhui province 230088 No. 666

Patentee before: Anhui USTC iFLYTEK Co., Ltd.

TR01 Transfer of patent right

Effective date of registration: 20170823

Address after: 230088, Hefei province high tech Zone, 2800 innovation Avenue, 248 innovation industry park, H2 building, room two, Anhui

Patentee after: Anhui hear Technology Co., Ltd.

Address before: Wangjiang Road high tech Development Zone Hefei city Anhui province 230088 No. 666

Patentee before: Iflytek Co., Ltd.

TR01 Transfer of patent right