CN103366739B - Towards self-adaptation end-point detecting method and the system thereof of alone word voice identification - Google Patents

Towards self-adaptation end-point detecting method and the system thereof of alone word voice identification Download PDF

Info

Publication number
CN103366739B
CN103366739B CN201210085584.8A CN201210085584A CN103366739B CN 103366739 B CN103366739 B CN 103366739B CN 201210085584 A CN201210085584 A CN 201210085584A CN 103366739 B CN103366739 B CN 103366739B
Authority
CN
China
Prior art keywords
frame
word
short
end points
isolated word
Prior art date
Application number
CN201210085584.8A
Other languages
Chinese (zh)
Other versions
CN103366739A (en
Inventor
霍小四
尹明理
刘军江
Original Assignee
郑州市科学技术情报研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 郑州市科学技术情报研究所 filed Critical 郑州市科学技术情报研究所
Priority to CN201210085584.8A priority Critical patent/CN103366739B/en
Publication of CN103366739A publication Critical patent/CN103366739A/en
Application granted granted Critical
Publication of CN103366739B publication Critical patent/CN103366739B/en

Links

Abstract

The invention discloses a kind of self-adaptation end-point detecting method towards alone word voice identification and system thereof, comprising: a. phonetic entry, input comprises the voice signal of isolated word to be identified; B. voice pre-service, carries out amplitude translation to voice signal, normalization and sub-frame processing, calculates short-time average energy and the short-time average zero-crossing rate of each frame voice; C. isolated word end points rough detection, utilizes short-time average energy and the short-time average zero-crossing rate of each frame voice signal, and before and after end points, the shortest length of continuous speech frame retrains, and carries out guestimate to isolated word end points; D. the self-adaptative adjustment of detection threshold and the detection of accurate end points, utilize the restriction of the minimum duration of isolated word and maximum duration, carry out dynamic conditioning to detection threshold, and carry out front and back fine setting to sound end, obtain accurate isolated word end points; E. the end points exporting isolated word carries out alone word voice identification, exports accurate isolated word end points, utilizes speech recognition technology to carry out isolated word recognition.

Description

Towards self-adaptation end-point detecting method and the system thereof of alone word voice identification

Technical field

The present invention relates to a kind of voice activity detection algorithm towards alone word voice identification, more particularly, is a kind ofly can automatically regulate detection threshold, end-point detection algorithm for unspecified person alone word speech recognition according to ground unrest.

Background technology

Alone word voice identification is the technology by machine, the voice signal containing isolated word being changed into corresponding text or order, and have a very wide range of applications field and market background, as various command control system, voice toy etc.In isolated-word speech recognition system, the signal of input comprises alone word voice and ground unrest etc., finds out starting point and the terminal of voice, be called end-point detection from input signal.In isolated-word speech recognition system, the accuracy of end-point detection is directly connected to the height of discrimination.

Conventional end-point detection algorithm has the double-threshold comparison algorithm based on short-time average amplitude and short-time zero-crossing rate, and this algorithm short-time average amplitude distinguishes voiced sound and unvoiced segments, distinguishes voiceless sound and unvoiced segments with short-time average zero-crossing rate.The voice signal that this algorithm is high to signal to noise ratio (S/N ratio) has good Detection results, but affected by noise very large, and the Detection results for the voice signal of Noise is poor.

Summary of the invention

In view of this, the object of the invention is to overcome existing double-threshold comparison algorithm large defect affected by noise, according to the strong and weak adaptively modifying detection threshold of ground unrest, and the restriction combined according to alone word voice length, a kind of alone word voice identification end-point detection algorithm with high robustness is provided.

For achieving the above object, by the following technical solutions, it comprises the following steps in the present invention:

A. phonetic entry: input comprises the voice signal of isolated word to be identified;

B. voice pre-service: amplitude translation is carried out to voice signal, normalization and sub-frame processing, calculate short-time average energy and the short-time average zero-crossing rate of each frame voice signal;

C. isolated word end points rough detection: the short-time average energy and the short-time average zero-crossing rate that utilize each frame voice signal, and before and after end points, the shortest length of continuous speech frame retrains, and carries out guestimate to isolated word end points;

D. the self-adaptative adjustment of detection threshold and the detection of accurate end points: the restriction utilizing the minimum duration of isolated word and maximum duration, carries out dynamic conditioning to detection threshold, and carries out front and back fine setting to sound end, obtain accurate isolated word end points;

E. the end points exporting isolated word carries out alone word voice identification: export accurate isolated word end points, utilizes speech recognition technology to carry out isolated word recognition.

In step c, when carrying out the guestimate of end points, introduce the constraint of continuous speech frame length before and after end points.

In step e, when carrying out the accurate detection of end points, the length constraint according to isolated word carries out self-adaptative adjustment to detection threshold.When the alone word voice length detected is greater than the maximum length of isolated word, increase short-time energy high threshold, and adjust starting point backward, adjust terminal forward, make the frame average energy of starting point and terminal be greater than new high threshold respectively.When the alone word voice length detected is greater than the maximum length of isolated word, reduce short-time zero-crossing rate threshold value, and adjust starting point backward, adjust terminal forward, make starting point former frame and the average zero-crossing rate of terminal next frame be greater than new short-time zero-crossing rate threshold value.When the alone word voice length detected is less than the shortest length of isolated word, reduce short-time energy high threshold, and adjust forward starting point, adjust terminal backward, make the frame average energy of starting point and terminal be greater than new high threshold respectively.When the alone word voice length detected is less than the shortest length of isolated word, increase short-time zero-crossing rate threshold value, and adjust forward starting point, adjust terminal backward, make starting point former frame and the average zero-crossing rate of terminal next frame be greater than new short-time zero-crossing rate threshold value.

Towards the self-adaptation endpoint detection system of alone word voice identification, this system comprises: the input media of alone word tone signal to be identified; Amplitude translation is carried out to voice signal, normalization and sub-frame processing, to the device that short-time average energy and the short-time average zero-crossing rate of each frame voice calculate; Utilize short-time average energy and the short-time average zero-crossing rate of each frame voice signal, and before and after end points, the shortest length of continuous speech frame retrains, and isolated word end points is carried out to the device of guestimate; Utilize the restriction of the minimum duration of isolated word and maximum duration, dynamic conditioning is carried out to detection threshold, and front and back fine setting is carried out to sound end, obtain the device of accurate isolated word end points; Export accurate isolated word end points, utilize speech recognition technology to carry out the device of isolated word recognition.

The invention has the beneficial effects as follows: because traditional alone word voice endpoint detection algorithm based on double-threshold comparison is affected by noise comparatively large, the present invention provides a kind of new end-point detection algorithm with certain anti-noise ability, self-adaptative adjustment detection threshold.Compared with prior art, the present invention introduces the duration restriction of continuous speech frame when detecting end points, increase the robustness detected; Limiting by introducing the duration relevant to isolated word to be detected simultaneously, automatically adjusting the thresholding of detection.Algorithm realization is simple, effective, speed fast, and has certain anti-noise ability, is particularly suitable for mini-plant and embedded device realization, can as the front end of isolated-word speech recognition system.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the inventive method general frame.

Fig. 2 is the process flow diagram that in the inventive method, self-adaptative adjustment detection threshold and isolated word end points accurately detect.

Embodiment

Below in conjunction with accompanying drawing, specific embodiment of the invention method is further described.

As shown in Figure 1 and Figure 2, the present invention includes following steps:

1. phonetic entry

Input comprises the voice signal that will detect isolated word.

2. voice pre-service and detection threshold parameter choose

Amplitude translation is carried out to voice signal, normalized, then framing is carried out to voice signal, calculate short-time average energy and the short-time average zero-crossing rate of each frame voice.High threshold EFVU and Low threshold EFVL, zero passage threshold value ZCRT and alone word voice length limit LENU and lower limit LENL that a frame voice signal can be worth frequently is experimentally preset with experience.

3. the guestimate of isolated word starting point and terminal

The guestimate of 3-1. isolated word starting point

Detect each frame voice signal from front to back, find the guestimate x1 of isolated word starting point, this guestimate must meet following three conditions: one is that the short-time average energy of continuous some frames backward from x1 is greater than high threshold EFVU; Two is that the voice short-time average energy value of continuous some frames forward from x1 is less than Low threshold EFVL; Three is that x1-1 frame short-time zero-crossing rate is greater than zero passage threshold value ZCRT.

Specifically, search N continuous 1 frame short-time average energy is greater than the frame of high threshold EFVU, and wherein, N1 gets 5 to 7 for good, and the first frame in these frames is designated as as1.

From as1 Look-ahead, find out the frame that the short-time average energy nearest with as1 is less than EFVL, remember that a rear frame of this frame is a1.

From a1 Look-ahead N2 frame, N2 gets 10 for good, and in statistics N2 frame, short-time average energy is greater than the frame number of EFVL, if frame number is more than N3, N3 value 2 to 4, then that frame energy of wherein foremost being greater than EFVL is designated as new a1.

Continue to search from new a1, until the frame number that new N2 frame self-energy is greater than EFVL is less than N3+1 frame or voice first frame detected, be designated as new a1;

From a1 Look-ahead again, find short-time zero-crossing rate to be greater than the frame of ZCRT, then after this frame, a frame is the guestimate x1 of voice starting point.

The guestimate of 3-2. isolated word terminal

Search backward from xth 1 frame, find the guestimate x2 of isolated word terminal, this guestimate must meet following three conditions: one is that the short-time average energy of continuous some frames backward from x2 is less than Low threshold EFVL; Two is that the short time average energy of some frames forward from x2 is greater than high threshold EFVH; Three is that x2+1 frame short-time zero-crossing rate is greater than zero passage threshold value ZCRT.

Specifically, search backward from xth 1 frame, find N continuous 4 frame short-time average energy to be less than the frame of EFVL, wherein, N4 gets 20 to 30, and record wherein the first frame frame number is as2.

From as2 forward search N2 frame, if energy is greater than the frame of EFVU less than N3+1 frame in N2 frame, then recording the frame that wherein last energy is greater than EFVU is new as2, continues to search Look-ahead, until energy is greater than the frame number of EFVU more than N3 frame in continuous print N2 frame from new as2.

Search backward from as2, after finding out as2, first energy is less than the frame of EFVL, and recording this frame is a2;

Can continue to search backward from a2 again, find short-time zero-crossing rate to be greater than the frame of ZCRT, then this frame former frame is the guestimate x2 of voice terminal.

4. the self-adaptative adjustment of detection threshold and the accurate judgement of isolated word starting point and terminal

Adjust detection threshold adaptively by isolated word length constraint, and near the end points guestimate of isolated word, search it accurately estimate.

4-1. chooses auto-adaptive parameter

Setting auto-adaptive parameter α, β, μ and λ, wherein, 1.20≤α≤2.00,0.50≤β≤0.80,1.20≤μ≤2.00,0.50≤λ≤0.80.Make GEFVU=LEFVU=EFVU, GZCRT=LZCRT=ZCRT;

Threshold adaptive adjustment when 4-2. detection voice length is oversize

If 4-2-1. is x2-x1>LENU, show that short-time energy high threshold EFVU is too little, high threshold is increased α doubly, and adjust x1 backward, adjust x2 forward and make the frame average energy of starting point and terminal be greater than new high threshold respectively.Specifically, GEFVU=α * GEFVU is made.Search backward from x1, find first frame being greater than GEFVU to be designated as new x1.In like manner, Look-ahead from x2, finds first frame being greater than GEFVU to be designated as new x2.

If 4-2-2. still have x2-x1>LENU, show that short-time zero-crossing rate threshold value is too large, short-time zero-crossing rate threshold value is reduced into original β doubly, adjusts x1, forward adjustment x2 backward and make the average zero-crossing rate of the frame of xth 1-1 frame and xth 2+1 be greater than new short-time zero-crossing rate threshold value respectively.Specifically, make LZCRT=β * LZCRT, continue to search backward from x1, find first frame being less than LZCRT to be designated as new x1; In like manner, continue to look for forward from x2, find first frame being less than LZCRT to be designated as new x2.

4-2-3. circulation 4-2-1 and 4-2-2 step are until still have x2-x1>LENU, end loop after x2-x1≤LENU or circulation N4 time.

4-3. detect the threshold adaptive adjustment too in short-term of voice length

If 4-3-1. is x2-x1<LENL, show that short-time energy high threshold EFVU is too large, high threshold is reduced into original λ doubly, and adjust forward x1, adjust x2 backward and make the frame average energy of starting point and terminal be greater than new high threshold respectively; Specifically, LEFVU=λ * LEFVU is made.From x1 Look-ahead, find first frame being less than LEFVU, remember that this frame is new x1.In like manner, search backward from x2, find the first frame being less than LEFVU, remember that this frame is new x2.

If 4-3-2. still have x2-x1<LENL, show that short-time average zero-crossing rate high threshold is too little, short-time zero-crossing rate threshold value is enlarged into original μ doubly, adjusts forward x1, backward adjustment x2 and make the average zero-crossing rate of the frame of xth 1-1 frame and xth 2+1 be greater than new short-time zero-crossing rate threshold value respectively.Specifically, make GZCRT=μ * GZCRT, from x1, continue Look-ahead, find first frame being less than GZCRT to be designated as new x1; In like manner, continue to look for backward from x2, find first frame being less than GZCRT to be designated as new x2.

4-3-3. circulation step 4-3-1 and 4-3-2, until x2-x1 >=LENL, or circulation still has x2-x1<LENL N5 time, end loop.

5. the starting point exporting alone word voice is x1, and terminal is x2, carries out isolated word recognition.

What finally illustrate is, above embodiment is only in order to illustrate technical scheme of the present invention and unrestricted, other amendments that those of ordinary skill in the art make technical scheme of the present invention or equivalently to replace, only otherwise depart from the spirit and scope of technical solution of the present invention, all should be encompassed in the middle of right of the present invention.

Claims (4)

1., towards a self-adaptation end-point detecting method for alone word voice identification, it is characterized in that, the method comprises the following steps:
A. phonetic entry
Input comprises the voice signal of isolated word to be identified;
B. voice pre-service
Amplitude translation is carried out to voice signal, normalization and sub-frame processing, calculate short-time average energy and the short-time average zero-crossing rate of each frame voice signal;
C. isolated word end points rough detection
Utilize short-time average energy and the short-time average zero-crossing rate of each frame voice signal, and before and after end points, the shortest length of continuous speech frame retrains, and carries out guestimate to isolated word end points;
D. the self-adaptative adjustment of detection threshold and the detection of accurate end points
Utilize the restriction of the minimum duration of isolated word and maximum duration, dynamic conditioning is carried out to detection threshold, and front and back fine setting is carried out to sound end, obtain accurate isolated word end points;
When the alone word voice length detected is greater than the maximum length of isolated word, increase short-time energy high threshold, and adjust starting point backward, adjust terminal forward, make the frame average energy of starting point and terminal be greater than new high threshold respectively;
When the alone word voice length detected is greater than the maximum length of isolated word, reduce short-time zero-crossing rate threshold value, and adjust starting point backward, adjust terminal forward, make starting point former frame and the average zero-crossing rate of terminal next frame be greater than new short-time zero-crossing rate threshold value;
When the alone word voice length detected is less than the shortest length of isolated word, reduce short-time energy high threshold, and adjust forward starting point, adjust terminal backward, make the frame average energy of starting point and terminal be greater than new high threshold respectively;
When the alone word voice length detected is less than the shortest length of isolated word, increase short-time zero-crossing rate threshold value, and adjust forward starting point, adjust terminal backward, make starting point former frame and the average zero-crossing rate of terminal next frame be greater than new short-time zero-crossing rate threshold value;
E. the end points exporting isolated word carries out alone word voice identification
Export accurate isolated word end points, utilize speech recognition technology to carry out isolated word recognition.
2. according to claim 1 towards the self-adaptation end-point detecting method of alone word voice identification, it is characterized in that: in step c, when carrying out the guestimate of end points, introduce the constraint of continuous speech frame length before and after end points.
3. according to claim 1 towards the self-adaptation end-point detecting method of alone word voice identification, it is characterized in that: in step e, when carrying out the accurate detection of end points, the length constraint according to isolated word carries out self-adaptative adjustment to detection threshold.
4. realize the self-adaptation endpoint detection system towards alone word voice identification of method described in claim 1, it is characterized in that, this system comprises:
The input media of alone word tone signal to be identified;
Amplitude translation is carried out to voice signal, normalization and sub-frame processing, to the device that short-time average energy and the short-time average zero-crossing rate of each frame voice calculate;
Utilize short-time average energy and the short-time average zero-crossing rate of each frame voice signal, and before and after end points, the shortest length of continuous speech frame retrains, and isolated word end points is carried out to the device of guestimate;
Utilize the restriction of the minimum duration of isolated word and maximum duration, dynamic conditioning is carried out to detection threshold, and front and back fine setting is carried out to sound end, obtain the device of accurate isolated word end points;
Export accurate isolated word end points, utilize speech recognition technology to carry out the device of isolated word recognition.
CN201210085584.8A 2012-03-28 2012-03-28 Towards self-adaptation end-point detecting method and the system thereof of alone word voice identification CN103366739B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210085584.8A CN103366739B (en) 2012-03-28 2012-03-28 Towards self-adaptation end-point detecting method and the system thereof of alone word voice identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210085584.8A CN103366739B (en) 2012-03-28 2012-03-28 Towards self-adaptation end-point detecting method and the system thereof of alone word voice identification

Publications (2)

Publication Number Publication Date
CN103366739A CN103366739A (en) 2013-10-23
CN103366739B true CN103366739B (en) 2015-12-09

Family

ID=49367940

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210085584.8A CN103366739B (en) 2012-03-28 2012-03-28 Towards self-adaptation end-point detecting method and the system thereof of alone word voice identification

Country Status (1)

Country Link
CN (1) CN103366739B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104700830B (en) * 2013-12-06 2018-07-24 中国移动通信集团公司 A kind of sound end detecting method and device
CN105118502B (en) * 2015-07-14 2017-05-10 百度在线网络技术(北京)有限公司 End point detection method and system of voice identification system
CN106601234A (en) * 2016-11-16 2017-04-26 华南理工大学 Implementation method of placename speech modeling system for goods sorting
CN106847270A (en) * 2016-12-09 2017-06-13 华南理工大学 A kind of double threshold place name sound end detecting method
CN106448659B (en) * 2016-12-19 2019-09-27 广东工业大学 A kind of sound end detecting method based on short-time energy and fractal dimension
CN106601233A (en) * 2016-12-22 2017-04-26 北京元心科技有限公司 Voice command recognition method and device and electronic equipment
CN107045870A (en) * 2017-05-23 2017-08-15 南京理工大学 A kind of the Method of Speech Endpoint Detection of feature based value coding
CN108665889A (en) * 2018-04-20 2018-10-16 百度在线网络技术(北京)有限公司 The Method of Speech Endpoint Detection, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5970447A (en) * 1998-01-20 1999-10-19 Advanced Micro Devices, Inc. Detection of tonal signals
CN101206858A (en) * 2007-12-12 2008-06-25 北京中星微电子有限公司 Method and system for testing alone word voice endpoint
CN101226741A (en) * 2007-12-28 2008-07-23 无敌科技(西安)有限公司 Method for detecting movable voice endpoint
CN101625860A (en) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 Method for self-adaptively adjusting background noise in voice endpoint detection
CN101625857A (en) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 Self-adaptive voice endpoint detection method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5970447A (en) * 1998-01-20 1999-10-19 Advanced Micro Devices, Inc. Detection of tonal signals
CN101206858A (en) * 2007-12-12 2008-06-25 北京中星微电子有限公司 Method and system for testing alone word voice endpoint
CN101226741A (en) * 2007-12-28 2008-07-23 无敌科技(西安)有限公司 Method for detecting movable voice endpoint
CN101625860A (en) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 Method for self-adaptively adjusting background noise in voice endpoint detection
CN101625857A (en) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 Self-adaptive voice endpoint detection method

Also Published As

Publication number Publication date
CN103366739A (en) 2013-10-23

Similar Documents

Publication Publication Date Title
Mirsamadi et al. Automatic speech emotion recognition using recurrent neural networks with local attention
JP6200516B2 (en) Speech recognition power management
EP0962913B1 (en) Speech recognition
DE112010005959B4 (en) Method and system for automatic recognition of an end point of a sound recording
CA2575632C (en) Speech end-pointer
US10152971B2 (en) System and method for advanced turn-taking for interactive spoken dialog systems
US20120084086A1 (en) System and method for open speech recognition
EP2089877B1 (en) Voice activity detection system and method
US9177550B2 (en) Conservatively adapting a deep neural network in a recognition system
US8738375B2 (en) System and method for optimizing speech recognition and natural language parameters with user feedback
US8930196B2 (en) System for detecting speech interval and recognizing continuous speech in a noisy environment through real-time recognition of call commands
CN103236260B (en) Speech recognition system
US9437186B1 (en) Enhanced endpoint detection for speech recognition
US8650029B2 (en) Leveraging speech recognizer feedback for voice activity detection
Delcroix et al. Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the REVERB challenge
US9330663B2 (en) Initiating actions based on partial hotwords
US20060287856A1 (en) Speech models generated using competitive training, asymmetric training, and data boosting
KR20090033461A (en) Systems, methods, and apparatus for signal change detection
JP5089772B2 (en) Apparatus and method for detecting voice activity
CN101197130B (en) Sound activity detecting method and detector thereof
Bou-Ghazale et al. A robust endpoint detection of speech for noisy environments with application to automatic speech recognition
CN104216677A (en) Low-power voice gate for device wake-up
JP2013508744A (en) Voice interval detector and method
Jo et al. Statistical model-based voice activity detection using support vector machine
KR20140147587A (en) A method and apparatus to detect speech endpoint using weighted finite state transducer

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
COR Change of bibliographic data
CB03 Change of inventor or designer information

Inventor after: Huo Xiaosi

Inventor after: Yin Mingli

Inventor after: Liu Junjiang

Inventor after: Zhang Juzhou

Inventor after: Huang Xinchao

Inventor before: Huo Xiaosi

Inventor before: Yin Mingli

Inventor before: Liu Junjiang

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20151209

Termination date: 20180328