CN103366739A

CN103366739A - Self-adaptive endpoint detection method and self-adaptive endpoint detection system for isolate word speech recognition

Info

Publication number: CN103366739A
Application number: CN2012100855848A
Authority: CN
Inventors: 霍小四; 尹明理; 刘军江
Original assignee: Zhengzhou Science Technology Information Research Institute
Current assignee: Zhengzhou Science Technology Information Research Institute
Priority date: 2012-03-28
Filing date: 2012-03-28
Publication date: 2013-10-23
Anticipated expiration: 2032-03-28
Also published as: CN103366739B

Abstract

The invention discloses a self-adaptive endpoint detection method and a self-adaptive endpoint detection system for isolate word speech recognition. The self-adaptive endpoint detection method for isolate word speech recognition comprises the following steps: a, a voice input step, wherein a voice signal containing an isolate word to be recognized is input; b, a voice preprocessing step, wherein the voice signal is subjected to amplitude translation and normalization and framing processing operation, and short time average energy and a short time average zero-crossing rate of each frame of voice are calculated; c, an isolate word endpoint rough detection step, wherein isolate word endpoints are roughly estimated through utilization of the short time average energy and the short time average zero-crossing rate of each frame of the voice signal and constraint on the shortest length of continuous voice frames before and after the end points, d, a detection threshold self-adaptive adjustment and accurate endpoint detection step, wherein through utilization of constraint on the smallest time duration and the largest time duration of the isolate word, the detection threshold is subjected to dynamic adjustment operation, the voice endpoints are subjected to front and back fine adjustment, and accurate isolate word endpoints are obtained; e, an isolate word endpoint output and isolate word voice recognition step, wherein the accurate isolate word endpoints are output and isolate word recognition is realized by using voice recognizing technologies.

Description

Self-adaptation end-point detecting method and system thereof towards alone word voice identification

Technical field

The present invention relates to a kind of voice activity detection algorithm towards alone word voice identification, more particularly, be a kind of can according to ground unrest automatically regulate detection threshold, for the end-point detection algorithm of unspecified person alone word speech recognition.

Background technology

Alone word voice identification is the technology that the voice signal that contains isolated word is changed into corresponding text or order by machine, and have a very wide range of applications field and market background are such as various command control system, voice toy etc.In isolated-word speech recognition system, the signal of input comprises alone word voice and ground unrest etc., finds out starting point and the terminal point of voice from input signal, is called end-point detection.In isolated-word speech recognition system, the accuracy of end-point detection is directly connected to the height of discrimination.

End-point detection algorithm commonly used has the double threshold detection algorithm based on short-time average amplitude and short-time zero-crossing rate, and this algorithm is distinguished voiced sound and unvoiced segments with the short-time average amplitude, distinguishes voiceless sound and unvoiced segments with short-time average zero-crossing rate.This algorithm has good detection effect to the high voice signal of signal to noise ratio (S/N ratio), but affected by noise very large, and is relatively poor for the detection effect of the voice signal of Noise.

Summary of the invention

In view of this, the object of the invention is to overcome existing double threshold detection algorithm large defective affected by noise, strong and weak adaptively modifying detection threshold according to ground unrest, and in conjunction with the restriction according to alone word voice length, provide a kind of alone word voice identification end-point detection algorithm with high robustness.

For achieving the above object, the present invention by the following technical solutions, it may further comprise the steps:

A. phonetic entry: input comprises the voice signal of isolated word to be identified;

B. voice pre-service: voice signal is carried out the amplitude translation, and normalization and minute frame are processed, and calculate short-time average energy and the short-time average zero-crossing rate of each frame voice signal;

C. isolated word end points rough detection: utilize short-time average energy and the short-time average zero-crossing rate of each frame voice signal, and the shortest length constraint of continuous speech frame before and after the end points, the isolated word end points is carried out guestimate;

D. the self-adaptation adjustment of detection threshold and the detection of accurate end points: utilize the restriction of the minimum duration of isolated word and maximum duration, detection threshold is dynamically adjusted, and sound end is carried out the front and back fine setting, obtain accurate isolated word end points;

E. the end points of exporting isolated word carries out alone word voice identification: export accurate isolated word end points, utilize speech recognition technology to carry out isolated word recognition.

In the c step, when carrying out the guestimate of end points, introduce the constraint of end points front and back continuous speech frame length.

In the e step, when carrying out the accurate detection of end points, according to the length constraint of isolated word detection threshold is carried out the self-adaptation adjustment.When the alone word voice length that detects during greater than the maximum length of isolated word, increase the short-time energy high threshold, and adjust backward starting point, adjust terminal point forward, respectively so that the frame average energy of starting point and terminal point greater than new high threshold.When the alone word voice length that detects during greater than the maximum length of isolated word, dwindle the short-time zero-crossing rate threshold value, and adjust backward starting point, adjust terminal point forward, so that starting point former frame and the average zero-crossing rate of terminal point next frame are greater than new short-time zero-crossing rate threshold value.When the alone word voice length that detects during less than the shortest length of isolated word, dwindle the short-time energy high threshold, and adjust forward starting point, adjust terminal point backward, respectively so that the frame average energy of starting point and terminal point greater than new high threshold.When the alone word voice length that detects during less than the shortest length of isolated word, increase the short-time zero-crossing rate threshold value, and adjust forward starting point, adjust terminal point backward, so that starting point former frame and the average zero-crossing rate of terminal point next frame are greater than new short-time zero-crossing rate threshold value.

Towards the self-adaptation endpoint detection system of alone word voice identification, this system comprises: the input media of alone word tone signal to be identified; Voice signal is carried out the amplitude translation, and normalization and minute frame are processed, the device that short-time average energy and the short-time average zero-crossing rate of each frame voice calculated; Utilize short-time average energy and the short-time average zero-crossing rate of each frame voice signal, and the shortest length of continuous speech frame retrains the device that the isolated word end points is carried out guestimate before and after the end points; Utilize the restriction of the minimum duration of isolated word and maximum duration, detection threshold is dynamically adjusted, and sound end is carried out the front and back fine setting, obtain the device of accurate isolated word end points; Export accurate isolated word end points, utilize speech recognition technology to carry out the device of isolated word recognition.

The invention has the beneficial effects as follows: because traditional alone word voice endpoint detection algorithm that detects based on double threshold is affected by noise larger, the present invention provides a kind of new end-point detection algorithm with certain anti-noise ability, self-adaptation adjustment detection threshold.Compared with prior art, the present invention introduces the duration restriction of continuous speech frame when detecting end points, increase the robustness that detects; By introducing the duration restriction relevant with isolated word to be detected, automatically adjust the thresholding that detects simultaneously.The algorithm realization is simple, effective, speed is fast, and has certain anti-noise ability, is particularly suitable for mini-plant and embedded device and realizes, can be used as the front end of isolated-word speech recognition system.

Description of drawings

Fig. 1 is the process flow diagram of the inventive method general frame.

Fig. 2 is that self-adaptation is adjusted detection threshold and the accurate process flow diagram that detects of isolated word end points in the inventive method.

Embodiment

Below in conjunction with accompanying drawing specific implementation method of the present invention is further described.

As shown in Figure 1 and Figure 2, the present invention includes following steps:

1. phonetic entry

Input comprises the voice signal that will detect isolated word.

2. voice pre-service and detection threshold parameter are chosen

Voice signal is carried out the amplitude translation, and then normalized divides frame to voice signal, calculates short-time average energy and the short-time average zero-crossing rate of each frame voice.Preset high threshold EFVU and low threshold value EFVL, zero passage threshold value ZCRT and alone word voice length upper limit LENU and the lower limit LENL that a frame voice signal can be worth frequently according to experiment and experience.

3. the guestimate of isolated word starting point and terminal point

3-1. the guestimate of isolated word starting point

From front to back each frame voice signal is detected, find the guestimate x1 of isolated word starting point, this guestimate must be satisfied following three conditions: the one, begin the short-time average energy of continuous some frames backward greater than high threshold EFVU from x1; The 2nd, begin the voice short-time average energy value of continuous some frames forward less than low threshold value EFVL from x1; The 3rd, x1-1 frame short-time zero-crossing rate is greater than zero passage threshold value ZCRT.

Specifically, search N continuous 1 frame short-time average energy is greater than the frame of high threshold EFVU, and wherein, N1 gets 5 to 7 for good, and the first frame in these frames is designated as as1.

Search forward from as1, find out and the frame of the nearest short-time average energy of as1 less than EFVL, remember that a rear frame of this frame is a1.

Search forward the N2 frame from a1, N2 gets 10 for good, in the statistics N2 frame short-time average energy greater than the frame number of EFVL, if frame number surpasses N3, N3 value 2 to 4, then wherein top energy is designated as new a1 greater than that frame of EFVL.

Continue to search from new a1, until new N2 frame self-energy greater than the not enough N3+1 frame of the frame number of EFVL or detect voice the first frame, is designated as new a1;

Search forward from a1 again, find short-time zero-crossing rate greater than the frame of ZCRT, then a frame is the guestimate x1 of voice starting point behind this frame.

3-2. the guestimate of isolated word terminal point

Search backward since the x1 frame, find the guestimate x2 of isolated word terminal point, this guestimate must be satisfied following three conditions: the one, begin the short-time average energy of continuous some frames backward less than low threshold value EFVL from x2; The 2nd, begin the short-time average value energy of some frames forward greater than high threshold EFVH from x2; The 3rd, x2+1 frame short-time zero-crossing rate is greater than zero passage threshold value ZCRT.

Specifically, search backward since the x1 frame, find N continuous 4 frame short-time average energies less than the frame of EFVL, wherein, N4 gets 20 to 30, and record wherein the first frame frame number is as2.

Begin forward to search the N2 frame from as2, if energy is greater than the not enough N3+1 frame of frame of EFVU in the N2 frame, then record wherein last energy be new as2 greater than the frame of EFVU, continue to search forward from new as2 and search, until energy surpasses the N3 frame greater than the frame number of EFVU in the continuous N2 frame.

Search backward from as2, find out that first energy is less than the frame of EFVL behind the as2, recording this frame is a2;

Can continue to search backward from a2 again, find short-time zero-crossing rate greater than the frame of ZCRT, then this frame former frame guestimate x2 that is the voice terminal point.

4. the accurate judgement of the self-adaptation adjustment of detection threshold and isolated word starting point and terminal point

Adjust adaptively detection threshold by the isolated word length constraint, and near the end points guestimate of isolated word, search its accurate estimation.

4-1. choose auto-adaptive parameter

Set auto-adaptive parameter α, β, μ and λ, wherein, 1.20≤α≤2.00,0.50≤β≤0.80,1.20≤μ≤2.00,0.50≤λ≤0.80.Make GEFVU=LEFVU=EFVU, GZCRT=LZCRT=ZCRT;

The threshold adaptive adjustment when 4-2. detection voice length is oversize

If 4-2-1. x2-x1〉LENU, show that short-time energy high threshold EFVU is too little, high threshold is increased α doubly, and adjust backward x1, forward adjust x2 respectively so that the frame average energy of starting point and terminal point greater than new high threshold.Specifically, make GEFVU=α * GEFVU.Begin to search backward from x1, find first frame greater than GEFVU to be designated as new x1.In like manner, begin to search forward from x2, find first frame greater than GEFVU to be designated as new x2.

If 4-2-2. still have x2-x1〉LENU, show that the short-time zero-crossing rate threshold value is too large, the short-time zero-crossing rate threshold value is reduced into original β doubly, adjust backward x1, forward adjust x2 respectively so that the average zero-crossing rate of frame of x1-1 frame and x2+1 greater than new short-time zero-crossing rate threshold value.Specifically, make LZCRT=β * LZCRT, begin to continue to search backward from x1, find first frame less than LZCRT to be designated as new x1; In like manner, begin to continue to look for forward from x2, find first frame less than LZCRT to be designated as new x2.

4-2-3. circulation 4-2-1 and 4-2-2 step are until still have x2-x1 after x2-x1≤LENU or the circulation N4 time〉LENU, end loop.

4-3. detect the threshold adaptive adjustment too in short-term of voice length

4-3-1. if x2-x1＜LENL shows that short-time energy high threshold EFVU is too large, high threshold is reduced into original λ doubly, and adjust forward x1, backward adjust x2 respectively so that the frame average energy of starting point and terminal point greater than new high threshold; Specifically, make LEFVU=λ * LEFVU.Search forward from x1, find first less than the frame of LEFVU, remember that this frame is new x1.In like manner, search backward from x2, find the first frame less than LEFVU, remember that this frame is new x2.

If 4-3-2. still have x2-x1＜LENL, show that the short-time average zero-crossing rate high threshold is too little, the short-time zero-crossing rate threshold value is enlarged into original μ doubly, adjust forward x1, backward adjust x2 respectively so that the average zero-crossing rate of frame of x1-1 frame and x2+1 greater than new short-time zero-crossing rate threshold value.Specifically, make GZCRT=μ * GZCRT, begin to continue to search forward from x1, find first frame less than GZCRT to be designated as new x1; In like manner, begin to continue to look for backward from x2, find first frame less than GZCRT to be designated as new x2.

4-3-3. circulation step 4-3-1 and 4-3-2, until x2-x1 〉=LENL, or circulation still has x2-x1＜LENL N5 time, end loop.

5. the starting point of output alone word voice is x1, and terminal point is x2, carries out isolated word recognition.

Explanation is at last, above embodiment is only unrestricted in order to technical scheme of the present invention to be described, other modifications that those of ordinary skills make technical scheme of the present invention or be equal to replacement, only otherwise break away from the spirit and scope of technical solution of the present invention, all should be encompassed in the middle of the claim scope of the present invention.

Claims

1. the self-adaptation end-point detecting method towards alone word voice identification is characterized in that, the method may further comprise the steps:

A. phonetic entry

Input comprises the voice signal of isolated word to be identified;

B. voice pre-service

Voice signal is carried out the amplitude translation, and normalization and minute frame are processed, and calculate short-time average energy and the short-time average zero-crossing rate of each frame voice signal;

C. isolated word end points rough detection

Utilize short-time average energy and the short-time average zero-crossing rate of each frame voice signal, and the shortest length of continuous speech frame retrains before and after the end points, and the isolated word end points is carried out guestimate;

D. the self-adaptation adjustment of detection threshold and the detection of accurate end points

Utilize the restriction of the minimum duration of isolated word and maximum duration, detection threshold is dynamically adjusted, and sound end is carried out the front and back fine setting, obtain accurate isolated word end points;

E. the end points of exporting isolated word carries out alone word voice identification

Export accurate isolated word end points, utilize speech recognition technology to carry out isolated word recognition.

2. described self-adaptation end-point detecting method towards alone word voice identification according to claim 1 is characterized in that: in the c step, when carrying out the guestimate of end points, introduce the constraint of continuous speech frame length before and after the end points.

3. described self-adaptation end-point detecting method towards alone word voice identification according to claim 1 is characterized in that: in the e step, when carrying out the accurate detection of end points, according to the length constraint of isolated word detection threshold is carried out the self-adaptation adjustment.

4. described self-adaptation end-point detecting method towards alone word voice identification according to claim 3, it is characterized in that: when the alone word voice length that detects during greater than the maximum length of isolated word, increase the short-time energy high threshold, and adjust backward starting point, adjust terminal point forward, respectively so that the frame average energy of starting point and terminal point greater than new high threshold.

5. described self-adaptation end-point detecting method towards alone word voice identification according to claim 3, it is characterized in that: when the alone word voice length that detects during greater than the maximum length of isolated word, dwindle the short-time zero-crossing rate threshold value, and adjust backward starting point, adjust terminal point forward, so that starting point former frame and the average zero-crossing rate of terminal point next frame are greater than new short-time zero-crossing rate threshold value.

6. described self-adaptation end-point detecting method towards alone word voice identification according to claim 3, it is characterized in that: when the alone word voice length that detects during less than the shortest length of isolated word, dwindle the short-time energy high threshold, and adjust forward starting point, adjust terminal point backward, respectively so that the frame average energy of starting point and terminal point greater than new high threshold.

7. described self-adaptation end-point detecting method towards alone word voice identification according to claim 3, it is characterized in that: when the alone word voice length that detects during less than the shortest length of isolated word, increase the short-time zero-crossing rate threshold value, and adjust forward starting point, adjust terminal point backward, so that starting point former frame and the average zero-crossing rate of terminal point next frame are greater than new short-time zero-crossing rate threshold value.

8. realize the self-adaptation endpoint detection system towards alone word voice identification of the described method of claim 1, it is characterized in that, this system comprises:

The input media of alone word tone signal to be identified;

Voice signal is carried out the amplitude translation, and normalization and minute frame are processed, the device that short-time average energy and the short-time average zero-crossing rate of each frame voice calculated;

Utilize short-time average energy and the short-time average zero-crossing rate of each frame voice signal, and the shortest length of continuous speech frame retrains the device that the isolated word end points is carried out guestimate before and after the end points;

Utilize the restriction of the minimum duration of isolated word and maximum duration, detection threshold is dynamically adjusted, and sound end is carried out the front and back fine setting, obtain the device of accurate isolated word end points;

Export accurate isolated word end points, utilize speech recognition technology to carry out the device of isolated word recognition.