CN101206858B

CN101206858B - Method and system for testing alone word voice endpoint

Info

Publication number: CN101206858B
Application number: CN2007101793424A
Authority: CN
Inventors: 邓昊
Original assignee: Vimicro Corp
Current assignee: Vimicro Corp
Priority date: 2007-12-12
Filing date: 2007-12-12
Publication date: 2011-07-13
Anticipated expiration: 2027-12-12
Also published as: CN101206858A

Abstract

The invention discloses a voice endpoint detecting method for isolated words. The invention comprises the following steps: the starting point of an isolated word is determined in received voice signal frames; the calculation of the characteristic parameter and the detection of the endpoint of the isolated word are synchronously performed to the speech signal frames received after the starting point of the isolated word is determined. The invention also discloses a voice endpoint detecting device for isolated words. The invention realizes that the calculation of the characteristic parameter and the detection of the endpoint of the isolated word are synchronously performed, therefore, a great amount of the voice datum are not required for buffer memory, which is advantageous for the realization of storing large amount of voice datum in real time.

Description

The method and system that a kind of alone word voice endpoint detects

Technical field

The present invention relates to the speech recognition technology field, relate in particular to the method and system that a kind of alone word voice endpoint detects.

Background technology

Fig. 1 has provided the general module block diagram of alone word voice recognition system.As seen from Figure 1, the basic procedure of alone word voice identification is: input speech signal is sent into the sound end detection module behind minute frame, and this module detects the rising of each isolated word, stop, and the signal frame scope that also promptly determines certain isolated word correspondence is n ₁～n ₂Frame.Parameter extraction module calculates the feature parameter vector of these voice signal frames successively then, constitutes the characteristic set of this isolated word.If be in training mode, then deposit in the template base this characteristic set standby; If be in recognition mode, then this characteristic set is sent in the pattern match module, and with template base in the characteristic set stored compare the calculated characteristics distance one by one.Provide recognition result by the decision logic module according to this characteristic distance again.

In the above-mentioned alone word voice recognition system, the alone word voice endpoint detection is an important module in the alone word voice recognition system.Existing double threshold sound end detecting method mainly detects according to the short-time average magnitude parameter M and the short-time zero-crossing rate parameter Z of voice signal.The theoretical foundation of this method is: compare with unvoiced segments, sound section particularly the signal short-time average magnitude parameter of voiced segments is bigger; Though consonant section short-time average magnitude parameter is less, its short-time zero-crossing rate parameter is apparently higher than unvoiced segments.Detailed process is: a high threshold amplitude threshold parameter M at first is set _H, when the M of frame signal value surpasses M _HThe time, think that this frame signal is in sound section, like this according to M _HCan detect two end points

Frame and

Frame is thought that the voice signal frame that is between the two is sound section, and is likely voiced segments, but starting point and terminal point should lay respectively at accurately Before the frame and

After the frame.Secondly, set a low threshold value threshold parameter M _L, by

Frame is investigated forward, Frame is investigated backward, as the descending M that reduces to of the M of signal frame value _LThe time, can detect two end points n ' ₁Frame and n ' ₂Frame, and think it all is voice segments between the two.At last by n ' ₁Frame forward, n ' ₂Frame is used short-time zero-crossing rate threshold value Z backward _sDetermine the accurate starting point n of isolated word ₁Frame, terminal point n ₂Frame.

This kind theoretical method is simple, and operand is little, obtains widely applying in the alone word voice recognition system.But because this method determines the starting point and the terminal point of isolated word earlier, carry out parameter extraction again, therefore need a large amount of speech datas of buffer memory, be unfavorable for real-time implementation.

Summary of the invention

The method that the embodiment of the invention provides a kind of alone word voice endpoint to detect is carried out the problem that needs a large amount of speech datas of buffer memory that parameter extraction causes again in order to solve starting point and the terminal point owing to the determining isolated word earlier that exist in the prior art.

Correspondingly, the embodiment of the invention also provides the device that a kind of alone word voice endpoint detects.

The method that a kind of alone word voice endpoint detects comprises step:

Whenever receive a voice signal frame, the sets of signals of the voice signal frame combination that has received based on this voice signal frame that receives and adjacent at least one frame is carried out following processing, whether until determining the isolated word starting point: the short-time average magnitude parameter is higher than the voice signal frame number of default high threshold amplitude threshold value in the judgement sets of signals, more than or equal to default frame number threshold value; Judged result be more than or equal to the time, determine that the voice signal frame that the short-time average magnitude parameter that receives at first in the described sets of signals is higher than default low threshold amplitude threshold value is the isolated word starting point;

Voice signal frame to receiving after the isolated word starting point of determining carries out the calculating of characteristic parameter and the detection of isolated word terminal point synchronously.

The device that a kind of alone word voice endpoint detects comprises:

The starting point detecting unit is used for determining the isolated word starting point at the voice signal frame that receives;

The calculation of characteristic parameters unit, the voice signal frame that receives after the isolated word starting point that is used for described starting point detecting unit is determined carries out the calculating of characteristic parameter;

The end point determination unit, the voice signal frame that receives after the isolated word starting point that is used for described starting point detecting unit is determined, carry out the detection of isolated word terminal point, the detection of described isolated word terminal point is that the calculating of carrying out characteristic parameter with described calculation of characteristic parameters unit is carried out synchronously.

Described starting point detecting unit comprises:

The first combination subelement is used for whenever receiving a voice signal frame, and the voice signal frame that has received based on this voice signal frame that receives and adjacent at least one frame is combined into sets of signals, and described combined treatment is till determining the isolated word starting point;

First judgment sub-unit is used for judging that sets of signals short-time average magnitude parameter that the described first combination subelement is combined into is higher than the voice signal frame number of default high threshold amplitude threshold value, whether more than or equal to default frame number threshold value;

First starting point is determined subelement, be used for the judged result of described first judgment sub-unit for more than or equal to the time, determine that the signal frame that the short-time average magnitude parameter that receives at first in the described sets of signals is higher than default low threshold amplitude threshold value is the isolated word starting point.

The embodiment of the invention at first detects the isolated word starting point, begin to carry out calculation of characteristic parameters from starting point voice signal frame, in computing voice signal frame characteristic parameter, carry out the detection of isolated word terminal point, therefore after determining the isolated word terminal point, just can stop voice signal frame calculation of characteristic parameters, need not again a large amount of speech datas of buffer memory, help real-time implementation.

Description of drawings

Fig. 1 is the general module block diagram of alone word voice recognition system in the prior art;

Fig. 2 is the process flow diagram of alone word voice endpoint detection method in the embodiment of the invention;

Fig. 3 is the structural drawing of alone word voice endpoint pick-up unit in the embodiment of the invention;

Fig. 4 A, Fig. 4 B are the structural drawing of starting point detecting unit in the embodiment of the invention;

Fig. 5 A, Fig. 5 B are the structural drawing of end point determination unit in the embodiment of the invention.

Embodiment

Also there is following defective in existing double threshold sound end detecting method except a large amount of speech datas of needs buffer memory:

1, owing to is to utilize the correlation parameter of individual voice signal frame to carry out the detection of starting point and terminal point, therefore knock etc. when happening suddenly burr signal when existing, judge that easily this burr signal is the isolated word starting point, also easily three isolated word being detected is two sections isolated word.

2, when having uncertain ground unrest to exist, the accuracy of end-point detection is not easy to guarantee.

The embodiment of the invention is at the above-mentioned shortcoming of existing double threshold sound end detecting method, a kind of isolated word starting point of determining earlier is provided, in the calculation of characteristic parameters of voice signal frame, detect the method for isolated word terminal point again, in addition, utilize the correlation parameter of adjacent a plurality of voice signal frames to carry out end-point detection, and each voice signal frame is carried out the short-time spectrum adjustment handle, exist probability parameter also as one of parameter of end-point detection the voice that obtain.

Describe the embodiment of the embodiment of the invention in detail below in conjunction with accompanying drawing.

Fig. 2 has provided the process flow diagram of alone word voice endpoint detection method in the embodiment of the invention, and concrete steps are as follows:

Step 201, computing system start the average short-time average magnitude parameter M of initial N the signal frame of gathering the back _n, range parameter is as background noise promptly thought an initial N signal frame for only containing the signal frame of ground unrest, only is used for the estimating background noise comprising range parameter, does not carry out subsequent treatment.

M_{n} = \frac{1}{N} Σ_{i = 1}^{N} M (i) - - - (1)

The computing method of each signal frame short-time average magnitude parameter M (i) are:

M (i) = \frac{1}{K} Σ_{l = 1}^{K} abs (s (i, l)) - - - (2)

Wherein, (i l) is the range value of l sample point of i signal frame to s, and abs () is the function that takes absolute value.

When signal sampling rate was 8kHz, the representative value of frame number N was 10, and the representative value of frame length K is 128.

Step 202, each the voice signal frame that receives is carried out the short-time spectrum adjustment handle, determine that there is probability parameter in the voice of each voice signal frame.

In this step, the time-domain signal frame that receives is made Fourier analysis in short-term, estimate the noise amplitude in each frequency component, there are parameters such as probability in voice, calculate the weighting factor parameter of each frequency content according to this, after with the weighting factor parameter each frequency component being weighted, the voice signal after carrying out that Fourier is synthetic in short-term and being enhanced.For fear of the voice distortion that this method may be introduced, should adjust correlation parameter, a control filtering is noise contribution more stably.

This method is being carried out voice when strengthening, can estimate the Probability p that has voice on k the frequency content of i signal frame (i, k), p (i, k) value between 0～1.The main concentration of energy of considering phonetic element is in low-frequency range, the present invention with low-frequency range p (i, there is probability parameter P (i) in average k) as the voice of i signal frame:

P (i) = \frac{1}{L} Σ_{k = k_{0}}^{k_{1}} p (i, k) - - - (3)

Wherein, k ₀And k ₁Be the low-frequency range start-stop Frequency point sequence number of selecting, the Frequency point number of L for selecting, i.e. L=k ₁-k ₀+ 1.

Suppose that signal sampling rate is 8kHz, the short time discrete Fourier transform length of employing is 256 point, then k ₀And k ₁One group be 15 and 80 with reference to value, the respective frequencies composition range is about 450Hz～2400Hz.

In this step, adopt the short-time spectrum adjustment to handle to each the voice signal frame that receives and carry out the voice enhancing, remove noise contribution, estimate voice and have probability parameter.One of decision parameter that this parameter detects as subsequent endpoints has improved the accuracy that the noise circumstance lower extreme point detects.

Step 203, in the voice signal frame that receives, determine the isolated word starting point.

In this step, exist probability parameter P (i) and short-time average magnitude parameter M (i) to judge whether there is the isolated word starting point in these frames according to the voice of adjacent a plurality of voice signal frames.Cardinal rule is near the isolated word starting point, and the value of P of voice signal frame (i) and M (i) parameter is bigger.The process of determining the isolated word starting point is:

Whenever receive a voice signal frame, the sets of signals of the voice signal frame combination that has received based on this voice signal frame that receives and adjacent at least one frame is carried out following processing, until determining the isolated word starting point:

The short-time average magnitude parameter is higher than the voice signal frame number of default high threshold amplitude threshold value in the judgement sets of signals, whether more than or equal to default frame number threshold value, and whether voice exist probability parameter to be higher than default high threshold voice to exist the voice signal frame number of probable value more than or equal to default frame number threshold value;

Judged result be more than or equal to the time, determine that the short-time average magnitude parameter that receives at first in this sets of signals is higher than default low threshold amplitude threshold value, and voice to exist probability parameter to be higher than the voice signal frame that there are probable value in default low threshold voice be the isolated word starting point.

Step 204, the voice signal frame to receiving after the isolated word starting point of determining carry out the calculating of characteristic parameter and the detection of isolated word terminal point synchronously.

In case determine the isolated word starting point, just need carry out the calculating of characteristic parameter to the voice signal frame after the isolated word starting point, obtain to comprise the eigenvector of this signal frame characteristic information.The eigenvector of each signal frame of isolated word is arranged in regular turn, constitutes the characteristic set of this isolated word.Characteristic parameter mainly comprises cepstrum coefficient, MFCC coefficient and their the various parameters of deriving.Compare with primary voice data, these characteristic parameters have better stability and robustness, need the storage space of much less.The calculating of characteristic parameter is same as the prior art, repeats no more herein.

In calculation of characteristic parameters, carry out the detection of isolated word terminal point, testing process is:

Whenever receive a voice signal frame, the sets of signals of the voice signal frame combination that has received based on this voice signal frame that receives and adjacent at least one frame is carried out following processing, until determining the isolated word terminal point:

The short-time average magnitude parameter is lower than the voice signal frame number of default high threshold amplitude threshold value in the judgement sets of signals, whether more than or equal to default frame number threshold value, whether and voice exist probability parameter to be lower than the voice signal frame number that there is probable value in default high threshold voice, more than or equal to default frame number threshold value;

Judged result be more than or equal to the time, determine that the short-time average magnitude parameter that receives at first in this sets of signals is lower than default low threshold amplitude threshold value, and voice to exist probability parameter to be lower than the voice signal frame that there are probable value in default low threshold voice be the isolated word terminal point.

Step 205, whether judge the isolated word starting point determine and the distance between the terminal point greater than predefined lowest distance value, when judged result be greater than the time, the end-point detection end; When judged result be smaller or equal to the time, ignore the isolated word terminal point that this is determined, carry out step 204.

This step is that the isolated word terminal point that step 204 is determined is verified, removes " puppet " terminal point, mainly is the distance of investigating between the isolated word starting point that this terminal point and step 203 determine.If this ending frame sequence number is i ₁, the starting point frame number is i ₀, as satisfying

i ₁-i ₀＞IL _T (4)

Think that then this terminal point can accept, otherwise think that it is " puppet " terminal point, proceeds calculation of parameter and end point determination.IL _TBe predefined lowest distance value, typical value is 10.

Illustrate the detailed process of the definite isolated word starting point in the step 203 below:

Determine one group of high and low threshold amplitude threshold value according to ground unrest range parameter and input signal-to-noise ratio thresholding:

M _H＝α _H□M _n (5a)

M _L＝α _L□M _n (5b)

Wherein, M _nThe ground unrest range parameter that calculates for step 201.M _H, M _LBe height, low threshold amplitude threshold value.α _H, α _LBe height, low threshold amplitude thresholding scale parameter, typical value is α _H=5.5, α _L=4.

Height is set, and there is probable value P in the low threshold voice _H, P _LP _H, P _LValue should obtain by test, representative value is P _H=0.65, P _L=0.55.

If the current voice signal frame number that receives is i, use array Ma[5], Pa[5] there are probability parameter in the short-time average magnitude parameter and the voice that write down this 5 frame signal of i-4～i successively, that is:

Ma[k]＝M(i-4+k) (6a)

Pa[k]＝P(i-4+k) (6b)

Wherein, k=0～4.The computing method of short-time average magnitude parameter M (l) are referring to formula (2), and voice exist the computing method of probability parameter P (l) referring to formula (3).

Calculate array Ma[5] in value greater than M _HElement number C _MH, Pa[5] in value greater than P _HElement number C _PH

Calculate C _MHAnd C _PHPseudo-code as follows:

/*pseudo?codeto?calculate C _MH?and?C _PH*/

C _MH＝C _PH＝0；

For(k＝0；k＜5；k++)

{

If(Pa[k]＞P _H) ++C _PH；

If(Ma[k]＞M _H) ++C _MH；

}

END

Obvious C _MHAnd C _PHSpan be 0-5.If C _MHAnd C _PHValue enough big, satisfy simultaneously:

C _MH＞＝C _MH_T (7a)

C _PH＞＝C _PH_T (7b)

Then think to have the isolated word starting point in this 5 frame signal of i-4～i, otherwise upgrade array Ma[5], Pa[5], continue starting point and detect.Wherein, C _MH_ T, C _PH_ T is default frame number threshold value, and typical value is C _MH_ T=3, C _PH_ T=4.

After determining to have the isolated word starting point in these 5 signal frames of i-4～i, investigate since (i-4) individual voice signal frame, satisfied simultaneously as the parameter of a certain voice signal frame k:

Ma[k]＞M _L (8a)

Pa[k]＞P _L (8b)

Think that then accurate starting point is a k voice signal frame, voice signal frame after this all carries out the calculating of characteristic parameter.Wherein, k=i-4～i.If the parameter of 5 voice signal frames does not all satisfy (8), upgrade array Ma[5], Pa[5], continue starting point and detect.

In this step, also can judge whether there is the isolated word starting point in these frames by a short-time average magnitude parameter M (i), that is: according to adjacent a plurality of voice signal frames

Whether the short-time average magnitude parameter is higher than the voice signal frame number of default high threshold amplitude threshold value in the judgement sets of signals, more than or equal to default frame number threshold value;

Judged result be more than or equal to the time, determine that the voice signal frame that the short-time average magnitude parameter that receives at first in this sets of signals is higher than default low threshold amplitude threshold value is the isolated word starting point.

Compare with existing double threshold sound end detecting method, this point detecting method adopted high and low threshold value two cover threshold parameters equally, helps the more weak voiceless consonant The initial segment of the amplitude of detecting.In addition, because the parameter of investigating continuous 5 voice signal frames simultaneously detects, burr signal such as avoided to knock and be defined as the isolated word starting point.

End point determination process in the step 204 is consistent with above-mentioned starting point testing process, just adopts different threshold parameters, investigates the situation of the parameter of voice signal frame less than threshold value.Brief description is as follows:

Height is set, low threshold amplitude threshold value:

M′ _H＝α′ _H□M _n (9a)

M′ _L＝α′ _L□M _n (9b)

Wherein, α ' _H, α ' _LTypical value be α ' _H=4, α ' _L=3.2.Each parameter meaning is referring to formula (5).

High and low thresholding voice are set have probable value P ' _H, P ' _L, representative value is P ' _H=0.52, P ' _L=0.45.

Calculate array Ma[5] in value less than M ' _HElement number C ' _MH, Pa[5] in value less than P ' _HElement number C ' _PHIf satisfy simultaneously:

C′ _MH＞＝C′ _MH_T (10a)

C′ _PH＞＝C′ _PH_T (10b)

Then think to have the isolated word terminal point in this 5 frame signal of i-4～i, otherwise upgrade array Ma[5], Pa[5], continue end point determination.Wherein, C ' _MH_ T, C ' _PH_ T is default frame number threshold value, and typical value is C ' _MH_ T=4, C ' _PH_ T=4.

After determining to have the isolated word terminal point in these 5 signal frames of i-4～i, investigate since (i-4) individual voice signal frame, satisfied simultaneously as the parameter of a certain voice signal frame k:

Ma[k]＜M′ _L (11a)

Pa[k]＜P′ _L (11b)

Think that then accurate endpoint is a k voice signal frame.Wherein, k=i-4～i.If the parameter of 5 voice signal frames does not all satisfy (11), upgrade array Ma[5], Pa[5], continue end point determination.

Obviously, also can judge whether there is the isolated word terminal point in these frames by a short-time average magnitude parameter M (i) in this step, that is: according to adjacent a plurality of voice signal frames

Whether the short-time average magnitude parameter is lower than the voice signal frame number of default high threshold amplitude threshold value in the judgement sets of signals, more than or equal to default frame number threshold value;

Judged result be more than or equal to the time, determine that the voice signal frame that the short-time average magnitude parameter that receives at first in this sets of signals is lower than default low threshold amplitude threshold value is the isolated word terminal point.

Owing to utilize the parameter of continuous 5 voice signal frames to detect simultaneously, therefore helping avoiding tone is three Chinese character, and as " nine ", detection is two sections isolated word.

Correspondingly, the device that the embodiment of the invention also provides a kind of alone word voice endpoint to detect, its structure comprises as shown in Figure 3: starting point detecting unit 310, calculation of characteristic parameters unit 320, end point determination unit 330 and judging unit 340.

Starting point detecting unit 310 is used for determining the isolated word starting point at the voice signal frame that receives;

Calculation of characteristic parameters unit 320, the voice signal frame that receives after the isolated word starting point that is used for starting point detecting unit 310 is determined carries out the calculating of characteristic parameter;

End point determination unit 330, the voice signal frame that receives after the isolated word starting point that is used for starting point detecting unit 310 is determined carries out the calculating of characteristic parameter and carries out the detection of isolated word terminal point synchronously with calculation of characteristic parameters unit 320.

Judging unit 340 is used to judge that whether distance between the terminal point of determining in isolated word starting point that starting point detecting unit 310 is determined and end point determination unit 330 is greater than predefined lowest distance value;

When the judged result of judging unit 340 is during smaller or equal to predefined lowest distance value, ignore the isolated word terminal point of determining end point determination unit 330, to the voice signal frame that receives after this isolated word terminal point, continue to carry out the calculating of characteristic parameter and the detection of isolated word terminal point synchronously by calculation of characteristic parameters unit 320 and end point determination unit 330.

Preferably, shown in Fig. 4 A, starting point detecting unit 310 comprises: the first combination subelement 311, first judgment sub-unit 312 and first starting point are determined subelement 313.

The first combination subelement 311 is used for whenever receiving a voice signal frame, and the voice signal frame that has received based on this voice signal frame that receives and adjacent at least one frame is combined into sets of signals, and this combined treatment is till determining the isolated word starting point;

First judgment sub-unit 312 is used for judging that sets of signals short-time average magnitude parameter that the first combination subelement 311 is combined into is higher than the voice signal frame number of default high threshold amplitude threshold value, whether more than or equal to default frame number threshold value;

First starting point is determined subelement 313, be used for the judged result of first judgment sub-unit 312 for more than or equal to the time, determine that the signal frame that the short-time average magnitude parameter that receives at first in this sets of signals is higher than default low threshold amplitude threshold value is the isolated word starting point.

Preferably, shown in Fig. 5 A, end point determination unit 330 comprises: the second combination subelement 331, second judgment sub-unit 332 and first terminal point are determined subelement 333.

The second combination subelement 331 is used for whenever receiving a voice signal frame, and the voice signal frame that has received based on this voice signal frame that receives and adjacent at least one frame is combined into sets of signals, and this combined treatment is till determining the isolated word terminal point;

Second judgment sub-unit 332 is used for judging that sets of signals short-time average magnitude parameter that the second combination subelement 331 is combined into is lower than the voice signal frame number of default high threshold amplitude threshold value, whether more than or equal to default frame number threshold value;

First terminal point is determined subelement 333, be used for the judged result of second judgment sub-unit 332 for more than or equal to the time, determine that the voice signal frame that the short-time average magnitude parameter that receives at first in this sets of signals is lower than default low threshold amplitude threshold value is the isolated word terminal point.

Preferably, above-mentioned alone word voice endpoint pick-up unit also comprises: short-time spectrum is adjusted processing unit 350.

Short-time spectrum is adjusted processing unit 350, is used for that each the voice signal frame that receives is carried out the short-time spectrum adjustment and handles, and determines that there is probability parameter in the voice of each voice signal frame.

Preferably, shown in Fig. 4 B, starting point detecting unit 310 comprises: the first combination subelement 311, the 3rd judgment sub-unit 314 and second starting point are determined subelement 315.

The 3rd judgment sub-unit 314, be used for judging that sets of signals short-time average magnitude parameter that the first combination subelement 311 is combined into is higher than the voice signal frame number of default high threshold amplitude threshold value, whether more than or equal to default frame number threshold value, and whether voice exist probability parameter to be higher than default high threshold voice to exist the voice signal frame number of probable value more than or equal to default frame number threshold value;

Second starting point is determined subelement 315, be used for judged result in the 3rd judgment sub-unit 314 be more than or equal to the time, determine that the short-time average magnitude parameter that receives at first in this sets of signals is higher than default low threshold amplitude threshold value, and voice to exist probability parameter to be higher than the voice signal frame that there are probable value in default low threshold voice be the isolated word starting point.

Preferably, shown in Fig. 5 B, end point determination unit 330 comprises: the second combination subelement 331, the 4th judgment sub-unit 334 and second terminal point are determined subelement 335.

The 4th judgment sub-unit 334, be used for judging that sets of signals short-time average magnitude parameter that the second combination subelement 331 is combined into is lower than the voice signal frame number of default high threshold amplitude threshold value, whether more than or equal to default frame number threshold value, whether and voice exist probability parameter to be lower than the voice signal frame number that there is probable value in default high threshold voice, more than or equal to default frame number threshold value;

Second terminal point is determined subelement 335, be used for judged result in the 4th judgment sub-unit 334 be more than or equal to the time, determine that the short-time average magnitude parameter that receives at first in this sets of signals is lower than default low threshold amplitude threshold value, and voice to exist probability parameter to be lower than the voice signal frame that there are probable value in default low threshold voice be the isolated word terminal point.

The technical scheme that the embodiment of the invention proposes, detect the isolated word starting point after, when carrying out calculation of characteristic parameters, carry out the detection of isolated word terminal point, determine the calculating that promptly stops characteristic parameter behind the isolated word terminal point, therefore need not a large amount of speech datas of buffer memory.And utilize the correlation parameter of adjacent a plurality of voice signal frames to carry out the detection of isolated word end points, and effectively avoided burr signal is defined as the isolated word starting point, can be not an isolated word erroneous judgement of three two sections isolated word also.Simultaneously each voice signal frame is carried out the short-time spectrum adjustment and handle, and exist probability parameter also as one of parameter of end-point detection the voice that obtain, effectively raise the accuracy and the robustness of end-point detection when having uncertain ground unrest.

Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims

1. the method that alone word voice endpoint detects is characterized in that, comprises step:

2. the method for claim 1 is characterized in that, also comprises step:

Judge that whether described isolated word starting point of determining and the distance between the terminal point are greater than predefined lowest distance value;

When judged result is during smaller or equal to predefined lowest distance value, ignore the described isolated word terminal point of determining, to the voice signal frame that receives after the described isolated word terminal point, continue to carry out synchronously the calculating of characteristic parameter and the detection of isolated word terminal point.

3. method as claimed in claim 1 or 2 is characterized in that, the process of described detection isolated word terminal point comprises:

Judged result be more than or equal to the time, determine that the voice signal frame that the short-time average magnitude parameter that receives at first in the described sets of signals is lower than default low threshold amplitude threshold value is the isolated word terminal point.

4. method as claimed in claim 1 or 2 is characterized in that, also comprises step:

Each the voice signal frame that receives is carried out the short-time spectrum adjustment handle, determine that there is probability parameter in the voice of each voice signal frame.

5. method as claimed in claim 4 is characterized in that, the process of described definite isolated word starting point comprises:

Judged result be more than or equal to the time, determine that the short-time average magnitude parameter that receives at first in the described sets of signals is higher than default low threshold amplitude threshold value, and voice to exist probability parameter to be higher than the voice signal frame that there are probable value in default low threshold voice be the isolated word starting point.

6. method as claimed in claim 4 is characterized in that, the process of described detection isolated word terminal point comprises:

Judged result be more than or equal to the time, determine that the short-time average magnitude parameter that receives at first in the described sets of signals is lower than default low threshold amplitude threshold value, and voice to exist probability parameter to be lower than the voice signal frame that there are probable value in default low threshold voice be the isolated word terminal point.

7. the device that alone word voice endpoint detects is characterized in that, comprising:

The end point determination unit, the voice signal frame that receives after the isolated word starting point that is used for described starting point detecting unit is determined, carry out the detection of isolated word terminal point, the detection of described isolated word terminal point is that the calculating of carrying out characteristic parameter with described calculation of characteristic parameters unit is carried out synchronously;

Described starting point detecting unit comprises:

8. device as claimed in claim 7 is characterized in that, also comprises:

Judging unit is used to judge that whether distance between the terminal point of determining in isolated word starting point that described starting point detecting unit is determined and described end point determination unit is greater than predefined lowest distance value;

When the judged result of described judging unit is during smaller or equal to predefined lowest distance value, ignore the isolated word terminal point that described end point determination unit is determined, to the voice signal frame that receives after the described isolated word terminal point, continue to carry out the calculating of characteristic parameter and the detection of isolated word terminal point by described calculation of characteristic parameters unit and end point determination units synchronization.

9. as claim 7 or 8 described devices, it is characterized in that described end point determination unit comprises:

The second combination subelement is used for whenever receiving a voice signal frame, and the voice signal frame that has received based on this voice signal frame that receives and adjacent at least one frame is combined into sets of signals, and described combined treatment is till determining the isolated word terminal point;

Second judgment sub-unit is used for judging that sets of signals short-time average magnitude parameter that the described second combination subelement is combined into is lower than the voice signal frame number of default high threshold amplitude threshold value, whether more than or equal to default frame number threshold value;

First terminal point is determined subelement, be used for the judged result of described second judgment sub-unit for more than or equal to the time, determine that the voice signal frame that the short-time average magnitude parameter that receives at first in the described sets of signals is lower than default low threshold amplitude threshold value is the isolated word terminal point.

10. as claim 7 or 8 described devices, it is characterized in that, also comprise:

Short-time spectrum is adjusted processing unit, is used for that each the voice signal frame that receives is carried out the short-time spectrum adjustment and handles, and determines that there is probability parameter in the voice of each voice signal frame.

11. device as claimed in claim 10 is characterized in that, described starting point detecting unit comprises:

The 3rd judgment sub-unit, be used for judging that sets of signals short-time average magnitude parameter that the described first combination subelement is combined into is higher than the voice signal frame number of default high threshold amplitude threshold value, whether more than or equal to default frame number threshold value, and whether voice exist probability parameter to be higher than default high threshold voice to exist the voice signal frame number of probable value more than or equal to default frame number threshold value;

Second starting point is determined subelement, be used for judged result in described the 3rd judgment sub-unit be more than or equal to the time, determine that the short-time average magnitude parameter that receives at first in the described sets of signals is higher than default low threshold amplitude threshold value, and voice to exist probability parameter to be higher than the voice signal frame that there are probable value in default low threshold voice be the isolated word starting point.

12. device as claimed in claim 10 is characterized in that, described end point determination unit comprises:

The 4th judgment sub-unit, be used for judging that sets of signals short-time average magnitude parameter that the described second combination subelement is combined into is lower than the voice signal frame number of default high threshold amplitude threshold value, whether more than or equal to default frame number threshold value, whether and voice exist probability parameter to be lower than the voice signal frame number that there is probable value in default high threshold voice, more than or equal to default frame number threshold value;

Second terminal point is determined subelement, be used for judged result in described the 4th judgment sub-unit be more than or equal to the time, determine that the short-time average magnitude parameter that receives at first in the described sets of signals is lower than default low threshold amplitude threshold value, and voice to exist probability parameter to be lower than the voice signal frame that there are probable value in default low threshold voice be the isolated word terminal point.