CN101206858B - Method and system for testing alone word voice endpoint - Google Patents

Method and system for testing alone word voice endpoint Download PDF

Info

Publication number
CN101206858B
CN101206858B CN2007101793424A CN200710179342A CN101206858B CN 101206858 B CN101206858 B CN 101206858B CN 2007101793424 A CN2007101793424 A CN 2007101793424A CN 200710179342 A CN200710179342 A CN 200710179342A CN 101206858 B CN101206858 B CN 101206858B
Authority
CN
China
Prior art keywords
voice signal
signal frame
voice
isolated word
default
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2007101793424A
Other languages
Chinese (zh)
Other versions
CN101206858A (en
Inventor
邓昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vimicro Corp
Original Assignee
Vimicro Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vimicro Corp filed Critical Vimicro Corp
Priority to CN2007101793424A priority Critical patent/CN101206858B/en
Publication of CN101206858A publication Critical patent/CN101206858A/en
Application granted granted Critical
Publication of CN101206858B publication Critical patent/CN101206858B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a voice endpoint detecting method for isolated words. The invention comprises the following steps: the starting point of an isolated word is determined in received voice signal frames; the calculation of the characteristic parameter and the detection of the endpoint of the isolated word are synchronously performed to the speech signal frames received after the starting point of the isolated word is determined. The invention also discloses a voice endpoint detecting device for isolated words. The invention realizes that the calculation of the characteristic parameter and the detection of the endpoint of the isolated word are synchronously performed, therefore, a great amount of the voice datum are not required for buffer memory, which is advantageous for the realization of storing large amount of voice datum in real time.

Description

The method and system that a kind of alone word voice endpoint detects
Technical field
The present invention relates to the speech recognition technology field, relate in particular to the method and system that a kind of alone word voice endpoint detects.
Background technology
Fig. 1 has provided the general module block diagram of alone word voice recognition system.As seen from Figure 1, the basic procedure of alone word voice identification is: input speech signal is sent into the sound end detection module behind minute frame, and this module detects the rising of each isolated word, stop, and the signal frame scope that also promptly determines certain isolated word correspondence is n 1~n 2Frame.Parameter extraction module calculates the feature parameter vector of these voice signal frames successively then, constitutes the characteristic set of this isolated word.If be in training mode, then deposit in the template base this characteristic set standby; If be in recognition mode, then this characteristic set is sent in the pattern match module, and with template base in the characteristic set stored compare the calculated characteristics distance one by one.Provide recognition result by the decision logic module according to this characteristic distance again.
In the above-mentioned alone word voice recognition system, the alone word voice endpoint detection is an important module in the alone word voice recognition system.Existing double threshold sound end detecting method mainly detects according to the short-time average magnitude parameter M and the short-time zero-crossing rate parameter Z of voice signal.The theoretical foundation of this method is: compare with unvoiced segments, sound section particularly the signal short-time average magnitude parameter of voiced segments is bigger; Though consonant section short-time average magnitude parameter is less, its short-time zero-crossing rate parameter is apparently higher than unvoiced segments.Detailed process is: a high threshold amplitude threshold parameter M at first is set H, when the M of frame signal value surpasses M HThe time, think that this frame signal is in sound section, like this according to M HCan detect two end points
Figure S2007101793424D00011
Frame and
Figure S2007101793424D00012
Frame is thought that the voice signal frame that is between the two is sound section, and is likely voiced segments, but starting point and terminal point should lay respectively at accurately Before the frame and
Figure S2007101793424D00014
After the frame.Secondly, set a low threshold value threshold parameter M L, by
Figure S2007101793424D00015
Frame is investigated forward, Frame is investigated backward, as the descending M that reduces to of the M of signal frame value LThe time, can detect two end points n ' 1Frame and n ' 2Frame, and think it all is voice segments between the two.At last by n ' 1Frame forward, n ' 2Frame is used short-time zero-crossing rate threshold value Z backward sDetermine the accurate starting point n of isolated word 1Frame, terminal point n 2Frame.
This kind theoretical method is simple, and operand is little, obtains widely applying in the alone word voice recognition system.But because this method determines the starting point and the terminal point of isolated word earlier, carry out parameter extraction again, therefore need a large amount of speech datas of buffer memory, be unfavorable for real-time implementation.
Summary of the invention
The method that the embodiment of the invention provides a kind of alone word voice endpoint to detect is carried out the problem that needs a large amount of speech datas of buffer memory that parameter extraction causes again in order to solve starting point and the terminal point owing to the determining isolated word earlier that exist in the prior art.
Correspondingly, the embodiment of the invention also provides the device that a kind of alone word voice endpoint detects.
The method that a kind of alone word voice endpoint detects comprises step:
Whenever receive a voice signal frame, the sets of signals of the voice signal frame combination that has received based on this voice signal frame that receives and adjacent at least one frame is carried out following processing, whether until determining the isolated word starting point: the short-time average magnitude parameter is higher than the voice signal frame number of default high threshold amplitude threshold value in the judgement sets of signals, more than or equal to default frame number threshold value; Judged result be more than or equal to the time, determine that the voice signal frame that the short-time average magnitude parameter that receives at first in the described sets of signals is higher than default low threshold amplitude threshold value is the isolated word starting point;
Voice signal frame to receiving after the isolated word starting point of determining carries out the calculating of characteristic parameter and the detection of isolated word terminal point synchronously.
The device that a kind of alone word voice endpoint detects comprises:
The starting point detecting unit is used for determining the isolated word starting point at the voice signal frame that receives;
The calculation of characteristic parameters unit, the voice signal frame that receives after the isolated word starting point that is used for described starting point detecting unit is determined carries out the calculating of characteristic parameter;
The end point determination unit, the voice signal frame that receives after the isolated word starting point that is used for described starting point detecting unit is determined, carry out the detection of isolated word terminal point, the detection of described isolated word terminal point is that the calculating of carrying out characteristic parameter with described calculation of characteristic parameters unit is carried out synchronously.
Described starting point detecting unit comprises:
The first combination subelement is used for whenever receiving a voice signal frame, and the voice signal frame that has received based on this voice signal frame that receives and adjacent at least one frame is combined into sets of signals, and described combined treatment is till determining the isolated word starting point;
First judgment sub-unit is used for judging that sets of signals short-time average magnitude parameter that the described first combination subelement is combined into is higher than the voice signal frame number of default high threshold amplitude threshold value, whether more than or equal to default frame number threshold value;
First starting point is determined subelement, be used for the judged result of described first judgment sub-unit for more than or equal to the time, determine that the signal frame that the short-time average magnitude parameter that receives at first in the described sets of signals is higher than default low threshold amplitude threshold value is the isolated word starting point.
The embodiment of the invention at first detects the isolated word starting point, begin to carry out calculation of characteristic parameters from starting point voice signal frame, in computing voice signal frame characteristic parameter, carry out the detection of isolated word terminal point, therefore after determining the isolated word terminal point, just can stop voice signal frame calculation of characteristic parameters, need not again a large amount of speech datas of buffer memory, help real-time implementation.
Description of drawings
Fig. 1 is the general module block diagram of alone word voice recognition system in the prior art;
Fig. 2 is the process flow diagram of alone word voice endpoint detection method in the embodiment of the invention;
Fig. 3 is the structural drawing of alone word voice endpoint pick-up unit in the embodiment of the invention;
Fig. 4 A, Fig. 4 B are the structural drawing of starting point detecting unit in the embodiment of the invention;
Fig. 5 A, Fig. 5 B are the structural drawing of end point determination unit in the embodiment of the invention.
Embodiment
Also there is following defective in existing double threshold sound end detecting method except a large amount of speech datas of needs buffer memory:
1, owing to is to utilize the correlation parameter of individual voice signal frame to carry out the detection of starting point and terminal point, therefore knock etc. when happening suddenly burr signal when existing, judge that easily this burr signal is the isolated word starting point, also easily three isolated word being detected is two sections isolated word.
2, when having uncertain ground unrest to exist, the accuracy of end-point detection is not easy to guarantee.
The embodiment of the invention is at the above-mentioned shortcoming of existing double threshold sound end detecting method, a kind of isolated word starting point of determining earlier is provided, in the calculation of characteristic parameters of voice signal frame, detect the method for isolated word terminal point again, in addition, utilize the correlation parameter of adjacent a plurality of voice signal frames to carry out end-point detection, and each voice signal frame is carried out the short-time spectrum adjustment handle, exist probability parameter also as one of parameter of end-point detection the voice that obtain.
Describe the embodiment of the embodiment of the invention in detail below in conjunction with accompanying drawing.
Fig. 2 has provided the process flow diagram of alone word voice endpoint detection method in the embodiment of the invention, and concrete steps are as follows:
Step 201, computing system start the average short-time average magnitude parameter M of initial N the signal frame of gathering the back n, range parameter is as background noise promptly thought an initial N signal frame for only containing the signal frame of ground unrest, only is used for the estimating background noise comprising range parameter, does not carry out subsequent treatment.
M n = 1 N Σ i = 1 N M ( i ) - - - ( 1 )
The computing method of each signal frame short-time average magnitude parameter M (i) are:
M ( i ) = 1 K Σ l = 1 K abs ( s ( i , l ) ) - - - ( 2 )
Wherein, (i l) is the range value of l sample point of i signal frame to s, and abs () is the function that takes absolute value.
When signal sampling rate was 8kHz, the representative value of frame number N was 10, and the representative value of frame length K is 128.
Step 202, each the voice signal frame that receives is carried out the short-time spectrum adjustment handle, determine that there is probability parameter in the voice of each voice signal frame.
In this step, the time-domain signal frame that receives is made Fourier analysis in short-term, estimate the noise amplitude in each frequency component, there are parameters such as probability in voice, calculate the weighting factor parameter of each frequency content according to this, after with the weighting factor parameter each frequency component being weighted, the voice signal after carrying out that Fourier is synthetic in short-term and being enhanced.For fear of the voice distortion that this method may be introduced, should adjust correlation parameter, a control filtering is noise contribution more stably.
This method is being carried out voice when strengthening, can estimate the Probability p that has voice on k the frequency content of i signal frame (i, k), p (i, k) value between 0~1.The main concentration of energy of considering phonetic element is in low-frequency range, the present invention with low-frequency range p (i, there is probability parameter P (i) in average k) as the voice of i signal frame:
P ( i ) = 1 L Σ k = k 0 k 1 p ( i , k ) - - - ( 3 )
Wherein, k 0And k 1Be the low-frequency range start-stop Frequency point sequence number of selecting, the Frequency point number of L for selecting, i.e. L=k 1-k 0+ 1.
Suppose that signal sampling rate is 8kHz, the short time discrete Fourier transform length of employing is 256 point, then k 0And k 1One group be 15 and 80 with reference to value, the respective frequencies composition range is about 450Hz~2400Hz.
In this step, adopt the short-time spectrum adjustment to handle to each the voice signal frame that receives and carry out the voice enhancing, remove noise contribution, estimate voice and have probability parameter.One of decision parameter that this parameter detects as subsequent endpoints has improved the accuracy that the noise circumstance lower extreme point detects.
Step 203, in the voice signal frame that receives, determine the isolated word starting point.
In this step, exist probability parameter P (i) and short-time average magnitude parameter M (i) to judge whether there is the isolated word starting point in these frames according to the voice of adjacent a plurality of voice signal frames.Cardinal rule is near the isolated word starting point, and the value of P of voice signal frame (i) and M (i) parameter is bigger.The process of determining the isolated word starting point is:
Whenever receive a voice signal frame, the sets of signals of the voice signal frame combination that has received based on this voice signal frame that receives and adjacent at least one frame is carried out following processing, until determining the isolated word starting point:
The short-time average magnitude parameter is higher than the voice signal frame number of default high threshold amplitude threshold value in the judgement sets of signals, whether more than or equal to default frame number threshold value, and whether voice exist probability parameter to be higher than default high threshold voice to exist the voice signal frame number of probable value more than or equal to default frame number threshold value;
Judged result be more than or equal to the time, determine that the short-time average magnitude parameter that receives at first in this sets of signals is higher than default low threshold amplitude threshold value, and voice to exist probability parameter to be higher than the voice signal frame that there are probable value in default low threshold voice be the isolated word starting point.
Step 204, the voice signal frame to receiving after the isolated word starting point of determining carry out the calculating of characteristic parameter and the detection of isolated word terminal point synchronously.
In case determine the isolated word starting point, just need carry out the calculating of characteristic parameter to the voice signal frame after the isolated word starting point, obtain to comprise the eigenvector of this signal frame characteristic information.The eigenvector of each signal frame of isolated word is arranged in regular turn, constitutes the characteristic set of this isolated word.Characteristic parameter mainly comprises cepstrum coefficient, MFCC coefficient and their the various parameters of deriving.Compare with primary voice data, these characteristic parameters have better stability and robustness, need the storage space of much less.The calculating of characteristic parameter is same as the prior art, repeats no more herein.
In calculation of characteristic parameters, carry out the detection of isolated word terminal point, testing process is:
Whenever receive a voice signal frame, the sets of signals of the voice signal frame combination that has received based on this voice signal frame that receives and adjacent at least one frame is carried out following processing, until determining the isolated word terminal point:
The short-time average magnitude parameter is lower than the voice signal frame number of default high threshold amplitude threshold value in the judgement sets of signals, whether more than or equal to default frame number threshold value, whether and voice exist probability parameter to be lower than the voice signal frame number that there is probable value in default high threshold voice, more than or equal to default frame number threshold value;
Judged result be more than or equal to the time, determine that the short-time average magnitude parameter that receives at first in this sets of signals is lower than default low threshold amplitude threshold value, and voice to exist probability parameter to be lower than the voice signal frame that there are probable value in default low threshold voice be the isolated word terminal point.
Step 205, whether judge the isolated word starting point determine and the distance between the terminal point greater than predefined lowest distance value, when judged result be greater than the time, the end-point detection end; When judged result be smaller or equal to the time, ignore the isolated word terminal point that this is determined, carry out step 204.
This step is that the isolated word terminal point that step 204 is determined is verified, removes " puppet " terminal point, mainly is the distance of investigating between the isolated word starting point that this terminal point and step 203 determine.If this ending frame sequence number is i 1, the starting point frame number is i 0, as satisfying
i 1-i 0>IL T (4)
Think that then this terminal point can accept, otherwise think that it is " puppet " terminal point, proceeds calculation of parameter and end point determination.IL TBe predefined lowest distance value, typical value is 10.
Illustrate the detailed process of the definite isolated word starting point in the step 203 below:
Determine one group of high and low threshold amplitude threshold value according to ground unrest range parameter and input signal-to-noise ratio thresholding:
M H=α H□M n (5a)
M L=α L□M n (5b)
Wherein, M nThe ground unrest range parameter that calculates for step 201.M H, M LBe height, low threshold amplitude threshold value.α H, α LBe height, low threshold amplitude thresholding scale parameter, typical value is α H=5.5, α L=4.
Height is set, and there is probable value P in the low threshold voice H, P LP H, P LValue should obtain by test, representative value is P H=0.65, P L=0.55.
If the current voice signal frame number that receives is i, use array Ma[5], Pa[5] there are probability parameter in the short-time average magnitude parameter and the voice that write down this 5 frame signal of i-4~i successively, that is:
Ma[k]=M(i-4+k) (6a)
Pa[k]=P(i-4+k) (6b)
Wherein, k=0~4.The computing method of short-time average magnitude parameter M (l) are referring to formula (2), and voice exist the computing method of probability parameter P (l) referring to formula (3).
Calculate array Ma[5] in value greater than M HElement number C MH, Pa[5] in value greater than P HElement number C PH
Calculate C MHAnd C PHPseudo-code as follows:
/*pseudo?codeto?calculate C MH?and?C PH*/
C MH=C PH=0;
For(k=0;k<5;k++)
{
If(Pa[k]>P H) ++C PH
If(Ma[k]>M H) ++C MH
}
END
Obvious C MHAnd C PHSpan be 0-5.If C MHAnd C PHValue enough big, satisfy simultaneously:
C MH>=C MH_T (7a)
C PH>=C PH_T (7b)
Then think to have the isolated word starting point in this 5 frame signal of i-4~i, otherwise upgrade array Ma[5], Pa[5], continue starting point and detect.Wherein, C MH_ T, C PH_ T is default frame number threshold value, and typical value is C MH_ T=3, C PH_ T=4.
After determining to have the isolated word starting point in these 5 signal frames of i-4~i, investigate since (i-4) individual voice signal frame, satisfied simultaneously as the parameter of a certain voice signal frame k:
Ma[k]>M L (8a)
Pa[k]>P L (8b)
Think that then accurate starting point is a k voice signal frame, voice signal frame after this all carries out the calculating of characteristic parameter.Wherein, k=i-4~i.If the parameter of 5 voice signal frames does not all satisfy (8), upgrade array Ma[5], Pa[5], continue starting point and detect.
In this step, also can judge whether there is the isolated word starting point in these frames by a short-time average magnitude parameter M (i), that is: according to adjacent a plurality of voice signal frames
Whenever receive a voice signal frame, the sets of signals of the voice signal frame combination that has received based on this voice signal frame that receives and adjacent at least one frame is carried out following processing, until determining the isolated word starting point:
Whether the short-time average magnitude parameter is higher than the voice signal frame number of default high threshold amplitude threshold value in the judgement sets of signals, more than or equal to default frame number threshold value;
Judged result be more than or equal to the time, determine that the voice signal frame that the short-time average magnitude parameter that receives at first in this sets of signals is higher than default low threshold amplitude threshold value is the isolated word starting point.
Compare with existing double threshold sound end detecting method, this point detecting method adopted high and low threshold value two cover threshold parameters equally, helps the more weak voiceless consonant The initial segment of the amplitude of detecting.In addition, because the parameter of investigating continuous 5 voice signal frames simultaneously detects, burr signal such as avoided to knock and be defined as the isolated word starting point.
End point determination process in the step 204 is consistent with above-mentioned starting point testing process, just adopts different threshold parameters, investigates the situation of the parameter of voice signal frame less than threshold value.Brief description is as follows:
Height is set, low threshold amplitude threshold value:
M′ H=α′ H□M n (9a)
M′ L=α′ L□M n (9b)
Wherein, α ' H, α ' LTypical value be α ' H=4, α ' L=3.2.Each parameter meaning is referring to formula (5).
High and low thresholding voice are set have probable value P ' H, P ' L, representative value is P ' H=0.52, P ' L=0.45.
Calculate array Ma[5] in value less than M ' HElement number C ' MH, Pa[5] in value less than P ' HElement number C ' PHIf satisfy simultaneously:
C′ MH>=C′ MH_T (10a)
C′ PH>=C′ PH_T (10b)
Then think to have the isolated word terminal point in this 5 frame signal of i-4~i, otherwise upgrade array Ma[5], Pa[5], continue end point determination.Wherein, C ' MH_ T, C ' PH_ T is default frame number threshold value, and typical value is C ' MH_ T=4, C ' PH_ T=4.
After determining to have the isolated word terminal point in these 5 signal frames of i-4~i, investigate since (i-4) individual voice signal frame, satisfied simultaneously as the parameter of a certain voice signal frame k:
Ma[k]<M′ L (11a)
Pa[k]<P′ L (11b)
Think that then accurate endpoint is a k voice signal frame.Wherein, k=i-4~i.If the parameter of 5 voice signal frames does not all satisfy (11), upgrade array Ma[5], Pa[5], continue end point determination.
Obviously, also can judge whether there is the isolated word terminal point in these frames by a short-time average magnitude parameter M (i) in this step, that is: according to adjacent a plurality of voice signal frames
Whenever receive a voice signal frame, the sets of signals of the voice signal frame combination that has received based on this voice signal frame that receives and adjacent at least one frame is carried out following processing, until determining the isolated word terminal point:
Whether the short-time average magnitude parameter is lower than the voice signal frame number of default high threshold amplitude threshold value in the judgement sets of signals, more than or equal to default frame number threshold value;
Judged result be more than or equal to the time, determine that the voice signal frame that the short-time average magnitude parameter that receives at first in this sets of signals is lower than default low threshold amplitude threshold value is the isolated word terminal point.
Owing to utilize the parameter of continuous 5 voice signal frames to detect simultaneously, therefore helping avoiding tone is three Chinese character, and as " nine ", detection is two sections isolated word.
Correspondingly, the device that the embodiment of the invention also provides a kind of alone word voice endpoint to detect, its structure comprises as shown in Figure 3: starting point detecting unit 310, calculation of characteristic parameters unit 320, end point determination unit 330 and judging unit 340.
Starting point detecting unit 310 is used for determining the isolated word starting point at the voice signal frame that receives;
Calculation of characteristic parameters unit 320, the voice signal frame that receives after the isolated word starting point that is used for starting point detecting unit 310 is determined carries out the calculating of characteristic parameter;
End point determination unit 330, the voice signal frame that receives after the isolated word starting point that is used for starting point detecting unit 310 is determined carries out the calculating of characteristic parameter and carries out the detection of isolated word terminal point synchronously with calculation of characteristic parameters unit 320.
Judging unit 340 is used to judge that whether distance between the terminal point of determining in isolated word starting point that starting point detecting unit 310 is determined and end point determination unit 330 is greater than predefined lowest distance value;
When the judged result of judging unit 340 is during smaller or equal to predefined lowest distance value, ignore the isolated word terminal point of determining end point determination unit 330, to the voice signal frame that receives after this isolated word terminal point, continue to carry out the calculating of characteristic parameter and the detection of isolated word terminal point synchronously by calculation of characteristic parameters unit 320 and end point determination unit 330.
Preferably, shown in Fig. 4 A, starting point detecting unit 310 comprises: the first combination subelement 311, first judgment sub-unit 312 and first starting point are determined subelement 313.
The first combination subelement 311 is used for whenever receiving a voice signal frame, and the voice signal frame that has received based on this voice signal frame that receives and adjacent at least one frame is combined into sets of signals, and this combined treatment is till determining the isolated word starting point;
First judgment sub-unit 312 is used for judging that sets of signals short-time average magnitude parameter that the first combination subelement 311 is combined into is higher than the voice signal frame number of default high threshold amplitude threshold value, whether more than or equal to default frame number threshold value;
First starting point is determined subelement 313, be used for the judged result of first judgment sub-unit 312 for more than or equal to the time, determine that the signal frame that the short-time average magnitude parameter that receives at first in this sets of signals is higher than default low threshold amplitude threshold value is the isolated word starting point.
Preferably, shown in Fig. 5 A, end point determination unit 330 comprises: the second combination subelement 331, second judgment sub-unit 332 and first terminal point are determined subelement 333.
The second combination subelement 331 is used for whenever receiving a voice signal frame, and the voice signal frame that has received based on this voice signal frame that receives and adjacent at least one frame is combined into sets of signals, and this combined treatment is till determining the isolated word terminal point;
Second judgment sub-unit 332 is used for judging that sets of signals short-time average magnitude parameter that the second combination subelement 331 is combined into is lower than the voice signal frame number of default high threshold amplitude threshold value, whether more than or equal to default frame number threshold value;
First terminal point is determined subelement 333, be used for the judged result of second judgment sub-unit 332 for more than or equal to the time, determine that the voice signal frame that the short-time average magnitude parameter that receives at first in this sets of signals is lower than default low threshold amplitude threshold value is the isolated word terminal point.
Preferably, above-mentioned alone word voice endpoint pick-up unit also comprises: short-time spectrum is adjusted processing unit 350.
Short-time spectrum is adjusted processing unit 350, is used for that each the voice signal frame that receives is carried out the short-time spectrum adjustment and handles, and determines that there is probability parameter in the voice of each voice signal frame.
Preferably, shown in Fig. 4 B, starting point detecting unit 310 comprises: the first combination subelement 311, the 3rd judgment sub-unit 314 and second starting point are determined subelement 315.
The first combination subelement 311 is used for whenever receiving a voice signal frame, and the voice signal frame that has received based on this voice signal frame that receives and adjacent at least one frame is combined into sets of signals, and this combined treatment is till determining the isolated word starting point;
The 3rd judgment sub-unit 314, be used for judging that sets of signals short-time average magnitude parameter that the first combination subelement 311 is combined into is higher than the voice signal frame number of default high threshold amplitude threshold value, whether more than or equal to default frame number threshold value, and whether voice exist probability parameter to be higher than default high threshold voice to exist the voice signal frame number of probable value more than or equal to default frame number threshold value;
Second starting point is determined subelement 315, be used for judged result in the 3rd judgment sub-unit 314 be more than or equal to the time, determine that the short-time average magnitude parameter that receives at first in this sets of signals is higher than default low threshold amplitude threshold value, and voice to exist probability parameter to be higher than the voice signal frame that there are probable value in default low threshold voice be the isolated word starting point.
Preferably, shown in Fig. 5 B, end point determination unit 330 comprises: the second combination subelement 331, the 4th judgment sub-unit 334 and second terminal point are determined subelement 335.
The second combination subelement 331 is used for whenever receiving a voice signal frame, and the voice signal frame that has received based on this voice signal frame that receives and adjacent at least one frame is combined into sets of signals, and this combined treatment is till determining the isolated word terminal point;
The 4th judgment sub-unit 334, be used for judging that sets of signals short-time average magnitude parameter that the second combination subelement 331 is combined into is lower than the voice signal frame number of default high threshold amplitude threshold value, whether more than or equal to default frame number threshold value, whether and voice exist probability parameter to be lower than the voice signal frame number that there is probable value in default high threshold voice, more than or equal to default frame number threshold value;
Second terminal point is determined subelement 335, be used for judged result in the 4th judgment sub-unit 334 be more than or equal to the time, determine that the short-time average magnitude parameter that receives at first in this sets of signals is lower than default low threshold amplitude threshold value, and voice to exist probability parameter to be lower than the voice signal frame that there are probable value in default low threshold voice be the isolated word terminal point.
The technical scheme that the embodiment of the invention proposes, detect the isolated word starting point after, when carrying out calculation of characteristic parameters, carry out the detection of isolated word terminal point, determine the calculating that promptly stops characteristic parameter behind the isolated word terminal point, therefore need not a large amount of speech datas of buffer memory.And utilize the correlation parameter of adjacent a plurality of voice signal frames to carry out the detection of isolated word end points, and effectively avoided burr signal is defined as the isolated word starting point, can be not an isolated word erroneous judgement of three two sections isolated word also.Simultaneously each voice signal frame is carried out the short-time spectrum adjustment and handle, and exist probability parameter also as one of parameter of end-point detection the voice that obtain, effectively raise the accuracy and the robustness of end-point detection when having uncertain ground unrest.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims (12)

1. the method that alone word voice endpoint detects is characterized in that, comprises step:
Whenever receive a voice signal frame, the sets of signals of the voice signal frame combination that has received based on this voice signal frame that receives and adjacent at least one frame is carried out following processing, whether until determining the isolated word starting point: the short-time average magnitude parameter is higher than the voice signal frame number of default high threshold amplitude threshold value in the judgement sets of signals, more than or equal to default frame number threshold value; Judged result be more than or equal to the time, determine that the voice signal frame that the short-time average magnitude parameter that receives at first in the described sets of signals is higher than default low threshold amplitude threshold value is the isolated word starting point;
Voice signal frame to receiving after the isolated word starting point of determining carries out the calculating of characteristic parameter and the detection of isolated word terminal point synchronously.
2. the method for claim 1 is characterized in that, also comprises step:
Judge that whether described isolated word starting point of determining and the distance between the terminal point are greater than predefined lowest distance value;
When judged result is during smaller or equal to predefined lowest distance value, ignore the described isolated word terminal point of determining, to the voice signal frame that receives after the described isolated word terminal point, continue to carry out synchronously the calculating of characteristic parameter and the detection of isolated word terminal point.
3. method as claimed in claim 1 or 2 is characterized in that, the process of described detection isolated word terminal point comprises:
Whenever receive a voice signal frame, the sets of signals of the voice signal frame combination that has received based on this voice signal frame that receives and adjacent at least one frame is carried out following processing, until determining the isolated word terminal point:
Whether the short-time average magnitude parameter is lower than the voice signal frame number of default high threshold amplitude threshold value in the judgement sets of signals, more than or equal to default frame number threshold value;
Judged result be more than or equal to the time, determine that the voice signal frame that the short-time average magnitude parameter that receives at first in the described sets of signals is lower than default low threshold amplitude threshold value is the isolated word terminal point.
4. method as claimed in claim 1 or 2 is characterized in that, also comprises step:
Each the voice signal frame that receives is carried out the short-time spectrum adjustment handle, determine that there is probability parameter in the voice of each voice signal frame.
5. method as claimed in claim 4 is characterized in that, the process of described definite isolated word starting point comprises:
Whenever receive a voice signal frame, the sets of signals of the voice signal frame combination that has received based on this voice signal frame that receives and adjacent at least one frame is carried out following processing, until determining the isolated word starting point:
The short-time average magnitude parameter is higher than the voice signal frame number of default high threshold amplitude threshold value in the judgement sets of signals, whether more than or equal to default frame number threshold value, and whether voice exist probability parameter to be higher than default high threshold voice to exist the voice signal frame number of probable value more than or equal to default frame number threshold value;
Judged result be more than or equal to the time, determine that the short-time average magnitude parameter that receives at first in the described sets of signals is higher than default low threshold amplitude threshold value, and voice to exist probability parameter to be higher than the voice signal frame that there are probable value in default low threshold voice be the isolated word starting point.
6. method as claimed in claim 4 is characterized in that, the process of described detection isolated word terminal point comprises:
Whenever receive a voice signal frame, the sets of signals of the voice signal frame combination that has received based on this voice signal frame that receives and adjacent at least one frame is carried out following processing, until determining the isolated word terminal point:
The short-time average magnitude parameter is lower than the voice signal frame number of default high threshold amplitude threshold value in the judgement sets of signals, whether more than or equal to default frame number threshold value, whether and voice exist probability parameter to be lower than the voice signal frame number that there is probable value in default high threshold voice, more than or equal to default frame number threshold value;
Judged result be more than or equal to the time, determine that the short-time average magnitude parameter that receives at first in the described sets of signals is lower than default low threshold amplitude threshold value, and voice to exist probability parameter to be lower than the voice signal frame that there are probable value in default low threshold voice be the isolated word terminal point.
7. the device that alone word voice endpoint detects is characterized in that, comprising:
The starting point detecting unit is used for determining the isolated word starting point at the voice signal frame that receives;
The calculation of characteristic parameters unit, the voice signal frame that receives after the isolated word starting point that is used for described starting point detecting unit is determined carries out the calculating of characteristic parameter;
The end point determination unit, the voice signal frame that receives after the isolated word starting point that is used for described starting point detecting unit is determined, carry out the detection of isolated word terminal point, the detection of described isolated word terminal point is that the calculating of carrying out characteristic parameter with described calculation of characteristic parameters unit is carried out synchronously;
Described starting point detecting unit comprises:
The first combination subelement is used for whenever receiving a voice signal frame, and the voice signal frame that has received based on this voice signal frame that receives and adjacent at least one frame is combined into sets of signals, and described combined treatment is till determining the isolated word starting point;
First judgment sub-unit is used for judging that sets of signals short-time average magnitude parameter that the described first combination subelement is combined into is higher than the voice signal frame number of default high threshold amplitude threshold value, whether more than or equal to default frame number threshold value;
First starting point is determined subelement, be used for the judged result of described first judgment sub-unit for more than or equal to the time, determine that the signal frame that the short-time average magnitude parameter that receives at first in the described sets of signals is higher than default low threshold amplitude threshold value is the isolated word starting point.
8. device as claimed in claim 7 is characterized in that, also comprises:
Judging unit is used to judge that whether distance between the terminal point of determining in isolated word starting point that described starting point detecting unit is determined and described end point determination unit is greater than predefined lowest distance value;
When the judged result of described judging unit is during smaller or equal to predefined lowest distance value, ignore the isolated word terminal point that described end point determination unit is determined, to the voice signal frame that receives after the described isolated word terminal point, continue to carry out the calculating of characteristic parameter and the detection of isolated word terminal point by described calculation of characteristic parameters unit and end point determination units synchronization.
9. as claim 7 or 8 described devices, it is characterized in that described end point determination unit comprises:
The second combination subelement is used for whenever receiving a voice signal frame, and the voice signal frame that has received based on this voice signal frame that receives and adjacent at least one frame is combined into sets of signals, and described combined treatment is till determining the isolated word terminal point;
Second judgment sub-unit is used for judging that sets of signals short-time average magnitude parameter that the described second combination subelement is combined into is lower than the voice signal frame number of default high threshold amplitude threshold value, whether more than or equal to default frame number threshold value;
First terminal point is determined subelement, be used for the judged result of described second judgment sub-unit for more than or equal to the time, determine that the voice signal frame that the short-time average magnitude parameter that receives at first in the described sets of signals is lower than default low threshold amplitude threshold value is the isolated word terminal point.
10. as claim 7 or 8 described devices, it is characterized in that, also comprise:
Short-time spectrum is adjusted processing unit, is used for that each the voice signal frame that receives is carried out the short-time spectrum adjustment and handles, and determines that there is probability parameter in the voice of each voice signal frame.
11. device as claimed in claim 10 is characterized in that, described starting point detecting unit comprises:
The first combination subelement is used for whenever receiving a voice signal frame, and the voice signal frame that has received based on this voice signal frame that receives and adjacent at least one frame is combined into sets of signals, and described combined treatment is till determining the isolated word starting point;
The 3rd judgment sub-unit, be used for judging that sets of signals short-time average magnitude parameter that the described first combination subelement is combined into is higher than the voice signal frame number of default high threshold amplitude threshold value, whether more than or equal to default frame number threshold value, and whether voice exist probability parameter to be higher than default high threshold voice to exist the voice signal frame number of probable value more than or equal to default frame number threshold value;
Second starting point is determined subelement, be used for judged result in described the 3rd judgment sub-unit be more than or equal to the time, determine that the short-time average magnitude parameter that receives at first in the described sets of signals is higher than default low threshold amplitude threshold value, and voice to exist probability parameter to be higher than the voice signal frame that there are probable value in default low threshold voice be the isolated word starting point.
12. device as claimed in claim 10 is characterized in that, described end point determination unit comprises:
The second combination subelement is used for whenever receiving a voice signal frame, and the voice signal frame that has received based on this voice signal frame that receives and adjacent at least one frame is combined into sets of signals, and described combined treatment is till determining the isolated word terminal point;
The 4th judgment sub-unit, be used for judging that sets of signals short-time average magnitude parameter that the described second combination subelement is combined into is lower than the voice signal frame number of default high threshold amplitude threshold value, whether more than or equal to default frame number threshold value, whether and voice exist probability parameter to be lower than the voice signal frame number that there is probable value in default high threshold voice, more than or equal to default frame number threshold value;
Second terminal point is determined subelement, be used for judged result in described the 4th judgment sub-unit be more than or equal to the time, determine that the short-time average magnitude parameter that receives at first in the described sets of signals is lower than default low threshold amplitude threshold value, and voice to exist probability parameter to be lower than the voice signal frame that there are probable value in default low threshold voice be the isolated word terminal point.
CN2007101793424A 2007-12-12 2007-12-12 Method and system for testing alone word voice endpoint Expired - Fee Related CN101206858B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2007101793424A CN101206858B (en) 2007-12-12 2007-12-12 Method and system for testing alone word voice endpoint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2007101793424A CN101206858B (en) 2007-12-12 2007-12-12 Method and system for testing alone word voice endpoint

Publications (2)

Publication Number Publication Date
CN101206858A CN101206858A (en) 2008-06-25
CN101206858B true CN101206858B (en) 2011-07-13

Family

ID=39566997

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2007101793424A Expired - Fee Related CN101206858B (en) 2007-12-12 2007-12-12 Method and system for testing alone word voice endpoint

Country Status (1)

Country Link
CN (1) CN101206858B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102522081B (en) * 2011-12-29 2015-08-05 北京百度网讯科技有限公司 A kind of method and system detecting sound end
CN103366739B (en) * 2012-03-28 2015-12-09 郑州市科学技术情报研究所 Towards self-adaptation end-point detecting method and the system thereof of alone word voice identification
CN103578470B (en) * 2012-08-09 2019-10-18 科大讯飞股份有限公司 A kind of processing method and system of telephonograph data
CN103236260B (en) * 2013-03-29 2015-08-12 京东方科技集团股份有限公司 Speech recognition system
CN104700830B (en) * 2013-12-06 2018-07-24 中国移动通信集团公司 A kind of sound end detecting method and device
CN104021789A (en) * 2014-06-25 2014-09-03 厦门大学 Self-adaption endpoint detection method using short-time time-frequency value
CN104835496B (en) * 2015-05-30 2018-08-03 宁波摩米创新工场电子科技有限公司 A kind of high definition speech recognition system based on Linear Driving
CN106601250A (en) * 2015-11-10 2017-04-26 刘芨可 Speech control method and device and equipment
CN106601234A (en) * 2016-11-16 2017-04-26 华南理工大学 Implementation method of placename speech modeling system for goods sorting
CN107039035A (en) * 2017-01-10 2017-08-11 上海优同科技有限公司 A kind of detection method of voice starting point and ending point
CN107833582B (en) * 2017-11-20 2021-02-09 南京财经大学 Arc length-based voice signal endpoint detection method
CN108962225B (en) * 2018-06-27 2020-10-23 西安理工大学 Multi-scale self-adaptive voice endpoint detection method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6029130A (en) * 1996-08-20 2000-02-22 Ricoh Company, Ltd. Integrated endpoint detection for improved speech recognition method and system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6029130A (en) * 1996-08-20 2000-02-22 Ricoh Company, Ltd. Integrated endpoint detection for improved speech recognition method and system

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
侯周国,钱盛友,姚畅.短时域语音端点检测中谱熵算法的改进.《计算机工程与应用》.2007,(第21期),55-56. *
杨占军,杨英杰,王强.基于DSP的语音识别系统的设计与实现.《东北电力大学学报》.2006,第26卷(第2期),60-64. *
胡宾.嵌入式语音识别技术的研究.《武汉理工大学》.中国优秀硕士学位论文全文数据库,2006,26-53. *
谭乔来,钱盛友,陈亚琦.基于信号子空间和信息复杂度的语音端点检测.《计算机工程与应用》.2007,第43卷(第34期),55-56,60. *
赵丽娜,侯义斌,黄樟钦,高曦,李倩.基于FPGA的嵌入式语音识别控制系统.《小型微型计算机系统》.2007,第28卷(第8期),1527-1531. *
赵汉武,邹霞,张雄伟,闫佩君.一种基于巴克域噪声估计的语音增强算法.《解放军理工大学学报(自然科学版)》.2007,第8卷(第1期),5-9. *
魏峰,徐成,赵景远.一种基于LPC距离的端点检测改进算法.《微处理机》.2007,(第4期),46-48,52. *

Also Published As

Publication number Publication date
CN101206858A (en) 2008-06-25

Similar Documents

Publication Publication Date Title
CN101206858B (en) Method and system for testing alone word voice endpoint
Deshmukh et al. Use of temporal information: Detection of periodicity, aperiodicity, and pitch in speech
US8140330B2 (en) System and method for detecting repeated patterns in dialog systems
US10242677B2 (en) Speaker dependent voiced sound pattern detection thresholds
CN103971685B (en) Method and system for recognizing voice commands
CN108922541B (en) Multi-dimensional characteristic parameter voiceprint recognition method based on DTW and GMM models
CN103117067B (en) Voice endpoint detection method under low signal-to-noise ratio
Bou-Ghazale et al. A robust endpoint detection of speech for noisy environments with application to automatic speech recognition
CN101292283B (en) Voice judging system, and voice judging method
CN102237085B (en) Method and device for classifying audio signals
CN105529028A (en) Voice analytical method and apparatus
CN108305639B (en) Speech emotion recognition method, computer-readable storage medium and terminal
WO2003042974A1 (en) Method and system for chinese speech pitch extraction
CN101510423B (en) Multilevel interactive pronunciation quality estimation and diagnostic system
CN101114449A (en) Model training method for unspecified person alone word, recognition system and recognition method
KR101649243B1 (en) Method and apparatus for detecting correctness of pitch period
CN108917283A (en) A kind of intelligent refrigerator control method, system, intelligent refrigerator and cloud server
Xie et al. Robust acoustic-based syllable detection.
CN108682432B (en) Speech emotion recognition device
CN107331386A (en) End-point detecting method, device, processing system and the computer equipment of audio signal
CN106991998A (en) The detection method of sound end under noise circumstance
CN109346062A (en) Sound end detecting method and device
CN101183526A (en) Method of detecting fundamental tone period of voice signal
Smolenski et al. Usable speech processing: A filterless approach in the presence of interference
CN108847218A (en) A kind of adaptive threshold adjusting sound end detecting method, equipment and readable storage medium storing program for executing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20110713

Termination date: 20111212