CN106340310A

CN106340310A - Speech detection method and device

Info

Publication number: CN106340310A
Application number: CN201510401974.5A
Authority: CN
Inventors: 孙廷玮
Original assignee: Spreadtrum Communications Shanghai Co Ltd
Current assignee: Spreadtrum Communications Shanghai Co Ltd
Priority date: 2015-07-09
Filing date: 2015-07-09
Publication date: 2017-01-18
Anticipated expiration: 2035-07-09
Also published as: CN106340310B

Abstract

The invention provides a speech detection method and device. The speech detection method comprises the steps that sound data corresponding to an input sound signal are framed to acquire a number of sound frames; the eigenvector of the current frame is calculated, wherein the eigenvector comprises the wide window level energy difference, the narrow window level energy difference and the zero-crossing energy difference; the eigenvector of the current frame and a preset fuzzy audio and video rule are matched to acquire the corresponding speech detection score, wherein the fuzzy audio and video rule is acquired by training a sound training sample; and if the calculated speech detection score is greater than a score threshold, sound data corresponding to the current frame are detected. According to the scheme, the speech detection speed can be improved, and the speech detection cost is reduced.

Description

Speech detection method and device

Technical field

The present invention relates to speech detection technical field, more particularly to a kind of speech detection method and device.

Background technology

Mobile terminal, refers to computer equipment used in movement.With integrated circuit technique Develop rapidly, mobile terminal has had powerful disposal ability, mobile terminal is from simply logical Words instrument is changed into an integrated information processing platform, and this also increased broader development to mobile terminal Space.

Traditional mobile terminal, it usually needs user's manual operation, so that user concentrates certain note Meaning power.Speech detection method and always listen system (always listening system) using so that can Mobile terminal is carried out with non-manual activation and operates.When described always listen system detectio to arrive acoustical signal when, Speech detection system will activate, and the acoustical signal detecting is detected.Then, mobile terminal Will be according to the detected corresponding operation of acoustical signal execution, for example, when user input " dials xx Mobile phone " voice when, mobile terminal just can be to the voice messaging of " dialing the mobile phone of xx " of user input Detected, and after correct detection, obtained the information of the phone number of xx from mobile terminal, and dialled Beat.

But, speech detection method in prior art, generally use complex mathematical model and come to defeated The acoustical signal entering is detected, accordingly, there exist that detection speed is slow and the problem of high cost.

Content of the invention

The problem that the embodiment of the present invention solves is how to improve the speed of speech detection, and reduces speech detection Cost.

For solving the above problems, embodiments provide a kind of speech detection method, described voice inspection Survey method includes:

Sub-frame processing is carried out to the corresponding voice data of acoustical signal inputting and obtains multiple voiced frames；

Calculate the characteristic vector of present frame, described characteristic vector includes wide window position energy difference, narrow window potential energy amount Difference and zero passage energy difference；

The characteristic vector of present frame is mated with default fuzzy acoustic image rule, is obtained corresponding voice Detection score value, described fuzzy acoustic image rule is that voice training sample training is obtained；

When the speech detection score value calculating is more than point threshold, the corresponding voice data to present frame Detected.

Alternatively, described fuzzy acoustic image rule is to be combined to sound using neural network algorithm and genetic algorithm Training sample is trained obtaining.

Alternatively, described fuzzy acoustic image rule includes the first kind fuzzy acoustic image rule and Equations of The Second Kind obscures acoustic image Rule, the described characteristic vector by present frame is mated with default fuzzy acoustic image rule, is corresponded to Speech detection score value, comprising:

When the characteristic vector determining present frame obscures acoustic image rule with the first kind of described default fuzzy acoustic image rule When then matching, the speech detection score value of the present frame obtaining is 0；

When the characteristic vector determining present frame obscures acoustic image rule with the Equations of The Second Kind of described default fuzzy acoustic image rule When then matching, the speech detection score value of the present frame obtaining is 1.

Alternatively, described point threshold is the signal to noise ratio of the voice data of present frame.

The embodiment of the present invention additionally provides a kind of speech detection device, and described device includes:

Sub-frame processing unit, is suitable to the corresponding voice data of acoustical signal of input is carried out sub-frame processing and obtains To multiple voiced frames；

Computing unit, is suitable to calculate the characteristic vector of present frame, described characteristic vector includes wide window potential energy amount Poor, narrow window position energy difference and zero passage energy difference；

Matching unit, is suitable to be mated the characteristic vector of present frame with default fuzzy acoustic image rule, Obtain corresponding speech detection score value, described fuzzy acoustic image rule is that voice training sample training is obtained；

Detector unit, is suitable to when the speech detection score value calculating is more than point threshold, to present frame Corresponding voice data is detected.

Alternatively, described fuzzy acoustic image rule includes the first kind fuzzy acoustic image rule and Equations of The Second Kind obscures acoustic image Rule, described matching unit is determining the of the characteristic vector of present frame and described default fuzzy acoustic image rule When the fuzzy acoustic image rule of one class matches, the speech detection score value obtaining present frame is 0；Determining present frame The Equations of The Second Kind of characteristic vector and described default fuzzy acoustic image rule when obscuring acoustic image rule and matching, obtain The speech detection score value of present frame is 1.

Compared with prior art, technical scheme has the advantage that

Above-mentioned scheme, calculates the corresponding characteristic vector of each voiced frame by default fuzzy acoustic image rule Speech detection score value, to determine whether the acoustical signal of input to be detected, due to described fuzzy Acoustic image rule is used only for detecting in present frame whether include voice messaging, and wraps without being concerned about in present frame The particular content of the speech data including, it is thus possible to improve the speed of speech detection, reduces speech detection Cost.

Further, the signal to noise ratio of the speech detection score value of calculated present frame and present frame is carried out Relatively, when the speech detection score value determining present frame is more than the signal to noise ratio of present frame, determine in present frame Including speech data, because the signal to noise ratio of present frame can reflect that the background that present frame includes is made an uproar exactly The information of sound, it is thus possible to improve the accuracy rate of speech detection, the experience of lifting user.

Brief description

Fig. 1 is the flow chart of one of embodiment of the present invention speech detection method；

Fig. 2 is the flow chart of another kind of speech detection method in the embodiment of the present invention；

Fig. 3 is the structural representation of one of embodiment of the present invention speech detection device.

Specific embodiment

Of the prior art system is always listened to adopt voice activity detection (voice activity detection, vad) Technology sound is detected.But, existing voice activity detection method, it usually needs train The voice data of input is detected to the mathematical model for speech detection, due to described mathematical modulo Type more sends out miscellaneous so that the process of speech detection is complex, accordingly, there exist detection speed slowly and becomes This high problem.

For solving the above-mentioned problems in the prior art, the technical scheme that the embodiment of the present invention adopts is passed through Default fuzzy acoustic image rule calculates the speech detection score value of the corresponding characteristic vector of each voiced frame, comes really Determine whether the acoustical signal inputting to be detected, the speed of speech detection can be improved, reduce voice The cost of detection.

Understandable for enabling the above objects, features and advantages of the present invention to become apparent from, below in conjunction with the accompanying drawings The specific embodiment of the present invention is described in detail.

The flow chart that Fig. 1 shows one of embodiment of the present invention speech detection method.As shown in Figure 1 Speech detection method, may include that

Step s101: sub-frame processing is carried out to the corresponding voice data of acoustical signal inputting and obtains multiple sound Sound frame；

In being embodied as, it is possible to use the acoustical signal that mike (mic) comes to external world is acquired, When collecting acoustical signal, the acoustical signal of collection is processed, obtains corresponding voice data, And the voice data obtaining is divided into plural frame.

Step s102: calculate present frame characteristic vector, described characteristic vector include wide window position energy difference, Narrow window potential energy amount and zero passage energy difference.

In being embodied as, when the voice data obtaining being carried out sub-frame processing obtaining plural frame, Order according to the time calculates the corresponding characteristic vector of each frame frame by frame, and the characteristic vector according to each frame Determine whether to carry out speech detection to corresponding frame.Wherein, for the ease of description, calculating each frame by frame During the characteristic vector of frame, currently can calculated the frame of characteristic vector as described present frame.

Step s103: the characteristic vector of present frame is mated with default fuzzy acoustic image rule, obtains Corresponding speech detection score value.

In being embodied as, described fuzzy acoustic image rule is that voice training sample is trained obtaining, institute State fuzzy acoustic image rule and include a plurality of acoustic image rule, the as set of acoustic image rule.Wherein, each sound As rule includes corresponding decision-making score value, on the characteristic vector determining present frame and described fuzzy acoustic image rule During any bar acoustic image rule match in then, the fuzzy acoustic image rule that matches with the characteristic vector of present frame In decision-making score value, the as speech detection score value of present frame.

Step s104: when the speech detection score value calculating is more than point threshold, present frame is corresponded to Voice data detected.

In being embodied as, described point threshold can be for immobilizing it is also possible to difference according to each frame Different and change, those skilled in the art can be configured according to the actual needs.

Below in conjunction with Fig. 2, the speech detection method in the embodiment of the present invention is further described in detail.

Step s201: sub-frame processing is carried out to the corresponding voice data of acoustical signal inputting and obtains multiple sound Sound frame.

Step s202: calculate the characteristic vector of present frame.

In being embodied as, the characteristic vector of present frame includes wide window position energy difference (wide-window Energy difference), narrow window position energy difference (narrow-window energy difference) and zero passage Energy difference (zero-crossing difference), wherein, wide window position energy difference, narrow window position energy difference and mistake Zero energy difference can be calculated by below equation respectively:

δ ew=ew-ewa (1)

δ en=en-ena (2)

δ z=z-za (3)

Wherein, δ ew represents described width window position energy difference, and ew represents the wide window potential energy amount of present frame, ewa Represent the long-term mean value of wide window potential energy amount, δ en represents described narrow window position energy difference, and en represents the narrow of present frame Window potential energy amount, ena represents the long-term mean value of narrow window potential energy amount, and δ z represents described zero passage energy difference, and z represents The zero energy excessively of present frame, za represented the long-term mean value of zero energy.

Wherein, due to wide window position energy difference δ ew, the calculating of narrow window position energy difference δ en and zero passage energy difference δ z All carry out in time domain rather than frequency domain, therefore, and then computing resource can be saved, improve speech detection Speed.

It is to be herein pointed out for the qualitative change reflecting background noise, wide window potential energy amount long-term Average ewa, long-term mean value ena of narrow window potential energy amount and long-term mean value zna crossing zero energy, only in noise It is updated when appearance.

Step s203: the characteristic vector of present frame is mated with default fuzzy acoustic image rule, and sentences The characteristic vector of disconnected present frame is matched also with the fuzzy acoustic image rule of the first kind in described fuzzy acoustic image rule It is to obscure acoustic image rule with Equations of The Second Kind to match；When the characteristic vector determining present frame and described first kind mould When paste rule matches, execution step s204；When the characteristic vector determining present frame and described Equations of The Second Kind mould When paste rule matches, execution step s205；.

In being embodied as, the grammatical representation in human language defines empirical rule, by described experience Regular people can carry out language performance using the experience of itself with problem.Pre- in the embodiment of the present invention If the acoustic image rule that exactly manually obtained according to the cognitive inspiration standard of the problem that comes from of fuzzy acoustic image rule Set then.

In being embodied as, combined using genetic algorithm and neural algorithm and voice training sample is instructed Get fuzzy acoustic image rule.Wherein, genetic algorithm can search out the genetic function with several variables Globally optimal solution, there is larger motility, and insensitive to local Optimal solution problem, therefore, tool There is good robustness.Neural algorithm then can reduce execution time and can reduce the mistake of genetic algorithm Rate.Meanwhile, genetic algorithm and neural algorithm are less in the operand that sample sound is trained, permissible Save computing resource.

Using genetic algorithm and neural algorithm sample sound is trained obtain a series of with described feature to Each variable (wide window position energy difference δ ew, narrow window position energy difference δ en and zero passage energy difference δ z) phase in amount The acoustic image sequence of association.Then, more manually it is that described output sequence interpolation decision-making score value obtains finally Fuzzy acoustic image rule, wherein, when the corresponding voice training sample of described acoustic image sequence includes voice, It is then that the decision-making score value that described acoustic image sequence is added obtains first kind broad image rule for 1, conversely, then The decision-making score value adding for described acoustic image sequence is 0, obtains Equations of The Second Kind and obscures acoustic image rule.Table 1 shows The example of the modulus acoustic image rule in present invention enforcement:

Table 1

Wherein, when corresponding characteristic vector corresponding ultrasonogram picture is zero, show corresponding characteristic vector Numerical value larger；When corresponding characteristic vector corresponding ultrasonogram picture be when, show corresponding feature to The numerical value of amount is medium；When corresponding characteristic vector corresponding ultrasonogram picture is △, show corresponding spy The numerical value levying vector is less.

In being embodied as, when the characteristic vector of the present frame calculating is located at interval accordingly, then with Corresponding acoustic image image identification forms corresponding acoustic image sequence.Then, by corresponding for present frame acoustic image sequence Contrasted with default fuzzy acoustic image rule.

Step s204: the first kind in the characteristic vector determining present frame is regular with default fuzzy acoustic image Fuzzy acoustic image rule matches, and the speech detection score value of output present frame is 0.

In being embodied as, as shown in table 1, the decision-making score value that the first kind obscures in acoustic image rule is 0, because This, when the characteristic vector determining present frame obscures acoustic image rule with the first kind in default fuzzy acoustic image rule Then match, the speech detection score value of output present frame is 0.

Step s205: the Equations of The Second Kind in the characteristic vector determining present frame is regular with default fuzzy acoustic image When fuzzy acoustic image rule matches, the speech detection score value of output present frame is 1.

In being embodied as, as shown in table 1, the decision-making score value that Equations of The Second Kind obscures in acoustic image rule is 1, because This, when the characteristic vector determining present frame obscures acoustic image rule with the Equations of The Second Kind in default fuzzy acoustic image rule Then match, the speech detection score value of output present frame is 1.

Step s206: the speech detection score value of present frame and the signal to noise ratio of present frame are compared, judge Whether the speech detection score value of present frame is more than the signal to noise ratio of present frame, when judged result is to be, permissible Execution step s207, conversely, then do not execute any operation.

In being embodied as, signal to noise ratio (snr) data of each frame reflects voice signal and noise Ratio, it meets following condition:

(1) it is evenly distributed in all possible numerical intervals；

(2) slowly varying between adjacent frame；

(3) different optimal thresholds has different numerical value；

(4) magnitude can be minimized according to signal to noise ratio.

Therefore, in view of signal to noise ratio meets above-mentioned condition, using present frame signal to noise ratio as point threshold Carry out contrast to determine whether to carry out speech detection to present frame with the speech detection score value of present frame, permissible Improve the accuracy of speech detection.

Step s207: speech detection is carried out to present frame.

Below in conjunction with Fig. 3 to the speech detection method in the embodiment of the present invention corresponding speech detection device Make further details of introduction.

Fig. 3 shows the structural representation of one of embodiment of the present invention speech detection device, such as Fig. 3 Shown speech detection device 300, can include sub-frame processing unit 301, computing unit 302, coupling Unit 303 and detector unit 304, wherein:

Sub-frame processing unit 301, is suitable to carry out sub-frame processing to the corresponding voice data of acoustical signal of input Obtain multiple voiced frames.

Computing unit 302, is suitable to calculate the characteristic vector of present frame, described characteristic vector includes wide window potential energy Measure poor, narrow window position energy difference and zero passage energy difference.

Matching unit 303, is suitable to be mated the characteristic vector of present frame with default fuzzy acoustic image rule, Obtain corresponding speech detection score value, described fuzzy acoustic image rule is that voice training sample training is obtained. Wherein, described fuzzy acoustic image rule is to be combined to voice training sample using neural network algorithm and genetic algorithm Originally it is trained obtaining.

In being embodied as, described fuzzy acoustic image rule includes the first kind and obscures acoustic image rule and Equations of The Second Kind mould Paste acoustic image rule, described matching unit 303 is in the characteristic vector determining present frame and described default fuzzy sound When matching as the fuzzy acoustic image rule of the first kind of rule, the speech detection score value obtaining present frame is 0；? Determine that the characteristic vector of present frame and the Equations of The Second Kind of described default fuzzy acoustic image rule obscure acoustic image rule phase Timing, the speech detection score value obtaining present frame is 1.

Detector unit 304, is suitable to when the speech detection score value calculating is more than point threshold, to current The corresponding voice data of frame is detected.Wherein, in order to improve the accuracy of speech detection, described score value Threshold value is the signal to noise ratio of the voice data of present frame.

One of ordinary skill in the art will appreciate that all or part step in the various methods of above-described embodiment Suddenly the program that can be by complete come the hardware to instruct correlation, and this program can be stored in computer-readable In storage medium, storage medium may include that rom, ram, disk or CD etc..

Above the method and system of the embodiment of the present invention are had been described in detail, the present invention is not limited to this. Any those skilled in the art, without departing from the spirit and scope of the present invention, all can make various change with Modification, therefore protection scope of the present invention should be defined by claim limited range.

Claims

1. a kind of speech detection method is it is characterised in that include:

Calculate the characteristic vector of present frame, described characteristic vector includes wide window position energy difference, narrow window position energy difference With zero passage energy difference；

The characteristic vector of present frame is mated with default fuzzy acoustic image rule, is obtained corresponding voice inspection Survey score value, described fuzzy acoustic image rule is that voice training sample training is obtained；

When the speech detection score value calculating is more than point threshold, to present frame, corresponding voice data enters Row detection.

2. speech detection method according to claim 1 is it is characterised in that described fuzzy acoustic image rule is Combined using neural network algorithm and genetic algorithm and voice training sample is trained obtaining.

3. speech detection method according to claim 1 is it is characterised in that described fuzzy acoustic image rule is wrapped Include the first kind and obscure acoustic image rule and Equations of The Second Kind fuzzy acoustic image rule, the described characteristic vector by present frame Mated with default fuzzy acoustic image rule, obtained corresponding speech detection score value, comprising:

4. speech detection method according to claim 1 is it is characterised in that described point threshold is current The signal to noise ratio of the voice data of frame.

5. a kind of speech detection device is it is characterised in that include:

Sub-frame processing unit, is suitable to the corresponding voice data of acoustical signal of input is carried out sub-frame processing and obtains Multiple voiced frames；

Computing unit, be suitable to calculate present frame characteristic vector, described characteristic vector include wide window position energy difference, Narrow window position energy difference and zero passage energy difference；

Matching unit, is suitable to be mated the characteristic vector of present frame with default fuzzy acoustic image rule, obtains To corresponding speech detection score value, described fuzzy acoustic image rule is that voice training sample training is obtained；

Detector unit, is suitable to when the speech detection score value calculating is more than point threshold, to present frame pair The voice data answered is detected.

6. speech detection device according to claim 5 is it is characterised in that described fuzzy acoustic image rule is Combined using neural network algorithm and genetic algorithm and voice training sample is trained obtaining.

7. speech detection device according to claim 5 is it is characterised in that described fuzzy acoustic image rule is wrapped Include the first kind fuzzy acoustic image rule and Equations of The Second Kind obscures acoustic image rule, described matching unit is current in determination When the characteristic vector of the frame first kind fuzzy acoustic image rule regular with described default fuzzy acoustic image matches, The speech detection score value obtaining present frame is 0；In the characteristic vector determining present frame and described default mould When the fuzzy acoustic image rule of Equations of The Second Kind of paste acoustic image rule matches, obtain the speech detection score value of present frame For 1.

8. speech detection device according to claim 5 is it is characterised in that described point threshold is current The signal to noise ratio of the voice data of frame.