CN102201231B

CN102201231B - Voice sensing method

Info

Publication number: CN102201231B
Application number: CN201010139851A
Authority: CN
Inventors: 林颖聪; 丁永祯; 金判燮
Original assignee: Integrated System Solution Corp
Current assignee: British Cayman Islands Business Miley electronic Limited by Share Ltd.; Microchip Technology Inc
Priority date: 2010-03-23
Filing date: 2010-03-23
Publication date: 2012-10-24
Anticipated expiration: 2030-03-23
Also published as: CN102201231A

Abstract

The invention discloses a voice sensing method comprising the following steps of: sampling a first signal by a first radio reception device, and sampling a second signal by a second radio reception device, wherein the first radio reception device is closer to a voice signal source compared with the second radio reception device; calculating the first energy corresponding to the first signal in an interval, calculating the second energy corresponding to the second signal in the interval, and calculating a first ratio according to the first energy and the second energy; converting the first ratio to be a second ratio; setting a critical value; and judging whether the voice signal source is sensed or not according to the sizes of the second ratio and the critical value. Through the voice sensing method disclosed by the invention, a voice signal can be exactly sensed when a user transmits the voice signal.

Description

The voice method for detecting

Technical field

The present invention is about a kind of voice method for detecting, particularly a kind of voice method for detecting of using two audio signal reception devices.

Background technology

What in recent years, the heldfree type speech communication system was general is used.Generally speaking, the heldfree type speech communication system can be connected with mobile communication device through bluetooth communication module.Behind digitizing and modulation, the heldfree type speech communication system can convert voice signal to one by one package, utilizes bluetooth communication module to transmit these packages to Mobile Communications module again.

Yet under the environment of reality, the meeting of heldfree type speech communication system is because receive the interference of neighbourhood noise, and the sharpness of voice signal reduces originally.For example, when the user on the frequent road of vehicle dealing next door or the rapid transit station that crowds of the crowd when using the heldfree type speech communication system, the microphone of heldfree type speech communication system can receive many ground unrests.The voice signal that the volume when if the volume of these ground unrests is spoken greater than user itself, ground unrest will this user of serious disturbance be sent.

In addition, can know according to the correlative study of user's usage behavior, during whole section conversation in, only accounting for half the less than during the whole section conversation in a minute by the user.If during whole section conversation, heldfree type speech communication system one value constantly continues to transmit package, will make the heldfree type speech communication system produce unnecessary power consumption.Because the heldfree type speech communication system is to use the electric power of battery so that electric energy to be provided; If continue to produce unnecessary power consumption; The air time or the stand-by time that will be the heldfree type speech communication system significantly are lowered, and then lower the competitive power of this heldfree type speech communication system on market.

Summary of the invention

In view of above problem, the present invention proposes a kind of voice method for detecting, in order to when the user sends voice signal, detects this voice signal exactly.

Voice method for detecting proposed by the invention may further comprise the steps: by one first audio signal reception device, one first signal of taking a sample, and by the one second audio signal reception device secondary signal of taking a sample, wherein first audio signal reception device than second audio signal reception device near a source speech signal; Calculate first signal pairing one first energy in an interval, calculate secondary signal pairing one second energy in the interval, and calculate one first ratio according to first energy and second energy; Changing first ratio is one second ratio; Set a critical value; According to the size of second ratio and critical value, judge whether source speech signal is detected.

Except above-mentioned method; The present invention discloses a kind of voice method for detecting in addition; Comprise: by one first audio signal reception device, one first signal of taking a sample, and by the one second audio signal reception device secondary signal of taking a sample, wherein first audio signal reception device than second audio signal reception device near a source speech signal; Carry out a speech energy determining step, obtain one first judged result; Carry out a voice direction determining step, obtain one second judged result; According to first judged result and second judged result, judge whether source speech signal is detected.

Wherein, the speech energy determining step comprises: calculate first signal pairing first energy in an interval, calculate secondary signal pairing second energy in the interval, and calculate first ratio according to first energy and second energy; Changing first ratio is second ratio; Set critical value; Judge the size of second ratio and this critical value, export first judged result.

On the other hand, the voice direction determining step comprises: according to first signal and secondary signal, and first correlation on the calculating first direction and second correlation on the second direction; According to first correlation and second correlation, export second judged result.Wherein, first direction is the corresponding direction of source speech signal, and second direction is the direction beyond this first direction.

Judge according to voice direction proposed by the invention, can carry out the adjustment of threshold value, to improve the accuracy rate of detecting according to the size of background environment noise.In addition, the judgement that more can assist via the step of voice direction is with the accuracy of further increase detecting.

Describe the present invention below in conjunction with accompanying drawing and specific embodiment, but not as to qualification of the present invention.

Description of drawings

Figure 1A, Figure 1B and Fig. 1 C are the schematic appearance of heldfree type speech communication system proposed by the invention;

Fig. 2 is the process flow diagram of voice method for detecting first embodiment proposed by the invention;

Fig. 3 A and Fig. 3 B are simulate signal oscillogram of the present invention;

Fig. 4 is the process flow diagram of voice method for detecting second embodiment proposed by the invention;

Fig. 5 is the side view of heldfree type speech communication system proposed by the invention.

Wherein, Reference numeral

10 heldfree type speech communication systems

11 first

12 second

20 first audio signal reception devices

30 second audio signal reception devices

100 line segments

200 line segments

300 line segments

Embodiment

Below in conjunction with accompanying drawing and specific embodiment technical scheme of the present invention being carried out detailed description, further understanding the object of the invention, scheme and effect, but is not the restriction as accompanying claims protection domain of the present invention.

Please, be the schematic appearance of heldfree type speech communication system with reference to Figure 1A, Figure 1B and Fig. 1 C.

Figure 1A, Figure 1B are the schematic appearance of first embodiment.Heldfree type speech communication system 10 comprises first audio signal reception device 20 and second audio signal reception device 30.First audio signal reception device 20 and second audio signal reception device 30 can be a microphone separately.Heldfree type speech communication system 10 has first 11 and second 12.When the user used heldfree type speech communication system 10, first 11 meeting was comparatively near people's face, and second 12 meeting is further from people's face.In this embodiment, first audio signal reception device 20 is positioned at first 11, and second audio signal reception device 30 is positioned at second 12.In addition, first audio signal reception device, 20 to the second audio signal reception devices 30 are near source speech signal, and source speech signal is generally user's face.

Fig. 1 C is the schematic appearance of second embodiment.Heldfree type speech communication system 10 comprises first audio signal reception device 20 and second audio signal reception device 30.Heldfree type speech communication system 10 has first 11 and second 12.When the user used heldfree type speech communication system 10, first 11 meeting was comparatively near people's face, and second 12 meeting is further from people's face.In this embodiment, first audio signal reception device 20 and second audio signal reception device 30 all are positioned at first 10.And first audio signal reception device, 20 to the second audio signal reception devices 30 are near source speech signal, and source speech signal is generally user's face.

It please is the process flow diagram of voice method for detecting first embodiment proposed by the invention with reference to Fig. 2.The method is that speech energy is judged and may further comprise the steps flow process: by one first audio signal reception device, one first signal of taking a sample, and by the one second audio signal reception device secondary signal (S110) of taking a sample; Calculate first signal pairing first energy in an interval, calculate secondary signal pairing second energy (S120) in this interval; Calculate first ratio (S130) according to first energy and second energy; Changing first ratio is second ratio (S140); Set critical value (S150); According to the size of second ratio and critical value, judge that whether source speech signal is by detecting (S160).

In step S110; After capturing voice signal; The voice signal that first audio signal reception device 20 and second audio signal reception device 30 can will capture is through after periodic sampling (sampling) and analog/digital (Analog/Digital) conversion; First audio signal reception device 20 can be exported first signal, and second audio signal reception device 30 can the output secondary signals.In this embodiment, sampling frequency need be at least more than two times of voice signal highest frequency.And generally speaking, sampling frequency can be 8,000Hz (hertz).If obtain better effect, sampling frequency also can be for higher by 16,000Hz or 32,000Hz.On the other hand, analog/digital conversion generally can be 8 analog/digital conversion, or also can be 12,16 higher analog/digital conversion.

Expression for ease, first signal signature is P [t], secondary signal is denoted as R [t].Wherein t is a positive integer, represents the order on the discrete time.For example, when sampling frequency is 8, during 000Hz, and sample time be one second, then t is the positive integer between 1 to 8000.

In step S120, it is following with the method for the second energy E R [n] with the first energy E P [n] of secondary signal R [t] in one section interval to calculate the first signal P [t]:

EP [n] = Σ_{t = D * (n - 1) + 1}^{D * n} {| P [t] |}^{2},

ER [n] = Σ_{t = D * (n - 1) + 1}^{D * n} {| R [t] |}^{2};

Wherein D is the length of above-mentioned section.For example, the length of section is 64 sampling spots, and just D is 64.In this step, EP [1] be P [1], P [2] ..., the summation behind indivedual squares of the P [64], and EP [2] be P [65], P [66] ..., the summation behind indivedual squares of the P [128], other numerical value of first energy also can be by that analogy.The second energy calculation mode is identical with first energy.

The first above-mentioned energy E P [n] and the second energy E R [n] are in the enterprising row operation of time domain (time-domain).On the other hand, the first energy E P [n] also can be in the enterprising row operation of frequency field (frequency-domain) with the second energy E R [n].If on frequency field during computing, the signal P [1] on the time domain, P [2] ..., P [64] can via fast fourier transform (Fast Fourier Transformation, FFT) convert to signal P ' [1] on the frequency field, P ' [2] ..., P ' [64].Likewise, the signal R [1] on the time domain, R [2] ..., R [64] can via fast fourier transform (Fast Fourier Transformation, FFT) convert to signal R ' [1] on the frequency field, R ' [2] ..., R ' [64].

Afterwards, calculate the first energy E P [n] and the second energy E R [n] with following method again:

EP [n] = Σ_{t = D * (n - 1) + 1}^{D * n} {| P^{'} [t] |}^{2},

ER [n] = Σ_{t = D * (n - 1) + 1}^{D * n} {| R^{'} [t] |}^{2} .

In order to reach better detecting effect, the signal P ' [f] on the signal P [t] on the time domain, R [t] or the frequency field, R ' [f] can carry out the computing of energy more earlier via after the low pass filter filters out noise partly.

In step S130, calculate first ratio R [n] according to the first energy E P [n] and the second energy E R [n].The first ratio D [n] can be the second energy E R [n] divided by the first energy E P [n], just

D [n] = \frac{ER [n]}{ER [n]} .

If when the user sends voice signal because first audio signal reception device 20 than second audio signal reception device 30 more near source speech signal, and the square distance of acoustic energy and transmission is inversely proportional to, therefore the first energy E P [n] can be greater than the second energy E R [n] in theory.That is to say that R [n] can be less than 1.

In step S140, in order to obtain more level and smooth ratio, it is the second ratio M [n] that exponentially weighted moving average (EWMA) method capable of using (exponential weighted moving average) is changed the first ratio D [n].Its computing method are following: M [n]=(1-α) * D [n]+α * M [n-1].Wherein, 0≤α＜1.And α is when big more, represents the second ratio M [n] can be level and smooth more.Generally speaking, α can be 0.99.

In step S150, set a critical value Th [n] and make with judgement and do not detect voice signal.This critical value Th [n] can be fixed value or adjusts along with the second ratio M [n] is dynamic.

If critical value Th [n] then can adjust according to following method along with the second ratio M [n] does dynamic adjustment:

Th [n] = β \times \underset{t = 1 ~ n}{Max} {M [t]},

If

Th [n] \leq β \times \underset{t = 1 ~ n}{Max} {M [t]};

Th [n]=σ * Th [n-1], if

Th [n] > β \times \underset{t = 1 ~ n}{Max} {M [t]};

Wherein,

is regional maximal value; Just M [1] is to the maximal value between the M [n]; β is a sensitivity constant, and σ is an attenuation constant.β is the constant between 0 to 1, and when β was big more, then critical value Th [n] was big more.Generally speaking, β can be 0.5.σ is the constant between 0 to 1, uses so that critical value Th [n] descends in time gradually.

Make critical value Th [n] along with the purpose that the second ratio M [n] makes dynamic adjustment is to be to let critical value Th [n] change along with the size of background noise thereupon.When the user in the very big environment of background noise, if critical value Th [n] does not heighten thereupon, then voice signal will be difficult to detected.And the purpose that critical value Th [n] descends gradually is to be to move to one very quietly during environment as the user from an environment of making a lot of noise very much, and background noise can significantly descend.If critical value Th [n] is descended gradually, critical value Th [n] can remain on a very high numerical value, and non-speech audio is also detected easily.

At last, among the step S160,, judge whether source speech signal is detected according to the size of the second ratio M [n] with critical value Th [n].As the second ratio M [n] during, promptly represent voice signal to be detected less than critical value Th [n].

Please, be the simulate signal oscillogram with reference to Fig. 3 A and Fig. 3 B.The line segment 100 of Fig. 3 A is represented the first ratio D [n].From figure, can find out, the change of the first ratio D [n] suitable fast.The line segment 200 of Fig. 3 B is represented the second ratio M [n], and line segment 300 is represented critical value Th [n].From figure, can find out that the change of the second ratio M [n] is slowly many than the first ratio D [n].And critical value Th [n] can be along with the second ratio M [n] does dynamic adjustment.

According to above-mentioned method, can utilize two different audio signal reception devices to capture two various signals respectively.And after the energy ratio of calculating two unlike signals, set threshold value dynamically according to energy ratio.Last judge whether the detecting voice signal according to the size of threshold value and energy ratio again.So, speech energy proposed by the invention is judged flow process, can carry out the adjustment of threshold value according to the size of background environment noise, to improve the accuracy rate of detecting.

Except above-mentioned method, the present invention proposes a kind of voice direction in addition and judges flow process, the precision when judging to increase voice further.Please with reference to Fig. 4, be the process flow diagram of voice method for detecting second embodiment proposed by the invention, voice direction judges that flow process may further comprise the steps: first audio signal reception device, one first signal of taking a sample, and by the one second audio signal reception device secondary signal (S210) of taking a sample; According to first signal and secondary signal, first correlation on the calculating first direction and second correlation (S220) on the second direction; According to first correlation and second correlation, judge that whether source speech signal is by detecting (S230).

Step S210 is identical with step S110, therefore no longer gives unnecessary details.Likewise, first signal signature is P [t], and secondary signal is denoted as R [t].

Among the step S220; The account form of the first correlation C1 [t] on the first direction is following: C1 [t]=α * C1 [t-1]+(1-α) * P [t-τ] * R [t], τ are voice signal arrives first audio signal reception device 20 and second audio signal reception device 30 via first direction mistiming.Because P [t] and R [t] are for the signal on the discrete time after taking a sample, so τ also should be converted by sampling frequency.

Please, be the side view of heldfree type speech communication system with reference to Fig. 5.Voice signal is the d centimetre via the range difference that first direction arrives first audio signal reception device 20 and second audio signal reception device 30.Suppose that sound wave speed at normal temperatures was 33,000 (centimetre/seconds).Therefore, voice signal is d/33 via the mistiming that first direction arrives first audio signal reception device 20 and second audio signal reception device 30,000 (second).In addition, suppose that the first signal P [t] and the sampling frequency of secondary signal R [t] are 8,000Hz, then the cycle of representative sampling is 1/8000 second.Be after converting with sampling frequency with, mistiming τ, to be the individual sampling spot in (d/33,000)/(1/8000), just d * 8/33 sampling spot.If when the sampling spot number of calculating with above-mentioned formula is non-integer, can the result that formula is obtained be got contiguous integer as the sampling spot number.

On the other hand, the account form of the second correlation C2 [t] on the second direction is following: C2 [t]=α * C2 [t-1]+(1-α) * P [t] * R [t].

Because voice signal all is to send from first direction, therefore when voice signal sent, the first correlation C1 [t] of first direction can be greater than the second correlation C2 [t] of second direction.Otherwise, when noise when second direction is sent, the second correlation C2 [t] of second direction can be greater than the first correlation C1 [t] of first direction.Therefore, can be by judging the size of the first correlation C1 [t], to judge whether to detect voice signal with the second correlation C2 [t].

For the accuracy rate that increase is further detected, this step also can be calculated the third phase pass value C3 [t] on the third direction in addition, and the account form of third phase pass value C3 [t] is following: C3 [t]=α * C3 [t-1]+(1-α) * P [t] * R [t-τ].

Afterwards, if the first correlation C1 [t] is worth C3 [t] greater than the second correlation C2 [t] and the first correlation C1 [t] greater than the third phase pass, then judge to have detected voice signal.Accuracy rate for the detecting of further raising voice; Above-mentioned judgement formula can change the first correlation C1 [t] into and add that greater than the second correlation C2 [t] the threshold value H and the first correlation C1 [t] add threshold value H greater than third phase pass value C3 [t], then judges to have detected voice signal.

Above-mentioned speech energy judges that flow process and voice direction judgement flow process can be jointly as the foundations of judging.That is to say, can judge that when speech energy flow process and voice direction judgement flow process all are judged as when having detected voice signal, just regard as at last and have detected voice signal really.On the other hand, also can be to judge flow process when speech energy or one of them is judged as when detecting voice signal when voice direction is judged flow process, just identification has detected voice signal.

Above-mentioned voice method for detecting the whole bag of tricks capable of using is implemented.For example, this technology can be implemented in hardware, firmware, software or combination wherein.For a hardware embodiment; Can be at one or more ASIC (application-specific integrated circuit; ASIC), digital signal processor (digital signal processor; DSP), the programmable logical device (programmable logic device, PLD), imitate programmable gate array (FPGA), processor, controller, microcontroller, microprocessor, electronic equipment, through the processing unit of design with other electronic unit or a combination wherein of carrying out function described herein.

For a firmware and/or software implementation example, the voice method for detecting that available programs instructs embodiment of the present invention to disclose.For example, the said procedure instruction can be stored in the internal memory and can carry out by a processor.

Certainly; The present invention also can have other various embodiments; Under the situation that does not deviate from spirit of the present invention and essence thereof; Those of ordinary skill in the art work as can make various corresponding changes and distortion according to the present invention, but these corresponding changes and distortion all should belong to the protection domain of the appended claim of the present invention.

Claims

1. a voice method for detecting is characterized in that, comprising:

By one first audio signal reception device, one first signal of taking a sample, and by the one second audio signal reception device secondary signal of taking a sample, wherein this first audio signal reception device than this second audio signal reception device near a source speech signal;

Calculate this first signal pairing one first energy in an interval, calculate this secondary signal pairing one second energy in this interval, and calculate one first ratio according to this first energy and this second energy;

Changing this first ratio is one second ratio;

Set a critical value; And

According to the size of this second ratio and this critical value, judge whether this source speech signal is detected,

Wherein, in this step of this first ratio of conversion, utilize an exponential weighting displacement method of average, changing this first ratio is this second ratio.

2. voice method for detecting as claimed in claim 1 is characterized in that, in setting this step of a critical value, this critical value multiply by an attenuation parameter σ again for a regional maximal value of this second ratio multiply by a factor beta, wherein 0＜β≤1,0＜σ≤1.

3. voice method for detecting as claimed in claim 2 is characterized in that, judges in this step of size of this second ratio and this critical value, if on behalf of this source speech signal, this second ratio during less than this critical value, then detected.

4. a voice method for detecting is characterized in that, comprising:

Carry out a speech energy determining step, comprising:

Changing this first ratio is one second ratio;

Set a critical value; And

Judge the size of this second ratio and this critical value, export one first judged result;

Carry out a voice direction determining step, comprising:

According to this first signal and this secondary signal; Calculate one first correlation and one second correlation on the second direction on the first direction; Wherein this first direction is the corresponding direction of this source speech signal, and this second direction is the direction beyond this first direction; And

According to this first correlation and this second correlation, export one second judged result; And

According to this first judged result and this second judged result, judge whether this source speech signal is detected,

5. voice method for detecting as claimed in claim 4; It is characterized in that; In this first judged result and this second judged result; Judge in the step whether this source speech signal detected, when on behalf of this source speech signal, this second ratio during greater than this second correlation, then detected less than this critical value and this first correlation.

6. voice method for detecting as claimed in claim 4; It is characterized in that; In this first judged result and this second judged result; Judge in the step whether this source speech signal detected, when on behalf of this source speech signal, this second ratio during greater than this second correlation, then detected less than this critical value or this first correlation.

7. voice method for detecting as claimed in claim 4 is characterized in that, in setting this step of a critical value, this critical value multiply by an attenuation parameter σ again for a regional maximal value of this second ratio multiply by a factor beta, wherein 0＜β≤1,0＜σ≤1.