Embodiment
Below in conjunction with accompanying drawing and specific embodiment technical scheme of the present invention being carried out detailed description, further understanding the object of the invention, scheme and effect, but is not the restriction as accompanying claims protection domain of the present invention.
Please, be the schematic appearance of heldfree type speech communication system with reference to Figure 1A, Figure 1B and Fig. 1 C.
Figure 1A, Figure 1B are the schematic appearance of first embodiment.Heldfree type speech communication system 10 comprises first audio signal reception device 20 and second audio signal reception device 30.First audio signal reception device 20 and second audio signal reception device 30 can be a microphone separately.Heldfree type speech communication system 10 has first 11 and second 12.When the user used heldfree type speech communication system 10, first 11 meeting was comparatively near people's face, and second 12 meeting is further from people's face.In this embodiment, first audio signal reception device 20 is positioned at first 11, and second audio signal reception device 30 is positioned at second 12.In addition, first audio signal reception device, 20 to the second audio signal reception devices 30 are near source speech signal, and source speech signal is generally user's face.
Fig. 1 C is the schematic appearance of second embodiment.Heldfree type speech communication system 10 comprises first audio signal reception device 20 and second audio signal reception device 30.Heldfree type speech communication system 10 has first 11 and second 12.When the user used heldfree type speech communication system 10, first 11 meeting was comparatively near people's face, and second 12 meeting is further from people's face.In this embodiment, first audio signal reception device 20 and second audio signal reception device 30 all are positioned at first 10.And first audio signal reception device, 20 to the second audio signal reception devices 30 are near source speech signal, and source speech signal is generally user's face.
It please is the process flow diagram of voice method for detecting first embodiment proposed by the invention with reference to Fig. 2.The method is that speech energy is judged and may further comprise the steps flow process: by one first audio signal reception device, one first signal of taking a sample, and by the one second audio signal reception device secondary signal (S110) of taking a sample; Calculate first signal pairing first energy in an interval, calculate secondary signal pairing second energy (S120) in this interval; Calculate first ratio (S130) according to first energy and second energy; Changing first ratio is second ratio (S140); Set critical value (S150); According to the size of second ratio and critical value, judge that whether source speech signal is by detecting (S160).
In step S110; After capturing voice signal; The voice signal that first audio signal reception device 20 and second audio signal reception device 30 can will capture is through after periodic sampling (sampling) and analog/digital (Analog/Digital) conversion; First audio signal reception device 20 can be exported first signal, and second audio signal reception device 30 can the output secondary signals.In this embodiment, sampling frequency need be at least more than two times of voice signal highest frequency.And generally speaking, sampling frequency can be 8,000Hz (hertz).If obtain better effect, sampling frequency also can be for higher by 16,000Hz or 32,000Hz.On the other hand, analog/digital conversion generally can be 8 analog/digital conversion, or also can be 12,16 higher analog/digital conversion.
Expression for ease, first signal signature is P [t], secondary signal is denoted as R [t].Wherein t is a positive integer, represents the order on the discrete time.For example, when sampling frequency is 8, during 000Hz, and sample time be one second, then t is the positive integer between 1 to 8000.
In step S120, it is following with the method for the second energy E R [n] with the first energy E P [n] of secondary signal R [t] in one section interval to calculate the first signal P [t]:
Wherein D is the length of above-mentioned section.For example, the length of section is 64 sampling spots, and just D is 64.In this step, EP [1] be P [1], P [2] ..., the summation behind indivedual squares of the P [64], and EP [2] be P [65], P [66] ..., the summation behind indivedual squares of the P [128], other numerical value of first energy also can be by that analogy.The second energy calculation mode is identical with first energy.
The first above-mentioned energy E P [n] and the second energy E R [n] are in the enterprising row operation of time domain (time-domain).On the other hand, the first energy E P [n] also can be in the enterprising row operation of frequency field (frequency-domain) with the second energy E R [n].If on frequency field during computing, the signal P [1] on the time domain, P [2] ..., P [64] can via fast fourier transform (Fast Fourier Transformation, FFT) convert to signal P ' [1] on the frequency field, P ' [2] ..., P ' [64].Likewise, the signal R [1] on the time domain, R [2] ..., R [64] can via fast fourier transform (Fast Fourier Transformation, FFT) convert to signal R ' [1] on the frequency field, R ' [2] ..., R ' [64].
Afterwards, calculate the first energy E P [n] and the second energy E R [n] with following method again:
In order to reach better detecting effect, the signal P ' [f] on the signal P [t] on the time domain, R [t] or the frequency field, R ' [f] can carry out the computing of energy more earlier via after the low pass filter filters out noise partly.
In step S130, calculate first ratio R [n] according to the first energy E P [n] and the second energy E R [n].The first ratio D [n] can be the second energy E R [n] divided by the first energy E P [n], just
If when the user sends voice signal because first audio signal reception device 20 than second audio signal reception device 30 more near source speech signal, and the square distance of acoustic energy and transmission is inversely proportional to, therefore the first energy E P [n] can be greater than the second energy E R [n] in theory.That is to say that R [n] can be less than 1.
In step S140, in order to obtain more level and smooth ratio, it is the second ratio M [n] that exponentially weighted moving average (EWMA) method capable of using (exponential weighted moving average) is changed the first ratio D [n].Its computing method are following: M [n]=(1-α) * D [n]+α * M [n-1].Wherein, 0≤α<1.And α is when big more, represents the second ratio M [n] can be level and smooth more.Generally speaking, α can be 0.99.
In step S150, set a critical value Th [n] and make with judgement and do not detect voice signal.This critical value Th [n] can be fixed value or adjusts along with the second ratio M [n] is dynamic.
If critical value Th [n] then can adjust according to following method along with the second ratio M [n] does dynamic adjustment:
If
Th [n]=σ * Th [n-1], if
Wherein,
is regional maximal value; Just M [1] is to the maximal value between the M [n]; β is a sensitivity constant, and σ is an attenuation constant.β is the constant between 0 to 1, and when β was big more, then critical value Th [n] was big more.Generally speaking, β can be 0.5.σ is the constant between 0 to 1, uses so that critical value Th [n] descends in time gradually.
Make critical value Th [n] along with the purpose that the second ratio M [n] makes dynamic adjustment is to be to let critical value Th [n] change along with the size of background noise thereupon.When the user in the very big environment of background noise, if critical value Th [n] does not heighten thereupon, then voice signal will be difficult to detected.And the purpose that critical value Th [n] descends gradually is to be to move to one very quietly during environment as the user from an environment of making a lot of noise very much, and background noise can significantly descend.If critical value Th [n] is descended gradually, critical value Th [n] can remain on a very high numerical value, and non-speech audio is also detected easily.
At last, among the step S160,, judge whether source speech signal is detected according to the size of the second ratio M [n] with critical value Th [n].As the second ratio M [n] during, promptly represent voice signal to be detected less than critical value Th [n].
Please, be the simulate signal oscillogram with reference to Fig. 3 A and Fig. 3 B.The line segment 100 of Fig. 3 A is represented the first ratio D [n].From figure, can find out, the change of the first ratio D [n] suitable fast.The line segment 200 of Fig. 3 B is represented the second ratio M [n], and line segment 300 is represented critical value Th [n].From figure, can find out that the change of the second ratio M [n] is slowly many than the first ratio D [n].And critical value Th [n] can be along with the second ratio M [n] does dynamic adjustment.
According to above-mentioned method, can utilize two different audio signal reception devices to capture two various signals respectively.And after the energy ratio of calculating two unlike signals, set threshold value dynamically according to energy ratio.Last judge whether the detecting voice signal according to the size of threshold value and energy ratio again.So, speech energy proposed by the invention is judged flow process, can carry out the adjustment of threshold value according to the size of background environment noise, to improve the accuracy rate of detecting.
Except above-mentioned method, the present invention proposes a kind of voice direction in addition and judges flow process, the precision when judging to increase voice further.Please with reference to Fig. 4, be the process flow diagram of voice method for detecting second embodiment proposed by the invention, voice direction judges that flow process may further comprise the steps: first audio signal reception device, one first signal of taking a sample, and by the one second audio signal reception device secondary signal (S210) of taking a sample; According to first signal and secondary signal, first correlation on the calculating first direction and second correlation (S220) on the second direction; According to first correlation and second correlation, judge that whether source speech signal is by detecting (S230).
Step S210 is identical with step S110, therefore no longer gives unnecessary details.Likewise, first signal signature is P [t], and secondary signal is denoted as R [t].
Among the step S220; The account form of the first correlation C1 [t] on the first direction is following: C1 [t]=α * C1 [t-1]+(1-α) * P [t-τ] * R [t], τ are voice signal arrives first audio signal reception device 20 and second audio signal reception device 30 via first direction mistiming.Because P [t] and R [t] are for the signal on the discrete time after taking a sample, so τ also should be converted by sampling frequency.
Please, be the side view of heldfree type speech communication system with reference to Fig. 5.Voice signal is the d centimetre via the range difference that first direction arrives first audio signal reception device 20 and second audio signal reception device 30.Suppose that sound wave speed at normal temperatures was 33,000 (centimetre/seconds).Therefore, voice signal is d/33 via the mistiming that first direction arrives first audio signal reception device 20 and second audio signal reception device 30,000 (second).In addition, suppose that the first signal P [t] and the sampling frequency of secondary signal R [t] are 8,000Hz, then the cycle of representative sampling is 1/8000 second.Be after converting with sampling frequency with, mistiming τ, to be the individual sampling spot in (d/33,000)/(1/8000), just d * 8/33 sampling spot.If when the sampling spot number of calculating with above-mentioned formula is non-integer, can the result that formula is obtained be got contiguous integer as the sampling spot number.
On the other hand, the account form of the second correlation C2 [t] on the second direction is following: C2 [t]=α * C2 [t-1]+(1-α) * P [t] * R [t].
Because voice signal all is to send from first direction, therefore when voice signal sent, the first correlation C1 [t] of first direction can be greater than the second correlation C2 [t] of second direction.Otherwise, when noise when second direction is sent, the second correlation C2 [t] of second direction can be greater than the first correlation C1 [t] of first direction.Therefore, can be by judging the size of the first correlation C1 [t], to judge whether to detect voice signal with the second correlation C2 [t].
For the accuracy rate that increase is further detected, this step also can be calculated the third phase pass value C3 [t] on the third direction in addition, and the account form of third phase pass value C3 [t] is following: C3 [t]=α * C3 [t-1]+(1-α) * P [t] * R [t-τ].
Afterwards, if the first correlation C1 [t] is worth C3 [t] greater than the second correlation C2 [t] and the first correlation C1 [t] greater than the third phase pass, then judge to have detected voice signal.Accuracy rate for the detecting of further raising voice; Above-mentioned judgement formula can change the first correlation C1 [t] into and add that greater than the second correlation C2 [t] the threshold value H and the first correlation C1 [t] add threshold value H greater than third phase pass value C3 [t], then judges to have detected voice signal.
Above-mentioned speech energy judges that flow process and voice direction judgement flow process can be jointly as the foundations of judging.That is to say, can judge that when speech energy flow process and voice direction judgement flow process all are judged as when having detected voice signal, just regard as at last and have detected voice signal really.On the other hand, also can be to judge flow process when speech energy or one of them is judged as when detecting voice signal when voice direction is judged flow process, just identification has detected voice signal.
Above-mentioned voice method for detecting the whole bag of tricks capable of using is implemented.For example, this technology can be implemented in hardware, firmware, software or combination wherein.For a hardware embodiment; Can be at one or more ASIC (application-specific integrated circuit; ASIC), digital signal processor (digital signal processor; DSP), the programmable logical device (programmable logic device, PLD), imitate programmable gate array (FPGA), processor, controller, microcontroller, microprocessor, electronic equipment, through the processing unit of design with other electronic unit or a combination wherein of carrying out function described herein.
For a firmware and/or software implementation example, the voice method for detecting that available programs instructs embodiment of the present invention to disclose.For example, the said procedure instruction can be stored in the internal memory and can carry out by a processor.
Certainly; The present invention also can have other various embodiments; Under the situation that does not deviate from spirit of the present invention and essence thereof; Those of ordinary skill in the art work as can make various corresponding changes and distortion according to the present invention, but these corresponding changes and distortion all should belong to the protection domain of the appended claim of the present invention.