CN1271593C - Voice signal detection method - Google Patents

Voice signal detection method Download PDF

Info

Publication number
CN1271593C
CN1271593C CNB2004101025375A CN200410102537A CN1271593C CN 1271593 C CN1271593 C CN 1271593C CN B2004101025375 A CNB2004101025375 A CN B2004101025375A CN 200410102537 A CN200410102537 A CN 200410102537A CN 1271593 C CN1271593 C CN 1271593C
Authority
CN
China
Prior art keywords
value
frame
energy threshold
speech energy
threshold value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2004101025375A
Other languages
Chinese (zh)
Other versions
CN1622193A (en
Inventor
施健标
杨劲松
傅群
焉勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vimicro Corp
Original Assignee
Vimicro Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vimicro Corp filed Critical Vimicro Corp
Priority to CNB2004101025375A priority Critical patent/CN1271593C/en
Publication of CN1622193A publication Critical patent/CN1622193A/en
Application granted granted Critical
Publication of CN1271593C publication Critical patent/CN1271593C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The present invention discloses a voice signal detection method which is used to solve the problem that the judgment of voice signals is not enough accurate because a voice energy threshold which is used as a judgement criterion for voice signal frames and mute frames in the prior art can not be dynamically modified according to actual conditions. The present invention comprises the steps that audio stream data in a detection period is obtained and divided into a plurality of frames according to time, etc., the energy value of each frame is calculated, and each energy value is compared with a voice energy threshold so as to determine voice frames; the energy value of each frame in the detection period is respectively compared with the voice energy threshold; if the number of the frames whose energy values are larger than or equal to the current voice energy threshold is large, an average value of the maximum energy value of the frames in the detection frame and the current voice energy threshold is used as a voice energy threshold of the next detection period; otherwise, an average value of the minimum energy value of the frames in the detection period and the current voice energy threshold is used as the voice energy threshold of the next detection period; in this way, operation is continuously circulated until the audio stream data is completed.

Description

A kind of voice signal detection method
Technical field
The present invention relates to the audio transmission field, particularly relate to a kind of voice signal detection method.
Background technology
Usually people in the talk process, voice signal only accounts for 50% of whole audio stream mutually, and in VOIP such as video conference or Video chat (Voice Over IP is the voice transmission technology of carrier with the IP network) business, even can be lower.Therefore voice signal being extracted from audio stream, is very necessary for conserve system resources.After voice signal extracted from audio stream, then only need to preserve, handle the data of described voice signal, other partial data then can neglect, thereby have reduced storage space.For the VOIP business, can also reduce volume of transmitted data, conserve network bandwidth reduces network blockage, improves voice quality.
In order to reach this purpose, this area (for example: use speech coder GSM, G273 very widely) uses a kind of voice signal determination methods VAD (Voice Active Detection, voice activity detection) at present.It is the characteristic according to voice signal, audio stream is divided into some frames by 25 milliseconds, and parameters such as the average energy of every frame, average zero-crossing rate are carried out analytical calculation, result of calculation and prior preset threshold are compared, if be higher than preset threshold, then think the voice signal frame, otherwise think quiet frame.Adopt the VAD technology, codec can be encoded normally to the voice signal frame, then only need be labeled as quiet frame to quiet frame, and data volume is greatly reduced, and code efficiency is greatly improved.Yet, as a rule, the VAD technology can not be judged voice signal accurately and effectively, this is because the source of sound signal is intricate, and configure in advance as the speech energy threshold value of voice signal frame and quiet frame criterion, can not on-the-fly modify according to actual conditions, so cause the judgement of voice signal not accurate enough, effective shielding noise, when making acoustic frequency flow playback still with continuous noise.
actually obtain, during processing audio stream, at first, speaker's environment difference has various noises.Rain sound during as the automobile roar of highway, set noise, rainy day in the machine room or the like.These may be roughness clocklike, also may be irregular burst noises, and these background sounds can influence voice quality in various degree.Secondly, audio collecting device also may the output noise.For example: 50Hz or 60Hz power supply are exactly main noise source, and the electron device of forming collecting device also can produce noise, and Here it is, and why some computing machine is not being inserted under the situation of microphone the reason that still can record out noise.In addition, the noise effect that produced of the workmanship of audio collecting device, selection and type difference is also different.The collecting device that common computer audio collecting device has sound card, capture card and is embedded in camera.Wherein sound card is most widely used, and has become the standard configuration of computing machine, and capture card collection sound quality is best, and the collecting device collection sound quality that is embedded in the camera is relatively poor.At last, sound also can be introduced noise when digital-to-analog conversion.Sound form with ripple in air is propagated, and is a kind of simulating signal, and through converting digital signal to after the collecting device collection, this just needs sampling and quantizes.The audibility range of people's ear is at 20-20KHz, to guarantee that according to NYQUIST (Nyquist) sampling law sound is undistorted, just must be with 44KHz left and right sides sample frequency, because people's voice frequency range is at 300-3400Hz, so the sampling of voice is in most cases adopted the sample frequency of 8KHz.Need after the sampling each sampled point is quantized, quantification manner commonly used has two kinds, a kind of 8 quantifications, another kind of 16 quantifications.The figure place that adopts is few more, and distortion is big more, introduces big more noise, and at present, what the overwhelming majority adopted is 16 quantification manners.
Figure 1 shows that the audio stream oscillogram of the voice of recording in the daily life, the environment of recording is an office, with the machine roar, collecting device is embedded, noise signal is stronger, adopt the VAD technology can't effective recognition voice signal and noise signal wherein, therefore when playback with a large amount of continuous noises.
Based on the VAD technology, in order to reach better sound effect, some improvement have been done to it by some VOIP system, and it provides a kind of automatic control microphone volume technology, just judges level of noise, when noise is big, then reduce the collection volume of microphone automatically.This technology can be so that noise reduces, and is acoustically better relatively, but also reduced voice signal energy simultaneously and cause speech volume to descend, and can't not hear voice messaging.
Summary of the invention
The invention provides a kind of voice signal detection method, the speech energy threshold value as voice signal frame and quiet frame criterion can not on-the-fly modify according to actual conditions in the prior art in order to solve, cause the judgement of voice signal not accurate enough, effectively the problem of shielding noise.
Voice signal detection method provided by the invention comprises the following steps:
A, obtain the audio stream data in the sense cycle, and be divided into some frames, calculate the energy value of each frame audio stream data by the time, and with the speech energy threshold ratio; If more than or equal to described speech energy threshold value, then be designated speech frame, otherwise be designated quiet frame;
In B, the statistics current period more than or equal to the frame number of described speech energy threshold value with less than the frame number of described speech energy threshold value; If many, then get the speech energy threshold value of the mean value of the maximum energy value of each frame in this cycle and current speech energy threshold as next sense cycle more than or equal to the frame number of described speech energy threshold value; Otherwise, get in this cycle in the mean value of the minimum energy value of each frame and current speech energy threshold as the speech energy threshold value of next sense cycle;
C, go to steps A, repeat above testing process, dispose until all audio frequency flow data.
The initial value of described speech energy threshold value is a preset value.
Among the described step B in the statistics current period more than or equal to the frame number of described speech energy threshold value with less than the frame number of described speech energy threshold value, concrete grammar is:
One first counter is set, and preset initial value is 0, if the energy value of present frame then makes this counter add 1 more than or equal to the current speech energy threshold; After whole frames in the current period relatively finished, the value of this first counter was the interior frame number more than or equal to described speech energy threshold value of current period;
One second counter is set, and preset initial value is 0, if the energy value of present frame then makes this counter add 1 less than the current speech energy threshold; After whole frames in the current period relatively finished, the value of this second counter was the interior frame number less than described speech energy threshold value of current period.
The described energy value that calculates each frame audio stream data, concrete grammar is: after the squared magnitude to each sampled point in this frame, weighted mean obtains again.
The described energy value that calculates each frame audio stream data, concrete grammar is: after the amplitude of each sampled point in this frame was taken absolute value, weighted mean obtained again.
Described frame data are continuous 2 milliseconds audio stream data.
Described sense cycle is 500 milliseconds.
The present invention compares the energy value of each frame in the sense cycle respectively with the current speech energy threshold, draw energy value greater than the frame number that reaches less than the current speech energy threshold, again the two is compared, if energy value is many greater than the frame number of current speech energy threshold, the maximum energy value of then getting each frame in this sense cycle and the mean value of current speech energy threshold is as new speech energy threshold value, otherwise the mean value of getting the minimum energy value of interior each frame of this sense cycle and current speech energy threshold is as new speech energy threshold value; Utilize the constantly circulation in the process of processing audio stream of this method, can be every speech energy threshold value of a stipulated time (sense cycle) change, feasible speech energy threshold value as voice signal frame and quiet frame criterion no longer is a fixed value that configures in advance, but along with the variation of actual conditions, corresponding change real-time dynamicly, thereby reach the effect of distinguishing voice signal more accurately, and then reach effective shielding noise signal, improve the purpose of voice quality.
Description of drawings
Figure 1 shows that daily life sound intermediate frequency stream oscillogram;
Figure 2 shows that the inventive method flow chart of steps;
Figure 3 shows that new threshold calculations process flow diagram in the inventive method.
Embodiment
The present invention relates to a kind of voice signal detection method, Fig. 2 is the flow chart of steps of the inventive method, and Fig. 3 is a new threshold calculations process flow diagram in the inventive method.Below in conjunction with accompanying drawing 2 and accompanying drawing 3, the specific implementation method of the inventive method is described.
S1, obtain the audio stream data in the sense cycle, and be divided into some frames, calculate the energy value of each frame audio stream data by the time, and with the speech energy threshold ratio; If more than or equal to described speech energy threshold value, then be designated speech frame, otherwise be designated quiet frame.
Since the complicacy of voice signal, generally irregular seeking, but having regularity under the situation in short-term, so for the ease of analyzing and processing, need cut apart audio stream.For example: audio stream was cut apart for the 2ms/ frame by the time, and when sampling rate was 8KHz, every frame can collect 16 samples like this, and when sampling rate was 16KHz, then every frame can collect 32 samples.Because the present invention carries out Frame by timeslice to cut apart, so can be adapted to the speech detection under the various sample frequency.
Sense cycle of predefine of the present invention, and default speech energy threshold value initial value.Described sense cycle for example can be 500 milliseconds, and this sense cycle should not be provided with too short because the time of setting too weak point can cause frequent modification speech energy threshold value, the correlativity of losing voice signal causes mistake that a large amount of voice signals are judged to mute signal; Also should not be provided with oversizely, because the overlong time of setting, then the change number of times of speech energy threshold value causes mistake that a large amount of mute signals erroneous judgements are voice signal very little in the audio stream process time, has so just lost the meaning that on-the-fly modifies the speech energy threshold value.
Be divided into example with audio stream by the 2ms/ frame, get the audio stream data of first 2ms earlier, i.e. the first frame audio stream data calculates the energy value of this frame audio stream data, and concrete computing method have following two kinds:
One of method: after the squared magnitude to each sampled point in this frame, weighted mean again;
Its computing formula is: W = 1 N Σ i = 1 N S 2
Two of method: after the amplitude of each sampled point in this frame taken absolute value, weighted mean again;
Its computing formula is: W = 1 N Σ i = 1 N | S |
N is illustrated in the number of sampling in the described frame in the formula; S represents the amplitude of sampled point; Obtain the energy value W of this frame audio stream data.
The result who adopts first kind of computing method to obtain is more accurate, and the effect that subsequent step reaches is better, but its computing is complicated, and is bigger to the consumption of system resource; The degree of accuracy as a result that adopts second kind of computing method to obtain is relatively poor relatively, but calculating process is simple, and is not high to system requirements.The user can according to self-condition and require to select a kind of computing method.
After calculating the energy value of present frame audio stream data, if the energy value of this frame is more than or equal to the current speech energy threshold, then identifying this frame is speech frame; Simultaneously, a frame counter more than or equal to the current speech energy threshold is set, preset initial value is 0, if the energy value of present frame then makes this counter add 1 more than or equal to the current speech energy threshold.If the energy value of present frame is less than the current speech energy threshold, then identifying this frame is quiet frame; Frame counter less than the current speech energy threshold is set simultaneously, and preset initial value is 0, if the energy value of present frame then makes this counter add 1 less than the current speech energy threshold.So circulation judges that each frame in this sense cycle is speech frame or quiet frame.
S2, the data of adding up in the cycle according to current detection are calculated also change speech energy threshold value.
Judge it is the speech frame or the process of quiet frame in conjunction with previous step is rapid, behind the energy value that calculates the first frame audio stream data, it is changed to current maximum energy value and minimum energy value.
After handling the first frame audio stream data, get the second frame audio stream data, calculate the energy value of the second frame audio stream data by above-mentioned formula.Itself and current maximum energy value and minimum energy value are compared respectively,, then it is changed to new maximum energy value,, then it is changed to new minimum energy value if less than current minimum energy value if greater than current maximum energy value; Meanwhile this energy value is compared with the current speech energy threshold, if more than or equal to the current speech energy threshold, then will add 1,, then will add 1 less than the frame counter of current speech energy threshold if less than the current speech energy threshold more than or equal to the frame counter of current speech energy threshold.
So circulation, arrive up to the 500ms time of setting, after promptly having handled 250 audio frames, count value to described two counters compares, if it is more than frame number less than the counters count of current speech energy threshold more than or equal to the frame number of the counters count of current speech energy threshold, maximum energy value of then getting each frame in this 500ms and the speech energy threshold value of the mean value of current speech energy threshold as next sense cycle, otherwise get the speech energy threshold value of the mean value of the minimum energy value of interior each frame of this 500ms and current speech energy threshold as next sense cycle.When first 500ms arrived, former default speech energy threshold value was updated to by after the voice signal that collects is in real time quantized, the new energy value that calculate, statistics obtains like this.
S3, go to step S1, repeat above testing process, dispose until all audio frequency flow data.
After the data in first 500ms being added up, are calculated and changed the speech energy threshold value, to enter second 500ms, before entering second 500ms, need described two counters and current maximum energy value and current minimum energy value zero clearing, guarantee that the data statistics in second 500ms is accurate.In the time of this 500ms, the speech energy threshold value after a then above 500ms upgrades is a comparison other, by that analogy, is not completely cured and upgrades the speech energy threshold value, disposes until described audio stream.
Adopt this method once to upgrade the speech energy threshold value, can adapt to the voice environment of various complexity like this, export better sound effect every 500ms.

Claims (7)

1, a kind of voice signal detection method is characterized in that comprising the following steps:
A, obtain the audio stream data in the sense cycle, and be divided into some frames, calculate the energy value of each frame audio stream data by the time, and with the speech energy threshold ratio; If more than or equal to described speech energy threshold value, then be designated speech frame, otherwise be designated quiet frame;
In B, the statistics current period more than or equal to the frame number of described speech energy threshold value with less than the frame number of described speech energy threshold value; If many, then get the speech energy threshold value of the mean value of the maximum energy value of each frame in this cycle and current speech energy threshold as next sense cycle more than or equal to the frame number of described speech energy threshold value; Otherwise, get in this cycle in the mean value of the minimum energy value of each frame and current speech energy threshold as the speech energy threshold value of next sense cycle;
C, go to steps A, repeat above testing process, dispose until all audio frequency flow data.
2, the method for claim 1 is characterized in that, the initial value of described speech energy threshold value is a preset value.
3, method as claimed in claim 2 is characterized in that, among the described step B in the statistics current period more than or equal to the frame number of described speech energy threshold value with less than the frame number of described speech energy threshold value, concrete grammar is:
One first counter is set, and preset initial value is 0, if the energy value of present frame then makes this counter add 1 more than or equal to the current speech energy threshold; After whole frames in the current period relatively finished, the value of this first counter was the interior frame number more than or equal to described speech energy threshold value of current period;
One second counter is set, and preset initial value is 0, if the energy value of present frame then makes this counter add 1 less than the current speech energy threshold; After whole frames in the current period relatively finished, the value of this second counter was the interior frame number less than described speech energy threshold value of current period.
4, the method for claim 1 is characterized in that, the described energy value that calculates each frame audio stream data, and concrete grammar is: after the squared magnitude to each sampled point in this frame, weighted mean obtains again.
5, the method for claim 1 is characterized in that, the described energy value that calculates each frame audio stream data, and concrete grammar is: after the amplitude of each sampled point in this frame was taken absolute value, weighted mean obtained again.
6, the method for claim 1 is characterized in that, described frame data are continuous 2 milliseconds audio stream data.
7, the method for claim 1 is characterized in that, described sense cycle is 500 milliseconds.
CNB2004101025375A 2004-12-24 2004-12-24 Voice signal detection method Expired - Fee Related CN1271593C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2004101025375A CN1271593C (en) 2004-12-24 2004-12-24 Voice signal detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2004101025375A CN1271593C (en) 2004-12-24 2004-12-24 Voice signal detection method

Publications (2)

Publication Number Publication Date
CN1622193A CN1622193A (en) 2005-06-01
CN1271593C true CN1271593C (en) 2006-08-23

Family

ID=34766806

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2004101025375A Expired - Fee Related CN1271593C (en) 2004-12-24 2004-12-24 Voice signal detection method

Country Status (1)

Country Link
CN (1) CN1271593C (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100573663C (en) * 2006-04-20 2009-12-23 南京大学 Mute detection method based on speech characteristic to jude
JP4399440B2 (en) * 2006-06-30 2010-01-13 株式会社コナミデジタルエンタテインメント Music genre discriminating apparatus and game machine equipped with the same
CN101149921B (en) * 2006-09-21 2011-08-10 展讯通信(上海)有限公司 Mute test method and device
JP5266754B2 (en) * 2007-12-28 2013-08-21 ヤマハ株式会社 Magnetic data processing apparatus, magnetic data processing method, and magnetic data processing program
EP2685746A4 (en) 2011-03-09 2014-10-22 Panasonic Corp Howling detection device, howling suppressing device and method of detecting howling
US8972251B2 (en) * 2011-06-07 2015-03-03 Qualcomm Incorporated Generating a masking signal on an electronic device
CN104078051B (en) * 2013-03-29 2018-09-25 南京中兴软件有限责任公司 A kind of voice extracting method, system and voice audio frequency playing method and device
CN104112446B (en) * 2013-04-19 2018-03-09 华为技术有限公司 Breathing detection method and device
CN103327433B (en) * 2013-05-27 2014-08-27 腾讯科技(深圳)有限公司 Audio input interface detection method and system thereof
CN103632682B (en) * 2013-11-20 2019-11-15 科大讯飞股份有限公司 A kind of method of audio frequency characteristics detection
CN103680516B (en) * 2013-12-11 2017-07-28 深圳Tcl新技术有限公司 The treating method and apparatus of audio signal
CN105355211A (en) * 2014-08-18 2016-02-24 北京信威通信技术股份有限公司 Noise reduction method for single-ended MIC common-mode noise
CN104796822B (en) * 2015-01-16 2019-02-01 北京中电兴发科技有限公司 Audio squealing detection method, video monitoring method and system using this method
CN105070287B (en) * 2015-07-03 2019-03-15 广东小天才科技有限公司 The method and apparatus of speech terminals detection under a kind of adaptive noisy environment
CN105405452A (en) * 2015-11-13 2016-03-16 苏州集联微电子科技有限公司 Wireless walkie-talkie digital soft muting method
CN106067847B (en) * 2016-05-25 2019-10-22 腾讯科技(深圳)有限公司 A kind of voice data transmission method and device
CN106128474A (en) * 2016-07-04 2016-11-16 广东小天才科技有限公司 A kind of audio-frequency processing method and device
CN106228995B (en) * 2016-08-02 2019-10-11 成都普创通信技术股份有限公司 A kind of audio signal interruption detection method
CN106157951B (en) * 2016-08-31 2019-04-23 北京华科飞扬科技股份公司 Carry out the automatic method for splitting and system of audio punctuate
CN106373592B (en) * 2016-08-31 2019-04-23 北京华科飞扬科技股份公司 Audio holds processing method and the system of making pauses in reading unpunctuated ancient writings of making an uproar
CN106782613B (en) * 2016-12-22 2020-01-21 广州酷狗计算机科技有限公司 Signal detection method and device
CN108010539A (en) * 2017-12-05 2018-05-08 广州势必可赢网络科技有限公司 A kind of speech quality assessment method and device based on voice activation detection
CN112863542B (en) * 2021-01-29 2022-10-28 青岛海尔科技有限公司 Voice detection method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN1622193A (en) 2005-06-01

Similar Documents

Publication Publication Date Title
CN1271593C (en) Voice signal detection method
JP6689664B2 (en) Smart audio logging system and method for mobile devices
CN1188835C (en) System and method for reducing noise
CN1121678C (en) Communication apparatus and method for breakpoint to speaching mode
CN1727860B (en) Noise suppression method and apparatus
CN1175398C (en) Sound activation detection method for identifying speech and music from noise environment
CN1160698C (en) Endpointing of speech in noisy signal
CN1125430C (en) Waveform-based periodicity detector
CN1205601C (en) Method and apparatus for constructing voice templates for speaker-independent voice recognition system
CN1612641A (en) Automatic magnetic detection in hearing aids
CN101315772A (en) Speech reverberation eliminating method based on Wiener filtering
CN1507689A (en) Audio signal processing for speech communication
CN1210608A (en) Noisy speech parameter enhancement method and apparatus
DE112009005215T5 (en) Method and apparatus for audio signal classification
CN100347988C (en) Broad frequency band voice quality objective evaluation method
CN1044293C (en) Method and apparatus for encoding/decoding of background sounds
Gamper et al. Predicting word error rate for reverberant speech
CN1822092A (en) Method and its device for elliminating background noise in speech input
CN101060820A (en) Adaptive time-based noise suppression
CN1902684A (en) Method and device for processing a voice signal for robust speech recognition
CN112786071A (en) Data annotation method for voice segments of voice interaction scene
CN1228763C (en) Noise eliminating method
Chelloug et al. Robust Voice Activity Detection Against Non Homogeneous Noisy Environments
Chelloug et al. Real Time Implementation of Voice Activity Detection based on False Acceptance Regulation.
Vini Voice Activity Detection Techniques-A Review

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20060823

Termination date: 20111224