CN1271593C

CN1271593C - Voice signal detection method

Info

Publication number: CN1271593C
Application number: CNB2004101025375A
Authority: CN
Inventors: 施健标; 杨劲松; 傅群; 焉勇
Original assignee: Vimicro Corp
Current assignee: Vimicro Corp
Priority date: 2004-12-24
Filing date: 2004-12-24
Publication date: 2006-08-23
Anticipated expiration: 2024-12-24
Also published as: CN1622193A

Abstract

The present invention discloses a voice signal detection method which is used to solve the problem that the judgment of voice signals is not enough accurate because a voice energy threshold which is used as a judgement criterion for voice signal frames and mute frames in the prior art can not be dynamically modified according to actual conditions. The present invention comprises the steps that audio stream data in a detection period is obtained and divided into a plurality of frames according to time, etc., the energy value of each frame is calculated, and each energy value is compared with a voice energy threshold so as to determine voice frames; the energy value of each frame in the detection period is respectively compared with the voice energy threshold; if the number of the frames whose energy values are larger than or equal to the current voice energy threshold is large, an average value of the maximum energy value of the frames in the detection frame and the current voice energy threshold is used as a voice energy threshold of the next detection period; otherwise, an average value of the minimum energy value of the frames in the detection period and the current voice energy threshold is used as the voice energy threshold of the next detection period; in this way, operation is continuously circulated until the audio stream data is completed.

Description

A kind of voice signal detection method

Technical field

The present invention relates to the audio transmission field, particularly relate to a kind of voice signal detection method.

Background technology

Usually people in the talk process, voice signal only accounts for 50% of whole audio stream mutually, and in VOIP such as video conference or Video chat (Voice Over IP is the voice transmission technology of carrier with the IP network) business, even can be lower.Therefore voice signal being extracted from audio stream, is very necessary for conserve system resources.After voice signal extracted from audio stream, then only need to preserve, handle the data of described voice signal, other partial data then can neglect, thereby have reduced storage space.For the VOIP business, can also reduce volume of transmitted data, conserve network bandwidth reduces network blockage, improves voice quality.

In order to reach this purpose, this area (for example: use speech coder GSM, G273 very widely) uses a kind of voice signal determination methods VAD (Voice Active Detection, voice activity detection) at present.It is the characteristic according to voice signal, audio stream is divided into some frames by 25 milliseconds, and parameters such as the average energy of every frame, average zero-crossing rate are carried out analytical calculation, result of calculation and prior preset threshold are compared, if be higher than preset threshold, then think the voice signal frame, otherwise think quiet frame.Adopt the VAD technology, codec can be encoded normally to the voice signal frame, then only need be labeled as quiet frame to quiet frame, and data volume is greatly reduced, and code efficiency is greatly improved.Yet, as a rule, the VAD technology can not be judged voice signal accurately and effectively, this is because the source of sound signal is intricate, and configure in advance as the speech energy threshold value of voice signal frame and quiet frame criterion, can not on-the-fly modify according to actual conditions, so cause the judgement of voice signal not accurate enough, effective shielding noise, when making acoustic frequency flow playback still with continuous noise.

actually obtain, during processing audio stream, at first, speaker's environment difference has various noises.Rain sound during as the automobile roar of highway, set noise, rainy day in the machine room or the like.These may be roughness clocklike, also may be irregular burst noises, and these background sounds can influence voice quality in various degree.Secondly, audio collecting device also may the output noise.For example: 50Hz or 60Hz power supply are exactly main noise source, and the electron device of forming collecting device also can produce noise, and Here it is, and why some computing machine is not being inserted under the situation of microphone the reason that still can record out noise.In addition, the noise effect that produced of the workmanship of audio collecting device, selection and type difference is also different.The collecting device that common computer audio collecting device has sound card, capture card and is embedded in camera.Wherein sound card is most widely used, and has become the standard configuration of computing machine, and capture card collection sound quality is best, and the collecting device collection sound quality that is embedded in the camera is relatively poor.At last, sound also can be introduced noise when digital-to-analog conversion.Sound form with ripple in air is propagated, and is a kind of simulating signal, and through converting digital signal to after the collecting device collection, this just needs sampling and quantizes.The audibility range of people's ear is at 20-20KHz, to guarantee that according to NYQUIST (Nyquist) sampling law sound is undistorted, just must be with 44KHz left and right sides sample frequency, because people's voice frequency range is at 300-3400Hz, so the sampling of voice is in most cases adopted the sample frequency of 8KHz.Need after the sampling each sampled point is quantized, quantification manner commonly used has two kinds, a kind of 8 quantifications, another kind of 16 quantifications.The figure place that adopts is few more, and distortion is big more, introduces big more noise, and at present, what the overwhelming majority adopted is 16 quantification manners.

Figure 1 shows that the audio stream oscillogram of the voice of recording in the daily life, the environment of recording is an office, with the machine roar, collecting device is embedded, noise signal is stronger, adopt the VAD technology can't effective recognition voice signal and noise signal wherein, therefore when playback with a large amount of continuous noises.

Based on the VAD technology, in order to reach better sound effect, some improvement have been done to it by some VOIP system, and it provides a kind of automatic control microphone volume technology, just judges level of noise, when noise is big, then reduce the collection volume of microphone automatically.This technology can be so that noise reduces, and is acoustically better relatively, but also reduced voice signal energy simultaneously and cause speech volume to descend, and can't not hear voice messaging.

Summary of the invention

The invention provides a kind of voice signal detection method, the speech energy threshold value as voice signal frame and quiet frame criterion can not on-the-fly modify according to actual conditions in the prior art in order to solve, cause the judgement of voice signal not accurate enough, effectively the problem of shielding noise.

Voice signal detection method provided by the invention comprises the following steps:

A, obtain the audio stream data in the sense cycle, and be divided into some frames, calculate the energy value of each frame audio stream data by the time, and with the speech energy threshold ratio; If more than or equal to described speech energy threshold value, then be designated speech frame, otherwise be designated quiet frame;

In B, the statistics current period more than or equal to the frame number of described speech energy threshold value with less than the frame number of described speech energy threshold value; If many, then get the speech energy threshold value of the mean value of the maximum energy value of each frame in this cycle and current speech energy threshold as next sense cycle more than or equal to the frame number of described speech energy threshold value; Otherwise, get in this cycle in the mean value of the minimum energy value of each frame and current speech energy threshold as the speech energy threshold value of next sense cycle;

C, go to steps A, repeat above testing process, dispose until all audio frequency flow data.

The initial value of described speech energy threshold value is a preset value.

Among the described step B in the statistics current period more than or equal to the frame number of described speech energy threshold value with less than the frame number of described speech energy threshold value, concrete grammar is:

One first counter is set, and preset initial value is 0, if the energy value of present frame then makes this counter add 1 more than or equal to the current speech energy threshold; After whole frames in the current period relatively finished, the value of this first counter was the interior frame number more than or equal to described speech energy threshold value of current period;

One second counter is set, and preset initial value is 0, if the energy value of present frame then makes this counter add 1 less than the current speech energy threshold; After whole frames in the current period relatively finished, the value of this second counter was the interior frame number less than described speech energy threshold value of current period.

The described energy value that calculates each frame audio stream data, concrete grammar is: after the squared magnitude to each sampled point in this frame, weighted mean obtains again.

The described energy value that calculates each frame audio stream data, concrete grammar is: after the amplitude of each sampled point in this frame was taken absolute value, weighted mean obtained again.

Described frame data are continuous 2 milliseconds audio stream data.

Described sense cycle is 500 milliseconds.

The present invention compares the energy value of each frame in the sense cycle respectively with the current speech energy threshold, draw energy value greater than the frame number that reaches less than the current speech energy threshold, again the two is compared, if energy value is many greater than the frame number of current speech energy threshold, the maximum energy value of then getting each frame in this sense cycle and the mean value of current speech energy threshold is as new speech energy threshold value, otherwise the mean value of getting the minimum energy value of interior each frame of this sense cycle and current speech energy threshold is as new speech energy threshold value; Utilize the constantly circulation in the process of processing audio stream of this method, can be every speech energy threshold value of a stipulated time (sense cycle) change, feasible speech energy threshold value as voice signal frame and quiet frame criterion no longer is a fixed value that configures in advance, but along with the variation of actual conditions, corresponding change real-time dynamicly, thereby reach the effect of distinguishing voice signal more accurately, and then reach effective shielding noise signal, improve the purpose of voice quality.

Description of drawings

Figure 1 shows that daily life sound intermediate frequency stream oscillogram;

Figure 2 shows that the inventive method flow chart of steps;

Figure 3 shows that new threshold calculations process flow diagram in the inventive method.

Embodiment

The present invention relates to a kind of voice signal detection method, Fig. 2 is the flow chart of steps of the inventive method, and Fig. 3 is a new threshold calculations process flow diagram in the inventive method.Below in conjunction with accompanying drawing 2 and accompanying drawing 3, the specific implementation method of the inventive method is described.

S1, obtain the audio stream data in the sense cycle, and be divided into some frames, calculate the energy value of each frame audio stream data by the time, and with the speech energy threshold ratio; If more than or equal to described speech energy threshold value, then be designated speech frame, otherwise be designated quiet frame.

Since the complicacy of voice signal, generally irregular seeking, but having regularity under the situation in short-term, so for the ease of analyzing and processing, need cut apart audio stream.For example: audio stream was cut apart for the 2ms/ frame by the time, and when sampling rate was 8KHz, every frame can collect 16 samples like this, and when sampling rate was 16KHz, then every frame can collect 32 samples.Because the present invention carries out Frame by timeslice to cut apart, so can be adapted to the speech detection under the various sample frequency.

Sense cycle of predefine of the present invention, and default speech energy threshold value initial value.Described sense cycle for example can be 500 milliseconds, and this sense cycle should not be provided with too short because the time of setting too weak point can cause frequent modification speech energy threshold value, the correlativity of losing voice signal causes mistake that a large amount of voice signals are judged to mute signal; Also should not be provided with oversizely, because the overlong time of setting, then the change number of times of speech energy threshold value causes mistake that a large amount of mute signals erroneous judgements are voice signal very little in the audio stream process time, has so just lost the meaning that on-the-fly modifies the speech energy threshold value.

Be divided into example with audio stream by the 2ms/ frame, get the audio stream data of first 2ms earlier, i.e. the first frame audio stream data calculates the energy value of this frame audio stream data, and concrete computing method have following two kinds:

One of method: after the squared magnitude to each sampled point in this frame, weighted mean again;

Its computing formula is:

W = \frac{1}{N} Σ_{i = 1}^{N} S^{2}

Two of method: after the amplitude of each sampled point in this frame taken absolute value, weighted mean again;

Its computing formula is:

W = \frac{1}{N} Σ_{i = 1}^{N} | S |

N is illustrated in the number of sampling in the described frame in the formula; S represents the amplitude of sampled point; Obtain the energy value W of this frame audio stream data.

The result who adopts first kind of computing method to obtain is more accurate, and the effect that subsequent step reaches is better, but its computing is complicated, and is bigger to the consumption of system resource; The degree of accuracy as a result that adopts second kind of computing method to obtain is relatively poor relatively, but calculating process is simple, and is not high to system requirements.The user can according to self-condition and require to select a kind of computing method.

After calculating the energy value of present frame audio stream data, if the energy value of this frame is more than or equal to the current speech energy threshold, then identifying this frame is speech frame; Simultaneously, a frame counter more than or equal to the current speech energy threshold is set, preset initial value is 0, if the energy value of present frame then makes this counter add 1 more than or equal to the current speech energy threshold.If the energy value of present frame is less than the current speech energy threshold, then identifying this frame is quiet frame; Frame counter less than the current speech energy threshold is set simultaneously, and preset initial value is 0, if the energy value of present frame then makes this counter add 1 less than the current speech energy threshold.So circulation judges that each frame in this sense cycle is speech frame or quiet frame.

S2, the data of adding up in the cycle according to current detection are calculated also change speech energy threshold value.

Judge it is the speech frame or the process of quiet frame in conjunction with previous step is rapid, behind the energy value that calculates the first frame audio stream data, it is changed to current maximum energy value and minimum energy value.

After handling the first frame audio stream data, get the second frame audio stream data, calculate the energy value of the second frame audio stream data by above-mentioned formula.Itself and current maximum energy value and minimum energy value are compared respectively,, then it is changed to new maximum energy value,, then it is changed to new minimum energy value if less than current minimum energy value if greater than current maximum energy value; Meanwhile this energy value is compared with the current speech energy threshold, if more than or equal to the current speech energy threshold, then will add 1,, then will add 1 less than the frame counter of current speech energy threshold if less than the current speech energy threshold more than or equal to the frame counter of current speech energy threshold.

So circulation, arrive up to the 500ms time of setting, after promptly having handled 250 audio frames, count value to described two counters compares, if it is more than frame number less than the counters count of current speech energy threshold more than or equal to the frame number of the counters count of current speech energy threshold, maximum energy value of then getting each frame in this 500ms and the speech energy threshold value of the mean value of current speech energy threshold as next sense cycle, otherwise get the speech energy threshold value of the mean value of the minimum energy value of interior each frame of this 500ms and current speech energy threshold as next sense cycle.When first 500ms arrived, former default speech energy threshold value was updated to by after the voice signal that collects is in real time quantized, the new energy value that calculate, statistics obtains like this.

S3, go to step S1, repeat above testing process, dispose until all audio frequency flow data.

After the data in first 500ms being added up, are calculated and changed the speech energy threshold value, to enter second 500ms, before entering second 500ms, need described two counters and current maximum energy value and current minimum energy value zero clearing, guarantee that the data statistics in second 500ms is accurate.In the time of this 500ms, the speech energy threshold value after a then above 500ms upgrades is a comparison other, by that analogy, is not completely cured and upgrades the speech energy threshold value, disposes until described audio stream.

Adopt this method once to upgrade the speech energy threshold value, can adapt to the voice environment of various complexity like this, export better sound effect every 500ms.

Claims

1, a kind of voice signal detection method is characterized in that comprising the following steps:

2, the method for claim 1 is characterized in that, the initial value of described speech energy threshold value is a preset value.

3, method as claimed in claim 2 is characterized in that, among the described step B in the statistics current period more than or equal to the frame number of described speech energy threshold value with less than the frame number of described speech energy threshold value, concrete grammar is:

4, the method for claim 1 is characterized in that, the described energy value that calculates each frame audio stream data, and concrete grammar is: after the squared magnitude to each sampled point in this frame, weighted mean obtains again.

5, the method for claim 1 is characterized in that, the described energy value that calculates each frame audio stream data, and concrete grammar is: after the amplitude of each sampled point in this frame was taken absolute value, weighted mean obtained again.

6, the method for claim 1 is characterized in that, described frame data are continuous 2 milliseconds audio stream data.

7, the method for claim 1 is characterized in that, described sense cycle is 500 milliseconds.