CN106024017A

CN106024017A - Voice detection method and device

Info

Publication number: CN106024017A
Application number: CN201510119374.XA
Authority: CN
Inventors: 孙廷玮; 林福辉
Original assignee: Spreadtrum Communications Shanghai Co Ltd
Current assignee: Spreadtrum Communications Shanghai Co Ltd
Priority date: 2015-03-18
Filing date: 2015-03-18
Publication date: 2016-10-12

Abstract

The invention discloses a voice detection method and device, and the method comprises the steps: enabling collected voice signals to be overlapped and framed, and obtaining a plurality of corresponding sound frames; carrying out the windowing of the plurality of obtained sound frames; carrying out the frequency domain conversion of the sound frames after windowing, and obtaining frequency spectrums corresponding to the sound frames; carrying out the cepstrum domain conversion of the frequency spectrums corresponding to the obtained sound frames, and obtaining corresponding cepstrums; calculating the cepstrum distance between the cepstrums of two adjacent sound frames; and carrying out the voice detection of the collected signals when the calculated cepstrum distance is greater than a preset distance threshold value. According to the above scheme, the method can save the time of voice detection.

Description

Speech detection method and device

Technical field

The present invention relates to speech detection technical field, particularly relate to a kind of speech detection method and device.

Background technology

Mobile terminal, refers to the computer equipment that can use in movement, include in a broad aspect mobile phone, Notebook, panel computer, POS, vehicle-mounted computer etc..Along with developing rapidly of integrated circuit technique, move Dynamic terminal has had powerful disposal ability, and mobile terminal becomes one from simple call instrument Individual integrated information processing platform, this also adds broader development space to mobile terminal.

The use of mobile terminal, it usually needs user concentrates certain attention.Mobile terminal of today sets For being equipped with touch screen, user needs to touch described touch screen, to perform corresponding operation.But, When user cannot touch mobile terminal device, operation mobile terminal will become highly inconvenient.Such as, When having carried article during user drives vehicle or hands when.

Speech detection method and always listen the use of system (AlwaysListeningSystem) so that can be right Mobile terminal carries out non-manual activation and operation.When described always listen system acoustical signal to be detected time, voice Detecting system will activate, and is identified the acoustical signal detected, afterwards, mobile terminal will Corresponding operation is performed, such as, when user inputs " dialing the mobile phone of XX " according to the acoustical signal identified Voice time, the voice messaging of " dialing the mobile phone of XX " of user's input just can be known by mobile terminal Not, and after correct identification, from mobile terminal, obtain the information of the phone number of XX, and dial.

But, speech detection method in prior art, when being applied to always listen in system, need to protect always Hold opening to detect with the voice activity to user, accordingly, there exist the longest problem.

Summary of the invention

The problem that the embodiment of the present invention solves is the most time-consuming when carrying out speech detection.

For solving the problems referred to above, embodiments providing a kind of speech detection method, described voice is examined Survey method includes:

The acoustical signal gathered is carried out overlapping framing, obtains multiple voiced frames of correspondence；

Obtained multiple voiced frames are carried out windowing process；

Voiced frame after windowing process is carried out frequency domain conversion, obtains the frequency spectrum that each voiced frame is corresponding；

Frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains conversion, obtains the scramble of correspondence Spectrum；

Calculate the cepstrum distance between the cepstrum of two adjacent voiced frames；

When the cepstrum distance calculated is more than the distance threshold preset, the acoustical signal gathered is entered Row speech detection.

Alternatively, described voiced frame after windowing process is carried out frequency domain conversion, obtain each sound The frequency spectrum that frame is corresponding, including: the voiced frame after windowing process is carried out fast Fourier transform, To the frequency spectrum that each voiced frame is corresponding.

Alternatively, described frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains conversion, Arrive corresponding cepstrum, including:

c = {&Integral;}_{- π}^{π} (\log S (w) - α) \frac{dw}{2 π}

Wherein, c represents cepstrum coefficient, and S (w) represents voiced frame, and α is default correction term.

Alternatively, the cepstrum distance between the cepstrum of two voiced frames that described calculating is adjacent, including:

D = Σ_{j = 1}^{k} | a_{j} - b_{j} |

Wherein, D represents that cepstrum distance, j represent the sequence number of the sampling frequency in voiced frame, a_j、b_jTable respectively Showing the cepstrum of adjacent two voiced frame, k represents sampling frequency number.

Alternatively, the sampling frequency number of described voiced frame is 32.

Alternatively, time a length of 200ms to 1s of described gathered acoustical signal.

Alternatively, described distance threshold is by carrying out at preemphasis the sampled signal that sample frequency is 8KHz Manage, and the Hamming window that the voiced frame that frame length is 20ms adds 256 obtains.

The embodiment of the present invention additionally provides a kind of speech detection device, and described device includes:

Framing unit, is suitable to the acoustical signal gathered carries out overlapping framing, obtains multiple sound of correspondence Sound frame；

Windowing process unit, is suitable to obtained multiple voiced frames are carried out windowing process；

Frequency domain converting unit, is suitable to the voiced frame after windowing process is carried out frequency domain conversion, obtains each The frequency spectrum that individual voiced frame is corresponding；

Cepstral domains converting unit, is suitable to frequency spectrum corresponding for each obtained voiced frame is carried out cepstrum Territory is changed, and obtains the cepstrum of correspondence；

Computing unit, is suitable to the cepstrum distance calculating between the cepstrum of adjacent two voiced frame；

Speech detection unit, is suitable to when the cepstrum distance calculated is more than the distance threshold preset, right The acoustical signal gathered carries out speech detection.

Alternatively, described frequency domain converting unit is suitable to the voiced frame after windowing process is carried out quick Fu In leaf transformation, obtain the frequency spectrum that each voiced frame is corresponding.

Alternatively, the sampling frequency number of described voiced frame is 32.

Compared with prior art, technical scheme has the advantage that

By the cepstrum distance between the cepstrum of calculating adjacent sound frame, determine whether the sound to input Tone signal detects, relatively simple, only owing to calculating the computing of the cepstrum distance between alternative sounds frame Therefore, it can save calculating resource and the time of speech detection.

Further, owing to the sampling frequency number in each voiced frame is 32, cost can be calculated saving While, it is thus achieved that preferably speech detection performance.

Accompanying drawing explanation

Fig. 1 is the flow chart of a kind of speech detection method in the embodiment of the present invention；

Fig. 2 is the flow chart of the another kind of speech detection method in the embodiment of the present invention；

Fig. 3 is the voice under the conditions of different clean speech of the speech detection method in the embodiment of the present invention The simulation result schematic diagram of recognition correct rate；

Fig. 4 is that the speech detection method using ITU-T G.729B standard is under the conditions of different clean speech The simulation result schematic diagram of speech recognition accuracy；

Fig. 5 is the VAD based on statistical model speech recognition accuracy under the conditions of different clean speech Simulation result schematic diagram；

Fig. 6 be the VAD based on long-term speech information speech recognition under the conditions of different clean speech just The really simulation result schematic diagram of rate；

Fig. 7 be the speech recognition under the conditions of white noise of the speech detection method in the embodiment of the present invention just The really simulation result schematic diagram of rate；

Fig. 8 is the speech detection method using ITU-T G.729B standard voice under the conditions of white noise The simulation result schematic diagram of recognition correct rate；

Fig. 9 is the emulation of the VAD based on statistical model speech recognition accuracy under the conditions of white noise Result schematic diagram；

Figure 10 is the VAD based on long-term speech information speech recognition accuracy under the conditions of white noise Simulation result schematic diagram；

Figure 11 is the structural representation of a kind of speech detection device in the embodiment of the present invention.

Detailed description of the invention

Of the prior art always listen system use voice activity detection (Voice Activity Detection, VAD) Sound is detected by technology.

Voice activity detection method the most frequently used in GSM standard, carries out background noise more at noise intervals Newly.This voice activity detection method based on frequency domain generally uses and includes linear prediction spectrum, full frequency band energy Amount, low-frequency range (0-1KHz) energy and the characteristic vector of zero-crossing rate.Specifically, will input sound letter After number device group is filtered after filtering, calculate the sound levels of each frequency range, and use there is premeasuring Results model submodule determines probability, or determines that whether the energy level of present frame is more than making an uproar of storing Sound.Above-mentioned voice activity detection method, it usually needs a reliable submodule updates and stores noise Model.

For this problem, presently, there are and comment by being dynamically tracked power envelope carrying out noise spectrum Estimate, above-mentioned voice activity detection method is further improved.A kind of method therein will be by receiving Device characteristic working curve non-voice false alarm rate under some representational noises and situation the most less and Whether voice hit rate increases, and compares with original voice activity detection method.In prior art Another kind of speech detection method then construct a kind of loaded down with trivial details voice activity detection with six kinds of loaded down with trivial details rules Method.

Above-mentioned voice activity detection method can show excellent performance in specific condition and platform. But, above-mentioned voice activity detection method is when being applied to always listen in system, owing to needs by always listening are Unification directly maintains opening, detects with the acoustical signal to input, thus there is consuming meter Calculate resource and the problem of the time of calculating.

For solving the above-mentioned problems in the prior art, the technical scheme that the embodiment of the present invention uses is first Pass through.

Understandable, below in conjunction with the accompanying drawings for enabling the above-mentioned purpose of the present invention, feature and advantage to become apparent from The specific embodiment of the present invention is described in detail.

Fig. 1 shows the flow chart of a kind of speech detection method in the embodiment of the present invention.As shown in Figure 1 Speech detection method, may include that

Step S101: the acoustical signal gathered is carried out overlapping framing, obtains multiple voiced frames of correspondence.

In being embodied as, in order to the acoustical signal gathered is processed, can first will collect Acoustical signal carries out overlapping framing, obtains multiple voiced frame.The acoustical signal gathered is carried out framing, real Matter is that acoustical signal is carried out short-time analysis, and short-time analysis is divided into acoustical signal and has the fixed cycle Short of time, short of each time is relatively-stationary lasting sound clip.

In being embodied as, partly overlapping between two adjacent voiced frames, overlapping range can be according to reality Border situation selects.

Step S102: obtained multiple voiced frames are carried out windowing process.

In being embodied as, the Speech processing such as Hamming window, Hanning window, rectangular window can be selected to commonly use Window function, frame length is chosen as 10～40ms, and representative value is 20ms.

In being embodied as, voice signal is carried out sub-frame processing and destroys the naturalness of acoustical signal, logical Cross use voiced frame and carry out windowing and return process etc., this problem can be solved.

Step S103: the voiced frame after windowing process is carried out frequency domain conversion, obtains each voiced frame Corresponding frequency spectrum.

In being embodied as, the acoustical signal gathered in theory for be time dependent, be one Astable process, it is not possible to directly carry out the conversion of frequency domain.But, due to the sound letter gathered Number carrying out sub-frame processing (short-time analysis), the acoustical signal of every frame may be considered metastable, thus Can apply and carry out frequency domain conversion.Wherein, the frequency spectrum that each obtained voiced frame is corresponding includes frequency Relation with energy.

Step S104: frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains conversion, obtains Corresponding cepstrum.

In being embodied as, each voiced frame signal obtained can be carried out after framing windowing process Frequency domain is changed.

In being embodied as, cepstrum is that the logarithm value to power spectrum carries out inverse fourier transform (Inverse Fourier Transform, IFT), complicated convolution relation is become simple linear superposition so that scramble The group of frequencies component amount of each voiced frame signal can be identified relatively easily in spectrum.

Step S105: calculate the cepstrum distance between the cepstrum of adjacent voiced frame.

In being embodied as, by calculating the cepstrum distance between the cepstrum of adjacent voiced frame, with really Determine whether the acoustical signal gathered to be carried out speech detection.Wherein, with prior art calculates adjacent sound The spectrum energy of sound frame (current sound frame and the delay voiced frame with default time delay) is compared, and calculates phase Cepstrum distance between the cepstrum of adjacent voiced frame, can reduce the complexity of calculating, therefore, it is possible to Save and calculate resource and the time of calculating.

Step S106: when the cepstrum distance calculated is more than the distance threshold preset, to gathered Acoustical signal carries out speech detection.

In being embodied as, when the cepstrum distance calculated is more than the distance threshold preset, show defeated Containing voice signal in the acoustical signal entered, at this point it is possible to the acoustical signal gathered is carried out voice inspection Survey, to identify voice signal therein.

Fig. 2 shows the flow chart of the another kind of speech detection method in the embodiment of the present invention.Such as Fig. 2 institute The speech detection method shown, may include that

Step S201: the acoustical signal of preset duration is entered framing windowing process.

In being embodied as, first the acoustical signal inputted can be carried out overlapping framing, obtain frame by frame Signal.Wherein, frame length is chosen as 20ms, adjacent before and after partly overlap between two voiced frames.Afterwards, Voiced frame after framing can add the Hamming window of 256, and wherein, sample rate is 8kHz, and frame length is 20ms, interframe overlap is 50%, then a frame acoustical signal has 160 sampled points, by signal End zero padding obtains 256 sampled points.

In being embodied as, the time delay of adjacent voiced frame has important work in the calculating of cepstrum distance With.The most extended when putting longer, longer first tone signal with continuous print frequency spectrum may be returned by mistake Class；The most extended when putting longer, can cause when carrying out speech detection needing the longer startup time, and And to store the spectrum vector that more voiced frame is corresponding.In embodiments of the present invention, the sound letter gathered Number time span could be arranged to 200ms to 1s, to improve the performance of speech detection.

In being embodied as, the time delay when determining between different spectrum vector, following public affairs can be used Formula carries out simple z conversion to the time delay between different spectrum vectors, and conversion is to frequency domain:

F (x)=x (n-m)=＞＞ F (z)=z^-mX(Z) (2)

Wherein, f (x) represents the difference in time domain between two sampled points, and n represents the finger of current sampling point Number (index), the index (index) of any sampled point before m table current sampling point, F (z) represents F Through the function expression of z conversion, X (Z) represents x function expression after z changes.

Step S202: the voiced frame after framing windowing process is carried out FFT process, obtains each sound The frequency spectrum that sound frame is corresponding.

In an embodiment of the present invention, by the voiced frame after framing windowing process is carried out quick Fu In leaf transformation (Fast Fourier Transform, FFT) process, the frequency corresponding to obtain each voiced frame Spectrum.Wherein, the spectrogram that each voiced frame is corresponding includes the corresponding relation between frequency and amplitude.

Step S203: frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains conversion, obtains Corresponding cepstrum.

In being embodied as, frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains conversion, The corresponding relation that the cepstrum figure obtained includes between inverted frequency (q) and cepstrum coefficient (c).At this In a bright embodiment, formula below is used to be calculated the cepstrum coefficient that the voiced frame signal of input is corresponding:

c = {&Integral;}_{- π}^{π} (\log S (w) - α) \frac{dw}{2 π} - - - (3)

Wherein, c represents cepstrum coefficient, and S (w) represents the voiced frame of input, and α is default correction term.

Step S204: calculate the cepstrum distance between the cepstrum of adjacent sound frame.

In being embodied as, distance computing formula different in prior art can be used, be calculated phase Cepstrum distance between the cepstrum of adjacent voiced frame.In an embodiment of the present invention, Manhattan can be used Cepstrum distance between the cepstrum of distance (city block distance) calculating adjacent sound frame:

D = Σ_{j = 1}^{k} | a_{j} - b_{j} | - - - (4)

Wherein, D represents that cepstrum distance, j represent the sequence number of the sampling frequency in voiced frame, a_j、b_jRespectively Representing the cepstrum of two adjacent voiced frames, k represents sampling frequency number.

In an embodiment of the present invention, the value of k is 32, then, it is only necessary to carry out 32 subtractions With 31 sub-addition computings, just can calculate the cepstrum that has between the different cepstrum postponing frequency range away from From, therefore, it can be substantially reduced the complexity of calculating, save and calculate resource.

It is to be herein pointed out along with the adjacent sound used in the speech detection process of an acoustical signal The quantity of the cepstrum distance between the cepstrum of sound frame increases, and speech detection performance also will strengthen therewith.But It is that practice analysis shows, the quantity of the cepstrum distance between the cepstrum of the adjacent voiced frame used During more than 4, the lifting of speech detection performance will be the most small.

Step S205: judge that whether the cepstrum distance calculated is more than the distance threshold preset.

In being embodied as, calculate to the being adapted to property of distance threshold in the embodiment of the present invention, and Independent of with other parts in the embodiment of the present invention.But, practice have shown that, when distance threshold is fixing not During change, the speech detection method in the embodiment of the present invention is language under conditions of some speakers and background noise Sound detection performance is closer to.

In order to save resource and memory space, in an embodiment of the present invention, distance threshold can pass through Sample frequency is that the sampled signal of 8KHz carries out preemphasis process, and adds the voiced frame that frame length is 20ms Calculate under conditions of the Hamming window of 256.

It is to be herein pointed out distance threshold can be set according to being actually needed of terminal use.

In being embodied as, when judged result is for being, step S206 can be performed, when judged result is Time no, the most do not perform any action.

Step S206: the acoustical signal gathered is carried out speech detection.

In being embodied as, the cepstrum distance between the cepstrum of the adjacent sound frame calculated is more than During the distance threshold preset, then show gathered input audio signal includes voice signal, therefore, The input audio signal gathered can be carried out speech detection.

In being embodied as, when identifying the voice messaging in input audio signal, mobile terminal is permissible Corresponding operation is performed according to the acoustical signal identified.Such as, " dialing the mobile phone of XX " is inputted as user Voice time, the voice messaging of " dialing the mobile phone of XX " of user's input just can be known by mobile terminal Not, and after correct identification, from mobile terminal, obtain the information of the phone number of XX, and dial.

Speech detection method in below the present invention being implemented and VAD technology of the prior art, and ITU-T G.729 standard compares respectively.

Table 1:

Although the time of speech detection may be affected by coding techniques, but, from above-mentioned table 1 Contrast understand, the calculating time that the speech detection method in the embodiment of the present invention is used is short.Wherein, Compared with the speech detection method using frequency domain to process typical with prior art, in the embodiment of the present invention Speech detection method can save the time of more than 60%, compared with ITUT standard, then saves 40% The above time.

Fig. 3-6 shows the speech detection method in the embodiment of the present invention, uses ITU-T G.729B standard Speech detection method, VAD based on statistical model and VAD based on long-term speech information in difference Clean speech under the conditions of the simulation result schematic diagram of speech recognition accuracy.

Understanding from the comparison of Fig. 3-6, the speech recognition of the audio recognition method in the embodiment of the present invention is correct Whether rate can reach 90%, and will not be that local speaker is affected by speaker.

Fig. 7-10 shows the speech detection method in the embodiment of the present invention, uses ITU-T G.729B standard Speech detection method, VAD based on statistical model and VAD based on long-term speech information in white The simulation result schematic diagram of the speech recognition accuracy under noise conditions.

Understand from the comparison of Fig. 7-10, the speech recognition of the audio recognition method in the embodiment of the present invention Can, the performance under the noise circumstance of situation noise circumstance and different signal to noise ratio is higher than in prior art adopts With the speech detection method of ITU-T G.729B standard, especially under the noise circumstance of low signal-to-noise ratio.But It is that, compared with other VAD, the performance of the speech detection method in the embodiment of the present invention decreases. This is because the speech detection method in the embodiment of the present invention is by a relatively simple.Meanwhile, the present invention implements Speech detection method in example was saved while 90% calculating time, performance merely reduce 85%～ 90%, therefore may certify that effectiveness and the availability of audio recognition method in the embodiment of the present invention, suitable Speech detection is carried out in being applied to always listen in system.

Figure 11 shows the structural representation of a kind of speech detection device in the embodiment of the present invention.Such as Figure 11 Shown speech detection device 1100, can include framing unit 1101, windowing process unit 1102, frequently Territory converting unit 1103, cepstral domains converting unit 1104, computing unit 1105 and speech detection unit 1106, wherein:

Framing unit 1101, is suitable to the acoustical signal gathered carries out overlapping framing, obtains corresponding many Individual voiced frame.

Windowing process unit 1102, is suitable to obtained multiple voiced frames are carried out windowing process.

Frequency domain converting unit 1103, is suitable to the voiced frame after windowing process is carried out frequency domain conversion, To the frequency spectrum that each voiced frame is corresponding.

In being embodied as, described frequency domain converting unit 1103 is suitable to the voiced frame after windowing process Carry out fast Fourier transform, obtain the frequency spectrum that each voiced frame is corresponding.

Cepstral domains converting unit 1104, is suitable to carry out down frequency spectrum corresponding for each obtained voiced frame Spectrum domain is changed, and obtains the cepstrum of correspondence.

Computing unit 1105, is suitable to the cepstrum distance calculating between the cepstrum of adjacent two voiced frame.

Speech detection unit 1106, is suitable to when the cepstrum distance calculated is more than the distance threshold preset, The acoustical signal gathered is carried out speech detection.

One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment Suddenly the program that can be by completes to instruct relevant hardware, and this program can be stored in computer-readable In storage medium, storage medium may include that ROM, RAM, disk or CD etc..

Having been described in detail the method and system of the embodiment of the present invention above, the present invention is not limited to this. Any those skilled in the art, without departing from the spirit and scope of the present invention, all can make various change with Amendment, therefore protection scope of the present invention should be as the criterion with claim limited range.

Claims

1. a speech detection method, it is characterised in that including:

Obtained multiple voiced frames are carried out windowing process；

Frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains conversion, obtains the cepstrum of correspondence；

When the cepstrum distance calculated is more than the distance threshold preset, the acoustical signal gathered is carried out Speech detection.

Speech detection method the most according to claim 1, it is characterised in that described will be through windowing process After voiced frame carry out frequency domain conversion, obtain the frequency spectrum that each voiced frame is corresponding, including: will be through adding Voiced frame after window processes carries out fast Fourier transform, obtains the frequency spectrum that each voiced frame is corresponding.

Speech detection method the most according to claim 2, it is characterised in that described by obtained each The frequency spectrum that voiced frame is corresponding carries out cepstral domains conversion, obtains the cepstrum of correspondence, including:

c = {&Integral;}_{- π}^{π} (\log S (w) - α) \frac{dw}{2 π}

Speech detection method the most according to claim 1, it is characterised in that adjacent two of described calculating Cepstrum distance between the cepstrum of voiced frame, including:

D = Σ_{j = 1}^{k} | a_{j} - b_{j} |

Speech detection method the most according to claim 1, it is characterised in that the sampling frequency of described voiced frame Count is 32.

Speech detection method the most according to claim 1, it is characterised in that described gathered sound letter Number time a length of 200ms to 1s.

Speech detection method the most according to claim 1, it is characterised in that described distance threshold is by right Sample frequency is that the sampled signal of 8KHz carries out preemphasis process, and is the sound of 20ms to frame length Frame adds the Hamming window of 256 and obtains.

8. a speech detection device, it is characterised in that including:

Framing unit, is suitable to the acoustical signal gathered carries out overlapping framing, obtains multiple sound of correspondence Frame；

Frequency domain converting unit, is suitable to the voiced frame after windowing process is carried out frequency domain conversion, obtains each The frequency spectrum that voiced frame is corresponding；

Cepstral domains converting unit, is suitable to frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains Conversion, obtains the cepstrum of correspondence；

Speech detection unit, is suitable to when the cepstrum distance calculated is more than the distance threshold preset, to institute The acoustical signal gathered carries out speech detection.

Speech detection device the most according to claim 8, it is characterised in that described frequency domain converting unit is fitted In the voiced frame after windowing process is carried out fast Fourier transform, obtain each voiced frame corresponding Frequency spectrum.

Speech detection device the most according to claim 8, it is characterised in that the sampling frequency of described voiced frame Count is 32.

11. speech detection devices according to claim 8, it is characterised in that described gathered sound letter Number time a length of 200ms to 1s.

12. speech detection devices according to claim 8, it is characterised in that described distance threshold is by right Sample frequency is that the sampled signal of 8KHz carries out preemphasis process, and is the sound of 20ms to frame length Frame adds the Hamming window of 256 and obtains.