CN108053834A

CN108053834A - audio data processing method, device, terminal and system

Info

Publication number: CN108053834A
Application number: CN201711272872.3A
Authority: CN
Inventors: 陈日林; 陈孝良; 冯大航; 苏少炜; 常乐
Original assignee: BEIJING WISDOM TECHNOLOGY Co Ltd
Current assignee: BEIJING WISDOM TECHNOLOGY Co Ltd
Priority date: 2017-12-05
Filing date: 2017-12-05
Publication date: 2018-05-18
Anticipated expiration: 2037-12-05
Also published as: CN108053834B

Abstract

The embodiment of the invention discloses a kind of audio data processing method, device, terminal and system, this method includes：Obtain the voice data after space filtering；First Wiener filtering and the second Wiener filtering are carried out respectively to voice data, respectively obtain the first filtering data and the second filtering data, the first Wiener filtering is more than the second Wiener filtering to the inhibition level of noise；Sentenced using the first filtering data and stop the second filtering data, stop result to the progress data processing of the second filtering data according to sentencing.The embodiment of the present invention is according to voice activity detection and the different demands of automatic speech recognition, different degrees of Wiener filtering is carried out respectively, it can not only ensure the accuracy rate of automatic speech recognition, it can also avoid disturbing the influence to voice activity detection, more accurately detect Voice Activity Status, shorten the delay of feedback of interactive voice, improve the response speed to phonetic order, better usage experience is brought to user.

Description

Audio data processing method, device, terminal and system

Technical field

The present invention relates to technical field of data processing more particularly to a kind of audio data processing method, device, terminal and it is System.

Background technology

Intelligent sound interaction is the important branch of artificial intelligence field, accurately intelligent sound interaction is realized freely, by pole The both hands of big liberation people, obtain and the more free information flow of physical world and manipulation.Intelligent sound interacts main near field language Sound interacts and far field interactive voice.Past 20 years, near field voice obtained tremendous expansion, and the phonetic recognization rate near field approaches at present In the phonetic recognization rate of people, but more free interaction should be far field interactive voice.So-called far field interactive voice, it is meant that say There is a certain distance between words people and interactive device, expand the free space of speaker, but this can introduce the excessive back of the body again Scape noise jamming causes the intractability of voice activity detection and automatic speech recognition to be significantly greatly increased.

Voice activity detection detects the voice that speaker is actually typing from one section of continuous voice data.Accurately Voice activity detection on the one hand can improve the accuracy rate of follow-up automatic speech recognition, on the other hand can also reduce voice friendship Mutual delay of feedback, user speech instruction one terminate that implementing result can be provided rapidly, are brought to user and preferably use body It tests.

At present, it is general original audio data is handled using array signal after, using treated voice data into Row voice activity detection and automatic speech recognition, but should treated voice data still has certain interference, can serious shadow The accuracy of voice activity detection is rung, causes the error of voice activity detection, so as to cause the response to phonetic order slow.

The content of the invention

In view of this, the present invention provides a kind of audio data processing method, device, terminal and system, can solve existing Voice activity detection error caused by the error that the voice data that has in technology that treated still has, influences phonetic order sound The problem of answering speed.

A kind of audio data processing method provided in an embodiment of the present invention, including：

Obtain the voice data after space filtering；

First Wiener filtering and the second Wiener filtering are carried out respectively to the voice data, respectively obtain the first filtering data With the second filtering data, first Wiener filtering is more than second Wiener filtering to the inhibition level of noise；

Sentenced using first filtering data and stop second filtering data, stop result to the described second filtering number according to sentencing According to progress data processing.

Optionally, it is described that first Wiener filtering and the second Wiener filtering are carried out respectively to the voice data, it respectively obtains First filtering data and the second filtering data, specifically include：

Using the M powers of strength factor, first Wiener filtering is carried out to the voice data, obtains first filter Wave number evidence；Using the Nth power of the strength factor, second Wiener filtering is carried out to the voice data, obtains described the Two filtering datas；M is more than N.

Optionally, the M powers using strength factor carry out first Wiener filtering to the voice data, obtain To first filtering data, specifically include：

According to formulaDescribed first is carried out to the voice data Y (j ω) Wiener filtering obtains the second filtering data Y_VAD(jω)；

The Nth power using the strength factor carries out second Wiener filtering to the voice data, obtains institute The second filtering data is stated, is specifically included：

According to formulaDescribed second is carried out to the voice data Y (j ω) Wiener filtering obtains the second filtering data Y_ASR(jω)；

Wherein, M=1, N=1/2, the strength factor areThe P_yy(j ω) is the voice data Power spectrum, P_xx(j ω) is the average power spectra of the original audio data before the voice data space filtering, and EPS is minimum Value.

Optionally, the utilization first filtering data, which is sentenced, stops second filtering data, further includes before：

Interference is carried out to first filtering data to handle；

It is described that interference is gone to handle, including transient noise Processing for removing, noise reduction process and noise smoothing processing in one or It is multiple.

Optionally, the transient noise Processing for removing, specifically includes：

Obtain the voice data each increasing of corresponding first Wiener filtering of frequency domain point in default frequency domain Benefit；

The quantity of voice data frequency domain point in the default frequency domain is counted, obtains the first value；Statistical gain The quantity of frequency domain point of the amplitude within preset gain threshold value, obtains second value；

According to the described first value and the second value, obtain transient state and eliminate gain；

Gain is eliminated according to the transient state, eliminates the transient noise in first filtering data.

Optionally, the voice data obtained after space filtering, specifically includes：

Obtain the original audio data of sound pick-up outfit acquisition；

After carrying out Short Time Fourier Transform to the original audio data, obtain each passage in the sound pick-up outfit and correspond to Frequency-region signal；

Space filtering processing is carried out to the corresponding frequency-region signal of each passage, obtains the audio after the space filtering Data；

The utilization first filtering data, which is sentenced, stops second filtering data, specifically includes：

The inversion process of Short Time Fourier Transform is carried out to first filtering data and second filtering data；

Using treated the first filtering data sentences second filtering data that stops that treated, stop result to the place according to sentencing The second filtering data after reason carries out data processing.

The embodiment of the present invention additionally provides a kind of audio data processing method, applied to first terminal equipment, the method, Including：

Obtain the voice data after space filtering；

First filtering data and second filtering data are sent to second terminal equipment, so that described second is whole End equipment is sentenced using first filtering data stops second filtering data, and stops result to the described second filtering number according to sentencing According to progress data processing.

According to formulaFirst dimension is carried out to the voice data Y (j ω) Nanofiltration ripple obtains the second filtering data Y_VAD(jω)；

Interference is carried out to first filtering data to handle；

The quantity of voice data frequency domain point in the default frequency domain is counted, obtains the first value；Statistical gain The quantity of frequency domain point of the amplitude within the first preset gain threshold value, obtains second value；

Obtain the original audio data of sound pick-up outfit acquisition；

Space filtering processing is carried out to the corresponding frequency-region signal of each passage, obtains the audio after the space filtering Data.

Optionally, it is described that first filtering data is sent to second terminal equipment, it specifically includes：

The second terminal equipment is sent to after carrying out down-sampled processing to first filtering data.

Optionally, it is described that first filtering data and second filtering data are sent to second terminal equipment, tool Body includes：

After first filtering data and second filtering data are packaged compression, it is sent to the second terminal and sets It is standby.

A kind of audio-frequency data processing device provided in an embodiment of the present invention, including：Data acquisition module, the first filtering mould Block, the second filter module and first processing module；

The data acquisition module, for obtaining the voice data after space filtering；

First filter module for carrying out the first Wiener filtering to the voice data, obtains the first filtering data；

Second filter module for carrying out the second Wiener filtering to the voice data, obtains the second filtering data, First Wiener filtering is more than second Wiener filtering to the inhibition level of noise；

The first processing module stops second filtering data for sentencing using first filtering data, according to sentencing Stop result and data processing is carried out to second filtering data.

Optionally, first filter module, is specifically used for：

Using the M powers of strength factor, first Wiener filtering is carried out to the voice data, obtains first filter Wave number evidence；

Second filter module, is specifically used for：

Using the Nth power of the strength factor, second Wiener filtering is carried out to the voice data, obtains described the Two filtering datas；

Wherein, M is more than N.

Optionally, first filter module, including：First processing submodule；

The first processing submodule, for according to formulaTo the audio Data Y (j ω) carries out first Wiener filtering, obtains the second filtering data Y_VAD(jω)；

Second filter module, including：Second processing submodule；

The second processing submodule, for according to formulaTo the audio Data Y (j ω) carries out second Wiener filtering, obtains the second filtering data Y_ASR(jω)；

Wherein, M=1, N=1/2, the strength factor areThe P_yy(j ω) is the audio number According to power spectrum, P_xx(j ω) is the average power spectra of the original audio data before the voice data space filtering, and EPS is pole Small value.

Optionally, described device further includes：Second processing module；

The Second processing module is handled for carrying out interference to first filtering data；It is described that interference is gone to handle, Including one or more of transient noise Processing for removing, noise reduction process and noise smoothing processing；

The first processing module, specifically for treated that the first filtering data sentences stops using the Second processing module Second filtering data stops result to second filtering data progress audio identification according to sentencing.

Optionally, when it is described go interference processing include the transient noise Processing for removing when, the Second processing module, have Body includes：Noise reduction submodule；

The noise reduction submodule, is specifically used for：

Obtain the voice data each increasing of corresponding first Wiener filtering of frequency domain point in default frequency domain Benefit；The quantity of voice data frequency domain point in the default frequency domain is counted, obtains the first value；Statistical gain amplitude exists The quantity of frequency domain point within first preset gain threshold value, obtains second value；According to the described first value and the second value, obtain Transient state eliminates gain；Gain is eliminated according to the transient state, eliminates the transient noise in first filtering data.

Optionally, the data acquisition module, is specifically used for：

Obtain the original audio data of sound pick-up outfit acquisition；Short Time Fourier Transform is carried out to the original audio data Afterwards, the corresponding frequency-region signal of each passage in the sound pick-up outfit is obtained；The corresponding frequency-region signal of each passage is carried out Space filtering processing, obtains the voice data after the space filtering；

The first processing module, is specifically used for：

The inversion process of Short Time Fourier Transform is carried out to first filtering data and second filtering data；Profit With treated the first filtering data sentences second filtering data that stops that treated, stop result according to sentencing treated second to described Filtering data carries out audio identification data processing.

The embodiment of the present invention additionally provides a kind of audio-frequency data processing device, applied to first terminal equipment, including：Data Acquisition module, the first filter module, the second filter module and data transmission module；

The data transmission module, for first filtering data and second filtering data to be sent to second eventually End equipment, so that the first filtering data described in the second terminal equipment utilization, which is sentenced, stops second filtering data, and according to sentencing Stop result and data processing is carried out to second filtering data.

Optionally, first filter module, is specifically used for：

Second filter module, is specifically used for：

Wherein, M is more than N.

Optionally, first filter module, including：First processing submodule；

Second filter module, including：Second processing submodule；

Optionally, described device further includes：Data processing module；

The data processing module is handled for carrying out interference to first filtering data；It is described that interference is gone to handle, Including one or more of transient noise Processing for removing, noise reduction process and noise smoothing processing；

The data transmission module, specifically for by the data processing module treated the first filtering data and described Second filtering data is sent to second terminal equipment so that treated described in the second terminal equipment utilization first filtering number Stop second filtering data according to sentencing, and stop result to second filtering data progress data processing according to sentencing.

Optionally, when it is described go interference processing include the transient noise Processing for removing when, the data processing module, have Body includes：Noise reduction submodule；

The noise reduction submodule, is specifically used for：

Optionally, the data acquisition module, is specifically used for：

Obtain the original audio data of sound pick-up outfit acquisition；Short Time Fourier Transform is carried out to the original audio data Afterwards, the corresponding frequency-region signal of each passage in the sound pick-up outfit is obtained；The corresponding frequency-region signal of each passage is carried out Space filtering processing, obtains the voice data after the space filtering.

Optionally, which is characterized in that the data transmission module is specifically used for：

Optionally, data transmission module, also particularly useful for：

The embodiment of the present invention additionally provides a kind of voice data processing terminal, including：Memory and processor；

Said program code for storing program code, and is transmitted to the processor by the memory；

The processor for the instruction in said program code, performs as claimed in any one of claims 1 to 6 Audio data processing method.

The embodiment of the present invention additionally provides a kind of audio-frequency data processing system, including：First equipment and the second equipment；

First equipment, for obtaining the voice data after space filtering；Be additionally operable to the voice data respectively into The first Wiener filtering of row and the second Wiener filtering obtain the first filtering data and the second filtering data, first Wiener filtering Second Wiener filtering is more than to the inhibition level of noise；It is additionally operable to first filtering data and the second filtering number According to being sent to the second equipment；

Second equipment is stopped second filtering data for being sentenced using first filtering data, and stopped according to sentencing As a result audio identification is carried out to second filtering data.

Compared with prior art, the present invention has at least the following advantages：

In embodiments of the present invention, carry out the Wiener filtering of varying strength respectively to the voice data after space filtering, obtain The filtering data different to noise suppressed degree to two-way, i.e., the first filtering data more to noise suppressed and to noise Smaller second filtering data of inhibition level.Then, sentenced using the first filtering data and stop the second filtering data, according to judging result Data processing is carried out to the second filtering data.It, can be larger due to higher for the inhibition level of noise in the first filtering data Degree avoids influence of the interference to voice activity detection, improves the precision of voice activity detection and the response of automatic speech recognition Speed.And the inhibition level in the second filtering data for noise is relatively low, it can be to avoid higher noise suppressed to speech recognition The influence of accuracy.The embodiment of the present invention carries out not respectively according to voice activity detection and the different demands of automatic speech recognition With the Wiener filtering of degree, it can not only ensure the accuracy rate of automatic speech recognition, interference can also be avoided to examine speech activity The influence of survey, more accurately detects Voice Activity Status, shortens the delay of feedback of interactive voice, improves to phonetic order Response speed brings better usage experience to user.

Description of the drawings

It in order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, below will be to embodiment or existing There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments described in application, for those of ordinary skill in the art, without creative efforts, It can also be obtained according to these attached drawings other attached drawings.

Fig. 1 is a kind of schematic diagram of existing treated voice data；

Fig. 2 is a kind of flow diagram of audio data processing method provided in an embodiment of the present invention；

Fig. 3 is the signal of original audio data, the first filtering data and the second filtering data in the specific embodiment of the invention Figure；

Fig. 4 is the flow diagram of another audio data processing method provided in an embodiment of the present invention；

Fig. 5 provides the flow diagram of middle transient noise Processing for removing for the specific embodiment of the invention；

Fig. 6 is the flow diagram of another audio data processing method provided in an embodiment of the present invention；

Fig. 7 is a kind of structure diagram of audio-frequency data processing device provided in an embodiment of the present invention；

Fig. 8 is the structure diagram of another audio-frequency data processing device provided in an embodiment of the present invention.

Specific embodiment

In order to which those skilled in the art is made to more fully understand the present invention program, below in conjunction in the embodiment of the present invention The technical solution in the embodiment of the present invention is clearly and completely described in attached drawing, it is clear that described embodiment is only this Invention part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art exist All other embodiments obtained under the premise of creative work are not made, belong to the scope of protection of the invention.

In order to make it easy to understand, multiple technical terms involved in the embodiment of the present invention are introduced first.

Voice activity detection (Voice Activity Detection, VAD) is also known as speech terminals detection, the inspection of voice border It surveys, refers to the presence or absence that target voice is detected in noise circumstance.Commonly used in voice coding, speech enhan-cement, speech recognition It waits in speech processing systems.

Automatic speech recognition (Automatic Speech Recognition, ASR) is a kind of to be converted to the voice of people The technology of text.

It is usually that voice data all the way is directly exported using array signal processing, both during current interactive voice Speech recognition is carried out as ASR data, and the beginning and end (sentence and stop) of ASR processing is judged as VAD data.However, by Also the accuracy of VAD processing is influenced, for details, reference can be made to Fig. 1, this is illustrated there are more interference in the voice data of output A kind of existing treated voice data.It can be seen from figure 1 that the phonetic order that user actually enters originate in node A, End at node B.And carry out the obtained phonetic order of VAD processing using the data and end at node B ', i.e., at node B ' It can just judge phonetic order end of data, then carry out ASR processing according to the data between node A and node B '.This not only can Cause it is slow to the response speed of phonetic order, can also influence ASR processing speed, cause the response to phonetic order slow.

The present inventor has found that ASR processing needs to ensure that the voice data of input keeps smaller as far as possible under study for action Non-linear distortion, and VAD processing needs to carry out interference the inhibition of higher degree, but higher to the inhibition of interference, It can then cause the non-linear distortion introduced more.However, if treated, data are higher to the inhibition level of noise, can be because non- Linear effect causes voice data to be distorted, and reduces the accuracy of speech recognition；If ensureing the accuracy of speech recognition, count Sentencing for VAD can be influenced there are more serious interference again in stop as a result, cause VAD detection errors, influencing the sound of phonetic order Answer speed.That is, same input data cannot meet the need of ASR processing and VAD processing to input data noise suppressed degree simultaneously It asks.

For this purpose, an embodiment of the present invention provides a kind of audio data processing method, device, terminal and system, according to ASD and The different noise suppressed demand of VAD processing, carries out the voice data after space filtering the Wiener filtering of varying strength, obtains respectively To the two paths of data different to noise suppressed degree, VAD processing is carried out using to the higher filtered data of noise suppressed degree, Phonetic order sentence stopping, ASR processing is carried out using to the relatively low filtered data of noise suppressed degree, is ensureing at ASR On the basis of managing accuracy rate, the signal component for being unfavorable for VAD processing in data is reduced to the greatest extent, improves VAD processing and is sentenced The accuracy stopped, so as to improve the speed of ASR processing and the response speed to phonetic order.

Based on above-mentioned thought, in order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to Attached drawing is described in detail the specific embodiment of the present invention.

Referring to Fig. 2, which is a kind of flow diagram of audio data processing method provided in an embodiment of the present invention.

Firstly the need of explanation, a kind of audio data processing method provided in an embodiment of the present invention can be applied to appoint One terminal device, the terminal device can configure or are connected with multichannel radio reception microphone, to receive phonetic order input by user. As an example, which can be specifically smart mobile phone, tablet computer, PC and server etc., here not It enumerates again.

Audio data processing method provided in an embodiment of the present invention specifically includes following steps S201-S203.

S201：Obtain the voice data Y (j ω) after space filtering.

Since in practical applications, voice (such as phonetic order) input by user is generally by multiple microphone radio reception It obtains, is disturbed in the original audio data that radio reception obtains in different spaces and microphone channel different.Therefore, in order to correspond to sky Between it is different to remove the noise jamming in original audio data, it is necessary to according to the different to carrying out space filtering processing of space, With the voice data (the voice data Y (j ω) i.e. after space filtering) after interference of tentatively being gone out.

It should also be noted that, before space filtering is carried out, first voice input by user can also be pre-processed, It can specifically include going direct current biasing, windowed function etc., then audio carries out Short Time Fourier Transform (Short to treated Time Fourier Transform, STFT), obtain frequency domain signal X of the voice input by user under different frequency domains₁(jω)、 X₂(jω)、......、X_N(jω).Then, to the frequency domain signal X₁(jω)、X₂(jω)、......、X_N(j ω) carries out space Filtering process can obtain Y (j ω).

That is, in the possible realization method of the embodiment of the present invention, step S201 can specifically include：Sound pick-up outfit is obtained to adopt The original audio data of collection；After carrying out Short Time Fourier Transform to original audio data, each passage pair in sound pick-up outfit is obtained The frequency domain signal X answered₁(jω)、X₂(jω)、......、X_N(jω)；Space filter is carried out to the corresponding frequency-region signal of each passage Ripple processing, obtains the voice data Y after space filtering (j ω).

In the specific implementation, can utilize general matching law (General Sidelobe Cancellation, ) or minimum variance distortionless response (Minimum Variance Distortionless Response, MVDR) wave beam shape GSC It grows up to be a useful person to obtain Y (j ω), which is not described herein again for specific processing method.

S202：First Wiener filtering and the second Wiener filtering are carried out respectively to the voice data after space filtering, respectively To the first filtering data and the second filtering data, the first Wiener filtering is more than the second Wiener filtering to the inhibition level of noise.

In embodiments of the present invention, the voice data noise suppressed degree handled with VAD to input is handled in order to meet ASR Different demands, for ASR processing and VAD processing, varying strength is carried out respectively to the voice data Y (j ω) after space filtering Wiener filtering, with ensure ASR processing and VAD processing accuracy.To voice data Y (j ω) carry out noise suppressed degree compared with The first high Wiener filtering reduces the signal component for being unfavorable for VAD processing in voice data Y (j ω) to the greatest extent, reduces the To the interference of VAD processing in one filtering data, the accuracy that VAD is handled is improved.Noise suppressed is carried out to voice data Y (j ω) The second relatively low Wiener filtering of degree avoids influence of the distortion to ASR processing, ensures the identification accuracy of ASR processing.In this way, It can not only ensure the identification accuracy to phonetic order, the accuracy of VAD processing can also be improved, phonetic order is done in time Go out response,

In practical applications, can by adjusting Wiener filtering strength factor with to the voice data after space filtering into The Wiener filtering of the different noise suppressed degree of row, obtains the first filtering data and the second filtering data.

In some possible realization methods of the embodiment of the present invention, above-mentioned steps S202 can specifically include following step Suddenly：

Using the M powers of strength factor, the first Wiener filtering is carried out to the voice data Y (j ω) after space filtering, is obtained First filtering data；Using the Nth power of strength factor, the second wiener filter is carried out to the voice data Y (j ω) after space filtering Ripple obtains the second filtering data；Wherein, M is more than N.

It should be noted that strength factor is influenced in Wiener filtering to the inhibition level of noise, strength factor is bigger to making an uproar The inhibition level of sound is higher.Therefore, in embodiments of the present invention, the first dimension is carried out using the different power sides of same strength factor Nanofiltration ripple and the second Wiener filtering, you can obtain first filtering data and second filtering data different to noise suppressed degree.

As an example, which can be set toM and N take 1 and 1/2 respectively.Wherein, P_yy(j ω) is the power spectrum of voice data, can specifically be acquired by following formula (1)；P_xx(j ω) is voice data in space filtering The average power spectra of preceding original audio data can specifically be acquired by following formula (2)；EPS is minimum.

P_yy(j ω)=α P_yy(jω)+(1-α)Y(jω)Y^*(jω) (1)

In practical applications, firm power spectrum P can be obtained using single order smooth manner_xx(j ω) and P_yy(jω)。

Then, the first Wiener filtering is carried out to voice data Y (j ω), obtains the second filtering data Y_VAD(j ω), specifically can be with It utilizes following formula (3)：

The second Wiener filtering then is carried out to voice data Y (j ω), obtains the second filtering data Y_ASR(j ω), specifically can be with It utilizes following formula (4)：

S203：Sentenced using the first filtering data and stop the second filtering data, stop result to the progress of the second filtering data according to sentencing Data processing.

It is understood that eliminating the influence to VAD processing in the first filtering data, the first filtering data is utilized To target voice (such as phonetic order) is accurately identified in the second filtering data, sentence and stop the second filtering data.According to sentencing The result stopped carries out data processing (such as automatic speech recognition) to the second filtering data, can ensure the base of processing accuracy On plinth, the response speed of data processing is improved.

It should also be noted that, in the specific implementation, need that frequency-region signal carries out by treated before step S203 The inverse transformation (Inverse Short-Time Fourier Transform, ISTFT) of Short Time Fourier Transform obtains time domain letter Number, it is superimposed to obtain the first filtering data and the second filtering data by adding window, then sentence and stop and data processing.

With reference to concrete scene, the above-mentioned advantage for embodiment that the present invention will be described in detail.Referring to Fig. 3, the figure shows this Original audio data, the first filtering data and the second filtering data in invention specific embodiment.Wherein, original audio data carries Phonetic order and interference obtain the first filtering after the first higher Wiener filtering of noise suppressed intensity is carried out to original audio data Data obtain the second filtering data after the second relatively low Wiener filtering of noise suppressed intensity is carried out to original audio data.From Fig. 3 In can be seen that, can ensure the accuracy of speech recognition using the second filtering data；It can be improved and sentenced using the first Wiener filtering The accuracy stopped, more accurately detects Voice Activity Status, shortens the delay of feedback of interactive voice, improves to phonetic order Response speed, bring better usage experience to user.

Referring to Fig. 4, which is the flow diagram of another audio data processing method provided in an embodiment of the present invention.

In order to further improve the accuracy sentenced and stopped, the embodiment of the present invention further includes before step S203：

S204：Interference is carried out to the first filtering data to handle.

In embodiments of the present invention, it is above-mentioned that interference is gone to handle, can specifically it include：At transient noise Processing for removing, noise reduction One or more of reason and noise smoothing processing.When it is implemented, transient noise can be performed one by one to the first filtering data Processing for removing, noise reduction process and noise smoothing processing.

The processing of transient noise Processing for removing, noise reduction process and noise smoothing is illustrated below.

First, transient noise Processing for removing, as shown in figure 5, specifically comprising the following steps S501-S503.

S501：Obtain voice data each gain of corresponding first Wiener filtering of frequency domain point in default frequency domain.

In the embodiment of the present application, the gain of the first Wiener filtering, that is, above-mentioned strength factor corresponds to the value with each frequency domain. As an example, the gain of corresponding first Wiener filtering of each frequency domain point isCorresponding to each frequency domain Value.

S502：The quantity of voice data frequency domain point in default frequency domain is counted, obtains the first value；Statistical gain amplitude The quantity of frequency domain point within preset gain threshold value, obtains second value.

S503：According to the first value and second value, obtain transient state and eliminate gain.

It should be noted that since high frequency attenuation and reflection cause high frequency randomness higher, it is steady to obtain higher Property, the ratio that gain within certain frequency is less than some threshold value is only counted, noise is carried out to the first filtering data on this basis Smoothing processing.

When it is implemented, 0-2000Hz can be taken by presetting flat frequency domain, preset gain threshold value can take 0.3.

In one example, transient state eliminates gain gain and can be obtained according to following formula (5).

Wherein, all_bin is the first value, and count_bin is second value.

S504：Gain is eliminated according to transient state, eliminates the transient noise in the first filtering data.

In embodiments of the present invention, the first filter can be eliminated by the way that transient state elimination gain will be applied to the first filtering data Transient noise of the wave number in.

Secondly, any one noise reduction algorithm specifically may be employed in noise reduction process, no longer repeats one by one here.

Finally, noise smoothing is handled, and can specifically be realized by carrying out noise estimation to the first filtering data.

In one example, its single order smooth power spectrum first is calculated to the first filtering data of every frame after windowing process P_noise(j ω) is specifically acquired using above formula (1).Then, the single order smooth power spectrum of more each the first filtering data of frame, More new historical minimum power composes minP_noise(j ω), such as following formula (6),

Wherein, β and ρ is coefficient.

The single order smooth power spectrum of the frame is estimated as to the noise of several frames of the starting of the first filtering data (such as 50 frames) P_noise(jω)；Current history minimum power spectrum min P are estimated as to the noise of each frame after several frames_noise(jω).And Afterwards, the noise that the frame is superimposed on each frame of the first filtering data is estimated, you can the noise for making the first filtering data is steady, keeps away Exempt from the error that noise mutation causes VAD to handle.

Based on the audio data processing method that above-described embodiment provides, the embodiment of the present invention additionally provides another audio number According to processing method, by the responsible original audio data of first terminal equipment (such as smart mobile phone, tablet computer, server etc.) Reason, obtains the first filtering data and the second filtering data, is responsible for sentencing by second terminal equipment (such as server) and stop and data processing Process can not only ensure that the data operation quantity in first terminal equipment will not be excessive, can also be adopted in second terminal equipment With more responsible VAD Processing Algorithms, more accurate VAD handling results are obtained.

Specifically, referring to Fig. 6, which is that the flow of another audio data processing method provided in an embodiment of the present invention is shown It is intended to.

A kind of audio data processing method provided in an embodiment of the present invention applied to first terminal equipment, can specifically wrap Include following steps S601-S603.

S601：Obtain the voice data after space filtering.

Optionally, step S601 is specifically included：Obtain the original audio data of sound pick-up outfit acquisition；To original audio number After Short Time Fourier Transform is carried out, the corresponding frequency-region signal of each passage in sound pick-up outfit is obtained；It is corresponding to each passage Frequency-region signal carries out space filtering processing, obtains the voice data after space filtering.

S602：First Wiener filtering and the second Wiener filtering are carried out respectively to voice data, respectively obtain the first filtering number According to the second filtering data, the first Wiener filtering is more than the second Wiener filtering to the inhibition level of noise.

In the possible realization method of the embodiment of the present invention, step S602 can specifically include：

Using the M powers of strength factor, the first Wiener filtering is carried out to voice data, obtains the first filtering data；It utilizes The Nth power of strength factor carries out the second Wiener filtering to voice data, obtains the second filtering data；Wherein, M is more than N.

Optionally, using the M powers of strength factor, the first Wiener filtering is carried out to voice data, obtains the first filtering number According to specifically including：

According to formulaFirst Wiener filtering is carried out to voice data Y (j ω), Obtain the second filtering data Y_VAD(jω)。

Using the Nth power of strength factor, the second Wiener filtering is carried out to voice data, obtains the second filtering data, specifically Including：

According to formulaSecond Wiener filtering is carried out to voice data Y (j ω), Obtain the second filtering data Y_ASR(jω)。

Wherein, M=1, N=1/2, strength factor areP_yy(j ω) be voice data power spectrum, P_xx (j ω) is the average power spectra of the original audio data before voice data space filtering, and EPS is minimum.

It is understood that step S601-S602 and the step S201-S205 classes in above-described embodiment in the present embodiment Seemingly, referring specifically to above-mentioned related description, repeat no more.

In the possible realization method of the embodiment of the present invention, further included before step S603：First filtering data is carried out Interference is gone to handle.Specifically, interference is gone to handle, transient noise Processing for removing, noise reduction process and noise smoothing processing can be included One or more of.

Wherein, as an example, transient noise Processing for removing can specifically include：

Obtain voice data each gain of corresponding first Wiener filtering of frequency domain point in default frequency domain；Count sound Frequency obtains the first value according to the quantity of the frequency domain point in default frequency domain；Statistical gain amplitude is in the first preset gain threshold value Within frequency domain point quantity, obtain second value；According to the first value and second value, obtain transient state and eliminate gain；Disappear according to transient state Except gain, the transient noise in the first filtering data is eliminated.

It is understood that interference processing is gone with going interference processing class described in above-described embodiment in the present embodiment Seemingly, referring specifically to related description, which is not described herein again.

S603：First filtering data and the second filtering data are sent to second terminal equipment, so that second terminal equipment Sentenced using the first filtering data and stop the second filtering data, and stop result to the progress data processing of the second filtering data according to sentencing.

It is understood that filtering process to voice data and, using filtered data sentence and stop and data Processing is responsible for execution by different terminal devices (or server) respectively, can not only ensure accuracy and the place of filtering process Speed is managed, more complicated vad algorithm and ASR Algorithm can also be used, ensures to obtain and accurately sentences the speech recognition for stopping result As a result.

In the possible realization method of the embodiment of the present invention, in order to reduce the transmission speed that the transmission quantity of data improves data Degree, and then the response speed to phonetic order is improved, the first filtering data is uploaded to server, can specifically be included：

Second terminal equipment is sent to after carrying out down-sampled processing to the first filtering data.

As an example, the sample rate of the first filtering data can be reduced to 8kHz by 16kHz.

Optionally, the first filtering data and the second filtering data are uploaded to server, specifically included：

After first filtering data and the second filtering data are packaged compression, second terminal equipment is sent to.

In embodiments of the present invention, first terminal equipment carries out varying strength respectively to the voice data after space filtering Wiener filtering obtains the two-way filtering data different to noise suppressed degree, i.e., the first filtering more to noise suppressed Data and to smaller second filtering data of noise suppressed degree, are sent to second terminal equipment.Then, second terminal equipment profit Sentenced with the first filtering data and stop the second filtering data, data processing is carried out to the second filtering data according to judging result.Due to One filtering data of inhibition level in to(for) noise is higher, can largely avoid shadow of the interference to voice activity detection It rings, improves the precision of voice activity detection and the response speed of automatic speech recognition.And for noise in the second filtering data Inhibition level is relatively low, can be to avoid influence of the higher noise suppressed to speech discrimination accuracy.At the filtering of voice data Reason and, using filtered data sentence and stop and data processing, born respectively by first terminal equipment and second terminal equipment Duty performs, and can not only ensure the accuracy and processing speed of filtering process, can also use more complicated vad algorithm and ASR Algorithm ensures to obtain and accurately sentences the voice recognition result for stopping result.The embodiment of the present invention is according to voice activity detection and certainly The different demands of dynamic speech recognition, carry out different degrees of Wiener filtering, can not only ensure the standard of automatic speech recognition respectively True rate can also avoid disturbing the influence to voice activity detection, more accurately detect Voice Activity Status, shorten voice Interactive delay of feedback improves the response speed to phonetic order, and better usage experience is brought to user.

Based on the audio data processing method that above-described embodiment provides, the embodiment of the present invention additionally provides a kind of voice data Processing unit.

Referring to Fig. 7, which is a kind of structure diagram of audio-frequency data processing device provided in an embodiment of the present invention.

A kind of audio-frequency data processing device provided in an embodiment of the present invention, including：Data acquisition module 701, first filters Module 702, the second filter module 703 and first processing module 704；

Data acquisition module 701, for obtaining the voice data after space filtering；

First filter module 702, the voice data for being obtained to data acquisition module 701 carry out the first Wiener filtering, Obtain the first filtering data；

Second filter module 703, the voice data for being obtained to data acquisition module 701 carry out the second Wiener filtering, The second filtering data is obtained, the first Wiener filtering is more than the second Wiener filtering to the inhibition level of noise；

First processing module 704 stops the second filtering data for sentencing using the first filtering data, stops result to according to sentencing Two filtering datas carry out data processing.

In the possible realization method of the embodiment of the present invention, the first filter module 702 is specifically used for：Utilize strength factor M powers, to voice data carry out the first Wiener filtering, obtain the first filtering data；

Second filter module 703, is specifically used for：Using the Nth power of strength factor, the second wiener filter is carried out to voice data Ripple obtains the second filtering data；Wherein, M is more than N.

In the possible realization method of the embodiment of the present invention, the first filter module 702, including：First processing submodule；

First processing submodule, for according to formulaTo voice data Y (j The first Wiener filtering ω) is carried out, obtains the second filtering data Y_VAD(jω)；

Second filter module 703, including：Second processing submodule；

Second processing submodule, for according to formulaTo voice data Y (j The second Wiener filtering ω) is carried out, obtains the second filtering data Y_ASR(jω)；

In the possible realization method of the embodiment of the present invention, which further includes：Second processing mould Block；

Second processing module is handled for carrying out interference to the first filtering data；Interference is gone to handle, including transient noise One or more of Processing for removing, noise reduction process and noise smoothing processing；

First processing module 704, specifically for treated that the first filtering data sentences stops second using Second processing module Filtering data stops result to the second filtering data progress audio identification according to sentencing.

Optionally, when interference processing is gone to include transient noise Processing for removing, Second processing module specifically includes：Noise reduction Submodule；

Noise reduction submodule, is specifically used for：

In the possible realization method of the embodiment of the present invention, data acquisition module 701 is specifically used for：

Obtain the original audio data of sound pick-up outfit acquisition；After carrying out Short Time Fourier Transform to original audio data, obtain The corresponding frequency-region signal of each passage into sound pick-up outfit；Space filtering processing is carried out to the corresponding frequency-region signal of each passage, Obtain the voice data after space filtering；

First processing module 704, is specifically used for：

The inversion process of Short Time Fourier Transform is carried out to the first filtering data and the second filtering data；After processing The first filtering data sentence second filtering data that stops that treated, stop result according to sentencing the second filtering data carry out to treated Audio identification data processing.

Based on audio data processing method and device that above-described embodiment provides, the embodiment of the present invention additionally provides another kind Audio-frequency data processing device.

Referring to Fig. 8, which is the structure diagram of another audio-frequency data processing device provided in an embodiment of the present invention.

A kind of audio-frequency data processing device provided in an embodiment of the present invention, applied to first terminal equipment, including：Data obtain Modulus block 801, the first filter module 802, the second filter module 803 and data transmission module 804；

Data acquisition module 801, for obtaining the voice data after space filtering；

First filter module 802, the voice data for being obtained to data acquisition module 801 carry out the first Wiener filtering, Obtain the first filtering data；

Second filter module 803, the voice data for being obtained to data acquisition module 801 carry out the second Wiener filtering, The second filtering data is obtained, the first Wiener filtering is more than the second Wiener filtering to the inhibition level of noise；

Data transmission module 804, for the first filtering data and the second filtering data to be sent to second terminal equipment, with Sentence the first filtering data of second terminal equipment utilization and stop the second filtering data, and according to sentence stop result to the second filtering data into Row data processing.

In the possible realization method of the embodiment of the present invention, the first filter module 802 is specifically used for：Utilize strength factor M powers, to voice data carry out the first Wiener filtering, obtain the first filtering data；Second filter module 803, is specifically used for： Using the Nth power of strength factor, the second Wiener filtering is carried out to voice data, obtains the second filtering data；Wherein, M is more than N.

Optionally, the first filter module 802, including：First processing submodule；Second filter module 801, including：Second Handle submodule；

In the possible realization method of the embodiment of the present invention, which further includes：Data processing mould Block；

Data processing module is handled for carrying out interference to the first filtering data；Interference is gone to handle, including transient noise One or more of Processing for removing, noise reduction process and noise smoothing processing；

Data transmission module 804, specifically for by data processing module treated the first filtering data and the second filtering Data sending is to second terminal equipment, so that treated that the first filtering data sentences stops the second filtering number for second terminal equipment utilization According to, and stop result to the progress data processing of the second filtering data according to sentencing.

Optionally, when interference processing is gone to include transient noise Processing for removing, data processing module specifically includes：Noise reduction Submodule；

Noise reduction submodule, is specifically used for：

In the possible realization method of the embodiment of the present invention, data acquisition module 801 is specifically used for：

Obtain the original audio data of sound pick-up outfit acquisition；After carrying out Short Time Fourier Transform to original audio data, obtain The corresponding frequency-region signal of each passage into sound pick-up outfit；Space filtering processing is carried out to the corresponding frequency-region signal of each passage, Obtain the voice data after space filtering.

In the possible realization method of the embodiment of the present invention, data transmission module 804 is specifically used for：To the first filtering number According to being sent to second terminal equipment after carrying out down-sampled processing.

Optionally, data transmission module 804, also particularly useful for：First filtering data and the second filtering data are packaged pressure After contracting, second terminal equipment is sent to.

Based on audio data processing method and device that above-described embodiment provides, the embodiment of the present invention additionally provides a kind of sound Frequency data processing terminal.The voice data processing terminal, including：Memory and processor.Wherein, memory, for storing journey Sequence code, and program code is transmitted to processor；Processor for the instruction in program code, performs such as above-mentioned Audio data processing method described in embodiment of anticipating.

Based on audio data processing method and device that above-described embodiment provides, the embodiment of the present invention additionally provides a kind of sound Frequency data handling system.The audio-frequency data processing system, including：First equipment and the second equipment；

First equipment, for obtaining the voice data after space filtering；It is additionally operable to carry out the first dimension respectively to voice data Nanofiltration ripple and the second Wiener filtering obtain the first filtering data and the second filtering data, inhibition of first Wiener filtering to noise Degree is more than the second Wiener filtering；It is additionally operable to the first filtering data and the second filtering data being sent to the second equipment；

Second equipment stops the second filtering data for sentencing using the first filtering data, and stops result to the second filter according to sentencing Wave number is according to progress audio identification.

It should be noted that each embodiment is described by the way of progressive in this specification, each embodiment emphasis is said Bright is all difference from other examples, and just to refer each other for identical similar portion between each embodiment.For reality For applying method, apparatus disclosed in example or system, since it is corresponded to the methods disclosed in the examples, so description is simpler Single, reference may be made to the description of the method.

It should also be noted that, herein, relational terms such as first and second and the like are used merely to one Entity or operation are distinguished with another entity or operation, without necessarily requiring or implying between these entities or operation There are any actual relationship or orders.Moreover, term " comprising ", "comprising" or its any other variant are intended to contain Lid non-exclusive inclusion, so that process, method, article or equipment including a series of elements not only will including those Element, but also including other elements that are not explicitly listed or further include as this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence " including one ... ", it is not excluded that Also there are other identical elements in the process, method, article or apparatus that includes the element.

It can directly be held with reference to the step of method or algorithm that the embodiments described herein describes with hardware, processor The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.

The above described is only a preferred embodiment of the present invention, not make limitation in any form to the present invention.Though So the present invention is disclosed above with preferred embodiment, however is not limited to the present invention.It is any to be familiar with those skilled in the art Member, without departing from the scope of the technical proposal of the invention, all using the methods and technical content of the disclosure above to the present invention Technical solution makes many possible changes and modifications or is revised as the equivalent embodiment of equivalent variations.Therefore, it is every without departing from The content of technical solution of the present invention, technical spirit according to the invention is to any simple modification made for any of the above embodiments, equivalent Variation and modification, still fall within technical solution of the present invention protection in the range of.

Claims

1. a kind of audio data processing method, which is characterized in that the method, including：

Obtain the voice data after space filtering；

First Wiener filtering and the second Wiener filtering are carried out respectively to the voice data, respectively obtain the first filtering data and Two filtering datas, first Wiener filtering are more than second Wiener filtering to the inhibition level of noise；

Sentenced using first filtering data and stop second filtering data, according to sentence stop result to second filtering data into Row data processing.

2. according to the method described in claim 1, it is characterized in that, described carry out the voice data the first wiener filter respectively Ripple and the second Wiener filtering, respectively obtain the first filtering data and the second filtering data, specifically include：

Using the M powers of strength factor, first Wiener filtering is carried out to the voice data, obtains the first filtering number According to；Using the Nth power of the strength factor, second Wiener filtering is carried out to the voice data, obtains second filter Wave number evidence；M is more than N.

3. according to the method described in claim 2, it is characterized in that,

The M powers using strength factor carry out first Wiener filtering to the voice data, obtain first filter Wave number evidence, specifically includes：

According to formulaThe first wiener filter is carried out to the voice data Y (j ω) Ripple obtains the second filtering data Y_VAD(jω)；

The Nth power using the strength factor carries out second Wiener filtering to the voice data, obtains described the Two filtering datas, specifically include：

According to formulaSecond wiener is carried out to the voice data Y (j ω) Filtering, obtains the second filtering data Y_ASR(jω)；

Wherein, M=1, N=1/2, the strength factor areThe P_yy(j ω) is the work(of the voice data Rate is composed, P_xx(j ω) is the average power spectra of the original audio data before the voice data space filtering, and EPS is minimum.

4. according to the method described in claim 1, it is characterized in that, the utilization first filtering data, which is sentenced, stops described second Filtering data further includes before：

Interference is carried out to first filtering data to handle；

It is described that interference is gone to handle, including one or more of transient noise Processing for removing, noise reduction process and noise smoothing processing.

5. according to the method described in claim 4, it is characterized in that, the transient noise Processing for removing, specifically includes：

Obtain the voice data each gain of corresponding first Wiener filtering of frequency domain point in default frequency domain；

The quantity of voice data frequency domain point in the default frequency domain is counted, obtains the first value；Statistical gain amplitude The quantity of frequency domain point within preset gain threshold value, obtains second value；

6. a kind of audio data processing method, which is characterized in that applied to first terminal equipment, the method, including：

Obtain the voice data after space filtering；

First filtering data and second filtering data are sent to second terminal equipment, so that the second terminal is set It is standby sentenced using first filtering data stops second filtering data, and according to sentence stop result to second filtering data into Row data processing.

7. according to the method described in claim 6, it is characterized in that, described carry out the voice data the first wiener filter respectively Ripple and the second Wiener filtering, respectively obtain the first filtering data and the second filtering data, specifically include：

8. according to the method described in claim 6, it is characterized in that, the utilization first filtering data, which is sentenced, stops described second Filtering data further includes before：

Interference is carried out to first filtering data to handle；

9. according to the method described in claim 6-8 any one, which is characterized in that described by first filtering data and institute It states the second filtering data and is sent to second terminal equipment, specifically include：

Down-sampled processing is carried out to first filtering data；

Treated the first filtering data and second filtering data are sent to the second terminal equipment；It alternatively, will place After the first filtering data and second filtering data after reason are packaged compression, the second terminal equipment is sent to.

10. a kind of audio-frequency data processing device, which is characterized in that described device, including：Data acquisition module, the first filtering mould Block, the second filter module and first processing module；

Second filter module for carrying out the second Wiener filtering to the voice data, obtains the second filtering data, described First Wiener filtering is more than second Wiener filtering to the inhibition level of noise；

The first processing module is stopped second filtering data for being sentenced using first filtering data, stops tying according to sentencing Fruit carries out data processing to second filtering data.

11. a kind of audio-frequency data processing device, which is characterized in that applied to first terminal equipment, described device, including：Data Acquisition module, the first filter module, the second filter module and data transmission module；

The data transmission module, for first filtering data and second filtering data to be sent to second terminal and set It is standby, so that the first filtering data described in the second terminal equipment utilization, which is sentenced, stops second filtering data, and stop tying according to sentencing Fruit carries out data processing to second filtering data.

12. a kind of voice data processing terminal, which is characterized in that including：Memory and processor；

The processor for the instruction in said program code, performs sound as claimed in any one of claims 1 to 6 Frequency data processing method.

13. a kind of audio-frequency data processing system, which is characterized in that including：First equipment and the second equipment；

First equipment, for obtaining the voice data after space filtering；It is additionally operable to carry out the voice data respectively One Wiener filtering and the second Wiener filtering obtain the first filtering data and the second filtering data, and first Wiener filtering is to making an uproar The inhibition level of sound is more than second Wiener filtering；It is additionally operable to first filtering data and second filtering data hair It send to the second equipment；

Second equipment is stopped second filtering data for being sentenced using first filtering data, and stops result according to sentencing Audio identification is carried out to second filtering data.