CN108053834B

CN108053834B - Audio data processing method, device, terminal and system

Info

Publication number: CN108053834B
Application number: CN201711272872.3A
Authority: CN
Inventors: 陈日林; 陈孝良; 冯大航; 苏少炜; 常乐
Original assignee: Beijing Sound Intelligence Technology Co Ltd
Current assignee: Beijing Sound Intelligence Technology Co Ltd
Priority date: 2017-12-05
Filing date: 2017-12-05
Publication date: 2020-02-21
Anticipated expiration: 2037-12-05
Also published as: CN108053834A

Abstract

The embodiment of the invention discloses an audio data processing method, an audio data processing device, a terminal and an audio data processing system, wherein the method comprises the following steps: obtaining spatially filtered audio data; carrying out first wiener filtering on the audio data to obtain first filtering data; performing second wiener filtering on the audio data to obtain second filtering data, wherein the noise suppression degree of the first wiener filtering is greater than that of the second wiener filtering; and judging a starting node and an ending node for processing the second filtering data by using the first filtering data, and processing the second filtering data according to a judgment result. According to the embodiment of the invention, wiener filtering with different degrees is respectively carried out according to different requirements of voice activity detection and automatic voice recognition, so that the accuracy of automatic voice recognition can be ensured, the influence of interference on voice activity detection can be avoided, the voice activity state can be detected more accurately, the feedback delay of voice interaction is shortened, the response speed to voice instructions is improved, and better use experience is brought to users.

Description

Audio data processing method, device, terminal and system

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to an audio data processing method, an audio data processing device, an audio data processing terminal, and an audio data processing system.

Background

The intelligent voice interaction is an important branch in the field of artificial intelligence, realizes free and accurate intelligent voice interaction, greatly liberates both hands of people, and obtains more free information circulation and control with the physical world. The intelligent voice interaction mainly comprises near-field voice interaction and far-field voice interaction. Near-field speech has been greatly developed over the last two decades, with near-field speech recognition rates approaching that of humans, but the more free interaction should be far-field speech interaction. The far-field voice interaction means that a certain distance exists between a speaker and an interactive device, so that the free space of the speaker is enlarged, but excessive background noise interference is introduced, so that the processing difficulty of voice activity detection and automatic voice recognition is greatly increased.

Voice activity detection, that is, detecting the voice actually recorded by the speaker from a continuous piece of audio data. Accurate voice activity detection can improve the accuracy of subsequent automatic voice recognition on one hand, and can reduce the feedback delay of voice interaction on the other hand, and an execution result can be rapidly given as soon as a user voice instruction is finished, so that better use experience is brought to the user.

At present, after original audio data is generally processed by using an array signal, voice activity detection and automatic voice recognition are performed by using the processed audio data, but the processed audio data still has certain interference, which can seriously affect the accuracy of the voice activity detection, cause errors of the voice activity detection, and further cause slow response to voice commands.

Disclosure of Invention

In view of this, the present invention provides an audio data processing method, an audio data processing apparatus, an audio data processing terminal, and an audio data processing system, which can solve the problem that the response speed of a voice command is affected by a voice activity detection error caused by an error still existing in processed audio data in the prior art.

The audio data processing method provided by the embodiment of the invention comprises the following steps:

obtaining spatially filtered audio data;

carrying out first wiener filtering on the audio data to obtain first filtering data; performing second wiener filtering on the audio data to obtain second filtering data, wherein the noise suppression degree of the first wiener filtering is greater than that of the second wiener filtering;

and judging a starting node and an ending node for processing second filtering data by using the first filtering data, and processing the second filtering data according to a judgment result.

Optionally, the audio data is respectively subjected to first wiener filtering and second wiener filtering to obtain first filtering data and second filtering data, and the method specifically includes:

performing the first wiener filtering on the audio data by using the power M of the intensity coefficient to obtain first filtering data; performing the second wiener filtering on the audio data by using the power N of the intensity coefficient to obtain second filtering data; m is greater than N.

Optionally, the performing, by using the power of M of the intensity coefficient, the first wiener filtering on the audio data to obtain the first filtered data specifically includes:

according to the formula

Performing the first wiener filtering on the audio data Y (j ω) to obtain the first filtered data Y_VAD(jω)；

The performing, by using the power N of the intensity coefficient, the second wiener filtering on the audio data to obtain second filtered data specifically includes:

according to the formula

Performing the second wiener filtering on the audio data Y (j ω) to obtain the second filtered data Y_ASR(jω)；

Wherein, M is 1, N is 1/2, and the intensity coefficient is

The P is_yy(j ω) is the power spectrum of the audio data, P_xx(j ω) is the average power spectrum of the original audio data before spatial filtering of the audio data, and EPS is the minimum value.

Optionally, the determining, by using the first filtered data, a start node and an end node of processing the second filtered data, before further including:

performing interference removal processing on the first filtering data;

and the interference elimination processing comprises one or more of transient noise elimination processing, noise reduction processing and noise smoothing processing.

Optionally, the transient noise cancellation processing specifically includes:

obtaining the gain of the first wiener filter corresponding to each frequency domain point in a preset frequency domain range of the audio data;

counting the number of frequency domain points of the audio data in the preset frequency domain range to obtain a first value; counting the number of frequency domain points of which the gain amplitude is within a preset gain threshold value to obtain a second value;

obtaining a transient cancellation gain according to the first value and the second value;

and eliminating transient noise in the first filtering data according to the transient elimination gain.

Optionally, the obtaining the spatially filtered audio data specifically includes:

acquiring original audio data acquired by a recording device;

after short-time Fourier transform is carried out on the original audio data, a frequency domain signal corresponding to each channel in the recording equipment is obtained;

performing spatial filtering processing on the frequency domain signal corresponding to each channel to obtain the audio data after spatial filtering;

judging a start node and an end node for processing second filtered data by using the first filtered data, which specifically comprises the following steps:

performing inverse transform processing of short-time fourier transform on the first filtered data and the second filtered data;

and judging a starting node and an ending node for processing the second filtering data by using the processed first filtering data, and performing data processing on the processed second filtering data according to a judgment result.

The embodiment of the invention also provides an audio data processing method which is applied to the first terminal equipment, and the method comprises the following steps:

obtaining spatially filtered audio data;

and sending the first filtering data and the second filtering data to a second terminal device, so that the second terminal device judges a starting node and an ending node for processing the second filtering data by using the first filtering data, and performs data processing on the second filtering data according to a judgment result.

according to the formula

according to the formula

Wherein, M is 1, N is 1/2, and the intensity coefficient is

The P is_yy(j ω) is the power spectrum of the audio data, P_xx(j ω) is the average power spectrum, EP, of the original audio data before spatial filtering of said audio dataS is a minimum value.

performing interference removal processing on the first filtering data;

Optionally, the transient noise cancellation processing specifically includes:

counting the number of frequency domain points of the audio data in the preset frequency domain range to obtain a first value; counting the number of frequency domain points of which the gain amplitude is within a first preset gain threshold value to obtain a second value;

acquiring original audio data acquired by a recording device;

and carrying out spatial filtering processing on the frequency domain signal corresponding to each channel to obtain the audio data after spatial filtering.

Optionally, the sending the first filtered data to a second terminal device specifically includes:

and performing down-sampling processing on the first filtering data and then sending the first filtering data to the second terminal equipment.

Optionally, the sending the first filtered data and the second filtered data to a second terminal device specifically includes:

and after the first filtering data and the second filtering data are packed and compressed, sending the first filtering data and the second filtering data to the second terminal equipment.

An audio data processing apparatus provided in an embodiment of the present invention includes: the device comprises a data acquisition module, a first filtering module, a second filtering module and a first processing module;

the data acquisition module is used for acquiring audio data after spatial filtering;

the first filtering module is used for performing first wiener filtering on the audio data to obtain first filtering data;

the second filtering module is configured to perform second wiener filtering on the audio data to obtain second filtered data, where the first wiener filtering has a larger noise suppression degree than the second wiener filtering;

the first processing module is configured to determine a start node and an end node of processing second filtered data by using the first filtered data, and perform data processing on the second filtered data according to a determination result.

Optionally, the first filtering module is specifically configured to:

performing the first wiener filtering on the audio data by using the power M of the intensity coefficient to obtain first filtering data;

the second filtering module is specifically configured to:

performing the second wiener filtering on the audio data by using the power N of the intensity coefficient to obtain second filtering data;

wherein M is greater than N.

Optionally, the first filtering module includes: a first processing sub-module;

the first processing submodule is used for processing according to a formula

The second filtering module includes: a second processing sub-module;

the second processing submodule is used for processing according to a formula

Wherein, M is 1, N is 1/2, and the intensity coefficient is

Optionally, the apparatus further includes: a second processing module;

the second processing module is configured to perform interference cancellation processing on the first filtered data; the interference elimination processing comprises one or more of transient noise elimination processing, noise reduction processing and noise smoothing processing;

the first processing module is specifically configured to determine, by the first filtered data processed by the second processing module, a start node and an end node of processing of second filtered data, and perform audio recognition on the second filtered data according to a determination result.

Optionally, when the interference cancellation processing includes the transient noise cancellation processing, the second processing module specifically includes: a noise reduction submodule;

the noise reduction submodule is specifically configured to:

obtaining the gain of the first wiener filter corresponding to each frequency domain point in a preset frequency domain range of the audio data; counting the number of frequency domain points of the audio data in the preset frequency domain range to obtain a first value; counting the number of frequency domain points of which the gain amplitude is within a first preset gain threshold value to obtain a second value; obtaining a transient cancellation gain according to the first value and the second value; and eliminating transient noise in the first filtering data according to the transient elimination gain.

Optionally, the data obtaining module is specifically configured to:

acquiring original audio data acquired by a recording device; after short-time Fourier transform is carried out on the original audio data, a frequency domain signal corresponding to each channel in the recording equipment is obtained; performing spatial filtering processing on the frequency domain signal corresponding to each channel to obtain the audio data after spatial filtering;

the first processing module is specifically configured to:

performing inverse transform processing of short-time fourier transform on the first filtered data and the second filtered data; and judging a starting node and an ending node for processing second filtering data by using the first filtering data, and carrying out audio identification data processing on the processed second filtering data according to a judgment result.

An embodiment of the present invention further provides an audio data processing apparatus, applied to a first terminal device, including: the device comprises a data acquisition module, a first filtering module, a second filtering module and a data transmission module;

the data transmission module is configured to send the first filtered data and the second filtered data to a second terminal device, so that the second terminal device determines a start node and an end node of processing the second filtered data by using the first filtered data, and performs data processing on the second filtered data according to a determination result.

Optionally, the first filtering module is specifically configured to:

the second filtering module is specifically configured to:

wherein M is greater than N.

Optionally, the first filtering module includes: a first processing sub-module;

the first processing submodule is used for processing according to a formula

The second filtering module includes: a second processing sub-module;

the second processing submodule is used for processing according to a formula

Wherein, M is 1, N is 1/2, and the intensity coefficient is

Optionally, the apparatus further includes: a data processing module;

the data processing module is used for carrying out interference elimination processing on the first filtering data; the interference elimination processing comprises one or more of transient noise elimination processing, noise reduction processing and noise smoothing processing;

the data transmission module is specifically configured to send the first filtered data and the second filtered data processed by the data processing module to a second terminal device, so that the second terminal device determines a start node and an end node of processing the second filtered data by using the processed first filtered data, and performs data processing on the second filtered data according to a determination result.

Optionally, when the interference elimination processing includes the transient noise elimination processing, the data processing module specifically includes: a noise reduction submodule;

the noise reduction submodule is specifically configured to:

Optionally, the data obtaining module is specifically configured to:

acquiring original audio data acquired by a recording device; after short-time Fourier transform is carried out on the original audio data, a frequency domain signal corresponding to each channel in the recording equipment is obtained; and carrying out spatial filtering processing on the frequency domain signal corresponding to each channel to obtain the audio data after spatial filtering.

Optionally, the data transmission module is specifically configured to:

Optionally, the data transmission module is further specifically configured to:

An embodiment of the present invention further provides an audio data processing terminal, including: a memory and a processor;

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the audio data processing method according to any one of claims 1 to 6, according to instructions in the program code.

An embodiment of the present invention further provides an audio data processing system, including: a first device and a second device;

the first device to obtain spatially filtered audio data; the audio data processing device is further configured to perform first wiener filtering and second wiener filtering on the audio data respectively to obtain first filtering data and second filtering data, wherein the first wiener filtering has a larger noise suppression degree than the second wiener filtering; the first device is further configured to send the first filtered data and the second filtered data to a second device;

and the second equipment is used for judging a starting node and an ending node for processing the second filtering data by using the first filtering data and carrying out audio identification on the second filtering data according to the judgment result.

Compared with the prior art, the invention has at least the following advantages:

in the embodiment of the invention, wiener filtering with different intensities is respectively carried out on the audio data after the spatial filtering to obtain two paths of filtering data with different noise suppression degrees, namely first filtering data with larger noise suppression degree and second filtering data with smaller noise suppression degree. Then, the first filtering data is used for judging a starting node and an ending node for processing the second filtering data, and the second filtering data is processed according to the judgment result. Because the suppression degree of the first filtering data to the noise is higher, the influence of the interference on the voice activity detection can be avoided to a greater degree, and the accuracy of the voice activity detection and the response speed of the automatic voice recognition are improved. And the second filtering data has lower suppression degree to the noise, so that the influence of higher noise suppression on the voice recognition accuracy can be avoided. According to the embodiment of the invention, wiener filtering with different degrees is respectively carried out according to different requirements of voice activity detection and automatic voice recognition, so that the accuracy of automatic voice recognition can be ensured, the influence of interference on voice activity detection can be avoided, the voice activity state can be detected more accurately, the feedback delay of voice interaction is shortened, the response speed to voice instructions is improved, and better use experience is brought to users.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a diagram illustrating processed voice data according to the prior art;

fig. 2 is a flowchart illustrating an audio data processing method according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating original audio data, first filtered data, and second filtered data according to an embodiment of the present invention;

FIG. 4 is a flow chart illustrating another audio data processing method according to an embodiment of the invention;

FIG. 5 is a flow chart illustrating transient noise cancellation processing according to an embodiment of the present invention;

FIG. 6 is a flowchart illustrating another audio data processing method according to an embodiment of the present invention;

FIG. 7 is a block diagram of an audio data processing apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of another audio data processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For ease of understanding, a plurality of technical terms involved in the embodiments of the present invention are first described.

Voice Activity Detection (VAD), also called Voice endpoint Detection and Voice boundary Detection, refers to detecting the presence or absence of a target Voice in a noise environment. It is commonly used in speech processing systems such as speech coding, speech enhancement, speech recognition, etc.

Automatic Speech Recognition (ASR) is a technology that converts human Speech into text.

In the current voice interaction process, an array signal is generally directly adopted to process and output a path of voice data, which is used as ASR data for voice recognition and also used as VAD data for judging the start and the end of ASR processing. However, since there is still much interference in the output voice data, which affects the accuracy of VAD processing, see fig. 1 in particular, which shows a processed voice data of the prior art. As can be seen from fig. 1, the voice command actually input by the user starts from node a and ends at node B. The voice command obtained by VAD processing by using the data is ended at the node B ', namely the voice command data is judged to be ended at the node B ', and then ASR processing is carried out according to the data between the node A and the node B '. This not only results in a slow response speed to the voice instruction, but also affects the speed of ASR processing, resulting in a slow response speed to the voice instruction.

The inventor of the present invention found in research that ASR processing needs to ensure that the input audio data keeps as small as possible nonlinear distortion, while VAD processing needs to suppress interference to a higher degree, but the higher the suppression effect on interference, the more nonlinear distortion is introduced. However, if the processed data has a high degree of noise suppression, the voice data may be distorted due to the nonlinear effect, and the accuracy of voice recognition may be reduced; if the accuracy of voice recognition is ensured, more interference exists in the data to seriously influence the judgment result of the VAD, so that VAD detection errors are caused, and the response speed of voice commands is influenced. That is, the same input data cannot satisfy the requirement of the ASR processing and the VAD processing for the degree of noise suppression of the input data at the same time.

The embodiment of the invention provides an audio data processing method, an audio data processing device, an audio data processing terminal and an audio data processing system, wherein according to different noise suppression requirements of ASD and VAD processing, the spatially filtered audio data are respectively subjected to wiener filtering with different intensities to obtain two paths of data with different noise suppression degrees, the filtered data with higher noise suppression degree is used for VAD processing, a voice instruction is judged and processed at a starting node and a finishing node, and the filtered data with lower noise suppression degree is used for ASR processing.

In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, embodiments accompanying the drawings are described in detail below.

Referring to fig. 2, a flowchart of an audio data processing method according to an embodiment of the present invention is shown.

It should be noted that, the audio data processing method provided in the embodiments of the present invention may be applied to any terminal device, and the terminal device may be configured or connected with a multi-channel radio microphone to receive a voice command input by a user. As an example, the terminal device may be a smart phone, a tablet computer, a personal computer, a server, and the like, which are not listed here.

The audio data processing method provided by the embodiment of the invention specifically comprises the following steps S201-S203.

S201: spatially filtered audio data Y (j ω) is obtained.

In practical applications, the voice input by the user (such as voice commands) is generally received by multiple microphones, and the interference in the original audio data received by the microphones in different spaces and microphone channels is different. Therefore, in order to remove noise interference in the original audio data according to the difference of the space, the audio data after the interference is primarily removed (i.e., the audio data Y (j ω) after the spatial filtering) needs to be obtained by performing the spatial filtering process according to the difference of the space.

It should be further noted that, before performing spatial filtering, the speech input by the user may be preprocessed, which may specifically include dc offset removal, windowing function, and the like, and then short-time Fourier Transform (STFT) is performed on the processed audio to obtain frequency domain signals X of the speech input by the user in different frequency domains₁(jω)、X₂(jω)、……、X_N(j ω). Then, for the frequency domain signal X₁(jω)、X₂(jω)、……、X_NAnd (j omega) carrying out spatial filtering processing to obtain Y (j omega).

That is, in a possible implementation manner of the embodiment of the present invention, step S201 may specifically include: acquiring original audio data acquired by a recording device; after short-time Fourier transform is carried out on the original audio data, a frequency domain signal X corresponding to each channel in the recording equipment is obtained₁(jω)、X₂(jω)、……、X_N(j ω); and carrying out spatial filtering processing on the frequency domain signal corresponding to each channel to obtain audio data Y (j omega) after spatial filtering.

In a specific implementation, a Generalized Sidelobe Canceller (GSC) or a Minimum Variance Distortionless Response (MVDR) beamformer may be used to obtain Y (j ω), and a specific processing method is not described herein again.

S202: and respectively carrying out first wiener filtering and second wiener filtering on the audio data after the spatial filtering to respectively obtain first filtering data and second filtering data, wherein the noise suppression degree of the first wiener filtering is greater than that of the second wiener filtering.

In the embodiment of the invention, in order to meet different requirements of ASR processing and VAD processing on the noise suppression degree of input audio data, for the ASR processing and the VAD processing, wiener filtering with different intensities is respectively carried out on the audio data Y (j omega) after spatial filtering, so that the accuracy of the ASR processing and the VAD processing is ensured. The first wiener filtering with higher noise suppression degree is carried out on the audio data Y (j omega), so that the signal components which are unfavorable for VAD processing in the audio data Y (j omega) are reduced to the greatest extent, the interference to the VAD processing in the first filtering data is reduced, and the accuracy of the VAD processing is improved. And performing second wiener filtering with lower noise suppression degree on the audio data Y (j omega), so that the influence of distortion on ASR processing is avoided, and the identification accuracy of the ASR processing is ensured. Therefore, the method not only can ensure the identification accuracy of the voice command, but also can improve the accuracy of VAD processing and respond to the voice command in time,

in practical application, the intensity coefficient of the wiener filtering can be adjusted to perform the wiener filtering with different noise suppression degrees on the audio data after the spatial filtering, so as to obtain first filtering data and second filtering data.

In some possible implementation manners of the embodiment of the present invention, the step S202 may specifically include the following steps:

performing first wiener filtering on the audio data Y (j omega) subjected to spatial filtering by using the power M of the intensity coefficient to obtain first filtering data; performing second wiener filtering on the audio data Y (j omega) subjected to the spatial filtering by using the power of N times of the intensity coefficient to obtain second filtering data; wherein M is greater than N.

The intensity coefficient affects the degree of suppression of noise in wiener filtering, and the larger the intensity coefficient is, the higher the degree of suppression of noise is. Therefore, in the embodiment of the present invention, the first wiener filtering and the second wiener filtering are performed by using different powers of the same intensity coefficient, so that the first filtering data and the second filtering data with different noise suppression degrees can be obtained.

As an example, the intensity coefficient may be set to

M and N take 1 and 1/2, respectively. Wherein, P_yy(j ω) is audio dataThe power spectrum of (2) can be specifically obtained by the following formula (1); p_xx(j ω) is the average power spectrum of the original audio data before spatial filtering of the audio data, which can be specifically obtained by the following formula (2); EPS is minimum.

P_yy(jω)＝αP_yy(jω)+(1-α)Y(jω)Y^*(jω) (1)

In practical application, the stable power spectrum P can be obtained by using a first-order smoothing mode_xx(j ω) and P_yy(jω)。

Then, the audio data Y (j ω) is subjected to a first wiener filtering to obtain first filtered data Y_VAD(j ω), specifically, the following formula (3) can be used:

then the second wiener filtering is performed on the audio data Y (j omega) to obtain second filtered data Y_ASR(j ω), the following formula (4) can be specifically used:

s203: and judging a starting node and an ending node for processing the second filtering data by using the first filtering data, and processing the second filtering data according to a judgment result.

It can be understood that the influence on the VAD processing is removed from the first filtered data, and the target voice (such as a voice command) in the second filtered data can be accurately identified by using the first filtered data, so as to determine the start node and the end node of the processing on the second filtered data. And performing data processing (such as automatic voice recognition) on the second filtering data according to the judgment result, so that the response speed of the data processing can be improved on the basis of ensuring the processing accuracy.

In a specific implementation, before step S203, Inverse Short-Time Fourier Transform (ISTFT) is performed on the processed frequency domain signal to obtain a Time domain signal, and the Time domain signal and the first filtered data are obtained by windowing and superimposing, and then the judgment and data processing are performed.

The above advantages of the embodiments of the present invention are described in detail below with reference to specific scenarios. Referring to fig. 3, raw audio data, first filtered data, and second filtered data are shown in a specific embodiment of the present invention. The original audio data carries voice instructions and interference, first filtering data is obtained after first-dimension nano-filtering with high noise suppression intensity is carried out on the original audio data, and second filtering data is obtained after second-dimension nano-filtering with low noise suppression intensity is carried out on the original audio data. As can be seen from fig. 3, the accuracy of speech recognition can be ensured by using the second filtered data; the first wiener filtering can improve the accuracy of judgment, more accurately detect the voice activity state, shorten the feedback delay of voice interaction, improve the response speed of voice instructions and bring better use experience to users.

Referring to fig. 4, a flowchart of another audio data processing method according to an embodiment of the present invention is shown.

In order to further improve the accuracy of the determination, before step S203, the embodiment of the present invention further includes:

s204: and performing interference elimination processing on the first filtered data.

In an embodiment of the present invention, the interference removing process may specifically include: one or more of a transient noise cancellation process, a noise reduction process, and a noise smoothing process. In specific implementation, the transient noise elimination processing, the noise reduction processing and the noise smoothing processing may be performed on the first filtered data one by one.

The transient noise cancellation processing, the noise reduction processing, and the noise smoothing processing are explained below.

First, the transient noise cancellation process, as shown in fig. 5, specifically includes the following steps S501 to S503.

S501: and obtaining the gain of the first wiener filter corresponding to each frequency domain point in the preset frequency domain range of the audio data.

In the embodiment of the present application, the gain of the first wiener filter, i.e. the intensity coefficient, corresponds to the value of each frequency domain. As an example, the gain of the first wiener filter corresponding to each frequency domain point, i.e.

Corresponding to the values of each frequency domain.

S502: counting the number of frequency domain points of the audio data in a preset frequency domain range to obtain a first value; and counting the number of frequency domain points of which the gain amplitude is within a preset gain threshold value to obtain a second value.

S503: and obtaining the transient elimination gain according to the first value and the second value.

It should be noted that, because high frequency randomness is high due to high frequency attenuation and reflection, in order to obtain higher robustness, only the proportion that the gain is smaller than a certain threshold within a certain frequency is counted, and noise smoothing processing is performed on the first filtered data based on the proportion.

In specific implementation, the preset flat frequency domain range may be 0-2000Hz, and the preset gain threshold may be 0.3.

In one example, the transient cancellation gain may be obtained according to equation (5) below.

Wherein all _ bin is a first value and count _ bin is a second value.

S504: and eliminating transient noise in the first filtered data according to the transient elimination gain.

In the embodiment of the invention, the transient noise in the first filtered data can be eliminated by applying the transient elimination gain to the first filtered data.

Secondly, any noise reduction algorithm may be specifically adopted for the noise reduction processing, which is not described in detail herein.

Finally, the noise smoothing process may be specifically implemented by performing noise estimation on the first filtered data.

In one example, the first filtered data of each frame after windowing is firstly calculated to obtain an order smooth power spectrum P_noise(j ω) can be specifically obtained by the above formula (1). Then, the first-order smooth power spectrum of the first filtering data of each frame is compared, and the historical minimum power spectrum minP is updated_noise(j ω) is represented by the following formula (6),

where β and ρ are both coefficients.

The noise estimate for the first few frames (e.g., 50 frames) of the first filtered data is the first order smoothed power spectrum P for that frame_noise(j ω); the noise estimate for each frame after a number of frames is the current historical minimum power spectrum minP_noise(j ω). Then, every frame of the first filtered data is processedAnd superposing the noise estimation of the frame, namely, the noise of the first filtering data is stable, and the VAD processing error caused by sudden change of the noise is avoided.

Based on the audio data processing method provided by the above embodiment, another audio data processing method is also provided in the embodiments of the present invention, where a first terminal device (e.g., a smart phone, a tablet computer, a server, etc.) is responsible for processing original audio data to obtain first filtered data and second filtered data, and a second terminal device (e.g., a server) is responsible for a determination and data processing process, so as to not only ensure that the data computation amount on the first terminal device is not too large, but also use a more responsible VAD processing algorithm on the second terminal device to obtain a more accurate VAD processing result.

Specifically, refer to fig. 6, which is a flowchart illustrating another audio data processing method according to an embodiment of the present invention.

The audio data processing method provided by the embodiment of the invention is applied to the first terminal device, and specifically includes the following steps S601 to S603.

S601: spatially filtered audio data is obtained.

Optionally, step S601 specifically includes: acquiring original audio data acquired by a recording device; carrying out short-time Fourier transform on the original audio data to obtain a frequency domain signal corresponding to each channel in the recording equipment; and carrying out spatial filtering processing on the frequency domain signal corresponding to each channel to obtain audio data after spatial filtering.

S602: and respectively carrying out first wiener filtering and second wiener filtering on the audio data to respectively obtain first filtering data and second filtering data, wherein the noise suppression degree of the first wiener filtering is greater than that of the second wiener filtering.

In a possible implementation manner of the embodiment of the present invention, step S602 may specifically include:

performing first wiener filtering on the audio data by using the power M of the intensity coefficient to obtain first filtering data; performing second wiener filtering on the audio data by using the power of N times of the intensity coefficient to obtain second filtering data; wherein M is greater than N.

Optionally, the first wiener filtering is performed on the audio data by using the M power of the intensity coefficient to obtain first filtered data, and the method specifically includes:

according to the formulaPerforming first wiener filtering on the audio data Y (j omega) to obtain first filtered data Y_VAD(jω)。

And performing second wiener filtering on the audio data by using the power of N times of the intensity coefficient to obtain second filtering data, wherein the second filtering data specifically comprises the following steps:

according to the formula

Second wiener filtering is carried out on the audio data Y (j omega) to obtain second filtered data Y_ASR(jω)。

Wherein, M is 1, N is 1/2, and the intensity coefficient is

P_yy(j ω) is the power spectrum of the audio data, P_xx(j ω) is the average power spectrum of the original audio data before spatial filtering of the audio data, and EPS is the minimum value.

It is understood that steps S601 to S602 in this embodiment are similar to steps S201 to S205 in the above embodiment, and refer to the related description specifically, which is not repeated herein.

In a possible implementation manner of the embodiment of the present invention, step S603 further includes: and performing interference elimination processing on the first filtered data. Specifically, the interference elimination process may include one or more of a transient noise elimination process, a noise reduction process, and a noise smoothing process.

As an example, the transient noise cancellation processing may specifically include:

obtaining the gain of a first wiener filter corresponding to each frequency domain point in a preset frequency domain range of the audio data; counting the number of frequency domain points of the audio data in a preset frequency domain range to obtain a first value; counting the number of frequency domain points of which the gain amplitude is within a first preset gain threshold value to obtain a second value; obtaining a transient cancellation gain according to the first value and the second value; and eliminating transient noise in the first filtered data according to the transient elimination gain.

It can be understood that the interference elimination processing in this embodiment is similar to the interference elimination processing described in the foregoing embodiment, and specific reference may be made to relevant descriptions, which are not described herein again.

S603: and sending the first filtering data and the second filtering data to second terminal equipment so that the second terminal equipment judges a starting node and an ending node for processing the second filtering data by using the first filtering data, and processing the second filtering data according to a judgment result.

It can be understood that the filtering processing of the audio data, and the judgment and data processing by using the filtered data are respectively executed by different terminal devices (or servers), which not only can ensure the accuracy and processing speed of the filtering processing, but also can ensure to obtain the accurate speech recognition result of the judgment result by adopting more complex VAD algorithm and ASR algorithm.

In a possible implementation manner of the embodiment of the present invention, in order to reduce the transmission amount of data and increase the transmission speed of the data, and further increase the response speed to the voice command, uploading the first filtered data to the server, may specifically include:

and performing down-sampling processing on the first filtering data and then sending the first filtering data to second terminal equipment.

As an example, the sampling rate of the first filtered data may be reduced from 16kHz to 8 kHz.

Optionally, uploading the first filtered data and the second filtered data to the server specifically includes:

and packaging and compressing the first filtering data and the second filtering data, and then sending the first filtering data and the second filtering data to second terminal equipment.

In the embodiment of the present invention, the first terminal device performs wiener filtering on the spatially filtered audio data with different intensities, to obtain two paths of filtering data with different noise suppression degrees, that is, first filtering data with a higher noise suppression degree and second filtering data with a lower noise suppression degree, and sends the two paths of filtering data to the second terminal device. Then, the second terminal device judges a start node and an end node of the second filtered data processing using the first filtered data, and performs data processing on the second filtered data according to the judgment result. Because the suppression degree of the first filtering data to the noise is higher, the influence of the interference on the voice activity detection can be avoided to a greater degree, and the accuracy of the voice activity detection and the response speed of the automatic voice recognition are improved. And the second filtering data has lower suppression degree to the noise, so that the influence of higher noise suppression on the voice recognition accuracy can be avoided. The filtering processing of the audio data, and the judgment and data processing by using the filtered data are respectively executed by the first terminal equipment and the second terminal equipment, so that not only can the accuracy and the processing speed of the filtering processing be ensured, but also a more complex VAD algorithm and an ASR algorithm can be adopted, and the accurate voice recognition result of the judgment result can be ensured. According to the embodiment of the invention, wiener filtering with different degrees is respectively carried out according to different requirements of voice activity detection and automatic voice recognition, so that the accuracy of automatic voice recognition can be ensured, the influence of interference on voice activity detection can be avoided, the voice activity state can be detected more accurately, the feedback delay of voice interaction is shortened, the response speed to voice instructions is improved, and better use experience is brought to users.

Based on the audio data processing method provided by the embodiment, the embodiment of the invention also provides an audio data processing device.

Referring to fig. 7, a schematic structural diagram of an audio data processing apparatus according to an embodiment of the present invention is shown.

An audio data processing apparatus provided in an embodiment of the present invention includes: a data acquisition module 701, a first filtering module 702, a second filtering module 703 and a first processing module 704;

a data obtaining module 701, configured to obtain spatially filtered audio data;

a first filtering module 702, configured to perform first wiener filtering on the audio data obtained by the data obtaining module 701 to obtain first filtered data;

the second filtering module 703 is configured to perform second wiener filtering on the audio data obtained by the data obtaining module 701 to obtain second filtered data, where the suppression degree of the first wiener filtering on noise is greater than that of the second wiener filtering;

a first processing module 704, configured to determine a start node and an end node of processing the second filtered data by using the first filtered data, and perform data processing on the second filtered data according to a determination result.

In a possible implementation manner of the embodiment of the present invention, the first filtering module 702 is specifically configured to: performing first wiener filtering on the audio data by using the power M of the intensity coefficient to obtain first filtering data;

the second filtering module 703 is specifically configured to: performing second wiener filtering on the audio data by using the power of N times of the intensity coefficient to obtain second filtering data; wherein M is greater than N.

In a possible implementation manner of the embodiment of the present invention, the first filtering module 702 includes: a first processing sub-module;

a first processing submodule for processing according to a formula

Performing first wiener filtering on the audio data Y (j omega) to obtain first filtered data Y_VAD(jω)；

A second filtering module 703, comprising: a second processing sub-module;

a second processing submodule for processing according to a formula

Second wiener filtering is carried out on the audio data Y (j omega) to obtain second filtered data Y_ASR(jω)；

Wherein, M is 1, N is 1/2, and the intensity coefficient is

P_yy(j ω) is the power spectrum of the audio data, P_xx(j ω) is audio dataThe EPS is a minimum value for the average power spectrum of the original audio data before spatial filtering.

In a possible implementation manner of the embodiment of the present invention, the audio data processing apparatus further includes: a second processing module;

the second processing module is used for carrying out interference elimination processing on the first filtering data; interference removal processing including one or more of transient noise cancellation processing, noise reduction processing, and noise smoothing processing;

the first processing module 704 is specifically configured to determine a start node and an end node of processing the second filtered data by using the first filtered data, and perform audio recognition on the second filtered data according to a determination result.

Optionally, when the interference cancellation processing includes transient noise cancellation processing, the second processing module specifically includes: a noise reduction submodule;

a noise reduction submodule, specifically configured to:

In a possible implementation manner of the embodiment of the present invention, the data obtaining module 701 is specifically configured to:

acquiring original audio data acquired by a recording device; carrying out short-time Fourier transform on the original audio data to obtain a frequency domain signal corresponding to each channel in the recording equipment; carrying out spatial filtering processing on the frequency domain signal corresponding to each channel to obtain audio data after spatial filtering;

the first processing module 704 is specifically configured to:

performing inverse transform processing of short-time fourier transform on the first filtered data and the second filtered data; and judging a starting node and an ending node for processing the second filtering data by using the first filtering data, and carrying out audio identification data processing on the processed second filtering data according to a judgment result.

Based on the audio data processing method and device provided by the above embodiment, the embodiment of the invention also provides another audio data processing device.

Referring to fig. 8, a schematic structural diagram of another audio data processing apparatus according to an embodiment of the present invention is shown.

An audio data processing apparatus provided in an embodiment of the present invention is applied to a first terminal device, and includes: a data acquisition module 801, a first filtering module 802, a second filtering module 803 and a data transmission module 804;

a data obtaining module 801, configured to obtain spatially filtered audio data;

a first filtering module 802, configured to perform first wiener filtering on the audio data obtained by the data obtaining module 801 to obtain first filtered data;

a second filtering module 803, configured to perform second wiener filtering on the audio data obtained by the data obtaining module 801 to obtain second filtered data, where a suppression degree of the first wiener filtering on noise is greater than that of the second wiener filtering;

the data transmission module 804 is configured to send the first filtered data and the second filtered data to the second terminal device, so that the second terminal device determines a start node and an end node of processing the second filtered data by using the first filtered data, and performs data processing on the second filtered data according to a determination result.

In a possible implementation manner of the embodiment of the present invention, the first filtering module 802 is specifically configured to: performing first wiener filtering on the audio data by using the power M of the intensity coefficient to obtain first filtering data; the second filtering module 803 is specifically configured to: performing second wiener filtering on the audio data by using the power of N times of the intensity coefficient to obtain second filtering data; wherein M is greater than N.

Optionally, the first filtering module 802 includes: a first processing sub-module; a second filtering module 801, comprising: a second processing sub-module;

a first processing submodule for processing according to a formula

A second processing submodule for processing according to a formula

Wherein, M is 1, N is 1/2, and the intensity coefficient isP_yy(j ω) is the power spectrum of the audio data, P_xx(j omega) is the original audio before spatial filtering of the audio dataThe average power spectrum of the data, EPS, is minimal.

In a possible implementation manner of the embodiment of the present invention, the audio data processing apparatus further includes: a data processing module;

the data processing module is used for carrying out interference removal processing on the first filtering data; interference removal processing including one or more of transient noise cancellation processing, noise reduction processing, and noise smoothing processing;

the data transmission module 804 is specifically configured to send the first filtered data and the second filtered data processed by the data processing module to the second terminal device, so that the second terminal device determines a start node and an end node of processing the second filtered data by using the processed first filtered data, and performs data processing on the second filtered data according to a determination result.

Optionally, when the interference removal processing includes transient noise cancellation processing, the data processing module specifically includes: a noise reduction submodule;

a noise reduction submodule, specifically configured to:

In a possible implementation manner of the embodiment of the present invention, the data obtaining module 801 is specifically configured to:

acquiring original audio data acquired by a recording device; carrying out short-time Fourier transform on the original audio data to obtain a frequency domain signal corresponding to each channel in the recording equipment; and carrying out spatial filtering processing on the frequency domain signal corresponding to each channel to obtain audio data after spatial filtering.

In a possible implementation manner of the embodiment of the present invention, the data transmission module 804 is specifically configured to: and performing down-sampling processing on the first filtering data and then sending the first filtering data to second terminal equipment.

Optionally, the data transmission module 804 is further specifically configured to: and packaging and compressing the first filtering data and the second filtering data, and then sending the first filtering data and the second filtering data to second terminal equipment.

Based on the audio data processing method and device provided by the embodiment, the embodiment of the invention also provides an audio data processing terminal. The audio data processing terminal includes: a memory and a processor. The memory is used for storing the program codes and transmitting the program codes to the processor; a processor for executing the audio data processing method according to any of the embodiments described above, according to instructions in the program code.

Based on the audio data processing method and device provided by the embodiment, the embodiment of the invention also provides an audio data processing system. The audio data processing system includes: a first device and a second device;

a first device for obtaining spatially filtered audio data; the audio data processing device is also used for respectively carrying out first wiener filtering and second wiener filtering on the audio data to obtain first filtering data and second filtering data, and the suppression degree of the first wiener filtering on noise is greater than that of the second wiener filtering; the first device is also used for sending the first filtering data and the second filtering data to the second device;

It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The method, the device or the system disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The foregoing is merely a preferred embodiment of the invention and is not intended to limit the invention in any manner. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims

1. A method of audio data processing, the method comprising:

obtaining spatially filtered audio data;

carrying out first wiener filtering on the audio data to obtain first filtering data; performing second wiener filtering on the audio data to obtain second filtering data; the first wiener filter has a larger noise suppression degree than the second wiener filter;

2. The method according to claim 1, wherein the performing the first wiener filtering and the second wiener filtering on the audio data to obtain first filtered data and second filtered data respectively comprises:

3. The method of claim 2,

the performing, by using the power M of the intensity coefficient, the first wiener filtering on the audio data to obtain the first filtered data specifically includes:

according to the formula

according to the formula

Wherein, M is 1, N is 1/2, and the intensity coefficient is

4. The method of claim 1, wherein determining a start node and an end node for processing second filtered data using the first filtered data further comprises:

performing interference removal processing on the first filtering data;

5. The method according to claim 4, wherein the transient noise cancellation process specifically comprises:

6. An audio data processing method applied to a first terminal device, the method comprising:

obtaining spatially filtered audio data;

7. The method according to claim 6, wherein the performing the first wiener filtering and the second wiener filtering on the audio data to obtain first filtered data and second filtered data respectively comprises:

8. The method of claim 6, wherein determining a start node and an end node for processing second filtered data using the first filtered data further comprises:

performing interference removal processing on the first filtering data;

9. The method according to any one of claims 6 to 8, wherein the sending the first filtered data and the second filtered data to a second terminal device specifically includes:

down-sampling the first filtered data;

sending the processed first filtering data and the second filtering data to the second terminal device; or, the processed first filtered data and the second filtered data are packed and compressed, and then are sent to the second terminal device.

10. An audio data processing apparatus, characterized in that the apparatus comprises: the device comprises a data acquisition module, a first filtering module, a second filtering module and a first processing module;

11. An audio data processing apparatus, applied to a first terminal device, the apparatus comprising: the device comprises a data acquisition module, a first filtering module, a second filtering module and a data transmission module;

12. An audio data processing terminal, comprising: a memory and a processor;

13. An audio data processing system, comprising: a first device and a second device;

the first device to obtain spatially filtered audio data; the audio data processing device is also used for carrying out first wiener filtering on the audio data to obtain first filtering data; performing second wiener filtering on the audio data to obtain second filtering data; the first wiener filter has a larger noise suppression degree than the second wiener filter; the first device is further configured to send the first filtered data and the second filtered data to a second device;