WO2021103672A1

WO2021103672A1 - Audio data processing method and apparatus, and electronic device and storage medium

Info

Publication number: WO2021103672A1
Application number: PCT/CN2020/110038
Authority: WO
Inventors: 罗大为
Original assignee: 北京搜狗科技发展有限公司
Priority date: 2019-11-29
Filing date: 2020-08-19
Publication date: 2021-06-03
Also published as: CN110970046B; CN110970046A

Abstract

An audio data processing method and apparatus, and an electronic device (600, 700) and a storage medium (730). The method comprises: acquiring first multi-channel audio data, wherein the first multi-channel audio data is composed of audio data collected by one or more microphone arrays (101); performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data (102); generating a time-frequency mask for the second multi-channel audio data (103); performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data (104); and outputting an audio signal by using the first single-channel audio data (105). The method realizes audio processing of a plurality of microphone arrays used for non-synchronous collection, thereby preventing the high cost caused by the fact that only unified arrays used for synchronous collection can be used for audio processing, enlarging the pickup range, and improving the robustness.

Description

Method and device for audio data processing, electronic equipment and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201911207689.4, and the invention title is "a method and device for audio data processing, electronic equipment, and storage medium" on November 29, 2019, all of which The content is incorporated in this application by reference.

Technical field

This application relates to the field of audio data processing, and in particular to an audio data processing method and device, electronic equipment, and storage medium.

Background technique

At present, the microphone array technology usually focuses on a unified array system for synchronous acquisition, and the unified array system for synchronous acquisition has higher requirements for hardware design, manufacturing, and deployment.

Moreover, because it can only be deployed at a single point, if you want to cover a larger range, you need to deploy a large aperture and a large number of microphones. As the number of microphones in the array system increases, the cost will rise rapidly and the space deployment will be difficult Increase, and the robustness will decrease significantly.

Summary of the invention

In view of the above problems, an audio data processing method and device, electronic equipment, and storage medium are proposed in order to overcome the above problems or at least partially solve the above problems, including:

A method for audio data processing, the method comprising:

Acquiring first multi-channel audio data; wherein the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;

Performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;

Generating a time-frequency mask for the second multi-channel audio data;

Performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;

The audio signal output is performed by using the first single-channel audio data.

Optionally, the step of performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain the first single-channel audio data includes:

Determine the channel transfer function and the interference noise covariance matrix according to the time-frequency mask;

Using the channel transfer function and the interference noise covariance matrix to determine beam weights;

The beam weight is used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.

Optionally, the time-frequency mask includes a target speech mask and an interference noise mask, and the step of determining a channel transfer function and an interference noise covariance matrix according to the time-frequency mask includes:

Using the target voice mask to generate a target voice covariance matrix;

Using the target speech covariance matrix to calculate the channel transfer function;

Using the interference noise mask, the interference noise covariance matrix is calculated.

Optionally, the step of generating a time-frequency mask for the second multi-channel audio data includes:

Generating a first time-frequency mask for the target voice data in the second multi-channel audio data;

According to the first time-frequency mask, a time-frequency mask for the second multi-channel audio data is determined.

Optionally, the step of determining a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask includes:

Acquiring target-like voice data corresponding to the first time-frequency mask;

Combining the target voice data to generate a second time-frequency mask for target voice data in the second multi-channel audio data; wherein the target voice data includes the target voice data;

Combining the first video mask and the second video mask to generate a time-frequency mask for the second multi-channel audio data.

Optionally, the step of using the first single-channel audio data to output an audio signal includes:

Performing adaptive filtering processing on the first single-channel audio data to obtain second single-channel audio data;

The second single-channel audio data is used for audio signal output.

Optionally, the step of using the second single-channel audio data to output an audio signal includes:

Determine the current application type;

Using the single-channel noise reduction strategy corresponding to the current application type to perform noise reduction processing on the second single-channel audio data to obtain third single-channel audio data;

The third single-channel audio data is used for audio signal output.

Optionally, the step of performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data includes:

Obtain de-reverberation parameters;

Using the de-reverberation parameter to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;

The method also includes:

The first single-channel audio data and/or the second single-channel audio data and/or the third single-channel audio data are used to iteratively update the dereverberation parameters.

Optionally, before the step of performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data, the method further includes:

Determining the degree of relevance of audio data in the first multi-channel audio data;

According to the correlation degree, the audio data in the first multi-channel audio data is aligned.

An audio data processing device, the device comprising:

The first multi-channel audio data acquisition module is configured to acquire first multi-channel audio data; wherein, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;

A de-reverberation processing module, configured to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;

A time-frequency mask generating module, configured to generate a time-frequency mask for the second multi-channel audio data;

A beamforming processing module, configured to perform beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;

The audio signal output module is configured to use the first single-channel audio data to output audio signals.

Optionally, the beamforming processing module includes:

The function and matrix determination sub-module is used to determine the channel transfer function and the interference noise covariance matrix according to the time-frequency mask;

The beam weight determination sub-module is configured to use the channel transfer function and the interference and noise covariance matrix to determine the beam weight;

The first single-channel audio data obtaining submodule is configured to use the beam weight to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.

Optionally, the time-frequency mask includes a target speech mask and an interference noise mask, and the function and matrix determination submodule includes:

A target speech covariance matrix generating unit, configured to use the target speech mask to generate a target speech covariance matrix;

A channel transfer function obtaining unit, configured to use the target voice covariance matrix to calculate the channel transfer function;

The interference noise covariance matrix obtaining unit is configured to use the interference noise mask to calculate the interference noise covariance matrix.

Optionally, the time-frequency mask generation module includes:

The first time-frequency mask generation sub-module is configured to generate a first time-frequency mask for target-like voice data in the second multi-channel audio data;

The time-frequency mask according to determination sub-module is configured to determine a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask.

Optionally, the determined time-frequency mask sub-module includes:

The class target voice data obtaining unit is configured to obtain class target voice data corresponding to the first time-frequency mask;

The second time-frequency mask generating unit is configured to combine the class-target voice data to generate a second time-frequency mask for the target voice data in the second multi-channel audio data; wherein, the class-target voice data includes The target voice data;

The combined time-frequency mask unit is used to combine the first video mask and the second video mask to generate a time-frequency mask for the second multi-channel audio data.

Optionally, the first audio signal output module includes:

An adaptive filter processing sub-module, configured to perform adaptive filter processing on the first single-channel audio data to obtain second single-channel audio data;

The second audio signal output sub-module is configured to use the second single-channel audio data to output audio signals.

Optionally, the second audio signal output submodule includes:

The current application type determining unit is used to determine the current application type;

A third single-channel audio data obtaining unit, configured to adopt a single-channel noise reduction strategy corresponding to the current application type to perform noise reduction processing on the second single-channel audio data to obtain third single-channel audio data;

The third audio signal output unit is configured to use the third single-channel audio data to output audio signals.

Optionally, the de-reverberation processing module includes:

De-reverberation parameter acquisition sub-module for acquiring de-reverberation parameters;

The second multi-channel audio data obtaining sub-module is configured to use the de-reverberation parameter to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;

The device also includes:

An iterative update module is configured to use the first single-channel audio data and/or the second single-channel audio data and/or the third single-channel audio data to iteratively update the dereverberation parameters.

Optionally, the device further includes:

A correlation degree determination module, configured to determine the correlation degree of audio data in the first multi-channel audio data;

The alignment processing module is configured to perform alignment processing on the audio data in the first multi-channel audio data according to the correlation degree.

An electronic device that includes a memory and one or more programs, wherein one or more programs are stored in the memory and configured to be executed by one or more processors, including: Instructions to do the following:

Generating a time-frequency mask for the second multi-channel audio data;

Using the target voice mask to generate a target voice covariance matrix;

Combining the class target voice data to generate a second time-frequency mask for the target voice data in the second multi-channel audio data; wherein the class target voice data includes the target voice data;

The second single-channel audio data is used for audio signal output.

Determine the current application type;

The third single-channel audio data is used for audio signal output.

Obtain de-reverberation parameters;

The electronic device also includes instructions for performing the following operations:

Optionally, the electronic device further includes instructions for performing the following operations:

A readable storage medium. When instructions in the storage medium are executed by a processor of an electronic device, the electronic device can execute the audio data processing method described above.

The embodiments of this application have the following advantages:

In the embodiment of the present application, by acquiring the first multi-channel audio data, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays, and then the first multi-channel audio data is de-reverberated to obtain The second multi-channel audio data is used to generate a time-frequency mask for the second multi-channel audio data, and the time-frequency mask is used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data. Single-channel audio data, audio signal output, realizes the audio processing of multiple microphone arrays that are collected asynchronously, avoids the high cost caused by only using a unified array of synchronous collection for audio processing, expands the pickup range, and improves The robustness is improved, and by adopting the time-frequency mask, there is no need to rely on the position information of the microphone array during audio processing, which improves the noise reduction and anti-interference capabilities.

The above description is only an overview of the technical solution of the present invention. In order to understand the technical means of the present invention more clearly, it can be implemented in accordance with the content of the description, and in order to make the above and other objectives, features and advantages of the present invention more obvious and easy to understand. In the following, specific embodiments of the present invention are specifically cited.

Description of the drawings

In order to explain the technical solution of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative labor.

FIG. 1 is a flow chart of the steps of a method for processing audio data according to an embodiment of the present application;

FIG. 2 is a flow chart of the steps of another audio data processing method provided by an embodiment of the present application;

FIG. 3 is a flow chart of the steps of another audio data processing method provided by an embodiment of the present application;

FIG. 4 is a flowchart of the steps of another audio data processing method provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an audio data processing apparatus provided by an embodiment of the present application;

FIG. 6 is a structural block diagram of an electronic device for audio data processing provided by an embodiment of the present application;

FIG. 7 is a schematic structural diagram of another electronic device for audio data processing provided by an embodiment of the present application.

Specific embodiment

In order to make the above objectives, features, and advantages of the application more obvious and understandable, the application will be further described in detail below with reference to the accompanying drawings and specific implementations. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

1, there is shown a step flow chart of an audio data processing method provided by an embodiment of the present application, which may specifically include the following steps:

Step 101: Acquire first multi-channel audio data; wherein, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;

Among them, one or more microphone arrays can form a non-synchronized acquisition array system, which specifically can be that the obtained multi-channel signals are not completely synchronized in time due to inconsistent synchronization clocks or transmission delays, and a single microphone array Synchronous collection can be performed internally. If there are microphones that are not collected synchronously in a single microphone array, it can also be used as a single microphone array, and the sampling rate of the audio data collected by each microphone array is the same.

In practical applications, a control module, a transmission module, and a processing module can be set. The control module can control the working state of one or more microphone arrays, and then can control one or more microphone arrays to perform synchronous start and data transmission.

When the signal is collected, the control module can control one or more microphone arrays to start and start recording, and the one or more microphone arrays will send the collected data to the transmission module. The transmission module can adopt a preset subcontracting strategy to transfer each microphone The data collected by the array is synchronously transmitted to the processing module, which can perform data transmission in a wired or wireless manner, and the processing module can then obtain the first multi-channel audio data composed of audio data collected by one or more microphone arrays.

In an example, when part of the data packets are not transmitted in time, you can wait for a preset period of time. If the data packets are not received over time, the missing data can be marked with zeros and transmitted to the processing module.

Step 102: Perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;

Due to the multipath propagation caused by the reflection and refraction of sound in the propagation process, in the audio signal received by the microphone, in addition to the direct signal, there are also multipath signals. These multipath signals that follow the direct wave are called reverberation and often affect the voice. Human-computer interaction functions such as wake-up and recognition have adverse effects.

After obtaining the first multi-channel audio data, the processing module uses linear prediction or Kalman filtering and other filtering methods to de-reverberate the first multi-channel audio data, and then suppress the reverberation in the original signal to obtain the first multi-channel audio data. Two multi-channel audio data, and the de-reverberation processing can ensure that the phase relationship of the data does not change, and does not affect subsequent processing.

In an embodiment of the present application, before step 102, the method may further include the following steps:

Determine the degree of relevance of the audio data in the first multi-channel audio data; and perform alignment processing on the audio data in the first multi-channel audio data according to the degree of relevance.

Since the audio data collected by each microphone array may have an offset, if there is a clock offset of 20 milliseconds, you can determine the degree of correlation of the audio data in the first multi-channel audio data, and then perform alignment processing according to the degree of correlation to ensure that the data is offset. Moving within 1 frame does not affect subsequent processing.

Specifically, a reference frequency band and a reference channel can be selected, and then the cross-correlation coefficient (that is, the degree of correlation) of the first multi-channel audio data in the reference frequency band is calculated within the preset maximum offset range, and the search accuracy is less than that of subsequent processing Frame length, determine the offset corresponding to the maximum value of the cross-correlation coefficient between channels, and then align it based on the reference channel.

Step 103: Generate a time-frequency mask for the second multi-channel audio data;

Among them, the time-frequency mask can generate corresponding masking coefficients according to the size relationship of different components in each time-frequency point, which can be used for tasks such as the separation of speech and noise.

After the second multi-channel audio data is obtained, a classifier can be used to separate the target voice signal and other interference and noise signals in the second multi-channel audio data in the time-frequency domain, such as separating human voice and environmental noise, and then the target voice signal can be obtained. 2. Time-frequency mask of multi-channel audio data.

In an embodiment of the present application, step 103 may include the following sub-steps:

Sub-step 11, generating a first time-frequency mask for the target voice data in the second multi-channel audio data;

In a specific implementation, the second multi-channel audio data can be input into the first preset model, and the first preset model can output the first time-frequency mask for the target voice data in the second multi-channel audio data, such as the second Multi-channel audio data can include audio data corresponding to human voice and audio data corresponding to environmental noise. The class target audio data is audio data corresponding to human voice, and the first time frequency for the audio data of human voice can be obtained. Mask.

In an example, the first preset module can adopt a generative model, such as a complex Gaussian mixture model, or can adopt a discriminative model, such as DNN (Deep Neural Networks, deep neural network), TDNN (time delay neural network), LSTM (Long Short-Term Memory), CNN (Convolutional Neural Networks, Convolutional Neural Network), TCNN and other neural network structures composed of discriminant models.

Sub-step 12: Determine a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask.

After the first time-frequency mask is obtained, the first time-frequency mask can be directly used as the time-frequency mask for the second multi-channel audio data, or further optimization can be performed according to the first time-frequency mask to achieve The masking effect of the specified target audio data in the class target audio data.

In an embodiment of the present application, sub-step 12 may include the following sub-steps:

Sub-step 121: Acquire target-like voice data corresponding to the first time-frequency mask;

In a specific implementation, the first video mask can be used to process the second multi-channel audio data, and then the class-target voice data corresponding to the first time-frequency mask can be obtained from the second multi-channel audio data.

Sub-step 122, combining the target voice data to generate a second time-frequency mask for target voice data in the second multi-channel audio data; wherein the target voice data includes the target voice data;

After obtaining the target voice data, the target voice data can be input into a second preset model, and the second preset model can generate a second time-frequency mask for the target voice data in the second multi-channel audio data, such as the first Two multi-channel audio data may include audio data corresponding to human voice and audio data corresponding to environmental noise, and audio data corresponding to human voice may include audio data corresponding to user A and audio data corresponding to user B, target audio If the data is audio data corresponding to user A, the second time-frequency mask for the audio data corresponding to user A can be obtained, and then the masking effect of a designated person can be realized, which can be applied to scenarios such as home human-computer interaction.

In an example, the second preset model may be a model such as SpeakerBeam or iVector+DeepCluster.

Sub-step 123, combining the first video mask and the second video mask to generate a time-frequency mask for the second multi-channel audio data.

After obtaining the first video mask and the second video mask, the first video mask and the second video mask can be dot-multiplied to obtain the time-frequency mask for the second multi-channel audio data.

Step 104: Perform beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;

Among them, beamforming is a technology that uses the spatial spectrum characteristics of the received signal from the array to perform spatial filtering on the signal to achieve directional reception.

After the video mask is obtained, the time-frequency mask can be used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.

Step 105: Use the first single-channel audio data to output an audio signal.

After the first single-channel audio data is obtained, the first single-channel audio data can be used for audio signal output, thereby achieving enhancement of the voice signal and reducing the influence of interference noise.

In the embodiment of the present application, by acquiring the first multi-channel audio data, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays, and then the first multi-channel audio data is de-reverberated to obtain The second multi-channel audio data is used to generate a time-frequency mask for the second multi-channel audio data, and the time-frequency mask is used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data. Single-channel audio data, audio signal output, realizes the audio processing of multiple microphone arrays that are collected asynchronously, avoids the high cost caused by only using a unified array of synchronized collection for audio processing, expands the pickup range, and improves The robustness is improved, and by adopting the time-frequency mask, there is no need to rely on the position information of the microphone array during audio processing, which improves the noise reduction and anti-interference capabilities.

Referring to FIG. 2, there is shown a step flowchart of another audio data processing method provided by an embodiment of the present application, which may specifically include the following steps:

Step 201: Acquire first multi-channel audio data; wherein, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;

Step 202: Perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;

Step 203: Generate a time-frequency mask for the second multi-channel audio data;

Step 204: Determine the channel transfer function and the interference noise covariance matrix according to the time-frequency mask;

After obtaining the time-frequency mask, for each frequency point, the channel transfer function and the interference noise covariance matrix can be determined according to the time-frequency mask.

In an embodiment of the present application, the time-frequency mask may include a target speech mask and an interference noise mask, and the sum of the target speech mask and the interference noise mask may be a fixed value, such as the target speech mask and the interference noise mask. The sum of codes can be 1, and step 204 can include the following sub-steps:

Sub-step 21, using the target voice mask to generate a target voice covariance matrix, and using the target voice covariance matrix to calculate a channel transfer function;

In specific implementation, the target voice mask can be used to generate the target voice covariance matrix, and then the target voice covariance matrix can be used to calculate the channel transfer function, as follows:

The signal model of the microphone array can be expressed as:

Among them, x _i (t) is the signal received by the i-th microphone, s(t) is the target voice signal, f _i (t) is the channel transfer function of the signal received by the i-th microphone, and n _i (t) is the i-th microphone Noise and interference signals received by the microphone.

The time-frequency transformation of the above formula, each frequency point can be expressed as:

x _f,t = d _f s _f,t +n _f,t

Among them, x _{f, t} and n _{f, t} are the multi-channel data vector (ie the second multi-channel audio data) and noise interference signal received at the frequency at time _{t, respectively, and s f, t} are the target voice signals at that time. , D _f is the corresponding channel transfer function vector.

Since the reverberation has been basically suppressed, assuming that the noise interference is not related to the target speech signal, it can be further derived as:

among them,

with

Are the data, target and interference noise covariance matrix of frequency f,

Is the variance of the target voice signal at this frequency point, and N is the length of the time window used.

Use the obtained time-frequency mask:

among them,

Is the target speech covariance matrix estimation of the current frequency,

Is the target speech mask corresponding to the frequency point at time t,

with

Are the estimation of the channel transfer function vector and the target variance respectively, that is, by

Perform eigen decomposition, take the main eigenvalue and eigenvector to get the channel transfer function vector. For the online estimation method, multi-frame accumulation can be changed to the accumulation method with fading coefficient, which is convenient for real-time processing.

In sub-step 22, the interference noise covariance matrix is calculated by using the interference noise mask.

Based on the above description, the interference noise mask can also be used to calculate the interference noise covariance matrix, as follows:

among them,

Is the interference noise covariance matrix estimation of the current frequency,

Is the interference noise mask corresponding to the frequency point at time t.

Step 205, using the channel transfer function and the interference and noise covariance matrix to determine beam weights;

After obtaining the channel transfer function and the interference noise covariance matrix, the beam weight w _f can be calculated, and the minimum variance distortion (MVDR) beamforming method can be used, as follows:

Step 206: Perform beamforming processing on the second multi-channel audio data by using the beam weight to obtain first single-channel audio data;

In obtaining the beam weight, the beam weight may be used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.

Step 207: Use the first single-channel audio data to output an audio signal.

In the embodiment of this application, the channel transfer function and the interference noise covariance matrix are determined according to the time-frequency mask, and then the channel transfer function and the interference noise covariance matrix are used to determine the beam weights, and then the beam weights are used to determine the Two multi-channel audio data are subjected to beamforming processing to obtain the first single-channel audio data, which realizes the use of time-frequency mask estimation to obtain the channel transfer function and interference noise covariance matrix, and then performs beamforming to reduce the voice distortion caused by beamforming , And does not need to rely on the position information of the microphone array, can obtain the processing performance similar to the synchronous array, and improve the noise reduction and anti-interference ability.

Referring to FIG. 3, there is shown a step flow chart of another audio data processing method provided by an embodiment of the present application, which may specifically include the following steps:

Step 301: Acquire first multi-channel audio data; wherein, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;

Step 302: Perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;

Step 303: Generate a time-frequency mask for the second multi-channel audio data;

Step 304: Perform beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;

Step 305: Perform adaptive filtering processing on the first single-channel audio data to obtain second single-channel audio data;

Since the single-channel audio data after beamforming may still have some noise and interference, after the first single-channel audio data is obtained, the first single-channel audio data can be adaptively filtered to obtain the second single-channel For audio data, Generalized Sidelobe Canceller (GSC, Generalized Sidelobe Canceller) can be used. By outputting the interference noise time-frequency mask as the blocking branch, it is judged whether it is the target voice segment to adjust the adaptive filter coefficient update. Segment update filter and fix filter coefficients in the speech segment.

Step 306: Use the second single-channel audio data to output an audio signal.

When the second single-channel audio data is obtained, the second single-channel audio data can be used to output the audio signal, thereby achieving enhancement of the voice signal and reducing the influence of interference noise.

In the embodiment of the present application, the second single-channel audio data is obtained by performing adaptive filtering processing on the first single-channel audio data, and then the second single-channel audio data is used to output the audio signal, which realizes the self-control of the audio data. Adapt to filter processing, improve the purity of the output voice.

Referring to FIG. 4, there is shown a step flow chart of another audio data processing method provided by an embodiment of the present application, which may specifically include the following steps:

Step 401: Acquire first multi-channel audio data; wherein, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;

Step 402: Obtain dereverberation parameters;

In a specific implementation, a de-reverberation parameter can be obtained, and the de-reverberation parameter can be related to the voice variance of the target voice data, and it can be used as a filter coefficient of a filter for de-reverberation processing.

Step 403: Using the de-reverberation parameters, perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;

After the de-reverberation parameter is obtained, the de-reverberation parameter can be used to perform de-reverberation processing on the first multi-channel audio data to obtain the second multi-channel audio data.

Step 404: Generate a time-frequency mask for the second multi-channel audio data;

Step 405: Perform beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;

Step 406: Perform adaptive filtering processing on the first single-channel audio data to obtain second single-channel audio data;

Step 407: Determine the current application type;

In specific implementation, in order to meet different application requirements, such as audio communication, voice wake-up, and voice recognition applications, the current application type can be determined.

Step 408: Use the single-channel noise reduction strategy corresponding to the current application type to perform noise reduction processing on the second single-channel audio data to obtain third single-channel audio data;

After determining the current application type, you can use the single-channel noise reduction strategy corresponding to the current application type to perform noise reduction processing on the second single-channel audio data to obtain the third single-channel audio data, such as using log-MMSE (Minimum Mean Square Error) ), IMCRA (Improved Minima Controlled Recursive Averaging) and OMLSA (Optimally Modified Log-Spectral Amplitude Estimator) and other noise reduction schemes based on signal statistics, or use a noise reduction network composed of structures such as DNN, LSTM, TDNN, CNN and TCNN.

Step 409: Use the third single-channel audio data to output an audio signal.

When the third single-channel audio data is obtained, the third single-channel audio data can be used to output the audio signal, thereby achieving enhancement of the voice signal and reducing the influence of interference noise.

In an embodiment of the present application, the method may further include the following steps:

In specific implementation, since the obtained first single-channel audio data, second single-channel audio data, and third single-channel audio data are relatively pure target voices, the first single-channel audio data or the second single-channel audio can be used Data or the third single-channel audio data, iteratively update the de-reverberation parameters, and then get more accurate de-reverberation parameters and improve the de-reverberation effect.

In the embodiment of this application, by determining the current application type, the single-channel noise reduction strategy corresponding to the current application type is used to perform noise reduction processing on the second single-channel audio data to obtain the third single-channel audio data, and then the third single-channel audio data is obtained. Channel audio data, audio signal output, realizes the use of different noise reduction strategies for different application requirements, so that the output voice can be more adapted to the application requirements.

Moreover, by using all the first single-channel audio data or the second single-channel audio data or the third single-channel audio data, the de-reverberation parameters are updated iteratively, realizing positive feedback on the internal performance of the entire system, and iteratively improving the system performance. Improve the de-reverberation effect.

It should be noted that for the method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should know that the embodiments of this application are not limited by the described sequence of actions, because According to the embodiments of the present application, some steps may be performed in other order or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of the present application.

Referring to FIG. 5, there is shown a schematic structural diagram of an audio data processing apparatus provided by an embodiment of the present application, which may specifically include the following modules:

The first multi-channel audio data acquisition module 501 is configured to acquire first multi-channel audio data; wherein, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;

The de-reverberation processing module 502 is configured to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;

A time-frequency mask generating module 503, configured to generate a time-frequency mask for the second multi-channel audio data;

The beamforming processing module 504 is configured to perform beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;

The first audio signal output module 505 is configured to use the first single-channel audio data to output audio signals.

In an embodiment of the present application, the beamforming processing module 504 includes:

In an embodiment of the present application, the time-frequency mask includes a target speech mask and an interference noise mask, and the function and matrix determination submodule includes:

In an embodiment of the present application, the time-frequency mask generation module 503 includes:

In an embodiment of the present application, the time-frequency mask sub-module for determining according to includes:

In an embodiment of the present application, the first audio signal output module 505 includes:

In an embodiment of the present application, the second audio signal output submodule includes:

In an embodiment of the present application, the de-reverberation processing module 502 includes:

The device also includes:

In an embodiment of the present application, the device further includes:

The device embodiments described above are merely illustrative, where the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement it without creative work.

As for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.

Fig. 6 is a block diagram showing an electronic device 600 for audio data processing according to an exemplary embodiment. For example, the electronic device 600 may be a mobile phone, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, etc.

6, the electronic device 600 may include one or more of the following components: a processing component 602, a memory 604, a power supply component 606, a multimedia component 608, an audio component 610, an input/output (I/O) interface 612, and a sensor component 614 , And the communication component 616.

The processing component 602 generally controls the overall operations of the electronic device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing element 602 may include one or more processors 620 to execute instructions to complete all or part of the steps of the foregoing method. In addition, the processing component 602 may include one or more modules to facilitate the interaction between the processing component 602 and other components. For example, the processing component 602 may include a multimedia module to facilitate the interaction between the multimedia component 608 and the processing component 602.

The memory 604 is configured to store various types of data to support operations in the electronic device 600. Examples of these data include instructions for any application or method operating on the electronic device 600, contact data, phone book data, messages, pictures, videos, etc. The memory 604 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable and Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic Disk or Optical Disk.

The power supply component 606 provides power for various components of the electronic device 600. The power supply component 606 may include a power management system, one or more power supplies, and other components associated with the generation, management, and distribution of power for the electronic device 600.

The multimedia component 608 includes a screen that provides an output interface between the electronic device 600 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touch, sliding, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure related to the touch or slide operation. In some embodiments, the multimedia component 608 includes a front camera and/or a rear camera. When the device 600 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 610 is configured to output and/or input audio signals. For example, the audio component 610 includes a microphone (MIC), and when the electronic device 600 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal. The received audio signal can be further stored in the memory 604 or sent via the communication component 616. In some embodiments, the audio component 610 further includes a speaker for outputting audio signals.

The I/O interface 612 provides an interface between the processing component 602 and a peripheral interface module. The above-mentioned peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: home button, volume button, start button, and lock button.

The sensor component 614 includes one or more sensors for providing the electronic device 600 with various aspects of state evaluation. For example, the sensor component 614 can detect the on/off status of the device 600 and the relative positioning of the components. For example, the component is the display and the keypad of the electronic device 600. The sensor component 614 can also detect the electronic device 600 or a component of the electronic device 600. The position of the electronic device 600 changes, the presence or absence of contact between the user and the electronic device 600, the orientation or acceleration/deceleration of the electronic device 600, and the temperature change of the electronic device 600. The sensor component 614 may include a proximity sensor configured to detect the presence of nearby objects when there is no physical contact. The sensor component 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor component 614 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 616 is configured to facilitate wired or wireless communication between the electronic device 600 and other devices. The electronic device 600 can access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof. In an exemplary embodiment, the communication component 616 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 616 further includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

In an exemplary embodiment, the electronic device 600 may be implemented by one or more application specific integrated circuits (ASIC), digital signal processors (DSP), digital signal processing devices (DSPD), programmable logic devices (PLD), field A programmable gate array (FPGA), controller, microcontroller, microprocessor, or other electronic components are implemented to implement the above methods.

In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium including instructions, such as the memory 604 including instructions, and the foregoing instructions may be executed by the processor 620 of the electronic device 600 to complete the foregoing method. For example, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

A non-transitory computer-readable storage medium. When instructions in the storage medium are executed by a processor of a mobile terminal, the mobile terminal can execute an audio data processing method. The method includes:

Generating a time-frequency mask for the second multi-channel audio data;

Using the target voice mask to generate a target voice covariance matrix;

The second single-channel audio data is used for audio signal output.

Determine the current application type;

The third single-channel audio data is used for audio signal output.

Obtain de-reverberation parameters;

The method also includes:

FIG. 7 is a schematic structural diagram of an electronic device 700 for audio data processing according to an embodiment of the present application. The electronic device 700 may be a server, and the server 700 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (CPU) 722 (for example, one or more processors). ) And memory 732, one or more storage media 730 (for example, one or one storage device with a large amount of storage) for storing application programs 742 or data 744. Among them, the memory 732 and the storage medium 730 may be short-term storage or persistent storage. The program stored in the storage medium 730 may include one or more modules (not shown in the figure), and each module may include a series of command operations on the server. Furthermore, the central processing unit 722 may be configured to communicate with the storage medium 730, and execute a series of instruction operations in the storage medium 730 on the server 700.

The server 700 may also include one or more power supplies 726, one or more wired or wireless network interfaces 750, one or more input and output interfaces 758, one or more keyboards 756, and/or, one or more operating systems 741 , Such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and so on.

The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other.

Those skilled in the art should understand that the embodiments of the present application may be provided as methods, devices, or computer program products. Therefore, the embodiments of the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the embodiments of the present application may adopt the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

The embodiments of this application are described with reference to the flowcharts and/or block diagrams of the methods, terminal devices (systems), and computer program products according to the embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to the processors of general-purpose computers, special-purpose computers, embedded processors, or other programmable data processing terminal equipment to generate a machine, so that instructions executed by the processor of the computer or other programmable data processing terminal equipment A device for realizing the functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram is generated.

These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing terminal equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The instruction device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing terminal equipment, so that a series of operation steps are executed on the computer or other programmable terminal equipment to produce computer-implemented processing, so that the computer or other programmable terminal equipment The instructions executed above provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.

Although the preferred embodiments of the embodiments of the present application have been described, those skilled in the art can make additional changes and modifications to these embodiments once they learn the basic creative concept. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and all changes and modifications falling within the scope of the embodiments of the present application.

Finally, it should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities. Or there is any such actual relationship or sequence between operations. Moreover, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article, or terminal device that includes a series of elements includes not only those elements, but also those that are not explicitly listed. Other elements listed, or also include elements inherent to this process, method, article, or terminal device. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other same elements in the process, method, article, or terminal device that includes the element.

The above provides a detailed introduction to the provided audio data processing method and device, electronic equipment, and storage medium. Specific examples are used in this article to illustrate the principles and implementation of the application. The description of the above embodiments is only It is used to help understand the methods and core ideas of this application; at the same time, for those of ordinary skill in the art, according to the ideas of this application, there will be changes in the specific implementation and the scope of application. In summary, this The content of the description should not be construed as a limitation on this application.

Claims

A method for audio data processing, characterized in that the method includes:

Acquiring first multi-channel audio data; wherein the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;

Performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;

Generating a time-frequency mask for the second multi-channel audio data;

Performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;

The audio signal output is performed by using the first single-channel audio data.
The method according to claim 1, wherein the step of performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain the first single-channel audio data comprises:

Determine the channel transfer function and the interference noise covariance matrix according to the time-frequency mask;

Using the channel transfer function and the interference noise covariance matrix to determine beam weights;

The beam weight is used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
The method according to claim 2, wherein the time-frequency mask includes a target speech mask and an interference noise mask, and the channel transfer function and the interference noise covariance matrix are determined according to the time-frequency mask The steps include:

Using the target voice mask to generate a target voice covariance matrix;

Using the target speech covariance matrix to calculate the channel transfer function;

Using the interference noise mask, the interference noise covariance matrix is calculated.
The method according to claim 1 or 2 or 3, wherein the step of generating a time-frequency mask for the second multi-channel audio data comprises:

Generating a first time-frequency mask for the target voice data in the second multi-channel audio data;

According to the first time-frequency mask, a time-frequency mask for the second multi-channel audio data is determined.
The method according to claim 4, wherein the step of determining a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask comprises:

Acquiring target-like voice data corresponding to the first time-frequency mask;

Combining the class target voice data to generate a second time-frequency mask for the target voice data in the second multi-channel audio data; wherein the class target voice data includes the target voice data;

Combining the first video mask and the second video mask to generate a time-frequency mask for the second multi-channel audio data.
The method according to claim 1, wherein the step of using the first single-channel audio data to output an audio signal comprises:

Performing adaptive filtering processing on the first single-channel audio data to obtain second single-channel audio data;

The second single-channel audio data is used for audio signal output.
The method according to claim 6, wherein the step of using the second single-channel audio data to output an audio signal comprises:

Determine the current application type;

Using the single-channel noise reduction strategy corresponding to the current application type to perform noise reduction processing on the second single-channel audio data to obtain third single-channel audio data;

The third single-channel audio data is used for audio signal output.
8. The method according to claim 7, wherein the step of performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data comprises:

Obtain de-reverberation parameters;

Using the de-reverberation parameter to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;

The method also includes:

The first single-channel audio data and/or the second single-channel audio data and/or the third single-channel audio data are used to iteratively update the dereverberation parameters.
The method according to claim 1, wherein before the step of performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data, the method further comprises:

Determining the degree of relevance of audio data in the first multi-channel audio data;

According to the correlation degree, the audio data in the first multi-channel audio data is aligned.
An audio data processing device, characterized in that the device comprises:

The first multi-channel audio data acquisition module is configured to acquire first multi-channel audio data; wherein, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;

A de-reverberation processing module, configured to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;

A time-frequency mask generating module, configured to generate a time-frequency mask for the second multi-channel audio data;

A beamforming processing module, configured to perform beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;

The audio signal output module is configured to use the first single-channel audio data to output audio signals.
The apparatus according to claim 10, wherein the beamforming processing module comprises:

The function and matrix determination sub-module is used to determine the channel transfer function and the interference noise covariance matrix according to the time-frequency mask;

The beam weight determination sub-module is configured to use the channel transfer function and the interference and noise covariance matrix to determine the beam weight;

The first single-channel audio data obtaining submodule is configured to use the beam weight to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
The device according to claim 11, wherein the time-frequency mask comprises a target speech mask and an interference noise mask, and the function and matrix determination sub-module comprises:

A target speech covariance matrix generating unit, configured to use the target speech mask to generate a target speech covariance matrix;

A channel transfer function obtaining unit, configured to use the target voice covariance matrix to calculate the channel transfer function;

The interference noise covariance matrix obtaining unit is configured to use the interference noise mask to calculate the interference noise covariance matrix.
The device according to claim 10 or 11 or 12, wherein the time-frequency mask generation module comprises:

The first time-frequency mask generation sub-module is configured to generate a first time-frequency mask for target-like voice data in the second multi-channel audio data;

The time-frequency mask according to determination sub-module is configured to determine a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask.
The apparatus according to claim 13, wherein the time-frequency mask sub-module for determining according to comprises:

The class target voice data obtaining unit is configured to obtain class target voice data corresponding to the first time-frequency mask;

The second time-frequency mask generating unit is configured to combine the class-target voice data to generate a second time-frequency mask for the target voice data in the second multi-channel audio data; wherein, the class-target voice data includes The target voice data;

The combined time-frequency mask unit is used to combine the first video mask and the second video mask to generate a time-frequency mask for the second multi-channel audio data.
The device according to claim 10, wherein the first audio signal output module comprises:

An adaptive filter processing sub-module, configured to perform adaptive filter processing on the first single-channel audio data to obtain second single-channel audio data;

The second audio signal output sub-module is configured to use the second single-channel audio data to output audio signals.
The device according to claim 15, wherein the second audio signal output sub-module comprises:

The current application type determining unit is used to determine the current application type;

A third single-channel audio data obtaining unit, configured to adopt a single-channel noise reduction strategy corresponding to the current application type to perform noise reduction processing on the second single-channel audio data to obtain third single-channel audio data;

The third audio signal output unit is configured to use the third single-channel audio data to output audio signals.
The device according to claim 16, wherein the de-reverberation processing module comprises:

De-reverberation parameter acquisition sub-module for acquiring de-reverberation parameters;

The second multi-channel audio data obtaining sub-module is configured to use the de-reverberation parameter to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;

The device also includes:

An iterative update module is configured to use the first single-channel audio data and/or the second single-channel audio data and/or the third single-channel audio data to iteratively update the dereverberation parameters.
The device according to claim 10, wherein the device further comprises:

A correlation degree determination module, configured to determine the correlation degree of audio data in the first multi-channel audio data;

The alignment processing module is configured to perform alignment processing on the audio data in the first multi-channel audio data according to the correlation degree.
An electronic device characterized by comprising a memory and one or more programs, wherein one or more programs are stored in the memory and configured to be executed by one or more processors. The program contains instructions for the following operations:

Acquiring first multi-channel audio data; wherein the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;

Performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;

Generating a time-frequency mask for the second multi-channel audio data;

Performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;

The audio signal output is performed by using the first single-channel audio data.
The electronic device according to claim 19, wherein the step of performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain the first single-channel audio data comprises:

Determine the channel transfer function and the interference noise covariance matrix according to the time-frequency mask;

Using the channel transfer function and the interference noise covariance matrix to determine beam weights;

The beam weight is used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
The electronic device according to claim 20, wherein the time-frequency mask comprises a target speech mask and an interference noise mask, and the channel transfer function and the interference noise covariance are determined according to the time-frequency mask The steps of the matrix include:

Using the target voice mask to generate a target voice covariance matrix;

Using the target speech covariance matrix to calculate the channel transfer function;

Using the interference noise mask, the interference noise covariance matrix is calculated.
The electronic device according to claim 19 or 20 or 21, wherein the step of generating a time-frequency mask for the second multi-channel audio data comprises:

Generating a first time-frequency mask for the target voice data in the second multi-channel audio data;

According to the first time-frequency mask, a time-frequency mask for the second multi-channel audio data is determined.
The electronic device according to claim 22, wherein the step of determining a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask comprises:

Acquiring target-like voice data corresponding to the first time-frequency mask;

Combining the class target voice data to generate a second time-frequency mask for the target voice data in the second multi-channel audio data; wherein the class target voice data includes the target voice data;

Combining the first video mask and the second video mask to generate a time-frequency mask for the second multi-channel audio data.
The electronic device according to claim 19, wherein the step of using the first single-channel audio data to output an audio signal comprises:

Performing adaptive filtering processing on the first single-channel audio data to obtain second single-channel audio data;

The second single-channel audio data is used for audio signal output.
The electronic device according to claim 24, wherein the step of using the second single-channel audio data to output an audio signal comprises:

Determine the current application type;

Using the single-channel noise reduction strategy corresponding to the current application type to perform noise reduction processing on the second single-channel audio data to obtain third single-channel audio data;

The third single-channel audio data is used for audio signal output.
The electronic device according to claim 25, wherein the step of performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data comprises:

Obtain de-reverberation parameters;

Using the de-reverberation parameter to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;

The electronic device also includes instructions for performing the following operations:

The first single-channel audio data and/or the second single-channel audio data and/or the third single-channel audio data are used to iteratively update the dereverberation parameters.
The electronic device according to claim 19, wherein the electronic device further comprises instructions for performing the following operations:

Determining the degree of relevance of audio data in the first multi-channel audio data;

According to the correlation degree, the audio data in the first multi-channel audio data is aligned.
A readable storage medium, characterized in that, when the instructions in the storage medium are executed by the processor of the electronic device, the electronic device can execute the audio data processing method according to any one of the method claims 1-9.