CN114283832A

CN114283832A - Processing method and device for multi-channel audio signal

Info

Publication number: CN114283832A
Application number: CN202111058595.2A
Authority: CN
Inventors: 罗艺; 王珺; 林永业
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-09-09
Filing date: 2021-09-09
Publication date: 2022-04-05

Abstract

Embodiments of the present disclosure provide a processing method, apparatus, device and computer-readable storage medium for a multi-channel audio signal. The method provided by the embodiment of the disclosure realizes the beamforming of the audio signal of the signal source based on the time domain characteristics of the multi-channel audio signal, the beamforming processing process does not involve any complex domain operation, and can be directly applied to any existing neural network beamforming framework. In addition, by utilizing the beamforming method of the present disclosure as a substitute for the frequency domain beamforming method in a cascaded neural network beamforming framework, the audio processing performance of the system is significantly improved.

Description

Processing method and device for multi-channel audio signal

Technical Field

The present disclosure relates to the field of artificial intelligence and signal processing, and more particularly, to a method, an apparatus, a device, and a storage medium for processing a multi-channel audio signal.

Background

Beamforming techniques have found wide application in voice communication systems, teleconferencing, and speech recognition, as spatial filters that extract a target signal from a mixed signal received by a microphone array. Recent research into neural network beamforming techniques has greatly advanced the development of speech signal processing techniques, such as multi-channel speech enhancement and separation. Neural network beamforming techniques typically first apply a neural network to extract a target signal from a mixed signal, and then apply conventional beamforming techniques to perform spatial filtering to enhance the target signal.

Existing beamforming techniques and neural network beamforming techniques are most often processed based on frequency domain characteristics of the signal, for example, in neural network beamforming techniques, methods such as multi-channel wiener filtering (MCWF) and minimum variance distortion free response (MVDR) beamforming are most often used to perform spatial filtering. However, such frequency domain processing methods have two main problems: (1) the frequency domain processing method has limited theoretical upper limit performance due to the dependence on the precision of frequency domain feature extraction and the quality of target signal extraction; (2) the frequency domain characteristics of the signal are generally complex domain characteristics, and how to reasonably perform complex domain nonlinear operation in the neural network remains an unsolved problem.

Therefore, there is a need for an efficient beamforming method that allows for improved theoretical upper bound performance of beamforming while avoiding complex-domain nonlinear operation in neural networks.

Disclosure of Invention

In order to solve the above problem, the present disclosure provides a multi-channel audio signal processing method for processing based on time domain characteristics of a signal, and by performing beamforming on a multi-channel audio signal in a time domain, theoretical upper limit performance of audio signal processing is improved, and at the same time, no complex number domain operation is involved.

According to an aspect of the present disclosure, a processing method for a multi-channel audio signal is proposed, comprising: acquiring the multichannel audio signal, wherein the multichannel audio signal is acquired in a specific environment through a plurality of microphones, and the specific environment comprises one or more signal sources; determining time domain features of the multi-channel audio signal based on the acquired multi-channel audio signal; determining time domain features of one or more estimated audio signals based on the acquired multi-channel audio signal, the one or more estimated audio signals corresponding to the one or more signal sources, respectively; and generating one or more output audio signals corresponding to the one or more signal sources, respectively, based on the time-domain features of the multi-channel audio signal and the time-domain features of the one or more estimated audio signals.

In some embodiments, generating one or more output audio signals corresponding respectively to the one or more signal sources based on the time-domain features of the multichannel audio signal and the time-domain features of the one or more output audio signals comprises: for each of the one or more estimated audio signals, determining a feature transform matrix based on time-domain features of the multichannel audio signal and time-domain features of the estimated audio signal; for each of the one or more estimated audio signals, determining a time-domain feature of an output audio signal corresponding to the estimated audio signal based on a feature transform matrix corresponding to the estimated audio signal and a time-domain feature of the multichannel audio signal; and generating the one or more output audio signals based on time domain features of each of the one or more output audio signals.

In some embodiments, determining a feature transform matrix based on the time-domain features of the multi-channel audio signal and the estimated time-domain features of the audio signal comprises: determining a system function of the wiener filter by taking the time domain characteristics of the multi-channel audio signal as the input of the wiener filter and taking the time domain characteristics of the estimated audio signal as the output of the wiener filter; and taking the system function of the wiener filter as the feature transformation matrix.

In some embodiments, determining the time-domain feature of the multi-channel audio signal based on the acquired multi-channel audio signal further comprises: grouping time domain features of the multi-channel audio signal to generate a first number of time domain feature groups of the multi-channel audio signal; and determining one or more estimated time domain features of the audio signal based on the acquired multi-channel audio signal further comprises: the time domain features of each of the one or more estimated audio signals are grouped, generating a first number of sets of time domain features for each of the one or more estimated audio signals.

In some embodiments, for each of the one or more estimated audio signals: determining a feature transformation matrix based on the time-domain features of the multi-channel audio signal and the estimated time-domain features of the audio signal further comprises: for each of the first number of time-domain feature groups of the estimated audio signal and a respective one of the first number of time-domain feature groups of the multi-channel audio signal, determining a respective feature transformation matrix to obtain a first number of feature transformation matrices corresponding to the estimated audio signal; determining a time-domain feature of an output audio signal corresponding to the estimated audio signal based on a feature transform matrix corresponding to the estimated audio signal and the time-domain feature of the multi-channel audio signal further comprises: determining a first number of time domain feature sets of an output audio signal corresponding to the estimated audio signal based on the first number of time domain feature sets of the multi-channel audio signal and the first number of feature transformation matrices corresponding to the estimated audio signal, and splicing the first number of time domain feature sets of the output audio signal to obtain time domain features of the output audio signal.

In some embodiments, determining a time-domain feature of the multi-channel audio signal based on the acquired multi-channel audio signal comprises: performing time domain linear transformation on the acquired multi-channel audio signal to determine the time domain characteristics of the multi-channel audio signal; wherein determining the time-domain features of the one or more estimated audio signals based on the acquired multi-channel audio signal comprises: for each of the one or more signal sources, selecting an audio signal of one channel from the multi-channel audio signal; and separating the time domain characteristic of the estimated audio signal corresponding to the signal source from the time domain characteristic of the audio signal of the selected channel through a pre-trained audio separation network.

According to another aspect of the present disclosure, a processing method for a multi-channel audio signal is proposed, comprising: acquiring the multichannel audio signal, wherein the multichannel audio signal is acquired in a specific environment through a plurality of microphones, and the specific environment comprises one or more signal sources; determining time domain features of the multi-channel audio signal based on the acquired multi-channel audio signal; determining time domain features of one or more estimated audio signals based on the acquired multi-channel audio signal, the one or more estimated audio signals corresponding to the one or more signal sources, respectively; updating the time domain features of the one or more estimated audio signals based on the time domain features of the multi-channel audio signal and the time domain features of the one or more estimated audio signals; and generating one or more output audio signals respectively corresponding to the one or more signal sources based on the updated one or more estimated time domain characteristics of the audio signals.

In some embodiments, determining the time-domain features of the one or more estimated audio signals based on the acquired multi-channel audio signal comprises: for each of the one or more signal sources, selecting an audio signal of one channel from the multi-channel audio signal; and separating the time domain characteristics of the estimated audio signal corresponding to the signal source from the time domain characteristics of the audio signal of the selected channel through a pre-trained first audio separation network.

In some embodiments, updating the time domain features of the one or more estimated audio signals based on the time domain features of the multichannel audio signal and the time domain features of the one or more estimated audio signals comprises: determining one or more feature transformation matrices respectively corresponding to the one or more estimated audio signals based on time-domain features of the multi-channel audio signal and time-domain features of the one or more estimated audio signals; generating time-domain features of one or more temporary audio signals respectively corresponding to the one or more estimated audio signals based on the time-domain features of the multi-channel audio signal and the one or more feature transformation matrices; and updating the time domain features of the one or more estimated audio signals based on the time domain features of the one or more temporary audio signals and the time domain features of the multichannel audio signal and/or the time domain features of the one or more estimated audio signals.

In some embodiments, determining one or more feature transform matrices respectively corresponding to the one or more estimated audio signals based on the time-domain features of the multichannel audio signal and the time-domain features of the one or more estimated audio signals comprises: for each of the one or more estimated audio signals, determining a system function of the wiener filter with the time-domain features of the multichannel audio signal as an input of the wiener filter and with the time-domain features of the estimated audio signal as an output of the wiener filter; and using the system function of the wiener filter as a feature transformation matrix corresponding to the estimated audio signal.

In some embodiments, updating the time domain features of the one or more estimated audio signals based on the time domain features of the one or more temporary audio signals and the time domain features of the multichannel audio signal and/or the time domain features of the one or more estimated audio signals comprises: for each of the one or more signal sources, selecting an audio signal of one channel from the multi-channel audio signal; and outputting the time domain feature of the estimated audio signal corresponding to the signal source through a pre-trained second audio separation network by taking the time domain feature of the temporary audio signal corresponding to the signal source and the time domain feature of the audio signal of the selected channel and/or the time domain feature of the estimated audio signal corresponding to the signal source as the updated time domain feature of the estimated audio signal corresponding to the signal source.

According to yet another aspect of the present disclosure, a processing apparatus for a multi-channel audio signal is presented, comprising: an audio signal acquisition module configured to acquire the multi-channel audio signal, the multi-channel audio signal being acquired by a plurality of microphones in a specific environment including one or more signal sources; a time domain feature determination module configured to determine time domain features of the multi-channel audio signal based on the acquired multi-channel audio signal and to determine time domain features of one or more estimated audio signals based on the acquired multi-channel audio signal, the one or more estimated audio signals corresponding to the one or more signal sources, respectively; and a target signal generation module configured to generate one or more output audio signals corresponding to the one or more signal sources, respectively, based on the time-domain features of the multi-channel audio signal and the time-domain features of the one or more estimated audio signals.

According to yet another aspect of the present disclosure, a processing apparatus for a multi-channel audio signal is presented, comprising: an audio signal acquisition module configured to acquire the multi-channel audio signal, the multi-channel audio signal being acquired by a plurality of microphones in a specific environment including one or more signal sources; a time domain feature determination module configured to determine time domain features of the multi-channel audio signal based on the acquired multi-channel audio signal and to determine time domain features of one or more estimated audio signals based on the acquired multi-channel audio signal, the one or more estimated audio signals corresponding to the one or more signal sources, respectively; a time domain feature updating module configured to update time domain features of the one or more estimated audio signals based on time domain features of the multi-channel audio signal and time domain features of the one or more estimated audio signals; and a target signal generation module configured to generate one or more output audio signals corresponding to the one or more signal sources, respectively, based on the updated one or more estimated time-domain features of the audio signal.

An embodiment of the present disclosure provides a processing device for a multi-channel audio signal, including: one or more processors; and one or more memories, wherein the one or more memories have stored therein a computer-executable program that, when executed by the processor, performs the method as described above.

Embodiments of the present disclosure provide a computer-readable storage medium having stored thereon computer-executable instructions for implementing the method as described above when executed by a processor.

Embodiments of the present disclosure provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the processing method for multi-channel audio signals according to the embodiments of the present disclosure.

Compared with the traditional wave beam forming method based on frequency domain processing, the method provided by the embodiment of the disclosure introduces more degrees of freedom, has higher theoretical upper limit performance, only involves real number domain operation, and reduces the computational complexity.

The method provided by the embodiment of the disclosure realizes the beamforming of the audio signal of the signal source based on the time domain characteristics of the multi-channel audio signal, the beamforming processing process does not involve any complex domain operation, and can be directly applied to any existing neural network beamforming framework. In addition, by utilizing the beamforming method of the present disclosure as a substitute for the frequency domain beamforming method in a cascaded neural network beamforming framework, the audio processing performance of the system is significantly improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required to be used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only exemplary embodiments of the disclosure, and that other drawings may be derived from those drawings by a person of ordinary skill in the art without inventive effort.

Fig. 1 is a schematic diagram illustrating a scene for acquiring a multi-channel audio signal by a microphone array according to an embodiment of the present disclosure;

fig. 2A is a flow chart illustrating a processing method for a multi-channel audio signal according to an embodiment of the present disclosure;

fig. 2B is a schematic diagram illustrating feature processing in a processing method for a multi-channel audio signal according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating processing of a multi-channel audio signal according to an embodiment of the disclosure;

fig. 4A is a flowchart illustrating a processing method for a multi-channel audio signal according to an embodiment of the present disclosure;

fig. 4B is a schematic diagram illustrating feature processing in a processing method for a multi-channel audio signal according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating neural network based processing of a multi-channel audio signal according to an embodiment of the present disclosure;

fig. 6 is a simulation scene schematic diagram illustrating a processing method for a multi-channel audio signal according to an embodiment of the present disclosure;

fig. 7 is a schematic diagram illustrating a processing apparatus for a multi-channel audio signal according to an embodiment of the present disclosure;

fig. 8 is a schematic diagram illustrating a processing apparatus for a multi-channel audio signal according to an embodiment of the present disclosure;

fig. 9 shows a schematic diagram of a processing device for a multi-channel audio signal according to an embodiment of the present disclosure;

FIG. 10 shows a schematic diagram of an architecture of an exemplary computing device, according to an embodiment of the present disclosure; and

FIG. 11 shows a schematic diagram of a storage medium according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

In the present specification and the drawings, steps and elements having substantially the same or similar characteristics are denoted by the same or similar reference numerals, and repeated description of the steps and elements will be omitted. Meanwhile, in the description of the present disclosure, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance or order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

For the purpose of describing the present disclosure, concepts related to the present disclosure are introduced below.

The processing method for a multi-channel audio signal of the present disclosure may be Artificial Intelligence (AI) -based. Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. For example, for artificial intelligence based processing methods for multi-channel audio signals, it is possible to separate the source signal from the mixed audio signal received by the microphone array in a way similar to how well the human auditory system distinguishes the desired audio signal from a noisy environment. Artificial intelligence by studying the design principles and implementation methods of various intelligent machines, the disclosed processing method for multi-channel audio signals has the function of automatically and accurately separating audio signals related to a source signal from the multi-channel audio signals and implementing more accurate beamforming through multiple filtering and separation iterations.

For example, the processing method for multi-channel audio signals of the present disclosure may be based on a beamforming method. In multi-channel audio signal processing of a microphone array, the task of beamforming is to extract a target signal from the multi-channel audio signal, i.e. to perform a combining process on the microphone array signals, suppress interfering signals in non-target directions, and enhance sound signals in the target direction. The beamforming method may be performed in a frequency domain and a time domain, and thus may be divided into a frequency domain beamforming method and a time domain beamforming method according to a processing domain. Due to the difficulty of meeting the requirements of real-time performance and multi-source tracking by the general beamforming method, with the continuous development of some emerging subjects in recent years, the beamforming method is also rapidly extended to these hot directions, including but not limited to neural network methods, genetic methods, higher-order statistical methods, and the like.

Among other things, research into neural network-based beamforming methods (i.e., neural network beamforming methods) has greatly advanced the state of development of multi-channel speech enhancement and separation systems. Neural network beamforming methods typically first apply a neural network to extract a target signal from a multi-channel mixed signal, and then apply conventional beamforming techniques to perform spatial filtering to enhance the target signal. Since both microphone arrays and target source characteristics can be estimated in the frequency domain in an easier way, most neural network beamforming methods typically process based on the frequency domain characteristics of the signal, for example using spatial filtering methods such as multi-channel wiener filtering (MCWF) and minimum variance distortion free response (MVDR) beamforming.

For example, the processing method for a multi-channel audio signal of the present disclosure may also be based on a wiener filtering method. The wiener filtering method is the optimal estimation of a stationary signal under the minimum mean square error criterion, has a very good inhibiting effect on additive noise in the environment, and can enable an output signal of a filtering system to be close to a real target signal as much as possible. The wiener filtering method assumes that the filtering process is linear processing, the processing process is regarded as a linear time-invariant system, and after an input signal passes through the system, an output signal is obtained based on an optimal criterion (namely, the error between the output signal of the system and an expected signal is minimized, namely, the minimum mean square error), wherein the optimal filtering system (system function) can be calculated in a time domain or a frequency domain. Therefore, the beamforming method and the neural network beamforming method according to the present disclosure employ a time domain wiener filtering method for performing spatial filtering.

In summary, the embodiments of the present disclosure provide solutions related to artificial intelligence, neural network beamforming, and the like, and will be further described with reference to the accompanying drawings.

Fig. 1 is a schematic diagram illustrating a scenario in which a multi-channel audio signal is acquired by a microphone array according to an embodiment of the present disclosure.

As shown in fig. 1, when a plurality of speakers (a plurality of target sources for a microphone array, Q are shown in fig. 1) make sounds within a detection range of the microphone array (shown in fig. 1 as being uniformly distributed in a circular array by 6 microphones), the sounds may be collected by the microphone array, so that a multi-channel audio signal is acquired through a plurality of channels of the microphone array.

The acquired multi-channel audio signal is then transmitted to an audio signal processing terminal for application to various microphone array processing tasks according to specific requirements, including but not limited to speech enhancement, speech separation, automatic speech recognition, keyword recognition, and speech binarization. The audio signal processing terminal may be a processing device for multi-channel audio signals according to an embodiment of the present disclosure as described below, or may be a processing device for achieving other purposes.

As previously described, since microphone arrays and target source characteristics can be more easily estimated in the frequency domain, most beamforming methods and neural network beamforming methods are based on the frequency domain characteristics of the signals for processing, for example spatial filtering is typically performed using methods such as multi-channel wiener filtering (MCWF) and minimum variance distortion free response (MVDR) beamforming. However, similar to the potential drawbacks of conventional time-frequency masking that have been discussed in prior studies on time-domain single-channel speech separation, the conventional frequency-domain neural beamforming method also has two core limitations: theoretical upper bound performance and complex field operation. On the one hand, the performance upper bound of a neural network beamforming method is limited by its own performance when an optimal target source (i.e., ideally noise-free target signal) is used for computing target source-specific features, such as a spatial covariance matrix, whereas the neural network beamforming method may fail when the upper bound performance of the selected beamformer is not good. On the other hand, as more and more research begins to apply neural networks to complex-domain processing, how to correctly process the real and imaginary parts of signal features in nonlinear transformation to effectively incorporate nonlinear complex-domain operations into neural network beamforming methods remains an unsolved problem.

Therefore, the present disclosure addresses the above-mentioned problem by providing a multi-channel audio signal processing method for processing based on time-domain characteristics of a signal, which performs beamforming on audio signals from multiple channels in the time domain to improve the audio signal processing performance.

Specifically, the method provided by the embodiment of the present disclosure performs wiener filtering based on time domain characteristics of a multi-channel audio signal, and performs minimum mean square error estimation on an audio signal separated from the multi-channel audio signal to obtain an optimal wiener filter coefficient, thereby implementing beamforming on the audio signal of a signal source. The method of embodiments of the present disclosure does not require signal transformation (e.g., Short Time Fourier Transform (STFT)), and therefore does not involve any complex-domain operations, can be directly applied in any existing neural network beamforming framework, and the theoretical upper bound performance of beamforming is significantly improved since the method of embodiments of the present disclosure introduces more degrees of freedom. In addition, by utilizing the beamforming method of the present disclosure as a substitute for the frequency domain beamforming method in a cascaded neural network beamforming framework, the audio processing performance of the system is significantly improved.

Fig. 2A is a flow chart illustrating a processing method 200 for a multi-channel audio signal according to an embodiment of the present disclosure. Fig. 2B is a schematic diagram illustrating feature processing in a processing method 200 for a multi-channel audio signal according to an embodiment of the present disclosure.

As shown in fig. 2A, in step S201, the multi-channel audio signal may be acquired, the multi-channel audio signal being captured through a plurality of microphones in a specific environment including one or more signal sources.

Alternatively, the multi-channel audio signal may be acquired by a microphone array as shown in fig. 1, which may include a plurality of microphones regularly arranged in a certain shape for acquiring single-channel audio signals from different directions in space, which may be classified into a linear array, a planar array, a stereo array, etc. according to a topology, for example, the microphone array shown in fig. 1 is a planar array.

It should be understood that the planar array is used as an example in the present disclosure to facilitate the description of the method of the present disclosure, but the method is also applicable to other types of microphone arrays, and the microphone array shown in fig. 1 is merely used as an example and not a limitation.

For example, the one or more signal sources may be multiple human voices as shown in fig. 1, when the method of the present disclosure is used for speech audio signal processing such as speech separation, speech enhancement or speech recognition for these speakers. In addition, the one or more signal sources may also include music, such as instrumental performance sounds.

In step S202, a time-domain feature of the multi-channel audio signal may be determined based on the acquired multi-channel audio signal.

According to an embodiment of the present disclosure, step S202 may include performing a time-domain linear transformation on the acquired multi-channel audio signal, and determining a time-domain feature of the multi-channel audio signal.

Alternatively, the windowed multi-channel audio signal (observation signal) may be subjected to a time-domain linear transformation to obtain its time-domain features.

For example, assuming that the microphone array comprises M microphones, the multi-channel audio signal comprises M single-channel audio signals, wherein each single-channel audio signal corresponds to one microphone. To be provided with

Representing a windowed audio signal having P sample points on the mth channel of the tth frame,

representing a linear transformation matrix or a real-valued waveform coder applied to the windowed audio signal,

representing the time domain characteristics of a multi-channel audio signal, where N represents the characteristic dimension of the audio signal for each channel and T represents the number of sampling frames, and

time domain feature Y of corresponding single-channel audio signal_m，tCan be expressed as the following real-valued linear transformation:

Y_m，t＝y_m，tB (1)

alternatively, B can be viewed as a linear combination of the features of the observed signal to form other richer or more representative signal features. In embodiments of the present disclosure, the linear transformation matrix may be an identity matrix or other predefined or adaptively optimized matrix, which is not limited by the present disclosure.

In step S203, time domain features of one or more estimated audio signals, which correspond to the one or more signal sources, respectively, may be determined based on the acquired multi-channel audio signal.

According to an embodiment of the present disclosure, step S203 may include: for each of the one or more signal sources, selecting an audio signal of one channel from the multi-channel audio signal; and separating the time domain characteristic of the estimated audio signal corresponding to the signal source from the time domain characteristic of the audio signal of the selected channel through a pre-trained audio separation network.

Alternatively, as shown in fig. 2B, the estimated time-domain characteristics of the signal sources may be separated from the mixed signal using a pre-trained neural network (i.e., an audio separation network). Wherein the audio separation network may be a single or multi-channel separation network.

Optionally, in an embodiment of the present disclosure, for cost saving, a single-channel separation network is used to separate, for each signal source, a time-domain feature of an estimated audio signal corresponding to the signal source from the multi-channel audio signal. Specifically, for each signal source, one channel may be selected from a plurality of channels of the microphone array as a reference channel, and the time domain feature of the audio signal of the reference channel may be passed through a single-channel separation network to output the time domain feature of the estimated audio signal corresponding to the signal source.

In step S204, one or more output audio signals respectively corresponding to the one or more signal sources may be generated based on the time domain features of the multi-channel audio signal and the time domain features of the one or more estimated audio signals.

According to an embodiment of the present disclosure, step S204 may include: for each of the one or more estimated audio signals, determining a feature transform matrix based on time-domain features of the multichannel audio signal and time-domain features of the estimated audio signal; for each of the one or more estimated audio signals, determining a time-domain feature of an output audio signal corresponding to the estimated audio signal based on a feature transform matrix corresponding to the estimated audio signal and a time-domain feature of the multichannel audio signal; and generating the one or more output audio signals based on time domain features of each of the one or more output audio signals.

Alternatively, as shown in fig. 2B, based on time-domain features of a multi-channel audio signal and time-domain features of one or more estimated audio signals respectively obtained from the multi-channel audio signal, one or more feature transformation matrices respectively corresponding to the one or more estimated audio signals may be determined, and then the feature transformation matrices are respectively applied to the time-domain features of the multi-channel audio signal, time-domain features of one or more output audio signals respectively corresponding to the one or more estimated audio signals may be obtained.

Wherein, according to an embodiment of the present disclosure, determining a feature transformation matrix based on the time-domain features of the multi-channel audio signal and the estimated time-domain features of the audio signal may comprise: determining a system function of the wiener filter by taking the time domain characteristics of the multi-channel audio signal as the input of the wiener filter and taking the time domain characteristics of the estimated audio signal as the output of the wiener filter; and taking the system function of the wiener filter as the feature transformation matrix.

Optionally, in an embodiment of the present disclosure, a time-domain generalized Wiener filter (TD-GWF) is used to filter the multi-channel audio signal. For each of one or more signal sources present in the environment, after determining a particular system function of the TD-GWF for its corresponding estimated time-domain characteristic of the audio signal, the multi-channel audio signal may be filtered for that signal source by taking the multi-channel audio signal as an input to the TD-GWF to obtain a time-domain characteristic of an output audio signal corresponding to that signal source.

It should be understood that in the method of the present disclosure, each of the estimated audio signal, the output audio signal, and the provisional audio signal to be mentioned later is corresponding to one signal source, and therefore there is a correspondence between these signals based on the signal source as well.

Alternatively, one or more output audio signals respectively corresponding to one or more signal sources may be obtained as a result of the beamforming process on the multi-channel audio signal by inverse transforming the obtained time-domain features of the one or more output audio signals, such as the time-domain linear transformation described previously.

In the above-described processing method 200 for a multi-channel audio signal, all time-domain features of the multi-channel audio signal are taken as inputs to the TD-GWF to output the complete time-domain features of the output audio signal corresponding to a single signal source. However, considering the necessary calculations (e.g., matrix inversion operations) in the filtering process and the huge amount of data for all the time-domain features of the actually obtained multi-channel audio signals with respect to the system functions of TD-GWF, the method of the present disclosure may reduce the computational complexity and relieve the stress of the computing system by transforming a single high-dimensional solution into multiple low-dimensional solutions by feature-grouping the time-domain features of these audio signals.

In particular, fig. 3 is a schematic diagram illustrating the processing of a multi-channel audio signal according to an embodiment of the present disclosure.

As shown in FIG. 3, 301 represents the complete time domain features of the multi-channel audio signal, similar as described with respect to step 202

It includes M, N, T data in three dimensions, where M represents the number of channels of the microphone array, N represents the characteristic dimension of the audio signal for each channel, and T represents the number of sampling frames.

Therefore, as described above, according to the embodiment of the present disclosure, step S202 may further include: the time domain features of the multi-channel audio signal are grouped to generate a first number of time domain feature groups of the multi-channel audio signal.

As shown at 302, for the complete time domain features of a multi-channel audio signal

Feature grouping is performed along dimension N to form a first number (shown as V in 302) of time-domain feature groups

Wherein the vth time domain feature group can be expressed as

Similarly, step S203 may further include: the time domain features of each of the one or more estimated audio signals are grouped, generating a first number of sets of time domain features for each of the one or more estimated audio signals.

Optionally, for each signal source, the time-domain feature of the estimated audio signal corresponding to that signal source is separated from the multi-channel audio signal by a pre-trained audio separation network

Later, one can similarly pair along dimension N

Feature grouping is performed to obtain a first number (shown as V in 303) of time domain feature groups

Wherein the vth time domain feature group can be expressed as

According to an embodiment of the present disclosure, for each of the one or more estimated audio signals, determining a feature transformation matrix based on the time-domain features of the multichannel audio signal and the time-domain features of the estimated audio signal may further comprise: for each of the first number of sets of time-domain features of the estimated audio signal and a corresponding set of time-domain features of the first number of sets of time-domain features of the multi-channel audio signal, determining a corresponding feature transformation matrix to obtain a first number of feature transformation matrices corresponding to the estimated audio signal.

Alternatively, it may be based on each as described above

And

determining a corresponding feature transformation matrix, i.e. to be ready

As input for TD-GWF and will

As output of the TD-GWF to determine the system function (i.e., the feature transformation matrix) of the TD-GWF for that signal source

As shown at 304 in fig. 3. According to the wiener filtering algorithm, the system function of the TD-GWF can be solved based on the minimum mean square error estimation criterion as follows:

therefore, wu can be obtained as follows:

wherein the content of the first and second substances,^Tdenotes the transpose operator, an^-1Representing the matrix inversion operator.

As can be seen from the above formula (3), in the pair W_vIn the calculation of

While the amount of computation to invert the matrix increases as the matrix size increases,

is of a size of

Therefore, it is calculated according toThe number of groups V grouped for features increases and decreases. Meanwhile, after grouping the time domain features according to the method disclosed by the invention, the solution of the original system function can be converted into V w_vThe parallel solution of (2) can improve the operation speed and the operation performance of the system.

According to an embodiment of the present disclosure, for each of the one or more estimated audio signals, determining a time-domain feature of an output audio signal corresponding to the estimated audio signal based on a feature transformation matrix corresponding to the estimated audio signal and a time-domain feature of the multi-channel audio signal may further include: determining a first number of time domain feature sets of an output audio signal corresponding to the estimated audio signal based on the first number of time domain feature sets of the multi-channel audio signal and the first number of feature transformation matrices corresponding to the estimated audio signal, and splicing the first number of time domain feature sets of the output audio signal to obtain time domain features of the output audio signal.

Optionally, for the v-th time-domain feature set of the multi-channel audio signal

Estimating the v time domain feature group of the audio signal corresponding to the current signal source

May be based on w_vAnd

obtaining the v time domain characteristic group of the output audio signal corresponding to the current signal source

In particular, the set of time domain features may be compared to

Using a feature transformation matrix w_vAs shown in 305, this is expressed as follows:

thus, the set of V time-domain features based on the multi-channel audio signal

V time domain feature sets of an estimated audio signal corresponding to a current signal source

V time domain feature groups of the output audio signal corresponding to the current signal source can be obtained

Therefore, optionally, the v time domain feature groups of the output audio signal may be spliced, as shown in 306, to obtain the time domain feature of the output audio signal:

the processing method 200 performs wiener filtering based on the time domain characteristics of the multi-channel audio signal, the processing process does not involve any complex domain operation, and the theoretical upper limit performance of beam forming is remarkably improved due to the introduction of more degrees of freedom in the process.

In the processing method 200, only the case of single wiener filtering of the time-domain features of the multi-channel audio signal is described, and the audio processing performance of the processing system can be optimized based on the method in the cascaded neural network beamforming framework in consideration of the applicability of the method to any existing neural network beamforming framework and the boosting capability of the beamforming performance upper limit.

Fig. 4A is a flow chart illustrating a processing method 400 for a multi-channel audio signal according to an embodiment of the disclosure. Fig. 4B is a schematic diagram illustrating feature processing in a processing method 400 for a multi-channel audio signal according to an embodiment of the present disclosure.

As shown in fig. 4A, in step S401, the multi-channel audio signal may be acquired, which is captured through a plurality of microphones in a specific environment including one or more signal sources.

Similarly to step S201, the multi-channel audio signal may be acquired by a microphone array, wherein the microphone array may include a plurality of microphones regularly arranged in a specific shape for acquiring single-channel audio signals from different directions in space, which may be classified into a linear array, a planar array, a stereo array, etc. according to a topology, for example, the microphone array shown in fig. 1 is a planar array.

In step S402, a time-domain feature of the multi-channel audio signal is determined based on the acquired multi-channel audio signal.

Similar to step S202, the time domain characteristics of the windowed multi-channel audio signal (observation signal) may be obtained by time domain linear transformation. The time domain features of the multi-channel audio signal may be represented as

Where M represents the number of channels of the microphone array, and L represents the feature dimension of the audio signal corresponding to each channel, which may include the time domain features in a plurality of sampling frames (i.e. N × T as described above), or may represent only the dimension of the time domain feature corresponding to a certain sampling frame (which may be regarded as a real-time filtering process for the multi-channel audio signal), which is not limited by the present disclosure.

In step S403, time domain features of one or more estimated audio signals, which correspond to the one or more signal sources, respectively, may be determined based on the acquired multi-channel audio signal.

According to an embodiment of the present disclosure, step S403 may include: for each of the one or more signal sources, selecting an audio signal of one channel from the multi-channel audio signal; and separating the time domain characteristics of the estimated audio signal corresponding to the signal source from the time domain characteristics of the audio signal of the selected channel through a pre-trained first audio separation network.

Similar to step S203, the pre-trained neural network (i.e., the first audio separation network) may be utilized to separate the roughly-estimated time-domain features of the signal sources from the mixed signal, and the roughly-estimated time-domain features of the one or more signal sources may be expressed as

Wherein C represents the number of signal sources, C represents the C-th signal source,

representing the time domain characteristics of the estimated audio signal corresponding to the c-th signal source separated by the first audio separation network. Wherein the first audio separation network may be a single or multi-channel separation network. Optionally, for cost saving reasons, a single channel separation network is used to separate for each signal source the time-domain features of the estimated audio signal corresponding to that signal source from the multi-channel audio signal.

Alternatively, the first audio separation network may include a structure composed of an encoder, a separator, and a decoder, such as a DPRNN-TasNet (dual path recurrent neural network — time domain audio separation network), which may perform single-channel audio separation on a long sequence of time domain feature inputs in the present disclosure to output a separated time domain feature sequence of an estimated audio signal corresponding to a particular signal source. Furthermore, the audio separation network of the present disclosure may use any other form of separator in addition to the single channel DPRNN-TasNet, which is merely exemplified and not limited in the present disclosure.

In step S404, the time-domain features of the one or more estimated audio signals are updated based on the time-domain features of the multi-channel audio signal and the time-domain features of the one or more estimated audio signals.

Alternatively, the updating of the one or more estimated time-domain features of the audio signal may comprise filtering and separating the two parts. Wherein the filtering portion may employ the above TD-GWF, and the separating portion may employ a similar first audio separating network to achieve further subdivision of the filtering result. With each filtering and separation, updated time-domain characteristics of the one or more estimated audio signals may be obtained.

According to an embodiment of the present disclosure, step S404 may include: determining one or more feature transformation matrices respectively corresponding to the one or more estimated audio signals based on time-domain features of the multi-channel audio signal and time-domain features of the one or more estimated audio signals; generating time-domain features of one or more temporary audio signals respectively corresponding to the one or more estimated audio signals based on the time-domain features of the multi-channel audio signal and the one or more feature transformation matrices; and updating the time domain features of the one or more estimated audio signals based on the time domain features of the one or more temporary audio signals and the time domain features of the multichannel audio signal. Optionally, the time domain features of the one or more estimated audio signals may be updated based on the time domain features of the one or more temporary audio signals, the time domain features of the multichannel audio signal, and the time domain features of the one or more estimated audio signals.

Wherein, according to an embodiment of the present disclosure, determining one or more feature transformation matrices respectively corresponding to the one or more estimated audio signals based on the time-domain features of the multi-channel audio signal and the time-domain features of the one or more estimated audio signals may comprise: for each of the one or more estimated audio signals, determining a system function of the wiener filter with the time-domain features of the multichannel audio signal as an input of the wiener filter and with the time-domain features of the estimated audio signal as an output of the wiener filter; and using the system function of the wiener filter as a feature transformation matrix corresponding to the estimated audio signal.

As shown in fig. 4B, based on time-domain features of a multi-channel audio signal and time-domain features of one or more estimated audio signals respectively obtained from the multi-channel audio signal, one or more feature transformation matrices respectively corresponding to the one or more estimated audio signals may be determined and then applied to the time-domain features of the multi-channel audio signal, respectively, time-domain features of one or more temporary audio signals respectively corresponding to the one or more estimated audio signals may be obtained, which may be used for subsequent further subdivision of the filtering result.

Next, to achieve a further subdivision of the filtering result, the obtained time-domain features of the one or more temporary audio signals and the time-domain features of the multi-channel audio signal may be input to a second audio separation network similar to the first audio separation network. Optionally, the obtained time domain features of the one or more temporary audio signals, the time domain features of the multi-channel audio signal, and the time domain features of the one or more estimated audio signals may be input to a second audio separation network similar to the first audio separation network.

Optionally, for the first update, one or more transform coefficients corresponding to the one or more estimated audio signals may be determined based on time-domain features of the multi-channel audio signal and time-domain features of the one or more estimated audio signals; applying the one or more transform coefficients to time-domain features of the multi-channel audio signal to generate time-domain features of the respective one or more temporary audio signals; and updating the time domain features of the one or more estimated audio signals based on the time domain features of the one or more temporary audio signals and the time domain features of the multichannel audio signal (and optionally the time domain features of the one or more estimated audio signals).

Optionally, for subsequent updates (including a second update), a system function of one or more wiener filters corresponding to the one or more estimated audio signals may be determined based on the time-domain features of the multichannel audio signal and the time-domain features of the one or more estimated audio signals of a previous update; applying the system function of the one or more wiener filters to the time-domain features of the multi-channel audio signal to generate time-domain features of one or more temporary audio signals; and updating the time domain features of the one or more estimated audio signals based on the time domain features of the one or more temporary audio signals and the time domain features of the multichannel audio signal (and optionally the time domain features of the one or more estimated audio signals).

According to an embodiment of the present disclosure, updating the time-domain features of the one or more estimated audio signals based on the time-domain features of the one or more temporary audio signals and the time-domain features of the multichannel audio signal (and optionally the time-domain features of the one or more estimated audio signals) may comprise: for each of the one or more signal sources, selecting an audio signal of one channel from the multi-channel audio signal; and outputting the time domain feature of the estimated audio signal corresponding to the signal source through a pre-trained second audio separation network by taking the time domain feature of the temporary audio signal corresponding to the signal source and the time domain feature of the audio signal of the selected channel (and optionally the time domain feature of the estimated audio signal corresponding to the signal source) as the updated time domain feature of the estimated audio signal corresponding to the signal source.

Similarly, the second audio separation network may comprise a structure such as DPRNN-TasNet that may separate a time domain feature sequence of the estimated audio signal corresponding to the particular signal source as the updated time domain feature sequence of the estimated audio signal based on at least the time domain features of the one or more temporary audio signals and the time domain features of the multi-channel audio signal.

For example, the time domain characteristics of one or more signal sources separated by the first audio separation network

The updated time-domain characteristics of the estimated audio signal may be obtained via a plurality of iterations through the second audio separation network

Until a satisfactory result of estimating the time-domain characteristics of the audio signal is obtained or a preset limit of the number of updates has been reached.

In step S405, one or more output audio signals respectively corresponding to the one or more signal sources are generated based on the updated one or more estimated time domain features of the audio signal.

Optionally, the output of the neural network, i.e. the time-domain features of the one or more estimated audio signals that are updated last, may be taken as the time-domain features of the one or more output audio signals corresponding to the one or more signal sources, respectively, by a plurality of filtering and separate iterative processes in the cascaded neural network beamforming.

Fig. 5 is a schematic diagram illustrating neural network based processing of a multi-channel audio signal according to an embodiment of the present disclosure.

As shown in fig. 5, optionally, for each of the one or more signal sources, one channel of audio signal may be selected from the multi-channel audio signal, and its time domain feature may be used as the reference time domain feature

Characterizing the reference time domain

The input pre-trained first audio separation network (i.e., pre-separation module) may separate therefrom the time-domain features of one or more estimated audio signals corresponding to the one or more signal sources

Next, in each iteration, a time-domain feature of the audio signal may be estimated based on the time-domain feature of the multi-channel audio signal and the latest one or more (e.g., for the first iteration, for example, for the first iteration)

For the second iteration of

) Time-domain characteristics of one or more temporary audio signals are obtained by TD-GWF filtering (e.g., for the first iteration, for

For the second iteration of

) And then estimating the time-domain features of the audio signal based on the time-domain features of the multi-channel audio signal and the time-domain features of the one or more temporary audio signals and/or the latest one or more estimated time-domain features of the audio signal (e.g., Y and for the first iteration)

And/or

Y and for the second iteration

And/or

) And the post-separation module is the second audio separation network.

Fig. 6 is a simulation scene schematic diagram illustrating a processing method for a multi-channel audio signal according to an embodiment of the present disclosure.

As shown in FIG. 6, the length and width of the simulated room are randomly sampled between 3-10 meters, and the height is randomly sampled between 2.5-4 meters. The reverberation time is randomly sampled between 0.1 and 0.5 seconds. A circular microphone array of diameter 10 centimeters (em) consisting of 6 equally distributed microphones is arranged in the simulation room. Furthermore, there are two loudspeakers in the simulated room, which are located at an average distance of 2.9 ± 1.6 meters from the center position of the microphone array. The multi-channel audio signal obtained from the simulated room contains noise and reverberation.

According to the embodiment of the disclosure, the neural network involved in the cascaded neural network beamforming method can be trained, verified and tested. For example, 20000, 5000, and 3000 4-second long voices may be set as a training set, a verification set, and a test set, respectively. Wherein, for each speech, two speech signals and one noise signal may be randomly selected from various speech corpora such as Librispeech (large-scale english corpus), and non-speech corpus, respectively, for example.

Alternatively, a model such as a single-channel DPRNN-TasNet model may be used for pre-and post-separation modules in the cascaded neural network beamforming process described above, where each module may include, for example, 3 DPRNN (dual-path recurrent neural network) blocks. The window size of the TD-GWF can be set differently, such as to 2 milliseconds (ms), 4ms, 8ms, and 16ms, by setting reasonable parameters for the model, and for performance comparison.

Therefore, the performance comparison of TD-GWF as described above with the conventional frequency domain beamforming FD-MCWF (time domain multi-channel wiener filtering) method can be seen in table 1, which is illustrated by way of example and not limitation of Signal Distortion Ratio (SDR) and scale-invariant signal distortion ratio (SI-SDR) for signal quality evaluation.

TABLE 1

As shown in table 1, the upper limit performance of FD-MCWF is lower when the window size is smaller, while the proposed TD-GWF of the present disclosure achieves higher upper limit performance of FD-MCWF with window size of 256ms in case of window size of only 8 ms.

Therefore, an FD-MCWF with a window size of 32ms (typically used as a default configuration for frequency domain beamformers) has a lower upper bound performance than a TD-GWF with a window size of 2ms, and thus, without strict requirements on frequency domain resolution, the TD-GWF proposed by the present disclosure can be considered to have a much higher upper bound performance than a conventional FD-MCWF.

Furthermore, the results of the performance comparison using different TasNet (time domain audio separation network) models can be seen in table 2, where the overlap ratio between the two speakers is uniformly sampled between 0% and 100%.

TABLE 2

As shown in Table 2, the first two rows provide audio signal separation results using a single-channel DPRNN-TasNet model, where "-S" and "-L" represent "small" and "large" models with 3 and 6 blocks of DPRNN, respectively. The remainder of table 2 includes the results of cascaded neural network beamforming processes that utilize TD-GWF or FD-MCWF for beamforming.

Thus, it can be seen that the FD-MCWF with a window size of 32ms performs significantly worse than the TD-GWF with a window size of only 2ms in the configurations of 1 and 2 iterations. Furthermore, while FD-MCWF with window size 512ms may have better signal separation performance than TD-GWF when the speaker angle or speaker overlap ratio is small, the computation cost is much higher than TD-GWF with window size 4ms because the size of its spatial covariance matrix is too large (4097 × 4097) at this time.

Therefore, through the above simulation scenario and comparison of performance results, it can be seen that the method of the present disclosure has a higher theoretical upper limit performance and reduces the computational complexity and computational cost compared to the conventional beamforming method based on frequency domain processing.

Fig. 7 is a schematic diagram illustrating a processing apparatus 700 for a multi-channel audio signal according to an embodiment of the present disclosure. Fig. 8 is a schematic diagram illustrating a processing apparatus 800 for a multi-channel audio signal according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, the processing apparatus 700 for a multi-channel audio signal may include an audio signal acquisition module 701, a time domain feature determination module 702, and a target signal generation module 703.

According to an embodiment of the present disclosure, a processing apparatus 800 for a multi-channel audio signal may include an audio signal acquisition module 801, a time domain feature determination module 802, a time domain feature update module 803, and a target signal generation module 804.

Therein, the audio

signal acquisition modules

701 and 801 may similarly be configured to acquire the multi-channel audio signal acquired by the plurality of microphones in a specific environment including one or more signal sources.

For example, the multi-channel audio signal may be acquired by a microphone array as shown in fig. 1, which may include a plurality of microphones regularly arranged in a certain shape for acquiring single-channel audio signals from different directions in space, which may be classified into a linear array, a planar array, a stereo array, etc. according to a topology, for example, the microphone array shown in fig. 1 is a planar array.

For example, the one or more signal sources may be multiple human voices as shown in fig. 1, when the method of the present disclosure is used for speech audio signal processing such as speech separation, speech enhancement or speech recognition for these speakers.

The time domain

feature determination modules

702 and 802 may similarly be configured to determine time domain features of the multi-channel audio signal based on the acquired multi-channel audio signal, and to determine time domain features of one or more estimated audio signals, which correspond to the one or more signal sources, respectively, based on the acquired multi-channel audio signal.

For example, a windowed multi-channel audio signal (observation signal) may be subjected to a time-domain linear transformation to obtain its time-domain features.

Alternatively, the estimated time-domain characteristics of the signal sources may be separated from the mixed signal using a pre-trained neural network (i.e., an audio separation network), which may be a single-channel or multi-channel separation network.

For example, in embodiments of the present disclosure, for cost saving considerations, a single-channel separation network is used to separate, for each signal source, the time-domain features of the estimated audio signal corresponding to that signal source from the multi-channel audio signal.

The target signal generation module 703 may be configured to generate one or more output audio signals corresponding to the one or more signal sources, respectively, based on the time-domain features of the multi-channel audio signal and the time-domain features of the one or more estimated audio signals.

Alternatively, based on time-domain features of a multi-channel audio signal and time-domain features of one or more estimated audio signals respectively obtained from the multi-channel audio signal, one or more feature transformation matrices respectively corresponding to the one or more estimated audio signals may be determined and then applied to the time-domain features of the multi-channel audio signal respectively, and time-domain features of one or more output audio signals respectively corresponding to the one or more estimated audio signals may be obtained.

Unlike the processing apparatus 700, the processing apparatus 800 may include iterative updates to the time-domain features in a cascaded neural network beamforming process. Thus, the time domain feature updating module 803 in the processing apparatus 800 may be configured to update the time domain features of the one or more estimated audio signals based on the time domain features of the multichannel audio signal and the time domain features of the one or more estimated audio signals.

The target signal generation module 804 may be configured to generate one or more output audio signals corresponding to the one or more signal sources, respectively, based on the updated one or more estimated time-domain features of the audio signal.

According to yet another aspect of the disclosure, a processing device for a multi-channel audio signal is also provided. Fig. 9 shows a schematic diagram of a processing device 2000 for a multi-channel audio signal according to an embodiment of the present disclosure.

As shown in fig. 9, the processing device 2000 for multi-channel audio signals may include one or more processors 2010, and one or more memories 2020. Wherein the memory 2020 has stored therein computer readable code, which when executed by the one or more processors 2010 may perform a method as described above.

The processor in the embodiments of the present disclosure may be an integrated circuit chip having signal processing capabilities. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which may be of the X86 or ARM architecture.

In general, the various example embodiments of this disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic or any combination thereof. Certain aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of embodiments of the disclosure have been illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

For example, a method or apparatus in accordance with embodiments of the present disclosure may also be implemented by way of the architecture of computing device 3000 shown in fig. 10. As shown in fig. 10, computing device 3000 may include a bus 3010, one or more CPUs 3020, a Read Only Memory (ROM)3030, a Random Access Memory (RAM)3040, a communication port 3050 connected to a network, input/output components 3060, a hard disk 3070, and the like. A storage device in the computing device 3000, such as the ROM 3030 or the hard disk 3070, may store various data or files used in the processing and/or communication of the methods provided by the present disclosure, as well as program instructions executed by the CPU. Computing device 3000 can also include user interface 3080. Of course, the architecture shown in FIG. 9 is merely exemplary, and one or more components of the computing device shown in FIG. 10 may be omitted as needed in implementing different devices.

According to yet another aspect of the present disclosure, there is also provided a computer-readable storage medium. FIG. 11 shows a schematic diagram 4000 of a storage medium according to the present disclosure.

As shown in fig. 11, the computer storage medium 4020 has stored thereon computer readable instructions 4010. The computer readable instructions 4010, when executed by a processor, may perform methods according to embodiments of the present disclosure described with reference to the above figures. The computer readable storage medium in embodiments of the present disclosure may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DR RAM). It should be noted that the memories of the methods described herein are intended to comprise, without being limited to, these and any other suitable types of memory. It should be noted that the memories of the methods described herein are intended to comprise, without being limited to, these and any other suitable types of memory.

Embodiments of the present disclosure also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the processing method for multi-channel audio signals according to the embodiments of the present disclosure.

The method provided by the embodiment of the disclosure performs wiener filtering based on time domain characteristics of a multi-channel audio signal, and performs minimum mean square error estimation on an audio signal separated from the multi-channel audio signal to obtain an optimal wiener filtering coefficient, thereby realizing beamforming of the audio signal of a signal source. The method of embodiments of the present disclosure does not require signal transformation (e.g., Short Time Fourier Transform (STFT)), and therefore does not involve any complex-domain operations, can be directly applied in any existing neural network beamforming framework, and the theoretical upper bound performance of beamforming is significantly improved since the method of embodiments of the present disclosure introduces more degrees of freedom. In addition, by utilizing the beamforming method of the present disclosure as a substitute for the frequency domain beamforming method in a cascaded neural network beamforming framework, the audio processing performance of the system is significantly improved.

It is to be noted that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises at least one executable instruction for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The exemplary embodiments of the present disclosure described in detail above are merely illustrative, and not restrictive. It will be appreciated by those skilled in the art that various modifications and combinations of these embodiments or features thereof may be made without departing from the principles and spirit of the disclosure, and that such modifications are intended to be within the scope of the disclosure.

Claims

1. A processing method for a multi-channel audio signal, comprising:

acquiring the multichannel audio signal, wherein the multichannel audio signal is acquired in a specific environment through a plurality of microphones, and the specific environment comprises one or more signal sources;

determining time domain features of the multi-channel audio signal based on the acquired multi-channel audio signal;

determining time domain features of one or more estimated audio signals based on the acquired multi-channel audio signal, the one or more estimated audio signals corresponding to the one or more signal sources, respectively; and

generating one or more output audio signals corresponding to the one or more signal sources, respectively, based on the time-domain features of the multi-channel audio signal and the time-domain features of the one or more estimated audio signals.

2. The processing method of claim 1, wherein generating one or more output audio signals corresponding to the one or more signal sources, respectively, based on the time-domain features of the multichannel audio signal and the time-domain features of the one or more output audio signals comprises:

for each of the one or more estimated audio signals, determining a feature transform matrix based on time-domain features of the multichannel audio signal and time-domain features of the estimated audio signal;

for each of the one or more estimated audio signals, determining a time-domain feature of an output audio signal corresponding to the estimated audio signal based on a feature transform matrix corresponding to the estimated audio signal and a time-domain feature of the multichannel audio signal; and

generating the one or more output audio signals based on time domain features of each of the one or more output audio signals.

3. The processing method of claim 2, wherein determining a feature transformation matrix based on the time-domain features of the multi-channel audio signal and the estimated time-domain features of the audio signal comprises:

determining a system function of the wiener filter by taking the time domain characteristics of the multi-channel audio signal as the input of the wiener filter and taking the time domain characteristics of the estimated audio signal as the output of the wiener filter; and

and taking the system function of the wiener filter as the characteristic transformation matrix.

4. The processing method of claim 3,

determining a time-domain feature of the multi-channel audio signal based on the acquired multi-channel audio signal further comprises:

grouping time domain features of the multi-channel audio signal to generate a first number of time domain feature groups of the multi-channel audio signal; and

determining one or more estimated time domain features of the audio signal based on the acquired multi-channel audio signal further comprises:

the time domain features of each of the one or more estimated audio signals are grouped, generating a first number of sets of time domain features for each of the one or more estimated audio signals.

5. The processing method of claim 4, wherein, for each of the one or more estimated audio signals:

determining a feature transformation matrix based on the time-domain features of the multi-channel audio signal and the estimated time-domain features of the audio signal further comprises:

for each of the first number of time-domain feature groups of the estimated audio signal and a respective one of the first number of time-domain feature groups of the multi-channel audio signal, determining a respective feature transformation matrix to obtain a first number of feature transformation matrices corresponding to the estimated audio signal;

determining a time-domain feature of an output audio signal corresponding to the estimated audio signal based on a feature transform matrix corresponding to the estimated audio signal and the time-domain feature of the multi-channel audio signal further comprises:

determining a first number of sets of time domain features of an output audio signal corresponding to the estimated audio signal based on the first number of sets of time domain features of the multi-channel audio signal and a first number of feature transformation matrices corresponding to the estimated audio signal, an

And splicing the first number of time domain feature groups of the output audio signal to obtain the time domain features of the output audio signal.

6. The processing method of claim 1, wherein determining a time-domain feature of the multi-channel audio signal based on the acquired multi-channel audio signal comprises:

performing time domain linear transformation on the acquired multi-channel audio signal to determine the time domain characteristics of the multi-channel audio signal;

wherein determining the time-domain features of the one or more estimated audio signals based on the acquired multi-channel audio signal comprises:

for each of the one or more signal sources, selecting an audio signal of one channel from the multi-channel audio signal; and

and separating the time domain characteristic of the estimated audio signal corresponding to the signal source from the time domain characteristic of the audio signal of the selected channel through a pre-trained audio separation network.

7. A processing method for a multi-channel audio signal, comprising:

determining time domain features of one or more estimated audio signals based on the acquired multi-channel audio signal, the one or more estimated audio signals corresponding to the one or more signal sources, respectively;

updating the time domain features of the one or more estimated audio signals based on the time domain features of the multi-channel audio signal and the time domain features of the one or more estimated audio signals; and

generating one or more output audio signals respectively corresponding to the one or more signal sources based on the updated one or more estimated time domain characteristics of the audio signals.

8. The processing method of claim 7, wherein determining one or more estimated time domain features of the audio signal based on the acquired multi-channel audio signal comprises:

and separating the time domain characteristic of the estimated audio signal corresponding to the signal source from the time domain characteristic of the audio signal of the selected channel through a pre-trained first audio separation network.

9. The processing method of claim 8, wherein updating the time domain features of the one or more estimated audio signals based on the time domain features of the multichannel audio signal and the time domain features of the one or more estimated audio signals comprises:

determining one or more feature transformation matrices respectively corresponding to the one or more estimated audio signals based on time-domain features of the multi-channel audio signal and time-domain features of the one or more estimated audio signals;

generating time-domain features of one or more temporary audio signals respectively corresponding to the one or more estimated audio signals based on the time-domain features of the multi-channel audio signal and the one or more feature transformation matrices; and

updating the time-domain features of the one or more estimated audio signals based on the time-domain features of the one or more temporary audio signals and the time-domain features of the multichannel audio signal.

10. The processing method of claim 9, wherein determining one or more feature transformation matrices respectively corresponding to the one or more estimated audio signals based on the time-domain features of the multichannel audio signal and the time-domain features of the one or more estimated audio signals comprises:

for each of the one or more estimated audio signals, determining a system function of the wiener filter with the time-domain features of the multichannel audio signal as an input of the wiener filter and with the time-domain features of the estimated audio signal as an output of the wiener filter; and

the system function of the wiener filter is taken as a characteristic transformation matrix corresponding to the estimated audio signal.

11. The processing method of claim 9, wherein updating the time domain features of the one or more estimated audio signals based on the time domain features of the one or more temporary audio signals and the time domain features of the multichannel audio signal and/or the time domain features of the one or more estimated audio signals comprises:

and outputting the time domain characteristic of the estimated audio signal corresponding to the signal source through a pre-trained second audio separation network by taking the time domain characteristic of the temporary audio signal corresponding to the signal source and the time domain characteristic of the audio signal of the selected channel as input, wherein the time domain characteristic of the estimated audio signal corresponding to the signal source is used as the updated time domain characteristic of the estimated audio signal corresponding to the signal source.

12. A processing apparatus for a multi-channel audio signal, comprising:

an audio signal acquisition module configured to acquire the multi-channel audio signal, the multi-channel audio signal being acquired by a plurality of microphones in a specific environment including one or more signal sources;

a time domain feature determination module configured to determine time domain features of the multi-channel audio signal based on the acquired multi-channel audio signal and to determine time domain features of one or more estimated audio signals based on the acquired multi-channel audio signal, the one or more estimated audio signals corresponding to the one or more signal sources, respectively; and

a target signal generation module configured to generate one or more output audio signals corresponding to the one or more signal sources, respectively, based on the time-domain features of the multi-channel audio signal and the time-domain features of the one or more estimated audio signals.

13. A processing apparatus for a multi-channel audio signal, comprising:

a time domain feature determination module configured to determine time domain features of the multi-channel audio signal based on the acquired multi-channel audio signal and to determine time domain features of one or more estimated audio signals based on the acquired multi-channel audio signal, the one or more estimated audio signals corresponding to the one or more signal sources, respectively;

a time domain feature updating module configured to update time domain features of the one or more estimated audio signals based on time domain features of the multi-channel audio signal and time domain features of the one or more estimated audio signals; and

a target signal generation module configured to generate one or more output audio signals corresponding to the one or more signal sources, respectively, based on the updated one or more estimated time domain features of the audio signal.

14. A processing device for a multi-channel audio signal, comprising:

one or more processors; and

one or more memories having stored therein a computer-executable program that, when executed by the processor, performs the method of any of claims 1-11.

15. A computer-readable storage medium having stored thereon computer-executable instructions for implementing the method of any one of claims 1-11 when executed by a processor.