CN111028857B

CN111028857B - Method and system for reducing noise of multichannel audio-video conference based on deep learning

Info

Publication number: CN111028857B
Application number: CN201911378821.8A
Authority: CN
Inventors: 辛鑫
Original assignee: Suzhou Auditoryworks Co ltd
Current assignee: Suzhou Auditoryworks Co ltd
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2024-01-19
Anticipated expiration: 2039-12-27
Also published as: CN111028857A

Abstract

The invention relates to a method and a system for reducing noise of a multichannel audio-video conference based on deep learning, wherein the method comprises the following steps: collecting original multichannel signals, and converting the collected time domain signals into frequency domain signals; calculating noise probability existing on each frequency band by using a neural network, and calculating a covariance matrix of the noise by using the noise probability; calculating eigenvectors of a covariance matrix of the noise through the covariance matrix of the noise, and calculating weights of the combined multiple channels according to the covariance matrix of the noise and the eigenvectors of the covariance matrix of the noise; and outputting a noise reduction result according to the weight of the combined multi-channel and the frequency domain signal. The invention has the advantages of high recognition efficiency, small calculated amount and poor actual use effect.

Description

Method and system for reducing noise of multichannel audio-video conference based on deep learning

Technical Field

The invention relates to the technical field of noise reduction processing, in particular to a multichannel audio-video conference noise reduction method based on deep learning.

Background

Noise is usually generated in an audio-video conference, such as a table-knocking sound, a keyboard-knocking sound, a squeak sound generated by a table, etc., which greatly affect the quality of the conference, and besides, the other end of the video conference also generates relatively large noise, for example, a party opposite to the audio-video may be located on a train or in motion. When the noise is loud, the participant usually needs to concentrate on what he or she can say, thus causing the participant to spend much mental effort and resulting in very tiredness.

To solve the problem of conference noise, which generally involves acoustic processing, it is necessary to remove noise from an acoustic signal by utilizing acoustic characteristics. The acoustic signal is a one-dimensional time domain signal, and a common processing method is to decompose the signal into two-dimensional time-frequency signals by using a mathematical method such as fourier transform. However, the sounds of human voice and noise such as a knock table or the like are coincident in time-frequency space, and thus there is no very good way to distinguish between them.

In recent years, with the development of deep learning, a deep learning method has been used to solve the noise reduction problem, for example, "Recurrent Neural Networks for Noise Reduction in Robust ASR", in which an author uses RNN to perform noise reduction on an acoustic signal, however, in the actual use process, the following problems exist: the noise is estimated in 3-5 seconds theoretically, but in practical use, the noise can be estimated in 8-16 seconds, so that the practical use speed is too slow; for noise data types that are not trained, the recognition efficiency is very low; the calculated amount is too large, so that the effect is poor in practical use.

Disclosure of Invention

Therefore, the invention aims to solve the technical problem that the quality of the sound is lost in the prior art, and the use effect is poor, so that the method and the system for reducing the noise of the multichannel audio-video conference based on the deep learning, which have the advantages of no loss of the quality of the sound and good practical use effect, are provided.

In order to solve the technical problems, the method for reducing noise of the multichannel audio-video conference based on deep learning comprises the following steps: collecting original multichannel signals, and converting the collected time domain signals into frequency domain signals; calculating noise probability existing on each frequency band by using a neural network, and calculating a covariance matrix of the noise by using the noise probability; calculating eigenvectors of a covariance matrix of the noise through the covariance matrix of the noise, and calculating weights of the combined multiple channels according to the covariance matrix of the noise and the eigenvectors of the covariance matrix of the noise; and outputting a noise reduction result according to the weight of the combined multi-channel and the frequency domain signal.

In one embodiment of the present invention, the method for acquiring the original multi-channel signal is as follows: the raw multichannel signal is acquired by a microphone array.

In one embodiment of the invention, the method for converting the acquired time domain signal into the frequency domain signal comprises the following steps: the acquired time domain signal is converted into the frequency domain signal by fast fourier transform using a single filter or a plurality of filters.

In one embodiment of the present invention, the method for calculating the probability of noise existing on each frequency band by using the neural network is as follows: and inputting the data marked in advance into the neural network, and outputting noise probability existing on each frequency band after calculation of the neural network.

In one embodiment of the present invention, the method for calculating the covariance matrix of the noise is as follows: if the covariance matrix of the noise is phi _f The frequency domain signal is Y _i，t ThenWherein Y is _i，t Representing the frequency domain signal of the ith channel at time t, N representing the number of channels, +.>Is Y _i，t Is a conjugate transpose of (a).

In one embodiment of the present invention, the eigenvector calculation method of the covariance matrix of the noise is Φ _f W _f ＝W _f Λ, wherein the eigenvector of the covariance matrix of the noise is W _f The covariance matrix of the noise is phi _f Λ represents a matrix of eigenvalues.

In one embodiment of the present invention, the method for calculating the weights of the merging multi-channels is as follows:

wherein the weight of the merging multi-channel is +.> Is W _f Is a conjugate transpose of (a).

In one embodiment of the present invention, the method for outputting the noise reduction result according to the weights of the combining multiple channels and the frequency domain signal includes:

the invention also discloses a multichannel audio-video conference noise reduction system based on deep learning, which comprises: the acquisition module is used for acquiring original multichannel signals and converting the acquired time domain signals into frequency domain signals; the first calculation module is used for calculating noise probability existing on each frequency band by using the neural network, and calculating a covariance matrix of the noise by the noise probability; the second calculation module is used for calculating the eigenvectors of the covariance matrix of the noise through the covariance matrix of the noise, and calculating the weights of the combined multiple channels according to the covariance matrix of the noise and the eigenvectors of the covariance matrix of the noise; and the output module is used for outputting a noise reduction result according to the weight of the combined multi-channel and the frequency domain signal.

Compared with the prior art, the technical scheme of the invention has the following advantages:

according to the method and the system for reducing noise of the multichannel audio-video conference based on deep learning, the covariance matrix of noise can be calculated more rapidly and effectively, and then the covariance matrix is brought into a traditional signal processing frame, so that the covariance matrix of noise can be converged rapidly, and the spectrum matrix of the noise can be calculated; in addition, the invention uses the physical characteristics of the signals to reduce dryness and uses the traditional signal processing framework with physical significance, so the restored original sound is more real.

Drawings

In order that the invention may be more readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof that are illustrated in the appended drawings, in which

FIG. 1 is a flow chart of a method for reducing noise of a multichannel audio-video conference based on deep learning;

fig. 2 is a schematic diagram of a system for noise reduction in a multi-channel audio-video conference based on deep learning.

Detailed Description

Example 1

As shown in fig. 1, the embodiment provides a method for reducing noise of a multichannel audio/video conference based on deep learning, which includes the following steps: s1, acquiring original multi-channel signals, and converting the acquired time domain signals into frequency domain signals; s2, calculating noise probability existing on each frequency band by using a neural network, and calculating a covariance matrix of noise according to the noise probability; step S3: calculating eigenvectors of a covariance matrix of the noise through the covariance matrix of the noise, and calculating weights of the combined multiple channels according to the covariance matrix of the noise and the eigenvectors of the covariance matrix of the noise; step S4: and outputting a noise reduction result according to the weight of the combined multi-channel and the frequency domain signal.

In the method for reducing noise of the multichannel audio-video conference based on deep learning, in the step S1, original multichannel signals are collected, and the collected time domain signals are converted into frequency domain signals, so that subsequent processing of the signals is facilitated; in the step S2, the noise probability existing in each frequency band is calculated by using the neural network, and the covariance matrix of the noise is calculated by using the noise probability, so that the covariance matrix can be quickly converged, thereby being beneficial to calculating the spectrum matrix of the noise; in the step S3, the eigenvectors of the covariance matrix of the noise are calculated through the covariance matrix of the noise, and the weights of the combined multiple channels are calculated according to the covariance matrix of the noise and the eigenvectors of the covariance matrix of the noise, so that the recognition efficiency is high because the physical characteristics of the signals are utilized to reduce dryness; in the step S4, a noise reduction result is output according to the weight of the combined multi-channel and the frequency domain signal, which is not only beneficial to recovering the original sound and making it more realistic, but also has high speed and good use effect in actual use.

The method for collecting the original multichannel signals comprises the following steps: the original multichannel signals are collected through the microphone array, so that the collected signals are accurate and the speed is high. In addition, in this embodiment, the sampling rate is 16Khz.

The method for converting the acquired time domain signals into frequency domain signals comprises the following steps: the acquired time domain signal is converted into the frequency domain signal by fast fourier transform using a single filter or a plurality of filters. In this embodiment, a multi-filter bank is used, so that signals of each frequency band can be effectively reserved.

The method for calculating the noise probability existing on each frequency band by using the neural network comprises the following steps: the data marked in advance is input into the neural network, and noise probability existing on each frequency band is output after calculation through the neural network, so that the method is simple, and the calculation amount is small, so that the speed is high.

The method for calculating the covariance matrix of the noise comprises the following steps: if the covariance matrix of the noise is phi _f The frequency domain signal is Y _i，t ThenWherein Y is _i，t Representing the frequency domain signal of the ith channel at time t, N representing the number of channels, +.>Is Y _i，t Which represents the spectrum of noise. The eigenvector calculation method of the covariance matrix of the noise is phi _f W _f ＝W _f Λ, wherein the eigenvector of the covariance matrix of the noise is W _f The covariance matrix of the noise is phi _f Λ represents a matrix of eigenvalues.

The method for calculating the weight of the combined multi-channel comprises the following steps:

wherein the weight of the merging multi-channel is +.> Is W _f Is a conjugate transpose of (a). Due to covariance matrix phi of noise _f Is brought into a traditional minimum variance filter, so that the calculation is simple and quick.

The method for outputting the noise reduction result according to the weight of the combined multi-channel and the frequency domain signal comprises the following steps:the invention uses the physical characteristics of the signals to reduce dryness and uses the traditional signal processing framework with physical significance, so the restored original sound is more real.

Example two

Based on the same inventive concept, the present embodiment provides a system for reducing noise of a multichannel audio-video conference based on deep learning, and the principle of solving the problem is similar to that of the method for reducing noise of the multichannel audio-video conference based on deep learning, and the repetition is omitted.

Referring to fig. 2, the system for noise reduction in a multi-channel audio/video conference based on deep learning according to the present embodiment includes:

the acquisition module is used for acquiring original multichannel signals and converting the acquired time domain signals into frequency domain signals;

the first calculation module is used for calculating noise probability existing on each frequency band by using the neural network, and calculating a covariance matrix of the noise by the noise probability;

the second calculation module is used for calculating the eigenvectors of the covariance matrix of the noise through the covariance matrix of the noise, and calculating the weights of the combined multiple channels according to the covariance matrix of the noise and the eigenvectors of the covariance matrix of the noise;

and the output module is used for outputting a noise reduction result according to the weight of the combined multi-channel and the frequency domain signal.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present invention will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims

1. The method for reducing noise of the multichannel audio-video conference based on the deep learning is characterized by comprising the following steps of:

step S1: collecting original multichannel signals, and converting the collected time domain signals into frequency domain signals;

step S2: the method for calculating the covariance matrix of the noise by using the neural network to calculate the noise probability existing on each frequency band comprises the following steps: if the covariance matrix of the noise is phi _f The frequency domain signal is Y _i，t ThenWherein Y is _i，t Representing the frequency domain signal of the ith channel at time t, N representing the number of channels, +.>Is Y _i，t The eigenvector calculation method of the covariance matrix of the noise is phi _f W _f ＝W _f Λ, wherein the eigenvector of the covariance matrix of the noise is W _f The covariance matrix of the noise is phi _f And a represents a matrix of characteristic values, and the method for calculating the weight of the combined multi-channel is as follows: />Wherein the weight of the merging multi-channel is +.> Is W _f Is a conjugate transpose of (2);

step S3: calculating eigenvectors of a covariance matrix of the noise through the covariance matrix of the noise, and calculating weights of the combined multiple channels according to the covariance matrix of the noise and the eigenvectors of the covariance matrix of the noise;

step S4: and outputting a noise reduction result according to the weight of the combined multi-channel and the frequency domain signal.

2. The deep learning-based multichannel audio-video conference noise reduction method according to claim 1, wherein: the method for collecting the original multichannel signals comprises the following steps: the raw multichannel signal is acquired by a microphone array.

3. The method for noise reduction of a deep learning-based multichannel audio-video conference according to claim 1, wherein: the method for converting the acquired time domain signals into frequency domain signals comprises the following steps: the acquired time domain signal is converted into the frequency domain signal by fast fourier transform using a single filter or a plurality of filters.

4. The method for noise reduction of a deep learning-based multichannel audio-video conference according to claim 1, wherein: the method for calculating the noise probability existing on each frequency band by using the neural network comprises the following steps: and inputting the data marked in advance into the neural network, and outputting noise probability existing on each frequency band after calculation of the neural network.

5. The method for noise reduction of a deep learning-based multichannel audio-video conference according to claim 1, wherein: the method for outputting the noise reduction result according to the weight of the combined multi-channel and the frequency domain signal comprises the following steps:

6. a system for noise reduction in a multichannel audio-video conference based on deep learning, comprising:

the first calculating module is used for calculating noise probability existing on each frequency band by using a neural network, and calculating a covariance matrix of the noise by the noise probability, wherein the calculating method of the covariance matrix of the noise comprises the following steps: if the covariance matrix of the noise is phi _f The frequency domain signal is Y _i，t ThenWherein Y is _i，t Representing the frequency domain signal of the ith channel at time t, N representing the number of channels, +.>Is Y _i，t The eigenvector calculation method of the covariance matrix of the noise is phi _f W _f ＝W _f Λ, wherein the eigenvector of the covariance matrix of the noise is W _f The covariance matrix of the noise is phi _f And a represents a matrix of characteristic values, and the method for calculating the weight of the combined multi-channel is as follows: />Wherein the weight of the merging multi-channel is +.> Is the conjugate transpose of Wf;