CN113345435A

CN113345435A - Audio noise reduction method, device, equipment and medium

Info

Publication number: CN113345435A
Application number: CN202110751408.2A
Authority: CN
Inventors: 陈孝良; 冯大航; 奚少亨; 常乐
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2020-07-03
Filing date: 2021-07-02
Publication date: 2021-09-03
Also published as: CN111916075A

Abstract

The invention relates to an audio noise reduction method, an audio noise reduction device, audio noise reduction equipment and an audio noise reduction medium, which are applied to voice control scenes such as elevators, intelligent automobiles and the like and used for carrying out noise reduction processing on the audio of each object and optimizing a voice recognition process. The method comprises the following steps: acquiring a voice audio of a target object; acquiring a voice audio of a target object; determining a target bark band corresponding to the amplitudes of the target voice signal at the preset frequencies based on the transformation relation between the frequency domain and the bark domain, wherein the target voice signal is any frame of voice signal of the voice audio; determining the audio characteristics of the target voice signal by using a matrix formed by the determined target barker bands; inputting the audio characteristics of the target voice signal into a noise reduction network model to obtain a noise ratio value matrix corresponding to the target voice signal; and determining the denoised target voice signal based on the signal-to-noise ratio value matrix and the amplitudes of the target voice signal at the preset frequencies.

Description

Audio noise reduction method, device, equipment and medium

The present application claims priority of the chinese patent application entitled "a method, apparatus, device, and medium for processing audio signals" filed by the intellectual property office of the people's republic of china, application No. 202010635457.5, on 3/7/2020, which is incorporated herein by reference in its entirety.

Technical Field

The present invention relates to the field of speech processing, and in particular, to an audio noise reduction method, apparatus, device, and medium.

Background

In the field of voice control, voice control commands are typically issued by a user at the same time, with the specific command intent being determined from the audio captured of the user. However, in a scene (e.g., an elevator scene) where the voice control authority is shared, a plurality of users all have the voice control authority, and if the plurality of users send out the voice control command at the same time, the collected audio includes the voice control commands of the plurality of users. The quality of the voice control command audio of each user separated from the collected audio is poor, so that the specific command intention of each user cannot be determined, and the voice control commands sent by a plurality of users one by one need to be confirmed, so that the processing efficiency is low.

Disclosure of Invention

The invention provides an audio noise reduction method, an audio noise reduction device, audio noise reduction equipment and an audio noise reduction medium, which are used for carrying out noise reduction processing on audio of an object and optimizing a voice recognition process.

The technical scheme of the invention is as follows:

according to a first aspect of the embodiments of the present invention, there is provided an audio noise reduction method, including:

acquiring a voice audio of a target object;

determining a target bark band corresponding to the amplitudes of the target voice signal at the preset frequencies based on the transformation relation between the frequency domain and the bark domain, wherein the target voice signal is any frame of voice signal of the voice audio;

determining the audio characteristics of the target voice signal by using a matrix formed by the determined target barker bands;

inputting the audio characteristics of the target voice signal into a noise reduction network model to obtain a noise ratio value matrix corresponding to the target voice signal;

and determining the denoised target voice signal based on the signal-to-noise ratio value matrix and the amplitudes of the target voice signal at the preset frequencies.

In a possible implementation manner, in the audio noise reduction method provided by an embodiment of the present invention, the determining a denoised target speech signal based on the snr value matrix and the amplitudes of the target speech signal at the plurality of preset frequencies includes:

determining the amplitudes of the denoised target speech signal at the preset frequencies based on the noise ratio value matrix and the amplitudes of the target speech signal at the preset frequencies;

and converting the amplitudes of the denoised target speech signal at the preset frequencies into the denoised target speech signal based on a preset conversion relation.

In a possible implementation manner, in the audio noise reduction method provided by an embodiment of the present invention, the determining amplitudes of the denoised target speech signal at the preset frequencies based on the noise ratio value matrix and the amplitudes of the target speech signal at the preset frequencies includes:

determining a matrix obtained after the noise ratio value matrix is transformed to the frequency domain as a noise reduction matrix of the target voice signal by utilizing the transformation relation between the frequency domain and the bark domain;

and determining a product of a first matrix formed by the amplitudes of the target voice signal at the preset frequencies and a noise reduction matrix of the target voice signal as a second matrix, wherein the second matrix is formed by the amplitudes of the denoised target voice signal at the preset frequencies.

In a possible implementation manner, in the audio noise reduction method provided by the embodiment of the present invention, the acquiring a speech audio of the target object includes:

acquiring multiple paths of audio signals, wherein the multiple paths of audio signals are acquired simultaneously by using multiple audio acquisition devices arranged in the same scene, each path of audio signal comprises a voice signal of multiple objects, and the target object is any one of the multiple objects;

determining an amplitude matrix of each frame of audio signal based on the amplitude of each frame of audio signal in each predetermined path of audio signal at a plurality of preset frequencies;

and determining the amplitude of each frame of voice signal of each object at the plurality of preset frequencies according to the amplitude matrix of each frame of audio signal and the predetermined unmixing matrix of each frame of audio signal, and determining the voice audio of each object according to the amplitude of each frame of voice signal of each object at the plurality of preset frequencies.

In a possible implementation manner, in the audio noise reduction method provided by the embodiment of the present invention, the unmixing matrix of each frame of audio signal is determined by using the following steps:

determining an intermediate de-mixing matrix of a first frame audio signal in each path of audio signals, and determining a de-mixing matrix of the first frame audio signal based on the intermediate de-mixing matrix of the first frame audio signal in each path of audio signals;

and determining an intermediate de-mixing matrix of the audio signal of the non-first frame in each path of audio signal, and determining the de-mixing matrix of the audio signal of the non-first frame based on the intermediate de-mixing matrix of the audio signal of the non-first frame in each path of audio signal.

In a possible implementation manner, in the audio denoising method provided in an embodiment of the present invention, the denoising network model is trained by using the following steps:

dividing a Barker belt matrix of a noise-free audio sample and a Barker belt matrix point of a pure noise audio sample to obtain a noise ratio value matrix, and taking the audio characteristics of the pure noise audio sample and the audio characteristics of the noise-free audio sample as the input of a neural network model, and training the neural network by taking the noise ratio value matrix obtained by dividing the Barker belt matrix of the noise-free audio sample and the Barker belt matrix point of the pure noise audio sample as output;

taking the neural network model after training as the noise reduction network model;

wherein the audio characteristics of the noiseless audio samples are determined from the Barker band matrix of the noiseless audio samples and the audio characteristics of the pure noise audio samples are determined from the Barker band matrix of the pure noise audio samples.

In a possible implementation manner, in the audio noise reduction method provided by an embodiment of the present invention, the determining an audio feature of the target speech signal by using a matrix composed of the determined target barker bands includes:

calculating the average value and variance of all elements in a matrix consisting of the target barker bands;

and performing preset processing on a matrix formed by the target barker bands to obtain a matrix, and determining the matrix as the audio characteristic of the target voice signal, wherein the preset processing is to perform difference on each element in the matrix and the average value and divide the difference by the variance.

According to a second aspect of embodiments of the present invention, there is provided an audio noise reduction apparatus, comprising:

an acquisition unit configured to acquire an audio signal of a target object;

the processing unit is used for determining a target barker band corresponding to the amplitude values of the target voice signal at the plurality of preset frequencies based on the transformation relation between the frequency domain and the barker domain, wherein the target voice signal is any frame voice signal of the voice audio; determining the audio characteristics of the target voice signal by using a matrix formed by the determined target barker bands; inputting the audio characteristics of the target voice signal into a noise reduction network model to obtain a noise ratio value matrix corresponding to the target voice signal; and determining the denoised target voice signal based on the signal-to-noise ratio value matrix and the amplitudes of the target voice signal at the preset frequencies.

In a possible implementation manner, in the audio noise reduction apparatus provided in an embodiment of the present invention, the processing unit is specifically configured to:

determining the unmixing matrix of each frame of audio signal by adopting the following steps:

the noise reduction network model is trained by adopting the following steps:

According to a third aspect of embodiments of the present invention, there is provided an audio noise reduction device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the audio noise reduction method of any of the first aspect.

According to a fourth aspect of embodiments of the present invention, there is provided a storage medium having instructions that, when executed by a processor of an audio noise reduction device, enable the audio noise reduction device to perform the audio noise reduction method of any one of the first aspects.

The technical scheme provided by the embodiment of the invention at least has the following beneficial effects:

in the process of denoising any frame of voice signal in the voice audio of the target object, namely the target voice signal, the voice-to-noise ratio value matrix corresponding to the target voice signal is determined by using the audio characteristics of the target voice signal and the denoising network model. The SINR value matrix may be used to perform noise reduction on a corresponding speech signal. Based on the amplitudes of the target speech signal at a plurality of preset frequencies and the signal-to-noise ratio value matrix corresponding to the target speech signal, the denoised target speech signal can be determined, and the denoising of any frame of speech signal of the target object is realized, so that the speech audio denoising of the target object is realized, the optimization of a speech recognition process is facilitated, and the speech recognition efficiency of the object is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention and are not to be construed as limiting the invention.

Fig. 1 is a schematic flow chart illustrating a method of processing an audio signal according to an exemplary embodiment.

Fig. 2 is a schematic flow chart diagram illustrating another method of processing an audio signal in accordance with an exemplary embodiment.

Fig. 3 is a signal flow diagram illustrating an audio noise reduction method according to an exemplary embodiment.

FIG. 4 is a schematic flow chart diagram illustrating yet another audio noise reduction method in accordance with an exemplary embodiment.

Fig. 5 is a schematic structural diagram illustrating an audio noise reduction apparatus according to an exemplary embodiment.

Fig. 6 is a schematic diagram illustrating a structure of an audio noise reduction apparatus according to an exemplary embodiment.

Fig. 7 is a schematic diagram illustrating the structure of another audio noise reduction device according to an exemplary embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The application described in the embodiment of the present invention is to more clearly illustrate the technical solution of the embodiment of the present invention, and does not constitute a limitation to the technical solution provided in the embodiment of the present invention, and it is known by those skilled in the art that with the emergence of new applications, the technical solution provided in the embodiment of the present invention is also applicable to similar technical problems. In the description of the present invention, the term "plurality" means two or more unless otherwise specified.

The audio noise reduction method in the embodiment of the invention can be applied to identifying scenes in which multiple persons (multiple objects) are subjected to voice control simultaneously. For example, the voice control scene of an elevator, the voice control scene of an intelligent automobile, the voice control scene of other voice control intelligent equipment and the like.

In a speech control scene, especially a scene in which a plurality of objects issue speech control commands, in a first implementation manner, the audio noise reduction method provided by the embodiment of the present invention may be adopted in a process of separating an audio signal of each object from a collected audio, and such a design may realize that an audio of each object after noise reduction is separated from the collected audio. In a second implementation, the audio of each object may also be separated from the collected audio, and then the audio of each object may be subjected to noise reduction processing. When the noise reduction processing is performed on the audio of any object, the audio noise reduction method provided by the embodiment of the invention can be adopted.

The following first describes a process of separating an audio signal of each object from captured audio. Fig. 1 is a flowchart illustrating a processing method of an audio signal according to an exemplary embodiment, where as shown in fig. 1, the processing method of the audio signal includes the following steps:

step S101, acquiring multiple channels of audio signals, wherein the multiple channels of audio signals are acquired simultaneously by using multiple audio acquisition devices arranged in the same scene, and each channel of audio signal comprises a voice signal of multiple objects.

In a specific implementation, in the same scene, multiple audio acquisition devices are used to acquire multiple audio signals, for example, each audio acquisition device acquires one audio signal, and the audio acquisition devices may be devices such as microphones. In the same scene, for example, in an elevator scene, a passenger controls an elevator to stop at a certain floor through a voice control command, when a plurality of passengers simultaneously send out the voice control command, the voice of each passenger cannot be recognized because the collected audio is the voice mixed with the plurality of passengers, the voice of each passenger can be determined through the processing method of the audio signal provided by the embodiment of the invention in the elevator scene, and the voice of each passenger can be semantically recognized, so that the voice control command of each passenger can be determined. It should be noted that the processing method of the audio signal provided in the embodiment of the present invention may be applied to a voice control scene with a wakeup word, and may also be applied to a voice control scene without a wakeup word.

In a practical application scenario, in order to enhance the effect of the determined voice signal of each occupant, a limitation on the capturing condition of the audio capturing device may be added, for example, the audio capturing device captures the audio signal at a preset sampling frequency (e.g., 16000 Hz). To facilitate the description of the audio signal processing method provided by the embodiment of the present invention, the mth audio signal in the obtained multiple audio signals is denoted as x_m。

Step S102, determining an amplitude matrix of each frame of audio signal based on the amplitude of each frame of audio signal in each predetermined path of audio signal at a plurality of preset frequencies.

In specific implementation, it can be determined that each audio signal contains the same number of audio signals according to the acquisition frequency, and the nth frame of audio signal of the mth audio signal is marked as x_m(n) of (a). By using the short-time fourier transform method, the amplitude of each frame of audio signal in each channel of audio signal at a plurality of preset frequencies can be determined, for example, the amplitude of the nth frame of audio signal of the mth channel of audio signal at the kth frequency (frequency point) in the plurality of preset frequencies is recorded as X_m(k, n), it should be noted that the amplitude of the audio signal of the present application at a plurality of predetermined frequencies after changing from the time domain to the frequency domain is actually a complex number (including a real part and an imaginary part). From the amplitudes of the frames of audio signals at a plurality of predetermined frequencies in each audio signal, an amplitude matrix X of the frames of audio signals can be determined, e.g.

Step S103, determining the amplitude of each frame of voice signal of each object at a plurality of preset frequencies according to the amplitude matrix of each frame of audio signal and the predetermined unmixing matrix of each frame of audio signal.

In specific implementation, the amplitude matrix X of each frame of audio signal and the predetermined de-mixing matrix W of each frame of audio signal are utilized to carry out the conjugate matrix W of the de-mixing matrix W^HProduct W of two matrices of amplitude matrix of each frame of audio signal^HX＝Y。

Then, based on the amplitude matrix Y of each frame of voice signal and the number of paths of multi-path audio, the amplitude of each frame of voice signal of each object at a plurality of preset frequencies is determined.

Generally, the number of channels of the acquired multi-channel audio signal is greater than or equal to the number of objects. The amplitude matrix Y of each frame of audio signal contains the amplitudes of each frame of speech signal of all objects at a plurality of preset frequencies.

If the number num of the obtained multi-channel audio signals is larger than the number of the actual objects and is p, the amplitude matrix of each frame of audio signals

Wherein, Y₁(k, n) is the amplitude matrix of the speech signal of the first object, Y_p(k, n) is the amplitude matrix of the speech signal of the p-th object, Y_p+1(k, n) to Y_num(k, n) is a magnitude matrix of the speech leakage signal containing p objects.

If the number num of the obtained multi-channel audio signals is equal to the number p of the actual objects, the amplitude matrix of each frame of voice signals

Y₁(k, n) is the amplitude matrix of the speech signal of the first object, Y_p(k, n) is a magnitude matrix of the speech signal of the p-th object.

In the matrix calculation process, the number of rows and columns of the matrix of the amplitude X (k, n) of the audio signal of the nth frame and the kth frequency are all constant values, and the number of rows and columns of the matrix of the downmix matrix W (k, n) of the nth frame and the kth frequency are all constant values. The number of rows and columns of the speech signal amplitude matrix Y (k, n) of the nth frame and the kth frequency may be determined from the matrix calculation result between the amplitude matrix of the audio signal and the downmix matrix. Thus, the amplitude matrix of the speech signal of each object can be determined one by one from the amplitude matrices of the speech signals of the frames, for example, the nth frame, the number of rows of the amplitude matrix Y (k, n) of the speech signal of the k-th frequency is d, the amplitude matrix Y of the speech signal of each frame, the elements of the first d rows constitute the amplitude matrix of the speech signal of the first object, and the elements between the d +1 st row and the 2 nd row constitute the amplitude matrix of the speech signal of the second object. Therefore, the amplitudes of the respective frames of speech signals of each object at a plurality of preset frequencies can be determined from the amplitude matrix Y of the respective frames of audio signals. In the time domain, it is not needed to distinguish which of the amplitude matrixes Y of the audio signals of each frame is the amplitude matrix of the voice signal containing the actual object and which is the amplitude matrix of the audio leakage signal containing the object, so that the data calculation amount and the complexity are reduced.

In practical application, the determined amplitude of each frame of voice signal of each object can be transformed into a time domain, and whether the determined amplitude is the voice audio of the object or not and whether the determined amplitude is the voice audio of the invalid object (including the audio corresponding to the amplitude matrix of the audio leakage signal of the object) or not can be determined in a voice recognition mode. Or, in combination with the image recognition mode, the number p of the objects is determined by collecting images containing a plurality of objects, and the amplitude matrix containing the voice signals of each object is determined by the amplitude matrix of the voice signals of the first p objects in the determined amplitude matrices of the voice signals of num objects.

In one possible embodiment, the predetermined unmixing matrix of each frame of audio signal is determined by the following steps:

determining an intermediate de-mixing matrix of the first frame of audio signal in each path of audio signal, and determining a de-mixing matrix of the first frame of audio signal based on the intermediate de-mixing matrix of the first frame of audio signal in each path of audio signal;

In specific implementation, the unmixing matrix W of each frame of audio signal, wherein the unmixing matrix of the nth frame and the kth frequency point is recorded as

w_m(k, n) represents an intermediate unmixing matrix of the nth frame audio signal of the mth audio signal at the kth frequency (frequency point) in a plurality of preset frequencies, w_m(k,1) represents the intermediate unmixing matrix of the head frame audio signal (n is 1) in the mth audio signal, and the unmixing matrix of the head frame audio signal can be determined according to the intermediate unmixing matrix of the head frame audio signal in all the audio signals

The intermediate unmixing matrix of the audio signal of the first frame (n ≠ 1) in the mth audio signal is recorded as w_m(k, n), based on the intermediate de-mixing matrix of the n-th frame audio signal in all the audio signals, the de-mixing matrix of the n-th frame audio signal can be determined

To implement the determination of the unmixing matrix of each frame of audio signal, it should be noted that, in order to distinguish the intermediate unmixing matrix from the unmixing matrix, the intermediate unmixing matrix in the embodiment of the present invention is an intermediate unmixing matrix of each frame of audio signal in each audio signal, and the unmixing matrix is a matrix formed by the intermediate unmixing matrices of the audio signals in the same frame order in each audio signal. That is, the intermediate de-mixing matrix only contains information of one frame of audio signal in one audio signal, and the de-mixing matrix contains information of audio signals with the same frame sequence in multiple audio signals.

In a practical application scenario, determining an intermediate unmixing matrix of a first frame audio signal in each path of audio signals includes:

and aiming at each path of audio signal, determining a preset matrix as an intermediate de-mixing matrix of the first frame of audio signal in the path of audio signal.

In specific implementation, the first frame of each audio signal isThe intermediate unmixing matrix of the audio signal may be set as an identity matrix, that is, the predetermined matrix is set as an identity matrix, for example, the intermediate unmixing matrix w of the first frame audio signal (n ═ 1) in the mth audio signal_mThe amplitude of each frequency point in (k,1) is 1, and if the number k of the preset multiple frequency points is 3, the number k is equal to 3

In a practical application scenario, determining an intermediate unmixing matrix of a non-first-frame audio signal in each audio signal includes:

aiming at the audio signals of the non-first frame in each path of audio signals:

determining the signal energy of the current frame audio signal according to the amplitudes of the current frame audio signal at a plurality of preset frequencies and the unmixing matrix of the previous frame audio signal;

determining a covariance matrix of the current frame audio signal based on the signal energy of the current frame audio signal, the amplitudes of the current frame audio signal at a plurality of preset frequencies, and the covariance matrix of the previous frame audio signal;

determining a middle unmixing matrix of the current frame audio signal according to the covariance matrix of the current frame audio signal and the unmixing matrix of the previous frame audio signal;

the covariance matrix of the first frame audio signal in each channel is determined based on the preset matrix and the amplitudes of the first frame audio signal in each channel of audio signals at a plurality of preset frequencies.

In specific implementation, when determining the intermediate unmixing matrix of the non-first frame audio signal in each channel of audio signals, the following processing is performed for the current frame audio signal (nth frame audio signal) in the mth channel of audio signals:

according to the amplitude X of the nth frame audio signal at a plurality of preset frequencies_m(k, n), the intermediate unmixing matrix W of the audio signal of the previous frame of the mth audio signal can be determined from the unmixing matrix W (k, n-1) of the audio signal of the previous frame_m(k, n-1), and then determining the signal energy of the current frame audio signal (n ≠ 1) in each audio signal by the following formula:

wherein the content of the first and second substances,

is w_mA conjugate matrix of (k, n-1).

Based on the energy r of the n-th frame audio signal_mAmplitude X of (n) th frame audio signal at a plurality of preset frequencies_m(k, n), and covariance matrix V of the previous frame audio signal_m(k, n-1) by the formula

Determining a covariance matrix of an nth frame of audio signal in the mth channel of audio signal, wherein a is a preset smoothing coefficient, and G is in an actual application scene^·(r_m(n)) may take a value of 1.

According to the covariance matrix of the n-th frame of audio signals in the m-th audio signal and the unmixing matrix W (k, n-1) of the previous frame of audio signals, the formula W_m(k,n)＝(W(k,n-1)V_m(k,n))^-1e_kDetermining the intermediate de-mixing matrix w of the nth frame audio signal in the mth audio signal_m(k,n)。

It should be noted that, since the intermediate unmixing matrix of each channel of the first frame audio signal is determined according to the preset matrix, the covariance matrix of each channel of the first frame audio signal is determined based on the preset matrix and the amplitudes of the first frame audio signal in each channel of the audio signal at a plurality of preset frequencies, where the value of the smoothing coefficient a when determining the covariance matrix may be equal to 0.

And step S104, determining the voice audio frequency of each object according to the amplitude values of each frame of voice signal of each object at a plurality of preset frequencies.

In specific implementation, the method for converting the amplitude of each frame of speech signal of each object at a plurality of preset frequencies into the speech audio of each object may be determined according to the method for determining the amplitude of each frame of speech signal at a plurality of preset frequencies in each audio signal predetermined in step S102.

E.g. based on a predetermined conversion relationAnd converting the amplitude of each frame of voice signal of each object at a plurality of preset frequencies into each frame of voice signal of each object. In practical application scenarios, for example, the amplitudes of the frames of audio signals in each audio signal at multiple preset frequencies are predetermined to adopt short-time fourier transform, and that preset transform relationship may be the inverse of the short-time fourier transform, and the amplitudes Y of the frames of speech signals of each object at multiple preset frequencies are predetermined_p(k, n) converting the frame into a speech signal y of each frame by a predetermined conversion relation_p(n) of (a). Each frame of voice signal y of each object_p(n) according to the frame time sequence relation, splicing the frame voice signals to obtain the voice audio y of each object_p＝{y_p(1) … y_p(n)}。

Because the daily environment is full of noise, when a plurality of audio acquisition devices are used for acquiring multi-channel audio signals, the noise in the environment is also acquired. To improve the quality of the voice audio of each object, noise reduction processing may be performed on the voice audio after determining the voice audio of each object. The noise reduction processing may also be performed after determining the amplitudes of the respective frames of speech signals of each object at a plurality of preset frequencies before determining the speech audio of each object.

In step S103, the amplitude of each frame of speech signal of each object at a plurality of preset frequencies may be determined, that is, the speech signal of each object is obtained, and before step S104 is executed, noise reduction processing may be performed on the speech signal of each object, for example, noise reduction processing is performed on the speech signal of each object by performing noise reduction processing on the amplitude of each frame of speech signal of each object at a plurality of preset frequencies, so that noise reduction processing may be performed on the speech audio of each object separated from the audio signal of a plurality of objects.

When audio noise reduction is performed on the voice audio of the target object, the target barker bands corresponding to the target voice signal at multiple preset frequencies may be determined based on a transform relationship between a frequency domain and a barker domain, the target voice signal is any frame of voice signal of the voice audio of the target object, and the target object may be any object of multiple objects.

In a specific implementation, after the voice audio of each object is determined, the voice audio is subjected to noise reduction processing, and any one object of the multiple objects can be used as a target object, and any frame n of the voice audio of the target object is used as the voice signal y_p(n) as target voice signal, performing frequency domain transformation (such as short-time Fourier transform) on the target voice signal to obtain amplitude values Y of multiple preset frequencies_p(k, n) determining the amplitudes Y of the target speech signal at a plurality of preset frequencies by using the transformation relation between the frequency domain and the bark domain_p(k, n) corresponding bark band (target bark band).

If before determining the voice audio of each object, after determining the amplitudes of each frame of voice signal of each object at a plurality of preset frequencies, the noise reduction processing is performed, or the amplitude Y of the target voice signal at a plurality of preset frequencies can be directly determined by using the transformation relation between the frequency domain and the bark domain_p(k, n) corresponding barker bands (target barker bands), wherein the target speech signal is any frame speech signal of any object.

For example, Y_p(1, n) corresponding to Barker's band B1, Y_p(2, n) corresponding to Barker's band B2, Y_p(3, n) corresponding Barker belts B3, …, Y_pThe bark band corresponding to (k, n) is Bm, and a matrix B composed of the target bark bands can be obtained [ B1B 2B 3 … Bm ═ B1B 2B 3 … Bm]。

And then, determining the audio characteristics of the target voice signal by using the matrix formed by the determined target barker bands. In the frequency domain-to-barker domain transform relationship, if the frequencies k1 and k2 belong to the same frequency band, the barker bands corresponding to the frequencies k1 and k2 are the same, the barker bands of adjacent frequencies in the target barker band composition matrix are the same, for example, [ B1B 1B 3 … Bm ], only one barker band is reserved for the repeated barker band, the repeated barker band is deleted, for example, [ B1B 1B 3 … Bm ] is changed to [ B1B 3 … Bm ], the matrix B with the deleted repeated barker band is directly determined as the audio feature of the target speech signal, or the matrix B with the deleted repeated barker band is input to a preset high-pass filter for filtering, and the filtered matrix is used as the audio feature of the target speech signal.

In the target Barker zone groupAnd forming a matrix, and deleting the average value a and the variance s of all elements in the matrix B after repeated barker bands are deleted. The matrix B is processed according to a predetermined process, which may be to make each element in the matrix B differ from the average value and then divide by a square difference, such as any element in the matrix B (B)_i,j-a)/s. The matrix M obtained after the matrix B is subjected to the preset processing may be determined to be the audio feature of the target speech signal.

The application provides a noise reduction network model, and the training process of the noise reduction network model is as follows:

the method comprises the steps that a noise-to-noise ratio value matrix obtained after dividing a barker band matrix point of a noiseless audio sample and a barker band matrix point of a pure noise audio sample, audio characteristics of the pure noise audio sample and audio characteristics of the noiseless audio sample are used as input of a neural network model, and a noise-to-noise ratio value matrix obtained after dividing the barker band matrix point of the noiseless audio sample and the barker band matrix point of the pure noise audio sample is used as output to train the neural network;

taking the neural network model after training as a noise reduction network model;

wherein the audio characteristics of the noiseless audio samples are determined according to the Barker band matrix of the noiseless audio samples, and the audio characteristics of the pure noise audio samples are determined according to the Barker band matrix of the pure noise audio samples.

In specific implementation, the noiseless audio sample is transformed into a frequency domain, a bark band matrix corresponding to the noiseless audio sample is determined according to a transformation relation between the frequency domain and the bark domain, the bark band matrix M1 with repeated bark band deletion operation deleted is used as the audio feature of the noiseless audio sample, or the matrix M1 is input into a preset high-pass filter for filtering, and the filtered matrix is used as the audio feature of the noiseless audio sample. In the same processing mode, the pure noise audio sample is transformed into the frequency domain, the bark band matrix corresponding to the pure noise audio sample is determined according to the transformation relation between the frequency domain and the bark domain, the bark band matrix M2 with the repeated bark band operation deleted is used as the audio characteristic of the pure noise audio sample, or the matrix M2 is input into a preset high-pass filter for filtering, and the filtered matrix is used as the audio characteristic of the pure noise audio sample. Note that a noiseless audio sample refers to a pure speech audio without noise in the audio, and a pure noise audio sample refers to an audio only including noise in the audio.

The noise-noise ratio value matrix obtained by point-dividing the Barker band matrix M1 of the noise-free audio samples and the Barker band matrix M2 of the pure noise audio samples, namely, making each element M1 in the matrix M1_i,jDivided by the corresponding element M2 in matrix M2_i,jObtained ratio Z_i,jA matrix Z of values of the noise-to-noise ratio is formed.

And taking the noise ratio value matrix Z, the audio characteristics of the pure noise audio sample and the audio characteristics of the noise-free audio sample as the input of the neural network model, and taking the noise ratio value matrix Z as the output of the neural network model for training, or taking the output matrix approximation and the noise ratio value matrix Z as the target for training. And taking the trained neural network model as a noise reduction network model to be applied to a noise reduction processing process. And a target iteration number can be set, when the iteration number in the process of training the neural network model reaches the target iteration number, the training of the neural network model is determined to be finished, and the trained neural network model is used as the noise reduction network model. In a practical application scenario, the neural network model may be a Long Short-Term Memory network (LSTM).

Inputting the audio frequency characteristics of the target voice signal into a noise reduction network model to obtain a noise ratio value matrix Z corresponding to the target voice signal_p(n) (the matrix of values of the signal-to-noise ratio of the nth frame speech signal of the object p).

Sound-to-noise ratio value matrix Z corresponding to target voice signal_p(n) and the amplitude Y of the target speech signal at a plurality of predetermined frequencies_p(k, n), the amplitudes of the denoised target speech signal at a plurality of preset frequencies may be determined.

For example, based on the corresponding SNR value matrix Z of the target speech signal_p(n), a noise reduction matrix corresponding to the target speech signal may be determined. Then, the noise reduction matrix corresponding to the target voice signal and the target voice signal are utilized to be in a plurality of statesAmplitude Y of the predetermined frequency_p(k, n), the amplitudes of the denoised target speech signal at a plurality of preset frequencies may be determined.

In specific implementation, the matrix Z of the noise ratio values corresponding to the target speech signal is based on_p(n), when determining the noise reduction matrix corresponding to the target voice signal, the noise ratio value matrix Z corresponding to the target voice signal can be obtained by utilizing the transformation relation between the frequency domain and the bark domain_pAnd (n) converting into a frequency domain, determining a matrix T converted into the frequency domain as a noise reduction matrix of the target voice signal, wherein elements in the matrix T (k, n) are mask values corresponding to a plurality of preset frequencies of the target voice signal. Utilizing the corresponding noise reduction matrix of the target voice signal and the amplitude values Y of the target voice signal at a plurality of preset frequencies_p(k, n), when the amplitudes of the denoised target speech signal at a plurality of preset frequencies are determined, the amplitudes Y of the target speech signal at the plurality of preset frequencies can be determined_pAnd (k, n) multiplying a first matrix Y formed by the matrix (k, n) by a noise reduction matrix T of the target voice signal to obtain a second matrix C, wherein elements in the second matrix C are amplitudes of the denoised target voice signal at a plurality of preset frequencies.

Through the process, the target speech signal can be denoised in frequency. Further, the amplitude values of the denoised target speech signal at a plurality of preset frequencies are converted into the denoised target speech signal based on a preset conversion relation. In a practical application scenario, for example, it is predetermined that the amplitudes of each frame of audio signal in each audio signal at multiple preset frequencies are subjected to short-time fourier transform, and the preset transformation relationship may be an inverse transformation of the short-time fourier transform, so that the amplitudes of the denoised target audio signal at multiple preset frequencies are transformed into the denoised target audio signal c through the preset transformation relationship_p(n)。

De-noised frames of voice signals c of each object_p(n) according to the frame time sequence relation, splicing the denoised frame voice signals to obtain the denoised voice audio c of each object_p＝{c_p(1)…c_p(n) }. In the embodiment of the application, not only the voice audio of each object is separated from the acquired multi-channel audio signal, but also the separation is removedBackground noise or scattered noise in the voice audio which leaves out enhances the voice audio, and in a voice control elevator scene or a voice control intelligent automobile scene, the enhanced voice audio can improve the control effect on the elevator or the intelligent automobile.

Fig. 2 is a schematic flow diagram of an exemplary audio separation method, as shown in fig. 2, including:

in step S201, multiple audio signals are acquired.

In specific implementation, multiple audio signals are acquired by using multiple audio acquisition devices arranged in a unified scene, and each audio signal is mixed with a voice signal of multiple objects. The multiple objects may refer to multiple speakers, and the voice signal may refer to voices of the multiple speakers. FIG. 3 is a schematic diagram showing the signal flow of the audio separation method, and the obtained multi-channel audio signal is denoted as x_mEach frame of audio signal in each audio signal is marked as x_m(n)。

Step S202, determining the amplitude of each frame of audio signal in each path of audio signal at a plurality of preset frequencies, and determining the amplitude matrix of each frame of audio signal.

In specific implementation, as shown in fig. 3, the amplitude X of a plurality of preset frequencies in the frequency domain of each audio signal in each audio signal can be determined by transforming each audio signal in each audio signal through a short-time fourier transform 301 shown in fig. 3_m(k, n) and forming an amplitude matrix for each frame of audio signal

In step S203, a downmix matrix for each frame of audio signal is determined.

In specific implementation, determining the unmixing matrix of each frame of audio signal is an iterative process, that is, determining the unmixing matrix of the current frame of audio signal requires determining based on the unmixing matrix of the previous frame of audio signal. Aiming at any frequency point k, the unmixing matrix W (k, n) of the nth frame audio signal is an intermediate unmixing matrix W of the nth frame audio signal in each path of audio signal_m(k, n) the determined,

if the nth frame audio signal is the non-first frame audio signal (n ≠ 1), determining a middle de-mixing matrix of the nth frame audio signal in each path of audio signal by adopting the following process:

and determining a middle unmixing matrix of the current frame audio signal according to the covariance matrix of the current frame audio signal and the unmixing matrix of the previous frame audio signal.

In specific implementation, the amplitude X of the audio signal at a plurality of preset frequencies according to the nth frame is obtained_m(k, n), the intermediate unmixing matrix W of the audio signal of the previous frame of the mth audio signal can be determined from the unmixing matrix W (k, n-1) of the audio signal of the previous frame_m(k, n-1), and then determining the signal energy of the current frame audio signal (n ≠ 1) in each audio signal by the following formula:

According to the covariance matrix of the n-th frame of audio signal in the m-th audio signal and the previous frameA downmix matrix W (k, n-1) of the audio signal by the formula W_m(k，n)＝(W(k，n-1)V_m(k，n))^-1e_kDetermining the intermediate de-mixing matrix w of the nth frame audio signal in the mth audio signal_m(k,n)。

If the nth frame of audio signal is the first frame of audio signal (n is 1), determining a preset matrix as an intermediate unmixing matrix of the first frame of audio signal in each channel of audio signal, for example, the preset matrix may be an identity matrix, and a covariance matrix of each channel of the first frame of audio signal is determined based on the preset matrix and amplitudes of the first frame of audio signal in each channel of audio signal at a plurality of preset frequencies, where a value of a smoothing coefficient a when determining the covariance matrix may be equal to 0.

Step S204, determining the amplitude of each frame of voice signal of each object at a plurality of preset frequencies.

In specific implementation, for any frequency point k, X (k, n) W (k, n) is equal to Y (k, n) by using an amplitude matrix X (k, n) of the n-th frame audio signal and a downmix matrix W (k, n) of the n-th frame audio signal. Generally, the number num of acquired multi-path audio signals is greater than or equal to the number of objects. The amplitude matrix Y of each frame of audio signals contains the amplitudes of each frame of speech signals of all objects at a plurality of preset frequencies,

if the number of objects is 2, Y₁(k, n) is the amplitude matrix of the speech signal of the first object, Y₂(k, n) is a matrix of magnitudes of speech signals of the second object.

Step S205, determining the speech audio frequency of each object according to the amplitude of each frame of speech signal of each object at a plurality of preset frequencies.

In particular, the amplitude of the nth frame of speech signal of each object at a plurality of preset frequencies can be converted into the nth frame of speech audio of each object by the short-time inverse fourier transform 302 as shown in fig. 3. The voice audio of each object is composed of all the frame voice audio, and separation of the voice audio of each object from the audio signal in which a plurality of objects are mixed is achieved.

FIG. 4 shows a schematic flow diagram of a method of audio noise reduction according to an exemplary embodiment, comprising the steps of:

in step S401, an audio signal of a target object is acquired.

The audio noise reduction method provided by the embodiment of the application may adopt the audio separation method to obtain the audio signal of at least one object, and may also adopt other methods to obtain the audio signal of at least one object. The noise reduction processing is performed on the audio signal of any one object (target object) of the at least one object, and the noise reduction processing may be performed on each frame of the audio signal of the target object, for example, on the amplitudes of each frame of the audio signal at a plurality of preset frequencies.

In one possible implementation, the audio signal of each object may be obtained by performing the operations in step S201 to step S205 described above. The amplitudes of the respective frames of speech signals of each object at a plurality of preset frequencies can also be obtained by performing the operations in steps S201 to S204 described above.

Step S402, determining a target bark band corresponding to the amplitude values of the target voice signal at a plurality of preset frequencies based on the transformation relation between the frequency domain and the bark domain, wherein the target voice signal is any frame voice signal of the voice audio of the target object, and the target object is any object in a plurality of objects.

In specific implementation, the target object is any one object p of the multiple objects, the target speech signal is any frame n speech signal of the speech audio of the target object, and the amplitudes of the target speech signal in multiple preset frequency domains can be recorded as Y_p(k, n). Determining the amplitude Y of the target voice signal at a plurality of preset frequencies through the transformation relation between the frequency domain and the bark domain_p(k, n) corresponding bark band (target bark band).

And step S403, determining the audio characteristics of the target voice signal by using the matrix formed by the determined target barker bands.

In specific implementation, repeated barker bands in the determined target barker bands are removed, a matrix is formed, and the matrix is determined as the audio characteristic of the target voice signal. The average value and variance of all elements in the matrix can also be calculated, and the matrix obtained by subtracting each element in the matrix from the average value and dividing the difference by the variance is determined as the audio characteristic of the target voice signal.

Step S404, after the audio frequency characteristics of the target voice signal are input into the noise reduction network model, a noise ratio value matrix corresponding to the target voice signal is obtained.

In specific implementation, after the audio characteristics of the target voice signal are input into a pre-trained noise reduction network model, a noise ratio value matrix corresponding to the target voice signal is output.

Step S405, determining the denoised target speech signal based on the noise-to-noise ratio value matrix and the amplitudes of the target speech signal at the preset frequencies.

In specific implementation, based on the noise-to-noise ratio value matrix and the amplitudes of the target speech signal at the plurality of preset frequencies, the amplitudes of the denoised target speech signal at the plurality of preset frequencies can be determined. For example, a matrix obtained by transforming the noise-to-noise ratio value matrix into the frequency domain is determined as a noise reduction matrix of the target voice signal by using a transformation relation between the frequency domain and the bark domain, and a first matrix formed by amplitude values of the target voice signal at a plurality of preset frequencies and a product of the first matrix and the noise reduction matrix of the target voice signal are determined as a second matrix formed by amplitude values of the denoised target voice signal at a plurality of preset frequencies.

And then, based on a preset conversion relation, converting the amplitude values of the denoised target speech signal at a plurality of preset frequencies into the denoised target speech signal. For example, it is predetermined that the amplitudes of each frame of audio signal in each audio signal at multiple preset frequencies are transformed by short-time fourier transform, the preset transformation relationship may be an inverse transformation of the short-time fourier transform, and the amplitudes of the denoised target audio signal at multiple preset frequencies are transformed into the denoised target audio signal through the preset transformation relationship.

In a possible implementation manner, the denoised voice signals of the frames are spliced according to the frame time sequence relationship, so that the denoised voice audio of the target object can be obtained.

In a quiet environment and an interference environment respectively, an audio playing device (such as a bluetooth sound box) is used for playing a known sound source s, a plurality of microphones are used for collecting multi-channel audio signals, the collected multi-channel audio signals are separated and denoised by using the audio denoising method provided by the embodiment of the application, and the processed voice audio is recorded as c.

Separately counting the signal-to-noise ratio in quiet and noisy environments

Let the signal-to-noise ratio in quiet environments be denoted as SDR _ ref, the signal-to-noise ratio in noisy environments be denoted as SDRi, and the number of interferers present in noisy environments be denoted as i (interferers energy level is 55-60 dB). The test results are shown in table 1 below:

TABLE 1

Number of interference sources i	1	2	3	4	6
						SDRi-SDR_ref	16dB	13dB	10dB	9dB	8dB

In addition, the signal-to-interference-plus-noise ratio in quiet and noisy environments is separately counted

Where n is the background noise when the sound source s is recorded, and v is the interference source in the interference environment. Under the condition of different input signal-to-interference-plus-noise ratios IN _ SINR, the signal-to-interference-plus-noise ratio improvement condition corresponding to the voice audio subjected to the separation and denoising process c is shown IN table 2 below:

TABLE 2

In_SINR	-5dB	0dB	5dB
				SINRi	15-20dB	15-20dB	15-20dB

If the separated and denoised speech audio is c, the awakening word recognition is performed by using a common awakening model, and the awakening success rate (the number of times of awakening words recognized/the total number of awakening word sentences contained) under the conditions of different input signal-to-interference-plus-noise ratios IN _ SINR is as shown IN table 3 below:

TABLE 3

In_SINR	-5dB	0dB	5dB
				Wake-up success rate	83％-95％	87％-98％	90％-100％

Fig. 5 is a schematic diagram illustrating a structure of an audio noise reduction apparatus according to an exemplary embodiment, and as shown in fig. 5, the apparatus includes an obtaining unit 501 and a processing unit 502.

An obtaining unit 501, configured to obtain multiple channels of audio signals, where the multiple channels of audio signals are simultaneously collected by multiple audio collecting devices arranged in the same scene, and each channel of audio signal includes a voice signal of multiple objects;

a processing unit 502, configured to determine, for each frame of speech signal of the audio signal, an audio feature of the speech signal, and determine, based on the audio feature, a noise reduction matrix corresponding to the speech signal; determining a voice signal subjected to noise reduction according to the noise reduction matrix and the voice signal; and determining the audio signal of the target object after noise reduction according to each frame of voice signal after noise reduction.

In a possible implementation manner, in the audio noise reduction apparatus provided in an embodiment of the present invention, the processing unit 502 is specifically configured to:

the noise reduction network model is trained by adopting the following steps:

Based on the same concept of the above-described embodiment of the present invention, fig. 6 is a schematic structural diagram of an audio noise reduction apparatus 600 according to an exemplary embodiment, and as shown in fig. 6, the audio noise reduction apparatus 600 according to the embodiment of the present invention includes:

a processor 610;

a memory 620 for storing instructions executable by the processor 610;

wherein, the processor 610 is configured to execute the instructions to implement the audio noise reduction method in the embodiment of the present invention.

In an exemplary embodiment, a storage medium comprising instructions, such as the memory 620 comprising instructions, executable by the processor 610 of the audio noise reducer to perform the above-described method is also provided. Alternatively, the storage medium may be a non-transitory computer readable storage medium, for example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In addition, the audio noise reduction method and apparatus provided by the embodiments of the present invention described in conjunction with fig. 1, 2, 3, and 4 may be implemented by an audio noise reduction device. Fig. 7 shows a schematic structural diagram of an audio noise reduction device according to an embodiment of the present invention.

The audio noise reduction device may comprise a processor 701 and a memory 702 storing computer program instructions.

Specifically, the processor 701 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured as one or more Integrated circuits implementing an embodiment of the present invention.

Memory 702 may include a mass storage memory for storing data or instructions. By way of example, and not limitation, memory 702 may include a Hard Disk Drive (HDD), a floppy Disk Drive, flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 702 may include removable or non-removable (or fixed) media, where appropriate. The memory 702 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 702 is non-volatile solid-state memory. In a particular embodiment, the memory 702 includes Read Only Memory (ROM). Where appropriate, the ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or flash memory or a combination of two or more of these.

The processor 701 implements the audio noise reduction method in the above-described embodiments by reading and executing computer program instructions stored in the memory 702.

In one example, the audio noise reduction device may also include a communication interface 703 and a bus 710. As shown in fig. 7, the processor 701, the memory 702, and the communication interface 703 are connected by a bus 710 to complete mutual communication.

The communication interface 703 is mainly used for implementing communication between modules, apparatuses, units and/or devices in the embodiment of the present invention.

Bus 710 includes hardware, software, or both to couple the components of the audio noise reduction device to each other. By way of example, and not limitation, a bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a Hypertransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus or a combination of two or more of these. Bus 710 may include one or more buses, where appropriate. Although specific buses have been described and shown in the embodiments of the invention, any suitable buses or interconnects are contemplated by the invention.

In addition, in combination with the audio noise reduction method in the foregoing embodiments, the embodiments of the present invention may be implemented by providing a computer-readable storage medium. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the audio noise reduction methods of the above embodiments.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for audio noise reduction, the method comprising:

acquiring a voice audio of a target object;

determining a target bark band corresponding to the amplitude values of a target voice signal at a plurality of preset frequencies based on the transformation relation between the frequency domain and the bark domain, wherein the target voice signal is any frame voice signal of the voice audio;

2. The method of claim 1, wherein determining the denoised target speech signal based on the matrix of snr values and the magnitudes of the target speech signal at the plurality of predetermined frequencies comprises:

3. The method of claim 2, wherein determining the magnitudes of the denoised target speech signal at the plurality of preset frequencies based on the matrix of snr values and the magnitudes of the target speech signal at the plurality of preset frequencies comprises:

4. The method of claim 1, wherein the obtaining the speech audio of the target object comprises:

determining the amplitude of each frame of voice signal of each object at the plurality of preset frequencies according to the amplitude matrix of each frame of audio signal and a predetermined unmixing matrix of each frame of audio signal;

and determining the voice audio frequency of each object according to the amplitude of each frame of voice signal of each object at the preset frequencies.

5. The method of claim 4, wherein the downmix matrix for each frame of the audio signal is determined by:

6. The method of claim 1, wherein the noise reduction network model is trained using the steps of:

7. The method of claim 1, wherein determining the audio characteristics of the target speech signal using the determined matrix of target barker bands comprises:

8. An audio noise reduction apparatus, characterized in that the apparatus comprises:

an acquisition unit configured to acquire a voice audio of a target object;

the processing unit is used for determining a target barker band corresponding to the amplitudes of the target voice signal at the preset frequencies based on the transformation relation between the frequency domain and the barker domain, wherein the target voice signal is any frame voice signal of the voice audio; determining the audio characteristics of the target voice signal by using a matrix formed by the determined target barker bands; inputting the audio characteristics of the target voice signal into a noise reduction network model to obtain a noise ratio value matrix corresponding to the target voice signal; and determining the denoised target voice signal based on the signal-to-noise ratio value matrix and the amplitudes of the target voice signal at the preset frequencies.

9. An audio noise reduction device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the audio noise reduction method of any of claims 1 to 7.

10. A storage medium, wherein instructions in the storage medium, when executed by a processor of an audio noise reduction device, enable the audio noise reduction device to perform the audio noise reduction method of any of claims 1 to 7.