CN113763979A - Audio noise reduction and audio noise reduction model processing method, device, equipment and medium - Google Patents
Audio noise reduction and audio noise reduction model processing method, device, equipment and medium Download PDFInfo
- Publication number
- CN113763979A CN113763979A CN202110557785.2A CN202110557785A CN113763979A CN 113763979 A CN113763979 A CN 113763979A CN 202110557785 A CN202110557785 A CN 202110557785A CN 113763979 A CN113763979 A CN 113763979A
- Authority
- CN
- China
- Prior art keywords
- submodel
- audio signal
- frequency domain
- real part
- processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000009467 reduction Effects 0.000 title claims abstract description 419
- 238000003672 processing method Methods 0.000 title abstract description 13
- 238000012545 processing Methods 0.000 claims abstract description 771
- 230000005236 sound signal Effects 0.000 claims abstract description 705
- 238000012549 training Methods 0.000 claims abstract description 106
- 238000000034 method Methods 0.000 claims abstract description 88
- 238000013528 artificial neural network Methods 0.000 claims abstract description 37
- 230000009466 transformation Effects 0.000 claims abstract description 31
- 239000013598 vector Substances 0.000 claims description 118
- 238000000605 extraction Methods 0.000 claims description 39
- 230000001131 transforming effect Effects 0.000 claims description 26
- 238000004821 distillation Methods 0.000 claims description 18
- 230000000694 effects Effects 0.000 abstract description 26
- 238000005516 engineering process Methods 0.000 abstract description 17
- 238000013473 artificial intelligence Methods 0.000 abstract description 16
- 230000006870 function Effects 0.000 description 59
- 238000010586 diagram Methods 0.000 description 28
- 230000008569 process Effects 0.000 description 21
- 230000015654 memory Effects 0.000 description 19
- 238000004590 computer program Methods 0.000 description 17
- 239000000047 product Substances 0.000 description 13
- 239000011159 matrix material Substances 0.000 description 10
- 238000005070 sampling Methods 0.000 description 9
- 238000010801 machine learning Methods 0.000 description 8
- 238000007726 management method Methods 0.000 description 8
- 238000010606 normalization Methods 0.000 description 8
- 238000013527 convolutional neural network Methods 0.000 description 7
- 238000013135 deep learning Methods 0.000 description 7
- 230000007246 mechanism Effects 0.000 description 7
- 239000000284 extract Substances 0.000 description 6
- 238000003062 neural network model Methods 0.000 description 6
- 239000000203 mixture Substances 0.000 description 5
- 230000004913 activation Effects 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000013136 deep learning model Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 4
- 238000012544 monitoring process Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 230000002123 temporal effect Effects 0.000 description 4
- 230000006978 adaptation Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 230000006403 short-term memory Effects 0.000 description 3
- 238000012550 audit Methods 0.000 description 2
- 210000004556 brain Anatomy 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 2
- 230000007423 decrease Effects 0.000 description 2
- 238000009795 derivation Methods 0.000 description 2
- 238000006073 displacement reaction Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000007620 mathematical function Methods 0.000 description 2
- 238000002156 mixing Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 238000012805 post-processing Methods 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000003862 health status Effects 0.000 description 1
- 230000002779 inactivation Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000008450 motivation Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 238000012954 risk control Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The application relates to a method, a device, equipment and a medium for processing an audio noise reduction model and an audio noise reduction model. The method relates to an artificial intelligence technology, and the processing method of the audio noise reduction model comprises the following steps: respectively carrying out feature coding on a real part sequence and an imaginary part sequence obtained after a sample audio signal is converted into a frequency domain signal through a real part processing network and an imaginary part processing network in a first submodel based on a neural network to obtain real part attention and imaginary part attention corresponding to the sample audio signal, and obtaining frequency domain coding features corresponding to the sample audio signal based on the real part sequence and the imaginary part sequence as well as the real part attention and the imaginary part attention; performing model training on the first submodel according to a first loss determined by a frequency domain transformation sequence corresponding to the clean audio signal based on the frequency domain coding characteristics to obtain a frequency domain processing submodel; the frequency domain processing submodel and the time domain processing submodel are connected and then trained together to obtain the audio noise reduction model, and the noise reduction effect of the audio noise reduction model can be improved.
Description
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to an audio denoising method and apparatus, a computer device, and a storage medium, and further, to a processing method and apparatus of an audio denoising model, a computer device, and a storage medium.
Background
As is well known, audio signals are generally mixed with different levels of noise, for example, a user may be in various scenes when making an audio call, and the noisy background noise will interfere with the audio call. In order to obtain better audio quality, the original audio signal is usually subjected to noise reduction, and the conventional audio noise reduction methods are as follows: adaptive filters, spectral subtraction, and wiener filtering, among others.
With the popularization of deep learning in the technical field of artificial intelligence, the use of a deep learning model based on a neural network for noise reduction processing of audio signals has become a research hotspot, and the effect of the method can be superior to that of a traditional noise reduction algorithm. However, some existing audio noise reduction models do not make maximum use of frequency domain information of an input audio signal, resulting in poor audio noise reduction effect.
Disclosure of Invention
Therefore, it is necessary to provide an audio noise reduction method, an audio noise reduction apparatus, a computer device, and a storage medium capable of improving an audio noise reduction effect, and also provide a processing method, an apparatus, a computer device, and a storage medium of an audio noise reduction model capable of improving an audio noise reduction effect.
A method of audio noise reduction, the method comprising:
acquiring an original audio signal of a time domain;
respectively carrying out feature coding on a real part sequence and an imaginary part sequence obtained after the original audio signal is converted into a frequency domain signal through a real part processing network and an imaginary part processing network in the trained frequency domain processing submodel to obtain real part attention and imaginary part attention corresponding to the original audio signal;
obtaining frequency domain coding features corresponding to the original audio signal based on the real part sequence and the imaginary part sequence and the real part attention and the imaginary part attention;
and transforming the frequency domain coding characteristics into time domain signals, and carrying out noise reduction processing on the time domain signals through a trained time domain processing submodel to obtain noise reduction signals corresponding to the original audio signals.
An audio noise reduction apparatus, the apparatus comprising:
the acquisition module is used for acquiring an original audio signal of a time domain;
the frequency domain coding module is used for respectively carrying out feature coding on a real part sequence and an imaginary part sequence obtained after the original audio signal is converted into the frequency domain signal through a real part processing network and an imaginary part processing network in the trained frequency domain processing submodel to obtain real part attention and imaginary part attention corresponding to the original audio signal; obtaining frequency domain coding features corresponding to the original audio signal based on the real part sequence and the imaginary part sequence and the real part attention and the imaginary part attention;
and the time domain noise reduction module is used for converting the frequency domain coding characteristics into time domain signals and carrying out noise reduction processing on the time domain signals through a trained time domain processing submodel to obtain noise reduction signals corresponding to the original audio signals.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring an original audio signal of a time domain;
respectively carrying out feature coding on a real part sequence and an imaginary part sequence obtained after the original audio signal is converted into a frequency domain signal through a real part processing network and an imaginary part processing network in the trained frequency domain processing submodel to obtain real part attention and imaginary part attention corresponding to the original audio signal;
obtaining frequency domain coding features corresponding to the original audio signal based on the real part sequence and the imaginary part sequence and the real part attention and the imaginary part attention;
and transforming the frequency domain coding characteristics into time domain signals, and carrying out noise reduction processing on the time domain signals through a trained time domain processing submodel to obtain noise reduction signals corresponding to the original audio signals.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring an original audio signal of a time domain;
respectively carrying out feature coding on a real part sequence and an imaginary part sequence obtained after the original audio signal is converted into a frequency domain signal through a real part processing network and an imaginary part processing network in the trained frequency domain processing submodel to obtain real part attention and imaginary part attention corresponding to the original audio signal;
obtaining frequency domain coding features corresponding to the original audio signal based on the real part sequence and the imaginary part sequence and the real part attention and the imaginary part attention;
and transforming the frequency domain coding characteristics into time domain signals, and carrying out noise reduction processing on the time domain signals through a trained time domain processing submodel to obtain noise reduction signals corresponding to the original audio signals.
A computer program comprising computer instructions stored in a computer readable storage medium, the computer instructions being read from the computer readable storage medium by a processor of a computer device, the processor executing the computer instructions to cause the computer device to perform the steps of the audio noise reduction method described above.
After the original audio signal in the time domain is obtained, the real part processing network and the imaginary part processing network in the trained frequency domain processing submodel are used for respectively carrying out feature coding on the real part sequence and the imaginary part sequence of the original audio signal, the network structure can fully utilize the frequency domain information of the original audio signal, namely the amplitude information and the phase information represented by the real part sequence representation and the imaginary part sequence, so that the real part attention and the imaginary part attention obtained by coding can give more and more accurate attention to the clean audio signal in the original audio signal, and thus, the frequency domain coding features corresponding to the original audio signal are obtained based on the real part sequence and the imaginary part sequence, and the real part attention and the imaginary part attention, the frequency domain characteristics of the clean audio signal in the original audio signal can be accurately represented, the clean audio signal in the original audio signal can be accurately expressed according to the time domain signal obtained by the frequency domain coding characteristics, and the noise reduction effect is better; in addition, the time domain signal is subjected to noise reduction processing by subsequently further using a time domain processing submodel, so that the tone quality of the time domain signal can be further improved, and the obtained noise reduction signal effect can be better.
A method of processing an audio noise reduction model, the method comprising:
obtaining a sample audio signal, the sample audio information being generated from a clean audio signal;
respectively performing feature coding on a real part sequence and an imaginary part sequence obtained after the sample audio signal is transformed into a frequency domain signal through a real part processing network and an imaginary part processing network in a first sub-model based on a neural network to obtain real part attention and imaginary part attention corresponding to the sample audio signal, and obtaining frequency domain coding features corresponding to the sample audio signal based on the real part sequence and the imaginary part sequence, the real part attention and the imaginary part attention;
performing model training on the first submodel according to a first loss determined based on the frequency domain coding features and a frequency domain transform sequence corresponding to the clean audio signal to obtain a frequency domain processing submodel;
and connecting the frequency domain processing submodel with a time domain processing submodel to be trained, and then training together to obtain the audio noise reduction model for carrying out noise reduction processing on the audio signal.
In one embodiment, the obtaining of the real part attention and the imaginary part attention corresponding to the sample audio signal by feature-coding the real part sequence and the imaginary part sequence obtained after transforming the sample audio signal into the frequency domain signal through the real part processing network and the imaginary part processing network in the first sub-model based on the neural network comprises:
inputting the sample audio signal into a first sub-model based on a neural network;
in the first submodel, performing frequency domain transformation on the sample audio signal to obtain a real part sequence and an imaginary part sequence corresponding to the sample audio signal;
respectively carrying out feature coding on the real part sequence and the imaginary part sequence through a real part processing network in the first submodel to obtain a real part first coding feature and an imaginary part first coding feature;
respectively carrying out feature coding on the real part sequence and the imaginary part sequence through an imaginary part processing network in the first submodel to obtain a second coding feature of the real part and a second coding feature of the imaginary part;
and obtaining the real part attention corresponding to the sample audio signal according to the real part first coding feature and the imaginary part second coding feature, and obtaining the imaginary part attention corresponding to the sample audio signal according to the real part second coding feature and the imaginary part first coding feature.
In one embodiment, the obtaining the corresponding frequency-domain coding features of the sample audio signal based on the real part sequence and the imaginary part sequence, and the real part attention and the imaginary part attention comprises:
multiplying the real part sequence and the real part attention to obtain a real part of the frequency domain coding feature corresponding to the sample audio signal;
and multiplying the imaginary part sequence and the imaginary part attention to obtain an imaginary part of the frequency domain coding feature corresponding to the sample audio signal.
In one embodiment, the obtaining the corresponding frequency-domain coding features of the sample audio signal based on the real part sequence and the imaginary part sequence, and the real part attention and the imaginary part attention comprises:
multiplying the real part sequence and the real part attention to obtain a first result, multiplying the imaginary part sequence and the imaginary part attention to obtain a second result, and taking the difference between the first result and the second result as the real part of the frequency domain coding feature corresponding to the sample audio signal;
and multiplying the real part sequence and the imaginary part attention to obtain a third result, multiplying the imaginary part sequence and the real part attention to obtain a fourth result, and taking the sum of the third result and the fourth result as the imaginary part of the frequency domain coding feature corresponding to the sample audio signal.
In one embodiment, the obtaining the corresponding frequency-domain coding features of the sample audio signal based on the real part sequence and the imaginary part sequence, and the real part attention and the imaginary part attention comprises:
obtaining original amplitude information and original phase information of the sample audio signal based on the real part sequence and the imaginary part sequence; obtaining predicted amplitude information and predicted phase information of the sample audio signal based on the real part attention and the imaginary part attention;
obtaining amplitude information of frequency domain coding features corresponding to the sample audio signal according to the product of the original amplitude information and the predicted amplitude information;
and obtaining the phase information of the frequency domain coding characteristics corresponding to the sample audio signal according to the sum of the original phase information and the predicted phase information.
In one embodiment, the determining of the first loss step comprises:
performing frequency domain transformation processing on the clean audio signal to obtain a corresponding frequency domain transformation sequence, wherein the frequency domain transformation sequence comprises a real part sequence and an imaginary part sequence;
determining a first loss according to a difference between a real part sequence corresponding to the clean audio signal and a real part feature in a frequency domain coding feature corresponding to the sample audio signal, and a difference between an imaginary part sequence corresponding to the clean audio signal and an imaginary part feature in a frequency domain coding feature corresponding to the sample audio signal.
In one embodiment, the determining of the second loss comprises:
coding the time domain signal through a coder in the second submodel to obtain a time domain coding vector, performing feature extraction on the time domain coding vector through a time sequence feature extraction network in the second submodel to obtain a hidden feature corresponding to the time domain signal, and decoding based on the time domain coding vector and the hidden feature through a decoder in the second submodel to obtain a noise reduction signal corresponding to a sample audio signal in the second sample set;
constructing a second loss from the noise reduction signal and the clean audio signal.
In one embodiment, said constructing a second loss from said noise reduction signal and said clean audio signal comprises:
projecting the noise reduction signal corresponding to the sample audio signal to the vertical direction and the horizontal direction of the clean audio signal respectively to obtain a vertical projection vector and a horizontal projection vector;
and obtaining a second loss according to the vertical projection vector and the horizontal projection vector.
In one embodiment, the step of determining the third loss comprises:
coding the time domain signal through a coder in the third submodel to obtain a time domain coding vector, extracting the characteristics of the time domain coding vector through a time sequence characteristic extraction network in the third submodel to obtain hidden characteristics corresponding to the time domain signal, and predicting the noise scene category of the sample audio signal based on the hidden characteristics through an output layer in the third submodel;
and constructing a third loss according to the noise scene category and the noise label category of the noise signal used for generating the sample audio signal.
An apparatus for processing an audio noise reduction model, the apparatus comprising:
an obtaining module, configured to obtain a sample audio signal, where the sample audio information is generated according to a clean audio signal;
a frequency domain coding training module, configured to perform feature coding on a real part sequence and an imaginary part sequence obtained after the sample audio signal is transformed into a frequency domain signal through a real part processing network and an imaginary part processing network in a first sub-model based on a neural network, respectively, to obtain a real part attention and an imaginary part attention corresponding to the sample audio signal, and obtain a frequency domain coding feature corresponding to the sample audio signal based on the real part sequence and the imaginary part sequence, the real part attention and the imaginary part attention; performing model training on the first submodel according to a first loss determined based on the frequency domain coding features and a frequency domain transform sequence corresponding to the clean audio signal to obtain a frequency domain processing submodel;
and the integrated training module is used for connecting the frequency domain processing submodel with a time domain processing submodel to be trained and then training together to obtain the audio noise reduction model for carrying out noise reduction processing on the audio signal.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
obtaining a sample audio signal, the sample audio information being generated from a clean audio signal;
respectively performing feature coding on a real part sequence and an imaginary part sequence obtained after the sample audio signal is transformed into a frequency domain signal through a real part processing network and an imaginary part processing network in a first sub-model based on a neural network to obtain real part attention and imaginary part attention corresponding to the sample audio signal, and obtaining frequency domain coding features corresponding to the sample audio signal based on the real part sequence and the imaginary part sequence, the real part attention and the imaginary part attention;
performing model training on the first submodel according to a first loss determined based on the frequency domain coding features and a frequency domain transform sequence corresponding to the clean audio signal to obtain a frequency domain processing submodel;
and connecting the frequency domain processing submodel with a time domain processing submodel to be trained, and then training together to obtain the audio noise reduction model for carrying out noise reduction processing on the audio signal.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
obtaining a sample audio signal, the sample audio information being generated from a clean audio signal;
respectively performing feature coding on a real part sequence and an imaginary part sequence obtained after the sample audio signal is transformed into a frequency domain signal through a real part processing network and an imaginary part processing network in a first sub-model based on a neural network to obtain real part attention and imaginary part attention corresponding to the sample audio signal, and obtaining frequency domain coding features corresponding to the sample audio signal based on the real part sequence and the imaginary part sequence, the real part attention and the imaginary part attention;
performing model training on the first submodel according to a first loss determined based on the frequency domain coding features and a frequency domain transform sequence corresponding to the clean audio signal to obtain a frequency domain processing submodel;
and connecting the frequency domain processing submodel with a time domain processing submodel to be trained, and then training together to obtain the audio noise reduction model for carrying out noise reduction processing on the audio signal.
A computer program comprising computer instructions stored in a computer readable storage medium, the computer instructions being read from the computer readable storage medium by a processor of a computer device, the processor executing the computer instructions to cause the computer device to perform the steps of the method of processing an audio noise reduction model as described above.
The audio noise reduction model comprises a frequency domain processing submodel and a time domain processing submodel, wherein the frequency domain processing submodel comprises a real part processing network and an imaginary part processing network, after an original audio signal of a time domain is obtained, the real part processing network and the imaginary part processing network in the frequency domain processing submodel are used for respectively carrying out feature coding on a real part sequence and an imaginary part sequence of a sample audio signal, the network structure can fully learn frequency domain information of the original audio signal, namely amplitude information and phase information represented by the real part sequence and the imaginary part sequence, so that the real part attention and the imaginary part attention obtained by coding can give more and more accurate attention to a clean audio signal in the sample audio signal, and thus, the frequency domain coding features corresponding to the sample audio signal are obtained based on the real part sequence, the imaginary part sequence, the real part attention and the imaginary part attention, and performing model training on the frequency domain processing submodel according to the first loss determined by the frequency domain coding characteristics and the frequency domain transformation sequence corresponding to the clean audio signal for generating the sample audio signal, so that the frequency domain processing submodel can accurately learn the frequency domain characteristics of the clean signal in the sample audio signal, and then performing model training by combining the frequency domain processing submodel and the time domain processing submodel together, so that the noise reduction effect of the obtained audio noise reduction model is better.
Drawings
FIG. 1 is a diagram of an exemplary embodiment of an application environment for an audio denoising method;
FIG. 2 is a flow diagram illustrating a method for processing an audio noise reduction model in one embodiment;
FIG. 3 is a model framework diagram of a frequency domain processing submodel in one embodiment;
FIG. 4 is a diagram illustrating a frequency domain processing sub-model and a time domain processing sub-model in a cascade to obtain an audio noise reduction model according to an embodiment;
FIG. 5 is a schematic flow chart illustrating obtaining real part attention and imaginary part attention corresponding to a sample audio signal according to an embodiment;
FIG. 6 is a diagram illustrating feature encoding of real and imaginary sequences by real and imaginary processing networks in one embodiment;
FIG. 7 is a schematic diagram of two layers of a real part processing network and an imaginary part processing network in one embodiment;
FIG. 8 is a schematic flow chart illustrating the process of connecting the frequency domain processing submodel with the time domain processing submodel to be trained and then training them together to obtain an audio noise reduction model according to an embodiment;
FIG. 9(a) is a diagram illustrating a model structure of a second sub-model in an embodiment;
FIG. 9(b) is a diagram illustrating a second loss versus projection vector in one embodiment;
FIG. 10 is a diagram illustrating a model structure of a third submodel in one embodiment;
FIG. 11 is a diagram illustrating a model structure for performing multi-task learning on a time-domain noise reduction task and a noise scene classification task in one embodiment;
FIG. 12 is a schematic flow chart illustrating the process of connecting the frequency domain processing submodel with the time domain processing submodel to be trained and then training them together to obtain an audio noise reduction model according to another embodiment;
FIG. 13 is a diagram illustrating the steps of training an audio noise reduction model in one embodiment;
FIG. 14 is a schematic flow chart illustrating model distillation training of a time domain noise reduction submodel according to one embodiment;
FIG. 15 is a block diagram illustrating model distillation training of a time domain noise reduction submodel according to an embodiment;
FIG. 16 is a flow diagram illustrating an exemplary method for audio noise reduction;
FIG. 17 is a schematic flow chart illustrating obtaining real part attention and imaginary part attention corresponding to an original audio signal according to an embodiment;
FIG. 18 is a schematic flow chart illustrating obtaining real part attention and imaginary part attention corresponding to an original audio signal through a real part processing network and an imaginary part processing network in a trained frequency domain processing submodel according to an embodiment;
FIG. 19 is a block diagram showing the structure of a processing means of an audio noise reduction model in one embodiment;
FIG. 20 is a block diagram showing the structure of an audio noise reducing device according to an embodiment;
FIG. 21 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The audio denoising method and the processing method of the audio denoising model provided by the application realize audio denoising by using the machine learning and other technologies in the Artificial Intelligence (AI) technology.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.
Machine Learning (ML) is a multi-domain cross discipline, which relates to multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc., and is used for specially researching how a computer simulates or realizes human Learning behaviors to acquire new knowledge or skills and reorganizes an existing knowledge structure to continuously improve the performance of the knowledge structure. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. The artificial neural network is an important machine learning technology and has wide application prospects in the fields of system identification, pattern recognition, intelligent control and the like.
Deep Learning (Deep Learning), which is an intrinsic rule and expression level of Learning sample data, is a new research direction in the technical field of artificial intelligence, and information obtained in the Learning process is very helpful for interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds. The motivation for studying deep learning is to build neural networks that simulate the human brain for analytical learning, which mimics the mechanism of the human brain to interpret data such as images, sounds, text, and the like. It is to be appreciated that the present application trains and uses an audio noise reduction model by using deep learning techniques.
The audio noise reduction method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may obtain an original audio signal in the time domain; respectively carrying out feature coding on a real part sequence and an imaginary part sequence obtained after an original audio signal is converted into a frequency domain signal through a real part processing network and an imaginary part processing network in the trained frequency domain processing submodel to obtain real part attention and imaginary part attention corresponding to the original audio signal; acquiring frequency domain coding characteristics corresponding to the original audio signal based on the real part sequence and the imaginary part sequence and the real part attention and the imaginary part attention; and transforming the frequency domain coding characteristics into time domain signals, and carrying out noise reduction processing on the time domain signals through the trained time domain processing submodel to obtain noise reduction signals corresponding to the original audio signals.
For example, the terminal 102 may use the voice call signal as an original audio signal during an audio call, and perform voice noise reduction processing on the voice call signal by using the audio noise reduction method provided in the embodiment of the present application to obtain a corresponding noise reduction signal.
It is understood that in other embodiments, the terminal 102 may pass the raw audio signal to the server 104 after acquiring the raw audio signal. After obtaining the original audio signal, the server 104 inputs the original audio signal into the trained frequency domain processing submodel, then obtains the frequency domain coding feature corresponding to the original audio signal through the frequency domain processing submodel, transforms the frequency domain coding feature into a time domain signal, inputs the time domain signal into the trained time domain processing submodel, and then performs noise reduction processing on the time domain signal through the time domain processing submodel to obtain a noise reduction signal corresponding to the original audio signal.
The processing method of the audio noise reduction model provided by the embodiment of the application can also be applied to the application environment shown in fig. 1. The server 104 may obtain a sample audio signal, the sample audio signal being generated from the clean audio signal; respectively carrying out feature coding on a real part sequence and an imaginary part sequence obtained after a sample audio signal is converted into a frequency domain signal through a real part processing network and an imaginary part processing network in a first submodel based on a neural network to obtain real part attention and imaginary part attention corresponding to the sample audio signal, and obtaining frequency domain coding features corresponding to the sample audio signal based on the real part sequence and the imaginary part sequence as well as the real part attention and the imaginary part attention; performing model training on the first submodel according to a first loss determined by a frequency domain transformation sequence corresponding to the clean audio signal based on the frequency domain coding characteristics to obtain a frequency domain processing submodel; and connecting the frequency domain processing submodel with a time domain processing submodel to be trained, and then training together to obtain an audio noise reduction model for carrying out noise reduction processing on the audio signal.
It is understood that in other embodiments, the audio noise reduction model may also be trained by the terminal 102.
The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud services, a cloud database, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a CDN, and big data and artificial intelligence platforms. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like.
In the audio noise reduction method provided by the embodiment of the present application, an execution main body of the audio noise reduction method may be the audio noise reduction device provided by the embodiment of the present application, or a computer device integrated with the audio noise reduction device, where the audio noise reduction device may be implemented in a hardware or software manner. In the processing method of the audio noise reduction model provided in the embodiment of the present application, an execution main body of the processing method may be a processing apparatus of the audio noise reduction model provided in the embodiment of the present application, or a computer device integrated with the processing apparatus of the audio noise reduction model, where the processing apparatus of the audio noise reduction model may be implemented in a hardware or software manner. The computer device may be the terminal 102 or the server 104 shown in fig. 1.
In one embodiment, the terminal 102 or server 104 used to train the speech noise reduction model may be a node in a blockchain network. The Blockchain (Blockchain) is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. The block chain, which is essentially a decentralized database, is a string of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, which is used for verifying the validity (anti-counterfeiting) of the information and generating a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
The block chain underlying platform can comprise processing modules such as user management, basic service, intelligent contract and operation monitoring. The user management module is responsible for identity information management of all blockchain participants, and comprises public and private key generation maintenance (account management), key management, user real identity and blockchain address corresponding relation maintenance (authority management) and the like, and under the authorization condition, the user management module supervises and audits the transaction condition of certain real identities and provides rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node equipment and used for verifying the validity of the service request, recording the service request to storage after consensus on the valid request is completed, for a new service request, the basic service firstly performs interface adaptation analysis and authentication processing (interface adaptation), then encrypts service information (consensus management) through a consensus algorithm, transmits the service information to a shared account (network communication) completely and consistently after encryption, and performs recording and storage; the intelligent contract module is responsible for registering and issuing contracts, triggering the contracts and executing the contracts, developers can define contract logics through a certain programming language, issue the contract logics to a block chain (contract registration), call keys or other event triggering and executing according to the logics of contract clauses, complete the contract logics and simultaneously provide the function of upgrading and canceling the contracts; the operation monitoring module is mainly responsible for deployment, configuration modification, contract setting, cloud adaptation in the product release process and visual output of real-time states in product operation, such as: alarm, monitoring network conditions, monitoring node equipment health status, and the like.
The platform product service layer provides basic capability and an implementation framework of typical application, and developers can complete block chain implementation of business logic based on the basic capability and the characteristics of the superposed business. The application service layer provides the application service based on the block chain scheme for the business participants to use.
In one embodiment, as shown in fig. 2, a method for processing an audio noise reduction model is provided, which is described by taking the method as an example applied to a computer device (terminal 102 or server 104) in fig. 1, and includes the following steps:
in step 202, a sample audio signal is obtained, the sample audio signal being generated from a clean audio signal.
Wherein the sample audio signal is an audio signal used for training an audio noise reduction model. After generating the model training requirement for audio noise reduction, it is necessary to generate a sample audio signal for training the audio noise reduction model.
In one embodiment, the computer device may mix the clean audio signal with the noise signal according to different signal-to-noise ratios to obtain a sample audio signal, and the clean audio signal may be used as the generated label information of the sample audio signal during model training. The clean audio signal may be, for example, a clean human voice signal, the clean human voice languages used including english, chinese, and dialects of each place. The noise signal used may be noise of various scenes, such as white noise, wind noise, subway noise, keyboard noise, mouse noise, etc. In some embodiments, the class of the noise signal may also serve as the label information for the sample audio signal.
Optionally, the computer device may read in clean audio signals and noise signals, and then randomly mix the clean audio signals and the noise signals according to different signal-to-noise ratios to obtain sample audio signals, which may perform data enhancement on training sample data to a certain extent, thereby improving the generalization capability of the model.
In some embodiments, the computer device may also obtain sample audio signals delivered by other computer devices, such as the server 104 in fig. 1 described above to obtain images delivered by the terminal 102, and the computer device may also obtain sample audio signals generated locally.
The first sub-model based on the neural network can learn the characteristics of a clean audio signal in the sample audio signal on a frequency domain through the sample audio signal in the model training process. The first sub-model may use a deep learning model based on a neural network, such as LSTM (Long short-term memory network), which is a cyclic neural network with a special structure and can learn the Long-term dependency of Long-sequence input.
Both the time domain and the frequency domain are fundamental properties of the signal. Analyzing the signal from different dimensions, and switching in the solution problem from different angles, which can be called domains. The time domain reflects the corresponding relation between mathematical functions or physical signals and time, and is the feedback of the real world and the only objective existing domain. The frequency domain is a coordinate system used for describing the characteristics of the signal in the frequency domain, showing the signal amount in a frequency range, and is a way of constructing an auxiliary consideration from a mathematical point of view.
In order to be able to sufficiently learn the frequency domain characteristics of the clean audio signal mixed in the sample audio signal, the first sub-model based on the neural network is provided to include a real part processing network and an imaginary part processing network. The real part processing network is designed to utilize amplitude information of the sample audio signal, and the imaginary part sequence network is designed to utilize phase information of the sample audio signal. The real part processing network and the imaginary part processing network may both be LSTM (Long Short-Term Memory) based network structures, and the real part processing network and the imaginary part processing network may both include at least one layer of LSTM network structure, that is, the LSTM overall structure capable of processing complex-form frequency domain signals includes LSTM processing real part sequences and LSTM processing imaginary part sequences, which may be referred to as complex-LSTM.
The real part processing network may be configured to obtain a real part attention corresponding to the sample audio signal and the imaginary part processing network may be configured to obtain an imaginary part attention corresponding to the sample audio signal. The real part attention may be used to reflect attention to a clean signal in the real part frequency domain features of the sample audio signal and the imaginary part attention may be used to reflect attention to a clean signal in the imaginary part frequency domain features of the sample audio signal.
Since the first sub-model defines the goal in training to multiply the real part feature of the output result of the real part processing network and the imaginary part processing network with the real part sequence of the original audio signal, the real part of the clean audio signal can be obtained, and the imaginary part feature of the output result of the real part processing network and the imaginary part processing network can be multiplied with the imaginary part sequence of the original audio signal, and the imaginary part of the clean audio signal can be obtained, based on such a structure equivalent to Attention mechanism (Attention), the output result of the real part processing network and the imaginary part processing network can express more and more accurate Attention to the clean audio signal in the sample audio signal, and therefore, the output result of the real part processing network and the imaginary part processing network can be called as real part Attention and imaginary part Attention. The attention mechanism comes from a process that mimics what a living being observes, a mechanism that aligns internal experience with external stimuli, thereby increasing partial area attention.
Generally, after an input signal is input into a neural network, network parameters in a network layer of the neural network operate on the input signal to obtain an operation result. Each layer network receives the operation result output by the previous layer network, and outputs the operation result of the layer through the operation of the layer network as the input of the next layer. In this embodiment, the real part processing network and the imaginary part processing network in the first submodel include at least one layer.
Specifically, the computer device may input the obtained sample audio signal into a first submodel, in the first submodel, the sample audio signal is first converted into a frequency domain signal, then the real part sequence and the imaginary part sequence of the frequency domain signal are respectively input into the real part processing network and the imaginary part processing network, the real part processing network and the imaginary part processing network respectively perform feature coding on the real part sequence and the imaginary part sequence, that is, the network parameters in the real part processing network and the imaginary part processing network perform operations on the input real part sequence and the imaginary part sequence, and the real part attention and the imaginary part attention corresponding to the sample audio signal are obtained according to the output results of the last layer of real part processing network and the imaginary part processing network.
After obtaining the attention to the clean audio signal in the sample audio signal, the computer device then obtains the frequency domain coding features corresponding to the sample audio signal based on the real part sequence and the imaginary part sequence, and the real part attention and the imaginary part attention of the sample audio signal. The frequency domain coding characteristic is a characteristic that reflects an audio signal from a frequency domain perspective. As mentioned above, the real part attention and the imaginary part attention output by the real part processing network and the real part processing network in the first submodel may give more and more accurate attention to the clean audio signal in the sample audio signal, and then the computer device may extract more portions related to the clean audio signal from the real part sequence and the imaginary part sequence of the frequency domain signal of the sample audio signal based on the real part attention and the imaginary part attention, and then obtain the corresponding frequency domain coding features of the sample audio signal. It is understood that the frequency domain coding features include real and imaginary parts.
In one embodiment, the computer device may transform the original audio signal into a frequency domain signal using a fourier forward transform in the first submodel, the frequency domain signal comprising a real part sequence and an imaginary part sequence. The Fourier transform may be a short-time Fourier transform (STFT). For example, the sample audio signal is a 15s audio signal, where the sampling rate is 16K, the length of the fourier transform window is set to 512, and the overlap of the fourier transform windows is set to 75%, that is, the displacement step of the window is 128%, 1872 discrete sequences with a length of 512 can be obtained from the sample audio signal, and after the short time fourier transform, the obtained frequency domain signal includes 1872 real part sequences with a length of 257 and 1872 imaginary part sequences with a length of 257.
In order to enable the first submodel to learn the frequency domain characteristics of the clean audio signal in the sample audio signal, namely the amplitude information and the phase information of the signal, wherein the amplitude information and the phase information are determined through a real part and an imaginary part, the computer equipment can construct a loss function, namely a first loss, according to the frequency domain coding characteristics and the frequency domain transformation characteristics corresponding to the clean audio signal in the sample audio signal, perform model training by using the first loss, update the model parameters of the first submodel, when the training stopping condition is met, the first submodel learns the capacity of digging out the frequency domain characteristics of the clean signal from the audio signal, and the trained model can be called as a frequency domain processing submodel.
In one embodiment, the first loss determining step comprises: carrying out frequency domain transformation processing on the clean audio signal to obtain a corresponding frequency domain transformation sequence, wherein the frequency domain transformation sequence comprises a real part sequence and an imaginary part sequence; the first loss is determined based on a difference between a real part sequence corresponding to the clean audio signal and a real part feature in the frequency domain coding features corresponding to the sample audio signal, and a difference between an imaginary part sequence corresponding to the clean audio signal and an imaginary part feature in the frequency domain coding features corresponding to the sample audio signal.
Specifically, the computer device may perform inverse fourier transform on a clean audio signal used in generating the sample audio signal, and obtain frequency domain transform characteristics corresponding to the clean audio signal, including a real part sequence and an imaginary part sequence. The computer device can respectively aim at the real part sequence y according to the following formularAnd the sequence of imaginary parts yiJointly calculating a first loss:
Loss1=||yr-fr w(x)||2+||yi-fi w(x)||2;
wherein, yrRepresenting the real part sequence, y, in the frequency domain transform signature obtained by the forward Fourier transform of a clean audio signaliRepresenting the imaginary sequence in the frequency-domain transform characteristic obtained by applying a Fourier transform to a clean audio signal, fr w(x) Representing the real part of the corresponding frequency-domain coding features of the sample audio signal, fi w(x) Denotes the imaginary part of the corresponding frequency-domain coding features of the sample audio signal, the subscript r denotes the real part of the complex number, and the subscript i denotes the imaginary part of the complex number.
Fig. 3 is a schematic diagram of a model framework of a frequency domain processing submodel according to an embodiment. Referring to fig. 3, the frequency domain processing submodel includes a fourier transform module, a real part processing network, and an imaginary part processing network. The real part processing network and the imaginary part processing network are complex-LSTM structures based on real part and imaginary part operation. After the sample audio signal is input into the frequency domain processing submodel, Fourier transform is carried out through a Fourier transform module to obtain a real part sequence and an imaginary part sequence of the frequency domain signal, and the real part sequence and the imaginary part sequence are subjected to complex-LSTM feature coding to obtain real part attention and imaginary part attention. Frequency domain coding features of a clean audio signal are mined from the real part sequence and the imaginary part sequence based on the real part attention and the imaginary part attention. The frequency domain coding features are subjected to inverse Fourier transform to obtain a time domain signal.
In one embodiment, since the frequency domain processing submodel obtained according to the first loss training has the capability of mining the frequency domain features of the clean signal from the audio signal, the computer device may directly connect the frequency domain processing submodel with the inverse fourier transform module to obtain the audio noise reduction model. That is, when the original audio signal needs to be subjected to noise reduction processing, the original audio signal only needs to be input into the trained frequency domain processing submodel to obtain the corresponding frequency domain coding characteristic, and then the frequency domain coding characteristic is subjected to inverse fourier transform, so that the noise reduction signal corresponding to the original audio signal can be obtained.
And step 208, connecting the frequency domain processing submodel with a time domain processing submodel to be trained, and then training together to obtain an audio noise reduction model for carrying out noise reduction processing on the audio signal.
Wherein the temporal processing submodel is a model for refining a clean signal in the audio signal from a temporal perspective. The time domain processing submodel may employ a neural network model. In this embodiment, in order to obtain a noise reduction signal with better sound quality, the computer device further learns the characteristics of the sample audio signal in the time domain. Specifically, after obtaining the frequency domain processing submodel through model training, the computer device continues to perform integrated training on the frequency domain processing submodel and the time domain processing submodel to be trained to obtain the trained time domain processing submodel and the frequency domain processing submodel with updated model parameters, and after the integrated training is finished, the updated frequency domain processing submodel is connected with the trained time domain processing submodel to obtain the audio noise reduction model.
That is, the first half of the audio noise reduction model is learned in the frequency domain, and in order to fully utilize the amplitude and phase information of the audio signal, a complex-LSTM structure based on real and imaginary part operation is designed, and the second half of the audio noise reduction model is further additionally learned in the time domain to obtain a noise reduction signal with better sound quality.
In one embodiment, when the computer device continues to perform integrated training on the frequency domain processing submodel and the time domain processing submodel to be trained, the loss of the frequency domain processing submodel is not introduced any more, but the model parameters of the time domain processing submodel are updated only according to the loss of the time domain noise reduction submodule, and meanwhile, the model parameters of the frequency domain processing submodel are adjusted.
The computer equipment can input the sample audio signal into the frequency domain processing submodel obtained in the previous step during the integrated training, after the corresponding frequency domain coding characteristics are output through the frequency domain processing submodel, the frequency domain coding characteristics are subjected to Fourier inversion to obtain a time domain signal, then the time domain signal is input into the time domain processing submodel, and the noise reduction signal is output through the processing of the time domain processing submodel. The computer equipment can compare the clean audio signal in the sample audio signal with the noise reduction signal output by the time domain processing submodel, further calculate the loss function, namely the loss function of the time domain processing submodel, and then perform gradient back propagation according to the loss function, so as to adjust the model parameters of the time domain processing submodel and the frequency domain processing submodel. The loss function of this part of the time domain processing submodel can be calculated by using an index for evaluating the audio Noise reduction effect, such as SNR (Signal Noise Ratio) or SI-SDR (Scale Invariant Signal-to-Distortion Ratio).
Fig. 4 is a schematic diagram of an embodiment in which a frequency domain processing sub-model and a time domain processing sub-model are cascaded to obtain an audio noise reduction model. Referring to fig. 4, during model training, a first loss is first constructed by using a difference between a frequency domain coding feature corresponding to a sample audio signal and a frequency domain transform feature corresponding to a clean audio signal for generating the sample audio signal, a frequency domain processing sub-model is then cascaded with a time domain processing sub-model to be trained, and model training is performed jointly according to a difference between the clean audio signal and a noise reduction signal output by the time domain processing sub-model. After the co-training is finished, in the obtained audio noise reduction model, the frequency domain processing submodel and the time domain processing submodel can fully utilize the frequency domain information and the time domain information of the audio signal, so that a very good audio noise reduction effect is obtained.
The processing method of the audio noise reduction model comprises a frequency domain processing submodel and a time domain processing submodel, wherein the frequency domain processing submodel comprises a real part processing network and an imaginary part processing network, after an original audio signal of a time domain is obtained, the real part processing network and the imaginary part processing network in the frequency domain processing submodel are used for respectively carrying out feature coding on a real part sequence and an imaginary part sequence of a sample audio signal, the network structure can fully learn frequency domain information of the original audio signal, namely amplitude information and phase information represented by the real part sequence and the imaginary part sequence, so that the real part attention and the imaginary part attention obtained by coding can give more and more accurate attention to a clean audio signal in the sample audio signal, and thus, the frequency domain coding features corresponding to the sample audio signal are obtained based on the real part sequence and the imaginary part sequence, the real part attention and the imaginary part attention, and performing model training on the frequency domain processing submodel according to the first loss determined by the frequency domain coding characteristics and the frequency domain transformation sequence corresponding to the clean audio signal for generating the sample audio signal, so that the frequency domain processing submodel can accurately learn the frequency domain characteristics of the clean signal in the sample audio signal, and then performing model training by combining the frequency domain processing submodel and the time domain processing submodel together, so that the noise reduction effect of the obtained audio noise reduction model is better.
As shown in fig. 5, in one embodiment, the obtaining of the real part attention and the imaginary part attention corresponding to the sample audio signal by respectively performing feature coding on the real part sequence and the imaginary part sequence obtained after transforming the sample audio signal into the frequency domain signal through the real part processing network and the imaginary part processing network in the first sub-model based on the neural network comprises:
Specifically, the computer device may set a model structure of the first submodel in advance, and call the first submodel to input the acquired sample audio signal into the first submodel.
In particular, the computer device may transform the sample audio signal into a frequency domain signal after performing a frequency domain transform in the first submodel. For example, a short-time fourier transform may be performed on the sample audio signal to obtain a real part sequence and an imaginary part sequence in the frequency domain signal.
And step 508, respectively performing feature coding on the real part sequence and the imaginary part sequence through an imaginary part processing network in the first submodel to obtain a second coding feature of the real part and a second coding feature of the imaginary part.
In order to obtain a good noise reduction effect, the computer device learns the frequency domain characteristics of the audio signal from a frequency domain perspective. After converting the time domain signal into the frequency domain signal, in order to fully learn the amplitude information and the phase information of the audio signal, the computer equipment performs feature coding on the real part sequence and the imaginary part sequence, and excavates the frequency domain feature of the clean audio signal from the time domain signal. Specifically, in the real part processing network and the imaginary part processing network, the computer device calculates the input real part sequence and imaginary part sequence with the parameter matrix in the network to obtain the corresponding coding features.
With reference to the complex multiplication formula:
plural complex1=a+jb;
Plural complex2=c+jd;
Plural complex1And a plurality of complex2Multiplication:
complex1·complex2=(a·c-b·d)+j(a·d+b·c);
the embodiment of the invention defines a formula for operating the real part sequence and the imaginary part sequence by the real part processing network and the imaginary part processing network:
Lrr=LSTMr(Xr);Lir=LSTMr(Xi);
Lri=LSTMi(Xr);Lii=LSTMi(Xi);
Lout=Lrr-Lii+j(Lri+Lir);
wherein, XrRepresenting the frequency-domain transformation of an input sample audio signal X to obtain a real part sequence, XiRepresenting an imaginary part sequence obtained by frequency-domain transforming an input sample audio signal X; l isrrRepresenting processed real part processing networks LSTMrTo XrThe result of the operation obtained after processing, i.e. the real part first coding feature Lrr;LirRepresentation through real part processing network LSTMrTo XiThe operation result obtained after processing, namely the imaginary part first coding characteristic Lir;LriRepresentation of imaginary part processing network LSTMiTo XrThe operation result obtained after the processing is the second coding characteristic of the real part; l isiiRepresentation through imaginary part processing network LSTMiTo XiThe operation result obtained after the post-processing is carried out, namely the imaginary part second coding feature; l isoutExpressing the operation results obtained after each layer of real part processing network and imaginary part processing network, including Lrr-LiiReal part and L of the compositionri+LirThe imaginary part of the composition.
That is, the output result of each layer in the complex-LSTM is also divided into a real part having a relationship with both of the real part sequence and the imaginary part sequence in the frequency domain signal corresponding to the sample audio signal and an imaginary part having a relationship with both of the real part sequence and the imaginary part sequence in the frequency domain signal corresponding to the sample audio signal.
Fig. 6 is a schematic diagram of a real part processing network and an imaginary part processing network for feature coding of a real part sequence and an imaginary part sequence in one embodiment. Referring to FIG. 6, the real part sequence XrWith the imaginary sequence XiNetwork parameters W of the network to be processed by the real partrNetwork parameter-W with imaginary processing networkiTogether mapped to real part attention (i.e., L) in a first mannerrr-Lii) Real part sequence XrWith the imaginary sequence XiNetwork parameter-W of the network to be processed by the real partrNetwork parameter W with imaginary processing networkiCollectively mapped to imaginary attention (i.e., L) in a second mannerri+Lir)。
And step 510, obtaining the real part attention corresponding to the sample audio signal according to the real part first coding feature and the imaginary part second coding feature, and obtaining the imaginary part attention corresponding to the sample audio signal according to the real part second coding feature and the imaginary part first coding feature.
According to the formula defined above: l isout=Lrr-Lii+j(Lri+Lir);
The computer equipment can process the real part of the output of the network with the first coding characteristic LrrAn imaginary second coding characteristic L associated with the imaginary processing network outputiiThe difference is used as the attention of the real part corresponding to the sample audio signal, and the imaginary part is used for processing the second coding characteristic L of the real part output by the networkriImaginary first coding feature L output from real processing networkirThe sum is taken as the corresponding imaginary part attention of the sample audio signal.
Therefore, the real part attention obtained by the characteristic coding of the real part processing network and the imaginary part processing network in the first submodel refers to the real part and the imaginary part of the sample audio signal, and the imaginary part attention obtained specifically refers to the real part and the imaginary part of the sample audio signal, so that the multi-aspect information of the sample audio information can be fully utilized, and the interpretability is provided for obtaining a better noise reduction effect subsequently.
In one embodiment, the real part processing network and the imaginary part processing network in the first submodel include at least two layers, the computer device obtains the complex results output by the real part processing network and the imaginary part processing network of the previous layer, splits the complex results into the real part and the imaginary part and then uses the split results as the input of the current layer, the real part processing network and the imaginary part processing network of the current layer are used for respectively carrying out feature coding to obtain each coding feature, each coding feature is operated according to the formula to obtain the complex results output by the current layer, the complex results are input into the next layer for carrying out the same feature coding and operation, and so on until the complex results output by the last layer are obtained and are split into the real part and the imaginary part which are respectively used as the real part attention and the imaginary part attention.
In some embodiments, the complex result output by the last layer may also be processed by the fully connected layer, so as to obtain the final real attention and imaginary attention. The full connection layer is used for carrying out matrix multiplication processing on the input characteristics and the network parameters corresponding to the full connection layer, and therefore corresponding characteristics are output. Specifically, the real part processing network of the last layer is connected to the first fully-connected layer, and the imaginary part processing network of the last layer is connected to the second fully-connected layer, that is, the real part of the complex results output by the real part processing network and the imaginary part processing network of the last layer is the input of the first fully-connected layer, and the imaginary part of the complex results output by the real part processing network and the imaginary part processing network of the last layer is the input of the second fully-connected layer. The first fully-connected layer may be configured to perform matrix multiplication on the real part and the network parameter corresponding to the first fully-connected layer to obtain the real part attention corresponding to the original audio signal, and the second fully-connected layer may be configured to perform matrix multiplication on the imaginary part and the network parameter corresponding to the second fully-connected layer to obtain the imaginary part attention corresponding to the original audio signal.
It should be noted that the frequency domain processing sub-network includes at least two layers of real part processing network and imaginary part processing network, wherein the real part processing network and imaginary part processing network of the first layer are used for receiving the real part sequence and imaginary part sequence corresponding to the original audio signal, and the real part processing network and imaginary part processing network of the last layer are used for outputting the real part attention and imaginary part attention corresponding to the original audio signal, the real part processing network and imaginary part processing network of the "current layer" are used for describing the layer currently performing feature coding in the frequency domain processing sub-network, and the real part processing network and imaginary part processing network of the "previous layer" are used for describing the previous layer in the frequency domain processing sub-network at the current layer, and the input data of the real part processing network and imaginary part processing network of the current layer is the output data of the real part processing network and imaginary part processing network of the previous layer. The "current layer" is a concept of relative change, for example, after the output result of the current layer s is obtained by performing feature coding on the current layer s using the output result of the previous layer s-1, the output result of the current layer s is input to the next layer s +1, and at this time, the next layer s +1 performs feature coding on the current layer s using the output result of the current layer s, so that the next layer s +1 can be used as a new "current layer", and the current layer s can be used as a new previous layer. For example, the current layer may be the first layer, may be the last layer; the upper layer can be a first layer and can be a second layer; the next layer may be the second layer or the last layer.
It can be understood that, in the embodiment of the present application, the current layer, the previous layer, and the last layer are not limited to the deployment location in the frequency domain processing sub-network, but are related to the processing order of the data in the frequency domain processing sub-network. For example, in the frequency domain processing subnetwork, when data is processed sequentially from left to right, the previous layer may be disposed on the left side of the next layer, when data is processed sequentially from right to left, the previous layer may also be disposed on the right side of the next layer, when data is processed sequentially from top to bottom, the previous layer may also be disposed on the top side of the next layer, and when data is processed sequentially from bottom to top, the previous layer may also be disposed on the bottom side of the next layer. Similarly, the deployment positions of the last layer and the first layer in the frequency domain processing sub-network are also related to the processing sequence in the frequency domain processing sub-network. Referring to fig. 7, data is sequentially processed from top to bottom in the frequency domain processing sub-network, where the previous layer is disposed above the next layer, the first layer is disposed as the first layer, and the last layer is the last layer, i.e., the second layer.
Fig. 7 is a schematic diagram of two layers of real part processing network and imaginary part processing network in one embodiment. Referring to fig. 7, a real part sequence and an imaginary part sequence in a frequency domain signal corresponding to a sample audio signal are input to a real part processing network and an imaginary part processing network of a first layer to obtain a complex result output by the first layer, a real part and a sequence in the complex result are input to a real part processing network and an imaginary part processing network of a second layer to obtain a complex result output by the second layer, a real part in the complex result output by the second layer is input to a first fully-connected layer to output a real part attention corresponding to an original audio signal, and an imaginary part in the complex result output by the second layer is input to a second fully-connected layer to output an imaginary part attention corresponding to the sample audio signal.
After the computer device obtains the real part attention and the imaginary part attention through the real part processing network and the imaginary part processing network based on the real part and imaginary part operation in the frequency domain processing submodel, the frequency domain coding characteristics corresponding to the sample audio signal can be expressed in a form of multiplying the attention and the original signal.
In one embodiment, obtaining frequency-domain coding features corresponding to the sample audio signal based on the real part sequence and the imaginary part sequence, and the real part attention and the imaginary part attention comprises: multiplying the real part sequence by the real part attention to obtain a real part of the frequency domain coding characteristics corresponding to the sample audio signal; and multiplying the imaginary part sequence and the imaginary part attention to obtain the imaginary part of the frequency domain coding feature corresponding to the sample audio signal.
Specifically, the obtained attention to the clean audio signal in the sample audio signal includes a real attention part and an imaginary attention part, and the computer device may ignore the phase and multiply in a real multiplication form. That is, the real part attention is directly multiplied by the real part sequence in the frequency domain signal corresponding to the sample audio signal, the product result is used as the real part of the frequency domain coding feature corresponding to the sample audio signal, the imaginary part attention is multiplied by the imaginary part sequence in the frequency domain signal corresponding to the sample audio signal, and the product result is used as the imaginary part of the frequency domain coding feature corresponding to the sample audio signal.
I.e. by the following formula:
wherein,representing the frequency domain coding characteristics corresponding to the sample audio signal, X representing the real part sequence obtained by performing frequency domain transformation on the input sample audio signal X, XiRepresenting the imaginary sequence obtained after frequency-domain transforming the input sample audio signal X,the real part of attention is shown,indicating imaginary attention.
In one embodiment, obtaining frequency-domain coding features corresponding to the sample audio signal based on the real part sequence and the imaginary part sequence, and the real part attention and the imaginary part attention comprises: multiplying the real part sequence and the real part attention to obtain a first result, multiplying the imaginary part sequence and the imaginary part attention to obtain a second result, and taking the difference between the first result and the second result as the real part of the frequency domain coding feature corresponding to the sample audio signal; and multiplying the real part sequence and the imaginary part attention to obtain a third result, multiplying the imaginary part sequence and the real part attention to obtain a fourth result, and taking the sum of the third result and the fourth result as the imaginary part of the frequency domain coding feature corresponding to the sample audio signal.
In particular, the obtained attention to the clean audio signal in the sample audio signal comprises a real part attention and an imaginary part attention, and the computer device may multiply according to a format of the real part and the imaginary part, that is, according to a formula of complex multiplication:
wherein,representing the corresponding frequency-domain coding features, X, of the sample audio signalrRepresenting the frequency-domain transformation of an input sample audio signal X to obtain a real part sequence, XiRepresenting the imaginary sequence obtained after frequency-domain transforming the input sample audio signal X,the real part of attention is shown,indicating imaginary attention.
In one embodiment, obtaining frequency-domain coding features corresponding to the sample audio signal based on the real part sequence and the imaginary part sequence, and the real part attention and the imaginary part attention comprises: obtaining original amplitude information and original phase information of the sample audio signal based on the real part sequence and the imaginary part sequence; obtaining predicted amplitude information and predicted phase information of the sample audio signal based on the real part attention and the imaginary part attention; obtaining amplitude information of frequency domain coding characteristics corresponding to the sample audio signal according to the product of the original amplitude information and the predicted amplitude information; and obtaining the phase information of the frequency domain coding characteristics corresponding to the sample audio signal according to the sum of the original phase information and the predicted phase information.
In particular, the obtained attention to the clean audio signal in the sample audio signal comprises both a real part attention and an imaginary part attention, and the computer device may multiply using the magnitude phase information. I.e. according to the formula for amplitude phase multiplication:
wherein,representing the corresponding frequency-domain coding features, X, of the sample audio signalmagRepresenting a real part sequence X of a sample-based audio signalrWith the imaginary sequence XiThe original amplitude information obtained, XphaseRepresenting a real part sequence X of a sample-based audio signalrWith the imaginary sequence XiThe original phase information obtained is used to derive,representing real part-based attentionAttention to imaginary partThe obtained predicted amplitude information of the sample audio signal,attention based on real partAttention to imaginary partThe obtained predicted phase information of the sample audio signal, j representing a complex number.
In the above embodiment, the final frequency-domain coding feature is obtained by multiplying the output real part attention and imaginary part attention by the real part sequence and imaginary part sequence in the original signal, respectively, so that the frequency-domain processing submodel learns the capability of obtaining a clean signal when the output attention is multiplied by the original signal step by step.
Further, since each machine learning task is more or less noisy, for example, when training the sub-model corresponding to task a, data-dependent noise and generalization performance are easily ignored. Since different tasks have different noise patterns, learning multiple tasks simultaneously can result in a more generalized representation. The multi-task learning refers to that a plurality of related machine learning tasks are put together for learning, and the learning process shares and supplements the information of the learned related fields with each other through shallow sharing representation, so that the multi-tasks can be mutually promoted. Therefore, in order to improve the generalization capability of the model, when the time domain processing submodel is trained, the computer device may further set a noise scene classification task, and perform multi-task learning according to the time domain noise reduction task of the time domain processing submodel and the noise scene classification task. In this embodiment, the noise scene classification task and the time domain noise reduction task have higher correlation, the effect of two subtasks can be improved, and meanwhile, the generalization capability of the whole model is improved.
The noise scene classification has a certain application significance, and users in different scenes have different noise sensitivity degrees and can adopt different noise reduction levels. For example, in a daily contact scene between a user and relatives and friends, the noise can be suppressed properly to meet the requirement, and the noise reduction degree can be slightly weaker; the user is in a scene of meeting and talking of a plurality of people, and the user has higher requirements for eliminating the noisy background sound. According to the difference of the noise scenes where the users are located, noise reduction is carried out to different degrees, and the improvement of user experience is facilitated. Then, after the noise scene class in the audio is identified, the noise reduction signal may be automatically output at the corresponding noise reduction level, or may be output at the noise reduction level input by the user.
In an embodiment, after obtaining the frequency domain processing submodel, the computer device may set a second submodel and a third submodel connected to the frequency domain processing submodel, where the second submodel is used to learn a time domain feature of a signal to implement noise reduction on the time domain signal, and the third submodel is used to learn a time domain feature of a signal to implement classification on a noise scene category of the time domain signal. The second submodel and the third submodel are both neural network-based models. When the time domain noise reduction task and the noise scene classification task are used for multi-task learning, the label information of the input sample audio signal also comprises the noise scene category. As mentioned above, the computer device may mix the clean audio signal and the noise signal according to different signal-to-noise ratios to obtain the sample audio signal, and then the tag information of the sample audio signal further includes a noise scene category of the mixed noise signal, such as white noise, wind noise, subway noise, keyboard noise, mouse noise, and so on.
As shown in fig. 8, in an embodiment, in step 208, the frequency domain processing submodel and the time domain processing submodel to be trained are connected and then trained together to obtain an audio noise reduction model for performing noise reduction processing on an audio signal, which includes:
and step 802, connecting the frequency domain processing submodel with a second submodel and a third submodel based on a neural network.
Specifically, the time domain noise reduction task and the noise scene classification task are two parallel subtasks, and when multi-task learning is performed, the computer device may connect the frequency domain processing submodel with the second submodel and the third submodel, respectively.
Specifically, the time domain noise reduction task and the noise scene classification task both process signals from the time domain perspective. Therefore, the computer device may transform the frequency domain coding features obtained by the frequency domain processing submodel into time domain signals, and then input the time domain signals into the second submodel and the third submodel, respectively. The frequency domain coding features may be inverse fourier transformed to obtain a time domain signal.
Optionally, the computer device transforms the frequency domain coding features into time domain signals through the frequency domain processing submodel after the frequency domain processing submodel obtains the frequency domain coding features. The computer device may input the frequency domain coding features into the second submodel and the third submodel, respectively, and transform the frequency domain coding features into time domain signals by inverse fourier transform, respectively.
And respectively designing loss functions, namely a second loss and a third loss by the computer equipment according to respective targets of the audio noise reduction task and the noise scene classification task. A multitasking objective function is an objective function that fuses the losses of multiple tasks. The multi-task objective function in the embodiment fuses the second loss of the time domain noise reduction task and the third loss corresponding to the noise scene classification task, and the same multi-task objective function is optimized after the second loss and the third loss are fused together, so that the optimization of the two tasks is realized.
In some embodiments, when the computer device continues to perform integrated training on the frequency domain processing submodel, the second submodel, and the third submodel, the constructed multitask objective function does not introduce loss of the frequency domain processing submodel, but updates the model parameters of the second submodel and the third submodel only according to loss of the time domain noise reduction task corresponding to the second submodel and loss of the noise scene classification task corresponding to the third submodel, and simultaneously adjusts the model parameters of the frequency domain processing submodel.
In some embodiments, in order to avoid that the whole audio noise reduction model is dominated by a certain task in the training process and the final performance is deteriorated, the weight of each task in the multitask is determined by using the variance uncertainty. The derivation process of the multitask objective function is as follows:
assuming that the input of the model is x and the weight of the model is W, the output of the model is fw(x)。
Then, for the regression task, a probability model of the gaussian likelihood function can be defined:
p(y|fw(x))=N(fw(x),σ2);
where N represents the Gaussian likelihood function, the mean of which is the output f of the modelw(x) The standard deviation sigma of the Gaussian likelihood function is also used as the noise of the model according to the output f of the modelw(x) And (5) counting to obtain.
For multiple tasks, a probability model for the multi-task likelihood function can be defined as:
p(y|fw(x))=p(y1|fw(x))...p(yk|fw(x));
then, for the two regression tasks, the outputs are y1 and y2, respectively, and the probability model defining the multi-tasking objective function for the two regression tasks based on the model parameters w and σ is:
p(y1,y2|fw(x))=p(y1|fw(x))·p(y2|fw(x))
=N(y1;fw(x),σ1 2)·N(y2;fw(x),σ2 2);
and further obtaining target functions of a plurality of regression tasks:
where L1 and L2 represent the loss functions of the two regression tasks, respectively, and the goal of the model is to minimize this multi-task objective function L (W, σ)1,σ2) It can be seen from the objective function that as the model noise increases, the corresponding weight in the loss function decreases, and if the model noise decreases, the weight in the loss function increases.
For the embodiment of the application, the multiple tasks include a time domain noise reduction task and a noise scene classification task, wherein the former is a regression task, and the latter is a classification task. The classification task generally accesses the network output into a Softmax function to obtain a classification probability, and determines a final classification result according to the classification probability, that is, a probability model of the classification task is as follows: p (y | f)w(x))=Softmax(fw(x));
Then, referring to the above multi-task objective functions of the two regression tasks, the multi-task objective function of the time domain noise reduction task and the noise scene classification task in the embodiment of the present application may be determined as follows:
wherein, L1 represents a second loss corresponding to the time domain noise reduction task, and L2 represents a third loss corresponding to the noise scene classification task; sigma1Representing the standard deviation, σ, statistically obtained from the output of the second submodel2Indicating the standard deviation statistically obtained from the output result of the third submodel.
In one embodiment, the constructing step of the multitask objective function includes: obtaining a plurality of noise reduction signals obtained by carrying out noise reduction processing on time domain signals corresponding to the plurality of sample audio signals by a second submodel, and determining a second lost weight according to standard deviations of the plurality of noise reduction signals; obtaining a plurality of noise scene categories obtained by carrying out noise classification on time domain signals corresponding to the plurality of sample audio signals by a third submodel, and determining a third loss weight according to standard deviations of the plurality of noise scene categories; and fusing the second loss and the third loss according to respective weights to obtain the multitask objective function.
Based on the multi-task objective function obtained through the derivation, the computer device may construct the multi-task objective function according to a second loss corresponding to the time domain denoising task of the second submodel and a third loss corresponding to the noise scene classification task of the third submodel. Wherein the second loss is determined according to a difference between the noise reduction signal output by the second submodel and the clean audio signal of the generated sample audio signal, and the weight of the second loss is determined according to a standard deviation of the noise reduction signal corresponding to the plurality of sample audio signals output by the second submodel. The third penalty is determined from a difference between a noise scene class output by the third submodel and a label class of a noise signal used to generate the sample audio signal, and a weight of the third penalty is determined from a standard deviation of noise scene classes corresponding to a plurality of sample audio signals output by the third submodel.
And 808, performing model training on the frequency domain processing submodel, the second submodel and the third submodel together according to the multitask objective function to obtain an updated frequency domain processing submodel, a trained time domain processing submodel and a trained noise classification submodel.
Specifically, when the computer device performs multitask training, a sample audio signal is input into the frequency domain processing submodel, after a corresponding frequency domain coding characteristic is output through the frequency domain processing submodel, the frequency domain coding characteristic is subjected to inverse fourier transform to obtain a time domain signal, the time domain signal is input into the second submodel and the third submodel, a noise reduction signal is output through the processing of the second submodel, and a noise scene type is output through the third submodel. The computer equipment can compare the clean audio signal in the sample audio signal with the noise reduction signal output by the second submodel to further calculate second loss, calculate third loss according to the noise label category of the noise signal in the sample audio signal and the noise scene category output by the third submodel, construct a multi-task objective function according to the second loss and the third loss, and perform gradient back propagation according to the multi-task objective function to train the second submodel and the third submodel, obtain a trained time domain processing submodel and a trained noise classification submodel, and update the model parameters of the frequency domain processing submodel.
And 810, connecting the updated frequency domain processing submodel with the trained time domain processing submodel to obtain an audio noise reduction model for performing noise reduction processing on the audio signal.
Specifically, the computer device may connect the updated frequency domain processing submodel with the time domain processing submodel obtained through the multitask learning, as an audio noise reduction model. In other embodiments, the computer device may also connect the updated frequency domain processing submodel with the time domain processing submodel obtained through the multitask learning, and further connect with the noise classification submodel obtained through the multitask learning, so as to obtain an audio noise reduction model for simultaneously performing noise reduction processing and noise scene classification on the audio signal.
In the embodiment, the noise scene classification task and the time domain noise reduction task have higher correlation, the effect of two subtasks can be simultaneously improved based on multi-task learning, and the generalization capability of the whole model is improved.
In one embodiment, the second loss determining step comprises: coding the time domain signal through a coder in a second submodel to obtain a time domain coding vector, extracting the characteristics of the time domain coding vector through a time sequence characteristic extraction network in the second submodel to obtain hidden characteristics corresponding to the time domain signal, and decoding the time domain coding vector and the hidden characteristics through a decoder in the second submodel to obtain a noise reduction signal corresponding to the sample audio signal in a second sample set; a second loss is constructed from the noise reduction signal and the clean audio signal.
And the second sub-model adopts an Encode-Decoder structure based on convolution. The Encode-Decoder structure converts an input sequence into another sequence output. In this framework, the encoder converts the sequence corresponding to the input time-domain signal into a vector, and the decoder receives the vector and generates the output sequence in time order. The encoder and the decoder may employ the same type of neural network model, or may be different types of neural network models. For example, the encoder and the decoder may both be a CNN (Convolutional Neural network) model, or the encoder may use an rnn (redundant Neural network) model and the decoder may use a CNN model, and the timing feature extraction network of the second sub-model may use an LSTM model, for example, a 2-layer LSTM network, and may learn the timing dependency of the input time domain signal.
Specifically, the computer equipment inputs the time domain signal output by the frequency domain submodel into a second submodel, converts the input time domain signal into a time domain coding vector through an encoder of the second submodel, extracts the inherent relation of the time domain coding vector on a time sequence through an intermediate time sequence characteristic extraction network, and excavates hidden characteristics. In this embodiment, the time series feature extraction network is an intermediate layer, also called a hidden layer, with respect to the encoder as an input layer and the decoder as an output layer, and therefore the features extracted by the time series feature extraction network are called hidden features. And finally, decoding the signal based on the time domain coding vector and the hidden feature through a decoder, and outputting a noise reduction signal. The computer device constructs a second loss according to the noise reduction signal output by the second submodel and the clean audio signal used for generating the sample audio signal.
Fig. 9(a) is a schematic structural diagram of a second submodel for training to obtain a time domain processing submodel in an embodiment. Referring to fig. 9(a), the second sub-model adopts a convolution-based Encoder-Decoder structure, the Encoder and Decoder may be 1-dimensional convolution, and the middle layer of the second sub-model adopts an LSTM network.
In one embodiment, constructing the second loss from the noise reduction signal and the clean audio signal comprises: projecting the noise reduction signal corresponding to the sample audio signal to the vertical direction and the horizontal direction of the clean audio signal respectively to obtain a vertical projection vector and a horizontal projection vector; and obtaining a second loss according to the vertical projection vector and the horizontal projection vector.
Specifically, the computer device may project a vector corresponding to the output noise reduction signal to a vertical direction and a horizontal direction of a direction in which a vector corresponding to the clean audio signal is located, take the vertical projection vector as a denominator, take the horizontal projection vector as a numerator, and obtain the second loss after taking a logarithm. It can be seen that the second loss is smaller when the noise reduction signal output by the second submodel is parallel to the clean audio signal, and the second loss value is larger when the vector direction deviation of the vector corresponding to the noise reduction signal from the vector corresponding to the clean audio signal is larger, the larger the vertical projection vector is.
In one embodiment, the second loss may be determined by the following equation:
YE=Y*-YT;
wherein, YERepresenting a vertical projection vector, YTRepresenting a horizontal projection vector, YtrueRepresenting the clean audio signal and Y represents the noise reduction signal output by the second submodel.
Referring to fig. 9(b), a diagram of the relationship between the second loss and the projection vector in one embodiment is shown. Referring to the right side of fig. 9(b), when the second submodel outputs the noise reduction signal Y and the clean audio signal YtrueWhen parallel, the horizontal projection vector YTMaximum, perpendicular projection vector YEAt the minimum, the smaller the second loss is, and when the vector direction deviation of the vector corresponding to the noise reduction signal from the vector corresponding to the clean audio signal is larger, the horizontal projection vector Y is largerTThe smaller, the perpendicular projection vector YEThe larger the second loss value is.
In this embodiment, by projecting the vector corresponding to the noise reduction signal output by the second submodel in the direction of the clean audio model, the difference between the input sample audio signal and the output noise reduction signal can be represented reasonably and effectively.
In one embodiment, the third loss determining step comprises: coding the time domain signal through a coder in a third submodel to obtain a time domain coding vector, extracting the characteristics of the time domain coding vector through a time sequence characteristic extraction network in the third submodel to obtain hidden characteristics corresponding to the time domain signal, and predicting the noise scene category of the sample audio signal based on the hidden characteristics through an output layer in the third submodel; and constructing a third loss according to the noise scene category and the noise label category of the noise signal used for generating the sample audio signal.
Specifically, the computer device inputs the time domain signal output by the frequency domain submodel into a third submodel, converts the input time domain signal into a time domain coding vector through an encoder of the third submodel, extracts the inherent relation of the time domain coding vector on the time sequence through an intermediate time sequence feature extraction network, and excavates the hidden feature. And finally outputting the noise scene category of the sample audio signal based on the hidden features through an output layer. And the computer equipment constructs a third loss according to the noise scene category output by the third submodel and the noise label category of the noise signal used for generating the sample audio signal. The timing feature extraction network of the third submodel may employ an LSTM, for example, a layer 2 LSTM network.
The output layer of the third submodel comprises a full connection layer, an activation layer and a normalization layer. And a full connection layer in the output layer receives the hidden features extracted by the feature extraction network, matrix multiplication is carried out on the hidden features and model parameters corresponding to the full connection layer, the hidden features are mapped to a sample space, and finally, after nonlinear characteristics are introduced through the activation layer, noise scene categories corresponding to sample audio signals are output through a normalization layer (softmax function).
In one embodiment, the third loss corresponding to the noise scene classification task may be represented by a cross-entropy loss function, which is expressed by the following formula:
wherein K represents the total number of classes of noise classes, for example, if 5 kinds of noise in different scenes are used to generate a large number of sample audio signals, the value of K is 5; yi represents a noise label class of a noise signal used to generate the sample audio signal currently processed, e.g., if the true label is of the ith class, yi is 1, otherwise yi is 0; pi denotes the class of the noise scene of the sample audio signal predicted by the third submodel, i.e. the probability that the noise belongs to class i, which is calculated by the output layer softmax function.
Fig. 10 is a schematic structural diagram of a third sub-model for training to obtain a noise classification sub-model in an embodiment. Referring to fig. 10, the input layer of the third submodel is an encoder, the middle layer of the third submodel adopts a two-layer LSTM structure, and noise scenes are classified according to noise information learned by the structure.
Fig. 11 is a schematic diagram of a model structure for performing multi-task learning on a time-domain noise reduction task and a noise scene classification task in an embodiment. Referring to fig. 11, a sample audio signal is input to a frequency domain processing submodel 1102, the sample audio signal is transformed into a frequency domain signal by a fourier forward transform model in the frequency domain processing submodel 1102, the frequency domain signal includes a real part sequence and an imaginary part sequence, the real part sequence and the imaginary part sequence are respectively feature-encoded by a real part processing network and an imaginary part processing network based on complex-LSTM in the frequency domain processing submodel to obtain real part attention and imaginary part attention, frequency domain encoding features are obtained based on the real part attention, the imaginary part attention, the real part sequence and the imaginary part sequence, the frequency domain processing submodel is transformed into a time domain signal by an inverse fourier transform module in the frequency domain processing submodel, and the time domain processing submodel 1104 and the noise classification submodel 1106 are respectively input. The time domain signals are encoded through an encoder in the time domain processing submodel 1104 to obtain time domain encoding vectors, the time domain encoding vectors are subjected to feature extraction through a time sequence feature extraction network based on LSTM in the time domain processing submodel 1104 to obtain hidden features corresponding to the time domain signals, the time domain encoding vectors are input into a decoder in the time domain processing submodel 1104, the decoder outputs noise reduction signals corresponding to the sample audio signals, and the output noise reduction signals and the clean audio signals in the sample audio signals can construct second loss. The time domain signal is noise classified by the noise classification submodel 1106, and the output noise scene class and the noise label class of the noise signal in the sample audio signal can construct a third loss. And constructing a multitask objective function according to the second loss and the third loss to carry out multitask learning.
In an embodiment, before the computer device performs the multitask learning, after obtaining the frequency domain processing submodel, the computer device may set a second submodel connected to the frequency domain processing submodel, perform model pre-training on the second submodel to obtain a pre-trained time domain processing submodel, introduce a third submodel for classifying noise scene categories, and perform integrated training on the entire model based on a multi-task objective function of the pre-trained time domain processing submodel and the third submodel.
As shown in fig. 12, in an embodiment, in step 208, after connecting the frequency domain processing submodel and the time domain processing submodel to be trained, training together to obtain an audio noise reduction model for performing noise reduction processing on an audio signal, the method includes:
and step 1202, connecting the frequency domain processing submodel with a second submodel and a third submodel based on a neural network.
Specifically, the time domain noise reduction task and the noise scene classification task are two parallel subtasks, and when multi-task learning is required, the computer device may connect the frequency domain processing submodel with the second submodel and the third submodel, respectively.
Specifically, the time domain noise reduction task and the noise scene classification task both process signals from the time domain perspective. Therefore, the computer device may transform the frequency domain coding features obtained by the frequency domain processing submodel into a time domain signal. The computer device may perform an inverse fourier transform on the frequency domain coding features to obtain a time domain signal.
And 1206, performing model training on the frequency domain processing submodel and the second submodel together based on a second loss determined by a noise reduction signal obtained by performing noise reduction on the time domain signal and the clean audio signal by the second submodel to obtain an updated frequency domain processing submodel and a trained time domain processing submodel.
The computer equipment firstly introduces the frequency domain processing submodel into the second submodel, and pre-trains the second submodel before multi-task learning, so that the second submodel has a better initial network parameter, and the efficiency of the follow-up multi-task learning with the noise classification subtask is accelerated.
In some embodiments, when the computer device trains the frequency domain processing submodel and the second submodel together, the computer device trains the second submodel to obtain the pre-trained time domain processing submodel by performing gradient back propagation according to a second loss of the time domain noise reduction task corresponding to the second submodel and updating the model parameters of the frequency domain processing submodel.
And 1208, inputting the sample audio signal into the updated frequency domain processing submodel to obtain corresponding frequency domain coding characteristics, converting the frequency domain coding characteristics into time domain signals, and respectively inputting the time domain processing submodel and the third submodel which are trained.
After the time domain processing submodel is obtained, for a subsequent sample audio signal used for multi-task learning, the computer equipment inputs the time domain signal output by the frequency domain processing submodel into the pre-trained time domain processing submodel and the third submodel at the same time.
And the computer equipment constructs a multi-task objective function according to the second loss of the time domain noise reduction task corresponding to the pre-trained time domain processing submodel and the third loss of the noise scene classification task corresponding to the third submodel.
And 1212, performing model training on the frequency domain processing submodel, the trained time domain processing submodel and the third submodel together according to the multitask objective function to obtain an updated frequency domain processing submodel, an updated time domain processing submodel and a noise classification submodel.
And the computer equipment performs gradient back propagation according to the multitask objective function so as to train a third submodel to obtain a trained noise classification submodel, and simultaneously updates the model parameters of the frequency domain processing submodel and the time domain processing submodel.
And step 1214, connecting the updated frequency domain processing submodel with the updated time domain processing submodel to obtain an audio noise reduction model for performing noise reduction processing on the audio signal.
Specifically, the computer device may connect the updated frequency domain processing submodel with the time domain processing submodel obtained through the multitask learning, as an audio noise reduction model. In other embodiments, the computer device may also connect the updated frequency domain processing submodel with the time domain processing submodel obtained through the multitask learning, and further connect with the noise classification submodel obtained through the multitask learning, so as to obtain an audio noise reduction model for simultaneously performing noise reduction processing and noise scene classification on the audio signal.
FIG. 13 is a diagram illustrating steps of training an audio noise reduction model according to an embodiment. Referring to fig. 13, the training of the model is divided into three steps, first, a frequency domain processing submodel is learned, the frequency domain processing submodel inputs a sample audio signal with noise and outputs real part attention and imaginary part attention based on the complex-LSTM network output; secondly, accessing a pre-trained frequency domain processing submodel into a time domain processing submodel, inputting a sample audio signal with noise of the pre-trained frequency domain processing submodel, outputting a noise reduction signal by the time domain processing submodel, and training the time domain processing submodel by adopting SI-SNR as a loss function according to a clean audio signal and the noise reduction signal in the sample audio signal; and finally, accessing a noise classification sub-model, then performing multi-task learning on the whole network to obtain a whole audio noise reduction model, and observing the performance of the whole audio noise reduction model on the test set.
In a specific task, the representation learning capability of the model can be improved to a certain extent by increasing the parameter quantity of the model, and the index effect of the model is improved, but the model is limited by various performance requirements, and the model is often required to reach the optimal index evaluation under the limited parameter quantity as far as possible. Under the condition of simultaneously completing two subtasks of noise reduction and classification, the calculated amount of a network structure adopted by a single task is large, so that the obstruction is brought to model deployment, and in order to ensure the performance of the model, the computer equipment can use a distillation strategy aiming at the audio noise reduction model, so that the evaluation index of the audio noise reduction model on the audio time domain noise reduction task is further improved.
FIG. 14 is a schematic flow chart illustrating model distillation training for the temporal noise reduction submodel according to an embodiment. Referring to fig. 14, the method includes the steps of:
and 1402, obtaining a teacher model for performing noise reduction processing on the audio signal according to the audio noise reduction model, and constructing a student model for performing noise reduction processing on the audio signal according to the frequency domain processing sub-model and the light-weight time domain noise reduction network.
Specifically, the computer device may first set a teacher model with a large amount of network parameters, and obtain a trained teacher model under full data training according to the embodiment provided above. In addition, the computer equipment constructs a student model for carrying out noise reduction processing on the audio signal according to the trained frequency domain processing submodel and the light-weight time domain noise reduction network.
Compared with the time domain processing submodel in the teacher model, the lightweight time domain noise reduction network is a network with small volume, less network parameters and less calculation amount, for example, the number of network parameters of the teacher model is about 10 times larger than that of the student model. The teacher model is an audio noise reduction model with large model volume, more network parameters and larger calculation amount, and has better evaluation indexes than the student models under the training of full data, so that the teacher model can guide the ability of the student models on the audio time domain noise reduction task.
The first noise reduction signal is a prediction result obtained by performing noise reduction processing on an input sample audio signal by a teacher model, and the prediction result can be used as labeling information of a student model to guide the student model to learn. In addition, the first time domain encoding vector is an encoding vector obtained by encoding the time domain signal by an encoder in the time domain processing submodel in the teacher model.
And the second noise reduction signal is a prediction result obtained by performing noise reduction processing on the input sample audio signal by the student model. The second time domain coding vector is a coding vector obtained by coding a time domain signal by a coder in a lightweight time domain noise reduction network in the student model.
In this embodiment, the model distillation training refers to guiding the training of the student model by using the prediction result output by the trained teacher model with high accuracy, so as to realize the transfer of knowledge. Therefore, the computer equipment can construct a model distillation loss function according to the second noise reduction signal output by the student model and the first noise reduction signal output by the trained teacher model, and update the parameters of the student model by using the model distillation loss function.
In order to further optimize the evaluation index while improving the performance of the student model, the teacher model and the student model are aligned on the coding layer, the decoding layer and the output result which have certain physical meanings.
Firstly, the coding layer and the decoding layer are aligned, the first time domain coding vector and the second time domain coding vector are the coding vectors output by the coding layer, and the first noise reduction signal and the second noise reduction signal are the results output by the decoding layer. The Mean-Square Error (MSE) is used to promote the global approximation of the feature distribution of the two models at the encoding layer and the decoding layer.
In addition, the time domain coding vectors and the noise reduction signals output by the encoder and the decoder of the student model and the teacher model are respectively normalized, and then the mean square error loss is calculated, so that the characteristic distribution of the two models can be prevented from being influenced by noise or abnormal values, and the normalization operation has certain index improvement.
The normalization operation is shown in the following equation:
wherein Z represents an original vector, e.g. a time-domain coded vector output by a coding layer or a vector corresponding to a noise reduction signal output by a decoding layer, ZiAnd c represents the dimension of the output original vector.
Then, the mean square error loss of the coding layer and the decoding layer can be represented by the following formula:
wherein χ represents a set of sample audio signals in an input batch of sample audio signals, | χ | is the number of sample audio signals in the set, xiRepresenting any one sample audio signal in the set, tiDecoder in time domain processing submodel representing teacher model according to sample audio signal xiOutput noise reduction signal, siRepresenting decoder in lightweight time-domain noise reduction network in student model according to sample audio signal xiThe output noise reduction signal is output as a noise reduction signal,encoder in time-domain processing submodel representing teacher model from sample audio signal xiThe output first time-domain encoded vector is,encoder representing a lightweight time-domain noise reduction network in a student model from a sample audio signal xiThe output second time domain coding vector; l ismseRepresenting the teacher model coding layer,And the decoding layer outputs the result respectively, and the mean square error loss of the student model encoding layer and the decoding layer outputs the result.
In addition, the embodiment also prompts the student model to fit the audio output of the teacher model from the data structure level by comparing the internal relation of the output results of the teacher model and the student model at the data level. The data structure loss comprises distance loss and angle loss of output results of the teacher model and the student model respectively.
Wherein the distance loss can be expressed by the following formula:
wherein (x)i,xj) Is any pair of sample audio signals, χ, in an input batch of sample audio signals2Represents in the set formed by a pair of sample audio signals in a batch of input sample audio signals, | χ |2I is the number of sample audio signals in the set, tiRepresenting teacher model from sample audio signal xiOutput noise reduction signal, tjRepresenting teacher model from sample audio signal xjOutput noise reduction signal, siRepresenting a student model from a sample audio signal xiOf the outputNoise reduction signal, sjRepresenting a student model from a sample audio signal xjThe output noise reduction signal mu represents the normalization processing of the output data; lδRepresents the Huber loss; l isDRepresenting the loss in data distance of the teacher model output and the student model output.
Wherein the angle loss can be expressed by the following formula:
ψA(ti,tj,tk)=cos∠titjtk=<eij,ekj>;
ψA(si,sj,sk)=cos∠sisjsk=<eij,ekj>;
wherein (x)i,xj,xk) Is any three sample audio signals in an input batch of sample audio signals, χ3Represents the set of any three sample audio signals in the input batch of sample audio signals, | χ |3I is the number of sample audio signals in the set, tiRepresenting teacher model from sample audio signal xiOutput noise reduction signal, tjRepresenting teacher model from sample audio signal xjOutput noise reduction signal eijRepresents ti、tjUnit vector of the direction, ejkRepresents tj、tkUnit vector of the direction, siRepresenting a student model from a sample audio signal xiOutput noise reduction signal, sjRepresenting a student model from a sample audio signal xjThe output noise reduction signal mu represents the normalization processing of the output data; lδRepresents the Huber loss; l isAAnd the loss of the teacher model output result and the student model output result in the data angle is represented.
The computer equipment can determine the model distillation loss L according to the following formulaKD:
LKD=λmse·Lmse+λD·LD+λA·LA;
Wherein L ismseIs the mean square error loss, L, of the output characteristics of the coding layer and the decoding layerDDistance comparison loss, L, of output characteristics of teacher model and student modelAIs the angle comparison loss, lambda, of the output characteristics of the teacher model and the student modelmse、λD、λARespectively, representing the tunable hyperparameters corresponding to losses.
FIG. 15 is a block diagram illustrating model distillation training for a temporal noise reduction submodel according to an embodiment. Referring to fig. 15, a teacher model is constructed and trained to obtain a trained teacher model according to the previous embodiments. When the student model is constructed, the frequency domain processing submodel in the teacher model is directly utilized, the light weight time domain noise reduction network is accessed, and the student model is obtained by combining the frequency domain processing submodel and the light weight time domain noise reduction network. And carrying out model distillation training on the light-weight time-domain noise reduction network of the student model according to the mean square error loss between the coding layer and the decoding layer output characteristics of the teacher model and the student model and the data structure loss of the decoding layer output characteristics. And after the student model is basically converged, the light weight time domain noise reduction network of the student model is finely adjusted by utilizing the multi-task learning training mode, and finally the trained student model is obtained.
In one specific embodiment, referring to the model structure of fig. 11, the design parameters of the audio noise reduction model based on attention mechanism and multitask learning are listed as follows:
the fourier transform here uses a short-time fourier transform which divides a longer time signal into shorter segments of the same length, on each of which a fourier transform, i.e. a fourier spectrum, is calculated. The sampling rate is a sampling rate of the audio signal, and the discrete audio sequence can be obtained by sampling the audio signal according to the sampling rate. The length of the sample audio is the duration of the audio signal, and may be 15s, for example. The fourier transform window length is the length of the above-described shorter segment divided into the same length, and is set to 512, for example. And Fourier transform window overlapping rate used for determining the displacement step of the window. The Batch Size is the number of samples of the audio signal per Batch of the input audio noise reduction model. The number of the LSTM hidden units is the number of the hidden units in each time step in the real part processing network and the imaginary part processing network of each layer in the frequency domain processing sub-network, for example, 128 hidden units. The number of LSTM layers is the number of layers of the real part processing network and the imaginary part processing network in the frequency domain processing sub-network, for example, 2 layers. The inactivation rate of the fully-connected layer is the proportion of neurons that do not participate in the operation in the fully-connected layer, and for example, 25% of the neurons may be randomly selected to not participate in the operation. The number of convolutional layer channels is the depth of convolutional layers in the time domain processing sub-network and the noise classification sub-network, and the size of the convolutional kernel refers to the size of the convolutional kernel adopted by an encoder and a decoder in the time domain processing sub-network and the noise classification sub-network.
The numerical values of the parameters in the above tables are for illustration only. On the premise of obtaining the embodiments provided by the present application, a person skilled in the art obtains a model obtained by correspondingly adjusting the values of the parameters according to actual needs, and the model also belongs to the protection scope of the present application.
According to the processing method of the audio noise reduction model, an end-to-end deep learning model is adopted, the input is the audio containing noise, and the output is the audio subjected to noise reduction and the current noise scene category. Based on the network structure design of deep learning, an attention-based audio frequency domain information learning network is provided, so that noise and audio can be better distinguished; the first half of the model is to improve the common LSTM structure into a complex-LSTM structure based on complex numbers, so that the frequency domain information of the audio signal can be fully learned, and the second half of the model is to further supplement the learning in the time domain; in addition, an audio noise reduction and audio classification method combining an attention mechanism and multi-task learning is also provided, and an end-to-end model capable of realizing noise reduction and noise scene classification simultaneously can be obtained; furthermore, it is proposed to determine the weight of the multitask loss function by variance uncertainty; in addition, in order to further improve the model effect, the adopted model distillation training method improves the performance upper limit of the model on the basis of not changing the parameter magnitude, the model deployment is considered in the model training process, the performance of the time domain noise reduction submodel is further improved, and the engineering practicability of the model is increased.
In one embodiment, as shown in fig. 16, an audio noise reduction method is provided, which is described by taking the method as an example applied to the computer device (terminal 102 or server 104) in fig. 1, and includes the following steps:
in step 1602, an original audio signal in the time domain is obtained.
Wherein, the original audio signal is a sound signal to be subjected to noise reduction processing. For example, the original voice signal may be an original voice signal during a voice call, an original sound signal during recording of a song, playing of a song, or an original recorded sound signal during recording of a speech.
The original audio signal is a signal carrying a noise signal, since the original audio signal is in a variety of noisy environments when generated. For example, when video and audio calls are made on a daily basis, a caller may be in a variety of noisy environments, so that various noises are entrained in a voice call signal, such as: sounds made by vehicles coming and going on noisy streets, cluttered background sounds of multi-person conversations in dining halls, loud keyboard clicks and mouse clicks in office scenes, and the like. In the conversation process, a user naturally wants to keep the undisturbed and high-quality conversation level, and then through the audio noise reduction method provided by the embodiment of the application, the computer equipment can perform noise reduction processing on the original audio signal so as to obtain a corresponding noise reduction signal.
Both the time domain and the frequency domain are fundamental properties of the signal. Analyzing the signal from different dimensions, and switching in the solution problem from different angles, which can be called domains. The time domain reflects the corresponding relation between mathematical functions or physical signals and time, and is the feedback of the real world and the only objective existing domain. The frequency domain is a coordinate system used for describing the characteristics of the signal in the frequency domain, showing the signal amount in a frequency range, and is a way of constructing an auxiliary consideration from a mathematical point of view.
In the embodiment of the present application, the original audio signal acquired by the computer device is a time domain signal. The original audio signal may be a continuous time domain signal representing the variation of the intensity of the audio over continuous time. The original audio signal may also be a discrete time domain signal representing the variation of the intensity of the audio over time of the upsampled points. For example, the original audio signal may be a 15s original audio signal, or may be a discrete sequence obtained by sampling the original audio signal, for example, at a sampling rate of 16K, the original audio signal is a discrete sequence with a length of 240000. Optionally, after acquiring the continuous original audio signal, the computer device may subsequently input the trained frequency domain processing sub-model, and the frequency domain processing sub-model performs sampling according to the sampling rate and divides the frequency domain processing sub-model into a plurality of discrete sub-sequences according to the preset window length. Optionally, the computer device may also sample the continuous original audio signal and divide the continuous original audio signal into a plurality of discrete sub-sequences according to a preset window length, and then subsequently input the discrete sub-sequences into the trained frequency domain processing sub-model.
In some embodiments, the computer device may capture the audio signal instantly by a local audio capture device, and use the captured audio signal as an original audio signal, for example, the voice signal of the user captured by the terminal 102 in fig. 1 through a microphone as an original audio signal. In other embodiments, the original audio signal may be a signal transmitted from another computer device.
The frequency domain processing submodel is a model which is obtained by training in advance and has the frequency domain characteristic of the clean audio signal in the original audio signal. The frequency domain processing submodel may adopt a deep learning model based on a neural network, such as LSTM (Long short-term memory network), which is a cyclic neural network with a special structure and can learn the Long-term dependency relationship of Long-sequence input. In this embodiment, the frequency domain processing sub-model is able to learn the inherent association between discrete sub-sequences in the original audio signal.
The frequency domain processing submodel may be structurally partitioned according to function. In this embodiment, to fully utilize the amplitude and phase information of the original audio signal, the frequency domain processing sub-model includes a real part processing network and an imaginary part processing network. The real part processing network is designed to utilize real part information of the original audio signal, and the imaginary part processing network is designed to utilize imaginary part information of the original audio signal. The real part processing network and the imaginary part processing network may be both LSTM-based network structures, and the real part processing network and the imaginary part processing network may each include at least one layer of LSTM network structure, that is, the LSTM overall structure capable of processing complex-form frequency domain signals includes LSTM processing real part sequences and LSTM processing imaginary part sequences, which may be referred to as complex-LSTM.
The real part processing network may be used to obtain a real part attention corresponding to the original audio signal and the imaginary part processing network may be used to obtain an imaginary part attention corresponding to the original audio signal. The real part attention may be used to reflect attention to a clean signal in the real part frequency domain features of the original audio signal and the imaginary part attention may be used to reflect attention to a clean signal in the imaginary part frequency domain features of the original audio signal. Because the frequency domain processing submodel defines the goal in training to multiply the real part feature of the output result of the real part processing network and the imaginary part processing network with the real part sequence of the original audio signal, the real part of the clean audio signal can be obtained, and the imaginary part feature of the output result of the real part processing network and the imaginary part processing network can be multiplied with the imaginary part sequence of the original audio signal, and the imaginary part of the clean audio signal can be obtained, based on such a structure equivalent to Attention (Attention), the output result of the real part processing network and the imaginary part processing network can express more and more accurate Attention to the clean audio signal in the original audio signal, and therefore, the output result of the real part processing network and the imaginary part processing network can be called as real part Attention and imaginary part Attention.
Generally, after an input signal is input into a neural network, network parameters in a network layer of the neural network operate on the input signal to obtain an operation result. Each layer network receives the operation result output by the previous layer network, and outputs the operation result of the layer through the operation of the layer network as the input of the next layer. In this embodiment, the real part processing network and the imaginary part processing network in the frequency domain processing submodel include at least one layer.
Specifically, the computer device may input the acquired original audio signal into a trained frequency domain processing submodel, in the frequency domain processing submodel, the original audio signal is first converted into a frequency domain signal, and then a real part sequence and an imaginary part sequence of the frequency domain signal are respectively input into a real part processing network and an imaginary part processing network, the real part processing network and the imaginary part processing network respectively perform feature coding on the real part sequence and the imaginary part sequence, that is, network parameters in the real part processing network and the imaginary part processing network operate on the input real part sequence and the input imaginary part sequence, and the real part attention and the imaginary part attention corresponding to the original audio signal are obtained according to output results of the last layer of real part processing network and the last layer of imaginary part processing network.
In one embodiment, the computer device may transform the original audio signal into a frequency domain signal using a fourier forward transform in the frequency domain processing submodel, the frequency domain signal comprising a real part sequence and an imaginary part sequence.
In one embodiment, the frequency domain processing submodel may be obtained by performing model training together with a time domain processing submodel described below, and the trained frequency domain processing submodel is connected with the time domain processing submodel to serve as an audio noise reduction model. That is, the output time domain signal of the frequency domain processing submodel is input to the time domain processing submodel for further noise reduction processing, and a final noise reduction signal is obtained.
In one embodiment, the computer device may deeply learn a model structure of the model in advance to obtain an initial model, and then perform model training on the initial model through the sample audio signal to obtain the frequency domain processing submodel.
In one embodiment, the frequency domain processing submodel may include a fourier transform module, divided by function. After inputting the original audio signal into the trained frequency domain processing submodel, the computer device performs Fourier transform on the original audio signal through a Fourier transform module to obtain a corresponding frequency domain signal, wherein the frequency domain signal comprises a real part sequence and an imaginary part sequence. The Fourier transform may be a short-time Fourier transform (STFT).
And 1606, obtaining a frequency domain coding feature corresponding to the original audio signal based on the real part sequence and the imaginary part sequence, and the real part attention and the imaginary part attention.
Wherein the frequency domain coding features reflect characteristics of the audio signal from the frequency domain. As mentioned above, the real part attention and the imaginary part attention output by the real part processing network and the real part processing network in the frequency domain processing submodel can give more and more accurate attention to the clean audio signal in the original audio signal, and then the computer device can extract more parts related to the clean audio signal from the real part sequence and the imaginary part sequence of the frequency domain signal of the original audio signal based on the real part attention and the imaginary part attention, and then obtain the frequency domain coding features corresponding to the original audio signal. It is understood that the frequency domain coding features include real and imaginary parts.
The time domain processing submodel is a model for performing noise reduction processing on a signal from the time domain. The time domain processing submodel may employ a neural network model. For example, the model structure may be designed by combining a convolutional neural network and a long-term memory neural network. The trained time domain processing submodel has the capability of performing noise reduction processing on the time domain signal, namely, the input of the time domain processing submodel is the time domain signal, and the output is the time domain signal after noise reduction.
Specifically, after obtaining the frequency domain coding characteristics corresponding to the original audio signal output by the frequency domain processing submodel, the computer device may perform inverse fourier transform on the frequency domain coding characteristics to obtain a time domain signal. In order to further prompt the effect of noise reduction processing, the computer device further performs noise reduction processing on the time domain signal through the trained time domain processing submodel to obtain a noise reduction signal corresponding to the original audio signal.
In some embodiments, the computer device may transform the frequency domain coded features into time domain signals by a frequency domain processing sub-model, in which case the frequency domain processing sub-model includes an inverse fourier transform module, the frequency domain processing sub-model inverse fourier transforming the frequency domain coded features by the inverse fourier transform module and outputting the time domain signals as input to the time domain processing sub-model, the time domain processing sub-model further performing noise reduction from a time domain perspective. In some embodiments, the computer may transform the frequency-domain coded features into time-domain signals through a time-domain processing submodel, in which case the frequency-domain processing submodel does not include an inverse fourier transform module, the time-domain processing submodel includes an inverse fourier transform module, the frequency-domain coded features output by the frequency-domain processing submodel are used as input to the time-domain processing submodel, and the time-domain processing submodel performs an inverse fourier transform on the frequency-domain coded features through the inverse fourier transform module to obtain time-domain signals.
In some embodiments, step 1608 may also be replaced with: and transforming the frequency domain coding characteristics into time domain signals to obtain noise reduction signals corresponding to the original audio signals. Since the frequency domain coding features represent the frequency domain features corresponding to the clean audio signals in the original audio signals to a certain extent, the time domain signals obtained through the inverse fourier transform represent the clean audio signals in the original audio signals to a certain extent, that is, the signals obtained through the noise reduction processing. Based on the method, the computer equipment can directly convert the frequency domain coding characteristics into time domain signals to obtain noise reduction signals corresponding to the original audio signals, and the time domain signals do not need to be subjected to noise reduction processing by adopting a time domain processing sub-model, so that the noise reduction processing effect is improved.
Referring to fig. 11, when noise reduction processing is performed on an original audio signal, the original audio signal is input to a frequency domain processing sub-model. Then, the original audio signal is transformed into a frequency domain signal through a Fourier forward transform module in the frequency domain processing submodel to obtain a real part sequence and an imaginary part sequence, and the real part sequence and the imaginary part sequence are subjected to feature coding through a real part processing network and an imaginary part processing network which are based on real part and imaginary part operation in the frequency domain processing submodel to obtain real part attention and imaginary part attention. Then, through an attention module in the frequency domain processing submodel, the frequency domain coding features corresponding to the original audio signal are obtained based on the real part sequence and the imaginary part sequence, and the real part attention and the imaginary part attention. And finally, converting the frequency domain coding characteristics into time domain signals through an inverse Fourier transform module in the frequency domain processing submodel, inputting the time domain signals into the trained time domain processing submodel, performing signal noise reduction processing on the time domain signals through the time domain processing submodel, and outputting final noise reduction signals.
In the audio noise reduction method, the trained frequency domain processing submodel comprises a real part processing network and an imaginary part processing network, after the original audio signal of the time domain is obtained, the real part processing network and the imaginary part processing network in the trained frequency domain processing submodel are used for respectively carrying out feature coding on the real part sequence and the imaginary part sequence of the original audio signal, the network structure can fully utilize the frequency domain information of the original audio signal, namely the amplitude information and the phase information represented by the real part sequence and the imaginary part sequence, so that the real part attention and the imaginary part attention obtained by coding can give more and more accurate attention to the clean audio signal in the original audio signal, and thus, the frequency domain coding features corresponding to the original audio signal based on the real part sequence and the imaginary part sequence, and the real part attention and the imaginary part attention can accurately represent the frequency domain features of the clean audio signal in the original audio signal, the time domain signal obtained according to the frequency domain coding feature can accurately express the clean audio signal in the original audio signal, and the noise reduction effect is better; in addition, the time domain signal is subjected to noise reduction processing by subsequently further using a time domain processing submodel, so that the tone quality of the time domain signal can be further improved, and the obtained noise reduction signal effect can be better.
As shown in fig. 17, in an embodiment, the obtaining of the real part attention and the imaginary part attention corresponding to the original audio signal by respectively performing feature coding on the real part sequence and the imaginary part sequence obtained after the original audio signal is transformed into the frequency domain signal through the real part processing network and the imaginary part processing network in the trained frequency domain processing submodel includes:
Specifically, the computer device may invoke a frequency domain processing submodel into which the acquired original audio signal is input.
Specifically, the computer apparatus may transform the original audio signal into a frequency domain signal after performing frequency domain transform in the frequency domain processing submodel. For example, a short-time fourier transform may be performed on the original audio signal to obtain a real part sequence and an imaginary part sequence in the frequency domain signal.
And 1708, respectively performing feature coding on the real part sequence and the imaginary part sequence through an imaginary part processing network in the frequency domain processing submodel to obtain a second coding feature of the real part and a second coding feature of the imaginary part.
In order to obtain a good noise reduction effect, the computer equipment converts the time domain signal into a frequency domain signal, fully utilizes amplitude information and phase information of the time domain signal to perform feature coding, and excavates frequency domain features of a clean audio signal from the time domain signal, namely utilizes a real part sequence and an imaginary part sequence to perform feature coding. Specifically, inside the real part processing network and inside the imaginary part processing network, the input real part sequence and imaginary part sequence are multiplied by the network internal parameter matrix.
With reference to the complex multiplication formula:
plural complex1=a+jb;
Plural complex2=c+jd;
Plural complex1And a plurality of complex2Multiplication:
complex1·complex2=(a·c-b·d)+j(a·d+b·c);
the embodiment of the application defines a formula for operating the real part sequence and the imaginary part sequence by the real part processing network and the imaginary part processing network:
Lrr=LSTMr(Xr);Lir=LSTMr(Xi);
Lri=LSTMi(Xr);Lii=LSTMi(Xi);
Lout=Lrr-Lii+j(Lri+Lir);
wherein, XrRepresenting the frequency domain transformation of an input original audio signal X to obtain a real part sequence, XiRepresenting an imaginary part sequence obtained by frequency-domain transforming an input original audio signal X; l isrrRepresenting processed real part processing networks LSTMrTo XrThe result of the operation obtained after processing, i.e. the real part first coding feature Lrr;LirRepresentation through real part processing network LSTMrThe operation result obtained after processing the Xi, namely the imaginary part first coding characteristic Lir;LriRepresentation of imaginary part processing network LSTMiTo XrThe operation result obtained after the processing is the second coding characteristic of the real part; l isiiRepresentation through imaginary part processing network LSTMiTo XiThe operation result obtained after the post-processing is carried out, namely the imaginary part second coding feature; l isoutAnd representing the operation result obtained after the real part processing network and the imaginary part processing network of each layer.
That is, the output result of each layer in the complex-LSTM is also divided into a real part having a relationship with both the real part sequence and the imaginary part sequence in the frequency domain signal corresponding to the original audio signal and an imaginary part having a relationship with both the real part sequence and the imaginary part sequence in the frequency domain signal corresponding to the original audio signal.
According to the formula defined above: l isout=Lrr-Lii+j(Lri+Lir);
The computer equipment can process the real part of the output of the network with the first coding characteristic LrrAn imaginary second coding characteristic L associated with the imaginary processing network outputiiThe difference is used as the attention of the real part corresponding to the original audio signal, and the imaginary part is processed with the second coding characteristic L of the real part output by the networkriImaginary first coding feature L output from real processing networkirThe sum, as the corresponding imaginary part attention of the original audio signal.
Therefore, the real part attention obtained by the characteristic coding of the real part processing network and the imaginary part processing network in the frequency domain processing submodel refers to the real part and the imaginary part of the original audio signal, and the imaginary part attention obtained specifically refers to the real part and the imaginary part of the original audio signal, so that the multi-aspect information of the original audio information can be fully utilized, and the interpretability is provided for obtaining a better noise reduction effect subsequently.
In one embodiment, the real part processing network and the imaginary part processing network in the frequency domain processing submodel include at least two layers, the computer device obtains the complex results output by the real part processing network and the imaginary part processing network of the previous layer, splits the complex results into the real part and the imaginary part and then uses the split results as the input of the current layer, the real part processing network and the imaginary part processing network of the current layer are utilized to respectively carry out feature coding and obtain each coding feature, each coding feature is operated according to the formula to obtain the complex results output by the current layer, the complex results are input into the next layer to carry out the same feature coding and operation, and so on until the complex results output by the last layer are obtained and are split into the real part and the imaginary part and then respectively used as the real part attention and the imaginary part attention.
In some embodiments, the complex result output by the last layer may also be processed by the fully connected layer, so as to obtain the final real attention and imaginary attention. The full connection layer is used for carrying out matrix multiplication processing on the input characteristics and the network parameters corresponding to the full connection layer, and therefore corresponding characteristics are output. Specifically, the real part processing network of the last layer is connected to the first fully-connected layer, and the imaginary part processing network of the last layer is connected to the second fully-connected layer, that is, the real part of the complex results output by the real part processing network and the imaginary part processing network of the last layer is the input of the first fully-connected layer, and the imaginary part of the complex results output by the real part processing network and the imaginary part processing network of the last layer is the input of the second fully-connected layer. The first fully-connected layer may be configured to perform matrix multiplication on the real part and the network parameter corresponding to the first fully-connected layer to obtain the real part attention corresponding to the original audio signal, and the second fully-connected layer may be configured to perform matrix multiplication on the imaginary part and the network parameter corresponding to the second fully-connected layer to obtain the imaginary part attention corresponding to the original audio signal.
As shown in fig. 18, in an embodiment, the obtaining of the real part attention and the imaginary part attention corresponding to the original audio signal by respectively performing feature coding on the real part sequence and the imaginary part sequence obtained after the original audio signal is transformed into the frequency domain signal through the real part processing network and the imaginary part processing network in the trained frequency domain processing submodel includes:
step 1802, respectively performing feature coding on the real part sequence and the imaginary part sequence through a real part processing network of a first layer in the frequency domain processing submodel to obtain a real part first coding feature and an imaginary part first coding feature.
And 1804, respectively carrying out feature coding on the real part sequence and the imaginary part sequence through an imaginary part processing network of a first layer in the frequency domain processing submodel to obtain a second coding feature of the real part and a second coding feature of the imaginary part.
Specifically, in a real part processing network and an imaginary part processing network of a first layer of a frequency domain processing sub-network, a real part sequence and an imaginary part sequence of an original audio signal are input and pass through a real part processing network LSTM of the first layer r1 and imaginary part processing network LSTMiA characteristic code of 1, obtaining a complex result L including outputs of real part attention and imaginary part attention of the first layer corresponding to the first layer, respectivelyout1, namely:
Lrr_1=LSTMr_1(Xr);Lir_1=LSTMr_1(Xi);
Lri_1=LSTMi_1(Xr);Lii_1=LSTMi_1(Xi);
Lout_1=Lrr_1-Lii_1+j(Lri_1+Lir_1).
step 1810, iteratively performing feature coding on the real part attention and the imaginary part attention corresponding to the previous layer through the real part processing network and the imaginary part processing network of the current layer to obtain the real part attention and the imaginary part attention corresponding to the current layer, and stopping iteration until the real part attention and the imaginary part attention corresponding to the last layer are obtained.
Then the real part processing network LSTM at the second layerr2 and imaginary part processing network LSTMiIn _2, add LoutThe real part of 1 is used as the real part sequence X of the second layer inputrIs prepared by mixing LoutImaginary part of _1as imaginary part sequence X of second layer inputi:
Lrr_2=LSTMr_2(Xr);Lir_2=LSTMr_2(Xi);
Lri_2=LSTMi_2(Xr);Lii_1=LSTMi_2(Xi);
Lout_2=Lrr_2-Lii_2+j(Lri_2+Lir_2).
LoutA 2 is the complex result of the output of the second layer output including the real and imaginary attention.
And so on, when the real part processing network and the imaginary part processing network in the frequency domain processing submodel comprise three layers, the real part processing network LSTM in the third layer is continuedr3 and imaginary part processing network LSTMiIn _3, L isoutThe real part of _2is used as the real part sequence X of the third layer inputrIs prepared by mixing LoutImaginary part of _2as imaginary part sequence X of third layer inputiSimilarly, until the real part attention and the imaginary part attention corresponding to the last layer are obtained, the real part attention and the imaginary part attention output by the last layer are taken as the real part attention and the imaginary part attention corresponding to the final original audio signal.
Each repetition of the above process in one layer of real and imaginary processing networks is referred to as an "iterative" process. According to the above process, the real part attention and the imaginary part attention output by the last layer are taken as the real part attention and the imaginary part attention corresponding to the final original audio signal by repeating the process for multiple times, that is, performing multiple iterations.
After the computer device obtains the real part attention and the imaginary part attention through the real part processing network and the imaginary part processing network based on the real part and imaginary part operation in the frequency domain processing submodel, the frequency domain coding characteristics corresponding to the original audio signal can be expressed in a form of multiplying the attention and the original signal.
In one embodiment, obtaining the frequency-domain coding features corresponding to the original audio signal based on the real part sequence and the imaginary part sequence, and the real part attention and the imaginary part attention comprises: multiplying the real part sequence by the real part attention to obtain a real part of the frequency domain coding characteristics corresponding to the original audio signal; and multiplying the imaginary part sequence and the imaginary part attention to obtain the imaginary part of the frequency domain coding feature corresponding to the original audio signal.
Specifically, the obtained attention to the clean audio signal in the original audio signal includes both the real attention and the imaginary attention, and the computer device may ignore the phase and multiply in the form of real multiplication. That is, the real part attention is directly multiplied by the real part sequence in the frequency domain signal corresponding to the original audio signal, the product result is used as the real part of the frequency domain coding feature corresponding to the original audio signal, the imaginary part attention is multiplied by the imaginary part sequence in the frequency domain signal corresponding to the original audio signal, and the product result is used as the imaginary part of the frequency domain coding feature corresponding to the original audio signal.
I.e. by the following formula:
wherein,representing the corresponding frequency-domain coding features, X, of the original audio signalrRepresenting the frequency domain transformation of an input original audio signal X to obtain a real part sequence, XiRepresenting the imaginary sequence obtained after frequency domain transformation of the input original audio signal X,the real part of attention is shown,indicating imaginary attention.
In one embodiment, obtaining the frequency-domain coding features corresponding to the original audio signal based on the real part sequence and the imaginary part sequence, and the real part attention and the imaginary part attention comprises: multiplying the real part sequence and the real part attention to obtain a first result, multiplying the imaginary part sequence and the imaginary part attention to obtain a second result, and taking the difference between the first result and the second result as the real part of the frequency domain coding feature corresponding to the original audio signal; and multiplying the real part sequence and the imaginary part attention to obtain a third result, multiplying the imaginary part sequence and the real part attention to obtain a fourth result, and taking the sum of the third result and the fourth result as the imaginary part of the frequency domain coding feature corresponding to the original audio signal.
In particular, the obtained attention to the clean audio signal in the original audio signal comprises a real part attention and an imaginary part attention, and the computer device may multiply according to a format of the real part and the imaginary part, that is, according to a formula of complex multiplication:
wherein,representing the corresponding frequency-domain coding features, X, of the original audio signalrRepresenting the frequency domain transformation of an input original audio signal X to obtain a real part sequence, XiRepresenting the imaginary sequence obtained after frequency domain transformation of the input original audio signal X,the real part of attention is shown,indicating imaginary attention.
In one embodiment, obtaining the frequency-domain coding features corresponding to the original audio signal based on the real part sequence and the imaginary part sequence, and the real part attention and the imaginary part attention comprises: obtaining original amplitude information and original phase information of the original audio signal based on the real part sequence and the imaginary part sequence, and obtaining predicted amplitude information and predicted phase information of the original audio signal based on the real part attention and the imaginary part attention; obtaining amplitude information of frequency domain coding characteristics corresponding to the original audio signal according to the product of the original amplitude information and the predicted amplitude information; and obtaining the phase information of the frequency domain coding characteristics corresponding to the original audio signal according to the sum of the original phase information and the predicted phase information.
In particular, the obtained attention to the clean audio signal in the original audio signal comprises both a real part attention and an imaginary part attention, and the computer device may multiply using the magnitude phase information. I.e. according to the formula for amplitude phase multiplication:
wherein,representing the corresponding frequency-domain coding features, X, of the original audio signalmagRepresenting a real part sequence X based on an original audio signalrWith the imaginary sequence XiThe original amplitude information obtained, XphaseRepresenting a real part sequence X based on an original audio signalrWith the imaginary sequence XiThe original phase information obtained is used to derive,representing real part-based attentionAttention to imaginary partThe obtained predicted amplitude information of the original audio signal,attention based on real partAttention to imaginary partThe obtained predicted phase information of the original audio signal, j representing a complex number.
In the above embodiment, the final frequency-domain coding feature is obtained by multiplying the output real part attention and imaginary part attention by the real part sequence and imaginary part sequence in the original signal, respectively, so that the frequency-domain processing submodel learns the capability of obtaining a clean signal when the output attention is multiplied by the original signal step by step.
In one embodiment, the noise reduction processing is performed on the time domain signal through the trained time domain processing submodel to obtain a noise reduction signal corresponding to the original audio signal, and the noise reduction processing includes: inputting the time domain signal into the trained time domain processing submodel; coding the time domain signal through a coder in the time domain processing submodel to obtain a time domain coding vector; performing feature extraction on the time domain coding vector through a time sequence feature extraction network in the time domain processing submodel to obtain hidden features corresponding to the time domain signals; and decoding based on the time domain coding vector and the hidden feature through a decoder in the time domain processing submodel to obtain a noise reduction signal corresponding to the original audio signal.
The time domain processing submodel adopts an Encode-Decoder structure based on convolution. The Encode-Decoder structure converts an input sequence into another sequence output. In this framework, the encoder converts the sequence corresponding to the input time-domain signal into a vector, and the decoder receives the vector and generates the output sequence in time order. The encoder and the decoder may employ the same type of neural network model, or may be different types of neural network models. For example, the encoder and the decoder may both be a CNN (Convolutional Neural network) model, or the encoder may use an rnn (redundant Neural network) model and the decoder may use a CNN model, and the time-domain processing sub-model may use an LSTM model, for example, a 2-layer LSTM network, and may learn the time-sequence dependency of the input time-domain signal.
Specifically, the computer device inputs the time domain signal output by the frequency domain submodel into the time domain processing submodel, converts the input time domain signal into a time domain coding vector through an encoder of the time domain processing submodel, extracts the inherent relation of the time domain coding vector on the time sequence through an intermediate time sequence characteristic extraction network, and excavates the hidden characteristic. In this embodiment, the time series feature extraction network is an intermediate layer, also called a hidden layer, with respect to the encoder as an input layer and the decoder as an output layer, and therefore the features extracted by the time series feature extraction network are called hidden features. And finally, decoding the signal based on the time domain coding vector and the hidden feature through a decoder, and outputting a noise reduction signal.
In one embodiment, the method further comprises: and classifying the time domain signals through the trained noise classification submodel to obtain the noise scene category of the original audio signal, wherein the noise classification submodel is obtained by carrying out model training together with the time domain processing submodel.
Specifically, the computer device inputs the time domain signal output by the frequency domain submodel into the noise classification submodel, and outputs the noise scene category of the noise signal in the original audio signal through the noise classification submodel.
In one embodiment, classifying the time-domain signal by a trained noise classification submodel to obtain a noise scene category of the original audio signal, including: inputting the time domain signal into a trained noise classification submodel; coding the time domain signal through a coder in the noise classification submodel to obtain a time domain coding vector; performing feature extraction on the time domain coding vector through a time sequence feature extraction network in the noise classification submodel to obtain hidden features corresponding to the time domain signals; and predicting the noise scene category of the original audio signal based on the hidden features through an output layer in the noise classification submodel.
Specifically, the computer equipment inputs the time domain signal output by the frequency domain submodel into the noise classification submodel, converts the input time domain signal into a time domain coding vector through an encoder of the noise classification submodel, extracts the internal relation of the time domain coding vector on a time sequence through an intermediate time sequence characteristic extraction network, and excavates hidden characteristics. And finally, outputting the noise scene category of the original audio signal through an output layer based on the hidden features. The timing feature extraction network of the noise classification submodel may employ LSTM, for example, a layer 2 LSTM network.
The output layer of the noise classification submodel comprises a full connection layer, an activation layer and a normalization layer. And a full connection layer in the output layer receives the hidden features extracted by the feature extraction network, matrix multiplication is carried out on the hidden features and model parameters corresponding to the full connection layer, the hidden features are mapped to a sample space, and finally, after nonlinear characteristics are introduced through the activation layer, noise scene categories corresponding to original audio signals are output through a normalization layer (softmax function).
In one embodiment, the method further comprises: acquiring an input noise reduction level; determining a signal weight corresponding to an input noise reduction level, the signal weight including a first weight and a second weight for adjusting a ratio between an original audio signal and a noise reduction signal, respectively; and according to the first weight and the second weight, fusing the original audio signal and the noise reduction signal to obtain an audio output signal corresponding to the input noise reduction level.
In this embodiment, the computer device may provide a function of turning on one-key noise reduction for the user, and simultaneously provide a selection of multiple noise reduction levels for the user, so that the user can select a suitable noise reduction level according to the different environments in which the user is located. The computer device can not only output the noise reduction signal in real time, but also output whether the computer device is currently in a noise environment and in what type of noise environment.
In a specific embodiment, the audio noise reduction method comprises the following steps:
1. acquiring an original audio signal of a time domain;
2. inputting an original audio signal into a trained frequency domain processing sub-model;
3. in the frequency domain processing submodel, carrying out frequency domain transformation on the original audio signal to obtain a real part sequence and an imaginary part sequence corresponding to the original audio signal;
4. respectively carrying out feature coding on the real part sequence and the imaginary part sequence through a real part processing network in the frequency domain processing submodel to obtain a first coding feature of the real part and a first coding feature of the imaginary part;
5. respectively carrying out feature coding on the real part sequence and the imaginary part sequence through an imaginary part processing network in the frequency domain processing submodel to obtain a second coding feature of the real part and a second coding feature of the imaginary part;
6. acquiring real part attention corresponding to the original audio signal according to the real part first coding feature and the imaginary part second coding feature;
7. obtaining the imaginary part attention corresponding to the original audio signal according to the real part second coding feature and the imaginary part first coding feature;
8. multiplying the real part sequence by the real part attention to obtain a real part of the frequency domain coding characteristics corresponding to the original audio signal;
9. multiplying the imaginary part sequence and the imaginary part attention to obtain an imaginary part of the frequency domain coding feature corresponding to the original audio signal;
10. transforming the frequency domain coding features into time domain signals;
11. inputting the time domain signal into the trained time domain processing submodel;
12. coding the time domain signal through a coder in the time domain processing submodel to obtain a time domain coding vector;
13. performing feature extraction on the time domain coding vector through a time sequence feature extraction network in the time domain processing submodel to obtain hidden features corresponding to the time domain signals;
14. decoding based on the time domain coding vector and the hidden feature through a decoder in the time domain processing submodel to obtain a noise reduction signal corresponding to the original audio signal;
15. inputting the time domain signal into a trained noise classification submodel;
16. coding the time domain signal through a coder in the noise classification submodel to obtain a time domain coding vector;
17. performing feature extraction on the time domain coding vector through a time sequence feature extraction network in the noise classification submodel to obtain hidden features corresponding to the time domain signals;
18. predicting the noise scene category of the original audio signal based on the hidden features through an output layer in the noise classification submodel;
19. acquiring an input noise reduction level;
20. determining a signal weight corresponding to an input noise reduction level, the signal weight including a first weight and a second weight for adjusting a ratio between an original audio signal and a noise reduction signal, respectively;
21. and according to the first weight and the second weight, fusing the original audio signal and the noise reduction signal to obtain an audio output signal corresponding to the input noise reduction level.
In a specific application scenario, an instant messaging client is run on a computer device, when a user makes a voice call, the instant messaging client can collect an original voice signal, the original voice signal is input into a trained voice noise reduction model through the instant messaging client according to the method provided by the embodiment of the application, in a frequency domain processing sub-model of the voice noise reduction model, a real part processing network and an imaginary part processing network are used for respectively performing feature coding on a real part sequence and an imaginary part sequence obtained after the original voice signal is converted into a frequency domain signal, so as to obtain real part attention and imaginary part attention corresponding to the original voice signal, and a frequency domain coding feature corresponding to the original voice signal is obtained based on the real part sequence and the imaginary part sequence, and the real part attention and the imaginary part attention. In the time domain processing submodel of the voice noise reduction model, the frequency domain coding characteristics are converted into time domain signals, the time domain signals are subjected to noise reduction processing to obtain noise reduction signals corresponding to the original voice signals, and then the noise reduction signals are transmitted to the opposite party of the call. Of course, after the original audio signal is transmitted to the call object by the computer device, the noise reduction processing may be performed on the original audio signal by the computer device used by the call object by using the speech noise reduction method provided in the embodiment of the present application.
In addition, in a noise classification submodel of the speech noise reduction model, the frequency domain coding features can be transformed into time domain signals, and the time domain signals are subjected to noise classification to obtain a noise scene type corresponding to the original speech signals. In addition, a noise reduction level switching control can be further arranged on the voice call interaction interface, different controls correspond to different noise reduction degrees, the computer equipment can respond to the triggering operation aiming at the noise reduction level switching control, the mixed signal is output after the proportion between the original audio signal and the noise reduction signal is adjusted, noise reduction of different degrees can be carried out, and user experience is improved.
It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the above-mentioned flowcharts may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or the stages is not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a part of the steps or the stages in other steps.
In one embodiment, as shown in fig. 19, there is provided an audio noise reduction model processing apparatus 1900, which may adopt a software module or a hardware module, or a combination of the two, as a part of a computer device, and specifically includes: an acquisition module 1902, a frequency domain coding training module 1904, and an integrated training module 1906, wherein:
an obtaining module 1902, configured to obtain a sample audio signal, where the sample audio signal is generated from a clean audio signal;
a frequency domain coding training module 1904, configured to perform feature coding on a real part sequence and an imaginary part sequence obtained after transforming a sample audio signal into a frequency domain signal respectively through a real part processing network and an imaginary part processing network in a first sub-model based on a neural network, to obtain real part attention and imaginary part attention corresponding to the sample audio signal, and obtain frequency domain coding features corresponding to the sample audio signal based on the real part sequence and the imaginary part sequence, and the real part attention and the imaginary part attention; performing model training on the first submodel according to a first loss determined by a frequency domain transformation sequence corresponding to the clean audio signal based on the frequency domain coding characteristics to obtain a frequency domain processing submodel;
an integrated training module 1906, configured to connect the frequency domain processing sub-model with the time domain processing sub-model to be trained, and then train together to obtain an audio noise reduction model for performing noise reduction on the audio signal.
In one embodiment, the frequency domain coding training module 1904 is specifically configured to input the sample audio signal into a first sub-model based on a neural network; in the first submodel, carrying out frequency domain transformation on the sample audio signal to obtain a real part sequence and an imaginary part sequence corresponding to the sample audio signal; respectively carrying out feature coding on the real part sequence and the imaginary part sequence through a real part processing network in a first submodel to obtain a real part first coding feature and an imaginary part first coding feature; respectively carrying out feature coding on the real part sequence and the imaginary part sequence through an imaginary part processing network in the first submodel to obtain a second coding feature of the real part and a second coding feature of the imaginary part; and obtaining the real part attention corresponding to the sample audio signal according to the real part first coding feature and the imaginary part second coding feature, and obtaining the imaginary part attention corresponding to the sample audio signal according to the real part second coding feature and the imaginary part first coding feature.
In one embodiment, the frequency-domain coding training module 1904 is specifically configured to multiply the real part sequence with the real part attention to obtain the real part of the frequency-domain coding feature corresponding to the sample audio signal; and multiplying the imaginary part sequence and the imaginary part attention to obtain the imaginary part of the frequency domain coding feature corresponding to the sample audio signal.
In an embodiment, the frequency-domain coding training module 1904 is specifically configured to multiply the real part sequence and the real part attention to obtain a first result, multiply the imaginary part sequence and the imaginary part attention to obtain a second result, and use a difference between the first result and the second result as a real part of the frequency-domain coding feature corresponding to the sample audio signal; and multiplying the real part sequence and the imaginary part attention to obtain a third result, multiplying the imaginary part sequence and the real part attention to obtain a fourth result, and taking the sum of the third result and the fourth result as the imaginary part of the frequency domain coding feature corresponding to the sample audio signal.
In an embodiment, the frequency domain coding training module 1904 is specifically configured to obtain original amplitude information and original phase information of the sample audio signal based on the real part sequence and the imaginary part sequence; obtaining predicted amplitude information and predicted phase information of the sample audio signal based on the real part attention and the imaginary part attention; obtaining amplitude information of frequency domain coding characteristics corresponding to the sample audio signal according to the product of the original amplitude information and the predicted amplitude information; and obtaining the phase information of the frequency domain coding characteristics corresponding to the sample audio signal according to the sum of the original phase information and the predicted phase information.
In an embodiment, the frequency domain coding training module 1904 is specifically configured to perform frequency domain transformation processing on the clean audio signal to obtain a corresponding frequency domain transformation sequence, where the frequency domain transformation sequence includes a real part sequence and an imaginary part sequence; the first loss is determined based on a difference between a real part sequence corresponding to the clean audio signal and a real part feature in the frequency domain coding features corresponding to the sample audio signal, and a difference between an imaginary part sequence corresponding to the clean audio signal and an imaginary part feature in the frequency domain coding features corresponding to the sample audio signal.
In one embodiment, the integrated training module 1906 is further configured to connect the frequency domain processing submodel with the second and third neural network-based submodels; transforming the frequency domain coding features corresponding to the sample audio signals into time domain signals; based on a second loss determined by a noise reduction signal obtained by carrying out noise reduction processing on the time domain signal by the second submodel and the clean audio signal, and a third loss determined by a noise scene category obtained by carrying out noise classification on the time domain signal by the third submodel and the noise label category of the sample audio signal, constructing a multitask objective function; performing model training on the frequency domain processing submodel, the second submodel and the third submodel together according to the multitask objective function to obtain an updated frequency domain processing submodel, a trained time domain processing submodel and a trained noise classification submodel; and connecting the updated frequency domain processing submodel with the trained time domain processing submodel to obtain an audio noise reduction model for carrying out noise reduction processing on the audio signal.
In an embodiment, the integrated training module 1906 is further configured to obtain a plurality of noise reduction signals obtained by performing noise reduction processing on time domain signals corresponding to the plurality of sample audio signals by using a second submodel, and determine a second lost weight according to a standard deviation of the plurality of noise reduction signals; obtaining a plurality of noise scene categories obtained by carrying out noise classification on time domain signals corresponding to the plurality of sample audio signals by a third submodel, and determining a third loss weight according to standard deviations of the plurality of noise scene categories; and fusing the second loss and the third loss according to respective weights to obtain the multitask objective function.
In an embodiment, the integrated training module 1906 is further configured to encode the time-domain signal through an encoder in the second sub-model to obtain a time-domain encoding vector, perform feature extraction on the time-domain encoding vector through a time sequence feature extraction network in the second sub-model to obtain a hidden feature corresponding to the time-domain signal, and perform decoding based on the time-domain encoding vector and the hidden feature through a decoder in the second sub-model to obtain a noise reduction signal corresponding to the sample audio signal in the second sample set; a second loss is constructed from the noise reduction signal and the clean audio signal.
In an embodiment, the integrated training module 1906 is further configured to project the noise reduction signal corresponding to the sample audio signal to the vertical direction and the horizontal direction of the clean audio signal, respectively, to obtain a vertical projection vector and a horizontal projection vector; and obtaining a second loss according to the vertical projection vector and the horizontal projection vector.
In an embodiment, the integrated training module 1906 is configured to encode the time-domain signal through an encoder in the third sub-model to obtain a time-domain encoding vector, perform feature extraction on the time-domain encoding vector through a time sequence feature extraction network in the third sub-model to obtain a hidden feature corresponding to the time-domain signal, and predict a noise scene category of the sample audio signal based on the hidden feature through an output layer in the third sub-model; and constructing a third loss according to the noise scene category and the noise label category of the noise signal used for generating the sample audio signal.
In one embodiment, the integrated training module 1906 is further configured to connect the frequency domain processing submodel with the second and third neural network-based submodels; transforming the frequency domain coding features corresponding to the sample audio signals into time domain signals; performing model training on the frequency domain processing submodel and the second submodel together based on a second loss determined by a noise reduction signal obtained by performing noise reduction on the time domain signal and the clean audio signal by the second submodel to obtain an updated frequency domain processing submodel and a trained time domain processing submodel; inputting the sample audio signal into the updated frequency domain processing submodel to obtain corresponding frequency domain coding characteristics, converting the frequency domain coding characteristics into time domain signals, and then respectively inputting the trained time domain processing submodel and a third submodel; based on a second loss determined by a noise reduction signal and a clean audio signal obtained by performing noise reduction on a time domain signal by a trained time domain processing submodel, and a third loss determined by a noise scene category obtained by performing noise classification on the time domain signal by a third submodel and a noise label category of a sample audio signal, constructing a multi-task objective function; performing model training on the frequency domain processing submodel, the trained time domain processing submodel and the third submodel together according to the multitask objective function to obtain an updated frequency domain processing submodel, an updated time domain processing submodel and a noise classification submodel; and connecting the updated frequency domain processing submodel with the updated time domain processing submodel to obtain an audio noise reduction model for carrying out noise reduction processing on the audio signal.
In an embodiment, the processing device 1900 of the audio noise reduction model further includes: the distillation training module is used for obtaining a teacher model for carrying out noise reduction processing on the audio signals according to the audio noise reduction model, and constructing a student model for carrying out noise reduction processing on the audio signals according to the frequency domain processing sub-model and the light-weight time domain noise reduction network; inputting a sample audio signal into a teacher model, obtaining corresponding frequency domain coding characteristics through a frequency domain processing submodel in the teacher model, converting the frequency domain coding characteristics into time domain signals, coding the time domain signals through a coder in the time domain processing submodel in the teacher model, obtaining a first time domain coding vector corresponding to the sample audio signal, and obtaining a first noise reduction signal corresponding to the sample audio signal based on the first time domain coding vector through a decoder in the time domain processing submodel; inputting a sample audio signal into a student model, obtaining corresponding frequency domain coding characteristics through a frequency domain processing sub-model in the student model, converting the frequency domain coding characteristics into time domain signals, coding the time domain signals through a coder in a lightweight time domain noise reduction network to obtain second time domain coding vectors corresponding to the sample audio signal, and obtaining second noise reduction signals corresponding to the sample audio signal based on the second time domain coding vectors through a decoder in the lightweight time domain noise reduction network; and performing model training on the student model according to the model distillation loss to obtain the lightweight audio noise reduction model based on the mean square error loss between the first time domain coding vector and the second time domain coding vector, the mean square error loss between the first noise reduction signal and the second noise reduction signal and the model distillation loss determined by the data structure loss between the first noise reduction signal and the second noise reduction signal.
The processing apparatus 1900 of the above audio noise reduction model comprises a frequency domain processing submodel and a time domain processing submodel, wherein the frequency domain processing submodel comprises a real part processing network and an imaginary part processing network, and after an original audio signal in a time domain is obtained, the real part processing network and the imaginary part processing network in the frequency domain processing submodel are used to perform feature coding on a real part sequence and an imaginary part sequence of a sample audio signal respectively, such a network structure can fully learn frequency domain information of the original audio signal, that is, amplitude information and phase information of a real part sequence representation and an imaginary part sequence representation, so that the real part attention and the imaginary part attention obtained by coding can give more and more accurate attention to a clean audio signal in the sample audio signal, and thus, the frequency domain coding features corresponding to the sample audio signal are obtained based on the real part sequence and the imaginary part sequence, the real part attention and the imaginary part attention, and performing model training on the frequency domain processing submodel according to the first loss determined by the frequency domain coding characteristics and the frequency domain transformation sequence corresponding to the clean audio signal for generating the sample audio signal, so that the frequency domain processing submodel can accurately learn the frequency domain characteristics of the clean signal in the sample audio signal, and then performing model training by combining the frequency domain processing submodel and the time domain processing submodel together, so that the noise reduction effect of the obtained audio noise reduction model is better.
For specific definition of the processing apparatus 1900 of the audio noise reduction model, reference may be made to the above definition of the processing method of the audio noise reduction model, which is not described herein again. The modules in the processing device of the audio noise reduction model can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, as shown in fig. 20, there is provided an audio noise reduction apparatus 2000, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: an obtaining module 2002, a frequency domain coding module 2004 and a time domain noise reduction module 2006, wherein:
an obtaining module 2002, configured to obtain an original audio signal in a time domain;
a frequency domain coding module 2004, configured to perform feature coding on a real part sequence and an imaginary part sequence obtained after the original audio signal is transformed into the frequency domain signal through a real part processing network and an imaginary part processing network in the trained frequency domain processing submodel, respectively, to obtain a real part attention and an imaginary part attention corresponding to the original audio signal; acquiring frequency domain coding characteristics corresponding to the original audio signal based on the real part sequence and the imaginary part sequence and the real part attention and the imaginary part attention;
and the time domain noise reduction module 2006 is configured to transform the frequency domain coding features into time domain signals, and perform noise reduction processing on the time domain signals through the trained time domain processing submodel to obtain noise reduction signals corresponding to the original audio signals.
In one embodiment, the frequency domain coding module 2004 is further configured to input the original audio signal into a trained frequency domain processing submodel; in the frequency domain processing submodel, carrying out frequency domain transformation on the original audio signal to obtain a real part sequence and an imaginary part sequence corresponding to the original audio signal; respectively carrying out feature coding on the real part sequence and the imaginary part sequence through a real part processing network in the frequency domain processing submodel to obtain a first coding feature of the real part and a first coding feature of the imaginary part; respectively carrying out feature coding on the real part sequence and the imaginary part sequence through an imaginary part processing network in the frequency domain processing submodel to obtain a second coding feature of the real part and a second coding feature of the imaginary part; and obtaining the real part attention corresponding to the original audio signal according to the real part first coding feature and the imaginary part second coding feature, and obtaining the imaginary part attention corresponding to the original audio signal according to the real part second coding feature and the imaginary part first coding feature.
In one embodiment, the real part processing network and the imaginary part processing network in the frequency domain processing submodel include at least two layers; the frequency domain coding module 2004 is further configured to perform feature coding on the real part sequence and the imaginary part sequence respectively through a real part processing network of a first layer in the frequency domain processing submodel to obtain a real part first coding feature and an imaginary part first coding feature; respectively carrying out feature coding on the real part sequence and the imaginary part sequence through an imaginary part processing network of a first layer in the frequency domain processing submodel to obtain a second coding feature of the real part and a second coding feature of the imaginary part; acquiring real part attention of the original audio signal corresponding to the first layer according to the real part first coding feature and the imaginary part second coding feature, and acquiring imaginary part attention of the original audio signal corresponding to the first layer according to the real part second coding feature and the imaginary part first coding feature; and iteratively performing feature coding on the real part attention and the imaginary part attention corresponding to the previous layer through the real part processing network and the imaginary part processing network of the current layer respectively to obtain the real part attention and the imaginary part attention corresponding to the current layer, and stopping iteration until the real part attention and the imaginary part attention corresponding to the last layer are obtained.
In one embodiment, the frequency domain coding module 2004 is further configured to multiply the real part sequence with the real part attention to obtain a real part of the frequency domain coding feature corresponding to the original audio signal; and multiplying the imaginary part sequence and the imaginary part attention to obtain the imaginary part of the frequency domain coding feature corresponding to the original audio signal.
In one embodiment, the frequency-domain coding module 2004 is further configured to multiply the real part sequence with the real part attention to obtain a first result, multiply the imaginary part sequence with the imaginary part attention to obtain a second result, and use a difference between the first result and the second result as a real part of the frequency-domain coding feature corresponding to the original audio signal; and multiplying the real part sequence and the imaginary part attention to obtain a third result, multiplying the imaginary part sequence and the real part attention to obtain a fourth result, and taking the sum of the third result and the fourth result as the imaginary part of the frequency domain coding feature corresponding to the original audio signal.
In one embodiment, the frequency domain coding module 2004 is further configured to obtain raw amplitude information and raw phase information of the raw audio signal based on the real part sequence and the imaginary part sequence, and obtain predicted amplitude information and predicted phase information of the raw audio signal based on the real part attention and the imaginary part attention; obtaining amplitude information of frequency domain coding characteristics corresponding to the original audio signal according to the product of the original amplitude information and the predicted amplitude information; and obtaining the phase information of the frequency domain coding characteristics corresponding to the original audio signal according to the sum of the original phase information and the predicted phase information.
In one embodiment, the time domain noise reduction module 2006 is further configured to input the time domain signal into a trained time domain processing sub-model; coding the time domain signal through a coder in the time domain processing submodel to obtain a time domain coding vector; performing feature extraction on the time domain coding vector through a time sequence feature extraction network in the time domain processing submodel to obtain hidden features corresponding to the time domain signals; and decoding based on the time domain coding vector and the hidden feature through a decoder in the time domain processing submodel to obtain a noise reduction signal corresponding to the original audio signal.
In one embodiment, the audio noise reduction apparatus 2000 further comprises:
and the noise classification module is used for classifying the time domain signals through the trained noise classification submodel to obtain the noise scene category of the original audio signal, and the noise classification submodel is obtained by carrying out model training together with the time domain processing submodel.
In the above embodiment, the noise classification module is further configured to input the time domain signal into a trained noise classification submodel; coding the time domain signal through a coder in the noise classification submodel to obtain a time domain coding vector; performing feature extraction on the time domain coding vector through a time sequence feature extraction network in the noise classification submodel to obtain hidden features corresponding to the time domain signals; and predicting the noise scene category of the original audio signal based on the hidden features through an output layer in the noise classification submodel.
In one embodiment, the audio noise reduction apparatus 2000 further comprises:
the noise reduction gear adjusting module is used for acquiring the input noise reduction level; determining a signal weight corresponding to an input noise reduction level, the signal weight including a first weight and a second weight for adjusting a ratio between an original audio signal and a noise reduction signal, respectively; and according to the first weight and the second weight, fusing the original audio signal and the noise reduction signal to obtain an audio output signal corresponding to the input noise reduction level.
In the audio noise reduction apparatus 2000, the trained frequency domain processing submodel includes a real part processing network and an imaginary part processing network, and after the original audio signal in the time domain is obtained, the real part processing network and the imaginary part processing network in the trained frequency domain processing submodel respectively perform feature coding on the real part sequence and the imaginary part sequence of the original audio signal, such a network structure can fully utilize the frequency domain information of the original audio signal, i.e. amplitude information and phase information characterized by the real part sequence characterization and the imaginary part sequence, so that the real part attention and the imaginary part attention obtained by coding can give more and more accurate attention to the clean audio signal in the original audio signal, and thus, the frequency domain characteristics of the clean audio signal in the original audio signal can be accurately characterized based on the coding features corresponding to the original audio signal obtained by the real part sequence and the imaginary part sequence, the real part attention and the imaginary part attention, the time domain signal obtained according to the frequency domain coding feature can accurately express the clean audio signal in the original audio signal, and the noise reduction effect is better; in addition, the time domain signal is subjected to noise reduction processing by subsequently further using a time domain processing submodel, so that the tone quality of the time domain signal can be further improved, and the obtained noise reduction signal effect can be better.
For specific limitations of the audio noise reduction apparatus 2000, reference may be made to the above limitations of the audio noise reduction method, which will not be described herein again. The modules in the audio noise reduction device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, and the computer device may be the terminal or the server in fig. 1, and its internal structure diagram may be as shown in fig. 21. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an audio noise reduction method and/or a processing method of an audio noise reduction model.
Those skilled in the art will appreciate that the architecture shown in fig. 21 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (15)
1. A method for audio noise reduction, the method comprising:
acquiring an original audio signal of a time domain;
respectively carrying out feature coding on a real part sequence and an imaginary part sequence obtained after the original audio signal is converted into a frequency domain signal through a real part processing network and an imaginary part processing network in the trained frequency domain processing submodel to obtain real part attention and imaginary part attention corresponding to the original audio signal;
obtaining frequency domain coding features corresponding to the original audio signal based on the real part sequence and the imaginary part sequence and the real part attention and the imaginary part attention;
and transforming the frequency domain coding characteristics into time domain signals, and carrying out noise reduction processing on the time domain signals through a trained time domain processing submodel to obtain noise reduction signals corresponding to the original audio signals.
2. The method as claimed in claim 1, wherein the obtaining of the real part attention and the imaginary part attention corresponding to the original audio signal by respectively performing feature coding on a real part sequence and an imaginary part sequence obtained after transforming the original audio signal into the frequency domain signal through a real part processing network and an imaginary part processing network in the trained frequency domain processing submodel comprises:
inputting the original audio signal into a trained frequency domain processing sub-model;
in the frequency domain processing submodel, carrying out frequency domain transformation on the original audio signal to obtain a real part sequence and an imaginary part sequence corresponding to the original audio signal;
respectively carrying out feature coding on the real part sequence and the imaginary part sequence through a real part processing network in the frequency domain processing submodel to obtain a first coding feature of the real part and a first coding feature of the imaginary part;
respectively carrying out feature coding on the real part sequence and the imaginary part sequence through an imaginary part processing network in the frequency domain processing submodel to obtain a second coding feature of the real part and a second coding feature of the imaginary part;
and obtaining the real part attention corresponding to the original audio signal according to the real part first coding feature and the imaginary part second coding feature, and obtaining the imaginary part attention corresponding to the original audio signal according to the real part second coding feature and the imaginary part first coding feature.
3. The method of claim 1, wherein the real part processing network and the imaginary part processing network in the frequency domain processing submodel comprise at least two layers;
the obtaining of the real part attention and the imaginary part attention corresponding to the original audio signal by respectively performing feature coding on the real part sequence and the imaginary part sequence obtained after the original audio signal is transformed into the frequency domain signal through the real part processing network and the imaginary part processing network in the trained frequency domain processing submodel includes:
respectively carrying out feature coding on the real part sequence and the imaginary part sequence through a real part processing network of a first layer in the frequency domain processing submodel to obtain a first coding feature of a real part and a first coding feature of an imaginary part;
respectively carrying out feature coding on the real part sequence and the imaginary part sequence through an imaginary part processing network of a first layer in the frequency domain processing submodel to obtain a second coding feature of the real part and a second coding feature of the imaginary part;
according to the real part first coding feature and the imaginary part second coding feature, obtaining a real part attention of the original audio signal corresponding to a first layer, and according to the real part second coding feature and the imaginary part first coding feature, obtaining an imaginary part attention of the original audio signal corresponding to the first layer;
and iteratively performing feature coding on the real part attention and the imaginary part attention corresponding to the previous layer through the real part processing network and the imaginary part processing network of the current layer respectively to obtain the real part attention and the imaginary part attention corresponding to the current layer, and stopping iteration until the real part attention and the imaginary part attention corresponding to the last layer are obtained.
4. The method according to claim 1, wherein the obtaining the frequency-domain coding features corresponding to the original audio signal based on the real part sequence and the imaginary part sequence and the real part attention and the imaginary part attention comprises:
multiplying the real part sequence and the real part attention to obtain a real part of the frequency domain coding feature corresponding to the original audio signal;
and multiplying the imaginary part sequence and the imaginary part attention to obtain an imaginary part of the frequency domain coding feature corresponding to the original audio signal.
5. The method according to claim 1, wherein the obtaining the frequency-domain coding features corresponding to the original audio signal based on the real part sequence and the imaginary part sequence and the real part attention and the imaginary part attention comprises:
multiplying the real part sequence and the real part attention to obtain a first result, multiplying the imaginary part sequence and the imaginary part attention to obtain a second result, and taking the difference between the first result and the second result as the real part of the frequency domain coding feature corresponding to the original audio signal;
and multiplying the real part sequence and the imaginary part attention to obtain a third result, multiplying the imaginary part sequence and the real part attention to obtain a fourth result, and taking the sum of the third result and the fourth result as the imaginary part of the frequency domain coding feature corresponding to the original audio signal.
6. The method according to claim 1, wherein the obtaining the frequency-domain coding features corresponding to the original audio signal based on the real part sequence and the imaginary part sequence and the real part attention and the imaginary part attention comprises:
obtaining original amplitude information and original phase information of the original audio signal based on the real part sequence and the imaginary part sequence, and obtaining predicted amplitude information and predicted phase information of the original audio signal based on the real part attention and the imaginary part attention;
obtaining amplitude information of frequency domain coding features corresponding to the original audio signal according to the product of the original amplitude information and the predicted amplitude information;
and obtaining the phase information of the frequency domain coding characteristics corresponding to the original audio signal according to the sum of the original phase information and the predicted phase information.
7. The method of claim 1, wherein the performing noise reduction processing on the time domain signal through the trained time domain processing submodel to obtain a noise reduction signal corresponding to the original audio signal comprises:
inputting the time domain signal into a trained time domain processing sub-model;
coding the time domain signal through a coder in the time domain processing submodel to obtain a time domain coding vector;
extracting the characteristics of the time domain coding vector through a time sequence characteristic extraction network in the time domain processing submodel to obtain hidden characteristics corresponding to the time domain signal;
and decoding by a decoder in the time domain processing submodel based on the time domain coding vector and the hidden features to obtain a noise reduction signal corresponding to the original audio signal.
8. The method of claim 1, further comprising:
and classifying the time domain signals through a trained noise classification submodel to obtain the noise scene category of the original audio signals, wherein the noise classification submodel is obtained by carrying out model training together with the time domain processing submodel.
9. The method of claim 8, wherein the classifying the time-domain signal by the trained noise classification submodel to obtain the noise scene class of the original audio signal comprises:
inputting the time domain signal into a trained noise classification submodel;
coding the time domain signal through a coder in the noise classification submodel to obtain a time domain coding vector;
performing feature extraction on the time domain coding vector through a time sequence feature extraction network in the noise classification submodel to obtain hidden features corresponding to the time domain signals;
predicting, by an output layer in the noise classification submodel, a noise scene class of the original audio signal based on the hidden features.
10. The method according to any one of claims 1 to 9, further comprising:
acquiring an input noise reduction level;
determining a signal weight corresponding to the input noise reduction level, the signal weight including a first weight and a second weight for adjusting a ratio between the original audio signal and the noise reduction signal, respectively;
and according to the first weight and the second weight, fusing the original audio signal and the noise reduction signal to obtain an audio output signal corresponding to the input noise reduction level.
11. A method for processing an audio noise reduction model, the method comprising:
obtaining a sample audio signal, the sample audio information being generated from a clean audio signal;
respectively performing feature coding on a real part sequence and an imaginary part sequence obtained after the sample audio signal is transformed into a frequency domain signal through a real part processing network and an imaginary part processing network in a first sub-model based on a neural network to obtain real part attention and imaginary part attention corresponding to the sample audio signal, and obtaining frequency domain coding features corresponding to the sample audio signal based on the real part sequence and the imaginary part sequence, the real part attention and the imaginary part attention;
performing model training on the first submodel according to a first loss determined based on the frequency domain coding features and a frequency domain transform sequence corresponding to the clean audio signal to obtain a frequency domain processing submodel;
and connecting the frequency domain processing submodel with a time domain processing submodel to be trained, and then training together to obtain the audio noise reduction model for carrying out noise reduction processing on the audio signal.
12. The method of claim 11, wherein the connecting the frequency domain processing submodel and the time domain processing submodel to be trained and then training together to obtain the audio noise reduction model for performing noise reduction on the audio signal comprises:
connecting the frequency domain processing submodel with a second submodel and a third submodel based on a neural network;
transforming the frequency domain coding features corresponding to the sample audio signals into time domain signals;
constructing a multitask objective function based on a second loss determined by a noise reduction signal obtained by performing noise reduction processing on the time domain signal by the second submodel and the clean audio signal, and a third loss determined by a noise scene category obtained by performing noise classification on the time domain signal by the third submodel and a noise label category of the sample audio signal;
performing model training on the frequency domain processing submodel, the second submodel and the third submodel together according to the multitask objective function to obtain an updated frequency domain processing submodel, a trained time domain processing submodel and a trained noise classification submodel;
and connecting the updated frequency domain processing submodel with the trained time domain processing submodel to obtain the audio noise reduction model for carrying out noise reduction processing on the audio signal.
13. The method of claim 12, wherein the step of constructing the multitasking objective function comprises:
obtaining a plurality of noise reduction signals obtained by carrying out noise reduction processing on time domain signals corresponding to a plurality of sample audio signals by the second submodel, and determining the weight of the second loss according to the standard deviation of the plurality of noise reduction signals;
obtaining a plurality of noise scene categories obtained by carrying out noise classification on time domain signals corresponding to a plurality of sample audio signals by the third submodel, and determining the weight of the third loss according to the standard deviation of the plurality of noise scene categories;
and fusing the second loss and the third loss according to respective weights to obtain the multitask objective function.
14. The method of claim 11, wherein the connecting the frequency domain processing submodel and the time domain processing submodel to be trained and then training together to obtain the audio noise reduction model for performing noise reduction on the audio signal comprises:
connecting the frequency domain processing submodel with a second submodel and a third submodel based on a neural network;
transforming the frequency domain coding features corresponding to the sample audio signals into time domain signals;
performing model training on the frequency domain processing submodel and the second submodel together based on a second loss determined by a noise reduction signal obtained by performing noise reduction processing on the time domain signal and the clean audio signal by the second submodel to obtain an updated frequency domain processing submodel and a trained time domain processing submodel;
inputting the sample audio signal into the updated frequency domain processing submodel to obtain corresponding frequency domain coding features, and respectively inputting the trained time domain processing submodel and the third submodel after transforming the frequency domain coding features into time domain signals;
constructing a multitask objective function based on a second loss determined by a noise reduction signal obtained by performing noise reduction on the time domain signal by the trained time domain processing submodel and the clean audio signal, and a third loss determined by a noise scene category obtained by performing noise classification on the time domain signal by the third submodel and a noise label category of the sample audio signal;
performing model training on the frequency domain processing submodel, the trained time domain processing submodel and the third submodel together according to the multitask objective function to obtain an updated frequency domain processing submodel, an updated time domain processing submodel and a noise classification submodel;
and connecting the updated frequency domain processing submodel with the updated time domain processing submodel to obtain the audio noise reduction model for carrying out noise reduction processing on the audio signal.
15. The method according to any one of claims 11 to 14, further comprising:
obtaining a teacher model for carrying out noise reduction processing on the audio signals according to the audio noise reduction model, and constructing a student model for carrying out noise reduction processing on the audio signals according to the frequency domain processing submodel and the light-weight time domain noise reduction network;
inputting a sample audio signal into the teacher model, obtaining corresponding frequency domain coding features through the frequency domain processing submodel in the teacher model, transforming the frequency domain coding features into time domain signals, coding the time domain signals through a coder in the time domain processing submodel in the teacher model to obtain first time domain coding vectors corresponding to the sample audio signal, and obtaining first noise reduction signals corresponding to the sample audio signal based on the first time domain coding vectors through a decoder in the time domain processing submodel;
inputting a sample audio signal into the student model, obtaining corresponding frequency domain coding features through the frequency domain processing submodel in the student model, transforming the frequency domain coding features into time domain signals, coding the time domain signals through a coder in the lightweight time domain noise reduction network to obtain second time domain coding vectors corresponding to the sample audio signal, and obtaining second noise reduction signals corresponding to the sample audio signal based on the second time domain coding vectors through a decoder in the lightweight time domain noise reduction network;
and performing model training on the student model according to the model distillation loss based on the mean square error loss between the first time domain coding vector and the second time domain coding vector, the mean square error loss between the first noise reduction signal and the second noise reduction signal and the model distillation loss determined by the data structure loss between the first noise reduction signal and the second noise reduction signal to obtain the lightweight audio noise reduction model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110557785.2A CN113763979A (en) | 2021-05-21 | 2021-05-21 | Audio noise reduction and audio noise reduction model processing method, device, equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110557785.2A CN113763979A (en) | 2021-05-21 | 2021-05-21 | Audio noise reduction and audio noise reduction model processing method, device, equipment and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113763979A true CN113763979A (en) | 2021-12-07 |
Family
ID=78787139
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110557785.2A Pending CN113763979A (en) | 2021-05-21 | 2021-05-21 | Audio noise reduction and audio noise reduction model processing method, device, equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113763979A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114067820A (en) * | 2022-01-18 | 2022-02-18 | 深圳市友杰智新科技有限公司 | Training method of voice noise reduction model, voice noise reduction method and related equipment |
CN114121031A (en) * | 2021-12-08 | 2022-03-01 | 思必驰科技股份有限公司 | Device voice noise reduction, electronic device, and storage medium |
CN114792524A (en) * | 2022-06-24 | 2022-07-26 | 腾讯科技(深圳)有限公司 | Audio data processing method, apparatus, program product, computer device and medium |
CN117577124A (en) * | 2024-01-12 | 2024-02-20 | 京东城市(北京)数字科技有限公司 | Training method, device and equipment of audio noise reduction model based on knowledge distillation |
WO2024051676A1 (en) * | 2022-09-08 | 2024-03-14 | 维沃移动通信有限公司 | Model training method and apparatus, electronic device, and medium |
-
2021
- 2021-05-21 CN CN202110557785.2A patent/CN113763979A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114121031A (en) * | 2021-12-08 | 2022-03-01 | 思必驰科技股份有限公司 | Device voice noise reduction, electronic device, and storage medium |
CN114067820A (en) * | 2022-01-18 | 2022-02-18 | 深圳市友杰智新科技有限公司 | Training method of voice noise reduction model, voice noise reduction method and related equipment |
CN114067820B (en) * | 2022-01-18 | 2022-06-28 | 深圳市友杰智新科技有限公司 | Training method of voice noise reduction model, voice noise reduction method and related equipment |
CN114792524A (en) * | 2022-06-24 | 2022-07-26 | 腾讯科技(深圳)有限公司 | Audio data processing method, apparatus, program product, computer device and medium |
WO2024051676A1 (en) * | 2022-09-08 | 2024-03-14 | 维沃移动通信有限公司 | Model training method and apparatus, electronic device, and medium |
CN117577124A (en) * | 2024-01-12 | 2024-02-20 | 京东城市(北京)数字科技有限公司 | Training method, device and equipment of audio noise reduction model based on knowledge distillation |
CN117577124B (en) * | 2024-01-12 | 2024-04-16 | 京东城市(北京)数字科技有限公司 | Training method, device and equipment of audio noise reduction model based on knowledge distillation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112071329B (en) | Multi-person voice separation method and device, electronic equipment and storage medium | |
CN113763979A (en) | Audio noise reduction and audio noise reduction model processing method, device, equipment and medium | |
CN112071330B (en) | Audio data processing method and device and computer readable storage medium | |
CN111966800B (en) | Emotion dialogue generation method and device and emotion dialogue model training method and device | |
CN109887484A (en) | A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device | |
CN112767910B (en) | Audio information synthesis method, device, computer readable medium and electronic equipment | |
Zou et al. | Improving multimodal fusion with Main Modal Transformer for emotion recognition in conversation | |
CN112837669B (en) | Speech synthesis method, device and server | |
Wang et al. | Multi-source domain adaptation for text-independent forensic speaker recognition | |
Lee et al. | Deep representation learning for affective speech signal analysis and processing: Preventing unwanted signal disparities | |
CN113822017A (en) | Audio generation method, device, equipment and storage medium based on artificial intelligence | |
Wu et al. | Acoustic to articulatory mapping with deep neural network | |
CN113761841A (en) | Method for converting text data into acoustic features | |
CN114783459B (en) | Voice separation method and device, electronic equipment and storage medium | |
CN114360502A (en) | Processing method of voice recognition model, voice recognition method and device | |
Devi et al. | Automatic speaker recognition from speech signal using bidirectional long‐short‐term memory recurrent neural network | |
Zhang et al. | Speaker-independent lipreading by disentangled representation learning | |
Ho et al. | Cross-lingual voice conversion with controllable speaker individuality using variational autoencoder and star generative adversarial network | |
Cai et al. | Music creation and emotional recognition using neural network analysis | |
Chen et al. | Speaker-independent emotional voice conversion via disentangled representations | |
Yang et al. | 3D head-talk: speech synthesis 3D head movement face animation | |
Hagos et al. | Recent advances in generative ai and large language models: Current status, challenges, and perspectives | |
Shen et al. | Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning for Emotion Recognition in Conversation | |
CN116343820A (en) | Audio processing method, device, equipment and storage medium | |
Huang et al. | Identification of depression state based on multi‐scale acoustic features in interrogation environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |