CN113470684A - Audio noise reduction method, device, equipment and storage medium - Google Patents

Audio noise reduction method, device, equipment and storage medium Download PDF

Info

Publication number
CN113470684A
CN113470684A CN202110837937.4A CN202110837937A CN113470684A CN 113470684 A CN113470684 A CN 113470684A CN 202110837937 A CN202110837937 A CN 202110837937A CN 113470684 A CN113470684 A CN 113470684A
Authority
CN
China
Prior art keywords
audio
frequency
information
time
noise reduction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110837937.4A
Other languages
Chinese (zh)
Other versions
CN113470684B (en
Inventor
张之勇
王健宗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110837937.4A priority Critical patent/CN113470684B/en
Publication of CN113470684A publication Critical patent/CN113470684A/en
Application granted granted Critical
Publication of CN113470684B publication Critical patent/CN113470684B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The invention relates to artificial intelligence and provides an audio noise reduction method, an audio noise reduction device, audio noise reduction equipment and a storage medium. The method can be used for preprocessing a band noise frequency to obtain frequency spectrum information, processing the frequency spectrum information based on a frequency domain signal processing network to obtain frequency spectrum mask characteristics, obtaining time frequency characteristics according to the frequency spectrum information and the frequency spectrum mask characteristics, processing the time frequency characteristics based on a time domain signal processing network to obtain the time frequency mask characteristics, generating predicted audio according to the time frequency characteristics and the time frequency mask characteristics, adjusting network parameters of a preset learner based on the predicted audio and pure audio to obtain a noise reduction model, obtaining request audio, and performing noise reduction processing on the request audio based on the noise reduction model to obtain target audio. The invention can improve the noise reduction accuracy and the real-time performance of the requested audio. Furthermore, the invention also relates to a blockchain technique, the target audio can be stored in a blockchain.

Description

Audio noise reduction method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to an audio noise reduction method, device, equipment and storage medium.
Background
In a telephone conference such as a remote office call, there is a high demand for real-time performance and accuracy of audio noise reduction, however, in the current noise reduction mode, information at a frame level is usually processed in a complete speech sequence, resulting in low noise reduction efficiency.
Therefore, how to improve the real-time performance and accuracy of audio noise reduction becomes a technical problem which needs to be solved urgently.
Disclosure of Invention
In view of the foregoing, it is desirable to provide an audio noise reduction method, apparatus, device and storage medium, which can improve the noise reduction accuracy and noise reduction real-time performance of the requested audio.
In one aspect, the present invention provides an audio noise reduction method, where the audio noise reduction method includes:
the method comprises the steps of obtaining an audio sample and obtaining a preset learner, wherein the audio sample comprises a noisy audio and a pure audio, and the preset learner comprises a frequency domain signal processing network and a time domain signal processing network;
preprocessing the band noise frequency to obtain frequency spectrum information;
processing the frequency spectrum information based on the frequency domain signal processing network to obtain frequency spectrum mask characteristics corresponding to the frequency spectrum information;
acquiring the time-frequency characteristics of the noisy audio according to the frequency spectrum information and the frequency spectrum mask characteristics;
processing the time-frequency characteristics based on the time-domain signal processing network to obtain time-frequency mask characteristics;
generating a prediction audio according to the time-frequency characteristics and the time-frequency mask characteristics;
adjusting network parameters of the preset learner based on the predicted audio and the pure audio to obtain a noise reduction model;
and acquiring a request audio, and performing noise reduction processing on the request audio based on the noise reduction model to obtain a target audio.
According to a preferred embodiment of the present invention, the obtaining the audio sample comprises:
counting the audio time of the pure audio;
acquiring audios with the duration less than or equal to the audio duration from a recording library to obtain a plurality of recorded audios;
carrying out arbitrary synthesis processing on the pure audio and each recorded audio to obtain a plurality of noisy audio;
determining a plurality of the noisy audio and the clean audio as the audio samples.
According to a preferred embodiment of the present invention, the preprocessing the band noise frequency to obtain frequency spectrum information includes:
acquiring a preset moving window function;
carrying out Fourier transform on the noisy frequency based on the preset moving window function to obtain a spectrogram;
acquiring preset processing time length, and calculating the ratio of the audio frequency time length to the preset processing time length;
and carrying out segmentation processing on the frequency spectrum graph according to the preset processing duration to obtain the frequency spectrum information, wherein the number of the frequency spectrum information is the same as the ratio.
According to a preferred embodiment of the present invention, the frequency domain signal processing network includes a gated neural network, a fully connected network, and an activation function, the gated neural network includes a reset gate and an update gate, and the processing the spectral information based on the frequency domain signal processing network to obtain a spectral mask feature corresponding to the spectral information includes:
acquiring time sequence information of the frequency spectrum information, wherein the time sequence information comprises a first frequency spectrum at a first moment and a second frequency spectrum at a second moment;
analyzing the first frequency spectrum and the second frequency spectrum based on the reset parameter of the reset gate to obtain candidate information of the second moment;
calculating an amount of information for the first spectrum based on an update parameter in the update gate, the first spectrum, and the second spectrum;
generating output information of the second moment according to the first frequency spectrum, the candidate information and the information amount, determining the output information as the first frequency spectrum until the time sequence information participates in training, and obtaining first network output of the gated neural network;
analyzing the network output according to the weight matrix and the bias value in the fully-connected network to obtain a second network output;
and processing the second network output based on the activation function to obtain the spectrum mask characteristic.
According to a preferred embodiment of the present invention, the obtaining the time-frequency feature of the noisy audio according to the spectrum information and the spectrum mask feature includes:
calculating amplitude information in the frequency spectrum information, and extracting phase information from the frequency spectrum information;
calculating the product of the amplitude information, the phase information and the spectrum mask characteristic to obtain a predicted spectrum;
carrying out inverse Fourier transform processing on the predicted frequency spectrum to obtain a predicted time frequency;
and extracting the characteristics in the predicted time frequency based on the first preset convolution layer to obtain the time frequency characteristics.
According to the preferred embodiment of the present invention, the generating a prediction audio according to the time-frequency feature and the time-frequency mask feature comprises:
calculating the product of the time-frequency characteristic and the time-frequency mask characteristic to obtain an enhanced characteristic;
performing upsampling processing on the enhanced features based on a second preset convolution layer to obtain a restored signal;
acquiring initial information of the restored signal on each time sequence;
if the number of the initial information on any time sequence is multiple, calculating the average value of the multiple initial information on any time sequence to obtain the overlapped information on any time sequence;
generating prediction information according to the initial information and the overlapping information;
and converting the prediction information to obtain the prediction audio.
According to a preferred embodiment of the present invention, the adjusting the network parameters of the preset learner based on the predicted audio and the pure audio to obtain the noise reduction model includes:
acquiring first time domain information of the pure audio and acquiring second time domain information of the predicted audio;
calculating a loss value of the preset learner according to the following formula:
Figure BDA0003177844240000041
wherein loss means the loss value, ytRefers to the first time domain information that is,
Figure BDA0003177844240000042
refers to the second time domain information;
and adjusting the network parameters according to the loss value until the loss value is not reduced any more, so as to obtain the noise reduction model.
In another aspect, the present invention further provides an audio noise reduction apparatus, including:
the device comprises an acquisition unit and a learning unit, wherein the acquisition unit is used for acquiring an audio sample and acquiring a preset learner, the audio sample comprises a noisy audio and a pure audio, and the preset learner comprises a frequency domain signal processing network and a time domain signal processing network;
the preprocessing unit is used for preprocessing the band noise frequency to obtain frequency spectrum information;
the processing unit is used for processing the frequency spectrum information based on the frequency domain signal processing network to obtain frequency spectrum mask characteristics corresponding to the frequency spectrum information;
the acquiring unit is further configured to acquire a time-frequency feature of the noisy audio according to the frequency spectrum information and the frequency spectrum mask feature;
the processing unit is further configured to process the time-frequency feature based on the time-domain signal processing network to obtain a time-frequency mask feature;
the generating unit is used for generating a prediction audio according to the time-frequency characteristics and the time-frequency mask characteristics;
the adjusting unit is used for adjusting the network parameters of the preset learner based on the predicted audio and the pure audio to obtain a noise reduction model;
the obtaining unit is further configured to obtain a request audio, and perform noise reduction processing on the request audio based on the noise reduction model to obtain a target audio.
In another aspect, the present invention further provides an electronic device, including:
a memory storing computer readable instructions; and
a processor executing computer readable instructions stored in the memory to implement the audio noise reduction method.
In another aspect, the present invention also provides a computer-readable storage medium, in which computer-readable instructions are stored, and the computer-readable instructions are executed by a processor in an electronic device to implement the audio noise reduction method.
According to the technical scheme, the whole noise frequency can be converted into frequency spectrum information by preprocessing the noise frequency, so that the processing efficiency of the frequency spectrum information can be improved, the noise reduction efficiency of the noise frequency can be improved, the noise reduction of the noise frequency can be realized in the frequency domain through the frequency domain signal processing network, the phase information of a target sound source can be enhanced in the time frequency through the time frequency signal processing network, the dual noise reduction in the frequency domain and the time frequency can be realized, the noise reduction accuracy of the noise reduction model is improved, and the voice enhancement effect of the target sound frequency is further improved.
Drawings
FIG. 1 is a flow chart of a preferred embodiment of the audio noise reduction method of the present invention.
Fig. 2 is a functional block diagram of an audio noise reduction device according to a preferred embodiment of the present invention.
Fig. 3 is a schematic structural diagram of an electronic device implementing a method for audio noise reduction according to a preferred embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
Fig. 1 is a flow chart of a preferred embodiment of the audio noise reduction method according to the present invention. The order of the steps in the flow chart may be changed and some steps may be omitted according to different needs.
The audio noise reduction method is applied to one or more electronic devices, which are devices capable of automatically performing numerical calculation and/or information processing according to computer readable instructions set or stored in advance, and the hardware thereof includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The electronic device may be any electronic product capable of performing human-computer interaction with a user, for example, a Personal computer, a tablet computer, a smart phone, a Personal Digital Assistant (PDA), a game machine, an interactive Internet Protocol Television (IPTV), a smart wearable device, and the like.
The electronic device may include a network device and/or a user device. Wherein the network device includes, but is not limited to, a single network electronic device, an electronic device group consisting of a plurality of network electronic devices, or a Cloud Computing (Cloud Computing) based Cloud consisting of a large number of hosts or network electronic devices.
The network in which the electronic device is located includes, but is not limited to: the internet, a wide area Network, a metropolitan area Network, a local area Network, a Virtual Private Network (VPN), and the like.
And S10, acquiring an audio sample and acquiring a preset learner, wherein the audio sample comprises a noisy audio and a pure audio, and the preset learner comprises a frequency domain signal processing network and a time domain signal processing network.
In at least one embodiment of the present invention, the noisy audio refers to an audio including noise information, and the noisy audio is synthesized according to the clean audio and the recorded audio.
The clean audio refers to audio that does not contain noise information.
The frequency domain signal processing network is a network for eliminating noise information from the frequency domain of the noisy audio.
The time domain signal processing network refers to a network which eliminates noise information from the time domain of the noisy audio frequency.
In at least one embodiment of the present invention, the electronic device obtaining an audio sample comprises:
counting the audio time of the pure audio;
acquiring audios with the duration less than or equal to the audio duration from a recording library to obtain a plurality of recorded audios;
carrying out arbitrary synthesis processing on the pure audio and each recorded audio to obtain a plurality of noisy audio;
determining a plurality of the noisy audio and the clean audio as the audio samples.
Wherein the audio duration refers to a total duration of the pure audio.
The recording library stores a plurality of mapping relations of audio and duration.
The time length of the multiple recorded audios is less than or equal to the audio time length, and the multiple recorded audios can be background sounds such as sirens and the like.
The multiple recorded audios can be obtained through the audio time length, so that the time length of the synthesized noisy audio is the same as the audio time length of the pure audio.
And S11, preprocessing the band noise frequency to obtain frequency spectrum information.
In at least one embodiment of the present invention, the spectrum information refers to information of the band noise frequency on a frequency domain.
In at least one embodiment of the present invention, the electronic device preprocesses the noisy frequency to obtain frequency spectrum information, and the frequency spectrum information includes:
acquiring a preset moving window function;
carrying out Fourier transform on the noisy frequency based on the preset moving window function to obtain a spectrogram;
acquiring preset processing time length, and calculating the ratio of the audio frequency time length to the preset processing time length;
and carrying out segmentation processing on the frequency spectrum graph according to the preset processing duration to obtain the frequency spectrum information, wherein the number of the frequency spectrum information is the same as the ratio.
The preset moving window function can be set according to requirements, and the preset moving window function can enable the noisy audio frequency to output a stable signal within a limited time width.
The spectrogram refers to the mapping relation of the noisy audio frequency on time-energy.
The preset processing time is set according to the noise reduction efficiency requirement.
Through it is right to predetermine the moving window function take noise frequency to carry out Fourier transform, can make the generation spectrogram is more steady, through right the spectrogram carries out segmentation processing, can be convenient for follow-up right the frequency spectrum information parallel processing, thereby improve take noise frequency's the efficiency of making an uproar that falls.
And S12, processing the frequency spectrum information based on the frequency domain signal processing network to obtain a frequency spectrum mask characteristic corresponding to the frequency spectrum information.
In at least one embodiment of the invention, the spectral mask feature is used to mask noise information in the frequency domain for the noisy frequency. The spectrum information corresponds to corresponding spectrum mask characteristics.
In at least one embodiment of the present invention, the frequency domain signal processing network includes a gated neural network, a fully connected network, and an activation function, the gated neural network includes a reset gate and an update gate, and the electronic device processes the spectral information based on the frequency domain signal processing network to obtain a spectral mask feature corresponding to the spectral information includes:
acquiring time sequence information of the frequency spectrum information, wherein the time sequence information comprises a first frequency spectrum at a first moment and a second frequency spectrum at a second moment;
analyzing the first frequency spectrum and the second frequency spectrum based on the reset parameter of the reset gate to obtain candidate information of the second moment;
calculating an amount of information for the first spectrum based on an update parameter in the update gate, the first spectrum, and the second spectrum;
generating output information of the second moment according to the first frequency spectrum, the candidate information and the information amount, determining the output information as the first frequency spectrum until the time sequence information participates in training, and obtaining first network output of the gated neural network;
analyzing the network output according to the weight matrix and the bias value in the fully-connected network to obtain a second network output;
and processing the second network output based on the activation function to obtain the spectrum mask characteristic.
Wherein the reset parameter, the update parameter, the weight matrix and the bias value are network parameters initially set in the preset learner.
The information amount is an information amount of the first spectrum reserved in the second time.
The activation function is typically set to a sigmoid function.
The time sequence information is analyzed through the gated neural network, the problems of gradient disappearance and gradient explosion can be solved, and therefore the accuracy of the frequency spectrum mask characteristics can be improved.
And S13, acquiring the time-frequency characteristics of the noisy audio according to the frequency spectrum information and the frequency spectrum mask characteristics.
In at least one embodiment of the present invention, the time-frequency characteristics refer to characteristics of the noisy audio in time and frequency.
In at least one embodiment of the present invention, the obtaining, by the electronic device, the time-frequency feature of the noisy audio according to the spectrum information and the spectrum mask feature includes:
calculating amplitude information in the frequency spectrum information, and extracting phase information from the frequency spectrum information;
calculating the product of the amplitude information, the phase information and the spectrum mask characteristic to obtain a predicted spectrum;
carrying out inverse Fourier transform processing on the predicted frequency spectrum to obtain a predicted time frequency;
and extracting the characteristics in the predicted time frequency based on the first preset convolution layer to obtain the time frequency characteristics.
Wherein the convolution kernel size of the first preset convolution layer is typically set to 1 x 1.
Noise information in the noisy audio frequency can be accurately eliminated through the frequency spectrum mask features, accuracy of the predicted frequency spectrum is improved, and then the time-frequency features can be accurately extracted according to the convolutional layer.
And S14, processing the time-frequency characteristics based on the time-domain signal processing network to obtain time-frequency mask characteristics.
In at least one embodiment of the present invention, the time-frequency mask feature is used to mask noise information of the noisy frequency in the time domain.
In at least one embodiment of the present invention, the time domain signal processing network comprises a transient normalization layer, a gated cyclic unit layer, a fully connected layer, and an activation function. And the electronic equipment processes the time-frequency characteristics based on the instantaneous normalization layer, the gated circulation unit layer, the full connection layer and the activation function to obtain the time-frequency mask characteristics.
In at least one embodiment of the present invention, a manner in which the electronic device processes the time-frequency feature based on the time-domain signal processing network is similar to a manner in which the electronic device processes the frequency spectrum information based on the frequency-domain signal processing network, which is not described herein again.
And S15, generating a prediction audio according to the time frequency characteristics and the time frequency mask characteristics.
In at least one embodiment of the present invention, the predicted audio refers to an audio obtained by denoising the noisy audio in the frequency domain and the time domain by the preset learner.
In at least one embodiment of the present invention, the generating, by the electronic device, a predicted audio according to the time-frequency feature and the time-frequency mask feature includes:
calculating the product of the time-frequency characteristic and the time-frequency mask characteristic to obtain an enhanced characteristic;
performing upsampling processing on the enhanced features based on a second preset convolution layer to obtain a restored signal;
acquiring initial information of the restored signal on each time sequence;
if the number of the initial information on any time sequence is multiple, calculating the average value of the multiple initial information on any time sequence to obtain the overlapped information on any time sequence;
generating prediction information according to the initial information and the overlapping information;
and converting the prediction information to obtain the prediction audio.
Wherein the prediction information refers to information of the predicted audio in a time domain.
According to the embodiment, the generated prediction information can be more gradual, so that the noise reduction effect of the prediction audio is improved.
Specifically, the electronic device generates prediction information according to the initial information and the overlapping information.
For example: the initial information on the first time sequence is n1The initial information at the second timing has n2、n3、n4The initial information at the third time sequence is n5If a plurality of initial information is detected at the second timing, the overlapping information at the second timing is calculated as
Figure BDA0003177844240000101
The prediction information can be further generated as: n is1
Figure BDA0003177844240000102
n5
And S16, adjusting the network parameters of the preset learner based on the predicted audio and the pure audio to obtain a noise reduction model.
In at least one embodiment of the present invention, the network parameters include initialization configuration parameters of the frequency domain signal processing network and the time domain signal processing network.
The noise reduction model is used for eliminating noise information in the audio.
In at least one embodiment of the present invention, the adjusting, by the electronic device, the network parameter of the preset learner based on the predicted audio and the pure audio to obtain the noise reduction model includes:
acquiring first time domain information of the pure audio and acquiring second time domain information of the predicted audio;
calculating a loss value of the preset learner according to the following formula:
Figure BDA0003177844240000111
wherein loss means the loss value, ytRefers to the first time domain information that is,
Figure BDA0003177844240000112
refers to the second time domain information;
and adjusting the network parameters according to the loss value until the loss value is not reduced any more, so as to obtain the noise reduction model.
The accuracy of the loss value can be improved through the first time domain information and the second time domain information, and therefore the noise reduction precision of the noise reduction model can be ensured according to the loss value.
And S17, acquiring the request audio, and carrying out noise reduction processing on the request audio based on the noise reduction model to obtain the target audio.
In at least one embodiment of the present invention, the requested audio refers to audio that requires noise reduction. The requested audio may be any audio received in real-time.
The target audio is the audio obtained after the noise reduction is performed on the request audio. And if the accuracy of the noise reduction model reaches 100%, the target audio does not contain any noise information.
It is emphasized that, to further ensure the privacy and security of the target audio, the target audio may also be stored in a node of a blockchain.
In at least one embodiment of the present invention, a manner in which the electronic device performs noise reduction processing on the request audio based on the noise reduction model is similar to a manner in which the electronic device performs processing on the noisy audio based on the preset learner to obtain the predicted audio, which is not described in detail herein again.
According to the technical scheme, the model loss value in the preset learner can be accurately determined through the pure audio and the decoded audio predicted by the preset learner on the noisy audio, so that the network parameters can be accurately adjusted according to the model loss value, and the enhancement effect of the audio noise reduction model is improved. In addition, the coding network is used for coding the noisy frequency, and the audio coding information comprises phase information in each voice time sequence state, so that the enhancement effect of the audio noise reduction model can be improved, and the enhancement effect of the target audio can be improved.
Fig. 2 is a functional block diagram of an audio noise reduction device according to a preferred embodiment of the present invention. The audio noise reduction apparatus 11 includes an obtaining unit 110, a preprocessing unit 111, a processing unit 112, a generating unit 113, and an adjusting unit 114. The module/unit referred to herein is a series of computer readable instruction segments that can be accessed by the processor 13 and perform a fixed function and that are stored in the memory 12. In the present embodiment, the functions of the modules/units will be described in detail in the following embodiments.
The obtaining unit 110 obtains an audio sample, and obtains a preset learner, where the audio sample includes a noisy audio and a clean audio, and the preset learner includes a frequency domain signal processing network and a time domain signal processing network.
In at least one embodiment of the present invention, the noisy audio refers to an audio including noise information, and the noisy audio is synthesized according to the clean audio and the recorded audio.
The clean audio refers to audio that does not contain noise information.
The frequency domain signal processing network is a network for eliminating noise information from the frequency domain of the noisy audio.
The time domain signal processing network refers to a network which eliminates noise information from the time domain of the noisy audio frequency.
In at least one embodiment of the present invention, the obtaining unit 110 obtains the audio sample including:
counting the audio time of the pure audio;
acquiring audios with the duration less than or equal to the audio duration from a recording library to obtain a plurality of recorded audios;
carrying out arbitrary synthesis processing on the pure audio and each recorded audio to obtain a plurality of noisy audio;
determining a plurality of the noisy audio and the clean audio as the audio samples.
Wherein the audio duration refers to a total duration of the pure audio.
The recording library stores a plurality of mapping relations of audio and duration.
The time length of the multiple recorded audios is less than or equal to the audio time length, and the multiple recorded audios can be background sounds such as sirens and the like.
The multiple recorded audios can be obtained through the audio time length, so that the time length of the synthesized noisy audio is the same as the audio time length of the pure audio.
The preprocessing unit 111 preprocesses the band noise frequency to obtain frequency spectrum information.
In at least one embodiment of the present invention, the spectrum information refers to information of the band noise frequency on a frequency domain.
In at least one embodiment of the present invention, the preprocessing unit 111 preprocesses the noisy frequency to obtain frequency spectrum information, including:
acquiring a preset moving window function;
carrying out Fourier transform on the noisy frequency based on the preset moving window function to obtain a spectrogram;
acquiring preset processing time length, and calculating the ratio of the audio frequency time length to the preset processing time length;
and carrying out segmentation processing on the frequency spectrum graph according to the preset processing duration to obtain the frequency spectrum information, wherein the number of the frequency spectrum information is the same as the ratio.
The preset moving window function can be set according to requirements, and the preset moving window function can enable the noisy audio frequency to output a stable signal within a limited time width.
The spectrogram refers to the mapping relation of the noisy audio frequency on time-energy.
The preset processing time is set according to the noise reduction efficiency requirement.
Through it is right to predetermine the moving window function take noise frequency to carry out Fourier transform, can make the generation spectrogram is more steady, through right the spectrogram carries out segmentation processing, can be convenient for follow-up right the frequency spectrum information parallel processing, thereby improve take noise frequency's the efficiency of making an uproar that falls.
The processing unit 112 processes the spectrum information based on the frequency domain signal processing network, and obtains a spectrum mask feature corresponding to the spectrum information.
In at least one embodiment of the invention, the spectral mask feature is used to mask noise information in the frequency domain for the noisy frequency. The spectrum information corresponds to corresponding spectrum mask characteristics.
In at least one embodiment of the present invention, the frequency domain signal processing network includes a gated neural network, a fully connected network, and an activation function, the gated neural network includes a reset gate and an update gate, and the processing unit 112 processes the spectral information based on the frequency domain signal processing network to obtain a spectral mask feature corresponding to the spectral information includes:
acquiring time sequence information of the frequency spectrum information, wherein the time sequence information comprises a first frequency spectrum at a first moment and a second frequency spectrum at a second moment;
analyzing the first frequency spectrum and the second frequency spectrum based on the reset parameter of the reset gate to obtain candidate information of the second moment;
calculating an amount of information for the first spectrum based on an update parameter in the update gate, the first spectrum, and the second spectrum;
generating output information of the second moment according to the first frequency spectrum, the candidate information and the information amount, determining the output information as the first frequency spectrum until the time sequence information participates in training, and obtaining first network output of the gated neural network;
analyzing the network output according to the weight matrix and the bias value in the fully-connected network to obtain a second network output;
and processing the second network output based on the activation function to obtain the spectrum mask characteristic.
Wherein the reset parameter, the update parameter, the weight matrix and the bias value are network parameters initially set in the preset learner.
The information amount is an information amount of the first spectrum reserved in the second time.
The activation function is typically set to a sigmoid function.
The time sequence information is analyzed through the gated neural network, the problems of gradient disappearance and gradient explosion can be solved, and therefore the accuracy of the frequency spectrum mask characteristics can be improved.
The obtaining unit 110 obtains the time-frequency feature of the noisy audio according to the spectrum information and the spectrum mask feature.
In at least one embodiment of the present invention, the time-frequency characteristics refer to characteristics of the noisy audio in time and frequency.
In at least one embodiment of the present invention, the obtaining unit 110 obtains the time-frequency characteristic of the noisy audio according to the spectrum information and the spectrum mask characteristic includes:
calculating amplitude information in the frequency spectrum information, and extracting phase information from the frequency spectrum information;
calculating the product of the amplitude information, the phase information and the spectrum mask characteristic to obtain a predicted spectrum;
carrying out inverse Fourier transform processing on the predicted frequency spectrum to obtain a predicted time frequency;
and extracting the characteristics in the predicted time frequency based on the first preset convolution layer to obtain the time frequency characteristics.
Wherein the convolution kernel size of the first preset convolution layer is typically set to 1 x 1.
Noise information in the noisy audio frequency can be accurately eliminated through the frequency spectrum mask features, accuracy of the predicted frequency spectrum is improved, and then the time-frequency features can be accurately extracted according to the convolutional layer.
The processing unit 112 processes the time-frequency feature based on the time-domain signal processing network to obtain a time-frequency mask feature.
In at least one embodiment of the present invention, the time-frequency mask feature is used to mask noise information of the noisy frequency in the time domain.
In at least one embodiment of the present invention, the time domain signal processing network comprises a transient normalization layer, a gated cyclic unit layer, a fully connected layer, and an activation function. The processing unit 112 processes the time-frequency feature based on the instantaneous normalization layer, the gated cycle unit layer, the full link layer, and the activation function to obtain the time-frequency mask feature.
In at least one embodiment of the present invention, a manner in which the processing unit 112 processes the time-frequency feature based on the time-domain signal processing network is similar to a manner in which the processing unit 112 processes the frequency spectrum information based on the frequency-domain signal processing network, which is not described herein again.
The generating unit 113 generates a prediction audio according to the time-frequency feature and the time-frequency mask feature.
In at least one embodiment of the present invention, the predicted audio refers to an audio obtained by denoising the noisy audio in the frequency domain and the time domain by the preset learner.
In at least one embodiment of the present invention, the generating unit 113 generates the prediction audio according to the time-frequency feature and the time-frequency mask feature includes:
calculating the product of the time-frequency characteristic and the time-frequency mask characteristic to obtain an enhanced characteristic;
performing upsampling processing on the enhanced features based on a second preset convolution layer to obtain a restored signal;
acquiring initial information of the restored signal on each time sequence;
if the number of the initial information on any time sequence is multiple, calculating the average value of the multiple initial information on any time sequence to obtain the overlapped information on any time sequence;
generating prediction information according to the initial information and the overlapping information;
and converting the prediction information to obtain the prediction audio.
Wherein the prediction information refers to information of the predicted audio in a time domain.
According to the embodiment, the generated prediction information can be more gradual, so that the noise reduction effect of the prediction audio is improved.
Specifically, the generation unit 113 generates prediction information from the initial information and the superimposition information.
For example: the initial information on the first time sequence is n1The initial information at the second timing has n2、n3、n4The initial information at the third time sequence is n5Detecting how much of the initial information at the second timingThen, the overlapping information on the second time sequence is calculated as
Figure BDA0003177844240000161
The prediction information can be further generated as: n is1
Figure BDA0003177844240000162
n5
The adjusting unit 114 adjusts the network parameters of the predetermined learner based on the predicted audio and the clean audio to obtain a noise reduction model.
In at least one embodiment of the present invention, the network parameters include initialization configuration parameters of the frequency domain signal processing network and the time domain signal processing network.
The noise reduction model is used for eliminating noise information in the audio.
In at least one embodiment of the present invention, the adjusting unit 114 adjusts the network parameters of the preset learner based on the predicted audio and the pure audio, and obtaining the noise reduction model includes:
acquiring first time domain information of the pure audio and acquiring second time domain information of the predicted audio;
calculating a loss value of the preset learner according to the following formula:
Figure BDA0003177844240000171
wherein loss means the loss value, ytRefers to the first time domain information that is,
Figure BDA0003177844240000172
refers to the second time domain information;
and adjusting the network parameters according to the loss value until the loss value is not reduced any more, so as to obtain the noise reduction model.
The accuracy of the loss value can be improved through the first time domain information and the second time domain information, and therefore the noise reduction precision of the noise reduction model can be ensured according to the loss value.
The obtaining unit 110 obtains the requested audio, and performs noise reduction processing on the requested audio based on the noise reduction model to obtain the target audio.
In at least one embodiment of the present invention, the requested audio refers to audio that requires noise reduction. The requested audio may be any audio received in real-time.
The target audio is the audio obtained after the noise reduction is performed on the request audio. And if the accuracy of the noise reduction model reaches 100%, the target audio does not contain any noise information.
It is emphasized that, to further ensure the privacy and security of the target audio, the target audio may also be stored in a node of a blockchain.
In at least one embodiment of the present invention, a manner of performing noise reduction processing on the request audio by the obtaining unit 110 based on the noise reduction model is similar to a manner of obtaining the predicted audio by processing the noisy audio based on the preset learner, and details thereof are not repeated herein.
According to the technical scheme, the model loss value in the preset learner can be accurately determined through the pure audio and the decoded audio predicted by the preset learner on the noisy audio, so that the network parameters can be accurately adjusted according to the model loss value, and the enhancement effect of the audio noise reduction model is improved. In addition, the coding network is used for coding the noisy frequency, and the audio coding information comprises phase information in each voice time sequence state, so that the enhancement effect of the audio noise reduction model can be improved, and the enhancement effect of the target audio can be improved.
Fig. 3 is a schematic structural diagram of an electronic device implementing a method for audio noise reduction according to a preferred embodiment of the present invention.
In one embodiment of the present invention, the electronic device 1 includes, but is not limited to, a memory 12, a processor 13, and computer readable instructions, such as an audio noise reduction program, stored in the memory 12 and executable on the processor 13.
It will be appreciated by a person skilled in the art that the schematic diagram is only an example of the electronic device 1 and does not constitute a limitation of the electronic device 1, and that it may comprise more or less components than shown, or some components may be combined, or different components, e.g. the electronic device 1 may further comprise an input output device, a network access device, a bus, etc.
The Processor 13 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The processor 13 is an operation core and a control center of the electronic device 1, and is connected to each part of the whole electronic device 1 by various interfaces and lines, and executes an operating system of the electronic device 1 and various installed application programs, program codes, and the like.
Illustratively, the computer readable instructions may be partitioned into one or more modules/units that are stored in the memory 12 and executed by the processor 13 to implement the present invention. The one or more modules/units may be a series of computer readable instruction segments capable of performing specific functions, which are used for describing the execution process of the computer readable instructions in the electronic device 1. For example, the computer readable instructions may be partitioned into an acquisition unit 110, a pre-processing unit 111, a processing unit 112, a generation unit 113, and an adjustment unit 114.
The memory 12 may be used for storing the computer readable instructions and/or modules, and the processor 13 implements various functions of the electronic device 1 by executing or executing the computer readable instructions and/or modules stored in the memory 12 and invoking data stored in the memory 12. The memory 12 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. The memory 12 may include non-volatile and volatile memories, such as: a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other storage device.
The memory 12 may be an external memory and/or an internal memory of the electronic device 1. Further, the memory 12 may be a memory having a physical form, such as a memory stick, a TF Card (Trans-flash Card), or the like.
The integrated modules/units of the electronic device 1 may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, all or part of the flow of the method according to the above embodiments may be implemented by hardware that is configured to be instructed by computer readable instructions, which may be stored in a computer readable storage medium, and when the computer readable instructions are executed by a processor, the steps of the method embodiments may be implemented.
Wherein the computer readable instructions comprise computer readable instruction code which may be in source code form, object code form, an executable file or some intermediate form, and the like. The computer-readable medium may include: any entity or device capable of carrying said computer readable instruction code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM).
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
In conjunction with fig. 1, the memory 12 of the electronic device 1 stores computer-readable instructions to implement an audio noise reduction method, and the processor 13 executes the computer-readable instructions to implement:
the method comprises the steps of obtaining an audio sample and obtaining a preset learner, wherein the audio sample comprises a noisy audio and a pure audio, and the preset learner comprises a frequency domain signal processing network and a time domain signal processing network;
preprocessing the band noise frequency to obtain frequency spectrum information;
processing the frequency spectrum information based on the frequency domain signal processing network to obtain frequency spectrum mask characteristics corresponding to the frequency spectrum information;
acquiring the time-frequency characteristics of the noisy audio according to the frequency spectrum information and the frequency spectrum mask characteristics;
processing the time-frequency characteristics based on the time-domain signal processing network to obtain time-frequency mask characteristics;
generating a prediction audio according to the time-frequency characteristics and the time-frequency mask characteristics;
adjusting network parameters of the preset learner based on the predicted audio and the pure audio to obtain a noise reduction model;
and acquiring a request audio, and performing noise reduction processing on the request audio based on the noise reduction model to obtain a target audio.
Specifically, the processor 13 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the computer readable instructions, which is not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The computer readable storage medium has computer readable instructions stored thereon, wherein the computer readable instructions when executed by the processor 13 are configured to implement the steps of:
the method comprises the steps of obtaining an audio sample and obtaining a preset learner, wherein the audio sample comprises a noisy audio and a pure audio, and the preset learner comprises a frequency domain signal processing network and a time domain signal processing network;
preprocessing the band noise frequency to obtain frequency spectrum information;
processing the frequency spectrum information based on the frequency domain signal processing network to obtain frequency spectrum mask characteristics corresponding to the frequency spectrum information;
acquiring the time-frequency characteristics of the noisy audio according to the frequency spectrum information and the frequency spectrum mask characteristics;
processing the time-frequency characteristics based on the time-domain signal processing network to obtain time-frequency mask characteristics;
generating a prediction audio according to the time-frequency characteristics and the time-frequency mask characteristics;
adjusting network parameters of the preset learner based on the predicted audio and the pure audio to obtain a noise reduction model;
and acquiring a request audio, and performing noise reduction processing on the request audio based on the noise reduction model to obtain a target audio.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. The plurality of units or devices may also be implemented by one unit or device through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. An audio noise reduction method, comprising:
the method comprises the steps of obtaining an audio sample and obtaining a preset learner, wherein the audio sample comprises a noisy audio and a pure audio, and the preset learner comprises a frequency domain signal processing network and a time domain signal processing network;
preprocessing the band noise frequency to obtain frequency spectrum information;
processing the frequency spectrum information based on the frequency domain signal processing network to obtain frequency spectrum mask characteristics corresponding to the frequency spectrum information;
acquiring the time-frequency characteristics of the noisy audio according to the frequency spectrum information and the frequency spectrum mask characteristics;
processing the time-frequency characteristics based on the time-domain signal processing network to obtain time-frequency mask characteristics;
generating a prediction audio according to the time-frequency characteristics and the time-frequency mask characteristics;
adjusting network parameters of the preset learner based on the predicted audio and the pure audio to obtain a noise reduction model;
and acquiring a request audio, and performing noise reduction processing on the request audio based on the noise reduction model to obtain a target audio.
2. The audio noise reduction method of claim 1, wherein the obtaining audio samples comprises:
counting the audio time of the pure audio;
acquiring audios with the duration less than or equal to the audio duration from a recording library to obtain a plurality of recorded audios;
carrying out arbitrary synthesis processing on the pure audio and each recorded audio to obtain a plurality of noisy audio;
determining a plurality of the noisy audio and the clean audio as the audio samples.
3. The method of claim 2, wherein the pre-processing the noisy audio to obtain spectral information comprises:
acquiring a preset moving window function;
carrying out Fourier transform on the noisy frequency based on the preset moving window function to obtain a spectrogram;
acquiring preset processing time length, and calculating the ratio of the audio frequency time length to the preset processing time length;
and carrying out segmentation processing on the frequency spectrum graph according to the preset processing duration to obtain the frequency spectrum information, wherein the number of the frequency spectrum information is the same as the ratio.
4. The method of audio noise reduction according to claim 1, wherein the frequency-domain signal processing network comprises a gated neural network, a fully connected network, and an activation function, the gated neural network comprises a reset gate and an update gate, and the processing the spectral information based on the frequency-domain signal processing network to obtain the spectral mask feature corresponding to the spectral information comprises:
acquiring time sequence information of the frequency spectrum information, wherein the time sequence information comprises a first frequency spectrum at a first moment and a second frequency spectrum at a second moment;
analyzing the first frequency spectrum and the second frequency spectrum based on the reset parameter of the reset gate to obtain candidate information of the second moment;
calculating an amount of information for the first spectrum based on an update parameter in the update gate, the first spectrum, and the second spectrum;
generating output information of the second moment according to the first frequency spectrum, the candidate information and the information amount, determining the output information as the first frequency spectrum until the time sequence information participates in training, and obtaining first network output of the gated neural network;
analyzing the network output according to the weight matrix and the bias value in the fully-connected network to obtain a second network output;
and processing the second network output based on the activation function to obtain the spectrum mask characteristic.
5. The method of claim 1, wherein the obtaining the time-frequency feature of the noisy audio according to the spectral information and the spectral mask feature comprises:
calculating amplitude information in the frequency spectrum information, and extracting phase information from the frequency spectrum information;
calculating the product of the amplitude information, the phase information and the spectrum mask characteristic to obtain a predicted spectrum;
carrying out inverse Fourier transform processing on the predicted frequency spectrum to obtain a predicted time frequency;
and extracting the characteristics in the predicted time frequency based on the first preset convolution layer to obtain the time frequency characteristics.
6. The method of claim 1, wherein the generating the predicted audio according to the time-frequency features and the time-frequency mask features comprises:
calculating the product of the time-frequency characteristic and the time-frequency mask characteristic to obtain an enhanced characteristic;
performing upsampling processing on the enhanced features based on a second preset convolution layer to obtain a restored signal;
acquiring initial information of the restored signal on each time sequence;
if the number of the initial information on any time sequence is multiple, calculating the average value of the multiple initial information on any time sequence to obtain the overlapped information on any time sequence;
generating prediction information according to the initial information and the overlapping information;
and converting the prediction information to obtain the prediction audio.
7. The method of claim 1, wherein the adjusting the network parameters of the pre-learner based on the predicted audio and the clean audio to obtain a noise reduction model comprises:
acquiring first time domain information of the pure audio and acquiring second time domain information of the predicted audio;
calculating a loss value of the preset learner according to the following formula:
Figure FDA0003177844230000031
wherein loss means the loss value, ytRefers to the first time domain information that is,
Figure FDA0003177844230000032
refers to the second time domain information;
and adjusting the network parameters according to the loss value until the loss value is not reduced any more, so as to obtain the noise reduction model.
8. An audio noise reduction apparatus, comprising:
the device comprises an acquisition unit and a learning unit, wherein the acquisition unit is used for acquiring an audio sample and acquiring a preset learner, the audio sample comprises a noisy audio and a pure audio, and the preset learner comprises a frequency domain signal processing network and a time domain signal processing network;
the preprocessing unit is used for preprocessing the band noise frequency to obtain frequency spectrum information;
the processing unit is used for processing the frequency spectrum information based on the frequency domain signal processing network to obtain frequency spectrum mask characteristics corresponding to the frequency spectrum information;
the acquiring unit is further configured to acquire a time-frequency feature of the noisy audio according to the frequency spectrum information and the frequency spectrum mask feature;
the processing unit is further configured to process the time-frequency feature based on the time-domain signal processing network to obtain a time-frequency mask feature;
the generating unit is used for generating a prediction audio according to the time-frequency characteristics and the time-frequency mask characteristics;
the adjusting unit is used for adjusting the network parameters of the preset learner based on the predicted audio and the pure audio to obtain a noise reduction model;
the obtaining unit is further configured to obtain a request audio, and perform noise reduction processing on the request audio based on the noise reduction model to obtain a target audio.
9. An electronic device, characterized in that the electronic device comprises:
a memory storing computer readable instructions; and
a processor executing computer readable instructions stored in the memory to implement the audio noise reduction method of any of claims 1 to 7.
10. A computer-readable storage medium characterized by: the computer-readable storage medium has stored therein computer-readable instructions that are executed by a processor in an electronic device to implement the audio noise reduction method of any of claims 1 to 7.
CN202110837937.4A 2021-07-23 2021-07-23 Audio noise reduction method, device, equipment and storage medium Active CN113470684B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110837937.4A CN113470684B (en) 2021-07-23 2021-07-23 Audio noise reduction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110837937.4A CN113470684B (en) 2021-07-23 2021-07-23 Audio noise reduction method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113470684A true CN113470684A (en) 2021-10-01
CN113470684B CN113470684B (en) 2024-01-12

Family

ID=77882114

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110837937.4A Active CN113470684B (en) 2021-07-23 2021-07-23 Audio noise reduction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113470684B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113921022A (en) * 2021-12-13 2022-01-11 北京世纪好未来教育科技有限公司 Audio signal separation method, device, storage medium and electronic equipment
WO2023102930A1 (en) * 2021-12-10 2023-06-15 清华大学深圳国际研究生院 Speech enhancement method, electronic device, program product, and storage medium
WO2023140488A1 (en) * 2022-01-20 2023-07-27 Samsung Electronics Co., Ltd. Bandwidth extension and speech enhancement of audio
WO2023226193A1 (en) * 2022-05-23 2023-11-30 神盾股份有限公司 Audio processing method and apparatus, and non-transitory computer-readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010239424A (en) * 2009-03-31 2010-10-21 Kddi Corp Method, device and program for suppressing noise
CN104240717A (en) * 2014-09-17 2014-12-24 河海大学常州校区 Voice enhancement method based on combination of sparse code and ideal binary system mask
US20150245137A1 (en) * 2014-02-27 2015-08-27 JVC Kenwood Corporation Audio signal processing device
CN110491407A (en) * 2019-08-15 2019-11-22 广州华多网络科技有限公司 Method, apparatus, electronic equipment and the storage medium of voice de-noising
CN110808063A (en) * 2019-11-29 2020-02-18 北京搜狗科技发展有限公司 Voice processing method and device for processing voice
CN112567458A (en) * 2018-08-16 2021-03-26 三菱电机株式会社 Audio signal processing system, audio signal processing method, and computer-readable storage medium
CN112652321A (en) * 2020-09-30 2021-04-13 北京清微智能科技有限公司 Voice noise reduction system and method based on deep learning phase friendlier

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010239424A (en) * 2009-03-31 2010-10-21 Kddi Corp Method, device and program for suppressing noise
US20150245137A1 (en) * 2014-02-27 2015-08-27 JVC Kenwood Corporation Audio signal processing device
CN104240717A (en) * 2014-09-17 2014-12-24 河海大学常州校区 Voice enhancement method based on combination of sparse code and ideal binary system mask
CN112567458A (en) * 2018-08-16 2021-03-26 三菱电机株式会社 Audio signal processing system, audio signal processing method, and computer-readable storage medium
CN110491407A (en) * 2019-08-15 2019-11-22 广州华多网络科技有限公司 Method, apparatus, electronic equipment and the storage medium of voice de-noising
CN110808063A (en) * 2019-11-29 2020-02-18 北京搜狗科技发展有限公司 Voice processing method and device for processing voice
CN112652321A (en) * 2020-09-30 2021-04-13 北京清微智能科技有限公司 Voice noise reduction system and method based on deep learning phase friendlier

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023102930A1 (en) * 2021-12-10 2023-06-15 清华大学深圳国际研究生院 Speech enhancement method, electronic device, program product, and storage medium
CN113921022A (en) * 2021-12-13 2022-01-11 北京世纪好未来教育科技有限公司 Audio signal separation method, device, storage medium and electronic equipment
WO2023140488A1 (en) * 2022-01-20 2023-07-27 Samsung Electronics Co., Ltd. Bandwidth extension and speech enhancement of audio
WO2023226193A1 (en) * 2022-05-23 2023-11-30 神盾股份有限公司 Audio processing method and apparatus, and non-transitory computer-readable storage medium

Also Published As

Publication number Publication date
CN113470684B (en) 2024-01-12

Similar Documents

Publication Publication Date Title
CN113470684B (en) Audio noise reduction method, device, equipment and storage medium
US10621971B2 (en) Method and device for extracting speech feature based on artificial intelligence
CN112233698B (en) Character emotion recognition method, device, terminal equipment and storage medium
WO2020248393A1 (en) Speech synthesis method and system, terminal device, and readable storage medium
CN109766925B (en) Feature fusion method and device, electronic equipment and storage medium
CN111696029B (en) Virtual image video generation method, device, computer equipment and storage medium
WO2018223727A1 (en) Voiceprint recognition method, apparatus and device, and medium
CN111508508A (en) Super-resolution audio generation method and equipment
CN113470664B (en) Voice conversion method, device, equipment and storage medium
CN112927707A (en) Training method and device of voice enhancement model and voice enhancement method and device
WO2022141868A1 (en) Method and apparatus for extracting speech features, terminal, and storage medium
WO2023226839A1 (en) Audio enhancement method and apparatus, and electronic device and readable storage medium
CN113450822B (en) Voice enhancement method, device, equipment and storage medium
CN111858891A (en) Question-answer library construction method and device, electronic equipment and storage medium
CN113903361A (en) Speech quality detection method, device, equipment and storage medium based on artificial intelligence
CN113268597A (en) Text classification method, device, equipment and storage medium
WO2021253722A1 (en) Medical image reconstruction technology method and apparatus, storage medium and electronic device
CN113470672B (en) Voice enhancement method, device, equipment and storage medium
CN113438374A (en) Intelligent outbound call processing method, device, equipment and storage medium
CN114842859A (en) Voice conversion method, system, terminal and storage medium based on IN and MI
CN113486680A (en) Text translation method, device, equipment and storage medium
CN114464163A (en) Method, device, equipment, storage medium and product for training speech synthesis model
CN114842880A (en) Intelligent customer service voice rhythm adjusting method, device, equipment and storage medium
CN113470686A (en) Voice enhancement method, device, equipment and storage medium
CN113421575B (en) Voiceprint recognition method, voiceprint recognition device, voiceprint recognition equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant