CN117409792A - Reverberation suppression method, system, medium and equipment - Google Patents

Reverberation suppression method, system, medium and equipment Download PDF

Info

Publication number
CN117409792A
CN117409792A CN202311350534.2A CN202311350534A CN117409792A CN 117409792 A CN117409792 A CN 117409792A CN 202311350534 A CN202311350534 A CN 202311350534A CN 117409792 A CN117409792 A CN 117409792A
Authority
CN
China
Prior art keywords
training
reverberation
audio
spectrum
frequency domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311350534.2A
Other languages
Chinese (zh)
Inventor
李强
王凌志
叶东翔
朱勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bairui Interconnection Integrated Circuit Shanghai Co ltd
Original Assignee
Bairui Interconnection Integrated Circuit Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bairui Interconnection Integrated Circuit Shanghai Co ltd filed Critical Bairui Interconnection Integrated Circuit Shanghai Co ltd
Priority to CN202311350534.2A priority Critical patent/CN117409792A/en
Publication of CN117409792A publication Critical patent/CN117409792A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Reverberation, Karaoke And Other Acoustics (AREA)

Abstract

The application discloses a reverberation suppression method, a reverberation suppression system, a reverberation suppression medium and reverberation suppression equipment, and belongs to the technical field of audio coding. The method comprises the following steps: when audio is encoded, frequency domain spectrum coefficients and room reverberation time parameters corresponding to an audio frame are obtained; carrying out feature extraction on the audio frame according to the frequency domain spectrum coefficient and the room reverberation time parameter to obtain a feature parameter; inputting the characteristic parameters into a first pre-training model to obtain a first sub-band gain and a first unmixed amplitude spectrum corresponding to the audio frame; multiplying the first subband gain by the original amplitude spectrum corresponding to the frequency domain spectral coefficient to obtain a second unmixed amplitude spectrum; inputting the first unmixed amplitude spectrum and the second unmixed amplitude spectrum into a second pre-training model to obtain a second subband gain corresponding to the audio frame; and obtaining a downmix spectral coefficient according to the frequency domain spectral coefficient and the second subband gain, and encoding the audio frame according to the downmix spectral coefficient. According to the method and the device, the suppression effect on the audio reverberation is improved, and the user experience is improved.

Description

Reverberation suppression method, system, medium and equipment
Technical Field
The present disclosure relates to the field of audio coding technologies, and in particular, to a method, a system, a medium, and an apparatus for suppressing reverberation.
Background
When speaking based on bluetooth, particularly indoors, if the speaker is far from the microphone, the sound collected by the microphone may contain reverberation in addition to speaking sound and background noise. As the presence of reverberation reduces the clarity and intelligibility of the sound. When the prior art carries out reverberation suppression, the final tone quality effect is general, the operation amount is large, the end-to-end delay can be increased, and the user experience is reduced.
Disclosure of Invention
Aiming at the problems of general processing effect and large operand when reverberation simulation is carried out, the application provides a reverberation suppression method, a reverberation suppression system, a reverberation suppression medium and reverberation suppression equipment.
In a first aspect, the present application proposes a reverberation suppression method comprising: when audio is encoded, frequency domain spectrum coefficients and room reverberation time parameters corresponding to an audio frame are obtained; carrying out feature extraction on the audio frame according to the frequency domain spectrum coefficient and the room reverberation time parameter to obtain a feature parameter; inputting the characteristic parameters into a first pre-training model to obtain a first subband gain and a first unmixed amplitude spectrum corresponding to the audio; multiplying the first subband gain by the original amplitude spectrum corresponding to the frequency domain spectral coefficient to obtain a second unmixed amplitude spectrum; inputting the first unmixed amplitude spectrum and the second unmixed amplitude spectrum into a second pre-training model to obtain a second subband gain corresponding to the audio; obtaining a de-mixed spectrum coefficient according to the frequency domain spectrum coefficient and the second subband gain; the audio frame is encoded according to the downmix coefficients.
Optionally, when encoding the audio, acquiring the frequency domain spectrum coefficient and the room reverberation time parameter corresponding to the audio frame includes: performing time-frequency transformation on the audio frame to obtain a frequency domain spectrum coefficient; and calculating the audio frame to obtain the room reverberation time parameter.
Optionally, performing time-frequency transformation on the audio frame to obtain a frequency domain spectrum coefficient, including: framing the audio to obtain an audio frame; discrete cosine transform is carried out on the audio frame to obtain frequency domain spectrum coefficients.
Optionally, feature extraction is performed on the audio frame according to the frequency domain spectrum coefficient and the room reverberation time parameter to obtain feature parameters, including: calculating according to the frequency domain spectrum coefficient to obtain an amplitude spectrum corresponding to the audio frame; and determining a characteristic context according to the magnitude spectrum and the room reverberation time parameter, and further obtaining the characteristic parameter.
Optionally, the training process of the first pre-training model includes: acquiring room acoustic impulse response data and pure voice data; mixing room acoustic impulse response data and pure voice data to obtain training reverberation voice; respectively extracting features of the training reverberation voice and the pure voice to obtain corresponding training feature parameters; and inputting the characteristic parameters for training into a first neural network for training to obtain a first pre-training model.
Optionally, the training process of the second pre-training model includes: acquiring a first unmixed magnitude spectrum for training and a second unmixed magnitude spectrum for training; and training the second neural network according to the first unmixed magnitude spectrum for training and the second unmixed magnitude spectrum for training to obtain a second pre-training model.
In a second aspect, the present application proposes a reverberation suppression system comprising: a module for acquiring frequency domain spectral coefficients and room reverberation time parameters corresponding to the audio frames when encoding the audio; the module is used for carrying out feature extraction on the audio frame according to the frequency domain spectrum coefficient and the room reverberation time parameter to obtain a feature parameter; the module is used for inputting the characteristic parameters into the first pre-training model to obtain a first subband gain and a first unmixed amplitude spectrum corresponding to the audio; a module for multiplying the first subband gain by the original amplitude spectrum corresponding to the frequency domain spectral coefficient to obtain a second unmixed amplitude spectrum; and a module for inputting the first unmixed amplitude spectrum and the second unmixed amplitude spectrum into a second pre-training model to obtain a second subband gain corresponding to the audio.
In a third aspect, the present application proposes a computer-readable storage medium storing a computer program, wherein the computer program is operative to perform the reverberation suppression method of scheme one.
In a fourth aspect, the present application proposes a computer device comprising a processor and a memory, the memory storing a computer program, wherein: the processor operates the computer program to perform the reverberation suppression method of scheme one.
In the audio coding process, the audio is processed by using the first pre-training model, and the first sub-band gain and the first unmixed amplitude spectrum corresponding to the processed audio are obtained, so that the audio is processed in multiple aspects such as sub-band gain and spectrum mapping, and the suppression effect on the audio reverberation is improved. And combining the first unmixed amplitude spectrum and the second unmixed amplitude spectrum through the obtained second pre-training model to obtain a final unmixed spectrum coefficient of the audio, and guaranteeing the tone quality of the audio after reverberation suppression.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description of the embodiments will briefly be given with reference to the accompanying drawings, which are used to illustrate some embodiments of the present application.
FIG. 1 is a schematic diagram of one example of a reverberation simulation method;
FIG. 2 is a schematic diagram of one embodiment of a reverberation suppression method of the present application;
FIG. 3 is a schematic diagram of one example of a training and reasoning process of the pre-training model of the present application;
FIG. 4 is a schematic diagram of one example of a reverberation simulation method of the present application;
FIG. 5 is a schematic diagram of another example of a reverberation simulation method of the present application;
FIG. 6 is a schematic diagram of one embodiment of a reverberation suppression system of the present application.
Specific embodiments thereof have been shown by way of example in the drawings and will herein be described in more detail. These drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but to illustrate the concepts of the present application to those skilled in the art by reference to specific embodiments.
Detailed Description
The preferred embodiments of the present application will be described in detail below with reference to the drawings so that the advantages and features of the present application can be more easily understood by those skilled in the art, thereby making a clearer and more definite definition of the protection scope of the present application.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
When speaking based on bluetooth, particularly indoors, if the speaker is far from the microphone, the sound collected by the microphone may contain reverberation in addition to speaking sound and background noise. As the presence of reverberation reduces the clarity and intelligibility of the sound. When the prior art carries out reverberation suppression, the final tone quality effect is general, the operation amount is large, the end-to-end delay can be increased, and the user experience is reduced.
For the problem of reverberation in audio, fig. 1 shows a schematic diagram of a method for performing reverberation simulation in the prior art.
In the example shown in fig. 1, most of the prior art reverberation suppression methods are based on complex cepstrum filtering methods, the basic principle of which is described as follows: firstly, framing input voice, and carrying out Fourier transform according to an audio frame to obtain a complex spectrum corresponding to the audio frame; calculating a complex cepstrum corresponding to the audio frame; performing low-pass filtering so as to filter out a reverberation part in the complex cepstrum; performing complex cepstrum inverse operation to obtain a complex spectrum; and performing inverse Fourier transform on the complex spectrum, and performing overlap addition on the complex spectrum and the historical audio frames to obtain a time domain audio signal. The method has larger operand, and increases delay from an audio transmitting end to a receiving end due to the related overlap-add algorithm, thereby reducing user experience.
Therefore, in view of the above problems, the present application proposes a reverberation simulation method, a system, a medium, and a device. The method comprises the following steps: when audio is encoded, frequency domain spectrum coefficients and room reverberation time parameters corresponding to an audio frame are obtained; carrying out feature extraction on the audio frame according to the frequency domain spectrum coefficient and the room reverberation time parameter to obtain a feature parameter; inputting the characteristic parameters into a first pre-training model to obtain a first subband gain and a first unmixed amplitude spectrum corresponding to the audio; multiplying the first subband gain by the original amplitude spectrum corresponding to the frequency domain spectral coefficient to obtain a second unmixed amplitude spectrum; inputting the first unmixed amplitude spectrum and the second unmixed amplitude spectrum into a second pre-training model to obtain a second subband gain corresponding to the audio; obtaining a de-mixed spectrum coefficient according to the frequency domain spectrum coefficient and the second subband gain; the audio frame is encoded according to the downmix coefficients.
In the audio coding process, the first pre-training model is utilized to process the audio, and the first sub-band gain and the first unmixed amplitude spectrum corresponding to the coded audio frame are obtained, so that the audio is subjected to reverberation processing in multiple aspects such as sub-band gain and spectrum mapping, and the suppression effect on the audio reverberation is improved. And combining the first unmixed amplitude spectrum and the second unmixed amplitude spectrum through the obtained second pre-training model to obtain a final unmixed spectrum coefficient corresponding to the audio frame, and then carrying out a subsequent encoding process by using the unmixed amplitude spectrum to ensure the tone quality of the audio after reverberation suppression and improve user experience.
The following describes the technical solution of the present application and how the technical solution of the present application solves the above technical problems in detail with specific embodiments. The specific embodiments described below may be combined with one another to form new embodiments. The same or similar ideas or processes described in one embodiment may not be repeated in certain other embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
FIG. 2 is a schematic diagram of one embodiment of a reverberation suppression method of the present application.
In the embodiment shown in fig. 2, the reverberation suppression method of the present application includes a process S201, when encoding audio, obtaining a frequency domain spectral coefficient and a room reverberation time parameter corresponding to an audio frame.
In this embodiment, the reverberation modeling method of the present application is mainly directed to the fact that when a user speaks into a microphone in a room, the microphone will collect background noise in addition to the user's voice and reverberation. In addition, the reverberation simulation method is based on the current existing audio coding and decoding process, and the calculation is performed by utilizing the generated audio parameters of the audio frames in the coding and decoding process, so that the calculation force can be reduced.
Optionally, when encoding the audio, acquiring the frequency domain spectrum coefficient and the room reverberation time parameter corresponding to the audio frame includes: performing time-frequency transformation on the audio frame to obtain a frequency domain spectrum coefficient; and calculating the audio frame to obtain the room reverberation time parameter.
In this alternative embodiment, during the process of encoding audio, the frequency domain spectral coefficients corresponding to the audio are obtained; and testing the audio, obtaining a room reverberation time parameter corresponding to the audio, and performing reverberation simulation of the audio subsequently.
Optionally, performing time-frequency transformation on the audio frame to obtain a frequency domain spectrum coefficient, including: framing the audio to obtain an audio frame; discrete cosine transform is carried out on the audio frame to obtain frequency domain spectrum coefficients.
In this alternative embodiment, in the process of encoding audio, the encoded audio is first framed to obtain audio frames, and then discrete cosine transform is performed on the audio frames to obtain frequency domain spectral coefficients corresponding to the audio frames.
Specifically, the application takes a low-power consumption bluetooth LC3 codec as an example, and the technical scheme of the application is described. In the LC3 audio encoder, the encoded audio data is framed according to the LC3 standard specification, and discrete cosine transform output spectral coefficients are performed for each audio frame (current frame number m):
t(n)=x m (Z-N F +n),for n=0…2·N F -1-Z
t(2N F -Z+n)=0,for n=0…Z-1
wherein x is m (n) is the input m-th frame time domain audio pcm signal, X m (k) Is the frequency domain spectral coefficient after discrete cosine transform.
Specifically, the room reverberation time parameter RT60 is a classical index parameter reflecting the reverberation time, which means the time required for the attenuation of 60dB of sound energy after turning off the sound source, wherein the room reverberation time parameter RT60 is calculated by a more mature method in the prior art, and is not described here again.
In the embodiment shown in fig. 2, the reverberation suppression method of the present application includes a process S202, performing feature extraction on an audio frame according to a frequency domain spectral coefficient and a room reverberation time parameter to obtain a feature parameter.
In the embodiment, the method and the device process the audio through the neural network model, so that the reverberation in the audio is suppressed. In the technical scheme of the application, after the frequency domain spectrum coefficient and the room reverberation time parameter corresponding to the audio are obtained, feature extraction is carried out according to the frequency domain spectrum coefficient and the room reverberation time parameter, and corresponding feature parameters are obtained.
Optionally, feature extraction is performed on the audio frame according to the frequency domain spectrum coefficient and the room reverberation time parameter to obtain feature parameters, including: calculating a corresponding magnitude spectrum according to the frequency domain spectrum coefficient; and determining a characteristic context according to the magnitude spectrum and the room reverberation time parameter, and further obtaining the characteristic parameter.
Specifically, when determining the characteristic parameters, the amplitude spectrum |x corresponding to the audio frame is obtained based on the time-frequency transformation m (k)|;
The context of feature extraction is then determined from the value of RT 60:
|X m-N (k)|,|X m-N+1 (k)|,…|X m (k)|
the larger the RT60 value is, the stronger the correlation between the previous and subsequent frames is, the longer the feature context needs to be selected, and in practical application, the length of the context may be selected according to the typical RT60 value.
Wherein, table 1 is an empirical value calculated from typical simulation results when the frame length is 10 ms.
TABLE 1
RT60 <100ms [100,200) [200,400) [400,600) [600,900) >900
Length of 8 12 15 20 24 30
Specifically, in determining the characteristic parameters, the value of RT60 is first determined, and then the amplitude spectrum of the corresponding frame length is determined as the characteristic parameters according to the range in which the value of RT60 is located in table 1. For example, RT60 is in the range of 200,400), the previous frame including the current frame is taken, and the amplitude spectrum composition feature vector of 15 frames in total is selected as the feature parameter. For example, if the value of RT60 is 150, and the corresponding frame length is 12 by looking up table 1, the current frame and the frames before the current frame are taken, and the total 12 frames form the feature vector. In a specific implementation process, the length of the feature vector input by the neural network is usually a fixed value, and can be supplemented by supplementing 0. For example, the neural network has a characteristic length of 15, but in this case, the RT60 has a value of 150 and the corresponding length is 12, and in this case, the neural network may be padded with 0 to obtain 15 for subsequent calculation.
In the embodiment shown in fig. 2, the reverberation suppression method of the present application includes a process S203, inputting the characteristic parameters into a first pre-training model to obtain a first subband gain and a first downmix amplitude spectrum corresponding to the audio.
In this embodiment, the present application processes the obtained feature parameters through a first pre-training model to obtain a first subband gain and a first unmixed amplitude spectrum corresponding to the audio frame.
Optionally, the training process of the first pre-training model includes: acquiring room acoustic impulse response data and pure voice data; mixing room acoustic impulse response data and pure voice data to obtain training reverberation voice; respectively extracting features of the training reverberation voice and the pure voice to obtain corresponding training feature parameters; and inputting the characteristic parameters for training into a first neural network for training to obtain a first pre-training model.
In this alternative embodiment, training data is first acquired while training of the first pre-training model is performed. Room acoustic impulse response data (Room Impulse Response, RIR) is first obtained, which may be obtained by adapting the typical room size and the desired supported reverberation time, wherein specific methods may employ conventional techniques in the art and are not described in detail herein. And mixing the acquired room acoustic impulse response data with the pure voice data to obtain the training reverberation voice. And then, respectively carrying out feature extraction on the pure voice data and the reverberation voice data to obtain corresponding feature parameters, and training the first neural network by utilizing the feature parameters to obtain a first pre-training model.
In the embodiment shown in fig. 2, the reverberation suppression method of the present application includes a process S204 of multiplying the original amplitude spectrum corresponding to the frequency domain spectral coefficient by the first subband gain to obtain a second unmixed amplitude spectrum.
In this embodiment, after the first subband gain and the first unmixed amplitude spectrum are obtained by the first pre-training model, the second unmixed amplitude spectrum is obtained by multiplying the original amplitude spectrum corresponding to the frequency domain spectral coefficient by the first subband gain, and then the second unmixed amplitude spectrum is further processed by the second pre-training model.
In the embodiment shown in fig. 2, the reverberation suppression method of the present application includes a process S205 of inputting the first downmix amplitude spectrum and the second downmix amplitude spectrum to a second pre-training model to obtain a second subband gain corresponding to audio.
In this embodiment, the second pre-training model is used to process the second downmix amplitude spectrum to obtain a second subband gain corresponding to the audio, so as to suppress the reverberation of the audio subsequently, and improve the effect of reverberation simulation.
Optionally, the training process of the second pre-training model includes: acquiring a first unmixed magnitude spectrum for training and a second unmixed magnitude spectrum for training; and training the second neural network according to the first unmixed magnitude spectrum for training and the second unmixed magnitude spectrum for training to obtain a second pre-training model.
In this alternative embodiment, the second neural network is trained using the obtained first unmixed magnitude spectrum for training and the second unmixed magnitude spectrum for training, resulting in a corresponding second pre-training model.
Specifically, FIG. 3 is a schematic diagram of one example of the training and reasoning process of the pre-training model of the present application.
In the example shown in fig. 3, the acquired room acoustic impulse response data (Room Impulse Response, RIR) is mixed with the clean speech using convolution operation during the training of the model to obtain reverberant speech. And then respectively carrying out feature extraction on the reverberant voice and the pure voice to obtain reverberation features and pure features. And when the feature extraction is carried out on the pure voice, the subband gain and the spectrum mapping corresponding to the pure voice are obtained.
Specifically, in the convolution operation, assuming that the input speech is x (n) and the impulse response of the RIR is h (n), the reverberant speech y (n) is: y (n) =x (n) ×h (n).
When the feature extraction is carried out, feature parameters of the reverberant voice are extracted to obtain a reverberant magnitude spectrum |Y m (k) I, wherein the calculation procedure is identical to the above-mentioned i X m (k) The same is true, and will not be described in detail here.
And when the feature extraction is carried out on the pure voice, obtaining a pure feature 1 and a pure feature 2. The clean feature 1 is an ideal reference of the first target of the neural network training output, taking the configuration of a frame length of 10ms with a sampling rate of 16kHz as an example, firstly calculating the reverberation voice spectrum coefficient Y m (k) The ideal subband gain (i.e., clean speech feature 1) is then:
in calculating clean feature 2, clean feature 2 is an ideal reference for the second target of the first neural network training output: i X m (k)|。
In the example shown in fig. 3, the first neural network and the second neural network are trained after obtaining the reverberation characteristics and the clean characteristics. Wherein the first neural network may be a cyclic neural network RNN, the reverberant voice feature is input to the first neural network, and the processed first neural network outputs a subband Gain rnn (k) And unmixed magnitude spectrum 2, calculating a loss function by taking pure voice features 1 and 2 as targets, adjusting weights and bias of the neural network, and determining the loss function:
unmixed magnitude spectrum 1: z m,dereverb- (k)|=|X m (k)|·Gain rnn (k) Wherein Gain is rnn (k) Is the subband gain of the first neural network output;
unmixed amplitude spectrum 2: z m,dereverb- (k) I.e. the unmixed magnitude spectrum output by the first neural network. And then, adjusting the precision of the output result of the first neural network by adjusting parameters in the first neural network and utilizing a loss function, so as to finally obtain a first pre-training model.
In the related signal processing based on FFT in the prior art, after the voice signal is transformed to the frequency domain by FFT, the voice signal can be divided into amplitude spectrum and phase spectrum to be processed respectively, and human ear hearing is insensitive to the phase spectrum, so when the processing is performed by deep learning, the phase spectrum is not usually learned, the target of deep learning output is the amplitude spectrum, and the final amplitude spectrum is obtained by directly outputting the amplitude spectrum or firstly generating the amplitude spectrum gain and then multiplying the amplitude spectrum. The new amplitude spectrum and the unprocessed phase spectrum are used in the inverse transformation to generate an enhanced speech signal.
However, in the signal processing based on the MDCT transform, the corresponding phase spectrum cannot be obtained, and the obtained amplitude spectrum does not have a very strict correspondence with the original signal, so that the unmixed amplitude spectrum obtained by the neural network achieves a certain degree of unmixed effects, but the effect is not optimal. Based on this, the present invention proposes a second neural network, which has the effect that the first and second unmixed magnitude spectra can be combined, and that the MDFT is not targeted for training any more, but rather for training, which comprises both magnitude and phase. According to the training model trained by the method, the amplitude and the phase of the original signal can be recovered well, and the audio processing effect is guaranteed.
Specifically, the second neural network is a fully connected neural network DNN, and the input is a first unmixed magnitude spectrum and a second unmixed magnitude spectrum, and the output is a final unmixed subband Gain dnn,mdft (k) The training target, i.e. the ideal unmixed subband gain, is defined by the MDFT domain:
the loss function calculation method of the second neural network comprises the following steps:
wherein the method comprises the steps of
t(n)=x(Z-N F +n),for n=0…2`N F -1-Z
t(2N F -Z+n)=0,for n=0…Z-1
Through the process, the first neural network and the second neural network are trained to obtain a corresponding first pre-training model and a corresponding second pre-training model, and then the audio data are processed by the aid of the first pre-training model and the second pre-training model.
In the embodiment shown in fig. 2, the reverberation suppression method of the present application includes a process S206 of obtaining a downmix spectral coefficient according to the frequency domain spectral coefficient and the second subband gain, and encoding an audio frame according to the downmix spectral coefficient.
In this embodiment, a second sub-band gain corresponding to the audio frame is obtained through a second pre-training model, and then the frequency domain spectrum coefficient of the audio frame is processed by using the second sub-band gain to obtain a downmix spectrum coefficient corresponding to the audio frame, so as to complete the audio downmix processing. After the corresponding downmix spectral coefficients of the audio frame are determined, a subsequent audio encoding and decoding process is performed using the audio spectral coefficients.
Specifically, fig. 4 is a schematic diagram of one example of a reverberation simulation method of the present application.
As shown in fig. 4, when audio encoding is performed, time-frequency conversion is performed on input audio data to obtain corresponding spectral coefficients. And meanwhile, calculating a room reverberation time parameter RT60, and carrying out feature extraction to obtain feature parameters. And inputting the acquired characteristic parameters into a first training pre-training model of the first neural network to process the characteristic parameters, so as to obtain a first subband gain and a first unmixed amplitude spectrum. And then multiplying the original amplitude spectrum corresponding to the frequency domain spectral coefficient by the first subband gain to obtain a second unmixed amplitude spectrum. And inputting the first unmixed amplitude spectrum and the second unmixed amplitude spectrum into a second pre-training model trained by a second neural network for processing, so as to obtain a second subband gain. The second subband gain and the frequency domain spectral coefficients are then used to derive the downmix spectral coefficients, followed by a subsequent encoding process. In the audio decoding process, the time-frequency inverse transformation and overlap-add process are performed again, and the audio data is output at the decoding end.
Specifically, fig. 5 is a schematic diagram of another example of the reverberation modeling method of the present application.
In the example shown in fig. 5, the reverberation simulation method of the present application is performed in conjunction with the audio codec process. As shown in fig. 5, in the encoding process of the LC3 audio encoder, the low-delay modified discrete cosine transform is used to obtain a spectral coefficient, and the room reverberation time parameter RT60 is calculated to perform feature extraction on the encoded audio, so as to obtain a feature parameter. And then processing through a first pre-training model trained by the first neural network and a second pre-training model trained by the second neural network, finally obtaining a de-mixing spectral coefficient corresponding to the audio frame, and then carrying out a subsequent coding process according to the de-mixing spectral coefficient. In the decoding process, the low-delay modified discrete cosine transform is carried out on the encoding result, and then the overlap addition is carried out. Compared with the prior art, the method has the advantages that the overlap-add process is arranged at the decoding end, so that the calculation power consumption at the encoding end is reduced.
In the audio coding process, the audio is processed by using the first pre-training model, and the first sub-band gain and the first unmixed amplitude spectrum corresponding to the processed audio are obtained, so that the audio is processed in multiple aspects such as sub-band gain and spectrum mapping, and the suppression effect on the audio reverberation is improved. And combining the first unmixed amplitude spectrum and the second unmixed amplitude spectrum through the obtained second pre-training model to obtain a final unmixed spectrum coefficient of the audio, and guaranteeing the tone quality of the audio after reverberation suppression.
FIG. 6 is a schematic diagram of one embodiment of a reverberation suppression system of the present application.
In the embodiment shown in fig. 6, the reverberation suppression system of the present application includes: a module 601 for acquiring frequency domain spectral coefficients and room reverberation time parameters corresponding to an audio frame when encoding the audio; a module 602 for extracting features of the audio frame according to the frequency domain spectrum coefficient and the room reverberation time parameter to obtain a feature parameter; a module 603 for inputting the feature parameters into the first pre-training model to obtain a first subband gain and a first unmixed magnitude spectrum corresponding to the audio; a module 604 for multiplying the first subband gain by the original amplitude spectrum corresponding to the frequency domain spectral coefficients to obtain a second unmixed amplitude spectrum; a module 605 for inputting the first and second downmix amplitude spectra to a second pre-training model, resulting in a second subband gain corresponding to the audio; a module 606 for obtaining a downmix spectral coefficient from the frequency domain spectral coefficient and the second subband gain; a module 607 for encoding the audio frame according to the downmix coefficients.
Optionally, when encoding the audio, acquiring the frequency domain spectrum coefficient and the room reverberation time parameter corresponding to the audio frame includes: performing time-frequency transformation on the audio frame to obtain a frequency domain spectrum coefficient; and testing the audio to obtain the corresponding room reverberation time parameter.
Optionally, performing time-frequency transformation on the audio frame to obtain a frequency domain spectrum coefficient, including: framing the encoded audio data to obtain an audio frame; discrete cosine transform is carried out on the audio frame to obtain frequency domain spectrum coefficients.
Optionally, feature extraction is performed on the audio frame according to the frequency domain spectrum coefficient and the room reverberation time parameter to obtain feature parameters, including: calculating a corresponding magnitude spectrum according to the frequency domain spectrum coefficient; and determining a characteristic context according to the magnitude spectrum and the room reverberation time parameter, and further obtaining the characteristic parameter.
Optionally, the training process of the first pre-training model includes: acquiring room acoustic impulse response data and pure voice data; mixing room acoustic impulse response data and pure voice data to obtain training reverberation voice; respectively extracting features of the training reverberation voice and the pure voice to obtain corresponding training feature parameters; and inputting the characteristic parameters for training into a first neural network for training to obtain a first pre-training model.
Optionally, the training process of the second pre-training model includes: acquiring a first unmixed magnitude spectrum for training and a second unmixed magnitude spectrum for training; and training the second neural network according to the first unmixed magnitude spectrum for training and the second unmixed magnitude spectrum for training to obtain a second pre-training model.
In the audio coding process, the audio is processed by using the first pre-training model, and the first sub-band gain and the first unmixed amplitude spectrum corresponding to the processed audio are obtained, so that the audio is processed in multiple aspects such as sub-band gain and spectrum mapping, and the suppression effect on the audio reverberation is improved. And combining the first unmixed amplitude spectrum and the second unmixed amplitude spectrum through the obtained second pre-training model to obtain a final unmixed spectrum coefficient of the audio, and guaranteeing the tone quality of the audio after reverberation suppression.
In one embodiment of the present application, a computer-readable storage medium stores computer instructions, wherein the computer instructions are operative to perform the reverberation suppression method described in any one of the embodiments. Wherein the storage medium may be directly in hardware, in a software module executed by a processor, or in a combination of the two.
A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium.
The processor may be a central processing unit (English: central Processing Unit; CPU; for short), or other general purpose processor, digital signal processor (English: digital Signal Processor; for short DSP), application specific integrated circuit (English: application Specific Integrated Circuit; ASIC; for short), field programmable gate array (English: field Programmable Gate Array; FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, etc. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one embodiment of the present application, a computer device includes a processor and a memory storing computer instructions, wherein: the processor operates the computer instructions to perform the reverberation suppression method described in any of the embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
The foregoing is only examples of the present application, and is not intended to limit the scope of the patent application, and all equivalent structural changes made by the specification and drawings of the present application, or direct or indirect application in other related technical fields, are included in the scope of the patent protection of the present application.

Claims (10)

1. A method of suppressing reverberation, comprising:
when audio is encoded, frequency domain spectrum coefficients and room reverberation time parameters corresponding to an audio frame are obtained;
carrying out feature extraction on the audio frame according to the frequency domain spectrum coefficient and the room reverberation time parameter to obtain a feature parameter;
inputting the characteristic parameters into a first pre-training model to obtain a first subband gain and a first unmixed amplitude spectrum corresponding to the audio frame;
multiplying the first subband gain by the original amplitude spectrum corresponding to the frequency domain spectrum coefficient to obtain a second unmixed amplitude spectrum;
inputting the first unmixed amplitude spectrum and the second unmixed amplitude spectrum into a second pre-training model to obtain a second sub-band gain corresponding to the audio frame;
and obtaining a downmix spectral coefficient according to the frequency domain spectral coefficient and the second subband gain, and encoding the audio frame according to the downmix spectral coefficient.
2. The method for suppressing reverberation as recited in claim 1, wherein the step of obtaining the frequency domain spectral coefficients and the room reverberation time parameters corresponding to the audio frame when encoding the audio comprises:
performing time-frequency transformation on the audio frame to obtain the frequency domain spectrum coefficient;
and calculating the audio frame to acquire the room reverberation time parameter.
3. The method of reverberation suppression according to claim 2, wherein the performing time-frequency transformation on the audio frame to obtain the frequency domain spectral coefficients includes:
framing the audio to obtain the audio frame;
and performing discrete cosine transform on the audio frame to obtain the frequency domain spectrum coefficient.
4. The method of claim 1, wherein the performing feature extraction on the audio frame according to the frequency domain spectral coefficient and the room reverberation time parameter to obtain feature parameters includes:
calculating an amplitude spectrum corresponding to the audio frame according to the frequency domain spectrum coefficient;
and determining a characteristic context according to the magnitude spectrum sum and the room reverberation time parameter, and further obtaining the characteristic parameter.
5. The method of reverberation suppression according to claim 1, wherein the training process of the first pre-training model comprises:
acquiring room acoustic impulse response data and pure voice data;
mixing the room acoustic impulse response data and the pure voice data to obtain training reverberation voice;
respectively extracting the characteristics of the training reverberation voice and the pure voice to obtain corresponding training characteristic parameters;
and inputting the training characteristic parameters into a first neural network for training to obtain the first pre-training model.
6. The method of reverberation suppression according to claim 1, wherein the training process of the second pre-training model comprises:
acquiring a first unmixed magnitude spectrum for training and a second unmixed magnitude spectrum for training;
and training a second neural network according to the first unmixed magnitude spectrum for training and the second unmixed magnitude spectrum for training to obtain the second pre-training model.
7. A reverberation suppression system comprising:
a module for acquiring frequency domain spectral coefficients and room reverberation time parameters corresponding to the audio frames when encoding the audio;
the module is used for carrying out characteristic extraction on the audio frame according to the frequency domain spectrum coefficient and the room reverberation time parameter to obtain a characteristic parameter;
the module is used for inputting the characteristic parameters into a first pre-training model to obtain a first sub-band gain and a first unmixed amplitude spectrum corresponding to the audio;
the module is used for multiplying the first sub-band gain by the original amplitude spectrum corresponding to the frequency domain spectrum coefficient to obtain a second unmixed amplitude spectrum;
a module for inputting the first and second downmix amplitude spectra to a second pre-training model, resulting in a second subband gain corresponding to the audio;
a module for obtaining a downmix spectral coefficient from the frequency domain spectral coefficient and the second subband gain;
and means for encoding the audio frame based on the downmix coefficients.
8. The reverberation suppression system of claim 7, wherein the training process of the first pre-training model comprises:
acquiring room acoustic impulse response data and pure voice data;
mixing the room acoustic impulse response data and the pure voice data to obtain training reverberation voice;
respectively extracting the characteristics of the training reverberation voice and the pure voice to obtain corresponding training characteristic parameters;
and inputting the training characteristic parameters into a first neural network for training to obtain the first pre-training model.
9. A computer readable storage medium storing a computer program, wherein the computer program is operative to perform the reverberation suppression method of any one of claims 1-6.
10. A computer device comprising a processor and a memory, the memory storing a computer program, wherein: a processor operates a computer program to perform the reverberation suppression method of any one of claims 1-6.
CN202311350534.2A 2023-10-18 2023-10-18 Reverberation suppression method, system, medium and equipment Pending CN117409792A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311350534.2A CN117409792A (en) 2023-10-18 2023-10-18 Reverberation suppression method, system, medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311350534.2A CN117409792A (en) 2023-10-18 2023-10-18 Reverberation suppression method, system, medium and equipment

Publications (1)

Publication Number Publication Date
CN117409792A true CN117409792A (en) 2024-01-16

Family

ID=89488375

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311350534.2A Pending CN117409792A (en) 2023-10-18 2023-10-18 Reverberation suppression method, system, medium and equipment

Country Status (1)

Country Link
CN (1) CN117409792A (en)

Similar Documents

Publication Publication Date Title
Qian et al. Speech Enhancement Using Bayesian Wavenet.
Zhao et al. Convolutional neural networks to enhance coded speech
Lin et al. Speech enhancement using multi-stage self-attentive temporal convolutional networks
CN110085245B (en) Voice definition enhancing method based on acoustic feature conversion
Koizumi et al. SpecGrad: Diffusion probabilistic model based neural vocoder with adaptive noise spectral shaping
Schröter et al. Deepfilternet2: Towards real-time speech enhancement on embedded devices for full-band audio
Yuliani et al. Speech enhancement using deep learning methods: A review
CN105448302B (en) A kind of the speech reverberation removing method and system of environment self-adaption
Wang et al. Recurrent deep stacking networks for supervised speech separation
Abdullah et al. Towards more efficient DNN-based speech enhancement using quantized correlation mask
Zezario et al. Self-supervised denoising autoencoder with linear regression decoder for speech enhancement
CN113345460B (en) Audio signal processing method, device, equipment and storage medium
Girirajan et al. Real-Time Speech Enhancement Based on Convolutional Recurrent Neural Network.
CN113921023A (en) Bluetooth audio squeal suppression method, device, medium and Bluetooth equipment
Elshamy et al. DNN-based cepstral excitation manipulation for speech enhancement
CN113838471A (en) Noise reduction method and system based on neural network, electronic device and storage medium
Liu et al. Gesper: A Restoration-Enhancement Framework for General Speech Reconstruction
Astudillo et al. Uncertainty propagation
Giacobello et al. Stable 1-norm error minimization based linear predictors for speech modeling
Li et al. Single channel speech enhancement using temporal convolutional recurrent neural networks
CN116844558A (en) Audio noise reduction method, system, encoder and medium based on deep learning
CN117219102A (en) Low-complexity voice enhancement method based on auditory perception
Raikar et al. Single channel joint speech dereverberation and denoising using deep priors
CN115966218A (en) Bone conduction assisted air conduction voice processing method, device, medium and equipment
CN117409792A (en) Reverberation suppression method, system, medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination