CN111312273A

CN111312273A - Reverberation elimination method, apparatus, computer device and storage medium

Info

Publication number: CN111312273A
Application number: CN202010389871.2A
Authority: CN
Inventors: 李娟娟; 朱睿; 王燕南
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-05-11
Filing date: 2020-05-11
Publication date: 2020-06-19

Abstract

The application relates to a reverberation cancellation method, apparatus, computer device and storage medium. The method comprises the following steps: acquiring a voice signal with reverberation; processing the voice signal with reverberation to obtain a first amplitude spectrum, and obtaining the voice characteristic with reverberation of the voice signal with reverberation based on the first amplitude spectrum; determining corresponding time-frequency masking quantity according to the feature of the voice with reverberation, and eliminating the reverberation of the first amplitude spectrum based on the time-frequency masking quantity to obtain a second amplitude spectrum; and determining the voice signal after the reverberation is eliminated according to the second amplitude spectrum. By adopting the method, the reverberation elimination effect and the voice quality after the reverberation elimination can be improved.

Description

Reverberation elimination method, apparatus, computer device and storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a reverberation cancellation method and apparatus, a computer device, and a storage medium.

Background

Indoor reverberation is a common phenomenon in daily life, but the reverberation signal affects the definition and intelligibility of an audio signal, and further affects performances such as speech recognition, hearing aids, sound source positioning and the like. Therefore, it is necessary to cancel reverberation.

In the conventional art, the reverberation signal is deconvolved by an inverse filter that estimates the room impulse response, however, since the room impulse response is unknown, time-varying, and long in length, tracking and estimation of the room impulse response are difficult, resulting in poor reverberation cancellation effect. At present, the reverberation elimination method based on machine learning of artificial intelligence usually directly estimates the amplitude spectrum of an audio signal, but the amplitude spectrum has a large variation range and a large learning difficulty, so that the voice quality after reverberation elimination is poor.

Disclosure of Invention

In view of the above, it is necessary to provide a reverberation elimination method, apparatus, computer device and storage medium capable of improving the reverberation elimination effect and eliminating the quality of the voice after reverberation.

A method of reverberation cancellation, the method comprising:

acquiring a voice signal with reverberation;

processing the voice signal with reverberation to obtain a first amplitude spectrum, and obtaining the voice characteristic with reverberation of the voice signal with reverberation based on the first amplitude spectrum;

determining a corresponding time-frequency masking quantity according to the voice feature with reverberation, and eliminating the reverberation of the first amplitude spectrum based on the time-frequency masking quantity to obtain a second amplitude spectrum;

and determining the voice signal after the reverberation is eliminated according to the second amplitude spectrum.

A reverberation cancellation device, the device comprising:

the acquisition module is used for acquiring a voice signal with reverberation;

the processing module is used for processing the voice signal with reverberation to obtain a first amplitude spectrum, and obtaining the voice feature with reverberation of the voice signal with reverberation based on the first amplitude spectrum;

the elimination module is used for determining a corresponding time-frequency masking quantity according to the voice feature with reverberation, and eliminating the reverberation of the first amplitude spectrum based on the time-frequency masking quantity to obtain a second amplitude spectrum;

and the determining module is used for determining the voice signal after the reverberation is eliminated according to the second amplitude spectrum.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring a voice signal with reverberation;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring a voice signal with reverberation;

According to the method, the device, the computer equipment and the storage medium for eliminating the reverberation, the reverberation voice signal is obtained, the reverberation voice signal is processed to obtain the first amplitude spectrum, the reverberation voice characteristic of the reverberation voice signal is obtained based on the first amplitude spectrum, the corresponding time-frequency masking quantity is determined according to the reverberation voice characteristic, the reverberation elimination is carried out on the first amplitude spectrum based on the time-frequency masking quantity to obtain the second amplitude spectrum, and the voice signal after the reverberation elimination is determined according to the second amplitude spectrum. By introducing the time-frequency masking amount, the reverberation elimination is carried out on the amplitude spectrum of the voice signal with the reverberation, the reverberation can be effectively removed, meanwhile, the voice damage is reduced, and the voice quality after the reverberation elimination is improved.

Drawings

FIG. 1 is a flow diagram of a reverberation cancellation method in one embodiment;

FIG. 2 is a diagram illustrating a conversion of a reverberant speech signal from the time domain to the frequency domain in one embodiment;

FIG. 3 is a diagram illustrating the structure of a reverberation cancellation model in one embodiment;

FIG. 4 is a schematic flow chart illustrating the steps of training to obtain a reverberation cancellation model in one embodiment;

FIG. 5 is a flow diagram of a reverberation cancellation method in one embodiment;

FIG. 6 shows the result of quality testing of a reverberant speech signal in one embodiment;

FIG. 7 shows the result of a quality test of a speech signal after reverberation is removed in one embodiment;

fig. 8 is a block diagram of the structure of the reverberation removal device in one embodiment;

FIG. 9 is a diagram showing an internal structure of a computer device in one embodiment;

FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.

Detailed Description

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The application relates to a voice technology and machine learning in artificial intelligence, in particular to a neural network model, which is applied to the technical field of reverberation elimination and is used for carrying out reverberation elimination on reverberation voice signals.

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, a reverberation cancellation method is provided, and this embodiment is described by taking the method as an example of being applied to a terminal, it is to be understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and is implemented by interaction between the terminal and the server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. In this embodiment, the method includes the following steps S102 to S108.

S102, obtaining the voice signal with reverberation.

The voice signal with reverberation can be a voice signal under various scenes which can generate reverberation, for example, an indoor training conference, the training voice of the speaker teacher can be regarded as the voice signal with reverberation, when a participant wants to record the training voice of the speaker teacher for learning, the training voice of the speaker teacher can be collected through a terminal (such as a mobile phone, a recording pen and the like) with a recording function, and the terminal obtains the voice signal with reverberation.

S104, processing the voice signal with reverberation to obtain a first amplitude spectrum, and obtaining the voice characteristic with reverberation of the voice signal with reverberation based on the first amplitude spectrum.

The first amplitude spectrum is used as the frequency domain description of the voice signal with reverberation and used for representing the distribution situation of the amplitude values of each frequency component forming the voice signal with reverberation. The voice feature with reverberation is used for representing the frequency domain characteristic of the voice signal with reverberation, the first amplitude spectrum can be directly used as the voice feature with reverberation, and a logarithmic amplitude spectrum or a logarithmic energy spectrum corresponding to the first amplitude spectrum can also be used as the voice feature with reverberation.

Specifically, the terminal may perform time-frequency analysis on the voice signal with reverberation, convert the voice signal with reverberation from a time domain to a frequency domain to obtain a frequency spectrum of the voice signal with reverberation, and then obtain a corresponding magnitude spectrum according to the frequency spectrum calculation.

And S106, determining a corresponding time-frequency masking amount according to the characteristics of the voice with reverberation, and eliminating the reverberation of the first amplitude spectrum based on the time-frequency masking amount to obtain a second amplitude spectrum.

Where the time-frequency masking amount is defined as a ratio of an amplitude spectrum of a clean speech signal in the speech signal with reverberation to an amplitude spectrum of the speech signal with reverberation, for example, expressed by Mask, Mask = X/Y, where X represents the amplitude spectrum of the clean speech signal in the speech signal with reverberation, Y represents a first amplitude spectrum of the speech signal with reverberation, and Mask may be a sequence of values, where each value corresponds to an amplitude ratio. The second amplitude spectrum represents the amplitude spectrum of the reverberated speech signal after the reverberation is eliminated.

After obtaining the feature of the voice with reverberation of the voice signal with reverberation, the terminal can predict and obtain the time-frequency masking quantity corresponding to the voice signal with reverberation according to the feature of the voice with reverberation through a trained prediction model. After obtaining the first amplitude spectrum (Y) and the time-frequency masking amount (Mask) of the voice signal with reverberation, the terminal may perform reverberation cancellation on the first amplitude spectrum by the time-frequency masking amount, specifically, calculate and obtain the amplitude spectrum (X) of a clean voice signal in the voice signal with reverberation through a relational Mask = X/Y, where the clean voice signal may be understood as a voice signal with reverberation after the reverberation is cancelled, and thus the calculated amplitude spectrum X may be used as a second amplitude spectrum.

And S108, determining the voice signal without reverberation according to the second amplitude spectrum.

The voice signal after the reverberation elimination represents a voice signal obtained after the reverberation elimination is carried out on the voice signal with the reverberation. After the terminal obtains the second amplitude spectrum, the terminal may perform time-frequency analysis on the second amplitude spectrum, convert the second amplitude spectrum from a frequency domain to a time domain, and obtain a speech signal corresponding to the second amplitude spectrum, that is, a speech signal with reverberation removed.

In the reverberation elimination method, a first amplitude spectrum is obtained by obtaining a voice signal with reverberation and processing the voice signal with reverberation, a feature of the voice signal with reverberation is obtained based on the first amplitude spectrum, a corresponding time-frequency masking quantity is determined according to the feature of the voice signal with reverberation, reverberation elimination is carried out on the first amplitude spectrum based on the time-frequency masking quantity, a second amplitude spectrum is obtained, and the voice signal after reverberation elimination is determined according to the second amplitude spectrum. By introducing the time-frequency masking amount, the reverberation elimination is carried out on the amplitude spectrum of the voice signal with the reverberation, the reverberation can be effectively removed, meanwhile, the voice damage is reduced, and the voice quality after the reverberation elimination is improved.

In an embodiment, the step of processing the reverberant speech signal to obtain the first amplitude spectrum may specifically include the following steps: performing framing and windowing processing on the voice signal with reverberation to obtain a voice signal frame; and carrying out Fourier transform on each voice signal frame to obtain a corresponding Fourier transform coefficient, carrying out modulus taking on each Fourier transform coefficient to obtain a corresponding amplitude value, and obtaining a first amplitude spectrum based on each amplitude value.

As shown in fig. 2, a schematic illustration of the conversion of a reverberant speech signal from the time domain to the frequency domain in one embodiment is provided. The terminal obtains an initial voice signal with reverberation as a time domain signal, and performs framing and windowing on the initial voice signal with reverberation according to a preset frame length and a preset frame shift to obtain a plurality of voice signal frames. The preset frame length and the preset frame shift may be set according to actual conditions, for example, the preset frame length is set to 20ms, and the preset frame shift is set to 10 ms. And the terminal respectively performs Fourier transform (FFT) on each frame of voice to obtain a Fourier transform coefficient after the FFT is performed on each frame of voice, and performs modulus on the Fourier transform coefficient to obtain a corresponding amplitude value. In particular, the fourier transform coefficient comprises a real part and an imaginary part, for example denoted by a + bi, and the amplitude value modulo it is (a)²+b²)^0.5. And then the terminal obtains a first amplitude spectrum based on the obtained amplitude values, so that the initial voice signal with reverberation is converted into a frequency domain signal from a time domain signal.

In this embodiment, the reverberation-containing speech signal is subjected to framing, windowing and fourier transform processing, the time-domain signal is converted into a frequency-domain signal, and an amplitude spectrum obtained by calculating the frequency-domain signal can more accurately describe the characteristics of the reverberation-containing speech signal.

In an embodiment, the step of obtaining the feature of the speech with reverberation of the speech signal with reverberation based on the first amplitude spectrum may specifically be to use the first amplitude spectrum as the feature of the speech with reverberation corresponding to the speech signal with reverberation.

The first amplitude spectrum can describe the frequency domain characteristics of the voice signal with reverberation, the corresponding amplitude spectrums of different voice signals with reverberation are different, and the first amplitude spectrum is used as the voice feature with reverberation to help to distinguish the reverberation component in the voice signal with reverberation.

In an embodiment, the step of obtaining the feature of the speech signal with reverberation based on the first amplitude spectrum may specifically be performing logarithm processing on each amplitude value in the first amplitude spectrum to obtain a corresponding logarithm amplitude spectrum, which is used as the feature of the speech signal with reverberation corresponding to the speech signal with reverberation.

In an embodiment, the step of obtaining the feature of the speech signal with reverberation based on the first amplitude spectrum may specifically be to perform square calculation and then logarithm processing on each amplitude value in the first amplitude spectrum to obtain a corresponding logarithm energy spectrum, which is used as the feature of the speech signal with reverberation corresponding to the speech signal with reverberation.

After the terminal obtains the first amplitude spectrum, logarithms can be taken from each amplitude value in the first amplitude spectrum to obtain a corresponding logarithmic amplitude spectrum, or squares can be solved first and then logarithms can be taken from each amplitude value in the first amplitude spectrum to obtain a corresponding logarithmic energy spectrum. The logarithmic magnitude spectrum or the logarithmic energy spectrum is used as the feature of the voice with reverberation, and the change scale of the feature data can be compressed on the premise of not changing the property and the relative relation of the feature data, so that the feature data are more stable, and errors caused by too large data difference in the feature data processing process are reduced.

In an embodiment, the method includes the steps of determining a corresponding time-frequency masking amount according to the feature of the voice with reverberation, performing reverberation elimination on the first magnitude spectrum based on the time-frequency masking amount, and obtaining a second magnitude spectrum.

The input of the trained reverberation elimination model is the feature of the voice with reverberation of the voice signal with reverberation, and the output is an amplitude spectrum obtained after the voice signal with reverberation is subjected to dereverberation processing, namely a second amplitude spectrum. Specifically, the reverberation elimination model comprises a two-part structure, the input of the first part structure is the feature of the speech with reverberation, the feature of the speech with reverberation is predicted by adopting the first part structure, and the output of the first part structure is the corresponding time-frequency masking quantity. The input of the second part structure comprises time-frequency masking quantity and a first amplitude spectrum, the time-frequency masking quantity is multiplied by the first amplitude spectrum by adopting the second part structure, and the output of the second part structure is a second amplitude spectrum.

In this embodiment, the trained reverberation elimination model is used to predict the time-frequency masking amount corresponding to the voice signal with reverberation, and then the predicted time-frequency masking amount is multiplied by the first amplitude spectrum of the voice signal with reverberation to obtain the second amplitude spectrum corresponding to the voice signal without reverberation. Because the range of change of the amplitude spectrum is large, the learning difficulty is large, and the direct prediction of the amplitude spectrum can cause more damage to the recovered voice, low intelligibility and insufficient naturalness, based on this, the reverberation elimination model used in the embodiment does not directly predict the second amplitude spectrum, but helps to predict the second amplitude spectrum by introducing the time-frequency masking amount as an intermediate amount, so that the voice damage of the voice signal after the reverberation is removed can be reduced, and the intelligibility and the voice quality are improved.

In one embodiment, as shown in fig. 3, a schematic structural diagram of a reverberation cancellation model is provided, which includes a long-short term memory network layer and a time-frequency masking processing layer. As shown in fig. 4, the method for training to obtain the reverberation cancellation model may specifically include the following steps S402 to S410.

S402, obtaining the label amplitude spectrum of the voice signal with the reverberation sample and the clean sample voice signal corresponding to the voice signal with the reverberation sample.

The voice signal with the reverberation sample can be obtained by synthesizing the clean voice sample signal and the sample reverberation signal, can cover most indoor reverberation scenes, such as indoor reverberation scenes of a conference room, a classroom, a home, a hall and the like, can generate reverberation of different degrees due to different indoor sizes, different wall and ground materials, or different distances between a microphone and a sound source, and can obtain a large number of sample reverberation signals by simulating different degrees of reverberation in various indoor scenes, so that the synthesized voice signal with the reverberation sample can cover most indoor reverberation scenes. The tag magnitude spectrum of the clean sample speech signal is used as a training target for the reverberation cancellation model. Specifically, the terminal may obtain the clean sample voice signal, and perform processing such as framing, windowing, fourier transform, and the like on the obtained clean sample voice signal to obtain an amplitude spectrum of the clean sample voice signal, which is used as a tag amplitude spectrum.

S404, processing the voice signal with the reverberation sample to obtain a magnitude spectrum with the reverberation sample, and obtaining the voice characteristic with the reverberation sample of the voice signal with the reverberation sample based on the magnitude spectrum with the reverberation sample.

Specifically, the terminal may perform framing, windowing, fourier transform, and other processing on the voice signal with the reverberation sample to obtain a magnitude spectrum of the voice signal with the reverberation sample, and then square and logarithm each amplitude value in the magnitude spectrum of the voice signal with the reverberation sample to obtain a corresponding logarithm energy spectrum of the sample with the reverberation sample as a voice feature of the sample with the reverberation.

S406, predicting the voice characteristics of the reverberation sample by adopting the long-term and short-term memory network layer of the reverberation elimination model to be trained to obtain the corresponding predicted time-frequency masking amount.

Specifically, the voice features with the reverberation samples are input into a long-short term memory network Layer (LSTM), the voice features with the reverberation samples are predicted through the LSTM, and corresponding predicted time-frequency masking quantity is output. The LSTM can not only consider the input at the current moment, but also endow the network with a memory function for previous contents, and an input gate, an output gate, a forgetting gate and a cell state unit in the network structure improve the time sequence modeling capability of the LSTM, can memorize more information and effectively capture long-term dependence in data.

And S408, multiplying the predicted time-frequency masking amount by the amplitude spectrum of the reverberation sample to obtain a predicted amplitude spectrum by using the time-frequency masking processing layer of the reverberation elimination model to be trained.

Specifically, after the LSTM outputs the predicted time-frequency masking amount, the predicted time-frequency masking amount and the amplitude spectrum with the reverberation sample are input into the time-frequency masking processing layer together, the predicted time-frequency masking amount and the amplitude spectrum with the reverberation sample are multiplied element by element through the time-frequency masking processing layer, and finally, the predicted amplitude spectrum with the reverberation removed voice signal is output, namely the predicted amplitude spectrum.

S410, adjusting parameters of the reverberation elimination model to be trained based on the error between the prediction amplitude spectrum and the label amplitude spectrum, and obtaining the trained reverberation elimination model.

Specifically, when the training end condition is not met, the parameters of the LSTM in the reverberation elimination model to be trained are adjusted based on the error between the predicted magnitude spectrum and the tag magnitude spectrum, and then the steps S406 to S408 are returned to perform iteration until the training end condition is met, so that the trained reverberation elimination model is obtained. The training end condition may be that the number of iterations reaches a preset number, or that a loss value of the predicted amplitude spectrum relative to the label amplitude spectrum is smaller than a preset threshold.

In the model training process, the input of the model is the feature of the voice with reverberation of the voice signal with reverberation sample obtained based on the amplitude spectrum of the voice signal with reverberation sample, the output of the model is the predicted amplitude spectrum of the voice signal with reverberation sample, and the training target is to reduce the difference between the predicted amplitude spectrum of the voice signal with reverberation sample and the label amplitude spectrum of the corresponding clean sample voice signal. Specifically, the objective function based on the minimum mean square error can be defined as follows:

wherein the content of the first and second substances,y _nandx _nrespectively representnA predicted magnitude spectrum and a corresponding labeled magnitude spectrum of the frame-with-reverberation sample speech signal,

represents the amount of predicted time-frequency masking,Wandbare respectively a dieWeight of type and bias. When the model is optimized, the label magnitude spectrum (namely, the magnitude spectrum expected to be reached after the reverberation of the voice signal with the reverberation sample is removed) participates in the learning of the guidance model, the approximation between the prediction magnitude spectrum and the label magnitude spectrum is directly pursued by the model training method, and the obtained reverberation elimination model is trained by the model training method, so that the magnitude spectrum after the reverberation is removed can be directly optimized, and the model has better performance compared with the model for indirectly optimizing the magnitude spectrum after the reverberation is removed.

In the embodiment, the time-frequency masking is fused into the LSTM by using an implicit time-frequency masking method to form a reverberation elimination model, and the time-frequency masking processing layer is used as an intermediate layer to assist in predicting the amplitude spectrum of the voice signal after reverberation is removed, so that reverberation elimination of the voice signal with reverberation is realized. In addition, the reverberation elimination model obtained by training in the embodiment has good generalization capability, and has good reverberation elimination capability in most reverberation scenes. For example, in the conference room with the glass wall, the reverberation of the actual audio collected in the conference room is heavier due to the reflection of the glass wall, the frequency spectrum is fuzzy, and the tailing phenomenon is obvious.

In an embodiment, the step of obtaining the speech signal with the reverberation sample may specifically include the following steps: acquiring a clean sample voice signal and a simulated room impulse response signal; and (4) convolving the simulated room impulse response signal with the clean sample voice signal to obtain the voice signal with the reverberation sample.

The clean sample voice signal represents a voice signal in an environment without reverberation or with negligible reverberation, and specifically, the terminal can collect the voice signal in the environment without reverberation or with negligible reverberation to obtain the clean sample voice signal. The simulated room impulse response signal represents a signal simulating a room impulse response, and specifically, the simulated room impulse response signal can be obtained by generating the room impulse response under various indoor reverberation scenes by simulation using a simulation tool.

In this embodiment, the simulated room impulse response signal is convolved with the clean sample speech signal to obtain a speech signal with a reverberation sample, which can cover most of indoor reverberation scenes and is used as training data to train a reverberation elimination model, and the reverberation elimination model obtained by training has good generalization capability.

In an embodiment, the step of determining the voice signal after the reverberation elimination according to the second magnitude spectrum may specifically be performing inverse fourier transform on the second magnitude spectrum to obtain the voice signal after the reverberation elimination.

The second amplitude spectrum represents an amplitude spectrum obtained after dereverberation processing is carried out on the voice signal with reverberation, and the second amplitude spectrum is subjected to inverse Fourier transform to realize conversion from a frequency domain to a time domain, so that the voice signal of the time domain after the reverberation is eliminated is obtained.

In one embodiment, as shown in fig. 5, a reverberation cancellation method is provided, which includes the following steps S501 to S507.

S501, obtaining the voice signal with reverberation.

S502, framing and windowing the voice signal with reverberation to obtain a voice signal frame.

S503, carrying out Fourier transform on each voice signal frame to obtain a corresponding Fourier transform coefficient, carrying out modulus taking on each Fourier transform coefficient to obtain a corresponding amplitude value, and obtaining a first amplitude spectrum based on each amplitude value.

S504, each amplitude value in the first amplitude spectrum is subjected to square solving and logarithm taking processing to obtain a corresponding logarithm energy spectrum which is used as the feature of the voice with reverberation corresponding to the voice signal with reverberation.

And S505, predicting the voice characteristics of the reverberation sample by adopting the long-term and short-term memory network layer of the trained reverberation elimination model to obtain the corresponding time-frequency masking quantity.

And S506, multiplying the time-frequency masking amount and the first amplitude spectrum by adopting a time-frequency masking processing layer of the trained reverberation elimination model to obtain a second amplitude spectrum.

And S507, performing inverse Fourier transform on the second amplitude spectrum to obtain the voice signal without reverberation.

For specific description of steps S501 to S507, reference may be made to the foregoing embodiments, which are not described herein again. In the embodiment, the time-frequency masking is fused into the LSTM by using an implicit time-frequency masking method to form a reverberation elimination model, the logarithmic energy spectrum of the voice signal with reverberation is used as model input, and finally the model outputs the amplitude spectrum of the voice signal after reverberation, so that the reverberation elimination of the voice signal with reverberation is realized, meanwhile, the voice damage can be reduced, and the voice intelligibility and the voice quality after the reverberation elimination are improved.

Referring to fig. 6 and 7, fig. 6 shows the quality test result of the voice signal with reverberation (the voice signal before dereverberation), and fig. 7 shows the quality test result of the voice signal after dereverberation (the voice signal after dereverberation). The quality test results comprise a plurality of objective indicators, including subjective speech quality assessment (PESQ), short-term objective intelligibility (STOI), segmented signal-to-noise ratio (SSNR), Log Spectral Distance (LSD), and speech-to-reverberation modulation energy ratio (SRMR). The larger the PESQ index is, the better the overall voice quality is; the larger the STOI indicator, the better the speech intelligibility; the larger the SSNR index is, the better the interference sound elimination effect is; the smaller the LSD index is, the less speech damage is; the larger the SRMR index, the more reverberation is eliminated. As can be seen from comparison of the test results in fig. 7 and fig. 8, after reverberation elimination is performed on the voice signal with reverberation of three different reverberation levels (mild reverberation, moderate reverberation, and severe reverberation) by the method, objective indexes of the voice signal with the reverberation eliminated are better than those of the voice signal before reverberation elimination, so that reverberation is effectively eliminated, voice damage is reduced, and voice quality is improved.

It should be understood that although the various steps in the flowcharts of fig. 1, 4-5 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1, 4-5 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps or stages.

In one embodiment, as shown in fig. 8, there is provided a reverberation cancellation device 800, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, specifically comprising: an obtaining module 810, a processing module 820, a eliminating module 830, and a determining module 840, wherein:

an obtaining module 810, configured to obtain the voice signal with reverberation.

The processing module 820 is configured to process the voice signal with reverberation to obtain a first amplitude spectrum, and obtain a voice feature with reverberation of the voice signal with reverberation based on the first amplitude spectrum.

And the eliminating module 830 is configured to determine a corresponding time-frequency masking amount according to the feature of the voice with reverberation, and perform reverberation elimination on the first amplitude spectrum based on the time-frequency masking amount to obtain a second amplitude spectrum.

And the determining module 840 is configured to determine the speech signal without reverberation according to the second amplitude spectrum.

In an embodiment, the processing module 820 is specifically configured to, when processing the reverberant speech signal to obtain the first amplitude spectrum: performing framing and windowing processing on the voice signal with reverberation to obtain a voice signal frame; and carrying out Fourier transform on each voice signal frame to obtain a corresponding Fourier transform coefficient, carrying out modulus taking on each Fourier transform coefficient to obtain a corresponding amplitude value, and obtaining a first amplitude spectrum based on each amplitude value.

In one embodiment, the processing module 820 is specifically configured to use the first amplitude spectrum as the feature with reverberation voice corresponding to the signal with reverberation voice when obtaining the feature with reverberation voice of the signal with reverberation voice based on the first amplitude spectrum.

In an embodiment, the processing module 820 is specifically configured to perform logarithm processing on each amplitude value in the first amplitude spectrum when obtaining the feature with reverberation voice of the voice signal with reverberation based on the first amplitude spectrum, so as to obtain a corresponding logarithm amplitude spectrum as the feature with reverberation voice corresponding to the voice signal with reverberation.

In an embodiment, the processing module 820 is specifically configured to perform square-finding and logarithm-finding processing on each amplitude value in the first amplitude spectrum when obtaining the feature with reverberation of the voice signal based on the first amplitude spectrum, so as to obtain a corresponding logarithm energy spectrum as the feature with reverberation corresponding to the voice signal with reverberation.

In an embodiment, the elimination module 830 is specifically configured to, when determining a corresponding time-frequency masking amount according to the feature of the voice with reverberation, and performing reverberation elimination on the first magnitude spectrum based on the time-frequency masking amount to obtain a second magnitude spectrum: and predicting the characteristics of the reverberation voice by adopting a trained reverberation elimination model, determining a corresponding time-frequency masking quantity, and multiplying the time-frequency masking quantity by the first amplitude spectrum to obtain a second amplitude spectrum.

In one embodiment, the reverberation cancellation model includes a long-short term memory network layer and a time-frequency masking processing layer; the device also comprises a training module used for training to obtain a reverberation elimination model; the training module comprises: the device comprises an acquisition unit, a processing unit, a first prediction unit, a second prediction unit and an adjustment unit, wherein:

and the acquisition unit is used for acquiring the label amplitude spectrum of the voice signal with the reverberation sample and the clean sample voice signal corresponding to the voice signal with the reverberation sample.

And the processing unit is used for processing the voice signal with the reverberation sample to obtain a magnitude spectrum with the reverberation sample, and obtaining the voice characteristic with the reverberation sample of the voice signal with the reverberation sample based on the magnitude spectrum with the reverberation sample.

And the first prediction unit is used for predicting the voice characteristics of the reverberation sample by adopting a long-term and short-term memory network layer of the reverberation elimination model to be trained to obtain the corresponding prediction time-frequency masking amount.

And the second prediction unit is used for multiplying the prediction time-frequency masking amount by the amplitude spectrum of the reverberation sample to obtain the prediction amplitude spectrum by adopting the time-frequency masking processing layer of the reverberation elimination model to be trained.

And the adjusting unit is used for adjusting the parameters of the reverberation elimination model to be trained based on the error between the prediction amplitude spectrum and the label amplitude spectrum to obtain the trained reverberation elimination model.

In an embodiment, the obtaining unit, when obtaining the speech signal with the reverberation sample, is specifically configured to: acquiring a clean sample voice signal and a simulated room impulse response signal; and (4) convolving the simulated room impulse response signal with the clean sample voice signal to obtain the voice signal with the reverberation sample.

In an embodiment, the determining module 840 is specifically configured to perform inverse fourier transform on the second magnitude spectrum to obtain the voice signal without reverberation when determining the voice signal without reverberation according to the second magnitude spectrum.

For the specific definition of the reverberation cancellation device, reference may be made to the above definition of the reverberation cancellation method, which is not described herein again. The various modules in the reverberation canceling device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a reverberation cancellation method.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a reverberation cancellation method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the configurations shown in fig. 9 or 10 are merely block diagrams of some configurations relevant to the present disclosure, and do not constitute a limitation on the computing devices to which the present disclosure may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It should be understood that the terms "first", "second", etc. in the above-described embodiments are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of reverberation cancellation, the method comprising:

acquiring a voice signal with reverberation;

2. The method of claim 1, wherein processing the reverberant speech signal into a first magnitude spectrum comprises:

performing framing and windowing processing on the voice signal with reverberation to obtain a voice signal frame;

performing Fourier transform on each voice signal frame to obtain a corresponding Fourier transform coefficient, performing modulus operation on each Fourier transform coefficient to obtain a corresponding amplitude value, and obtaining a first amplitude spectrum based on each amplitude value.

3. The method of claim 2, wherein obtaining the reverberant speech feature of the reverberant speech signal based on the first magnitude spectrum comprises any one of:

the first item: taking the first amplitude spectrum as a feature with reverberation voice corresponding to the voice signal with reverberation;

the second term is: carrying out logarithm processing on each amplitude value in the first amplitude spectrum to obtain a corresponding logarithm amplitude spectrum which is used as a feature of the voice with reverberation corresponding to the voice signal with reverberation;

the third item: and carrying out square calculation and logarithm processing on each amplitude value in the first amplitude spectrum to obtain a corresponding logarithm energy spectrum which is used as the feature of the voice with reverberation corresponding to the voice signal with reverberation.

4. The method of claim 1, wherein determining a corresponding amount of time-frequency masking according to the feature of the speech with reverberation, and performing reverberation cancellation on the first magnitude spectrum based on the amount of time-frequency masking to obtain a second magnitude spectrum comprises:

and predicting the characteristics of the voice with reverberation by adopting a trained reverberation elimination model, determining a corresponding time-frequency masking quantity, and multiplying the time-frequency masking quantity by the first amplitude spectrum to obtain a second amplitude spectrum.

5. The method of claim 4, wherein the reverberation cancellation model comprises a long-short term memory network layer and a time-frequency masking processing layer, and the method for training to obtain the reverberation cancellation model comprises:

acquiring a voice signal with reverberation sample and a label amplitude spectrum of a clean sample voice signal corresponding to the voice signal with reverberation sample;

processing the voice signal with the reverberation sample to obtain a range spectrum with the reverberation sample, and obtaining the voice characteristic with the reverberation sample of the voice signal with the reverberation sample based on the range spectrum with the reverberation sample;

predicting the voice characteristics of the reverberation sample by adopting a long-term and short-term memory network layer of a reverberation elimination model to be trained to obtain corresponding prediction time-frequency masking quantity;

multiplying the predicted time-frequency masking quantity by the amplitude spectrum of the reverberation sample to obtain a predicted amplitude spectrum by adopting a time-frequency masking processing layer of the reverberation elimination model to be trained;

and adjusting parameters of the reverberation elimination model to be trained based on the error between the predicted magnitude spectrum and the label magnitude spectrum to obtain a trained reverberation elimination model.

6. The method of claim 5, wherein obtaining a reverberant sample speech signal comprises:

acquiring a clean sample voice signal and a simulated room impulse response signal;

and convolving the simulated room impulse response signal with the clean sample voice signal to obtain a voice signal with a reverberation sample.

7. The method according to any of claims 1 to 6, wherein determining the reverberated speech signal from the second magnitude spectrum comprises:

and carrying out inverse Fourier transform on the second amplitude spectrum to obtain the voice signal without reverberation.

8. A reverberation cancellation device, characterized in that said device comprises:

the acquisition module is used for acquiring a voice signal with reverberation;

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.