CN111353258A - Echo suppression method based on coding and decoding neural network, audio device and equipment - Google Patents
Echo suppression method based on coding and decoding neural network, audio device and equipment Download PDFInfo
- Publication number
- CN111353258A CN111353258A CN202010084801.6A CN202010084801A CN111353258A CN 111353258 A CN111353258 A CN 111353258A CN 202010084801 A CN202010084801 A CN 202010084801A CN 111353258 A CN111353258 A CN 111353258A
- Authority
- CN
- China
- Prior art keywords
- spectrogram
- network
- anechoic
- echo
- neural network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 230000001629 suppression Effects 0.000 title claims abstract description 35
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 33
- 238000003062 neural network model Methods 0.000 claims abstract description 21
- 238000002592 echocardiography Methods 0.000 claims abstract description 18
- 230000003595 spectral effect Effects 0.000 claims abstract description 15
- 238000012549 training Methods 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 claims description 35
- 238000001228 spectrum Methods 0.000 claims description 19
- 238000007781 pre-processing Methods 0.000 claims description 13
- 125000004122 cyclic group Chemical group 0.000 claims description 8
- 230000009467 reduction Effects 0.000 claims description 8
- 239000011159 matrix material Substances 0.000 claims description 7
- 230000002457 bidirectional effect Effects 0.000 claims description 6
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000005457 optimization Methods 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 230000000306 recurrent effect Effects 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 230000008569 process Effects 0.000 description 11
- 230000006870 function Effects 0.000 description 10
- 230000004913 activation Effects 0.000 description 8
- 230000000694 effects Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012886 linear function Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
Abstract
The invention discloses an echo suppression method based on a coding and decoding neural network, an audio device and equipment, wherein echo-contained audio data are generated by acquiring anechoic audio data and adding simulated echoes; converting the anechoic audio data into an anechoic spectrogram, and extracting the spectral characteristics of the anechoic spectrogram; converting the audio data with echoes into a spectrogram with echoes, and extracting the spectral characteristics of the spectrogram with echoes; building a coding and decoding neural network model according to the spectral characteristics of the anechoic spectrogram and the spectral characteristics of the echoic spectrogram; training the encoding and decoding neural network model to obtain a prediction model of the echo-removed spectrogram; converting audio data to be processed into a spectrogram, inputting the spectrogram into a prediction model, outputting a de-echoed spectrogram, and converting the de-echoed spectrogram to obtain de-echoed audio data; the method can remove echoes in different scenes, can fit echo linear laws in different environments, and has high adaptability.
Description
Technical Field
The invention relates to the technical field of audio and network communication, in particular to an echo suppression method, an echo suppression device, echo suppression equipment and a storage medium based on a coding and decoding neural network.
Background
In the current speech recognition system, suppressing echo in speech is a very important link in the recognition process, and the effect of echo processing will directly affect the effect and result of speech recognition. In particular, with the wide application of teleconferencing such as video conferencing and telephone conferencing, during teleconferencing, a signal transmitted from a distant place (i.e., a far-end signal) is played through a speaker of a telephone in a conference room, sound waves are reflected by walls, floors, ceilings and the like, and both reflected waves and direct waves are picked up by a microphone of the telephone and transmitted to a distant place as a part of a signal at a near end to form an echo; in addition, sound leaking from the handset can also be picked up by the microphone of the phone and sent to the far-end as part of the near-end signal, which can also form echo. These echoes have a certain delay and when the delay is over 50 milliseconds and there is no or only little attenuation, the far-end user perceives a clear echo. Since this Echo is generated by an Acoustic route, it is called Acoustic Echo (Acoustic Echo).
The traditional echo suppression method uses a linear filtering mode at present, and calculates a related linear filtering algorithm by finding out the gradual change rule of the acoustic echo, so that the echo in the sound is reduced and eliminated by the filtering algorithm. However, the method can only simulate the echo situation in a fixed scene, and when the scene is not fixed and the echo situation is complex, the echo suppression effect will be reduced, and the method does not have high stability.
Disclosure of Invention
In order to solve the above problems, the present invention provides an echo suppression method based on a codec neural network, an audio device, an apparatus, and a storage medium, which can remove echoes in different scenes, can fit echo linear laws in different environments, and have high adaptability.
In order to achieve the purpose, the invention adopts the technical scheme that:
an echo suppression method based on a coding and decoding neural network comprises the following steps:
obtaining anechoic audio data, and adding simulated echo to the anechoic audio data to generate audio data with echo;
converting the anechoic audio data into an anechoic spectrogram, and extracting the frequency spectrum characteristics of the anechoic spectrogram; converting the echo audio data into an echo spectrogram, and extracting the spectral characteristics of the echo spectrogram;
building a coding and decoding neural network model according to the spectral characteristics of the anechoic spectrogram and the spectral characteristics of the echoic spectrogram;
taking the frequency spectrum characteristics of the anechoic spectrogram as tag data, taking the frequency spectrum characteristics of the anechoic spectrogram as input data, and training the coding and decoding neural network model to obtain a prediction model of the anechoic spectrogram;
and converting the audio data to be processed into a spectrogram, inputting the spectrogram into the prediction model, outputting a de-echoed spectrogram, and converting the de-echoed spectrogram to obtain de-echoed audio data.
Preferably, the anechoic audio data is obtained by recording audio associated with a human voice in a quiet anechoic environment.
Preferably, the anechoic audio data is converted into an anechoic spectrogram through an FFT algorithm by an audio processing library in Python; and converting the spectrum characteristics of the anechoic spectrogram into characteristic data after dimensionality reduction through Embedding processing, and taking the characteristic data as label data of the coding and decoding neural network model.
Preferably, analog echoes are added to the audio data through an indoor audio array processing algorithm library in Python, and the corresponding audio data with echoes is generated through simulating the space size of the environment through the indoor audio array processing algorithm library and setting the size of the echoes and the extension time of the echoes.
Further, the audio data with echo is converted into a spectrogram with echo through an FFT algorithm by an audio processing library in Python; and converting the spectrum characteristics of the echoed spectrogram into characteristic data after dimensionality reduction through Embedding processing, and taking the characteristic data as input data of the coding and decoding neural network model.
Preferably, the encoding and decoding neural network model further comprises an encoding network and a decoding network, wherein:
the coding network is used for coding the frequency spectrum characteristics of the anechoic spectrogram or the frequency spectrum characteristics of the echoic spectrogram and converting the characteristic dimension of the spectrogram into a two-dimensional matrix structure;
the decoding network is used for decoding the encoded data and converting the decoded data into characteristic dimensions of a spectrogram.
Preferably, the encoding network further comprises:
the preprocessing network comprises three full connection layers, and each full connection layer is provided with a dropout processing;
the CBHG network comprises a convolution filter, a high-dimensional network and a bidirectional GRU network, wherein the convolution filter performs convolution processing on an output result of the preprocessing network, merges the convolution processing result and the feature data after Embedding Embedding processing through a residual error network after the convolution filter, inputs the merged data into the high-dimensional network, and finally takes an output result of the high-dimensional network as the input of the bidirectional GRU network and outputs a forward RNN result and a reverse RNN result;
an attention network for calculating an attention probability value of the RNN result in the forward direction and the RNN result in the reverse direction.
Preferably, the decoding network further comprises:
the preprocessing network has the same structure as that of the coding network and carries out nonlinear transformation processing on the attention probability value through a full-connection network;
an attention-recurrent neural network, comprising a layer of 256 gating recurrent units, wherein the output result of the preprocessing network and the attention probability value are used as input and output to a decoding-recurrent neural network after passing through the GRU unit;
the decoding-cyclic neural network comprises two layers of residual gating cyclic units, wherein each layer also comprises 256 gating cyclic units, and the output dimension of the residual gating cyclic units is further converted into the dimension of a spectrogram in a dimension conversion mode or through an RNN network.
Preferably, when the prediction model is trained, the anechoic spectrogram and the anechoic spectrogram are further subjected to loss value calculation, and the loss value is subjected to iterative training through an Adam optimization algorithm in Tensorflow to obtain the fitted prediction model.
Another object of the present invention is to provide an audio apparatus, which includes a memory and a processor, wherein the memory stores instructions, and the processor causes an echo suppressing apparatus to implement the echo suppressing method according to any of the above embodiments by executing the instructions stored in the memory.
It is a further object of the present invention to provide an apparatus comprising the audio device.
It is still another object of the present invention to provide a computer-readable storage medium, wherein instructions are stored in the computer-readable storage medium, and the instructions are executed by an echo suppressing apparatus to enable the echo suppressing apparatus to implement any one of the above echo suppressing methods.
The invention has the beneficial effects that:
(1) according to the echo suppression method based on the encoding and decoding neural network, the echo-free audio data are extracted, the simulated echoes are added to the echo-free audio data, the echo-free audio data are input into the encoding and decoding network model and trained to obtain the prediction model, the audio data to be processed are input into the prediction model to be predicted, echo-removed audio data are output, echoes in different scenes can be removed, echo linear rules in different environments can be fitted, and the method has high adaptability;
(2) the encoding and decoding neural network adopts a residual error network mode, during encoding, the convolution processing result and the feature data after the Embedding processing are merged through the residual error network, and the merged data are input into the high-dimensional network, so that a deeper network layer can be realized, and the problem of gradient disappearance can not be caused;
(3) the prediction model trained by the encoding and decoding neural network has better generalization capability, and can remove echoes of various scenes;
(4) when the prediction model is trained, the anechoic spectrogram and the anechoic spectrogram are further subjected to loss value calculation, and the loss value is subjected to iterative training through an Adam optimization algorithm in Tensorflow, so that the fitted prediction model is more accurate;
(5) the echo suppression method is applied to the voice recognition technology, so that the influence of echo sound on voice recognition can be reduced, the accuracy of the voice recognition is improved, and the probability of error recognition is reduced.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a simplified flowchart of a prediction model training process of a codec neural network based on an echo suppression method of the codec neural network according to the present invention;
FIG. 2 is a simplified flow chart of a prediction process of an echo suppression method based on a codec neural network according to the present invention;
FIG. 3 is a network structure diagram of a codec neural network model of an echo suppression method based on the codec neural network according to the present invention;
FIG. 4 is a diagram of a coding network structure of a coding and decoding neural network model of an echo suppression method based on a coding and decoding neural network according to the present invention;
FIG. 5 is a decoding network structure diagram of a codec neural network model of an echo suppression method based on the codec neural network according to the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects of the present invention more clear and obvious, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1 and fig. 2, an echo suppression method based on a codec neural network of the present invention includes the following steps:
obtaining anechoic audio data, and adding simulated echo to the anechoic audio data to generate audio data with echo;
converting the anechoic audio data into an anechoic spectrogram, and extracting the frequency spectrum characteristics of the anechoic spectrogram; converting the echo audio data into an echo spectrogram, and extracting the spectral characteristics of the echo spectrogram;
building a coding and decoding neural network model according to the spectral characteristics of the anechoic spectrogram and the spectral characteristics of the echoic spectrogram;
taking the frequency spectrum characteristics of the anechoic spectrogram as tag data, taking the frequency spectrum characteristics of the anechoic spectrogram as input data, and training the coding and decoding neural network model to obtain a prediction model of the anechoic spectrogram;
and converting the audio data to be processed into a spectrogram, inputting the spectrogram into the prediction model, outputting a de-echoed spectrogram, and converting the de-echoed spectrogram to obtain de-echoed audio data.
In this embodiment, the anechoic audio data is obtained by recording audio related to human voice in a quiet anechoic environment. Specifically, the anechoic audio data is obtained by collecting real recorded audio related to human voice, wherein the recorded audio is unrelated to the content of sound, but the sound of the audio is clear and the content of the audio is clear; after 1 to 2 ten thousand pieces of audio data meeting the requirements are collected, the original anechoic audio data are used as training label data; the analog echo is to process the anechoic audio data through a linear filter so as to realize an echo effect; here, a linear function of the echo is involved; and the anechoic audio data passes through a linear filter to obtain the audio data with the echo.
The anechoic audio data is further converted into an anechoic spectrogram through an FFT algorithm through an audio processing library (library of library) in Python, and the converted spectrogram data adopts a two-dimensional matrix structure; and converting the spectrum characteristics of the anechoic spectrogram into characteristic data after dimensionality reduction through Embedding processing, and taking the characteristic data as label data of the coding and decoding neural network model. Further, in this embodiment, a simulated echo is added to the anechoic audio data through an indoor audio array processing algorithm library (pyro-acoustics library) in Python, and the corresponding audio data with echo is generated by simulating the spatial size of the environment through the indoor audio array processing algorithm library and setting the size of the echo and the extension time of the echo.
Further, the audio data with echo is converted into a spectrogram with echo through an FFT algorithm by an audio processing library (Librosa library) in Python, and the converted spectrogram data adopts a two-dimensional matrix structure; and converting the spectrum characteristics of the echoed spectrogram into characteristic data after dimensionality reduction through Embedding processing, and taking the characteristic data as input data of the coding and decoding neural network model.
As shown in fig. 3, the codec neural network model further includes an encoding network (Encoder network) and a decoding network (Decoder network), wherein:
the encoding network (Encoder network) is used for encoding the spectral features of the anechoic spectrogram or the spectral features of the echoic spectrogram and converting the characteristic dimensions of the spectrogram into a two-dimensional matrix structure;
the decoding network (Decoder network) is used for decoding the encoded data and converting the decoded data into the characteristic dimension of the spectrogram.
In the matrix structure, the first dimensions of different audios are different, and the first dimension is determined according to the audio time length; the second dimension is set to be fixed, and the higher the second dimension is, the more information is described; the features after dimensionality reduction are reduced in the second dimension, which describes the abstract speech features.
As shown in fig. 4, the encoding network (Encoder network) further includes:
a pre-processing network (prenet network) comprising three fully connected layers, and each fully connected layer has a dropout process;
the CBHG network comprises a convolution filter, a high-dimensional network and a bidirectional GRU network, wherein the convolution filter performs convolution processing on an output result of the preprocessing network, merges (sequence addition) the convolution processing result and the feature data after Embedding Embedding processing through a residual error network after the convolution filter, inputs the merged data into the high-dimensional network (equivalent to dimension increasing processing, and reduces the low-dimensional features after dimension reduction into high-dimensional spectrogram features), and finally takes the output result of the high-dimensional network as the input of the bidirectional GRU network and outputs a forward RNN result and a reverse RNN result;
an Attention network (Attention network) for calculating an Attention probability value of the RNN result in the forward direction and the RNN result in the reverse direction.
Wherein, the convolution filter includes in order: a convolution layer, a pooling layer, two one-dimensional convolution layers; the size of the filter (convolution) of the first layer of convolutional layers is 3, stride (step size) is 1, the adopted activation function is ReLu (activation function), the size of the filter (convolution) and stride (step size) of the second layer of convolutional layers is the same as that of the first layer, and the activation function is not adopted.
Each layer structure of the high-dimensional network (highway nets) is as follows: the input is put into two fully-connected networks of one layer at the same time, the activation functions of the two networks respectively adopt ReLu (activation function) and sigmoid (activation function) functions, the input is assumed, the output of ReLu (activation function) is output1, the output of sigmoid (activation function) is output2, and then the output of the high-dimensional layer (highway layer) is output which is output1 output2+ input (1-output 2); in this example, 4 high-dimensional layers are used.
The Attention network encapsulates the Attention of the created GRU circulation network through an Attention Wrapper function encapsulated by TensorFlow; the Attention Mechanism (Attention Mechanism) therein calculates the Attention required by the different cells (cells) in the GRU network and is presented in a probabilistic manner in the range of 0-1.
When an Encoder result after the Attention is obtained through the Attention network, a Decoder decoding network process is started to be built, wherein the Decoder decoding network process is used for decoding information of the Encoder into required result information.
As shown in fig. 5, the decoding network (Decoder network) further includes:
a pre-processing network (prenet network) having the same structure as the pre-processing network of the encoding network, and performing a non-linear transformation process on the attention probability value through a full-connection network;
an Attention-recurrent neural network (Attention-RNN network) including a layer including 256 gated recurrent units (GRU units) which output a result of the preprocessing network and the Attention probability value as inputs, and output to a Decoder-RNN network through the GRU units;
a decoding-recurrent neural network (Decoder-RNN network) comprising two layers of residual gated cyclic units (residual GRU units), each layer also comprising 256 gated cyclic units (GRU units), the input of the Decoder of the first step being a 0 matrix, and the output of the t step being taken as the input of the t +1 step; and further converting the output dimension of the residual gating cycle unit into the dimension of a spectrogram in a dimension conversion mode or through an RNN network.
Preferably, when the prediction model is trained, the anechoic spectrogram and the anechoic spectrogram are further subjected to loss value calculation, and a difference value is calculated by using euclidean distance (i.e., a loss value) according to a difference value of each corresponding position value between the spectrogram and the spectrogram, wherein the two spectrograms have the same result and are matrices; iterative training of loss values is carried out through an Adam optimization algorithm in TensorFlow, and a fitted prediction model is obtained; the present embodiment iteratively trains 2 ten thousand batches, and 64 audio spectrogram data are transmitted into each batch for training.
When a coding and decoding neural network model after training fitting is obtained, taking out the trained model as a prediction model; firstly, converting audio data to be processed with echo into spectrogram data through a Librosa library, then transmitting the spectrogram data to the prediction model, outputting audio spectrogram data with echo removed by the model, finally converting the spectrogram into audio data through an open-source vocoder (vocoder), and finally storing the audio data in an audio file form.
Another object of the present invention is to provide an audio apparatus, which includes a memory and a processor, wherein the memory stores instructions, and the processor causes an echo suppressing apparatus to implement the echo suppressing method according to any of the above embodiments by executing the instructions stored in the memory. The audio devices include, but are not limited to: power amplifier, audio amplifier, multimedia console, digital sound console, audio sampling card, synthesizer, middle and high frequency audio amplifier, microphone, sound card in the PC, earphone, etc.
It is a further object of the present invention to provide an apparatus comprising the audio device. The device may be a general purpose computer device or a special purpose computer device, which may be a server. In a specific implementation, the device may be a desktop, a laptop, a web server, a Personal Digital Assistant (PDA), a mobile phone, a tablet computer, a wireless terminal device, a communication device, an embedded device, or the like.
One embodiment of the present application also provides a computer storage medium having instructions stored therein; the echo suppression device (which may be a computer device, such as a server) executes the instructions, for example, a processor in the computer device executes the instructions, so that the echo suppression device implements the echo suppression method according to the above embodiment.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the apparatus and device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points. Also, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
While the above description shows and describes the preferred embodiments of the present invention, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (12)
1. An echo suppression method based on a coding and decoding neural network is characterized by comprising the following steps:
obtaining anechoic audio data, and adding simulated echo to the anechoic audio data to generate audio data with echo;
converting the anechoic audio data into an anechoic spectrogram, and extracting the frequency spectrum characteristics of the anechoic spectrogram; converting the echo audio data into an echo spectrogram, and extracting the spectral characteristics of the echo spectrogram;
building a coding and decoding neural network model according to the spectral characteristics of the anechoic spectrogram and the spectral characteristics of the echoic spectrogram;
taking the frequency spectrum characteristics of the anechoic spectrogram as tag data, taking the frequency spectrum characteristics of the anechoic spectrogram as input data, and training the coding and decoding neural network model to obtain a prediction model of the anechoic spectrogram;
and converting the audio data to be processed into a spectrogram, inputting the spectrogram into the prediction model, outputting a de-echoed spectrogram, and converting the de-echoed spectrogram to obtain de-echoed audio data.
2. The echo suppression method based on the codec neural network of claim 1, wherein: the anechoic audio data is obtained by recording audio associated with a human voice in a quiet anechoic environment.
3. The echo suppression method based on the codec neural network of claim 2, wherein: converting the anechoic audio data into an anechoic spectrogram through an FFT algorithm by an audio processing library in Python; and converting the spectrum characteristics of the anechoic spectrogram into characteristic data after dimensionality reduction through Embedding processing, and taking the characteristic data as label data of the coding and decoding neural network model.
4. The echo suppression method based on the codec neural network of claim 1, wherein: adding simulated echoes for audio data through an indoor audio array processing algorithm library in Python, simulating the space size of an environment through the indoor audio array processing algorithm library, and setting the size of the echoes and the extension time of the echoes to generate corresponding audio data with the echoes.
5. The echo suppression method based on the codec neural network of claim 4, wherein: converting the audio data with echo into a spectrogram with echo through an FFT algorithm by an audio processing library in Python; and converting the spectrum characteristics of the echoed spectrogram into characteristic data after dimensionality reduction through Embedding processing, and taking the characteristic data as input data of the coding and decoding neural network model.
6. The echo suppression method based on the codec neural network according to claim 3 or 5, wherein: the codec neural network model further includes an encoding network and a decoding network, wherein:
the coding network is used for coding the frequency spectrum characteristics of the anechoic spectrogram or the frequency spectrum characteristics of the echoic spectrogram and converting the characteristic dimension of the spectrogram into a two-dimensional matrix structure;
the decoding network is used for decoding the encoded data and converting the decoded data into characteristic dimensions of a spectrogram.
7. The echo suppression method based on codec neural network of claim 6, wherein: the encoding network further comprises:
the preprocessing network comprises three full connection layers, and each full connection layer is provided with a dropout processing;
the CBHG network comprises a convolution filter, a high-dimensional network and a bidirectional GRU network, wherein the convolution filter performs convolution processing on an output result of the preprocessing network, merges the convolution processing result and the feature data after Embedding Embedding processing through a residual error network after the convolution filter, inputs the merged data into the high-dimensional network, and finally takes an output result of the high-dimensional network as the input of the bidirectional GRU network and outputs a forward RNN result and a reverse RNN result;
an attention network for calculating an attention probability value of the RNN result in the forward direction and the RNN result in the reverse direction.
8. The echo suppression method based on the codec neural network of claim 7, wherein: the decoding network further comprises:
the preprocessing network has the same structure as that of the coding network and carries out nonlinear transformation processing on the attention probability value through a full-connection network;
an attention-recurrent neural network, comprising a layer of 256 gating recurrent units, wherein the output result of the preprocessing network and the attention probability value are used as input and output to a decoding-recurrent neural network after passing through the GRU unit;
the decoding-cyclic neural network comprises two layers of residual gating cyclic units, wherein each layer also comprises 256 gating cyclic units, and the output dimension of the residual gating cyclic units is further converted into the dimension of a spectrogram in a dimension conversion mode or through an RNN network.
9. The echo suppression method based on the codec neural network of claim 1, wherein: when the prediction model is trained, the anechoic spectrogram and the anechoic spectrogram are further subjected to loss value calculation, and the loss value is subjected to iterative training through an Adam optimization algorithm in Tensorflow, so that the fitted prediction model is obtained.
10. An audio apparatus, characterized by: comprising a memory having instructions stored therein and a processor that, upon execution of the instructions stored in the memory, causes an echo suppression device to implement the echo suppression method according to any one of claims 1 to 9.
11. An apparatus, characterized by: comprising the audio device of claim 10.
12. A computer-readable storage medium having stored therein instructions, execution of which by an echo suppression device causes the echo suppression device to implement the echo suppression method of any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010084801.6A CN111353258A (en) | 2020-02-10 | 2020-02-10 | Echo suppression method based on coding and decoding neural network, audio device and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010084801.6A CN111353258A (en) | 2020-02-10 | 2020-02-10 | Echo suppression method based on coding and decoding neural network, audio device and equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111353258A true CN111353258A (en) | 2020-06-30 |
Family
ID=71192219
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010084801.6A Pending CN111353258A (en) | 2020-02-10 | 2020-02-10 | Echo suppression method based on coding and decoding neural network, audio device and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111353258A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023044961A1 (en) * | 2021-09-23 | 2023-03-30 | 武汉大学 | Multi-feature fusion echo cancellation method and system based on self-attention transform network |
CN116402906A (en) * | 2023-06-08 | 2023-07-07 | 四川省医学科学院·四川省人民医院 | Signal grade coding method and system based on kidney echo |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107481728A (en) * | 2017-09-29 | 2017-12-15 | 百度在线网络技术(北京)有限公司 | Background sound removing method, device and terminal device |
CN109841206A (en) * | 2018-08-31 | 2019-06-04 | 大象声科(深圳)科技有限公司 | A kind of echo cancel method based on deep learning |
KR20190085883A (en) * | 2018-01-11 | 2019-07-19 | 네오사피엔스 주식회사 | Method and apparatus for voice translation using a multilingual text-to-speech synthesis model |
CN110148419A (en) * | 2019-04-25 | 2019-08-20 | 南京邮电大学 | Speech separating method based on deep learning |
CN110476206A (en) * | 2017-03-29 | 2019-11-19 | 谷歌有限责任公司 | End-to-end Text To Speech conversion |
CN110647980A (en) * | 2019-09-18 | 2020-01-03 | 成都理工大学 | Time sequence prediction method based on GRU neural network |
-
2020
- 2020-02-10 CN CN202010084801.6A patent/CN111353258A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110476206A (en) * | 2017-03-29 | 2019-11-19 | 谷歌有限责任公司 | End-to-end Text To Speech conversion |
CN107481728A (en) * | 2017-09-29 | 2017-12-15 | 百度在线网络技术(北京)有限公司 | Background sound removing method, device and terminal device |
KR20190085883A (en) * | 2018-01-11 | 2019-07-19 | 네오사피엔스 주식회사 | Method and apparatus for voice translation using a multilingual text-to-speech synthesis model |
CN109841206A (en) * | 2018-08-31 | 2019-06-04 | 大象声科(深圳)科技有限公司 | A kind of echo cancel method based on deep learning |
CN110148419A (en) * | 2019-04-25 | 2019-08-20 | 南京邮电大学 | Speech separating method based on deep learning |
CN110647980A (en) * | 2019-09-18 | 2020-01-03 | 成都理工大学 | Time sequence prediction method based on GRU neural network |
Non-Patent Citations (1)
Title |
---|
YUXUAN WANG ET AL: "Tacotron: Towards End-to-End Speech Synthesis", 《18TH ANNUAL CONFERENCE OF THE INTERNATIONAL-SPEECH-COMMUNICATION-ASSOCIATION》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023044961A1 (en) * | 2021-09-23 | 2023-03-30 | 武汉大学 | Multi-feature fusion echo cancellation method and system based on self-attention transform network |
CN116402906A (en) * | 2023-06-08 | 2023-07-07 | 四川省医学科学院·四川省人民医院 | Signal grade coding method and system based on kidney echo |
CN116402906B (en) * | 2023-06-08 | 2023-08-11 | 四川省医学科学院·四川省人民医院 | Signal grade coding method and system based on kidney echo |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110491404B (en) | Voice processing method, device, terminal equipment and storage medium | |
JP7258182B2 (en) | Speech processing method, device, electronic device and computer program | |
CN111755019A (en) | System and method for acoustic echo cancellation using deep multitask recurrent neural networks | |
KR102276951B1 (en) | Output method for artificial intelligence speakers based on emotional values calculated from voice and face | |
CN111353258A (en) | Echo suppression method based on coding and decoding neural network, audio device and equipment | |
CN111710344A (en) | Signal processing method, device, equipment and computer readable storage medium | |
CN114792524B (en) | Audio data processing method, apparatus, program product, computer device and medium | |
CN113870874A (en) | Multi-feature fusion echo cancellation method and system based on self-attention transformation network | |
CN114338623B (en) | Audio processing method, device, equipment and medium | |
CN113299306B (en) | Echo cancellation method, echo cancellation device, electronic equipment and computer-readable storage medium | |
JP2023548707A (en) | Speech enhancement methods, devices, equipment and computer programs | |
CN116737895A (en) | Data processing method and related equipment | |
CN116030823A (en) | Voice signal processing method and device, computer equipment and storage medium | |
CN112687284B (en) | Reverberation suppression method and device for reverberation voice | |
JP2024502287A (en) | Speech enhancement method, speech enhancement device, electronic device, and computer program | |
Romaniuk et al. | Efficient low-latency speech enhancement with mobile audio streaming networks | |
CN112750452A (en) | Voice processing method, device and system, intelligent terminal and electronic equipment | |
CN115798497B (en) | Time delay estimation system and device | |
CN113763978B (en) | Voice signal processing method, device, electronic equipment and storage medium | |
CN117351983B (en) | Transformer-based voice noise reduction method and system | |
CN115565543B (en) | Single-channel voice echo cancellation method and device based on deep neural network | |
WO2024055751A1 (en) | Audio data processing method and apparatus, device, storage medium, and program product | |
CN116248229B (en) | Packet loss compensation method for real-time voice communication | |
US20240096332A1 (en) | Audio signal processing method, audio signal processing apparatus, computer device and storage medium | |
CN110880325B (en) | Identity recognition method and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200630 |