CN115116471A - Audio signal processing method and apparatus, training method, device, and medium - Google Patents

Audio signal processing method and apparatus, training method, device, and medium Download PDF

Info

Publication number
CN115116471A
CN115116471A CN202210459690.1A CN202210459690A CN115116471A CN 115116471 A CN115116471 A CN 115116471A CN 202210459690 A CN202210459690 A CN 202210459690A CN 115116471 A CN115116471 A CN 115116471A
Authority
CN
China
Prior art keywords
audio signal
matrix
processed
neural network
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210459690.1A
Other languages
Chinese (zh)
Other versions
CN115116471B (en
Inventor
马东鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210459690.1A priority Critical patent/CN115116471B/en
Publication of CN115116471A publication Critical patent/CN115116471A/en
Application granted granted Critical
Publication of CN115116471B publication Critical patent/CN115116471B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

Audio signal processing method and apparatus, training method, device, and medium. The present disclosure provides an audio signal processing method for performing time-frequency domain echo cancellation processing by using a neural network, including: acquiring a reference audio signal and an audio signal to be processed; obtaining an amplitude spectrum matrix of the reference audio signal and an amplitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed; performing frequency domain echo cancellation processing on the audio signal to be processed by using a first neural network to generate a first processed audio signal; obtaining a real part data matrix and an imaginary part data matrix of the reference audio signal and a real part data matrix and an imaginary part data matrix of the first processed audio signal; and performing time domain echo cancellation processing on the first processed audio signal by using a second neural network to generate a second processed audio signal. The present disclosure also provides a neural network training method, an audio signal processing apparatus, a computing device, a computer-readable storage medium, and a computer program product.

Description

Audio signal processing method and apparatus, training method, device, and medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to an audio signal processing method, an audio signal processing apparatus using the audio signal processing method, a training method for a neural network model, a computing device, a computer-readable storage medium, and a computer program product.
Background
With the continuous development of audio signal processing technology, the audio quality of the target object using the terminal equipment is more and more required. If echo occurs during the call, the call quality is severely affected. The principle of echo generation is: the audio signal is played in a loudspeaker and undergoes multiple reflections resulting in signal distortion in a closed or semi-closed environment, and is finally picked up by a microphone together with the local audio to form an echo. The echo can interfere with the delivery of local audio, severely impacting the communication experience.
While the adaptive filter is generally used in the conventional technique to cancel the linear part of the echo, the nonlinear part of the echo is often difficult to cancel. Recently, neural network models have begun to be applied to echo cancellation. The neural network model can be used in combination with the adaptive filter to post-process the audio signal processed by the adaptive filter to remove non-linear residuals and a small amount of linear residuals in the signal; alternatively, neural network models may be used in place of the adaptive filter to cancel both linear and non-linear portions of the echo. Currently used echo cancellation processing based on a neural network model generally converts an audio signal into a frequency domain to obtain a corresponding magnitude spectrum and a corresponding phase spectrum, performs echo cancellation processing based on the magnitude spectrum, and then transforms the magnitude spectrum and the phase spectrum to a time domain to generate a processed audio signal. However, in this processing method, only the amplitude spectrum of the audio signal is processed, and the phase is discarded, which adversely affects the echo cancellation quality of the audio signal.
Disclosure of Invention
According to a first aspect of the present disclosure, there is provided an audio signal processing method including: acquiring a reference audio signal and an audio signal to be processed, wherein the reference audio signal is an audio signal played through a local loudspeaker, and the audio signal to be processed is an audio signal collected through a local microphone; obtaining an amplitude spectrum matrix of the reference audio signal, and obtaining an amplitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed; based on the amplitude spectrum matrix of the reference audio signal, the amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed, performing frequency domain echo cancellation processing on the audio signal to be processed by using a first neural network, and generating a first processed audio signal; obtaining a real part data matrix and an imaginary part data matrix of the reference audio signal, and obtaining a real part data matrix and an imaginary part data matrix of the first processed audio signal; performing time domain echo cancellation processing on the first processed audio signal with a second neural network based on the real and imaginary data matrices of the reference audio signal and the real and imaginary data matrices of the first processed audio signal, and generating a second processed audio signal.
According to some exemplary embodiments of the present disclosure, the obtaining of the magnitude spectrum matrix of the reference audio signal comprises: performing a fast fourier transform on the reference audio signal to obtain an amplitude spectrum matrix of the reference audio signal; and, the obtaining an amplitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed comprises: and performing fast Fourier transform on the audio signal to be processed to obtain a magnitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed.
According to some exemplary embodiments of the present disclosure, the performing, with the first neural network, a frequency-domain echo cancellation process on the audio signal to be processed based on the amplitude spectrum matrix of the reference audio signal, the amplitude spectrum matrix of the audio signal to be processed, and the phase spectrum matrix of the audio signal to be processed, and the generating the first processed audio signal includes: splicing the amplitude spectrum matrix of the reference audio signal with the amplitude spectrum matrix of the audio signal to be processed to generate a spliced amplitude spectrum matrix; inputting the spliced amplitude spectrum matrix into the first neural network to generate an amplitude spectrum filtering matrix; generating a filtered magnitude spectrum matrix based on the magnitude spectrum filtering matrix and the magnitude spectrum matrix of the audio signal to be processed; and taking the filtered amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed together as input parameters, and performing inverse fast Fourier transform to generate the first processed audio signal.
According to some exemplary embodiments of the present disclosure, the magnitude of the magnitude spectrum filtering matrix is the same as the magnitude of the magnitude spectrum matrix of the audio signal to be processed, and wherein the generating a filtered magnitude spectrum matrix based on the magnitude spectrum filtering matrix and the magnitude spectrum matrix of the audio signal to be processed comprises: multiplying each element in the amplitude spectrum matrix of the audio signal to be processed with an element at a corresponding position in the amplitude spectrum filtering matrix to generate the filtered amplitude spectrum matrix.
According to some exemplary embodiments of the disclosure, the generating a filtered magnitude spectrum matrix based on the magnitude spectrum filtering matrix and the magnitude spectrum matrix of the audio signal to be processed comprises: and convolving the amplitude spectrum matrix of the audio signal to be processed with the amplitude spectrum filtering matrix to generate the filtered amplitude spectrum matrix.
According to some exemplary embodiments of the disclosure, the obtaining the real and imaginary data matrices of the reference audio signal and the obtaining the real and imaginary data matrices of the first processed audio signal comprise: performing a fast fourier transform on the reference audio signal to obtain a real part data matrix and an imaginary part data matrix of the reference audio signal; a fast fourier transform is performed on the first processed audio signal to obtain a real part data matrix and an imaginary part data matrix of the first processed audio signal.
According to some exemplary embodiments of the disclosure, the performing time domain echo cancellation processing on the first processed audio signal with a second neural network based on the real and imaginary data matrices of the reference audio signal and the real and imaginary data matrices of the first processed audio signal, and generating the second processed audio signal comprises: splicing the real part data matrix of the reference audio signal with the real part data matrix of the first processed audio signal to generate a spliced real part data matrix, and splicing the imaginary part data matrix of the reference audio signal with the imaginary part data matrix of the first processed audio signal to generate a spliced imaginary part data matrix; inputting the spliced real part data matrix and the spliced imaginary part data matrix into the second neural network to generate a real part data filtering matrix and an imaginary part data filtering matrix; generating a filtered real part data matrix based on the real part data matrix of the first processed audio signal and the real part data filtering matrix, and generating a filtered imaginary part data matrix based on the imaginary part data matrix of the first processed audio signal and the imaginary part data filtering matrix; and taking the filtered real part data matrix and the filtered imaginary part data matrix as input parameters together, and performing inverse fast Fourier transform to generate the second processed audio signal.
According to some exemplary embodiments of the disclosure, the real part data filtering matrix has a size same as a real part data matrix of the first processed audio signal and the imaginary part data filtering matrix has a size same as an imaginary part data matrix of the first processed audio signal, and wherein the generating a filtered real part data matrix based on the real part data matrix of the first processed audio signal and the real part data filtering matrix, the generating a filtered imaginary part data matrix based on the imaginary part data matrix of the first processed audio signal and the imaginary part data filtering matrix comprises: multiplying each element in the real data matrix of the first processed audio signal with an element at a corresponding position in the real data filter matrix to generate the filtered real data matrix; each element of the imaginary data matrix of the first processed audio signal is multiplied with an element of the imaginary data filter matrix at a corresponding position to generate the filtered imaginary data matrix.
According to some exemplary embodiments of the disclosure, the generating a filtered real data matrix based on the real data matrix of the first processed audio signal and the real data filtering matrix comprises: convolving the real part data matrix of the first processed audio signal with the real part data filter matrix to generate the filtered real part data matrix; and generating a filtered imaginary data matrix based on the imaginary data matrix of the first processed audio signal and the imaginary data filtering matrix comprises: convolving the imaginary data matrix of the first processed audio signal with the imaginary data filter matrix to generate the filtered imaginary data matrix.
According to some exemplary embodiments of the present disclosure, the first neural network and the second neural network are both long-short term memory neural networks.
According to a second aspect of the present disclosure, there is provided a neural network training method, including: acquiring a reference audio signal and an audio signal to be processed, wherein the reference audio signal is an audio signal played through a local loudspeaker, and the audio signal to be processed is a mixed audio signal obtained by mixing an audio signal acquired by a local microphone with a real audio signal; obtaining an amplitude spectrum matrix of the reference audio signal, and obtaining an amplitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed; based on the amplitude spectrum matrix of the reference audio signal, the amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed, performing frequency domain echo cancellation processing on the audio signal to be processed by using a first neural network, and generating a first processed audio signal; obtaining a real part data matrix and an imaginary part data matrix of the reference audio signal, and obtaining a real part data matrix and an imaginary part data matrix of the first processed audio signal; performing time domain echo cancellation processing on the first processed audio signal using a second neural network based on a real part data matrix and an imaginary part data matrix of the reference audio signal and a real part data matrix and an imaginary part data matrix of the first processed audio signal, and generating a second processed audio signal; obtaining a first signal-to-noise ratio loss value based on the first processed audio signal and the real audio signal, and obtaining a second signal-to-noise ratio loss value based on the second processed audio signal and the real audio signal; adjusting at least one of the first neural network and the second neural network based at least on the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value.
According to some exemplary embodiments of the disclosure, adjusting at least one of the first neural network and the second neural network based at least on the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value comprises: carrying out weighted summation on the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value to obtain a comprehensive signal-to-noise ratio loss value; adjusting at least one of the first neural network and the second neural network based on the integrated SNR loss value
According to some exemplary embodiments of the present disclosure, the weight of the first signal-to-noise ratio loss is greater than the weight of the second signal-to-noise ratio loss.
According to some of the example embodiments of the present disclosure, wherein adjusting at least one of the first neural network and the second neural network based at least on the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value comprises:
inputting the first processed audio signal into a CTC-based acoustic model to obtain a CTC loss value; carrying out weighted summation on the first signal-to-noise ratio loss value, the second signal-to-noise ratio loss value and the CTC loss value to obtain a comprehensive loss value; and adjusting at least one of the first neural network and the second neural network based on the composite loss value.
According to some exemplary embodiments of the present disclosure, the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value are both scale-invariant signal-to-noise ratio loss values.
According to a third aspect of the present disclosure, there is provided an audio signal processing apparatus comprising: an audio signal acquisition module configured to: acquiring a reference audio signal and an audio signal to be processed, wherein the reference audio signal is an audio signal played through a local loudspeaker, and the audio signal to be processed is an audio signal collected through a local microphone; an audio signal frequency domain information acquisition module configured to: obtaining an amplitude spectrum matrix of the reference audio signal, and obtaining an amplitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed; a frequency domain echo cancellation processing module configured to: based on the amplitude spectrum matrix of the reference audio signal and the amplitude spectrum matrix of the audio signal to be processed, performing frequency domain echo cancellation processing on the audio signal to be processed by using a first neural network, and generating a first processed audio signal; an audio signal time domain information acquisition module configured to: obtaining a real part data matrix and an imaginary part data matrix of the reference audio signal, and obtaining a real part data matrix and an imaginary part data matrix of the first processed audio signal; a time domain echo cancellation processing module configured to: performing time domain echo cancellation processing on the first processed audio signal with a second neural network based on the real and imaginary data matrices of the reference audio signal and the real and imaginary data matrices of the first processed audio signal, and generating a second processed audio signal.
According to a fourth aspect of the present disclosure, there is provided a computing device comprising a processor and a memory configured to store computer-executable instructions configured to, when executed on the processor, cause the processor to perform the audio signal processing method according to the first aspect of the present disclosure and exemplary embodiments thereof, or cause the processor to perform the neural network training method according to the second aspect of the present disclosure and exemplary embodiments thereof.
According to a fifth aspect of the present disclosure, there is provided a computer-readable storage medium configured to store computer-executable instructions configured to, when executed on a processor, cause the processor to perform the audio signal processing method according to the first aspect of the present disclosure and exemplary embodiments thereof, or to perform the neural network training method according to the second aspect of the present disclosure and exemplary embodiments thereof.
According to a sixth aspect of the present disclosure, there is provided a computer program product comprising computer executable instructions configured to, when executed on a processor, cause the processor to perform the audio signal processing method according to the first aspect of the present disclosure and exemplary embodiments thereof, or cause the processor to perform the neural network training method according to the second aspect of the present disclosure and exemplary embodiments thereof.
Therefore, according to the audio signal processing method provided by the present disclosure, the frequency domain echo cancellation processing is performed on the frequency domain based on the amplitude spectrum of the audio signal, and then the time domain echo cancellation processing is performed on the time domain based on the real part data and the imaginary part data of the audio signal, so as to take the effect of the phase information into consideration, thereby significantly improving the echo cancellation effect of the audio signal.
In addition, the neural network training method provided by the present disclosure can perform joint training on the first neural network and/or the second neural network based on both the result of the frequency domain echo cancellation and the result of the time domain echo cancellation, or based on the result of the frequency domain echo cancellation, the result of the time domain echo cancellation, and the evaluation result of the acoustic model, and therefore, the quality of the audio signal processing method on echo cancellation can be further improved.
Drawings
So that the manner in which the above recited features, characteristics and advantages of the present disclosure can be understood in detail, a more particular description of embodiments of the present disclosure, briefly summarized above, may be had by reference to the appended drawings, in which; in the drawings:
fig. 1 schematically illustrates an application scenario of an audio signal processing method according to some exemplary embodiments of the present disclosure;
FIG. 2 schematically illustrates an echo cancellation model based on a neural network model in the related art;
fig. 3 schematically illustrates, in flow chart form, a method of audio signal processing provided in accordance with some exemplary embodiments of the present disclosure;
fig. 4 schematically illustrates some details of the audio signal processing method illustrated in fig. 3, according to some exemplary embodiments of the present disclosure;
fig. 5 schematically illustrates some details of the audio signal processing method illustrated in fig. 3, according to some exemplary embodiments of the present disclosure;
fig. 6 schematically illustrates an audio signal processing model provided according to some exemplary embodiments of the present disclosure;
figure 7a schematically illustrates, in flow chart form, a neural network training method provided in accordance with some exemplary embodiments of the present disclosure;
fig. 7b schematically illustrates some details of the neural network training method illustrated in fig. 7a, in accordance with some exemplary embodiments of the present disclosure;
FIG. 8 schematically illustrates a neural network training model provided in accordance with some exemplary embodiments of the present disclosure;
FIG. 9 schematically illustrates, in flow chart form, another neural network training method provided in accordance with some exemplary embodiments of the present disclosure;
fig. 10 schematically illustrates another neural network training model provided in accordance with some exemplary embodiments of the present disclosure;
fig. 11 schematically illustrates, in block diagram form, the structure of an audio signal processing apparatus according to some exemplary embodiments of the present disclosure;
FIG. 12 schematically illustrates, in block diagram form, the structure of a computing device in accordance with some embodiments of the present disclosure.
It is to be understood that the matter shown in the figures is merely schematic and thus it is not necessarily drawn to scale. Further, throughout the drawings, the same or similar features are indicated by the same or similar reference numerals.
Detailed Description
The following description provides specific details of various exemplary embodiments of the present disclosure so that those skilled in the art can fully understand and implement the technical solutions according to the present disclosure.
First, some terms referred to in exemplary embodiments of the present disclosure are explained to facilitate understanding by those skilled in the art:
artificial Intelligence (AI): the theory, method, technique and application system that uses digital computer or machine controlled by digital computer to simulate, extend and expand human intelligence, sense environment, obtain knowledge and use knowledge to obtain optimal result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and to implement a new intelligent machine that can react in a manner similar to human intelligence. The artificial intelligence researches the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Machine Learning (ML) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
Echo Cancellation (AEC) is a method for canceling noise generated by the microphone and the speaker due to the Echo path of the audio signal by means of sound wave interference. The AEC has the main function of estimating the acoustic transfer function from the local loudspeaker to the local microphone, including the reflection path, and filtering the incoming audio signal through the estimated acoustic transfer function to obtain an estimate of the echo signal. The estimated echo is then subtracted from the microphone signal to obtain an anechoic signal, and the anechoic signal, rather than the microphone signal, is transmitted over the channel to the far end. Traditional AEC tends to employ an adaptive filtering approach, and recently, with the development of artificial intelligence and machine learning, a neural network has been proposed for AEC, which is particularly advantageous for removing non-linear residuals in audio signals.
Signal-to-Noise Ratio (Signal-Noise Ratio, SNR): signal (or information) to noise ratio in an electronic device or electronic system. The signal refers to a signal which needs to be processed from the outside of the device, and the noise refers to an irregular extra signal (or information) which does not exist in the original signal generated after processing, and the signal does not change along with the change of the original signal.
The Scale-invariant Signal-to-Noise Ratio (SiSNR) refers to a Signal-to-Noise Ratio that is not affected by Signal variation. For example, in the case of reducing the influence caused by the variation of the signal by regularization, the sisr can be calculated according to the following formula:
Figure DEST_PATH_IMAGE002
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE004
is to evaluate the signal(s) of the signal,sis a clean signal, and is,
Figure DEST_PATH_IMAGE006
is an operation of re-summing the product of elements,
Figure DEST_PATH_IMAGE008
is a 2 norm.
Character Error Rate (CER) is one of the most common metrics used for speech recognition accuracy. The CER may be calculated as CER = (S + D + I)/(S + D + C), where S is the number of characters replaced, D is the number of characters deleted, I is the number of characters inserted, and C is the correct number of words.
An acoustic model based on Connected Temporal Classification (CTC) is well known in the art, and is based on a Temporal Classification method, and frame-level alignment is not required in a time dimension, so that a result can be predicted by inputting speech features. The CTC-based acoustic Model greatly simplifies the training procedure of the acoustic Model, and thus simplifies the modeling process, compared to a conventional acoustic Model (e.g., an acoustic Model based on Deep Neural Network-Hidden Markov Model (DNN-HMM)).
Referring to fig. 1, an application scenario of an audio signal processing method according to some exemplary embodiments of the present disclosure is schematically illustrated. As shown in fig. 1, the exemplary application scenario 100 includes a first terminal device 110, a server 120, and a second terminal device 130, and the first terminal device 110 and the second terminal device 130 can be connected to the server 120 in a wired or wireless manner for communication through a network 140.
The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, a cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, which is not limited in this disclosure. The first terminal device 110 and the second terminal device 130 may be, but are not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. Examples of network 140 may include any combination of a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), and/or a communication network such as the internet, to which the disclosure is not limited. Accordingly, each of the first terminal device 110, the server 120, and the second terminal device 130 may include at least one communication interface (not shown) capable of communicating over the network 140. Such communication interfaces may be one or more of the following: any type of network interface (e.g., a Network Interface Card (NIC)), wired or wireless (such as IEEE 802.11 wireless lan (wlan)) wireless interface, a global microwave access interoperability (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth. It should be understood that the first terminal device 110 and the second terminal device 130 included in the application scenario 100 shown in fig. 1 are merely exemplary, and thus, the application scenario 100 may include more terminal devices, for example, an application scenario of a real-time communication conference with multiple target objects participating. Furthermore, it should also be understood that in other exemplary application scenarios, the first terminal device 110 and the second terminal device 130 may communicate directly with each other, and thus the server 120 is not required.
In the exemplary application scenario 100 shown in fig. 1, the second terminal device 130 may acquire an audio signal, perform channel coding to obtain a reference audio signal, then transmit the reference audio signal to the server 120, the server 120 transmits the reference audio signal to the first terminal device 110, the first terminal device 110 receives the reference audio signal, performs channel decoding on the reference audio signal, and further plays the reference audio signal through a speaker of the first terminal device 110, meanwhile, the first terminal device 110 may also acquire a local audio signal through a microphone, and in this process, because it also acquires the reference audio signal played through the speaker to generate an echo signal, the audio signal acquired by the first terminal device 110 through the microphone includes the echo signal. Similarly, for the second terminal device 130, it may receive the reference audio signal from the first terminal device 110 and play it through its speaker, and at the same time, the second terminal device 110 may collect the local audio signal through its microphone, and in this process, generate an echo signal because it collects the reference audio signal played through the speaker. In this case, the audio signal collected by the second terminal device 130 through the microphone also includes an echo signal. Therefore, for the terminal device in the exemplary application scenario shown in fig. 1, echo cancellation processing needs to be performed on the acquired audio signal.
In the conventional technology, an adaptive filter is generally used to eliminate the linear part in the echo, but the nonlinear part in the echo is often difficult to eliminate. Recently, neural networks have begun to be applied to echo cancellation, and compared to conventional adaptive filters, neural network models can remove linear and nonlinear portions of an echo that an audio signal has.
Referring to fig. 2, a neural network-based echo cancellation model in the related art is schematically illustrated. As shown in fig. 2, the echo cancellation model 200 first obtains a reference audio signal and a to-be-processed audio signal, where the reference audio signal is an audio signal played through a local speaker, and the to-be-processed audio signal is an audio signal collected through a local microphone. The reference audio signal is input to the first fast fourier transform module 210a for fast fourier transform to obtain a magnitude spectrum matrix 220 of the reference audio signal, and the audio signal to be processed is input to the second fast fourier transform module 210b for fast fourier transform to obtain a magnitude spectrum matrix 230 and a phase spectrum matrix 240 of the audio signal to be processed. Then, the amplitude spectrum matrix 220 of the reference audio signal and the amplitude spectrum matrix 230 of the audio signal to be processed are input to the splicing processing module 250 for splicing processing. The stitched magnitude spectrum matrix is input into a long-short term memory neural network (i.e., an LSTM neural network) 260 to generate a magnitude spectrum masking matrix 270. The magnitude spectrum masking matrix 270 and the magnitude spectrum matrix 230 of the audio signal to be processed are input to the multiplication processing module 280 to multiply each element in the magnitude spectrum matrix 230 of the audio signal to be processed with one element at a corresponding position in the magnitude spectrum masking matrix 270 to generate a masked magnitude spectrum matrix. The masked magnitude spectrum matrix and the phase spectrum matrix 240 of the audio signal to be processed are input together to the inverse fourier transform module 290 for inverse fourier transform to generate the processed audio signal. The processed audio signal is the audio signal processed by the echo cancellation model 200.
As can be seen from the above analysis, the echo cancellation model 200 converts the audio signal to be processed into a frequency domain, obtains a corresponding magnitude spectrum and a corresponding phase spectrum, performs echo cancellation processing on the magnitude spectrum, and then transforms the magnitude spectrum into a time domain in combination with the phase spectrum to generate a processed audio signal. In the echo cancellation model 200, the LSTM neural network 260 performs echo cancellation processing only for the magnitude spectrum of the audio signal, and discards the phase spectrum. However, the phase information of the audio signal is substantially beneficial for the processing of the signal by the neural network, which can further improve the performance of the neural network in canceling the echo signal. Therefore, in the echo cancellation model 200 of the related art, the operation of the LSTM neural network 260 discarding the phase spectrum during the processing procedure may adversely affect the quality of echo cancellation.
Referring to fig. 3, an audio signal processing method provided according to some exemplary embodiments of the present disclosure is schematically illustrated in the form of a flowchart. As shown in fig. 3, the audio signal processing method 300 includes steps 310, 320, 330, 340, and 350.
In step 310, a reference audio signal and a to-be-processed audio signal are obtained, where the reference audio signal is an audio signal played through a local speaker, and the to-be-processed audio signal is an audio signal collected through a local microphone. It should be understood that a local speaker and a local microphone mean that the speaker and the microphone are in the same space, and thus, an audio signal (i.e., a reference audio signal) played by the speaker may be reflected multiple times in the space and may be picked up by the microphone, thereby generating an echo signal included in the audio signal to be processed. The reference audio signal may have a variety of sources, and by way of non-limiting example, the reference audio signal may be an audio signal captured and transmitted by the remote end device, or may be an audio signal received from a server that is playable by a speaker, or may be any audio signal that is playable by a speaker at the local end device, so long as the audio signal is played by the local speaker and thus may be captured by the local microphone to form an echo signal. The present disclosure does not limit the source of the reference audio signal. Since the audio signal to be processed includes an echo signal caused by playing the reference audio signal, it is necessary to perform echo cancellation processing on the audio signal to be processed.
In step 320, a magnitude spectrum matrix of the reference audio signal is obtained, and a magnitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed are obtained. It should be understood that any suitable method may be employed to obtain the amplitude spectrum matrix of the reference audio signal, and to obtain the amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed, which is not limited by the present disclosure. In some exemplary embodiments of the present disclosure, a fast fourier transform may be performed on a reference audio signal to obtain a magnitude spectral matrix of the reference audio signal, and a fast fourier transform may also be performed on an audio signal to be processed to obtain a magnitude spectral matrix and a phase spectral matrix of the audio signal to be processed. Through fast Fourier transform, the frequency domain characteristics of the reference audio signal and the audio signal to be processed, namely the amplitude spectrum and the phase spectrum, can be quickly and conveniently obtained. It should be understood that any suitable method of obtaining the magnitude and phase spectra of an audio signal is possible, and the disclosure is not limited thereto. For example, in other exemplary embodiments of the present disclosure, the reference audio signal and the audio signal to be processed may also be input into the trained neural network model, respectively, to obtain a magnitude spectrum matrix of the reference audio signal and a magnitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed, respectively.
In step 330, a first neural network is used to perform frequency domain echo cancellation processing on the audio signal to be processed based on the amplitude spectrum matrix of the reference audio signal, the amplitude spectrum matrix of the audio signal to be processed, and the phase spectrum matrix of the audio signal to be processed, and a first processed audio signal is generated. In this step, by performing corresponding processing on the frequency domain characteristics of the audio signal to be processed, echo cancellation processing of the audio signal to be processed can be realized in the frequency domain. As a non-limiting example, the first neural network may be configured to: and generating a magnitude spectrum filtering matrix based on the magnitude spectrum matrix of the reference audio signal and the magnitude spectrum matrix of the audio signal to be processed. For example, the magnitude spectrum filtering matrix may be a masking matrix having the same size as that of the magnitude spectrum matrix of the audio signal to be processed, or may be a convolution matrix capable of being convolved with the magnitude spectrum matrix of the audio signal to be processed. It follows that the first neural network functions as: based on the obtained frequency domain features of both the reference audio signal and the audio signal to be processed (i.e. the amplitude spectrum matrices of both), a corresponding filter matrix is adaptively generated, which can then be used to filter the frequency domain features of the audio signal to be processed (i.e. the amplitude spectrum matrices of the audio signal to be processed). Thereby, based on the first neural network, echo cancellation processing of the audio signal to be processed in the frequency domain can be achieved. The function of the first neural network will be described in more detail below. Further, it should be understood that the first neural network may be any suitable neural network model, such as, but not limited to, a fully-connected neural network, a convolutional neural network, a recurrent neural network, and an LSTM neural network, among others. The present disclosure does not limit the type of neural network that may be used as the first neural network.
Referring to fig. 4, details of step 330 of the audio signal processing method 300 shown in fig. 3 are schematically illustrated, according to some exemplary embodiments of the present disclosure. As shown in fig. 4, in some exemplary embodiments of the present disclosure, step 330 of audio signal processing method 300 includes steps 330-1, 330-2, 330-3, and 330-4.
In step 330-1, the amplitude spectrum matrix of the reference audio signal is spliced with the amplitude spectrum matrix of the audio signal to be processed to generate a spliced amplitude spectrum matrix. In step 330-2, the stitched amplitude spectrum matrix is input to the first neural network, generating an amplitude spectrum filter matrix. In step 330-3, a filtered amplitude spectrum matrix is generated based on the amplitude spectrum filter matrix and the amplitude spectrum matrix of the audio signal to be processed. In step 330-4, taking the filtered amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed together as input parameters, and performing inverse fast fourier transform to generate the first processed audio signal. In some exemplary embodiments of the present disclosure, the magnitude spectrum filtering matrix generated in the above step 330-2 has the same size as the magnitude spectrum matrix of the audio signal to be processed, and thus, in the step 330-3, each element of the magnitude spectrum matrix of the audio signal to be processed may be multiplied by one element of a corresponding position in the magnitude spectrum filtering matrix to generate a filtered magnitude spectrum matrix. It is to be understood that the term "corresponding position" here means that the position number of an element in the amplitude spectrum matrix of the audio signal to be processed in the matrix is the same as the position number of the corresponding element in the amplitude spectrum filter matrix. For example, both are elements with position numbers (m, n) (m and n are any integers greater than 0), and so on. Further, according to other exemplary embodiments of the present disclosure, in step 330-3, the magnitude spectrum matrix of the audio signal to be processed may be convolved with the magnitude spectrum filtering matrix to generate a filtered magnitude spectrum matrix. In this case, the magnitude of the magnitude spectrum filtering matrix generated in the above-described step 330-2 is not necessarily the same as the magnitude spectrum matrix of the audio signal to be processed.
The elements in the corresponding positions are directly multiplied, the operation is simple, the calculated amount is relatively small, and the complexity of the model is low. The convolution operation is adopted, the calculation amount is relatively large, but the signal filtering effect is good, and therefore the convolution operation is very beneficial to improving the echo cancellation quality. Further, it should be understood that the filtered magnitude spectrum matrix may be generated based on the magnitude spectrum filtering matrix and the magnitude spectrum matrix of the audio signal to be processed in any suitable manner, which is not limited by the present disclosure.
Referring back to fig. 3, in step 340, a real data matrix and an imaginary data matrix of the reference audio signal are obtained, and a real data matrix and an imaginary data matrix of the first processed audio signal are obtained. It should be understood that any suitable method may be employed to obtain the real and imaginary data matrices of the reference audio signal and to obtain the real and imaginary data matrices of the audio signal to be processed, which is not limited by the present disclosure. In some exemplary embodiments of the present disclosure, a reference audio signal may be fast fourier transformed to obtain a real part data matrix and an imaginary part data matrix of the reference audio signal, and an audio signal to be processed may also be fast fourier transformed to obtain a real part data matrix and an imaginary part data matrix of the audio signal to be processed. The time domain characteristics of the reference audio signal and the audio signal to be processed, namely a real part data matrix and an imaginary part data matrix, can be obtained quickly and conveniently through fast Fourier transform. It should be understood that any suitable method of obtaining the real and imaginary data matrices of the audio signal is possible, and the disclosure is not limited thereto. For example, in further exemplary embodiments of the present disclosure, the reference audio signal and the audio signal to be processed may also be input into the trained neural network model, respectively, to obtain a real part data matrix and an imaginary part data matrix of the reference audio signal and a real part data matrix and an imaginary part data matrix of the audio signal to be processed, respectively.
In step 350, a time domain echo cancellation process is performed on the first processed audio signal using a second neural network based on the real and imaginary data matrices of the reference audio signal and the real and imaginary data matrices of the first processed audio signal, and a second processed audio signal is generated. In this step, by performing corresponding processing on the time domain feature of the audio signal to be processed, the echo cancellation processing of the audio signal to be processed can be implemented in the time domain. As a non-limiting example, the second neural network may be configured to: a real data filtering matrix and an imaginary data filtering matrix are generated based on the real data matrix and the imaginary data matrix of the reference audio signal and the real data matrix and the imaginary data matrix of the first processed audio signal, respectively. For example, the real data filtering matrix and the imaginary data filtering matrix may be masking matrices having the same size as that of the real data matrix and the imaginary data matrix of the first processed audio signal, respectively, or may be convolution matrices capable of being convolved with the real data matrix and the imaginary data matrix of the first processed audio signal, respectively. It follows that the second neural network functions: based on the obtained time-domain characteristics of both the reference audio signal and the first processed audio signal (i.e. the real data matrix and the imaginary data matrix of both), respective filter matrices are adaptively generated, which can then be used to filter the time-domain characteristics of the first processed audio signal (i.e. the real data matrix and the imaginary data matrix of the first processed audio signal). Thus, based on the second neural network, echo cancellation processing of the first processed audio signal in the time domain can be achieved. The function of the second neural network will be described in more detail below. Further, it should be understood that the second neural network may likewise be any suitable neural network model, such as, but not limited to, a fully-connected neural network, a convolutional neural network, a recurrent neural network, and an LSTM neural network, among others. The present disclosure does not impose any limitation on the type of neural network that may be used as the second neural network.
Referring to fig. 5, details of step 350 of the audio signal processing method 300 shown in fig. 3 are schematically illustrated, according to some exemplary embodiments of the present disclosure. As shown in fig. 5, in some exemplary embodiments of the present disclosure, step 350 of audio signal processing method 300 includes steps 350-1, 350-2, 350-3, and 350-4.
In step 350-1, the real data matrix of the reference audio signal is spliced with the real data matrix of the first processed audio signal to generate a spliced real data matrix, and the imaginary data matrix of the reference audio signal is spliced with the imaginary data matrix of the first processed audio signal to generate a spliced imaginary data matrix. In step 350-2, the stitched real data matrix and the stitched imaginary data matrix are input to the second neural network, generating a real data filtering matrix and an imaginary data filtering matrix. In step 350-3, a filtered real part data matrix is generated based on the real part data matrix and the real part data filtering matrix of the first processed audio signal, and a filtered imaginary part data matrix is generated based on the imaginary part data matrix and the imaginary part data filtering matrix of the first processed audio signal. In step 350-4, taking the filtered real part data matrix and the filtered imaginary part data matrix together as input parameters, performing an inverse fast fourier transform to generate the second processed audio signal. In some exemplary embodiments of the present disclosure, the size of the real data filtering matrix generated in the above-described step 350-2 is the same as the size of the real data matrix of the first processed audio signal, and the size of the imaginary data filtering matrix generated is the same as the size of the imaginary data matrix of the first processed audio signal, whereby, in step 350-3, each element of the real data matrix of the first processed audio signal may be multiplied by one element of a corresponding position in the real data filtering matrix to generate a filtered real data matrix, and each element of the imaginary data matrix of the first processed audio signal may be multiplied by one element of a corresponding position in the imaginary data filtering matrix to generate a filtered imaginary data matrix. As already explained in detail above, the term "corresponding position" here means that the position number of an element in the real part data matrix of the first processed audio signal in the matrix is the same as the position number of the corresponding element in the real part data filter matrix, and the position number of an element in the imaginary part data matrix of the first processed audio signal in the matrix is the same as the position number of the corresponding element in the imaginary part data filter matrix. Further, according to other exemplary embodiments of the present disclosure, in step 350-3, the real data matrix of the first processed audio signal may be convolved with the real data filtering matrix to generate a filtered real data matrix, and the imaginary data matrix of the first processed audio signal may be convolved with the imaginary data filtering matrix to generate a filtered imaginary data matrix. In this case, the size of the real part data filtering matrix generated in the above step 350-2 is not necessarily the same as the size of the real part data matrix of the first processed audio signal, and the size of the imaginary part data filtering matrix generated is not necessarily the same as the sum of the sizes of the imaginary part data matrices of the first processed audio signal.
The elements in the corresponding positions are directly multiplied, the operation is simple, the calculated amount is relatively small, and the complexity of the model is low. The convolution operation is adopted, the calculation amount is relatively large, but the signal filtering effect is good, and therefore the convolution operation is very beneficial to improving the echo cancellation quality. It should be appreciated that the filtered real data matrix may be generated based on the real data filtering matrix and the real data matrix of the first processed audio signal and the filtered imaginary data matrix may be generated based on the imaginary data filtering matrix and the imaginary data matrix of the first processed audio signal in any suitable manner, as the disclosure is not limited in this respect.
Referring to fig. 6, there is schematically shown an audio signal processing model provided according to some exemplary embodiments of the present disclosure, which corresponds to the audio signal processing method shown in fig. 3 to 5. As shown in fig. 6, the audio signal processing model 400 first obtains a reference audio signal and a to-be-processed audio signal, wherein the reference audio signal is an audio signal played through a local speaker, and the to-be-processed audio signal is an audio signal collected through a local microphone. The reference audio signal is input to a first fast fourier transform module 410a for fast fourier transform to obtain a magnitude spectrum matrix 420a of the reference audio signal, and the audio signal to be processed is input to a second fast fourier transform module 410b for fast fourier transform to obtain a magnitude spectrum matrix 430a and a phase spectrum matrix 440 of the audio signal to be processed. The magnitude spectrum matrix 420a of the reference audio signal and the magnitude spectrum matrix 430a of the audio signal to be processed are input to the first stitching processing module 450a for stitching processing to generate a stitched magnitude spectrum matrix. The stitched magnitude spectrum matrix is then input into the first LSTM neural network 460a to generate a magnitude spectrum filter matrix 470 a. The magnitude spectrum filter matrix 470a and the magnitude spectrum matrix 430a of the audio signal to be processed are input to the first filter processing module 480 a. The first filtering processing module 480a may multiply each element in the magnitude spectrum matrix 430a of the audio signal to be processed by one element in a corresponding position in the magnitude spectrum filtering matrix 470a to generate a filtered magnitude spectrum matrix, or the first filtering processing module 480a may convolve the magnitude spectrum matrix 430a of the audio signal to be processed with the magnitude spectrum filtering matrix 470a to generate a filtered magnitude spectrum matrix. The filtered magnitude spectrum matrix and the phase spectrum matrix 440 of the audio signal to be processed are input to the first inverse fourier transform module 490a for inverse fourier transform to generate the first processed audio signal. The reference audio signal is also input to a third fast fourier transform module 410c for fast fourier transformation to obtain a real data matrix 420b and an imaginary data matrix 420c of the reference audio signal. The first processed audio signal is input to the fourth fast fourier transform module 410d for fast fourier transform to obtain a real data matrix 430b and an imaginary data matrix 430c of the first processed audio signal. The real and imaginary data matrices 420b, 420c of the reference audio signal, and the real and imaginary data matrices 430b, 430c of the first processed audio signal are input to the second stitching processing module 450b for respective stitching processing, so as to generate a stitched real and imaginary data matrix accordingly. The stitched real data matrix and the stitched imaginary data matrix are then input into a second LSTM neural network 460b to generate a real data filter matrix 470b and an imaginary data filter matrix 470c, respectively. The real data filter matrix 470b and the real data matrix 430b of the first processed audio signal are input to the second filter processing block 480 b. The second filter processing module 480b may multiply each element of the real data matrix 430b of the first processed audio signal by one element of a corresponding position of the real data filtering matrix 470b to generate a filtered real data matrix, or the second filter processing module 480b may convolve the real data matrix 430b of the first processed audio signal with the real data filtering matrix 470b to generate a filtered real data matrix. The imaginary data filter matrix 470c and the imaginary data matrix 430c of the first processed audio signal are input to a third filter processing module 480 c. The third filter processing module 480c may multiply each element of the imaginary data matrix 430c of the first processed audio signal by an element of a corresponding position of the imaginary data filter matrix 470c to generate a filtered imaginary data matrix, or the third filter processing module 480c may convolve the imaginary data matrix 430c of the first processed audio signal with the imaginary data filter matrix 470c to generate a filtered imaginary data matrix. The filtered real data matrix and the filtered imaginary data matrix are input to the second inverse fourier transform module 490b for inverse fourier transform to generate a second processed audio signal.
From the analysis, the granularity of frequency domain modeling is larger, only a magnitude spectrum can be learned, the modeling capability is poorer, but the consumed computational power is less, and the time domain modeling has finer granularity, so that the modeling capability is better, but a larger model is required at the same time, and the larger computational power is required. Therefore, the audio signal processing method and the audio signal processing model provided according to the exemplary embodiment of the present disclosure adopt a time-frequency domain mixing manner, so that the advantages of the two can be obtained, and the deficiencies of the two can be compensated, so that the audio signal processing method and the audio signal processing model have a finer granularity and a stronger modeling capability, and consume less computational effort.
Therefore, according to the audio signal processing method and the audio signal processing model provided by the exemplary embodiment of the present disclosure, the echo cancellation effect of the audio signal is significantly improved by performing the frequency domain echo cancellation process based on the magnitude spectrum of the audio signal in the frequency domain first, and then performing the time domain echo cancellation process based on the real part data and the imaginary part data of the audio signal in the time domain, so as to take the effect of the phase information into consideration. For example, after echo cancellation processing, speech recognition may be performed on the processed speech signal and compared to the content of the real speech signal to obtain a corresponding character error rate (i.e., CER) for verifying the effectiveness of the echo cancellation processing. Compared with the conventional echo cancellation processing, when the echo cancellation processing is performed on a voice signal by using the audio signal processing method provided by the exemplary embodiment of the present disclosure, the CER is relatively reduced by about 16% in the case of performing voice recognition, as shown in the following table:
echo cancellation processing method Test set CER
Conventional AEC 9.42
Audio signal processing method according to the present disclosure 7.88
Therefore, the audio signal processing method and the audio signal processing scheme provided according to the exemplary embodiments of the present disclosure improve the echo cancellation effect and improve the quality of voice transmission compared to the conventional echo cancellation method, thereby also improving the user experience. In addition, both the first neural network 460a and the second neural network 460b may be trained with a particular audio signal to enable echo cancellation processing in both the frequency domain and the time domain of the corresponding audio signal, as will be described in detail below.
Referring to fig. 7a, there is schematically shown in flowchart form a neural network training method provided according to some exemplary embodiments of the present disclosure, which may be used for training the first and second neural networks used in the audio signal processing method and audio signal processing model described above. As shown in fig. 7a, the neural network training method 500 includes steps 510, 520, 530, 540, 550, 560, and 570.
In step 510, a reference audio signal and a to-be-processed audio signal are obtained, where the reference audio signal is an audio signal played through a local speaker, and the to-be-processed audio signal is a mixed audio signal obtained by mixing an audio signal collected by a local microphone with a real audio signal. Thus, in the neural network training method 500, the audio signal collected by the local microphone is substantially completely an echo signal caused by the reference audio signal played by the local speaker, and thus, the echo signal, together with the mixed audio signal obtained by mixing the real audio signal and the reference audio signal, constitutes an audio signal that can be used for training the first and second neural networks. Steps 520 to 550 are the same as steps 320 to 350 of the audio signal processing method 300 shown in fig. 3, and therefore are not described herein again. It should be understood that in some exemplary embodiments, step 530 may also include those steps shown in fig. 4 and step 550 may also include those steps shown in fig. 5. In step 560, a first snr loss value is obtained based on the first processed audio signal and the real audio signal, and a second snr loss value is obtained based on the second processed audio signal and the real audio signal. In some exemplary embodiments of the present disclosure, both the first snr Loss value and the second snr Loss value may be a scale-invariant snr Loss value (i.e., sisr Loss). However, any other suitable indicator for calculating the signal quality loss is possible, and the disclosure is not limited thereto. In step 570, at least one of the first neural network and the second neural network is adjusted based on at least the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value.
Referring to fig. 7b, step 570 is further defined, in accordance with some exemplary embodiments of the present disclosure. As shown in fig. 7b, step 570 further comprises: step 570-1, performing weighted summation on the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value to obtain a comprehensive signal-to-noise ratio loss value; and, step 570-2, adjusting at least one of the first neural network and the second neural network based on the integrated signal-to-noise ratio loss value. In some exemplary embodiments of the present disclosure, the first signal-to-noise ratio loss value may be weighted differently than the second signal-to-noise ratio loss value depending on how different the first neural network and the second neural network depend on in the application, e.g., the first signal-to-noise ratio loss value may be weighted more heavily than the second signal-to-noise ratio loss value. In some exemplary embodiments of the present disclosure, based on this situation, the first neural network and the second neural network may be jointly trained by using the neural network training method according to the exemplary embodiment of the present disclosure, but in practical applications, only the first neural network may be used for echo cancellation processing, and the second neural network is not used. Thus, in practical applications, the complexity of the echo cancellation process, and thus the amount of computation, may be reduced, while still ensuring the quality of the echo processing. In other exemplary embodiments of the present disclosure, the weight of the first snr loss value may be less than the weight of the second snr loss value, or the weight of the first snr loss value may be equal to the weight of the second snr loss value. It should be understood that the snr loss values or the scale-invariant snr loss values recited in the present disclosure are exemplary, and indeed, any suitable other metric may be used to train the neural network, and the present disclosure is not limited thereto.
Referring to fig. 8, a neural network training model provided according to some exemplary embodiments of the present disclosure is schematically illustrated, which corresponds to the neural network training method illustrated in fig. 7. It should be understood that the neural network training model 600 shown in fig. 8 is substantially the same as the audio Signal processing model 400 shown in fig. 6, except that the neural network training model 600 shown in fig. 8 further includes a first Signal to Noise Ratio (SNR) loss calculation module 610a, a second SNR loss calculation module 610b, and an integrated SNR loss calculation module 620. Only these differences will be explained below, and the same parts as those shown in fig. 6 will not be described again.
The first snr loss module 610a receives the first processed audio signal from the first inverse fourier transform module 490a and computes a first snr loss value. The second snr loss calculation module 610b receives the second processed audio signal from the second inverse fourier transform module 490b and calculates a second snr loss value. The integrated snr loss calculation module 620 receives the first snr loss value and the second snr loss value and performs a weighted summation of the first snr loss value and the second snr loss value to obtain an integrated snr loss value. As described above, the first signal-to-noise ratio loss value may be weighted differently than the second signal-to-noise ratio loss value depending on the degree of dependency on the first and second neural networks in the application. For example, in other exemplary embodiments of the present disclosure, the weight of the first signal-to-noise ratio loss value may be greater than the weight of the second signal-to-noise ratio loss value. In other exemplary embodiments of the present disclosure, the weight of the first signal-to-noise ratio loss value may be less than or equal to the weight of the second signal-to-noise ratio loss value. At least one of the first neural network 460a and the second neural network 460b may then be adjusted based on the integrated signal-to-noise ratio loss value.
Thus, the neural network training method 500 and the neural network training model 600 are capable of jointly training the first neural network and/or the second neural network based on both the results of the frequency domain echo cancellation and the results of the time domain echo cancellation. In addition, after the joint training, only the frequency domain echo cancellation (i.e., the first neural network 460a and its related links) may be used in practical applications without the time domain echo cancellation (i.e., the second neural network 460b and its related links), so that the complexity of the neural network model used in the application may not increase, and at the same time, the high quality of the echo cancellation is ensured.
Referring to fig. 9, another neural network training method provided in accordance with some exemplary embodiments of the present disclosure, which may be used to train the first and second neural networks used in the audio signal processing method and audio signal processing model described above, is schematically illustrated in the form of a flowchart. As shown in fig. 9, the neural network training method 700 includes steps 710, 720, 730, 740, 750, 760, 770, 780, and 790.
Steps 710 to 760 in the neural network training method 700 are the same as steps 510 to 560 in the neural network training method 500 shown in fig. 7, and thus are not described again here. Further, it should be understood that in some exemplary embodiments, step 730 may also include those steps shown in fig. 4 and step 750 may also include those steps shown in fig. 5. In step 770, the first processed audio signal is input to a CTC-based acoustic model to obtain a CTC loss value. In step 780, the first snr loss value, the second snr loss value, and the CTC loss value are weighted and summed to obtain a composite loss value. In step 790, at least one of the first neural network and the second neural network is adjusted based on the composite loss value. In some exemplary embodiments of the present disclosure, both the first snr Loss value and the second snr Loss value may be a scale-invariant snr Loss value (i.e., sisr Loss). However, any other suitable indicator for calculating the signal quality loss is possible, and the disclosure is not limited thereto. Further, in some exemplary embodiments of the present disclosure, the first signal-to-noise ratio loss value, the second signal-to-noise ratio loss value, and the CTC loss value may have different weights, respectively. For example, the first snr loss value can be weighted more heavily than both the second snr loss value and the CTC loss value. However, it should be understood that any other weighting with respect to the first signal-to-noise ratio loss value, the second signal-to-noise ratio loss value, and the CTC loss value is possible depending on the actual situation.
Referring to fig. 10, another neural network training model provided according to some exemplary embodiments of the present disclosure is schematically illustrated, which corresponds to the neural network training method illustrated in fig. 9. It should be understood that the neural network training model 800 shown in fig. 10 is substantially the same as the neural network training model 600 shown in fig. 8, except that the neural network training model 800 shown in fig. 10 further includes a composite loss value calculation module 810 and a CTC-based acoustic model. Only these differences will be described below, and the same parts as those shown in fig. 8 will not be described again.
As shown in fig. 10, the first processed audio from the first inverse fourier transform module 490a is fed to a Filter bank (FBank) feature extraction module 820 to obtain band features, then sequentially passed through a difference processing module 830 and a mean square difference processing module 840 to obtain acoustic features, which are then passed through a convolution module 850 and then fed to a third neural network module 860 for processing, and after being linearized by a linearization module 870, fed to a CTC loss value calculation module 880 for calculating a CTC loss value. As shown in fig. 10, in this exemplary embodiment, the third neural network module 860 may include an LSTM network and a Batch Normalization layer (BN layer). However, the third neural network module 860 may include other types of neural network models, such as fully-connected neural networks, convolutional neural networks, recurrent neural networks, and so on, as the present disclosure does not limit thereto. In addition, the FBank feature extraction module 820, the difference processing module 830, the mean square value difference processing module 840, the convolution module 850, the third neural network module 860, the linearization module 870, and the CTC loss value calculation module 880, which are described above, together form an exemplary CTC model-based acoustic model, which is well known to those skilled in the art and thus will not be described in detail herein. It should be understood that other types of CTC-based acoustic models are possible, and the present disclosure is not limited thereto. The integrated loss value calculation module 810 receives the first snr loss value, the second snr loss value, and the CTC loss value and performs a weighted summation of the three to obtain an integrated loss value. Then, based on the integrated snr loss value, at least one of the first neural network 460a and the second neural network 460b may be adjusted, or at least one of the first neural network 460a, the second neural network 460b, and the third neural network 860 may be adjusted.
Thus, the neural network training method 700 and the neural network training model 800 can jointly train the first neural network and/or the second neural network based on the results of the frequency domain echo cancellation, the results of the time domain echo cancellation, and the evaluation results of the acoustic model. In the joint training, the gradient of the acoustic model can be transferred to the first neural network and the second neural network so as to train the first neural network and the second neural network further, so that the performance of the trained neural network can be further improved. In addition, because the acoustic model is not updated during training, other acoustic models which are not subjected to joint training can be directly accessed when the trained audio signal processing model is on line, so that the trained audio signal processing model has strong universality.
Referring to fig. 11, a structure of an audio signal processing apparatus according to some exemplary embodiments of the present disclosure is schematically illustrated in a block diagram. As shown in fig. 11, the audio signal processing apparatus 900 includes: an audio signal obtaining module 910, an audio signal frequency domain information obtaining module 920, a frequency domain echo cancellation processing module 930, an audio signal time domain information obtaining module 940, and a time domain echo cancellation processing module 950.
The audio signal acquisition module 910 is configured to: a reference audio signal and an audio signal to be processed are obtained. The audio signal frequency domain information acquisition module 920 is configured to: and obtaining a magnitude spectrum matrix of the reference audio signal, and obtaining a magnitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed. The frequency domain echo cancellation processing module 930 is configured to: and performing frequency domain echo cancellation processing on the audio signal to be processed by using a first neural network based on the amplitude spectrum matrix of the reference audio signal and the amplitude spectrum matrix of the audio signal to be processed to generate a first processed audio signal. The audio signal time domain information acquisition module 940 is configured to: obtaining a real part data matrix and an imaginary part data matrix of the reference audio signal, and obtaining a real part data matrix and an imaginary part data matrix of the first processed audio signal. The time domain echo cancellation processing module 950 is configured to: and performing time domain echo cancellation processing on the first processed audio signal by using a second neural network based on the real part data matrix and the imaginary part data matrix of the reference audio signal and the real part data matrix and the imaginary part data matrix of the first processed audio signal to generate a second processed audio signal. The above modules relate to the operations of steps 310 to 350 described above with respect to fig. 3, and thus are not described in detail herein.
Furthermore, the various modules described above with respect to fig. 11 may each be implemented in hardware or in hardware in combination with software and/or firmware. For example, the modules may be implemented as computer-executable code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, the modules may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of these modules may be implemented together in a system on a chip (SoC). The SoC may include an integrated circuit chip including one or more components of a processor (e.g., a Central Processing Unit (CPU), microcontroller, microprocessor, Digital Signal Processor (DSP), etc.), memory, one or more communication interfaces, and/or other circuitry, and may optionally execute received program code and/or include embedded firmware to perform functions.
Referring to fig. 12, the structure of a computing device 1000 in accordance with some embodiments of the present disclosure is schematically illustrated in block diagram form. The computing device 1000 may be used for various application scenarios described in the present disclosure, and it may implement the audio signal processing methods and/or neural network training methods described in the present disclosure.
Computing device 1000 may include at least one processor 1002, memory 1004, communication interface(s) 1006, display device 1008, other input/output (I/O) devices 1010, and one or more mass storage devices 1012, capable of communicating with each other, such as by system bus 1014 or other appropriate connection.
The processor 1002 may be a single processing unit or multiple processing units, all of which may include single or multiple computing units or multiple cores. The processor 1002 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitry, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor 1002 may be configured to retrieve and execute computer readable instructions stored in the memory 1004, mass storage device 1012, or other computer readable medium, such as program code for an operating system 1016, program code for an application program 1018, program code for other programs 1020, and so forth.
Memory 1004 and mass storage devices 1012 are examples of computer readable storage media for storing instructions that can be executed by processor 1002 to implement the various functions described above. By way of example, the memory 1004 may generally include both volatile and nonvolatile memory (e.g., RAM, ROM, and the like). In addition, mass storage devices 1012 may generally include hard disk drives, solid state drives, removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CDs, DVDs), storage arrays, network attached storage, storage area networks, and the like. The memory 1004 and the mass storage device 1012 may both be collectively referred to herein as a computer-readable memory or computer-readable storage medium, and may be non-transitory media capable of storing computer-readable, processor-executable program instructions as computer-executable code that may be executed by the processor 1002 as a particular machine configured to implement the operations and functions described in various exemplary embodiments of the present disclosure (e.g., the above-described audio signal processing method models, neural network training methods, and models corresponding thereto).
A number of program modules may be stored on the mass storage device 1012. These program modules include an operating system 1016, one or more application programs 1018, other programs 1020, and program data 1022, and can be executed by processor 1002. Examples of such applications or program modules may include, for instance, computer program logic (e.g., computer-executable code or instructions) for implementing the following components/functions: an audio signal obtaining module 910, an audio signal frequency domain information obtaining module 920, a frequency domain echo cancellation processing module 930, an audio signal time domain information obtaining module 940, and a time domain echo cancellation processing module 950.
Although illustrated in fig. 12 as being stored in memory 1004 of computing device 1000, the audio signal processing methods and models, the neural network training methods and models, and the audio signal acquisition module 910, the audio signal frequency domain information acquisition module 920, the frequency domain echo cancellation processing module 930, the audio signal time domain information acquisition module 940, and the time domain echo cancellation processing module 950, or portions thereof, may be implemented using any form of computer-readable media that is accessible by computing device 1000. As used herein, "computer-readable media" includes at least two types of computer-readable media, namely computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computing device. In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism. Computer storage media as defined by the present disclosure does not include communication media.
Computing device 1000 may also include one or more communication interfaces 1006 for exchanging data with other devices, such as over a network, direct connection, and so forth. Communication interface 1006 may facilitate communications within a variety of networks and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet, and so forth. Communication interface 1006 may also provide for communication with external storage devices (not shown), such as in a storage array, network attached storage, storage area network, or the like.
In some examples, computing device 1000 may also include a display device 1008, such as a display, for displaying information and images. Other I/O devices 1010 may be devices that receive various inputs from a target object and provide various outputs to the target object, including but not limited to touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input/output devices, and so forth.
The present disclosure also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computing device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computing device to perform the audio signal processing method and/or the neural network training method provided in the various alternative embodiments described above.
The terminology used herein is for the purpose of describing embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and "comprising," when used in this disclosure, specify the presence of stated features but do not preclude the presence or addition of one or more other features. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items. It will be understood that, although the terms "first," "second," "third," etc. may be used herein to describe various features, these features should not be limited by these terms. These terms are only used to distinguish one feature from another.
Unless otherwise defined, all terms (including technical and scientific terms) used in this disclosure have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and/or the present specification and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
In the description of the present specification, the description of the terms "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
The present disclosure describes various techniques in the general context of software, hardware elements, or program modules. Generally, these modules include routines, programs, objects, elements, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and "component" as used in this disclosure generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described in this disclosure are platform-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., as a list of executable instructions that can be thought of as being useful for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. Moreover, it should be further appreciated that the various steps of the methods illustrated in the flowcharts or otherwise described herein are merely exemplary and are not meant to imply that the steps of the methods illustrated or described must be performed in accordance with the steps illustrated or described. Rather, various steps of the methods shown in the flowcharts or otherwise described herein may be performed in a different order than presented in the present disclosure or may be performed concurrently. Further, the methods shown in the flowcharts or otherwise described herein may include other additional steps as desired.
It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, any one or a combination of the following techniques, which are well known in the art, may be used: discrete logic circuits having logic gate circuits for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gate circuits, programmable gate arrays, field programmable gate arrays, and the like.
It will be understood by those skilled in the art that all or part of the steps of the method of the above embodiments may be performed by hardware associated with program instructions, and that the program may be stored in a computer readable storage medium, and that the program, when executed, includes one or a combination of the steps of performing the method embodiments.
Although the present disclosure has been described in detail in connection with some exemplary embodiments, it is not intended to be limited to the specific form set forth in the disclosure. Rather, the scope of the present disclosure is limited only by the accompanying claims.

Claims (19)

1. An audio signal processing method comprising:
acquiring a reference audio signal and an audio signal to be processed, wherein the reference audio signal is an audio signal played through a local loudspeaker, and the audio signal to be processed is an audio signal collected through a local microphone;
obtaining a magnitude spectrum matrix of the reference audio signal, and obtaining a magnitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed;
based on the amplitude spectrum matrix of the reference audio signal, the amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed, performing frequency domain echo cancellation processing on the audio signal to be processed by using a first neural network, and generating a first processed audio signal;
obtaining a real part data matrix and an imaginary part data matrix of the reference audio signal, and obtaining a real part data matrix and an imaginary part data matrix of the first processed audio signal;
performing time domain echo cancellation processing on the first processed audio signal with a second neural network based on real and imaginary data matrices of the reference audio signal and real and imaginary data matrices of the first processed audio signal, and generating a second processed audio signal.
2. The audio signal processing method of claim 1, wherein the obtaining of the magnitude spectrum matrix of the reference audio signal comprises: performing a fast Fourier transform on the reference audio signal to obtain a magnitude spectrum matrix of the reference audio signal, an
The obtaining of the amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed includes: and carrying out fast Fourier transform on the audio signal to be processed to obtain a magnitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed.
3. The audio signal processing method of claim 1, wherein the performing a frequency-domain echo cancellation process on the audio signal to be processed using a first neural network based on the amplitude spectrum matrix of the reference audio signal, the amplitude spectrum matrix of the audio signal to be processed, and the phase spectrum matrix of the audio signal to be processed, and generating a first processed audio signal comprises:
splicing the amplitude spectrum matrix of the reference audio signal with the amplitude spectrum matrix of the audio signal to be processed to generate a spliced amplitude spectrum matrix;
inputting the spliced amplitude spectrum matrix into the first neural network to generate an amplitude spectrum filtering matrix;
generating a filtered magnitude spectrum matrix based on the magnitude spectrum filtering matrix and the magnitude spectrum matrix of the audio signal to be processed;
and taking the filtered amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed together as input parameters, and performing inverse fast Fourier transform to generate the first processed audio signal.
4. The audio signal processing method of claim 3, wherein a size of the magnitude spectrum filtering matrix is the same as a size of a magnitude spectrum matrix of the audio signal to be processed, and wherein the generating a filtered magnitude spectrum matrix based on the magnitude spectrum filtering matrix and the magnitude spectrum matrix of the audio signal to be processed comprises:
and multiplying each element in the amplitude spectrum matrix of the audio signal to be processed with one element at a corresponding position in the amplitude spectrum filtering matrix to generate the filtered amplitude spectrum matrix.
5. The audio signal processing method of claim 3, wherein the generating a filtered magnitude spectrum matrix based on the magnitude spectrum filtering matrix and a magnitude spectrum matrix of the audio signal to be processed comprises:
and convolving the amplitude spectrum matrix of the audio signal to be processed with the amplitude spectrum filtering matrix to generate the filtered amplitude spectrum matrix.
6. The audio signal processing method of claim 1, wherein the obtaining a real data matrix and an imaginary data matrix of the reference audio signal and obtaining a real data matrix and an imaginary data matrix of the first processed audio signal comprises:
performing a fast Fourier transform on the reference audio signal to obtain a real part data matrix and an imaginary part data matrix of the reference audio signal;
performing a fast Fourier transform on the first processed audio signal to obtain a real part data matrix and an imaginary part data matrix of the first processed audio signal.
7. The audio signal processing method of claim 1, wherein the performing time-domain echo cancellation processing on the first processed audio signal using a second neural network based on the real and imaginary data matrices of the reference audio signal and the real and imaginary data matrices of the first processed audio signal and generating a second processed audio signal comprises:
concatenating the real data matrix of the reference audio signal with the real data matrix of the first processed audio signal to generate a concatenated real data matrix, and concatenating the imaginary data matrix of the reference audio signal with the imaginary data matrix of the first processed audio signal to generate a concatenated imaginary data matrix;
inputting the stitched real part data matrix and the stitched imaginary part data matrix into the second neural network to generate a real part data filtering matrix and an imaginary part data filtering matrix;
generating a filtered real part data matrix based on the real part data matrix and the real part data filtering matrix of the first processed audio signal, and generating a filtered imaginary part data matrix based on the imaginary part data matrix and the imaginary part data filtering matrix of the first processed audio signal;
and taking the filtered real part data matrix and the filtered imaginary part data matrix as input parameters together, and performing inverse fast Fourier transform to generate the second processed audio signal.
8. The audio signal processing method of claim 7, wherein a size of the real data filtering matrix is the same as a size of a real data matrix of the first processed audio signal and a size of the imaginary data filtering matrix is the same as a size of an imaginary data matrix of the first processed audio signal, and wherein the generating a filtered real data matrix based on the real data matrix and the real data filtering matrix of the first processed audio signal and generating a filtered imaginary data matrix based on the imaginary data matrix and the imaginary data filtering matrix of the first processed audio signal comprises:
multiplying each element of a real data matrix of the first processed audio signal with an element of a corresponding position in the real data filter matrix to generate the filtered real data matrix;
multiplying each element of the imaginary data matrix of the first processed audio signal with an element of a corresponding position in the imaginary data filter matrix to generate the filtered imaginary data matrix.
9. The audio signal processing method of claim 7, wherein the generating a filtered real data matrix based on the real data matrix of the first processed audio signal and the real data filtering matrix comprises: convolving the real data matrix of the first processed audio signal with the real data filter matrix to generate the filtered real data matrix, an
Said generating a filtered imaginary data matrix based on the imaginary data matrix and the imaginary data filtering matrix of the first processed audio signal comprises: convolving the imaginary data matrix of the first processed audio signal with the imaginary data filtering matrix to generate the filtered imaginary data matrix.
10. The audio signal processing method of claim 1, wherein the first neural network and the second neural network are both long-short term memory neural networks.
11. A neural network training method, comprising:
acquiring a reference audio signal and an audio signal to be processed, wherein the reference audio signal is an audio signal played through a local loudspeaker, and the audio signal to be processed is a mixed audio signal obtained by mixing an audio signal acquired by a local microphone with a real audio signal;
obtaining a magnitude spectrum matrix of the reference audio signal, and obtaining a magnitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed;
based on the amplitude spectrum matrix of the reference audio signal, the amplitude spectrum matrix and the phase spectrum matrix of the audio signal to be processed, performing frequency domain echo cancellation processing on the audio signal to be processed by using a first neural network, and generating a first processed audio signal;
obtaining a real part data matrix and an imaginary part data matrix of the reference audio signal, and obtaining a real part data matrix and an imaginary part data matrix of the first processed audio signal;
performing a time domain echo cancellation process on the first processed audio signal with a second neural network based on real and imaginary data matrices of the reference audio signal and real and imaginary data matrices of the first processed audio signal and generating a second processed audio signal;
obtaining a first signal-to-noise ratio loss value based on the first processed audio signal and the real audio signal, and obtaining a second signal-to-noise ratio loss value based on the second processed audio signal and the real audio signal;
adjusting at least one of the first neural network and the second neural network based at least on the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value.
12. The neural network training method of claim 11, wherein adjusting at least one of the first neural network and the second neural network based at least on the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value comprises:
carrying out weighted summation on the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value to obtain a comprehensive signal-to-noise ratio loss value;
adjusting at least one of the first neural network and the second neural network based on the integrated signal-to-noise ratio loss value.
13. The neural network training method of claim 12, wherein the first signal-to-noise ratio penalty is weighted more heavily than the second signal-to-noise ratio penalty.
14. The neural network training method of claim 11, wherein adjusting at least one of the first neural network and the second neural network based at least on the first signal-to-noise ratio loss value and the second signal-to-noise ratio loss value comprises:
inputting the first processed audio signal into a CTC-based acoustic model to obtain a CTC loss value;
carrying out weighted summation on the first signal-to-noise ratio loss value, the second signal-to-noise ratio loss value and the CTC loss value to obtain a comprehensive loss value; and
adjusting at least one of the first neural network and the second neural network based on the composite loss value.
15. The neural network training method of any one of claims 11-14, wherein the first and second signal-to-noise ratio loss values are both scale-invariant signal-to-noise ratio loss values.
16. An audio signal processing apparatus comprising:
an audio signal acquisition module configured to: acquiring a reference audio signal and an audio signal to be processed, wherein the reference audio signal is an audio signal played through a local loudspeaker, and the audio signal to be processed is an audio signal collected through a local microphone;
an audio signal frequency domain information acquisition module configured to: obtaining a magnitude spectrum matrix of the reference audio signal, and obtaining a magnitude spectrum matrix and a phase spectrum matrix of the audio signal to be processed;
a frequency domain echo cancellation processing module configured to: performing frequency domain echo cancellation processing on the audio signal to be processed by using a first neural network based on the amplitude spectrum matrix of the reference audio signal and the amplitude spectrum matrix of the audio signal to be processed, and generating a first processed audio signal;
an audio signal time domain information acquisition module configured to: obtaining a real part data matrix and an imaginary part data matrix of the reference audio signal, and obtaining a real part data matrix and an imaginary part data matrix of the first processed audio signal;
a time domain echo cancellation processing module configured to: performing time domain echo cancellation processing on the first processed audio signal with a second neural network based on real and imaginary data matrices of the reference audio signal and real and imaginary data matrices of the first processed audio signal, and generating a second processed audio signal.
17. A computing device comprising a processor and a memory, the memory configured to store computer-executable instructions configured to, when executed on the processor, cause the processor to perform the audio signal processing method of any of claims 1 to 10, or cause the processor to perform the neural network training method of any of claims 11 to 15.
18. A computer readable storage medium configured to store computer executable instructions configured to, when executed on a processor, cause the processor to perform the audio signal processing method of any one of claims 1 to 10 or cause the processor to perform the neural network training method of any one of claims 11 to 15.
19. A computer program product comprising computer-executable instructions configured to, when executed on a processor, cause the processor to perform the audio signal processing method of any one of claims 1 to 10, or to perform the neural network training method of any one of claims 11 to 15.
CN202210459690.1A 2022-04-28 2022-04-28 Audio signal processing method and device, training method, training device and medium Active CN115116471B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210459690.1A CN115116471B (en) 2022-04-28 2022-04-28 Audio signal processing method and device, training method, training device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210459690.1A CN115116471B (en) 2022-04-28 2022-04-28 Audio signal processing method and device, training method, training device and medium

Publications (2)

Publication Number Publication Date
CN115116471A true CN115116471A (en) 2022-09-27
CN115116471B CN115116471B (en) 2024-02-13

Family

ID=83326289

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210459690.1A Active CN115116471B (en) 2022-04-28 2022-04-28 Audio signal processing method and device, training method, training device and medium

Country Status (1)

Country Link
CN (1) CN115116471B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113689878A (en) * 2021-07-26 2021-11-23 浙江大华技术股份有限公司 Echo cancellation method, echo cancellation device, and computer-readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113140225A (en) * 2020-01-20 2021-07-20 腾讯科技(深圳)有限公司 Voice signal processing method and device, electronic equipment and storage medium
WO2021196905A1 (en) * 2020-04-01 2021-10-07 腾讯科技(深圳)有限公司 Voice signal dereverberation processing method and apparatus, computer device and storage medium
CN113744748A (en) * 2021-08-06 2021-12-03 浙江大华技术股份有限公司 Network model training method, echo cancellation method and device
CN113870888A (en) * 2021-09-24 2021-12-31 武汉大学 Feature extraction method and device based on time domain and frequency domain of voice signal, and echo cancellation method and device
CN114121031A (en) * 2021-12-08 2022-03-01 思必驰科技股份有限公司 Device voice noise reduction, electronic device, and storage medium
CN114242098A (en) * 2021-12-13 2022-03-25 北京百度网讯科技有限公司 Voice enhancement method, device, equipment and storage medium
CN114283795A (en) * 2021-12-24 2022-04-05 思必驰科技股份有限公司 Training and recognition method of voice enhancement model, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113140225A (en) * 2020-01-20 2021-07-20 腾讯科技(深圳)有限公司 Voice signal processing method and device, electronic equipment and storage medium
WO2021196905A1 (en) * 2020-04-01 2021-10-07 腾讯科技(深圳)有限公司 Voice signal dereverberation processing method and apparatus, computer device and storage medium
CN113744748A (en) * 2021-08-06 2021-12-03 浙江大华技术股份有限公司 Network model training method, echo cancellation method and device
CN113870888A (en) * 2021-09-24 2021-12-31 武汉大学 Feature extraction method and device based on time domain and frequency domain of voice signal, and echo cancellation method and device
CN114121031A (en) * 2021-12-08 2022-03-01 思必驰科技股份有限公司 Device voice noise reduction, electronic device, and storage medium
CN114242098A (en) * 2021-12-13 2022-03-25 北京百度网讯科技有限公司 Voice enhancement method, device, equipment and storage medium
CN114283795A (en) * 2021-12-24 2022-04-05 思必驰科技股份有限公司 Training and recognition method of voice enhancement model, electronic equipment and storage medium

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
HAO ZHANG,等: "Deep Learning for Joint Acoustic Echo and Noise Cancellation with Nonlinear Distortions", 《INTERSPEECH 2019》 *
NILS L. WESTHAUSEN,等: "Dual-Signal Transformation LSTM Network for Real-Time Noise Suppression", 《HTTPS://ARXIV.ORG/PDF/2005.07551.PDF》 *
SHIMIN ZHANG,等: "F-T-LSTM based Complex Network for Joint Acoustic Echo Cancellation and Speech Enhancement", 《HTTPS://ARXIV.ORG/PDF/2106.07577.PDF》 *
XIAOHUAI LE,等: "DPCRN: Dual-Path Convolution Recurrent Network for Single Channel Speech Enhancement", 《HTTPS://ARXIV.ORG/PDF/2107.05429.PDF》 *
张伟: "智能音箱中回声消除算法的研究与实现", 《中国优秀硕士学位论文全文数据库》 *
王冬霞,等: "基于BLSTM神经网络的回声和噪声抑制算法", 《信号处理》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113689878A (en) * 2021-07-26 2021-11-23 浙江大华技术股份有限公司 Echo cancellation method, echo cancellation device, and computer-readable storage medium

Also Published As

Publication number Publication date
CN115116471B (en) 2024-02-13

Similar Documents

Publication Publication Date Title
CN107452389A (en) A kind of general monophonic real-time noise-reducing method
CN107845389A (en) A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks
CN110164467A (en) The method and apparatus of voice de-noising calculate equipment and computer readable storage medium
CN113436643B (en) Training and application method, device and equipment of voice enhancement model and storage medium
WO2019227574A1 (en) Voice model training method, voice recognition method, device and equipment, and medium
CN112581973B (en) Voice enhancement method and system
CN103559888A (en) Speech enhancement method based on non-negative low-rank and sparse matrix decomposition principle
EP3899936B1 (en) Source separation using an estimation and control of sound quality
CN107408394A (en) It is determined that the noise power between main channel and reference channel is differential and sound power stage is poor
WO2022141868A1 (en) Method and apparatus for extracting speech features, terminal, and storage medium
CN112735460B (en) Beam forming method and system based on time-frequency masking value estimation
CN111785288A (en) Voice enhancement method, device, equipment and storage medium
Ram et al. Performance analysis of adaptive variational mode decomposition approach for speech enhancement
CN115116471B (en) Audio signal processing method and device, training method, training device and medium
CN116013344A (en) Speech enhancement method under multiple noise environments
KR102026226B1 (en) Method for extracting signal unit features using variational inference model based deep learning and system thereof
CN113053400A (en) Training method of audio signal noise reduction model, audio signal noise reduction method and device
WO2024041512A1 (en) Audio noise reduction method and apparatus, and electronic device and readable storage medium
WO2024027295A1 (en) Speech enhancement model training method and apparatus, enhancement method, electronic device, storage medium, and program product
CN116052706B (en) Low-complexity voice enhancement method based on neural network
Li et al. Robust frequency domain spline adaptive filtering based on the half-quadratic criterion: Performance analysis and applications
US20220130406A1 (en) Noise spatial covariance matrix estimation apparatus, noise spatial covariance matrix estimation method, and program
CN113763978A (en) Voice signal processing method, device, electronic equipment and storage medium
CN111832596A (en) Data processing method, electronic device and computer readable medium
Yechuri et al. An iterative posterior regularized nmf-based adaptive wiener filter for speech enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant