CN116013337B - Audio signal processing method, training method, device, equipment and medium for model - Google Patents

Audio signal processing method, training method, device, equipment and medium for model Download PDF

Info

Publication number
CN116013337B
CN116013337B CN202310038041.9A CN202310038041A CN116013337B CN 116013337 B CN116013337 B CN 116013337B CN 202310038041 A CN202310038041 A CN 202310038041A CN 116013337 B CN116013337 B CN 116013337B
Authority
CN
China
Prior art keywords
audio signal
audio
original
signal
original audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310038041.9A
Other languages
Chinese (zh)
Other versions
CN116013337A (en
Inventor
蒋逸恒
李峥
张策
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202310038041.9A priority Critical patent/CN116013337B/en
Publication of CN116013337A publication Critical patent/CN116013337A/en
Application granted granted Critical
Publication of CN116013337B publication Critical patent/CN116013337B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

The disclosure provides an audio signal processing method, a training method of a model, a training device of the model, equipment and a medium, and relates to the technical field of artificial intelligence, in particular to the technical field of voice. The specific implementation scheme is as follows: the audio signal processing method comprises the steps of linearly filtering an original audio signal according to a reference signal and the original audio signal to obtain a first audio signal; according to the original audio signal and the first audio signal, nonlinear filtering is carried out on the first audio signal to obtain a second audio signal; and determining an output audio signal from the second audio signal.

Description

Audio signal processing method, training method, device, equipment and medium for model
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to the field of speech technology.
Background
With the continuous development of the Internet in the world, the human-computer interaction is more and more intelligent, and the voice technology is more and more developed. Such as smart speakers, smart car systems, smart phones, smart home systems, etc., all require the transfer and processing of voice information when interacting with a user.
Disclosure of Invention
The present disclosure provides an audio signal processing method, a training method of a deep learning model, an apparatus, a device, a storage medium, and a program product.
According to an aspect of the present disclosure, there is provided an audio signal processing method including: according to the reference signal and the original audio signal, performing linear filtering on the original audio signal to obtain a first audio signal; according to the original audio signal and the first audio signal, nonlinear filtering is carried out on the first audio signal to obtain a second audio signal; and determining an output audio signal from the second audio signal.
According to another aspect of the present disclosure, there is provided a training method of a deep learning model, including: processing the sample original audio signal and the sample mixed audio signal by using a deep learning model to obtain output effective audio parameters and output echo audio parameters; and adjusting parameters of the deep learning model according to the difference value between the output effective audio parameter and the effective audio parameter label and the difference value between the output echo audio parameter and the echo audio parameter label to obtain a trained deep learning model.
According to another aspect of the present disclosure, there is provided an audio signal processing apparatus including: the first filtering module is used for carrying out linear filtering on the original audio signal according to the reference signal and the original audio signal to obtain a first audio signal; the second filtering module is used for carrying out nonlinear filtering on the first audio signal according to the original audio signal and the first audio signal to obtain a second audio signal; and a determining module for determining an output audio signal based on the second audio signal.
According to another aspect of the present disclosure, there is provided a training apparatus of a deep learning model, including: the processing module is used for processing the sample original audio signal and the sample mixed audio signal by using the deep learning model to obtain output effective audio parameters and output echo audio parameters; and the adjusting module is used for adjusting parameters of the deep learning model according to the difference value between the output effective audio parameter and the effective audio parameter label and the difference value between the output echo audio parameter and the echo audio parameter label to obtain a trained deep learning model.
Another aspect of the present disclosure provides an electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods shown in the embodiments of the present disclosure.
According to another aspect of the disclosed embodiments, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method shown in the disclosed embodiments.
According to another aspect of the disclosed embodiments, there is provided a computer program product comprising a computer program/instruction, characterized in that the computer program/instruction, when executed by a processor, implements the steps of the method shown in the disclosed embodiments.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is an application scenario schematic diagram of an audio signal processing method, apparatus, electronic device, and storage medium according to an embodiment of the present disclosure;
fig. 2 schematically illustrates a schematic diagram of an audio signal according to an embodiment of the present disclosure;
fig. 3 schematically illustrates a flowchart of an audio signal processing method according to an embodiment of the present disclosure;
fig. 4 schematically illustrates a schematic diagram of processing an audio signal according to an embodiment of the present disclosure;
fig. 5 schematically illustrates a schematic diagram of processing an audio signal according to another embodiment of the present disclosure;
fig. 6 schematically illustrates a schematic diagram of processing an audio signal according to another embodiment of the present disclosure;
FIG. 7 schematically illustrates a flow chart of a training method of a deep learning model according to an embodiment of the disclosure;
fig. 8 schematically illustrates a block diagram of an audio signal processing apparatus according to an embodiment of the present disclosure;
FIG. 9 schematically illustrates a block diagram of a training apparatus of a deep learning model according to an embodiment of the present disclosure; and
FIG. 10 schematically illustrates a block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
An application scenario of the audio signal processing method and apparatus provided in the present disclosure will be described below with reference to fig. 1.
Fig. 1 is an application scenario schematic diagram of an audio signal processing method, apparatus, electronic device, and storage medium according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.
As shown in fig. 1, the application scenario 100 includes a microphone 101 and a speaker 102.
Microphone 101 may comprise a microphone that may be used to collect audio signals, such as user speech, etc., from the periphery of the device. The speaker 102 is used to play audio data such as answer voices corresponding to the user's voices, etc.
For example, the microphone 101 and the speaker 102 may be located in the same mobile terminal device. The audio signal processing method includes that the mobile terminal device processes an audio signal collected by the microphone 101, and audio data obtained by processing is output by the speaker 102 of the mobile terminal device.
For example, the microphone 101 and the speaker 102 may be located in two mobile terminal devices, respectively, which are connected by wired or wireless means. The audio signal processing method may include that the mobile terminal device where the microphone 101 is located processes an audio signal acquired by the microphone 101, and sends audio data obtained by processing to the mobile terminal device where the speaker 102 is located, and the speaker 102 outputs the audio data. The audio signal processing method may further include that the mobile terminal device where the microphone 101 is located sends the audio signal collected by the microphone 101 to the mobile terminal device where the speaker 102 is located, where the mobile terminal device where the speaker 102 is located processes the audio signal, and the audio data obtained by processing is output by the speaker 102 of the mobile terminal device.
Mobile terminal devices may include, for example, smart speakers, smartphones, tablets, laptop and desktop computers, and the like.
According to an embodiment of the present disclosure, the audio signal received by the microphone 101 is an original audio signal. The original audio signal includes an audio signal emitted by the user and an audio signal in the surrounding environment. For example, the audio signals in the surrounding environment include audio signals output by the speaker 102. The audio signal sent by the user and the audio signal in the surrounding environment are aliased, so that the audio signal in the surrounding environment forms interference information of the audio signal sent by the user, thereby affecting the audio signal processing process of the mobile terminal device.
For example, before the speaker 102 does not start operating, the mobile terminal device processes the user audio signal from the microphone 101, generates audio data for interaction with the user, and outputs the audio data by the speaker 102. While speaker 102 is outputting audio data, the user may continue to send audio signals into microphone 101, where the original audio signals received by microphone 101 include the audio signals from the user and the audio data output by speaker 102. When the mobile terminal device processes the original audio signal, the audio data output by the speaker 102 interferes with the audio signal sent by the user, so that the voice recognition performance of the mobile terminal device is reduced.
According to the audio signal processing method of the embodiment of the disclosure, the mobile terminal device can eliminate the audio data output by the loudspeaker 102 in the original audio signal, so that the mobile terminal device can accurately identify the audio signal sent by the user. For example, the mobile terminal device may perform echo cancellation (Acoustic Echo Cancellation, AEC) on the original audio signal, ensuring the clarity and intelligibility of the audio signal emitted by the user.
In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.
In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.
The audio signal provided by the present disclosure will be described below in connection with fig. 2.
Fig. 2 schematically shows a schematic diagram of an audio signal according to an embodiment of the present disclosure.
As shown in fig. 2, the mobile terminal device 200 includes a microphone 201 and a speaker 202. For example, the microphone 201 and the speaker 202 are located in the same mobile terminal device 200. The original audio signal that may be received by microphone 201 includes echoes generated from the audio signal output by speaker 202.
According to an embodiment of the present disclosure, the echo includes sound received by the microphone 201 after the sound emitted from the speaker 202 propagates in the medium such as air. The sound can alias the voice content of the real user, interfering with the performance of the speech recognition system.
According to an embodiment of the present disclosure, the echoes include linear echoes e_line and nonlinear echoes e_non-line. With the audio signal output by the speaker 202 as a reference signal, the linear echo e_line includes the audio signal from which the reference signal reaches the microphone 201 via a plurality of reflection paths. The nonlinear echo e_non-line includes an audio signal generated due to nonlinear distortion of the reference signal caused by power amplification hardware and a spatial transmission process.
According to embodiments of the present disclosure, audio signals received by microphone 201 from speaker 202 may be canceled using the AEC algorithm.
According to an embodiment of the present disclosure, an original audio signal is subjected to a filtering process using a Linear AEC (LAEC) and a Model AEC (MAEC) based on deep learning, thereby obtaining an accurate user audio signal. LAEC is used to cancel linear echoes in the original audio signal and MAEC is used to cancel nonlinear echoes remaining after LAEC.
However, the implementation of the AEC algorithm is related to the hardware design of the mobile terminal device. For example, in order to achieve a good echo cancellation effect, the hardware design needs to satisfy the condition that nonlinear distortion is as small as possible. For example, the reference signal received by microphone 201 cannot be later than the user audio signal received by microphone 201. For example, the smaller the reverberation of the original audio signal, the greater the correlation of the reference signal and the user audio signal, and the better the cancellation effect of AEC.
For example, in some example AEC algorithms, the hardware of the mobile terminal device 200 needs to perform echo cancellation for different specific scenarios, while meeting the respective requirements. Such as a multi-speaker conference scenario, an autopilot in-car scenario, a smart speaker interaction scenario, etc. In these specific scenarios, the reference signal is acquired using hard extraction (the audio signal sent to the speaker 202 is acquired from hardware) with a specific hardware structure in the hardware bottom layer of the mobile terminal apparatus 200, so that the echo signal received by the microphone 201 can be cancelled based on the reference signal by the hardware adaptation algorithm, ensuring the echo cancellation effect.
However, since the types of mobile terminal devices are abundant and the types of hardware of various mobile terminal devices are different, the power amplification hardware of the mobile terminal devices is easy to generate serious nonlinear distortion phenomena with different forms, and the difficulty of acquiring the reference signal by setting a general hardware structure in the hardware bottom layer of the mobile terminal devices is high. In addition, the cost of the hardware modification process of the mobile terminal equipment is high, and the modification process is complex. Due to memory and computing power of the mobile terminal device, the mobile terminal device may not be able to load the AEC algorithm based on the hardware structure, and thus the hardware structure is also difficult to be compatible with multiple types of mobile terminal devices.
According to an embodiment of the present disclosure, an audio signal processing method performed by a mobile terminal device is implemented based on a software program. The AEC algorithm is realized based on an upper-layer APP of a system architecture, so that the hardware adaptation of the AEC algorithm to various types of mobile terminal equipment is improved.
According to embodiments of the present disclosure, the audio signal processing method may use soft-back (acquisition of the audio signal sent to the speaker 202 by software) to acquire the reference signal. Due to the non-fixed delay that may exist in the scheduling of the processor of the mobile terminal device 200, a non-fixed delay difference may occur between the soft-picked reference signal and the audio signal received by the microphone 201. For example, the delay difference ranges from 20ms to 600ms.
For example, in some mobile terminal devices, a large delay difference may occur between the reference signal and the audio signal received by the microphone 201 due to the weaker computing power of the processor. In the case where the delay difference is determined to be large, the cancellation performance of the AEC algorithm may be reduced. Furthermore, due to the poor uncertainty of the delay, echo in the original audio signal may not be cancelled using conventional AEC algorithms.
According to the embodiment of the disclosure, the delay difference is compensated by using a delay estimation (Time Delay Estimation, TDE) algorithm realized based on a software program, so that the mobile terminal equipment can accurately eliminate the echo in the original audio signal, and the elimination effect is improved. In addition, the audio signal processing method provided by the embodiment of the present disclosure may also control the computation overhead generated in the echo cancellation process, so that the audio signal processing method provided by the embodiment of the present disclosure may be used for a mobile terminal device with weak computation capability.
The audio signal processing method provided by the present disclosure will be described below with reference to fig. 3.
Fig. 3 schematically shows a flowchart of an audio signal processing method according to an embodiment of the present disclosure.
As shown in fig. 3, the audio signal processing method 300 includes performing linear filtering on an original audio signal according to a reference signal and the original audio signal to obtain a first audio signal in operation S310.
According to an embodiment of the present disclosure, the original audio signal is the audio signal received by the microphone according to the previous embodiment, and the original audio signal includes the audio signal from the user collected by the microphone and the audio signal from the speaker according to the previous embodiment collected by the microphone. The reference signal includes an audio signal from a speaker acquired using a soft-back approach.
According to an embodiment of the present disclosure, the audio signal from the speaker collected by the microphone includes a linear echo and a nonlinear echo. The first audio signal is an audio signal from which linear echoes in the original audio signal are removed.
Then, in operation S320, the first audio signal is non-linearly filtered according to the original audio signal and the first audio signal, to obtain a second audio signal.
According to an embodiment of the present disclosure, after the linear echo in the original audio signal is cancelled, the first audio signal further comprises a residual nonlinear echo. The second audio signal is an audio signal from which nonlinear echoes in the first audio signal are eliminated.
In operation S330, an output audio signal is determined according to the second audio signal.
According to the embodiment of the disclosure, the second audio signal is subjected to gain processing, so that an output audio signal can be obtained. For example, the output audio signal may be an audio signal from a user output by a speaker, the mobile terminal device receives the original audio signal, performs linear filtering and nonlinear filtering on the original audio signal to obtain a second audio signal for characterizing the audio signal of the user, and uses the second audio signal as the output audio signal to output the audio signal from the user through the speaker. For example, the output audio signal may also be an audio signal output by a speaker and used for interaction with a user, where the mobile terminal device receives the audio signal from the original audio signal, performs linear filtering and nonlinear filtering on the original audio signal to obtain a second audio signal used for characterizing the audio signal of the user, and generates the output audio signal based on the second audio signal. The output audio signal may be an audio signal that replies to the voice content of the second audio signal, and the audio signal that replies to the user is output through a speaker.
According to the embodiment of the disclosure, the second audio signal may be further subjected to gain processing, and an output audio signal may be obtained. For example, the second audio signal is subjected to sound expansion, so that the voice quality of the second audio signal is improved, and the experience of a user is improved.
For example, the second audio signal may be gain processed using automatic gain control (Automatic Gain Control, abbreviated AGC).
According to an embodiment of the present disclosure, the reference signal, the original audio signal, the first audio signal, the second audio signal, and the output audio signal processed by the audio signal processing method 300 may be frequency domain signals. The reference signal and the original audio signal received by the microphone are time domain signals, and filtering processing is performed based on the reference signal and the original audio signal after the reference signal and the original audio signal are converted into frequency domain signals from the time domain signals. After the output audio signal is determined, the output audio signal is converted from a frequency domain signal to a time domain signal, and the output audio signal in the form of the time domain signal is output by a speaker.
In accordance with an embodiment of the present disclosure, in linear filtering, an input signal includes an original audio signal and a reference signal. When the original signal is subjected to linear filtering based on the reference signal, the time delay difference generated between the original audio signal and the reference signal due to time delay jitter can be estimated, so that the original audio signal is subjected to linear filtering by combining the time delay difference, the accuracy of the linear filtering can be improved, and the voice distortion degree of the first audio signal obtained after the linear filtering can be reduced. In nonlinear filtering, the input signal includes an original audio signal and a first audio signal. Since the delay difference between the original audio signal and the reference signal is estimated, there may still be a small delay difference between the original audio signal and the first audio signal. In order to further suppress the influence of the delay jitter, the nonlinear filtering can be performed on the first audio signal by combining the delay difference between the original audio signal and the first audio signal, so that the accuracy of the nonlinear filtering is improved.
The audio signal processing method provided by the present disclosure will be described below with reference to fig. 4.
Fig. 4 schematically shows a schematic diagram of processing an audio signal according to an embodiment of the present disclosure.
As shown in fig. 4, the process 400 of processing an audio signal includes linear filtering 401 and nonlinear filtering 402 an original audio signal s_input to obtain an output audio signal s_output.
According to embodiments of the present disclosure, the linear filtering 401 and the nonlinear filtering 402 may comprise a two-stage AEC algorithm. In the two-stage AEC algorithm, the first stage AEC is linear filtering 401 and the second stage AEC is nonlinear filtering 402. For example, the linear filtering 401 may be a LAEC algorithm-based linear echo cancellation, and the nonlinear filtering 402 may be a gate loop unit (Gate Recurrent Unit, GRU) based model AEC algorithm.
According to embodiments of the present disclosure, the LAEC algorithm may include an adaptive filtering algorithm, such as normalized least mean square (Nomalized Least Mean Square, NLMS). The NLMS algorithm processes the reference signal based on the filtering parameters and outputs an adaptive reference signal. And then, according to the original audio signal S_input and the adaptive audio signal, performing linear filtering on the original audio signal to obtain a first audio signal. The first audio signal is associated with a filtering error, the filtering error being associated with the reference signal and the filtering parameter. For example, the optimal filter coefficients may be solved using a minimum mean square error. The LAEC algorithm can perform stable cancellation on the linear echo in the original audio signal s_input, and can improve the echo cancellation capability of the linear filter 401. In addition, the LAEC algorithm can be applied to mobile terminal equipment with weak computing capability because the LAEC algorithm needs less occupied memory.
According to embodiments of the present disclosure, the model AEC algorithm may employ a wiener filter post-processing method, so that the amount of cancellation of echo cancellation is controllable. The original audio signal S_input is introduced into the model AEC algorithm, so that the time delay difference between the original audio signal S_input and the reference signal is reduced, and the nonlinear filtering 402 process is optimized. In the model AEC algorithm, the mean square error (Mean Square Error, MSE) of the magnitude spectrum can be employed as a loss function to increase the amount of echo cancellation during the nonlinear filtering 402. The model AEC based on the deep learning has strong nonlinear learning capability, and can eliminate the residual linear echo and nonlinear echo after LAEC.
The audio signal processing method provided by the present disclosure will be described in detail below with reference to fig. 5.
Fig. 5 schematically illustrates a schematic diagram of processing an audio signal according to another embodiment of the present disclosure.
As shown in fig. 5, the process 500 of processing an audio signal includes performing delay estimation 501 on a reference signal s_ref and an original audio signal s_input to obtain an aligned reference signal s_ref-delay; performing linear filtering 502 according to the alignment reference signal S_ref-delay and the original audio signal S_input to obtain a first audio signal S1; processing the first audio signal S1 and the original audio signal S_input by using a deep learning model to obtain effective audio parameters H x And echo audio parameter H r . According to the effective audio parameters H x Echo Audio parameter H r The first audio signal S1 is non-linearly filtered to obtain a second audio signal S2.
According to an embodiment of the present disclosure, according to the reference signal and the original audio signal, performing linear filtering on the original audio signal to obtain the first audio signal may include: acquiring a reference signal S_ref and an original audio signal S_input; aligning the reference signal S_ref with the original audio signal S_input to obtain an aligned reference signal S_ref-delay; and linearly filtering the original audio signal S_input by using the alignment reference signal S_ref-delay to obtain a first audio signal S1.
According to an embodiment of the present disclosure, there is a delay difference between the reference signal s_ref obtained using soft extraction and the audio signal from the speaker mixed in the original audio signal s_input received by the microphone. For example, the reference signal s_ref is before the audio signal from the speaker mixed in the original audio signal s_input, for example, the time delay difference between the reference signal s_ref and the audio signal from the speaker mixed in the original audio signal s_input received by the microphone ranges from 20ms to 600ms.
According to the embodiment of the disclosure, the reference signal s_ref is delayed by a corresponding delay difference based on the original audio signal s_input, so that the reference signal s_ref is aligned with the original audio signal s_input, and an aligned reference signal s_ref-delay is obtained.
According to an embodiment of the present disclosure, the step of aligning the reference signal with the original audio signal to obtain an aligned reference signal may include: determining a correlation between the reference signal and the original audio signal using a generalized cross-correlation phase transformation; determining a signal delay between the reference signal and the original audio signal according to the correlation; and aligning the reference signal with the original audio signal according to the signal delay.
According to an embodiment of the present disclosure, a delay difference between the reference signal s_ref and the original audio signal s_input is estimated by calculating a correlation between the reference signal s_ref and the original audio signal s_input using a generalized cross-correlation-Phase Transform (Generalized Cross Correlation-Phase Transform). The smaller the reverberation generated between the reference signal s_ref and the original audio signal s_input, the greater the correlation between the reference signal s_ref and the original audio signal s_input. In order to reduce the calculation amount of the delay difference, the calculation can be selectively performed for the signal frequency points with higher human voice correlation in the reference signal s_ref and the original audio signal s_input.
According to an embodiment of the present disclosure, in case it is determined that the delay difference between the detected reference signal s_ref and the original audio signal s_input is less than 5ms, the delay difference may be ignored, thereby saving the amount of computation generated by the delay estimation 501.
According to an embodiment of the present disclosure, performing nonlinear filtering on a first audio signal according to an original audio signal and the first audio signal, obtaining a second audio signal may include: processing the original audio signal S_input and the first audio signal S1 by using a deep learning model to obtain effective audio parameters H x And echo audio parameter H r The method comprises the steps of carrying out a first treatment on the surface of the According to the effective audio parameters H x Echo Audio parameter H r And presetting a filter coefficient, and performing nonlinear filtering on the first audio signal S1 to obtain a second audio signal S2.
According to the embodiment of the disclosure, after sensing the difference between the output signal (the first audio signal S1) and the input signal (the original audio signal s_input) of the linear filtering 502, the neural network of the deep learning model may determine the residual echo in the first audio signal S1, thereby improving the cancellation capability of the nonlinear filtering 504 and the cancellation stability of the two-stage AEC scheme.
According to an embodiment of the present disclosure, a deep learning model includes a convolutional layer, a recurrent neural network, and a fully connected layer. The step of processing the original audio signal and the first audio signal using the deep learning model to obtain effective audio parameters and echo audio parameters may include: inputting the original audio signal S_input and the first audio signal S1 into a convolution layer to obtain original audio signal characteristics and first audio signal characteristics; inputting the original audio signal characteristics and the first audio signal characteristics into a cyclic neural network to obtain difference characteristics between the original audio signal characteristics and the first audio signal characteristics; inputting the difference characteristics into the full connection layer to obtain effective audio parameters H x And echo audio parameter H r
According to the embodiment of the disclosure, the original audio signal s_input is introduced into the model AEC, and the original audio signal features are used as the input features of the model AEC, so that the model AEC input features are optimized. The recurrent neural network can more accurately find the residual echo in the first audio signal S1 by perceiving the difference between the output signal of LAEC (first audio signal S1) and the input signal (original audio signal s_input). In addition, due to the delay estimation 501, the delay difference between the original audio signal s_input and the first audio signal S1 input to the model AEC is small, so that the elimination effect of the model AEC and the stability of the model AEC can be improved.
A method of mask learning may be used, where the original audio signal s_input is filtered by the mask to obtain a clean second audio signal S2.
In the nonlinear filtering 504, a wiener post-filter processing method is employed to determine the second audio signal S2, according to an embodiment of the present disclosure. For example, in determining the effective audio parameter H by the deep learning model 503 x And echo audio parameter H r A method of mask learning may be used, where the mask filters the original audio signal s_input to obtain a clean second audio signal S2.
According to an embodiment of the present disclosure, the second audio signal S2 may be determined by the formula (1):
alpha is a preset filter coefficient, and the preset filter coefficient is a constant which is more than 0 and not more than 2. The final echo cancellation amount can be controlled by adjusting the preset filter coefficient alpha, and the larger the value of the preset filter coefficient is, the more the cancellation amount is. Since the value of the preset filter coefficient can be adjusted, the elimination amount of the model AEC becomes controllable.
According to the embodiment of the disclosure, under the condition that the elimination amount of the mobile terminal equipment cannot meet the elimination requirement, the echo elimination effect of the mobile terminal equipment can be improved by adjusting the preset filter coefficient alpha, so that more types of mobile terminal equipment are compatible.
According to an embodiment of the present disclosure, the deep learning model is implemented based on 8-bit integer data. By using the model quantization technology, the model parameters with larger bit width are mapped into the model parameters with smaller bit width, so that the memory consumption and the calculation amount of the mobile terminal equipment can be further reduced. For example, the model parameters of float32 can be mapped to int8. In practical applications, it is feasible to implement a deep learning model according to embodiments of the present disclosure using int8.
Fig. 6 schematically illustrates a schematic diagram of processing an audio signal according to another embodiment of the present disclosure.
As shown in fig. 6, the reference signal s_ref can be detected by the extraction during the period T1 and the period T3, and the reference signal s_ref cannot be detected by the extraction during the period T2. Therefore, in order to reduce the computation and memory occupation of the mobile terminal device as much as possible, a shutdown strategy may be performed. For example, the audio signal processing method may further include, before linearly filtering the original audio signal according to the reference signal and the original audio signal: periodically detecting the reference signal s_ref and the original audio signal s_input at predetermined time intervals; and performing an operation of linearly filtering the original audio signal s_input according to the reference signal s_ref and the original audio signal s_input in response to detecting the reference signal s_ref.
Thus, according to the embodiment of the disclosure, due to the difference in computing speed and computing capacity between different mobile terminal devices, the two-stage AEC algorithm may cause problems of more memory occupation, time consumption in computing and the like on the mobile terminal devices with low partial computing capacity. In order to reduce the calculation amount and the memory occupation as much as possible, the reference signal is detected, and the mobile terminal device does not need to perform echo cancellation operation on the original audio signal s_input under the condition that the reference signal s_ref is not detected.
According to the audio signal processing method disclosed by the invention, a two-stage AEC algorithm is adopted, and the AEC algorithm is applied to the upper-layer application of the mobile terminal equipment through the post-wiener filtering processing, the time delay estimation TDE, the original audio signal input model AEC, the model quantization technology and the shutdown strategy, so that the AEC algorithm operated by a pure software program is realized, and the method is compatible with various mobile terminal equipment.
According to the embodiment of the disclosure, a training method of the deep learning model is also provided. With this training method, a deep learning model according to an embodiment of the present disclosure may be obtained, and the effective audio parameters and echo audio parameters described above may be obtained with the trained deep learning model. The training method of the deep learning model provided by the present disclosure will be described below with reference to fig. 7.
Fig. 7 schematically illustrates a flowchart of a training method of a deep learning model according to an embodiment of the present disclosure.
As shown in fig. 7, the training method 700 of the deep learning model includes processing a sample original audio signal and a sample mixed audio signal using the deep learning model to obtain output effective audio parameters and output echo audio parameters in operation S710.
Then, in operation S720, parameters of the deep learning model are adjusted according to the difference value between the output valid audio parameter and the valid audio parameter tag and the difference value between the output echo audio parameter and the echo audio parameter tag, so as to obtain a trained deep learning model.
According to the embodiment of the disclosure, since different mobile terminal devices have different hardware, the nonlinear effect difference generated by different mobile terminal devices is also different. In order to give consideration to the AEC processing effects of various mobile terminal devices, the original audio signals and the mixed audio signals of the various mobile terminal devices can be collected as sample original audio signals and sample mixed audio signals, and training for the deep neural network can be performed.
According to an embodiment of the present disclosure, the input of the deep learning model is a sample raw audio signal and a sample mixed audio signal. The output of the deep learning model is to output valid audio parameters and output echo audio parameters. Sample raw audio signal and sample mixed audio signal. The sample original audio signal is an original audio signal acquired by a microphone of the mobile terminal device, and the sample mixed audio signal is an audio signal processed by linear filtering.
According to an embodiment of the present disclosure, a deep learning model includes a convolutional layer, a recurrent neural network, and a fully connected layer. In operation S710, processing the sample original audio signal and the sample mixed audio signal using the deep learning model to obtain the output effective audio parameter and the output echo audio parameter may include: inputting the sample original audio signal and the sample mixed audio signal into a convolution layer to obtain the original audio signal characteristics and the mixed audio signal characteristics; inputting the original audio signal characteristics and the mixed audio signal characteristics into a cyclic neural network to obtain difference characteristics between the original audio signal characteristics and the mixed audio signal characteristics; and inputting the difference characteristic into the full connection layer to obtain an output effective audio parameter and an output echo audio parameter.
According to embodiments of the present disclosure, the convolution layers may include 3 to 5 convolution layers, for example, the convolution layers may include a sample original audio signal convolution layer and a sample mixed audio signal convolution layer. And respectively inputting the sample original audio signal and the sample mixed audio signal into the original audio signal convolution layer and the sample mixed audio signal convolution layer to obtain the original audio signal characteristics and the mixed audio signal characteristics. The convolution layers may also include a sample reference signal convolution layer to which the sample reference signal is input to obtain the reference signal characteristic. The recurrent neural network is a GRU-based recurrent neural network structure.
According to an embodiment of the present disclosure, adjusting parameters of a deep learning model according to a difference value between an output effective audio parameter and an effective audio parameter tag and a difference value between an output echo audio parameter and an echo audio parameter tag includes: determining a first amplitude spectrum mean square error between the output effective audio parameters and the effective audio parameter labels; determining a second amplitude spectrum mean square error between the output echo audio parameters and the echo audio parameter labels; and adjusting parameters of the deep learning model according to the first amplitude spectrum mean square error and the second amplitude spectrum mean square error.
According to the embodiment of the disclosure, the effective audio parameter tag and the echo audio parameter tag are training set data of a deep learning model training process, and sample data is trained through the training set data so as to train parameters in a neural network.
According to the embodiment of the disclosure, the amplitude spectrum mean square error is used as the loss function in the training process to train the deep learning model, so that the voice distortion of the audio signal is ensured to be small, and the voice recognition rate is improved.
According to an embodiment of the present disclosure, the sample original audio signal and the sample mixed audio signal are frequency domain signals, and the training object of the deep learning model is a frequency domain signal.
According to an embodiment of the present disclosure, the deep learning model is implemented based on 8-bit integer data. And the model quantization technology is used for mapping the model parameters with larger bit width into the model parameters with smaller bit width, so that the memory consumption and the calculated amount are further reduced. For example, the model parameters of float32 can be mapped to int8.
The audio signal processing apparatus provided by the present disclosure will be described below with reference to fig. 8.
Fig. 8 schematically shows a block diagram of an audio signal processing apparatus according to an embodiment of the present disclosure.
As shown in fig. 8, the audio signal processing apparatus 800 includes a first filtering module 810, a second filtering module 820, and a determining module 830.
The first filtering module 810 is configured to perform linear filtering on the original audio signal according to the reference signal and the original audio signal, so as to obtain a first audio signal.
The second filtering module 820 is configured to perform nonlinear filtering on the first audio signal according to the original audio signal and the first audio signal, so as to obtain a second audio signal.
The determining module 830 is configured to determine an output audio signal according to the second audio signal.
According to an embodiment of the present disclosure, the first filtering module 810 includes an acquisition unit, an alignment unit, and a first filtering unit. The acquisition unit is used for acquiring the reference signal and the original audio signal. The alignment unit is used for aligning the reference signal with the original audio signal to obtain an aligned reference signal. The first filtering unit is used for performing linear filtering on the original audio signal by using the aligned reference signal to obtain a first audio signal.
According to an embodiment of the present disclosure, the second filtering module 820 includes a processing unit and a second filtering unit. The processing unit is used for processing the original audio signal and the first audio signal by utilizing the deep learning model to obtain effective audio parameters and echo audio parameters. The second filtering unit is used for performing nonlinear filtering on the first audio signal according to the effective audio parameter, the echo audio parameter and the preset filtering coefficient to obtain a second audio signal.
According to an embodiment of the present disclosure, a deep learning model includes a convolutional layer, a recurrent neural network, and a fully connected layer.
According to an embodiment of the present disclosure, the processing unit is further configured to: inputting the original audio signal and the first audio signal into a convolution layer to obtain the original audio signal characteristics and the first audio signal characteristics; inputting the original audio signal characteristics and the first audio signal characteristics into a cyclic neural network to obtain difference characteristics between the original audio signal characteristics and the first audio signal characteristics; and inputting the difference characteristic into the full connection layer to obtain the effective audio parameter and the echo audio parameter.
According to an embodiment of the present disclosure, the preset filter coefficient is a constant greater than 0 and not greater than 2.
According to an embodiment of the present disclosure, the audio signal processing apparatus 800 further includes: the device comprises a detection module and an execution module. The detection module is used for periodically detecting the reference signal and the original audio signal at preset time intervals. The execution module is used for responding to the detection of the reference signal and executing the operation of linear filtering on the original audio signal according to the reference signal and the original audio signal.
According to an embodiment of the present disclosure, the alignment unit is further for determining a correlation between the reference signal and the original audio signal using a generalized cross-correlation phase transformation; determining a signal delay between the reference signal and the original audio signal according to the correlation; and aligning the reference signal with the original audio signal according to the signal delay.
According to an embodiment of the present disclosure, the deep learning model is implemented based on 8-bit integer data.
The training apparatus of the deep learning model provided by the present disclosure will be described below with reference to fig. 9.
Fig. 9 schematically illustrates a block diagram of a training apparatus of a deep learning model according to an embodiment of the present disclosure.
As shown in fig. 9, the training apparatus 900 of the deep learning model includes a processing module 910 and an adjusting module 920.
The processing module 910 is configured to process the sample original audio signal and the sample mixed audio signal by using the deep learning model, so as to obtain output valid audio parameters and output echo audio parameters.
The adjusting module 920 is configured to adjust parameters of the deep learning model according to a difference value between the output valid audio parameter and the valid audio parameter tag and a difference value between the output echo audio parameter and the echo audio parameter tag, so as to obtain a trained deep learning model.
According to an embodiment of the present disclosure, a deep learning model includes a convolutional layer, a recurrent neural network, and a fully connected layer.
According to an embodiment of the present disclosure, the processing module 910 includes a first input unit, a second input unit, and a third input unit. And the first input unit is used for inputting the sample original audio signal and the sample mixed audio signal into the convolution layer to obtain the original audio signal characteristics and the mixed audio signal characteristics. And the second input unit is used for inputting the original audio signal characteristics and the mixed audio signal characteristics into the cyclic neural network to obtain difference characteristics between the original audio signal characteristics and the mixed audio signal characteristics. And the third input unit is used for inputting the difference characteristic into the full-connection layer to obtain the output effective audio parameter and the output echo audio parameter.
According to an embodiment of the present disclosure, the adjustment module 920 includes a first determination unit, and an adjustment unit. The first determining unit is used for determining a first amplitude spectrum mean square error between the output effective audio parameter and the effective audio parameter label. The second determining unit is used for determining a second amplitude spectrum mean square error between the output echo audio parameter and the echo audio parameter label. The adjusting unit is used for adjusting parameters of the deep learning model according to the first amplitude spectrum mean square error and the second amplitude spectrum mean square error.
According to an embodiment of the present disclosure, the sample raw audio signal and the sample mixed audio signal are frequency domain signals.
According to an embodiment of the present disclosure, the sample mixed audio signal is an audio signal processed by linear filtering.
According to an embodiment of the present disclosure, the deep learning model is implemented based on 8-bit integer data.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 10 schematically illustrates a block diagram of an example electronic device 1000 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM1003, various programs and data required for the operation of the device 1000 can also be stored. The computing unit 1001, the ROM 1002, and the RAM1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
Various components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and communication unit 1009 such as a network card, modem, wireless communication transceiver, etc. Communication unit 1009 allows device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.
The computing unit 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the respective methods and processes described above, such as an audio signal processing method and a training method of a deep learning model. For example, in some embodiments, the audio signal processing method and the training method of the deep learning model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009. When the computer program is loaded into the RAM1003 and executed by the computing unit 1001, one or more steps of the above-described audio signal processing method and training method of the deep learning model may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the audio signal processing method and the training method of the deep learning model in any other suitable way (e.g., by means of firmware).
Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (30)

1. An audio signal processing method, comprising:
according to a reference signal and an original audio signal, performing linear filtering on the original audio signal to obtain a first audio signal;
processing the original audio signal and the first audio signal by using a deep learning model to obtain effective audio parameters and echo audio parameters;
according to the effective audio parameters, the echo audio parameters and the adjustable preset filter coefficients, nonlinear filtering is carried out on the first audio signals to obtain second audio signals; and
And determining an output audio signal according to the second audio signal.
2. The method of claim 1, wherein linearly filtering the original audio signal based on a reference signal and the original audio signal to obtain a first audio signal comprises:
acquiring a reference signal and an original audio signal;
aligning the reference signal with the original audio signal to obtain an aligned reference signal; and
and linearly filtering the original audio signal by using the alignment reference signal to obtain the first audio signal.
3. The method of claim 1, wherein the deep learning model comprises a convolutional layer, a recurrent neural network, and a fully connected layer.
4. The method of claim 3, wherein said processing said original audio signal and said first audio signal using a deep learning model to obtain effective audio parameters and echo audio parameters comprises:
inputting the original audio signal and the first audio signal into the convolution layer to obtain original audio signal characteristics and first audio signal characteristics;
inputting the original audio signal characteristics and the first audio signal characteristics into the cyclic neural network to obtain difference characteristics between the original audio signal characteristics and the first audio signal characteristics; and
And inputting the difference characteristic into the full connection layer to obtain the effective audio parameter and the echo audio parameter.
5. The method of claim 1, wherein the preset filter coefficient is a constant greater than 0 and not greater than 2.
6. The method of claim 1, further comprising, prior to said linearly filtering said original audio signal from a reference signal and an original audio signal:
periodically detecting a reference signal and an original audio signal at predetermined time intervals; and
in response to detecting the reference signal, performing the operation of linearly filtering the original audio signal according to the reference signal and the original audio signal.
7. The method of claim 2, wherein said aligning the reference signal with the original audio signal, resulting in an aligned reference signal, comprises:
determining a correlation between the reference signal and the original audio signal using a generalized cross-correlation phase transformation;
determining a signal delay between the reference signal and the original audio signal according to the correlation; and
the reference signal is aligned with the original audio signal according to the signal delay.
8. The method of claim 1, wherein the deep learning model is implemented based on 8-bit integer data.
9. A training method of a deep learning model, comprising:
processing a sample original audio signal and a sample mixed audio signal by using a deep learning model to obtain an output effective audio parameter and an output echo audio parameter, wherein the sample mixed audio signal is an audio signal processed by linear filtering; and
and adjusting parameters of the deep learning model according to the difference value between the output effective audio parameter and the effective audio parameter label and the difference value between the output echo audio parameter and the echo audio parameter label to obtain a trained deep learning model.
10. The method of claim 9, wherein the deep learning model comprises a convolutional layer, a recurrent neural network, and a fully connected layer.
11. The method of claim 10, wherein processing the sample raw audio signal and the sample mixed audio signal using the deep learning model to obtain the output effective audio parameters and the output echo audio parameters comprises:
inputting the sample original audio signal and the sample mixed audio signal into the convolution layer to obtain original audio signal characteristics and mixed audio signal characteristics;
Inputting the original audio signal characteristics and the mixed audio signal characteristics into the cyclic neural network to obtain difference characteristics between the original audio signal characteristics and the mixed audio signal characteristics; and
and inputting the difference characteristic into the full connection layer to obtain the output effective audio parameter and the output echo audio parameter.
12. The method of claim 9, wherein the adjusting the parameters of the deep learning model according to the difference value between the output valid audio parameter and valid audio parameter tag and the difference value between the output echo audio parameter and echo audio parameter tag comprises:
determining a first amplitude spectrum mean square error between the output effective audio parameter and an effective audio parameter label;
determining a second amplitude spectrum mean square error between the output echo audio parameter and the echo audio parameter tag; and
and adjusting parameters of the deep learning model according to the first amplitude spectrum mean square error and the second amplitude spectrum mean square error.
13. The method of claim 9, wherein the sample raw audio signal and the sample mixed audio signal are frequency domain signals.
14. The method of claim 9, wherein the deep learning model is implemented based on 8-bit integer data.
15. An audio signal processing apparatus comprising:
the first filtering module is used for carrying out linear filtering on the original audio signal according to the reference signal and the original audio signal to obtain a first audio signal;
the processing unit is used for processing the original audio signal and the first audio signal by utilizing a deep learning model to obtain effective audio parameters and echo audio parameters;
the second filtering unit is used for carrying out nonlinear filtering on the first audio signal according to the effective audio parameter, the echo audio parameter and the adjustable preset filtering coefficient to obtain a second audio signal; and
and the determining module is used for determining an output audio signal according to the second audio signal.
16. The apparatus of claim 15, wherein the first filtering module comprises:
an acquisition unit configured to acquire a reference signal and an original audio signal;
an alignment unit, configured to align the reference signal with the original audio signal, to obtain an aligned reference signal; and
and the first filtering unit is used for linearly filtering the original audio signal by utilizing the alignment reference signal to obtain the first audio signal.
17. The apparatus of claim 15, wherein the deep learning model comprises a convolutional layer, a recurrent neural network, and a fully connected layer.
18. The apparatus of claim 17, wherein the processing unit is further configured to:
inputting the original audio signal and the first audio signal into the convolution layer to obtain original audio signal characteristics and first audio signal characteristics;
inputting the original audio signal characteristics and the first audio signal characteristics into the cyclic neural network to obtain difference characteristics between the original audio signal characteristics and the first audio signal characteristics; and
and inputting the difference characteristic into the full connection layer to obtain the effective audio parameter and the echo audio parameter.
19. The apparatus of claim 15, wherein the preset filter coefficient is a constant greater than 0 and not greater than 2.
20. The apparatus of claim 15, further comprising:
a detection module for periodically detecting the reference signal and the original audio signal at predetermined time intervals; and
and the execution module is used for responding to the detection of the reference signal and executing the operation of linear filtering on the original audio signal according to the reference signal and the original audio signal.
21. The apparatus of claim 16, wherein the alignment unit is further to:
determining a correlation between the reference signal and the original audio signal using a generalized cross-correlation phase transformation;
determining a signal delay between the reference signal and the original audio signal according to the correlation; and
the reference signal is aligned with the original audio signal according to the signal delay.
22. The apparatus of claim 15, wherein the deep learning model is implemented based on 8-bit integer data.
23. A training device for a deep learning model, comprising:
the processing module is used for processing the sample original audio signal and the sample mixed audio signal by using the deep learning model to obtain an output effective audio parameter and an output echo audio parameter, wherein the sample mixed audio signal is an audio signal processed by linear filtering; and
and the adjusting module is used for adjusting the parameters of the deep learning model according to the difference value between the output effective audio parameter and the effective audio parameter label and the difference value between the output echo audio parameter and the echo audio parameter label to obtain a trained deep learning model.
24. The training device of claim 23, wherein the deep learning model comprises a convolutional layer, a recurrent neural network, and a fully connected layer.
25. The training device of claim 24, wherein the processing module comprises:
the first input unit is used for inputting the sample original audio signal and the sample mixed audio signal into the convolution layer to obtain original audio signal characteristics and mixed audio signal characteristics;
the second input unit is used for inputting the original audio signal characteristics and the mixed audio signal characteristics into the cyclic neural network to obtain difference characteristics between the original audio signal characteristics and the mixed audio signal characteristics; and
and the third input unit is used for inputting the difference characteristic into the full-connection layer to obtain the output effective audio parameter and the output echo audio parameter.
26. The training device of claim 23, wherein the adjustment module comprises:
a first determining unit, configured to determine a first amplitude spectrum mean square error between the output valid audio parameter and a valid audio parameter tag;
a second determining unit, configured to determine a second amplitude spectrum mean square error between the output echo audio parameter and the echo audio parameter tag; and
And the adjusting unit is used for adjusting parameters of the deep learning model according to the first amplitude spectrum mean square error and the second amplitude spectrum mean square error.
27. The training device of claim 23, wherein the sample raw audio signal and the sample mixed audio signal are frequency domain signals.
28. The training device of claim 23, wherein the deep learning model is implemented based on 8-bit integer data.
29. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8 or the method of any one of claims 9-14.
30. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8 or the method of any one of claims 9-14.
CN202310038041.9A 2023-01-10 2023-01-10 Audio signal processing method, training method, device, equipment and medium for model Active CN116013337B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310038041.9A CN116013337B (en) 2023-01-10 2023-01-10 Audio signal processing method, training method, device, equipment and medium for model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310038041.9A CN116013337B (en) 2023-01-10 2023-01-10 Audio signal processing method, training method, device, equipment and medium for model

Publications (2)

Publication Number Publication Date
CN116013337A CN116013337A (en) 2023-04-25
CN116013337B true CN116013337B (en) 2023-12-29

Family

ID=86023073

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310038041.9A Active CN116013337B (en) 2023-01-10 2023-01-10 Audio signal processing method, training method, device, equipment and medium for model

Country Status (1)

Country Link
CN (1) CN116013337B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113362843A (en) * 2021-06-30 2021-09-07 北京小米移动软件有限公司 Audio signal processing method and device
CN113763977A (en) * 2021-04-16 2021-12-07 腾讯科技(深圳)有限公司 Method, apparatus, computing device and storage medium for eliminating echo signal
CN114141224A (en) * 2021-11-30 2022-03-04 北京百度网讯科技有限公司 Signal processing method and device, electronic equipment and computer readable medium
CN114242100A (en) * 2021-12-16 2022-03-25 北京百度网讯科技有限公司 Audio signal processing method, training method and device, equipment and storage medium thereof
CN115083431A (en) * 2022-06-20 2022-09-20 北京达佳互联信息技术有限公司 Echo cancellation method and device, electronic equipment and computer readable medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113763977A (en) * 2021-04-16 2021-12-07 腾讯科技(深圳)有限公司 Method, apparatus, computing device and storage medium for eliminating echo signal
CN113362843A (en) * 2021-06-30 2021-09-07 北京小米移动软件有限公司 Audio signal processing method and device
CN114141224A (en) * 2021-11-30 2022-03-04 北京百度网讯科技有限公司 Signal processing method and device, electronic equipment and computer readable medium
CN114242100A (en) * 2021-12-16 2022-03-25 北京百度网讯科技有限公司 Audio signal processing method, training method and device, equipment and storage medium thereof
CN115083431A (en) * 2022-06-20 2022-09-20 北京达佳互联信息技术有限公司 Echo cancellation method and device, electronic equipment and computer readable medium

Also Published As

Publication number Publication date
CN116013337A (en) 2023-04-25

Similar Documents

Publication Publication Date Title
US11323807B2 (en) Echo cancellation method and apparatus based on time delay estimation
CN109767783B (en) Voice enhancement method, device, equipment and storage medium
US11276414B2 (en) Method and device for processing audio signal using audio filter having non-linear characteristics to prevent receipt of echo signal
US20210327448A1 (en) Speech noise reduction method and apparatus, computing device, and computer-readable storage medium
CN111418010A (en) Multi-microphone noise reduction method and device and terminal equipment
CN105825864B (en) Both-end based on zero-crossing rate index is spoken detection and echo cancel method
US11456007B2 (en) End-to-end multi-task denoising for joint signal distortion ratio (SDR) and perceptual evaluation of speech quality (PESQ) optimization
CN111968658B (en) Speech signal enhancement method, device, electronic equipment and storage medium
van Waterschoot et al. Double-talk-robust prediction error identification algorithms for acoustic echo cancellation
WO2021007841A1 (en) Noise estimation method, noise estimation apparatus, speech processing chip and electronic device
CN114974280A (en) Training method of audio noise reduction model, and audio noise reduction method and device
CN114360562A (en) Voice processing method, device, electronic equipment and storage medium
CN114242100B (en) Audio signal processing method, training method, device, equipment and storage medium thereof
CN116013337B (en) Audio signal processing method, training method, device, equipment and medium for model
WO2024017110A1 (en) Voice noise reduction method, model training method, apparatus, device, medium, and product
US10650839B2 (en) Infinite impulse response acoustic echo cancellation in the frequency domain
CN114333912B (en) Voice activation detection method, device, electronic equipment and storage medium
CN114171043B (en) Echo determination method, device, equipment and storage medium
CN112489669B (en) Audio signal processing method, device, equipment and medium
CN113205824B (en) Sound signal processing method, device, storage medium, chip and related equipment
CN114302286A (en) Method, device and equipment for reducing noise of call voice and storage medium
CN112165558B (en) Method and device for detecting double-talk state, storage medium and terminal equipment
CN111048096B (en) Voice signal processing method and device and terminal
CN113055787A (en) Echo cancellation method, echo cancellation device, electronic equipment and storage medium
CN107170461B (en) Voice signal processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant