WO2023226234A1

WO2023226234A1 - Model training method and apparatus, and computer-readable non-transitory storage medium

Info

Publication number: WO2023226234A1
Application number: PCT/CN2022/117526
Authority: WO
Inventors: 林功艺
Original assignee: 神盾股份有限公司
Priority date: 2022-05-23
Filing date: 2022-09-07
Publication date: 2023-11-30
Also published as: WO2023226193A1; CN115294952A; TW202347318A; TW202347319A

Abstract

A model training method, a model training apparatus and a computer-readable non-transitory storage medium. The model training method comprises: processing a first audio signal on the basis of a prediction model so as to generate a first control instruction; on the basis of the first control instruction, generating an audio signal corresponding to the first control instruction as a second audio signal; outputting the second audio signal so as to suppress a third audio signal, wherein the time when the first audio signal occurs is earlier than the time when the third audio signal occurs; determining an audio error signal on the basis of the second audio signal and the third audio signal; in response to the audio error signal not meeting an error condition, adjusting the prediction model, and processing the first audio signal again on the basis of the prediction model until the audio error signal meets the error condition; and in response to the audio error signal meeting the error condition, keeping the prediction model unchanged.

Description

Model training method and device, non-transitory computer-readable storage medium

This application requires U.S. Provisional Patent Application No. 63/344,642 filed on May 23, 2022, U.S. Provisional Patent Application No. 63/351,439 filed on June 13, 2022, and U.S. Provisional Patent Application No. 63/351,439 filed on June 14, 2022. The priority of U.S. Provisional Patent Application No. 63/352,213 and the priority of PCT International Application No. PCT/CN2022/110275 filed on August 4, 2022 are hereby cited in full. The content is included as part of this application.

Technical field

Embodiments of the present disclosure relate to a model training method, a model training device, and a non-transitory computer-readable storage medium.

Background technique

At present, noise reduction methods mainly include active noise reduction and passive noise reduction. Active noise reduction uses the noise reduction system to generate an inverse signal that is equal to the external noise to neutralize the noise, thereby achieving the noise reduction effect. Passive noise reduction mainly achieves the noise reduction effect by forming a closed space around the object or using sound insulation materials to block external noise.

Active noise reduction can use the noise cancellation model to achieve the destructive superposition of backward inverted audio with the originally received audio (for example, noise) to achieve the effect of suppressing the audio. An active noise reduction process is as follows: First, the audio Vn generated by the sound source is received through the microphone, and the received audio Vn is sent to the processor. Then, the processor performs inversion processing on the audio Vn to generate inverted audio. Vn' and output the inverted audio Vn' to the speaker, and the speaker emits the inverted audio Vn'. The human ear can receive the inverted audio Vn’ and the audio Vn, and the inverted audio Vn’ and the audio Vn can be destructively superimposed to achieve the effect of suppressing the audio. In this active noise reduction, due to the time required for signal processing and signal transmission, the time of the inverted audio Vn' output by the speaker must lag behind the time of the audio Vn originally received by the microphone. Therefore, the human ear receives The time to the inverted audio Vn' must also lag behind the time when the human ear receives the audio Vn, and the silencing effect is poor, and may even be impossible to achieve. There must be a delay from the input end (i.e. microphone) to the output end (i.e. speaker). The lower the delay from the input end to the output end, the smaller the time difference between the human ear receiving the inverted audio Vn' and the received audio Vn, the smaller the noise reduction. The better. Therefore, active noise reduction has extremely strict requirements on end-to-end delay, so the architecture of the active noise reduction system must use high-speed analog-to-digital converters and high-speed computing hardware to achieve low latency and achieve better audio suppression effects. , resulting in high development costs and less elastic architecture. Therefore, how to avoid the impact of end-to-end delay on active noise reduction and how to achieve better audio suppression effects have become problems that need to be solved.

Currently, the cancellation model can be trained in advance and then applied to actual scenarios. However, due to the variety of audio signals in different scenarios, the number of training samples used to train the cancellation model is limited and cannot fully simulate the real environment. The audio signal in the training sample may not be exactly the same as the audio signal generated in the real environment, so the cancellation model may not be able to achieve the cancellation function. Therefore, how to make the silencing model more suitable for real environments, so that the silencing model can better achieve the effect of suppressing audio, and the insufficient number of samples used to train the silencing model have become problems that need to be solved.

Contents of the invention

In response to the above problems, at least one embodiment of the present disclosure provides a model training method, including: based on a prediction model, processing a first audio signal to generate a first control instruction; based on the first control instruction, generating a signal corresponding to the first control instruction. The audio signal corresponding to a control instruction is used as the second audio signal; the second audio signal is output to suppress the third audio signal, wherein the first audio signal appears earlier than the third audio signal. ; Based on the second audio signal and the third audio signal, determine an audio error signal; in response to the audio error signal not satisfying the error condition, adjust the prediction model, and adjust the prediction model again based on the prediction model The first audio signal is processed until the audio error signal satisfies the error condition; in response to the audio error signal satisfying the error condition, the prediction model is maintained unchanged.

For example, in the model training method provided by at least one embodiment of the present disclosure, the prediction model includes a neural network, and determining the audio error signal based on the second audio signal and the third audio signal includes: based on the second audio signal and the third audio signal. For the second audio signal and the third audio signal, a loss value is calculated through the loss function of the neural network, wherein the audio error signal includes the loss value.

For example, in the model training method provided by at least one embodiment of the present disclosure, adjusting the prediction model in response to the audio error signal not meeting the error condition includes: responding to the loss value not meeting the error condition. Error conditions, using the loss value to adjust the parameters of the neural network.

For example, in the model training method provided by at least one embodiment of the present disclosure, processing the first audio signal again based on the prediction model includes: in response to the audio error signal not meeting the error condition, Based on the neural network, the first audio signal is processed again to generate a second control instruction, wherein the second control instruction is different from the first control instruction; based on the second control instruction, generate and output the audio signal corresponding to the second control instruction as the second audio signal.

For example, in the model training method provided by at least one embodiment of the present disclosure, the prediction model includes a lookup table, and adjusting the prediction model in response to the audio error signal not meeting an error condition includes: responding to The audio error signal does not satisfy the error condition, audio feature coding is generated based on the first audio signal and the third audio signal, and the lookup table is adjusted based on the audio feature coding.

For example, in the model training method provided by at least one embodiment of the present disclosure, the prediction model includes a lookup table, and processing the first audio signal again based on the prediction model includes: responding to the audio error If the signal does not meet the error condition, the first audio signal is processed again to generate a second control instruction based on the lookup table, where the second control instruction is different from the first control instruction; based on The second control instruction generates and outputs an audio signal corresponding to the second control instruction as the second audio signal.

For example, in the model training method provided by at least one embodiment of the present disclosure, determining the audio error signal based on the second audio signal and the third audio signal includes: calculating the second audio signal and the The root mean square error between the third audio signals to obtain the audio error signal.

For example, in the model training method provided by at least one embodiment of the present disclosure, processing the first audio signal to generate the first control instruction based on the prediction model includes: obtaining the first audio signal; based on the prediction The model processes the first audio signal to predict a fourth audio signal; and generates the first control instruction based on the fourth audio signal.

For example, in the model training method provided by at least one embodiment of the present disclosure, the prediction model includes a lookup table, and processing the first audio signal based on the prediction model to predict a fourth audio signal includes: Generate a first audio feature code based on the first audio signal; query the lookup table based on the first audio feature code to obtain a second audio feature code; predict the third audio feature code based on the second audio feature code Four audio signals.

For example, in the model training method provided by at least one embodiment of the present disclosure, the phase of the second audio signal is opposite to the phase of the fourth audio signal.

For example, in the model training method provided by at least one embodiment of the present disclosure, the absolute value of the time difference between the time when the audio signal corresponding to the first control instruction is output and the time when the third audio signal starts to appear is less than the time threshold.

At least one embodiment of the present disclosure also provides a model training device, including: an instruction generation module configured to process the first audio signal to generate a first control instruction based on the prediction model; and the audio generation module is configured to process the first audio signal based on the prediction model. The first control instruction generates an audio signal corresponding to the first control instruction as a second audio signal; the output module is configured to output the second audio signal to suppress the third audio signal, wherein the third audio signal is An audio signal occurs earlier than the third audio signal; an error calculation module is configured to determine an audio error signal based on the second audio signal and the third audio signal; an adjustment module is configured In response to the audio error signal not meeting the error condition, the prediction model is adjusted; in response to the audio error signal meeting the error condition, the prediction model is kept unchanged; wherein the instruction generation module further configured to, in response to the audio error signal not satisfying an error condition, process the first audio signal again based on the prediction model until the audio error signal satisfies the error condition.

For example, in the model training device provided by at least one embodiment of the present disclosure, the prediction model includes a neural network, and when performing the operation of determining an audio error signal based on the second audio signal and the third audio signal , the error calculation module is configured to calculate a loss value through a loss function of the neural network based on the second audio signal and the third audio signal, wherein the audio error signal includes the loss value.

For example, in the model training device provided by at least one embodiment of the present disclosure, when performing the operation of adjusting the prediction model in response to the audio error signal not meeting an error condition, the adjustment module is configured to : In response to the loss value not meeting the error condition, use the loss value to adjust the parameters of the neural network.

For example, in the model training device provided by at least one embodiment of the present disclosure, when performing the operation of processing the first audio signal again based on the prediction model, the instruction generation module is configured to: respond to If the audio error signal does not meet the error condition, the first audio signal is processed again based on the neural network to generate a second control instruction, where the second control instruction is the same as the first control instruction. Not the same; the audio generation module is further configured to generate and output an audio signal corresponding to the second control instruction as the second audio signal based on the second control instruction.

For example, in the model training device provided by at least one embodiment of the present disclosure, the prediction model includes a lookup table, and the adjustment module includes a feature code generation submodule and a lookup table adjustment submodule, and the feature code generation submodule is configured To: in response to the audio error signal not satisfying the error condition, generate audio feature encoding based on the first audio signal and the third audio signal; the lookup table adjustment sub-module is configured to based on the audio feature Coding adjusts the lookup table.

For example, in the model training device provided by at least one embodiment of the present disclosure, the prediction model includes a lookup table, and when performing the operation of processing the first audio signal again based on the prediction model, the instruction The generation module is configured to: in response to the audio error signal not satisfying the error condition, process the first audio signal again to generate a second control instruction based on the lookup table, wherein the second control instruction The instruction is different from the first control instruction; the audio generation module is further configured to generate and output an audio signal corresponding to the second control instruction as the second audio signal based on the second control instruction.

For example, in the model training device provided by at least one embodiment of the present disclosure, when performing the operation of determining an audio error signal based on the second audio signal and the third audio signal, the error calculation module is configured is: calculating the root mean square error between the second audio signal and the third audio signal to obtain the audio error signal.

For example, in the model training device provided by at least one embodiment of the present disclosure, the instruction generation module includes an audio acquisition sub-module, a prediction sub-module and a generation sub-module, and the audio acquisition sub-module is configured to acquire the first audio signal; the prediction sub-module is configured to process the first audio signal based on the prediction model to predict a fourth audio signal; the generation sub-module is configured to generate the fourth audio signal based on the fourth audio signal. The first control instruction.

For example, in the model training device provided by at least one embodiment of the present disclosure, the prediction model includes a lookup table, the prediction sub-module includes a query unit and a prediction unit, the query unit is configured to based on the first audio signal Generate a first audio feature code; query the lookup table based on the first audio feature code to obtain a second audio feature code; the prediction unit is configured to predict the first audio feature code based on the second audio feature code. Four audio signals.

For example, in the model training device provided by at least one embodiment of the present disclosure, the phase of the second audio signal is opposite to the phase of the fourth audio signal.

For example, in the model training device provided by at least one embodiment of the present disclosure, the absolute value of the time difference between the time when the audio signal corresponding to the first control instruction is output and the time when the third audio signal starts to appear is less than the time threshold.

At least one embodiment of the present disclosure also provides a model training device, including: one or more memories non-transiently storing computer-executable instructions; one or more processors configured to run the computer-executable instructions, Wherein, the computer-executable instructions implement the model training method according to any embodiment of the present disclosure when run by the one or more processors.

At least one embodiment of the present disclosure also provides a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are implemented when executed by a processor. A model training method according to any embodiment of the present disclosure.

According to the model training method, model training device and non-transitory computer-readable storage medium provided by any embodiment of the present disclosure, the current audio signal (ie, the first audio signal) and the future audio signal (ie, the third audio signal) are used ) conduct real-time training of the prediction model to improve the accuracy of the prediction results output by the prediction model, avoid the problem that the prediction results based on the prediction model output cannot suppress future audio signals, and improve the effect of noise reduction based on the prediction model.

In addition, at least one embodiment of the present disclosure provides an audio processing method, including: generating a control instruction based on a first audio signal; generating a second audio signal based on the control instruction; and outputting the second audio signal to suppress the second audio signal. Three audio signals, wherein the sum of the phases of the second audio signal and the third audio signal is less than a phase threshold, and the first audio signal appears earlier than the third audio signal.

For example, in the audio processing method provided by at least one embodiment of the present disclosure, the outputting the second audio signal to suppress the third audio signal includes: based on the control instruction, determining the output of the second audio signal. The first moment; outputting the second audio signal at the first moment, wherein the third audio signal starts to appear from the second moment, and the absolute time difference between the first moment and the second moment is The value is less than the time threshold.

For example, in the audio processing method provided by at least one embodiment of the present disclosure, the time difference between the first moment and the second moment is 0.

For example, in the audio processing method provided by at least one embodiment of the present disclosure, generating a control instruction based on a first audio signal includes: acquiring the first audio signal; processing the first audio signal to predict a fourth audio signal; based on the fourth audio signal, the control instruction is generated.

For example, in the audio processing method provided by at least one embodiment of the present disclosure, the second audio signal and/or the third audio signal and/or the fourth audio signal are periodic or intermittent time domain Signal.

For example, in the audio processing method provided by at least one embodiment of the present disclosure, processing the first audio signal to predict a fourth audio signal includes: generating a first audio feature code based on the first audio signal ; Query the lookup table based on the first audio feature coding to obtain the second audio feature coding; predict and obtain the fourth audio signal based on the second audio feature coding.

For example, in the audio processing method provided by at least one embodiment of the present disclosure, the lookup table includes at least one first encoding field.

For example, in the audio processing method provided by at least one embodiment of the present disclosure, the lookup table further includes at least one second encoding field, and multiple first encoding fields constitute one second encoding field.

For example, in the audio processing method provided by at least one embodiment of the present disclosure, the second audio feature encoding includes at least one of the first encoding field and/or at least one of the second encoding field.

For example, in the audio processing method provided by at least one embodiment of the present disclosure, obtaining the first audio signal includes: collecting an initial audio signal; performing downsampling processing on the initial audio signal to obtain the first audio signal. Signal.

For example, in the audio processing method provided by at least one embodiment of the present disclosure, obtaining the first audio signal includes: collecting an initial audio signal; filtering the initial audio signal to obtain the first audio signal .

For example, in the audio processing method provided by at least one embodiment of the present disclosure, the phase of the second audio signal is opposite to the phase of the third audio signal.

At least one embodiment of the present disclosure also provides an audio processing device, including: an instruction generation module configured to generate a control instruction based on a first audio signal; and an audio generation module configured to generate a second audio based on the control instruction. signal; an output module configured to output the second audio signal to suppress a third audio signal; wherein the sum of the phases of the second audio signal and the phase of the third audio signal is less than a phase threshold, the The first audio signal appears earlier than the third audio signal.

For example, in the audio processing device provided by at least one embodiment of the present disclosure, the output module includes a time determination sub-module and an output sub-module, and the time determination sub-module is configured to determine to output the first time based on the control instruction. The first moment of the second audio signal; the output sub-module is configured to output the second audio signal at the first moment, wherein the third audio signal begins to appear from the second moment, and the first moment The absolute value of the time difference between the second moment and the second moment is less than the time threshold.

For example, in the audio processing device provided by at least one embodiment of the present disclosure, the time difference between the first time and the second time is 0.

For example, in the audio processing device provided by at least one embodiment of the present disclosure, the instruction generation module includes an audio acquisition sub-module, a prediction sub-module and a generation sub-module, and the audio acquisition sub-module is configured to acquire the first audio signal; the prediction sub-module is configured to process the first audio signal to predict a fourth audio signal; the generation sub-module is configured to generate the control instruction based on the fourth audio signal.

For example, in the audio processing device provided by at least one embodiment of the present disclosure, the second audio signal and/or the third audio signal and/or the fourth audio signal are periodic or intermittent time domain Signal.

For example, in the audio processing device provided by at least one embodiment of the present disclosure, the prediction sub-module includes a query unit and a prediction unit, the query unit is configured to generate a first audio feature encoding based on the first audio signal and a prediction unit based on the first audio signal. The first audio feature coding queries a lookup table to obtain a second audio feature coding; the prediction unit is configured to predict the fourth audio signal based on the second audio feature coding.

For example, in the audio processing device provided by at least one embodiment of the present disclosure, the lookup table includes at least one first encoding field.

For example, in the audio processing device provided by at least one embodiment of the present disclosure, the lookup table further includes at least one second encoding field, and multiple first encoding fields constitute one second encoding field.

For example, in the audio processing device provided by at least one embodiment of the present disclosure, the second audio feature encoding includes at least one of the first encoding field and/or at least one of the second encoding field.

For example, in the audio processing device provided by at least one embodiment of the present disclosure, the audio acquisition sub-module includes a collection unit and a down-sampling processing unit, the collection unit is configured to collect an initial audio signal; the down-sampling processing unit is Configured to perform downsampling processing on the initial audio signal to obtain the first audio signal.

For example, in the audio processing device provided by at least one embodiment of the present disclosure, the audio acquisition sub-module includes an acquisition unit and a filtering unit, the acquisition unit is configured to acquire an initial audio signal; the filtering unit is configured to The initial audio signal is filtered to obtain the first audio signal.

For example, in the audio processing device provided by at least one embodiment of the present disclosure, the phase of the second audio signal is opposite to the phase of the third audio signal.

At least one embodiment of the present disclosure also provides an audio processing device, including: one or more memories non-transiently storing computer-executable instructions; one or more processors configured to run the computer-executable instructions, Wherein, the computer-executable instructions implement the audio processing method according to any embodiment of the present disclosure when run by the one or more processors.

At least one embodiment of the present disclosure also provides a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer-executable instructions, which are implemented when executed by a processor. An audio processing method according to any embodiment of the present disclosure.

According to the audio processing method, audio processing device and non-transitory computer-readable storage medium provided by any embodiment of the present disclosure, an inversion of the future audio signal is generated by learning the characteristics of the current audio signal (ie, the first audio signal) audio signal (i.e., the second audio signal) to suppress the future audio signal (i.e., the third audio signal) to avoid the inversion audio signal due to the delay between the input end and the output end being out of sync with the audio signal that needs to be suppressed Problem, improving the noise cancellation effect can significantly reduce or even eliminate the impact of the input-to-output delay on noise cancellation, and the audio suppression effect is better than the backward active cancellation system commonly used in the industry.

Description of the drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below. Obviously, the drawings in the following description only relate to some embodiments of the present disclosure and do not limit the present disclosure. .

Figure 1 is a schematic block diagram of an audio processing system provided by at least one embodiment of the present disclosure;

Figure 2A is a schematic flow chart of an audio processing method provided by at least one embodiment of the present disclosure;

Figure 2B is a schematic flow chart of step S10 shown in Figure 2A;

Figure 2C is a schematic flow chart of step S102 shown in Figure 2B;

Figure 3 is a schematic diagram of a first audio signal and a third audio signal provided by at least one embodiment of the present disclosure;

Figure 4 is a schematic diagram of a third audio signal and a fourth audio signal provided by at least one embodiment of the present disclosure;

Figure 5A is a schematic diagram of an audio signal provided by some embodiments of the present disclosure;

Figure 5B is an enlarged schematic diagram of the audio signal in the dotted rectangular frame P1 in Figure 5A;

Figure 6 is a schematic block diagram of an audio processing device provided by at least one embodiment of the present disclosure;

Figure 7 is a schematic block diagram of another audio processing device provided by at least one embodiment of the present disclosure;

Figure 8 is a schematic diagram of a non-transitory computer-readable storage medium provided by at least one embodiment of the present disclosure;

Figure 9 is a schematic block diagram of a model training system provided by at least one embodiment of the present disclosure;

Figure 10A is a schematic flow chart of a model training method provided by at least one embodiment of the present disclosure;

Figure 10B is a schematic flow chart of step S200 shown in Figure 10A;

Figure 10C is a schematic flow chart of step S2002 shown in Figure 10B;

Figure 11 is a schematic diagram of a first audio signal and a third audio signal provided by at least one embodiment of the present disclosure;

Figure 12A is a schematic diagram of an audio error signal and the number of training iterations provided by at least one embodiment of the present disclosure;

Figure 12B is a schematic diagram between another audio error signal and the number of training iterations provided by at least one embodiment of the present disclosure;

Figure 13 is a schematic block diagram of a model training device provided by at least one embodiment of the present disclosure;

Figure 14 is a schematic block diagram of another model training device provided by at least one embodiment of the present disclosure; and

Figure 15 is a schematic diagram of a non-transitory computer-readable storage medium provided by at least one embodiment of the present disclosure.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings of the embodiments of the present disclosure. Obviously, the described embodiments are some, but not all, of the embodiments of the present disclosure. Based on the described embodiments of the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present disclosure.

Unless otherwise defined, technical terms or scientific terms used in this disclosure shall have the usual meaning understood by a person with ordinary skill in the art to which this disclosure belongs. "First", "second" and similar words used in this disclosure do not indicate any order, quantity or importance, but are only used to distinguish different components. Words such as "include" or "comprising" mean that the elements or things appearing before the word include the elements or things listed after the word and their equivalents, without excluding other elements or things. Words such as "connected" or "connected" are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

In order to keep the following description of the embodiments of the present disclosure clear and concise, the present disclosure omits detailed descriptions of some well-known functions and well-known components.

At least one embodiment of the present disclosure provides an audio processing method. The audio processing method includes: generating a control instruction based on the first audio signal; generating a second audio signal based on the control instruction; and outputting the second audio signal to suppress the third audio signal. The sum of the phases of the second audio signal and the phase of the third audio signal is less than the phase threshold, and the first audio signal appears earlier than the third audio signal.

In the audio processing method provided by embodiments of the present disclosure, by learning the characteristics of the current audio signal (ie, the first audio signal), an inverted audio signal (ie, the second audio signal) of the future audio signal is generated to suppress the future audio signal (i.e., the third audio signal), to avoid the problem of out-of-synchronization between the inverted audio signal and the audio signal that needs to be suppressed due to the delay between the input end and the output end, and improve the noise canceling effect, which can significantly reduce or even eliminate the input end. Regarding the impact of output delay on noise cancellation, the audio suppression effect is better than the audio suppression effect of the backward active cancellation system commonly used in the industry.

Embodiments of the present disclosure also provide an audio processing device and a non-transitory computer-readable storage medium. The audio processing method can be applied to the audio processing device provided by the embodiment of the present disclosure, and the audio processing device can be configured on an electronic device. The electronic device may be a personal computer, a mobile terminal, a car headrest, etc. The mobile terminal may be a mobile phone, a headset, a tablet computer or other hardware devices.

The embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings, but the present disclosure is not limited to these specific embodiments.

Figure 1 is a schematic block diagram of an audio processing system provided by at least one embodiment of the present disclosure. Figure 2A is a schematic flow chart of an audio processing method provided by at least one embodiment of the present disclosure. Figure 2B is shown in Figure 2A Figure 2C is a schematic flow chart of step S10 shown in Figure 2B. Figure 3 is a schematic diagram of a first audio signal and a third audio signal provided by at least one embodiment of the present disclosure.

The audio processing system shown in Figure 1 can be used to implement the audio processing method provided by any embodiment of the present disclosure, for example, the audio processing method shown in Figure 2A. As shown in Figure 1, the audio processing system may include an audio receiving part, an audio processing part and an audio output part. The audio receiving part can receive the audio signal Sn1 emitted by the sound source at time tt1, and then transmit the audio signal Sn1 to the audio processing part. The audio processing part processes the audio signal Sn1 to predict the inverted audio signal Sn2 of the future audio signal Sn3. ;Then the inverted audio signal Sn2 is output through the audio output section. The inverted audio signal Sn2 may be used to suppress future audio signals Sn3 generated by the sound source at time tt2 later than time tt1. For example, the target object (for example, a human ear, etc.) can receive the inverted audio signal Sn2 and the future audio signal Sn3 at the same time, so that the inverted audio signal Sn2 and the future audio signal Sn3 can be destructively superimposed, thereby achieving silence.

For example, the audio receiving part may include a microphone, an amplifier (for example, a microphone amplifier), an analog to digital converter (ADC), a downsampler, etc., and the audio processing part may include an AI engine and/or a digital signal Processor (Digital Signal Processing, DSP), etc., the audio output part can include an upsampler, a digital to analog converter (digital to analog converter, DAC), an amplifier (for example, a speaker amplifier), a speaker, etc.

As shown in Figure 2A, an audio processing method provided by one embodiment of the present disclosure includes steps S10 to S12. In step S10, a control instruction is generated based on the first audio signal; in step S11, a second audio signal is generated based on the control instruction; in step S12, the second audio signal is output to suppress the third audio signal.

For example, the first audio signal may be the audio signal Sn1 shown in FIG. 1 , the second audio signal may be the inverted audio signal Sn2 shown in FIG. 1 , and the third audio signal may be the future audio signal Sn3 shown in FIG. 1 .

For example, the audio receiving part can receive a first audio signal; the audio processing part can process the first audio signal to generate a control instruction, and generate a second audio signal based on the control instruction; the audio output part can output the second audio signal, thereby achieving Suppress third audio signal.

For example, the first audio signal appears earlier than the third audio signal. As shown in Figure 3, the time when the first audio signal starts to appear is t11, and the time when the third audio signal starts to appear is t21. On the time axis t, time t11 is earlier than time t21. For example, the time period during which the first audio signal exists may be the time period between time t11 and time t12, and the time period during which the third audio signal exists may be the time period between time t21 and time t22. Taking into account factors such as the time of the signal processing process, time t12 and time t21 may not be the same time, and time t12 is earlier than time t21.

It should be noted that, in the embodiment of the present disclosure, "the time period in which the audio signal exists or the time in which it appears" means the time period in which the audio corresponding to the audio signal exists or the time in which it appears.

For example, the sum of the phases of the second audio signal and the phase of the third audio signal is less than the phase threshold. The phase threshold can be set according to the actual situation, and this disclosure does not specifically limit this. For example, in some embodiments, the phase of the second audio signal is opposite to the phase of the third audio signal, so that complete silence can be achieved, that is, the third audio signal is completely suppressed. At this time, when the second audio signal and the third audio signal When received by an audio collection device (for example, a microphone, etc.), the error energy of the audio signal received by the audio collection device is 0; if the second audio signal and the third audio signal are received by the human ear, it is equivalent to the person not hearing the sound. .

For example, in some embodiments, the first audio signal may be the time-domain audio signal with the maximum volume (maximum amplitude) between time t11 and time t12, and the first audio signal is not an audio signal of a specific frequency, so the implementation of the present disclosure The audio processing method provided in the example does not need to extract spectral features from the audio signal to generate a spectrogram, which can simplify the audio signal processing process and save processing time.

For example, the first audio signal and the third audio signal may be audio signals generated by the external environment, machines, etc., the sound of machine operation, the sound of electric drills and electric saws during decoration, etc. For example, machines may include household appliances (air conditioners, range hoods, washing machines, etc.) and the like.

For example, in some embodiments, as shown in Figure 2B, step S10 may include steps S101 to 103. In step S101, a first audio signal is obtained; in step S102, the first audio signal is processed to predict Fourth audio signal; in step S103, a control instruction is generated based on the fourth audio signal. In the audio processing method provided by embodiments of the present disclosure, the audio signal (ie, the fourth audio signal) is predicted by learning the characteristics of the current audio signal (ie, the first audio signal).

For example, the fourth audio signal is a predicted future audio signal. For example, on the time axis, the time period in which the fourth audio signal exists is later than the time period in which the first audio signal exists, for example, the time period in which the fourth audio signal exists. The segment is the same as the time period in which the third audio signal exists, so the time period in which the fourth audio signal exists may also be the time period between time t21 and time t22 shown in FIG. 3 .

Figure 4 is a schematic diagram of a third audio signal and a fourth audio signal provided by at least one embodiment of the present disclosure. In the example shown in Figure 4, the horizontal axis represents time (Time), the vertical axis represents amplitude (Amplitude), and the amplitude can be expressed as a voltage value. As shown in Figure 4, in one embodiment, the predicted fourth audio signal is substantially the same as the third audio signal.

For example, in one embodiment, the third audio signal and the fourth audio signal may be exactly the same. In this case, the phase of the second audio signal finally generated based on the fourth audio signal is opposite to the phase of the third audio signal, thereby achieving complete Silencing.

For example, in step S102, processing the first audio signal to predict the fourth audio signal may include processing the first audio signal through a neural network to predict the fourth audio signal.

For example, neural networks may include recurrent neural networks, long short-term memory networks, or generative adversarial networks. In embodiments of the present disclosure, the characteristics of the audio signal can be learned based on artificial intelligence, thereby predicting the audio signal of a certain future time period that has not yet occurred, and thereby generating an inverted audio signal of the future time period to suppress the time period audio signal.

For example, in some embodiments, as shown in Figure 2C, step S102 may include steps S1021 to 1023. In step S1021, a first audio feature code is generated based on the first audio signal; in step S1022, based on the first audio signal, The feature coding queries the lookup table to obtain the second audio feature coding; in step S1023, based on the second audio feature coding, a fourth audio signal is predicted.

For example, the first audio signal may be an analog signal, and the first audio signal may be processed through an analog-to-digital converter to obtain a processed first audio signal. The processed first audio signal may be a digital signal. Based on the processed The first audio signal may generate a first audio feature code.

For another example, the first audio signal may be a digital signal, such as a PDM (Pulse-density-modulation, pulse density modulation) signal. In this case, the first audio feature code may be generated directly based on the first audio signal. PDM signals can be represented by binary numbers 0 and 1.

For example, any suitable encoding method may be used to implement the first audio feature encoding. For example, in some embodiments, when representing an audio signal, the changing state of the audio signal can be used to describe the audio signal, and multi-bits can be used to represent the changing state of the audio signal. For example, two bits (2bits) can be used to represent the changing state of the audio signal. In some examples, as shown in Table 1 below, 00 means that the audio signal becomes larger, 01 means that the audio signal becomes smaller, 10 means that there is no audio signal, and 11 means that there is no audio signal. The audio signal remains unchanged.

比特Bits	音频信号的变化状态The changing state of the audio signal
0000	音频信号变大Audio signal becomes louder
0101	音频信号变小Audio signal becomes smaller
1010	没有音频信号no audio signal
1111	音频信号不变Audio signal remains unchanged

Table 1

"The audio signal becomes larger" means that the amplitude of the audio signal in the unit time period (each time step) becomes larger with time, and "the audio signal becomes smaller" means that the amplitude of the audio signal in the unit time period increases with time. The time becomes smaller, "the audio signal remains unchanged" means that the amplitude of the audio signal in the unit time period does not change with time, and "no audio signal" means that there is no audio signal in the unit time period, that is, the amplitude of the audio signal is 0.

Figure 5A is a schematic diagram of an audio signal provided by some embodiments of the present disclosure. Figure 5B is an enlarged schematic diagram of the audio signal in the dotted rectangular box P1 in Figure 5A.

In Figure 5A, the abscissa is time (ms, milliseconds), and the ordinate is the amplitude of the audio signal (volts, volts). As shown in FIG. 5A , the audio signal V is a periodically changing signal, and the periodic pattern of the audio signal V is the pattern shown by the dotted rectangular frame P2.

As shown in Figure 5B, the amplitude of the audio signal represented by the waveform segment 30 does not change with time t, and the time corresponding to the waveform segment 30 is a unit time period, then the waveform segment 30 can be expressed as audio feature coding (11); similarly Ground, the amplitude of the audio signal represented by waveform segment 31 gradually increases with time t, and the time corresponding to waveform segment 31 is four unit time segments, then waveform segment 31 can be expressed as audio feature encoding (00,00,00, 00); the amplitude of the audio signal represented by waveform segment 32 remains unchanged with time t, the time corresponding to waveform segment 32 is a unit time period, and waveform segment 32 can be represented as audio feature encoding (11); represented by waveform segment 33 The amplitude of the audio signal gradually becomes smaller with time t, and the time corresponding to the waveform segment 33 is six unit time periods, then the waveform segment 33 can be expressed as the audio feature code (01,01,01,01,01,01); The amplitude of the audio signal represented by waveform segment 34 does not change with time t, and the time corresponding to waveform segment 34 is a unit time period, then waveform segment 34 can be expressed as audio feature encoding (11); the audio signal represented by waveform segment 35 The amplitude of the signal gradually increases with time t, and the time corresponding to the waveform segment 35 is eight unit time segments, then the waveform segment 35 can be expressed as audio feature encoding (00,00,00,00,00,00,00,00 ); By analogy, waveform segment 36 can be expressed as audio feature coding (01,01,01,01,01,01,01,01,01,01,01,01), and waveform segment 37 can be expressed as audio feature coding (11), the waveform segment 38 can be expressed as audio feature encoding (00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00). Therefore, the audio feature encoding corresponding to the audio signal shown in Figure 5B can be expressed as {11,00,00,00,00,11,01,01,01,01,01,01,11,00,00,00, 00,00,00,00,00,01,01,01,01,01,01,01,01,01,01,01,01,11,00,00,00,00,00,00,00, 00,00,00,00,00,00,00,00,00,…}.

For example, in some embodiments, a lookup table (codebook) includes at least one first code field. For example, in other embodiments, the lookup table further includes at least one second encoding field, and multiple first encoding fields constitute a second encoding field, so that dimensionally reduced high-order features can be formed from combinations of low-level features. For example, the coding method of the coding field (codeword, for example, the codeword may include a first coding field and a second coding field) in the lookup table may be the same as the coding method of the above-mentioned first audio feature coding.

For example, in some embodiments, when two bits are used to represent the changing state of the audio signal to implement feature encoding, the first encoding field may be one of 00, 01, 10, and 11. 00, 01, 10 and 11 can be combined to form the second encoding field. For example, a second encoding field may be represented as {00,00,00,01,01,01,11,11,01,…}, which is composed of a combination of 00, 01 and 11.

For example, when the lookup table includes a plurality of second encoding fields, the number of first encoding fields included in each of the plurality of second encoding fields may be different.

It should be noted that when more bits (for example, 3 bits, 4 bits, etc.) are used to represent the changing state of the audio signal to implement feature encoding, the types of the first coding field can be more, for example, when 3 bits are used to represent When the audio signal changes state, the first encoding field can have up to 8 types. At this time, the first encoding field can be part or all of 000, 001, 010, 011, 100, 101, 110 and 111.

For example, one or more second encoding fields can also be combined to obtain a third encoding field, or one or more second encoding fields and one or more first encoding fields can be combined to obtain a third encoding field, similarly Alternatively, one or more third coding fields may be combined or one or more third coding fields may be combined with the first coding field and/or the second coding field to obtain a higher order coding field. In embodiments of the present disclosure, low-order feature codes can be combined to obtain high-order feature codes, thereby achieving more efficient and longer predictions.

For example, the second audio feature encoding includes at least one first encoding field and/or at least one second encoding field. For example, in some embodiments, the second audio feature encoding may include one or more complete second encoding fields, or the second audio feature encoding may include part of the first encoding field in one second encoding field.

It should be noted that when the lookup table includes a third encoding field, the second audio feature encoding may include at least one first encoding field and/or at least one second encoding field and/or at least one third encoding field.

For example, in one embodiment, the lookup table includes the second encoding field W1, the second encoding field W2, and the second encoding field W3, and W1={11,00,00,00,00,11,01,01,01 ,01,01,01,11,00,00,00,00,00,00,00,00,01,01,01,01,01,01,01,01,01,01,01,01,11 ,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,….}, W2＝{11,01,00,00,01, 01,01,01,01,01,01,….}, W3＝{11,00,01,00,00,01,01,01,11,00,00,00,01,01,01,01 ,01,01,01,01,01,….}.

In one embodiment, as shown in Figure 5B, starting from time t31, the audio collection device continues to collect the first audio signal. When the first feature encoding field corresponding to the first audio signal collected by the audio collection device is expressed as {11 }, corresponding to waveform segment 30, a query is performed based on the lookup table to determine whether there is a certain coding field (including the first coding field and the second coding field) in the lookup table, including {11}. In the above example, the query The second encoding field W1, the second encoding field W2, and the second encoding field W3 in the lookup table all include {11}. At this time, the second encoding field W1, the second encoding field W2, and the second encoding field W3 are all used as to-be-coded fields. The encoding fields to be output in the output encoding field list.

Then, as shown in Figure 5B, when the second feature encoding field corresponding to the first audio signal collected by the audio collection device is represented as {00}, corresponding to the first unit time period in the waveform segment 31, continue the search. Query the table (at this time, you can only query the coding field to be output in the coding field column to be output, which can save query time. However, you can also query the entire lookup table) to determine whether a certain encoding exists in the lookup table. The field includes {11,00}. In the above example, it is found that the second encoding field W1 and the second encoding field W3 in the lookup table both include {11,00}, because the second encoding field W2 includes {11,01}. , and does not include {11,00}, thus not meeting the characteristics of the first audio signal collected by the audio collection device. Therefore, the second encoding field W2 can be deleted from the list of encoding fields to be output. At this time, the second encoding field W2 Field W1 and the second encoding field W3 serve as the encoding fields to be output in the encoding field list to be output.

Then, when the third feature encoding field corresponding to the first audio signal collected by the audio collection device is represented as {00}, corresponding to the second unit time period in the waveform segment 31, continue to query the lookup table to determine Check whether there is a certain encoding field in the lookup table that includes {11,00,00}. In the above example, the second encoding field W1 in the lookup table is queried and includes {11,00,00}. Then, it can be predicted that the next audio signal should be the pattern of the second encoding field W1. For the first three coding fields {11,00,00} in the second coding field W1, since their corresponding audio signals have passed in time, the fourth field in the second coding field W1 can be output (ie {00}) is used as the predicted second audio coding feature. At this time, the second audio feature coding is expressed as {00,00,11,01,01,01,01,01,01 ,11,00,00,00,00,00,00,00,00,01,01,01,01,01,01,01,01,01,01,01,01,11,00,00,00 ,00,00,00,00,00,00,00,00,00,00,00,00,00,…….}.

It should be noted that in actual applications, how many feature coding fields are matched before determining the second audio feature coding can be adjusted according to actual application scenarios, design requirements and other factors. For example, in the above example, when 3 matching fields (in actual In the application, if 10, 20, 50, etc.) feature coding fields can be matched, the second audio feature coding can be determined.

For example, in the above example, the first audio feature code corresponding to the first audio signal includes 3 feature code fields and is represented as {11,00,00}. As shown in Figure 5B, the time period corresponding to the first audio signal It is from time t31 to time t32. When considering factors such as the system's signal processing time, the system actually needs to output the second audio signal at time t33, which is later than time t32. At this time, the first two feature coding fields in the second audio feature coding {00 The time period corresponding to ,00} (that is, the time period between time t32 and time t33) has passed, so the audio feature encoding corresponding to the predicted fourth audio signal is actually expressed as {11,01,01,01,01 ,01,01,11,00,00,00,00,00,00,00,00,01,01,01,01,01,01,01,01,01,01,01,01,11,00 ,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,…}.

For example, if the third audio signal and the fourth audio signal are exactly the same, the audio feature code corresponding to the third audio signal is also expressed as {11,01,01,01,01,01,01,11,00,00,00 ,00,00,00,00,00,01,01,01,01,01,01,01,01,01,01,01,01,11,00,00,00,00,00,00,00 ,00,00,00,00,00,00,00,00,00,…}.

For example, the second audio signal is a signal obtained by inverting the fourth audio signal, that is, the second audio signal can be {11,01,01,01,01,01,01,11,00,00,00, 00,00,00,00,00,01,01,01,01,01,01,01,01,01,01,01,01,11,00,00,00,00,00,00,00, 00,00,00,00,00,00,00,00,00,….} The inverted audio signal of this pattern.

For example, in some embodiments, the duration of the second audio signal, the duration of the third audio signal, and the duration of the fourth audio signal are substantially the same, eg, identical.

For example, in some embodiments, the leading feature coding field may be set for at least part of the first coding field and/or the second coding field in the lookup table. For example, the leading feature coding field may be set for the second coding field W1 {11,00 ,00}, when the leading feature coding field is detected, the second coding field W1 is output as the second audio feature coding. In this case, when it is detected that the first audio feature code corresponding to the first audio signal is {11,00,00}, the first audio feature code corresponding to the first audio signal and the preamble feature code field {11,00, 00} matching, so that the second encoding field W1 can be output as the second audio feature encoding.

For another example, the leading feature coding field {11,00,00,01,01} can be set for the second coding field W1. When some fields in the leading feature coding field are detected, the second coding field W1 and the leading feature coding field are The remaining fields in the feature encoding field are output as the second audio feature encoding. In this case, when it is detected that the first audio feature encoding corresponding to the first audio signal is {11,00,00}, the first audio signal corresponding The first audio feature encoding matches the first three fields {11,00,00} in the leading feature encoding field, so that the remaining fields {01,01} and the second encoding field W1 in the leading feature encoding field can be output as the third 2. Audio feature encoding. At this time, the time corresponding to the first two feature coding fields {01,01} in the second audio feature coding (i.e., the remaining fields in the leading feature coding field) can be the time for the system to process the signal, so that the predicted first The audio feature encoding corresponding to the four audio signals may be the complete second encoding field W1.

It should be noted that the length of the leading feature encoding field can be adjusted according to actual conditions, and this disclosure does not limit this.

It is worth noting that for look-up tables, when the memory used to store the look-up table is large enough and the content stored in the look-up table is rich enough (that is, there are enough combinations of encoding fields in the look-up table), the user's desire to eliminate all types of audio signals. For neural networks, when the samples used to train the neural network are rich enough and the types of samples are rich enough, any type of audio signal that the user wants to eliminate can be predicted based on the neural network.

For example, the lookup table may be stored in the memory in the form of a table, etc. The embodiments of the present disclosure do not limit the specific form of the lookup table.

For example, predictions in neural networks can be achieved by looking up tables.

For example, the second audio signal and/or the third audio signal and/or the fourth audio signal are periodic or intermittent time domain signals, and the second audio signal and/or the third audio signal and/or the fourth audio signal The signal characteristics are periodic or intermittent time domain amplitude changes, that is, the second audio signal and/or the third audio signal and/or the fourth audio signal have the characteristics of continuous repetition or intermittence repetition, and have a fixed pattern. For intermittent audio signals, since there is no audio signal during the pause period of the intermittent audio signal, there is no spectral feature to be extracted during the pause period, but the pause period can become the time domain feature of the intermittent audio signal. one.

For example, in some embodiments, step S101 may include: collecting an initial audio signal; performing downsampling on the initial audio signal to obtain a first audio signal.

Since the sampling rate of the initial audio signal collected by the audio acquisition device is high, it is not conducive to the back-end audio signal processing device (for example, artificial intelligence engine (AI (Artificial Intelligence) Engine), digital signal processor (Digital Signal) Processing (DSP for short), etc.), therefore, the initial audio signal can be down-sampled to achieve frequency reduction, which is convenient for processing by the audio signal processing device. For example, the frequency can be reduced to 48K Hz or even lower.

For example, in other embodiments, step S101 may include: collecting an initial audio signal; and filtering the initial audio signal to obtain a first audio signal.

In some application scenarios, being too quiet is not safe. Therefore, filtering can also be performed through a bandwidth controller (Bandwidth controller) to suppress audio signals within a specific frequency range. For continuous and intermittent audio signals (for example, knocking or dripping noise, etc.), the effective bandwidth of the first audio signal is set to the frequency range corresponding to the audio signal that needs to be suppressed, for example, 1K ~ 6K Hz , thereby ensuring that users can still hear more important sounds. For example, when used in the automotive field, it must be ensured that the driver can hear the horn, etc. to improve driving safety.

For example, in some embodiments, filtering processing and downsampling processing can also be used in combination, and the present disclosure does not limit the processing order of filtering processing and downsampling processing. For example, in some embodiments, obtaining the first audio signal may include: collecting an initial audio signal; filtering the initial audio signal to obtain an audio signal within a predetermined frequency range; and downsampling the audio signal within the predetermined frequency range. Processing to obtain the first audio signal; alternatively, obtaining the first audio signal may include: collecting an initial audio signal; performing downsampling processing on the initial audio signal; and performing filtering processing on the downsampled audio signal to obtain the first audio signal.

For example, the control instruction may include the time at which the second audio signal is output, the fourth audio signal, a control signal instructing to invert the fourth audio signal, and the like.

For example, in some embodiments, step S11 may include: based on the control instruction, determining a fourth audio signal and a control signal indicating inverting the fourth audio signal; based on the control signal, inverting the fourth audio signal Processed to generate a second audio signal.

For example, in some embodiments, step S12 may include: determining a first moment to output the second audio signal based on the control instruction; and outputting the second audio signal at the first moment.

For example, the third audio signal starts to appear from the second moment, and the absolute value of the time difference between the first moment and the second moment is less than the time threshold. It should be noted that the time threshold can be specifically set according to the actual situation, and this disclosure does not limit this. The smaller the time threshold, the better the silencing effect.

For example, in some embodiments, the time difference between the first moment and the second moment is 0, that is, the moment when the second audio signal starts to be output and the moment when the third audio signal starts to appear are the same. In the example shown in Figure 3 , the time when the second audio signal starts to be output and the time when the third audio signal starts to appear are both time t21.

For example, the time difference between the first moment and the second moment can be set according to the actual situation. For example, the first moment and the second moment can be set to ensure that the second audio signal and the third audio signal are transmitted to the target object at the same time, thereby avoiding The transmission of audio signals causes the second audio signal and the third audio signal to be out of sync, further improving the noise canceling effect. For example, the target object can be a human ear, a microphone, etc.

For example, the second audio signal can be output through a device such as a speaker that can convert an electrical signal into a sound signal for output.

It should be noted that when the audio collection device does not collect the audio signal, the audio processing method provided by the present disclosure may not be executed until the audio collection device collects the audio signal, thereby saving power consumption.

In embodiments of the present disclosure, the audio processing method can reduce or eliminate periodic audio signals (for example, noise) in environmental audio signals. For example, in application scenarios such as libraries, the sound of construction at a nearby construction site can be eliminated. wait. This type of scenario does not require special knowledge of the audio signals that you want to keep. It simply reduces the target sounds to be silenced in the environment that need to be eliminated. These target sounds to be silenced usually have the characteristics of continuous repetition or intermittence repetition, so they can be predicted through prediction. Predicted. It should be noted that the "target sound to be silenced" can be determined according to the actual situation. For example, for an application scenario such as a library, when there is a construction site around the library, the external environment audio signal can include two audio signals. The first The audio signal can be the sound of drilling at the construction site, and the second audio signal can be the sound of discussions by people around you. Usually, the sound of construction site drilling has periodic characteristics and usually has a fixed pattern. However, the discussion sound most likely does not have a fixed pattern and does not have periodic characteristics. At this time, the target sound to be silenced is the construction site drilling sound. Through the audio processing method provided by the embodiments of the present disclosure, it is possible to predict the drilling sound at the construction site, thereby eliminating or reducing the drilling sound at the construction site.

The audio processing method provided by embodiments of the present disclosure can be applied to automobile driving headrests to create a silent zone near the driver's ears to avoid unnecessary external audio signals (such as engine noise, road noise, wind noise, and tire noise). Noise signals while the car is driving) interfere with the driver. For another example, this audio processing method can also be applied to hair dryers, range hoods, vacuum cleaners, non-inverter air conditioners and other equipment to reduce the operating sound emitted by these equipment, allowing users to stay in noisy environments without being affected by the surrounding environment. The impact of environmental noise. This audio processing method can also be applied to headphones, etc., to reduce or eliminate external sounds, so that users can better receive the sounds from the headphones (music or phone calls, etc.).

At least one embodiment of the present disclosure also provides an audio processing device. Figure 6 is a schematic block diagram of an audio processing device provided by at least one embodiment of the present disclosure.

As shown in FIG. 6 , the audio processing device 600 includes an instruction generation module 601 , an audio generation module 602 and an output module 603 . The components and structures of the audio processing device 600 shown in FIG. 6 are only exemplary and not restrictive. The audio processing device 600 may also include other components and structures as needed.

The instruction generation module 601 is configured to generate a control instruction based on the first audio signal. The instruction generation module 601 is used to execute step S10 shown in Figure 2A.

The audio generation module 602 is configured to generate a second audio signal based on the control instruction. The audio generation module 602 is used to perform step S11 shown in Figure 2A.

The output module 603 is configured to output the second audio signal to suppress the third audio signal. The output module 603 is used to perform step S12 shown in Figure 2A.

For a specific description of the functions implemented by the instruction generation module 601, please refer to the relevant description of step S10 shown in FIG. 2A in the embodiment of the above audio processing method. For a specific description of the functions implemented by the audio generation module 602, please refer to the above audio For the relevant description of step S11 shown in FIG. 2A in the embodiment of the processing method, for a specific description of the functions implemented by the output module 603, please refer to the relevant description of step S12 shown in FIG. 2A in the embodiment of the audio processing method. . The audio processing device can achieve similar or identical technical effects to the foregoing audio processing method, which will not be described again here.

For example, the first audio signal appears earlier than the third audio signal.

For example, the sum of the phases of the second audio signal and the third audio signal is less than the phase threshold. In some embodiments, the phase of the second audio signal is opposite to the phase of the third audio signal, so that the third audio signal can be completely suppressed. .

For example, in some embodiments, the instruction generation module 601 may include an audio acquisition sub-module, a prediction sub-module and a generation sub-module. The audio acquisition sub-module is configured to acquire the first audio signal; the prediction sub-module is configured to process the first audio signal to predict a fourth audio signal; the generation sub-module is configured to generate a control instruction based on the fourth audio signal.

For example, the second audio signal and/or the third audio signal and/or the fourth audio signal are periodic or intermittent time domain signals.

For example, the third audio signal and the fourth audio signal may be exactly the same.

For example, in some embodiments, the prediction sub-module may process the first audio signal based on a neural network to predict the fourth audio signal. For example, the prediction sub-module may include the AI engine and/or digital signal processor in the audio processing part shown in Figure 1. The AI engine may include a neural network. For example, the AI engine may include a recurrent neural network, a long short-term memory network, or At least one neural network among generative adversarial networks and the like.

For example, in some implementations, the prediction sub-module includes a query unit and a prediction unit. The query unit is configured to generate a first audio feature code based on the first audio signal and query the lookup table based on the first audio feature code to obtain a second audio feature code. The prediction unit is configured to predict the fourth audio signal based on the second audio feature encoding.

For example, the lookup unit may include memory for storing lookup tables.

For example, in some embodiments, the lookup table may include at least one first encoding field. For example, in other embodiments, the lookup table further includes at least one second encoding field, and multiple first encoding fields constitute one second encoding field. Regarding the specific content of the lookup table, reference may be made to the relevant descriptions in the embodiments of the audio processing method described above, and repeated details will not be described again.

For example, the second audio feature encoding includes at least one first encoding field and/or at least one second encoding field.

For example, in some embodiments, the audio acquisition sub-module includes an acquisition unit and a downsampling processing unit. The acquisition unit is configured to collect the initial audio signal; the down-sampling processing unit is configured to perform down-sampling processing on the initial audio signal to obtain the first audio signal.

For example, in some embodiments, the audio acquisition sub-module includes an acquisition unit and a filtering unit. The acquisition unit is configured to acquire an initial audio signal; and the filtering unit is configured to filter the initial audio signal to obtain a first audio signal.

For example, the audio acquisition sub-module can be implemented as the audio receiving part shown in Figure 1. For example, the collection unit may include an audio collection device, such as a microphone in the audio receiving part shown in FIG. 1 , or the like. For example, the acquisition unit may also include an amplifier, an analog-to-digital converter, etc.

For example, in some embodiments, the output module 603 may include a moment determination sub-module and an output sub-module. The time determination sub-module is configured to determine a first time to output the second audio signal based on the control instruction; the output sub-module is configured to output the second audio signal at the first time.

For example, the output module 603 may be implemented as the audio output part shown in FIG. 1 .

For example, the third audio signal starts to appear from the second moment, and the absolute value of the time difference between the first moment and the second moment is less than the time threshold.

For example, the time difference between the first time and the second time may be zero.

For example, the output sub-module may include audio output devices such as speakers. For example, the output sub-module may also include a digital-to-analog converter, etc.

For example, the instruction generation module 601, the audio generation module 602, and/or the output module 603 may be hardware, software, firmware, or any feasible combination thereof. For example, the instruction generation module 601, the audio generation module 602 and/or the output module 603 can be a dedicated or general-purpose circuit, chip or device, or a combination of a processor and a memory. The embodiments of the present disclosure do not limit the specific implementation forms of each of the above modules, sub-modules and units.

At least one embodiment of the present disclosure also provides an audio processing device. FIG. 7 is a schematic block diagram of another audio processing device provided by at least one embodiment of the present disclosure.

For example, as shown in FIG. 7 , the audio processing device 700 includes one or more memories 701 and one or more processors 702 . One or more memories 701 are configured to store non-transitory computer-executable instructions; one or more processors 702 are configured to execute the computer-executable instructions. The computer-executable instructions, when executed by one or more processors 702, implement the audio processing method according to any of the above embodiments. For the specific implementation and related explanations of each step of the audio processing method, please refer to the description of the above embodiments of the audio processing method, and will not be described again here.

For example, in some embodiments, the audio processing device 700 may further include a communication interface and a communication bus. The memory 701, the processor 702 and the communication interface can communicate with each other through the communication bus, and the memory 701, the processor 6702 and the communication interface and other components can also communicate through a network connection. This disclosure does not limit the type and function of the network.

For example, the communication bus may be a Peripheral Component Interconnect Standard (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus can be divided into address bus, data bus, control bus, etc.

For example, the communication interface is used to implement communication between the audio processing device 700 and other devices. The communication interface may be a Universal Serial Bus (USB) interface, etc.

For example, the processor 702 and the memory 701 can be provided on the server side (or cloud).

For example, processor 702 may control other components in audio processing device 700 to perform desired functions. The processor 702 may be a central processing unit (CPU), a network processor (NP), etc.; it may also be a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable Logic devices, discrete gate or transistor logic devices, discrete hardware components. The central processing unit (CPU) can be X86 or ARM architecture, etc.

For example, memory 701 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory may include, for example, random access memory (RAM) and/or cache memory (cache), etc. Non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disk read-only memory (CD-ROM), USB memory, flash memory, and the like. One or more computer-executable instructions may be stored on the computer-readable storage medium, and the processor 702 may execute the computer-executable instructions to implement various functions of the audio processing device 700 . Various applications and various data can also be stored in the storage medium.

For example, for detailed description of the process of audio processing performed by the audio processing device 700, reference may be made to the relevant descriptions in the embodiments of the audio processing method, and repeated details will not be described again.

For example, in some embodiments, the audio processing device 700 may be embodied in the form of a chip, a small device/device, or the like.

FIG. 8 is a schematic diagram of a non-transitory computer-readable storage medium provided by at least one embodiment of the present disclosure. For example, as shown in Figure 8, one or more computer-executable instructions 1001 may be non-transitory stored on a non-transitory computer-readable storage medium 1000. For example, one or more steps in the audio processing method described above may be performed when the computer-executable instructions 1001 are executed by a processor.

For example, the non-transitory computer-readable storage medium 1000 can be applied in the above-mentioned audio processing device 700, and for example, it can include the memory 701 in the audio processing device 700.

For description of the non-transitory computer-readable storage medium 1000, reference may be made to the description of the memory 701 in the embodiment of the audio processing device 600 shown in FIG. 7, and repeated descriptions will not be repeated.

At least one embodiment of the present disclosure provides an audio processing method, an audio processing device and a non-transitory computer-readable storage medium. By learning the characteristics of the current audio signal, the audio signal (ie, the fourth audio signal) is predicted, and the audio signal is predicted based on the characteristics of the current audio signal. The obtained audio signal generates an inverted audio signal of the future audio signal to suppress the future audio signal, avoiding the problem of out-of-sync between the inverted audio signal and the audio signal that needs to be suppressed due to the delay between the input end and the output end, and improving the silencing effect. , can significantly reduce or even eliminate the impact of the input-to-output delay on noise reduction, and the audio suppression effect is better than that of the backward active noise reduction system commonly used in the industry; because the first audio signal is a time domain signal, The first audio signal is not an audio signal of a specific frequency, so the audio processing method provided by embodiments of the present disclosure does not need to extract spectral features from the audio signal to generate a spectrogram, thereby simplifying the audio signal processing process and saving processing time; In the lookup table, low-order feature codes can be combined to obtain high-order feature codes, thereby achieving more efficient and longer predictions; and in this audio processing method, filtering processing can also be performed through a bandwidth controller, This enables the suppression of audio signals within a specific frequency range to ensure that users can still hear more important sounds. For example, when used in the automotive field, it must be ensured that the driver can hear the horn, etc. to improve driving safety. ; In addition, when the audio signal is not collected, the audio processing method provided by the present disclosure may not be executed until the audio signal is collected, thereby saving power consumption.

At least one embodiment of the present disclosure provides a model training method. The model training method includes: based on the prediction model, processing the first audio signal to generate a first control instruction; based on the first control instruction, generating an audio signal corresponding to the first control instruction as a second audio signal; outputting the second audio signal signal to suppress the third audio signal, wherein the first audio signal appears earlier than the third audio signal; determines the audio error signal based on the second audio signal and the third audio signal; in response to the audio error signal not When the error condition is met, the prediction model is adjusted, and the first audio signal is processed again based on the prediction model until the audio error signal meets the error condition; in response to the audio error signal meeting the error condition, the prediction model remains unchanged.

It should be noted that in the following description of the model training method with reference to the accompanying drawings, the limitations of ordinal numbers such as “first”, “second”, and “third” are only to distinguish multiple signals in the same embodiment (for example, , the first audio signal, the second audio signal, the third audio signal, the fourth audio signal). In the present disclosure, signals defined by the same ordinal word in different embodiments (for example, the "first audio signal" in the description of the above audio processing method Audio signal" is not necessarily the same as the "first audio signal" in the model training method).

In the model training method provided by the embodiment of the present disclosure, the current audio signal (ie, the first audio signal) and the future audio signal (ie, the third audio signal) are used to perform real-time training on the prediction model to improve the prediction of the prediction model output. The accuracy of the results avoids the problem that the prediction results based on the prediction model output cannot suppress future audio signals, and improves the effect of noise reduction based on the prediction model.

Embodiments of the present disclosure also provide a model training device and a non-transitory computer-readable storage medium. The model training method can be applied to the model training device provided by the embodiment of the present disclosure, and the model training device can be configured on an electronic device. The electronic device may be a personal computer, a mobile terminal, a car headrest, etc. The mobile terminal may be a mobile phone, a headset, a tablet computer or other hardware devices.

Figure 9 is a schematic block diagram of a model training system provided by at least one embodiment of the present disclosure. Figure 10A is a schematic flow chart of a model training method provided by at least one embodiment of the present disclosure. Figure 10B is shown in Figure 10A 10C is a schematic flow chart of step S200 shown in FIG. 10B , and FIG. 11 is a schematic diagram of a first audio signal and a third audio signal provided by at least one embodiment of the present disclosure.

In embodiments of the present disclosure, the prediction model can be trained using a pre-training method and/or an on-site training method. The pre-training method means training the prediction model based on the training audio samples in the training set obtained in advance; on-site training means training the prediction model based on audio signals collected in actual application scenarios.

The model training system shown in Figure 9 can be used to implement the model training method provided by any embodiment of the present disclosure, for example, the model training method shown in Figure 10A. The model training system shown in Figure 9 can be applied to on-site training or pre-training.

As shown in Figure 9, the model training system may include an audio acquisition part, an error calculation part, a prediction part and an audio output part. The audio acquisition part can acquire the audio signal Sn11, and then transmit the audio signal Sn11 to the prediction part; the prediction part processes the audio signal Sn11 to predict the inverted audio signal Sn12 of the future audio signal Sn13. The inverted audio signal Sn12 can be output through the audio output part to suppress the future audio signal Sn13. For example, the target object Ta (eg, a human ear, etc.) can receive the inverted audio signal Sn12 and the future audio signal Sn13 at the same time, so that The inverted audio signal Sn12 and the future audio signal Sn13 can be destructively superimposed. At this time, the audio acquisition part can also collect the audio signal in the current application scenario. The collected audio signal is the destructive superposition result Sr of the inverted audio signal Sn12 and the future audio signal Sn13 that appears later than the audio signal Sn11. , for example, when the inverted audio signal Sn12 can be used to completely silence the future audio signal Sn13, then the superposition result Sr may be a silent signal, that is, there is no audio signal. Then, the audio acquisition part can transmit the superposition result Sr to the error calculation part; the error calculation part can generate an error audio signal ES based on the superposition result Sr. Finally, the error calculation part can transmit the error audio signal ES to the prediction part. When the error audio signal does not meet the conditions, the prediction part can adjust the prediction model in response to the error audio signal. When the error audio signal meets the conditions, the prediction part can not The predictive model is adjusted so that the predictive model remains unchanged.

In one embodiment, the audio acquisition part can also acquire the inverted audio signal Sn12 from the prediction part and collect the audio signal in the current application scenario (ie, the superposition result Sr shown in Figure 9). Then, the audio acquisition part can transmit the inverted audio signal Sn12 and the superposition result Sr to the error calculation part; the error calculation part can obtain the future audio signal Sn13 based on the inverse audio signal Sn12 and the superposition result Sr, and calculate the inverted audio signal Sn12 and the future audio signal Sn13 are processed to generate an error audio signal ES.

In one embodiment, for the pre-training method, the audio acquisition part can also obtain the inverted audio signal Sn12 from the prediction part, and can also obtain the future audio signal Sn13 whose appearance time is later than the audio signal Sn11, and then, the inverted audio signal The signal Sn12 and the future audio signal Sn13 are transmitted to the error calculation part; the error calculation part can process the inverted audio signal Sn12 and the future audio signal Sn13 to generate an error audio signal ES.

For example, the audio acquisition part may include a microphone, an amplifier (for example, a microphone amplifier), an analog to digital converter (ADC), a downsampler (downsampler), etc., and the error calculation part may include a processor, etc.; the prediction part may Including AI engine and/or digital signal processor (Digital Signal Processing, DSP), etc., the audio output part can include upsampler (Upsampler), digital to analog converter (digital to analog converter, DAC), amplifier (for example, speaker amplifier ) and speakers, etc.

As shown in Figure 10A, a model training method provided by an embodiment of the present disclosure includes steps S200 to S207. In step S200, based on the prediction model, the first audio signal is processed to generate a first control instruction; in step S201, based on the first control instruction, an audio signal corresponding to the first control instruction is generated as a second audio signal; in step S201, based on the first control instruction, an audio signal corresponding to the first control instruction is generated as a second audio signal; S202, output the second audio signal to suppress the third audio signal; in step S203, determine the audio error signal based on the second audio signal and the third audio signal; in step S204, determine whether the audio error signal satisfies the error condition; in response to If the audio error signal does not meet the error condition, corresponding to the N branch of Figure 10A, steps S205 and S207 are executed. In step S205, the prediction model is adjusted. In step S207, the first audio signal is processed again based on the prediction model. Until the audio error signal satisfies the error condition; in response to the audio error signal satisfying the error condition, corresponding to the Y branch of FIG. 10A, step S206 is executed. In step S206, the prediction model is kept unchanged.

For example, the first audio signal appears earlier than the third audio signal. That is to say, relative to the first audio signal, the third audio signal belongs to the future audio signal.

For example, the first audio signal may be the audio signal Sn11 shown in FIG. 9 , the second audio signal may be the inverted audio signal Sn12 shown in FIG. 9 , and the third audio signal may be the future audio signal Sn13 shown in FIG. 9 . The audio acquisition part can acquire the first audio signal; the prediction part can process the first audio signal based on the prediction model to generate a first control instruction, and generate a second audio signal based on the first control instruction; and then the error calculation part can process the second audio signal based on the prediction model. The audio signal and the third audio signal are processed to obtain an error audio signal. The prediction part can determine whether to adjust the prediction model based on the error audio signal, thereby achieving training of the prediction model.

It should be noted that in the embodiment of the model training method of the present disclosure, the "first audio signal" represents a type of audio signal processed by the prediction model to generate a second audio signal, for example, the first audio signal in step S200 The signal may be different from the first audio signal in step S207; the "second audio signal" represents a type of audio signal generated for suppressing future audio signals. "Third audio signal" represents a type of audio signal that needs to be suppressed. The "first control instruction" represents the control instruction obtained by processing the first audio signal for the first time by the prediction model.

For example, in one embodiment, the prediction model can be trained in a pre-training manner. Each training audio sample in the training set can include a first training audio signal and a second training audio signal. The time when the first training audio signal appears Earlier than the time when the second training audio signal appears, the second training audio signal is a future audio signal relative to the first training audio signal. In the pre-training, the prediction model is trained using the training set until the prediction result obtained by processing the first training audio signal by the prediction model is consistent with the second training audio signal. The first training audio signal in the training audio sample is the above-mentioned first audio signal, and the second training audio signal in the training audio sample is the above-mentioned third audio signal.

For the pre-training method, because the audio in the training audio samples in the training set is pre-recorded, it may not be exactly the same as the audio in the real application scenario. The training audio samples in the training set cannot be like the audio in the real application scenario. So real, this may cause problems that cannot be silenced when the trained prediction model is applied to actual application scenarios. Therefore, in the embodiment of the present disclosure, the prediction model can be further trained using on-site training. In the on-site training method, it takes a period of time to train the model at the beginning, but after a period of time, the training results of the prediction model will become better and better. Since on-site real-time training is performed through audio signals in actual application scenarios, the accuracy of the trained prediction model will be higher than that of the prediction model trained using training audio samples in the training set. The prediction based on on-site training The model can be more suitable for actual application scenarios, avoid the problem that the prediction model cannot suppress audio signals in actual application scenarios, and improve the adaptability of the prediction model to different application scenarios, so that the prediction model can adapt to different application scenarios, and in The prediction accuracy of the prediction model is high in different application scenarios, which improves the noise reduction effect in actual application scenarios. In addition, since the prediction model can be trained based on audio signals in actual application scenarios, the requirement for the sample size used to train the prediction model can be reduced.

For example, in another embodiment, the model training method shown in FIG. 10A can be executed based on audio signals collected in real time in the current application scenario. At this time, the audio acquisition part can collect the audio signal emitted by the sound source in the current application scenario starting from the current moment to obtain the first audio signal. The audio acquisition part can collect the audio signal emitted by the sound source starting at a certain moment after the current moment. signal as the third audio signal. For example, as shown in Figure 11, in one embodiment, in the current application scenario, the time when audio signal A starts to appear is t100 and the time period it exists can be the time period between time t100 and time t101. Audio signal B The time when audio signal C starts to appear is t200 and the existing time period can be the time period between time t200 and time t201. The time when audio signal C starts to appear is t300 and the existing time period can be the time period between time t300 and time t301. , the time when the audio signal D starts to appear is t400 and the time period it exists may be the time period between time t400 and time t401. On the time axis t, time t101 is earlier than time t200, time t201 is earlier than time t300, and time t301 is earlier than time t400. As shown in Figure 11, if the current time is t100, the audio acquisition part can collect audio signal A as the first audio signal, and the audio acquisition part can collect audio signal B as the third audio signal.

It should be noted that the pre-training method and the on-site training method can be combined to achieve training of the prediction model. For example, the prediction model can be pre-trained in a pre-training manner, and then the pre-trained prediction model can be applied to actual application scenarios, and then on-site training can be used to continue training the prediction model, thus saving the time required for the model to be used in actual applications. Time for on-site training in the application scenario.

In the following description, unless otherwise specified, the first audio signal and the third audio signal are audio signals collected in the current actual application scenario as an example.

For example, the first audio signal and the third audio signal may be audio signals generated by the external environment, machines, etc. in the current actual application scenario, the sound of machine operation, the sound of electric drills and electric saws during decoration, etc. For example, machines may include household appliances (air conditioners, range hoods, washing machines, etc.) and the like.

For example, in some embodiments, the first audio signal may be a time domain audio signal with the largest volume (largest amplitude) in the current actual application scenario during the time period in which the first audio signal exists, and the first audio signal is not audio of a specific frequency. signal, so that the model training method provided by embodiments of the present disclosure does not need to extract spectral features from the audio signal to generate a spectrogram, thereby simplifying the audio signal processing process and saving processing time.

For example, in some embodiments, as shown in FIG. 10B , step S200 may include steps S2001 to S2003. In step S2001, the first audio signal is obtained; in step S2002, the first audio signal is processed based on the prediction model. The fourth audio signal is obtained by prediction; in step S2003, a first control instruction is generated based on the fourth audio signal. In the model training method provided by embodiments of the present disclosure, the prediction model can learn the characteristics of the current audio signal (ie, the first audio signal) to predict the audio signal (ie, the fourth audio signal).

For example, the fourth audio signal is a predicted future audio signal. For example, on the time axis, the time period in which the fourth audio signal exists is later than the time period in which the first audio signal exists. For example, the time period during which the fourth audio signal exists is the same as the time period during which the third audio signal exists.

For example, in some embodiments, step S2001 may include: collecting an initial audio signal; and performing downsampling processing on the initial audio signal to obtain a first audio signal.

For example, in other embodiments, step S2001 may include: collecting an initial audio signal; and filtering the initial audio signal to obtain a first audio signal.

For example, in some embodiments, filtering processing and downsampling processing can also be used in combination, that is, filtering processing and downsampling processing can be performed on the initial audio signal to obtain the first audio signal. The processing of filtering processing and downsampling processing in this disclosure There is no restriction on the order.

For example, in one embodiment, the prediction model includes a lookup table. As shown in Figure 10C, step S2002 may include steps S2012 to S2032. In step S2012, a first audio feature code is generated based on the first audio signal; in step S2022 In step S2032, the lookup table is queried based on the first audio feature coding to obtain the second audio feature coding; in step S2032, the fourth audio signal is predicted based on the second audio feature coding.

For another example, the first audio signal may be a digital signal, such as a PDM signal. In this case, the first audio feature code may be generated directly based on the first audio signal. PDM signals can be represented by binary numbers 0 and 1.

For example, any suitable encoding method may be used to implement the first audio feature encoding. For example, in some embodiments, when representing an audio signal, the changing state of the audio signal can be used to describe the audio signal, and multiple bits can be used to represent the changing state of the audio signal. For example, two bits may be used to represent the changing state of the audio signal. For the relevant description of using two bits to represent the changing state of the audio signal, please refer to the relevant description in the embodiment of the audio processing method above, and the repeated details will not be repeated.

For example, in some embodiments, a lookup table (codebook) includes at least one first code field. For example, in other embodiments, the lookup table further includes at least one second encoding field, and multiple first encoding fields constitute a second encoding field, so that dimensionally reduced high-order features can be formed from combinations of low-level features. For example, the second audio feature encoding includes at least one first encoding field and/or at least one second encoding field.

For example, in some embodiments, the second audio feature encoding may include one or more complete second encoding fields, or the second audio feature encoding may include part of the first encoding field in one second encoding field.

It should be noted that for the specific description of the lookup table, reference can be made to the relevant descriptions in the embodiments of the audio processing method above, and repeated details will not be repeated.

For example, in one embodiment, the prediction model includes a neural network, and in step S2002, the first audio signal can be processed through the neural network to predict a fourth audio signal. For example, neural networks may include recurrent neural networks, long short-term memory networks, or generative adversarial networks.

For example, the first control instruction may include a time at which the second audio signal is output, a fourth audio signal, a control signal instructing to invert the fourth audio signal, and the like.

For example, step S201 may include: determining a fourth audio signal and a control signal indicating inverting the fourth audio signal based on the first control instruction; performing inversion processing on the fourth audio signal based on the control signal to generate second audio signal.

For example, the phase of the second audio signal is opposite to the phase of the fourth audio signal.

For example, in step S202, the second audio signal may be output to the audio acquisition part, and the audio acquisition part may transmit the second audio signal to the error calculation part for calculation by the error calculation part.

For example, in step S202, the second audio signal can also be output to the audio output part, and the audio output part can output the second audio signal, so that the third audio signal can be suppressed. At this time, the audio acquisition part can collect the third audio signal. The superposition result after the second audio signal and the third audio signal are superimposed, and the superposition result is transmitted to the error calculation part for calculation.

For example, the absolute value of the time difference between the time when the audio signal corresponding to the first control instruction (ie, the second audio signal) is output and the time when the third audio signal starts to appear is less than the time threshold. In one embodiment, the time difference between the output and the first audio signal is less than the time threshold. The time difference between the time of the audio signal corresponding to a control instruction and the time when the third audio signal starts to appear may be 0. The time at which the audio signal corresponding to the first control instruction is output may be determined based on the first control instruction.

It should be noted that the time threshold can be specifically set according to the actual situation, and this disclosure does not limit this. The smaller the time threshold, the better the noise reduction effect achieved by the trained prediction model.

For example, in an embodiment, step S203 may include: calculating the root mean square error between the second audio signal and the third audio signal to obtain the audio error signal. For example, in one embodiment, before performing calculation of the root mean square error between the second audio signal and the third audio signal, in a pre-training manner, the second audio signal and the third audio signal may first be acquired through the audio acquisition part. signal, and then transmit the second audio signal and the third audio signal to the error calculation part for calculation; for the on-site training method, first, the second audio signal can be obtained through the audio acquisition part, and the second audio signal can be collected through the audio acquisition part The superposition result after destructive superposition of the audio signal and the third audio signal; then, the audio acquisition part can transmit the second audio signal and the superposition result to the error calculation part; then, the error calculation part can be based on the second audio signal and the The superposition result obtains a third audio signal, and calculation is performed on the second audio signal and the third audio signal.

Figure 12A is a schematic diagram of an audio error signal and the number of training iterations provided by at least one embodiment of the present disclosure. As shown in Figure 12A, the audio error signal is the root mean square error between the second audio signal and the third audio signal. After iteratively training the prediction model for approximately 100 times, the audio error signal is the root mean square error between the second audio signal and the third audio signal. The root mean square error is reduced to close to 0.

For example, in one embodiment, the prediction model includes a neural network. At this time, since the second audio signal is determined based on the predicted fourth audio signal, the second audio signal can be used as the corresponding output of the neural network, using the neural network The output (embodied as the second audio signal) and the label data groundtruth corresponding to the first audio signal (embodied as the third audio signal) construct a loss function of the neural network and calculate the loss value based on the loss function. At this time, step S203 may include: calculating the loss value through the loss function of the neural network based on the second audio signal and the third audio signal. The audio error signal includes loss values.

FIG. 12B is a schematic diagram of another audio error signal and the number of training iterations provided by at least one embodiment of the present disclosure. As shown in Figure 12B, the audio error signal is the loss value calculated through the loss function of the neural network. After the prediction model is iteratively trained about 50 times, the loss value is reduced to close to 0.

For example, when the second audio signal suppresses the third audio signal better, the audio error signal becomes smaller. When the phase of the second audio signal is opposite to the phase of the third audio signal, complete silence can be achieved. At this time, the audio error signal can be minimum, for example, 0.

For example, in step S204, it is determined whether the audio error signal satisfies the error condition. When the audio error signal satisfies the error condition, it means that the third audio signal can be better suppressed based on the second audio signal, thereby achieving silence. At this time , the prediction effect of the prediction model is better, so that the prediction model can be kept unchanged; when the audio error signal does not meet the error condition, it means that the suppression of the third audio signal may not be achieved based on the second audio signal, even due to the second audio signal The generation of causes the audio signal in the current environment to be larger. At this time, the prediction effect of the prediction model is poor, and the prediction model needs to be adjusted.

For example, in one embodiment, the prediction model includes a neural network, and in response to the audio error signal not meeting the error condition, in step S205, adjusting the prediction model includes: in response to the loss value not meeting the error condition, using the loss value to Network parameters are adjusted. Processing the first audio signal again based on the prediction model includes: in response to the audio error signal not meeting the error condition, processing the first audio signal again based on the neural network to generate a second control instruction; based on the second control instruction, generating and output the audio signal corresponding to the second control instruction as the second audio signal.

For example, the first audio signal can be processed again based on the neural network after parameter adjustment to generate the second control instruction.

For example, in another embodiment, the prediction model includes a lookup table, and in response to the audio error signal not meeting the error condition, in step S205, adjusting the prediction model includes: in response to the audio error signal not meeting the error condition, based on the first The audio signal and the third audio signal generate an audio feature code; the lookup table is adjusted based on the audio feature code. Processing the first audio signal again based on the prediction model includes: in response to the audio error signal not satisfying the error condition, processing the first audio signal again based on the lookup table to generate a second control instruction; based on the second control instruction, generating and output the audio signal corresponding to the second control instruction as the second audio signal.

For example, the second control instruction is different from the first control instruction.

It should be noted that in the embodiment of the model training method of the present disclosure, the "second control instruction" represents the control instruction obtained when the prediction model is repeatedly iteratively trained.

For example, when based on the second audio signal (the audio signal corresponding to the first control instruction generated based on the first audio signal (audio signal A shown in Figure 11)) and the third audio signal (audio signal B shown in Figure 11) If the determined audio error signal does not satisfy the error condition, the audio feature code F can be generated based on the first audio signal (audio signal A shown in Figure 11) and the third audio signal (audio signal B shown in Figure 11), and then based on Audio feature encoding F adjustment lookup table.

For example, adjusting the lookup table based on the audio feature coding F may include: comparing the audio feature coding F with all coding fields in the lookup table. When the audio feature coding F is not the same as any coding field in the lookup table, then the audio feature coding F is different from any coding field in the lookup table. Feature code F is added to the lookup table to update the lookup table to obtain an updated lookup table; when the audio feature code F is the same as a certain coding field in the lookup table, the lookup table remains unchanged, that is, the lookup table is not updated. For example, in one embodiment, the lookup table before adjustment may include encoding field A, encoding field B, and encoding field C, if the audio feature encoding F is different from any one of encoding field A, encoding field B, and encoding field C. , at this time, the adjusted lookup table can include coding field A, coding field B, coding field C and audio feature coding F; when audio feature coding F is the same as coding field A, at this time, the lookup table remains unchanged and after adjustment The lookup table is the same as the lookup table before adjustment, that is, the adjusted lookup table can include encoding field A, encoding field B, and encoding field C.

For example, in one embodiment, the first audio signal can be processed again to generate the second control instruction based on the lookup table before updating; in another embodiment, the first audio signal can be processed again based on the updated lookup table. The audio signal is processed to generate a second control instruction.

It should be noted that before adding the audio feature code F to the lookup table, when the number of coding fields in the lookup table reaches the maximum, that is, the storage space of the lookup table is full, you can select from the lookup table a frequency lower than the frequency used. A coding field of the threshold, delete the coding field, and then add the audio feature code F to the lookup table to update the lookup table, thereby avoiding the problem of being unable to store the audio feature code F, and also avoiding the storage space required for the lookup table. is too big.

For example, error conditions can be set according to actual conditions.

The following briefly describes the overall process of the model training method provided by the embodiments of the present disclosure based on an example of pre-training and on-site training.

In an example of pre-training, first, the first training audio sample can be obtained from the training set through, for example, the audio acquisition part, and a training process (including steps S200 to S206) is performed on the prediction model based on the first training audio sample. During the training process, the first training audio signal in the first training audio sample is used as the first audio signal, and the second training audio signal in the first training audio sample is used as the third audio signal. In step S204, When the audio error signal during the training process meets the error condition, step S206 is executed, that is, the prediction model is kept unchanged; when the audio error signal during the training process does not meet the error condition, step S205 and step S207 are executed. In step S205 , the prediction model is adjusted, and then in step S207, the second training audio sample can be obtained from the training set through the audio acquisition part, and the next training process is performed on the prediction model based on the second training audio sample (repeat step S200 ~S206), in the next training process, the first training audio signal in the second training audio sample is used as the first audio signal, and the second training audio signal in the second training audio sample is used as the third audio signal Signal. By analogy, in pre-training, the prediction model is iteratively trained.

For example, the first training audio sample and the second training audio sample can be the same training audio sample. That is to say, the same training audio sample can be used to train the prediction model for multiple iterations. At this time, in step S200 The first audio signal is the same as the first audio signal in step S207; the first training audio sample and the second training audio sample can also be different training audio samples. In this case, the first audio signal in step S200 is the same as the first audio signal in step S207. The first audio signals in S207 are not the same.

It should be noted that in the pre-training mode, when step S206 is executed, the model training method may also include: checking whether the training set includes training audio samples that have not been used to train the prediction model. When the training set includes training audio samples that have not been used to train the prediction model, Training audio samples used to train the prediction model, then obtain training audio samples that have not been used to train the prediction model to train the prediction model until all training audio samples in the training set are used to train the prediction model.

In an example of on-site training, as shown in Figure 11, if the current time is t100, the audio signal A can be collected as the first audio signal through, for example, the audio acquisition part to perform a training process on the prediction model. During this training process In steps S200 to S201, a second audio signal is generated based on the first audio signal (i.e. audio signal A); in step S202 of the training process, the audio signal B can be collected through the audio acquisition part as the same as the first audio signal ( That is, the third audio signal corresponding to audio signal A); in step S203, determine the audio frequency between the second audio signal obtained based on the first audio signal (ie, audio signal A) and the third audio signal (ie, audio signal B). Error signal; in step S204 of the training process, when the audio error signal between the second audio signal and the third audio signal (ie audio signal B) obtained based on the first audio signal (ie audio signal A) satisfies the error condition , then execute step S206, that is, keep the prediction model unchanged; when the audio error signal between the second audio signal and the third audio signal (ie, audio signal B) obtained based on the first audio signal (ie, audio signal A) does not satisfy If there is an error condition, then execute step S205 to adjust the prediction model; then execute step S207. When executing step S207, time t201 has passed, and the audio acquisition part needs to collect the audio signal that starts to appear at the current time (later than time t201) again as The first audio signal, as shown in Figure 11, if the current time becomes time t300, then in step S207, the audio acquisition part can collect the audio signal C as the first audio signal to perform the next training process (repeated execution) on the prediction model Steps S200 to S206), in the next training process, the audio acquisition part collects the audio signal D as the third audio signal corresponding to the first audio signal (ie, the audio signal C). By analogy, in field training, the predictive model is trained iteratively.

At least one embodiment of the present disclosure also provides a model training device. Figure 13 is a schematic block diagram of a model training device provided by at least one embodiment of the present disclosure.

As shown in Figure 13, the model training device 1300 includes an instruction generation module 1301, an audio generation module 1302, an output module 1303, an error calculation module 1304 and an adjustment module 1305. The components and structures of the model training device 1300 shown in FIG. 13 are only exemplary and not restrictive. The model training device 1300 may also include other components and structures as needed.

The instruction generation module 1301 is configured to process the first audio signal to generate a first control instruction based on the prediction model. The instruction generation module 1301 is used to execute step S200 shown in Figure 10A.

The audio generation module 1302 is configured to generate an audio signal corresponding to the first control instruction as a second audio signal based on the first control instruction. The audio generation module 1302 is used to perform step S201 shown in Figure 10A.

The output module 1303 is configured to output the second audio signal to suppress the third audio signal. The output module 1303 is used to perform step S202 shown in Figure 10A. For example, the first audio signal appears earlier than the third audio signal.

The error calculation module 1304 is configured to determine an audio error signal based on the second audio signal and the third audio signal. The error calculation module 1304 is used to perform step S203 shown in Figure 10A.

The adjustment module 1305 is configured to adjust the prediction model in response to the audio error signal not meeting the error condition; and to keep the prediction model unchanged in response to the audio error signal meeting the error condition. The adjustment module 1305 is used to execute steps S205 to S206 shown in FIG. 10A. The adjustment module 1305 is also configured to determine whether the audio error signal satisfies the error condition, that is, the adjustment module 1305 is also configured to perform step S204 shown in FIG. 10A.

The instruction generation module 1301 is further configured to, in response to the audio error signal not satisfying the error condition, process the first audio signal again based on the prediction model until the audio error signal satisfies the error condition. The instruction generation module 1301 is also used to execute step S207 shown in Figure 10A.

For a specific description of the functions implemented by the instruction generation module 1301, please refer to the relevant description of step S200 and step S207 shown in Figure 10A in the embodiment of the above model training method. For a specific description of the functions implemented by the audio generation module 1302, please refer to Referring to the relevant description of step S201 shown in FIG. 10A in the embodiment of the above model training method, for a specific description of the functions implemented by the output module 1303, refer to step S202 shown in FIG. 10A in the embodiment of the above model training method. For a detailed description of the functions implemented by the error calculation module 1404, please refer to the relevant description of step S203 shown in Figure 10A in the embodiment of the above model training method. For a specific description of the functions implemented by the adjustment module 1305, please refer to Refer to the relevant description of steps S204 to S206 shown in FIG. 10A in the embodiment of the above model training method. The model training device can achieve similar or identical technical effects to the foregoing model training method, which will not be described again here.

For example, in some embodiments, the instruction generation module 1301 includes an audio acquisition sub-module, a prediction sub-module and a generation sub-module. The audio acquisition sub-module is configured to acquire the first audio signal; the prediction sub-module is configured to process the first audio signal based on the prediction model to predict a fourth audio signal; the generation sub-module is configured to generate based on the fourth audio signal. First control command.

For example, the audio acquisition sub-module can be implemented as the audio acquisition part shown in Figure 9.

For example, in some embodiments, the prediction model includes a neural network, and the prediction sub-module can process the first audio signal based on the neural network to predict the fourth audio signal. For example, the prediction sub-module may include an AI engine and/or a digital signal processor in the prediction part shown in Figure 9, and the AI engine may include a neural network.

For example, in some embodiments, the prediction model includes a lookup table, the prediction sub-module includes a query unit and a prediction unit, the query unit is configured to generate a first audio feature encoding based on the first audio signal; query the lookup table based on the first audio feature encoding , to obtain the second audio feature coding; the prediction unit is configured to predict and obtain the fourth audio signal based on the second audio feature coding.

For example, the lookup unit may include memory for storing lookup tables.

For example, the absolute value of the time difference between the time when the output module 1303 outputs the audio signal (ie, the second audio signal) corresponding to the first control instruction and the time when the third audio signal starts to appear is less than the time threshold.

For example, the output module 1303 can be implemented as the audio output part shown in FIG. 9 . For example, the output module 1303 may include audio output devices such as speakers, and may also include a digital-to-analog converter and the like.

For example, in some embodiments, the prediction model includes a neural network, and when performing the operation of determining the audio error signal based on the second audio signal and the third audio signal, the error calculation module 1304 is configured to determine the audio error signal based on the second audio signal and the third audio signal. For audio signals, the loss value is calculated through the loss function of the neural network. The audio error signal includes loss values. When performing the operation of adjusting the prediction model in response to the audio error signal not meeting the error condition, the adjustment module 1305 is configured to: use the loss value to adjust parameters of the neural network in response to the loss value not meeting the error condition. When performing the operation of processing the first audio signal again based on the prediction model, the instruction generation module 1301 is configured to: in response to the audio error signal not satisfying the error condition, based on the neural network, process the first audio signal again to generate the third audio signal. 2. Control instructions. The second control instruction is different from the first control instruction. The audio generation module 1302 is further configured to generate and output an audio signal corresponding to the second control instruction as a second audio signal based on the second control instruction.

For example, in some embodiments, the prediction model includes a lookup table, and the adjustment module 1305 includes a feature encoding generation submodule and a lookup table adjustment submodule. The feature encoding generation submodule is configured to: in response to the audio error signal not satisfying the error condition, based on The first audio signal and the third audio signal generate audio feature coding; the lookup table adjustment sub-module is configured to adjust the lookup table based on the audio feature coding.

For example, in some embodiments, the prediction model includes a lookup table, and when performing the operation of processing the first audio signal again based on the prediction model, the instruction generation module 1301 is configured to: in response to the audio error signal not satisfying the error condition, based on Look up the table and process the first audio signal again to generate the second control instruction. The second control instruction is different from the first control instruction. The audio generation module 1302 is further configured to generate and output an audio signal corresponding to the second control instruction as a second audio signal based on the second control instruction.

For example, when performing the operation of determining the audio error signal based on the second audio signal and the third audio signal, the error calculation module 1304 is configured to: calculate the root mean square error between the second audio signal and the third audio signal to Get the audio error signal.

For example, the instruction generation module 1301, the audio generation module 1302, the output module 1303, the error calculation module 1304 and/or the adjustment module 1305 can be hardware, software, firmware, and any feasible combination thereof. For example, the instruction generation module 1301, audio generation module 1302, output module 1303, error calculation module 1304 and/or adjustment module 1305 can be a dedicated or general circuit, chip or device, or a combination of a processor and a memory. The embodiments of the present disclosure do not limit the specific implementation forms of each of the above modules, sub-modules and units.

At least one embodiment of the present disclosure also provides a model training device. FIG. 14 is a schematic block diagram of another model training device provided by at least one embodiment of the present disclosure.

For example, as shown in Figure 14, the model training device 1400 includes one or more memories 1401 and one or more processors 1402. One or more memories 1401 are configured to store non-transitory computer-executable instructions; one or more processors 1402 are configured to execute the computer-executable instructions. The computer-executable instructions, when executed by one or more processors 1402, implement the model training method according to any of the above embodiments. For the specific implementation and related explanations of each step of the model training method, please refer to the description of the above embodiment of the model training method, and will not be described again here.

For example, in some embodiments, the model training device 1400 may also include a communication interface and a communication bus. The memory 1401, the processor 1402 and the communication interface can communicate with each other through a communication bus, and the memory 1401, the processor 1402 and the communication interface and other components can also communicate through a network connection. This disclosure does not limit the type and function of the network.

For example, the communication interface is used to implement communication between the model training device 1400 and other devices. The communication interface may be a Universal Serial Bus (USB) interface, etc.

For example, the processor 1402 and the memory 1401 can be provided on the server side (or cloud).

For example, processor 1402 may control other components in model training device 1400 to perform desired functions. The processor 1402 may be a central processing unit (CPU), a network processor (NP), etc.; it may also be a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other programmable Logic devices, discrete gate or transistor logic devices, discrete hardware components. The central processing unit (CPU) can be X86 or ARM architecture, etc.

For example, memory 1401 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory may include, for example, random access memory (RAM) and/or cache memory (cache), etc. Non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disk read-only memory (CD-ROM), USB memory, flash memory, and the like. One or more computer-executable instructions may be stored on the computer-readable storage medium, and the processor 1402 may execute the computer-executable instructions to implement various functions of the model training device 1400. Various applications and various data can also be stored in the storage medium.

For example, for a detailed description of the process of model training performed by the model training device 1400, reference may be made to the relevant descriptions in the embodiments of the model training method, and repeated descriptions will not be repeated.

Figure 15 is a schematic diagram of a non-transitory computer-readable storage medium provided by at least one embodiment of the present disclosure. For example, as shown in Figure 15, one or more computer-executable instructions 2001 may be non-transitory stored on a non-transitory computer-readable storage medium 2000. For example, one or more steps in the model training method described above may be performed when the computer-executable instructions 2001 are executed by a processor.

For example, the non-transitory computer-readable storage medium 2000 can be applied in the above-mentioned model training device 1400. For example, it can include the memory 1401 in the model training device 1400.

For description of the non-transitory computer-readable storage medium 2000, reference may be made to the description of the memory 1401 in the embodiment of the model training device 1400 shown in FIG. 14, and repeated descriptions will not be repeated.

At least one embodiment of the present disclosure provides a model training method, a model training device and a non-transitory computer-readable storage medium, using a current audio signal (ie, a first audio signal) and a future audio signal (ie, a third audio signal). ) Conduct real-time training of the prediction model to improve the accuracy of the prediction results output by the prediction model, avoid the problem that the prediction results based on the prediction model output cannot suppress future audio signals, and improve the effect of noise reduction based on the prediction model; in addition, you can Through on-site real-time training of audio signals in current actual application scenarios, the accuracy of the trained prediction model will be higher than that of the prediction model trained using training audio samples in the training set. The prediction based on on-site training The model can be more suitable for actual application scenarios, avoid the problem that the prediction model cannot suppress audio signals in actual application scenarios, and improve the adaptability of the prediction model to different application scenarios, so that the prediction model can adapt to different application scenarios, and in The prediction accuracy of the prediction model in different application scenarios is high, which improves the noise cancellation effect in actual application scenarios; because the prediction model can be trained based on the audio signals in actual application scenarios, the number of samples used to train the prediction model can be reduced. Quantitative requirements; since the first audio signal is a time domain signal and the first audio signal is not an audio signal of a specific frequency, the model training method provided by the embodiment of the present disclosure does not need to extract spectral features from the audio signal to generate a spectrogram, This can simplify the audio signal processing process and save processing time; before adding the audio feature code F to the lookup table, select a coding field with a frequency lower than the frequency threshold from the lookup table, and delete the coding field, and then, The audio feature code F is then added to the lookup table to update the lookup table, thereby avoiding the problem of being unable to store the audio feature code F and also avoiding excessive storage space required for the lookup table.

Regarding this disclosure, there are still several points that need to be explained:

(1) The drawings of the embodiments of this disclosure only refer to structures related to the embodiments of this disclosure, and other structures may refer to common designs.

(2) Without conflict, the embodiments of the present disclosure and the features in the embodiments can be combined with each other to obtain new embodiments.

The above are only specific implementation modes of the present disclosure, but the protection scope of the present disclosure is not limited thereto. The protection scope of the present disclosure should be subject to the protection scope of the claims.

Claims

A model training method including:

Based on the prediction model, process the first audio signal to generate a first control instruction;

Based on the first control instruction, generate an audio signal corresponding to the first control instruction as a second audio signal;

Outputting the second audio signal to suppress a third audio signal, wherein the first audio signal occurs earlier than the third audio signal;

determining an audio error signal based on the second audio signal and the third audio signal;

In response to the audio error signal not satisfying the error condition, adjusting the prediction model and processing the first audio signal again based on the prediction model until the audio error signal satisfies the error condition;

In response to the audio error signal satisfying the error condition, the prediction model is maintained unchanged.
The model training method according to claim 1, wherein the prediction model includes a neural network,

Determining the audio error signal based on the second audio signal and the third audio signal includes: calculating a loss value through a loss function of the neural network based on the second audio signal and the third audio signal. ,

Wherein, the audio error signal includes the loss value.
The model training method according to claim 2, wherein the adjusting the prediction model in response to the audio error signal not satisfying an error condition includes:

In response to the loss value not satisfying the error condition, the parameters of the neural network are adjusted using the loss value.
The model training method according to claim 3, wherein processing the first audio signal again based on the prediction model includes:

In response to the audio error signal not satisfying the error condition, based on the neural network, the first audio signal is processed again to generate a second control instruction, wherein the second control instruction is consistent with the first The control instructions are not the same;

Based on the second control instruction, an audio signal corresponding to the second control instruction is generated and output as the second audio signal.
The model training method according to claim 1, wherein the prediction model includes a lookup table,

In response to the audio error signal not meeting an error condition, adjusting the prediction model includes:

In response to the audio error signal not satisfying the error condition, generating audio feature encoding based on the first audio signal and the third audio signal;

The lookup table is adjusted based on the audio feature encoding.
The model training method according to claim 1, wherein the prediction model includes a lookup table,

Processing the first audio signal again based on the prediction model includes:

In response to the audio error signal not satisfying the error condition, the first audio signal is processed again to generate a second control instruction based on the lookup table, wherein the second control instruction is consistent with the first The control instructions are not the same;

Based on the second control instruction, an audio signal corresponding to the second control instruction is generated and output as the second audio signal.
The model training method according to any one of claims 1 to 6, wherein determining the audio error signal based on the second audio signal and the third audio signal includes:

The root mean square error between the second audio signal and the third audio signal is calculated to obtain the audio error signal.
The model training method according to any one of claims 1 to 6, wherein the processing of the first audio signal to generate the first control instruction based on the prediction model includes:

Obtain the first audio signal;

Process the first audio signal based on the prediction model to predict a fourth audio signal;

The first control instruction is generated based on the fourth audio signal.
The model training method according to claim 8, wherein the prediction model includes a lookup table,

Processing the first audio signal based on the prediction model to predict a fourth audio signal includes:

Generate a first audio feature code based on the first audio signal;

Query the lookup table based on the first audio feature code to obtain a second audio feature code;

Based on the second audio feature encoding, the fourth audio signal is predicted.
The model training method according to claim 8, wherein the phase of the second audio signal is opposite to the phase of the fourth audio signal.
The model training method according to any one of claims 1 to 6, wherein the absolute value of the time difference between the time when the audio signal corresponding to the first control instruction is output and the time when the third audio signal starts to appear less than the time threshold.
A model training device including:

an instruction generation module configured to process the first audio signal to generate a first control instruction based on the prediction model;

an audio generation module configured to generate an audio signal corresponding to the first control instruction as a second audio signal based on the first control instruction;

An output module configured to output the second audio signal to suppress a third audio signal, wherein the first audio signal appears earlier than the third audio signal;

an error calculation module configured to determine an audio error signal based on the second audio signal and the third audio signal;

an adjustment module configured to adjust the prediction model in response to the audio error signal not meeting the error condition; and to keep the prediction model unchanged in response to the audio error signal meeting the error condition;

Wherein, the instruction generation module is further configured to, in response to the audio error signal not satisfying the error condition, process the first audio signal again based on the prediction model until the audio error signal satisfies the error condition. .
The model training device according to claim 12, wherein the prediction model includes a neural network,

When performing the operation of determining an audio error signal based on the second audio signal and the third audio signal, the error calculation module is configured to based on the second audio signal and the third audio signal, The loss value is calculated through the loss function of the neural network,

Wherein, the audio error signal includes the loss value.
The model training device according to claim 13, wherein when performing the operation of adjusting the prediction model in response to the audio error signal not satisfying an error condition, the adjustment module is configured to: respond to If the loss value does not satisfy the error condition, the loss value is used to adjust the parameters of the neural network.
The model training device according to claim 14, wherein when performing the operation of processing the first audio signal again based on the prediction model, the instruction generation module is configured to: respond to the audio signal. The error signal does not satisfy the error condition, and based on the neural network, the first audio signal is processed again to generate a second control instruction, wherein the second control instruction is different from the first control instruction;

The audio generation module is further configured to generate and output an audio signal corresponding to the second control instruction as the second audio signal based on the second control instruction.
The model training device according to claim 12, wherein the prediction model includes a lookup table, and the adjustment module includes a feature encoding generation submodule and a lookup table adjustment submodule,

The feature code generation sub-module is configured to: in response to the audio error signal not satisfying the error condition, generate an audio feature code based on the first audio signal and the third audio signal;

The lookup table adjustment sub-module is configured to adjust the lookup table based on the audio feature encoding.
The model training device according to claim 12, wherein the prediction model includes a lookup table, and when performing the operation of processing the first audio signal again based on the prediction model, the instruction generation module is configured to: in response to the audio error signal not satisfying the error condition, based on the lookup table, process the first audio signal again to generate a second control instruction, wherein the second control instruction is consistent with the The above first control instructions are different;

The audio generation module is further configured to generate and output an audio signal corresponding to the second control instruction as the second audio signal based on the second control instruction.
The model training device according to any one of claims 12 to 17, wherein when performing the operation of determining an audio error signal based on the second audio signal and the third audio signal, the error calculation module Configured to: calculate a root mean square error between the second audio signal and the third audio signal to obtain the audio error signal.
The model training device according to any one of claims 12 to 17, wherein the instruction generation module includes an audio acquisition sub-module, a prediction sub-module and a generation sub-module,

The audio acquisition sub-module is configured to acquire the first audio signal;

The prediction sub-module is configured to process the first audio signal based on the prediction model to predict a fourth audio signal;

The generating sub-module is configured to generate the first control instruction based on the fourth audio signal.
The model training device according to claim 19, wherein the prediction model includes a lookup table, and the prediction sub-module includes a query unit and a prediction unit,

The query unit is configured to generate a first audio feature code based on the first audio signal; query the lookup table based on the first audio feature code to obtain a second audio feature code;

The prediction unit is configured to predict the fourth audio signal based on the second audio feature encoding.
The model training device according to claim 19, wherein the phase of the second audio signal is opposite to the phase of the fourth audio signal.
The model training device according to any one of claims 12 to 17, wherein the absolute value of the time difference between the time when the audio signal corresponding to the first control instruction is output and the time when the third audio signal starts to appear less than the time threshold.
A model training device including:

One or more memories that non-transitoryly store computer-executable instructions;

one or more processors configured to execute the computer-executable instructions,

Wherein, the computer-executable instructions implement the model training method according to any one of claims 1 to 11 when run by the one or more processors.
A non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer-executable instructions, and when executed by a processor, the computer-executable instructions implement any one of claims 1 to 11 The model training method described in the item.