CN117854488A

CN117854488A - Audio processing method and electronic equipment

Info

Publication number: CN117854488A
Application number: CN202410065674.3A
Authority: CN
Inventors: 赵伟康; 廖廷康; 李志飞
Original assignee: Mobvoi Innovation Technology Co Ltd
Current assignee: Mobvoi Innovation Technology Co Ltd
Priority date: 2024-01-16
Filing date: 2024-01-16
Publication date: 2024-04-09

Abstract

The embodiment of the invention discloses an audio processing method and electronic equipment, which are used for generating an enhanced audio data set by acquiring a plurality of initial audio signals and then carrying out preset processing on at least one initial audio signal, wherein the preset processing comprises audio fusion processing and/or audio mixing processing. And performing reverberation processing on the audio signals in the enhanced audio data set and the preset audio signals to generate training audio data. Therefore, training audio data can be generated according to the initial audio signal, and the method has high applicability.

Description

Audio processing method and electronic equipment

Technical Field

The present invention relates to the field of audio processing technologies, and in particular, to an audio processing method and an electronic device.

Background

In the prior art, audio signals are typically processed using trained audio processing models (e.g., convolutional neural network (Convolutional Neural Networks, CNN), recurrent neural network (Recurrent Neural Network, RNN), long-short-term memory network (Long Short Term Memory, LSTM), etc. models) to achieve different functions (e.g., audio recognition, noise suppression, speech enhancement, etc.). In order to improve the robustness of the audio processing model, a large amount of audio training data is required to be used for training, so as to obtain the audio processing model. Therefore, how to acquire audio training data becomes a problem to be solved.

Currently, in order to obtain audio training data in different scenes (e.g., scenes of a mall, building, street, etc.), it is common to simulate the different scenes to build a dedicated test environment (e.g., build a recording studio, recording room, etc.) so as to simulate and collect the required audio training data in the test environment. However, in the prior art, the method of constructing test environments of different scenes to obtain audio training data is limited, and the applicability is low.

Disclosure of Invention

In view of this, the embodiment of the invention provides an audio processing method and an electronic device, which can generate training audio data according to an initial audio signal, and have higher applicability.

In a first aspect, an embodiment of the present invention provides an audio processing method, including:

acquiring a plurality of initial audio signals;

performing predetermined processing on at least one of the initial audio signals to generate an enhanced audio data set, wherein the predetermined processing comprises audio fusion processing and/or audio mixing processing;

and carrying out reverberation processing on the audio signals in the enhanced audio data set and preset audio signals to generate training audio data.

In a second aspect, embodiments of the present invention provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement a method according to the first aspect.

In a third aspect, an embodiment of the present invention provides an electronic device, including:

a memory for storing one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of the first aspect.

The embodiment of the invention generates the enhanced audio data set by acquiring a plurality of initial audio signals and then performing predetermined processing on at least one initial audio signal, wherein the predetermined processing comprises audio fusion processing and/or audio mixing processing. And performing reverberation processing on the audio signals in the enhanced audio data set and the preset audio signals to generate training audio data. Therefore, training audio data can be generated according to the initial audio signal, and the method has high applicability.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent from the following description of embodiments of the present invention with reference to the accompanying drawings, in which:

FIG. 1 is a diagram of an audio processing architecture in accordance with an embodiment of the present invention;

FIG. 2 is a flow chart of an audio processing method of an embodiment of the present invention;

FIG. 3 is a flow chart of a method of generating a first enhanced audio in an embodiment of the invention;

FIG. 4 is a schematic diagram of an initial audio signal in an embodiment of the invention;

FIG. 5 is a schematic diagram of a plurality of audio segments intercepted in an embodiment of the invention;

FIG. 6 is a flow chart of a method of generating second enhanced audio in an embodiment of the invention;

FIG. 7 is a schematic diagram of candidate audio data in an embodiment of the invention;

FIG. 8 is a flow chart of a method of generating third enhanced audio in an embodiment of the invention;

FIG. 9 is a flow chart of a method of generating training audio data in an embodiment of the invention;

FIG. 10 is a flow chart of a method of determining weight information in an embodiment of the invention;

FIG. 11 is a schematic diagram of an audio processing apparatus according to an embodiment of the present invention;

fig. 12 is a schematic diagram of an electronic device according to an embodiment of the invention.

Detailed Description

The present application is described below based on examples, but the present application is not limited to only these examples. In the following detailed description of the present application, certain specific details are set forth in detail. The present application will be fully understood by those skilled in the art without a description of these details. Well-known methods, procedures, flows, components and circuits have not been described in detail so as not to obscure the nature of the present application.

Moreover, those of ordinary skill in the art will appreciate that the drawings are provided herein for illustrative purposes and that the drawings are not necessarily drawn to scale.

Unless the context clearly requires otherwise, the words "comprise," "comprising," and the like throughout the application are to be construed as including but not being exclusive or exhaustive; that is, it is the meaning of "including but not limited to".

In the description of the present application, it should be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present application, unless otherwise indicated, the meaning of "a plurality" is two or more.

In the following description, an example is described in which an audio processing method is applied to a scene in which noise (e.g., a reverberation audio signal) is suppressed, and training audio data is applied to an audio processing model in which noise is suppressed. Reverberation (Reverberation) is a physical phenomenon, and refers to the reflection of sound waves emitted by a sound source when the sound waves encounter obstacles (such as floors, walls, vegetation, etc.), so that a plurality of reflected sound waves still exist to mix for a continuous period of time after the sound source stops sounding. It should be understood that the audio processing method according to the embodiment of the present invention may be applied to other various types of scenes requiring audio processing, and correspondingly, the training audio data may be applied to an audio processing model for training functions required for the audio processing scene. For example, applying the audio processing method to a speech recognition scene, training audio data may be applied to training an audio processing model having a speech recognition function. As another example, applying the audio processing method to a speech enhanced scene, training audio data may be applied to train an audio processing model with speech enhancement functionality.

In this embodiment, the method for acquiring the audio training data by setting up the test environments of different scenes in the prior art is limited, and the applicability is low. In this case, the present embodiment first performs fusion processing on the initial audio signal and/or performs mixing processing on the initial audio signal to obtain an enhanced audio data set. And then, carrying out reverberation processing on the audio signals in the enhanced audio data set and preset audio signals to generate training audio data. Therefore, training audio data can be generated according to the initial audio signal, and the method has high applicability. Specifically, the audio processing architecture diagram of the present embodiment may refer to fig. 1.

Fig. 1 is a diagram of an audio processing architecture according to an embodiment of the present invention. As shown in fig. 1, the audio processing architecture of the present embodiment includes a first mixing module 11, an attenuation coefficient measuring module 12, a first signal generator 13, a fusion module 14, a second mixing module 15, a second signal generator 16, a first reverberation module 17, and a second reverberation module 18.

In this embodiment, each module in the software architecture (i.e., the first mixing module 11, the attenuation coefficient measuring module 12, the first signal generator 13, the fusion module 14, the second mixing module 15, the second signal generator 16, the first reverberation module 17, and the second reverberation module 18) is configured to implement different predetermined functions, which may be a processing unit disposed on different hardware computing platforms, or may be a plurality of software programs or interfaces that provide services disposed on a unified hardware platform or a cloud platform. In the following description, each module in the software architecture is exemplified by being deployed in unified hardware (for example, the hardware may be an electronic device with functions of data processing, data storage, data transmission, etc.).

Fig. 2 is a flowchart of an audio processing method according to an embodiment of the present invention. As shown in fig. 2, the audio processing procedure of the present embodiment includes the following steps:

step S100, a plurality of initial audio signals are acquired.

In this embodiment, the plurality of initial audio signals may be acquired in a plurality of ways, such as acquisition in the target environment, custom generation, acquisition from a network, and the like. The present embodiment is described taking the above-described example of acquiring a plurality of initial audio signals, and it should be understood that a plurality of initial audio signals may also be acquired by means of acoustic design software, such as software ODEON, sketchUp, EASE, soundPLAN, insul.

Alternatively, a plurality of initial audio signals may be acquired as desired in a target environment, such as a building, subway station, street, dedicated test room, etc. For example, audio is played through a speaker in a target room, and a plurality of initial Audio signals are collected through an Audio collection device, such as a microphone, a Recorder (Audio Recorder), a digital Audio interface (Digital Audio Interface). As another example, environmental audio in the target environment is captured by an audio capture device, and then analyzed by software (e.g., such as MATLAB (Matrix Laboratory), numPy (Numerical Python), etc.) to extract the desired plurality of initial audio signals.

Alternatively, a plurality of initial audio signals may be custom generated according to requirements. For example, parameters such as different durations, frequencies, amplitudes, etc. of the initial audio signals are set in a customized manner, and then an objective function (e.g., random function, randn function, etc.) is called from a function library (e.g., MATLAB function library, numPy function library, etc.) to generate a plurality of initial audio signals.

Alternatively, a plurality of initial audio signals may be acquired from the network. Specifically, a network audio data set of a target type (e.g., noise, voice, etc.) may be acquired from a cloud server through a network, and then a plurality of audio signals in the network audio data set are determined as a plurality of initial audio signals. Wherein the network may be implemented by long term evolution technology (long term evolution, LTE), 5G mobile communication network technology (5th Generation,5G), etc.

In this embodiment, the plurality of initial audio signals may be noise signals. It is easy to understand that the noise signal types in different application scenarios are different. For example, if the user sings in a room, the noise signal may be a room impulse response (Room Impulse Response, RIR). For another example, where the user is in a voice call with a friend over an electronic device, the noise signal may be the environmental audio of the sea wave striking the beach. For another example, if a user is in a street with a friend in a voice call via an electronic device, the noise signal may be a nearby vehicle whistling sound, a sound reflected by a street sign, or the like. That is, the initial audio signal may be various types of audio such as voice, environmental audio, and the like. In the following description, a plurality of initial audio signals including a room impulse response (Room Impulse Response, RIR) and a preset noise signal, and the preset noise signal is gaussian white noise (White Gaussian Noise, WGN) are illustrated as an example. It is easily understood that the preset Noise signal may also be uniform white Noise (Uniform White Noise), impulse Noise (Impulse Noise), pink Noise (Pink Noise), brown Noise (brown Noise), or the like.

Step S200, performing a predetermined process on at least one of the initial audio signals to generate an enhanced audio data set, where the predetermined process includes an audio fusion process and/or a mixing process.

In this embodiment, the enhanced audio data set may be generated in a number of ways from a plurality of initial audio signals. In particular, one of the plurality of initial audio signals may be subjected to a predetermined process to generate a corresponding one of the enhanced audio data sets. Each of the plurality of initial audio signals may also be subjected to a predetermined process to generate a corresponding plurality of enhanced audio data sets. In the following description, a procedure of generating a corresponding one of the enhanced audio data sets by performing a predetermined processing on a room impulse response (i.e., an initial audio signal) is described as an example.

In this embodiment, the initial audio signal may be subjected to a mixing process to generate an enhanced audio data set. Specifically, the initial audio signal is input to the first mixing module 11, and the first mixing module 11 may perform mixing processing on the initial audio signal in various ways to generate the enhanced audio data set.

Alternatively, the first mixing module 11 may perform a mixing process on the room impulse response and the different preset noise signals to generate the enhanced audio data set. For example, the room impulse response is convolved with different preset audio signals to obtain a plurality of enhanced audio data sets. Wherein different preset noise signals are gaussian white noise with different sampling frequencies, durations and amplitudes.

Optionally, the enhanced audio data set comprises first enhanced audio. The first mixing module 11 may intercept a plurality of audio segments from the room impulse response and then mix the audio segments to generate the enhanced audio data set. Specifically, step 200 may include step S211 and step S212 may refer to fig. 3.

Fig. 3 is a flow chart of a method of generating first enhanced audio in an embodiment of the invention. As shown in fig. 3, the process of generating the first enhanced audio of the present embodiment includes the steps of:

step S211, a plurality of audio segments are intercepted from the initial audio signal, and a preset time interval is reserved between the audio segments.

In this embodiment, for the room impulse response, a predetermined size of the room is considered, so that there is a certain time interval between the time when the sound wave emitted from the sound source reaches the audio collection device and the time when the sound wave reaches the audio collection device via the reflected audio of the object such as the wall. Similarly, there is also a time interval for sound waves to reach the audio collection device via different reflected audio from different objects (e.g., different walls). In order to improve the authenticity of the enhanced audio data set, the present embodiment intercepts a plurality of audio segments from the original audio signal by the first mixing module 11 with a predetermined time interval between the audio segments.

Alternatively, an initial time stamp of the received initial audio signal may be obtained, and a reflection time stamp of each reflection audio of the received initial audio signal may be obtained, and then a candidate time interval between audio segments of the initial audio signal may be determined based on the initial time stamp and each reflection time stamp (e.g., a difference between the initial time stamp and a reflection time stamp closest to the initial time stamp may be calculated, and a candidate time interval may be determined). Further, the candidate time intervals may be determined directly as predetermined time intervals between audio segments of the initial audio signal. Alternatively, the candidate time intervals corresponding to the initial audio signal may be calculated (e.g., by means of an average calculation, a weighted calculation, etc.), and the predetermined time interval between the interception of the audio segments may be determined. That is, the predetermined time intervals between the audio segments may be the same or different.

Alternatively, the predetermined time interval between audio segments may be custom set by the user according to the requirements.

Alternatively, the plurality of audio segments may be truncated according to the sampling frequency of the room impulse response. The sampling frequency characterizes the sampling number (i.e. the sampling point number) of discrete signals extracted from continuous signals in unit time, and the unit is Hz. For example, the sampling frequency of the room impulse response is 16000Hz, and the number of samples characterizing the signal within 1 second is 16000. That is, the corresponding sampling points may be determined from the sampling frequency of the room impulse response.

For example, the plurality of audio segments includes a first audio segment and a second audio segment, reference may be made to fig. 4 and 5.

Fig. 4 is a schematic diagram of an initial audio signal in an embodiment of the invention. Fig. 5 is a schematic diagram of capturing multiple audio segments in an embodiment of the invention. As shown in fig. 4 and 5, in the initial audio signal shown in fig. 4, the number of sampling points corresponding to the truncated first audio segment is 0-A1, and the number of sampling points corresponding to the truncated second audio segment is A2-A3, which may refer to fig. 5. There is a predetermined time interval between the first audio segment and the second audio segment, so there is an un-truncated audio segment between the first audio segment and the second audio segment. The non-truncated audio segment has a number of sampling points A1-A2 corresponding to a predetermined time interval. That is, the first audio segment is first cut from the initial audio signal, and then the second audio segment is cut after a predetermined time interval of the first audio segment.

Step S212, mixing processing is carried out on each audio segment to generate first enhanced audio.

In this embodiment, the first enhanced audio may include a plurality of audio. Specifically, the audio segments may be mixed to be integrated into one audio. For example, the audio segments include B, C and D, and each audio segment is mixed to be integrated into one audio BCD, BDC, CBD, CDB, DBC or DCB. Any one of the plurality of audio segments may be mixed with other audio segments to obtain a plurality of audio segments. For example, the plurality of audio segments includes B, C and D, and the audio segments are subjected to a mixing process to obtain a plurality of audio BC, BD, CB, CD, DB, DC.

In this embodiment, the mixing characterization is a process of integrating different original audio signals (original audio signals such as any audio segment) into a stereo audio track or a single audio track to obtain a mixed audio signal. Specifically, parameters such as frequency, amplitude, sound field and the like of any original audio signal can be adjusted according to requirements, so that the mixing processing process of different original audio signals is realized. The mixing process for different original audio signals can be implemented in various ways such as a mixer (e.g., a digital signal processor, etc.), mixing software (e.g., pro Tools, logic Pro, etc.), custom design (e.g., writing mixing process code using Python programming language), etc.

In this embodiment, any of the initial audio signals may be subjected to an audio fusion process in a number of ways to generate the enhanced audio data set.

Alternatively, the room impulse response and the different preset noise signals (e.g., gaussian white noise, brown noise, etc.) may be separately convolved to generate the enhanced audio data set.

Optionally, the enhanced audio data set may comprise second enhanced audio. Specifically, the attenuation coefficient measuring module 12, the first signal generator 13 and the fusion module 14 may perform audio fusion processing on the initial audio signal to generate second enhanced audio. Correspondingly, step 200 may include step S221-step S224 may refer to fig. 6.

Fig. 6 is a flow chart of a method of generating second enhanced audio in an embodiment of the invention. As shown in fig. 6, the process of generating the second enhanced audio of the present embodiment includes the steps of:

step S221, determining an attenuation coefficient of the initial audio signal.

In this embodiment, the attenuation coefficient measuring module 12 may determine the attenuation coefficient of the initial audio signal in various ways.

Alternatively, the attenuation coefficient measurement module 12 may input the initial audio signal into a pre-trained attenuation coefficient model to output the attenuation coefficients. Wherein the attenuation coefficient model is, for example, a convolutional neural network, a cyclic neural network, or the like.

Optionally, the attenuation coefficient measuring module 12 may obtain the amplitude of each sampling point of the initial audio signal, then calculate the amplitude difference value of any two adjacent sampling points to obtain a plurality of amplitude variation amounts, and further calculate the attenuation coefficient according to the plurality of amplitude variation amounts by adopting modes such as linear regression, curve fitting (curve fitting), and the like.

Optionally, the attenuation coefficient calculation module 12 constructs a loss function of the room impulse response (i.e., the initial audio signal). And then, calculating the preset learning rate, the preset amplitude value and the preset attenuation coefficient for a preset number of times according to the loss function to obtain a target amplitude value and a target attenuation coefficient which are not changed any more, and determining the target attenuation coefficient as the attenuation coefficient of the room impulse response.

For example, the function of the room impulse response is r ₁ (t) the loss function constructed by the attenuation coefficient calculation module 12 is specifically as follows:

in the loss function loss (u, d), u represents the attenuation coefficient, d represents the amplitude, r ₁ (t) represents a function of the room impulse response, |r ₁ (t) | represents the absolute value of the function of the room impulse response, e represents a natural constant, t represents time, N is a positive integer greater than or equal to 2, e represents a correction term, and e may take a value of 0.01, for example.

Firstly, according to a first attenuation coefficient partial derivative function, a preset first learning rate alpha ₁ A preset first amplitude d ₁ And a preset first attenuation coefficient u ₁ Performing partial derivative calculation to obtain a second attenuation coefficient u ₂ The first attenuation coefficient partial derivative function is specifically as follows:

in the first attenuation coefficient partial derivative function, alpha ₁ For example, 0.0001, d ₁ The value of (a) is, for example, 1, u ₁ For example 0.01.loss (u) ₁ ,d ₁ ) The representation will u ₁ And d ₁ Is carried into the loss function loss (u, d).

Further, if the second attenuation coefficient u ₂ Is smaller than the first threshold value, and updates the second attenuation coefficient u according to the first preset value ₂ . If the second attenuation coefficient u ₂ A second attenuation coefficient u is greater than or equal to the first threshold ₂ Is unchanged. The first threshold is, for example, 0, and the first preset value is, for example, 0.

Then, according to the first amplitude deviation function, a preset first learning rate alpha ₁ A preset first amplitude value d ₁ And a second attenuation coefficient u ₂ Performing partial derivative calculation to obtain a second amplitude value d ₂ The first amplitude deviation function is specifically as follows:

in the formula, loss (u ₂ ,d ₁ ) The representation will u ₂ And d ₁ Is carried into the loss function loss (u, d).

Correspondingly, if the second amplitude value d ₂ Is smaller than a second threshold value, and updates a second amplitude value d according to a second preset value ₂ . If the second amplitude value d ₂ A second amplitude value d is greater than or equal to a second threshold value ₂ Is unchanged. Wherein the second threshold is, for example, 0, and the second preset value is, for example, 0. Thereby, a second attenuation coefficient u is obtained ₂ And a second amplitude value d ₂ 。

Further, if the second attenuation coefficient u ₂ Less than or equal to a preset first attenuation coefficient u ₁ Will first learn rate alpha ₁ Is determined as a second learning rate alpha ₂ . If the second attenuation coefficient u ₂ Is larger than a preset first attenuation coefficient u ₁ According to the first learning rate formula and the first learning rate alpha ₁ Calculating a second learning rate alpha ₂ That is, if the overall loss increases, that is, if the calculated attenuation coefficient is larger than the input attenuation coefficient value, the learning rate is adjusted. The first learning rate formula is specifically as follows:

α ₂ ＝0.5α ₁

Further, and calculate a second attenuation coefficient u ₂ Is similar to the process according to the second decay coefficient partial derivative function and the second learning rate alpha ₂ Second amplitude value d ₂ And a second attenuation coefficient u ₂ Performing partial derivative calculation to obtain a third attenuation coefficient u ₃ The second attenuation coefficient partial derivative function is specifically as follows:

in the formula, loss (u ₂ ,d ₂ ) The representation will u ₂ And d ₂ Is carried into the loss function loss (u, d).

Correspondingly, if the third attenuation coefficient u ₃ Less than the first threshold value, the third attenuation coefficient u is updated according to the first preset value ₃ . If the third attenuation coefficient u ₃ The third attenuation coefficient u is greater than or equal to the first threshold value ₃ Is unchanged.

Then, and calculate a second amplitude value d ₂ Is similar to the process according to the second amplitude deviation function and the second learning rate alpha ₂ Second amplitude value d ₂ And a third attenuation coefficient u ₃ Performing partial derivative calculation to obtain a third amplitude value d ₃ The second amplitude deviation function is specifically as follows:

in the formula, loss (u ₃ ,d ₂ ) The representation will u ₃ And d ₂ Is carried into the loss function loss (u, d).

Correspondingly, if the third amplitude value d ₃ Is smaller than the second threshold value, and updates the third amplitude value d according to the second preset value ₃ . If the third amplitude value d ₃ A third amplitude value d is greater than or equal to the second threshold value ₃ Is unchanged. Thereby, a third attenuation coefficient u is obtained ₃ And a third amplitude value d ₃ 。

Correspondingly, if the third attenuation coefficient u ₃ Less than or equal to the second attenuation coefficient u ₂ Will learn the second learning rate alpha ₂ Is determined as a third learning rate alpha ₃ . If the third attenuation coefficient u ₃ Is greater than the second attenuation coefficient u ₂ According to the second learning rate formula and the second learning rate alpha ₂ Calculating a third learning rate alpha ₃ . The second learning rate formula is specifically as follows:

α ₃ ＝0.5α ₂

similarly, according to the nth decay coefficient partial derivative function, the nth learning rate alpha _n Nth amplitude value d _n And an nth attenuation coefficient u _n Performing partial derivative calculation to obtain the (n+1) th attenuation coefficient u _n+1 。

Wherein n is a positive integer of 3 or more. The nth attenuation coefficient partial derivative function is specifically as follows:

in the formula, loss (u _n ,d _n ) The representation will u _n And d _n Is carried into the loss function loss (u, d).

Correspondingly, if the (n+1) th attenuation coefficient u _n+1 Less than the first threshold, the (n+1) th attenuation coefficient u is updated according to the first preset value _n+1 . If the (n+1) th attenuation coefficient u _n+1 The (n+1) th attenuation coefficient is greater than or equal to the first thresholdu _n+1 Is unchanged.

Similarly, according to the nth amplitude deviation function, the nth learning rate alpha _n Nth amplitude value d _n And (n+1) th attenuation coefficient u _n+1 Performing partial derivative calculation to obtain (n+1) th amplitude value d _n+1 . Wherein n is a positive integer of 3 or more. The nth magnitude partial derivative function is specifically as follows:

In the formula, loss (u _n+1 ,d _n ) The representation will u _n+1 And d _n Is carried into the loss function loss (u, d).

Correspondingly, if the (n+1) th amplitude value d _n+1 Is smaller than the second threshold value, and the (n+1) th amplitude value d is updated according to the second preset value _n+1 . If the (n+1) th amplitude value d _n+1 The (n+1) th amplitude value d is greater than or equal to the second threshold value _n+1 Is unchanged. Thus, the (n+1) th attenuation coefficient u is obtained _n+1 And (n+1) th amplitude value d _n+1 。

Correspondingly, if the (n+1) th attenuation coefficient u _n+1 Less than or equal to the nth attenuation coefficient u _n Will learn the rate alpha n _n Determined as the (n+1) th learning rate alpha _n+1 . If the (n+1) th attenuation coefficient u _n+1 Is greater than the nth attenuation coefficient u _n According to the nth learning rate formula and the nth learning rate alpha _n Calculating the (n+1) th learning rate alpha _n+1 . Wherein n is a positive integer of 3 or more. The nth learning rate formula is specifically as follows:

α _n+1 ＝0.5α _n

further, if the attenuation coefficients calculated a predetermined number of times (e.g., 2 times, 3 times, 21 times, etc.) are the same and the calculated amplitude values are the same, the attenuation coefficient calculated a predetermined number of times is determined as the target attenuation coefficient (i.e., the room impulse response r ₁ Attenuation coefficient of (t). Correspondingly, if any two attenuation coefficients in the plurality of attenuation coefficients calculated for a predetermined number of times are different, or the plurality of calculated amplitudes If any two amplitude values are different, the calculation is continued according to the nth attenuation coefficient partial derivative function, the nth amplitude partial derivative function and the n learning rate formula.

Step S222, candidate audio data is generated according to the attenuation coefficient and the preset noise signal.

In the present embodiment, the attenuation coefficient measuring module 12 inputs the attenuation coefficient of the initial audio signal to the first signal generator 13, and then the first signal generator 13 generates candidate audio data from the attenuation coefficient and the preset noise signal. It will be readily appreciated that the attenuation coefficient of the candidate audio data is a function of the original audio signal input to the first mixing module 11 (i.e. the room impulse response is r ₁ (t)) are identical.

For example, the preset noise signal σ (t) represents gaussian white noise with a mean value of 0 and a variance of 1. The function of the room impulse response is r ₁ The attenuation coefficient of (t) is u ₃₂ . Candidate audio data r ₃ (t) can be determined by the following formula:

at r ₃ In the formula of (t), e represents a natural constant, and t represents time.

Further, a schematic diagram of candidate audio data may refer to fig. 7.

Fig. 7 is a schematic diagram of candidate audio data in an embodiment of the invention. As shown in fig. 7, the amplitude of the selected audio data gradually decreases as the number of sampling points increases.

Alternatively, the first signal generator 13 may generate a plurality of candidate audio data according to the attenuation coefficient and different preset noise signals, respectively. In the following description, candidate audio data r ₃ (t) is illustrated as an example.

Step S223, determining the audio data to be enhanced, which is the same as the attenuation coefficient, from a plurality of initial audio signals.

Alternatively, the initial audio signal (i.e. the room impulse response) input to the first mixing module 11 may beFunction r ₁ (t)) is determined as audio data to be enhanced.

Alternatively, the attenuation coefficient measuring module 12 may determine an attenuation coefficient corresponding to each of the initial audio signals, and determine at least one other initial audio signal, which is identical to the attenuation coefficient of the room impulse response input to the first mixing module 11, of the plurality of initial audio signals as the audio data to be enhanced. In the following description, audio data to be enhanced, i.e. a function r of the room impulse response ₁ (t) is illustrated as an example.

Step S224, performing audio fusion processing on the audio data to be enhanced and the candidate audio data to generate second enhanced audio.

In this embodiment, the fusion module 14 performs audio fusion processing according to the weight information corresponding to the audio data to be enhanced and the candidate audio data, so as to generate the second enhanced audio. It is easy to understand that the attenuation coefficient of the second enhanced audio is the same as the attenuation coefficients of the audio data to be enhanced and the candidate audio data.

Alternatively, multiple sets of weight information corresponding to the audio data to be enhanced and the candidate audio data may be set first. The fusion module 14 performs audio fusion processing according to the plurality of sets of weight information corresponding to the audio data to be enhanced and the candidate audio data, so as to obtain a plurality of fusion audios, so as to determine the second enhanced audio. For example, the sets of weight information corresponding to the audio data to be enhanced and the candidate audio data are (E1, E2), (E3, E4).

Optionally, the fusion module 14 may also randomly generate weight information corresponding to the audio data to be enhanced and the candidate audio data.

For example, the audio data to be enhanced is r ₁ (t) the corresponding weight information is β. Candidate audio data is r ₃ (t) the corresponding weight information is (1-beta). Then the second enhanced audio r ₄ (t) can be determined by the following formula:

r ₄ (t)＝βr ₁ (t)+(1-β)r ₃ (t)

in this embodiment, an audio fusion process and a mixing process may be performed on the initial audio signal to generate an enhanced audio data set.

Alternatively, the room impulse response and different preset noise signals (e.g., gaussian white noise, brown noise, etc.) may be respectively convolved to obtain the fused audio signal. The fused audio signal and the different preset noise signals are then subjected to a mixing process to generate an enhanced audio data set.

Optionally, the enhanced audio data set may comprise third enhanced audio. Specifically, the attenuation coefficient measuring module 12, the first signal generator 13, the fusion module 14 and the second mixing module 15 can perform audio fusion processing and mixing processing on the initial audio signal. The process of the audio fusion process is similar to the above steps S221-S224, and the disclosure is not repeated here. Correspondingly, step 200 may further comprise steps S225-S226, see fig. 8.

Fig. 8 is a flow chart of a method of generating third enhanced audio in an embodiment of the invention. As shown in fig. 8, the process of generating the third enhanced audio of the present embodiment includes the steps of:

step S225, a plurality of audio segments are intercepted from the second enhanced audio, and a preset time interval is reserved between the audio segments.

In this embodiment, the fusion module 14 inputs the second enhanced audio into the second mixing module 15, and the second mixing module 15 intercepts a plurality of audio segments from the second enhanced audio, with a predetermined time interval between the audio segments. The specific embodiment is similar to step S211, and the present invention is not described herein.

Step S226, performing a mixing process on each audio segment to generate a third enhanced audio.

In the present embodiment, the second mixing module 15 performs a mixing process on each audio segment to generate third enhanced audio. The specific embodiment is similar to step S212, and the present invention is not described herein.

Optionally, the enhanced audio data set may further comprise fourth enhanced audio. A room impulse response in the plurality of initial audio signals may be determined as fourth enhanced audio. The preset noise signal, the environmental audio, etc. among the plurality of initial audio signals may also be determined as the fourth enhanced audio according to the need. Alternatively, all of the plurality of initial audio signals may be determined as fourth enhanced audio.

Optionally, the second signal generator 16 may generate the fourth enhanced audio according to the preset noise signal and the preset attenuation coefficient in the plurality of initial audio signals, and the specific embodiment is similar to the step S222, which is not repeated herein. The preset attenuation coefficient may be determined in various manners, for example, by randomly generating the second signal generator 16, acquiring a data set of attenuation coefficients from a network, and the like. Similarly, the preset noise signal may be acquired in advance, or may be randomly generated by the second signal generator 16. It will be readily appreciated that the fourth enhanced audio may also be generated for any of the initial audio signals and the preset attenuation coefficients as desired.

And step S300, performing reverberation processing on the audio signals in the enhanced audio data set and preset audio signals to generate training audio data.

Optionally, the audio signal in the enhanced audio data set and the preset audio signal may be reverberated by an application program, a neural network model, or the like, so as to generate training audio data. Examples of the application programs include Adobe audio, avid Pro Tools, steinberg base, and the like. Among them, adobe audio is an application widely used in reverberant audio design. Avid Pro Tools belongs to a common computer audio reverberation application program and has audio reverberation, editing, control and effect processing functions. Steinberg base is an audio production software that provides multi-channel reverberations processing and other functions.

In this embodiment, the first reverberation module 17 and/or the second reverberation module 18 may perform reverberation processing on the audio signal in the enhanced audio data set and the preset audio signal to generate the training audio data.

Alternatively, the type of the preset audio signal may be determined according to the application scenario. For example, for a speech recognition scenario, the preset audio signal may be the target speech desired to be recognized. For another example, for a speech enhancement application scenario, the preset audio signal may be a target speech for which enhancement is desired. For another example, for a scene in which noise (e.g., a reverberant audio signal) is suppressed, the preset audio signal may be reverberant-free audio data. In the following description, a preset audio signal is taken as an example of the non-reverberant audio data.

In the present embodiment, the reverberation-free audio data can be acquired in various ways. For example, sound source audio acquired by the audio acquisition device can be obtained under the anechoic room or the recording room, and the anechoic room or the recording room is usually made of sound-absorbing materials (such as plant fibers, foam and the like) for designing wall surfaces, ground surfaces and the like, so that reflection caused by encountering obstacles by sound waves emitted by the sound source can be effectively avoided, and the sound source audio is acquired. For another example, the reverberant audio data may be obtained through various manners such as custom generation, network acquisition, etc., and the specific embodiment is similar to the step S100, and the disclosure is not repeated here.

Alternatively, the first reverberation module 17 may convolve any one of the audio signals in the enhanced audio data set with the non-reverberant audio data to generate the training audio data. Thus, the obtained training audio data can simulate the effect of the reverberant-free audio data passing through the stable reverberant field. Meanwhile, since the first enhanced audio, the second enhanced audio, the third enhanced audio and the fourth enhanced audio in the enhanced audio data set may include a plurality of audio, the first reverberation module 17 may perform convolution processing on any one of the audio signals in the enhanced audio data set and certain non-reverberation audio data, respectively, to obtain a large amount of training audio data.

Optionally, the second mixing module 15 may perform reverberation processing on the audio signal in the enhanced audio data set and the preset audio signal according to the preset weight information to generate the training audio data. Specifically, step S300 includes steps S310 to S330, and reference may be made to fig. 9.

Fig. 9 is a flow chart of a method of generating training audio data in an embodiment of the invention. As shown in fig. 9, the process of generating training audio data of the present embodiment includes the steps of:

step S310, a first audio signal and a second audio signal with the same attenuation coefficient are obtained from the enhanced audio data set.

In this embodiment, the second reverberation module 18 obtains the first and second audio signals having the same attenuation coefficients from the enhanced audio data set. For example, if the first audio signal is a certain initial audio signal, the second audio signal may be a second enhanced audio or a third enhanced audio corresponding to the initial audio signal.

Step S320, convolving the first audio signal and the second audio signal with a preset audio signal respectively to obtain a third audio signal and a fourth audio signal.

In this embodiment, the second reverberation module 18 performs convolution processing on the first audio signal and the preset audio signal to obtain a third audio signal, and performs convolution processing on the second audio signal and the preset audio signal to obtain a fourth audio signal.

Step S330, performing weighted calculation according to weight information corresponding to the third audio signal and the fourth audio signal to obtain training audio data, wherein the weight information is determined according to a predetermined corresponding relation between the sampling frequency and the weight value of the audio signal.

In the present embodiment, it is considered that in practical use, reverberation may vary unstably with time. In this case, the present embodiment performs weighting calculation according to the weight information corresponding to the third audio signal and the fourth audio signal by determining the correspondence between the sampling frequency of the audio signal and the weight value in advance, so as to obtain training audio data. Therefore, the obtained training audio data can simulate the effect that the reverberant-free audio data pass through the unstable reverberant field changing along with time, so that the quality of the training audio data is improved.

Optionally, weight values corresponding to different sampling frequency segments are preset, and the corresponding weight values can be directly determined according to the sampling frequencies corresponding to the third audio signal and the fourth audio signal. For example, the weight value corresponding to the audio signal with the sampling frequency range of 15000HZ-18000HZ is preset to be F1, and the weight value corresponding to the third audio signal is F1 when the sampling frequency of the third audio signal is 16000 HZ.

Optionally, the corresponding relation between each sampling frequency (that is, the sampling point corresponding to the sampling frequency) and the weight value is preset, and the corresponding weight information can be obtained by calculating according to the sampling frequencies corresponding to the third audio signal and the fourth audio signal.

For example, for an audio signal having a time length of 1 second and a sampling frequency of 16000Hz, the number of samples characterizing the signal within 1 second is 16000. The weight value corresponding to the sampling frequency of 11Hz-8000Hz may be preset to 0, that is, the weight value corresponding to the 11 th sampling point to the 8000 th sampling point within the 1 second is 0.

For another example, for an audio signal with a time length of 2 seconds and a sampling frequency of 16000 Hz. The weight value corresponding to the sampling frequency of 11Hz to 8000Hz may be set to 0 in advance, that is, the weight value corresponding to the 11 th to 8000 th sampling points in each second is set to 0. It is easy to understand that since the audio signal has a time length of 2 seconds and a sampling frequency of 16000Hz, the audio signal has 32000 sampling points in total, and the preset weight value corresponding to the sampling frequency of 11Hz-8000Hz is 0, that is, the weight values corresponding to the 11 th to 8000 th sampling points and the 16011 th to 24000 th sampling points are 0.

Specifically, step S330 may include steps S331 to S334 may refer to fig. 10.

Fig. 10 is a flowchart of a method of determining weight information in an embodiment of the present invention. As shown in fig. 10, the process of determining weight information of the present embodiment includes the steps of:

step S331, based on a predetermined correspondence between the sampling frequency of the audio signal and the weight value, generating a corresponding first weight sequence according to the sampling frequency of the preset audio signal.

In this embodiment, the correspondence between the sampling points corresponding to the sampling frequencies and the weight values may be preset. The weight value may be a certain preset value. The weight value may also be a generated random number. In the following description, taking a weight value as a random number, the generation of a first weight sequence corresponding to a preset audio signal as a random sequence is illustrated.

For example, the time length of the reverberant-free audio data (i.e., the preset audio signal) is 1 second, and the sampling frequency is 16000HZ. Since the sampling frequency characterizes the number of samples (i.e., the number of sampling points) extracted from the continuous signal and constituting the discrete signal in a unit time, the product of the time length of the non-reverberant audio data and the sampling frequency is calculated, and it can be determined that the total number of sampling points of the non-reverberant audio data is 16000. The first weight sequence generated may refer to the following formula:

In the formula, n ₁ Is a positive integer representing the nth ₁ And sampling points. X (n) ₁ ) Representing a first weight sequence. randn ₁ (0, 1) represents a gaussian distribution with a mean of 0 and a variance of 1. i represents an imaginary sign. conj represents the conjugate of complex numbers. "conj (X (16001-n) ₁ )),8001<n ₁ Less than or equal to 16000", the first weight sequence X (n) corresponding to the 8002 nd sampling point to the 16000 th sampling point ₁ ) Equal to X (16001-n) ₁ ) Is a complex conjugate of (a) and (b). That is, the reverberant audio data is FFT processed (Fast Fourier Transform ) to generate a random sequence of slow variation of equal time length as the reverberant audio data. Wherein the slowly varying random sequence refers to a first weight sequence X (n ₁ ) Has a weight value other than zero (e.g., 1<n ₁ ≤10，X(n ₁ )＝randn ₁ (0,1)+randn ₁ (0, 1) i). Wherein FFT processing of the reverbeless audio data may decompose the time domain signal (i.e. the reverbeless audio data) into a sum of several sinusoidal signals of different frequencies. Since the lower the frequency of the sinusoidal signal is, the longer the corresponding wavelength is, and the slower the change is, the weight value of the sampling point corresponding to each sinusoidal signal is set, so that the first weight sequence X (n) ₁ ) To a slowly varying random sequence in the time domain. That is, by generating a low-frequency energetic random frequency domain signal and then converting the random frequency domain signal into a time by the inverse Fourier transformRandom sequences that vary slowly across the domain. Wherein the low frequency energetic random frequency domain signal refers to a first weight sequence X (n) with sampling points with non-zero weight values ₁ ) For example, a first weight sequence X (n ₁ ) The first weight sequence X (n) corresponding to the 2 nd sampling point to the 10 th sampling point ₁ )＝randn ₁ (0,1)+randn ₁ (0,1)i。

In the first weight sequence X (n ₁ ) In the formula in (a), n ₁ The upper limit of the value of (a) can be determined according to the total sampling point number of the non-reverberation audio data, namely, n in the formula because the total sampling point number of the non-reverberation audio data is 16000 ₁ The upper limit of the value of (2) is 16000, so that the generated first weight sequence is the same as the time length of the non-reverberation audio data.

Similarly, if the first reverberant audio data has a time length of 10 seconds and a sampling frequency of 16000HZ, a random sequence of total sampling points of 160000 is required. The first non-reverberant audio data may be split into 10 second non-reverberant audio data with a time length of 1 second and a sampling frequency of 16000HZ, and then passed through a random sequence corresponding to the 10 second non-reverberant audio data, i.e. a corresponding first weight sequence X (n ₁ ) And combining to obtain the first reverberation-free audio data.

In the first weight sequence X (n ₁ ) In the formula of (1), n ₁ The value range of (2) can be adjusted and set by user-defined according to the requirements, that is, the weight values corresponding to different sampling points can be set by user-defined according to the requirements. That is, different random sequences can be obtained by changing the weight values corresponding to the sampling points, so that the training audio data obtained later can simulate the effect of the reverbeless audio data passing through different reverberant fields which change with time.

For example, if the weight values corresponding to the 21 st sampling point to the 8000 st sampling point are all set to 0, the first weight sequence X (n ₁ ) Is adjusted to obtain a second weight sequence X (n ₂ ). The second weight sequence may refer to the following formula:

and a first weight sequence X (n ₁ ) Similarly, in the second weight sequence X (n ₂ ) In the formula of (1), n ₂ Is a positive integer representing the nth ₂ Sampling points, randn ₂ (0, 1) represents a gaussian distribution with a mean of 0 and a variance of 1. i represents an imaginary sign. conj represents the conjugate of complex numbers. "conj (X (16001-n) ₂ )),8001<n ₂ Less than or equal to 16000", and the second weight sequence X (n) corresponding to the 8002 nd sampling point to the 16000 th sampling point ₂ ) Equal to X (16001-n) ₂ ) Is a complex conjugate of (a) and (b). Due to the first weight sequence X (n ₁ ) X (n) corresponding to 11 th to 20 th sampling points ₁ ) =0, and the second weight sequence X (n ₂ ) X (n) corresponding to 11 th to 20 th sampling points ₂ )＝randn ₂ (0,1)+randn ₂ (0, 1) i. Second weight sequence X (n ₂ ) The number of sampling points with the weight value not being 0 is increased, so that the second weight sequence X (n ₂ ) Compared with the first weight sequence X (n ₁ ) The change is severe. That is, the more the number of sampling points in the random sequence with a weight value other than 0, the more violent the random sequence changes, so that the training audio data obtained later can simulate the effect of the reverberant-free audio data passing through different reverberant fields which change with time.

Step S332, performing inverse fourier transform on the first weight sequence to obtain a time domain signal.

In the present embodiment, for the first weight sequence X (n ₁ ) The inverse fourier transform may be performed to obtain the time domain signal γ (t) with reference to the following formula:

γ(t)＝IFFT(X(n ₁ ))

in the formula, IFFT represents an inverse fourier transform function (i.e., inverse Fast Four ier Transform), and t represents time.

Step S333, performing normalization processing on the time domain signal to obtain a second weight sequence.

In this embodiment, the second weight sequence is obtained by normalizing the time domain signal γ (t) The following formula can be referenced:

in the formula, t represents time, max (γ (t)) represents the maximum value of the time domain signal γ (t), and min (γ (t)) represents the minimum value of the time domain signal γ (t).

Step S334, determining weight information corresponding to the third audio signal and the fourth audio signal according to the second weight sequence.

Alternatively, the second weight sequence may beDetermined as weight information of the third audio signal, the weight information of the fourth audio signal is +.>Alternatively, the second weight sequence +.>The weight information of the third audio signal is determined as the weight information of the fourth audio signal

In the following description, the third audio signal is s ₁ (t) the weight information corresponding to the third audio signal is the second weight sequenceThe fourth audio signal is s ₂ (t) weight information corresponding to the fourth audio signal, i.e. +.>An example is described.Specifically, according to the third audio signal s ₁ (t) and fourth audio signal s ₂ (t) corresponding weight information->Andthe weighting calculation is performed to obtain training audio data S (t) by referring to the following formula:

further, a training data set is generated from the plurality of initial audio signals and the corresponding training audio data generated by the first and second reverberation modules 17, 18, and then the audio processing model may be trained according to the training data set to obtain the target model.

For example, the audio processing model is a convolutional neural network and the target model is used to suppress reverberant audio signals. The data in the training dataset is first pre-processed, e.g., normalized (e.g., converted to a target format, etc.), extracted features (e.g., frequency, amplitude, etc. of the audio). A loss function (e.g., MSE function (mean square error, mean square error)) of the convolutional neural network is determined. And taking the data in the training data set as input of the convolutional neural network to obtain output information, and taking the non-reverberation audio data as target information output by the convolutional neural network. And calculating the loss value of the output information and the target information according to the loss function. And then adjusting parameters (such as node number, activation function and the like) of the convolutional neural network according to the loss value so as to train the convolutional neural network to obtain a target model.

Fig. 11 is a schematic diagram of an audio processing apparatus according to an embodiment of the present invention. As shown in fig. 11, the audio processing apparatus of the present embodiment includes an audio acquisition unit 401, an audio processing unit 402, and an audio generation unit 403. Wherein the frequency acquisition unit 401 is configured to acquire a plurality of initial audio signals. The audio processing unit 402 is configured to perform a predetermined process on at least one of the initial audio signals to generate an enhanced audio data set, where the predetermined process includes an audio fusion process and/or a mixing process. The audio generating unit 403 is configured to perform reverberation processing on the audio signal in the enhanced audio data set and a preset audio signal to generate training audio data.

Fig. 12 is a schematic diagram of an electronic device according to an embodiment of the invention. As shown in fig. 12, the electronic device shown in fig. 12 is a general-purpose audio processing apparatus, which includes a general-purpose computer hardware structure including at least a processor 510 and a memory 520. Processor 510 and memory 520 are connected by bus 530. The memory 520 is adapted to store instructions or programs executable by the processor 510. Processor 510 may be a stand-alone microprocessor or may be a collection of one or more microprocessors. Thus, the processor 510 implements processing of data and control of other devices by executing instructions stored by the memory 520 to perform the method flows of embodiments of the invention as described above. The bus 530 connects the above-described components together while connecting the above-described components to the display first controller 540 and the display device and the input/output (I/O) device 550. Input/output (I/O) device 550 may be a mouse, keyboard, modem, network interface, touch input device, somatosensory input device, printer, and other devices known in the art. Typically, the input/output device 550 is connected to the system through an input/output (I/O) first controller 560.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, apparatus (device) or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may employ a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations of methods, apparatus (devices) and computer program products according to embodiments of the application. It will be understood that each of the flows in the flowchart may be implemented by computer program instructions.

These computer program instructions may be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.

These computer program instructions may also be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.

Another embodiment of the present invention is directed to a non-volatile storage medium storing a computer readable program for causing a computer to perform some or all of the method embodiments described above.

That is, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the embodiments described above may be implemented by specifying relevant hardware by a program, where the program is stored in a storage medium, and includes several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps in the methods of the embodiments described herein. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the same, but rather, various modifications and variations may be made by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. that fall within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A method of audio processing, the method comprising:

acquiring a plurality of initial audio signals;

2. The method of claim 1, wherein the enhanced audio data set comprises first enhanced audio, and wherein the predetermined processing of at least one of the initial audio signals to generate the enhanced audio data set comprises:

intercepting a plurality of audio segments from the initial audio signal, each of the audio segments having a predetermined time interval therebetween;

and mixing each audio segment to generate the first enhanced audio.

3. The method of claim 1, wherein the enhanced audio data set comprises a second enhanced audio, and wherein the predetermined processing of at least one of the initial audio signals to generate the enhanced audio data set comprises:

determining an attenuation coefficient of the initial audio signal;

Generating candidate audio data according to the attenuation coefficient and a preset noise signal;

determining audio data to be enhanced, which is identical to the attenuation coefficient, from a plurality of initial audio signals;

and carrying out audio fusion processing on the audio data to be enhanced and the candidate audio data to generate the second enhanced audio.

4. The method of claim 3, wherein the enhanced audio data comprises third enhanced audio, and wherein the predetermined processing of at least one of the initial audio signals to generate an enhanced audio data set further comprises:

intercepting a plurality of audio segments from the second enhanced audio, each audio segment having a predetermined time interval therebetween;

and mixing each audio segment to generate the third enhanced audio.

5. The method of claim 1, wherein the plurality of initial audio signals comprise a room impulse response and a preset noise signal, the enhanced audio data set comprises a fourth enhanced audio, the method further comprising:

determining the room impulse response as the fourth enhanced audio; and/or

And generating the fourth enhanced audio according to a preset attenuation coefficient and the preset noise signal.

6. The method of claim 1, wherein reverberation processing the audio signals in the enhanced audio dataset with a preset audio signal to generate training audio data comprises:

and convolving any audio signal in the enhanced audio data set with the preset audio signal to generate the training audio data.

7. The method of claim 1, wherein reverberation processing the audio signals in the enhanced audio dataset with a preset audio signal to generate training audio data comprises:

acquiring a first audio signal and a second audio signal with the same attenuation coefficient from the enhanced audio data set;

performing convolution processing on the first audio signal and the second audio signal and the preset audio signal respectively to obtain a third audio signal and a fourth audio signal;

weighting calculation is carried out according to the weight information corresponding to the third audio signal and the fourth audio signal so as to obtain the training audio data;

the weight information is determined according to a predetermined corresponding relation between the sampling frequency of the audio signal and the weight value.

8. The method according to claim 1, wherein the method further comprises:

Determining a training data set from a plurality of the initial audio signals and the training audio data;

and training the audio processing model according to the training data set.

9. A computer readable storage medium, on which computer program instructions are stored, which computer program instructions, when executed by a processor, implement the method of any of claims 1-8.

10. An electronic device, the electronic device comprising:

a memory and a processor for storing one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-8.