CN114338623A

CN114338623A - Audio processing method, device, equipment, medium and computer program product

Info

Publication number: CN114338623A
Application number: CN202210007064.9A
Authority: CN
Inventors: 高毅; 杨清山; 罗程; 李斌; 张思宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-01-05
Filing date: 2022-01-05
Publication date: 2022-04-12
Anticipated expiration: 2042-01-05
Also published as: CN114338623B

Abstract

The application discloses an audio processing method, device, equipment, medium and computer program product, and relates to the field of audio processing. The method comprises the following steps: acquiring audio characteristics corresponding to audio data, wherein the audio data is audio to be subjected to voice transmission, and the audio characteristics are used for indicating the energy distribution condition of the audio data; carrying out noise suppression processing on the audio data based on the audio features to obtain noise reduction audio data; determining voice detection data based on the energy distribution condition corresponding to the audio features, wherein the voice detection data is used for indicating the existence condition of a voice signal in the audio data; and carrying out volume scaling processing on the noise reduction audio data according to the voice detection data to obtain target audio data, wherein the target audio data is audio used for voice transmission. After whether the audio data contain the voice signals or not is determined according to the energy distribution corresponding to the audio features, the volume of the audio data subjected to noise reduction is adjusted, and the gain effect of the audio data in the volume adjusting process is improved.

Description

Audio processing method, device, equipment, medium and computer program product

Technical Field

The present application relates to the field of audio processing, and in particular, to a method, an apparatus, a device, a medium, and a computer program product for processing audio.

Background

Voice over Internet Protocol (VoIP) is a widely used Voice call technology, and a plurality of users using different terminal devices can realize Voice communication through the Internet, which can be regarded as "Internet phones".

Due to the fact that hardware components of audio acquisition devices of different terminal devices are different, or the volume of a user is different, the problem that loud or loud is possibly caused when the terminal devices are communicated with each other is solved. In the related art, the above problem is solved by increasing the digital gain setting, for example, the voice output side increases or decreases the microphone volume setting of the terminal device, or the voice reception side increases or decreases the playback volume setting.

However, when the volume is adjusted in the above manner, when the digital gain is increased, noise is also amplified, and if the volume of the microphone is directly increased, saturation distortion of the acquired audio data is likely to occur in the digital signal processing process, or the acquired audio data exceeds the optimal coding volume level of the vocoder, so that the coding quality of the encoder is reduced, and the voice quality is poor.

Disclosure of Invention

The embodiment of the application provides an audio processing method, an audio processing device, audio processing equipment, an audio processing medium and a computer program product, which can improve the gain effect of audio data in the volume adjustment process. The technical scheme is as follows:

in one aspect, a method for processing audio is provided, where the method includes:

acquiring audio characteristics corresponding to audio data, wherein the audio data are audio to be subjected to voice transmission, and the audio characteristics are used for indicating the energy distribution condition of the audio data;

carrying out noise suppression processing on the audio data based on the audio features to obtain noise reduction audio data;

determining voice detection data based on the energy distribution situation corresponding to the audio features, wherein the voice detection data is used for indicating the existence situation of voice signals in the audio data;

and carrying out volume scaling processing on the noise reduction audio data according to the voice detection data to obtain target audio data, wherein the target audio data is audio used for voice transmission.

In another aspect, an apparatus for processing audio is provided, the apparatus comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring audio characteristics corresponding to audio data, the audio data is audio to be subjected to voice transmission, and the audio characteristics are used for indicating the energy distribution condition of the audio data;

the noise reduction module is used for carrying out noise suppression processing on the audio data based on the audio features to obtain noise reduction audio data;

a detection module, configured to determine voice detection data based on the energy distribution corresponding to the audio feature, where the voice detection data is used to indicate existence of a voice signal in the audio data;

and the processing module is used for carrying out volume scaling processing on the noise reduction audio data according to the voice detection data to obtain target audio data, and the target audio data is audio used for carrying out voice transmission.

In another aspect, a computer device is provided, where the terminal includes a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the audio processing method according to any of the embodiments of the present application.

In another aspect, a computer-readable storage medium is provided, in which at least one program code is stored, and the program code is loaded and executed by a processor to implement the audio processing method described in any of the embodiments of the present application.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the audio processing method described in any of the above embodiments.

The technical scheme provided by the application at least comprises the following beneficial effects:

when the volume adjustment processing is carried out on the audio data used for voice transmission, the noise suppression processing is carried out on the audio characteristics of the audio data to obtain noise reduction audio data, the voice detection data is determined according to the energy distribution condition corresponding to the audio characteristics, whether the volume adjustment needs to be carried out on the noise reduction audio data is determined according to the voice detection data, and the target audio data after the volume processing is obtained.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 2 is a flow chart of a method for processing audio provided by an exemplary embodiment of the present application;

FIG. 3 is a block diagram of an audio denoising method based on a neural network model according to an exemplary embodiment of the present application;

FIG. 4 is a flow chart of a method for processing audio provided by another exemplary embodiment of the present application;

FIG. 5 is a flow chart of a method for processing audio provided by another exemplary embodiment of the present application;

FIG. 6 is a block diagram of speech detection provided by an exemplary embodiment of the present application;

FIG. 7 is a flowchart of a method for processing audio provided by another exemplary embodiment of the present application;

FIG. 8 is a block diagram of noise reduction processing and speech detection provided by an exemplary embodiment of the present application;

FIG. 9 is a block diagram of an audio processing system provided in an exemplary embodiment of the present application;

FIG. 10 is a block diagram of an audio processing device according to an exemplary embodiment of the present application;

FIG. 11 is a block diagram of an audio processing device according to another exemplary embodiment of the present application;

fig. 12 is a block diagram of a terminal according to an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

First, terms referred to in the embodiments of the present application are briefly described:

artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

In the embodiment of the application, machine learning or deep learning is applied to a voice communication technology, and noise reduction and/or voice detection of audio data are completed through a machine learning model or a deep learning model, so that the gain effect of the audio data for transmitting voice signals in the volume adjustment process is improved.

Next, an application scenario of the embodiment of the present application will be schematically described.

First, the audio processing method provided in the embodiment of the present application may be applied to instant voice communication, for example, "internet phone" is implemented between users through the internet, a terminal device collects voice audio of a user through a microphone, obtains gain audio after performing volume scaling processing after the voice audio is processed through the audio processing method provided in the embodiment of the present application, and transmits the gain audio to a terminal device currently establishing voice connection through the internet, so as to improve quality of voice call.

Secondly, the audio processing method provided in the embodiment of the present application may be applied to a recording function, for example, a user records a next section of voice information through a microphone of a terminal device, and after the terminal device finishes acquiring the voice information, the terminal device preprocesses audio data corresponding to the voice information by using the audio processing method provided in the embodiment of the present application to obtain a gain audio, and plays the gain audio as a recording audio or transmits the gain audio to other terminals according to a user instruction.

The above description of the application scenario only takes the application to the instant voice communication and the recording function as an example, and is not limited to a specific application scenario, and the audio processing method may also be applied to other application scenarios related to audio processing.

The implementation environment of the embodiments of the present application is described in conjunction with the above noun explanations and descriptions of application scenarios. As shown in fig. 1, the computer system of the embodiment includes: a terminal device, a server 120 and a communication network 130, wherein the terminal device comprises a first terminal 111 and a second terminal 112.

The first terminal 111 includes various types of devices such as a mobile phone, a tablet computer, a desktop computer, a laptop computer, and a vehicle-mounted terminal. Illustratively, the first terminal 111 has a target application running therein, the target application being for providing a function of voice communication. Optionally, the target application includes various forms of applications such as a stand-alone application program, a web application, an applet in a host application, or a functional module in an application program, which is not limited herein.

The second terminal 112 includes various types of devices such as a mobile phone, a tablet computer, a desktop computer, a laptop computer, and a vehicle-mounted terminal. Illustratively, the second terminal 112 also has a target application running therein for providing voice communication functionality. Optionally, the target application includes various forms of applications such as a stand-alone application program, a web application, an applet in a host application, or a functional module in an application program, which is not limited herein. The first terminal 111 and the second terminal 112 establish a voice connection relationship through the internet.

The server 120 is configured to provide back-end support for the target application, i.e. the voice connection established between the first terminal 111 and the second terminal 112 is implemented through the server 120. It should be noted that the server 120 may be an independent physical server or a communication base station, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

The Cloud Technology (Cloud Technology) is a hosting Technology for unifying series resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

In some embodiments, the server 120 described above may also be implemented as a node in a blockchain system. The Blockchain (Blockchain) is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. The block chain, which is essentially a decentralized database, is a string of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, which is used for verifying the validity (anti-counterfeiting) of the information and generating a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

The block chain underlying platform can comprise a management module, a basic service, an intelligent contract, an operation and other processing modules. The management module is responsible for information management of all blockchain participants, including maintaining public and private key generation (account management), key management and the like; the basic service module is deployed on all block chain node equipment and used for verifying the validity of the service request, recording the service request to storage after consensus on the valid request is completed, for a new service request, the basic service firstly performs interface adaptation analysis and authentication processing (interface adaptation), then encrypts service information (consensus management) through a consensus algorithm, transmits the service information to a shared account (network communication) completely and consistently after encryption, and performs recording and storage; the intelligent contract module is responsible for registering and issuing contracts, triggering the contracts and executing the contracts, developers can define contract logics through a certain programming language, issue the contract logics to a block chain (contract registration), call keys or other event triggering and executing according to the logics of contract clauses, complete the contract logics and simultaneously provide the function of upgrading and canceling the contracts; the operation module is mainly responsible for deployment, configuration modification, contract setting, cloud adaptation in the product release process and visual output of real-time states in product operation.

In some embodiments, taking the first terminal 111 as a current sender of audio data and the second terminal 112 as a receiver of the audio data as an example, a microphone of the first terminal 111 records sound around the terminal to obtain the audio data, obtains noise reduction audio data and voice detection data according to audio features corresponding to the audio data, performs volume scaling processing according to the noise audio data and the voice detection data to obtain target audio data, the first terminal 111 transmits the target audio data to the server 120, the server 120 determines the second terminal 112 that establishes voice connection with the first terminal 111, then forwards the target audio data to the second terminal 112, and the second terminal 112 plays the target audio data.

In other embodiments, the second terminal 112 may also be a sender of audio data, and the first terminal 111 may also be a receiver of audio data, which is not limited herein. Alternatively, the number of the first terminal 111 and the second terminal 111 is not limited, for example, the number of the first terminal 111 and the second terminal 112 may be one, that is, the first terminal 111 and the second terminal 112 form one-to-one voice communication, or the number of the first terminal 111 or the second terminal 112 may be plural, that is, the first terminal 111 and the second terminal 112 form one-to-many or many-to-many voice communication.

Alternatively, the audio processing process of the audio data may be completed in the first terminal 111, the server 120, or the second terminal 112, which is not limited herein.

The first terminal 111 and the server 120, and the second terminal 112 and the server 120 are illustratively connected through a communication network 130, where the communication network 130 may be a wired network or a wireless network, and is not limited herein.

Referring to fig. 2, a method for processing audio shown in an embodiment of the present application is shown, and in the embodiment of the present application, the method is described as being applied to a terminal device (a first terminal or a second terminal) shown in fig. 1, and the method may also be applied to a server, and is described as being applied to a terminal device only. The method comprises the following steps:

step 201, obtaining an audio feature corresponding to the audio data.

The audio data is audio to be subjected to voice transmission, and the audio characteristics are used for indicating the energy distribution condition of the audio data.

Optionally, when the audio processing method is applied to a sender terminal device in voice communication, the audio data may be audio acquired by the terminal device through an audio acquisition device; when the audio processing method is applied to a receiving terminal device in voice communication, the audio data may be audio received by the terminal device from a terminal device with which a voice connection is established. In other embodiments, the audio data may also be audio that is read from a storage area by the terminal device.

In some embodiments, the audio feature is an audio feature obtained by feature extraction of the audio data. Alternatively, the technique used for extracting the features of the audio data to obtain the audio features may be at least one of Mel-Frequency Cepstral Coefficients (MFCCs), Linear Predictive Coding (LPC), Linear Predictive Cepstral Coefficients (LPCCs), Line Spectral Frequencies (LSFs), Discrete Wavelet Transform (DWT), and Perceptual Linear Prediction (PLPs).

Optionally, the audio feature includes at least one feature that characterizes a speech signal, such as a vector of spectral amplitude values, a vector of spectral logarithmic energy values, a MFCC vector, a Filter bank (Fbanks) vector, a Bark-Frequency Cepstral Coefficients (BFCC) vector, a pitch period, and so on. In other embodiments, the audio feature may be a temporal first-order or second-order difference of the feature vector to reflect a dynamic change characteristic of the feature over time.

In some embodiments, the audio data may be further preprocessed before feature extraction, wherein the preprocessing includes at least one of digital-to-analog conversion, sampling, pre-emphasis, windowing, and framing. The digital-to-analog conversion processing is to convert audio data of which the audio signals are analog signals into digital signals, or convert audio data of which the audio signals are digital signals into analog signals, schematically, audio acquisition data is acquired through audio acquisition equipment, the audio signals in the audio acquisition data are analog signals, the audio acquisition data are sampled according to a target sampling rate, and audio data are obtained, wherein the audio signals in the audio data are digital signals; the sampling process is to convert the voice signal as an analog signal into a digital signal; the pre-emphasis treatment is to emphasize the high-frequency part of the voice, remove the influence of lip radiation and increase the high-frequency resolution of the voice; windowing and framing processes are performed by weighting a speech signal using a movable finite-length window, and then transforming or operating each frame through a correlation filter to divide the speech signal into short segments (analysis frames).

And 202, performing noise suppression processing on the audio data based on the audio characteristics to obtain noise reduction audio data.

In the embodiment of the application, noise suppression processing is completed through a denoising neural network obtained through pre-training. Alternatively, the denoising Neural Network may be at least one of a Convolutional Neural Network (CNN), a cyclic Neural Network (RNN), a Deep Neural Network (DNN), such as a forward fully-connected Deep Neural Network (fw-Term Memory, LSTM), a closed Recurrent Unit Network (GRU), and the like. In other embodiments, the network structure of the denoising neural network may also be a combination of the above networks, which is not limited herein.

In some embodiments, the denoising neural network includes an input layer, a hidden layer, and an output layer.

In which the number of neurons of the input layer is consistent with the length of the input audio feature vector, in one example, the input audio feature vector includes 129 spectral log energy values and a gene period value, that is, 130 values in total, and the input layer has 130 neurons.

The number of layers and the number of neurons corresponding to the hidden layers may be determined according to the training data scale of the denoising neural network in the training process and/or the computing resources for operating the denoising neural network, for example, if the training data scale is large, a larger network scale is adopted (i.e., a larger number of hidden layers and the number of neurons of the hidden layers are set) to obtain a better effect, and if the training data scale is large, fewer hidden layers and fewer neurons of the hidden layers are adopted.

The number of neurons in the output layer is related to the number of audio gains to be calculated, and the number of the audio gains corresponds to a dividing mode for dividing audio features corresponding to the audio data. In some embodiments, when the audio feature corresponding to the audio data is an audio amplitude feature, the number of audio gains corresponds to the number of amplitude ranges into which the audio amplitude is divided; in other embodiments, when the audio data corresponds to audio frequency characteristics, the number of audio gains corresponds to the number of frequency ranges into which the audio frequencies are divided.

Illustratively, taking the audio frequency feature as an example, in one example, if the audio data needs to calculate frequency gain data of each frequency point, the output frequency gain data G2(k), k being 1, 2, …, N/2+1, and the number of neurons in the output layer being N/2+ 1. In another example, the number of neurons in the output layer may also be less than N/2+1, for example, if N/2+1 frequency bins are divided into different frequency subbands, each neuron in the output layer only needs to predict frequency gain data of each subband, and the number of neurons in the output layer corresponds to the number of frequency subbands obtained by division.

Illustratively, when the audio frequency feature is an audio frequency feature, the frequency gain data output by the output layer is a gain predicted by the denoising neural network according to different frequency range distributions of the speech signal and the noise signal, wherein when the denoising neural network recognizes that the audio signal in the target frequency range is the noise signal, the frequency gain data output by the corresponding neuron is used for weakening the audio signal in the target frequency range; when the audio signal in the target frequency range is recognized as a voice signal by the denoising neural network, the frequency gain data output by the corresponding neuron is used for enhancing the audio signal in the target frequency range.

Illustratively, after frequency gain data corresponding to the audio features are determined, noise reduction audio data are determined according to frequency spectrum data corresponding to the audio data and a frequency correspondence between the frequency gain data. In some embodiments, the spectral data corresponding to the audio data may be obtained by transforming the audio data in a Time domain into an audio signal in a frequency domain by using a Short-Time Fourier Transform (STFT) operation.

In an example, as shown in fig. 3, which illustrates an audio denoising block diagram based on a neural network model according to an exemplary embodiment of the present application, the audio data is subjected to a preprocessing process of time-domain-to-frequency-domain conversion 310 to obtain corresponding spectral data, then a voice feature extraction 320 operation is performed on the spectral data to obtain an audio feature, then the audio feature is input into a denoising neural network 330, where the denoising neural network 330 includes an input layer, a hidden layer, and an output layer, an audio gain G2(k) output by the output layer is multiplied by the spectral data corresponding to the audio data to obtain denoising spectral data represented by a frequency-domain signal, and then the denoising spectral data represented by a time-domain signal is output after a processing process of frequency-domain-to-time-domain conversion 340.

Step 203, determining voice detection data based on the energy distribution situation corresponding to the audio features.

Wherein the speech detection data is used to indicate the presence of speech signals in the audio data.

Illustratively, the acquisition of Voice Detection data is achieved by Voice Activity Detection (VAD). In the embodiment of the present application, the process of implementing voice detection by VAD includes a voice distinguishing portion and a VAD decision portion, the voice distinguishing portion judges whether the current audio data includes a voice signal according to the audio features, the VAD decision portion is configured to determine an output VAD flag according to a judgment result of the voice distinguishing portion, and the VAD flag is used as voice detection data.

In some embodiments, the voice determination part may input the audio features to a pre-trained voice recognition model, determine a probability that the voice signal is included in the audio data according to the audio features by the voice recognition model, and input the probability to the VAD decision part for determination. In other embodiments, the voice determination portion may determine whether the audio data includes a voice signal according to an energy distribution corresponding to the audio feature, taking the audio feature as the audio frequency feature as an example, and the voice determination portion determines whether a frequency band corresponding to the voice signal satisfies a voice signal existence condition according to a distribution condition of the audio frequency feature to output a corresponding result.

Illustratively, the VAD flag output by the VAD decision portion is used to indicate whether the audio data contains a voice signal, and in one example, when the VAD flag is 1, it indicates that the audio data contains the voice signal, and when the VAD flag is 0, it indicates that the audio data does not contain the voice signal.

And step 204, carrying out volume scaling processing on the noise reduction audio data according to the voice detection data to obtain target audio data.

The target audio data is audio for voice transmission. Illustratively, when the voice detection data indicates that a voice signal exists in the audio data, the volume scaling processing is performed according to the audio energy condition corresponding to the noise reduction audio data to obtain the target audio data. The above audio energy condition is used to determine whether the audio energy of the noise reduction audio data meets a volume adjustment condition, that is, the volume corresponding to the noise reduction audio data needs to be adjusted within a certain required range, so as to avoid that the volume corresponding to the audio data is too large or too small.

In some embodiments, the Root Mean Square energy (RMS Level) corresponding to the voice signal in the noise reduction audio data may be counted, the Root Mean Square energy is compared with the target volume, if the Root Mean Square energy is smaller than the target volume, the digital gain is gradually increased until a preset upper limit value of the digital gain is reached, for example, amplification is not more than 30dB, and if the Root Mean Square energy is smaller than the target volume, the digital gain is gradually decreased until a preset lower limit value of the digital gain is reached, for example, reduction is not more than-10 dB.

In some embodiments, when the voice detection data indicates that no voice signal exists in the audio data, a volume reduction operation is performed on the noise reduction audio data, for example, the volume of the noise reduction audio data is reduced to 0, that is, if the current audio data does not include a voice signal, that is, it indicates that signals included in the audio data are all noise signals, the effect of further reducing noise can be achieved by reducing the volume to 0, and meanwhile, when the volume is 0, the amount of data correspondingly transmitted by the audio data is reduced, thereby reducing the consumption of bandwidth resources during data transmission.

In other embodiments, when the voice detection data indicates that no voice signal is present in the audio data, the volume scaling operation on the noise reduction audio data is skipped, i.e., the noise reduction audio data is directly output as the target audio data. The embodiment scheme can be applied to the condition that the audio time corresponding to the audio data is long, namely, the phenomenon that the receiving party user feels unnatural due to long-time silence is avoided.

To sum up, in the audio processing method provided in the embodiment of the present application, when volume adjustment processing is performed on audio data for voice transmission, noise suppression processing is performed on audio features of the audio data to obtain noise reduction audio data, voice detection data is determined according to an energy distribution condition corresponding to the audio features, and whether volume adjustment needs to be performed on the noise reduction audio data is determined according to the voice detection data to obtain target audio data after volume processing.

Referring to fig. 4, a method for processing audio according to an embodiment of the present application is shown, in which a pre-processing process of audio data is schematically illustrated. The method comprises the following steps:

step 401, acquiring audio acquisition data through an audio acquisition device.

The audio signal in the audio acquisition data is an analog signal. The audio acquisition device can be an independent microphone device connected with the terminal device, and can also be a microphone module carried by the terminal device.

It is understood that in the specific implementation manner of the present application, when the above embodiments of the present application are applied to specific products or technologies, the audio acquisition data acquired by the audio acquisition device needs to obtain user permission or consent before being processed or uploaded to the server, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related countries and regions.

Step 402, performing framing processing on the audio acquisition data to obtain a target number of framed audio data.

In the preprocessing process, the audio acquisition data needs to be subjected to framing processing, the obtained frame audio data of the target number is sequentially input into a processing system, for example, one frame of frame audio data is obtained every 10 milliseconds, and the processing system sequentially processes the voice signal corresponding to each frame of frame audio data.

Step 403, sampling the audio data of the subframe to obtain the audio data.

In order to facilitate the device to process the audio data, the obtained frame-divided audio data is sampled at a target sampling rate to obtain audio data with the audio signal as a digital signal. Optionally, the target sampling rate may be preset by the system, may be specified by the terminal device, and may be determined according to a device condition. In one example, the device condition includes a computing resource condition of the terminal device, wherein when an amount of computing resources allocated to the processing system is positively correlated with the target sampling rate; in another example, the device condition further includes a current network condition of the terminal device, wherein when the network condition is better, a higher target sampling rate is used, and when the network condition is worse, a lower target sampling rate is used.

Step 404, performing echo cancellation operation on the audio data to obtain preprocessed audio data.

When the audio processing method provided by the application is applied to instant voice communication, the current terminal device is not only a sender of audio data, but also a receiver corresponding to the audio data sent by the far-end terminal device, and in order to prevent a far-end speaker from hearing own voice, Echo Cancellation (Echo Cancellation) operation needs to be performed on the audio data. Illustratively, the Echo Cancellation operation may include Acoustic Echo Cancellation (AEC), which refers to Cancellation of echoes due to multiple feedbacks of speaker sound to a microphone in a hands-free or conferencing application, and/or Line Echo Cancellation (LEC), which refers to Cancellation of echoes due to two-four wire matching coupling of physical electronics. In some embodiments, the echo cancellation operation described above may be accomplished by an adaptive filter.

The pre-processed audio data is used for feature extraction to obtain audio features.

To sum up, the audio processing method provided in the embodiment of the present application preprocesses the audio collected by the audio collection device through framing, analog-to-digital conversion, and echo cancellation, where the framing processing can reduce the data amount of the audio features input to the neural network, the analog-to-digital conversion can improve the processing efficiency of the processing system on the data, and the echo cancellation processing can improve the voice call quality in the instant voice communication scene.

Referring to fig. 5, a method for processing audio according to an embodiment of the present application is shown, in which voice detection is performed on audio gain data output by a noise reduction neural network to obtain voice detection data. The method comprises the following steps:

step 501, extracting the characteristics of the audio data to obtain audio characteristics.

Illustratively, the audio data is preprocessed audio data, wherein the preprocessing process is shown in steps 401 to 404, which is not described herein again.

Optionally, the audio feature includes at least one feature that characterizes a feature of the speech signal, such as a vector of spectral amplitude values, a vector of spectral logarithmic energy values, an MFCC vector, an Fbanks vector, a BFCC vector, a pitch period, and the like. In other embodiments, the audio feature may be a temporal first-order or second-order difference of the feature vector to reflect a dynamic change characteristic of the feature over time.

Step 502, obtaining spectrum data corresponding to the audio data.

The spectral data is used to indicate the frequency distribution of the audio data.

In some embodiments, the spectral data corresponding to the audio data may be obtained by transforming the audio data in the time domain into an audio signal in the frequency domain by using a short-time fourier transform operation.

In step 503, frequency gain data is obtained based on the audio features.

The frequency gain data is used for indicating the gain conditions corresponding to the audio data in at least two frequency ranges respectively.

In the embodiment of the present application, a denoising process of audio data is completed through a deep learning model, illustratively, gain conditions corresponding to audio features in at least two frequency ranges are predicted through a first network to obtain the frequency gain data, and the first network is used for predicting the gain conditions according to a spectrum energy condition and a pitch period indicated by the audio features.

Optionally, the first network may be at least one of a convolutional neural network, a cyclic neural network, and a deep neural network, for example, a forward fully-connected deep neural network, a long-short term memory artificial neural network, a closed recurrent unit network, and the like. In other embodiments, the network structure of the first network may also be a combination of the above networks, and is not limited herein.

Illustratively, voice recognition is performed on the audio features through a first network, frequency ranges corresponding to a voice signal and a noise signal in the audio data are determined, and according to a frequency distribution situation corresponding to the audio features, a frequency band corresponding to the audio features is divided into at least two frequency ranges, where the at least two frequency ranges include a first frequency range. In response to the audio feature in the first frequency range being a frequency range corresponding to the speech signal, determining first gain data based on the audio feature in the first frequency range, the first gain data being used to enhance the audio signal of the audio data in the first frequency range; or, in response to the audio characteristic in the first frequency range being a frequency range corresponding to the noise signal, determining second gain data based on the audio characteristic in the first frequency range, the second gain data being used for attenuating the audio signal of the audio data in the first frequency range, and determining the frequency gain data from the first gain data or the second gain data corresponding to at least two frequency ranges respectively.

The at least two frequency ranges may be specified by a technician before training of the first network, or may be a plurality of frequency ranges that are automatically learned by the first network in a training process, the number of frequency ranges corresponding to the voice signal is at least one, and the number of frequency ranges corresponding to the noise signal is at least one.

Step 504, determining noise reduction audio data based on the frequency correspondence between the frequency gain data and the spectral data.

Illustratively, after frequency gain data corresponding to the audio features are determined, noise reduction audio data are determined according to frequency spectrum data corresponding to the audio data and a frequency correspondence between the frequency gain data. In the embodiment of the present application, the frequency gain data and the spectrum data are multiplied by each other according to the correspondence between the frequencies, so as to obtain the noise reduction audio data. In one example, the noise reduction audio data Xout2(k) is calculated by formula one, where x (k) is spectral data, G2(k) is frequency gain data, and k is a frequency bin on the spectrum.

The formula I is as follows: xout2(k) ═ x (k) × G2(k)

Step 505, performing band selection operation on at least two frequency ranges, and determining a second frequency range meeting the requirement of the voice band.

The above-mentioned voice band is required for determining a frequency range including a voice signal in the audio data.

Illustratively, if the energy of the voice in the frequency range corresponding to the voice signal is large, and the signal-to-noise ratio corresponding to the audio data is high, the trained neural network can accurately identify noise and human voice, especially for unsteady noise such as keyboard voice. In the embodiment of the present application, since the neural net for noise suppression has a very strong voice detection capability, the VAD detection can be reliably guided by using the frequency gain data G2(k), that is, the output of the VAD flag is completed by the frequency gain data.

Schematically, as shown in fig. 6, which shows a speech detection block diagram provided by an exemplary embodiment of the present application, frequency gain data G2(k)601 is input to a band selection module 610, and then processed by the VAD decision module 620, thereby outputting a VAD flag 602, wherein the band selection module 610 is configured to determine a frequency range including the speech signal from the at least two frequency ranges, optionally, the frequency band selection module 610 presets the frequency range corresponding to the voice signal, or, in the frequency gain data G2(k)601 output by the first network, the frequency points corresponding to the frequency gains recognized as voice signals by the first network carry a preset identification, the preset identifier indicates that the first network recognizes the frequency point as a frequency point corresponding to the voice signal, and the frequency band selection module 610 may determine the frequency band corresponding to the voice signal according to the preset identifier. The VAD decision module 620 determines the output VAD flag according to the voice gain data corresponding to the determined frequency band.

Step 506, voice gain data corresponding to the voice signal is determined based on the second frequency range.

And after the second frequency range corresponding to the voice signal is determined, determining voice gain data corresponding to the voice signal through the frequency gain data output by the first network. In one example, the band selecting section automatically selects M low-band gains as the candidate frequency gains, M being a positive integer, for example, when M is 3, the candidate frequency gains of three lower bands G2(2), G2(3), G2(4) are selected.

After the candidate frequency gains are determined, weighting and summing can be carried out on the candidate frequency gains through a preset weight relation, and voice gain data corresponding to the voice signals are obtained. Optionally, the preset weight relationship may be preset by a system, or may be set by a user, and is not limited herein.

Step 507, determining voice detection data based on the matching condition between the voice gain data and the target gain threshold.

Illustratively, comparing the voice gain data corresponding to the acquired voice signal with a target gain threshold, if the voice gain data is greater than or equal to the target gain threshold, determining that the output VAD flag is 1, that is, indicating that the audio data includes the voice signal, and if the voice gain data is less than the target gain threshold, determining that the output VAD flag is 0, that is, indicating that no voice signal exists in the audio data. Optionally, the target gain threshold may be preset by the system, or may be set by a user, which is not limited herein.

In one example, the VAD decision module first calculates a gain weighted product X ═ a × G2(2) + b × G2(3) + c × G2(4), and when X is greater than or equal to a target gain threshold THG, it considers that the current audio data contains a voice signal, and the VAD flag is set to 1, otherwise it is set to 0.

And step 508, carrying out volume scaling processing according to the noise reduction audio data of the voice detection data to obtain target audio data.

Illustratively, when the voice detection data indicates that a voice signal exists in the audio data, the volume scaling processing is performed according to the audio energy condition corresponding to the noise reduction audio data to obtain the target audio data.

In some embodiments, the root mean square energy corresponding to the voice signal in the noise reduction audio data may be counted, the root mean square energy may be compared with the target volume, and if the root mean square energy is smaller than the target volume, the digital gain may be gradually increased until the preset upper limit value of the digital gain is reached.

Illustratively, taking the application of the audio processing method to instant voice communication as an example, if the audio data is audio collected by a sending terminal of the audio, after the target audio data is obtained, the sending terminal inputs the target audio data to an encoder for encoding, so as to obtain audio data to be transmitted, and transmits the audio data to be transmitted to a receiving terminal, where the receiving terminal is a terminal performing voice communication with the sending terminal.

In the embodiment of the present application, the VAD detection process is completed by directly using the frequency gain data output by the first network, so that the data amount processed by the whole processing system is reduced.

Referring to fig. 7, a method for processing audio according to an embodiment of the present application is shown, in which speech detection is performed on audio gain data output by a noise reduction neural network to obtain speech detection data. The method comprises the following steps:

step 701, performing feature extraction on the audio data to obtain audio features.

In the embodiment of the present application, the process of extracting the audio features is shown as step 201 or step 501, which is not described herein again.

Step 702, inputting the audio data to a first network to obtain noise reduction audio data.

In the embodiment of the present application, the denoising process of the audio data is completed through a deep learning model, and the first network is configured to predict a gain condition according to a spectrum energy condition and a pitch period indicated by the audio feature.

Illustratively, after frequency gain data corresponding to the audio features are determined, noise reduction audio data are determined according to frequency spectrum data corresponding to the audio data and a frequency correspondence between the frequency gain data. In the embodiment of the present application, the frequency gain data and the spectrum data are multiplied by each other according to the correspondence between the frequencies, so as to obtain the noise reduction audio data.

And 703, determining the probability of the voice signal in the audio data through the second network to predict based on the energy distribution condition corresponding to the audio feature, so as to obtain voice probability data.

In the embodiment of the present application, the voice detection process is completed through a second network, which is schematically a branch network of the first network, and at least one layer of network neurons is shared between the first network and the second network.

Illustratively, the first network includes an input layer including a layer of neurons, a hidden layer including a layer of N neurons, and an output layer including a layer of neurons. Optionally, the second network may share an input layer with the first network, or the second network may share an input layer and a K-layer hidden layer with the first network, where N and K are positive integers, and K is less than or equal to N.

Optionally, the second network may be at least one of a convolutional neural network, a cyclic neural network, and a deep neural network, for example, a forward fully-connected deep neural network, a long-short term memory artificial neural network, a closed recurrent unit network, and the like. In other embodiments, the network structure of the second network may also be a combination of the above networks, which is not limited herein.

Optionally, in the training process, the first network and the second network may be trained together, or after the training of the first network is completed, parameters corresponding to neurons shared by the first network and the second network are fixed, and the second network is trained.

Step 704, determining voice detection data based on the matching between the voice probability data and the target probability threshold.

In some embodiments, the voice probability data s is a number between 0 and 1, and corresponds to a VAD flag set to 1 when the voice probability data s is greater than or equal to a target probability threshold, and corresponds to a VAD flag set to 0 when the voice probability data s is less than the target probability threshold, where the VAD flag is voice detection data.

Optionally, the target probability threshold may be preset by the system, or may be self-defined, which is not limited herein.

Schematically, as shown in fig. 8, which illustrates a noise reduction processing and voice detection block diagram provided in an exemplary embodiment of the present application, after audio data is subjected to a preprocessing process of time-domain-to-frequency-domain conversion 810, corresponding spectrum data is obtained, then a voice feature extraction 820 is performed on the spectrum data, an audio feature is obtained, then the audio feature is input into a first network 830, where the first network 830 includes an input layer, a hidden layer, and an output layer, an audio gain G2(k) output by the output layer is multiplied by the spectrum data corresponding to the audio data, so as to obtain noise reduction spectrum data represented by a frequency-domain signal, and after a processing process of frequency-domain-to-time-domain conversion 840, noise reduction spectrum data represented by a time-domain signal is output. The first network 830 also corresponds to a branch network for voice detection, i.e. a second network 850, in this example, the first network 830 and the second network 850 share an input layer and a next hidden layer therebetween, and an output layer of the second network 850 outputs voice probability data, and the voice probability data obtains a VAD flag through a comparison result with target probability data.

Step 705, performing volume scaling processing on the noise reduction audio data according to the voice detection data to obtain target audio data.

In the embodiment of the application, part of neurons are shared by the first network for denoising and the second network for voice detection, so that the model data volume of the whole model is reduced, meanwhile, the denoising process and the voice detection process which are completed through deep learning correspond to high processing accuracy, and the audio processing effect is improved.

With reference to the foregoing embodiments, an audio processing system in the embodiments of the present application is schematically illustrated, as shown in fig. 9, which shows a block diagram of an audio processing system provided in an exemplary embodiment of the present application, wherein a speech signal 901 is collected by an audio collecting device 910 to obtain audio data, the audio data is input to a subsequent module in a frame format, that is, the audio data is input to an echo canceller 920 for echo cancellation and then input to a noise reducing network 930, the noise reducing network 930 inputs the noise reducing audio data to a gain controller 940, meanwhile, frequency gain data is input to a speech detecting module 950, the speech detecting module 950 outputs a VAD flag through speech detection to the gain controller 940, the gain controller 940 processes the VAD flag and the noise reducing audio data to obtain target audio data, and transmits the target audio data to an encoder 960, encoded by encoder 960 to obtain data for convenient transmission over the internet.

Referring to fig. 10, a block diagram of an audio processing apparatus according to an exemplary embodiment of the present application is shown, where the apparatus includes the following modules:

an obtaining module 1010, configured to obtain an audio feature corresponding to audio data, where the audio data is an audio to be subjected to voice transmission, and the audio feature is used to indicate an energy distribution condition of the audio data;

a noise reduction module 1020, configured to perform noise suppression processing on the audio data based on the audio feature to obtain noise reduction audio data;

a detecting module 1030, configured to determine voice detection data based on the energy distribution corresponding to the audio feature, where the voice detection data is used to indicate existence of a voice signal in the audio data;

the processing module 1040 is configured to perform volume scaling processing on the noise reduction audio data according to the voice detection data to obtain target audio data, where the target audio data is an audio used for performing the voice transmission.

In some optional embodiments, as shown in fig. 11, the noise reduction module 1020 further includes:

a first obtaining unit 1021, configured to obtain spectrum data corresponding to the audio data, where the spectrum data is used to indicate a frequency distribution of the audio data;

the first obtaining unit 1021, further configured to obtain, based on the audio feature, frequency gain data, where the frequency gain data is used to indicate respective gain conditions corresponding to the audio data in at least two frequency ranges;

a first determining unit 1022, configured to determine the noise reduction audio data based on a frequency correspondence between the frequency gain data and the spectrum data.

In some optional embodiments, the first obtaining unit 1021 is further configured to predict gain conditions corresponding to the audio feature in the at least two frequency ranges through a first network to obtain the frequency gain data, where the first network is configured to predict the gain conditions according to a spectrum energy condition and a pitch period indicated by the audio feature.

In some optional embodiments, the noise reduction module 1020 further includes:

a first identifying unit 1023, configured to perform speech recognition on the audio features through the first network, and determine frequency ranges corresponding to a speech signal and a noise signal in the audio data, respectively;

the first identifying unit 1023 is further configured to divide the frequency band corresponding to the audio feature into the at least two frequency ranges according to the frequency distribution corresponding to the audio feature, where the at least two frequency ranges include a first frequency range;

the first determining unit 1022 is further configured to, in response to that the audio feature in the first frequency range is a frequency range corresponding to the speech signal, determine first gain data based on the audio feature in the first frequency range, where the first gain data is used to enhance the audio signal of the audio data in the first frequency range; or, in response to the audio feature in the first frequency range being a frequency range corresponding to the noise signal, determining second gain data based on the audio feature in the first frequency range, the second gain data being for attenuating the audio signal of the audio data in the first frequency range;

the first determining unit 1022 is further configured to determine the frequency gain data from the first gain data or the second gain data respectively corresponding to the at least two frequency ranges.

In some optional embodiments, the detecting module 1030 further includes:

a second obtaining unit 1031, configured to obtain frequency gain data based on the audio features, where the frequency gain data is used to indicate respective corresponding gain conditions of the audio data in at least two frequency ranges;

a second determining unit 1032, configured to perform a band selection operation on the at least two frequency ranges, and determine a second frequency range that meets a voice band requirement, where the voice band requirement is used to determine a frequency range including a voice signal in the audio data;

the second determining unit 1032 is further configured to determine voice gain data corresponding to the voice signal based on the second frequency range;

the second determining unit 1032 is further configured to determine the voice detection data based on a matching condition between the voice gain data and a target gain threshold.

In some optional embodiments, the detecting module 1030 further includes:

a second identifying unit 1033, configured to determine, based on the energy distribution condition corresponding to the audio feature, a probability that a speech signal exists in the audio data through a second network for prediction, so as to obtain speech probability data, where at least one layer of network neurons is shared between the first network and the second network;

the second determining unit 1032 is further configured to determine the voice detection data based on a matching condition between the voice probability data and a target probability threshold.

In some optional embodiments, the apparatus further comprises: a pre-processing module 1050;

the preprocessing module 1050 further includes:

the audio acquisition device comprises an acquisition unit 1051, a processing unit and a processing unit, wherein the acquisition unit 1051 is used for acquiring audio acquisition data through audio acquisition equipment, and audio signals in the audio acquisition data are analog signals;

the conversion unit 1052 is configured to sample the audio acquisition data according to a target sampling rate to obtain the audio data, where an audio signal in the audio data is a digital signal;

a feature extraction unit 1053, configured to perform feature extraction on the audio data to obtain the audio features.

In some optional embodiments, the preprocessing module 1050 further includes:

an echo cancellation unit 1054, configured to perform an echo cancellation operation on the audio data to obtain preprocessed audio data, where the preprocessed audio data is used to perform feature extraction to obtain the audio feature.

In some optional embodiments, the preprocessing module 1050 further includes:

a framing unit 1055, configured to perform framing processing on the audio acquisition data to obtain a target number of framed audio data;

the feature extraction unit 1053 is further configured to sample the framed audio data to obtain the audio data.

In some optional embodiments, the audio data is audio collected by a sending terminal;

the device further comprises:

the encoding module 1060 is configured to input the target audio data to an encoder for encoding, so as to obtain audio data to be transmitted;

the sending module 1070 is configured to transmit the audio data to be transmitted to a receiving terminal, where the receiving terminal is a terminal performing voice communication with the sending terminal.

To sum up, the audio processing apparatus provided in the embodiment of the present application obtains noise reduction audio data by performing noise suppression processing on audio features of audio data when performing volume adjustment processing on the audio data for voice transmission, determines voice detection data according to energy distribution conditions corresponding to the audio features, and determines whether volume adjustment needs to be performed on the noise reduction audio data according to the voice detection data to obtain target audio data after the volume processing, that is, performs volume adjustment on the noise reduction audio data after determining whether the audio data includes a voice signal according to the energy distribution corresponding to the audio features, thereby improving a gain effect of the audio data in a volume adjustment process, and improving voice quality in the audio data.

It should be noted that: the audio processing apparatus provided in the foregoing embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the audio processing apparatus and the audio processing method provided in the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.

Fig. 12 shows a block diagram of a terminal 1200 according to an exemplary embodiment of the present application. The terminal 1200 may be: a smart phone, a tablet computer, a motion Picture Experts Group Audio Layer 3 player (MP 3), a motion Picture Experts Group Audio Layer 4 player (MP 4), a notebook computer or a desktop computer. Terminal 1200 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and so forth.

In general, terminal 1200 includes: a processor 1201 and a memory 1202.

The processor 1201 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1201 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 1201 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1201 may be integrated with a Graphics Processing Unit (GPU) which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, processor 1201 may also include an Artificial Intelligence (AI) processor for processing computational operations related to machine learning.

Memory 1202 may include one or more computer-readable storage media, which may be non-transitory. Memory 1202 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1202 is used to store at least one instruction for execution by processor 1201 to implement a method of processing audio provided by method embodiments herein.

While terminal 1200 illustratively includes additional components, those skilled in the art will appreciate that the configuration illustrated in FIG. 12 is not intended to be limiting of terminal 1200, and may include more or less components than those illustrated, or some components in combination, or in a different arrangement of components.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, which may be a computer readable storage medium contained in a memory of the above embodiments; or it may be a separate computer-readable storage medium not incorporated in the terminal. The computer readable storage medium has stored therein at least one instruction, at least one program, a set of codes, or a set of instructions that is loaded and executed by the processor to implement the method of processing audio described in any of the above embodiments.

Optionally, the computer-readable storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a Solid State Drive (SSD), or an optical disc. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM). The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for processing audio, the method comprising:

2. The method of claim 1, wherein the denoising the audio data based on the audio feature to obtain denoised audio data comprises:

acquiring spectrum data corresponding to the audio data, wherein the spectrum data is used for indicating the frequency distribution condition of the audio data;

acquiring frequency gain data based on the audio features, wherein the frequency gain data is used for indicating the gain conditions respectively corresponding to the audio data in at least two frequency ranges;

determining the noise reduction audio data based on a frequency correspondence between the frequency gain data and the spectrum data.

3. The method of claim 2, wherein the obtaining frequency gain data based on the audio features comprises:

and predicting gain conditions respectively corresponding to the audio features in the at least two frequency ranges through a first network to obtain the frequency gain data, wherein the first network is used for predicting the gain conditions according to the spectrum energy condition and the pitch period indicated by the audio features.

4. The method according to claim 3, wherein the predicting, by the first network, gain conditions corresponding to the audio feature in the at least two frequency ranges, respectively, to obtain the frequency gain data comprises:

performing voice recognition on the audio features through the first network, and determining frequency ranges corresponding to voice signals and noise signals in the audio data respectively;

dividing the frequency band corresponding to the audio feature into the at least two frequency ranges according to the frequency distribution condition corresponding to the audio feature, wherein the at least two frequency ranges comprise a first frequency range;

in response to the audio feature in the first frequency range being a frequency range corresponding to the speech signal, determining first gain data based on the audio feature in the first frequency range, the first gain data being used to enhance the audio signal of the audio data in the first frequency range; or, in response to the audio feature in the first frequency range being a frequency range corresponding to the noise signal, determining second gain data based on the audio feature in the first frequency range, the second gain data being for attenuating the audio signal of the audio data in the first frequency range;

and determining the frequency gain data according to the first gain data or the second gain data respectively corresponding to the at least two frequency ranges.

5. The method according to any one of claims 1 to 4, wherein the determining the voice detection data based on the energy distribution corresponding to the audio feature comprises:

performing a band selection operation on the at least two frequency ranges, and determining a second frequency range meeting a voice band requirement, where the voice band requirement is used to determine a frequency range including a voice signal in the audio data;

determining voice gain data corresponding to the voice signal based on the second frequency range;

determining the voice detection data based on a match between the voice gain data and a target gain threshold.

6. The method according to any one of claims 3 to 4, wherein the determining the voice detection data based on the energy distribution corresponding to the audio feature comprises:

determining the probability of voice signals existing in the audio data through a second network to predict based on the energy distribution condition corresponding to the audio features, so as to obtain voice probability data, wherein at least one layer of network neurons is shared between the first network and the second network;

and determining the voice detection data based on the matching condition between the voice probability data and a target probability threshold.

7. The method according to any one of claims 1 to 4, wherein the obtaining of the audio features corresponding to the audio data comprises:

acquiring audio acquisition data through audio acquisition equipment, wherein audio signals in the audio acquisition data are analog signals;

sampling the audio acquisition data according to a target sampling rate to obtain the audio data, wherein audio signals in the audio data are digital signals;

and performing feature extraction on the audio data to obtain the audio features.

8. The method of claim 7, wherein before the extracting the features of the audio data to obtain the audio features, the method further comprises:

and performing echo cancellation operation on the audio data to obtain preprocessed audio data, wherein the preprocessed audio data is used for performing feature extraction to obtain the audio features.

9. The method of claim 7, wherein sampling the audio acquisition data according to a target sampling rate to obtain the audio data comprises:

performing framing processing on the audio acquisition data to obtain frame audio data of a target number;

and sampling the frame audio data to obtain the audio data.

10. The method according to any one of claims 1 to 4, wherein the audio data is audio collected by a transmitting terminal;

after the volume scaling processing is performed on the noise reduction audio data according to the voice detection data to obtain target audio data, the method further includes:

inputting the target audio data into an encoder for encoding to obtain audio data to be transmitted;

and transmitting the audio data to be transmitted to a receiving terminal, wherein the receiving terminal is a terminal performing voice communication with the sending terminal.

11. An apparatus for processing audio, the apparatus comprising:

12. The apparatus of claim 11, wherein the noise reduction module further comprises:

the first acquisition unit is used for acquiring spectrum data corresponding to the audio data, and the spectrum data is used for indicating the frequency distribution condition of the audio data;

the first obtaining unit is further configured to obtain frequency gain data based on the audio features, where the frequency gain data is used to indicate respective corresponding gain conditions of the audio data in at least two frequency ranges;

a first determining unit configured to determine the noise reduction audio data based on a frequency correspondence between the frequency gain data and the spectrum data.

13. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes or set of instructions, which is loaded and executed by the processor to implement a method of processing audio as claimed in any one of claims 1 to 10.

14. A computer-readable storage medium, in which at least one program code is stored, the program code being loaded and executed by a processor to implement the method of processing audio according to any one of claims 1 to 10.

15. A computer program product comprising a computer program or instructions which, when executed by a processor, implement a method of processing audio as claimed in any one of claims 1 to 10.