CN113571075B

CN113571075B - Audio processing method, device, electronic equipment and storage medium

Info

Publication number: CN113571075B
Application number: CN202110118489.2A
Authority: CN
Inventors: 鲍枫; 李娟娟; 李岳鹏
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Filing date: 2021-01-28
Publication date: 2024-07-09
Anticipated expiration: 2041-01-28

Abstract

The present application relates to the field of computer technologies, and in particular, to a method and apparatus for audio processing, an electronic device, and a computer readable storage medium. The method comprises the steps of obtaining original audio characteristics corresponding to audio data to be processed; invoking a first network model to process the original audio features to obtain first audio features, wherein the first audio features comprise at least one-dimensional features; invoking a second network model to process the original audio features and the first audio features to obtain second audio features, wherein the feature quantity of the second audio features is larger than that of the first audio features; according to the second audio characteristics and the original audio characteristics, calling a fully-connected network model to obtain a gain result corresponding to the audio data to be processed; and generating denoising audio data according to the gain result and the audio data to be processed. The method can improve the denoising effect, so that the voice in the audio can be accurately judged, and the judgment accuracy is improved.

Description

Audio processing method, device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and apparatus for audio processing, an electronic device, and a computer readable storage medium.

Background

With the development of computer technology, web conferences are increasingly being received by people and become the preferred solution for teleconferencing. In a web conference, a participant typically chooses to turn his microphone off when not speaking to avoid interfering with the current speaker. The conference moderator can also keep the conference order by disabling some or all other participants by means of a rights control function, etc.

Currently, the turning on and off of a microphone during a user's participation in a conference may be controlled by a conference program. The online conference program will detect the user's speaking and, in the event that it is determined that the user is speaking, actively turn on the microphone to allow the user to speak.

However, noise interference generally exists in the current user's participation environment, so that the online conference program misjudges the noise of the surrounding environment as the user speaking and turns on the microphone, and the accuracy of the judgment of the user speaking and the use experience of the user are reduced.

Disclosure of Invention

Based on the technical problems, the application provides an audio processing method to improve the denoising effect, so that the voice in the audio can be accurately judged, and the judgment accuracy is improved.

Other features and advantages of the application will be apparent from the following detailed description, or may be learned by the practice of the application.

According to an aspect of an embodiment of the present application, there is provided a method of audio processing, including:

acquiring original audio characteristics corresponding to audio data to be processed;

invoking a first network model to process the original audio features to obtain first audio features, wherein the first audio features comprise at least one-dimensional features;

invoking a second network model to process the original audio features and the first audio features to obtain second audio features, wherein the feature quantity of the second audio features is larger than that of the first audio features;

According to the second audio characteristics and the original audio characteristics, a full-connection network model is called to obtain a gain result corresponding to the audio data to be processed;

and generating denoising audio data according to the gain result and the audio data to be processed.

According to an aspect of an embodiment of the present application, there is provided an audio processing apparatus including:

the acquisition module is used for acquiring original audio characteristics corresponding to the audio data to be processed;

The calling module is used for calling a first network model to process the original audio features to obtain first audio features, wherein the first audio features comprise at least one-dimensional features;

The calling module is further used for calling a second network model to process the original audio features and the first audio features to obtain second audio features, wherein the feature quantity of the second audio features is larger than that of the first audio features;

The calling module is further used for calling a full-connection network model to obtain a gain result corresponding to the audio data to be processed according to the second audio feature and the original audio feature;

And the generating module is used for generating denoising audio data according to the gain result and the audio data to be processed.

In some embodiments of the present application, based on the above technical solutions, the acquiring module includes:

The interval dividing unit is used for dividing the audio data to be processed into a first frequency interval and a second frequency interval, wherein the maximum frequency of the first frequency interval is smaller than the minimum frequency of the second frequency interval;

A subband dividing unit, configured to perform frequency division on frequencies of the first frequency interval and the second frequency interval, and perform sparsification processing on subbands of the second frequency interval to obtain a subband set, where the number of subbands divided by the first frequency interval is greater than the number of subbands divided by the second frequency interval, and the subband audio set includes audio segment data corresponding to each subband;

And the characteristic calculation unit is used for calculating the original audio characteristic according to the subband set.

In some embodiments of the present application, based on the above technical solution, the feature calculating unit includes:

A first calculating subunit, configured to calculate barker frequency cepstrum coefficients of each subband in the subband set, to obtain a first feature set;

A second calculating subunit, configured to calculate, for at least two subbands in the subband set, a difference coefficient and a discrete cosine transform value between the subbands, to obtain a second feature set;

And the characteristic determining subunit is used for determining the original audio characteristic according to the first characteristic set and the second characteristic set.

In some embodiments of the present application, based on the above technical solution, the calling module includes:

The model calling unit is used for calling a third network model to process the original audio feature, the first audio feature and the second audio feature to obtain a third audio feature, wherein the feature quantity of the third audio feature is larger than that of the second audio feature;

The model calling unit is further configured to call a fully connected network model according to the third audio feature, and obtain a gain result corresponding to the audio data to be processed.

In some embodiments of the present application, based on the above technical solution, the generating module includes:

The gain calculation unit is used for carrying out multiplication calculation according to the gain result and the audio data to be processed to obtain an audio gain result;

And the audio conversion unit is used for carrying out inverse fast Fourier transform on the audio gain result to obtain denoising audio data.

In some embodiments of the present application, based on the above technical solutions, the audio processing apparatus further includes:

the acquisition module is also used for acquiring training audio characteristics corresponding to the audio data to be trained;

The calling module is further used for calling a first network model included in the model to be trained, and processing the training audio features to obtain first audio features, wherein the first audio features comprise at least one-dimensional features;

The calling module is further configured to call a second network model included in the model to be trained, and process the training audio feature and the first audio feature to obtain a second audio feature, where a dimension of the second audio feature is greater than a dimension of the first audio feature;

The calling module is further configured to call a fully connected network model included in the to-be-trained model according to the second audio feature and the training audio feature, and obtain a gain result corresponding to the to-be-processed audio data;

and the training module is used for adjusting the model parameters of the model to be trained according to the gain result, the audio data to be trained and the noiseless audio data corresponding to the audio data to be processed to obtain an audio processing model.

the acquisition module is used for acquiring the audio data to be processed through the audio acquisition device;

the identification module is used for carrying out identification processing on the noise-removed frequency data to obtain an audio identification result;

And the switching module is used for controlling the audio acquisition device to transmit the audio data if the audio recognition result indicates that the audio data to be processed is human voice, otherwise, controlling the audio acquisition device to stop the audio data transmission.

According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of audio processing as in the above claims via execution of the executable instructions.

According to an aspect of the embodiments of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of audio processing as in the above technical solution.

According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the method of providing audio processing in the various alternative implementations described above.

In the technical scheme provided by some embodiments of the present application, denoising processing is performed on voice data to be processed through a network model, in the processing process, by inputting the original input characteristics and the output results of the preceding network model into the subsequent network model to perform calculation, the noise characteristic conditions in the original audio characteristics can be fully considered in the model calculation process, so that the noise is fully filtered, the denoising effect is improved, the voice in the audio can be more accurately judged, and the judgment accuracy is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

In the drawings:

FIG. 1 is a schematic diagram of an interface of a conferencing application according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of audio processing according to an embodiment of the application;

FIG. 3 is a flow chart of a method of audio processing according to an embodiment of the present application;

FIG. 4 is a flow chart of a method of audio processing according to an embodiment of the application;

FIG. 5 is a flow chart of a method of audio processing according to an embodiment of the present application;

FIG. 6 is an algorithm block diagram of an audio processing apparatus according to an embodiment of the present application;

FIG. 7 is a flow chart of a method of audio processing according to an embodiment of the present application;

FIG. 8 is a flow chart of a method of audio processing in an embodiment of the application;

FIG. 9 is a flow chart of a method of audio processing in an embodiment of the application;

Fig. 10 schematically shows a block diagram of the audio processing apparatus in the embodiment of the present application;

Fig. 11 shows a schematic diagram of a computer system suitable for use in implementing an embodiment of the application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the application may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

Network online conferencing programs are becoming the preferred way of teleconferencing. The participants access the cloud conference server through the network online conference program and listen to the conference and speak through the speakers and microphones on the terminal.

It can be understood that the audio processing method and the related device in the embodiment of the application can be applied to voice call equipment such as computers and mobile phones, internet equipment such as smart phones and smart televisions, and special equipment such as landline telephones and teleconference cameras. Similar to the case of application to these devices, the microphone is used to collect voice audio data of the user, and the method for audio processing according to the embodiment of the application is combined to perform denoising processing to obtain the audio after noise removal. Specific implementation may refer to the following detailed description of the application of the embodiments of the present application to cloud conference.

Referring to fig. 1, fig. 1 is an interface schematic diagram of a conference application according to an embodiment of the application. The conference application runs on a terminal (e.g., a computer) that connects to a cloud conference server via the internet and receives and transmits video, audio, and text information to participate in the conference. The computer has a built-in microphone or an external microphone. After connecting to the online conference, the user may switch the microphone to a mute mode in the conference application. At this point, the conference application will not send audio signals to the cloud conference server. However, the microphone on the computer is not turned off and will continue to collect audio information for the meeting application to analyze and determine the user's speech condition. When a user needs to speak in the process of participating in a conference, the conference application program can directly start speaking, firstly, the audio information acquired by the microphone is processed by utilizing the audio processing method in the application, noise (such as the sound of mouse and keyboard operation, the prompting sound of other applications or mobile phone messages, the sound of moving an object or a desk and chair on a desktop and the like) in the audio information is filtered, and then the denoised audio information is analyzed and judged by a voice activity detection (Voice Activity Detection, VAD) mode. When it is detected that the user is speaking, the conference application will prompt the user to turn on the microphone to enter the speaking mode, or directly turn on the microphone, for the user to speak.

It can be appreciated that the cloud conference server may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (ContentDelivery Network, CDNs), and basic cloud computing services such as big data and artificial intelligence platforms. The terminal may be a computer (such as a notebook computer, a desktop computer, etc.), a smart phone, a tablet computer, a smart speaker, a smart watch, etc., but is not limited thereto. The terminal and the cloud conference server may be directly or indirectly connected through a wired or wireless communication manner, and the present application is not limited herein.

The audio processing method in the embodiment of the application can be realized in a machine learning mode when being implemented in particular, and can be applied to cloud conferences in particular.

Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Cloud conferencing is an efficient, convenient, low-cost form of conferencing based on cloud computing technology. The user can rapidly and efficiently share voice, data files and videos with all groups and clients in the world synchronously by simply and easily operating through an internet interface, and the user is helped by a cloud conference service provider to operate through complex technologies such as data transmission, processing and the like in the conference.

At present, domestic cloud conference mainly focuses on service contents taking a Software as a main body (Software as a service) mode, including service forms such as telephone, network, video and the like, and video conference based on cloud computing is called as a cloud conference.

In the cloud conference era, the transmission, processing and storage of data are all processed by the computer resources of video conference factories, and users can carry out efficient remote conferences without purchasing expensive hardware and installing complicated software.

The cloud conference system supports the dynamic cluster deployment of multiple servers, provides multiple high-performance servers, and greatly improves conference stability, safety and usability. In recent years, video conferences are popular for many users because of greatly improving communication efficiency, continuously reducing communication cost and bringing about upgrade of internal management level, and have been widely used in various fields of government, transportation, finance, operators, education, enterprises, etc. Undoubtedly, the video conference has stronger attraction in convenience, rapidness and usability after the cloud computing is applied, and the video conference application is required to be stimulated.

The scheme of the application is suitable for filtering noise in voice audio information to obtain the denoised voice information for subsequent processing so as to improve the accuracy of voice information operation. The following describes the technical scheme provided by the application in detail by combining the specific embodiments. The method of the present embodiment can be applied to a computer terminal, and is specifically performed by an audio processing apparatus.

Referring to fig. 2, fig. 2 is a flowchart of a method for audio processing according to an embodiment of the application, where the flowchart includes at least the following steps S201 to S205:

Step S201, obtaining original audio features corresponding to the audio data to be processed.

In the embodiment of the application, the audio processing device can acquire the audio data to be processed through the microphone. Noise data and voice data may be included in the audio data to be processed. The sampling of the audio data to be processed may typically be performed at a frequency of 16000 HZ. According to a preset frequency band division rule, the audio processing device divides the frequency band of the audio data to be processed to obtain a plurality of sub-bands. Then, for each subband, its parametric features are calculated.

The number and manner of sub-band division and the selection of the parameter characteristics may take various suitable forms. In particular, the sub-bands may be divided into a plurality of bark bands in a bark (bark) domain. For each bark band, parameters such as a cepstrum coefficient, a difference coefficient and the like in the band can be calculated as parameter characteristics.

Step S202, a first network model is called, and the original audio features are processed to obtain first audio features, wherein the first audio features comprise at least one-dimensional features.

In the embodiment of the application, the audio processing device inputs the original audio characteristics into the first network model for processing to obtain the first audio characteristics. The first network model may be a model of a recurrent neural network, which takes sequence data as input, performs recursion in the evolution direction of the sequence, and all the recurrent units are connected in a chained manner. The first network model is one of the circulation units, wherein the first network model can be realized by using Long Short-Term Memory (LSTM) or gating circulation units (Gated RecurrentUnit, GRU) and other models. The first network model receives as input the original audio features and outputs as output a multi-dimensional vector, i.e. the first audio features. Step S203, calling a second network model, and processing the original audio features and the first audio features to obtain second audio features, wherein the feature quantity of the second audio features is larger than that of the first audio features;

In the embodiment of the application, the audio processing device combines the first audio feature and the original audio feature into the input feature and inputs the input feature into the second network model for processing to obtain the second audio feature.

In particular, the second network model is also one of the cyclic neural network models, which is the same type of neural network model as the first network model, and receives as inputs the output result of the first network model as well as the original audio features. The dimensions of the output result of the second network model (i.e. the second audio feature) are typically larger than the dimensions of the output result of the second network model (i.e. the first audio feature) in order to more emphasize the features in the audio data. Specifically, if the first audio feature includes 60 feature values, the second audio feature includes at least 61 or more feature values, for example 70 or 80 feature values. This is because, the input data of the second network model includes the first audio feature and the original audio feature, and the feature number of the second audio feature is greater than that of the first audio feature, so that the second audio feature has enough size to accommodate the feature details in the original audio feature, and the weight of the original audio feature in the process of calculating the second audio feature is improved, so that the learning capability and the denoising effect of the model are improved. If the feature number of the second audio features is equal to or smaller than that of the first audio features, the weight of the original audio features in the second audio features is lower due to insufficient size of the second audio features, learning ability is reduced, feature details in the original audio features are lost, and denoising effect is weakened.

The types of activation functions used by the second network model and the first network model may be the same or different, and are not limited herein.

It is understood that, in the present application, each feature value in the audio feature obtained by the network model processing may be data having a value within a predetermined range. The feature values are intermediate values in the learning process, do not necessarily correspond to actual physical meanings, and do not necessarily have a direct association relationship with the audio data to be processed. The value range of each characteristic value depends on the function adopted by the corresponding network model. For example, if the first network model employs a hyperbolic tangent function, the first audio feature output by the first network model will include, for example, 60 feature values, each of which is a value ranging from 0 to 1, and each of which does not necessarily have an actual physical meaning.

Step S204, calling a full-connection network model according to the second audio characteristics and the original audio characteristics, and obtaining a gain result corresponding to the audio data to be processed.

In particular, the audio processing device may invoke at least two loop units. Under the condition that only the first network model and the second network model are called, the audio processing device can input the second audio features output by the second network model into the fully-connected network model for processing, and the gain result of the audio is obtained. The number of dimensions of the gain result is the same as the number of subbands in the original audio feature. For example, if the audio data to be processed is divided into 50 subbands, the dimension in the gain result is 50 dimensions.

And under the condition that the audio processing device calls three or more than three circulating units, the audio processing device continuously calls the subsequent circulating units for further processing according to the second audio characteristics and the original audio characteristics, takes the output result of the subsequent circulating units and the output result of the previous circulating units as the input of the next circulating unit until the last circulating unit in the sequence finishes processing, and obtains the final audio characteristics. And then, processing by adopting a full-connection network model to obtain a gain result.

It can be understood that the number of dimensions of the output structure of each circulation unit should be in a stepwise rising trend, so as to fully embody the voice features and the noise features in the audio data to be processed gradually, thereby being beneficial to extracting the noise.

Step S205, generating denoising audio data according to the gain result and the audio data to be processed.

Specifically, after the gain result is obtained, denoising operation can be performed on the gain result and the audio data to be processed to obtain denoised audio data. For example, the gain result is a result that includes M dimensions that correspond to M subbands into which the speech data to be processed is partitioned. And denoising operation is carried out according to the signal values of the sub-bands and the characteristic values in the corresponding gain results, so that denoised signal values can be obtained, and all the calculated signal values are combined to denoise audio data.

In the embodiment of the application, the voice data to be processed is processed through the neural network model, and in the processing process, the original input characteristics and the output results of the preceding cyclic units are input into the subsequent cyclic units for calculation, so that the noise characteristic conditions in the original audio characteristics can be fully considered in the model calculation process, the noise is fully filtered, the denoising effect is improved, the voice in the audio can be more accurately judged, and the judgment accuracy is improved.

In one embodiment of the present application, in order to reduce the resource consumption of the algorithm and improve the computing efficiency on the basis of sufficiently recognizing the voice feature, as shown in fig. 3, the step S201 of obtaining the original audio feature corresponding to the audio data to be processed may include the following steps S301 to S303, which are described in detail as follows:

step S301, dividing the audio data to be processed into a first frequency interval and a second frequency interval, wherein the maximum frequency of the first frequency interval is smaller than the minimum frequency of the second frequency interval;

Step S302, frequency division is carried out on the frequencies of a first frequency interval and a second frequency interval, and sub-bands of the second frequency interval are subjected to sparse processing, so that a sub-band set is obtained, wherein the number of the sub-bands divided by the first frequency interval is larger than that of the sub-bands divided by the second frequency interval, and the sub-band audio set comprises audio fragment data corresponding to each sub-band;

Step S303, calculating the original audio characteristics according to the subband set.

In the embodiment of the application, the audio data to be processed is sampled by 16000Hz, so as to obtain a wideband voice signal with 8000Hz bandwidth. The audio processing device divides the audio data to be processed into a first frequency interval and a second frequency interval. The first frequency interval is divided according to the usual speech frequencies at which the person speaks, which may typically comprise a relatively low frequency band. For example, the first frequency interval may be 0 to 2000Hz. The second frequency interval mainly comprises frequency intervals related to various environmental noises, and the range of the second frequency interval is not overlapped with the first frequency interval, for example, the second frequency interval can be 2000Hz to 8000Hz.

The first frequency interval and the second frequency interval are respectively divided into a plurality of sub-bands, each sub-band corresponding to one piece of audio clip data. In the present application, the frequency bands are divided in the manner of the bark (bark) domain. Before obtaining a plurality of characteristic parameters of the audio data to be processed, carrying out Fourier transform on the audio data to be processed to obtain an amplitude spectrum of the audio data to be processed, and then carrying out bark sub-band division on the amplitude spectrum of the current audio data to be processed according to critical frequency band definition to obtain the characteristic parameters of a plurality of sub-bands.

For example, short-time fourier transforms may be performed on the audio data to be processed and the magnitude spectrum of the current audio signal segment calculated. The audio data to be processed is a noisy audio signal consisting of a clean human voice signal s (t), and uncorrelated noise w (t), e.g. w (t) may be noise in the environment. The time domain expression of the audio data to be processed satisfies: x (t) =s (t) +w (t), where t represents time. The frequency domain expression of the current audio signal segment can be satisfied by respectively performing short-time Fourier transform on two sides of the expression: x (k) =s (k) +w (k), where X (k) represents the amplitude spectrum of the noise-containing frequency signal, S (k) represents the amplitude spectrum of the human voice signal, W (k) represents the noise amplitude spectrum, and k represents the frequency bin, for example, 512 frequency bins of short-time fourier transform may be performed on the audio data to be processed.

The first frequency interval may be divided directly into a preset number of bark bands, the number of specific divisions typically depending on experience, for example, the first frequency interval may be divided into 36 bark bands. The second frequency interval is first sparsified before the bark band is divided, so that the influence of noise signals in the second frequency interval on the result is moderately reduced. The number of subbands in the second frequency interval may be less than the number of subbands in the first frequency interval to avoid confusion with respect to the calculation result. For example, the second frequency interval may be divided into 28 subbands.

The audio segments corresponding to the subbands of the first frequency interval and the second frequency interval constitute a subband set, i.e., 64 subbands. For each of these 64 subbands, the calculation may calculate its various audio features as the original audio features, such as a spectrogram, a short-time power spectral density, a fundamental frequency, a formant, and a cepstral coefficient.

In this embodiment, by dividing the audio data to be processed into two different sections and performing the thinning process on the section involving noise, the noise signal in the audio data to be processed can be represented by a small number of features.

In one embodiment of the present application, in order to more accurately determine the normally speaking voice and the non-speaking voice, as shown in fig. 4, the step S303 may calculate the original audio feature according to the subband set, and may include the following steps S401 to S403, which are described in detail below:

step S401, calculating the barker frequency cepstrum coefficient of each sub-band in the sub-band set to obtain a first feature set;

Step S402, calculating a difference coefficient and a discrete cosine transform value between sub-bands for at least two sub-bands in the sub-band set to obtain a second feature set;

step S403, determining the original audio feature according to the first feature set and the second feature set.

The audio processing device may calculate the parametric characteristics within each sub-band. Specifically, for example, the audio data to be processed is divided into 56 bands, wherein 32 bands are divided for the lower frequency portion (0 to 1000 Hz) and 24 bands are divided for the higher frequency portion (1000 to 8000 Hz). For each of the 56 bark bands, coefficients of the bark frequency cepstrum (Bark Frequency Cepstrum Coefficient, BFCC) within the band, i.e., bark domain parameter features, are calculated, thereby resulting in 56 features, forming a first feature set.

It will be appreciated that the frequency band division and the parameter characteristics thereof are merely examples, and that BFCC coefficients may also employ other parameters, such as mel-frequency cepstral coefficients (Me lFrequency Cepstrum Coefficient, MFCC), which are not limited herein.

For a portion of the sub-band, the audio processing device may calculate the difference coefficients of BFCC coefficients within its band, as well as discrete cosine transform values. Specifically, for example, for the first 1 to 6 bark bands, the first order differential coefficient and the second order differential coefficient of the band thereof of BFCC coefficients may be calculated, and also the discrete cosine transform value of the in-band signal cross-correlation coefficient may be calculated, thereby obtaining 18 features, forming the second feature set.

The first order difference is the difference between BFCC coefficients of two adjacent subbands, and can be used for reflecting the relationship between the two adjacent subbands. Illustratively, the first order difference of the coefficients of the subbands BFCC may be obtained according to the following formula: y (b) =x (b+1) -X (b), where X (b) is BFCC coefficients of subband b and Y (b) is the first order difference. The second-order difference of BFCC coefficients is the difference between two adjacent first-order differences, and represents the relationship between the adjacent first-order differences, namely the relationship between the front first-order difference and the rear first-order difference, and can be used for representing the dynamic relationship between the adjacent three sub-bands in the sub-bands of the audio frequency amplitude spectrum. For example, the second order difference of BFCC coefficients can be obtained according to the following formula: z (b) =y (b+1) -Y (b) =x (b+2) -2X (b+1) +x (b), where X (b) is the BFCC coefficients of subband b, Y (b) is the first order difference, and Z (b) is the second order difference.

In one embodiment, the noise is suppressed in order not to be very careful between fundamental frequency harmonics without sufficient resolution of the speech. Post-filtering methods may also be added: an inter-harmonic noise (inter-harmonic noise) is removed within one fundamental period (pitch period) using a comb filter. Thus, the original audio features may incorporate as additional features the fundamental frequency (pitch) of the genetic period (pitch period) for the comb filter as well as the energy parameters.

Thus, from the first and second feature sets described above, as well as the additional features, it is possible to determine the original audio features comprising 76 dimensions.

It will be appreciated that the above-described numbers of first feature set, second feature set, and additional features are merely examples and not limiting, and that one skilled in the art may determine the number of features depending on the particular implementation.

In the embodiment of the application, the original audio characteristics are determined by calculating the bark frequency cepstrum coefficient, the difference coefficient and the discrete cosine transform value in the sub-band, so that the voice and the noise condition in the audio to be processed can be fully represented.

In one embodiment of the present application, in order to more fully filter noise data in the audio data to be processed, as shown in fig. 5, the step S204 may call a fully connected network model according to the second audio feature and the original audio feature to obtain a gain result corresponding to the audio data to be processed, and may include the following steps S501 to S502, which are described in detail below:

Step S501, calling a third network model, and processing the original audio features, the first audio features and the second audio features to obtain third audio features, wherein the number of the features of the third audio features is larger than that of the features of the second audio features;

Step S502, calling a full-connection network model according to the third audio characteristics, and obtaining a gain result corresponding to the audio data to be processed.

In an embodiment of the application, the audio processing device invokes three network models. For convenience of description, referring to fig. 6, fig. 6 is an algorithm structure diagram of an audio processing apparatus according to an embodiment of the present application. Specifically, all three network models are realized by adopting a GRU model. For example, assume that 76 features calculated based on 56 subbands are included in the original audio features. The audio processing device inputs feature values of the 76 features into the first GRU model. The first GRU model uses a tanh function as an activation function, and its output first audio features include 60 features. The 60 features of the first output result and 76 of the original audio features are then output into a second GRU model that uses the ReLU function as an activation function, and the first audio feature output includes 70 features. Similarly, the audio processing device invokes a third GRU model that also uses a tanh function as an activation function to process 76 of the original audio features, 60 of the models of the first audio feature, and 70 of the second audio feature, and the output third audio feature includes 130 features. It can be understood that according to the sequence order of the GRU model, the number of the output features is gradually increased so as to retain more detail features, so that the representation of the voice signal and the noise signal is more specific, the gain is accurately calculated, and the denoising effect is improved.

After 130 features of the third audio feature are obtained, the audio processing means inputs them into the fully connected model. In this embodiment, the fully connected model uses a Sigmoid function as a calculation function, and calculates 56 feature values corresponding to 56 subbands according to 130 features input, and uses the calculated values as the output gain result.

Similar to the description about the first network model and the second network model, the feature number of the third audio feature output by the third network model is greater than the feature number of the second audio feature, and each feature value in the third audio feature does not necessarily have an actual physical meaning, and detailed descriptions about the first audio feature and the second audio feature are omitted herein.

It should be noted that the above-described GRU model may be replaced by other neural network models, such as a long-short-term memory artificial neural network model or a recurrent neural network model. The activation functions of the individual GRU models may also be replaced with other similar activation functions. The dimensions of the output results of the individual GRU models may also depend on the input values and implementation specifics, as long as the trend of increasing in accordance with the model sequence is met. The types of the neural network model, the types of the activation functions, and the dimensions of the output result are not limited here.

In the embodiment of the application, the audio processing device specifically calls three network model units, so that the denoising capability of the audio processing device can be improved, and meanwhile, the quantization volume of the audio processing device is maintained to meet the requirement of real-time communication, so that the accuracy of a subsequent voice detection algorithm is improved, and the user experience is improved.

In one embodiment of the present application, in order to obtain the denoising audio data, as shown in fig. 7, step S205 may specifically be described above, and the generating the denoising audio data according to the gain result and the audio data to be processed may include steps S701 to S702 as follows:

step S701, performing multiplication calculation according to the gain result and the audio data to be processed to obtain an audio gain result;

step S702, performing inverse fast Fourier transform on the audio gain result to obtain denoising audio data.

Specifically, for each subband of the audio data to be processed, a corresponding gain characteristic value will be included in the gain result. The frequency of the sub-band is multiplied by the gain characteristic value to filter the noise quotation marks therein, and the voice signals therein are amplified, thereby performing denoising operation. And combining each sub-band with the obtained calculation result of the corresponding gain characteristic value to obtain an audio gain result.

Then, the audio gain result is subjected to inverse fast fourier transform, so that data of the audio gain result is converted from frequency into time domain, and denoising audio data is obtained.

In the embodiment of the application, the audio data to be processed is denoised by utilizing the gain result, so that the influence of external noise factors is effectively eliminated, and the quality and effect of the generated denoised audio data are improved.

In one embodiment of the present application, in order to obtain a trained audio processing model, the audio processing model includes the first network model, the second network model and the fully connected network model, as shown in fig. 8, the step S201 may include the following steps S801 to S805 before the original audio features corresponding to the audio data to be processed are obtained, which are described in detail below:

step S801, obtaining training audio characteristics corresponding to audio data to be trained;

Step S802, a first network model included in a model to be trained is called, and training audio features are processed to obtain first audio features, wherein the first audio features comprise at least one-dimensional features;

step S803, a second network model included in the model to be trained is called, and the training audio features and the first audio features are processed to obtain second audio features, wherein the dimensions of the second audio features are larger than those of the first audio features;

step S804, calling a fully connected network model included in the model to be trained according to the second audio characteristics and the training audio characteristics, and obtaining a gain result corresponding to the audio data to be processed;

step S805, adjusting model parameters of the model to be trained according to the gain result, the audio data to be trained and the noiseless audio data corresponding to the audio data to be processed, so as to obtain an audio processing model.

The audio processing model comprises a plurality of sub-models, and specifically comprises a first network model, a second network model and a full-connection network model. In one embodiment, the audio processing model may include further network models, such as a third network model, each connected in sequential order and having the output results of the preamble model as well as the original input results as its own input features. The last model in the sequence inputs the output result into the fully connected network model to obtain the final gain result. The number of network models included in the audio processing model may depend on the particular implementation and the application is not limited.

Specifically, the audio data to be trained includes noise-containing audio data. The training set of the neural network can be constructed according to the collected fundamental frequency information of a large number of audio signals and characteristic parameters of a plurality of sub-bands, and the original noise-containing data training set meets the following conditions: x (b) =s (b) +w (b), and the target enhanced data training set satisfies: x' (b) =g (b) ×s (b) +w (b) for parameter training. The purpose of the algorithm is to optimize this target enhancement factor g (b). Wherein b is a subband index number, X (b) represents an original noise-containing amplitude spectrum, X' (b) represents a noise-containing amplitude spectrum after human voice enhancement, S (b) represents a noise-free human voice amplitude spectrum, and W (b) represents a noise amplitude spectrum. The loss function is related to the relationship between the enhancement result of the target and the enhancement result output by the audio model to be trained, and may be L (p (x), p ' (x)) = (p (x) -p ' (x)) ², where p (x) represents the target enhancement result and p ' (x) represents the enhancement result output by the audio model to be trained. The target enhancement result can be calculated according to the audio data to be processed and the corresponding noiseless audio data. In the neural network, the degree of fitting of the neural network is usually measured by using a loss function, namely, the loss function is minimized, which means that the degree of fitting is the best, and the corresponding model parameter is the optimal parameter.

Therefore, in the training process of the audio processing model, firstly, according to the parameter characteristic calculation mode, sub-band division and calculation are carried out on the audio data to be trained, so as to obtain training audio characteristics. And then, according to the number of the circulating units in the model to be trained, taking the output result of the preamble unit and the original training audio characteristics as inputs, and calculating an output result. For a model to be trained of a two-layer structure, firstly, a first network model included in the model to be trained is called, training audio features are processed to obtain first audio features, wherein the first audio features include at least one-dimensional features, then, a second network model included in the model to be trained is called, the training audio features and the first audio features are processed to obtain second audio features, and the dimensions of the second audio features are larger than those of the first audio features.

And then, obtaining a final gain result through the output of the second network model through the full connection model. And obtaining a target gain result according to the audio data to be trained and the corresponding noiseless audio data. And calculating a loss function according to the target gain result and the gain result output by the model to be trained, and adjusting model parameters of the model to be trained according to the loss result to obtain the audio processing model.

The training process for the model to be trained may be performed iteratively, and specifically, a plurality of training batches may be set, each batch inputting a certain amount of audio data to be trained as a training data set. During iterative training, the loss value may be iteratively trained by an adaptive moment estimation optimizer (Adaptive MomentEstimation Optimizer).

In the embodiment, the audio data to be trained is utilized to train the model to be trained, so that the audio processing model is obtained, and feasibility of a scheme is improved.

In one embodiment of the present application, in order to control the state of the audio capturing device for the user to speak, as shown in fig. 9, the method may include the following steps S901 to S903, which are described in detail as follows:

before the original audio features corresponding to the audio data to be processed are obtained in step S201, the method further includes:

step S901, collecting audio data to be processed through an audio collecting device;

After generating the denoising frequency data according to the gain result and the audio data to be processed in step S205, the method further includes:

step S902, carrying out recognition processing on the noise-removed frequency data to obtain an audio recognition result;

In step S903, if the audio recognition result indicates that the audio data to be processed is human voice, the audio acquisition device is controlled to transmit the audio data, otherwise, the audio acquisition device is controlled to stop transmitting the audio data.

The audio capturing device may be any kind of microphone or other device with audio capturing functionality. Specifically, after a user participates in the cloud conference server through the conference application, the conference application acquires audio data to be processed through a microphone. The user can switch the microphone to a mute state, at this time, the conference application will not send audio data to the cloud conference server to speak, but still will collect the audio data to be processed through the microphone, so that the background program of the conference application can analyze whether the user is speaking.

When the user speaks, the conference application program uses the audio processing device in the above embodiment to perform denoising processing on the collected audio data to be processed, so as to obtain denoising frequency data. Then, in step S902, voice recognition may be performed on the noise-removed frequency data by using a VAD algorithm or other types of detection algorithms, so as to obtain an audio recognition result.

If the audio recognition result indicates that the audio data to be processed includes voice, it can be judged that the user is currently speaking, and the microphone can be switched to a call state. The microphone may send audio data to the remote cloud conference server through the conference application in a talk state to allow the user to speak. Otherwise, if the audio recognition result indicates that the audio data to be processed does not include the voice of the voice, the microphone is kept in a mute state. In the mute state, the microphone will cease transmitting audio data. In one embodiment, if the microphone is already in a talk state, no processing may be done. In one embodiment, before the identification process is performed on the denoised data, or before the audio data to be processed is collected by the microphone, the state of the microphone may be detected first, if the microphone is in a talking state, no operation is performed, and if the microphone is in a mute state, the above steps are started.

It will be appreciated that the above-mentioned states of the microphone refer to states set for the microphone in an application environment such as a conference application, and not the on-off state of the microphone itself. The mute state and the talk state are used to distinguish whether the conference application is sending audio data to the remote server, in both of which the microphone is in a powered-on state and can collect audio data.

In this embodiment, the method in the embodiment of the present application denoises the collected audio, then determines the voice of the person according to the denoised audio, and controls whether the audio device transmits audio data according to the determination result, so that when the user forgets to turn on the audio device, the application can replace the user to turn on the audio device to speak, thereby avoiding the user from repeating speaking content and improving the usability of the application.

It should be noted that although the steps of the methods of the present application are depicted in the accompanying drawings in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.

The following describes a method implemented by the apparatus of the present application, which may be used to perform audio processing in the above-described embodiments of the present application. Fig. 10 schematically shows a block diagram of the audio processing apparatus in the embodiment of the present application. As shown in fig. 10, the audio processing apparatus 1000 may mainly include:

an obtaining module 1001, configured to obtain an original audio feature corresponding to audio data to be processed;

a calling module 1002, configured to call a first network model to process the original audio feature to obtain a first audio feature, where the first audio feature includes at least one-dimensional feature;

And a generating module 1003, configured to generate denoising audio data according to the gain result and the audio data to be processed.

In some embodiments of the present application, based on the above technical solutions, the obtaining module 1001 includes:

In some embodiments of the present application, based on the above technical solutions, the calling module 1002 includes:

In some embodiments of the present application, based on the above technical solution, the generating module 1003 includes:

In some embodiments of the present application, based on the above technical solutions, the audio processing apparatus 1000 further includes:

the obtaining module 1001 is further configured to obtain a training audio feature corresponding to the audio data to be trained;

The invoking module 1002 is further configured to invoke a first network model included in the model to be trained, and process the training audio feature to obtain a first audio feature, where the first audio feature includes at least one-dimensional feature;

The invoking module 1002 is further configured to invoke a second network model included in the model to be trained, and process the training audio feature and the first audio feature to obtain a second audio feature, where a dimension of the second audio feature is greater than a dimension of the first audio feature;

The invoking module 1002 is further configured to invoke a fully connected network model included in the to-be-trained model according to the second audio feature and the training audio feature, to obtain a gain result corresponding to the to-be-processed audio data;

It should be noted that, the apparatus provided in the foregoing embodiments and the method provided in the foregoing embodiments belong to the same concept, and a specific manner in which each module performs an operation has been described in detail in the method embodiment, which is not described herein again.

It should be noted that, the computer system 1100 of the electronic device shown in fig. 11 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 11, the computer system 1100 includes a central processing unit (Centra lProcessing Unit, CPU) 1101 that can perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 1102 or a program loaded from a storage section 1108 into a random access Memory (Random Access Memory, RAM) 1103. In the RAM 1103, various programs and data required for system operation are also stored. The CPU 1101, ROM 1102, and RAM 1103 are connected to each other by a bus 1104. An Input/Output (I/O) interface 1105 is also connected to bus 1104.

The following components are connected to the I/O interface 1105: an input section 1106 including a keyboard, a mouse, and the like; an output portion 1107 including a Cathode Ray Tube (CRT), a Liquid crystal display (Liquid CRYSTA L DISPLAY, LCD), and a speaker, etc.; a storage section 1108 including a hard disk or the like; and a communication section 1109 including a network interface card such as a LAN (Loca lArea Network ) card, a modem, or the like. The communication section 1109 performs communication processing via a network such as the internet. The drive 1110 is also connected to the I/O interface 1105 as needed. Removable media 1111, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed on drive 1110, so that a computer program read therefrom is installed as needed into storage section 1108.

In particular, the processes described in the various method flowcharts may be implemented as computer software programs according to embodiments of the application. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1109, and/or installed from the removable media 1111. When executed by a Central Processing Unit (CPU) 1101, performs the various functions defined in the system of the present application.

It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), a flash Memory, an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of audio processing, comprising:

generating denoising audio data according to the gain result and the audio data to be processed;

The obtaining the original audio features corresponding to the audio data to be processed includes:

dividing the audio data to be processed into a first frequency interval and a second frequency interval, wherein the maximum frequency of the first frequency interval is smaller than the minimum frequency of the second frequency interval;

Frequency division is carried out on the frequencies of the first frequency interval and the second frequency interval, and sub-bands of the second frequency interval are subjected to sparse processing, so that a sub-band set is obtained, wherein the number of the sub-bands divided by the first frequency interval is larger than that of the sub-bands divided by the second frequency interval, and the sub-band set comprises audio fragment data corresponding to each sub-band;

The Baker frequency cepstrum coefficient of each sub-band in the sub-band set is calculated to obtain a first feature set;

calculating differential coefficients and discrete cosine transform values between sub-bands for at least two sub-bands in the sub-band set to obtain a second feature set;

and determining the original audio features according to the first feature set and the second feature set.

2. The method according to claim 1, wherein the calling a fully connected network model according to the second audio feature and the original audio feature to obtain the gain result corresponding to the audio data to be processed includes:

Invoking a third network model to process the original audio features, the first audio features and the second audio features to obtain third audio features, wherein the number of the features of the third audio features is larger than that of the features of the second audio features;

And calling a full-connection network model according to the third audio characteristics to obtain a gain result corresponding to the audio data to be processed.

3. The method of claim 1, wherein generating de-noised frequency data from the gain result and the audio data to be processed comprises:

multiplying according to the gain result and the audio data to be processed to obtain an audio gain result;

and performing inverse fast Fourier transform on the audio gain result to obtain denoising audio data.

4. The method of claim 1, wherein the audio processing model includes the first network model, the second network model, and the fully-connected network model, and wherein prior to the obtaining the original audio features corresponding to the audio data to be processed, the method further comprises:

Acquiring training audio characteristics corresponding to audio data to be trained;

Invoking a first network model included in a model to be trained, and processing the training audio features to obtain first audio features, wherein the first audio features comprise at least one-dimensional features;

invoking a second network model included in the model to be trained, and processing the training audio feature and the first audio feature to obtain a second audio feature, wherein the dimension of the second audio feature is larger than that of the first audio feature;

According to the second audio characteristics and the training audio characteristics, invoking a fully-connected network model included in the to-be-trained model to obtain a gain result corresponding to the to-be-processed audio data;

And adjusting model parameters of the model to be trained according to the gain result, the audio data to be trained and the noiseless audio data corresponding to the audio data to be processed to obtain an audio processing model.

5. The method of claim 1, wherein prior to the obtaining the original audio feature corresponding to the audio data to be processed, the method further comprises:

collecting the audio data to be processed through an audio collecting device;

After generating the denoising frequency data according to the gain result and the audio data to be processed, the method further comprises:

Performing recognition processing on the denoising frequency data to obtain an audio recognition result;

And if the audio recognition result indicates that the audio data to be processed is human voice, controlling the audio acquisition device to transmit the audio data, otherwise, controlling the audio acquisition device to stop transmitting the audio data.

6. An audio processing apparatus, comprising:

the generating module is used for generating denoising audio data according to the gain result and the audio data to be processed;

Wherein, the acquisition module includes:

A subband dividing unit, configured to perform frequency division on frequencies of the first frequency interval and the second frequency interval, and perform thinning processing on subbands of the second frequency interval, so as to obtain a subband set, where the number of subbands divided by the first frequency interval is greater than the number of subbands divided by the second frequency interval, and the subband set includes audio segment data corresponding to each subband;

7. An electronic device, comprising:

A processor;

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of audio processing of any of claims 1 to 5 via execution of the executable instructions.

8. A computer readable medium on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of audio processing according to any of claims 1 to 5.

9. A computer program product, characterized in that the computer program product comprises computer instructions stored in a computer-readable storage medium, from which computer instructions a processor of a computer device reads, the processor executing the computer instructions, causing the computer device to perform the method of audio processing according to any one of claims 1 to 5.