CN116013332A

CN116013332A - Audio processing method and device

Info

Publication number: CN116013332A
Application number: CN202211716020.XA
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-04-25

Abstract

The application provides an audio processing method, which comprises the following steps: acquiring audio coding information of an audio track; clustering the audio coding information of the audio track to obtain tone clustering information of the audio track; acquiring the audio mixing parameters of the audio track according to the audio coding information and the audio clustering information respectively corresponding to the audio track; and under the condition that the number of the audio tracks is a plurality of audio tracks, acquiring mixed audio according to the mixing parameters of the plurality of audio tracks. The application also provides an audio processing apparatus, a computer device and a computer readable storage medium. The technical scheme provided by the application can improve the audio processing effect and the user experience.

Description

Audio processing method and device

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to an audio processing method, an audio processing device, a computer device, and a computer readable storage medium.

Background

With the development of computer technology, electronic products such as mobile phones, computers, tablets and the like become daily necessities. The processing requirements of users on media information such as audio and video are increasingly diversified, for example, specific audio and video are produced by means of stereo mixing.

Stereo mixing refers to the integration of sound from multiple sources into a stereo audio track. In the process of manual mixing, a mixer can independently adjust the frequency, the dynamic, the tone quality, the positioning, the reverberation and the sound field of each original signal to optimize each sound track, and then the sound tracks are overlapped on a final finished product to achieve the perfect effect of distinct layers. In recent years, with the development of the music industry, a plurality of digital music workstations and digital music platforms are emerging, and the demand for music production is increasing. Accordingly, people begin to perform a mixing operation by an automatic audio mixing technique. However, depending on the existing automatic audio mixing technology, the audio processing effect is poor and the experience is poor.

It should be noted that the foregoing is not necessarily prior art, and is not intended to limit the scope of the patent protection of the present application.

Disclosure of Invention

An object of an embodiment of the present application is to provide an audio processing method, an apparatus, a computer device, and a computer readable storage medium, for solving the above-mentioned problems.

An aspect of an embodiment of the present application provides an audio processing method, including:

acquiring audio coding information of an audio track;

Clustering the audio coding information of the audio track to obtain tone clustering information of the audio track;

acquiring the audio mixing parameters of the audio track according to the audio coding information and the audio clustering information respectively corresponding to the audio track;

and under the condition that the number of the audio tracks is a plurality of audio tracks, acquiring mixed audio according to the mixing parameters of the plurality of audio tracks.

Optionally, the audio coding information comprises a multi-dimensional feature vector of a frequency domain;

the clustering the audio coding information of the audio track to obtain timbre clustering information of the audio track includes:

inputting the multidimensional feature vector of the audio track into a corresponding clustering model to obtain tone color clustering information of the audio track; each sound track corresponds to one clustering model, and the clustering models are used for tone matching.

Optionally, the obtaining the mixing parameters of the audio track according to the audio coding information and the audio clustering information respectively corresponding to the audio track includes:

and acquiring the mixing parameters of the audio track according to the audio coding information, the tone color clustering information and the style information respectively corresponding to the audio track.

Optionally, the obtaining the mixing parameters of the audio track according to the audio coding information, the timbre clustering information and the style information respectively corresponding to the audio track includes:

Inputting the audio coding information, the tone color clustering information and the style information into a parameter generation network in parallel to obtain the mixing parameters;

the parameter generation network is used for converting the audio coding information, the tone color clustering information and the style information into information with preset dimensions.

Optionally, the parameter generation network comprises a feature fusion layer, a multi-layer perceptron and an output layer; correspondingly, the parallel input of the audio coding information, the tone color clustering information and the style information into the parameter generation network to obtain the mixing parameters includes:

inputting the audio coding information, the tone color clustering information and the style information into a feature fusion layer in parallel, and obtaining fusion features 5;

inputting the fusion characteristics into the multi-layer perceptron to output high-dimensional characteristics with preset dimensions through the multi-layer perceptron; a kind of electronic device with high-pressure air-conditioning system

And inputting the high-dimensional features into an activation function layer to output mixing parameters with preset dimensions through the activation function layer.

0 optionally, in the case that the number of the audio tracks is a plurality of, acquiring mixed audio according to mixing parameters of the audio tracks, including:

inputting the audio data of the audio track and corresponding mixing parameters into a pre-trained mixing neural network model to acquire target audio information of the audio track; a plurality of the audio tracks correspond to a plurality of groups of the target audio information;

And 5, carrying out audio mixing operation on a plurality of groups of target audio information corresponding to the audio tracks so as to obtain the mixed audio.

Optionally, the hybrid neural network model includes a deep network model;

the depth network model sequentially comprises the following components:

a batch normalization layer for receiving and processing audio data;

0, each coding layer is used for processing the output of the previous layer and the mixing parameters respectively;

a plurality of intermediate convolution layers in series, each intermediate convolution layer being respectively used for processing the output of the previous layer;

a plurality of decoding layers connected in series, each decoding layer being used for processing the output of the previous layer and the output of the corresponding coding layer respectively;

an intermediate convolution layer for processing the output of the previous layer;

a convolution layer for processing the output of the intermediate convolution layer and outputting the target audio information of the audio track via an output layer;

the plurality of decoding layers are connected in one-to-one correspondence with the plurality of encoding layers, and the connection mode comprises: correspondingly connecting the coding layer and the decoding layer with the nearest hierarchy according to the hierarchy sequence; and in the remaining coding layers and decoding layers, the nearest coding layer and decoding layer are connected correspondingly according to the hierarchical sequence, and the mode of connection is continued until the connection is completed.

Another aspect of an embodiment of the present application provides an audio processing apparatus, including:

the first acquisition module is used for acquiring audio coding information of the audio track;

the clustering module is used for clustering the audio coding information of the audio track to obtain tone color clustering information of the audio track;

the second acquisition module is used for acquiring the audio mixing parameters of the audio track according to the audio coding information and the audio clustering information respectively corresponding to the audio track;

and the third acquisition module is used for acquiring mixed audio according to the mixing parameters of the plurality of audio tracks when the number of the audio tracks is a plurality of.

Another aspect of the embodiments of the present application provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor for implementing the steps of the above-mentioned audio processing method when the computer program is executed by the processor.

Another aspect of the embodiments of the present application provides a computer-readable storage medium having a computer program stored therein, the computer program being executable by at least one processor to cause the at least one processor to perform the steps of the audio processing method described above.

The audio processing method, the audio processing device, the computer equipment and the computer readable storage medium provided by the embodiment of the application acquire tone color clustering information of the audio track through clustering. Because the frequency response is determined by different tone types, the mixing parameters can be adjusted according to the tone clustering information of the audio track, and then the required monorail audio is obtained. When one audio corresponds to a plurality of audio tracks, the audio tracks obtain respective mixing parameters according to respective tone color cluster information, and then mixed audio adjusted based on tone color adaptability of each audio is obtained. It can be known that, according to the technical scheme provided by the embodiment, based on clustering the audio coding information of each audio track, the tone clustering information of each audio track is obtained and used for mixing, so that the audio processing effect and the user experience are improved.

Drawings

Fig. 1 schematically illustrates an operation environment diagram of an audio processing method according to a first embodiment of the present application;

fig. 2 schematically shows a flow chart of an audio processing method according to an embodiment of the present application;

FIG. 3 schematically illustrates an exemplary architecture of a parameter generation network;

FIG. 4 schematically illustrates a processing link of an application example;

fig. 5 schematically shows a block diagram of an audio processing device according to a second embodiment of the present application; a kind of electronic device with high-pressure air-conditioning system

Fig. 6 schematically shows a hardware architecture diagram of a computer device according to a third embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

It should be noted that the descriptions of "first," "second," etc. in the embodiments of the present application are for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be regarded as not exist and not within the protection scope of the present application.

In the description of the present application, it should be understood that the numerical references before the steps do not identify the order of performing the steps, but are only used for convenience in describing the present application and distinguishing each step, and thus should not be construed as limiting the present application.

The term interpretation referred to in this application:

stereo mixing, which is an important step in music production, integrates sound from multiple sources into a stereo audio track. These mixed sound signals may originate from different instruments, voices or strings, respectively.

CQT (constant Q transform) refers to a filter bank in which the center frequencies are exponentially distributed, the filter bandwidths are different, but the ratio of the center frequencies to the bandwidths is a constant Q.

Mel spectrum (Mel spectrum) is a spectrum of frequency conversion into Mel scale. Mel scale is a nonlinear scale unit that represents the sense of a change in pitch (pitch) of the human ear, defined based on frequency.

For the convenience of understanding the technical solutions provided in the embodiments of the present application, the following related technologies are described below:

according to research, the music intelligent scheme generally comprises two kinds of scheme based on rules, such as calculating to obtain frequency response characteristics, tone characteristics, loudness information, dynamic information, reverberation time and the like of target track reference audio as target reference values of mixing, and performing measures such as EQ (equalizer), dynamic compression, reverberation adding and the like on the audio track to get close to the target reference values to obtain a target mixing result. However, this method is limited to the fixing device, and the frequency response effect and the corresponding music expression required by different music are different, so that the same target value does not have universality. Another is a data-driven mixing method, which learns some nonlinear processing of audio from a large amount of data for a certain individual sound effect such as a music master processing. For example, although the end-to-end mixing method of multitrack audio can improve the mixing quality, it is not controllable because data fish-bone is mixed and cannot make all types of music show good effect. In addition, with the development of electronic music, more and more timbres of non-real musical instruments are presented in musical compositions, so that a data set cannot cover all timbres, and therefore, the scheme cannot obtain good mixing effect for timbres outside the data set.

For this reason, the embodiment of the application provides an intelligent audio mixing scheme. In the scheme, the multi-track music can be adaptively mixed according to the style of the input multi-track music and the tone characteristic of each track of audio. See in particular below.

Fig. 1 schematically shows an environmental operation diagram of an audio processing method according to an embodiment of the present application.

As shown in fig. 1, the environment schematic includes a server 2, a network 4, and an electronic device 6, where:

server 2, as a digital streaming service platform, may be comprised of a single or multiple computing devices. The single or multiple computing devices may include virtualized computing instances. Virtualized computing instances may include virtual machines such as emulation of computer systems, operating systems, servers, and the like. The computing device may load the virtual machine based on a virtual image and/or other data defining particular software (e.g., operating system, dedicated application, server) for emulation. As the demand for different types of processing services changes, different virtual machines may be loaded and/or terminated on one or more computing devices. A hypervisor may be implemented to manage the use of different virtual machines on the same computing device.

The server 2 may provide digital streaming services, for example, providing audio, video data, etc., to electronic devices. The server 2 may be configured to communicate with the electronic device 6 or the like via the network 4. Electronic device 6 may be any type of computing device such as a mobile device, tablet device, laptop computer, virtual reality device, gaming device, set top box, vehicle terminal, smart television, headset, among others. In some embodiments, a virtual terminal is also possible.

The electronic device 6 can provide various functions such as recording, downloading, uploading and processing of audio and video. Specifically, the electronic device 6 may be configured with an audio program. The audio program outputs (plays) audio content to the user.

The following describes an audio processing scheme by way of various embodiments, with the electronic device 6 as the execution subject. In this case, the mixed audio generated by the server 2 may be returned to the electronic device 6 by the server 2.

Example 1

Fig. 2 schematically shows a flow chart of an audio processing method according to a first embodiment of the present application.

As shown in fig. 2, the audio processing method may include steps S200 to S206, in which:

step S200, obtaining audio coding information of the audio track.

Step S202, clustering the audio coding information of the audio track to obtain tone color clustering information of the audio track.

Step S204, according to the audio coding information and the audio clustering information respectively corresponding to the audio tracks, the audio mixing parameters of the audio tracks are obtained.

Step S206, in the case that the number of the audio tracks is a plurality of audio tracks, mixed audio is obtained according to the mixing parameters of the audio tracks.

One track corresponds to one part of audio, MIDI (Musical Instrument Digital Interface ) or audio data can be recorded at a specific time position. Each track may be defined as a performance of a musical instrument. The audio editor may allow multi-track operation, i.e. a song may correspond to multiple tracks at the same time. In the audio editor, the audio data of each audio track can be mixed, the volume of the audio track can be increased or decreased, the phase can be changed, and the reverberation or delay effect can be added. Of course, other operations may be performed.

According to the audio processing method provided by the embodiment, tone color clustering information of the audio tracks is obtained through clustering. Since the frequency response is determined by different tone types, the mixing can be adjusted according to the tone cluster information of the audio track

Parameters, and then the required monorail audio is obtained. When one audio corresponds to a plurality of audio tracks, the audio tracks obtain respective mixing parameters according to respective tone color cluster information, and then obtain tone color adaptation based on each audio

Mixed audio with adjusted responsiveness. It can be known that, according to the technical scheme provided by the embodiment, based on clustering the audio coding information of each audio track, the tone clustering information of each audio track is obtained and used for mixing, so that the audio processing effect and the user experience are improved.

0 each of steps S200 to S206 and other additional steps will be described in detail below with reference to fig. 2.

First, audio mixing generally takes into account aspects such as frequency response characteristics, tone characteristics, loudness information, dynamics information, reverberation time, etc. Based on the information, the audio data is correspondingly processed to obtain target audio.

Different music corresponds to different styles and timbres, which often largely determine the final 5 effects of the mix.

As for the music style, the main types of music are: pop (Pop), rock (Rock), folk (ballad), electronic, jazz (Jazz), absolute Music (pure Music), rap (Rap), metal (Metal), world Music, new Age (New century), classification (Classical), indie (independent), ambient Music (atmosphere Music). Each music has its unique 0-tune feature and distribution method.

Popular music, adapters typically include drums, bass, guitar, keyboards. The drums provide a source of tempo, the bass provides a bass foundation, the guitar provides a main musical instrument or a main body, and the keyboard provides various synthetic classes of timbres to enrich the hierarchy. The music characteristic frequency band is about 650hz, the frequency deviation is large in the whole, the dynamic range is smaller, and the loudness is generally about-10 dB to-12 dB.

Jazz music, the adapter has a piano, bass, drum set, saxophone, trumpet, trombone, clarinet, tremolo, guitar. For mixing, the percussion instrument with the characteristic frequency range of 200hz and the pipe music with the characteristic frequency range of about 2000hz are full and rich in the whole frequency range. The dynamic range is larger, and the emotion is rich and changeable. The overall loudness is typically around-14 dB.

Rock music, the adapters typically include electroacoustic adapters such as electric guitar, distortion guitar, electric bass, drum set, electronic synthesizer/organ. The sound frequency band is characterized by bass at 100hz, low-frequency cavity feel at 150hz and distorted guitar noise tone at about 2000hz, and the whole frequency band is full and rich and has more middle and low frequencies. The dynamic range is small. The overall loudness is relatively large, generally around-5.

Electronic music, adapters typically include a wide variety of synthesizer tone compositions such as lead tone, super aw,808 drum machine/bass, reese mass, pad, fx transition sound effects, various sample slices. The music characteristic frequency band is characterized in that the low frequency of the music characteristic frequency band is about 80hz to strike a drum point, the fundamental frequency of human voice is about 500hz, the high frequency striking sense of the drum point is achieved, the whole frequency band is full and rich, and the low frequency band is more elastic. The dynamic range is small. The overall loudness is typically around-6 dB to-9 dB.

Orchestras, which generally include string instruments, tubular instruments, and also various percussion instruments, adapters include violins, short/flute, single/double reed pipes, pintles, small/large/long/round, tuning drums, triangular irons, harps, pianos. The characteristic differences of different sections of frequency bands are larger, the overall characteristic frequency band is about 3000hz in the string music sound part, and the overall frequency is low and medium. The dynamic range is very large, and the emotion is rich and changeable. The loudness is around-24 dB for small and around-13 dB for large.

It can be seen that the desired sound frequency response is different for different music styles, and the overall frequency response is determined by different tone types. The clustering of tone types, the adaptive mixing of tones according to the effect of different tone colors in a musical sound will be described below.

Step S200Audio encoding information of the audio track is acquired.

The tracks may input specific audio data such as drum, bei Sisheng, guitar, music singing.

And obtaining the audio coding information of the audio track by coding the audio data.

In an exemplary embodiment, the audio data may be encoded using a feature extractor (e.g., VGGish model) to obtain a 128-dimensional high-dimensional feature vector in the frequency domain dimension, and other audio features such as mel-spectrum, CQT-spectrum, etc.

Step S202And clustering the audio coding information of the audio track to obtain tone color clustering information of the audio track.

And clustering according to the audio coding information to obtain the tone type (tone clustering information) of the audio track.

The obtained tone color cluster information can be used for adjusting the audio track data in the subsequent step to obtain the adaptive audio parameters.

It should be noted that, with the development of electronic music, more and more timbres of non-real musical instruments are presented in musical compositions, so that it is difficult to cover all timbres of a data set, and thus, the existing scheme cannot obtain a good mixing effect for timbres outside the data set. In this embodiment, the tone type is analyzed based on the clustering, so that the subsequent mixing effect can be improved.

In some embodiments, the audio coding information comprises a multi-dimensional feature vector of the frequency domain.

In order to improve accuracy of tone color clustering and thus improve subsequent reverberation effects, the "clustering the audio coding information of the audio track to obtain tone color clustering information of the audio track" in step S200 may include: inputting the multidimensional feature vector of the audio track into a corresponding clustering model to obtain tone color clustering information of the audio track; each sound track corresponds to one clustering model, and the clustering models are used for tone matching.

When a plurality of audio tracks exist, each audio track corresponds to one clustering model, so that the audio coding information of the audio tracks can be analyzed in parallel, and the clustering analysis efficiency is improved. In an exemplary application, the clustering model may be to construct timbre matches from the encoded audio features (Gaussian Mixed Model, gaussian mixture model). In an exemplary application, the cluster centers in the cluster model may be set to 8, i.e., 8 cluster centers are obtained.

Step S204And acquiring the mixing parameters of the audio track according to the audio coding information and the audio clustering information respectively corresponding to the audio track.

By adding the tone color cluster information of the audio track, a mixing parameter adapted to the tone color cluster information can be generated. Based on the tone color type in the tone color cluster information, adaptive mixing is performed according to the role of the tone color type in the musical sound.

In some embodiments, as described above, the desired sound frequency response is different for different music styles, and the overall frequency response is determined by different tone types. In order to highlight the style characteristics of music, the mixing parameters can be adjusted according to different styles, and further the mixing result of the corresponding style is obtained. For this purpose, the step S204 "obtaining the mixing parameters of the audio track according to the audio coding information and the audio clustering information corresponding to the audio track respectively" may include: and acquiring the mixing parameters of the audio track according to the audio coding information, the tone color clustering information and the style information respectively corresponding to the audio track.

It should be noted that the style information may be set manually; one or more instrument types used in the playing process can be identified according to the music data, and then corresponding style information is automatically determined based on the one or more instrument types or instrument type combinations; of course, other ways of determining style information may be used.

In some embodiments, the step of obtaining the mixing parameters of the audio track according to the audio coding information, the timbre cluster information and the style information corresponding to the audio track respectively may include: inputting the audio coding information, the tone color clustering information and the style information into a parameter generation network in parallel to obtain the mixing parameters; the parameter generation network is used for converting the audio coding information, the tone color clustering information and the style information into information with preset dimensions. When a plurality of audio tracks exist, each audio track corresponds to a parameter generation network, so that the audio mixing parameters of the audio tracks can be generated in parallel, and the generation efficiency of the audio mixing parameters is improved. In this embodiment, three types of information including audio coding information, tone color cluster information and style information can be fused through the parameter generation network to generate a frame-level mixing parameter. The frame-level mixing parameters may be used for the generation of subsequent target audio data. The parameter generation network may be a deep learning network or the like.

In some embodiments, the parameter generation network includes a feature fusion layer, a multi-layer perceptron, and an output layer; correspondingly, the step of inputting the audio coding information, the timbre cluster information and the style information in parallel into a parameter generation network to obtain the mixing parameters may include: inputting the audio coding information, the tone color clustering information and the style information into a feature fusion layer in parallel to obtain fusion features; inputting the fusion characteristics into the multi-layer perceptron to output high-dimensional characteristics with preset dimensions through the multi-layer perceptron; and inputting the high-dimensional features into an activation function layer to output the mixing parameters with preset dimensions through the activation function layer. Based on the parameter generation network, the style information, the tone color cluster information and the characteristic information of the music can be fused into the mixing parameters better, so that the method is better used in the mixing operation of the subsequent steps.

As shown in fig. 3, the feature fusion layer may be a concatate layer, which is used to combine features and fuse features extracted by multiple convolution feature extraction frames. In this embodiment, the concatate layer fuses the audio coding information, the timbre cluster information, and the style information. The Multi-layer perceptron MLP (Multi-La yer Perception) includes three hidden layers for outputting prediction data. The output layer may adopt a PreLU layer for finally outputting a mixing parameter, where the mixing parameter includes style information, tone color cluster information, and characteristic information of music.

Step S206And under the condition that the number of the audio tracks is a plurality of audio tracks, acquiring mixed audio according to the mixing parameters of the plurality of audio tracks.

In this embodiment, the audio data of a plurality of audio tracks are integrated into one stereo audio track, wherein in the integration process, an audio mixing result of the audio data of a plurality of audio tracks is determined according to a mixing parameter of each audio track. According to the embodiment, the mixing parameters of each audio track are adjusted according to the tone type of each audio track, so that the effects of different tone colors in music sounds are adaptively mixed, and the effect of audio mixing is improved.

In some embodiments, the step S206 "in the case that the number of audio tracks is plural, acquiring the mixed audio according to the mixing parameters of the plurality of audio tracks" may include the steps of: inputting the audio data of the audio track and corresponding mixing parameters into a pre-trained mixing neural network model to acquire target audio information of the audio track; a plurality of the audio tracks correspond to a plurality of groups of the target audio information; and carrying out audio mixing operation on a plurality of groups of target audio information corresponding to the audio tracks so as to obtain the mixed audio.

In the embodiment, by configuring a mixing neural network model for each audio track, end-to-end multi-track music mixing can be realized, and the problems that data required by single-track music mixing are complex and the overall effect is difficult to control are solved. In addition, by using the mixed neural network to replace modules such as equalization, compression, delay, distortion, reverberation and the like in conventional mixing, the end-to-end result after mixing is directly obtained, and complex operation of a plurality of mixing steps is omitted.

It should be noted that the above-mentioned hybrid neural network model may be various models that are trained.

In some embodiments, to enhance reverberation effects, the hybrid neural network model includes a deep network model;

the depth network model sequentially comprises the following components:

a batch normalization layer for receiving and processing audio data;

the plurality of coding layers are connected in series, and each coding layer is respectively used for processing the output of the last layer and the mixing parameters;

For example, the deep network model may include one BN layer, six REB layers, four ICB layers, six RDB layers, one ICB layer, and one convolutional layer logically in series. Six REB layers are connected to six RDB layers.

REB (Residual encoder block) residual code block.

RDB (Residual decoder block) residual decoded block.

RCB (Residual convolutional block), a residual convolution block.

ICB (intermediate convolutional blocks), an intermediate convolution block.

In the scheme, the mixing parameters of the corresponding audio tracks and the audio before unmixed belong to the input of the mixing neural network model, and the mixing parameters of the corresponding audio tracks are added to each REB layer, so that the subsequent mixing effect can be effectively improved.

The data input to each model is data obtained by mapping (mapping).

It should also be noted that the training process for the deep network model may be: adding the mixing parameters into each REB layer of the depth network model, and outputting the obtained predicted mixing result. And comparing the predicted mixing result with the artificial mixing result to obtain a Loss value Loss. Then the audio after the artificial mixing is performed. The loss function may be a multi-resolution short time fourier transform loss STFTloss, consisting of SC (spectral convergence ) and SM (spectral amplitude) parts.

To make this application easier to understand, an example of application is provided below in connection with fig. 4.

In this application example, N tracks are provided, N being a natural number.

Corresponding N processing links are formed based on the N tracks. Taking the processing link formed by the first track as an example, 5 comprises in order:

(1) The coding module can adopt VGGish pre-training models and the like;

(2) The tone color clustering module can adopt tone color matching GMM and the like;

(3) The parameter generation module can adopt a parameter generation network, and the parameter generation network comprises a plurality of layers of perceptrons;

And (4) the mixed neural network module can adopt Deep ResUNet model and the like.

The first audio track encodes the received audio data through an encoding module to obtain a feature vector; inputting the feature vector into a tone clustering module to obtain tone clustering information (tone type); inputting tone color cluster information, characteristic vectors input by the coding module and style information to the parameter generation module

To obtain the mixing parameters. Then, the mixing parameters and audio data are input into the mixing neural network module 5 to obtain target audio data.

The second to the N-th tracks can be used for processing links of the first track to obtain respective wood template audio data.

And mixing the target audio data of the first audio track with the target audio data of the N audio track to obtain mixed audio.

0 in the case of only one track, the target audio data obtained for this track is the audio data provided to the speaker.

Example two

Fig. 5 schematically shows a block diagram of an audio processing device according to a second embodiment of the present application, which may be divided into one or more program modules, which are stored in a storage medium and executed by one or more processors to complete the embodiments of the present application. Program modules in the embodiments of the present application refer to a series of computer program instruction segments capable of implementing specific functions, and the following description specifically describes the functions of each program module in the embodiments. As shown in fig. 5, the audio processing apparatus 600 may include a first acquisition module 610, a clustering module 620, a second acquisition module 630, and a third acquisition module 640, wherein:

A first obtaining module 610, configured to obtain audio coding information of an audio track;

a clustering module 620, configured to cluster the audio coding information of the audio track to obtain timbre cluster information of the audio track;

a second obtaining module 630, configured to obtain mixing parameters of the audio track according to the audio coding information and the audio clustering information corresponding to the audio track respectively;

and a third obtaining module 640, configured to obtain, when the number of audio tracks is plural, mixed audio according to mixing parameters of the audio tracks.

In some embodiments, the audio coding information comprises a multi-dimensional feature vector of the frequency domain;

the clustering module 620 is further configured to:

In some embodiments, the second obtaining module 630 is configured to:

In some embodiments, the parameter generation network includes a feature fusion layer, a multi-layer perceptron, and an output layer; correspondingly, the second obtaining module 630 is configured to:

inputting the audio coding information, the tone color clustering information and the style information into a feature fusion layer in parallel to obtain fusion features;

In some embodiments, the third obtaining module 640 is configured to:

And carrying out audio mixing operation on a plurality of groups of target audio information corresponding to the audio tracks so as to obtain the mixed audio.

In some embodiments, the hybrid neural network model comprises a deep network model;

the depth network model sequentially comprises the following components:

a batch normalization layer for receiving and processing audio data;

Example III

Fig. 6 schematically shows a hardware architecture diagram of a computer device 10000 adapted to implement an audio processing method according to a third embodiment of the present application. The computer device 10000 may be part of the server 2 or the electronic device 6. In this embodiment, the computer device 10000 is a device capable of automatically performing numerical calculation and/or information processing in accordance with an instruction set or stored in advance. For example, it may be a smart phone, tablet, laptop, personal computer, virtual device, set top box, television, projector, car terminal, headset, etc. In other embodiments, the computer device 10000 may also be a rack server, a blade server, a tower server, or a rack server (including a server stand alone or a server cluster formed by a plurality of servers), or the like. As shown in fig. 6, the computer device 10000 includes at least, but is not limited to: the memory 10010, processor 10020, network interface 10030 may be communicatively linked to each other via a system bus. Wherein:

memory 10010 includes at least one type of computer-readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, memory 10010 may be an internal storage module of computer device 10000, such as a hard disk or memory of computer device 10000. In other embodiments, the memory 10010 may also be an external storage device of the computer device 10000, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like. Of course, the memory 10010 may also include both an internal memory module of the computer device 10000 and an external memory device thereof. In this embodiment, the memory 10010 is typically used for storing an operating system installed on the computer device 10000 and various types of application software, such as program codes of an audio processing method. In addition, the memory 10010 may be used to temporarily store various types of data that have been output or are to be output.

The processor 10020 may be a central processing unit (Central Processing Unit, simply CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 10020 is typically configured to control overall operation of the computer device 10000, such as performing control and processing related to data interaction or communication with the computer device 10000. In this embodiment, the processor 10020 is configured to execute program codes or process data stored in the memory 10010.

The network interface 10030 may comprise a wireless network interface or a wired network interface, which network interface 10030 is typically used to establish a communication link between the computer device 10000 and other computer devices. For example, the network interface 10030 is used to connect the computer device 10000 to an external terminal through a network, establish a data transmission channel and a communication link between the computer device 10000 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a global system for mobile communications (Global System of Mobile communication, abbreviated as GSM), wideband code division multiple access (Wideband Code Division Multiple Access, abbreviated as WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, etc.

It should be noted that fig. 6 only shows a computer device having components 10010-10030, but it should be understood that not all of the illustrated components are required to be implemented, and more or fewer components may be implemented instead.

In this embodiment, the audio processing method stored in the memory 10010 may be further divided into one or more program modules and executed by one or more processors (the processor 10020 in this embodiment) to complete the embodiments of the present application.

Example IV

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the audio processing method in the embodiments.

In this embodiment, the computer-readable storage medium includes a flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the computer readable storage medium may be an internal storage unit of a computer device, such as a hard disk or a memory of the computer device. In other embodiments, the computer readable storage medium may also be an external storage device of a computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card), etc. that are provided on the computer device. Of course, the computer-readable storage medium may also include both internal storage units of a computer device and external storage devices. In this embodiment, the computer readable storage medium is typically used to store an operating system and various types of application software installed on a computer device, such as program codes of the audio processing method in the embodiment, and the like. Furthermore, the computer-readable storage medium may also be used to temporarily store various types of data that have been output or are to be output.

It will be apparent to those skilled in the art that the modules or steps of the embodiments of the present application described above may be implemented in a general purpose computing device, and they may be integrated into a single computing device, or distributed

Are distributed over a network of computing devices, and optionally may be implemented with program 5 code executable by the computing devices, such that they may be stored in storage devices for execution by the computing devices, and in some cases, the steps shown or described may be performed in a different order than what is shown or described, or they may be separately fabricated into individual integrated circuit modules, or a plurality of modules or steps in them may be fabricated into a single integrated circuit module. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

0 it should be noted that the above are only preferred embodiments of the present application and are not meant to limit the application

The protection scope is equivalent to the equivalent structure or equivalent flow transformation made by the content of the specification and the drawings of the application, or is directly or indirectly applied to other related technical fields, and the equivalent structure or equivalent flow transformation is included in the protection scope of the application.

Claims

1. A method of audio processing, the method comprising:

acquiring audio coding information of an audio track;

2. The audio processing method according to claim 1, wherein the audio encoding information includes a multi-dimensional feature vector of a frequency domain;

3. The audio processing method according to claim 1, wherein the obtaining the mixing parameters of the audio track according to the audio coding information and the audio clustering information respectively corresponding to the audio track includes:

4. The audio processing method according to claim 3, wherein the obtaining the mixing parameters of the audio track according to the audio coding information, the tone color cluster information and the style information respectively corresponding to the audio track includes:

5. The audio processing method according to claim 4, wherein the parameter generation network includes a feature fusion layer, a multi-layer perceptron, and an output layer; correspondingly, the parallel input of the audio coding information, the tone color clustering information and the style information into the parameter generation network to obtain the mixing parameters includes:

6. The audio processing method according to any one of claims 1 to 5, wherein, in the case where the number of audio tracks is plural, acquiring mixed audio from mixing parameters of plural audio tracks, comprises:

7. The audio processing method of claim 6, wherein the hybrid neural network model comprises a deep network model;

the depth network model sequentially comprises the following components:

a batch normalization layer for receiving and processing audio data;

8. An audio processing apparatus, the apparatus comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor is adapted to implement the steps of the audio processing method of any of claims 1 to 7 when the computer program is executed by the processor.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program executable by at least one processor to cause the at least one processor to perform the steps of the audio processing method according to any one of claims 1 to 7.