CN117119369A

CN117119369A - Audio generation method, computer device, and computer-readable storage medium

Info

Publication number: CN117119369A
Application number: CN202310986917.2A
Authority: CN
Inventors: 王雨晨; 芮元庆; 闫震海
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2023-08-07
Filing date: 2023-08-07
Publication date: 2023-11-24

Abstract

The application relates to an audio generation method, a computer device and a computer-readable storage medium. The method comprises the following steps: extracting original sound source signals corresponding to a plurality of target sound source objects in original audio; performing decorrelation processing on original sound source signals corresponding to all target sound source objects to obtain derivative sound source signals corresponding to all target sound source objects; for any target sound source object, distributing a sound image position corresponding to any target sound source object according to an original sound source signal corresponding to any target sound source object and a derivative sound source signal corresponding to any target sound source object; based on the sound image position corresponding to each target sound source object, adjusting the gain of each target sound source object, and outputting target audio corresponding to the original audio; the number of channels of the target audio is greater than the number of channels of the original audio. By adopting the method, the stereo sound source can be quickly converted into multi-channel surround sound for output, the accurate positioning of sound images is ensured, and the audio generation efficiency and the audio effect are effectively improved.

Description

Audio generation method, computer device, and computer-readable storage medium

Technical Field

The present application relates to the field of computer technology, and in particular, to an audio generating method, a computer device, and a computer readable storage medium.

Background

For multi-channel music generation, such as multi-channel surround sound audio based on stereo audio, accompaniment is usually separated by adopting a traditional audio processing mode and simply distributed to left and right surround channels, which can cause that the channel correlation corresponding to each channel is too large due to the fact that the separation degree does not reach the required standard, so that good space surround feeling cannot be created, and the generated audio effect is poor.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an audio generation method, a computer device, and a computer-readable storage medium capable of enhancing the audio surround sound effect.

In a first aspect, the present application provides an audio generation method. The method comprises the following steps:

extracting original sound source signals corresponding to a plurality of target sound source objects in original audio;

performing decorrelation processing on original sound source signals corresponding to the target sound source objects to obtain derivative sound source signals corresponding to the target sound source objects;

for any target sound source object, distributing the sound image position corresponding to any target sound source object according to the original sound source signal corresponding to any target sound source object and the derivative sound source signal corresponding to any target sound source object;

Based on the sound image position corresponding to each target sound source object, adjusting the gain of each target sound source object, and outputting target audio corresponding to the original audio; the number of channels of the target audio is greater than the number of channels of the original audio.

In one embodiment, the obtaining the derivative sound source signal corresponding to each target sound source object by performing decorrelation processing on the original sound source signal corresponding to each target sound source object includes:

performing decorrelation processing on an original sound source signal corresponding to any target sound source object according to a first time delay processing mode to obtain a first time delay result of the original sound source signal corresponding to any target sound source object;

and performing decorrelation processing on the first delay result according to a second delay processing mode to generate a plurality of decorrelation signals aiming at any target sound source object, wherein the plurality of decorrelation signals are used as derivative sound source signals corresponding to any target sound source object.

In one embodiment, the performing decorrelation processing on the original sound source signal corresponding to any target sound source object according to the first delay processing manner to obtain a first delay result of the original sound source signal corresponding to any target sound source object includes:

And inputting the original sound source signal corresponding to any target sound source object into an all-pass filter, and obtaining an output signal of the original sound source signal corresponding to any target sound source object according to the impulse response information of the all-pass filter, wherein the output signal is used as the first delay result.

In one embodiment, the performing decorrelation processing on the first delay result according to the second delay processing manner, to generate a plurality of decorrelated signals for the arbitrary target sound source object, includes:

acquiring preset sampling information; the preset sampling information comprises the number of sampling points;

and carrying out sampling time delay processing on the first time delay result according to the number of the sampling points to obtain a plurality of decorrelated signals.

In one embodiment, the distributing, for any target sound source object, the sound image position corresponding to the any target sound source object according to the original sound source signal corresponding to the any target sound source object and the derivative sound source signal corresponding to the any target sound source object includes:

the original sound source signal corresponding to any target sound source object and the derivative sound source signal corresponding to any target sound source object are respectively placed at different sound channel positions to obtain sound pressure difference information;

And positioning the sound image position corresponding to any target sound source object according to the sound pressure difference information.

In one embodiment, the obtaining the sound pressure difference information by respectively placing the original sound source signal corresponding to the arbitrary target sound source object and the derivative sound source signal corresponding to the arbitrary target sound source object at different channel positions includes:

determining a signal placement mode for any target sound source object; the signal placement mode comprises a first sound channel position and a second sound channel position;

placing the original sound source signal corresponding to any target sound source object to the first channel position, and placing the derivative sound source signal corresponding to any target sound source object to the second channel position;

and determining the sound pressure difference information according to the sound pressure difference of the sound source between the first sound channel position and the second sound channel position.

In one embodiment, the extracting the original sound source signals corresponding to the plurality of target sound source objects in the original audio includes:

acquiring original audio containing a plurality of sound source objects;

inputting the original audio to a sound source separation network to obtain original sound source signals corresponding to the sound source objects;

Taking a sound source object to be decorrelated as the target sound source object to obtain a plurality of original sound source signals corresponding to the target sound source object;

the method further comprises the steps of:

and performing loudness scaling according to the original sound source signals corresponding to the sound source objects and the original audio, and determining loudness ratio information of the sound source objects in the original audio.

In one embodiment, the adjusting the gain of each target sound source object based on the sound image position corresponding to each target sound source object, and outputting the target audio corresponding to the original audio, includes:

based on the sound image position corresponding to each target sound source object, adjusting the gain of each target sound source object to obtain gain allocation information of each target sound source object, and obtaining gain configuration information of sound source objects except for the target sound source object;

combining gain allocation information of each target sound source object, gain configuration information of sound source objects except the target sound source object and loudness ratio information of each sound source object in the original audio to synthesize converted audio corresponding to the original audio;

and performing audio rendering on the converted audio according to preset standardized processing information to obtain the target audio.

In a second aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In a third aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a fourth aspect, the application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

An audio generation method, a computer device and a computer readable storage medium as described above. In the scheme, original sound source signals corresponding to a plurality of target sound source objects in original audio are extracted, then the original sound source signals corresponding to the target sound source objects are subjected to decorrelation processing to obtain derivative sound source signals corresponding to the target sound source objects, and for any target sound source object, according to the original sound source signals corresponding to any target sound source object and the derivative sound source signals corresponding to any target sound source object, sound image positions corresponding to any target sound source object are distributed, and then the gains of the target sound source objects are adjusted based on the sound image positions corresponding to the target sound source objects, the target audio corresponding to the original audio is output, the number of channels of the target audio is larger than that of the original audio, a large number of expansion music elements are obtained through decorrelation processing under the condition that the number of sub tracks is small, and accurate positioning of sound images is guaranteed, so that a stereo sound source can be quickly converted into multi-channel surround sound for output, and the audio generation efficiency and the audio effect are effectively improved.

Drawings

FIG. 1 is a flow chart of an audio generation method according to an embodiment;

FIG. 2 is a schematic diagram of a signal placement manner according to an embodiment;

FIG. 3 is a schematic diagram of azimuth modulation in one embodiment;

FIG. 4 is a flow chart of another audio generation method in one embodiment;

FIG. 5 is a block diagram of an audio generating apparatus in one embodiment;

FIG. 6 is an internal block diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In one embodiment, as shown in fig. 1, an audio generating method is provided, where the method is applied to a terminal to illustrate, it is understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server. In this embodiment, the method includes the following steps.

In step S101, extracting original sound source signals corresponding to a plurality of target sound source objects in original audio;

as an example, the original audio may be stereo audio, which may correspond to two channels, such as a left channel and a right channel.

The target sound source object may be a sound source object to be subjected to decorrelation processing, and whether subsequent decorrelation processing and sound image position redistribution are required to be performed on the sound source object or not may be determined according to sound source characteristics of the sound source object, for example, the sound source object may include musical instruments such as guitar, piano, bass and the like, and human voice.

In practical application, the original audio including a plurality of sound source objects can be obtained, then the original audio can be input into a sound source separation network to obtain the original sound source signals corresponding to the sound source objects, and further the sound source objects to be decorrelated can be used as target sound source objects to obtain the original sound source signals corresponding to the plurality of target sound source objects so as to further perform decorrelation processing and sound image position redistribution.

In one example, stereo audio (i.e., raw audio) may be input to a sound source separation network, through which sound source signals of a plurality of sound source objects (i.e., raw sound source signals) such as guitar, piano, bass, human voice, etc. may be separated; for each sound source object, a plurality of sound source signals may be separated, i.e. the original sound source signal may comprise a plurality of, e.g. a sound source signal of the sound source object in the left channel, a sound source signal of the sound source object in the right channel.

In still another example, taking guitar, piano, bass, and voice as examples, since bass is a bass bottoming musical instrument, which has low-frequency band characteristics, it can be determined that subsequent decorrelation processing and sound image position redistribution are not required, and further a sound source object such as guitar, piano, voice, and the like, which is determined to be required to be decorrelated, can be regarded as a target sound source object; the sound source object to be decorrelated may also be determined according to other sound source characteristics or audio production requirements, and is not particularly limited in this embodiment.

In step S102, a derivative sound source signal corresponding to each target sound source object is obtained by performing decorrelation processing on an original sound source signal corresponding to each target sound source object;

in a specific implementation, signal decorrelation can be performed on the original sound source signals corresponding to each extracted target sound source object, for example, a processing mode of combining an all-pass filter and a delay combination operation can be combined, and then a plurality of decorrelated signals can be derived, namely, derived sound source signals corresponding to each target sound source object are obtained.

In an alternative embodiment, taking the original audio as stereo audio as an example, for each sound source object, the original sound source signal under the left channel and the original sound source signal under the right channel can be obtained by separation, the original sound source signal under the left channel can be used for performing decorrelation processing to obtain the derivative sound source signal in the corresponding left direction range, and the original sound source signal under the right channel can be used for performing decorrelation processing to obtain the derivative sound source signal in the corresponding right direction range.

In step S103, for any target sound source object, according to the original sound source signal corresponding to any target sound source object and the derivative sound source signal corresponding to any target sound source object, the sound image position corresponding to any target sound source object is allocated;

after the derived sound source signal of any target sound source object is obtained, the original sound source signal obtained through separation and the derived sound source signal obtained through decorrelation treatment can be adopted to conduct azimuth modulation, specifically, taking the guitar as an example of any target sound source object, the original sound source signal of the guitar and the derived sound source signal of the guitar can be placed in different azimuth of a listening position, and further the sound image position of the guitar relative to the listening position can be obtained through positioning through the sound pressure difference of the placed position sound source.

For example, taking the original audio as stereo audio as an example, the original sound source signal of the guitar may be placed in front of the listening position, and the derivative sound source signal of the guitar may be placed in rear of the listening position, and thus the guitar may be positioned in rear of the listening position, that is, the sound image position, may be determined by the sound pressure difference of the front and rear sound sources.

In step S104, the gain of each target sound source object is adjusted based on the sound image position corresponding to each target sound source object, and the target audio corresponding to the original audio is output; the number of channels of the target audio is greater than the number of channels of the original audio.

As an example, the target audio may be surround sound audio, which may correspond to six or eight channels, such as 5.1 channel surround sound, 7.1 channel surround sound.

After the sound image positions corresponding to the target sound source objects are obtained, the gains of the target sound source objects can be adjusted to create a surround sound effect, and then the surround sound can be subjected to audio rendering to be output to obtain target audio, such as surround sound audio.

In one example, panoramic sound (surround sound), also known as spatial audio, not only provides more possibilities for music streaming media, but also greatly enhances the user's auditory experience. Most of sound sources in the music market are recorded and manufactured in stereo, so that songs released by a stereo master set cannot be subjected to rapid spatial audioization. And the requirement of an immersive listening environment, such as a 5.1 surround sound system, also results in extremely expensive panoramic sound production. Therefore, by adopting the audio generation method, a large number of stored stereo sound sources can be quickly manufactured into corresponding multi-channel surround sound versions, and different play scenes can be adapted according to different surround sound output requirements.

For example, the audio generation method of the present application can realize the conversion of stereo sound sources into 5.1 channels based on the separation of sound sources of deep learning, extract different music signals in the left and right channels by using a deep neural network, such as but not limited to human voice, drum, bass, piano, guitar and other sound sources, etc., and then perform signal decorrelation processing on the separated signals, thereby automatically generating audio coding formats conforming to 5.1 and 7.1 channel surround sound.

Compared with the traditional method, the accompaniment can be simply separated, the problem that the separation degree is low, so that the relativity of each channel is overlarge is solved, redistribution of the sound image position cannot be realized, the sound image positioning is inaccurate, and good surrounding sense cannot be created; according to the technical scheme, stereo can be separated into an independent human sound track and a plurality of independent musical instrument tracks by using a deep learning method, allocation of surrounding channels can be formulated based on the stereo, and a better multichannel music generation scheme can be provided; the sound source separation based on deep learning can obtain a large number of expanded music elements through decorrelation processing under the condition of less track division number, and can achieve the surrounding effect of positioning musical instruments through sound pressure difference on azimuth modulation, so that the sound source recorded according to stereo can be quickly converted into multichannel surround sound for output, the application range of songs in a music library can be widened, and the method can be applied to vehicle-mounted surround sound, live broadcast panoramic sound support and the like.

According to the audio generation method, the original sound source signals corresponding to the plurality of target sound source objects in the original audio are extracted, then the original sound source signals corresponding to the target sound source objects are subjected to decorrelation processing to obtain the derivative sound source signals corresponding to the target sound source objects, and for any target sound source object, the sound image positions corresponding to any target sound source object are allocated according to the original sound source signals corresponding to any target sound source object and the derivative sound source signals corresponding to any target sound source object, so that the gains of the target sound source objects are adjusted based on the sound image positions corresponding to the target sound source objects, the target audio corresponding to the original audio is output, a large number of expansion music elements are obtained through decorrelation processing under the condition that the number of manufactured track dividing is small, accurate positioning of sound images is guaranteed, three-dimensional sound sources can be quickly converted into multi-channel surround sound output, and audio generation efficiency and audio effect are effectively improved.

In one embodiment, in step S102, the decorrelation process is performed on the original sound source signal corresponding to each target sound source object to obtain a derivative sound source signal corresponding to each target sound source object, which may include the following steps:

Performing decorrelation processing on the original sound source signals corresponding to any target sound source object according to a first delay processing mode to obtain a first delay result of the original sound source signals corresponding to any target sound source object; and performing decorrelation processing on the first delay result according to a second delay processing mode to generate a plurality of decorrelation signals aiming at any target sound source object as derivative sound source signals corresponding to any target sound source object.

In an example, for the problem that a better surround effect cannot be created due to too few sound sources, the original sound source signals corresponding to any target sound source object are subjected to signal decorrelation, for example, a processing mode of an all-pass filter (i.e., a first delay processing mode) and a processing mode of a delay combination operation (i.e., a second delay processing mode) can be combined, and a plurality of decorrelated signals can be derived, so that the azimuth processing is not performed separately for each type of musical instrument, but the signals are duplicated through the decorrelation operation and different positioning settings, so that a surround effect is created, and meanwhile, a main sound image is created by adjusting the gain.

In this embodiment, the decorrelation processing is performed on the original sound source signal corresponding to any target sound source object according to the first delay processing mode to obtain a first delay result of the original sound source signal corresponding to any target sound source object, and the decorrelation processing is performed on the first delay result according to the second delay processing mode to generate a plurality of decorrelation signals for any target sound source object, which are used as derivative sound source signals corresponding to any target sound source object, so that a large number of expanded music elements can be obtained through the decorrelation processing under the condition that the number of manufactured tracks is small, and rich materials are provided for audio production.

In one embodiment, according to a first delay processing manner, performing decorrelation processing on an original sound source signal corresponding to any target sound source object to obtain a first delay result of the original sound source signal corresponding to any target sound source object, which may include the following steps:

and inputting the original sound source signal corresponding to any target sound source object into an all-pass filter, and obtaining an output signal of the original sound source signal corresponding to any target sound source object according to the impulse response information of the all-pass filter as a first delay result.

In practical application, taking guitar as an example, the original sound source signal of guitar can be input into the all-pass filter, so that the phase changes to different degrees along with the change of frequency under the condition of not changing the amplitude response, namely, the group delay characteristic of the all-pass filter can be utilized to cause the delay of waveform envelope generated by different frequency components in the time domain, and the aim of decorrelation can be achieved.

For example, the first order all-pass filter transfer function may be represented as follows:

where w is the normalized cut-off frequency and ranges from [0,1].

In an example, taking a voice as an example, the signal of the voice (i.e., the original sound source signal) is x [ n ], the impulse response of the all-pass filter (i.e., the impulse response information) is h [ n ], and the output signal after passing through the all-pass filter is y [ n ], the following relationship can be obtained:

y[n]＝x[n]*h[n]

In this embodiment, by inputting the original sound source signal corresponding to any one target sound source object to the all-pass filter, the output signal of the original sound source signal corresponding to any one target sound source object is obtained according to the impulse response information of the all-pass filter, and as the first delay result, the aim of signal decorrelation can be achieved based on the group delay characteristic of the all-pass filter.

In one embodiment, the decorrelation processing is performed on the first delay result according to the second delay processing manner, so as to generate a plurality of decorrelated signals for any target sound source object, which may include the following steps:

acquiring preset sampling information; the preset sampling information comprises the number of sampling points; and carrying out sampling time delay processing on the first time delay result according to the number of the sampling points to obtain a plurality of decorrelated signals.

In a specific implementation, because according to the preferential effect, when two sounds are separated within 35 ms, the listener cannot distinguish the two sound sources, and by utilizing this characteristic, the signal y [ n ] (i.e., the first delay result) after passing through the all-pass filter can be subjected to delay of n (i.e., the number of sampling points) sampling points (for example, n < the audio sampling rate 44100×0.035), so that the decorrelation effect can be further enhanced under the condition of consistent tone. After the signal decorrelation operation combining the all-pass filter and the delay combination is completed, the same instrument signal can be derived into a plurality of (e.g., 3-6) uncorrelated signals, i.e., a plurality of decorrelated signals for any target sound source object.

In this embodiment, the preset sampling information is obtained, and then the sampling delay processing is performed on the first delay result according to the number of sampling points, so as to obtain a plurality of decorrelation signals, and a large number of expanded music elements can be obtained through the decorrelation processing, so that data support is provided for multi-channel music generation.

In one embodiment, in step S103, for any target sound source object, the allocation of the sound image position corresponding to any target sound source object according to the original sound source signal corresponding to any target sound source object and the derivative sound source signal corresponding to any target sound source object may include the following steps:

the original sound source signal corresponding to any target sound source object and the derivative sound source signal corresponding to any target sound source object are respectively placed at different sound channel positions to obtain sound pressure difference information; and positioning the sound image position corresponding to any target sound source object according to the sound pressure difference information.

In an example, the original sound source signal corresponding to any target sound source object and the derivative sound source signal corresponding to any target sound source object may be placed in different directions of the listening position, that is, placed in different channel positions, as shown in fig. 2, which is an example of a placement manner, and then the sound image position of any target sound source object relative to the listening position may be obtained by placing the sound pressure difference (i.e., the sound pressure difference information) of the sound source at the position, such as the sound pressure difference of the front and rear sound sources in fig. 2.

In this embodiment, the original sound source signal corresponding to any target sound source object and the derivative sound source signal corresponding to any target sound source object are respectively placed at different channel positions to obtain the sound pressure difference information, and then the sound image position corresponding to any target sound source object is positioned according to the sound pressure difference information, so that redistribution of the sound image position can be realized, and the sound image positioning accuracy is improved.

In one embodiment, the method for obtaining the sound pressure difference information by respectively placing the original sound source signal corresponding to any one target sound source object and the derivative sound source signal corresponding to any one target sound source object at different channel positions may include the following steps:

determining a signal placement mode for any target sound source object; the signal placement mode comprises a first sound channel position and a second sound channel position; placing an original sound source signal corresponding to any target sound source object to a first channel position, and placing a derivative sound source signal corresponding to any target sound source object to a second channel position; and determining sound pressure difference information according to the sound pressure difference of the sound source between the first channel position and the second channel position.

In practical application, after deriving multiple decorrelated signals for the same type of instrument signal, static azimuth modulation can be performed based on the original signal and the derived signal, and taking any target sound source object as guitar as an example, as shown in fig. 2, taking 5.1 channels as an example, the original guitar left and right channel signals can be defined as G _l And G _r The guitar derived signal obtained by the decorrelation operation is defined as G _(new-l) And G _(new-r) The original guitar signal (i.e. the original sound source signal) and the derived guitar signal (i.e. the derived sound source signal) can be placed in front of and behind the listening position (i.e. the first channel position and the second channel position), so that the direct sound of the same element with signals in all directions can be ensured to reach the listening position, the amplitudes of the four signals can be adjusted, and the guitar is determined to be positioned behind the listening position by the sound pressure difference of the front sound source and the rear sound source.

In this embodiment, by determining a signal placement manner for any target sound source object, then placing an original sound source signal corresponding to any target sound source object at a first channel position, and placing a derivative sound source signal corresponding to any target sound source object at a second channel position, further determining sound pressure difference information according to a sound pressure difference of a sound source between the first channel position and the second channel position, redistribution of sound image positions is achieved, accurate positioning of sound images is ensured, and audio effects are effectively improved.

In one embodiment, in step S101, extracting original sound source signals corresponding to a plurality of target sound source objects in original audio may include the steps of:

Acquiring original audio containing a plurality of sound source objects; inputting the original audio to a sound source separation network to obtain original sound source signals corresponding to each sound source object; taking the sound source object to be decorrelated as a target sound source object to obtain original sound source signals corresponding to a plurality of target sound source objects;

the sound source separation network can be a neural network, and separation of different music elements can be realized based on the neural network structure; other model structures that achieve the same separation effect may also be used.

In an example, taking the original audio as stereo audio as an example, stereo audio may be input to a sound source separation network, through which sound source signals of a plurality of sound source objects (i.e., original sound source signals) may be separated, such as guitar, piano, bass, human voice, etc., and then the sound source objects of guitar, piano, human voice, etc., which are determined to need to be subjected to decorrelation processing, may be determined as target sound source objects, so as to perform further decorrelation and azimuth modulation processing based on the original sound source signals corresponding to the target sound source objects.

The method further comprises the steps of:

and carrying out loudness scaling according to the original sound source signals and the original audio corresponding to each sound source object, and determining loudness ratio information of each sound source object in the original audio.

In a specific implementation, taking the original audio as stereo audio as an example, after the sound source signals of a plurality of sound source objects are separated, the original stereo sound can be compared with each path of musical instrument signals (namely, the original sound source signals) to carry out loudness scaling, so that the loudness ratio (volume ratio) of each path of signals in the original work, namely, loudness ratio information, can be reserved and obtained, and the loudness ratio information is used for restoring according to the loudness ratio in the subsequent panoramic sound synthesis.

In this embodiment, by acquiring the original audio including a plurality of sound source objects, inputting the original audio to the sound source separation network to obtain the original sound source signals corresponding to each sound source object, and further using the sound source object to be decorrelated as the target sound source object to obtain the original sound source signals corresponding to the plurality of target sound source objects, stereo can be separated into an independent human sound track and a plurality of independent instrument tracks by using a deep learning method, thereby realizing effective sound source separation.

In one embodiment, in step S104, adjusting the gain of each target sound source object based on the sound image position corresponding to each target sound source object, and outputting the target audio corresponding to the original audio may include the following steps:

based on the sound image position corresponding to each target sound source object, adjusting the gain of each target sound source object to obtain gain allocation information of each target sound source object, and obtaining gain configuration information of sound source objects except the target sound source object; combining gain allocation information of each target sound source object, gain configuration information of sound source objects except the target sound source object and loudness ratio information of each sound source object in original audio to synthesize converted audio corresponding to the original audio; and performing audio rendering on the converted audio according to preset standardized processing information to obtain target audio.

In an example, taking a 5.1 channel as an azimuth modulation schematic of a multi-channel musical instrument signal and a 4-channel reverberation signal as shown in fig. 3, the thickness of a line may represent the transmission amount of the musical instrument in a surrounding sound field, for example, a human sound signal may be directly transmitted to a center channel, a musical instrument signal may be evenly distributed in 5 channels (for example, a center channel, a front left channel, a front right channel, a left surround and a right surround in fig. 3) with different gain transmission amounts, different musical instruments may select different channels (the channels may relate to multiple channels, which may relate to directions, for example, a sound image is located on the right side, and may be transmitted through a right front channel and a right rear channel), and the main localization sound image is determined, for example, where the positions of the thick lines in fig. 3 are located, in which case the channels may play a role of localization for the corresponding musical instrument. As in fig. 3 guitar, piano, drum, voice, others, etc. each have a corresponding bold line position.

In an alternative embodiment, the bass as a bass bottoming instrument may be equally distributed to several channels other than the LFE (Low Frequency Sound Channel ), which may send out a gain beyond the average to the LFE as the LFE mainly acts to boost the bass. The LFE may also comprise a low pass signal with a cut-off frequency of 120Hz after linear addition of all signals except the bass.

In yet another example, taking a 5.1 channel as an example, as shown in fig. 3, in order to more truly restore surround sound, the reverberations signal may be modulated in the direction of left and right surround to simulate the surround reverberations signal of a sound source after being reflected by a room. The azimuth modulation may take the form of HOA (Higher Order Ambisonics, higher order ambisonic) and may then be decoded into a six-channel signal.

In practical application, for surround rendering output, a compression module can be added to perform standardized protection processing on the surround rendering output before final signal output, namely, audio rendering is performed on converted audio according to preset standardized processing information, so that conditions of clipping distortion, signal distortion and the like caused by signal overload can be prevented. Therefore, the stereo signal can be quickly converted into the multichannel surround signal so as to adapt to different scenes for playing according to the output requirement.

For example, in a 5.1 vehicle-mounted sound playing scene, a stereo sound source can be output as a six-channel signal, and the signal is transmitted to a 5.1 loudspeaker array for playing through sound card playing equipment, so that the surrounding effect is better realized; aiming at 7.1 and other surround sound formats, the method can also realize better adaptation of the song library to the playing end, and widens the application scene of streaming media.

The equalizer, reverberator, compressor and other effectors related in the technical scheme of the embodiment are not limited to a specific algorithm, and have similar functions; the azimuth modulation and channel allocation mechanism is not limited to VBAP, HOA, and other methods.

In this embodiment, the gain of each target sound source object is adjusted based on the sound image position corresponding to each target sound source object to obtain gain allocation information of each target sound source object, gain configuration information of sound source objects except for the target sound source object is obtained, and then the gain allocation information of each target sound source object, the gain configuration information of sound source objects except for the target sound source object, and loudness ratio information of each sound source object in the original audio are combined to synthesize converted audio corresponding to the original audio, and then audio rendering is performed on the converted audio according to preset standardized processing information to obtain target audio, so that the stereo sound source is quickly converted into multi-channel surround sound for output, and the audio generation efficiency and audio effect are effectively improved.

In one embodiment, as shown in fig. 4, a flow diagram of another audio generation method is provided.

In this embodiment, the method includes the steps of:

In step 401, original sound source signals corresponding to a plurality of target sound source objects in original audio are extracted. In step 402, an original sound source signal corresponding to any target sound source object is input to an all-pass filter, and an output signal of the original sound source signal corresponding to any target sound source object is obtained as a first delay result according to impulse response information of the all-pass filter. In step 403, preset sampling information is obtained, and sampling delay processing is performed on the first delay result according to the number of sampling points, so as to obtain a plurality of decorrelated signals, which are used as derivative sound source signals corresponding to any target sound source object. In step 404, the original sound source signal corresponding to any target sound source object and the derivative sound source signal corresponding to any target sound source object are respectively placed at different channel positions to obtain the sound pressure difference information. In step 405, the sound image position corresponding to any one of the target sound source objects is located according to the sound pressure difference information. In step 406, the gain of each target sound source object is adjusted based on the sound image position corresponding to each target sound source object, gain allocation information of each target sound source object is obtained, and gain arrangement information of sound source objects other than the target sound source object is obtained. In step 407, the converted audio corresponding to the original audio is synthesized by combining the gain allocation information of each target sound source object, the gain configuration information of the sound source objects other than the target sound source object, and the loudness ratio information of each sound source object in the original audio. In step 408, audio rendering is performed on the converted audio according to the preset standardized processing information, so as to obtain the target audio. It should be noted that, the specific limitation of the above steps may be referred to the specific limitation of an audio generation method, which is not described herein.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides an audio generating device for realizing the above related audio generating method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of one or more audio generating devices provided below may be referred to the limitation of the audio generating method hereinabove, and will not be repeated here.

In one embodiment, as shown in fig. 5, there is provided an audio generating apparatus including:

the original sound source signal extraction module 501 is configured to extract original sound source signals corresponding to a plurality of target sound source objects in original audio;

a derived sound source signal obtaining module 502, configured to obtain derived sound source signals corresponding to the target sound source objects by performing decorrelation processing on original sound source signals corresponding to the target sound source objects;

a sound image position allocation module 503, configured to allocate, for any target sound source object, a sound image position corresponding to the any target sound source object according to an original sound source signal corresponding to the any target sound source object and a derivative sound source signal corresponding to the any target sound source object;

a target audio obtaining module 504, configured to adjust a gain of each target sound source object based on a sound image position corresponding to each target sound source object, and output target audio corresponding to the original audio; the number of channels of the target audio is greater than the number of channels of the original audio.

In one embodiment, the derived sound source signal obtaining module 502 includes:

the first decorrelation sub-module is used for performing decorrelation processing on the original sound source signals corresponding to any target sound source object according to a first delay processing mode to obtain a first delay result of the original sound source signals corresponding to any target sound source object;

And the second decorrelation sub-module is used for performing decorrelation processing on the first delay result according to a second delay processing mode, and generating a plurality of decorrelation signals aiming at any target sound source object as derivative sound source signals corresponding to the any target sound source object.

In one embodiment, the first decorrelation submodule includes:

and the all-pass filter processing unit is used for inputting the original sound source signal corresponding to any target sound source object into an all-pass filter, and obtaining an output signal of the original sound source signal corresponding to any target sound source object according to the impulse response information of the all-pass filter as the first delay result.

In one embodiment, the second decorrelation submodule includes:

the sampling information acquisition unit is used for acquiring preset sampling information; the preset sampling information comprises the number of sampling points;

and the decorrelation signal obtaining unit is used for carrying out sampling time delay processing on the first time delay result according to the number of the sampling points to obtain a plurality of decorrelation signals.

In one embodiment, the sound image location allocation module 503 includes:

the sound pressure difference information obtaining submodule is used for obtaining sound pressure difference information by respectively placing the original sound source signal corresponding to any target sound source object and the derivative sound source signal corresponding to any target sound source object at different channel positions;

And the sound image position positioning sub-module is used for positioning the sound image position corresponding to any target sound source object according to the sound pressure difference information.

In one embodiment, the sound pressure difference information obtaining submodule includes:

a signal placement mode determining unit, configured to determine a signal placement mode for the arbitrary target sound source object; the signal placement mode comprises a first sound channel position and a second sound channel position;

a signal placement unit, configured to place an original sound source signal corresponding to the arbitrary target sound source object to the first channel position, and place a derivative sound source signal corresponding to the arbitrary target sound source object to the second channel position;

and the sound pressure difference determining unit is used for determining the sound pressure difference information according to the sound pressure difference of the sound source between the first sound channel position and the second sound channel position.

In one embodiment, the original sound source signal extraction module 501 includes:

an original audio acquisition sub-module for acquiring original audio including a plurality of sound source objects;

the sound source separation sub-module is used for inputting the original audio to a sound source separation network to obtain original sound source signals corresponding to the sound source objects;

A target sound source object determining submodule, configured to use a sound source object to be decorrelated as the target sound source object, to obtain a plurality of original sound source signals corresponding to the target sound source object;

in one embodiment, the apparatus further comprises:

and the loudness ratio information determining module is used for performing loudness scaling according to the original sound source signals corresponding to the sound source objects and the original audio, and determining the loudness ratio information of the sound source objects in the original audio.

In one embodiment, the target audio derivation module 504 includes:

the gain information determining submodule is used for adjusting the gain of each target sound source object based on the sound image position corresponding to each target sound source object, obtaining gain distribution information of each target sound source object and obtaining gain configuration information of sound source objects except for the target sound source object;

the converted audio obtaining submodule is used for combining gain allocation information of each target sound source object, gain configuration information of sound source objects except the target sound source object and loudness ratio information of each sound source object in the original audio to synthesize converted audio corresponding to the original audio;

And the audio rendering sub-module is used for performing audio rendering on the converted audio according to preset standardized processing information to obtain the target audio.

The respective modules in the above-described audio generating apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement an audio generation method.

It will be appreciated by those skilled in the art that the structure shown in FIG. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:

In one embodiment, the steps of the other embodiments described above are also implemented when the processor executes a computer program.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the computer program, when executed by a processor, also implements the steps of the other embodiments described above.

In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of audio generation, the method comprising:

2. The method according to claim 1, wherein the obtaining the derivative sound source signal corresponding to each target sound source object by performing decorrelation processing on the original sound source signal corresponding to each target sound source object includes:

3. The method of claim 2, wherein the performing decorrelation on the original sound source signal corresponding to any target sound source object according to the first delay processing manner to obtain a first delay result of the original sound source signal corresponding to any target sound source object includes:

4. The method of claim 2, wherein said performing decorrelation on said first delayed result in accordance with a second delayed processing mode to generate a plurality of decorrelated signals for said any one target sound source object comprises:

5. The method according to claim 1, wherein the assigning, for any target sound source object, a sound image position corresponding to the any target sound source object according to an original sound source signal corresponding to the any target sound source object and a derivative sound source signal corresponding to the any target sound source object, includes:

6. The method according to claim 5, wherein the obtaining the sound pressure difference information by respectively placing the original sound source signal corresponding to the arbitrary target sound source object and the derivative sound source signal corresponding to the arbitrary target sound source object at different channel positions includes:

7. The method according to claim 1, wherein extracting original sound source signals corresponding to a plurality of target sound source objects in the original audio comprises:

acquiring original audio containing a plurality of sound source objects;

the method further comprises the steps of:

8. The method of claim 7, wherein adjusting the gain of each of the target sound source objects based on the sound image position corresponding to each of the target sound source objects, and outputting the target audio corresponding to the original audio, comprises:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 8.