CN116456263A

CN116456263A - Audio signal conversion method, device and equipment

Info

Publication number: CN116456263A
Application number: CN202310456374.3A
Authority: CN
Inventors: 李志华; 杨松; 杨波
Original assignee: Feihu Information Technology Tianjin Co Ltd
Current assignee: Feihu Information Technology Tianjin Co Ltd
Priority date: 2023-04-25
Filing date: 2023-04-25
Publication date: 2023-07-18

Abstract

The invention relates to an audio signal conversion method, a device and equipment, which can directly obtain a second binaural audio signal with surround stereo effect by acquiring a first binaural audio signal to be converted from media such as video and audio and inputting the first binaural audio signal into a preset deep learning model. Therefore, the second double-channel audio signal can generate the effect of surround sound when played on the traditional double-channel audio equipment such as headphones and the like, the input cost of the surround sound equipment can be effectively reduced, and the use is more convenient.

Description

Audio signal conversion method, device and equipment

Technical Field

The present invention relates to the field of audio processing technologies, and in particular, to a method, an apparatus, and a device for audio signal conversion.

Background

With the development of the audio and video industry, the requirements of people on audio-visual experience are higher and higher. Audio, as an important component, in addition to conveying information carried by itself, often causes resonance of the user, playing an indispensable role in the visual perception. The common binaural audio has a large market share due to low cost, small occupied storage space, but has poor spatial and azimuth senses. More advanced surround sound such as 5.1 channel audio has clearer sound, wider sound field and more realistic surround sound, but the associated surround sound equipment is not only costly but also very inconvenient to use.

Disclosure of Invention

The invention provides an audio signal conversion method, an audio signal conversion device and audio signal conversion equipment, which can realize the effect of surround sound on a double-channel device, reduce equipment cost investment and provide convenience for users.

In order to achieve the above purpose, the present invention provides the following technical solutions:

according to an embodiment of the present invention, there is provided an audio signal conversion method including:

acquiring a first binaural audio signal to be converted;

and inputting the first binaural audio signal into a preset deep learning model to obtain a second binaural audio signal with surround sound effect.

Further, the audio signal conversion method further includes:

and carrying out enhancement processing on the obtained second double-channel audio signal so as to enhance the sense of thickness and the sense of space.

Further, the enhancing the obtained second dual-channel audio signal includes:

accumulating the second double-channel audio signal after the delay processing with the second double-channel audio signal to obtain a third double-channel audio signal;

the volume of the third binaural audio signal is scaled to be the same as the size of the first binaural audio signal, so as to obtain a fourth binaural audio signal;

the fourth binaural audio signal is stored at the sampling frequency of the first binaural audio signal.

Further, the deep learning model includes: the encoder and the decoder are connected based on a long-term and short-term memory network, and the training process of the deep learning model comprises the following steps:

preprocessing the collected original double-channel audio signals to obtain target double-channel audio signals with surround stereo sound effects;

and inputting the original two-channel audio signal to the encoder, processing the feature extraction result output by the encoder through the long-short-period memory network and the decoder to obtain a model output audio signal, calculating a loss function based on the model output audio signal and the target two-channel audio signal, and obtaining the deep learning model when the value of the loss function is not smaller than a preset value.

Further, before inputting the original binaural audio signal to the encoder, the method further comprises:

performing data expansion on the original dual-channel audio signal, wherein the data expansion mode comprises the following steps: random drift, band masking, signal reverberation.

Further, the preprocessing the collected original dual-channel audio signal to obtain a target dual-channel audio signal with surround stereo effect includes:

upmixing the original binaural audio signal to a surround sound channel to obtain a surround sound audio signal;

and downmixing the surrounding stereo audio signal to the channel of the original double-channel audio signal to obtain the target double-channel audio signal.

Further, the upmixing the original binaural audio signal onto a surround channel to obtain a surround audio signal, including:

separating the original binaural audio signal into an original binaural audio signal and a subwoofer audio signal;

separating the original channel audio signal into a direct sound signal and a reverberant signal;

and convolving the direct sound signals with the head related transfer function data of the corresponding channels of the surround sound respectively, and convolving the reverberation signals with the head related transfer function data of the residual channels of the surround sound respectively.

Further, the downmixing the surround sound audio signal onto the channels of the original dual-channel audio signal to obtain the target dual-channel audio signal includes:

taking the average value of left channel audio of each channel of audio in the surround sound audio signal as the left channel audio of the target double-channel audio signal;

and taking the average value of right channel audio of each channel of audio in the surround sound audio signal as the right channel audio of the target double-channel audio signal.

According to an embodiment of the present invention, there is provided an audio signal conversion apparatus including:

the audio signal acquisition module is used for acquiring a first binaural audio signal to be converted; and

and the audio signal conversion module is used for inputting the first binaural audio signal into a preset deep learning model to obtain a second binaural audio signal with a surround stereo effect.

According to an embodiment of the present invention, there is provided an apparatus comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the respective steps of the audio signal conversion method as described above.

As can be seen from the above technical solution, the present invention discloses an audio signal conversion method, apparatus and device, which can obtain a first binaural audio signal to be converted from media such as video and audio, and then input the first binaural audio signal into a preset deep learning model, so as to directly obtain a second binaural audio signal with surround sound effects. Therefore, the second double-channel audio signal can generate the effect of surround sound when played on the traditional double-channel audio equipment such as headphones and the like, the input cost of the surround sound equipment can be effectively reduced, and the use is more convenient.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of an audio signal conversion method according to an embodiment of the present invention;

fig. 2 is a flowchart of an audio signal enhancement process according to an embodiment of the present invention;

FIG. 3 is a flow chart for constructing a deep learning model according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a process of stereo surround sound according to an embodiment of the present invention;

FIG. 5 is a block diagram of a deep learning model provided by an embodiment of the present invention;

fig. 6 is a block diagram of an audio signal conversion device according to an embodiment of the present invention;

fig. 7 is a block diagram of an apparatus according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, an embodiment of the present invention provides an audio signal conversion method, which may include the steps of:

101. a first binaural audio signal to be converted is acquired.

The double-channel audio signal is an audio signal composed of left and right audio signals. The two-channel audio data can be directly extracted from the video as required, and the extracted audio signal is used as a first two-channel audio signal to be converted, for example, a user can directly extract the two-channel audio signal from the video as the first two-channel audio signal to be converted through a ffmpeg tool.

102. And inputting the first binaural audio signal into a preset deep learning model to obtain a second binaural audio signal with surround sound effect.

The proper deep learning model can be pre-selected, and the original binaural audio signal collected through the audio processing tool and the target binaural audio signal with the surround sound effect obtained through pretreatment are trained, so that the binaural audio signal with the surround sound effect is obtained. Therefore, the user can obtain the hearing experience of the surround sound equipment such as 5.1 channels only by means of the double-channel audio equipment such as the earphone, the use is convenient and quick, the equipment cost is greatly saved, the user can hear more audio resources, and the range of the user hearing the audio is expanded.

As a possible implementation of the above embodiment, referring to fig. 5, the deep learning model may be constructed based on CNN (convolutional neural network) may include: the encoder and the decoder are connected based on a long-term and short-term memory network. The deep learning network model adopts a U-net shaped encoder-decoder structure, wherein an arrow represents the jump connection in the U-net, the encoder and the decoder are connected through a long and short memory network LSTM, the encoder layer number is from 1 to L, the decoder layer number is from L to 1, and the parameter L can control the depth of the model. The loss function of the deep learning model can adopt short-time Fourier transform and inverse transformation thereof, so that the network is converged in both the time domain and the frequency domain, and the aim of simultaneously optimizing the time domain and the frequency domain is fulfilled. The specific content of the loss function may be:

loss＝(out–in)+[STFT(out)–STFT(in)]+[ISTFT(out)+

ISTFT (in), where in represents the input, out represents the output, and STFT and ISTFT represent the short-time fourier transform and its inverse, respectively.

It will be appreciated that those skilled in the art may also use other neural network models to construct the deep learning model, and the invention is not limited herein.

The training process of the deep learning model shown with reference to fig. 3 may include the steps of:

301. and preprocessing the collected original double-channel audio signals to obtain target double-channel audio signals with surround stereo sound effects.

Specifically, the process of acquiring the target binaural audio signal with reference to fig. 4 may include the steps of:

401. the original binaural audio signal is upmixed onto a surround sound channel to obtain a surround sound audio signal.

Taking the 5.1 channel as an example, where 5 refers to the center channel, front left and right channels, rear left and right surround channels, and 0.1 refers to the subwoofer channel.

The original binaural audio signal is extracted from the video collected from the video platform, such as a television show, a movie, a variety, a cartoon, a documentary, etc., and then the original audio data may be separated into a high frequency signal and a low frequency signal using a IIR (Infinite Impulse Response) filter, wherein the high frequency signal and the low frequency signal are based on the frequency of the subwoofer audio contained in the original binaural audio signal, and the filter may be set according to the frequency of the subwoofer audio, so as to filter out the low frequency signal and the high frequency signal, wherein the low frequency signal is the subwoofer audio.

The high frequency signal is then separated into a direct sound signal and a reverberant signal using a WPE (Weighted Prediction Error) algorithm. Convolving the direct sound signal with HRTF data corresponding to the front left channel, the front center channel and the front right channel respectively by using HRTF (head related transfer function) to obtain front left channel audio, center channel audio and front right channel audio; and convolving the reverberation signal with HRTF data corresponding to the rear left channel and the rear right channel respectively to obtain rear left channel audio and rear right channel audio. Where HRTF is the frequency domain acoustic transfer function from the sound source to the ears under free field conditions. The comprehensive filtering effect of human physiological structures (including head, pinna and the like) on sound waves is characterized. Contains the main information of sound source localization and can be used for generating virtual surround sound.

402. And (3) down-mixing the surround sound audio signal to the sound channel of the original double-channel audio signal to obtain a target double-channel audio signal.

The 5.1 channel audio obtained in the above steps may be downmixed into two-channel audio having 5.1 channel surround sound effects as target two-channel audio data. The original double-channel audio comprises a left channel and a right channel in each channel of the 5.1-channel audio obtained through the steps, and the downmixing is to take the average value of the left channel audio of each channel of audio as the left channel audio of the target double-channel audio and take the average value of the right channel audio of each channel of audio as the right channel audio of the target double-channel audio.

302. The method comprises the steps of inputting an original dual-channel audio signal into an encoder, processing a feature extraction result output by the encoder through a long-short-term memory network and a decoder to obtain a model output audio signal, calculating a loss function based on the model output audio signal and a target dual-channel audio signal, and obtaining a deep learning model when the value of the loss function is not smaller than a preset value. The magnitude of the preset value can be set according to the actual application requirement, and the invention is not limited herein.

When the original binaural audio signal and the target binaural audio signal are input into the deep learning network for end-to-end training, a series of data expansion methods can be adopted, including random drift, frequency band masking, signal reverberation and the like, and one or more of the above methods can be adopted for processing so as to improve model performance and generalization capability. Meanwhile, in order to accelerate the training process, both the model and the data can be operated on the GPU.

In the specific implementation, under the condition that a target binaural audio signal is kept unchanged, the original binaural audio can be subjected to some random transformation, so that the richness of data can be increased, and the model generalization capability is improved, wherein random drift refers to random addition of small-amplitude front-back drift to the original audio in a time dimension with a certain probability, frequency band masking refers to random filtering of the original audio in different frequency bands with a certain probability by using a band elimination filter, and signal reverberation refers to random addition of some reverberation noise in an original audio value with a certain probability. All three methods can be randomly added when data is read in the training process, so that training data is greatly enriched.

To further optimize the technical solution, in some embodiments of the present invention, enhancement processing may be performed on the obtained second dual-channel audio signal to enhance the sense of heavy and spatial.

Specifically, as shown in fig. 2, the flow of the enhancement process may include the following steps:

201. and accumulating the second double-channel audio signal after the delay processing with the second double-channel audio signal to obtain a third double-channel audio signal.

202. And scaling the volume of the third binaural audio signal to be the same as the size of the first binaural audio signal to obtain a fourth binaural audio signal.

203. The fourth binaural audio signal is stored at the sampling frequency of the first binaural audio signal.

Specifically, the delayed second binaural audio signal may be accumulated with the undelayed second binaural audio signal to obtain a third binaural audio signal. The volume of the third binaural audio signal may be scaled to the same size as the first binaural audio signal in order to maintain a consistent original audio volume level, resulting in a fourth binaural audio signal. For example, the volume is scaled using the following formula:

out＝out/out _max *i n _max

where i n is the volume of the original binaural audio signal and out is the volume of the final output target binaural audio signal. i n _max Is the maximum volume of the original dual-channel audio signal, out _max Is the maximum volume of the target binaural audio signal. Finally, willThe fourth binaural audio signal is stored at the sampling frequency of the first binaural audio signal. And the audio signal in the video may be replaced with an enhanced audio signal using a ffmpeg tool.

Referring to fig. 6 based on the same design concept, an embodiment of the present invention further provides an audio signal conversion apparatus, which may implement each step of the above audio signal conversion method during operation, and may include:

an audio signal acquisition module 601 for acquiring a first binaural audio signal to be converted, and

the audio signal conversion module 602 is configured to input the first binaural audio signal into a preset deep learning model, and obtain a second binaural audio signal with a surround sound effect.

Further, the audio signal conversion apparatus further includes:

and the enhancement processing module is used for carrying out enhancement processing on the obtained second double-channel audio signal so as to enhance the sense of thickness and the sense of space.

Further, the enhancement processing module is specifically configured to:

and accumulating the second double-channel audio signal after the delay processing with the second double-channel audio signal to obtain a third double-channel audio signal.

And scaling the volume of the third binaural audio signal to be the same as the size of the first binaural audio signal to obtain a fourth binaural audio signal.

Further, the audio signal conversion apparatus further includes: and the model training module is used for preprocessing the collected original double-channel audio signals to obtain target double-channel audio signals with surround stereo sound effects.

The method comprises the steps of inputting an original dual-channel audio signal into an encoder, processing a feature extraction result output by the encoder through a long-short-term memory network and a decoder to obtain a model output audio signal, calculating a loss function based on the model output audio signal and a target dual-channel audio signal, and obtaining a deep learning model when the value of the loss function is not smaller than a preset value.

Further, the audio signal conversion apparatus further includes: the data expansion module is used for carrying out data expansion on the original binaural audio signal, and the data expansion mode comprises the following steps: random drift, band masking, signal reverberation.

Further, the model training module is specifically configured to upmix the original binaural audio signal onto a surround sound channel to obtain a surround sound audio signal.

And (3) down-mixing the surround sound audio signal to the sound channel of the original double-channel audio signal to obtain a target double-channel audio signal.

Further, the model training module is also specifically configured to separate the original binaural audio signal into an original binaural audio signal and a subwoofer audio signal.

The original channel audio signal is separated into a direct sound signal and a reverberant signal.

The direct sound signal is convolved with the head related transfer function data of the corresponding channel of the surround sound, respectively, and the reverberant signal is convolved with the head related transfer function data of the remaining channels of the surround sound, respectively.

Further, the model training module is further specifically configured to take the average value of the left channel audio of each channel of audio in the surround sound audio signal as the left channel audio of the target binaural audio signal.

Taking the average value of right channel audio of each channel of audio in the surround sound audio signal as the right channel audio of the target double channel audio signal.

The audio signal conversion device has the same beneficial effects as the above audio signal conversion method, and the specific implementation manner can refer to the embodiment of the above audio signal conversion method, and the disclosure is not repeated here.

Referring to fig. 7, an embodiment of the present invention also provides an apparatus, which may include: a memory 701 and a processor 702.

A memory 701 for storing a program.

A processor 702 for executing the program, and implementing the steps of the audio signal conversion method as described in the above embodiment.

For the foregoing method embodiments, for simplicity of explanation, the methodologies are shown as a series of acts, but one of ordinary skill in the art will appreciate that the present invention is not limited by the order of acts, as some steps may, in accordance with the present invention, occur in other orders or concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points.

The steps in the method of each embodiment of the invention can be sequentially adjusted, combined and deleted according to actual needs, and the technical features described in each embodiment can be replaced or combined.

The modules and the submodules in the device and the terminal of the embodiments of the invention can be combined, divided and deleted according to actual needs.

The modules or sub-modules illustrated as separate components may or may not be physically separate, and components that are modules or sub-modules may or may not be physical modules or sub-modules, i.e., may be located in one place, or may be distributed over multiple network modules or sub-modules. Some or all of the modules or sub-modules may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional module or sub-module in the embodiments of the present invention may be integrated in one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated in one module. The integrated modules or sub-modules may be implemented in hardware or in software functional modules or sub-modules.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software elements may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An audio signal conversion method, comprising:

acquiring a first binaural audio signal to be converted;

2. The method as recited in claim 1, further comprising:

3. The method according to claim 2, wherein said enhancing the resulting second dual-channel audio signal comprises:

4. The method of claim 1, wherein the deep learning model comprises: the encoder and the decoder are connected based on a long-term and short-term memory network, and the training process of the deep learning model comprises the following steps:

5. The method of claim 4, further comprising, prior to inputting the original binaural audio signal to the encoder:

6. The method of claim 4, wherein preprocessing the collected original binaural audio signal to obtain a target binaural audio signal having surround sound effects, comprises:

7. The method of claim 6, wherein upmixing the original binaural audio signal onto a surround sound channel to obtain a surround sound audio signal comprises:

8. The method of claim 6, wherein said downmixing the surround sound audio signal onto channels of the original binaural audio signal to obtain the target binaural audio signal comprises:

9. An audio signal conversion apparatus, comprising:

10. An apparatus, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the respective steps of the audio signal conversion method according to any one of claims 1 to 8.