CN114501297B

CN114501297B - Audio processing method and electronic equipment

Info

Publication number: CN114501297B
Application number: CN202210344790.XA
Authority: CN
Inventors: 许剑峰; 胡贝贝; 夏日升
Original assignee: Beijing Honor Device Co Ltd
Current assignee: Beijing Honor Device Co Ltd
Priority date: 2022-04-02
Filing date: 2022-04-02
Publication date: 2022-09-02
Anticipated expiration: 2042-04-02
Also published as: CN114501297A

Abstract

The application provides an audio processing method and an electronic device, wherein the method comprises the following steps: determining position parameters corresponding to the audio objects, wherein the position parameters at least comprise parameters for representing the elevation angles of the audio objects relative to a listener, and each audio object corresponds to an audio source; then determining the left channel target audio of each audio object and the right channel target audio of each audio object according to the audio of each audio object in the left channel original audio, the audio of each audio object in the right channel original audio and the position parameter corresponding to each audio object; and finally, overlapping the target audio of each left channel to obtain output audio of the left channel, and overlapping the target audio of each right channel to obtain output audio of the right channel. By adopting the technical scheme of the embodiment of the application, the audio can provide the listener with the height information of each audio object relative to the listener and the height difference between different audio objects, so that the spatial sense of the audio playing is improved.

Description

Audio processing method and electronic equipment

Technical Field

The present application relates to the field of terminal technologies, and in particular, to an audio processing method and an electronic device.

Background

The current electronic device usually has an audio playing function, and outputs different audio signals through different sound channels when playing audio, so as to improve the spatial sense of the audio heard by a user. For example, when an electronic device is connected to an earphone, different audio is output through the left channel and the right channel of the earphone, and the spatial sense of the audio heard by a user is improved by increasing the surround of the audio playing in the horizontal plane.

However, currently, when sound is played through headphones, a user can only feel that audio of a left channel and audio of a right channel are different in a horizontal direction, and there is no difference in height. Therefore, when audio is played through an earphone, the spatial impression of sound is poor.

Disclosure of Invention

In order to solve the above problems, the present application provides an audio processing method and an electronic device, so as to improve a spatial sense during audio playing.

In a first aspect, the present application provides an audio processing method, including: firstly, determining position parameters corresponding to each audio object, wherein the position parameters at least comprise parameters for representing the elevation angle of the audio object relative to a listener, and each audio object corresponds to an audio source; then determining the left channel target audio of each audio object and the right channel target audio of each audio object according to the audio of each audio object in the left channel original audio, the audio of each audio object in the right channel original audio and the position parameters corresponding to each audio object; and finally, overlapping the target audio of each left channel to obtain output audio of the left channel, and overlapping the target audio of each right channel to obtain output audio of the right channel.

By adopting the technical scheme of the embodiment of the application, the parameters for representing the elevation angle of the audio object relative to a listener are set for each audio object, so that the position information of the audio of each audio object in the height direction is increased. Because the audio of each audio object has the information in the height direction, a listener can not only feel the height information of each audio object relative to the listener, but also can feel the height difference among different audio objects during audio playing, and therefore the spatial sense during audio playing is improved.

In a possible implementation manner, determining a left channel target audio of each audio object and a right channel target audio of each audio object according to an audio of each audio object in a left channel original audio, an audio of each audio object in a right channel original audio, and a position parameter corresponding to each audio object specifically includes:

extracting the audio of each audio object in the original audio of the left channel to obtain single-object audio of the left channel, and extracting the audio of each audio object in the original audio of the right channel to obtain single-object audio of the right channel; synthesizing a single-channel signal of each audio object according to each left-channel single-object audio and each right-channel single-object audio; and determining the left channel target audio of each audio object and the right channel target audio of each audio object according to the single-channel signal of each audio object and the position parameter corresponding to each audio object.

In a possible implementation manner, extracting the audio of each audio object in the left channel original audio to obtain a left channel single object audio, and extracting the audio of each audio object in the right channel original audio to obtain a right channel single object audio specifically includes:

firstly, extracting the audio of different types of audio objects in the original audio of a left channel to obtain a single-type audio of the left channel, and extracting the audio of different types of audio objects in the original audio of a right channel to obtain a single-type audio of the right channel; and then extracting the audio of each audio object in the left channel single-type audio to obtain a left channel single-object audio, and extracting the audio of each audio object in the right channel single-type audio to obtain a right channel single-object audio.

In a possible implementation manner, n left channel single object audios and n right channel single object audios are provided, where n is a positive integer, and synthesizing a mono signal of each audio object according to each left channel single object audio and each right channel single object audio specifically includes:

determining a correlation of an ith left-channel single-object audio and a jth right-channel single-object audio, wherein i =1,2, …, n, j =1,2, …, n, and the correlation of the ith left-channel single-object audio and the jth right-channel single-object audio is used for determining whether the ith left-channel single-object audio and the jth right-channel single-object audio correspond to the same audio object; and when the correlation degree of the ith left channel single-object audio and the jth right channel single-object audio is greater than a preset correlation degree threshold value, synthesizing the ith left channel single-object audio and the jth right channel single-object audio into an audio object single-channel signal.

In a possible implementation manner, before determining the left channel target audio of each audio object and the right channel target audio of each audio object according to the audio of each audio object in the left channel original audio, the audio of each audio object in the right channel original audio, and the position parameter corresponding to each audio object, the method further includes:

acquiring original audio; and respectively extracting the audio of a left channel and the audio of a right channel in the original audio to obtain the original audio of the left channel and the original audio of the right channel.

In a possible implementation manner, when extracting audio of different types of audio objects in a left channel original audio to obtain a left channel mono-type audio and obtain a left channel first background sound, and extracting audio of different types of audio objects in a right channel original audio to obtain a right channel mono-type audio and obtain a right channel first background sound, overlapping the left channel target audio to obtain a left channel output audio, and overlapping the right channel target audio to obtain a right channel output audio, specifically includes:

superposing each left channel target audio and the left channel output background sound to obtain a left channel output audio, and superposing each right channel target audio and the right channel output background sound to obtain a right channel output audio; wherein the left channel output background sound comprises a left channel first background sound and the right channel output background sound comprises a right channel first background sound.

Audio other than the individual left channel mono-type audio may be present in the left channel original audio, for example when the audio does not have features that can be distinguished from other types of audio objects, the audio may be regarded as a left channel first background sound; alternatively, when the amplitude of the audio is too small (i.e., the energy is too small), the audio may be regarded as a left channel first background sound, or the like; the same holds for the right channel first background sound.

The method has the advantages that the left channel output audio is obtained and is superposed with the left channel background sound, and the sound in the original audio of the left channel can be truly restored; the same is true for obtaining the right channel output audio.

In a possible implementation manner, when the audio of different types of audio objects in the left channel original audio is extracted to obtain a left channel single-type audio and obtain a left channel first background sound, the audio of different types of audio objects in the right channel original audio is extracted to obtain a right channel single-type audio and obtain a right channel first background sound, the audio of each audio object in the left channel single-type audio is extracted to obtain a left channel single-object audio and obtain a left channel second background sound, and the audio of each audio object in the right channel single-type audio is extracted to obtain a right channel single-object audio and obtain a right channel second background sound, the left channel output background sound further includes a left channel second background sound, and the right channel output background sound further includes a right channel second background sound.

The description of the left channel first background sound and the right channel first background sound is the same as above.

Audio other than each left-channel single-object audio may be present in the left-channel mono-type audio, which may be regarded as a left-channel second background sound, for example, when the audio does not have a feature that can be distinguished from other audio objects; alternatively, when the amplitude of the audio is too small (i.e., the energy is too small), the audio may be regarded as a left channel second background sound, or the like; the same holds for the right channel second background sound.

The method comprises the steps of obtaining a left channel output audio, superposing a left channel first background sound and a left channel second background sound, and being capable of really restoring sound in a left channel original audio; the same holds for obtaining the right channel output audio.

In a possible implementation manner, when extracting audio of each audio object in the left channel single-type audio to obtain a left channel single-object audio and obtain a left channel second background sound, and extracting audio of each audio object in the right channel single-type audio to obtain a right channel single-object audio and obtain a right channel second background sound, the left channel target audios are overlapped to obtain a left channel output audio, and the right channel target audios are overlapped to obtain a right channel output audio, specifically including:

superposing each left channel target audio and the left channel output background sound to obtain a left channel output audio, and superposing each right channel target audio and the right channel output background sound to obtain a right channel output audio; wherein the left channel output background sound comprises a left channel second background sound, and the right channel output background sound comprises a right channel second background sound.

The description of the left channel second background sound and the right channel second background sound is the same as above.

The method comprises the steps of obtaining a left channel output audio superposed with a left channel second background sound, and being capable of relatively truly restoring the sound in the left channel original audio; the same holds for obtaining the right channel output audio.

In one possible implementation, the location parameters further include at least one of:

a parameter for characterizing a horizontal azimuth of the audio object with respect to a listener,

or the like, or, alternatively,

a parameter for characterizing a distance of the audio object with respect to a listener.

The parameters characterizing the horizontal azimuth of the audio object with respect to the listener can provide position information of each audio object with respect to the listener in a direction parallel to the horizontal plane, such that the listener perceives the position of the audio object in the direction parallel to the horizontal plane when the audio is played. Specifically, when playing audio, a listener can not only feel the horizontal orientation information of each audio object relative to the listener, but also can feel the horizontal orientation difference between different audio objects, thereby further improving the spatial impression when playing audio.

The parameters for characterizing the distance of the audio objects relative to the listener can provide distance information of each audio object relative to the listener, so that the listener feels the distance relationship between the audio objects and the listener when the audio is played. Specifically, when playing audio, a listener can not only feel the distance relationship between each audio object and the listener, but also can feel the distance difference between different audio objects and the listener, thereby further improving the spatial impression when playing audio.

In a possible implementation, when the position parameters further comprise a parameter characterizing a horizontal azimuth angle of the audio object with respect to the listener, and a parameter characterizing a distance of the audio object with respect to the listener,

determining a left channel target audio of each audio object and a right channel target audio of each audio object according to the mono signal of each audio object and the position parameter corresponding to each audio object, specifically comprising:

respectively determining Head Related Transfer Functions (HRTFs) corresponding to the audio objects based on the position parameters corresponding to the audio objects;

and respectively processing the single-channel signals of the audio objects by using the HRTFs corresponding to the audio objects to obtain the left-channel target audio of the audio objects and the right-channel target audio of the audio objects.

The head-related transfer function HRTF is utilized to process the single-channel signals of the audio objects to obtain the left channel target audio and the right channel target audio of the audio objects, and the accuracy is high. That is, the audio with the corresponding position parameter can be obtained more accurately, so that when the audio is played, the sound heard by the listener has a position relationship which is more in line with the representation of the position parameter.

In a second aspect, the present application provides an electronic device comprising a processor and a memory, wherein the memory stores codes, and the processor is configured to call the codes stored in the memory to execute any one of the audio processing methods described above.

Drawings

Fig. 1A is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure;

fig. 1B is a schematic structural diagram of a mobile phone according to an embodiment of the present application;

fig. 2 is a flowchart of an audio processing method according to an embodiment of the present application;

FIG. 3 is a flowchart of an audio processing method according to another embodiment of the present application;

fig. 4A is a schematic diagram of a 3D audio playing APP main interface provided in the embodiment of the present application;

fig. 4B is a schematic diagram of a song selection interface for 3D audio playback APP according to an embodiment of the present application;

fig. 5A is a schematic diagram of audio obtained by extracting different types of audio objects in original audio of a left channel according to an embodiment of the present application;

FIG. 5B is a schematic diagram of audio extraction for different types of audio objects in original audio of a left channel according to another embodiment of the present application;

fig. 6A is a schematic diagram of extracting audio of different audio objects in a left channel single-type audio according to an embodiment of the present application;

FIG. 6B is a schematic diagram illustrating audio extraction of different audio objects in a single type of audio for a left channel according to another embodiment of the present application;

fig. 7A is a schematic diagram of a horizontal azimuth of an audio object according to an embodiment of the present application;

FIG. 7B is a diagram illustrating the correspondence between left and right channel energies and horizontal azimuth angles;

FIG. 8 is a schematic diagram of an elevation angle and a distance of an audio object provided by an embodiment of the present application;

FIG. 9A is a schematic diagram of a 3D audio object analysis and setup interface according to an embodiment of the present application;

FIG. 9B is a schematic diagram of another interface for 3D audio object analysis and setting provided in the embodiments of the present application;

fig. 9C is a schematic diagram of a 3D audio karaoke interface provided in an embodiment of the present application;

fig. 10A is a schematic diagram of obtaining mono target audio of an audio object by using a head related transfer function HRTF according to an embodiment of the present application;

FIG. 10B is a schematic diagram of obtaining a mono target audio according to an embodiment of the present application;

FIG. 10C is a schematic diagram of obtaining a mono target audio according to another embodiment of the present application;

fig. 11 is a schematic diagram of obtaining mono output audio according to an embodiment of the present application.

Detailed Description

In the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as examples, illustrations or descriptions. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

The terms "first", "second", and the like in the description of the present application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated.

In the present application, unless expressly stated or limited otherwise, the term "coupled" is to be construed broadly, e.g., "coupled" may be a fixed connection, a removable connection, or an integral part; may be directly connected or indirectly connected through an intermediate.

The terminology used in the description of the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application, which will be described in detail below with reference to the accompanying drawings.

In order to facilitate understanding of technical solutions provided in the embodiments of the present application, terms common to the embodiments of the present application are described below.

A Recurrent Neural Network (RNN) is a Neural Network used to process sequence data. Compared with a general neural network, the neural network can process data with sequence variation. For example, the meaning of a word may be different because of the differences in the above-mentioned contents, and the RNN can solve such problems well.

Long-short term memory (LSTM) is a special RNN, mainly to solve the problems of gradient extinction and gradient explosion during Long sequence training. In short, LSTM can perform better in longer sequences than normal RNNs.

The following describes a configuration of an electronic device in the present embodiment.

Referring to fig. 1A, fig. 1A is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure.

As shown in fig. 1A, the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identity Module (SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It is to be understood that the illustrated structure of the embodiment of the present invention does not specifically limit the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, such as: the processor 110 may include a Central Processing Unit (CPU), an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

The NPU is a neural-network (NN) computing processor that processes input information quickly by using a biological neural network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. Applications such as intelligent recognition of the electronic device 100 can be realized through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, and the like.

The electronic device 100 implements display functions via the GPU, the display screen 194, and the application processor. The GPU is a microprocessor for image processing, connected to the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 is used to display images, video, and the like. The display screen 194 includes a display panel. The display panel may be a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), or the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, with N being a positive integer greater than 1.

The electronic device 100 may implement audio functions via the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headset interface 170D, and the application processor. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or some functional modules of the audio module 170 may be disposed in the processor 110.

The speaker 170A, also called a "horn", is used to convert the audio electrical signal into an acoustic signal. The electronic apparatus 100 can listen to music through the speaker 170A or listen to a handsfree call.

The receiver 170B, also called "earpiece", is used to convert the electrical audio signal into an acoustic signal. When the electronic apparatus 100 receives a call or voice information, it is possible to receive voice by placing the receiver 170B close to the human ear.

The microphone 170C, also referred to as a "microphone," is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can input a voice signal to the microphone 170C by speaking the user's mouth near the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C to achieve a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further include three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, perform directional recording, and so on.

The earphone interface 170D is used to connect a wired earphone. The headset interface 170D may be the USB interface 130, or may be a 3.5mm open mobile electronic device platform (OMTP) standard interface, a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The touch sensor 180K is also called a "touch device". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is used to detect a touch operation applied thereto or nearby. The touch sensor can communicate the detected touch operation to the application processor to determine the touch event type. Visual output associated with the touch operation may be provided through the display screen 194. In other embodiments, the touch sensor 180K may be disposed on a surface of the electronic device 100, different from the position of the display screen 194.

The embodiment of the present application does not specifically limit the type of the electronic device, and the electronic device may be a mobile phone, a notebook computer, a wearable electronic device (e.g., a smart watch), a tablet computer, an Augmented Reality (AR) device, a Virtual Reality (VR) device, or the like.

Referring to fig. 1B, fig. 1B is a schematic structural diagram of a mobile phone according to an embodiment of the present disclosure.

Fig. 1B shows an xoy plane of the mobile phone provided in the embodiment of the present application, for example, the xoy plane shown in the figure may be a front surface of the mobile phone.

At present, electronic devices such as mobile phones generally have an audio playing function, and it is more and more common that users use earphones to connect the electronic devices to play audio. Usually, when playing audio, different audio signals are output through different sound channels to improve the spatial sensation of the audio heard by the user. For example, different audio is output through the left and right channels of the headphones, and the spatial sensation of the audio heard by the user is improved by increasing the surround of the audio play in the horizontal plane.

However, currently, when sound is played through headphones, a user can only feel that audio of a left channel and audio of a right channel are different in a horizontal direction, and there is no difference in height. Therefore, when the user plays audio through the earphone, the spatial sense of sound is poor.

By adopting the technical scheme of the embodiment of the application, the audio of each audio object is extracted and obtained according to the original stereo audio, and each audio object corresponds to one sound source; setting parameters for characterizing the elevation angle of the audio object relative to a listener for each audio object to increase the position information of the audio of each audio object in the height direction; determining a left channel target audio and a channel target audio of each audio object according to the left channel audio, the right channel audio and the position parameters of each audio object; and finally, superposing the left channel target audio of each audio object to obtain left channel output audio, and superposing the right channel target audio of each audio object to obtain right channel output audio. Since the audio of each audio object has information in the height direction, a listener can not only feel the height information of each audio object relative to the listener, but also can feel the height difference between different audio objects, thereby improving the spatial sense during audio playing.

Referring to fig. 2, fig. 2 is a flowchart of an audio processing method according to an embodiment of the present disclosure.

As shown in fig. 2, the audio processing method of the present application includes S101-S103.

S101, determining position parameters corresponding to the audio objects, wherein the position parameters at least comprise parameters for representing the elevation angles of the audio objects relative to listeners.

Each audio object corresponds to an audio source.

An audio object refers to a sound source that produces different sounds in the original audio.

Each audio object corresponds to an audio source.

For example, the original audio includes two human voices, one drum sound, and one organ sound. At this time, the resulting audio object includes: two people, a drum, and a bridge.

The number of audio objects may be one or more.

Audio of an audio object, such as a human voice, a drum sound, a bass sound, a bird song, an airplane sound, and the like.

The audio produced by different types of audio sources typically has different characteristics. For example, human voice and drum sound have different characteristics.

The type of audio object may also be one or more.

For example, the audio objects include two persons, a drum, and a musical organ, and the types of the audio objects include a person, a drum, and a musical organ.

Since the audio object is obtained by analyzing and processing the original audio, it may be different from the real sound source. Thus, an audio object can also be understood as a virtual sound source. Here, the virtual sound source and the sound source actually generating the original audio may be the same or different.

Each audio frequency is corresponding to its own position parameter.

The listener may also be referred to as a listener, i.e. a person or thing that hears the audio when it is played.

The audio processing method of the embodiment of the application is used for improving the spatial sense during audio playing, namely improving the spatial sense of sound when a listener hears the processed audio.

The position parameter is used to provide position information of the audio object relative to the listener.

Elevation is used to provide height information of the audio object relative to the listener in a direction perpendicular to the horizontal plane.

S102, determining the left channel target audio of each audio object and the right channel target audio of each audio object according to the audio of each audio object in the left channel original audio, the audio of each audio object in the right channel original audio and the position parameter corresponding to each audio object.

The left channel original audio refers to audio of a left channel in the original audio, and the right channel original audio refers to audio of a right channel in the original audio.

The original audio refers to audio processed using the audio processing method of the embodiment of the present application.

The original audio may contain a plurality of audio objects.

The audio of each audio object in the original audio of the left channel refers to the audio of each audio object in the audio of the left channel of the original audio; the audio of each audio object in the right channel original audio refers to the audio of each audio object in the audio of the right channel of the original audio.

Each audio object corresponds to its own left channel target audio and right channel target audio.

For an audio object, the left channel target audio and the right channel target audio are determined according to the audio of the audio object in the left channel original audio, the audio of the audio object in the right channel original audio, and the position parameter of the audio object.

For an audio object, the left channel target audio and the right channel target audio of the audio object contain height information of the audio object with respect to a listener. Therefore, when the listener hears the audio of the audio object, the listener can feel the height information of the audio object relative to the listener, and the spatial feeling during audio playing can be improved.

S103, overlapping the left channel target audios to obtain left channel output audios, and overlapping the right channel target audios to obtain right channel output audios.

Because the original audio may contain a plurality of audio objects, the left channel target audio of each audio object is overlapped to obtain the audio of the left channel which is finally output, and the right channel target audio of each audio object is overlapped to obtain the audio of the right channel which is finally output.

After the target audio of each audio object is superposed, a listener can not only feel the height information of each audio object relative to the listener, but also can feel the height difference between different audio objects, so that the spatial sense during audio playing is improved.

By adopting the technical scheme of the embodiment of the application, the audio of each audio object is extracted and obtained according to the original stereo audio, and each audio object corresponds to one sound source; setting parameters for characterizing the elevation angle of the audio object relative to a listener for each audio object to increase the position information of the audio of each audio object in the height direction; determining a left channel target audio and a channel target audio of each audio object according to the left channel audio, the right channel audio and the position parameters of each audio object; and finally, superposing the left channel target audio of each audio object to obtain a left channel output audio, and superposing the right channel target audio of each audio object to obtain a right channel output audio. Since the audio of each audio object has information in the height direction, a listener can not only feel the height information of each audio object relative to the listener, but also can feel the height difference between different audio objects, thereby improving the spatial sense during audio playing.

The following description is made with reference to specific implementations.

Referring to fig. 3, fig. 3 is a flowchart of an audio processing method according to another embodiment of the present application.

As shown in fig. 3, the audio processing method of the present application includes S201-S210.

S201, original audio is obtained, wherein the original audio comprises a left channel original audio and a right channel original audio.

Raw audio refers to stereo audio data to be processed.

Stereo audio data typically includes multiple channels of audio.

The original audio at least comprises a left channel original audio and a right channel original audio; the left channel original audio is also the audio of the left channel of the original audio, and the right channel original audio is also the audio of the right channel of the original audio.

In one possible implementation, the format of the original audio may be a Pulse Code Modulation (PCM) format, that is, the original audio may be PCM audio data.

PCM audio data is a bare stream of uncompressed audio sample data, which is standard digital audio data converted from an analog signal by sampling, quantization, and encoding.

The source of the original audio is not limited in this embodiment. For example, the sources of the original audio may include: audio files in mp3 format, audio-video files in mp4 format, and the like. The original audio can be obtained by decoding an audio file and an audio-video file.

In one possible implementation, the target audio source is obtained in response to a target audio source obtaining instruction before the original audio is obtained. The target sound source acquiring instruction may be issued by the user. The target sound source is the source of the original audio. The target sound source is a stereo sound source.

In some possible cases, the original audio also includes audio other than left channel original audio and right channel original audio. At this time, the audio of two adjacent channels is processed according to the processing method of the left channel original audio and the right channel original audio in this embodiment, and then is converted according to the relationship of each channel to obtain the final output. The specific principle is similar to that of the present embodiment, and the description of the present embodiment is omitted.

The embodiment of the application also provides a 3D (3-dimension) audio playing application program, and the audio processing method provided by the embodiment of the application can be applied to the 3D audio playing application program.

The 3D audio playing application program in the embodiment of the present application may be an application program installed on the electronic device, for example, the application program may be an application program carried by an electronic device system, an application program provided by a third party such as an application program mall, or an application program acquired in other manners/ways.

In the embodiment of the application, the electronic device is taken as an example of a mobile phone, and a possible interface of a 3D audio playing application program (3D audio playing APP) is described.

Referring to fig. 4A, fig. 4A is a schematic diagram of a 3D audio playing APP main interface provided in an embodiment of the present application.

As shown in fig. 4A, the main interface includes a target sound source display frame 401, and at least the name of a target sound source is displayed in the target sound source display frame 401, where the target sound source is a stereo sound source determined according to a target sound source acquisition instruction of a user.

The format type of the target sound source may be displayed in the target sound source display box 401.

The embodiment of the present application takes a target sound source as an example of a song.

For example, as shown in fig. 4A, "song 1. flac" is displayed in the target sound source display box 401, where "song 1" is the name of the target sound source and "flac" is the format type of the target sound source.

As shown in fig. 4A, words such as "song file" may also be displayed in the target sound source display box 401 to provide more information about the target sound source.

A plurality of virtual operation buttons may also be included in the main interface.

For example, as shown in fig. 4A, the following virtual buttons are also included in the main interface: a replace song virtual button 402, a 3D audio object analysis and setting virtual button 403, a 3D audio play virtual button 404, and a 3D audio K song virtual button 405.

Further, the virtual operation buttons that the user can operate on the current interface are displayed in black, and the virtual operation buttons that the user cannot operate on the current interface are displayed in gray.

The interpretation and description of the audio object will be made in S202.

As shown in fig. 4A, the replacement song virtual button 402 and the audio object analysis and setting virtual button 403 are displayed in black, and the 3D audio play virtual button 404 and the 3D audio K song virtual button 405 are displayed in gray.

In some possible cases, virtual buttons for implementing other functions are also included in the main interface, such as the lower virtual button in the main interface shown in fig. 4A: a return virtual button 406 for returning to the previous level interface, a desktop button 407 for jumping to the electronic device desktop, a main interface virtual button 408, and the like.

In some possible cases, the user may click the change song virtual button 402, the main interface jumps to the song selection interface, and the stereo sound source to be processed is selected at the song selection interface.

In some possible cases, in the song selection interface, names of stereo sound sources available for the user to select are displayed, or other one or more kinds of information of the stereo sound are displayed, such as format type, singer name of the song, and the like.

Referring to fig. 4B, fig. 4B is a schematic diagram of a song selection interface for 3D audio playing APP according to an embodiment of the present application.

As shown in fig. 4B, in the song selection interface, "song 1. flac-singer 1" indicates that the name of the stereo sound source is "song 1", the genre is ". flac", and the singer of the song is "singer 1".

"Song 2.mp 3-singer 2", "Song 3. acc-singer 3" and "Song 4-singer 4" in FIG. 4B are similar to the above description.

In some possible cases, the target audio source acquiring instruction of the user is: and clicking the area where the name of the stereo sound source is located on the song selection interface by the user.

That is, in response to the user clicking the area where the name of the stereo sound source is located on the song selection interface, the target sound source is obtained, and the stereo sound source corresponding to the name is the target sound source.

In some possible cases, the audio processing method provided by the embodiment of the present application may also be applied to plug-ins and the like of application programs, so as to increase functions of other application programs of the electronic device.

S202, extracting the audio of different types of audio objects in the left channel original audio and the right channel original audio respectively to obtain a left channel single-type audio and a right channel single-type audio.

The original audio comprises a left channel original audio and a right channel original audio, and the left channel original audio and the right channel original audio are processed respectively.

Extracting the original audio of the left channel to obtain a single-type audio of the left channel; and extracting the original audio of the right channel to obtain the single-type audio of the right channel.

Audio of different types of audio objects means that the audio belongs to different types of audio objects.

The process of extracting the left channel original audio is similar to the process of extracting the right channel original audio, and the process of extracting the left channel original audio is described below as an example, and the process of extracting the right channel original audio is not described again.

And extracting the audio of different audio object types in the original audio of the left channel to obtain the audio of the single type of the left channel.

The left channel original audio refers to audio data of a left channel in the original audio.

The audio of different audio object types in the original audio of the left channel refers to the audio of different types of audio objects in the original audio of the left channel.

The left channel mono type audio means that an audio object has only one type of audio data. In the left channel mono type audio, the audio of one audio object may be included, and the audio of a plurality of audio objects may also be included.

For example, the number of the left channel single-type audio obtained according to the left channel original audio is two, namely, the human voice of the left channel and the drum sound of the left channel; the type of the sound source of the human voice of the left channel is human, and the type of the sound source of the drum voice of the left channel is drum. The human voice of the left channel may include a first human voice and a second human voice, and the drum voice of the left channel may include a first drum voice.

The following is an implementation manner for extracting different types of audio objects in original audio of a left channel to obtain a single type of audio of the left channel provided in the embodiment of the present application.

The principle of extracting the audio of the audio objects of different types in the original audio of the right channel to obtain the single-type audio of the right channel is similar, and the description is omitted here.

Referring to fig. 5A, fig. 5A is a schematic diagram of audio frequency extraction for different types of audio objects in original audio frequency of a left channel according to an embodiment of the present application.

As shown in fig. 5A, the time domain signal of the original audio of the left channel is transformed into the frequency domain signal of the original audio of the left channel after time-frequency transformation.

In some possible implementations, the time-frequency Transform may employ Fourier Transform (FT), Fast Fourier Transform (FFT), Discrete Cosine Transform (DCT), Modified Discrete Cosine Transform (MDCT), or the like.

And then inputting the frequency domain signal of the original audio of the left channel obtained by conversion into a first network to obtain masks of different types of audio objects, such as a human voice mask, a drum voice mask or other object masks.

The first network is obtained through pre-training, and can process the frequency domain signal of the input audio to obtain masks of various different types of audio objects, such as a human voice mask, a drum voice mask, and the like, and possibly masks of other audio objects.

Then, multiplying the obtained masks of the different types of audio objects with frequency domain signals obtained after time domain transformation respectively to obtain frequency spectrums of the different types of audio objects; and then, performing inverse frequency domain transformation on the frequency spectrums of the audio objects of different types to obtain time domain signals of the audio objects of different types (the time domain signals of the single-type audio of the left channel, which is also called as the single-type audio of the left channel).

For example, taking a human voice as an example, the human voice mask is multiplied by a frequency domain signal obtained after time domain transformation to obtain a human voice frequency spectrum, and then the human voice frequency spectrum is subjected to inverse frequency domain transformation to obtain a time domain signal of the human voice.

As shown in fig. 5A, the resulting frequency domain signal of the left channel mono-type audio includes: the human voice spectrum, the drum sound spectrum, etc., and possibly the spectrum of other audio objects.

The human voice frequency spectrum is processed by inverse time-frequency transformation to obtain a human voice time domain signal, the drum sound frequency spectrum is processed by inverse time-frequency transformation to obtain a drum sound time domain signal, and the other audio object frequency spectrums are processed by inverse time-frequency transformation to obtain time domain signals of other audio objects.

The first network may be a Neural Network (NN).

In one possible implementation, the first network may be any one of the following: long short-term memory (LSTM), Convolutional Neural Networks (CNN), Convolutional cyclic Networks (CRN), and U-net.

The first network may also be other types of networks.

The foregoing describes that the time domain signal of the original audio of the left channel is converted into the frequency domain signal of the original audio of the left channel, and the frequency domain signal of the original audio of the left channel is processed and analyzed, so as to obtain the audio of different types of audio objects (single type audio of the left channel).

In a possible implementation manner, the audio of different types of audio objects (left channel single-type audio) can also be obtained by directly utilizing the time domain signal of the left channel original audio.

Another implementation manner of extracting different types of audio objects in the original audio of the left channel to obtain a single type of audio of the left channel is provided below.

Referring to fig. 5B, fig. 5B is a schematic diagram of audio extraction for different types of audio objects in original audio of a left channel according to another embodiment of the present application.

As shown in fig. 5B, the time domain signal of the original audio of the left channel is input to the second network, and the audio of the audio object of a single type is extracted to obtain the time domain signal of the audio object of different types.

The second network includes an encoding network, a separation network, and a decoding network.

The time domain signal of the original audio of the left channel is input to a coding network to obtain a coded signal.

In one possible implementation, the encoding network may be a neural network.

Then inputting the coded signals into a separation network to obtain a mask matrix; and then multiplying the coded signals by a mask matrix, and inputting the multiplied audio signals into a decoding network for decoding to obtain time domain signals (left channel single-type audio) of the audio of different types of audio objects.

As shown in fig. 5B, the resulting left channel mono-type audio includes: time domain signals of human voice, time domain signals of drum sound and time domain signals of other audio objects.

Besides the implementation manners provided in the embodiments of the present application, the single type audio of the left channel may also be obtained through other implementation manners, or multiple schemes may be combined to obtain the single type audio of the left channel, which is not limited in the embodiments of the present application.

The left channel single-type audio is obtained through the above description, and the principle of obtaining the right channel single-type audio is similar, which is not described herein again.

In one possible implementation, when the left channel mono type audio and the right channel mono type audio are extracted, a left channel first background sound and a right channel first background sound may also be obtained.

As shown in fig. 5A, when the time domain signal of the original mono audio is processed, in addition to the time domain signal of the human voice and the time domain signal of the drum sound, a left channel first background sound may be obtained.

The left channel first background sound may be understood as an audio signal other than a time domain signal of a human voice and a time domain signal of a drum sound in a time domain signal of original mono audio.

For the method shown in fig. 5B and other methods as well, it is also possible to obtain the left channel first background sound.

The principle of obtaining the right channel first background sound is the same, and is not described herein again.

In a possible implementation manner, after the left channel single-type audio and the right channel single-type audio are obtained, the left channel single-type audio and the right channel single-type audio may be respectively filtered.

The following description will be given by taking the example of screening the left channel single-type audio, and the same principle of screening the right channel single-type audio will not be repeated here.

Specifically, the obtained left channel single-type audio is screened according to the energy of each left channel single-type audio.

In the description of the embodiments of the present application, if not specifically stated, the energy of the audio refers to the magnitude of the amplitude of the audio signal, and is expressed in decibels (dB).

Determining the energy of each left channel single type audio; and when the energy of a certain left channel single-type audio is smaller than a preset energy threshold value, deleting the left channel single-type audio, and adding the left channel single-type audio into the left channel first background sound.

At this time, the number of left channel mono-type audio is reduced by one.

The preset energy threshold may be the same or different for different types of audio objects.

For example, through the process shown in fig. 5A, obtaining the left channel mono-type audio includes: human voice, drum sound, and piano sound. And screening the obtained single-type audio of the left channel by using the preset energy threshold, such as a first energy threshold.

Specifically, the energy of each left channel mono type of audio is determined: the energy of the human voice, the energy of the drum sound and the energy of the organ sound are respectively compared with the first energy threshold value.

And when the energy of the human voice and the energy of the drum sound are both larger than or equal to the first energy threshold value, and the energy of the musical instrument is smaller than the first energy threshold value, the obtained single-type audio of the left channel comprises the human voice and the drum sound, but not the musical instrument sound.

For another example, the single type of audio including human voice may be filtered by using a first energy threshold, the single type of audio including drum voice may be filtered by using a second energy threshold, and the single type of audio including musical instrument voice may be filtered by using a third energy threshold. Wherein the first energy threshold, the second energy threshold and the third energy threshold are different from each other. When the energy of the human voice is greater than or equal to the first energy threshold, the energy of the drum sound is greater than or equal to the second energy threshold and the energy of the musical instrument sound is less than the third energy threshold, the obtained single-type audio of the left channel comprises the human voice and the drum sound, and does not comprise the drum sound.

The principle of obtaining the right channel first background sound by screening the right channel single-type audio is similar to that of obtaining the left channel first background sound by screening the left channel single-type audio, and the description is omitted here.

Through the above processing procedures, a left channel single-type audio, a left channel first background sound, a right channel single-type audio, and a right channel first background sound are obtained.

S203, respectively extracting the audio of different audio objects in the left channel single-type audio and the right channel single-type audio to obtain a left channel single-object audio and a right channel single-object audio.

In the audio of each single type of object audio, the audio of one or more audio objects may be extracted, and the number of audios of a single audio object obtained according to the audio of each single type of object audio is not particularly limited in the embodiments of the present application.

The result of S202 is a single type of audio, and the single type of audio object may include one or more audio objects.

The following description will take an example of extracting audio of different audio objects in a left channel single-type audio to obtain a left channel single-object audio.

The principle of extracting the audio of different audio objects in the right channel single-type audio to obtain the right channel single-object audio is similar to that of the left channel, and is not described herein again.

The left channel mono type audio refers to data of a single type of audio object, which may include one or more audio objects. Thus, separating each left channel mono type of audio may result in audio of one or more audio objects.

For example, the left channel mono-type audio includes: human voice, drum sound, and piano sound.

Separating the human voice to obtain a first human voice and a second human voice; separating the drum sounds to obtain first drum sounds; and separating the sound to obtain a first sound and a second sound.

The following are several implementations of extracting audio of different audio objects in a single type of audio of a left channel provided in the embodiments of the present application.

The principle of extracting the audio of different audio objects in the right channel single-type audio is similar, and is not described herein again.

In one possible implementation, the left channel single object audio is derived based on a neural network.

Referring to fig. 6A, fig. 6A is a schematic diagram illustrating audio extraction of different audio objects in a single type of audio of a left channel according to an embodiment of the present application.

As shown in fig. 6A, for example, for an audio object of type "person", the left channel mono-type audio may be regarded as audio mixed by a plurality of persons.

Firstly, voice feature extraction is carried out on the single-type audio of the left channel to obtain voice features.

In some possible implementations, the voice feature extraction may be performed in a manner including: fast fourier transform FFT, mel spectrum, modified Discrete Cosine transform mdct (modified Discrete Cosine transform), and the like.

Then, the obtained speech features are input into a third network, and a human voice vector of each frame of audio frame is obtained, wherein the human voice vector of each frame of audio frame may include a plurality of human voice vectors.

The third network is pre-trained.

Clustering the human voice vectors of each frame of audio frame, determining the human voice vector of each human voice in each frame of audio frame, and generating a feature mask of each human voice according to the human voice vector of the same human voice, such as a first human voice mask and a second human voice mask, possibly including other human voice masks.

And clustering the voice vectors of each audio frame, and presetting the maximum voice number n.

And then, multiplying the voice features obtained by feature extraction with the feature mask of each voice respectively (namely, the voice reconstruction process), and performing inverse time-frequency transformation on the multiplication result to obtain a time domain signal of each voice.

In some possible implementations, the maximum number of human voices n may be derived from the description information of the original audio. For example, the original audio is a song, and the description information of the original audio may include the number of performers of the song.

The maximum number n of voices may also be a preset default value n, for example, n is set equal to 3.

In some possible implementations, the left channel single object audio is derived based on a non-negative matrix decomposition.

Referring to fig. 6B, fig. 6B is a schematic diagram of extracting audio of different audio objects in a left channel single-type audio according to another embodiment of the present application.

As shown in fig. 6B, for example, for an audio object of type "person", the left channel mono-type audio may be regarded as multi-person sound mixed audio.

And transforming the time domain signal of the single-type audio of the left channel into a frequency domain signal of the single-type audio of the left channel after time-frequency transformation.

And processing the frequency domain signal of the single-type audio of the left channel by using a non-Negative Matrix Factorization (NMF) algorithm to obtain a characteristic matrix and a weight matrix.

And carrying out filtering decomposition processing on the characteristic matrix and the weight matrix to obtain a frequency domain signal of each voice, and carrying out inverse time-frequency conversion on the frequency domain signal of each voice to obtain a time domain signal of each voice.

In one possible implementation, the left channel single-object audio is obtained by using a beam forming method.

Specifically, taking voice as an example, when multi-channel recording is utilized, the spatial orientations of different singers are different, and beams corresponding to the orientations are found to extract the sound of a single singer, so as to obtain the audio frequency of each voice.

Besides the above implementation manners, the left channel single-object audio may be obtained through other implementation manners, or multiple schemes may be aggregated to obtain the left channel single-object audio, which is not limited in this embodiment of the application.

The left channel single-object audio is obtained through the above description, and the principle of obtaining the right channel single-object audio is similar, which is not described herein again.

In a possible implementation manner, when the left channel mono-type audio and the right channel mono-type audio are separated to obtain the left channel mono-object audio, a left channel second background sound and a right channel second background sound may also be obtained.

As shown in fig. 6A, when the left channel mono type audio is processed, in addition to the time domain signal of the first person's voice and the time domain signal of the second person's voice, a left channel second background sound may be obtained.

The left channel second background sound may be understood as an audio signal other than the time domain signal of the first human voice and the time domain signal of the second human voice in the time domain signal of the left channel mono type audio.

For the method shown in fig. 6B and other methods as well, it is also possible to obtain the left channel second background sound.

The principle of obtaining the second background sound of the right channel is the same, and the description is omitted here.

In a possible implementation manner, after the left channel single-object audio and the right channel single-object audio are obtained, the left channel single-object audio and the right channel single-object audio may be further filtered respectively.

The following description will be given by taking the example of screening the left channel single object audio, and the principle of screening the right channel single object audio is similar, which is not described herein again.

Specifically, the obtained left channel single-object audio is screened according to the energy of each left channel single-object audio.

And determining the energy of each left channel single-object audio, deleting the left channel single-object audio when the energy of a certain left channel single-object audio is smaller than a preset energy threshold, and adding the left channel single-object audio into a left channel second background sound.

At this time, the number of left channel mono-type audio is reduced by one.

The preset energy threshold may be the same or different for different audio objects.

For example, through the processing procedure shown in fig. 6A, obtaining left channel single-object audio includes: the first person's voice, the second person's voice, the drum's voice, the first organ's voice and the second organ's voice.

And screening the obtained left channel single-object audio by using a preset energy threshold, such as a second energy threshold.

Obtaining the energy of each left channel single-object audio: the energy of the first human voice, the energy of the second human voice, the energy of the drum sound, the energy of the first musical instrument sound and the energy of the second musical instrument sound are respectively compared with the second energy threshold value.

For example, when the energy of the first human voice, the energy of the second human voice, the energy of the drum sound, and the energy of the first sound are all greater than or equal to the second amount threshold, and the energy of the second sound is less than the second energy threshold, the obtained left-channel single-object audio includes the first human voice, the second human voice, the drum sound, and the first sound, and does not include the second sound.

For the comparison of the above-described energy amounts, it is possible to use multi-frame audio data of audio data.

For example, when the energy of the first voice is compared with a second preset energy threshold, the first voice is divided into multiple frames of audio data, and when each frame of audio data of the first voice is greater than or equal to the second preset energy threshold, it is determined that the energy of the first voice is greater than or equal to the second preset energy threshold.

For example, when the energy of the first organ is compared with a second preset energy threshold, the first organ is divided into a plurality of frames of audio data, and when each frame of audio data of the first organ is smaller than the second preset energy threshold, the energy of the first organ is determined to be smaller than the second preset energy threshold.

In the above description, for the processing of the original audio of the left channel, the single-type audio of the left channel is obtained by extraction; then, each left channel single-type audio is separated to obtain a left channel single-object audio.

In one possible implementation, for the processing of the left channel original audio, the left channel single-object audio may be directly obtained.

For example, it is known that only one type of audio object of the original audio of the left channel exists, and in this case, the single-object audio of the left channel can be obtained directly by using the original audio of the left channel without extracting audio data of the single-type audio object.

For example, the left channel original audio is a pure vocal chorus, and the audio object includes only a plurality of vocal sounds.

The above describes processing the left channel mono type audio to obtain the left channel mono object audio and the left channel second background sound. The principle for processing the right channel mono type audio is similar and will not be described here.

Through the above processing procedures, a left channel single object audio, a left channel second background sound, a right channel single object audio, and a right channel second background sound are obtained.

S204, matching the left channel single-object audio and the right channel single-object audio to obtain the left channel audio and the right channel audio of each audio object.

The original audio may comprise a plurality of audio objects, and there may be a plurality of audio objects for each type.

Specifically, for the same audio object type, a left channel single object audio obtained by separating the same left channel single type audio and a right channel single object audio obtained by separating the same right channel single type audio are matched to obtain a left channel single object audio and a right channel single object audio of each audio object of the audio object type.

According to the above description, separating the left channel single type audio results in one or more left channel single object audio, and separating the right channel single type audio results in one or more right channel single object audio.

For a type of audio object, when the number of the left channel single-object audios is one, it indicates that for the left channel, the type of audio object only contains one audio object; when multiple left channel mono-object audio is obtained for a type of audio object, it indicates that the type of audio object contains multiple audio objects for the left channel.

The principle of separating the right channel single-type audio to obtain one or more right channel single-object audios is similar to the above, and is not described here again.

When the audio objects of this type for both the left channel and the right channel comprise only one audio object, a left channel single object audio and a right channel single object audio are obtained that correspond to the same audio object.

When the audio objects of the type are both comprised of multiple audio objects for the left channel and the right channel, the resulting multiple left channel single object audio and the multiple right channel single object audio correspond to the multiple audio objects. The correspondence between the plurality of left-channel single-object audios and the plurality of right-channel single-object audios is generally uncertain.

In one possible implementation, for the same audio object type, the left channel single object audio and the right channel single object audio are matched according to the correlation between the left channel single object audio and the right channel single object audio.

For the same type of audio object, there are a plurality of left channel single object audios and a plurality of right channel single object audios.

And obtaining the correlation degree of the left channel single-object audio and the right channel single-object audio, and determining the matching relation between the left channel single-object audio and the right channel single-object audio according to the correlation degree.

The correlation is used to describe a degree of correlation between the left channel single object audio and the right channel single object audio.

For example, for the same type of audio object "person", the left channel single object audio includes: the first human voice and the second human voice, the right channel single object audio includes: a third voice and a fourth voice.

Next, the following description will be given taking an example of determining the correlation between the first person's voice and the third person's voice.

The first voice is X, the third voice is Y, and the relevance of the first voice X and the third voice Y is as follows:

wherein cor (X, Y) is the covariance of the first voice X and the third voice Y, var [ X ] is the variance of the first voice X, and var [ Y ] is the variance of the third voice Y.

The correlation between two single-object audios may also be calculated in other ways.

The correlation of each left channel single-object audio and each right channel single-object audio is determined, for example, the correlation r1 of the first person's voice and the third person's voice, the correlation r2 of the first person's voice and the fourth person's voice, the correlation r3 of the second person's voice and the third person's voice, and the correlation r4 of the second person's voice and the fourth person's voice are obtained.

And respectively comparing the correlation r1, the correlation r2, the correlation r3, the correlation r4 and the correlation threshold, and determining the left channel single-object audio and the right channel single-object audio corresponding to the correlation exceeding the correlation threshold as matched single-object audio.

For example, the correlation threshold is 0.7; the correlation r1 and the correlation r4 are greater than the correlation threshold, the correlation r2 and the correlation r3 are less than the correlation threshold, and at this time, the first voice and the third voice are matched, and the second voice and the fourth voice are matched.

In some possible implementations, the correlation r1 of the first voice and the third voice, and the correlation r2 of the first voice and the fourth voice may also be compared; the degree of correlation r3 of the second person's voice and the third person's voice and the degree of correlation r4 of the second person's voice and the fourth person's voice are compared.

For example, when the degree of correlation r1 is greater than the degree of correlation r2, the sound sources of the first person sound and the third person sound are the same, and the first person sound and the third person sound correspond to the same audio object; when the degree of correlation r4 is greater than the degree of correlation r3, the second and fourth human voices are the same, and the first and third human voices correspond to the same audio object.

And S205, generating a single-channel signal of each audio object by using the left channel audio and the right channel audio of each audio object.

The present embodiment provides the following several implementations of generating a mono signal for each audio object.

In one possible implementation, the left channel single object audio and the right channel single object audio of each audio object are superimposed, generating a mono signal for each audio object.

For example, for the same audio object, the left channel single-object audio is L, the right channel single-object audio is R, and the monaural signal M = L + M of the audio object.

The single-channel signal is obtained by superposing the single-object audio of the left channel and the single-object audio of the right channel, and the calculation complexity is low.

In one possible implementation, for each audio object, the sum of the square of the left channel single object audio and the square of the right channel single object audio is squared to generate a mono signal for each audio object.

For example, for the same audio object, the left channel single object audio is L, the right channel single object audio is R, and the mono signal of the audio object:

in this way, for one audio object, the left channel single-object audio and the right channel single-object audio can keep the energy consistency after being converted into a single-channel signal.

In some possible cases, the following may also be used: for one audio object, firstly, determining the phase difference between the left channel single-object audio and the right channel single-object audio; then, the phase difference is utilized to align the phases of the left channel single-object audio and the right channel single-object audio; and then, overlapping the left channel single-object audio and the right channel single-object audio which are aligned in phase to obtain a single-channel signal of the audio object.

In some possible cases, the correlation between the left channel single-object audio and the right channel single-object audio at different time delays is calculated to realize the phase alignment processing.

Besides the above implementation manners, the mono signal of each audio object may be generated by other implementation manners, or multiple schemes may be combined to generate the mono signal of each audio object, which is not limited in this application.

The parameters of the audio object may include position parameters, such as angle-related parameters, distance-related parameters, etc.; other parameters, such as gain, etc., may also be included.

The following is a description of determining parameters of an audio object.

And S206, determining an initial horizontal azimuth angle of each audio object.

Horizontal azimuth is a parameter that characterizes the horizontal azimuth of the audio object relative to the listener.

The initial horizontal azimuth refers to an initial value of the horizontal azimuth.

Horizontal azimuth is a parameter of the audio object, and the initial horizontal azimuth refers to an initial value of the parameter of horizontal azimuth.

The initial horizontal azimuth of the audio object is explained below.

The horizontal azimuth of an audio object is in particular the horizontal azimuth of the audio object with respect to the listener.

An audio object is considered to be a virtual sound source from which a listener hears sounds. The horizontal azimuth of the audio object refers to the angle of the audio object in the horizontal plane with respect to the listener.

Referring to fig. 7A, fig. 7A is a schematic diagram of a horizontal azimuth of an audio object according to an embodiment of the present disclosure.

Fig. 7A shows a schematic top view of a space in which a listener is located.

When the listener is a person, the viewing angle of the top view of the space in which the listener is located is: the observation is made from directly above the top of the listener's head.

As shown in fig. 7A, the listener is located in a space centered on listener O, and the radius of the circle is a projection of the distance between listener O and audio object S in the plan view direction.

The direction in which the listener O faces is the direction pointing from O to O ', i.e. the direction of OO'.

The horizontal azimuth a of audio object S refers to the angle between the line between listener O and audio object S, and OO', as shown in fig. 7A.

For example, when the audio object is located directly in front of the listener (i.e. audio object S is located at O'), the horizontal azimuth of the audio object is 0 degrees; the horizontal azimuth angle of the audio object is-90 degrees when the audio object is located directly to the left of the listener; the horizontal azimuth angle of the audio object is 90 degrees when the audio object is located directly to the right of the listener.

The mono audio includes a plurality of audio frames. For example, in mono audio, each audio frame comprises 10ms,20ms,40ms segments of audio.

For an audio object, the azimuth of the horizon may be different for different audio frames.

An audio object is considered to be a virtual sound source from which a listener hears sound. The audio object may be moving relative to the listener throughout the period in which the listener hears the sound of the audio object. Thus, the horizontal azimuth of the audio object may be varied.

In one possible implementation, the horizontal azimuth of each audio object comprises a plurality of azimuths, each azimuth corresponding to each audio frame of the monaural audio.

It will be appreciated that the horizontal azimuth of an audio object may be the same or different for different audio frames.

The present embodiment provides the following several implementations of obtaining the horizontal azimuth of the audio object.

In one possible implementation, the horizontal azimuth of the audio object is obtained by means of trigonometric function modeling.

Specifically, first, the energy E _ left of the left channel single object audio and the energy E _ right of the right channel single object audio are obtained.

When the energy of the left channel single-object audio is greater than or equal to the energy of the right channel single-object audio, the horizontal azimuth of the audio object is obtained by the following formula:

Azimuth=2*arctan (E_right/E_left)-90

when the energy of the left channel single object audio is less than that of the right channel single object audio, the horizontal azimuth of the audio object is obtained by the following formula:

Azimuth=90-2*arctan (E_ left/E_ right)

wherein, E _ right is the energy of the right channel single-object audio, E _ left is the energy of the left channel single-object audio, and arctan is an arctangent function.

The horizontal azimuth of the audio object estimated by the above manner ranges from-90 to +90 degrees.

In a possible implementation manner, the horizontal azimuth of the audio object is obtained according to the energy of the left channel single-object audio and the energy of the right channel single-object audio, and the corresponding relationship between the two energies and the horizontal azimuth of the audio object.

Further, a ratio of the energy of the left channel single-object audio to the energy of the right channel single-object audio may be obtained, and the horizontal azimuth of the audio object may be obtained according to a correspondence between the ratio and the horizontal azimuth of the audio object.

In an environment where real sound is transmitted, when sound is transmitted to both ears of a listener, it is necessary to consider the influence of factors such as head occlusion, auricle reflection, shoulder reflection, and the like of the listener on the sound.

Further, in one possible implementation, the audio for the listener's left and right channels may be determined using a head Related transform function hrtf (head Related transform functions).

The head Related transform function hrtf (head Related transform functions) is an audio localization algorithm for describing the transmission process of sound waves from a sound source to two ears, corresponding to the head Related impulse response HRIR in the time domain.

Taking a listener as an example, a person localizes sound from three-dimensional space through two ears, which is useful for a system for analyzing sound signals by the human ears.

The signal transmitted from any point in space to the human ear (usually in front of the eardrum) can be described by a filtering system, and the sound source plus filter results in a sound signal (binaural signal) in front of the eardrums of both ears.

Sound signals from specific locations in space can be reproduced by filters (transfer functions) that describe spatial information, i.e. HRTFs. If a filter bank from all spatial orientations to both ears is obtained, a filter matrix is obtained to recover the sound signal from the entire spatial orientation.

Therefore, a known sound source and HRTF can obtain audio that the sound source transmits to both ears, respectively.

Based on the method, the corresponding relation between the left ear energy ratio and the right ear energy ratio and the horizontal azimuth angle of the audio object can be obtained in advance, and the left ear energy ratio and the right ear energy ratio are the ratio of the energy of the left channel single-object audio to the energy of the right channel single-object audio.

The left channel single-object audio and the right channel single-object audio of the audio object are obtained through S205, that is, for the audio object, the ratio of the energy E _ left of the left channel single-object audio to the energy E _ right of the right channel single-object audio can be obtained.

According to the corresponding relation and the ratio of the energy of the left channel single-object audio to the energy of the right channel single-object audio, the horizontal azimuth angle of the audio object can be determined.

Further, the correspondence may be stored in a table form.

Referring to fig. 7B, fig. 7B is a schematic diagram illustrating the corresponding relationship between the left and right channel energies and the horizontal azimuth.

As shown in fig. 7B, the corresponding relationship between the left-right ear energy ratio E _ left/E _ right and the horizontal Azimuth angle Azimuth of the audio object is obtained in advance:

when the Azimuth angle Azimuth is 90 degrees, 85 degrees, 80 degrees, 75 degrees, 70 degrees, 0, -75 degrees, -80 degrees, -85 degrees, -90 degrees, the corresponding left-right ear energy ratios E _ left/E _ right are respectively 0.01, 0.03, 0.08, 0.15, 0.25, 1.00, 6.6, 12.5, 33.3 and 100.

It is understood that, in addition to the corresponding relationship shown in fig. 7B, other corresponding relationships between the horizontal Azimuth angle Azimuth and the left-right ear energy ratio E _ left/E _ right may also be included, which may be determined according to actual situations.

It is explained above that the horizontal azimuth of the audio object can be determined according to the corresponding relationship between the horizontal azimuth and the left-right ear energy ratio. The correspondence is a one-to-one correspondence between a plurality of numerical values.

In a possible implementation manner, after the correspondence is obtained in the above manner, a functional relationship between the horizontal azimuth and the energy ratio of the left ear and the right ear may be fitted by using the correspondence.

Specifically, the fitted function may be a polynomial function of the following formula:

wherein, a _n ，a ₂ ，a ₁ ，a ₀ The parameters, which are polynomial functions, are obtained by function fitting.

Besides the above implementation manners, the horizontal azimuth of the audio object may be obtained through other implementation manners, or a plurality of schemes may be combined to obtain the horizontal azimuth of the audio object, which is not limited in this application.

In a possible implementation manner, for each audio frame of each audio object, the horizontal azimuth of the audio object is obtained in any manner.

That is, a set of horizontal azimuth angles of each audio object is obtained, and the set includes the horizontal azimuth angle of the audio object of each audio frame.

In a possible implementation manner, the initial azimuth horizon may be further determined by setting a first azimuth horizon and a variation range of the azimuth horizon.

The initial horizontal azimuth is a value of the first horizontal azimuth that changes according to the change range.

Specifically, the first horizontal azimuth may be obtained by any of the above manners, or may be directly set.

The azimuth horizontal variation range refers to a range within which the azimuth of the audio object is varied.

Further, the change form may be random change, or may be change according to a preset rule (for example, gradually increasing or decreasing, or the like).

For example, the first horizontal azimuth is 5 degrees, and the horizontal azimuth variation range is [ -1,1 ]. That is, the first horizontal azimuth of the audio object is 5 degrees and randomly varies at [4,6 ].

In a possible implementation, the first azimuth horizontal angle of the audio object may also be set directly, and the azimuth horizontal angle of the audio object remains fixed.

It is to be understood that, when there are multiple audio objects, the manner of determining the first horizontal azimuth angle of each audio object may be the same or different, and this is not limited by the embodiment of the present application.

And S207, determining an initial elevation angle and an initial distance of each object audio according to the audio object type.

The elevation angle in the embodiment of the application is a parameter for representing the elevation angle of the audio object relative to a listener; the distance in the embodiment of the present application is a parameter for characterizing the distance of the audio object with respect to the listener.

The elevation angle of an audio object particularly refers to the elevation angle of the audio object relative to the listener, and the distance of an audio object refers to the distance between the audio object and the listener.

The initial elevation angle refers to an initial value of the elevation angle. The initial distance refers to the initial value of the distance.

Elevation angle and distance are both parameters of the audio object, initial elevation angle referring to the initial value of this parameter of elevation angle, and initial distance referring to the initial value of this parameter of distance.

The following describes how to determine the initial elevation angle and the initial distance of an audio object.

Referring to fig. 8, fig. 8 is a schematic diagram illustrating an elevation angle and a distance of an audio object according to an embodiment of the present disclosure.

Fig. 8 shows a schematic side view of a space where a listener is located.

When the listener is a person, the viewing angle of the side view of the space in which the listener is located is: the observation was performed parallel to the listener standing.

As shown in fig. 8, the distance between the listener O and the audio object S is d.

Taking the example of a listener standing on a horizontal plane, the elevation angle b of audio object S is: the angle between the line connecting the listener O and the audio object S and the horizontal plane.

An audio object is considered to be a virtual sound source from which a listener hears sounds.

For example, the listener and the audio object are located at the same horizontal plane, taking into account the height of the listener and the audio object: the height of the listener is 1 m. When the height of the audio object is equal to 1m, the elevation angle of the audio object is 0 degree; when the height of the audio object is greater than 1m, the elevation angle of the audio object is greater than 0 degree; when the height of the audio object is less than 1m, the elevation angle of the audio object is less than 0 degree.

The following is an implementation of determining the elevation angle and distance of object audio provided by an embodiment of the present application.

In one possible implementation, the elevation angle and distance of the object audio are determined according to the type of the audio object.

The present embodiment provides the following several implementations for determining the elevation angle and distance of the object audio according to the type of the audio object.

In one possible implementation, the elevation angle and distance of the audio objects of each audio object type are preset.

For an audio object type, the elevation angle and the distance of the audio objects belonging to the audio object type are predetermined.

Specifically, for an audio object type, an elevation angle of an audio object of the audio object type may be preset as a preset elevation angle, and a distance of the audio object type may be preset as a preset distance.

Further, the preset elevation angle and the preset distance may correspond to one or more numerical values.

When the preset elevation angle corresponds to a numerical value, determining the numerical value as the elevation angle of the audio object; when the preset elevation angle corresponds to a plurality of values, one of the values can be randomly selected as the value of the elevation angle of the object audio in a random selection mode.

The preset distance and the preset elevation angle are the same, and are not described in detail here.

The two parameters of the preset distance and the preset elevation angle may also correspond to one preset value and the other to a plurality of preset values, which are not described herein again.

For example, for the audio object type "person", the preset elevation angle of the audio object may include 0 degree, 1 degree, 2 degrees, and the preset distance may include 1m, 2m, 3 m.

In determining the preset elevation angle of the audio object type "person", one of values of 0 degree, 1 degree, and 2 degrees may be randomly determined as the preset elevation angle of the audio object.

The determined preset elevation angle may be different for different audio objects of the same audio object type.

For the audio object type "person", the preset elevation angle of the audio object may include 0 degree, 1 degree, 2 degrees, etc., and the preset distance may include 1m, 2m, 3m, etc.;

similarly, for the audio object type "drum", the preset elevation angle of the audio object may include-5 degrees, -6 degrees, -7 degrees, etc., and the preset distance may include 2 meters, 3 meters, 4 meters, etc.;

for the audio object type "bird", the preset elevation angle of the audio object may include 20 degrees, 30 degrees, 40 degrees, etc., and the preset distance may include 5 meters, 6 meters, 10 meters, etc.;

for the audio object type "airplane", the preset elevation angle of the audio object may include 70 degrees, 80 degrees, 90 degrees, etc., and the preset distance may include 20 meters, 30 meters, 40 meters, etc.

The correspondence between the type of the audio object and the two parameters of the preset elevation angle and the preset distance can be stored and/or presented in the form of a table.

And according to the corresponding relation and the type of the audio object, randomly determining the numerical values of the preset elevation angle and the preset distance of the audio object.

The elevation angle and the distance of the audio object of the preset audio object type are explained above to determine the elevation angle and the distance of the object audio.

In a possible implementation, the preset elevation angle and the preset distance of the audio object may also be randomly generated, for example, using a random number generation function or a random number generator.

Further, a range for generating a random number value is set for the random number generation function or the random number generator.

For example, the preset elevation angle of the audio object type "person" is randomly generated using a random number generation function or a random number generator. The range of the value of the generated random number is set to be [ -5,5], that is, the range of the random number generation function or the preset elevation angle generated by the random number generator is [ -5,5 ].

The above describes randomly generating the elevation angle and distance of an audio object type to determine the elevation angle and distance of the object audio.

In one possible implementation, the distance of an audio object may also be determined according to the energy of mono audio of the audio object.

For example, the smaller the energy of the monophonic audio of an audio object, the greater the distance that the audio object is determined to be.

In some possible cases, the distance of the audio object may be determined using a function of the energy of the mono audio of the audio object and the distance of the audio object. The above function is used to embody: a negative correlation between an energy of mono audio of the audio object and a distance of the audio object.

When a plurality of object audios exist, the implementation manner may be the same or different for the determination of the elevation angle and the distance of each object audio; the determination of the elevation angle of an object audio and the determination of the distance of the object audio may be implemented in the same manner or in different manners, which is not limited in this embodiment.

Besides the above implementation manners, the elevation angle and the distance of the audio object may be obtained through other implementation manners, or multiple schemes may be combined to obtain the elevation angle and the distance of the audio object, which is not limited in the embodiment of the present application.

In a possible implementation manner, for each audio frame of each audio object, the elevation angle and the distance of the audio object are obtained by any one of the above manners.

That is, an elevation angle set of each audio object is obtained, and the set contains the elevation angle of the audio object of each audio frame; a distance set for each audio object is obtained, which set contains the audio objects for each audio frame.

In a possible implementation, the first elevation angle, the first distance, the variation range of the elevation angle, and the variation range of the distance of the audio object may also be directly set.

Specifically, the first elevation angle and the first distance of the audio object may be obtained by any of the above manners, and may also be directly set.

A variation range of elevation angle, which means that the elevation angle of the audio object varies within the range; the variation range of the distance means that the distance of the audio object varies within the range.

The above description of the initial value of the horizontal azimuth and the range of variation of the horizontal azimuth is referred to for the first elevation angle, the first distance, the range of variation of the elevation angle, and the range of variation of the distance.

It is to be understood that, when there are multiple audio objects, the manner of determining the initial elevation angle and the initial distance of each audio object may be the same or different, and the embodiment of the present application does not limit this.

Processing the original audio according to the above procedure to obtain a single-object audio and a position parameter of the audio object, where the position parameter of the audio object may include: a horizontal azimuth of the audio object, an elevation of the audio object, and a distance of the audio object.

In some possible cases, other parameters of the audio object may also be derived.

In one possible implementation, after the raw audio is acquired, the raw audio is processed (S202-S206 above) in response to raw audio processing instructions, resulting in single-object audio and position parameters of the audio object.

Further, in one possible implementation, the raw audio processing instructions may be issued by a user.

In the embodiment of the application, an electronic terminal is taken as an example of a mobile phone, and a possible interface of a 3D audio playing application program (3D audio playing APP) is described.

In one possible implementation, information of the audio object may be presented to the user in the form of an image, the information of the audio object including a horizontal azimuth angle of the audio object, an elevation angle of the audio object, a distance of the audio object, and the like.

Referring to fig. 4A, as shown in fig. 4A, in the main interface, a 3D audio object analysis and setting virtual button 403 is displayed.

In one possible implementation, the original audio processing instruction may be generated by a user clicking a 3D audio object analysis and setting virtual button 403 in the main interface.

In some possible cases, after receiving the pressing operation of the 3D audio object analysis and setting virtual button 403, the page jumps from the main interface to the 3D audio object analysis and setting interface.

In one possible implementation, the raw audio is processed in response to raw audio processing instructions, and the 3D audio object analysis and setting interface displays the progress of the processing and analysis in the process of processing the raw audio.

Specifically, after a user clicks a button of '3D audio object analysis and setting' on a main playing interface, a selected audio file is analyzed through an Artificial Intelligence (AI) algorithm, and different audio objects and environmental sounds are extracted.

Referring to fig. 9A, fig. 9A is a schematic view of a 3D audio object analysis and setting interface provided in the present embodiment.

As shown in fig. 9A, a target sound source display box 401 is displayed on the 3D audio object analysis and setting interface. The name and format "ferry. flac" of the target sound source are displayed in the target sound source display box 401.

Below the target audio source display frame 401, the 3D audio object analysis and setting interface further includes a progress indication 409 for indicating a current progress in processing the original audio source.

As shown in fig. 9A, the progress of processing and analysis of the original video is also displayed below the target sound source display frame 401. Here, the target audio is the original video for processing and analysis.

As shown in fig. 9A, progress indication 409 "… … in analysis (90% completed)" is used to illustrate that the original video is being processed and analyzed, and the progress of the processing and analysis is about 90%.

In some possible cases, the current processing and analyzing progress can be determined according to the time length required by the historical processing and analyzing and the time used for the current processing and analyzing; or determining the current processing and analyzing progress by utilizing the ratio of the number of the processed audio objects to the number of all the audio objects; or other implementations that can be used to get a progress of the processing and analysis.

In some possible cases, the 3D audio object analysis and setting interface may further include a return virtual button 406, a jump to desktop button 407 of the electronic device desktop, a main interface virtual button 408, and the like.

The horizontal azimuth, elevation, and distance of the object audio have already definite values, or initial values and variations are already definite.

S208, determining the target value of the parameter of the audio object.

The parameters mentioned below, without specific reference, refer to parameters of the audio object.

The parameters may include location parameters and may also include other parameters. The location parameters may include horizontal azimuth, elevation, distance, and the like; other parameters may include gain, etc.

The horizontal azimuth, elevation and distance of the object audio have been described above and will not be described herein.

The gain of an audio object refers to a multiple of scaling the signal amplitude of the single object audio of the audio object.

In one possible implementation, the initial value of the parameter is determined as a target value of the parameter.

For example, the initial azimuth horizontal angle is determined as a target value of the azimuth horizontal angle.

In one possible implementation, the initial value of the parameter is determined as a target value of the parameter, and when an input value of a certain parameter is received, the target value of the parameter is determined according to the input value of the parameter.

Further, the input value of the parameter may be input by the user.

Specifically, the input value of the parameter may be a target value of the parameter, that is, the target value of the parameter is directly input by the user; the input value of the parameter may also be a variation value of the parameter, that is, a variation value of the parameter input by the user, and specifically, the target value of the parameter may be obtained according to the initial value and the variation value of the parameter.

At least initial values of three parameters of horizontal azimuth, elevation and distance of the object audio are determined through S206-S207.

The user can adjust the parameters of the object audio according to the needs to meet the personalized requirements of the user. In the embodiment of the application, an electronic terminal is taken as an example of a mobile phone, and a possible interface of a 3D audio playing application program (3D audio playing APP) is described.

With continued reference to fig. 9A, progress indicator 409 is used to indicate the current progress of processing and analyzing the original audio source. When processing and analyzing the original audio source is complete, progress indicator 409 may display "analysis is complete" indicating that the processing and analysis of the original audio is complete, i.e., the processing and analysis of the original video is 100% in progress.

Referring to fig. 9A and 9B, fig. 9B is a schematic diagram of another interface for analyzing and setting a 3D audio object according to an embodiment of the present disclosure.

As shown in fig. 9B, when processing and analyzing the raw data is complete, progress indicator 409 may display "analysis is complete".

In the 3D audio object analysis and setting interface shown in fig. 9B, the interface includes a target sound source display box 401, a return virtual button 406, a desktop button 407 for jumping to the desktop of the electronic device, a main interface virtual button 408, and a progress indication 409, and further includes: an audio object display area 410, an audio object list 411, and a selected audio object parameter setting area 412.

After the processing and analysis of the raw data is finished, the audio object and the position parameters of the audio object are obtained, and at this time, an audio object list can be obtained.

As shown in fig. 9B, after the processing and analysis of the raw data is completed, a rectangular space (3D audio space) may be displayed on the interface. The center of the rectangle may be the virtual listener position.

The audio object list 411 shown in fig. 9B shows the names of the audio objects and the corresponding icons of the audio objects on the interface.

The position parameters of the audio object include a horizontal azimuth of the audio object, an elevation of the audio object, and a distance of the audio object.

Since the location parameters of the respective audio objects are used to indicate the location relationship of the audio objects and the listener, including the angle in the respective horizontal and numerical directions, and the distance between the two. The positional relationship of the individual audio objects and the listener can now be determined.

Fig. 9B shows an audio object display area 410 representing the three-dimensional space in which the listener is located by means of a cube, with different audio objects represented by different graphics. The object display area 410 shows the positional relationship of the audio object and the listener.

Below the audio object display area 410 and the audio object list 411 there may also be displayed an icon description of the listener.

In the 3D audio object analysis and setting interface shown in fig. 9B, the user can clearly and intuitively see the positional relationship of the audio object and the listener.

Fig. 9B shows a 3D audio object analysis and setting interface, which further includes a selected audio object parameter setting area 412.

The user may adjust the position parameters of each audio object using the selected audio object parameter setting region 412.

Specifically, the user may select an audio object desired to be adjusted by clicking an icon of the audio object in the audio object display area 410.

When a user click on an icon of an audio object in the audio object display area 410 is received, the name and icon of the audio object whose icon is clicked appear in the selected audio object parameter setting area 412.

As shown in fig. 9B, the audio object "human voice" is shown on the left side of the selected audio object parameter setting area 412, and at this time, the user can adjust the position parameter by selecting the right side of the selected audio object parameter setting area 412.

As shown in fig. 9B, on the right side of the selected audio object parameter setting area 412, names of three position parameters and a numerical value input box of the position parameters are displayed.

The user can input into each numerical value input box to set the target value of each positional parameter.

Fig. 9B shows a case where the target value of the position parameter input by the user is directly acquired. In a possible implementation manner, adjustment values of the numerical values of the position parameters by the user can be obtained.

In some possible cases, the initial value displayed by the value input box of each position parameter may be a value obtained after the original video is processed and analyzed, that is, a value obtained before the adjustment of each position parameter is initially displayed.

In some possible implementations, target values for other parameters of the audio object are obtained.

Other parameters of an audio object may include the gain of the audio object, in particular: the single object audio of the audio object is amplified by a factor of two.

As shown in fig. 9B, on the right side of the selected audio object parameter setting area 412, a name and numerical value input box of the parameter "gain" is also displayed.

The parameter "gain" acquisition process is similar to the above location parameters, and is not described in detail here.

When there is no adjustment to the values of the position parameter and other parameters, the target value of the parameter may be equal to the initial value of the parameter.

In some possible implementations, the user may also click and drag a ball corresponding to the audio object out of the rectangular space shown in fig. 9B, so that when the 3D audio is played, the sound of the audio object is not played.

For example, there are 2 singers in a song, and the user may mute a single singer.

The manner in which the horizontal azimuth, elevation, distance and gain of an audio object are determined is described above: the individuation of the audio can be improved by processing and analyzing the original audio to obtain a first numerical value of the parameter, and then taking a second numerical value of the obtained parameter as a target value of the parameter, or obtaining the target value of the parameter according to the first numerical value and the variation of the parameter.

In some possible cases, the target value of the parameter may also be the first value of the parameter. For example, after the first value of the parameter is obtained, the first value is directly used as the target value of the parameter, and the processing procedure of obtaining the second value is not performed, so that the user operation can be simplified.

S209, determining the left channel target audio and the right channel target audio of each audio object according to the single channel audio of each audio object and the target value of the position parameter.

In one possible implementation, for the mono audio of each audio object, the mono audio of the audio object is processed using a head-related transfer function HRTF based on the target values of the parameters of the audio object, resulting in a left channel target audio of the audio object and a right channel target audio of the audio object.

The head related transfer function HRTF includes a left HRTF and a right HRTF.

The mono audio of the audio object is processed with a head related transfer function HRTF, meaning that this function HRTF acts on the mono audio.

Referring to fig. 10A, fig. 10A is a schematic diagram of obtaining mono target audio of an audio object by using a head-related transfer function HRTF according to an embodiment of the present application.

The mono target audio may be a left channel target audio or a right channel target audio.

As shown in fig. 10A, for each audio object, a left HRTF and a right HRTF that meet a target value of a position parameter are determined from an HRTF library according to the target value of the position parameter; processing the single-channel audio by using a left HRTF to obtain a left-channel target audio; and processing the single-channel audio by using the right HRTF to obtain a right-channel target audio.

For each audio object, the obtained target left channel is the audio frequency received by the left channel of the listener, namely the sound emitted when the audio object is positioned at the target position; the target position refers to a position at which the positional relationship between the audio object and the listener satisfies the target value of the positional parameter of the audio object.

Since the head related transfer function HRTF is derived from a HRTF library based on target values of the position parameters, after HRTF processing of mono audio of an audio object, a listener can perceive that the audio object is at the target position when the audio object is uttered.

Take the example of a listener playing audio with an audio player. For an audio object, when the left channel of the audio player plays the left channel target audio of the audio object, the listener can feel that the position relation between the audio object and the listener conforms to the target value of the position parameter of the audio object.

The present embodiment provides several following implementation manners for obtaining the monaural target audio, and the following description will take the monaural target audio as the left channel target audio as an example.

Referring to fig. 10B, fig. 10B is a schematic diagram of obtaining a mono target audio according to an embodiment of the present application.

As shown in fig. 10B, HRTF filter coefficients of a time domain, which are left and right HRTFs, respectively, in a time domain are determined in the HRTF library, which conform to the target value of the position parameter, using the target value of the position parameter of the audio object.

Convolving the time domain signal of the single-channel audio with the left HRTF of the time domain to obtain a left-channel target audio; and (4) convolving the time domain signal of the single-channel audio with the right HRTF of the time domain to obtain the right-channel target audio. The left channel target audio and the right channel target audio are both time domain signals.

Referring to fig. 10C, fig. 10C is a schematic diagram of obtaining a mono target audio according to another embodiment of the present application.

As shown in fig. 10C, using the target value of the position parameter of the audio object, determining HRTF filter coefficients of a frequency domain, which may be an FFT domain or an MDCT domain, etc., in the HRTF library, which conform to the target value of the position parameter, to obtain a left HRTF and a right HRTF of the frequency domain; then, performing time-frequency conversion (possibly FFT or MDCT transformation, etc.) on the time domain signal of the single-channel audio to obtain a frequency domain signal of the single-channel audio; multiplying the time domain signal of the single-channel audio by the left HRTF and the right HRTF of the frequency domain respectively to obtain a frequency domain signal of the left-channel target audio and a frequency domain signal of the right-channel target audio; and respectively carrying out inverse time-frequency transformation on the frequency domain signal of the left channel target audio and the frequency domain signal of the right channel target audio to obtain a time domain signal of the left channel target audio and a time domain signal of the right channel target audio.

Besides the above implementation manners, the left channel target audio and the right channel target audio may be obtained through other implementation manners, or multiple schemes may be combined to obtain the left channel target audio and the right channel target audio, which is not limited in this application embodiment.

It can be understood that, when there are multiple audio objects, the manner of determining to obtain the left channel target audio and the right channel target audio of each audio object may be the same or different, and this is not limited in this embodiment of the application.

The manner of obtaining the audio object audio of the right channel and the audio object audio of the left channel is similar, and is not described herein again.

Please refer to fig. 9B, in the 3D audio object analysis and setting interface, other functional virtual buttons may be further included: listening the subject, listening all, and saving the results.

In some possible cases, S09 is executed in response to pressing of any one of the keys of the trial listening object, the trial listening in its entirety, and the save result by the user.

In some possible cases, when a user presses a button of the listening object, only the left channel target audio and the right channel target audio of the selected audio object may be determined and played after the determination.

In some possible cases, when the user presses all the keys for trial listening, the left channel target audio and the right channel target audio of all the audio objects are determined and played after the determination.

In some possible cases, when a user's pressing of the save result key is received, the left channel target audio and the right channel target audio of all the audio objects are determined and saved after the determination.

S210, superposing the background sound of the left channel and the left channel target audio of all the audio objects to obtain left channel output audio, and superposing the background sound of the right channel and the right channel target audio of all the audio objects to obtain right channel output audio.

Referring to fig. 11, fig. 11 is a schematic diagram of obtaining a mono output audio according to an embodiment of the present application.

The mono output audio includes left channel output audio and right channel output audio.

As shown in fig. 11, superimposing the background sound of the left channel and the left channel target audio of all the audio objects results in left channel output audio, and superimposing the background sound of the right channel and the right channel target audio of all the audio objects results in right channel output audio.

The background sound of the left channel is the superposition of the first background sound of the left channel and the second background sound of the left channel, and the background sound of the right channel is the superposition of the first background sound of the right channel and the second background sound of the right channel.

Taking the earphone as an example, the left channel output audio is the audio output through the left channel of the earphone, and the right channel output audio is the audio output through the right channel of the earphone.

There may be multiple audio objects, so that the left channel target audio of all audio objects is superimposed and the right channel target audio of all audio objects is superimposed.

In a possible implementation manner, before obtaining the left channel output audio, some audio processing may be performed on a result of superimposing the background sound of the left channel and the left channel target audio of all the audio objects, and then the obtained left channel output audio is output.

The audio processing may include: a dynamic range controller process for controlling the timbre equalizer process, for controlling the loudness, and a limiter process for avoiding clipping, etc.

In some possible cases, before obtaining the left channel output audio, other processing may be performed, which is not limited in this embodiment of the application.

In some possible cases, with continued reference to fig. 9B, in response to a user pressing a virtual key in the interface shown in fig. 9B that clicks the trial listening all over or saves the result, S210 is performed to obtain the left channel output audio and the right channel output audio.

Further, in response to the user pressing the save result virtual key, the interface shown in fig. 9B may also return to the interface shown in fig. 4A after saving the result.

Returning to the interface shown in fig. 4A, the 3D audio play virtual button 404 and the 3D audio song-by-song virtual button 405 are displayed in black (which may be gray originally).

The user can play audio by clicking on the 3D audio play virtual button 404.

In some possible implementations, in a practical application scenario, a user may input a sound by using a sound input device of the electronic apparatus, for example, the microphone 170C of the electronic apparatus 100 shown in fig. 1A, as one audio object of the output audio.

Specifically, the user can make a karaoke (sing song) by clicking the 3D audio karaoke virtual button 405.

Please refer to fig. 9C, fig. 9C is a schematic diagram of a 3D audio karaoke interface provided in the embodiment of the present application.

In some possible implementations, in response to a user pressing the 3D audio K song virtual button 405, a jump is made to the 3D audio K song interface shown in fig. 9C.

Unlike fig. 9B, fig. 9C shows a 3D audio K song interface in which a start K song virtual button is included for starting recording audio in response to a user's press; an end karaoke virtual button for ending recording of audio in response to a user's press; and a result saving virtual button for saving the K song result.

In an actual scene, audio recording (the user performs karaoke) can be performed while the obtained left channel output audio and right channel output audio are played. For example, the user can sing with a singer who likes the user.

Further, for example, the original audio includes two voices (a first voice and a second voice), and through the above-mentioned process of analyzing and processing the audio object, the first voice can be removed, so that the user can sing with the second voice.

Further, the audio input by the user may be one audio object. The user can click on the 3D audio object analysis and setting virtual button 403 in the interface shown in fig. 4A to set the existing audio object and the added audio object (generated by the user karaoke).

The specific process of 3D audio object analysis and setting is described above and will not be described herein.

By adopting the technical scheme of the embodiment of the application, the audio frequency of each audio object is extracted and obtained according to the original stereo audio frequency, the position parameters at least comprising a horizontal azimuth angle, an elevation angle and a distance are set, and then the head-related transmission function is determined by utilizing the position parameters; and processing the audio of each audio object through a head related function to obtain the audio of each audio object containing the spatial position information, and further obtain the audio output by the left channel and the right channel of the earphone. By adopting the technical scheme of the embodiment of the application, a listener can not only feel the position information of each audio object relative to the listener, but also can feel the position difference between different audio objects, thereby improving the spatial impression during audio playing.

The above steps may be performed by a central processing unit CPU, a network processor NPU, or one or more processors in the application processor AP in the electronic device.

The embodiment of the application further provides the electronic equipment. The embodiment of the present application does not specifically limit the type of the electronic device, and the electronic device may be a mobile phone, a notebook computer, a wearable electronic device (e.g., a smart watch), a tablet computer, an Augmented Reality (AR) device, a Virtual Reality (VR) device, or the like.

The electronic device provided by the embodiment of the application comprises a processor and a memory, wherein the memory stores codes, and the processor is used for calling the codes stored in the memory and executing any one of the audio processing methods.

Referring to fig. 1A, the electronic device 100 provided in the embodiment of the present application may include a memory 121 shown in fig. 1A, and the processor may be the processor 110 shown in fig. 1A.

In some possible implementations, the electronic device 100 provided in the embodiment of the present application may include a processor that includes: one or more of a Central Processing Unit (CPU), an Application Processor (AP), a neural-Network Processing Unit (NPU), and the like.

In some possible implementations, the electronic device 100 provided in the embodiments of the present application may include a processor including one or more processors or processing units.

In some possible cases, the electronic device 100 provided in the embodiment of the present application may include a memory in other forms.

By adopting the technical scheme of the embodiment of the application, the audio frequency of each audio object is extracted and obtained according to the original stereo audio frequency, and each audio object corresponds to a sound source; setting parameters for characterizing the elevation angle of the audio object relative to a listener for each audio object to increase the position information of the audio of each audio object in the height direction; determining a left channel target audio and a channel target audio of each audio object according to the left channel audio, the right channel audio and the position parameters of each audio object; and finally, superposing the left channel target audio of each audio object to obtain left channel output audio, and superposing the right channel target audio of each audio object to obtain right channel output audio. Since the audio of each audio object has information in the height direction, a listener can not only feel the height information of each audio object relative to the listener, but also can feel the height difference between different audio objects, thereby improving the spatial sense during audio playing. Furthermore, parameters for representing the horizontal azimuth angle and the distance of the audio object relative to a listener can be set for each audio object, the position information of each audio object is increased, and the spatial sense during audio playing is further improved.

All the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments. The above-described apparatus embodiments are merely illustrative, and the units and modules described as separate components may or may not be physically separate. In addition, some or all of the units and modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The foregoing is directed to embodiments of the present application and it is understood that various modifications and enhancements may be made by those skilled in the art without departing from the principles of the application and are intended to be included within the scope of the application.

Claims

1. A method of audio processing, the method comprising:

acquiring the type of each audio object in the original audio;

determining a position parameter corresponding to each audio object according to the type of each audio object, wherein the position parameter at least comprises a parameter for representing the elevation angle of the audio object relative to a listener, and each audio object corresponds to an audio source;

responding to the adjustment operation of the user on the position parameters corresponding to the audio objects in the selected audio object parameter setting area, and adjusting the position parameters;

determining a left channel target audio of each audio object and a right channel target audio of each audio object according to the audio of each audio object in the left channel original audio, the audio of each audio object in the right channel original audio and the updated position parameter corresponding to each audio object;

and superposing the left channel target audios to obtain left channel output audios, and superposing the right channel target audios to obtain right channel output audios.

2. The method according to claim 1, wherein the determining a left channel target audio of each of the audio objects and a right channel target audio of each of the audio objects according to the audio of each of the audio objects in the left channel original audio, the audio of each of the audio objects in the right channel original audio, and the updated position parameter corresponding to each of the audio objects specifically comprises:

extracting the audio frequency of each audio object in the left channel original audio frequency to obtain a left channel single object audio frequency, and extracting the audio frequency of each audio object in the right channel original audio frequency to obtain a right channel single object audio frequency;

synthesizing a single-channel signal of each audio object according to each left-channel single-object audio and each right-channel single-object audio;

and determining the left channel target audio of each audio object and the right channel target audio of each audio object according to the single-channel signal of each audio object and the updated position parameter corresponding to each audio object.

3. The method according to claim 2, wherein extracting the audio of each audio object in the left channel original audio to obtain a left channel single object audio, and extracting the audio of each audio object in the right channel original audio to obtain a right channel single object audio, specifically comprises:

extracting the audio of the audio objects with different types in the left channel original audio to obtain a left channel single-type audio, and extracting the audio of the audio objects with different types in the right channel original audio to obtain a right channel single-type audio;

and extracting the audio of each audio object in the left channel single-type audio to obtain a left channel single-object audio, and extracting the audio of each audio object in the right channel single-type audio to obtain a right channel single-object audio.

4. The method according to claim 2, wherein the number of the left channel single-object audio and the number of the right channel single-object audio are n, where n is a positive integer, and the synthesizing the mono signal of each audio object according to each of the left channel single-object audio and each of the right channel single-object audio specifically comprises:

determining a correlation of an i-th left-channel single-object audio and a j-th right-channel single-object audio, wherein the i =1,2, …, n, and the j =1,2, …, n, and the correlation of the i-th left-channel single-object audio and the j-th right-channel single-object audio is used for determining whether the i-th left-channel single-object audio and the j-th right-channel single-object audio correspond to the same audio object;

and when the correlation degree of the ith left channel single-object audio and the jth right channel single-object audio is greater than a preset correlation degree threshold value, synthesizing the ith left channel single-object audio and the jth right channel single-object audio into a single audio signal of the audio object.

5. The method of claim 1, wherein before determining the left channel target audio of each audio object and the right channel target audio of each audio object according to the audio of each audio object in the left channel original audio, the audio of each audio object in the right channel original audio, and the updated position parameter corresponding to each audio object, the method further comprises:

and respectively extracting the audio of a left channel and the audio of a right channel in the original audio to obtain the original audio of the left channel and the original audio of the right channel.

6. The method according to claim 3, wherein when extracting the audio of the different types of audio objects in the left channel original audio to obtain a left channel single-type audio and obtain a left channel first background sound, and extracting the audio of the different types of audio objects in the right channel original audio to obtain a right channel single-type audio and obtain a right channel first background sound, the superimposing the left channel target audio to obtain a left channel output audio, and the superimposing the right channel target audio to obtain a right channel output audio, specifically comprises:

superimposing each of the left channel target audio and left channel output background sound to obtain the left channel output audio, and superimposing each of the right channel target audio and right channel output background sound to obtain the right channel output audio; wherein the left channel output background sound comprises the left channel first background sound and the right channel output background sound comprises the right channel first background sound.

7. The method of claim 6, wherein when extracting audio of different types of the audio objects in the original audio of the left channel to obtain a mono type audio of the left channel and obtaining a first background sound of the left channel, extracting the audio of the audio objects with different types in the right channel original audio to obtain a right channel single-type audio and obtain a right channel first background sound, extracting the audio of each audio object in the left channel single-type audio to obtain a left channel single-object audio and obtain a left channel second background sound, and when the audio of each audio object in the right channel single-type audio is extracted to obtain a right channel single-object audio and a right channel second background sound, the left channel output background sound further includes the left channel second background sound, and the right channel output background sound further includes the right channel second background sound.

8. The method according to claim 3, wherein when extracting the audio of each audio object in the left channel single-type audio to obtain a left channel single-object audio and obtain a left channel second background sound, and extracting the audio of each audio object in the right channel single-type audio to obtain a right channel single-object audio and obtain a right channel second background sound, the superimposing the left channel target audios to obtain a left channel output audio, and the superimposing the right channel target audios to obtain a right channel output audio, specifically includes:

superimposing each of the left channel target audio and left channel output background sound to obtain the left channel output audio, and superimposing each of the right channel target audio and right channel output background sound to obtain the right channel output audio; wherein the left channel output background sound comprises the left channel second background sound and the right channel output background sound comprises the right channel second background sound.

9. The method of claim 1, wherein the location parameters further comprise at least one of:

or the like, or, alternatively,

a parameter for characterizing a distance of the audio object relative to a listener.

10. The method of claim 2, wherein when the location parameters further include a parameter characterizing a horizontal azimuth of the audio object relative to the listener and a parameter characterizing a distance of the audio object relative to the listener,

the determining, according to the mono signal of each audio object and the updated position parameter corresponding to each audio object, the left channel target audio of each audio object and the right channel target audio of each audio object specifically includes:

respectively determining Head Related Transfer Functions (HRTFs) corresponding to the audio objects based on the updated position parameters corresponding to the audio objects;

and respectively processing the single-channel signal of each audio object by using the HRTF corresponding to each audio object to obtain the left channel target audio of each audio object and the right channel target audio of each audio object.

11. An electronic device, comprising a processor and a memory, wherein the memory stores code and the processor is configured to invoke the code stored in the memory and perform the method of any of claims 1-10.