CN111615045A

CN111615045A - Audio processing method, device, equipment and storage medium

Info

Publication number: CN111615045A
Application number: CN202010578962.0A
Authority: CN
Inventors: 胡诗超; 赵伟峰
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2020-06-23
Filing date: 2020-06-23
Publication date: 2020-09-01
Anticipated expiration: 2040-06-23
Also published as: CN111615045B

Abstract

The application discloses an audio processing method, an audio processing device, audio processing equipment and a storage medium, and belongs to the technical field of audio. The method comprises the following steps: acquiring a dual-channel audio signal to be processed, wherein the dual-channel audio signal comprises a left-channel signal and a right-channel signal; determining a single-channel frequency domain signal according to the left channel signal and the right channel signal; determining sound field information according to the left channel signal and the right channel signal; determining direction information of frequency points in the single-channel frequency domain signal according to the sound field information; classifying the direction information of the frequency points in the single-channel frequency domain signal to obtain mask sequences of a plurality of channels, wherein one mask sequence corresponds to one channel, and each mask sequence is used for indicating that the corresponding channel comprises the frequency domain signal condition corresponding to the frequency points in the single-channel frequency domain signal; a multi-channel audio signal is determined based on the mask sequences of the plurality of channels and the single-channel frequency domain signal. The embodiment of the application can reduce the correlation between the frequency domain signals of different channels.

Description

Audio processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of audio technologies, and in particular, to an audio processing method, apparatus, device, and storage medium.

Background

However, many audios are made by a two-channel stereo system, and in order to obtain a better hearing effect, the audios are often converted from two-channel audio signals into multi-channel audio signals.

At present, a two-channel audio signal can be converted into a multi-channel audio signal using a transformation matrix. For example, a two-channel audio signal may be converted into a signal matrix, which is converted into a multi-channel audio signal.

However, in case the number of channels of the multi-channel audio signal is determined, different two-channel audio signals are multiplied by the same transform matrix resulting in different multi-channel audio signals. In this case, it may result in high correlation between the audio signals of the respective channels in the converted multi-channel audio signal.

Disclosure of Invention

The embodiment of the application provides an audio processing method, an audio processing device, an audio processing apparatus and a storage medium, which can solve the problem of high correlation between audio signals of each channel in the related art. The technical scheme is as follows:

in one aspect, an audio processing method is provided, and the method includes:

acquiring a dual-channel audio signal to be processed, wherein the dual-channel audio signal comprises a left-channel signal and a right-channel signal;

determining a single-channel frequency domain signal according to the left channel signal and the right channel signal;

determining sound field information from the left channel signal and the right channel signal, the sound field information indicating a difference of binaural received signals;

determining direction information of frequency points in the single-channel frequency domain signal according to the sound field information;

classifying the direction information of the frequency points in the single-channel frequency domain signal to obtain mask sequences of a plurality of channels, wherein one mask sequence corresponds to one channel, and each mask sequence is used for indicating that the corresponding channel comprises the frequency domain signal condition corresponding to the frequency points in the single-channel frequency domain signal;

and determining a multichannel audio signal corresponding to the two-channel audio signal according to the mask sequences of the channels and the single-channel frequency domain signal.

In one possible implementation manner of the present application, each of the plurality of channels corresponds to reference direction information and a direction information deviation threshold; the number of the frequency points in the single-channel frequency domain signal is multiple;

the classifying the direction information of the frequency points in the single-channel frequency domain signal to obtain mask sequences of a plurality of channels includes:

for any channel in a plurality of channels, determining the absolute value of the difference between the direction information of each frequency point in the plurality of frequency points and the reference direction information corresponding to the any channel to obtain the direction information deviation of each frequency point and the any channel;

and determining a mask value corresponding to each frequency point in a mask sequence corresponding to any channel according to the direction information deviation between each frequency point and any channel and the direction information deviation threshold value of any channel.

In a possible implementation manner of the present application, determining a mask value corresponding to each frequency point in a mask sequence corresponding to any channel according to a direction information deviation between each frequency point and any channel and a direction information deviation threshold of any channel includes:

for any frequency point in the multiple frequency points, if the direction information deviation between the any frequency point and the any channel is smaller than the direction information deviation threshold corresponding to the any channel, determining that the mask value corresponding to the any frequency point in the mask sequence corresponding to the any channel is a first numerical value, where the first numerical value is used to indicate that the frequency domain signal corresponding to the any frequency point belongs to the any channel;

and if the direction information deviation between the any frequency point and the any channel is greater than the direction information deviation threshold of the any channel, determining that the mask value corresponding to the any frequency point in the mask sequence corresponding to the any channel is a second numerical value, wherein the second numerical value is used for indicating that the frequency domain signal corresponding to the any frequency point does not belong to the any channel.

In one possible implementation manner of this application, the determining a multichannel audio signal corresponding to the binaural audio signal according to the mask sequence of the multiple channels and the single-channel frequency domain signal includes:

multiplying the mask sequence corresponding to each channel with the single-channel frequency domain signal to obtain a frequency domain signal of each channel;

carrying out inverse Fourier transform on the frequency domain signal of each channel to obtain a time domain signal of each channel;

determining the time domain signals of the plurality of channels as the multi-channel audio signal.

In one possible implementation manner of the present application, the determining sound field information according to the left channel signal and the right channel signal includes:

determining a left channel frequency domain signal corresponding to the left channel signal, and determining a right channel frequency domain signal corresponding to the right channel signal;

determining the binaural intensity difference according to the left channel frequency domain signal and the right channel frequency domain signal;

determining the binaural phase difference according to the left channel frequency domain signal and the right channel frequency domain signal;

determining the binaural intensity difference and the binaural phase difference as the sound field information.

In one possible implementation manner of the present application, the determining the binaural intensity difference according to the left channel frequency domain signal and the right channel frequency domain signal includes:

respectively determining the absolute value of the left channel frequency domain signal and the absolute value of the right channel frequency domain signal;

determining a difference value between the absolute value of the left channel frequency domain signal and the absolute value of the right channel frequency domain signal to obtain a third numerical value;

determining the sum of the absolute value of the left channel frequency domain signal and the absolute value of the right channel frequency domain signal to obtain a fourth numerical value;

and dividing the third numerical value and the fourth numerical value to obtain the binaural intensity difference.

In one possible implementation manner of the present application, the determining the binaural phase difference according to the left channel frequency domain signal and the right channel frequency domain signal includes:

respectively determining the angular frequency of the left channel frequency domain signal and the angular frequency of the right channel frequency domain signal;

and determining the binaural phase difference according to the angular frequency difference between the angular frequency of the left channel frequency domain signal and the angular frequency of the right channel frequency domain signal.

In one possible implementation manner of the present application, the determining a single-channel frequency domain signal according to the left channel signal and the right channel signal includes:

determining the amplitude of the single-channel frequency domain signal according to the left channel frequency domain signal and the right channel frequency domain signal;

determining the angular frequency of the single-channel frequency domain signal according to the left channel frequency domain signal and the right channel frequency domain signal;

and determining the single-channel frequency domain signal according to the amplitude and the angular frequency of the single-channel frequency domain signal.

In another aspect, an audio processing apparatus is provided, the apparatus comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a dual-channel audio signal to be processed, and the dual-channel audio signal comprises a left-channel signal and a right-channel signal;

a first determining module, configured to determine a single-channel frequency domain signal according to the left channel signal and the right channel signal;

a second determining module for determining sound field information from the left channel signal and the right channel signal, the sound field information indicating a difference of binaural received signals;

a third determining module, configured to determine, according to the sound field information, direction information of frequency points in the single-channel frequency domain signal;

the classification module is used for classifying the direction information of the frequency points in the single-channel frequency domain signal to obtain mask sequences of a plurality of channels, wherein one mask sequence corresponds to one channel, and each mask sequence is used for indicating that the corresponding channel comprises the frequency domain signal condition corresponding to the frequency points in the single-channel frequency domain signal;

and the fourth determining module is used for determining the multichannel audio signals corresponding to the two-channel audio signals according to the mask sequences of the channels and the single-channel frequency domain signals.

the classification module is configured to:

In one possible implementation manner of the present application, the classifying module is configured to:

In one possible implementation manner of the present application, the fourth determining module is configured to:

In one possible implementation manner of the present application, the second determining module is configured to:

In one possible implementation manner of the present application, the third determining module is configured to:

In one possible implementation manner of the present application, the first determining module is configured to:

In another aspect, an electronic device is provided, including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to implement the audio processing method of the above aspect.

In another aspect, a computer-readable storage medium is provided, which stores instructions that, when executed by a processor, implement the audio processing method of the above aspect.

In another aspect, a computer program product is provided comprising instructions which, when run on a computer, cause the computer to perform the audio processing method of one aspect described above.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

a single channel frequency domain signal, which is a reference signal used to generate the frequency domain signal for each channel, and sound field information, which may be used to indicate a difference of binaural received signals, are determined based on the binaural audio signal to be processed, respectively. And determining the direction information of each frequency point in the single-channel frequency domain signal according to the sound field information. Furthermore, the direction information of the frequency points in the single-channel frequency domain signal can be used for determining the mask sequences of the multiple channels, the mask sequences can be used for indicating the frequency domain signals corresponding to which frequency points in the single-channel frequency domain signal are included in the corresponding channels, so that the frequency points of the single-channel frequency domain signal are distributed to the corresponding channels according to the mask sequences of the multiple channels, the difference between the direction information of the frequency points included in the same channel is small, the difference between the direction information of the frequency points included in different channels is large, and the correlation between the frequency domain signals of different channels is reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow diagram illustrating a method of audio processing according to an exemplary embodiment;

FIG. 2 is a schematic diagram illustrating an audio process according to an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating the structure of an audio processing device according to an exemplary embodiment;

fig. 4 is a schematic structural diagram of an electronic device according to an exemplary embodiment.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Before describing the audio processing method provided by the embodiment of the present application in detail, the implementation environment related to the embodiment of the present application is briefly described.

The audio processing method provided by the embodiment of the application can be executed by electronic equipment, and the electronic equipment has an audio processing function. Further, an application capable of processing audio may be installed in the electronic device, and the electronic device may process the audio through the application to convert the two-channel audio signal into a multi-channel audio signal. As an example, the electronic device may be a PC (Personal Computer), a mobile phone, a smart phone, a PDA (Personal digital Assistant), a wearable device, a PPC (Pocket PC), a tablet Computer, a smart car machine, a smart television, a smart speaker, and the like, which are not limited in this embodiment.

After the description of the implementation environment related to the embodiments of the present application, the following describes in detail an audio processing method provided by the embodiments of the present application with reference to the drawings.

Referring to fig. 1, fig. 1 is a flowchart illustrating an audio processing method according to an exemplary embodiment, where this embodiment is described by taking the method as an example applied to the electronic device, and the method may include the following implementation steps:

step 101: obtaining a dual-channel audio signal to be processed, wherein the dual-channel audio signal comprises a left-channel signal and a right-channel signal.

In general, the two-channel audio signal to be processed may be any two-channel audio signal, and the two-channel audio signal is audio composed of a left-channel time domain signal and a right-channel time domain signal, that is, the left-channel signal and the right-channel signal are generally referred to as time domain signals. It should be noted that the to-be-processed binaural audio signal may be an audio uploaded by a user, an audio stored in the electronic device, an audio in the cloud, or the like, which is not limited in this embodiment.

Generally, the difficulty of processing a frequency domain signal is less than that of processing a time domain signal, and therefore, when the time domain signal is to be processed, the time domain signal may be converted into the frequency domain signal, and then the frequency domain signal is processed to reduce the difficulty of processing. Therefore, a left channel frequency domain signal corresponding to the left channel time domain signal may be determined, and a right channel frequency domain signal corresponding to the right channel time domain signal may be determined.

Therefore, the electronic equipment can respectively perform Fourier transform on the left channel time domain signal and the right channel time domain signal to obtain a left channel frequency domain signal and a right channel frequency domain signal of the dual-channel audio signal to be processed.

For example, as shown in fig. 2, the left channel frequency domain signal may be determined by Xt1 ═ STFT (x1), and the right channel frequency domain signal may be determined by Xt2 ═ STFT (x 2).

Wherein x1 is a left channel time domain signal, x2 is a right channel time domain signal, STFT is a short time fourier transform function, Xt1 is a left channel frequency domain signal, and Xt2 is a right channel frequency domain signal.

It should be noted that, in addition to the short-time fourier Transform, the time domain signal may be converted into the frequency domain signal by modifying a Discrete Cosine Transform (MDCT), a Discrete Cosine Transform (DCT), and the like, which is not limited in this embodiment.

Step 102: and determining a single-channel frequency domain signal according to the left channel signal and the right channel signal.

The single-channel frequency domain signal may be understood as a reference signal, that is, a frequency domain signal that does not represent direction information, and may be used to generate a frequency domain signal of each channel.

As an example, from the left channel signal and the right channel signal, an implementation of determining the single channel frequency domain signal may be: determining a left channel frequency domain signal corresponding to the left channel signal, and determining a right channel frequency domain signal corresponding to the right channel signal. And determining the amplitude of the single-channel frequency domain signal according to the left channel frequency domain signal and the right channel frequency domain signal. And determining the angular frequency of the single-channel frequency domain signal according to the left channel frequency domain signal and the right channel frequency domain signal. And determining the single-channel frequency domain signal according to the amplitude and the angular frequency of the single-channel frequency domain signal.

That is, the amplitude and angular frequency of the single-channel frequency-domain signal may be determined from the left-channel frequency-domain signal and the right-channel frequency-domain signal, and since the frequency-domain signal may be determined based on the amplitude and angular frequency, the single-channel frequency-domain signal may be determined based on the determined amplitude and angular frequency of the single-channel frequency-domain signal.

Illustratively, the amplitude of the single-channel frequency-domain signal may be determined by the following equation (1):

wherein Xt1 is the left channel frequency domain signal, Xt2 is the right channel frequency domain signal, and XM is the amplitude of the single channel frequency domain signal.

Illustratively, the angular frequency of the single-channel frequency-domain signal may be determined by the following equation (2):

angM＝arctan2(imag(Xt1)+imag(Xt2)，real(Xt1)+real(Xt2))(2)

where Xt1 is the left channel frequency domain signal, Xt2 is the right channel frequency domain signal, angM is the angular frequency of the single channel frequency domain signal, imag is the function for determining the imaginary part of the complex number, real is the function for determining the real part of the complex number, and arctan2 is the arctan function.

Step 103: from the left channel signal and the right channel signal, sound field information is determined, which is indicative of a difference of the binaural received signal.

The sound field information may include Inter Level Difference (ILD), Inter Phase Difference (IPD), Inter Time Difference (ITD), and the like, which is not limited in this embodiment.

In general, when the distance from the sound source position to the left ear and the distance from the sound source position to the right ear are different, the time taken for the audio signal to reach the left ear and the right ear after being emitted from the sound source position is different, and the intensities of the audio signals perceived by the left ear and the right ear are also different. In this case, a time difference in which the left and right ears receive the audio signals may be defined as a binaural time difference, an intensity difference of the audio signals perceived by the left and right ears may be defined as a binaural intensity difference, and a phase difference in which the sound waves reach the left and right ears due to the binaural time difference may be defined as a binaural phase difference.

As an example, from the left channel signal and the right channel signal, the implementation of determining the sound field information may be: determining a left channel frequency domain signal corresponding to the left channel signal, and determining a right channel frequency domain signal corresponding to the right channel signal. And determining the binaural intensity difference according to the left channel frequency domain signal and the right channel frequency domain signal. And determining the binaural phase difference according to the left channel frequency domain signal and the right channel frequency domain signal. The binaural intensity difference and the binaural phase difference are determined as sound field information.

That is, when the sound field information includes a binaural intensity difference and a binaural phase difference, the binaural intensity difference and the binaural phase difference may be determined from the left channel frequency domain signal and the right channel frequency domain signal.

As an example, from the left channel frequency domain signal and the right channel frequency domain signal, determining the binaural intensity difference may be implemented as: the absolute value of the left channel frequency domain signal and the absolute value of the right channel frequency domain signal are determined, respectively. And determining the difference value between the absolute value of the left channel frequency domain signal and the absolute value of the right channel frequency domain signal to obtain a third numerical value. And determining the sum of the absolute value of the left channel frequency domain signal and the absolute value of the right channel frequency domain signal to obtain a fourth numerical value. And dividing the third numerical value and the fourth numerical value to obtain the binaural intensity difference.

That is, the absolute value processing may be performed on the left channel frequency domain signal and the right channel frequency domain signal respectively to obtain the absolute value of the left channel frequency domain signal and the absolute value of the right channel frequency domain signal. Further, the difference in the intensities of the frequency domain signals perceived by the left and right ears may be determined based on the absolute value of the left channel frequency domain signal and the absolute value of the right channel frequency domain signal.

Illustratively, the binaural intensity difference may be determined by the following equation (3):

where Xt1 is the left channel frequency domain signal, Xt2 is the right channel frequency domain signal, and ILD is the binaural intensity difference.

As an example, from the left channel frequency domain signal and the right channel frequency domain signal, an implementation of determining the binaural phase difference may be: the angular frequency of the left channel frequency domain signal and the angular frequency of the right channel frequency domain signal are determined separately. And determining the binaural phase difference according to the angular frequency difference between the angular frequency of the left channel frequency domain signal and the angular frequency of the right channel frequency domain signal.

That is, the left channel frequency domain signal and the right channel frequency domain signal may be subjected to an angular frequency processing, respectively, to obtain an angular frequency of the left channel frequency domain signal and an angular frequency of the right channel frequency domain signal. Further, a phase difference of arrival of the sound wave at the left ear and the right ear may be determined based on the angular frequency of the left channel frequency domain signal and the angular frequency of the right channel frequency domain signal.

Illustratively, the binaural phase difference may be determined by the following equation (4):

where Xt1 is the left channel frequency domain signal, Xt2 is the right channel frequency domain signal, and IPD is the binaural phase difference.

Illustratively, ang (Xt1) and ang (Xt2) can be determined by the above equation (2).

Step 104: and determining the direction information of the frequency points in the single-channel frequency domain signal according to the sound field information.

The frequency point can be understood as a sampling point obtained by sampling a single-channel frequency domain signal. In general, the number of frequency points may be determined by a signal sampling rate, which may be set based on actual conditions. It can be understood that the higher the signal sampling rate is, the greater the number of frequency points in the single-channel frequency-domain signal is, the lower the signal sampling rate is, and the lower the number of frequency points in the single-channel frequency-domain signal is.

It should be noted that the number of frequency points in the single-channel frequency domain signal may be one or multiple, and this embodiment does not limit this.

For example, the direction information of the frequency points in the single-channel frequency domain signal can be determined by the following formula (5):

θ(w)＝arctan2(ILD(w),IPD(w))(5)

wherein w is a frequency point in a single-channel frequency domain signal, ipd (w) is a binaural phase difference of the frequency point w, ild (w) is a binaural intensity difference of the frequency point w, θ (w) is direction information of the frequency point w, and arctan2 is an arctan function.

It should be noted that, in addition to determining the Direction information Of the frequency points in the single-channel frequency domain Signal according to the formula (5), the Direction information Of the frequency points in the single-channel frequency domain Signal may also be determined by means Of Multiple Signal Classification (MUSIC Signal Classification), a neural network model, Direction Of Arrival (DOA) estimation, and the like, which is not limited in this embodiment.

Step 105: the method comprises the steps of classifying direction information of frequency points in the single-channel frequency domain signal to obtain mask sequences of a plurality of channels, wherein one mask sequence corresponds to one channel, and each mask sequence is used for indicating that the corresponding channel comprises frequency domain signal conditions corresponding to the frequency points in the single-channel frequency domain signal.

The number of channels of the multi-channel audio signal of the to-be-processed two-channel audio signal is not limited, and can be set according to actual requirements. For example, a multichannel audio signal with 5 channels may be generated, a multichannel audio signal with 7 channels may be generated, and the like.

As an example, each channel of the plurality of channels corresponds to the reference direction information and the direction information deviation threshold. The number of frequency points in the single-channel frequency domain signal is multiple.

The reference direction information and the direction information deviation threshold may be set according to an actual situation, which is not limited in this embodiment.

For example, if the number of channels of the multi-channel audio signal is 5, the reference direction information of channel 1 may be set to 0 degrees, and the direction information deviation threshold of channel 1 may be set to 60 degrees. The reference direction information of the channel 2 is set to positive 30 degrees, and the direction information deviation threshold of the channel 2 is set to 60 degrees. The reference direction information of the channel 3 is set to minus 30 degrees, and the direction information deviation threshold of the channel 3 is set to 60 degrees. The reference direction information of the channel 4 is set to positive 135 degrees, and the direction information deviation threshold of the channel 4 is set to 60 degrees. The reference direction information of the channel 5 is set to minus 135 degrees, and the direction information deviation threshold of the channel 5 is set to 60 degrees.

That is, it may be determined, according to the direction information of the multiple frequency points, the reference direction information corresponding to any channel, and the direction information deviation threshold, which frequency point corresponds to a frequency domain signal belonging to any channel and which frequency point corresponds to a frequency domain signal not belonging to any channel among the multiple frequency points in the single-channel frequency domain signal.

As an example, the classifying the direction information of the frequency points in the single-channel frequency-domain signal to obtain the mask sequences of the multiple channels may include: and for any channel in the multiple channels, determining the absolute value of the difference between the direction information of each frequency point in the multiple frequency points and the reference direction information corresponding to any channel, and obtaining the direction information deviation of each frequency point and any channel. And determining a mask value corresponding to each frequency point in a mask sequence corresponding to any channel according to the direction information deviation of each frequency point and any channel and the direction information deviation threshold of any channel.

That is, the direction information of each frequency point in the multiple frequency points may be compared with the reference direction information corresponding to any channel, that is, the direction information of each frequency point may be subtracted from the reference direction information corresponding to any channel, and then an absolute value is taken from the difference obtained by the subtraction, so as to obtain the direction information deviation between each frequency point and any channel. Therefore, the mask value of each frequency point in the mask sequence corresponding to any channel can be determined according to the direction information deviation, that is, whether the frequency domain signal corresponding to each frequency point belongs to any channel can be determined according to the direction information deviation.

Illustratively, if θ (w) is the direction information of the frequency point w and θ (n) is the reference direction information of the channel n, θ (n) may be subtracted from θ (w) to obtain θ (w) - θ (n), an absolute value is taken from a difference obtained by the subtraction to obtain θ (w) - θ (n), and the absolute value is determined as the direction information deviation between the frequency point w and the channel n, so that the mask value corresponding to the frequency point w in the mask sequence corresponding to the channel n may be determined according to the direction information deviation.

For example, if the direction information of the bin 1 is 65 degrees and the reference direction information of the channel 2 is 60 degrees, 60 degrees may be subtracted from 65 degrees, and then an absolute value is taken from the difference obtained by the subtraction, so as to determine that the deviation of the direction information of the bin 1 and the channel 2 is 5 degrees, and thus, according to the deviation of the direction information, the mask value corresponding to the bin 1 in the mask sequence corresponding to the channel 2 may be determined.

As an example, according to the direction information deviation between each frequency point and any channel and the direction information deviation threshold of any channel, the implementation manner of determining the mask value corresponding to each frequency point in the mask sequence corresponding to any channel may be: for any frequency point in the multiple frequency points, if the direction information deviation of the any frequency point and any channel is smaller than the direction information deviation threshold corresponding to any channel, determining that the mask value corresponding to any frequency point in the mask sequence corresponding to any channel is a first numerical value, wherein the first numerical value is used for indicating that the frequency domain signal corresponding to any frequency point belongs to any channel. And if the direction information deviation of any frequency point and any channel is greater than the direction information deviation threshold of any channel, determining that the mask value corresponding to any frequency point in the mask sequence corresponding to any channel is a second numerical value, wherein the second numerical value is used for indicating that the frequency domain signal corresponding to any frequency point does not belong to any channel.

The first value and the second value may be set according to an actual situation, which is not limited in this embodiment. For example, a first value may be set to 1 and a second value may be set to 0.

That is, if the directional information deviation between any frequency point and any channel is smaller than the directional information deviation threshold corresponding to any channel, it is indicated that the difference between the directional information of any frequency point and the reference directional information of any channel is small, that is, it can be determined that the frequency domain signal corresponding to any frequency point belongs to any channel, and in this case, the mask value corresponding to any frequency point in the mask sequence corresponding to any channel can be marked as the first numerical value. If the direction information deviation between any frequency point and any channel is greater than the direction information deviation threshold corresponding to any channel, it is indicated that the difference between the direction information of any frequency point and the reference direction information of any channel is large, that is, it can be determined that the frequency domain signal corresponding to any frequency point does not belong to any channel, and under such a condition, the mask value corresponding to any frequency point in the mask sequence corresponding to any channel can be marked as a second numerical value.

Illustratively, if θ (w) is the direction information of the frequency point w, θ (n) is the reference direction information of the channel n, and θ (th _ n) is the direction information deviation threshold of the channel n. When | θ (w) - θ (n) | < θ (th _ n), it can be stated that the difference between the direction information of the frequency point w and the reference direction information of the channel n is small, that is, it can be determined that the frequency domain signal corresponding to the frequency point w belongs to the channel n, and in this case, the mask value corresponding to the frequency point w in the mask sequence corresponding to the channel n is marked as a first numerical value. When | θ (w) - θ (n) | ≧ θ (th _ n), it can be described that the difference between the direction information of the frequency point w and the reference direction information of the channel n is large, that is, it can be determined that the frequency domain signal corresponding to the frequency point w does not belong to the channel n, and in this case, the mask value corresponding to the frequency point w in the mask sequence corresponding to the channel n is marked as the second numerical value.

For example, if the reference direction information of channel 2 is positive 60 degrees, the direction information deviation threshold of channel 2 is 60 degrees, and if the direction information of bin 1 is positive 70 degrees, |70-60| ═ 10, 10<60, can indicate that the difference between the direction information of bin 1 and the reference direction information of channel 2 is small, that is, it can be determined that the frequency domain signal corresponding to bin 1 belongs to channel 2, in this case, the mask value corresponding to bin w in the mask sequence corresponding to channel 2 is marked as the first value,

it should be noted that, for any frequency point in a single-channel frequency domain signal, the corresponding frequency domain signal may belong to one channel or may belong to multiple channels, which is not limited in this embodiment.

For example, if the reference direction information of channel 1 is positive 30 degrees, the direction information deviation threshold of channel 1 is 60 degrees, the reference direction information of channel 2 is positive 60 degrees, and the direction information deviation threshold of channel 2 is 60 degrees. If the direction information of frequency point 1 is positive 70 degrees, it can be determined that frequency point 1 belongs to both channel 1 and channel 2 because |70-60| ═ 10, 10<60, and |70-30| ═ 40, 40< 60.

According to the implementation mode, the mask value corresponding to each frequency point in the plurality of frequency points in the mask sequence corresponding to any channel can be determined, and thus, the mask sequence corresponding to any channel can be determined.

Illustratively, if the first value is 1 and the second value is 0, the single-channel frequency domain signal includes 7 frequency points, the mask value corresponding to the frequency point 1 in the mask sequence corresponding to the channel 2 is 1, the mask value corresponding to the frequency point 2 in the mask sequence corresponding to the channel 2 is 0, the mask value corresponding to the frequency point 3 in the mask sequence corresponding to the channel 2 is 0, the mask value corresponding to the frequency point 4 in the mask sequence corresponding to the channel 2 is 1, the mask value corresponding to the frequency point 5 in the mask sequence corresponding to the channel 2 is 0, the mask value corresponding to the frequency point 6 in the mask sequence corresponding to the channel 2 is 1, and the mask value corresponding to the frequency point 7 in the mask sequence corresponding to the channel 2 is 1. Thus, it can be determined that the mask sequence corresponding to channel 2 is (1, 0, 0, 1, 0, 1, 1).

Step 106: and determining a multi-channel audio signal corresponding to the two-channel audio signal according to the mask sequences of the channels and the single-channel frequency domain signal.

For any channel in the multiple channels, the mask sequence corresponding to the channel can be used to determine which frequency point in the multiple frequency points in the single-channel frequency domain signal corresponds to the frequency domain signal belonging to the channel, and then all the frequency domain signals corresponding to the frequency points belonging to the channel can be determined as the frequency domain signal of the channel.

For example, determining the multi-channel audio signal corresponding to the two-channel audio signal according to the mask sequence of the plurality of channels and the single-channel frequency-domain signal may include the following 1-3 implementation steps:

1. the mask sequence corresponding to each channel may be multiplied by the single-channel frequency-domain signal to obtain a frequency-domain signal for each channel.

For example, if the first value is 1, the second value is 0, and the mask sequence of the channel 1 is (1, 1, 1, 0, 1), the frequency domain signal corresponding to the frequency point 1 in the single-channel frequency domain signal may be multiplied by 1, the frequency domain signal corresponding to the frequency point 2 in the single-channel frequency domain signal may be multiplied by 1, the frequency domain signal corresponding to the frequency point 3 in the single-channel frequency domain signal may be multiplied by 1, the frequency domain signal corresponding to the frequency point 4 in the single-channel frequency domain signal may be multiplied by 0, and the frequency domain signal corresponding to the frequency point 5 in the single-channel frequency domain signal may be multiplied by 1. In this way, a frequency domain signal composed of a frequency domain signal corresponding to frequency point 1, a frequency domain signal corresponding to frequency point 2, a frequency domain signal corresponding to frequency point 3, and a frequency domain signal corresponding to frequency point 5 can be determined as the frequency domain signal of channel 1.

2. And carrying out inverse Fourier transform on the frequency domain signal of each channel to obtain a time domain signal of each channel.

For example, as shown in fig. 2, the time domain signal of channel n may be determined by xn ═ istft (xn).

Where Xn is the frequency domain signal of channel n, Xn is the time domain signal of channel n, and ISTFT is the short-time inverse fourier transform function.

3. Time domain signals of a plurality of channels are determined as a multi-channel audio signal.

For example, when the number of channels of the multi-channel audio signal is 5, time domain signals of 5 channels may be determined as the multi-channel audio signal. When the number of channels of the multi-channel audio signal is 7, time domain signals of 7 channels may be determined as the multi-channel audio signal.

In a possible implementation manner, after the frequency domain signals of the multiple channels are determined based on the direction information of each frequency point in the multiple frequency points and the single-channel frequency domain signal, the frequency domain characteristics of the frequency points included in the frequency domain signals of each channel in the multiple channels can be further determined, and then the frequency domain signals of each channel in the multiple channels can be subjected to frequency domain signal separation based on the determined frequency domain characteristics, so that the frequency domain signals of the multiple channels determined based on the single-channel frequency domain signals, the direction information of the frequency points and the frequency domain characteristics of the frequency points are obtained.

The frequency domain features include frequency domain energy features, pitch features, frequency features, and the like, which are not limited in this embodiment.

For example, if the frequency domain signals of 5 channels are determined based on the direction information of each frequency point in the multiple frequency points and the single-channel frequency domain signal, frequency domain signal separation may be further performed based on the frequency domain characteristics of the frequency points included in the frequency domain signals of each channel in the 5 channels. Here, the frequency-domain signal separation of the frequency-domain signal of the channel 2 based on the frequency-domain characteristics of the frequency points is taken as an example for explanation, so that the pitch characteristics of the frequency points included in the frequency-domain signal of the channel 2 can be determined, and further, the frequency-domain signal of the channel 2 can be separated into the frequency-domain signal of the human voice channel and the frequency-domain signal of the guitar voice channel based on the determined pitch characteristics. That is, it can be determined based on the pitch characteristics which frequency points correspond to the frequency domain signals belonging to the vocal tract, which frequency points correspond to the frequency domain signals belonging to the guitar vocal tract, all frequency domain signals belonging to the vocal tract are determined as the frequency domain signals of the vocal tract, and all frequency domain signals belonging to the guitar vocal tract are determined as the frequency domain signals of the guitar vocal tract.

Furthermore, the short-time inverse fourier transform may be performed on the frequency domain signals of the plurality of channels determined based on the single-channel frequency domain signal, the directional information of the frequency point, and the frequency domain characteristics of the frequency point, to obtain time domain signals of the plurality of channels, and the time domain signals of the plurality of channels may be determined as a multi-channel audio signal.

Of course, the frequency domain signals of the multiple channels may also be determined based on the frequency domain features of the single-channel frequency domain signals and the frequency points, and then the determined frequency domain signals of each of the multiple channels may be further subjected to frequency domain signal separation based on the direction information of the frequency points included in the frequency domain signals of each of the multiple channels, which is not limited in this embodiment.

In an embodiment of the present application, a single-channel frequency-domain signal and sound field information are determined based on a binaural audio signal to be processed, respectively, wherein the single-channel frequency-domain signal is a reference signal used for generating a frequency-domain signal for each channel, and the sound field information may be used for indicating a difference of a binaural received signal. And determining the direction information of each frequency point in the single-channel frequency domain signal according to the sound field information. Furthermore, the direction information of the frequency points in the single-channel frequency domain signal can be used for determining the mask sequences of the multiple channels, the mask sequences can be used for indicating the frequency domain signals corresponding to which frequency points in the single-channel frequency domain signal are included in the corresponding channels, so that the frequency points of the single-channel frequency domain signal are distributed to the corresponding channels according to the mask sequences of the multiple channels, the difference between the direction information of the frequency points included in the same channel is small, the difference between the direction information of the frequency points included in different channels is large, and the correlation between the frequency domain signals of different channels is reduced.

Fig. 3 is a schematic diagram illustrating the structure of an audio processing apparatus according to an exemplary embodiment, which may be implemented by software, hardware, or a combination of both. The audio processing apparatus may include:

an obtaining module 310, configured to obtain a two-channel audio signal to be processed, where the two-channel audio signal includes a left-channel signal and a right-channel signal;

a first determining module 320, configured to determine a single-channel frequency domain signal according to the left channel signal and the right channel signal;

a second determining module 330 configured to determine sound field information from the left channel signal and the right channel signal, the sound field information indicating a difference of binaural received signals;

a third determining module 340, configured to determine, according to the sound field information, direction information of frequency points in the single-channel frequency domain signal;

the classifying module 350 is configured to classify the direction information of the frequency points in the single-channel frequency domain signal to obtain mask sequences of multiple channels, where one mask sequence corresponds to one channel, and each mask sequence is used to indicate that the corresponding channel includes frequency domain signal conditions corresponding to the multiple frequency points in the single-channel frequency domain signal;

a fourth determining module 360, configured to determine, according to the mask sequences of the multiple channels and the single-channel frequency domain signal, a multi-channel audio signal corresponding to the two-channel audio signal.

the classification module 350 is configured to:

In one possible implementation manner of the present application, the classifying module 350 is configured to:

In a possible implementation manner of the present application, the fourth determining module 360 is configured to:

In a possible implementation manner of the present application, the second determining module 330 is configured to:

In a possible implementation manner of the present application, the third determining module 340 is configured to:

In one possible implementation manner of the present application, the first determining module 320 is configured to:

It should be noted that: in the audio processing apparatus provided in the foregoing embodiment, when performing audio processing, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the audio processing apparatus and the audio processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.

Fig. 4 is a schematic structural diagram of an electronic device 400 provided in an embodiment of the present application, where the electronic device 400 may be a portable mobile terminal, such as: a smart phone, a tablet computer, an MP3 player (Moving Picture experts Group Audio Layer III, motion video experts compression standard Audio Layer 3), an MP4 player (Moving Picture experts Group Audio Layer IV, motion video experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Electronic device 400 may also be referred to by other names such as user equipment, portable terminals, laptop terminals, desktop terminals, and the like.

In general, the electronic device 400 includes: a processor 401 and a memory 402.

Processor 401 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 401 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 401 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 401 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 401 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 402 may include one or more computer-readable storage media, which may be non-transitory. Memory 402 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 402 is used to store at least one instruction for execution by processor 401 to implement the audio processing method provided by the method embodiments herein.

Of course, the electronic device 400 may further have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the electronic device 400 may further include other components for implementing device functions, which are not described herein again.

An embodiment of the present application further provides a non-transitory computer-readable storage medium, and when instructions in the storage medium are executed by a processor of a mobile terminal, the mobile terminal is enabled to execute the audio processing method provided in the embodiment shown in fig. 1.

Embodiments of the present application further provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the audio processing method provided in the embodiment shown in fig. 1.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of audio processing, the method comprising:

2. The method of claim 1, wherein each channel of the plurality of channels corresponds to a reference direction information and a direction information deviation threshold; the number of the frequency points in the single-channel frequency domain signal is multiple;

3. The method according to claim 2, wherein the determining a mask value corresponding to each frequency point in a mask sequence corresponding to any channel according to a direction information deviation between each frequency point and any channel and a direction information deviation threshold of any channel includes:

4. The method of claim 3, wherein determining a multichannel audio signal to which the two-channel audio signal corresponds based on the mask sequence of the plurality of channels and the single-channel frequency-domain signal comprises:

5. The method of claim 1, wherein determining sound field information from the left channel signal and the right channel signal comprises:

6. A method as recited in claim 5, wherein said determining the binaural intensity difference from the left channel frequency domain signal and the right channel frequency domain signal comprises:

7. A method as recited in claim 5, wherein said determining the binaural phase difference from the left channel frequency domain signal and the right channel frequency domain signal comprises:

8. The method of claim 1, wherein determining a single channel frequency domain signal from the left channel signal and the right channel signal comprises:

9. An audio processing apparatus, characterized in that the apparatus comprises:

10. The apparatus of claim 9, wherein each channel of the plurality of channels corresponds to reference direction information and a direction information deviation threshold; the number of the frequency points in the single-channel frequency domain signal is multiple;

the classification module is configured to:

11. The apparatus of claim 10, wherein the categorization module is to:

12. The apparatus of claim 11, wherein the fourth determination module is to:

13. The apparatus of claim 9, wherein the second determination module is to:

14. The apparatus of claim 13, wherein the third determination module is to:

15. The apparatus of claim 13, wherein the third determination module is to:

16. The apparatus of claim 9, wherein the first determination module is to:

17. An electronic device, comprising:

a processor;

a memory storing instructions executable by the processor;

wherein the processor is configured to execute the instructions and to implement the steps of any of the methods of claims 1-8.

18. A computer-readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement the steps of any of the methods of claims 1-8.