CN112750444A

CN112750444A - Sound mixing method and device and electronic equipment

Info

Publication number: CN112750444A
Application number: CN202010621654.1A
Authority: CN
Inventors: 梁俊斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2021-05-04
Anticipated expiration: 2040-06-30
Also published as: CN112750444B

Abstract

The present application relates to the field of audio processing technologies, and in particular, to a sound mixing method, a sound mixing apparatus, a computer-readable medium, and an electronic device. The sound mixing method comprises the following steps: acquiring at least two paths of audio input signals, and respectively acquiring power information of each path of audio input signal; acquiring loudness information related to the frequency of the audio input signal, and respectively performing weighting processing on power information corresponding to each frequency point in the audio input signal according to the loudness information to obtain perceptual quantization information of the audio input signal; respectively carrying out numerical adjustment on the perception quantization information of each path of audio input signal to determine a perception equalization weight for reducing perception difference between each path of audio input signal; and performing superposition processing on the at least two paths of audio input signals according to the perceptual equilibrium weight of each path of audio input signal to obtain mixed audio. The real perception effect after sound mixing can be adjusted by adjusting the perception balance weight, so that the sound of all channels is not completely masked as much as possible after sound mixing, and the perceptibility of each channel of audio signals is improved.

Description

Sound mixing method and device and electronic equipment

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a sound mixing method, a sound mixing apparatus, a computer-readable medium, and an electronic device.

Background

With the development of computer and network technologies, the traditional telecommunication network communication or internet VoIP communication application can basically meet the social needs of people for groups, such as multi-user audio and video conference, multi-user online live broadcast, multi-user real-time voice chat in online games, and the core technology for realizing multi-user voice call is audio mixing.

The sound has acoustic masking properties, i.e. the human ear is able to distinguish slight sounds in a silent environment, but in a noisy environment, these slight sounds are drowned out by the noise. For an application scenario of multi-person communication, with the increase of the number of speakers, the final sound mixed and superimposed will become relatively noisy, and part of the human voices in the noisy sound, such as the sound with lower tone and smaller volume, will be difficult to hear after mixing. Therefore, how to prevent the mutual masking of the sounds in the mixed audio is a problem to be solved.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present application and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.

Disclosure of Invention

An object of the present application is to provide a mixing method, a mixing apparatus, a computer-readable medium, and an electronic device, which overcome, at least to some extent, the problem of mutual masking of sounds present in mixed audio.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.

According to an aspect of an embodiment of the present application, there is provided a mixing method, including:

acquiring at least two paths of audio input signals, and respectively acquiring power information of each path of audio input signal;

obtaining loudness information related to the frequency of the audio input signal, and respectively carrying out weighted summation on power information corresponding to each frequency point in the audio input signal according to the loudness information to obtain perceptual quantization information of the audio input signal;

respectively carrying out numerical adjustment on the perception quantization information of each path of audio input signal to determine a perception equalization weight for reducing perception difference between each path of audio input signal;

and performing superposition processing on the at least two paths of audio input signals according to the perceptual equilibrium weight of each path of audio input signal to obtain mixed audio.

According to an aspect of an embodiment of the present application, there is provided a mixing apparatus including:

the power acquisition module is configured to acquire at least two paths of audio input signals and respectively acquire power information of each path of audio input signal;

the perception quantization module is configured to acquire loudness information related to the frequency of the audio input signal, and perform weighted summation on power information corresponding to each frequency point in the audio input signal according to the loudness information to obtain perception quantization information of the audio input signal;

the perceptual equalization module is configured to perform numerical adjustment on perceptual quantization information of the audio input signals respectively to determine perceptual equalization weights for reducing perceptual differences among the audio input signals;

and the signal superposition module is configured to superpose the at least two paths of audio input signals according to the perceptual equalization weight of each path of audio input signal to obtain mixed audio.

In some embodiments of the present application, based on the above technical solutions, the power obtaining module includes:

the framing processing unit is configured to perform framing processing on each path of audio input signal respectively to obtain an audio data frame of the audio input signal;

a windowing processing unit configured to perform windowing processing on the audio data frame to obtain a windowed framed signal of the audio input signal;

a frequency domain converting unit configured to convert the windowed framed signal from a time domain to a frequency domain to obtain power information of the audio input signal.

In some embodiments of the present application, based on the above technical solutions, the windowing processing unit includes:

a window function obtaining subunit configured to obtain a window function for performing windowing processing on the audio data frame, where the window function is a hamming window or a hanning window;

a window function point multiplier unit configured to multiply the window function with the audio data frame point to obtain a windowed framing signal of the audio input signal.

In some embodiments of the present application, based on the above technical solutions, the frequency domain converting unit includes:

a spectrum acquisition subunit configured to perform fourier transform on the time domain-based windowed framed signal to obtain frequency domain-based spectrum information;

an energy spectrum determination subunit configured to determine energy spectrum information of the audio input signal from the magnitude in the spectral information;

a first power determination subunit configured to obtain a framing time length of the windowed framing signal and determine power information of the audio input signal according to the energy spectrum information and the framing time length.

an autocorrelation function obtaining subunit configured to obtain an autocorrelation function of the windowed framing information in a time domain;

a second power determination subunit configured to perform a Fourier transform on the autocorrelation function to obtain power information of the audio input signal based on a frequency domain.

In some embodiments of the present application, based on the above technical solution, the perceptual quantization module includes:

an equal loudness curve acquisition unit configured to acquire equal loudness curve data representing a mapping relationship between a sound pressure level and a frequency;

an equal loudness curve interpolation unit configured to interpolate the equal loudness curve data to obtain loudness information related to a frequency of the audio input signal.

In some embodiments of the present application, based on the above technical solution, the perceptual quantization module further includes:

a frequency point determination unit configured to determine a lower frequency point and an upper frequency point adjacent to a frequency of the audio input signal in the equal loudness curve data;

a parameter query unit configured to query the equal-loudness curve data to obtain a reference frequency parameter and a reference sound pressure parameter of the lower frequency point and the upper frequency point;

a parameter interpolation unit configured to perform interpolation processing on the reference frequency parameter and the reference sound pressure parameter to obtain an interpolated frequency parameter and an interpolated sound pressure parameter related to the frequency of the audio input signal, respectively;

a loudness determination unit configured to determine loudness information related to a frequency of the audio input signal from the interpolated frequency parameter and the interpolated sound pressure parameter.

an indexing processing unit configured to index the loudness information to obtain perceptual weighting coefficients of the audio input signal;

and the weighted summation unit is configured to perform weighted summation on the perceptual weighting coefficient and the power information corresponding to each frequency point in the audio input signal to obtain perceptual quantization information of the audio input signal.

In some embodiments of the present application, based on the above technical solutions, the perceptual equalizing module includes:

a first smoothing filtering unit configured to smooth-filter the perceptual quantization information to obtain a perceptual smoothed value of the audio input signal;

the device comprises a smoothing proportion determining unit, a smoothing proportion determining unit and a smoothing proportion determining unit, wherein the smoothing proportion determining unit is configured to compare the perception smoothing values of all paths of audio input signals to obtain a maximum smoothing value and determine a perception smoothing proportion between the maximum smoothing value and each perception smoothing value;

a second smoothing filtering unit configured to smooth-filter the perceptual smoothing scale to obtain perceptual equalization weights for reducing perceptual differences between the respective audio input signals.

In some embodiments of the present application, based on the above technical solutions, the first smoothing filter unit includes:

a first information obtaining subunit, configured to obtain a perceptual smoothing value of a previous signal frame and perceptual quantization information of a current signal frame in the audio input signal;

a first factor obtaining subunit configured to obtain a first smoothing factor for performing smoothing filtering on the perceptual quantization information;

a first smoothing filter subunit configured to perform weighted summation on the perceptual smoothing value of the previous signal frame and the perceptual quantization information of the current signal frame according to the first smoothing factor to obtain a perceptual smoothing value of the current signal frame.

In some embodiments of the present application, based on the above technical solution, the second smoothing filter unit includes:

a second information obtaining subunit, configured to obtain perceptual equalization weights of a previous signal frame and a perceptual smoothing ratio of a current signal frame in the audio input signal;

a second factor obtaining subunit configured to obtain a second smoothing factor for performing smoothing filtering on the perceptual smoothing scale;

a second smoothing filter subunit configured to perform weighted summation on the perceptual equalization weight of the previous signal frame and the perceptual smoothing proportion of the current signal frame according to the second smoothing factor to obtain the perceptual equalization weight of the current signal frame.

In some embodiments of the present application, based on the above technical solutions, the signal superposition module includes:

a power weighting unit configured to perform weighting processing on the power information of the audio input signal according to the perceptual equalization weight to obtain equalized power information of the audio input signal;

a time domain conversion unit configured to convert the equalized power information from a frequency domain to a time domain to obtain an equalized audio signal of the audio input signal;

a superposition processing unit configured to perform superposition processing on the equalized audio signals of the at least two audio input signals to obtain mixed audio.

In some embodiments of the present application, based on the above technical solutions, the superposition processing unit includes:

a linear superposition subunit configured to linearly superpose equalized audio signals of the at least two audio input signals to obtain a linear superposition signal;

a factor obtaining subunit configured to obtain a range quantization factor for determining a signal range of mixed audio and a basic contraction factor for contracting the mixed audio to the signal range;

a down-mixing sub-unit configured to down-mix the linearly superimposed signal according to the value domain quantization factor and the basic shrinkage factor to obtain a mixed audio.

According to an aspect of the embodiments of the present application, there is provided a computer readable medium, on which a computer program is stored, the computer program, when executed by a processor, implementing the mixing method as in the above technical solution.

According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute the mixing method as in the above technical solution via executing the executable instructions.

According to an aspect of embodiments herein, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable medium. The processor of the computer device reads the computer instruction from the computer readable medium, and the processor executes the computer instruction, so that the computer device executes the mixing method provided in the above technical solution.

In the technical scheme provided by the embodiment of the application, the auditory perception degree of each audio input signal can be obtained based on the auditory perception analysis of each audio input signal to be mixed, for the audio input signals with different auditory perception degrees, the real perception effect after the audio mixing can be adjusted by adjusting the perception equalization weight, so that the sound of all channels is not completely masked as far as possible after the audio mixing, the perceptibility of each audio signal is improved, each audio input signal can obtain a relatively balanced output effect at a receiving end, and the reliability and stability of mixed audio in the aspect of data transmission are improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

In the drawings:

fig. 1 schematically shows a block diagram of an exemplary system architecture to which the solution of the present application applies.

Fig. 2 schematically illustrates a flow chart of steps of a mixing method in some embodiments of the present application.

Fig. 3 schematically illustrates a flow chart of method steps for obtaining power information in some embodiments of the present application.

Fig. 4 schematically shows a graph of acoustic equal loudness measured by the international acoustic standards organization.

Fig. 5 schematically illustrates a flow chart of method steps for deriving loudness information based on interpolation processing in some embodiments of the present application.

Fig. 6 schematically shows a mapping relationship between perceptual weighting coefficients and frequencies in some embodiments of the present application.

Fig. 7 schematically illustrates a flow chart of method steps for determining perceptual equalization weights in some embodiments of the present application.

Fig. 8 schematically shows a block diagram of a mixing apparatus according to an embodiment of the present application.

FIG. 9 schematically illustrates a block diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

As shown in fig. 1, system architecture 100 may include a terminal device 110, a network 120, and a server 130. The terminal device 110 may include various electronic devices such as a smart phone, a tablet computer, a notebook computer, and a desktop computer. The server 130 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. Network 120 may be a communication medium of various connection types capable of providing a communication link between terminal device 110 and server 130, such as a wired communication link or a wireless communication link.

The system architecture in the embodiments of the present application may have any number of terminal devices, networks, and servers, according to implementation needs. For example, the server 130 may be a server group composed of a plurality of server devices. In addition, the technical solution provided in the embodiment of the present application may be applied to the terminal device 110, or may be applied to the server 130, or may be implemented by both the terminal device 110 and the server 130, which is not particularly limited in this application.

For example, an audio client for performing a voice call may be installed on the terminal device 110, and the user may send a voice signal to the server 130 through the audio client, and the server 130 may forward the voice signal to other users in the multi-party call. Meanwhile, each user may receive the voice signals forwarded by the server 130, which are uttered by other users, at the audio client. In some application scenarios, the audio mixing method provided in the embodiment of the present application may be executed locally by the terminal device 110, that is, after receiving the voice signals of multiple channels forwarded by the server 130, the voice signals are mixed and played; in other application scenarios, the audio mixing method provided in the embodiment of the present application may also be executed by the server 130, that is, the server 130 mixes the received voice signals of the multiple channels and then forwards the mixed voice signals to the corresponding terminal devices 110, so as to play the mixed audio to the user.

The technical scheme can be realized based on the cloud computing technology. Cloud computing (cloud computing) is a computing model that distributes computing tasks over a pool of resources formed by a large number of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.

As a basic capability provider of cloud computing, a cloud computing resource pool (cloud platform for short) is established, and multiple types of virtual resources are deployed in the resource pool and are used by external customers selectively. The cloud computing resource pool mainly comprises: computing devices (virtualized machines, including operating systems), storage devices, network devices.

According to the logic function division, a PaaS (Platform as a Service) layer can be deployed on an IaaS (Infrastructure as a Service) layer, a SaaS (Software as a Service) layer is deployed on the PaaS layer, and the SaaS can be directly deployed on the IaaS. PaaS is a platform on which software runs, such as a database, a web container, etc. SaaS is a variety of business software, such as web portal, sms, and mass texting. Generally speaking, SaaS and PaaS are upper layers relative to IaaS.

The cloud conference is an efficient, convenient and low-cost conference form based on a cloud computing technology. A user can share voice, data files and videos with teams and clients all over the world quickly and efficiently only by performing simple and easy-to-use operation through an internet interface, and complex technologies such as transmission and processing of data in a conference are assisted by a cloud conference service provider to operate.

At present, cloud conferences mainly focus on service contents mainly in a SaaS mode, wherein the service contents comprise service forms such as telephones, networks and videos, and cloud computing-based video conferences are called cloud conferences.

In the cloud conference era, data transmission, processing and storage are all processed by computer resources of video conference manufacturers, users do not need to purchase expensive hardware and install complicated software, and efficient teleconferencing can be performed only by opening a browser and logging in a corresponding interface.

The cloud conference system supports multi-server dynamic cluster deployment, provides a plurality of high-performance servers, and greatly improves conference stability, safety and usability. In recent years, video conferences are popular with many users because of greatly improving communication efficiency, continuously reducing communication cost and bringing about upgrading of internal management level, and the video conferences are widely applied to various fields such as governments, armies, transportation, finance, operators, education, enterprises and the like. Undoubtedly, after the video conference uses cloud computing, the cloud computing has stronger attraction in convenience, rapidness and usability, and the arrival of new climax of video conference application is necessarily stimulated.

The following describes technical solutions of the mixing method, the mixing apparatus, the computer-readable medium, and the electronic device provided in the present application in detail with reference to specific embodiments.

Fig. 2 schematically illustrates a flowchart of steps of a mixing method in some embodiments of the present application, where the mixing method may be executed by a terminal device, a server, or both the terminal device and the server, and this is not particularly limited in this embodiment of the present application. As shown in fig. 2, the mixing method may mainly include the following steps S210 to S240.

Step S210, at least two paths of audio input signals are obtained, and power information of each path of audio input signal is obtained respectively.

S220, obtaining loudness information related to the frequency of the audio input signal, and respectively carrying out weighted summation on power information corresponding to each frequency point in the audio input signal according to the loudness information to obtain perceptual quantization information of the audio input signal.

And step S230, respectively carrying out numerical value adjustment on the perception quantization information of each path of audio input signal to determine a perception equalization weight for reducing perception difference between each path of audio input signal.

And S240, performing superposition processing on at least two paths of audio input signals according to the perceptual balance weight of each path of audio input signal to obtain mixed audio.

In the audio mixing method provided by the embodiment of the application, the auditory perception degree of each audio input signal can be obtained based on the auditory perception analysis of each audio input signal to be mixed, and for audio input signals with different auditory perception degrees, the real perception effect after the audio mixing can be adjusted by adjusting the perception equalization weight, so that the sound of all channels is not completely masked as much as possible after the audio mixing, each audio input signal can obtain a relatively balanced output effect at a receiving end, the reliability and stability of mixed audio in the aspect of data transmission are improved, and the perceptibility of each audio input signal is also improved.

The following is a detailed description of each step of the mixing method in the above embodiment.

In step S210, at least two audio input signals are obtained, and power information of each audio input signal is obtained respectively.

The audio input signal is a signal to be mixed acquired from different sources. For example, in a multi-party call scenario, each party may collect their respective voice signals through a terminal device, and a voice signal from one party is a path of audio input signal.

For each audio input signal, in order to perform quantitative characterization of auditory perception, corresponding power information needs to be acquired first. The power information may be represented, for example, as a power spectral density function, power spectrum for short. The power spectrum is used to characterize the variation of the power of the audio input signal with frequency, i.e. the distribution of the signal power in the frequency domain. Fig. 3 schematically illustrates a flow chart of method steps for obtaining power information in some embodiments of the present application. As shown in fig. 3, on the basis of the above embodiment, the step S10 of separately acquiring the power information of each audio input signal may include the following steps S310 to S330.

And S310, performing framing processing on each path of audio input signal respectively to obtain an audio data frame of the audio input signal.

Each audio input signal is a signal that varies continuously in the time domain, and the audio input signal is unstable as a whole and can be considered to be relatively stable locally, so that the audio input signal can be firstly framed to form audio data frames corresponding to different time segments. For example, the embodiment of the application can divide the audio input signal according to the time nodes to obtain the audio data frames with the time length of 10ms or 20 ms. In addition, the embodiment of the present application may configure a certain frame shift during the framing process, where the frame shift indicates a time interval between frame start points in two adjacent audio data frames, for example, the frame shift may be one half of a time length of one audio data frame. By configuring the frame shift, a certain data overlap between two adjacent audio data frames can be achieved to maintain a smooth transition of the adjacent audio data frames in the data characteristics.

Step S320, performing windowing processing on the audio data frame to obtain a windowed framing signal of the audio input signal.

After the audio input signal is subjected to framing processing, each obtained audio data frame has a break at the beginning and the end, so that the more the audio data frames are obtained by dividing, the greater the error between the audio data frames and the original audio input signal is. The windowing of the audio data frames reduces the errors due to framing so that the audio data frames become continuous and each audio data frame will exhibit the characteristics of a periodic function for subsequent feature analysis. In some optional embodiments, in this step, a window function for performing windowing processing on the audio data frame may be obtained, where the window function is a hamming window (hamming window) or a Hanning window (Hanning window), and then the window function is point-multiplied with the audio data frame to obtain a windowed framing signal of the audio input signal.

Step S330, converting the windowed frame-divided signal from the time domain to the frequency domain to obtain the power information of the audio input signal.

The windowed framing signal is a time domain based signal, which after conversion from the time domain to the frequency domain may result in power information of the audio input signal. The conversion from the time domain to the frequency domain may be performed by Fourier transform (Fourier transform) of the signal. The power information can be acquired by two embodiments depending on the processing object of fourier transform.

One optional implementation of obtaining power information includes: carrying out Fourier transform on the windowed framing signal based on the time domain to obtain frequency spectrum information based on the frequency domain; determining energy spectrum information of the audio input signal according to the amplitude values in the frequency spectrum information; and acquiring the framing time length of the windowed framing signal, and determining the power information of the audio input signal according to the energy spectrum information and the framing time length.

The frequency spectrum information is used for representing the corresponding relation between the amplitude and the frequency of the signal, the energy spectrum information of the audio input signal can be obtained after the amplitude in the frequency spectrum information is squared, and then the power information of the audio input signal can be obtained by averaging the energy spectrum information in the frame time length.

Another alternative embodiment of obtaining power information includes: acquiring an autocorrelation function of windowed framing information in a time domain; the autocorrelation function is fourier transformed to obtain power information of the audio input signal based on the frequency domain.

The autocorrelation function is an average measure of the characteristics of a signal in the time domain and is used to describe the degree of correlation between the values of the signal at any two different time instances. Performing a fourier transform on the autocorrelation function of the windowed framing information may directly obtain power information of the audio input signal based on the frequency domain.

The audio input signal can be converted from the time domain to the frequency domain by performing the above steps S310 to S330, and power information for performing feature analysis is acquired. For example, each audio input frame in the audio input signal may include a plurality of bins, and the absolute value of the power of the s-th bin in the i-th audio data frame may be denoted as p (i, s), where s is 0 to K-1, where K is the total number of bins in the audio data frame.

In step S220, loudness information related to the frequency of the audio input signal is obtained, and the power information corresponding to each frequency point in the audio input signal is weighted according to the loudness information, so as to obtain perceptual quantization information of the audio input signal.

The purpose of the auditory perception quantization processing is to quantize human auditory perception of different audio input signals so as to be used for carrying out equalization processing according to auditory perception subsequently. The "loudness" varies mainly with the intensity of the sound, but is also affected by the frequency, and sounds of the same intensity and different frequencies have different auditory perceptions for the human ear.

In an embodiment of the present application, a method of obtaining loudness information related to a frequency of an audio input signal may include: acquiring equal loudness curve data used for representing the mapping relation between the sound pressure level and the frequency; the equal loudness curve data is interpolated to obtain loudness information related to the frequency of the audio input signal.

The equal loudness curve is a curve describing the relationship between sound pressure level and sound wave frequency under equal loudness conditions, and is one of important auditory characteristics. From the equal loudness curve it can be determined what sound pressure level pure tones at different frequencies need to reach in order to obtain a consistent auditory loudness for the listener. Fig. 4 schematically shows a graph of acoustic equal loudness measured by the international acoustic standards organization. As can be seen from the figure, each equal loudness curve has similar frequency internal characteristics. In a middle and low frequency range (below 1 kHz), when the frequency is lower, the sound pressure intensity (energy) required by equal loudness is higher, namely, higher sound energy is required to enable the human ear to obtain the same auditory sensation; in the middle and high frequency range (above 1 kHz), different frequency ranges have different acoustic auditory perception characteristics along with frequency change.

Quantized equal loudness curve data can be acquired based on the equal loudness curve, and loudness information related to the frequency of the audio input signal can be obtained by performing interpolation processing on the equal loudness curve data. Fig. 5 schematically illustrates a flow chart of method steps for deriving loudness information based on interpolation processing in some embodiments of the present application. As shown in fig. 5, by performing interpolation processing on the equal loudness curve data to obtain loudness information related to the frequency of the audio input signal, steps S510 to S540 may be included as follows.

And S510, determining a lower frequency point and an upper frequency point which are adjacent to the frequency of the audio input signal in the equal loudness contour data.

The equal loudness curve data is discretized quantized data, and for each frequency value in the audio input signal, a lower frequency point and an upper frequency point adjacent to the frequency value can be searched in the equal loudness curve data. The lower frequency point is a frequency point which is smaller than the frequency value of the audio input signal and is nearest to the audio input signal, and the upper frequency point is a frequency point which is larger than the frequency value of the audio input signal and is nearest to the audio input signal.

And S520, inquiring the equal-response curve data to obtain the reference frequency parameters and the reference sound pressure parameters of the lower frequency point and the upper frequency point.

The reference frequency parameter is a frequency-related parameter and may include a first frequency point parameter (in phon/dB) and a second frequency point parameter (in dB)^-1). For example, the first frequency point parameter of the lower frequency point j-1 is a_f(j-1), the second frequency point parameter of the lower frequency point j-1 is b_f(j-1); the first frequency point parameter of the upper frequency point j is a_f(j) The second frequency point parameter of the upper frequency point j is b_f(j)。

The reference sound pressure parameter is a sound pressure related parameter and may include a threshold sound pressure level (in dB). For example, the threshold sound pressure level of the lower frequency point j-1 is T_f(j-1), the threshold sound pressure level of the upper frequency point j is T_f(j)。

Step S530, interpolation processing is respectively carried out on the reference frequency parameter and the reference sound pressure parameter so as to obtain an interpolation frequency parameter and an interpolation sound pressure parameter which are related to the frequency of the audio input signal.

For the first frequency point parameter, interpolation processing can be performed through the following formula to obtain a first interpolation frequency parameter a_fy：

Wherein freq is the frequency value of the audio input signal, f_f(j-1) is the frequency value of the lower frequency point j-1, f_f(j) The frequency value of the upper frequency point j.

For the second frequency point parameter, interpolation processing can be performed through the following formula to obtain a second interpolation frequency parameter b_fy：

The sound pressure level of the threshold can be interpolated by the following formula to obtain an interpolated sound pressure parameter T_fy：

And S540, determining loudness information related to the frequency of the audio input signal according to the interpolation frequency parameter and the interpolation sound pressure parameter.

The loudness information loud can be calculated according to the following formula based on the interpolation frequency parameter and the interpolation sound pressure parameter obtained by interpolation processing:

wherein L is_fPure tone sound pressure level in dB.

By executing the steps S510 to S540 as above, loudness information of the audio input signal corresponding to each frequency value can be calculated based on the interpolation parameter. The power information may be weighted based on the loudness information to obtain perceptual quantization information for the audio input signal. Specifically, the loudness information may be subjected to an indexing process to obtain a perceptual weighting coefficient of the audio input signal, and then the perceptual weighting coefficient and the power information corresponding to each frequency point in the audio input signal are subjected to a weighted summation to obtain perceptual quantization information of the audio input signal.

In some embodiments of the present application, the loudness information loud may be subjected to an exponential processing by the following formula to obtain the perceptual weighting coefficient cof(s) as perceptual quantization information:

according to the perceptual weighting coefficients cof(s) corresponding to each frequency point s, the corresponding power values can be weighted and summed, so that perceptual quantization information ep (i) of the ith audio data frame in the audio input signal is obtained:

fig. 6 schematically shows a mapping relationship between perceptual weighting coefficients and frequencies in some embodiments of the present application. As can be seen from the mapping curve shown in fig. 6, in the embodiment of the present application, different perceptual weighting coefficients may be calculated and determined at different frequency points, so that power information corresponding to different frequency points may be perceptually quantized by using the perceptual weighting coefficients related to frequency, and perceptual quantization information of the audio input signal is obtained.

In step S230, the perceptual quantization information of the audio input signals is respectively subjected to a numerical adjustment to determine perceptual equalization weights for reducing perceptual differences between the audio input signals.

Based on the step S220, the perceptual quantization information based on auditory perception of each channel of audio input signal in the current frame can be obtained, and in order to realize that audio input signals with different auditory senses can have equal "sounding" rights as much as possible, the perceptual quantization information based on auditory perception of the present application performs multi-channel equalization processing, and the objective is to enhance channels with weak auditory sense, for example, sounds with lower sound level and smaller volume; and the sound which is relatively bright and easily masks signals of other channels can be properly attenuated, so that the sound of all the channels can be at a considerable level in hearing sense.

Fig. 7 schematically illustrates a flow chart of method steps for determining perceptual equalization weights in some embodiments of the present application. As shown in fig. 7, on the basis of the above embodiments, step S230, respectively performing a numerical adjustment on the perceptual quantization information of the audio input signals to determine perceptual equalization weights for reducing perceptual differences between the audio input signals, may include steps S710 to S730 as follows.

And S710, performing smooth filtering on the perception quantization information to obtain a perception smooth value of the audio input signal.

The method for carrying out smooth filtering on the perception quantization information is to blend the perception smooth value of the previous frame data into the current frame data perception smooth value calculation according to a certain proportion, so that two adjacent data frames can be smoothly transited on the perception smooth value, and sudden change is avoided.

For example, for the audio input signal input in the first channel, this step may obtain the perceptual smooth value EPS of the previous signal frame in the audio input signal_l(i-1) and perceptual quantization information EP of the current signal frame_l(i) (ii) a And, a first smoothing factor α for smoothing filtering the perceptual quantization information may be obtained; perceptual smoothing value EPS of a previous signal frame according to a first smoothing factor alpha_l(i-1) and perceptual quantization information EP of the current signal frame_l(i) After weighting processing, the perception smooth value EPS of the current signal frame can be obtained_l(i) In that respect Specifically, the following weighting formula can be adopted:

EPS_l(i)＝α*EPS_l(i-1)+(1-α)*EP_l(i)

the first smoothing factor α is a parameter with a value of 0-1, for example, the first smoothing factor α may have a value of 0.1.

And S720, comparing the perception smooth values of the audio input signals to obtain a maximum smooth value, and determining the perception smooth proportion between the maximum smooth value and each perception smooth value.

Step S710 may obtain perceptual smoothing values of the audio input signals corresponding to the respective input channels, and after comparing the perceptual smoothing values of the different input channels in this step, a maximum smoothing value may be obtained, and a ratio of the maximum smoothing value to the perceptual smoothing values of the respective audio input signals may be obtained to obtain a perceptual smoothing ratio of the respective audio input signals. For example, the perceptual smooth scale of the audio input signal corresponding to the ith input channel may be expressed as:

where M is the number of audio input signals to be mixed.

And step S730, carrying out smooth filtering on the perception smooth proportion to obtain perception balance weight for reducing perception difference among all paths of audio input signals.

The mode of carrying out smooth filtering on the perception smooth proportion is to blend the perception balance weight of the previous frame data into the perception balance weight calculation of the current frame data according to a certain proportion, so that two adjacent data frames can be in smooth transition on the perception balance weight, and sudden change is avoided.

For example, for the audio input signal inputted in the first channel, this step can obtain the perceptual equalization weight wp of the previous signal frame in the audio input signal_l(i-1) a perceptual smoothing ratio with the current signal frame; and, a second smoothing factor β for smoothing filtering the perceptual smoothing scale may be obtained; perceptual equalization weight wp for a previous signal frame according to a second smoothing factor β_l(i-1) and the perceptual smooth proportion of the current signal frame are weighted to obtain a perceptual equilibrium weight wp of the current signal frame_l(i) In that respect Specifically, the following weighting formula can be adopted:

the second smoothing factor β is a parameter with a value of 0-1, for example, the second smoothing factor β may have a value of 0.2.

By executing the steps S710 to S730, the perceptual equalization weight of each audio input signal can be determined on the basis of the smoothing filtering, and the adjustment of the smoothing filtering degree can be realized by adjusting the first smoothing factor and the second smoothing factor, which has the advantage of flexible and controllable smoothing filtering degree.

In step S240, at least two paths of audio input signals are superimposed according to the perceptual equalization weight of each path of audio input signal to obtain a mixed audio.

The power information of each audio input signal can be weighted according to the perceptual equalization weight obtained in step S230 to obtain the equalization power information of each audio input signal, the equalization power information is then converted from the frequency domain to the time domain to obtain the equalization audio signal of the audio input signal, and the equalization audio signals of at least two audio input signals are superimposed to obtain the mixed audio.

The weighting process may be performed by multiplying the perceptual equalization weight of each audio input signal by the corresponding power information to obtain equalized power information. After that, the equalized power information based on the frequency domain may be subjected to inverse fourier transform to obtain an equalized audio signal based on the time domain. After the equalized audio signal is obtained, it may be subjected to superposition processing. In order to avoid the problem of sample point value overflow of the mixed audio after superposition, the embodiment of the application can perform down-mixing processing on the mixed audio.

The method of downmixing an equalized audio signal may include: carrying out linear superposition on equalized audio signals of at least two paths of audio input signals to obtain linear superposed signals; obtaining a value range quantization factor for determining a signal value range of the mixed audio and a basic contraction factor for contracting the mixed audio to the signal value range; and performing down-mixing on the linear superposition signals according to the value domain quantization factor and the basic contraction factor to obtain mixed audio.

For example, the embodiment of the present application may linearly add the equalized audio signal according to the following formula:

wherein ap_l(t) is the equalized audio signal for the t-th sampling point of the l-th input channel. bp of bp_m(t) represents a linear superposition signal obtained by linearly superposing, for the M audio input signals, M-1 equalized audio signals other than the mth equalized audio signal.

The signal range of the mixed audio can be expressed as [ -2 ]^Q-1,2^Q-1-1]Wherein Q is a value range quantization factor, and the value range quantization factor Q may take a value of 16, for example.

The embodiment of the present application may adopt the following formula to perform down-mixing on the linear superposition signal to obtain the mixed audio signal bp'_m(t)：

n_m＝|bp_m(t)|/2^Q-1,m＝1,2,…M

Where k is a basic contraction factor for contracting the mixed audio to a signal value range, and may be a preset parameter with a value greater than 1. sgn () is the sign function and mod is the remainder operation. n is_mThe mth signal interval obtained by dividing the linear superimposed signal is shown.

According to the sound mixing method based on perception weighting, the sound mixing weight is added for the channel with lower perception volume through auditory perception analysis, and all channel sounds can be not completely masked as far as possible after sound mixing, so that the fairness of multi-user audio communication is improved.

It should be noted that although the various steps of the methods in this application are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the shown steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

The following describes embodiments of an apparatus of the present application, which can be used to perform the mixing method in the above embodiments of the present application. Fig. 8 schematically shows a block diagram of a mixing apparatus according to an embodiment of the present application. As shown in fig. 8, the mixing apparatus 800 may mainly include:

a power obtaining module 810 configured to obtain at least two audio input signals and obtain power information of each audio input signal respectively;

the perceptual quantization module 820 is configured to acquire loudness information related to the frequency of the audio input signal, and perform weighted summation on the power information corresponding to each frequency point in the audio input signal according to the loudness information to obtain perceptual quantization information of the audio input signal;

a perceptual equalizing module 830 configured to perform a numerical adjustment on perceptual quantization information of the audio input signals to determine perceptual equalizing weights for reducing perceptual differences between the audio input signals;

and the signal superposition module 840 is configured to perform superposition processing on the at least two audio input signals according to the perceptual equalization weight of each audio input signal to obtain mixed audio.

In some embodiments of the present application, based on the above embodiments, the power obtaining module includes:

the framing processing unit is configured to perform framing processing on each path of audio input signal respectively to obtain audio data frames of the audio input signal;

a windowing processing unit configured to perform windowing processing on the audio data frame to obtain a windowed framing signal of the audio input signal;

a frequency domain converting unit configured to convert the windowed framed signal from the time domain to the frequency domain to obtain power information of the audio input signal.

In some embodiments of the present application, based on the above embodiments, the windowing processing unit includes:

a window function acquiring subunit configured to acquire a window function for windowing the audio data frame, the window function being a hamming window or a hanning window;

a window function point multiplier unit configured to multiply the window function with the audio data frame points to obtain a windowed framing signal of the audio input signal.

In some embodiments of the present application, based on the above embodiments, the frequency domain converting unit includes:

a spectrum acquisition subunit, configured to perform fourier transform on the time domain-based windowed framing signal to obtain frequency domain-based spectrum information;

an energy spectrum determination subunit configured to determine energy spectrum information of the audio input signal according to the magnitude in the spectrum information;

a second power determination subunit configured to perform a Fourier transform on the autocorrelation function to obtain power information of the audio input signal based on the frequency domain.

In some embodiments of the present application, based on the above embodiments, the perceptual quantization module includes:

and the equal loudness curve interpolation unit is configured to carry out interpolation processing on the equal loudness curve data so as to obtain loudness information related to the frequency of the audio input signal.

In some embodiments of the present application, based on the above embodiments, the perceptual quantization module further includes:

the parameter query unit is configured to query the equal loudness contour data to obtain a reference frequency parameter and a reference sound pressure parameter of the lower frequency point and the upper frequency point;

an indexing processing unit configured to index loudness information to obtain perceptual weighting coefficients of the audio input signal;

In some embodiments of the present application, based on the above embodiments, the sensing equalization module includes:

a first smoothing filtering unit configured to smooth-filter the perceptual quantization information to obtain a perceptual smooth value of the audio input signal;

a smoothing ratio determining unit configured to compare the perceptual smoothing values of the audio input signals to obtain a maximum smoothing value, and determine perceptual smoothing ratios between the maximum smoothing value and the respective perceptual smoothing values;

In some embodiments of the present application, based on the above embodiments, the first smoothing filter unit includes:

and the first smoothing filtering subunit is configured to perform weighted summation on the perceptual smoothing value of the previous signal frame and the perceptual quantization information of the current signal frame according to the first smoothing factor to obtain the perceptual smoothing value of the current signal frame.

In some embodiments of the present application, based on the above embodiments, the second smoothing filter unit includes:

and the second smoothing filtering subunit is configured to perform weighted summation on the perceptual equalization weight of the previous signal frame and the perceptual smoothing proportion of the current signal frame according to a second smoothing factor to obtain the perceptual equalization weight of the current signal frame.

In some embodiments of the present application, based on the above embodiments, the signal superposition module includes:

the power weighting unit is configured to perform weighting processing on the power information of the audio input signal according to the perceptual equalization weight so as to obtain the equalization power information of the audio input signal;

a time domain converting unit configured to convert the equalized power information from a frequency domain to a time domain to obtain an equalized audio signal of the audio input signal;

and the superposition processing unit is configured to carry out superposition processing on the equalized audio signals of the at least two paths of audio input signals to obtain mixed audio.

In some embodiments of the present application, based on the above embodiments, the superimposition processing unit includes:

the linear superposition subunit is configured to linearly superpose equalized audio signals of the at least two paths of audio input signals to obtain linear superposition signals;

a factor acquisition subunit configured to acquire a value range quantization factor for determining a signal value range of the mixed audio and a basic contraction factor for contracting the mixed audio into the signal value range;

The specific details of the audio mixing apparatus provided in the embodiments of the present application have been described in detail in the corresponding method embodiments, and are not described herein again.

Fig. 9 schematically shows a structural block diagram of a computer system of an electronic device for implementing the embodiment of the present application.

It should be noted that the computer system 900 of the electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments.

As shown in fig. 9, a computer system 900 includes a Central Processing Unit (CPU)901 that can perform various appropriate actions and processes in accordance with a program stored in a Read-Only Memory (ROM) 902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for system operation are also stored. The CPU 901, ROM902, and RAM 903 are connected to each other via a bus 904. An Input/Output (I/O) interface 905 is also connected to bus 904.

The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage portion 908 including a hard disk and the like; and a communication section 909 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.

In particular, according to embodiments of the present application, the processes described in the various method flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The computer program executes various functions defined in the system of the present application when executed by a Central Processing Unit (CPU) 901.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A mixing method, comprising:

2. The mixing method according to claim 1, wherein the separately obtaining power information of each audio input signal comprises:

respectively carrying out framing processing on each path of audio input signal to obtain an audio data frame of the audio input signal;

windowing the audio data frame to obtain a windowed framing signal of the audio input signal;

converting the windowed frame divided signal from time domain to frequency domain to obtain power information of the audio input signal.

3. The mixing method according to claim 2, wherein the windowing the audio data frame to obtain a windowed framed signal of the audio input signal comprises:

acquiring a window function for windowing the audio data frame, wherein the window function is a Hamming window or a Hanning window;

and multiplying the window function and the audio data frame point to obtain a windowed framing signal of the audio input signal.

4. The mixing method according to claim 2, wherein the converting the windowed frame signal from the time domain to the frequency domain to obtain the power information of the audio input signal comprises:

performing Fourier transform on the windowed framed signal based on the time domain to obtain frequency spectrum information based on the frequency domain;

determining energy spectrum information of the audio input signal according to the amplitude values in the frequency spectrum information;

and acquiring the framing time length of the windowed framing signal, and determining the power information of the audio input signal according to the energy spectrum information and the framing time length.

5. The mixing method according to claim 2, wherein the converting the windowed frame signal from the time domain to the frequency domain to obtain the power information of the audio input signal comprises:

acquiring an autocorrelation function of the windowed framing information in a time domain;

fourier transforming the autocorrelation function to obtain power information of the audio input signal based on a frequency domain.

6. The mixing method according to claim 1, wherein the obtaining loudness information related to the frequency of the audio input signal comprises:

acquiring equal loudness curve data used for representing the mapping relation between the sound pressure level and the frequency;

and carrying out interpolation processing on the equal loudness curve data to obtain loudness information related to the frequency of the audio input signal.

7. The mixing method according to claim 6, wherein the interpolating the equal loudness curve data to obtain loudness information related to the frequency of the audio input signal comprises:

determining a lower frequency point and an upper frequency point adjacent to the frequency of the audio input signal in the equal loudness curve data;

inquiring the equal-loudness curve data to obtain a reference frequency parameter and a reference sound pressure parameter of the lower frequency point and the upper frequency point;

respectively carrying out interpolation processing on the reference frequency parameter and the reference sound pressure parameter to obtain an interpolation frequency parameter and an interpolation sound pressure parameter which are related to the frequency of the audio input signal;

determining loudness information related to the frequency of the audio input signal from the interpolated frequency parameter and the interpolated sound pressure parameter.

8. The audio mixing method according to claim 1, wherein the weighting and summing the power information corresponding to each frequency point in the audio input signal according to the loudness information to obtain perceptual quantization information of the audio input signal comprises:

performing exponential processing on the loudness information to obtain a perceptual weighting coefficient of the audio input signal;

and carrying out weighted summation on the perceptual weighting coefficient and the power information corresponding to each frequency point in the audio input signal to obtain the perceptual quantization information of the audio input signal.

9. The audio mixing method according to claim 1, wherein the numerically adjusting the perceptual quantization information of the audio input signals to determine perceptual equalization weights for reducing perceptual differences between the audio input signals comprises:

performing smooth filtering on the perceptual quantization information to obtain a perceptual smooth value of the audio input signal;

comparing the perception smooth values of all paths of audio input signals to obtain a maximum smooth value, and determining the perception smooth proportion between the maximum smooth value and each perception smooth value;

and carrying out smooth filtering on the perception smooth proportion to obtain a perception equalization weight for reducing perception difference between the audio input signals.

10. The mixing method according to claim 9, wherein the smoothly filtering the perceptual quantization information to obtain a perceptual smooth value of the audio input signal comprises:

acquiring a perceptual smooth value of a previous signal frame and perceptual quantization information of a current signal frame in the audio input signal;

acquiring a first smoothing factor for performing smoothing filtering on the perception quantization information;

and carrying out weighted summation on the perceptual smooth value of the previous signal frame and the perceptual quantization information of the current signal frame according to the first smoothing factor to obtain the perceptual smooth value of the current signal frame.

11. The mixing method according to claim 9, wherein the smoothly filtering the perceptual smoothing scale to obtain perceptual equalization weights for reducing perceptual differences between the respective audio input signals comprises:

acquiring a perceptual balance weight of a previous signal frame and a perceptual smooth proportion of a current signal frame in the audio input signal;

acquiring a second smoothing factor for performing smoothing filtering on the perception smoothing proportion;

and carrying out weighted summation on the perceptual equilibrium weight of the previous signal frame and the perceptual smooth proportion of the current signal frame according to the second smoothing factor to obtain the perceptual equilibrium weight of the current signal frame.

12. The audio mixing method according to claim 1, wherein the superimposing the at least two audio input signals according to the perceptual equalization weight of each audio input signal to obtain the mixed audio comprises:

carrying out weighting processing on the power information of the audio input signal according to the perceptual equilibrium weight to obtain equilibrium power information of the audio input signal;

converting the equalized power information from a frequency domain to a time domain to obtain an equalized audio signal of the audio input signal;

and performing superposition processing on the equalized audio signals of the at least two paths of audio input signals to obtain mixed audio.

13. The mixing method according to claim 12, wherein the performing the superposition processing on the equalized audio signals of the at least two audio input signals to obtain the mixed audio comprises:

performing linear superposition on the equalized audio signals of the at least two paths of audio input signals to obtain linear superposed signals;

obtaining a value range quantization factor for determining a signal value range of a mixed audio and a basic contraction factor for contracting the mixed audio to the signal value range;

and performing down-mixing on the linear superposition signal according to the value domain quantization factor and the basic shrinkage factor to obtain mixed audio.

14. An audio mixing apparatus, comprising:

15. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the mixing method of any one of claims 1 to 13 via execution of the executable instructions.