CN116962955A - Multi-channel sound mixing method, equipment and medium - Google Patents

Multi-channel sound mixing method, equipment and medium Download PDF

Info

Publication number
CN116962955A
CN116962955A CN202210414876.5A CN202210414876A CN116962955A CN 116962955 A CN116962955 A CN 116962955A CN 202210414876 A CN202210414876 A CN 202210414876A CN 116962955 A CN116962955 A CN 116962955A
Authority
CN
China
Prior art keywords
audio
channel
energy
frame
audio data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210414876.5A
Other languages
Chinese (zh)
Inventor
周永强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202210414876.5A priority Critical patent/CN116962955A/en
Priority to PCT/CN2023/087077 priority patent/WO2023197967A1/en
Publication of CN116962955A publication Critical patent/CN116962955A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0324Details of processing therefor
    • G10L21/034Automatic adjustment
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/01Multi-channel, i.e. more than two input channels, sound reproduction with two speakers wherein the multi-channel information is substantially preserved
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The present application relates to the field of audio processing technologies, and in particular, to a multi-channel audio mixing method, apparatus, and medium. The method comprises the following steps: acquiring first multi-channel audio data, wherein the first multi-channel audio data comprises audio data of M channels to be mixed; determining that audio data with energy meeting a preset energy threshold exists in the first multi-channel audio data, and performing energy amplitude reduction processing on the audio data with energy greater than the preset energy threshold in the first multi-channel audio data; obtaining second multichannel audio data according to the energy amplitude reduction processing result; and (3) performing down-mixing on the second multi-channel audio data to obtain mixed output data with N mixed channels, wherein M is greater than N and N is greater than or equal to 1. The multi-channel audio mixing method provided by the embodiment of the application can solve the problem of sound breaking caused by overhigh energy of part of audio frames during audio channel down mixing, obtain more ideal audio channel down mixing results and improve the hearing experience of users.

Description

Multi-channel sound mixing method, equipment and medium
Technical Field
The application relates to the technical field of audio processing, in particular to a multichannel audio mixing method, equipment and medium.
Background
With the rapid development of modern technologies, in various scenes where audio playing is required, due to the problem of mismatch between audio data and the number of channels of an audio output device, real-time multi-channel audio mixing is often required to be completed when audio is output, and generally, multi-channel audio data is converted into audio data with a smaller number of channels, that is, channel down-mixing. For example, when the AiMax source is played on a large screen, multi-channel audio data such as 3.1, 5.1, 7.1 and the like exist, but when the large screen output device is switched to a digital audio interface (Sony/Philips Digital Interface, spdif)/audio return channel (Audio Return Channel, ARC)/bluetooth output, only two channels are output, so that in order to keep the information of the audio stream as much as possible, the data of the multiple channels needs to be downmixed to generate the two-channel data.
The following two schemes are generally adopted in the current channel down-mixing scheme for multi-channel mixing into two channels:
1) The first two channels of multi-channel data are used as output, and the middle channel, the surrounding channel, the bass channel and the like are discarded. When the scheme is used for outputting, part of voice audio data appear in the discarded channels to cause voice loss, and meanwhile, the hearing experience of a user is reduced due to the fact that only two channels are used as output.
2) And (3) adopting a dolby down-mixing scheme to carry out weighted summation on data related to left and right channels in the multi-channel audio data, so as to obtain two-channel audio data output. However, for audio data which does not meet the dolby specification, for example, for the case where the energy of bass audio data is high in the audio data, when the dolby downmixing scheme is adopted for channel downmixing, the situation of sound breaking occurs, so that the hearing experience of the user is poor.
Disclosure of Invention
The embodiment of the application provides a multichannel sound mixing method, equipment and medium, which solve the problem that in the current sound channel down-mixing scheme, audio data after down-mixing is broken, and the hearing experience of a user is affected.
In a first aspect, an embodiment of the present application provides a multi-channel audio mixing method, which is applied to an electronic device, and includes: acquiring first multi-channel audio data, wherein the first multi-channel audio data comprises audio data of M channels to be mixed; determining that audio data with energy meeting a preset energy threshold exists in the first multi-channel audio data, and performing energy amplitude reduction processing on the audio data with energy greater than the preset energy threshold in the first multi-channel audio data; obtaining second multichannel audio data according to the energy amplitude reduction processing result; and (3) performing down-mixing on the second multi-channel audio data to obtain mixed output data with N mixed channels, wherein M is greater than N and N is greater than or equal to 1.
It is understood that the first multi-channel audio data is input data when the channels are downmixed, and the second multi-channel audio data is output data after the channels are downmixed.
In some embodiments, the preset energy threshold is a preset minimum energy value that may cause audio mixing to break. In some embodiments, the preset energy threshold is preset, and may be other energy values that may cause audio to break after mixing and affect the hearing experience of the user. The application is not limited in this regard.
In some embodiments, the first multi-channel audio data may be multi-channel audio data of 2.1 channels, 3.1 channels, 5.1 channels, 7.1 channels, etc., and the mixing output data may be mono audio data, two-channel audio data, or other multi-channel audio data not exceeding the first multi-channel audio data.
It can be understood that the multi-channel mixing method in the present application mixes multi-channel audio data (i.e., first multi-channel audio data) into audio data having a smaller number of channels, i.e., mixing output data. Before the audio data of each channel are subjected to channel downmixing, the audio data exceeding a preset energy threshold value is determined by carrying out energy tracking on the audio data of each channel, energy suppression is carried out, second multi-channel audio data after the energy suppression is obtained, and the second multi-channel audio data are subjected to channel downmixing. The multi-channel audio mixing method provided by the embodiment of the application can be fully adapted to and support the channel down-mixing of various multi-channel audio data, can solve the problem of sound breaking caused by overhigh energy of part of audio frames during the channel down-mixing, can obtain more ideal channel down-mixing results, and can improve the hearing experience of users.
In a possible implementation manner of the first aspect, determining that audio data with energy greater than a preset energy threshold exists in the first multi-channel audio data includes: carrying out framing treatment on the first multichannel audio data to obtain a plurality of audio frames, and determining frame energy of the plurality of audio frames; determining that a high-energy audio frame with frame energy greater than a preset energy threshold exists in the first multi-channel audio data.
It will be appreciated that in some embodiments, audio frames in the first multi-channel audio data for which the frame energy does not exceed the preset energy threshold are low energy audio frames, and that the energy reduction process may not be performed for the low energy audio frames.
In one possible implementation manner of the first aspect, performing energy amplitude reduction processing on audio data with energy greater than a preset energy threshold in the first multi-channel audio data to obtain second multi-channel audio data includes: determining a target gain of the high-energy audio frame, and determining a frame gain of the high-energy audio frame according to the target gain; and determining a target audio frame corresponding to the high-energy audio frame after the energy amplitude reduction processing according to the frame gain of the high-energy audio frame.
It can be understood that the target gain is an energy suppression factor when the high-energy audio frame is subjected to energy reduction processing, and the energy reduction of the high-energy audio frame can be realized by using the energy suppression factor.
In some embodiments, the target gain may also be provided for low energy audio frames, which have a target gain of 1, i.e. are not energy downscaled.
In a possible implementation of the first aspect, the frame energy of the high energy audio frame is determined by the following formula:wherein the high energy audio frame comprises L sampling points; beta represents a frame energy smoothing coefficient; x is x i (n) (k) represents audio data of a kth sampling point in an nth audio frame of an ith channel to be mixed in the M channels to be mixed; />Representing the energy of a kth sampling point in an nth audio frame of an ith to-be-mixed channel in the M to-be-mixed channels; />Representing the frame energy of the nth audio frame of the ith of the M channels to be mixed.
In some embodiments, each audio frame may include l=512 sampling points, i.e., the frame length of the audio frame is 512. In other embodiments, L may be other values, which are not limited by the present application.
In a possible implementation of the first aspect, the preset energy threshold includes a first threshold and/or a second threshold; the high energy audio frame comprises at least one of: among the plurality of audio frames of the M audio channels to be mixed, the audio frame with the average frame energy larger than the first threshold value of at least one audio frame with the same index corresponding to the same mixed audio channel is a high-energy audio frame; among the audio frames of the same channel to be mixed, the audio frame with the maximum frame energy larger than the second threshold value of at least two audio frames continuous with the corresponding audio frame is a high-energy audio frame.
It may be understood that the index of an audio frame is a sequence number corresponding to an audio frame in any one of the M channels to be mixed, for example, for an nth audio frame of an ith channel to be mixed in the M channels to be mixed, the index is n.
In a possible implementation of the first aspect, the maximum frame energy of each audio frame of the M channels to be mixed is determined according to a frame energy of an audio frame having the largest frame energy among audio frames corresponding to the same mixing channel and having the same index as each audio frame.
In a possible implementation of the above first aspect, the target gain of the high energy audio frames is determined according to a preset energy threshold and a maximum frame energy of at least two audio frames consecutive to each high energy audio frame.
In a possible implementation of the first aspect, the frame gain is determined by the following formula:wherein α represents a frame gain smoothing coefficient;representing a target gain of an nth audio frame of an ith to-be-mixed channel of the M to-be-mixed channels;representing the frame gain of the n-1 audio frame of the i-th audio channel to be mixed in the M audio channels to be mixed;representing the frame gain of the nth audio frame of the ith of the M channels to be mixed.
In some embodiments, the frame gain of the low-energy audio frame in which the frame energy in the first multi-channel audio data does not exceed the preset energy threshold may also be calculated using the above formula, where the target gain of the low-energy audio frame is 1.
In one possible implementation manner of the first aspect, determining, according to a frame gain of the high-energy audio frame, a target audio frame corresponding to the high-energy audio frame after the energy-reduction processing includes: determining the sampling point gain of each sampling point in the high-energy audio frame according to the frame gain of the high-energy audio frame; according to the gain of each sampling point, carrying out energy amplitude reduction processing on the audio data of each sampling point in the high-energy audio frame to obtain the audio data of each sampling point in the target audio frame; and generating a target audio frame according to the audio data of each sampling point of the target audio frame.
In a possible implementation of the first aspect, the gain of each sampling point is determined by the following formula:
wherein, frameLen represents the frame length of the target audio frame; frameGain xi(n-1) Representing the frame gain of the n-1 audio frame of the i-th audio channel to be mixed in the M audio channels to be mixed; frameGain xi(n) Representing a frame gain of an nth audio frame of an ith channel to be mixed among the M channels to be mixed; gainBuGainBuff [ i ] ][k] xi(n-1) A sampling point gain representing a kth sampling point of an n-1 th audio frame of an ith channel to be mixed in the M channels to be mixed; gainBuGainBuff [ i ]][k] xi(n) Representing the ith of the M channels to be mixedSample point gain for the kth sample point of the nth audio frame of the channel to be mixed.
In some embodiments, the low-energy audio frame whose frame energy in the first multi-channel audio data does not exceed the preset energy threshold may also calculate the sampling point gain according to the above formula, where the frame gain of the low-energy audio frame is calculated based on the target gain of the low-energy audio frame being 1.
In a possible implementation of the first aspect, the audio data of each sampling point in the target audio frame is determined by the audio data of each sampling point of the high energy audio frame corresponding to the target audio frame and the corresponding sampling point gain.
In a possible implementation manner of the first aspect, obtaining the second multi-channel audio data according to the energy reduction processing result includes: and generating second multi-channel audio data according to the target audio frame and the low-energy audio frame with energy not greater than a preset energy threshold value in the first multi-channel audio data.
In a possible implementation manner of the first aspect, the downmixing the second multi-channel audio data to obtain mixed output data with N second channels includes: and carrying out weighted summation on the target audio frame and the low-energy audio frame corresponding to the same second channel in the second multi-channel audio data to obtain mixed output data.
It can be understood that the above process of obtaining the mixed output data is to perform channel down-mixing on the second multi-channel audio data by adopting a dolby down-mixing method, where the weight coefficient for performing weighted summation is a preset parameter.
In a second aspect, an embodiment of the present application provides an electronic device, including: one or more processors; one or more memories; the one or more memories store one or more programs that, when executed by the one or more processors, cause the electronic device to perform the multi-channel mixing method described above.
In a third aspect, an embodiment of the present application provides a computer readable storage medium, where instructions are stored, where the instructions when executed on a computer cause the computer to perform the above-described multi-channel mixing method.
In a fourth aspect, embodiments of the present application provide a computer program product comprising a computer program/instruction which, when executed by a processor, implements the above-described multi-channel mixing method.
Drawings
Fig. 1 is a schematic view of a multi-channel audio mixing method according to an embodiment of the present application;
fig. 2 is a flow chart of a method for mixing six channels into two channels according to an embodiment of the present application;
Fig. 3 is a schematic flow chart of a multi-channel audio mixing method according to an embodiment of the present application;
fig. 4 is a flow chart of another multi-channel audio mixing method according to an embodiment of the present application;
fig. 5 is a schematic flow chart of an energy suppression method according to an embodiment of the present application;
fig. 6 is a schematic diagram of a code stream waveform of multi-channel audio data according to an embodiment of the present application;
fig. 7 is a schematic diagram of a code stream waveform and an energy spectrum of a mixed channel after channel downmixing;
fig. 8 is a schematic diagram of a hardware structure of a mobile phone according to an embodiment of the present application.
Detailed Description
As described above, with respect to audio data that does not meet the dolby specification, there is a problem that audio data obtained by performing channel downmixing is broken. Specifically, the audio data which does not conform to the dolby specification is too high in energy due to the fact that the audio data of the bass channel is too high in energy when the audio data after mixing is obtained by weighting and summing, and furthermore, when the corresponding audio data is output, the user hears the audio with broken sound, and the user hearing experience is not high.
In order to solve the problem that in the channel down-mixing scheme, audio data after down-mixing is broken and the hearing experience of a user is affected, the application provides a multi-channel sound mixing method. The method comprises the following steps: the electronic equipment determines the audio data exceeding the preset energy threshold value in each channel of the multi-channel audio data, performs energy suppression (namely energy amplitude reduction) on the audio data, and then calculates the audio data after the suppression of each channel. And further, the suppressed multi-channel audio data can be weighted and summed based on a preset channel down-mixing algorithm to obtain the audio data after channel down-mixing.
It will be appreciated that in some embodiments, energy suppression may be performed by calculating a frame gain for each audio frame in each channel, and reducing the frame gain by a corresponding suppression factor to obtain a suppressed audio frame, and thus energy suppressed audio data.
It can be understood that the preset channel down-mixing algorithm is a corresponding relationship between the channels before and after channel down-mixing, and a weighted sum weight coefficient. For example, in a channel downmix scheme in which six channels are respectively a left channel, a left surround channel, a bass channel, a center channel, a right channel, and a right surround channel, the preset channel downmix algorithm is: the left channel, left surround, bass and mid-set down-mix are left channel, the right channel, right surround, bass and mid-set down-mix are right channel, and the weight coefficient when the weighted summation is the default weight coefficient in the dolby down-mix scheme.
It will be appreciated that in some embodiments, the multi-channel audio data may be divided into a plurality of audio frames, and energy tracking and energy suppression may be performed based on the audio frames. The audio frames are obtained by dividing audio data into a plurality of audio data fragments through framing, each fragment is used as an audio frame, the audio frames are overlapped, and the audio frame division of each channel is the same. In other embodiments, the audio data with energy exceeding the preset energy threshold may be energy suppressed in other manners, which is not limited by the present application.
Before weighting and summing the audio data of each channel, the multi-channel audio mixing method provided by the embodiment of the application determines the audio frame exceeding the preset energy threshold value by carrying out energy tracking on the audio frame of each channel and carries out energy suppression, so that the multi-channel audio mixing method can be fully suitable for various audio data, supports the channel down mixing of various multi-channel audio data, can solve the problem of sound breaking caused by overhigh energy of part of audio frames during the channel down mixing, obtains more ideal channel down mixing results and improves the hearing experience of users.
It is understood that the electronic devices in embodiments of the present application include, but are not limited to, various types of electronic devices such as cell phones (including folding screen cell phones), tablet computers, laptop computers, desktop computers, servers, wearable devices, head mounted displays, mobile email devices, car set devices, portable gaming devices, portable music players, reader devices, televisions with one or more processors embedded or coupled therein, and the like. For convenience of explanation, the electronic device is taken as an example of a mobile phone, and the present application is described below.
Referring to fig. 1 and fig. 2, an application scenario in an embodiment of the present application will be described by taking an example that audio data in a mobile phone 100 is output to a bluetooth headset 200.
Fig. 1 is a schematic view of an application scenario of a multi-channel mixing method.
As shown in fig. 1, the scenario includes a cell phone 100 and a bluetooth headset 200. Wherein the mobile phone 100 and the bluetooth headset are connected wirelessly through bluetooth.
When the user wears the bluetooth headset 200 and plays multi-channel audio data in the mobile phone 100, the mobile phone 100 needs to down-mix the multi-channel audio data therein into two-channel audio data including a left channel and a right channel and transmit the audio data to the bluetooth headset 200 through bluetooth. The bluetooth headset 200 will play after receiving the audio data.
Fig. 2 is a flow chart of a method for mixing six channels into two channels.
Specifically, as shown in fig. 2, six-channel audio data is taken as an example, wherein the six channels include a left channel A1, a left surround channel A2, a bass channel A3, a center channel A4, a right channel A5, and a right surround channel A6. Before audio data is downmixed, the mobile phone 100 may perform frame processing on six-channel audio data, and determine the frame energy of each audio frame. When the frame energy of the audio frame of any channel is larger than a preset energy threshold, determining an energy suppression factor of the audio frame, performing energy suppression on the audio frame, and calculating six-channel audio data after energy suppression according to each suppressed audio frame. And then, carrying out channel down-mixing on the six-channel audio data subjected to energy suppression by adopting a preset channel down-mixing algorithm to obtain two-channel audio data, wherein the two-channel audio data comprise a left channel a1 and a right channel a2. The mobile phone 100 may then transmit the audio data of the left channel a1 to the left earpiece of the bluetooth earpiece 200 and the audio data of the right channel a2 to the right earpiece of the bluetooth earpiece 200.
It can be understood that the multi-channel mixing method in the present application can support the six-channel down-mixing into two channels, and also can support any M-channel down-mixing into N channels, where M > N.
The following describes a multi-channel audio mixing method according to an embodiment of the present application with reference to the accompanying drawings.
Fig. 3 is a flow chart of a multi-channel audio mixing method according to an embodiment of the present application.
As shown in fig. 3, the multi-channel mixing method includes:
301: multi-channel audio data is acquired.
It will be appreciated that different channels of the multi-channel audio data (i.e. the first multi-channel audio data in the foregoing) are of different channel types, wherein each channel is a channel to be mixed, such as a left channel, a right channel, a left surround channel, a right surround channel, etc., and that the audio data of different channels may be output through different speakers, may also be output through channel downmixing, and may be output from the same channel.
In some embodiments, the multi-channel audio data may be 3.1, 5.1, 7.1, etc. multi-channel audio data. Wherein the 3.1 channels include a left channel, a bass channel, a center channel and a right channel, the 5.1 channels include a left channel, a left rear surround channel, a bass channel, a center channel, a right channel and a right surround channel, and the 7.1 channels include a left channel, a left rear surround channel, a bass channel, a center channel, a right surround channel, a left rear surround channel and a right rear surround channel. In some embodiments, the multi-channel audio data may also be more and fewer channels than in the examples described above, as the application is not limited in this regard.
It will be appreciated that in some embodiments, the mask of the multi-channel audio data is also included in the acquired multi-channel audio data. Wherein the mask is used to overlay a layer of mask over the multi-channel audio data to select or mask portions of the audio data. According to the mask, the correspondence relation between the channels before and after mixing the multi-channel audio data can be determined, that is, the audio data of each mixed channel is formed by mixing the audio data of the channels before mixing.
In some embodiments, after the multi-channel audio data is acquired, the multi-channel audio data may also be initialized. The initializing may include determining a preset channel downmix algorithm, which may specifically include: and determining the corresponding relation of the channels before and after the mixing, and carrying out weighted summation on the audio data of the multiple channels to obtain the weight coefficient of the audio data of each channel after the corresponding mixing. It is understood that the audio data of at least one channel corresponding to the same mixed channel after the audio is downmixed (i.e., the output channel after the audio is downmixed) in the multi-channel audio data can be used as a joint detection channel. Further, when the energy of the audio frame is determined, the energy of the corresponding audio frame in the joint detection channel may be initially determined. For example, a determination may be made as to the average energy of the audio frames, whether it is greater than a set initial determination energy threshold. In some embodiments, initializing may further include initializing parameters of a formula in a preset channel downmix algorithm or an energy-suppressing algorithm.
302: and framing the multi-channel audio data to obtain a plurality of audio frames.
It will be appreciated that the characteristics of the multi-channel audio data and the parameters characterizing the essential characteristics thereof will change over time, i.e. the audio data has time-varying properties, so that the following energy tracking and energy suppression of the multi-channel audio data can be based on short time, i.e. short time analysis. In particular, the multi-channel audio data may be divided into segments, each segment being an audio frame. And the essential characteristics of the same audio frame remain unchanged or relatively stable.
Further, in some embodiments, the multi-channel audio data may be directly framed, e.g., each audio frame has a frame length of 10-30ms.
In other embodiments, the multi-channel audio data may be sampled, the continuous multi-channel audio data converted into discrete multi-channel audio data, and the continuous 512-sample audio data formed into one audio frame.
303: if the frame energy of the mth audio frame is greater than the preset energy threshold. Wherein the preset energy threshold is preset. If the frame energy of the mth audio frame is greater than the preset energy threshold, which indicates that after the audio channel is downmixed, the audio corresponding to the audio frame may have a broken sound, energy suppression is needed, i.e. step 304 is performed. When the frame energy of the mth audio frame is less than or equal to the preset energy threshold, it indicates that after the audio channel is downmixed, the audio corresponding to the audio frame will not be broken, and then energy suppression is not needed, that is, step 305 is performed.
It may be understood that the index of an audio frame is a sequence number corresponding to an audio frame in any one of multiple channels of the multi-channel audio data, for example, for an mth audio frame of an ith channel to be mixed in M channels to be mixed, the index is M.
In some embodiments, the preset energy threshold may be determined from a minimum frame energy value that may cause a break in the audio data after downmixing of the channels. For example, the preset energy threshold is-6 dB or-3 dB, etc. The determination may be made specifically based on different electronic devices, audio output devices, and multi-channel audio data, as the application is not limited in this regard.
In some embodiments, step 303 may determine whether the average frame energy of the mth audio frame is greater than a set energy threshold by calculating an average frame energy of the mth audio frame of each of the joint detection channels and determining the average frame energy.
In some embodiments, step 303 may determine whether the frame energy of the mth audio frame is greater than the set energy threshold by simply calculating the frame energy of the mth audio frame of each channel separately and determining whether the frame energy of the mth audio frame is greater than the set energy threshold. In other embodiments, the frame energy of the mth audio frame of each channel in each joint detection channel may be taken as the frame energy of the mth audio frame, and further, whether the frame energy is greater than a set energy threshold may be determined to determine whether the frame energy of the mth audio frame is greater than the set energy threshold. Further, after determining the maximum frame energy of the mth audio frame of each channel in each joint detection channel as the frame energy of the mth audio frame, it may be determined whether the maximum frame energy in at least one audio frame near the mth audio frame is greater than a set energy threshold. For example, it is determined whether the largest frame energy of the frame energies of the mth audio frame and the mth-1 audio frame is greater than a set energy threshold to determine whether the frame energy of the mth audio frame is greater than the set energy threshold.
In some embodiments, the set energy threshold includes an initial decision energy threshold (i.e., the first threshold in the foregoing) and a refined decision energy threshold (i.e., the second threshold in the foregoing). Further, step 303 may make an initial determination of the frame energy of the mth audio frame. Specifically, the average frame energy of the mth audio frame in each joint detection channel is calculated, the average frame energy is judged, whether the average frame energy is larger than an initial judgment energy threshold value or not is judged, and if the average frame energy is larger than the initial judgment energy threshold value, further accurate judgment is carried out. Specifically, the refined determination may be that the maximum frame energy of the mth audio frame of each channel in each joint detection channel may be taken as the frame energy of the mth audio frame, and then it may be determined whether the maximum frame energy of at least two consecutive audio frames near the mth audio frame is greater than the refined determination energy threshold. Step 303 is further described below in connection with formulas.
In some embodiments, the frame energy of the audio frame may be obtained by fourier transforming the multi-channel audio data to obtain an energy spectrum of the multi-channel audio data, and calculating the frame energy of the audio frame formed by a plurality of sampling points according to energy values of a plurality of continuous sampling points, where a specific calculation method is described in conjunction with a formula below.
304: and performing energy suppression on the mth audio frame by using an energy suppression algorithm to obtain the mth target audio frame.
It can be understood that the energy suppression algorithm is an algorithm for determining a target audio frame after energy suppression based on the frame energy of the mth audio frame. In some embodiments, an energy suppression factor may be calculated according to the calculated frame energy of the mth audio frame and a preset formula, and then energy suppression is performed on the mth audio frame based on the energy suppression factor, so as to obtain the mth target audio frame. In some embodiments, the energy suppression factor is a target gain, and further the energy suppression may be a frame gain of an mth audio frame calculated according to the target gain, and the mth target audio frame calculated according to the frame gain.
In some embodiments, when each audio frame includes a plurality of sampling points, when the target audio frame is obtained by calculation, the frame gain of the mth audio frame may be calculated by using the energy suppression factor, and the gain of each sampling point in the mth audio frame (i.e., the sampling point gain) is calculated according to the frame gain of the mth audio frame, so as to determine the audio data of each sampling point in the target audio frame based on the gain of each sampling point, and perform signal reconstruction to determine the audio data of the mth target audio frame.
305: the mth audio frame is determined to be the mth target audio frame.
It can be appreciated that in some embodiments, the frame energy of the mth audio frame does not exceed the preset energy threshold, after the audio data corresponding to the audio frame is output after the audio is downmixed, the audio will not be broken due to the too high energy, and the hearing experience of the user will not be affected, so that the audio frame does not need to be energy suppressed, and the audio data of the original audio frame can be retained.
306: and carrying out channel down-mixing on each target audio frame of the plurality of channels based on a preset channel down-mixing algorithm to obtain mixed output data.
It can be understood that the preset channel downmix rule includes a channel corresponding to the front and rear of the channel downmix and a channel downmix calculation formula. That is, the target audio frames of some channels of the multi-channel audio data of each mixed channel after the channel downmixing is calculated can be determined by the channels corresponding to the front and rear of the channel downmixing. And substituting the determined voice frame corresponding to the same sound mixing channel into a corresponding channel down-mixing formula to achieve the mixing output data of the sound mixing channel. Wherein the mixing output data is the audio data of one channel output to the bluetooth headset 200 in the foregoing.
It will be appreciated that in some embodiments, after the target audio frame is obtained in step 305 or 304, channel downmixing may be performed using dolby downmixing algorithm to obtain mixed output data of each mixed channel. Specifically, the target audio frames corresponding to the indexes in each joint detection channel can be weighted and summed to obtain the mixed audio frames in the mixed output data corresponding to the target audio frames, and then the mixed audio frames corresponding to the same mixed channel are spliced to obtain the output data of the mixed channel.
According to the embodiment of the application, by the multi-channel audio mixing method, the energy of voice data in multi-channel audio data is tracked, the audio data with larger energy is subjected to energy suppression, and then the multi-channel audio data subjected to energy suppression is subjected to channel down-mixing. The multi-channel audio mixing method in the embodiment of the application can be suitable for multi-channel audio data of dolby specifications and multi-channel audio data of non-dolby specifications, namely the multi-channel audio mixing method in the embodiment of the application can be suitable for multi-channel audio down-mixing, can realize self-adaptive audio down-mixing, and can not generate sound breaking caused by overhigh energy of the audio data. Meanwhile, the multi-channel audio mixing method in the embodiment of the application only carries out energy suppression on part of audio data with higher energy, does not discard the audio data, can retain the audio data of each channel while solving the problem of down-mixing sound breaking, and improves the hearing experience of users.
In the following, a two-channel downmixing method is taken as an example as a mono mixing method in connection with fig. 4, and a multi-channel mixing method in an embodiment of the present application is further described.
Fig. 4 is a flow chart of another multi-channel mixing method according to an embodiment of the application.
As shown in fig. 4, the multi-channel audio data includes a code stream of channel 1 audio data and a code stream of channel 2 audio data. After the electronic device acquires the audio data of the sound channel 1 and the audio data of the sound channel 2, framing processing is carried out on the audio data of the two sound channels, and each sound channel can obtain 6 audio data frames.
As can be seen from the code stream of the audio data of each channel after framing, in the audio frames of the channel 1, the second audio frame and the fourth audio frame are not very stable, in the audio frames of the channel 2, the second audio frame and the third audio frame are not very stable, and in order to avoid the situation that the mixed audio data after downmixing is broken, the energy suppression is required to be performed on the second audio frame and the fourth audio frame in the audio frames of the channel 1 and the second audio frame and the third audio frame in the audio frames of the channel 1, so as to obtain the code stream of the target audio frame after the suppression of each channel. Furthermore, a dolby down-mixing algorithm can be adopted to carry out weighted summation on corresponding target audio frames in the two down-mixed channels, so that corresponding mixed audio frames in the mixed channels are obtained, and 6 mixed audio frames form mixed output data.
An energy suppression method according to an embodiment of the present application is further described below with reference to fig. 5.
Fig. 5 is a flow chart of a method of energy suppression in an embodiment of the application.
As shown in fig. 5, the method includes:
501: multi-channel data. The acquired multi-channel data in step 501, that is, the acquired multi-channel audio data in step 301, and step 501 is similar to step 301, and the present application is not described herein.
502: and generating an initialized down-mixing algorithm according to the channel arrangement, and determining a mixing channel.
It can be understood that the channel arrangement is the kind and number of channels in the multi-channel data. The initialization downmixing algorithm is a formula algorithm for obtaining data of the mixed sound channel according to the corresponding relation between the front sound channel and the rear sound channel of the sound channel downmixing and the mixed sound channel generated by the sound channel arrangement. Specifically, in some embodiments, generating the initialization downmixing algorithm may include determining correspondence of M channels and N channels by a mask of the multi-channel data. For example, when the six-channel downmix is two channels, the left channel, left surround, bass, center downmix is left channel, and the right channel, right surround, bass, center downmix is right channel. In some embodiments, generating the initialization downmixing algorithm further includes calculating an equation for each of the mixed output data in the mixed channel and initializing parameters in the equation. When the channels are downmixed, each channel generating the same mixing channel (i.e. mixing channel) can be expressed as a joint detection channel, that is, the data of each channel in the joint detection channel corresponding to the mixing channel is weighted and summed to obtain the data to be output by the mixing channel (i.e. mixing channel).
503: the VAD detection result is 1.
The VAD detection is called voice endpoint detection, and the english is called VoiceActivityDetection. Wherein the condition for VAD detection may be that the frame energy of the audio data is greater than a set energy threshold.
In some embodiments, the multichannel data may be sampled and framed prior to VAD detection. The framing process may be to divide a part of consecutive sampling points into one audio frame after sampling the multi-channel data, for example, taking 512 sampling points as one audio frame. Furthermore, in some embodiments, the VAD detection condition may be to detect whether the frame energy of the audio frame is greater than a preset VAD detection threshold (i.e., the first threshold in the foregoing), where the VAD detection threshold may be, for example, greater than-6 dB. When the detection decision result of the VAD is true (true), that is, the VAD detection result is 1, it indicates that the frame energy of the audio frame has checked the preset VAD detection threshold, and after the audio frame is subjected to channel downmixing, a sound breaking situation may occur. It can be understood that VAD detection is the initial judgment in the foregoing, and further accurate judgment can be performed based on the initial judgment result.
In some embodiments, it may be jointly detected whether the average frame energy of each audio frame in the channel is greater than a preset VAD detection threshold. If the average frame energy is too high, it is necessary to track the energy of each audio frame of each channel and to suppress the energy of the audio frame that meets the energy suppression condition.
It can be understood that when the VAD detection judgment result is 1, energy tracking is only needed for the audio frame, and when the detection judgment result of the VAD detection is false (false), that is, when the VAD detection result is 0, the audio frame cannot be broken after downmixing due to too high energy of the audio frame. The next audio frame may be VAD detected.
504: the frame energy is calculated, and the maximum energy of the joint detection channel and the front and back frames is tracked.
It will be appreciated that if the VAD detection result is 1, energy suppression is required for the corresponding audio frame.
Specifically, taking L sampling points as a speech frame, the frame energy of the audio frame can be calculated by the following formula:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing the frame energy of the nth audio frame in the ith channel. Beta is the smoothing coefficient when calculating the frame energy, x i (n) (k) represents input data of a kth sampling point of an nth audio frame in an ith channel, that is, partial data in the acquired multi-channel data. Where i=0, 1,2,3,4,5, where the value of i is related to the number of channels, k may be an integer between 0 and L, and the value range of n is the number of speech frames after framing the multi-channel data. In some embodiments, the smoothing coefficient β=0.3.
It will be appreciated that in step 504 above, the frame energy is calculated by calculating the frame energyThe tracking of the frame energy of the data of each channel may be implemented, and when the tracking of the frame energy of the audio frame, and the frame energy is determined to be greater than the set detection threshold, step 505 may be performed.
Specifically, the audio frame with the largest energy in the corresponding audio frame in the joint detection channel may be calculated, and may be specifically determined in the following manner.
Wherein, the liquid crystal display device comprises a liquid crystal display device,representing the maximum frame energy value of the nth audio frame in each channel data of the joint detection channel.
In some embodiments, after determining the maximum frame energy value of the current nth audio frame by the above formula (2), the maximum energy value in the previous and subsequent audio frames may be determined by the following formula:
wherein, the liquid crystal display device comprises a liquid crystal display device,represents the maximum energy in the audio frames preceding and following the nth audio frame,representing the maximum frame energy value of the n-1 st audio frame in each channel data of the joint detection channel,representing the maximum frame energy value of the n+1th audio frame in each channel data of the joint detection channel.
In some embodiments of the present invention, in some embodiments,and also by determining the maximum value of the energy maximum values in the preceding and following audio frames of the calculated nth and n-1 th audio frames.
505: and calculating target gain and frame gain, and finally calculating the gain of each sampling point, and multiplying the gain by the fixed gain to output a result.
It will be appreciated that the output is input data when channel downmixing is performed.
In some embodiments, the calculation results inThen, it can be firstly determined whether the energy of the nth current audio frame exceeds the set detection threshold (i.e. the second threshold above), if yes, it indicates that the audio frame has a risk of breaking sound after downmixing channels, and energy suppression is needed, so that the target gain and frame gain of the audio frame can be calculated to determine the energy suppression factor.
It will be appreciated that the determination of the set detection threshold makes an accurate determination of the frame energy for each channel to determine whether or not an over-energy audio frame needs to be suppressed.
It can be understood that the target gain can be understood as an energy suppression factor, which is used for reducing the frame gain, so as to achieve the purpose of reducing the energy of the audio frame.
In some embodiments, the target gain may be calculated by the following formula:
it is understood that Threshold represents the set detection Threshold, threshold= -3dB.Representing the target gain of the nth audio frame of the ith channel.
As can be seen from the formula (4), when the maximum frame energy value of the front and rear frames is smaller than the set detection Threshold, it is indicated that the frame energy of the audio frame is appropriate, and the situation of down-mixing and breaking due to too high energy does not occur. When the maximum frame energy value of the front frame and the rear frame is larger than or equal to the set detection Threshold value Threshold, the situation that the audio frame has too high frame energy and down-mix sound breaking is possibly caused by the too high energy is indicated, and the target gain of the calculator is needed to restrain the frame gain of the audio frame.
In some embodiments, after the target gain is calculated, the frame gain of the audio frame may be determined by the following formula:
/>
wherein, the liquid crystal display device comprises a liquid crystal display device,frame gain of nth audio frame representing ith channel,/>Representing the frame gain of the n-1 st audio frame of the i-th channel, and α represents the smoothing coefficient when calculating the frame gain. In some embodiments, the smoothing coefficient α=0.1.
In some embodiments, after the frame gain is calculated according to equation (5) above, the sample gain for each sample in the audio frame may be calculated using the following equation:
wherein, the liquid crystal display device comprises a liquid crystal display device,sample gain representing the kth sample in the nth audio frame of the ith channel,/for the nth channel>The sample point gain representing the kth sample point in the n-1 th audio frame of the ith channel, frameLen representing the frame length of the audio frame. For example, with 512 sampling points as one audio frame, the frame length of the audio frame is 512.
As can be seen from the above formula (6), when calculating the sampling point gain, the sampling point gain of each sampling point is related to the frame gain of the corresponding speech frame and the sampling point gain of the corresponding index number of the preceding speech frame. Wherein the index of each sampling point is the serial number of each sampling point in the corresponding audio frame, for example, the index of the kth sampling point in one audio frame is k.
Further, in some embodiments, the calculation formula for calculating the target audio frame according to the sampling point gain obtained in formula 6 is:
wherein alpha is i (n)[i][k]Audio data representing the kth sample point of the nth target audio frame of the ith channel, x i (n)[i][k]Audio data representing the kth sample point of the nth target audio frame of the ith channel, x in equation 1 i (n)(k)。
It will be appreciated that in some embodiments, audio data for successive target audio frames may be derived from audio data for discrete sample points in the target audio frame. And then the audio data of each target audio frame can be used as the input data of the down-mixing of the channels.
In some embodiments, after each energy suppressed audio frame is calculated, the channel downmixing may be performed by the following formula:
wherein a is j (n) output result of nth audio frame representing jth mixing channel, w i,j (n) represents the mixing weight, alpha, of the nth audio frame of the ith channel in the jth mixing channel i (n) represents input data of an nth audio frame of an ith channel when channel downmixing is performed. J=0, 1 when the channel is downmixed to two channels, j=0, 1,2 when the channel is downmixed to three channels, and so on. A=0, 1,2,3,4,5, each representing a plurality of channels before mixing. For example a=0, 1,2,3,4,5, respectively, then represent the left channelA left surround channel, a bass channel, a center channel, a right surround channel. J represents the number of channels before mixing corresponding to the jth mixing channel.
In order to more clearly illustrate the positive effects of the multi-channel mixing method in the embodiment of the present application, the multi-channel mixing method in the embodiment of the present application is simulated as follows with reference to fig. 6 and 7. Wherein the simulation software is the clip software. Then, the simulation is performed under the simulation condition that the multichannel audio data is 5.1 channels audio data, the smoothing coefficient α=0.1, β=0.3, and the detection Threshold threshold= -3dB is set.
Fig. 6 is a schematic diagram of a code stream waveform of multi-channel audio data according to an embodiment of the present application.
Fig. 7 shows a schematic diagram of a code stream waveform and an energy spectrum of a mixed channel after channel downmixing.
As shown in fig. 6, six code stream waveforms in the figure represent, from top to bottom, a left channel, a right channel, a center channel, a bass channel, a left surround channel, and a right surround channel. Wherein the abscissa represents time and the ordinate represents audio data of the audio data. In the process of code stream transparent transmission, each sampling point is represented by 16 bits, so under the condition of adopting a fixed coefficient downmixing scheme, if the mixing result is required to be ensured not to break, the addition of all the channels is smaller than the range represented by 16 bits, otherwise, data can overflow and wrap around, and data jump occurs, so that noise can be generated. If the dolby down-mix coefficients are used for mixing as in fig. 6, the energy peaks of the partial data in the block in fig. 6 are large, and the coefficients of the channel down-mix need to be small in order to ensure that the down-mix does not break. Meanwhile, due to positive correlation between the volume and energy, the volume of the corresponding audio mixing channel is reduced, so that the overall volume of the audio after downmixing is reduced, and the hearing experience of a user is affected.
As shown in fig. 7, the first line of code stream waveform diagram is a code stream waveform diagram of audio data of a left channel mixed sound channel obtained by adopting the multi-channel mixing method in the present application, and the second line of code stream waveform diagram is a code stream waveform diagram of audio data of a right channel mixed sound channel obtained by adopting the dolby mixing method. In the first line code stream waveform diagram and the second line code stream waveform diagram, the abscissa represents time and the ordinate represents audio data of the audio data. The third row corresponds to the audio data of the left channel mix channel of the first row, which is the energy spectrum of the audio data of the left channel mix channel, and the fourth row corresponds to the audio data of the right channel mix channel of the second row, which is the energy spectrum of the audio data of the right channel mix channel. In the third and fourth rows of energy spectra, the abscissa represents time and the ordinate represents energy.
As can be seen from fig. 7, the code stream waveform of the audio data of the channels processed by the multi-channel mixing method in the embodiment of the present application is more stable in envelope variation, less in occurrence of large fluctuation, and stable in energy stripe, and does not occur too high to spread over the whole frequency domain. The code stream waveform of the audio data of the right channel mixing channel which is not processed by the multi-channel mixing method in the application has the problem that the energy stripes are spread over the whole frequency domain in the part with larger energy, such as the audio data selected by the box in fig. 7, so that obvious sound breaking is caused and the noise is obvious.
Therefore, according to the multi-channel audio mixing method, energy tracking is carried out on the voice data of each channel, so that energy suppression is carried out on the audio data with excessively high energy, the risk of audio mixing and breaking is reduced on the premise that the audio data are not lost, and the hearing experience of a user is improved.
Fig. 8 is a schematic diagram of a hardware structure of a mobile phone 100 according to an embodiment of the present application.
The mobile phone 100 can execute the display method provided by the embodiment of the application. In fig. 8, similar parts have the same reference numerals. As shown in fig. 8, the handset 100 may include a processor 110, a power module 140, a memory 180, a camera 101, a mobile communication module 130, a wireless communication module 120, a sensor module 190, an audio module 150, an interface module 160, and a display screen 102, among others.
It should be understood that the structure illustrated in the embodiments of the present application is not limited to the specific embodiment of the mobile phone 100. In other embodiments of the application, the handset 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
The processor 110 may include one or more processing units, for example, processing modules or processing circuits that may include a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), an image signal processor (image signal processor, ISP), a digital signal processor (Digital Signal Processor, DSP), a microprocessor (Micro-programmed Control Unit, MCU), an artificial intelligence (Artificial Intelligence, AI) processor, a programmable logic device (Field Programmable Gate Array, FPGA), or the like. Wherein the different processing units may be separate devices or may be integrated in one or more processors. For example, in some examples of the application, the processor 110 may be configured to determine whether the energy of the mth audio frame is greater than a set energy threshold and calculate the energy suppression factor. In some embodiments, the processor 110 may be further configured to perform channel downmixing on the obtained target audio frame to obtain mixed output data.
The Memory 180 may be used to store data, software programs, and modules, and may be a Volatile Memory (RAM), such as a Random-Access Memory (RAM); or a nonvolatile Memory (Non-Volatile Memory), such as a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk (HDD) or a Solid State Drive (SSD); or a combination of the above types of memories, or may be a removable storage medium, such as a Secure Digital (SD) memory card. In some embodiments of the application, the memory 180 is configured to store multi-channel audio data and a preset channel downmix algorithm for the mobile phone 100.
The power module 140 may include a power source, a power management component, and the like. The power source may be a battery. The power management component is used for managing the charging of the power supply and the power supply supplying of the power supply to other modules. The charging management module is used for receiving charging input from the charger; the power management module is used for connecting a power supply, and the charging management module is connected with the processor 110.
The mobile communication module 130 may include, but is not limited to, an antenna, a power amplifier, a filter, a low noise amplifier (Low Noise Amplify, LNA), and the like. The mobile communication module 130 may provide a solution for wireless communication including 2G/3G/4G/5G, etc. applied to the handset 100. The mobile communication module 130 may receive electromagnetic waves from an antenna, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to a modem processor for demodulation. The mobile communication module 130 may amplify the signal modulated by the modem processor, and convert the signal into electromagnetic waves through the antenna to radiate. In some embodiments, at least some of the functional modules of the mobile communication module 130 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 130 may be disposed in the same device as at least some of the modules of the processor 110.
The wireless communication module 120 may include an antenna, and transmit and receive electromagnetic waves via the antenna. The wireless communication module 120 may provide solutions for wireless communication including wireless local area network (Wireless Local Area Networks, WLAN) (e.g., wireless fidelity (Wireless Fidelity, wi-Fi) network), bluetooth (BT), global navigation satellite system (Global Navigation Satellite System, GNSS), frequency modulation (Frequency Modulation, FM), near field wireless communication technology (Near Field Communication, NFC), infrared technology (IR), etc., applied to the handset 100. The handset 100 may communicate with a network and other devices via wireless communication technology.
In some embodiments, the mobile communication module 130 and the wireless communication module 120 of the handset 100 may also be located in the same module.
The camera 101 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to an ISP (Image Signal Processor ) to be converted into a digital image signal. The mobile phone 100 can implement a photographing function through an ISP, a camera 101, a video codec, a GPU (Graphic Processing Unit, a graphic processor), a display screen 102, an application processor, and the like. For example, in some embodiments of the present application, the camera 101 is used to collect face images and two-dimensional code images, and is used for face recognition, two-dimensional code recognition, and so on by the mobile phone 100. .
The display screen 102 includes a display panel. The display panel may employ a liquid crystal display (Liquid Crystal Display, LCD), an Organic Light-emitting Diode (OLED), an Active-matrix Organic Light-emitting Diode (AMOLED), a flexible Light-emitting Diode (Flex Light-emitting Diode, FLED), a Mini LED, a Micro OLED, a quantum dot Light-emitting Diode (QLED), or the like. For example, the display screen 102 is used to display various UI interfaces of the mobile phone 100 in modes of split screen, parallel view, single APP exclusive screen, and the like in the landscape/portrait state.
The sensor module 190 may include a proximity light sensor, a pressure sensor, a gyroscope sensor, a barometric sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, and the like.
The audio module 150 may convert digital audio information to an analog audio signal output or convert an analog audio input to a digital audio signal. The audio module 150 may also be used to encode and decode audio signals. In some embodiments, the audio module 150 may be disposed in the processor 110, or some functional modules of the audio module 150 may be disposed in the processor 110.
The interface module 160 includes an external memory interface, a universal serial bus (Universal Serial Bus, USB) interface, a subscriber identity module (Subscriber Identification Module, SIM) card interface, and the like. Wherein the external memory interface may be used to connect an external memory card, such as a Micro SD card, to extend the memory capabilities of the handset 100. The external memory card communicates with the processor 110 through an external memory interface to implement data storage functions. The universal serial bus interface is used for communication between the handset 100 and other handsets. The subscriber identity module card interface is used to communicate with a SIM card mounted to the handset 100, for example to read a telephone number stored in the SIM card or to write a telephone number to the SIM card.
In some embodiments, the handset 100 also includes keys, motors, indicators, and the like. The keys may include a volume key, an on/off key, and the like. The motor is used to generate a vibration effect on the cellular phone 100. The indicators may include laser indicators, radio frequency indicators, LED indicators, and the like.
Embodiments of the disclosed mechanisms may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the application may be implemented as a computer program or program code that is executed on a programmable system comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
Program code may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For the purposes of this application, a processing system includes any system having a processor such as, for example, a digital signal processor (Digital Signal Processor, DSP), microcontroller, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or microprocessor.
The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. Program code may also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described in the present application are not limited in scope by any particular programming language. In either case, the language may be a compiled or interpreted language. In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed over a network or through other computer readable media. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including but not limited to floppy diskettes, optical disks, read-Only memories (CD-ROMs), magneto-optical disks, read Only Memories (ROMs), random access memories (Random Access Memory, RAMs), erasable programmable Read-Only memories (Erasable Programmable Read Only Memory, EPROMs), electrically erasable programmable Read-Only memories (Electrically Erasable Programmable Read-Only memories, EEPROMs), magnetic or optical cards, flash Memory, or tangible machine-readable Memory for transmitting information (e.g., carrier waves, infrared signal digital signals, etc.) using the internet in an electrical, optical, acoustical or other form of propagated signal. Thus, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).
In addition, the technical scheme of the application also provides a computer readable storage medium, and the computer readable storage medium stores instructions which, when executed on the electronic device 100, cause the electronic device 100 to execute the display method provided by the technical scheme of the application.
In addition, the technical scheme of the application also provides a computer program product, and the computer program product comprises instructions for realizing the display method provided by the technical scheme of the application.
In addition, the technical scheme of the application also provides a chip device, which comprises: a communication interface for inputting and/or outputting information; and a processor for executing the computer-executable program to cause the apparatus mounted with the chip device to execute the display method provided by the technical scheme of the application.
In the drawings, some structural or methodological features may be shown in a particular arrangement and/or order. However, it should be understood that such a particular arrangement and/or ordering may not be required. Rather, in some embodiments, these features may be arranged in a different manner and/or order than shown in the illustrative figures. Additionally, the inclusion of structural or methodological features in a particular figure is not meant to imply that such features are required in all embodiments, and in some embodiments, may not be included or may be combined with other features.
It should be noted that, in the embodiments of the present application, each unit/module mentioned in each device is a logic unit/module, and in physical terms, one logic unit/module may be one physical unit/module, or may be a part of one physical unit/module, or may be implemented by a combination of multiple physical units/modules, where the physical implementation manner of the logic unit/module itself is not the most important, and the combination of functions implemented by the logic unit/module is only a key for solving the technical problem posed by the present application. Furthermore, in order to highlight the innovative part of the present application, the above-described device embodiments of the present application do not introduce units/modules that are less closely related to solving the technical problems posed by the present application, which does not indicate that the above-described device embodiments do not have other units/modules.
It should be noted that in the examples and descriptions of this patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
While the application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the application.

Claims (16)

1. A multi-channel mixing method applied to an electronic device, comprising:
acquiring first multi-channel audio data, wherein the first multi-channel audio data comprises audio data of M channels to be mixed;
determining that audio data with energy larger than a preset energy threshold exists in the first multi-channel audio data, and performing energy amplitude reduction processing on the audio data with energy larger than the preset energy threshold in the first multi-channel audio data;
obtaining second multichannel audio data according to the energy amplitude reduction processing result;
and carrying out down-mixing on the second multi-channel audio data to obtain mixed output data with N mixed channels, wherein M is greater than N and N is greater than or equal to 1.
2. The multi-channel mixing method according to claim 1, wherein the determining that audio data having energy greater than a preset energy threshold exists in the first multi-channel audio data includes:
Carrying out framing treatment on the first multi-channel audio data to obtain a plurality of audio frames, and determining frame energy of the plurality of audio frames;
and determining that high-energy audio frames with frame energy greater than a preset energy threshold exist in the first multi-channel audio data.
3. The multi-channel mixing method according to claim 2, wherein the frame energy of the high-energy audio frame is determined by the following formula:
wherein the high energy audio frame comprises L sampling points;
beta represents a frame energy smoothing coefficient;
x i (n) (k) represents audio data of a kth sampling point in an nth audio frame of an ith channel to be mixed among the M channels to be mixed;
representing the energy of the kth sampling point in the nth audio frame of the ith audio channel to be mixed in the M audio channels to be mixed;
and representing the frame energy of the nth audio frame of the ith audio channel to be mixed in the M audio channels to be mixed.
4. The multi-channel mixing method according to claim 2, wherein the preset energy threshold comprises a first threshold and/or a second threshold;
the high energy tones include at least one of:
among the plurality of audio frames of the M channels to be mixed, an audio frame having an average frame energy greater than the first threshold value corresponding to at least one audio frame having the same index of the same mixed channel is the high-energy audio frame;
And in each audio frame of the same sound to be mixed channel, the audio frame with the maximum frame energy of at least two audio frames which are continuous with the corresponding audio frame and are larger than the second threshold value is the high-energy audio frame.
5. The method according to claim 4, wherein a maximum frame energy of each of the audio frames of the M channels to be mixed is determined based on a frame energy of an audio frame having a maximum frame energy among audio frames corresponding to the same mixing channel and having the same index as each of the audio frames.
6. The multi-channel mixing method according to claim 2, wherein performing energy-reducing processing on the audio data having energy greater than the preset energy threshold in the first multi-channel audio data comprises:
determining a target gain of the high-energy audio frame, and determining a frame gain of the high-energy audio frame according to the target gain;
and determining a target audio frame corresponding to the high-energy audio frame after the energy amplitude reduction processing according to the frame gain of the high-energy audio frame.
7. The multi-channel mixing method of claim 6, wherein the target gain of the high-energy audio frames is determined according to the preset energy threshold and a maximum frame energy of at least two audio frames consecutive to each of the high-energy audio frames.
8. The multi-channel mixing method of claim 7, wherein the frame gain is determined by the following formula:
wherein, the liquid crystal display device comprises a liquid crystal display device,
alpha represents a frame gain smoothing coefficient;
representing a target gain of an nth audio frame of an ith to-be-mixed channel in the M to-be-mixed channels;
a frame gain representing the n-1 th audio frame of the i-th audio channel to be mixed in the M audio channels to be mixed;
and representing the frame gain of the nth audio frame of the ith audio channel to be mixed in the M audio channels to be mixed.
9. The multi-channel mixing method according to claim 6, wherein the determining, according to the frame gain of the high-energy audio frame, the target audio frame corresponding to the high-energy audio frame after the energy-reducing processing includes:
determining the sampling point gain of each sampling point in the high-energy audio frame according to the frame gain of the high-energy audio frame;
according to the gain of each sampling point, performing energy amplitude reduction processing on the audio data of each sampling point in the high-energy audio frame to obtain the audio data of each sampling point in the target audio frame;
and generating the target audio frame according to the audio data of each sampling point of the target audio frame.
10. The multi-channel mixing method of claim 9, wherein each of the sample point gains is determined by the following formula:
wherein, the liquid crystal display device comprises a liquid crystal display device,
FrameLen represents the frame length of the target audio frame;
representing frame gains of n-1 audio frames of an i-th channel to be mixed in the M channels to be mixed;
representing a frame gain of an nth audio frame of the ith audio channel to be mixed in the M audio channels to be mixed;
a sample point gain representing a kth sample point of the n-1 th audio frame of the ith channel to be mixed among the M channels to be mixed;
and the sampling point gain of the kth sampling point of the nth audio frame of the ith audio channel to be mixed in the M audio channels to be mixed is represented.
11. The multi-channel mixing method according to claim 9, wherein the audio data of each sampling point in the target audio frame is determined by the audio data of each sampling point of the high-energy audio frame corresponding to the target audio frame and the corresponding sampling point gain.
12. The multi-channel mixing method according to claim 6, wherein the obtaining the second multi-channel audio data according to the energy-reducing processing result includes:
And generating the second multi-channel audio data according to the target audio frame and the low-energy audio frame with energy not greater than a preset energy threshold value in the first multi-channel audio data.
13. The multi-channel mixing method according to claim 12, wherein the downmixing the second multi-channel audio data to obtain mixed output data having N second channels comprises:
and carrying out weighted summation on the target audio frame and the low-energy audio frame corresponding to the same second channel in the second multi-channel audio data to obtain the mixed audio output data.
14. An electronic device, comprising:
a memory for storing instructions for execution by one or more processors of the electronic device, and
a processor, being one of the processors of an electronic device, for controlling performing the multi-channel mixing method of any one of claims 1 to 13.
15. A computer readable storage medium, characterized in that the storage medium has stored thereon instructions, which when executed on a computer, cause the computer to perform the multi-channel mixing method of any of claims 1 to 13.
16. A computer program product, characterized in that the computer program product comprises instructions for implementing a multi-channel mixing method according to any of claims 1 to 13.
CN202210414876.5A 2022-04-15 2022-04-15 Multi-channel sound mixing method, equipment and medium Pending CN116962955A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210414876.5A CN116962955A (en) 2022-04-15 2022-04-15 Multi-channel sound mixing method, equipment and medium
PCT/CN2023/087077 WO2023197967A1 (en) 2022-04-15 2023-04-07 Multi-channel sound mixing method, and device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210414876.5A CN116962955A (en) 2022-04-15 2022-04-15 Multi-channel sound mixing method, equipment and medium

Publications (1)

Publication Number Publication Date
CN116962955A true CN116962955A (en) 2023-10-27

Family

ID=88329055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210414876.5A Pending CN116962955A (en) 2022-04-15 2022-04-15 Multi-channel sound mixing method, equipment and medium

Country Status (2)

Country Link
CN (1) CN116962955A (en)
WO (1) WO2023197967A1 (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
SE0400998D0 (en) * 2004-04-16 2004-04-16 Cooding Technologies Sweden Ab Method for representing multi-channel audio signals
CN103188595B (en) * 2011-12-31 2015-05-27 展讯通信(上海)有限公司 Method and system of processing multichannel audio signals
PT3539126T (en) * 2016-11-08 2020-12-24 Fraunhofer Ges Forschung Apparatus and method for downmixing or upmixing a multichannel signal using phase compensation

Also Published As

Publication number Publication date
WO2023197967A1 (en) 2023-10-19

Similar Documents

Publication Publication Date Title
EP3480819B1 (en) Audio data processing method and apparatus
CN107705778B (en) Audio processing method, device, storage medium and terminal
US9338553B2 (en) Spectral shaping for audio mixing
US11361785B2 (en) Sound outputting device including plurality of microphones and method for processing sound signal using plurality of microphones
CN107637095A (en) The loudspeaker of reservation privacy, energy efficient for personal voice
WO2020034779A1 (en) Audio processing method, storage medium and electronic device
CN111462764B (en) Audio encoding method, apparatus, computer-readable storage medium and device
CN110941415B (en) Audio file processing method and device, electronic equipment and storage medium
CN110827808A (en) Speech recognition method, speech recognition device, electronic equipment and computer-readable storage medium
US20220391163A1 (en) Automatic volume adjustment method and apparatus, medium, and device
JP2008309955A (en) Noise suppresser
CN116208704A (en) Sound processing method and device
CN116668892B (en) Audio signal processing method, electronic device and readable storage medium
CN113889135A (en) Method for estimating direction of arrival of sound source, electronic equipment and chip system
CN116962955A (en) Multi-channel sound mixing method, equipment and medium
CN114220454B (en) Audio noise reduction method, medium and electronic equipment
CN106293607B (en) Method and system for automatically switching audio output modes
JP2022095689A (en) Voice data noise reduction method, device, equipment, storage medium, and program
US10897665B2 (en) Method of decreasing the effect of an interference sound and sound playback device
CN116546126B (en) Noise suppression method and electronic equipment
CN111163411A (en) Method for reducing interference sound influence and sound playing device
CN115696114B (en) Microphone configuration adjustment method, electronic equipment and medium
CN116634350B (en) Audio processing method and device and electronic equipment
CN116320144B (en) Audio playing method, electronic equipment and readable storage medium
US20160099006A1 (en) Electronic device, method, and computer program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination