EP3583786A1

EP3583786A1 - Apparatus and method for downmixing multichannel audio signals

Info

Publication number: EP3583786A1
Application number: EP18754857.3A
Authority: EP
Inventors: Pei-Lun Hsieh; Tsai-Yi WU
Original assignee: Ambidio Inc
Current assignee: Ambidio Inc
Priority date: 2017-02-17
Filing date: 2018-02-16
Publication date: 2019-12-25
Also published as: TW201843675A; CN109644315A; EP3583786A4; JP2020508590A; WO2018151858A1; KR20190109726A

Abstract

A method for processing a multi-channel input audio signal is performed at a computing device. The method includes the following steps: selecting, from the multi-channel input audio signal, a left input channel and a right input channel, wherein the left input channel and the right input channel correspond to a pair of spatially symmetrical signal sources; generating one or more cross-channel features from the left input channel and the right input channel; processing, in accordance with the cross-channel features, the left input channel and the right input channel to generate a left intermediate channel and a right intermediate channel; and combining each of the left intermediate channel and the right intermediate channel with a third input channel of the multi-channel input audio signal to form a two-channel output audio signal.

Description

APPARATUS AND METHOD FOR DOWNMIXING

MULTICHANNEL AUDIO SIGNALS

TECHNICAL FIELD

[0001] The present application is related generally to audio signal processing and in particular to a computer implemented method, apparatus, and computer usable program code for downmixing multichannel audio signals.

BACKGROUND

[0002] Surround sound is a technique to produce, transmit, and playback audio using multiple audio channels that surround the listener. It is typically achieved by multiple discrete audio channels. Two prevailing configurations of multichannel or surround sound configuration are 5.1 surround sound and 7.1 surround sound. A 5.1 surround sound configuration consists of a pair of front speakers (L and R), one center front channel (C), a pair of side speakers (Ls and Rs), and one Low Frequency Effect (LFE), with a conventional order of L, C, R, Ls, Rs, LFE. A 7.1 surround sound configuration consists of a pair of front speakers (L and R), one center front channel (C), a pair of side surround speakers (Lss and Rss), a pair of rear surround speakers (Lrs and Rrs), and one Low Frequency Effect (LFE), with a conventional order of L, C, R, Lss, Rss, Lrs, Rrs, and LFE.

[0003] Downmixing is a process that converts a program with a multiple-channel configuration (e.g., a multichannel audio file) into a program with fewer channels. For example, a 5.1 surround sound file or a 7.1 surround sound file can be downmixed and played using a two-channel stereophonic playback system while providing a good listening experience to a listener using the two-channel stereophonic playback system.

[0004] In a conventional downmix process, each of the multichannel audio input is processed individually and separately with a respective processor to produced one or two channel output. The process on each channel does not properly consider the meaningful information between the speaker pairs. Consequently, the audio output obtained using these conventional downmixing processes is less accurate and sometimes can compromise the immersive listening experience.

SUMMARY

[0005] An object of the present application is to develop an audio downmix pipeline that processes the input audio channels by pairs. The resulting downmixed output has a better accuracy while retaining the spatial information of the original multichannel audio stream. While the number of the channel input and output can be any meaningful number under the definition, the following description uses 5.1 to stereo downmix and 7.1 to stereo downmix as examples.

[0006] According to a first aspect of the present application, a method for processing a multichannel input audio signal is performed at a computing device having one or more processors, memory, and a pluralit of program modules stored in the memory and to be executed by the one or more processors. The method includes the following steps: selecting, from the multi-channel input audio signal, a left input channel and a right input channel, wherein the left input channel and the right input channel correspond to a pair of spatially symmetrical signal sources; generating one or more cross-channel features from the left input channel and the right input channel; processing, in accordance with the cross-channel features, the left input channel and the right input channel to generate a left intermediate channel and a right intermediate channel; and combining each of the left intermediate channel and the right intermediate channel with a third input channel of the multi-channel input audio signal to form a two-channel output audio signal.

[0007] According to another aspect of the present application, a computing device comprises: one or more processors; memory; and a plurality of program modules stored in the memory and lu be executed by the one or more processors. The plurality of program modules, when executed by the one or more processors, cause the computing device to perform the method described above for processing a multichannel input audio signal.

According to yet another aspect of the present application, a computer program product stored in a non-transitory computer-readable storage medium in conjunction with a computing device having one or more processors, the computer program product including a plurality of program modules that, when executed by the one or more processors, cause the computing device to perform the method described above for processing a multichannel audio signal.

BRIEF DESCRIPTION OF DRAWINGS

[0008] The accompanying drawings, which are included to provide a further understanding of the embodiments and are incorporated herein and constitute a part of the specification, illustrate the described embodiments and together with the description serve to explain the underlying principles. Like reference numerals refer to corresponding parts.

[0009] Figure 1A is a block diagram illustrating a conventional downmixing process performed to a 5.1 input signal, in accordance with some embodiments.

[0010] Figure IB is a block diagram illustrating a conventional LoRo downmixing process performed to a 5.1 input signal, in accordance with some embodiments.

[0011] Figure 1 C is a block diagram illustrating a conventional downmixing process of surround sound virtualization or spatialization from a 7.1 input signal, in accordance with some embodiments.

[0012] Figure 2 illustrates a block diagram of a data processing system configured to perform audio downmixing in accordance with an illustrative embodiment of the present application.

[0013] Figures 3A-3B are block diagrams illustrating audio downmix pipelines of processing multichannel input signals in accordance with some embodiments.

[0014] Figure 4A is a block diagram illustrating a signal workflow including a PROC applied to an input pair in accordance with some embodiments.

[0015] Figure 4B is a block diagram illustrating a signal workflow applied to a 7.1 surround sound file in accordance with some embodiments. [0016] Figure 5 illustrates a user interface of a software application or a plugin component of a software application that is used to manage implementing the signal pipeline as discussed with reference to in Figure 4B for a 7.1 surround sound file in accordance with some embodiments.

[0017] Figures 6A-6C is a flowchart illustrating a process of downmixing

multichannel audio signals in accordance with some embodiments.

DETAILED DESCRIPTION

[0018] Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of radio communication systems such as smartphones and tablets.

[0019] With reference now to the figures, exemplary block diagrams of data processing environments are provided in which illustrati e embodiments may be

implemented. It should be appreciated that these figures are only exemplary and are not intended to assert or imply any limitation with regard to the environments in which different embodiments may be implemented. Many modifications to the depicted environments may be made.

[0020] Figure 1A is a block diagram illustrating a conventional downmixing process performed to a 5.1 input signal. As shown in Figure 1 A, each input channel (i.e., L, C, R, Ls, Rs, and LFE) is treated differently and is sent to its respective processor module (PROC), regardless of its relationship to other channels. The processor can include one or more sub- modules (not shown), such as a gain, a time delay, a low pass filter, and/or other audio processing modules. The output of each process module for a respective channel can include one or more channels, depending on the type of the process(es) implemented in the process module. At the end, these outputs are summed (i.e.,∑) up to be, in this example, a two channel audio (i.e., L and R output channels).

[0021] Figure IB is a block diagram illustrating a conventional Left only/Right only

(LoRo) downmixing process performed to a 5.1 input signal. Each input channel (i.e., L, C, R, Ls, Rs, and LFE) will be passed through a gain module individually. The adjustment of the gain depends on the physical location as if it was reproduced by a surround sound system. While a surround channel may be attenuated more than a L/R channel, the relationship between the left side and right side is ignored. The left channel output is created by adding all channels from the left side, plus attenuated C and LFE signals. Center channel is split into two because there is no physical speaker located on the center line in a stereo reproduction setting. LFE is also split into two channels. The right channel output is created by adding all channels from the right side, plus attenuated C and LFE signals. In this example, every input channel is treated differently by its individual processor, with a simple gain module, which takes a one channel input and produce a one or two channel output. Finally, all the PROC output will be summed up based on the intended reproduced location of the input.

[0022] Figure 1 C is a block diagram illustrating a conventional downmixing process of surround sound virtualization or spatialization from a 7.1 input signal. In this example, each channel (i.e., L, C, R, Lss, Rss, Lrs, Rrs, and LFE) of the input multichannel audio will be processed by, in addition to its individual gain, a respective head-related transfer function (URTF) representing the intended location of the respective physical speaker to produce a two channel output. For instance, the left channel input will be processed by its HRTF representing the left channel of the speaker in a surround sound system. Similar process will be applied to all other input channels. All of the two channel output sets based on respective input channels will be summed together and become the left channel output and the right channel output, respectively. In this example, every input channel is also treated differently by an individual processor (e.g., consisting of a gain module and a HRTF filter). The processor will take a one channel input and produce a two channel output. All the two channel output will be summed up and become the final two channel output.

[0023] Figure 2 illustrates a block diagram of a data processing system 100 configured to perform audio downmixing in accordance with an illustrative embodiment of the present application. In this illustrative example, the data processing system 100 includes communications fabric 102, which provides communications between processor unit 104, memory 106, persistent storage 108, communications unit 1 10, input/output (I/O) unit 1 12, display 1 14, and one or more speakers 1 16. Note that the speakers 1 16 may be built into the data processing system 100 or external to the data processing system 100. In some embodiments, the data processing system 100 takes the form of a laptop computer, a desktop computer, a tablet computer, a mobile phone (such as a smartphone), a multimedia player device, a navigation device, an educational device (such as a child's learning toy), a gaming system, an audio/video (AV) receiver, or a control device (e.g., a home or industrial controller).

[0024] Processor unit 104 serves to execute instructions for software programs that may be loaded into memory 106. Processor unit 104 may be a set of one or more processors or may be a multi-processor core, depending on the particular implementation. Further, processor unit 104 may be implemented using one or more heterogeneous processor systems in which a main processor is present with secondary processors on a single chip. As another illustrative example, processor unit 104 may be a symmetric multi-processor system containing multiple processors of the same type.

[0025] Memory 106, in these examples, may be a random access memory or any other suitable volatile or non- volatile storage device. Persistent storage 108 may take various forms depending on the particular implementation. For example, persistent storage 1 08 may contain one or more components or devices such as a hard drive, a flash memory, a rewritable optical disk, a rewritable magnetic tape, or some combination of the above. The media used by persistent storage 108 may also be removable. For example, a removable hard drive may be used for persistent storage 108. [0026] Communications unit 1 10, in these examples, provides for communications with other data processing systems or devices. In these examples, communications unit 1 10 is a network interface card. Communications unit 1 10 may provide communications through the use of either or both physical and wireless communications links.

[0027] Input/output unit 1 12 allows for input and output of data with other devices that may be connected to data processing system 100. For example, input/output unit 1 12 may provide a connection for user input through a keyboard and mouse. Further, input/output unit 1 12 may send output to a printer. Display 1 14 provides a mechanism to display information to a user. Speakers 1 16 play out sounds to the user.

[0028] Instructions for the operating system and applications or programs are located on persistent storage 108. These instructions may be loaded into memory 106 for execution by processor unit 104. The processes of the different embodiments as described below may be performed by processor unit 104 using computer implemented instructions, which may be located in a memory, such as memory 106. These instructions are referred to as program code (or module), computer usable program code (or module), or computer readable program code (or module) that may be read and executed by a processor in processor unit 104. The program code (or module) in the different embodiments may be embodied on different physical or tangible computer readable media, such as memory 106 or persistent storage 108.

[0029] Program code/module 120 is located in a functional form on the computer readable storage media 1 18 that is selectively removable and may be loaded onto or transferred to data processing system 100 for execution by processor unit 104. Program code/module 120 and computer readable storage media 1 18 form computer program product 122 in these examples. In one example, computer readable storage media 1 18 may be in a tangible form, such as, for example, an optical or magnetic disc that is inserted or placed into a drive or other device that is part of the persistent storage 108 for transfer onto a storage device, such as a hard drive that is part of the persistent storage 108. In a tangible form, the computer readable storage media 1 18 may also take the form of a persistent storage, such as a hard drive, a thumb drive, or a flash memory that is connected to data processing system 100. The tangible form of computer readable storage media 1 18 is also referred to as computer recordable storage media. In some instances, the computer readable storage media 1 18 may not be removable from the data processing system 100.

[0030] Alternatively, program code/module 120 may be transferred to data processing system 100 from computer readable storage media 1 18 through a communications link to communications unit 1 10 and/or through a connection to input/output unit 1 12. The communications link and/or the connection may be physical or wireless in the illustrative examples. The computer readable media also may take the form of non-tangible media, such as communications links or wireless transmissions containing the program code/module.

[0031] The different components illustrated for data processing system 100 are not meant to provide architectural limitations to the manner in which different embodiments may be implemented. The different illustrative embodiments may be implemented in a data processing system including components in addition to or in place of those illustrated for data processing system 100. Other components shown in Figure 1 can be varied from the illustrative examples shown.

[0032] As one example, a storage device in data processing system 100 is any hardware apparatus that may store data. Memory 106, persistent storage 108 and computer readable storage media 1 18 are examples of storage devices in a tangible form.

[0033] In another example, a bus system may be used to implement communications fabric 102 and may be comprised of one or more buses, such as a system bus or an input/output bus. The bus system may be implemented using any suitable type of architecture that provides for a transfer of data between different components or devices attached to the bus system. Additionally, a communications unit may include one or more devices used to transmit and receive data, such as a modem or a network adapter. Further, a memory may be, for example, memory 106 or a cache such as found in an interface and memory controller hub that may be present in communications fabric 102. [0034] To overcome the issues with the conventional approaches described in the background of the present application, different embodiments of the present application are described below and associated with systems and methods of downmixing input audio channels on a pair-by-pair basis to provide better audio information accuracy and to retain spatial information of the original multichannel audio stream. Unlike the conventional methods, relationship between the input channels is considered in pairs. Each pair is passed into a processor respectively. The information of a pair is compared and analyzed against each other in order to create a more solid sound image and a better spatial outcome. To make the pair meaningful, at least one of the module in the process needs to cross-reference information between the pair. The two channel output of each pair will be summed with the single channels to create a left channel output and a right channel output.

[0035] Figures 3A-3B are block diagrams illustrating audio downmix pipelines of processing multichannel input signals in accordance with some embodiments. The multichannel input signal in Figure 3 A is a 5.1 surround sound file 210 including a left front channel (L), a center front channel (C), a right front channel (R), a left side channel (Ls), a right side channel (Rs), and a Low Frequency Effect (LFE). The multichannel input signal in Figure 3B is a 7.1 surround sound file 240 including a left front channel (L), a center front channel (C), a right front channel (R), a left side surround channel (Lss), a right side surround channel (Rss), a left rear surround channel (Lrs), a right rear surround channel (Rrs), and a Low Frequency Effect (LFE).

[0036] In some embodiments, the system selects one or more input pairs from the multichannel input signal. In some embodiments, an input pair corresponds to two sets of audio streams that are intended to be reproduced at speakers placed in symmetrical sides. As such, the input pair includes a pair of spatially symmetrical signal sources, in som

embodiments, an input pair includes two audio channels at two sides (i.e., left side and right side) with the same angle from the center line. For example, the front pair includes the left front (L) channel and the right front (K) channel, which are 30° to the left and right of the center line respectively. In another example, the rear pair in a Dolby 7.1 surround sound setup places the left rear surround (Lrs) channel and the right rear surround (Rrs) channel at 135° to the left and right of the center line respectively. Each selected input pair is then sent to a respective processor (PROC) that produces an output audio pair.

[0037] In some embodiments as shown in Figure 3A, out of the multiple input channels of the 5.1 surround sound file, the system selects the left front channel (L) and the right front channel (R) as a pair 222, and the left side channel (Ls) and the right side channel (Rs) as a pair 224. In some embodiments as shown in Figure 3B, out of the multiple input channels of the 7.1 surround sound file, the system selects the left front channel (L) and the right front channel (R) as a pair 242, the left side surround channel (Lss) and the right side surround channel (Rss) as a pair 244, and the left rear surround channel (Lrs) and the right rear surround channel (Rrs) as a pair 246. The channel sitting on the center line (e.g., the center front C channel) and the channel with omni direction (e.g., LFE) are single channels and will not be paired with any other channel.

[0038] In some embodiments, each pair will be passed into a processor respectively.

For example, in Figure 3A, pairs 222 and 224 are sent to PROCs 232 and 234 respectively, and in Figure 3B, pairs 242, 244, and 246 are sent to PROCs 252, 254, and 256 respectively. In such process, the information of a pair is compared and analyzed against each other in order to create a more solid sound image and a better spatial outcome. To make the pair meaningful, at least one of the one or more modules in each processor PROC needs to cross- reference information between two channels within the pair. The output signal of each processor PROC includes two channels, and the two-channel output of each pair will be summed (∑) with the single channels to create output signal including a left channel output (L') and a right channel output (R') (as shown in Figure 3A and 3B respectively).

[0039] In some embodiments, the pair processor (PROC) can be any pnssihle "two= in-two-out" audio signal processors. As mentioned above, the processor consists of at least one module that incorporates cross-channel features from an input pair. In some

embodiments, a pair processor (PROC) includes one or more modules that retrieve signal information from an input pair. Based on the input pair signal information, the pair processor (PROC) then modifies the output stream on a pair basis.

[0040] In some embodiments, a pair processor (PROC) consists of a plurality of different components (or modules), including channel-dependent components and/or channel- independent components. In some embodiments, a channel-dependent component performs a multichannel-in-multichannel-out process, e.g., a two-in-two-out process in a pair processor. In some embodiments, the channel-dependent component produces multiple output channels with each based on more than one input channel. In some embodiment, a channel-dependent component uses the information of the input signal and adjust the process based on extracted cross-channel features. In some embodiments, the cross-channel features include a comparison of volumes of respective input channels, relationship of the frequency spectrum characteristics (e.g., magnitude and/or phase) of the left and right input channels, and/or time and amplitude differences of the signal onset of the left and right input channels. In some embodiments, the pair processor (PROC) includes one or more channel-dependent components such as a mid/side (M/S) mixer and/or a width controller (WC). For example, a M/S mixer can produce mid and side signals using sum and difference of the input left and right signals. In another example, a M/S mixer can produce mid and side signals by comparing the overlap region of the frequency spectrum of input signals.

[0041] In some embodiments, a channel-independent component is a multichannel-in- multichannel-out process (including two-in-two-out) that treats each respective channel of the multichannel input file separately. In some embodiments, the channel-independent component processes the multichannel input file as if the multiple channels were split into the same number of mono signals, each mono signal was processed independently, and the processed mono signals of the respective multiple channels are summed together. In some embodiments, the pair processor (PROC) includes one or more channel-independent components, such as an equalizer (EQ) and/or a dynamic range compressor (DRC). For example, an equalizer (EQ) module takes in each channel respectively and produces the corresponding channel without using any information from other channels. The result of an equalizer (EQ) is as if each channel was equalized separately with the same input parameter.

[0042] Figure 4A is a block diagram illustrating a signal workflow including a PROC

420 applied to an input pair 410 in accordance with some embodiments. In some embodiments, the PROC 420 includes a plurality of modules (also referred to as components) that are configured to process the input pair signal 410. In some embodiments, the PROC 420 can be any pair processor, such as PROC 232, PROC 234, PROC 252, PROC 254, or PROC 256 as illustrated in Figures 3A-3B. Accordingly, the pair processor 420 can be applied to any input pair, such as pair 222, pair 224, pair 242, pair 244, or pair 246 as illustrated in Figures 3A-3B. In some embodiments, the pair processor 420 includes a mid/side (M/S) mixer 422, an equalizer (EQ) 428, a dynamic range compressor (DRC) 430, and a crosstalk cancellation (XTC) module 432 coupled to each other as shown in Figure 4A. The output signal of the PROC 420 is further processed with a width controller (WC) 434 and another dynamic range compressor (DRC) 436 to obtain the output pair 440 as shown in Figure 4A.

[0043] In some embodiments, the data processing system 100 first sends the input left and right signals 410 into the M/S mixer 422. In some embodiments, the M/S mixer 422 is a mixing tool configured to generate three components (two side components (S) 424, i.e., left side and right side, and one mid component (M) 426) from the input pair 410. The left side component represents the sound source that appears only at the left channel, whereas the right side component corresponds to the sound only appears at the right channel. The middle component is the sound source that appears only in the phantom center of the soundstage, e.g., main musical element and dialog.

[0044] By doing so, the M/S mixer 422 separates out the information that is useful for various subsequent soundstage enhancement and minimizes unnecessary distortion in the sound quality (e.g., coloration). Moreover, this step also helps lower the correlation between the left and right components. In some embodiments, the M/S mixer 422 analyzes the sound image and estimate the sound coming from center, left, and right. The M/S mixer 422 then splits the two input channels, i.e., the left and right signals 410, into one-channel mid signal 426 and two-channel side signals 424. A more detailed description of the M/S mixer 422 can be found in PCT Application No. PCT/US2015/057616, entitled "APPARATUS AND METHOD FOR SOUND STAGE ENHANCEMENT" and filed on October 27, 2015, which is incorporated by references in its entirety.

[0045] Next, the system sends the side signals 424 to the equalizer (EQ) 428 to adjust the frequency component of the side signals 424. In some embodiments, the EQ 428 encountered by the two side signals 424 includes one or more multi-band equalizers for performing bandpass filtering to the two side signals. In some embodiments, the multi-band equalizers applied to each side signal are the same. In some other embodiments, the multi- band equalizers applied to one side signal are not the same as those applied to the other side signal. Nonetheless, their functions are to keep the original color of the sound signals and to avoid ambiguous spatial cues present in these two signals. In some embodiments, this EQ 428 can also be used to select the target sound source based on the spectral analysis of the two side components. In some embodiments as illustrated in Figure 4A, the EQ 428 produces two output signals 450 and 452. In some embodiments, each of the output signals 450 and 452 are two-channel audio signals. In some embodiments, the EQ 428 applies respective bandpass filters to each of the two-channel side signals 424 to obtain bandpass-filtered signals 450.

[0046] In some embodiments, the EQ 428 also produces residual signals 452 based on the difference in the frequency band between the input signals of the EQ 428 (i.e., the two- channel side signals 424) and the output signal of the EQ 428 (i.e., the two-channel bandpass- filtered signals 450). In some embodiments, the data processing system 100 generates a leftside residual component and a right-side residual component by subtracting the left-side component and the right-side component from the bandpass-filtered left-side component and the bandpass-filtered right-side component, respectively. In some embodiments, a respective amplifier is applied to the residual signal and the result signal from the crosstalk cancellation to adjust the gains of the two signals before they are combined together. A more detailed description of the EQ 428 can be found in PCT Application No. PCT/US2015/057616, entitled "APPARATUS AND METHOD FOR SOUND STAGE ENHANCEMENT" and filed on October 27, 2015, which is incorporated by references in its entirety.

[0047] In some embodiments, the bandpass-filtered signals 450 are sent to a dynamic range compressor (DRC) 430. In some embodiments, the DRC 430 includes a bandpass filter (different from the bandpass filter of EQ 428) for amplifying the two sound signals (i.e., the two-channel bandpass-filtered signals 450) within a predefined frequency range in order to maximize the soundstage enhancement effect achieved by the crosstalk cancellation block (XTC) 432. In some embodiments, a user (e.g., a sound engineer) can adjust the bandpass filter of DRC 430 to window out a specific frequency band. By doing so, the user can highlight certain specific sound events of his/her choice. For example, after performing equalization to the left-side component and the right-side component using the first bandpass filter of the EQ 428, the data processing system 100 removes a predefined frequency band from the left-side component and the right-side component using a second bandpass filter of the DRC 430. Representative bandpass filters used in the EQ block 428 and the DRC block 430 include a biquadratic filter or a Butterworth filter. In some embodiments, after performing equalization to the left-side component and the right-side component using the first bandpass filter of the EQ 428, the data processing system 100 performs a first dynamic range compression by the DRC 430 to the left-side component and the right-side component to highlight a predefined frequency band with respect to other frequencies. A more detailed description of the DRC 430 can be found in PCT Application No. PCT/US2015/057616, entitled "APPARATUS AND METHOD FOR SOUND STAGE ENHANCEMENT" and filed on October 27, 2015, which is incorporated by references in its entirety.

[0048] In some embodiments, the output signals from the DRC 430 are then sent to the crosstalk cancellation (XTC) module 432 for performing the crosstalk cancel lat.inn process. Crosstalk is an inherent problem in stereo (i.e., two-channel) loudspeaker playback. It occurs when a sound reaches the ear on the opposite side from each speaker, and introduces unwanted spectral coloration to the original signal. The solution to this problem is a crosstalk cancellation (XTC) algorithm. One type of the XTC algorithm is to use a generalized directional binaural transfer function, such as Head-Related Transfer Functions (HRTFs) and/or Binaural Room Impulse Response (BRIR), to represent the angles of the two physical loudspeakers with respect to the listener's position. Another type of the XTC algorithm system, is a recursive crosstalk cancellation method that does not require head-related transfer function (HRTF), binaural room impulse response (BRIR), or any other binaural transfer functions. The basic algorithm can be formulated as follows:

left[n] = left[n — A_L * right[n— d_L]

right[n] = right[n]— A_R * left[n— d_R)

where AL and AR are the attenuation coefficients of the signal and cfe and dR are the delays in number of data samples from the respective speakers to the opposite-side ears. In some embodiments, the XTC 432 as shown in Figure 4A uses the recursive crosstalk cancellation method or the generalized directional binaural transfer function. A more detailed description of the XTC 432 can be found in US Patent Application No. 14/569,490, entitled

"APPARATUS AND METHOD FOR SOUND STAGE ENHANCEMENT" and filed on December 12, 2014 (with Patent No. 9,532,156 granted on December 27, 2016), and in US Patent Application No. 15/349,822, entitled "APPARATUS AND METHOD FOR SOUND STAGE ENHANCEMENT" and filed on November 1 1 , 2016, which are incorporated by references in their entireties.

[0049] In some embodiments as shown in Figure 4A, the output signals of XTC 432 are fed into an amplifier 462, the pair of residual signals 452 is fed into an amplifier 464, and the mid component (M) 426 is also fed into an amplifier 466 before they are sent to the width controller (WC) 434 for processing and being combined together.

[0050] In some embodiments, the output signals of the amplifiers 462, 464, and 466 are sent to the WC 434 to adjust the width of the stage. In some embodiments, WC 434 uses the analyzed information of the input signal pair to control the soundstage width of the output audio signal. A stage width can be from as narrow as 0°, to as wide as a 360° as for a total immersive sound. In some embodiments, the cross-channel information of a pair (e.g., the output pair 472 or 474) is analyzed and otherwise adjusted. In some embodiments, a user can assign the desired stage width to the Width Controller, as illustrated below in Figure 5. The assigned stage width may affect the pair summing matrix of the WC 434 based on the previously analyzed information. In some examples, a width of the soundstage can be adjusted using the following equations:

β

Left' [n] = Left[n] - Right[n]

Right'[n] = Right[n] -

where -5<β<0 is the stage width parameter. The resulting signal has the maximum soundstage width when β = 0, and is close to a mono signal when β = -5.

[0051] In some embodiments, the output signals 476 of the WC 434 are sent to the second dynamic range compressor (DRC) 436 to amplify the overall output level of the sound signals in the audio mastering process. In some embodiments, the data processing system 100 performs a second dynamic range compression by DRC 436 to the left-side component and the right-side component to preserve the localization cues in the digital audio output signals.

[0052] As shown in Figure 4A, the output of the pipeline is a stereo audio signal 440 including a left channel (L') and a right channel (R'). In some embodiments, the signal workflow including PROC 420 can be applied to a 5.1 surround sound file as shown in Figure 3A. For example, PROC 232 and/or PROC 234 may be similar to PROC 420 as illustrated in Figure 4A.

[0053] Figure 4B is a block diagram illustrating a signal workflow applied to a 7.1 surround sound file in accordance with some embodiments. In some embodiments as illustrating with reference to Figure 3B, the input signals of the 7.1 surround sound file are grouped into pairs, i.e., L/R pair 242, Lss/Rss pair 244, and Lrs/Rrs pair 246 respectively. The L R pair 242, Lss/Rss pair 244, and Lrs/Rrs pair 246 is then sent to PROC 252, PROC 254, and PROC 256 respectively. In some embodiments, a respective PROC of the PROC 252, PROC 254, and PROC 256 is similar to PROC 420 as discussed with reference to Figure 4A. In some other embodiments, PROC 252, PROC 254, or PROC 256 may include one or more other modules (or components) in different arrangements. The respective output signals of PROC 252, PROC 254, and PROC 256 are sent to a respective Width Control, e.g., WC 482, WC 484, and WC 486 respectively. The WC 482, WC 484, or WC 486 may be similar to WC 434 as discussed with reference to Figure 4A. The output of WC 482, WC 484, and WC 486 are combined with the center signal C and the Low Frequency Effect channel LFE to generate the output stereo audio signal 488.

[0054] Figure 5 illustrates a user interface (UI) 500 of a software application or a plugin component of a software application that is used to manage implementing the signal pipeline as discussed with reference to in Figure 4B for a 7.1 surround sound file in accordance with some embodiments. The signal pipeline may include a plurality of pair processors, such as PROC 420 as shown in Figure 4A. There are three pairs in this case: front pair (L, R), side pair (Lss, Rss), and rear pair (Lrs, Rrs). The left panel 510 of the UI 500 controls the gain of respective input channels. The controlling area 520 controls the frequency component of the equalizer (EQ). The controlling area 530 controls the width of the width controller. As discussed with reference to Figure 4B, each pair passes through different pair processor PROC with different parameters, thus controlling area 540 can be used for selecting which pair (e.g., front pair, side pair, or rear pair) to input parameters for implementing PROC processing.

[0055] As discussed above, sound is an important part in the media and

entertainment. It makes emotion impact to the audience and connect them to the story. In order to have a more immersive listening experience, multi-channel surround sound is introduced. Multi-channel audio format utilizes multiple audio tracks in order to reconstruct sound on a corresponding multi-channel sound reproduction system. Downmixing can happen both on the production end and the reproduction end. On the production end, sound mixers normally start mixing with the highest channel count, and downmix to a lower channel count. On the reproduction end, a multi-channel audio track can be downmix to lower channel count in order to fit the channel number of the reproduction system. In both case, maintaining the use and placement of sound that matches the original creative intent is the goal. [0056] Conventional downmixing method, as shown in Figures 1A-1 C, applies a process to each individual track. The relationship between tracks is not taken into account. The method and system presented herein take audio input on a pair-by-pair basis. The multichannel audio tracks are first grouped into pairs depending on the physical placement of the reproduction system. Second, the relationship between the two channels of a pair will be analyzed. Finally, the pair input is processed based on the analyzing result. By incorporating the relationship between the pairs, the spatial information of the original multi-channel audio input can be better preserved. As a result, the multichannel audio output downmixed with the system and method introduced herein creates a more accurate sound scape that is closer to the original multi-channel audio input, than the same audio input downmixed with conventional method.

[0057] Figures 6A-6C is a flowchart illustrating a process of downmixing

multichannel audio signals using the data processing system 100 shown in Figure 2 in accordance with some embodiments. The data processing system 100 selects (602) a left input channel and a right input channel from the multi-channel input audio signal. In some embodiments, the left input channel and the right input channel correspond to a pair of spatially symmetrical signal sources. In some embodiments, the multi-channel audio input signal is the 5.1 surround sound file 210 of Figure 3A or the 7.1 surround sound file 240 of Figure 3B. In some embodiments, the 5.1 surround input signal 21 0 includes a left front channel L and a right front channel R as a pair 222, and a left side channel Ls and a right side channel Rs as a pair 224. In some embodiments, the 7.1 surround input signal 240 includes a left front channel L and a right front channel R as a pair 242, a left side surround channel Lss and a right side surround channel Rss as a pair 244, and a left rear surround channel Lrs and a right rear surround channel Rrs as a pair 246.

[0058] The data processing system 100 then generates (604) one or more cross- channel features from the left input channel and the right input channel of a selected pair. In some embodiments, the one or more cross-channel features include a comparison of volumes of the left and right input channels of a pair, relationship of the frequency spectrum characteristics (e.g., magnitude and/or phase) of the left and right input channels, and/or time and amplitude differences of the signal onset of the left and right input channels.

[0059] The data processing system 100 then processes (606), in accordance with the cross-channel features, the left input channel and the right input channel of the selected pair to generate a left intermediate channel and a right intermediate channel. In some embodiments, the left input channel and the right input channel of a pair are processed using a processor (PROC) as illustrated in Figures 3A-3B and 4A-4B. In some embodiments, the PROC includes one or more modules as illustrated in Figure 4A.

[0060] Next, the data processing system 100 combines (608) each of the left intermediate channel and the right intermediate channel with a third input channel of the multi-channel input audio signal to form a two-channel output audio signal. For example, as illustrated in Figure 3A, the left intermediate channel and the right intermediate channel (e.g., the L/R pair or the Ls/Rs pair) processed by a respective PROC is combined with the center channel C and/or the Low Frequency Effect (LFE) to produce the two-channel output audio signal L'/R' . Similarly, as illustrated in Figure 3B, the left intermediate channel and the right intermediate channel (e.g., the L/R pair, the Lss/Rss pair, or the Lrs/Rrs pair) processed by a respective PROC is combined with the center channel C and/or the Low Frequency Effect (LFE) to produce the two-channel output audio signal L' R' .

[0061] In acoustic engineering, a soundstage is normally defined as the area between the left-most perceived location and right-most perceived location in an audio reproduction. In other words, the soundstage is the limit of how far a sound object can be perceived.

Accordingly, the soundstage width is defined as the distance between the left boundary and the right boundary. In normal cases, the soundstage width of a stereophonic reproduction is a separate distance of two loudspeakers. In this application, the concept of soundstage is adapted to apply to each symmetric pair of channels independently. For example, in some embodiments, the data processing system 100 further adjusts (610) a soundstage width associated with the left intermediate channel and the right intermediate channel using a width controller (e.g., the WC 434) before combining the left intermediate channel and the right intermediate channel with the third input channel. In some embodiments, the data processing system 100 receives (612) a user input specifying the soundstage width of the two-channel output audio signal. The user input can be received on UI 500 as illustrated in Figure 5.

[0062] In some embodiments, the step 606 of processing the left input channel and the right input channel further comprises extracting (614) a middle component, a left side component, and a right side component from the left input channel and the right input channel. For example, as illustrated in Figure 4 A, the input pair 410 is processed by the M/S mixer 422 to produce a middle component 426 and a left side component and a right side component S 424. In some embodiments, the data processing system 100 processes (616) the left input channel and the right input channel the left side component and the right side component before combining them with the middle component to generate the left intermediate channel and the right intermediate channel.

[0063] In some embodiments, the step 606 of processing the left input channel and the right input channel further comprises performing (61 8) equalization (e.g., by the EQ block 428 of Figure 4A) to the left side component and the right side component using a bandpass filter to obtain a left bandpass-filtered component and a right bandpass-filtered component (e.g., the output signals of EQ 450). In some embodiments, the equalization process further generates (620) a left-side residual component based on a difference between the left side component and the left bandpass-filtered component, and a right-side residual component based on a difference between the right side component and the right bandpass- filtered component, such as the left-side residual component and the right-side residual component 452 as illustrated in Figure 4A.

[0064] In some embodiments, after performing equalization to the left side component and the right side component, the data processing system 100 performs (622) a first dynamic range compression (e.g., by the DRC 430 of Figure 4A) to the left bandpass- filtered component and the right bandpass-filtered component (e.g., produced by the EQ 428 of Figure 4A), respectively, to obtain a left compressed component and a right compressed component correspondingly.

[0065] In some embodiments, after performing the first dynamic range compression, the data processing system 100 performs (624) crosstalk cancellation (e.g., by the XTC 432 of Figure 4A) to the left compressed component and the right compressed component (e.g., produced by DRC 430 of Figure 4A), respectively, to obtain a crosstalk-cancelled left-side component and a crosstalk-cancelled right-side component

[0066] In some embodiments, the data processing system 100 combines (626) the crosstalk-cancelled left-side component and the crosstalk-cancelled right-side component, the left-side residual component and the right-side residual component, and the middle component to generate the left intermediate channel and the right intermediate channel. In some embodiments, the combining step further comprises adjusting (628) a soundstage width (e.g., by the WC 434 of Figure 4A) associated with the left intermediate channel and the right intermediate channel before combining them with the third input channel. For example, as shown in Figure 4B, the left and right intermediate channels produced by a respective PROC is sent to a respective WC for adjusting the soundstage width. The adjusted signals are then combined with a third input channel, e.g., the C or the LFE channel, to produce the output stereo signal 488.

[0067] In some embodiments, after adjusting the soundstage width, the data processing system 100 performs (630) a second dynamic range compression (e.g., by DRC 436 of Figure 4A) to generate the left intermediate channel and the right intermediate channel.

[0068] Finally, it should be noted that the invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing hnth hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.

[0069] Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any tangible apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

[0070] The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk - read only memory (CD-ROM), compact disk - read/write (CD-R/W) and DVD.

[0071] A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution. In some embodiments, the data processing system is implemented in the form of a semiconductor chip (e.g., a system-on-chip) that integrates all components of a computer or other electronic system into a single chip substrate.

[0072] Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers.

[0073] Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.

[0074] The description of the present application has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

[0075] The terminology used in the description of the embodiments herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of claims. As used in the description of the embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, it will also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

[0076] It will also be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first port could be termed a second port, and, similarly, a second port could be termed a first port, without departing from the scope of the embodiments. The first port and the second port are both ports, but they are not the same port.

[0077] Many modifications and alternative embodiments of the embodiments described herein will come to mind to one skilled in the art having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the scope of claims are not to be limited to the specific examples of the embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

[0078] The embodiments were chosen and described in order to best explain the underlying principles and their practical applications, to thereby enable others skilled in the art to best utilize the underlying principles and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

WHAT IS CLAIMED:

1. A computer-implemented method for processing a multi-channel input audio signal, comprising:

at a computing device having one or more processors, memory, and a plurality of program modules stored in the memory and to be executed by the one or more processors: selecting, from the multi-channel input audio signal, a left input channel and a right input channel, wherein the left input channel and the right input channel correspond to a pair of spatially symmetrical signal sources;

generating one or more cross-channel features from the left input channel and the right input channel;

processing, in accordance with the cross-channel features, the left input channel and the right input channel to generate a left intermediate channel and a right intermediate channel; and

combining each of the left intermediate channel and the right intermediate channel with a third input channel of the multi-channel input audio signal to form a two-channel output audio signal.

2. The computer-implemented method of claim 1 , further comprising:

adjusting a soundstage width associated with the left intermediate channel and the right intermediate channel before combining them with the third input channel.

3. The computer-implemented method of claim 2, further comprising:

receiving a user input specifying the soundstage width of the two-channel output audio signal.

4. The computer- implemented method of claim 1 , wherein the step of processing the left input channel and the right input channel further comprises:

extracting a middle component, a left side component, and a right side component from the left input channel and the right input channel; and

processing the left side component and the right side component before combining them with the middle component to generate the left intermediate channel and the right intermediate channel.

5. The computer-implemented method of claim 4, wherein processing the left side component and the right side component further comprising:

performing equalization to the left side component and the right side component using a bandpass filter to obtain a left bandpass-filtered component and a right bandpass-filtered component; and

generating a left-side residual component based on a difference between the left side component and the left bandpass-filtered component, and a right-side residual component based on a difference between the right side component and the right bandpass-filtered component.

6. The computer-implemented method of claim 5, further comprising:

after performing equalization to the left side component and the right side component, performing a first dynamic range compression to the left bandpass-filtered component and the right bandpass-filtered component, respectively, to obtain a left compressed component and a right compressed component correspondingly.

7. The computer-implemented method of claim 6, further comprising:

after performing the first dynamic range compression, performing crosstalk cancellation to the left compressed component and the right compressed component, respectively, to obtain a crosstalk-cancelled left-side component and a crosstalk-cancelled right-side component.

8. The computer-implemented method of claim 7, further comprising:

combining the crosstalk-cancelled left-side component and the crosstalk-cancelled right-side component, the left-side residual component and the right-side residual component, and the middle component to generate the left intermediate channel and the right intermediate channel, wherein the combining step further comprises:

9. The computer-implemented method of claim 8, further comprising:

after adjusting the soundstage width, performing a second dynamic range

compression to generate the left intermediate channel and the right intermediate channel.

10. The computer-implemented method of claim 1, wherein the left input channel is a left front channel, and the right input channel is a right front channel.

1 1. The computer-implemented method of claim 1, wherein the left input channel is a left surround channel, and the right input channel is a right surround channel.

12. The computer-implemented method of claim 1, wherein the left input channel is a left rear surround channel, and the right input channel is a right rear surround channel.

13. The computer-implemented method of claim 1 , wherein the third input channel is a center channel.

14. The computer-implemented method of claim 1 , wherein the third input channel is low-frequency effects channel.

15. A computing device for processing a multi-channel input audio signal, the computing device comprising:

one. or more processors;

memory; and

a plurality of program modules stored in the memory and to be executed by the one or more processors, wherein the plurality of program modules, when executed by the one or more processors, cause the computing device to perform a plurality of steps including:

selecting, from the multi-channel input audio signal, a left input channel and a right input channel, wherein the left input channel and the right input channel correspond to a pair of spatially symmetrical signal sources;

generating one or more cross-channel features from the left input channel and the right input channel; processing, in accordance with the cross-channel features, the left input channel and the right input channel to generate a left intermediate channel and a right intermediate channel; and

combining each of the left intermediate channel and the right intermediate channel with a third input channel of the multi-channel input audio signal to form a two- channel output audio signal.

16. The computing device of claim 15, wherein the computing device is further caused to perform:

17. The computing device of claim 16, wherein the computing device is further caused to perform:

18. The computing device of claim 15, wherein the step of processing the left input channel and the right input channel further comprises:

19. The computing device of claim 18, wherein processing the left side component and the l ight side component further comprising:

performing equalization to the left side component and the right side component using a bandpass filter to obtain a left bandpass-filtered component and a right bandpass-filtered component; and generating a left-side residual component based on a difference between the left side component and the left bandpass-filtered component, and a right-side residual component based on a difference between the right side component and the right bandpass-filtered component.

20. The computing device of claim 19, wherein the computing device is further caused to perform:

21. The computing device of claim 20, wherein the computing device is further caused to perform:

22. The computing device of claim 21 , wherein the computing device is further caused to perform:

23. The computing device of claim 22, wherein the computing device is further caused to perform:

after adjusting the soundstage width, performing a second dynamic range compression to generate the left intermediate channel and the right intermediate channel.

24. The computing device of claim 15, wherein the left input channel is a left front channel, and the right input channel is a right front channel.

25. The computing device of claim 15, wherein the left input channel is a left surround channel, and the right input channel is a right surround channel.

26. The computing device of claim 15, wherein the left input channel is a left rear surround channel, and the right input channel is a right rear surround channel.

27. The computing device of claim 15, wherein the third input channel is a center channel.

28. The computing device of claim 15, wherein the third input channel is low-frequency effects channel.

29. A computer program product stored in a non-transitory computer readable storage medium in connection with a computing device having one or more processors for processing an audio signal, the computer program product including a plurality of program modules that, when executed by the one or more processors, cause the computing device to perform a plurality of steps including:

selecting, from the multi-channel input audio signal, a left input channel and a right input channel, wherein the left input channel and the right input channel correspond to a pair of spatially symmetrical signal sources; generating one or more cross-channel features from the left input channel and the right input channel; processing, in accordance with the cross-channel features, the left input channel and the right input channel to generate a left intermediate channel and a right intermediate channel; and combining each of the left intermediate channel and the right intermediate channel with a third input channel of the multi-channel input audio signal to form a two-channel output audio signal.

30. The computer program product of claim 29, wherein the computing device is further caused to perform:

31. The computer program product of claim 30, wherein the computing device is further caused to perform:

32. The computer program product of claim 29, wherein the step of processing the left input channel and the right input channel further comprises:

33. The computer program product of claim 32, wherein processing the left side component and the right side component further comprising:

34. The computer program product of claim 33, wherein the computing device is further caused to:

after performing equalization to the left side component and the right side component, perform a first dynamic range compression to the left bandpass-filtered component and the right bandpass-filtered component, respectively, to obtain a left compressed component and a right compressed component correspondingly.

35. The computer program product of claim 34, wherein the computing device is further caused to:

after performing the first dynamic range compression, perform crosstalk cancellation to the left compressed component and the right compressed component, respectively, to obtain a crosstalk-cancelled left-side component and a crosstalk-cancelled right-side component.

36. The computer program product of claim 35, wherein the computing device is further caused to performing:

37. The computer program product of claim 36, wherein the computing device is further caused to:

after adjusting the soundstage width, perform a second dynamic range compression to generate the left intermediate channel and the right intermediate channel.

38. The computer program product of claim 29, wherein the left input channel is a left front channel, and the right input channel is a right front channel.

39. The computer program product of claim 29, wherein the left input channel is a left surround channel, and the right input channel is a right surround channel.

40. The computer program product of claim 29, wherein the left input channel is a left rear surround channel, and the right input channel is a right rear surround channel.

41. The computer program product of claim 29, wherein the third input channel is a center channel.

42. The computer program product of claim 29, wherein the third input channel is low- frequency effects channel.