CN110853658A

CN110853658A - Method and apparatus for downmixing audio signal, computer device, and readable storage medium

Info

Publication number: CN110853658A
Application number: CN201911173782.8A
Authority: CN
Inventors: 王薇娜; 高五峰; 董强国; 孙学京
Original assignee: CHINA FILM SCIENCE AND TECHNOLOGY INST
Current assignee: China Film Science and Technology Research Institute (Film Technology Quality Inspection Institute of the Central Propaganda Department)
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2020-02-28
Anticipated expiration: 2039-11-26
Also published as: CN110853658B

Abstract

The invention discloses a method and a device for down-mixing an audio signal, computer equipment and a readable storage medium. The method comprises the following steps: multiplying the multi-channel audio signal by a two-channel conversion coefficient to obtain a left-channel audio signal and a right-channel audio signal; respectively converting the multi-channel audio signal, the left channel audio signal and the right channel audio signal to generate a multi-channel frequency domain signal, a first left channel frequency domain signal and a first right channel frequency domain signal; processing the multi-channel frequency domain signal based on a head-related transmission model to obtain a second left channel frequency domain signal and a second right channel frequency domain signal; weighting the first left channel frequency domain signal and the second left channel frequency domain signal to generate a down-mixed left channel frequency domain signal, and weighting the first right channel frequency domain signal and the second right channel frequency domain signal to generate a down-mixed right channel frequency domain signal; and converting the down-mixed left channel frequency domain signal and the down-mixed right channel frequency domain signal to generate a down-mixed left channel audio signal and a down-mixed right channel audio signal.

Description

Method and apparatus for downmixing audio signal, computer device, and readable storage medium

Technical Field

The present invention relates to the field of audio processing, and in particular, to a method and an apparatus for downmixing an audio signal, a computer device, and a computer-readable storage medium.

Background

In recent years, with the upgrading of high definition video technology from 2K to 4K, even 8K, and the development of VR (Virtual Reality) and AR (Augmented Reality), the hearing requirement of people for audio is gradually increased. Systems with immersive audio 5.1, 7.1, or even more channels begin to emerge in large numbers.

With the rapid development of mobile internet, more and more users choose to experience audio contents through earphones. Therefore, there is a need to convert multi-channel audio content into a two-channel or stereo format (i.e., Downmix processing) to accommodate scenes played by headphones or dual speakers. However, the current down-mixing technology is not mature, and the obtained two-channel audio has difficulty in having both sound quality and spatial rendering effect.

It is to be noted that the above information disclosed in the background section is only for enhancement of understanding of the background of the invention, and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

In view of the foregoing, the present invention provides a method, an apparatus, a computer device and a computer readable storage medium for audio signal downmixing.

Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.

According to an aspect of the present invention, there is provided a downmix method of an audio signal, including: acquiring a multi-channel audio signal; correspondingly multiplying the multi-channel audio signal by a preset left channel conversion coefficient and a preset right channel conversion coefficient respectively to obtain a left channel audio signal and a right channel audio signal; respectively performing time domain-frequency domain conversion on the multi-channel audio signal, the left channel audio signal and the right channel audio signal to correspondingly generate a multi-channel frequency domain signal, a first left channel frequency domain signal and a first right channel frequency domain signal; processing the multi-channel frequency domain signal respectively based on a left channel frequency domain response submodel and a right channel frequency domain response submodel in the head-related transmission model to obtain a second left channel frequency domain signal and a second right channel frequency domain signal; according to the weight coefficient, performing weighting processing on the first left channel frequency domain signal and the second left channel frequency domain signal to generate a down-mixed left channel frequency domain signal, and performing weighting processing on the first right channel frequency domain signal and the second right channel frequency domain signal to generate a down-mixed right channel frequency domain signal; and respectively performing frequency domain-time domain conversion on the down-mixed left channel frequency domain signal and the down-mixed right channel frequency domain signal to correspondingly generate a down-mixed left channel audio signal and a down-mixed right channel audio signal.

According to an embodiment of the present invention, the left channel conversion coefficient and the right channel conversion coefficient are filtering damping coefficients of each channel corresponding to the head-related transmission model.

According to an embodiment of the present invention, the weight coefficient is predetermined according to a moving speed of a sound source of the multi-channel audio signal.

According to an embodiment of the present invention, when the sound source is a stationary sound source, the weight coefficient corresponding to the second left channel frequency domain signal and the second right channel frequency domain signal is 0.

According to an embodiment of the present invention, the weight coefficients are determined by pre-training a multi-channel frequency domain signal corresponding to the multi-channel audio sample signal based on a convolutional neural network model.

According to an embodiment of the invention, the method further comprises: and determining the weight coefficient according to the ratio of the maximum eigenvalue of the covariance matrix corresponding to the multi-channel audio signal to the sum of all eigenvalues.

According to an embodiment of the present invention, determining the weight coefficient according to a ratio of a maximum eigenvalue of a covariance matrix corresponding to the multi-channel audio signal to a sum of all eigenvalues includes: when the ratio is greater than a preset threshold value, determining that the weight coefficient corresponding to the first left channel frequency domain signal and the first right channel frequency domain signal is smaller than the weight coefficient corresponding to the second left channel frequency domain signal and the second right channel frequency domain signal; when the ratio is smaller than the preset threshold, determining that the weight coefficient corresponding to the first left channel frequency domain signal and the first right channel frequency domain signal is larger than the weight coefficient corresponding to the second left channel frequency domain signal and the second right channel frequency domain signal.

According to another aspect of the present invention, there is provided a down-mixing apparatus of an audio signal, including: the signal acquisition module is used for acquiring a multi-channel audio signal; the first processing module is used for correspondingly multiplying the multi-channel audio signal by a preset left channel conversion coefficient and a preset right channel conversion coefficient respectively to obtain a left channel audio signal and a right channel audio signal; the first conversion module is used for respectively carrying out time domain-frequency domain conversion on the multi-channel audio signal, the left channel audio signal and the right channel audio signal to correspondingly generate a multi-channel frequency domain signal, a first left channel frequency domain signal and a first right channel frequency domain signal; the second processing module is used for processing the multi-channel frequency domain signals respectively based on a left channel frequency domain response submodel and a right channel frequency domain response submodel in the head-related transmission model to obtain a second left channel frequency domain signal and a second right channel frequency domain signal; a third processing module, configured to perform weighting processing on the first left channel frequency domain signal and the second left channel frequency domain signal according to a weighting coefficient to generate a down-mixed left channel frequency domain signal, and perform weighting processing on the first right channel frequency domain signal and the second right channel frequency domain signal to generate a down-mixed right channel frequency domain signal; and the second conversion module is used for respectively carrying out frequency domain-time domain conversion on the down-mixed left channel frequency domain signal and the down-mixed right channel frequency domain signal to correspondingly generate a down-mixed left channel audio signal and a down-mixed right channel audio signal.

According to still another aspect of the present invention, there is provided a computer apparatus comprising: a memory, a processor and executable instructions stored in the memory and executable in the processor, the processor implementing any of the above-mentioned audio signal downmixing methods when executing the executable instructions.

According to yet another aspect of the present invention, there is provided a computer-readable storage medium having stored thereon computer-executable instructions that, when executed by a processor, implement any one of the above-described methods of downmixing an audio signal.

According to the audio signal down-mixing method provided by the invention, the two-channel audio signal with good tone quality and good space rendering effect can be obtained.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a flowchart illustrating a method of downmixing an audio signal according to an exemplary embodiment.

Fig. 2 is a flowchart illustrating another audio signal downmixing method according to an exemplary embodiment.

Fig. 3 is a block diagram illustrating a downmixing apparatus of an audio signal according to an exemplary embodiment.

FIG. 4 is a schematic diagram illustrating a configuration of a computer device, according to an example embodiment.

Fig. 5 is a schematic diagram of a multi-channel audio system shown in accordance with an example embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The drawings are merely schematic illustrations of the invention and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, apparatus, steps, and so forth. In other instances, well-known structures, methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

As described above, in order to solve the problem of poor sound quality or spatial misalignment of the downmixed binaural audio, the present invention provides a new audio signal downmixing method. The following specifically describes embodiments of the present invention.

Fig. 1 is a flowchart illustrating a method of downmixing an audio signal according to an exemplary embodiment. The method of downmixing an audio signal as shown in fig. 1 may be applied, for example, to a binauralization (binauralization) process of a typical immersive 5.1 system.

As shown in fig. 5, the audio signal output by a typical immersive 5.1 system includes: a left channel (l) audio signal, a right channel (r) audio signal, a center channel (c) audio signal, a left surround channel (ls) audio signal, and a right surround channel (rs) audio signal. The azimuth angle of the left channel audio signal is-30 degrees, the azimuth angle of the right channel audio signal is 30 degrees, the azimuth angle of the center channel audio signal is 0 degree, the azimuth angle of the left surround channel audio signal is-110 degrees, and the azimuth angle of the right surround channel audio signal is 110 degrees.

Referring to fig. 1, a method 10 of downmixing an audio signal includes:

in step S102, a multi-channel audio signal is acquired.

A typical immersive 5.1 system is used as an example for explanation: the acquired multi-channel audio signal may be represented as: x is the number of_in(t)＝[x_{in_l}(t)，x_{in_r}(t)，x_{in_c}(t)，x_{in_ls}(t)，x_{in_rs}(t)]^T. Wherein x is_{in_l}(t)，x_{in_r}(t)，x_{in_c}(t)，x_{in_ls}(t)，x_{in_rs}(t) are left channel, right channel, center channel, left surround channel, and right surround channel audio signals, respectively.

In step S104, the multi-channel audio signal is correspondingly multiplied by a preset left channel conversion coefficient and a preset right channel conversion coefficient, respectively, to obtain a left channel audio signal and a right channel audio signal.

In some embodiments, the left channel conversion coefficient and the right channel conversion coefficient are filtering damping coefficients of a Head Related Transfer Function (HRTF) model corresponding to each channel.

The left channel conversion coefficient may be expressed as α corresponding to the above-mentioned 5 channels_l＝[α_{l_l}，α_{r_l}，α_{c_l}，α_{ls_l}，α_{rs_l}]^TThe right channel conversion coefficient can be expressed as α_r＝[α_{l_r}，α_{r_r}，α_{c_r}，α_{ls_r}，α_{rs_r}]^T。

In light of the above, the obtained left channel audio signal can be expressed as: x is the number of_{m_l}(t)＝[x_{in_l}(t)·α_{l_l}，x_{in_r}(t)·α_{r_l}，x_{in_c}(t)·α_{c_l}，x_{in_ls}(t)·α_{ls_l}，x_{in_rs}(t)·α_{rs_l}]^TThe obtained right channel audio signal may be represented as: x is the number of_{m_r}(t)＝[x_{in_l}(t)·α_{l_r}，x_{in_r}(t)·α_{r_r}，x_{in_c}(t)·α_{c_r}，x_{in_ls}(t)·α_{ls_r}，x_{in_rs}(t)·α_{rs_r}]^T。

In step S106, time-frequency domain conversion is performed on the multi-channel audio signal, the left channel audio signal, and the right channel audio signal, respectively, so as to generate a multi-channel frequency domain signal, a first left channel frequency domain signal, and a first right channel frequency domain signal.

For example, the multi-channel audio signal x in the time domain may be based on a Fast Fourier Transform (FFT) algorithm_in(t), left channel audio signal x_{m_l}(t) and a right channel audio signal x_{m_r}(t) separately converting to multi-channel frequency-domain signals x in the frequency domain_in(k, n), a first left channel frequency domain signal x_{m_l}(k, n) and a first right channel frequency domain signal x_{m_r}(k, n). Where k and n represent the frequency and time, respectively, of the discrete domain.

In step S108, the multi-channel frequency domain signal is processed based on the left channel frequency domain response submodel and the right channel frequency domain response submodel in the head-related transmission model, respectively, to obtain a second left channel frequency domain signal and a second right channel frequency domain signal.

The response function of the left channel frequency domain response submodel in the head-related transmission model may be represented as: h is_l(k,n)＝[h_{l_l}(k,n)，h_{r_l}(k,n)，h_{c_l}(k,n)，h_{ls_l}(k,n)，h_{rs_l}(k,n)]^TThe response function of the right channel frequency domain response submodel in the head-related transmission model may be expressed as: h is_r(k,n)＝[h_{l_r}(k,n)，h_{r_r}(k,n)，h_{c_r}(k,n)，h_{ls_r}(k,n)，h_{rs_r}(k,n)]^T. Thus, the processed second left channel frequency domain signal can be represented as: x is the number of_{h_l}(k,n)＝[x_{in_l}(k,n)·h_{l_l}(k,n)，x_{in_r}(k,n)·h_{r_l}(k,n)，x_{in_c}(k,n)·h_{c_l}(k,n)，x_{in_ls}(k,n)·h_{ls_l}(k,n)，x_{in_rs}(k,n)·h_{rs_l}(k,n)]^TThe second right channel frequency domain signal may be represented as: x is the number of_{h_r}(k,n)＝[x_{in_l}(k,n)·h_{l_r}(k,n)，x_{in_r}(k,n)·h_{r_r}(k,n)，x_{in_c}(k,n)·h_{c_r}(k,n)，x_{in_ls}(k,n)·h_{ls_r}(k,n)，x_{in_rs}(k,n)·h_{rs_r}(k,n)]^T。

In step S110, the first left channel frequency domain signal and the second left channel frequency domain signal are weighted according to the weight coefficient to generate a down-mixed left channel frequency domain signal, and the first right channel frequency domain signal and the second right channel frequency domain signal are weighted to generate a down-mixed right channel frequency domain signal.

In light of the above, according to the weight coefficients ω (k, n) and 1- ω (k, n), the downmix left channel frequency domain signal can be represented as: y is_l(k,n)＝ω(k,n)·x_{h_l}(k,n)+(1-ω(k,n))·x_{m_l}(k, n), the downmix right channel frequency domain signal may be represented as: y is_r(k,n)＝ω(k,n)·x_{h_r}(k,n)+(1-ω(k,n))·x_{m_r}(k,n)。

It should be noted that, since different frequency bands have different effects on sound quality, and the method for generating the first binaural frequency domain signal and the method for generating the second binaural frequency domain signal have different effects on different frequencies, the weight coefficient ω may be related to the frequency k.

First binaural frequency domain signal (x)_{m_l}(k,n)，x_{m_r}(k, n)) effectively preserves the sound quality, in particular, guarantees the sound quality of high-frequency signals; second binaural frequency domain signal (x) obtained based on head-related transmission model processing_{h_l}(k,n)，x_{h_r}(k, n)) having precise azimuthal continuitySex and sense of space. Thus, the resulting two-channel frequency domain signal (y) is downmixed_l(k,n)，y_r(k, n)) can achieve both good sound quality and a spatial rendering effect.

In step S112, frequency-to-time domain conversion is performed on the down-mixed left channel frequency domain signal and the down-mixed right channel frequency domain signal, respectively, so as to generate a down-mixed left channel audio signal and a down-mixed right channel audio signal correspondingly.

That is, corresponding to step S106, the downmix left channel frequency domain signal y in the frequency domain may be based on Inverse Fast Fourier Transform (IFFT)_l(k, n) and the downmix right channel frequency domain signal y_r(k, n) separately converting to a downmix left channel audio signal y in the time domain_l(t) and a downmix right channel audio signal y_r(t) to support output of the headphone mode or the dual speaker mode.

According to the audio signal down-mixing method provided by the embodiment of the invention, the two-channel audio signal with good tone quality and good space rendering effect can be obtained.

It should be clearly understood that the present disclosure describes how to make and use particular examples, but the principles of the present disclosure are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.

Regarding the determination scheme of the weight coefficient ω (k, n) used in step S110, the present invention provides the following three embodiments for illustration. In the method of the present invention, the weighting coefficients are not limited to the following three embodiments.

[ one ] the weight coefficient omega (k, n) is based on the multi-channel audio signal x_inThe moving speed of the sound source of (t) is predetermined.

In particular, the second binaural frequency domain signal (x)_{h_l}(k,n)，x_{h_r}(k, n)) is positively correlated with the moving speed of the sound source, and the first binaural frequency domain signal (x) is obtained_{m_l}(k,n)，x_{m_r}(k, n)) is inversely related to the moving speed of the sound source. In particular, when the sound source is a still sound sourceThen, ω (k, n) may be determined to be 0.

This scheme is generally applicable to scenarios where it is known whether the sound source is moving and its speed of movement. The faster the source moves, the closer ω (k, n) is set to 1 to enhance the downmix of the second binaural frequency domain signal (x)_{h_l}(k,n)，x_{h_r}(k, n)), and further, the azimuth continuity of the output signal is secured.

Second, the weight coefficient ω (k, n) is determined by pre-training a multi-channel frequency domain signal corresponding to the multi-channel audio sample signal based on a Convolutional Neural Network (CNN) model.

The convolutional neural network has the characteristic learning capability and can extract high-order features from a multi-channel frequency domain signal: the convolutional layer and the pooling layer in the convolutional neural network can respond to the translation invariance of the input features, namely, similar features located at different positions in space can be identified. In training samples, a multi-channel frequency domain signal (e.g., in the form of a multi-channel spectrogram) obtained by converting a multi-channel audio sample signal is input to a convolutional neural network model, and the convolutional neural network model outputs trained weight coefficients ω (k, n) for the weighting process in step S110.

The scheme is generally suitable for scenes in which whether the sound source moves or the moving speed of the sound source cannot be predicted.

[ III ] the weight coefficient omega (k, n) is obtained by judging the multi-channel audio signal x_in(t) whether the "primary channel" energy ratio is prominent.

In view of the above, fig. 2 is a flowchart illustrating another audio signal downmixing method according to an exemplary embodiment. The difference from the method 10 of fig. 1 is that the method of fig. 2 further provides a specific method of determining the weighting factors, i.e. further provides an embodiment of the method 10. Likewise, the method of downmixing an audio signal as shown in fig. 2 may also be applied to the binaural processing procedure of a typical immersive 5.1 system, for example.

Referring to fig. 2, the method 10 further includes:

in step S202, a weight coefficient is determined according to a ratio of a maximum eigenvalue of a covariance matrix corresponding to the multi-channel audio signal to a sum of all eigenvalues.

Specifically, in step S2022, when the ratio is greater than the preset threshold, it is determined that the weight coefficient corresponding to the first left channel frequency domain signal and the first right channel frequency domain signal is smaller than the weight coefficient corresponding to the second left channel frequency domain signal and the second right channel frequency domain signal.

On the contrary, in step S2024, when the ratio is smaller than the preset threshold, it is determined that the weight coefficient corresponding to the first left channel frequency domain signal and the first right channel frequency domain signal is greater than the weight coefficient corresponding to the second left channel frequency domain signal and the second right channel frequency domain signal.

For theIn other words, the covariance matrix corresponding to the multi-channel audio signal is a 5-dimensional square matrix, and each row represents 1 channel feature. The covariance matrix has 5 eigenvalues, wherein the channel corresponding to the row with the largest eigenvalue is the "main channel".

If the ratio of the maximum characteristic value to the sum of all the characteristic values is closer to 1, the more prominent the energy ratio of the main channel is; conversely, if the ratio of the maximum eigenvalue to the sum of all eigenvalues is closer to 0, it indicates that the more balanced the energy of each channel, the less prominent the dominant channel energy ratio.

In the present invention, when it is determined that the dominant channel energy ratio is prominent (e.g., the ratio of the maximum eigenvalue to the sum of all eigenvalues is greater than 0.5), the enhanced downmix second binaural frequency domain signal (x) may be selected_{h_l}(k,n)，x_{h_r}(k, n)), namely setting omega (k, n) to be between 0.5 and 1, and setting omega (k, n) to be closer to 1 the closer the ratio is to 1; accordingly, downmixing the first binaural frequency domain signal (x)_{m_l}(k,n)，x_{m_r}(k, n)), 1- ω (k, n) is between 0 and 0.5.

Conversely, when it is determined that the dominant channel energy ratio is not prominent (e.g., the ratio of the maximum eigenvalue to the sum of all eigenvalues is less than 0.5), the downmixed second binaural frequency domain signal (x) may be optionally attenuated_{j_l}(k,n)，x_{h_r}(k, n)), i.e.Setting omega (k, n) to be between 0 and 0.5, wherein the closer the ratio is to 0, the closer omega (k, n) is to 0; accordingly, the downmix first binaural frequency domain signal (x) is enhanced_{m_l}(k,n)，x_{m_r}(k, n)), 1- ω (k, n) is between 0.5 and 1.

It should be noted that the present invention is not limited to the size of the preset threshold, and the preset threshold may be set in advance according to the number of channels of the input audio signal and the specific design requirement in practical applications.

Those skilled in the art will appreciate that all or part of the steps implementing the above embodiments are implemented as computer programs executed by a CPU. The computer program, when executed by the CPU, performs the functions defined by the method provided by the present invention. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic or optical disk, or the like.

Furthermore, it should be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the method according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the embodiments of the method of the present invention.

Referring to fig. 3, the apparatus 30 for downmixing audio signals includes: a signal acquisition module 302, a first processing module 304, a first conversion module 306, a second processing module 308, a third processing module 310, and a second conversion module 312.

The signal obtaining module 302 is configured to obtain a multi-channel audio signal.

The first processing module 304 is configured to multiply the multi-channel audio signal by a preset left channel conversion coefficient and a preset right channel conversion coefficient, respectively, to obtain a left channel audio signal and a right channel audio signal.

The first conversion module 306 is configured to perform time-frequency domain conversion on the multi-channel audio signal, the left channel audio signal, and the right channel audio signal, and generate a multi-channel frequency domain signal, a first left channel frequency domain signal, and a first right channel frequency domain signal.

The second processing module 308 is configured to process the multi-channel frequency domain signal based on the left channel frequency domain response submodel and the right channel frequency domain response submodel in the head-related transmission model, respectively, to obtain a second left channel frequency domain signal and a second right channel frequency domain signal.

The third processing module 310 is configured to perform weighting processing on the first left channel frequency domain signal and the second left channel frequency domain signal according to the weight coefficient to generate a down-mixed left channel frequency domain signal, and perform weighting processing on the first right channel frequency domain signal and the second right channel frequency domain signal to generate a down-mixed right channel frequency domain signal.

The second conversion module 312 is configured to perform frequency-time domain conversion on the down-mixed left channel frequency domain signal and the down-mixed right channel frequency domain signal, respectively, to generate a down-mixed left channel audio signal and a down-mixed right channel audio signal correspondingly.

According to the audio signal down-mixing device provided by the embodiment of the invention, the two-channel audio signal with good tone quality and good space rendering effect can be obtained.

It is noted that the block diagrams shown in the above figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

FIG. 4 is a schematic diagram illustrating a configuration of a computer device, according to an example embodiment. It should be noted that the computer device shown in fig. 4 is only an example, and should not bring any limitation to the function and the scope of the application of the embodiment of the present invention.

As shown in fig. 4, the computer apparatus 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the apparatus 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program performs the above-described functions defined in the apparatus of the present invention when executed by the Central Processing Unit (CPU) 801.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present invention may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a transmitting unit, an obtaining unit, a determining unit, and a first processing unit. The names of these units do not in some cases constitute a limitation to the unit itself, and for example, the sending unit may also be described as a "unit sending a picture acquisition request to a connected server".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise:

acquiring a multi-channel audio signal; respectively multiplying the multi-channel audio signal by a preset left channel conversion coefficient and a preset right channel conversion coefficient correspondingly to obtain a left channel audio signal and a right channel audio signal; respectively carrying out time domain-frequency domain conversion on the multi-channel audio signal, the left channel audio signal and the right channel audio signal to correspondingly generate a multi-channel frequency domain signal, a first left channel frequency domain signal and a first right channel frequency domain signal; processing the multi-channel frequency domain signal based on a left channel frequency domain response submodel and a right channel frequency domain response submodel in the head-related transmission model respectively to obtain a second left channel frequency domain signal and a second right channel frequency domain signal; according to the weight coefficient, carrying out weighting processing on the first left channel frequency domain signal and the second left channel frequency domain signal to generate a down-mixed left channel frequency domain signal, and carrying out weighting processing on the first right channel frequency domain signal and the second right channel frequency domain signal to generate a down-mixed right channel frequency domain signal; and respectively carrying out frequency domain-time domain conversion on the down-mixed left channel frequency domain signal and the down-mixed right channel frequency domain signal to correspondingly generate a down-mixed left channel audio signal and a down-mixed right channel audio signal.

Exemplary embodiments of the present invention are specifically illustrated and described above. It is to be understood that the invention is not limited to the precise construction, arrangements, or instrumentalities described herein; on the contrary, the invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method of downmixing an audio signal, comprising:

acquiring a multi-channel audio signal;

correspondingly multiplying the multi-channel audio signal by a preset left channel conversion coefficient and a preset right channel conversion coefficient respectively to obtain a left channel audio signal and a right channel audio signal;

respectively performing time domain-frequency domain conversion on the multi-channel audio signal, the left channel audio signal and the right channel audio signal to correspondingly generate a multi-channel frequency domain signal, a first left channel frequency domain signal and a first right channel frequency domain signal;

processing the multi-channel frequency domain signal respectively based on a left channel frequency domain response submodel and a right channel frequency domain response submodel in the head-related transmission model to obtain a second left channel frequency domain signal and a second right channel frequency domain signal;

according to the weight coefficient, performing weighting processing on the first left channel frequency domain signal and the second left channel frequency domain signal to generate a down-mixed left channel frequency domain signal, and performing weighting processing on the first right channel frequency domain signal and the second right channel frequency domain signal to generate a down-mixed right channel frequency domain signal; and

and respectively carrying out frequency domain-time domain conversion on the down-mixed left channel frequency domain signal and the down-mixed right channel frequency domain signal to correspondingly generate a down-mixed left channel audio signal and a down-mixed right channel audio signal.

2. The method of claim 1, wherein the left channel conversion coefficient and the right channel conversion coefficient are filtering damping coefficients of each channel corresponding to the head-related transmission model.

3. The method according to claim 1 or 2, wherein the weight coefficients are predetermined according to a moving speed of a sound source of the multi-channel audio signal.

4. The method of claim 3, wherein the weighting factor corresponding to the second left channel frequency domain signal and the second right channel frequency domain signal is 0 when the audio source is a stationary audio source.

5. The method according to claim 1 or 2, wherein the weight coefficients are determined by pre-training a multi-channel frequency domain signal corresponding to the multi-channel audio sample signal based on a convolutional neural network model.

6. The method of claim 1 or 2, further comprising: and determining the weight coefficient according to the ratio of the maximum eigenvalue of the covariance matrix corresponding to the multi-channel audio signal to the sum of all eigenvalues.

7. The method of claim 6, wherein determining the weight coefficient according to a ratio of a maximum eigenvalue of a covariance matrix corresponding to the multi-channel audio signal to a sum of all eigenvalues comprises:

when the ratio is greater than a preset threshold value, determining that the weight coefficient corresponding to the first left channel frequency domain signal and the first right channel frequency domain signal is smaller than the weight coefficient corresponding to the second left channel frequency domain signal and the second right channel frequency domain signal;

when the ratio is smaller than the preset threshold, determining that the weight coefficient corresponding to the first left channel frequency domain signal and the first right channel frequency domain signal is larger than the weight coefficient corresponding to the second left channel frequency domain signal and the second right channel frequency domain signal.

8. An apparatus for downmixing an audio signal, comprising:

the signal acquisition module is used for acquiring a multi-channel audio signal;

the first processing module is used for correspondingly multiplying the multi-channel audio signal by a preset left channel conversion coefficient and a preset right channel conversion coefficient respectively to obtain a left channel audio signal and a right channel audio signal;

the first conversion module is used for respectively carrying out time domain-frequency domain conversion on the multi-channel audio signal, the left channel audio signal and the right channel audio signal to correspondingly generate a multi-channel frequency domain signal, a first left channel frequency domain signal and a first right channel frequency domain signal;

the second processing module is used for processing the multi-channel frequency domain signals respectively based on a left channel frequency domain response submodel and a right channel frequency domain response submodel in the head-related transmission model to obtain a second left channel frequency domain signal and a second right channel frequency domain signal;

a third processing module, configured to perform weighting processing on the first left channel frequency domain signal and the second left channel frequency domain signal according to a weighting coefficient to generate a down-mixed left channel frequency domain signal, and perform weighting processing on the first right channel frequency domain signal and the second right channel frequency domain signal to generate a down-mixed right channel frequency domain signal; and

and the second conversion module is used for respectively carrying out frequency domain-time domain conversion on the down-mixed left channel frequency domain signal and the down-mixed right channel frequency domain signal to correspondingly generate a down-mixed left channel audio signal and a down-mixed right channel audio signal.

9. A computer device, comprising: memory, processor and executable instructions stored in the memory and executable in the processor, characterized in that the processor implements the method according to any of claims 1-7 when executing the executable instructions.

10. A computer-readable storage medium having stored thereon computer-executable instructions, which when executed by a processor, implement the method of any one of claims 1-7.