CN110853658A - Method and apparatus for downmixing audio signal, computer device, and readable storage medium - Google Patents

Method and apparatus for downmixing audio signal, computer device, and readable storage medium Download PDF

Info

Publication number
CN110853658A
CN110853658A CN201911173782.8A CN201911173782A CN110853658A CN 110853658 A CN110853658 A CN 110853658A CN 201911173782 A CN201911173782 A CN 201911173782A CN 110853658 A CN110853658 A CN 110853658A
Authority
CN
China
Prior art keywords
frequency domain
channel frequency
domain signal
signal
audio signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911173782.8A
Other languages
Chinese (zh)
Other versions
CN110853658B (en
Inventor
王薇娜
高五峰
董强国
孙学京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Film Science and Technology Research Institute (Film Technology Quality Inspection Institute of the Central Propaganda Department)
Original Assignee
CHINA FILM SCIENCE AND TECHNOLOGY INST
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHINA FILM SCIENCE AND TECHNOLOGY INST filed Critical CHINA FILM SCIENCE AND TECHNOLOGY INST
Priority to CN201911173782.8A priority Critical patent/CN110853658B/en
Publication of CN110853658A publication Critical patent/CN110853658A/en
Application granted granted Critical
Publication of CN110853658B publication Critical patent/CN110853658B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S1/00Two-channel systems
    • H04S1/007Two-channel systems in which the audio signals are in digital form
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/305Electronic adaptation of stereophonic audio signals to reverberation of the listening space
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/01Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Stereophonic System (AREA)

Abstract

The invention discloses a method and a device for down-mixing an audio signal, computer equipment and a readable storage medium. The method comprises the following steps: multiplying the multi-channel audio signal by a two-channel conversion coefficient to obtain a left-channel audio signal and a right-channel audio signal; respectively converting the multi-channel audio signal, the left channel audio signal and the right channel audio signal to generate a multi-channel frequency domain signal, a first left channel frequency domain signal and a first right channel frequency domain signal; processing the multi-channel frequency domain signal based on a head-related transmission model to obtain a second left channel frequency domain signal and a second right channel frequency domain signal; weighting the first left channel frequency domain signal and the second left channel frequency domain signal to generate a down-mixed left channel frequency domain signal, and weighting the first right channel frequency domain signal and the second right channel frequency domain signal to generate a down-mixed right channel frequency domain signal; and converting the down-mixed left channel frequency domain signal and the down-mixed right channel frequency domain signal to generate a down-mixed left channel audio signal and a down-mixed right channel audio signal.

Description

Method and apparatus for downmixing audio signal, computer device, and readable storage medium
Technical Field
The present invention relates to the field of audio processing, and in particular, to a method and an apparatus for downmixing an audio signal, a computer device, and a computer-readable storage medium.
Background
In recent years, with the upgrading of high definition video technology from 2K to 4K, even 8K, and the development of VR (Virtual Reality) and AR (Augmented Reality), the hearing requirement of people for audio is gradually increased. Systems with immersive audio 5.1, 7.1, or even more channels begin to emerge in large numbers.
With the rapid development of mobile internet, more and more users choose to experience audio contents through earphones. Therefore, there is a need to convert multi-channel audio content into a two-channel or stereo format (i.e., Downmix processing) to accommodate scenes played by headphones or dual speakers. However, the current down-mixing technology is not mature, and the obtained two-channel audio has difficulty in having both sound quality and spatial rendering effect.
It is to be noted that the above information disclosed in the background section is only for enhancement of understanding of the background of the invention, and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
In view of the foregoing, the present invention provides a method, an apparatus, a computer device and a computer readable storage medium for audio signal downmixing.
Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.
According to an aspect of the present invention, there is provided a downmix method of an audio signal, including: acquiring a multi-channel audio signal; correspondingly multiplying the multi-channel audio signal by a preset left channel conversion coefficient and a preset right channel conversion coefficient respectively to obtain a left channel audio signal and a right channel audio signal; respectively performing time domain-frequency domain conversion on the multi-channel audio signal, the left channel audio signal and the right channel audio signal to correspondingly generate a multi-channel frequency domain signal, a first left channel frequency domain signal and a first right channel frequency domain signal; processing the multi-channel frequency domain signal respectively based on a left channel frequency domain response submodel and a right channel frequency domain response submodel in the head-related transmission model to obtain a second left channel frequency domain signal and a second right channel frequency domain signal; according to the weight coefficient, performing weighting processing on the first left channel frequency domain signal and the second left channel frequency domain signal to generate a down-mixed left channel frequency domain signal, and performing weighting processing on the first right channel frequency domain signal and the second right channel frequency domain signal to generate a down-mixed right channel frequency domain signal; and respectively performing frequency domain-time domain conversion on the down-mixed left channel frequency domain signal and the down-mixed right channel frequency domain signal to correspondingly generate a down-mixed left channel audio signal and a down-mixed right channel audio signal.
According to an embodiment of the present invention, the left channel conversion coefficient and the right channel conversion coefficient are filtering damping coefficients of each channel corresponding to the head-related transmission model.
According to an embodiment of the present invention, the weight coefficient is predetermined according to a moving speed of a sound source of the multi-channel audio signal.
According to an embodiment of the present invention, when the sound source is a stationary sound source, the weight coefficient corresponding to the second left channel frequency domain signal and the second right channel frequency domain signal is 0.
According to an embodiment of the present invention, the weight coefficients are determined by pre-training a multi-channel frequency domain signal corresponding to the multi-channel audio sample signal based on a convolutional neural network model.
According to an embodiment of the invention, the method further comprises: and determining the weight coefficient according to the ratio of the maximum eigenvalue of the covariance matrix corresponding to the multi-channel audio signal to the sum of all eigenvalues.
According to an embodiment of the present invention, determining the weight coefficient according to a ratio of a maximum eigenvalue of a covariance matrix corresponding to the multi-channel audio signal to a sum of all eigenvalues includes: when the ratio is greater than a preset threshold value, determining that the weight coefficient corresponding to the first left channel frequency domain signal and the first right channel frequency domain signal is smaller than the weight coefficient corresponding to the second left channel frequency domain signal and the second right channel frequency domain signal; when the ratio is smaller than the preset threshold, determining that the weight coefficient corresponding to the first left channel frequency domain signal and the first right channel frequency domain signal is larger than the weight coefficient corresponding to the second left channel frequency domain signal and the second right channel frequency domain signal.
According to another aspect of the present invention, there is provided a down-mixing apparatus of an audio signal, including: the signal acquisition module is used for acquiring a multi-channel audio signal; the first processing module is used for correspondingly multiplying the multi-channel audio signal by a preset left channel conversion coefficient and a preset right channel conversion coefficient respectively to obtain a left channel audio signal and a right channel audio signal; the first conversion module is used for respectively carrying out time domain-frequency domain conversion on the multi-channel audio signal, the left channel audio signal and the right channel audio signal to correspondingly generate a multi-channel frequency domain signal, a first left channel frequency domain signal and a first right channel frequency domain signal; the second processing module is used for processing the multi-channel frequency domain signals respectively based on a left channel frequency domain response submodel and a right channel frequency domain response submodel in the head-related transmission model to obtain a second left channel frequency domain signal and a second right channel frequency domain signal; a third processing module, configured to perform weighting processing on the first left channel frequency domain signal and the second left channel frequency domain signal according to a weighting coefficient to generate a down-mixed left channel frequency domain signal, and perform weighting processing on the first right channel frequency domain signal and the second right channel frequency domain signal to generate a down-mixed right channel frequency domain signal; and the second conversion module is used for respectively carrying out frequency domain-time domain conversion on the down-mixed left channel frequency domain signal and the down-mixed right channel frequency domain signal to correspondingly generate a down-mixed left channel audio signal and a down-mixed right channel audio signal.
According to still another aspect of the present invention, there is provided a computer apparatus comprising: a memory, a processor and executable instructions stored in the memory and executable in the processor, the processor implementing any of the above-mentioned audio signal downmixing methods when executing the executable instructions.
According to yet another aspect of the present invention, there is provided a computer-readable storage medium having stored thereon computer-executable instructions that, when executed by a processor, implement any one of the above-described methods of downmixing an audio signal.
According to the audio signal down-mixing method provided by the invention, the two-channel audio signal with good tone quality and good space rendering effect can be obtained.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
Fig. 1 is a flowchart illustrating a method of downmixing an audio signal according to an exemplary embodiment.
Fig. 2 is a flowchart illustrating another audio signal downmixing method according to an exemplary embodiment.
Fig. 3 is a block diagram illustrating a downmixing apparatus of an audio signal according to an exemplary embodiment.
FIG. 4 is a schematic diagram illustrating a configuration of a computer device, according to an example embodiment.
Fig. 5 is a schematic diagram of a multi-channel audio system shown in accordance with an example embodiment.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The drawings are merely schematic illustrations of the invention and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, apparatus, steps, and so forth. In other instances, well-known structures, methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
As described above, in order to solve the problem of poor sound quality or spatial misalignment of the downmixed binaural audio, the present invention provides a new audio signal downmixing method. The following specifically describes embodiments of the present invention.
Fig. 1 is a flowchart illustrating a method of downmixing an audio signal according to an exemplary embodiment. The method of downmixing an audio signal as shown in fig. 1 may be applied, for example, to a binauralization (binauralization) process of a typical immersive 5.1 system.
As shown in fig. 5, the audio signal output by a typical immersive 5.1 system includes: a left channel (l) audio signal, a right channel (r) audio signal, a center channel (c) audio signal, a left surround channel (ls) audio signal, and a right surround channel (rs) audio signal. The azimuth angle of the left channel audio signal is-30 degrees, the azimuth angle of the right channel audio signal is 30 degrees, the azimuth angle of the center channel audio signal is 0 degree, the azimuth angle of the left surround channel audio signal is-110 degrees, and the azimuth angle of the right surround channel audio signal is 110 degrees.
Referring to fig. 1, a method 10 of downmixing an audio signal includes:
in step S102, a multi-channel audio signal is acquired.
A typical immersive 5.1 system is used as an example for explanation: the acquired multi-channel audio signal may be represented as: x is the number ofin(t)=[xin_l(t),xin_r(t),xin_c(t),xin_ls(t),xin_rs(t)]T. Wherein x isin_l(t),xin_r(t),xin_c(t),xin_ls(t),xin_rs(t) are left channel, right channel, center channel, left surround channel, and right surround channel audio signals, respectively.
In step S104, the multi-channel audio signal is correspondingly multiplied by a preset left channel conversion coefficient and a preset right channel conversion coefficient, respectively, to obtain a left channel audio signal and a right channel audio signal.
In some embodiments, the left channel conversion coefficient and the right channel conversion coefficient are filtering damping coefficients of a Head Related Transfer Function (HRTF) model corresponding to each channel.
The left channel conversion coefficient may be expressed as α corresponding to the above-mentioned 5 channelsl=[αl_l,αr_l,αc_l,αls_l,αrs_l]TThe right channel conversion coefficient can be expressed as αr=[αl_r,αr_r,αc_r,αls_r,αrs_r]T
In light of the above, the obtained left channel audio signal can be expressed as: x is the number ofm_l(t)=[xin_l(t)·αl_l,xin_r(t)·αr_l,xin_c(t)·αc_l,xin_ls(t)·αls_l,xin_rs(t)·αrs_l]TThe obtained right channel audio signal may be represented as: x is the number ofm_r(t)=[xin_l(t)·αl_r,xin_r(t)·αr_r,xin_c(t)·αc_r,xin_ls(t)·αls_r,xin_rs(t)·αrs_r]T
In step S106, time-frequency domain conversion is performed on the multi-channel audio signal, the left channel audio signal, and the right channel audio signal, respectively, so as to generate a multi-channel frequency domain signal, a first left channel frequency domain signal, and a first right channel frequency domain signal.
For example, the multi-channel audio signal x in the time domain may be based on a Fast Fourier Transform (FFT) algorithmin(t), left channel audio signal xm_l(t) and a right channel audio signal xm_r(t) separately converting to multi-channel frequency-domain signals x in the frequency domainin(k, n), a first left channel frequency domain signal xm_l(k, n) and a first right channel frequency domain signal xm_r(k, n). Where k and n represent the frequency and time, respectively, of the discrete domain.
In step S108, the multi-channel frequency domain signal is processed based on the left channel frequency domain response submodel and the right channel frequency domain response submodel in the head-related transmission model, respectively, to obtain a second left channel frequency domain signal and a second right channel frequency domain signal.
The response function of the left channel frequency domain response submodel in the head-related transmission model may be represented as: h isl(k,n)=[hl_l(k,n),hr_l(k,n),hc_l(k,n),hls_l(k,n),hrs_l(k,n)]TThe response function of the right channel frequency domain response submodel in the head-related transmission model may be expressed as: h isr(k,n)=[hl_r(k,n),hr_r(k,n),hc_r(k,n),hls_r(k,n),hrs_r(k,n)]T. Thus, the processed second left channel frequency domain signal can be represented as: x is the number ofh_l(k,n)=[xin_l(k,n)·hl_l(k,n),xin_r(k,n)·hr_l(k,n),xin_c(k,n)·hc_l(k,n),xin_ls(k,n)·hls_l(k,n),xin_rs(k,n)·hrs_l(k,n)]TThe second right channel frequency domain signal may be represented as: x is the number ofh_r(k,n)=[xin_l(k,n)·hl_r(k,n),xin_r(k,n)·hr_r(k,n),xin_c(k,n)·hc_r(k,n),xin_ls(k,n)·hls_r(k,n),xin_rs(k,n)·hrs_r(k,n)]T
In step S110, the first left channel frequency domain signal and the second left channel frequency domain signal are weighted according to the weight coefficient to generate a down-mixed left channel frequency domain signal, and the first right channel frequency domain signal and the second right channel frequency domain signal are weighted to generate a down-mixed right channel frequency domain signal.
In light of the above, according to the weight coefficients ω (k, n) and 1- ω (k, n), the downmix left channel frequency domain signal can be represented as: y isl(k,n)=ω(k,n)·xh_l(k,n)+(1-ω(k,n))·xm_l(k, n), the downmix right channel frequency domain signal may be represented as: y isr(k,n)=ω(k,n)·xh_r(k,n)+(1-ω(k,n))·xm_r(k,n)。
It should be noted that, since different frequency bands have different effects on sound quality, and the method for generating the first binaural frequency domain signal and the method for generating the second binaural frequency domain signal have different effects on different frequencies, the weight coefficient ω may be related to the frequency k.
First binaural frequency domain signal (x)m_l(k,n),xm_r(k, n)) effectively preserves the sound quality, in particular, guarantees the sound quality of high-frequency signals; second binaural frequency domain signal (x) obtained based on head-related transmission model processingh_l(k,n),xh_r(k, n)) having precise azimuthal continuitySex and sense of space. Thus, the resulting two-channel frequency domain signal (y) is downmixedl(k,n),yr(k, n)) can achieve both good sound quality and a spatial rendering effect.
In step S112, frequency-to-time domain conversion is performed on the down-mixed left channel frequency domain signal and the down-mixed right channel frequency domain signal, respectively, so as to generate a down-mixed left channel audio signal and a down-mixed right channel audio signal correspondingly.
That is, corresponding to step S106, the downmix left channel frequency domain signal y in the frequency domain may be based on Inverse Fast Fourier Transform (IFFT)l(k, n) and the downmix right channel frequency domain signal yr(k, n) separately converting to a downmix left channel audio signal y in the time domainl(t) and a downmix right channel audio signal yr(t) to support output of the headphone mode or the dual speaker mode.
According to the audio signal down-mixing method provided by the embodiment of the invention, the two-channel audio signal with good tone quality and good space rendering effect can be obtained.
It should be clearly understood that the present disclosure describes how to make and use particular examples, but the principles of the present disclosure are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.
Regarding the determination scheme of the weight coefficient ω (k, n) used in step S110, the present invention provides the following three embodiments for illustration. In the method of the present invention, the weighting coefficients are not limited to the following three embodiments.
[ one ] the weight coefficient omega (k, n) is based on the multi-channel audio signal xinThe moving speed of the sound source of (t) is predetermined.
In particular, the second binaural frequency domain signal (x)h_l(k,n),xh_r(k, n)) is positively correlated with the moving speed of the sound source, and the first binaural frequency domain signal (x) is obtainedm_l(k,n),xm_r(k, n)) is inversely related to the moving speed of the sound source. In particular, when the sound source is a still sound sourceThen, ω (k, n) may be determined to be 0.
This scheme is generally applicable to scenarios where it is known whether the sound source is moving and its speed of movement. The faster the source moves, the closer ω (k, n) is set to 1 to enhance the downmix of the second binaural frequency domain signal (x)h_l(k,n),xh_r(k, n)), and further, the azimuth continuity of the output signal is secured.
Second, the weight coefficient ω (k, n) is determined by pre-training a multi-channel frequency domain signal corresponding to the multi-channel audio sample signal based on a Convolutional Neural Network (CNN) model.
The convolutional neural network has the characteristic learning capability and can extract high-order features from a multi-channel frequency domain signal: the convolutional layer and the pooling layer in the convolutional neural network can respond to the translation invariance of the input features, namely, similar features located at different positions in space can be identified. In training samples, a multi-channel frequency domain signal (e.g., in the form of a multi-channel spectrogram) obtained by converting a multi-channel audio sample signal is input to a convolutional neural network model, and the convolutional neural network model outputs trained weight coefficients ω (k, n) for the weighting process in step S110.
The scheme is generally suitable for scenes in which whether the sound source moves or the moving speed of the sound source cannot be predicted.
[ III ] the weight coefficient omega (k, n) is obtained by judging the multi-channel audio signal xin(t) whether the "primary channel" energy ratio is prominent.
In view of the above, fig. 2 is a flowchart illustrating another audio signal downmixing method according to an exemplary embodiment. The difference from the method 10 of fig. 1 is that the method of fig. 2 further provides a specific method of determining the weighting factors, i.e. further provides an embodiment of the method 10. Likewise, the method of downmixing an audio signal as shown in fig. 2 may also be applied to the binaural processing procedure of a typical immersive 5.1 system, for example.
Referring to fig. 2, the method 10 further includes:
in step S202, a weight coefficient is determined according to a ratio of a maximum eigenvalue of a covariance matrix corresponding to the multi-channel audio signal to a sum of all eigenvalues.
Specifically, in step S2022, when the ratio is greater than the preset threshold, it is determined that the weight coefficient corresponding to the first left channel frequency domain signal and the first right channel frequency domain signal is smaller than the weight coefficient corresponding to the second left channel frequency domain signal and the second right channel frequency domain signal.
On the contrary, in step S2024, when the ratio is smaller than the preset threshold, it is determined that the weight coefficient corresponding to the first left channel frequency domain signal and the first right channel frequency domain signal is greater than the weight coefficient corresponding to the second left channel frequency domain signal and the second right channel frequency domain signal.
For theIn other words, the covariance matrix corresponding to the multi-channel audio signal is a 5-dimensional square matrix, and each row represents 1 channel feature. The covariance matrix has 5 eigenvalues, wherein the channel corresponding to the row with the largest eigenvalue is the "main channel".
If the ratio of the maximum characteristic value to the sum of all the characteristic values is closer to 1, the more prominent the energy ratio of the main channel is; conversely, if the ratio of the maximum eigenvalue to the sum of all eigenvalues is closer to 0, it indicates that the more balanced the energy of each channel, the less prominent the dominant channel energy ratio.
In the present invention, when it is determined that the dominant channel energy ratio is prominent (e.g., the ratio of the maximum eigenvalue to the sum of all eigenvalues is greater than 0.5), the enhanced downmix second binaural frequency domain signal (x) may be selectedh_l(k,n),xh_r(k, n)), namely setting omega (k, n) to be between 0.5 and 1, and setting omega (k, n) to be closer to 1 the closer the ratio is to 1; accordingly, downmixing the first binaural frequency domain signal (x)m_l(k,n),xm_r(k, n)), 1- ω (k, n) is between 0 and 0.5.
Conversely, when it is determined that the dominant channel energy ratio is not prominent (e.g., the ratio of the maximum eigenvalue to the sum of all eigenvalues is less than 0.5), the downmixed second binaural frequency domain signal (x) may be optionally attenuatedj_l(k,n),xh_r(k, n)), i.e.Setting omega (k, n) to be between 0 and 0.5, wherein the closer the ratio is to 0, the closer omega (k, n) is to 0; accordingly, the downmix first binaural frequency domain signal (x) is enhancedm_l(k,n),xm_r(k, n)), 1- ω (k, n) is between 0.5 and 1.
It should be noted that the present invention is not limited to the size of the preset threshold, and the preset threshold may be set in advance according to the number of channels of the input audio signal and the specific design requirement in practical applications.
Those skilled in the art will appreciate that all or part of the steps implementing the above embodiments are implemented as computer programs executed by a CPU. The computer program, when executed by the CPU, performs the functions defined by the method provided by the present invention. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic or optical disk, or the like.
Furthermore, it should be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the method according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the embodiments of the method of the present invention.
Fig. 3 is a block diagram illustrating a downmixing apparatus of an audio signal according to an exemplary embodiment.
Referring to fig. 3, the apparatus 30 for downmixing audio signals includes: a signal acquisition module 302, a first processing module 304, a first conversion module 306, a second processing module 308, a third processing module 310, and a second conversion module 312.
The signal obtaining module 302 is configured to obtain a multi-channel audio signal.
The first processing module 304 is configured to multiply the multi-channel audio signal by a preset left channel conversion coefficient and a preset right channel conversion coefficient, respectively, to obtain a left channel audio signal and a right channel audio signal.
The first conversion module 306 is configured to perform time-frequency domain conversion on the multi-channel audio signal, the left channel audio signal, and the right channel audio signal, and generate a multi-channel frequency domain signal, a first left channel frequency domain signal, and a first right channel frequency domain signal.
The second processing module 308 is configured to process the multi-channel frequency domain signal based on the left channel frequency domain response submodel and the right channel frequency domain response submodel in the head-related transmission model, respectively, to obtain a second left channel frequency domain signal and a second right channel frequency domain signal.
The third processing module 310 is configured to perform weighting processing on the first left channel frequency domain signal and the second left channel frequency domain signal according to the weight coefficient to generate a down-mixed left channel frequency domain signal, and perform weighting processing on the first right channel frequency domain signal and the second right channel frequency domain signal to generate a down-mixed right channel frequency domain signal.
The second conversion module 312 is configured to perform frequency-time domain conversion on the down-mixed left channel frequency domain signal and the down-mixed right channel frequency domain signal, respectively, to generate a down-mixed left channel audio signal and a down-mixed right channel audio signal correspondingly.
According to the audio signal down-mixing device provided by the embodiment of the invention, the two-channel audio signal with good tone quality and good space rendering effect can be obtained.
It is noted that the block diagrams shown in the above figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
FIG. 4 is a schematic diagram illustrating a configuration of a computer device, according to an example embodiment. It should be noted that the computer device shown in fig. 4 is only an example, and should not bring any limitation to the function and the scope of the application of the embodiment of the present invention.
As shown in fig. 4, the computer apparatus 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the apparatus 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.
In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program performs the above-described functions defined in the apparatus of the present invention when executed by the Central Processing Unit (CPU) 801.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present invention may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a transmitting unit, an obtaining unit, a determining unit, and a first processing unit. The names of these units do not in some cases constitute a limitation to the unit itself, and for example, the sending unit may also be described as a "unit sending a picture acquisition request to a connected server".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise:
acquiring a multi-channel audio signal; respectively multiplying the multi-channel audio signal by a preset left channel conversion coefficient and a preset right channel conversion coefficient correspondingly to obtain a left channel audio signal and a right channel audio signal; respectively carrying out time domain-frequency domain conversion on the multi-channel audio signal, the left channel audio signal and the right channel audio signal to correspondingly generate a multi-channel frequency domain signal, a first left channel frequency domain signal and a first right channel frequency domain signal; processing the multi-channel frequency domain signal based on a left channel frequency domain response submodel and a right channel frequency domain response submodel in the head-related transmission model respectively to obtain a second left channel frequency domain signal and a second right channel frequency domain signal; according to the weight coefficient, carrying out weighting processing on the first left channel frequency domain signal and the second left channel frequency domain signal to generate a down-mixed left channel frequency domain signal, and carrying out weighting processing on the first right channel frequency domain signal and the second right channel frequency domain signal to generate a down-mixed right channel frequency domain signal; and respectively carrying out frequency domain-time domain conversion on the down-mixed left channel frequency domain signal and the down-mixed right channel frequency domain signal to correspondingly generate a down-mixed left channel audio signal and a down-mixed right channel audio signal.
Exemplary embodiments of the present invention are specifically illustrated and described above. It is to be understood that the invention is not limited to the precise construction, arrangements, or instrumentalities described herein; on the contrary, the invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (10)

1. A method of downmixing an audio signal, comprising:
acquiring a multi-channel audio signal;
correspondingly multiplying the multi-channel audio signal by a preset left channel conversion coefficient and a preset right channel conversion coefficient respectively to obtain a left channel audio signal and a right channel audio signal;
respectively performing time domain-frequency domain conversion on the multi-channel audio signal, the left channel audio signal and the right channel audio signal to correspondingly generate a multi-channel frequency domain signal, a first left channel frequency domain signal and a first right channel frequency domain signal;
processing the multi-channel frequency domain signal respectively based on a left channel frequency domain response submodel and a right channel frequency domain response submodel in the head-related transmission model to obtain a second left channel frequency domain signal and a second right channel frequency domain signal;
according to the weight coefficient, performing weighting processing on the first left channel frequency domain signal and the second left channel frequency domain signal to generate a down-mixed left channel frequency domain signal, and performing weighting processing on the first right channel frequency domain signal and the second right channel frequency domain signal to generate a down-mixed right channel frequency domain signal; and
and respectively carrying out frequency domain-time domain conversion on the down-mixed left channel frequency domain signal and the down-mixed right channel frequency domain signal to correspondingly generate a down-mixed left channel audio signal and a down-mixed right channel audio signal.
2. The method of claim 1, wherein the left channel conversion coefficient and the right channel conversion coefficient are filtering damping coefficients of each channel corresponding to the head-related transmission model.
3. The method according to claim 1 or 2, wherein the weight coefficients are predetermined according to a moving speed of a sound source of the multi-channel audio signal.
4. The method of claim 3, wherein the weighting factor corresponding to the second left channel frequency domain signal and the second right channel frequency domain signal is 0 when the audio source is a stationary audio source.
5. The method according to claim 1 or 2, wherein the weight coefficients are determined by pre-training a multi-channel frequency domain signal corresponding to the multi-channel audio sample signal based on a convolutional neural network model.
6. The method of claim 1 or 2, further comprising: and determining the weight coefficient according to the ratio of the maximum eigenvalue of the covariance matrix corresponding to the multi-channel audio signal to the sum of all eigenvalues.
7. The method of claim 6, wherein determining the weight coefficient according to a ratio of a maximum eigenvalue of a covariance matrix corresponding to the multi-channel audio signal to a sum of all eigenvalues comprises:
when the ratio is greater than a preset threshold value, determining that the weight coefficient corresponding to the first left channel frequency domain signal and the first right channel frequency domain signal is smaller than the weight coefficient corresponding to the second left channel frequency domain signal and the second right channel frequency domain signal;
when the ratio is smaller than the preset threshold, determining that the weight coefficient corresponding to the first left channel frequency domain signal and the first right channel frequency domain signal is larger than the weight coefficient corresponding to the second left channel frequency domain signal and the second right channel frequency domain signal.
8. An apparatus for downmixing an audio signal, comprising:
the signal acquisition module is used for acquiring a multi-channel audio signal;
the first processing module is used for correspondingly multiplying the multi-channel audio signal by a preset left channel conversion coefficient and a preset right channel conversion coefficient respectively to obtain a left channel audio signal and a right channel audio signal;
the first conversion module is used for respectively carrying out time domain-frequency domain conversion on the multi-channel audio signal, the left channel audio signal and the right channel audio signal to correspondingly generate a multi-channel frequency domain signal, a first left channel frequency domain signal and a first right channel frequency domain signal;
the second processing module is used for processing the multi-channel frequency domain signals respectively based on a left channel frequency domain response submodel and a right channel frequency domain response submodel in the head-related transmission model to obtain a second left channel frequency domain signal and a second right channel frequency domain signal;
a third processing module, configured to perform weighting processing on the first left channel frequency domain signal and the second left channel frequency domain signal according to a weighting coefficient to generate a down-mixed left channel frequency domain signal, and perform weighting processing on the first right channel frequency domain signal and the second right channel frequency domain signal to generate a down-mixed right channel frequency domain signal; and
and the second conversion module is used for respectively carrying out frequency domain-time domain conversion on the down-mixed left channel frequency domain signal and the down-mixed right channel frequency domain signal to correspondingly generate a down-mixed left channel audio signal and a down-mixed right channel audio signal.
9. A computer device, comprising: memory, processor and executable instructions stored in the memory and executable in the processor, characterized in that the processor implements the method according to any of claims 1-7 when executing the executable instructions.
10. A computer-readable storage medium having stored thereon computer-executable instructions, which when executed by a processor, implement the method of any one of claims 1-7.
CN201911173782.8A 2019-11-26 2019-11-26 Method and apparatus for downmixing audio signal, computer device, and readable storage medium Active CN110853658B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911173782.8A CN110853658B (en) 2019-11-26 2019-11-26 Method and apparatus for downmixing audio signal, computer device, and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911173782.8A CN110853658B (en) 2019-11-26 2019-11-26 Method and apparatus for downmixing audio signal, computer device, and readable storage medium

Publications (2)

Publication Number Publication Date
CN110853658A true CN110853658A (en) 2020-02-28
CN110853658B CN110853658B (en) 2021-12-07

Family

ID=69604505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911173782.8A Active CN110853658B (en) 2019-11-26 2019-11-26 Method and apparatus for downmixing audio signal, computer device, and readable storage medium

Country Status (1)

Country Link
CN (1) CN110853658B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111654745A (en) * 2020-06-08 2020-09-11 海信视像科技股份有限公司 Multi-channel signal processing method and display device
CN112927701A (en) * 2021-02-05 2021-06-08 商汤集团有限公司 Sample generation method, neural network generation method, audio signal generation method and device

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100763920B1 (en) * 2006-08-09 2007-10-05 삼성전자주식회사 Method and apparatus for decoding input signal which encoding multi-channel to mono or stereo signal to 2 channel binaural signal
US20070280485A1 (en) * 2006-06-02 2007-12-06 Lars Villemoes Binaural multi-channel decoder in the context of non-energy conserving upmix rules
US20080052089A1 (en) * 2004-06-14 2008-02-28 Matsushita Electric Industrial Co., Ltd. Acoustic Signal Encoding Device and Acoustic Signal Decoding Device
US20090046864A1 (en) * 2007-03-01 2009-02-19 Genaudio, Inc. Audio spatialization and environment simulation
CN101695151A (en) * 2009-10-12 2010-04-14 清华大学 Method and equipment for converting multi-channel audio signals into dual-channel audio signals
CN102172047A (en) * 2008-07-31 2011-08-31 弗劳恩霍夫应用研究促进协会 Signal generation for binaural signals
CN103026406A (en) * 2010-09-28 2013-04-03 华为技术有限公司 Device and method for postprocessing decoded multi-channel audio signal or decoded stereo signal
US20160198281A1 (en) * 2013-09-17 2016-07-07 Wilus Institute Of Standards And Technology Inc. Method and apparatus for processing audio signals
CN107040862A (en) * 2016-02-03 2017-08-11 腾讯科技(深圳)有限公司 Audio-frequency processing method and processing system
CN109644315A (en) * 2017-02-17 2019-04-16 无比的优声音科技公司 Device and method for the mixed multi-channel audio signal that contracts

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080052089A1 (en) * 2004-06-14 2008-02-28 Matsushita Electric Industrial Co., Ltd. Acoustic Signal Encoding Device and Acoustic Signal Decoding Device
US20070280485A1 (en) * 2006-06-02 2007-12-06 Lars Villemoes Binaural multi-channel decoder in the context of non-energy conserving upmix rules
KR100763920B1 (en) * 2006-08-09 2007-10-05 삼성전자주식회사 Method and apparatus for decoding input signal which encoding multi-channel to mono or stereo signal to 2 channel binaural signal
US20090046864A1 (en) * 2007-03-01 2009-02-19 Genaudio, Inc. Audio spatialization and environment simulation
CN102172047A (en) * 2008-07-31 2011-08-31 弗劳恩霍夫应用研究促进协会 Signal generation for binaural signals
CN101695151A (en) * 2009-10-12 2010-04-14 清华大学 Method and equipment for converting multi-channel audio signals into dual-channel audio signals
CN103026406A (en) * 2010-09-28 2013-04-03 华为技术有限公司 Device and method for postprocessing decoded multi-channel audio signal or decoded stereo signal
US20160198281A1 (en) * 2013-09-17 2016-07-07 Wilus Institute Of Standards And Technology Inc. Method and apparatus for processing audio signals
CN107040862A (en) * 2016-02-03 2017-08-11 腾讯科技(深圳)有限公司 Audio-frequency processing method and processing system
CN109644315A (en) * 2017-02-17 2019-04-16 无比的优声音科技公司 Device and method for the mixed multi-channel audio signal that contracts

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
MINGSIAN R. BAI ET AL.: "Upmixing and Downmixing Two-Channel Stereo Audio for Consumer Electronics", 《IEEE TRANSACTIONS ON CONSUMER ELECTRONICS》 *
XINGWEISUN, ET AL.: "An improved 5-2 channel downmix algorithm for 3D audio reproduction", 《ADVANCES IN INTELLIGENT INFORMATION HIDING AND MULTIMEDIA SIGNAL PROCESSING, SMART INNOVATION, SYSTEMS AND TECHNOLOGIES》 *
YONG-HYUN, ET AL.: "Efficient Primary-Ambient Decomposition Algorithm for Audio Upmix", 《JOURNAL OF BROADCAST ENGINEERING》 *
张建东 等: "三维声双耳渲染及其评价", 《广播与电视技术》 *
王丰 等: "数字电影领域新技术应用与思考", 《现代电影技术》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111654745A (en) * 2020-06-08 2020-09-11 海信视像科技股份有限公司 Multi-channel signal processing method and display device
CN112927701A (en) * 2021-02-05 2021-06-08 商汤集团有限公司 Sample generation method, neural network generation method, audio signal generation method and device

Also Published As

Publication number Publication date
CN110853658B (en) 2021-12-07

Similar Documents

Publication Publication Date Title
US10469978B2 (en) Audio signal processing method and device
US20180359587A1 (en) Audio signal processing method and apparatus
US20180213309A1 (en) Spatial Audio Processing Apparatus
CN110035376A (en) Come the acoustic signal processing method and device of ears rendering using phase response feature
US20220225051A1 (en) Signal processing device and method, and program
US11950063B2 (en) Apparatus, method and computer program for audio signal processing
WO2020034779A1 (en) Audio processing method, storage medium and electronic device
US9264838B2 (en) System and method for variable decorrelation of audio signals
CN110853658B (en) Method and apparatus for downmixing audio signal, computer device, and readable storage medium
CN114203163A (en) Audio signal processing method and device
CN114503606A (en) Audio processing
US10057702B2 (en) Audio signal processing apparatus and method for modifying a stereo image of a stereo signal
JP6486351B2 (en) Acoustic spatialization using spatial effects
KR20210071972A (en) Signal processing apparatus and method, and program
US11445324B2 (en) Audio rendering method and apparatus
US11012802B2 (en) Computing system for binaural ambisonics decoding
CN117896666A (en) Method for playback of audio data, electronic device and storage medium
Song et al. An Efficient Method Using the Parameterized HRTFs for 3D Audio Real-Time Rendering on Mobile Devices
CN117351978A (en) Method for determining audio masking model and audio masking method
CN114783450A (en) Audio processing method, device, computing equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 100086 Beijing city Haidian District Shuangyushu Academy Road No. 44

Patentee after: China Film Science and Technology Research Institute (Film Technology Quality Inspection Institute of the Central Propaganda Department)

Address before: 100086 Beijing city Haidian District Shuangyushu Academy Road No. 44

Patentee before: CHINA FILM SCIENCE AND TECHNOLOGY INST.