CN112205006B - Adaptive remixing of audio content - Google Patents

Adaptive remixing of audio content Download PDF

Info

Publication number
CN112205006B
CN112205006B CN201980036214.5A CN201980036214A CN112205006B CN 112205006 B CN112205006 B CN 112205006B CN 201980036214 A CN201980036214 A CN 201980036214A CN 112205006 B CN112205006 B CN 112205006B
Authority
CN
China
Prior art keywords
separation
signal
evaluation
audio
adaptive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201980036214.5A
Other languages
Chinese (zh)
Other versions
CN112205006A (en
Inventor
斯特凡·乌利希
弗兰克·吉龙
迈克尔·埃嫩克尔
托马斯·肯普
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony Corp
Original Assignee
Sony Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony Corp filed Critical Sony Corp
Publication of CN112205006A publication Critical patent/CN112205006A/en
Application granted granted Critical
Publication of CN112205006B publication Critical patent/CN112205006B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field

Abstract

An electronic device, comprising: an audio source separation unit (201) configured to determine a separation (2) from an input signal (1) based on audio source separation; an evaluation unit (203) configured to determine an evaluation result (3) of audio source separation from the separation (2) and the input signal (1) based on machine learning; and an adaptive remix/upmix unit (202) configured to determine the output signal (4) based on the separation (2) and based on the evaluation result (3).

Description

Adaptive remixing of audio content
Technical Field
The present disclosure relates generally to the field of audio processing, and in particular to a method and apparatus for audio source separation and adaptive up-mixing/remixing.
Background
For example, there is a lot of audio content available in the form of Compact Discs (CDs), magnetic tapes, audio data files downloadable from the internet, but also in the form of sound tracks of videos, e.g. stored on digital video disks or the like. Typically, for example, for a mono or stereo setting, the audio content has been mixed from the original audio source signal without the original audio source signal having to be maintained from the original audio source used to produce the audio content. However, there are cases or applications where remixing or upmixing of audio content is envisaged. For example, where audio content is to be played on a device that has more available audio channels than the audio content provided, for example, mono audio content is to be played on a stereo device, stereo audio content is to be played on a surround sound device having six audio channels, and so on. In other cases, the spatial position of the perceived audio source will be modified or the loudness of the perceived audio source will be modified.
While techniques for remixing audio content are ubiquitous, it is generally desirable to improve methods and apparatus for remixing audio content.
Disclosure of Invention
According to a first aspect, the present disclosure provides an electronic device comprising: an audio source separation unit configured to determine a separation from the input signal based on audio source separation; an evaluation unit configured to determine an evaluation result of audio source separation from the separation and the input signal based on machine learning; and an adaptive remix/upmix unit configured to determine the output signal based on the separation and based on the evaluation result.
According to yet another aspect, the present disclosure provides a method comprising: an audio source separation process configured to determine a separation from the input signal based on the audio source separation; an evaluation process configured to determine an evaluation result of audio source separation from the separation and the input signal based on machine learning; and an adaptive remix/upmix process configured to determine the output signal based on the separation and based on the evaluation result.
According to another aspect, the present disclosure provides a computer program comprising instructions which, when executed on a processor, cause the processor to: determining a separation from the input signal based on the audio source separation; determining an evaluation result of the audio source separation from the separation and the input signal based on machine learning; and determining an output signal based on the separation by adaptive remixing/upmixing and based on the evaluation result.
Drawings
Embodiments are described by way of example with reference to the accompanying drawings, in which:
fig. 1 schematically illustrates a general approach to audio up-mixing/remixing by audio source separation (BSS);
fig. 2 schematically illustrates a method of adaptive remix/up-mix based on blind evaluation;
FIG. 3 visualizes the process of blind evaluation;
FIG. 4 schematically depicts a process of training a CNN to perform a blind evaluation of a source separation process;
fig. 5a, 5b visualize a first embodiment of adaptive signal remix/upmix;
fig. 6 shows a flow chart visualizing a method for adaptive signal remixing/upmixing according to a first embodiment; and
fig. 7a, 7b, 7c and 7d show a second embodiment of adaptive signal remixing/upmixing.
FIG. 8 provides a schematic diagram of a system that applies a digitized unipolar integration algorithm; and
fig. 9 schematically depicts an embodiment of an electronic system that may be used as an adaptive remixing/upmixing system.
Detailed Description
Before a detailed description of the embodiments is given with reference to fig. 1 to 6, some brief descriptions will be given.
An embodiment discloses an electronic device, including: an audio source separation unit configured to determine a separation from the input signal based on audio source separation; an evaluation unit configured to determine an evaluation result of blind source separation from the separation and the input signal based on machine learning; and an adaptive remix/upmix unit configured to determine the output signal based on the separation and based on the evaluation result.
In audio source separation, an input signal including a plurality of sources (e.g., musical instruments, voices, etc.) is decomposed into separate. Audio source separation may be unsupervised (referred to as "blind source separation", BSS) or partially supervised. "blind" means that the blind source separation does not necessarily have information about the original source. For example, it may not necessarily be known how many sources the original signal contains or which sound information of the input signal belongs to which original source. The purpose of blind source separation is to decompose the original signal separation without knowing the previous separation. The blind source separation unit may use any of the blind source separation techniques known to the skilled person. In (blind) source separation, the least correlated or the most independent source signals in a probabilistic sense or in an information-theoretic sense may be searched for, or the source signals may be found based on non-negative matrix factorization structural constraints on the audio source signals. Methods for performing (blind) source separation are known to those skilled in the art and are based on, for example, principal component analysis, singular value decomposition, non-independent (independent) component analysis techniques, non-negative matrix decomposition, artificial neural networks, and the like.
Although some embodiments use blind source separation for generating the separated audio source signals, the present disclosure is not limited to embodiments in which no further information is used for separating the audio source signals, but in some embodiments further information is used for generating the separated audio source signals. Such further information may be, for example, information about the mixing process, information about the type of audio sources comprised in the input audio content, information about the spatial positions of the audio sources comprised in the input audio content, etc.
The input signal may be any type of audio signal. It may be in the form of an analog signal, a digital signal, which may originate from a compact disk, digital video disk, etc., which may be a data file, such as a wave file, mp3 file, etc., and the present disclosure is not limited to the specific format of the input audio content. The input audio content may be, for example, a stereo audio signal having a first channel input audio signal and a second channel input audio signal, although the present disclosure is not limited to input audio content having two audio channels. In other embodiments, the input audio content may include any number of channels, such as a remix of 5.1 audio signals or the like.
The input signal may comprise one or more source signals. In particular, the input signal may comprise several audio sources. The audio source may be any entity that produces sound waves, such as musical instruments, speech, vocals (vocals), artificially generated sounds (e.g., from a synthesizer), and the like.
The input audio content may represent or comprise mixed audio sources, which means that the sound information is not separately available for all audio sources of the input audio content, but that for example the sound information for different audio sources at least partially overlaps or mixes.
The separation resulting from the input signal by blind source separation may, for example, include singing voice separation, bass separation, drum separation, and other separations. In singing voice separation, all sounds belonging to human speech may be included, in bass separation, all noise below a predefined threshold frequency may be included, in drum separation, all noise belonging to a drum in a song/music piece may be included, and in other separation, all remaining sounds may be included.
Still further, the separation may also include a residue.
When performing audio source separation (e.g., Blind Source Separation (BSS)) and subsequent remixing/upmixing, the evaluation unit evaluates how well the BSS works. In case of poor separation of blind source separation, the sounds originally belonging together (e.g. the voice of a singer) may become split, resulting in different syllables being played by different loudspeakers in the room, since the voices have been erroneously separated in different output channels. If this happens, and furthermore the loudspeakers in the room are in different positions (e.g. in a surround system), a user listening to the output sound may hear different syllables from the same speech from different directions. This can lead to the effect that the user thinks that the singer has moved or that the sound comes from an unexpected, odd direction.
In remix/upmix, the separation obtained from blind source separation is processed. The remix/upmix in the embodiment is adaptive in that it is influenced by the evaluation results provided by the blind evaluation. For example, in the case of an estimated separation that is considered to represent a "good" separation, remixing/upmixing may be more extensive than in the case of an estimated separation that is considered to represent a "poor" separation. The present disclosure is not limited to a specific number of audio channels and all kinds of remixing, upmixing and downmixing may be achieved.
The quality of the up-mix/remix may depend on the quality of the source separation. One common problem with the separation of sources into instruments (such as "bass", "drum", "other", and "singing") is that "other" and "singing" are not clearly separated. For example, portions of the flute or synthesizer signal may be erroneously separated into "singing". If the remixing/upmixing system does not know that the separation failed, the listener will perceive objectionable artifacts. For example, if "singing" is placed in front of the listener and "other" is placed behind the listener, the flute/synthesizer may be perceived as moving between the front and the back.
The evaluation unit may for example comprise an Artificial Neural Network (ANN). The evaluation unit may for example comprise an artificial neural network which can be realized by all construction methods known to the skilled person. The artificial neural network ANN may be, for example, a Convolutional Neural Network (CNN). Alternatively, the Artificial Neural Network (ANN) may be a recurrent neural network, or a fully connected neural network, or the like. In particular, the ANN may be implemented as one or more computing devices created in CMOS (complementary metal oxide semiconductor), nano-devices, GPU (graphics processing unit), transistors, etc.
The evaluation unit may have been trained for evaluating audio source separation. For example, the training of the evaluation unit may be performed by a machine learning process, e.g. according to any technique or method known to the person skilled in the art, in particular supervised learning, unsupervised learning (Hebbian learning), reinforcement learning, etc.
The evaluation unit may be configured to determine an estimated signal-to-distortion ratio (SDR), an estimated image-to-spatial distortion ratio (ISR), an estimated signal-to-interference ratio (SIR) and/or an estimated signal-to-artifact ratio (SAR) as an evaluation result. Alternatively, the evaluation unit may be configured to determine a subjective quality metric (e.g. a human opinion score) which is an estimate of the separation quality perceived by humans.
The adaptive remix/up-mix unit may be configured to determine a degree of remix/up-mix according to the evaluation result. For example, the embodiments described below allow the degree of separation to be estimated and the loudspeaker to be driven dynamically in different ways thereafter. In the case of poor separation, for example, since all sounds come from all directions, the listening system can reduce its surround effect by driving all speakers at the same volume, thereby suppressing the effect of listening to sounds from the wrong direction.
Remixing/upmixing performance can be improved if source separation is evaluated. If the source separation is good, remixing/upmixing may be more aggressive (i.e., the separation is further separated, which increases listener surround). Remixing/upmixing may be more conservative if the sources are poorly separated.
For example, the adaptive remixing/upmixing unit may, for example, be configured to determine the position of the virtual sound source based on the evaluation result. Remixing/upmixing may, for example, involve placing the instrument in a new location. For example, a stereo song may be divided into "bass", "drum", "other" and "singing" and up-mixed to a 5.1 system where the "other", e.g. including piano, guitar, synthesizer, etc., is now placed on the back of the listener. Thereby, the perceived sense of surround of the listener can be increased.
The adaptive remix/upmix unit may be configured to determine an amount to apply to the one or more separate audio effects based on the evaluation result.
The adaptive remix/upmix unit may be configured to determine a number of output channels for rendering the output signal based on the evaluation result.
The embodiment also discloses a method, comprising: an audio source separation process configured to determine a separation from the input signal based on audio source separation; an evaluation process configured to determine an evaluation result of audio source separation from the separation and the input signal based on machine learning; and an adaptive remix/upmix process configured to determine the output signal based on the separation and based on the evaluation result. These embodiments also include a method having all of the process aspects described above and in the figures described in more detail below.
According to another aspect, the present disclosure provides a computer program comprising instructions which, when executed on a processor, cause the processor to: determining a separation from the input signal based on the audio source separation; determining an evaluation result of the audio source separation from the separation and the input signal based on machine learning; and determining an output signal based on the separation and evaluation results by adaptive remixing/upmixing. These embodiments also include a computer program that implements all of the process aspects described in the figures above and in more detail below. Such a program may run on a computer, processor, tablet computer, smartphone, hi-fi unit, or any other device that the technician wants to select.
The term "signal" as used herein is not limited to any particular format, and may be an analog signal, a digital signal, or a signal stored in a data file, or any other format.
Embodiments will now be described with reference to the accompanying drawings.
Audio up-mix/remix by Blind Source Separation (BSS)
Fig. 1 schematically shows a general approach to audio up-mixing/remixing by Blind Source Separation (BSS).
First, a source separation (also called "down-mix") is performed, which decomposes the stereo source audio signal 1 comprising the two channels 1a, 1b and the audio from multiple audio sources (source 1, source 2, … … source K) (e.g. instruments, speech, etc.) into "separations", here into source estimates 2a-2d, where K is an integer and represents the number of audio sources. Since the separation of the audio source signals may be imperfect, for example due to a mixing of the audio sources, a residual signal 3(r (n)) is generated in addition to the separated audio source signals 2 a. The residual signal may for example represent the difference between the input audio content and the sum of all the separated audio source signals. The audio signal emitted by each audio source is represented in the input audio content 1 by its respective recorded sound wave. For input audio content having more than one audio channel, such as stereo input audio content or surround sound input audio content, the spatial information of the audio sources may also be included within the input audio content or represented by the input audio content, e.g. by the proportions of the audio source signals included in the different audio channels. The separation of the input audio content 1 into the separated audio source signals 2a-2d and the residual 3 is performed on the basis of blind source separation or other techniques capable of separating audio sources.
In a second step the separations 2a-d and possible residuals 3 are remixed and rendered into a new loudspeaker signal 4, here a signal comprising 5 channels 4 a. Generating output audio content by mixing the separated audio source signals and the residual signal based on the spatial information, based on the separated audio source signals and the residual signal. The output audio content is exemplarily shown in fig. 1 and is denoted by reference numeral 4.
Hereinafter, the number of audio channels of the input audio content is referred to as M in And the number of audio channels outputting audio content is referred to as M out . Since the input audio content 1 in the example of fig. 1 has two channels 1a and 1b and the output audio content 4 in the example of fig. 1 has five channels 4a, 4 in 2 and M out 5. The method in FIG. 1 is generally referred to as remixing, and in particular if M in <M out It is called up-mixing. In the example of fig. 1, the number M of audio channels of the input audio content I in 2 is less than the number M of audio channels outputting audio content 4 out This is therefore an upmix of the surround sound output audio content 4 from the stereo input audio content 1 to 5.0.
Adaptive remix/upmix based on blind evaluation
Fig. 2 schematically illustrates a method of adaptive remix/up-mix based on blind estimation. The method comprises a process of audio source separation 201, a process of blind assessment 203 and a process of adaptive remix/up-mix 202. As described above with respect to FIG. 1, having M in The input signal of the channel is input to the source separation 201 and decomposed into M sep And (4) separating. The separated signal 2 is sent to an adaptive remix/up-mix 202 and a blind evaluation 203.
The blind evaluation 203 is configured to receive as input an input signal 1 and a separation signal 2. By separating lettersNumber 2 is compared to the input signal 1 and blind evaluation 203 estimates the quality of the source separation process. The quality of the blind estimate 203 is represented by an estimate, here the estimated signal-to-distortion ratio SDR. Adaptive remix/upmix 202 remixes/upmixes the separated signals based on the estimated SDR to obtain a signal having M out The output signal 4 of the channel. That is, the remix/upmix 202 adapts the quality of the source separation 202 estimated by a blind evaluation 203. That is, the adaptive remix/up-mix 202 may decide the parameters of the remix/up-mix according to the estimated SDR. The process of fig. 2 thus provides an audio remixing/upmixing system, and the audio remixing/upmixing system is adaptive and uses a blind evaluator to determine its settings. For example, if the average SDR (average over all four instruments) is low, then the splits can be placed closer together. Furthermore, the perception of artifacts (e.g., musical noise) may be reduced by adding reverberation to the separation. As another example, a remix/upmix system may be provided that is capable of selecting the separation it uses from several source separation algorithms. In this scenario, several source separation algorithms may be run in parallel, and the best algorithm may be selected based on the results of the blind evaluation.
In the embodiment of fig. 2, the result of the blind evaluation 203 is the signal-to-distortion ratio SDR. Additionally or alternatively, the blind evaluation 203 process may determine an image-to-spatial distortion ratio (ISR), a signal-to-interference ratio (SIR), and/or a signal-to-artifact ratio (SAR). Furthermore, the mean square error in the time or frequency domain can be used as another objective quality metric. Also, the subjective score may be estimated by an evaluator. These mechanisms are known to those skilled in the art.
Blind evaluation using Artificial Neural Network (ANN)
Fig. 3 visualizes the process of blind evaluation. For blind evaluation, an Artificial Neural Network (ANN)203, herein, for example, a Convolutional Neural Network (CNN), is used because CNN has good pattern recognition and value estimation capabilities. CNN 203 has been trained to estimate signal-to-artifact ratio (SAR), signal-to-distortion ratio (SDR), image-to-spatial distortion ratio (ISR), and signal-to-interference ratio (SIR) as evaluation result 3. CNN 203 (from blind source separation 202 in fig. 2) receives as input signal 1 (mixed) and separation 2. The separation 2 may for example comprise four signals (singing voice signal, drum signal, bass signal and other signals including residuals) as instruments. As the evaluation result, the CNN 203 outputs at least one of the estimated signal artifact ratio SAR, the estimated signal distortion ratio SDR, the estimated image-to-spatial distortion ratio ISR, and the estimated signal-to-interference ratio SIR for each instrument. Using the output of the blind evaluator 203, the remix/upmix system may be adapted as described above in fig. 2.
Fig. 4 schematically depicts a process of training a CNN to perform a blind evaluation of the source separation process. CNN 203 is trained to estimate the signal-to-distortion ratio SDR3 for result 2 of blind separation process 201. In the training phase, the signal-to-distortion ratio SDR3 is used as an overall performance measure for blind source separation 201. During the training phase, CNN 203 is trained with a large number of input signals 1 (hybrid), the true sources s of which ij (t) is known. For example, by blending 204 a predetermined number of real sources s ij (t) (instruments) generate the input signal 1 (mix). Blind source estimation 201 is performed on the input signal 1 (hybrid) to obtain an estimated separation 2 (estimated source signal)
Figure GDA0002804404760000111
)。
Based on true sources s ij (t) (instruments) and estimated source signals
Figure GDA0002804404760000112
The quality of the blind source separation, here denoted as signal-to-distortion ratio SDR3, is determined in process 205. Assuming that i is a channel index and j is a musical instrument/source index, the signal-distortion ratio SDR3 is given by:
Figure GDA0002804404760000113
wherein s is ij (t) and
Figure GDA0002804404760000114
is the true and estimated source signal. M is a group of in Is the total number of channels. In general M in 2, i.e. the input mix for source separation is stereo. The calculated signal-to-distortion ratio SDR3 was fed to a blind method evaluation CNN 203 as training data. That is, during training, CNN 203 receives input signal 1 (mixing) and the estimated source signals obtained from blind source separation 201
Figure GDA0002804404760000115
As an input. As described in the embodiments of fig. 2 and 3 above, CNN can reliably estimate unknown separations (unknown) if enough training data is used
Figure GDA0002804404760000116
) The SDR value of (1). Thus, the CNN implementing the blind estimator is known from the mixed signal and the true separation during the training phase.
When blind evaluation is performed using trained CNN 203, the above formula is not used for blind evaluation because after training, the true source s ij Is unknown.
Applications of
Fig. 5a visualizes an embodiment of adaptive signal remixing/upmixing. A sound system with two virtual sound sources 51, 52 is provided, the virtual sound source 51 being located in front of the user 31 and the virtual sound source 52 being located behind the user 31. In this embodiment, the wo output channel (M) is defined out 2). The adaptive remix/up-mix process (202 in fig. 2) sends a "bass" channel, a "singing" channel, and a "drum" channel to the first sound source 51 in front of the user 31. In addition, the adaptive remix/upmix process sends "other" channels to the virtual sound source 52 behind the user 31. From the estimated SDR value provided by blind evaluation (203 in fig. 2), the virtual distance d between the virtual sound source 51 and the virtual sound source 52 is determined according to the function shown in fig. 5 b. The virtual distance d between the virtual sound source and the user 31 may be achieved by positioning the respective virtual sound source according to the distance d. The virtual sound source may be produced, for example, by a 3D audio rendering technique as described in more detail below with respect to fig. 8And (4) generating.
Fig. 5b visualizes the function for adaptive signal remix/upmix used by the embodiment of fig. 5 a. This function shows the distance d of the two virtual sound sources 51 and 52 of fig. 5a as a function of the estimated SDR. For high SDR values, the distance d is chosen to be greater than for the case of low SDR values.
Fig. 6 shows a flow chart visualizing a method for adaptive signal remixing/upmixing according to the first embodiment. At S601, an input signal is received and an estimated separation of instruments/sources obtained from blind source separation. At S602, a blind separation result is estimated by determining an estimated SDR based on the received input signal and the estimation separation. At S603, the location of the instrument/source is determined from the SDR. At S604, remix/upmix estimates the separation based on the calculated positions of the instruments/sources. At S605, the remixed/upmixed signal is rendered with the 3D sound system.
Fig. 7a, 7b, 7c and 7d show another embodiment of adaptive signal remixing/upmixing. In this embodiment of adaptive signal remix/up-mix, the adaptive remix/up-mix has more options to react to a certain SDR value obtained by blind evaluation.
Fig. 7a shows a sound mix of good (high) estimated SDR values. As shown in fig. 7a, the adaptive remix/up-mix (202 in fig. 2) computes an output signal that gives the impression that the sound comes from four different directions. In this case, the adaptive remix/upmix outputs singing voice using the front sound direction, bass and others using the back direction, and simultaneously outputs a drum using the lateral sound direction. In the case of fig. 7a, the estimated SDR value provided by the blind evaluation (203 in fig. 3) is high, so that essentially no erroneously attributed noise in each of the separate channels can be assumed. Then, the adaptive remix/upmix decisions are taken at a larger distance d from each other 1 、d 2 、d 3 And d 4 All four virtual sound sources are placed.
In the case of FIG. 7b, the estimated SDR value provided by the blind evaluation is low, so that the adaptive remix ≦ deviceThe upmix is determined to be at a small distance d from each other 1 、d 2 、d 3 And d 4 All four virtual sound sources are placed.
Fig. 7c shows an alternative possible reaction to a small estimated SDR. As described above, the adaptive remix/upmix produces an output signal that gives the impression that the sound comes from four different directions. If the estimated SDR is small, there is uncertainty if all lyrics sounds are actually separated into singing sounds. In the case of poor source separation, the other and singing channels may overlap, since both have similar frequencies. Thus, if the BSS is evaluated with a low estimation SDR, it may be suggested to output other channels and singing voice channels from the same direction/virtual sound source, since in this case the effect of switching or moving the sound direction may be avoided. As shown in fig. 7c, the adaptive remix/up-mix is decided based on the blind evaluation results to produce an output signal giving the impression that the sound comes from only two different directions, the drum, others and singing voices come from the front of the user and only bass comes from the back.
Fig. 7d shows different possible reactions to small SDR values. As described above, other channels and the singing voice channel may overlap, which may cause a phenomenon that a singer moves on a stage while singing, sometimes his voice comes from the front, sometimes from the back. This impression can be reduced by using reverberation or reverberation effects on the singing voice. Reverberation or reverberation adds space to the singing voice making them sound wider and thus making it more difficult for the user to determine the direction in which the singing voice comes (reverberation gives the impression that the sound comes from the direction of other rooms caused by reflection). This effect may mask special effects that appear separated from the source of the error. The adaptive remix/up-mix may thus adapt the amount of reverberation on singing voice based on the estimated SDR obtained in the blind estimation.
System for digital monopole synthesis
Fig. 8 provides an embodiment of a system that implements a method based on a digitized unipolar synthesis algorithm with integer delay.
The theoretical background of this system is described in more detail in patent application US 2016/0037282 a1, which is incorporated herein by reference.
The technique implemented in the embodiment of US 2016/0037282 a1 is conceptually similar to wave field synthesis, which uses a limited number of sound-proof enclosures to generate a defined sound field. However, the basic basis of the generation principle of the embodiments is specific, since the synthesis does not attempt to model the sound field accurately, but is based on the least squares method.
The target sound field is modeled as at least one target monopole placed at a defined target location. In one embodiment, the target sound field is modeled as one single target monopole. In other embodiments, the target sound field is modeled as a plurality of target monopoles placed at respective defined target locations. The position of the target monopole may be moved. For example, the target monopole may be adapted to the movement of the noise source to be attenuated. If multiple target monopoles are used to represent the target soundfield, the method of synthesizing sounds of the target monopoles based on a defined set of synthetic monopoles, as described below, may be applied independently to each target monopole, and the contributions of the synthetic monopoles obtained for each target monopole may be summed to reconstruct the target soundfield.
The source signal x (n) is fed to
Figure GDA0002804404760000151
Delay unit and amplifying unit a of a mark p Where p 1.. N is the index of the corresponding synthetic monopole used to synthesize the target monopole signal. The delay and amplification unit according to this embodiment may apply equation (117) of US 2016/0037282 a1 to calculate the resultant signal y for synthesizing the target unipolar signal p (n)=s p (n) of (a). The generated signal s p (n) is power amplified and fed to a loudspeaker S p
In the present embodiment, the synthesis is thus performed in the form of delayed and amplified components of the source signal x.
According to this embodiment, the delay n of the synthetic monopole with index p p Corresponding to a target monopole r o And generator r p R ═ R between them p0 =|r p -r o The propagation time of sound | (Euclidean distance).
Further, according to the present embodiment, the amplification factor
Figure GDA0002804404760000152
And a distance R ═ R p0 In inverse proportion.
In an alternative embodiment of the system, a modified amplification factor according to equation (118) of US 2016/0037282 a1 may be used.
In still an alternative embodiment of the system, the mapping factor described in relation to fig. 9 of US 2016/0037282 a1 may be used to modify the magnification.
Implementation mode
Fig. 9 schematically depicts an embodiment of an electronic system that may implement the process of adaptive remix/upmix based on blind evaluation as described above. The electronic system 900 includes a CPU 901 as a processor. The electronic system 900 also includes a microphone array 910, a speaker array 911, and a convolutional neural network unit 920 connected to the processor 901. The processor 901 may for example implement a blind source separation unit, an adaptive remix/upmix unit and/or a blind method evaluation unit implementing the procedure described in more detail with respect to fig. 2. The CNN unit may be, for example, an artificial neural network in hardware, e.g., a neural network on a GPU or any other hardware dedicated for the purpose of implementing an artificial neural network. As described in the above embodiments, the speaker array 911 is composed of one or more speakers distributed over a predetermined space and configured to render 3D audio. The electronic system 900 also includes a user interface 912 connected to the processor 901. The user interface 912 serves as a human-machine interface and enables a dialog between an administrator and an electronic system. For example, an administrator may use the user interface 912 to configure the system. Electronic system 900 also includes an ethernet interface 921, a bluetooth interface 904, and a WLAN interface 905. These units 904, 905 serve as I/O interfaces for data communication with external devices. For example, additional speakers, microphones, and cameras with ethernet, WLAN, or bluetooth connections may be coupled to the processor 901 via these interfaces 921, 904, and 905.
The electronic system 900 also includes data storage 902 and data storage 903 (here, RAM). The data storage 903 is arranged to temporarily store or buffer data or computer instructions for processing by the processor 901. The data storage 902 is provided as, for example, a long term memory for recording sensor data obtained from the microphone array 910 and provided to or retrieved from the CNN unit 920. The data store 902 may also store audio data representing audio messages that the public announcement system may deliver to people moving in a predefined space.
The process of blind evaluation using a convolutional neural network may be implemented by the neural network 920, or alternatively, the blind evaluation may be implemented on the processor 901 using a software implementation of a convolutional neural network. The artificial neural network may be implemented as a convolutional neural network (as described in the above embodiments), or by a neural network (such as a deep neural network, a cyclic neural network, etc.).
It should be noted that the above description is only an example configuration. Alternative configurations may be implemented with additional or other sensors, storage devices, interfaces, etc.
***
It should be recognized that the embodiments describe a method having an exemplary sequence of method steps. However, the particular sequence of method steps is for illustrative purposes only and should not be construed as having a binding force.
It should be noted that the division of the electronic system of fig. 9 into units is for illustrative purposes only and the present disclosure is not limited to any particular division of functions in particular units. For example, at least part of the circuitry may be implemented by a separately programmed processor, Field Programmable Gate Array (FPGA), dedicated circuitry, or the like.
All units and entities described in this description and claimed in the claims can be implemented as integrated circuit logic, for example on a chip, if not otherwise stated, and the functions provided by these units and entities can be implemented by software, if not otherwise stated.
Whilst the embodiments of the present disclosure described so far have been implemented at least in part using software-controlled data processing apparatus, it will be appreciated that the provision of such software-controlled computer programs and the transmission, storage or other media providing such computer programs are envisaged as aspects of the present disclosure.
Note that the present technology can also be configured as described below.
(1) An electronic device, comprising:
an audio source separation unit (201) configured to determine a separation (2) from an input signal (1) based on audio source separation;
an evaluation unit (203) configured to determine an evaluation result (3) of audio source separation from the separation (2) and the input signal (1) based on machine learning; and
an adaptive remix/upmix unit (202) configured to determine an output signal (4) based on the separation (2) and based on the evaluation result (3).
(2) The electronic device according to (1), wherein the evaluation unit (203) comprises an artificial neural network.
(3) The electronic device of (1) or (2), wherein the evaluation unit (203) has been trained for evaluating blind source separation.
(4) The electronic apparatus according to any of (1) to (3), wherein the evaluation unit is configured to determine an estimated signal-to-distortion ratio (SDR), an estimated image-to-spatial distortion ratio (ISR), an estimated signal-to-interference ratio (SIR) and/or an estimated signal-to-artifact ratio (SAR) as the evaluation result (3).
(5) The electronic apparatus according to any one of (1) to (4), wherein the evaluation unit is configured to estimate a human opinion score as an evaluation result (3).
(6) The electronic device according to any one of (1) to (5), wherein the evaluation result is used to select a specific source separation algorithm from among several source separation algorithms.
(7) The electronic device of any of (1) through (6), wherein the input signal (1) comprises one or more source signals(s) ij (t)))。
(8) The composition according to any one of (1) to (7)The electronic device of, wherein the source signal(s) ij (t)) includes at least one of a singing voice signal, a bass signal, and a drum signal.
(9) The electronic apparatus according to any of (1) to (8), wherein the evaluation unit (203) is configured to determine the estimated signal-to-distortion ratio (SDR) based on the following equation
Figure GDA0002804404760000191
Where i is the channel index, j is the source index, and s ij (t) and
Figure GDA0002804404760000192
estimate the source signal for the truth, and M in Is the total number of channels.
(10) The electronic device of any of (1) through (9), wherein separating (2) comprises leaving.
(11) The electronic apparatus according to any of (1) to (10), wherein the adaptive remix/up-mix unit (202) is configured to determine the degree of remix/up-mix according to the evaluation result (3).
(12) The electronic apparatus according to any one of (1) to (11), wherein the adaptive remixing/upmixing unit (202) is configured to determine the position of the virtual sound source (51, 52) based on the evaluation result (3).
(13) The electronic apparatus according to any of (1) to (12), wherein the adaptive remix/upmix unit (202) is configured to determine an amount of one or more separate audio effects to apply in the separating (2) based on the evaluation result (3).
(14) The electronic apparatus according to any of (1) to (13), wherein the adaptive remix/up-mix unit (202) is configured to determine a number of output channels for rendering the output signal (4) based on the evaluation result (3).
(15) The electronic device of any of (1) through (14), wherein audio source separation is based on blind source separation.
(16) A method, comprising:
an audio source separation process (201) configured to determine a separation (2) from an input signal (1) based on audio source separation;
an evaluation process (203) configured to determine an evaluation result (3) of audio source separation from the separation (2) and the input signal (1) based on machine learning; and
an adaptive remix/upmix process (202) configured to determine an output signal (4) based on the separation (2) and based on the evaluation result (3).
(17) A computer program comprising instructions that, when executed on a processor, cause the processor to:
determining a separation (2) from the input signal (1) based on the audio source separation;
determining an evaluation result (3) of the audio source separation from the separation (2) and the input signal (1) based on machine learning; and
an output signal (4) is determined by adaptive remixing/upmixing based on the separating (2) and based on the evaluation result (3).

Claims (16)

1. An electronic device, comprising:
an audio source separation unit configured to determine a separation from the input signal based on audio source separation;
an evaluation unit configured to determine an evaluation result of the audio source separation from the separation and the input signal based on machine learning; and
an adaptive remix/upmix unit configured to determine an output signal based on the separation and based on the evaluation result;
wherein the evaluation unit is configured to determine as an evaluation result an estimated signal-to-distortion ratio SDR, an estimated image-to-spatial distortion ratio ISR, an estimated signal-to-interference ratio SIR and/or an estimated signal-to-artifact ratio SAR.
2. The electronic device of claim 1, wherein the evaluation unit comprises an artificial neural network.
3. The electronic device according to claim 1, wherein the evaluation unit has been trained for evaluating blind source separation.
4. The electronic device according to claim 1, wherein the evaluation unit is further configured to estimate a human opinion score as an evaluation result.
5. The electronic device of claim 1, wherein the evaluation result is used to select a particular source separation algorithm from a number of source separation algorithms.
6. The electronic device of claim 1, wherein the input signal comprises one or more source signals.
7. The electronic device of claim 6, wherein the source signal comprises at least one of a singing voice signal, a bass signal, and a drum signal.
8. The electronic device according to claim 1, wherein the evaluation unit is configured to determine the estimated signal-to-distortion ratio SDR based on the following equation
Figure FDA0003635757470000021
Where i is the channel index, j is the source index, and s ij (t) and
Figure FDA0003635757470000022
estimate the source signal for trueness, and M in Is the total number of channels.
9. The electronic device of claim 1, wherein the separation comprises a residue.
10. The electronic apparatus according to claim 1, wherein the adaptive remix/upmix unit is configured to determine the degree of remix/upmix depending on the evaluation result.
11. The electronic apparatus according to claim 1, wherein the adaptive remix/up-mix unit is configured to determine a position of a virtual sound source based on the evaluation result.
12. The electronic apparatus according to claim 1, wherein the adaptive remixing/upmixing unit is configured to determine an amount of one or more separate audio effects to apply in separating based on the evaluation result.
13. The electronic apparatus according to claim 1, wherein the adaptive remix/up-mix unit is configured to determine a number of output channels for rendering the output signal based on the evaluation result.
14. The electronic device of claim 1, wherein the audio source separation is based on blind source separation.
15. A method for adaptive remixing of audio content, comprising:
an audio source separation process configured to determine a separation from the input signal based on audio source separation;
an evaluation process configured to determine an evaluation result of the audio source separation from the separation and the input signal based on machine learning; and
an adaptive remix/upmix process (202) configured to determine an output signal based on the separation and based on the evaluation result;
wherein the evaluation process is configured to determine as evaluation result an estimated signal distortion ratio SDR, an estimated image-to-spatial distortion ratio ISR, an estimated signal-to-interference ratio SIR and/or an estimated signal-to-artifact ratio SAR.
16. A storage medium having a program stored thereon, which when executed on a processor causes the processor to:
determining a separation from the input signal based on the audio source separation;
determining an assessment of the audio source separation from the separation and the input signal based on machine learning; and
determining an output signal based on the separating and based on the evaluation result by adaptive remixing/upmixing;
wherein an estimated signal-to-distortion ratio SDR, an estimated image-to-spatial distortion ratio ISR, an estimated signal-to-interference ratio SIR and/or an estimated signal-to-artifact ratio SAR are determined as evaluation results.
CN201980036214.5A 2018-06-01 2019-05-29 Adaptive remixing of audio content Active CN112205006B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
EP18175645.3 2018-06-01
EP18175645 2018-06-01
PCT/EP2019/064117 WO2019229199A1 (en) 2018-06-01 2019-05-29 Adaptive remixing of audio content

Publications (2)

Publication Number Publication Date
CN112205006A CN112205006A (en) 2021-01-08
CN112205006B true CN112205006B (en) 2022-08-26

Family

ID=62528284

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980036214.5A Active CN112205006B (en) 2018-06-01 2019-05-29 Adaptive remixing of audio content

Country Status (3)

Country Link
JP (1) JP7036234B2 (en)
CN (1) CN112205006B (en)
WO (1) WO2019229199A1 (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021175460A1 (en) * 2020-03-06 2021-09-10 Algoriddim Gmbh Method, device and software for applying an audio effect, in particular pitch shifting
EP4005243B1 (en) * 2020-03-06 2023-08-23 algoriddim GmbH Method and device for decomposing and recombining of audio data and/or visualizing audio data
EP4115630A1 (en) 2020-03-06 2023-01-11 algoriddim GmbH Method, device and software for controlling timing of audio data
EP4115629A1 (en) 2020-03-06 2023-01-11 algoriddim GmbH Method, device and software for applying an audio effect to an audio signal separated from a mixed audio signal
KR20230017287A (en) * 2020-08-26 2023-02-03 구글 엘엘씨 Systems and methods for upmixing audiovisual data
JP7136979B2 (en) * 2020-08-27 2022-09-13 アルゴリディム ゲー・エム・ベー・ハー Methods, apparatus and software for applying audio effects
DE102021201668A1 (en) * 2021-02-22 2022-08-25 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung eingetragener Verein Signal-adaptive remixing of separate audio sources
WO2023162508A1 (en) * 2022-02-25 2023-08-31 ソニーグループ株式会社 Signal processing device, and signal processing method
WO2023202551A1 (en) * 2022-04-19 2023-10-26 北京字跳网络技术有限公司 Acoustic transmission method and device, and nonvolatile computer readable storage medium
WO2024044502A1 (en) * 2022-08-24 2024-02-29 Dolby Laboratories Licensing Corporation Audio object separation and processing audio
CN117253472B (en) * 2023-11-16 2024-01-26 上海交通大学宁波人工智能研究院 Multi-region sound field reconstruction control method based on generation type deep neural network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103563402A (en) * 2011-05-16 2014-02-05 高通股份有限公司 Blind source separation based spatial filtering
CN104616663A (en) * 2014-11-25 2015-05-13 重庆邮电大学 Music separation method of MFCC (Mel Frequency Cepstrum Coefficient)-multi-repetition model in combination with HPSS (Harmonic/Percussive Sound Separation)

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3931237B2 (en) * 2003-09-08 2007-06-13 独立行政法人情報通信研究機構 Blind signal separation system, blind signal separation method, blind signal separation program and recording medium thereof
CN102833665B (en) * 2004-10-28 2015-03-04 Dts(英属维尔京群岛)有限公司 Audio spatial environment engine
JP4952698B2 (en) * 2008-11-04 2012-06-13 ソニー株式会社 Audio processing apparatus, audio processing method and program
WO2014147442A1 (en) * 2013-03-20 2014-09-25 Nokia Corporation Spatial audio apparatus
US9721202B2 (en) * 2014-02-21 2017-08-01 Adobe Systems Incorporated Non-negative matrix factorization regularized by recurrent neural networks for audio processing
US9749769B2 (en) 2014-07-30 2017-08-29 Sony Corporation Method, device and system
JP6981417B2 (en) * 2016-09-09 2021-12-15 ソニーグループ株式会社 Sound source separators and methods, as well as programs

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103563402A (en) * 2011-05-16 2014-02-05 高通股份有限公司 Blind source separation based spatial filtering
CN104616663A (en) * 2014-11-25 2015-05-13 重庆邮电大学 Music separation method of MFCC (Mel Frequency Cepstrum Coefficient)-multi-repetition model in combination with HPSS (Harmonic/Percussive Sound Separation)

Also Published As

Publication number Publication date
WO2019229199A1 (en) 2019-12-05
CN112205006A (en) 2021-01-08
JP2021526334A (en) 2021-09-30
JP7036234B2 (en) 2022-03-15

Similar Documents

Publication Publication Date Title
CN112205006B (en) Adaptive remixing of audio content
JP7183467B2 (en) Generating binaural audio in response to multichannel audio using at least one feedback delay network
JP7139409B2 (en) Generating binaural audio in response to multichannel audio using at least one feedback delay network
US20230254657A1 (en) Audio processing device and method therefor
JP5149968B2 (en) Apparatus and method for generating a multi-channel signal including speech signal processing
KR101341523B1 (en) Method to generate multi-channel audio signals from stereo signals
JP6198800B2 (en) Apparatus and method for generating an output signal having at least two output channels
AU2015295518B2 (en) Apparatus and method for enhancing an audio signal, sound enhancing system
JP2009522895A (en) Decoding binaural audio signals
Farina et al. Ambiophonic principles for the recording and reproduction of surround sound for music
JP2023517720A (en) Reverb rendering
US20090052681A1 (en) System and a method of processing audio data, a program element, and a computer-readable medium
CN103650538A (en) Method and apparatus for decomposing a stereo recording using frequency-domain processing employing a spectral weights generator
US20220337952A1 (en) Content based spatial remixing
WO2014203496A1 (en) Audio signal processing apparatus and audio signal processing method
CN109036456B (en) Method for extracting source component environment component for stereo
JP2023500265A (en) Electronic device, method and computer program
CN113348508A (en) Electronic device, method, and computer program
WO2018193160A1 (en) Ambience generation for spatial audio mixing featuring use of original and extended signal
Kasak et al. Hybrid binaural singing voice separation
KR20190060464A (en) Audio signal processing method and apparatus
CN116643712A (en) Electronic device, system and method for audio processing, and computer-readable storage medium
Tom Automatic mixing systems for multitrack spatialization based on unmasking properties and directivity patterns
JP2017163458A (en) Up-mix device and program
Kan et al. Psychoacoustic evaluation of different methods for creating individualized, headphone-presented virtual auditory space from B-format room impulse responses

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant