CN117043851A

CN117043851A - Electronic device, method and computer program

Info

Publication number: CN117043851A
Application number: CN202280022435.9A
Authority: CN
Inventors: 斯特凡·乌利希; 乔治·法布罗; 迈克尔·埃嫩克尔; 光藤祐基
Original assignee: Sony Group Corp
Current assignee: Sony Group Corp
Priority date: 2021-03-26
Filing date: 2022-03-15
Publication date: 2023-11-10
Also published as: WO2022200136A1; JP2024512493A

Abstract

An electronic device comprising circuitry configured to process an accompaniment signal(s) according to a field mode process (17) _acc (n)) to obtain an enhanced accompaniment signal(s) _acc* (n))。

Description

Electronic device, method and computer program

Technical Field

The present disclosure relates generally to the field of audio processing, and in particular, to an apparatus, method and computer program for Karaoke (Karaoke) enabling a user to sing with a song.

Background

In the karaoke apparatus, accompaniment music other than a singing part of a song is reproduced, and a singer sings the song together with the reproduced accompaniment music. To inform the singer of the lyrics, the lyrics are displayed on a display device such as a monitor. Karaoke devices typically include a music player for playing back accompaniment music, one or more microphone inputs for connecting microphones that capture the sounds of the singer, means for changing the pitch of the played music to adapt the pitch range of the accompaniment music to the singer's gamut, and an audio output for outputting the accompaniment music and the captured sounds.

Although techniques for karaoke devices generally exist, it is desirable to improve the user experience in karaoke settings.

Disclosure of Invention

According to a first aspect, the present disclosure provides an electronic device comprising circuitry configured to process an accompaniment signal according to a field mode process to obtain an enhanced accompaniment signal.

According to a second aspect, the present disclosure provides a method of processing an accompaniment signal according to a live mode process to obtain an enhanced accompaniment signal.

Further aspects are set out in the dependent claims, the following description and the accompanying drawings.

Drawings

Embodiments are described by way of example with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates the goals of a "field mode" process;

FIG. 2 schematically illustrates an example of a karaoke system with in-field mode processing;

fig. 3 schematically illustrates a general method of audio upmixing/remixing by Blind Source Separation (BSS);

fig. 4 schematically shows an embodiment of the field mode process (17 in fig. 2);

FIG. 5 schematically illustrates an alternative embodiment of the field mode process (17 in FIG. 2);

fig. 6 schematically shows an embodiment of a process of crowd singing simulation (41 in fig. 4 and 5);

fig. 7 schematically shows a second embodiment (41 in fig. 4 and 5) of the process of crowd singing simulation;

FIG. 8 schematically illustrates an embodiment of in-situ sound processing (42 in FIGS. 4 and 5);

fig. 9 schematically illustrates an embodiment of a microphone crosstalk simulation 82;

fig. 10 schematically shows an embodiment of a jitter simulation 83;

fig. 11a schematically shows a first embodiment of an equalizer 83;

fig. 11b schematically shows a second embodiment of the equalizer 83;

FIG. 12a shows a first embodiment of a sample database 46;

FIG. 12b shows a second embodiment of a sample database 46;

fig. 13 schematically illustrates an embodiment of primary environment extraction (PAE) (43 in fig. 4);

FIG. 14 schematically illustrates an embodiment of harmonic striking source separation (HPSS) (48 in FIG. 4);

fig. 15 schematically shows an embodiment in which the room simulator 44 is implemented by surround reverberation;

fig. 16 shows an embodiment of a renderer 45 using binaural rendering technique;

fig. 17 shows an embodiment of a renderer based on 2 to 5 channel upmixing;

FIG. 18 schematically illustrates an embodiment of the extended in-situ sound processing (42 in FIGS. 4 and 5); and

fig. 19 schematically shows an example of processing performed by the 3D audio renderer 89 in fig. 18;

FIG. 20 provides an embodiment of a 3D audio rendering technique based on a digitized monopole synthesis algorithm;

fig. 21 schematically depicts an embodiment of an electronic device capable of implementing a karaoke system with in-field mode processing.

Detailed Description

Before a detailed description is given of the embodiment with reference to fig. 1, some general description is made.

An embodiment discloses an electronic device including circuitry configured to process an accompaniment signal according to a field mode process to obtain an enhanced accompaniment signal.

The live mode processing may be configured to give the listener of the enhanced accompaniment signal the sensation as if he were part of a concert.

The electronic device may be, for example, any music or movie reproduction device such as a karaoke box, a smart phone, a PC, a TV, a synthesizer, a mixing console, etc.

The circuitry of the electronic device may include a processor, which may be, for example, a CPU, memory (RAM, ROM, etc.), memory and/or storage, an interface, etc. The circuitry may include or be connectable to input means (mouse, keyboard, camera, etc.), output means (display (e.g., liquid crystal, (organic) light emitting diode, etc.), speaker, etc., (wireless) interface, etc.) well known for electronic devices (computers, smartphones, etc.,) furthermore, the circuitry may include or be connectable to sensors (image sensors, camera sensors, video sensors, etc.) for sensing still images or video image data.

The accompaniment may be a residual signal (residual signal) obtained by separating the human voice signal from the audio input signal. For example, the audio input signal may be a musical piece including a human voice, a guitar, a keyboard, and a drum, and the accompaniment signal may be a signal including a guitar, a keyboard, and a drum as a residue of separating the human voice from the audio input signal.

The field mode processing may be configured to process the accompaniment signal through the room simulator to obtain the reverberation signal. Using the room simulator, a real reverberation signal can be created that is also added to the karaoke output.

The in-situ mode processing may be configured to process the reverberations signal by a renderer (45) to obtain a rendered reverberations signal. The renderer may be a 3D audio renderer, a binaural renderer or an upmixer. Using a suitable renderer, a real reverberation signal can be created which is also added to the karaoke output.

The field mode processing may be configured to process the accompaniment signals through main environment extraction or through harmonic striking source separation to obtain accompaniment signals (s _acc (n)) or harmonic parts.

The field mode processing may be configured to process the ambient or harmonic portions by the room simulator to obtain ambient or harmonic reverberation, respectively.

The live mode process may be controlled by live mode parameters describing the location of the singer and/or live mode parameters describing the stage.

The in-situ mode processing may be configured to process the vocal signal through crowd singing simulation to obtain a crowd vocal signal. Crowd singing simulation may create a signal that sounds like a large crowd singing with a singer. For example, crowd singing simulation may include multiple pitch and/or formant shift branches.

The live mode processing may be configured to process the accompaniment signal based on the live sound effect to obtain a live accompaniment signal.

In-situ sound effect processing may include source separation.

Any source separation technique may be applied. For example, blind source separation (blind source separation, BSS) (also known as blind signal separation) may be used for source separation. Blind Source Separation (BSS) may include separating a set of source signals from a set of mix signals. One application of Blind Source Separation (BSS) is to separate music into individual instrument tracks, making upmix (upmix) or remix (remix) of the original content possible.

Instead of Blind Source Separation (BSS), other source separation techniques may be used, such as out-of-phase stereo (Out Of Phase Stereo, OOPS) techniques, etc.

Instead of using source separation techniques on a fully mixed recording, embodiments may also use material that appears in separate form, e.g., as "human voice/accompaniment" or just as "accompaniment" (e.g., because they are special karaoke productions).

In-situ sound effect processing may also include microphone crosstalk simulation. Microphone crosstalk simulation may be applied to a single instrument track to simulate the microphone "crosstalk" effects that occur during a live performance due to the microphones also capturing signals from other instruments.

In-situ sound effect processing may also include jitter simulation. Jitter simulation can simulate the fact that live performances are not usually perfectly time aligned with the instrument.

In-situ sound effects processing may also include audio equalization. Equalization may be modified by using a "main EQ" to "field EQ" process.

The field pattern processing may include obtaining a sample from a sample database. The sample inserter may obtain samples of cheering, applause, and crowd noise from a pre-recorded sample database and randomly insert the samples into the sample audio stream.

The renderer may use information about the current position of the user in the room and/or information about the direction in which he is gazing or tilting.

The electronic device may further include a mixer configured to mix the enhanced accompaniment signal with the user's vocal signal.

Embodiments also relate to a method of processing an accompaniment signal according to a live mode process to obtain an enhanced accompaniment signal as described above.

Embodiments also relate to a computer program comprising instructions which, when executed by a processor, instruct the processor to perform the method described in the embodiments.

In audio source separation, an input signal including a plurality of sources (e.g., musical instruments, sounds, etc.) is decomposed into separate portions. Audio source separation may be unsupervised (referred to as "blind source separation", BSS) or partially supervised. By "blind" it is meant that blind source separation does not necessarily have information about the original source. For example, it may not necessarily be known how many sources the original signal contains, or which sound information of the input signal belongs to which original source. The purpose of blind source separation is to decompose the original signal separation section without prior knowledge of the separation section. The blind source separation unit may use any blind source separation technique known to the skilled person. In (blind) source separation, the least relevant or the most independent source signal in the sense of probability theory or information theory can be searched or structural constraints on the audio source signal can be found based on non-negative matrix factorization. Methods of performing (blind) source separation are known to the skilled person and are based on, for example, principal component analysis, singular value decomposition, (non-) dependent component analysis, non-negative matrix factorization, artificial neural networks, etc.

Although some embodiments use blind source separation to generate a separated audio source signal, the present disclosure is not limited to embodiments that do not use more information to separate audio source signals, but in some embodiments use more information to generate a separated audio source signal. These further information may be, for example, information about a mixing process, information about the type of audio source included in the input audio content, information about the spatial position of the audio source included in the input audio content, etc.

According to some implementations, the circuit may be further configured to transpose the audio output signal based on the pitch ratio such that the transposition value corresponds to an integer multiple of the semitone.

Embodiments will now be described by referring to the drawings.

The goal of "live mode" is shown in fig. 1. The left side of fig. 1 shows a user of a karaoke apparatus singing with accompanying music. In the example shown here, the singer uses the device alone at home. The fact that no one has shared the karaoke experience with him detracts from the user's experience. The right side of fig. 1 schematically demonstrates the effect that the improved karaoke apparatus according to the embodiment can produce to the user. In the right hand scenario of fig. 1, the user is given the sensation that he is part of a concert, many others sharing the experience with him.

Karaoke system with in-situ mode processing

Fig. 2 schematically shows an example of a karaoke system with field mode processing. The audio input signal x (n) received from the mono or stereo audio input 13 comprises a mix of sources (see 1,2, …,k) A. The invention relates to a method for producing a fibre-reinforced plastic composite The audio input signal x (n) is, for example, a song on which a karaoke singing should be performed, and includes an original human voice and accompaniment including a plurality of musical instruments. The audio input signal x (n) is input to the processing of the source separation 14 and is decomposed into separate parts (see separated source 2 and residual signal 3 in fig. 3), here into the original human voice s _vocals (n) and residual signal 3 (accompaniment s _acc (n)). An exemplary embodiment of the process of source separation 14 is described in fig. 3 below.

The microphone 11 of the user acquires the audio input signal y (n). The audio input signal y (n) is, for example, a karaoke signal, and includes a user's voice and background sound. The background sound may be any noise that has been captured by the microphone of the karaoke singer, such as street noise, crowd noise, echoes (feedback) caused by the playback of the karaoke system if the user is not wearing headphones but using a speaker, etc. The audio input signal y (n) is input to a source separation 12 process and is separated into separate parts (see separated source 2 and residual signal 3 in fig. 2), where it is separated into separate sources 2 (i.e. user human voices s) _user (n)) and unwanted residual signals (not shown in fig. 2) afterwards. An exemplary embodiment of the source separation 12 process is described in fig. 3 below.

Accompaniment s _acc (n) is provided to the field mode process 17 (described in more detail in fig. 4 below). The in-situ pattern process 17 receives the original human voice s _vocals (n) and accompaniment s _acc (n) as input. The in-situ pattern processing 17 processes the original human voice s _vocals (n) and accompaniment s _acc (n) and outputting a karaoke signal s _acc* (n) to the signal adder 18. The signal adder 18 receives the karaoke output signal s _acc* (n) and user voices s _user (n) and add them, and output the added signal to the speaker system 19. The field mode process further outputs the field mode parameters to a display unit 20 where the field mode parameters are presented to the user. The display unit 20 further receives lyrics 21 and presents them to the user.

The user may be handled by, for example, sound effects (not shown in fig. 2)Human voice s _user (n). For example, reverberation can be added to human voices, making them sound more "wet" and thus sound more suitable for accompaniment.

In the system of fig. 2, source separation is performed on the audio input signal y (n) in real time. Alternatively, the audio input signal x (n) may be processed in advance, for example, when the audio input signal x (n) is stored in a music library.

In the system of fig. 2, the audio input signal x (n) may be processed, for example, by a Blind Source Separation (BSS) process, as described in more detail in fig. 3 below. In alternative embodiments, other voice separation algorithms, such as out-of-phase stereo (OOPS) techniques, may be used to separate the voice from the accompaniment.

The audio input x (n) may be an audio recording, such as a WAV file, MP3 file, AAC file, WMA file, AIFF file, etc. This means that the audio input x (n) is the actual audio, meaning the raw audio from a commercial performance such as a song that is not ready. The karaoke material does not require any manual preparation and can be processed on-line completely automatically and with good quality and high realism, so that audio material prepared in advance is not required in this embodiment.

In other implementations, the audio input x (n) is a MIDI file. In this case, the karaoke system may, for example, be transposed in the MIDI domain and the accompaniment s rendered with the MIDI synthesizer _acc (n)。

The input signal may be any type of audio signal. May be in the form of analog signals, digital signals, may originate from optical discs, digital video discs, etc., may be data files (such as waveform files, mp3 files, etc.), and the present disclosure is not limited to input audio content in a particular format. The input audio content may be, for example, a stereo audio signal having a first channel input audio signal and a second channel input audio signal, although the present disclosure is not limited to input audio content having two audio channels. In other embodiments, the input audio content may include any number of channels, such as remixing of 5.1 audio signals, and the like.

The input signal may include one or more source signals. In particular, the input signal may comprise several audio sources. The audio source may be any entity that produces sound waves, such as musical instruments, sounds, vocal sounds, artificially generated sounds (e.g., the original form of a synthesizer), and the like.

Blind source separation

Fig. 3 schematically illustrates a general method of audio upmixing/remixing by Blind Source Separation (BSS). First, an audio source separation (also referred to as "unmixed") is performed, which decomposes a source audio signal 1 (here an audio input signal x (n)) comprising a plurality of channels I and audio from a plurality of audio sources 1, sources 2, …, sources K (e.g. instruments, sounds, etc.) into "separated portions", here a separated source 2 (e.g. human voice s) for each channel I _vocals (n)) and a residual signal 3 (e.g., accompaniment s) _acc (n)), where K is an integer representing the number of audio sources. The residual signal here is a signal obtained by separating human voice from an audio input signal. That is, the residual signal is the "residual" audio signal after removal of the human voice of the input audio signal. However, embodiments are not limited to this scenario. For example, in general, two DNNs may also be used with two separate parts ("vocal", "accompaniment") and other residuals (=errors caused by DNNs).

In the embodiment herein, the source audio signal 1 is a stereo signal having two channels i=1 and i=2. The separated source 2 and residual signal 3 are then remixed and rendered into a new loudspeaker signal 4, here a signal comprising five channels 4a to 4e, i.e. a 5.0 channel system. For example, the audio source separation process may be implemented as described in more detail in the published paper: uhlich, stefan et al, "advanced neural network based music source separation by data enhancement and network mixing (Improving music source separation based on deep neural networks through data augmentation and network blending)", 2017IEEE acoustic, speech and signal processing International Conference (ICASSP), IEEE,2017.

Since the separation of the audio source signals may be imperfect, for example, due to mixing of the audio sources, a residual signal 3 (r (n)) is generated in addition to the separated audio source signals 2a to 2 d. The residual signal may for example represent the difference between the input audio content and the sum of all the separate audio source signals. The audio signal emitted by each audio source is represented in the input audio content 1 by its corresponding recorded sound waves. For input audio content having more than one audio channel, such as stereo or surround sound input audio content, the spatial information of the audio source is typically also comprised by or represented by the input audio content, e.g. by the proportions of the audio source signals comprised in the different audio channels. The separation of the input audio content 1 into the separated audio source signals 2a to 2d and the residual 3 is performed based on blind source separation or other techniques capable of separating audio sources.

In a second step the separation sections 2a to 2d and possibly the residuals 3 are remixed and rendered into a new loudspeaker signal 4, here a signal comprising five channels 4a to 4e, i.e. a 5.0 channel system. Based on the separated audio source signal and the residual signal, output audio content is generated by mixing the separated audio source signal and the residual signal based on the spatial information. The output audio content is shown schematically in fig. 3 and denoted by reference numeral 4.

The audio input x (n) and the audio input y (n) may be separated by the method described in fig. 3, wherein the audio input y (n) is separated into user human voices s _user (n) and unused background sounds, and the audio input x (n) is separated into the original human voice s _user (n) and accompaniment s _acc (n). Accompaniment s _acc (n) may be further separated into corresponding tracks, such as drums, pianos, string, etc. (see 81 in fig. 8 and 18). The separation of the human voice greatly improves the accompaniment and human voice processing modes.

Another method of removing accompaniment from the audio input y (n) is for example a crosstalk cancellation (crosstalk cancellation) method, wherein the reference of the accompaniment is subtracted in phase from the microphone signal, for example by using adaptive filtering.

In-situ mode processing

Fig. 4 schematically shows an embodiment of the field mode processing (17 in fig. 2). In-situ mode processing receives information via source separation (in fig. 214 Obtained original human voice s _vocals (n) and accompaniment s _acc (n) as input. In-situ mode processing of original human voice s _vocals (n) and accompaniment s _acc (n), and outputs a karaoke output signal s which can be outputted through a speaker system (19 in fig. 2) _acc* (n)。

Processing the original human voice s obtained by source separation (14 in FIG. 2) by crowd singing simulation 41 _vocals (n) to obtain the crowd's voice s _crowd (n). Crowd singing simulation 41 creates a signal that sounds like a (large) crowd following singing (see fig. 6 and corresponding description). To create an enhanced accompaniment signal, the accompaniment s obtained by source separation (14 in FIG. 2) is processed through the in-situ sound effect 42 _acc (n) to obtain a live accompaniment s _{acc_live} (n). Further processing of accompaniment s by primary environment extraction (Primary Ambient Extraction, PAE) 43 _acc (n) to obtain accompaniment s _acc The environmental part s of (n) _amb (n). The ambient section s is further processed by a room simulator 44 _amb (n) to obtain ambient reverberation s _{amb_rev} (n). The ambient reverberation s is further processed by a renderer 45 (e.g. a binaural renderer as described in fig. 16, or an upmixer as described in fig. 17) _{amb_rev} (n) to obtain rendered ambient reverberation s _{amb_rev,3D} (n). Using the room simulator 44 and the appropriate renderer 45, a real reverberation signal is created which is also added to the karaoke output.

Crowd voice s obtained through crowd singing simulation 41 _crowd (n) GAIN and DELAY adjustments are made at GAIN/DELAY1 (GAIN/DELAY 1). Original human voice s _vocals (n) gain and delay adjustments are made at gain/delay 2. By applying the live sound effect 42 to the accompaniment s _acc (n) obtained site accompaniment s _{acc_live} (n) GAIN and DELAY adjustments are made at GAIN/DELAY3 (GAIN/DELAY 3). Ambient reverberation s _{amb_rev} (n) GAIN and DELAY adjustments are made at GAIN/DELAY4 (GAIN/DELAY 4). Samples s obtained from sample database 46 _samples (n) GAIN and DELAY adjustments are made at GAIN/DELAY5 (GAIN/DELAY 5). It should be noted that for human voice s _vocals The gain of the direct path of (n) (gain/delay 2) is typically very small because in a karaoke system, the human voice should be removed. However, a bit of human voice s can be left in the output _vocals (n) to assist the user in singing or in the event the user decides to follow the original singer with the crowd.

Gain/delay adjusted human voice s _crowd (n) gain/delay adjusted original human voice s _vocals (n) gain/delay adjusted live accompaniment s _{acc_live} (n) gain/delay adjusted ambient reverberation s _{amb_rev} (n) and gain/delay adjusted s _samples (n) mixing by the mixer 47 to obtain the Karaoke output signal s _acc* (n) the Karaoke output signal s _acc* (n) can be coupled to the user's voice s through a speaker system (19 in FIG. 2) _user (n) are output together (see fig. 2).

Fig. 5 schematically shows an alternative embodiment of the field mode processing (17 in fig. 2). The difference between the embodiment of fig. 4 and the field mode processing of the embodiment of fig. 5 is that the embodiment of fig. 5 replaces the primary environmental extraction (PAE) of the embodiment of fig. 4 with harmonic striking source separation (Harmonic Percussive Source Separation, HPSS).

Processing accompaniment s by harmonic striking source separation (HPSS) 48 _acc (n) to obtain accompaniment s _acc Harmonic part s of (n) _harm (n). Further processing of the harmonic portion s by the room simulator 44 _harm (n) to obtain harmonic reverberation s _{harm_rev} (n). The harmonic reverberation s is further processed by a renderer 45 (e.g. a binaural renderer as described in fig. 16, or an upmixer as described in fig. 17) _{harm_rev} (n) to obtain rendered harmonic reverberation s _{harm_rev,3D} (n)。

Gain/delay adjusted human voice s _crowd (n) gain/delay adjusted original human voice s _vocals (n) gain/delay adjusted live accompaniment s _{acc_live} (n) gain/delay adjusted harmonic reverberation s _{harm_rev} (n) and gain/delay adjusted s _samples (n) by means of a mixer 47Mixing to obtain Karaoke output signal s _acc* (n) the Karaoke output signal s _acc* (n) can be coupled to the user's voice s through a speaker system (19 in FIG. 2) _user (n) are output together (see fig. 2).

The presence mode processing described above with respect to fig. 4 and 5 may be controlled by a user of the karaoke system through a user interface through presence mode parameter presets.

For example, a first presence mode parameter, SINER LOCATION, may allow for selection of a SINGER LOCATION. For example, SINGER location= "on stage", or "in audience". In the "on stage" state, the live mode gives the feeling of being a band singer, cheering from the front, and the instrument from the side/back. In the "in audience" state, the live mode gives the sensation of singing with the crowd, the instrument perception coming from the front, cheering from the side/back.

The second spot mode parameter VENUE (VENUE) may define stage position and may affect perceived crowd size (number of people) and stage/concert hall size (reverberation time of signal). For example, venue= "wen briy stadium", "royalty albert concert hall", "club" or "bar". The state "wintergreen stadium" may simulate the atmosphere of a large stadium (at most 90000 guests), the state "Royal Albert concert hall" may simulate the atmosphere of a large concert hall (at most 9500 guests), the state "club" may simulate the atmosphere of a medium-sized club (at most 200 guests), and the state "bar" may simulate the atmosphere of a bar (at most 50 guests).

Crowd singing simulation

Crowd singing simulation is from the vocal track s that draws _vocals (n) creating a crowd singing audio signal s _crowd (n). This can be achieved by using strong reverberation and creating many different versions with superimposed pitch shift + delay (similar to "human voice doubling").

Fig. 6 schematically shows an embodiment of the crowd singing simulation process (41 in fig. 4 and 5). Crowd singing simulation 41 processes original vocals s _vocals (n) to obtain the crowd's voice s _crowd (n). Original human voice s _vocals (N) is fed to the number n=n _crowd The pitch shifters 61-1 to 61-N of (a). Each of the pitch shifters 61-1 to 61-N shifts the original human voice s _vocals (n) a pitch shift of a corresponding predetermined percentage p _i (i=1-N). The pitch-shifted human voice is fed to a number N of formant shifters 62-1 to 62-N. Each of the formant shifters 61-1 to 61-N performs formant shifting on pitch-shifted human voice by a predetermined amount f _i (i=1-N). Pitch and formant shifted human voice is fed to a number N of gain/delay stages 63-1 to 63-N. Each gain/delay stage 63-1 to 63-N adjusts the gain and delay of the human voice by a predefined gain g, respectively _i And delay δt _i (i=1-N). The mixed human voice is processed by reverberation 65 to add reverberation to the mixed human voice. The human voice processed in this way is then mixed by the mixer 64 to obtain the human voice s of the crowd _crowd (n)。

Number N of parallel pitch/formant shift branches _crowd For example, it may be selected according to a predefined spot mode parameter VENUE, which defines stage position and affects perceived crowd size (number of people) and stage/concert hall size (reverberation time of signal). For example, if venue= "greenish stadium", then N _crowd Can be set as N _crowd =200, if venue= "royalty albert concert hall", N _crowd Can be set as N _crowd =100, if venue= "club", then N _crowd Can be set as N _crowd =50, if venue= "bar", then N _crowd Can be set as N _crowd ＝20。

For example, it can be based on the expression of p _i Gaussian distribution centered at =1 (no pitch shift) randomly selects the percentage p for pitch shift _i (i=1 to N), and the predetermined standard deviation thereof is 100 minutes (cent). Similarly, for example, one can rely on the expression p _i Gaussian distribution with =1 (no formant shift) as center randomly selects the parameter f for formant shift _i (i=1-N) with a predetermined criterion depending on the selected formant shift algorithmDeviation.

Delay δt for each pitch/formant shift branch _i Can be, for example, described in [0,0.5s]Wherein 0 represents a person very close to the singer on stage and 0.5s represents a person far from the singer on stage or a person singing somewhat too late. To simulate the number of people at a distance r from singers on a stage, approximately follows r ² And the fact that the number of people in the field is increased (assuming uniform distribution) is based on r ² The random number generator may be configured to prefer a larger delay over a smaller delay. In addition, δt is selected _i The interval of (2) may depend on the site. For example, if venue= "greenish stadium", then the interval [0,0.5s "can be followed]Selecting δt _i If venue= "Royal albert concert hall", then the interval [0,0.3s ] can be followed]Selecting δt _i If VENUE= "club", then the interval [0,0.2s "can be followed]Selecting δt _i And if venue= "bar", it may be from interval 0,0.1s]Selecting δt _i 。

For example, gain g _i (i=1-N) can be randomly set to a number between 0.5 and 1.5, where g _i >1 represents an increase in the loudness of the human voice, and g _i <1 represents a reduction in the loudness of the human voice. Gain g _i May also be delayed by δt _i Relatedly, to simulate the effect that a person located farther away is heard with less sound and with greater delay, e.g. by having a delay δt for a greater delay _i Reducing gain g _i 。

Parameters controlling the singing simulation of the crowd may also be affected by the presence mode parameter "SINERGER LOCATION". For example, if singgers location= "on stage", the delay δt of each pitch/formant shift branch _i Can be described, for example, in [0.1,0.5s]To take into account the effect of SINGERs on stage and thus offset a distance from the crowd, whereas if SINGER location= "in audience", the delay δt of each pitch/formant shift branch _i Can be, for example, in [0,0.3s]To take into account the effect that singers are surrounded by the crowd and that some of the crowd are very close to singers.

The processing of the reverberation 65 may depend on a spot mode parameter VENUE which defines stage position and affects the perceived size of the stage/concert hall (reverberation time of the signal). For example, if venue= "wintergreen stadium", convolution reverberation based on a prerecorded wintergreen stadium impulse response may be applied, if venue= "royalty albert concert hall", convolution reverberation based on a prerecorded royalty concert hall impulse response may be applied, if venue= "club", convolution reverberation based on a prerecorded club impulse response may be applied, and if venue= "bar", convolution reverberation based on a prerecorded bar impulse response may be applied. Instead of convolved reverb, algorithmic reverb with appropriate size parameter settings may be used.

In the embodiment of fig. 6, the reverberator 65 processes the mixed signal. In an alternative implementation shown in fig. 7, the surround reverberation 66 is applied on the pitch/formant shift branches. The surround reverberation algorithm allows each individual source (each pitch/formant shift branch) to be placed at a specific location in the simulated field. With the surround reverberation 66, simulated individuals in a population can be placed within a venue based on the actual location of people within the actual venue. This makes the reverberant sound effect more realistic.

On-site sound effect

Fig. 8 schematically illustrates an embodiment of in-situ sound processing (42 in fig. 4 and 5). The live sound effect 42 processes accompaniment s _acc (n) to obtain a live accompaniment s _acclive (n)。

Processing accompaniment s by source separation 81 _acc (n) to obtain accompaniment s _acc Separation tracks s of individual sources (instruments) within (n) _inst,1 (n) to s _inst,N (n). Microphone "crosstalk" simulation 82 is applied to individual instrument tracks to simulate the microphone "crosstalk" effects that occur during a live performance as a result of the microphones also capturing signals from other instruments. The resulting instrument rail s _inst-bleed,1 (n) to s _inst-bleed,N (n) further processing by a dithering simulation 83, which simulates a live performance, which is typically notThere is the fact that the instruments are perfectly time aligned. The resulting instrument trajectory s is then passed through a mixer 84 _{inst-jitter,1} (n) to s _{inst-jitter,N} (n) remixing. The remixed signal s is then further processed by equalizer 85 by modifying the equalization using the "main EQ" to "field EQ" process _inst-mix (n)。

Fig. 9 schematically illustrates an embodiment of a microphone crosstalk simulation 82. Microphone crosstalk simulation 82 receives instrument signal s from source separation (81 in fig. 8) _inst,1 (n) to s _inst,N (n). The instrument signal s is passed through the mixer 91-1 _inst,1 (n) and musical instrument signal s _inst,2 (n) to s _inst,N (n) mixing, adding-12 dB of microphone crosstalk to obtain a musical instrument signal s including analog microphone crosstalk _inst-bleed,1 (n). The instrument signal s is passed through the mixer 91-2 _inst,2 (n) and musical instrument signal s _inst,1 (n)、S _inst，3 (n) to s _inst,N (n) mixing, adding-12 dB of microphone crosstalk to obtain a musical instrument signal s including analog microphone crosstalk _inst-bleed,2 (n). Through the mixer 91-N pair of instrument signals s _inst,N (n) and musical instrument signal s _inst,1 (n) to s _inst,N-1 (n) mixing, adding-12 dB of microphone crosstalk to obtain a musical instrument signal s including analog microphone crosstalk _inst-bleed,N (n)。

Fig. 10 schematically shows an embodiment of a dither matrix 83. Musical instrument signal s obtained by microphone crosstalk simulation (82 in fig. 8) _inst-bleed,1 (n) delaying by delay 101-1 to obtain instrument signal s _{inst-jitter,1} (n). Musical instrument signal s obtained by microphone crosstalk simulation _inst-bleed,2 (n) delaying by delay 101-2 to obtain instrument signal s _{inst-jitter,2} (n). Musical instrument signal s obtained by microphone crosstalk simulation _inst-bleed,N (N) delaying by delay 101-N to obtain instrument signal s _{inst-jitter,N} (n). Delays 101-1 through 101-N are configured to slightly delay/advance each instrument by a random time span. The time span may be e.g. from the interval [ -100ms, +100ms]Is selected randomly. It should be noted that this timeThe span may change during the song, i.e., not constant, but may change over time to increase the perception of a live performance.

Fig. 11a schematically shows a first embodiment of an equalizer 85. The instrument mix s obtained by remixing (84 in fig. 8) is processed by the static equalizer 111 _inst-mix (n) to obtain a live accompaniment s _{acc_live} (n). Static equalizer 111 modifies the equalization using parameter/graphic EQ to change the equalization from "main EQ" to "field EQ".

Fig. 11b schematically shows a second embodiment of the equalizer 85. The instrument mix s obtained by remixing (84 in fig. 8) is processed by the dynamic equalizer 112 _inst-mix (n) to obtain a live accompaniment s _{acc_live} (n). The dynamic equalizer 112 is controlled by the DNN 113, which DNN 113 learns to convert the "Main EQ" to "FieldEQ".

The process accompaniment s shown above _acc (n) obtaining a live accompaniment s _{acc_live} The in-situ sound effects of (n) are given by way of example only. The individual field effects (crosstalk simulation 82, jitter simulation 83, field EQ 85) may be applied individually or in combination. The embodiments are not limited to selecting the in-situ sound effects shown in the embodiment of fig. 8.

In addition, other live sound effects (not shown in FIG. 8) may be applied to the accompaniment s _acc (n) to obtain a live accompaniment s _{acc_live} (n). For example, the acceleration module may be configured to accelerate accompaniment s _acc (n) in order to simulate the effect that live performances are typically played slightly faster than the sound recordings used as the basis for the karaoke system. It should be noted, however, that if the live sound effect (42 in fig. 4 and 5) includes a pair accompaniment s _acc (n) acceleration is performed, then the same acceleration should also be applied to the vocal track s (fed to the mixer 47 in fig. 4 and 5) _vocals (n), and the crowd singing simulation 41 keeps the voice synchronized with the accompaniment based on the voice track. The same applies to the reverberation paths (43, 44, 45) in fig. 4 and 5, which should also receive accompaniment s that has been accelerated _acc (n)。

Sample database

FIG. 12aA first embodiment of a sample database 46 is shown. The sample inserter 142 obtains samples of cheering, applause, and crowd noise from a pre-recorded sample database 143 and randomly inserts the samples into the sample audio stream s _samples (n). Sample inserter 142 may be configured to randomly add samples of cheering, applause, crowd noise, etc. during playback of songs and between songs. The sampled audio stream s may then be _samples (n) is directly added to the karaoke output signal (see the mixer 47 in fig. 4 and 5).

The sample inserter 142 may be further configured to evaluate the field mode parameter SINERGER LOCATION. For example, if SINGER location= "in audience", the sample inserter 142 may select a stronger sample than SINGER location= "on stage". In addition, the sample inserter 142 may render samples to different LOCATIONs (e.g., where the "clapping" is perceived from the front, versus the "clapping" is perceived from the surroundings) according to the SINEER LOCATION parameter. The sample inserter 142 may be further configured to evaluate a spot mode parameter venue that defines stage position and may affect perceived crowd size (number of people) and stage/concert hall size (reverberation time of signal). For example, if venue= "greenbrier stadium", the sample inserter 142 may select samples from the first set of samples, if venue= "Royal albert concert hall", the sample inserter 142 may select samples from the second set of samples, if venue= "club", the sample inserter 142 may select samples from the third set of samples, if venue= "bar", the sample inserter 142 may select samples from the fourth set of samples.

Fig. 12b shows a second embodiment of the sample database 46. Event detector 141 detects accompaniment s _acc The event in (n). Such events may be, for example, the start of a song, the end of a song, the start of a chorus, intensity climax in a song, etc. Based on the detected event, the sample inserter 142 obtains samples of cheering, applause, and crowd noise from the pre-recorded sample database 143 and inserts the samples into the sample audio stream s _samples (n). In this way, the sampleThe present inserter may select a background sample suitable for the current situation (e.g., shouting of the crowd before the song starts, crazy applause after the song ends, and screaming) to mix into the karaoke output signal.

Extraction of main environment (PAE)

Fig. 13 schematically shows an embodiment of the primary environment extraction (PAE) (43 in fig. 4). The primary environmental extraction (PAE) 43 is configured to accompany s based on their direction and diffuse spatial features, respectively _acc (n) decomposition into principal component s _acc-primary (n) and environmental component s _{acc_ambien} (n). A common multichannel PAE method is Principal Component Analysis (PCA). Implementation of PAE 43 is found, for example, in callos AVENDANO, "frequency domain method of multi-channel upmixing (afrequest-Domain Approach to Multichannel Upmix)", j.audioeng eng.soc., vol.52, no.7/8, 7/8 month 2004 (reference [1 ] ])。

Harmonic striking source separation (HPSS)

Fig. 14 schematically illustrates an embodiment of harmonic striking source separation (HPSS) (48 in fig. 4). Harmonic striking source separation (HPSS) 48 is configured to accompany the s _acc (n) decomposing into a signal composed of all harmonic sounds and another signal composed of all striking sounds. The HPSS 48 takes advantage of the observation that in a spectrogram representation of the input signal, harmonic sounds tend to form a horizontal structure (in the time direction) while percussive sounds form a vertical structure (in the frequency direction). HPSS 48 may implement, for example, the method described in Fitzgerald, derry, "Harmonic/strike separation with median filtering (Harmonic/percussive separation using median filtering)", "international digital audio effects conference discussion (DAFx), volume 13, 2010.

Room simulator

The field pattern can be enhanced by adding real reverberation. The perception of the room/concert hall may be given to the user using a room simulator 44 with a suitable rendering algorithm.

As shown in the embodiments of fig. 4 and 5, respectively, it may be beneficial to create a reverberant signal for only the ambient/harmonic parts of the accompaniment. However, the room simulator 44 may also operate directly on accompaniment without applying environmental or harmonic separation (PAE or HPSS).

The room simulator 44 is configured to add reverberation to the accompaniment s depending on whether PAE or HPSS is applied (or neither is applied) _acc (n) environmental section s to accompaniment _{acc_amb} (n) or to the harmonic part s of accompaniment _{acc_harm} (n). Convolved reverberations may be used, or algorithmic reverberations with appropriate size parameter settings may be used.

The processing of the room simulator 44 may depend on the live mode parameter VENUE, which defines stage position and affects the perceived size of the stage/concert hall (reverberation time of the signal). For example, if venue= "wintergreen stadium", convolution reverberation based on a prerecorded wintergreen stadium impulse response may be applied, if venue= "royalty albert concert hall", convolution reverberation based on a prerecorded royalty concert hall impulse response may be applied, if venue= "club", convolution reverberation based on a prerecorded club impulse response may be applied, and if venue= "bar", convolution reverberation based on a prerecorded bar impulse response may be applied.

Fig. 15 schematically shows an embodiment in which the room simulator 44 is implemented by surround reverberation. The surround reverberation algorithm 153 allows for each individual source s to be obtained by source separation 151 and PAE 152 _inst,1 (n) to s _inst,N (n) placed at a specific location on the simulated site. In the case where the surround reverberation 153 is used as the room simulator 44, accompaniment s _acc The ambient part (or harmonic part, or the complete signal itself) of the instrument within (n) may be placed within the venue depending on the actual location of the instrument on the stage. This makes the reverberation effect more realistic.

Binaural renderer

If headphone playback is used, binaural rendering may be used to model audio sources from a particular direction.

Fig. 16 shows an embodiment of a renderer 45 using binaural rendering technique. Processing by the binaural renderer 45 the reverberation source s obtained by the room simulator 44 _{amb_rev} (n) (see embodiment of FIG. 4) or s _{harm_rev} (n) (see embodiment of FIG. 5) to obtain ambient reverberation s _{amb_rev} (n) or harmonic reverberation s _{harm_rev} (n). The binaural renderer 45 includes a binaural processor 162, which binaural processor 162 performs binaural processing based on a Head Related Impulse Response (HRIR) 161, which Head Related Impulse Response (HRIR) 161 has been predetermined based on a measured or modeled head of a user of the karaoke system. Binaural processing 162 involves source signal source s _rev,1 (n) to s _rev,N (n) convolution with a measured or modeled Head Related Impulse Response (HRIR) 161.

Instead of the Head Related Impulse Response (HRIR), a Binaural Room Impulse Response (BRIR) may also be used.

Binaural audio is typically played via stereo headphones.

2-5 channel upmix

Fig. 17 shows an embodiment of the renderer 45 based on 2 to 5 channel upmixing. Accompaniment s _acc (n) from left stereo channel s _acc,L (n) and Right stereo channel s _acc,R (n) composition. Processing of the left stereo channel s of an accompaniment by 2-3 upmix 171 _acc,L (n) and Right stereo channel s _acc,R (n) to obtain an output channel s for the front left speaker SKP1 _acc,SPK1 (n) obtaining an output channel s for the center speaker SKP2 _acc,SPK2 (n) obtaining the output channel s of the front right speaker SKP3 _acc,SPK3 (n). In order to derive the front channel, reference [1 ] can be used]Unmixed (unmix) and relocation (repanning) techniques of section 4.

The left stereo channel s of the accompaniment is further processed by the primary environment extraction (PAE) 43 _acc,L (n) and Right stereo channel s _acc,R (n). The primary environment extraction (PAE) 43 is configured to extract the left stereo channel s from the accompaniment _acc,L (n) and Right stereo channel s _acc,R (n) extracting the environmental component s _amb,L (n) and s _amb,R (n). Through an all-pass filter G _L (Z)Z ^-D Processing an environmental component s _amb,L (n), and pass through an all-pass filter G _R (Z)Z ^-D Processing an environmental component s _amb,R (n) to de-correlate them from the ambient components in the front channel, as in reference [1 ]]Is described in section 5. This minimizes creation of ghost images on the sides. The filtered ambient component s _amb,L (n) and s _amb,R (n) output through the left rear speaker SPK4 and the right rear speaker SPK5, respectively.

Use of positioning and orientation information

Fig. 18 schematically illustrates an embodiment of the extended in-situ sound effect processing (42 in fig. 4 and 5). As in the embodiment of fig. 8, the live sound effect 41 processes the accompaniment s _acc (n) to obtain a live accompaniment s _{acc_live} (n). Processing accompaniment s by source separation 81 _acc (n) to obtain accompaniment s _acc Separation trajectory s of single source (instrument) within (n) _inst,1 (n) to s _inst,N (n). Microphone "crosstalk" simulation 82 is applied to individual instrument tracks to simulate the microphone "crosstalk" effects that occur during a live performance due to the microphones capturing signals from other instruments simultaneously. The resulting instrument track s is further processed by jitter simulation 83 _inst-bleed,1 (n) to s _inst-bleed,N (n) the fact that jitter simulation simulates live performances is generally not perfect time alignment of instruments. The resulting instrument trajectory s is then processed by the 3D audio renderer 89 _{inst-jitter,1} (n) to s _{inst-jitter,N} (n) the 3D audio renderer 89 is derived from the instrument track s _inst-bleed,1 (n) to s _inst -bleed _,N (n) generating 3D Audio accompaniment s _acc-3D (n). The 3D audio renderer 89 uses information about the current position of the user in the room or the direction of his gaze or tilt to position the user on the virtual stage. By using information about the current position of the user in the room or the direction in which he gazes or tilts, the rendering of the individual instruments can be affected. For example, assume that a singer (=user) has a guitar on the right. If he is now gazing/tilting towards the right, the amplitude of the guitar orbit will increase, as is the case in the real world. In this way, the user's experience is improved, since he can also interact with a single instrument.

Then, through 3D voice3D audio accompaniment s obtained by the frequency renderer 89 _acc-3D (n) can be mixed with appropriate 3D audio signals from other branches of the karaoke system. In this case, for example, crowd singing simulation of fig. 7 may be applied, which uses the generation and 3D audio accompaniment s obtained through the live sound effect of fig. 18 _acc-3D (n) equivalent surround reverberations of the 3D audio accompaniment. Likewise, a suitable 3D audio renderer may be applied in the reverberation path (45 in fig. 4 and 5). The 3D audio rendering may be implemented, for example, with binaural technology (if the karaoke output is through headphones) or by 5.1 or 7.1 upmixing (if the karaoke output is through a 5.1 or 7.1 speaker system).

Fig. 19 schematically shows an example of processing performed by the 3D audio renderer 89 in fig. 18. The user 191 of the karaoke system is located at a position within the room and faces a specific direction. For example, the position and orientation (direction of gaze or tilt) of the user 191 may be obtained by the karaoke system from sensor information (such as information from a gyroscope and an acceleration sensor worn by the user), information obtained from a camera image by an object recognition and tracking technique such as SLAM (simultaneous localization and mapping) for an indoor environment, or other techniques. For example, such sensors may be integrated in a smart phone or mp3 player held in the hand of the user, or they may be integrated in a smart watch worn by the user, or they may be integrated in a headset worn by the user (which would also allow the gaze direction to be obtained). For example, the orientation of user 191 may be obtained by gaze detection techniques or head tracking techniques (e.g., SLAM-based). The user position and orientation obtained by the sensor is converted into a position p of the user 191 in a coordinate system 199 defining a virtual stage _u And a direction d. Further, a local coordinate system 198 of the user's head is defined with reference to the coordinate system 199. As shown in fig. 19, in this user coordinate system 198, the position of the user's head defines the origin of the coordinate system and the orientation of the head defines one axis of the coordinate system. Each instrument obtained by instrument separation (81 in fig. 18) is given a corresponding position on the virtual stage. The first instrument 192, here for example a rhythmic guitar, is located at position p ₁ . A second instrument 193, e.g. a main toneGuitar at position p ₂ . A third instrument 194, here for example a drum, is located at position p ₃ . A fourth instrument 195, here for example a bass guitar, is located at position p ₄ 。

It should be noted that, in order to simplify the drawing, fig. 19 is a two-dimensional illustration in which positions in x, y directions on the virtual stage are represented by a two-dimensional coordinate system 199 (a bird's eye perspective view of the virtual stage). In a practical implementation, the 3D audio rendering technique may also cover the height of the sound object as a third dimension (not shown in fig. 19).

In this example, the renderer 89 is configured to render the separate musical instruments 192 to 195 as virtual sound sources (3D objects) by means of a 3D audio rendering technique, such as virtual monopole synthesis described in more detail below with respect to fig. 20. In the example of fig. 19, the user is located in the center of a band made up of instruments 192 through 195 on a virtual stage and toward crowd 196 (e.g., simulated by crowd singing simulation 41 of fig. 6 and/or sample database 46 of fig. 12a, 12 b). For example, placement p of instruments 192 to 195 ₁ 、p ₂ 、p ₃ 、p ₄ The placement of the instrument in the band may be based on predefined criteria. For example, according to standard placement, the position p of the rhythmic guitar 192 ₁ The position p of the main guitar 193 can be at the left front side of the virtual stage ₂ The position p of the drum 194 can be on the right front side of the virtual stage ₃ Position p of bass guitar 195, which may be behind the virtual stage ₄ Or behind the virtual stage. Alternatively, such positional information (static or dynamic) may also be extracted from the audio by, for example, analyzing panning, reverberation, inter-channel delay, or inter-channel coherence of the audio signal of each instrument.

Placement p of instruments 192 to 195 ₁ 、p ₂ 、p ₃ 、p ₄ May be static throughout the karaoke performance or may be dynamic, such dynamic being based on a predetermined motion pattern or model of motion simulating the actual motion of the band members (drums: static, main guitar: dynamic, etc.).

The 3D audio renderer 89 considers the position p of the user 191 when performing audio rendering _u And towards d. For example, when performing audio rendering, the 3D audio renderer 89 will position p the musical instruments 192 to 195 on the virtual stage ₁ 、p ₂ 、p ₃ 、p ₄ Into a local coordinate system 198 of the user's head. Then, based on their position in the local coordinate system 198 of the user's head, a corresponding virtual sound source is created on headphones worn by the user, for example with binaural technology.

3D audio rendering

Fig. 20 provides an embodiment of a 3D audio rendering technique based on a digitized monopole synthesis algorithm. Such rendering techniques may be applied, for example, by the renderer 89 of fig. 18 or the renderers 45 of fig. 4 and 5.

The theoretical background of such rendering techniques is described in more detail in patent application US2016/0037282 A1, which is incorporated herein by reference.

The technique implemented in the embodiment of US2016/0037282 A1 is conceptually similar to wave field synthesis, which uses a limited number of acoustic shells to generate a defined sound field. However, the basic basis of the generation principle of the embodiments is specific, since synthesis does not attempt to model the sound field accurately, but is based on the least squares method.

The target sound field is modeled as at least one target monopole placed at a defined target location. In one embodiment, the target sound field is modeled as a single target monopole. In other embodiments, the target sound field is modeled as a plurality of target monopole placed at respective defined target locations. The location of the target monopole may be moving. For example, the target monopole may accommodate the motion of the noise source to be attenuated. If a target sound field is represented using a plurality of target monopoles, a method of synthesizing sound of the target monopole based on a defined set of synthesized monopoles (as will be described below) may be independently applied to each target monopole, and contributions of the synthesized monopoles obtained by each target monopole may be summed to reconstruct the target sound field.

The source signal x (n) is fed to the outputMarked delay unit and amplification unit a _p Where p=1, …, N is the index of the corresponding synthesized monopole used to synthesize the target monopole signal. The delay and amplification unit according to this embodiment can apply equation (117) of US2016/0037282 A1 to calculate the resulting signal y _p (n)＝s _p (n) for synthesizing a target unipolar signal. The signal s generated _p (n) is power amplified and fed to speaker s _p 。

In this embodiment, therefore, the synthesis is performed in the form of a delayed and amplified component of the source signal x.

According to this embodiment, the delay n of the synthesized monopole is indexed p _p Corresponds to the target monopole r ₀ Sum generator r _p Euclidean distance r=r between _p0 ＝|r _p -r ₀ Propagation time of sound of i. For the synthesis of a focused sound source, the delay is inverted (n _p Negative). Since this results in a non-causal system, in practice this is achieved by using a buffer solution, where the buffer size is chosen to cover the assumed delay range necessary to place the source in the area of the speaker. For example, if the maximum distance from the speaker to the focus source is Rmax, the buffer size should be an integer value Where c is the speed of sound, and f _s Is the sampling rate of the system. / >

Further, according to this embodiment, the amplification factorFrom distance r=r _p0 Inversely proportional.

In an alternative embodiment of the system, a modified amplification factor according to equation (118) of US2016/0037282 A1 may be used.

In yet another alternative embodiment of the system, the mapping factor described in relation to FIG. 9 of US2016/0037282 A1 may be used to modify the amplification.

Implementation mode

Fig. 21 schematically depicts an embodiment of an electronic device capable of implementing a karaoke system having a field mode process as described above. The electronic device 1200 includes a CPU 1201 as a processor. The electronic device 1200 further comprises a microphone array 1210, a speaker array 1211 and a convolutional neural network unit 1220 connected to the processor 1201. For example, the processor 1201 may implement a pitch shifter, formant shifter, reverberation, source separation, crosstalk simulation, jitter simulation, or equalizer, implementing the processes described in more detail with respect to fig. 4-17. For example, DNN 1220 may be an artificial neural network in hardware, e.g., a neural network on a GPU or any other hardware dedicated to implementing an artificial neural network. For example, DNN 1220 may implement source separation (12 in fig. 2, 81 in fig. 8) or dynamic equalization (112 in fig. 11 b). The speaker array 1211 (such as speaker system 19 described with respect to fig. 2) is made up of one or more speakers distributed over a predetermined space and is configured to present any kind of audio (such as 3D audio). The electronic device 1200 also includes a user interface 1212 coupled to the processor 1201. The user interface 1212 acts as a human-machine interface and enables conversations between the user and the electronic system. For example, a user may configure the system using the user interface 1212. The electronic device 1200 also includes an ethernet interface 1221, a bluetooth interface 1204, and a WLAN interface 1205. These units 1204, 1205 act as I/O interfaces for data communication with external devices. For example, additional speakers, microphones, and cameras with ethernet, WLAN, or bluetooth connections may be coupled to the processor 1201 via these interfaces 1221, 1204, and 1205. The electronic device 1200 further comprises data storage 1202 and data memory 1203 (here RAM). The data memory 1203 is arranged to temporarily store or buffer data or computer instructions for processing by the processor 1201. The data storage 1202 is configured as a long term storage, for example, for recording sensor data obtained from the microphone array 1210 and provided to the DNN 1220 or retrieved from the DNN 1220. The data store 1202 may also store audio samples (e.g., sample database 143 in fig. 12a and 12 b).

It should be noted that the above description is merely an example configuration. Alternative configurations may be implemented using additional or other sensors, storage devices, interfaces, etc.

It should be appreciated that the embodiments describe the method in an exemplary order of method steps. However, the particular order of the method steps is presented for illustrative purposes only and should not be construed as a constraint.

It should also be noted that the division of the electronic device of fig. 21 into units is for illustration purposes only, and the present disclosure is not limited to any particular division of functionality in a particular unit. For example, at least some of the circuitry may be implemented by a separately programmed processor, a Field Programmable Gate Array (FPGA), dedicated circuitry, or the like.

All of the elements and entities described in this specification and claimed in the appended claims may be implemented as, for example, integrated circuit logic on a chip, if not otherwise specified, and the functions provided by such elements and entities may be implemented in software.

With respect to the embodiments of the present disclosure described above, implemented at least in part using software-controlled data processing apparatus, it should be understood that computer programs providing such software control, as well as transmission, storage or other media through which such computer programs are provided, are contemplated as aspects of the present disclosure.

Note that the present technology can also be configured as described below.

(1) An electronic device comprising circuitry configured to process an accompaniment signal(s) according to a field mode process (17) _acc (n)) to obtain an enhanced accompaniment signal(s) _acc* (n))。

(2) The electronic device according to (1), wherein the field mode processing (17) is configured to give the enhanced accompaniment signal (s _acc* (n)) as if he were part of a concert.

(3) The electronic device of (1) or (2), wherein the field mode processing (17) is configured to pass at the room simulator (44)Accompaniment signals(s) _acc (n)) to obtain a reverberant signal(s) _{amb_rev} (n)，s _{harm_rev} (n))。

(4) The electronic device of (3), wherein the field mode processing (17) is configured to process the reverberation signal(s) by the renderer (45) _{amb_rev} (n)，s _{harm_rev} (n)) to obtain a rendered reverberation signal(s) _{amb_rev,3D} (n)，s _{harm_rev,3D} (n))。

(5) The electronic device of (4), wherein the renderer (45) is a 3D audio renderer (45; 43, 171), a binaural renderer (45) or an upmixer (43,171).

(6) The electronic device according to any one of (1) to (5), wherein the field mode processing (17) is configured to process the accompaniment signal(s) by the main environment extraction (43) or by the harmonic striking source separation (48) _acc (n)) to obtain accompaniment signals(s) respectively _acc (n)) environmental part(s) _amb (n)) or harmonic parts(s) _harm (n))。

(7) The electronic device of (6), wherein the field mode processing (17) is configured to process the environmental portion(s) by a room simulator (44) _amb (n)) or harmonic parts(s) _harm (n)) to obtain ambient reverberation(s), respectively _{amb_rev} (n)) or harmonic reverberation(s) _{harm_rev} (n))。

(8) The electronic apparatus according to any one of (1) to (7), wherein the live mode process (17) is controlled by a live mode parameter (SINGER LOCATION) describing a SINGER's position and/or a live mode parameter (VENUE) describing a stage.

(9) The electronic device of any one of (1) to (8), wherein the in-situ mode processing (17) is configured to process the vocal signal(s) by crowd singing simulation (41) _vocals (n)) to obtain a crowd voice signal(s) _crowd (n))。

(10) The electronic device of (9), wherein the crowd singing simulation (41) comprises a plurality of pitch and/or formant shift branches.

(11) The electronic device according to any one of (1) to (10), wherein the field mode processing (17) is configured toProcessing accompaniment signals(s) based on field sound effects (42) _acc (n)) to obtain a live accompaniment signal(s) _{acc_live} (n))。

(12) The electronic device of (11), wherein the in-situ sound processing (42) includes source separation (81).

(13) The electronic device of (11), wherein the in-situ sound effects processing (42) includes microphone crosstalk simulation (82).

(14) The electronic device of (11), wherein the in-situ sound processing (42) includes dithering simulation (83).

(15) The electronic device of (11), wherein the in-situ sound effects processing (42) includes equalization (85).

(16) The electronic device of any one of (1) to (15), wherein the field mode processing (17) comprises obtaining samples from a sample database (143).

(17) The electronic device of any one of (4) to (16), which is configured to, in rendering the enhanced accompaniment signal (s _acc* (n)) using the current position (p) of the user in the room _u ) And/or information (d) about the direction of his gaze or tilt.

(18) The electronic device according to any one of (1) to (17), further comprising a mixer (18), the mixer (18) being configured to mix the enhanced accompaniment signal (s _acc* (n)) and the user's vocal signal(s) _user (n)) for mixing.

(19) The electronic device of any one of (12) to (18), wherein the in-situ sound effect processing (42) comprises a renderer (89), the renderer (89) being configured to render the source(s) obtained by the source separation (81) _{inst-jitter,1} (n)，…，s _{inst-jitter,N} (n))。

(20) The electronic device of (19), wherein the renderer (89) is configured to receive information from the sensor and to determine the current position (p _u ) And/or information about the direction (d) of his gaze or tilt.

(21) The electronic device of (20), wherein the renderer (89) is configured to use information about the current position of the user and/or information about the direction in which he gazes or tilts.

(22) An accompaniment signal(s) is processed in accordance with a scene mode process (17) _acc (n)) to obtain an enhanced accompaniment signal(s) _acc* (n)) are provided.

(23) A computer program comprising instructions which, when executed by a processor, instruct the processor to perform (19) the method.

Claims

1. An electronic device comprising circuitry configured to process an accompaniment signal(s) according to a field mode process (17) _acc (n)) to obtain an enhanced accompaniment signal(s) _acc* (n))。

2. The electronic device of claim 1, wherein the field mode processing (17) is configured to give the enhanced accompaniment signal (s _acc* (n)) as if he were part of a concert.

3. The electronic device of claim 1, wherein the field mode processing (17) is configured to process the accompaniment signal(s) by a room simulator (44) _acc (n)) to obtain a reverberant signal(s) _{amb_rev} (n)，s _{harm_rev} (n))。

4. An electronic device according to claim 3, wherein the field mode processing (17) is configured to process the reverberation signal(s) by a renderer (45) _{amb_rev} (n)，s _{harm_rev} (n)) to obtain a rendered reverberant signal(s) _{amb_rev,3D} (n)，s _{harm_rev,3D} (n))。

5. The electronic device of claim 4, wherein the renderer (45) is a 3D audio renderer (45; 43, 171), a binaural renderer (45) or an upmixer (43, 171).

6. The electronic device of claim 1, wherein the fieldThe pattern processing (17) is configured to process the accompaniment signal(s) by primary environment extraction (43) or by harmonic striking source separation (48) _acc (n)) to obtain the accompaniment signals(s) respectively _acc (n)) environmental part(s) _amb (n)) or harmonic parts(s) _harm (n))。

7. The electronic device of claim 6, wherein the field mode processing (17) is configured to process the environmental portion(s) by a room simulator (44) _amb (n)) or the harmonic part(s) _harm (n)) to obtain ambient reverberation(s), respectively _{amb_rev} (n)) or harmonic reverberation(s) _{harm_rev} (n))。

8. Electronic device according to claim 1, wherein the live mode process (17) is controlled by live mode parameters describing singer position (singer position) and/or live mode parameters describing stage (venue).

9. The electronic device of claim 1, wherein the site mode processing (17) is configured to process the vocal signal (s _vocals (n)) to obtain a crowd voice signal(s) _crowd (n))。

10. The electronic device of claim 9, wherein the crowd singing simulation (41) comprises a plurality of pitch and/or formant shift branches.

11. The electronic device of claim 1, wherein the field mode processing (17) is configured to process the accompaniment signal(s) based on a field sound effect (42) _acc (n)) to obtain a live accompaniment signal(s) _{acc_live} (n))。

12. The electronic device of claim 11, wherein the in-situ sound effects processing (42) includes source separation (81).

13. The electronic device of claim 11, wherein the in-situ sound effects processing (42) includes microphone crosstalk simulation (82).

14. The electronic device of claim 11, wherein the in-situ sound effects processing (42) includes dithering simulation (83).

15. The electronic device of claim 11, wherein the in-situ sound effects processing (42) includes equalization (85).

16. The electronic device of claim 1, wherein the field mode processing (17) comprises obtaining samples from a sample database (143).

17. The electronic device of claim 4, configured to, when rendering the enhanced accompaniment signal (s _acc* (n)) using the current position (p) of the user in the room _u ) And/or information (d) about the direction of his gaze or tilt.

18. The electronic device of claim 1, further comprising a mixer (18), the mixer (18) being configured to mix the enhanced accompaniment signal (s _acc* (n)) and the user's vocal signal(s) _user (n)) for mixing.

19. The electronic device of claim 12, wherein the in-situ sound effect processing (42) comprises a renderer (89), the renderer (89) being configured to render the source(s) obtained by the source separation (81) _{inst-jitter,1} (n)，…，s _{inst-jitter,N} (n))。

20. The electronic device of claim 19, wherein the renderer (89) is configured to receive information from sensors and to determine the current position (p _u ) And/or information about the direction (d) of his gaze or tilt.

21. The electronic device of claim 20, wherein the renderer (89) is configured to use information about the current position of the user and/or information about the direction in which he gazes or tilts.

22. An accompaniment signal(s) is processed in accordance with a scene mode process (17) _acc (n)) to obtain an enhanced accompaniment signal(s) _acc* (n)) are provided.

23. A computer program comprising instructions which, when executed by a processor, instruct the processor to perform the method of claim 19.