WO2022078905A1 - Procédé et appareil pour restituer un signal audio parmi une pluralité de signaux vocaux - Google Patents

Procédé et appareil pour restituer un signal audio parmi une pluralité de signaux vocaux Download PDF

Info

Publication number
WO2022078905A1
WO2022078905A1 PCT/EP2021/077898 EP2021077898W WO2022078905A1 WO 2022078905 A1 WO2022078905 A1 WO 2022078905A1 EP 2021077898 W EP2021077898 W EP 2021077898W WO 2022078905 A1 WO2022078905 A1 WO 2022078905A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice signals
voice
signal
audio
positions
Prior art date
Application number
PCT/EP2021/077898
Other languages
English (en)
Inventor
Thomas Morin
Sylvain Thiebaud
Original Assignee
Interdigital Ce Patent Holdings, Sas
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Interdigital Ce Patent Holdings, Sas filed Critical Interdigital Ce Patent Holdings, Sas
Publication of WO2022078905A1 publication Critical patent/WO2022078905A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field

Definitions

  • the present disclosure relates to the domain of audio rendering, and more particularly to the domain of audio conference rendering.
  • Audio/video conference applications may be used on any of computers, smartphones, tablets, television sets, set-top-boxes and more generally on any kind of (e.g., professional) devices. Speeches may be difficult to understand when at least two attendees speak at the same time. This may also be a problem in real life (e.g., when people talk directly without any audio/video conference application). People may generally try to prevent that by adopting a polite behaviour such as only one person talking at a time. Some conference applications may deal with that problem by introducing a feature such as a button, that may be used by an attendee to notify the rest of the audience an intention to speak, in a same way a student would raise a hand before speaking. This feature may not prevent any number of attendees to simultaneously speak and therefore to not be understood by the other attendees.
  • the present disclosure has been designed with the foregoing in mind.
  • Similarity values of the voice signals may be obtained, wherein a similarity value may indicate a level of similarity between two voice signals.
  • the audio signal may be rendered by spatializing the voice signals based on the similarity values, the higher the level of similarity between the two voice signals, the higher a distance between the two voice signals in the spatialized audio signal.
  • FIG. 1 is a diagram illustrating an example of a graphical user interface of an audio-conferencing system
  • FIG. 3A and 3B are two diagrams illustrating two examples of similarities between two voice signals based on frequency spectra of the voice signals
  • FIG. 4 is a system diagram illustrating an example of spatializing four voice signals in an audio signal comprising two audio output signals
  • FIG. 5 is a diagram illustrating an example of graphical user interface of a processing device for rendering an audio signal of a plurality of voice signals
  • FIG. 6B is a diagram representing an exemplary architecture of the processing device of figure 6A.
  • interconnected is defined to mean directly connected to or indirectly connected with through one or more intermediate components. Such intermediate components may include both hardware and software based components.
  • interconnected is not limited to a wired interconnection and also includes wireless interconnection.
  • any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.
  • any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, a) a combination of circuit elements that performs that function or b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function.
  • the disclosure as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. It is thus regarded that any means that can provide those functionalities are equivalent to those shown herein.
  • such phrasing is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of the third listed option (C) only, or the selection of the first and the second listed options (A and B) only, or the selection of the first and third listed options (A and C) only, or the selection of the second and third listed options (B and C) only, or the selection of all three options (A and B and C).
  • This may be extended, as is clear to one of ordinary skill in this and related arts, for as many items as are listed.
  • audio spatialization may be performed by panning audio.
  • Panning may be seen as a distribution of an audio signal into a new (e.g., any of stereo, multi-channel) audio signal, which may be determined by a pan control setting, that may be referred to herein as a panning factor.
  • An input audio signal may be converted into a plurality of output signals wherein a specific panning factor may be applied to the input audio signal for a (e.g., each) output signal.
  • Applying a panning factor to an audio input signal may comprise any of amplifying, attenuating, delaying, and filtering the audio input signal specifically (e.g., differently) from other input signals.
  • an audio signal may comprise any number of audio output signals for creating a three-dimension (3D) audio rendering.
  • a 3D sound e.g., audio
  • a 3D sound may be obtained based on at least four audio outputs (e.g., connected to at least four loudspeakers). Examples of such 3D audio systems may include any of 5.1 and 7.1 audio systems.
  • a (e.g., spatial) position may be obtained for a sound (e.g., input signal) by differently balancing the sound volume (e.g., original audio input signal) in (e.g., each of) the different audio output signals.
  • Embodiments described herein may also be applicable to virtual 3D sounds rendered by a binaural headset.
  • a binaural audio signal may be seen as a two outputs (e.g., stereo) audio signal that may be obtained by applying audio filters on (e.g., parts of) a signal to make the brain believe a particular sound may come from a specific location.
  • Audio filters can be any of delay, reverb, high-pass, low-pass, equalization, compression...
  • a sound (e.g., an audio input signal) may be spatialized (e.g., allocated a specific position) by processing the audio input signal specifically (e.g., differently) for different audio outputs, so that a user listening to the audio output signals e.g., rendered by different loudspeakers may get the feeling that a sound (e.g., corresponding to the audio input signal) may originate from that specific position.
  • Processing an input signal specifically (e.g., differently) for different outputs may comprise any of amplifying, attenuating, delaying, filtering, equalizing, ...) the input signal specifically for a given output signal. This processing may be referred to herein as applying a panning factor to the input signal.
  • Embodiments described herein may allow to improve speeches clarity (e.g., understandability) in a case where different persons with similar voices may be talking simultaneously and/or successively in an audio (e.g., video) conference systems.
  • speeches clarity e.g., understandability
  • audio e.g., video
  • Figure 1 is a diagram illustrating an example of a graphical user interface (GUI) of an audio-conferencing system.
  • the audio-conferencing system e.g., application
  • the audio-conferencing system may run (e.g., execute) on a processing device.
  • the audio-conferencing system may display the GUI e.g., as illustrated by Figure 1.
  • Embodiments described herein may also be applicable to audio-conferencing systems without any GUI.
  • the audio-conferencing system may receive a plurality of input voice signals corresponding to a plurality of participants (e.g., attendees) in an audio conference.
  • the plurality of input voice signals may correspond, for example, to a set (e.g., list) of active interlocutors 10.
  • the audio (e.g., multi-output) signal may be rendered, e.g., by the audio-conferencing system.
  • a distance 323 (which may be referred to herein as a frequency distance) may be obtained between the frequencies of respectively a first peak of a first frequency spectrum 31 A and a second peak of a second frequency spectrum 32A (e.g., of respectively the first and the second voice signals).
  • the similarity value between the first and the second voice signals may be based on (e.g., represented by an inverse of) the frequency distance 323 between those two peaks of frequency spectra, the lower the frequency distance 323, the higher the similarity.
  • the frequency spectra may comprise more than one peak. More than one frequency distances may be obtained between peaks of respective frequency spectra. Any number of frequency distances between any number of respective peaks may be, for example, averaged for being a basis of (e.g., representing) the similarity between the first and the second voice signals.
  • Figure 3B is a diagram illustrating a second example of a similarity between two voice signals based on another distance between at least a part of the respective two frequency spectra 31 B, 32B.
  • This other distance may be referred to herein as a magnitude distance.
  • a part 35 of the frequency spectra 31 B, 32B may correspond, for example, to the range of frequencies between 1 kHz and 5kHz.
  • a magnitude distance between the first frequency spectrum 31 B of the first voice signal and the second frequency spectrum 32B of the second voice signal may be obtained based on a sum of the absolute difference of magnitudes 34 between the two frequency spectra 31 B, 32B for a set of samples of at least a part 35 of the frequency spectra 31 B, 32B.
  • the similarity value between the first and the second voice signals may be (e.g., represented by an inverse of) the magnitude distance between those two frequency spectra, the lower the magnitude distance, the higher the similarity. More generally any method for obtaining any kind of (e.g., statistical) distance between two voice signal representations (e.g., characteristics) may be applicable to embodiments described herein.
  • the characteristic of a voice signal may be a signature of the voice signal, that may be obtained based on a deep learning method, and the similarity may be any kind of similarity between voice signatures.
  • any representation of a voice signal for which a similarity with (e.g. a representation) of another voice signal may be obtained may be applicable to embodiments described herein.
  • Any technique for computing a similarity between two representations of two voice signals may be applicable to embodiments described herein.
  • Figure 4 is a system diagram illustrating an example of spatializing four (e.g., input voice) signals VS1 , VS2, VS3, VS4 in an audio signal comprising two audio output signals AO1 , AO2.
  • a first P11 , a second P12, a third P13 and a fourth P14 panning factors may be applied to respectively a firstVSI , a second VS2, a third VS3 and a fourth VS4 (e.g., input) voice signals.
  • the resulting first, second, third and fourth panned signals may be mixed by a first mixer M1.
  • the output of the first mixer M1 may result in a first audio output signal AO1 (e.g., corresponding to the left channel of any of a stereo signal and a binaural signal).
  • a second audio output signal AO2 (e.g., corresponding to the right channel of any of the stereo signal and the binaural signal) may be obtained by applying a fifth P21 , a sixth P22, a seventh P23 and an eighth P24 panning factors respectively to the first VS 1 , second VS2, third VS3 and fourth VS4 (e.g., input) voice signals, and by mixing the resulting panned voice signals.
  • the position of the first voice signal VS1 in the stereo (e.g., or binaural) signal may be represented by the first P11 and the fifth P12 panning factors.
  • the position of the second voice signal VS2 in the stereo (e.g., or binaural) signal may be represented by the second P21 and the sixth P22 panning factors.
  • the position of the third voice signal VS3 in the stereo (e.g., or binaural) signal may be represented by the third P31 and the seventh P32 panning factors.
  • the position of the fourth voice signal VS4 in the stereo (e.g., or binaural) signal may be represented by the fourth P41 and the eighth P42 panning factors.
  • Embodiments described herein may also be applicable to audio signals comprising more than two audio output signals, the position of a (e.g., input) voice signal in the audio signal being represented by a set of panning factors (of any kind), wherein a (e.g., specific) panning factor may be applied to that (e.g., input) voice signal for obtaining (e.g., mixing) a (each, specific) audio output signal.
  • a panning factors of any kind
  • each audio output signal of the audio signal may be obtained by mixing the different (e.g., input) voice signals VS1 , VS2, VS3, VS4 after having applied to each (e.g., input) voice signal VS1 , VS2, VS3, VS4 a panning factor corresponding to that audio output signal and to that (e.g., input) voice signal, the different panning factors to be applied to a (e.g., given input) voice signal for generating the different audio output signals corresponding to the position of the (e.g., given input) voice signal in the (e.g., spatialized) audio signal.
  • the number of positions may be fix (e.g., for a given audio conference).
  • the number of positions may correspond to the number of active interlocutors.
  • the set of positions may be a set of fix positions.
  • a (e.g., each) position may correspond to a specific set of panning factors (to be applied to a (e.g., input) voice signal) for generating the audio output signals.
  • the positions in an audio signal may depend on the number of audio output signals and on the number of positions.
  • the positions may be determined, for example, in order to homogeneously distribute (e.g., spatialize) the different positions in the (e.g., spatialized) audio signal.
  • the set of positions may be a set of dynamic positions.
  • a set of dynamic positions may be obtained as a function of similarity values between pairs of voice signals. Any function for obtaining dynamic spatial positions in a spatialized audio signal for a set of voice signals based on similarity values between pairs of voice signals may be applicable to embodiments described herein.
  • the number of positions and the positions of the set of positions may be fix (e.g., predetermined), and may correspond to the number of participants in the audio conference.
  • the position of each interlocutor may be determined once for the whole audio conference.
  • the number of positions and the positions of the set of positions may be fix (e.g., predetermined), and may correspond to the number of active interlocutors (which may be less than the number of participants).
  • a position of a (e.g., input) voice signal of an interlocutor may change as the set of active interlocutors may change.
  • Prioritizing the audio conference understandability may be based on statistics recording. For example, the positions of the different (e.g., input) voice signals may be recorded (e.g., stored, logged), as (e.g., the positions may correspond to active interlocutors).
  • the audio conference understandability may be prioritized, by keeping a similar position, the next time an interlocutor may become active e.g., in a best effort manner.
  • a new position (e.g., possibly significantly different from a previous position) may be allocated to an (e.g., new active) interlocutor (e.g., only) if the audio conference understandability may be (e.g., significantly) increased by allocating that new position.
  • a new position (e.g., slightly) different from a previous position may be allocated to the (e.g., new active) interlocutor, despite availability of another (e.g., better) position being more distant from the previous position.
  • another available position further improving the understandability of the audio conference may not be allocated if the other available position is away from the previous position by a distance higher than a value.
  • a new active interlocutor may change of position only if the audio conference understandability may be improved by a given (e.g., threshold) value.
  • a new active interlocutor may change of position on a condition that the distance between the previous position and the new position remains below a given (e.g., threshold) value.
  • Prioritizing the stability of voice positions may be performed by positioning each (e.g., voice signal corresponding to each) interlocutor in a (e.g., predetermined, spatial) area.
  • a position of a (e.g., input) voice signal may be adjusted as the set of active interlocutors may change, on a condition that it remains in the (e.g., predetermined, spatial) area.
  • the plurality of voice signals may correspond to a set of active interlocutors, which may be a subset of the interlocutors of the audio conference.
  • the positions of the different voice signals may be updated (e.g., reallocated).
  • the number and the available positions in the audio signal may be fix (e.g., predetermined, preconfigured).
  • the available positions may be homogeneously distributed in the audio signal and a (e.g., specific, fix) spatial distance may separate a (e.g., each) pair of positions.
  • a similarity value may be obtained for each pair of voice signals among the plurality of voice signals.
  • the pairs of voice signals may be ordered based on their similarity values.
  • the pair of voice signal of highest similarity may be positioned in the two first positions of highest spatial distance between them.
  • example A there may be four voice signals (i1 , i2, i3, i4), with the following list of ordered similarity values:
  • positions There may be four available positions, that may be allocated according to the following table:
  • the positions may be allocated in three steps.
  • i1 and i2 may be positioned in the most spatially distant positions (1 and 2).
  • i3 may be positioned in the remaining available position which may be the most distant from the position of i2 (e.g., position 2).
  • i4 may be positioned in the remaining position (e.g., position 3).
  • example B there may be four voice signals (i1 , i2, i3, i4), with the following list of ordered similarity values:
  • the positions may be allocated in two steps.
  • i1 and i4 may be positioned in the most spatially distant positions (1 and 2).
  • the second step corresponding to the next highest similarity (between i2 and i3), since neither i1 , nor i2 is already positioned in a position, they may be positioned in the remaining available positions.
  • i1 and i2 may be respectively positioned in each of the remaining available positions, based on the next highest similarity (e.g., between i2 and i4).
  • i2 may be positioned in the available position 2, which may be at a higher (e.g., spatial) distance from the position allocated to i4 (e.g., position 4), than from the other available position (e.g., position 3).
  • more than one voice signals may be spatialized on a same position, on a condition that they are (e.g., sufficiently) different (e.g., on a condition the similarity value is below a given value).
  • voice signals may be positioned on a (e.g., predefined) position, such as e.g., a central position.
  • positions of (e.g., exceeding voice signals) may be randomly determined. Any technique for allocating positions to voice signals exceeding the number of available positions (e.g., by allocating a same position to any number of voice signals) may be applicable to embodiments described herein.
  • the speaking time of the interlocutors may be monitored, and the stability of the positions of the most talkative interlocutors may be maintained based on the interlocutor’s speaking time monitoring.
  • the positions of the less talkative interlocutors may change more frequently than the positions of the more talkative interlocutors.
  • voice signals of less talkative interlocutors may remain positioned at a central position.
  • FIG. 5 is a system diagram illustrating an example of graphical user interface of a processing device for rendering an audio signal of a plurality of voice signals.
  • N voice signals may correspond to respectively N positions (P11 , P12), (P21 , P22), (P31 , P32), ... (PN1 , PN2), N being any integer number greater than one.
  • the voice signals may, for example, correspond to participants in the audio conference.
  • the GUI of the processing device may display (e.g., graphical) visual representations 51 , 52, 53, 54 of the participants of the audio conference.
  • the visual representations 51 , 52, 53, 54 of the participants may be positioned in a displayed image (e.g., of the GUI) consistently with the positions of the corresponding voice signals in the (e.g., rendered, spatialized) audio signal.
  • a first participant may correspond to a first voice signal positioned at a first position (P11 , P12).
  • the first participant may be represented by a first visual representation 51 .
  • the first visual representation 51 may be positioned in an area of the displayed image that may be consistent with the first position (P11 , P12).
  • a (e.g., displayed) position of a visual representation 51 of a participant may be aligned 511 with a spatial position (P11 , P12) of a voice signal corresponding to the participant.
  • a user of the audio-conferencing system may get the impression that the voice of the first participant may provide from the visual representation 51 .
  • a (e.g., displayed) position of a visual representation 51 of a participant may not be totally (e.g., perfectly) 510 aligned with a spatial position (P11 , P12) of the voice signal corresponding to the participant.
  • the audio spatialization of the voice signal pan left
  • spatial distances between voice signals may be boosted in the audio signal (e.g., with regards to the visual representations in the GUI) while preserving the relative positioning between the voice signals in the audio signal.
  • a visual representation may be an avatar (e.g., a picture) of a participant.
  • the audio conference may be an audio-video conference
  • a visual representation of a participant may comprise a video of the participant, attending the audio-video conference.
  • a characteristic of a voice signal may comprise any representation of a voice signal for which a similarity value with (e.g., a characteristic of) another voice signal may be determined.
  • examples of a characteristic of a voice signal may include any of a (e.g., gender) metadata associated with the voice signal, a frequency spectrum of the voice signal, a power spectral density, a signature ...
  • a characteristic of a voice signal may be (e.g., preliminary) obtained based on samples of the voice signal.
  • the characteristic of the voice signal may be obtained (e.g., determined, acquired) based on initial samples of the voice signal, e.g., when the participant associated with the voice signal starts speaking (e.g., at the beginning of the audio conference, and/or when speaking for the first time).
  • the (e.g., acquired) characteristic of the voice signal may be stored (e.g., memorized), for example, in a user profile associated with the participant, for being used to obtain similarities with other voice signals associated with other participants.
  • a participant may be requested, e.g., in the process of creating (or updating) its associated user profile, to pronounce a (e.g., predefined sentence), based on which the characteristic of its associated voice signal may be obtained (e.g., determined, acquired).
  • a participant may be requested, e.g., in the process of creating (or updating) its associated user profile, to pronounce a (e.g., predefined sentence), based on which the characteristic of its associated voice signal may be obtained (e.g., determined, acquired).
  • the (e.g., specific, default, generic) characteristic of a voice signal may be obtained based on a set of voice samples (e.g., records) of people sharing similar (e.g., same) metadata (e.g., voice records of a set of women, voice records of a set of women of similar age, voice records of a set of women of similar age and/or of a same nationality, ).
  • the (e.g., specific, default, generic) characteristic may be any of a frequency spectrum, a power spectral density , a signature and any representation of a voice signal based on which similarities with other voice signals may be determined.
  • a characteristic of a voice signal may be preconfigured in the audio-conferencing system.
  • the characteristic of a voice signal may be stored e.g. in a user profile associated with a participant, for obtaining similarities with other voice signals associated with other participants.
  • the characteristics of the plurality of voice signal may be received (e.g., via a network interface) by the processing device running the audio-conferencing application for rendering the audio signal.
  • the characteristics may be determined based on any of the above described method, by any of the processing device rendering the spatialized audio signal, and another processing device communicating with the (e.g., rendering) processing device (e.g., belonging to the same audio-conferencing system).
  • a default position may be determined for spatializing the voice signal in the audio signal.
  • Figure 6A is a diagram illustrating an example of a processing device 6 for rendering an audio signal of a plurality of voice signals.
  • the processing device 6 may comprise a network interface 60 for connection to a network.
  • the network interface 60 may be configured to send and receive data (e.g., voice signals) for communicating with other processing devices of e.g., an audio conference system.
  • the network interface 60 may be any of: a wireless local area network interface such as Bluetooth, Wi-Fi in any flavour, or any kind of wireless interface of the IEEE 802 family of network interfaces; a wired LAN interface such as Ethernet, IEEE 802.3 or any wired interface of the IEEE 802 family of network interfaces; a wired bus interface such as USB, FireWire, or any kind of wired bus technology.
  • a broadband cellular wireless network interface such as 2G/3G/4G/5G cellular wireless network interface compliant to the 3GPP specification in any of its releases; a wide area network interface such a xDSL, FFTx or a WiMAX interface.
  • any network interface allowing to send and receive data may be applicable to embodiments described herein.
  • the processing device 6 may comprise an optional sensor 61 (that may be internal or external to the processing device).
  • the sensor 61 may be any kind of microphone capable of (e.g., configured to) acquiring a voice signal of a speaking user.
  • the network interface 60 and the optional sensor 61 may be coupled to a processing module 62, that may be configured to obtain similarity values of the (e.g., received) voice signals wherein a similarity value may indicate a level of similarity between two voice signals.
  • the processing module 62 may be configured spatialize the voice signals based on the similarity values, the higher the level of similarity between the two voice signals, the higher a distance between the two voice signals in the spatialized audio signal.
  • the processing module 62 may be configured to render the spatialized audio signal by sending the spatialized audio signal to an audio output 64.
  • the audio output 64 may be internal or external to the processing device 5.
  • the audio output 64 may comprise (e.g., be capable of being connected to) any number of loudspeakers for rendering the spatialized audio signal.
  • the processing device 6 may comprise an optional display output 66 (e.g., screen) coupled with the processing module 62.
  • the display output 64 e.g., screen
  • the processing module 62 may be configured to display any number of visual representations of participants (e.g., in an audio conference), associated with (e.g., different) voice signals.
  • FIG. 6B represents an exemplary architecture of the processing device 6 described herein.
  • the processing device 6 may comprise one or more processor(s) 610, which may be, for example, any of a CPU, a GPU a DSP (English acronym of Digital Signal Processor), along with internal memory 620 (e.g. any of RAM, ROM, EPROM).
  • the processing device 6 may comprise any number of Input/Output interface(s) 630 adapted to send output information and/or to allow a user to enter commands and/or data (e.g. any of a keyboard, a mouse, a touchpad, a webcam, a display), and/or to send I receive data over a network interface; and a power source 640 which may be external to the processing device 6.
  • the processing device 6 may further comprise a computer program stored in the memory 620.
  • the computer program may comprise instructions which, when executed by the processing device 6, in particular by the processor(s) 610, make the processing device 6 carrying out the processing method described with reference to figure 2.
  • the computer program may be stored externally to the processing device 6 on a non-transitory digital data support, e.g. on an external storage medium such as any of a SD Card, HDD, CD-ROM, DVD, a read-only and/or DVD drive, a DVD Read/Write drive, all known in the art.
  • the processing device 6 may comprise an interface to read the computer program. Further, the processing device 6 may access any number of Universal Serial Bus (USB)-type storage devices (e.g., “memory sticks.”) through corresponding USB ports (not shown).
  • USB Universal Serial Bus
  • the processing device 6 may be any of a TV set, a set-top-box, a media player, a game console, a server, a desktop computer, a laptop computer, an access point, wired or wireless, an internet gateway, a networking device, any apparatus capable of running an audio-conference application.
  • a method for rendering an audio signal of a plurality of voice signals may comprise obtaining similarity values of the voice signals wherein a similarity value may indicate a level of similarity between two voice signals.
  • the method may further comprise rendering the audio signal by spatializing the voice signals based on the similarity values, the higher the level of similarity between the two voice signals, the higher a distance between the two voice signals in the spatialized audio signal.
  • the similarity value may be obtained based on a first characteristic and a second characteristic respectively representative of each of the two voice signals.
  • the first characteristic and the second characteristic may be gender metadata respectively associated with each of the two voice signals.
  • the first characteristic and the second characteristic may be any of frequency spectra and power spectral densities of respectively each of the two voice signals.
  • the similarity value may be based on any of a distance and a cross correlation between at least a part of the first and the second characteristics.
  • a position of each voice signal may be obtained in the spatialized audio signal based on the similarity values.
  • the audio signal may comprise a plurality of audio output signals and the position of a voice signal may correspond to a set of panning factors to be applied to the audio output signals for spatializing the voice signal in the spatialized audio signal.
  • the position may belong to a set of dynamic positions in the spatialized audio signal.
  • the voice signals may correspond to participants in an audio conference
  • the method may further comprise displaying visual representations of the participants, by positioning the visual representations of the participants in a displayed image consistently with the positions of the corresponding voice signals in the spatialized audio signal.
  • the audio conference may be an audio video conference and the visual representation of a participant may comprise a video of the participant.
  • the apparatus may comprise a processor configured to execute the method for rendering an audio signal of a plurality of voice signals according to any embodiment disclosed herein.
  • a computer program product for rendering an audio signal of a plurality of voice signals is disclosed herein.
  • the computer program product may comprise program code instructions executable by a processor for executing the method for rendering an audio signal of a plurality of voice signals according to any embodiment disclosed herein.
  • a non-transitory computer readable storage media for rendering an audio signal of a plurality of voice signals is disclosed herein.
  • the computer readable storage media may comprise program code instructions executable by a processor for executing the method for rendering an audio signal of a plurality of voice signals according to any embodiment disclosed herein.
  • embodiments described herein may be employed in any combination or sub-combination.
  • embodiments described herein are not limited to the described variants, and any arrangement of variants and embodiments may be used.
  • embodiments described herein are not limited to any of the audio spatialization technique, the voice characteristic determination technique, and the similarity comparison technique described herein and any other type of audio spatialization, voice characteristic determination, and similarity comparison techniques may be applicable to embodiments described herein.
  • any characteristic, variant or embodiment described fora method is compatible with an apparatus device comprising means for processing the disclosed method, with a device comprising a processor configured to process the disclosed method, with a computer program product comprising program code instructions and with a non-transitory computer-readable storage medium storing program instructions.
  • non-transitory computer-readable storage media include, but are not limited to, a read only memory (ROM), random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).
  • ROM read only memory
  • RAM random access memory
  • register cache memory
  • semiconductor memory devices magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).
  • processing platforms, computing systems, controllers, and other devices containing processors are noted. These devices may contain at least one Central Processing Unit (“CPU”) and memory.
  • CPU Central Processing Unit
  • FIG. 1 A block diagram illustrating an exemplary computing system
  • FIG. 1 A block diagram illustrating an exemplary computing system
  • FIG. 1 A block diagram illustrating an exemplary computing system
  • FIG. 1 A block diagram illustrating an exemplary computing system
  • FIG. 1 A block diagram illustrating an exemplary computing system
  • memory may contain at least one Central Processing Unit (“CPU”) and memory.
  • CPU Central Processing Unit
  • Such acts and operations or instructions may be referred to as being "executed,” “computer executed” or "CPU executed.”
  • an electrical system represents data bits that can cause a resulting transformation or reduction of the electrical signals and the maintenance of data bits at memory locations in a memory system to thereby reconfigure or otherwise alter the CPU's operation, as well as other processing of signals.
  • the memory locations where data bits are maintained are physical locations that have particular electrical, magnetic, optical, or organic properties corresponding to or representative of the data bits. It should be understood that the representative embodiments are not limited to the above-mentioned platforms or CPUs and that other platforms and CPUs may support the provided methods.
  • the data bits may also be maintained on a computer readable medium including magnetic disks, optical disks, and any other volatile (e.g., Random Access Memory (“RAM”)) or non-volatile (e.g., Read-Only Memory (“ROM”)) mass storage system readable by the CPU.
  • RAM Random Access Memory
  • ROM Read-Only Memory
  • the computer readable medium may include cooperating or interconnected computer readable medium, which exist exclusively on the processing system or are distributed among multiple interconnected processing systems that may be local or remote to the processing system. It is understood that the representative embodiments are not limited to the above-mentioned memories and that other platforms and memories may support the described methods.
  • the implementer may opt for some combination of hardware, software, and/or firmware.
  • the foregoing detailed description has set forth various embodiments of the devices and/or processes via the use of block diagrams, flowcharts, and/or examples. Insofar as such block diagrams, flowcharts, and/or examples contain one or more functions and/or operations, it will be understood by those within the art that each function and/or operation within such block diagrams, flowcharts, or examples may be implemented, individually and/or collectively, by a wide range of hardware, software, firmware, or virtually any combination thereof.
  • Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs); Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine.
  • DSP digital signal processor
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • FPGAs Field Programmable Gate Arrays
  • DSPs digital signal processors
  • ASICs Application Specific Integrated Circuits
  • FPGAs Field Programmable Gate Arrays
  • DSPs digital signal processors
  • FIG. 1 ASICs
  • FIG. 1 ASICs
  • FIG. 1 ASICs
  • FIG. 1 ASICs
  • FIG. 1 ASICs
  • FIG. 1 ASICs
  • FIG. 1 Application Specific Integrated Circuits
  • FPGAs Field Programmable Gate Arrays
  • DSPs digital signal processors
  • a signal bearing medium examples include, but are not limited to, the following: a recordable type medium such as a floppy disk, a hard disk drive, a CD, a DVD, a digital tape, a computer memory, etc., and a transmission type medium such as a digital and/or an analog communication medium (e.g., a fiber optic cable, a waveguide, a wired communications link, a wireless communication link, etc.).
  • a signal bearing medium include, but are not limited to, the following: a recordable type medium such as a floppy disk, a hard disk drive, a CD, a DVD, a digital tape, a computer memory, etc.
  • a transmission type medium such as a digital and/or an analog communication medium (e.g., a fiber optic cable, a waveguide, a wired communications link, a wireless communication link, etc.).
  • any two components so associated may also be viewed as being “operably connected”, or “operably coupled”, to each other to achieve the desired functionality, and any two components capable of being so associated may also be viewed as being “operably couplable” to each other to achieve the desired functionality.
  • operably couplable include but are not limited to physically mateable and/or physically interacting components and/or wirelessly interactable and/or wirelessly interacting components and/or logically interacting and/or logically interactable components.
  • the phrase “A or B” will be understood to include the possibilities of “A” or “B” or “A and B.”
  • the terms “any of” followed by a listing of a plurality of items and/or a plurality of categories of items, as used herein, are intended to include “any of,” “any combination of,” “any multiple of,” and/or “any combination of multiples of” the items and/or the categories of items, individually or in conjunction with other items and/or other categories of items.
  • the term “set” or “group” is intended to include any number of items, including zero.
  • the term “number” is intended to include any number, including zero.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Selon des modes de réalisation, des valeurs de similarité des signaux vocaux peuvent être obtenues, une valeur de similarité pouvant indiquer un niveau de similarité entre deux signaux vocaux. Selon des modes de réalisation, le signal audio peut être restitué par spatialisation des signaux vocaux sur la base des valeurs de similarité, plus le niveau de similarité entre les deux signaux vocaux est élevé, plus la distance entre les deux signaux vocaux est élevée dans le signal audio spatialisé.
PCT/EP2021/077898 2020-10-16 2021-10-08 Procédé et appareil pour restituer un signal audio parmi une pluralité de signaux vocaux WO2022078905A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP20306225.2 2020-10-16
EP20306225 2020-10-16

Publications (1)

Publication Number Publication Date
WO2022078905A1 true WO2022078905A1 (fr) 2022-04-21

Family

ID=73198229

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2021/077898 WO2022078905A1 (fr) 2020-10-16 2021-10-08 Procédé et appareil pour restituer un signal audio parmi une pluralité de signaux vocaux

Country Status (1)

Country Link
WO (1) WO2022078905A1 (fr)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070263823A1 (en) * 2006-03-31 2007-11-15 Nokia Corporation Automatic participant placement in conferencing
US20090112589A1 (en) * 2007-10-30 2009-04-30 Per Olof Hiselius Electronic apparatus and system with multi-party communication enhancer and method
US20100266112A1 (en) * 2009-04-16 2010-10-21 Sony Ericsson Mobile Communications Ab Method and device relating to conferencing
US20170346951A1 (en) * 2015-04-22 2017-11-30 Huawei Technologies Co., Ltd. Audio signal processing apparatus and method
US20190121516A1 (en) * 2012-12-27 2019-04-25 Avaya Inc. Three-dimensional generalized space
US20200176010A1 (en) * 2018-11-30 2020-06-04 International Business Machines Corporation Avoiding speech collisions among participants during teleconferences

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070263823A1 (en) * 2006-03-31 2007-11-15 Nokia Corporation Automatic participant placement in conferencing
US20090112589A1 (en) * 2007-10-30 2009-04-30 Per Olof Hiselius Electronic apparatus and system with multi-party communication enhancer and method
US20100266112A1 (en) * 2009-04-16 2010-10-21 Sony Ericsson Mobile Communications Ab Method and device relating to conferencing
US20190121516A1 (en) * 2012-12-27 2019-04-25 Avaya Inc. Three-dimensional generalized space
US20170346951A1 (en) * 2015-04-22 2017-11-30 Huawei Technologies Co., Ltd. Audio signal processing apparatus and method
US20200176010A1 (en) * 2018-11-30 2020-06-04 International Business Machines Corporation Avoiding speech collisions among participants during teleconferences

Similar Documents

Publication Publication Date Title
US11832080B2 (en) Spatial audio parameters and associated spatial audio playback
CN112513981A (zh) 空间音频参数合并
US9805725B2 (en) Object clustering for rendering object-based audio content based on perceptual criteria
US9865274B1 (en) Ambisonic audio signal processing for bidirectional real-time communication
KR101805110B1 (ko) 사운드 스테이지 강화를 위한 장치 및 방법
RU2678650C2 (ru) Кластеризация аудиообъектов с сохранением метаданных
JP2023501728A (ja) 音声映像ストリームからのプライバシーに配慮した会議室でのトランスクリプション
US20220059123A1 (en) Separating and rendering voice and ambience signals
BR112020017360A2 (pt) transformação de sinais de áudio capturados em diferentes formatos em um número reduzido de formatos para simplificar as operações de codificação e decodificação
CN108337535A (zh) 客户端视频的转发方法、装置、设备和存储介质
WO2022078905A1 (fr) Procédé et appareil pour restituer un signal audio parmi une pluralité de signaux vocaux
CN117079661A (zh) 一种声源处理方法及相关装置
US20220392478A1 (en) Speech enhancement techniques that maintain speech of near-field speakers
CN116189651A (zh) 一种用于远程视频会议的多说话人声源定位方法及系统
WO2018094968A1 (fr) Procédé et appareil de traitement audio, et serveur multimédia
TW201517022A (zh) 球面諧波係數之寫碼
CN111951821B (zh) 通话方法和装置
WO2022133128A1 (fr) Post-traitement de signal binaural
Breebaart et al. Spatial coding of complex object-based program material
EP3488623B1 (fr) Groupement d'objet audio sur une différence perceptive en fonction du rendu
US20230276187A1 (en) Spatial information enhanced audio for remote meeting participants
US20230262169A1 (en) Core Sound Manager
WO2018017394A1 (fr) Regroupement d'objets audio sur la base d'une différence de perception sensible au dispositif de rendu
EP3925236B1 (fr) Normalisation adaptative de l'intensité sonore pour un regroupement d'objets audio
Wuolio et al. On the potential of spatial audio in enhancing virtual user experiences

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21790162

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21790162

Country of ref document: EP

Kind code of ref document: A1