WO2017211447A1 - Procédé pour reproduire des signaux sonores à un premier emplacement pour un premier participant à une conférence avec au moins deux autres participants à au moins un autre emplacement - Google Patents

Procédé pour reproduire des signaux sonores à un premier emplacement pour un premier participant à une conférence avec au moins deux autres participants à au moins un autre emplacement Download PDF

Info

Publication number
WO2017211447A1
WO2017211447A1 PCT/EP2017/000648 EP2017000648W WO2017211447A1 WO 2017211447 A1 WO2017211447 A1 WO 2017211447A1 EP 2017000648 W EP2017000648 W EP 2017000648W WO 2017211447 A1 WO2017211447 A1 WO 2017211447A1
Authority
WO
WIPO (PCT)
Prior art keywords
location
talker
participant
listener
talkers
Prior art date
Application number
PCT/EP2017/000648
Other languages
English (en)
Inventor
Carlos Valenzuela
Original Assignee
Valenzuela Holding Gmbh
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Valenzuela Holding Gmbh filed Critical Valenzuela Holding Gmbh
Publication of WO2017211447A1 publication Critical patent/WO2017211447A1/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/02Details
    • H04L12/16Arrangements for providing special services to substations
    • H04L12/18Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
    • H04L12/1813Arrangements for providing special services to substations for broadcast or conference, e.g. multicast for computer conferences, e.g. chat rooms
    • H04L12/1822Conducting the conference, e.g. admission, detection, selection or grouping of participants, correlating users to one or more conference sessions, prioritising transmission
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • H04N7/147Communication arrangements, e.g. identifying the communication as a video-communication, intermediate storage of the signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • H04S7/303Tracking of listener position or orientation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2201/00Details of transducers, loudspeakers or microphones covered by H04R1/00 but not provided for in any of its subgroups
    • H04R2201/40Details of arrangements for obtaining desired directional characteristic by combining a number of identical transducers covered by H04R1/40 but not provided for in any of its subgroups
    • H04R2201/4012D or 3D arrays of transducers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/15Aspects of sound capture and related signal processing for recording or reproduction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/01Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]

Definitions

  • the present invention relates to a method for reproducing sound signals at a first location for a first participant within a conference with at least two further participants at at least one further location.
  • Telephone or video conferences with several participants at different locations are oftentimes subject to acoustical problems because the quality of the sound reproduction is limited and no natural conversation can be achieved.
  • a listener can only distinguish the two other participants if the voices of both are sufficiently different. In particular when both participants speak simultaneously, a differentiation between the two talkers is sometimes impossible and misunderstandings are inevitable.
  • the object of the present invention is to present a method for reproducing sound signals at a first location for a first participant within a conference with at least two further participants in at least one further location that permits a listener at a first location to better differentiate between the participants at the further locations during the conversation.
  • a method for reproducing sound signals at a first location for a first participant within a conference with at least two further participants in at least one further location wherein the sound signal of each further participant is recorded and reproduced at the first location, and wherein each participant who is not present at the first location is allocated a virtual position at the first location, and the sound signal of the respective participant is reproduced from this virtual position, and wherein each sound signal is reproduced along a principal radiation direction.
  • the term "sound signal” in singular form refers to all the sound signals produced by one participant whenever he is talking.
  • the term “sound signals” in plural refers to the sound signals produced by different participants.
  • the sound signals shall be recorded separately, but in case of several participants at the same location it is also possible to record the signals jointly and to separate them by appropriate means before they are reproduced.
  • the method in accordance with the invention shall be used for recording, transmission and reproduction of sound signals, in particular of spoken signals in the framework of a telephone or video conference. It has to be distinguished between a first location, where a first participant is located who is listener, and further locations where further participants are located who will become talker once they start to speak.
  • the terms listener and talker are consequently used synonymously for the participants according to the context.
  • the further participants at the further locations are hence potential talkers that may speak individually or simultaneously.
  • the described method provides an excellent intelligibility for a listener at the first location. It is obvious that in a telephone or videoconference all participants may be talkers and listeners, and that the described method may be applied to any of the different locations in order to achieve the best acoustic quality for all participants.
  • Each participant who is not present at the first location is allocated a virtual position at the first location, for example by means of the position of a loudspeaker or the position of a virtual sound source which reproduces the sound signal of the respective participant.
  • Each reproduced sound signal shall have a directivity (herein also called sound signal direction or principal radiation direction of the sound signal) in order to simulate the speaking direction of a talker.
  • a human talker has a distinctive principal radiation direction which corresponds to the facing direction of the talker.
  • the sound signal directivity is generated by reproducing the sound signal with a directional sound source that has a principal radiation direction, i.e. by reproducing the sound signal along the principal radiation direction of the reproducing sound source.
  • the sound signal directivity corresponds to the principal radiation direction.
  • a directional loudspeaker may be employed or any known method (see below under "Voice Direction Production Unit") for simulating the principal radiation direction of a sound source may be employed.
  • the principal radiation directions of the sound signals of both talkers are directed away from the participant in the first location, in particular in an angle of more than 40° between the principal radiation directions, and
  • the sound signals reproduced at the first location have directivity, i.e. a direction which is specified by the principal radiation direction of the reproduced sound signal.
  • the principal radiation direction may be changed during time.
  • the recorded signal may have directivity as well, but such directivity does not need to be detected and recorded.
  • the directivity of the reproduced sound signal at the first location and the principal radiation direction may be determined independently from the original directivity and direction of emission.
  • the principal radiation direction of the reproduced sound signal is directed towards the participant at the first location.
  • the principal radiation directions of the sound signals of both talkers are directed away from the participant, preferably with an angle of more than 40°, and in particular of more than 90° between the principal radiation directions. In such a way, the listener can distinguish better between the two talkers as if both signals were directed towards the listener.
  • the direction of the sound signal of at least one talker preferably of exactly one talker, is directed towards the participant at the first location, and the principal radiation directions of all other sound signals are directed away from the participant. Again, the listener can better distinguish between the different signals as in the case where all signals were directed towards the listener.
  • all sound signals which are not directed towards the listener are directed in different directions, advantageously with the largest possible angles between them.
  • the preceding method may in particular be used in a telephone conference with no further information, e.g. about the interaction of the participants available.
  • a visual representation may for example be a simple name tag, a photograph or preferably video images of a participant.
  • the position of the visual representation of a participant and the position of the sound source reproducing this participant are correlated with one another.
  • the method of the invention may still be enhanced by detecting factors concerning the attention of the participant at the first location and by controlling the principal radiation directions of the sound signals at the first location depending on the detected data.
  • Controlling the direction of the sound signals depending on the detected data may result in a deviation from the rules described above or may lead to a definite setting in case several alternatives for the principal radiation direction of a sound signal are possible.
  • the sound signal of the talker to whom the listener directs his attention may be directed towards the listener, and all other sound signals may be turned away from the listener, preferably in different directions.
  • the listener has in this way the possibility to influence the reproduction of the sound signals.
  • the viewing direction or facing angle of the listener or any other suitable factor may be used.
  • the principal radiation direction of a sound signal may be controlled in case of one talker in a way that the signal is not directed towards the listener but is directed away from the listener in case the listener directs his attention elsewhere, e.g. is looking at another participant who is not talking.
  • the same is possible for more than one talker, i.e. if the listener is not directing his attention to one of the talkers but elsewhere none of the signals is directed towards the listener but all signals are directed away from the listener, in particular in different directions.
  • the principal radiation direction of the reproduced sound signals may be influenced by factors at the end of the talkers.
  • At least one visual representation of the participant of the first location may be arranged in at least one further location, and preferably representations of all participants of the respective other locations shall be arranged at this one further location.
  • Factors regarding the attention of a talker may be detected, e.g., the viewing direction or facing angle of a talker.
  • the principal radiation directions of the reproduced sound signals at the first location may then be controlled depending on the detected data.
  • the principal radiation direction of the reproduced sound signal of a talker at the first location can e.g. be controlled in a way that it is directed towards the listener at the first location when the attention of the talker is directed at the visual representation of the listener at the respective location.
  • the principal radiation direction for one or more talkers can also be controlled in a way that the signals are only directed towards the listener if the talker's attention is focused on the listener, and e.g. in case none of the talkers is looking at the listener, none of the signals is directed towards the listener, even in case of only one talker.
  • Figure 1 shows schematically the setting for a telephone conference with enhanced audio performances at one location
  • Figures 2a to 2c show the system of figure 1 with one, two or three participants speaking;
  • Figure 3 shows the system of Figure 1 with additional
  • Figure 4 shows the system of figure 3 in a "control by listener" mode
  • Figure 5 shows the system of Figure 3 with additional video information at one further location
  • Figure 6 shows the system of figure 5 in a "control by talker” mode
  • Figure 7 shows the system of figure 5 in a "control by listener” mode
  • Figures 8 - 16 show further examples and embodiments of the present invention.
  • Figure 1 shows schematically a system that may be used for a telephone conference.
  • Participant P in the first location LOC1 is sitting in front of several loudspeakers LS1 to LS5 which reproduce the sound signals recorded separately from the five further participants P1 to P5, which are attending the conference form different remote locations LOC1 ' to LOC5'.
  • the loudspeakers LS1 to LS5 may also be virtual sound sources that are generated by e.g. stereophonic phantom source techniques or Wave-Field-Synthesis.
  • the loudspeakers LS1 to LS5 may also be virtual sound sources that are reproduced via headphones by using Head-Related- Transfer-Functions (HRTF) or other spatialization techniques. Data transmission between the different locations is made by means of a well-known data network NW. The sound signals are recorded at locations LOC1 ' to LOC 5' without taking care of the directivity of the sound.
  • Figure 2a shows the setup of figure 1 wherein participant P3, and only participant P3, is talking and becomes hence talker S3.
  • the sound signal of talker S3, but not the speaking direction, is recorded.
  • the sound signal is transmitted to the first location LOC1.
  • the sound signal is reproduced via loudspeaker LS3 which is assigned to participant/talker P3/S3.
  • the reproduced sound signal has directivity and the signal is turned directly towards participant P in the first location LOC1 who becomes listener L.
  • Figure 3 shows the setting of figure 1 , wherein additional video information is available. Participant P in the first location LOC1 is sitting in front of video images V1 to V5 of each of the further participants. [0045] As information regarding the participant's attention, the viewing direction D of participant P is detected.
  • FIG. 4 The situation of figure 2c is shown in figure 4, where participants P1 , P3 and P4 are talking. Participant P / listener L in the first location is looking at the video image of participant P1. The signal of participant P1 who is talker S1 is directed to participant P / listener L, since the listener's attention is directed to this talker. Both other sound signals are directed away from the listener.
  • Figure 5 shows a setup similar to the one of figure 3 where information regarding a talker's attention is recorded.
  • the further participants P1 to P5 also have video images VP of each of the further participants in front of them, as is schematically shown for participant P1 only, in order to keep the drawing clear, with video images VP and VP2 to VP5.
  • information concerning participant P1 is recorded, but the same may apply to all other participants. However, this is for clarity reasons not represented in the present figure.
  • participant P1 is looking at the screen with the video images of participant P, respectively listener L in the first location, and participants P3/S3 and P4/S4 discuss between the two of them and are looking at each other.
  • the sound signal of talker S1 reproduced by loudspeaker LS1 is directed to the listener L, and the signals of talkers S3 and S4, reproduced by loudspeakers LS3 and LS4, are turned away from the listener L in different directions.
  • the listener can influence the reproduction of the sound signals by looking at one of the talkers so that this signal is directed to the listener.
  • Rules may be implemented to limit changes in the signal direction. It is e.g. possible to change the direction of the reproduced signals only if the listener L focuses his viewing direction on one talker for a certain time.
  • Figure 7 shows a combination of the previously described setups wherein information regarding the listener's and the talker's attention is available.
  • talker S1 is looking at the video images as a representation of the listener L, and talkers S3 and S4 are looking at each other. The listener is looking at talker S1. Consequently, the sound signals of talker S1 are directed to the listener L.
  • the Listener L turns his view towards talker S3. In accordance with the rules underlying the process, the sound signal of talker S3 which is reproduced by loudspeaker LS3 is turned towards the listener whereas the sound signal of S1 reproduced by LS1 is turned away from the listener.
  • only one or several reproduced sound signals may be turned towards a listener. If only one reproduced sound signal may be directed to the listener, the viewing direction of the listener prevails and the sound signal of a talker looking at the listener is only directed towards the listener if the listener looks at this talker or looks at none of the talkers.
  • the audio-enhancement system of the present embodiment of the invention differs from the state of the art in that artificial speaking directions are provided which are independent of the facing angle of the corresponding talkers.
  • the technical effect due to this difference is an improved communication in conferencing systems, which goes beyond simple audio source separation improvements, even when no information about the facing angle of the talkers is available.
  • the audio-enhancement system can be operative at one or more receiving sites. This means that the audio-enhancement system is autonomous in the sense, that its operation is independent of whether or not other participating sites are employing such a system.
  • the audio-enhancement system comprises the following means to provide the virtual audio source positions which are spatially separated: ⁇ a Position Setting Unit, which specifies the spatial position from which a
  • the audio-enhancement system is characterized by the following means to provide artificial speaking directions that enhance speech intelligibility:
  • a Voice Direction Setting Unit which specifies the voice direction of a talker that shall be perceived at a specific listener location, wherein the
  • the audio-enhancement system can be operated in different modes. It is noted that the mode of operation can be selected individually at a receiving site, i.e. independent of the other remote sites. The following modes of operation are possible which have an effect on the Voice Direction Setting Unit:
  • the Voice Direction Setting Unit will specify the voice directions of the talkers based upon the chosen control input.
  • the mode of operation can be selected by the local participant, i.e. the user of the audio- enhancement system, or it can be selected automatically by the system itself. The automatic selection by the system is based upon whether or not visual
  • the Position Setting Unit specifies the virtual audio source positions based on the screen configuration of the specific listener location whenever visual representations of the remote participants are present on the screen. If no visual representations are available, the Position Setting Unit automatically switches to specifying the virtual audio source positions based on the number of remote participants or the number of connected sites.
  • the necessary processing for the audio-enhancement system may be implemented in a centralized, in a distributed or in a hybrid manner.
  • the processing of the audio-enhancement system takes place at a central location, such as in the Cloud, a centralized Multipoint Control Unit (MCU), a central server, etc.
  • MCU Multipoint Control Unit
  • the necessary processing takes place at the local site that uses the system, i.e. at one or more locations of participants who have the audio-enhancement system implemented at their site.
  • the necessary processing is distributed between a central location and the local site(s) in order to optimize network delays, errors, dropped frames etc.
  • Embodiment 1 communication enhancement is based on control input from talker.
  • the voice direction of talker T is identified by the system and at the remote locations the voice direction setting unit sets the voice direction of the talker T according to the control input by the talker.
  • the voice direction of the talker T at the remote locations is assigned according to the distribution of the positions of the remote listeners on the screen.
  • the voice direction of talker T is identified by the system and at the remote locations the Voice Direction Setting Unit sets the voice direction of the talker T according to the control input by the talker. Only three possible voice directions per location are used.
  • a talker T of a videoconference can address different remote participants (L1 , L2, to L5) which are displayed on his screen according to the screen configuration that he chose.
  • the chosen screen configuration shows all remote participants on the left side of the screen, positioned in a vertical row to leave space on the right side of the screen to show e.g. a presentation.
  • the talker T can address the different remote listeners Ln individually by either directing his voice to the person he wants to address, just like he would do in a real meeting or by manually selecting the person he wants to address.
  • the manual selection may be accomplished by selecting the image of the target person on the screen (with a cursor, via touch-screen, by typing the name, or any other suitable manual selection method that is known in the art.) The selection is maintained until the talker selects another target person.
  • the audio-enhancement system transmits the selection, i.e. the object identifier that identifies the "current target listener", as meta-data.
  • the audio-enhancement system detects (a) the visual gazing direction of the talker by means of known gaze tracking techniques (methods for measuring the point where one is looking), and/or (b) the acoustic speaking direction of the talker by means of known acoustically-based tracking techniques (see e.g. WO2007/062840).
  • gaze tracking techniques methods for measuring the point where one is looking
  • acoustic speaking direction of the talker by means of known acoustically-based tracking techniques (see e.g. WO2007/062840).
  • the most popular SW-based techniques use 2D-video images from which the eye position or the facing direction is extracted. Other techniques are based on analyzing 3D- camera video images.
  • the audio-enhancement system transforms the detected visual gazing direction and/or the detected acoustic speaking direction to an object identifier which identifies the person at whom the talker is talking to, i.e. the "current target listener" Lx.
  • the transformation is accomplished by matching the detected visual gazing direction and/or the detected acoustic speaking direction, i.e. the measured point where the talker is looking, with the visual distribution of the remote participant images on the screen of the talker.
  • the visual distribution of the remote participants' images depends on the screen configuration that the talker has chosen.
  • the audio-enhancement system transmits the object identifier "current target listener" as meta-data.
  • the audio- enhancement system which may be either implemented at the listener's site or in a centralized manner, executes the following processing steps: [0077] In a first step, the audio-enhancement system detects the visual gazing direction of the talker from the video image received at the listener's location or at the central processing location. As mentioned earlier this is done by means of known gaze tracking techniques (e.g. SW-based techniques using 2D-video images).
  • the audio-enhancement system transforms the detected visual gazing direction to an object identifier which identifies the person at whom the talker is talking to, i.e. the "current target listener" Lx.
  • object identifier which identifies the person at whom the talker is talking to
  • Lx the "current target listener” Lx.
  • Information about the chosen screen configuration (conference layout and size of screen) at the talker's site is known because it is transmitted to the listener's site.
  • the information about the screen configuration at the talker's site can be transmitted as meta-data from the talker's site, or it is known to a central communication control unit, which organizes the audio and video streams to the connected sites, that also transmits this information as meta-data.
  • the transformation is accomplished by matching the detected visual gazing direction, i.e. the measured point where the talker is looking, with the acquired information about the chosen screen configuration or about the visual distribution of the remote participants' images on the screen at the talker's site.
  • the audio-enhancement system transmits the object identifier "current target listener" as meta-data and uses it to set the voice direction of the talker at the listener's site.
  • the audio-enhancement system will transmit an object identifier "current target for talker Tx" for each simultaneous talker. Specification of audio source position based on screen configuration and specification of speaking direction based on input from talker.
  • a camera captures the image of the talker T and a microphone or microphone array captures the voice of the talker T.
  • the captured voice and image of the talker T are transmitted via a network (e.g. VoIP, Internet, Cloud-based network, telephone network, computer network, or any other kind of communication network) to the remote participants Ln.
  • a network e.g. VoIP, Internet, Cloud-based network, telephone network, computer network, or any other kind of communication network
  • the system of the present invention reproduces the voice of the talker T at the remote sites as follows: [0087] (1 ) The spatial location from which the voice of the talker is perceived, i.e. the virtual audio source position at a remote site, is mapped with the position of the image of the talker T on the screen at the remote location (which may vary according to the chosen screen configuration), so that the voice of the talker T comes from the location that corresponds to his location on the screen. For example, at the remote location of listener L5 the talker is displayed in the middle of the screen. The system of the invention will therefore reproduce the transmitted voice of the talker T in such a way, that the voice appears to originate from the middle of the screen.
  • the artificial voice direction (speaking direction) at a remote location i.e. the production of a directional characteristic of the virtual sound source at a remote location, is mapped according to the control input of the talker, namely in such a way as to reproduce the information to whom the talker is directing his speech to.
  • the voice direction of the talker is set to point to this addressed participant. For example, if the talker is addressing the listener L1 by directing his voice to the image of L1 on his screen, then the talker's voice at the remote location of listener L1 will be reproduced such, that the sound source directivity pattern, i.e. the directional characteristic of the reproduced voice of the talker is directed towards the listener L1.
  • the directivity pattern of the reproduced voice of the talker is set to point away from the remote listener.
  • the voice direction of the talker may be set in such a way as to correspond
  • the talker is, for example, addressing the listener L1 , then the talker's voice at all remote locations but the remote location of L1 , will be directed towards the direction in which the image of L1 is displayed at the corresponding remote location. For example, as shown in Fig.11A, at the remote location of listener L3, the voice of the talker T is directed to the right, where the listener L1 is positioned on the screen configuration of listener L3.
  • the following alternative is provided by the system: Instead of providing only left and right voice directions according to the 2D representation of the participants on the screen, also voice directions in between the left and right direction, excluding a range of +/- 7° around the direction which points to the local listener position, are used by the system. As shown in Fig. 8A, the directions in between (shown by dashed lines and labeled with the listener who is being addressed by this direction) are assigned according to the distribution of the positions of the remote listeners on the screen. For example, at the location of listener L2, the voice direction that addresses the listener L1 is shown by the dashed arrow with the label L1. The number of such additional voice directions can be limited to the number of separately perceivable voice directions.
  • the solid arrows pointing to the listener of the respective location represent the voice direction which is set when the listener of that location is being addressed by the talker. All other arrows represent the set voice direction when the listener of the respective location is not being addressed, but any one of the other listeners is being addressed. The labeling of the arrows indicates which listener is the addressed participant.
  • Fig. 8B differs from Fig. 8A in that only three possible voice directions are provided at any location by the system: [0094] If the local participant is being addressed by a talker, the voice of the talker is set to point to the local participant (shown by the solid arrows in Fig. 8B). For example, if the talker T addresses the listener L4, the voice of the talker is set to point to the listener L4 at the remote location of L4.
  • the voice of the talker is set to point away from the remote listener.
  • the voice direction of the talker is set to correspond to the direction where the addressed participant is positioned on the screen. If the person is positioned to the right of the talker the voice direction of the talker will be set to the right (e.g. listener L3 in the remote location of listener L4 in Fig. 8B), and vice versa if the person is positioned to the left of the talker (e.g. listeners L1 , L2 and L5 in the remote location of listener L4 in Fig.11 B).
  • the talker is positioned at either the left or the right edge of the screen (e.g. at the remote location of listener L3 and of listener L2 in Fig. 8B), instead of providing a left and a right voice direction, only two left or two right voice directions with different angles pointing away from the remote participant are provided (e.g. the left-pointing arrow (L4) and the left-pointing arrow (L3, L5, L1) at the remote location of listener L2 in Fig. 8B), since all displayed participants will be positioned only to one side of the talker.
  • the above explanations referring to the one talker T apply in the same manner to all the other simultaneous talkers.
  • the audio production unit of the system produces the virtual sound source positions and the artificial voice directions at each remote location.
  • the reproduction methods and accuracy may vary between the remote locations depending on such factors as the available hardware at the remote locations, the employed
  • any known method may be employed, such as for example: (a) stereophonic sound source reproduction techniques including normal two-channel stereo systems as well as multi-channel surround sound systems, (b) wave-field synthesis, (c) ambisonics; (d) 2D- and 3D- loudspeaker cluster or loudspeaker arrays (e.g. multi-speaker display systems), or (f) spatial reproduction techniques for headphones.
  • stereophonic sound source reproduction techniques including normal two-channel stereo systems as well as multi-channel surround sound systems, (b) wave-field synthesis, (c) ambisonics; (d) 2D- and 3D- loudspeaker cluster or loudspeaker arrays (e.g. multi-speaker display systems), or (f) spatial reproduction techniques for headphones.
  • the accuracy with which the virtual audio source location is reproduced by the audio reproduction unit of the system can be varied between one of the following possibilities: (a) the perceived position of the virtual sound source matches the image of the corresponding talker in azimuth, elevation and distance, (b) the position matches the image only in azimuth and elevation, and a generic distance which remains the same for any talker is used, and (c) the position matches the image only in azimuth, and a generic distance and elevation which remain the same for any talker are used.
  • the generic elevation would most preferably be chosen such that it would match the middle of the screen in vertical direction.
  • the dynamic distribution of currently talking participants to a limited number of perceivable azimuth positions and elevation positions that can be produced by the system is accomplished by the Dynamic Position Distribution Unit as follows: Each visual position on the screen is assigned to the closest possible perceivable audio source position that the system can create. If, however, two participants of two visual positions, which are assigned to the same audio source position, are talking simultaneously, then the talker that started first is mapped to its assigned audio source position, and the second talker, who started talking later (even if only a millisecond later) is assigned to the next closest audio source position that is possible. The same applies if even more than two participants are talking
  • any known method for simulating the principal radiation direction of a sound source may be employed, such as for example: (a) 2D- and 3D-loudspeaker cluster or loudspeaker arrays (e.g. multi-speaker display systems), (b) wave-field synthesis which either employs monopole synthesis or appropriate directivity filters, (c) directivity reproduction as described in WO2007/062840, or (d) two-channel-based directivity reproduction for loudspeakers or headphones (as disclosed in a parallel. patent application), or any other known method.
  • the principal radiation direction of a sound source is specified as the main direction of emission, i.e. the direction whereto in average for all relevant
  • Figure 9 shows the definition of the voice direction angle of a talker T with respect to the listener L.
  • the voice direction angle of a talker with respect to the listener is defined as the angle between the connection line "talker to listener" and the arrow representing the voice direction of the talker, with the talker being the vertex.
  • a voice direction with an angle of 0° corresponds to the voice direction pointing at the listener.
  • a voice direction with an angle of +45° corresponds to the voice direction of a talker which points with +45° to the right of the listener as shown in Fig. 9.
  • the accuracy with which the artificial voice directions have to be produced depends on the sensitivity of the human ear. It is often sufficient to provide a limited number of different possible voice direction angles for one talker as perceived by the listener. For example, it may be enough to provide 12 to 18 distinctly perceivable voice directions in a range of 360° around the talker, i.e. to provide a voice direction every 20° to 30°.
  • communication enhancement is based on the number of remote participants and the number of remote talkers.
  • the current invention improves communication by detecting the number of remote participants as well as the number of active remote talkers in order to provide the spatially separated virtual audio source positions and the artificial speaking directions.
  • the virtual audio source positions at the site of a specific listener are set as follows: [0110] If no visual information is available, such as in a teleconference, the Position Setting Unit determines the number of remote participants or the number of connected sites, and assigns one virtual audio source position to each remote participant. The assigned virtual audio source positions remain the same throughout the conversation. Whenever a remote participant becomes a talker, the Position Production Unit produces the voice of the talker from the virtual audio source position that is assigned to this remote participant. If only a limited number of virtual audio source positions are available along the azimuth or elevation, the Dynamic Position Distribution Unit can be used (as described earlier) to dynamically distribute the current talkers to the best possible virtual audio source positions.
  • the possible virtual audio source positions along the azimuth may be evenly distributed on a semicircle in front of the listener or may be distributed according to the minimum audible angle in azimuth, leading to uneven spacing between the virtual audio source positions.
  • the Position Setting Unit sets the virtual audio source position for each remote participant to correspond with the position of his image, so that the voice of any talking remote participant comes from the location that corresponds to his location on the screen (in azimuth and elevation).
  • the Dynamic Position Distribution Unit can be used (as described earlier) to dynamically distribute the current talkers to the best possible virtual audio source positions.
  • the artificial voice directions at the site of a specific listener are set based on the number of active remote talkers.
  • the Voice Direction Setting Unit in a first step, determines the number of active remote talkers while keeping track of which talker was first in time, which second, and so on. Even small time deviations between two talkers that start talking at the same time, are used by the system to determine which talker was first and which second. Based on the determined number of active remote talkers and their time order, the voice directions are set as follows:
  • the voice direction of the talker T is set based on the number of talking participants. If only one participant is talking, his voice direction at all remote listener sites is set to point to the listener at the remote site.
  • Fig. 10 shows, for example, that the voice direction of the talker T is set to point to the listener L1 and also to the listener L2, who are both at different locations. This means that both listeners, L1 and L2, who are at different remote sites, will perceive the voice direction of the talker as being directed to them. The same applies to all other remote listeners. This means, if only one remote participant is talking, the voice direction of this talker will be set at all other sites to be pointing to the remote listener.
  • Figure 11 shows the setting of the voice direction of two remote participants who are talking simultaneously based on the number of talking participants. [0116] If two remote participants are talking at the same time, the voice direction of each talker will be set to point away from the listener at all remote listener locations. This means, that all remote listeners will perceive the voice directions of both talkers as being directed away from them. For example, in Fig. 11 the voice direction of talker T1 and the voice direction of simultaneous talker T2 are set to point away from the listener L1 in location 1 and the listener L2 in location 2.
  • Figure 12 shows the setting of the voice direction, of three remote participants who are talking simultaneously based on the number of talking participants.
  • Talker T1 was the first in time to start talking, T2 the second and T3 the third.
  • FIG. 14 shows an alternative for setting the voice direction of multiple remote participants who are talking simultaneously based on the number of talking participants. Preferred voice directions are represented in the following table:
  • the system provides the following alternative for setting the voice directions of the third, fourth, fifth, etc. simultaneous talker at all remote listener locations:
  • the voice direction of the third simultaneous talker is set to point away from the listener by an angle that is smaller than the angle chosen for the fourth simultaneous talker.
  • the angle chosen for the fourth talker is smaller than the angle chosen for the fifth talker, etc.
  • the table above shows two possible examples for voice direction angles that fulfill this requirement.
  • the middle column shows an example where the angle of talker T3 is chosen to be either +20° or -20°. Accordingly, the angle of talker T4 is chosen to be larger, namely either +40° or -40°.
  • the angle of talker T3 is chosen to be either +30° or -30°, and the angle of talker T4 is, therefore, chosen to be either +60° or -60°. It does not matter if the positive or the negative angle is chosen, i.e. talker T3 may have any one of the signs, and talker T4 may also have any one of the signs.
  • the voice direction of the seventh, eight, etc. simultaneous talker is set to be +90° or -90°.
  • the above algorithm for specifying the speaking direction based on the number of simultaneous remote talkers may also be employed in an alternative embodiment of the mode "control by talker” whenever two or more simultaneous talkers are addressing the same one listener.
  • Embodiment 3 communication enhancement is based on control input from listener.
  • the system of the present invention provides the listener at any remote location with the option to choose for himself which talker he wants to have enhanced in order to hear that talker better than all the other talkers. This mode of operation is called "control by listener" and can be selected individually at any remote location.
  • the system provides this option not only when multiple simultaneous talkers are present, but also in situations where only one talker is talking.
  • a listener who is not being addressed by the talker, has then the option to enhance the talker's voice by choosing the "control by listener" mode, and thus forcing the Voice Direction Setting Unit to set the voice direction of the talker to point to himself, independent of any control input by the talker or the number of talkers.
  • the listener In the "control by listener" mode, the listener is allowed to select a talker as a preferred talker.
  • the selection can be done by any one of the following input methods: [0129] Manual selection of a talker: The preferred talker is selected by a manual input from the listener, for example, by selecting the video image of the talker, or selecting an avatar or other visible representation of the talker, whereby the selection may be accomplished with a touch screen, by pointing the curser to the selected talker, etc.
  • Visual selection of a talker The preferred talker is selected by detecting at which talker the listener is gazing at. For this purpose, the viewing direction of the listener is determined by means of gaze-tracking. Whenever the listener focuses his gaze within a given spatial range for a preset time span (e.g. 3-5 seconds for one talker or for multiple simultaneous talkers, or e.g. 3-5 seconds for one talker, 3-5 seconds for two simultaneous talkers, and 5-10 seconds for three and more simultaneous talkers), this spatial range is interpreted as the facing angle of attention. The detected facing angle of attention is then translated into the information at which remote participant the listener is focusing his attention.
  • a preset time span e.g. 3-5 seconds for one talker or for multiple simultaneous talkers, or e.g. 3-5 seconds for one talker, 3-5 seconds for two simultaneous talkers, and 5-10 seconds for three and more simultaneous talkers
  • Head-tracking selection of a talker The preferred talker is selected by tracking the head-orientation of the listener and correlating the detected head- orientation with virtual audio source positions to determine at which remote talker the listener is focusing his attention. This method is especially useful for
  • the Voice Direction Setting Unit detects the control input from the listener, determines which talker is selected by the listener, and adjusts the voice directions of all the talkers as follows:
  • the voice direction of the selected talker is set to point to the listener.
  • the voice directions of all other talkers may be set to point away from the listener in such a way, that each voice direction of a non-selected talker has a different voice direction angle with respect to the listener.
  • the talker may select two preferred talkers whose voice directions are set by the Voice Direction Setting Unit to point to the listener.
  • the selection of the preferred talkers is accomplished by (a) a manual selection of both talkers, (b) a visual selection of one talker and a manual selection of the other talker, or (c) a head-tracking selection of one talker and a manual selection of the other talker.
  • Figure 15 shows the setting of the voice directions of multiple simultaneous talkers based on the input of the listener.
  • Fig.15 shows an example for a videoconference setup in which two listeners L1 and L2, who are at different locations, have chosen the option " control by listener” to control the voice direction of a preferred talker out of a multitude of simultaneous talkers.
  • the virtual sound source position of the five simultaneous talkers T1-T5 is set by the Position Setting Unit to correspond with the images on the chosen screen configuration.
  • the listener L1 has chosen a screen configuration that displays all remote participants in a horizontal line next to each other.
  • the starting points of the vectors representing the voice direction of each remote talker are mapped to the corresponding position on the screen.
  • the listener has chosen a different screen configuration which places the remote participants in both horizontal and vertical direction.
  • the virtual sound source positions are set by the Position Setting Unit to correspond with the images on the screen.
  • the control input by the listeners L1 and L2 is accomplished with the visual selection method explained earlier.
  • the dashed arrows at location 1 and location 2 show at which remote participant the respective listener is gazing at.
  • Listener L1 is gazing at talker T4, listener L2 is focusing his attention on talker T3. If the listener's gaze is focused for longer than a preset time-span on a specific talker, e.g. on talker T4 in location 1 or on talker T3 in location 2, then this talker is selected as the preferred talker at the respective location, and the voice direction of the preferred talker is set to point to the listener.
  • the Voice Direction Setting Unit at location 1 sets the voice direction of talker T4 to point to the listener L1 , and at location 2 the voice direction of talker T3 is set to point to the listener L2.
  • the voice directions of all other talkers are set to point away from the respective listener L1 and L2.
  • the voice direction angles of all other talkers at one location are chosen to be different.
  • Figure 16 shows the setting of the voice direction of the remote talker, who is the only currently talking participant, based on the input of the listener at location 1.
  • Fig.16 shows an example of a videoconference where only one remote participant is talking and where one remote listener, who is not being addressed by the talker, has chosen the "control by listener" mode to better hear the talking participant.
  • the talker T is the only participant currently talking. He is addressing listener L4 who is at location 4.
  • Listener L1 at location 1 is focusing his attention at the talker T by gazing at the image of talker T (shown by the dashed arrow pointing from the listener L1 to the talker T).
  • the mode of operation is set to "control by talker". Accordingly, the voice direction of the talker T is set to point to the listener L4 at this location. At location 1 , the voice direction of the talker T would be pointing to the listener L4, as shown by the dashed arrow pointing towards L4, if the mode of operation "control by talker" were selected. However, because the user L1 has selected the mode "control by listener", the Voice Direction Setting Unit rotates the voice direction vector of the talker T towards the listener L1 (shown by the dashed curved arrow) and sets it to point towards the listener L1 (shown by the solid voice direction vector pointing to the listener L1).

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Telephonic Communication Services (AREA)

Abstract

La présente invention concerne un procédé pour reproduire des signaux sonores à un premier emplacement pour un premier participant à une conférence avec au moins deux autres participants à au moins un autre emplacement. La présente invention a pour but de présenter un tel procédé qui permet à un auditeur à un premier emplacement de mieux différencier les participants aux autres emplacements au cours de la conversation. Conformément à l'invention, un tel procédé est proposé, dans lequel le signal sonore de chaque participant supplémentaire est enregistré et reproduit au premier emplacement, et chaque participant qui n'est pas présent au premier emplacement se voit affecter une position virtuelle au premier emplacement, et le signal sonore du participant respectif est reproduit à partir de cette position virtuelle, et chaque signal sonore est reproduit le long d'une direction de rayonnement principale. Le procédé est caractérisé en ce que, • dans le cas d'un interlocuteur, la direction de rayonnement principale du signal sonore reproduit de l'interlocuteur respectif est dirigée vers le participant au premier emplacement, • dans le cas de deux interlocuteurs simultanés, les directions de rayonnement principales des signaux sonores des deux interlocuteurs sont dirigées à l'opposé du participant dans le premier emplacement, en particulier selon un angle supérieur à 40° entre les directions de rayonnement principales, et • dans le cas de plus de deux interlocuteurs simultanés, la direction de rayonnement principale du signal sonore d'au moins un interlocuteur, en particulier d'un seul interlocuteur exactement, est dirigée vers le participant au premier emplacement, et les directions de rayonnement principales de tous les autres signaux sonores sont dirigées à l'opposé du participant au premier emplacement, en particulier dans des directions différentes.
PCT/EP2017/000648 2016-06-06 2017-06-06 Procédé pour reproduire des signaux sonores à un premier emplacement pour un premier participant à une conférence avec au moins deux autres participants à au moins un autre emplacement WO2017211447A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DE102016006731.4 2016-06-06
DE102016006731 2016-06-06

Publications (1)

Publication Number Publication Date
WO2017211447A1 true WO2017211447A1 (fr) 2017-12-14

Family

ID=59363084

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2017/000648 WO2017211447A1 (fr) 2016-06-06 2017-06-06 Procédé pour reproduire des signaux sonores à un premier emplacement pour un premier participant à une conférence avec au moins deux autres participants à au moins un autre emplacement

Country Status (1)

Country Link
WO (1) WO2017211447A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230300251A1 (en) * 2021-09-07 2023-09-21 Verizon Patent And Licensing Inc. Systems and methods for videoconferencing with spatial audio
KR102666792B1 (ko) * 2018-07-30 2024-05-20 소니그룹주식회사 정보 처리 장치, 정보 처리 시스템, 정보 처리 방법 및 프로그램

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0410744A (ja) * 1990-04-27 1992-01-14 Nippon Telegr & Teleph Corp <Ntt> 会議通話端末装置
WO2007062840A1 (fr) 2005-11-30 2007-06-07 Miriam Noemi Valenzuela Procédé pour enregistrer et reproduire les signaux sonores d'une source sonore présentant des caractéristiques directives variables dans le temps
US20130170678A1 (en) * 2007-04-04 2013-07-04 At&T Intellectual Property I, L.P. Methods and systems for synthetic audio placement
US20160127846A1 (en) * 2012-06-27 2016-05-05 Volkswagen Ag Devices and methods for conveying audio information in vehicles

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0410744A (ja) * 1990-04-27 1992-01-14 Nippon Telegr & Teleph Corp <Ntt> 会議通話端末装置
WO2007062840A1 (fr) 2005-11-30 2007-06-07 Miriam Noemi Valenzuela Procédé pour enregistrer et reproduire les signaux sonores d'une source sonore présentant des caractéristiques directives variables dans le temps
US20160105758A1 (en) * 2005-11-30 2016-04-14 Valenzuela Holding Gmbh Sound source replication system
US20130170678A1 (en) * 2007-04-04 2013-07-04 At&T Intellectual Property I, L.P. Methods and systems for synthetic audio placement
US20160127846A1 (en) * 2012-06-27 2016-05-05 Volkswagen Ag Devices and methods for conveying audio information in vehicles

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHOJI SHIMADA ET AL: "A NEW TALKER LOCATION RECOGNITION THROUGH SOUND IMAGE LOCALIZATION CONTROL IN MULTIPOINT TELECONFERENCES SYSTEM", ELECTRONICS & COMMUNICATIONS IN JAPAN, PART I - COMMUNICATIONS, WILEY, HOBOKEN, NJ, US, vol. 72, no. 2, 1 February 1989 (1989-02-01), pages 20 - 27, XP000124912, ISSN: 8756-6621 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102666792B1 (ko) * 2018-07-30 2024-05-20 소니그룹주식회사 정보 처리 장치, 정보 처리 시스템, 정보 처리 방법 및 프로그램
US20230300251A1 (en) * 2021-09-07 2023-09-21 Verizon Patent And Licensing Inc. Systems and methods for videoconferencing with spatial audio

Similar Documents

Publication Publication Date Title
US11991315B2 (en) Audio conferencing using a distributed array of smartphones
US10805575B2 (en) Controlling focus of audio signals on speaker during videoconference
US10516852B2 (en) Multiple simultaneous framing alternatives using speaker tracking
US10447970B1 (en) Stereoscopic audio to visual sound stage matching in a teleconference
US8571192B2 (en) Method and apparatus for improved matching of auditory space to visual space in video teleconferencing applications using window-based displays
US9253572B2 (en) Methods and systems for synthetic audio placement
US9113034B2 (en) Method and apparatus for processing audio in video communication
JP4255461B2 (ja) 電話会議用のステレオ・マイクロフォン処理
US9049339B2 (en) Method for operating a conference system and device for a conference system
EP2352290B1 (fr) Méthode et dispositif pour aligner des signaux audio et vidéo pendant une vidéconférence
US20100328419A1 (en) Method and apparatus for improved matching of auditory space to visual space in video viewing applications
US20050280701A1 (en) Method and system for associating positional audio to positional video
US7720212B1 (en) Spatial audio conferencing system
JP2006254064A (ja) 遠隔会議システム、音像位置割当方法および音質設定方法
WO2011120407A1 (fr) Procédé et appareil de réalisation pour communication vidéo
WO2017211447A1 (fr) Procédé pour reproduire des signaux sonores à un premier emplacement pour un premier participant à une conférence avec au moins deux autres participants à au moins un autre emplacement
JP2009246528A (ja) 画像付音声通信システム、画像付音声通信方法およびプログラム
US20230138733A1 (en) Representation of natural eye contact within a video conferencing session
JP2006339869A (ja) 映像信号と音響信号の統合装置
JPH0758859A (ja) 遠隔会議用情報送信装置及び情報受信装置
WO2017211448A1 (fr) Procédé permettant de générer un signal à deux canaux à partir d&#39;un signal mono-canal d&#39;une source sonore
US20220303149A1 (en) Conferencing session facilitation systems and methods using virtual assistant systems and artificial intelligence algorithms
WO2023286320A1 (fr) Dispositif et procédé de traitement d&#39;informations, et programme
US20240129433A1 (en) IP based remote video conferencing system
JP2023043497A (ja) リモート会議システム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17740288

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17740288

Country of ref document: EP

Kind code of ref document: A1