WO2022118671A1 - 情報処理装置、情報処理方法、およびプログラム - Google Patents

情報処理装置、情報処理方法、およびプログラム Download PDF

Info

Publication number
WO2022118671A1
WO2022118671A1 PCT/JP2021/042528 JP2021042528W WO2022118671A1 WO 2022118671 A1 WO2022118671 A1 WO 2022118671A1 JP 2021042528 W JP2021042528 W JP 2021042528W WO 2022118671 A1 WO2022118671 A1 WO 2022118671A1
Authority
WO
WIPO (PCT)
Prior art keywords
participant
localization
voice
information processing
sound image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2021/042528
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
健太郎 木村
康之 古賀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony Group Corp
Original Assignee
Sony Group Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony Group Corp filed Critical Sony Group Corp
Priority to JP2022566836A priority Critical patent/JPWO2022118671A1/ja
Priority to US18/038,696 priority patent/US20230419985A1/en
Publication of WO2022118671A1 publication Critical patent/WO2022118671A1/ja
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control

Definitions

  • the present technology relates to an information processing device, an information processing method, and a program, and in particular, an information processing device, an information processing method, which makes it possible to easily distinguish between the voice of a real participant and the voice of a remote participant. And about the program.
  • the so-called remote conference in which a remote user participates in a conference using a device such as a PC is becoming widespread. Participant's voice collected by the microphone is transmitted to a device used by another participant via a server, and is output from headphones or a speaker. This allows each participant to have a conversation with another participant.
  • Patent Document 1 virtual speech positions are set at intervals, and the voice of a participant who is thought to make an important speech at a meeting is localized in front of the receiver using a head-related transfer function.
  • the technology to make it is disclosed.
  • This technique was made in view of such a situation, and makes it possible to easily distinguish between the voice of an actual participant and the voice of a remote participant.
  • the information processing device of one aspect of the present technology sets the sound image of the voice of a remote participant who remotely participates in a conversation held in a predetermined space to a position different from the position of a real participant who is a participant in the predetermined space. It is equipped with a sound image localization processing unit for localization.
  • the audio sound image of a remote participant who remotely participates in a conversation held in a predetermined space is localized at a position different from the position of a real participant who is a participant in the predetermined space.
  • FIG. 15 explaining the flow of setting the localization position of the voice of a remote participant.
  • FIG. 16 explaining the flow of setting the localization position of the voice of a remote participant.
  • FIG. 17 explaining the flow of setting the localization position of the voice of a remote participant. It is a figure following FIG. 15
  • FIG. 1 is a diagram showing a configuration example of a Tele-communication system according to an embodiment of the present technology.
  • the Tele-communication system of FIG. 1 is configured by connecting a client terminal used by a conference participant to a communication management server 1 via a network 11 such as the Internet.
  • client terminals 2A to 2D which are PCs, are shown as client terminals used by users A to D, who are participants in the conference.
  • client terminals Other devices such as smartphones and tablet terminals may be used as client terminals.
  • client terminals 2A and 2D it is appropriately referred to as the client terminal 2.
  • Users A to D are users who participate in the same conference. Users A to D, for example, wear stereotype earphones (inner ear headphones) and participate in the conference. For example, an open ear type (open type) earphone that does not seal the ear canal is used. The appearance of the earphones used by the users A to D will be described later.
  • the users A to D can hear the surrounding sound as well as the sound output by the client terminal 2.
  • a microphone is provided at a predetermined position on the earphone housing.
  • the earphone and the client terminal 2 are connected by wire via a cable or wirelessly via communication of a predetermined standard such as wireless LAN or Bluetooth (registered trademark). Voice is transmitted and received between the earphone and the client terminal 2.
  • Each user prepares the client terminal 2 and participates in the conference with the earphones attached. For example, as shown at the bottom of FIG. 1, users A to C participate in a conference from the same space such as a conference room in an office. Further, the user D participates in the conference remotely from his / her home.
  • the number of users participating in the same conference is not limited to four.
  • the number of users who participate from the same space and the number of users who participate remotely can also be changed arbitrarily.
  • the communication management server 1 manages a conference that is advanced by having a plurality of users have a conversation online.
  • the communication management server 1 is an information processing device that controls the transmission and reception of voices between client terminals 2 and manages so-called remote conferences.
  • the communication management server 1 uses the voice of the user A transmitted from the client terminal 2A in response to the user A speaking. Receive data. From the client terminal 2A, the voice data of the user A collected by the microphone used by the user A is transmitted.
  • the communication management server 1 transmits the voice data of the user A to the client terminal 2D and outputs the voice of the user A.
  • users B to D can hear the voice of user A.
  • the earphones used by the user B and the user C are open type earphones, the user B and the user C who are in the same space as the user A can directly hear the voice of the user A.
  • the voice of user A is transmitted to users B and C via the communication management server 1. Is also sent. In this case, the user B, the user C, and the user D each hear the voice of the user A transmitted via the communication management server 1.
  • the voice data transmitted from the client terminal 2B or the client terminal 2C is transmitted to the client terminal 2D via the communication management server 1.
  • the communication management server 1 receives the voice data of the user D transmitted from the client terminal 2D in response to the utterance of the user D.
  • the communication management server 1 transmits the voice data of the user D to the client terminals 2A to 2C, respectively, and outputs the voice of the user D.
  • users A to C can hear the voice of user D.
  • participant A to C who participate in the conference from the conference room of the office are actual participants.
  • a space in a predetermined range centered on a reference position is the same space.
  • a participant who participates in a conference from a space different from the space where the actual participant is present is called a remote participant in the sense of a remote participant.
  • the user D who participates in the conference by himself from his / her home is a remote participant.
  • ⁇ Audio localization during a meeting> When the voice transmitted from the communication management server 1 is output, the client terminal 2 performs sound image localization processing. The voice transmitted from the communication management server 1 is localized and output at a predetermined position in the space.
  • the client terminal 2 used by users A to C is obtained by performing sound image localization processing for localizing the voice of user D, who is a remote participant, at a predetermined position in the conference room, and performing sound image localization processing.
  • the voice of the user D is output from the earphone used by each user. Users A to C will feel the sound image of the voice of the user D so that the voice of the user D can be heard from the position set as the localization position.
  • the localization position of the voice of the user D who is a remote participant is set to a predetermined position in the conference room.
  • Information on the localization position set by the communication management server 1 is provided to the client terminal 2 and used for sound image localization processing.
  • FIG. 3 is a diagram showing an example of setting the localization position.
  • the space shown in FIG. 3 from above is a conference room in which users A to C are located.
  • User A and User B sit side by side on the right side of the table T provided in the conference room, and User C sits on the left side of the table T in front of User A.
  • Users A to C are sitting facing the table T, respectively.
  • the user A speaks the voice of the user A is naturally heard from the left side for the user B and from the front for the user C.
  • the communication management server 1 When users A to C who are real participants are sitting in the state shown on the left side of FIG. 3, the communication management server 1 has an area where users A to C do not exist, as shown by a hatch on the right side of FIG. Is set as a localizationable area, which is an area in which the voice of the user D who is a remote participant can be localized.
  • the area on the circle passing through the positions of users A to C and between the user B and the user C is set as the localizable area A1. Further, the area between the user A and the user B is set as the localizable area A2, and the area between the user A and the user C is set as the localizable area A3. Localizable regions A1 to A3 are arcuate regions having a predetermined width.
  • the communication management server 1 sets a predetermined position in the localizationable area as the localization position of the voice of the user D.
  • the position in the localizationable area A1 is set as the localization position of the voice of the user D.
  • the setting of the voice localization position of the remote participant will be described in detail later.
  • the voice localization position of the user D is set in the communication management server 1 in this way, and the sound image localization process is performed in the client terminal 2.
  • the voice of the user D is for the user A. It can be heard from a position diagonally forward to the right, and can be heard from a position approximately in front of User B. Further, the voice of the user D can be heard from a position substantially to the left of the user C.
  • the utterance shown in the balloon is the utterance of user D.
  • the multiple circles of the balloon source schematically show the sound image of the voice of the user D.
  • the sound image of the voice of the user D will be felt in a position where no real participant is present.
  • Users A to C are wearing earphones 3, respectively.
  • the localization position of the voice of the remote participant is set at the position where there is no real participant, so that each real participant is different from the voice of the other real participant. It is possible to easily distinguish the voice of a remote participant.
  • the localization position of the voice of the remote participant is set without considering the position of the actual participant, as shown in FIG. 6, the localization position of the voice of the user D who is the remote participant actually participates. It may be set to the same position as the position of the user B who is the person. In this case, the voice of the user D is heard from the position of the user B, and the user A and the user C cannot know who they are talking to, but it is possible to prevent such a state. It becomes.
  • FIG. 7 is a diagram showing the appearance of the earphone.
  • the earphone 3 worn by each user is composed of a right side unit 3R and a left side unit 3L (not shown). As shown in an enlarged manner in the balloon of FIG. 7, the right side unit 3R is configured by joining the driver unit 31 and the ring-shaped mounting portion 33 via a U-shaped sound conduit 32. The right side unit 3R is mounted by pressing the mounting portion 33 around the outer ear hole and sandwiching the right ear between the mounting portion 33 and the driver unit 31.
  • the left side unit 3L has the same configuration as the right side unit 3R.
  • the left side unit 3L and the right side unit 3R are connected by wire or wirelessly.
  • the driver unit 31 of the right side unit 3R receives the audio signal transmitted from the client terminal 2 and outputs the sound corresponding to the audio signal from the tip of the sound conduit 32 as shown by arrow # 1.
  • a hole portion for outputting sound toward the external ear canal is formed.
  • the mounting portion 33 has a ring shape. Along with the sound output from the tip of the sound conduit 32, the surrounding sound also reaches the external ear canal as shown by arrow # 2.
  • the earphone 3 is an open type earphone that does not seal the ear canal.
  • the driver unit 31 is provided with a microphone.
  • a device other than the earphone 3 may be used as an output device used for listening to the voices of the participants in the conference.
  • FIG. 8 is a diagram showing an example of an output device.
  • a closed type headphone as shown in A of FIG. 8 or a shoulder-mounted neckband speaker as shown in B of FIG. 8 is used.
  • Speakers are provided on the left and right units that make up the neckband speaker, and sound is output toward the user's ears.
  • a microphone is also provided in the headphones and neckband speaker to collect the user's voice.
  • FIG. 9 is a block diagram showing a hardware configuration example of the communication management server 1.
  • the communication management server 1 is composed of a computer.
  • the communication management server 1 may be configured by one computer having the configuration shown in FIG. 9, or may be configured by a plurality of computers.
  • the CPU 101, ROM 102, and RAM 103 are connected to each other by the bus 104.
  • the CPU 101 executes the server program 101A and controls the overall operation of the communication management server 1.
  • the server program 101A is a program for realizing a Tele-communication system.
  • An input / output interface 105 is further connected to the bus 104.
  • An input unit 106 including a keyboard, a mouse, and the like, and an output unit 107 including a display, a speaker, and the like are connected to the input / output interface 105.
  • the input / output interface 105 is connected to a storage unit 108 made of a hard disk, a non-volatile memory, etc., a communication unit 109 made of a network interface, etc., and a drive 110 for driving the removable media 111.
  • the communication unit 109 communicates with the client terminal 2 used by each user via the network 11.
  • FIG. 10 is a block diagram showing a functional configuration example of the communication management server 1. At least a part of the functional units shown in FIG. 10 is realized by executing the server program 101A by the CPU 101 of FIG.
  • the information processing unit 121 is realized in the communication management server 1.
  • the information processing unit 121 includes a position information acquisition unit 131, a localization position setting unit 132, a localization position information transmission unit 133, a voice reception unit 134, and a voice transmission unit 135.
  • the position information acquisition unit 131 acquires the position of the actual participant. From the client terminal 2 used by the real participant, the position information indicating the position of the real participant is transmitted. When there are a plurality of real participants, the positions of the respective real participants are acquired based on the position information transmitted from each client terminal 2. The position information acquired by the position information acquisition unit 131 is supplied to the localization position setting unit 132.
  • the localization position setting unit 132 sets the localization possible area at a position where there is no real participant, that is, a position different from the position of the real participant, based on the position of the real participant in the same space. Further, the localization position setting unit 132 sets a predetermined position in the localization possible area as the localization position of the voice of the remote participant.
  • the localization position information which is the information on the localization position of the voice of the remote participant set by the localization position setting unit 132, is supplied to the localization position information transmission unit 133 together with the position information of the actual participant.
  • the localization position information transmission unit 133 controls the communication unit 109 and transmits the localization position information supplied from the localization position setting unit 132 to the client terminal 2 used by each actual participant.
  • the localization position information transmission unit 133 transmits the localization position information and the position information of the actual participant to the client terminal 2 used by each remote participant. For example, in the client terminal 2 used by the remote participant, the voice of each real participant is heard from the direction corresponding to the position of the real participant with reference to the position of the remote participant itself represented by the localization position information. Sound image localization processing is performed so that it can be heard.
  • the voice receiving unit 134 controls the communication unit 109 and receives the voice data transmitted from the client terminal 2 used by the participant who made the utterance.
  • the voice data received by the voice receiving unit 134 is output to the voice transmitting unit 135.
  • the voice transmitting unit 135 controls the communication unit 109 and transmits the voice data supplied from the voice receiving unit 134 to the client terminal 2 used by the participant who is the listener.
  • FIG. 11 is a block diagram showing a hardware configuration example of the client terminal 2.
  • the client terminal 2 is configured by connecting a memory 202, a voice input unit 203, a voice output unit 204, an operation unit 205, a communication unit 206, a display 207, and a camera 208 to the control unit 201.
  • the control unit 201 is composed of a CPU, ROM, RAM, and the like.
  • the control unit 201 controls the overall operation of the client terminal 2 by executing the client program 201A.
  • the client program 201A is a program for using the Tele-communication system managed by the communication management server 1.
  • the client terminal 2 is realized by installing the client program 201A, which is a dedicated application program, on a general-purpose PC.
  • the client terminal 2 may be realized by mounting a DSP board and an A / D / D / A conversion board on a general-purpose PC, or may be realized by a dedicated device.
  • the memory 202 is composed of a flash memory or the like.
  • the memory 202 stores various information such as the client program 201A executed by the control unit 201.
  • the voice input unit 203 communicates with the earphone 3 and receives the voice transmitted from the earphone 3. From the earphone 3, the user's voice collected by the microphone provided in the earphone 3 is transmitted. The voice received by the voice input unit 203 is output to the control unit 201 as a microphone voice.
  • the voice input may be performed using a microphone provided in the client terminal 2 or may be performed using an external microphone connected to the client terminal 2.
  • the audio output unit 204 communicates with the earphone 3 and transmits an audio signal supplied from the control unit 201 to output the audio of the participants of the conference from the earphone 3.
  • the operation unit 205 is composed of an input unit such as a keyboard and a touch panel provided on the display 207.
  • the operation unit 205 outputs information representing the content of the user's operation to the control unit 201.
  • the communication unit 206 is a communication module compatible with wireless communication of mobile communication systems such as 5G communication, and a communication module compatible with wireless LAN and the like.
  • the communication unit 206 communicates with the communication management server 1 via the network 11 which is an IP communication network.
  • the communication unit 206 receives the information transmitted from the communication management server 1 and outputs it to the control unit 201. Further, the communication unit 206 transmits the information supplied from the control unit 201 to the communication management server 1.
  • the display 207 is composed of an organic EL display, an LCD, and the like. Various screens such as a remote conference screen are displayed on the display 207.
  • the camera 208 is composed of, for example, an RGB camera.
  • the camera 208 photographs a user who is a participant of the conference and outputs the image to the control unit 201. Not only the voice but also the transmission / reception of the image taken by the camera 208 is appropriately performed between the client terminals 2 via the communication management server 1.
  • FIG. 12 is a block diagram showing a functional configuration example of the client terminal 2. At least a part of the functional units shown in FIG. 12 is realized by executing the client program 201A by the control unit 201 of FIG.
  • the information processing unit 211 is realized in the client terminal 2.
  • the information processing unit 211 is composed of a reproduction processing unit 221, a voice transmission unit 222, and a user position detection unit 223.
  • the information processing unit 211 of FIG. 12 will be mainly described as the configuration of the client terminal 2 used by the actual participants.
  • the reproduction processing unit 221 is composed of an audio receiving unit 241, a localization position acquisition unit 242, a sound image localization processing unit 243, an HRTF data storage unit 244, and an output control unit 245.
  • the voice receiving unit 241 controls the communication unit 206 and receives the voice data transmitted from the communication management server 1. Voice data of other participants such as remote participants is transmitted from the communication management server 1. The audio data received by the audio receiving unit 241 is supplied to the sound image localization processing unit 243.
  • the localization position acquisition unit 242 controls the communication unit 206 and receives the localization position information transmitted from the communication management server 1. From the communication management server 1, localization position information indicating the localization position of the voice of the remote participant is transmitted. The localization position information received by the localization position acquisition unit 242 is supplied to the sound image localization processing unit 243.
  • the sound image localization processing unit 243 corresponds to the positional relationship (relationship between direction and distance) between the position of the user of the client terminal 2 who is the actual participant and the localization position of the voice of the remote participant represented by the localization position information.
  • HRTF Head-Related Transfer Function
  • the HRTF data storage unit 244 contains HRTF (head related transfer function) data that represents the sound transmission characteristics from various positions to the listening position when each position on the space where the actual participant is present is set as the listening position.
  • the HRTF data is stored.
  • HRTF data corresponding to a plurality of positions are prepared in the client terminal 2 based on each listening position in the space where the actual participants are.
  • the sound image localization processing unit 243 performs sound image localization processing using the HRTF data on the remote participant's voice data so that the voice of the remote participant who made the utterance can be heard from the localization position of the remote participant's voice. ..
  • the sound image localization processing performed by the sound image localization processing unit 243 includes rendering such as VBAP (Vector Based Amplitude Panning) based on position information, and binaural processing using HRTF data.
  • the voice of the remote participant is processed in the client terminal 2 as the voice data of the object audio.
  • channel-based audio data of two channels of L / R generated by the sound image localization process is supplied to the output control unit 245.
  • the output control unit 245 outputs the audio data generated by the sound image localization process to the audio output unit 204, and outputs the audio data from the earphone 3.
  • the voice transmission unit 222 controls the communication unit 206 and transmits the microphone voice data supplied from the voice input unit 203 to the communication management server 1.
  • the user position detection unit 223 detects the position of the user of the client terminal 2 which is a real participant.
  • the user position detection unit 223 functions as a position sensor for the conference participants.
  • the position of the user of the client terminal 2 is detected based on the information of the positioning system such as GPS. Further, the user's position is detected based on the information of the mobile base station and the information of the access point of the wireless LAN. The user's position may be detected using Bluetooth® communication, or the user's position may be detected based on an image taken by the camera 208. The user's position may be detected based on the measurement result by the acceleration sensor or the gyro sensor mounted on the client terminal 2.
  • the user position detection unit 223 controls the communication unit 206 and transmits the position information indicating the position of the user of the client terminal 2 to the communication management server 1.
  • the information processing unit 211 of FIG. 12 is configured to be provided in the client terminal 2 used by the remote participant, in the voice receiving unit 241 the actual participant and others transmitted from the communication management server 1 Voice data of remote participants is received.
  • the localization position acquisition unit 242 receives the localization position information and the position information of the actual participant transmitted from the communication management server 1, and HRTF data according to the positional relationship between the actual participant and other remote participants. Is used to perform sound image localization processing in the sound image localization processing unit 243. The voice of the actual participant and the voice of another remote participant obtained by performing the sound image localization process are output from the earphone 3 used by the user of the client terminal 2 which is the remote participant by the output control unit 245. To.
  • the audio data may be rendered using the arithmetic function of an external device such as a mobile phone, PHS, VOIP phone, digital exchange, gateway, or terminal adapter.
  • an external device such as a mobile phone, PHS, VOIP phone, digital exchange, gateway, or terminal adapter.
  • FIG. 13 shows the processing of the client terminal 2 used by the actual participants U1 and U2 and the processing of the client terminal 2 used by the remote participants U11 and U12 as the processing of the client terminal 2. The same applies to other sequence diagrams described later.
  • the client terminal 2 used by the real participant U1 will be referred to as a client terminal 2-1 and the client terminal 2 used by the real participant U2 will be referred to as a client terminal 2-2.
  • the client terminal 2 used by the remote participant U11 will be referred to as a client terminal 2-11
  • the client terminal 2 used by the remote participant U12 will be referred to as a client terminal 2-12.
  • the processing of the client terminal 2 used by other real participants participating in the same conference is the same as the processing of the client terminals 2-1 and 2-2 used by the real participants U1 and U2. Further, the processing of the client terminal 2 used by the other remote participants is the same as the processing of the client terminals 211, 2-12 used by the remote participants U11 and U12.
  • step S1 the user position detection unit 223 of the client terminal 2-1 detects the position of the actual participant U1 and transmits the position information to the communication management server 1.
  • step S11 the user position detection unit 223 of the client terminal 2-2 detects the position of the actual participant U2 and transmits the position information to the communication management server 1.
  • step S21 the position information acquisition unit 131 of the communication management server 1 receives the position information transmitted from the client terminal 2-1 and acquires the position of the actual participant U1.
  • step S22 the position information acquisition unit 131 receives the position information transmitted from the client terminal 2-2 and acquires the position of the actual participant U2.
  • step S23 the localization position setting unit 132 performs the localization position setting process.
  • the localization position setting process the localization positions of the voices of the remote participants U11 and U12 are set. The details of the localization position setting process will be described later with reference to the flowchart of FIG.
  • step S24 the localization position information transmission unit 133 transmits the localization position information indicating the localization position of the voices of the remote participants U11 and U12 to the client terminal 2-1 and the client terminal 2-2.
  • step S25 the localization position information transmission unit 133 provides the localization position information indicating the localization positions of the voices of the remote participants U11 and U12 together with the position information indicating the positions of the actual participants U1 and U2 to the client terminal 2. Send to -11 and client terminal 2-12.
  • step S2 the localization position acquisition unit 242 of the client terminal 2-1 receives the localization position information transmitted from the communication management server 1.
  • step S12 the localization position acquisition unit 242 of the client terminal 2-2 receives the localization position information transmitted from the communication management server 1.
  • step S31 the localization position acquisition unit 242 of the client terminal 2-11 receives the respective position information of the real participants U1 and U2 and the localization position information of the remote participant U12 transmitted from the communication management server 1. ..
  • step S41 the localization position acquisition unit 242 of the client terminal 2-12 receives the respective position information of the real participants U1 and U2 and the localization position information of the remote participant U11 transmitted from the communication management server 1. ..
  • step S23 of FIG. 13 The localization position setting process performed in step S23 of FIG. 13 will be described with reference to the flowchart of FIG.
  • step S51 the localization position setting unit 132 of the communication management server 1 calculates the localization possible area based on the position of the actual participant acquired by the position information acquisition unit 131.
  • step S52 the localization position setting unit 132 sets a predetermined position in the localization possible area as the localization position of the voice of the remote participant. After that, the process returns to step S23 in FIG. 13 and the subsequent processing is performed.
  • the actual participants U1 to U3 are at positions P1 to P3 in the same space such as a conference room, respectively.
  • the positions P1 to P3 are specified based on the position information transmitted from the client terminals 2 used by the actual participants U1 to U3, respectively.
  • the localization position setting unit 132 creates a circle with a predetermined radius R [m] in the space where the real participants are, and groups the real participants inside the created circle as one of the participants of the same conference. Form as a group.
  • the localization position setting unit 132 participates in approaching each other by outputting a voice such as "Please gather the participants in the conference room" from the client terminal 2. Notify the person.
  • a voice such as "Please gather the participants in the conference room" from the client terminal 2.
  • one group is formed by the real participants U1 to U3, and one group is formed by the real participants U1, U3, U4.
  • Such processing is continued until one group is formed.
  • a circle with a radius of 5 m is set as a circle forming one group.
  • the size of the circles forming one group may be changed, for example, by the actual participants themselves.
  • the localization position The setting unit 132 obtains x c , y c , r 1 that minimizes the sum of the distances to the points (x n , y n ) in the circle represented by the following equation (1).
  • Obtaining x c , y c , and r 1 corresponds to obtaining an approximate circle of a point cloud at positions P1 to PN, as shown by the broken line circle in FIG.
  • an approximate circle C having a radius r1 is set according to the positions P1 to P4 where the actual participants U1 to U4 are located.
  • N 2
  • the circle with the minimum radius passing through the position P1 and the position P2 is set as an approximate circle.
  • the localization position setting unit 132 sets the localization possible region at a position on the approximate circle C and away from the position where the actual participant is actually present.
  • the localizable region is an arcuate region having a predetermined width.
  • the localizable region is set at a position separated by r 2 [m] or more from the position on the approximate circle C closest to the position where the actual participant is actually located.
  • a localizable region A1 is set between the real participant U1 and the real participant U3 at a predetermined distance from each real participant, and the real participant U1 and the real participant U2 are set.
  • the localizable area A2 is set at a predetermined distance from each actual participant.
  • a localizable area A3 is set between the real participants U3 and the real participants U4 at a predetermined distance from each real participant, and between the real participants U2 and the real participants U4, respectively.
  • the localization possible area A4 is set at a predetermined distance from the actual participants.
  • the solid small circle surrounding the real participants U1 to U3 is a circle with a radius r2 centered on the position on the approximate circle C of the real participants.
  • the solid small circle shown in the vicinity of the real participant U4 is also a circle having a radius r2 centered on the position on the approximate circle C.
  • the localization position setting unit 132 sets the localization position of the voice of the remote participant in the localization possible area.
  • the localization position of the voice of the remote participant U11 which is the first remote participant, is set at the substantially center of the localizable area A1
  • the second remote person is set at the substantially center of the localizable area A2.
  • the localization position of the voice of the remote participant U12 which is a participant, is set.
  • the localization position of the voice of the remote participant is set so that the angle from the center O of the approximate circle C is dispersed.
  • the localization position Q m is set. By moving the localization position Q m as far as possible from the position of the actual participant (dispersing the angle), it is possible to make the voice easier to hear.
  • the localization positions of the voices of the remote participants are adjusted so as to separate the respective positions. For example, in the example of FIG. 20, after the localization position of the voice of the remote participant U11 is set in the localization possible area A1, the remote participant U15 participates in the conference, so that the respective positions are separated. The audio localization positions of the remote participant U11 and the remote participant U15 are adjusted.
  • step S131 the voice transmission unit 222 (FIG. 12) of the client terminal 2-11 transmits the voice data of the remote participant U11 to the communication management server 1.
  • step S121 the voice receiving unit 134 (FIG. 10) of the communication management server 1 receives the voice data transmitted from the client terminal 2-11.
  • step S122 the voice transmission unit 135 transmits the voice data of the remote participant U11 to each of the client terminals 2-1, 2, 2 and 2-12.
  • step S101 the voice receiving unit 241 of the client terminal 2-1 receives the voice data transmitted from the communication management server 1.
  • step S102 the sound image localization processing unit 243 performs sound image localization processing using the HRTF data on the voice data of the remote participant U11 so that the voice can be heard from the voice localization position of the remote participant U11 who made the utterance.
  • step S103 the output control unit 245 outputs the sound of the remote participant U11 generated by the sound image localization process from the earphone 3 worn by the real participant U1.
  • the client terminal 2-2 performs the same processing as the processing of steps S101 to S103 in steps S111 to S113, so that the voice of the remote participant U11 is output from the earphone 3 worn by the real participant U2.
  • the client terminal 2-12 also performs the same processing as the processing of steps S101 to S103 in steps S141 to S143, so that the voice of the remote participant U11 is output from the earphone 3 worn by the remote participant U12.
  • the real participants U1 and U2 which are real participants
  • the remote participant U12 which is another remote participant
  • the voice of the remote participant U11 Since the voice of the remote participant U11 is localized and felt at a position away from the real participants U1 and U2, the real participants U1 and U2 and the remote participant U12 are the voices of the remote participant U11 and other participants. You can distinguish it from voice.
  • step S151 the voice transmission unit 222 of the client terminal 2-1 transmits the voice data of the real participant U1 to the communication management server 1.
  • step S161 the voice receiving unit 134 of the communication management server 1 receives the voice data transmitted from the client terminal 2-1.
  • step S162 the voice transmission unit 135 transmits the voice data of the actual participant U1 to each of the client terminals 211, 2-12.
  • step S171 the voice receiving unit 241 of the client terminal 2-11 receives the voice data transmitted from the communication management server 1.
  • step S172 the sound image localization processing unit 243 performs sound image localization processing using the HRTF data on the voice data of the real participant U1 so that it can be heard from the position of the real participant U1 who made the utterance.
  • step S173 the output control unit 245 outputs the sound of the real participant U1 generated by the sound image localization process from the earphone 3 worn by the remote participant U11.
  • the client terminal 2-12 performs the same processing as the processing of steps S171 to S173 in steps S181 to S183, so that the voice of the real participant U1 is output from the earphone 3 worn by the remote participant U12.
  • the remote participants U11 and U12 which are remote participants, can hear the voice of the actual participant U1.
  • ⁇ Case 2 in which a real-life participant speaks As described above, when a real participant wears a closed type headphone or the like, the voice of another real participant who has spoken is delivered via the communication management server 1 instead of directly. For example, in the client terminal 2 used by a real-life participant wearing closed-type headphones, sound image localization processing is performed on the voices of other real-life participants.
  • the process shown in FIG. 23 is a process performed in response to the actual participant U1 speaking. It is assumed that the actual participant U2 is wearing closed headphones.
  • the process shown in FIG. 23 is different from the process shown in FIG. 22 in that the sound image localization process for the voice data transmitted from the communication management server 1 is performed on the client terminal 2-2. Duplicate explanations will be omitted as appropriate.
  • the voice data of the actual participant U1 transmitted from the communication management server 1 in step S162 is also transmitted to the client terminal 2-2.
  • step S201 the voice receiving unit 241 of the client terminal 2-2 receives the voice data transmitted from the communication management server 1.
  • step S202 the sound image localization processing unit 243 performs sound image localization processing using the HRTF data on the voice data of the real participant U1 so that it can be heard from the position of the real participant U1 who made the utterance.
  • step S203 the output control unit 245 outputs the sound of the real participant U1 generated by the sound image localization process from the earphone 3 worn by the real participant U2.
  • each participant has no actual participant.
  • the voice of the remote participant can be heard from the position, and it becomes easy to distinguish the voice.
  • the area where the voice of the remote participant can be presented is calculated based on the position information of the conference participants, and the localization position of the voice is distributed and arranged in the localization possible area, so that each participant can perform the voice. It becomes easy to distinguish.
  • ⁇ Modification example >> ⁇ Exclusion area setting> If there is an area in a space such as a conference room that should not be the localization position of the remote participant's voice, the localization area is set to exclude such an area, and the localization position of the remote participant's voice is set. Will be done. That is, the area that should not be the localization position of the voice of the remote participant is set as the exclusion area.
  • the localization possible area is set in the communication management server 1 so as to exclude the area outside the conference room and the area with the wall.
  • FIG. 24 is a diagram showing an example of setting a localizable area.
  • the environment such as the position of the wall W is also detected by the user position detection unit 223 of the client terminal 2 together with the position of the actual participant, and is provided to the communication management server 1. For example, based on the image taken by the camera 208, the environment in which the actual participant is present is detected by the user position detection unit 223.
  • the area excluded from the setting target of the localizable area is not limited to the area with the wall. Various areas that should not be the localization position of the remote participant's voice are excluded from the setting target of the localization possible area.
  • the area on the roadway side is a localizationable area in order to avoid accidents and difficulty in hearing due to noise. Excluded from the setting target.
  • the area on the railroad track side is excluded from the setting target of the localizable area.
  • the restricted area is excluded from the setting target of the exclusion zone.
  • the area that becomes unnatural when the voice is localized is set as an exclusion area that should not be the voice localization position of the remote participant.
  • the localizable area so as to exclude the excluded area, sound image localization suitable for the environment in which the conference participants are present becomes possible.
  • ⁇ Movement of localization position when the number of participants increases or decreases When the number of participants increases or decreases due to the addition of participants or the participants leaving the conference, the above-mentioned calculation is performed on the communication management server 1, and the localization position of the voice of the remote participant is updated. In this case, the localization position moves from the position P old , which is the position before the update, to the position P new , which is the position after the update.
  • the remote participant's voice will be heard from a position different from the position imagined, which is unnatural. be.
  • the movement of the localization position is presented by animation.
  • the audio animation is performed, for example, by changing the HRTF data used for the sound image localization process and sequentially moving the audio localization position (sound source position) along the path from the position P old to the position P new .
  • FIG. 25 is a diagram showing an example of audio animation.
  • the sound source is linearly moved from position P old to position P new during audio animation, it may cross near the center of the conversation circle, resulting in an unnatural conversation. Therefore, as shown by the thick arrow in FIG. 25, for example, the sound source is moved so as to move on the arc of the approximate circle C from the position P old to the position P new .
  • An arcuate path is set as the movement path of the position of the sound source, and the position of the sound source moves while maintaining the distance from the center position of the approximate circle C, which is the reference position, to form along the approximate circle C. It becomes possible to keep the circle of conversation that is held.
  • the movement path of the sound source is set so as to avoid the position of the actual participant.
  • the movement start position is set to the position P old , the movement is moved to the position P31 which is a position away from the center of the approximate circle C, and the arc shape is maintained while maintaining the distance from the center of the approximate circle C to the position P31.
  • the route to move to is set. Further, a route for moving from the position P32 to the position P new is set when the position P32 on the straight line passing through the center of the approximate circle C and the position P new is reached.
  • the movement of the sound source as described above is not limited to at least one of the cases where the number of actual participants increases or decreases and the number of remote participants increases or decreases, for example, when the actual participants move. It is also done. It is possible to allow the movement of the sound source to be performed in response to changes in various situations of the participants.
  • a conference screen may be displayed on the display 207 of the client terminal 2 used by each participant, thereby presenting the positional relationship of each participant.
  • FIG. 27 is a diagram showing an example of a conference screen.
  • a participant icon which is information visually representing each participant, is displayed on the background image showing the place where the conference is held.
  • the position of the participant icon on the screen corresponds to the position of each participant.
  • the participant icon is configured as a circular image including the user's face.
  • the participant icon is displayed in a size corresponding to the distance from the position of the participant who uses the client terminal 2 to the position of each participant.
  • Participant icons I1 to I4 represent real participants or remote participants, respectively.
  • Example of rendering Sound image localization processing including rendering and binaural processing is performed by the client terminal 2, but it may be performed by the communication management server 1. That is, the sound image localization process may be performed on the client terminal 2 side or may be performed on the communication management server 1 side.
  • the reproduction processing unit 221 of FIG. 12 is realized by the information processing unit 121 of the communication management server 1 as shown in FIG. 28.
  • the audio receiving unit 241 constituting the reproduction processing unit 221 of the information processing unit 121 receives the audio data of the participant who is the target of the sound image localization processing. Further, the localization position acquisition unit 242 acquires the localization position set by the localization position setting unit 132.
  • the sound image localization processing unit 243 performs sound image localization processing on the audio data received by the audio reception unit 241 using the HRTF data corresponding to the localization position acquired by the localization position acquisition unit 242.
  • the output control unit 245 transmits, for example, channel-based audio data of two channels of L / R, which is generated by the sound image localization process, to the client terminal 2 and outputs it from the earphone 3.
  • Conversations conducted by multiple users are assumed to be conversations in remote conferences, but conversations in which multiple people participate online, such as conversations at meals and lectures.
  • conversations in which multiple people participate online such as conversations at meals and lectures.
  • the techniques described above can be applied to various types of conversations.
  • the series of processes described above can be executed by hardware or software.
  • the programs constituting the software are installed from a program recording medium on a computer embedded in dedicated hardware, a general-purpose personal computer, or the like.
  • the installed program is recorded and provided on the removable media 111 shown in FIG. 9, which consists of an optical disk (CD-ROM (Compact Disc-Read Only Memory), DVD (Digital Versatile Disc), etc.) or a semiconductor memory. It may also be provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital broadcasting.
  • the program can be installed in the ROM 102 or the storage unit 108 in advance.
  • the program executed by the computer may be a program in which processing is performed in chronological order according to the order described in the present specification, or processing may be performed in parallel or at a necessary timing such as when a call is made. It may be a program to be performed.
  • the system means a set of a plurality of components (devices, modules (parts), etc.), and it does not matter whether all the components are in the same housing. Therefore, a plurality of devices housed in separate housings and connected via a network, and a device in which a plurality of modules are housed in one housing are both systems. ..
  • this technology can take a cloud computing configuration in which one function is shared by multiple devices via a network and processed jointly.
  • each step described in the above flowchart can be executed by one device or shared by a plurality of devices.
  • the plurality of processes included in the one step can be executed by one device or shared by a plurality of devices.
  • Information processing provided with a sound image localization processing unit that localizes the sound image of the voice of a remote participant who remotely participates in a conversation held in a predetermined space at a position different from the position of a real participant who is a participant in the predetermined space.
  • Device (2) The information processing apparatus according to (1), further comprising a localization position setting unit that sets the localization position of the sound image of the voice of the remote participant based on the position of the actual participant.
  • the localization position setting unit sets the localization position at a position away from the position of each real participant.
  • the localization position setting unit avoids the position of the real participant and moves the localization position to the above (7).
  • the information processing device described. (9) The information processing device according to (1) above, wherein the sound image localization processing unit localizes the sound image of the voice of the remote participant at a localization position set based on the position of the actual participant. (10) The information processing device according to (9) above, wherein the sound image localization processing unit localizes the sound image of the voice of the remote participant at the localization position set at a position away from the position of each real participant.
  • the information processing device according to (9) or (10) above, wherein the sound image localization processing unit localizes the sound images of the sounds of the voices of the plurality of remote participants at distant positions. (12) Of the above (9) to (11), the sound image localization processing unit localizes the sound image of the voice of the remote participant at a position in the area other than the exclusion area set according to the environment of the predetermined space. The information processing device described in any of them. (13) The information processing apparatus according to any one of (1) to (12), further comprising an output control unit that outputs the voice of the remote participant from the output device used by the real participant.
  • Information processing equipment An information processing method for localizing a sound image of a remote participant who remotely participates in a conversation held in a predetermined space at a position different from the position of a real participant who is a participant in the predetermined space.
  • On the computer A program for executing a process of localizing a sound image of a remote participant who remotely participates in a conversation held in a predetermined space at a position different from the position of a real participant who is a participant in the predetermined space.
  • 1 Communication management server 2 Client terminal, 3 Earphone, 121 Information processing unit, 131 Position information acquisition unit, 132 Localization position setting unit, 133 Localization position information transmission unit, 134 Voice reception unit, 135 Voice transmission unit, 211 Information processing unit , 221 playback processing unit, 222 audio transmission unit, 223 user position detection unit, 241 audio reception unit, 242 localization position acquisition unit, 243 sound image localization processing unit, 244 HRTF data storage unit, 245 output control unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Telephonic Communication Services (AREA)
PCT/JP2021/042528 2020-12-04 2021-11-19 情報処理装置、情報処理方法、およびプログラム Ceased WO2022118671A1 (ja)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2022566836A JPWO2022118671A1 (https=) 2020-12-04 2021-11-19
US18/038,696 US20230419985A1 (en) 2020-12-04 2021-11-19 Information processing apparatus, information processing method, and program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2020-201905 2020-12-04
JP2020201905 2020-12-04

Publications (1)

Publication Number Publication Date
WO2022118671A1 true WO2022118671A1 (ja) 2022-06-09

Family

ID=81855016

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/042528 Ceased WO2022118671A1 (ja) 2020-12-04 2021-11-19 情報処理装置、情報処理方法、およびプログラム

Country Status (3)

Country Link
US (1) US20230419985A1 (https=)
JP (1) JPWO2022118671A1 (https=)
WO (1) WO2022118671A1 (https=)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024127986A1 (ja) * 2022-12-12 2024-06-20 パナソニックIpマネジメント株式会社 音声処理システム、音声処理方法、及びプログラム

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008131193A (ja) * 2006-11-17 2008-06-05 Yamaha Corp 音像位置制御装置
WO2020022154A1 (ja) * 2018-07-27 2020-01-30 シャープ株式会社 通話端末、通話システム、通話端末の制御方法、通話プログラム、および記録媒体
JP2020088516A (ja) * 2018-11-20 2020-06-04 株式会社竹中工務店 テレビ会議システム

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4426484B2 (ja) * 2005-03-11 2010-03-03 株式会社日立製作所 音声会議システム、会議端末および音声サーバ
US9998606B2 (en) * 2016-06-10 2018-06-12 Glen A. Norris Methods and apparatus to assist listeners in distinguishing between electronically generated binaural sound and physical environment sound
US11539844B2 (en) * 2018-09-21 2022-12-27 Dolby Laboratories Licensing Corporation Audio conferencing using a distributed array of smartphones

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008131193A (ja) * 2006-11-17 2008-06-05 Yamaha Corp 音像位置制御装置
WO2020022154A1 (ja) * 2018-07-27 2020-01-30 シャープ株式会社 通話端末、通話システム、通話端末の制御方法、通話プログラム、および記録媒体
JP2020088516A (ja) * 2018-11-20 2020-06-04 株式会社竹中工務店 テレビ会議システム

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024127986A1 (ja) * 2022-12-12 2024-06-20 パナソニックIpマネジメント株式会社 音声処理システム、音声処理方法、及びプログラム

Also Published As

Publication number Publication date
JPWO2022118671A1 (https=) 2022-06-09
US20230419985A1 (en) 2023-12-28

Similar Documents

Publication Publication Date Title
US11991315B2 (en) Audio conferencing using a distributed array of smartphones
US11531518B2 (en) System and method for differentially locating and modifying audio sources
EP3424229B1 (en) Systems and methods for spatial audio adjustment
US8073125B2 (en) Spatial audio conferencing
US7533346B2 (en) Interactive spatalized audiovisual system
US20220201417A1 (en) Eliminating spatial collisions due to estimated directions of arrival of speech
US20210314704A1 (en) Systems and methods for distinguishing audio using positional information
WO2023286320A1 (ja) 情報処理装置および方法、並びにプログラム
CN116114241A (zh) 信息处理装置、信息处理终端、信息处理方法和程序
WO2022118671A1 (ja) 情報処理装置、情報処理方法、およびプログラム
JP2020010329A (ja) 符号化された光線を用いてスピーカアレイ及びマイクロフォンアレイを誘導するシステム、方法、及びプログラム
EP2216975A1 (en) Telecommunication device
Kan et al. Mobile Spatial Audio Communication System.
CN116057928A (zh) 信息处理装置、信息处理终端、信息处理方法和程序
JP6972858B2 (ja) 音響処理装置、プログラム及び方法
JP2006279492A (ja) 電話会議システム
CN115550831A (zh) 通话音频的处理方法、装置、设备、介质及程序产品
CN116057927A (zh) 信息处理装置、信息处理终端、信息处理方法和程序
Kilgore The Vocal Village: A Spatialized Audioconferencing Tool for Collaboration at a Distance
Albrecht et al. Continuous Mobile Communication with Acoustic Co-Location Detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21900426

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022566836

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 18038696

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21900426

Country of ref document: EP

Kind code of ref document: A1