WO2024070550A1

WO2024070550A1 - System, electronic device, system control method, and program

Info

Publication number: WO2024070550A1
Application number: PCT/JP2023/032576
Authority: WO
Inventors: 石田華子瀬戸; 隆行荒川
Original assignee: 京セラ株式会社
Priority date: 2022-09-29
Filing date: 2023-09-06
Publication date: 2024-04-04

Abstract

This system includes a first electronic device, a second electronic device, an estimation unit, and a control unit. The first electronic device acquires at least one from among video and audio of a first user. The second electronic device is configured so as to be capable of communicating with the first electronic device, and the at least one from among video and audio of the first user acquired by the first electronic device is output to a second user responding to speech of the first user. The estimation unit estimates, on the basis of the at least one from among video and audio of the first user, the response timing of the second user responding to speech of the first user. The control unit causes information indicating the response timing estimated by the estimation unit to be acquired by the second electronic device.

Description

SYSTEM, ELECTRONIC DEVICE, SYSTEM CONTROL METHOD, AND PROGRAM - Patent application

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to patent application No. 2022-156837, filed in Japan on September 29, 2022, the entire disclosure of which is incorporated herein by reference.

This disclosure relates to a system, an electronic device, a method for controlling the system, and a program.

In recent years, so-called remote conferences, such as web conferences or video conferences, have become more common. In remote conferences, electronic devices (or systems including electronic devices) are used to enable communication between participants in multiple locations. For example, consider a situation in which a conference is held in an office, and at least one of the conference participants holds the remote conference at his or her home in a remote location. In this case, audio and/or video of the conference in the office is acquired by, for example, an electronic device installed in the office, and transmitted to, for example, an electronic device installed in the participant's home. Also, audio and/or video at the participant's home is acquired by, for example, an electronic device installed in the participant's home, and transmitted to, for example, an electronic device installed in the office. Such electronic devices allow a conference to be held without all participants gathering in the same place.

Various technologies that can be applied to remote conferences such as those described above have been proposed. For example, Patent Document 1 discloses a device that displays a graphic that represents the output range of directional sound output by a speaker, superimposed on an image captured by a camera. This device makes it possible to visually grasp the output range of directional sound. Furthermore, for example, Patent Document 2 discloses a system in which, when a speaker and a listener in separate locations are engaged in a conversation, a listener robot is attached to the speaker's side, and a speaker robot is attached to the listener's side.

JP 2010-21705 A JP 2000-349920 A

In one embodiment, the system includes:
a first electronic device that acquires at least one of video and audio of a first user;
a second electronic device configured to be able to communicate with the first electronic device and configured to output at least one of a video and a voice of the first user acquired by the first electronic device to a second user who responds to a speech of the first user;
an estimation unit that estimates a response timing of the second user responding to an utterance of the first user based on at least one of a video and a voice of the first user;
a control unit that causes the second electronic device to acquire information indicating the response timing estimated by the estimation unit;
including.

The electronic device according to an embodiment includes:
An electronic device configured to be able to communicate with other electronic devices,
an acquisition unit that acquires at least one of a video and a voice of a user of the other electronic device;
an output unit that outputs at least one of a video and a voice of a user of the other electronic device to the user of the other electronic device in response to a speech of the user of the other electronic device;
an estimation unit that estimates a response timing of a user of the other electronic device responding to an utterance of the user of the other electronic device based on at least one of an image and an audio of the user of the other electronic device;
a presentation unit that presents information indicating the response timing estimated by the estimation unit;
Equipped with.

A method for controlling a system according to an embodiment includes the steps of:
A first electronic device acquires at least one of a video and a voice of a first user;
A step in which a second electronic device configured to be able to communicate with the first electronic device outputs at least one of a video and a voice of the first user acquired by the first electronic device to a second user who responds to an utterance of the first user;
estimating a response timing of the second user responding to the speech of the first user based on at least one of a video and a voice of the first user;
causing the second electronic device to acquire information indicating the response timing;
including.

A program according to an embodiment includes:
On the computer,
A first electronic device acquires at least one of a video and a voice of a first user;
A step in which a second electronic device configured to be able to communicate with the first electronic device outputs at least one of a video and a voice of the first user acquired by the first electronic device to a second user who responds to an utterance of the first user;
estimating a response timing of the second user responding to the speech of the first user based on at least one of a video and a voice of the first user;
causing the second electronic device to acquire information indicating the response timing;
Execute the command.

FIG. 1 is a diagram illustrating an example of a usage mode of a system according to an embodiment. FIG. 2 is a functional block diagram illustrating a schematic configuration of a first electronic device according to an embodiment. FIG. 4 is a functional block diagram illustrating a schematic configuration of a second electronic device according to an embodiment. FIG. 4 is a functional block diagram illustrating a configuration of a third electronic device according to an embodiment. FIG. 2 is a sequence diagram illustrating a basic operation of a system according to an embodiment. FIG. 11 is a diagram illustrating response timing according to an embodiment. FIG. 2 is a sequence diagram illustrating the operation of a system according to an embodiment. FIG. 2 is a sequence diagram illustrating the operation of a system according to an embodiment.

In this disclosure, an "electronic device" may be, for example, a device that is powered by power supplied from a power system or a battery. In this disclosure, a "system" may be, for example, a device that includes at least an electronic device. In this disclosure, a "user" may be a person who uses or may use an electronic device according to an embodiment (typically a human), and a person who uses or may use a system including an electronic device according to an embodiment. In addition, in this disclosure, a conference in which at least one participant participates by communication from a different location than the other participants, such as a web conference or video conference, is collectively referred to as a "remote conference."

Further improvements in functionality are desired for electronic devices that enable communication between multiple locations during remote conferences, etc., for example to facilitate communication. The purpose of the present disclosure is to provide a system, electronic device, system control method, and program that facilitate communication between multiple locations. According to one embodiment, it is possible to provide a system, electronic device, system control method, and program that facilitate communication between multiple locations. Below, a system including an electronic device according to one embodiment is described in detail with reference to the drawings.

FIG. 1 is a diagram showing an example of how a system according to an embodiment is used. The following description assumes a situation in which participant Mg remotely participates in a conference held in conference room MR from his/her home RL, as shown in FIG. 1. As shown in FIG. 1, participants Ma, Mb, Mc, and Md participate in the conference in conference room MR. In conference room MR, the participants of the conference are not limited to participants Ma, Mb, Mc, and Md, and may include, for example, other participants. Furthermore, participants other than participant Mg may also remotely participate in the conference from their respective homes.

As shown in FIG. 1, the system according to an embodiment may include, for example, a first electronic device 1, a second electronic device 100, and a third electronic device 300. In FIG. 1, the first electronic device 1, the second electronic device 100, and the third electronic device 300 are shown only in schematic form. The system according to an embodiment may not include at least any of the first electronic device 1, the second electronic device 100, and the third electronic device 300, and may include devices other than the electronic devices mentioned above.

The first electronic device 1 according to one embodiment may be installed in the conference room MR. Meanwhile, the second electronic device 100 capable of communicating with the first electronic device 1 according to one embodiment may be installed in the home RL of the participant Mg. The location of the home RL of the participant Mg may be a location different from the location of the conference room MR. The location of the home RL of the participant Mg may be far away from the location of the conference room MR, or may be close to the location of the conference room MR.

1, the first electronic device 1 according to an embodiment is connected to the second electronic device 100 according to an embodiment, for example, via a network N. Also, as shown in FIG. 1, the third electronic device 300 according to an embodiment may be connected to at least one of the first electronic device 1 and the second electronic device 100, for example, via a network N. The first electronic device 1 according to an embodiment may be connected to the second electronic device 100 according to an embodiment, at least one of wirelessly and wired. The third electronic device 300 according to an embodiment may be connected to at least one of the first electronic device 1 and the second electronic device 100, at least one of wirelessly and wired. In FIG. 1, the first electronic device 1, the second electronic device 100, and the third electronic device 300 are connected wirelessly and/or wired via the network N, as shown by dashed lines. In an embodiment, the first electronic device 1 and the second electronic device 100 may be included in a remote conference system according to an embodiment. Also, the third electronic device 300 may be included in a remote conference system according to an embodiment.

In the present disclosure, the network N as shown in FIG. 1 may include various electronic devices and/or devices such as a server as appropriate. The network N as shown in FIG. 1 may also include devices such as a base station and/or a repeater as appropriate. In the present disclosure, for example, when the first electronic device 1 and the second electronic device 100 "communicate", the first electronic device 1 and the second electronic device 100 may communicate directly. In the present disclosure, for example, when the first electronic device 1 and the second electronic device 100 "communicate", the first electronic device 1 and the second electronic device 100 may communicate via at least one of other devices such as the third electronic device 300 and/or a base station. In the present disclosure, for example, when the first electronic device 1 and the second electronic device 100 "communicate", more specifically, the communication unit of the first electronic device 1 and the communication unit of the second electronic device 100 may communicate.

The above-mentioned notation may include the same intention as above not only when the first electronic device 1 and the second electronic device 100 "communicate" with each other, but also when one "sends" information to the other and/or when the other "receives" information sent by one. Furthermore, the above-mentioned notation may include the same intention as above not only when the first electronic device 1 and the second electronic device 100 "communicate" with each other, but also when any electronic device, including the third electronic device 300, communicates with any other electronic device.

The first electronic device 1 according to one embodiment may be arranged in the conference room MR, for example as shown in FIG. 1. In this case, the first electronic device 1 may be arranged in a position where it can acquire the voice and/or video of at least one of the conference participants Ma, Mb, Mc, and Md. Furthermore, the first electronic device 1 outputs the voice and/or video of participant Mg, as described below. Therefore, the first electronic device 1 may be arranged so that the voice and/or video of participant Mg output from the first electronic device 1 reaches at least one of the conference participants Ma, Mb, Mc, and Md.

The second electronic device 100 according to one embodiment may be arranged in the home RL of the participant Mg, for example, in a manner as shown in FIG. 1. In this case, the second electronic device 100 may be arranged in a position where it is possible to acquire the voice and/or image of the participant Mg. The second electronic device 100 may acquire the voice and/or image of the participant Mg by a microphone or a headset and/or a camera connected to the second electronic device 100.

The second electronic device 100 also outputs audio and/or video of at least one of the participants Ma, Mb, Mc, and Md of the conference in the conference room MR, as described below. For this reason, the second electronic device 100 may be positioned so that the audio and/or video output from the second electronic device 100 reaches the participant Mg. The audio output from the second electronic device 100 may be positioned so that it reaches the ears of the participant Mg via, for example, headphones, earphones, speakers, or a headset.

The third electronic device 300 may be, for example, a server-like device that relays between the first electronic device 1 and the second electronic device 100. Also, the system according to one embodiment does not need to include the third electronic device 300.

FIG. 1 shows only one example of a usage mode of the first electronic device 1, the second electronic device 100, and the third embodiment 300 according to an embodiment. The first electronic device 1, the second electronic device 100, and the third embodiment 300 according to an embodiment may be used in various other modes.

The remote conference system including the first electronic device 1 and the second electronic device 100 shown in FIG. 1 allows the participant Mg to behave as if he or she is participating in a conference held in the conference room MR while staying at home RL. Also, the remote conference system including the first electronic device 1 and the second electronic device 100 shown in FIG. 1 allows the conference participants Ma, Mb, Mc, and Md to feel as if the participant Mg is actually participating in the conference held in the conference room MR. That is, in the remote conference system including the first electronic device 1 and the second electronic device 100, the first electronic device 1 arranged in the conference room MR can play a role like an avatar of the participant Mg. In this case, the first electronic device 1 may function as a physical avatar (such as a telepresence robot) that resembles the participant Mg. Also, the first electronic device 1 may function as a virtual avatar that displays an image of the participant Mg or an image that resembles, for example, a character of the participant Mg on the first electronic device 1.

Next, the functional configurations of the first electronic device 1, the second electronic device 100, and the third electronic device 300 according to one embodiment will be described.

2 is a block diagram showing a schematic configuration of the functions of the first electronic device 1 shown in FIG. 1. An example of the configuration of the first electronic device 1 according to an embodiment will be described below. As shown in FIG. 1, the first electronic device 1 may be used in the conference room MR by participants Ma, Mb, Mc, Md, etc., for example. The second electronic device 100 described later has a function of outputting the voice and/or video of the participant Mg acquired by the second electronic device 100 to the first electronic device 1 when the participant Mg speaks. The first electronic device 1 also has a function of outputting the voice and/or video of the participants Ma, Mb, Mc, Md, etc. acquired by the first electronic device 1 to the second electronic device 100 when the participants Ma, Mb, Mc, Md, etc. speak. The first electronic device 1 allows the participants Ma, Mb, Mc, Md, etc. to hold a remote conference or video conference in the conference room MR even if the participant Mg is in a remote location. Therefore, the first electronic device 1 is also referred to as an electronic device "used locally" as appropriate.

The first electronic device 1 according to one embodiment can be various devices, but may be, for example, a specially designed device. For example, the first electronic device 1 according to one embodiment may have a housing with an exterior on which an illustration of a human or the like is drawn, or may have a housing that is shaped to resemble at least a part of a human or the like, or a robot. The first electronic device 1 according to one embodiment may be, for example, a general-purpose smartphone, tablet, phablet, notebook computer (notebook PC or laptop), or computer (desktop). The first electronic device 1 according to one embodiment may have at least a part of a human or robot drawn on the display of a notebook PC, for example.

As shown in FIG. 2, the first electronic device 1 according to one embodiment may include a control unit 10, a memory unit 20, a communication unit 30, an imaging unit 40, an audio input unit 50, an audio output unit 60, a display unit 70, and a power unit 80. The control unit 10 may also include, for example, a determination unit 12, an estimation unit 14, and an adjustment unit 16. In one embodiment, the first electronic device 1 may not include at least some of the functional units shown in FIG. 2, or may include components other than the functional units shown in FIG. 2.

The control unit 10 controls and/or manages the entire first electronic device 1, including each functional unit constituting the first electronic device 1. The control unit 10 may include at least one processor, such as a CPU (Central Processing Unit) or a DSP (Digital Signal Processor), to provide control and processing power for executing various functions. The control unit 10 may be realized as a single processor, as a number of processors, or as individual processors. The processor may be realized as a single integrated circuit (IC). The processor may be realized as a number of communicatively connected integrated circuits and discrete circuits. The processor may be realized based on various other known technologies.

The control unit 10 may include one or more processors and memories. The processor may include a general-purpose processor that loads a specific program to execute a specific function, and a dedicated processor specialized for a specific process. The dedicated processor may include an application specific integrated circuit (ASIC). The processor may include a programmable logic device (PLD). The PLD may include a field-programmable gate array (FPGA). The control unit 10 may be either a system-on-a-chip (SoC) or a system in a package (SiP) in which one or more processors work together. The control unit 10 controls the operation of each component of the first electronic device 1.

The control unit 10 may be configured to include, for example, at least one of software and hardware resources. Furthermore, in the first electronic device 1 according to one embodiment, the control unit 10 may be configured by specific means in which software and hardware resources work together. Furthermore, in the first electronic device 1 according to one embodiment, at least one of the other functional units may also be configured by specific means in which software and hardware resources work together.

In the first electronic device 1 according to one embodiment, the control unit 10 performs various types of control and other operations, which will be described later. In addition, the determination unit 12 of the control unit 10 can perform various types of determination processing. The estimation unit 14 can perform various types of estimation processing. The adjustment unit 16 can perform various types of adjustment processing.

The storage unit 20 may function as a memory that stores various information. The storage unit 20 may store, for example, a program executed in the control unit 10 and the results of processing executed in the control unit 10. The storage unit 20 may also function as a work memory for the control unit 10. As shown in FIG. 2, the storage unit 20 may be connected to the control unit 10 by wire and/or wirelessly. The storage unit 20 may include, for example, at least one of a RAM (Random Access Memory) and a ROM (Read Only Memory). The storage unit 20 may be configured, for example, by a semiconductor memory or the like, but is not limited to this, and may be any storage device. For example, the storage unit 20 may be a storage medium such as a memory card inserted into the first electronic device 1 according to one embodiment. The storage unit 20 may also be an internal memory of a CPU used as the control unit 10, or may be connected to the control unit 10 as a separate unit.

The communication unit 30 has an interface function for wireless and/or wired communication with, for example, an external device. The communication method performed by the communication unit 30 in one embodiment may be a wireless communication standard. For example, the wireless communication standard includes cellular phone communication standards such as 2G, 3G, 4G, and 5G. For example, the cellular phone communication standards include LTE (Long Term Evolution), W-CDMA (Wideband Code Division Multiple Access), CDMA2000, PDC (Personal Digital Cellular), GSM (Registered Trademark) (Global System for Mobile communications), and PHS (Personal Handy-phone System), etc. For example, wireless communication standards include WiMAX (Worldwide Interoperability for Microwave Access), IEEE 802.11, WiFi, Bluetooth (registered trademark), IrDA (Infrared Data Association), and NFC (Near Field Communication). The communication unit 30 may include, for example, a modem whose communication method is standardized by ITU-T (International Telecommunication Union Telecommunication Standardization Sector). The communication unit 30 can support one or more of the above communication standards.

The communication unit 30 may be configured to include, for example, an antenna for transmitting and receiving radio waves and an appropriate RF unit. The communication unit 30 may wirelessly communicate with, for example, a communication unit of another electronic device via an antenna. The communication unit 30 may have a function of transmitting any information from the first electronic device 1 to another device, and/or a function of receiving any information from another device in the first electronic device 1. For example, the communication unit 30 may wirelessly communicate with the second electronic device 100 shown in FIG. 1. In this case, the communication unit 30 may wirelessly communicate with a communication unit 130 (described later) of the second electronic device 100. Thus, in one embodiment, the communication unit 30 has a function of communicating with the second electronic device 100. Also, for example, the communication unit 30 may wirelessly communicate with the third electronic device 300 shown in FIG. 1. In this case, the communication unit 30 may wirelessly communicate with a communication unit 330 (described later) of the third electronic device 300. Thus, in one embodiment, the communication unit 30 may have a function of communicating with the third electronic device 300. The communication unit 30 may also be configured as an interface such as a connector for wired connection to the outside. The communication unit 30 can be configured using known technology for wireless communication, so a detailed description of the hardware and the like is omitted.

As shown in FIG. 2, the communication unit 30 may be connected to the control unit 10 via a wired and/or wireless connection. Various pieces of information received by the communication unit 30 may be supplied to, for example, the storage unit 20 and/or the control unit 10. Various pieces of information received by the communication unit 30 may be stored in, for example, a memory built into the control unit 10. Furthermore, the communication unit 30 may transmit, for example, the results of processing by the control unit 10 and/or information stored in the storage unit 20 to the outside.

The imaging unit 40 may be configured to include an image sensor that captures images electronically, such as a digital camera. The imaging unit 40 may be configured to include an imaging element that performs photoelectric conversion, such as a CCD (Charge Coupled Device Image Sensor) or a CMOS (Complementary Metal Oxide Semiconductor) sensor. The imaging unit 40 can capture an image of the surroundings of the first electronic device 1, for example. The imaging unit 40 may capture an image of the inside of the conference room MR shown in FIG. 1, for example. In one embodiment, the imaging unit 40 may capture images of participants Ma, Mb, Mc, and Md of a conference held in the conference room MR shown in FIG. 1, for example.

The imaging unit 40 may convert the captured image into a signal and transmit it to the control unit 10. For this reason, the imaging unit 40 may be connected to the control unit 10 via a wired and/or wireless connection. Furthermore, a signal based on the image captured by the imaging unit 40 may be supplied to any functional unit of the first electronic device 1, such as the memory unit 20 and/or the display unit 70. The imaging unit 40 is not limited to an imaging device such as a digital camera, and may be any device that captures an image of the state inside the conference room MR shown in FIG. 1.

In one embodiment, the imaging unit 40 may capture images of the state inside the conference room MR as still images at predetermined time intervals (e.g., 15 frames per second). Also, in one embodiment, the imaging unit 40 may capture images of the state inside the conference room MR as a continuous video. Furthermore, the imaging unit 40 may be configured to include a fixed camera, or may be configured to include a movable camera.

The voice input unit 50 detects (acquires) sounds or voices around the first electronic device 1, including human voices. For example, the voice input unit 50 may detect sounds or voices as air vibrations, for example, with a diaphragm, and convert them into an electrical signal. Specifically, the voice input unit 50 may include an acoustic device that converts sounds into an electrical signal, such as a microphone. In one embodiment, the voice input unit 50 may detect (acquire) the voices of at least one of the participants Ma, Mb, Mc, and Md in the conference room MR shown in FIG. 1, for example. The voices (electrical signals) detected by the voice input unit 50 may be input to the control unit 10, for example. For this reason, the voice input unit 50 may be connected to the control unit 10 by wire and/or wirelessly.

The audio input unit 50 may convert the acquired sound or voice into an electrical signal and supply it to the control unit 10. The audio input unit 50 may also supply the electrical signal (audio signal) into which the sound or voice has been converted to a functional unit of the first electronic device 1, such as the memory unit 20. The audio input unit 50 may be any device that detects (acquires) sound or voice within the conference room MR shown in FIG. 1.

The audio output unit 60 converts an electrical signal (audio signal) of sound or voice supplied from the control unit 10 into sound, and outputs the audio signal as sound or voice. The audio output unit 60 may be connected to the control unit 10 by wire and/or wirelessly. The audio output unit 60 may be configured to include a device having a function of outputting sound, such as an arbitrary speaker (loudspeaker). In one embodiment, the audio output unit 60 may be configured to include a directional speaker that transmits sound in a specific direction. The audio output unit 60 may also be configured to be able to change the directionality of the sound. The audio output unit 60 may include an amplifier or an amplification circuit that appropriately amplifies the electrical signal (audio signal).

In one embodiment, the audio output unit 60 may amplify the audio signal that the communication unit 30 receives from the second electronic device 100. Here, the audio signal received from the second electronic device 100 may be, for example, the audio signal of a speaker (e.g., participant Mg shown in FIG. 1) that is received by the communication unit 30 from the second electronic device 100 of that speaker. In other words, the audio output unit 60 may output the audio signal of a speaker (e.g., participant Mg shown in FIG. 1) as the voice of that speaker.

The display unit 70 may be any display device, such as a Liquid Crystal Display (LCD), an Organic Electro-Luminescence panel, or an Inorganic Electro-Luminescence panel. The display unit 70 may display various types of information, such as characters, figures, or symbols. The display unit 70 may also display objects and icon images constituting various GUIs, for example, to prompt the user to operate the first electronic device 1.

Various data necessary for display on the display unit 70 may be supplied, for example, from the control unit 10 or the memory unit 20. For this reason, the display unit 70 may be connected to the control unit 10 or the like by wire and/or wirelessly. Furthermore, when the display unit 70 includes, for example, an LCD, it may be configured to include a backlight, etc., as appropriate.

In one embodiment, the display unit 70 may display an image based on the video signal transmitted from the second electronic device 100. The display unit 70 may display, for example, an image of participant Mg captured by the second electronic device 100 as an image based on the video signal transmitted from the second electronic device 100. By displaying the image of participant Mg on the display unit 70 of the first electronic device 1, for example, participants Ma, Mb, Mc, and Md shown in FIG. 1 can visually know the state of participant Mg who is in a location away from the conference room MR.

The display unit 70 may display, for example, the image of the participant Mg captured by the second electronic device 100 as is. On the other hand, the display unit 70 may display, for example, an image of the participant Mg as a character (for example, an avatar or a robot).

The power unit 80 generates power to drive any moving part in the first electronic device 1. The power unit 80 may be configured to include a power source such as a servo motor that drives the moving part in the first electronic device 1. The power unit 80 may drive any moving part in the first electronic device 1 under the control of the control unit 10. For this reason, the power unit 80 may be connected to the control unit 10 by wire and/or wirelessly.

In one embodiment, the power unit 80 may drive, for example, at least a part of the housing of the first electronic device 1. Furthermore, for example, if the first electronic device 1 has a housing shaped to resemble at least a part of a human or robot, the power unit 80 may drive at least a part of the human or robot shape.

The first electronic device 1 may be driven by the power unit 80 to perform an action that expresses, for example, the emotion and/or behavior of the participant Mg. For example, the first electronic device 1 may be driven by the power unit 80 to perform an action that expresses the response of the participant Mg. Here, the "response" may include a short interjection such as "yes" and/or "ah" made by the listener during the speaker's speech or between speeches. The "response" may also include a head movement such as a nod indicating a positive action not involving speech or a head shake indicating a negative action, or a hand movement such as a hand gesture, or a movement of the entire upper body indicating a large change in emotion such as surprise or deep agreement. Furthermore, the "response" may include a change in facial expression that moves a part or multiple parts of the face. The above-mentioned responses are made consciously or unconsciously for the purpose of showing that the listener understands or agrees with the content of the speaker's speech, or for the purpose of taking the rhythm of speech to make it easier for the speaker to speak. Therefore, the first electronic device 1 may perform an action such as a nod and/or a head shake of the participant Mg by driving at least a part of a part that imitates the head of the participant Mg. For example, the first electronic device 1 may perform an action such as a hand gesture of the participant Mg by driving at least a part of a part that imitates the hand of the participant Mg. For example, the first electronic device 1 may perform an action to express an emotion such as surprise or deep agreement of the participant Mg by driving at least a part of a part that imitates a part or parts of the face of the participant Mg. For example, the first electronic device 1 may perform an action to express an expression of the participant Mg by driving at least a part of a part that imitates a part or parts of the face of the participant Mg. In addition, the first electronic device 1 may output, for example, a pre-recorded response of the participant Mg such as "Yes" and/or "Eh" from the audio output unit 60. In addition, for example, the first electronic device 1 may perform an action to express an emotion such as joy, anger, sadness, or happiness of the participant Mg by driving the power unit 80. In this case, the power unit 80 may perform an action that expresses emotions such as joy, anger, sadness, and happiness of the participant Mg, for example, by driving at least a part of a component that imitates the face (expression) of the participant Mg. Also, for example, the first electronic device 1 may perform an action such as a human shrugging the shoulders, a polite human bow, or an action that shows an apology, by driving the power unit 80.

The operation of expressing the emotions and/or behavior of a human being, such as participant Mg, by displaying using the display unit 70 and/or driving the power unit 80 may use various known technologies. For this reason, a detailed explanation of the operation of expressing the emotions and/or behavior of a human being, such as participant Mg, by displaying using the display unit 70 and/or driving the power unit 80 will be omitted. The first electronic device 1 according to one embodiment can perform an operation of expressing the emotions and/or behavior of participant Mg by displaying using the display unit 70 and/or driving the power unit 80.

In one embodiment, the first electronic device 1 may be a specially designed device as described above. On the other hand, in one embodiment, the first electronic device 1 may include, for example, the audio output unit 60 and the power unit 80 among the functional units shown in FIG. 2. In this case, the first electronic device 1 may be connected to another electronic device to supplement at least some of the functions of the other functional units shown in FIG. 2. Here, the other electronic device may be, for example, a general-purpose smartphone, tablet, phablet, notebook computer (notebook PC or laptop), or computer (desktop).

3 is a block diagram showing a schematic configuration of the second electronic device 100 shown in FIG. 1. An example of the configuration of the second electronic device 100 according to an embodiment will be described below. As shown in FIG. 1, the second electronic device 100 may be, for example, an electronic device used by the participant Mg at his/her home RL. The above-mentioned first electronic device 1 has a function of outputting the voice and/or video of the participants Ma, Mb, Mc, Md, etc. acquired by the first electronic device 1 when the participants Ma, Mb, Mc, Md, etc. speak to the second electronic device 100. In addition, the second electronic device 100 has a function of outputting the voice and/or video of the participant Mg acquired by the second electronic device 100 to the first electronic device 1 when the participant Mg speaks. The second electronic device 100 allows the participant Mg to hold a remote conference or video conference even at a location away from the conference room MR. Therefore, the second electronic device 100 is also referred to as an electronic device "used remotely" as appropriate.

As shown in FIG. 3, the second electronic device 100 according to one embodiment may include a control unit 110, a memory unit 120, a communication unit 130, an imaging unit 140, an audio input unit 150, an audio output unit 160, a display unit 170, a tactile sensation providing unit 190, and an acquisition unit 200. The control unit 110 may also include, for example, a determination unit 112, an estimation unit 114, and an adjustment unit 116. In one embodiment, the second electronic device 100 may not include at least some of the functional units shown in FIG. 3, or may include components other than the functional units shown in FIG. 3.

The control unit 110 controls and/or manages the entire second electronic device 100, including each functional unit constituting the second electronic device 100. The control unit 110 may basically be configured based on the same concept as the control unit 10 shown in FIG. 2, for example. The determination unit 112, estimation unit 114, and adjustment unit 116 of the control unit 110 may also be configured based on the same concept as the determination unit 12, estimation unit 14, and adjustment unit 16 of the control unit 10 shown in FIG. 2, for example.

The storage unit 120 may function as a memory that stores various types of information. The storage unit 120 may store, for example, programs executed in the control unit 110 and results of processing executed in the control unit 110. The storage unit 120 may also function as a work memory for the control unit 110. As shown in FIG. 3, the storage unit 120 may be connected to the control unit 110 via a wired and/or wireless connection. The storage unit 120 may basically be configured based on the same concept as the storage unit 20 shown in FIG. 2, for example.

The communication unit 130 has an interface function for wireless and/or wired communication. The communication unit 130 may wirelessly communicate with, for example, a communication unit of another electronic device, for example, via an antenna. For example, the communication unit 130 may wirelessly communicate with the first electronic device 1 shown in FIG. 1. In this case, the communication unit 130 may wirelessly communicate with the communication unit 30 of the first electronic device 1. In this way, in one embodiment, the communication unit 130 has a function of communicating with the first electronic device 1. Also, for example, the communication unit 130 may wirelessly communicate with the third electronic device 300 shown in FIG. 1. In this case, the communication unit 130 may wirelessly communicate with the communication unit 330 (described later) of the third electronic device 300. In this way, in one embodiment, the communication unit 130 may have a function of communicating with the third electronic device 300. As shown in FIG. 3, the communication unit 130 may be connected to the control unit 110 in a wired and/or wireless manner. The communication unit 130 may basically be configured based on the same idea as the communication unit 30 shown in FIG. 2.

The imaging unit 140 may be configured to include an image sensor that captures images electronically, such as a digital camera. The imaging unit 140 may capture images of the interior of the home RL shown in FIG. 1, for example. In one embodiment, the imaging unit 140 may capture images of participants Mg who join a conference from the home RL shown in FIG. 1, for example. The imaging unit 140 may convert the captured images into signals and transmit them to the control unit 110. For this reason, the imaging unit 140 may be connected to the control unit 110 by wire and/or wirelessly. The imaging unit 140 may basically be configured based on the same concept as the imaging unit 40 shown in FIG. 2, for example.

The audio input unit 150 detects (acquires) sounds or voices around the second electronic device 100, including human voices. For example, the audio input unit 150 may detect sounds or voices as air vibrations, for example, with a diaphragm, and convert them into an electrical signal. Specifically, the audio input unit 150 may include an acoustic device that converts sounds into an electrical signal, such as an arbitrary microphone. In one embodiment, the audio input unit 150 may detect (acquire) the voice of the participant Mg in the home RL shown in FIG. 1, for example. The voice (electrical signal) detected by the audio input unit 150 may be input to the control unit 110, for example. For this reason, the audio input unit 150 may be connected to the control unit 110 by wire and/or wirelessly. The audio input unit 150 may basically be configured based on the same concept as the audio input unit 50 shown in FIG. 2, for example.

The audio output unit 160 converts an electrical signal (audio signal) supplied from the control unit 110 into sound, and outputs the audio signal as sound or voice. The audio output unit 160 may be connected to the control unit 110 by wire and/or wirelessly. The audio output unit 160 may be configured to include a device having a function of outputting sound, such as an arbitrary speaker (loudspeaker). In one embodiment, the audio output unit 160 may output a sound detected by the audio input unit 50 of the first electronic device 1. Here, the sound detected by the audio input unit 50 of the first electronic device 1 may be at least one of the voices of the participants Ma, Mb, Mc, and Md in the conference room MR shown in FIG. 1. The audio output unit 160 may basically be configured based on the same idea as the audio output unit 60 shown in FIG. 2, for example.

The display unit 170 may be any display device, such as a Liquid Crystal Display (LCD), an Organic Electro-Luminescence panel, or an Inorganic Electro-Luminescence panel. The display unit 170 may basically be configured based on the same concept as the display unit 70 shown in FIG. 2, for example. Various data required for display on the display unit 170 may be supplied from, for example, the control unit 110 or the memory unit 120. For this reason, the display unit 170 may be connected to the control unit 110, etc., via a wired and/or wireless connection.

The display unit 170 may be, for example, a touch screen display equipped with a touch panel function that detects input by contact with the participant Mg's finger or stylus.

In one embodiment, the display unit 170 may display an image based on the video signal transmitted from the first electronic device 1. The display unit 170 may display images of participants Ma, Mb, Mc, Md, etc. captured by the first electronic device 1 (its imaging unit 40) as an image based on the video signal transmitted from the first electronic device 1. By displaying images of participants Ma, Mb, Mc, Md, etc. on the display unit 170 of the second electronic device 100, for example, participant Mg shown in FIG. 1 can visually know the state of participants Ma, Mb, Mc, Md, etc. in a conference room MR away from his/her home RL.

The display unit 170 may directly display images of the participants Ma, Mb, Mc, Md, etc. captured by the first electronic device 1. On the other hand, the display unit 170 may display images (e.g., avatars) that characterize the participants Ma, Mb, Mc, Md, etc.

In one embodiment, the display unit 170 may have a function of notifying, for example, the participant Mg of the response timing, which will be described later. In other words, the participant Mg can know the response timing by visually checking the display unit 170. Also, in one embodiment, the display unit 170 may be an indicator, such as an LED, that notifies the response timing.

The tactile sensation presentation unit 190 may have a function of presenting a tactile sensation such as vibration to the fingers of the participant Mg, for example. In one embodiment, the tactile sensation presentation unit 190 may be configured in combination with a display unit 170 having a touch screen display function. In such a configuration, for example, when the participant Mg touches the display unit 170 to operate the second electronic device 100, he or she can recognize the presentation of a tactile sensation by the tactile sensation presentation unit 190. In one embodiment, the tactile sensation presentation unit 190 may have a function of notifying the participant Mg, for example, of the response timing described below.

The acquisition unit 200 may be various functional units that acquire the second user's response to the first user's utterance. The second user's response will be described in more detail below. The acquisition unit 200 of the second electronic device 100 may acquire input to at least one of the imaging unit 140 and the audio input unit 150 shown in FIG. 3, for example. The acquisition unit 200 may also be configured to include at least one of the imaging unit 140 and the audio input unit 150 shown in FIG. 3, for example. The acquisition unit 200 may acquire a mouse click or touch input by the user, or may acquire input to a motion sensor and/or a foot pedal. The acquisition unit 200 may also include an input device that detects a mouse click or touch input by the user, or may include a motion sensor and/or a foot pedal.

In one embodiment, the second electronic device 100 may be a dedicated device as described above. Meanwhile, in one embodiment, the second electronic device 100 may include some of the functional units shown in FIG. 3, for example. In this case, the second electronic device 100 may be connected to another electronic device to supplement at least some of the functions of the other functional units shown in FIG. 3. Here, the other electronic device may be, for example, a general-purpose smartphone, tablet, phablet, notebook computer (notebook PC or laptop), or computer (desktop), etc.

In particular, a smartphone or a laptop computer often has almost all of the functional units shown in FIG. 3. For this reason, in one embodiment, the second electronic device 100 may be a smartphone or a laptop computer. In this case, the second electronic device 100 may be a smartphone or a laptop computer with an application (program) installed for linking with the first electronic device 1.

FIG. 4 is a block diagram showing a schematic configuration of the third electronic device 300 shown in FIG. 1. An example of the configuration of the third electronic device 300 according to one embodiment will be described below. The third electronic device 300 may be installed in a location other than the participant Mg's home RL and the conference room MR, as shown in FIG. 1. The third electronic device 300 may be installed in the participant Mg's home RL or nearby, or in the conference room MR or nearby.

The first electronic device 1 has a function of transmitting the audio and/or video data of the participants Ma, Mb, Mc, Md, etc. acquired by the first electronic device 1 to the third electronic device 300 when the participants Ma, Mb, Mc, Md, etc. speak. The third electronic device 300 may transmit the audio and/or video data received from the first electronic device 1 to the second electronic device 100. The second electronic device 100 also has a function of transmitting the audio and/or video data of the participant Mg acquired by the second electronic device 100 to the third electronic device 300 when the participant Mg speaks. The third electronic device 300 may transmit the audio and/or video data received from the second electronic device 100 to the first electronic device 1. In this way, the third electronic device 300 may have a function of relaying between the first electronic device 1 and the second electronic device 100. The third electronic device 100 is also referred to as a "server" as appropriate.

As shown in FIG. 4, the third electronic device 300 according to one embodiment may include a control unit 310, a storage unit 320, and a communication unit 330. The control unit 310 may also include, for example, a determination unit 312, an estimation unit 314, and an adjustment unit 316. In one embodiment, the third electronic device 300 may not include at least some of the functional units shown in FIG. 4, or may include components other than the functional units shown in FIG. 4.

The control unit 310 controls and/or manages the entire third electronic device 300, including each functional unit constituting the third electronic device 300. The control unit 310 may basically be configured based on the same concept as the control unit 10 shown in FIG. 2, for example. The determination unit 312, estimation unit 314, and adjustment unit 316 of the control unit 310 may also be configured based on the same concept as the determination unit 12, estimation unit 14, and adjustment unit 16 of the control unit 10 shown in FIG. 2, for example.

The storage unit 320 may function as a memory that stores various types of information. The storage unit 320 may store, for example, programs executed in the control unit 310 and results of processing executed in the control unit 310. The storage unit 320 may also function as a work memory for the control unit 310. As shown in FIG. 4, the storage unit 320 may be connected to the control unit 310 via a wired and/or wireless connection. The storage unit 320 may basically be configured based on the same concept as the storage unit 20 shown in FIG. 2, for example.

The communication unit 330 has an interface function for wireless and/or wired communication. The communication unit 330 may wirelessly communicate with, for example, a communication unit of another electronic device, for example, via an antenna. For example, the communication unit 330 may wirelessly communicate with the first electronic device 1 shown in FIG. 1. In this case, the communication unit 330 may wirelessly communicate with the communication unit 30 of the first electronic device 1. In this way, in one embodiment, the communication unit 330 has a function of communicating with the first electronic device 1. Also, for example, the communication unit 330 may wirelessly communicate with the second electronic device 100 shown in FIG. 1. In this case, the communication unit 330 may wirelessly communicate with the communication unit 130 of the second electronic device 100. In this way, in one embodiment, the communication unit 330 may have a function of communicating with the second electronic device 100. As shown in FIG. 4, the communication unit 330 may be connected to the control unit 310 in a wired and/or wireless manner. The communication unit 330 may basically be configured based on the same idea as the communication unit 30 shown in FIG. 2.

In one embodiment, the third electronic device 300 may be, for example, a specially designed device. On the other hand, in one embodiment, the third electronic device 300 may include, for example, some of the functional units shown in FIG. 4. In this case, the third electronic device 300 may be connected to other electronic devices to supplement at least some of the functions of the other functional units shown in FIG. 4. Here, the other electronic devices may be, for example, devices such as a general-purpose computer or server. In one embodiment, the third electronic device 300 may be, for example, a relay server, a web server, or an application server.

Next, the basic operation of the first electronic device 1 and the second electronic device 100 according to one embodiment will be described. The following description will be given assuming a situation in which a participant Mg participates in a remote conference held in a conference room MR from his/her home RL, as shown in FIG. 1.

That is, the first electronic device 1 according to one embodiment is installed in the conference room MR and acquires video and/or audio of at least one of the participants Ma, Mb, Mc, and Md. The video and/or audio acquired by the first electronic device 1 is transmitted to the second electronic device 100 installed in the home RL of the participant Mg. The second electronic device 100 outputs the video and/or audio of at least one of the participants Ma, Mb, Mc, and Md acquired by the first electronic device 1. This allows the participant Mg to recognize the video and/or audio of at least one of the participants Ma, Mb, Mc, and Md.

Meanwhile, the second electronic device 100 according to one embodiment is installed in the home RL of the participant Mg and acquires video and/or audio of the participant Mg. The video and/or audio acquired by the second electronic device 100 is transmitted to the first electronic device 1 installed in the conference room MR. The first electronic device 1 outputs the video and/or audio of the participant Mg received from the second electronic device 100. This allows at least one of the participants Ma, Mb, Mc, and Md to hear the video and/or audio of the participant Mg.

FIG. 5 is a sequence diagram explaining the basic operation of the system according to the embodiment as described above. FIG. 5 is a diagram showing the exchange of data etc. between the first electronic device 1, the second electronic device 100, and the third electronic device 300. Below, the basic operation when a remote conference or video conference is held using the system according to the embodiment will be explained with reference to FIG. 5.

In the operation shown in FIG. 5, the first electronic device 1 used locally may be used by the first user. Here, the first user may be, for example, at least one of the participants Ma, Mb, Mc, and Md shown in FIG. 1 (hereinafter also referred to as a local user). The second electronic device 100 used remotely may be used by the second user. Here, the second user may be, for example, the participant Mg shown in FIG. 1 (hereinafter also referred to as a remote user). Hereinafter, the operation performed by the first electronic device 1 may be, in more detail, performed by, for example, the control unit 10 of the first electronic device 1. In this specification, the operation performed by the control unit 10 of the first electronic device 1 may be referred to as the operation performed by the first electronic device 1. Similarly, the operation performed by the second electronic device 100 may be, in more detail, performed by, for example, the control unit 110 of the second electronic device 100. In this specification, the operation performed by the control unit 110 of the second electronic device 100 may be referred to as the operation performed by the second electronic device 100. Furthermore, the operations performed by the third electronic device 300 may be more specifically performed by, for example, the control unit 310 of the third electronic device 300. In this specification, the operations performed by the control unit 310 of the third electronic device 300 may be referred to as operations performed by the third electronic device 300.

5 starts, the first electronic device 1 acquires at least one of the video and audio of the first user (e.g., at least one of the participants Ma, Mb, Mc, and Md) (step S1). Specifically, in step S1, the first electronic device 1 may capture the video of the first user using the imaging unit 40 and acquire (or detect) the audio of the first user using the audio input unit 50. Next, the first electronic device 1 encodes at least one of the video and audio of the first user (step S2). In step S2, encoding may mean compressing the video and/or audio data according to a predetermined rule and converting it into a format according to the purpose, including encryption. The first electronic device 1 may perform various known encoding methods, such as software encoding or hardware encoding.

Next, the first electronic device 1 transmits the encoded video and/or audio data to the third electronic device 300 (step S3). Specifically, in step S3, the first electronic device 1 transmits the video and/or audio data from the communication unit 30 to the communication unit 330 of the third electronic device 300. Also in step S3, the third electronic device 300 receives the video and/or audio data transmitted from the communication unit 30 of the first electronic device 1 via the communication unit 330.

Next, the third electronic device 300 transmits the encoded video and/or audio data received from the communication unit 30 to the second electronic device 100 (step S4). Specifically, in step S4, the third electronic device 300 transmits the video and/or audio data from the communication unit 330 to the communication unit 130 of the second electronic device 100. Also, in step S4, the second electronic device 100 receives the video and/or audio data transmitted from the communication unit 330 of the third electronic device 300 via the communication unit 130.

Then, the second electronic device 100 decodes the encoded video and/or audio data received from the communication unit 330 (step S5). In step S5, decoding may mean returning the format of the encoded video and/or audio data to its original format. The second electronic device 100 may perform various known decoding methods, such as software encoding or hardware encoding.

Next, the second electronic device 100 presents at least one of the video and audio of the first user (e.g., at least one of participants Ma, Mb, Mc, and Md) to the second user (e.g., participant Mg) (step S6). Specifically, in step S6, the second electronic device 100 may display the video of the first user on the display unit 170 and output the audio of the first user from the audio output unit 160.

By performing the operations of steps S1 to S6, for example, a second user (e.g., participant Mg) at home RL can recognize the video and/or audio of a first user (e.g., at least one of participants Ma, Mb, Mc, and Md) in a conference room MR.

The above describes a manner in which the first electronic device 1 transmits video and/or audio of the first user to the second electronic device 100 via the third electronic device 300. By reversing the procedure, the second electronic device 100 can transmit video and/or audio of the second user to the first electronic device 1 via the third electronic device 300.

That is, the second electronic device 100 acquires at least one of the video and audio of the second user (e.g., participant Mg) (step S11). Specifically, in step S11, the second electronic device 100 may capture the video of the second user using the imaging unit 140 and acquire (or detect) the audio of the second user using the audio input unit 150. Next, the second electronic device 100 encodes at least one of the video and audio of the second user (step S12).

Next, the second electronic device 100 transmits the encoded video and/or audio data to the third electronic device 300 (step S13). Specifically, in step S13, the second electronic device 100 transmits the video and/or audio data from the communication unit 130 to the communication unit 330 of the third electronic device 300. Also in step S13, the third electronic device 300 receives the video and/or audio data transmitted from the communication unit 130 of the second electronic device 100 via the communication unit 330.

Next, the third electronic device 300 transmits the encoded video and/or audio data received from the communication unit 130 to the first electronic device 1 (step S14). Specifically, in step S14, the third electronic device 300 transmits the video and/or audio data from the communication unit 330 to the communication unit 30 of the first electronic device 1. Also, in step S14, the first electronic device 1 receives the video and/or audio data transmitted from the communication unit 330 of the third electronic device 300 via the communication unit 30.

Next, the first electronic device 1 decodes the encoded video and/or audio data received from the communication unit 330 (step S15).

Next, the first electronic device 1 presents at least one of the video and audio of the second user (e.g., participant Mg) to the first user (e.g., at least one of participants Ma, Mb, Mc, and Md) (step S16). Specifically, in step S16, the first electronic device 1 may display the video of the second user on the display unit 70 and output the audio of the second user from the audio output unit 60.

By performing the operations of steps S11 to S16, for example, a first user (e.g., at least one of participants Ma, Mb, Mc, and Md) in the conference room MR can recognize the video and/or audio of a second user (e.g., participant Mg) in his/her home RL.

The operations from step S1 to step S6 and the operations from step S11 to step S16 may be executed in the reverse order. That is, the operations from step S11 to step S16 may be executed first, and then the operations from step S1 to step S6. Furthermore, the operations from step S1 to step S6 and the operations from step S11 to step S16 may be executed simultaneously, or may be executed so that they at least partially overlap.

Here, we will explain some of the issues that may arise when using remote or video conferencing in the manner described above.

For example, as shown in FIG. 1, when at least a part of the network N includes an Internet line or the like, the communication speed of the line is usually not guaranteed and is often a best-effort contract. It is expected that a certain level of communication speed can be ensured by establishing a dedicated line between the conference room MR and the home RL of the participant Mg shown in FIG. 1. However, the establishment of a dedicated line tends to be a high hurdle in terms of cost. For this reason, recent remote conferences or video conferences are realized by a configuration in which at least a part of the network N includes an Internet line or the like, and there are often no options for compromising on communication delays.

Furthermore, in the operation shown in FIG. 5, the encoding and/or decoding operations also require a certain amount of time. For example, if a first user asks a second user a question, encoding and decoding must be performed twice before the second user's response is returned to the first user who asked the question. Even if one encoding or decoding takes only a short amount of time, if such processing goes back and forth between the first and second users, it is conceivable that a non-negligible time delay will occur during the conversation.

In this way, if a delay occurs in the conversation between the first user and the second user, it is expected that the progress of the remote conference or video conference will not be smooth. For example, when the first user requests a response such as confirmation of intent from the second user, the timing at which the video and/or audio indicating the end of the first user's speech is actually transmitted to the second user may be delayed. In this case, if the timing of the second user's response is further delayed, it is expected that the first user will become impatient to wait for the second user's response, or that the second user's response will overlap with the first user's next speech. In addition, if such a situation occurs, it is also expected that the amount and/or quality of information transmitted to participants will decrease. Therefore, in a remote conference or video conference, it is desirable to be able to appropriately transmit and share the listener's response to the speaker's speech in order to facilitate smooth communication.

The system according to one embodiment estimates the response timing of the second user based on the first user's speech, and notifies the second user of the arrival of the response timing. The system according to one embodiment may also estimate the response timing of the second user based on the first user's speech at a point before the first user's speech ends.

Next, we will explain how the system according to one embodiment estimates response timing. Figure 6 is a diagram explaining how the system according to one embodiment estimates response timing.

The upper part of Figure 6 shows the waveform of the voice of the first user acquired (detected) by the voice input unit 50 when the first user of the first electronic device 1 is talking. In the graph at the top of Figure 6, the vertical axis indicates the level of the voice of the first user, and the horizontal axis indicates time (hours). The vertical axis of the graph at the top of Figure 6 may represent, for example, the sound pressure of the voice of the first user acquired by the voice input unit 50, converted into a voltage and then amplified. Furthermore, the vertical axis of the graph at the top of Figure 6 may represent the sound pressure or volume of the voice of the first user.

Generally, human conversations progress through interactions between oneself and the other person. For this reason, in a typical conversation, as shown in the upper graph of Figure 6, there are times when one speaks and times when one waits for the other person's speech, a reply, or a reaction such as an interjection. When waiting for the other person's speech, a reply, or an interjection, one tends to make (almost no) sound. Therefore, for example, the silent or nearly silent sections shown in the upper graph of Figure 6 may be considered desirable times for the second user to respond. In this way, the desirable times for the second user to respond to the conversation of the first user are referred to as (the second user's) "response timing."

The graph at the bottom of FIG. 6 illustrates an example of response timing. As shown in the graph at the bottom of FIG. 6, the time period during which the first user makes (almost) no sound may be set as the response timing of the second user. As shown in the graph at the bottom of FIG. 6, the timing at which the first user who is making sound stops making (almost) sound may be set as the start point of the response timing of the second user. Also, the timing at which the first user who is not making sound next makes sound may be set as the end point of the response timing of the second user. In the graph at the bottom of FIG. 6, the state in which the response timing is on is indicated by a value of +1, and the state in which the response timing is off is indicated by a value of -1.

The system according to one embodiment may estimate the response timing as shown in FIG. 6 while acquiring the voice of the first user, rather than determining the response timing by analyzing the voice of the first user after acquisition. That is, the system according to one embodiment may estimate the start time of each response timing as shown in FIG. 6 before the end of each utterance of the first user. In this case, the system according to one embodiment may estimate the start time of the response timing based on the features of the voice of the first user acquired by the voice input unit 50 and/or the features of the language. The system according to one embodiment may also estimate the start time of the response timing based on the image of the first user captured by the imaging unit 40, i.e., the face, facial expressions, gestures, and/or body movements, instead of or in addition to the voice of the first user.

Various methods can be envisaged for estimating the response timing from the voice and/or video of the first user. For example, a system according to an embodiment may estimate the start point of the response timing based on the timing when the volume of the voice of the first user decreases or the tone of voice becomes lower, as a feature of the voice of the first user. A system according to an embodiment may estimate the start point of the response timing based on the timing when the ending of the sentence becomes "desu" or "masu", as a feature of the language of the first user. A system according to an embodiment may estimate the start point of the response timing based on the timing when the first user returns their gaze to the first electronic device 1 after looking away from the first electronic device 1, as a feature of the video of the first user.

The system according to one embodiment may estimate the end point of the response timing based on the timing when the volume of speech decreases and then increases, as a feature of the first user's voice. The system according to one embodiment may estimate the timing when the volume increases, as a feature of the first user's voice, as the timing of a question, and estimate the end point of the response timing based on the average response time to the question.

The system according to one embodiment may determine, as a feature of the first user's language, whether the content of the utterance is an open-ended question or a closed-ended question. In this case, the system according to one embodiment may set, for example, a response timing after an open-ended question that is longer than the response timing after a closed-ended question. The system according to one embodiment may also set, for example, a response timing after a closed-ended question that is shorter than the response timing after an open-ended question.

The system according to one embodiment may also determine whether or not the conversation is lively, or the degree to which the conversation is lively, as a feature of the first user's voice and/or language. In this case, the system according to one embodiment may set the response timing for determining that the first user is lively to be relatively short.

Furthermore, the system according to one embodiment may determine whether the content of the conversation is positive or negative based on the voice and/or language features of the first user. In this case, the system according to one embodiment may set the response timing to be relatively short when the content of the conversation of the first user is determined to be relatively positive, and set the response timing to be relatively long when the content of the conversation of the first user is determined to be relatively negative.

The system according to one embodiment may estimate the response timing or correct the estimated response timing by analyzing the first user's past audio and/or video history. The system according to one embodiment may estimate the response timing based on, for example, AI (Artificial Intelligence) technology. The system according to one embodiment may estimate the response timing based on, for example, machine learning (and even deep learning) technology.

In the system according to an embodiment, when the response timing is estimated as described above, the second electronic device 100 may indicate to the second user that it is time to respond at the time of the estimated response timing. The second electronic device 100 may present the arrival of the response timing to the second user as at least one of visual information, auditory information, and tactile information. For example, the second electronic device 100 may notify the second user of the response timing by displaying "You have been asked a question" or "It's your turn" on the display unit 170. The second electronic device 100 may also notify the second user of the response timing by turning on or blinking the display unit 170 configured as an indicator such as an LED. The second electronic device 100 may also notify the second user of the response timing by outputting a sound such as "You have been asked a question" or "It's your turn" from the audio output unit 160. The second electronic device 100 may also notify the second user of the response timing by outputting a predetermined notification sound or the like from the audio output unit 160. Furthermore, the second electronic device 100 may notify the second user of the response timing by outputting haptic information, such as a predetermined vibration, from the haptic sensation providing unit 190.

In this case, the system according to one embodiment may transmit the response timing by prioritizing it over normal audio and/or video communication, for example. Because the transmission of the response timing is merely a notification of timing, it is considered that even if the response timing is prioritized over audio and/or video communication, it will have little effect on the audio and/or video communication. Furthermore, in the system according to one embodiment, the transmission of the response timing may be performed using, for example, a publish/subscribe server. Furthermore, in the system according to one embodiment, the transmission of the response timing may use a line separate from the line for normal audio and/or video communication.

In this way, the system according to one embodiment can inform the second user of the original response timing even if, for example, there is a delay in audio and/or video. Therefore, the second user can respond to the first user's comment at an appropriate time. In other words, the system according to one embodiment reduces the number of cases where the first user becomes impatient for the second user's response, and also reduces the number of cases where the second user's response overlaps with the first user's next comment. Therefore, the system according to one embodiment can facilitate communication between multiple locations.

The above-mentioned response timing estimation may be performed by the estimation unit 14 of the first electronic device 1, the estimation unit 314 of the third electronic device 300, or the estimation unit 114 of the second electronic device 100. The response timing estimation may be performed by at least one of the estimation unit 14 of the first electronic device 1, the estimation unit 114 of the second electronic device 100, and the estimation unit 314 of the third electronic device 300. In this case, among the estimation unit 14, the estimation unit 114, and the estimation unit 314, those that do not estimate the response timing may not be required components. In addition, various determination processes related to the above-mentioned response timing estimation may be performed by the determination unit 12 of the first electronic device 1, the determination unit 312 of the third electronic device 300, or the determination unit 112 of the second electronic device 100. Furthermore, the process related to correcting the estimated response timing may be performed by the adjustment unit 16 of the first electronic device 1, by the adjustment unit 316 of the third electronic device 300, or by the adjustment unit 116 of the second electronic device 100.

Next, the characteristic operations of the system according to one embodiment will be further described. FIG. 7 is a sequence diagram illustrating the characteristic operations of the system according to one embodiment. Like FIG. 5, FIG. 7 is a diagram illustrating the exchange of data and the like between the first electronic device 1, the second electronic device 100, and the third electronic device 300. Below, referring to FIG. 7, the characteristic operations when a remote conference or video conference is held using the system according to one embodiment will be described. The encoding and decoding of data described in FIG. 5 may use known technology. For this reason, the description of the encoding and decoding of data will be omitted in FIG. 7. Below, the description of the same or similar content as that already described in FIG. 5 may be simplified or omitted as appropriate.

7 starts, the first electronic device 1 acquires at least one of the video and audio of the first user (e.g., at least one of participants Ma, Mb, Mc, and Md) (step S101). The operation of step S101 may be the same as step S1 in FIG. 5.

Next, the first electronic device 1 transmits video and/or audio data of the first user to the third electronic device 300 (step S102). The operation of step S102 may be similar to step S3 in FIG. 5. The third electronic device 300 transmits the video and/or audio data of the first user received from the first electronic device 1 to the second electronic device 100 (step S103). The operation of step S103 may be similar to step S4 in FIG. 5.

When the second electronic device 100 receives the video and/or audio data of the first user from the third electronic device 300 in step S103, it presents at least one of the video and audio of the first user to the second user (e.g., participant Mg) (step S104). The operation of step S104 may be the same as step S6 in FIG. 5.

When the third electronic device 300 receives the video and/or audio data of the first user from the first electronic device 1 in step S102, it estimates the response timing based on the video and/or audio data of the first user (step S105). The response timing estimation performed in step S105 can be performed as described above.

Then, the third electronic device 300 determines whether the time of the estimated response timing has arrived (step S106). If the time of the response timing has not arrived in step S106, the third electronic device 300 may wait until the time of the response timing arrives or may execute other processing. If the time of the response timing has arrived in step S106, the third electronic device 300 transmits information indicating the estimated response timing to the second electronic device 100 (steps S107 and S108).

When the second electronic device 100 receives the information indicating the timing to respond from the third electronic device 300 in step S108, it notifies the second user that it is time to respond (step S109). In step S109, the second electronic device 100 may present the timing to respond to the second user as at least one of visual information, auditory information, and tactile information, as described above. In this way, by presenting the timing to respond, the second user can respond to the conversation of the first user at an appropriate timing.

Next, the second electronic device 100 acquires a response of the second user (e.g., participant Mg) to the speech of the first user (step S110). For example, the acquisition unit 200 of the second electronic device 100 may acquire the response of the second user to the speech of the first user. The acquisition unit 200 of the second electronic device 100 may acquire, for example, input to at least one of the imaging unit 140 and the voice input unit 150 shown in FIG. 3. The acquisition unit 200 may also acquire a mouse click or touch input by the user, or may acquire input to a motion sensor and/or a foot pedal. The response of the second user may include, for example, a head movement back and forth or up and down (nodding), a head movement left and right (shaking), a hand gesture, a movement of the upper body, a facial expression, or a back-and-forth including a short speech such as "yes," "no," or "ah." The responses acquired by the second electronic device 100 are not limited to those described above. The second electronic device 100 may acquire a combination of the above as the response of the second user. The second electronic device 100 may acquire at least one of the video and the voice of the second user when acquiring the response of the second user. The second electronic device 100 may acquire the response of the second user by, for example, performing image recognition on the acquired video and voice recognition on the acquired voice. However, what the second electronic device 100 acquires is not limited to at least one of the video and the voice of the second user. For example, when the second electronic device 100 is provided with a motion sensor that acquires, for example, the motion of a human body, the response of the second user may be acquired by acquiring (detecting) the motion of the body of the second user, such as a nod. For example, the second electronic device 100 may be provided with a motion sensor. In this case, the second electronic device 100 may be a device held by the second user in the hand, such as a wearable terminal worn by the second user, a mouse, or a touch pen. In the above-mentioned case, the second electronic device 100 may acquire the response of the second user by connecting to a smartphone, a tablet terminal, a foot pedal, or the like held by the second user in the hand, by wire or wirelessly. The acquisition of the response of the second user is not limited to the above, and the response of the second user may be acquired by combining these. Here, an example of the correspondence between the information such as video and audio acquired by the second electronic device 100 and the detection method of the response of the second user is shown in Table 1 below. The second electronic device 100 may acquire a head nod or a head shake as the response of the second user, for example, by performing image recognition on the acquired video. In addition, the second electronic device 100 may acquire a head nod action as the response of the second user when a positive word is detected by performing voice recognition on the acquired voice, for example. The second electronic device 100 may acquire a head shake action as the response of the second user when a negative word is detected by performing voice recognition on the acquired voice, for example. The second electronic device 100 may connect to a wearable terminal such as headphones equipped with a motion sensor worn by the second user. In this case, the second electronic device 100 may acquire a head nod or a head shake action of the second user detected by the wearable terminal as the response of the second user. The second electronic device 100 may be, for example, a handheld device such as a smartphone or tablet equipped with a motion sensor. In this case, the second electronic device 100 may be tilted back and forth by the second user, and may acquire a head nod action associated with this action as the second user's response. The second electronic device 100 may detect a head nod, for example, by clicking a mouse. In this case, the second electronic device 100 may display, on the display unit 170, a GUI in which a button corresponding to a head nod is set, and may acquire the second user's response by the second user clicking a mouse button.

Then, the second electronic device 100 transmits the acquired data such as video and/or audio of the second user to the third electronic device 300 (step S111). The operation of step S111 may be the same as step S13 in FIG. 5. Here, the data transmitted from the second electronic device 100 to the third electronic device 300 may include data indicating body movements such as nodding that correspond to the response of the second user.

The third electronic device 300 transmits data such as video and/or audio of the second user received from the second electronic device 100 to the first electronic device 1 (step S112). The operation of step S112 may be the same as step S14 in FIG. 5.

When the first electronic device 1 receives data such as the video and/or audio of the second user from the third electronic device 300 in step S112, it presents at least one of the video and audio of the second user to the first user (e.g., participant Ma) (step S113). The operation of step S113 may be the same as step S16 in FIG. 5. Furthermore, when the first electronic device 1 receives data indicating a body movement such as a nod of the second user in step S113, it may reproduce the body movement of the second user, for example, by driving the power unit 80. Furthermore, when the first electronic device 1 receives data indicating a body movement such as a nod of the second user in step S113, it may reproduce the body movement of the second user, for example, by displaying it on the display unit 70.

In this way, according to the system of one embodiment, the first user can receive the second user's response to his/her comment at an appropriate time. Therefore, according to the system of one embodiment, communication between multiple locations can be facilitated.

Other Embodiments
7, the first electronic device 1 and the second electronic device 100 communicate with each other via the third electronic device 300. However, in one embodiment, the above-mentioned operation may be performed without the third electronic device 300. In this case, the first electronic device 1 and the second electronic device 100 may be configured to be able to communicate with each other directly or indirectly.

Other Embodiments
7, the third electronic device 300 estimates the response timing in advance in step S105, and when the time of the response timing arrives, transmits information indicating the response timing to the second electronic device 100. However, the system according to an embodiment may not be limited to such a configuration. A system according to a modified example of an embodiment will be further described below.

FIG. 8 is a sequence diagram that explains the characteristic operations of a system according to a modified embodiment of the embodiment shown in FIG. 7. Below, only the differences from the characteristic operations of a system according to a modified embodiment of the embodiment shown in FIG. 7 will be explained.

As shown in FIG. 8, the operations from step S101 to step S105 may be the same as those in FIG. 7. On the other hand, once the third electronic device 300 has estimated the response timing (in advance) in step S105, it may transmit information indicating the response timing to the second electronic device 100 even before the time of the response timing arrives (steps S121 and S122). In this case, the second electronic device 100 that has received the information indicating the response timing determines whether the time of the estimated response timing has arrived (step S123). If the time of the response timing has not arrived in step S123, the second electronic device 100 may wait until the time of the response timing arrives or may execute other processing. If the time of the response timing arrives in step S123, the second electronic device 100 notifies the second user that it is time to respond (step S109). The operations from step S110 to step S113 may be the same as those in FIG. 7.

In this way, even with the system according to the modified embodiment shown in FIG. 8, the first user can receive the second user's response to his/her own comment at an appropriate time.

Other Embodiments
In the above-described embodiment, the second electronic device 100 detects the response of the second user. However, in other embodiments, the first electronic device 1 and/or the second electronic device 300 may detect the response of the second user.

As described above, the system according to an embodiment may include, for example, a first electronic device 1, a second electronic device 100, and a third electronic device 300. The first electronic device 1 acquires at least one of a video and a voice of the first user. The second electronic device 100 may be configured to be able to communicate with the first electronic device 1. The second electronic device 100 outputs at least one of a video and a voice of the first user acquired by the first electronic device 1 to a second user who responds to the speech of the first user. The third electronic device 300 may include a control unit 310 and an estimation unit 314. The estimation unit 314 may estimate the response timing of the second user who responds to the speech of the first user based on at least one of a video and a voice of the first user. The control unit 310 may control the second electronic device 100 to acquire information indicating the response timing estimated by the estimation unit 314.

The second electronic device 100 according to one embodiment may include a presentation unit. In this case, the presentation unit of the second electronic device 100 may present the response timing to the second user as at least one of visual information, auditory information, and tactile information. Here, the presentation unit of the second electronic device 100 may be, for example, at least one of the display unit 170, the audio output unit 160, and the tactile sensation presentation unit 190 shown in FIG. 3. Furthermore, the presentation unit of the second electronic device 100 may present the response timing to the second user when the response timing is reached.

The second electronic device 100 according to an embodiment may include an acquisition unit 200. In this case, the acquisition unit 200 of the second electronic device 100 may acquire the response of the second user as at least one of video and audio. Here, the acquisition unit 200 of the second electronic device 100 may be, for example, at least one of the imaging unit 140 and the audio input unit 150 shown in FIG. 3. The acquisition unit 200 of the second electronic device 100 may acquire input to at least one of the imaging unit 140 and the audio input unit 150 shown in FIG. 3. Furthermore, the acquisition unit 200 may acquire a mouse click or touch input by the user, or may acquire input to a motion sensor and/or a foot pedal, etc.

The second electronic device 100 according to one embodiment may also include a communication unit 130. In this case, the communication unit 130 may transmit at least one of the video and audio acquired by the acquisition unit to the first electronic device 1.

The control unit 310 of the third electronic device 300 according to one embodiment may, for example, perform control so as to transmit information indicating the response timing estimated by the estimation unit 314 to the second electronic device 100 before the response timing is reached (i.e., in advance).

In one embodiment, the second electronic device 100 may include an acquisition unit 200 that acquires a response of the second user corresponding to a predetermined action of the second user. In one embodiment, the second electronic device 100 may also include a communication unit 130 that transmits data indicated by the response of the second user to the first electronic device 1. In one embodiment, the first electronic device 1 may also include a power unit 80 that drives at least a part of the housing of the first electronic device 1 based on the data indicating the response of the second user.

The estimation unit 314 of the third electronic device 300 according to one embodiment may estimate the response timing. In this case, the estimation unit 314 may estimate the response timing based on at least one of the voice characteristics of the first user and the language characteristics of the first user extracted from at least one of the video and audio of the first user acquired by the first electronic device 1. The estimation unit 314 may also estimate the response timing based on at least one of the facial expression characteristics of the first user and the gestures of the first user extracted from at least one of the video and audio of the first user acquired by the first electronic device 1.

In addition, the estimation unit 314 of the third electronic device 300 according to one embodiment may estimate the response timing by predicting the timing at which the first user's current utterance will end and the timing at which the first user's next utterance will start.

Next, we will further explain a modified example of the system according to one embodiment.

(Modification of one embodiment)
For example, in the above-described embodiment, if the second user responds when the response timing is nearing the end, the response may overlap with the start of the first user's next speech. In that case, it is assumed that the first user who has started the next speech may interrupt the speech, which may result in poor communication. Therefore, for example, in step S113 of FIG. 7 or FIG. 8, when presenting the video and/or audio of the second user, if the remaining time until the end of the response timing is short, the first electronic device 1 may not present the video and/or audio of the second user. In addition, the first electronic device 1 may not drive at least a part of the housing of the first electronic device 1 with the power unit 80 based on data indicating the response of the second user. In these cases, for example, the third electronic device 300 may also transmit the response timing estimated by the estimation unit 314 to the first electronic device 1. Then, for example, in step S113 of FIG. 7 or FIG. 8, the determination unit 12 of the first electronic device 1 may determine whether the remaining time of the response timing is shorter than a predetermined time. When the remaining response time is shorter than a predetermined time, the first electronic device 1 may be configured not to present the video and/or audio of the second user. Furthermore, when the remaining response time is shorter than a predetermined time, the first electronic device 1 may be configured not to cause the power unit 80 to drive at least a part of the housing of the first electronic device 1 based on data indicating the response of the second user.

In this way, the first electronic device 1 may include a determination unit 12. In this case, the determination unit 12 may determine whether or not to present at least one of the video and audio of the second user acquired from the second electronic device 100 to the first user based on the remaining time until the end of the response timing. In this case, the determination unit 12 may determine whether or not to cause the power unit 80 to drive at least a part of the housing of the first electronic device 1 based on the remaining time until the end of the response timing.

According to a modified embodiment, even if the second user responds when the response timing is nearing the end, the response is prevented from overlapping with the start of the first user's next utterance. Therefore, the first user who has started the next utterance is not interrupted, allowing for smooth communication.

In addition, in a modified example of the embodiment described above, the first electronic device 1 may not present the image and/or voice of the second user when the first user is speaking, not when the remaining response time is shorter than the predetermined time. For example, in step S113 of FIG. 7 or FIG. 8, when presenting the image and/or voice of the second user, the first electronic device 1 may not present the image and/or voice of the second user when the first user is speaking. In this case, the determination unit 12 of the first electronic device 1 may determine whether or not the speech of the first user is detected by the voice input unit 50 in step S113 of FIG. 7 or FIG. 8. If the speech of the first user is detected, the first electronic device 1 may not present the image and/or voice of the second user. In addition, the first electronic device 1 may not drive at least a part of the housing of the first electronic device 1 to the power unit 80 based on the data indicating the response of the second user, not when the remaining response time is shorter than the predetermined time, but when the first user is speaking.

In this way, the determination unit 12 of the first electronic device 1 may determine whether or not to present at least one of the video and audio of the second user acquired from the second electronic device 100 to the first user, depending on whether or not the first electronic device 1 detects the audio of the first user. Furthermore, the determination unit 12 may determine whether or not to cause the power unit 80 to drive at least a part of the housing of the first electronic device 1, depending on whether or not the first electronic device 1 detects the audio of the first user.

In addition, in a modified example of the embodiment described above, the first electronic device 1 may execute an operation to suggest to the first user that the response timing will be extended, instead of not presenting the image and/or voice of the second user. For example, when presenting the image and/or voice of the second user, the first electronic device 1 may notify the first user that the second user is about to speak if the remaining time until the end of the response timing is short when presenting the image and/or voice of the second user. Also, when the power unit 80 drives at least a part of the housing of the first electronic device 1 based on data indicating the response of the second user, the first electronic device 1 may notify the first user that the second user is about to speak if the remaining time until the end of the response timing is short. In such a case, the first electronic device 1 may output a voice such as a filler word of the second user, such as "hmm" or "um", from the voice output unit 60. Also, the first electronic device 1 may display the appearance of the second user about to speak on the display unit 70 by using characters or images. The first electronic device 1 may also express the appearance that the second user is about to speak by driving the power unit 80. In this way, by suggesting to the first user that the response timing will be extended and then presenting the image and/or audio of the second user, the risk that the response of the second user will overlap with the next utterance of the first user is reduced.

In this way, in a system according to a modified embodiment, the first electronic device 1 may include a control unit 10. In this case, the control unit 10 may control the first electronic device 1 to perform an operation suggesting to the first user that the response timing be extended. The control unit 10 may perform such control when the remaining time until the end of the response timing is less than or equal to a predetermined time when at least one of the video and audio of the second user acquired from the second electronic device 100 is presented to the first user. The control unit 10 may also perform such control when the remaining time until the end of the response timing is less than or equal to a predetermined time when the power unit 80 drives at least a part of the housing of the first electronic device 1 based on data indicating the response of the second user.

(Modification of one embodiment)
For example, in the above-described embodiment, even if a response timing is presented to the second user, there are individual differences in the timing at which the second user actually responds. If the time from when the response timing is presented to the second user to when the second user actually responds is extremely short or extremely long, it may give the first user a sense of discomfort and prevent smooth communication.

The third electronic device 300 may adjust the response timing estimated by the estimation unit 314 based on the timing when the second user responded to the first user's speech in the past. In this case, the adjustment unit 316 of the third electronic device 300 may adjust the response timing estimated by the estimation unit 314 based on the timing when the second user responded to the first user's speech in the past. For example, if the response timing estimated by the estimation unit 314 is too early, the adjustment unit 316 may delay the response timing depending on the degree to which the response timing is determined to be too early. Also, if the response timing estimated by the estimation unit 314 is too late, the adjustment unit 316 may advance the response timing depending on the degree to which the response timing is determined to be too late.

In this way, the system according to the modified example of the embodiment may include an adjustment unit 316. In this case, the adjustment unit 316 may adjust the response timing estimated by the estimation unit 314 based on the timing at which the second user responded to the speech of the first user in the past. Furthermore, the function of the adjustment unit 316 provided in the third electronic device 300 may be realized, for example, by the adjustment unit 116 provided in the second electronic device 100, or by the adjustment unit 16 provided in the first electronic device 1.

On the other hand, in a system according to a modified example of an embodiment, the time at which the response timing is presented may be adjusted, rather than adjusting the estimated response timing. In this case, the adjustment unit 116 of the second electronic device 100 may adjust the time at which the response timing estimated by the estimation unit 314 is presented, based on the timing at which the second user responded to the speech of the first user in the past. For example, if the response timing estimated by the estimation unit 314 is too early, the adjustment unit 116 may delay the time at which the response timing is presented, depending on the degree to which the response timing is determined to be too early. Also, if the response timing estimated by the estimation unit 314 is too late, the adjustment unit 116 may advance the time at which the response timing is presented, depending on the degree to which the response timing is determined to be too late.

In this way, in a system according to a modified example of an embodiment, for example, the second electronic device 100 may include an adjustment unit 116. Here, the adjustment unit 116 may adjust the time at which the response timing is presented to the second user based on the timing at which the second user responded to the speech of the first user in the past. Also, the function of the adjustment unit 116 included in the second electronic device 100 may be realized, for example, by an adjustment unit 316 included in the third electronic device 300, or may be realized by an adjustment unit 16 included in the first electronic device 1.

Although the embodiments of the present disclosure have been described based on the drawings and examples, it should be noted that those skilled in the art would easily be able to make various modifications or corrections based on the present disclosure. Therefore, it should be noted that these modifications or corrections are included in the scope of the present disclosure. For example, the functions included in each component or step can be rearranged so as not to cause logical inconsistencies, and multiple components or steps can be combined into one or divided. Although the embodiments of the present disclosure have been described mainly with respect to the device, the embodiments of the present disclosure can also be realized as a method including steps executed by each component of the device. The embodiments of the present disclosure can also be realized as a method, a program executed by a processor or the like included in the device, or a storage medium or storage medium on which a program is recorded. It should be understood that these are also included in the scope of the present disclosure.

The above-described embodiments are not limited to implementation as a system. For example, the above-described embodiments may be implemented as a control method for a system, or as a program executed in a system. For example, the above-described embodiments may be implemented as at least one of the first electronic device 1, the second electronic device 100, and the third electronic device 300. The above-described embodiments may be implemented as a control method for at least one of the first electronic device 1, the second electronic device 100, and the third electronic device 300. Furthermore, the above-described embodiments may be implemented as a program executed by at least one of the first electronic device 1, the second electronic device 100, and the third electronic device 300, or as a storage medium or recording medium on which the program is recorded.

For example, the above-described embodiment may be implemented as the second electronic device 100. In this case, the second electronic device 100 may be configured to be able to communicate with the first electronic device 1. The second electronic device 100 may include an acquisition unit, an output unit, an estimation unit, and a presentation unit. The acquisition unit may acquire at least one of an image and a voice of the user of the first electronic device 1. The acquisition unit may be, for example, at least one of the imaging unit 140 and the voice input unit 150 shown in FIG. 3. The output unit may output at least one of an image and a voice of the user of the first electronic device 1 to the user of the second electronic device 100 who responds to the speech of the user of the first electronic device 1. The output unit may be, for example, at least one of the voice output unit 160 and the display unit 170 shown in FIG. 3. The estimation unit may estimate the response timing of the user of the second electronic device 100 who responds to the speech of the user of the first electronic device 1 based on at least one of the image and the voice of the user of the first electronic device 1. The estimation unit may be, for example, the estimation unit 114 shown in FIG. 3. The presentation unit may present information indicating the response timing estimated by the estimation unit. The presentation unit may be, for example, at least one of the audio output unit 160, the display unit 170, and the haptic sensation providing unit 190 shown in FIG. 3.

LIST OF SYMBOLS 1 First electronic device 10 Control unit 12 Determination unit 14 Estimation unit 16 Adjustment unit 20 Memory unit 30 Communication unit 40 Imaging unit 50 Audio input unit 60 Audio output unit 70 Display unit 80 Power unit 100 Second electronic device 110 Control unit 112 Determination unit 114 Estimation unit 116 Adjustment unit 120 Memory unit 130 Communication unit 140 Imaging unit 150 Audio input unit 160 Audio output unit 170 Display unit 190 Tactile sensation providing unit 200 Acquisition unit 300 Third electronic device 310 Control unit 312 Determination unit 314 Estimation unit 316 Adjustment unit 320 Memory unit 330 Communication unit N Network

Claims

a first electronic device that acquires at least one of video and audio of a first user;
a second electronic device configured to be able to communicate with the first electronic device and configured to output at least one of a video and a voice of the first user acquired by the first electronic device to a second user who responds to a speech of the first user;
an estimation unit that estimates a response timing of the second user responding to an utterance of the first user based on at least one of a video and a voice of the first user;
a control unit that causes the second electronic device to acquire information indicating the response timing estimated by the estimation unit;
A system including:
The system according to claim 1, wherein the second electronic device includes a presentation unit that presents the response timing to the second user as at least one of visual information, auditory information, and tactile information.
The system according to claim 2, wherein the presenting unit presents the response timing to the second user when the response timing is reached.
The system according to claim 1, wherein the second electronic device includes an acquisition unit that acquires the response of the second user as at least one of video and audio, and a communication unit that transmits at least one of the video and audio acquired by the acquisition unit to the first electronic device.
The system according to claim 4, wherein the first electronic device includes a determination unit that determines whether or not to present at least one of the video and audio of the second user acquired from the second electronic device to the first user based on the remaining time until the end of the response timing.
The system according to claim 4, further comprising a determination unit that determines whether or not to present at least one of the video and audio of the second user acquired from the second electronic device to the first user, depending on whether or not the first electronic device detects the audio of the first user.
the second electronic device includes an acquisition unit that acquires a response of the second user corresponding to a predetermined action of the second user, and a communication unit that transmits data indicating the response of the second user to the first electronic device;
The system of claim 1 , wherein the first electronic device includes a power unit that drives at least a portion of a housing of the first electronic device based on data indicating a response of the second user.
The system according to claim 7, wherein the first electronic device includes a determination unit that determines whether or not to cause the power unit to drive at least a portion of the housing of the first electronic device based on the remaining time until the end of the response timing.
The system according to claim 7, further comprising a determination unit that determines whether or not to cause the power unit to drive at least a portion of the housing of the first electronic device depending on whether or not the first electronic device detects the voice of the first user.
The system according to claim 1, wherein the control unit transmits information indicating the response timing estimated by the estimation unit to the second electronic device before the response timing is reached.
The system according to claim 1, wherein the estimation unit estimates the response timing based on at least one of the voice characteristics of the first user, the language characteristics of the first user, the facial expression characteristics of the first user, and the gestures of the first user, which are extracted from at least one of the video and audio of the first user acquired by the first electronic device.
The system according to claim 1 or 11, wherein the estimation unit estimates the response timing by predicting the timing at which the first user's current utterance ends and the timing at which the first user's next utterance starts.
The system according to claim 1, wherein the second electronic device includes an adjustment unit that adjusts the time at which the response timing is presented to the second user based on the timing at which the second user responded to the speech of the first user in the past.
The system according to claim 1, further comprising an adjustment unit that adjusts the response timing estimated by the estimation unit based on the timing at which the second user responded to the speech of the first user in the past.
The system according to claim 4, wherein the first electronic device includes a control unit that, when presenting at least one of the video and audio of the second user acquired from the second electronic device to the first user, performs an operation of suggesting to the first user that the response timing be extended if the remaining time until the end of the response timing is equal to or less than a predetermined time.
The system according to claim 7, wherein the first electronic device includes a control unit that performs an operation to suggest to the first user that the response timing be extended if the remaining time until the end of the response timing is equal to or less than a predetermined time when the power unit drives at least a portion of the housing of the first electronic device.
An electronic device configured to be able to communicate with other electronic devices,
an acquisition unit that acquires at least one of a video and a voice of a user of the other electronic device;
an output unit that outputs at least one of a video and a voice of a user of the other electronic device to the user of the other electronic device in response to a speech of the user of the other electronic device;
an estimation unit that estimates a response timing of a user of the other electronic device responding to an utterance of the user of the other electronic device based on at least one of an image and an audio of the user of the other electronic device;
a presentation unit that presents information indicating the response timing estimated by the estimation unit;
An electronic device comprising:
A first electronic device acquires at least one of a video and a voice of a first user;
A step in which a second electronic device configured to be able to communicate with the first electronic device outputs at least one of a video and a voice of the first user acquired by the first electronic device to a second user who responds to an utterance of the first user;
estimating a response timing of the second user responding to the speech of the first user based on at least one of a video and a voice of the first user;
causing the second electronic device to acquire information indicating the response timing;
A method for controlling a system, comprising:
On the computer,
A first electronic device acquires at least one of a video and a voice of a first user;
A step in which a second electronic device configured to be able to communicate with the first electronic device outputs at least one of a video and a voice of the first user acquired by the first electronic device to a second user who responds to an utterance of the first user;
estimating a response timing of the second user responding to the speech of the first user based on at least one of a video and a voice of the first user;
causing the second electronic device to acquire information indicating the response timing;
A program to execute.