WO2020013061A1

WO2020013061A1 - Information processing device and information processing method

Info

Publication number: WO2020013061A1
Application number: PCT/JP2019/026542
Authority: WO
Inventors: 泰成橋本
Original assignee: ソニー株式会社
Priority date: 2018-07-11
Filing date: 2019-07-03
Publication date: 2020-01-16

Abstract

The present invention facilitates communication between people. A cause estimation unit performs cause estimation when it is determined that a person being called is unaware of being called. For example, the cause estimation unit estimates one from among prescribed types of pre-set causes as the cause of the person being called being unaware of being called. In such case, the prescribed types of causes include all or some of being in a conversation, being absent, having a hearing disability, concentrating, sleeping, and intentionally not responding, for example. An output control unit performs control, on the basis of the result of the cause estimation, such that an output is changed in order to make the person being called aware of being called. The person being called to can thereby be effectively made aware of being called.

Description

Information processing apparatus and information processing method

The present technology relates to an information processing device and an information processing method, and more particularly, to an information processing device and an information processing method for facilitating communication between humans.

For example, Patent Literature 1 proposes a technique of presenting a message from another user when the owner of the tablet terminal approaches, when a message from another user is registered.

JP 2014-186610 A

The technology described in Patent Document 1 presents a message from another user, but does not attempt to facilitate direct communication between humans.

技術 The purpose of this technology is to facilitate communication between humans.

The concept of this technology is
A cause estimating unit for estimating a cause when it is determined that the callee is unaware of the call;
An information processing apparatus comprising: an output control unit configured to control an output for making the called party aware of the call based on a result of the cause estimation.

In the present technology, when the cause estimating unit determines that the callee has not noticed the call, the cause is estimated. For example, the cause estimating unit may be configured to estimate one of predetermined types of causes as a cause that the callee does not notice the call. In this case, for example, the predetermined type of cause may include all or a part of conversation, absence, hearing loss, concentration, sleep, and intentionally no response. Further, for example, the cause estimating unit may perform the cause estimation based on a multimodal input.

(4) The output control unit is controlled to change the output for reminding the called party of the call based on the result of the cause estimation. For example, the output control unit may change the output by the multi-modal output.

As described above, in the present technology, when it is determined that the callee has not noticed the call, control is performed to change the output for notifying the callee to the callee based on the result of the cause estimation. Is what you do. Therefore, the callee can be effectively made aware of the call, and communication between humans can be facilitated.

In the present technology, for example, the output control unit may perform control so that direction information indicating the direction in which the call is made is included in the output. Thus, the callee can easily recognize from which direction the call has been made, and can appropriately respond to the call.

Also, in the present technology, for example, the output control unit may be configured to control so as not to output when the call is not based on live voice. This makes it possible to avoid, for example, erroneously responding to a call from a television receiver.

Further, in the present technology, for example, when the result of the cause estimation is the absence of the callee, the output control unit may output after the callee returns, to notify that the call was made. May be controlled. As a result, the callee who has been absent can know that the call has been made after returning. In this case, for example, the output control unit may include time information indicating when the call was made in the output for notifying that the call was made. As a result, the callee can easily recognize when the call has been made, and can appropriately respond to the call.

According to the present technology, communication between humans can be facilitated. Note that the effects described here are not necessarily limited, and may be any of the effects described in the present disclosure.

It is a figure showing the state where the voice agent as a 1st embodiment was arranged in the living room. It is a flowchart which shows an example of the processing procedure at the time of an audio | voice agent making an output to make a callee aware of a call. It is a figure which shows an example of the cause which a callee does not notice a call with an image. It is a figure which shows roughly an example of the situation about the case where a cause is "(a) Conversation". It is a figure which shows roughly another example of the situation about the case where a cause is "(a) talking". It is a figure showing roughly an example of a situation about a case where a cause is "(b) absence". It is a figure which shows roughly another example of the situation about the case where a cause is "(b) absence". It is a figure which shows roughly an example of the situation when the cause is "(b) absence" and the called party returns. It is a figure showing roughly an example of a situation about a case where a cause is "(c) hearing loss". It is a figure showing roughly an example of a situation about a case where a cause is “(d) noise”. FIG. 14 is a diagram schematically illustrating an example of a situation when the cause is “(e) concentration”. It is a figure showing roughly an example of a situation about a case where a cause is “(f) sleep”. It is a figure which shows roughly an example of the situation about the case where a cause was "(g) intentionally no reaction". It is a figure which shows roughly an example of the situation about the case where there is a call from a television receiver. It is a block diagram which shows the example of a structure of a voice agent. It is a flow chart which shows an example of a processing procedure of a processing main part of a voice agent at the time of calling. It is a flowchart which shows an example of the processing procedure of the processing main part of a voice agent when the called person who was absent at the time of calling is returned. It is a block diagram showing other examples of composition of a voice agent. FIG. 4 is a diagram illustrating an example of a Web API used by a voice agent. It is a figure showing an example of a return value of Web API of “face detection / recognition”. It is a figure showing an example of composition of a video telephone system as a 2nd embodiment. It is a block diagram which shows the example of a structure of the voice agent of a video telephone apparatus. It is a block diagram which shows the other example of a structure of the voice agent of a video telephone device. It is a figure showing the example of composition of the video telephone system as a 3rd embodiment. It is a block diagram which shows the example of a structure of an agent cloud service. FIG. 3 is a block diagram illustrating a configuration example of computer hardware.

Hereinafter, embodiments for carrying out the invention (hereinafter, referred to as “embodiments”) will be described. The description will be made in the following order.
1. 1. First embodiment Second embodiment3. Third embodiment4. Modified example

<1. First Embodiment>
[Voice Agent]
FIG. 1 shows a state in which a voice agent 10 according to the first embodiment is arranged in a room, for example, a living room 20. The voice agent 10 constitutes an information processing device. Although not described in detail, the voice agent 10 has a function of a conventionally known voice agent. Further, when it is determined that the callee has not noticed the call, the voice agent 10 outputs to the callee to make the caller aware of the call. In that case, the voice agent 10 estimates the cause and changes the output based on the result of the cause estimation. In this case, the voice agent 10 can estimate the sound source direction, and include information indicating from which direction the call is coming from in the output.

(2) The flowchart in FIG. 2 shows an example of a processing procedure when the voice agent 10 outputs to the called party to make the calling party aware of the calling. The voice agent 10 starts processing in step ST1. Next, when it is determined in step ST2 that the callee has not noticed the call, the voice agent 10 estimates the cause.

Next, in step ST3, the voice agent 10 outputs to the called party to make the calling party aware of the calling based on the estimation result of the cause. Then, the voice agent 10 ends a series of processing in step ST4.

The voice agent 10 performs cause estimation based on multimodal input. The multi-modal input includes, for example, a camera, a microphone (microphone), and various sensors such as an infrared sensor and a human sensor. When estimating the cause, the voice agent 10 estimates, for example, one of predetermined types set as a cause that the callee does not notice the call. The voice agent 10 performs cause estimation using, for example, a machine-learned classifier. When estimating the cause, the voice agent 10 can appropriately refer to not only multimodal input but also profile information such as age, gender, and illness of the registrant.

In this embodiment, for example, “(a) talking”, “(b) absent”, “(c) hearing loss”, “(d) concentration”, “(e) sleep (nap) )), "(F) concentration" and "(g) intentionally unresponsive". It should be noted that the predetermined types of causes that are set in advance may include only some of them instead of all of them, or may include other causes. FIG. 3 shows an example of an image of the cause of (a) to (g).

(4) The voice agent 10 outputs, for example, a multi-modal output to make the called party aware of the calling. Multimodal outputs include speakers, monitors, projectors, LEDs, lighting, wearable devices, robots, and the like. In this case, when the voice agent 10 knows from which direction the caller is calling, the voice agent 10 can include the direction information in the output. In this case, if the voice agent 10 knows who is the caller from the pre-registered information, the voice agent 10 can include the information of the caller in the output.

When the cause is “(a) talking”, the voice agent 10 executes, for example, all or a part of the following (1) to (6).
(1) Blink the lights in the room (2) Notify by voice (notify immediately or when the conversation is interrupted)
(3) Display on projector / monitor (4) Blink LED (5) Send notification to wearable device (6) Ask robot to be notified of notification

FIG. 4 schematically shows an example of a situation where the cause is “(a) During conversation”. This example shows a case where the caller Dad calls "A-kun" while the child A and the mother are having a conversation, but A does not notice the call. . In this example, the voice agent 10 tells A that his father is calling, "A-kun, Dad is calling" by voice from a speaker and images from a projector or monitor. Notify.

In the illustrated example, direction information indicating from which direction the father is calling is not included. For example, when the father is calling from the entrance direction, and when including the direction information, for example, "A-kun, the father is calling from the entrance" is notified.

FIG. 5 schematically shows another example of the situation when the cause is “(a) Conversation”. In this example, while the child A and the mother are having a conversation, someone who is the caller (unregistered person) calls "A-kun", but he does not notice the call The case is shown. In this example, the voice agent 10 informs A that someone is calling, "A-kun, someone is calling." By voice from a speaker and an image from a projector or monitor. are doing.

When the cause is “(b) absent, the voice agent 10 executes, for example, all or part of the following (1) and (2).
(1) Notify the wearable (2) Record information such as the call time, the caller, the callee, etc. (3) Notify the caller of the absence of the callee

Then, after the called party returns, the voice agent 10 executes all or part of the following (1) to (5). In this case, the above-described recording information is appropriately referred to.
(1) Blink the lights in the room (2) Notify by voice (3) Display on the projector / monitor (4) Blink the LED (5) Ask the robot to inform you that there was a notification

Note that even if a fixed period of time (for example, 30 minutes) is set as the holding period of the calling event information recorded by executing the above-mentioned “(2) Recording information of calling time, calling person, called person, etc.” Good. In this case, if a certain period of time has elapsed, the event information is deleted, so that even if the called party returns, all or some of the following (1) to (5) are not executed. .

FIG. 6 schematically shows an example of a situation where the cause is “(b) absent”. This example shows a case where the caller, Dad, calls "A-kun" while A is not present. In this example, the voice agent 10 notifies the caller Dad that "A-kun is not present."

FIG. 7 schematically shows another example of the situation when the cause is “(b) absent”. This example shows a case where the caller, Dad, calls "A-kun" in a state where the mother is present but A is absent. Also in this example, the voice agent 10 notifies the caller Dad that "A-kun is not present."

FIG. 8 schematically shows an example of a situation where the cause is “(b) absent” and the called party A returns. In this example, the voice agent 10 indicates to A that there was a call from Dad by voice from the speaker and an image from the projector or monitor, "A-kun, Dad called 10 minutes ago." It was out. " It is also conceivable that the part 10 minutes ago is notified by the time of the call itself (XX hours X minutes).

When the cause is “(c) hearing loss”, the voice agent 10 executes, for example, all or a part of the following (1) to (7).
(1) Blink the lighting in the room (2) Display on the projector / monitor (3) Blink the LED (4) Send a notification to the wearable (5) Ask the robot to notify that there was a notification (6) Communicate with sound at frequencies that are easy to hear (7) Communicate with loud sound

FIG. 9 schematically shows an example of a situation where the cause is “(c) hearing loss”. This example shows a case in which there is a grandfather with hearing loss (distant ears), and the caller A calls "grandfather", but the grandfather does not notice the call. In this example, the voice agent 10 indicates to the grandfather that Mr. A is calling, based on the sound from the speaker and the image from the projector or the monitor, "Grandfather, Mr. A is calling." Notify. Note that, in this case, the sound is transmitted by a sound having a frequency that is easy to hear or a sound having a large volume.

When the cause is “(d) noise”, the voice agent 10 executes, for example, all or a part of the following (1) to (6).
(1) Blink the lights in the room (2) Display on the projector / monitor (3) Notify by voice (Notify immediately or at the timing when noise is cut off)
(4) Blink LED (5) Send notification to wearable (6) Ask the robot to notify that there was notification

FIG. 10 schematically shows an example of a situation where the cause is “(d) noise”. This example shows a case where the caller Dad calls "A-kun" in a state where there is a child A in the presence of noise, but A does not notice the call. In this example, the voice agent 10 tells A that his father is calling, "A-kun, Dad is calling" by voice from a speaker and images from a projector or monitor. Notify.

When the cause is “(e) concentration”, the voice agent 10 executes, for example, all or a part of the following (1) to (6).
(1) Blink the lights in the room (2) Display on the projector / monitor (3) Notify by voice (Notify immediately or at the timing when concentration stops)
(4) Blink LED (5) Send notification to wearable (6) Ask the robot to notify that there was notification

FIG. 11 schematically illustrates an example of a situation where the cause is “(e) concentration”. This example shows a case where the caller Dad calls "A-kun" while the child A is concentrating on studying, but A does not notice the call. In this example, the voice agent 10 tells A that his father is calling, "A-kun, Dad is calling" by voice from a speaker and images from a projector or monitor. Notify.

When the cause is “(f) sleep (nap)”, the voice agent 10 executes, for example, all or a part of the following (1) to (7).
(1) Blink the lighting in the room (2) Display on the projector / monitor (3) Blink the LED (4) Send a notification to the wearable (5) Ask the robot to be notified that there was a notification (6) "B (caller) is absent" tells A (caller) by voice (7) "B (caller) is absent" and A (caller) wearable Send notification

FIG. 12 schematically illustrates an example of a situation where the cause is “(f) sleep”. This example shows a case where the caller, Dad, calls "A-kun" while the child, A, is in a nap, but he does not notice the call. In this example, the voice agent 10 notifies the caller, Dad, that A is taking a nap by voice from the speaker, "Dad, A-kun is absent." .

When the cause is “(g) intentionally no response”, the voice agent 10 executes, for example, the following (1). The certain time is, for example, 10 minutes. This fixed time can be arbitrarily set by the user (administrator) of the voice agent 10. In addition, "this function" means a function for notifying the called party that the calling has been made.
(1) Stop this function for a certain period of time

FIG. 13 schematically shows an example of a situation where the cause is “(g) intentionally no reaction”. This example shows a case where the caller, Dad, called "A-kun", but A did not knowingly react to it. In this example, the voice agent 10 does not execute notification of the call to Mr. A, nor does the voice agent 10 notify the caller Dad.

The voice agent 10 does not notify the callee that the call has been made when the call is not based on live voice. This can avoid, for example, erroneously responding to a call from a television receiver. In this case, it is conceivable that the voice agent 10 discriminates between the live voice and the voice from the television receiver by using the frequency characteristics, but the identification method is not limited to this.

FIG. 14 shows a case in which Mr. A is accidentally calling for “A-kun” from the television receiver in a state where he is absorbed in creating a car using blocks. In this example, the voice agent 10 identifies that the voice related to the call is not a live voice, and does not execute a notification to Mr. A that the call has been made.

In the above description, it is described that the TV receiver does not respond to the call of "A-kun". However, for example, the TV receiver may be used as a terminal for a videophone. In that case, for example, if the callee is not aware of the call from the other party, the voice agent 10 has the benefit of executing the notification that the callee has been called.

Therefore, it is conceivable that the voice agent 10 determines whether or not to notify the called party of the calling based on, for example, the calling direction and the calling party. For example, in the case of a call from the direction of the television receiver, basically, the callee will not be notified that the call has been made, except in the case where the caller is a registrant. Is notified that the caller has been called.

"Voice Agent Configuration"
FIG. 15 shows a configuration example of the voice agent 10. The voice agent 10 has a camera 101 and a microphone 102 as input interfaces. Here, the microphone 102 has, for example, an array configuration so that the sound source direction can be estimated. The voice agent 10 has a speaker 103, a projector 104, a monitor 105, and an LED 106 as output interfaces.

(4) The voice agent 10 has a processing main unit 107. The processing main unit 107 includes a face detection unit 111, a face identification unit 112, a voice recognition unit 113, a natural language processing unit 114, a notice determination unit 115, a cause estimation unit 116, a sound source direction estimation unit 117 , A speaker estimation unit 118, a live voice discrimination unit 119, an output control unit 120, a speech synthesis unit 121, and a network interface 122.

The face detection unit 111 performs a face recognition process on the image signal from the camera 101 to detect a face present in the image that is the visual field of the voice agent 10. The face identification unit 112 identifies each of the detected faces based on the face detected by the face detection unit 111 by comparing with the face of the registrant registered in advance.

The voice recognition unit 113 performs voice recognition processing on the voice signal from the microphone 102, and converts the voice signal into text. The natural language processing unit 114 analyzes the text obtained by the speech recognition unit 113 to obtain information such as words, parts of speech, and dependencies.

The sound source direction estimating unit 117 estimates the sound source direction based on a plurality of audio signals from the microphone (microphone array) 102, for example, by detecting a time difference between the audio signals. The speaker estimating unit 118 estimates a speaker based on a voice signal from the microphone 102 by comparing with a voice characteristic of a registrant registered in advance. Based on the audio signal from the microphone 102, the live voice determination unit 119 determines whether the voice is a live voice or a voice from a television receiver based on, for example, frequency characteristics.

The notice determination unit 115 is configured to call the called party based on the image signal from the camera 101, the identification result by the face identification unit 112, the processing result of the natural language processing unit 114, the estimation result of the sound source direction estimation unit 117, and the like. Determine if you are aware.

For example, the awareness determination unit 115 determines whether the direction of the face of the callee matches the direction of the sound source. In addition, for example, the awareness determination unit 115 inputs the reaction (return, face-up, etc.) performed by the normally called person as training data and inputs the image of the called person to the awareness discriminator that has been trained by the teacher. Is determined.

The cause estimating unit 116 receives the image signal from the camera 101, the audio signal from the microphone 102, the identification result by the face identification unit 112, the processing result of the natural language processing unit 114, and the registrant's profile information. Estimate the cause that the caller does not notice the call.

For example, if there is no face of the called person in the image, it is presumed that absence is the cause. Further, for example, if it is known from the registrant's profile information that the called party is deaf (early distant), it is presumed that hearing loss (early distant) is the cause. Also, for example, if the volume of the environmental sound exceeds a certain level, it is estimated that noise is the cause. Further, for example, a scene discriminator of “during conversation”, “sleep”, and “concentration” created by deep learning determines the result of the discrimination as a cause. In addition, for example, if the discrimination score of a gesture for discriminating a gesture meaning “I want to be quiet” created by deep learning exceeds a certain level, it is presumed that the cause is that there is no intentional reaction. .

The output control unit 120 determines the determination result of the awareness determination unit 115, the estimation result of the cause estimation unit 116, the processing result of the natural language processing unit 114, the estimation result of the sound source direction estimation unit 117, the estimation result of the speaker estimation unit 118, Based on the result of the discrimination by the voice discriminating unit 119, the output for controlling the callee to notice the call is controlled. The output control unit 120 specifically generates text data for audio output, generates image data for image display, and generates control signals for controlling each output interface.

The speech synthesis unit 121 converts text data indicating a character string into speech data (speech signal). The network interface 122 is an interface for connecting the output control unit 120 to the illumination 131, the wearable device 132, and the robot 133 as output interfaces via a LAN.

The flowchart in FIG. 16 shows an example of the processing procedure of the processing main unit 107 of the voice agent 10 when a call is made. The processing main unit 107 starts processing in step ST11. Next, in step ST12, the processing main unit 107 performs voice recognition and analysis (natural language processing). Then, in step ST13, the processing main unit 107 determines whether the name of the registrant has been called based on the analysis result.

When the name of the registrant is called, in step ST14, the processing main unit 107 estimates the sound source direction, that is, the direction in which the call was made. For example, when the voice agent 10 is arranged in the living room, the direction is the entrance direction, the kitchen direction, the second floor direction, the window direction, and the like. Next, in step ST15, the processing main unit 107 determines whether the voice calling the registrant's name, that is, the calling voice is a live voice.

Next, in step ST16, the processing main unit 107 estimates a speaker, that is, a caller. In this case, if the caller is a registrant, the caller can be specifically identified. Next, in step ST17, face detection / identification is performed to determine whether or not there is a person in the field of view of the voice agent 10, and when there is a person, the person is recognized.

Next, in step ST18, the processing main unit 107 determines whether or not the called person, that is, the called person has noticed the calling. When the call is not noticed, the processing main unit 107 estimates the cause of not being noticed in step ST19. Next, in step ST20, the processing main unit 107 determines what action should be taken to make the user aware, in accordance with the cause of the notice.

Next, in step ST21, the processing main unit 107 controls output for reminding the called party of the call based on the determined action. After the processing in step ST21, the processing main unit 107 ends a series of processing in step ST22. When the called person is noticed in step ST18, the processing main unit 107 immediately ends the series of processing in step ST22.

FIG. 17 shows an example of the processing procedure of the processing main unit 107 of the voice agent 10 when the called party who was absent when the calling was made returns. The processing main unit 107 starts processing in step ST31. Next, in step ST32, the processing main unit 107 performs face detection and identification to determine whether or not there is a person in the field of view of the voice agent 10, and to recognize who the person is when there is a person.

Next, in step ST33, the processing main unit 107 determines whether or not there is a person who has been absent, that is, a person who has been called when the cause of unaware of the above-mentioned call is absent. When there is, in step ST34, the processing main unit 107 controls output for notifying that the called party has been called based on the absence record. After the processing in step ST34, the processing main unit 107 ends a series of processing in step ST35.

FIG. 18 shows another configuration example of the voice agent 10. In FIG. 18, portions corresponding to those in FIG. 15 are denoted by the same reference numerals, and detailed description will be omitted as appropriate. The processing main unit 107 of the voice agent 10 illustrated in FIG. 18 includes only a cause estimating unit 116, an output control unit 120, and

network interfaces

122 and 123. The network interface 123 is an interface for connecting to a Web API existing on the cloud 150 via a WAN. The processing main unit 107 of the voice agent 10 illustrated in FIG. 18 executes many processes in the processing main unit 107 of the voice agent 10 illustrated in FIG. 15 using a Web API existing on the cloud 150.

FIG. 19 shows an example of a Web API used by the voice agent 10 shown in FIG. The Web API of “face detection / recognition” receives a moving image file and authentication information as parameters, and uses a registrant ID, speaker coordinates (x, y), and accuracy as return values. This return value is in, for example, a JSON format. Similarly, return values of the following other Web APIs are, for example, JSON format.

Here, the moving image file is a moving image file recorded by the camera 101 of the voice agent 10. The authentication information is authentication information for using the Web API. The registrant ID is an ID unique to the registrant, for example, an ID indicating each family member. Further, the speaker coordinates (x, y) are in-screen coordinates of a position where the face of the speaker is shown. Further, the accuracy indicates the probability that the recognized face is the registrant ID, for example, the degree of conviction that the recognized face is Mr. A when recognized as Mr. A.

FIG. 20 shows an example of the return value of the Web API of “face detection / recognition”. Here, the return value when three faces are identified in the image is shown. “Id_detected” is the ID (number indicating the order) assigned to the detected face. “Id_recognized” is a registrant ID.

The Web API for "notice determination" receives a registrant ID, a moving image file, and authentication information as parameters, and returns a boolean value indicating whether the user is in the room and a boolean value indicating whether the user has noticed. Here, the registrant ID is a registration ID of a person who wants to check whether or not he / she has noticed. The moving image file is a moving image file recorded by the camera 101 of the voice agent. The authentication information is authentication information for using the Web API.

The true / false value of whether or not the user is in the room indicates whether or not the person specified by the registrant ID is in the moving image, and is “True” if the person is in the moving image and “False” if not. The true / false value of whether or not noticed indicates whether or not the person designated by the registrant ID has noticed, and is “True” if noticed, and “False” if not noticed.

The "voice recognition" Web API receives a voice file, a language type, and authentication information as parameters, and returns a text as a return value. Here, the voice file is a voice file recorded by the microphone 102 of the voice agent 10. The language type is the language type of the recorded voice. The authentication information is authentication information for using the Web API. The text is text transcribed from the audio file.

The Web API of "natural language processing" receives text, language type, and authentication information as parameters, and returns words, parts of speech, and dependencies as return values. The Web API of “sound source direction estimation” receives a sound file and authentication information as parameters, and returns the sound source direction θ and the sound source distance r as return values. The Web API of “speaker estimation” receives a voice file and authentication information as parameters, and uses a registrant ID as a return value. The Web API of the “live voice discriminating unit” receives a voice file and authentication information as parameters, and returns a boolean value indicating whether or not the voice is live as a return value. The Web API of “speech synthesis” receives a character string, a language type, and authentication information as parameters, and returns a speech file as a return value.

As described above, in the voice agent 10 shown in FIG. 1, when it is determined that the callee has not noticed the call, the speech agent 10 is made to notice the call to the callee based on the result of the cause estimation. Can be controlled to change the output. Therefore, the callee can be effectively made aware of the call, and communication between humans can be facilitated.

Also, in the voice agent 10 shown in FIG. 1, it is possible to control so that the output for reminding the caller of the call includes direction information indicating the direction of the call. Therefore, the callee can easily recognize from which direction the call was made, and can appropriately respond to the call.

In addition, the voice agent 10 shown in FIG. 1 can be controlled so as not to output when the call is not based on live voice. Therefore, it is possible to avoid, for example, erroneously responding to a call from a television receiver.

Also, in the voice agent 10 shown in FIG. 1, when the result of the cause estimation is that the called party is absent, after the called party returns, control is performed so that an output for notifying that the calling party has been called is output. it can. Therefore, the callee who has been absent can know that the call has been made after returning. Also, in this case, the output for notifying that the call has been made can include time information indicating when the call was made. As a result, the callee can easily recognize when the call has been made, and can appropriately respond to the call.

<2. Second Embodiment>
[Video phone system]
FIG. 21 shows a configuration example of a videophone system 50 according to the second embodiment. The videophone system 50 includes a videophone device 200A arranged in a house where a father, a mother and two children live, and a videophone device 200B arranged in a house where a grandfather and a grandmother live. Are connected via a.

The videophone device 200B includes a voice agent having the same function as the voice agent 10 described above. When the voice agent of the videophone device 200B receives a call from the videophone device 200A and determines that the callee has not noticed the call, like the voice agent 10 described above, Outputs to the callee to make the call aware.

For example, the illustrated example shows a case where the boy calls the grandfather from the videophone device 200A and the grandfather on the television receiver 200B does not notice it. LED emission 202 is performed.

FIG. 22 shows a configuration example of the voice agent 210 of the videophone device 200B. In FIG. 22, portions corresponding to those in FIG. 15 are denoted by the same reference numerals, and detailed description thereof will be omitted as appropriate. In this case, a voice signal sent from the videophone device 200A via the Internet is given to the voice recognition unit 113 and the speaker estimation unit 118 of the processing main unit 107. Also, in this case, since the sound source direction estimation and the live voice discrimination are unnecessary, those functional units are omitted from the processing main unit 107 in FIG.

FIG. 23 shows another configuration example of the voice agent 210 of the videophone device 200B. 23, portions corresponding to those in FIGS. 18 and 22 are denoted by the same reference numerals, and detailed description thereof will be omitted as appropriate. The processing main unit 107 of the voice agent 210 illustrated in FIG. 23 executes many processes in the processing main unit 107 of the voice agent 210 illustrated in FIG. 22 using a Web API existing on the cloud 150. In this case, the audio signal transmitted from the videophone device 200A via the Internet is provided to the cloud 150 via the network interface 123 and processed.

<3. Third Embodiment>
[Video phone system]
FIG. 24 shows a configuration example of a videophone system 60 according to the third embodiment. This videophone system 60 has a configuration in which a videophone device 300A composed of a mobile device handled by a girl and a videophone device 300B arranged in a house where grandfather and grandmother live are connected via an agent cloud service 310. Has become.

The agent cloud service 310 has the same function as the voice agent 10 described above. When the agent cloud service 310 receives a call from the videophone device 300A and determines that the callee has not noticed the call, the agent cloud service 310 determines whether the videophone device 300B does not In response, control is performed so as to output to the called person to notice the calling.

For example, the illustrated example shows a case where the girl calls the grandfather from the videophone device 300A and the grandfather on the television receiver 300B does not notice it. LED emission 302 is performed.

FIG. 25 shows a configuration example of the agent cloud service 310. 25, portions corresponding to those in FIG. 23 are denoted by the same reference numerals, and detailed description thereof will be omitted as appropriate. In this case, the agent cloud service 310 is connected to each device on the videophone device 300B side by the network interface 312, and is connected to the videophone device 300A side by the network interface 313.

<4. Modification>
Note that, in the above-described embodiment, it has been described that whether or not a called party is present is determined based on the result of performing face detection and identification processing on the image signal from the camera 101. The processing may be performed by adding a sensor or an infrared camera. This helps to distinguish between photos and real people, so it is possible to falsely detect photos and posters and avoid false actions by agents talking to people in the photo by mistake. .

FIG. 26 is a block diagram illustrating a configuration example of hardware of a computer that executes the above-described processing of the voice agent by a program. For example, the information main body 107 (see FIGS. 15, 18, 22, and 23) can be configured by a computer.

In the computer, a CPU (Central Processing Unit) 501, a ROM (Read Only Memory) 502, and a RAM (Random Access Memory) 503 are interconnected by a bus 504. The bus 504 is further connected to an input / output interface 505. An input unit 506, an output unit 507, a storage unit 508, a communication unit 509, and a drive 510 are connected to the input / output interface 505.

The input unit 506 includes a keyboard, a mouse, a microphone, and the like. The output unit 507 includes a display, a speaker, and the like. The storage unit 508 includes a hard disk, a nonvolatile memory, and the like. The communication unit 509 includes a network interface and the like. The drive 510 drives a removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory.

In the computer configured as described above, the CPU 501 loads the program stored in the storage unit 508 to the RAM 503 via the input / output interface 505 and the bus 504 and executes the program, for example. Is performed.

The program executed by the computer (CPU 501) can be provided by being recorded on a removable medium 511 as a package medium or the like, for example. Further, the program can be provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.

In the computer, the program can be installed in the storage unit 508 via the input / output interface 505 by attaching the removable medium 511 to the drive 510. The program can be received by the communication unit 509 via a wired or wireless transmission medium and installed in the storage unit 508. In addition, the program can be installed in the ROM 502 or the storage unit 508 in advance.

The program executed by the computer may be a program in which processing is performed in chronological order according to the order described in this specification, or may be performed in parallel or at a necessary timing such as when a call is made. It may be a program that performs processing.

Although the preferred embodiments of the present disclosure have been described in detail with reference to the accompanying drawings, the technical scope of the present disclosure is not limited to the examples. It is apparent that a person having ordinary knowledge in the technical field of the present disclosure can arrive at various changes or modifications within the scope of the technical idea described in the claims. It is understood that also belongs to the technical scope of the present disclosure.

In addition, the present technology may have the following configurations.
(1) a cause estimating unit for estimating the cause when it is determined that the callee is not aware of the call;
An information processing apparatus comprising: an output control unit configured to control an output for notifying the callee of the call based on a result of the cause estimation.
(2) The information processing apparatus according to (1), wherein the cause estimating unit estimates one of predetermined types of causes set in advance as causes by which the callee does not notice the call.
(3) The information processing apparatus according to (2), wherein the predetermined type of cause includes all or a part of conversation, absence, hearing loss, concentration, sleep, and intentional unresponsiveness.
(4) The information processing apparatus according to any one of (1) to (3), wherein the cause estimating unit performs the cause estimation based on a multimodal input.
(5) The information processing device according to any one of (1) to (4), wherein the output control unit changes an output by multimodal output.
(6) The information processing device according to any one of (1) to (5), wherein the output control unit controls so that direction information indicating a direction in which the call is made is included in the output.
(7) The information processing device according to any one of (1) to (6), wherein the output control unit performs control so that the output is not performed when the call does not involve live voice.
(8) If the result of the cause estimation is that the called party is absent, the output control unit performs an output for notifying that the calling has been made after the called party returns. The information processing apparatus according to any one of (1) to (7).
(9) The information processing device according to (8), wherein the output control unit includes, in an output for notifying that the call has been made, time information indicating when the call was made.
(10) a procedure for estimating the cause when it is determined that the callee is unaware of the call;
An information processing method comprising a step of performing control to change an output for making the called party aware of the call based on a result of the cause estimation.

DESCRIPTION OF SYMBOLS 10 ... Voice agent 20 ...

Living

50, 60 ... Video telephone system 101 ... Camera 102 ... Microphone 103 ... Speaker 104 ... Projector 105 ... Monitor 106 ... LED
107: processing main unit 111: face detection unit 112: face identification unit 113: voice recognition unit 114: natural language processing unit 115: notice recognition unit 116: cause estimation unit 117: sound source direction estimating unit 118: speaker estimating unit 119: live voice discriminating unit 120: output control unit 121: voice synthesizing unit 122: network interface 131: illumination 132・・・ Wearable device 133 ・・・ Robot 150 ・・・

Cloud

200A, 200B, 300A, 300B ・・・ Videophone device 210 ・・・ Voice agent 310 ・・・ Agent cloud service 312,313 ・・・ Network interface

Claims

A cause estimating unit for estimating a cause when it is determined that the callee is unaware of the call;
An information processing apparatus comprising: an output control unit configured to control an output for notifying the callee of the call based on a result of the cause estimation.
The information processing apparatus according to claim 1, wherein the cause estimating unit estimates one of predetermined types of causes set as a cause by which the callee does not notice the call.
The information processing apparatus according to claim 2, wherein the predetermined types of causes include all or a part of a conversation, absence, hearing loss, concentration, sleep, and intentionally unresponsive.
The information processing device according to claim 1, wherein the cause estimating unit performs the cause estimation based on a multimodal input.
The information processing device according to claim 1, wherein the output control unit changes an output based on a multimodal output.
The information processing device according to claim 1, wherein the output control unit performs control so that direction information indicating a direction in which the call is made is included in the output.
The information processing device according to claim 1, wherein the output control unit controls the output so as not to be performed when the call is not based on live voice.
If the result of the cause estimation is the absence of the callee, the output control unit performs control so as to perform an output for notifying that the call has been made after the callee returns. Item 2. The information processing device according to item 1.
The information processing device according to claim 8, wherein the output control unit includes, in an output for notifying that the call has been made, time information indicating when the call was made.
A procedure for estimating the cause if the callee is determined to be unaware of the call,
An information processing method comprising a step of performing control to change an output for making the called party aware of the call based on a result of the cause estimation.