WO2021246056A1 - Dispositif de traitement d'informations, procédé de traitement d'informations et programme informatique - Google Patents

Dispositif de traitement d'informations, procédé de traitement d'informations et programme informatique Download PDF

Info

Publication number
WO2021246056A1
WO2021246056A1 PCT/JP2021/015097 JP2021015097W WO2021246056A1 WO 2021246056 A1 WO2021246056 A1 WO 2021246056A1 JP 2021015097 W JP2021015097 W JP 2021015097W WO 2021246056 A1 WO2021246056 A1 WO 2021246056A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
utterance
dialogue
information
turn
Prior art date
Application number
PCT/JP2021/015097
Other languages
English (en)
Japanese (ja)
Inventor
範亘 高橋
Original Assignee
ソニーグループ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ソニーグループ株式会社 filed Critical ソニーグループ株式会社
Publication of WO2021246056A1 publication Critical patent/WO2021246056A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • this disclosure relates to an information processing device and an information processing method for performing interactive processing with a user, and a computer program.
  • voice agents that interact with users through voice are becoming widespread.
  • An electronic device equipped with a voice agent interprets a user's utterance, executes a device operation instructed by voice, and provides voice guidance regarding notification of the device status and explanation of how to use the device.
  • the voice agent By using the voice agent, the man-hours of the listener such as the consultation desk can be reduced.
  • the user can have a dialogue without worrying about the machine partner.
  • An object of the present disclosure is to provide an information processing device and an information processing method for performing dialogue processing with a user while controlling an utterance turn, and a computer program.
  • the sensor unit that acquires user information and A control unit that controls dialogue with the user based on the information acquired by the sensor unit, and a control unit. It is an information processing apparatus provided with.
  • the control unit determines the utterance turn based on the silence time or the feature amount of the user according to the type of the user estimated from the situation of the dialogue with the user acquired through the sensor unit and the attributes / characteristics of the user. conduct.
  • control unit estimates the intention of the user's utterance, and if the estimation result does not match the utterance content of the user, the control unit outputs information corresponding to the estimation result. Then, the control unit controls the progress of the topic of the voice dialogue based on the response of the user to the information corresponding to the estimation result.
  • the second aspect of this disclosure is The acquisition step to acquire user information through the sensor unit, A control step that controls dialogue with the user based on the information acquired in the acquisition step, It is an information processing method having.
  • the third aspect of this disclosure is Sensor part that acquires user information, A control unit that controls dialogue with the user based on the information acquired by the sensor unit.
  • the computer program according to the third aspect of the present disclosure defines a computer program described in a computer-readable format so as to realize a predetermined process on the computer.
  • a collaborative action is exhibited on the computer, and the same action effect as that of the information processing apparatus according to the first aspect of the present disclosure is obtained. be able to.
  • an information processing device an information processing method, and a computer program that control an utterance turn and perform interactive processing with a user while estimating the original utterance intention.
  • FIG. 1 is a diagram showing an example of dialogue between a user and a voice agent when the silent section of a speaker is not recognized as a transition of an utterance turn.
  • FIG. 2 is a diagram showing an example of dialogue between the user and the voice agent when the silent section of the speaker is recognized as the transition of the utterance turn.
  • FIG. 3 is a diagram showing a functional configuration example of the dialogue system 300.
  • FIG. 4 is a diagram showing a configuration example of the sensor group 400.
  • FIG. 5 is a diagram showing a configuration example of the utterance turn estimation neural network model 500.
  • FIG. 6 is a diagram showing an example of a dialogue situation, a user's attributes / characteristics, a silent section, and a user's feature amount.
  • FIG. 1 is a diagram showing an example of dialogue between a user and a voice agent when the silent section of a speaker is not recognized as a transition of an utterance turn.
  • FIG. 3 is a diagram showing a functional configuration example of the
  • FIG. 7 is a diagram showing another configuration example of the neural network model used for determining the utterance turn.
  • FIG. 8 is a flowchart showing the operation of the dialogue system 300 (first embodiment).
  • FIG. 9 is a flowchart showing a processing procedure for selecting an estimator to be used for determining an actual utterance turn from a plurality of estimators for utterance turn estimation.
  • FIG. 10 is a diagram showing how a plurality of estimators are narrowed down to be suitable for a user who is a dialogue partner.
  • FIG. 11 is a diagram showing how the estimators suitable for the user with whom the dialogue is made are narrowed down from a plurality of estimators.
  • FIG. 12 is a diagram showing how a plurality of estimators are narrowed down to be suitable for the user with whom the dialogue is to be performed.
  • FIG. 13 is a diagram showing how the estimators suitable for the user with whom the dialogue is made are narrowed down from a plurality of estimators.
  • FIG. 14 is a flowchart showing the operation of the dialogue system 300 (second embodiment).
  • FIG. 15 is a diagram showing an example of dialogue with the user by the dialogue system 300 according to the second embodiment.
  • FIG. 16 is a diagram showing another example of dialogue with the user by the dialogue system 300 according to the second embodiment.
  • FIG. 17 is a diagram showing an example of dialogue between the dialogue system 300 and the user.
  • FIG. 18 is a diagram showing a mechanism for estimating the intention of the user's utterance.
  • FIG. 19 is a diagram showing an example of dialogue between the dialogue system 300 and the user (when the original intention of the user's utterance is not estimated).
  • FIG. 20 is a diagram showing an example of dialogue between the dialogue system 300 and the user (when both determination of the utterance turn and estimation of the user's original utterance intention are performed).
  • FIG. 21 is a diagram showing an example of dialogue between the dialogue system 300 and the user (when the utterance turn is determined but the user's original utterance intention is not estimated).
  • FIG. 22 is a diagram showing an example of dialogue between the dialogue system 300 and the user (when the user's original utterance intention is estimated but the utterance turn is not determined).
  • FIG. 20 is a diagram showing an example of dialogue between the dialogue system 300 and the user (when the original intention of the user's utterance is not estimated).
  • FIG. 20 is a diagram showing an example of dialogue between the dialogue system 300 and the user (when both determination of the utterance turn and estimation of the user's original
  • FIG. 23 is a diagram showing an example of dialogue between the dialogue system 300 and the user (when both determination of the utterance turn and estimation of the user's original utterance intention are performed).
  • FIG. 24 is a diagram showing an example of dialogue between the dialogue system 300 and the user (when the utterance turn is determined but the user's original utterance intention is not estimated).
  • FIG. 25 is a diagram showing an example of dialogue between the dialogue system 300 and the user (when the user's original utterance intention is estimated but the utterance turn is not determined).
  • FIG. 26 is a diagram showing an example of dialogue between the dialogue system 300 and the user (when both determination of the utterance turn and estimation of the user's original utterance intention are performed).
  • FIG. 24 is a diagram showing an example of dialogue between the dialogue system 300 and the user (when the utterance turn is determined but the user's original utterance intention is not estimated).
  • FIG. 25 is a diagram showing an example of dialogue between the dialogue system 300 and the user (when the user'
  • FIG. 27 is a diagram showing an example of dialogue between the dialogue system 300 and the user (when the utterance turn is determined but the user's original utterance intention is not estimated).
  • FIG. 28 is a diagram showing an example of dialogue between the dialogue system 300 and the user (when the user's original utterance intention is estimated but the utterance turn is not determined).
  • FIG. 29 is a diagram showing an example of dialogue between the dialogue system 300 and the elderly in the elderly home.
  • FIG. 30 is a diagram showing a query generated for estimating the original intention of the user's utterance in the dialogue example shown in FIG. 29.
  • FIG. 31 is a diagram showing a query generated for estimating the original intention of the user's utterance in the dialogue example shown in FIG. 29.
  • FIG. 29 is a diagram showing an example of dialogue between the user (when the utterance turn is determined but the user's original utterance intention is not estimated).
  • FIG. 29 is a diagram showing an example of dialogue between the dialogue system 300 and the elderly in
  • FIG. 32 is a diagram showing an example of a dialogue between the dialogue system 300 and the user in a musical chat.
  • FIG. 33 is a diagram showing a query generated for estimating the original intention of the user's utterance in the dialogue example shown in FIG. 32.
  • FIG. 34 is a diagram showing a query generated for estimating the original intention of the user's utterance in the dialogue example shown in FIG. 32.
  • FIG. 35 is a diagram showing a query generated for estimating the original intention of the user's utterance in the dialogue example shown in FIG. 32.
  • FIG. 36 is a diagram showing a query generated for estimating the original intention of the user's utterance in the dialogue example shown in FIG. 32.
  • FIG. 33 is a diagram showing a query generated for estimating the original intention of the user's utterance in the dialogue example shown in FIG. 32.
  • FIG. 34 is a diagram showing a query generated for estimating the original intention of the user's utterance in the dialogue example shown in FIG.
  • FIG. 37 is a diagram showing an example of dialogue between the dialogue system 300 and the user (patient).
  • FIG. 38 is a diagram showing a query generated for estimating the original intention of the user's utterance in the dialogue example shown in FIG. 37.
  • FIG. 39 is a diagram showing a query generated for estimating the original intention of the user's utterance in the dialogue example shown in FIG. 37.
  • FIG. 40 is a diagram showing information thinned out by the user and an example of utilizing the information.
  • FIG. 41 is a diagram showing an example of an utterance sequence in which a user intermittently utters a thought floating in the brain.
  • FIG. 42 is a diagram showing an example of an utterance sequence in which a user organizes and collectively utters thoughts floating in the brain.
  • the speaker is misunderstood as a transition of the utterance turn due to silence while forming a message to the other party in his head, and when spoken by the voice agent, he / she cannot achieve the desired dialogue.
  • silent sections due to fillers and self-aizuchi are often not recognized as meaningless.
  • FIG. 1 shows an example of dialogue between the user and the voice agent when the silent section of the speaker is not recognized as the transition of the utterance turn.
  • the vertical axis is the time axis.
  • the voice agent is after the utterance from the user who was asked about the schedule, "Tuesday and Wednesday, the meeting continues all day in the company," and "Thursday, we have a substitute holiday.”
  • the following silent section is not recognized as the transition of the speech turn.
  • the user can utter a message to the voice agent formed in his head in the silent section, "Because there was no athletic meet for the first grade daughter on this day.”
  • the voice agent does not acquire the utterance right by appropriately determining the location of the utterance turn even if the silent section of the user becomes long.
  • the voice agent can retrieve useful and new information from the user, such as a child's story.
  • FIG. 2 shows an example of dialogue between the user and the voice agent when the silent section of the speaker is recognized as the transition of the utterance turn as a comparison with FIG.
  • the vertical axis is the time axis.
  • the voice agent is after the utterance from the user who was asked about the schedule, "Tuesday and Wednesday, the meeting continues all day in the company," and "Thursday, we have a substitute holiday.” Recognizing the following silent section as the transition of the utterance turn, the user is asked the next message, "How about Friday?". As a result, the user must answer "2 pm that day ...
  • the voice agent immediately determines that the user's long silent section is a transition of the utterance turn, and acquires the utterance right. As a result, even if the voice agent can finish the work efficiently, it does not lead to the story of the user's child and loses the chance to hear the story of the child from the user.
  • the utterance turn transition occurs due to the occurrence of a silent section while the user is forming a message to the other party, the chance of hearing the message from the user is reduced.
  • the user may lose the motivation to speak if the transition of the utterance turn occurs while thinking about the message to the other party.
  • the silent section generated during the dialogue is the transition of the utterance turn based on not only the length of the silent section but also the situation in which the user is interacting and the attributes / characteristics of the user.
  • the system configuration diagram 3 schematically shows a functional configuration example of the dialogue system 300 to which the present disclosure is applied.
  • the illustrated dialogue system 300 includes a voice input unit 301, a voice section detection unit 302, a voice recognition unit 303, a semantic analysis unit 304, a sensor unit 305, an utterance turn determination unit 306, and a dialogue management unit 307. It includes an audio output unit 308 and a display unit 309.
  • a voice input unit 301 includes a voice input unit 301, a voice section detection unit 302, a voice recognition unit 303, a semantic analysis unit 304, a sensor unit 305, an utterance turn determination unit 306, and a dialogue management unit 307.
  • It includes an audio output unit 308 and a display unit 309.
  • each part will be described.
  • the voice input unit 301 is composed of a voice input element such as a microphone, and inputs voice spoken by a user who has a dialogue with the dialogue system 300.
  • the voice section detection unit 302 detects a section of the voice signal input from the voice input unit 301 that includes voice, and passes it to the voice signal voice recognition unit 303 of the voice section. Further, the voice section detection unit 302 detects a silent section in which the voice is not included in the voice signal input from the voice input unit 301, and information on the silent section (for example, the start time of the silent section where the voice starts to be interrupted). And information on the length of the silent section or the end time) is passed to the speech turn determination unit 306.
  • the silent section may include voices other than voices for advancing dialogue such as fillers, aizuchi, and breathing.
  • the voice recognition unit 303 converts the voice signal of the voice section passed from the voice section detection unit 302 into information that can be semantically analyzed such as a text character string.
  • the semantic analysis unit 304 inputs a text character string of the voice recognition result from the voice recognition unit 303, and estimates the meaning of the subject of the character string, the target of the action, and the like. It should be noted that a deep-learned neural network model may be used to perform natural language processing such as speech recognition and semantic analysis.
  • the sensor unit 305 is composed of a combination of various sensor elements. In one embodiment of the present disclosure, the sensor unit 305 is used to detect a situation or environment in which the dialogue system 300 is interacting with the user. Further, the sensor unit 305 is also used to detect the attributes of the user with whom the dialogue is made, the individual characteristics of the user with whom the dialogue is made, and the like. A specific configuration example of the sensor unit 305 will be described later.
  • the utterance turn determination unit 306 performs utterance turn determination processing in the dialogue between the user and the dialogue system 300 based on the information of the silent section input from the voice section detection unit 302. In a simple example, when the length of the silent section exceeds a predetermined value, it is determined as a transition of the utterance turn. On the other hand, in one embodiment of the present disclosure, the utterance turn determination unit 306 determines whether or not the silent section corresponds to the transition of the utterance turn based on the sensor information from the sensor unit 305.
  • the utterance turn determination unit 306 determines the situation and environment in which the dialogue system 300 is interacting with the user, the attributes / characteristics of the user with whom the dialogue is made, the characteristic amount of the user with whom the dialogue is made, and the like. Based on this, it is determined whether or not the silent section corresponds to the transition of the utterance turn. Therefore, in this embodiment, even if the length of the silent section is the same, it may be determined as a transition of the utterance turn or a transition of the utterance turn depending on the difference in the dialogue situation, the attribute / characteristic of the user, the feature amount of the user, and the like. May not judge.
  • the utterance turn determination unit 306 may perform the utterance turn determination process using an estimator composed of a neural network model.
  • This neural network model is based on, for example, the situation and environment in which the dialogue system 300 is interacting with the user, the attributes / characteristics of the user with whom the dialogue is made, the feature amount of the user with whom the dialogue is made, and the like. It is assumed that deep learning is performed so as to estimate the meaning represented by.
  • the dialogue management unit 307 manages the dialogue between the user and the dialogue system 300. For example, when the speech turn determination unit 306 determines that the transition of the speech turn from the user to the dialogue system 300, the dialogue management unit 307 determines the semantic analysis result of the user's speech by the semantic analysis unit 304 or the dialogue with the past user. The response text to the user generated based on the context of is output to the voice output unit 308 and the display unit 309. Further, the dialogue management unit 307 may be set to generate a filler or an aizuchi. Even if the voice section recognition unit 303, the semantic analysis unit 304, the utterance turn determination unit 306, and a part or the whole of the dialogue management unit 307 are realized by one or a plurality of integrated circuits as the control unit 310. good.
  • the dialogue management unit 307 performs estimation processing of the original intention of the user's utterance input through the semantic analysis unit 304. For example, when a plurality of utterances are continuously input by the user, it is determined whether or not there is a possibility of an error in each utterance. Then, when the estimation result of the original intention of the user's utterance does not match the content of the user's utterance, the dialogue management unit 307 presents the estimation result to the user, and based on the user's reaction to it, the subsequent topic. Manage dialogue to switch progress.
  • the dialogue management unit 307 may generate a response sentence or estimate the intention of the user's original utterance by using an estimator composed of a neural network model.
  • the utterance intent estimator uses a domain database according to the situation of the dialogue, a synonym database that covers similar words of each word included in the user's utterance, and a user model that describes the characteristics of each user. Estimate the original intention of the utterance.
  • the voice output unit 308 is composed of one or a plurality of speakers, and outputs a response sentence generated by the dialogue management unit 307 by voice.
  • the display unit 309 is composed of, for example, a liquid crystal display or an organic EL (Electro-Luminescence) display.
  • the display unit 309 may display the character image of the dialogue system 300 or the animation of the character. Further, the display unit 309 may display the response sentence generated by the dialogue management unit 307 as a text. Further, the display unit 309 may display which utterance turn of the user or the dialogue system 300 is currently. Further, when the dialogue management unit 307 estimates the original intention of the user's utterance and a plurality of candidates are calculated, the candidate with high likelihood may be displayed on the display unit 309.
  • the dialogue system 300 may be a device dedicated to dialogue such as a voice agent device, a personal computer, a multifunctional information terminal such as a smartphone or a tablet, various home appliances, or an IoT (Internet of Things) device. Further, the dialogue system 300 including the above functional configuration may be realized by a plurality of devices. Specifically, any of the audio input unit 301, the sensor unit 305, the audio output unit 308, and the display unit 309 is mounted on a device physically independent of the device having the control unit 310, and each device is mounted. May be connected by wire or wirelessly. It should be noted that each function constituting the control unit 310 may also be realized by a plurality of devices.
  • the voice section detection unit 302 the voice recognition unit 303, the semantic analysis unit 304, the utterance turn determination unit 306, and the dialogue management unit 307 have the functions of the dialogue system 300. It may be realized by an external server connected to the device via a network.
  • FIG. 4 shows a configuration example of the sensor group 400 included in the sensor unit 305 used in the dialogue system 300 shown in FIG.
  • the sensor group 400 includes a camera unit 410, a user status sensor unit 420, an environment sensor unit 430, and a user profile sensor unit 440.
  • the sensor group 400 includes a situation and environment in which the user is interacting with the dialogue system 300, attributes / characteristics of the user with whom the dialogue is made, and a feature amount of the user with whom the dialogue is with. Used to get information.
  • the camera unit 410 includes a camera 411 that captures a user who is interacting with the dialogue system 300, and a camera 412 that captures a room in which the dialogue system 300 is installed or an environment in which the dialogue system 300 is interacting.
  • the user status sensor unit 420 includes one or more sensors that acquire status information regarding the status of the user interacting with the dialogue system 300.
  • state information the user state sensor unit 420 may include, for example, a user's work state or action state (moving state such as stationary, walking, running, eyelid opening / closing state, line-of-sight direction, pupil size), and mental state (user interacts). It is intended to acquire the degree of emotion, excitement, arousal, emotions, emotions, etc., such as whether one is absorbed or concentrated in the body, and the physiological state.
  • the user status sensor unit 420 includes various sensors such as a sweating sensor, a myoelectric potential sensor, an electrooculogram sensor, a brain wave sensor, an exhalation sensor, a gas sensor, an ion concentration sensor, and an IMU (Internal Measurement Unit) that measures the user's behavior, and the user. It may be provided with an audio sensor (such as a microphone) that picks up the speech and a position information detection sensor (such as a proximity sensor) that detects the position of an object such as a user's finger. The audio sensor may be shared with the microphone used for the voice input unit 301 for inputting the user's utterance.
  • the environment sensor unit 430 includes various sensors that measure information about the room where the dialogue system 300 is installed or the environment in which the user has a dialogue with the dialogue system 300. For example, temperature sensors, humidity sensors, light sensors, illuminance sensors, airflow sensors, odor sensors, electromagnetic wave sensors, geomagnetic sensors, GPS (Global Positioning System) sensors, audio sensors that collect ambient sounds (microphones, etc.) are environmental sensors. Included in part 430. Further, the environment sensor unit 430 may acquire information such as the size of the room in which the dialogue system 300 is placed, the position of the user, and the brightness of the room.
  • the user profile sensor unit 440 detects profile information such as attributes related to the user interacting with the dialogue system 300.
  • the user profile sensor unit 440 does not necessarily have to be composed of sensor elements.
  • the user profile such as the age and gender of the user may be estimated based on the face image of the user taken by the camera 411 or the utterance of the user picked up by the audio sensor.
  • the user profile acquired on the multifunctional information terminal carried by the user such as a smartphone may be acquired by the cooperation between the dialogue system 300 and the smartphone.
  • the user profile sensor unit 440 does not need to detect even sensitive information so as to affect the privacy and confidentiality of the user.
  • the user profile sensor unit 440 does not need to detect the profile information of the same user every time the user profile information is interacted with, and is provided with a memory such as an EEPROM (Electrically Erasable and Programmable ROM) that stores the user profile information once acquired. May be.
  • EEPROM Electrical Erasable and Programmable ROM
  • a multifunctional information terminal carried by a user such as a smartphone may be utilized as a user status sensor unit 420, an environment sensor unit 430, or a user profile sensor unit 440 by linking the dialogue system 300 and the smartphone.
  • sensor information acquired by the sensor built into the smartphone healthcare function (pedometer, etc.), calendar or schedule book / memorandum, mail, browser history, SNS (Network Network Service) posting and browsing history, etc.
  • the data managed by the application may be added to the user's state data and environment data.
  • a sensor built in another CE device or IoT device existing in the same space as the dialogue system 300 may be utilized as the user status sensor unit 420 or the environment sensor unit 430.
  • the utterance turn determination unit 306 of the first embodiment can perform the utterance turn determination process using the neural network model.
  • This neural network model has, for example, a silent section based on the situation and environment in which the user is interacting with the dialogue system 300, the attributes of the user with whom the dialogue is made, and the individual characteristics of the user with whom the dialogue is made. It is assumed that deep learning is performed so as to estimate the meaning to be represented.
  • FIG. 5 shows a configuration example of the utterance turn estimation neural network model 500 used for determining the utterance turn.
  • the utterance turn estimation neural network 500 inputs the information of the dialogue situation, the user's attributes / characteristics, and the silent section, and outputs the result of estimating whether or not the utterance turn has changed.
  • the dialogue status and user attributes / characteristics are input from the sensor unit 305 to the utterance turn estimation neural network model 500.
  • the silent section information is input from the voice section detection unit 302 to the utterance turn estimation neural network model 500.
  • the utterance turn estimation neural network model 500 estimates the user's type from the dialogue situation and the user's attributes / characteristics, and the utterance turn is based on the length of the silent section and the user's feature amount according to the user's type. Estimate whether it is a transition.
  • the utterance turn determination unit 306 uses an estimator composed of the utterance turn estimation neural network model 500 to weight the length of silence time according to the user type estimated from the dialogue situation and the user's attributes / characteristics.
  • the utterance turn can be controlled based on the feature amount of the user. For example, the utterance turn determination unit 306 determines the transition of the utterance turn in a short silent time for a young user, but determines the transition of the utterance turn in a long silent time for an old user. That is, even in the silent section of the same length, the determination result by the utterance turn determination unit 306 may differ depending on the attribute / characteristic of the user.
  • the utterance turn determination unit 306 changes the weighting of the user's features such as the line of sight, face orientation, gestures such as posture and hand gestures, and sentence ending phonology according to the situation of dialogue and the attributes / characteristics of the user, and utterances. Make a turn decision.
  • FIG. 6 shows an example of the dialogue situation, the user's attributes / characteristics, the silence time, and the user's feature amount.
  • the status of the dialogue includes the usage of the dialogue system 300 by the user (interview, Q & A, content search or recommendation, interview, task request, frank chat, listening, etc.) and the place where the user uses the dialogue system 300 (public). Or items such as private space, amount of noise and miscellaneous images, standing or sitting talk, hospital / school / house, country / region name, etc., and time zone when the user uses the dialogue system 300 (morning, midnight, etc.). including.
  • the attributes and characteristics of the user include age, gender, nationality, language used, wording (honorifics, frank tone, etc.), personality (impatient, laid-back, hurrying, leeway, etc.), and with the person with whom the dialogue is made. Includes items such as relationships (doctors and patients, teachers and students, counselors and counselors, etc.).
  • the silent section includes items such as long (t1), short (t2), initially long and gradually short (t3).
  • the user's features include linguistic information during speech (difference between Japanese and English, etc.), prosody information (sound at the end of the sentence, etc.), volume information (volume at the end of the sentence, etc.), and other non-verbal information (during dialogue). Includes items such as the user's line of sight, face orientation, posture, and time-series changes such as hand gestures).
  • FIG. 7 shows another configuration example of the neural network model used for determining the utterance turn.
  • the type estimation neural network model 701 that estimates the feature amount of the user
  • the utterance turn estimation neural network model 702 that estimates whether or not the transition of the utterance turn is based on the information of the user type and the silent section. It is equipped with. Both neural network models 701 and 702 are connected and used so that the utterance turn estimation neural network model 702 inputs the estimation result by the type estimation neural network model 701.
  • FIG. 7 is a configuration in which the End to End neural network model 500 shown in FIG. 5 is divided into two neural network models.
  • the neural network model 701 estimates the type of the user who is interacting with the dialogue system 300 based on the dialogue status input from the sensor unit 305 and the attributes / characteristics of the user. Please refer to FIG. 6 for the situation of dialogue and the attributes / characteristics of the user.
  • the utterance turn estimation neural network model 702 determines the criteria for the silent time and the user's feature amount to be used in preference to the utterance turn estimation based on the user type estimated by the type estimation neural network model 701. Based on the information of the silent section input from the voice section detection unit 302 and the observation result of the user's feature amount by the sensor unit 305, it is estimated whether or not the utterance turn has changed. Please refer to FIG. 6 for the feature amount of the user.
  • the utterance turn determination unit 306 uses the utterance turn estimation neural network model 701 and the utterance turn estimation neural network model 702 to obtain a silent time according to the user type estimated from the dialogue situation and the user's attributes / characteristics.
  • the utterance turn can be controlled based on the standard of the user's feature amount.
  • FIG. 8 shows the operation of the dialogue system 300 in the first embodiment in the form of a flowchart.
  • the utterance turn determination unit 306 inputs the situation in which the user interacts with the dialogue system 300 and the user's attributes / characteristics based on the sensor information input from the sensor unit 305, and inputs the input dialogue status and the user's attributes / characteristics.
  • the type of the corresponding user is determined (step S801).
  • the utterance turn determination unit 306 adjusts the length of the silent time as the determination criterion of the utterance turn and the feature amount of the user to be prioritized according to the type of the user determined in step S801 (step S802).
  • the utterance turn determination unit 306 waits until the silent section arrives based on the input signal from the voice section detection unit 302 (No in step S803). Then, when the silent section is detected (Yes in step S803), the utterance turn determination unit 306 determines the length of the silent time according to the type of the user and the feature amount of the user to be prioritized, which are adjusted in step S802. It is checked whether the silent section is a transition of the utterance turn (step S804).
  • the dialogue system 300 executes the utterance (step S805). That is, the dialogue management unit 307 generates a response sentence to the user based on the semantic analysis result of the user's utterance by the semantic analysis unit 304, the context of the dialogue with the user in the past, and the like. Then, the voice output unit 308 outputs the response sentence generated by the dialogue management unit 307 by voice. It should be noted that a part or all of the output response sentence may be generated before the transition of the utterance turn.
  • the dialogue system 300 does not execute the utterance and waits for the next utterance of the user.
  • the dialogue management unit 307 outputs a filler through the voice output unit 308 (step S806) to prompt the user to speak.
  • the utterance turn determination unit 306 may display on the display unit 309 that the user still has the utterance right to prompt the user to speak.
  • the user speaks step S807
  • the user's utterance can be input through the voice input unit 301 to newly acquire useful information from the user (in FIG. 1, the user listens to the child's story). Corresponds to the case).
  • silence or voice other than the filler may be selected instead of the filler depending on the situation.
  • a dialogue device (see, for example, Patent Document 1) that controls the utterance turn based on the user's feature quantities such as the user's voice power, voice pitch, line-of-sight direction, and head movement is already known.
  • the dialogue system 300 estimates the user type from the dialogue situation and the attributes / characteristics of the user, and the length and priority of the silent time which is the criterion for determining the utterance turn according to the user type. Since the feature amount of the user to be adjusted is adjusted, a more accurate utterance turn transition can be realized.
  • the dialogue system 300 determines the utterance turn with an emphasis on the user's phonological information (sentence ending phonology, etc.) when interacting with a certain user, but determines the utterance turn when interacting with another user. It is possible to determine the utterance turn with an emphasis on the direction of the line of sight, and to realize the optimum determination of the utterance turn for each user.
  • the utterance turn determination unit 306 in the dialogue system 300 responds to the user type determined from the dialogue situation and the user's attributes / characteristics when the user who has the utterance right is silent. Even if a certain amount of silent time continues, useful and new information can be obtained from the user by waiting for the user's subsequent utterance without determining the transition of the utterance turn. For example, as shown in Fig. 1, even if there is a long silent section after the user's utterance "Tuesday and Wednesday, the meeting continues all day in the company" and "Thursday, we have a substitute holiday". By not judging it as a transition of the utterance turn according to the type of user, the user formed a message to the voice agent in the silent section in his head. , ”Can be spoken.
  • the dialogue system 300 can determine the utterance turn according to the situation of the dialogue and the attributes / characteristics of the user.
  • the state of dialogue and the method of acquiring user attributes / characteristics will be described.
  • the dialogue status and the attributes / characteristics of the user are determined. Can be estimated. Further, the situation of the dialogue and the attributes / characteristics of the user can be estimated based on the reaction of the user to the greeting made by the dialogue system 300 at the time of initial setting. A part of the information about the situation of the dialogue and the attributes / characteristics of the user may be set by the user of the dialogue partner or a third party before the dialogue, and the dialogue system 300 asks a question for setting to the voice output unit 308. Or may be output to the display unit 309. In addition, questions and setting methods regarding the situation of dialogue and the attributes / characteristics of users may be different for each user.
  • the utterance turn determination unit 306 determines the user type from the dialogue situation and the attributes / characteristics of the user, and the length of the silent time which is the determination criterion of the utterance turn and the feature amount of the user to be prioritized according to the user type.
  • the estimator neural network model
  • the utterance turn determination unit 306 may estimate the utterance turn using the finally selected estimator.
  • FIG. 9 shows a processing procedure for selecting an estimator to be used for determining an actual utterance turn from a plurality of estimators for utterance turn estimation in the form of a flowchart.
  • the dialogue status such as the cultural area and environment in which the dialogue system 300 is used, and the attributes / characteristics of the user are acquired (step S901), and a plurality of estimations are made. From the vessels, narrow down to the estimators corresponding to this cultural area and environment (step S902).
  • the user's attributes / characteristics such as the user's age and gender are estimated based on the face image of the user taken by the camera 411 (step S903), and the estimator group narrowed down in step S902 is changed to the user's attributes / characteristics. Further narrowing down to a suitable estimator (step S904).
  • the attributes / characteristics of the user may be estimated based on the sensor information from the user state sensor 420 and the user profile sensor unit 440 instead of the captured image of the camera 411. At this point, it is assumed that the estimator is narrowed down without depending on the individual difference of the user.
  • step S905 the individual difference of the user is estimated from the user's reaction to the greeting performed by the dialogue system 300 at the time of initial setting (step S905), and the estimator group narrowed down in step S904 is the most suitable for the reaction speed of each individual user. Narrow down to one suitable estimator (step S906).
  • the dialogue system 300 evaluates the user's behavior based on the sensor information acquired by the sensor unit 305 during operation, that is, during the dialogue with the user (step S907).
  • step S908 the evaluation result of the user's behavior is fed back to the estimator in use, and the estimator is relearned by error back propagation (step S908).
  • steps S901 to S908 are repeatedly executed until the dialogue between the dialogue system 300 and the user is completed (No in step S909).
  • FIGS. 10 to 13 show how a plurality of estimators are narrowed down to be suitable for the user with whom the dialogue is to be performed.
  • FIG. 10 shows all estimators pre-equipped in the dialogue system 300.
  • the situation of the dialogue is the hospital interview, Japan, and the user's attributes / characteristics are Japanese at the service installation location and the initial setting of the dialogue system 300, and the hospital inquiry is shown.
  • Japan shows how it has been narrowed down to the estimators suitable for Japanese people.
  • the user is a man in his thirties as the attributes and characteristics of the user in more detail. It is shown that it is further narrowed down to the estimator suitable for men in their thirties. At this point, it is assumed that the estimator is narrowed down without depending on the individual difference of the user.
  • FIG. 13 shows a state in which one estimator most suitable for the individual difference of a specific user is selected from the estimators corresponding to each individual difference shown in FIG.
  • the second embodiment dialogue system 300 is intended to be used in the field of medical care, caregiving, or long-term care, for example, and targets a user who speaks thoughts as he or she thinks. According to such an utterance format, the user does not hesitate to speak various opinions, so that the dialogue system 300 can extract more information from the user. On the other hand, it is assumed that the utterance content includes the following elements as compared with the case where the utterance format user who utters a thought as he / she thinks after organizing the message.
  • the message from the utterance-style user who speaks his thoughts as he thinks may contain incorrect information, unnecessary information, and uncertain information.
  • the dialogue system 300 When the dialogue system 300 generates a response sentence based on incorrect information, unnecessary information, and uncertain information received from the user and returns it to the user, the dialogue does not proceed normally and the topic is derailed.
  • the dialogue system 300 cannot provide a sufficient dialogue service because the user loses the motivation to continue the dialogue because an unintended response statement is returned.
  • the dialogue system 300 conducts a dialogue with the user on the assumption that there is a possibility that the content of the user's utterance is incorrect.
  • the dialogue management unit 307 estimates the original intention of the user who utters a thought as he / she thinks, and manages the dialogue with the user based on the estimation result.
  • the dialogue management unit 307 may use an estimator composed of a deep-learned neural network model or the like for estimating the original intention of the user's utterance. Further, the dialogue system 300 is equipped with a plurality of estimators in order to more accurately estimate the original intention of utterance in response to various dialogue situations and user attributes / characteristics, and is the same as in FIG. The estimator used for the actual speech intention estimation may be selected according to the processing procedure.
  • the dialogue management unit 307 estimates the original intention of the user's utterance by using a domain database according to the situation of the dialogue, a synonym database that covers similar words, and a user model that describes the characteristics of each user. For example, the dialogue management unit 307 generates a query consisting of words included in the user's utterance and its synonyms, searches the domain database with the query, and estimates the original intention of the user's utterance. Further, when only a low-likelihood estimation result can be obtained by one query, the dialogue management unit 307 further multiplies the query including the word included in the utterance of the subsequent user and its synonym to the domain database. To try to get a highly probable estimated tuberculosis.
  • FIG. 14 shows the operation of the dialogue system 300 including the user's original speech intention estimation in the second embodiment in the form of a flowchart.
  • the dialogue management unit 307 estimates the information that the user originally intended to speak based on the information acquired through the dialogue with the user and the information acquired in advance (step S1401).
  • the dialogue management unit 307 compares the estimation result in step S1401 with the actual user's utterance content (step S1402).
  • step S1402 if the estimation result and the actual user's utterance content are equal (Yes in step S1402), the dialogue management unit 307 advances the topic as it is based on the content spoken by the user (step S1403).
  • the dialogue management unit 307 presents the estimation result in step S1401 to the user to confirm the user's intention. Make an utterance (step S1404).
  • the estimation result may be presented to the user by including the estimation result in the response sentence and outputting it by voice, or the estimation result may be displayed by using the display unit 309. Then, it waits for the user to indicate whether or not the presented estimation result is the content intended by the user.
  • step S1405 When the user indicates that the presented estimation result is the content intended by the user (Yes in step S1405), the dialogue management unit 307 advances the topic based on the result estimated in step S1401 (step S1406). ).
  • the dialogue management unit 307 advances the topic as it is based on the content spoken by the user. (Step S1403).
  • step S1403 the dialogue management unit 307 does not immediately switch the topic based on the estimation result.
  • the topic is advanced based on the content spoken by the user as it is, but if another plausible topic can be estimated from the information extracted from the following dialogue, it is shown to the user.
  • the credibility of the user's utterance is low or the user's utterance is wrong, the user can easily have a dialogue with the originally intended content.
  • the dialogue system 300 even if the user makes an unreliable or erroneous utterance, it becomes easy for the user to proceed with the dialogue having the originally intended content.
  • FIG. 15 shows an example of dialogue with the user by the dialogue system 300 according to the second embodiment.
  • the vertical axis is the time axis.
  • the dialogue system 300 is assumed to operate according to the processing procedure shown in FIG.
  • the dialogue management unit 307 compares the estimation result with the utterance content of the actual user, and when it detects that the estimation result and the utterance content of the actual user are not equal, presents the estimation result to the user and the user Make an utterance to confirm your intention.
  • the dialogue management unit 307 outputs a response sentence "There was no such menu today, is it an orange cake?" From the voice output unit 308.
  • the dialogue management unit 307 can advance the topic based on the estimation result that it is an "orange cake".
  • FIG. 16 shows another example of dialogue with the user by the dialogue system 300 according to the second embodiment.
  • the vertical axis is the time axis.
  • the dialogue system 300 is assumed to operate according to the processing procedure shown in FIG.
  • the dialogue management unit 307 compares the estimation result with the utterance content of the actual user, and when it detects that the estimation result and the utterance content of the actual user are not equal, presents the estimation result to the user and the user Make an utterance to confirm your intention.
  • the dialogue management unit 307 outputs a response sentence "There was no such menu today, is it an orange cake?" From the voice output unit 308.
  • step S1401 of the processing procedure shown in FIG. 14 the dialogue management unit 307 originally intended the user to speak based on the information acquired through the dialogue with the user and the information acquired in advance. Estimate the information.
  • the following (1) to (4) can be given as an example of the information acquired in advance.
  • the dialogue management unit 307 inputs the semantic analysis result of the user's utterance from the semantic analysis unit 304, words such as the subject of the character string and the action target related to the user's utterance are input from the above (1) to (4) and the like.
  • the original intention of the user's utterance is estimated based on the result of searching for synonyms from the information acquired in advance. For example, among the synonyms found by the synonym search, the candidates that can be known by the user and that match the domain according to the topic are calculated and presumed to be the original intention of the user's utterance.
  • the process of estimating the original intention of the user's utterance will be described according to the dialogue system 300 shown in FIG. 17 and the user's dialogue example.
  • the vertical axis is the time axis.
  • the dialogue system 300 is assumed to operate according to the processing procedure shown in FIG.
  • the dialogue management unit 307 uses the above (1) to describe the subject of the character string and the action target related to the user's utterance. )-(4), etc., search for synonyms from the information acquired in advance, and estimate the original intention of the user's utterance based on the result.
  • FIG. 18 illustrates a mechanism for estimating the intention of a user's utterance.
  • the dialogue management unit 307 searches the synonym database 1802 and the user model 1803 to find the synonyms "apple”, “orange”, and “pear” that the user can know. Further, the dialogue management unit 307 collates with the domain database 1801 to calculate a candidate "orange cake” that matches the domain along the topic "today's menu", and presumes that it is the original intention of the user's utterance. ..
  • the dialogue management unit 307 compares the estimation result with the actual user's utterance content, and the estimation result of "orange (cake)" and the actual user's utterance content "apple” are not equal. When it is detected, the estimation result is presented to the user, and an utterance is made to confirm the user's intention.
  • the dialogue management unit 307 causes the voice output unit 308 to output a response sentence "Is it an orange cake?" Then, the user makes an utterance in response to this response sentence, stating that the estimated result presented is the content intended by the user, saying, "Oh, that's okay, the sweetness was modest.” After that, the dialogue management unit 307 can proceed with the topic based on the estimation result that it is an "orange cake".
  • FIG. 19 shows a dialogue example between the dialogue system 300 and the user when the original intention of the user's utterance is not estimated.
  • the vertical axis is the time axis.
  • the dialogue system 300 does not follow the processing procedure shown in FIG. 14, does not estimate the information that the user originally intended to speak, and does not present the estimation result to the user.
  • the dialogue management unit 307 collates words such as the subject of the character string and the action target related to the user's utterance, and the domain " It is judged that there is no topic that matches "apple” in "today's menu”. Then, the dialogue system 300 in this case utters a response sentence "There was no such menu” without presuming the original intention of the user's utterance. Even if the user can recognize that there is an error in the content of his / her utterance based on the response sentence from the dialogue system 300, the dialogue system 300 does not present the original intention of the user's utterance, so that the user can do it by himself / herself. Unless you remember the original intention of the utterance, it will be difficult to proceed with the topic based on the content of this utterance.
  • FIG. 20 shows an example of dialogue between the dialogue system 300 and the user.
  • the vertical axis is the time axis.
  • the dialogue system 300 determines the utterance turn (first embodiment) according to the processing procedure shown in FIG. 8, and estimates the user's original utterance intention according to the processing procedure shown in FIG. (Second embodiment) shall be performed.
  • the dialogue system 300 utters "Are there any free days?" In order to decide the appointment date with the user. In response, the user responds to the dialogue system 300 by saying, "Thursday was the day after Labor Thanksgiving Day.”
  • the original intention of the user's utterance is that Thursday, which is the day after the holiday, is open, and that the holiday on Thursday is "Labor Thanksgiving Day”. It is presumed that it is "July 20th Marine Day”.
  • the dialogue system 300 compares this estimation result with the actual user's utterance content, and when it detects that the estimation result and the actual user's utterance content are not equal, presents the estimation result to the user and presents the estimation result to the user as well as the original user's original content.
  • the response sentence "Is it the day of the sea on July 20?" Is uttered.
  • the user responds to this response sentence from the dialogue system 300 by saying, "Yes, it's a substitute holiday for Marine Day.”
  • the estimated result presented is the content intended by the user. Make an utterance that indicates. Therefore, the dialogue system 300 can advance the topic based on the estimated result.
  • the dialogue system 300 is based on the length of the silent time according to the type of the user estimated from the situation of the dialogue and the attributes / characteristics of the user and the feature amount of the user to be prioritized. , Judges that it is not a transition of the utterance turn, and waits for the user's utterance. In addition, the dialogue system 300 urges the user to speak by giving an aizuchi with "ho" to indicate that the transition of the speech turn has not yet occurred.
  • the dialogue system 300 acquires useful personal information such as "the user has a daughter in the first grade of elementary school” and "a substitute holiday at a daughter's athletic meet on Thursday" after a long silent section of the user. be able to. Then, the dialogue system 300 gives an aizuchi indicating that it is convinced, agreed, or acknowledged, saying "I like it".
  • the dialogue system 300 can be linked to, for example, product recommendation or conclusion of a business contract by utilizing personal information acquired from the user when proceeding with the subsequent dialogue with the user.
  • the dialogue system 300 determines the utterance turn and estimates the user's original utterance intention
  • the user can tell that the utterance turn is stolen by the dialogue system 300 or that he / she speaks the wrong content. You can continue speaking at your own pace without worrying about it.
  • the dialogue system 300 presents the estimation result of the original utterance intention, so that the user keeps the motivation to continue the dialogue and follows the original utterance intention. You can proceed with the topic.
  • the dialogue system 300 can retrieve useful and new personal information from a user who continues to speak freely.
  • FIG. 21 shows an example of dialogue with the user when the dialogue system 300 does not estimate the intention of the user's original utterance as a comparison with the dialogue example shown in FIG.
  • the dialogue system 300 utters "Are there any free days?" In order to decide the appointment date with the user. In response, the user responds to the dialogue system 300 by saying, "Thursday was the day after Labor Thanksgiving Day.”
  • the dialogue system 300 does not estimate the original intention of the user's utterance, and "Labor Thanksgiving Day” is neither Wednesday nor Thursday, and is correctly November 23, so simply "Labor Thanksgiving Day is 11". It's the 23rd of March. "
  • the user is less motivated to continue the dialogue because the dialogue system 300 simply points out an error in the content of his / her utterance without understanding his / her original intention of the utterance. After that, the user is more motivated to speak only reliable information.
  • the user tries to complete the requirement briefly and says, "Well, no, Thursday is a holiday, so it's Friday.” Since the dialogue system 300 was able to decide the promised date with the user, the dialogue system 300 utters a response sentence "I understand" to show the user that he / she understands.
  • the dialogue system 300 determines that the transition of the utterance turn is not made based on the length of the silent time according to the type of the user estimated from the situation of the dialogue and the attributes / characteristics of the user and the characteristic amount of the user to be prioritized. Wait for the user to speak. However, since the user is less motivated to continue the dialogue because the error in the content of his utterance is pointed out, the next message in his mind in the silent section (for example, "This day is also a substitute holiday". There is no athletic meet for my daughter in the first grade of elementary school, "), and the dialogue between the dialogue system 300 and the user ends silently.
  • FIG. 22 shows an example of dialogue with the user when the dialogue system 300 does not determine the utterance turn as a comparison with the dialogue examples shown in FIGS. 20 and 21.
  • the dialogue system 300 utters "Are there any free days?" In order to decide the appointment date with the user. In response, the user responds to the dialogue system 300 by saying, "Thursday was the day after Labor Thanksgiving Day.”
  • the original intention of the user's utterance is that Thursday, which is the day after the holiday, is open, and Wednesday's holiday is "Labor Thanksgiving Day.” It is presumed that it is "Marine Day on July 20th” instead of "”.
  • the dialogue system 300 compares the estimation result with the actual utterance content of the user and detects that the estimation result and the actual user's utterance content are not equal, the dialogue system 300 presents the estimation result to the user and the user's original utterance. In order to confirm the intention of, utter the response sentence "Is it the day of the sea on July 20?"
  • the user makes an utterance indicating that the presented estimation result is the content intended by the user, such as "Yes, it's a substitute holiday for Marine Day" in response to this response from the dialogue system 300. Therefore, the dialogue system 300 can advance the topic based on the estimated result.
  • the dialogue system 300 does not determine the utterance turn according to the type of the user. For example, the dialogue system 300 utters the following response sentence "How about Friday?" While the user is silent thinking about the next message in his head. Therefore, the user replies, "Ah, yes, it's OK.”
  • the dialogue system 300 can quickly do the task of deciding the appointment date with the user, but the next message from the user (for example, "The substitute holiday is also the day of the first grade daughter's athletic meet.” Hey, "), the dialogue with the user ends without asking.
  • the dialogue system 300 performs both determination of the utterance turn and estimation of the user's original intention of utterance. This makes it easier to retrieve useful and new information from the user while maintaining the user's willingness to continue the dialogue.
  • the dialogue system 300 uses the user's line-of-sight information as an important factor for determining the utterance turn based on the dialogue situation of general work and the attributes / characteristics of the user who is the partner of the dialogue. In addition to selecting the estimator, select the estimator of the original utterance intention considering that it is a business schedule dialogue. Then, the dialogue system 300 realizes an utterance with a low cognitive load with the user's dialogue system 300, and realizes a dialogue in which the utterance turn is controlled so as not to interrupt the utterance of the user which may be useful. ..
  • the utterance turn determination unit 306 controls the utterance turn during the dialogue with the user by using the utterance turn estimator selected from the estimator group. Further, the dialogue management unit 307 estimates the original intention of each utterance of the user by using the original intention estimator selected from the estimator group. Further, when estimating the original intention of the user's utterance, as shown in FIG. 18, a domain database, a synonym database, and a user model are used. The estimator group and each database may be placed on the cloud in consideration of scalability, or may be placed on the dialogue system 300 side in consideration of dialogue performance.
  • the dialogue management unit 307 is based on the domain (current topic with the user), the synonyms of "Labor Thanksgiving Day” (other holidays, etc.), and the information in the calendar, and the user says "Labor Thanksgiving Day”.
  • the original intention of "Labor Thanksgiving Day (November 23)” is presumed that "Marine Day on July 20", which is the latest holiday, is the original intention of the user's utterance (step S1401). , It is determined that the estimation result does not match the utterance content of the user (No in step S1402).
  • the dialogue management unit 307 presents the estimation result to the user, and in order to confirm the user's intention, utters a response sentence "Is it the Marine Day on July 20?" (Step S1404).
  • the user said, "Yes, it's a substitute holiday for Marine Day.”
  • the dialogue management unit 307 advances the topic based on the estimation result that the day the user is asking is "Marine Day on July 20" (step S1406).
  • the utterance turn determination unit 306 estimates the situation in which the user interacts with the dialogue system 300 and the user's attributes / characteristics based on the sensor information input from the sensor unit 305, and determines the estimated dialogue situation and the user's attributes / characteristics. The type of the corresponding user is determined (step S801). In the dialogue example shown in FIG. 20, the situation of dialogue in general business is assumed. Then, the utterance turn determination unit 306 adjusts the length of the silent time as the determination criterion of the utterance turn and the feature amount of the user to be prioritized according to the type of the user determined in step S801 (step S802).
  • the utterance turn determination unit 306 gives priority to the length of the silent time according to the type of user. Based on the feature amount of the user to be used, it is checked whether or not the transition of the utterance turn is performed (step S804). In the dialogue example shown in FIG. 20, the silent section continues for a long time, but it is determined that the user is still likely to want to talk based on the feature amount of the user that the line of sight is off (No in step S804), and the dialogue management unit. 307 outputs a filler called “ho” through the voice output unit 308 (step S806) to prompt the user to speak. After that, the user utters the message "The substitute holiday is because there is no athletic meet for the daughter in the first grade of elementary school on this day" in the silent section, so the dialogue system 300 is useful from the user. Information can be newly acquired (step S807).
  • FIG. 23 shows an example of dialogue between the dialogue system 300 and the user. However, the vertical axis is the time axis.
  • the dialogue system 300 determines the utterance turn (first embodiment) according to the processing procedure shown in FIG. 8, and estimates the user's original utterance intention according to the processing procedure shown in FIG. (Second embodiment) shall be performed.
  • the user speaks to the dialogue system 300 at 1:00 pm on Tuesday, "Washing, isn't it raining tomorrow?"
  • the dialogue system 300 based on the information acquired in advance (the above user's utterance is at 1 am on Tuesday), the original intention of the user's utterance is not tomorrow (Wednesday). I presume to ask for the weather on Tuesday. Then, the dialogue system 300 compares this estimation result with the actual utterance content of the user, presents the estimation result to the user, and confirms the original intention of the user's utterance. It's from the beginning. "
  • the user intended the estimated result presented to the response sentence from the dialogue system 300, "Would you like to wash and dry it?" Make an utterance to show that. Therefore, the dialogue system 300 can advance the topic based on the estimated result.
  • the dialogue system 300 is based on the length of the silent time according to the type of the user estimated from the situation of the dialogue and the attributes / characteristics of the user and the feature amount of the user to be prioritized. , Judges that it is not a transition of the utterance turn, and waits for the user's utterance.
  • the dialogue system 300 gives an aizuchi indicating that it is convinced, agreed, or acknowledged, saying, "That's right.”
  • the dialogue system 300 can acquire personal information that the user "actually does not like to dry the laundry very much" through a long silent section of the user.
  • the personal information acquired from the user is considered to be valuable information for the organization to which the user belongs (for example, the family of the user including the dialogue system 300). ..
  • the dialogue system 300 determines the utterance turn and estimates the user's original utterance intention
  • the user can tell that the utterance turn is stolen by the dialogue system 300 or that he / she speaks the wrong content. You can continue speaking at your own pace without worrying about it.
  • the dialogue system 300 presents the estimation result of the original utterance intention, so that the user keeps the motivation to continue the dialogue and follows the original utterance intention. You can proceed with the topic.
  • the dialogue system 300 can retrieve useful and new information from a user who continues to speak freely.
  • FIG. 24 shows an example of dialogue with the user when the dialogue system 300 does not estimate the intention of the user's original utterance as a comparison with the dialogue example shown in FIG. 23.
  • the user speaks to the dialogue system 300 at 1:00 pm on Tuesday, "Washing, isn't it raining tomorrow?"
  • the dialogue system 300 does not estimate the user's original intention of utterance, and since the weather tomorrow (Wednesday) is rainy, the dialogue system 300 simply utters a response sentence that simply conveys the fact that "yes, it is raining". do.
  • the user is less motivated to make a choice today (Tuesday) and to continue the dialogue because the dialogue system 300 is informed of tomorrow's weather without understanding his or her original intention of utterance. Therefore, the user utters "Well, no" and ends the dialogue. Therefore, in the dialogue example shown in FIG. 24, the dialogue system 300 cannot acquire the personal information that the user "actually does not like to dry the laundry very much".
  • FIG. 25 shows an example of dialogue with the user when the dialogue system 300 does not determine the utterance turn as a comparison with the dialogue examples shown in FIGS. 23 and 24.
  • the user speaks to the dialogue system 300 at 1:00 pm on Tuesday, "Washing, isn't it raining tomorrow?"
  • the dialogue system 300 is based on the information acquired in advance (the above user's utterance is at 1 am on Tuesday), and the original intention of the user's utterance is not tomorrow (Wednesday). Presuming that it is asking for the weather on Tuesday, the dialogue system 300 compares this estimation result with the actual utterance content of the user, presents the estimation result to the user, and presents the user's original intention of the utterance. To confirm, say the response, "No, it's raining from Wednesday.” Then, the user makes an utterance indicating that the presented estimation result is the content intended by the user, such as "Would you like to wash and dry it?"
  • the dialogue system 300 does not determine the utterance turn according to the type of the user. For example, the dialogue system 300 utters the following response, "OK, should I buy more detergent?" While the user is silent thinking about the next message in his head. For this reason, the user replies, "Oh, is it gone? Let's make it a little bigger.”
  • the dialogue system 300 can confirm another requirement of purchasing additional detergent, but the individual who "actually does not like to dry the laundry very much" because the user does not feel "I still want to talk". The dialogue with the user ends without getting the information.
  • the dialogue system 300 performs both determination of the utterance turn and estimation of the user's original intention of utterance. This makes it easier to retrieve useful and new information from the user while maintaining the user's willingness to continue the dialogue.
  • the dialogue system 300 may determine the utterance turn and estimate the user's original utterance intention, and then confirm another requirement as shown in FIG. 25, if necessary. As a result, it is possible to increase the motivation of the user to continue the dialogue and to extract more useful information from the user.
  • the dialogue system 300 uses the gesture information of the user as an important factor for determining the utterance turn based on the dialogue situation of the task request and the attributes / characteristics of the user who is the partner of the dialogue. In addition to selecting the estimator, select the estimator of the original speech intention considering that it is a dialogue related to washing and the date and time. Then, the dialogue system 300 realizes an utterance with a low cognitive load with the user's dialogue system 300, and realizes a dialogue in which the utterance turn is controlled so as not to interrupt the utterance of the user which may be useful. ..
  • the utterance turn determination unit 306 controls the utterance turn during the dialogue with the user by using the utterance turn estimator selected from the estimator group. Further, the dialogue management unit 307 estimates the original intention of each utterance of the user by using the original intention estimator selected from the estimator group. Further, when estimating the original intention of the user's utterance, as shown in FIG. 18, a domain database, a synonym database, and a user model are used. The estimator group and each database may be placed on the cloud in consideration of scalability, or may be placed on the dialogue system 300 side in consideration of dialogue performance.
  • the dialogue management unit 307 estimates that the original intention of the user's utterance is to ask the weather on Tuesday based on the domain (current topic with the user) and the current time and date (step). S1401), and it is determined that the estimation result does not match the utterance content of the user (No in step S1402).
  • the dialogue management unit 307 presents the estimation result to the user and utters a response sentence "No, it rains from Wednesday" in order to confirm the user's intention (step S1404).
  • the user uttered "Would you like to wash and dry it?" And showed that the estimation result presented by the dialogue management unit 307 was the content intended by the user (step S1405). Yes), the dialogue management unit 307 advances the topic based on the estimation result that the day when the user is asking for the weather is Wednesday (step S1406).
  • the utterance turn determination unit 306 estimates the situation in which the user interacts with the dialogue system 300 and the user's attributes / characteristics based on the sensor information input from the sensor unit 305, and determines the estimated dialogue situation and the user's attributes / characteristics. The type of the corresponding user is determined (step S801). In the dialogue example shown in FIG. 23, the situation of the task request dialogue is assumed. Then, the utterance turn determination unit 306 adjusts the length of the silent time as the determination criterion of the utterance turn and the feature amount of the user to be prioritized according to the type of the user determined in step S801 (step S802).
  • the utterance turn determination unit 306 determines the length and priority of the silent time according to the type of user. Based on the feature amount of the user to be appointed, it is checked whether or not the transition of the utterance turn is performed (step S804). In the dialogue example shown in FIG. 23, the silent section continues for a long time, and the end of the user's utterance immediately before is a question mark. After making a determination (No in step S804), the dialogue management unit 307 waits until the next utterance while observing the state of the user (step S806). After that, the user utters a message "I don't really like to dry the laundry so much" that was formed in my head in the silent section, so the dialogue system 300 provides useful personal information from the user. It can be newly acquired (step S807).
  • FIG. 26 shows an example of dialogue between the dialogue system 300 and the user.
  • the vertical axis is the time axis.
  • the dialogue system 300 determines the utterance turn (first embodiment) according to the processing procedure shown in FIG. 8, and estimates the user's original utterance intention according to the processing procedure shown in FIG. (Second embodiment) shall be performed.
  • the user hears the sound of the siren and utters to the dialogue system 300, "The police car ran.”
  • the dialogue system 300 is based on the information acquired in advance (for example, the siren of the ambulance is "peebo peepo" and the siren of the police car is "u-u-"). It is presumed that the original intention of the utterance (the sound of the siren that the user wanted to convey) was not a police car but an ambulance. Then, the dialogue system 300 compares this estimation result with the actual utterance content of the user, presents the estimation result to the user, and responds with "Ambulance?" In order to confirm the original intention of the user's utterance. Speak a sentence.
  • the user responded to this response from the dialogue system 300 by saying, "Yes! Ambulance! I fell down! It's amazing! Make an utterance to indicate that there is. Therefore, the dialogue system 300 can advance the topic based on the estimated result.
  • the dialogue system 300 is based on the length of the silent time according to the type of the user estimated from the situation of the dialogue and the attributes / characteristics of the user and the feature amount of the user to be prioritized. , If it is determined that it is not a transition of the utterance turn, it strikes a "Yeah" and waits for the user's utterance.
  • the dialogue system 300 can acquire the information that the user "remained at school” through a long silent section of the user. "Staying at school” is trivial information for children, but can be important information for parents.
  • the dialogue system 300 determines the utterance turn and estimates the user's original utterance intention
  • the user can tell that the utterance turn is stolen by the dialogue system 300 or that he / she speaks the wrong content. You can continue speaking at your own pace without worrying about it.
  • the dialogue system 300 presents the estimation result of the original utterance intention, so that the user keeps the motivation to continue the dialogue and follows the original utterance intention. You can proceed with the topic.
  • the dialogue system 300 can retrieve useful and new information from a user who continues to speak freely.
  • FIG. 27 shows an example of dialogue with the user when the dialogue system 300 does not estimate the intention of the user's original utterance as a comparison with the dialogue example shown in FIG. 26.
  • the user hears the sound of the siren and utters to the dialogue system 300, "The police car ran.”
  • the dialogue system 300 does not presume that the user's original intention of utterance, that is, the user actually heard the ambulance, not the police car, and responded "Is it a police car called Peabo Peapo?" Speak a sentence.
  • the user notices his / her own misunderstanding and utters "What kind of sound is a police car?" To the utterance that the dialogue system 300 responds to without understanding the original intention of the utterance. .. As a result, the topic shifts to what the police car siren sounds like. Therefore, in the dialogue example shown in FIG. 27, the dialogue system 300 cannot acquire the information that the user "remained at school".
  • FIG. 28 shows an example of dialogue with the user when the dialogue system 300 does not determine the utterance turn as a comparison with the dialogue examples shown in FIGS. 26 and 27.
  • the user hears the sound of the siren and utters to the dialogue system 300, "The police car ran.”
  • the dialogue system 300 is based on the information acquired in advance (for example, the siren of the ambulance is "peebo peepo" and the siren of the police car is "u-u-"). It is presumed that the original intention of the utterance (the sound of the siren that the user wanted to convey) was not a police car but an ambulance. Then, the dialogue system 300 compares this estimation result with the actual utterance content of the user, presents the estimation result to the user, and responds with "Ambulance?" In order to confirm the original intention of the user's utterance. Speak a sentence.
  • the dialogue system 300 does not determine the utterance turn according to the type of the user. For example, the dialogue system 300 utters the following response sentence "I like ambulances" while the user is silent thinking about the next message in his head. For this reason, the user replies, "Yeah, but I also like dump trucks these days! I haven't come.”
  • the topic shifts to the topic of what kind of car the user likes. Therefore, the dialogue system 300 cannot acquire the information that the user "remained at school”.
  • the dialogue system 300 performs both determination of the utterance turn and estimation of the user's original intention of utterance. This makes it easier to retrieve useful and new information from the user while maintaining the user's willingness to continue the dialogue.
  • the dialogue system 300 may incorporate an utterance such as "I like an ambulance" in FIG. 28 as an aizuchi as a result of the determination of the utterance turn. In this case, an appropriate filler or aizuchi is selected in order to keep the user's willingness to continue the dialogue based on the user's feature amount and the like.
  • the dialogue system 300 makes it difficult to acquire an utterance turn and is developing a language based on the situation of the dialogue of chatting with a child and the attributes / characteristics of the user who is a child. Therefore, select each estimator of the utterance turn and the original utterance intention with the priority of the grammatical knowledge component lowered. Then, the dialogue system 300 realizes an utterance with a low cognitive load with the user's dialogue system 300, and realizes a dialogue in which the utterance turn is controlled so as not to interrupt the utterance of the user which may be useful. ..
  • the utterance turn determination unit 306 controls the utterance turn during the dialogue with the user by using the utterance turn estimator selected from the estimator group. Further, the dialogue management unit 307 estimates the original intention of each utterance of the user by using the original intention estimator selected from the estimator group. Further, when estimating the original intention of the user's utterance, as shown in FIG. 18, a domain database, a synonym database, and a user model are used. The estimator group and each database may be placed on the cloud in consideration of scalability, or may be placed on the dialogue system 300 side in consideration of dialogue performance.
  • the user utters, "I ran a police car with peepo peepo.”
  • the dialogue management unit 307 is based on the information acquired in advance (for example, the siren of the ambulance is "peebo peepo" and the siren of the police car is "u-u-"). It is estimated that the original intention of the utterance (the sound of the siren that the user wanted to convey) is not a police car but an ambulance (step S1401), and it is determined that the estimation result does not match the content of the user's utterance (No. in step S1402). ).
  • the dialogue management unit 307 presents the estimation result to the user and utters a response sentence "Ambulance?" In order to confirm the user's intention (step S1404).
  • the user said, "Yes! Ambulance! I fell down! It's amazing! And showed that the estimation result presented by the dialogue management unit 307 was the content intended by the user (step). Yes) of S1405, the dialogue management unit 307 advances the topic based on the estimation result that the sound of the siren heard by the user is an "ambulance" (step S1406).
  • the utterance turn determination unit 306 estimates the situation in which the user interacts with the dialogue system 300 and the user's attributes / characteristics based on the sensor information input from the sensor unit 305, and determines the estimated dialogue situation and the user's attributes / characteristics.
  • the type of the corresponding user is determined (step S801). In the dialogue example shown in FIG. 26, it is assumed that the dialogue partner is a child who is developing a language in the situation of a dialogue of chatting with a child. Then, the utterance turn determination unit 306 adjusts the length of the silent time as the determination criterion of the utterance turn and the feature amount of the user to be prioritized according to the type of the user determined in step S801 (step S802).
  • the utterance turn determination unit 306 determines the length of the silent time according to the user's type. It is checked whether or not the transition of the utterance turn is made based on the feature amount of the user to be prioritized (step S804).
  • the silent section continues for a long time, and it can be determined that the end of the utterance is in the adult, but it is determined that the user who is a child still has a high possibility of wanting to speak (No in step S804), and the dialogue management unit. 307 waits until the next utterance while observing the state of the user (step S806).
  • the dialogue management unit 307 shows a mechanism for estimating the user's utterance intention using the domain database 1801, the synonym database 1802, and the user model 1803.
  • the domain database 1801 in this application example a facility database containing information such as meal menus and equipment provided in the facility is used.
  • the dialogue system 300 originally manages the dialogue by increasing the weight of the intention estimation.
  • it is not always necessary to share the objective correct answer, so while considering the possibility of utterance errors, try to estimate the original intention of the utterance so as to match the subjective correct answer. May be good.
  • it may be judged that it is educationally preferable to estimate the original intention of the utterance based on the objective correct answer, and the process may be performed.
  • utterance turn estimation About utterance turn estimation: Elderly people have a slower dialogue tempo than non-elderly people. In addition, gaze detection is often useful for estimating utterance turns for Japanese people, but gaze detection performance is low for elderly people. Therefore, among the feature quantities of the user, the weight of the line of sight is lowered, the silence time is lengthened, and the estimator for estimating the utterance turn is selected.
  • the dialogue system 300 can obtain useful information from the elderly by taking into account the characteristics of the elderly and refraining from acquiring the utterance turn during a silent section sufficiently long for the non-elderly. Further, when starting the utterance from the dialogue system 300, a filler may be added at the beginning to compensate for the estimation error of the utterance turn.
  • FIG. 29 shows an example of dialogue between the dialogue system 300 and the elderly in the elderly home.
  • the vertical axis is the time axis.
  • the dialogue system 300 determines the utterance turn (first embodiment) according to the processing procedure shown in FIG. 8, and estimates the user's original utterance intention according to the processing procedure shown in FIG. (Second embodiment) shall be performed.
  • the dialogue system 300 selects an estimator of the original utterance intention that estimates the original utterance intention so as to match the subjective correct answer while considering the possibility of the user's utterance error.
  • this estimator generated a query consisting of the user's utterance "soft" and its disjunction, taking into account the possibility of error, and extracted from the facility database today's menu.
  • this estimator When compared with (rice, spinach salad, grilled salmon, orange cake 7), there is no match and the user's utterance content is unlikely, but considering the possibility of utterance errors peculiar to the elderly. Therefore, the estimation result of the original utterance intention is not presented.
  • the dialogue system 300 selects the utterance turn estimator that refrains from acquiring the utterance turn even in a sufficiently long silent section for the elderly, it refrains from acquiring the utterance turn. Then, after a long silent section, the user says, "Oh, my dentures aren't working well these days," and "I ate it comfortably with an apple.”
  • the dialogue system 300 uses an estimator selected for the elderly, and as shown in FIG. 31, the synonym "pear” extracted from the synonym database for the user's utterance "apple” in consideration of the possibility of error. Generate a query with “orange” ... and match it with today's menu (rice, spinach salad, grilled salmon, orange cake %) extracted from the facility database to match the user's "that with apples " It is presumed that the original intention of the utterance is "orange cake".
  • the dialogue system 300 presents the estimation result to the user and utters a response sentence "Eh-orange cake?" In order to confirm the user's original intention of utterance.
  • the user makes an utterance indicating that the presented estimation result is the content intended by the user, saying, "Oh, that's okay with the modest sweetness.” Therefore, the dialogue system 300 can advance the topic based on the estimated result.
  • the dialogue system 300 can realize a dialogue according to the attributes and characteristics of the elderly by using the speech turn estimator for the elderly and the original speech intention estimator. As a result, the dialogue system 300 can acquire useful information of the user as follows.
  • the information acquired by the dialogue system 300 through the dialogue with the user is incorporated into the user model and used for the estimation of the original utterance intention thereafter.
  • the information acquired by the dialogue system 300 as described above is useful information for each stakeholder of the user such as a doctor, a facility staff, a family member, and a helper.
  • the dialogue system 300 may output the information acquired through the dialogue with the user while estimating the original speech intention of the user to an external database used by doctors, facility staff, and the like.
  • the initial setting is Japanese. Since it is more difficult to predict the end of sentences in Japanese than in English, select an utterance turn estimator that prioritizes phonological features rather than linguistic features.
  • the dialogue management unit 307 shows a mechanism for estimating the user's utterance intention using the domain database 1801, the synonym database 1802, and the user model 1803.
  • a music database is used as the domain database 1801 in this application example.
  • the dialogue system 300 advances the dialogue while generating a query in consideration of the possibility of an error in the content of the user's utterance.
  • the emotion of utterance increases in the topic of taste, the user's emotion may be sensed and the user model may be weighted.
  • FIG. 32 shows an example of a dialogue between the dialogue system 300 and the user in a musical chat.
  • the vertical axis is the time axis.
  • the dialogue system 300 determines the utterance turn (first embodiment) according to the processing procedure shown in FIG. 8, and estimates the user's original utterance intention according to the processing procedure shown in FIG. (Second embodiment) shall be performed.
  • the user says, "I want to listen to some music.”
  • the dialogue system 300 utters "Well, what should I do?" And asks the user what music he wants to hear.
  • the silent section here is a silent section of sufficient length that should be the transition of the utterance turn on average. Since the dialogue system 300 selects the utterance turn estimator in consideration of the Japanese sentence end phonology, the sentence end cannot be predicted, so that the transition of the utterance turn is not determined.
  • the dialogue system 300 selects an estimator of the original utterance intention that attempts to generate a query while considering the possibility of the user's utterance error.
  • This estimator generates a query that combines each user's utterance with each similar word extracted from the synonym database 1802. Specifically, as shown in FIG. 33, the user's utterance "70's” and its similar words “80's", “60's” ... and the user's utterance “West Germany” and its similar words “East Germany”. , “Germany”, “Belgium”, “Switzerland” ... and generate a query that takes the logical product of the user's utterance "girl duo” and its similar words “female trio", "duo", “twin vocal” ... Then, when the candidates of the original intention of the user's utterance are extracted by searching the music database, the original intention of the user's utterance cannot be estimated because the number of candidates is enormous but the likelihood of each candidate is low.
  • the dialogue system 300 is a logic of the user's utterance "XXX” and its similar words “YYY”, “ZZZ” ... and the user's utterance "owner” and its similar words "staff”.
  • a productive query is generated and the music database is searched, it is estimated that the label name originally intended by the user's utterance is moderately likely to be the "B label" of XXX by the owner.
  • the dialogue system 300 presents the estimation result to the user and utters a response sentence "Is it a B label?" In order to confirm the user's original intention of utterance.
  • the user makes an utterance indicating that the presented estimation result is the content intended by the user, such as "Oh hi B label, and cousins.” Therefore, the dialogue system 300 can advance the topic based on the estimated result.
  • the dialogue system 300 calculates the logical product of the queries generated so far, the user's utterance "cousin” and its similar words “sister”, “cousin”, “brother”, and so on.
  • the singer name originally intended by the user is in the 80's, West Germany, female duo, B label, and sister singer "AAA”. I presume it was high.
  • the user uttered "70? Years”, but from the generated query, it is estimated that the original intention of the user's utterance was "80s”.
  • the user spoke with the "cousin” singer, but from the generated query it is presumed that the original intention of the user's utterance was the "sister” singer "AAA”.
  • the dialogue system 300 presents the estimation result to the user and utters a response sentence "AAA if it is a sister?" In order to confirm the user's original intention of utterance.
  • the user makes an utterance indicating that the presented estimation result is the content intended by the user, such as "Oh, that song, that song, I want to listen to title 1". .. Therefore, the dialogue system 300 can advance the topic based on the estimated result.
  • the dialogue system 300 has the results of the queries generated so far, and the user's utterance "Title 1" and its similar words “Title A", “Title B", and “Title”.
  • the probability that the title of the song that the user wants to hear is "Title A” is the singer "AAA” that was originally intended. Estimated to be moderate.
  • the dialogue system 300 presents the estimation result to the user and utters a response sentence "Title A, West German music at that time is good! In order to confirm the user's original intention of utterance. do.
  • the user responds to this response from the dialogue system 300 by saying, "That! I don't know if West Germany is good, but I like the B label, right?" Make the utterance shown.
  • the dialogue system 300 When the dialogue system 300 generates a query as shown in FIGS. 33 to 36, if there are a plurality of candidates, the dialogue system 300 presents a higher-ranking candidate to the display unit 309 or the like, and the user can use his / her own. The original intention of the utterance may be confirmed and easily transmitted to the dialogue system 300. In a chat about music, there are innumerable proper nouns related to the content that has become a hot topic, and a large number of candidates can be easily extracted. Therefore, it is convenient if the user can confirm on the screen in addition to the dialogue.
  • the dialogue system 300 is a query generated in consideration of the fact that innumerable proper nomenclatures exist in a chat about music and there is a high possibility that an utterance content error occurs due to a user's memorization error. Search the music database and proceed with the dialogue while estimating the original intention of the user's utterance. As a result, the dialogue system 300 can acquire useful information of the user as follows.
  • the information acquired by the dialogue system 300 through the dialogue with the user is incorporated into the user model and used for the estimation of the original utterance intention thereafter.
  • the dialogue system 300 can narrow down the candidates by the already acquired user model. For example, when a user requests an ambiguous song as in the dialogue example shown in FIG. 32, by increasing the priority of the B label, the query intended by the user can be easily obtained.
  • the user model acquired through music chat can be utilized, for example, in music and content marketing.
  • the dialogue system 300 may output the information acquired through the dialogue with the user while estimating the original speech intention of the user to an external database used by the service provider or the like.
  • the dialogue management unit 307 shows a mechanism for estimating the user's utterance intention using the domain database 1801, the synonym database 1802, and the user model 1803.
  • a medical database is used as the domain database 1801 in this application example.
  • the user of this application example is a patient, and is often weakened by a disease, memorizes a temporary symptom by mistake, or makes a mistake in technical terms. Therefore, the dialogue system 300 considers the possibility of an error in the content of the user's utterance and proceeds with the dialogue while generating a query.
  • the dialogue system 300 searches the medical database for a query generated from the information contained in the user's utterance, multiplies the candidate group of the original intention of the user's utterance, estimates the original intention of the user's utterance, and estimates the original intention of the user's utterance.
  • the estimation result is presented to the user and the dialogue proceeds while confirming the user's original intention of utterance.
  • the dialogue system 300 estimates the user's original utterance intention by adding the understanding of the demonstrative words using the camera.
  • the dialogue system 300 may be equipped with a facial expression analyzer, a gesture recognizer, and the like, as well as a camera for determining the complexion of the user and understanding the demonstrative words.
  • FIG. 37 shows an example of a dialogue between the dialogue system 300 and the user (patient). However, the vertical axis is the time axis.
  • the dialogue system 300 determines the utterance turn (first embodiment) according to the processing procedure shown in FIG. 8, and estimates the user's original utterance intention according to the processing procedure shown in FIG. (Second embodiment) shall be performed.
  • the dialogue system 300 utters "How was it?" And the interview with the user starts.
  • the user remembers the symptom of the disease and says, "I have a pain in my flank and it is about 38 degrees all the time,” and enters the silent section.
  • the silent section here is a silent section of sufficient length that should be the transition of the utterance turn on average.
  • the dialogue system 300 selects an estimator that estimates the utterance turn by giving priority to the user's line of sight and face orientation, and the user's line of sight and face orientation indicate the intention to continue the utterance. It is not judged as a transition of.
  • the dialogue system 300 selects an estimator of the original utterance intention that attempts to generate a query while considering the possibility of the user's utterance error.
  • This estimator generates a query that combines each user's utterance with each similar word extracted from the synonym database 1802. Specifically, as shown in FIG. 38, a query that takes the logical product of the user's utterance "armpit” and its similar word “lower abdomen” ... and the user's utterance "38 degrees” and an error of two degrees above and below it. And search the medical database to extract candidates for the original intention of the user's utterance, the number of candidates is enormous, but the likelihood of each candidate is low, so the original intention of the user's utterance cannot be estimated. ..
  • the dialogue system 300 generates a query obtained by multiplying the query shown in FIG. 38 by the symptoms of "difficult to wear shoes" and "having an appetite" based on the utterance of an additional user.
  • the first candidate for the disease name originally intended by the user's utterance is “epididymitis” and the second candidate is “epididymitis”.
  • the user's utterance "flank pain” was originally intended to be pain in "lower abdomen” which is a synonym for "flank”.
  • the dialogue system 300 presents the estimation result regarding the symptom of the disease to the user, and utters a response sentence "Isn't it the lower abdomen?" In order to confirm the user's original intention of utterance.
  • the user utters an answer including the demonstrative "Well, this area hurts” to this response sentence from the dialogue system 300. Therefore, when the dialogue system 300 understands that the user indicates “around this area” as "lower abdomen” by understanding the demonstrative words using the camera, the dialogue system 300 applies the query shown in FIG. 39 to the user's utterance. Presumes that the originally intended disease name is "epididymitis". Then, the dialogue system 300 utters "may be epididymitis" in order to present the estimation result regarding the disease name to the user.
  • the dialogue system 300 takes into account that the patient user is often vulnerable to a disease, erroneously remembers a temporary symptom, or misunderstands a jargon.
  • the dialogue is advanced while estimating the original intention of the user's utterance.
  • the dialogue system 300 can conduct a more accurate interview.
  • the dialogue system 300 may output the information acquired through the dialogue with the user while estimating the original speech intention of the user to an external database used by medical personnel and the like.
  • the user utters after organizing a part of N thoughts (thought 1, thought 2, ..., thought N) floating in the brain, such as thinning out or summarizing them into one thought.
  • useful information of the user may be lost.
  • the information thinned out by the speaker includes useful information that can be utilized by the listener.
  • FIG. 40 shows information thinned out by the user and an example of utilizing the information.
  • the patient decides that the trivial symptom is not related to the cause and omits the utterance, but it may be a clue for the doctor to determine the cause.
  • children are spoken to one after another, they stop speaking subtopics, which can be a clue for parents and teachers to improve their education. Since there are innumerable proper nouns in the topic of content such as music and movies, the user's search query is ambiguous and tends to be thinned out, but it expresses the user's preference, marketing such as recommendation engine, and usability. It will be a clue for improvement.
  • chats with business partners some of the utterances that tend to be thinned out when the utterance turn is deprived include those that can be used for sales and marketing.
  • chats with residents of facilities such as elderly housing with care
  • some of the utterances that are thinned out when switching utterance turns in a short silent section include checking the health of doctors, improving the services of facility staff, and providing peace of mind to families. Contains information to connect.
  • the interval at which thoughts appear in the brain varies from user to user, and some users speak one after another, while others speak across a long silent section.
  • the transition of the utterance turn is judged in a short silent section, the user who takes a long time to think has the utterance turn transition to the voice agent before the next speech, and the user is not given a sufficient opportunity to speak. As a result, it becomes difficult to obtain useful information from the user.
  • the dialogue system 300 controls the utterance turn based on the user's feature amount according to the dialogue situation and the user's attributes / characteristics, so that the user has a thought group in his / her brain. It is possible to correspond to the utterance format in which the utterance is spoken as it is. As a result of the user being able to speak at his own pace in the May rain style without thinning out the thoughts in his brain, the dialogue system 300 can acquire more useful information from the user.
  • the dialogue system 300 has a function of estimating the original intention of the user's utterance, and can present the estimation result of the utterance intention to the user. Therefore, it is possible to obtain accurate information from the user even if the user speaks one after another without organizing the thoughts that have come to mind and there is a possibility of an error in at least a part of the utterance.
  • FIG. 41 shows an example of an utterance sequence in which a user utters an utterance group floating in the brain intermittently without organizing it.
  • the user utters all the thought groups floating in the brain without thinning out by the utterance format assumed in the second embodiment of the above item E.
  • the interval at which the user speaks is not constant, and the interval between speeches may be short or the interval between speeches may be relatively long.
  • the dialogue partner such as the dialogue system 300 can hear all the thought groups that the user has in his or her brain.
  • FIG. 42 shows, as a comparison, an example of an utterance sequence in which a user organizes and utters a group of thoughts that have floated in the brain.
  • the user does not speak at all from the first thought (thinking 1) to the last thought (thinking N), and after the thought N comes to mind, all thoughts (thinking 1 ⁇ ). After the delay time to organize N) in the brain, speak only once at a time.
  • some thoughts are thinned out or a plurality of thoughts are put together, and the amount of information is reduced.
  • utterance sequence example shown in FIG. 41 in which the user utters intermittently without organizing the thought group floating in the brain, there is a possibility that an error is included in a part of the utterance.
  • utterances that may be erroneous are shown in gray.
  • the dialogue system 300 according to the present disclosure has a function of estimating the original intention of the user's utterance, and can present the estimation result of the utterance intention to the user, confirm the intention of the utterance, and proceed with the topic. Therefore, the user can freely continue to speak at his / her own pace without hesitation in the thought group floating in the brain.
  • the dialogue system 300 may indicate to the user by a display unit 309 or an indicator (not shown) that the utterance turn determination and the original intention estimation processing are performed.
  • the image such as a character or an icon displayed on the display unit 309 estimates the state of waiting for the user's utterance by facial expressions or gestures or the original intention of the user's utterance.
  • an indicator such as a light provided in the device for realizing the dialogue system 300 or an icon displayed on the display unit 309 may indicate the state to the user by the color of the light or the number of blinks.
  • the dialogue system 300 is originally intended to be a state of waiting for the user's speech at the beginning, a state of accepting the user's speech, a state of speaking, and a state of estimating the speech turn.
  • the voice, facial expression, gesture, color, voice, facial expression, gesture, color using any one of the voice output unit 308, the display unit 309, the indicator, or a combination of two or more of these.
  • the current situation may be notified to the user by changing the blinking, the effect, and the like.
  • the dialogue system 300 includes a broadcast program guide (including those for audio content and online content), a file list possessed by the device (recorded program, download file, etc.). Includes installed applications), calendar, news information (including product and content release information, weather information, alerts and emergency information), website and service usage information (search history, purchase history, viewing history, favorite information) ), Current or past location information may be used. Specifically, when the user makes an utterance about the content as shown in the example shown in FIG. 32, the original intention of the user's utterance is estimated based on the purchase history, viewing history, recorded program, etc. of the past content. You may go.
  • the process described as a part of the dialogue exemplified in the present specification may be performed by displaying on the display unit 309, or a plurality of candidates may be presented to the user. It may be done by letting you choose.
  • the dialogue system 300 displays a plurality of results of originally estimating the intention of the speech related to the content on the display unit 309, and displays the operation unit (including the operation button and the touch panel) of the device (including the operation button and the touch panel) of the device (not shown), the operation device (remote controller, etc.).
  • the user may be made to select by using (including a smartphone), voice input, or the like.
  • the dialogue system 300 displays an image corresponding to a program guide, a recorded program list, a favorite content list, etc. as an estimated candidate during the dialogue process, and a touch panel operation from a user corresponding to the displayed image. (Or voice input, input by the operating device) may be accepted.
  • the display image as a selection candidate is output to an external device such as a smartphone via a wired or wireless connection, and the dialogue system 300 continues the dialogue processing in response to the input to the external device. Implementation is possible.
  • the dialogue system 300 obtains the setting information (accessibility setting) of the device that realizes the dialogue system 300 when acquiring the dialogue status, user attributes / characteristics, and user feature quantity information shown in FIG. You may use the setting information of the linked device, the setting information of the account associated with the device / user, the result of inquiring to the user before the start of the dialogue, and the like.
  • the user profile sensor unit 440 may be used to acquire such information.
  • the dialogue system 300 is linked to the device setting information (including accessibility settings) and the device / user based on the information acquired by using the sensor unit 305 and the user information obtained during the dialogue. You may set / update the account setting information.
  • the dialogue system 300 sets accessibility and output of the device based on the user's physical information (including information on visual acuity and hearing) and preference information obtained in the dialogue processing related to the present disclosure.
  • Image quality settings, including sound quality settings), account registration information, etc. may be changed.
  • the dialogue system 300 may automatically change the settings when it is determined that the information corresponding to these settings has been obtained in the dialogue processing, or the user may use the voice output unit 308 or the display unit 309. You may confirm whether to change the setting to. Further, when confirming whether or not to change the settings, the dialogue system 300 may output a user interface corresponding to those settings from the voice output unit 308 or the display unit 309 and allow the user to change the settings.
  • the user may be presented with a setting that is determined to be suitable for the user.
  • the dialogue system 300 may display the setting user interface on the display unit 309 after setting the on / off setting of each setting, the numerical setting of each setting, and the like in advance.
  • the user changes the settings by himself / herself as necessary, and then applies the changes in each setting.
  • This disclosure can be applied to personal computers, multifunctional information terminals such as smartphones and tablets, various home appliances, IoT devices, robots, etc., in addition to devices dedicated to dialogue such as voice agent devices.
  • the sensor unit that acquires user information and A control unit that controls dialogue with the user based on the information acquired by the sensor unit, and a control unit.
  • Information processing device equipped with.
  • the control unit determines the utterance turn in the silent section of the user based on the information acquired through the sensor unit.
  • the information processing device according to (1) above.
  • the control unit makes an utterance turn based on the silence time or the feature amount of the user according to the type of the user estimated from the situation of the dialogue with the user acquired through the sensor unit and the attributes / characteristics of the user. Judgment, The information processing device according to (2) above.
  • the control unit determines that the transition of the utterance turn has not occurred based on the determination result of the utterance turn, and outputs the first dialogue voice in a state of accepting the utterance of the user.
  • the information processing device according to (2) above.
  • the first dialogue voice is an aizuchi or a filler output in the silent section of the user.
  • the information processing device according to (4) above.
  • the control unit determines that the transition of the utterance turn has occurred based on the determination result of the utterance turn, ends the acceptance of the utterance of the user, and outputs the second dialogue voice.
  • the information processing apparatus according to any one of (4) and (5) above.
  • the control unit is based on the dialogue status, user attributes / characteristics, and individual differences estimated from the information acquired through the sensor unit from the utterance turn estimation estimators according to the user type. Estimate the utterance turn using the estimator selected in The information processing apparatus according to any one of (2) to (6) above.
  • the control unit estimates the intention of the user's utterance and outputs information corresponding to the estimation result.
  • the information processing apparatus according to any one of (1) to (7) above.
  • the control unit estimates the intention of the user's utterance, and if the estimation result does not match the content of the user's utterance, outputs information corresponding to the estimation result.
  • the information processing device according to (8) above.
  • the control unit controls the progress of the topic of the voice dialogue based on the response of the user to the information corresponding to the estimation result.
  • the information processing apparatus according to any one of (8) and (9) above.
  • control unit When the control unit indicates that the user is as intended for the information corresponding to the estimation result, the control unit advances the topic of the voice dialogue based on the information corresponding to the estimation result.
  • Control The information processing apparatus according to (10) above.
  • control unit controls the progress of the topic of the voice dialogue based on the utterance content of the user.
  • the information processing apparatus according to any one of (10) and (11) above.
  • the control unit uses at least one of a domain database according to the situation of dialogue, a synonym database containing similar words of each word, and a user model that describes the characteristics of the user, and uses the user's utterance. Estimate the intention, The information processing device according to (8) above.
  • the control unit outputs information corresponding to the estimation result to the display unit, and controls the progress of the voice dialogue based on the user's response to the information corresponding to the estimation result output to the display unit.
  • the information processing apparatus according to (10) above.
  • the control unit outputs a character image or an icon image to the display unit.
  • the character image and the icon image are images corresponding to either the determination of the utterance turn in the silent section of the user or the estimation of the utterance intention of the user based on the information acquired through the sensor unit.
  • the control unit outputs information acquired through dialogue with the user to an external device while estimating the intention of the user's utterance.
  • the information processing apparatus according to any one of (8) to (15) above.
  • control step the utterance turn in the silent section of the user is determined based on the information acquired in the acquisition step.
  • Sensor unit that acquires user information
  • a control unit that controls dialogue with the user based on the information acquired by the sensor unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

La présente invention concerne un dispositif de traitement d'informations permettant de commander un tour de parole et d'effectuer un traitement pour une conversation avec un utilisateur tout en estimant l'intention de parole originale. Le dispositif de traitement d'informations comprend une unité de capteur qui acquiert des informations d'utilisateur et une unité de commande qui commande la conversation avec l'utilisateur sur la base des informations acquises par l'unité de capteur. L'unité de commande effectue la détermination d'un tour de parole de l'utilisateur dans une section silencieuse sur la base des informations acquises par l'unité de capteur. L'unité de commande estime l'intention de parole de l'utilisateur et délivre des informations correspondant au résultat d'estimation lorsque le résultat d'estimation ne correspond pas au contenu prononcé par l'utilisateur.
PCT/JP2021/015097 2020-06-05 2021-04-09 Dispositif de traitement d'informations, procédé de traitement d'informations et programme informatique WO2021246056A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2020098977 2020-06-05
JP2020-098977 2020-06-05

Publications (1)

Publication Number Publication Date
WO2021246056A1 true WO2021246056A1 (fr) 2021-12-09

Family

ID=78830326

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/015097 WO2021246056A1 (fr) 2020-06-05 2021-04-09 Dispositif de traitement d'informations, procédé de traitement d'informations et programme informatique

Country Status (1)

Country Link
WO (1) WO2021246056A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005196134A (ja) * 2003-12-12 2005-07-21 Toyota Central Res & Dev Lab Inc 音声対話システム及び方法並びに音声対話プログラム
JP2009210703A (ja) * 2008-03-03 2009-09-17 Alpine Electronics Inc 音声認識装置
WO2016151698A1 (fr) * 2015-03-20 2016-09-29 株式会社 東芝 Dispositif, procédé et programme de dialogue
WO2019142427A1 (fr) * 2018-01-16 2019-07-25 ソニー株式会社 Dispositif de traitement d'informations, système de traitement d'informations, procédé de traitement d'informations et programme
JP2021051172A (ja) * 2019-09-24 2021-04-01 学校法人早稲田大学 対話システムおよびプログラム

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005196134A (ja) * 2003-12-12 2005-07-21 Toyota Central Res & Dev Lab Inc 音声対話システム及び方法並びに音声対話プログラム
JP2009210703A (ja) * 2008-03-03 2009-09-17 Alpine Electronics Inc 音声認識装置
WO2016151698A1 (fr) * 2015-03-20 2016-09-29 株式会社 東芝 Dispositif, procédé et programme de dialogue
WO2019142427A1 (fr) * 2018-01-16 2019-07-25 ソニー株式会社 Dispositif de traitement d'informations, système de traitement d'informations, procédé de traitement d'informations et programme
JP2021051172A (ja) * 2019-09-24 2021-04-01 学校法人早稲田大学 対話システムおよびプログラム

Similar Documents

Publication Publication Date Title
CN112119454B (zh) 适应多个年龄组和/或词汇水平的自动助理
US11100384B2 (en) Intelligent device user interactions
US11004446B2 (en) Alias resolving intelligent assistant computing device
Cress et al. Common questions about AAC services in early intervention
Pascalis et al. On the links among face processing, language processing, and narrowing during development
Van Berkum Understanding sentences in context: What brain waves can tell us
Narayanan et al. Behavioral signal processing: Deriving human behavioral informatics from speech and language
EP3259754B1 (fr) Procédé et dispositif de fourniture d'informations
US20180240459A1 (en) Method and system for automation of response selection and composition in dialog systems
US9336782B1 (en) Distributed collection and processing of voice bank data
Creel Preschoolers’ use of talker information in on‐line comprehension
US9802125B1 (en) On demand guided virtual companion
Byun Bidirectional perception–production relations in phonological development: evidence from positional neutralization
Beeke et al. Prosody as a compensatory strategy in the conversations of people with agrammatism
WO2021246056A1 (fr) Dispositif de traitement d'informations, procédé de traitement d'informations et programme informatique
Van Nijnatten et al. Interviewing victims of sexual abuse with an intellectual disability: A Dutch single case study
Satti When it’s “now or never” Multimodal practices for managing opportunities to initiate other-repair in collaborative storytelling
Schuller et al. Computational charisma—A brick by brick blueprint for building charismatic artificial intelligence
Hofmann Intuitive speech interface technology for information exchange tasks
Miehle Communication style modelling and adaptation in spoken dialogue systems
US20240212826A1 (en) Artificial conversation experience
JP7350384B1 (ja) 対話システム、及び対話方法
Frommer et al. Giving computers personality? Personality in computers is in the eye of the user
Valdivia Voice familiarity in an interactive voice-reminders app for elderly care recipients and their family caregivers
Martini et al. “I Am All-Inclusive.” But Not Really: An Exploration on the Influence of Gender and Conversational Contexts on Intelligent Voice Assistants.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21817260

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21817260

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP