WO2021153101A1 - Dispositif de traitement d'informations, procédé de traitement d'informations et programme de traitement d'informations - Google Patents

Dispositif de traitement d'informations, procédé de traitement d'informations et programme de traitement d'informations Download PDF

Info

Publication number
WO2021153101A1
WO2021153101A1 PCT/JP2020/047857 JP2020047857W WO2021153101A1 WO 2021153101 A1 WO2021153101 A1 WO 2021153101A1 JP 2020047857 W JP2020047857 W JP 2020047857W WO 2021153101 A1 WO2021153101 A1 WO 2021153101A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
speaker
utterance
information processing
processing device
Prior art date
Application number
PCT/JP2020/047857
Other languages
English (en)
Japanese (ja)
Inventor
真里 斎藤
Original Assignee
ソニーグループ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ソニーグループ株式会社 filed Critical ソニーグループ株式会社
Publication of WO2021153101A1 publication Critical patent/WO2021153101A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • This disclosure relates to an information processing device, an information processing method, and an information processing program.
  • the text may be displayed for a long time, and it was difficult to convey that the speaker understood the utterance.
  • the present disclosure includes a state estimation unit that estimates the state of emotion understanding that understands emotions based on the speaker's speech, and a response generation unit that generates output information based on the estimation result by the state estimation unit.
  • An information processing device is provided.
  • Embodiment of the present disclosure >> ⁇ 1.1. Overview>
  • a system that understands a speaker's utterance and interacts with the speaker has become widespread.
  • a system in which the input utterance is converted into text and displayed has become common.
  • This system is realized, for example, as a speaker-type dialogue agent such as a smart speaker or a human-type dialogue agent such as Pepper (registered trademark).
  • the text may be displayed for a long time, and it was difficult to convey that the speaker understood the utterance.
  • the speaker's utterance for example, if a filler, which is a connecting word that has nothing to do with the utterance content, or a nod or an aizuchi can be performed, the speaker can be made to feel that the dialogue agent understands the utterance. obtain. Therefore, a technique related to a dialogue agent that performs fillers, nods, and aizuchi in the speaker's utterance is being advanced.
  • Patent Document 1 includes a technique for controlling the operation of the dialogue agent when it cannot be estimated that the utterance should be waited for or the utterance should be executed. It is disclosed.
  • the dialogue agent's behavior related to the dialogue is controlled regardless of the intention of the speaker's utterance, for example, the dialogue agent's behavior may interfere with the speaker's utterance. obtain.
  • the idea was conceived by paying attention to the above points, and a technique capable of controlling to perform an appropriate response in line with the intention of the speaker's utterance is proposed.
  • the present embodiment will be described in detail in order.
  • an example of the dialogue agent will be described using the terminal device 20.
  • FIG. 1 is a diagram showing a configuration example of the information processing system 1.
  • the information processing system 1 includes an information processing device 10 and a terminal device 20.
  • Various devices can be connected to the information processing device 10.
  • a terminal device 20 is connected to the information processing device 10, and information is linked between the devices.
  • the terminal device 20 is wirelessly connected to the information processing device 10.
  • the information processing device 10 performs short-range wireless communication using the terminal device 20 and Bluetooth (registered trademark).
  • the information processing device 10 and the terminal device 20 include various interfaces such as I2C (Inter-Integrated Circuit) and SPI (Serial Peripheral Interface), and LAN (Local) regardless of whether they are wired or wireless. It may be connected via various networks such as Area Network), WAN (Wide Area Network), the Internet, and mobile communication networks.
  • I2C Inter-Integrated Circuit
  • SPI Serial Peripheral Interface
  • LAN Local Area Network
  • LAN Local Area Network
  • networks such as Area Network), WAN (Wide Area Network), the Internet, and mobile communication networks.
  • the information processing device 10 is an information processing device that controls, for example, the terminal device 20 according to the utterance data of the speaker's utterance (voice). Specifically, the information processing device 10 first estimates the state of emotional understanding that understands emotions based on the speaker's utterance, and generates output information based on the estimation result. Then, the information processing device 10 controls the terminal device 20 by transmitting the generated output information to, for example, the terminal device 20.
  • the information processing device 10 also has a function of controlling the overall operation of the information processing system 1. For example, the information processing device 10 controls the overall operation of the information processing system 1 based on the information linked between the devices. Specifically, the information processing device 10 controls the terminal device 20 based on the information received from the terminal device 20.
  • the information processing device 10 is realized by a PC (Personal computer), a WS (Workstation), or the like.
  • the information processing device 10 is not limited to a PC, a WS, or the like.
  • the information processing device 10 may be an information processing device such as a PC or WS that implements the function of the information processing device 10 as an application.
  • Terminal device 20 The terminal device 20 is an information processing device to be controlled.
  • the terminal device 20 acquires utterance data. Then, the terminal device 20 transmits the acquired utterance data to the information processing device 10.
  • the terminal device 20 may be realized as any device.
  • the terminal device 20 may be realized as a speaker type device or a human type device.
  • the terminal device 20 may be realized as, for example, a presenting device that presents visual information of a dialogue agent.
  • the information processing system 1 generates a response, which is a listening reaction to a speaker's utterance, by transitioning three states. Specifically, the information processing system 1 estimates the state of utterance recognition that recognizes the speaker's utterance, estimates the state of emotional understanding based on the speaker's utterance, and information on the request included in the speaker's utterance. The response is generated by transitioning from the estimation of the execution preparation state for executing the process based on the request-related information. The response based on the estimation of the state of utterance recognition is, for example, a response for telling the speaker that the utterance of the speaker has been received.
  • the response based on the estimation of the state of emotional understanding is, for example, a response for telling the speaker that they are sympathetic.
  • the response based on the estimation of the execution preparation state is, for example, a response for executing a process based on the request-related information included in the utterance of the speaker.
  • the information processing system 1 can generate a response according to the state by transitioning between these three states.
  • FIG. 2 is a diagram showing an outline of the functions of the information processing system 1.
  • the information processing system 1 first recognizes the utterance of the speaker U12 (S11). When the information processing system 1 recognizes the utterance of the speaker U12, the information processing system 1 estimates the utterance recognition state. Next, the information processing system 1 recognizes an emotional word indicating an emotion from the utterance of the speaker U12 (S12). When the information processing system 1 recognizes an emotional word, it estimates the state of emotional understanding. Then, the information processing system 1 executes a process of reciting an emotional word (S13). When the information processing system 1 recognizes another speaker U12's utterance, the information processing system 1 estimates the state of utterance recognition.
  • the information processing system 1 recognizes the request-related information from the utterance of the speaker U12 (S14). When the information processing system 1 recognizes the request-related information, it estimates the state of preparation for execution. Then, the information processing system 1 executes a process based on the request-related information (S15). In the process of S15, the information processing system 1 estimates the state of utterance recognition when the process based on the request-related information is not executed (S16).
  • the information processing system 1 recognizes the request-related information from the utterance of the speaker U12 (S17). When the information processing system 1 recognizes the request-related information, it estimates the state of preparation for execution. Then, the information processing system 1 executes a process based on the request-related information (S15). In the process of S15, the information processing system 1 estimates the state of emotion understanding when the process based on the request-related information is not executed (S18).
  • the information processing system 1 "listens (voices)", “understands emotions”, and “requests” by setting a stage in the response such as the dialogue agent's aizuchi.
  • the state of "execute” can be communicated using different processes.
  • the information processing system 1 can convey that the dialogue agent is listening while understanding the transition of the contents of the speaker's utterance, so that the speaker can speak with peace of mind.
  • FIG. 3 is a diagram showing an outline of a UI (User Interface) when estimating the state of utterance recognition.
  • the terminal device 20 first detects the utterance TK11 of the speaker U12.
  • the information processing system 1 detects the terminal SK11 of the utterance TK11
  • the information processing system 1 controls the terminal device 20 so as to perform an aizuchi such as "Yeah" (S21).
  • the terminal device 20 outputs the response RK11, which is an aizuchi to the utterance TK11.
  • the terminal device 20 detects the utterance TK12 of the speaker U12.
  • the information processing system 1 controls the terminal device 20 so that the speaker U12 makes a nod such as shaking his head vertically until the terminal SK12 of the utterance TK12 is detected during the utterance of the utterance TK12 (S22). That is, the information processing system 1 controls the terminal device 20 so that the speaker U12 does not give an aizuchi during the utterance of the utterance TK12.
  • the information processing system 1 detects the terminal SK12 of the utterance TK12
  • the information processing system 1 controls the terminal device 20 so as to perform an aizuchi.
  • the terminal device 20 outputs a response RK12, which is an aizuchi to the utterance TK12.
  • the terminal device 20 detects the utterance TK13 of the speaker U12.
  • the information processing system 1 controls the terminal device 20 to nod while the speaker U12 is speaking the utterance TK13 until the terminal SK13 of the utterance TK13 is detected (S23).
  • the information processing system 1 controls the terminal device 20 so as to perform an aizuchi.
  • the terminal device 20 outputs the response RK13, which is an aizuchi to the utterance TK13.
  • the information processing system 1 can perform an aizuchi at a timing that does not interfere with the speaker's utterance, so that it is possible to appropriately inform the speaker that the speaker's utterance has arrived.
  • FIG. 4 is a diagram showing an outline of the UI when estimating the state of emotional understanding.
  • the terminal device 20 detects the utterance TK23 of the speaker U12.
  • the information processing system 1 controls the terminal device 20 so that the speaker U12 nods during the utterance of the utterance TK23 until the terminal SK23 of the utterance TK23 is detected. Further, the information processing system 1 detects the emotional word KG11 from the utterance TK23 (S33). Specifically, the information processing system 1 performs language analysis processing on the utterance TK23.
  • the information processing system 1 detects the emotional word KG11 by comparing the linguistic information included in the utterance TK23 with the linguistic information predetermined as the emotional word. For example, the information processing system 1 detects the emotional word KG11 by accessing the storage unit that stores the emotional word information. When the information processing system 1 detects the emotional word KG11, the information processing system 1 uses the emotional word KG11 and the linguistic information in the context close to the emotional word KG11 among the linguistic information contained in the speech TK23 to appropriately express the emotion indicated by the emotional word KG11.
  • the terminal device 20 is controlled so as to repeat the expression.
  • the information processing system 1 determines the emotion "trouble” indicated by the emotion word KG11 based on the emotion word KG11 "trouble” and the adjacent linguistic information "long”. Repeat with appropriate expressions.
  • the terminal device 20 outputs the response RK23, which is a repeat of the utterance TK23. In this way, the information processing system 1 can repeat the linguistic information in the context before and after the emotional word KG11. As a result, the information processing system 1 can appropriately convey to the speaker that he / she understands and sympathizes with the speaker's emotions, so that the speaker can speak with peace of mind.
  • the information processing system 1 detects the emotional word KG21 from the utterance TK33 (S43).
  • the terminal device uses linguistic information predetermined as a synonym (synonym) of the emotional word KG21 so as to repeat the emotion indicated by the emotional word KG21 in an appropriate expression. 20 is controlled.
  • the information processing system 1 uses "sad”, which is linguistic information predetermined as a synonym for "worst”, which is the emotion word KG21, to perform "worst", which is the emotion indicated by the emotion word KG21. Repeat with appropriate expressions. In this way, the information processing system 1 generates an empathic utterance for reciting predetermined linguistic information as a synonym for the emotional word KG21.
  • the terminal device 20 outputs the response RK33, which is a repeat of the utterance TK33.
  • the information processing system 1 has "terrible”, which is linguistic information predetermined as a synonym for "worst”, which is the emotional word KG21, and is close to the emotional word KG21 among the linguistic information contained in the utterance TK33.
  • "terrible” is linguistic information predetermined as a synonym for "worst” which is the emotional word KG21, and is close to the emotional word KG21 among the linguistic information contained in the utterance TK33.
  • the information processing system 1 not only outputs a response using the registered emotional words, but also estimates the emotions of the speaker using, for example, a sensor to generate emotional words corresponding to the estimated emotions. You may use it to output the response.
  • the information processing system 1 may learn the response based on the utterance included in the conversation with another speaker, for example. Further, the information processing system 1 may output the latest updated response by updating the learned and stored response at any time each time it detects a conversation with another speaker, for example.
  • FIG. 6 is a diagram showing an outline of the UI when estimating the state of preparation for execution.
  • the terminal device 20 detects the utterance TK43 of the speaker U12.
  • the information processing system 1 controls the terminal device 20 so that the speaker U12 nods during the utterance of the utterance TK43 until the terminal SK43 of the utterance TK43 is detected. Further, the information processing system 1 detects the request-related information IG11 from the utterance TK43 (S53).
  • the information processing system 1 When the information processing system 1 detects the request-related information IG11, it outputs a response RK43 to the effect that the request such as "OK" is recognized. Then, the information processing system 1 controls the terminal device 20 so as to repeat the content of the request indicated by the request-related information IG11. The terminal device 20 outputs the response RK44, which is a repeat of the utterance TK43. Then, the information processing system 1 executes a process based on the information regarding the request indicated by the request-related information IG11 (S54).
  • the information processing system 1 determines whether or not the information regarding the request indicated by the request-related information IG11 is sufficient to execute the process.
  • the information processing system 1 controls the terminal device 20 so as to perform an aizuchi with an expression that is less recognizable than a predetermined standard.
  • the information processing system 1 can prompt the speaker to continue the utterance by controlling the terminal device 20 so as to perform the aizuchi at a low volume, for example.
  • duplication of utterances by the terminal device 20 may occur.
  • the information processing system 1 can prompt the speaker to continue the utterance, the problem that the utterance may be duplicated by the terminal device 20 can be solved. If the information processing system 1 cannot detect the continuation of the speaker's utterance, the information processing system 1 outputs to the speaker that the utterance is not sufficient. In addition, the information processing system 1 prompts the speaker to continue the utterance by using the linguistic information of the stagnant (incomplete) sentence. As a result, the information processing system 1 can promote natural utterance rather than urging the speaker to speak insufficient information necessary for executing the process.
  • the information processing system 1 when the information related to the request indicated by the request-related information IG11 is sufficient to execute the process, the information processing system 1 outputs a response RK43 indicating that the request has been recognized.
  • the information processing system 1 when the utterance TK43 is the end of the dialogue for uttering the request indicated by the request-related information IG11, the information processing system 1 outputs the response RK43 in a recognizable expression equivalent to a predetermined reference. ..
  • the information processing system 1 can output the response RK43 at a volume equivalent to, for example, a predetermined reference.
  • the terminal device 20 detects the emotional word KG31 from the utterance TK51 of the speaker U12 (S62).
  • the terminal device 20 outputs a response RK52 in which the emotion "enjoyment” indicated by the emotional word KG31 is repeated in an appropriate expression.
  • the information processing system 1 outputs the response RK52 based on the emotional word KG31 "fun" and the adjacent linguistic information "very".
  • the terminal device 20 detects the request-related information IG21 from the utterance TK63 of the speaker U12 (S73).
  • the information processing system 1 determines that the information regarding the request indicated by the request-related information IG21 is not sufficient to execute the process.
  • the information processing system 1 outputs a response RK63 indicating that the utterance is not sufficient.
  • the terminal device 20 detects the utterance TK64 of the speaker U12.
  • the information processing system 1 determines that the utterance TK64 of the speaker U12 contains sufficient information to execute the process based on the information regarding the request indicated by the request-related information IG21 (S74).
  • the terminal device 20 outputs the response RK64, which is a repeat of the utterance TK64.
  • the information processing system 1 controls the terminal device 20 so as to present the information regarding the request indicated by the request-related information IG21 together with the output of the response RK65 in response to the utterance TK65 of the speaker U12. After that, the information processing system 1 detects the emotional word KG41 from the utterance TK67 of the speaker U12 (S77).
  • the terminal device 20 outputs a response RK67 in which the emotion “delicious” indicated by the emotion word KG41 is repeated in an appropriate expression. Specifically, the information processing system 1 outputs the response RK67 based on the emotional word KG41, "it looks delicious".
  • the terminal device 20 detects the emotional word KG51 from the utterance TK71 of the speaker U12 (S81).
  • the terminal device 20 outputs a response RK71 in which the emotion “busy” indicated by the emotion word KG51 is repeated in an appropriate expression.
  • the information processing system 1 outputs the response RK71 based on the emotional word KG51 "busy” and the adjacent linguistic information "work”.
  • FIG. 10 is a block diagram showing a functional configuration example of the information processing system 1 according to the first embodiment.
  • the information processing device 10 includes a communication unit 100, a control unit 110, and a storage unit 120.
  • the information processing device 10 has at least a control unit 110.
  • the communication unit 100 has a function of communicating with an external device. For example, the communication unit 100 outputs information received from the external device to the control unit 110 in communication with the external device. Specifically, the communication unit 100 outputs the utterance data received from the terminal device 20 to the control unit 110.
  • the communication unit 100 transmits the information input from the control unit 110 to the external device in communication with the external device. Specifically, the communication unit 100 transmits information regarding acquisition of utterance data input from the control unit 110 to the terminal device 20.
  • Control unit 110 has a function of controlling the operation of the information processing device 10. For example, the control unit 110 detects the end of the utterance data. Further, the control unit 110 performs a process of controlling the operation of the terminal device 20 based on the information regarding the detected termination.
  • the control unit 110 includes a speaker identification unit 111, an utterance detection unit 112, an utterance recognition unit 113, a state estimation unit 114, a semantic analysis unit 115, and a request processing unit. It has 116, a response generation unit 117, an utterance execution unit 118, and an motion presentation unit 119.
  • the speaker identification unit 111 has a function of performing speaker identification processing.
  • the speaker identification unit 111 accesses the storage unit 120 (for example, the speaker information storage unit 121) and performs identification processing using the speaker information.
  • the speaker identification unit 111 identifies the speaker by comparing the image pickup information transmitted from the image pickup unit 212 via the communication unit 200 with the speaker information stored in the storage unit 120. do.
  • the utterance detection unit 112 has a function of detecting the utterance of the speaker. For example, the utterance detection unit 112 performs detection processing on the utterance data transmitted from the utterance acquisition unit 211 via the communication unit 200. In addition, the utterance detection unit 112 detects the utterance of a specific speaker. For example, the utterance detection unit 112 detects the utterance of a specific speaker based on the image pickup information transmitted from the image pickup unit 212 via the communication unit 200.
  • the utterance recognition unit 113 has a function of recognizing the utterance of the speaker. For example, the utterance recognition unit 113 performs recognition processing on the utterance data transmitted from the utterance acquisition unit 211 via the communication unit 200. Specifically, the utterance recognition unit 113 converts the utterance data into linguistic information.
  • the utterance recognition unit 113 has a function of performing a process of detecting the end of the utterance data. For example, the utterance recognition unit 113 performs a process of detecting the end of the utterance data transmitted from the utterance acquisition unit 211. Specifically, the utterance recognition unit 113 detects the end of the language information.
  • the state estimation unit 114 has a function of performing a process of estimating a state based on the utterance of the speaker. For example, the state estimation unit 114 performs estimation processing on the utterance data transmitted from the utterance acquisition unit 211 via the communication unit 200. Specifically, the state estimation unit 114 estimates the state of emotional understanding when the speaker's utterance includes emotional words. The state estimation unit 114 accesses the storage unit 120 (for example, the emotional word information storage unit 122) and performs estimation processing using linguistic information. Specifically, the state estimation unit 114 estimates the state of emotional understanding by comparing the linguistic information included in the utterance data with the emotional words stored in the storage unit 120.
  • the storage unit 120 for example, the emotional word information storage unit 122
  • the state estimation unit 114 estimates the state of emotional understanding according to the emotional word indicating the emotion among the linguistic information included in the speaker's utterance. Further, the state estimation unit 114 determines the state of emotional understanding according to the linguistic information that expresses the speaker's emotions, which is linguistic information other than the emotional words that indicate emotions among the linguistic information included in the speaker's utterance. presume.
  • the state estimation unit 114 estimates the state of preparation for execution when the speaker's utterance includes request-related information. Further, the state estimation unit 114 estimates the state of utterance recognition when the speaker's utterance does not include emotional words and request-related information.
  • the semantic analysis unit 115 has a function of analyzing the intention of the speaker's utterance from the linguistic information included in the speaker's utterance. Specifically, the semantic analysis unit 115 analyzes the intention of the speaker's utterance by classifying the linguistic information of the speaker's utterance into categories such as nouns, verbs, and modifiers.
  • the request processing unit 116 has a function of performing processing for executing processing based on the request-related information included in the utterance of the speaker. For example, the request processing unit 116 generates control information for executing processing based on the request-related information.
  • the response generation unit 117 has a function of performing a process of generating a response to be presented to the speaker. For example, the response generation unit 117 generates control information for performing a nod, an aizuchi, or the like, which is a response to be presented to the speaker.
  • the response generation unit 117 is for performing the nodding of the operation in a size according to the state based on the utterance of the speaker by, for example, predetermining the control information for performing the nodding of the stepwise operation such as large, medium and small. Generate control information for.
  • the response generation unit 117 determines a parameter for determining the magnitude of the nodding motion in advance, so that the size of the response generation unit 117 corresponds to the state based on the speaker's utterance based on the value of the parameter. Generates control information for nodding the operation. Further, the response generation unit 117 is for performing intonation at a volume, tone, etc. according to the state based on the speaker's utterance, for example, by predetermining control information for performing intonation with different volume, tone, etc. Generate control information. As another example, the response generation unit 117 defines parameters for determining the volume, tone, etc. of the intonation in advance, and based on the values of the parameters, the volume, tone, etc. according to the state based on the speaker's utterance. Generates control information for intonation in. The response generation unit 117 generates control information for performing relative output with respect to a reference according to the speaker.
  • the response generation unit 117 determines whether or not the ambient sound other than the speaker's utterance is in the steady ambient sound state, and when the ambient sound other than the speaker's utterance is in the steady ambient sound state, the response generation unit 117 determines. For example, it generates control information for performing intonation at a steady volume and tone. Further, when the ambient sound other than the speaker's utterance is louder or smaller than the steady ambient sound state, the response generation unit 117 performs intonation at, for example, relatively the same volume and tone. Generate control information for. In this case, the response generation unit 117 generates control information for nodding the operation with a size corresponding to the volume of the intonation, the tone of the tone, and the like.
  • the response generation unit 117 When the response generation unit 117 generates control information for performing a large motion nod, it generates control information for performing an aizuchi such as volume and tone according to the magnitude of the nod motion. As a result, the response generation unit 117 can synchronize the magnitudes of the nod and the aizuchi, which are the operations controlled for the terminal device 20. For example, when the response generation unit 117 controls the terminal device 20 so as to perform a nod of a large operation, the response generation unit 117 controls the terminal device 20 so that the volume of the aizuchi is increased.
  • the terminal device 20 is set so that the frequency of the aizuchi increases or the interval (timing) of the aizuchi becomes short. Control.
  • the response generation unit 117 generates control information for performing a steady response when the speaker's utterance includes emotional words that the speaker regularly uses.
  • the response generation unit 117 provides control information for performing a non-stationary response when the speaker's utterance does not constantly use (infrequently used) or contains a first-appearing emotional word.
  • the response generation unit 117 performs an action of listening back to the speaker's utterance, an action of leaning forward, an action of giving a suspicious expression, or an utterance of raising the ending when reciting as a non-stationary response.
  • Generate control information for making a response such as.
  • the response generation unit 117 generates a response using the linguistic information included in the speaker's utterance. For example, the response generation unit 117 generates a response using the linguistic information analyzed by the semantic analysis unit 115.
  • the response generation unit 117 generates an empathic utterance for reciting an emotional word indicating an emotion among the linguistic information included in the speaker's utterance. Further, the response generation unit 117 recites sympathetic utterances for reciting linguistic information other than emotional words indicating emotions among the linguistic information included in the speaker's utterances and expressing the speaker's emotions. Generate.
  • the utterance execution unit 118 has a function of presenting control information for executing the utterance of the terminal device 20 to the speaker. For example, the utterance execution unit 118 presents control information for executing the utterance of the terminal device 20 to the speaker to the terminal device 20 via the communication unit 100.
  • the motion presentation unit 119 has a function of presenting control information for controlling the motion of the terminal device 20 to the speaker.
  • the motion presentation unit 119 presents control information for controlling the operation of the terminal device 20 to the speaker to the terminal device 20 via the communication unit 100.
  • the storage unit 120 is realized by, for example, a semiconductor memory element such as a RAM or a flash memory, or a storage device such as a hard disk or an optical disk.
  • the storage unit 120 has a function of storing data related to processing in the information processing device 10. As shown in FIG. 10, the storage unit 120 includes a speaker information storage unit 121 and an emotional word information storage unit 122.
  • FIG. 11 shows an example of the speaker information storage unit 121.
  • the speaker information storage unit 121 shown in FIG. 11 stores speaker information.
  • the speaker information storage unit 121 may have items such as "speaker ID" and "speaker information”.
  • “Speaker ID” indicates identification information for identifying the speaker.
  • “Speaker information” indicates speaker information.
  • conceptual information such as “speaker information # 1" and “speaker information # 2" is stored in “speaker information”, but in reality, the speaker information Imaging information and the like are stored.
  • FIG. 12 shows an example of the emotion word information storage unit 122.
  • the emotion word information storage unit 122 shown in FIG. 12 stores information related to emotion words.
  • the emotional word information storage unit 122 has items such as "emotional word information ID”, “emotional word”, “synonymous word”, “general co-occurrence word”, and “speaker co-occurrence word”. You may.
  • Emotional word information ID indicates identification information for identifying emotional word information.
  • Emotional word indicates an emotional word.
  • synonyms indicate synonyms for emotional words.
  • General co-occurrence word indicates a commonly used co-occurrence word among co-occurrence words for co-occurring emotional words.
  • Speaker co-occurrence word indicates a speaker-specific co-occurrence word among co-occurrence words for co-occurring emotional words.
  • the emotional word according to the embodiment may not be an emotional word commonly defined by the speaker as a general emotional word, but may be linguistic information that frequently appears for a specific expression peculiar to the speaker.
  • the emotional word information storage unit 122 may store linguistic information that frequently appears for a specific expression peculiar to the speaker as an emotional word.
  • the information processing system 1 does not repeat the emotional word, but presents linguistic information co-occurring with a specific expression as the emotional word.
  • the linguistic information that frequently appears for this specific expression is "very difficult”. In the case of "”, "difficult” is presented as an emotional word.
  • Terminal device 20 As shown in FIG. 10, the terminal device 20 has a communication unit 200, a control unit 210, and a presentation unit 220.
  • the communication unit 200 has a function of communicating with an external device. For example, the communication unit 200 outputs information received from the external device to the control unit 210 in communication with the external device. Specifically, the communication unit 200 outputs information regarding acquisition of utterance data received from the information processing device 10 to the control unit 210. Further, the communication unit 200 outputs the control information received from the information processing device 10 to the control unit 210.
  • the communication unit 200 outputs the control information received from the information processing device 10 to the presentation unit 220.
  • the communication unit 200 transmits the information input from the control unit 210 to the external device in the communication with the external device. Specifically, the communication unit 200 transmits the utterance data input from the control unit 210 to the information processing device 10.
  • Control unit 210 has a function of controlling the overall operation of the terminal device 20. For example, the control unit 210 controls the utterance data acquisition process by the utterance acquisition unit 211. Further, the control unit 210 controls a process in which the communication unit 200 transmits the utterance data acquired by the utterance acquisition unit 211 to the information processing device 10.
  • the utterance acquisition unit 211 has a function of acquiring the utterance data of the speaker.
  • the utterance acquisition unit 211 acquires utterance data using the utterance (voice) detector provided in the terminal device 20.
  • the image pickup unit 212 has a function of capturing a speaker.
  • the operation control unit 213 has a function of controlling the operation of the terminal device 20.
  • the operation control unit 213 controls the operation of the terminal device 20 according to the acquired control information.
  • the presentation unit 220 has a function of controlling the overall presentation. As shown in FIG. 10, the presentation unit 220 includes a voice presentation unit 221 and an motion presentation unit 222.
  • the voice presentation unit 221 has a function of performing a process of presenting the voice of the terminal device 20.
  • the voice presenting unit 221 presents voice based on the control information received from the utterance execution unit 118 via the communication unit 200.
  • the motion presentation unit 222 has a function of performing a process of presenting the motion of the terminal device 20. For example, the motion presentation unit 222 presents the motion based on the control information received from the motion presentation unit 119 via the communication unit 200.
  • FIG. 13 is a flowchart showing a flow of processing related to state estimation in the information processing device 10 according to the embodiment.
  • the information processing device 10 detects the utterance of the speaker based on the utterance data (S101). For example, the information processing device 10 detects the utterance of a specific speaker. Further, the information processing device 10 recognizes the utterance of the speaker (S102). For example, the information processing device 10 detects the end of a speaker's utterance. Next, the information processing device 10 determines whether or not the emotional word is included. Then, when the speaker's utterance includes an emotional word (S104; YES), the information processing device 10 estimates the state of emotional understanding (S106).
  • the information processing device 10 determines whether or not the request-related information is included (S108). Then, when the request-related information is included in the utterance of the speaker (S108; YES), the information processing device 10 estimates the state of preparation for execution (S110). Further, the information processing device 10 estimates the state of utterance recognition (S112) when the request-related information is not included in the utterance of the speaker (S108; NO).
  • FIG. 14 is a flowchart showing a flow of processing when the state of utterance recognition is estimated in the information processing apparatus 10 according to the embodiment.
  • the information processing device 10 determines whether or not it is the end of the utterance (S200). Then, when the information processing device 10 is the end of the utterance (S200; YES), the information processing device 10 controls the terminal device 20 so as to repeat or give an aizuchi with a filler (S202). Further, when the information processing device 10 is not the end of the utterance (S200; NO), the information processing device 10 determines whether or not it is between the utterances (S204).
  • the information processing device 10 controls the terminal device 20 so as to give an aizuchi at a low volume when it is during an utterance (S204; YES) (S206). Further, the information processing device 10 controls the terminal device 20 so as to nod with a small operation when it is not during the utterance (S204; NO) (S208).
  • FIG. 15 is a flowchart showing a flow of processing when the state of emotion understanding in the information processing apparatus 10 according to the embodiment is estimated.
  • the information processing device 10 determines whether or not it is the end of the utterance (S300). Then, when the information processing device 10 is the end of the utterance (S300; YES), the information processing device 10 controls the terminal device 20 so as to repeat the emotional word (S302). Further, when the information processing device 10 is not the end of the utterance (S300; NO), the information processing device 10 determines whether or not it is between the utterances (S304).
  • the information processing device 10 controls the terminal device 20 so as to give an aizuchi at a loud volume when the utterance is in progress (S304; YES) (S306). Further, the information processing device 10 controls the terminal device 20 so as to nod with a large operation when it is not during the utterance (S304; NO) (S308).
  • the information processing device 10 When the state of emotion understanding is estimated, the information processing device 10 generates control information recognizable to the speaker as compared with the case of estimating the state of utterance recognition shown in FIG.
  • FIG. 16 is a flowchart showing a flow of processing when the state of preparation for execution is estimated in the information processing apparatus 10 according to the embodiment.
  • the information processing device 10 determines whether or not an utterance sufficient for execution has been acquired (S400). Then, when it is determined that the information processing device 10 has acquired sufficient utterances for execution (S400; YES), the information processing device 10 controls the terminal device 20 to execute the process based on the information regarding the request (S402).
  • the information processing apparatus 10 determines whether or not the canceled utterance, which is an utterance to cancel the execution, has been acquired (S404). ). Then, when the information processing apparatus 10 determines that the canceled utterance has been acquired (S404; YES), the information processing device 10 ends the information processing. Further, when it is determined that the canceled utterance has not been acquired (S404; NO), the information processing device 10 controls the terminal device 20 so as to urge the utterance to utter information regarding the further request. (S406). Then, the process returns to the process of S400.
  • the response generation unit 117 has shown a case where the response generation unit 117 generates control information for performing a response in which the size of the nod, the volume of the intonation, the tone of the tone, etc. are different, but the present invention is not limited to this example. ..
  • the response generation unit 117 may generate control information for performing a response in which the strength of the facial expression and the size of the animation expression are different.
  • the response generation unit 117 may generate control information for facial expressions, movements of tails and ears of animals, and clothing and accessories worn to make different responses. In this way, the response generation unit 117 may generate control information regarding the representation on the video.
  • the response generation unit 117 may generate control information for making a response in a different manner such as nodding or aizuchi, depending on the character indicated by the terminal device 20. For example, when the character indicated by the terminal device 20 is a business-like character, the response generation unit 117 may generate control information for performing a response with a small difference in strength. Then, the response generation unit 117 may generate control information for performing an aizuchi using polite words such as "yes" and "is that so?". As another example, when the character indicated by the terminal device 20 is a casual character, the response generation unit 117 may generate control information for performing a response having a large difference in strength. Then, the response generation unit 117 may generate control information for performing an aizuchi using everyday words such as "Yeah”, "I see", and "Hee".
  • the response generation unit 117 may generate control information for making a response with a different interval depending on the speaker. For example, the response generation unit 117 may estimate the interval between utterances of a speaker by identifying the speaker using utterance data, imaging information, etc., and storing the speech speed, interval, etc. of each speaker. good. Then, the response generation unit 117 may learn the dialogue in which the responses such as the aizuchi do not overlap as the teacher data. As a result, the response generation unit 117 can be adapted to avoid duplication of responses.
  • the response generation unit 117 may generate control information for performing, for example, a low-volume aizuchi or a nod of a small operation. As a result, the information processing system 1 can present the response without disturbing the speaker's utterance.
  • the response generation unit 117 which personalizes the pattern of the aizuchi and the repetition of emotional understanding, learns the pattern of the aizuchi, which has a high probability of continuing utterance, for each speaker by changing the length of the aizuchi and the variation of the linguistic information. You may. In addition, the response generation unit 117 may learn so that the frequency of use of the aizuchi increases when the amount of utterances of the speaker after the aizuchi increases.
  • the state estimation unit 114 that personalizes the transition of the state may estimate the transition from the state of utterance recognition to the state of emotion understanding at a low frequency in the case of a speaker who uses a lot of emotional words. As a result, the information processing device 10 can control the terminal device 20 so that the number of repetitions does not increase. Further, in the case of a speaker who uses a lot of emotional words, the response generation unit 117 may generate control information for performing a response with different variations indicating emotional understanding. For example, the response generation unit 117 may access the emotional word information storage unit 122 or the like and perform processing using synonyms or the like.
  • the response generation unit 117 may generate control information for processing without reciting the listener in the state of preparation for execution.
  • the information processing device 10 can control the terminal device 20 so that the information processing device 10 executes the information processing device 10 as soon as the speaker makes an utterance regarding the request.
  • the state estimation unit 114 determines that the speaker's emotions are in a steady (neutral) state, the state understanding unit 114 understands the emotions even if the speaker's utterance includes emotional words. It is not necessary to estimate the state of. For example, the state estimation unit 114 does not have to estimate the state of emotion understanding when the speaker's emotion is determined to be a steady state based on the result of the recognition processing of the speaker's facial expression based on the imaging information. ..
  • the state estimation unit 114 determines that the speaker's emotion is in a steady state based on the processing result of the speaker's utterance intonation and the utterance recognition by the peripheral language, the state of emotion understanding. Does not have to be estimated. Further, in the state estimation unit 114, based on the linguistic processing result for the utterance, the emotional words included in the utterance are not the linguistic information based on the speaker's emotions, but the linguistic information quoted from the emotions of others or the sentences of others. If this is the case, it is not necessary to estimate the state of emotional understanding.
  • the above embodiment can also be applied in the medical field such as a visually impaired person. If the speaker is visually impaired, it is considered that the visual response such as nodding cannot be properly grasped. Therefore, when the speaker is a visually impaired person, the information processing system 1 may respond by using an aizuchi instead of a nod. In this case, the response generation unit 117 may generate control information for responding by using an aizuchi instead of a nod at the timing of responding by using a nod. On the other hand, when the speaker is a hearing-impaired person, it is considered that the auditory response such as an aizuchi cannot be properly grasped.
  • the information processing system 1 may respond by using a nod instead of an aizuchi.
  • the response generation unit 117 may generate control information for responding by using a nod instead of the aizuchi at the timing of responding by using the aizuchi.
  • the above embodiment can also be applied to the field of long-term care for the elderly and the like.
  • the information processing system 1 may slow down the tempo of the response operation such as nodding or aizuchi. Further, the information processing system 1 may increase the detection threshold value regarding the time between terminal detections and the like. As a result, the information processing system 1 can control the timing of the utterance of the speaker and the utterance of the terminal device 20 so as not to overlap. Further, the information processing system 1 may increase the change in facial expression indicated by the terminal device 20.
  • the information processing system 1 may make a large change in the response such as the utterance volume in the case of an elderly person whose hearing is deteriorated even if the ambient sound is steady.
  • the terminal device 20 even when the terminal device 20 interacts with another person other than the speaker (for example, the speaker's family), the terminal device 20 is compared with the case where the terminal device 20 interacts with the other person.
  • the response made by the speaker By changing the response made by the speaker relatively, it is possible to make a response suitable for the speaker.
  • FIG. 17 is a block diagram showing a hardware configuration example of the information processing device according to the embodiment.
  • the information processing device 900 shown in FIG. 17 can realize, for example, the information processing device 10 and the terminal device 20 shown in FIG.
  • the information processing by the information processing device 10 and the terminal device 20 according to the embodiment is realized by the cooperation between the software and the hardware described below.
  • the information processing device 900 includes a CPU (Central Processing Unit) 901, a ROM (Read Only Memory) 902, and a RAM (Random Access Memory) 903.
  • the information processing device 900 includes a host bus 904a, a bridge 904, an external bus 904b, an interface 905, an input device 906, an output device 907, a storage device 908, a drive 909, a connection port 910, and a communication device 911.
  • the hardware configuration shown here is an example, and some of the components may be omitted. Further, the hardware configuration may further include components other than the components shown here.
  • the CPU 901 functions as, for example, an arithmetic processing device or a control device, and controls all or a part of the operation of each component based on various programs recorded in the ROM 902, the RAM 903, or the storage device 908.
  • the ROM 902 is a means for storing a program read into the CPU 901, data used for calculation, and the like.
  • a program read into the CPU 901 various parameters that change as appropriate when the program is executed, and the like are temporarily or permanently stored. These are connected to each other by a host bus 904a composed of a CPU bus or the like.
  • the CPU 901, ROM 902, and RAM 903 can realize the functions of the control unit 110 and the control unit 210 described with reference to FIG. 10, for example, in collaboration with software.
  • the CPU 901, ROM 902, and RAM 903 are connected to each other via, for example, a host bus 904a capable of high-speed data transmission.
  • the host bus 904a is connected to the external bus 904b, which has a relatively low data transmission speed, via, for example, the bridge 904.
  • the external bus 904b is connected to various components via the interface 905.
  • the input device 906 is realized by a device such as a mouse, a keyboard, a touch panel, a button, a microphone, a switch, and a lever, in which information is input by a speaker. Further, the input device 906 may be, for example, a remote control device using infrared rays or other radio waves, or an externally connected device such as a mobile phone or a PDA that supports the operation of the information processing device 900. .. Further, the input device 906 may include, for example, an input control circuit that generates an input signal based on the information input by the speaker using the above input means and outputs the input signal to the CPU 901. By operating the input device 906, the speaker of the information processing device 900 can input various data to the information processing device 900 and instruct the processing operation.
  • the input device 906 may be formed by a device that detects information about the speaker.
  • the input device 906 includes an image sensor (for example, a camera), a depth sensor (for example, a stereo camera), an acceleration sensor, a gyro sensor, a geomagnetic sensor, an optical sensor, a sound sensor, and a distance measuring sensor (for example, ToF (Time of Flight). ) Sensors), may include various sensors such as force sensors.
  • the input device 906 includes information on the state of the information processing device 900 itself such as the posture and moving speed of the information processing device 900, and information on the surrounding environment of the information processing device 900 such as brightness and noise around the information processing device 900. May be obtained.
  • the input device 906 receives a GNSS signal (for example, a GPS signal from a GPS (Global Positioning System) satellite) from a GNSS (Global Navigation Satellite System) satellite and receives position information including the latitude, longitude and altitude of the device. It may include a GPS module to measure. Further, regarding the position information, the input device 906 may detect the position by transmission / reception with Wi-Fi (registered trademark), a mobile phone / PHS / smartphone, or short-range communication. The input device 906 can realize, for example, the function of the utterance acquisition unit 211 described with reference to FIG.
  • a GNSS signal for example, a GPS signal from a GPS (Global Positioning System) satellite
  • GNSS Global Navigation Satellite System
  • the output device 907 is formed of a device capable of visually or audibly notifying the speaker of the acquired information.
  • Such devices include display devices such as CRT display devices, liquid crystal display devices, plasma display devices, EL display devices, laser projectors, LED projectors and lamps, audio output devices such as speakers and headphones, and printer devices. ..
  • the output device 907 outputs, for example, the results obtained by various processes performed by the information processing device 900.
  • the display device visually displays the results obtained by various processes performed by the information processing device 900 in various formats such as texts, images, tables, and graphs.
  • the audio output device converts an audio signal composed of reproduced audio data, acoustic data, etc. into an analog signal and outputs it audibly.
  • the output device 907 can realize, for example, the function of the presentation unit 220 described with reference to FIG.
  • the storage device 908 is a data storage device formed as an example of the storage unit of the information processing device 900.
  • the storage device 908 is realized by, for example, a magnetic storage device such as an HDD, a semiconductor storage device, an optical storage device, a magneto-optical storage device, or the like.
  • the storage device 908 may include a storage medium, a recording device that records data on the storage medium, a reading device that reads data from the storage medium, a deleting device that deletes the data recorded on the storage medium, and the like.
  • the storage device 908 stores programs executed by the CPU 901, various data, various data acquired from the outside, and the like.
  • the storage device 908 can realize, for example, the function of the storage unit 120 described with reference to FIG.
  • the drive 909 is a reader / writer for a storage medium, and is built in or externally attached to the information processing device 900.
  • the drive 909 reads information recorded on a removable storage medium such as a mounted magnetic disk, optical disk, magneto-optical disk, or semiconductor memory, and outputs the information to the RAM 903.
  • the drive 909 can also write information to the removable storage medium.
  • connection port 910 is a port for connecting an external connection device such as a USB (Universal Serial Bus) port, an IEEE1394 port, a SCSI (Small SCSI System Interface), an RS-232C port, or an optical audio terminal. ..
  • the communication device 911 is, for example, a communication interface formed by a communication device or the like for connecting to the network 920.
  • the communication device 911 is, for example, a communication card for a wired or wireless LAN (Local Area Network), LTE (Long Term Evolution), Bluetooth (registered trademark), WUSB (Wireless USB), or the like.
  • the communication device 911 may be a router for optical communication, a router for ADSL (Asymmetric Digital Subscriber Line), a modem for various communications, or the like.
  • the communication device 911 can transmit and receive signals and the like to and from the Internet and other communication devices in accordance with a predetermined protocol such as TCP / IP.
  • the communication device 911 can realize, for example, the functions of the communication unit 100 and the communication unit 200 described with reference to FIG.
  • the network 920 is a wired or wireless transmission path for information transmitted from a device connected to the network 920.
  • the network 920 may include a public network such as the Internet, a telephone line network, a satellite communication network, various LANs (Local Area Network) including Ethernet (registered trademark), and a WAN (Wide Area Network).
  • the network 920 may include a dedicated line network such as IP-VPN (Internet Protocol-Virtual Private Network).
  • the above is an example of a hardware configuration capable of realizing the functions of the information processing apparatus 900 according to the embodiment.
  • Each of the above components may be realized by using a general-purpose member, or may be realized by hardware specialized for the function of each component. Therefore, it is possible to appropriately change the hardware configuration to be used according to the technical level at each time when the embodiment is implemented.
  • the information processing device 10 performs a process of generating output information based on the estimation result of the state of emotional understanding based on the utterance of the speaker.
  • the information processing device 10 can control the operation of the terminal device 20 based on the estimation result of the state of emotional understanding based on the utterance of the speaker.
  • each device described in the present specification may be realized as a single device, or a part or all of the devices may be realized as separate devices.
  • the information processing device 10 and the terminal device 20 shown in FIG. 10 may be realized as independent devices.
  • it may be realized as a server device connected to the information processing device 10 and the terminal device 20 via a network or the like.
  • the server device connected by a network or the like may have the function of the control unit 110 of the information processing device 10.
  • each device described in the present specification may be realized by using any of software, hardware, and a combination of software and hardware.
  • the programs constituting the software are stored in advance in, for example, a recording medium (non-temporary medium: non-transitory media) provided inside or outside each device. Then, each program is read into RAM at the time of execution by a computer and executed by a processor such as a CPU.
  • a state estimation unit that estimates the state of emotional understanding that understands emotions based on the speaker's utterances
  • a response generation unit that generates output information based on the estimation result by the state estimation unit, and a response generation unit.
  • Information processing device (2) The state estimation unit Estimate a plurality of states including the emotional understanding, The information processing device according to (1) above.
  • the state estimation unit Estimate the state of emotional understanding according to emotional words indicating emotions in the linguistic information included in the speaker's utterance.
  • the information processing device according to any one of (1) to (3) above.
  • the state estimation unit Among the linguistic information included in the utterance of the speaker, the linguistic information other than the emotional word indicating the emotion, and the state of the emotional understanding according to the linguistic information expressing the emotion of the speaker is estimated.
  • the information processing device according to any one of (1) to (4) above.
  • the response generator Based on the information about the end of the speaker's utterance, the output information based on the utterance recognition that recognizes the speaker's utterance is generated.
  • the information processing device according to any one of (1) to (5) above.
  • the response generator Generates the output information based on the emotional understanding based on the information about the end of the speaker's utterance.
  • the information processing device according to any one of (1) to (6) above.
  • the response generator As the output information based on the emotional understanding, an empathic utterance for reciting an emotional word indicating an emotion among the linguistic information included in the speaker's utterance is generated.
  • the information processing device according to (7) above.
  • the response generator Generates empathic utterances for reciting predetermined linguistic information as synonyms corresponding to the emotional words.
  • the information processing device according to (8) above.
  • the response generator As the output information based on the emotional understanding, the linguistic information other than the emotional words indicating the emotions among the linguistic information included in the utterance of the speaker is used to repeat the linguistic information expressing the emotions of the speaker. Generate sympathetic utterances, The information processing device according to any one of (7) to (9) above. (11) The response generator When the information about the request included in the utterance of the speaker satisfies a predetermined condition, the output information based on the information about the request is generated. The information processing device according to any one of (1) to (10) above. (12) The response generator When the information regarding the request included in the utterance of the speaker does not satisfy a predetermined condition, the output information for urging the speaker to speak the information regarding the request is generated.
  • the information processing device according to any one of (1) to (11) above. (13) The response generator As the output information, voice information or operation information is generated. The information processing device according to claim 1. The information processing device according to any one of (1) to (12) above. (14) The response generator As the output information, the operation information regarding the expression on the video is generated. The information processing device according to (13) above. (15) The response generator As the output information based on the emotional understanding, the voice information or the operation information that can be recognized by the speaker is generated rather than the output information based on the utterance recognition that recognizes the utterance of the speaker. The information processing device according to (13) or (14). (16) The response generator Generate the output information relative to the speaker-dependent criteria. The information processing device according to any one of (13) to (15).
  • the response generator As the output information, the voice information at a volume corresponding to the environment around the speaker is generated.
  • the information processing device according to (16) above.
  • the computer Understanding emotions based on the speaker's utterance Estimate the state of emotional understanding and Generate output information based on the estimated estimation results, Information processing method.
  • Information processing system 10 Information processing device 20 Terminal device 100 Communication unit 110 Control unit 111 Speaker identification unit 112 Speech detection unit 113 Speech recognition unit 114 State estimation unit 115 Semantic analysis unit 116 Request processing unit 117 Response generation unit 118 Speech execution unit 119 Motion presentation unit 120 Storage unit 200 Communication unit 210 Control unit 211 Speech acquisition unit 212 Imaging unit 213 Motion control unit 220 Presentation unit 221 Voice presentation unit 222 Motion presentation unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

La présente invention met en œuvre un dialogue naturel en fonction de l'intention d'un énoncé d'un locuteur. Selon un mode de réalisation, il est prévu un dispositif de traitement d'informations (10) comprenant une unité d'estimation d'état (114) qui estime l'état de compréhension d'émotion pour comprendre une émotion sur la base d'un énoncé du locuteur, et une unité de génération de réponse (117) qui génère des informations de sortie sur la base d'un résultat d'estimation obtenu par l'unité d'estimation d'état (114).
PCT/JP2020/047857 2020-01-27 2020-12-22 Dispositif de traitement d'informations, procédé de traitement d'informations et programme de traitement d'informations WO2021153101A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2020-011190 2020-01-27
JP2020011190A JP2021117371A (ja) 2020-01-27 2020-01-27 情報処理装置、情報処理方法および情報処理プログラム

Publications (1)

Publication Number Publication Date
WO2021153101A1 true WO2021153101A1 (fr) 2021-08-05

Family

ID=77079017

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/047857 WO2021153101A1 (fr) 2020-01-27 2020-12-22 Dispositif de traitement d'informations, procédé de traitement d'informations et programme de traitement d'informations

Country Status (2)

Country Link
JP (1) JP2021117371A (fr)
WO (1) WO2021153101A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023162108A1 (fr) * 2022-02-24 2023-08-31 日本電信電話株式会社 Dispositif d'apprentissage, dispositif d'inférence, procédé d'apprentissage, procédé d'inférence, programme d'apprentissage et programme d'inférence
WO2023162114A1 (fr) * 2022-02-24 2023-08-31 日本電信電話株式会社 Dispositif d'entraînement, dispositif d'inférence, procédé d'entraînement, procédé d'inférence, programme d'entraînement, et programme d'inférence

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10312196A (ja) * 1997-03-12 1998-11-24 Seiko Epson Corp 応答音声の音量適正化方法およびその装置
JP2005196134A (ja) * 2003-12-12 2005-07-21 Toyota Central Res & Dev Lab Inc 音声対話システム及び方法並びに音声対話プログラム
JP2017117090A (ja) * 2015-12-22 2017-06-29 株式会社アイ・ビジネスセンター 対話システムおよびプログラム
JP2017162268A (ja) * 2016-03-10 2017-09-14 国立大学法人大阪大学 対話システムおよび制御プログラム
WO2019107144A1 (fr) * 2017-11-28 2019-06-06 ソニー株式会社 Dispositif et procédé de traitement d'informations

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10312196A (ja) * 1997-03-12 1998-11-24 Seiko Epson Corp 応答音声の音量適正化方法およびその装置
JP2005196134A (ja) * 2003-12-12 2005-07-21 Toyota Central Res & Dev Lab Inc 音声対話システム及び方法並びに音声対話プログラム
JP2017117090A (ja) * 2015-12-22 2017-06-29 株式会社アイ・ビジネスセンター 対話システムおよびプログラム
JP2017162268A (ja) * 2016-03-10 2017-09-14 国立大学法人大阪大学 対話システムおよび制御プログラム
WO2019107144A1 (fr) * 2017-11-28 2019-06-06 ソニー株式会社 Dispositif et procédé de traitement d'informations

Also Published As

Publication number Publication date
JP2021117371A (ja) 2021-08-10

Similar Documents

Publication Publication Date Title
JP6819672B2 (ja) 情報処理装置、情報処理方法、及びプログラム
WO2017168870A1 (fr) Dispositif de traitement d'informations et procédé de traitement d'informations
US20200335128A1 (en) Identifying input for speech recognition engine
KR20200111853A (ko) 전자 장치 및 전자 장치의 음성 인식 제어 방법
CN109040641B (zh) 一种视频数据合成方法及装置
JP6585733B2 (ja) 情報処理装置
WO2020244416A1 (fr) Dispositif électronique de réveil vocal interactif, procédé fondé sur un signal de microphone et support
WO2020244402A1 (fr) Dispositif électronique à réveil par interaction locutoire et procédé reposant sur un signal de microphone, et support
KR102628211B1 (ko) 전자 장치 및 그 제어 방법
JPWO2017130486A1 (ja) 情報処理装置、情報処理方法およびプログラム
JP6548045B2 (ja) 会議システム、会議システム制御方法、およびプログラム
JP6904361B2 (ja) 情報処理装置、及び情報処理方法
WO2021153101A1 (fr) Dispositif de traitement d'informations, procédé de traitement d'informations et programme de traitement d'informations
WO2020244411A1 (fr) Dispositif électronique et procédé de réveil à interaction vocale basé sur un signal de microphone, et support
JP6904357B2 (ja) 情報処理装置、情報処理方法、及びプログラム
CN111883135A (zh) 语音转写方法、装置和电子设备
WO2021149441A1 (fr) Dispositif et procédé de traitement d'informations
WO2020079918A1 (fr) Dispositif de traitement d'informations et procédé de traitement d'informations
KR20210100831A (ko) 인공지능 기반 수어통역 서비스 제공 시스템 및 방법
JP2018075657A (ja) 生成プログラム、生成装置、制御プログラム、制御方法、ロボット装置及び通話システム
WO2019073668A1 (fr) Dispositif de traitement d'informations, procédé de traitement d'informations et programme
Panek et al. Challenges in adopting speech control for assistive robots
WO2018079018A1 (fr) Dispositif de traitement de l'information et procédé de traitement de l'information
JP7316971B2 (ja) 会議支援システム、会議支援方法、およびプログラム
KR20210100832A (ko) 사용자의 감정상태를 판단하는 인공지능 기반 수어통역 서비스 제공 시스템 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20916755

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20916755

Country of ref document: EP

Kind code of ref document: A1