WO2017200079A1

WO2017200079A1 - Dialog method, dialog system, dialog device, and program

Info

Publication number: WO2017200079A1
Application number: PCT/JP2017/018794
Authority: WO
Inventors: 弘晃杉山; 豊美目黒; 淳司大和; 雄一郎吉川; 石黒　浩; 尊優飯尾; 浩平小川
Original assignee: 日本電信電話株式会社; 国立大学法人大阪大学
Priority date: 2016-05-20
Filing date: 2017-05-19
Publication date: 2017-11-23
Also published as: JPWO2017200079A1; JP6682104B2

Abstract

The present invention limits user speech to a prescribed range, and keeps a dialog with the user continuing for a long time. A dialog system 10 includes at least an input unit 1 for receiving user speech, and a presentation unit 5 for presenting speech. The input unit 1 receives first user speech spoken by a user. A presentation unit 5-1 presents limiting speech that limits the user speech, which has been determined on the basis of the first user speech, to a prescribed range. The input unit 1 receives second user speech spoken by the user after the limiting speech.

Description

Dialogue method, dialogue system, dialogue apparatus, and program

The present invention relates to a technology in which a computer interacts with a human using a natural language, which can be applied to a robot that communicates with a human.

In recent years, research and development of robots that communicate with people has progressed, and they have been put to practical use in various fields. For example, in the field of communication therapy, there is a usage form in which a robot is a conversation partner of a person who is lonely. Specifically, in a nursing home for the elderly, the robot can play a role of listening to the resident, so he can play a role in healing the loneliness of the resident and show a conversation with the robot. You can create conversation opportunities with the family and caregivers. Further, for example, there is a usage form in which a robot is a practice partner in a communication training field. Specifically, the foreign language learning can be efficiently advanced by having the robot become a practice partner of the foreign language learner at the foreign language learning facility. Also, for example, in application as an information presentation system, it is basic to let robots talk to each other, but by talking to people from time to time, people can participate in the conversation without being bored, and information that is easy for people to accept Can be presented. Specifically, news, product introductions, accumulation / knowledge when people have time in meeting places in the city, bus stops, station platforms, etc. or when there is room to participate in dialogues at home or in classrooms. Efficient information presentation such as introduction and education (for example, childcare / education for children, general education professor for adults, moral education, etc.) can be expected. Furthermore, for example, in application as an information collection system, there is a utilization form in which a robot collects information while talking to a person. Since communication can be maintained through communication with the robot, information can be collected without giving a sense of pressure that people are listening. Specifically, it is assumed to be applied to personal information surveys, market surveys, product evaluations, preference surveys for recommended products, and the like. As described above, various applications of human-robot communication are expected, and realization of a robot that can more naturally interact with users is expected. In addition, with the spread of smartphones, services such as LINE (registered trademark) that allow users to enjoy conversations with people by chatting in almost real time are also being implemented. If the technology of conversation with the robot is applied to this chat service, it becomes possible to realize a chat service for more natural dialogue with the user even when there is no chat partner.

In this specification, the hardware used as a conversation partner with a user such as a robot and a chat partner used in these services, and the computer software for causing the computer to function as the hardware as a conversation partner with the user are collectively referred to. It will be called an agent. Since the agent is a conversation partner with the user, the agent may be anthropomorphic, personalized, or have personality or individuality, such as a robot or a chat partner.

The key to the realization of these services is a technology that enables agents realized by hardware and computer software to naturally interact with humans.

As an example of the agent described above, there is a voice dialogue system that recognizes a user's utterance, understands and infers the intention of the utterance, and responds appropriately as described in Non-Patent Document 1, for example. Research on speech dialogue systems has been actively promoted with the progress of speech recognition technology, and has been put to practical use in, for example, automatic speech response systems.

Also, as an example of the above-mentioned agent, there is a scenario dialogue system that dialogues with a user on a specific topic according to a predetermined scenario. In the scenario dialogue system, the dialogue can be continued as long as the dialogue develops along the scenario. For example, the dialogue system described in Non-Patent Document 2 is a system that performs a dialogue between a user and a plurality of agents, including an interruption by an agent and an exchange between agents. For example, when an agent utters a question prepared for a scenario to a user, and the utterance of a user's answer to the question corresponds to an option prepared for the scenario, the agent functions to utter corresponding to the option. To do. That is, the scenario dialogue system is a dialogue system in which an agent makes an utterance based on a scenario stored in advance in the system. In this interactive system, when the agent asks the user and receives a response from the user, the conversation is swayed regardless of the content of the user's utterance, or the topic is changed by interrupting the agent. Even when the user's utterance deviates from the original topic, it is possible to respond so as not to make the user feel the story is broken.

Also, as an example of the above-described agent, there is a chat dialogue system in which a user and an agent have a natural dialogue when the agent utters according to the content of the user's utterance. For example, in the dialogue system described in Non-Patent Document 3, words included in the utterance of the user or the agent while giving more importance to the context-specific ones in a plurality of dialogues between the user and the agent. Is a system that realizes a chat conversation between the user and the system by the system speaking according to the rules described in advance. The rules used by the chat dialogue system are not limited to those described in advance, but may be automatically generated based on the user's utterance content, or uttered in the immediate utterance by the user or agent or in the vicinity thereof. It may be automatically generated based on the utterance, or may be automatically generated based on the utterance including at least the utterance immediately before or near the utterance by the user or the agent. . Non-Patent Document 3 describes a technique for automatically generating a rule based on words that have a co-occurrence relationship or a dependency relationship with words included in a user's utterance. Further, for example, the dialogue system described in Non-Patent Document 4 is a system that reduces the cost of rule generation by fusing rules described manually and rules described using a statistical utterance generation method. The chat dialogue system is different from the scenario dialogue system because the agent does not utter the utterance according to the prepared scenario. Therefore, depending on the user's utterance, the agent's utterance does not correspond to the user's utterance. At least the content of the user's utterance, the utterance spoken immediately before or near the user or agent, or the utterance spoken immediately before or near the user or agent An agent can make an utterance based on the utterance. That is, the chat dialogue system includes at least the utterance content of the user, the utterance spoken immediately before or by the user or agent, or the utterance uttered immediately before or by the user or agent. It is a dialogue system in which an agent utters speech based on. In these chat dialogue systems, it is possible to explicitly respond to the user's utterance.

However, since users utter a wide variety of complex utterances, it is difficult to accurately understand the meaning and content of all users' utterances in a conventional speech dialogue system. If the spoken dialogue system cannot accurately understand the user's utterance, an appropriate response cannot be made to the user's utterance. In a situation where the user and the voice dialogue system have a one-on-one dialogue, if the voice dialogue system fails to respond appropriately, the user feels stressed to continue the dialogue, interrupting the dialogue or causing the failure of the dialogue. Cause it.

An object of the present invention is to provide a dialogue technique that can limit a user's utterance to a predetermined range and continue a dialogue for a long time in view of the above points.

In order to solve the above-described problem, a dialog method according to a first aspect of the present invention is a dialog method performed by a dialog system that interacts with a user, wherein the input unit is an utterance uttered by the user. A first accepting step for accepting an utterance, a presenting step for presenting a limited utterance that is an utterance for limiting a user's utterance determined based on the first user utterance to a predetermined range, and an input unit; And a second reception step of receiving a second user utterance that is an utterance uttered by the user after the limited utterance.

A dialog system according to a second aspect of the present invention is a dialog system that performs a dialog with a user, and includes a first user utterance that is an utterance uttered by the user and an utterance for limiting the user's utterance to a predetermined range. An input unit that receives a second user utterance that is uttered by a user after a certain limited utterance, an utterance determination unit that determines a limited utterance based on the first user utterance, and a limited utterance determined by the utterance determination unit Presenting unit to present.

A dialog device according to a third aspect of the present invention is a dialog device that determines an utterance to be presented by a dialog system including at least an input unit that accepts a user's utterance and a presentation unit that presents the utterance. An utterance determination unit that determines a limited utterance that is an utterance for limiting the user's utterance to a predetermined range based on the first user utterance that is an utterance is included.

According to the present invention, it is possible to limit the user's utterance to the predetermined range by letting the utterance for limiting the user's utterance to the predetermined range before the user's utterance, and to interact with the user. It is possible to realize a dialogue system and a dialogue device that can last for a long time.

FIG. 1 is a diagram illustrating a functional configuration of a dialogue system using a humanoid robot. FIG. 2 is a diagram illustrating a processing procedure of the interactive method according to the first embodiment. FIG. 3 is a diagram illustrating a processing procedure of the dialogue method according to the second embodiment. FIG. 4 is a diagram illustrating a functional configuration of a dialogue system using group chat.

Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function in drawing, and duplication description is abbreviate | omitted.

<First embodiment>
The dialogue system of the first embodiment is a system in which a plurality of humanoid robots cooperate with each other to interact with a user. As shown in FIG. 1, the dialogue system 10 includes an input unit 1, a speech recognition unit 2, an utterance determination unit 3, a speech synthesis unit 4, and a presentation unit 5. The interactive method of the first embodiment is realized by the processing of each step described later by the interactive system 10. As shown in FIG. 1, a part of the dialogue system 10 including the voice recognition unit 2, the utterance determination unit 3, and the voice synthesis unit 4 is a dialogue device 11.

It has been confirmed that similar phenomena occur when humans have a smooth conversation (see, for example, Reference 1). This phenomenon is called a pull-in phenomenon. It has been confirmed that the linguistic entrainment phenomenon occurs between a human and a robot (for example, see Reference 2).
[Reference 1] Condon, Williams S., and Louis W. Sander, “Neonate movement is synchronized with adult speech: Interactional participation and language acquisition”, Science, vol. 183, issue 4120, pp. 99-101, 1974
[Reference 2] Takao Iio et al., “Vocabulary Attraction: Can Robots Inspire Human Vocabulary?”, IPSJ Journal, vol. 51, no. 2, pp. 277-289, 2010

The dialog technology of the present invention uses the above-described pulling phenomenon to dialog the user's utterance by presenting the user with an action corresponding to a condition for the utterance to be acquired by the dialog system before the user utters. Pull into a range that satisfies the conditions for the system to acquire. As a result, it is possible to avoid a situation where the dialogue is interrupted because the dialogue system cannot understand the user's utterance, and the dialogue can be continued for a long time.

An example in which a user's utterance is drawn in an interactive system in which a user and a plurality of agents interact. First, the first agent performs an utterance (for example, a question-type utterance) that asks for an answer from the other party. Subsequently, the second agent performs an utterance (hereinafter referred to as a pull-in utterance) that the dialog system can easily understand, and waits for the user's utterance. The subsequent user's utterance is drawn into the immediately preceding second agent's utterance and becomes an utterance having characteristics similar to those of the second agent's utterance. In this example, it is assumed that the dialogue system utters an easy-to-understand utterance as an action corresponding to the condition for the dialogue system to acquire the utterance, but the behavior is not limited to the utterance, and the gaze or body orientation or limbs It may be non-verbal behavior such as movement.

As a method for determining the pull-in utterance, a method described in advance as a rule can be considered. Specifically, there is a rule for deciding the utterance content by filling an appropriate word in a template with a blank. Examples of the rule creation method include a manual creation method and a method using a known failure detection technique (for example, see Reference 3). In the method using the failure detection technique, it is determined whether or not the dialogue has failed for the utterance of the second agent following the utterance of the first agent. If it is determined that the dialogue has not failed at this time, it can be said that the utterance of the second agent is an utterance easy to understand by the dialogue system, and is appropriate as a pull-in utterance.
[Reference 3] Hiroaki Sugiyama, “Detection of Chat Dialogue Failure by Combining Data with Different Characteristics”, 6th Dialogue System Symposium (SIG-SLUD), Artificial Intelligence Society, pp. 51-56, 2015

Also, instead of preparing the rules in advance, a method of determining the contents of the pull-in utterance each time while performing a dialogue is also conceivable. In this method, failure detection is performed on the dialogue history up to that point in the middle of the dialogue, and the utterance of the second agent is determined so that the utterance of the next dialogue device does not cause the dialogue failure. With this method, since a longer dialog history can be used, it is possible to determine the content of the pull-in utterance more appropriate for the utterance.

The dialogue device 11 is a special configuration in which a special program is read by a known or dedicated computer having a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. Device. For example, the interactive apparatus 11 executes each process under the control of the central processing unit. The data input to the dialogue device 11 and the data obtained in each process are stored in, for example, the main storage device, and the data stored in the main storage device is read out as necessary and used for other processing. The Further, at least a part of each processing unit of the interactive apparatus 11 may be configured by hardware such as an integrated circuit.

The input unit 1 is an interface for the dialog system 10 to acquire the user's utterance. In other words, the input unit 1 is an interface for the user to input an utterance to the dialogue system 10. For example, the input unit 1 is a microphone that picks up a user's speech and uses it as an audio signal. The input unit 1 inputs the voice signal of the collected user's uttered voice to the voice recognition unit 2.

The voice recognition unit 2 converts the voice signal of the user's uttered voice collected by the input unit 1 into a text representing the content of the user's utterance. The voice recognition unit 2 inputs text representing the user's utterance content to the utterance determination unit 3. Any existing speech recognition technology may be used as the speech recognition method, and an optimal method may be selected as appropriate in accordance with the usage environment.

The utterance determination unit 3 determines the text representing the utterance content from the dialogue system 10 based on the input text representing the utterance content of the user. The utterance determination unit 3 inputs text representing the determined utterance content to the speech synthesis unit 4. Further, when the dialogue system 10 performs non-verbal behavior instead of drawing-in utterance, the utterance determination unit 3 presents to the user from the dialogue system 10 based on the input text representing the utterance content of the user. Determine information that represents the content of non-verbal actions. In this case, the utterance determination unit 3 inputs information representing the content of the determined action to the presentation unit 5.

The speech synthesizer 4 converts the text representing the utterance content determined by the utterance deciding unit 3 into an audio signal representing the utterance content. The voice synthesis unit 4 inputs a voice signal representing the utterance content to the presentation unit 5. As a speech synthesis method, any existing speech synthesis technology may be used, and an optimum method may be selected as appropriate in accordance with the usage environment.

The presentation unit 5 is an interface for presenting the utterance content or non-linguistic behavior determined by the utterance determination unit 3 to the user. For example, the presentation unit 5 is a humanoid robot imitating a human shape. This humanoid robot, for example, produces a speech signal representing the utterance content obtained by converting the text representing the utterance content determined by the utterance determination unit 3 into an audio signal by the speech synthesis unit 4 from, for example, a speaker mounted on the head. Present the utterance. In addition, this humanoid robot executes a non-verbal action by operating a housing according to information representing the content of the non-verbal action determined by the utterance determination unit 3, that is, presents an action. When the presenting unit 5 is a humanoid robot, one humanoid robot is prepared for each personality participating in the dialogue. In the following, it is assumed that there are two humanoid robots 5-1 and 5-2 as an example in which two personalities participate in the dialogue.

The input unit 1 may be integrated with the presentation unit 5. For example, when the presentation unit 5 is a humanoid robot, a microphone can be mounted on the head of the humanoid robot and used as the input unit 1.

Hereinafter, the processing procedure of the dialogue method of the first embodiment will be described with reference to FIG.

In step S11, the humanoid robot 5-1 outputs a voice representing the content of the first utterance which is a certain utterance from the speaker. The text representing the content of the first utterance may be arbitrarily selected by the utterance determination unit 3 from, for example, fixed phrases stored in a storage unit (not shown) in the utterance determination unit 3 in advance. It may be determined according to the utterance content. As a technique for determining the utterance contents according to the utterance contents up to immediately before, a technique used in a conventional dialogue system may be used. For example, the scenario dialogue system described in Non-Patent Document 2 and Non-Patent Document 3 Alternatively, the chat dialogue system described in 4 can be used. When the utterance determination unit 3 uses the technology used in the scenario dialogue system, for example, the utterance determination unit 3 includes the words included in each utterance and the focus constituting each utterance for a dialogue including about the last five utterances. Select a scenario in which the distance between words and words included in each scenario stored in a storage unit (not shown) in the utterance determination unit 3 is shorter than a predetermined distance, and select text included in the selected scenario By doing so, the text representing the content of the first utterance is determined. When the utterance determination unit 3 uses the technology used in the chat dialogue system, the utterance determination unit 3 is described in advance using a word included in the user's utterance as a trigger, for example, and is not shown in the utterance determination unit 3. The text representing the content of the first utterance may be determined according to the rules stored in the storage unit, or automatically based on words that are co-occurrence or dependency with words included in the user's utterance A rule may be generated, and a text representing the content of the first utterance may be determined according to the rule.

In step S12, the humanoid robot 5-2 performs an action corresponding to a condition for acquiring the user's utterance for the first utterance (hereinafter referred to as a pull-in action). The pull-in behavior includes the above-described pull-in utterance, and also includes non-verbal behavior such as gaze or body orientation and limb movement. The content of the pull-in action can be arbitrarily selected by the utterance determination unit 3 from the standard behaviors that are determined in advance and stored in a storage unit (not shown) in the utterance determination unit 3 in the same manner as the content of the first utterance, for example. It may be selected or may be determined according to the content of the utterance up to immediately before. The conditions for acquiring the user's utterance are: B. Conditions related to non-verbal behavior in user utterances; It can classify | categorize into the conditions regarding the content of a user's utterance. A. Conditions regarding non-verbal behavior in the user's utterance are A1. Conditions regarding the timing of the user's utterance, that is, conditions for avoiding that the user utters earlier than the timing at which the speech recognition unit 2 can accept the user's utterance, or A2. Conditions relating to the volume and direction of the user's utterance, that is, conditions for avoiding the input unit 1 collecting the user's uttered voice at a volume at which the voice recognition unit 2 cannot recognize the user's utterance. Etc. B. The condition related to the content of the user's utterance is that the speech recognition unit 2 recognizes the user's utterance with higher accuracy, or the content of the user's utterance is outside the range assumed in the running scenario, and the scenario cannot be continued. This is a condition for avoiding this.

A1. Specifically, the action corresponding to the timing of the user's utterance is A1-1. The humanoid robot first gives a model answer of a desired timing, A1-2. For example, the humanoid robot moves its line of sight so that the user's utterance is at a desired timing. A2. The action corresponding to the conditions related to the volume and direction of the user's utterance may be such that the humanoid robot first gives a model answer at a higher volume for a user whose voice is low. B. Specifically, the action corresponding to the content of the user's utterance is B-1. The humanoid robot first performs an utterance in which the length of the utterance is controlled to a desired length, B-2. The humanoid robot first performs an utterance in which the level of detail of the utterance is controlled to a desired level, B-3. The humanoid robot first utters the grammar difficulty level controlled to a desired level, B-4. The humanoid robot first utters the controlled nouns in the utterance, B-5. For example, the humanoid robot first performs an utterance in which the degree of spokenness of the utterance is controlled to a desired level.

The specific examples of actions corresponding to the above-described conditions for acquiring the user's utterance can be arbitrarily combined. For example, A1. B. User's utterance timing and As an example of performing an action corresponding to both the contents of the user's utterance, B-1. An utterance in which the length of the utterance is controlled to a desired length is A1-1. The humanoid robot may be performed first at a desired timing. For example, B.I. B-2. As an action to perform the action corresponding to the content of the user's utterance. Level of detail of utterance and B-4. The humanoid robot may first perform utterance in which the presence or absence of proper nouns during utterance is controlled simultaneously.

Hereinafter, the behavior corresponding to the conditions for acquiring the user's utterance will be described with specific examples. Here, R represents a humanoid robot, and H represents a user. The number following R is the identifier of the humanoid robot. “R1” represents that the humanoid robot 5-1 speaks, and “R2” represents that the humanoid robot 5-2 speaks. Note that who the humanoid robot intends to talk to may be expressed by, for example, the movement of the head or line of sight of the humanoid robot, or may not be expressed.

A1-1. A specific example in the case where the humanoid robot first gives a model answer at a desired timing will be shown below. This may cause the timing at which the voice recognition unit 2 starts voice recognition to be delayed, for example, to prevent voice recognition of a user utterance from failing or a voice recognition result lacking the beginning of the user utterance. Is what we do.

R1: “What food do you like?” (* Question = first utterance)
R2: “Soba” (* Model answer = Action)
H: “Ramen”

A1-2. A specific example in the case where the humanoid robot moves its line of sight so as to be at a desired timing is shown below. This is also performed in order to avoid a problem caused by the timing at which the speech recognition unit 2 starts speech recognition being delayed, as in A1-1.

R1: “What food do you like?” (* Question = first utterance)
R2: (Look at the user) (* Behavior)
H: “Ramen”

In the above example, the humanoid robot that directs the line of sight to the user is R2, but R1 may perform an action of directing the line of sight to the user, and other humanoid robots other than R1 and R2 direct the line of sight to the user. You may take action.

B-1. A specific example in the case where the humanoid robot first performs an utterance in which the length of the utterance is controlled to a desired length is shown below. If the user's utterance is too long or too short, the recognition rate of the voice recognition unit 2 may decrease. Therefore, in order to draw the user to speak at an appropriate length, the humanoid robot utters a model answer of a desired length before the user speaks.

The following is an example in which the user does not take action as before and the dialogue fails because the user's utterance is too short.

R1: “What kind of food do you like?”
H: “Soba” (* Since the user utters only one word, context information cannot be used and speech recognition is difficult.)

The following is an example in which the user does not take action as before and the dialogue fails because the user's utterance is too long.

R1: “What kind of food do you like?”
H: “Oh, it ’s a recent ramen shop in the area of Joyo, but it ’s pretty good, but it ’s lined up.” (* The user ’s utterance contains too many words, so all the words are wrong. And it is difficult to recognize voice.

The following is an example in which a humanoid robot utters a model answer prior to the user's utterance.

R1: “What kind of food do you like?”
R2: “I like ramen.”
H: “I like buckwheat.” (* The recognition rate is improved because the user's utterance is drawn into the model robot's model answer and peripheral words are added.)

B-2. A specific example in the case where the humanoid robot first performs an utterance in which the level of detail of the utterance is controlled to a desired level is shown below. If the user's utterance is too detailed or too simple, an appropriate response may not be generated. Therefore, in order to draw the user to speak at an appropriate level of detail, the humanoid robot utters an exemplary answer at the desired level of detail before the user speaks.

The following is an example of an unsuccessful dialogue because the user's utterance is too simple for the utterance “Tonight's schedule?”

R1: "What are your plans for tonight?"
H: “Drink and sleep”
R1: “Do you want to drink water?” (* Because part of the user's utterance is omitted, the meaning could not be correctly interpreted.)

The following is an example in which the dialogue fails because the user's utterance is too detailed without performing the action for pulling in as usual.

R1: "What are your plans for tonight?"
H: “Because I tend to sink, I go to a bar with a sister and play with a parlor”
R1: “Where do you sink?” (* I could not understand where the topic of the user's utterance was focused.)

R1: "What are your plans for tonight?"
R2: “I will go to the cinema to see the movie.
H: “I'm going to have a drink at a bar.” (* Since the user's utterance is drawn into the model answer of the humanoid robot and contains words that identify the topic at an appropriate granularity, Can interpret utterances.)

The following is an example of an unsuccessful dialogue because the user's utterance is too simple for the utterance “I went on a trip during this time” without performing the action for pulling in as usual.

R1: “I went on a trip during this time”
H: "Which area?"
R1: "It's around" (* User's utterance is only a general word and the focus of the topic could not be found.)

R1: “I went on a trip during this time”
H: “I went to Saariselka”
R1: (silence) (* Since the topic of the user's utterance is too detailed, an appropriate response could not be generated.)

R1: “I went on a trip during this time”
R2 (→ R1): “Did you go to America?”
R1 (→ H): “Yeah. Where have you gone?”
H: “I went to Finland” (* The topic of the user's utterance is reasonably detailed and can generate a response.)

B-3. A specific example in the case where the humanoid robot first performs an utterance in which the grammar difficulty level is controlled to a desired level is shown below. If the user's utterance does not have the desired grammar, an appropriate response may not be generated. For this reason, the humanoid robot utters a model answer with the desired grammar before the user speaks, so that the user speaks with the grammar of the desired difficulty level.

The following is an example when the predicate term structure is used as a key for utterance generation. If the user does not draw in the model answer, the utterance may be broken as in the above-described example, and the user's utterance content may not be interpreted. In the following example, NP represents a noun phrase, Adj represents an adjective phrase, and VP represents a verb phrase.

R1: “What kind of food do you like?”
R2: “I am (NP) / Slightly (Adj) / Ramen is (NP) / I like (VP)”
H: “I am (NP) / fresh (Adj) / Soba is (NP) / I like (VP)”

The following is an example where nouns are used as utterance generation keys.

R1: “What kind of food do you like?”
R2: “Easy (Adj) / Ramen (NP)”
H: “Refreshing (Adj) / Soba kana (NP)”

B-4. A specific example in the case where the humanoid robot first performs an utterance in which the presence or absence of a proper noun during utterance is controlled is shown below. When proper nouns are included in the user's utterance, the topic can be easily identified, so that the subsequent dialogue is often easy to handle.

The following is an example when there is no proper noun.

R1: “What kind of ramen do you like?”
R2: “Do you like being light?”
H: “I ’m so tired”

The following is an example when there are proper nouns.

R1: “What kind of ramen do you like?”
R2: “I like ●
H: “I like ▲▲ shops”

B-5. A specific example in the case where the humanoid robot first performs an utterance in which the degree of spokenness of the utterance is controlled to a desired level is shown below. Here, “spokenness” includes, for example, missing particles, changes in endings, increases in polysemy, increases in colloquial interjections and adverbs, and the like. The lower the degree of spokenness, the higher the accuracy of speech recognition and speech understanding. On the other hand, the higher the degree of spokenness, the more frank impression can be given to the user.

The following is an example when the degree of colloquialism is low.

R1: “What kind of ramen do you like?”
R2: “I like light noodles”
H: “I like thick ramen”

The following is an example when the degree of colloquialism is high.

R1: “What kind of ramen do you like?”
R2: “I like it lightly or I like it”
H: “Well, it ’s all right”

In the latter example, the utterance of the humanoid robot R2 includes a missing particle “ga”, a colloquial ending, “ramen” replaced with “no”, etc. However, the increase in the interjection “Well”, the increase in the comparative adverb “Yappa”, the colloquial ending, and the replacement of “Ramen” with “no” are included, and the degree of colloquialism is high.

In step S13, the microphone 1 accepts an utterance uttered by the user after the pull-in action. Hereinafter, this utterance is referred to as user utterance. The speech recognition unit 2 recognizes the speech signal of the user utterance collected by the microphone 1 and inputs the text obtained as a speech recognition result to the utterance determination unit 3 as text representing the content of the user utterance.

Thereafter, the conversation about the content of the user utterance between the user and the dialogue system 10 is continued. For example, the dialogue system 10 is based on the technology used in the scenario dialogue system so that a dialogue according to the scenario selected by the technology used in the scenario dialogue system is executed between the user and the dialogue system 10. A voice representing the content of the determined scenario utterance is output from the speaker. Further, for example, the dialogue system 10 outputs, from a speaker, a voice representing the utterance content of the chat utterance determined by the technology used in the chat dialogue system based on the user's utterance. The humanoid robot that performs subsequent speech may be any one humanoid robot or a plurality of humanoid robots.

<Second embodiment>
In the first embodiment, the user's utterance is drawn into a range that satisfies the conditions for acquiring the dialog system 10 by using the pull-in phenomenon so that the dialog system 10 can accurately understand the user's utterance. . In the second embodiment, a configuration is described in which the user's utterance is limited to a desired range without using the pull-in phenomenon. If the user's utterance can be limited to the range assumed by the dialog system 10, the dialog system 10 is likely to be able to respond appropriately to the user's utterance. For example, if the user can always speak affirmative or negative (“Yes / No”), the dialogue system 10 can always respond appropriately to the user's utterance.

Hereinafter, the processing procedure of the dialogue method of the second embodiment will be described with reference to FIG.

In step S21, the microphone 1 accepts an utterance uttered by the user. Hereinafter, this utterance is referred to as a first user utterance. The voice recognition unit 2 recognizes the voice signal of the first user utterance collected by the microphone 1 and inputs the text obtained as the voice recognition result to the utterance determination unit 3 as text representing the contents of the first user utterance. .

In step S22, the humanoid robot 5-1 outputs, from the speaker, voice representing the content of the utterance determined by the utterance determining unit 3 based on the text representing the content of the first user utterance. Hereinafter, this utterance is called a limited utterance. The limited utterance is an utterance for limiting the user's utterance to a desired range. Examples of the desired range include C-1. Limit the user's utterances to mutual, C-2. For example, the user's utterance is limited to affirmation or denial (for example, “Yes / No”).

Hereinafter, the utterance for limiting the user's utterance to a desired range will be described in detail with specific examples. About the notation method of a specific example, it is the same as that of 1st embodiment. In the specific example, * 1 corresponds to the first user utterance and * 2 corresponds to the limited utterance.

C-1. A specific example in the case where the user's utterance is limited to the companion is shown below. For example, by including a word representing the content of the first user utterance and speaking a question that confirms the content of the first user utterance as a limited utterance, the user is more likely to return a conflict.

R: “What do you like?”
H: “I like reading” (* 1)
R: “I like reading books” (* 2)
H: “Yes”

C-2. A specific example in the case where the user's utterance is limited to positive or negative is shown below. For example, by setting a closed question including a word related to the content of the first user utterance as a limited utterance, the user has a higher possibility of answering with affirmative or negative. The closed question is a question whose answer range is limited, such as “Yes / No” or “A or B or C”. On the other hand, questions that can be answered freely, such as the so-called 5W1H (When, Where, Who, What, Why, How) Called open questions.

R: “What do you like?”
H: “I like reading” (* 1)
R: “Do you like reading comics?” (* 2)
H: “Yes”

In step S23, the microphone 1 accepts an utterance uttered by the user after the limited utterance. Hereinafter, this utterance is referred to as a second user utterance. The voice recognition unit 2 recognizes the voice signal of the second user utterance picked up by the microphone 1 and inputs the text obtained as the voice recognition result to the utterance determination unit 3 as text representing the content of the second user utterance. .

Thereafter, the conversation on the content of the second user utterance is continued between the user and the dialogue system 10. For example, the dialogue system 10 is based on the technology used in the scenario dialogue system so that a dialogue according to the scenario selected by the technology used in the scenario dialogue system is executed between the user and the dialogue system 10. A voice representing the content of the determined scenario utterance is output from the speaker. Further, for example, the dialogue system 10 outputs, from a speaker, a voice representing the utterance content of the chat utterance determined by the technology used in the chat dialogue system based on the user's utterance. The humanoid robot that performs subsequent speech may be any one humanoid robot or a plurality of humanoid robots.

<Modification>
In the embodiment described above, an example in which a robot is used as an agent to perform a voice conversation has been described. However, the robot in the embodiment described above is a humanoid robot having a body or the like, but a robot having no body or the like. There may be. In addition, the dialogue technique of the present invention is not limited to these, and it is also possible to adopt a form in which a dialogue is performed using an agent that does not have an entity such as a human body and does not have an utterance mechanism like a humanoid robot. As such a form, for example, a form in which dialogue is performed using an agent displayed on a computer screen can be cited. More specifically, in a group chat in which multiple accounts interact by text messages, such as “LINE” (registered trademark) and “2 channel” (registered trademark), the user's account and the dialog device account interact. It is also possible to apply to the form which performs. In this form, the computer having the screen for displaying the agent needs to be in the vicinity of the person, but the computer and the interactive device may be connected via a network such as the Internet. That is, this dialogue system can be applied not only to a dialogue in which speakers such as a person and a robot actually talk each other but also to a conversation in which the speakers communicate via a network.

As shown in FIG. 4, the dialogue system 20 according to the modification includes an input unit 1, an utterance determination unit 3, and a presentation unit 5. In the example of FIG. 4, the interactive system 20 according to the modification includes a single interactive device 21, and the interactive device 21 according to the modified example includes an input unit 1, an utterance determination unit 3, and a presentation unit 5.

The interactive apparatus of the modified example is an information processing apparatus such as a mobile terminal such as a smartphone or a tablet, or a desktop or laptop personal computer. In the following description, it is assumed that the interactive device is a smartphone. The presentation unit 5 is a liquid crystal display included in the smartphone. A chat application window is displayed on the liquid crystal display, and conversation contents of the group chat are displayed in time series in the window. The group chat is a function in which a plurality of accounts post a text message to each other and develop a conversation in the chat. It is assumed that a plurality of virtual accounts corresponding to a virtual personality controlled by the dialogue apparatus and a user account participate in this group chat. That is, this modification is an example in which the agent is a virtual account displayed on a liquid crystal display of a smartphone that is an interactive device. The user can input the utterance content to the input unit 1 using the software keyboard and post it to the group chat through his / her account. The utterance determination unit 3 determines the utterance content from the dialogue device based on the posting from the user's account, and posts it to the group chat through each virtual account. In addition, it is good also as a structure which inputs the utterance content to the input part 1 by a utterance using the microphone and voice recognition function which were mounted in the smart phone. Moreover, it is good also as a structure which outputs the utterance content obtained from each dialog system from the speaker with the audio | voice corresponding to each virtual account, using the speaker and speech synthesis function which were mounted in the smart phone.

With the above-described configuration, according to the dialogue technique of the present invention, the dialogue system performs the action corresponding to the condition for the dialogue system to acquire the utterance before the user utterance, thereby The user can be drawn into a range that satisfies the conditions for acquisition, and the user can continue to interact with the dialog system for a long time.

As described above, the embodiments of the present invention have been described, but the specific configuration is not limited to these embodiments, and even if there is a design change or the like as appropriate without departing from the spirit of the present invention, Needless to say, it is included in this invention. The various processes described in the embodiments are not only executed in time series according to the description order, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes.

[Program, recording medium]
When various processing functions in the interactive device described in the above embodiment are realized by a computer, the processing contents of the functions that the interactive device should have are described by a program. When various processing functions in the interactive system described in the above modification are realized by a computer, the processing contents of the functions that the interactive system should have are described by a program. Then, by executing this program on a computer, various processing functions in the interactive device and the interactive system are realized on the computer.

The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

Also, this program is distributed, for example, by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

For example, a computer that executes such a program first stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, this computer reads the program stored in its own storage device and executes the process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. In addition, the program is not transferred from the server computer to the computer, and the above processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

An interaction method performed by an interaction system for interacting with a user,
A first receiving step in which the input unit accepts a first user utterance which is an utterance uttered by the user;
A presentation step of presenting a limited utterance that is an utterance for limiting the utterance of the user determined based on the first user utterance to a predetermined range;
A second receiving step in which the input unit receives a second user utterance which is an utterance made by the user after the limited utterance;
Interactive method including
The dialogue method according to claim 1,
The predetermined range is a content intended to be affirmative or negative.
How to interact.
The interactive method according to claim 1 or 2,
The limited utterance is a closed question including a word related to the content of the first user utterance,
How to interact.
The interactive method according to claim 1 or 2,
The predetermined range is content intended to be affirmed,
How to interact.
A dialogue method according to any one of claims 1 to 3,
The limited utterance includes a word related to the content of the first user utterance, and is a question for confirming the content of the first user utterance.
How to interact.
An interactive system for interacting with a user,
Input for accepting a first user utterance that is an utterance made by the user and a second user utterance that is an utterance made by the user after a limited utterance that is an utterance for limiting the utterance of the user to a predetermined range And
An utterance determination unit that determines the limited utterance based on the first user utterance;
A presentation unit for presenting the limited utterance determined by the utterance determination unit;
Interactive system including
An interactive device for determining an utterance to be presented by an interactive system including at least an input unit that receives a user's utterance and a presentation unit that presents the utterance,
An interactive device including an utterance determination unit that determines a limited utterance that is an utterance for limiting the utterance of the user to a predetermined range based on a first user utterance that is an utterance uttered by the user.
A program for causing a computer to execute each step of the interactive method according to any one of claims 1 to 5.
A program for causing a computer to function as the interactive device according to claim 7.