US11087757B2 - Determining a system utterance with connective and content portions from a user utterance - Google Patents

Determining a system utterance with connective and content portions from a user utterance Download PDF

Info

Publication number
US11087757B2
US11087757B2 US16/390,261 US201916390261A US11087757B2 US 11087757 B2 US11087757 B2 US 11087757B2 US 201916390261 A US201916390261 A US 201916390261A US 11087757 B2 US11087757 B2 US 11087757B2
Authority
US
United States
Prior art keywords
utterance
dialogue
voice
output
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
US16/390,261
Other versions
US20190244620A1 (en
Inventor
Atsushi Ikeno
Yusuke JINGUJI
Toshifumi Nishijima
Fuminori Kataoka
Hiromi Tonegawa
Norihide Umeyama
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toyota Motor Corp
Original Assignee
Toyota Motor Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toyota Motor Corp filed Critical Toyota Motor Corp
Priority to US16/390,261 priority Critical patent/US11087757B2/en
Publication of US20190244620A1 publication Critical patent/US20190244620A1/en
Priority to US17/366,270 priority patent/US11900932B2/en
Application granted granted Critical
Publication of US11087757B2 publication Critical patent/US11087757B2/en
Priority to US18/539,604 priority patent/US20240112678A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics

Definitions

  • the present invention relates to a voice dialogue system and particularly to a voice dialogue system accommodating barge-in utterances.
  • Japanese Patent Application Laid-open No. 2014-77969 discloses determining, when a system utterance and a user utterance overlap each other, whether or not a target user utterance is directed toward a dialogue system based on a length of the target user utterance, a time relationship between the target user utterance and an immediately previous utterance, a state of the system, and the like. According to this method, a user utterance to which the dialogue system must respond and a user utterance such as a monologue to which the dialogue system need not respond can be distinguished from each other.
  • User utterances can be classified into, for instance, those responding to a system utterance currently being output, those responding to a last system utterance, and those spontaneously made to the voice dialogue system by a user.
  • a determination is desirably made on whether or not the user's intention is to respond to a current system utterance.
  • An object of the present invention is to accurately determine an utterance intention of a user when a system utterance and a user utterance overlap each other in a voice dialogue system.
  • a first aspect is a voice dialogue system, including:
  • a voice input unit configured to acquire a user utterance
  • an intention understanding unit configured to interpret an intention of utterance of a voice acquired by the voice input unit
  • a dialogue text creator configured to create a text of a system utterance
  • a voice output unit configured to output the system utterance as voice data
  • the dialogue text creator is further configured to, when creating a text of a system utterance, create the text by inserting a tag in a position in the system utterance, and
  • the intention understanding unit is interpret an utterance intention of a user in accordance with whether a timing at which the user utterance is made is before or after an output of a system utterance at a position corresponding to the tag from the voice output unit.
  • the intention understanding unit may interpret that, when the timing at which the user utterance is made is after the output of the system utterance sentence at the position corresponding to the tag from the voice output unit, the user utterance is a response to the system utterance, and may interpret that, when the timing at which the user utterance is input is before the output of the system utterance sentence at the position corresponding to the tag from the voice output unit, the user utterance is not a response to the system utterance.
  • the dialogue text creator may generate the system utterance as a combination of a connective portion and a content portion and embed the tag between the connective portion and the content portion.
  • a position of the tag need not be between the connective portion and the content portion and may be, for example, a position in the content portion where it is assumed that the user is able to understand an intention of a system utterance by listening to contents up to the position.
  • the intention understanding unit may calculate a first period of time, which is a period of time from the output of the system utterance from the voice output unit until the output of all texts preceding the tag from the voice output unit, acquire a second period of time, which is a period of time from the output of the system utterance from the voice output unit until the start of input of the user utterance, and compare the first period of time and the second period of time with each other to determine whether the timing at which the user utterance is made is before or after the output of a system utterance at the position corresponding to the tag from the voice output unit.
  • the voice output unit desirably does not output as a voice the tag in the text of the system utterance sentence.
  • the present invention can be considered a voice dialogue system including at least a part of the units described above.
  • the present invention can also be considered a voice dialogue method which executes at least a part of the processes described above.
  • the present invention can also be considered a computer program that causes the method to be executed by a computer or a computer-readable storage medium that non-transitorily stores the computer program.
  • the respective units and processes described above can be combined with one another to the greatest extent possible to constitute the present invention.
  • an utterance intention of a user can be accurately determined even when a system utterance and a user utterance overlap each other.
  • FIG. 1 is a diagram showing a configuration of a voice dialogue system according to an embodiment
  • FIG. 2 is a diagram showing a configuration of a voice dialogue system according to a modification
  • FIG. 3 is a diagram illustrating how an intention of a user utterance is understood by a voice dialogue system according to an embodiment
  • FIG. 4 is a diagram showing a flow of processing of dialogue text creation and output in a voice dialogue system according to an embodiment
  • FIG. 5 is a diagram illustrating a flow of processing of understanding an intention of a user utterance in a voice dialogue system according to an embodiment.
  • a voice dialogue terminal need not be a robot and an arbitrary information processing apparatus, a voice dialogue interface, and the like can be used.
  • FIG. 1 is a diagram showing a configuration of a voice dialogue system (a voice dialogue robot) according to the present embodiment.
  • a voice dialogue robot 100 is a computer including a microphone (a voice input unit) 101 , a speaker (a voice output unit) 107 , a processing unit such as a microprocessor, a memory, and a communication apparatus.
  • the voice dialogue robot 100 functions as a voice recognizer 102 , an intention understanding unit 103 , a dialogue manager 104 , a dialogue text creator 105 , and a voice synthesizer 106 .
  • the voice dialogue robot 100 may include an image acquisition apparatus (camera), movable joints, and moving mechanism.
  • the voice recognizer 102 performs processing such as noise elimination, sound source separation, and feature amount extraction with respect to voice data of a user utterance input from the microphone 101 and converts contents of the user utterance into a text.
  • the voice recognizer 102 also acquires a timing (a time point) at which the user utterance is made to the microphone 101 .
  • the voice recognizer 102 is configured to be able to understand a user utterance made during a system utterance.
  • a user utterance during a system utterance is referred to as a barge-in utterance (interrupting utterance).
  • the voice recognizer 102 is adapted to handle a barge-in utterance and is capable of extracting and recognizing a user utterance by suppressing self-utterances in voice data inputted to the microphone 101 .
  • the intention understanding unit 103 interprets (understands) an utterance intention of the user based on a recognition result (a text of utterance contents, an utterance feature, and the like) of the voice recognizer 102 .
  • the intention understanding unit 103 stores a corpus or a dictionary for interpreting utterance contents and interprets an utterance by the user by referring to the corpus or the dictionary.
  • the intention understanding unit 103 also determines whether or not a barge-in utterance by the user is a response to a current system utterance. Moreover, a barge-in utterance not being a response to a current system utterance includes both a case where the barge-in utterance is a response to a system utterance preceding the current system utterance and a case where the user spontaneously talks to the robot. Details of processing for determining whether or not a barge-in utterance is a response to a current system utterance will be described later.
  • a result of understanding of the utterance intention of a user utterance by the intention understanding unit 103 is sent to the dialogue manager 104 and the dialogue text creator 105 .
  • the dialogue manager 104 stores a history of dialogue performed in the past between the system and the user.
  • the dialogue manager 104 not only manages contents of a dialogue but also manages circumstances (for example, a time and date or a location) in which the dialogue was performed.
  • the dialogue manager 104 enables what kind of conversation had taken place with the user to be discerned and a response using previous dialogue as a reference to be generated.
  • the dialogue text creator 105 receives a result of interpretation of the intention of a user utterance from the intention understanding unit 103 and creates a dialogue text of an utterance (a system utterance) for responding to the user utterance.
  • a dialogue text of a system utterance is also referred to as a system utterance sentence or a system dialogue text.
  • the dialogue text creator 105 creates a system dialogue text by referring to contents of previous dialogue (including contents of a current dialogue) stored in the dialogue manager 104 , user information, and the like.
  • the dialogue text creator 105 stores a dialogue scenario database and may create a response sentence using a dialogue scenario stored in the database.
  • the dialogue text created by the dialogue text creator 105 is sent to and stored in the dialogue manager 104 .
  • a dialogue text of a system response is created by embedding a “tag” for notifying a timing of determining whether or not a barge-in utterance by the user is a response to a current utterance.
  • the dialogue text creator 105 creates a response sentence as a sum of a connective portion and a body (a content portion). In doing so, the tag is inserted between the connective and the body. For example, when creating a text by splicing a connective of “Hey” and a body of “What's your name?”, a text reading “Hey, ⁇ 1> what's your name?” is generated. In this case, “ ⁇ 1>” corresponds to the tag.
  • the voice synthesizer 106 receives a text of utterance contents and performs voice synthesis to generate response voice data.
  • the response voice data generated by the voice synthesizer 106 is reproduced from the speaker 107 . In doing so, the tag included in a text is not output as a voice.
  • the voice synthesizer 106 calculates a time point at which output of a voice up to immediately before a tag included in a text of utterance contents ends or a period of time required to output voice from the start of the text up to immediately before the tag.
  • the time point or the period of time can be calculated based on the text of the utterance contents and an utterance speed.
  • the calculated period of time or time point is sent to the intention understanding unit 103 .
  • the voice dialogue robot 100 need not be configured as a single apparatus.
  • a two-apparatus configuration can be adopted with a robot apparatus 109 (a front end apparatus) including the microphone 101 , the speaker 107 , a camera, and movable joints and a smartphone 110 (or another computer) which executes various processing.
  • the robot apparatus and the computer are connected by wireless communication such as Bluetooth (registered trademark), data acquired by the robot apparatus is sent to the computer, and reproduction of a response sentence or the like is performed by the robot apparatus based on a result of processing by the computer.
  • wireless communication such as Bluetooth (registered trademark)
  • the voice recognition process and the dialogue text creation process need not be performed by the voice dialogue robot 100 and, as shown in FIG. 2 , the processes may be performed by a voice recognition server 200 and a dialogue server 300 . Alternatively, the processes may be performed by a single server. When the processes are performed using an external server in this manner, the smartphone 110 (or the robot apparatus 109 ) controls cooperation with the server.
  • FIG. 3 is a diagram schematically illustrating processing for determining an intention of a user utterance when a system utterance and the user utterance overlap each other in the present embodiment.
  • the system successively makes an utterance 302 of “Say, tell me, where are you from?”.
  • the user makes an utterance 303 of “I love to drive” after a short delay from a start timing of the utterance 302 . Since the utterance 302 and the utterance 303 overlap each other, a determination must be made as to whether the user utterance 303 is a response to the utterance 301 or a response to the utterance 302 .
  • a text of the utterance 302 has been created by the dialogue text creator 105 by embedding a tag to read “Say, tell me, ⁇ 1> where are you from?”.
  • the voice synthesizer 106 calculates a period of time A required for output from the start of the utterance 302 up to immediately before the tag ⁇ 1>.
  • the tag is not output as a voice as described earlier, hereinafter, for the sake of brevity, a tag will sometimes be described as though output as a voice such as referring to a timing at which output up to immediately before a tag is completed as an output timing of a tag.
  • the system can also acquire a period of time B between the start of the utterance 302 and the start of the utterance 303 .
  • a period of time B between the start of the utterance 302 and the start of the utterance 303 .
  • a tag output timing period of time A period of time B
  • a determination can be made that the utterance 303 by the user is a response to the previous utterance 301 by the system. This is because, since the tag is inserted before a body of a response sentence, it is appropriate to consider that a response prior to output of the body is not a response to the current utterance 302 but a response to the previous utterance 301 .
  • FIG. 4 is a flow chart showing a flow of processing for generating and outputting a dialogue text in the voice dialogue robot 100 .
  • the dialogue text creator 105 determines a dialogue scenario (a conversation template) corresponding to circumstances.
  • the circumstances as referred to herein are determined, for instance, based on various factors such as a recognition result of a user utterance, contents of previous dialogue, and a current time point or location.
  • the dialogue text creator 105 includes a dialogue scenario database storing a plurality of dialogue scenarios (conversation templates), and contents of a system utterance and contents of further system utterances in accordance with expected user responses are described in a dialogue scenario. Contents of a part of system utterances in a dialogue scenario are specified so as to be determined in accordance with a response by the user or other circumstances.
  • the dialogue text creator 105 selects a dialogue scenario conforming to current circumstances.
  • the dialogue text creator 105 determines a text of an utterance sentence based on the selected dialogue scenario. While a method of determining an utterance sentence text is not particularly specified, in this case, a text of an utterance sentence is ultimately determined as a combination of a connective and a body. Examples of a connective include simple replies, interjections, and gambits such as “Yeah”, “Is that so?”, and “By the way” or a repetition of a part of the utterance contents of the user.
  • the dialogue text creator 105 inserts a tag between the connective and the body to create a text of an utterance sentence. For example, texts such as “Hey, ⁇ 1> what's your name?” and “By the way, ⁇ 2> what's tomorrow's weather?” are generated.
  • a combination of a connective, a tag, and a body may be stored in a dialogue scenario (a conversation template) or a dialogue scenario may only store a body and an appropriate connective may be selected to be added to the body together with a tag.
  • step S 13 a period of time required by an utterance from the start of the utterance up to a portion immediately preceding a tag is calculated and stored when the dialogue text creator 105 outputs a determined utterance text.
  • the period of time from the start of an utterance to immediately before a tag can be obtained from an utterance speed setting in the voice synthesis process and from contents of the uttered text.
  • step S 14 the voice synthesizer 106 converts the utterance sentence text into voice data and outputs the voice data from the speaker 107 .
  • step S 15 a start timing of an utterance is stored.
  • FIG. 5 is a flow chart of an intention understanding process for determining whether or not a barge-in utterance by a user (in other words, a user utterance overlapping a system utterance) is intended as a response to a current system utterance.
  • the intention understanding process of a user utterance in the voice dialogue robot 100 includes elements other than determining whether or not the user utterance is a response to a current system utterance, the following description will focus on the determination of whether or not the user utterance is a response to the current system utterance.
  • step S 21 an utterance by the user is acquired from the microphone 101 . In doing so, a start timing of the user utterance is stored.
  • step S 22 the intention understanding unit 103 compares a period of time (the period of time A in FIG. 3 ) between an utterance start timing of a system utterance currently being output and an output timing of a tag in the system utterance with a period of time (the period of time B in FIG. 3 ) between the utterance start timing of the system utterance and an utterance start timing of the user utterance.
  • step S 24 the intention understanding unit 103 determines that the user utterance is a response to a system utterance immediately preceding the current system utterance.
  • step S 25 the intention understanding unit 103 determines that the user utterance is a response to the current system utterance.
  • a final determination may be made in consideration of other elements. For example, a determination may conceivably be made by taking into consideration an association between contents of a last system utterance and a current system utterance and contents of a barge-in utterance by the user. As in the example shown in FIG.
  • a method of creating a dialogue text is not particularly limited.
  • a dialogue text may be determined without using a dialogue scenario.
  • an insertion position of a tag in a dialogue text is not limited to between a connective and a body and a tag need only be inserted at a position where effects of the present invention can be produced.
  • a plurality of tags may be inserted into one response sentence, in which case an utterance intention of the user can be determined based on which of three or more sections divided by the tags a start of the user utterance corresponds to.
  • a “tag” as used in the present invention refers to a specifier of a position in a response sentence and how the specifier is specifically expressed in a response sentence text is not limited.
  • An arbitrary character string defined in advance or an arbitrary character string based on a rule defined in advance can be adopted in order to specify a position in a response sentence, in which case both arbitrary character strings correspond to a “tag” according to the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

Described is a voice dialogue system that includes a voice input unit which acquires a user utterance, an intention understanding unit which interprets an intention of utterance of a voice acquired by the voice input unit, a dialogue text creator which creates a text of a system utterance, and a voice output unit which outputs the system utterance as voice data. When creating a text of a system utterance, the dialogue text creator creates the text by inserting a tag in a position in the system utterance, and the intention understanding unit interprets an utterance intention of a user in accordance with whether a timing at which the user utterance is made is before or after an output of a system utterance at a position corresponding to the tag from the voice output unit.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS
This application is a continuation of U.S. patent application Ser. No. 15/704,691 filed Sep. 14, 2017, now U.S. Pat. No. 10,319,379 issued Jun. 11, 2019, which claims priority under 35 USC 119 from Japanese Patent Application No. 2016-189406 filed Sep. 28, 2016, the entire disclosures of which are hereby incorporated by reference in their entireties.
BACKGROUND OF THE INVENTION Field of the Invention
The present invention relates to a voice dialogue system and particularly to a voice dialogue system accommodating barge-in utterances.
Description of the Related Art
In a voice dialogue system, when a system utterance and a user utterance overlap each other, a determination is desirably made on whether or not the user is responding to a current system utterance.
Japanese Patent Application Laid-open No. 2014-77969 discloses determining, when a system utterance and a user utterance overlap each other, whether or not a target user utterance is directed toward a dialogue system based on a length of the target user utterance, a time relationship between the target user utterance and an immediately previous utterance, a state of the system, and the like. According to this method, a user utterance to which the dialogue system must respond and a user utterance such as a monologue to which the dialogue system need not respond can be distinguished from each other.
However, with the method described in Japanese Patent Application Laid-open No. 2014-77969, even when a determination can be made that a user utterance overlapping a system utterance is directed toward the voice dialogue system, a determination cannot be made on whether or not the user utterance is a response to a system utterance currently being output.
User utterances can be classified into, for instance, those responding to a system utterance currently being output, those responding to a last system utterance, and those spontaneously made to the voice dialogue system by a user. When a system utterance and a user utterance overlap each other, a determination is desirably made on whether or not the user's intention is to respond to a current system utterance.
An object of the present invention is to accurately determine an utterance intention of a user when a system utterance and a user utterance overlap each other in a voice dialogue system.
  • Patent Document 1: Japanese Patent Application Laid-open No. 2014-77969
SUMMARY OF THE INVENTION
A first aspect is a voice dialogue system, including:
a voice input unit configured to acquire a user utterance;
an intention understanding unit configured to interpret an intention of utterance of a voice acquired by the voice input unit;
a dialogue text creator configured to create a text of a system utterance; and
a voice output unit configured to output the system utterance as voice data, wherein
the dialogue text creator is further configured to, when creating a text of a system utterance, create the text by inserting a tag in a position in the system utterance, and
the intention understanding unit is interpret an utterance intention of a user in accordance with whether a timing at which the user utterance is made is before or after an output of a system utterance at a position corresponding to the tag from the voice output unit.
In this manner, by embedding a tag into a system utterance, determination whether or not a user utterance is a response to a system utterance currently being output can be made depending on whether a start timing of the user utterance is before or after an utterance of a sentence (word) at a position corresponding to the tag.
In the present aspect, the intention understanding unit may interpret that, when the timing at which the user utterance is made is after the output of the system utterance sentence at the position corresponding to the tag from the voice output unit, the user utterance is a response to the system utterance, and may interpret that, when the timing at which the user utterance is input is before the output of the system utterance sentence at the position corresponding to the tag from the voice output unit, the user utterance is not a response to the system utterance.
In addition, in the present aspect, the dialogue text creator may generate the system utterance as a combination of a connective portion and a content portion and embed the tag between the connective portion and the content portion. However, a position of the tag need not be between the connective portion and the content portion and may be, for example, a position in the content portion where it is assumed that the user is able to understand an intention of a system utterance by listening to contents up to the position.
Furthermore, in the present aspect, the intention understanding unit may calculate a first period of time, which is a period of time from the output of the system utterance from the voice output unit until the output of all texts preceding the tag from the voice output unit, acquire a second period of time, which is a period of time from the output of the system utterance from the voice output unit until the start of input of the user utterance, and compare the first period of time and the second period of time with each other to determine whether the timing at which the user utterance is made is before or after the output of a system utterance at the position corresponding to the tag from the voice output unit.
In addition, in the present aspect, the voice output unit desirably does not output as a voice the tag in the text of the system utterance sentence.
Moreover, the present invention can be considered a voice dialogue system including at least a part of the units described above. In addition, the present invention can also be considered a voice dialogue method which executes at least a part of the processes described above. Furthermore, the present invention can also be considered a computer program that causes the method to be executed by a computer or a computer-readable storage medium that non-transitorily stores the computer program. The respective units and processes described above can be combined with one another to the greatest extent possible to constitute the present invention.
According to the present invention, in a voice dialogue system, an utterance intention of a user can be accurately determined even when a system utterance and a user utterance overlap each other.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a diagram showing a configuration of a voice dialogue system according to an embodiment;
FIG. 2 is a diagram showing a configuration of a voice dialogue system according to a modification;
FIG. 3 is a diagram illustrating how an intention of a user utterance is understood by a voice dialogue system according to an embodiment;
FIG. 4 is a diagram showing a flow of processing of dialogue text creation and output in a voice dialogue system according to an embodiment; and
FIG. 5 is a diagram illustrating a flow of processing of understanding an intention of a user utterance in a voice dialogue system according to an embodiment.
DESCRIPTION OF THE EMBODIMENTS
A preferred embodiment of the present invention will now be exemplarily described in detail with reference to the drawings. While the embodiment described below is a system in which a voice dialogue robot is used as a voice dialogue terminal, a voice dialogue terminal need not be a robot and an arbitrary information processing apparatus, a voice dialogue interface, and the like can be used.
<System Configuration>
FIG. 1 is a diagram showing a configuration of a voice dialogue system (a voice dialogue robot) according to the present embodiment. A voice dialogue robot 100 according to the present embodiment is a computer including a microphone (a voice input unit) 101, a speaker (a voice output unit) 107, a processing unit such as a microprocessor, a memory, and a communication apparatus. When the microprocessor executes a program, the voice dialogue robot 100 functions as a voice recognizer 102, an intention understanding unit 103, a dialogue manager 104, a dialogue text creator 105, and a voice synthesizer 106. Although not shown, the voice dialogue robot 100 may include an image acquisition apparatus (camera), movable joints, and moving mechanism.
The voice recognizer 102 performs processing such as noise elimination, sound source separation, and feature amount extraction with respect to voice data of a user utterance input from the microphone 101 and converts contents of the user utterance into a text. The voice recognizer 102 also acquires a timing (a time point) at which the user utterance is made to the microphone 101.
Moreover, the voice recognizer 102 is configured to be able to understand a user utterance made during a system utterance. A user utterance during a system utterance is referred to as a barge-in utterance (interrupting utterance). The voice recognizer 102 is adapted to handle a barge-in utterance and is capable of extracting and recognizing a user utterance by suppressing self-utterances in voice data inputted to the microphone 101.
The intention understanding unit 103 interprets (understands) an utterance intention of the user based on a recognition result (a text of utterance contents, an utterance feature, and the like) of the voice recognizer 102. The intention understanding unit 103 stores a corpus or a dictionary for interpreting utterance contents and interprets an utterance by the user by referring to the corpus or the dictionary.
The intention understanding unit 103 also determines whether or not a barge-in utterance by the user is a response to a current system utterance. Moreover, a barge-in utterance not being a response to a current system utterance includes both a case where the barge-in utterance is a response to a system utterance preceding the current system utterance and a case where the user spontaneously talks to the robot. Details of processing for determining whether or not a barge-in utterance is a response to a current system utterance will be described later. A result of understanding of the utterance intention of a user utterance by the intention understanding unit 103 is sent to the dialogue manager 104 and the dialogue text creator 105.
The dialogue manager 104 stores a history of dialogue performed in the past between the system and the user. The dialogue manager 104 not only manages contents of a dialogue but also manages circumstances (for example, a time and date or a location) in which the dialogue was performed. The dialogue manager 104 enables what kind of conversation had taken place with the user to be discerned and a response using previous dialogue as a reference to be generated.
The dialogue text creator 105 receives a result of interpretation of the intention of a user utterance from the intention understanding unit 103 and creates a dialogue text of an utterance (a system utterance) for responding to the user utterance. In the present specification, a dialogue text of a system utterance is also referred to as a system utterance sentence or a system dialogue text. The dialogue text creator 105 creates a system dialogue text by referring to contents of previous dialogue (including contents of a current dialogue) stored in the dialogue manager 104, user information, and the like. The dialogue text creator 105 stores a dialogue scenario database and may create a response sentence using a dialogue scenario stored in the database. The dialogue text created by the dialogue text creator 105 is sent to and stored in the dialogue manager 104.
A dialogue text of a system response is created by embedding a “tag” for notifying a timing of determining whether or not a barge-in utterance by the user is a response to a current utterance. The dialogue text creator 105 creates a response sentence as a sum of a connective portion and a body (a content portion). In doing so, the tag is inserted between the connective and the body. For example, when creating a text by splicing a connective of “Hey” and a body of “What's your name?”, a text reading “Hey, <1> what's your name?” is generated. In this case, “<1>” corresponds to the tag. In addition, when splicing “By the way” and “What's tomorrow's weather?”, “By the way, <2> what's tomorrow's weather?” is created. In this case, while the numerals in the tags are for identifying the tags, when only one tag is to be included in one sentence (response), a variable sign such as numerals need not necessarily be used.
The voice synthesizer 106 receives a text of utterance contents and performs voice synthesis to generate response voice data. The response voice data generated by the voice synthesizer 106 is reproduced from the speaker 107. In doing so, the tag included in a text is not output as a voice.
In addition, the voice synthesizer 106 calculates a time point at which output of a voice up to immediately before a tag included in a text of utterance contents ends or a period of time required to output voice from the start of the text up to immediately before the tag. The time point or the period of time can be calculated based on the text of the utterance contents and an utterance speed. The calculated period of time or time point is sent to the intention understanding unit 103.
Moreover, the voice dialogue robot 100 need not be configured as a single apparatus. For example, as shown in FIG. 2, a two-apparatus configuration can be adopted with a robot apparatus 109 (a front end apparatus) including the microphone 101, the speaker 107, a camera, and movable joints and a smartphone 110 (or another computer) which executes various processing. In this case, the robot apparatus and the computer are connected by wireless communication such as Bluetooth (registered trademark), data acquired by the robot apparatus is sent to the computer, and reproduction of a response sentence or the like is performed by the robot apparatus based on a result of processing by the computer.
In addition, the voice recognition process and the dialogue text creation process need not be performed by the voice dialogue robot 100 and, as shown in FIG. 2, the processes may be performed by a voice recognition server 200 and a dialogue server 300. Alternatively, the processes may be performed by a single server. When the processes are performed using an external server in this manner, the smartphone 110 (or the robot apparatus 109) controls cooperation with the server.
<Processing>
FIG. 3 is a diagram schematically illustrating processing for determining an intention of a user utterance when a system utterance and the user utterance overlap each other in the present embodiment. In this case, an example will be described in which, after the system makes an utterance 301 of “What kind of hobbies do you have?”, since the user does not return a response, the system successively makes an utterance 302 of “Say, tell me, where are you from?”. The user makes an utterance 303 of “I love to drive” after a short delay from a start timing of the utterance 302. Since the utterance 302 and the utterance 303 overlap each other, a determination must be made as to whether the user utterance 303 is a response to the utterance 301 or a response to the utterance 302.
In this case, a text of the utterance 302 has been created by the dialogue text creator 105 by embedding a tag to read “Say, tell me, <1> where are you from?”. In addition, the voice synthesizer 106 calculates a period of time A required for output from the start of the utterance 302 up to immediately before the tag <1>. Moreover, although the tag is not output as a voice as described earlier, hereinafter, for the sake of brevity, a tag will sometimes be described as though output as a voice such as referring to a timing at which output up to immediately before a tag is completed as an output timing of a tag.
The system can also acquire a period of time B between the start of the utterance 302 and the start of the utterance 303. In this case, when the start of the utterance 303 by the user is before a tag output timing (period of time A period of time B), a determination can be made that the utterance 303 by the user is a response to the previous utterance 301 by the system. This is because, since the tag is inserted before a body of a response sentence, it is appropriate to consider that a response prior to output of the body is not a response to the current utterance 302 but a response to the previous utterance 301.
In addition, when the start of the utterance 303 by the user is after the tag output timing (period of time A<period of time B), a determination can be made that the utterance 303 by the user is a response to the current utterance 302 by the system. This is because it is appropriate to consider that the user responds to the current utterance 302 after the system starts output of the body of the response sentence.
Hereinafter, details for realizing the processing shown in FIG. 3 will be described with reference to the flow charts in FIGS. 4 and 5.
FIG. 4 is a flow chart showing a flow of processing for generating and outputting a dialogue text in the voice dialogue robot 100. In step S11, the dialogue text creator 105 determines a dialogue scenario (a conversation template) corresponding to circumstances. The circumstances as referred to herein are determined, for instance, based on various factors such as a recognition result of a user utterance, contents of previous dialogue, and a current time point or location. The dialogue text creator 105 includes a dialogue scenario database storing a plurality of dialogue scenarios (conversation templates), and contents of a system utterance and contents of further system utterances in accordance with expected user responses are described in a dialogue scenario. Contents of a part of system utterances in a dialogue scenario are specified so as to be determined in accordance with a response by the user or other circumstances. The dialogue text creator 105 selects a dialogue scenario conforming to current circumstances.
In step S12, the dialogue text creator 105 determines a text of an utterance sentence based on the selected dialogue scenario. While a method of determining an utterance sentence text is not particularly specified, in this case, a text of an utterance sentence is ultimately determined as a combination of a connective and a body. Examples of a connective include simple replies, interjections, and gambits such as “Yeah”, “Is that so?”, and “By the way” or a repetition of a part of the utterance contents of the user. The dialogue text creator 105 inserts a tag between the connective and the body to create a text of an utterance sentence. For example, texts such as “Hey, <1> what's your name?” and “By the way, <2> what's tomorrow's weather?” are generated.
Moreover, a combination of a connective, a tag, and a body may be stored in a dialogue scenario (a conversation template) or a dialogue scenario may only store a body and an appropriate connective may be selected to be added to the body together with a tag.
In step S13, a period of time required by an utterance from the start of the utterance up to a portion immediately preceding a tag is calculated and stored when the dialogue text creator 105 outputs a determined utterance text. The period of time from the start of an utterance to immediately before a tag can be obtained from an utterance speed setting in the voice synthesis process and from contents of the uttered text.
In step S14, the voice synthesizer 106 converts the utterance sentence text into voice data and outputs the voice data from the speaker 107. In step S15, a start timing of an utterance is stored.
FIG. 5 is a flow chart of an intention understanding process for determining whether or not a barge-in utterance by a user (in other words, a user utterance overlapping a system utterance) is intended as a response to a current system utterance. Moreover, although the intention understanding process of a user utterance in the voice dialogue robot 100 includes elements other than determining whether or not the user utterance is a response to a current system utterance, the following description will focus on the determination of whether or not the user utterance is a response to the current system utterance.
In step S21, an utterance by the user is acquired from the microphone 101. In doing so, a start timing of the user utterance is stored.
In step S22, the intention understanding unit 103 compares a period of time (the period of time A in FIG. 3) between an utterance start timing of a system utterance currently being output and an output timing of a tag in the system utterance with a period of time (the period of time B in FIG. 3) between the utterance start timing of the system utterance and an utterance start timing of the user utterance.
When the user utterance is before the output start timing of the tag in the system utterance or, in other words, when the period of time A≥the period of time B (S23—YES), in step S24, the intention understanding unit 103 determines that the user utterance is a response to a system utterance immediately preceding the current system utterance.
On the other hand, when the user utterance is after the output start timing of the tag in the system utterance or, in other words, when the period of time A<the period of time B (S23—NO), in step S25, the intention understanding unit 103 determines that the user utterance is a response to the current system utterance.
Advantageous Effects
According to the present embodiment, when a user utterance and a system utterance overlap each other, whether or not the user utterance is a response to a current system utterance can be determined with simple processing. Therefore, a dialogue between the system and the user can be realized in a more appropriate manner.
<Modifications>
In the embodiment described above, while only a result of a comparison between a timing of a user utterance and an output timing of a tag is taken into consideration in order to determine whether or not the user utterance is a response to a current system utterance, a final determination may be made in consideration of other elements. For example, a determination may conceivably be made by taking into consideration an association between contents of a last system utterance and a current system utterance and contents of a barge-in utterance by the user. As in the example shown in FIG. 3, in a case where the user makes an utterance of “I love to drive” when the system is successively asking “What kind of hobbies do you have?” and “Where are you from?”, a determination can be made based on the association between contents that the user's utterance is a response to the previous system utterance (“What kind of hobbies do you have?”) regardless of a timing of the user utterance. In this manner, it is also favorable to make a final determination in consideration of both a timing of a user utterance and an association between utterance contents.
In addition, while an example in which the system successively utters two questions has been described above, similar processing can also be applied when the user starts a conversation. In this case, a determination is made as to whether a user utterance is a response to a system utterance or a spontaneous start of a conversation. In other words, when a barge-in utterance by the user is not a response to a system utterance, it is understood that the user's intention is to start a conversation.
In addition, while a dialogue scenario (a conversation template) is used to create a dialogue text, a method of creating a dialogue text is not particularly limited. A dialogue text may be determined without using a dialogue scenario. Furthermore, an insertion position of a tag in a dialogue text is not limited to between a connective and a body and a tag need only be inserted at a position where effects of the present invention can be produced. In addition, a plurality of tags may be inserted into one response sentence, in which case an utterance intention of the user can be determined based on which of three or more sections divided by the tags a start of the user utterance corresponds to.
While the term “tag” is used in the description given above and expressions such as “<1>” are adopted in a response sentence text, a “tag” as used in the present invention refers to a specifier of a position in a response sentence and how the specifier is specifically expressed in a response sentence text is not limited. An arbitrary character string defined in advance or an arbitrary character string based on a rule defined in advance can be adopted in order to specify a position in a response sentence, in which case both arbitrary character strings correspond to a “tag” according to the present invention.
<Other>
The configurations of the embodiment and the modification described above can be used appropriately combined with each other without departing from the technical ideas of the present invention. In addition, the present invention may be realized by appropriately making changes thereto without departing from the technical ideas thereof.

Claims (6)

What is claimed is:
1. A voice dialogue system, comprising:
a voice input unit configured to acquire user utterances of a user;
a dialogue text creator configured to create system utterances, wherein the dialogue text creator creates the system utterances based upon a stored history of dialogue performed in a past between the system and the user stored in a dialogue manager, the dialogue manager storing a time and date or location of the dialogue and enabling what kind of conversation had taken place with the user to be discerned and a response using previous dialogue as a reference to be generated;
a voice output unit configured to output the system utterances as voice data; and
a determiner configured to, in a case that a current user utterance is acquired while a system utterance is being output as voice data, determine whether or not the current user utterance acquired by the voice input unit is a response to a content that is output at a time of the current user utterance, wherein
the system utterance comprises a connective portion for connecting following sentences and a content portion that is a subject of the system utterance,
the content portion includes (a) a first content portion that is a first system utterance that is output before the connective portion and (b) a second content portion that is a second system utterance that is output after the connective portion,
the first content portion is a first question and the second content portion is a second question different from the first question,
the determiner determines, in a case that the current user utterance is acquired before the output of the second content portion has started, the current user utterance is a response to the first content portion, and
the determiner determines, in a case that the current user utterance is acquired after the output of the second content portion has started, the current user utterance is a response to the second content portion.
2. The voice dialogue system according to claim 1, wherein the connective portion comprises one of an interjection, a gambit, or a repetition of a part of a previously acquired user utterance.
3. The voice dialogue system according to claim 2, wherein
the dialogue creator is further configured to, when creating the system utterances, to insert an unvoiced tag between the connective portion and the content portion of the system utterances, and
the determiner is further configured to determine that the output of the content portion of the system utterance has started based at least on a position of the unvoiced tag in the system utterance.
4. The voice dialogue system according to claim 2, wherein the determiner is further configured to:
calculate a first period of time that is a period of time that it will take to output the connective portion of the system utterance as voice data;
acquire a second period of time that is a period of time from a start of output of the system utterance as voice data to a start of the acquired current user utterance; and
compare the first period of time and the second period of time with each other to determine whether the current user utterance is acquired after the output of the content portion of the system utterance has started or before the output of the content portion of the system utterance has started.
5. The voice dialogue system according to claim 1, wherein the determiner is further programmed to function as an intention understanding unit storing a corpus or a dictionary for interpreting utterance contents and interpreting a user utterance by referring to the corpus or the dictionary.
6. The voice dialogue system according to claim 1, wherein the voice dialogue system includes a voice dialogue robot having movable joints, the voice dialogue robot configured to function as the voice input unit, the dialogue text creator, the voice output unit and the determiner.
US16/390,261 2016-09-28 2019-04-22 Determining a system utterance with connective and content portions from a user utterance Active US11087757B2 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US16/390,261 US11087757B2 (en) 2016-09-28 2019-04-22 Determining a system utterance with connective and content portions from a user utterance
US17/366,270 US11900932B2 (en) 2016-09-28 2021-07-02 Determining a system utterance with connective and content portions from a user utterance
US18/539,604 US20240112678A1 (en) 2016-09-28 2023-12-14 Voice dialogue system and method of understanding utterance intention

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
JP2016189406A JP6515897B2 (en) 2016-09-28 2016-09-28 Speech dialogue system and method for understanding speech intention
JP2016-189406 2016-09-28
US15/704,691 US10319379B2 (en) 2016-09-28 2017-09-14 Methods and systems for voice dialogue with tags in a position of text for determining an intention of a user utterance
US16/390,261 US11087757B2 (en) 2016-09-28 2019-04-22 Determining a system utterance with connective and content portions from a user utterance

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US15/704,691 Continuation US10319379B2 (en) 2016-09-28 2017-09-14 Methods and systems for voice dialogue with tags in a position of text for determining an intention of a user utterance

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/366,270 Division US11900932B2 (en) 2016-09-28 2021-07-02 Determining a system utterance with connective and content portions from a user utterance

Publications (2)

Publication Number Publication Date
US20190244620A1 US20190244620A1 (en) 2019-08-08
US11087757B2 true US11087757B2 (en) 2021-08-10

Family

ID=61685640

Family Applications (4)

Application Number Title Priority Date Filing Date
US15/704,691 Active US10319379B2 (en) 2016-09-28 2017-09-14 Methods and systems for voice dialogue with tags in a position of text for determining an intention of a user utterance
US16/390,261 Active US11087757B2 (en) 2016-09-28 2019-04-22 Determining a system utterance with connective and content portions from a user utterance
US17/366,270 Active 2037-11-26 US11900932B2 (en) 2016-09-28 2021-07-02 Determining a system utterance with connective and content portions from a user utterance
US18/539,604 Pending US20240112678A1 (en) 2016-09-28 2023-12-14 Voice dialogue system and method of understanding utterance intention

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US15/704,691 Active US10319379B2 (en) 2016-09-28 2017-09-14 Methods and systems for voice dialogue with tags in a position of text for determining an intention of a user utterance

Family Applications After (2)

Application Number Title Priority Date Filing Date
US17/366,270 Active 2037-11-26 US11900932B2 (en) 2016-09-28 2021-07-02 Determining a system utterance with connective and content portions from a user utterance
US18/539,604 Pending US20240112678A1 (en) 2016-09-28 2023-12-14 Voice dialogue system and method of understanding utterance intention

Country Status (3)

Country Link
US (4) US10319379B2 (en)
JP (1) JP6515897B2 (en)
CN (1) CN107871503B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210335362A1 (en) * 2016-09-28 2021-10-28 Toyota Jidosha Kabushiki Kaisha Determining a system utterance with connective and content portions from a user utterance
US20210342553A1 (en) * 2018-05-09 2021-11-04 Nippon Telegraph And Telephone Corporation Dialogue data generation device, dialogue data generation method, and program

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6428954B2 (en) * 2016-02-18 2018-11-28 ソニー株式会社 Information processing apparatus, information processing method, and program
JP2018054790A (en) * 2016-09-28 2018-04-05 トヨタ自動車株式会社 Voice interaction system and voice interaction method
US20210065708A1 (en) * 2018-02-08 2021-03-04 Sony Corporation Information processing apparatus, information processing system, information processing method, and program
JP6969491B2 (en) * 2018-05-11 2021-11-24 トヨタ自動車株式会社 Voice dialogue system, voice dialogue method and program
JP7169096B2 (en) * 2018-06-18 2022-11-10 株式会社デンソーアイティーラボラトリ Dialogue system, dialogue method and program
CN109285545A (en) * 2018-10-31 2019-01-29 北京小米移动软件有限公司 Information processing method and device
JP2020086943A (en) * 2018-11-26 2020-06-04 シャープ株式会社 Information processing device, information processing method, and program
CN111475206B (en) * 2019-01-04 2023-04-11 优奈柯恩(北京)科技有限公司 Method and apparatus for waking up wearable device
JP7512254B2 (en) 2019-03-26 2024-07-08 株式会社Nttドコモ Spoken dialogue system, model generation device, barge-in utterance determination model, and spoken dialogue program
CN115146653B (en) * 2022-07-21 2023-05-02 平安科技(深圳)有限公司 Dialogue scenario construction method, device, equipment and storage medium

Citations (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020087312A1 (en) * 2000-12-29 2002-07-04 Lee Victor Wai Leung Computer-implemented conversation buffering method and system
US20020184031A1 (en) 2001-06-04 2002-12-05 Hewlett Packard Company Speech system barge-in control
US20030093274A1 (en) * 2001-11-09 2003-05-15 Netbytel, Inc. Voice recognition using barge-in time
US20030163309A1 (en) 2002-02-22 2003-08-28 Fujitsu Limited Speech dialogue system
US20030191648A1 (en) * 2002-04-08 2003-10-09 Knott Benjamin Anthony Method and system for voice recognition menu navigation with error prevention and recovery
JP2004151562A (en) 2002-10-31 2004-05-27 Seiko Epson Corp Method for controlling voice interaction and voice interaction control device
US20050015256A1 (en) * 2003-05-29 2005-01-20 Kargman James B. Method and apparatus for ordering food items, and in particular, pizza
US20060020471A1 (en) * 2004-07-23 2006-01-26 Microsoft Corporation Method and apparatus for robustly locating user barge-ins in voice-activated command systems
US20060080101A1 (en) 2004-10-12 2006-04-13 At&T Corp. Apparatus and method for spoken language understanding by using semantic role labeling
US7197460B1 (en) * 2002-04-23 2007-03-27 At&T Corp. System for handling frequently asked questions in a natural language dialog service
US7321856B1 (en) 2005-08-03 2008-01-22 Microsoft Corporation Handling of speech recognition in a declarative markup language
US7853451B1 (en) 2003-12-18 2010-12-14 At&T Intellectual Property Ii, L.P. System and method of exploiting human-human data for spoken language understanding systems
US20100324896A1 (en) 2004-12-22 2010-12-23 Enterprise Integration Group, Inc. Turn-taking confidence
US20130006643A1 (en) 2010-01-13 2013-01-03 Aram Lindahl Devices and Methods for Identifying a Prompt Corresponding to a Voice Input in a Sequence of Prompts
US20130218574A1 (en) * 2002-02-04 2013-08-22 Microsoft Corporation Management and Prioritization of Processing Multiple Requests
US8600753B1 (en) 2005-12-30 2013-12-03 At&T Intellectual Property Ii, L.P. Method and apparatus for combining text to speech and recorded prompts
JP2014077969A (en) 2012-10-12 2014-05-01 Honda Motor Co Ltd Dialogue system and determination method of speech to dialogue system
US20140180697A1 (en) 2012-12-20 2014-06-26 Amazon Technologies, Inc. Identification of utterance subjects
US8799000B2 (en) * 2010-01-18 2014-08-05 Apple Inc. Disambiguation based on active input elicitation by intelligent automated assistant
US20140278404A1 (en) 2013-03-15 2014-09-18 Parlant Technology, Inc. Audio merge tags
US20140324438A1 (en) 2003-08-14 2014-10-30 Freedom Scientific, Inc. Screen reader having concurrent communication of non-textual information
US20150019217A1 (en) * 2005-08-05 2015-01-15 Voicebox Technologies Corporation Systems and methods for responding to natural language speech utterance
US20150348533A1 (en) 2014-05-30 2015-12-03 Apple Inc. Domain specific language for encoding assistant dialog
US20160300570A1 (en) * 2014-06-19 2016-10-13 Mattersight Corporation Personality-based chatbot and methods
US20160314787A1 (en) 2013-12-19 2016-10-27 Denso Corporation Speech recognition apparatus and computer program product for speech recognition
US20160372138A1 (en) * 2014-03-25 2016-12-22 Sharp Kabushiki Kaisha Interactive home-appliance system, server device, interactive home appliance, method for allowing home-appliance system to interact, and nonvolatile computer-readable data recording medium encoded with program for allowing computer to implement the method
US9792901B1 (en) 2014-12-11 2017-10-17 Amazon Technologies, Inc. Multiple-source speech dialog input
US20180090132A1 (en) 2016-09-28 2018-03-29 Toyota Jidosha Kabushiki Kaisha Voice dialogue system and voice dialogue method
US10319379B2 (en) * 2016-09-28 2019-06-11 Toyota Jidosha Kabushiki Kaisha Methods and systems for voice dialogue with tags in a position of text for determining an intention of a user utterance

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6598018B1 (en) * 1999-12-15 2003-07-22 Matsushita Electric Industrial Co., Ltd. Method for natural dialog interface to car devices
US7158935B1 (en) * 2000-11-15 2007-01-02 At&T Corp. Method and system for predicting problematic situations in a automated dialog
JP4680691B2 (en) * 2005-06-15 2011-05-11 富士通株式会社 Dialog system
EP2259252B1 (en) * 2009-06-02 2012-08-01 Nuance Communications, Inc. Speech recognition method for selecting a combination of list elements via a speech input
WO2012150658A1 (en) * 2011-05-02 2012-11-08 旭化成株式会社 Voice recognition device and voice recognition method
JP6391925B2 (en) * 2013-09-20 2018-09-19 株式会社東芝 Spoken dialogue apparatus, method and program
US8862467B1 (en) * 2013-12-11 2014-10-14 Google Inc. Contextual speech recognition
KR101770187B1 (en) * 2014-03-27 2017-09-06 한국전자통신연구원 Method and apparatus for controlling navigation using voice conversation
US10311862B2 (en) * 2015-12-23 2019-06-04 Rovi Guides, Inc. Systems and methods for conversations with devices about media using interruptions and changes of subjects

Patent Citations (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020087312A1 (en) * 2000-12-29 2002-07-04 Lee Victor Wai Leung Computer-implemented conversation buffering method and system
US20020184031A1 (en) 2001-06-04 2002-12-05 Hewlett Packard Company Speech system barge-in control
US20030093274A1 (en) * 2001-11-09 2003-05-15 Netbytel, Inc. Voice recognition using barge-in time
US20130218574A1 (en) * 2002-02-04 2013-08-22 Microsoft Corporation Management and Prioritization of Processing Multiple Requests
US20030163309A1 (en) 2002-02-22 2003-08-28 Fujitsu Limited Speech dialogue system
US20030191648A1 (en) * 2002-04-08 2003-10-09 Knott Benjamin Anthony Method and system for voice recognition menu navigation with error prevention and recovery
US7197460B1 (en) * 2002-04-23 2007-03-27 At&T Corp. System for handling frequently asked questions in a natural language dialog service
JP2004151562A (en) 2002-10-31 2004-05-27 Seiko Epson Corp Method for controlling voice interaction and voice interaction control device
US20050015256A1 (en) * 2003-05-29 2005-01-20 Kargman James B. Method and apparatus for ordering food items, and in particular, pizza
US20140324438A1 (en) 2003-08-14 2014-10-30 Freedom Scientific, Inc. Screen reader having concurrent communication of non-textual information
US7853451B1 (en) 2003-12-18 2010-12-14 At&T Intellectual Property Ii, L.P. System and method of exploiting human-human data for spoken language understanding systems
US20060020471A1 (en) * 2004-07-23 2006-01-26 Microsoft Corporation Method and apparatus for robustly locating user barge-ins in voice-activated command systems
US20060080101A1 (en) 2004-10-12 2006-04-13 At&T Corp. Apparatus and method for spoken language understanding by using semantic role labeling
US20100324896A1 (en) 2004-12-22 2010-12-23 Enterprise Integration Group, Inc. Turn-taking confidence
US7321856B1 (en) 2005-08-03 2008-01-22 Microsoft Corporation Handling of speech recognition in a declarative markup language
US20150019217A1 (en) * 2005-08-05 2015-01-15 Voicebox Technologies Corporation Systems and methods for responding to natural language speech utterance
US8600753B1 (en) 2005-12-30 2013-12-03 At&T Intellectual Property Ii, L.P. Method and apparatus for combining text to speech and recorded prompts
US20130006643A1 (en) 2010-01-13 2013-01-03 Aram Lindahl Devices and Methods for Identifying a Prompt Corresponding to a Voice Input in a Sequence of Prompts
US8903716B2 (en) * 2010-01-18 2014-12-02 Apple Inc. Personalized vocabulary for digital assistant
US8799000B2 (en) * 2010-01-18 2014-08-05 Apple Inc. Disambiguation based on active input elicitation by intelligent automated assistant
JP2014077969A (en) 2012-10-12 2014-05-01 Honda Motor Co Ltd Dialogue system and determination method of speech to dialogue system
US20140156276A1 (en) 2012-10-12 2014-06-05 Honda Motor Co., Ltd. Conversation system and a method for recognizing speech
US20140180697A1 (en) 2012-12-20 2014-06-26 Amazon Technologies, Inc. Identification of utterance subjects
US20150179175A1 (en) 2012-12-20 2015-06-25 Amazon Technologies, Inc. Identification of utterance subjects
JP2016501391A (en) 2012-12-20 2016-01-18 アマゾン テクノロジーズ インコーポレーテッド Identifying the utterance target
US20140278404A1 (en) 2013-03-15 2014-09-18 Parlant Technology, Inc. Audio merge tags
US20160314787A1 (en) 2013-12-19 2016-10-27 Denso Corporation Speech recognition apparatus and computer program product for speech recognition
US20160372138A1 (en) * 2014-03-25 2016-12-22 Sharp Kabushiki Kaisha Interactive home-appliance system, server device, interactive home appliance, method for allowing home-appliance system to interact, and nonvolatile computer-readable data recording medium encoded with program for allowing computer to implement the method
US20150348533A1 (en) 2014-05-30 2015-12-03 Apple Inc. Domain specific language for encoding assistant dialog
US20160300570A1 (en) * 2014-06-19 2016-10-13 Mattersight Corporation Personality-based chatbot and methods
US9792901B1 (en) 2014-12-11 2017-10-17 Amazon Technologies, Inc. Multiple-source speech dialog input
US20180090132A1 (en) 2016-09-28 2018-03-29 Toyota Jidosha Kabushiki Kaisha Voice dialogue system and voice dialogue method
US10319379B2 (en) * 2016-09-28 2019-06-11 Toyota Jidosha Kabushiki Kaisha Methods and systems for voice dialogue with tags in a position of text for determining an intention of a user utterance

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Jan. 24, 2019 Notice of Allowance issued in U.S. Appl. No. 15/704,691.
Sep. 11, 2018 Office Action issued in U.S. Appl. No. 15/704,691.
U.S. Appl. No. 15/704,691, filed Sep. 14, 2017 in the name of Atsushi Ikeno et al.

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210335362A1 (en) * 2016-09-28 2021-10-28 Toyota Jidosha Kabushiki Kaisha Determining a system utterance with connective and content portions from a user utterance
US11900932B2 (en) * 2016-09-28 2024-02-13 Toyota Jidosha Kabushiki Kaisha Determining a system utterance with connective and content portions from a user utterance
US20210342553A1 (en) * 2018-05-09 2021-11-04 Nippon Telegraph And Telephone Corporation Dialogue data generation device, dialogue data generation method, and program
US12026460B2 (en) * 2018-05-09 2024-07-02 Nippon Telegraph And Telephone Corporation Dialogue data generation device, dialogue data generation method, and program

Also Published As

Publication number Publication date
US11900932B2 (en) 2024-02-13
US20240112678A1 (en) 2024-04-04
JP2018054791A (en) 2018-04-05
CN107871503A (en) 2018-04-03
US20210335362A1 (en) 2021-10-28
US20190244620A1 (en) 2019-08-08
CN107871503B (en) 2023-02-17
US20180090144A1 (en) 2018-03-29
US10319379B2 (en) 2019-06-11
JP6515897B2 (en) 2019-05-22

Similar Documents

Publication Publication Date Title
US11900932B2 (en) Determining a system utterance with connective and content portions from a user utterance
CN107516511B (en) Text-to-speech learning system for intent recognition and emotion
US20200058294A1 (en) Method and device for updating language model and performing speech recognition based on language model
KR102191425B1 (en) Apparatus and method for learning foreign language based on interactive character
US9449599B2 (en) Systems and methods for adaptive proper name entity recognition and understanding
US8635070B2 (en) Speech translation apparatus, method and program that generates insertion sentence explaining recognized emotion types
US9588967B2 (en) Interpretation apparatus and method
CN111292740B (en) Speech recognition system and method thereof
US20180090132A1 (en) Voice dialogue system and voice dialogue method
CN110675855A (en) Voice recognition method, electronic equipment and computer readable storage medium
JP6024675B2 (en) Voice recognition terminal device, voice recognition system, and voice recognition method
US20170103757A1 (en) Speech interaction apparatus and method
US9984689B1 (en) Apparatus and method for correcting pronunciation by contextual recognition
CN110910903B (en) Speech emotion recognition method, device, equipment and computer readable storage medium
JP6614080B2 (en) Spoken dialogue system and spoken dialogue method
WO2020036195A1 (en) End-of-speech determination device, end-of-speech determination method, and program
KR20210034276A (en) Dialogue system, dialogue processing method and electronic apparatus
CN113593522A (en) Voice data labeling method and device
WO2014194299A1 (en) Systems and methods for adaptive proper name entity recognition and understanding
EP4275203B1 (en) Self-learning end-to-end automatic speech recognition
JP5818753B2 (en) Spoken dialogue system and spoken dialogue method
KR102300303B1 (en) Voice recognition considering utterance variation
CN110895938B (en) Voice correction system and voice correction method
JP6538399B2 (en) Voice processing apparatus, voice processing method and program
KR20200011160A (en) Intelligent end-to-end word learning method using speech recognition technology

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: ADVISORY ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE