WO2011030372A1

WO2011030372A1 - Speech interaction device and program

Info

Publication number: WO2011030372A1
Application number: PCT/JP2009/004446
Authority: WO
Inventors: 山本大介; 土井美和子; 小林優佳; 横山祥恵; 古賀敏之; 熊巳創; 片岡敬弘
Original assignee: 株式会社東芝
Priority date: 2009-09-09
Filing date: 2009-09-09
Publication date: 2011-03-17

Abstract

A speech interaction device (100) is characterized by being provided with a generation unit (101) which generates an interaction sentence compliant with a user-initiative mode or a system-initiative mode in accordance with instructions of the modes, a speech presentation unit (102) which presents the interaction sentence to a user by speech, a speech feature quantity calculation unit (103) which calculates the speech feature quantity of the user in response to the speech presented by the speech presentation unit, a determination unit (104) which calculates the activation level of the interaction by the user on the basis of the speech feature quantity of the user, and a switch unit (105) which switches the mode of the generation unit (101) to the user-initiative mode when the activation level is equal to or higher than a threshold and switches the mode of the generation unit (101) to the system-initiative mode when the activation level is lower than the threshold.

Description

Spoken dialogue apparatus and program

The present invention relates to a dialogue technique using voice.

Japanese Patent Application Laid-Open No. 2009-37050 presents a topic (for example, presents a topic including a keyword of interest to the user) by the dialogue apparatus, and estimates the degree of interest of the user with respect to this topic. An interactive apparatus that does not bore users by switching topics when the user's interest is low is disclosed.

JP 2009-37050 A

However, the technique disclosed in Japanese Patent Application Laid-Open No. 2009-37050 is a system that unilaterally presents a topic, and is not interesting for the user. It is difficult to maintain a conversation by voice conversation only by the conversation device switching the topic.

Therefore, an object of the present invention is to provide a voice interactive device capable of continuous conversation with a user.

A voice interaction device according to an aspect of the present invention presents a dialog unit that generates a dialogue sentence corresponding to either mode in response to an instruction in a user initiative mode or a system initiative mode, and presents the dialogue sentence to the user by voice. A voice presentation unit for calculating the voice feature amount of the user in response to the voice presented by the voice presentation unit, and the activity level of the user's interaction based on the voice feature amount And a determination unit that calculates the mode, the mode of the generation unit is switched to a user-driven mode when the activity is greater than or equal to a threshold, and the mode of the generation unit is switched to a system-driven mode when the activity is less than the threshold And a switching unit for switching.

According to the present invention, it is possible to perform a dialogue process that enables a sustained conversation by voice dialogue.

The block diagram which shows the structure of the voice interactive apparatus which concerns on 1st Embodiment. The figure which shows the structure of a dialogue database. The figure which shows a related distance schematically. The figure which shows the structure of an audio | voice feature-value calculation part. The figure for demonstrating operation | movement of an audio | voice time detection part. The figure which shows the flowchart showing operation | movement of a voice interactive apparatus. The figure which shows the example of utterance. The block diagram which shows the structure of the voice interactive apparatus which concerns on 2nd Embodiment. The figure which shows the flowchart showing operation | movement of a voice interactive apparatus.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings to be described below, the same reference numerals indicate the same parts, and duplicate descriptions are omitted.

(First embodiment)
FIG. 1 is a block diagram showing a configuration of a voice interactive apparatus 100 according to the first embodiment of the present invention. In the present embodiment, the case where the voice interaction device 100 is applied to a robot will be described as an example. However, the present invention is not limited to this, and can be applied to various devices on which the voice interaction device can be mounted. As described in the present embodiment, when the voice interaction device 100 is applied to a robot, the voice interaction device 100 indicates a part related to the voice interaction of the robot. Furthermore, the control target 111 shown in FIG. 1 indicates a part such as a robot hand, foot, or head.

The voice interaction apparatus 100 includes a generation unit 101 that generates a sentence for interacting with the user, and a voice presentation unit that presents the sentence generated by the generation unit 101 to the user by voice. The voice presentation unit indicates the voice synthesis unit 102 and the voice output unit 108. Specifically, the speech synthesis unit 102 converts the text generated by the generation unit 101 into a speech signal, and the obtained speech signal is converted into speech by the speech output unit 108 and output. When the user hears this voice and generates a voice that responds to the voice, the voice input unit 107 converts the user's voice into a voice signal.

Note that the voice output unit 108 may be externally connected to the voice interaction device 100 when used for purposes other than the robot.

The voice dialogue apparatus 101 further includes a voice feature amount calculation unit 103 that calculates a feature amount of a voice uttered by the user, and a determination that determines the activity level of the user's dialogue based on the feature amount calculated by the voice feature amount calculation unit 103 Unit 104 and a switching unit 105 that switches between the system-driven mode and the user-driven mode in accordance with the determination of the determination unit 104.

Further, the voice interaction device 100 stores a plurality of keywords and the like in the interaction database 106, and stores a sentence template for generating a sentence using the keywords stored in the interaction database 106 in the interaction template 109.

The generating unit 101 is connected to the dialog database 106, generates a text for interacting with the user using the dialog database 106, and sends the generated text to the speech synthesizer 102. The voice synthesizing unit 102 converts the sentence sent from the generating unit 101 into a voice signal, and outputs the voice signal from the voice output unit 108 such as a speaker to the user as voice.

The expression output unit 110 operates the robot or the CG as the control target 111 according to the operation data stored in association with the template (dialog sentence) registered in advance in the dialog template 109, so that the dialog is smoothly performed. Do. In the dialogue template 109, for example, if the template is a question sentence, the head is curled, and if the template is a companion such as “Yes”, the action data is registered in association with the template. The expression output unit 110 changes, for example, the robot operation or the CG image based on the operation data associated with the template registered in the dialogue template 109.

FIG. 2 is a diagram showing information stored in the dialogue database 106. The dialogue database 106 has a plurality of topic data 1 to N (where N is an integer of 2 or more) in which a plurality of keywords and the like are stored. The topic data 1 to N are each a topic, a keyword such as a name of a place related to the topic, a person name, a name of food, a keyword priority, and a distance indicating a relationship between the keywords (hereinafter, “ And "related distance").

For example, if the switching unit 105 selects the topic data 1, the dialogue database 106 selects some keywords in the topic data 1 and generates a sentence based on the relationship distance between the selected keywords. .

Various methods can be considered as a method for generating a sentence from the relational distance between keywords. For example, there is a method of selecting two closely related keywords and creating a sentence from the closeness of the conceptual structure. FIG. 3 is a diagram schematically showing the proximity of the conceptual structure. “Apple” and “mandarin” are one of the same concepts (fruits), and “apples” and “fruits” are “fruits” because “apples” are contained in “fruits”. Is expressed. In FIG. 3A, the conceptual structure is closer as the numerical value is larger. For example, in the case of “apples” and “mandarin oranges” having a relational distance of 1, a sentence predetermined according to the relational distance can be generated, such as “Which is better in“ apples ”or“ mandarin oranges ”?” I can do it. Other related distances include associative distances. FIG. 3B is a diagram schematically showing the associative distance. As shown in FIG. 3B, words associated with a certain word are statistically processed, and the associative distance can be expressed by an index indicating the similarity between the words. For example, it is possible to calculate the associative distance from the frequency of asking many people the question of “listen to something that is associated with listening to“ apple ””. In FIG. 3 (b), the larger the value is, the closer the associative distance is. For example, the associative distance between “apples” and “fruits” is as high as 0.2, and the associative distance between “apples” and “vegetables” is It can be seen that the similarity is not so high as 4.2. Other related distances include text distance and dialog history distance. The sentence distance represents how many words each keyword is separated from the sentence such as a news article. The dialogue history distance represents how many words each keyword is separated from the past dialogue history.

The voice synthesizer 102 converts the text sent from the generator 101 into a voice signal. The converted voice signal is output as voice to the user who is the object of dialogue through the voice output unit 108 connected to the voice synthesis unit 102.

As shown in FIG. 4, the voice feature quantity calculation unit 103 includes a volume detection unit 401, a pitch detection unit 402, a voice time detection unit 403, and a voice recognition unit 404. The audio feature amount calculation unit 103 is connected to an audio input unit 95 such as a microphone, and sends audio information obtained by each unit to the determination unit 104.

The volume detection unit 401 detects the volume of the voice uttered by the user input from the voice input unit 401. Specifically, the amplitude of the voice uttered by the user in time series is accumulated as data, and the average value and variance of the voice amplitude are calculated from this data.

The pitch detection unit 402 detects the fundamental frequency (pitch) of the voice uttered by the user input from the voice input unit 401. Specifically, the fundamental frequency of the voice uttered by the user in time series is accumulated as data, and the average value and variance of the fundamental frequency are calculated from this data.

The voice time detection unit 403 detects the time from the voice input unit 401 until the user ends the utterance and the time from the end of the robot utterance to the start of the user's utterance.

FIG. 5 is a diagram for explaining the operation of the audio time detection unit 403. The horizontal axis indicates time. FIG. 5A shows the utterance of the robot, and FIG. 5B shows the utterance of the user. T1 indicates the time (alternative latency) until one speaker (robot) finishes speaking and the next speaker (user) starts speaking. T2 indicates the time from when the user utters until the utterance ends. The time from when the user utters until the utterance ends and until the user starts to utter again (hereinafter referred to as “speech interval”) is stored as data, and the average time of the utterance interval is calculated from this data.

The speech recognition unit 404 recognizes speech and converts it into sentences. Even if it recognizes sentences called continuous speech recognition, it recognizes only a predetermined standby vocabulary called word recognition. May be. In the case of continuous speech recognition, the user's speech is recognized as a sentence, and whether or not the topic data keyword is included in the sentence is calculated. In the case of word recognition, a keyword of topic data is registered as a standby vocabulary, and whether or not the standby vocabulary is recognized is calculated.

As described above, after the user and the voice interaction device 100 repeat the conversation several times, the sound volume detection unit 401, the pitch detection unit 402, the voice time detection unit 403, and the voice recognition unit 404 use the average value and variance of the voice amplitude. In addition, the voice database 405 stores the average value and variance of the frequency of voice, the average time and variance in alternation latency, and whether or not the topic data keyword is recognized by voice recognition. Further, voice information uttered by the user can be accumulated in advance, and the voice information can be stored in the voice database 405. The average value and variance are calculated from physical quantities such as voice amplitude, frequency, and speech interval, and the number of conversations, respectively.

The determination unit 104 compares the voice feature amount data sent from the voice database 405 connected to the voice feature amount calculation unit 103 with the voice feature amount emitted by the user, and determines the degree of activity.

Here, the degree of activity is larger than physical quantities such as amplitude, frequency, and utterance interval constituting voice information calculated from voices uttered by the user in which voices uttered by the user are accumulated in the voice database 405. Or a small degree. The past indicates the passage of time since the voice interaction device 100 and the user started a conversation. For example, it may be in minutes or for several years.

For example, in the case of the amplitude of the voice, if the amplitude of the voice uttered by the user is larger than the average value of the amplitude of the voice sent from the voice database 405, it is determined that the activity is equal to or greater than the threshold value.

In the case of the voice frequency, if the frequency of the voice uttered by the user is larger than the average value of the voice frequency sent from the voice database 405, it is determined that the activity is high.

Note that the average value of the amplitude and frequency of the voice sent from the voice database 405 is set as a threshold used for the determination of the activity. That is, the degree of activity can be determined by comparing the average value of the amplitude and frequency of the voice uttered by the past user with the amplitude and frequency of the voice uttered by the current user.

In the case of an utterance interval, if the user's utterance interval is shorter than the average interval of the utterance intervals sent from the voice database 405, it is determined that the activity is high. That is, the degree of activity is determined by comparing the average value of the past user speech intervals with the current user speech interval.

Thus, when it is determined that the activity level is high and the voice interactive apparatus 100 is in the system driven mode, the determining unit 104 switches the switching unit 105 to the user driven mode.

On the other hand, when it is determined that the activity is less than the threshold value and the voice interaction apparatus 100 is in the user-driven mode, the determination unit 104 switches the switching unit 105 to the system-driven mode.

In the present embodiment, the determination unit 104 can determine if at least one of the physical quantities such as the amplitude, frequency, and utterance interval constituting the three pieces of audio information is used.

Here, the system-driven mode refers to a mode in which the topic data 1 to N in the dialogue database 105 shown in FIG.

As shown in FIG. 2, the user-driven mode is fixed to the topic data selected in the system-driven mode, and the voice uttered by the user is compared with the words included in the fixed topic data. This is a mode in which recognition is performed, and when there is a corresponding word, the generation unit 101 generates a sentence using the word.

The switching unit 105 switches between the system initiative mode and the user initiative mode according to the activity calculated by the determination unit 104.

FIG. 6 is a flowchart illustrating the operation of the voice interaction apparatus 100. 6A shows operations in the system initiative mode in steps S601 to S604, and FIG. 6B shows operations in the user initiative mode in steps S611 to S614. Hereinafter, although the initial state of the voice interactive apparatus 100 is described as being in the system-driven mode, the present invention is not limited to this, and the user-driven mode may be in the initial state.

In step S601, the generation unit 101 selects topic data determined in advance using the conversation database 106 and a representative keyword in the topic data, generates a sentence for interacting with the user, and performs speech synthesis. Send to part 102.

In step S602, the speech synthesis unit 102 converts the text sent from the generation unit 101 into speech. The voice output unit 108 outputs the voice generated by the voice synthesis unit 102 to the user.

In step S3603, the voice feature quantity calculation unit 103 calculates a voice feature quantity from the user's voice when the user who heard the voice output from the voice output unit 108 answers the voice.

In step S <b> 604, the determination unit 104 determines the activity level of the user's dialogue based on the voice feature amount calculated by the voice feature amount calculation unit 103. If the determination unit 104 determines that the activity is not high (“NO” in step S604), the process returns to step S601, and the generation unit 101 generates another sentence. If the determination unit 104 determines that the activity is equal to or greater than the threshold (“YES” in step S604), the switching unit 105 switches from the system-driven mode to the user-driven mode.

In step S611, immediately after switching to the user-driven mode, the topic data selected when the activity level of the user's dialogue is high in the system-driven mode, specifically, the text is generated by the generation unit 101 in step S601. Use topic data selected to generate. A keyword in the topic data is selected, and the generation unit 101 generates a sentence. After detecting the user's voice in step S613 to be described later, the voice recognition unit 40 recognizes the user's voice and compares it with the keyword in the topic data when fixed in step S604. When the detected voice matches the keyword in the fixed topic data, the generation unit 101 generates a sentence using the keyword. On the other hand, when the detected voice does not match the keyword in the fixed topic data, the generation unit 101 uses a keyword stored in the dialogue database 106 in advance (such as “Hey, huh, yeah”). To generate a sentence.

In step S612, the speech synthesis unit 102 converts the text sent from the generation unit 101 into a speech signal. The voice output unit 108 outputs the voice signal converted by the voice synthesizing unit 102 to the user as voice. In step S613, the voice feature amount calculation unit 103 receives the voice output from the voice output unit 108. The voice feature amount is calculated from the voice of the user when the answer is made.

In step S614, the determination unit 104 determines the activity level of the user's interaction based on the voice feature amount calculated by the voice feature amount calculation unit 103. If the determination unit 104 determines that the activity is equal to or higher than the threshold (“NO” in step S614), the process returns to step S604. On the other hand, when the determination unit 104 determines that the activity is less than the threshold (“YES” in step S614), the determination unit 104 instructs the switching unit 105 to switch from the user-driven mode to the system-driven mode. Then, the switching unit 105 switches from the user-driven mode to the system-driven mode, and returns to step S601.

Thus, in the present embodiment, the system-driven mode and the user-driven mode can be switched by determining the activity level of the user's conversation. As a result, it is possible to have a continuous voice conversation.

Next, an example of utterance when dialogue processing is performed according to the above operation will be shown as an example. FIG. 7 is a diagram illustrating an utterance example. At first, it started in the system driven mode (topic presentation). Here, it is assumed that topic data 1 is used, and the keyword distance is closest to “apple”, which is the keyword having the highest priority in topic data 1 (most representative of the topic), and “apple”. “Aomori” shall be extracted. Then, from the concept of “fruit” of “apple” and the concept of “prefecture name” of “Aomori”, the dialogue template 109 is searched, and the template “prefecture name” is called “fruit”. The user utters “Speaking of apples, Aomori.” On the other hand, it is assumed that the user utters a low voice with a short “Yes”. The degree of activity is calculated from the voice feature amount calculated in step 4. Since the degree of activity is less than the threshold compared to the history so far, the system-driven mode (topic presentation) is continued. Use “Tohoku”, which is the closest keyword distance to the keyword “Aomori” in Topic Data 2. Use “Topic” in “Aomori” and “Region” in “Tohoku”. From the concept of vs Search the template 109 and use the template ““ Subject is in “Region” ”” and say “Aomori is in Tohoku”. ,that's right. Aomori is a prefecture in Tohoku! "From the voice feature quantity (for example, volume and voice pitch) calculated by the voice feature quantity calculation unit 103, it is determined that the activity is equal to or higher than the threshold value, and the user-driven mode is set. In the user-driven mode, the topic data 2 when the activity is high in the system-driven mode is used, and “Aomori” having the highest priority in the topic data 2 is used, and the dialog template 109 for the user-driven mode is used. Using the concept of “prefecture name”, say “Have you ever been to Aomori?” Suppose a user speaks, “Yeah, I was born in Aomori. Aomori is a good place.” Since the degree of activity is not less than the threshold, the user-driven mode is continued. Also, if the keyword is not recognized by the voice feature amount calculation unit 103, a conflict such as “Yes” is given. Furthermore, when the user utters “Nebuta Festival”, which is one of the keywords in Topic Data 2, and “Nebuta Festival in summer,” and the keyword is recognized, The festival is repeated. Continuing the conversation in this way, when the activity level of the user's dialogue again decreases, the system is switched to the system-driven mode, and the next topic is presented using the topic data 3.

As described above, in the system-driven mode, the topic data is sequentially switched to present a new topic. On the other hand, in the user-driven mode, the topic data is fixed, and the user's utterance is urged by repeating the keyword or the keyword spoken by the user according to the user's story.

(Second Embodiment)
FIG. 8 is a block diagram of a voice interaction apparatus 800 according to the second embodiment of the present invention. The voice interaction apparatus 800 is different from the voice interaction apparatus 100 in that it further includes an image feature amount calculation unit 801 and an expression output unit 110. The description of the same configuration as the voice interaction device 100 in the voice interaction device 800 is omitted.

The image feature amount calculation unit 801 is connected to an image database and image input unit 802 (not shown). The image input unit 802 images the user who is the subject of dialogue in time series, and sends the captured image to the image feature amount calculation unit 801.

The image feature amount calculation unit 801 detects the contour of the user's face from the image sent from the image input unit 802, detects the feature point from the detected contour, accumulates the position of the feature point in the image database, From these, the average value of the movement width at the position of the feature point is calculated.

Specifically, the image feature quantity calculation unit 801 prepares a face image template in the image database in advance, compares the template with the user image captured by the image input unit 802, and determines the face contour. Is detected. Then, the image feature amount calculation unit 801 detects feature points such as eyes, mouth, and nose of the user's image captured from the detected contour, and monitors these feature points.

The determination unit 104 includes the image feature amount calculation unit 801 in addition to the audio feature amount calculation unit 103, and thus can determine the activity using the position of the feature point detected from the user's face outline.

Specifically, the position of the feature point of the contour of the user's face detected by the image feature amount calculation unit 801 is compared with the average value of the movement width of the position at the calculated feature point stored in the image database. If the movement width at the position of the feature point of the contour of the user's face detected by the image feature amount calculation unit 801 is larger than the average value, the activity is determined to be greater than or equal to the threshold value. Note that the average value of the movement widths of the positions at the feature points sent from the image database is used as a threshold value. That is, when the voice interaction apparatus 800 is in the system-driven mode and the movement width of the feature point position of the user's facial contour is equal to or greater than the threshold value, the user-driven mode is switched.

On the other hand, when the voice interaction apparatus 800 is in the user-driven mode and the movement width of the feature point position of the user's facial contour is equal to or less than the threshold, the system switches to the system-driven mode.

It is also possible to accumulate an average value of the motion widths of the positions of feature points detected in advance from the contour image of the user's face and store the position information of these feature points in the image database. The average value is calculated by dividing the integrated value of the vector change amount of the feature point position by the number of dialogues.

FIG. 9 is a flowchart illustrating the operation of the voice interaction apparatus 800.

FIG. 9 (a) shows the operation in the system driven mode, and FIG. 9 (b) shows the operation in the user driven mode.

Of the operations of the voice interactive apparatus 800, steps S901, S902, and S911 are the same as steps S601, S602, and S611 in FIG.

In step S903, in addition to the operation in step S603 of FIG. 6, the movement of the user's face with respect to the sound output from the sound output unit 106 is detected from the image feature amount of the user's face image.

In step S904, in addition to the operation of step S604 in FIG. 6, the activity level of the user's conversation is determined from the detected image feature amount. If the activity is equal to or greater than the threshold (“YES” in step S904), the switching unit 105 switches from the system-driven mode to the user-driven mode, and proceeds to step S911. If the activity is not high (“NO” in step S904), the process returns to step S901.

The operations in step S913 and step S914 in the user-driven mode are the same as the operations in FIG. 6 except that the operation is performed using the image feature amount, as in steps S903 and S904.

As described above, according to the present embodiment, by providing the image feature amount calculation unit 801 in addition to the sound feature amount calculation unit 103, it is possible to determine the activity level of the user's conversation not only by sound but also by image. , It is possible to determine the activity level of the user's conversation with high accuracy and to have a continuous conversation.

In addition, each function of each part of the voice interactive apparatus according to the embodiment of the present invention can be executed by a computer by a voice interactive program stored in a computer-readable storage medium. Further, the present invention is not limited to the above-described embodiment, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of components disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

DESCRIPTION OF SYMBOLS 100 Voice interaction apparatus 101 Generation | occurrence | production part 102 Speech synthesis part 103 Voice feature-value calculation part 104 Determination part 105 Switching part

Claims

A generation unit that generates a dialogue sentence corresponding to one of the modes according to an instruction of the user-driven mode and the system-driven mode;
A voice presentation unit for presenting the dialogue sentence to the user by voice;
A voice feature amount calculation unit that calculates a voice feature amount of the user in response to the voice presented by the voice presentation unit;
A determination unit that calculates the activity of the user's dialogue based on the voice feature amount;
A switching unit that switches the mode of the generation unit to a user-driven mode when the activity is greater than or equal to a threshold; and a mode that switches the mode of the generation unit to a system-driven mode when the activity is less than the threshold;
A voice interactive apparatus comprising:
The switching unit switches the mode of the generation unit to a user-driven mode when the past activity is smaller than the current activity, and when the past activity is greater than the current activity The spoken dialogue apparatus according to claim 1, wherein the mode of the generation unit is switched to a system initiative mode.
An image feature amount calculation unit that calculates an image feature amount of the captured image;
The spoken dialogue apparatus according to claim 1, wherein the determination unit calculates the activity level using the voice feature amount and the image feature amount.
The speech dialogue apparatus according to claim 1, further comprising an expression output unit that causes the robot or CG to operate in accordance with operation data stored in association with the dialogue sentence.
Computer
A generation unit that generates a dialogue sentence corresponding to one of the modes according to an instruction of the user-driven mode and the system-driven mode;
A voice presentation unit for presenting the dialogue sentence to the user by voice;
A voice feature amount calculation unit that calculates a voice feature amount of the user in response to the voice presented by the voice presentation unit;
A determination unit that calculates an activity level of the user's dialogue based on the voice feature amount of the user;
Functions as a switching unit that switches the mode of the generating unit to a user-driven mode when the activity is greater than or equal to a threshold, and switches the mode of the generating unit to a system-driven mode when the activity is less than the threshold A voice dialogue program to let you.