WO2009087860A1

WO2009087860A1 - Voice interactive device and computer-readable medium containing voice interactive program

Info

Publication number: WO2009087860A1
Application number: PCT/JP2008/072703
Authority: WO
Inventors: Akiko Yamato
Original assignee: Brother Kogyo Kabushiki Kaisha
Priority date: 2008-01-10
Filing date: 2008-12-12
Publication date: 2009-07-16
Also published as: JP2009186989A

Abstract

In a case where an input of the voice from the user (S4: YES), the inputted voice is analyzed and the characters are converted (S5). In a case where the inputted voice is not an ending instruction (S6: NO), a keyword is extracted from the converted character string (S7) and determination context is determined on the basis of the extracted keyword (S8). In a case where there is a change in the determination context (S9: YES) and measurement time by a timer has not elapsed five minutes (S13: NO), it is judged whether or not the determination context is meaning context. In a case where the determination context is the meaning context, the “first change attribute” and “second change attribute” of a first attribute information storage area are referenced and the attribute of an output voice is changed (S14). A response sentence is determined (S20), voice-synthesized according to the attribute after change stored in the attribute storage area (S21), and outputted from a speaker (S22).

Description

Spoken dialogue apparatus and computer-readable medium storing voice dialogue program

The present invention relates to a voice interactive device and a computer-readable medium storing a voice interactive program. More specifically, the present invention relates to a voice interactive apparatus capable of changing the tone of voice when conversation contents change, and a computer-readable medium storing a voice interactive program.

Conventionally, when a user uses a computer, information is input by a keyboard or a mouse, and information is output by displaying characters or images on a display. There has been proposed a user support apparatus and system that perform voice input / output so that information can be input / output in a user-friendly environment rather than such input / output (see, for example, Patent Document 1). In the user support device described in Patent Literature 1, input / output of information is performed by the user support device and the user interacting with each other.
JP 2002-163171 A

When humans have a conversation, the tone and tempo change as the conversation changes. For example, if the content of the story changes from a job story to a hobby story, the serious tone during the job story changes into a joyful and light tone during the hobby story. However, an apparatus such as the user support apparatus described in Patent Document 1 interacts with the user at a fixed voice and a fixed speed. Therefore, even if the content of the conversation changes, the tone of the voice to talk with does not change accordingly, and the user may feel unnatural.

This disclosure is intended to provide a voice dialogue apparatus capable of changing the tone of voice when conversation contents change, and a computer-readable medium storing a voice dialogue program.

According to the present disclosure, a voice input unit that inputs voice, a conversion unit that converts an input voice, which is a voice input by the voice input unit, into a character string, and a context that stores a conversation context corresponding to a keyword A keyword stored in the context storage means is extracted from a storage means and a converted character string that is a character string converted by the conversion means, and stored in the context storage means corresponding to the extracted keyword. Output by the context determination means for determining the context of the input voice as the context of the input voice, the conversation sentence determination means for determining a conversation sentence according to the input voice, the voice output means for outputting the voice, and the voice output means. Attribute storage means for storing audio attributes, and attributes stored in the attribute storage means. Output control means for causing the voice output means to output the conversation sentence determined by the sentence determination means, and the determination context that is the context determined by the context determination means is determined previously by the context determination means A determination unit that determines whether or not the determination context has changed from a previous determination context that is a context; and when the determination unit determines that the determination context has changed, the attribute of the voice stored in the attribute storage unit is changed There is provided a voice interactive apparatus comprising attribute changing means.

2 is a hardware block diagram of the voice interactive apparatus 100. FIG. 3 is a schematic diagram showing a configuration of a context tree storage area 131. FIG. 4 is a schematic diagram of a tree structure of contexts stored in a context tree storage area 131. FIG. 5 is a schematic diagram showing a configuration of a first attribute information storage area 1321. FIG. It is a schematic diagram which shows the structure of the 2nd attribute information storage area 1322. FIG. 10 is a schematic diagram showing a configuration of a third attribute information storage area 1323. FIG. 3 is a flowchart of main processing of the voice interaction apparatus 100. It is a flowchart of the 1st process performed in the main process. It is a flowchart of the 2nd process performed in the main process. It is a figure which shows an example of the dialogue between a user and a voice dialogue agent.

Hereinafter, embodiments according to the present disclosure will be described with reference to the drawings. The voice interaction apparatus 100 of this embodiment is a so-called personal computer. As shown in FIG. 1, the voice interaction apparatus 100 is provided with a CPU 10 that controls the voice interaction apparatus 100. Connected to the CPU 10 are a RAM 11 that temporarily stores various data and a ROM 12 that stores BIOS and the like. Further, a hard disk device 13, an output control unit 14, an input control unit 15, an audio output control unit 16, an audio input control unit 17, and a timer 18 are connected to the CPU 10 via a bus. An output device 24 is connected to the output control unit 14, and an input device 25 is connected to the input control unit 15. The output device 24 is, for example, a display, and the input device 25 is, for example, a mouse or a keyboard. A speaker 26 is connected to the audio output control unit 16, and a microphone 27 is connected to the audio input control unit 17. The timer 18 measures time.

The hard disk device 13 is provided with at least a context tree storage area 131, an attribute information storage area 132, an acoustic model storage area 133, a voice interaction program storage area 134, and other information storage areas 135. In the context tree storage area 131, a context tree indicating the relationship of contexts (contents of conversation) is stored. The attribute information storage area 132 stores information related to a voice attribute (hereinafter referred to as “voice attribute information”) designated when a context conversation satisfying a predetermined condition is made. The acoustic model storage area 133 stores a plurality of acoustic models for outputting sound from the microphone 27. The voice interaction program storage area 134 stores a voice interaction program executed by the CPU 10. In the other information storage area 135, other information used in the voice interactive apparatus 100 is stored.

The RAM 11 is provided with a currently determined context storage area 111, a previous determined context storage area 112, and an attribute storage area 113. In the current determination context storage area 111, a context ID of the current determination context (hereinafter referred to as “determination context ID”) is stored. In the previous determination context storage area 112, a context ID of the determination context immediately before becoming the current context (hereinafter referred to as “previous determination context ID”) is stored. The attribute storage area 113 stores attributes used when the voice output from the speaker 26 is synthesized. The attribute data items are, for example, speed, pitch, acoustic model, and voice quality after filtering.

In the present embodiment, when the voice dialogue program is executed in the voice dialogue apparatus 100, the voice dialogue agent is activated. An image of the character is displayed on the output device (display) 24 by the voice interaction agent. This character image is a concrete representation of a voice interaction agent. The user interacts with the voice interaction agent as if interacting with the character image. A speech (voice) from the user is input from the microphone 27. The input voice is analyzed as text and used as an input sentence from the user. A response sentence corresponding to the input sentence is determined, voice-converted, and output from the speaker 26 as voice. At the time of voice output, the character image also has a design that speaks a word, giving the user a sense of realism that is interacting with the character.

Furthermore, the dialogue content between the user and the voice dialogue agent is determined by a keyword in the user's input sentence. This dialogue is called “context”. This context is represented by a tree structure (see FIG. 3). The voice interaction agent changes the attribute of the output voice of the voice interaction agent according to the specific context and the moving state of the context, and outputs a sound suitable for the content of the conversation.

The context tree storage area 131 provided in the HDD 13 will be described with reference to FIGS.

As shown in FIG. 2, the context tree storage area 131 is provided with “context ID”, “context name”, and “keyword” as data items. A context name is given for each context ID. Further, a keyword is assigned to the context ID. When a keyword appears in the conversation between the user and the voice interaction agent, the context associated with the keyword is set as the “determined context” that is the context of the current conversation. Note that the context shown in FIG. 2 is an example.

Context rule assignment rules are explained. The context ID “0000” is an ID given to the context that is the root of the tree structure. The context on the branch is given a 4-digit + 4-digit ID such as “0100-0000”. The last four digits “0000” are the context ID of the parent (one layer higher). That is, “0100-0000” indicates a child of the context ID “0000” (lower one hierarchy). Hereinafter, the subsequent four-digit ID is referred to as “parent ID”. In the context tree storage area 131 shown in FIG. 2, as shown in FIG. 3, the context with the context name “general” having the context ID “0000” is the root. As a child of the context with the context ID “0000”, the context with the context name “music” with the context ID “0100-0000”, the context with the context name “art” with the context ID “0101-0000”, and the context ID “0102-0000”. The context with the name "Chat" is connected to the root context.

Also, out of the four digits before the context ID, the first two digits indicate the hierarchy of the tree structure. As shown in FIG. 3, in the context of the root of the tree structure, the previous two-digit ID is “00”, indicating the hierarchy “00”. In the context of the context name “music” with the context ID “0100-0000”, the previous two-digit ID “01” indicates the first layer. In the context IDs “0200-0100” and “0201-0100” of the context child of the context ID “0100-0000”, the first two digits of the previous four digits “02” indicate the second layer. Yes. Further, the last two digits of the preceding four digits are identification numbers in the same hierarchy. In the example shown in FIG. 2 and FIG. 3, “01” and “02” are assigned in order from “00” as identification numbers. Hereinafter, the first four digits are referred to as “own ID”, the previous two digits of the own ID are referred to as “hierarchy ID”, and the latter two digits are referred to as “identification number”. That is, the context ID is composed of “(own ID 4 digits) − (parent ID 4 digits)”, that is, “(hierarchical ID 2 digits) (identification number 2 digits) − (parent ID 4 digits)”. According to such a context ID assigning rule, IDs that do not overlap each other are assigned to the context, so that the context can be identified by the context ID.

Next, the attribute information storage area 132 provided in the HDD 13 will be described with reference to FIGS. The attribute information storage area 132 includes a first attribute information storage area 1321, a second attribute information storage area 1322, and a third attribute information storage area 1323.

First, the first attribute information storage area 1321 will be described with reference to FIG. The first attribute information storage area 1321 stores voice attribute information for changing attributes when a context having a special meaning becomes a determined context. As shown in FIG. 4, the first attribute information storage area 1321 is provided with “meaning”, “context ID”, “first change attribute”, and “second change attribute” as data items. In the “first change attribute” and “second change attribute”, items of “type”, “method”, and “change value” are provided, respectively. A context ID is assigned to each meaning, and two types of attributes can be set as change attributes. Examples of attribute types include the speed of output speech, the type of acoustic model used for speech synthesis, the pitch of output speech, and the voice quality of output speech after filtering. The attribute is not limited to this, and an attribute that can be assigned to a speech synthesis program for performing speech synthesis may be used. Hereinafter, the context specified by the context ID assigned to the meaning is referred to as a “semantic context”.

In the example shown in FIG. 4, the special meanings are “hobby”, “special field”, “disadvantage field”, and “chat”. The context ID assigned to “hobby” is “0101-0000”. Since the first change attribute is “speed” and the change value is “1.2”, the speed of the output voice is changed to 1.2. Since the second change attribute is “pitch” and the method is “high”, the pitch of the output audio is changed by a predetermined amount. The predetermined amount to be changed is determined in advance. For example, if the method is “high”, the pitch is changed by 0.1 higher than the current pitch. If the method is "low", the pitch is changed 0.1 lower than the current pitch. Further, in the meaning “special field”, “voice type” is designated as the first change attribute, and the change value is “modelC”. This indicates that an acoustic model “modelC” is used among the acoustic models when performing speech synthesis. The acoustic model is stored in the acoustic model storage area 133 of the HDD 13. Note that the example shown in FIG. 4 is merely an example, other meanings may be set, and a plurality of contexts may be assigned to one meaning. The voice attribute information is not limited to the information shown in FIG.

Next, the second attribute information storage area 1322 will be described with reference to FIG. The second attribute information storage area 1322 stores audio attribute information for changing attributes when a context of a specific hierarchy in the context tree becomes a determined context. Hereinafter, a context belonging to a specific hierarchy is referred to as a “specific hierarchy context”. As shown in FIG. 5, in the second attribute information storage area 1322, “hierarchy” and “first change attribute” are provided as data items. In the “first change attribute”, items of “type”, “method”, and “change value” are provided. The first change attribute is assigned to each layer, and one voice attribute can be set as the change attribute.

In the example shown in FIG. 5, “highest level”, “second level”, and “lowest level” are designated as specific levels. If the determined context is the highest layer of the context tree, that is, if the context ID is “0000”, an instruction to change all attributes to initial values is issued. If the determined context is the context of the second layer, that is, if the context ID is “02 ***-****” (* is an arbitrary number), an instruction to set the pitch to “0.6” is issued. The If the determined context is the lowest layer of the context tree, that is, in the example shown in FIGS. 2 and 3, if the context ID is “04 ***-***”, an instruction to set the voice quality to “0.4” is issued. The Note that the change instruction illustrated in FIG. 5 is an example, and the change instruction may be set for another layer, and the change content may be other content.

Next, the third attribute information storage area 1323 will be described with reference to FIG. Although details will be described later, in the voice interaction device 100, when the determined context is changed, it is determined in what positional relationship the determined context has moved in the context tree. The third attribute information storage area 1323 stores voice attribute information for changing the attribute when the movement of the determination context is a specific position change. As shown in FIG. 6, “position change” and “first change attribute” are provided as data items in the third attribute information storage area 1323. In the “first change attribute”, items of “type”, “method”, and “change value” are provided. A first change attribute is assigned to each position change, and one voice attribute can be set as a change attribute.

In the example shown in FIG. 6, the position changes are “moving next (small ID)”, “moving next (large ID)”, “moving up one level”, “moving down one level”, “up two levels "Move to" and "Move down two levels" are provided. “Move to next (small ID)” indicates movement to the context of the next lower level in the context tree and having the identification number one smaller. In other words, when the hierarchy ID of the decision context before the movement and the decision context after the movement are equal and “identification number after movement = identification number before movement−1” is satisfied, this “move next (ID small ID ) ”. “Move to next (large ID)” indicates a move to a context having an identification number larger by one in the context next to the same hierarchy in the context tree. That is, the movement in which the hierarchy ID of the determination context before the movement and the determination context after the movement is equal and “the identification number after the movement = the identification number before the movement + 1” is established is the “movement next (large ID)”. It corresponds to.

“Move up one level” indicates a move up to the context one level higher in the context tree. The movement in the case where the parent ID of the determination context before the movement and the own ID of the determination context after the movement are equal corresponds to this “move up one level”. “Move down one level” indicates movement to a context one level below in the context tree. The movement in the case where the own ID of the determination context before movement is equal to the parent ID of the determination context after movement corresponds to this “move down one level”. “Move up two layers” indicates a move up to a context two levels higher in the context tree. The movement when the parent ID of the context of the parent ID of the determination context before the movement is equal to the own ID of the determination context after the movement corresponds to this “move up two levels”. “Move down two levels” indicates movement to the context two levels below in the context tree. The movement when the parent ID of the context of the parent ID of the determined context after the movement is equal to the own ID of the determined context before the movement corresponds to this “move down two levels”. That is, when viewed from the decision context before movement, the decision context after movement is a parent, a parent of a parent, a child, and a child of a child (these relationships are collectively referred to as In this embodiment, the voice attribute is changed to “parent-child relationship”.

Next, with reference to FIG. 7 to FIG. 9, the operation when the voice interaction agent is activated in the voice interaction device 100 will be described with a focus on changing the voice attribute. The operation of the main process shown in FIG. 7 is executed by the CPU 10 according to the voice interaction program stored in the hard disk device 13. First, the first determination context and voice attributes are set (S1). The initial decision context and audio attributes are predetermined. The first context ID is stored in the currently determined context storage area 111 of the RAM 11, and the first audio attribute is stored in the attribute storage area 113 of the RAM 11. In the example illustrated in FIGS. 2 and 3, for example, the context ID “0000” is the first determination context.

Next, the value of the counter C that counts the number of times the decision context has changed is initialized to “0” that is an initial value (S2). The timer 18 for measuring the reference time for changing the sound attribute is reset, and the time measurement is started (S3). When a sound is input from the microphone 27, it is determined whether or not a sound is input from the user (S4). When there is no voice input from the user (S4: NO), repeated input confirmation is performed (S4), and a standby state for input from the user is set.

When there is a voice input from the user (S4: YES), the input voice is analyzed by a well-known voice analysis technique and converted into a character (S5). It is determined whether or not an instruction to end the voice interaction agent has been issued based on whether or not the obtained character string is a word indicating the end of the voice interaction agent (S6). The words indicating the end of the voice interaction agent are registered in advance, for example, “End”, “Bye Bye”, “Goodbye”, “Jaane”, “End”, “Good Night”. If the obtained character string is not an end instruction (S6: NO), a keyword is extracted from the character string (S7). Specifically, the part of speech is decomposed, and it is determined whether or not there is a keyword in the obtained word. If a word registered in the “keyword” of the context tree storage area 131 is included in the word, the keyword that appears first in the character string is set as a keyword for context determination. Then, a determination context is determined based on the extracted keyword (S8). Specifically, the context ID associated with the extracted keyword is set as the context ID of the determined context. The context ID currently stored in the determined context storage area 111 is stored in the previous determined context storage area 112. The context ID associated with the keyword is stored in the currently determined context storage area 111.

Next, it is determined whether or not the decision context has changed (S9). If the context ID stored in the previously determined context storage area 112 is the same as the context ID stored in the currently determined context storage area 111, it is determined that the determined context has not changed (S9: NO). . Then, the first process is performed (S10).

When the first process shown in FIG. 8 is started, it is determined whether or not the time measured by the timer 18 has passed 5 minutes or more (S31). If five minutes or more have not elapsed (S31: NO), the process returns to the main process. If 5 minutes or more have elapsed (S31: YES), it is determined whether or not the value of the counter C is “0” (S32). If it is “0”, that is, if the decision context has not changed for 5 minutes or more (S32: YES), the value of “pitch” which is one of the attributes is changed to 0.8 times (S33). ). The timer 18 is reset, time measurement is started (S34), and the process returns to the main process.

When the value of the counter C is not “0” (S32: NO), it is determined whether or not the value of the counter C is “5” or more (S35). If it is not "5" or more (S35: NO), the timer 18 is reset, time measurement is started (S34), and the process returns to the main process. If the value of the counter C is “5” or more, that is, if the determined context has changed at least five times in 5 minutes (S35: YES), all audio attributes are changed to initial values. (S36). The value of the counter C is initialized to “0” (S37). The timer 18 is reset, time measurement is started (S34), and the process returns to the main process.

When the process returns to the main process shown in FIG. 7, a response sentence is determined in response to the text converted from the voice input by the user (S20). The response sentence is determined based on a predetermined rule by a well-known dialogue technique. The type of response sentence to be determined is not particularly important and will not be described. The response sentence determined in S20 is voice-synthesized by a well-known voice synthesis technique based on the attribute stored in the attribute storage area 113 of the RAM 11 (S21) and output from the speaker 26 (S22). And it returns to S4 and the input from a user waits (S4).

If there is a change in the decision context (S9: YES), “1” is added to the value of the counter C that counts the number of times the decision context has changed (S12). It is determined whether or not the time measured by the timer 18 has passed 5 minutes or more (S13). If 5 minutes or more have not elapsed (S13: NO), the second process is performed (S14).

When the second process shown in FIG. 9 is started, first, it is determined whether or not the determined context is a semantic context (S38). If the context ID of the determined context is stored in the “context ID” of the first attribute information storage area 1321 (see FIG. 4), it is determined that the determined context is a semantic context (S38: YES). Therefore, the attribute of the output voice is changed (S41). Specifically, “first change attribute” and “second change attribute” in the first attribute information storage area 1321 are referred to. In this case, in the attribute storage area 113, the attribute designated by “type” is changed based on the designation of “method” or “change value”. For example, if the determined context ID is “0101-0000”, “Speed” is set to “1.2”, and “0.1” is added to the value of “Pitch”. Thereafter, the process returns to the main process.

If the determination context is not a semantic context (S38: NO), it is determined whether the determination context is a specific hierarchy context (S39). When the determined context ID is a context ID belonging to the hierarchy specified in the “hierarchy” of the second attribute information storage area 1322 (see FIG. 5), it is determined that the determined context is a specific hierarchy context ( S39: YES). In the example shown in FIG. 5, when the self ID of the decision context ID is “0000” (the highest layer), when the hierarchy ID of the decision context ID is “02” (the second layer), or the decision context When the ID of the ID is “04” (lowermost layer), it is determined that the determined context is a specific hierarchy context. In this case, in the attribute storage area 113, the attribute specified by the “type” of the “first change attribute” in the second attribute information storage area 1322 is changed based on the designation of “method” or “change value”. (S42). For example, if the hierarchy ID is “02”, the “pitch” is set to “0.6”. Thereafter, the process returns to the main process.

If the determined context is not the specific hierarchy context (S39: NO), it is determined whether the moving state of the determined context is a predetermined position change (S40). The determined context ID is compared with the previous determined context ID, and if the movement state is designated as “position change” in the third attribute information storage area 1323 (see FIG. 6), it is determined that the predetermined position change has occurred. (S40: YES). For example, in the example illustrated in FIG. 6, when the parent ID of the determination context before the movement and the own ID of the determination context after the movement are equal, it is determined that the position change is “move up one level”. In this case, in the attribute storage area 113, the attribute specified by the “type” of the “first change attribute” in the third attribute information storage area 1323 is changed based on the designation of “method” or “change value”. (S43). Thereafter, the process returns to the main process.

When the process returns to the main process shown in FIG. 7, a response sentence is determined in response to a word converted from the voice input by the user (S20). The response sentence is synthesized by a known speech synthesis technique based on the changed attribute stored in the attribute storage area 113 (S21) and output from the speaker 26 (S22). Then, the process returns to S4, and an input from the user is waited (S4).

In addition, when there is a change in the determination context (S9: YES) and the time measured by the timer 18 has passed 5 minutes or more (S13: YES), the value of the counter C is “5” or more. Is determined (S15). If the value of the counter “C” is not “5” or more (S15: NO), the timer 18 is reset, time measurement is started (S16), and the second process is performed (S14).

If the value of the counter C is “5” or more (S15: YES), all voice attributes are changed to initial values (S17). The value of the counter C is initialized to “0” (S18). The timer 18 is reset and time measurement is started (S19). A response sentence is determined (S20), and the determined response sentence is synthesized (S21) and output from the speaker 26 (S22). Then, the process returns to S4, and an input from the user is waited (S4).

By repeating the processing of S4 to S22, the dialogue between the user and the voice dialogue agent proceeds. When the context changes, if the changed context is a semantic context, a specific hierarchy context, or a predetermined position change occurs, the attribute of the output sound is changed. If the time when the decision context has not changed is equal to or longer than the predetermined time, the attribute is changed. If the determination context has exceeded the predetermined number of times within the predetermined time, the attribute is changed. The response sentence output by the voice interaction agent is converted into voice based on the changed attribute stored in the attribute storage area 113, and the voice is output from the speaker 26. When the user inputs a word for instructing the end, this process ends.

Hereinafter, with reference to FIG. 10, the dialogue between the user and the voice interaction agent in the example shown in FIGS. 2 to 6 will be described with a specific example. In FIG. 10, “dialogue number” is a number assigned to a set of an input sentence from the user and a response sentence of the voice interaction agent. The “input sentence from the user” is a sentence obtained by converting the voice input from the microphone 27 into characters. “Keyword” is a keyword extracted from the input sentence. The “context” is a determination context determined by a keyword. In “attribute”, “acoustic model”, “pitch”, “speed”, and “voice quality” are exemplified as voice attributes. The “agent response text” is a response text output from the voice interaction agent in response to the input text. In the following specific example, all dialogues take place within 5 minutes.

First, the determination context ID of the first determination context is set to “0000”. The initial value of the attribute is also stored in the attribute storage area 113 of the RAM 11 (S1). Input sentence of dialogue number 1 for the "Hello", "Hello" is extracted as a keyword (S7). "Hello" because associated with the context of the context name of the context ID "0000", "General" (see FIG. 2), determining the context ID is "0000" (S8). Since the previous determined context is also “0000”, there is no change in the context (S9: NO). In this case, if the measurement time by the timer 18 is less than 5 minutes (S31: NO), the attribute remains unchanged at the initial value. Response sentence "Hello, recently somewhere to went out?" Is determined (S20), speech synthesis in accordance with the attribute of the initial value is performed (S21), the response sentence is output (S22).

Next, the user makes the following remark, and the input sentence “Yes, I went to the exhibition” of the dialogue number 2 is input (S4: YES). The input voice is converted into characters (S5), and "exhibition" is extracted as a keyword (S7). Since “Exhibition” is associated with the context of the context name “Art” with the context ID “0101-0000” (see FIG. 2), the determined context ID is “0101-0000” (S8). Since the previous determined context is “0000”, there is a change in the context (S9: YES). The measurement time by the timer 18 is less than 5 minutes (S13: NO), and the determination context is a semantic context (S38: YES). Since “0101-0000” is the context ID of the semantic context of the meaning “hobby” (see FIG. 4), “0.1” is added to the initial value “1.0” and “1.1”. The speed is changed from the initial value “1.0” to the changed value “1.2” (S41). The response sentence “Hey, exhibition. Do you see pictures or sculptures?” Is determined (S20), speech synthesis is performed with the changed speech attributes (S21), and a response sentence is output (S22).

Next, an input sentence of dialogue number 3 “This time it was an exhibition of pictures” is input (S4: YES). “Exhibition” is extracted as a keyword (S7). Since “Exhibition” is associated with the context with the context name “Art” having the context ID “0101-0000”, the determined context ID is “0101-0000” (S8). Since the previously determined context ID is also “0101-0000”, there is no change in the determined context (S9: NO). A response sentence “what picture?” Is determined (S20), speech synthesis is performed with the same attribute as the previous time (S21), and a response sentence is output (S22).

Next, the input sentence “Japanese painting is” of dialogue number 4 is input (S4: YES). “Japanese painting” is extracted as a keyword (S7). Since “Japanese painting” is associated with the context with the context name “Japanese painting” having the context ID “0202-0101” (see FIG. 2), the determined context ID is “0202-0101” (S8). Since the previous determined context is “0101-0000”, there is a change in the context (S9: YES). The measurement time by the timer 18 is less than 5 minutes (S13: NO), and the determination context is a semantic context (S38: YES). Since the context ID “0202-0101” is a semantic context of the meaning “special field” (see FIG. 4), the acoustic model is changed to “modelC” (S41). The response sentence “Hey, Japanese painting. Old picture or modern Japanese painting?” Is determined (S20), voice synthesis is performed with the changed attribute (S21), and a response sentence is output (S22).

Next, the input sentence of dialogue number 5 “Old Dane. It was a Kano school exhibition” is entered (S4: YES). “Kano School” is extracted as a keyword (S7). Since the “Kano school” is associated with the context of the context name “Kano school” with the context ID “0304-022” (see FIG. 2), the determined context ID is set to “0304-0202” (S8). Since the previous determined context is “0202-0101”, there is a change in the context (S9: YES). The measurement time by the timer 18 is less than 5 minutes (S13: NO), and the decision context is neither a semantic context nor a specific hierarchy context (S38: NO, S39: NO), but is one level lower than the previous decision context. It is moving (S40: YES). Therefore, the speed is set to “1.3” by adding “0.1” to the stored value “1.2” (S43). A response sentence “What kind of work was there in the Kano school?” Is determined (S20), speech synthesis is performed with the changed attribute (S21), and a response sentence is output (S22).

Next, an input sentence of dialogue number 6 “A work by a person named Eikari Kano was on display” was entered (S4: YES). “Kano Ekinori” is extracted as a keyword (S7). Since “Kano Naganori” is associated with the context of the context name “painter” of the context ID “0400-0304” (see FIG. 2), the determined context ID is “0400-0304” (S8). Since the previous determined context is “0304-0202”, there is a change in the context (S9: YES). The measurement time by the timer 18 is less than 5 minutes (S13: NO), and the determination context is a specific hierarchy context (lowermost layer) (S39: YES). Therefore, the voice quality is changed to “0.4” (S42). The response sentence “What is Kano Naganori's representative work?” Is determined (S20), speech synthesis is performed with the changed attribute (S21), and a response sentence is output (S22).

Next, the input sentence of dialogue number 7 “National treasure 洛中洛外図屏風なな” is input (S4: YES). “Takanaka-gai-gai-fu-fu” is extracted as a keyword (S7). Since “Takanaka 洛 gaifu” is associated with the context of the context name “works” with the context ID “0401-034” (see FIG. 2), the determined context ID is “0401-0304” (S8). . Since the previous determined context is “0400-0304”, there is a change in the context (S9: YES). The measurement time by the timer 18 is less than 5 minutes (S13: NO), and the determination context is a specific hierarchy context (lowermost layer) (S39: YES). Therefore, the voice quality is changed to “0.4” (S42). A response sentence “where is the picture?” Is determined (S20), speech synthesis is performed with the changed attribute (S21), and a response sentence is output (S22).

Next, the input sentence of dialogue number 8 “Where was, I forgot. Bye Bye” is input (S4: YES). “Bye-bye” is a termination instruction (S6: YES), and the dialogue between the user and the voice interaction agent is terminated.

As described above, the output voice of the voice interaction agent can be changed according to the content (context) of the conversation between the user and the voice interaction agent. Therefore, since the output voice of the voice dialogue agent becomes a voice commensurate with the context, a natural dialogue can be performed.

If voice attribute information indicating an attribute suitable for the context is stored in correspondence with the context, it is possible to output a voice suitable for the context, that is, the content of the conversation. Therefore, the output sound can be switched to a sound suitable for the content of the conversation according to the change in context. Therefore, the user can have a natural conversation without feeling uncomfortable with the content and voice of the conversation.

The user who is having a conversation with the voice interactive apparatus 100 can grasp the hierarchy of the context of the conversation by the output voice. Therefore, the user can talk while grasping the change state of the content of the conversation, and helps to enjoy the conversation. For example, if a specific hierarchy is set as the lowest hierarchy, the user can know that the context does not change to detailed contents any more. Further, if the specific hierarchy is the highest hierarchy, the user can know that the conversation can be shifted to more detailed contents. Further, if a tree structure is devised so as to give some meaning to the context of a predetermined hierarchy, some meaning can be conveyed to the user by a change in voice attributes.

The user can understand the situation in which the content of the conversation becomes deeper, shallower, or changes in the context of the same level by the voice output from the voice interactive device 100. Therefore, the user can talk while grasping the change state of the content of the conversation, and helps to enjoy the conversation.

The user who is having a conversation with the voice interaction apparatus 100 knows that the context has been switched many times within a predetermined time by the voice output from the voice interaction apparatus 100. Therefore, the user can talk while feeling the change state of the content of the conversation, which helps to enjoy the conversation.

The user who is having a conversation with the voice interaction apparatus 100 knows that the same context has continued for a predetermined time or longer by the voice output from the voice interaction apparatus 100. Even if there is no change in the context, the attributes of the output voice change, which helps to enjoy the conversation.

It should be noted that the voice interaction device and the voice interaction system according to the present disclosure are not limited to the above-described embodiments, and various modifications can be made without departing from the scope of the present disclosure. In the above-described embodiment, the voice interaction device having the voice interaction program is a so-called personal computer. However, the device having the voice interaction program need not be a personal computer. For example, a portable terminal, a mobile phone, or a television may be used as long as a microphone for inputting sound and a speaker for outputting sound are provided.

The context tree shown in FIGS. 2 and 3 is an example, and the context tree of this example is not necessarily adopted. In order to actually output a sound suitable for the conversation between the user and the voice interaction agent, it is desirable to create contexts in more fields and use a subdivided and deepened context tree. As the voice attribute information stored in the attribute information storage area 132 is also finely set, the voice more suitable for the conversation between the user and the voice interaction agent can be output. If the configuration of the context tree is devised, it is possible to cope with a more appropriate voice. When the number of contexts in the same hierarchy is increased to 100 or more, it is necessary to increase the number of digits of the context ID. The context ID assignment rule is not limited to the above-described embodiment. The voice interaction device may be configured so that the user can add contexts and keywords to the context tree. In this case, a character string may be received by the input device (keyboard or mouse) 24.

In the above embodiment, keywords are extracted from the input sentence, and the context is determined based on the first keyword that appears in the input sentence. However, the keyword used to determine the context is not limited to the keyword that appears first. For example, when a plurality of keywords are present in the input sentence, the context having the lowest hierarchy among the contexts associated with the respective keywords may be used as the determination context. When the same keyword is associated with a plurality of contexts, the determination context may be determined according to the context of the previous dialog in consideration of the flow of the dialog. For example, assume that the keyword “program” is assigned to both the context name “concert” and the context name “computer”. In this case, if the context of the previous dialogue is “music”, the determined context may be “concert”.

In the above-described embodiment, when the determination context after movement is a parent, when the determination context after movement is a parent, when it is a parent, when it is a child, and when it is a child of a child, the audio attribute is Be changed. However, the attribute may be changed only when the determination context before and after the movement is a parent and a child. The attribute may be changed even when the relationship is more than four generations away.

In the above embodiment, the voice attributes are changed when the determination contexts before and after the movement belong to the same hierarchy and the identification numbers are different by one. However, when the determination contexts before and after the movement belong to the same hierarchy, the attribute may be changed regardless of the identification number. The attribute may be changed when the determination contexts before and after the movement belong to the same hierarchy and the parent is the same.

Claims

Voice input means for inputting voice;
Conversion means for converting an input voice, which is a voice input by the voice input means, into a character string;
A context storage means for storing a conversation context corresponding to a keyword;
A keyword stored in the context storage unit is extracted from a converted character string that is a character string converted by the conversion unit, and the context stored in the context storage unit is associated with the extracted keyword. Context determining means for determining the context of the input voice;
A conversation sentence determining means for determining a conversation sentence according to the input voice;
Audio output means for outputting audio;
Attribute storage means for storing attributes of the sound output by the sound output means;
Output control means for causing the voice output means to output the conversation sentence determined by the conversation sentence determination means with the attribute stored in the attribute storage means;
A determination unit that determines whether or not the determination context that is the context determined by the context determination unit has changed from the previous determination context that is the context determined by the context determination unit;
A voice dialogue apparatus comprising: attribute changing means for changing a voice attribute stored in the attribute storage means when the determination means determines that the determination context has changed.
Attribute information storage means for storing audio attribute information related to audio attributes output by the audio output means in association with the context;
The attribute changing means, when the decision context changes to a context in which the voice attribute information is stored in the attribute information storage means, the voice attribute stored in the attribute storage means to the decision context The voice interactive apparatus according to claim 1, wherein the voice dialog device changes to a voice attribute indicated by the corresponding voice attribute information.
The data structure of the context storage means is a tree structure, and as the hierarchy of the tree structure progresses from upper to lower, a plurality of contexts are stored so as to be detailed conversation contents.
The spoken dialogue apparatus according to claim 1 or 2, wherein the attribute changing unit changes a voice attribute when the decision context and the previous decision context are in a parent-child relationship of the tree structure or belong to the same hierarchy. .
The data structure of the context storage means is a tree structure, and as the hierarchy of the tree structure progresses from upper to lower, a plurality of contexts are stored so as to be detailed conversation contents.
The voice interaction apparatus according to claim 1, wherein the attribute changing unit changes a voice attribute when the determined context becomes a context of a predetermined hierarchy of the tree structure.
5. The voice interactive apparatus according to claim 1, wherein the attribute changing unit changes a voice attribute when the determined context changes a predetermined number of times within a first predetermined time.
6. The voice interactive apparatus according to claim 1, wherein the attribute changing unit changes a voice attribute when the time during which the determination context does not change is equal to or longer than a second predetermined time.
A computer-readable medium storing a voice dialogue program for operating a computer as various processing means of the voice dialogue apparatus according to any one of claims 1 to 6.