WO2019058453A1

WO2019058453A1 - Voice interaction control device and method for controlling voice interaction

Info

Publication number: WO2019058453A1
Application number: PCT/JP2017/033902
Authority: WO
Inventors: 昭男堀井; 岡登　洋平
Original assignee: 三菱電機株式会社
Priority date: 2017-09-20
Filing date: 2017-09-20
Publication date: 2019-03-28
Also published as: JP6851491B2; JPWO2019058453A1

Abstract

The purpose of the present invention is to provide a voice interaction control device for controlling voice interaction so that a system can suitably respond to a second voice input after a first voice. The voice interaction control device according to the present invention allows the system to present a user a response to a voice input by the user and comprises: a voice segment detection unit for detecting a voice segment for a series of voices input; a voice recognition unit for recognizing a voice in a voice segment; a response generation unit for generating a response corresponding to the voice recognition results; and a voice interaction control unit that controls the voice segment detection unit, the voice recognition unit, and the response generation unit. The voice interaction control unit causes the voice segment detection unit to detect a second voice segment constituting the second voice to allow a second response to be generated for a second series of voices input after the first voice even when the processing for the first voice has yet to be completed, including the processing until the system presents the user a first response to a first series of voices.

Description

Voice dialogue control apparatus and voice dialogue control method

The present invention relates to a voice interaction control apparatus and a voice interaction control method for causing a system to present a response corresponding to voice input from a user when the user operates the system by interaction between the system and the user.

A system having a voice recognition function inputs a voice uttered by a user, and outputs a response corresponding to the voice. According to Patent Document 1, when the user inputs an interrupting voice while the system is outputting voice, the voice output is continued or paused depending on the importance of the voice being output. A speech dialogue control method has been proposed for performing processing on embedded speech.

Japanese Patent Application Laid-Open No. 2004-325848

However, the system described in Patent Document 1 can not capture the subsequent second voice at a specific timing, for example, immediately after the end detection of the first voice, that is, immediately after the end of the first voice capture. When the user speaks at such a specific timing, a habit may occur between the system and the user, and the system may make an inappropriate response.

Even if the user makes a plurality of utterances following the first voice, the system needs to appropriately input and respond appropriately without dropping those utterances.

The present invention has been made to solve the problems as described above, and provides a voice interaction control device for performing interaction control so that the system can appropriately respond to the second voice input after the first voice. To aim.

A voice interaction control device according to the present invention performs voice interaction control for causing the system to present a response to voice input from the user to the user when the user performs an operation on the system by interaction between the user and the system. A voice section detection unit that detects a voice section from the beginning to the end of the input continuous voice, a voice recognition unit that recognizes voice in the voice section, and a voice recognition result of voice A response generation unit that generates a response to be presented to the user from the system, and an interaction control unit that controls the voice section detection unit, the voice recognition unit, and the response generation unit. The dialogue control unit is configured to detect a first voice section forming a series of first voice input as voice and until a first response corresponding to a voice recognition result of the first voice is presented to the user from the system. A second voice is formed to enable generation of a second response to a series of second voices input as voice after the first voice even if processing for the first voice including processing is not completed. The voice segment detection unit detects the voice segment.

According to the present invention, it is possible to provide a voice interaction control device for performing interaction control so that the system can appropriately respond to the second voice input after the first voice.

The objects, features, aspects and advantages of the present invention will become more apparent from the following detailed description and the accompanying drawings.

FIG. 1 is a block diagram showing a configuration of a voice interaction control device and system in a first embodiment. It is a figure which shows an example of the processing circuit which a speech interaction control apparatus contains. It is a figure which shows another example of the processing circuit which a speech interaction control apparatus contains. 5 is a sequence chart showing an example of the operation of the voice interaction control apparatus and the voice interaction control method according to the first embodiment. 5 is a flowchart showing an example of the operation of the voice interaction control device and the voice interaction control method according to the first embodiment. FIG. 7 is a block diagram showing the configuration of a voice interaction control device and system in a second embodiment. FIG. 18 is a diagram showing an example of a configuration of a system response database in Embodiment 2. FIG. 16 is a sequence chart showing an example of the operation of the voice interaction control device and the voice interaction control method according to Embodiment 2. FIG. FIG. 10 is a flowchart showing an example of the operation of the voice interaction control device and the voice interaction control method according to Embodiment 2. FIG. FIG. 14 is a block diagram showing the configuration of a voice interaction control device and system in a third embodiment. FIG. 17 is a sequence chart showing an example of the operation of the voice interaction control device in Embodiment 3 and the voice interaction control method. FIG. FIG. 17 is a flow chart showing an example of the operation of the speech dialog control device and the speech dialog control method in the third embodiment. FIG. FIG. 14 is a block diagram showing the configuration of a voice interaction control device and system in a fourth embodiment. FIG. 18 is a diagram showing an example of a configuration of a first dictionary database in the fourth embodiment. FIG. 18 is a diagram showing an example of a configuration of a second dictionary database in the fourth embodiment. FIG. 18 is a diagram showing an example of a configuration of a system response database in a fourth embodiment. FIG. 16 is a sequence chart showing an example of the operation of the voice interaction control device in the fourth embodiment and the voice interaction control method. FIG. 16 is a flow chart showing an example of the operation of the voice interaction control device and the voice interaction control method according to the fourth embodiment. FIG. 18 is a block diagram showing the configuration of a voice interaction control device and system in a fifth embodiment. FIG. 16 is a flow chart showing an example of the operation of the voice interaction control device and the voice interaction control method according to the fifth embodiment. FIG. 21 is a flow chart showing an example of the operation of the voice interaction control device in the sixth embodiment and the voice interaction control method. FIG. 21 is a block diagram showing an example of a configuration of a voice dialogue control device mounted on a vehicle in a seventh embodiment. FIG. 21 is a block diagram showing an example of the configuration of a voice dialog control device provided in a server according to a seventh embodiment.

In this specification, an embodiment of a voice interaction control device that performs interaction control for causing the system to present a response corresponding to voice input from the user will be described.

Embodiment 1
A voice dialogue control apparatus and a voice dialogue control method according to the first embodiment will be described.

(Constitution)
FIG. 1 is a block diagram showing the configuration of voice dialogue control apparatus 100 and system 200 in the first embodiment.

The system 200 inputs a voice uttered by the user to operate the system 200, and presents a response to the voice to the user. The system 200 includes a voice input device 21, a voice interaction control device 100 and a response presentation device 22. The system 200 is, for example, a navigation system, an audio system, a control system that controls devices related to the driving of a vehicle, a control system that controls a driving environment, and the like.

The voice input device 21 is an interface for the user to operate the system 200. The voice input device 21 inputs a voice uttered by the user in order to perform an operation on the system 200, and outputs the voice to the voice dialogue control device 100. The voice input device 21 is, for example, a microphone.

The voice interaction control device 100 receives voice from the voice input device 21 and performs interaction control for causing the system 200 to present a response corresponding to the voice to the user.

The response presentation device 22 presents the response generated by the voice interaction control device 100 to the user. Note that “to present” includes that the response presentation device 22 operates in accordance with the generated response. The response presentation device 22 may present the response to the user by operating according to the response generated by the voice interaction control device 100. For example, if the system 200 is a navigation system, the response presentation device 22 is an audio output device or display device. The voice output device presents a response by, for example, voice outputting guidance information to a destination. The display device presents a response, for example, by displaying guidance information to a destination along with a map. Or, for example, when the system 200 is an audio system, the response presentation device 22 is a music playback device. The music playback device presents a response by playing music. Or, for example, when the system 200 is a control system that controls devices related to the driving of a vehicle, the response presentation device 22 is a drive control device of the vehicle. Or, for example, when the system 200 is a control system that controls a driving environment, the response presentation device 22 is an air conditioner, a light, a mirror position adjustment device, a seat position adjustment device, or the like.

The voice dialogue control apparatus 100 includes a voice section detection unit 11, a voice recognition unit 12, a response generation unit 13 and a dialogue control unit 14.

The voice section detection unit 11 detects a voice section from the beginning to the end of the input continuous voice. In the present embodiment, as an example, the voice activity detection unit 11 constantly detects an input voice.

The speech recognition unit 12 performs speech recognition on the speech in the speech segment detected by the speech segment detection unit 11. At the time of the speech recognition, the speech recognition unit 12 performs speech recognition by selecting the recognition vocabulary based on the acoustically or linguistically most probable vocabulary in the speech in the speech section. The speech recognition unit 12 performs speech recognition, for example, with reference to a dictionary database (not shown). The dictionary database may be provided in the voice interaction control apparatus 100 or in an external server. When the dictionary database is provided in the server, the dialog control device communicates with the server so that the speech recognition unit 12 performs speech recognition with reference to the dictionary database.

The response generation unit 13 generates a response corresponding to the speech recognition result of the speech recognition by the speech recognition unit 12. The response generator 13 generates a response, for example, with reference to a system response database (not shown). The system response database is, for example, a table, and the recognition vocabulary and the responses included in the speech recognition result are stored in association with each other. The system response database may be provided in the voice interaction control device 100 or in an external server. When the system response database is provided in the server, the dialog control device communicates with the server, and the response generation unit 13 generates a response with reference to the system response database. The response generation unit 13 outputs the response to the response presentation device 22.

The dialogue control unit 14 controls the operations of the speech segment detection unit 11, the speech recognition unit 12 and the response generation unit 13. The dialogue control unit 14 controls each unit while monitoring the dialogue state of the system 200. The interactive state is a state at any time from when a voice is detected by the voice section detection unit 11 to when a response corresponding to the voice is generated and further the response is presented to the user. . For example, the dialogue control unit 14 controls the operation of the speech recognition unit 12 based on the notification that the speech zone detection unit 11 detects the beginning or the end of the speech zone. Alternatively, the dialogue control unit 14 controls the start of generation of the response in the response generation unit 13 based on the notification that the speech recognition unit 12 has finished the speech recognition, or starts the speech recognition of the subsequent speech in the speech recognition unit 12 Control.

An example of the specific function which the dialog control part 14 has is as follows. The dialogue control unit 14 controls the processing for the first voice of the series and the processing for the second voice input after the first voice. The processing for the first voice includes processing from detection of the first voice section forming the first voice to presentation of the first response from the system 200 to the user. More specifically, the process for the first voice is at least a process of the speech recognition unit 12 performing speech recognition of the first speech and a response generation unit 13 generating a first response corresponding to the speech recognition result of the first speech. including. In the processing for the first voice, the end of the first voice section forming the first voice is detected, and then the first response is presented to the response presentation device 22, and the beginning of the voice section forming the voice to be input next is Processing until detection may be included.

The dialogue control unit 14 detects the second voice section forming the second voice in the voice section detection unit 11 so that the second response to the second voice can be generated even if the processing on the first voice is not completed. Let Furthermore, in the present embodiment, the dialogue control unit 14 causes the speech recognition unit 12 to recognize the second speech in the second speech section, and the second response corresponding to the speech recognition result of the second speech is a response generation unit. 13 to be presented from the system 200 to the user.

(Processing circuit)
FIG. 2 is a view showing an example of the processing circuit 50 provided in the voice interaction control device 100. As shown in FIG. Each function of the speech zone detection unit 11, the speech recognition unit 12, the response generation unit 13, and the dialogue control unit 14 is realized by the processing circuit 50. That is, the processing circuit 50 includes the voice section detection unit 11, the voice recognition unit 12, the response generation unit 13, and the dialogue control unit 14.

When the processing circuit 50 is dedicated hardware, the processing circuit 50 may be, for example, a single circuit, a composite circuit, a programmed processor, a parallel programmed processor, an application specific integrated circuit (ASIC), an FPGA (field-programmable) Gate Array) or a circuit combining these. The functions of the voice section detection unit 11, the speech recognition unit 12, the response generation unit 13, and the dialogue control unit 14 may be realized individually by a plurality of processing circuits, or realized collectively by one processing circuit. It is also good.

FIG. 3 is a view showing another example of the processing circuit included in the voice interaction control device 100. The processing circuit includes a processor 51 and a memory 52. When the processor 51 executes the program stored in the memory 52, the functions of the speech zone detection unit 11, the speech recognition unit 12, the response generation unit 13, and the dialogue control unit 14 are realized. For example, software or firmware described as a program is executed by the processor 51 to implement each function. That is, the voice dialogue control device 100 includes a memory 52 for storing a program and a processor 51 for executing the program.

In the program, the voice interaction control apparatus 100 detects a voice section from the start to the end forming the input voice sequence, recognizes the voice in the detected voice section, and recognizes the voice. Functions and operations are described which generate responses corresponding to recognition results and further control the detection of those speech segments, speech recognition and generation of responses. In addition, the program forms a series of second voices input after the first voice, even when the processing for the first voice is not finished when the voice interaction control apparatus 100 executes each control. The function and operation for detecting the second speech segment are described. Furthermore, the program causes the second voice in the second voice section to be voice-recognized, generates a second response corresponding to the voice recognition result of the second voice, and causes the system 200 to present it to the user. It is done. The above program causes a computer to execute the procedure or method of the voice section detection unit 11, the voice recognition unit 12, the response generation unit 13, and the dialogue control unit 14 described above.

The processor 51 is, for example, a central processing unit, a processing unit, an arithmetic unit, a microprocessor, a microcomputer, a DSP (Digital Signal Processor) or the like. The memory 52 is, for example, nonvolatile or volatile, such as random access memory (RAM), read only memory (ROM), flash memory, erasable programmable read only memory (EPROM), electrically erasable programmable read only memory (EEPROM), or the like. It is a semiconductor memory. Alternatively, the memory 52 may be any storage medium used in the future, such as a magnetic disk, a flexible disk, an optical disk, a compact disk, a mini disk, a DVD, and the like.

The functions of the voice section detection unit 11, the speech recognition unit 12, the response generation unit 13, and the dialogue control unit 14 described above are partially realized by dedicated hardware, and the other portions are realized by software or firmware. May be Thus, the processing circuit implements each of the functions described above by hardware, software, firmware, or a combination thereof.

(Operation)
Next, the operation of the voice interaction control apparatus 100 and the voice interaction control method will be described. FIG. 4 is a sequence chart showing an example of the operation of the voice interaction control apparatus 100 and the voice interaction control method according to the first embodiment. FIG. 5 is a flowchart showing an example of the operation of the voice interaction control apparatus 100 and the voice interaction control method according to the first embodiment.

Although not shown in the flowchart of FIG. 5, first, the dialog control unit 14 controls the voice section detection unit 11 to be in a standby state in which voice reception is possible and in a standby state in which the voice recognition unit 12 is capable of speech recognition. Do. This control is performed, for example, by an operation of instructing the user to start accepting the voice section detection to the system 200. Alternatively, after startup of the system 200, the dialogue control unit 14 may automatically control the voice section detection unit 11 to a standby state in which voice can be received. After this control, the voice activity detection unit 11 is constantly in a state of monitoring the input of voice, that is, in a detectable state.

In step S10, the voice activity detection unit 11 receives the first voice and detects the beginning of the first voice activity. The detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.

In step S20, the voice recognition unit 12 starts voice recognition of the first voice after the start end of the first voice section detected by the voice section detection unit 11 based on the notification of the start end detection.

In step S30, the voice activity detection unit 11 detects the end of the first voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.

In step S40, the voice recognition unit 12 ends voice recognition of the first voice up to the end of the first voice section detected by the voice section detection unit 11 based on the notification of end detection. The voice recognition unit 12 outputs the voice recognition result of the first voice to the response generation unit 13 and notifies the dialog control unit 14 of the end.

In step S50, the response generation unit 13 starts generation of a first response corresponding to the speech recognition result of the first speech based on the control from the dialogue control unit 14.

In step S60, the voice activity detection unit 11 detects the beginning of the second voice activity of the second voice input after the first voice. The detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14. Note that step S60 and the following step S70 are performed in parallel with the generation of the first response in the response generation unit 13.

In step S70, the voice recognition unit 12 starts voice recognition of the second voice after the start end of the second voice section detected by the voice section detection unit 11 based on the notification of the start end detection.

In step S80, the response generation unit 13 completes the generation of the first response. The dialogue control unit 14 causes the system 200 to present the first response to the user. That is, the response presentation device 22 presents the first response to the user.

In step S90, the voice activity detection unit 11 detects the end of the second voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.

In step S100, the speech recognition unit 12 ends the speech recognition of the second speech up to the end of the second speech zone detected by the speech zone detection unit 11. The voice recognition unit 12 outputs the voice recognition result of the second voice to the response generation unit 13 and notifies the dialog control unit 14 of the end.

In step S110, the response generation unit 13 starts generation of a second response corresponding to the speech recognition result of the second voice input from the speech recognition unit 12 based on the control from the dialogue control unit 14.

In step S120, the response generation unit 13 completes the generation of the second response. The dialogue control unit 14 causes the system 200 to present the second response to the user. That is, the response presentation device 22 presents the second response to the user.

(effect)
Summarizing the above, the voice interaction control device 100 according to the first embodiment responds to the user from the system 200 with respect to the voice input from the user when the user operates the system 200 by the interaction between the user and the system 200. A speech dialog control device 100 for performing dialogue control for presentation, the speech section detection unit 11 detecting a speech section from the start to the end forming the input series of speech, the speech in the speech section A voice recognition unit 12 for recognizing, a response generation unit 13 for generating a response to be presented to the user from the system 200, which is a response corresponding to a voice recognition result of voice, a voice section detection unit 11 and a voice recognition unit 12 And a dialog control unit 14 that controls the response generation unit 13. The dialogue control unit 14 detects the first speech section forming the series of first speech input as speech, and then the first response corresponding to the speech recognition result of the first speech is presented from the system 200 to the user Even if the processing for the first voice including the processing up to the first voice is not completed, the second voice is made to be able to generate the second response for the second voice of the series inputted as the voice after the first voice. The voice segment detection unit 11 detects the second voice segment.

With the above configuration, the voice interaction control device 100 can perform interactive control so that the system can appropriately respond to the second voice input after the first voice. The voice interaction control apparatus 100 can generate a response without omission to the second voice input immediately after the end of the first voice section. Further, as shown as an example in the present embodiment, voice dialogue control apparatus 100 constantly inputs voice to perform voice section detection, thereby eliminating the time when the user can not acquire voice uttered. it can.

In the voice interaction control method according to the first embodiment, when the user operates the system 200 by interaction between the user and the system 200, the system 200 presents a response to the voice input from the user to the user. A speech dialogue control method for dialogue control, comprising detecting a speech section from the beginning to the end forming the input series of speech, speech recognizing speech in the speech section, and corresponding to speech recognition result of speech A response, which generates a response to be presented to the user from the system 200, and performs control of each of speech segment detection, speech recognition of the speech, and generation of the response. In the voice interaction control method, when performing each control, a first response corresponding to a voice recognition result of the first voice after a first voice section forming a series of first voice inputted as voice is detected It is possible to generate a second response to a series of second voices input as voice after the first voice, even if processing for the first voice including processing until the system is presented to the user is not finished In order to do this, the second voice section that makes the second voice is detected.

According to the voice interaction control method including such configuration, it is possible to perform interaction control so that the system can appropriately respond to the second voice input after the first voice. According to this voice dialogue control method, it is possible to generate a response without omission to the second voice input immediately after the end of the first voice section. Moreover, according to this voice dialogue control method, since voice is always input to perform voice section detection, it is possible to eliminate a time when the user can not obtain a voice to be uttered.

Second Embodiment
A voice dialogue control apparatus and a voice dialogue control method according to the second embodiment will be described.

(Constitution)
FIG. 6 is a block diagram showing configurations of the voice interaction control device 101 and the system 200 in the second embodiment. The system 200 includes a dictionary database storage device 23 in addition to the configuration shown in the first embodiment.

The voice recognition unit 12 of the voice dialogue control device 101 refers to the dictionary database stored in the dictionary database storage device 23 to perform voice recognition. In addition to the configuration shown in the first embodiment, voice dialog control device 101 includes voice storage unit 15.

The voice storage unit 15 stores the voice in the voice section detected by the voice section detection unit 11. Hereinafter, although the example in which the voice storage unit 15 stores the second voice in the second voice section is shown, the present invention is not limited thereto, and the voice storage unit 15 may also store the first voice of the first voice section. .

The dialogue control unit 14 causes the voice recognition unit 12 to perform voice recognition of the second voice stored in the voice storage unit 15 based on the notification indicating that the voice recognition unit 12 has finished voice recognition of the first voice, and generates a response. The unit 13 generates a second response corresponding to the speech recognition result of the second speech. Further, the dialogue control unit 14 causes the response generation unit 13 to generate the second response based on the notification indicating that the generation of the first response is completed in the response generation unit 13.

(System response database)
In the present embodiment, the response generation unit 13 generates responses by referring to the system response database for each response corresponding to each speech recognition result. FIG. 7 is a diagram showing an example of the configuration of the system response database in the second embodiment. The system response database is composed of recognition vocabulary contained in the speech recognition result and a response corresponding to the speech recognition result. Also, depending on the configuration of the response presentation device 22 that presents the response to the user, a plurality of responses may be included.

(Processing circuit)
Each function of the voice storage unit 15 and the dialogue control unit 14 described above is realized by, for example, the processing circuit 50 shown in FIG. That is, the processing circuit 50 includes the voice storage unit 15 and the dialogue control unit 14 having the respective functions described above.

When each function of the voice storage unit 15 and the dialogue control unit 14 is realized by the processing circuit shown in FIG. 3, the function of the voice storage unit 15 is realized by the memory 52, for example. Further, the program stored in the memory 52 stores the second voice in the second voice section, and the second voice stored in the memory 52 based on the notification indicating that the voice recognition of the first voice is finished. And the function and operation of generating a second response corresponding to the speech recognition result of the second speech are described. Furthermore, the program describes functions and operations for generating a second response based on a notification indicating that the generation of the first response is completed.

(Operation)
Next, the operation of the voice interaction control apparatus 101 and the voice interaction control method will be described. FIG. 8 is a sequence chart showing an example of the operation of the voice interaction control apparatus 101 and the voice interaction control method according to the second embodiment. FIG. 9 is a flowchart showing an example of the operation of the voice interaction control apparatus 101 and the voice interaction control method according to the second embodiment.

In the first embodiment, an example is shown in which the second speech is input during generation of the first response. However, in the second embodiment, the second speech is input during speech recognition of the first speech. An example is shown.

In step S10, the voice activity detection unit 11 receives the first voice and detects the beginning of the first voice activity. Here, "I want to go to the supermarket" uttered by the user is input as the first voice. The detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.

In step S20, the voice recognition unit 12 starts voice recognition of the first voice after the start end of the first voice section detected by the voice section detection unit 11 based on the notification of the start end detection. Here, the speech recognition unit 12 starts speech recognition of the first speech with reference to the dictionary database.

In step S32, the voice activity detection unit 11 receives the second voice and detects the beginning of the second voice activity. Here, “I want to go to a convenience store.” Uttered by the user is input as the second voice. The detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.

In step S34, the dialogue control unit 14 causes the voice storage unit 15 to start storing the second voice based on the notification of the detection of the start end of the second voice section. In addition, in FIG. 8, in order to simplify a sequence chart, illustration of the operation regarding this notification is omitted.

In step S40, the voice recognition unit 12 ends voice recognition of the first voice up to the end of the first voice section detected by the voice section detection unit 11 based on the notification of end detection. In the speech recognition result of the first speech, "super" is included as a recognition vocabulary. Further, the voice recognition unit 12 notifies the dialogue control unit 14 of the end of the voice recognition. The dialogue control unit 14 controls the following step S50, step S62 and step S70 to be executed based on the notification.

In step S50, the response generation unit 13 starts generation of a first response corresponding to the speech recognition result of the first speech based on the control from the dialogue control unit 14. The response generator 13 refers to the system response database shown in FIG. 7 and starts generating the first response.

In step S62, the voice recognition unit 12 starts reading of the second voice from the voice storage unit 15 based on the control from the dialogue control unit 14. In the present embodiment, the voice storage unit 15 outputs the previously stored second voice to the voice recognition unit 12 with a time difference while storing the second voice in the second voice section. In addition, step S62 to the following step S73 are executed in parallel with the generation of the first response in the response generation unit 13.

In step S70, the voice recognition unit 12 starts voice recognition of the second voice from the beginning of the second voice section read from the voice storage unit 15 based on the notification of the beginning detection. As described above, the voice recognition unit 12 starts voice recognition of the second voice based on the notification that voice recognition of the first voice is finished, thereby performing voice recognition of the second voice after voice recognition of the first voice. It can start. The speech recognition unit 12 starts speech recognition of the second speech with reference to the dictionary database.

In step S71, the voice activity detection unit 11 detects the end of the second voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.

In step S72, the voice storage unit 15 ends the storage of the second voice.

In step S73, the voice storage unit 15 ends the reading of the second voice from the voice storage unit 15.

In step S80, the response generation unit 13 completes the generation of the first response. Here, the response generation unit 13 generates a first response including “display the search result of the super.” As information for voice output or display output. The dialogue control unit 14 controls to present the first response from the response presentation device 22 to the user. For example, when the response presentation device 22 is a speaker, the speaker presents the first response to the user by outputting a voice as "display the search result of the supermarket" according to the first response. Alternatively, for example, when the response presentation device 22 is a display device, the display device presents the first response to the user by displaying “display the search result of the super.” According to the first response. Alternatively, the response generation unit 13 may generate a first response including a control signal for searching for a super. In this case, a destination search unit (not shown) included in the system 200 searches for a supermarket based on the first response, and the response presentation device 22 presents the search result of the supermarket to the user. In the present embodiment, the response generation unit 13 notifies the dialogue control unit 14 that the generation of the first response is completed.

In step S100, the speech recognition unit 12 ends the speech recognition of the second speech up to the end of the second speech segment. The “convenience store” is included as a recognition vocabulary in the speech recognition result of the second speech. Further, the voice recognition unit 12 notifies the dialogue control unit 14 of the end of the voice recognition.

In step S110, the response generation unit 13 starts generation of a second response corresponding to the speech recognition result of the second voice input from the speech recognition unit 12 based on the control from the dialogue control unit 14. The response generation unit 13 refers to the system response database shown in FIG. 7 and starts generation of the second response. In the present embodiment, step S110 is performed after step S90. That is, the dialogue control unit 14 controls the process of step S110 to be executed based on the notification that the generation of the first response is completed.

In step S120, the response generation unit 13 completes the generation of the second response. Here, the response generation unit 13 generates a second response including “display the search result of the convenience store” as information for voice output or display output. The dialogue control unit 14 controls the second response from the response presentation device 22 to be presented to the user. For example, when the response presentation device 22 is a speaker, the speaker presents the second response to the user by outputting a voice as "display the search result of the convenience store" according to the second response. Alternatively, for example, when the response presentation device 22 is a display device, the display device presents the second response to the user by displaying “display the search result of the convenience store” according to the second response. Alternatively, the response generation unit 13 may generate a second response including a control signal for searching a convenience store. In this case, the destination search unit included in the system 200 searches the convenience store based on the second response, and the response presentation device 22 presents the search result of the convenience store to the user.

In the operation of the voice interaction control apparatus 101 described above, the voice stored in the voice storage unit 15 is not limited to the second voice. The voice storage unit 15 may also store the first voice. That is, after the voice dialogue control device 101 stores the first voice of the first voice section detected by the voice section detection unit 11 in the voice storage unit 15 once, it reads it out after a predetermined time elapses, and sends it to the voice recognition unit 12. Speech recognition may be performed.

(effect)
Summarizing the above, the voice dialogue control device 101 according to the second embodiment further includes the voice storage unit 15 that stores the second voice in the second voice section detected by the voice section detection unit 11. The dialogue control unit 14 causes the voice recognition unit 12 to recognize the second voice stored in the voice storage unit 15 based on the notification indicating that the voice recognition unit 12 has finished the voice recognition of the first voice. The response generation unit 13 generates a second response that corresponds to the result of speech recognition of the second speech.

With such a configuration, the voice interaction control apparatus 101 can obtain the second voice even during processing of the first voice, for example, during voice recognition or response generation. That is, the voice interaction control apparatus 101 can generate an appropriate response to each of a plurality of voices uttered by the user at any timing.

In addition, the dialogue control unit 14 of the speech dialogue control device 101 according to the second embodiment causes the speech recognition unit 12 to perform speech recognition based on the notification indicating that the generation of the first response is completed by the response generation unit 13. The response generation unit 13 generates a second response corresponding to the speech recognition result of the second speech in the second speech segment.

With such a configuration, the voice interaction control apparatus 101 can sequentially present both the first response to the first voice and the second response to the second voice to the user. For example, immediately after the system inputs the first voice "I want to go to the supermarket" and starts the processing, if the user utters the second voice "I want to go to the convenience store after all," the conventional system is the second It is conceivable that only the response presenting the search result of the super can be performed because the speech can not be recognized. However, the voice interaction control apparatus 101 according to the present embodiment can input both the first voice and the second voice, and can present the search results of the supermarket and the search results of the convenience store, respectively.

Embodiment 3
A voice dialogue control apparatus and a voice dialogue control method according to the third embodiment will be described.

(Constitution)
FIG. 10 is a block diagram showing configurations of the voice dialogue control device 102 and the system 200 in the third embodiment. In addition to the configuration shown in the second embodiment, voice dialogue control apparatus 102 includes a dialogue state determination unit 16.

The dialogue state determination unit 16 determines whether the speech recognition result of the second speech recognized by the speech recognition unit 12 is to update the speech recognition result of the first speech.

Based on the determination result of the dialogue state determination unit 16, the dialogue control unit 14 terminates the process on the first voice halfway and causes the response generation unit 13 to generate a second response.

(Processing circuit)
Each function of the above-mentioned dialogue state judgment unit 16 and dialogue control unit 14 is realized by, for example, the processing circuit 50 shown in FIG. That is, the processing circuit 50 includes the dialogue state determination unit 16 and the dialogue control unit 14 having the respective functions described above.

Further, when each function of the dialogue state determination unit 16 and the dialogue control unit 14 described above is realized by the processing circuit shown in FIG. A function and operation for determining whether the speech recognition result is to update the speech recognition result of the first speech are described. Furthermore, the program describes functions and operations for causing the second response to be generated as well as terminating the process for the first voice on the basis of the determination result.

(Operation)
Next, the operation of the voice interaction control apparatus 102 and the voice interaction control method will be described. FIG. 11 is a sequence chart showing an example of the operation of the voice interaction control apparatus 102 and the voice interaction control method according to the third embodiment. FIG. 12 is a flowchart showing an example of the operation of the voice interaction control apparatus 102 and the voice interaction control method according to the third embodiment. In the following description, the description of the operation of the voice storage unit 15 is omitted, but the operation is the same as that of the second embodiment.

In step S10, the voice activity detection unit 11 receives the first voice and detects the beginning of the first voice activity. Here, “I want to go to a convenience store” uttered by the user is input as the first voice. The detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.

In step S20, the voice recognition unit 12 starts voice recognition of the first voice after the start end of the first voice section detected by the voice section detection unit 11 based on the notification of the start end detection. The speech recognition unit 12 performs speech recognition with reference to the dictionary database.

In step S40, the voice recognition unit 12 ends voice recognition of the first voice up to the end of the first voice section detected by the voice section detection unit 11 based on the notification of end detection. "Convenience store" is included as a recognition vocabulary in the speech recognition result of the first speech. Further, the voice recognition unit 12 notifies the dialogue control unit 14 of the end of the voice recognition.

In step S60, the voice activity detection unit 11 detects the beginning of the second voice activity of the second voice input after the first voice. Here, “I want to go to a restaurant after all” spoken by the user is input as the second voice. The detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.

In step S70, the speech recognition unit 12 starts speech recognition of the second speech after the start of the second speech zone detected by the speech zone detection unit 11. The speech recognition unit 12 refers to the dictionary database stored in the dictionary database storage unit 23 to perform speech recognition.

In step S100, the speech recognition unit 12 ends the speech recognition of the second speech up to the end of the second speech segment. The "restaurant" is included as a recognition vocabulary in the speech recognition result of the second speech. Further, the voice recognition unit 12 notifies the dialogue control unit 14 of the end of the voice recognition.

In step S102, the dialogue state determination unit 16 determines whether the speech recognition result of the second speech is to update the speech recognition result of the first speech and outputs the judgment result to the dialogue control unit 14. . In the present embodiment, it is determined whether the speech recognition result of the second speech including "restaurant" is to update the speech recognition result of the first speech including "convenience store". If it is determined that the update is not to be performed, step S104 is executed. If it is determined that the update is to be performed, step S106 is performed. In the present embodiment, the dialogue state determination unit 16 determines that the speech recognition result of the first speech including the “convenience store” updates the speech recognition result of the second speech including the “restaurant”. In this determination operation, the dialogue state determination unit 16 may determine the necessity of updating based on the parallel relation of the vocabulary of “convenience store” and “restaurant”, and other vocabulary included in the second voice, for example, paradox The necessity of updating may be determined based on the conjunction "after all".

If it is determined in step S102 that the update is not to be performed, in step S104, the response generation unit 13 completes the generation of the first response by the control of the dialog control unit 14 based on the determination result, and the response presentation device 22 Presents the first response to the user. In this case, the same response presentation as step S80 shown in the second embodiment is performed. Subsequently, a response to the second voice is presented to the response presentation device 22 after step S110 shown in FIG.

On the other hand, when it is determined in step S102 that updating is to be performed, in step S106, based on the determination result, the dialogue control unit 14 ends the process on the first voice halfway.

In step S110, the response generation unit 13 starts generation of a second response corresponding to the speech recognition result of the second speech. The response generation unit 13 refers to the system response database shown in FIG. 7 and starts generation of the second response.

In step S120, the response generation unit 13 completes the generation of the second response. Here, the response generation unit 13 generates a second response including “display the search result of the restaurant” as information for voice output or display output. The dialogue control unit 14 controls the second response from the response presentation device 22 to be presented to the user. For example, in the case where the response presentation device 22 is a speaker, the speaker presents the second response to the user by outputting a voice “display the search result of the restaurant” according to the second response. Or, for example, when the response presentation device 22 is a display device, the display device presents the second response to the user by displaying “display the search result of the restaurant” according to the second response. Alternatively, the response generation unit 13 may generate a second response including a control signal for searching a restaurant. In this case, the destination search unit included in the system 200 starts a restaurant search based on the second response, and the response presentation device 22 displays the restaurant search results.

When the second voice which is not matched with the first voice is input while the processing for the first voice is being executed, the dialogue control unit 14 cancels the processing for the first voice halfway and causes the second voice to be input. Control to generate only the corresponding second response. Thereby, only the second response is presented to the response presentation device 22.

(effect)
Summarizing the above, the voice interaction control device 102 according to the third embodiment determines that the speech recognition result of the second speech in the second speech segment recognized as speech by the speech recognition unit 12 is the speech recognition result of the first speech. The communication state determination unit 16 is further included to determine whether or not to update. Based on the determination result of the dialogue state determination unit 16, the dialogue control unit 14 terminates the process on the first voice halfway and causes the response generation unit 13 to generate a second response.

With such a configuration, when the operation content based on the first voice and the operation content based on the second voice do not match, the voice interaction control device 102 terminates the processing for the first voice halfway, and responds to the second voice. As a result of being able to present, user's operability can be enhanced. For example, immediately after the system inputs the first voice "I want to go to a convenience store" and starts the process, if the user utters a second voice "I want to go to a restaurant anyway", the conventional system is the second It is conceivable that only the response presenting the search result of the convenience store can be performed because the voice can not be recognized. However, based on the speech recognition result of the first speech and the speech recognition result of the second speech, the speech dialogue control device 102 in the third embodiment searches for a restaurant more responsive to the user's intention, ie, the second speech. The result can be presented earlier than the voice interaction control device 101 according to the second embodiment.

Fourth Preferred Embodiment
A voice dialogue control device and a voice dialogue control method according to the fourth embodiment will be described. Descriptions of configurations and operations similar to those of the other embodiments will be omitted.

(Constitution)
FIG. 13 is a block diagram showing configurations of the voice interaction control device 103 and the system 200 in the fourth embodiment.

The dictionary database storage unit 23 of the system 200 stores a plurality of dictionary databases. In the present embodiment, the dictionary database storage unit 23 stores a first dictionary database 24 and a second dictionary database 25.

The first dictionary database 24 is a dictionary database prepared corresponding to the standby state of the system 200. The standby state is, for example, a state in which the voice input device 21 of the system 200 can receive an operation by the user, that is, a state in which the input of the first voice is awaited. In the standby state, the display device, which is another user interface included in the system 200, displays, for example, a menu screen. The second dictionary database 25 is a dictionary database that corresponds to the state after the system 200 has recognized the first speech, and is associated with a specific vocabulary included in the speech recognition result of the first speech.

The speech recognition unit 12 performs speech recognition with reference to one dictionary database corresponding to the state of the system 200 among a plurality of dictionary databases.

In the present embodiment, when the system 200 is in the standby state, the speech recognition unit 12 refers to the first dictionary database 24 as a dictionary database corresponding to the standby state to speak the first speech. recognize. Alternatively, when the system 200 is in the standby state, the speech recognition unit 12 refers to all the dictionary databases to refer to the first dictionary database 24 as one dictionary database corresponding to the standby state. Speech recognition. FIG. 14 is a diagram showing an example of a configuration of the first dictionary database 24 in the fourth embodiment. The first dictionary database 24 includes the state of the system 200 and the recognition vocabulary. The first screen in FIG. 14 is a standby screen such as a menu screen.

Further, when the state of the system 200 is the state after the speech recognition of the first speech and the speech recognition result of the first speech includes the specific vocabulary, the speech recognition unit 12 corresponds to that state. The second speech is speech-recognized with reference to a second dictionary database 25 associated with the specific vocabulary as one dictionary database. For example, when the voice recognition unit 12 or the dialogue control unit 14 determines whether the specific vocabulary is included in the voice recognition result of the first voice after voice recognition of the first voice, and determines that the specific vocabulary is included. To select the second speech by referring to the second dictionary database 25. As described above, the speech recognition unit 12 has a function of performing processing such as switching the dictionary database used for speech recognition according to the state of the system 200. FIG. 15 is a diagram showing an example of a configuration of the second dictionary database 25 in the fourth embodiment. The second dictionary database 25 includes the main state of the system 200, the related state of the system 200, and the recognition vocabulary.

The response generation unit 13 generates a response corresponding to the speech recognition result of speech and information of one dictionary database referred to for speech recognition of the speech. For example, the response generation unit 13 generates a first response corresponding to the speech recognition result of the first speech and the information of the first dictionary database 24 referred to for speech recognition of the first speech. Alternatively, for example, the response generation unit 13 generates a second response corresponding to the speech recognition result of the second speech and the information of the second dictionary database 25 referred to for speech recognition of the second speech.

(System response database)
The response generation unit 13 generates a response by referring to the system response database and the response to the voice. FIG. 16 is a diagram showing an example of a configuration of a system response database in the fourth embodiment. The system response database is composed of recognition vocabulary contained in the speech recognition result, information of the dictionary database referenced for speech recognition, and responses corresponding thereto.

(Processing circuit)
Each function of the speech recognition unit 12 and the response generation unit 13 described above is realized by, for example, the processing circuit 50 shown in FIG. That is, the processing circuit 50 includes the speech recognition unit 12 and the response generation unit 13 having the respective functions described above.

When each function of the speech recognition unit 12 and the response generation unit 13 described above is realized by the processing circuit shown in FIG. 3, the program stored in the memory 52 may include one of a plurality of dictionary databases for speech. A function and an operation are described which perform speech recognition with reference to the dictionary database and generate a response corresponding to speech recognition result of speech and information of one dictionary database referred to for speech recognition of speech. In addition, the program refers to the first dictionary database 24 prepared corresponding to the standby state of the system 200 to recognize the first speech, and relates to the specific vocabulary included in the speech recognition result of the first speech. The function and operation of speech recognition of the second speech are described with reference to the second dictionary database 25. Also, the program describes a function and an operation for generating a second response corresponding to the speech recognition result of the second speech and the information of the second dictionary database 25.

(Operation)
Next, the operation of the voice interaction control apparatus 103 and the voice interaction control method will be described. FIG. 17 is a sequence chart showing an example of the operation of the voice interaction control apparatus 103 and the voice interaction control method according to the fourth embodiment. FIG. 18 is a flow chart showing an example of the operation of the voice interaction control apparatus 103 and the voice interaction control method according to the fourth embodiment. In the following description, the description of the operation of the voice storage unit 15 is omitted, but the operation is the same as that of the second embodiment.

In step S10, the voice activity detection unit 11 receives the first voice and detects the beginning of the first voice activity. Here, “reproduction” uttered by the user is input as the first voice. The detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.

In step S22, the speech recognition unit 12 selects the first dictionary database 24 corresponding to the standby state of the system 200. For example, the speech recognition unit 12 acquires information indicating that the system 200 is in a standby state, and selects the first dictionary database 24 shown in FIG. 14 from among a plurality of dictionary databases based on the information. . Here, the information indicating that the voice recognition unit 12 acquires the standby state is information that the first screen is displayed.

In step S24, the speech recognition unit 12 refers to the first dictionary database 24 and starts speech recognition of the first speech after the start of the first speech segment detected by the speech segment detection unit 11. Alternatively, by combining steps S22 and S24, the speech recognition unit 12 refers to all the dictionary databases based on the information that the system 200 is in the standby state, and the first dictionary database 24 corresponding to the standby state. The first voice may be voice-recognized with reference to FIG.

In step S40, the voice recognition unit 12 ends voice recognition of the first voice up to the end of the first voice section detected by the voice section detection unit 11 based on the notification of end detection. The speech recognition result of the first speech includes "reproduction" as a recognition vocabulary.

In step S60, the voice activity detection unit 11 detects the beginning of the second voice activity of the second voice input after the first voice. Here, “music” uttered by the user is input as the second voice. The detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.

At step S74, speech recognition unit 12 is in a state after system 200 is speech-recognized for the first speech, and second dictionary database 25 corresponding to a state in which a specific vocabulary is included in the speech recognition result of the first speech. Choose For example, the speech recognition unit 12 determines whether the speech recognition result of the first speech includes a specific vocabulary or not, and when it is determined that the specific vocabulary is included, a plurality of dictionary databases are used to specify the specific vocabulary. The related second dictionary database 25 is selected. Here, the speech recognition unit 12 determines whether the speech recognition result of the first speech includes “reproduction” which is a specific vocabulary, and determines that it is included in the second speech shown in FIG. The second speech is recognized by referring to the dictionary database 25.

In step S 76, the speech recognition unit 12 refers to the second dictionary database 25 to start speech recognition of the second speech after the start of the second speech segment detected by the speech segment detection unit 11. Thus, the speech recognition unit 12 has a function of switching the dictionary database used for speech recognition from the first dictionary database 24 to the second dictionary database 25 according to the state of the system 200.

In step S100, the speech recognition unit 12 ends the speech recognition of the second speech up to the end of the second speech segment. The speech recognition result of the second speech and the information of the second dictionary database 25 referred to for speech recognition of the second speech are output to the response generation unit 13. Note that "music" is included as a recognition vocabulary in the speech recognition result of the second speech. Further, the voice recognition unit 12 notifies the dialogue control unit 14 of the end of the voice recognition.

In step S110, the response generation unit 13 starts generation of a second response corresponding to the speech recognition result of the second speech. The response generation unit 13 refers to the system response database shown in FIG. 16 and starts generation of the second response.

In step S120, the response generation unit 13 completes the generation of the second response. Here, since the recognition vocabulary is "music" and the dictionary database information is "second dictionary database", the response generation unit 13 generates a second response including "play music" as information for voice output. Generate The dialogue control unit 14 controls the second response from the response presentation device 22 to be presented to the user. For example, the speaker included in the response presentation device 22 presents the second response to the user by outputting a voice "Play music" according to the second response. Alternatively, the response generation unit 13 generates a second response including a control signal for causing the music reproduction device included in the response presentation device 22 to reproduce music, and the music reproduction device reproduces music based on the second response. It is also good.

Although not shown in the flowchart, when the second voice is not input in step S60, the first dictionary database 24 referred to for the voice recognition result of the first voice and the voice recognition of the first voice. The above information is output to the response generation unit 13. Since the recognition vocabulary is "play" and the dictionary database information is "first dictionary database", the response generation unit 13 includes "what to play?" As information for voice output or display output. The response is generated and the response presentation device 22 presents the first response to the user.

(effect)
Summarizing the above, the speech recognition unit 12 of the speech dialogue control apparatus 103 according to the fourth embodiment recognizes speech by referring to one of a plurality of dictionary databases according to the state of the system. The response generation unit 13 generates a response corresponding to the speech recognition result of speech and information of one dictionary database referred to for speech recognition of speech.

With such a configuration, the speech interaction control apparatus 103 can switch the dictionary database to be referred to in speech recognition according to the state of the system 200, that is, the interaction state, thereby generating an accurate response to the user's speech. be able to.

Further, the voice recognition unit 12 of the voice dialogue control device 103 according to the fourth embodiment refers to the first voice database with reference to the first dictionary database 24 prepared corresponding to the standby state of the system 200 among the plurality of dictionary databases. And a second dictionary database 25 corresponding to a state after speech recognition of the first speech among a plurality of dictionary databases and associated with a specific vocabulary included in the speech recognition result of the first speech Speech recognition. The response generation unit 13 generates a second response corresponding to the speech recognition result of the second speech and the information of the second dictionary database referred to for speech recognition of the second speech.

With such a configuration, the voice dialog control device 103 can generate a response reflecting the contents of both the first voice and the second voice, and can generate an accurate response to the user's speech. . For example, if the user utters the second voice "music" immediately after the system inputs the first voice "play" and starts the process, the conventional system can not recognize the second voice and the user It is conceivable to present a response asking what to play. However, since the voice interaction control apparatus 103 in the present embodiment refers to the second dictionary database related to the voice recognition result of the first voice to recognize the second voice, the music is used in accordance with the user's intention. It can be played back.

The Fifth Preferred Embodiment
A voice dialogue control apparatus and a voice dialogue control method according to the fifth embodiment will be described. Descriptions of configurations and operations similar to those of the other embodiments will be omitted.

(Constitution)
FIG. 19 is a block diagram showing configurations of the voice dialogue control device 104 and the system 200 in the fifth embodiment.

The response generation unit 13 further includes a confirmation response generation unit 17 that generates a confirmation response for causing the user to select one response from a plurality of responses generated corresponding to the speech recognition result of speech.

The dialogue control unit 14 causes the system 200 to present a confirmation response to the user, causes the response generation unit to generate one response corresponding to the voice input by the user according to the confirmation response, and causes the system 200 to present it to the user.

(Processing circuit)
Each function of the confirmation response generation unit 17 and the response generation unit 13 described above is realized by, for example, the processing circuit shown in FIG. 2 or 3. When implemented by the processing circuit shown in FIG. 3, the program stored in memory 52 is for the user to select one response from a plurality of responses generated corresponding to the speech recognition result of speech. The functions and operations that generate acknowledgments are described. In addition, the program describes a function and an operation that causes the system 200 to present a confirmation response to the user from the system 200, generate one response corresponding to voice input by the user according to the confirmation response, and cause the system 200 to present it to the user. ing.

(Operation)
Next, the operation of the voice dialogue control device 104 and the voice dialogue control method will be described. FIG. 20 is a flow chart showing an example of the operation of the speech dialog control device 104 and the speech dialog control method according to the fifth embodiment. In the present embodiment, steps S10 to S110 are the same as in the fourth embodiment, and therefore the description thereof is omitted.

In step S112, the response generation unit 13 determines whether a plurality of second responses corresponding to the speech recognition result of the second speech can be generated. For example, if a portable device for playing music and a CD (Compact Disc) player are provided in the system 200, the response generation unit 13 generates a second response including a control signal for playing the music stored in the portable device. , And a second response including a control signal to play the music stored on the CD. If it is determined that a plurality of second responses are not generated, step S120 is performed. In this case, the processes after step S120 are the same as in the fourth embodiment. If it is determined that a plurality of second responses are to be generated, step S122 is performed.

In step S122, the confirmation response generation unit 17 generates a confirmation response for causing the user to select one second response among the plurality of second responses generated corresponding to the voice recognition result of the second voice. Do. Here, the acknowledgment generation unit 17 generates an acknowledgment including “Do you want to play music on a portable device or play music on a CD?” As information for audio output or display output.

In step S124, the dialog control unit 14 causes the response presentation device 22 to present the confirmation response to the user. The response presentation device 22 presents to the user "Do you want to play the music of the portable device or the music of the CD?", And the user re-enters the voice for operating the system according to the acknowledgment response. . For example, when the voice "play music of portable device." Is input by the user, the voice interaction control apparatus 104 generates one second response by voice recognition and response generation similar to the above steps. The response presentation device 22 plays the music of the portable device to present the selected one second response to the user.

(effect)
Summarizing the above, the response generation unit 13 of the speech dialog control device 104 according to the fourth embodiment is a confirmation response for causing the user to select one response from a plurality of responses generated corresponding to the result of speech recognition of speech. It further includes an acknowledgment generation unit 17 to generate. The dialogue control unit 14 causes the system 200 to present a confirmation response to the user, causes the response generation unit 13 to generate one response corresponding to the voice input by the user according to the confirmation response, and causes the system 200 to present it to the user.

With such a configuration, the voice dialog control device 104 can ask the user for confirmation when there is an ambiguity in the interaction between the user and the system.

Embodiment 6
A voice dialogue control apparatus and a voice dialogue control method according to the sixth embodiment will be described.

(Constitution)
The configurations of the voice interaction control device 104 and the system 200 in the sixth embodiment are the same as in the fourth embodiment. However, the dialog control unit 14 in the present embodiment determines whether or not the elapsed time from the end of the first voice section to the start of the second voice section is equal to or greater than a specific value. When the elapsed time is equal to or greater than the specific value, the dialogue control unit 14 recognizes the second voice by referring to the first dictionary database 24 prepared corresponding to the standby state of the system 200 among the plurality of dictionary databases. Let Alternatively, when the elapsed time is less than the specific value, the dialogue control unit 14 corresponds to the state after the speech recognition of the first speech among the plurality of dictionary databases and corresponds to the specific vocabulary included in the speech recognition result of the first speech. The second speech is recognized by referring to the associated second dictionary database. The dialogue control unit 14 determines the relevance between the first voice and the second voice based on whether the elapsed time is equal to or more than a threshold and generates a response to be presented to the user.

(Processing circuit)
The above-mentioned function of the dialogue control unit 14 is realized by, for example, the processing circuit shown in FIG. 2 or FIG. When implemented by the processing circuit shown in FIG. 3, the program stored in the memory 52 determines whether the elapsed time from the end of the first voice period to the beginning of the second voice period is equal to or greater than a specified value. Based on the determination, a function and an operation for causing the second speech to be recognized by referring to the first dictionary database 24 prepared corresponding to the standby state of the system 200 among the plurality of dictionary databases are described. Alternatively, based on the above determination, a second dictionary database corresponding to a state after speech recognition of the first speech among the plurality of dictionary databases and related to a specific vocabulary included in the speech recognition result of the first speech is referred to (2) A function and operation for speech recognition of speech are described.

(Operation)
Next, the operation of the voice dialogue control device 104 and the voice dialogue control method will be described. FIG. 21 is a flow chart showing an example of the operation of the voice interaction control apparatus 104 and the voice interaction control method according to the sixth embodiment. Steps S10 to S60 in the present embodiment are the same as in the fourth embodiment, and therefore, the description thereof is omitted.

In step S64, the dialogue control unit 14 determines whether or not the elapsed time from the end of the first speech zone to the beginning of the second speech zone is equal to or greater than a specific value. If it is determined that the elapsed time is not the specific value or more, that is, if it is determined that the utterances are related, step S74 is executed. If it is determined that the elapsed time is equal to or greater than the specific value, that is, if it is determined that there is no relevance between the utterances, step S70 is executed.

In steps S74 and S76, the speech recognition unit 12 starts speech recognition of the second speech after the start of the second speech zone detected by the speech zone detection unit 11. The speech recognition unit 12 performs speech recognition of the second speech by referring to the second dictionary database 25 associated with the specific vocabulary included in the speech recognition result of the first speech. Each process after step S74 is the same as each process in the fourth embodiment shown in FIG.

On the other hand, when it is determined that there is no relevancy between utterances, in step S70, the speech recognition unit 12 determines that the second speech after the start of the second speech segment detected by the speech segment detection unit 11 is Start recognition. However, the speech recognition unit 12 performs speech recognition with reference to the first dictionary database 24 prepared corresponding to the standby state of the system 200.

In step S100, the speech recognition unit 12 ends the speech recognition of the second speech up to the end of the second speech segment. The speech recognition result of the second speech and the information of the first dictionary database 24 referred to for speech recognition of the second speech are output to the response generation unit 13. Note that "music" is included as a recognition vocabulary in the speech recognition result of the second speech. Further, the voice recognition unit 12 notifies the dialogue control unit 14 of the end of the voice recognition.

In step S120, the response generation unit 13 completes the generation of the second response. Here, since the recognition vocabulary is "music" and the dictionary database information is "first dictionary database", the response generation unit 13 performs the second response including "display music screen" as information for voice output. Generate The dialogue control unit 14 controls the second response from the response presentation device 22 to be presented to the user. For example, the speaker included in the response presentation device 22 presents the second response to the user by outputting a voice as "Display music screen" according to the second response. Alternatively, even if the response generation unit 13 generates a second response including a control signal for displaying a music screen on the display device included in the response presentation device 22, the display device displays the music screen based on the second response. Good.

(effect)
To summarize the above, the dialog control unit 14 of the voice dialog control device 104 according to the fourth embodiment determines whether the elapsed time from the end of the first voice section to the start of the second voice section is equal to or greater than a specific value. Based on the first dictionary database 24 which is prepared corresponding to the standby state of the system 200 among the plurality of dictionary databases to make the second speech voice-recognized, or the first speech of the plurality of dictionary databases The second speech is speech-recognized with reference to the second dictionary database 25 corresponding to the state after the speech recognition and associated with the specific vocabulary included in the speech recognition result of the first speech.

With such a configuration, the voice interaction control device 104 generates an accurate response to the user's speech by generating a response in consideration of the speech timing from the user in addition to the speech recognition result of the speech. Can.

Seventh Embodiment
Each voice interaction control device shown in the first to sixth embodiments is mounted on a vehicle, for example. FIG. 22 is a block diagram showing an example of the configuration of the voice interaction control device 105 mounted on the vehicle 30. As shown in FIG. Here, the voice interaction control device 105 is any one of the voice interaction control devices 100 to 104 shown in the first to sixth embodiments. The system 200 includes, for example, an on-vehicle device (not shown) such as a navigation device, an audio device, and a PND (Portable Navigation Device). The voice input device (not shown) of the on-vehicle apparatus inputs voice uttered by the user, the voice dialogue control device 105 generates a response corresponding to the voice, and the response presentation device (not shown) of the on-vehicle apparatus The response is presented to the user.

Also, the system 200 including the voice interaction control device 105 may be constructed by appropriately combining the communication terminal included in the in-vehicle device, the server installed outside the vehicle, and the function of the application installed in these. Good. FIG. 23 is a block diagram showing an example of the configuration of the voice interaction control device 105 provided in the server 40. As shown in FIG. Voice input from a voice input device (not shown) of the communication terminal 32 is received by the communication device 41 of the server 40 via the network and processed by the voice interaction control device 105. The voice interaction control device 105 generates a response corresponding to the voice. The generated response is presented to the user from the response presentation device (not shown) of the on-vehicle device 31 from the communication device 41 via the network. The response presentation device may be included in the communication terminal 32. Here, the communication terminal 32 is, for example, a mobile phone, a smartphone, and a tablet. In addition, each component of the voice interaction control device 105 may be distributed and disposed in each device configuring the system 200. In that case, each function is realized by each component communicating with each other as appropriate. By providing the voice interaction control device 105 in the server 40, or by arranging each component of the voice interaction control device 105 in the server 40 or the like, the configuration of the vehicle 30 or the on-vehicle device 31 can be simplified. , And the functions of the voice interaction control device 105 are realized.

In the present invention, within the scope of the invention, each embodiment can be freely combined, or each embodiment can be appropriately modified or omitted. Although the present invention has been described in detail, the above description is an exemplification in all aspects, and the present invention is not limited thereto. It is understood that countless variations not illustrated are conceivable without departing from the scope of the present invention.

DESCRIPTION OF SYMBOLS 11 voice section detection part, 12 voice recognition part, 13 response production | generation part, 14 dialog control part, 15 voice storage part, 16 dialog state determination part, 17 confirmation response production part, 24 1st dictionary database, 25 2nd dictionary database, 100 voice interaction control devices, 200 systems.

Claims

A voice dialogue control device that performs dialogue control for causing the system to present a response to voice input from a user to the user when the user operates the system by interaction between the user and the system.
A voice period detection unit for detecting a voice period from the beginning to the end of the input sequence of voices;
A voice recognition unit for voice recognition of the voice in the voice section;
A response generation unit that generates a response to be presented to the user from the system, the response corresponding to the speech recognition result of the speech;
A dialog control unit that controls the voice section detection unit, the voice recognition unit, and the response generation unit;
The dialogue control unit
From the detection of the first speech section forming the first speech sequence input as the speech until the first response corresponding to the speech recognition result of the first speech is presented to the user from the system The second voice can be generated to enable generation of a second response to a series of second voices input as the voice after the first voice even if the processing for the first voice including the process is not completed. The voice dialogue control device which makes the voice section detection part detect the 2nd voice section which makes.
It further comprises a voice storage unit for storing the second voice in the second voice segment detected by the voice segment detection unit,
The dialogue control unit
The voice recognition unit causes the voice recognition unit to perform voice recognition of the second voice stored in the voice storage unit based on a notification indicating that voice recognition of the first voice is completed in the voice recognition unit. The voice dialogue control device according to claim 1, wherein the response generation unit generates the second response corresponding to the voice recognition result of voice.
The dialogue control unit
Corresponds to the voice recognition result of the second voice in the second voice section recognized as voice by the voice recognition unit based on the notification indicating that the generation of the first response is completed by the response generation unit The voice dialogue control apparatus according to claim 1, wherein the response generation unit generates the second response.
An interactive state determining whether or not the voice recognition result of the second voice in the second voice section recognized as voice by the voice recognition unit is to update the voice recognition result of the first voice. Further comprising a determination unit;
The dialogue control unit
The voice dialogue control device according to claim 1, wherein the processing on the first voice is terminated halfway and the response generation unit generates the second response based on the determination result of the dialog state determination unit.
The voice recognition unit
The voice is recognized by referring to one dictionary database corresponding to the state of the system among a plurality of dictionary databases,
The response generator
The voice dialogue control device according to claim 1, wherein the response corresponding to the voice recognition result of the voice and the information of the one dictionary database referred to for voice recognition of the voice is generated.
The voice recognition unit
The first voice is voice-recognized with reference to a first dictionary database prepared corresponding to the standby state of the system among the plurality of dictionary databases,
The second voice is referred to by referring to a second dictionary database corresponding to a state after voice recognition of the first voice among the plurality of dictionary databases and included in the voice recognition result of the first voice. Speech recognition
The response generator
The voice interaction control according to claim 5, wherein the second response corresponding to the voice recognition result of the second voice and the information of the second dictionary database referred to for voice recognition of the second voice is generated. apparatus.
The response generator
And a confirmation response generation unit generating a confirmation response for causing the user to select one response from the plurality of responses generated corresponding to the voice recognition result of the voice,
The dialogue control unit
The confirmation response is presented from the system to the user, the response generation unit generates one response corresponding to the voice input by the user according to the confirmation response, and the system is presented to the user The voice dialogue control device according to Item 1.
The dialogue control unit
Corresponding to the standby state of the system among the plurality of dictionary databases, based on the determination whether the elapsed time from the end of the first speech zone to the beginning of the second speech zone is greater than or equal to a specific value The second voice is voice-recognized with reference to the prepared first dictionary database, or the voice of the first voice corresponds to a state after voice recognition of the first voice among the plurality of dictionary databases. The voice dialogue control device according to claim 5, wherein the second speech is recognized by referring to a second dictionary database related to a specific vocabulary included in the recognition result.
A voice interaction control method for performing interaction control for causing the system to present a response to voice input from a user to the user when the user performs an operation on the system by interaction between the user and the system.
Detecting a voice section from the beginning to the end of the input series of voices;
Speech recognition of the speech in the speech segment;
Generating a response to be presented to the user from the system, the response corresponding to the speech recognition result of the speech;
Performing control of each of the detection of the speech segment, speech recognition of the speech, and generation of the response;
When performing the control,
Processing from detection of a first voice section forming a series of first voice input as the voice to a first response corresponding to a voice recognition result of the first voice from the system to the user To make it possible to generate a second response to a series of second voices input as the voices after the first voices, even if processing for the first voices including A voice dialogue control method for detecting a second voice segment to be performed.