WO2019058453A1 - Voice interaction control device and method for controlling voice interaction - Google Patents

Voice interaction control device and method for controlling voice interaction Download PDF

Info

Publication number
WO2019058453A1
WO2019058453A1 PCT/JP2017/033902 JP2017033902W WO2019058453A1 WO 2019058453 A1 WO2019058453 A1 WO 2019058453A1 JP 2017033902 W JP2017033902 W JP 2017033902W WO 2019058453 A1 WO2019058453 A1 WO 2019058453A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
response
speech
unit
user
Prior art date
Application number
PCT/JP2017/033902
Other languages
French (fr)
Japanese (ja)
Inventor
昭男 堀井
岡登 洋平
Original Assignee
三菱電機株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 三菱電機株式会社 filed Critical 三菱電機株式会社
Priority to PCT/JP2017/033902 priority Critical patent/WO2019058453A1/en
Priority to JP2019542865A priority patent/JP6851491B2/en
Publication of WO2019058453A1 publication Critical patent/WO2019058453A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • the present invention relates to a voice interaction control apparatus and a voice interaction control method for causing a system to present a response corresponding to voice input from a user when the user operates the system by interaction between the system and the user.
  • a system having a voice recognition function inputs a voice uttered by a user, and outputs a response corresponding to the voice.
  • Patent Document 1 when the user inputs an interrupting voice while the system is outputting voice, the voice output is continued or paused depending on the importance of the voice being output.
  • a speech dialogue control method has been proposed for performing processing on embedded speech.
  • Patent Document 1 can not capture the subsequent second voice at a specific timing, for example, immediately after the end detection of the first voice, that is, immediately after the end of the first voice capture.
  • a specific timing for example, immediately after the end detection of the first voice, that is, immediately after the end of the first voice capture.
  • the present invention has been made to solve the problems as described above, and provides a voice interaction control device for performing interaction control so that the system can appropriately respond to the second voice input after the first voice. To aim.
  • a voice interaction control device performs voice interaction control for causing the system to present a response to voice input from the user to the user when the user performs an operation on the system by interaction between the user and the system.
  • a voice section detection unit that detects a voice section from the beginning to the end of the input continuous voice, a voice recognition unit that recognizes voice in the voice section, and a voice recognition result of voice
  • a response generation unit that generates a response to be presented to the user from the system, and an interaction control unit that controls the voice section detection unit, the voice recognition unit, and the response generation unit.
  • the dialogue control unit is configured to detect a first voice section forming a series of first voice input as voice and until a first response corresponding to a voice recognition result of the first voice is presented to the user from the system.
  • a second voice is formed to enable generation of a second response to a series of second voices input as voice after the first voice even if processing for the first voice including processing is not completed.
  • the voice segment detection unit detects the voice segment.
  • a voice interaction control device for performing interaction control so that the system can appropriately respond to the second voice input after the first voice.
  • FIG. 1 is a block diagram showing a configuration of a voice interaction control device and system in a first embodiment. It is a figure which shows an example of the processing circuit which a speech interaction control apparatus contains. It is a figure which shows another example of the processing circuit which a speech interaction control apparatus contains.
  • 5 is a sequence chart showing an example of the operation of the voice interaction control apparatus and the voice interaction control method according to the first embodiment. 5 is a flowchart showing an example of the operation of the voice interaction control device and the voice interaction control method according to the first embodiment.
  • FIG. 7 is a block diagram showing the configuration of a voice interaction control device and system in a second embodiment.
  • FIG. 18 is a diagram showing an example of a configuration of a system response database in Embodiment 2.
  • FIG. 1 is a block diagram showing a configuration of a voice interaction control device and system in a first embodiment. It is a figure which shows an example of the processing circuit which a speech interaction control apparatus contains. It is a figure which shows another example of the processing circuit
  • FIG. 16 is a sequence chart showing an example of the operation of the voice interaction control device and the voice interaction control method according to Embodiment 2.
  • FIG. FIG. 10 is a flowchart showing an example of the operation of the voice interaction control device and the voice interaction control method according to Embodiment 2.
  • FIG. FIG. 14 is a block diagram showing the configuration of a voice interaction control device and system in a third embodiment.
  • FIG. 17 is a sequence chart showing an example of the operation of the voice interaction control device in Embodiment 3 and the voice interaction control method.
  • FIG. FIG. 17 is a flow chart showing an example of the operation of the speech dialog control device and the speech dialog control method in the third embodiment.
  • FIG. FIG. 14 is a block diagram showing the configuration of a voice interaction control device and system in a fourth embodiment.
  • FIG. 18 is a diagram showing an example of a configuration of a first dictionary database in the fourth embodiment.
  • FIG. 18 is a diagram showing an example of a configuration of a second dictionary database in the fourth embodiment.
  • FIG. 18 is a diagram showing an example of a configuration of a system response database in a fourth embodiment.
  • FIG. 16 is a sequence chart showing an example of the operation of the voice interaction control device in the fourth embodiment and the voice interaction control method.
  • FIG. 16 is a flow chart showing an example of the operation of the voice interaction control device and the voice interaction control method according to the fourth embodiment.
  • FIG. 18 is a block diagram showing the configuration of a voice interaction control device and system in a fifth embodiment.
  • FIG. 16 is a flow chart showing an example of the operation of the voice interaction control device and the voice interaction control method according to the fifth embodiment.
  • FIG. 21 is a flow chart showing an example of the operation of the voice interaction control device in the sixth embodiment and the voice interaction control method.
  • FIG. 21 is a block diagram showing an example of a configuration of a voice dialogue control device mounted on a vehicle in a seventh embodiment.
  • FIG. 21 is a block diagram showing an example of the configuration of a voice dialog control device provided in a server according to a seventh embodiment.
  • Embodiment 1 A voice dialogue control apparatus and a voice dialogue control method according to the first embodiment will be described.
  • FIG. 1 is a block diagram showing the configuration of voice dialogue control apparatus 100 and system 200 in the first embodiment.
  • the system 200 inputs a voice uttered by the user to operate the system 200, and presents a response to the voice to the user.
  • the system 200 includes a voice input device 21, a voice interaction control device 100 and a response presentation device 22.
  • the system 200 is, for example, a navigation system, an audio system, a control system that controls devices related to the driving of a vehicle, a control system that controls a driving environment, and the like.
  • the voice input device 21 is an interface for the user to operate the system 200.
  • the voice input device 21 inputs a voice uttered by the user in order to perform an operation on the system 200, and outputs the voice to the voice dialogue control device 100.
  • the voice input device 21 is, for example, a microphone.
  • the voice interaction control device 100 receives voice from the voice input device 21 and performs interaction control for causing the system 200 to present a response corresponding to the voice to the user.
  • the response presentation device 22 presents the response generated by the voice interaction control device 100 to the user. Note that “to present” includes that the response presentation device 22 operates in accordance with the generated response.
  • the response presentation device 22 may present the response to the user by operating according to the response generated by the voice interaction control device 100.
  • the response presentation device 22 is an audio output device or display device.
  • the voice output device presents a response by, for example, voice outputting guidance information to a destination.
  • the display device presents a response, for example, by displaying guidance information to a destination along with a map.
  • the response presentation device 22 is a music playback device.
  • the music playback device presents a response by playing music.
  • the response presentation device 22 is a drive control device of the vehicle.
  • the response presentation device 22 is an air conditioner, a light, a mirror position adjustment device, a seat position adjustment device, or the like.
  • the voice dialogue control apparatus 100 includes a voice section detection unit 11, a voice recognition unit 12, a response generation unit 13 and a dialogue control unit 14.
  • the voice section detection unit 11 detects a voice section from the beginning to the end of the input continuous voice.
  • the voice activity detection unit 11 constantly detects an input voice.
  • the speech recognition unit 12 performs speech recognition on the speech in the speech segment detected by the speech segment detection unit 11. At the time of the speech recognition, the speech recognition unit 12 performs speech recognition by selecting the recognition vocabulary based on the acoustically or linguistically most probable vocabulary in the speech in the speech section.
  • the speech recognition unit 12 performs speech recognition, for example, with reference to a dictionary database (not shown).
  • the dictionary database may be provided in the voice interaction control apparatus 100 or in an external server. When the dictionary database is provided in the server, the dialog control device communicates with the server so that the speech recognition unit 12 performs speech recognition with reference to the dictionary database.
  • the response generation unit 13 generates a response corresponding to the speech recognition result of the speech recognition by the speech recognition unit 12.
  • the response generator 13 generates a response, for example, with reference to a system response database (not shown).
  • the system response database is, for example, a table, and the recognition vocabulary and the responses included in the speech recognition result are stored in association with each other.
  • the system response database may be provided in the voice interaction control device 100 or in an external server. When the system response database is provided in the server, the dialog control device communicates with the server, and the response generation unit 13 generates a response with reference to the system response database.
  • the response generation unit 13 outputs the response to the response presentation device 22.
  • the dialogue control unit 14 controls the operations of the speech segment detection unit 11, the speech recognition unit 12 and the response generation unit 13.
  • the dialogue control unit 14 controls each unit while monitoring the dialogue state of the system 200.
  • the interactive state is a state at any time from when a voice is detected by the voice section detection unit 11 to when a response corresponding to the voice is generated and further the response is presented to the user.
  • the dialogue control unit 14 controls the operation of the speech recognition unit 12 based on the notification that the speech zone detection unit 11 detects the beginning or the end of the speech zone.
  • the dialogue control unit 14 controls the start of generation of the response in the response generation unit 13 based on the notification that the speech recognition unit 12 has finished the speech recognition, or starts the speech recognition of the subsequent speech in the speech recognition unit 12 Control.
  • the dialogue control unit 14 controls the processing for the first voice of the series and the processing for the second voice input after the first voice.
  • the processing for the first voice includes processing from detection of the first voice section forming the first voice to presentation of the first response from the system 200 to the user. More specifically, the process for the first voice is at least a process of the speech recognition unit 12 performing speech recognition of the first speech and a response generation unit 13 generating a first response corresponding to the speech recognition result of the first speech. including.
  • the end of the first voice section forming the first voice is detected, and then the first response is presented to the response presentation device 22, and the beginning of the voice section forming the voice to be input next is Processing until detection may be included.
  • the dialogue control unit 14 detects the second voice section forming the second voice in the voice section detection unit 11 so that the second response to the second voice can be generated even if the processing on the first voice is not completed.
  • the dialogue control unit 14 causes the speech recognition unit 12 to recognize the second speech in the second speech section, and the second response corresponding to the speech recognition result of the second speech is a response generation unit. 13 to be presented from the system 200 to the user.
  • FIG. 2 is a view showing an example of the processing circuit 50 provided in the voice interaction control device 100. As shown in FIG. Each function of the speech zone detection unit 11, the speech recognition unit 12, the response generation unit 13, and the dialogue control unit 14 is realized by the processing circuit 50. That is, the processing circuit 50 includes the voice section detection unit 11, the voice recognition unit 12, the response generation unit 13, and the dialogue control unit 14.
  • the processing circuit 50 may be, for example, a single circuit, a composite circuit, a programmed processor, a parallel programmed processor, an application specific integrated circuit (ASIC), an FPGA (field-programmable) Gate Array) or a circuit combining these.
  • the functions of the voice section detection unit 11, the speech recognition unit 12, the response generation unit 13, and the dialogue control unit 14 may be realized individually by a plurality of processing circuits, or realized collectively by one processing circuit. It is also good.
  • FIG. 3 is a view showing another example of the processing circuit included in the voice interaction control device 100.
  • the processing circuit includes a processor 51 and a memory 52.
  • the processor 51 executes the program stored in the memory 52, the functions of the speech zone detection unit 11, the speech recognition unit 12, the response generation unit 13, and the dialogue control unit 14 are realized.
  • software or firmware described as a program is executed by the processor 51 to implement each function. That is, the voice dialogue control device 100 includes a memory 52 for storing a program and a processor 51 for executing the program.
  • the voice interaction control apparatus 100 detects a voice section from the start to the end forming the input voice sequence, recognizes the voice in the detected voice section, and recognizes the voice. Functions and operations are described which generate responses corresponding to recognition results and further control the detection of those speech segments, speech recognition and generation of responses.
  • the program forms a series of second voices input after the first voice, even when the processing for the first voice is not finished when the voice interaction control apparatus 100 executes each control. The function and operation for detecting the second speech segment are described.
  • the program causes the second voice in the second voice section to be voice-recognized, generates a second response corresponding to the voice recognition result of the second voice, and causes the system 200 to present it to the user. It is done.
  • the above program causes a computer to execute the procedure or method of the voice section detection unit 11, the voice recognition unit 12, the response generation unit 13, and the dialogue control unit 14 described above.
  • the processor 51 is, for example, a central processing unit, a processing unit, an arithmetic unit, a microprocessor, a microcomputer, a DSP (Digital Signal Processor) or the like.
  • the memory 52 is, for example, nonvolatile or volatile, such as random access memory (RAM), read only memory (ROM), flash memory, erasable programmable read only memory (EPROM), electrically erasable programmable read only memory (EEPROM), or the like. It is a semiconductor memory.
  • the memory 52 may be any storage medium used in the future, such as a magnetic disk, a flexible disk, an optical disk, a compact disk, a mini disk, a DVD, and the like.
  • the functions of the voice section detection unit 11, the speech recognition unit 12, the response generation unit 13, and the dialogue control unit 14 described above are partially realized by dedicated hardware, and the other portions are realized by software or firmware. May be Thus, the processing circuit implements each of the functions described above by hardware, software, firmware, or a combination thereof.
  • FIG. 4 is a sequence chart showing an example of the operation of the voice interaction control apparatus 100 and the voice interaction control method according to the first embodiment.
  • FIG. 5 is a flowchart showing an example of the operation of the voice interaction control apparatus 100 and the voice interaction control method according to the first embodiment.
  • the dialog control unit 14 controls the voice section detection unit 11 to be in a standby state in which voice reception is possible and in a standby state in which the voice recognition unit 12 is capable of speech recognition. Do. This control is performed, for example, by an operation of instructing the user to start accepting the voice section detection to the system 200. Alternatively, after startup of the system 200, the dialogue control unit 14 may automatically control the voice section detection unit 11 to a standby state in which voice can be received. After this control, the voice activity detection unit 11 is constantly in a state of monitoring the input of voice, that is, in a detectable state.
  • step S10 the voice activity detection unit 11 receives the first voice and detects the beginning of the first voice activity.
  • the detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.
  • step S20 the voice recognition unit 12 starts voice recognition of the first voice after the start end of the first voice section detected by the voice section detection unit 11 based on the notification of the start end detection.
  • step S30 the voice activity detection unit 11 detects the end of the first voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.
  • step S40 the voice recognition unit 12 ends voice recognition of the first voice up to the end of the first voice section detected by the voice section detection unit 11 based on the notification of end detection.
  • the voice recognition unit 12 outputs the voice recognition result of the first voice to the response generation unit 13 and notifies the dialog control unit 14 of the end.
  • step S50 the response generation unit 13 starts generation of a first response corresponding to the speech recognition result of the first speech based on the control from the dialogue control unit 14.
  • step S60 the voice activity detection unit 11 detects the beginning of the second voice activity of the second voice input after the first voice. The detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14. Note that step S60 and the following step S70 are performed in parallel with the generation of the first response in the response generation unit 13.
  • step S70 the voice recognition unit 12 starts voice recognition of the second voice after the start end of the second voice section detected by the voice section detection unit 11 based on the notification of the start end detection.
  • step S80 the response generation unit 13 completes the generation of the first response.
  • the dialogue control unit 14 causes the system 200 to present the first response to the user. That is, the response presentation device 22 presents the first response to the user.
  • step S90 the voice activity detection unit 11 detects the end of the second voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.
  • step S100 the speech recognition unit 12 ends the speech recognition of the second speech up to the end of the second speech zone detected by the speech zone detection unit 11.
  • the voice recognition unit 12 outputs the voice recognition result of the second voice to the response generation unit 13 and notifies the dialog control unit 14 of the end.
  • step S110 the response generation unit 13 starts generation of a second response corresponding to the speech recognition result of the second voice input from the speech recognition unit 12 based on the control from the dialogue control unit 14.
  • step S120 the response generation unit 13 completes the generation of the second response.
  • the dialogue control unit 14 causes the system 200 to present the second response to the user. That is, the response presentation device 22 presents the second response to the user.
  • the voice interaction control device 100 responds to the user from the system 200 with respect to the voice input from the user when the user operates the system 200 by the interaction between the user and the system 200.
  • a speech dialog control device 100 for performing dialogue control for presentation, the speech section detection unit 11 detecting a speech section from the start to the end forming the input series of speech, the speech in the speech section
  • a voice recognition unit 12 for recognizing, a response generation unit 13 for generating a response to be presented to the user from the system 200, which is a response corresponding to a voice recognition result of voice, a voice section detection unit 11 and a voice recognition unit 12
  • a dialog control unit 14 that controls the response generation unit 13.
  • the dialogue control unit 14 detects the first speech section forming the series of first speech input as speech, and then the first response corresponding to the speech recognition result of the first speech is presented from the system 200 to the user Even if the processing for the first voice including the processing up to the first voice is not completed, the second voice is made to be able to generate the second response for the second voice of the series inputted as the voice after the first voice.
  • the voice segment detection unit 11 detects the second voice segment.
  • the voice interaction control device 100 can perform interactive control so that the system can appropriately respond to the second voice input after the first voice.
  • the voice interaction control apparatus 100 can generate a response without omission to the second voice input immediately after the end of the first voice section.
  • voice dialogue control apparatus 100 constantly inputs voice to perform voice section detection, thereby eliminating the time when the user can not acquire voice uttered. it can.
  • a speech dialogue control method for dialogue control comprising detecting a speech section from the beginning to the end forming the input series of speech, speech recognizing speech in the speech section, and corresponding to speech recognition result of speech A response, which generates a response to be presented to the user from the system 200, and performs control of each of speech segment detection, speech recognition of the speech, and generation of the response.
  • a first response corresponding to a voice recognition result of the first voice after a first voice section forming a series of first voice inputted as voice is detected It is possible to generate a second response to a series of second voices input as voice after the first voice, even if processing for the first voice including processing until the system is presented to the user is not finished In order to do this, the second voice section that makes the second voice is detected.
  • the voice interaction control method including such configuration, it is possible to perform interaction control so that the system can appropriately respond to the second voice input after the first voice.
  • this voice dialogue control method it is possible to generate a response without omission to the second voice input immediately after the end of the first voice section.
  • voice dialogue control method since voice is always input to perform voice section detection, it is possible to eliminate a time when the user can not obtain a voice to be uttered.
  • FIG. 6 is a block diagram showing configurations of the voice interaction control device 101 and the system 200 in the second embodiment.
  • the system 200 includes a dictionary database storage device 23 in addition to the configuration shown in the first embodiment.
  • the voice recognition unit 12 of the voice dialogue control device 101 refers to the dictionary database stored in the dictionary database storage device 23 to perform voice recognition.
  • voice dialog control device 101 includes voice storage unit 15.
  • the voice storage unit 15 stores the voice in the voice section detected by the voice section detection unit 11.
  • the voice storage unit 15 stores the second voice in the second voice section
  • the present invention is not limited thereto, and the voice storage unit 15 may also store the first voice of the first voice section. .
  • the dialogue control unit 14 causes the voice recognition unit 12 to perform voice recognition of the second voice stored in the voice storage unit 15 based on the notification indicating that the voice recognition unit 12 has finished voice recognition of the first voice, and generates a response.
  • the unit 13 generates a second response corresponding to the speech recognition result of the second speech. Further, the dialogue control unit 14 causes the response generation unit 13 to generate the second response based on the notification indicating that the generation of the first response is completed in the response generation unit 13.
  • the response generation unit 13 generates responses by referring to the system response database for each response corresponding to each speech recognition result.
  • FIG. 7 is a diagram showing an example of the configuration of the system response database in the second embodiment.
  • the system response database is composed of recognition vocabulary contained in the speech recognition result and a response corresponding to the speech recognition result. Also, depending on the configuration of the response presentation device 22 that presents the response to the user, a plurality of responses may be included.
  • processing circuit 50 Each function of the voice storage unit 15 and the dialogue control unit 14 described above is realized by, for example, the processing circuit 50 shown in FIG. That is, the processing circuit 50 includes the voice storage unit 15 and the dialogue control unit 14 having the respective functions described above.
  • the function of the voice storage unit 15 is realized by the memory 52, for example.
  • the program stored in the memory 52 stores the second voice in the second voice section, and the second voice stored in the memory 52 based on the notification indicating that the voice recognition of the first voice is finished.
  • the function and operation of generating a second response corresponding to the speech recognition result of the second speech are described.
  • the program describes functions and operations for generating a second response based on a notification indicating that the generation of the first response is completed.
  • FIG. 8 is a sequence chart showing an example of the operation of the voice interaction control apparatus 101 and the voice interaction control method according to the second embodiment.
  • FIG. 9 is a flowchart showing an example of the operation of the voice interaction control apparatus 101 and the voice interaction control method according to the second embodiment.
  • the second speech is input during generation of the first response.
  • the second speech is input during speech recognition of the first speech.
  • An example is shown.
  • step S10 the voice activity detection unit 11 receives the first voice and detects the beginning of the first voice activity.
  • “I want to go to the supermarket” uttered by the user is input as the first voice.
  • the detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.
  • step S20 the voice recognition unit 12 starts voice recognition of the first voice after the start end of the first voice section detected by the voice section detection unit 11 based on the notification of the start end detection.
  • the speech recognition unit 12 starts speech recognition of the first speech with reference to the dictionary database.
  • step S30 the voice activity detection unit 11 detects the end of the first voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.
  • step S32 the voice activity detection unit 11 receives the second voice and detects the beginning of the second voice activity.
  • the detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.
  • step S34 the dialogue control unit 14 causes the voice storage unit 15 to start storing the second voice based on the notification of the detection of the start end of the second voice section.
  • the dialogue control unit 14 causes the voice storage unit 15 to start storing the second voice based on the notification of the detection of the start end of the second voice section.
  • illustration of the operation regarding this notification is omitted.
  • step S40 the voice recognition unit 12 ends voice recognition of the first voice up to the end of the first voice section detected by the voice section detection unit 11 based on the notification of end detection.
  • "super" is included as a recognition vocabulary.
  • the voice recognition unit 12 notifies the dialogue control unit 14 of the end of the voice recognition.
  • the dialogue control unit 14 controls the following step S50, step S62 and step S70 to be executed based on the notification.
  • step S50 the response generation unit 13 starts generation of a first response corresponding to the speech recognition result of the first speech based on the control from the dialogue control unit 14.
  • the response generator 13 refers to the system response database shown in FIG. 7 and starts generating the first response.
  • step S62 the voice recognition unit 12 starts reading of the second voice from the voice storage unit 15 based on the control from the dialogue control unit 14.
  • the voice storage unit 15 outputs the previously stored second voice to the voice recognition unit 12 with a time difference while storing the second voice in the second voice section.
  • step S62 to the following step S73 are executed in parallel with the generation of the first response in the response generation unit 13.
  • step S70 the voice recognition unit 12 starts voice recognition of the second voice from the beginning of the second voice section read from the voice storage unit 15 based on the notification of the beginning detection.
  • the voice recognition unit 12 starts voice recognition of the second voice based on the notification that voice recognition of the first voice is finished, thereby performing voice recognition of the second voice after voice recognition of the first voice. It can start.
  • the speech recognition unit 12 starts speech recognition of the second speech with reference to the dictionary database.
  • step S71 the voice activity detection unit 11 detects the end of the second voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.
  • step S72 the voice storage unit 15 ends the storage of the second voice.
  • step S73 the voice storage unit 15 ends the reading of the second voice from the voice storage unit 15.
  • the response generation unit 13 completes the generation of the first response.
  • the response generation unit 13 generates a first response including “display the search result of the super.”
  • the dialogue control unit 14 controls to present the first response from the response presentation device 22 to the user.
  • the response presentation device 22 is a speaker
  • the speaker presents the first response to the user by outputting a voice as "display the search result of the supermarket" according to the first response.
  • the response presentation device 22 is a display device
  • the display device presents the first response to the user by displaying “display the search result of the super.”
  • the response generation unit 13 may generate a first response including a control signal for searching for a super.
  • a destination search unit included in the system 200 searches for a supermarket based on the first response, and the response presentation device 22 presents the search result of the supermarket to the user.
  • the response generation unit 13 notifies the dialogue control unit 14 that the generation of the first response is completed.
  • step S100 the speech recognition unit 12 ends the speech recognition of the second speech up to the end of the second speech segment.
  • the “convenience store” is included as a recognition vocabulary in the speech recognition result of the second speech. Further, the voice recognition unit 12 notifies the dialogue control unit 14 of the end of the voice recognition.
  • step S110 the response generation unit 13 starts generation of a second response corresponding to the speech recognition result of the second voice input from the speech recognition unit 12 based on the control from the dialogue control unit 14.
  • the response generation unit 13 refers to the system response database shown in FIG. 7 and starts generation of the second response.
  • step S110 is performed after step S90. That is, the dialogue control unit 14 controls the process of step S110 to be executed based on the notification that the generation of the first response is completed.
  • step S120 the response generation unit 13 completes the generation of the second response.
  • the response generation unit 13 generates a second response including “display the search result of the convenience store” as information for voice output or display output.
  • the dialogue control unit 14 controls the second response from the response presentation device 22 to be presented to the user.
  • the response presentation device 22 is a speaker
  • the speaker presents the second response to the user by outputting a voice as "display the search result of the convenience store” according to the second response.
  • the response presentation device 22 is a display device
  • the display device presents the second response to the user by displaying “display the search result of the convenience store” according to the second response.
  • the response generation unit 13 may generate a second response including a control signal for searching a convenience store.
  • the destination search unit included in the system 200 searches the convenience store based on the second response, and the response presentation device 22 presents the search result of the convenience store to the user.
  • the voice stored in the voice storage unit 15 is not limited to the second voice.
  • the voice storage unit 15 may also store the first voice. That is, after the voice dialogue control device 101 stores the first voice of the first voice section detected by the voice section detection unit 11 in the voice storage unit 15 once, it reads it out after a predetermined time elapses, and sends it to the voice recognition unit 12. Speech recognition may be performed.
  • the voice dialogue control device 101 further includes the voice storage unit 15 that stores the second voice in the second voice section detected by the voice section detection unit 11.
  • the dialogue control unit 14 causes the voice recognition unit 12 to recognize the second voice stored in the voice storage unit 15 based on the notification indicating that the voice recognition unit 12 has finished the voice recognition of the first voice.
  • the response generation unit 13 generates a second response that corresponds to the result of speech recognition of the second speech.
  • the voice interaction control apparatus 101 can obtain the second voice even during processing of the first voice, for example, during voice recognition or response generation. That is, the voice interaction control apparatus 101 can generate an appropriate response to each of a plurality of voices uttered by the user at any timing.
  • the dialogue control unit 14 of the speech dialogue control device 101 causes the speech recognition unit 12 to perform speech recognition based on the notification indicating that the generation of the first response is completed by the response generation unit 13.
  • the response generation unit 13 generates a second response corresponding to the speech recognition result of the second speech in the second speech segment.
  • the voice interaction control apparatus 101 can sequentially present both the first response to the first voice and the second response to the second voice to the user. For example, immediately after the system inputs the first voice "I want to go to the supermarket" and starts the processing, if the user utters the second voice "I want to go to the convenience store after all," the conventional system is the second It is conceivable that only the response presenting the search result of the super can be performed because the speech can not be recognized. However, the voice interaction control apparatus 101 according to the present embodiment can input both the first voice and the second voice, and can present the search results of the supermarket and the search results of the convenience store, respectively.
  • Embodiment 3 A voice dialogue control apparatus and a voice dialogue control method according to the third embodiment will be described.
  • FIG. 10 is a block diagram showing configurations of the voice dialogue control device 102 and the system 200 in the third embodiment.
  • voice dialogue control apparatus 102 includes a dialogue state determination unit 16.
  • the dialogue state determination unit 16 determines whether the speech recognition result of the second speech recognized by the speech recognition unit 12 is to update the speech recognition result of the first speech.
  • the dialogue control unit 14 Based on the determination result of the dialogue state determination unit 16, the dialogue control unit 14 terminates the process on the first voice halfway and causes the response generation unit 13 to generate a second response.
  • Each function of the above-mentioned dialogue state judgment unit 16 and dialogue control unit 14 is realized by, for example, the processing circuit 50 shown in FIG. That is, the processing circuit 50 includes the dialogue state determination unit 16 and the dialogue control unit 14 having the respective functions described above.
  • FIG. 11 is a sequence chart showing an example of the operation of the voice interaction control apparatus 102 and the voice interaction control method according to the third embodiment.
  • FIG. 12 is a flowchart showing an example of the operation of the voice interaction control apparatus 102 and the voice interaction control method according to the third embodiment. In the following description, the description of the operation of the voice storage unit 15 is omitted, but the operation is the same as that of the second embodiment.
  • step S10 the voice activity detection unit 11 receives the first voice and detects the beginning of the first voice activity.
  • “I want to go to a convenience store” uttered by the user is input as the first voice.
  • the detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.
  • step S20 the voice recognition unit 12 starts voice recognition of the first voice after the start end of the first voice section detected by the voice section detection unit 11 based on the notification of the start end detection.
  • the speech recognition unit 12 performs speech recognition with reference to the dictionary database.
  • step S30 the voice activity detection unit 11 detects the end of the first voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.
  • step S40 the voice recognition unit 12 ends voice recognition of the first voice up to the end of the first voice section detected by the voice section detection unit 11 based on the notification of end detection.
  • "Convenience store” is included as a recognition vocabulary in the speech recognition result of the first speech. Further, the voice recognition unit 12 notifies the dialogue control unit 14 of the end of the voice recognition.
  • step S50 the response generation unit 13 starts generation of a first response corresponding to the speech recognition result of the first speech based on the control from the dialogue control unit 14.
  • the response generator 13 refers to the system response database shown in FIG. 7 and starts generating the first response.
  • step S60 the voice activity detection unit 11 detects the beginning of the second voice activity of the second voice input after the first voice.
  • “I want to go to a restaurant after all” spoken by the user is input as the second voice.
  • the detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.
  • step S70 the speech recognition unit 12 starts speech recognition of the second speech after the start of the second speech zone detected by the speech zone detection unit 11.
  • the speech recognition unit 12 refers to the dictionary database stored in the dictionary database storage unit 23 to perform speech recognition.
  • step S90 the voice activity detection unit 11 detects the end of the second voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.
  • step S100 the speech recognition unit 12 ends the speech recognition of the second speech up to the end of the second speech segment.
  • the "restaurant” is included as a recognition vocabulary in the speech recognition result of the second speech.
  • the voice recognition unit 12 notifies the dialogue control unit 14 of the end of the voice recognition.
  • step S102 the dialogue state determination unit 16 determines whether the speech recognition result of the second speech is to update the speech recognition result of the first speech and outputs the judgment result to the dialogue control unit 14. . In the present embodiment, it is determined whether the speech recognition result of the second speech including "restaurant" is to update the speech recognition result of the first speech including "convenience store". If it is determined that the update is not to be performed, step S104 is executed. If it is determined that the update is to be performed, step S106 is performed. In the present embodiment, the dialogue state determination unit 16 determines that the speech recognition result of the first speech including the “convenience store” updates the speech recognition result of the second speech including the “restaurant”.
  • the dialogue state determination unit 16 may determine the necessity of updating based on the parallel relation of the vocabulary of “convenience store” and “restaurant”, and other vocabulary included in the second voice, for example, paradox
  • the necessity of updating may be determined based on the conjunction "after all”.
  • step S104 the response generation unit 13 completes the generation of the first response by the control of the dialog control unit 14 based on the determination result, and the response presentation device 22 Presents the first response to the user.
  • the same response presentation as step S80 shown in the second embodiment is performed.
  • a response to the second voice is presented to the response presentation device 22 after step S110 shown in FIG.
  • step S106 based on the determination result, the dialogue control unit 14 ends the process on the first voice halfway.
  • step S110 the response generation unit 13 starts generation of a second response corresponding to the speech recognition result of the second speech.
  • the response generation unit 13 refers to the system response database shown in FIG. 7 and starts generation of the second response.
  • the response generation unit 13 completes the generation of the second response.
  • the response generation unit 13 generates a second response including “display the search result of the restaurant” as information for voice output or display output.
  • the dialogue control unit 14 controls the second response from the response presentation device 22 to be presented to the user.
  • the response presentation device 22 is a speaker
  • the speaker presents the second response to the user by outputting a voice “display the search result of the restaurant” according to the second response.
  • the response presentation device 22 is a display device
  • the display device presents the second response to the user by displaying “display the search result of the restaurant” according to the second response.
  • the response generation unit 13 may generate a second response including a control signal for searching a restaurant.
  • the destination search unit included in the system 200 starts a restaurant search based on the second response, and the response presentation device 22 displays the restaurant search results.
  • the dialogue control unit 14 cancels the processing for the first voice halfway and causes the second voice to be input. Control to generate only the corresponding second response. Thereby, only the second response is presented to the response presentation device 22.
  • the voice interaction control device 102 determines that the speech recognition result of the second speech in the second speech segment recognized as speech by the speech recognition unit 12 is the speech recognition result of the first speech.
  • the communication state determination unit 16 is further included to determine whether or not to update. Based on the determination result of the dialogue state determination unit 16, the dialogue control unit 14 terminates the process on the first voice halfway and causes the response generation unit 13 to generate a second response.
  • the voice interaction control device 102 terminates the processing for the first voice halfway, and responds to the second voice.
  • the voice interaction control device 102 terminates the processing for the first voice halfway, and responds to the second voice.
  • user's operability can be enhanced. For example, immediately after the system inputs the first voice "I want to go to a convenience store" and starts the process, if the user utters a second voice "I want to go to a restaurant anyway", the conventional system is the second It is conceivable that only the response presenting the search result of the convenience store can be performed because the voice can not be recognized.
  • the speech dialogue control device 102 in the third embodiment searches for a restaurant more responsive to the user's intention, ie, the second speech.
  • the result can be presented earlier than the voice interaction control device 101 according to the second embodiment.
  • FIG. 13 is a block diagram showing configurations of the voice interaction control device 103 and the system 200 in the fourth embodiment.
  • the dictionary database storage unit 23 of the system 200 stores a plurality of dictionary databases.
  • the dictionary database storage unit 23 stores a first dictionary database 24 and a second dictionary database 25.
  • the first dictionary database 24 is a dictionary database prepared corresponding to the standby state of the system 200.
  • the standby state is, for example, a state in which the voice input device 21 of the system 200 can receive an operation by the user, that is, a state in which the input of the first voice is awaited.
  • the display device which is another user interface included in the system 200, displays, for example, a menu screen.
  • the second dictionary database 25 is a dictionary database that corresponds to the state after the system 200 has recognized the first speech, and is associated with a specific vocabulary included in the speech recognition result of the first speech.
  • the speech recognition unit 12 performs speech recognition with reference to one dictionary database corresponding to the state of the system 200 among a plurality of dictionary databases.
  • the speech recognition unit 12 when the system 200 is in the standby state, the speech recognition unit 12 refers to the first dictionary database 24 as a dictionary database corresponding to the standby state to speak the first speech. recognize. Alternatively, when the system 200 is in the standby state, the speech recognition unit 12 refers to all the dictionary databases to refer to the first dictionary database 24 as one dictionary database corresponding to the standby state. Speech recognition.
  • FIG. 14 is a diagram showing an example of a configuration of the first dictionary database 24 in the fourth embodiment.
  • the first dictionary database 24 includes the state of the system 200 and the recognition vocabulary.
  • the first screen in FIG. 14 is a standby screen such as a menu screen.
  • the speech recognition unit 12 corresponds to that state.
  • the second speech is speech-recognized with reference to a second dictionary database 25 associated with the specific vocabulary as one dictionary database.
  • the speech recognition unit 12 or the dialogue control unit 14 determines whether the specific vocabulary is included in the voice recognition result of the first voice after voice recognition of the first voice, and determines that the specific vocabulary is included.
  • the speech recognition unit 12 has a function of performing processing such as switching the dictionary database used for speech recognition according to the state of the system 200.
  • FIG. 15 is a diagram showing an example of a configuration of the second dictionary database 25 in the fourth embodiment.
  • the second dictionary database 25 includes the main state of the system 200, the related state of the system 200, and the recognition vocabulary.
  • the response generation unit 13 generates a response corresponding to the speech recognition result of speech and information of one dictionary database referred to for speech recognition of the speech. For example, the response generation unit 13 generates a first response corresponding to the speech recognition result of the first speech and the information of the first dictionary database 24 referred to for speech recognition of the first speech. Alternatively, for example, the response generation unit 13 generates a second response corresponding to the speech recognition result of the second speech and the information of the second dictionary database 25 referred to for speech recognition of the second speech.
  • FIG. 16 is a diagram showing an example of a configuration of a system response database in the fourth embodiment.
  • the system response database is composed of recognition vocabulary contained in the speech recognition result, information of the dictionary database referenced for speech recognition, and responses corresponding thereto.
  • processing circuit 50 Each function of the speech recognition unit 12 and the response generation unit 13 described above is realized by, for example, the processing circuit 50 shown in FIG. That is, the processing circuit 50 includes the speech recognition unit 12 and the response generation unit 13 having the respective functions described above.
  • the program stored in the memory 52 may include one of a plurality of dictionary databases for speech.
  • a function and an operation are described which perform speech recognition with reference to the dictionary database and generate a response corresponding to speech recognition result of speech and information of one dictionary database referred to for speech recognition of speech.
  • the program refers to the first dictionary database 24 prepared corresponding to the standby state of the system 200 to recognize the first speech, and relates to the specific vocabulary included in the speech recognition result of the first speech.
  • the function and operation of speech recognition of the second speech are described with reference to the second dictionary database 25.
  • the program describes a function and an operation for generating a second response corresponding to the speech recognition result of the second speech and the information of the second dictionary database 25.
  • FIG. 17 is a sequence chart showing an example of the operation of the voice interaction control apparatus 103 and the voice interaction control method according to the fourth embodiment.
  • FIG. 18 is a flow chart showing an example of the operation of the voice interaction control apparatus 103 and the voice interaction control method according to the fourth embodiment.
  • the description of the operation of the voice storage unit 15 is omitted, but the operation is the same as that of the second embodiment.
  • step S10 the voice activity detection unit 11 receives the first voice and detects the beginning of the first voice activity.
  • “reproduction” uttered by the user is input as the first voice.
  • the detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.
  • the speech recognition unit 12 selects the first dictionary database 24 corresponding to the standby state of the system 200.
  • the speech recognition unit 12 acquires information indicating that the system 200 is in a standby state, and selects the first dictionary database 24 shown in FIG. 14 from among a plurality of dictionary databases based on the information.
  • the information indicating that the voice recognition unit 12 acquires the standby state is information that the first screen is displayed.
  • step S24 the speech recognition unit 12 refers to the first dictionary database 24 and starts speech recognition of the first speech after the start of the first speech segment detected by the speech segment detection unit 11.
  • the speech recognition unit 12 refers to all the dictionary databases based on the information that the system 200 is in the standby state, and the first dictionary database 24 corresponding to the standby state.
  • the first voice may be voice-recognized with reference to FIG.
  • step S30 the voice activity detection unit 11 detects the end of the first voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.
  • step S40 the voice recognition unit 12 ends voice recognition of the first voice up to the end of the first voice section detected by the voice section detection unit 11 based on the notification of end detection.
  • the speech recognition result of the first speech includes "reproduction" as a recognition vocabulary.
  • step S60 the voice activity detection unit 11 detects the beginning of the second voice activity of the second voice input after the first voice.
  • “music” uttered by the user is input as the second voice.
  • the detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.
  • speech recognition unit 12 is in a state after system 200 is speech-recognized for the first speech, and second dictionary database 25 corresponding to a state in which a specific vocabulary is included in the speech recognition result of the first speech.
  • the speech recognition unit 12 determines whether the speech recognition result of the first speech includes a specific vocabulary or not, and when it is determined that the specific vocabulary is included, a plurality of dictionary databases are used to specify the specific vocabulary.
  • the related second dictionary database 25 is selected.
  • the speech recognition unit 12 determines whether the speech recognition result of the first speech includes “reproduction” which is a specific vocabulary, and determines that it is included in the second speech shown in FIG. The second speech is recognized by referring to the dictionary database 25.
  • step S 76 the speech recognition unit 12 refers to the second dictionary database 25 to start speech recognition of the second speech after the start of the second speech segment detected by the speech segment detection unit 11.
  • the speech recognition unit 12 has a function of switching the dictionary database used for speech recognition from the first dictionary database 24 to the second dictionary database 25 according to the state of the system 200.
  • step S90 the voice activity detection unit 11 detects the end of the second voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.
  • step S100 the speech recognition unit 12 ends the speech recognition of the second speech up to the end of the second speech segment.
  • the speech recognition result of the second speech and the information of the second dictionary database 25 referred to for speech recognition of the second speech are output to the response generation unit 13.
  • music is included as a recognition vocabulary in the speech recognition result of the second speech.
  • the voice recognition unit 12 notifies the dialogue control unit 14 of the end of the voice recognition.
  • step S110 the response generation unit 13 starts generation of a second response corresponding to the speech recognition result of the second speech.
  • the response generation unit 13 refers to the system response database shown in FIG. 16 and starts generation of the second response.
  • step S120 the response generation unit 13 completes the generation of the second response.
  • the recognition vocabulary is "music” and the dictionary database information is "second dictionary database”
  • the response generation unit 13 generates a second response including "play music” as information for voice output.
  • Generate The dialogue control unit 14 controls the second response from the response presentation device 22 to be presented to the user.
  • the speaker included in the response presentation device 22 presents the second response to the user by outputting a voice "Play music” according to the second response.
  • the response generation unit 13 generates a second response including a control signal for causing the music reproduction device included in the response presentation device 22 to reproduce music, and the music reproduction device reproduces music based on the second response. It is also good.
  • the first dictionary database 24 referred to for the voice recognition result of the first voice and the voice recognition of the first voice.
  • the above information is output to the response generation unit 13. Since the recognition vocabulary is "play" and the dictionary database information is "first dictionary database”, the response generation unit 13 includes "what to play?" As information for voice output or display output. The response is generated and the response presentation device 22 presents the first response to the user.
  • the speech recognition unit 12 of the speech dialogue control apparatus 103 recognizes speech by referring to one of a plurality of dictionary databases according to the state of the system.
  • the response generation unit 13 generates a response corresponding to the speech recognition result of speech and information of one dictionary database referred to for speech recognition of speech.
  • the speech interaction control apparatus 103 can switch the dictionary database to be referred to in speech recognition according to the state of the system 200, that is, the interaction state, thereby generating an accurate response to the user's speech. be able to.
  • the voice recognition unit 12 of the voice dialogue control device 103 refers to the first voice database with reference to the first dictionary database 24 prepared corresponding to the standby state of the system 200 among the plurality of dictionary databases.
  • a second dictionary database 25 corresponding to a state after speech recognition of the first speech among a plurality of dictionary databases and associated with a specific vocabulary included in the speech recognition result of the first speech Speech recognition.
  • the response generation unit 13 generates a second response corresponding to the speech recognition result of the second speech and the information of the second dictionary database referred to for speech recognition of the second speech.
  • the voice dialog control device 103 can generate a response reflecting the contents of both the first voice and the second voice, and can generate an accurate response to the user's speech. .
  • the conventional system can not recognize the second voice and the user It is conceivable to present a response asking what to play.
  • the voice interaction control apparatus 103 in the present embodiment refers to the second dictionary database related to the voice recognition result of the first voice to recognize the second voice, the music is used in accordance with the user's intention. It can be played back.
  • the Fifth Preferred Embodiment A voice dialogue control apparatus and a voice dialogue control method according to the fifth embodiment will be described. Descriptions of configurations and operations similar to those of the other embodiments will be omitted.
  • FIG. 19 is a block diagram showing configurations of the voice dialogue control device 104 and the system 200 in the fifth embodiment.
  • the response generation unit 13 further includes a confirmation response generation unit 17 that generates a confirmation response for causing the user to select one response from a plurality of responses generated corresponding to the speech recognition result of speech.
  • the dialogue control unit 14 causes the system 200 to present a confirmation response to the user, causes the response generation unit to generate one response corresponding to the voice input by the user according to the confirmation response, and causes the system 200 to present it to the user.
  • Each function of the confirmation response generation unit 17 and the response generation unit 13 described above is realized by, for example, the processing circuit shown in FIG. 2 or 3.
  • the program stored in memory 52 is for the user to select one response from a plurality of responses generated corresponding to the speech recognition result of speech.
  • the functions and operations that generate acknowledgments are described.
  • the program describes a function and an operation that causes the system 200 to present a confirmation response to the user from the system 200, generate one response corresponding to voice input by the user according to the confirmation response, and cause the system 200 to present it to the user. ing.
  • FIG. 20 is a flow chart showing an example of the operation of the speech dialog control device 104 and the speech dialog control method according to the fifth embodiment.
  • steps S10 to S110 are the same as in the fourth embodiment, and therefore the description thereof is omitted.
  • step S112 the response generation unit 13 determines whether a plurality of second responses corresponding to the speech recognition result of the second speech can be generated. For example, if a portable device for playing music and a CD (Compact Disc) player are provided in the system 200, the response generation unit 13 generates a second response including a control signal for playing the music stored in the portable device. , And a second response including a control signal to play the music stored on the CD. If it is determined that a plurality of second responses are not generated, step S120 is performed. In this case, the processes after step S120 are the same as in the fourth embodiment. If it is determined that a plurality of second responses are to be generated, step S122 is performed.
  • a CD Compact Disc
  • step S122 the confirmation response generation unit 17 generates a confirmation response for causing the user to select one second response among the plurality of second responses generated corresponding to the voice recognition result of the second voice.
  • the acknowledgment generation unit 17 generates an acknowledgment including “Do you want to play music on a portable device or play music on a CD?” As information for audio output or display output.
  • step S124 the dialog control unit 14 causes the response presentation device 22 to present the confirmation response to the user.
  • the response presentation device 22 presents to the user "Do you want to play the music of the portable device or the music of the CD?", And the user re-enters the voice for operating the system according to the acknowledgment response. .
  • the voice interaction control apparatus 104 generates one second response by voice recognition and response generation similar to the above steps.
  • the response presentation device 22 plays the music of the portable device to present the selected one second response to the user.
  • the response generation unit 13 of the speech dialog control device 104 is a confirmation response for causing the user to select one response from a plurality of responses generated corresponding to the result of speech recognition of speech. It further includes an acknowledgment generation unit 17 to generate.
  • the dialogue control unit 14 causes the system 200 to present a confirmation response to the user, causes the response generation unit 13 to generate one response corresponding to the voice input by the user according to the confirmation response, and causes the system 200 to present it to the user.
  • the voice dialog control device 104 can ask the user for confirmation when there is an ambiguity in the interaction between the user and the system.
  • Embodiment 6 A voice dialogue control apparatus and a voice dialogue control method according to the sixth embodiment will be described.
  • the configurations of the voice interaction control device 104 and the system 200 in the sixth embodiment are the same as in the fourth embodiment.
  • the dialog control unit 14 in the present embodiment determines whether or not the elapsed time from the end of the first voice section to the start of the second voice section is equal to or greater than a specific value.
  • the dialogue control unit 14 recognizes the second voice by referring to the first dictionary database 24 prepared corresponding to the standby state of the system 200 among the plurality of dictionary databases.
  • the dialogue control unit 14 corresponds to the state after the speech recognition of the first speech among the plurality of dictionary databases and corresponds to the specific vocabulary included in the speech recognition result of the first speech.
  • the second speech is recognized by referring to the associated second dictionary database.
  • the dialogue control unit 14 determines the relevance between the first voice and the second voice based on whether the elapsed time is equal to or more than a threshold and generates a response to be presented to the user.
  • the above-mentioned function of the dialogue control unit 14 is realized by, for example, the processing circuit shown in FIG. 2 or FIG.
  • the program stored in the memory 52 determines whether the elapsed time from the end of the first voice period to the beginning of the second voice period is equal to or greater than a specified value. Based on the determination, a function and an operation for causing the second speech to be recognized by referring to the first dictionary database 24 prepared corresponding to the standby state of the system 200 among the plurality of dictionary databases are described.
  • a second dictionary database corresponding to a state after speech recognition of the first speech among the plurality of dictionary databases and related to a specific vocabulary included in the speech recognition result of the first speech is referred to (2)
  • a function and operation for speech recognition of speech are described.
  • FIG. 21 is a flow chart showing an example of the operation of the voice interaction control apparatus 104 and the voice interaction control method according to the sixth embodiment. Steps S10 to S60 in the present embodiment are the same as in the fourth embodiment, and therefore, the description thereof is omitted.
  • step S64 the dialogue control unit 14 determines whether or not the elapsed time from the end of the first speech zone to the beginning of the second speech zone is equal to or greater than a specific value. If it is determined that the elapsed time is not the specific value or more, that is, if it is determined that the utterances are related, step S74 is executed. If it is determined that the elapsed time is equal to or greater than the specific value, that is, if it is determined that there is no relevance between the utterances, step S70 is executed.
  • steps S74 and S76 the speech recognition unit 12 starts speech recognition of the second speech after the start of the second speech zone detected by the speech zone detection unit 11.
  • the speech recognition unit 12 performs speech recognition of the second speech by referring to the second dictionary database 25 associated with the specific vocabulary included in the speech recognition result of the first speech.
  • Each process after step S74 is the same as each process in the fourth embodiment shown in FIG.
  • step S70 the speech recognition unit 12 determines that the second speech after the start of the second speech segment detected by the speech segment detection unit 11 is Start recognition. However, the speech recognition unit 12 performs speech recognition with reference to the first dictionary database 24 prepared corresponding to the standby state of the system 200.
  • step S90 the voice activity detection unit 11 detects the end of the second voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.
  • step S100 the speech recognition unit 12 ends the speech recognition of the second speech up to the end of the second speech segment.
  • the speech recognition result of the second speech and the information of the first dictionary database 24 referred to for speech recognition of the second speech are output to the response generation unit 13. Note that "music" is included as a recognition vocabulary in the speech recognition result of the second speech.
  • the voice recognition unit 12 notifies the dialogue control unit 14 of the end of the voice recognition.
  • step S110 the response generation unit 13 starts generation of a second response corresponding to the speech recognition result of the second speech.
  • the response generation unit 13 refers to the system response database shown in FIG. 16 and starts generation of the second response.
  • step S120 the response generation unit 13 completes the generation of the second response.
  • the recognition vocabulary is "music" and the dictionary database information is "first dictionary database”
  • the response generation unit 13 performs the second response including "display music screen” as information for voice output.
  • Generate The dialogue control unit 14 controls the second response from the response presentation device 22 to be presented to the user.
  • the speaker included in the response presentation device 22 presents the second response to the user by outputting a voice as "Display music screen” according to the second response.
  • the display device displays the music screen based on the second response. Good.
  • the dialog control unit 14 of the voice dialog control device 104 determines whether the elapsed time from the end of the first voice section to the start of the second voice section is equal to or greater than a specific value. Based on the first dictionary database 24 which is prepared corresponding to the standby state of the system 200 among the plurality of dictionary databases to make the second speech voice-recognized, or the first speech of the plurality of dictionary databases The second speech is speech-recognized with reference to the second dictionary database 25 corresponding to the state after the speech recognition and associated with the specific vocabulary included in the speech recognition result of the first speech.
  • the voice interaction control device 104 generates an accurate response to the user's speech by generating a response in consideration of the speech timing from the user in addition to the speech recognition result of the speech. Can.
  • FIG. 22 is a block diagram showing an example of the configuration of the voice interaction control device 105 mounted on the vehicle 30.
  • the voice interaction control device 105 is any one of the voice interaction control devices 100 to 104 shown in the first to sixth embodiments.
  • the system 200 includes, for example, an on-vehicle device (not shown) such as a navigation device, an audio device, and a PND (Portable Navigation Device).
  • the voice input device (not shown) of the on-vehicle apparatus inputs voice uttered by the user, the voice dialogue control device 105 generates a response corresponding to the voice, and the response presentation device (not shown) of the on-vehicle apparatus The response is presented to the user.
  • FIG. 23 is a block diagram showing an example of the configuration of the voice interaction control device 105 provided in the server 40.
  • Voice input from a voice input device (not shown) of the communication terminal 32 is received by the communication device 41 of the server 40 via the network and processed by the voice interaction control device 105.
  • the voice interaction control device 105 generates a response corresponding to the voice.
  • the generated response is presented to the user from the response presentation device (not shown) of the on-vehicle device 31 from the communication device 41 via the network.
  • the response presentation device may be included in the communication terminal 32.
  • the communication terminal 32 is, for example, a mobile phone, a smartphone, and a tablet.
  • each component of the voice interaction control device 105 may be distributed and disposed in each device configuring the system 200. In that case, each function is realized by each component communicating with each other as appropriate.
  • the configuration of the vehicle 30 or the on-vehicle device 31 can be simplified.
  • the functions of the voice interaction control device 105 are realized.
  • each embodiment can be freely combined, or each embodiment can be appropriately modified or omitted.
  • the present invention has been described in detail, the above description is an exemplification in all aspects, and the present invention is not limited thereto. It is understood that countless variations not illustrated are conceivable without departing from the scope of the present invention.

Abstract

The purpose of the present invention is to provide a voice interaction control device for controlling voice interaction so that a system can suitably respond to a second voice input after a first voice. The voice interaction control device according to the present invention allows the system to present a user a response to a voice input by the user and comprises: a voice segment detection unit for detecting a voice segment for a series of voices input; a voice recognition unit for recognizing a voice in a voice segment; a response generation unit for generating a response corresponding to the voice recognition results; and a voice interaction control unit that controls the voice segment detection unit, the voice recognition unit, and the response generation unit. The voice interaction control unit causes the voice segment detection unit to detect a second voice segment constituting the second voice to allow a second response to be generated for a second series of voices input after the first voice even when the processing for the first voice has yet to be completed, including the processing until the system presents the user a first response to a first series of voices.

Description

音声対話制御装置および音声対話制御方法Voice dialogue control apparatus and voice dialogue control method
 本発明は、システムとユーザとの対話によりユーザがシステムに対し操作を行うに際し、ユーザから入力される音声に対応する応答をシステムに提示させる音声対話制御装置および音声対話制御方法に関する。 The present invention relates to a voice interaction control apparatus and a voice interaction control method for causing a system to present a response corresponding to voice input from a user when the user operates the system by interaction between the system and the user.
 音声認識機能を有するシステムは、ユーザから発話される音声を入力し、その音声に対応する応答を出力する。特許文献1には、システムが音声を出力中に、ユーザから割込音声が入力された場合、出力中の音声の重要度に応じて、音声出力を継続する、もしくは、一時停止して、割込音声に対する処理を実行する音声対話制御方法が提案されている。 A system having a voice recognition function inputs a voice uttered by a user, and outputs a response corresponding to the voice. According to Patent Document 1, when the user inputs an interrupting voice while the system is outputting voice, the voice output is continued or paused depending on the importance of the voice being output. A speech dialogue control method has been proposed for performing processing on embedded speech.
特開2004-325848号公報Japanese Patent Application Laid-Open No. 2004-325848
 しかし、特許文献1に記載のシステムは、特定のタイミングにおいて、例えば、第1音声の終端検出直後、すなわち第1音声取り込み終了直後は、後続の第2音声を取り込むことができない。ユーザがそのような特定のタイミングで発話した場合、システムとユーザとの間で齟齬が生じ、システムは不適当な応答を行う場合がある。 However, the system described in Patent Document 1 can not capture the subsequent second voice at a specific timing, for example, immediately after the end detection of the first voice, that is, immediately after the end of the first voice capture. When the user speaks at such a specific timing, a habit may occur between the system and the user, and the system may make an inappropriate response.
 ユーザが第1音声につづいて複数の発話を行った場合でも、システムはそれらの発話をとりこぼさずに適切に入力し、適切に応答する必要がある。 Even if the user makes a plurality of utterances following the first voice, the system needs to appropriately input and respond appropriately without dropping those utterances.
 本発明は、以上のような課題を解決するためになされたものであり、第1音声の後に入力される第2音声に対しシステムが適切に応答できるよう対話制御する音声対話制御装置の提供を目的とする。 The present invention has been made to solve the problems as described above, and provides a voice interaction control device for performing interaction control so that the system can appropriately respond to the second voice input after the first voice. To aim.
 本発明に係る音声対話制御装置は、ユーザとシステムとの対話によりユーザがシステムに対し操作を行うに際し、ユーザから入力される音声に対する応答をシステムからユーザに提示させるための対話制御を行う音声対話制御装置であって、入力される一続きの音声をなす始端から終端までの音声区間を検出する音声区間検出部と、音声区間内の音声を音声認識する音声認識部と、音声の音声認識結果に対応する応答であって、システムからユーザに提示させるべき応答を生成する応答生成部と、音声区間検出部と音声認識部と応答生成部とを制御する対話制御部と、を備える。対話制御部は、音声として入力される一続きの第1音声をなす第1音声区間が検出されてから第1音声の音声認識結果に対応する第1応答がシステムからユーザに提示されるまでの処理を含む第1音声に対する処理が終了していなくても、第1音声の後に音声として入力される一続きの第2音声に対する第2応答を生成可能とするために第2音声をなす第2音声区間を音声区間検出部に検出させる。 A voice interaction control device according to the present invention performs voice interaction control for causing the system to present a response to voice input from the user to the user when the user performs an operation on the system by interaction between the user and the system. A voice section detection unit that detects a voice section from the beginning to the end of the input continuous voice, a voice recognition unit that recognizes voice in the voice section, and a voice recognition result of voice A response generation unit that generates a response to be presented to the user from the system, and an interaction control unit that controls the voice section detection unit, the voice recognition unit, and the response generation unit. The dialogue control unit is configured to detect a first voice section forming a series of first voice input as voice and until a first response corresponding to a voice recognition result of the first voice is presented to the user from the system. A second voice is formed to enable generation of a second response to a series of second voices input as voice after the first voice even if processing for the first voice including processing is not completed. The voice segment detection unit detects the voice segment.
 本発明によれば、第1音声の後に入力される第2音声に対しシステムが適切に応答できるよう対話制御する音声対話制御装置の提供が可能である。 According to the present invention, it is possible to provide a voice interaction control device for performing interaction control so that the system can appropriately respond to the second voice input after the first voice.
 本発明の目的、特徴、局面、および利点は、以下の詳細な説明と添付図面とによって、より明白になる。 The objects, features, aspects and advantages of the present invention will become more apparent from the following detailed description and the accompanying drawings.
実施の形態1における音声対話制御装置およびシステムの構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a voice interaction control device and system in a first embodiment. 音声対話制御装置が含む処理回路の一例を示す図である。It is a figure which shows an example of the processing circuit which a speech interaction control apparatus contains. 音声対話制御装置が含む処理回路の別の一例を示す図である。It is a figure which shows another example of the processing circuit which a speech interaction control apparatus contains. 実施の形態1における音声対話制御装置の動作および音声対話制御方法の一例を示すシーケンスチャートである。5 is a sequence chart showing an example of the operation of the voice interaction control apparatus and the voice interaction control method according to the first embodiment. 実施の形態1における音声対話制御装置の動作および音声対話制御方法の一例を示すフローチャートである。5 is a flowchart showing an example of the operation of the voice interaction control device and the voice interaction control method according to the first embodiment. 実施の形態2における音声対話制御装置およびシステムの構成を示すブロック図である。FIG. 7 is a block diagram showing the configuration of a voice interaction control device and system in a second embodiment. 実施の形態2におけるシステム応答データベースの構成の一例を示す図である。FIG. 18 is a diagram showing an example of a configuration of a system response database in Embodiment 2. 実施の形態2における音声対話制御装置の動作および音声対話制御方法の一例を示すシーケンスチャートである。FIG. 16 is a sequence chart showing an example of the operation of the voice interaction control device and the voice interaction control method according to Embodiment 2. FIG. 実施の形態2における音声対話制御装置の動作および音声対話制御方法の一例を示すフローチャートである。FIG. 10 is a flowchart showing an example of the operation of the voice interaction control device and the voice interaction control method according to Embodiment 2. FIG. 実施の形態3における音声対話制御装置およびシステムの構成を示すブロック図である。FIG. 14 is a block diagram showing the configuration of a voice interaction control device and system in a third embodiment. 実施の形態3における音声対話制御装置の動作および音声対話制御方法の一例を示すシーケンスチャートである。FIG. 17 is a sequence chart showing an example of the operation of the voice interaction control device in Embodiment 3 and the voice interaction control method. FIG. 実施の形態3における音声対話制御装置の動作および音声対話制御方法の一例を示すフローチャートである。FIG. 17 is a flow chart showing an example of the operation of the speech dialog control device and the speech dialog control method in the third embodiment. FIG. 実施の形態4における音声対話制御装置およびシステムの構成を示すブロック図である。FIG. 14 is a block diagram showing the configuration of a voice interaction control device and system in a fourth embodiment. 実施の形態4における第1辞書データベースの構成の一例を示す図である。FIG. 18 is a diagram showing an example of a configuration of a first dictionary database in the fourth embodiment. 実施の形態4における第2辞書データベースの構成の一例を示す図である。FIG. 18 is a diagram showing an example of a configuration of a second dictionary database in the fourth embodiment. 実施の形態4におけるシステム応答データベースの構成の一例を示す図である。FIG. 18 is a diagram showing an example of a configuration of a system response database in a fourth embodiment. 実施の形態4における音声対話制御装置の動作および音声対話制御方法の一例を示すシーケンスチャートである。FIG. 16 is a sequence chart showing an example of the operation of the voice interaction control device in the fourth embodiment and the voice interaction control method. 実施の形態4における音声対話制御装置の動作および音声対話制御方法の一例を示すフローチャートである。FIG. 16 is a flow chart showing an example of the operation of the voice interaction control device and the voice interaction control method according to the fourth embodiment. 実施の形態5における音声対話制御装置およびシステムの構成を示すブロック図である。FIG. 18 is a block diagram showing the configuration of a voice interaction control device and system in a fifth embodiment. 実施の形態5における音声対話制御装置の動作および音声対話制御方法の一例を示すフローチャートである。FIG. 16 is a flow chart showing an example of the operation of the voice interaction control device and the voice interaction control method according to the fifth embodiment. 実施の形態6における音声対話制御装置の動作および音声対話制御方法の一例を示すフローチャートである。FIG. 21 is a flow chart showing an example of the operation of the voice interaction control device in the sixth embodiment and the voice interaction control method. 実施の形態7における車両に搭載された音声対話制御装置の構成の一例を示すブロック図である。FIG. 21 is a block diagram showing an example of a configuration of a voice dialogue control device mounted on a vehicle in a seventh embodiment. 実施の形態7におけるサーバに設けられる音声対話制御装置の構成の一例を示すブロック図である。FIG. 21 is a block diagram showing an example of the configuration of a voice dialog control device provided in a server according to a seventh embodiment.
 本明細書では、ユーザから入力される音声に対応する応答をシステムからユーザに提示させるための対話制御を行う音声対話制御装置の実施の形態を説明する。 In this specification, an embodiment of a voice interaction control device that performs interaction control for causing the system to present a response corresponding to voice input from the user will be described.
 <実施の形態1>
 実施の形態1における音声対話制御装置および音声対話制御方法を説明する。
Embodiment 1
A voice dialogue control apparatus and a voice dialogue control method according to the first embodiment will be described.
 (構成)
 図1は、実施の形態1における音声対話制御装置100およびシステム200の構成を示すブロック図である。
(Constitution)
FIG. 1 is a block diagram showing the configuration of voice dialogue control apparatus 100 and system 200 in the first embodiment.
 システム200は、システム200の操作を行うためにユーザから発話された音声を入力し、その音声に対する応答をユーザに提示する。システム200は、音声入力装置21、音声対話制御装置100および応答提示装置22を含む。システム200は、例えば、ナビゲーションシステム、オーディオシステム、車両の運転に関連する装置を制御する制御システム、運転環境を制御する制御システムなどである。 The system 200 inputs a voice uttered by the user to operate the system 200, and presents a response to the voice to the user. The system 200 includes a voice input device 21, a voice interaction control device 100 and a response presentation device 22. The system 200 is, for example, a navigation system, an audio system, a control system that controls devices related to the driving of a vehicle, a control system that controls a driving environment, and the like.
 音声入力装置21は、ユーザがシステム200に対し操作を行うためのインターフェースである。音声入力装置21は、システム200に対し操作を行うためにユーザが発話する音声を入力し、音声対話制御装置100に出力する。音声入力装置21は、例えばマイクである。 The voice input device 21 is an interface for the user to operate the system 200. The voice input device 21 inputs a voice uttered by the user in order to perform an operation on the system 200, and outputs the voice to the voice dialogue control device 100. The voice input device 21 is, for example, a microphone.
 音声対話制御装置100は、音声入力装置21から音声を入力し、その音声に対応する応答をシステム200からユーザに提示させるための対話制御を行う。 The voice interaction control device 100 receives voice from the voice input device 21 and performs interaction control for causing the system 200 to present a response corresponding to the voice to the user.
 応答提示装置22は、音声対話制御装置100によって生成された応答をユーザに提示する。なお、「提示する」とは、生成された応答に従って応答提示装置22が動作することを含む。応答提示装置22は、音声対話制御装置100によって生成された応答に従って動作することによりユーザに応答を提示してもよい。例えば、システム200がナビゲーションシステムである場合、応答提示装置22は音声出力装置または表示装置である。音声出力装置は、例えば、目的地までの案内情報を音声出力することにより応答を提示する、表示装置は、例えば、目的地までの案内情報を地図とともに表示することにより応答を提示する。または例えば、システム200がオーディオシステムである場合、応答提示装置22は音楽再生装置である。音楽再生装置は、音楽を再生することにより応答を提示する。または例えば、システム200が車両の運転に関連する装置を制御する制御システムである場合、応答提示装置22は車両の駆動制御装置である。または例えば、システム200が運転環境を制御する制御システムである場合、応答提示装置22は、エアコン、照明、ミラー位置調整装置または座席位置調整装置などである。 The response presentation device 22 presents the response generated by the voice interaction control device 100 to the user. Note that “to present” includes that the response presentation device 22 operates in accordance with the generated response. The response presentation device 22 may present the response to the user by operating according to the response generated by the voice interaction control device 100. For example, if the system 200 is a navigation system, the response presentation device 22 is an audio output device or display device. The voice output device presents a response by, for example, voice outputting guidance information to a destination. The display device presents a response, for example, by displaying guidance information to a destination along with a map. Or, for example, when the system 200 is an audio system, the response presentation device 22 is a music playback device. The music playback device presents a response by playing music. Or, for example, when the system 200 is a control system that controls devices related to the driving of a vehicle, the response presentation device 22 is a drive control device of the vehicle. Or, for example, when the system 200 is a control system that controls a driving environment, the response presentation device 22 is an air conditioner, a light, a mirror position adjustment device, a seat position adjustment device, or the like.
 音声対話制御装置100は、音声区間検出部11、音声認識部12、応答生成部13および対話制御部14を含む。 The voice dialogue control apparatus 100 includes a voice section detection unit 11, a voice recognition unit 12, a response generation unit 13 and a dialogue control unit 14.
 音声区間検出部11は、入力される一続きの音声を構成する始端から終端までの音声区間を検出する。本実施の形態において、音声区間検出部11は、一例として、常時、入力される音声を検出している。 The voice section detection unit 11 detects a voice section from the beginning to the end of the input continuous voice. In the present embodiment, as an example, the voice activity detection unit 11 constantly detects an input voice.
 音声認識部12は、音声区間検出部11にて検出された音声区間内の音声を音声認識する。その音声認識の際、音声認識部12は、音声区間内の音声を、音響的または言語的に最も確からしい語彙に基づいて認識語彙の選出を行い、音声認識する。音声認識部12は、例えば、辞書データベース(図示せず)を参照して音声認識する。辞書データベースは音声対話制御装置100に設けられてもよいし、外部のサーバに設けられてもよい。辞書データベースがサーバに設けられる場合、対話制御装置がサーバと通信することにより、音声認識部12はその辞書データベースを参照して音声認識する。 The speech recognition unit 12 performs speech recognition on the speech in the speech segment detected by the speech segment detection unit 11. At the time of the speech recognition, the speech recognition unit 12 performs speech recognition by selecting the recognition vocabulary based on the acoustically or linguistically most probable vocabulary in the speech in the speech section. The speech recognition unit 12 performs speech recognition, for example, with reference to a dictionary database (not shown). The dictionary database may be provided in the voice interaction control apparatus 100 or in an external server. When the dictionary database is provided in the server, the dialog control device communicates with the server so that the speech recognition unit 12 performs speech recognition with reference to the dictionary database.
 応答生成部13は、音声認識部12にて音声認識された音声認識結果に対応する応答を生成する。応答生成部13は、例えば、システム応答データベース(図示せず)を参照して応答を生成する。システム応答データベースは、例えばテーブルであり、音声認識結果に含まれる認識語彙と応答とが互いに対応付けられて格納されている。システム応答データベースは、音声対話制御装置100に設けられてもよいし、外部のサーバに設けられてもよい。システム応答データベースがサーバに設けられる場合、対話制御装置がサーバと通信することにより、応答生成部13はそのシステム応答データベースを参照して応答を生成する。応答生成部13は、その応答を応答提示装置22に出力する。 The response generation unit 13 generates a response corresponding to the speech recognition result of the speech recognition by the speech recognition unit 12. The response generator 13 generates a response, for example, with reference to a system response database (not shown). The system response database is, for example, a table, and the recognition vocabulary and the responses included in the speech recognition result are stored in association with each other. The system response database may be provided in the voice interaction control device 100 or in an external server. When the system response database is provided in the server, the dialog control device communicates with the server, and the response generation unit 13 generates a response with reference to the system response database. The response generation unit 13 outputs the response to the response presentation device 22.
 対話制御部14は、音声区間検出部11、音声認識部12および応答生成部13のそれぞれの動作を制御する。対話制御部14は、システム200の対話状態をモニタリングしながら、各部を制御する。対話状態とは、音声区間検出部11にて音声が検出されてから、その音声に対応する応答が生成され、さらにその応答がユーザに提示されるまでのいずれかの時点における状態のことである。例えば、対話制御部14は、音声区間検出部11が音声区間の始端または終端を検出した通知に基づき、音声認識部12の動作を制御する。または対話制御部14は、音声認識部12における音声認識が終了した通知に基づき、応答生成部13における応答の生成の開始を制御したり、音声認識部12における後続の音声の音声認識の開始を制御したりする。 The dialogue control unit 14 controls the operations of the speech segment detection unit 11, the speech recognition unit 12 and the response generation unit 13. The dialogue control unit 14 controls each unit while monitoring the dialogue state of the system 200. The interactive state is a state at any time from when a voice is detected by the voice section detection unit 11 to when a response corresponding to the voice is generated and further the response is presented to the user. . For example, the dialogue control unit 14 controls the operation of the speech recognition unit 12 based on the notification that the speech zone detection unit 11 detects the beginning or the end of the speech zone. Alternatively, the dialogue control unit 14 controls the start of generation of the response in the response generation unit 13 based on the notification that the speech recognition unit 12 has finished the speech recognition, or starts the speech recognition of the subsequent speech in the speech recognition unit 12 Control.
 対話制御部14が有する具体的な機能の一例は以下の通りである。対話制御部14は、一続きの第1音声に対する処理と、その第1音声の後に入力される第2音声に対する処理とを制御する。第1音声に対する処理は、第1音声をなす第1音声区間が検出されてから第1応答がシステム200からユーザに提示されるまでの処理を含む。より詳細には、第1音声に対する処理は、少なくとも、音声認識部12が第1音声を音声認識する処理および応答生成部13が第1音声の音声認識結果に対応する第1応答を生成する処理を含む。また、第1音声に対する処理は、第1音声をなす第1音声区間の終端が検出されてから第1応答が応答提示装置22に提示され、次に入力される音声をなす音声区間の始端が検出されるまでの処理を含んでもよい。 An example of the specific function which the dialog control part 14 has is as follows. The dialogue control unit 14 controls the processing for the first voice of the series and the processing for the second voice input after the first voice. The processing for the first voice includes processing from detection of the first voice section forming the first voice to presentation of the first response from the system 200 to the user. More specifically, the process for the first voice is at least a process of the speech recognition unit 12 performing speech recognition of the first speech and a response generation unit 13 generating a first response corresponding to the speech recognition result of the first speech. including. In the processing for the first voice, the end of the first voice section forming the first voice is detected, and then the first response is presented to the response presentation device 22, and the beginning of the voice section forming the voice to be input next is Processing until detection may be included.
 対話制御部14は、第1音声に対する処理が終了していなくても、第2音声に対する第2応答を生成可能とするために第2音声をなす第2音声区間を音声区間検出部11に検出させる。さらに本実施の形態においては、対話制御部14は、第2音声区間内の第2音声を音声認識部12に音声認識させ、第2音声の音声認識結果に対応する第2応答を応答生成部13に生成させて、システム200からユーザに提示させる。 The dialogue control unit 14 detects the second voice section forming the second voice in the voice section detection unit 11 so that the second response to the second voice can be generated even if the processing on the first voice is not completed. Let Furthermore, in the present embodiment, the dialogue control unit 14 causes the speech recognition unit 12 to recognize the second speech in the second speech section, and the second response corresponding to the speech recognition result of the second speech is a response generation unit. 13 to be presented from the system 200 to the user.
 (処理回路)
 図2は音声対話制御装置100が備える処理回路50の一例を示す図である。音声区間検出部11、音声認識部12、応答生成部13、および対話制御部14の各機能は、処理回路50により実現される。すなわち、処理回路50は、音声区間検出部11と音声認識部12と応答生成部13と対話制御部14と、を含む。
(Processing circuit)
FIG. 2 is a view showing an example of the processing circuit 50 provided in the voice interaction control device 100. As shown in FIG. Each function of the speech zone detection unit 11, the speech recognition unit 12, the response generation unit 13, and the dialogue control unit 14 is realized by the processing circuit 50. That is, the processing circuit 50 includes the voice section detection unit 11, the voice recognition unit 12, the response generation unit 13, and the dialogue control unit 14.
 処理回路50が専用のハードウェアである場合、処理回路50は、例えば、単一回路、複合回路、プログラム化したプロセッサ、並列プログラム化したプロセッサ、ASIC(Application Specific Integrated Circuit)、FPGA(Field-Programmable Gate Array)、またはこれらを組み合わせた回路等である。音声区間検出部11、音声認識部12、応答生成部13、および対話制御部14の各機能は、複数の処理回路により個別に実現されてもよいし、1つの処理回路によりまとめて実現されてもよい。 When the processing circuit 50 is dedicated hardware, the processing circuit 50 may be, for example, a single circuit, a composite circuit, a programmed processor, a parallel programmed processor, an application specific integrated circuit (ASIC), an FPGA (field-programmable) Gate Array) or a circuit combining these. The functions of the voice section detection unit 11, the speech recognition unit 12, the response generation unit 13, and the dialogue control unit 14 may be realized individually by a plurality of processing circuits, or realized collectively by one processing circuit. It is also good.
 図3は音声対話制御装置100が備える処理回路の別の一例を示す図である。処理回路は、プロセッサ51とメモリ52とを含む。プロセッサ51がメモリ52に格納されるプログラムを実行することにより、音声区間検出部11、音声認識部12、応答生成部13、および対話制御部14の各機能が実現される。例えば、プログラムとして記述されたソフトウェアまたはファームウェアがプロセッサ51により実行されることにより各機能が実現される。すなわち、音声対話制御装置100は、プログラムを格納するメモリ52と、そのプログラムを実行するプロセッサ51とを備える。 FIG. 3 is a view showing another example of the processing circuit included in the voice interaction control device 100. The processing circuit includes a processor 51 and a memory 52. When the processor 51 executes the program stored in the memory 52, the functions of the speech zone detection unit 11, the speech recognition unit 12, the response generation unit 13, and the dialogue control unit 14 are realized. For example, software or firmware described as a program is executed by the processor 51 to implement each function. That is, the voice dialogue control device 100 includes a memory 52 for storing a program and a processor 51 for executing the program.
 プログラムには、音声対話制御装置100が、入力される一続きの音声を構成する始端から終端までの音声区間を検出し、検出された音声区間内の音声を音声認識し、音声認識された音声認識結果に対応する応答を生成し、さらに、それら音声区間の検出、音声認識および応答の生成の各々を制御する機能および動作が記述されている。また、そのプロブラムは、音声対話制御装置100が、各々の制御を実行する際、第1音声に対する処理が終了していなくても、第1音声の後に入力される一続きの第2音声をなす第2音声区間を検出させる機能および動作が記述されている。さらに、プログラムには、第2音声区間内の第2音声を音声認識させ、第2音声の音声認識結果に対応する第2応答を生成させて、システム200からユーザに提示させる機能および動作が記述されている。以上のプログラムは、上述した音声区間検出部11、音声認識部12、応答生成部13、および対話制御部14の手順または方法をコンピュータに実行させるものである。 In the program, the voice interaction control apparatus 100 detects a voice section from the start to the end forming the input voice sequence, recognizes the voice in the detected voice section, and recognizes the voice. Functions and operations are described which generate responses corresponding to recognition results and further control the detection of those speech segments, speech recognition and generation of responses. In addition, the program forms a series of second voices input after the first voice, even when the processing for the first voice is not finished when the voice interaction control apparatus 100 executes each control. The function and operation for detecting the second speech segment are described. Furthermore, the program causes the second voice in the second voice section to be voice-recognized, generates a second response corresponding to the voice recognition result of the second voice, and causes the system 200 to present it to the user. It is done. The above program causes a computer to execute the procedure or method of the voice section detection unit 11, the voice recognition unit 12, the response generation unit 13, and the dialogue control unit 14 described above.
 プロセッサ51は、例えば、中央処理装置、処理装置、演算装置、マイクロプロセッサ、マイクロコンピュータ、DSP(Digital Signal Processor)等である。メモリ52は、例えば、RAM(Random Access Memory)、ROM(Read Only Memory)、フラッシュメモリ、EPROM(Erasable Programmable Read Only Memory)、EEPROM(Electrically Erasable Programmable Read Only Memory)等の、不揮発性または揮発性の半導体メモリである。または、メモリ52は、磁気ディスク、フレキシブルディスク、光ディスク、コンパクトディスク、ミニディスク、DVD等、今後使用されるあらゆる記憶媒体であってもよい。 The processor 51 is, for example, a central processing unit, a processing unit, an arithmetic unit, a microprocessor, a microcomputer, a DSP (Digital Signal Processor) or the like. The memory 52 is, for example, nonvolatile or volatile, such as random access memory (RAM), read only memory (ROM), flash memory, erasable programmable read only memory (EPROM), electrically erasable programmable read only memory (EEPROM), or the like. It is a semiconductor memory. Alternatively, the memory 52 may be any storage medium used in the future, such as a magnetic disk, a flexible disk, an optical disk, a compact disk, a mini disk, a DVD, and the like.
 上述した音声区間検出部11、音声認識部12、応答生成部13、および対話制御部14の各機能は、一部が専用のハードウェアによって実現され、他の一部がソフトウェアまたはファームウェアにより実現されてもよい。このように、処理回路は、ハードウェア、ソフトウェア、ファームウェア、またはこれらの組み合わせによって、上述の各機能を実現する。 The functions of the voice section detection unit 11, the speech recognition unit 12, the response generation unit 13, and the dialogue control unit 14 described above are partially realized by dedicated hardware, and the other portions are realized by software or firmware. May be Thus, the processing circuit implements each of the functions described above by hardware, software, firmware, or a combination thereof.
 (動作)
 次に、音声対話制御装置100の動作および音声対話制御方法を説明する。図4は、実施の形態1における音声対話制御装置100の動作および音声対話制御方法の一例を示すシーケンスチャートである。図5は、実施の形態1における音声対話制御装置100の動作および音声対話制御方法の一例を示すフローチャートである。
(Operation)
Next, the operation of the voice interaction control apparatus 100 and the voice interaction control method will be described. FIG. 4 is a sequence chart showing an example of the operation of the voice interaction control apparatus 100 and the voice interaction control method according to the first embodiment. FIG. 5 is a flowchart showing an example of the operation of the voice interaction control apparatus 100 and the voice interaction control method according to the first embodiment.
 図5のフローチャートには図示を省略しているが、まず、対話制御部14は、音声区間検出部11を音声受付が可能な待機状態および音声認識部12を音声認識が可能な待機状態に制御する。この制御は、例えば、ユーザによるシステム200に対する音声区間検出の受付開始を指示する操作により行われる。または、システム200の立ち上げ後、対話制御部14が自動的に音声区間検出部11を音声受付可能な待機状態に制御してもよい。この制御以降、音声区間検出部11は、常時、音声の入力をモニタリングする状態、つまり検出可能な状態となる。 Although not shown in the flowchart of FIG. 5, first, the dialog control unit 14 controls the voice section detection unit 11 to be in a standby state in which voice reception is possible and in a standby state in which the voice recognition unit 12 is capable of speech recognition. Do. This control is performed, for example, by an operation of instructing the user to start accepting the voice section detection to the system 200. Alternatively, after startup of the system 200, the dialogue control unit 14 may automatically control the voice section detection unit 11 to a standby state in which voice can be received. After this control, the voice activity detection unit 11 is constantly in a state of monitoring the input of voice, that is, in a detectable state.
 ステップS10にて、音声区間検出部11は、第1音声を入力して第1音声区間の始端を検出する。検出された始端は、音声認識部12または対話制御部14に通知される。 In step S10, the voice activity detection unit 11 receives the first voice and detects the beginning of the first voice activity. The detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.
 ステップS20にて、音声認識部12は、始端検出の通知に基づき、音声区間検出部11にて検出された第1音声区間の始端以降の第1音声の音声認識を開始する。 In step S20, the voice recognition unit 12 starts voice recognition of the first voice after the start end of the first voice section detected by the voice section detection unit 11 based on the notification of the start end detection.
 ステップS30にて、音声区間検出部11は、第1音声区間の終端を検出する。検出された終端は、音声認識部12または対話制御部14に通知される。 In step S30, the voice activity detection unit 11 detects the end of the first voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.
 ステップS40にて、音声認識部12は、終端検出の通知に基づき、音声区間検出部11にて検出された第1音声区間の終端までの第1音声の音声認識を終了する。音声認識部12は、第1音声の音声認識結果を応答生成部13に出力し、その終了を対話制御部14に通知する。 In step S40, the voice recognition unit 12 ends voice recognition of the first voice up to the end of the first voice section detected by the voice section detection unit 11 based on the notification of end detection. The voice recognition unit 12 outputs the voice recognition result of the first voice to the response generation unit 13 and notifies the dialog control unit 14 of the end.
 ステップS50にて、応答生成部13は、対話制御部14からの制御に基づき、第1音声の音声認識結果に対応する第1応答の生成を開始する。 In step S50, the response generation unit 13 starts generation of a first response corresponding to the speech recognition result of the first speech based on the control from the dialogue control unit 14.
 ステップS60にて、音声区間検出部11は、第1音声の後に入力される第2音声の第2音声区間の始端を検出する。検出された始端は、音声認識部12または対話制御部14に通知される。なお、このステップS60および以下のステップS70は、応答生成部13における第1応答の生成と並行して実行される。 In step S60, the voice activity detection unit 11 detects the beginning of the second voice activity of the second voice input after the first voice. The detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14. Note that step S60 and the following step S70 are performed in parallel with the generation of the first response in the response generation unit 13.
 ステップS70にて、音声認識部12は、始端検出の通知に基づき、音声区間検出部11にて検出された第2音声区間の始端以降の第2音声の音声認識を開始する。 In step S70, the voice recognition unit 12 starts voice recognition of the second voice after the start end of the second voice section detected by the voice section detection unit 11 based on the notification of the start end detection.
 ステップS80にて、応答生成部13は、第1応答の生成を完了する。対話制御部14は、第1応答をシステム200からユーザに提示させる。つまり、応答提示装置22は、その第1応答をユーザに提示する。 In step S80, the response generation unit 13 completes the generation of the first response. The dialogue control unit 14 causes the system 200 to present the first response to the user. That is, the response presentation device 22 presents the first response to the user.
 ステップS90にて、音声区間検出部11は、第2音声区間の終端を検出する。検出された終端は、音声認識部12または対話制御部14に通知される。 In step S90, the voice activity detection unit 11 detects the end of the second voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.
 ステップS100にて、音声認識部12は、音声区間検出部11にて検出された第2音声区間の終端までの第2音声の音声認識を終了する。音声認識部12は、第2音声の音声認識結果を応答生成部13に出力し、その終了を対話制御部14に通知する。 In step S100, the speech recognition unit 12 ends the speech recognition of the second speech up to the end of the second speech zone detected by the speech zone detection unit 11. The voice recognition unit 12 outputs the voice recognition result of the second voice to the response generation unit 13 and notifies the dialog control unit 14 of the end.
 ステップS110にて、応答生成部13は、対話制御部14からの制御に基づき、音声認識部12から入力する第2音声の音声認識結果に対応する第2応答の生成を開始する。 In step S110, the response generation unit 13 starts generation of a second response corresponding to the speech recognition result of the second voice input from the speech recognition unit 12 based on the control from the dialogue control unit 14.
 ステップS120にて、応答生成部13は第2応答の生成を完了する。対話制御部14は、第2応答をシステム200からユーザに提示させる。つまり、応答提示装置22は、その第2応答をユーザに提示する。 In step S120, the response generation unit 13 completes the generation of the second response. The dialogue control unit 14 causes the system 200 to present the second response to the user. That is, the response presentation device 22 presents the second response to the user.
 (効果)
 以上をまとめると、実施の形態1における音声対話制御装置100は、ユーザとシステム200との対話によりユーザがシステム200に対し操作を行うに際し、ユーザから入力される音声に対する応答をシステム200からユーザに提示させるための対話制御を行う音声対話制御装置100であって、入力される一続きの音声をなす始端から終端までの音声区間を検出する音声区間検出部11と、音声区間内の音声を音声認識する音声認識部12と、音声の音声認識結果に対応する応答であって、システム200からユーザに提示させるべき応答を生成する応答生成部13と、音声区間検出部11と音声認識部12と応答生成部13とを制御する対話制御部14と、を備える。対話制御部14は、音声として入力される一続きの第1音声をなす第1音声区間が検出されてから第1音声の音声認識結果に対応する第1応答がシステム200からユーザに提示されるまでの処理を含む第1音声に対する処理が終了していなくても、第1音声の後に音声として入力される一続きの第2音声に対する第2応答を生成可能とするために第2音声をなす第2音声区間を音声区間検出部11に検出させる。
(effect)
Summarizing the above, the voice interaction control device 100 according to the first embodiment responds to the user from the system 200 with respect to the voice input from the user when the user operates the system 200 by the interaction between the user and the system 200. A speech dialog control device 100 for performing dialogue control for presentation, the speech section detection unit 11 detecting a speech section from the start to the end forming the input series of speech, the speech in the speech section A voice recognition unit 12 for recognizing, a response generation unit 13 for generating a response to be presented to the user from the system 200, which is a response corresponding to a voice recognition result of voice, a voice section detection unit 11 and a voice recognition unit 12 And a dialog control unit 14 that controls the response generation unit 13. The dialogue control unit 14 detects the first speech section forming the series of first speech input as speech, and then the first response corresponding to the speech recognition result of the first speech is presented from the system 200 to the user Even if the processing for the first voice including the processing up to the first voice is not completed, the second voice is made to be able to generate the second response for the second voice of the series inputted as the voice after the first voice. The voice segment detection unit 11 detects the second voice segment.
 以上の構成により、音声対話制御装置100は、第1音声の後に入力される第2音声に対しシステムが適切に応答できるよう対話制御することができる。音声対話制御装置100は、第1音声区間の終端直後に入力される第2音声に対しても漏れなく応答を生成することが可能である。また、本実施の形態において一例として示されてように、音声対話制御装置100は、常時、音声を入力して音声区間検出を行うため、ユーザが発話する音声の取得ができない時間がなくすことができる。 With the above configuration, the voice interaction control device 100 can perform interactive control so that the system can appropriately respond to the second voice input after the first voice. The voice interaction control apparatus 100 can generate a response without omission to the second voice input immediately after the end of the first voice section. Further, as shown as an example in the present embodiment, voice dialogue control apparatus 100 constantly inputs voice to perform voice section detection, thereby eliminating the time when the user can not acquire voice uttered. it can.
 また、実施の形態1における音声対話制御方法は、ユーザとシステム200との対話によりユーザがシステム200に対し操作を行うに際し、ユーザから入力される音声に対する応答をシステム200からユーザに提示させるための対話制御を行う音声対話制御方法であって、入力される一続きの音声をなす始端から終端までの音声区間を検出し、音声区間内の音声を音声認識し、音声の音声認識結果に対応する応答であって、システム200からユーザに提示させるべき応答を生成し、音声区間の検出、音声の音声認識、および、応答の生成の各々の制御を実行する。音声対話制御方法は、その各々の制御を実行する際、音声として入力される一続きの第1音声をなす第1音声区間が検出されてから第1音声の音声認識結果に対応する第1応答がシステムからユーザに提示されるまでの処理を含む第1音声に対する処理が終了していなくても、第1音声の後に音声として入力される一続きの第2音声に対する第2応答を生成可能とするために第2音声をなす第2音声区間を検出させる。 In the voice interaction control method according to the first embodiment, when the user operates the system 200 by interaction between the user and the system 200, the system 200 presents a response to the voice input from the user to the user. A speech dialogue control method for dialogue control, comprising detecting a speech section from the beginning to the end forming the input series of speech, speech recognizing speech in the speech section, and corresponding to speech recognition result of speech A response, which generates a response to be presented to the user from the system 200, and performs control of each of speech segment detection, speech recognition of the speech, and generation of the response. In the voice interaction control method, when performing each control, a first response corresponding to a voice recognition result of the first voice after a first voice section forming a series of first voice inputted as voice is detected It is possible to generate a second response to a series of second voices input as voice after the first voice, even if processing for the first voice including processing until the system is presented to the user is not finished In order to do this, the second voice section that makes the second voice is detected.
 このような構成を含む音声対話制御方法によれば、第1音声の後に入力される第2音声に対しシステムが適切に応答できるよう対話制御することができる。この音声対話制御方法によれば、第1音声区間の終端直後に入力される第2音声に対しても漏れなく応答を生成することが可能である。また、この音声対話制御方法によれば、常時、音声を入力して音声区間検出を行うため、ユーザが発話する音声の取得ができない時間がなくすことができる。 According to the voice interaction control method including such configuration, it is possible to perform interaction control so that the system can appropriately respond to the second voice input after the first voice. According to this voice dialogue control method, it is possible to generate a response without omission to the second voice input immediately after the end of the first voice section. Moreover, according to this voice dialogue control method, since voice is always input to perform voice section detection, it is possible to eliminate a time when the user can not obtain a voice to be uttered.
 <実施の形態2>
 実施の形態2における音声対話制御装置および音声対話制御方法を説明する。
Second Embodiment
A voice dialogue control apparatus and a voice dialogue control method according to the second embodiment will be described.
 (構成)
 図6は、実施の形態2における音声対話制御装置101およびシステム200の構成を示すブロック図である。システム200は、実施の形態1に示された構成に加えて、辞書データベース記憶装置23を含む。
(Constitution)
FIG. 6 is a block diagram showing configurations of the voice interaction control device 101 and the system 200 in the second embodiment. The system 200 includes a dictionary database storage device 23 in addition to the configuration shown in the first embodiment.
 音声対話制御装置101の音声認識部12は、辞書データベース記憶装置23に記憶されている辞書データベースを参照して、音声認識する。また、音声対話制御装置101は、実施の形態1に示された構成に加えて、音声記憶部15を含む。 The voice recognition unit 12 of the voice dialogue control device 101 refers to the dictionary database stored in the dictionary database storage device 23 to perform voice recognition. In addition to the configuration shown in the first embodiment, voice dialog control device 101 includes voice storage unit 15.
 音声記憶部15は、音声区間検出部11にて検出される音声区間内の音声を記憶する。以下、音声記憶部15が第2音声区間内の第2音声を記憶する例が示されるが、これに限定されず、音声記憶部15は第1音声区間の第1音声も記憶してもよい。 The voice storage unit 15 stores the voice in the voice section detected by the voice section detection unit 11. Hereinafter, although the example in which the voice storage unit 15 stores the second voice in the second voice section is shown, the present invention is not limited thereto, and the voice storage unit 15 may also store the first voice of the first voice section. .
 対話制御部14は、音声認識部12において第1音声の音声認識が終了したことを示す通知に基づき、音声記憶部15に記憶された第2音声を音声認識部12に音声認識させ、応答生成部13に第2音声の音声認識結果に対応する第2応答を生成させる。さらに、対話制御部14は、応答生成部13にて第1応答の生成が完了したことを示す通知に基づき、応答生成部13に第2応答を生成させる。 The dialogue control unit 14 causes the voice recognition unit 12 to perform voice recognition of the second voice stored in the voice storage unit 15 based on the notification indicating that the voice recognition unit 12 has finished voice recognition of the first voice, and generates a response. The unit 13 generates a second response corresponding to the speech recognition result of the second speech. Further, the dialogue control unit 14 causes the response generation unit 13 to generate the second response based on the notification indicating that the generation of the first response is completed in the response generation unit 13.
 (システム応答データベース)
 本実施の形態において、応答生成部13は、各音声認識結果に対応する各応答を、システム応答データベースを参照して応答を生成する。図7は、実施の形態2におけるシステム応答データベースの構成の一例を示す図である。システム応答データベースは、音声認識結果に含まれる認識語彙と、音声認識結果に対応する応答とで構成される。また、応答をユーザに提示する応答提示装置22の構成に応じて、複数の応答が含まれてもよい。
(System response database)
In the present embodiment, the response generation unit 13 generates responses by referring to the system response database for each response corresponding to each speech recognition result. FIG. 7 is a diagram showing an example of the configuration of the system response database in the second embodiment. The system response database is composed of recognition vocabulary contained in the speech recognition result and a response corresponding to the speech recognition result. Also, depending on the configuration of the response presentation device 22 that presents the response to the user, a plurality of responses may be included.
 (処理回路)
 上記の音声記憶部15および対話制御部14の各機能は、例えば、図2に示される処理回路50により実現される。すなわち処理回路50は、上記の各機能を有する音声記憶部15および対話制御部14を含む。
(Processing circuit)
Each function of the voice storage unit 15 and the dialogue control unit 14 described above is realized by, for example, the processing circuit 50 shown in FIG. That is, the processing circuit 50 includes the voice storage unit 15 and the dialogue control unit 14 having the respective functions described above.
 上記の音声記憶部15および対話制御部14の各機能が、図3に示される処理回路により実現される場合、音声記憶部15の機能は、例えばメモリ52により実現される。また、メモリ52に格納されるプログラムには、第2音声区間内の第2音声を記憶させ、第1音声の音声認識が終了したことを示す通知に基づき、メモリ52に記憶された第2音声を音声認識させ、第2音声の音声認識結果に対応する第2応答を生成させる機能および動作が記述されている。さらに、プログラムには、第1応答の生成が完了したことを示す通知に基づき、第2応答を生成させる機能および動作が記述されている。 When each function of the voice storage unit 15 and the dialogue control unit 14 is realized by the processing circuit shown in FIG. 3, the function of the voice storage unit 15 is realized by the memory 52, for example. Further, the program stored in the memory 52 stores the second voice in the second voice section, and the second voice stored in the memory 52 based on the notification indicating that the voice recognition of the first voice is finished. And the function and operation of generating a second response corresponding to the speech recognition result of the second speech are described. Furthermore, the program describes functions and operations for generating a second response based on a notification indicating that the generation of the first response is completed.
 (動作)
 次に、音声対話制御装置101の動作および音声対話制御方法を説明する。図8は、実施の形態2における音声対話制御装置101の動作および音声対話制御方法の一例を示すシーケンスチャートである。図9は、実施の形態2における音声対話制御装置101の動作および音声対話制御方法の一例を示すフローチャートである。
(Operation)
Next, the operation of the voice interaction control apparatus 101 and the voice interaction control method will be described. FIG. 8 is a sequence chart showing an example of the operation of the voice interaction control apparatus 101 and the voice interaction control method according to the second embodiment. FIG. 9 is a flowchart showing an example of the operation of the voice interaction control apparatus 101 and the voice interaction control method according to the second embodiment.
 実施の形態1においては、第2音声が第1応答の生成中に入力された例が示されたが、実施の形態2においては、第2音声が第1音声の音声認識中に入力される例を示す。 In the first embodiment, an example is shown in which the second speech is input during generation of the first response. However, in the second embodiment, the second speech is input during speech recognition of the first speech. An example is shown.
 ステップS10にて、音声区間検出部11は、第1音声を入力して第1音声区間の始端を検出する。ここでは、第1音声として、ユーザから発話される「スーパーに行きたい。」が入力される。検出された始端は、音声認識部12または対話制御部14に通知する。 In step S10, the voice activity detection unit 11 receives the first voice and detects the beginning of the first voice activity. Here, "I want to go to the supermarket" uttered by the user is input as the first voice. The detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.
 ステップS20にて、音声認識部12は、始端検出の通知に基づき、音声区間検出部11にて検出された第1音声区間の始端以降の第1音声の音声認識を開始する。ここでは、音声認識部12は、辞書データベースを参照して第1音声の音声認識を開始する。 In step S20, the voice recognition unit 12 starts voice recognition of the first voice after the start end of the first voice section detected by the voice section detection unit 11 based on the notification of the start end detection. Here, the speech recognition unit 12 starts speech recognition of the first speech with reference to the dictionary database.
 ステップS30にて、音声区間検出部11は、第1音声区間の終端を検出する。検出された終端は、音声認識部12または対話制御部14に通知される。 In step S30, the voice activity detection unit 11 detects the end of the first voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.
 ステップS32にて、音声区間検出部11は、第2音声を入力して第2音声区間の始端を検出する。ここでは、第2音声として、ユーザから発話される「やっぱりコンビニに行きたい。」が入力される。検出された始端は、音声認識部12または対話制御部14に通知される。 In step S32, the voice activity detection unit 11 receives the second voice and detects the beginning of the second voice activity. Here, “I want to go to a convenience store.” Uttered by the user is input as the second voice. The detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.
 ステップS34にて、対話制御部14は、第2音声区間の始端検出の通知に基づき、音声記憶部15に第2音声の記憶を開始させる。なお、図8において、シーケンスチャートを簡略化するため、この通知に関する動作の図示は省略している。 In step S34, the dialogue control unit 14 causes the voice storage unit 15 to start storing the second voice based on the notification of the detection of the start end of the second voice section. In addition, in FIG. 8, in order to simplify a sequence chart, illustration of the operation regarding this notification is omitted.
 ステップS40にて、音声認識部12は、終端検出の通知に基づき、音声区間検出部11にて検出された第1音声区間の終端までの第1音声の音声認識を終了する。第1音声の音声認識結果には、「スーパー」が認識語彙として含まれる。また、音声認識部12は、音声認識の終了を対話制御部14に通知する。対話制御部14は、その通知に基づき、以下のステップS50とステップS62とステップS70とが実行されるよう制御する。 In step S40, the voice recognition unit 12 ends voice recognition of the first voice up to the end of the first voice section detected by the voice section detection unit 11 based on the notification of end detection. In the speech recognition result of the first speech, "super" is included as a recognition vocabulary. Further, the voice recognition unit 12 notifies the dialogue control unit 14 of the end of the voice recognition. The dialogue control unit 14 controls the following step S50, step S62 and step S70 to be executed based on the notification.
 ステップS50にて、応答生成部13は、対話制御部14からの制御に基づき、第1音声の音声認識結果に対応する第1応答の生成を開始する。応答生成部13は、図7に示されるシステム応答データベースを参照し、第1応答の生成を開始する。 In step S50, the response generation unit 13 starts generation of a first response corresponding to the speech recognition result of the first speech based on the control from the dialogue control unit 14. The response generator 13 refers to the system response database shown in FIG. 7 and starts generating the first response.
 ステップS62にて、音声認識部12は、対話制御部14からの制御に基づき、音声記憶部15から第2音声の読み込みを開始する。本実施の形態において、音声記憶部15は、第2音声区間内の第2音声を記憶しながら、時間差で、既に記憶された第2音声を音声認識部12に出力する。また、ステップS62から以下のステップS73までは、応答生成部13における第1応答の生成と並行して実行される。 In step S62, the voice recognition unit 12 starts reading of the second voice from the voice storage unit 15 based on the control from the dialogue control unit 14. In the present embodiment, the voice storage unit 15 outputs the previously stored second voice to the voice recognition unit 12 with a time difference while storing the second voice in the second voice section. In addition, step S62 to the following step S73 are executed in parallel with the generation of the first response in the response generation unit 13.
 ステップS70にて、音声認識部12は、始端検出の通知に基づき、音声記憶部15から読み込んだ第2音声区間の始端以降の第2音声の音声認識を開始する。このように、音声認識部12は、第1音声の音声認識が終了した通知に基づいて、第2音声の音声認識を開始することにより、第1音声の音声認識後に第2音声の音声認識を開始することができる。なお、音声認識部12は、辞書データベースを参照して第2音声の音声認識を開始する。 In step S70, the voice recognition unit 12 starts voice recognition of the second voice from the beginning of the second voice section read from the voice storage unit 15 based on the notification of the beginning detection. As described above, the voice recognition unit 12 starts voice recognition of the second voice based on the notification that voice recognition of the first voice is finished, thereby performing voice recognition of the second voice after voice recognition of the first voice. It can start. The speech recognition unit 12 starts speech recognition of the second speech with reference to the dictionary database.
 ステップS71にて、音声区間検出部11は、第2音声区間の終端を検出する。検出された終端は、音声認識部12または対話制御部14に通知される。 In step S71, the voice activity detection unit 11 detects the end of the second voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.
 ステップS72にて、音声記憶部15は、第2音声の記憶を終了する。 In step S72, the voice storage unit 15 ends the storage of the second voice.
 ステップS73にて、音声記憶部15は、音声記憶部15からの第2音声の読み込みを終了する。 In step S73, the voice storage unit 15 ends the reading of the second voice from the voice storage unit 15.
 ステップS80にて、応答生成部13は、第1応答の生成を完了する。ここでは、応答生成部13は、音声出力用または表示出力用の情報として「スーパーの検索結果を表示します。」を含む第1応答を生成する。対話制御部14は、第1応答を応答提示装置22からユーザに提示するよう制御する。例えば、応答提示装置22がスピーカである場合、スピーカは、第1応答に従い「スーパーの検索結果を表示します。」と音声出力することにより、ユーザに第1応答を提示する。または例えば、応答提示装置22が表示装置である場合、表示装置は、第1応答に従い「スーパーの検索結果を表示します。」と表示することにより、ユーザに第1応答を提示する。または、応答生成部13は、スーパーを検索させる制御信号を含む第1応答を生成してもよい。この場合、システム200に含まれる目的地検索部(図示せず)がその第1応答に基づいてスーパーを検索し、応答提示装置22がスーパーの検索結果をユーザに提示する。なお、本実施の形態においては、応答生成部13は、第1応答の生成が完了したことを対話制御部14に通知する。 In step S80, the response generation unit 13 completes the generation of the first response. Here, the response generation unit 13 generates a first response including “display the search result of the super.” As information for voice output or display output. The dialogue control unit 14 controls to present the first response from the response presentation device 22 to the user. For example, when the response presentation device 22 is a speaker, the speaker presents the first response to the user by outputting a voice as "display the search result of the supermarket" according to the first response. Alternatively, for example, when the response presentation device 22 is a display device, the display device presents the first response to the user by displaying “display the search result of the super.” According to the first response. Alternatively, the response generation unit 13 may generate a first response including a control signal for searching for a super. In this case, a destination search unit (not shown) included in the system 200 searches for a supermarket based on the first response, and the response presentation device 22 presents the search result of the supermarket to the user. In the present embodiment, the response generation unit 13 notifies the dialogue control unit 14 that the generation of the first response is completed.
 ステップS100にて、音声認識部12は、第2音声区間の終端までの第2音声の音声認識を終了する。第2音声の音声認識結果には、「コンビニ」が認識語彙として含まれる。また、音声認識部12は、音声認識の終了を対話制御部14に通知する。 In step S100, the speech recognition unit 12 ends the speech recognition of the second speech up to the end of the second speech segment. The “convenience store” is included as a recognition vocabulary in the speech recognition result of the second speech. Further, the voice recognition unit 12 notifies the dialogue control unit 14 of the end of the voice recognition.
 ステップS110にて、応答生成部13は、対話制御部14からの制御に基づき、音声認識部12から入力する第2音声の音声認識結果に対応する第2応答の生成を開始する。応答生成部13は、図7に示されるシステム応答データベースを参照し、第2応答の生成を開始する。なお、本実施の形態において、このステップS110は、ステップS90の後に実行される。すなわち、対話制御部14は、第1応答の生成が完了した通知に基づき、ステップS110が実行されるよう制御する。 In step S110, the response generation unit 13 starts generation of a second response corresponding to the speech recognition result of the second voice input from the speech recognition unit 12 based on the control from the dialogue control unit 14. The response generation unit 13 refers to the system response database shown in FIG. 7 and starts generation of the second response. In the present embodiment, step S110 is performed after step S90. That is, the dialogue control unit 14 controls the process of step S110 to be executed based on the notification that the generation of the first response is completed.
 ステップS120にて、応答生成部13は、第2応答の生成を完了する。ここでは、応答生成部13は、音声出力用または表示出力用の情報として「コンビニの検索結果を表示します。」を含む第2応答を生成する。対話制御部14は、第2応答を応答提示装置22からユーザに提示するよう制御する。例えば、応答提示装置22がスピーカである場合、スピーカは、第2応答に従い「コンビニの検索結果を表示します。」と音声出力することにより、ユーザに第2応答を提示する。または例えば、応答提示装置22が表示装置である場合、表示装置は、第2応答に従い「コンビニの検索結果を表示します。」と表示することにより、ユーザに第2応答を提示する。または、応答生成部13は、コンビニを検索させる制御信号を含む第2応答を生成してもよい。この場合、システム200に含まれる目的地検索部がその第2応答に基づいてコンビニの検索し、応答提示装置22がコンビニの検索結果をユーザに提示する。 In step S120, the response generation unit 13 completes the generation of the second response. Here, the response generation unit 13 generates a second response including “display the search result of the convenience store” as information for voice output or display output. The dialogue control unit 14 controls the second response from the response presentation device 22 to be presented to the user. For example, when the response presentation device 22 is a speaker, the speaker presents the second response to the user by outputting a voice as "display the search result of the convenience store" according to the second response. Alternatively, for example, when the response presentation device 22 is a display device, the display device presents the second response to the user by displaying “display the search result of the convenience store” according to the second response. Alternatively, the response generation unit 13 may generate a second response including a control signal for searching a convenience store. In this case, the destination search unit included in the system 200 searches the convenience store based on the second response, and the response presentation device 22 presents the search result of the convenience store to the user.
 なお、以上の音声対話制御装置101の動作において、音声記憶部15に記憶される音声は第2音声に限定されない。音声記憶部15は第1音声も記憶してもよい。すなわち、音声対話制御装置101は、音声区間検出部11にて検出される第1音声区間の第1音声を音声記憶部15に一度記憶してから、一定時間経過後に読み出し、音声認識部12にて音声認識してもよい。 In the operation of the voice interaction control apparatus 101 described above, the voice stored in the voice storage unit 15 is not limited to the second voice. The voice storage unit 15 may also store the first voice. That is, after the voice dialogue control device 101 stores the first voice of the first voice section detected by the voice section detection unit 11 in the voice storage unit 15 once, it reads it out after a predetermined time elapses, and sends it to the voice recognition unit 12. Speech recognition may be performed.
 (効果)
 以上をまとめると、実施の形態2における音声対話制御装置101は、音声区間検出部11にて検出される第2音声区間内の第2音声を記憶する音声記憶部15をさらに備える。対話制御部14は、音声認識部12にて第1音声の音声認識が終了したことを示す通知に基づき、音声記憶部15にて記憶されている第2音声を音声認識部12に音声認識させ、第2音声の音声認識結果に対応する第2応答を応答生成部13に生成させる。
(effect)
Summarizing the above, the voice dialogue control device 101 according to the second embodiment further includes the voice storage unit 15 that stores the second voice in the second voice section detected by the voice section detection unit 11. The dialogue control unit 14 causes the voice recognition unit 12 to recognize the second voice stored in the voice storage unit 15 based on the notification indicating that the voice recognition unit 12 has finished the voice recognition of the first voice. The response generation unit 13 generates a second response that corresponds to the result of speech recognition of the second speech.
 このような構成により、音声対話制御装置101は、第1音声の処理中、例えば、音声認識中または応答生成中においても、第2音声の取得が可能である。すなわち、音声対話制御装置101は、任意のタイミングでユーザが発話する複数の音声の各々に対して適切な応答を生成することが可能である。 With such a configuration, the voice interaction control apparatus 101 can obtain the second voice even during processing of the first voice, for example, during voice recognition or response generation. That is, the voice interaction control apparatus 101 can generate an appropriate response to each of a plurality of voices uttered by the user at any timing.
 また、実施の形態2における音声対話制御装置101の対話制御部14は、応答生成部13にて第1応答の生成が完了したことを示す通知に基づき、音声認識部12にて音声認識される第2音声区間内の第2音声の音声認識結果に対応する第2応答を応答生成部13に生成させる。 In addition, the dialogue control unit 14 of the speech dialogue control device 101 according to the second embodiment causes the speech recognition unit 12 to perform speech recognition based on the notification indicating that the generation of the first response is completed by the response generation unit 13. The response generation unit 13 generates a second response corresponding to the speech recognition result of the second speech in the second speech segment.
 このような構成により、音声対話制御装置101は、第1音声に対する第1応答および第2音声に対する第2応答の両方を順にユーザに提示することができる。例えば、システムが第1音声「スーパーに行きたい。」を入力してその処理を開始した直後に、ユーザが第2音声「やっぱりコンビニに行きたい。」と発話した場合、従来のシステムは第2音声を音声認識できず、スーパーの検索結果を提示する応答のみを行うことが考えられる。しかし、本実施の形態における音声対話制御装置101は、第1音声および第2音声の両方を入力して、それぞれスーパーの検索結果およびコンビニの検索結果を提示することができる。 With such a configuration, the voice interaction control apparatus 101 can sequentially present both the first response to the first voice and the second response to the second voice to the user. For example, immediately after the system inputs the first voice "I want to go to the supermarket" and starts the processing, if the user utters the second voice "I want to go to the convenience store after all," the conventional system is the second It is conceivable that only the response presenting the search result of the super can be performed because the speech can not be recognized. However, the voice interaction control apparatus 101 according to the present embodiment can input both the first voice and the second voice, and can present the search results of the supermarket and the search results of the convenience store, respectively.
 <実施の形態3>
 実施の形態3における音声対話制御装置および音声対話制御方法を説明する。
Embodiment 3
A voice dialogue control apparatus and a voice dialogue control method according to the third embodiment will be described.
 (構成)
 図10は、実施の形態3における音声対話制御装置102およびシステム200の構成を示すブロック図である。音声対話制御装置102は、実施の形態2に示された構成に加えて、対話状態判定部16を含む。
(Constitution)
FIG. 10 is a block diagram showing configurations of the voice dialogue control device 102 and the system 200 in the third embodiment. In addition to the configuration shown in the second embodiment, voice dialogue control apparatus 102 includes a dialogue state determination unit 16.
 対話状態判定部16は、音声認識部12にて音声認識される第2音声の音声認識結果が第1音声の音声認識結果を更新するものであるか否かを判定する。 The dialogue state determination unit 16 determines whether the speech recognition result of the second speech recognized by the speech recognition unit 12 is to update the speech recognition result of the first speech.
 対話制御部14は、対話状態判定部16の判定結果に基づき、第1音声に対する処理を途中で終了させかつ応答生成部13に第2応答を生成させる。 Based on the determination result of the dialogue state determination unit 16, the dialogue control unit 14 terminates the process on the first voice halfway and causes the response generation unit 13 to generate a second response.
 (処理回路)
 上記の対話状態判定部16および対話制御部14の各機能は、例えば、図2に示される処理回路50により実現される。すなわち処理回路50は、上記の各機能を有する対話状態判定部16および対話制御部14を含む。
(Processing circuit)
Each function of the above-mentioned dialogue state judgment unit 16 and dialogue control unit 14 is realized by, for example, the processing circuit 50 shown in FIG. That is, the processing circuit 50 includes the dialogue state determination unit 16 and the dialogue control unit 14 having the respective functions described above.
 また、上記の対話状態判定部16および対話制御部14の各機能が、図3に示される処理回路により実現される場合、メモリ52に格納されるプログラムには、音声認識される第2音声の音声認識結果が第1音声の音声認識結果を更新するものであるか否かを判定する機能および動作が記述されている。さらに、プログラムには、その判定結果に基づき、第1音声に対する処理を途中で終了させるとともに、第2応答を生成させる機能および動作が記述されている。 Further, when each function of the dialogue state determination unit 16 and the dialogue control unit 14 described above is realized by the processing circuit shown in FIG. A function and operation for determining whether the speech recognition result is to update the speech recognition result of the first speech are described. Furthermore, the program describes functions and operations for causing the second response to be generated as well as terminating the process for the first voice on the basis of the determination result.
 (動作)
 次に、音声対話制御装置102の動作および音声対話制御方法を説明する。図11は、実施の形態3における音声対話制御装置102の動作および音声対話制御方法の一例を示すシーケンスチャートである。図12は、実施の形態3における音声対話制御装置102の動作および音声対話制御方法の一例を示すフローチャートである。なお、以下の説明において、音声記憶部15の動作説明は省略するが、その動作は実施の形態2と同様である。
(Operation)
Next, the operation of the voice interaction control apparatus 102 and the voice interaction control method will be described. FIG. 11 is a sequence chart showing an example of the operation of the voice interaction control apparatus 102 and the voice interaction control method according to the third embodiment. FIG. 12 is a flowchart showing an example of the operation of the voice interaction control apparatus 102 and the voice interaction control method according to the third embodiment. In the following description, the description of the operation of the voice storage unit 15 is omitted, but the operation is the same as that of the second embodiment.
 ステップS10にて、音声区間検出部11は、第1音声を入力して第1音声区間の始端を検出する。ここでは、第1音声として、ユーザから発話される「コンビニに行きたい。」が入力される。検出された始端は、音声認識部12または対話制御部14に通知される。 In step S10, the voice activity detection unit 11 receives the first voice and detects the beginning of the first voice activity. Here, “I want to go to a convenience store” uttered by the user is input as the first voice. The detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.
 ステップS20にて、音声認識部12は、始端検出の通知に基づき、音声区間検出部11にて検出された第1音声区間の始端以降の第1音声の音声認識を開始する。音声認識部12は、辞書データベースを参照して音声認識する。 In step S20, the voice recognition unit 12 starts voice recognition of the first voice after the start end of the first voice section detected by the voice section detection unit 11 based on the notification of the start end detection. The speech recognition unit 12 performs speech recognition with reference to the dictionary database.
 ステップS30にて、音声区間検出部11は、第1音声区間の終端を検出する。検出された終端は、音声認識部12または対話制御部14に通知される。 In step S30, the voice activity detection unit 11 detects the end of the first voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.
 ステップS40にて、音声認識部12は、終端検出の通知に基づき、音声区間検出部11にて検出された第1音声区間の終端までの第1音声の音声認識を終了する。第1音声の音声認識結果には、「コンビニ」が認識語彙として含まれる。また、音声認識部12は、音声認識の終了を対話制御部14に通知する。 In step S40, the voice recognition unit 12 ends voice recognition of the first voice up to the end of the first voice section detected by the voice section detection unit 11 based on the notification of end detection. "Convenience store" is included as a recognition vocabulary in the speech recognition result of the first speech. Further, the voice recognition unit 12 notifies the dialogue control unit 14 of the end of the voice recognition.
 ステップS50にて、応答生成部13は、対話制御部14からの制御に基づき、第1音声の音声認識結果に対応する第1応答の生成を開始する。応答生成部13は、図7に示されるシステム応答データベースを参照し、第1応答の生成を開始する。 In step S50, the response generation unit 13 starts generation of a first response corresponding to the speech recognition result of the first speech based on the control from the dialogue control unit 14. The response generator 13 refers to the system response database shown in FIG. 7 and starts generating the first response.
 ステップS60にて、音声区間検出部11は、第1音声の後に入力される第2音声の第2音声区間の始端を検出する。ここでは、第2音声として、ユーザから発話される「やっぱりレストランに行きたい。」が入力される。検出された始端は、音声認識部12または対話制御部14に通知される。 In step S60, the voice activity detection unit 11 detects the beginning of the second voice activity of the second voice input after the first voice. Here, “I want to go to a restaurant after all” spoken by the user is input as the second voice. The detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.
 ステップS70にて、音声認識部12は、音声区間検出部11にて検出された第2音声区間の始端以降の第2音声の音声認識を開始する。音声認識部12は、辞書データベース記憶装置23に記憶された辞書データベースを参照して音声認識する。 In step S70, the speech recognition unit 12 starts speech recognition of the second speech after the start of the second speech zone detected by the speech zone detection unit 11. The speech recognition unit 12 refers to the dictionary database stored in the dictionary database storage unit 23 to perform speech recognition.
 ステップS90にて、音声区間検出部11は、第2音声区間の終端を検出する。検出された終端は、音声認識部12または対話制御部14に通知される。 In step S90, the voice activity detection unit 11 detects the end of the second voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.
 ステップS100にて、音声認識部12は、第2音声区間の終端までの第2音声の音声認識を終了する。第2音声の音声認識結果には、「レストラン」が認識語彙として含まれる。また、音声認識部12は、音声認識の終了を対話制御部14に通知する。 In step S100, the speech recognition unit 12 ends the speech recognition of the second speech up to the end of the second speech segment. The "restaurant" is included as a recognition vocabulary in the speech recognition result of the second speech. Further, the voice recognition unit 12 notifies the dialogue control unit 14 of the end of the voice recognition.
 ステップS102にて、対話状態判定部16は、第2音声の音声認識結果が第1音声の音声認識結果を更新するものであるか否かを判定し、判定結果を対話制御部14に出力する。本実施の形態において、「レストラン」を含む第2音声の音声認識結果が、「コンビニ」を含む第1音声の音声認識結果を更新するものか否か判定される。更新するものではないと判定された場合、ステップS104が実行される。更新するものであると判定された場合、ステップS106が実行される。本実施の形態において、対話状態判定部16は、「コンビニ」を含む第1音声の音声認識結果が、「レストラン」を含む第2音声の音声認識結果を更新するものと判定する。この判定動作において、対話状態判定部16は、「コンビニ」および「レストラン」の語彙の並列関係に基づいて更新要否を判定してもよいし、第2音声に含まれる他の語彙、例えば逆説の接続詞「やっぱり」に基づいて更新要否を判定してもよい。 In step S102, the dialogue state determination unit 16 determines whether the speech recognition result of the second speech is to update the speech recognition result of the first speech and outputs the judgment result to the dialogue control unit 14. . In the present embodiment, it is determined whether the speech recognition result of the second speech including "restaurant" is to update the speech recognition result of the first speech including "convenience store". If it is determined that the update is not to be performed, step S104 is executed. If it is determined that the update is to be performed, step S106 is performed. In the present embodiment, the dialogue state determination unit 16 determines that the speech recognition result of the first speech including the “convenience store” updates the speech recognition result of the second speech including the “restaurant”. In this determination operation, the dialogue state determination unit 16 may determine the necessity of updating based on the parallel relation of the vocabulary of “convenience store” and “restaurant”, and other vocabulary included in the second voice, for example, paradox The necessity of updating may be determined based on the conjunction "after all".
 ステップS102にて更新するものではないと判定された場合、ステップS104にて、判定結果に基づく対話制御部14の制御により、応答生成部13は第1応答の生成を完了し、応答提示装置22はその第1応答をユーザに提示する。この場合、実施の形態2に示されたステップS80と同様の応答提示がなされる。続いて、図12に示されるステップS110以降にて、第2音声に対する応答が応答提示装置22に提示される。 If it is determined in step S102 that the update is not to be performed, in step S104, the response generation unit 13 completes the generation of the first response by the control of the dialog control unit 14 based on the determination result, and the response presentation device 22 Presents the first response to the user. In this case, the same response presentation as step S80 shown in the second embodiment is performed. Subsequently, a response to the second voice is presented to the response presentation device 22 after step S110 shown in FIG.
 一方で、ステップS102にて更新するものであると判定された場合、ステップS106において、判定結果に基づき、対話制御部14は第1音声に対する処理を途中で終了させる。 On the other hand, when it is determined in step S102 that updating is to be performed, in step S106, based on the determination result, the dialogue control unit 14 ends the process on the first voice halfway.
 ステップS110にて、応答生成部13は、第2音声の音声認識結果に対応する第2応答の生成を開始する。応答生成部13は、図7に示されるシステム応答データベースを参照し、第2応答の生成を開始する。 In step S110, the response generation unit 13 starts generation of a second response corresponding to the speech recognition result of the second speech. The response generation unit 13 refers to the system response database shown in FIG. 7 and starts generation of the second response.
 ステップS120にて、応答生成部13は、第2応答の生成を完了する。ここでは、応答生成部13は、音声出力用または表示出力用の情報として「レストランの検索結果を表示します。」を含む第2応答を生成する。対話制御部14は、第2応答を応答提示装置22からユーザに提示するよう制御する。例えば、応答提示装置22がスピーカである場合、スピーカは、第2応答に従い「レストランの検索結果を表示します。」と音声出力することにより、ユーザに第2応答を提示する。または例えば、応答提示装置22が表示装置である場合、表示装置は、第2応答に従い「レストランの検索結果を表示します。」と表示することにより、ユーザに第2応答を提示する。または、応答生成部13がレストランを検索させる制御信号を含む第2応答を生成してもよい。この場合、システム200に含まれる目的地検索部がその第2応答に基づいてレストランの検索を開始し、応答提示装置22がレストランの検索結果を表示する。 In step S120, the response generation unit 13 completes the generation of the second response. Here, the response generation unit 13 generates a second response including “display the search result of the restaurant” as information for voice output or display output. The dialogue control unit 14 controls the second response from the response presentation device 22 to be presented to the user. For example, in the case where the response presentation device 22 is a speaker, the speaker presents the second response to the user by outputting a voice “display the search result of the restaurant” according to the second response. Or, for example, when the response presentation device 22 is a display device, the display device presents the second response to the user by displaying “display the search result of the restaurant” according to the second response. Alternatively, the response generation unit 13 may generate a second response including a control signal for searching a restaurant. In this case, the destination search unit included in the system 200 starts a restaurant search based on the second response, and the response presentation device 22 displays the restaurant search results.
 第1音声に対する処理が実行されている間に、第1音声とは整合しない第2音声が入力された場合、対話制御部14は、第1音声に対する処理を途中で中止させ、第2音声に対応する第2応答のみ生成されるよう制御する。それにより、第2応答のみが、応答提示装置22に提示される。 When the second voice which is not matched with the first voice is input while the processing for the first voice is being executed, the dialogue control unit 14 cancels the processing for the first voice halfway and causes the second voice to be input. Control to generate only the corresponding second response. Thereby, only the second response is presented to the response presentation device 22.
 (効果)
 以上をまとめると、本実施の形態3における音声対話制御装置102は、音声認識部12にて音声認識される第2音声区間内の第2音声の音声認識結果が第1音声の音声認識結果を更新するものであるか否かを判定する対話状態判定部16をさらに備える。対話制御部14は、対話状態判定部16の判定結果に基づき、第1音声に対する処理を途中で終了させかつ応答生成部13に第2応答を生成させる。
(effect)
Summarizing the above, the voice interaction control device 102 according to the third embodiment determines that the speech recognition result of the second speech in the second speech segment recognized as speech by the speech recognition unit 12 is the speech recognition result of the first speech. The communication state determination unit 16 is further included to determine whether or not to update. Based on the determination result of the dialogue state determination unit 16, the dialogue control unit 14 terminates the process on the first voice halfway and causes the response generation unit 13 to generate a second response.
 このような構成により、音声対話制御装置102は、第1音声に基づく操作内容と第2音声に基づく操作内容とが整合しない場合、第1音声に対する処理を途中で終了させ、第2音声に対する応答を提示させることができる結果、ユーザの操作性を高めることができる。例えば、システムが第1音声「コンビニに行きたい。」を入力してその処理を開始した直後に、ユーザが第2音声「やっぱりレストランに行きたい。」と発話した場合、従来のシステムは第2音声を音声認識できず、コンビニの検索結果を提示する応答のみを行うことが考えられる。しかし、本実施の形態3における音声対話制御装置102は、第1音声の音声認識結果と第2音声の音声認識結果とに基づき、よりユーザの意図に沿った応答すなわち第2音声に対するレストランの検索結果を実施の形態2における音声対話制御装置101よりも早く提示することができる。 With such a configuration, when the operation content based on the first voice and the operation content based on the second voice do not match, the voice interaction control device 102 terminates the processing for the first voice halfway, and responds to the second voice. As a result of being able to present, user's operability can be enhanced. For example, immediately after the system inputs the first voice "I want to go to a convenience store" and starts the process, if the user utters a second voice "I want to go to a restaurant anyway", the conventional system is the second It is conceivable that only the response presenting the search result of the convenience store can be performed because the voice can not be recognized. However, based on the speech recognition result of the first speech and the speech recognition result of the second speech, the speech dialogue control device 102 in the third embodiment searches for a restaurant more responsive to the user's intention, ie, the second speech. The result can be presented earlier than the voice interaction control device 101 according to the second embodiment.
 <実施の形態4>
 実施の形態4における音声対話制御装置および音声対話制御方法を説明する。なお、他の実施の形態と同様の構成および動作については説明を省略する。
Fourth Preferred Embodiment
A voice dialogue control device and a voice dialogue control method according to the fourth embodiment will be described. Descriptions of configurations and operations similar to those of the other embodiments will be omitted.
 (構成)
 図13は、実施の形態4における音声対話制御装置103およびシステム200の構成を示すブロック図である。
(Constitution)
FIG. 13 is a block diagram showing configurations of the voice interaction control device 103 and the system 200 in the fourth embodiment.
 システム200の辞書データベース記憶装置23には、複数の辞書データベースが格納されている。本実施の形態において、辞書データベース記憶装置23には、第1辞書データベース24と第2辞書データベース25とが格納されている。 The dictionary database storage unit 23 of the system 200 stores a plurality of dictionary databases. In the present embodiment, the dictionary database storage unit 23 stores a first dictionary database 24 and a second dictionary database 25.
 第1辞書データベース24は、システム200の待受状態に対応して準備された辞書データベースである。待受状態とは、例えば、システム200の音声入力装置21がユーザによる操作を受付可能な状態、すなわち第1音声の入力を待ち受けている状態である。待受状態においては、システム200が含む別のユーザインターフェースである表示装置は、例えばメニュー画面を表示している。第2辞書データベース25は、システム200が第1音声の音声認識した後の状態に対応し、かつ、第1音声の音声認識結果に含まれる特定語彙に関連する辞書データベースである。 The first dictionary database 24 is a dictionary database prepared corresponding to the standby state of the system 200. The standby state is, for example, a state in which the voice input device 21 of the system 200 can receive an operation by the user, that is, a state in which the input of the first voice is awaited. In the standby state, the display device, which is another user interface included in the system 200, displays, for example, a menu screen. The second dictionary database 25 is a dictionary database that corresponds to the state after the system 200 has recognized the first speech, and is associated with a specific vocabulary included in the speech recognition result of the first speech.
 音声認識部12は、複数の辞書データベースのうちシステム200の状態に応じた一の辞書データベースを参照して音声認識する。 The speech recognition unit 12 performs speech recognition with reference to one dictionary database corresponding to the state of the system 200 among a plurality of dictionary databases.
 本実施の形態において、音声認識部12は、システム200の状態が待受状態である場合、その待受状態に対応する一の辞書データベースとして第1辞書データベース24を参照して第1音声を音声認識する。または、システム200が待受状態である場合、音声認識部12は、全ての辞書データベースを参照することにより、待受状態に対応する一の辞書データベースとして第1辞書データベース24を参照して第1音声を音声認識する。図14は、実施の形態4における第1辞書データベース24の構成の一例を示す図である。第1辞書データベース24は、システム200の状態と認識語彙とを含む。図14において第1画面とは、メニュー画面等の待受画面である。 In the present embodiment, when the system 200 is in the standby state, the speech recognition unit 12 refers to the first dictionary database 24 as a dictionary database corresponding to the standby state to speak the first speech. recognize. Alternatively, when the system 200 is in the standby state, the speech recognition unit 12 refers to all the dictionary databases to refer to the first dictionary database 24 as one dictionary database corresponding to the standby state. Speech recognition. FIG. 14 is a diagram showing an example of a configuration of the first dictionary database 24 in the fourth embodiment. The first dictionary database 24 includes the state of the system 200 and the recognition vocabulary. The first screen in FIG. 14 is a standby screen such as a menu screen.
 また、音声認識部12は、システム200の状態が第1音声の音声認識後の状態であり、かつ、第1音声の音声認識結果に特定語彙が含まれる状態である場合、その状態に対応する一の辞書データベースとしてその特定語彙に関連する第2辞書データベース25を参照して第2音声を音声認識する。例えば、音声認識部12または対話制御部14が、第1音声の音声認識後に、第1音声の音声認識結果に特定語彙が含まれるか否かを判定し、特定語彙が含まれると判定した場合には、第2辞書データベース25を参照して第2音声を音声認識することを選択する。このように、音声認識部12は、システム200の状態に応じて音声認識に用いる辞書データベースを切り替えるなどの処理を行う機能を有する。図15は、実施の形態4における第2辞書データベース25の構成の一例を示す図である。第2辞書データベース25は、システム200の主状態と、システム200の関連状態と、認識語彙とを含む。 Further, when the state of the system 200 is the state after the speech recognition of the first speech and the speech recognition result of the first speech includes the specific vocabulary, the speech recognition unit 12 corresponds to that state. The second speech is speech-recognized with reference to a second dictionary database 25 associated with the specific vocabulary as one dictionary database. For example, when the voice recognition unit 12 or the dialogue control unit 14 determines whether the specific vocabulary is included in the voice recognition result of the first voice after voice recognition of the first voice, and determines that the specific vocabulary is included. To select the second speech by referring to the second dictionary database 25. As described above, the speech recognition unit 12 has a function of performing processing such as switching the dictionary database used for speech recognition according to the state of the system 200. FIG. 15 is a diagram showing an example of a configuration of the second dictionary database 25 in the fourth embodiment. The second dictionary database 25 includes the main state of the system 200, the related state of the system 200, and the recognition vocabulary.
 応答生成部13は、音声の音声認識結果と、その音声の音声認識のために参照された一の辞書データベースの情報とに対応する応答を生成する。例えば、応答生成部13は、第1音声の音声認識結果と第1音声の音声認識のために参照された第1辞書データベース24の情報とに対応する第1応答を生成する。または例えば、応答生成部13は、第2音声の音声認識結果と第2音声の音声認識のために参照された第2辞書データベース25の情報とに対応する第2応答を生成する。 The response generation unit 13 generates a response corresponding to the speech recognition result of speech and information of one dictionary database referred to for speech recognition of the speech. For example, the response generation unit 13 generates a first response corresponding to the speech recognition result of the first speech and the information of the first dictionary database 24 referred to for speech recognition of the first speech. Alternatively, for example, the response generation unit 13 generates a second response corresponding to the speech recognition result of the second speech and the information of the second dictionary database 25 referred to for speech recognition of the second speech.
 (システム応答データベース)
 応答生成部13は、音声に対する応答を、システム応答データベースを参照して応答を生成する。図16は、実施の形態4におけるシステム応答データベースの構成の一例を示す図である。システム応答データベースは、音声認識結果に含まれる認識語彙と、音声認識のために参照された辞書データベースの情報と、それらに対応する応答とで構成される。
(System response database)
The response generation unit 13 generates a response by referring to the system response database and the response to the voice. FIG. 16 is a diagram showing an example of a configuration of a system response database in the fourth embodiment. The system response database is composed of recognition vocabulary contained in the speech recognition result, information of the dictionary database referenced for speech recognition, and responses corresponding thereto.
 (処理回路)
 上記の音声認識部12および応答生成部13の各機能は、例えば、図2に示される処理回路50により実現される。すなわち処理回路50は、上記の各機能を有する音声認識部12および応答生成部13を含む。
(Processing circuit)
Each function of the speech recognition unit 12 and the response generation unit 13 described above is realized by, for example, the processing circuit 50 shown in FIG. That is, the processing circuit 50 includes the speech recognition unit 12 and the response generation unit 13 having the respective functions described above.
 また、上記の音声認識部12および応答生成部13の各機能が、図3に示される処理回路により実現される場合、メモリ52に格納されるプログラムには、音声を複数の辞書データベースのうち一の辞書データベースを参照して音声認識し、音声の音声認識結果と音声の音声認識のために参照された一の辞書データベースの情報とに対応する応答を生成する機能および動作が記述されている。また、プログラムには、システム200の待受状態に対応して準備された第1辞書データベース24を参照して第1音声を音声認識し、第1音声の音声認識結果に含まれる特定語彙に関連する第2辞書データベース25を参照して第2音声を音声認識する機能および動作が記述されている。また、プログラムには、第2音声の音声認識結果と第2辞書データベース25の情報とに対応する第2応答を生成する機能および動作が記述されている。 When each function of the speech recognition unit 12 and the response generation unit 13 described above is realized by the processing circuit shown in FIG. 3, the program stored in the memory 52 may include one of a plurality of dictionary databases for speech. A function and an operation are described which perform speech recognition with reference to the dictionary database and generate a response corresponding to speech recognition result of speech and information of one dictionary database referred to for speech recognition of speech. In addition, the program refers to the first dictionary database 24 prepared corresponding to the standby state of the system 200 to recognize the first speech, and relates to the specific vocabulary included in the speech recognition result of the first speech. The function and operation of speech recognition of the second speech are described with reference to the second dictionary database 25. Also, the program describes a function and an operation for generating a second response corresponding to the speech recognition result of the second speech and the information of the second dictionary database 25.
 (動作)
 次に、音声対話制御装置103の動作および音声対話制御方法を説明する。図17は、実施の形態4における音声対話制御装置103の動作および音声対話制御方法の一例を示すシーケンスチャートである。図18は、実施の形態4における音声対話制御装置103の動作および音声対話制御方法の一例を示すフローチャートである。なお、以下の説明において、音声記憶部15の動作説明は省略するが、その動作は実施の形態2と同様である。
(Operation)
Next, the operation of the voice interaction control apparatus 103 and the voice interaction control method will be described. FIG. 17 is a sequence chart showing an example of the operation of the voice interaction control apparatus 103 and the voice interaction control method according to the fourth embodiment. FIG. 18 is a flow chart showing an example of the operation of the voice interaction control apparatus 103 and the voice interaction control method according to the fourth embodiment. In the following description, the description of the operation of the voice storage unit 15 is omitted, but the operation is the same as that of the second embodiment.
 ステップS10にて、音声区間検出部11は、第1音声を入力して第1音声区間の始端を検出する。ここでは、第1音声として、ユーザから発話される「再生」が入力される。検出された始端は、音声認識部12または対話制御部14に通知される。 In step S10, the voice activity detection unit 11 receives the first voice and detects the beginning of the first voice activity. Here, “reproduction” uttered by the user is input as the first voice. The detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.
 ステップS22にて、音声認識部12は、システム200の待受状態に対応する第1辞書データベース24を選択する。例えば、音声認識部12は、システム200が待受状態であることを示す情報を取得し、その情報に基づき、複数の辞書データベースの中から、図14に示される第1辞書データベース24を選択する。ここで、音声認識部12が取得する待受状態であることを示す情報とは、第1画面が表示されているという情報である。 In step S22, the speech recognition unit 12 selects the first dictionary database 24 corresponding to the standby state of the system 200. For example, the speech recognition unit 12 acquires information indicating that the system 200 is in a standby state, and selects the first dictionary database 24 shown in FIG. 14 from among a plurality of dictionary databases based on the information. . Here, the information indicating that the voice recognition unit 12 acquires the standby state is information that the first screen is displayed.
 ステップS24にて、音声認識部12は、第1辞書データベース24を参照して、音声区間検出部11にて検出された第1音声区間の始端以降の第1音声の音声認識を開始する。または、ステップS22とS24とを組み合わせて、音声認識部12は、システム200が待受状態である情報に基づき、全ての辞書データベースを参照することにより、待受状態に対応する第1辞書データベース24を参照して第1音声を音声認識してもよい。 In step S24, the speech recognition unit 12 refers to the first dictionary database 24 and starts speech recognition of the first speech after the start of the first speech segment detected by the speech segment detection unit 11. Alternatively, by combining steps S22 and S24, the speech recognition unit 12 refers to all the dictionary databases based on the information that the system 200 is in the standby state, and the first dictionary database 24 corresponding to the standby state. The first voice may be voice-recognized with reference to FIG.
 ステップS30にて、音声区間検出部11は、第1音声区間の終端を検出する。検出された終端は、音声認識部12または対話制御部14に通知される。 In step S30, the voice activity detection unit 11 detects the end of the first voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.
 ステップS40にて、音声認識部12は、終端検出の通知に基づき、音声区間検出部11にて検出された第1音声区間の終端までの第1音声の音声認識を終了する。第1音声の音声認識結果には、「再生」が認識語彙として含まれる。 In step S40, the voice recognition unit 12 ends voice recognition of the first voice up to the end of the first voice section detected by the voice section detection unit 11 based on the notification of end detection. The speech recognition result of the first speech includes "reproduction" as a recognition vocabulary.
 ステップS60にて、音声区間検出部11は、第1音声の後に入力される第2音声の第2音声区間の始端を検出する。ここでは、第2音声として、ユーザから発話される「音楽」が入力される。検出された始端は、音声認識部12または対話制御部14に通知される。 In step S60, the voice activity detection unit 11 detects the beginning of the second voice activity of the second voice input after the first voice. Here, “music” uttered by the user is input as the second voice. The detected start point is notified to the speech recognition unit 12 or the dialogue control unit 14.
 ステップS74にて、音声認識部12は、システム200が第1音声の音声認識後の状態であり、かつ、第1音声の音声認識結果に特定語彙が含まれる状態に対応する第2辞書データベース25を選択する。例えば、音声認識部12は、第1音声の音声認識結果に特定語彙が含まれるか否かを判定し、特定語彙が含まれると判定した場合には、複数の辞書データベースから、その特定語彙に関連する第2辞書データベース25を選択する。ここでは、音声認識部12は、第1音声の音声認識結果に、特定語彙である「再生」が含まれるか否かを判定し、含まれると判定した場合に、図15に示される第2辞書データベース25を参照して第2音声を音声認識する。 At step S74, speech recognition unit 12 is in a state after system 200 is speech-recognized for the first speech, and second dictionary database 25 corresponding to a state in which a specific vocabulary is included in the speech recognition result of the first speech. Choose For example, the speech recognition unit 12 determines whether the speech recognition result of the first speech includes a specific vocabulary or not, and when it is determined that the specific vocabulary is included, a plurality of dictionary databases are used to specify the specific vocabulary. The related second dictionary database 25 is selected. Here, the speech recognition unit 12 determines whether the speech recognition result of the first speech includes “reproduction” which is a specific vocabulary, and determines that it is included in the second speech shown in FIG. The second speech is recognized by referring to the dictionary database 25.
 ステップS76にて、音声認識部12は、第2辞書データベース25を参照して、音声区間検出部11にて検出された第2音声区間の始端以降の第2音声の音声認識を開始する。このように、音声認識部12は、システム200の状態に応じて、音声認識に用いる辞書データベースを第1辞書データベース24から第2辞書データベース25に切り替える機能を有する。 In step S 76, the speech recognition unit 12 refers to the second dictionary database 25 to start speech recognition of the second speech after the start of the second speech segment detected by the speech segment detection unit 11. Thus, the speech recognition unit 12 has a function of switching the dictionary database used for speech recognition from the first dictionary database 24 to the second dictionary database 25 according to the state of the system 200.
 ステップS90にて、音声区間検出部11は、第2音声区間の終端を検出する。検出された終端は、音声認識部12または対話制御部14に通知される。 In step S90, the voice activity detection unit 11 detects the end of the second voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.
 ステップS100にて、音声認識部12は、第2音声区間の終端までの第2音声の音声認識を終了する。第2音声の音声認識結果と、その第2音声の音声認識のために参照された第2辞書データベース25の情報とが、応答生成部13に出力される。なお、第2音声の音声認識結果には、「音楽」が認識語彙として含まれる。また、音声認識部12は、音声認識の終了を対話制御部14に通知する。 In step S100, the speech recognition unit 12 ends the speech recognition of the second speech up to the end of the second speech segment. The speech recognition result of the second speech and the information of the second dictionary database 25 referred to for speech recognition of the second speech are output to the response generation unit 13. Note that "music" is included as a recognition vocabulary in the speech recognition result of the second speech. Further, the voice recognition unit 12 notifies the dialogue control unit 14 of the end of the voice recognition.
 ステップS110にて、応答生成部13は、第2音声の音声認識結果に対応する第2応答の生成を開始する。応答生成部13は、図16に示されるシステム応答データベースを参照し、第2応答の生成を開始する。 In step S110, the response generation unit 13 starts generation of a second response corresponding to the speech recognition result of the second speech. The response generation unit 13 refers to the system response database shown in FIG. 16 and starts generation of the second response.
 ステップS120にて、応答生成部13は、第2応答の生成を完了する。ここでは、認識語彙は「音楽」、辞書データベース情報は「第2辞書データベース」であるため、応答生成部13は、音声出力用の情報として「音楽を再生します。」を含む第2応答を生成する。対話制御部14は、第2応答を応答提示装置22からユーザに提示するよう制御する。例えば、応答提示装置22に含まれるスピーカが、第2応答に従い「音楽を再生します。」と音声出力することにより、ユーザに第2応答を提示する。または、応答生成部13は、応答提示装置22に含まれる音楽再生装置に音楽を再生させる制御信号を含む第2応答を生成し、音楽再生装置がその第2応答に基づいて音楽を再生してもよい。 In step S120, the response generation unit 13 completes the generation of the second response. Here, since the recognition vocabulary is "music" and the dictionary database information is "second dictionary database", the response generation unit 13 generates a second response including "play music" as information for voice output. Generate The dialogue control unit 14 controls the second response from the response presentation device 22 to be presented to the user. For example, the speaker included in the response presentation device 22 presents the second response to the user by outputting a voice "Play music" according to the second response. Alternatively, the response generation unit 13 generates a second response including a control signal for causing the music reproduction device included in the response presentation device 22 to reproduce music, and the music reproduction device reproduces music based on the second response. It is also good.
 なお、フローチャートによる図示は省略するが、ステップS60にて、第2音声の入力がない場合、第1音声の音声認識結果とその第1音声の音声認識のために参照された第1辞書データベース24の情報とが、応答生成部13に出力される。認識語彙は「再生」、辞書データベース情報は「第1辞書データベース」であるため、応答生成部13は、音声出力用または表示出力用の情報として「何を再生しますか?」を含む第1応答を生成し、応答提示装置22がその第1応答をユーザに提示する。 Although not shown in the flowchart, when the second voice is not input in step S60, the first dictionary database 24 referred to for the voice recognition result of the first voice and the voice recognition of the first voice. The above information is output to the response generation unit 13. Since the recognition vocabulary is "play" and the dictionary database information is "first dictionary database", the response generation unit 13 includes "what to play?" As information for voice output or display output. The response is generated and the response presentation device 22 presents the first response to the user.
 (効果)
 以上をまとめると、実施の形態4における音声対話制御装置103の音声認識部12は、音声を複数の辞書データベースのうちシステムの状態に応じた一の辞書データベースを参照して音声認識する。応答生成部13は、音声の音声認識結果と音声の音声認識のために参照された一の辞書データベースの情報とに対応する応答を生成する。
(effect)
Summarizing the above, the speech recognition unit 12 of the speech dialogue control apparatus 103 according to the fourth embodiment recognizes speech by referring to one of a plurality of dictionary databases according to the state of the system. The response generation unit 13 generates a response corresponding to the speech recognition result of speech and information of one dictionary database referred to for speech recognition of speech.
 このような構成により、音声対話制御装置103は、システム200の状態すなわち対話状態によって、音声認識の際に参照する辞書データベースを切り替えることができるため、ユーザの発話に対して正確な応答を生成することができる。 With such a configuration, the speech interaction control apparatus 103 can switch the dictionary database to be referred to in speech recognition according to the state of the system 200, that is, the interaction state, thereby generating an accurate response to the user's speech. be able to.
 また、実施の形態4における音声対話制御装置103の音声認識部12は、複数の辞書データベースのうちシステム200の待受状態に対応して準備された第1辞書データベース24を参照して第1音声を音声認識し、複数の辞書データベースのうち第1音声の音声認識後の状態に対応しかつ第1音声の音声認識結果に含まれる特定語彙に関連する第2辞書データベース25を参照して第2音声を音声認識する。応答生成部13は、第2音声の音声認識結果と第2音声の音声認識のために参照された第2辞書データベースの情報とに対応する第2応答を生成する。 Further, the voice recognition unit 12 of the voice dialogue control device 103 according to the fourth embodiment refers to the first voice database with reference to the first dictionary database 24 prepared corresponding to the standby state of the system 200 among the plurality of dictionary databases. And a second dictionary database 25 corresponding to a state after speech recognition of the first speech among a plurality of dictionary databases and associated with a specific vocabulary included in the speech recognition result of the first speech Speech recognition. The response generation unit 13 generates a second response corresponding to the speech recognition result of the second speech and the information of the second dictionary database referred to for speech recognition of the second speech.
 このような構成により、音声対話制御装置103は、第1音声および第2音声の両方の内容を反映した応答を生成することができ、ユーザの発話に対して正確な応答を生成することができる。例えば、システムが第1音声「再生」を入力してその処理を開始した直後に、ユーザが第2音声「音楽」と発話した場合、従来のシステムは第2音声を音声認識できず、ユーザに何を再生するか尋ねる応答を提示することが考えられる。しかし、本実施の形態における音声対話制御装置103は、第1音声の音声認識結果に関連する第2辞書データベースを参照して第2音声を音声認識するため、ユーザの意図に沿って、音楽を再生させることができる。 With such a configuration, the voice dialog control device 103 can generate a response reflecting the contents of both the first voice and the second voice, and can generate an accurate response to the user's speech. . For example, if the user utters the second voice "music" immediately after the system inputs the first voice "play" and starts the process, the conventional system can not recognize the second voice and the user It is conceivable to present a response asking what to play. However, since the voice interaction control apparatus 103 in the present embodiment refers to the second dictionary database related to the voice recognition result of the first voice to recognize the second voice, the music is used in accordance with the user's intention. It can be played back.
 <実施の形態5>
 実施の形態5における音声対話制御装置および音声対話制御方法を説明する。なお、他の実施の形態と同様の構成および動作については説明を省略する。
The Fifth Preferred Embodiment
A voice dialogue control apparatus and a voice dialogue control method according to the fifth embodiment will be described. Descriptions of configurations and operations similar to those of the other embodiments will be omitted.
 (構成)
 図19は、実施の形態5における音声対話制御装置104およびシステム200の構成を示すブロック図である。
(Constitution)
FIG. 19 is a block diagram showing configurations of the voice dialogue control device 104 and the system 200 in the fifth embodiment.
 応答生成部13は、音声の音声認識結果に対応して生成される複数の応答から一の応答をユーザに選択させるための確認応答を生成する確認応答生成部17をさらに含む。 The response generation unit 13 further includes a confirmation response generation unit 17 that generates a confirmation response for causing the user to select one response from a plurality of responses generated corresponding to the speech recognition result of speech.
 対話制御部14は、確認応答をシステム200からユーザに提示させ、確認応答に従ってユーザによって入力される音声に対応する一の応答を応答生成部に生成させて、システム200からユーザに提示させる。 The dialogue control unit 14 causes the system 200 to present a confirmation response to the user, causes the response generation unit to generate one response corresponding to the voice input by the user according to the confirmation response, and causes the system 200 to present it to the user.
 (処理回路)
 上記の確認応答生成部17および応答生成部13の各機能は、例えば、図2または図3に示される処理回路により実現される。図3に示される処理回路により実現される場合、メモリ52に格納されるプログラムには、音声の音声認識結果に対応して生成される複数の応答のから一の応答をユーザに選択させるための確認応答を生成する機能および動作が記述されている。また、プログラムには、確認応答をシステム200からユーザに提示させ、確認応答に従ってユーザによって入力される音声に対応する一の応答を生成させて、システム200からユーザに提示させる機能および動作が記述されている。
(Processing circuit)
Each function of the confirmation response generation unit 17 and the response generation unit 13 described above is realized by, for example, the processing circuit shown in FIG. 2 or 3. When implemented by the processing circuit shown in FIG. 3, the program stored in memory 52 is for the user to select one response from a plurality of responses generated corresponding to the speech recognition result of speech. The functions and operations that generate acknowledgments are described. In addition, the program describes a function and an operation that causes the system 200 to present a confirmation response to the user from the system 200, generate one response corresponding to voice input by the user according to the confirmation response, and cause the system 200 to present it to the user. ing.
 (動作)
 次に、音声対話制御装置104の動作および音声対話制御方法を説明する。図20は、実施の形態5における音声対話制御装置104の動作および音声対話制御方法の一例を示すフローチャートである。なお、本実施の形態においては、ステップS10からステップS110までは、実施の形態4と同様であるため説明は省略する。
(Operation)
Next, the operation of the voice dialogue control device 104 and the voice dialogue control method will be described. FIG. 20 is a flow chart showing an example of the operation of the speech dialog control device 104 and the speech dialog control method according to the fifth embodiment. In the present embodiment, steps S10 to S110 are the same as in the fourth embodiment, and therefore the description thereof is omitted.
 ステップS112にて、応答生成部13は、第2音声の音声認識結果に対応する複数の第2応答が生成され得るか否かを判定する。例えば、音楽を再生するポータブルデバイスとCD(Compact Disc)プレイヤーとがシステム200に備わっていた場合、応答生成部13は、ポータブルデバイスに記憶されている音楽を再生させる制御信号を含む第2応答と、CDに記憶されている音楽を再生させる制御信号を含む第2応答とを生成可能である。複数の第2応答が生成されないと判定された場合、ステップS120が実行される。この場合、ステップS120以降の処理は、実施の形態4と同様である。複数の第2応答が生成されると判定された場合、ステップS122が実行される。 In step S112, the response generation unit 13 determines whether a plurality of second responses corresponding to the speech recognition result of the second speech can be generated. For example, if a portable device for playing music and a CD (Compact Disc) player are provided in the system 200, the response generation unit 13 generates a second response including a control signal for playing the music stored in the portable device. , And a second response including a control signal to play the music stored on the CD. If it is determined that a plurality of second responses are not generated, step S120 is performed. In this case, the processes after step S120 are the same as in the fourth embodiment. If it is determined that a plurality of second responses are to be generated, step S122 is performed.
 ステップS122にて、確認応答生成部17は、第2音声の音声認識結果に対応して生成される複数の第2応答のうち、一の第2応答をユーザに選択させるための確認応答を生成する。ここでは、確認応答生成部17は、音声出力用または表示出力用の情報として「ポータブルデバイスの音楽を再生しますか?それともCDの音楽を再生しますか?」を含む確認応答を生成する。 In step S122, the confirmation response generation unit 17 generates a confirmation response for causing the user to select one second response among the plurality of second responses generated corresponding to the voice recognition result of the second voice. Do. Here, the acknowledgment generation unit 17 generates an acknowledgment including “Do you want to play music on a portable device or play music on a CD?” As information for audio output or display output.
 ステップS124にて、対話制御部14は、確認応答を応答提示装置22からユーザに提示させる。応答提示装置22は、「ポータブルデバイスの音楽を再生しますか?それともCDの音楽を再生しますか?」をユーザに提示し、ユーザは確認応答に従ってシステムを操作するための音声を再び入力する。例えば、ユーザにより「ポータブルデバイスの音楽を再生。」という音声が入力された場合、音声対話制御装置104は、上記の各ステップと同様の音声認識および応答生成により一の第2応答を生成する。応答提示装置22が、ポータブルデバイスの音楽を再生することにより、選択された一の第2応答がユーザに提示される。 In step S124, the dialog control unit 14 causes the response presentation device 22 to present the confirmation response to the user. The response presentation device 22 presents to the user "Do you want to play the music of the portable device or the music of the CD?", And the user re-enters the voice for operating the system according to the acknowledgment response. . For example, when the voice "play music of portable device." Is input by the user, the voice interaction control apparatus 104 generates one second response by voice recognition and response generation similar to the above steps. The response presentation device 22 plays the music of the portable device to present the selected one second response to the user.
 (効果)
 以上をまとめると、実施の形態4における音声対話制御装置104の応答生成部13は、音声の音声認識結果に対応して生成する複数の応答から一の応答をユーザに選択させるための確認応答を生成する確認応答生成部17をさらに含む。対話制御部14は、確認応答をシステム200からユーザに提示させ、確認応答に従ってユーザによって入力される音声に対応する一の応答を応答生成部13に生成させて、システム200からユーザに提示させる。
(effect)
Summarizing the above, the response generation unit 13 of the speech dialog control device 104 according to the fourth embodiment is a confirmation response for causing the user to select one response from a plurality of responses generated corresponding to the result of speech recognition of speech. It further includes an acknowledgment generation unit 17 to generate. The dialogue control unit 14 causes the system 200 to present a confirmation response to the user, causes the response generation unit 13 to generate one response corresponding to the voice input by the user according to the confirmation response, and causes the system 200 to present it to the user.
 このような構成により、音声対話制御装置104は、ユーザとシステムとの対話に曖昧性が生じている場合に、ユーザに確認を求めることができる。 With such a configuration, the voice dialog control device 104 can ask the user for confirmation when there is an ambiguity in the interaction between the user and the system.
 <実施の形態6>
 実施の形態6における音声対話制御装置および音声対話制御方法を説明する。
Embodiment 6
A voice dialogue control apparatus and a voice dialogue control method according to the sixth embodiment will be described.
 (構成)
 実施の形態6における音声対話制御装置104およびシステム200の構成は実施の形態4と同様である。ただし、本実施の形態における対話制御部14は、第1音声区間の終端から第2音声区間の始端までの経過時間が特定値以上であるか否か判定する。対話制御部14は、経過時間が特定値以上である場合、複数の辞書データベースのうちシステム200の待受状態に対応して準備された第1辞書データベース24を参照して第2音声を音声認識させる。または、対話制御部14は、経過時間が特定値未満である場合、複数の辞書データベースのうち第1音声の音声認識後の状態に対応しかつ第1音声の音声認識結果に含まれる特定語彙に関連する第2辞書データベースを参照して第2音声を音声認識させる。対話制御部14は、第1音声と第2音声との関連性を、経過時間が閾値以上であるか否かに基づいて判断してユーザに提示すべき応答を生成させる。
(Constitution)
The configurations of the voice interaction control device 104 and the system 200 in the sixth embodiment are the same as in the fourth embodiment. However, the dialog control unit 14 in the present embodiment determines whether or not the elapsed time from the end of the first voice section to the start of the second voice section is equal to or greater than a specific value. When the elapsed time is equal to or greater than the specific value, the dialogue control unit 14 recognizes the second voice by referring to the first dictionary database 24 prepared corresponding to the standby state of the system 200 among the plurality of dictionary databases. Let Alternatively, when the elapsed time is less than the specific value, the dialogue control unit 14 corresponds to the state after the speech recognition of the first speech among the plurality of dictionary databases and corresponds to the specific vocabulary included in the speech recognition result of the first speech. The second speech is recognized by referring to the associated second dictionary database. The dialogue control unit 14 determines the relevance between the first voice and the second voice based on whether the elapsed time is equal to or more than a threshold and generates a response to be presented to the user.
 (処理回路)
 上記の対話制御部14の機能は、例えば、図2または図3に示される処理回路により実現される。図3に示される処理回路により実現される場合、メモリ52に格納されるプログラムには、第1音声区間の終端から第2音声区間の始端までの経過時間が特定値以上であるか否かの判定に基づき、複数の辞書データベースのうちシステム200の待受状態に対応して準備された第1辞書データベース24を参照して第2音声を音声認識させる機能および動作が記述されている。または、上記判定に基づき、複数の辞書データベースのうち第1音声の音声認識後の状態に対応しかつ第1音声の音声認識結果に含まれる特定語彙に関連する第2辞書データベースを参照して第2音声を音声認識させる機能および動作が記述されている。
(Processing circuit)
The above-mentioned function of the dialogue control unit 14 is realized by, for example, the processing circuit shown in FIG. 2 or FIG. When implemented by the processing circuit shown in FIG. 3, the program stored in the memory 52 determines whether the elapsed time from the end of the first voice period to the beginning of the second voice period is equal to or greater than a specified value. Based on the determination, a function and an operation for causing the second speech to be recognized by referring to the first dictionary database 24 prepared corresponding to the standby state of the system 200 among the plurality of dictionary databases are described. Alternatively, based on the above determination, a second dictionary database corresponding to a state after speech recognition of the first speech among the plurality of dictionary databases and related to a specific vocabulary included in the speech recognition result of the first speech is referred to (2) A function and operation for speech recognition of speech are described.
 (動作)
 次に、音声対話制御装置104の動作および音声対話制御方法を説明する。図21は、実施の形態6における音声対話制御装置104の動作および音声対話制御方法の一例を示すフローチャートである。なお、本実施の形態におけるステップS10からステップS60までは、実施の形態4と同様であるため説明は省略する。
(Operation)
Next, the operation of the voice dialogue control device 104 and the voice dialogue control method will be described. FIG. 21 is a flow chart showing an example of the operation of the voice interaction control apparatus 104 and the voice interaction control method according to the sixth embodiment. Steps S10 to S60 in the present embodiment are the same as in the fourth embodiment, and therefore, the description thereof is omitted.
 ステップS64にて、対話制御部14は、第1音声区間の終端から第2音声区間の始端までの経過時間が特定値以上であるか否か判定する。経過時間が特定値以上でないと判定された場合、すなわち発話間に関連性があると判定された場合、ステップS74が実行される。経過時間が特定値以上であると判定された場合、すなわち発話間に関連性がないと判定された場合、ステップS70が実行される。 In step S64, the dialogue control unit 14 determines whether or not the elapsed time from the end of the first speech zone to the beginning of the second speech zone is equal to or greater than a specific value. If it is determined that the elapsed time is not the specific value or more, that is, if it is determined that the utterances are related, step S74 is executed. If it is determined that the elapsed time is equal to or greater than the specific value, that is, if it is determined that there is no relevance between the utterances, step S70 is executed.
 ステップS74およびS76にて、音声認識部12は、音声区間検出部11にて検出された第2音声区間の始端以降の第2音声の音声認識を開始する。音声認識部12は、第1音声の音声認識結果に含まれる特定語彙に関連する第2辞書データベース25を参照して第2音声を音声認識する。ステップS74以降の各処理は、図18に示される実施の形態4における各処理と同様である。 In steps S74 and S76, the speech recognition unit 12 starts speech recognition of the second speech after the start of the second speech zone detected by the speech zone detection unit 11. The speech recognition unit 12 performs speech recognition of the second speech by referring to the second dictionary database 25 associated with the specific vocabulary included in the speech recognition result of the first speech. Each process after step S74 is the same as each process in the fourth embodiment shown in FIG.
 一方で、発話間に関連性がないと判定された場合、ステップS70にて、音声認識部12は、音声区間検出部11にて検出された第2音声区間の始端以降の第2音声の音声認識を開始する。だだし、音声認識部12は、システム200の待受状態に対応して準備された第1辞書データベース24を参照して音声認識する。 On the other hand, when it is determined that there is no relevancy between utterances, in step S70, the speech recognition unit 12 determines that the second speech after the start of the second speech segment detected by the speech segment detection unit 11 is Start recognition. However, the speech recognition unit 12 performs speech recognition with reference to the first dictionary database 24 prepared corresponding to the standby state of the system 200.
 ステップS90にて、音声区間検出部11は、第2音声区間の終端を検出する。検出された終端は、音声認識部12または対話制御部14に通知される。 In step S90, the voice activity detection unit 11 detects the end of the second voice activity. The detected end is notified to the speech recognition unit 12 or the dialogue control unit 14.
 ステップS100にて、音声認識部12は、第2音声区間の終端までの第2音声の音声認識を終了する。第2音声の音声認識結果と、その第2音声の音声認識のために参照された第1辞書データベース24の情報とが、応答生成部13に出力される。なお、第2音声の音声認識結果には、「音楽」が認識語彙として含まれる。また、音声認識部12は、音声認識の終了を対話制御部14に通知する。 In step S100, the speech recognition unit 12 ends the speech recognition of the second speech up to the end of the second speech segment. The speech recognition result of the second speech and the information of the first dictionary database 24 referred to for speech recognition of the second speech are output to the response generation unit 13. Note that "music" is included as a recognition vocabulary in the speech recognition result of the second speech. Further, the voice recognition unit 12 notifies the dialogue control unit 14 of the end of the voice recognition.
 ステップS110にて、応答生成部13は、第2音声の音声認識結果に対応する第2応答の生成を開始する。応答生成部13は、図16に示されるシステム応答データベースを参照し、第2応答の生成を開始する。 In step S110, the response generation unit 13 starts generation of a second response corresponding to the speech recognition result of the second speech. The response generation unit 13 refers to the system response database shown in FIG. 16 and starts generation of the second response.
 ステップS120にて、応答生成部13は、第2応答の生成を完了する。ここでは、認識語彙は「音楽」、辞書データベース情報は「第1辞書データベース」であるため、応答生成部13は、音声出力用の情報として「音楽画面を表示します。」を含む第2応答を生成する。対話制御部14は、第2応答を応答提示装置22からユーザに提示するよう制御する。例えば、応答提示装置22に含まれるスピーカが、第2応答に従い「音楽画面を表示します。」と音声出力することにより、ユーザに第2応答を提示する。または、応答生成部13が応答提示装置22に含まれる表示装置に音楽画面を表示させる制御信号を含む第2応答を生成し、表示装置がその第2応答に基づいて音楽画面を表示してもよい。 In step S120, the response generation unit 13 completes the generation of the second response. Here, since the recognition vocabulary is "music" and the dictionary database information is "first dictionary database", the response generation unit 13 performs the second response including "display music screen" as information for voice output. Generate The dialogue control unit 14 controls the second response from the response presentation device 22 to be presented to the user. For example, the speaker included in the response presentation device 22 presents the second response to the user by outputting a voice as "Display music screen" according to the second response. Alternatively, even if the response generation unit 13 generates a second response including a control signal for displaying a music screen on the display device included in the response presentation device 22, the display device displays the music screen based on the second response. Good.
 (効果)
 以上をまとめると、実施の形態4における音声対話制御装置104の対話制御部14は、第1音声区間の終端から第2音声区間の始端までの経過時間が特定値以上であるか否かの判定に基づき、複数の辞書データベースのうちシステム200の待受状態に対応して準備された第1辞書データベース24を参照して第2音声を音声認識させる、または、複数の辞書データベースのうち第1音声の音声認識後の状態に対応しかつ第1音声の音声認識結果に含まれる特定語彙に関連する第2辞書データベース25を参照して第2音声を音声認識させる。
(effect)
To summarize the above, the dialog control unit 14 of the voice dialog control device 104 according to the fourth embodiment determines whether the elapsed time from the end of the first voice section to the start of the second voice section is equal to or greater than a specific value. Based on the first dictionary database 24 which is prepared corresponding to the standby state of the system 200 among the plurality of dictionary databases to make the second speech voice-recognized, or the first speech of the plurality of dictionary databases The second speech is speech-recognized with reference to the second dictionary database 25 corresponding to the state after the speech recognition and associated with the specific vocabulary included in the speech recognition result of the first speech.
 このような構成により、音声対話制御装置104は、音声の音声認識結果に加えて、ユーザからの発話タイミングも考慮して応答を生成することにより、ユーザの発話に対し正確な応答を生成することができる。 With such a configuration, the voice interaction control device 104 generates an accurate response to the user's speech by generating a response in consideration of the speech timing from the user in addition to the speech recognition result of the speech. Can.
 <実施の形態7>
 実施の形態1から6にて示された各音声対話制御装置は、例えば、車両に搭載される。図22は、車両30に搭載された音声対話制御装置105の構成の一例を示すブロック図である。ここで、音声対話制御装置105は、実施の形態1から6にて示された音声対話制御装置100から104のいずれかの音声対話制御装置である。システム200は、例えば、ナビゲーション装置、オーディオ装置、PND(Portable Navigation Device)など車載装置(図示せず)を含む。ユーザによって発話される音声を車載装置の音声入力装置(図示せず)が入力し、その音声に対応する応答を音声対話制御装置105が生成し、車載装置の応答提示装置(図示せず)がその応答をユーザに提示する。
Seventh Embodiment
Each voice interaction control device shown in the first to sixth embodiments is mounted on a vehicle, for example. FIG. 22 is a block diagram showing an example of the configuration of the voice interaction control device 105 mounted on the vehicle 30. As shown in FIG. Here, the voice interaction control device 105 is any one of the voice interaction control devices 100 to 104 shown in the first to sixth embodiments. The system 200 includes, for example, an on-vehicle device (not shown) such as a navigation device, an audio device, and a PND (Portable Navigation Device). The voice input device (not shown) of the on-vehicle apparatus inputs voice uttered by the user, the voice dialogue control device 105 generates a response corresponding to the voice, and the response presentation device (not shown) of the on-vehicle apparatus The response is presented to the user.
 また、音声対話制御装置105を含むシステム200は、車載装置に含まれる通信端末と、車両の外部に設置されたサーバと、これらにインストールされるアプリケーションの機能とを適宜に組み合わせて構築されてもよい。図23は、サーバ40に設けられる音声対話制御装置105の構成の一例を示すブロック図である。通信端末32の音声入力装置(図示せず)から入力される音声は、ネットワークを介してサーバ40の通信装置41で受信され、音声対話制御装置105にて処理される。音声対話制御装置105はその音声に対応する応答を生成する。生成された応答は、通信装置41からネットワークを介して車載装置31の応答提示装置(図示せず)からユーザに提示される。その応答提示装置は通信端末32に含まれていてもよい。ここで、通信端末32とは、例えば、携帯電話、スマートフォンおよびタブレットなどである。また、音声対話制御装置105の各構成要素は、システム200を構築する各機器に分散して配置されてもよい。その場合、各構成要素が互いに適宜通信することにより各機能が実現される。音声対話制御装置105がサーバ40に設けられることにより、または、音声対話制御装置105の各構成要素がサーバ40等に分散配置されることにより、車両30または車載装置31の構成を簡素化しながらも、音声対話制御装置105の機能が実現される。 Also, the system 200 including the voice interaction control device 105 may be constructed by appropriately combining the communication terminal included in the in-vehicle device, the server installed outside the vehicle, and the function of the application installed in these. Good. FIG. 23 is a block diagram showing an example of the configuration of the voice interaction control device 105 provided in the server 40. As shown in FIG. Voice input from a voice input device (not shown) of the communication terminal 32 is received by the communication device 41 of the server 40 via the network and processed by the voice interaction control device 105. The voice interaction control device 105 generates a response corresponding to the voice. The generated response is presented to the user from the response presentation device (not shown) of the on-vehicle device 31 from the communication device 41 via the network. The response presentation device may be included in the communication terminal 32. Here, the communication terminal 32 is, for example, a mobile phone, a smartphone, and a tablet. In addition, each component of the voice interaction control device 105 may be distributed and disposed in each device configuring the system 200. In that case, each function is realized by each component communicating with each other as appropriate. By providing the voice interaction control device 105 in the server 40, or by arranging each component of the voice interaction control device 105 in the server 40 or the like, the configuration of the vehicle 30 or the on-vehicle device 31 can be simplified. , And the functions of the voice interaction control device 105 are realized.
 なお、本発明は、その発明の範囲内において、各実施の形態を自由に組み合わせたり、各実施の形態を適宜、変形、省略したりすることが可能である。本発明は詳細に説明されたが、上記した説明は、全ての局面において、例示であって、本発明がそれに限定されるものではない。例示されていない無数の変形例が、この発明の範囲から外れることなく想定され得るものと解される。 In the present invention, within the scope of the invention, each embodiment can be freely combined, or each embodiment can be appropriately modified or omitted. Although the present invention has been described in detail, the above description is an exemplification in all aspects, and the present invention is not limited thereto. It is understood that countless variations not illustrated are conceivable without departing from the scope of the present invention.
 11 音声区間検出部、12 音声認識部、13 応答生成部、14 対話制御部、15 音声記憶部、16 対話状態判定部、17 確認応答生成部、24 第1辞書データベース、25 第2辞書データベース、100 音声対話制御装置、200 システム。 DESCRIPTION OF SYMBOLS 11 voice section detection part, 12 voice recognition part, 13 response production | generation part, 14 dialog control part, 15 voice storage part, 16 dialog state determination part, 17 confirmation response production part, 24 1st dictionary database, 25 2nd dictionary database, 100 voice interaction control devices, 200 systems.

Claims (9)

  1.  ユーザとシステムとの対話によりユーザがシステムに対して操作を行うに際し、ユーザから入力される音声に対する応答を前記システムから前記ユーザに提示させるための対話制御を行う音声対話制御装置であって、
     入力される一続きの前記音声をなす始端から終端までの音声区間を検出する音声区間検出部と、
     前記音声区間内の前記音声を音声認識する音声認識部と、
     前記音声の音声認識結果に対応する応答であって、前記システムから前記ユーザに提示させるべき前記応答を生成する応答生成部と、
     前記音声区間検出部と前記音声認識部と前記応答生成部とを制御する対話制御部と、を備え、
     前記対話制御部は、
     前記音声として入力される一続きの第1音声をなす第1音声区間が検出されてから前記第1音声の前記音声認識結果に対応する第1応答が前記システムから前記ユーザに提示されるまでの処理を含む前記第1音声に対する処理が終了していなくても、前記第1音声の後に前記音声として入力される一続きの第2音声に対する第2応答を生成可能とするために前記第2音声をなす第2音声区間を前記音声区間検出部に検出させる音声対話制御装置。
    A voice dialogue control device that performs dialogue control for causing the system to present a response to voice input from a user to the user when the user operates the system by interaction between the user and the system.
    A voice period detection unit for detecting a voice period from the beginning to the end of the input sequence of voices;
    A voice recognition unit for voice recognition of the voice in the voice section;
    A response generation unit that generates a response to be presented to the user from the system, the response corresponding to the speech recognition result of the speech;
    A dialog control unit that controls the voice section detection unit, the voice recognition unit, and the response generation unit;
    The dialogue control unit
    From the detection of the first speech section forming the first speech sequence input as the speech until the first response corresponding to the speech recognition result of the first speech is presented to the user from the system The second voice can be generated to enable generation of a second response to a series of second voices input as the voice after the first voice even if the processing for the first voice including the process is not completed. The voice dialogue control device which makes the voice section detection part detect the 2nd voice section which makes.
  2.  前記音声区間検出部にて検出される前記第2音声区間内の前記第2音声を記憶する音声記憶部をさらに備え、
     前記対話制御部は、
     前記音声認識部にて前記第1音声の音声認識が終了したことを示す通知に基づき、前記音声記憶部にて記憶されている前記第2音声を前記音声認識部に音声認識させ、前記第2音声の前記音声認識結果に対応する前記第2応答を前記応答生成部に生成させる請求項1に記載の音声対話制御装置。
    It further comprises a voice storage unit for storing the second voice in the second voice segment detected by the voice segment detection unit,
    The dialogue control unit
    The voice recognition unit causes the voice recognition unit to perform voice recognition of the second voice stored in the voice storage unit based on a notification indicating that voice recognition of the first voice is completed in the voice recognition unit. The voice dialogue control device according to claim 1, wherein the response generation unit generates the second response corresponding to the voice recognition result of voice.
  3.  前記対話制御部は、
     前記応答生成部にて前記第1応答の生成が完了したことを示す通知に基づき、前記音声認識部にて音声認識される前記第2音声区間内の前記第2音声の前記音声認識結果に対応する前記第2応答を前記応答生成部に生成させる請求項1に記載の音声対話制御装置。
    The dialogue control unit
    Corresponds to the voice recognition result of the second voice in the second voice section recognized as voice by the voice recognition unit based on the notification indicating that the generation of the first response is completed by the response generation unit The voice dialogue control apparatus according to claim 1, wherein the response generation unit generates the second response.
  4.  前記音声認識部にて音声認識される前記第2音声区間内の前記第2音声の前記音声認識結果が前記第1音声の前記音声認識結果を更新するものであるか否かを判定する対話状態判定部をさらに備え、
     前記対話制御部は、
     前記対話状態判定部の判定結果に基づき、前記第1音声に対する前記処理を途中で終了させかつ前記応答生成部に前記第2応答を生成させる請求項1に記載の音声対話制御装置。
    An interactive state determining whether or not the voice recognition result of the second voice in the second voice section recognized as voice by the voice recognition unit is to update the voice recognition result of the first voice. Further comprising a determination unit;
    The dialogue control unit
    The voice dialogue control device according to claim 1, wherein the processing on the first voice is terminated halfway and the response generation unit generates the second response based on the determination result of the dialog state determination unit.
  5.  前記音声認識部は、
     前記音声を複数の辞書データベースのうち前記システムの状態に応じた一の辞書データベースを参照して音声認識し、
     前記応答生成部は、
     前記音声の前記音声認識結果と前記音声の音声認識のために参照された前記一の辞書データベースの情報とに対応する前記応答を生成する請求項1に記載の音声対話制御装置。
    The voice recognition unit
    The voice is recognized by referring to one dictionary database corresponding to the state of the system among a plurality of dictionary databases,
    The response generator
    The voice dialogue control device according to claim 1, wherein the response corresponding to the voice recognition result of the voice and the information of the one dictionary database referred to for voice recognition of the voice is generated.
  6.  前記音声認識部は、
     前記複数の辞書データベースのうち前記システムの待受状態に対応して準備された第1辞書データベースを参照して前記第1音声を音声認識し、
     前記複数の辞書データベースのうち前記第1音声の音声認識後の状態に対応しかつ前記第1音声の前記音声認識結果に含まれる特定語彙に関連する第2辞書データベースを参照して前記第2音声を音声認識し、
     前記応答生成部は、
     前記第2音声の前記音声認識結果と前記第2音声の音声認識のために参照された前記第2辞書データベースの情報とに対応する前記第2応答を生成する請求項5に記載の音声対話制御装置。
    The voice recognition unit
    The first voice is voice-recognized with reference to a first dictionary database prepared corresponding to the standby state of the system among the plurality of dictionary databases,
    The second voice is referred to by referring to a second dictionary database corresponding to a state after voice recognition of the first voice among the plurality of dictionary databases and included in the voice recognition result of the first voice. Speech recognition
    The response generator
    The voice interaction control according to claim 5, wherein the second response corresponding to the voice recognition result of the second voice and the information of the second dictionary database referred to for voice recognition of the second voice is generated. apparatus.
  7.  前記応答生成部は、
     前記音声の前記音声認識結果に対応して生成する複数の前記応答から一の応答を前記ユーザに選択させるための確認応答を生成する確認応答生成部をさらに含み、
     前記対話制御部は、
     前記確認応答を前記システムから前記ユーザに提示させ、前記確認応答に従って前記ユーザによって入力される前記音声に対応する一の応答を前記応答生成部に生成させて、前記システムから前記ユーザに提示させる請求項1に記載の音声対話制御装置。
    The response generator
    And a confirmation response generation unit generating a confirmation response for causing the user to select one response from the plurality of responses generated corresponding to the voice recognition result of the voice,
    The dialogue control unit
    The confirmation response is presented from the system to the user, the response generation unit generates one response corresponding to the voice input by the user according to the confirmation response, and the system is presented to the user The voice dialogue control device according to Item 1.
  8.  前記対話制御部は、
     前記第1音声区間の終端から前記第2音声区間の始端までの経過時間が特定値以上であるか否かの判定に基づき、前記複数の辞書データベースのうち前記システムの待受状態に対応して準備された第1辞書データベースを参照して前記第2音声を音声認識させる、または、前記複数の辞書データベースのうち前記第1音声の音声認識後の状態に対応しかつ前記第1音声の前記音声認識結果に含まれる特定語彙に関連する第2辞書データベースを参照して前記第2音声を音声認識させる請求項5に記載の音声対話制御装置。
    The dialogue control unit
    Corresponding to the standby state of the system among the plurality of dictionary databases, based on the determination whether the elapsed time from the end of the first speech zone to the beginning of the second speech zone is greater than or equal to a specific value The second voice is voice-recognized with reference to the prepared first dictionary database, or the voice of the first voice corresponds to a state after voice recognition of the first voice among the plurality of dictionary databases. The voice dialogue control device according to claim 5, wherein the second speech is recognized by referring to a second dictionary database related to a specific vocabulary included in the recognition result.
  9.  ユーザとシステムとの対話によりユーザがシステムに対し操作を行うに際し、ユーザから入力される音声に対する応答を前記システムから前記ユーザに提示させるための対話制御を行う音声対話制御方法であって、
     入力される一続きの前記音声をなす始端から終端までの音声区間を検出し、
     前記音声区間内の前記音声を音声認識し、
     前記音声の音声認識結果に対応する応答であって、前記システムから前記ユーザに提示させるべき前記応答を生成し、
     前記音声区間の検出、前記音声の音声認識、および、前記応答の生成の各々の制御を実行し、
     前記制御を実行する際、
     前記音声として入力される一続きの第1音声をなす第1音声区間が検出されてから前記第1音声の音声認識結果に対応する第1応答が前記システムから前記ユーザに提示されるまでの処理を含む前記第1音声に対する処理が終了していなくても、前記第1音声の後に前記音声として入力される一続きの第2音声に対する第2応答を生成可能とするために前記第2音声をなす第2音声区間を検出させる音声対話制御方法。
    A voice interaction control method for performing interaction control for causing the system to present a response to voice input from a user to the user when the user performs an operation on the system by interaction between the user and the system.
    Detecting a voice section from the beginning to the end of the input series of voices;
    Speech recognition of the speech in the speech segment;
    Generating a response to be presented to the user from the system, the response corresponding to the speech recognition result of the speech;
    Performing control of each of the detection of the speech segment, speech recognition of the speech, and generation of the response;
    When performing the control,
    Processing from detection of a first voice section forming a series of first voice input as the voice to a first response corresponding to a voice recognition result of the first voice from the system to the user To make it possible to generate a second response to a series of second voices input as the voices after the first voices, even if processing for the first voices including A voice dialogue control method for detecting a second voice segment to be performed.
PCT/JP2017/033902 2017-09-20 2017-09-20 Voice interaction control device and method for controlling voice interaction WO2019058453A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2017/033902 WO2019058453A1 (en) 2017-09-20 2017-09-20 Voice interaction control device and method for controlling voice interaction
JP2019542865A JP6851491B2 (en) 2017-09-20 2017-09-20 Voice dialogue control device and voice dialogue control method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2017/033902 WO2019058453A1 (en) 2017-09-20 2017-09-20 Voice interaction control device and method for controlling voice interaction

Publications (1)

Publication Number Publication Date
WO2019058453A1 true WO2019058453A1 (en) 2019-03-28

Family

ID=65811399

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2017/033902 WO2019058453A1 (en) 2017-09-20 2017-09-20 Voice interaction control device and method for controlling voice interaction

Country Status (2)

Country Link
JP (1) JP6851491B2 (en)
WO (1) WO2019058453A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112599133A (en) * 2020-12-15 2021-04-02 北京百度网讯科技有限公司 Vehicle-based voice processing method, voice processor and vehicle-mounted processor

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001014165A (en) * 1999-06-30 2001-01-19 Toshiba Corp Device for generating response, device for managing dialogue, method for generating response and computer readable recording medium for storing response generating program
JP2003058188A (en) * 2001-08-13 2003-02-28 Fujitsu Ten Ltd Voice interaction system
JP2004037910A (en) * 2002-07-04 2004-02-05 Denso Corp Interaction system and interactive verse capping system
JP2015064450A (en) * 2013-09-24 2015-04-09 シャープ株式会社 Information processing device, server, and control program
JP2017102320A (en) * 2015-12-03 2017-06-08 アルパイン株式会社 Voice recognition device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001014165A (en) * 1999-06-30 2001-01-19 Toshiba Corp Device for generating response, device for managing dialogue, method for generating response and computer readable recording medium for storing response generating program
JP2003058188A (en) * 2001-08-13 2003-02-28 Fujitsu Ten Ltd Voice interaction system
JP2004037910A (en) * 2002-07-04 2004-02-05 Denso Corp Interaction system and interactive verse capping system
JP2015064450A (en) * 2013-09-24 2015-04-09 シャープ株式会社 Information processing device, server, and control program
JP2017102320A (en) * 2015-12-03 2017-06-08 アルパイン株式会社 Voice recognition device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112599133A (en) * 2020-12-15 2021-04-02 北京百度网讯科技有限公司 Vehicle-based voice processing method, voice processor and vehicle-mounted processor

Also Published As

Publication number Publication date
JP6851491B2 (en) 2021-03-31
JPWO2019058453A1 (en) 2019-12-12

Similar Documents

Publication Publication Date Title
US11356730B2 (en) Systems and methods for routing content to an associated output device
KR101418163B1 (en) Speech recognition repair using contextual information
KR102100389B1 (en) Personalized entity pronunciation learning
US10706853B2 (en) Speech dialogue device and speech dialogue method
EP3475942B1 (en) Systems and methods for routing content to an associated output device
US8484033B2 (en) Speech recognizer control system, speech recognizer control method, and speech recognizer control program
US9092435B2 (en) System and method for extraction of meta data from a digital media storage device for media selection in a vehicle
US7822613B2 (en) Vehicle-mounted control apparatus and program that causes computer to execute method of providing guidance on the operation of the vehicle-mounted control apparatus
JP4260788B2 (en) Voice recognition device controller
CN111095400A (en) Selection system and method
US10599469B2 (en) Methods to present the context of virtual assistant conversation
US20150039316A1 (en) Systems and methods for managing dialog context in speech systems
KR102360589B1 (en) Systems and methods for routing content to related output devices
JP2001083991A (en) User interface device, navigation system, information processing device and recording medium
JP7347217B2 (en) Information processing device, information processing system, information processing method, and program
JP6851491B2 (en) Voice dialogue control device and voice dialogue control method
JP7456387B2 (en) Information processing device and information processing method
JP2006058641A (en) Speech recognition device
JP2004354942A (en) Voice interactive system, voice interactive method and voice interactive program
CN117090668A (en) Vehicle exhaust sound adjusting method and device and vehicle
JP2001209394A (en) Speech recognition system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17925620

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019542865

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17925620

Country of ref document: EP

Kind code of ref document: A1