WO2015102039A1 - Speech recognition apparatus - Google Patents

Speech recognition apparatus Download PDF

Info

Publication number
WO2015102039A1
WO2015102039A1 PCT/JP2014/006171 JP2014006171W WO2015102039A1 WO 2015102039 A1 WO2015102039 A1 WO 2015102039A1 JP 2014006171 W JP2014006171 W JP 2014006171W WO 2015102039 A1 WO2015102039 A1 WO 2015102039A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
user
content
speech
voice recognition
Prior art date
Application number
PCT/JP2014/006171
Other languages
French (fr)
Japanese (ja)
Inventor
鈴木 竜一
Original Assignee
株式会社デンソー
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社デンソー filed Critical 株式会社デンソー
Publication of WO2015102039A1 publication Critical patent/WO2015102039A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • This disclosure relates to a speech recognition apparatus (Speech Recognition Apparatus) that recognizes speech uttered by a user in an interactive manner.
  • Speech Recognition Apparatus Speech Recognition Apparatus
  • a speech recognition device interprets input information input by a user, identifies a dialog agent that makes a response corresponding to the input information, sends input information to the dialog agent, requests a response, and The response from the agent is output, and the processable information is inquired to multiple dialog agents, the input information and the processable information are collated, and the dialog agent that can process the input information is selected and selected.
  • a configuration in which input information is transmitted to a dialogue agent and a response is received see, for example, Patent Document 1).
  • the user's utterance history is recorded for each subject, a new scenario of dialogue in each subject is determined based on the utterance history, and a speech recognition dictionary to be referred to is based on the determined new scenario of dialogue.
  • a voice recognition device that determines and recognizes the user's utterance using the voice recognition dictionary as a reference target (see, for example, Patent Document 2).
  • the apparatus described in Patent Document 1 is configured to select a dialog agent that can handle input information, transmit input information to the selected dialog agent, and receive a response. It is possible to have a smooth conversation in a state close to a natural conversation in which the category of the category changes frequently, but when the interactive speech recognition process is completed and the interactive speech recognition process is started again In this case, the user needs to speak the same input information again, which makes the user feel bothersome. There is a possibility.
  • the device described in Patent Document 2 determines a new scenario for each subject based on the user's utterance history in order to improve the accuracy of speech recognition. Based on the above, the speech recognition dictionary to be referred to is limited. However, even in the apparatus described in Patent Document 2, when the interactive speech recognition process is started again after the interactive speech recognition process is completed, the speech dialog is performed so as to continue the past operation. In such a case, the user needs to speak the same input information again, and the user may feel annoyed.
  • This disclosure is intended to provide a voice recognition device that eliminates the user's annoyance due to the fact that past operations are not continued.
  • a speech recognition apparatus includes a storage control section and a speech recognition processing section.
  • the voice recognition device recognizes the utterance content spoken by the user, generates voice for voice conversation with the user based on the utterance content recognized by the voice recognition, and performs voice recognition processing in an interactive format.
  • the storage control section causes the storage unit to store at least one of the content recognized by the voice recognition and the content executed in response to the user's manual operation.
  • the voice recognition processing section When performing the voice recognition processing, the voice recognition processing section generates voice for voice conversation with the user using the content stored in the storage unit and performs the voice recognition processing.
  • At least one of the content recognized by the voice recognition and the content executed according to the user's manual operation is stored in the storage unit, and stored in the storage unit when the voice recognition process is performed. Since the voice for performing voice conversation with the user is generated using the content and the voice recognition process is performed, the user's troublesomeness caused by the past operation not being continued can be eliminated.
  • Diagram showing the overall configuration of the navigation device The figure which showed the composition of the voice recognition part and the voice dialogue control part Flowchart of control circuit and voice interaction control unit of navigation device
  • the figure which showed the composition of the voice recognition part and the voice dialogue control part Figure showing a display example of the context display screen image Diagram for explaining the difference between dialogs with and without continuing the context Diagram for explaining the difference between dialogs with and without continuing the context Diagram for explaining the difference between dialogs with and without continuing the context Diagram for explaining the difference between dialogs with and without continuing the context
  • FIG. 1 shows the overall configuration of a speech recognition apparatus according to an embodiment of the present disclosure.
  • the voice recognition device is configured as a navigation device 20 that is mounted and used in a vehicle (also referred to as a host vehicle).
  • the navigation device 20 recognizes speech content uttered by the user, generates speech for voice conversation with the user based on the speech content recognized by the speech recognition, and performs speech recognition processing in an interactive format. In addition, a process of executing an operation according to the utterance content recognized by the voice recognition is performed.
  • This navigation device 20 includes a position detector 21, a data input device 22, an operation switch group 23, a communication device 24, an external memory 25, a display device 26, a remote control sensor 27, and a control circuit. 28 and a voice recognition unit 10.
  • the position detector 21 includes a gyroscope 21a, a distance sensor 21b, and a GPS receiver 21c, and outputs various information for specifying the current position input from these to the control circuit 28.
  • the data input device 22 is a device for inputting map data for map display and route search.
  • the data input device 22 reads out necessary map data from a map data storage medium in which map data is stored in response to a request from the control circuit 28.
  • the map data storage medium includes not only map data for map display and route search, but also dictionary data used when the speech recognition unit 10 performs recognition processing.
  • the map data storage medium can be configured using a hard disk drive, CD, DVD, flash memory, or the like.
  • the operation switch group 23 includes various switches such as a touch switch arranged on the front surface of a display (also referred to as a display unit or a display panel) of a display device 26 described later and a mechanical switch provided around the display. Then, signals corresponding to various user switch operations are output to the control circuit 28.
  • a touch switch arranged on the front surface of a display (also referred to as a display unit or a display panel) of a display device 26 described later and a mechanical switch provided around the display.
  • the communication device 24 is for communicating with the outside, and is configured by a mobile communication device such as a mobile phone, for example.
  • the external memory 25 is composed of a portable storage medium such as a USB memory or an SD card. Various data are stored in the external memory 25.
  • the display device 26 has a display such as a liquid crystal, and displays video and images (including screen images) according to the video signal input from the control circuit 28 on the display.
  • the remote control sensor 27 receives a radio signal transmitted from a remote control 27a for performing a remote operation.
  • the control circuit 28 is configured as a computer including a CPU, ROM, RAM, I / O, and the like, and the CPU of the control circuit 28 performs various processes according to programs stored in the ROM. It should be noted that part or all of the processes executed by the program may be executed by hardware components.
  • host vehicle position detection processing for detecting the host vehicle position based on various information for specifying the current position of the host vehicle input from the position detector 21, and on the map around the host vehicle position Map display processing for displaying the host vehicle position mark superimposed on the destination, destination setting processing for setting the destination, route search processing for searching for a guidance route to the destination, travel guidance processing for performing travel guidance according to the guidance route, etc. .
  • the voice recognition unit 10 is a device that performs processing for recognizing input voice collected by the microphone 15, generates dialogue voice, and outputs (speaks) the dialogue voice from the speaker 14.
  • the voice recognition unit 10 is configured as one or a plurality of computers including a CPU, a RAM, a ROM, an I / O, and the like as an example. Various processes are performed in accordance with programs stored in the ROM. It should be noted that part or all of the processes executed by the program may be executed by hardware components.
  • the voice recognition unit 10 includes a voice synthesis circuit 11, a voice recognition circuit 12, and a voice dialogue control circuit 13. Each of these may be composed of individual computers, or may be composed of one or two computers.
  • a speaker 14 is connected to the speech synthesis circuit 11, and a microphone 15 is connected to the speech recognition circuit 12.
  • a PTT switch (Push Talk to Switch) 16 is connected to the voice interaction control circuit 13.
  • the voice recognition circuit 12 recognizes the input voice collected by the microphone 15 in accordance with an instruction from the voice dialogue control circuit 13, and notifies the voice dialogue control circuit 13 of the recognition result. That is, the speech recognition circuit 12 collates the speech data acquired from the microphone 15 using the stored dictionary data, and selects a higher-order comparison target pattern having a higher degree of matching than a plurality of comparison target pattern candidates. Output to the voice dialogue control circuit 13.
  • the voice recognition circuit 12 sequentially analyzes the voice data input from the microphone 15 to extract an acoustic feature quantity (for example, cepstrum), and uses the acoustic feature quantity time-series data obtained by the acoustic analysis as a result.
  • an acoustic feature quantity for example, cepstrum
  • acoustic feature quantity time-series data obtained by the acoustic analysis as a result.
  • HMM Hidden Markov Model
  • DP matching method or neural network each section corresponds to which word stored as dictionary data. Recognize word sequences.
  • the voice dialogue control circuit 13 instructs the voice synthesis circuit 11 to output a response voice based on the recognition result in the voice recognition circuit 12 and also instructs the control circuit 28 of the navigation device 20 based on the recognition result in the voice recognition circuit 12.
  • a destination and a command necessary for the travel guidance process are notified, and processing for instructing to set the destination and execute the command is performed.
  • the speech synthesizer circuit 11 has a waveform database in which various speech waveforms are converted into a database. Using the speech waveform stored in the waveform database, the speech synthesis control circuit 13 issues a response speech output instruction. Synthesize speech based on. Note that the synthesized speech synthesized by the speech synthesis circuit 11 is output from the speaker 14.
  • the voice recognition unit 10 speaks various commands for executing various processes such as route setting, route guidance, facility search, and facility display toward the microphone 15 while the user presses the PTT switch 16. It has become. Specifically, the voice dialogue control circuit 13 monitors the timing when the PTT switch 16 is pressed, the timing when the PTT switch 16 is returned, and the time during which the pressed state is continued. The recognition circuit 12 is instructed to execute processing. On the other hand, when the PTT switch 16 is not pressed, the processing is not executed. Therefore, voice data input via the microphone 15 while the PTT switch 16 is being pressed is output to the voice recognition circuit 12.
  • FIG. 2 shows configurations of the voice recognition circuit 12 and the voice dialogue control circuit 13.
  • the speech recognition circuit 12 includes a speech extraction unit 101, a speech recognition collation unit 103, a speech recognition result output unit 105, a speech recognition dictionary unit 107, and a target dictionary determination unit 109.
  • the voice recognition dictionary unit 107 includes a command correspondence dictionary 201, an address correspondence dictionary 203, a music correspondence dictionary 205, a telephone directory correspondence dictionary 207, and the like.
  • the voice extraction unit 101 extracts words from the voice data input from the microphone 15.
  • the target dictionary determining unit 109 determines a target dictionary to be used for speech recognition from the command corresponding dictionaries 201 to 207 of the speech recognition dictionary unit 107.
  • the speech recognition / collation unit 103 collates the words extracted by the speech extraction unit 101 using the target dictionary determined by the target dictionary determination unit 109 that determines the target dictionary.
  • the voice recognition result output unit 105 outputs the result of voice recognition based on the collation result of the voice recognition collation unit 103 to the voice dialogue processing unit 121 of the voice dialogue control circuit 13.
  • the voice dialogue control circuit 13 includes a voice dialogue processing unit 121, a function execution processing determination unit 123, a voice output content determination unit 125, and a context history management unit 127.
  • the voice dialogue processing unit 121 determines a phrase that matches the voice recognition result output from the voice recognition result output unit 105 of the voice recognition circuit 12 from among previously prepared question phrases and dialogue phrases. In addition, when determining a phrase that matches the speech recognition result, the voice interaction processing unit 121 according to the present embodiment can also determine a phrase using a context managed by the context history management unit 127 described later. ing.
  • the function execution process determination unit 123 determines a process to execute a function based on the content processed by the voice interaction processing unit 121 and notifies the control circuit 28 of the determined process. In addition, the function execution process determination unit 123 acquires the speech recognition result output from the speech recognition result output unit 105 via the speech recognition circuit 12 and notifies the control circuit 28 of the speech recognition result.
  • the voice output content determination unit 125 determines the voice data to be output based on the content processed by the voice dialog processing unit 121 and notifies the voice synthesis circuit 11 of the determined voice data.
  • the control circuit 28 executes the function in accordance with the notification from the function execution process determination unit 123, and when the function execution is completed, notifies the context history management unit 127 of the executed content. In addition, the control circuit 28 notifies the context history management unit 127 of the voice recognition result acquired via the function execution process determination unit 123.
  • the context history management unit 127 has a memory (not shown), and sequentially stores contents (contexts) notified from the control circuit 28 in this memory to manage the context history.
  • the context refers to the utterance content uttered by the user in the voice conversation or the operation content executed in response to the user's manual operation. For example, when the user speaks “Destination” and then speaks “Tokyo” in a voice dialogue, “Destination, Tokyo” is stored as a context in the memory of the context history management unit 127. Further, when the user selects and operates “AM radio” after manually selecting “audio”, “audio and AM radio” is stored in the memory of the context history management unit 127 as a context.
  • FIG. 3 shows a flowchart of the voice recognition process of the navigation device 20.
  • the processing of the control circuit 28 and the voice recognition unit 10 of the navigation device 20 will be described as the voice recognition processing of the navigation device 20.
  • the ignition switch of the vehicle changes from the off state to the on operation state
  • the navigation device 20 enters an operating state.
  • the control circuit 28 and the voice recognition unit 10 of the navigation device 20 perform the process shown in FIG.
  • each section is expressed as S100, for example.
  • each section can be divided into a plurality of subsections, while a plurality of sections can be combined into one section.
  • each section can be referred to as a device, module, or means.
  • each of the above sections or a combination thereof includes not only (i) a section of software combined with a hardware unit (eg, a computer), but also (ii) hardware (eg, an integrated circuit, As a section of (wiring logic circuit), it can be realized with or without the function of related devices.
  • the hardware section can be included inside the microcomputer.
  • a voice recognition top screen image (also referred to as a top screen) is displayed on the display unit (or display panel) of the display device 26 (S100). Specifically, a voice recognition top screen image is displayed on the display unit of the display device 26 according to an instruction from the control circuit 28 of the navigation device 20.
  • FIG. 4 shows a display example of the voice recognition top screen image.
  • This voice recognition top screen image includes a message “Yes” and “No” along with a message “Do you want to continue the context?”.
  • “Yes” is selected
  • it is not desired to continue the utterance content in the past speech recognition processing “No” is selected. If no context is stored in the memory of the context history management unit 127, only “No” can be selected.
  • control circuit 28 determines that the context is continued when “Yes” is selected by the user, and is not continued when “No” is selected by the user.
  • a user voice input is performed (S200). During a period when the PTT switch 16 is pressed by a user operation, for example, as shown in FIG. 6A, when the user speaks “set destination”, the voice data from the microphone 15 is converted into a voice recognition circuit. 12 is input.
  • the determination in S204 is NO, and a dialog voice is generated and outputted (S206).
  • This processing is performed by the processing of the voice interaction processing unit 121 and the voice output content determination unit 125.
  • a dialogue voice such as “Please tell the destination” is generated for the utterance content “Set destination”, this dialogue voice is output from the speaker 14, and the process returns to S 200.
  • “Tokyo” is recognized as a voice in S 202, and a dialog voice such as “Set destination to Tokyo” is generated in S 206. Sound is output.
  • the voice recognition unit 10 notifies the control circuit 28 that the destination is set to Tokyo, and in response to this notification, the control circuit 28 sets the destination to Tokyo and sets the destination to Tokyo. Is notified to the voice recognition unit 10.
  • the context is stored (S106). Specifically, the control circuit 28 instructs the context history management unit 127 to store the context.
  • “Destination” is associated with “Tokyo” which is a specific place name
  • “Destination, Tokyo” is stored as a context in the memory of the context history management unit 127, and this processing is terminated.
  • the voice recognition top screen image as shown in FIG. 4 is displayed on the display unit of the display device 26 (S100), and then it is determined whether or not to continue the context (S102).
  • a user voice input is performed (S200). As shown in FIG. 6A, for example, when the user speaks “What is the weather today?” During the period in which the PTT switch 16 is pressed by a user operation, the voice data from the microphone 15 is converted into a voice recognition circuit. 12 is input.
  • the determination in S204 is NO, and a dialog voice is generated. Audio is output (S206). For example, an interactive voice such as “Where do you want to know the weather?” Is generated for the utterance content “What is the weather today?”, The interactive voice is output from the speaker 14, and the process returns to S 200.
  • the voice recognition unit 10 notifies the control circuit 28 that the voice recognition is finished, and in response to this notice, the control circuit 28 finishes the voice recognition and informs that the weather information of today's Tokyo has been notified. Notify the recognition unit 10.
  • the context is stored (S106).
  • “weather information” and a specific place name “Tokyo” are associated with each other, “weather information, Tokyo” is stored in the memory of the context history management unit 127 as a context, and this process ends.
  • the voice recognition top screen image as shown in FIG. 4 is displayed on the display unit of the display device 26 (S100).
  • FIG. 5 shows a display example of a context display screen image.
  • the context display screen image includes a context display image 300 and a reset button 310.
  • the reset button 310 is a button used when all contexts are reset.
  • the context is determined (S110). Further, in the display example of FIG. 5, the context indicating “Destination Tokyo” is highlighted. Here, for example, when the user utters “next”, the uttered content is recognized as speech, and “destination AA restaurant” is highlighted. Further, in accordance with the display screen image of the context as shown in FIG. 5, when the user utters “fifth”, the uttered content is recognized as voice, and “air conditioner on” is highlighted. As described above, the context to be highlighted is switched according to the content of the user's utterance. When the user utters “determine”, the highlighted context is determined as the context used for speech recognition.
  • the process proceeds to the second voice recognition process (S300 to S306).
  • the voice recognition process is performed by generating voice for voice conversation with the user so as to continue the utterance contents in the past voice recognition process using the context determined in S110.
  • a user's voice input is performed (S300). For example, when the user speaks “What is the weather today?” During the period in which the PTT switch 16 is being pressed by a user operation, voice data from the microphone 15 is input to the voice recognition circuit 12.
  • voice recognition processing is performed (S302). Specifically, the voice dialogue control circuit 13 instructs the voice recognition circuit 12 to execute voice recognition processing, and the voice recognition circuit 12 performs voice recognition processing of voice data from the microphone 15 in response to this instruction.
  • a target dictionary to be used for speech recognition is specified based on the context determined in S110, and speech recognition processing is performed using this target dictionary. For example, when the context determined in S110 is “Destination, Tokyo”, for example, using the address correspondence dictionary 203 or the telephone directory correspondence dictionary 207 related to the destination setting, The music correspondence dictionary 205 that does not exist is not used. In this way, the recognition rate is improved by using the minimum necessary dictionary.
  • S304 it is determined whether or not the operation to be executed has been determined. Specifically, whether or not the operation to be executed has been determined based on whether or not the utterance content recognized in the voice recognition processing in S202 is an operation command that instructs execution of a predetermined function. Determine whether.
  • the determination in S304 is NO and the voice generated by generating the dialog voice is output (S306).
  • a voice for voice conversation with the user is generated so as to continue the utterance contents in the past voice recognition processing.
  • the context determined in S110 is “Destination, Tokyo” and the user's utterance content is recognized as “What is today's weather?”
  • S302 a specific place name included in the context is used.
  • a phrase such as “Today's weather in Tokyo is sunny” is generated by combining a certain “Tokyo” and “How is the weather today?”. Then, this phrase is output as audio from the speaker 14.
  • the voice recognition unit 10 notifies the control circuit 28 that the voice recognition is finished, and in response to this notice, the control circuit 28 finishes the voice recognition and informs that the weather information of today's Tokyo has been notified. Notify the recognition unit 10.
  • the context is stored (S114).
  • “weather information” and a specific place name “Tokyo” are associated with each other, “weather information, Tokyo” is stored in the memory of the context history management unit 127 as a context, and this process ends.
  • Fig. 7 shows an example of dialogue when searching for a song by artist name and displaying the artist's album list.
  • A is an example of interaction when the context is not continued
  • (b) is an example of interaction when the context is continued.
  • a dialogue voice “Please tell the artist name” is output from the speaker 14, and then the user ’s “artist name”.
  • a dialogue voice “Play Michael's song” is output from the speaker 14, and when the function is executed, “Artist, Michael” is the context history management unit 127 as the context. Stored in the memory.
  • the speaker 14 responds to the user's utterance “display album list” from the speaker 14.
  • a dialogue voice “Michael's album list is displayed” is output from the speaker 14 in response to the user's utterance “Michael”. The That is, the input “Michael” is repeated.
  • “artist Michael” is stored as a context in the memory of the context history management unit 127, and after a series of speech recognition processing is completed, the speech recognition processing is started again.
  • a dialogue voice “display Michael's album list” is output from the speaker 14. That is, the list of Michaels can be displayed without inputting "Michael” again.
  • Fig. 8 shows an example of a dialog when searching for a destination and wanting to call the destination.
  • (A) is an example of interaction when the context is not continued, and
  • (b) is an example of interaction when the context is continued.
  • a dialogue voice “Please tell the destination” is output from the speaker 14, and then the user “AA restaurant”.
  • the speaker 14 outputs a dialogue voice “Set AA restaurant as the destination”, and when the function is executed, “Destination, AA restaurant” is stored in the memory of the context history management unit 127 as a context. Is done.
  • the speaker 14 responds to the user's utterance “make a call” and “where to call”
  • the speaker 14 outputs the conversation voice “I will call the AA restaurant”. That is, the input “AA restaurant” is repeated.
  • “Destination, AA restaurant” is stored in the memory of the context history management unit 127 as a context, and after a series of voice recognition processes, the voice recognition process is performed again.
  • a dialogue voice “Call a AA restaurant” is output from the speaker 14 in response to the user's “call” utterance. That is, Michael's list can be displayed without inputting "AA restaurant” again.
  • the ability to perform recognition processing with continued context eliminates the hassle of repeating the same input over and over, and also reduces the number of conversations that can be exchanged in a vehicle interior environment. Can ensure safety.
  • the content recognized by the voice recognition is stored in the memory of the context history management unit 127 as a context, and when performing the voice recognition process, the user has a voice conversation using the context stored in the memory. For this reason, the voice recognition process is performed and the voice recognition process is performed, so that the user's troublesomeness caused by the past operation not being continued can be eliminated.
  • the user confirms whether or not to perform a voice conversation using the context stored in the memory, and if the voice conversation using the content stored in the memory is confirmed, the context stored in the memory is changed. Since the voice for voice dialogue with the user is generated, it is possible to prevent the voice using the context from being voiced against the intention of the user.
  • the context stored in the memory is displayed on the display unit, the content of the conversation that continues according to the user operation is specified, and the voice for voice conversation with the user is generated using the specified content.
  • the user can easily specify the content of the ongoing dialogue.
  • the context stored in the memory can be displayed on the display unit in order from the latest one. Further, it is possible to display the content of the latest certain number of times (for example, 5 times) on the display unit.
  • the dictionary used for speech recognition is changed according to the context specified according to the user operation, it is possible to improve the recognition rate of speech recognition. It is also possible to specify the content of the dialogue that continues in response to an instruction by the user's utterance.
  • the context stored in the memory can be deleted in response to a user operation.
  • the configuration is shown in which all contexts are erased by operating the reset button 310 for erasing all contexts stored in the memory in response to a user operation.
  • contexts are selected one by one. And can be configured to be erased.
  • the content recognized by the speech recognition is stored as a context in the memory of the context history management unit 127, and after the speech recognition process is completed, the content is stored in the memory when the speech recognition process is performed again.
  • the voice for voice interaction was generated and the speech recognition process was performed so that the utterance contents in the past voice recognition process were continued using the determined context.
  • the voice recognition process may be performed by generating a voice for voice conversation with the user using the context stored in the memory. it can.
  • the memory of the context history management unit 127 is also referred to as a storage unit / device / means, and S106 and S114 are also referred to as a storage control section / device / means or a storage instruction section / device / means.
  • S306 is also referred to as a speech recognition processing section / device / means or content usage recognition section / device / means
  • S102 is also referred to as a confirmation section / device / means or content confirmation section / device / means
  • S108 is referred to as a display control section /
  • S110 is referred to as specific section / device / means or content specific section / device / means
  • voice Power content determination unit 125 is referred to as sound generation unit / device / means-reset button 310 is referred to as erasing unit / device / means-or content deletion unit / device / means-.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Navigation (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

In this speech recognition apparatus, content recognized by speech recognition and/or content executed in accordance with a manual operation by a user is stored (S106, S114) in a memory, and speech recognition processing is completed. When speech recognition processing is subsequently executed again, the content stored in the memory is used to generate speech for speech interaction with the user, and speech recognition processing (S300-S306) is executed.

Description

音声認識装置Voice recognition device 関連出願の相互参照Cross-reference of related applications
 本開示は、2014年1月6日に出願された日本出願番号2014-264号に基づくもので、ここにその記載内容を援用する。 This disclosure is based on Japanese Patent Application No. 2014-264 filed on January 6, 2014, the contents of which are incorporated herein.
 本開示は、ユーザが発話した発話内容を対話形式で音声認識する音声認識装置(Speech Recognition Apparatus)に関するものである。 This disclosure relates to a speech recognition apparatus (Speech Recognition Apparatus) that recognizes speech uttered by a user in an interactive manner.
 従来より、ユーザの発話を音声認識するとともに、この音声認識した結果に基づいてユーザの次の発話を促す音声を生成して出力する処理を繰り返し行い、対話形式で音声認識を行う対話側音声認識装置が種々提案されている。 Conventionally, the voice recognition of the user's utterance and the process of generating and outputting the voice that urges the user's next utterance based on the result of the voice recognition are repeatedly performed, and the voice recognition in the dialog side performs the voice recognition in the interactive form Various devices have been proposed.
 このような音声認識装置として、例えば、ユーザにより入力された入力情報を解釈し、入力情報に対応する応答を行う対話エージェントを識別し、入力情報を対話エージェントに送信して応答を依頼し、対話エージェントからの応答を出力するようになっており、複数の対話エージェントに対して処理可能情報を問い合わせ、入力情報と処理可能情報を照合して、入力情報を処理できる対話エージェントを選択し、選択された対話エージェントに対して入力情報を送信して応答を受信するように構成したものがある(例えば、特許文献1参照)。 As such a speech recognition device, for example, it interprets input information input by a user, identifies a dialog agent that makes a response corresponding to the input information, sends input information to the dialog agent, requests a response, and The response from the agent is output, and the processable information is inquired to multiple dialog agents, the input information and the processable information are collated, and the dialog agent that can process the input information is selected and selected. There is a configuration in which input information is transmitted to a dialogue agent and a response is received (see, for example, Patent Document 1).
 また、ユーザの発話履歴を主題毎に記録し、この発話履歴に基づいて各主題における対話の新たなシナリオを決定し、この決定した対話の新たなシナリオに基づいて参照対象とする音声認識辞書を決定し、この音声認識辞書を参照対象としてユーザの発話を認識するようにした音声認識装置もある(例えば、特許文献2参照)。 In addition, the user's utterance history is recorded for each subject, a new scenario of dialogue in each subject is determined based on the utterance history, and a speech recognition dictionary to be referred to is based on the determined new scenario of dialogue. There is also a voice recognition device that determines and recognizes the user's utterance using the voice recognition dictionary as a reference target (see, for example, Patent Document 2).
JP 2004-288018 AJP 2004-288018 A JP 2008-287193 AJP 2008-287193 A
 上記特許文献1に記載された装置は、入力情報に対応可能な対話エージェントを選択し、選択された対話エージェントに対して入力情報を送信して応答を受信する構成となっているので、入力情報のカテゴリが頻繁に変化する自然な対話に近い状態で、円滑な対話を行うことが可能であるが、対話形式の音声認識処理が完了した後、再度、対話形式の音声認識処理を開始する場合に、過去の操作を継続するように音声対話を行うような構成を有していないので、この場合、ユーザは、再度同じ入力情報を発話する必要があり、ユーザに煩わしさを感じさせてしまう可能性もある。 The apparatus described in Patent Document 1 is configured to select a dialog agent that can handle input information, transmit input information to the selected dialog agent, and receive a response. It is possible to have a smooth conversation in a state close to a natural conversation in which the category of the category changes frequently, but when the interactive speech recognition process is completed and the interactive speech recognition process is started again In this case, the user needs to speak the same input information again, which makes the user feel bothersome. There is a possibility.
 また、上記特許文献2に記載された装置は、音声認識の精度を向上するために、ユーザの発話履歴に基づいて各主題における対話の新たなシナリオを決定し、この決定した対話の新たなシナリオに基づいて参照対象とする音声認識辞書を限定するようになっている。しかし、上記特許文献2に記載された装置においても、対話形式の音声認識処理が完了した後、再度、対話形式の音声認識処理を開始する場合に、過去の操作を継続するように音声対話を行うような構成を有していないので、このような場合、ユーザは、再度同じ入力情報を発話する必要があり、ユーザに煩わしさを感じさせてしまう可能性がある。 The device described in Patent Document 2 determines a new scenario for each subject based on the user's utterance history in order to improve the accuracy of speech recognition. Based on the above, the speech recognition dictionary to be referred to is limited. However, even in the apparatus described in Patent Document 2, when the interactive speech recognition process is started again after the interactive speech recognition process is completed, the speech dialog is performed so as to continue the past operation. In such a case, the user needs to speak the same input information again, and the user may feel annoyed.
 本開示は、過去の操作が継続されないことによるユーザの煩わしさを解消する音声認識装置を提供することを目的とする。 This disclosure is intended to provide a voice recognition device that eliminates the user's annoyance due to the fact that past operations are not continued.
 上記目的を達成するため、本開示の一つの例によれば、音声認識装置は、記憶制御セクションと音声認識処理セクションとを含むように提供される。音声認識装置は、ユーザが発話した発話内容を音声認識するとともに、この音声認識により認識された発話内容に基づいてユーザと音声対話するための音声を生成して、対話形式で音声認識処理を実施する。記憶制御セクションは、音声認識により認識された内容およびユーザの手動操作に応じて実行した内容の少なくとも一方を記憶部に記憶させる。音声認識処理セクションは、音声認識処理を実施するとき、記憶部に記憶された内容を用いてユーザと音声対話するための音声を生成して音声認識処理を実施する。 In order to achieve the above object, according to one example of the present disclosure, a speech recognition apparatus is provided to include a storage control section and a speech recognition processing section. The voice recognition device recognizes the utterance content spoken by the user, generates voice for voice conversation with the user based on the utterance content recognized by the voice recognition, and performs voice recognition processing in an interactive format. To do. The storage control section causes the storage unit to store at least one of the content recognized by the voice recognition and the content executed in response to the user's manual operation. When performing the voice recognition processing, the voice recognition processing section generates voice for voice conversation with the user using the content stored in the storage unit and performs the voice recognition processing.
 このような構成によれば、音声認識により認識された内容およびユーザの手動操作に応じて実行した内容の少なくとも一方を記憶部に記憶させ、音声認識処理を実施するとき、記憶部に記憶された内容を用いてユーザと音声対話するための音声を生成して音声認識処理を実施するので、過去の操作が継続されないことによるユーザの煩わしさを解消することができる。 According to such a configuration, at least one of the content recognized by the voice recognition and the content executed according to the user's manual operation is stored in the storage unit, and stored in the storage unit when the voice recognition process is performed. Since the voice for performing voice conversation with the user is generated using the content and the voice recognition process is performed, the user's troublesomeness caused by the past operation not being continued can be eliminated.
 本開示についての上記目的およびその他の目的、特徴や利点は、添付の図面を参照しながら下記の詳細な記述により、より明確になる。
ナビゲーション装置の全体構成を示した図 音声認識部と音声対話制御部の構成を示した図 本ナビゲーション装置の制御回路および音声対話制御部のフローチャート 音声認識部と音声対話制御部の構成を示した図 コンテキストの表示画面像の表示例を示した図 コンテキストを継続する場合と継続しない場合の対話の違いについて説明するための図 コンテキストを継続する場合と継続しない場合の対話の違いについて説明するための図 コンテキストを継続する場合と継続しない場合の対話の違いについて説明するための図
The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description with reference to the accompanying drawings.
Diagram showing the overall configuration of the navigation device The figure which showed the composition of the voice recognition part and the voice dialogue control part Flowchart of control circuit and voice interaction control unit of navigation device The figure which showed the composition of the voice recognition part and the voice dialogue control part Figure showing a display example of the context display screen image Diagram for explaining the difference between dialogs with and without continuing the context Diagram for explaining the difference between dialogs with and without continuing the context Diagram for explaining the difference between dialogs with and without continuing the context
 本開示の一実施形態に係る音声認識装置の全体構成を図1に示す。本音声認識装置は、車両(ホスト車両とも言及される)に搭載されて用いられるナビゲーション装置20として構成されている。本ナビゲーション装置20は、ユーザが発話した発話内容を音声認識し、この音声認識により認識された発話内容に基づいてユーザと音声対話するための音声を生成して、対話形式で音声認識処理を実施するとともに、音声認識により認識された発話内容に応じた操作を実行する処理を行う。 FIG. 1 shows the overall configuration of a speech recognition apparatus according to an embodiment of the present disclosure. The voice recognition device is configured as a navigation device 20 that is mounted and used in a vehicle (also referred to as a host vehicle). The navigation device 20 recognizes speech content uttered by the user, generates speech for voice conversation with the user based on the speech content recognized by the speech recognition, and performs speech recognition processing in an interactive format. In addition, a process of executing an operation according to the utterance content recognized by the voice recognition is performed.
 (1)ナビゲーション装置20の全体の構成について
 本ナビゲーション装置20は、位置検出器21、データ入力器22、操作スイッチ群23、通信装置24、外部メモリ25、表示装置26、リモコンセンサ27、制御回路28および音声認識ユニット10を備えている。
(1) Overall Configuration of Navigation Device 20 This navigation device 20 includes a position detector 21, a data input device 22, an operation switch group 23, a communication device 24, an external memory 25, a display device 26, a remote control sensor 27, and a control circuit. 28 and a voice recognition unit 10.
 位置検出器21は、ジャイロスコープ21a、距離センサ21bおよびGPS受信機21cを有しており、これらより入力される現在位置を特定するための各種情報を制御回路28へ出力する。 The position detector 21 includes a gyroscope 21a, a distance sensor 21b, and a GPS receiver 21c, and outputs various information for specifying the current position input from these to the control circuit 28.
 データ入力器22は、地図表示や経路探索用の地図データを入力するための装置である。データ入力器22は、制御回路28の要求に応じて、地図データが記憶された地図データ記憶媒体から必要な地図データの読み出しを行う。地図データ記憶媒体には、地図表示や経路探索用の地図データに加えて、音声認識ユニット10において認識処理を行う際に用いる辞書データも含まれる。なお、地図データ記憶媒体としては、ハードディスクドライブ、CD、DVD、フラッシュメモリ等を用いて構成することができる。 The data input device 22 is a device for inputting map data for map display and route search. The data input device 22 reads out necessary map data from a map data storage medium in which map data is stored in response to a request from the control circuit 28. The map data storage medium includes not only map data for map display and route search, but also dictionary data used when the speech recognition unit 10 performs recognition processing. The map data storage medium can be configured using a hard disk drive, CD, DVD, flash memory, or the like.
 操作スイッチ群23は、後述の表示装置26のディスプレイ(表示部あるいは表示パネルとも言及される)の前面に重ねて配置されたタッチスイッチ、ディスプレイの周囲に設けられたメカニカルスイッチ等の各種スイッチを有し、ユーザの各種スイッチ操作に応じた信号を制御回路28へ出力する。 The operation switch group 23 includes various switches such as a touch switch arranged on the front surface of a display (also referred to as a display unit or a display panel) of a display device 26 described later and a mechanical switch provided around the display. Then, signals corresponding to various user switch operations are output to the control circuit 28.
 通信装置24は、外部と通信を行うためのものであり、例えば、携帯電話機等の移動体通信機によって構成される。 The communication device 24 is for communicating with the outside, and is configured by a mobile communication device such as a mobile phone, for example.
 外部メモリ25は、USBメモリ、SDカード等の持ち運び可能な記憶媒体により構成されている。この外部メモリ25には、各種データが記憶されるようになっている。 The external memory 25 is composed of a portable storage medium such as a USB memory or an SD card. Various data are stored in the external memory 25.
 表示装置26は、液晶等のディスプレイを有し、制御回路28より入力される映像信号に応じた映像、画像(画面像を含む)をディスプレイに表示させる。リモコンセンサ27は、遠隔操作を行うためのリモコン27aから送信される無線信号を受信するものである。 The display device 26 has a display such as a liquid crystal, and displays video and images (including screen images) according to the video signal input from the control circuit 28 on the display. The remote control sensor 27 receives a radio signal transmitted from a remote control 27a for performing a remote operation.
 制御回路28は、CPU、ROM、RAM、I/O等を備えたコンピュータとして構成されており、制御回路28のCPUは、ROMに記憶されたプログラムに従って各種処理を実施する。尚、プログラムで実施される各処理は、当然ながら、一部あるいは全てが、ハードウエアのコンポーネントによって実行されても良い。 The control circuit 28 is configured as a computer including a CPU, ROM, RAM, I / O, and the like, and the CPU of the control circuit 28 performs various processes according to programs stored in the ROM. It should be noted that part or all of the processes executed by the program may be executed by hardware components.
 制御回路28の処理としては、位置検出器21から入力されるホスト車両の現在位置を特定するための各種情報に基づいてホスト車両位置を検出するホスト車両位置検出処理、ホスト車両位置周辺の地図上にホスト車両位置マークを重ねて表示する地図表示処理、目的地を設定する目的地設定処理、目的地に至る案内経路を探索する経路探索処理、案内経路に従って走行案内を行う走行案内処理等がある。 As the processing of the control circuit 28, host vehicle position detection processing for detecting the host vehicle position based on various information for specifying the current position of the host vehicle input from the position detector 21, and on the map around the host vehicle position Map display processing for displaying the host vehicle position mark superimposed on the destination, destination setting processing for setting the destination, route search processing for searching for a guidance route to the destination, travel guidance processing for performing travel guidance according to the guidance route, etc. .
 音声認識ユニット10は、マイク15により集音された入力音声の認識処理を行うとともに、対話音声を生成し、この対話音声をスピーカ14より音声出力(発声)させる装置である。 The voice recognition unit 10 is a device that performs processing for recognizing input voice collected by the microphone 15, generates dialogue voice, and outputs (speaks) the dialogue voice from the speaker 14.
 (2)音声認識ユニット10について
 音声認識ユニット10は、本実施形態では、一例として、CPU、RAM、ROM、I/O等を備えた一つあるいは複数のコンピュータとして構成されており、CPUは、ROMに記憶されたプログラムに従って各種処理を実施する。尚、プログラムで実施される各処理は、当然ながら、一部あるいは全てが、ハードウエアのコンポーネントによって実行されても良い。
(2) Voice Recognition Unit 10 In this embodiment, the voice recognition unit 10 is configured as one or a plurality of computers including a CPU, a RAM, a ROM, an I / O, and the like as an example. Various processes are performed in accordance with programs stored in the ROM. It should be noted that part or all of the processes executed by the program may be executed by hardware components.
 この音声認識ユニット10は、音声合成回路11、音声認識回路12および音声対話制御回路13を備えている。これらは、各々が、個別のコンピュータで構成されても良いし、一つあるいは二つのコンピュータで構成されても良い。音声合成回路11には、スピーカ14が接続され、音声認識回路12には、マイク15が接続されている。また、音声対話制御回路13には、PTTスイッチ(Push Talk to Switch)16が接続されている。 The voice recognition unit 10 includes a voice synthesis circuit 11, a voice recognition circuit 12, and a voice dialogue control circuit 13. Each of these may be composed of individual computers, or may be composed of one or two computers. A speaker 14 is connected to the speech synthesis circuit 11, and a microphone 15 is connected to the speech recognition circuit 12. In addition, a PTT switch (Push Talk to Switch) 16 is connected to the voice interaction control circuit 13.
 音声認識回路12は、音声対話制御回路13からの指示に応じて、マイク15により集音された入力音声の認識処理を行い、その認識結果を音声対話制御回路13に通知する。すなわち、音声認識回路12は、マイク15から取得した音声データに対し、記憶している辞書データを用いて照合を行い、複数の比較対象パターン候補と比較して一致度の高い上位比較対象パターンを音声対話制御回路13へ出力する。 The voice recognition circuit 12 recognizes the input voice collected by the microphone 15 in accordance with an instruction from the voice dialogue control circuit 13, and notifies the voice dialogue control circuit 13 of the recognition result. That is, the speech recognition circuit 12 collates the speech data acquired from the microphone 15 using the stored dictionary data, and selects a higher-order comparison target pattern having a higher degree of matching than a plurality of comparison target pattern candidates. Output to the voice dialogue control circuit 13.
 なお、音声認識回路12は、マイク15より入力された音声データを順次音響分析して音響的特徴量(例えば、ケプストラム)を抽出し、この音響分析によって得られた音響的特徴量時系列データを、周知のHMM(隠れマルコフモデル)、DPマッチング法あるいはニューラルネットなどにより、いくつかの区間に分け、各区間が辞書データとして格納されたどの単語に対応しているかを判定することで入力音声中の単語系列を認識する。 The voice recognition circuit 12 sequentially analyzes the voice data input from the microphone 15 to extract an acoustic feature quantity (for example, cepstrum), and uses the acoustic feature quantity time-series data obtained by the acoustic analysis as a result. In the input speech, it is divided into several sections by the well-known HMM (Hidden Markov Model), DP matching method or neural network, and each section corresponds to which word stored as dictionary data. Recognize word sequences.
 音声対話制御回路13は、音声認識回路12における認識結果に基づき、音声合成回路11への応答音声の出力指示を行うとともに、音声認識回路12における認識結果に基づき、ナビゲーション装置20の制御回路28に対して、例えば、走行案内処理のために必要な目的地やコマンドを通知して目的地の設定やコマンドを実行させるよう指示する処理を行う。 The voice dialogue control circuit 13 instructs the voice synthesis circuit 11 to output a response voice based on the recognition result in the voice recognition circuit 12 and also instructs the control circuit 28 of the navigation device 20 based on the recognition result in the voice recognition circuit 12. On the other hand, for example, a destination and a command necessary for the travel guidance process are notified, and processing for instructing to set the destination and execute the command is performed.
 なお、音声合成回路11は、各種音声波形をデータベース化した波形データベースを有しており、この波形データベース内に格納されている音声波形を用い、音声対話制御回路13からの応答音声の出力指示に基づく音声を合成する。なお、この音声合成回路11により合成された合成音声がスピーカ14から出力される。 The speech synthesizer circuit 11 has a waveform database in which various speech waveforms are converted into a database. Using the speech waveform stored in the waveform database, the speech synthesis control circuit 13 issues a response speech output instruction. Synthesize speech based on. Note that the synthesized speech synthesized by the speech synthesis circuit 11 is output from the speaker 14.
 本実施形態における音声認識ユニット10は、ユーザがPTTスイッチ16を押しながらマイク15に向かって、経路設定や経路案内あるいは施設検索や施設表示など各種の処理を実行するための各種コマンドを発話するようになっている。具体的には、音声対話制御回路13がPTTスイッチ16が押されたタイミングや戻されたタイミング及び押された状態が継続した時間を監視しており、PTTスイッチ16が押された場合には音声認識回路12に対して処理の実行を指示する。一方、PTTスイッチ16が押されていない場合にはその処理を実行させないようにしている。したがって、PTTスイッチ16が押されている間にマイク15を介して入力された音声データが音声認識回路12へ出力されることとなる。 The voice recognition unit 10 according to the present embodiment speaks various commands for executing various processes such as route setting, route guidance, facility search, and facility display toward the microphone 15 while the user presses the PTT switch 16. It has become. Specifically, the voice dialogue control circuit 13 monitors the timing when the PTT switch 16 is pressed, the timing when the PTT switch 16 is returned, and the time during which the pressed state is continued. The recognition circuit 12 is instructed to execute processing. On the other hand, when the PTT switch 16 is not pressed, the processing is not executed. Therefore, voice data input via the microphone 15 while the PTT switch 16 is being pressed is output to the voice recognition circuit 12.
 (3)音声認識回路12と音声対話制御回路13について
 ここで、本実施形態における音声認識回路12と音声対話制御回路13についてさらに説明する。図2に、音声認識回路12と音声対話制御回路13の構成を示す。
(3) Voice Recognition Circuit 12 and Voice Dialog Control Circuit 13 Here, the voice recognition circuit 12 and the voice dialog control circuit 13 in this embodiment will be further described. FIG. 2 shows configurations of the voice recognition circuit 12 and the voice dialogue control circuit 13.
 音声認識回路12は、音声抽出部101、音声認識照合部103、音声認識結果出力部105、音声認識辞書部107、対象辞書決定部109を備えている。また、本実施形態において、音声認識辞書部107は、コマンド対応辞書201、住所対応辞書203、楽曲対応辞書205、電話帳対応辞書207等を有している。 The speech recognition circuit 12 includes a speech extraction unit 101, a speech recognition collation unit 103, a speech recognition result output unit 105, a speech recognition dictionary unit 107, and a target dictionary determination unit 109. In this embodiment, the voice recognition dictionary unit 107 includes a command correspondence dictionary 201, an address correspondence dictionary 203, a music correspondence dictionary 205, a telephone directory correspondence dictionary 207, and the like.
 音声抽出部101は、マイク15より入力された音声データから単語を抽出する。また、対象辞書決定部109は、音声認識辞書部107の各コマンド対応辞書201~207等から音声認識に用いる対象辞書を決定する。また、音声認識照合部103は、対象辞書を決定する対象辞書決定部109により決定された対象辞書を用いて、音声抽出部101により抽出された単語の照合を行う。また、音声認識結果出力部105は、音声認識照合部103の照合の結果に基づいて音声認識した結果を音声対話制御回路13の音声対話処理部121へ出力する。 The voice extraction unit 101 extracts words from the voice data input from the microphone 15. The target dictionary determining unit 109 determines a target dictionary to be used for speech recognition from the command corresponding dictionaries 201 to 207 of the speech recognition dictionary unit 107. In addition, the speech recognition / collation unit 103 collates the words extracted by the speech extraction unit 101 using the target dictionary determined by the target dictionary determination unit 109 that determines the target dictionary. Further, the voice recognition result output unit 105 outputs the result of voice recognition based on the collation result of the voice recognition collation unit 103 to the voice dialogue processing unit 121 of the voice dialogue control circuit 13.
 一方、音声対話制御回路13は、音声対話処理部121、機能実行処理決定部123、音声出力内容決定部125およびコンテキスト履歴管理部127を備えている。 On the other hand, the voice dialogue control circuit 13 includes a voice dialogue processing unit 121, a function execution processing determination unit 123, a voice output content determination unit 125, and a context history management unit 127.
 音声対話処理部121は、予め用意された質問フレーズや対話フレーズの中から、音声認識回路12の音声認識結果出力部105より出力される音声認識結果にマッチするフレーズを決定する。また、本実施形態における音声対話処理部121は、音声認識結果にマッチするフレーズを決定する際に、後述するコンテキスト履歴管理部127により管理されたコンテキストを用いてフレーズを決定することも可能となっている。 The voice dialogue processing unit 121 determines a phrase that matches the voice recognition result output from the voice recognition result output unit 105 of the voice recognition circuit 12 from among previously prepared question phrases and dialogue phrases. In addition, when determining a phrase that matches the speech recognition result, the voice interaction processing unit 121 according to the present embodiment can also determine a phrase using a context managed by the context history management unit 127 described later. ing.
 機能実行処理決定部123は、音声対話処理部121で処理された内容に基づいて、機能実行する処理を決定し、決定した処理を制御回路28に通知する。また、機能実行処理決定部123は、音声認識回路12を介して音声認識結果出力部105より出力される音声認識結果を取得し、この音声認識結果を制御回路28に通知する。 The function execution process determination unit 123 determines a process to execute a function based on the content processed by the voice interaction processing unit 121 and notifies the control circuit 28 of the determined process. In addition, the function execution process determination unit 123 acquires the speech recognition result output from the speech recognition result output unit 105 via the speech recognition circuit 12 and notifies the control circuit 28 of the speech recognition result.
 音声出力内容決定部125は、音声対話処理部121で処理された内容に基づいて、音声出力する音声データを決定し、決定した音声データを音声合成回路11に通知する。 The voice output content determination unit 125 determines the voice data to be output based on the content processed by the voice dialog processing unit 121 and notifies the voice synthesis circuit 11 of the determined voice data.
 制御回路28は、機能実行処理決定部123からの通知に従って機能実行し、この機能実行が完了すると、実行した内容をコンテキスト履歴管理部127に通知する。また、制御回路28は、機能実行処理決定部123を介して取得した音声認識結果をコンテキスト履歴管理部127へ通知する。 The control circuit 28 executes the function in accordance with the notification from the function execution process determination unit 123, and when the function execution is completed, notifies the context history management unit 127 of the executed content. In addition, the control circuit 28 notifies the context history management unit 127 of the voice recognition result acquired via the function execution process determination unit 123.
 コンテキスト履歴管理部127は、メモリ(図示せず)を有しており、このメモリに制御回路28から通知された内容(コンテキスト)を順次記憶させてコンテキストの履歴を管理する。ここで、コンテキストとは、音声対話でユーザが発話した発話内容あるいはユーザの手動操作に応じて実行した操作内容のことをいう。例えば、音声対話でユーザが「目的地」と発話した後、「東京」と発話した場合、「目的地、東京」がコンテキストとしてコンテキスト履歴管理部127のメモリに記憶される。また、ユーザが手動操作で「オーディオ」を選択操作した後、「AMラジオ」を選択操作した場合、「オーディオ、AMラジオ」がコンテキストとしてコンテキスト履歴管理部127のメモリに記憶される。 The context history management unit 127 has a memory (not shown), and sequentially stores contents (contexts) notified from the control circuit 28 in this memory to manage the context history. Here, the context refers to the utterance content uttered by the user in the voice conversation or the operation content executed in response to the user's manual operation. For example, when the user speaks “Destination” and then speaks “Tokyo” in a voice dialogue, “Destination, Tokyo” is stored as a context in the memory of the context history management unit 127. Further, when the user selects and operates “AM radio” after manually selecting “audio”, “audio and AM radio” is stored in the memory of the context history management unit 127 as a context.
 (4)音声認識処理について
 本ナビゲーション装置20の音声認識処理のフローチャートを図3に示す。ここでは、ナビゲーション装置20の制御回路28および音声認識ユニット10の処理を本ナビゲーション装置20の音声認識処理として説明する。車両のイグニッションスイッチがオフ状態からオン操作状態になると、本ナビゲーション装置20は動作状態となる。そして、例えば、ユーザの操作スイッチ群23に対する操作により音声認識処理の開始が指示されると、ナビゲーション装置20の制御回路28および音声認識ユニット10は、図3に示す処理を実施する。
(4) Voice Recognition Process FIG. 3 shows a flowchart of the voice recognition process of the navigation device 20. Here, the processing of the control circuit 28 and the voice recognition unit 10 of the navigation device 20 will be described as the voice recognition processing of the navigation device 20. When the ignition switch of the vehicle changes from the off state to the on operation state, the navigation device 20 enters an operating state. For example, when the start of the voice recognition process is instructed by the user's operation on the operation switch group 23, the control circuit 28 and the voice recognition unit 10 of the navigation device 20 perform the process shown in FIG.
 ここで、この出願に記載されるフローチャート、あるいは、フローチャートの処理は、複数のセクション(あるいはステップと言及される)を含み、各セクションは、たとえば、S100と表現される。さらに、各セクションは、複数のサブセクションに分割されることができる、一方、複数のセクションが合わさって一つのセクションにすることも可能である。さらに、各セクションは、デバイス、モジュール、ミーンズとして言及されることができる。また、上記の複数のセクションの各々あるいは組合わさったものは、(i)ハードウエアユニット(例えば、コンピュータ)と組み合わさったソフトウエアのセクションのみならず、(ii)ハードウエア(例えば、集積回路、配線論理回路)のセクションとして、関連する装置の機能を含みあるいは含まずに実現できる。さらに、ハードウエアのセクションは、マイクロコンピュータの内部に含まれることもできる。 Here, the flowchart described in this application or the process of the flowchart includes a plurality of sections (or referred to as steps), and each section is expressed as S100, for example. Further, each section can be divided into a plurality of subsections, while a plurality of sections can be combined into one section. Further, each section can be referred to as a device, module, or means. In addition, each of the above sections or a combination thereof includes not only (i) a section of software combined with a hardware unit (eg, a computer), but also (ii) hardware (eg, an integrated circuit, As a section of (wiring logic circuit), it can be realized with or without the function of related devices. Furthermore, the hardware section can be included inside the microcomputer.
 まず、表示装置26の表示部(あるいは表示パネル)に音声認識トップ画面像(トップ画面とも言及される)を表示させる(S100)。具体的には、ナビゲーション装置20の制御回路28の指示により表示装置26の表示部に音声認識トップ画面像が表示される。図4に、音声認識トップ画面像の表示例を示す。この音声認識トップ画面像には、「コンテキストを継続しますか?」というメッセージとともに、「はい」および「いいえ」を示す表示が含まれる。過去の音声認識処理での発話内容を継続したいときには「はい」を選択し、過去の音声認識処理での発話内容を継続したくないときには「いいえ」を選択するようになっている。なお、コンテキスト履歴管理部127のメモリにコンテキストが記憶されていない場合には、「いいえ」のみが選択可能となる。 First, a voice recognition top screen image (also referred to as a top screen) is displayed on the display unit (or display panel) of the display device 26 (S100). Specifically, a voice recognition top screen image is displayed on the display unit of the display device 26 according to an instruction from the control circuit 28 of the navigation device 20. FIG. 4 shows a display example of the voice recognition top screen image. This voice recognition top screen image includes a message “Yes” and “No” along with a message “Do you want to continue the context?”. When it is desired to continue the utterance content in the past speech recognition processing, “Yes” is selected, and when it is not desired to continue the utterance content in the past speech recognition processing, “No” is selected. If no context is stored in the memory of the context history management unit 127, only “No” can be selected.
 次に、コンテキストを継続するか否かを判定する(S102)。制御回路28は、音声認識トップ画面像に従って、ユーザにより「はい」が選択された場合にはコンテキストを継続し、ユーザにより「いいえ」が選択された場合にはコンテキストを継続しないと判定する。 Next, it is determined whether or not to continue the context (S102). In accordance with the voice recognition top screen image, the control circuit 28 determines that the context is continued when “Yes” is selected by the user, and is not continued when “No” is selected by the user.
 ここで、ユーザにより「いいえ」が選択された場合、S102の判定はNOとなり、第1音声認識処理(S200~S206)へ進む。 Here, if “No” is selected by the user, the determination in S102 is NO, and the process proceeds to the first voice recognition process (S200 to S206).
 この第1音声認識処理では、まず、ユーザの音声入力を行う(S200)。ユーザ操作によりPTTスイッチ16が押されている期間中に、図6(a)に示すように、例えば、ユーザが「目的地を設定する」と発話すると、マイク15からの音声データが音声認識回路12へ入力される。 In the first voice recognition process, first, a user voice input is performed (S200). During a period when the PTT switch 16 is pressed by a user operation, for example, as shown in FIG. 6A, when the user speaks “set destination”, the voice data from the microphone 15 is converted into a voice recognition circuit. 12 is input.
 そして、「目的地を設定する」という音声が音声認識され(S202)、次に、実行する操作が決定したか否かを判定する(S204)。この判定は、機能実行処理決定部123により行われる。 Then, the voice of “setting the destination” is recognized (S202), and then it is determined whether or not the operation to be executed is determined (S204). This determination is performed by the function execution process determination unit 123.
 ここで、S202の音声認識処理で認識された発話内容が、予め定められた機能を実行指示する操作コマンドでない場合、S204の判定はNOとなり、対話音声を生成して音声出力する(S206)。この処理は、音声対話処理部121および音声出力内容決定部125の処理により行われる。例えば、「目的地を設定する」という発話内容に対して「目的地をお話ください」といった対話音声を生成し、この対話音声をスピーカ14より音声出力させ、S200へ戻る。 Here, when the utterance content recognized in the voice recognition process in S202 is not an operation command for instructing execution of a predetermined function, the determination in S204 is NO, and a dialog voice is generated and outputted (S206). This processing is performed by the processing of the voice interaction processing unit 121 and the voice output content determination unit 125. For example, a dialogue voice such as “Please tell the destination” is generated for the utterance content “Set destination”, this dialogue voice is output from the speaker 14, and the process returns to S 200.
 ここで、ユーザが「東京」と発話すると、S202にて「東京」が音声認識され、S206にて「目的地を東京に設定します」といった対話音声が生成され、この対話音声がスピーカ14より音声出力される。 Here, when the user utters “Tokyo”, “Tokyo” is recognized as a voice in S 202, and a dialog voice such as “Set destination to Tokyo” is generated in S 206. Sound is output.
 ここで、図6(a)中には示してないが、ユーザが「はい」と発話すると、S204の判定がYESとなり、機能を実行する(S104)。すなわち、音声認識ユニット10から制御回路28へ目的地を東京に設定する通知が行われ、この通知に応じて制御回路28は、目的地を東京に設定するとともに、目的地を東京に設定したことを音声認識ユニット10へ通知する。 Here, although not shown in FIG. 6A, when the user speaks “Yes”, the determination in S204 is YES and the function is executed (S104). That is, the voice recognition unit 10 notifies the control circuit 28 that the destination is set to Tokyo, and in response to this notification, the control circuit 28 sets the destination to Tokyo and sets the destination to Tokyo. Is notified to the voice recognition unit 10.
 次に、コンテキストを記憶させる(S106)。具体的には、制御回路28は、コンテキスト履歴管理部127に指示してコンテキストを記憶させる。この場合、「目的地」と、具体的な地名である「東京」を関連付けし、「目的地、東京」をコンテキストとしてコンテキスト履歴管理部127のメモリに記憶させ、本処理を終了する。 Next, the context is stored (S106). Specifically, the control circuit 28 instructs the context history management unit 127 to store the context. In this case, “Destination” is associated with “Tokyo” which is a specific place name, “Destination, Tokyo” is stored as a context in the memory of the context history management unit 127, and this processing is terminated.
 次に、再度、ユーザの操作スイッチ群23に対する操作により音声認識処理の開始が指示されると、ナビゲーション装置20の制御回路28および音声認識ユニット10は、図3に示す処理を実施する。 Next, when the start of the voice recognition process is instructed again by the user's operation on the operation switch group 23, the control circuit 28 and the voice recognition unit 10 of the navigation device 20 perform the process shown in FIG.
 まず、表示装置26の表示部に図4に示したような音声認識トップ画面像を表示させ(S100)、次に、コンテキストを継続するか否かを判定する(S102)。 First, the voice recognition top screen image as shown in FIG. 4 is displayed on the display unit of the display device 26 (S100), and then it is determined whether or not to continue the context (S102).
 ここで、ユーザにより「いいえ」が選択された場合、S102の判定はNOとなり、第1音声認識処理(S200~S206)へ進む。なお、このように、S102の判定がNOとなった場合、図6(a)に示すように、コンテキストは非継続となる。 Here, if “No” is selected by the user, the determination in S102 is NO, and the process proceeds to the first voice recognition process (S200 to S206). In this way, when the determination in S102 is NO, the context is not continued as shown in FIG.
 第1音声認識処理では、まず、ユーザの音声入力を行う(S200)。ユーザ操作によりPTTスイッチ16が押されている期間中に、図6(a)に示すように、例えば、ユーザが「今日の天気は?」と発話すると、マイク15からの音声データが音声認識回路12へ入力される。 In the first voice recognition process, first, a user voice input is performed (S200). As shown in FIG. 6A, for example, when the user speaks “What is the weather today?” During the period in which the PTT switch 16 is pressed by a user operation, the voice data from the microphone 15 is converted into a voice recognition circuit. 12 is input.
 そして、S202にて「今日の天気は?」という音声が音声認識され、次に、実行する操作が決定したか否かを判定する(S204)。 Then, in S202, the voice “What is the weather today?” Is recognized, and it is then determined whether or not the operation to be executed has been determined (S204).
 ここで、S202の音声認識処理で認識された発話内容が、予め定められた機能を実行指示する操作コマンドでない場合、S204の判定はNOとなり、対話音声を生成し、この対話音声をスピーカ14より音声出力させる(S206)。例えば、「今日の天気は?」という発話内容に対して「どこの天気が知りたいですか?」といった対話音声を生成し、この対話音声をスピーカ14より音声出力させ、S200へ戻る。 Here, when the utterance content recognized in the voice recognition process in S202 is not an operation command for instructing execution of a predetermined function, the determination in S204 is NO, and a dialog voice is generated. Audio is output (S206). For example, an interactive voice such as “Where do you want to know the weather?” Is generated for the utterance content “What is the weather today?”, The interactive voice is output from the speaker 14, and the process returns to S 200.
 そして、ユーザが「東京」と発話すると、S202にて「東京」が音声認識され、S206にて「今日の東京の天気は晴れです」といった対話音声がスピーカ14より音声出力される。 Then, when the user utters “Tokyo”, “Tokyo” is recognized as a voice in S 202, and a dialog voice such as “Today's weather in Tokyo is clear” is output from the speaker 14 in S 206.
 そして、図6(a)中には示してないが、ユーザが「終了」と発話すると、S204の判定がYESとなり、機能を実行する(S104)。すなわち、音声認識ユニット10から制御回路28へ音声認識を終了する通知が行われ、この通知に応じて制御回路28は、音声認識を終了するとともに、今日の東京の天気情報を通知したことを音声認識ユニット10へ通知する。 Although not shown in FIG. 6A, when the user utters “end”, the determination in S204 is YES and the function is executed (S104). That is, the voice recognition unit 10 notifies the control circuit 28 that the voice recognition is finished, and in response to this notice, the control circuit 28 finishes the voice recognition and informs that the weather information of today's Tokyo has been notified. Notify the recognition unit 10.
 次に、コンテキストを記憶させる(S106)。この場合、「天気情報」と、具体的な地名である「東京」を関連付けし、「天気情報、東京」をコンテキストとしてコンテキスト履歴管理部127のメモリに記憶させ、本処理を終了する。 Next, the context is stored (S106). In this case, “weather information” and a specific place name “Tokyo” are associated with each other, “weather information, Tokyo” is stored in the memory of the context history management unit 127 as a context, and this process ends.
 次に、再度、ユーザの操作スイッチ群23に対する操作により音声認識処理の開始が指示されると、ナビゲーション装置20の制御回路28および音声認識ユニット10は、図3に示す処理を実施する。 Next, when the start of the voice recognition process is instructed again by the user's operation on the operation switch group 23, the control circuit 28 and the voice recognition unit 10 of the navigation device 20 perform the process shown in FIG.
 まず、表示装置26の表示部に図4に示したような音声認識トップ画面像を表示させる(S100)。 First, the voice recognition top screen image as shown in FIG. 4 is displayed on the display unit of the display device 26 (S100).
 ここで、ユーザにより「はい」が選択されると、S102の判定はYESとなり、次に、表示装置26の表示部にコンテキストを表示する(S108)。なお、このように、S102の判定がYESとなった場合、図6(b)に示すように、コンテキストは継続となる。また、図5に、コンテキストの表示画面像の表示例を示す。このコンテキストの表示画面像には、コンテキスト表示画像300とリセットボタン310が含まれる。 Here, if “Yes” is selected by the user, the determination in S102 is YES, and then the context is displayed on the display unit of the display device 26 (S108). In this way, when the determination in S102 is YES, the context continues as shown in FIG. 6B. FIG. 5 shows a display example of a context display screen image. The context display screen image includes a context display image 300 and a reset button 310.
 コンテキスト表示画像300には、最新のコンテキストから順番に5つのコンテキストが一覧表示されている。新たにコンテキストが追加されると、最も古いコンテキストが消去されるようになっている。なお、これらのコンテキストから特定のコンテキストが選択されるようになっている。また、リセットボタン310は、コンテキストを全てリセットときに使用するボタンである。 In the context display image 300, five contexts are listed in order from the latest context. When a new context is added, the oldest context is erased. A specific context is selected from these contexts. The reset button 310 is a button used when all contexts are reset.
 次に、コンテキストを決定する(S110)。また、図5の表示例では、「目的地 東京」を示したコンテキストが強調表示されている。ここで、例えば、ユーザが「次」と発話すると、発話した内容が音声認識され、「目的地 AAレストラン」が強調表示されるようになる。また、図5に示したようなコンテキストの表示画面像従って、ユーザが「5番目」と発話すると、発話した内容が音声認識され、「エアコン オン」が強調表示されるようになる。このように、ユーザの発話内容に応じて強調表示されるコンテキストが切り替わるようになっている。そして、ユーザが「決定」と発話すると、強調表示されているコンテキストが音声認識に用いられるコンテキストとして決定される。 Next, the context is determined (S110). Further, in the display example of FIG. 5, the context indicating “Destination Tokyo” is highlighted. Here, for example, when the user utters “next”, the uttered content is recognized as speech, and “destination AA restaurant” is highlighted. Further, in accordance with the display screen image of the context as shown in FIG. 5, when the user utters “fifth”, the uttered content is recognized as voice, and “air conditioner on” is highlighted. As described above, the context to be highlighted is switched according to the content of the user's utterance. When the user utters “determine”, the highlighted context is determined as the context used for speech recognition.
 次に、第2音声認識処理(S300~S306)へ進む。この第2音声認識処理では、S110にて決定したコンテキストを用いて過去の音声認識処理での発話内容を継続するようにユーザと音声対話するための音声を生成して音声認識処理を実施する。 Next, the process proceeds to the second voice recognition process (S300 to S306). In the second voice recognition process, the voice recognition process is performed by generating voice for voice conversation with the user so as to continue the utterance contents in the past voice recognition process using the context determined in S110.
 具体的には、まず、ユーザの音声入力を行う(S300)。ユーザ操作によりPTTスイッチ16が押されている期間中に、例えば、ユーザが「今日の天気は?」と発話すると、マイク15からの音声データが音声認識回路12へ入力される。 Specifically, first, a user's voice input is performed (S300). For example, when the user speaks “What is the weather today?” During the period in which the PTT switch 16 is being pressed by a user operation, voice data from the microphone 15 is input to the voice recognition circuit 12.
 次に、音声認識処理を行う(S302)。具体的には、音声対話制御回路13から音声認識回路12へ音声認識処理の実行が指示され、この指示に応じて音声認識回路12は、マイク15からの音声データの音声認識処理を実施する。 Next, voice recognition processing is performed (S302). Specifically, the voice dialogue control circuit 13 instructs the voice recognition circuit 12 to execute voice recognition processing, and the voice recognition circuit 12 performs voice recognition processing of voice data from the microphone 15 in response to this instruction.
 ここで、S110にて決定したコンテキストに基づいて音声認識で使用する対象辞書を特定し、この対象辞書を用いて音声認識処理を行う。例えば、S110にて決定したコンテキストが「目的地、東京」となっている場合、例えば、目的地設定と関係のある住所対応辞書203や電話帳対応辞書207を使用し、目的地設定と関係のない楽曲対応辞書205を使用しないようにする。このように、必要最小限の辞書を使用するようにして、認識率を向上するようにしている。 Here, a target dictionary to be used for speech recognition is specified based on the context determined in S110, and speech recognition processing is performed using this target dictionary. For example, when the context determined in S110 is “Destination, Tokyo”, for example, using the address correspondence dictionary 203 or the telephone directory correspondence dictionary 207 related to the destination setting, The music correspondence dictionary 205 that does not exist is not used. In this way, the recognition rate is improved by using the minimum necessary dictionary.
 次に、実行する操作が決定したか否かを判定する(S304)。具体的には、具体的には、S202の音声認識処理で認識された発話内容が、予め定められた機能を実行指示する操作コマンドであるか否かに基づいて実行する操作が決定したか否かを判定する。 Next, it is determined whether or not the operation to be executed has been determined (S304). Specifically, whether or not the operation to be executed has been determined based on whether or not the utterance content recognized in the voice recognition processing in S202 is an operation command that instructs execution of a predetermined function. Determine whether.
 ここで、S302の音声認識処理で認識された発話内容が、予め定められた機能を実行指示する操作コマンドでない場合、S304の判定はNOとなり、対話音声を生成した音声出力する(S306)。 Here, if the utterance content recognized in the voice recognition process in S302 is not an operation command for instructing execution of a predetermined function, the determination in S304 is NO and the voice generated by generating the dialog voice is output (S306).
 ここでは、S110にて決定したコンテキストを用いて過去の音声認識処理での発話内容を継続するようにユーザと音声対話するための音声を生成する。例えば、S110にて決定したコンテキストが「目的地、東京」となっており、S302にてユーザの発話内容が「今日の天気は?」として認識された場合、コンテキストに含まれる具体的な地名である「東京」と「今日の天気は?」を組み合わせて、図6(b)に示すように、「今日の東京の天気は晴れです」といったフレーズを生成する。そして、このフレーズがスピーカ14より音声出力される。 Here, using the context determined in S110, a voice for voice conversation with the user is generated so as to continue the utterance contents in the past voice recognition processing. For example, when the context determined in S110 is “Destination, Tokyo” and the user's utterance content is recognized as “What is today's weather?” In S302, a specific place name included in the context is used. As shown in FIG. 6B, a phrase such as “Today's weather in Tokyo is sunny” is generated by combining a certain “Tokyo” and “How is the weather today?”. Then, this phrase is output as audio from the speaker 14.
 そして、図6(b)中には示してないが、ユーザが「終了」と発話すると、S204の判定がYESとなり、機能を実行する(S112)。すなわち、音声認識ユニット10から制御回路28へ音声認識を終了する通知が行われ、この通知に応じて制御回路28は、音声認識を終了するとともに、今日の東京の天気情報を通知したことを音声認識ユニット10へ通知する。 Although not shown in FIG. 6B, when the user utters “end”, the determination in S204 is YES and the function is executed (S112). That is, the voice recognition unit 10 notifies the control circuit 28 that the voice recognition is finished, and in response to this notice, the control circuit 28 finishes the voice recognition and informs that the weather information of today's Tokyo has been notified. Notify the recognition unit 10.
 次に、コンテキストを記憶させる(S114)。この場合、「天気情報」と、具体的な地名である「東京」を関連付けし、「天気情報、東京」をコンテキストとしてコンテキスト履歴管理部127のメモリに記憶させ、本処理を終了する。 Next, the context is stored (S114). In this case, “weather information” and a specific place name “Tokyo” are associated with each other, “weather information, Tokyo” is stored in the memory of the context history management unit 127 as a context, and this process ends.
 上記したように、コンテキストを継続しない場合には、過去の音声認識処理での発話内容を継続しないため、「今日の天気は?」という発話内容に対して「どこの天気が知りたいですか?」といった対話音声がスピーカ14より音声出力され、この対話音声に応答するように、ユーザは「東京」と発話する必要がある。すなわち、「東京」という入力を繰り返し行うことになる。これに対し、コンテキストを継続する場合には、「今日の天気は?」という発話内容に対して、過去の音声認識処理での対話に含まれる「東京」を用いて、「今日の東京の天気は晴れです」といった対話音声がスピーカ14より音声出力される。すなわち、再度「東京」という入力をしなくても、東京の天気情報を認識することができる。 As described above, when the context is not continued, the utterance content in the past speech recognition processing is not continued. Therefore, “Which weather do you want to know?” ”Is output from the speaker 14, and the user needs to speak“ Tokyo ”to respond to the dialog. That is, the input “Tokyo” is repeatedly performed. On the other hand, in the case of continuing the context, using “Tokyo” included in the dialogue in the past speech recognition processing for the utterance content “What is the weather today?” Dialogue voice such as “Sunny” is output from the speaker 14. That is, it is possible to recognize Tokyo weather information without inputting “Tokyo” again.
 図7に、アーティスト名で楽曲検索して、そのアーティストのアルバムリストを表示させたいときの対話の例を示す。(a)は、コンテキストを継続しない場合、(b)はコンテキストを継続する場合の対話例である。 Fig. 7 shows an example of dialogue when searching for a song by artist name and displaying the artist's album list. (A) is an example of interaction when the context is not continued, and (b) is an example of interaction when the context is continued.
 図7(a)に示すように、ユーザの「アーティスト名で曲をかける」という発話に応答してスピーカ14より「アーティスト名をお話ください」という対話音声が出力された後、ユーザの「アーティスト名で曲をかける」という発話に応答してスピーカ14より「マイケルの曲を再生します」という対話音声が出力され、機能実行した場合には、「アーティスト、マイケル」がコンテキストとしてコンテキスト履歴管理部127のメモリに記憶される。そして、一連の音声認識処理が終了した後、再度、音声認識処理を開始する際に、コンテキストを継続しない場合には、ユーザの「アルバムリストを表示する」という発話に応答してスピーカ14より「どのアーティストのアルバムリストを表示しますか」という対話音声が出力された後、ユーザの「マイケル」という発話に応答してスピーカ14より「マイケルのアルバムリストを表示します」という対話音声が出力される。すなわち、「マイケル」という入力を繰り返し行うことになる。 As shown in FIG. 7A, in response to the user's utterance “song by artist name”, a dialogue voice “Please tell the artist name” is output from the speaker 14, and then the user ’s “artist name”. In response to the utterance “Make a song with”, a dialogue voice “Play Michael's song” is output from the speaker 14, and when the function is executed, “Artist, Michael” is the context history management unit 127 as the context. Stored in the memory. When the context is not continued when the speech recognition processing is started again after the series of speech recognition processing is completed, the speaker 14 responds to the user's utterance “display album list” from the speaker 14. In response to the user's utterance “Michael”, a dialogue voice “Michael's album list is displayed” is output from the speaker 14 in response to the user's utterance “Michael”. The That is, the input “Michael” is repeated.
 これに対し、図7(b)に示すように、「アーティスト、マイケル」がコンテキストとしてコンテキスト履歴管理部127のメモリに記憶され、一連の音声認識処理が終了した後、再度、音声認識処理を開始する際に、コンテキストを継続する場合には、ユーザの「アルバムリストを表示する」という発話に応答してスピーカ14より「マイケルのアルバムリストを表示します」という対話音声が出力される。すなわち、再度「マイケル」という入力をしなくても、マイケルのリストを表示することができる。 On the other hand, as shown in FIG. 7B, “artist Michael” is stored as a context in the memory of the context history management unit 127, and after a series of speech recognition processing is completed, the speech recognition processing is started again. When the context is continued, in response to the user's utterance “display album list”, a dialogue voice “display Michael's album list” is output from the speaker 14. That is, the list of Michaels can be displayed without inputting "Michael" again.
 図8に、目的地を検索して、その目的地に電話をかけたいときの対話の例を示す。(a)は、コンテキストを継続しない場合、(b)はコンテキストを継続する場合の対話例である。 Fig. 8 shows an example of a dialog when searching for a destination and wanting to call the destination. (A) is an example of interaction when the context is not continued, and (b) is an example of interaction when the context is continued.
 図8(a)に示すように、ユーザの「目的地を設定する」という発話に応答してスピーカ14より「目的地をお話ください」という対話音声が出力された後、ユーザの「AAレストラン」という発話に応答してスピーカ14より「AAレストランを目的地に設定します」という対話音声が出力され、機能実行すると、「目的地、AAレストラン」がコンテキストとしてコンテキスト履歴管理部127のメモリに記憶される。そして、一連の音声認識処理が終了した後、再度、音声認識処理を開始する際に、コンテキストを継続しない場合には、ユーザの「電話をかける」という発話に応答してスピーカ14より「どこに電話をかけますか」という対話音声が出力された後、ユーザの「AAレストラン」という発話に応答してスピーカ14より「AAレストランに電話をかけます」という対話音声が出力される。すなわち、「AAレストラン」という入力を繰り返し行うことになる。 As shown in FIG. 8A, in response to the user's utterance “set destination”, a dialogue voice “Please tell the destination” is output from the speaker 14, and then the user “AA restaurant”. In response to the utterance, the speaker 14 outputs a dialogue voice “Set AA restaurant as the destination”, and when the function is executed, “Destination, AA restaurant” is stored in the memory of the context history management unit 127 as a context. Is done. When the context is not continued when the voice recognition process is started again after the series of voice recognition processes is completed, the speaker 14 responds to the user's utterance “make a call” and “where to call” In response to the user's utterance “AA restaurant”, the speaker 14 outputs the conversation voice “I will call the AA restaurant”. That is, the input “AA restaurant” is repeated.
 これに対し、図8(b)に示すように、「目的地、AAレストラン」がコンテキストとしてコンテキスト履歴管理部127のメモリに記憶され、一連の音声認識処理が終了した後、再度、音声認識処理を開始する際に、コンテキストを継続する場合には、ユーザの「電話をかける」という発話に応答してスピーカ14より「AAレストランに電話をかけます」という対話音声が出力される。すなわち、再度「AAレストラン」という入力をしなくても、マイケルのリストを表示することができる。 On the other hand, as shown in FIG. 8B, “Destination, AA restaurant” is stored in the memory of the context history management unit 127 as a context, and after a series of voice recognition processes, the voice recognition process is performed again. When the context is to be continued, a dialogue voice “Call a AA restaurant” is output from the speaker 14 in response to the user's “call” utterance. That is, Michael's list can be displayed without inputting "AA restaurant" again.
 以上のように、コンテキストを継続した認識処理ができることで、何度も同じ入力を繰り返す煩わしさを解消することができ、かつ、車室内という環境では対話のやり取り数を減少させることができるという点で安全性を確保することができる。 As described above, the ability to perform recognition processing with continued context eliminates the hassle of repeating the same input over and over, and also reduces the number of conversations that can be exchanged in a vehicle interior environment. Can ensure safety.
 上記した構成によれば、音声認識により認識された内容をコンテキストとしてコンテキスト履歴管理部127のメモリに記憶させ、音声認識処理を実施するとき、メモリに記憶されたコンテキストを用いてユーザと音声対話するための音声を生成して音声認識処理を実施するので、過去の操作が継続されないことによるユーザの煩わしさを解消することができる。 According to the configuration described above, the content recognized by the voice recognition is stored in the memory of the context history management unit 127 as a context, and when performing the voice recognition process, the user has a voice conversation using the context stored in the memory. For this reason, the voice recognition process is performed and the voice recognition process is performed, so that the user's troublesomeness caused by the past operation not being continued can be eliminated.
 また、メモリに記憶されたコンテキストを用いた音声対話を行うか否かをユーザに確認し、メモリに記憶された内容を用いた音声対話を行うと確認された場合、メモリに記憶されたコンテキストを用いてユーザと音声対話するための音声が生成されるので、ユーザの意に反してコンテキストを用いた音声が音声出力されてしまうようなことを防止することができる。 In addition, the user confirms whether or not to perform a voice conversation using the context stored in the memory, and if the voice conversation using the content stored in the memory is confirmed, the context stored in the memory is changed. Since the voice for voice dialogue with the user is generated, it is possible to prevent the voice using the context from being voiced against the intention of the user.
 また、メモリに記憶させたコンテキストを表示部に表示させ、ユーザ操作に応じて継続する対話の内容を特定し、この特定された内容を用いてユーザと音声対話するための音声が生成されるので、ユーザは容易に継続する対話の内容を特定することができる。 In addition, since the context stored in the memory is displayed on the display unit, the content of the conversation that continues according to the user operation is specified, and the voice for voice conversation with the user is generated using the specified content. The user can easily specify the content of the ongoing dialogue.
 また、メモリに記憶させたコンテキストを、最新のものから順に表示部に表示させることができる。また、直近の一定回数分(例えば、5回分)の内容を表示部に表示させることもできる。 Also, the context stored in the memory can be displayed on the display unit in order from the latest one. Further, it is possible to display the content of the latest certain number of times (for example, 5 times) on the display unit.
 また、ユーザ操作に応じて特定されたコンテキストに応じて音声認識で用いる辞書を変更するので、音声認識の認識率を向上することも可能である。また、ユーザの発話による指示に応じて継続する対話の内容を特定することもできる。ユーザ操作に応じてメモリに記憶されたコンテキストを消去することもできる。 Also, since the dictionary used for speech recognition is changed according to the context specified according to the user operation, it is possible to improve the recognition rate of speech recognition. It is also possible to specify the content of the dialogue that continues in response to an instruction by the user's utterance. The context stored in the memory can be deleted in response to a user operation.
 なお、本開示は上述の実施形態に限定されることなく、本開示の趣旨を逸脱しない範囲内で、以下のように種々変形可能である。 In addition, this indication is not limited to the above-mentioned embodiment, In the range which does not deviate from the meaning of this indication, various deformation | transformation are possible as follows.
 上記実施形態では、ユーザ操作に応じてメモリに記憶された全てのコンテキストを消去するためのリセットボタン310を操作することにより、全てのコンテキストを消去する構成を示したが、コンテキストを1つずつ選択して消去するように構成することもできる。 In the above embodiment, the configuration is shown in which all contexts are erased by operating the reset button 310 for erasing all contexts stored in the memory in response to a user operation. However, contexts are selected one by one. And can be configured to be erased.
 また、上記実施形態では、音声認識により認識された内容をコンテキストとしてコンテキスト履歴管理部127のメモリに記憶させ、音声認識処理が完了した後、再度、音声認識処理を実施するとき、メモリに記憶されたコンテキストを用いて過去の音声認識処理での発話内容を継続するようにユーザと音声対話するための音声を生成して音声認識処理を実施したが、ユーザの手動操作に応じて実行した内容をコンテキストとしてコンテキスト履歴管理部127のメモリに記憶させ、音声認識処理を実施するとき、メモリに記憶されたコンテキストを用いてユーザと音声対話するための音声を生成して音声認識処理を実施することもできる。 In the above embodiment, the content recognized by the speech recognition is stored as a context in the memory of the context history management unit 127, and after the speech recognition process is completed, the content is stored in the memory when the speech recognition process is performed again. The voice for voice interaction was generated and the speech recognition process was performed so that the utterance contents in the past voice recognition process were continued using the determined context. When storing the context in the memory of the context history management unit 127 and performing the voice recognition process, the voice recognition process may be performed by generating a voice for voice conversation with the user using the context stored in the memory. it can.
 なお、上記実施形態において、コンテキスト履歴管理部127のメモリは記憶部/デバイス/ミーンズとも言及され、S106、S114が記憶制御セクション/デバイス/ミーンズあるいは記憶指示セクション/デバイス/ミーンズとも言及され、S300~S306は音声認識処理セクション/デバイス/ミーンズあるいは内容利用認識セクション/デバイス/ミーンズとも言及され、S102は確認セクション/デバイス/ミーンズあるい内容確認セクション/デバイス/ミーンズとも言及され、S108は表示制御セクション/デバイス/ミーンズあるいは表示指示セクション/デバイス・ミーンズとも言及され、S110は特定セクション/デバイス/ミーンズあるいは内容特定セクション/デバイス/ミーンズ言及され、音声出力内容決定部125は音声生成部/デバイス/ミーンズとも言及され、リセットボタン310は消去部/デバイス/ミーンズあるいは内容消去部/デバイス/ミーンズとも言及される。 In the above embodiment, the memory of the context history management unit 127 is also referred to as a storage unit / device / means, and S106 and S114 are also referred to as a storage control section / device / means or a storage instruction section / device / means. S306 is also referred to as a speech recognition processing section / device / means or content usage recognition section / device / means, S102 is also referred to as a confirmation section / device / means or content confirmation section / device / means, and S108 is referred to as a display control section / Also referred to as device / means or display instruction section / device means, S110 is referred to as specific section / device / means or content specific section / device / means, and voice Power content determination unit 125 is referred to as sound generation unit / device / means-reset button 310 is referred to as erasing unit / device / means-or content deletion unit / device / means-.
 本開示は、実施例に準拠して記述されたが、本開示は当該実施例や構造に限定されるものではないと理解される。本開示は、様々な変形例や均等範囲内の変形をも包含する。加えて、様々な組み合わせや形態、さらには、それらに一要素のみ、それ以上、あるいはそれ以下、を含む他の組み合わせや形態をも、本開示の範疇や思想範囲に入るものである。 Although the present disclosure has been described based on the embodiments, it is understood that the present disclosure is not limited to the embodiments and structures. The present disclosure includes various modifications and modifications within the equivalent range. In addition, various combinations and forms, as well as other combinations and forms including only one element, more or less, are within the scope and spirit of the present disclosure.

Claims (8)

  1.  ユーザが発話した発話内容を音声認識するとともに、この音声認識により認識された発話内容に基づいてユーザと音声対話するための音声を生成して、対話形式で音声認識処理を実施する音声認識装置であって、
     前記音声認識により認識された内容およびユーザの手動操作に応じて実行した内容の少なくとも一方を記憶部(127)に記憶させる記憶制御セクション(S106、S114)と、
     前記音声認識処理を実施するとき、前記記憶部に記憶された前記内容を用いて前記ユーザと音声対話するための音声を生成して前記音声認識処理を実施する音声認識処理セクション(S300~S306)と、
     を備えた音声認識装置。
    A speech recognition apparatus that recognizes speech content spoken by a user and generates speech for voice interaction with the user based on the speech content recognized by the speech recognition, and performs speech recognition processing in an interactive format. There,
    A storage control section (S106, S114) for storing in the storage unit (127) at least one of the content recognized by the voice recognition and the content executed in response to a user's manual operation;
    When performing the speech recognition processing, a speech recognition processing section (S300 to S306) that performs speech recognition processing by generating speech for voice conversation with the user using the content stored in the storage unit When,
    A speech recognition device comprising:
  2.  前記記憶部に記憶された前記内容を用いた音声対話を行うか否かをユーザに確認する確認セクション(S102)を備え、
     前記確認セクションにより前記記憶部に記憶された前記内容を用いた音声対話を行うと確認された場合、前記音声認識処理セクションは、前記記憶部に記憶された前記内容を用いて前記ユーザと音声対話するための音声を生成する請求項1に記載の音声認識装置。
    A confirmation section (S102) for confirming with the user whether or not to perform a voice conversation using the content stored in the storage unit;
    When it is confirmed by the confirmation section that a voice dialogue using the content stored in the storage unit is performed, the voice recognition processing section uses the content stored in the storage unit to perform a voice dialogue with the user. The speech recognition apparatus according to claim 1, wherein the speech recognition device generates speech for performing the operation.
  3.  前記記憶部に記憶させた前記内容を表示部に表示させる表示制御セクション(S108)と、
     ユーザ操作に応じて継続する対話の前記内容を特定する特定セクション(S110)と、を備え、
     前記音声認識処理セクションは、前記特定セクションにより特定された前記内容を用いて前記ユーザと音声対話するための音声を生成する請求項1または2に記載の音声認識装置。
    A display control section (S108) for causing the display unit to display the contents stored in the storage unit;
    A specific section (S110) for specifying the content of the dialogue that continues in response to a user operation,
    The voice recognition apparatus according to claim 1, wherein the voice recognition processing section generates a voice for voice conversation with the user using the content specified by the specific section.
  4.  前記表示制御セクションは、前記記憶部に記憶させた前記内容を、最新のものから順に前記表示部に表示させる請求項3に記載の音声認識装置。 4. The speech recognition apparatus according to claim 3, wherein the display control section causes the display unit to display the contents stored in the storage unit in order from the latest one.
  5.  前記表示制御セクションは、直近の一定回数分の前記内容を前記表示部に表示させる請求項3または4に記載の音声認識装置。 The voice recognition device according to claim 3 or 4, wherein the display control section displays the content for the most recent fixed number of times on the display unit.
  6.  前記特定セクションは、ユーザの発話による指示に応じて継続する対話の前記内容を特定する請求項3ないし5のいずれか1つに記載の音声認識装置。 The voice recognition device according to any one of claims 3 to 5, wherein the specific section specifies the content of the conversation that continues in accordance with an instruction by a user's utterance.
  7.  前記音声認識処理セクションは、前記特定セクションにより特定された前記内容に応じて前記音声認識で用いる辞書を変更する請求項3ないし6のいずれか1つに記載の音声認識装置。 The voice recognition device according to any one of claims 3 to 6, wherein the voice recognition processing section changes a dictionary used in the voice recognition according to the contents specified by the specific section.
  8.  ユーザ操作に応じて前記記憶部に記憶された前記内容を消去するための消去部(310)を備えた請求項1ないし7のいずれか1つに記載の音声認識装置。 The voice recognition device according to any one of claims 1 to 7, further comprising an erasure unit (310) for erasing the content stored in the storage unit in response to a user operation.
PCT/JP2014/006171 2014-01-06 2014-12-11 Speech recognition apparatus WO2015102039A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2014-000264 2014-01-06
JP2014000264A JP2015129793A (en) 2014-01-06 2014-01-06 Voice recognition apparatus

Publications (1)

Publication Number Publication Date
WO2015102039A1 true WO2015102039A1 (en) 2015-07-09

Family

ID=53493388

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2014/006171 WO2015102039A1 (en) 2014-01-06 2014-12-11 Speech recognition apparatus

Country Status (2)

Country Link
JP (1) JP2015129793A (en)
WO (1) WO2015102039A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10269351B2 (en) * 2017-05-16 2019-04-23 Google Llc Systems, methods, and apparatuses for resuming dialog sessions via automated assistant

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007264198A (en) * 2006-03-28 2007-10-11 Toshiba Corp Interactive device, interactive method, interactive system, computer program and interactive scenario generation device
JP2008083100A (en) * 2006-09-25 2008-04-10 Toshiba Corp Voice interactive device and method therefor
JP2010073192A (en) * 2008-08-20 2010-04-02 Universal Entertainment Corp Conversation scenario editing device, user terminal device, and automatic answering system
JP2012008554A (en) * 2010-05-24 2012-01-12 Denso Corp Voice recognition device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007264198A (en) * 2006-03-28 2007-10-11 Toshiba Corp Interactive device, interactive method, interactive system, computer program and interactive scenario generation device
JP2008083100A (en) * 2006-09-25 2008-04-10 Toshiba Corp Voice interactive device and method therefor
JP2010073192A (en) * 2008-08-20 2010-04-02 Universal Entertainment Corp Conversation scenario editing device, user terminal device, and automatic answering system
JP2012008554A (en) * 2010-05-24 2012-01-12 Denso Corp Voice recognition device

Also Published As

Publication number Publication date
JP2015129793A (en) 2015-07-16

Similar Documents

Publication Publication Date Title
JP5821639B2 (en) Voice recognition device
US10706853B2 (en) Speech dialogue device and speech dialogue method
US9239829B2 (en) Speech recognition device
EP1450349B1 (en) Vehicle-mounted control apparatus and program that causes computer to execute method of providing guidance on the operation of the vehicle-mounted control apparatus
TWI281146B (en) Apparatus and method for synthesized audible response to an utterance in speaker-independent voice recognition
US9123327B2 (en) Voice recognition apparatus for recognizing a command portion and a data portion of a voice input
WO2015098109A1 (en) Speech recognition processing device, speech recognition processing method and display device
CN105355202A (en) Voice recognition apparatus, vehicle having the same, and method of controlling the vehicle
JP2009169139A (en) Voice recognizer
JP2010191400A (en) Speech recognition system and data updating method
JP3702867B2 (en) Voice control device
EP3540565A1 (en) Control method for translation device, translation device, and program
CN103426429B (en) Sound control method and device
JP2008145693A (en) Information processing device and information processing method
EP1899955B1 (en) Speech dialog method and system
US20170301349A1 (en) Speech recognition system
WO2015102039A1 (en) Speech recognition apparatus
JP4498906B2 (en) Voice recognition device
JP2007183516A (en) Voice interactive apparatus and speech recognition method
JP2011180416A (en) Voice synthesis device, voice synthesis method and car navigation system
JP4093394B2 (en) Voice recognition device
JP2005114964A (en) Method and processor for speech recognition
JP2017102320A (en) Voice recognition device
JP2008233009A (en) Car navigation device, and program for car navigation device
JPH06110495A (en) Speech recognition device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14876040

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14876040

Country of ref document: EP

Kind code of ref document: A1