WO2015102039A1

WO2015102039A1 - Speech recognition apparatus

Info

Publication number: WO2015102039A1
Application number: PCT/JP2014/006171
Authority: WO
Inventors: 鈴木　竜一
Original assignee: 株式会社デンソー
Priority date: 2014-01-06
Filing date: 2014-12-11
Publication date: 2015-07-09
Also published as: JP2015129793A

Abstract

In this speech recognition apparatus, content recognized by speech recognition and/or content executed in accordance with a manual operation by a user is stored (S106, S114) in a memory, and speech recognition processing is completed. When speech recognition processing is subsequently executed again, the content stored in the memory is used to generate speech for speech interaction with the user, and speech recognition processing (S300-S306) is executed.

Description

Voice recognition device

Cross-reference of related applications

This disclosure is based on Japanese Patent Application No. 2014-264 filed on January 6, 2014, the contents of which are incorporated herein.

This disclosure relates to a speech recognition apparatus (Speech Recognition Apparatus) that recognizes speech uttered by a user in an interactive manner.

Conventionally, the voice recognition of the user's utterance and the process of generating and outputting the voice that urges the user's next utterance based on the result of the voice recognition are repeatedly performed, and the voice recognition in the dialog side performs the voice recognition in the interactive form Various devices have been proposed.

As such a speech recognition device, for example, it interprets input information input by a user, identifies a dialog agent that makes a response corresponding to the input information, sends input information to the dialog agent, requests a response, and The response from the agent is output, and the processable information is inquired to multiple dialog agents, the input information and the processable information are collated, and the dialog agent that can process the input information is selected and selected. There is a configuration in which input information is transmitted to a dialogue agent and a response is received (see, for example, Patent Document 1).

In addition, the user's utterance history is recorded for each subject, a new scenario of dialogue in each subject is determined based on the utterance history, and a speech recognition dictionary to be referred to is based on the determined new scenario of dialogue. There is also a voice recognition device that determines and recognizes the user's utterance using the voice recognition dictionary as a reference target (see, for example, Patent Document 2).

JP 2004-288018 A JP 2008-287193 A

The apparatus described in Patent Document 1 is configured to select a dialog agent that can handle input information, transmit input information to the selected dialog agent, and receive a response. It is possible to have a smooth conversation in a state close to a natural conversation in which the category of the category changes frequently, but when the interactive speech recognition process is completed and the interactive speech recognition process is started again In this case, the user needs to speak the same input information again, which makes the user feel bothersome. There is a possibility.

The device described in Patent Document 2 determines a new scenario for each subject based on the user's utterance history in order to improve the accuracy of speech recognition. Based on the above, the speech recognition dictionary to be referred to is limited. However, even in the apparatus described in Patent Document 2, when the interactive speech recognition process is started again after the interactive speech recognition process is completed, the speech dialog is performed so as to continue the past operation. In such a case, the user needs to speak the same input information again, and the user may feel annoyed.

This disclosure is intended to provide a voice recognition device that eliminates the user's annoyance due to the fact that past operations are not continued.

In order to achieve the above object, according to one example of the present disclosure, a speech recognition apparatus is provided to include a storage control section and a speech recognition processing section. The voice recognition device recognizes the utterance content spoken by the user, generates voice for voice conversation with the user based on the utterance content recognized by the voice recognition, and performs voice recognition processing in an interactive format. To do. The storage control section causes the storage unit to store at least one of the content recognized by the voice recognition and the content executed in response to the user's manual operation. When performing the voice recognition processing, the voice recognition processing section generates voice for voice conversation with the user using the content stored in the storage unit and performs the voice recognition processing.

According to such a configuration, at least one of the content recognized by the voice recognition and the content executed according to the user's manual operation is stored in the storage unit, and stored in the storage unit when the voice recognition process is performed. Since the voice for performing voice conversation with the user is generated using the content and the voice recognition process is performed, the user's troublesomeness caused by the past operation not being continued can be eliminated.

The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description with reference to the accompanying drawings.
Diagram showing the overall configuration of the navigation device The figure which showed the composition of the voice recognition part and the voice dialogue control part Flowchart of control circuit and voice interaction control unit of navigation device The figure which showed the composition of the voice recognition part and the voice dialogue control part Figure showing a display example of the context display screen image Diagram for explaining the difference between dialogs with and without continuing the context Diagram for explaining the difference between dialogs with and without continuing the context Diagram for explaining the difference between dialogs with and without continuing the context

FIG. 1 shows the overall configuration of a speech recognition apparatus according to an embodiment of the present disclosure. The voice recognition device is configured as a navigation device 20 that is mounted and used in a vehicle (also referred to as a host vehicle). The navigation device 20 recognizes speech content uttered by the user, generates speech for voice conversation with the user based on the speech content recognized by the speech recognition, and performs speech recognition processing in an interactive format. In addition, a process of executing an operation according to the utterance content recognized by the voice recognition is performed.

(1) Overall Configuration of Navigation Device 20 This navigation device 20 includes a position detector 21, a data input device 22, an operation switch group 23, a communication device 24, an external memory 25, a display device 26, a remote control sensor 27, and a control circuit. 28 and a voice recognition unit 10.

The position detector 21 includes a gyroscope 21a, a distance sensor 21b, and a GPS receiver 21c, and outputs various information for specifying the current position input from these to the control circuit 28.

The data input device 22 is a device for inputting map data for map display and route search. The data input device 22 reads out necessary map data from a map data storage medium in which map data is stored in response to a request from the control circuit 28. The map data storage medium includes not only map data for map display and route search, but also dictionary data used when the speech recognition unit 10 performs recognition processing. The map data storage medium can be configured using a hard disk drive, CD, DVD, flash memory, or the like.

The operation switch group 23 includes various switches such as a touch switch arranged on the front surface of a display (also referred to as a display unit or a display panel) of a display device 26 described later and a mechanical switch provided around the display. Then, signals corresponding to various user switch operations are output to the control circuit 28.

The communication device 24 is for communicating with the outside, and is configured by a mobile communication device such as a mobile phone, for example.

The external memory 25 is composed of a portable storage medium such as a USB memory or an SD card. Various data are stored in the external memory 25.

The display device 26 has a display such as a liquid crystal, and displays video and images (including screen images) according to the video signal input from the control circuit 28 on the display. The remote control sensor 27 receives a radio signal transmitted from a remote control 27a for performing a remote operation.

The control circuit 28 is configured as a computer including a CPU, ROM, RAM, I / O, and the like, and the CPU of the control circuit 28 performs various processes according to programs stored in the ROM. It should be noted that part or all of the processes executed by the program may be executed by hardware components.

As the processing of the control circuit 28, host vehicle position detection processing for detecting the host vehicle position based on various information for specifying the current position of the host vehicle input from the position detector 21, and on the map around the host vehicle position Map display processing for displaying the host vehicle position mark superimposed on the destination, destination setting processing for setting the destination, route search processing for searching for a guidance route to the destination, travel guidance processing for performing travel guidance according to the guidance route, etc. .

The voice recognition unit 10 is a device that performs processing for recognizing input voice collected by the microphone 15, generates dialogue voice, and outputs (speaks) the dialogue voice from the speaker 14.

(2) Voice Recognition Unit 10 In this embodiment, the voice recognition unit 10 is configured as one or a plurality of computers including a CPU, a RAM, a ROM, an I / O, and the like as an example. Various processes are performed in accordance with programs stored in the ROM. It should be noted that part or all of the processes executed by the program may be executed by hardware components.

The voice recognition unit 10 includes a voice synthesis circuit 11, a voice recognition circuit 12, and a voice dialogue control circuit 13. Each of these may be composed of individual computers, or may be composed of one or two computers. A speaker 14 is connected to the speech synthesis circuit 11, and a microphone 15 is connected to the speech recognition circuit 12. In addition, a PTT switch (Push Talk to Switch) 16 is connected to the voice interaction control circuit 13.

The voice recognition circuit 12 recognizes the input voice collected by the microphone 15 in accordance with an instruction from the voice dialogue control circuit 13, and notifies the voice dialogue control circuit 13 of the recognition result. That is, the speech recognition circuit 12 collates the speech data acquired from the microphone 15 using the stored dictionary data, and selects a higher-order comparison target pattern having a higher degree of matching than a plurality of comparison target pattern candidates. Output to the voice dialogue control circuit 13.

The voice recognition circuit 12 sequentially analyzes the voice data input from the microphone 15 to extract an acoustic feature quantity (for example, cepstrum), and uses the acoustic feature quantity time-series data obtained by the acoustic analysis as a result. In the input speech, it is divided into several sections by the well-known HMM (Hidden Markov Model), DP matching method or neural network, and each section corresponds to which word stored as dictionary data. Recognize word sequences.

The voice dialogue control circuit 13 instructs the voice synthesis circuit 11 to output a response voice based on the recognition result in the voice recognition circuit 12 and also instructs the control circuit 28 of the navigation device 20 based on the recognition result in the voice recognition circuit 12. On the other hand, for example, a destination and a command necessary for the travel guidance process are notified, and processing for instructing to set the destination and execute the command is performed.

The speech synthesizer circuit 11 has a waveform database in which various speech waveforms are converted into a database. Using the speech waveform stored in the waveform database, the speech synthesis control circuit 13 issues a response speech output instruction. Synthesize speech based on. Note that the synthesized speech synthesized by the speech synthesis circuit 11 is output from the speaker 14.

The voice recognition unit 10 according to the present embodiment speaks various commands for executing various processes such as route setting, route guidance, facility search, and facility display toward the microphone 15 while the user presses the PTT switch 16. It has become. Specifically, the voice dialogue control circuit 13 monitors the timing when the PTT switch 16 is pressed, the timing when the PTT switch 16 is returned, and the time during which the pressed state is continued. The recognition circuit 12 is instructed to execute processing. On the other hand, when the PTT switch 16 is not pressed, the processing is not executed. Therefore, voice data input via the microphone 15 while the PTT switch 16 is being pressed is output to the voice recognition circuit 12.

(3) Voice Recognition Circuit 12 and Voice Dialog Control Circuit 13 Here, the voice recognition circuit 12 and the voice dialog control circuit 13 in this embodiment will be further described. FIG. 2 shows configurations of the voice recognition circuit 12 and the voice dialogue control circuit 13.

The speech recognition circuit 12 includes a speech extraction unit 101, a speech recognition collation unit 103, a speech recognition result output unit 105, a speech recognition dictionary unit 107, and a target dictionary determination unit 109. In this embodiment, the voice recognition dictionary unit 107 includes a command correspondence dictionary 201, an address correspondence dictionary 203, a music correspondence dictionary 205, a telephone directory correspondence dictionary 207, and the like.

The voice extraction unit 101 extracts words from the voice data input from the microphone 15. The target dictionary determining unit 109 determines a target dictionary to be used for speech recognition from the command corresponding dictionaries 201 to 207 of the speech recognition dictionary unit 107. In addition, the speech recognition / collation unit 103 collates the words extracted by the speech extraction unit 101 using the target dictionary determined by the target dictionary determination unit 109 that determines the target dictionary. Further, the voice recognition result output unit 105 outputs the result of voice recognition based on the collation result of the voice recognition collation unit 103 to the voice dialogue processing unit 121 of the voice dialogue control circuit 13.

On the other hand, the voice dialogue control circuit 13 includes a voice dialogue processing unit 121, a function execution processing determination unit 123, a voice output content determination unit 125, and a context history management unit 127.

The voice dialogue processing unit 121 determines a phrase that matches the voice recognition result output from the voice recognition result output unit 105 of the voice recognition circuit 12 from among previously prepared question phrases and dialogue phrases. In addition, when determining a phrase that matches the speech recognition result, the voice interaction processing unit 121 according to the present embodiment can also determine a phrase using a context managed by the context history management unit 127 described later. ing.

The function execution process determination unit 123 determines a process to execute a function based on the content processed by the voice interaction processing unit 121 and notifies the control circuit 28 of the determined process. In addition, the function execution process determination unit 123 acquires the speech recognition result output from the speech recognition result output unit 105 via the speech recognition circuit 12 and notifies the control circuit 28 of the speech recognition result.

The voice output content determination unit 125 determines the voice data to be output based on the content processed by the voice dialog processing unit 121 and notifies the voice synthesis circuit 11 of the determined voice data.

The control circuit 28 executes the function in accordance with the notification from the function execution process determination unit 123, and when the function execution is completed, notifies the context history management unit 127 of the executed content. In addition, the control circuit 28 notifies the context history management unit 127 of the voice recognition result acquired via the function execution process determination unit 123.

The context history management unit 127 has a memory (not shown), and sequentially stores contents (contexts) notified from the control circuit 28 in this memory to manage the context history. Here, the context refers to the utterance content uttered by the user in the voice conversation or the operation content executed in response to the user's manual operation. For example, when the user speaks “Destination” and then speaks “Tokyo” in a voice dialogue, “Destination, Tokyo” is stored as a context in the memory of the context history management unit 127. Further, when the user selects and operates “AM radio” after manually selecting “audio”, “audio and AM radio” is stored in the memory of the context history management unit 127 as a context.

(4) Voice Recognition Process FIG. 3 shows a flowchart of the voice recognition process of the navigation device 20. Here, the processing of the control circuit 28 and the voice recognition unit 10 of the navigation device 20 will be described as the voice recognition processing of the navigation device 20. When the ignition switch of the vehicle changes from the off state to the on operation state, the navigation device 20 enters an operating state. For example, when the start of the voice recognition process is instructed by the user's operation on the operation switch group 23, the control circuit 28 and the voice recognition unit 10 of the navigation device 20 perform the process shown in FIG.

Here, the flowchart described in this application or the process of the flowchart includes a plurality of sections (or referred to as steps), and each section is expressed as S100, for example. Further, each section can be divided into a plurality of subsections, while a plurality of sections can be combined into one section. Further, each section can be referred to as a device, module, or means. In addition, each of the above sections or a combination thereof includes not only (i) a section of software combined with a hardware unit (eg, a computer), but also (ii) hardware (eg, an integrated circuit, As a section of (wiring logic circuit), it can be realized with or without the function of related devices. Furthermore, the hardware section can be included inside the microcomputer.

First, a voice recognition top screen image (also referred to as a top screen) is displayed on the display unit (or display panel) of the display device 26 (S100). Specifically, a voice recognition top screen image is displayed on the display unit of the display device 26 according to an instruction from the control circuit 28 of the navigation device 20. FIG. 4 shows a display example of the voice recognition top screen image. This voice recognition top screen image includes a message “Yes” and “No” along with a message “Do you want to continue the context?”. When it is desired to continue the utterance content in the past speech recognition processing, “Yes” is selected, and when it is not desired to continue the utterance content in the past speech recognition processing, “No” is selected. If no context is stored in the memory of the context history management unit 127, only “No” can be selected.

Next, it is determined whether or not to continue the context (S102). In accordance with the voice recognition top screen image, the control circuit 28 determines that the context is continued when “Yes” is selected by the user, and is not continued when “No” is selected by the user.

Here, if “No” is selected by the user, the determination in S102 is NO, and the process proceeds to the first voice recognition process (S200 to S206).

In the first voice recognition process, first, a user voice input is performed (S200). During a period when the PTT switch 16 is pressed by a user operation, for example, as shown in FIG. 6A, when the user speaks “set destination”, the voice data from the microphone 15 is converted into a voice recognition circuit. 12 is input.

Then, the voice of “setting the destination” is recognized (S202), and then it is determined whether or not the operation to be executed is determined (S204). This determination is performed by the function execution process determination unit 123.

Here, when the utterance content recognized in the voice recognition process in S202 is not an operation command for instructing execution of a predetermined function, the determination in S204 is NO, and a dialog voice is generated and outputted (S206). This processing is performed by the processing of the voice interaction processing unit 121 and the voice output content determination unit 125. For example, a dialogue voice such as “Please tell the destination” is generated for the utterance content “Set destination”, this dialogue voice is output from the speaker 14, and the process returns to S 200.

Here, when the user utters “Tokyo”, “Tokyo” is recognized as a voice in S 202, and a dialog voice such as “Set destination to Tokyo” is generated in S 206. Sound is output.

Here, although not shown in FIG. 6A, when the user speaks “Yes”, the determination in S204 is YES and the function is executed (S104). That is, the voice recognition unit 10 notifies the control circuit 28 that the destination is set to Tokyo, and in response to this notification, the control circuit 28 sets the destination to Tokyo and sets the destination to Tokyo. Is notified to the voice recognition unit 10.

Next, the context is stored (S106). Specifically, the control circuit 28 instructs the context history management unit 127 to store the context. In this case, “Destination” is associated with “Tokyo” which is a specific place name, “Destination, Tokyo” is stored as a context in the memory of the context history management unit 127, and this processing is terminated.

Next, when the start of the voice recognition process is instructed again by the user's operation on the operation switch group 23, the control circuit 28 and the voice recognition unit 10 of the navigation device 20 perform the process shown in FIG.

First, the voice recognition top screen image as shown in FIG. 4 is displayed on the display unit of the display device 26 (S100), and then it is determined whether or not to continue the context (S102).

Here, if “No” is selected by the user, the determination in S102 is NO, and the process proceeds to the first voice recognition process (S200 to S206). In this way, when the determination in S102 is NO, the context is not continued as shown in FIG.

In the first voice recognition process, first, a user voice input is performed (S200). As shown in FIG. 6A, for example, when the user speaks “What is the weather today?” During the period in which the PTT switch 16 is pressed by a user operation, the voice data from the microphone 15 is converted into a voice recognition circuit. 12 is input.

Then, in S202, the voice “What is the weather today?” Is recognized, and it is then determined whether or not the operation to be executed has been determined (S204).

Here, when the utterance content recognized in the voice recognition process in S202 is not an operation command for instructing execution of a predetermined function, the determination in S204 is NO, and a dialog voice is generated. Audio is output (S206). For example, an interactive voice such as “Where do you want to know the weather?” Is generated for the utterance content “What is the weather today?”, The interactive voice is output from the speaker 14, and the process returns to S 200.

Then, when the user utters “Tokyo”, “Tokyo” is recognized as a voice in S 202, and a dialog voice such as “Today's weather in Tokyo is clear” is output from the speaker 14 in S 206.

Although not shown in FIG. 6A, when the user utters “end”, the determination in S204 is YES and the function is executed (S104). That is, the voice recognition unit 10 notifies the control circuit 28 that the voice recognition is finished, and in response to this notice, the control circuit 28 finishes the voice recognition and informs that the weather information of today's Tokyo has been notified. Notify the recognition unit 10.

Next, the context is stored (S106). In this case, “weather information” and a specific place name “Tokyo” are associated with each other, “weather information, Tokyo” is stored in the memory of the context history management unit 127 as a context, and this process ends.

First, the voice recognition top screen image as shown in FIG. 4 is displayed on the display unit of the display device 26 (S100).

Here, if “Yes” is selected by the user, the determination in S102 is YES, and then the context is displayed on the display unit of the display device 26 (S108). In this way, when the determination in S102 is YES, the context continues as shown in FIG. 6B. FIG. 5 shows a display example of a context display screen image. The context display screen image includes a context display image 300 and a reset button 310.

In the context display image 300, five contexts are listed in order from the latest context. When a new context is added, the oldest context is erased. A specific context is selected from these contexts. The reset button 310 is a button used when all contexts are reset.

Next, the context is determined (S110). Further, in the display example of FIG. 5, the context indicating “Destination Tokyo” is highlighted. Here, for example, when the user utters “next”, the uttered content is recognized as speech, and “destination AA restaurant” is highlighted. Further, in accordance with the display screen image of the context as shown in FIG. 5, when the user utters “fifth”, the uttered content is recognized as voice, and “air conditioner on” is highlighted. As described above, the context to be highlighted is switched according to the content of the user's utterance. When the user utters “determine”, the highlighted context is determined as the context used for speech recognition.

Next, the process proceeds to the second voice recognition process (S300 to S306). In the second voice recognition process, the voice recognition process is performed by generating voice for voice conversation with the user so as to continue the utterance contents in the past voice recognition process using the context determined in S110.

Specifically, first, a user's voice input is performed (S300). For example, when the user speaks “What is the weather today?” During the period in which the PTT switch 16 is being pressed by a user operation, voice data from the microphone 15 is input to the voice recognition circuit 12.

Next, voice recognition processing is performed (S302). Specifically, the voice dialogue control circuit 13 instructs the voice recognition circuit 12 to execute voice recognition processing, and the voice recognition circuit 12 performs voice recognition processing of voice data from the microphone 15 in response to this instruction.

Here, a target dictionary to be used for speech recognition is specified based on the context determined in S110, and speech recognition processing is performed using this target dictionary. For example, when the context determined in S110 is “Destination, Tokyo”, for example, using the address correspondence dictionary 203 or the telephone directory correspondence dictionary 207 related to the destination setting, The music correspondence dictionary 205 that does not exist is not used. In this way, the recognition rate is improved by using the minimum necessary dictionary.

Next, it is determined whether or not the operation to be executed has been determined (S304). Specifically, whether or not the operation to be executed has been determined based on whether or not the utterance content recognized in the voice recognition processing in S202 is an operation command that instructs execution of a predetermined function. Determine whether.

Here, if the utterance content recognized in the voice recognition process in S302 is not an operation command for instructing execution of a predetermined function, the determination in S304 is NO and the voice generated by generating the dialog voice is output (S306).

Here, using the context determined in S110, a voice for voice conversation with the user is generated so as to continue the utterance contents in the past voice recognition processing. For example, when the context determined in S110 is “Destination, Tokyo” and the user's utterance content is recognized as “What is today's weather?” In S302, a specific place name included in the context is used. As shown in FIG. 6B, a phrase such as “Today's weather in Tokyo is sunny” is generated by combining a certain “Tokyo” and “How is the weather today?”. Then, this phrase is output as audio from the speaker 14.

Although not shown in FIG. 6B, when the user utters “end”, the determination in S204 is YES and the function is executed (S112). That is, the voice recognition unit 10 notifies the control circuit 28 that the voice recognition is finished, and in response to this notice, the control circuit 28 finishes the voice recognition and informs that the weather information of today's Tokyo has been notified. Notify the recognition unit 10.

Next, the context is stored (S114). In this case, “weather information” and a specific place name “Tokyo” are associated with each other, “weather information, Tokyo” is stored in the memory of the context history management unit 127 as a context, and this process ends.

As described above, when the context is not continued, the utterance content in the past speech recognition processing is not continued. Therefore, “Which weather do you want to know?” ”Is output from the speaker 14, and the user needs to speak“ Tokyo ”to respond to the dialog. That is, the input “Tokyo” is repeatedly performed. On the other hand, in the case of continuing the context, using “Tokyo” included in the dialogue in the past speech recognition processing for the utterance content “What is the weather today?” Dialogue voice such as “Sunny” is output from the speaker 14. That is, it is possible to recognize Tokyo weather information without inputting “Tokyo” again.

Fig. 7 shows an example of dialogue when searching for a song by artist name and displaying the artist's album list. (A) is an example of interaction when the context is not continued, and (b) is an example of interaction when the context is continued.

As shown in FIG. 7A, in response to the user's utterance “song by artist name”, a dialogue voice “Please tell the artist name” is output from the speaker 14, and then the user ’s “artist name”. In response to the utterance “Make a song with”, a dialogue voice “Play Michael's song” is output from the speaker 14, and when the function is executed, “Artist, Michael” is the context history management unit 127 as the context. Stored in the memory. When the context is not continued when the speech recognition processing is started again after the series of speech recognition processing is completed, the speaker 14 responds to the user's utterance “display album list” from the speaker 14. In response to the user's utterance “Michael”, a dialogue voice “Michael's album list is displayed” is output from the speaker 14 in response to the user's utterance “Michael”. The That is, the input “Michael” is repeated.

On the other hand, as shown in FIG. 7B, “artist Michael” is stored as a context in the memory of the context history management unit 127, and after a series of speech recognition processing is completed, the speech recognition processing is started again. When the context is continued, in response to the user's utterance “display album list”, a dialogue voice “display Michael's album list” is output from the speaker 14. That is, the list of Michaels can be displayed without inputting "Michael" again.

Fig. 8 shows an example of a dialog when searching for a destination and wanting to call the destination. (A) is an example of interaction when the context is not continued, and (b) is an example of interaction when the context is continued.

As shown in FIG. 8A, in response to the user's utterance “set destination”, a dialogue voice “Please tell the destination” is output from the speaker 14, and then the user “AA restaurant”. In response to the utterance, the speaker 14 outputs a dialogue voice “Set AA restaurant as the destination”, and when the function is executed, “Destination, AA restaurant” is stored in the memory of the context history management unit 127 as a context. Is done. When the context is not continued when the voice recognition process is started again after the series of voice recognition processes is completed, the speaker 14 responds to the user's utterance “make a call” and “where to call” In response to the user's utterance “AA restaurant”, the speaker 14 outputs the conversation voice “I will call the AA restaurant”. That is, the input “AA restaurant” is repeated.

On the other hand, as shown in FIG. 8B, “Destination, AA restaurant” is stored in the memory of the context history management unit 127 as a context, and after a series of voice recognition processes, the voice recognition process is performed again. When the context is to be continued, a dialogue voice “Call a AA restaurant” is output from the speaker 14 in response to the user's “call” utterance. That is, Michael's list can be displayed without inputting "AA restaurant" again.

As described above, the ability to perform recognition processing with continued context eliminates the hassle of repeating the same input over and over, and also reduces the number of conversations that can be exchanged in a vehicle interior environment. Can ensure safety.

According to the configuration described above, the content recognized by the voice recognition is stored in the memory of the context history management unit 127 as a context, and when performing the voice recognition process, the user has a voice conversation using the context stored in the memory. For this reason, the voice recognition process is performed and the voice recognition process is performed, so that the user's troublesomeness caused by the past operation not being continued can be eliminated.

In addition, the user confirms whether or not to perform a voice conversation using the context stored in the memory, and if the voice conversation using the content stored in the memory is confirmed, the context stored in the memory is changed. Since the voice for voice dialogue with the user is generated, it is possible to prevent the voice using the context from being voiced against the intention of the user.

In addition, since the context stored in the memory is displayed on the display unit, the content of the conversation that continues according to the user operation is specified, and the voice for voice conversation with the user is generated using the specified content. The user can easily specify the content of the ongoing dialogue.

Also, the context stored in the memory can be displayed on the display unit in order from the latest one. Further, it is possible to display the content of the latest certain number of times (for example, 5 times) on the display unit.

Also, since the dictionary used for speech recognition is changed according to the context specified according to the user operation, it is possible to improve the recognition rate of speech recognition. It is also possible to specify the content of the dialogue that continues in response to an instruction by the user's utterance. The context stored in the memory can be deleted in response to a user operation.

In addition, this indication is not limited to the above-mentioned embodiment, In the range which does not deviate from the meaning of this indication, various deformation | transformation are possible as follows.

In the above embodiment, the configuration is shown in which all contexts are erased by operating the reset button 310 for erasing all contexts stored in the memory in response to a user operation. However, contexts are selected one by one. And can be configured to be erased.

In the above embodiment, the content recognized by the speech recognition is stored as a context in the memory of the context history management unit 127, and after the speech recognition process is completed, the content is stored in the memory when the speech recognition process is performed again. The voice for voice interaction was generated and the speech recognition process was performed so that the utterance contents in the past voice recognition process were continued using the determined context. When storing the context in the memory of the context history management unit 127 and performing the voice recognition process, the voice recognition process may be performed by generating a voice for voice conversation with the user using the context stored in the memory. it can.

In the above embodiment, the memory of the context history management unit 127 is also referred to as a storage unit / device / means, and S106 and S114 are also referred to as a storage control section / device / means or a storage instruction section / device / means. S306 is also referred to as a speech recognition processing section / device / means or content usage recognition section / device / means, S102 is also referred to as a confirmation section / device / means or content confirmation section / device / means, and S108 is referred to as a display control section / Also referred to as device / means or display instruction section / device means, S110 is referred to as specific section / device / means or content specific section / device / means, and voice Power content determination unit 125 is referred to as sound generation unit / device / means-reset button 310 is referred to as erasing unit / device / means-or content deletion unit / device / means-.

Although the present disclosure has been described based on the embodiments, it is understood that the present disclosure is not limited to the embodiments and structures. The present disclosure includes various modifications and modifications within the equivalent range. In addition, various combinations and forms, as well as other combinations and forms including only one element, more or less, are within the scope and spirit of the present disclosure.

Claims

A speech recognition apparatus that recognizes speech content spoken by a user and generates speech for voice interaction with the user based on the speech content recognized by the speech recognition, and performs speech recognition processing in an interactive format. There,
A storage control section (S106, S114) for storing in the storage unit (127) at least one of the content recognized by the voice recognition and the content executed in response to a user's manual operation;
When performing the speech recognition processing, a speech recognition processing section (S300 to S306) that performs speech recognition processing by generating speech for voice conversation with the user using the content stored in the storage unit When,
A speech recognition device comprising:
A confirmation section (S102) for confirming with the user whether or not to perform a voice conversation using the content stored in the storage unit;
When it is confirmed by the confirmation section that a voice dialogue using the content stored in the storage unit is performed, the voice recognition processing section uses the content stored in the storage unit to perform a voice dialogue with the user. The speech recognition apparatus according to claim 1, wherein the speech recognition device generates speech for performing the operation.
A display control section (S108) for causing the display unit to display the contents stored in the storage unit;
A specific section (S110) for specifying the content of the dialogue that continues in response to a user operation,
The voice recognition apparatus according to claim 1, wherein the voice recognition processing section generates a voice for voice conversation with the user using the content specified by the specific section.
4. The speech recognition apparatus according to claim 3, wherein the display control section causes the display unit to display the contents stored in the storage unit in order from the latest one.
The voice recognition device according to claim 3 or 4, wherein the display control section displays the content for the most recent fixed number of times on the display unit.
The voice recognition device according to any one of claims 3 to 5, wherein the specific section specifies the content of the conversation that continues in accordance with an instruction by a user's utterance.
The voice recognition device according to any one of claims 3 to 6, wherein the voice recognition processing section changes a dictionary used in the voice recognition according to the contents specified by the specific section.
The voice recognition device according to any one of claims 1 to 7, further comprising an erasure unit (310) for erasing the content stored in the storage unit in response to a user operation.