JP2019113636A

JP2019113636A - Voice recognition system

Info

Publication number: JP2019113636A
Application number: JP2017245769A
Authority: JP
Inventors: 近藤　裕介; Yusuke Kondo; 裕介近藤
Original assignee: Onkyo Corp
Current assignee: Onkyo Corp
Priority date: 2017-12-22
Filing date: 2017-12-22
Publication date: 2019-07-11

Abstract

To detect a hot word even in the middle of voice recognition processing after the hot word is detected.SOLUTION: A voice recognition system 1 comprises: a recording buffer from which recorded voice data is outputted; and a primary buffer where the voice data of the recording buffer is copied. The hot word is detected by using the voice data of the primary buffer. Natural language understanding is performed by using the voice data of the recording buffer. Hot word detection and natural language understanding are simultaneously performed after the hot word is detected.SELECTED DRAWING: Figure 1

Description

本発明は、音声認識を行う音声認識システムに関する。 The present invention relates to a speech recognition system that performs speech recognition.

音声認識を行う音声認識システムは、例えば、スピーカー装置等の電子機器と、クラウドサーバーと、から構成される。電子機器は、マイクとスピーカーとを備える。音声認識システムは、マイクから入力される音声を認識し、認識した音声に基づいて、処理を実行し、実行結果をスピーカーから出力する。例えば、ユーザーが、「天気教えて」と発話した場合、音声認識システムに含まれる電子機器から、「今日の天気は晴れです。」という音声が出力される。特許文献１には、音声認識の結果で、通信レートを変更する発明が開示されている。 A voice recognition system that performs voice recognition includes, for example, an electronic device such as a speaker device and a cloud server. The electronic device comprises a microphone and a speaker. The speech recognition system recognizes speech input from a microphone, executes processing based on the recognized speech, and outputs an execution result from a speaker. For example, when the user utters "Teach me the weather", an electronic device included in the speech recognition system outputs a voice "The weather today is fine." Patent Document 1 discloses an invention of changing a communication rate as a result of speech recognition.

音声認識を行う音声認識システムの中には、音声認識を有効とするためのホットワード（例えば、「ハロー、オンキヨー」）を検出した場合に、その後の音声認識処理（自然言語理解：ＮＬＵ（Natural language understanding））を行うものがある。図３（ａ）は、ホットワード検出を説明するための図である。録音された音声データは、録音バッファに出力される。録音バッファの音声データを用いて、ホットワード検出が行われる。図３（ｂ）は、ホットワード検出後の処理を説明するための図である。ホットワード検出後、録音バッファの音声データを用いて、自然言語理解が行われる。 In a speech recognition system that performs speech recognition, when a hotword (eg, "Hello, Onkyo") for enabling speech recognition is detected, subsequent speech recognition processing (natural language understanding: NLU (Natural) language understanding)). FIG. 3A is a diagram for explaining hot word detection. The recorded voice data is output to the recording buffer. Hotword detection is performed using audio data of the recording buffer. FIG. 3B is a diagram for explaining processing after hot word detection. After hot word detection, natural language understanding is performed using voice data of the recording buffer.

米国公開特許２００８／０３０００２５US Published Patent 2008/0300025

従来の電子機器においては、ホットワード検出後、自然言語理解が終了するまで、ホットワード検出を行うことができないという問題がある。例えば、ユーザーは、ホットワードである「ハロー、オンキヨー」を発話した後、「天気を教えて」と発話したとする。このとき、ユーザーは、天気の情報を聞くのではなく、音楽を聴きたくなった場合、「ハロー、オンキヨー」、「天気を教えて」の後、即座に、「ハロー、オンキヨー」、「音楽をかけて」と発話しても、音楽を聴くことができない。 In the conventional electronic device, there is a problem that hot word detection can not be performed until natural language understanding is finished after hot word detection. For example, it is assumed that the user utters "Tell the weather" after uttering the hot word "Hello, Onkyo". At this time, if the user wants to listen to the music instead of listening to the weather information, "Hello, Onkyo", "Tell the weather", then immediately "Hello, Onkyo", "Music I can not listen to music even if I say "I take it".

本発明の目的は、ホットワード検出後の音声認識処理中でも、ホットワード検出可能とすることである。 An object of the present invention is to enable hot word detection even during speech recognition processing after hot word detection.

第１の発明の音声認識システムは、録音された音声データが出力される録音バッファと、前記録音バッファの音声データがコピーされる一次バッファと、を備え、前記一次バッファの音声データを用いて、ホットワード検出が行われ、前記録音バッファの音声データを用いて、自然言語理解が行われることを特徴とする。 A voice recognition system according to a first aspect of the present invention comprises a recording buffer to which recorded voice data is output, and a primary buffer to which voice data of the recording buffer is copied, and using the voice data of the primary buffer, Hot word detection is performed, and natural language understanding is performed using speech data of the recording buffer.

本発明では、録音バッファの音声データがコピーされた一次バッファの音声データを用いて、ホットワード検出が行われる。また、録音バッファの音声データを用いて、自然言語理解が行われる。このため、ホットワード検出と自然言語理解とを同時に行うことができるため、ホットワード検出後の自然言語理解処理中でも、ホットワード検出を行うことができる。 In the present invention, hot word detection is performed using the audio data of the primary buffer to which the audio data of the recording buffer has been copied. In addition, natural language understanding is performed using voice data of a recording buffer. Therefore, since hot word detection and natural language understanding can be performed simultaneously, hot word detection can be performed even during natural language understanding processing after hot word detection.

第２の発明の音声認識システムは、第１の発明の音声認識システムにおいて、ホットワードが検出された後、ホットワード検出と、自然言語理解と、が同時に行われることを特徴とする。 A speech recognition system according to a second aspect of the present invention is characterized in that, in the speech recognition system according to the first aspect, hot word detection and natural language understanding are simultaneously performed after the hot word is detected.

第３の発明の音声認識システムは、第１又は第２の発明の音声認識システムにおいて、ホットワードが検出される前、前記録音バッファの音声データを用いて、ホットワード検出が行われることを特徴とする。 A speech recognition system according to a third aspect of the invention is characterized in that in the speech recognition system according to the first or second aspect, hot word detection is performed using speech data of the recording buffer before a hot word is detected. I assume.

第４の発明の音声認識システムは、第１〜第３のいずれかの発明の音声認識システムにおいて、ホットワードが検出された後、前記録音バッファの音声データが前記一次バッファにコピーされることを特徴とする。 A speech recognition system according to a fourth aspect of the present invention is the speech recognition system according to any of the first through third aspects, wherein after the hot word is detected, the audio data of the recording buffer is copied to the primary buffer. It features.

本発明によれば、ホットワード検出後の自然言語理解処理中でも、ホットワード検出を行うことができる。 According to the present invention, hotword detection can be performed even during natural language understanding processing after hotword detection.

本発明の実施形態に係る音声認識システムの構成を示す図である。It is a figure showing the composition of the speech recognition system concerning the embodiment of the present invention. 本発明の実施形態に係る音声認識システムの構成を示す図である。It is a figure showing the composition of the speech recognition system concerning the embodiment of the present invention. （ａ）ホットワード検出を説明するための図である。（ｂ）は、ホットワード検出後の処理を説明するための図である。(A) It is a figure for demonstrating a hot word detection. (B) is a figure for demonstrating the process after a hot word detection.

以下、本発明の実施形態について説明する。図１は、本発明の実施形態係る音声認識システムの構成を示すブロック図である。音声認識システム１は、電子機器と、クラウドサーバーと、を備える。電子機器は、図示しないが、ＳｏＣ（System on Chip）と、マイクと、スピーカーと、等を備える。ＳｏＣ（制御部）は、電子機器を構成する各部を制御する。本実施形態では、音声認識システム１は、電子機器と、クラウドサーバーと、によって、構成され、電子機器とクラウドサーバーとが協働して、音声認識を行う。 Hereinafter, embodiments of the present invention will be described. FIG. 1 is a block diagram showing the configuration of a speech recognition system according to an embodiment of the present invention. The speech recognition system 1 includes an electronic device and a cloud server. Although not shown, the electronic device includes an SoC (System on Chip), a microphone, a speaker, and the like. The SoC (control unit) controls each unit constituting the electronic device. In the present embodiment, the voice recognition system 1 is configured by an electronic device and a cloud server, and the electronic device and the cloud server cooperate to perform voice recognition.

マイクは、音声を集音する。マイクにより集音された音声は、録音される（音声録音）。ＳｏＣにより、録音された音声データは、録音バッファに出力される。録音バッファの音声データを用いて、ホットワード（トリガーワード）の検出が行われる（図３（ａ）参照）。ホットワードは、例えば、「ハロー、オンキヨー」である。本実施形態では、ＳｏＣがホットワードの検出を行うが、ホットワードの検出は、クラウドサーバーで行われてもよい。ホットワードの検出後、図１に示すように、ＳｏＣにより、録音バッファの音声データは、一次バッファにコピーされる。一次バッファの音声データを用いて、引き続き、ホットワードの検出が行われる。これと同時に、録音バッファの音声データを用いて、自然言語理解が行われる。本実施形態では、クラウドサーバーで、自然言語理解が行われる。 The microphone picks up the sound. The sound collected by the microphone is recorded (voice recording). The audio data recorded by the SoC is output to the recording buffer. The detection of the hotword (trigger word) is performed using the audio data of the recording buffer (see FIG. 3A). The hotword is, for example, "Hello, Onkyo". In this embodiment, the SoC performs hotword detection, but hotword detection may be performed by a cloud server. After detection of the hotword, as shown in FIG. 1, audio data of the recording buffer is copied to the primary buffer by the SoC. The audio data of the primary buffer is subsequently used for hot word detection. At the same time, natural language understanding is performed using speech data of the recording buffer. In the present embodiment, natural language understanding is performed on the cloud server.

なお、本実施形態では、電子機器に、録音バッファ、一次バッファが設けられているが、クラウドサーバーに設けられていてもよい。 In the present embodiment, the electronic device is provided with the recording buffer and the primary buffer, but may be provided in the cloud server.

このように、ホットワード検出と、自然言語理解と、が同時に行われるため、ホットワード検出後の自然言語理解処理中でも、ホットワード検出を行うことができる。すなわち、「ハロー、オンキヨー」、「天気を教えて」の後、即座に、「ハロー、オンキヨー」、「音楽をかけて」と発話した場合、自然言語理解処理中であっても、ホットワード検出を行って、音楽を再生することができる。 As described above, since hot word detection and natural language understanding are simultaneously performed, hot word detection can be performed even during natural language understanding processing after hot word detection. That is, if you immediately say "hello, on kiyo", "put on music" immediately after "hello, on kiyo" or "tell the weather", then hotword detection even while natural language understanding processing is in progress. You can play the music.

なお、自然言語理解の後、音声合成が行われる。例えば、ユーザーが、「ハロー、オンキヨー」、「天気を教えて」と発話したとする。これに対応して、例えば、「今日の天気は晴れです。」がスピーカーから出力される。ここで、音声合成処理中は、図２に示すように、録音バッファの音声データを用いて、ホットワード検出が行われる。「今日の天気は晴れです。」音声合成中に、「ハロー、オンキヨー」と発話されるような場合、録音バッファの音声データを用いて、ホットワード検出が行われる。 After natural language understanding, speech synthesis is performed. For example, it is assumed that the user utters "Hello, Onkyo", "Tell me about the weather". Corresponding to this, for example, "Today's weather is fine" is output from the speaker. Here, during the speech synthesis processing, as shown in FIG. 2, hot word detection is performed using the speech data of the recording buffer. "Today's weather is fine." During speech synthesis, if "Hello, Onkyo" is uttered, hotword detection is performed using the audio data of the recording buffer.

以上説明したように、本実施形態では、録音バッファの音声データがコピーされた一次バッファの音声データを用いて、ホットワード検出が行われる。また、録音バッファの音声データを用いて、自然言語理解が行われる。このため、ホットワード検出と自然言語理解とを同時に行うことができるため、ホットワード検出後の自然言語理解処理中でも、ホットワード検出を行うことができる。 As described above, in the present embodiment, hot word detection is performed using the audio data of the primary buffer to which the audio data of the recording buffer has been copied. In addition, natural language understanding is performed using voice data of a recording buffer. Therefore, since hot word detection and natural language understanding can be performed simultaneously, hot word detection can be performed even during natural language understanding processing after hot word detection.

以上、本発明の実施形態について説明したが、本発明を適用可能な形態は、上述の実施形態には限られるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更を加えることが可能である。 As mentioned above, although embodiment of this invention was described, the form which can apply this invention is not restricted to the above-mentioned embodiment, It is possible to add a change suitably in the range which does not deviate from the meaning of this invention is there.

本発明は、音声認識を行う音声認識システムに好適に採用され得る。 The present invention can be suitably employed in a speech recognition system that performs speech recognition.

１音声認識システム 1 Speech recognition system

Claims

A recording buffer for outputting the recorded voice data;
A primary buffer to which audio data of the recording buffer is copied;
Equipped with
Hotword detection is performed using voice data of the primary buffer,
A speech recognition system characterized in that natural language understanding is performed using speech data of the recording buffer.

The speech recognition system according to claim 1, characterized in that hot word detection and natural language understanding are performed simultaneously after the hot word is detected.

The speech recognition system according to claim 1 or 2, wherein hot word detection is performed using voice data of the recording buffer before a hot word is detected.

A speech recognition system according to any of the preceding claims, wherein speech data of the recording buffer is copied to the primary buffer after a hotword is detected.