JP2009092693A

JP2009092693A - Speech recognition device

Info

Publication number: JP2009092693A
Application number: JP2007260164A
Authority: JP
Inventors: Shinpei Hibiya; 新平日比谷; Kiyotaka Takehara; 清隆竹原; Kenji Okuno; 健治奥野; Akira Baba; 朗馬場; Kenji Nakakita; 賢二中北
Original assignee: Panasonic Electric Works Co Ltd
Current assignee: Panasonic Electric Works Co Ltd
Priority date: 2007-10-03
Filing date: 2007-10-03
Publication date: 2009-04-30

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition device which improves recognition performance of speech and usability. <P>SOLUTION: The speech recognition device A for controlling operation of a control object based on speech uttered by a user, comprises: a speech input section 11 to which speech of the user is input; a storage section 12 in which a recognition object vocabulary related to each operation of the control object is respectively stored in a plurality of uttering styles; a selection section 13 for selecting any uttering style from the plurality of uttering styles; a collating section 14 in which the speech input to the speech input section 11 is collated with the recognition object vocabulary of the uttering style which is selected by the selection section 13 in the recognition object vocabulary stored in the storage section 12, and the recognition object vocabulary is specified based on the collation result; and a control section 16 for operating the control object based on the recognition object vocabulary specified by the collation section 14. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、音声認識装置に関するものである。 The present invention relates to a speech recognition apparatus.

従来、入力された音声に基づいて制御対象の動作を制御する音声認識装置がある（例えば、特許文献１参照）。
特開２００４−３５５８５４号公報 2. Description of the Related Art Conventionally, there is a speech recognition device that controls the operation of a control target based on input speech (see, for example, Patent Document 1).
JP 2004-355854 A

音声認識装置において、各制御内容に対応する認識対象語彙が１つのみ登録されている場合、ある利用者の語彙の発話が音声認識装置にとって認識し難いと、その利用者はその語彙に対応した制御を行うことができない。また、登録されている認識対象語彙が利用者にとって発話し難い場合、その利用者にとって音声認識装置のユーザビリティは低いことになる。 In the speech recognition apparatus, when only one recognition target vocabulary corresponding to each control content is registered, it is difficult for the speech recognition apparatus to recognize the utterance of a certain user's vocabulary, and the user corresponds to that vocabulary. Control cannot be performed. If the registered recognition target vocabulary is difficult for a user to speak, the usability of the speech recognition apparatus is low for the user.

例えば、ある機器「暖房機器」を動作させるための認識対象語彙としては操作したい機器の名称「暖房」、機器の動作を命ずる依頼調の言葉「暖房をつけて」、命令調の言葉「暖房をつけろ」等、様々な発話スタイルの語彙が考えられるが、利用者毎に使い易い語彙は異なると考えられるため、予め認識対象語彙の発話スタイルをいずれか１つに絞り込むことは、ユーザビリティの観点からみても好ましくない。 For example, as a recognition target vocabulary for operating a certain device “heating device”, the name of the device to be operated “heating”, a request-like word to order the operation of the device “turn on heating”, a command-like word “heating” Vocabulary with various utterance styles can be considered, but it is considered that the vocabulary that is easy to use for each user is different, so narrowing the utterance style of the vocabulary to be recognized to any one in advance is from the viewpoint of usability Even if it sees, it is not preferable.

しかし、各制御内容に対応する認識対象語彙を単純に複数の発話スタイルで登録しておくと、認識対象語彙が多くなるほど音声認識性能が低下するため、装置自体が使いづらいものになってしまう。 However, if the recognition target vocabulary corresponding to each control content is simply registered in a plurality of utterance styles, the speech recognition performance deteriorates as the recognition target vocabulary increases, and the device itself becomes difficult to use.

本発明は、上記事由に鑑みてなされたものであり、その目的は、音声の認識性能とユーザビリティを向上させた音声認識装置を提供することにある。 The present invention has been made in view of the above reasons, and an object of the present invention is to provide a speech recognition apparatus with improved speech recognition performance and usability.

請求項１の発明は、利用者から発せられる音声に基づいて制御対象の動作を制御する音声認識装置において、利用者の音声が入力される音声入力部と、制御対象の各動作に対応付けられた語彙を複数の発話スタイルで各々格納した記憶部と、複数の発話スタイルからいずれかの発話スタイルを選択する選択部と、音声入力部に入力された音声を、記憶部に格納されている語彙のうち選択部が選択した発話スタイルの語彙と照合し、当該照合結果に基づいて語彙を特定する照合部と、照合部が特定した語彙に基づいて制御対象を動作させる制御部とを備えることを特徴とする。 According to the first aspect of the present invention, in a voice recognition device that controls the operation of a control target based on a voice uttered from a user, the voice input unit to which the user's voice is input is associated with each control target operation. A vocabulary stored in the storage unit, a storage unit that stores each vocabulary in a plurality of utterance styles, a selection unit that selects one of the utterance styles from the plurality of utterance styles, and a voice that is input to the voice input unit A collation unit that collates with the vocabulary of the utterance style selected by the selection unit and identifies the vocabulary based on the collation result, and a control unit that operates the control target based on the vocabulary identified by the collation unit Features.

この発明によれば、制御対象の各動作に、複数の発話スタイルに各々対応した語彙を設定しているので、ある発話スタイルの認識対象語彙では音声認識が困難である場合でも、別の発話スタイルを選択することによって、音声認識性能が向上し、利用者にとって音声認識装置のユーザビリティが高くなる。すなわち、音声の認識性能とユーザビリティを向上させることができるのである。また、発話スタイルを多くした場合でも、実際に照合処理に用いる認識対象語彙の増加は抑えられ、音声認識性能の低下を防止できる。 According to the present invention, since the vocabulary corresponding to each of the plurality of utterance styles is set for each operation to be controlled, even when speech recognition is difficult with the vocabulary to be recognized of a certain utterance style, By selecting, the voice recognition performance is improved, and the usability of the voice recognition device is improved for the user. That is, the speech recognition performance and usability can be improved. Further, even when the utterance style is increased, an increase in the recognition target vocabulary actually used for the collation process can be suppressed, and a decrease in speech recognition performance can be prevented.

請求項２の発明は、請求項１において、前記記憶部に格納されている語彙の複数の発話スタイルを提示する提示手段を備え、前記選択部は、利用者からの入力に基づいて複数の発話スタイルからいずれかの発話スタイルを選択することを特徴とする。 The invention of claim 2 comprises a presentation means for presenting a plurality of utterance styles of the vocabulary stored in the storage unit according to claim 1, wherein the selection unit is configured to generate a plurality of utterances based on an input from a user. One of the utterance styles is selected from the styles.

この発明によれば、ある発話スタイルの認識対象語彙では音声認識が困難である場合でも、利用者が別の発話スタイルを選択して認識対象語彙を発話することで音声認識が容易となり、音声認識性能が向上する。また、利用者が、自己に適した発話スタイルを選択できるので、利用者にとって発話しやすく、その利用者にとって音声認識装置のユーザビリティが高くなる。 According to the present invention, even when speech recognition is difficult with a recognition target vocabulary of a certain utterance style, the user can select another utterance style and utter the recognition target vocabulary, thereby facilitating speech recognition. Performance is improved. In addition, since the user can select a speech style suitable for him / her, it is easy for the user to speak, and the usability of the speech recognition apparatus is improved for the user.

請求項３の発明は、請求項１において、前記音声入力部に入力された音声の発話スタイルを前記複数の発話スタイルのうちいずれかに分類する発話スタイル分類手段と、分類された発話スタイルの各出現頻度を算出する頻度算出手段とを備え、前記選択部は、複数の発話スタイルから少なくとも出現頻度の最も高い発話スタイルを選択することを特徴とする。 The invention of claim 3 is characterized in that, in claim 1, an utterance style classification means for classifying an utterance style of the voice input to the voice input unit into any one of the plurality of utterance styles, and each of the classified utterance styles Frequency calculating means for calculating the appearance frequency, wherein the selection unit selects an utterance style having the highest appearance frequency from a plurality of utterance styles.

この発明によれば、利用者がよく使う発話スタイルを選択して認識対象語彙を特定することで、音声認識性能が向上する。また、利用者がよく使う発話スタイルを利用者に意識させることなく選択できるので、利用者にとって音声認識装置のユーザビリティが高くなる。 According to the present invention, the speech recognition performance is improved by selecting the utterance style frequently used by the user and specifying the recognition target vocabulary. In addition, since the user can select the utterance style frequently used by the user without making the user aware of it, the usability of the speech recognition apparatus is improved for the user.

以上説明したように、本発明では、音声の認識性能とユーザビリティを向上させることができるという効果がある。 As described above, the present invention has an effect that speech recognition performance and usability can be improved.

以下、本発明の実施の形態を図面に基づいて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

（実施形態１）
本実施形態の音声認識装置Ａは図１に示すブロック構成を備え、図２に示すように浴室に設置されており、制御対象としては、暖房機器等の空調機器Ｂ１、主照明機器Ｂ２、間接照明機器Ｂ３、浴槽Ｙ内に配置されたジェットバス等の浴槽機器Ｂ４、バステレビＢ５、給湯装置（図示なし）等の浴室内に設置された機器がある。さらに、浴室内の機器だけでなく、浴室外の機器を制御対象として追加してもよい。 (Embodiment 1)
The speech recognition apparatus A of the present embodiment has a block configuration shown in FIG. 1 and is installed in a bathroom as shown in FIG. 2, and control targets include air conditioning equipment B1 such as heating equipment, main lighting equipment B2, and indirect. There are devices installed in the bathroom such as the lighting device B3, a bathtub device B4 such as a jet bath disposed in the bathtub Y, a bath TV B5, and a hot water supply device (not shown). Furthermore, you may add not only the apparatus in a bathroom but the apparatus outside a bathroom as a control object.

音声認識装置Ａは、音声入力部１１と、記憶部１２と、選択部１３と、照合部１４と、表示パネル１５と、制御部１６とを備えており、音声入力部１１はマイクロホンで構成され、例えば液晶画面からなる表示パネル１５とともに、浴室の壁面に設置される。 The voice recognition device A includes a voice input unit 11, a storage unit 12, a selection unit 13, a collation unit 14, a display panel 15, and a control unit 16. The voice input unit 11 is configured by a microphone. For example, it is installed on the wall surface of the bathroom together with the display panel 15 made of a liquid crystal screen.

制御部１６は、制御信号を上記制御対象の機器に出力することによって、制御対象の動作を制御する。さらには制御部１６が表示パネル１５の表示制御も行っており、制御対象の機器を制御する際に補助となる操作画面を表示パネル１５に表示させる。その表示画面は図３に示す階層構造になっており、トップ画面ＧＴの下層には、空調機器操作画面Ｇ１、照明機器操作画面Ｇ２、浴槽機器操作画面Ｇ３、給湯機器操作画面Ｇ４、バステレビ操作画面Ｇ５が用意され、さらに空調機器操作画面Ｇ１の下層には空調機器操作画面Ｇ１１，Ｇ１２，．．．、照明機器操作画面Ｇ２の下層には照明機器操作画面Ｇ２１，Ｇ２２，．．．、浴槽操作画面Ｇ３の下層には浴槽機器操作画面Ｇ３１，Ｇ３２，．．．、給湯機器操作画面Ｇ４の下層には給湯機器操作画面Ｇ４１，Ｇ４２，．．．、バステレビ操作画面Ｇ５の下層にはバステレビ操作画面Ｇ５１，Ｇ５２，．．．が各々用意されている。 The control unit 16 controls the operation of the control target by outputting a control signal to the control target device. Furthermore, the control unit 16 also performs display control of the display panel 15, and displays an operation screen to assist the control of the device to be controlled on the display panel 15. The display screen has the hierarchical structure shown in FIG. 3, and the air conditioner operation screen G1, the lighting device operation screen G2, the bathtub device operation screen G3, the hot water supply device operation screen G4, and the bus TV operation are arranged below the top screen GT. A screen G5 is prepared, and further below the air conditioner operation screen G1, air conditioner operation screens G11, G12,. . . , Lighting device operation screens G21, G22,. . . The bathtub operation screens G31, G32,. . . In the lower layer of the hot water supply device operation screen G4, the hot water supply device operation screens G41, G42,. . . The bus TV operation screens G51, G52,. . . Are prepared for each.

音声認識装置Ａが起動すると、トップ画面ＧＴが表示パネル１５に表示される。その後、音声入力部１１からの音声入力に応じて画面を遷移していく。このとき、制御部１６は、表示パネル１５に表示させている画面状態（表示画面のＩＤデータ、表示画面状態データ等）を照合部１４へ出力する。表示画面ＩＤデータは、例えば、トップ画面ＧＴにはＩＤ０、空調機器操作画面Ｇ１にはＩＤ１、照明機器操作画面Ｇ２にはＩＤ２、浴槽機器操作画面Ｇ３にはＩＤ３、給湯機器操作画面Ｇ４にはＩＤ４、バステレビ操作画面Ｇ５にはＩＤ５が予め設定され、空調機器操作画面Ｇ１１，Ｇ１２，．．．にはＩＤ１１，ＩＤ１２，．．．、照明機器操作画面Ｇ２１，Ｇ２２，．．．にはＩＤ２１，ＩＤ２２，．．．、浴槽機器操作画面Ｇ３１，Ｇ３２，．．．、にはＩＤ３１，ＩＤ３２，．．．、給湯機器操作画面Ｇ４１，Ｇ４２，．．．、にはＩＤ４１，ＩＤ４２，．．．、バステレビ操作画面Ｇ５１，Ｇ５２，．．．にはＩＤ５１，ＩＤ５２，．．．が予め設定されている。 When the speech recognition apparatus A is activated, the top screen GT is displayed on the display panel 15. Thereafter, the screen is changed according to the voice input from the voice input unit 11. At this time, the control unit 16 outputs the screen state (display screen ID data, display screen state data, etc.) displayed on the display panel 15 to the collation unit 14. The display screen ID data is, for example, ID0 for the top screen GT, ID1 for the air conditioning equipment operation screen G1, ID2 for the lighting equipment operation screen G2, ID3 for the bathtub equipment operation screen G3, and ID4 for the hot water supply equipment operation screen G4. , ID 5 is preset on the bus television operation screen G5, and the air conditioning equipment operation screens G11, G12,. . . Includes ID11, ID12,. . . , Lighting device operation screens G21, G22,. . . ID21, ID22,. . . , Bathtub equipment operation screens G31, G32,. . . , ID31, ID32,. . . , Hot water supply equipment operation screens G41, G42,. . . , ID41, ID42,. . . , Bus TV operation screens G51, G52,. . . ID51, ID52,. . . Is preset.

まず、音声入力部１１は、利用者から発せられた音声を集音し、音声データとして照合部１４へ出力する。照合部１４は、音声データに対して、記憶部１２に格納された音声データベース内の音声の特徴量と音素あるいは語彙などとの対応統計データや、認識対象語彙データ等のデータを用いて音声認識処理を行い、音声データと記憶部１２に格納されている認識対象語彙との照合を行って、音声データに対して類似度の最も高い認識対象語彙を特定する。 First, the voice input unit 11 collects voices uttered from the user and outputs the voices to the collation unit 14 as voice data. The collation unit 14 performs speech recognition on the speech data by using statistical data on correspondence between the feature amount of speech in the speech database stored in the storage unit 12 and phonemes or vocabulary, and data such as recognition target vocabulary data. Processing is performed, the speech data and the recognition target vocabulary stored in the storage unit 12 are collated, and the recognition target vocabulary having the highest similarity to the speech data is specified.

認識対象語彙は、表示パネル１５に表示されている画面に予め対応付けて記憶部１２に格納されており、照合部１４は、現在取得している表示画面ＩＤデータおよび表示画面状態データに対応付けられている認識対象語彙を照合処理に用いる。 The recognition target vocabulary is stored in advance in the storage unit 12 in association with the screen displayed on the display panel 15, and the collation unit 14 associates with the currently acquired display screen ID data and display screen state data. The recognized recognition vocabulary is used for collation processing.

例えば、図４に示すように、空調機器操作画面Ｇ１の表示画面ＩＤデータ「１」および表示画面状態データ（図４内の画面説明に相当する）に対応して「暖房機器を起動」、「換気扇を起動」、「表示画面をトップ画面へ遷移」等の各操作内容が用意されており、各操作内容に対応して認識対象語彙が設定されている。また、照明機器操作画面Ｇ２の表示画面ＩＤデータ「２」および表示画面状態データに対応して「主照明をつける」、「間接照明をつける」、「表示画面をトップ画面へ遷移」の各操作内容が用意されており、各操作内容に対応して認識対象語彙が設定されている。 For example, as shown in FIG. 4, “activate the heating device”, “display the heating device”, “1” corresponding to the display screen ID data “1” and the display screen state data (corresponding to the screen description in FIG. 4) of the air conditioning device operation screen G1. Each operation content such as “start ventilation fan” and “transition of display screen to top screen” is prepared, and a recognition target vocabulary is set corresponding to each operation content. In addition, in response to the display screen ID data “2” and the display screen state data on the lighting device operation screen G2, operations of “turn on main illumination”, “turn on indirect illumination”, and “transition display screen to top screen” are performed. Contents are prepared, and a recognition target vocabulary is set corresponding to each operation content.

さらに、操作内容毎の認識対象語彙は複数の発話スタイルに対応して設定されており、本実施形態では発話スタイルとして、名称単独スタイル、依頼調スタイル、命令調スタイル等を用いる。 Furthermore, the recognition target vocabulary for each operation content is set corresponding to a plurality of utterance styles, and in this embodiment, a name single style, a request style, a command style, etc. are used as the utterance style.

例えば、図４に示すように、空調機器操作時の操作内容「暖房機器を起動」の認識対象語彙は、発話スタイルが名称単独スタイルの場合は「暖房」、発話スタイルが依頼調スタイルの場合は「暖房をつけて」、発話スタイルが命令調スタイルの場合は「暖房をつけろ」と設定される。また、操作内容「換気扇を起動」の認識対象語彙は、発話スタイルが名称単独スタイルの場合は「換気扇」、発話スタイルが依頼調スタイルの場合は「換気扇をつけて」、発話スタイルが命令調スタイルの場合は「換気扇をつけろ」と設定される。他の操作内容の場合も同様に各発話スタイル毎に認識対象語彙が設定される。また、操作内容「表示画面をトップ画面へ遷移」の認識対象語彙は、発話スタイルが名称単独スタイルの場合は「トップ画面」、発話スタイルが依頼調スタイルの場合は「トップ画面へ遷移して」、発話スタイルが命令調スタイルの場合は「トップ画面へ遷移しろ」と設定される。他の操作でも同様に複数の発話スタイルに対応して認識対象語彙が設定されている。 For example, as shown in FIG. 4, the recognition target vocabulary of the operation content “activate heating device” at the time of air conditioning device operation is “heating” when the utterance style is a single name style, and when the utterance style is a request style “Turn on heating” is set, and if the utterance style is a command style, “Turn on heating” is set. Also, the vocabulary to be recognized for the operation content “Activate Ventilation Fan” is “Ventilation Fan” when the utterance style is the name-only style, “With Ventilation Fan” when the utterance style is the request style, and the utterance style is the command style. In the case of, “Turn on ventilation fan” is set. Similarly, in the case of other operation contents, a recognition target vocabulary is set for each utterance style. In addition, the vocabulary to be recognized for the operation content “transition from display screen to top screen” is “top screen” when the utterance style is the name alone style, and “transition to the top screen” when the utterance style is the request style. When the utterance style is the command style, “transition to the top screen” is set. In other operations as well, recognition target words are set corresponding to a plurality of utterance styles.

このように、記憶部１２では、１つの操作内容に発話スタイル毎の認識対象語彙が設定されているが、照合部１４は、複数の発話スタイルのうち後述の選択部１３によって選択されたいずれかの発話スタイルに対応する認識対象語彙のみを照合処理に用いる。 As described above, in the storage unit 12, the recognition target vocabulary for each utterance style is set in one operation content, but the collation unit 14 is one of the plurality of utterance styles selected by the selection unit 13 described later. Only the recognition target vocabulary corresponding to the utterance style is used for collation processing.

次に、発話スタイルの選択処理について説明する。本実施形態では、表示パネル１５に表示する画面として発話スタイル選択画面が予め用意されており、利用者が、操作内容「表示画面を発話スタイル選択画面へ遷移」に対応する認識対象語彙を現在選択されている発話スタイル（この場合、依頼調スタイル）で音声入力部１１に向かって発すると、照合部１４では、選択されている発話スタイルに対応する認識対象語彙のみを照合処理に用い、この照合処理の結果、音声データに対して類似度の最も高い認識対象語彙（この場合、「発話スタイル選択画面へ遷移して」）が特定されると、表示パネル１５の表示が発話スタイル選択画面に遷移する。発話スタイル選択画面には、選択可能な発話スタイルとして、名称単独スタイル、依頼調スタイル、命令調スタイル等が表示されている。 Next, the speech style selection process will be described. In this embodiment, an utterance style selection screen is prepared in advance as a screen to be displayed on the display panel 15, and the user currently selects a recognition target vocabulary corresponding to the operation content “transition of display screen to utterance style selection screen”. When speaking to the voice input unit 11 in the utterance style that has been selected (in this case, the request style), the collation unit 14 uses only the recognition target vocabulary corresponding to the selected utterance style for the collation process. When the recognition target vocabulary having the highest similarity to the voice data (in this case, “transition to the utterance style selection screen”) is specified as a result of the processing, the display on the display panel 15 transitions to the utterance style selection screen. To do. On the utterance style selection screen, a single name style, a request style, a command style, etc. are displayed as selectable utterance styles.

そして、利用者は、発話スタイル選択画面を見ながら、名称単独スタイル、依頼調スタイル、命令調スタイル等の複数の発話スタイルの中から、所望の発話スタイルを選択することが可能であり、利用者が、操作内容「発話スタイルとして名称単独スタイルの選択」に対応する認識対象語彙を現在選択されている発話スタイル（この場合、依頼調スタイル）で音声入力部１１に向かって発すると、照合部１４では、照合処理の結果、音声データに対して類似度の最も高い認識対象語彙（この場合、「名称単独スタイルを選択して」）が特定されると、照合部１４は、図５に示すように、選択された発話スタイルに対応する発話スタイルＩＤ（この場合、名称単独スタイルに対応する発話スタイルＩＤ「１」）を選択部１３へ出力する。 The user can select a desired utterance style from a plurality of utterance styles such as a name-only style, a request style, and a command style while viewing the utterance style selection screen. However, if the recognition target vocabulary corresponding to the operation content “selecting the name alone style as the utterance style” is uttered toward the voice input unit 11 in the currently selected utterance style (in this case, the request style), the matching unit 14 Then, when the recognition target vocabulary having the highest similarity with respect to the speech data (in this case, “select a single name style”) is specified as a result of the collation processing, the collation unit 14 performs processing as shown in FIG. In addition, the utterance style ID corresponding to the selected utterance style (in this case, the utterance style ID “1” corresponding to the name-only style) is output to the selection unit 13.

選択部１３は、照合処理に用いる認識対象語彙として、記憶部１２に格納している認識対象語彙のうち、受信した発話スタイルＩＤに対応する発話スタイル（この場合、名称単独スタイル）に対応する認識対象語彙を選択する。 The selection unit 13 recognizes corresponding to the utterance style (in this case, the name single style) corresponding to the received utterance style ID among the recognition target vocabulary stored in the storage unit 12 as the recognition target vocabulary used for the collation processing. Select the target vocabulary.

以後、照合部１４では、選択部１３によって選択されたいずれかの発話スタイル（この場合、名称単独スタイル）に対応する認識対象語彙のみを照合処理に用いる。選択された発話スタイルが名称単独スタイルの場合、例えば、操作内容「暖房機器を起動」、「表示画面をトップ画面へ遷移」に各々対応する認識対象語彙は、「暖房」、「トップ画面」となり、「暖房をつけて」、「暖房をつけろ」、「トップ画面へ遷移して」、「トップ画面へ遷移しろ」等の別の発話スタイルの認識対象語彙は照合処理には用いられない。 Thereafter, the collation unit 14 uses only the recognition target vocabulary corresponding to any utterance style (in this case, the name single style) selected by the selection unit 13 for the collation processing. When the selected utterance style is a name-only style, for example, the recognition target vocabulary corresponding to the operation contents “activate heating device” and “transition display screen to top screen” are “heating” and “top screen”, respectively. Recognized vocabularies of different utterance styles such as “turn on heating”, “turn on heating”, “transition to top screen”, and “transition to top screen” are not used for the collation process.

このように、照合部１４は、記憶部１２に格納している全ての発話スタイルに対応する認識対象語彙の中から絞り込まれた所定の発話スタイルの認識対象語彙のみを用いて、音声データに対して類似度の最も高い認識対象語彙を特定する音声認識処理を行う。そして、照合部１４は、音声データに対して類似度の最も高い認識対象語彙を特定すると、図６に示すように、特定した認識対象語彙に対応する制御命令信号を制御部１６へ出力する。例えば、特定した認識対象語彙が「暖房」であれば、空調機器Ｂ１の起動命令信号を出力し、特定した認識対象語彙が「主照明」であれば、主照明機器Ｂ２の起動命令信号を出力する。 In this manner, the collation unit 14 uses only the recognition target vocabulary of the predetermined utterance style narrowed down from the recognition target vocabulary corresponding to all the utterance styles stored in the storage unit 12 to the speech data. Speech recognition processing for identifying the recognition target vocabulary having the highest similarity. Then, when identifying the recognition target vocabulary having the highest similarity to the voice data, the collation unit 14 outputs a control command signal corresponding to the identified recognition target vocabulary to the control unit 16 as shown in FIG. For example, if the identified recognition target vocabulary is “heating”, the activation command signal of the air conditioning device B1 is output, and if the identified recognition target vocabulary is “main lighting”, the activation command signal of the main lighting device B2 is output. To do.

制御部１６は、照合部１４から受け取った制御命令信号に基づいて、空調機器Ｂ１、主照明機器Ｂ２、間接照明機器Ｂ３、浴槽機器Ｂ４、バステレビＢ５等へ制御信号を送信して各動作を制御するとともに、表示パネル１５の画面を各制御内容に関連した画面に遷移させる。例えば、空調機器Ｂ１を起動させるときは、表示パネル１５に空調機器操作画面Ｇ１１を表示させ、主照明機器Ｂ２をつけるときは、表示パネル１５に照明機器操作画面Ｇ２１を表示させる。このときも、制御部１６は、表示パネル１５に表示させている画面状態（表示画面の画面ＩＤデータ、表示画面状態データ等）を照合部１４へ出力している。 Based on the control command signal received from the collation unit 14, the control unit 16 transmits a control signal to the air conditioner B1, the main lighting device B2, the indirect lighting device B3, the bathtub device B4, the bus television B5, and the like to perform each operation. While controlling, the screen of the display panel 15 is changed to the screen relevant to each control content. For example, when the air conditioner B1 is activated, the air conditioner operation screen G11 is displayed on the display panel 15, and when the main illumination device B2 is attached, the illumination device operation screen G21 is displayed on the display panel 15. Also at this time, the control unit 16 outputs the screen state (screen ID data of the display screen, display screen state data, etc.) displayed on the display panel 15 to the verification unit 14.

したがって、制御対象の各操作内容に、複数の発話スタイルに各々対応した認識対象語彙を設定しているので、照合部１４は、ある発話スタイルの認識対象語彙では音声認識が困難である場合でも、利用者が別の発話スタイルを選択して認識対象語彙を発話することで音声認識が容易となり、音声認識性能が向上する。また、利用者が、自己に適した発話スタイルを選択できるので、利用者にとって発話しやすく、その利用者にとって音声認識装置のユーザビリティが高くなる。すなわち、音声の認識性能とユーザビリティを向上させることができるのである。 Therefore, since the recognition target vocabulary corresponding to each of the plurality of utterance styles is set for each operation content to be controlled, the collation unit 14 can perform speech recognition with a recognition target vocabulary of a certain utterance style. When the user selects another utterance style and speaks the recognition target vocabulary, the speech recognition becomes easy and the speech recognition performance is improved. In addition, since the user can select a speech style suitable for him / her, it is easy for the user to speak, and the usability of the speech recognition apparatus is improved for the user. That is, the speech recognition performance and usability can be improved.

また、発話スタイルを多くした場合でも、実際に照合処理に用いる認識対象語彙の増加は抑えられ、音声認識性能の低下を防止できる。 Further, even when the utterance style is increased, an increase in the recognition target vocabulary actually used for the collation process can be suppressed, and a decrease in speech recognition performance can be prevented.

なお、上記音声認識装置Ａを浴室以外の他の住空間に設置してもよい。例えば、音声認識装置Ａをキッチンに設置し、利用者が音声入力することで、キッチン内の機器制御や、表示パネル１５の画面が遷移して料理レシピ表示を行うことができる。 Note that the voice recognition device A may be installed in a living space other than the bathroom. For example, when the voice recognition device A is installed in a kitchen and a user inputs a voice, the device control in the kitchen or the screen of the display panel 15 can be changed to display a cooking recipe.

また、発話スタイルとしては、図７に示すように、名称単独スタイル、依頼調スタイル、命令調スタイル以外に、不満調スタイル、連想スタイルが考えられる。不満調スタイルとは、利用者の不満内容に対応する語彙を認識対象語彙として登録し、当該認識対象語彙にその不満を解消させるような機器制御を対応付ける。例えば、操作内容「暖房機器を起動」に対して認識対象語彙「寒い」を対応つけてもよい（寒いという不満を解消するために暖房機器を起動する）。 Further, as the utterance style, as shown in FIG. 7, in addition to the single name style, the request style, and the command style, an unsatisfactory style and an associative style are conceivable. In the dissatisfied style, a vocabulary corresponding to the content of the user's dissatisfaction is registered as a recognition target vocabulary, and device control that resolves the dissatisfaction is associated with the recognition target vocabulary. For example, the recognition target vocabulary “cold” may be associated with the operation content “activate heating device” (the heating device is activated in order to eliminate the dissatisfaction of being cold).

連想スタイルは、使用者の思わず出てしまう一言を認識対象語彙として登録し、当該認識対象語彙にその発話内容から連想される機器制御を対応付ける。例えば、「暖房機器を起動」に対して認識対象語彙「寒い」を対応つけてもよい（寒いなあと感じているということは暖かくする必要があるので、暖房機器を起動する）。また例えば、操作内容「照明の照度を上げる」に対して認識対象語彙「眠いなあ」を対応つけてもよい（眠いなあと感じているということは覚醒させる必要があるので、照明の照度を上げる）。 The associative style registers a word that is unexpectedly generated by the user as a recognition target vocabulary, and associates the device control associated with the recognition target vocabulary with the utterance content. For example, the recognition target vocabulary “cold” may be associated with “activate the heating device” (since it is necessary to warm the feeling after it is cold, the heating device is activated). In addition, for example, the operation content “increased illumination intensity” may be associated with the recognition target vocabulary “sleepy” (because it is necessary to wake up feeling after being sleepy, increase the illumination intensity) ).

（実施形態２）
本実施形態の音声認識装置Ａは図８に示すブロック構成を備え、実施形態１の構成に頻度算出部１７を設けたものであり、実施形態１と同様の構成には同一の符号を付して説明は省略する。 (Embodiment 2)
The speech recognition apparatus A according to the present embodiment has the block configuration shown in FIG. 8 and includes the frequency calculation unit 17 in the configuration of the first embodiment. The same reference numerals are given to the same configurations as those in the first embodiment. Description is omitted.

本実施形態では、音声入力部１１からの音声データに対して、所定期間内に各発話スタイルが使用された頻度を算出し、この算出された頻度に基づいて発話スタイルを切り換えるもので、その動作フローチャートを図９に示す。 In the present embodiment, the frequency at which each utterance style is used within a predetermined period is calculated for the voice data from the voice input unit 11, and the utterance style is switched based on the calculated frequency. A flowchart is shown in FIG.

まず、利用者が音声入力部１１に入力した音声データを照合部１４が受け取ると、照合部１４は、発話スタイル分類手段として機能し、音声データの発話スタイルを、予め設定されている複数の発話スタイルのうちいずれかに分類して、当該分類した発話スタイルに対応する発話スタイルＩＤ（図５参照）を頻度算出部１７へ出力し、頻度算出部１７は発話スタイルＩＤを取得する（Ｓ１）。なお、この照合部１４による発話スタイルの分類は、周知の音声認識方法を用いて発話スタイル毎の特徴（語尾、イントネーション等）を認識することで行われる。 First, when the collation unit 14 receives voice data input to the voice input unit 11 by the user, the collation unit 14 functions as an utterance style classification unit, and the utterance style of the voice data is set to a plurality of preset utterances. The style is classified into one of the styles, and the utterance style ID (see FIG. 5) corresponding to the classified utterance style is output to the frequency calculation section 17, and the frequency calculation section 17 acquires the utterance style ID (S1). Note that the utterance styles are classified by the collation unit 14 by recognizing features (suffixes, intonations, etc.) for each utterance style using a known speech recognition method.

そして、頻度算出部１７は、分類された発話スタイルの各出現頻度を算出する頻度算出手段として機能し、内蔵したタイミングカウンタ（図示なし）の発話スタイル選択カウント値（初期値３０）をデクリメントして（Ｓ２）、さらに発話スタイル毎のＩＤ取得回数を、内蔵した頻度カウンタ（図示なし）でカウントする（Ｓ３）。発話スタイル選択カウント値は、発話スタイル選択のタイミングを示しており、カウント値がゼロになったら発話スタイルの選択が行われ（Ｓ４）、カウント値がゼロでなければ本処理は終了し、再度Ｓ１から処理を開始する。 The frequency calculation unit 17 functions as a frequency calculation unit that calculates the appearance frequency of each classified utterance style, and decrements the utterance style selection count value (initial value 30) of a built-in timing counter (not shown). (S2) Further, the ID acquisition count for each utterance style is counted by a built-in frequency counter (not shown) (S3). The utterance style selection count value indicates the timing of utterance style selection. When the count value becomes zero, the utterance style is selected (S4). If the count value is not zero, the present process is terminated, and S1 is again performed. Start processing from.

カウント値がゼロになると、頻度算出部１７は、発話スタイル選択カウント値を初期値３０にリセットした後（Ｓ５）、各発話スタイルのＩＤ取得回数の割合を算出し、ＩＤ取得回数が全３０回のうち過半数を占める発話スタイルがあるか否かを判断する（Ｓ６）。過半数を占める発話スタイルがあれば、当該発話スタイルの出現頻度が最も高い上に、利用者が当該発話スタイルを好んで使用しているとみなし、当該発話スタイルの発話スタイルＩＤを選択部１３へ出力するとともに、発話スタイルの切り替えが発生したことを報知する画面の生成命令を照合部１４を介して制御部１６へ出力し（Ｓ７）、制御部１６は、表示パネル１５に現在表示されている例えば空調機器操作画面Ｇ１１（図１０（ａ））に、発話スタイル切替報知画面Ｈ１を貼り付けて表示させる（図１０（ｂ））。 When the count value becomes zero, the frequency calculation unit 17 resets the utterance style selection count value to the initial value 30 (S5), and then calculates the ratio of the ID acquisition count for each utterance style, so that the ID acquisition count is 30 times. It is determined whether or not there is an utterance style that occupies a majority (S6). If there is an utterance style that occupies the majority, the appearance frequency of the utterance style is the highest, and it is considered that the user likes the utterance style, and the utterance style ID of the utterance style is output to the selection unit 13. In addition, a screen generation command for notifying that the utterance style has been switched is output to the control unit 16 via the verification unit 14 (S7), and the control unit 16 is currently displayed on the display panel 15, for example. The speech style switching notification screen H1 is pasted and displayed on the air conditioning equipment operation screen G11 (FIG. 10A) (FIG. 10B).

選択部１３は、照合処理に用いる認識対象語彙として、記憶部１２に格納している認識対象語彙のうち、受信した発話スタイルＩＤによって設定される発話スタイルに対応する認識対象語彙を選択する。 The selection unit 13 selects the recognition target vocabulary corresponding to the utterance style set by the received utterance style ID from the recognition target vocabulary stored in the storage unit 12 as the recognition target vocabulary used for the collation processing.

以後、照合部１４では、選択部１３によって選択されたいずれかの発話スタイルに対応する認識対象語彙のみを照合処理に用いる。 Thereafter, the collation unit 14 uses only the recognition target vocabulary corresponding to one of the utterance styles selected by the selection unit 13 for the collation processing.

また、Ｓ６において、過半数を占める発話スタイルがなければ、利用者にとって好みの発話スタイルが確定していないとみなし、発話スタイルの切り替えは行わず、本処理は終了し、再度Ｓ１から処理を開始する。 In S6, if there is no utterance style that occupies the majority, it is considered that the user's favorite utterance style has not been determined, the utterance style is not switched, this processing ends, and the processing starts again from S1. .

なお、頻度算出部１７は、ＩＤ取得回数が最も多い発話スタイルの発話スタイルＩＤを選択部１３へ出力してもよい。また、頻度算出部１７が用いる発話スタイル選択カウント値の初期値は３０以外でもよい。 The frequency calculation unit 17 may output the utterance style ID of the utterance style with the largest number of ID acquisitions to the selection unit 13. The initial value of the utterance style selection count value used by the frequency calculation unit 17 may be other than 30.

したがって、制御対象の各操作内容に、複数の発話スタイルに各々対応した認識対象語彙を設定しているので、照合部１４は、利用者がよく使う発話スタイルを選択して認識対象語彙を特定することで、音声認識性能が向上する。また、利用者がよく使う発話スタイルを利用者に意識させることなく選択できるので、利用者にとって音声認識装置のユーザビリティが高くなる。すなわち、音声の認識性能とユーザビリティを向上させることができるのである。 Therefore, since the recognition target vocabulary corresponding to each of the plurality of utterance styles is set for each operation content to be controlled, the collation unit 14 selects the utterance style frequently used by the user and specifies the recognition vocabulary. As a result, the voice recognition performance is improved. In addition, since the user can select the utterance style frequently used by the user without making the user aware of it, the usability of the speech recognition apparatus is improved for the user. That is, the speech recognition performance and usability can be improved.

実施形態１の音声認識装置のブロック構成を示す図である、It is a figure which shows the block configuration of the speech recognition apparatus of Embodiment 1. 浴室に設置された音声認識装置を示す図である。It is a figure which shows the speech recognition apparatus installed in the bathroom. 表示画面の階層構造を示す図である。It is a figure which shows the hierarchical structure of a display screen. 表示画面と認識対象語彙との対応を示す図である。It is a figure which shows a response | compatibility with a display screen and recognition object vocabulary. 発話スタイルと発話スタイルＩＤとの対応を示す図である。It is a figure which shows a response | compatibility with speech style and speech style ID. 認識対象語彙と制御命令との対応を示す図である。It is a figure which shows a response | compatibility with a recognition object vocabulary and a control command. 発話スタイルの例を示す図である。It is a figure which shows the example of an utterance style. 実施形態２の音声認識装置のブロック構成を示す図である、It is a figure which shows the block configuration of the speech recognition apparatus of Embodiment 2. 発話スタイル設定時の動作フローチャートを示す図である。It is a figure which shows the operation | movement flowchart at the time of speech style setting. （ａ）（ｂ）発話スタイル切替報知画面の貼り付けを示す図である。(A) (b) It is a figure which shows pasting of the speech style switching alert | report screen.

Explanation of symbols

Ａ音声認識装置
１１音声入力部
１２記憶部
１３選択部
１４照合部
１５表示パネル
１６制御部
Ｂ１空調機器
Ｂ２主照明機器
Ｂ３間接照明機器
Ｂ４浴槽機器
Ｂ５バステレビ A voice recognition device 11 voice input unit 12 storage unit 13 selection unit 14 collation unit 15 display panel 16 control unit B1 air conditioning device B2 main lighting device B3 indirect lighting device B4 bathtub device B5 bus TV

Claims

In a voice recognition device that controls the operation of a control target based on voice emitted from a user,
A voice input unit for inputting a user's voice;
A storage unit storing vocabulary associated with each action to be controlled in a plurality of utterance styles,
A selection unit for selecting one of the utterance styles from a plurality of utterance styles;
A collation unit that collates speech input to the speech input unit with the vocabulary of the utterance style selected by the selection unit from the vocabulary stored in the storage unit, and identifies the vocabulary based on the collation result;
A speech recognition device comprising: a control unit that operates a control target based on the vocabulary specified by the collation unit.

Presenting means for presenting a plurality of utterance styles of the vocabulary stored in the storage unit, wherein the selection unit selects one of the utterance styles from the plurality of utterance styles based on an input from the user. The speech recognition apparatus according to claim 1, wherein:

Utterance style classification means for classifying the utterance style of the voice input to the voice input unit into any of the plurality of utterance styles; and a frequency calculation means for calculating the appearance frequency of the classified utterance styles, The speech recognition apparatus according to claim 1, wherein the selection unit selects an utterance style having the highest appearance frequency from a plurality of utterance styles.