JP2014149457A

JP2014149457A - Voice recognition device, electronic apparatus, and control program of voice recognition device

Info

Publication number: JP2014149457A
Application number: JP2013018898A
Authority: JP
Inventors: Hironori Tanaka; 裕紀田中; Masahito Takeuchi; 雅人竹内; Kazuaki Shimaoka; 和章嶋岡; Kaiji Nabetani; 海二鍋谷; Kenji Kimura; 賢二木村; Nami Iiyama; 菜美飯山
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2013-02-01
Filing date: 2013-02-01
Publication date: 2014-08-21

Abstract

PROBLEM TO BE SOLVED: To increase accuracy of voice recognition without losing user's convenience.SOLUTION: A recognition control unit (10) includes: an apparatus condition acquisition unit (11) for acquiring information which shows a condition of a digital camera (100); a candidate phrase determination unit (12) for determining a phrase associated with the information as a candidate phrase; a voice acquisition unit (13) for acquiring voice data; a specific phrase detection unit (14) for detecting a specific phrase from the voice data; and a recognition phrase determination unit (15) for determining the fact that the specific phrase is any of the candidate phrases and determining the phrase as a recognition phrase.

Description

本発明は、音声に含まれる語句を認識する音声認識装置および該音声認識装置を備えた電子機器に関する。 The present invention relates to a speech recognition device for recognizing a phrase included in speech and an electronic apparatus including the speech recognition device.

音声に含まれる語句を認識する、音声認識の方法は種々知られている。また、音声認識処理において、当該音声認識の精度の向上のために、様々な技術が開示されている。 Various speech recognition methods for recognizing words contained in speech are known. In the speech recognition process, various techniques are disclosed for improving the accuracy of the speech recognition.

例えば、特許文献１には、音声認識用の記憶装置を階層構造とし利用者の習熟度向上につれ長文節の入力を可能とするように適時的に上位の階層を設定し、利用頻度の少ない語彙を新しい語彙と入れ替えることで音声認識の精度を向上させる技術が開示されている。また、特許文献２には、音声認識の候補となる語句をユーザに提示し、認識結果の絞り込みや修正を行わせることにより音声認識の精度を向上させる技術が開示されている。また、特許文献３および４には、音声から認識候補となる語句の含まれるカテゴリを特定し、当該カテゴリをユーザに提示することにより、ユーザに語句の絞り込みを行わせることにより音声認識精度を向上させる技術が開示されている。 For example, in Patent Document 1, a vocabulary with a low frequency of use is set up in a timely manner so that a storage device for speech recognition has a hierarchical structure and a higher phrase is input timely so that a long phrase can be input as the user's proficiency level increases. A technique for improving the accuracy of speech recognition by replacing with a new vocabulary is disclosed. Patent Document 2 discloses a technique for improving the accuracy of speech recognition by presenting words and phrases that are candidates for speech recognition to the user and narrowing down or correcting the recognition results. Further, Patent Documents 3 and 4 improve the speech recognition accuracy by specifying a category including a phrase that is a recognition candidate from speech and presenting the category to the user, thereby allowing the user to narrow down the phrase. Techniques for making them disclosed are disclosed.

特開２００４−３２５７０４号公報（２００４年１１月１８日公開）JP 2004-325704 A (published on November 18, 2004) 特開２０１２−０２２２５１号公報（２０１２年２月２日公開）JP 2012-022251 A (released February 2, 2012) 特開２００１−１０９４９２号公報（２００１年４月２０日公開）JP 2001-109492 A (published April 20, 2001) 特開２００６−１８４６７０号公報（２００６年７月１３日公開）JP 2006-184670 A (published July 13, 2006)

しかしながら、上述のような従来技術においては、音声認識精度を向上させることができるものの、ユーザの操作性を損なう場合があった。例えば、特許文献１に開示の技術では、音声認識に用いる語彙は、利用者の習熟度により決定される。このため、上記技術を用いた機器を複数の利用者が共用している場合、上記語彙が利用者それぞれの習熟度にそぐわないものとなる可能性がある。このような場合、語彙の誤認識およびそれに伴う機器の誤動作が増加し、結果的にユーザの操作性が損なわれてしまう場合があった。また、特許文献２〜４に開示の技術では、最終的な音声認識の結果を得るまでに、ユーザが最初の発話以降も所定の操作を行う必要があるため、ユーザの操作が煩雑になり、操作性を損なっている。 However, in the conventional techniques as described above, although the voice recognition accuracy can be improved, the operability of the user may be impaired. For example, in the technique disclosed in Patent Document 1, the vocabulary used for speech recognition is determined by the proficiency level of the user. For this reason, when a plurality of users share a device using the above technique, the vocabulary may not match the proficiency level of each user. In such a case, erroneous recognition of the vocabulary and associated malfunctions of the device increase, and as a result, the operability of the user may be impaired. In addition, in the technologies disclosed in Patent Documents 2 to 4, since the user needs to perform a predetermined operation after the first utterance before obtaining the final speech recognition result, the user's operation becomes complicated, The operability is impaired.

本発明は上記の問題点に鑑みなされたものであり、その目的は、ユーザの操作性を損なうことなく音声認識における認識精度を向上させることができる音声認識装置、および音声認識装置の制御プログラムを実現することにある。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a speech recognition apparatus and a speech recognition apparatus control program capable of improving recognition accuracy in speech recognition without impairing user operability. It is to be realized.

上記の課題を解決するために、本発明の一態様に係る音声認識装置は、ユーザの発話を音声として検出し、当該音声に含まれる語句を音声認識する音声認識装置であって、音声操作の対象となる電子機器の状態を示す情報を取得する機器状態取得手段と、上記機器状態取得手段によって取得された上記電子機器の状態を示す情報に対応付けられ、上記音声認識の対象となる候補語句を決定する候補語句決定手段と、上記ユーザの発話を音声データとして取得する音声データ取得手段と、上記音声データ取得手段によって取得された音声データから発話内容を特定する少なくとも一つの語句を特定語句として検出する特定語句検出手段と、上記特定語句検出手段によって検出された特定語句が、上記候補語句決定手段によって決定された候補語句のいずれかの語句であることを特定し、特定した語句を認識語句として決定する認識語句決定手段と、を備えていることを特徴としている。 In order to solve the above problems, a speech recognition apparatus according to an aspect of the present invention is a speech recognition apparatus that detects a user's utterance as speech and recognizes a phrase included in the speech. Device status acquisition means for acquiring information indicating the state of the target electronic device, and candidate phrases that are associated with the information indicating the state of the electronic device acquired by the device status acquisition means and that are the target of the speech recognition Candidate phrase determining means for determining the speech, voice data acquiring means for acquiring the user's utterance as voice data, and at least one phrase for specifying the utterance content from the voice data acquired by the voice data acquiring means as a specific phrase The specific word detected by the specific word detection means and the specific word detected by the specific word detection means Identify be any phrase, it is characterized by comprising a recognition word determining means for determining the identified phrase as a recognized word.

本発明の一態様によれば、音声認識装置が、電子機器の現在の状態に応じて、音声認識の対象となる候補語句を決定することで、音声認識に使用する候補語句の絞り込みが自動的に行われることになるため、候補語句の絞り込み、すなわち音声認識精度を向上させるための操作をユーザが行う必要がなくなる。したがって、ユーザの操作性を損なうことなく音声認識における認識精度を向上させることができるという効果を奏する。 According to one aspect of the present invention, the speech recognition apparatus determines candidate words / phrases to be subjected to speech recognition in accordance with the current state of the electronic device, thereby automatically narrowing down candidate words / phrases to be used for speech recognition. Therefore, it is not necessary for the user to narrow down candidate words, that is, to improve the speech recognition accuracy. Therefore, it is possible to improve the recognition accuracy in voice recognition without impairing the user operability.

本発明の第１の実施形態に係る音声認識装置を搭載した、デジタルカメラの要部構成を示すブロック図である。It is a block diagram which shows the principal part structure of the digital camera carrying the speech recognition apparatus which concerns on the 1st Embodiment of this invention. 上記音声認識装置が音声認識のために利用する語句テーブルのデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of the phrase table which the said speech recognition apparatus utilizes for speech recognition. 上記デジタルカメラにて行われる音声認識処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the speech recognition process performed with the said digital camera. 上記デジタルカメラの動作概要を示す図である。It is a figure which shows the operation | movement outline | summary of the said digital camera. 本発明の第２の実施形態に係る音声認識装置を搭載したデジタルカメラの動作概要を示す図である。It is a figure which shows the operation | movement outline | summary of the digital camera carrying the speech recognition apparatus which concerns on the 2nd Embodiment of this invention. 上記音声認識装置が音声認識のために利用する語句テーブルのデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of the phrase table which the said speech recognition apparatus utilizes for speech recognition.

≪実施形態１≫
本発明の第１の実施形態について説明すれば、以下の通りである。なお、本実施形態では、本発明の音声認識装置を、電子機器の一種であるデジタルカメラに搭載した例について説明する。以下では、本実施形態について、図１〜４を用いて詳細に説明する。 Embodiment 1
The first embodiment of the present invention will be described as follows. In the present embodiment, an example in which the voice recognition apparatus of the present invention is mounted on a digital camera which is a kind of electronic apparatus will be described. Below, this embodiment is described in detail using FIGS.

〔要部構成〕
まず初めに、デジタルカメラ１００の要部構成を図１に基づいて説明する。図１は、デジタルカメラ１００の要部構成を示すブロック図である。なお、本実施形態を説明するうえで、直接関係のない部分（例えば、外部機器との接続部分や写真撮影を行う部分など）についての説明は省略する。デジタルカメラ１００（電子機器）は、図示の通り、音声認識装置としての認識制御部１０と、センサ部２０（センサ部、機器状態特定手段）と、機器状態特定部２１と、音声検出部３０（音声検出手段）と、語句テーブル４０と、機器制御部５０とを備えている。 [Main part configuration]
First, the main configuration of the digital camera 100 will be described with reference to FIG. FIG. 1 is a block diagram showing a main configuration of the digital camera 100. In the description of the present embodiment, descriptions of portions that are not directly related (for example, a portion connected to an external device or a portion that takes a picture) are omitted. As illustrated, the digital camera 100 (electronic device) includes a recognition control unit 10 as a voice recognition device, a sensor unit 20 (sensor unit, device state specifying unit), a device state specifying unit 21, and a voice detecting unit 30 ( Voice detection means), a phrase table 40, and a device control unit 50.

センサ部２０は、現在のデジタルカメラ１００の状態が変化したことを検知するものである。ここで、デジタルカメラ１００の状態の変化とは、デジタルカメラ１００が任意の制御指示を受け付けた状態を示す。例えば、デジタルカメラ１００に対して、撮影モードや表示モード、各種パラメータの設定などが指示された状態を示す。センサ部２０は、デジタルカメラ１００に対し制御指示が送信されことを検知し、検知信号を後述の機器状態特定部２１へと送信する。なお、センサ部２０は、デジタルカメラ１００の状態の変化を検出できれば、どのような検知方法を用いてもよい。例えば、センサ部２０は、デジタルカメラ１００の撮影モードの設定のための設定信号を受信することにより、状態の変化を検知してもよい。この場合、撮影モードの変更に伴い、設定信号も変化するため、設定信号を上記検知信号として機器状態特定部２１へと送信すればよい。 The sensor unit 20 detects that the current state of the digital camera 100 has changed. Here, the change in the state of the digital camera 100 indicates a state in which the digital camera 100 has received an arbitrary control instruction. For example, the digital camera 100 is instructed to take a shooting mode, a display mode, various parameter settings, and the like. The sensor unit 20 detects that a control instruction is transmitted to the digital camera 100 and transmits a detection signal to the device state specifying unit 21 described later. The sensor unit 20 may use any detection method as long as it can detect a change in the state of the digital camera 100. For example, the sensor unit 20 may detect a change in state by receiving a setting signal for setting the shooting mode of the digital camera 100. In this case, since the setting signal also changes as the shooting mode is changed, the setting signal may be transmitted to the device state specifying unit 21 as the detection signal.

機器状態特定部２１は、上記検知信号からデジタルカメラ１００の状態を特定するものである。ここで、デジタルカメラ１００の状態とは、デジタルカメラ１００が、上記制御指示を受け付けた際の、デジタルカメラ１００の状態である。したがって、デジタルカメラ１００の状態とは、例えば撮影モード、表示モード、各種パラメータが設定されている状態、すなわち制御状態を示す。機器状態特定部２１はデジタルカメラ１００の状態を特定すると、特定した状態を示す情報を認識制御部１０内の機器状態取得部１１へと送信する。なお、機器状態特定部２１は、デジタルカメラ１００の現在の状態を特定することができれば、どのような特定方法を用いてもよい。例えば、上記センサ部２０から上記設定信号を受信した場合は、該設定信号により設定される撮影モードを、デジタルカメラ１００の現在の状態であると特定すればよい。 The device state specifying unit 21 specifies the state of the digital camera 100 from the detection signal. Here, the state of the digital camera 100 is the state of the digital camera 100 when the digital camera 100 receives the control instruction. Therefore, the state of the digital camera 100 indicates, for example, a state in which a shooting mode, a display mode, and various parameters are set, that is, a control state. When the device state specifying unit 21 specifies the state of the digital camera 100, the device state specifying unit 21 transmits information indicating the specified state to the device state acquiring unit 11 in the recognition control unit 10. The device state specifying unit 21 may use any specifying method as long as the current state of the digital camera 100 can be specified. For example, when the setting signal is received from the sensor unit 20, the shooting mode set by the setting signal may be specified as the current state of the digital camera 100.

音声検出部３０は、デジタルカメラ１００のユーザの発話を音声データとして検出するものである。音声検出部３０の構成および形態は特に限定しない。音声検出部３０にて検出された音声データは、後述の認識制御部１０の音声取得部１３へと送信される。また、上述のセンサ部２０、機器状態特定部２１、および音声検出部３０は、デジタルカメラ１００に内蔵されてもよいし、デジタルカメラ１００の外部機器として接続されてもよい。 The voice detection unit 30 detects the speech of the user of the digital camera 100 as voice data. The configuration and form of the voice detection unit 30 are not particularly limited. The voice data detected by the voice detection unit 30 is transmitted to the voice acquisition unit 13 of the recognition control unit 10 described later. In addition, the sensor unit 20, the device state specifying unit 21, and the sound detection unit 30 described above may be incorporated in the digital camera 100 or connected as an external device of the digital camera 100.

語句テーブル４０は、特定語句と照合するための語句を示す情報である。語句テーブル４０には、語句に対し、当該語句を候補語句とするか否かの情報と、デジタルカメラ１００の制御命令を示す情報とが対応づけられ記憶されている。ここで、「候補語句」とは、後述する認識語句決定部１５において、特定語句と照合する対象となる（音声認識において、ユーザの発話に含まれる語句を検出する候補になる）語句を示す。語句テーブル４０の詳細なデータ構造については後で詳述する。語句テーブル４０は、候補語句決定部１２により書き換えられ、認識語句決定部１５および制御信号出力部１６により参照される。 The phrase table 40 is information indicating a phrase for matching with a specific phrase. In the word / phrase table 40, information indicating whether or not the word / phrase is a candidate word / phrase and information indicating a control command of the digital camera 100 are stored in association with the word / phrase. Here, the “candidate word / phrase” refers to a word / phrase (to be a candidate for detecting a word / phrase included in a user's utterance in speech recognition) to be compared with a specific word / phrase in the recognition word / phrase determination unit 15 described later. The detailed data structure of the phrase table 40 will be described in detail later. The phrase table 40 is rewritten by the candidate phrase determination unit 12 and is referred to by the recognition phrase determination unit 15 and the control signal output unit 16.

認識制御部１０は、デジタルカメラ１００の音声認識を統括的に制御するものである。認識制御部１０は例えば、ＣＰＵ（central processing unit）などで実現される。認識制御部１０は、さらに詳しくは、機器状態取得部１１（機器状態取得手段）、候補語句決定部１２（候補語句決定手段）、音声取得部１３（音声データ取得手段）、特定語句検出部１４（特定語句検出手段）、認識語句決定部１５（認識語句決定手段）、および制御信号出力部１６を含む。 The recognition control unit 10 controls the overall voice recognition of the digital camera 100. The recognition control unit 10 is realized by, for example, a CPU (central processing unit). More specifically, the recognition control unit 10 includes a device state acquisition unit 11 (device state acquisition unit), a candidate phrase determination unit 12 (candidate phrase determination unit), a voice acquisition unit 13 (voice data acquisition unit), and a specific phrase detection unit 14. (Specific phrase detection means), recognition phrase determination section 15 (recognition phrase determination means), and control signal output section 16 are included.

機器状態取得部１１は、機器状態特定部２１からデジタルカメラ１００の状態を示す情報を取得し、取得した情報を後述の候補語句決定部１２へと送信するものである。 The device state acquisition unit 11 acquires information indicating the state of the digital camera 100 from the device state specifying unit 21 and transmits the acquired information to the candidate phrase determination unit 12 described later.

候補語句決定部１２は、機器状態取得部１１から受信した、機器の状態を示す情報に対応付けられた語句を、上記音声認識の対象となる候補語句として決定するものである。具体的には、候補語句決定部１２は、後述する語句テーブル４０に記憶されているどの語句を候補語句とするかを決定するものである。候補語句決定部１２の行う候補語句の決定方法については、後で詳述する。 The candidate phrase determination unit 12 determines the phrase associated with the information indicating the state of the device received from the device state acquisition unit 11 as the candidate phrase that is the target of the voice recognition. Specifically, the candidate word / phrase determination unit 12 determines which words / phrases stored in a word / phrase table 40 to be described later are used as candidate words / phrases. The candidate phrase determination method performed by the candidate phrase determination unit 12 will be described in detail later.

音声取得部１３は、音声検出部３０から音声データを取得し、取得した音声データを特定語句検出部１４へと送信するものである。 The voice acquisition unit 13 acquires voice data from the voice detection unit 30 and transmits the acquired voice data to the specific phrase detection unit 14.

特定語句検出部１４は、音声取得部１３から送信された音声データに含まれる語句の中から特定の語句（以下、「特定語句」と称する）を検出するものである。ここで、「特定語句」は、デジタルカメラ１００を制御するために、当該デジタルカメラ１００に対するユーザの音声指示内容を示す語句をいう。特定語句検出部１４は、より具体的には、音声データをテキストデータに変換し、予め登録されているデータベース（図示せず）から、当該音声データに含まれる特定語句をテキストデータとして検出する。検出された特定語句は、認識語句決定部１５へと送信される。 The specific phrase detection unit 14 detects a specific phrase (hereinafter referred to as “specific phrase”) from the phrases included in the voice data transmitted from the voice acquisition unit 13. Here, the “specific phrase” refers to a phrase indicating the content of the user's voice instruction to the digital camera 100 in order to control the digital camera 100. More specifically, the specific phrase detection unit 14 converts voice data into text data, and detects a specific phrase included in the voice data as text data from a previously registered database (not shown). The detected specific phrase is transmitted to the recognized phrase determination unit 15.

なお、特定語句は、後述の認識語句決定部１５にて語句テーブル４０の語句のうち候補語句として決定された語句と、該特定語句とを照合できる形式で検出されればよく、特定語句の検出形式は特に限定されない。例えば、特定語句は音声データであってもよい。 The specific phrase may be detected in a format that can match the specific phrase with the phrase determined as a candidate phrase among the phrases in the phrase table 40 by the recognition phrase determination unit 15 described later. The format is not particularly limited. For example, the specific phrase may be voice data.

認識語句決定部１５は、特定語句と各候補語句とを照合し、照合の結果特定語句と合致した候補語句を、特定語句が示す語句であると決定するものである。以降、特定語句と各候補語句との照合の結果、特定語句と合致すると判定された候補語句を単に「認識語句」と記述する。認識語句決定部１５は、より具体的には、特定語句の文字列と語句テーブル４０に格納されている候補語句の文字列とを照合する。照合の結果、両者の文字列が全てまたは一定以上の割合で一致する場合に、上記候補語句が認識語句であると判定する。 The recognition word / phrase determination unit 15 collates the specific word / phrase with each candidate word / phrase, and determines the candidate word / phrase that matches the specific word / phrase as a result of the collation as the word / phrase indicated by the specific word / phrase. Hereinafter, as a result of the comparison between the specific word and each candidate word, the candidate word determined to match the specific word is simply described as “recognized word”. More specifically, the recognized phrase determination unit 15 collates the character string of the specific phrase with the character string of the candidate phrase stored in the phrase table 40. As a result of the collation, when the character strings of both coincide with each other at a certain ratio or more, it is determined that the candidate phrase is a recognized phrase.

制御信号出力部１６は、認識語句決定部１５にて決定した認識語句に対応づけられた制御命令を上記語句テーブル４０から読み出し、デジタルカメラ１００の各種機能を制御する制御コマンドを作成し、作成した制御コマンドを制御信号として機器制御部５０へ出力するものである。ここで、「制御コマンド」とは、デジタルカメラ１００の各種機能を制御するためのコマンドである。 The control signal output unit 16 reads out a control command associated with the recognized word / phrase determined by the recognized word / phrase determining unit 15 from the word / phrase table 40, creates a control command for controlling various functions of the digital camera 100, and generates the control command. The control command is output to the device control unit 50 as a control signal. Here, the “control command” is a command for controlling various functions of the digital camera 100.

機器制御部５０は、デジタルカメラ１００の各種機能を制御するものである。機器制御部５０の制御する機能は特に限定されないが、具体例としては、写真撮影やタイマーなどデジタルカメラ１００がカメラとして持つ基本的な機能、撮影モードの設定などの各種設定機能、ならびにメニュー画面などの表示および操作機能などが挙げられる。 The device control unit 50 controls various functions of the digital camera 100. Functions controlled by the device control unit 50 are not particularly limited, but specific examples include basic functions that the digital camera 100 has as a camera, such as photography and timers, various setting functions such as shooting mode settings, and menu screens. Display and operation functions.

〔語句テーブルのデータ構成〕
次に、図２に基づき語句テーブル４０の詳細なデータ構造について説明する。図２は、認識制御部１０が音声認識のために利用する語句テーブル４０のデータ構造の一例を示す図である。具体的には、図２ではデジタルカメラ１００の撮影モード（状態）が「屋外」の場合の語句テーブル４０のデータの例を示している。なお、図２において、語句テーブル４０をテーブル形式のデータ構造にて示したことは一例であって、語句テーブル４０のデータ構造をテーブル形式に限定する意図はない。以降、データ構造を説明するためのその他の図においても同様である。図示の通り、語句テーブル４０は、「語句」列と、「候補」列と、「制御命令」列とを持つ。語句テーブル４０は、「語句」列に、「候補」列および「制御命令」列が対応づけられた構成である。したがって、「語句」列の情報が決まれば、「候補」列および「制御命令」列の情報が一意に決まる。 [Data structure of phrase table]
Next, the detailed data structure of the phrase table 40 will be described with reference to FIG. FIG. 2 is a diagram illustrating an example of a data structure of the phrase table 40 used by the recognition control unit 10 for speech recognition. Specifically, FIG. 2 shows an example of data in the phrase table 40 when the shooting mode (state) of the digital camera 100 is “outdoor”. In FIG. 2, the phrase table 40 is shown as a data structure in a table format, and the data structure of the phrase table 40 is not intended to be limited to the table format. Hereinafter, the same applies to other figures for explaining the data structure. As shown in the figure, the phrase table 40 has a “phrase” column, a “candidate” column, and a “control command” column. The phrase table 40 has a configuration in which a “candidate” column and a “control instruction” column are associated with a “phrase” column. Therefore, if the information of the “word / phrase” column is determined, the information of the “candidate” column and the “control instruction” column is uniquely determined.

「語句」列は、認識制御部１０が音声認識に用いる語句を格納する。 The “word / phrase” column stores words / phrases used by the recognition control unit 10 for voice recognition.

「候補」列は、各行において「語句」列に格納された語句を、音声認識において候補語句とするか否かの情報を格納している。「候補」列の情報は、機器状態取得部１１の取得する機器の状態に応じて、候補語句決定部１２によって書き換えられる。図２においては、「候補」列が「○」である行の「語句」列に格納された語句は、特定語句との照合の対象となる。すなわち、「候補」列が「○」である行の語句は候補語句である。一方、「候補」列が空欄である行の「語句」列に格納された語句は上記照合の対象とならない。 The “candidate” column stores information as to whether or not the word / phrase stored in the “word / phrase” column in each row is a candidate word / phrase in speech recognition. The information in the “candidate” column is rewritten by the candidate phrase determination unit 12 in accordance with the device status acquired by the device status acquisition unit 11. In FIG. 2, the phrase stored in the “phrase” column in the row where the “candidate” column is “◯” is a target of collation with a specific phrase. That is, the phrase in the row where the “candidate” column is “◯” is a candidate phrase. On the other hand, words stored in the “word / phrase” column of the row where the “candidate” column is blank are not subjected to the above collation.

「制御命令」列は、語句テーブル４０の各行において、「語句」列に格納された語句が特定語句と合致する、すなわち、認識語句であると判定された場合に、デジタルカメラ１００にて実行する制御命令を示す情報を格納する。ここで、「制御命令」とは、制御信号出力部１６が制御コマンドを作成するために必要な情報を示している。「制御命令」列の情報は、制御信号出力部１６により読み出される。 The “control command” column is executed by the digital camera 100 when it is determined in each row of the phrase table 40 that the phrase stored in the “phrase” column matches the specific phrase, that is, is a recognized phrase. Stores information indicating control instructions. Here, the “control command” indicates information necessary for the control signal output unit 16 to create a control command. Information in the “control command” column is read by the control signal output unit 16.

なお、各列に格納する情報の形式は特に限定されない。また、「候補」列の情報は、ユーザが自由に変更可能であってもよい。また、語句テーブル４０は、デジタルカメラ１００においてあらかじめ定められたものであってもよいが、ユーザによって書き換え可能なように記憶されていてもよい。例えば、ユーザが語句テーブル４０に対し、新規に語句を追加できるようにしてもよいし、該語句に対応づけられた制御命令や「候補」列の情報を変更できるようにしてもよい。さらに、語句テーブル４０は、制御命令を必ずしも記憶している必要はない。しかしながら、語句テーブル４０または他のテーブルに制御命令の情報を格納し、上記語句と対応づけた場合、音声認識の際に、認識された語句に対応してデジタルカメラ１００の制御を行うことが可能となるというメリットがある。 The format of information stored in each column is not particularly limited. The information in the “candidate” column may be freely changeable by the user. The phrase table 40 may be predetermined in the digital camera 100, but may be stored so that it can be rewritten by the user. For example, the user may be able to add a new phrase to the phrase table 40, or the control command associated with the phrase or information on the “candidate” column may be changed. Furthermore, the phrase table 40 does not necessarily have to store control commands. However, when the control command information is stored in the phrase table 40 or another table and is associated with the above phrase, it is possible to control the digital camera 100 corresponding to the recognized phrase at the time of speech recognition. There is a merit that it becomes.

〔候補語句決定部による候補語句の決定〕
次に、候補語句決定部１２が行う候補語句の決定について、上述の図２を参照して詳述する。候補語句決定部１２は、より具体的には、機器状態取得部１１から受信した機器の状態を示す情報に基づいて、語句テーブル４０中の当該語句の「候補」列の値を切替えるものである。 [Determination of candidate phrases by the candidate phrase determination unit]
Next, determination of candidate words and phrases performed by the candidate word and phrase determination unit 12 will be described in detail with reference to FIG. More specifically, the candidate phrase determination unit 12 switches the value of the “candidate” column of the phrase in the phrase table 40 based on the information indicating the device status received from the device status acquisition unit 11. .

デジタルカメラ１００の撮影モードが「屋外」の場合、ユーザがホワイトバランスの設定を屋内用の設定に変更するような語句を発話することは非常に少ないと予測される。このように、デジタルカメラ１００の各状態において、ユーザが発話しないと予測される語句については、語句テーブル４０の「候補」列を空欄とし、当該状態の場合に候補語句から除外する。具体的には、候補語句決定部１２は、機器状態取得部１１からデジタルカメラ１００の状態を示す情報を受信すると、語句テーブル４０を参照して、当該状態に予め対応づけられた語句を検索する。そして、候補語句決定部１２は、上記状態に対応付けられた語句の「候補」列のみを「○」とし（上記語句を候補語句とし）、他の語句に対応づけられた「候補」列の情報は全て空欄とする。 When the shooting mode of the digital camera 100 is “outdoor”, it is predicted that the user rarely speaks a phrase that changes the white balance setting to the indoor setting. As described above, in each state of the digital camera 100, for a word / phrase that is predicted not to be spoken by the user, the “candidate” column in the word / phrase table 40 is left blank, and is excluded from the candidate word / phrase in this state. Specifically, when the candidate word / phrase determination unit 12 receives information indicating the state of the digital camera 100 from the device state acquisition unit 11, the candidate word / phrase determination unit 12 refers to the word / phrase table 40 and searches for a word / phrase associated with the state in advance. . Then, the candidate phrase determining unit 12 sets only the “candidate” column of the phrase associated with the above state to “◯” (the above phrase is a candidate phrase), and the “candidate” column associated with the other phrase. All information is blank.

なお、候補語句決定部１２は、上述の方法で候補語句を絞り込む代わりに、デジタルカメラ１００の現在の状態において、実行可能な制御命令に対応づけられた語句を候補語句としてもよい。この場合は、制御命令を示す情報がデジタルカメラ１００に格納されており、該制御命令を示す情報が、語句テーブル４０の語句に対応づけられる。また、候補語句決定部１２は、上述のように「候補」列の情報を書き換える代わりに、デジタルカメラ１００の状態により、参照する語句テーブルを変更してもよい。この場合、デジタルカメラ１００は、自機の状態それぞれに対応した語句テーブルを備えることとなり、「候補」列の書換えは起こらないので、「候補」列は必ずしも必要ではない。 In addition, the candidate phrase determination unit 12 may use a phrase associated with a control command that can be executed in the current state of the digital camera 100 as a candidate phrase instead of narrowing down the candidate phrases by the above-described method. In this case, information indicating a control command is stored in the digital camera 100, and the information indicating the control command is associated with a word / phrase in the word / phrase table 40. Further, the candidate word / phrase determination unit 12 may change the word / phrase table to be referred to according to the state of the digital camera 100 instead of rewriting the information in the “candidate” column as described above. In this case, the digital camera 100 includes a word / phrase table corresponding to each state of the device itself, and the “candidate” column is not rewritten, so the “candidate” column is not necessarily required.

〔処理の流れ〕
図３は、デジタルカメラ１００にて行われる音声認識処理の流れを示した図である。まず初めに、デジタルカメラの状態を示す情報を取得する（Ｓ１００）。具体的には、機器状態取得部１１が、機器状態特定部２１によって特定されたデジタルカメラ１００の状態を示す情報を取得する。次に、Ｓ１００で取得したデジタルカメラの状態を示す情報に応じ、語句テーブルの「候補」列を書き換える（Ｓ１０２）。具体的には、候補語句決定部１２は、機器状態取得部１１から上記状態を示す情報を受信すると、受信した情報が示すデジタルカメラ１００の状態に応じて、語句テーブル４０の「候補」列の情報を書き換える。これにより、語句テーブル４０に含まれる語句のうち、どの語句を音声認識の対象の候補語句とするかが決定される。 [Process flow]
FIG. 3 is a diagram showing a flow of voice recognition processing performed in the digital camera 100. First, information indicating the state of the digital camera is acquired (S100). Specifically, the device state acquisition unit 11 acquires information indicating the state of the digital camera 100 specified by the device state specifying unit 21. Next, the “candidate” column in the phrase table is rewritten according to the information indicating the state of the digital camera acquired in S100 (S102). Specifically, when the candidate phrase determination unit 12 receives the information indicating the state from the device state acquisition unit 11, the candidate phrase determination unit 12 determines whether the “candidate” column of the phrase table 40 corresponds to the state of the digital camera 100 indicated by the received information. Rewrite information. As a result, it is determined which of the phrases included in the phrase table 40 is the candidate phrase for speech recognition.

次に、音声を検出する（Ｓ１０４のＹＥＳ）と、検出した音声から音声データを取得する（Ｓ１０６）。具体的には、音声検出部３０が、検出したデジタルカメラ１００のユーザの発話を音声データを、音声取得部１３によって取得する。続いて、音声データに含まれる特定語句を検出する（Ｓ１０８）。具体的には、特定語句検出部１４が、音声取得部１３によって取得された音声データから、音声認識の対象となる特定語句を検出する。次に、認識語句を決定する（Ｓ１１０）。具体的には、認識語句決定部１５が、特定語句検出部１４によって検出された特定語句と合致する、語句テーブル４０中の候補語句を認識語句と決定する。 Next, when voice is detected (YES in S104), voice data is acquired from the detected voice (S106). Specifically, the voice detection unit 30 acquires voice data of the detected speech of the user of the digital camera 100 by the voice acquisition unit 13. Subsequently, a specific phrase included in the audio data is detected (S108). Specifically, the specific word / phrase detection unit 14 detects a specific word / phrase to be subjected to voice recognition from the voice data acquired by the voice acquisition unit 13. Next, a recognition word / phrase is determined (S110). Specifically, the recognized word / phrase determination unit 15 determines a candidate word / phrase in the word / phrase table 40 that matches the specific word / phrase detected by the specific word / phrase detection unit 14 as a recognized word / phrase.

続いて、制御信号出力部１６は、認識語句に対応した制御命令を語句テーブルから読み出し（Ｓ１１２）、読み出した制御命令から制御コマンドを作成し（Ｓ１１４）、制御コマンドを出力する（Ｓ１１６）。最後に、デジタルカメラ１００は、制御コマンドに応じた制御を行う（Ｓ１１８）。 Subsequently, the control signal output unit 16 reads a control command corresponding to the recognized word from the word table (S112), creates a control command from the read control command (S114), and outputs the control command (S116). Finally, the digital camera 100 performs control according to the control command (S118).

〔動作概要〕
続いて、図４に基づいて、本実施形態に係る音声認識装置を搭載したデジタルカメラ１００の動作概要を説明する。図４の（ａ）および（ｂ）は、具体的には、撮影モードが「屋外」であるデジタルカメラ１００の動作を示している。ここでは、図２に示す語句テーブル４０を用いて候補語句を決定する。すなわち、図２に示す通り、上記デジタルカメラ１００の候補語句決定部１２は、「ホワイトバランス晴天」という語句は候補語句とするが、「ホワイトバランス蛍光灯」という語句は候補語句としない。 [Operation overview]
Next, an outline of the operation of the digital camera 100 equipped with the speech recognition apparatus according to the present embodiment will be described with reference to FIG. 4A and 4B specifically show the operation of the digital camera 100 in which the shooting mode is “outdoor”. Here, candidate phrases are determined using the phrase table 40 shown in FIG. That is, as shown in FIG. 2, the candidate phrase determination unit 12 of the digital camera 100 uses the phrase “white balance sunny” as a candidate phrase, but does not consider the phrase “white balance fluorescent lamp” as a candidate phrase.

上述のような場合に、ユーザが「ホワイトバランス晴天」という語句を発したとする（図４の（ａ））。この場合、上記語句は候補語句である。よって、認識語句決定部１５にて上記語句と特定語句との照合が行われる。これにより、上記語句が認識語句であると決定され、該認識語句に対応する制御命令（ホワイトバランスを「晴天」の設定に切替える）が制御コマンドとして実行される。つまり、上記ユーザの音声は音声認識され、ホワイトバランスは「晴天」へと変更される。一方、ユーザが「ホワイトバランス蛍光灯」という語句を発した場合（図４の（ｂ））、上記語句は候補語句ではないので認識語句決定部１５において特定語句と照合されない。つまり、上記ユーザの音声は音声認識されない。 In the above case, it is assumed that the user utters the phrase “white balance clear sky” ((a) in FIG. 4). In this case, the above phrase is a candidate phrase. Therefore, the recognized word / phrase determination unit 15 collates the word with the specific word / phrase. Thereby, it is determined that the word / phrase is a recognized word / phrase, and a control command corresponding to the recognized word / phrase (switching the white balance to the setting of “clear sky”) is executed as a control command. That is, the user's voice is recognized and the white balance is changed to “clear sky”. On the other hand, when the user utters the phrase “white balance fluorescent lamp” ((b) in FIG. 4), the recognized phrase determination unit 15 does not collate with the specific phrase because the phrase is not a candidate phrase. That is, the user's voice is not recognized.

このように、本実施形態に係る音声認識装置を搭載したデジタルカメラ１００は、その撮影モードにおいてユーザが指示しないであろうと予測される指示を示す語句については、音声認識の候補から除外することができる。 As described above, the digital camera 100 equipped with the voice recognition device according to the present embodiment can exclude words indicating instructions that the user is not supposed to give in the shooting mode from the voice recognition candidates. it can.

本実施形態１では、音声認識の認識対象となる候補語句をデジタルカメラ１００の状態に応じて決定する例について説明したが、これに限定されるものではなく、デジタルカメラ１００の状態を示す情報に対応付けられた語句の文字数から当該語句を上記候補語句とするか否かを決定するようにしてもよい。この場合には、特定語句の検出精度が低下すると考えられる状態である場合において、誤認識の起こりにくい語句を候補語句とすることができる。下記の実施形態２では、特にデジタルカメラ１００の状態が所定の状態（誤認識しやすい状態）である場合に、デジタルカメラ１００の状態を示す情報に対応付けられた語句の文字数が、音声認識における認識語句の誤検出を生じ易い文字数以上の語句を上記候補語句として決定する例について説明する。 In the first embodiment, an example in which candidate words / phrases to be recognized for speech recognition are determined according to the state of the digital camera 100 is described. However, the present invention is not limited to this, and information indicating the state of the digital camera 100 is used. You may make it determine whether the said phrase shall be made into the said candidate phrase from the character count of the matched phrase. In this case, in a state where the detection accuracy of a specific word / phrase is considered to be lowered, a word / phrase that is unlikely to be erroneously recognized can be set as a candidate word / phrase. In Embodiment 2 below, particularly when the state of the digital camera 100 is a predetermined state (a state where misrecognition is likely to occur), the number of characters in the phrase associated with the information indicating the state of the digital camera 100 is An example will be described in which words and phrases that are more likely to cause erroneous detection of recognized words are determined as the candidate words.

≪実施形態２≫
本発明の第２の実施形態について説明すれば、以下の通りである。なお、本実施形態では、前記実施形態１と同様に、本発明の音声認識装置を、電子機器の一種であるデジタルカメラに搭載した例について説明する。また、説明の便宜上、前記実施形態１にて説明した部材と同じ機能を有する部材については、同じ符号を付記しその説明を省略する。 << Embodiment 2 >>
The second embodiment of the present invention will be described as follows. In the present embodiment, as in the first embodiment, an example in which the speech recognition apparatus of the present invention is mounted on a digital camera which is a kind of electronic apparatus will be described. For convenience of explanation, members having the same functions as those described in the first embodiment are denoted by the same reference numerals and description thereof is omitted.

本実施形態が、前記実施形態１と相違している点は、図１に示す認識制御部１０内の候補語句決定部１２が候補語句を決定する処理である。本実施形態での音声認識処理では、音声を正確に検出できないと予測される状況（誤認識しやすい状態）において、音声認識の精度を極力担保できるよう、誤認識の起こりやすい語句を語句テーブルから除外する。 The present embodiment is different from the first embodiment in the process in which the candidate phrase determining unit 12 in the recognition control unit 10 shown in FIG. 1 determines the candidate phrase. In the speech recognition process according to the present embodiment, words that are likely to be erroneously recognized are stored in the phrase table so that the accuracy of speech recognition can be ensured as much as possible in a situation where it is predicted that speech cannot be accurately detected (a state in which erroneous recognition is likely to occur). exclude.

ここで、「誤認識の起こりやすい語句」とは、具体的には文字数の少ない（短い）語句や、他の単語と類似した文字や単語を多数含んでいる語句などである。なぜならば、上記のような語句は、他の語句との差異を判別できる部分が少なく、音声認識において正確に判別しづらいからである。 Here, the phrase “prone to misrecognition” is specifically a phrase having a small number of characters (short), a phrase including many characters or words similar to other words, and the like. This is because the phrase as described above has few portions that can be distinguished from other phrases, and is difficult to determine accurately in speech recognition.

次に、本実施形態において候補語句決定部１２が行う候補語句の決定について詳述する。本実施形態における候補語句決定部１２は、上記第１の実施形態にて説明した候補語句の決定に加え、以下の方法にて候補語句の決定を行う。 Next, the candidate phrase determination performed by the candidate phrase determination unit 12 in the present embodiment will be described in detail. The candidate word / phrase determination unit 12 according to the present embodiment determines candidate words / phrases by the following method in addition to the determination of candidate words / phrases described in the first embodiment.

図５の（ａ）〜（ｄ）は、本実施形態におけるデジタルカメラ１００の動作概要を示す図である。ここで、図５の（ａ）および（ｂ）に示すデジタルカメラ１００は、撮影モードが「自分撮り」以外のモードであり、図５の（ｃ）および（ｄ）に示すデジタルカメラ１００は、撮影モードが「自分撮り」である。また、図６は、本発明の第２の実施形態に係る音声認識装置が利用する語句テーブル４０の具体例を示す。図６は、さらに詳しくは、デジタルカメラ１００の撮影モードが「自分撮り」の場合の語句テーブル４０を示している。なお、語句テーブル４０のデータ構成自体は第１の実施形態と同様である。 5A to 5D are diagrams showing an outline of the operation of the digital camera 100 in the present embodiment. Here, the digital camera 100 shown in FIGS. 5A and 5B is a mode other than the “self-shooting” shooting mode, and the digital camera 100 shown in FIGS. The shooting mode is “selfie”. FIG. 6 shows a specific example of the phrase table 40 used by the speech recognition apparatus according to the second embodiment of the present invention. More specifically, FIG. 6 shows the phrase table 40 when the shooting mode of the digital camera 100 is “selfie”. Note that the data structure of the phrase table 40 is the same as that in the first embodiment.

ところで、デジタルカメラ１００の撮影モード（状態）が「自分撮り」の場合（図５の（ｃ）および（ｄ））は、発話するユーザと、デジタルカメラ１００の音声検出部３０との距離は通常の場合（図５の（ａ）および（ｂ））より離れていると考えられる。 By the way, when the shooting mode (state) of the digital camera 100 is “self-portrait” ((c) and (d) in FIG. 5), the distance between the user who speaks and the voice detection unit 30 of the digital camera 100 is usually This is considered to be farther from the case ((a) and (b) of FIG. 5).

上記の場合、音声検出部３０が検出する音声データの品質は低下し、音声データに基づいて検出される特定語句の検出精度も低下すると考えられる。このように、特定語句の検出精度が低い場合に文字数の少ない語句も候補語句とすると、認識語句の誤検出が増加し却ってユーザの操作性を損なってしまう。 In the above case, it is considered that the quality of the voice data detected by the voice detection unit 30 is lowered and the detection accuracy of the specific phrase detected based on the voice data is also lowered. As described above, if a word / phrase having a small number of characters is also a candidate word / phrase when the detection accuracy of the specific word / phrase is low, the erroneous detection of the recognized word / phrase increases and the operability of the user is impaired.

そこで、本実施形態における候補語句決定部１２は、図６に示すように、「撮影」「メニュー」「閲覧」など、文字数が予め定めた閾値より少ない語句については、対応する「候補」列を空欄とし、候補語句から除外する。上記閾値を音声認識における認識語句の誤検出を生じ易い文字数とすれば、誤検出し易い候補語句を除外することができる。よって、音声認識における認識語句の誤検出を確実に防止することができる。 Therefore, as shown in FIG. 6, the candidate phrase determination unit 12 in the present embodiment sets a corresponding “candidate” column for words and phrases such as “photographing”, “menu”, and “browsing” that are less than a predetermined threshold. Leave blank and exclude from candidate words. If the threshold is set to the number of characters that are likely to cause erroneous detection of a recognized word / phrase in speech recognition, candidate words / phrases that are likely to be erroneously detected can be excluded. Therefore, it is possible to reliably prevent erroneous detection of a recognized word / phrase in speech recognition.

なお、本実施形態における音声認識処理は、第１の実施形態における音声認識処理と併用することで、さらに音声認識の精度を向上させることができる。例えば、本実施形態においても、撮影モードが「自分撮り」の際にユーザが発話しないであろう「タイマーセット」という語句に対応する「候補」列は空欄としてよい（図６）。 Note that the speech recognition processing in the present embodiment can be further improved in accuracy of speech recognition by being used together with the speech recognition processing in the first embodiment. For example, also in this embodiment, the “candidate” column corresponding to the phrase “timer set” that the user will not utter when the shooting mode is “selfie” may be blank (FIG. 6).

〔動作概要〕
次に、本実施形態に係るデジタルカメラ１００の動作概要を、図５を用いて説明する。デジタルカメラ１００の撮影モードが「自分撮り」以外の場合（図５の（ａ）および（ｂ））は、デジタルカメラ１００の候補語句決定部１２は、「シャッター撮影」および「撮影」のどちらの語句も候補語句とする。一方、上記撮影モードが「自分撮り」の場合（図５の（ｃ）および（ｄ））は、「シャッター撮影」は候補語句とするが、「撮影」は候補語句としない。したがって、デジタルカメラ１００の撮影モードが「自分撮り」以外の場合は、ユーザが「シャッター撮影」という語句を発しても（図５の（ａ））、「撮影」という語句を発しても（図５の（ｂ））、該ユーザの音声は音声認識され、写真撮影が行われる。一方、デジタルカメラ１００の撮影モードが「自分撮り」である場合は、ユーザが「シャッター撮影」という語句を発した際（図５の（ｃ））は、ユーザの音声は音声認識されるが、「撮影」という語句を発した際（図５の（ｄ））は、音声認識されない。 [Operation overview]
Next, an outline of the operation of the digital camera 100 according to the present embodiment will be described with reference to FIG. When the shooting mode of the digital camera 100 is other than “self-shooting” ((a) and (b) in FIG. 5), the candidate word / phrase determining unit 12 of the digital camera 100 selects either “shutter shooting” or “shooting”. The phrase is also a candidate phrase. On the other hand, when the shooting mode is “self-portrait” ((c) and (d) in FIG. 5), “shutter shooting” is a candidate word, but “shooting” is not a candidate word. Therefore, when the shooting mode of the digital camera 100 is other than “self-portrait”, even if the user utters the phrase “shutter shoot” (FIG. 5A) or the phrase “shoot” (FIG. 5) 5 (b)), the user's voice is recognized as a voice and a picture is taken. On the other hand, when the shooting mode of the digital camera 100 is “selfie”, when the user utters the phrase “shutter shooting” ((c) in FIG. 5), the user's voice is recognized. When the phrase “photographing” is issued ((d) in FIG. 5), voice recognition is not performed.

このように、本実施形態では、音声検出部３０の検出する音声データの品質が低下すると予測される場合は、上記「撮影」のように、誤認識の起こりやすい語句をさらに候補語句から除外するようにしているので、ユーザの操作性と、音声認識の精度の向上とが両立できるという効果を奏する。 Thus, in this embodiment, when it is predicted that the quality of the audio data detected by the audio detection unit 30 will be reduced, words that are likely to be erroneously recognized are further excluded from the candidate words, such as “shooting”. As a result, it is possible to achieve both user operability and improvement in voice recognition accuracy.

上記の実施形態１，２では何れも、機器状態取得部１１がデジタルカメラ１００の制御状態を示す情報を機器状態として取得していたが、これに限定されるものではなく、デジタルカメラ１００の環境状態を示す情報、すなわちデジタルカメラ１００の周囲の明るさや、デジタルカメラ１００の傾きなどの検出値を機器状態として取得してもよい。下記の実施形態３では、デジタルカメラ１００の環境状態を示す情報を用いて、音声認識処理を行う例について説明する。 In both of the first and second embodiments, the device state acquisition unit 11 acquires information indicating the control state of the digital camera 100 as the device state. However, the present invention is not limited to this, and the environment of the digital camera 100 is not limited thereto. Information indicating the state, that is, detection values such as the brightness around the digital camera 100 and the tilt of the digital camera 100 may be acquired as the device state. In the third embodiment described below, an example in which voice recognition processing is performed using information indicating the environmental state of the digital camera 100 will be described.

≪実施形態３≫
本発明の第３の実施形態について説明すれば、以下の通りである。なお、本実施形態では、前記実施形態１と同様に、本発明の音声認識装置を、電子機器の一種であるデジタルカメラに搭載した例について説明する。また、説明の便宜上、前記実施形態１にて説明した部材と同じ機能を有する部材については、同じ符号を付記しその説明を省略する。 << Embodiment 3 >>
The third embodiment of the present invention will be described as follows. In the present embodiment, as in the first embodiment, an example in which the speech recognition apparatus of the present invention is mounted on a digital camera which is a kind of electronic apparatus will be described. For convenience of explanation, members having the same functions as those described in the first embodiment are denoted by the same reference numerals and description thereof is omitted.

本実施形態に係るデジタルカメラ１００では、図１に示すセンサ部２０が、デジタルカメラ１００の環境状態を示す情報、すなわち明るさや、傾きなどの検出値を検出する各種センサとしてはたらく。上記センサおよび上記情報の例としては、温度センサの検出する機器の外部または内部温度、傾きセンサの検知する電子機器の傾き、光センサの検知する電子機器の外光の強度などが挙げられる。 In the digital camera 100 according to the present embodiment, the sensor unit 20 illustrated in FIG. 1 serves as various sensors that detect information indicating the environmental state of the digital camera 100, that is, detection values such as brightness and inclination. Examples of the sensor and the information include the external or internal temperature of the device detected by the temperature sensor, the tilt of the electronic device detected by the tilt sensor, and the intensity of external light detected by the optical sensor.

このように、デジタルカメラ１００の環境状態を示す情報、すなわち明るさや傾きなどの検出値を機器状態として取得する場合は、候補語句決定部１２における候補語句の決定は、ユーザの操作を全く介さずに行うことができる。つまり、撮影モードの変更など、ユーザの操作をトリガとして候補語句の決定を行うのではなく、デジタルカメラ１００の内部または外部の環境に応じて自動的に、該環境に合わせて候補語句を決定することができる。これにより、ユーザの操作性を向上させながら、音声認識の精度を向上させることができる。 As described above, when the information indicating the environmental state of the digital camera 100, that is, the detection value such as brightness and inclination is acquired as the device state, the candidate phrase determination unit 12 determines the candidate phrase without any user operation. Can be done. That is, instead of determining candidate words / phrases triggered by a user operation such as a change in shooting mode, the candidate words / phrases are automatically determined in accordance with the environment inside or outside the digital camera 100. be able to. Thereby, the accuracy of voice recognition can be improved while improving the operability for the user.

〔変形例〕
本発明に係る音声認識装置は上述したデジタルカメラに限らない。本発明に係る音声認識装置は、自機が音声および電子機器の状態を取得できるならば、あらゆる電子機器に搭載し、音声認識を行うことができる。例えば、上記音声認識装置を、テレビに搭載するようにしてもよい。この場合、現在放送中のチャンネルのみを認識対象にする。また、上記音声認識装置を、ＨＤＤ（hard disc drive）レコーダに搭載してもよい。この場合、番組データにある録画番組のみ認識対象にする。 [Modification]
The speech recognition apparatus according to the present invention is not limited to the digital camera described above. The speech recognition apparatus according to the present invention can be installed in any electronic device and perform speech recognition as long as the device can acquire the state of the speech and the electronic device. For example, the voice recognition device may be mounted on a television. In this case, only the channel currently being broadcast is set as a recognition target. The voice recognition device may be mounted on an HDD (hard disc drive) recorder. In this case, only recorded programs in the program data are recognized.

また、音声認識装置は必ずしも上記電子機器に内蔵されている必要はない。例えば、上記電子機器と通信を行うことにより、上記電子機器の状態や、音声データを取得してもよい。例えば、スマートフォンなどに認識制御部１０および音声検出部３０を搭載し、テレビなどの家電製品にセンサ部２０、機器状態特定部２１を搭載してもよい。 Further, the voice recognition device is not necessarily built in the electronic device. For example, the state of the electronic device and audio data may be acquired by communicating with the electronic device. For example, the recognition control unit 10 and the voice detection unit 30 may be mounted on a smartphone or the like, and the sensor unit 20 and the device state specifying unit 21 may be mounted on a home appliance such as a television.

〔ソフトウェアによる実現例〕
認識制御部１０の制御ブロック（特に候補語句決定部１２および認識語句決定部１５）は、集積回路（ＩＣチップ）等に形成された論理回路（ハードウェア）によって実現してもよいし、ＣＰＵ（Central Processing Unit）を用いてソフトウェアによって実現してもよい。後者の場合、認識制御部１０は、各機能を実現するソフトウェアであるプログラムの命令を実行するＣＰＵ、上記プログラムおよび各種データがコンピュータ（またはＣＰＵ）で読み取り可能に記録されたＲＯＭ（Read Only Memory）または記憶装置（これらを「記録媒体」と称する）、上記プログラムを展開するＲＡＭ（Random Access Memory）などを備えている。そして、コンピュータ（またはＣＰＵ）が上記プログラムを上記記録媒体から読み取って実行することにより、本発明の目的が達成される。上記記録媒体としては、「一時的でない有形の媒体」、例えば、テープ、ディスク、カード、半導体メモリ、プログラマブルな論理回路などを用いることができる。また、上記プログラムは、該プログラムを伝送可能な任意の伝送媒体（通信ネットワークや放送波等）を介して上記コンピュータに供給されてもよい。なお、本発明は、上記プログラムが電子的な伝送によって具現化された、搬送波に埋め込まれたデータ信号の形態でも実現され得る。 [Example of software implementation]
The control blocks (particularly the candidate word determination unit 12 and the recognition word determination unit 15) of the recognition control unit 10 may be realized by a logic circuit (hardware) formed in an integrated circuit (IC chip) or the like, or a CPU ( It may be realized by software using a Central Processing Unit. In the latter case, the recognition control unit 10 includes a CPU that executes instructions of a program that is software that implements each function, and a ROM (Read Only Memory) in which the program and various data are recorded so as to be readable by a computer (or CPU). Alternatively, a storage device (these are referred to as “recording media”), a RAM (Random Access Memory) for expanding the program, and the like are provided. And the objective of this invention is achieved when a computer (or CPU) reads the said program from the said recording medium and runs it. As the recording medium, a “non-temporary tangible medium” such as a tape, a disk, a card, a semiconductor memory, a programmable logic circuit, or the like can be used. The program may be supplied to the computer via an arbitrary transmission medium (such as a communication network or a broadcast wave) that can transmit the program. The present invention can also be realized in the form of a data signal embedded in a carrier wave in which the program is embodied by electronic transmission.

〔まとめ〕
本発明の態様１に係る音声認識装置（認識制御部１０）は、ユーザの発話を音声として検出し、当該音声に含まれる語句を音声認識する音声認識装置であって、音声操作の対象となる電子機器（デジタルカメラ１００）の状態を示す情報を取得する機器状態取得手段（機器状態取得部１１）と、上記機器状態取得手段によって取得された上記電子機器の状態を示す情報に対応付けられ、上記音声認識の対象となる候補語句を決定する候補語句決定手段（候補語句決定部１２）と、上記ユーザの発話を音声データとして取得する音声データ取得手段（音声取得部１３）と、上記音声データ取得手段によって取得された音声データから発話内容を特定する少なくとも一つの語句を特定語句として検出する特定語句検出手段（特定語句検出部１４）と、上記特定語句検出手段によって検出された特定語句が、上記候補語句決定手段によって決定された候補語句のいずれかの語句であることを特定し、特定した語句を認識語句として決定する認識語句決定手段（認識語句決定部１５）と、を備えている。 [Summary]
The speech recognition device (recognition control unit 10) according to aspect 1 of the present invention is a speech recognition device that detects a user's speech as speech and recognizes words included in the speech, and is a target of speech operation. Corresponding to device status acquisition means (device status acquisition unit 11) for acquiring information indicating the status of the electronic device (digital camera 100), and information indicating the status of the electronic device acquired by the device status acquisition means, Candidate phrase determination means (candidate phrase determination section 12) for determining candidate phrases to be subjected to voice recognition, voice data acquisition means (voice acquisition section 13) for acquiring the user's speech as voice data, and the voice data Specific phrase detection means (specific phrase detection unit 14) that detects at least one phrase that specifies the utterance content from the voice data acquired by the acquisition means as a specific phrase The specific phrase detected by the specific phrase detection unit is identified as one of the candidate phrases determined by the candidate phrase determination unit, and the identified phrase is determined as a recognition phrase (Recognition word determining unit 15).

上記構成によれば、電子機器の現在の状態に応じて、音声認識の対象となる候補語句を決定することができる。これにより、ユーザに操作を要求することなく、電子機器が自動的に候補語句の絞り込みを行うことができる。したがって、ユーザの操作性を損なうことなく音声認識における認識精度を向上させることができるという効果を奏する。 According to the said structure, the candidate word / phrase used as the object of speech recognition can be determined according to the present state of an electronic device. Thus, the electronic device can automatically narrow down candidate words and phrases without requiring an operation from the user. Therefore, it is possible to improve the recognition accuracy in voice recognition without impairing the user operability.

本発明の態様２に係る音声認識装置は、上記態様１において、上記候補語句決定手段は、上記電子機器の状態を示す情報に対応付けられた語句の文字数から当該語句を上記候補語句とするか否かを決定してもよい。 The speech recognition apparatus according to aspect 2 of the present invention is the speech recognition apparatus according to aspect 1, wherein the candidate phrase determination unit determines the phrase from the number of characters in the phrase associated with the information indicating the state of the electronic device as the candidate phrase. You may decide whether or not.

上記構成によれば、候補語句が、電子機器の状態を示す情報に対応付けられた語句の文字数によって決定されることで、認識語句決定手段は、特定語句の文字数と、候補語句の文字数とを認識語句決定のためのパラメータとして、認識語句を決定することになるため、認識語句の決定精度を向上させることが可能となる。これにより、誤った音声認識による電子機器の誤動作のため、ユーザへ再操作を要求することが防止できる。したがって、ユーザの操作性の低下を軽減するとともに、音声認識の精度を向上させることができる。 According to the above configuration, the candidate word / phrase is determined based on the number of characters of the word / phrase associated with the information indicating the state of the electronic device, so that the recognized word / phrase determining unit calculates the number of characters of the specific word / phrase and the number of characters of the candidate word / phrase. Since the recognition word / phrase is determined as a parameter for determining the recognition word / phrase, the determination accuracy of the recognition word / phrase can be improved. Thereby, it is possible to prevent the user from requesting re-operation due to malfunction of the electronic device due to erroneous voice recognition. Therefore, it is possible to reduce the decrease in user operability and improve the accuracy of voice recognition.

本発明の態様３に係る音声認識装置は、上記態様２において、上記候補語句決定手段は、さらに、上記電子機器の状態を示す情報に対応付けられた語句の文字数が、音声認識における認識語句の誤検出を生じ易い文字数以上の語句を上記候補語句として決定してもよい。 The speech recognition apparatus according to aspect 3 of the present invention is the speech recognition apparatus according to aspect 2, wherein the candidate phrase determining means further determines that the number of characters in the phrase associated with the information indicating the state of the electronic device is the recognition phrase in speech recognition. Words that are more than the number of characters that are likely to be erroneously detected may be determined as the candidate words.

上記構成によれば、候補語句決定手段は、誤認識の起こり易い、文字数が所定の値より少ない語句を候補語句から除外することができる。つまり、他の語句との差異を判別できる部分が少なく、音声認識において正確に判別しづらい語句を候補語句から除外することができる。よって、ユーザの操作性と音声認識の精度をさらに向上させることができる。 According to the above configuration, the candidate word determination unit can exclude words and phrases that are likely to be erroneously recognized and whose number of characters is smaller than a predetermined value from the candidate words. That is, there are few portions that can be distinguished from other words and phrases, and words that are difficult to be accurately identified in speech recognition can be excluded from candidate phrases. Therefore, the operability of the user and the accuracy of voice recognition can be further improved.

本発明の態様４に係る音声認識装置は、上記態様１から３のいずれかにおいて、上記機器状態取得手段は、上記電子機器の外部または内部環境を示す機器情報（環境情報）を測定するセンサ部（センサ部２０）から、上記電子機器の状態として上記機器情報を取得してもよい。 The speech recognition apparatus according to Aspect 4 of the present invention is the sensor unit according to any one of Aspects 1 to 3, wherein the apparatus state acquisition unit measures apparatus information (environment information) indicating an external or internal environment of the electronic apparatus. The device information may be acquired from the (sensor unit 20) as the state of the electronic device.

上記構成によれば、センサ部が取得した機器情報に基づいて候補語句を決定することができる。したがって、電子機器の内部または外部の環境に応じて自動的に、該環境に合わせて候補語句を決定することができる。これにより、ユーザの操作性を向上させながら、音声認識の精度を向上させることができる。 According to the said structure, a candidate word / phrase can be determined based on the apparatus information which the sensor part acquired. Therefore, it is possible to automatically determine candidate words in accordance with the environment inside or outside the electronic device. Thereby, the accuracy of voice recognition can be improved while improving the operability for the user.

本発明の様態５に係る電子機器は、上記様態１から４のいずれかに記載の音声認識装置を備えた電子機器で、上記ユーザの発話を検出する音声検出手段（音声検出部３０）と、自機の状態を特定する機器状態特定手段（機器状態検知部２０）とを備えている。 An electronic apparatus according to an aspect 5 of the present invention is an electronic apparatus including the speech recognition device according to any one of the above aspects 1 to 4, and a voice detection unit (voice detection unit 30) that detects the user's utterance; Device state specifying means (device state detection unit 20) for specifying the state of the device itself.

上記構成によれば、上記電子機器は、機器状態特定手段が特定した電子機器の状態から候補語句を決定し、音声検出手段が検出したユーザの音声と、該候補語句とを用いて音声認識を行うことができる。これにより、電子機器は現在の自機の状態において適切な語句を候補語句として音声認識を行うことができる。 According to the above configuration, the electronic device determines a candidate word / phrase from the state of the electronic device specified by the device state specifying unit, and performs voice recognition using the user's voice detected by the voice detecting unit and the candidate word / phrase. It can be carried out. Accordingly, the electronic device can perform speech recognition using an appropriate word / phrase as a candidate word / phrase in the current state of the own device.

本発明の各態様に係る音声認識装置は、コンピュータによって実現してもよく、この場合には、コンピュータを上記音声認識装置が備える各手段として動作させることにより上記音声認識装置をコンピュータにて実現させる音声認識装置の制御プログラム、およびそれを記録したコンピュータ読み取り可能な記録媒体も、本発明の範疇に入る。 The speech recognition apparatus according to each aspect of the present invention may be realized by a computer. In this case, the speech recognition apparatus is realized by the computer by operating the computer as each unit included in the speech recognition apparatus. A control program for the speech recognition apparatus and a computer-readable recording medium on which the control program is recorded also fall within the scope of the present invention.

本発明は上述した各実施形態に限定されるものではなく、請求項に示した範囲で種々の変更が可能であり、異なる実施形態にそれぞれ開示された技術的手段を適宜組み合わせて得られる実施形態についても本発明の技術的範囲に含まれる。さらに、各実施形態にそれぞれ開示された技術的手段を組み合わせることにより、新しい技術的特徴を形成することができる。 The present invention is not limited to the above-described embodiments, and various modifications are possible within the scope shown in the claims, and embodiments obtained by appropriately combining technical means disclosed in different embodiments. Is also included in the technical scope of the present invention. Furthermore, a new technical feature can be formed by combining the technical means disclosed in each embodiment.

本発明は、音声認識により操作可能な電子機器に好適である。 The present invention is suitable for an electronic device that can be operated by voice recognition.

１０認識制御部（音声認識装置）、１１機器状態取得部（機器状態取得手段）、１２候補語句決定部（候補語句決定手段）、１３音声取得部（音声取得手段）、１４特定語句検出部（特定語句検出手段）、１５認識語句決定部（認識語句決定手段）、２０センサ部（センサ部）、２１機器状態特定部（機器状態特定手段）、３０音声検出部（音声検出手段）、１００デジタルカメラ（電子機器） DESCRIPTION OF SYMBOLS 10 recognition control part (voice recognition apparatus), 11 apparatus state acquisition part (apparatus state acquisition means), 12 candidate phrase determination part (candidate phrase determination means), 13 voice acquisition part (voice acquisition means), 14 specific phrase detection part ( Specific phrase detection unit), 15 recognition phrase determination unit (recognition phrase determination unit), 20 sensor unit (sensor unit), 21 device state identification unit (device state identification unit), 30 voice detection unit (voice detection unit), 100 digital Camera (electronic equipment)

Claims

A speech recognition device that detects a user's utterance as speech and recognizes speech included in the speech,
Device status acquisition means for acquiring information indicating the status of an electronic device to be subjected to voice operation;
Candidate phrase determination means for determining candidate phrases that are associated with the information indicating the state of the electronic device acquired by the apparatus status acquisition means and that are the target of speech recognition;
Voice data acquisition means for acquiring the user's speech as voice data;
Specific phrase detection means for detecting at least one phrase that specifies utterance content from the voice data acquired by the voice data acquisition means as a specific phrase;
A recognized word / phrase determining means for specifying that the specific word / phrase detected by the specific word / phrase detecting means is one of the candidate words / phrases determined by the candidate word / phrase determining means, and determining the identified word / phrase as a recognized word / phrase; ,
A speech recognition apparatus comprising:

2. The voice according to claim 1, wherein the candidate phrase determination unit determines whether or not to set the phrase as the candidate phrase from the number of characters of the phrase associated with the information indicating the state of the electronic device. Recognition device.

The device status acquisition means acquires the device information as information indicating the status of the electronic device from a sensor unit that measures device information indicating the external or internal environment of the electronic device. The speech recognition apparatus according to 2.

An electronic device comprising the voice recognition device according to any one of claims 1 to 3,
Voice detection means for detecting the user's utterance;
An electronic device comprising: device state specifying means for specifying the state of the device itself.

A control program for operating the speech recognition apparatus according to any one of claims 1 to 3, wherein the control program causes a computer to function as each of the means.