JP2010175921A

JP2010175921A - Voice recognition device

Info

Publication number: JP2010175921A
Application number: JP2009019461A
Authority: JP
Inventors: Yusuke Yamanaka; 雄介山中
Original assignee: Tokai Rika Co Ltd
Current assignee: Tokai Rika Co Ltd
Priority date: 2009-01-30
Filing date: 2009-01-30
Publication date: 2010-08-12

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice recognition device capable of reducing a load on operation, and improving precision of voice recognition. <P>SOLUTION: The voice recognition device comprises: a voice input section for inputting voice arising from phrase utterance, to output voice information; a voice recognition section for performing voice recognition processing on the basis of the voice information, to output voice recognition information; an operation input section for inputting the operation performed on the basis of the syllable of the phrase, to output operation information; a storage section for storing a dictionary in which the plurality of phrases are stored; and a selection section which obtains candidate phrases from the dictionary on the basis of the operation information, and which selects the uttered phrase from the candidate phrases on the basis of the voice recognition information. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、音声認識装置に関する。 The present invention relates to a speech recognition apparatus.

従来の技術として、発話した音声を取得する音声取得手段と、音声以外の第２の方法で入力した内容を取得する第２入力取得手段と、前記音声取得手段で取得した音声を識別する音声識別手段と、階層毎に音声と命令とを対応づけて記憶している音声認識辞書記憶手段と、前記第２入力取得手段で取得した内容により前記音声認識辞書記憶手段に記憶されている階層を識別する階層識別手段と、前記音声識別手段で識別された音声の命令を前記階層識別手段で識別された階層から検索する命令内容検索手段と、から構成されることを特徴とした音声認識装置が知られている（例えば、特許文献１参照）。 As conventional techniques, voice acquisition means for acquiring spoken voice, second input acquisition means for acquiring content input by a second method other than voice, and voice identification for identifying the voice acquired by the voice acquisition means Means, voice recognition dictionary storage means for storing voices and commands in association with each hierarchy, and the hierarchy stored in the voice recognition dictionary storage means by the contents acquired by the second input acquisition means A speech recognition apparatus comprising: a hierarchical identification unit configured to search; and a command content retrieval unit configured to retrieve a voice command identified by the speech identification unit from the hierarchy identified by the hierarchical identification unit. (For example, refer to Patent Document 1).

この音声認識装置によると、走行中でも運転者が表示部や操作部に視線をそらすことなく、タッチパッド操作により手指の感覚と動作のみで操作階層を指定し、辞書データ中で照合対象とする認識語彙の範囲限定を可能とすることにより、運転の安全性が改善し、複雑な操作も簡便化できて操作性が向上し、認識率を改善することができる。 According to this voice recognition device, the driver can specify the operation hierarchy only with the sensation and movement of the finger by touchpad operation without turning his / her eyes on the display unit or operation unit even while traveling, and recognition as a target for collation in dictionary data By limiting the vocabulary range, driving safety can be improved, complicated operations can be simplified, operability can be improved, and recognition rate can be improved.

特開２００７−２４０６８８号公報JP 2007-240688 A

しかし、従来の音声認識装置は、認識語彙の範囲限定がなされるのは、指定された操作階層にある語彙のみであるため、他の階層にある語彙の認識率を改善するものではなく、不便であった。 However, in the conventional speech recognition apparatus, the range of the recognition vocabulary is limited only to the vocabulary in the designated operation hierarchy, so that the recognition rate of the vocabulary in the other hierarchy is not improved, which is inconvenient. Met.

本発明の目的は、操作に対する負荷を軽減し、さらに、音声認識処理の精度を向上させることができる音声認識装置を提供することにある。 An object of the present invention is to provide a voice recognition device that can reduce the load on operation and further improve the accuracy of voice recognition processing.

本発明の一態様は、語句を発声することによって生じる音声を入力して音声情報を出力する音声入力部と、前記音声情報に基づいて音声認識処理を行い、音声認識情報を出力する音声認識部と、前記語句の音節に基づいて行われた操作を入力して操作情報を出力する操作入力部と、複数の語句が格納された辞書を記憶する記憶部と、前記操作情報に基づいて前記辞書から候補語句を取得し、前記音声認識情報に基づいて前記候補語句から発声された前記語句を選択する選択部と、を備えた音声認識装置を提供する。 One embodiment of the present invention includes a voice input unit that inputs voice generated by uttering a phrase and outputs voice information, and a voice recognition unit that performs voice recognition processing based on the voice information and outputs voice recognition information An operation input unit that inputs an operation performed based on the syllable of the word and outputs operation information, a storage unit that stores a dictionary in which a plurality of words are stored, and the dictionary based on the operation information A speech recognition apparatus comprising: a selection unit that obtains a candidate word / phrase from the candidate word / phrase and selects the word / phrase uttered from the candidate word / phrase based on the voice recognition information.

本発明の他の一態様は、語句を発声することによって生じる音声を入力して音声情報を出力する音声入力部と、前記音声情報に基づいて音声認識処理を行い、音声認識情報を出力する音声認識部と、前記語句の音節に基づいて行われた操作を入力して操作情報を出力する操作入力部と、複数の語句が格納された辞書を記憶する記憶部と、前記音声認識情報に基づいて前記辞書から候補語句を取得し、前記操作情報に基づいて前記候補語句から発声された前記語句を選択する選択部と、を備えた音声認識装置を提供する。 According to another aspect of the present invention, a voice input unit that inputs voice generated by uttering a phrase and outputs voice information, and voice that performs voice recognition processing based on the voice information and outputs voice recognition information A recognition unit; an operation input unit that inputs an operation performed based on the syllable of the word and outputs operation information; a storage unit that stores a dictionary in which a plurality of words are stored; and the voice recognition information And a selection unit that acquires candidate words from the dictionary and selects the words uttered from the candidate words based on the operation information.

本発明の他の一態様は、語句を発声することによって生じる音声を入力して音声情報を出力する音声入力部と、前記音声情報に基づいて音声認識処理を行い、音声認識情報を出力する音声認識部と、前記語句の音節に基づいて行われた操作を入力して操作情報を出力する操作入力部と、複数の語句が格納された辞書を記憶する記憶部と、前記音声認識情報に基づいて前記辞書から第１の候補語句を取得し、前記操作情報に基づいて前記辞書から第２の候補語句を取得し、前記第１の候補語句および前記第２の候補語句に基づいて発声された前記語句を選択する選択部と、を備えた音声認識装置を提供する。 According to another aspect of the present invention, a voice input unit that inputs voice generated by uttering a phrase and outputs voice information, and voice that performs voice recognition processing based on the voice information and outputs voice recognition information A recognition unit; an operation input unit that inputs an operation performed based on the syllable of the word and outputs operation information; a storage unit that stores a dictionary in which a plurality of words are stored; and the voice recognition information The first candidate word / phrase is obtained from the dictionary, the second candidate word / phrase is obtained from the dictionary based on the operation information, and uttered based on the first candidate word / phrase and the second candidate word / phrase. There is provided a voice recognition device comprising a selection unit that selects the word / phrase.

本発明によれば、操作に対する負荷を軽減し、さらに、音声認識処理の精度を向上させることができる。 According to the present invention, it is possible to reduce the load on operations and further improve the accuracy of voice recognition processing.

図１は、本発明の第１の実施の形態に係る車両内部の概略図である。FIG. 1 is a schematic view of the interior of a vehicle according to the first embodiment of the present invention. 図２は、本発明の第１の実施の形態に係る音声認識装置のブロック図である。FIG. 2 is a block diagram of the speech recognition apparatus according to the first embodiment of the present invention. 図３（ａ）は、本発明の第１の実施の形態に係る「デパート」を示す音声波形であり、（ｂ）は、タッチ区間と非タッチ区間とを示した概略図であり、（ｃ）は、タッチ入力部から出力された信号レベル波形を示した概略図である。FIG. 3A is a speech waveform showing a “department store” according to the first embodiment of the present invention, and FIG. 3B is a schematic diagram showing a touch section and a non-touch section. ) Is a schematic diagram illustrating a signal level waveform output from the touch input unit. 図４（ａ）は、本発明の第１の実施の形態に係る「ホテル」を示す音声波形であり、（ｂ）は、タッチ区間と非タッチ区間とを示した概略図であり、（ｃ）は、タッチ入力部から出力された信号レベル波形を示した概略図である。FIG. 4A is a speech waveform showing “hotel” according to the first embodiment of the present invention, and FIG. 4B is a schematic diagram showing a touch section and a non-touch section, (c ) Is a schematic diagram illustrating a signal level waveform output from the touch input unit. 図５（ａ）〜（ｃ）は、本発明の第１の実施の形態に係る語句に対するタッチ操作の一例である。FIGS. 5A to 5C show an example of a touch operation on a word / phrase according to the first embodiment of the present invention. 図６は、本発明の第１の実施の形態に係るフローチャートである。FIG. 6 is a flowchart according to the first embodiment of the present invention. 図７は、本発明の第２の実施の形態に係るフローチャートである。FIG. 7 is a flowchart according to the second embodiment of the present invention. 図８は、本発明の第３の実施の形態に係るフローチャートである。FIG. 8 is a flowchart according to the third embodiment of the present invention.

[第１の実施の形態]
（車両の構成）
図１は、本発明の第１の実施の形態に係る車両内部の概略図である。本発明の音声認識装置を車両に搭載し、被制御装置としてのカーナビゲーション装置の操作を音声によって行う場合について説明する。また、本実施の形態においては、主に、操作者として運転者の音声による操作について説明する。 [First embodiment]
(Vehicle configuration)
FIG. 1 is a schematic view of the interior of a vehicle according to the first embodiment of the present invention. A case will be described in which the voice recognition device of the present invention is mounted on a vehicle and a car navigation device as a controlled device is operated by voice. Moreover, in this Embodiment, the operation by a driver | operator's voice is mainly demonstrated as an operator.

車両１は、図１に示すように、車両１の図示しないステアリングシャフトの端部に回転可能に設けられたステアリング１０と、音声認識装置２と、カーナビゲーション装置３と、を備えて概略構成されている。 As shown in FIG. 1, the vehicle 1 is schematically configured to include a steering 10 that is rotatably provided at an end portion of a steering shaft (not shown) of the vehicle 1, a voice recognition device 2, and a car navigation device 3. ing.

カーナビゲーション装置３は、図１に示すように、表示部３０を備えている。表示部３０には、一例として、車両１の現在位置周辺の地図や操作メニュー等が表示される。 As shown in FIG. 1, the car navigation device 3 includes a display unit 30. For example, a map around the current position of the vehicle 1, an operation menu, and the like are displayed on the display unit 30.

（音声認識装置の構成）
図２は、本発明の第１の実施の形態に係る音声認識装置のブロック図である。この音声認識装置２は、語句の音節に合わせてタッチ入力部２１にタッチ操作を行うことによって候補語句を限定し、音声認識処理の精度を向上させるものである。なお、語句とは、例えば、被制御装置に対して可能な操作を簡潔に示す文字列であり、単語であっても文章であっても良い。 (Configuration of voice recognition device)
FIG. 2 is a block diagram of the speech recognition apparatus according to the first embodiment of the present invention. The speech recognition apparatus 2 limits the candidate words by performing a touch operation on the touch input unit 21 according to the syllables of the phrases, and improves the accuracy of the speech recognition processing. The phrase is, for example, a character string that briefly indicates an operation that can be performed on the controlled device, and may be a word or a sentence.

音声認識装置２は、図１および図２に示すように、音声を取得するように操作者と対向するメータパネル上方に設けられた音声入力部２０と、操作者がステアリング１０を把握した状態で操作可能となる、ステアリング１０の左側に設けられた操作入力部としてのタッチ入力部２１と、後述する音声認識部２２と、後述するひな形を格納するひな形集２３０、および多数の語句を格納する辞書２３１を記憶する記憶部２３と、後述する選択部２４と、を備えて概略構成されている。 As shown in FIGS. 1 and 2, the voice recognition device 2 has a voice input unit 20 provided above a meter panel facing the operator so as to acquire voice and a state in which the operator grasps the steering 10. A touch input unit 21 as an operation input unit provided on the left side of the steering wheel 10, a voice recognition unit 22 to be described later, a template collection 230 to store a template to be described later, and a large number of words and phrases are stored. And a storage unit 23 for storing the dictionary 231 and a selection unit 24 to be described later.

図３（ａ）は、本発明の第１の実施の形態に係る「デパート」を示す音声波形であり、（ｂ）は、タッチ区間と非タッチ区間とを示した概略図であり、（ｃ）は、タッチ入力部から出力された信号レベル波形を示した概略図であり、図４（ａ）は、本発明の第１の実施の形態に係る「ホテル」を示す音声波形であり、（ｂ）は、タッチ区間と非タッチ区間とを示した概略図であり、（ｃ）は、タッチ入力部から出力された信号レベル波形を示した概略図である。図３（ａ）および図４（ａ）は、横軸が時間、縦軸が振幅であり、図３（ｃ）および図４（ｃ）は、横軸が時間、縦軸が信号レベルである。なお、タッチ区間とは、タッチ入力部になされたタッチ操作が継続する時間を示し、非タッチ区間とは、タッチ区間とタッチ区間の間の時間、言い換えるなら、タッチ操作がなされていない時間を示している。また、図３（ｂ）および図４（ｂ）に示す「ｄｅ」や「ｈｏ」は、音節を示し、「ｐａｕ」は、無音区間を示している。 FIG. 3A is a speech waveform showing a “department store” according to the first embodiment of the present invention, and FIG. 3B is a schematic diagram showing a touch section and a non-touch section. ) Is a schematic diagram showing a signal level waveform output from the touch input unit, and FIG. 4A is an audio waveform showing “hotel” according to the first embodiment of the present invention, ( b) is a schematic diagram showing a touch section and a non-touch section, and (c) is a schematic diagram showing a signal level waveform output from the touch input unit. 3 (a) and 4 (a), the horizontal axis represents time, and the vertical axis represents amplitude. In FIGS. 3 (c) and 4 (c), the horizontal axis represents time, and the vertical axis represents signal level. . The touch section indicates a time during which a touch operation performed on the touch input unit is continued, and the non-touch section indicates a time between the touch section and the touch section, in other words, a time during which no touch operation is performed. ing. Further, “de” and “ho” shown in FIGS. 3B and 4B indicate syllables, and “pau” indicates a silent section.

音声入力部２０は、一例として、発声された音声を取得するマイクロフォンを含んで構成され、図３（ａ）および図４（ａ）に示す音声波形を出力する。この音声波形が、音声情報として出力される。 The voice input unit 20 includes, for example, a microphone that acquires the voice that is uttered, and outputs a voice waveform shown in FIGS. 3 (a) and 4 (a). This voice waveform is output as voice information.

タッチ入力部２１は、一例として、手指によるタッチ操作を検出するタッチセンサを含んで構成され、図３（ｃ）および図４（ｃ）に示すように、タッチ操作による静電容量の変化をＯＮとＯＦＦの信号レベル波形として出力する。この信号レベル波形が、操作情報として出力される。なお、タッチ入力部２１は、例えば、抵抗感圧方式、赤外方式等のタッチ操作を検出できるセンサでも良く、また、機械式スイッチ等のようにＯＮとＯＦＦが判別できるものでも良い。 The touch input unit 21 includes, for example, a touch sensor that detects a touch operation with a finger, and turns on a change in capacitance due to the touch operation, as illustrated in FIGS. 3C and 4C. And output as an OFF signal level waveform. This signal level waveform is output as operation information. The touch input unit 21 may be, for example, a sensor that can detect a touch operation such as a resistance pressure-sensitive method or an infrared method, or may be a device that can determine ON and OFF, such as a mechanical switch.

音声認識部２２は、音声入力部２０から出力された音声情報に対して音声認識処理を行い、音声認識処理の結果である音声認識情報を選択部２４に出力する。音声認識情報とは、音声認識処理を行って得られた語句の情報である。 The voice recognition unit 22 performs voice recognition processing on the voice information output from the voice input unit 20, and outputs voice recognition information that is a result of the voice recognition processing to the selection unit 24. The speech recognition information is word / phrase information obtained by performing speech recognition processing.

記憶部２３に記憶されるひな形集２３０とは、子音、母音、無声破裂音（ｐ、ｔ、ｋ）、無声摩擦音（ｓ）、鼻音（ｎ、ｍなど）等の音声波形をひな形として格納したものである。このひな形は、例えば、特徴量をＭＦＣＣ（Mel Frequency Cepstrum Coefficient）を用いて数値化したものである。 The template collection 230 stored in the storage unit 23 is obtained by using speech waveforms such as consonants, vowels, unvoiced plosives (p, t, k), unvoiced friction sounds (s), nasal sounds (n, m, etc.) as templates. It is stored. This template is obtained by quantifying a feature value using MFCC (Mel Frequency Cepstrum Coefficient), for example.

（音声認識装置の動作）
図５（ａ）〜（ｃ）は、本発明の第１の実施の形態に係る語句に対するタッチ操作の一例である。以下に、本発明の第１の実施の形態における音声認識装置の動作を、各図を参照しながら図６に示すフローチャートに従って詳細に説明する。以下に、操作者が現在地付近のデパートを検索する操作を音声によって行う場合について説明する。 (Operation of voice recognition device)
FIGS. 5A to 5C show an example of a touch operation on a word / phrase according to the first embodiment of the present invention. Hereinafter, the operation of the speech recognition apparatus according to the first embodiment of the present invention will be described in detail according to the flowchart shown in FIG. 6 with reference to the drawings. The case where the operator performs an operation of searching for a department store near the current location by voice will be described below.

まず、操作者は、現在地付近にあるデパートを検索するため、音声入力部２０に向けて語句として「デパート」と発声し、図３（ｂ）および（ｃ）に示すように、語句の発声、すなわち、語句の音節に合わせてタッチ入力部２１に対してタッチ操作を行う（Ｓ１）。 First, in order to search for a department store in the vicinity of the current location, the operator utters “department” as a phrase toward the voice input unit 20, and utters the phrase as shown in FIGS. 3B and 3C. That is, a touch operation is performed on the touch input unit 21 in accordance with the syllable of the phrase (S1).

次に、選択部２４は、タッチ入力部２１を介して取得した操作情報を解析して、語句の音節の数から語句の文字数を決定し、この文字数に基づいて辞書２３１から語句の候補となる候補語句情報を取得する（Ｓ２）。候補語句情報とは、語句を構成する文字数が一致する語句の情報である。 Next, the selection unit 24 analyzes the operation information acquired via the touch input unit 21, determines the number of words in the phrase from the number of syllables in the phrase, and becomes a phrase candidate from the dictionary 231 based on the number of characters. Candidate word information is acquired (S2). The candidate phrase information is information on a phrase that matches the number of characters that constitute the phrase.

具体的には、選択部２４は、所定の時間（例えば１秒）以上、ＯＦＦの区間、すなわち、図３（ｂ）に示す第１の非タッチ区間が続き、少なくとも１つのタッチ区間を経て、再び所定の時間以上、第２の非タッチ区間が続くとき、その第１および第２の非タッチ区間の間のＯＮの数を語句の音節の数とする。続いて、選択部２４は、タッチ区間におけるタッチ操作の長短によって促音や長音等を判断し、例えば、促音や長音を２文字として音節の数に加えて語句の文字数を算出する。選択部２４は、この文字数と一致する少なくとも１つの語句を辞書２３１から候補語句情報として取得する。 Specifically, the selection unit 24 continues for at least a predetermined time (for example, 1 second), an OFF section, that is, the first non-touch section shown in FIG. When the second non-touch period continues again for a predetermined time or more, the number of ONs between the first and second non-touch periods is set as the number of syllables of the phrase. Subsequently, the selection unit 24 determines a prompt sound, a long sound, or the like according to the length of the touch operation in the touch section, and calculates the number of words in the phrase in addition to the number of syllables, for example, with the prompt sound or the long sound as two characters. The selection unit 24 obtains at least one phrase that matches the number of characters from the dictionary 231 as candidate phrase information.

ここで、タッチ操作は、図５（ａ）〜（ｃ）に示すように、語句に含まれる長音や促音等よって幾つかの種類が考えられる。例えば、図５（ａ）に示す「プール」の場合、図５（ａ）に○で示すように、「プ」、「−」、「ル」と、３回に分けて短いタッチ操作を行う場合と、「プー」、「ル」と、長いタッチ操作と短いタッチ操作とを組み合わせて２回に分けてタッチ操作を行う場合等が考えられる。なお、選択部２４による長いタッチ操作と短いタッチ操作の判断は、例えば、予め設定されたタッチ区間の長さを基準にして判断されても良いし、非タッチ区間を基準にして判断されても良く、またこれらに限定されない。 Here, as shown in FIGS. 5A to 5C, several types of touch operations are conceivable depending on the long sound or the prompt sound included in the phrase. For example, in the case of “pool” shown in FIG. 5A, as shown by a circle in FIG. 5A, a short touch operation is performed in three times, “P”, “−”, “LE”. The case where the touch operation is performed in two times by combining the case of “Pooh”, “Le” and the long touch operation and the short touch operation may be considered. The determination of the long touch operation and the short touch operation by the selection unit 24 may be determined based on, for example, a preset length of the touch section, or may be determined based on the non-touch section. Good and not limited to these.

また、例えば、図５（ｂ）に示す「とうきょう」の場合、図５（ｂ）に○で示すように、「と」、「う」、「きょ」、「う」と、４回に分けて短いタッチ操作を行う場合と、「とう」、「きょう」と、２回に分けて長いタッチ操作を行う場合等が考えられる。 In addition, for example, in the case of “Tokyo” shown in FIG. 5B, as indicated by a circle in FIG. 5B, “To”, “U”, “Kyo”, “U” are four times. There are a case where a short touch operation is performed separately and a case where a long touch operation is performed twice, such as “to” and “today”.

さらに、例えば、図５（ｃ）に示す「あいちけん」の場合、図５（ｃ）で示すように、「あ」、「い」、「ち」、「け」、「ん」と、５回に分けて短いタッチ操作を行う場合と、「あ」、「い」、「ち」、「けん」と、短いタッチ操作と長いタッチ操作を組み合わせて４回に分けてタッチ操作を行う場合等が考えられる。この「けん」の「ん」は、鼻音とも撥音とも呼ばれる。 Further, for example, in the case of “Aichi Ken” shown in FIG. 5C, as shown in FIG. 5C, “A”, “I”, “Chi”, “Ke”, “N”, 5 When performing a short touch operation divided into times and when performing a touch operation divided into four times by combining “A”, “I”, “Chi”, “Ken”, short touch operation and long touch operation, etc. Can be considered. “Ken” of “Ken” is also called nasal sound or repellent sound.

そこで、選択部２４は、例えば、短いタッチ操作は、１音節を１文字とし、長いタッチ操作は、長音、鼻音または促音を含むとして１音節を２文字として語句の文字数を算出する。つまり、選択部２４は、長いタッチ操作が１回と、短いタッチ操作が１回、続いて操作されたとき、辞書２３１から、先頭の音節が長音、鼻音または促音を含み、文字数が３である語句を抽出する。また、選択部２４は、音声波形を分割するための分割情報をタッチ区間および非タッチ区間から生成し、音声認識部２２に送信する。 Thus, for example, the selection unit 24 calculates the number of words in a phrase by assuming that one short syllable includes one syllable and one long syllable includes long sound, nose sound, or prompting sound, and that one syllable is two characters. That is, when the long touch operation is performed once and the short touch operation is performed once, the selection unit 24 includes a long syllable, a nasal sound, or a prompt sound from the dictionary 231 and the number of characters is three. Extract words. In addition, the selection unit 24 generates division information for dividing the voice waveform from the touch section and the non-touch section, and transmits the generated information to the voice recognition unit 22.

なお、例えば、「デパート」の長音を示す「ｐａ」のタッチ区間の長さと、「ホテル」の何れかの音節のタッチ区間の長さがほぼ等しい場合が考えられる。このとき、選択部２４は、上記した方法に基づいてタッチ区間の長さを判断するので、正確に所定の語句におけるタッチ区間の長短を判断することができる。 Note that, for example, the case where the length of the touch section of “pa” indicating the long sound of “department” and the length of the touch section of any syllable of “hotel” are almost equal. At this time, since the selection unit 24 determines the length of the touch section based on the above-described method, the selection section 24 can accurately determine the length of the touch section in a predetermined word / phrase.

次に、選択部２４は、音声認識部２２を介して音声認識情報を取得し、取得した音声認識情報に基づいて候補語句情報に含まれる候補語句から対応する語句を選択する（Ｓ３）。 Next, the selection unit 24 acquires speech recognition information via the speech recognition unit 22, and selects a corresponding phrase from candidate phrases included in the candidate phrase information based on the acquired speech recognition information (S3).

具体的には、まず、音声認識部２２は、図３（ａ）に示すように、選択部２４から取得した分割情報に基づいて音声波形を分割する。続いて、音声認識部２２は、分割された区間をさらに母音、子音などの区間に分けた音声波形と、ひな形集２３０に含まれる音声波形のひな形とのパターンマッチングを行う。選択部２４は、例えば、ＨＭＭ（Hidden Markov Model；隠れマルコフモデル）を用いて分割された区間ごとの音声認識処理を行い、その結果に基づいて候補語句の中から音声情報に対応する語句を選択する。 Specifically, first, the speech recognition unit 22 divides the speech waveform based on the division information acquired from the selection unit 24 as illustrated in FIG. Subsequently, the voice recognition unit 22 performs pattern matching between a voice waveform obtained by dividing the divided section into sections such as vowels and consonants, and a template of a voice waveform included in the template collection 230. The selection unit 24 performs speech recognition processing for each section divided using, for example, an HMM (Hidden Markov Model), and selects words corresponding to the speech information from candidate words based on the result. To do.

次に、選択部２４は、選択した語句の語句情報を被制御装置としてのカーナビゲーション装置３に送信する（Ｓ４）。この語句情報には、カーナビゲーション装置３を制御する情報が含まれており、語句情報を受信したカーナビゲーション装置３は、語句情報に基づいて現在地付近のデパートを検索し、その結果を表示部３０に表示する。 Next, the selection unit 24 transmits the phrase information of the selected phrase to the car navigation device 3 as the controlled device (S4). The phrase information includes information for controlling the car navigation device 3. The car navigation device 3 that has received the phrase information searches for a department store near the current location based on the phrase information and displays the result on the display unit 30. To display.

（第１の実施の形態の効果）
本発明の第１の実施の形態における音声認識装置によれば、以下の効果が得られる。
（１）タッチ入力部に対するタッチ操作によって、音声波形を音節ごとに分割することができるので、音声波形をひな形を用いて分割する方法に比べて高速で、かつ、精度の良い音声認識処理を行うことができる。
（２）タッチ操作によって、認識対象の語句の文字数が決定されるので、候補語句の数を文字数から限定することができ、精度の良い音声認識処理を行うことができる。
（３）運転操作時に、音声によって被制御装置の操作を行うことができるので、操作者の視線移動を少なくし、操作者の負荷を軽減することができる。 (Effects of the first embodiment)
According to the speech recognition device in the first exemplary embodiment of the present invention, the following effects can be obtained.
(1) Since the speech waveform can be divided into syllables by a touch operation on the touch input unit, the speech recognition processing is faster and more accurate than the method of dividing the speech waveform using a template. It can be carried out.
(2) Since the number of characters of the recognition target word / phrase is determined by the touch operation, the number of candidate words / phrases can be limited based on the number of characters, and accurate voice recognition processing can be performed.
(3) Since the controlled device can be operated by voice during a driving operation, the operator's line-of-sight movement can be reduced and the load on the operator can be reduced.

[第２の実施の形態]
以下に本発明の第２の実施の形態について説明する。第２の実施の形態は、音声認識処理に基づいて選択された候補語句からタッチ操作に基づいて算出される文字数に一致する語句を選択する点において異なっている。なお、以下の各実施の形態において、第１の実施の形態と同様の機能および構成を有する部分については、同じ符号を付し、説明は省略するものとする。 [Second Embodiment]
The second embodiment of the present invention will be described below. The second embodiment is different in that a phrase that matches the number of characters calculated based on the touch operation is selected from the candidate phrases selected based on the speech recognition process. In the following embodiments, parts having the same functions and configurations as those of the first embodiment are denoted by the same reference numerals, and description thereof is omitted.

（音声認識装置の構成）
本実施の形態における音声認識装置２は、第１の実施の形態における音声認識装置と同じ構成を有する。 (Configuration of voice recognition device)
The speech recognition apparatus 2 in the present embodiment has the same configuration as the speech recognition apparatus in the first embodiment.

（音声認識装置の動作）
以下に、本実施の形態における音声認識装置の動作について、図７のフローチャートに従って説明する。 (Operation of voice recognition device)
Hereinafter, the operation of the speech recognition apparatus according to the present embodiment will be described with reference to the flowchart of FIG.

まず、操作者は、現在地付近にあるデパートを検索するため、音声入力部２０に向けて語句として「デパート」と発声し、図３（ｂ）および（ｃ）に示すように、語句の発声、すなわち、語句の音節に合わせてタッチ入力部２１に対してタッチ操作を行う（Ｓ２０）。 First, in order to search for a department store in the vicinity of the current location, the operator utters “department” as a phrase toward the voice input unit 20, and utters the phrase as shown in FIGS. 3B and 3C. That is, a touch operation is performed on the touch input unit 21 in accordance with the syllable of the phrase (S20).

次に、音声認識部２２は、音声入力部２０を介して音声情報を取得して音声認識処理を行い、音声認識情報を生成する。音声認識部２２は、その音声認識情報に基づいて辞書２３１から語句の候補となる候補語句情報を取得する（Ｓ２１）。 Next, the voice recognition unit 22 acquires voice information via the voice input unit 20 and performs voice recognition processing to generate voice recognition information. The voice recognition unit 22 acquires candidate word / phrase information that is a word / phrase candidate from the dictionary 231 based on the voice recognition information (S21).

具体的には、音声認識部２２は、図３（ａ）および（ｂ）に示すように、第１および第２の非タッチ区間から語句を示す音声波形を決定する。続いて、音声認識部２２は、母音、子音などの区間に分けた音声波形と、音声波形のひな形とのパターンマッチングを行う。音声認識部２２は、例えば、ＨＭＭを用いて音声認識処置を行い、その結果に基づいて音声情報に対応する候補語句情報を辞書２３１から取得する。 Specifically, as shown in FIGS. 3A and 3B, the speech recognition unit 22 determines a speech waveform indicating a phrase from the first and second non-touch sections. Subsequently, the speech recognition unit 22 performs pattern matching between a speech waveform divided into sections such as vowels and consonants and a template of the speech waveform. For example, the speech recognition unit 22 performs speech recognition processing using an HMM, and acquires candidate phrase information corresponding to the speech information from the dictionary 231 based on the result.

次に、選択部２４は、タッチ入力部２１を介して取得した操作情報を解析して、語句を構成する文字数を決定し、この文字数に基づいて音声情報に対応する候補語句情報から語句を選択する（Ｓ２２）。 Next, the selection unit 24 analyzes the operation information acquired via the touch input unit 21 to determine the number of characters constituting the phrase, and selects the phrase from the candidate phrase information corresponding to the speech information based on the number of characters. (S22).

次に、選択部２４は、選択した語句の語句情報を被制御装置としてのカーナビゲーション装置３に送信し（Ｓ２３）、語句情報を受信したカーナビゲーション装置３は、語句情報に基づいて現在地付近のデパートを検索し、その結果を表示部３０に表示する。 Next, the selection unit 24 transmits the phrase information of the selected phrase to the car navigation apparatus 3 as a controlled device (S23), and the car navigation apparatus 3 that has received the phrase information receives the phrase information near the current location based on the phrase information. The department store is searched and the result is displayed on the display unit 30.

（第２の実施の形態の効果）
本発明の第２の実施の形態における音声認識装置によれば、タッチ操作によって発声された語句の文字数を精度良く決定し、音声認識処理によって選択された候補語句からこの文字数に対応する語句を選択するので、精度の良い音声認識処理を行うことができる。 (Effect of the second embodiment)
According to the speech recognition device in the second exemplary embodiment of the present invention, the number of characters of a phrase uttered by a touch operation is accurately determined, and a phrase corresponding to this number of characters is selected from candidate phrases selected by speech recognition processing. Therefore, accurate voice recognition processing can be performed.

[第３の実施の形態]
以下に本発明の第３の実施の形態について説明する。第３の実施の形態は、音声情報に基づいて取得した第１の候補語句と、操作情報に基づいて取得した第２の候補語句と、をそれぞれ独立に辞書から取得し、第１および第２の候補語句の中で共通する語句を発声された語句とする点で、各実施の形態と異なっている。 [Third embodiment]
The third embodiment of the present invention will be described below. In the third embodiment, the first candidate word / phrase acquired based on the voice information and the second candidate word / phrase acquired based on the operation information are independently acquired from the dictionary, and the first and second This is different from each embodiment in that a common phrase among the candidate phrases is a spoken phrase.

（音声認識装置の動作）
以下に、本実施の形態における音声認識装置の動作について、図８のフローチャートに従って説明する。 (Operation of voice recognition device)
Hereinafter, the operation of the speech recognition apparatus according to the present embodiment will be described with reference to the flowchart of FIG.

まず、操作者は、現在地付近にあるデパートを検索するため、音声入力部２０に向けて語句として「デパート」と発声し、図３（ｂ）および（ｃ）に示すように、語句の発声、すなわち、語句の音節に合わせてタッチ入力部２１に対してタッチ操作を行う（Ｓ３０）。 First, in order to search for a department store in the vicinity of the current location, the operator utters “department” as a phrase toward the voice input unit 20, and utters the phrase as shown in FIGS. 3B and 3C. That is, a touch operation is performed on the touch input unit 21 in accordance with the syllable of the phrase (S30).

次に、音声認識部２２は、音声入力部２０を介して音声情報を取得して音声認識処理を行い、音声認識情報を生成する。音声認識部２２は、その音声認識情報に基づいて辞書２３１から語句の候補となる第１の候補語句情報を取得する。選択部２４は、タッチ入力部２１を介して取得した操作情報を解析して、語句を構成する文字数を決定し、この文字数に基づいて辞書２３１から語句の候補となる第２の候補語句情報を取得する（Ｓ３１）。 Next, the voice recognition unit 22 acquires voice information via the voice input unit 20 and performs voice recognition processing to generate voice recognition information. The voice recognition unit 22 acquires first candidate word / phrase information that is a word / phrase candidate from the dictionary 231 based on the voice recognition information. The selection unit 24 analyzes the operation information acquired via the touch input unit 21 to determine the number of characters constituting the phrase, and based on the number of characters, selects second candidate phrase information that is a candidate for the phrase from the dictionary 231. Obtain (S31).

次に、選択部２４は、第１および第２の候補語句を比較し、それぞれの候補語句に共通する語句を選択する（Ｓ３２）。 Next, the selection unit 24 compares the first and second candidate phrases, and selects a phrase common to each candidate phrase (S32).

次に、選択部２４は、選択した語句の語句情報を被制御装置としてのカーナビゲーション装置３に送信し（Ｓ３３）、語句情報を受信したカーナビゲーション装置３は、語句情報に基づいて現在地付近のデパートを検索し、その結果を表示部３０に表示する。 Next, the selection unit 24 transmits the phrase information of the selected phrase to the car navigation apparatus 3 as the controlled apparatus (S33), and the car navigation apparatus 3 that has received the phrase information is in the vicinity of the current location based on the phrase information. The department store is searched and the result is displayed on the display unit 30.

（第３の実施の形態の効果）
本発明の第３の実施の形態における音声認識装置によれば、音声認識情報に基づいて取得した第１の候補語句と、操作情報に基づいて取得した第２の候補語句と、を比較することで、より精度の高い音声認識処理を行うことができる。 (Effect of the third embodiment)
According to the speech recognition device in the third exemplary embodiment of the present invention, the first candidate word / phrase acquired based on the speech recognition information is compared with the second candidate word / phrase acquired based on the operation information. Thus, a voice recognition process with higher accuracy can be performed.

上記した各実施の形態の音声認識装置は、選択可能な語句が複数ある場合、また、選択可能な語句がない場合等において、例えば、音声認識処理によって得られた結果を優先しても良く、また、再度入力を促すように被制御装置等を介して報知するように構成されても良い。 The speech recognition apparatus of each embodiment described above may prioritize the result obtained by the speech recognition process, for example, when there are a plurality of selectable phrases, or when there are no selectable phrases. Moreover, you may comprise so that it may alert | report via a controlled apparatus etc. so that an input may be prompted again.

なお、本発明は、上記した実施の形態に限定されず、本発明の技術思想を逸脱あるいは変更しない範囲内で種々の変形および組み合わせが可能である。 The present invention is not limited to the above-described embodiments, and various modifications and combinations can be made without departing from or changing the technical idea of the present invention.

１…車両、２…音声認識装置、３…カーナビゲーション装置、１０…ステアリング、２０…音声入力部、２１…タッチ入力部、２２…音声認識部、２３…記憶部、２４…選択部、３０…表示部、２３０…ひな形集、２３１…辞書 DESCRIPTION OF SYMBOLS 1 ... Vehicle, 2 ... Voice recognition apparatus, 3 ... Car navigation apparatus, 10 ... Steering, 20 ... Voice input part, 21 ... Touch input part, 22 ... Voice recognition part, 23 ... Memory | storage part, 24 ... Selection part, 30 ... Display unit, 230 ... stationery collection, 231 ... dictionary

Claims

A voice input unit that inputs voice generated by uttering a word and outputs voice information;
A voice recognition unit that performs voice recognition processing based on the voice information and outputs the voice recognition information;
An operation input unit that inputs an operation performed based on the syllable of the word and outputs operation information;
A storage unit for storing a dictionary storing a plurality of words;
A selection unit that acquires candidate words from the dictionary based on the operation information, and selects the words uttered from the candidate words based on the speech recognition information;
A speech recognition device comprising:

A voice input unit that inputs voice generated by uttering a word and outputs voice information;
A voice recognition unit that performs voice recognition processing based on the voice information and outputs the voice recognition information;
An operation input unit that inputs an operation performed based on the syllable of the word and outputs operation information;
A storage unit for storing a dictionary storing a plurality of words;
A selection unit that acquires candidate words from the dictionary based on the speech recognition information and selects the words uttered from the candidate words based on the operation information;
A speech recognition device comprising:

A voice input unit that inputs voice generated by uttering a word and outputs voice information;
A voice recognition unit that performs voice recognition processing based on the voice information and outputs the voice recognition information;
An operation input unit that inputs an operation performed based on the syllable of the word and outputs operation information;
A storage unit for storing a dictionary storing a plurality of words;
A first candidate word / phrase is obtained from the dictionary based on the speech recognition information, a second candidate word / phrase is obtained from the dictionary based on the operation information, and the first candidate word / phrase and the second candidate word / phrase are obtained. A selection unit for selecting the words uttered based on
A speech recognition device comprising: