JP2018013549A

JP2018013549A - Speech content recognition device

Info

Publication number: JP2018013549A
Application number: JP2016141645A
Authority: JP
Inventors: 俊兵花田; Toshihei Hanada
Original assignee: Denso Corp
Current assignee: Denso Corp
Priority date: 2016-07-19
Filing date: 2016-07-19
Publication date: 2018-01-25
Anticipated expiration: 2036-07-19
Also published as: JP6708035B2

Abstract

PROBLEM TO BE SOLVED: To provide a speech content recognition device that creates intermediate language data on the basis of a shot image of a camera.SOLUTION: A voice recognition device 1 is configured to: when detecting speeches of users during operating in a standby mode, create connection lip motion data indicative of a motion pattern of a lip upon successively pronouncing two syllables, and tone pattern data indicative of a tone upon successively pronouncing the two syllables on the basis of voice data spoken by the user and image data in this instance; store the created connection lip motion data and the tone pattern data in a tone DB1c; when executing lip reading processing with a depression of a talk SW3 of a user as a trigger, determine a connection lip motion pattern from the image data at a time of the speech of the user; and decide the tone for each syllable from the determined connection lip motion pattern and data stored in the tone DB1c.SELECTED DRAWING: Figure 1

Description

本発明は、カメラが撮像したユーザの唇の動きパターンからユーザが発話している内容を認識する技術に関する。 The present invention relates to a technique for recognizing the content of a user speaking from the movement pattern of the user's lips captured by a camera.

特許文献１に開示されているように従来、発話者の顔画像に対して画像認識処理を施すことによって、発話者の発話内容を特定し、発話内容に応じたテキストデータや音声データを生成する技術（いわゆる読唇技術）がある。このような読唇技術は、文字入力等の種々のアプリケーションに利用される。 Conventionally, as disclosed in Patent Document 1, by performing image recognition processing on the face image of the speaker, the utterance content of the utterer is specified, and text data and voice data corresponding to the utterance content are generated. There is technology (so-called lip reading technology). Such a lip reading technique is used for various applications such as character input.

特開２０１５−２２０６８４号公報JP2015-220684A

ところで、ユーザの発話音声には、何を話しているかといったテキスト情報だけでなく、どのような調子で話しているかといった声調情報も含まれている。しかしながら、従来の読唇技術では、ユーザの発話音声に含まれる声調情報を含むテキストデータ（以降、中間言語データ）を生成する方法については検討されていない。なお、ここでの声調情報とは、抑揚（いわゆるイントネーション）や、ユーザの声の高さ、話す速度、音量などといった種々のパラメータを示す情報である。 By the way, the user's uttered voice includes not only text information indicating what is being spoken but also tone information on what tone is being spoken. However, in the conventional lip reading technology, a method for generating text data (hereinafter referred to as intermediate language data) including tone information included in a user's uttered voice has not been studied. Note that the tone information here is information indicating various parameters such as intonation (so-called intonation), the pitch of the user's voice, speaking speed, volume, and the like.

本発明は、この事情に基づいて成されたものであり、その目的とするところは、カメラの撮像画像に基づいて中間言語データを生成する発話内容認識装置を提供することにある。 The present invention has been made based on this situation, and an object thereof is to provide an utterance content recognition device that generates intermediate language data based on a captured image of a camera.

その目的を達成するための本発明は、所定のユーザ操作をトリガとしてユーザの発話内容を特定する処理を実行する発話内容認識装置であって、ユーザ操作を受け付ける操作受付部（Ｆ２）と、マイクを介してユーザの発話音声を発話音声データとして取得する音声取得部（Ｆ３）と、ユーザの顔部を撮影するように配置されたカメラが撮影した画像であるユーザ画像を逐次取得する画像取得部（Ｆ６）と、画像取得部が取得したユーザ画像からユーザの口唇形状の変化パターンである唇動パターンを検出し、さらに、その検出した唇動パターンに基づいてユーザの発話内容に対応する文字列である発話文字列を生成する読唇処理部（Ｆ７）と、音声取得部が取得した発話音声データに基づいて、２つの音節を連続して発声する際の声調パターンを特定する声調パターン特定部（Ｆ８１）と、画像取得部が取得したユーザ画像から、ユーザが２つの音節を連続して発声する際の唇の変化パターンである連結唇動パターンを特定する連結唇動パターン特定部（Ｆ８２）と、ユーザが続けて発声した２音節に対して、連結唇動パターン特定部が特定した連結唇動パターンと、声調パターン特定部が特定した声調パターンを対応付けて声調データベースに保存する処理であるパターン学習処理を実行する学習処理部（Ｆ８）と、声調データベースに保存されているデータと、発話文字列の生成に用いられたユーザ画像とを用いて、発話文字列を構成する各音節文字に対して声調情報を付加した中間言語データを生成する中間言語データ生成部（Ｆ９）と、を備え、学習処理部は、操作受付部がユーザ操作を受け付けていない場合に、声調パターン特定部及び連結唇動パターン特定部と協働して学習処理を逐次実行し、中間言語データ生成部は、操作受付部がユーザ操作を受け付けたことに基づいて読唇処理部が発話文字列を生成した場合に、中間言語データを生成するものであって、中間言語データ生成部は、発話文字列を構成する或る音節文字である対象文字についての声調を決定する場合には、声調データベースに格納されている複数の連結唇動パターンの中から、対象文字の１つ前に位置する音節文字と対象文字とを発声した時のユーザの口唇形状の変化パターンと類似度が高い連結唇動パターンを特定し、その特定された連結唇動パターンに対応付けられている声調パターンを用いて対象文字についての声調を決定することを特徴とする。 In order to achieve the object, the present invention is an utterance content recognition device that executes processing for specifying the utterance content of a user with a predetermined user operation as a trigger, an operation reception unit (F2) that receives the user operation, and a microphone A voice acquisition unit (F3) that acquires the user's utterance voice as utterance voice data, and an image acquisition unit that sequentially acquires user images that are images taken by a camera arranged to shoot the user's face (F6) and a lip movement pattern that is a change pattern of the user's lip shape from the user image acquired by the image acquisition unit, and further, a character string corresponding to the user's utterance content based on the detected lip movement pattern A lip reading processing unit (F7) that generates an utterance character string and a tone pattern for uttering two syllables continuously based on the utterance voice data acquired by the voice acquisition unit A tone pattern specifying unit (F81) that specifies a lip, and a connected lip movement pattern that is a lip change pattern when the user utters two syllables continuously from a user image acquired by the image acquisition unit The linked lip motion pattern specified by the connected lip motion pattern specifying unit and the tone pattern specified by the tone pattern specifying unit are associated with the lip motion pattern specifying unit (F82) and the two syllables continuously uttered by the user. Using the learning processing unit (F8) that executes pattern learning processing that is processing stored in the tone database, the data stored in the tone database, and the user image used for generating the utterance character string, An intermediate language data generation unit (F9) for generating intermediate language data in which tone information is added to each syllable character constituting the sequence, and the learning processing unit When the user operation is not received, the learning process is sequentially executed in cooperation with the tone pattern specifying unit and the connected lip movement pattern specifying unit, and the intermediate language data generating unit is configured so that the operation receiving unit receives the user operation. The lip reading processing unit generates intermediate language data when the lip reading processing unit generates an utterance character string, and the intermediate language data generation unit generates a syllable character of the target character that constitutes the utterance character string. When determining the tone, the lip shape of the user when the syllable character and the target character located immediately before the target character are uttered from the plurality of connected lip movement patterns stored in the tone database. A connected lip movement pattern having a high degree of similarity to the change pattern is specified, and the tone of the target character is determined using the tone pattern associated with the specified connected lip movement pattern. And features.

以上の構成では、操作受付部が、発話内容を特定する処理の実行命令に相当するユーザ操作を受け付けていない場合には（換言すれば入力待機状態となっている場合には）、学習処理部が、声調パターン特定部及び連結唇動パターン特定部と協働してパターン学習処理を逐次実行する。つまり、ユーザの日常会話を元に、２音節毎のユーザの発話時の声調パターンと、連結唇動パターンと学習していく。したがって、声調データベースに蓄積されるデータは、ユーザの実際の発話を元にしたデータである。 In the above configuration, when the operation receiving unit has not received a user operation corresponding to the execution instruction of the process for specifying the utterance content (in other words, in the input standby state), the learning processing unit However, the pattern learning process is sequentially executed in cooperation with the tone pattern specifying unit and the connected lip movement pattern specifying unit. That is, based on the user's daily conversation, learning is performed with the tone pattern of the user's utterance and the connected lip movement pattern for every two syllables. Therefore, the data stored in the tone database is data based on the user's actual speech.

そして、操作受付部が上述のユーザ操作を受け付けたことに基づいて読唇処理部がユーザの口唇の動きに応じた文字列（つまり発話文字列）を生成した場合には、中間言語データ生成部が、声調データベースに保存されているデータと、発話文字列の生成に用いたユーザ画像とから、発話文字列に声調情報を付加した中間言語データを生成する。 When the lip reading processing unit generates a character string corresponding to the movement of the user's lips (that is, an uttered character string) based on the reception of the above-described user operation by the operation receiving unit, the intermediate language data generating unit Then, intermediate language data in which the tone information is added to the utterance character string is generated from the data stored in the tone database and the user image used to generate the utterance character string.

このような態様によれば、カメラの撮像画像に基づいて中間言語データを生成することができる。また、上述した態様によって生成される中間言語データが備える声調情報は、実際のユーザの発話履歴に基づいて生成される。したがって、ユーザの発声時の癖等が再現された声調情報となることが期待される。 According to such an aspect, intermediate language data can be generated based on a captured image of the camera. The tone information included in the intermediate language data generated by the above-described aspect is generated based on the actual user utterance history. Therefore, it is expected that the tone information reproduces the wrinkles and the like when the user speaks.

なお、特許請求の範囲に記載した括弧内の符号は、一つの態様として後述する実施形態に記載の具体的手段との対応関係を示すものであって、本発明の技術的範囲を限定するものではない。 In addition, the code | symbol in the parenthesis described in the claim shows the correspondence with the specific means as described in embodiment mentioned later as one aspect, Comprising: The technical scope of this invention is limited is not.

音声入力システム１００の概略的な構成を示すブロック図である。1 is a block diagram showing a schematic configuration of a voice input system 100. FIG. 学習処理部Ｆ８が備える機能を概念的に表したブロック図である。It is a block diagram showing notionally the function with which learning processing part F8 is provided. スタンバイモード処理について説明するためのフローチャートである。It is a flowchart for demonstrating standby mode process. パターン学習処理について説明するためのフローチャートである。It is a flowchart for demonstrating a pattern learning process. パターン学習処理について説明するための図である。It is a figure for demonstrating a pattern learning process. 音節セットデータについて概念的に表した図である。FIG. 6 is a diagram conceptually showing syllable set data. 発話内容認識処理について説明するためのフローチャートである。It is a flowchart for demonstrating speech content recognition processing. 中間言語データ生成処理について説明するためのフローチャートである。It is a flowchart for demonstrating intermediate language data generation processing.

以下、本発明が適用された音声入力システム１００について図を用いて説明する。音声入力システム１００は、車両に搭載されてあって、図１に示すように、音声認識装置１、シートセンサ２、トークスイッチ（以降、トークＳＷ）３、マイク４、及びカメラ５を備えている。シートセンサ２、トークＳＷ３、マイク４、及びカメラ５のそれぞれは、車両内に構築されたローカルネットワーク（以降、ＬＡＮ：Local Area Network）を介して、音声認識装置１と通信可能に構成されている。 Hereinafter, a voice input system 100 to which the present invention is applied will be described with reference to the drawings. The voice input system 100 is mounted on a vehicle and includes a voice recognition device 1, a sheet sensor 2, a talk switch (hereinafter referred to as talk SW) 3, a microphone 4, and a camera 5, as shown in FIG. . Each of the sheet sensor 2, the talk SW 3, the microphone 4, and the camera 5 is configured to be able to communicate with the voice recognition device 1 via a local network (hereinafter, LAN: Local Area Network) built in the vehicle. .

音声認識装置１は、ＣＰＵ、ＲＡＭ、ＲＯＭ、Ｉ／Ｏ、及びこれらの構成を接続するバスラインなどを備えた、通常のコンピュータとして構成されている。ＲＯＭには、通常のコンピュータを音声認識装置１として機能させるためのプログラム（以降、発話内容特定プログラム）等が格納されている。 The speech recognition apparatus 1 is configured as a normal computer including a CPU, a RAM, a ROM, an I / O, and a bus line that connects these configurations. The ROM stores a program for causing a normal computer to function as the speech recognition apparatus 1 (hereinafter referred to as an utterance content specifying program) and the like.

なお、上述の発話内容特定プログラムは、ＲＯＭに限らず、非遷移的実体的記録媒体（non- transitory tangible storage medium）に格納されていればよい。ＣＰＵが発話内容特定プログラムを実行することは、発話内容特定プログラムに対応する方法が実行されることに相当する。 Note that the above-described utterance content identification program is not limited to the ROM, but may be stored in a non-transitionary tangible storage medium. The execution of the utterance content identification program by the CPU corresponds to the execution of a method corresponding to the utterance content identification program.

この音声認識装置１は、概略的には、マイク４やカメラ５から入力されるデータに基づいて、ユーザが発話している内容を認識し、その認識した結果を所定のアプリケーションソフトウェア（以降、アプリ）に提供する装置である。この音声認識装置１の詳細については別途後述する。なお、ここでのユーザとは、音声入力システム１００が搭載された車両を利用する人物であって、特に、運転席に着座している人物を指すものとする。音声認識装置１が請求項に記載の発話内容認識装置に相当する。 In general, the voice recognition device 1 recognizes the content of a user's utterance based on data input from the microphone 4 or the camera 5 and uses the recognized result as predetermined application software (hereinafter referred to as application software). ). Details of the voice recognition device 1 will be described later. The user here refers to a person who uses a vehicle on which the voice input system 100 is mounted, and particularly refers to a person sitting in a driver's seat. The voice recognition device 1 corresponds to the utterance content recognition device described in the claims.

シートセンサ２は、運転席に乗員（つまりユーザ）が着座しているか否かを示す信号を出力するセンサである。シートセンサ２は例えば圧力センサを用いて実現することができる。すなわち、シートセンサ２は、運転席の着座部に設けられた圧力センサであって、当該着座部に作用する圧力を示す信号を音声認識装置１に出力する。 The seat sensor 2 is a sensor that outputs a signal indicating whether an occupant (that is, a user) is seated in the driver's seat. The sheet sensor 2 can be realized using a pressure sensor, for example. That is, the seat sensor 2 is a pressure sensor provided in the seating portion of the driver's seat, and outputs a signal indicating the pressure acting on the seating portion to the voice recognition device 1.

トークＳＷ３は、ユーザが音声入力を開始する旨を指示するためのスイッチである。トークＳＷ３はここでは一例として、いわゆるクリック方式のスイッチとし、トークＳＷ３は、ユーザの操作によってオンに設定されると（すなわち、クリックされると）、オン信号を音声認識装置１に出力する。トークＳＷ３は、例えばステアリングコラムカバーの側面部やシフトレバーの近傍などユーザが操作しやすい位置に設けられている。 The talk SW3 is a switch for instructing the user to start voice input. As an example, the talk SW 3 is a so-called click-type switch. When the talk SW 3 is turned on by a user operation (ie, clicked), an on signal is output to the voice recognition device 1. The talk SW3 is provided at a position where the user can easily operate, for example, near the side surface of the steering column cover or the vicinity of the shift lever.

なお、トークＳＷ３は、ディスプレイに表示されたボタン画像であってもよい。その場合、音声認識装置１は、タッチパネルや周知のポインティングデバイスを介して、当該ボタン画像がユーザによって選択されたことを検出すればよい。 The talk SW3 may be a button image displayed on the display. In that case, the speech recognition apparatus 1 may detect that the button image has been selected by the user via a touch panel or a known pointing device.

マイク４は、例えば無指向性の小型マイクであり、ユーザが発話した音声や雑音などの周囲の音を集音し、電気的な音声信号に変換して、音声認識装置１に出力する。マイク４は、例えばステアリングコラムカバーの上面部や運転席側のサンバイザー等のユーザの音声を拾いやすい位置に設けられる。 The microphone 4 is, for example, a small omnidirectional microphone, collects surrounding sounds such as voices and noises spoken by the user, converts them into electrical voice signals, and outputs them to the voice recognition device 1. The microphone 4 is provided at a position where the user's voice can be easily picked up, such as the upper surface portion of the steering column cover or the sun visor on the driver's seat.

カメラ５は、光学式のカメラであって、例えばＣＭＯＳカメラやＣＣＤカメラ等を用いることができる。カメラ５は、運転席に着座している乗員の顔部を撮影するように、例えばステアリングコラムカバーや、インストゥルメントパネルの運転席に対向する部分等、適宜設計される位置に配置されている。なお、また、他の態様としてカメラ５は、赤外線カメラや近赤外線カメラなどであってもよい。カメラ５は所定のフレームレート（例えば３０ｆｐｓ）で撮影した画像データを音声認識装置１へ逐次出力する。なお、カメラ５は、撮影画像を映像信号として出力するものであってもよい。カメラ５が撮影した画像が請求項に記載のユーザ画像に相当する。 The camera 5 is an optical camera, and for example, a CMOS camera or a CCD camera can be used. The camera 5 is disposed at an appropriately designed position such as a steering column cover or a portion of the instrument panel facing the driver's seat so as to photograph the face of the occupant seated in the driver's seat. . In addition, as another aspect, the camera 5 may be an infrared camera, a near infrared camera, or the like. The camera 5 sequentially outputs image data captured at a predetermined frame rate (for example, 30 fps) to the voice recognition apparatus 1. The camera 5 may output a captured image as a video signal. An image taken by the camera 5 corresponds to the user image described in the claims.

＜音声認識装置１について＞
音声認識装置１は、ＣＰＵが上述の発話内容特定プログラムを実行することによって、図１に示す種々の機能ブロックに対応する機能を提供する。具体的には、音声認識装置１は、機能ブロックとして、着座判定部Ｆ１、操作受付部Ｆ２、音声取得部Ｆ３、音声認識部Ｆ４、ノイズレベル判定部Ｆ５、画像取得部Ｆ６、読唇処理部Ｆ７、学習処理部Ｆ８、中間言語化部Ｆ９、及び認識媒体設定部Ｆ１０を備える。 <About the voice recognition device 1>
The speech recognition apparatus 1 provides functions corresponding to various functional blocks shown in FIG. 1 when the CPU executes the above-described utterance content identification program. Specifically, the voice recognition device 1 includes a seating determination unit F1, an operation reception unit F2, a voice acquisition unit F3, a voice recognition unit F4, a noise level determination unit F5, an image acquisition unit F6, and a lip reading processing unit F7 as functional blocks. , A learning processing unit F8, an intermediate languageizing unit F9, and a recognition medium setting unit F10.

なお、音声認識装置１が備える機能ブロックの一部又は全部は、一つあるいは複数のＩＣ等を用いてハードウェアとして実現されていてもよい。また、ＣＰＵによるソフトウェアの実行とハードウェア部材の組み合わせによって実現されてもよい。 Note that some or all of the functional blocks provided in the speech recognition apparatus 1 may be realized as hardware using one or a plurality of ICs. Further, it may be realized by a combination of execution of software by the CPU and hardware members.

また、音声認識装置１は、不揮発性の記憶媒体を用いて実現されるデータベース（以降、ＤＢ）として、音声認識用ＤＢ１ａ、読唇用ＤＢ１ｂ、及び声調ＤＢ１ｃを備える。これらのＤＢは、例えばハードディスクやフラッシュメモリ等といった、不揮発性であって書き換え可能な記憶媒体を用いて実現されればよい。 In addition, the speech recognition apparatus 1 includes a speech recognition DB 1a, a lip reading DB 1b, and a tone DB 1c as databases (hereinafter referred to as DB) realized using a nonvolatile storage medium. These DBs may be realized using a non-volatile rewritable storage medium such as a hard disk or a flash memory.

音声認識用ＤＢ１ａは、音声認識処理に必要なデータが格納されているＤＢである。音声認識処理に必要なデータとは、例えば、人間の発声の小さな単位（いわゆる音素）の音響特徴が記述されている音響モデルや、音素の音響特徴と単語とを対応付ける認識辞書、単語間の連接関係を表現する言語モデルなどである。 The speech recognition DB 1a is a DB that stores data necessary for speech recognition processing. Data necessary for speech recognition processing includes, for example, an acoustic model in which acoustic features of small units of human speech (so-called phonemes) are described, a recognition dictionary that associates phonetic acoustic features with words, and concatenation between words A language model that expresses relationships.

読唇用ＤＢ１ｂは、後述する読唇処理に必要なデータが格納されているＤＢである。読唇処理に必要なデータとは、例えば、音節文字毎の口唇形状の変化パターン（以降、唇動パターン）を示す唇動モデルや、ユーザが使用したことがある単語を音節文字で記述した単語辞書などである。単語辞書においては、単語ごとの使用頻度や、その単語の前後で用いられた他の単語についての情報が関連付けられていることが好ましい。なお、音節文字とは、平仮名や片仮名といった仮名を指す。 The lip reading DB 1b is a DB in which data necessary for lip reading processing described later is stored. The data required for the lip reading process is, for example, a lip movement model showing a lip shape change pattern (hereinafter referred to as lip movement pattern) for each syllable character, or a word dictionary describing words that the user has used in syllable characters. Etc. In the word dictionary, it is preferable that the usage frequency for each word and information about other words used before and after the word are associated with each other. Note that the syllable characters refer to kana such as hiragana and katakana.

唇動モデルには、少なくとも、日本語の母音である「あ」、「い」、「う」、「え」、「お」の５つの音声文字に、撥音としての音節文字「ん」を加えた、６つの音声文字に対応する唇動パターンを示す画像データが登録されている。もちろん、唇動モデルには、母音と子音とが組み合わさってなる音節についての唇動パターンを示す画像データも含まれている。なお、母音と子音とが組み合わさってなる音節とは、「か」「き」「く」「け」「こ」などの直音や、「きゃ」「きゅ」「きょ」などの拗音、「しぇ」「ちぇ」などの外来音、濁音などである。モーラの概念を導入して促音や長音を１つの音節として取り扱うか否かは適宜設計されれば良い。これらの音節文字毎の唇動パターンを示すデータは、後述するパターン学習処理の過程において生成されれば良い。 In the lip movement model, at least five Japanese vowels “A”, “I”, “U”, “E”, “O” are added to the syllable character “n” as a repellent sound. In addition, image data indicating lip movement patterns corresponding to six voice characters is registered. Of course, the lip movement model also includes image data indicating a lip movement pattern for a syllable formed by combining vowels and consonants. Note that syllables that are a combination of vowels and consonants are straight sounds such as “ka”, “ki”, “ku”, “ke”, and “ko”, and roaring sounds such as “kya”, “kyu”, and “kyo”. , “She”, “Chee”, etc. It is only necessary to appropriately design whether to introduce the concept of mora and to handle prompt sounds and long sounds as one syllable. Data indicating the lip movement pattern for each syllable character may be generated in the pattern learning process described later.

声調ＤＢ１ｃは、後述する声調パターンデータを含む音節セットデータを記憶するＤＢである。声調パターンデータや音節セットデータについては別途後述する。 The tone DB 1c is a DB that stores syllable set data including tone pattern data described later. The tone pattern data and syllable set data will be described later separately.

着座判定部Ｆ１は、シートセンサ２から入力される信号に基づいて、ユーザが運転席に着座しているか否かを判定する。具体的には、シートセンサ２が所定の閾値（以降、着座判定閾値）以上の圧力を検出した場合に、ユーザが運転席に着座していると判定する。また、シートセンサ２が検出している圧力が着座判定閾値未満となっている場合には、運転席にユーザは着座していないと判定する。 The seating determination unit F1 determines whether or not the user is seated in the driver's seat based on a signal input from the seat sensor 2. Specifically, when the seat sensor 2 detects a pressure equal to or higher than a predetermined threshold (hereinafter referred to as a seating determination threshold), it is determined that the user is seated in the driver's seat. When the pressure detected by the seat sensor 2 is less than the seating determination threshold, it is determined that the user is not seated in the driver's seat.

なお、本実施形態では一例として、シートセンサ２を用いてユーザが運転席に着座しているか否かを判定する態様を採用するが、これに限らない。他の態様として、カメラ５の撮影画像に基づいてユーザが運転席に着座しているか否かを判定してもよい。その場合には撮影画像中にユーザの顔を検出できた場合に、運転席にユーザが着座していると判定すればよい。 In the present embodiment, as an example, a mode in which it is determined whether or not the user is seated in the driver's seat using the seat sensor 2 is adopted, but this is not a limitation. As another aspect, it may be determined whether or not the user is seated in the driver's seat based on the captured image of the camera 5. In that case, if the user's face can be detected in the captured image, it may be determined that the user is seated in the driver's seat.

操作受付部Ｆ２は、ユーザがトークＳＷ３を押下した操作を検出する。つまり、操作受付部Ｆ２は、ユーザが音声入力を開始するためのユーザ操作（以降、音声入力操作）を受け付ける。 The operation reception unit F2 detects an operation in which the user presses the talk SW3. That is, the operation reception unit F2 receives a user operation (hereinafter referred to as a voice input operation) for the user to start voice input.

ところで、音声認識装置１は動作モードとして、スリープモードと、スタンバイモードと、認識実行モードとの３つの動作モードを備える。動作モードの切り替えは図示しないオペレーティングシステム（以降、ＯＳ）によって実施される。 By the way, the speech recognition apparatus 1 includes three operation modes, which are a sleep mode, a standby mode, and a recognition execution mode, as operation modes. Switching of the operation mode is performed by an operating system (hereinafter referred to as OS) (not shown).

スリープモードは、音声認識装置１が起動してあって、かつ、運転席に乗員が着座していない場合の動作モードである。スリープモードとなっている場合、音声認識装置１は、マイク４やカメラ５の電源をオフにする。 The sleep mode is an operation mode when the voice recognition device 1 is activated and no occupant is seated in the driver's seat. In the sleep mode, the voice recognition device 1 turns off the microphone 4 and the camera 5.

スリープモード時においてユーザの運転席への着座を検出すると、マイク４やカメラ５の電源をオンにしてスタンバイモードへと移行する。なお、音声認識装置１は車両の走行用電源（例えばイグニッション電源）がオンとなった場合や、車両のドアが開かれた場合に、起動するように構成されているものとする。 When the user's seating in the driver's seat is detected in the sleep mode, the microphone 4 and the camera 5 are turned on and the standby mode is entered. It is assumed that the voice recognition device 1 is configured to be activated when a vehicle power source (for example, an ignition power source) is turned on or when a vehicle door is opened.

スタンバイモードは、ユーザが運転席に着座している状態において、ユーザによる音声入力操作が実行されていない場合（換言すれば入力待機状態となっている場合）の動作モードである。音声認識装置１がスタンバイモードで動作している状態において操作受付部Ｆ２が音声入力操作を受け付けた場合、音声認識装置１はスタンバイモードから認識実行モードへと移行して、ユーザの発話内容を認識する処理（以降、発話内容認識処理）を実行する。そして、当該処理が完了するとスタンバイモードへと戻る。換言すれば、発話内容認識処理を実行している状態が認識実行モードに相当する。なお、スタンバイモードにおいてユーザの運転席からの離席を検出した場合にはスリープモードへと移行する。発話内容認識処理については別途後述する。 The standby mode is an operation mode when the user is not seated in the driver's seat and the voice input operation by the user is not executed (in other words, in the input standby state). When the operation receiving unit F2 receives a voice input operation while the voice recognition device 1 is operating in the standby mode, the voice recognition device 1 shifts from the standby mode to the recognition execution mode to recognize the user's utterance content. Processing (hereinafter referred to as speech content recognition processing). When the processing is completed, the process returns to the standby mode. In other words, the state in which the utterance content recognition process is executed corresponds to the recognition execution mode. Note that when the user leaves the driver's seat in the standby mode, the mode shifts to the sleep mode. The utterance content recognition process will be described later separately.

音声取得部Ｆ３は、マイク４から出力される音声信号を取得する。また、音声取得部Ｆ３は、マイク４からの入力信号に基づいて、ユーザが発話しているか否かを判定する。発話しているか否かは、例えば、音声信号の零交差数に基づいて判断することができる。すなわち、一定のレベルを越える信号が入力されており、かつ、所定の単位時間当りの零交差数が一定数を越えた時を発話が開始された時点（以降、発話開始点）として採用する。また、当該条件が充足されなくなった時点を発話が終了した時点（以降、発話終了点）として採用する。以降では発話開始点から発話終了点までを発話区間と称するとともに、発話区間以外の期間を非発話区間と称する。 The voice acquisition unit F3 acquires a voice signal output from the microphone 4. The voice acquisition unit F3 determines whether the user is speaking based on the input signal from the microphone 4. Whether or not the user is speaking can be determined based on, for example, the number of zero crossings of the audio signal. That is, when a signal exceeding a certain level is input and the number of zero crossings per predetermined unit time exceeds a certain number, the time when the utterance is started (hereinafter referred to as the utterance start point) is adopted. Also, the time when the condition is no longer satisfied is adopted as the time when the utterance ends (hereinafter referred to as the utterance end point). Hereinafter, the period from the utterance start point to the utterance end point is referred to as an utterance interval, and a period other than the utterance interval is referred to as a non-utterance interval.

さらに、音声取得部Ｆ３は、発話区間であると判定されている期間に入力された音声信号に対してＡ／Ｄ変換を施すことで、ユーザが発話した音声に対応するデジタルデータ（以降、発話音声データ）を生成する。発話音声データは、その発声開始時刻を含む時間情報と対応付けられて図示しないメモリに保存される。 Further, the voice acquisition unit F3 performs A / D conversion on the voice signal input during the period determined to be the utterance section, thereby digital data corresponding to the voice uttered by the user (hereinafter referred to as utterance). Audio data). The utterance voice data is stored in a memory (not shown) in association with time information including the utterance start time.

なお、発話区間、非発話区間の識別は、その他の周知の方法によって実施されても良い。例えば、ガウス混合分布モデル (Gaussian Mixture Model：ＧＭＭ)に基づいて発話区間の開始及び終了を検出する方法を採用してもよい。これは音声と非音声のそれぞれのＧＭＭを定義し、入力短時間フレームごとに特徴量抽出から各ＧＭＭの尤度計算を行い、音声ＧＭＭと非音声ＧＭＭの尤度比から，発話区間の開始・終了を判別する方法である。 It should be noted that the speech segment and the non-speech segment may be identified by other known methods. For example, a method of detecting the start and end of an utterance interval based on a Gaussian Mixture Model (GMM) may be employed. This defines each GMM for speech and non-speech, calculates the likelihood of each GMM from feature extraction for each input short frame, and starts the speech segment from the likelihood ratio of speech GMM and non-speech GMM. This is a method for determining the end.

音声認識部Ｆ４は、音声取得部Ｆ３が生成した発話音声データに対して、音声認識用ＤＢ１ａに格納されている種々のデータを用いて、音声認識処理を実施する。音声認識処理は、公知の技術を用いればよいため、ここでの説明は省略する。 The speech recognition unit F4 performs speech recognition processing on the utterance speech data generated by the speech acquisition unit F3, using various data stored in the speech recognition DB 1a. Since the voice recognition process may use a known technique, a description thereof is omitted here.

ノイズレベル判定部Ｆ５は、音声取得部Ｆ３によって非発話区間であると判定されている時にマイク４から入力される音声信号の振幅に基づいて、騒音の大きさ（つまりノイズレベル）を判定する。例えば、非発話区間に入力されている音声信号の振幅が予め定められた所定の閾値を超えている場合にはノイズレベルは高レベルであると判定し、閾値未満となっている場合にはノイズレベルは低レベルであると判定する。 The noise level determination unit F5 determines the noise level (that is, the noise level) based on the amplitude of the audio signal input from the microphone 4 when it is determined by the audio acquisition unit F3 that it is a non-speech segment. For example, the noise level is determined to be high when the amplitude of the audio signal input in the non-speech section exceeds a predetermined threshold value, and the noise level is determined to be less than the threshold value. The level is determined to be low.

画像取得部Ｆ６は、カメラ５が撮像した画像データを逐次取得する。画像取得部Ｆ６は、取得した画像データに、その取得時刻を示すタイムスタンプを付与して、図示しないメモリに保存する。メモリに保存されている画像データの容量が所定の上限に達した場合には、取得時刻が古いデータから順次削除されていけばよい。 The image acquisition unit F6 sequentially acquires image data captured by the camera 5. The image acquisition unit F6 adds a time stamp indicating the acquisition time to the acquired image data, and stores it in a memory (not shown). When the capacity of the image data stored in the memory reaches a predetermined upper limit, it is only necessary that the acquisition time is deleted sequentially from the oldest data.

読唇処理部Ｆ７は、画像取得部Ｆ６から取得する画像データからユーザの口唇部の動きを検出する。そして、読唇処理部Ｆ７は、ユーザの口唇の動きの有無から、ユーザが発話を開始した時点（つまり発話開始点）、及び、発話を終了した時点（つまり発話終了点）を特定する。つまり、発話区間を特定する。 The lip reading processing unit F7 detects the movement of the user's lip from the image data acquired from the image acquisition unit F6. Then, the lip reading processing unit F7 specifies the time when the user starts utterance (that is, the utterance start point) and the time when the utterance ends (that is, the utterance end point) from the presence or absence of the movement of the user's lips. That is, the utterance section is specified.

また、発話区間に撮像された一連の画像データ（以降、発話画像データ）におけるユーザの口唇形状の変化パターン（つまり唇動パターン）から、ユーザが発声した音声をテキスト化する。つまり、読唇処理部Ｆ７は読唇処理を実行する。なお、ここでのテキスト化とは、ユーザの発話音声に対応する音節文字の列（以降、発話文字列）を生成することである。 Also, the voice uttered by the user is converted into text from the change pattern (that is, lip movement pattern) of the user's lip shape in a series of image data (hereinafter referred to as utterance image data) captured in the utterance section. That is, the lip reading processing unit F7 executes lip reading processing. Note that the textification here is to generate a string of syllable characters (hereinafter referred to as an utterance character string) corresponding to the user's utterance voice.

発話文字列の生成は、画像データから特定した唇動パターンと、読唇用ＤＢ１ｂにおいて唇動モデルとして保存されている音節文字毎の唇動パターンとを比較することで実現されればよい。比較方法としては、動的計画法などの周知の方法を援用することができる。１文字分の唇動パターンに対して複数の音節文字が候補として抽出された場合には、その前後に発声された音節文字を用いてユーザが発声した単語の候補を抽出し、単語辞書を参照して尤度が高い単語を形成する文字を採用すればよい。その他、画像データから発話文字列を生成するためのアルゴリズムとしては周知の方法を援用することができる。 The generation of the utterance character string may be realized by comparing the lip movement pattern specified from the image data with the lip movement pattern for each syllable character stored as the lip movement model in the lip reading DB 1b. A known method such as dynamic programming can be used as the comparison method. When a plurality of syllable characters are extracted as candidates for the lip movement pattern for one character, word candidates spoken by the user are extracted using syllable characters uttered before and after that, and the word dictionary is referred to Thus, characters that form words with high likelihood may be employed. In addition, a known method can be used as an algorithm for generating an utterance character string from image data.

学習処理部Ｆ８は、後述するパターン学習処理を実行する機能ブロックである。学習処理部Ｆ８は、パターン学習処理を実施するためのより細かい機能（つまりサブ機能）として、図２に示すように、声調パターン特定部Ｆ８１、連結唇動パターン特定部Ｆ８２、及び保存処理部Ｆ８３を備える。これらのサブ機能およびパターン学習処理の詳細については別途後述する。 The learning processing unit F8 is a functional block that executes a pattern learning process to be described later. As shown in FIG. 2, the learning processing unit F8 has finer functions (ie, sub-functions) for performing the pattern learning processing, as shown in FIG. 2, a tone pattern specifying unit F81, a connected lip movement pattern specifying unit F82, and a storage processing unit F83. Is provided. Details of these sub-functions and pattern learning processing will be described later.

中間言語化部Ｆ９は、後述する中間言語データ生成処理を実施する。中間言語化部Ｆ９が請求項に記載の中間言語データ生成部に掃討する。認識媒体設定部Ｆ１０は、ノイズレベル判定部Ｆ５の判定結果に基づき、音声認識部Ｆ４と読唇処理部Ｆ７のどちらを用いてユーザの発話内容を特定するのかを切り替える。具体的には、ノイズレベルが低レベルに判定されている場合には、音声認識部Ｆ４をユーザの発話内容を特定するための手段（以降、認識媒体）に設定する一方、ノイズレベルが高レベルに判定されている場合には読唇処理部Ｆ７を認識媒体に設定する。 The intermediate language unit F9 performs intermediate language data generation processing to be described later. The intermediate language unit F9 sweeps away the intermediate language data generation unit described in the claims. Based on the determination result of the noise level determination unit F5, the recognition medium setting unit F10 switches between using the voice recognition unit F4 and the lip reading processing unit F7 to specify the content of the user's utterance. Specifically, when the noise level is determined to be low, the voice recognition unit F4 is set as a means for identifying the user's utterance content (hereinafter, recognition medium), while the noise level is high. If it has been determined, the lip reading processing unit F7 is set as a recognition medium.

＜スタンバイモード処理＞
次に、図３に示すフローチャートを用いて、音声認識装置１がスタンバイモードで動作している場合に実行する処理（以降、スタンバイモード処理）について説明する。図３に示すフローチャートは、運転席へのユーザが着座したことを検出した場合に開始される。また、後述する発話内容認識処理が完了した場合にも開始される。つまり、スリープモードや認識実行モードから、スタンバイモードへと移行した場合に開始されれば良い。 <Standby mode processing>
Next, processing (hereinafter referred to as standby mode processing) executed when the speech recognition apparatus 1 is operating in the standby mode will be described using the flowchart shown in FIG. The flowchart shown in FIG. 3 is started when it is detected that the user sitting on the driver's seat is seated. It is also started when an utterance content recognition process described later is completed. That is, it may be started when a transition is made from the sleep mode or the recognition execution mode to the standby mode.

なお、スリープモードからスタンバイモードに移行する際には、カメラ５やマイク４の電源がオンされるものとする。また、このスタンバイモードとは独立して、ノイズレベル判定部Ｆ５は、マイク４から入力される音声信号に基づいて、ノイズレベルの判定を逐次実行しているものとする。 In addition, when shifting from the sleep mode to the standby mode, the camera 5 and the microphone 4 are turned on. Independently of the standby mode, the noise level determination unit F5 is assumed to sequentially perform noise level determination based on the audio signal input from the microphone 4.

まずステップＳ１０１では音声取得部Ｆ３が、マイク４から入力されている音声信号に基づいて、ユーザが発話しているか否かを判定する。ユーザが発話していると判定した場合にはステップＳ１０１が肯定判定されてステップＳ１０２に移る。一方、ユーザが発話していないと判定した場合にはステップＳ１０１が否定判定されてステップＳ１０４に移る。 First, in step S101, the voice acquisition unit F3 determines whether the user is speaking based on the voice signal input from the microphone 4. If it is determined that the user is speaking, an affirmative determination is made in step S101 and the process proceeds to step S102. On the other hand, if it is determined that the user is not speaking, a negative determination is made in step S101 and the process proceeds to step S104.

ステップＳ１０２では音声取得部Ｆ３が発話音声データを取得するとともに、読唇処理部Ｆ７がメモリの保存された画像データから、発話画像データを抽出してステップＳ１０３に移る。ステップＳ１０３では学習処理部Ｆ８が、パターン学習処理を実行してステップＳ１０４に移る。このステップＳ１０３で実施されるパターン学習処理については別途後述する。 In step S102, the voice acquisition unit F3 acquires utterance voice data, and the lip reading processing unit F7 extracts the utterance image data from the image data stored in the memory, and proceeds to step S103. In step S103, the learning processing unit F8 executes a pattern learning process and proceeds to step S104. The pattern learning process performed in step S103 will be described later separately.

ステップＳ１０４では着座判定部Ｆ１が、シートセンサ２から入力される信号に基づいて、ユーザが運転席に着座しているか否かを判定する。シートセンサ２からユーザが運転席に着座していることを示す信号が入力されている場合には、ステップＳ１０４は肯定判定されてステップＳ１０５に移る。一方、シートセンサ２からユーザが運転席に着座していないことを示す信号が入力されている場合には、ステップＳ１０４は否定判定されてステップＳ１０７に移る。 In step S <b> 104, the seating determination unit F <b> 1 determines whether the user is seated in the driver's seat based on the signal input from the seat sensor 2. If a signal indicating that the user is seated in the driver's seat is input from the seat sensor 2, an affirmative determination is made in step S104, and the process proceeds to step S105. On the other hand, when a signal indicating that the user is not seated in the driver's seat is input from the seat sensor 2, a negative determination is made in step S104, and the process proceeds to step S107.

ステップＳ１０５では操作受付部Ｆ２が、トークＳＷ３が押下されたか否かを判定する。トークＳＷ３が押下されている場合にはステップＳ１０５が肯定判定されてステップＳ１０６に移る。一方、トークＳＷ３が押下されていない場合にはステップＳ１０５が否定判定されてステップＳ１０１に戻る。 In step S105, the operation reception unit F2 determines whether or not the talk SW3 has been pressed. If the talk SW3 has been pressed, an affirmative decision is made in step S105 and the process proceeds to step S106. On the other hand, if the talk SW3 has not been pressed, a negative determination is made in step S105, and the process returns to step S101.

ステップＳ１０６では、動作モードを認識実行モードに設定して本フローを終了する。ステップＳ１０７では、動作モードをスリープモードに設定して本フローを終了する。 In step S106, the operation mode is set to the recognition execution mode, and this flow ends. In step S107, the operation mode is set to the sleep mode, and this flow ends.

＜パターン学習処理＞
次に、図４に示すフローチャートを用いて、学習処理部Ｆ８が実施するパターン学習処理について述べる。このフローチャートは、図３に示すスタンバイモード処理のステップＳ１０３に移った時に開始されれば良い。 <Pattern learning process>
Next, the pattern learning process performed by the learning processing unit F8 will be described using the flowchart shown in FIG. This flowchart may be started when the process proceeds to step S103 of the standby mode process shown in FIG.

まず、ステップＳ２０１では音声認識部Ｆ４が、音声取得部Ｆ３から提供される発話音声データを用いて音声認識処理を実施する。このステップＳ２０１を実行することによって、ユーザの発話音声に応じた文字列（つまり発話文字列）が生成されるとともに、各音節文字を発声しているタイミングが特定される。また、以降での処理の準備として、ユーザが発話した音節文字に対して、発話された順番に番号（以降、発声番号）を付与する。このステップＳ２０１が完了するとステップＳ２０２に移る。 First, in step S201, the speech recognition unit F4 performs speech recognition processing using the utterance speech data provided from the speech acquisition unit F3. By executing step S201, a character string (that is, an uttered character string) corresponding to the user's uttered voice is generated, and the timing at which each syllable character is uttered is specified. Further, as preparation for subsequent processing, numbers (hereinafter referred to as utterance numbers) are assigned to syllable characters uttered by the user in the order of utterance. When step S201 is completed, the process proceeds to step S202.

ステップＳ２０２では、読唇処理部Ｆ７が発話画像データからユーザの口唇部の動きを検出し、一連の発話画像データにおいて、ユーザが各音節文字を発声しているフレーム部分を順次特定する。そして、各音節文字を発声する際の唇動パターンを特定する。ステップＳ２０２の処理が完了するとステップＳ２０３に移る。ステップＳ２０３では学習処理部Ｆ８が、各音節文字を発声している時の唇動パターンを読唇用ＤＢ１ｂに保存する。 In step S202, the lip reading processing unit F7 detects the movement of the user's lip from the utterance image data, and sequentially identifies the frame portion in which the user utters each syllable character in the series of utterance image data. Then, the lip movement pattern when uttering each syllable character is specified. When the process of step S202 is completed, the process proceeds to step S203. In step S203, the learning processing unit F8 stores the lip movement pattern when each syllable character is uttered in the lip reading DB 1b.

このようなステップＳ２０１〜Ｓ２０３を実施することで、音声認識装置１は、音節文字毎のユーザの唇動パターンを学習していく。図５は、ステップＳ２０１〜Ｓ２０３までの処理の流れを概念的に表したものである。 By performing such steps S201 to S203, the speech recognition apparatus 1 learns the user's lip movement pattern for each syllable character. FIG. 5 conceptually shows the flow of processing from steps S201 to S203.

音声認識部Ｆ４は、図５の（Ａ）に示す発話音声データに対して音声認識処理を実施することで、図５の（Ｂ）に示すようにユーザが発話した音節文字を順次特定していく。つまり、発話文字列を生成する。また、各音節文字に対して発声された順番に発声番号を付与する。そして、学習処理部Ｆ８は、各状態に対応する画像データを、その時に発声されている音節文字の唇動パターンとして読唇用ＤＢ１ｂに登録していく。なお、図５の（Ｃ）は、各音節文字に割り当てられた発声番号を表し、（Ｄ）は各音節文字に対応する唇動パターンを表している。 The speech recognition unit F4 sequentially identifies syllable characters uttered by the user as shown in FIG. 5B by performing speech recognition processing on the speech voice data shown in FIG. Go. That is, an utterance character string is generated. Also, utterance numbers are assigned in the order in which each syllable character is uttered. Then, the learning processing unit F8 registers the image data corresponding to each state in the lip reading DB 1b as the lip movement pattern of the syllable characters uttered at that time. Note that (C) in FIG. 5 represents the utterance number assigned to each syllable character, and (D) represents the lip movement pattern corresponding to each syllable character.

以降では、一連の発話文字列において先頭からｊ番目（ｊは整数）の音節文字を発声している状態のことをｊ番目の状態とも記載する。また、第１声の直前の状態（つまり、無発声の状態）については、０番目の状態として取り扱う。また、発話終了直後の無発声状態に対しても１つの発声番号を付与して取り扱う。図５では発話終了直後の無発声状態を８番目の状態に設定している。 Hereinafter, the state in which the j-th (j is an integer) syllable character from the beginning in a series of uttered character strings is also referred to as the j-th state. Further, the state immediately before the first voice (that is, the state of no voice) is handled as the zeroth state. In addition, a single utterance number is assigned to an unvoiced state immediately after the end of utterance. In FIG. 5, the non-speech state immediately after the end of the utterance is set to the eighth state.

再び図４に戻り、パターン学習処理の説明を続ける。ステップＳ２０４では、以降での処理に用いる変数ｊを１に設定してステップＳ２０５に移る。ステップＳ２０５では、発話文字列を構成する音節文字の数ｎを取得してステップＳ２０６に移る。ｎは自然数である。なお、図５に示す例ではｎ＝７である。 Returning to FIG. 4 again, the description of the pattern learning process is continued. In step S204, the variable j used for the subsequent processing is set to 1, and the process proceeds to step S205. In step S205, the number n of syllable characters constituting the utterance character string is acquired, and the process proceeds to step S206. n is a natural number. In the example shown in FIG. 5, n = 7.

ステップＳ２０６ではｊがｎ＋１未満であるか否かを判定する。ｊがｎ＋１未満である場合にはステップＳ２０６が肯定されてステップＳ２０７に移る。一方、ｊがｎ＋１以上である場合には、ステップＳ２０６が否定判定されて本フローを終了する。なお、本フローが終了した場合には、本フローの呼び出し元であるスタンバイモード処理にリターンし、ステップＳ１０４に移る。 In step S206, it is determined whether j is less than n + 1. When j is less than n + 1, step S206 is affirmed and the process proceeds to step S207. On the other hand, if j is greater than or equal to n + 1, a negative determination is made in step S206, and this flow ends. If this flow is completed, the process returns to the standby mode process that is the caller of this flow, and the process proceeds to step S104.

ステップＳ２０７では声調パターン特定部Ｆ８１が、ｊ−１番目からｊ番目までの状態に対応する音声データに基づいて、ｊ−１番目の音節とｊ番目の音節とを続けて発声する際の声調パターンを示すデータ（以降、声調パターンデータ）を生成する。つまり、声調パターンデータは、１音節目の声調と２音節目の声調の、２つの音節に対する声調を示すデータである。 In step S207, the tone pattern specifying unit F81 continuously utters the j-1st syllable and the jth syllable based on the speech data corresponding to the j-1st to jth states. (Hereinafter referred to as tone pattern data). That is, the tone pattern data is data indicating the tone for two syllables, the tone of the first syllable and the tone of the second syllable.

ここでの声調には、抑揚（いわゆるイントネーション）や、ユーザの声の高さ、話す速度、音量などといった種々のパラメータが含まれる。図６の（Ａ）及び（Ｂ）は、「きょ」と「う」を続けて発声する際の声調データを概念的に表している。具体的には、（Ａ）は音調の変化を表しており、（Ｂ）は音量の変化を表している。話す速度については図示を省略しているが、話す速度についても周知の方法で数値化されれば良い。なお、声調データを構成する項目の種類は適宜設計されればよい。イントネーションと声の高さは、音調を示すデータによって表現されているものとする。 The tone here includes various parameters such as intonation (so-called intonation), the pitch of the user's voice, speaking speed, volume, and the like. 6A and 6B conceptually represent tone data when uttering “Kyo” and “U” in succession. Specifically, (A) represents a change in tone, and (B) represents a change in volume. Although the illustration of the speaking speed is omitted, the speaking speed may be quantified by a well-known method. Note that the types of items constituting the tone data may be designed as appropriate. It is assumed that the intonation and the pitch of the voice are expressed by data indicating the tone.

声調データの表現形式は、周知の種々の形式を採用することができる。ここでは一例として、電子情報技術産業協会規格においてＩＴＳ車載器用音声合成記号（JEITA TT-6004）として規定されている形式で表現することとする。 Various well-known formats can be adopted as the representation format of the tone data. Here, as an example, it is expressed in a format defined as an ITS on-board unit speech synthesis symbol (JEITA TT-6004) in the standards of the Japan Electronics and Information Technology Industries Association.

なお、ｊ＝１である場合、つまりｊ−１番の状態が無発声状態である場合には、ステップＳ２０７は無発声の状態から１番目の音節文字を発声する際の声調パターンデータを生成する処理に相当する。ステップＳ２０７での処理が完了するとステップＳ２０８に移る。 If j = 1, that is, if the j-1 state is an unvoiced state, step S207 generates tone pattern data for uttering the first syllable character from the unvoiced state. It corresponds to processing. When the process in step S207 is completed, the process proceeds to step S208.

ステップＳ２０８では連結唇動パターン特定部Ｆ８２が、ｊ−１番目からｊ番目までの状態に対応する発話画像データに基づいて、ｊ−１番目の音節とｊ番目の音節とを続けて発声する際の唇動パターン（以降、連結唇動パターン）を特定する。そして、その連結唇動パターンを示す連結唇動データを生成する。図６の（Ｃ）は、「きょ」と「う」を続けて発声する際の連結唇動パターンを概念的に表している。ステップＳ２０８での処理が完了するとステップＳ２０９に移る。 In step S208, when the connected lip movement pattern specifying unit F82 continuously utters the j-1 syllable and the jth syllable based on the utterance image data corresponding to the j-1st to jth states. Lip movement pattern (hereinafter referred to as connected lip movement pattern) is specified. Then, connected lip movement data indicating the connected lip movement pattern is generated. FIG. 6C conceptually shows the connected lip movement pattern when “Kyo” and “U” are continuously spoken. When the process in step S208 is completed, the process proceeds to step S209.

ステップＳ２０９では保存処理部Ｆ８３が、ステップＳ２０７で生成した声調データと、ステップＳ２０８で生成した連結唇動データと、それらが示す２つの音節文字と、を対応づけて声調ＤＢ１ｃに保存する。便宜上、ステップＳ２０７で生成した声調データと、ステップＳ２０８で生成した連結唇動データとを対応づけたデータを音節セットデータと称する。ステップＳ２０９での処理が完了するとステップＳ２１０に移る。 In step S209, the storage processing unit F83 stores the tone data generated in step S207, the connected lip movement data generated in step S208, and the two syllable characters indicated by them in the tone DB 1c. For convenience, the data that associates the tone data generated in step S207 with the connected lip movement data generated in step S208 is referred to as syllable set data. When the process in step S209 is completed, the process proceeds to step S210.

ステップＳ２１０では変数ｊの値を１つ増やして（つまりインクリメントして）、ステップＳ２０６に戻る。したがって、ステップＳ２０６からステップＳ２１０を繰り返すことで、２つの連続する音節毎の音節セットデータが生成される。例えば図５に示す例の場合には、７つの音節セットデータが生成される。 In step S210, the value of variable j is incremented by 1 (that is, incremented), and the process returns to step S206. Therefore, by repeating steps S206 to S210, syllable set data for every two consecutive syllables is generated. For example, in the example shown in FIG. 5, seven syllable set data are generated.

声調ＤＢ１ｃにおいて、種々の音節セットデータは、例えば、その音節セットデータが示す２つの音節文字をラベルとしてグループ化して保存されている。「きょう」という発声に対する音節セットデータが複数存在している場合には、それらを「きょう」という２音節に対応するデータとしてグループ化して保存する。なお、「きょう」という発声に対する音節セットデータが複数存在する場合とは、過去にユーザが種々の声調パターン又は連結唇動パターンで「きょう」と発声したことがある場合に相当する。 In the tone DB 1c, various syllable set data is stored by grouping, for example, two syllable characters indicated by the syllable set data as labels. When there are a plurality of syllable set data for the utterance “Kyo”, they are grouped and stored as data corresponding to two syllables “Kyo”. The case where there are a plurality of syllable set data for the utterance “Kyo” corresponds to the case where the user has uttered “Kyo” in various tone patterns or connected lip movement patterns in the past.

＜発話内容認識処理＞
次に、図７に示すフローチャートを用いて、学習処理部Ｆ８が実施する発話内容認識処理について述べる。発話内容認識処理は、マイク４が集音した音声又はカメラ５の撮像画像に基づいて（換言すれば音声認識と読唇処理の何れか一方を用いて）、ユーザが発話した内容を特定する処理である。発話内容認識処理は、トークＳＷ３が押下された場合に開始されれば良い。つまり、動作モードが認識実行モードへと移行した時に開始される。 <Speech content recognition processing>
Next, the speech content recognition process performed by the learning processing unit F8 will be described using the flowchart shown in FIG. The utterance content recognition process is a process for identifying the content uttered by the user based on the voice collected by the microphone 4 or the captured image of the camera 5 (in other words, using either voice recognition or lip reading process). is there. The utterance content recognition process may be started when the talk SW 3 is pressed. That is, it starts when the operation mode shifts to the recognition execution mode.

まずステップＳ３０１では認識媒体設定部Ｆ１０が、ノイズレベル判定部Ｆ５の判定結果に基づき、音声認識部Ｆ４と読唇処理部Ｆ７のどちらを用いてユーザの発話内容を特定するのかを判定する。ノイズレベルが低レベルと判定されている場合には、音声認識部Ｆ４を用いてユーザの発話内容を特定することを決定してステップＳ３１０に移る。一方、ノイズレベルが高レベルと判定されている場合には、読唇処理部Ｆ７を用いてユーザの発話内容を特定することを決定してステップＳ３２０に移る。 First, in step S301, the recognition medium setting unit F10 determines, based on the determination result of the noise level determination unit F5, which of the speech recognition unit F4 and the lip reading processing unit F7 is used to specify the user's utterance content. If it is determined that the noise level is low, it is determined to use the voice recognition unit F4 to specify the user's utterance content, and the process proceeds to step S310. On the other hand, if the noise level is determined to be high, it is determined that the utterance content of the user is specified using the lip reading processing unit F7, and the process proceeds to step S320.

ステップＳ３１０では音声認識部Ｆ４が、音声取得部Ｆ３が生成した発話音声データを取得して、ステップＳ３１１に移る。ステップＳ３１１では音声認識部Ｆ４が、取得した発話音声データに基づいて音声認識処理を実施してステップＳ３３０に移る。 In step S310, the voice recognition unit F4 acquires the speech voice data generated by the voice acquisition unit F3, and proceeds to step S311. In step S311, the voice recognition unit F4 performs voice recognition processing based on the acquired utterance voice data, and proceeds to step S330.

ステップＳ３２０では発話画像データを取得してステップＳ３２１に移る。ステップＳ３２１では読唇処理部Ｆ７がステップＳ３２１で取得した発話画像データを用いて読唇処理を実施することで発話文字列を生成して、ステップＳ３２２に移る。 In step S320, utterance image data is acquired, and the process proceeds to step S321. In step S321, the lip reading processing unit F7 performs a lip reading process using the utterance image data acquired in step S321, thereby generating an utterance character string, and proceeds to step S322.

ステップＳ３２２では中間言語化部Ｆ９が、ステップＳ３２１で生成された発話文字列を用いた中間言語データ生成処理を実施してステップＳ３２３に移る。この中間言語データ生成処理については別途後述する。なお、この中間言語データ生成処理の成果物として、発話文字列に、各音節文字をユーザが発声した際の声調を示す声調情報を付加したデータ（以降、中間言語データ）が生成される。 In step S322, the intermediate language unit F9 performs intermediate language data generation processing using the utterance character string generated in step S321, and proceeds to step S323. The intermediate language data generation process will be described later separately. As a product of the intermediate language data generation process, data (hereinafter referred to as intermediate language data) is generated by adding tone information indicating the tone when the user utters each syllable character to the utterance character string.

ステップＳ３２３では音声認識部Ｆ４が、ステップＳ３２３で生成された中間言語データを用いた音声認識処理を実行することでユーザの発話内容を特定する。ここでの発話内容との特定とは、例えば、発話文字列をイントネーションに基づいて単語レベルに分割し、さらに単語間の連接関係に基づいて、意味の通じる１文に変換することである。ステップＳ３２３での処理が完了するとステップＳ３３０に移る。 In step S323, the speech recognition unit F4 specifies the content of the user's utterance by executing speech recognition processing using the intermediate language data generated in step S323. The specification of the utterance content here is, for example, to divide the utterance character string into word levels based on intonation, and further convert it into a single sentence with meaning based on the concatenated relationship between words. When the process in step S323 is completed, the process proceeds to step S330.

ステップＳ３３０では以上の処理で特定したユーザの発話内容を示すデータを、所定のアプリに提供してステップＳ３３１に移る。ステップＳ３３１では動作モードをスタンバイモードへと移行して本フローを終了する。なお、本フローが終了した場合、図３に示すスタンバイモード処理が開始される。 In step S330, data indicating the utterance content of the user specified by the above processing is provided to a predetermined application, and the process proceeds to step S331. In step S331, the operation mode is shifted to the standby mode and this flow is terminated. When this flow is finished, the standby mode process shown in FIG. 3 is started.

＜中間言語データ生成処理＞
次に、図８に示すフローチャートを用いて、中間言語化部Ｆ９が実施する中間言語データ生成処理について述べる。このフローチャートは、図８に示す発話内容認識処理のステップＳ３２２に移った時に開始されれば良い。 <Intermediate language data generation processing>
Next, intermediate language data generation processing performed by the intermediate language unit F9 will be described using the flowchart shown in FIG. This flowchart may be started when the process proceeds to step S322 of the utterance content recognition process shown in FIG.

まず、ステップＳ４０１では、読唇処理部Ｆ７によって生成された発話文字列が備える音節文字の数ｎを取得してステップＳ４０２に移る。ステップＳ４０２では、以降の処理に用いる変数ｋを１に設定してステップＳ４０３に移る。なお、ｋは、自然数が設定される変数である。 First, in step S401, the number n of syllable characters included in the utterance character string generated by the lip reading processing unit F7 is acquired, and the process proceeds to step S402. In step S402, the variable k used for the subsequent processing is set to 1, and the process proceeds to step S403. Note that k is a variable in which a natural number is set.

ステップＳ４０３では、ｋがｎ＋１未満であるか否かを判定する。ｋがｎ＋１未満である場合にはステップＳ４０３が肯定されてステップＳ４０４に移る。一方、ｋがｎ＋１以上である場合には、ステップＳ４０３が否定判定されて本フローを終了する。なお、本フローが終了した場合には、本フローの呼び出し元である発話内容認識処理にリターンし、ステップＳ３２３に移る。 In step S403, it is determined whether k is less than n + 1. When k is less than n + 1, step S403 is affirmed and the process proceeds to step S404. On the other hand, if k is greater than or equal to n + 1, a negative determination is made in step S403 and this flow ends. When this flow is completed, the process returns to the utterance content recognition process that is the caller of this flow, and the process proceeds to step S323.

ステップＳ４０４では、ｋ−１番目からｋ番目までの状態に対応する発話画像データに基づいて、ｋ−１番目の音節とｋ番目の音節とを続けて発声する際の唇動パターン（以降、観測唇動パターン）を特定する。ステップＳ４０４での処理が完了すると、ステップＳ４０５に移る。ｋ番目の音節文字が請求項に記載の対象文字に相当し、ｋ−１番目の音節文字が請求項に記載の、対象文字の１つ前に位置する音節文字に相当する。 In step S404, the lip movement pattern (hereinafter referred to as observation) when the k-1th syllable and the kth syllable are continuously spoken based on the utterance image data corresponding to the k-1st to kth states. Lip movement pattern). When the process in step S404 is completed, the process proceeds to step S405. The kth syllable character corresponds to the target character recited in the claims, and the (k-1) th syllable character corresponds to the syllable character positioned immediately before the target character recited in the claims.

なお、観測唇動パターンは、中間言語化部Ｆ９が特定してもよいし、連結唇動パターン特定部Ｆ８２が特定してもよい。また、発話文字列の生成時に読唇処理部Ｆ７が特定した唇動パターンを用いて中間言語化部Ｆ９が特定してもよい。何れにしても観測唇動パターンは、発話文字列の生成に用いられた画像データに基づいて特定される。 The observed lip movement pattern may be specified by the intermediate language conversion unit F9 or the connected lip movement pattern specification unit F82. Further, the intermediate language conversion unit F9 may specify the lip movement pattern specified by the lip reading processing unit F7 when generating the utterance character string. In any case, the observed lip movement pattern is specified based on the image data used to generate the utterance character string.

ステップＳ４０５では、声調ＤＢ１ｃに保存されている種々の連結唇動データの中から、ステップＳ４０４で特定した観測唇動パターンとの類似度合いが最も高い連結唇動パターンを示す連結唇動データを特定する。ここでは一例として、ｋ−１番目の音節とｋ番目の音節文字をラベルとして付与されている連結唇動データを抽出し、その中で観測唇動パターンとの類似度合いが最も高い連結唇動パターンを示す連結唇動データを選択するものとする。 In step S405, the connected lip movement data indicating the connected lip movement pattern having the highest degree of similarity to the observed lip movement pattern specified in step S404 is specified from among the various connected lip movement data stored in the tone DB 1c. . Here, as an example, connected lip movement data having the k-1th syllable and the kth syllable character as labels is extracted, and among them, the connected lip movement pattern having the highest degree of similarity to the observed lip movement pattern. It is assumed that the connected lip movement data indicating is selected.

類似度合いの算出は、パターンマッチング等の周知の手法を用いて実施されれば良い。なお、声調ＤＢ１ｃに、ｋ−１番目の音節とｋ番目の音節文字をラベルとして割り当てられている連結唇動データが１つしか登録されていない場合には、その連結唇動データを選択すればよい。ステップＳ４０５での処理が完了するとステップＳ４０６に移る。 The degree of similarity may be calculated using a known method such as pattern matching. If only one connected lip movement data assigned with the k-1th syllable and the kth syllable character as labels is registered in the tone DB 1c, the connected lip movement data is selected. Good. When the process in step S405 is completed, the process proceeds to step S406.

ステップＳ４０６では、ステップＳ４０５で選択された連結唇動データと対応付けられている声調データを読みだしてステップＳ４０７に移る。ステップＳ４０７では、読み出した声調データと、そのｋ−１番目の音節文字に対して割り当てた声調とから、ｋ番目の音節文字に対する声調を決定する。例えばｋ＝１の時は、読み出した声調データに示される２音節目の声調をそのまま採用する。 In step S406, the tone data associated with the connected lip movement data selected in step S405 is read, and the process proceeds to step S407. In step S407, the tone for the kth syllable character is determined from the read tone data and the tone assigned to the k-1th syllable character. For example, when k = 1, the second syllable tone shown in the read tone data is used as it is.

また、ｋ≧２の時は、読み出した声調データに示される１音節目の声調が、発話文字列におけるｋ−１番目の音節文字に対して設定した声調と一致するように、読み出した声調データに示される２つの音節に対する声調を等しく補正する。例えば、読み出した声調データに示される１音節目の声調が、発話文字列におけるｋ−１番目の音節文字に設定した声調に対して０．５オクターブ低い場合には、声調データに示される２つの音節に対する声調を両方とも０．５オクターブずつ上げる。そして、そのような補正を施した声調データの２音節目の声調を、ｋ番目の音節文字に対する声調として採用する。 Also, when k ≧ 2, the read tone data is set so that the tone of the first syllable indicated in the read tone data matches the tone set for the k−1 syllable character in the utterance character string. Correct the tones for the two syllables shown in. For example, when the tone of the first syllable indicated in the read tone data is 0.5 octaves lower than the tone set for the k-1 syllable character in the utterance character string, the two tones shown in the tone data Raise the tone for both syllables by 0.5 octaves. Then, the tone of the second syllable of the tone data subjected to such correction is adopted as the tone for the kth syllable character.

ステップＳ４０７での処理が完了するとステップＳ４０８に移る。ステップＳ４０８では、変数ｋの値を１つ増やして（つまりインクリメントして）、ステップＳ４０３に戻る。したがって、ステップＳ４０３からステップＳ４０８を繰り返すことで発話文字列を構成する全ての音節文字に対する声調が決定される。つまり、発話文字列に声調情報が付加された中間言語データが生成される。中間言語データの表現形式は、上述の通りJEITA TT-6004などの任意の形式を採用することができる。 When the process in step S407 is completed, the process proceeds to step S408. In step S408, the value of the variable k is increased by 1 (that is, incremented), and the process returns to step S403. Therefore, the tone for all the syllable characters constituting the utterance character string is determined by repeating steps S403 to S408. That is, intermediate language data in which tone information is added to the utterance character string is generated. As the expression format of the intermediate language data, any format such as JEITA TT-6004 can be adopted as described above.

＜実施形態のまとめ＞
以上の構成では、スタンバイモードで動作している間にユーザの発話を検出した場合には、そのユーザが発話した音声データ及びその際の画像データを元に、連結唇動データと声調パターンデータとを生成し、声調ＤＢ１ｃに保存する（ステップＳ１０３）。 <Summary of Embodiment>
In the above configuration, when the user's utterance is detected while operating in the standby mode, the connected lip movement data, the tone pattern data, and the voice data uttered by the user and the image data at that time, Is stored in the tone DB 1c (step S103).

そして、ユーザのトークＳＷ３の押下をトリガとして読唇処理を実施した場合には、ユーザの発話時の画像データから連結唇動パターンを特定して、その特定した連結唇動パターンと声調ＤＢ１ｃに保存されているデータとから、音節毎の声調を決定する。 When the lip reading process is performed with the user's press of the talk SW 3 as a trigger, the connected lip movement pattern is specified from the image data at the time of the user's utterance, and is stored in the specified connected lip movement pattern and tone DB 1c. The tone of each syllable is determined from the data being recorded.

つまり、以上の構成によれば、カメラ５が撮像した画像データから、中間言語データを生成することができる。また、音節毎に割り当てられる声調は、実際にユーザが発話した時の唇動パターンと声調パターンとに基づいて決定されるため、実際のユーザの声調と近い声調であることが期待される。したがって、上述した方法によって生成される中間言語データは、ユーザの声調を相対的に精度良く再現した中間言語データとなることが期待できる。 That is, according to the above configuration, intermediate language data can be generated from image data captured by the camera 5. Further, since the tone assigned to each syllable is determined based on the lip movement pattern and tone pattern when the user actually speaks, it is expected that the tone is close to the tone of the actual user. Therefore, the intermediate language data generated by the above-described method can be expected to be intermediate language data that reproduces the user's tone relatively accurately.

なお、一般的に、単なる音節文字の羅列（つまり発話文字列）よりも、それらが発話された際の声調情報が付加された中間言語データのほうが情報量は大きい。そのため、発話内容を解析する上では、発話文字列よりも中間言語データを用いたほうが、単語の切れ目や疑問文であるか否かなどの特定精度が向上し、より適切な認識結果が得られるようになる。すなわち、以上の構成によれば、読唇処理の結果に基づいて、発話内容をより精度よく認識できるようになる。 Note that, in general, intermediate language data to which tone information when a utterance is added is larger in amount of information than a simple sequence of syllable characters (that is, an utterance character string). Therefore, when analyzing the utterance content, the use of intermediate language data rather than the utterance character string improves the accuracy of identification such as whether a word breaks or whether it is a question sentence, and provides a more appropriate recognition result. It becomes like this. That is, according to the above configuration, the utterance content can be recognized more accurately based on the result of the lip reading process.

以上、本発明の実施形態を説明したが、本発明は上述の実施形態に限定されるものではなく、以降で述べる種々の変形例も本発明の技術的範囲に含まれ、さらに、下記以外にも要旨を逸脱しない範囲内で種々変更して実施することができる。 As mentioned above, although embodiment of this invention was described, this invention is not limited to the above-mentioned embodiment, The various modifications described below are also contained in the technical scope of this invention, and also in addition to the following However, various modifications can be made without departing from the scope of the invention.

なお、前述の実施形態で述べた部材と同一の機能を有する部材については、同一の符号を付し、その説明を省略する。また、構成の一部のみに言及している場合、他の部分については先に説明した実施形態の構成を適用することができる。 In addition, about the member which has the same function as the member described in the above-mentioned embodiment, the same code | symbol is attached | subjected and the description is abbreviate | omitted. In addition, when only a part of the configuration is mentioned, the configuration of the above-described embodiment can be applied to the other portions.

［変形例１］
以上では、生成した中間言語データを、発話内容の特定（換言すれば認識）に利用する態様を開示したが、これに限らない。中間言語データは、音声合成処理に利用されても良い。その場合、音声認識装置１は、中間言語化部Ｆ９が生成した中間言語データを、音声合成処理を実行するアプリケーションソフトウェアに提供する。 [Modification 1]
In the above, although the aspect which utilizes the produced | generated intermediate language data for the specification (in other words recognition) of utterance content was disclosed, it is not restricted to this. The intermediate language data may be used for speech synthesis processing. In that case, the speech recognition apparatus 1 provides the intermediate language data generated by the intermediate language unit F9 to application software that executes speech synthesis processing.

［変形例２］
上述した実施形態では、ユーザの離席時にマイク４をオフする態様を開示したが、これに限らない。マイク４は走行用電源がオンとなっている間は常にオン状態が維持されても良い。 [Modification 2]
In the above-described embodiment, the mode in which the microphone 4 is turned off when the user leaves the seat is disclosed, but this is not a limitation. The microphone 4 may always be kept on while the traveling power supply is on.

［変形例３］
車両を利用する人物（つまりユーザ）が複数存在する場合には、上述した種々の処理は、ユーザを識別して実施することが好ましい。つまり、顔画像や声紋、指紋等によってユーザを識別し、ユーザ毎に音節文字毎の唇動パターンや、連結唇動データ、声調データを生成することが好ましい。 [Modification 3]
When there are a plurality of persons (that is, users) who use the vehicle, the various processes described above are preferably performed by identifying the user. That is, it is preferable that a user is identified by a face image, a voice print, a fingerprint, etc., and a lip movement pattern, linked lip movement data, and tone data for each syllable character is generated for each user.

［変形例４］
以上では、ユーザの発話音声を音節の概念で区切って処理を実施する態様を開示したが、これに限らない。ユーザの発話音声をモーラの概念で区切って処理してもよい。 [Modification 4]
In the above, although the aspect which processes by uttering a user's speech by the concept of a syllable is disclosed, it is not limited to this. The user's speech may be divided and processed by the concept of mora.

１００音声入力システム、１音声認識装置、２シートセンサ、３トークスイッチ、４マイク、５カメラ、Ｆ１着座判定部、Ｆ２操作受付部、Ｆ３音声取得部、Ｆ４音声認識部、Ｆ５ノイズレベル判定部、Ｆ６画像取得部、Ｆ７読唇処理部、Ｆ８学習処理部、Ｆ９中間言語化部（中間言語データ生成部）、Ｆ１０認識媒体設定部、Ｆ８１声調パターン特定部、Ｆ８２連結唇動パターン特定部、Ｆ８３保存処理部、１ａ音声認識用データベース、１ｂ読唇用データベース、１ｃ声調データベース 100 voice input system, 1 voice recognition device, 2 sheet sensor, 3 talk switch, 4 microphone, 5 camera, F1 seating determination unit, F2 operation reception unit, F3 voice acquisition unit, F4 voice recognition unit, F5 noise level determination unit, F6 image acquisition unit, F7 lip reading processing unit, F8 learning processing unit, F9 intermediate languageization unit (intermediate language data generation unit), F10 recognition medium setting unit, F81 tone pattern specifying unit, F82 connected lip movement pattern specifying unit, F83 storage Processing unit, 1a database for speech recognition, 1b database for lip reading, 1c tone database

Claims

An utterance content recognition device that executes a process of specifying the utterance content of a user with a predetermined user operation as a trigger,
An operation reception unit (F2) for receiving the user operation;
A voice acquisition unit (F3) that acquires the user's speech voice as speech voice data via a microphone;
An image acquisition unit (F6) for sequentially acquiring user images, which are images captured by a camera arranged to capture the user's face;
A lip movement pattern that is a change pattern of the user's lip shape is detected from the user image acquired by the image acquisition unit, and further, a character string corresponding to the user's utterance voice based on the detected lip movement pattern. A lip reading processing unit (F7) for generating a certain utterance character string;
A tone pattern specifying unit (F81) for specifying a tone pattern when two syllables are uttered continuously based on the utterance voice data acquired by the voice acquisition unit;
A connected lip movement pattern specifying unit (F82) for specifying a connected lip movement pattern which is a lip shape change pattern when the user continuously utters two syllables from the user image acquired by the image acquisition unit; ,
For the two syllables continuously uttered by the user, the connected lip movement pattern specified by the connected lip movement pattern specifying unit and the tone pattern specified by the tone pattern specifying unit are stored in a tone database in association with each other. A learning processing unit (F8) that executes pattern learning processing as processing,
Intermediate language data in which tone information is added to each syllable character constituting the utterance character string using the data stored in the tone database and the user image used to generate the utterance character string An intermediate language data generation unit (F9) for generating
The learning processing unit sequentially executes the pattern learning process in cooperation with the tone pattern specifying unit and the connected lip movement pattern specifying unit when the operation receiving unit does not receive the user operation,
The intermediate language data generation unit generates the intermediate language data when the lip reading processing unit generates the utterance character string based on the operation reception unit receiving the user operation,
The intermediate language data generation unit
When determining the tone of a target character that is a certain syllable character constituting the utterance character string, the target character is selected from a plurality of the connected lip movement patterns stored in the tone database. Identifying the connected lip movement pattern having a high similarity to the change pattern of the user's lip shape when the syllable character positioned immediately before and the target character are uttered;
An utterance content recognition apparatus, wherein a tone of the target character is determined using the tone pattern associated with the identified connected lip movement pattern.

In claim 1,
A speech recognition unit (F4) that performs speech recognition processing based on the utterance speech data acquired by the speech acquisition unit;
A noise level determination unit (F5) that determines a noise level that is a noise level based on the amplitude of the audio signal output from the microphone;
When the noise level determination unit determines that the noise level is high when the operation reception unit receives the user operation, the lip reading processing unit converts the uttered character string based on the user image. While generating
When the noise level is determined to be low by the noise level determination unit at the time when the operation reception unit receives the user operation, the voice recognition unit executes the voice recognition process so that the user An utterance content recognition device characterized by specifying the utterance content.

In claim 2,
The speech recognition unit is characterized in that the speech recognition unit specifies the speech content of the user by performing the speech recognition process using the intermediate language data generated by the intermediate language data generation unit.

In any one of Claims 1-3,
The utterance content recognition apparatus, wherein the intermediate language data generated by the intermediate language data generation unit is provided to application software for executing speech synthesis processing.