JP2012118679A

JP2012118679A - Information processor, word discrimination device, screen display operation device, word registration device and method and program related to the same

Info

Publication number: JP2012118679A
Application number: JP2010266650A
Authority: JP
Inventors: Takuya Koizumi; 拓也小泉
Original assignee: NEC Communication Systems Ltd
Current assignee: NEC Communication Systems Ltd
Priority date: 2010-11-30
Filing date: 2010-11-30
Publication date: 2012-06-21

Abstract

PROBLEM TO BE SOLVED: To provide an information processor, a word discrimination device, a screen display operation device, a word registration device and their methods and programs, for processing information and discriminating a word only by a video image of a mouth of a user and enabling operating of screen display.SOLUTION: Word image extraction means 22 extracts a temporal change for a lip image extracted by lip image area extraction means 21 as a word image of one unit. Pattern comparison means 25 compares a word vocalization pattern obtained by extraction from the word image with a word dictionary 24 and specifies a word to be an object of the operation of a device.

Description

本発明は、パーソナルコンピュータや携帯電話機等の各種の情報処理装置、これに使用される単語判別装置、画面表示の操作を行う画面表示操作装置、単語登録を行う単語登録装置およびこれらの方法ならびにプログラムに関する。 The present invention relates to various information processing apparatuses such as a personal computer and a mobile phone, a word discriminating apparatus used therefor, a screen display operating apparatus for performing screen display operations, a word registering apparatus for performing word registration, and methods and programs thereof. About.

パーソナルコンピュータや携帯電話機等の各種の情報処理装置では、ディスプレイ上に各種操作に関する情報を表示して、キーボードや操作パネルあるいはマウス等のポインティングデバイスを用いて、入力操作を行うことが多い。 In various information processing apparatuses such as personal computers and mobile phones, information related to various operations is displayed on a display, and input operations are often performed using a keyboard, an operation panel, or a pointing device such as a mouse.

図２５は、本発明の第１の関連技術の情報処理装置としての携帯電話機の外観を表わしたものである。この携帯電話機２００は、第１の筐体２０１と第２の筐体２０２をヒンジ機構２０３で折り畳み自在に連結した折り畳み型の電話機である。第１の筐体２０１にはディスプレイ２０５が配置され、第２の筐体２０２における折り畳み時にディスプレイ２０５と対向する面には操作部２０６が配置されている。 FIG. 25 shows the appearance of a mobile phone as an information processing apparatus according to the first related technology of the present invention. The cellular phone 200 is a foldable telephone in which a first casing 201 and a second casing 202 are foldably connected by a hinge mechanism 203. A display 205 is disposed in the first housing 201, and an operation unit 206 is disposed on a surface of the second housing 202 that faces the display 205 when folded.

この携帯電話機２００を例に採れば、ユーザはディスプレイ２０５に表示されたメニュー画面等の画面内容を見て、操作部２０６の決定キー２０６Ａや方向キー２０６Ｂ、あるいはダイヤルキー２０６Ｃ等のキー操作を行って、電子メールの送信や情報の検索等の必要な処理を実行する。ディスプレイ２０５にタッチパネルが付属している場合には、所望の表示部位を指やペンで押下することによっても、各種の操作が可能である。 Taking this cellular phone 200 as an example, the user looks at the screen contents such as the menu screen displayed on the display 205 and performs key operations such as the enter key 206A, the direction key 206B, or the dial key 206C of the operation unit 206. Then, necessary processing such as transmission of e-mail and retrieval of information is executed. When a touch panel is attached to the display 205, various operations can be performed by pressing a desired display portion with a finger or a pen.

ところで、このような操作は、ユーザの手によって行われるのを前提としている。したがって、手の不自由な人にとって多くの情報処理装置は、使い勝手の悪いインタフェースを備えた装置となる。また、手に特別の障害がない人にとっても、これらの情報処理装置を多用することは手に過度の負担を掛け、好ましくない。 By the way, it is assumed that such an operation is performed by the user's hand. Therefore, many information processing apparatuses are provided with an inconvenient interface for the handicapped person. Further, even for people who do not have any special obstacles in their hands, it is not preferable to use these information processing devices excessively because they place an excessive burden on the hands.

特にこの種の情報処理装置では、メニュー画面からユーザの所望の項目を選択するような場合、メニューが多くの階層に分類されている場合が多い。このような場合には、キーの押下等の手による操作を繰り返す必要があり、手の不自由な人に大きな負担を強いることになる。 In particular, in this type of information processing apparatus, when a user's desired item is selected from a menu screen, the menu is often classified into many layers. In such a case, it is necessary to repeat a manual operation such as pressing a key, which places a heavy burden on a handicapped person.

そこで、本発明の第２の関連技術として、レーザポインタと空気で作動するスイッチを用いてマウスカーソルを画面上で移動させたりクリックするパソコン入力装置が提案されている（たとえば特許文献１参照）。この第１の関連技術では、レーザポインタをユーザの頭部に取り付けておき、レーザビームをディスプレイ上に照射して、その位置をカメラで検出することで、マウスカーソルの移動制御を行う。また、ユーザの口元にチューブをセットして呼気または吸気によってスイッチを作動させることで、クリックを実現する。 Therefore, as a second related technique of the present invention, a personal computer input device has been proposed in which a mouse cursor is moved or clicked on a screen using a laser pointer and a switch operated by air (see, for example, Patent Document 1). In the first related technique, the movement of the mouse cursor is controlled by attaching a laser pointer to the user's head, irradiating a laser beam on the display, and detecting the position with a camera. The click is realized by setting a tube in the user's mouth and operating the switch by exhalation or inspiration.

この第２の関連技術では、ユーザがレーザポインタやこの画像を取得するカメラおよび空気で作動する特殊なスイッチを用意する必要がある。また、ユーザは入力操作を行っている間、頭部の姿勢と目の視点および口の自由度を奪われるという問題がある。 In this second related technique, it is necessary for the user to prepare a laser pointer, a camera for acquiring this image, and a special switch operated by air. Further, there is a problem that the user is deprived of the posture of the head, the viewpoint of the eyes, and the degree of freedom of the mouth while performing the input operation.

そこで、第３の関連技術として、目の動きと口の形状認識を利用した文字入力装置が提案されている（たとえば特許文献２参照）。この第３の関連技術では、使用者の映像（顔の位置、向き、目の方向など）および母音を発声した時の映像（口形状）の画像データを辞書として保存しておき、使用者が入力しようと見つめている行の文字を選定するようにしている。 Therefore, as a third related technique, a character input device using eye movement and mouth shape recognition has been proposed (see, for example, Patent Document 2). In the third related technique, image data of a user's video (face position, orientation, eye direction, etc.) and video (mouth shape) when a vowel is uttered are stored as a dictionary, and the user can The character of the line which is staring to input is selected.

特開２００５−０６３１０１号公報（第００２８段落、第００２９段落、図１）Japanese Patent Laying-Open No. 2005-063101 (paragraph 0028, paragraph 0029, FIG. 1) 特開２００２−２６９５４４号公報（請求項１、第０００７段落〜第００１１段落、第００２４段落、図１）JP 2002-269544 A (Claim 1, paragraphs 0007 to 0011, paragraph 0024, FIG. 1)

この第３の関連技術では、入力したい文字の母音の口の動きをカメラで読み取り、行を選択するボタンを表示して、ボタンを見つめた映像と合わせて文字の判別を行う。したがって、第３の関連技術を実施するためには、ユーザが行のボタンを目で追って、該当する行を見つけてこれを目で見つめている状態でその行の文字を１文字分だけ発声する必要がある。語句を通常の速度でしゃべった場合には、該当する行を目でいちいち追跡することが不可能なためである。このため、文字の入力速度がかなり低下するだけでなく、視点を頻繁に変更しなければならないので、目が疲れるという問題が生じる。 In the third related technique, the movement of the mouth of the vowel of the character to be input is read by the camera, a button for selecting a line is displayed, and the character is discriminated together with the video staring at the button. Therefore, in order to implement the third related technique, the user follows the button of the line, finds the corresponding line, and utters only one character of the line while gazing at the line. There is a need. This is because if the word is spoken at a normal speed, it is impossible to trace the corresponding line one by one. For this reason, not only the input speed of characters is considerably reduced, but also the problem that eyes are tired arises because the viewpoint must be changed frequently.

そこで、第３の関連技術では、行の方はキー入力によって決定してもよいとしているが、これでは手の不自由な人の操作を排除することになる。 Therefore, in the third related technique, the direction of the line may be determined by key input, but this eliminates the operation of a handicapped person.

そこで本発明の目的は、ユーザの口の映像のみによって情報処理や単語の判別を行ったり、画面表示の操作を可能にしたり、単語登録を行う情報処理装置、単語判別装置、画面表示操作装置、単語登録装置およびこれらの方法ならびにプログラムを提供することにある。 Accordingly, an object of the present invention is to perform information processing and word discrimination only with a video of a user's mouth, enable an operation of screen display, an information processing device that performs word registration, a word discrimination device, a screen display operation device, The object is to provide a word registration device, a method thereof, and a program.

本発明では、（イ）各種の情報を視覚的に表示するディスプレイと、（ロ）このディスプレイを使用する操作者の口を少なくとも撮影する撮像手段と、（ハ）この撮像手段によって得られた操作者の口の画像の経時的な変化を判別する変化判別手段と、（ニ）この変化判別手段の判別結果に応じて判別結果と対応付けた予め定めた特定の操作を実行する特定操作実行手段とを情報処理装置が具備する。 In the present invention, (a) a display for visually displaying various types of information, (b) an imaging means for photographing at least an operator's mouth using the display, and (c) an operation obtained by the imaging means. Change discriminating means for discriminating temporal changes in the image of the person's mouth, and (d) a specific operation executing means for executing a predetermined specific operation associated with the discrimination result according to the discrimination result of the change discriminating means And the information processing apparatus.

また、本発明では、（イ）認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出手段と、（ロ）この唇画像領域抽出手段によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する単語画像抽出手段と、（ハ）この単語画像抽出手段によって抽出した単語画像における上下の唇の閉じたときの合わせ目が全体的に水平に配置しているとしたときの前記した唇の開口部の中心点を通る上下方向と左右方向のそれぞれの距離の時間的な変化を１単位の単語が発声されたときの単語発声パターンとして記録する単語発声パターン記録手段と、（ニ）この単語発声パターン記録手段あるいはこれと同等の手段によって予め記録しておいた単語発声パターンをそれぞれ単語と対応付けた単語辞書と、（ホ）前記した単語発声パターン記録手段で記録した認識の対象となる単語発声パターンを前記した単語辞書内の各単語についての単語発声パターンと比較するパターン比較手段と、（へ）このパターン比較手段による比較の結果、最も一致すると判別した単語発声パターンに対応する単語を前記した認識の対象とする人物の発声した単語であると判別する単語判別手段とを単語判別装置が具備する。 In the present invention, (a) lip image region extracting means for extracting a lip region image from a human face image to be recognized; and (b) a lip image region extracted by the lip image region extracting unit. A word image extraction means for extracting a series of temporal changes from the start to the end of the lip image change as a unit word image, and (c) closing upper and lower lips in the word image extracted by the word image extraction means. One unit of words utters the temporal changes in the vertical and horizontal distances passing through the center point of the opening of the lips when the seam at the time is arranged horizontally. Word utterance pattern recording means for recording as a word utterance pattern at the time of being recorded, and (d) a word utterance pattern previously recorded by the word utterance pattern recording means or equivalent means. A word dictionary associated with each word, and (e) a pattern for comparing the word utterance pattern to be recognized recorded by the word utterance pattern recording means with the word utterance pattern for each word in the word dictionary. And (f) a word discriminating unit for discriminating that the word corresponding to the word utterance pattern determined to be the best match as a result of the comparison by the pattern comparing unit is a word spoken by the person to be recognized as described above. Is provided with a word discrimination device.

更に本発明では、（イ）認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出手段と、（ロ）この唇画像領域抽出手段によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する単語画像抽出手段と、（ハ）この単語画像抽出手段によって抽出した単語画像における上下の唇が閉じられるタイミングを測定する閉タイミング測定手段と、（ニ）前記した単語画像抽出手段によって抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別する母音判別手段と、（ホ）前記した閉タイミング測定手段あるいはこれと同等の手段によって測定したそれぞれのタイミングと前記した母音判別手段あるいはこれと同等の手段によって判別した各母音の組み合わせを、予め複数の単語とそれぞれ対応付けて用意した単語辞書と、（へ）前記した単語画像抽出手段によって抽出した単語画像における前記した閉タイミング測定手段で測定したそれぞれのタイミングおよび前記した母音判別手段で判別した単語を構成するそれぞれの母音の組み合わせを前記した単語辞書における前記した組み合わせと比較する比較手段と、（ト）この比較手段で最も一致すると判別した前記した単語辞書における前記した組み合わせに対応する単語を前記した認識の対象とする人物の発声した単語であると判別する単語判別手段とを単語判別装置が具備する。 Further, in the present invention, (a) a lip image region extracting means for extracting a lip region image from a human face image to be recognized, and (b) a lip image region extracted by the lip image region extracting unit. A word image extracting means for extracting a series of temporal changes from the start to the end of the lip image as one unit of word image; and (c) timing at which the upper and lower lips in the word image extracted by the word image extracting means are closed. And (d) a vowel discrimination that discriminates each vowel constituting a word from the shape of the same lip opening that continues for a predetermined time or longer in the word image extracted by the word image extraction unit. Means, (e) the respective timings measured by the closing timing measuring means or the equivalent means, and the vowel discrimination means described above. Is a word dictionary prepared by associating each vowel combination determined by an equivalent means with a plurality of words in advance, and (f) the closing timing measurement in the word image extracted by the word image extracting means. A comparison means for comparing each timing measured by the means and a combination of each vowel constituting the word discriminated by the vowel discrimination means with the above combination in the word dictionary, and (g) a best match between the comparison means Then, the word discriminating device comprises word discriminating means for discriminating that the word corresponding to the above-described combination in the above-described word dictionary is the word uttered by the person to be recognized as described above.

更にまた本発明では、（イ）認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出手段と、（ロ）この唇画像領域抽出手段によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する単語画像抽出手段と、（ハ）この単語画像抽出手段によって抽出した単語画像における上下の唇の閉じたときの合わせ目が全体的に水平に配置しているとしたときの前記した唇の開口部の中心点を通る上下方向と左右方向のそれぞれの距離の時間的な変化を１単位の単語が発声されたときの単語発声パターンとして記録する単語発声パターン記録手段と、（ニ）この単語発声パターン記録手段あるいはこれと同等の手段によって予め記録しておいた単語発声パターンをそれぞれ単語と対応付けた単語辞書と、（ホ）前記した単語発声パターン記録手段で記録した認識の対象となる単語発声パターンを前記した単語辞書内の各単語についての単語発声パターンと比較するパターン比較手段と、（へ）このパターン比較手段による比較の結果、最も一致すると判別した単語発声パターンに対応する単語を前記した認識の対象とする人物の発声した単語であると判別する単語判別手段と、（ト）各種の情報を表示するディスプレイと、（チ）前記した単語判別手段で判別した単語に対応する操作内容でこのディスプレイ上に表示された表示内容を操作する内容操作手段とを画面表示操作装置が具備する。 Furthermore, in the present invention, (a) a lip image region extracting means for extracting a lip region image from a human face image to be recognized, and (b) a lip image region extracted by the lip image region extracting unit. A word image extraction means for extracting a series of temporal changes from the start to the end of the lip image change as a unit word image, and (c) closing upper and lower lips in the word image extracted by the word image extraction means. One unit of words utters the temporal changes in the vertical and horizontal distances passing through the center point of the opening of the lips when the seam at the time is arranged horizontally. A word utterance pattern recording means for recording as a word utterance pattern at the time of being recorded, and (d) a word utterance pattern recorded in advance by the word utterance pattern recording means or an equivalent means. The word dictionary associated with each word and (e) the word utterance pattern to be recognized recorded by the word utterance pattern recording means are compared with the word utterance pattern for each word in the word dictionary. And (f) word discrimination means for discriminating that the word corresponding to the word utterance pattern determined to be the best match as a result of comparison by the pattern comparison means is the word spoken by the person to be recognized as described above. And (g) a display for displaying various types of information, and (h) a content operation unit for operating the display content displayed on the display with the operation content corresponding to the word determined by the word determination unit. A display operation device is provided.

また、本発明では、（イ）認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出手段と、（ロ）この唇画像領域抽出手段によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する単語画像抽出手段と、（ハ）この単語画像抽出手段によって抽出した単語画像における上下の唇が閉じられるタイミングを測定する閉タイミング測定手段と、（ニ）前記した単語画像抽出手段によって抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別する母音判別手段と、（ホ）前記した閉タイミング測定手段あるいはこれと同等の手段によって測定したそれぞれのタイミングと前記した母音判別手段あるいはこれと同等の手段によって判別した各母音の組み合わせを、予め複数の単語とそれぞれ対応付けて用意した単語辞書と、（へ）前記した単語画像抽出手段によって抽出した単語画像における前記した閉タイミング測定手段で測定したそれぞれのタイミングおよび前記した母音判別手段で判別した単語を構成するそれぞれの母音の組み合わせを前記した単語辞書における前記した組み合わせと比較する比較手段と、（ト）この比較手段で最も一致すると判別した前記した単語辞書における前記した組み合わせに対応する単語を前記した認識の対象とする人物の発声した単語であると判別する単語判別手段と、（チ）各種の情報を表示するディスプレイと、（リ）前記した単語判別手段で判別した単語に対応する操作内容でこのディスプレイ上に表示された表示内容を操作する内容操作手段とを画面表示操作装置が具備する。 In the present invention, (a) lip image region extracting means for extracting a lip region image from a human face image to be recognized; and (b) a lip image region extracted by the lip image region extracting unit. A word image extraction means for extracting a series of temporal changes from the start to the end of the lip image change as a unit word image, and (c) upper and lower lips in the word image extracted by the word image extraction means are closed. A closed timing measuring means for measuring timing, and (d) a vowel for discriminating each vowel constituting a word from the shape of the same lip opening that lasts for a predetermined time or longer in the word image extracted by the word image extracting means. Discriminating means, (e) the respective timings measured by the closing timing measuring means described above or equivalent means, and the vowel discriminating means described above. Or a word dictionary prepared by associating each vowel combination determined by means equivalent to this with a plurality of words in advance, and (f) the closing timing described above in the word image extracted by the word image extracting means. A comparison means for comparing each timing measured by the measurement means and each vowel combination constituting the word discriminated by the vowel discrimination means with the above-mentioned combination in the word dictionary; A word discriminating means for discriminating that a word corresponding to the combination in the word dictionary determined to be a match is a word uttered by the person to be recognized as described above, and (h) a display for displaying various types of information (I) This display with the operation content corresponding to the word discriminated by the word discriminating means described above. And the content operating means for operating the display contents displayed on the upper screen display operation device comprises.

更に本発明では、（イ）単語登録の際に認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出手段と、（ロ）この唇画像領域抽出手段によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を登録する画像を発声した場合の単語画像として抽出する登録対象単語画像抽出手段と、（ハ）この登録対象単語画像抽出手段によって抽出した単語画像における上下の唇の閉じたときの合わせ目が全体的に水平に配置しているとしたときの前記した唇の開口部の中心点を通る上下方向と左右方向のそれぞれの距離の時間的な変化を１単位の単語が発声されたときの単語発声パターンとして記録する単語発声パターン記録手段と、（ニ）この単語発声パターン記録手段あるいはこれと同等の手段によって予め記録しておいた単語発声パターンをそれぞれ単語と対応付けた単語辞書と、（ホ）前記した単語発声パターン記録手段で記録した単語登録の対象となる単語発声パターンを前記した単語辞書内の各単語についての単語発声パターンと比較するパターン比較手段と、（へ）このパターン比較手段による比較の結果、所定の値以上近似しないと判別した単語発声パターンに対応する未登録の単語のみを単語登録可能とする登録単語可否判別手段とを単語登録装置が具備する。 Further, in the present invention, (b) lip image region extracting means for extracting a lip region image from a human face image to be recognized at the time of word registration, and (b) extraction by the lip image region extracting unit. Registration target word image extraction means for extracting as a word image when an image for registering a series of temporal changes from the start to the end of the lip image change in the lip image region, and (c) this registration target word image extraction In the word image extracted by the means, when the upper and lower lips are closed when the joints are arranged horizontally, the vertical and horizontal directions passing through the center point of the opening of the lips described above, respectively. A word utterance pattern recording means for recording a temporal change in distance as a word utterance pattern when a unit word is uttered; and (d) a word utterance pattern recording means or a hand equivalent thereto. And (e) a word utterance pattern to be registered in the word utterance pattern recorded by the word utterance pattern recording means in the word dictionary. Pattern comparison means for comparing with the word utterance pattern for each word, and (f) registering only unregistered words corresponding to the word utterance pattern determined to be not more than a predetermined value as a result of comparison by the pattern comparison means The word registration device includes a registration word permission / non-permission determining unit that enables the registration.

更にまた本発明では、（イ）単語登録の際に認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出手段と、（ロ）この唇画像領域抽出手段によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を登録する画像を発声した場合の単語画像として抽出する登録対象単語画像抽出手段と、（ハ）この登録対象単語画像抽出手段によって抽出した単語画像における上下の唇が閉じられるタイミングを測定する閉タイミング測定手段と、（ニ）前記した登録対象単語画像抽出手段によって抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別する母音判別手段と、（ホ）前記した閉タイミング測定手段あるいはこれと同等の手段によって測定したそれぞれのタイミングと前記した母音判別手段あるいはこれと同等の手段によって判別した各母音の組み合わせを、予め複数の単語とそれぞれ対応付けて用意した単語辞書と、（へ）前記した登録対象単語画像抽出手段によって抽出した単語画像における前記した閉タイミング測定手段で測定したそれぞれのタイミングおよび前記した母音判別手段で判別した単語を構成するそれぞれの母音の組み合わせを前記した単語辞書における前記した組み合わせと比較する比較手段と、（ト）この比較手段による比較の結果、所定の値以上近似しないと判別した前記した組み合わせに対応する未登録の単語のみを単語登録可能とする登録単語可否判別手段とを単語登録装置が具備する。 Furthermore, in the present invention, (a) a lip image region extracting unit that extracts an image of a lip region from a face image of a person to be recognized at the time of word registration, and (b) the lip image region extracting unit Registration target word image extraction means for extracting as an image a word image when a series of chronological changes from the start to the end of the lip image region in the extracted lip image region is registered, and (c) this registration target word image A closing timing measuring means for measuring the timing at which the upper and lower lips are closed in the word image extracted by the extracting means; and (d) the same lip that continues for a predetermined time or longer in the word image extracted by the registration target word image extracting means. Vowel discriminating means for discriminating each vowel constituting the word from the shape of the opening, and (e) the closing timing measuring means described above or equivalent means Accordingly, a word dictionary prepared by previously associating each timing measured and a combination of each vowel determined by the above-described vowel determination unit or an equivalent unit with a plurality of words, and (f) the above-described registration target word Compare each timing measured by the closing timing measuring means in the word image extracted by the image extracting means and each vowel constituting the word determined by the vowel determining means with the combination in the word dictionary. And (g) a registered word propriety determining means for registering only unregistered words corresponding to the above-described combinations determined not to approximate more than a predetermined value as a result of comparison by the comparing means. A registration device is provided.

また、本発明では、（イ）認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出ステップと、（ロ）この唇画像領域抽出ステップで抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する単語画像抽出ステップと、（ハ）この単語画像抽出ステップによって抽出した単語画像における上下の唇の閉じたときの合わせ目が全体的に水平に配置しているとしたときの前記した唇の開口部の中心点を通る上下方向と左右方向のそれぞれの距離の時間的な変化を１単位の単語が発声されたときの単語発声パターンとして記録する単語発声パターン記録ステップと、（ニ）この単語発声パターン記録ステップで記録した認識の対象となる単語発声パターンを、前記した単語発声パターン記録ステップで予めそれぞれの単語と対応付けて登録した単語辞書内の各単語についての単語発声パターンと比較するパターン比較ステップと、（ホ）このパターン比較ステップによる比較の結果、最も一致すると判別した単語発声パターンに対応する単語を前記した認識の対象とする人物の発声した単語であると判別する単語判別ステップとを単語判別方法が具備する。 In the present invention, (a) a lip image region extracting step for extracting a lip region image from a human face image to be recognized; and (b) a lip image region extracted in the lip image region extracting step. A word image extraction step for extracting a series of temporal changes from the start to the end of the lip image change as a unit word image, and (c) closing the upper and lower lips in the word image extracted by the word image extraction step. One unit of words utters the temporal changes in the vertical and horizontal distances passing through the center point of the opening of the lips when the seam at the time is arranged horizontally. A word utterance pattern recording step for recording as a word utterance pattern at the time of being recorded, and (d) a word utterance pattern to be recognized recorded in the word utterance pattern recording step. A pattern comparison step for comparing with the word utterance pattern for each word in the word dictionary registered in advance in association with each word in the word utterance pattern recording step; and (e) as a result of comparison by this pattern comparison step, The word discrimination method includes a word discrimination step of discriminating that a word corresponding to the discriminated word utterance pattern is a word uttered by the person to be recognized as described above.

更に本発明では、（イ）認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出ステップと、（ロ）この唇画像領域抽出ステップによって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する単語画像抽出ステップと、（ハ）この単語画像抽出ステップによって抽出した単語画像における上下の唇が閉じられるタイミングを測定する閉タイミング測定ステップと、（ニ）前記した単語画像抽出ステップによって抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別する母音判別ステップと、（ホ）前記した単語画像抽出ステップによって抽出した単語画像における前記した閉タイミング測定ステップで測定したそれぞれのタイミングおよび前記した母音判別ステップで判別した単語を構成するそれぞれの母音の組み合わせを、前記した閉タイミング測定ステップで測定したそれぞれのタイミングと前記した母音判別ステップで判別した各母音の組み合わせで単語ごとに予め登録した単語辞書と比較する比較ステップと、（へ）この比較ステップによる比較の結果、最も一致すると判別した前記した単語辞書における前記した組み合わせに対応する単語を前記した認識の対象とする人物の発声した単語であると判別する単語判別ステップとを単語判別方法が具備する。 Further, in the present invention, (a) a lip image region extracting step for extracting a lip region image from a human face image to be recognized, and (b) a lip image region extracted by the lip image region extracting step. A word image extraction step for extracting a series of temporal changes from the start to the end of the lip image as one unit of word image; and (c) timing at which upper and lower lips are closed in the word image extracted by the word image extraction step. And (d) vowel discrimination that discriminates each vowel constituting a word from the shape of the same lip opening that lasts for a predetermined time or longer in the word image extracted by the word image extraction step. And (e) the closing timing measurement step in the word image extracted in the word image extraction step. Each timing determined in the above-mentioned step and each vowel combination constituting the word determined in the above-described vowel determination step are determined in each timing measured in the above-mentioned closed timing measurement step and each vowel determination step described above. A comparison step for comparing with a word dictionary registered in advance for each word in a combination of vowels, and (f) a word corresponding to the combination in the word dictionary determined to be the best match as a result of comparison in the comparison step. The word discrimination method includes a word discrimination step for discriminating a word uttered by a person to be recognized.

更にまた本発明では、コンピュータに、単語判別プログラムとして、（イ）認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出処理と、（ロ）この唇画像領域抽出処理によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する単語画像抽出処理と、（ハ）この単語画像抽出処理によって抽出した単語画像における上下の唇が閉じられるタイミングを測定する閉タイミング測定処理と、（ニ）前記した単語画像抽出処理によって抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別する母音判別処理と、（ホ）前記した単語画像抽出処理によって抽出した単語画像における前記した閉タイミング測定処理で測定したそれぞれのタイミングおよび前記した母音判別処理で判別した単語を構成するそれぞれの母音の組み合わせを、前記した閉タイミング測定処理で測定したそれぞれのタイミングと前記した母音判別処理で判別した各母音の組み合わせで単語ごとに予め登録した単語辞書と比較する比較処理と、（へ）この比較処理による比較の結果、最も一致すると判別した前記した単語辞書における前記した組み合わせに対応する単語を前記した認識の対象とする人物の発声した単語であると判別する単語判別処理とを実行させることを特徴とする。
Furthermore, in the present invention, as a word discrimination program, (b) a lip image region extraction process for extracting a lip region image from a human face image to be recognized; A word image extraction process for extracting a series of temporal changes from the start to the end of the lip image change in the lip image region extracted by the extraction process as one unit of word image; and (c) a word extracted by the word image extraction process. A closing timing measurement process for measuring the timing at which the upper and lower lips in the image are closed, and (d) a word composed of the same lip opening shape that continues for a predetermined time or longer in the word image extracted by the word image extraction process. A vowel discrimination process for discriminating each vowel to be performed, and (e) the above-mentioned closed pattern in the word image extracted by the word image extraction process described above. The timings measured in the ming measurement process and the combinations of the vowels constituting the words determined in the vowel discrimination process are determined in the timings measured in the closed timing measurement process and the vowel discrimination process. A comparison process for comparing each vowel combination with a word dictionary registered in advance for each word, and (f) as a result of comparison by this comparison process, the word corresponding to the combination in the word dictionary determined to be the best match is And a word discrimination process for discriminating that the word is spoken by a person who is the object of recognition.

以上説明したように本発明によれば、携帯電話機や小型のパーソナルコンピュータ等の情報処理装置に付属していることの多いカメラ（撮像装置）を活用することで、新たなデバイスを使用することなく操作者の口の画像の変化を基に予め定めた特定の操作を実行することができる。また、カメラ（撮像装置）が付属していない情報処理装置であっても、ＵＳＢ（Universal Serial Bus）接続等の手法によって、安価に本発明を実現することができる。 As described above, according to the present invention, a camera (imaging device) often attached to an information processing apparatus such as a mobile phone or a small personal computer can be used without using a new device. It is possible to execute a predetermined operation based on a change in the image of the operator's mouth. In addition, even an information processing apparatus that does not include a camera (imaging device) can realize the present invention at low cost by a technique such as USB (Universal Serial Bus) connection.

また、更に本発明によれば、母音の判別だけでなく単語画像における上下の唇が閉じられるタイミングの測定も行うことにした。これにより母音と両唇音の組み合わせによる単語認識が可能になり、辞書として登録する単語の数を装置の操作に必要な程度まで増加させることが可能になる。 Furthermore, according to the present invention, not only the determination of vowels but also the measurement of the timing when the upper and lower lips in the word image are closed. As a result, it is possible to recognize words by combining vowels and lip sounds, and it is possible to increase the number of words registered as a dictionary to a level necessary for operation of the apparatus.

本発明の情報処理装置のクレーム対応図である。It is a claim corresponding | compatible figure of the information processing apparatus of this invention. 本発明の単語判別装置のクレーム対応図である。It is a claim corresponding | compatible figure of the word discrimination device of this invention. 本発明の他の単語判別装置クレーム対応図である。It is another word discrimination device claim correspondence figure of the present invention. 本発明の画面表示操作装置のクレーム対応図である。It is a claim corresponding | compatible figure of the screen display operation apparatus of this invention. 本発明の他の画面表示操作装置のクレーム対応図である。It is a claim corresponding | compatible figure of the other screen display operation apparatus of this invention. 本発明の単語登録装置のクレーム対応図である。It is a claim corresponding | compatible figure of the word registration apparatus of this invention. 本発明の他の単語登録装置のクレーム対応図である。It is a claim corresponding | compatible figure of the other word registration apparatus of this invention. 本発明の単語判別方法のクレーム対応図である。It is a claim correspondence diagram of the word discrimination method of the present invention. 本発明の他の単語判別方法のクレーム対応図である。It is a claim corresponding | compatible figure of the other word discrimination | determination method of this invention. 本発明の単語判別プログラムのクレーム対応図である。It is a claim corresponding | compatible figure of the word discrimination | determination program of this invention. 本発明の実施の形態における画面表示操作装置としての携帯電話機の構成を表わした平面図である。It is a top view showing the structure of the mobile telephone as a screen display operation apparatus in embodiment of this invention. 本実施の形態の携帯電話機の回路構成の概要を表わしたブロック図である。It is a block diagram showing the outline | summary of the circuit structure of the mobile telephone of this Embodiment. 日本語の母音と唇の開口部の形状の関係を大まかに示した説明図である。It is explanatory drawing which showed roughly the relationship between the Japanese vowel and the shape of the opening part of a lip. 本実施の形態で携帯電話機のメニュー画面呼び出しと音声による単語登録モードの処理の概要を表わした流れ図である。It is the flowchart showing the outline | summary of the process of the menu screen call of a mobile telephone and the word registration mode by voice in this Embodiment. 本実施の形態でステップＳ４０１の処理が行われるときの携帯電話機とユーザの顔の位置関係を示した説明図である。It is explanatory drawing which showed the positional relationship of a mobile telephone and a user's face when the process of step S401 is performed in this Embodiment. 本実施の形態で単語辞書に登録した単語発声パターンについて、発声内容と解析結果および辞書の登録内容の一例を示した説明図である。It is explanatory drawing which showed an example of the content of utterance, an analysis result, and the registration content of a dictionary about the word utterance pattern registered into the word dictionary in this Embodiment. 本実施の形態でディスプレイの表示内容の一例を示した平面図である。It is the top view which showed an example of the display content of a display in this Embodiment. 本実施の形態でユーザが発声によってディスプレイの表示内容の操作を行う場合の処理の流れを示した流れ図である。It is the flowchart which showed the flow of the process in case a user operates the display content of a display by utterance in this Embodiment. 本実施の形態で図１４のステップＳ４０４による単語登録の処理の流れを具体的に表わした流れ図である。FIG. 15 is a flowchart specifically showing the flow of word registration processing in step S404 of FIG. 14 in the present embodiment. 本発明の第１の変形例で話者としてのユーザが単語を１音ずつ発声するときの画像の変化の様子を表わした説明図である。It is explanatory drawing showing the mode of the change of an image when the user as a speaker utters a word one sound at a time in the 1st modification of the present invention. 第１の変形例におけるディスプレイの表示例を示した平面図である。It is the top view which showed the example of a display of the display in a 1st modification. 第１の変形例における第１および第２のウィンドウのアクティブとノンアクティブの切替制御の様子を表わした説明図である。It is explanatory drawing showing the mode of the switching control of the active and non-active of the 1st and 2nd window in a 1st modification. この第１の変形例における第２の例として「タブ１」〜「タブ４」の選択を示した説明図である。It is explanatory drawing which showed selection of "tab 1"-"tab 4" as the 2nd example in this 1st modification. 本発明の第２の変形例における画面表示操作装置を示した斜視図である。It is the perspective view which showed the screen display operation apparatus in the 2nd modification of this invention. 本発明の第１の関連技術の情報処理装置としての携帯電話機の外観を表わした平面図である。It is a top view showing the external appearance of the mobile telephone as an information processing apparatus of the 1st related art of this invention.

図１は、本発明の情報処理装置のクレーム対応図を示したものである。本発明の情報処理装置１０は、ディスプレイ１１と、撮像手段１２と、変化判別手段１３と、特定操作実行手段１４を備えている。ここで、ディスプレイ１１は、各種の情報を視覚的に表示する。撮像手段１２は、ディスプレイ１１を使用する操作者の口を少なくとも撮影する。変化判別手段１３は、撮像手段１２によって得られた操作者の口の画像の経時的な変化を判別する。特定操作実行手段１４は、変化判別手段１３の判別結果に応じて判別結果と対応付けた予め定めた特定の操作を実行する。 FIG. 1 is a diagram corresponding to claims of the information processing apparatus according to the present invention. The information processing apparatus 10 according to the present invention includes a display 11, an imaging unit 12, a change determination unit 13, and a specific operation execution unit 14. Here, the display 11 visually displays various information. The imaging unit 12 captures at least the mouth of the operator who uses the display 11. The change discriminating unit 13 discriminates a change with time of the image of the operator's mouth obtained by the imaging unit 12. The specific operation execution unit 14 executes a predetermined specific operation associated with the determination result according to the determination result of the change determination unit 13.

図２は、本発明の単語判別装置のクレーム対応図を示したものである。本発明の単語判別装置２０は、唇画像領域抽出手段２１と、単語画像抽出手段２２と、単語発声パターン記録手段２３と、単語辞書２４と、パターン比較手段２５と、単語判別手段２６を備えている。ここで、唇画像領域抽出手段２１は、認識の対象とする人物の顔の画像から唇の領域の画像を抽出する。単語画像抽出手段２２は、唇画像領域抽出手段２１によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する。単語発声パターン記録手段２３は、単語画像抽出手段２２によって抽出した単語画像における上下の唇の閉じたときの合わせ目が全体的に水平に配置しているとしたときの前記した唇の開口部の中心点を通る上下方向と左右方向のそれぞれの距離の時間的な変化を１単位の単語が発声されたときの単語発声パターンとして記録する。単語辞書２４は、単語発声パターン記録手段２３あるいはこれと同等の手段によって予め記録しておいた単語発声パターンをそれぞれ単語と対応付けたものである。パターン比較手段２５は、単語発声パターン記録手段２３で記録した認識の対象となる単語発声パターンを単語辞書２４内の各単語についての単語発声パターンと比較する。単語判別手段２６は、パターン比較手段２５で最も一致すると判別した単語発声パターンに対応する単語を前記した認識の対象とする人物の発声した単語であると判別する。 FIG. 2 shows a claim correspondence diagram of the word discriminating apparatus of the present invention. The word discriminating apparatus 20 of the present invention includes a lip image region extracting unit 21, a word image extracting unit 22, a word utterance pattern recording unit 23, a word dictionary 24, a pattern comparing unit 25, and a word discriminating unit 26. Yes. Here, the lip image region extracting means 21 extracts a lip region image from the face image of the person to be recognized. The word image extraction unit 22 extracts a series of temporal changes from the start to the end of the lip image change in the lip image region extracted by the lip image region extraction unit 21 as one unit of word image. The word utterance pattern recording means 23 has the above-mentioned opening of the lips when the seam when the upper and lower lips are closed in the word image extracted by the word image extracting means 22 is arranged horizontally. The temporal changes in the vertical and horizontal distances passing through the center point are recorded as a word utterance pattern when one unit of word is uttered. The word dictionary 24 associates word utterance patterns previously recorded by the word utterance pattern recording means 23 or equivalent means with the words. The pattern comparison unit 25 compares the word utterance pattern to be recognized recorded by the word utterance pattern recording unit 23 with the word utterance pattern for each word in the word dictionary 24. The word discriminating unit 26 discriminates that the word corresponding to the word utterance pattern determined to be the best match by the pattern comparing unit 25 is the word uttered by the person to be recognized as described above.

図３は、本発明の他の単語判別装置のクレーム対応図を示したものである。本発明の他の単語判別装置３０は、唇画像領域抽出手段３１と、単語画像抽出手段３２と、閉タイミング測定手段３３と、母音判別手段３４と、単語辞書３５と、比較手段３６と、単語判別手段３７を備えている。ここで、唇画像領域抽出手段３１は、認識の対象とする人物の顔の画像から唇の領域の画像を抽出する。単語画像抽出手段３２は、唇画像領域抽出手段３１によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する。閉タイミング測定手段３３は、単語画像抽出手段３２によって抽出した単語画像における上下の唇が閉じられるタイミングを測定する。母音判別手段３４は、単語画像抽出手段３２によって抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別する。単語辞書３５は、閉タイミング測定手段３３あるいはこれと同等の手段によって測定したそれぞれのタイミングと母音判別手段３４あるいはこれと同等の手段によって判別した各母音の組み合わせを、予め複数の単語について用意したものである。比較手段３６は、単語画像抽出手段３２によって抽出した単語画像における閉タイミング測定手段３３で測定したそれぞれのタイミングおよび母音判別手段３４で判別した単語を構成するそれぞれの母音の組み合わせを単語辞書３５における前記した組み合わせと比較する。単語判別手段３７は、比較手段３６による比較の結果、最も一致すると判別した単語辞書３５における前記した組み合わせに対応する単語を前記した認識の対象とする人物の発声した単語であると判別する。 FIG. 3 shows a claim correspondence diagram of another word discriminating apparatus of the present invention. Another word discriminating apparatus 30 of the present invention includes a lip image region extracting unit 31, a word image extracting unit 32, a closing timing measuring unit 33, a vowel discriminating unit 34, a word dictionary 35, a comparing unit 36, a word A determination unit 37 is provided. Here, the lip image region extracting means 31 extracts a lip region image from the face image of the person to be recognized. The word image extraction unit 32 extracts a series of temporal changes from the start to the end of the lip image change in the lip image region extracted by the lip image region extraction unit 31 as a unit word image. The closing timing measuring unit 33 measures the timing at which the upper and lower lips in the word image extracted by the word image extracting unit 32 are closed. The vowel discrimination means 34 discriminates each vowel constituting a word from the shape of the same lip opening that continues for a predetermined time or longer in the word image extracted by the word image extraction means 32. The word dictionary 35 is prepared in advance for a plurality of words by combining each timing measured by the closing timing measuring means 33 or equivalent means with each vowel determined by the vowel discrimination means 34 or equivalent means. It is. The comparison means 36 uses the timing of the closed timing measurement means 33 in the word image extracted by the word image extraction means 32 and the combination of the vowels constituting the word determined by the vowel determination means 34 in the word dictionary 35. Compare with the combination. The word discriminating unit 37 discriminates that the word corresponding to the combination in the word dictionary 35 determined to be the best match as a result of the comparison by the comparing unit 36 is the word uttered by the person to be recognized.

図４は、本発明の画面表示操作装置のクレーム対応図を示したものである。本発明の画面表示操作装置４０は、唇画像領域抽出手段４１と、単語画像抽出手段４２と、単語発声パターン記録手段４３と、単語辞書４４と、パターン比較手段４５と、単語判別手段４６と、ディスプレイ４７と、内容操作手段４８を備えている。ここで、唇画像領域抽出手段４１は、認識の対象とする人物の顔の画像から唇の領域の画像を抽出する。単語画像抽出手段４２は、唇画像領域抽出手段４１によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する。単語発声パターン記録手段４３は、単語画像抽出手段４２によって抽出した単語画像における上下の唇の閉じたときの合わせ目が全体的に水平に配置しているとしたときの前記した唇の開口部の中心点を通る上下方向と左右方向のそれぞれの距離の時間的な変化を１単位の単語が発声されたときの単語発声パターンとして記録する。単語辞書４４は、単語発声パターン記録手段４３あるいはこれと同等の手段によって予め記録しておいた単語発声パターンの標準的なものを複数の単語について集めたものである。パターン比較手段４５は、単語発声パターン記録手段４３で記録した認識の対象となる単語発声パターンを単語辞書４４内の各単語についての単語発声パターンと比較する。単語判別手段４６は、パターン比較手段４５で最も一致すると判別した単語発声パターンに対応する単語を前記した認識の対象とする人物の発声した単語であると判別する。ディスプレイ４７は、各種の情報を表示する。内容操作手段４８では、単語判別手段４６で判別した単語に対応する操作内容でこのディスプレイ４７上に表示された表示内容を操作する。 FIG. 4 is a diagram corresponding to the claims of the screen display operation device of the present invention. The screen display operation device 40 according to the present invention includes a lip image region extraction means 41, a word image extraction means 42, a word utterance pattern recording means 43, a word dictionary 44, a pattern comparison means 45, a word discrimination means 46, A display 47 and content operation means 48 are provided. Here, the lip image region extracting means 41 extracts a lip region image from the face image of the person to be recognized. The word image extracting unit 42 extracts a series of temporal changes from the start to the end of the lip image change in the lip image region extracted by the lip image region extracting unit 41 as one unit of word image. The word utterance pattern recording means 43 has the above-mentioned opening of the lips when the seam when the upper and lower lips are closed in the word image extracted by the word image extracting means 42 is arranged horizontally. The temporal changes in the vertical and horizontal distances passing through the center point are recorded as a word utterance pattern when one unit of word is uttered. The word dictionary 44 is a collection of a plurality of standard word utterance patterns previously recorded by the word utterance pattern recording means 43 or equivalent means. The pattern comparison means 45 compares the word utterance pattern to be recognized recorded by the word utterance pattern recording means 43 with the word utterance pattern for each word in the word dictionary 44. The word discriminating unit 46 discriminates that the word corresponding to the word utterance pattern determined to be the best match by the pattern comparing unit 45 is the word uttered by the person to be recognized as described above. The display 47 displays various information. The content operation unit 48 operates the display content displayed on the display 47 with the operation content corresponding to the word determined by the word determination unit 46.

図５は、本発明の他の画面表示操作装置のクレーム対応図を示したものである。本発明の他の画面表示操作装置５０は、唇画像領域抽出手段５１と、単語画像抽出手段５２と、閉タイミング測定手段５３と、母音判別手段５４と、単語辞書５５と、比較手段５６と、単語判別手段５７と、ディスプレイ５８と、内容操作手段５９を備えている。ここで、唇画像領域抽出手段５１は、認識の対象とする人物の顔の画像から唇の領域の画像を抽出する。単語画像抽出手段５２は、唇画像領域抽出手段５１によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する。閉タイミング測定手段５３は、単語画像抽出手段５２によって抽出した単語画像における上下の唇が閉じられるタイミングを測定する。母音判別手段５４は、単語画像抽出手段５２によって抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別する。単語辞書５５は、閉タイミング測定手段５３あるいはこれと同等の手段によって測定したそれぞれのタイミングと母音判別手段５４あるいはこれと同等の手段によって判別した各母音の組み合わせを、予め複数の単語について用意したものである。比較手段５６は、単語画像抽出手段５２によって抽出した単語画像における閉タイミング測定手段５３で測定したそれぞれのタイミングおよび母音判別手段５４で判別した単語を構成するそれぞれの母音の組み合わせを単語辞書５５と比較する。単語判別手段５７は、比較手段５６で最も一致すると判別した単語辞書の単語を前記した認識の対象とする人物の発声した単語であると判別する。ディスプレイ５８は、各種の情報を表示する。内容操作手段５９は、単語判別手段５７で判別した単語に対応する操作内容でこのディスプレイ５８上に表示された表示内容を操作する。 FIG. 5 is a diagram corresponding to a claim of another screen display operation device of the present invention. Another screen display operation device 50 according to the present invention includes a lip image region extraction means 51, a word image extraction means 52, a closing timing measurement means 53, a vowel discrimination means 54, a word dictionary 55, a comparison means 56, A word discriminating means 57, a display 58, and a content operating means 59 are provided. Here, the lip image region extracting means 51 extracts a lip region image from the face image of the person to be recognized. The word image extraction unit 52 extracts a series of temporal changes from the start to the end of the lip image change in the lip image region extracted by the lip image region extraction unit 51 as a unit word image. The closing timing measuring unit 53 measures the timing at which the upper and lower lips in the word image extracted by the word image extracting unit 52 are closed. The vowel discrimination means 54 discriminates each vowel constituting the word from the shape of the same lip opening that continues for a predetermined time or longer in the word image extracted by the word image extraction means 52. The word dictionary 55 is prepared for a plurality of words in advance by combining each timing measured by the closing timing measuring means 53 or equivalent means with each vowel determined by the vowel discrimination means 54 or equivalent means. It is. Comparing means 56 compares each timing combination measured by the closing timing measuring means 53 in the word image extracted by the word image extracting means 52 and each vowel constituting the word determined by the vowel determining means 54 with the word dictionary 55. To do. The word discriminating means 57 discriminates that the word in the word dictionary determined to be the best match by the comparing means 56 is the word spoken by the person to be recognized as described above. The display 58 displays various information. The content operation unit 59 operates the display content displayed on the display 58 with the operation content corresponding to the word determined by the word determination unit 57.

図６は、本発明の単語登録装置のクレーム対応図を示したものである。本発明の単語登録装置６０は、唇画像領域抽出手段６１と、登録対象単語画像抽出手段６２と、単語発声パターン記録手段６３と、単語辞書６４と、パターン比較手段６５と、登録単語可否判別手段６６を備えている。ここで、唇画像領域抽出手段６１は、単語登録の際に認識の対象とする人物の顔の画像から唇の領域の画像を抽出する。登録対象単語画像抽出手段６２は、唇画像領域抽出手段６１によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を登録する画像を発声した場合の単語画像として抽出する。単語発声パターン記録手段６３は、登録対象単語画像抽出手段６２によって抽出した単語画像における上下の唇の閉じたときの合わせ目が全体的に水平に配置しているとしたときの前記した唇の開口部の中心点を通る上下方向と左右方向のそれぞれの距離の時間的な変化を１単位の単語が発声されたときの単語発声パターンとして記録する。単語辞書６４は、単語発声パターン記録手段６３あるいはこれと同等の手段によって予め記録しておいた単語発声パターンの標準的なものを複数の単語について集めたものである。パターン比較手段６５は、単語発声パターン記録手段６３で記録した単語登録の対象となる単語発声パターンを単語辞書６４内の各単語についての単語発声パターンと比較する。登録単語可否判別手段６６は、パターン比較手段６５で所定の値以上近似しないと判別した未登録の単語のみを単語登録可能とする。 FIG. 6 shows a claim correspondence diagram of the word registration device of the present invention. The word registration device 60 of the present invention includes a lip image region extraction unit 61, a registration target word image extraction unit 62, a word utterance pattern recording unit 63, a word dictionary 64, a pattern comparison unit 65, and a registered word availability determination unit. 66. Here, the lip image region extraction means 61 extracts a lip region image from the face image of the person to be recognized at the time of word registration. The registration target word image extraction unit 62 extracts an image for registering a series of temporal changes from the start to the end of the lip image change in the lip image region extracted by the lip image region extraction unit 61 as a word image in the case of speaking. . The word utterance pattern recording means 63 has the lip opening described above when the joints when the upper and lower lips are closed in the word image extracted by the registration target word image extraction means 62 are arranged horizontally. The temporal changes in the vertical and horizontal distances passing through the center point of the section are recorded as a word utterance pattern when one unit of word is uttered. The word dictionary 64 is a collection of standard words utterance patterns previously recorded by the word utterance pattern recording means 63 or equivalent means for a plurality of words. The pattern comparison means 65 compares the word utterance pattern to be registered in the word recorded by the word utterance pattern recording means 63 with the word utterance pattern for each word in the word dictionary 64. The registered word permission / non-permission determining means 66 can register only unregistered words that the pattern comparing means 65 has determined not to approximate more than a predetermined value.

図７は、本発明の他の単語登録装置のクレーム対応図を示したものである。本発明の他の単語登録装置７０は、唇画像領域抽出手段７１と、登録対象単語画像抽出手段７２と、閉タイミング測定手段７３と、母音判別手段７４と、単語辞書７５と、比較手段７６と、登録単語可否判別手段７７を備えている。ここで、唇画像領域抽出手段７１は、単語登録の際に認識の対象とする人物の顔の画像から唇の領域の画像を抽出する。登録対象単語画像抽出手段７２は、唇画像領域抽出手段７１によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を登録する画像を発声した場合の単語画像として抽出する。閉タイミング測定手段７３は、登録対象単語画像抽出手段７２によって抽出した単語画像における上下の唇が閉じられるタイミングを測定する。母音判別手段７４は、登録対象単語画像抽出手段７２によって抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別する。単語辞書７５は、閉タイミング測定手段７３あるいはこれと同等の手段によって測定したそれぞれのタイミングと母音判別手段７４あるいはこれと同等の手段によって判別した各母音の組み合わせを、予め複数の単語について用意したものである。比較手段７６は、登録対象単語画像抽出手段７２によって抽出した単語画像における閉タイミング測定手段７３で測定したそれぞれのタイミングおよび母音判別手段７４で判別した単語を構成するそれぞれの母音の組み合わせを単語辞書７５と比較する。登録単語可否判別手段７７は、比較手段７６で所定の値以上近似しないと判別した未登録の単語のみを単語登録可能とする。 FIG. 7 shows a claim correspondence diagram of another word registration apparatus of the present invention. Another word registration device 70 of the present invention includes a lip image region extraction unit 71, a registration target word image extraction unit 72, a closing timing measurement unit 73, a vowel discrimination unit 74, a word dictionary 75, and a comparison unit 76. , A registered word availability determination means 77 is provided. Here, the lip image region extracting means 71 extracts a lip region image from a human face image to be recognized at the time of word registration. The registration target word image extraction unit 72 extracts an image for registering a series of temporal changes from the start to the end of the lip image change in the lip image region extracted by the lip image region extraction unit 71 as a word image in the case of speaking. . The closing timing measuring unit 73 measures the timing at which the upper and lower lips in the word image extracted by the registration target word image extracting unit 72 are closed. The vowel discriminating means 74 discriminates each vowel constituting the word from the shape of the same lip opening that continues for a predetermined time or longer in the word image extracted by the registration target word image extracting means 72. The word dictionary 75 is prepared for a plurality of words in advance by combining each timing measured by the closing timing measuring means 73 or equivalent means with each vowel determined by the vowel discrimination means 74 or equivalent means. It is. The comparison means 76 uses the word dictionary 75 to indicate combinations of the vowels constituting the words determined by the closing timing measurement means 73 and the vowel discrimination means 74 in the word image extracted by the registration target word image extraction means 72. Compare with The registered word permission / non-permission determining unit 77 can register only unregistered words determined by the comparing unit 76 as not to be approximated by a predetermined value or more.

図８は、本発明の単語判別方法のクレーム対応図を示したものである。本発明の単語判別方法８０は、唇画像領域抽出ステップ８１と、単語画像抽出ステップ８２と、単語発声パターン記録ステップ８３と、パターン比較ステップ８４と、単語判別ステップ８５を備えている。ここで、唇画像領域抽出ステップ８１では、認識の対象とする人物の顔の画像から唇の領域の画像を抽出する。単語画像抽出ステップ８２では、唇画像領域抽出ステップ８１で抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する。単語発声パターン記録ステップ８３では、単語画像抽出ステップ８２によって抽出した単語画像における上下の唇の閉じたときの合わせ目が全体的に水平に配置しているとしたときの前記した唇の開口部の中心点を通る上下方向と左右方向のそれぞれの距離の時間的な変化を１単位の単語が発声されたときの単語発声パターンとして記録する。パターン比較ステップ８４では、単語発声パターン記録ステップ８３で記録した認識の対象となる単語発声パターンを、単語発声パターン記録ステップ８３で予めそれぞれの単語と対応付けて登録した単語辞書内の各単語についての単語発声パターンと比較する。単語判別ステップ８５では、パターン比較ステップ８４で最も一致すると判別した単語発声パターンに対応する単語を前記した認識の対象とする人物の発声した単語であると判別する。 FIG. 8 shows a claim correspondence diagram of the word discrimination method of the present invention. The word discrimination method 80 of the present invention includes a lip image region extraction step 81, a word image extraction step 82, a word utterance pattern recording step 83, a pattern comparison step 84, and a word discrimination step 85. Here, in the lip image region extraction step 81, a lip region image is extracted from the face image of the person to be recognized. In the word image extraction step 82, a series of temporal changes from the start to the end of the lip image change in the lip image region extracted in the lip image region extraction step 81 is extracted as one unit of word image. In the word utterance pattern recording step 83, the opening of the lip when the upper and lower lips of the word image extracted by the word image extraction step 82 are horizontally arranged as a whole is arranged horizontally. The temporal changes in the vertical and horizontal distances passing through the center point are recorded as a word utterance pattern when one unit of word is uttered. In the pattern comparison step 84, the word utterance pattern to be recognized recorded in the word utterance pattern recording step 83 is registered for each word in the word dictionary registered in advance in association with each word in the word utterance pattern recording step 83. Compare with word utterance pattern. In the word determination step 85, it is determined that the word corresponding to the word utterance pattern determined to be the best match in the pattern comparison step 84 is the word uttered by the person to be recognized as described above.

図９は、本発明の他の単語判別方法のクレーム対応図を示したものである。本発明の他の単語判別方法９０は、唇画像領域抽出ステップ９１と、単語画像抽出ステップ９２と、閉タイミング測定ステップ９３と、母音判別ステップ９４と、比較ステップ９５と、単語判別ステップ９６を備えている。ここで、唇画像領域抽出ステップ９１では、認識の対象とする人物の顔の画像から唇の領域の画像を抽出する。単語画像抽出ステップ９２では、唇画像領域抽出ステップ９１によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する。閉タイミング測定ステップ９３では、単語画像抽出ステップ９２によって抽出した単語画像における上下の唇が閉じられるタイミングを測定する。母音判別ステップ９４では、単語画像抽出ステップ９２によって抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別する。比較ステップ９５では、単語画像抽出ステップ９２によって抽出した単語画像における閉タイミング測定ステップ９３で測定したそれぞれのタイミングおよび母音判別ステップ９４で判別した単語を構成するそれぞれの母音の組み合わせを、閉タイミング測定ステップ９３で測定したそれぞれのタイミングと母音判別ステップ９４で判別した各母音の組み合わせで単語ごとに予め登録した単語辞書における前記した組み合わせと比較する。単語判別ステップ９６では、比較ステップ９５で最も一致すると判別した単語辞書の単語を前記した認識の対象とする人物の発声した単語であると判別する。 FIG. 9 shows a claim correspondence diagram of another word discrimination method of the present invention. Another word discrimination method 90 of the present invention includes a lip image region extraction step 91, a word image extraction step 92, a closing timing measurement step 93, a vowel discrimination step 94, a comparison step 95, and a word discrimination step 96. ing. Here, in the lip image region extracting step 91, a lip region image is extracted from the face image of the person to be recognized. In the word image extraction step 92, a series of temporal changes from the start to the end of the lip image change in the lip image region extracted in the lip image region extraction step 91 is extracted as one unit of word image. In the closing timing measurement step 93, the timing at which the upper and lower lips in the word image extracted by the word image extraction step 92 are closed is measured. In the vowel discrimination step 94, each vowel constituting the word is discriminated from the shape of the same lip opening that lasts for a predetermined time or longer in the word image extracted in the word image extraction step 92. In the comparison step 95, a combination of each timing measured in the closing timing measuring step 93 in the word image extracted in the word image extracting step 92 and each vowel constituting the word determined in the vowel determining step 94 is used as a closing timing measuring step. The combination of each timing measured in 93 and each vowel determined in the vowel determination step 94 is compared with the above-described combination in the word dictionary registered in advance for each word. In the word determination step 96, it is determined that the word in the word dictionary determined to be the best match in the comparison step 95 is the word spoken by the person to be recognized as described above.

図１０は、本発明の単語判別プログラムのクレーム対応図を示したものである。本発明の単語判別プログラム１００は、コンピュータに、唇画像領域抽出処理１０１と、単語画像抽出処理１０２と、閉タイミング測定処理１０３と、母音判別処理１０４と、比較処理１０５と、単語判別処理１０６を実行させるようにしている。ここで、唇画像領域抽出処理１０１では、認識の対象とする人物の顔の画像から唇の領域の画像を抽出する。単語画像抽出処理１０２では、唇画像領域抽出処理１０１によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する。閉タイミング測定処理１０３では、単語画像抽出処理１０２によって抽出した単語画像における上下の唇が閉じられるタイミングを測定する。母音判別処理１０４では、単語画像抽出処理１０２によって抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別する。比較処理１０５では、単語画像抽出処理１０２によって抽出した単語画像における閉タイミング測定処理１０３で測定したそれぞれのタイミングおよび母音判別処理１０４で判別した単語を構成するそれぞれの母音の組み合わせを、閉タイミング測定処理１０３で測定したそれぞれのタイミングと母音判別処理１０４で判別した各母音の組み合わせで単語ごとに予め登録した単語辞書における前記した組み合わせと比較する。単語判別処理１０６では、比較処理１０５で最も一致すると判別した単語辞書の単語を前記した認識の対象とする人物の発声した単語であると判別する。 FIG. 10 shows a claim correspondence diagram of the word discrimination program of the present invention. The word discrimination program 100 of the present invention includes a lip image region extraction process 101, a word image extraction process 102, a closing timing measurement process 103, a vowel discrimination process 104, a comparison process 105, and a word discrimination process 106 on a computer. I am trying to execute it. Here, in the lip image region extraction processing 101, a lip region image is extracted from the face image of the person to be recognized. In the word image extraction process 102, a series of temporal changes from the start to the end of the lip image change in the lip image area extracted by the lip image area extraction process 101 is extracted as one unit of word image. In the closing timing measurement process 103, the timing at which the upper and lower lips are closed in the word image extracted by the word image extraction process 102 is measured. In the vowel discrimination processing 104, each vowel constituting a word is discriminated from the shape of the same lip opening that continues for a predetermined time or longer in the word image extracted by the word image extraction processing. In the comparison process 105, a combination of the timing measured in the closing timing measurement process 103 in the word image extracted by the word image extraction process 102 and the vowel constituting the word determined in the vowel discrimination process 104 is used as the closed timing measurement process. The combination of each timing measured in 103 and each vowel determined in the vowel determination processing 104 is compared with the above combination in the word dictionary registered in advance for each word. In the word discrimination processing 106, it is discriminated that the word in the word dictionary determined to be the best match in the comparison processing 105 is a word uttered by the person to be recognized as described above.

＜発明の実施の形態＞ <Embodiment of the Invention>

次に本発明の実施の形態を説明する。 Next, an embodiment of the present invention will be described.

図１１は、本発明の実施の形態における画面表示操作装置としての携帯電話機の構成を表わしたものである。携帯電話機３００は、第１の筐体３０１と第２の筐体３０２をヒンジ機構３０３で折り畳み自在に連結した折り畳み型の電話機である。第１の筐体３０１には中央にディスプレイ３０５が配置され、その右上には撮像装置３０６が配置されている。また、第２の筐体３０２における折り畳み時にディスプレイ３０５と対向する面には操作部３０７が配置されている。操作部３０７には、決定キー３０７Ａや方向キー３０７Ｂ、ダイヤルキー３０７Ｃ等の各種のキーが配置されている。 FIG. 11 shows a configuration of a mobile phone as a screen display operation device according to the embodiment of the present invention. The mobile phone 300 is a foldable telephone in which a first housing 301 and a second housing 302 are connected to each other by a hinge mechanism 303 so as to be foldable. A display 305 is disposed at the center of the first housing 301, and an imaging device 306 is disposed at the upper right thereof. An operation unit 307 is disposed on a surface of the second housing 302 that faces the display 305 when folded. Various keys such as an enter key 307A, a direction key 307B, and a dial key 307C are arranged on the operation unit 307.

図１２は、本実施の形態における携帯電話機の回路構成の概要を表わしたものである。携帯電話機３００は、ＣＰＵ（Central Processing Unit）３２１と、このＣＰＵ３２１が実行するプログラムを格納したメモリ３２２を備えた主制御部３２３を有している。主制御部３２３は、データバス等のバス３２４を通じて携帯電話機３００の各部と接続されており、これらの制御を行うようになっている。 FIG. 12 shows an outline of the circuit configuration of the mobile phone according to the present embodiment. The cellular phone 300 includes a main control unit 323 including a CPU (Central Processing Unit) 321 and a memory 322 that stores a program executed by the CPU 321. The main control unit 323 is connected to each unit of the mobile phone 300 through a bus 324 such as a data bus, and performs these controls.

このうち通信制御部３２５は、図示しない基地局との通信を制御する。撮像装置３０６は静止画および動画の撮影を行う。表示制御装置３２６は、ディスプレイ３０５の表示を制御する。操作部３０７はキー入力によって携帯電話機３００の各種操作を行う。本実施の形態の携帯電話機では、ディスプレイ３０５の表示内容と対応付けて、唇の画像を用いた入力操作も可能である。 Among these, the communication control unit 325 controls communication with a base station (not shown). The imaging device 306 captures still images and moving images. The display control device 326 controls display on the display 305. The operation unit 307 performs various operations of the mobile phone 300 by key input. In the mobile phone of this embodiment, an input operation using a lip image can be performed in association with the display content of the display 305.

単語辞書３２７は、単語発声パターンの標準的なものを複数の単語について集めたものである。ここで単語発声パターンとは、ユーザが複数の音からなる単語を発声したときに、上下の唇の閉じたときの合わせ目が全体的に水平に配置しているとしたときの唇の開口部の中心点を通る上下方向と左右方向のそれぞれの距離の時間的な変化をパターン化したものである。単語判別部３２８はユーザが発声によって各種操作を行うときに使用する単語についての単語発声パターンを、辞書に登録した標準的な単語発声パターンと比較することによって判別するようになっている。画像メモリ３２９は、撮像装置３０６から得られた画像データを格納する。唇画像領域抽出部３３０は、ユーザの顔の画像から唇の画像を抽出する。唇の検出には、たとえば動的輪郭モデル（SNAKES）を用いることができる。また、顔を特定し、その周囲よりも赤い領域を判別することでも唇の検出を行うことができる。 The word dictionary 327 is a collection of standard words utterance patterns for a plurality of words. Here, the word utterance pattern is a lip opening when the user speaks a word consisting of a plurality of sounds and the joints when the upper and lower lips are closed are arranged horizontally. Is a pattern of temporal changes in the distance in the vertical and horizontal directions passing through the center point. The word discriminating unit 328 discriminates the word utterance pattern for the word used when the user performs various operations by utterance by comparing with the standard word utterance pattern registered in the dictionary. The image memory 329 stores image data obtained from the imaging device 306. The lip image region extraction unit 330 extracts a lip image from the user's face image. For detection of lips, for example, a dynamic contour model (SNAKES) can be used. Also, lips can be detected by identifying a face and discriminating a region that is redr than the surrounding area.

単語画像抽出部３３１は、ユーザが単語を発声する際の唇の画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として画像メモリ３２９から抽出する。閉タイミング測定部３３２はユーザが単語を発声したときの上下の唇の閉じるタイミングをそれぞれ測定する。母音判別部３３３は、単語画像抽出部３３１の抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別する。一般機能部３３４は、携帯電話機としての一般的な機能として、本実施の形態の携帯電話機３００に備えられている機能をまとめた部分である。たとえば本実施の形態の携帯電話機３００がテレビジョンの受信機能や電子決済の機能を備えている場合、これらの機能は一般機能部３３４に存在している。 The word image extraction unit 331 extracts a series of temporal changes from the start to the end of the lip image change when the user utters a word from the image memory 329 as one unit of word image. The closing timing measuring unit 332 measures the closing timing of the upper and lower lips when the user utters a word. The vowel discriminating unit 333 discriminates each vowel constituting the word from the shape of the same lip opening that continues for a predetermined time or longer in the word image extracted by the word image extracting unit 331. The general function unit 334 is a part that summarizes the functions provided in the mobile phone 300 according to the present embodiment as general functions as a mobile phone. For example, when the mobile phone 300 of this embodiment has a television reception function and an electronic payment function, these functions exist in the general function unit 334.

このような携帯電話機３００を構成する主制御部３２３以外の少なくとも一部は、ＣＰＵ３２１がメモリ３２２に格納されたプログラムを実行することによってソフトウェア的に実現することができる。 At least a part of the mobile phone 300 other than the main control unit 323 can be realized by software by the CPU 321 executing a program stored in the memory 322.

図１３は、日本語の母音と唇の開口部の形状の関係を大まかに示したものである。母音は、声帯の震えを伴う有声音であり、ある程度の時間以上同一の音が持続して発せられる点で子音と異なる。日本語の場合には母音は「ア」、「イ」、「ウ」、「エ」、「オ」の各音からなる。 FIG. 13 roughly shows the relationship between the Japanese vowels and the shape of the lip opening. A vowel is a voiced sound with vocal cord tremors, and differs from a consonant in that the same sound is emitted continuously for a certain period of time. In the case of Japanese, the vowels consist of “A”, “I”, “U”, “E”, and “O” sounds.

個々の母音は口の大きさや口の開口部の形状と舌の前後の位置との組み合わせによって外見上で特定することができる。しかしながら撮像装置３０６を用いて舌の前後の位置を特定することは困難である。そこで本実施の形態では、ユーザが単語を発声したときに口の開口部の形状が所定時間以上同一の状態に継続したとき、この形状から「ア」、「イ」、「ウ」、「エ」、「オ」のいずれかの母音が発声されたと推定するようにしている。 Individual vowels can be identified in appearance by a combination of the size of the mouth and the shape of the mouth opening and the positions of the front and back of the tongue. However, it is difficult to specify the front and back positions of the tongue using the imaging device 306. Therefore, in the present embodiment, when the shape of the mouth opening continues for a predetermined time or longer when the user utters a word, from this shape, “A”, “I”, “U”, “E” It is estimated that one of the vowels "" and "o" has been uttered.

ここで「ア」の音は、上下の唇の閉じたときの合わせ目が全体的に水平に配置されていると仮定したときの唇の開口部の中心点を通る上下方向に指２本が入る程度に口が開けられている状態で発せられる。「イ」の音は、軽く小指の先が入る程度に口が開けられている状態で発せられる。「エ」の音は、「ア」の音と「イ」の音の中間程度に口が開けられている状態で発せられる。これらの音を発するとき、口はすぼめていない。 Here, the sound of “a” is obtained when two fingers are vertically moved through the center point of the opening of the lip when it is assumed that the joint when the upper and lower lips are closed is arranged horizontally. It is emitted with the mouth open enough to enter. The sound of “I” is emitted with the mouth open enough for the tip of the little finger to enter. The sound of “D” is emitted with the mouth open to the middle of the sound of “A” and “I”. When you make these sounds, your mouth is not shrugged.

「ウ」および「オ」の各音は、口をすぼめて発声する。このうち「ウ」の音は「オ」の音よりも唇の開口部が一回り小さくなる。「ア」、「エ」、「イ」の各音における上下の唇の閉じたときの合わせ目が全体的に水平に配置されていると仮定したときの唇の開口部の中心点を通る左右方向の距離は、「ウ」および「オ」の各音を発声する場合よりも長い。 Each sound of “U” and “O” is uttered with the mouth closed. Among these, the sound of “U” is slightly smaller in the opening of the lips than the sound of “O”. Left and right passing through the center point of the lip opening when assuming that the joints when the upper and lower lips are closed in the sounds of “a”, “d”, and “b” are arranged horizontally. The distance in the direction is longer than when the “U” and “O” sounds are uttered.

したがって、本実施の形態では、図１２に示した母音判別部３３３が単語画像抽出部３３１の抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別するようにしている。もちろん、唇の大きさには個人差があり、開口部の上下方向および左右方向の距離と各母音の関係は相対的なものとなる。 Therefore, in this embodiment, each vowel that constitutes a word from the shape of the same lip opening that continues for a predetermined time or longer in the word image extracted by the word image extraction unit 331 by the vowel discrimination unit 333 shown in FIG. Is determined. Of course, there are individual differences in the size of the lips, and the relationship between the distance in the vertical and horizontal directions of the opening and each vowel is relative.

母音は継続的に発せられる音なので、たとえばユーザが「イ」という音を発声したときと「イイ」という連続音を発声したときでは、唇の開口部の形状に変化はなく、同一の開口部の形状が保たれる時間が異なるだけである。したがって、本実施の形態では、唇の開口部の形状が所定時間以上継続することを母音の判別に用いると共に、この所定時間を超えた特定時間が経過するたびに同一の母音が繰り返し発声されたと判別する。これにより、ユーザが「イ」という音を単発で発声したか「イイ」というように複数回「イ」の音を連発で発声したかが分かるようにしている。 Since the vowel sounds are continuously generated, for example, when the user utters the sound “I” and when the user utters the continuous sound “I”, the shape of the opening of the lips does not change, and the same opening The only difference is the time that the shape is maintained. Therefore, in the present embodiment, it is used for vowel discrimination that the shape of the lip opening continues for a predetermined time or more, and the same vowel is repeatedly uttered every time a specific time exceeding the predetermined time elapses. Determine. As a result, it is possible to determine whether the user has uttered the sound of “I” in a single shot or “I” in multiple shots, such as “I”.

ただし、これには例外がある。「ン」は、前に発声した「ア」、「イ」、「ウ」、「エ」、「オ」の各音と唇の形状を保持したまま発声することができる。また、次に発声する母音に備えて、先行して唇の開口部の形状を変えて音「ン」を発声することもできる。したがって、「イイ」という連続音と「イン」という言葉は、「イ」という音と区別できても、本発明の原理では区別できない可能性がある。可能性があるといったのは、「インク」という単語を発声したとき、母音で「イイウ」と判別するか「イウウ」と判別するか特定することができないことによる。もちろん、単語登録の対象とする音に「ン」を含めないとする制限も可能である。 There are exceptions to this. “N” can be uttered while retaining the sounds “A”, “I”, “U”, “D”, “O” and the shape of the lips. In preparation for the next vowel to be uttered, the sound “N” can be uttered by changing the shape of the opening of the lips in advance. Therefore, even though the continuous sound “I” and the word “In” can be distinguished from the sound “I”, they may not be distinguished by the principle of the present invention. This is because there is a possibility that when the word “ink” is uttered, it is impossible to specify whether the vowel is used to identify “Iu” or “Iu”. Of course, it is also possible to restrict the word registration target from not including “n”.

ところで、携帯電話機３００の操作をユーザの発声する単語の違いによって行おうとするとき、母音のみから個々の単語の違いを判別しようとすると、たとえば「イイ（ＯＫ）」という単語と「ミギ（右）」という単語は、共に母音「イ」、「イ」からなる区別できない単語となる。すなわち、操作のために登録する単語の数や種類が大きく制限されることになる。 By the way, when trying to operate the mobile phone 300 based on the difference of words uttered by the user, if the difference between the individual words is determined only from the vowels, for example, the word “OK” and “Migi (right)”. Are both indistinguishable words consisting of vowels “I” and “I”. That is, the number and types of words registered for operation are greatly limited.

そこで本実施の形態では、閉タイミング測定部３３２を用いてユーザが単語を発声したときの上下の唇の閉じるタイミングをそれぞれ測定するようにしている。これは、日本語で、ま行、ぱ行、ば行の各音（両唇音）を発声した場合には、上下の唇が必ず一度閉じるという特徴を利用するものである。たとえば「イイ（ＯＫ）」という単語を発声するとき、唇の開口部は図１３の右上に示した「イ」の状態が継続したままとなる。これに対して「ミギ（右）」という単語を発声した場合には、一度、口が完全に閉じ、次に図１３の右上に示した「イ」の状態が発生し、最後に図１３の右上に示した「イ」の状態が発生する。 Therefore, in the present embodiment, the closing timing measurement unit 332 is used to measure the closing timing of the upper and lower lips when the user utters a word. This uses the feature that the upper and lower lips are always closed once in Japanese, when the voices of both the line, line and line (both lip sounds) are uttered. For example, when the word “OK” is uttered, the state of “I” shown in the upper right of FIG. On the other hand, when the word “migi (right)” is uttered, the mouth is completely closed once, and then the state of “I” shown in the upper right of FIG. The state of “I” shown in the upper right occurs.

なお、「ミギ（右）」という単語を発声した場合に、口が完全に閉じる回数は必ずしも特定することはできない。「ミギ（右）」という単語の発声を終了させたときに、口が「イ」の状態で開いている場合もあれば、閉じている場合もあるからである。 When the word “migi (right)” is uttered, the number of times the mouth is completely closed cannot always be specified. This is because when the utterance of the word “migi (right)” is ended, the mouth may be open in a “b” state or may be closed.

いずれにせよ本実施の形態では単語を各母音の組み合わせとそれぞれの音が両唇音であるか否かという特性を用いることで、ユーザが携帯電話機３００の操作に使用する各単語を比較的不自由なく登録できるようにしている。 In any case, in this embodiment, each word used for the operation of the mobile phone 300 by the user is relatively inconvenient by using the characteristic that the word is a combination of each vowel and whether each sound is a bilateral sound. You can register without any problems.

図１４は、携帯電話機のメニュー画面呼び出しと音声による単語登録モードの処理の概要を表わしたものである。図１１および図１２と共に説明する。 FIG. 14 shows an outline of processing in the cellular phone menu screen call and word registration mode by voice. This will be described with reference to FIGS. 11 and 12.

ユーザは、まず図１１に示したように第１の筐体３０１と第２の筐体３０２を折り畳み状態から開いて、メニュー画面の呼び出しを意味する「メニュー」と発声する。すると、ＣＰＵ３２１は起動状態となった撮像装置３０６から、このときのユーザの顔の画像を取り込んで、「メニュー画面」の指示があったことを判別する（ステップＳ４０１）。 First, as shown in FIG. 11, the user opens the first casing 301 and the second casing 302 from the folded state, and utters “menu” which means calling a menu screen. Then, the CPU 321 takes in the image of the user's face at this time from the imaging device 306 that has been activated, and determines that there has been an instruction for a “menu screen” (step S401).

図１５は、このステップＳ４０１の処理が行われるときの携帯電話機とユーザの顔の位置関係を示したものである。図１２と共に説明する。携帯電話機３００の第１の筐体３０１と第２の筐体３０２が開いた状態で、ユーザ３５１の顔３５２は撮像装置３０６の前方に位置しているのが通常である。したがって、この状態で、被写体の取り込まれる画角（視野角）θの範囲内に存在する顔３５２の画像から唇画像領域抽出部３３０は、ユーザ３５１の唇３５３の画像領域を切り出すことができる。このとき切り出された唇３５３の画像領域は画像メモリ３２９に順次取り込まれ、ユーザ３５１が単語を発声したときにその開始点から終了点までの一連の画像が単語画像抽出部３３１で抽出される。 FIG. 15 shows the positional relationship between the mobile phone and the user's face when the process of step S401 is performed. This will be described with reference to FIG. In general, the face 352 of the user 351 is positioned in front of the imaging device 306 in a state where the first casing 301 and the second casing 302 of the mobile phone 300 are opened. Therefore, in this state, the lip image region extraction unit 330 can cut out the image region of the lips 353 of the user 351 from the image of the face 352 existing within the range of the angle of view (viewing angle) θ into which the subject is captured. The image area of the lips 353 cut out at this time is sequentially taken into the image memory 329, and when the user 351 utters a word, a series of images from the start point to the end point are extracted by the word image extraction unit 331.

この抽出した一連の画像に対して、閉タイミング測定部３３２はユーザが単語を発声したときの上下の唇の閉じるタイミングをそれぞれ測定する。また、母音判別部３３３は唇画像領域抽出部３３０で切り出した唇の画像における上下の唇の傾きを水平に直した後、これを基にして図１３に示す形状が所定時間以上継続したことをもって各母音を判別する。 With respect to the extracted series of images, the closing timing measuring unit 332 measures the closing timing of the upper and lower lips when the user utters a word. In addition, the vowel discrimination unit 333 corrects the inclination of the upper and lower lips in the lip image cut out by the lip image region extraction unit 330, and then the shape shown in FIG. Discriminate each vowel.

上下の唇の閉じるタイミングおよび母音の判別に際して、唇画像領域抽出部３３０で切り出した画像における唇３５３の開口部の中心点を通る上下方向と左右方向のそれぞれの距離の時間的な変化について、１単位の単語が発声されたときの単語発声パターンが画像メモリ３２９の図示しない作業用メモリ領域に展開される。そして、上下の唇の閉じるタイミングがそれぞれ記録されると共に、母音を発声する最小の時間として予め設定した時間（所定時間）以上に唇３５３の開口部の形状（図１３参照）が継続したとき、上下方向と左右方向のそれぞれの測定距離を基にして、これらが５種類の母音のいずれかに該当するかを判別する。また、１種類の母音に対応する唇３５３の開口部の形状が所定時間を超えた時間継続すると判別した場合には、特定時間が経過するたびに同一の母音が繰り返し発声されたとして、その旨の判別を行うことになる。 When discriminating the timing of closing the upper and lower lips and the vowels, with respect to temporal changes in the vertical and horizontal distances passing through the center point of the opening of the lips 353 in the image cut out by the lip image region extraction unit 330, 1 A word utterance pattern when a unit word is uttered is developed in a working memory area (not shown) of the image memory 329. When the timing of closing the upper and lower lips is recorded, and when the shape of the opening of the lips 353 (see FIG. 13) continues for a predetermined time (predetermined time) as a minimum time for uttering vowels, Based on the measured distances in the vertical and horizontal directions, it is determined whether these correspond to any of the five types of vowels. Also, if it is determined that the shape of the opening of the lip 353 corresponding to one type of vowel continues for a time exceeding a predetermined time, it is assumed that the same vowel is repeatedly uttered every time the specific time elapses. Will be determined.

図１６は、本実施の形態で単語辞書に登録した単語発声パターンについて、発声内容と解析結果および辞書の登録内容の一例を示したものである。図１２と共に説明する。 FIG. 16 shows an example of utterance contents, analysis results, and dictionary registration contents for the word utterance patterns registered in the word dictionary in the present embodiment. This will be described with reference to FIG.

単語辞書３２７は、ユーザが初期的に基本的な操作に使用するシステム辞書と、ユーザが後発的に登録するユーザ辞書に分かれている。システム辞書には、図１６に示す「メニュー」や「トウロク（登録）」という語句が登録されている。 The word dictionary 327 is divided into a system dictionary that a user initially uses for basic operations and a user dictionary that the user registers later. In the system dictionary, the words “menu” and “register (registered)” shown in FIG. 16 are registered.

このうち、「メニュー」という語句の発声は、閉タイミング測定部３３２で測定した両唇音の存在するタイミングを「○」で表わし、子音を「△」で表わすとき、母音判別部３３３の判別した母音を使用して、「○△エ△イ△ウウ」と表わすことができる。ここで、「メ」は、両唇音なので「○」を開始点とし、ま行の子音「△」と「メ」の母音「エ」がこれに続くことになる。「ニ」は、な行の子音「△」と「ニ」の母音「イ」から構成される。「ュ」は、や行の子音「△」と「ュ」の母音「ウ」から構成される。「ー」は「ュ」を延ばした単語なので、「ュ」と同様に子音「△」との母音「ウ」から構成される。ただし「ニ」の音が発声されてから「ュ」の音が発声されるまでの時間間隔は、母音「イ」が認識される最小間隔となる。 Among them, the utterance of the phrase “menu” indicates the timing at which both lip sounds measured by the closing timing measurement unit 332 are represented by “◯”, and the consonant is represented by “Δ”, the vowels identified by the vowel discrimination unit 333. Can be used to represent “◯ ΔeΔiΔw”. Here, since “Me” is a lip sound, “O” is set as the starting point, followed by the consonant “Δ” of “Ma” and the vowel “E” of “Me”. “Di” is composed of consonant “Δ” in a row and vowel “I” in “d”. “Yu” is composed of the consonant “Δ” of the line and the vowel “u” of “yu”. Since “-” is a word obtained by extending “yu”, it is composed of a vowel “u” with a consonant “Δ” like “yu”. However, the time interval from when the sound “d” is uttered to when the sound “u” is uttered is the minimum interval at which the vowel “a” is recognized.

図１４に戻って説明を続ける。ユーザが「メニュー」と発声するとＣＰＵ３２１はこの発声パターンから子音「△」を取り除き、「○△エ△イ△ウウ」と処理する。ＣＰＵ３２１は、処理内容を用いて単語辞書３２７のシステム辞書領域を検索すると、「メニュー」の登録内容と一致する。したがって、ユーザがメニュー画面の指示を行ったことが判別されることになる（ステップＳ４０１）。そこで、ＣＰＵ３２１はディスプレイ３０５にメニュー画面を表示するように表示制御装置３２６を制御する（ステップＳ４０２）。 Returning to FIG. 14, the description will be continued. When the user utters “menu”, the CPU 321 removes the consonant “Δ” from the utterance pattern, and processes it as “◯ ΔeΔiΔwoo”. When the CPU 321 searches the system dictionary area of the word dictionary 327 using the processing content, it matches the registered content of the “menu”. Therefore, it is determined that the user has instructed the menu screen (step S401). Therefore, the CPU 321 controls the display control device 326 so as to display the menu screen on the display 305 (step S402).

ディスプレイ３０５にメニュー画面が表示されたら、ＣＰＵ３２１はこの表示状態で次にどのような発声による指示が行われるかを待機する。ユーザが単語登録を行うために「トウロク（登録）」と発声したとする。ここで「トウロク」という発声については、単語辞書３２７のシステム辞書領域に、図１６に示すように「オウオウ」という登録内容で登録されている。 When the menu screen is displayed on the display 305, the CPU 321 waits for the next utterance instruction in this display state. It is assumed that the user utters “Toroku (registration)” to perform word registration. Here, the utterance “TOROKU” is registered in the system dictionary area of the word dictionary 327 with the registered content “OOU” as shown in FIG.

一方、ＣＰＵ３２１の方ではユーザが「トウロク」と発声すると、画像メモリ３２９に展開されたその単語発声パターンを基に前記した処理を行い、解析結果として「△オウ△オ△ウ」を得る。ここで「ト」、「ウ」、「ロ」、「ク」の各音はいずれも両唇音ではないので「○」が付かない。最初の「ト」は、た行の子音「△」と母音「オ」から構成される。「ウ」は、そのまま母音「ウ」から構成される。「ロ」は、ら行の子音「△」と母音「ウ」から構成される。最後の「ク」は、か行の子音「△」と母音「ウ」から構成される。この結果、子音「△」を取り除くと、「オウオウ」となり、「トウロク（登録）」の登録内容と完全に一致することになる。 On the other hand, when the user utters “Toroku”, the CPU 321 performs the above-described processing based on the word utterance pattern developed in the image memory 329 and obtains “ΔOhΔOΔU” as an analysis result. Here, the sounds of “G”, “U”, “B”, and “K” are not both-lip sounds, and therefore “○” is not attached. The first “G” is composed of a consonant “Δ” and a vowel “O”. “U” is composed of the vowel “U” as it is. “Ro” is composed of consonant “Δ” and vowel “U”. The last “ku” is composed of a consonant “Δ” and a vowel “c”. As a result, when the consonant “Δ” is removed, “Ouo” is obtained, which completely matches the registered content of “Turoku (Registered)”.

したがって、ステップＳ４０２の処理でディスプレイ３０５にメニュー画面が表示された状態でユーザが「トウロク（登録）」と発声したとすると、ＣＰＵ３２１はこれを「トウロク（登録）」の発声による指示があったと判別して（ステップＳ４０３：Ｙ）、音声による単語登録のモードを実行することになる（ステップＳ４０４）。そして、ユーザが「オワリ（終り）」と発声すると、ＣＰＵ３２１はこれを単語辞書３２７側の登録内容「オアイ」（図１６参照）との一致を検出して「オワリ（終り）」が発声されたと判別して（ステップＳ４０５：Ｙ）、単語登録のモードを終了する（エンド）。 Therefore, if the user utters “Turoku (registration)” in the state where the menu screen is displayed on the display 305 in the process of step S402, the CPU 321 determines that there is an instruction by uttering “Toroku (registration)”. Then (step S403: Y), the word registration mode by voice is executed (step S404). Then, when the user utters “warming (end)”, the CPU 321 detects a match with the registered content “Oai” (see FIG. 16) on the word dictionary 327 side and utters “warming (end)”. After the determination (step S405: Y), the word registration mode is ended (END).

一方、ディスプレイ３０５にメニュー画面が表示された状態でユーザが「トウロク（トウロク）」以外の他の語句を発声し（ステップＳ４０３：Ｎ）、それが単語登録のモード以外のモードとして判別された場合（ステップＳ４０６：Ｙ）、該当するそのモードが実行される。それ以外の場合、たとえばユーザが咳払いをしたり単語辞書３２７の登録内容に一致しない語句を発声した場合には（ステップＳ４０６：Ｎ）、メニュー画面が表示された状態となる（ステップＳ４０２）。もちろん、システムによってはユーザがこの状態で「オワリ（終り）」と発声することで、メニュー画面の表示状態を終了させることができるようにしてもよい。 On the other hand, when the menu screen is displayed on the display 305, the user utters a word other than “Toroku” (step S403: N), and it is determined as a mode other than the word registration mode. (Step S406: Y), the corresponding mode is executed. In other cases, for example, if the user coughs or utters a phrase that does not match the registered contents of the word dictionary 327 (step S406: N), the menu screen is displayed (step S402). Of course, depending on the system, the display state of the menu screen may be ended by the user saying “Wai (End)” in this state.

ところで図１６に示した単語辞書３２７には、ユーザ辞書として「タブイチ（タブ１）」、「タブニ（タブ２）」等の発声内容の登録も行われている。そこで、このような単語辞書３２７を用いて、ユーザ３５１がディスプレイ３０５に表示されたタブを操作する様子を説明する。 By the way, in the word dictionary 327 shown in FIG. 16, utterance contents such as “Tabuchi (Tab 1)” and “Tabuni (Tab 2)” are registered as user dictionaries. Therefore, how the user 351 operates the tab displayed on the display 305 using such a word dictionary 327 will be described.

図１７は、ディスプレイの表示内容の一例を示したものである。図１２と共に説明する。 FIG. 17 shows an example of display contents on the display. This will be described with reference to FIG.

ディスプレイ３０５には、ある操作が行われた時点で複数のタブを有するウィンドウが重なった状態で表示されている。図１７では「タブ１」が選択された結果として、第１のウィンドウが前面に出た状態でその内容が表示されている。ユーザは、この状態から「タブ２」の第２のウィンドウに移行する際に、「手」を使用するのであれば、図１１に示したダイヤルキー３０７Ｃの「２」キーを選択したり、図示しないカーソルを数字の「２」の位置に移動させてクリックすることになる。本実施の形態の携帯電話機３００では、発声によってもタブの選択が可能である。 On the display 305, windows having a plurality of tabs are displayed in an overlapped state when a certain operation is performed. In FIG. 17, as a result of selecting “Tab 1”, the contents are displayed in a state where the first window is brought to the front. If the user uses “hand” when moving from this state to the second window of “tab 2”, the user selects the “2” key of the dial key 307C shown in FIG. The cursor is not moved to the position of the number “2” and clicked. In mobile phone 300 according to the present embodiment, a tab can be selected also by speaking.

図１８は、ユーザが発声によってディスプレイの表示内容の操作を行う場合の処理の流れを示したものである。この処理は、図１４のステップＳ４０６における「その他の指示」の１つとして行われるものである。図１２、図１５〜図１７と共に説明する。 FIG. 18 shows the flow of processing when the user manipulates the display content by speaking. This process is performed as one of the “other instructions” in step S406 in FIG. This will be described with reference to FIGS. 12 and 15 to 17.

ディスプレイ３０５に図１７の内容が表示され、ユーザ３５１が図１４のステップＳ４０６における「その他の指示」の１つとして、発声によるディスプレイの表示内容の操作を選択したものとする。この状態で撮像装置３０６の取り込んだ画像データは、画像メモリ３２９の所定のリングメモリ領域にエンドレスに格納され、唇画像領域抽出部３３０は唇の画像領域を順次抽出し、母音判別部３３３は母音が判別されるかをチェックする（ステップＳ４２１）。 It is assumed that the content of FIG. 17 is displayed on the display 305 and that the user 351 has selected the operation of the display content by speaking as one of the “other instructions” in step S406 of FIG. In this state, the image data captured by the imaging device 306 is stored endlessly in a predetermined ring memory area of the image memory 329, the lip image area extraction unit 330 sequentially extracts the lip image areas, and the vowel discrimination unit 333 Is checked (step S421).

この結果、唇３５３の開口部の解析から母音のいずれかが検出されたら（ステップＳ４２１：Ｙ）、単語画像抽出部３３１は画像メモリ３２９のリングメモリ領域に格納された画像を時間ｔ₁だけ遡って、単語の切り出しを開始する（ステップＳ４２２）。ここで時間ｔ₁は、その母音を含む音が両唇音であるか否かを確認できる長さである。 As a result, when one of the vowels is detected from the analysis of the opening of the lips 353 (step S421: Y), the word image extraction unit 331 goes back the image stored in the ring memory area of the image memory 329 by time t _1. Then, the word extraction is started (step S422). Here, the time t ₁ is a length for confirming whether or not the sound including the vowel is a bilateral sound.

これと共に、単語画像抽出部３３１はリングメモリ領域における切り出しが開始した時点以降を順にチェックしていって次の母音が先の母音の検出（判別）から時間ｔ₂以内に検出（判別）されるかをチェックする（ステップＳ４２３、ステップＳ４２４）。ここで時間ｔ₂は、単語として複数の音が発声されるときの通常想定される「間（ま）」となる最大時間に所定の余裕時間を加えた時間である。単語画像抽出部３３１が次の母音を時間ｔ₂の経過前に判別すれば（ステップＳ４２３：Ｙ）、更に次の母音を時間ｔ₂以内に判別するかを繰り返しチェックする。 At the same time, the word image extraction unit 331 sequentially checks the time point after the start of extraction in the ring memory area, and the next vowel is detected (determined) within time t ₂ from the detection (determination) of the previous vowel. (Step S423, step S424). Here, the time t ₂ is a time obtained by adding a predetermined margin time to the maximum time that is normally assumed to be “between” when a plurality of sounds are uttered as words. If the word image extraction unit 331 determines the next vowel before the elapse of time t ₂ (step S423: Y), it repeatedly checks whether the next vowel is determined within the time t ₂ .

これに対して次の母音を時間ｔ₂の経過前に判別しなかった場合（ステップＳ４２３：Ｎ、ステップＳ４２４：Ｙ）、単語画像抽出部３３１はその時点で単語の画像の切り出しを終了する（ステップＳ４２５）。単語判別部３２８は図１４の箇所で説明した単語処理を行って、単語辞書３２７の「登録内容」に対応する内容を取得する（ステップＳ４２６）。 On the other hand, when the next vowel is not discriminated before the elapse of time t ₂ (step S423: N, step S424: Y), the word image extraction unit 331 ends the extraction of the word image at that time ( Step S425). The word discriminating unit 328 performs the word processing described in the part of FIG. 14 and acquires the content corresponding to the “registration content” in the word dictionary 327 (step S426).

たとえば、ユーザ３５１が「タブニ（タブ２）」と発声した単語の画像が唇画像領域抽出部３３０によって切り出されたとする。この場合、ステップＳ４２６の処理で「ア○ウイ」という処理結果が得られる。ＣＰＵ３２１は単語辞書３２７の登録内容を検索して（ステップＳ４２７）、これと一致するものが存在すれば（ステップＳ４２８：Ｙ）、その登録内容の操作を実行して（ステップＳ４２９）、処理がステップＳ４２１に戻る（リターン）。この例では、ステップＳ４２９の操作として、図１７の第１のウィンドウが表示された状態から「タブ２」の選択により第２のウィンドウが代わって選択された状態に変化する。 For example, it is assumed that an image of a word uttered by the user 351 as “Tabni (tab 2)” is cut out by the lip image region extraction unit 330. In this case, the processing result “A o ui” is obtained in the processing of step S426. The CPU 321 searches the registered contents of the word dictionary 327 (step S427), and if there is something that matches this (step S428: Y), the operation of the registered contents is executed (step S429), and the process is stepped. Return to S421 (return). In this example, as the operation in step S429, the state in which the first window in FIG. 17 is displayed is changed to the state in which the second window is selected instead by selecting “Tab 2”.

検索結果から「ア○ウイ」という処理結果と一致する登録内容が単語辞書３２７に存在しなかった場合には（ステップＳ４２８：Ｎ）、「操作内容認識不能」等のエラー表示がディスプレイ３０５に一定期間現われて（ステップＳ４３０）、その後、ステップＳ４２１に戻る（リターン）。ここでユーザ３５１は、発声による操作を再度トライすることができる。もちろん、システムによってはステップＳ４２７の検索で一致する登録内容が存在しない場合に、個々の母音や両唇音の位置の一致の度合いが最も高い単語を一致候補として表示してユーザ３５１に確認させたり、一致の度合いが所定のしきい値を超える場合には一致として処理し該当する操作を実行するようにしてもよい。 If there is no registered content in the word dictionary 327 that matches the processing result “Aui” from the search result (step S428: N), an error display such as “operation content cannot be recognized” or the like is constant on the display 305. A period appears (step S430), and then the process returns to step S421 (return). Here, the user 351 can try the operation by utterance again. Of course, depending on the system, when there is no matching registered content in the search of step S427, the word having the highest degree of matching of the positions of individual vowels and lip sounds is displayed as a matching candidate to allow the user 351 to confirm, When the degree of matching exceeds a predetermined threshold value, it may be processed as matching and the corresponding operation may be executed.

次に、ユーザが操作に必要な単語を登録する場合について具体的に説明する。図１６に示す単語辞書３２７に、第１のウィンドウから第３のウィンドウまでの展開を指示する「タブイチ（タブ１）」から「タブサン（タブ３）」までの単語が登録されているものとし、第４のウィンドウの展開を指示する単語がまだ登録されていないものとする。この場合、ユーザ３５１（図１５）は図１７に示したディスプレイ３０５で第４のウィンドウを音声指示によって展開することができない。 Next, a specific description will be given of a case where a user registers a word necessary for an operation. In the word dictionary 327 shown in FIG. 16, it is assumed that words from “Tabichi (Tab 1)” to “Tab San (Tab 3)” instructing expansion from the first window to the third window are registered. It is assumed that a word for instructing development of the fourth window has not been registered yet. In this case, the user 351 (FIG. 15) cannot expand the fourth window by voice instruction on the display 305 shown in FIG.

図１９は、図１４のステップＳ４０４による単語登録の処理の流れを具体的に表わしたものである。図１２、図１５および図１６と共に説明する。 FIG. 19 specifically shows the flow of the word registration process in step S404 of FIG. This will be described with reference to FIGS.

図１４のステップＳ４０４による単語登録のモードが開始すると、ＣＰＵ３２１はディスプレイ３０５に単語登録の対象となる操作内容を選択する画面を表示する（ステップＳ４４１）。この表示は、手の不自由な人に配慮して、ディスプレイ３０５の表示に関するあらゆる操作を択一的に選択できる内容となっている。ユーザ３５１は、次の表示内容を要求する場合には「ツギ（次）」と発声する（ステップＳ４４２：Ｙ）。この場合、ＣＰＵ３２１は次の選択画面に切り替えて（ステップＳ４４３）、ステップＳ４４１の表示状態に戻る。これに対してユーザ３５１が前に表示された表示内容を要求する場合には「マエ（前）」と発声する（ステップＳ４４２：Ｎ、ステップＳ４４４：Ｙ）。この場合、ＣＰＵ３２１は前の選択画面に切り替えて（ステップＳ４４５）、ステップＳ４４１の表示状態に戻る。 When the word registration mode in step S404 in FIG. 14 is started, the CPU 321 displays a screen for selecting the operation content to be registered in the word on the display 305 (step S441). This display has a content that allows any operation related to display on the display 305 to be selected alternatively in consideration of a handicapped person. When requesting the next display content, the user 351 utters “tsugi (next)” (step S442: Y). In this case, the CPU 321 switches to the next selection screen (step S443) and returns to the display state of step S441. On the other hand, when the user 351 requests the display content previously displayed, the user utters “mae (previous)” (step S442: N, step S444: Y). In this case, the CPU 321 switches to the previous selection screen (step S445) and returns to the display state of step S441.

このようにしてある時点で単語登録の対象となる所望の操作内容がディスプレイ３０５に表示されたら、ユーザ３５１はこの操作内容を単語登録の対象に選択するために「センタク（選択）」と発声する。ＣＰＵ３２１はこれを認識すると（ステップＳ４４２：Ｎ、ステップＳ４４４：Ｎ、ステップＳ４４６：Ｙ）、ディスプレイ３０５に音声の入力の指示を表示する（ステップＳ４４７）。そして、その指示から単語の入力に十分な時間ｔ₃が経過したら（ステップＳ４４８：Ｙ）、単語画像抽出部３３１は唇画像領域抽出部３３０によって抽出されたその間の画像から単語を構成する画像を抽出する。 In this way, when a desired operation content to be registered at a certain point is displayed on the display 305 in this way, the user 351 utters “select” to select this operation content as a word registration target. . When recognizing this (step S442: N, step S444: N, step S446: Y), the CPU 321 displays a voice input instruction on the display 305 (step S447). When a time t ₃ sufficient for inputting a word elapses from the instruction (step S448: Y), the word image extraction unit 331 selects an image constituting the word from the image in the meantime extracted by the lip image region extraction unit 330. Extract.

単語判別部３２８は抽出した画像に対して図１４の箇所で説明した単語処理を行って、単語辞書３２７の「登録内容」に対応する内容を取得する（ステップＳ４４９）。ＣＰＵ３２１はこの取得した内容について単語辞書３２７の登録内容を検索して（ステップＳ４５０）、一致するものがなければ（ステップＳ４５１：Ｎ）、ユーザ３５１の発声した音の内容を単語登録して（ステップＳ４５２）、処理を終了する（エンド）。 The word discriminating unit 328 performs the word processing described in the section of FIG. 14 on the extracted image, and acquires the content corresponding to “registration content” in the word dictionary 327 (step S449). The CPU 321 searches the registered content of the word dictionary 327 for the acquired content (step S450), and if there is no match (step S451: N), the content of the sound uttered by the user 351 is registered as a word (step S450). S452), the process ends (END).

一方、ステップＳ４４９で単語処理を行って得られた結果が単語辞書３２７の登録内容のいずれかと一致した場合には（ステップＳ４５１：Ｙ）、重複登録を防止するためにディスプレイ３０５にエラー表示が行われる（ステップＳ４５３）。これに対してユーザ３５１は登録のための処理を再度トライするか、単語登録を断念する選択を音声の発声によって行う（ステップＳ４５４、ステップＳ４５５）。ユーザ３５１が登録のために「トウロク（登録）」と発声したことが判別された場合（ステップＳ４５４：Ｙ）、ＣＰＵ３２１は処理をステップＳ４４７に戻して音声の入力を再度待機する。 On the other hand, when the result obtained by performing the word processing in step S449 matches any of the registered contents of the word dictionary 327 (step S451: Y), an error is displayed on the display 305 to prevent duplicate registration. (Step S453). On the other hand, the user 351 tries again the process for registration or makes a selection to give up the word registration by voice utterance (steps S454 and S455). When it is determined that the user 351 has uttered “register (register)” for registration (step S454: Y), the CPU 321 returns the process to step S447 and waits for voice input again.

たとえば、第４のウィンドウを音声指示によって展開する語句を単語登録するために、最初の時点でユーザ３５１が「タブシ（タブ４）」と発声したとする。この場合のステップＳ４４９による単語処理の結果は「ア○ウイ」となり、「タブニ（タブ２）」の登録内容と一致する。したがって、「タブシ（タブ４）」は単語辞書３２７の既登録の内容と一致することになり（ステップＳ４５１：Ｙ）、登録することができない（ステップＳ４５３）。 For example, it is assumed that the user 351 utters “Tabsi (tab 4)” at the first time point in order to register a word that expands the fourth window by voice instruction. In this case, the result of the word processing in step S449 is “A o ui”, which matches the registered content of “Tabuni (tab 2)”. Therefore, “Tabushi (Tab 4)” matches the already registered contents of the word dictionary 327 (Step S451: Y) and cannot be registered (Step S453).

この場合、ユーザ３５１が登録にトライすることを選択し（ステップＳ４５４：Ｙ）、「タブヨン（タブ４）」と読み替えて単語を発声したとする。すると、ステップＳ４４９による単語処理の結果は「ア○ウオオ」となり、ユーザ３５１は単語登録を行うことができる。 In this case, it is assumed that the user 351 selects to try to register (step S454: Y), and reads “tabuyon (tab 4)” and speaks a word. Then, the result of the word processing in step S449 is “Ao Woo”, and the user 351 can perform word registration.

エラー表示が行われた時点で、ユーザ３５１は単語処理を断念して「オワリ（終り）」と発声することもできる。この場合、ＣＰＵ３２１が「オワリ」を判別すると（ステップＳ４５４：Ｎ、ステップＳ４５５：Ｙ）、単語登録の処理が終了する（エンド）。ステップＳ４４１による操作内容の選択画面が表示されている状態でユーザ３５１が「オワリ（終り）」を発声した場合も（ステップＳ４４２：Ｎ、ステップＳ４４４：Ｎ、ステップＳ４４６：Ｎ、ステップＳ４５６：Ｙ）、同様である（エンド）。 At the time when the error display is performed, the user 351 can give up the word processing and say “Wai (End)”. In this case, when the CPU 321 determines “Warning” (step S454: N, step S455: Y), the word registration process ends (end). Even when the user 351 utters “warming (end)” while the operation content selection screen in step S441 is displayed (step S442: N, step S444: N, step S446: N, step S456: Y). The same (end).

以上説明した実施の形態によれば、ユーザ３５１が発声した単語を母音と両唇音の配列で区別して単語の比較処理を行うことにした。このため、ユーザ３５１の覚えやすい単語で多数の単語登録を行うことができ、表示操作を音を発するときの唇の開口部の形状によって多くの操作が可能になる。しかも、ユーザ３５１は必ずしも音を発生させず唇の開口部の形状を変化させるだけで表示操作を行うことができるので、他人に音の発生による迷惑を掛けることがない。 According to the embodiment described above, the words uttered by the user 351 are distinguished by the arrangement of vowels and lip sounds, and the word comparison process is performed. Therefore, a large number of words can be registered with words that are easy for the user 351 to remember, and many operations are possible depending on the shape of the opening of the lips when the display operation is sounded. In addition, the user 351 can perform the display operation only by changing the shape of the opening of the lips without necessarily generating a sound, so that it does not cause trouble to other people due to the generation of the sound.

なお、実施の形態では「ン」の音を先に発声した音と同一の母音として処理したが、次に発声する音と同一の母音として処理したり、不定の音「−」としていずれにも該当しない内容に分類して処理してもよい。また、本実施の形態では単語の発声の開始前および終了時点でユーザ３５１が唇を閉じていないことを前提に処理を説明したが、必ず唇を閉じる人の場合には、単語の両端に発生する「○」の符号を１つずつ除去することで、唇を閉じない人と同様のデータ処理が可能である。 In the embodiment, the “n” sound is processed as the same vowel as the previously uttered sound, but it is processed as the same vowel as the next uttered sound, or the indefinite sound “−”. The contents may be classified and processed as not applicable. In the present embodiment, the process has been described on the assumption that the user 351 does not close his / her lips before and after the start of utterance of a word. However, in the case of a person who always closes his / her lips, The data processing similar to that of a person who does not close the lips can be performed by removing the “o” symbols one by one.

＜発明の第１の変形例＞ <First Modification of Invention>

以上説明した実施の形態では単語を自然に発声させ、これを母音や両唇音の配置に分解して辞書に登録し、表示操作と対応付けたが、これに限るものではない。より簡易な音声による操作として、単語を構成する各音を唇の開閉で区切って１音ずつ発声させ、これらの母音の組み合わせを表示操作と対応付けるようにしてもよい。この場合、登録できる単語の種類はある程度制限されるが、登録された単語をうまく活用することでディスプレイに表示された内容を十分に操作することができる。 In the embodiment described above, a word is uttered naturally, it is decomposed into vowels and lip sounds, registered in a dictionary, and associated with a display operation. However, the present invention is not limited to this. As a simple voice operation, each sound constituting a word may be divided by opening and closing the lips and uttered one sound at a time, and a combination of these vowels may be associated with a display operation. In this case, the types of words that can be registered are limited to some extent, but the contents displayed on the display can be sufficiently manipulated by making good use of the registered words.

図２０は、話者としてのユーザが単語を１音ずつ発声するときの画像の変化の様子を表わしたものである。操作に使用する語句として「ミギ（右）」という音を発声する場合、この第１の変形例では「ミ」と「ギ」を１音ずつ間隔を置いて発音する。同図（Ａ）は「ミ」を発声する前の画像の状態であり、ユーザ３５１の唇３５３は閉じている。 FIG. 20 shows a change in image when a user as a speaker utters a word one sound at a time. In the case of uttering a sound “Migi (right)” as a word used for the operation, in this first modification, “mi” and “gi” are pronounced one by one at intervals. FIG. 6A shows the state of the image before “mi” is uttered, and the lips 353 of the user 351 are closed.

同図（Ｂ）は発声状態での唇の形状を表わした画像であり、ユーザ３５１の唇３５３は発声した音の種類に応じた開き具合となっている。１つの音の発声が終了すると、この第１の変形例でユーザ３５１は唇３５３を閉じ、同図（Ａ）の状態に戻る。 FIG. 5B is an image showing the shape of the lips in the utterance state, and the lips 353 of the user 351 are opened according to the type of sound uttered. When the utterance of one sound is finished, in this first modification, the user 351 closes the lips 353 and returns to the state shown in FIG.

このように１つの音が発声されるたびに口の開閉が行われるので、口を開いた同図（Ｃ）の画像と口を閉じた同図（Ａ）の画像が単語を構成する音の数だけ繰り返されることになる。もちろん、ユーザ３５１は発声を伴わない唇３５３の開閉のみを行うことができ、この場合にも第１の変形例の画面表示操作装置は表示上での各種操作を行うことができる。 Since the mouth is opened and closed each time one sound is produced in this way, the image of FIG. 10C with the mouth open and the image of FIG. It will be repeated by the number. Of course, the user 351 can only open and close the lips 353 without utterance, and in this case as well, the screen display operation device of the first modification can perform various operations on the display.

図２１は、この第１の変形例におけるディスプレイの操作の第１の例を説明するためのものである。図２０と共に説明する。 FIG. 21 is for explaining a first example of the operation of the display in the first modification. This will be described with reference to FIG.

ディスプレイ３０５には、第１のウィンドウ５０１と第２のウィンドウ５０２が表示されている。このような場合、ユーザ３５１は、いずれかのウィンドウをアクティブにする操作をまず行った後、該当するウィンドウに対してデータ処理を行うことになる。このときの第１のウィンドウ５０１と第２のウィンドウ５０２の切り替えという単純な操作は、図２０（Ａ）および同図（Ｃ）に示した口の開閉の判別で可能になる。 On the display 305, a first window 501 and a second window 502 are displayed. In such a case, the user 351 first performs an operation to activate one of the windows, and then performs data processing on the corresponding window. At this time, a simple operation of switching between the first window 501 and the second window 502 can be performed by the opening / closing determination of the mouth shown in FIGS.

図２２は、第１および第２のウィンドウのアクティブとノンアクティブの切替制御の様子を表わしたものである。図２０および図２１と共に説明する。 FIG. 22 shows a state of switching control between the active and inactive states of the first and second windows. This will be described with reference to FIGS.

図２２［Ａ］に示す第１の状態を初期状態とする。この初期状態では第１のウィンドウ５０１がアクティブで、これに対する操作が可能である。このとき、第２のウィンドウ５０２はノンアクティブであり、これに対する操作を行えない状態となっている。この状態でユーザ３５１が図２０の（Ａ）および（Ｃ）に示すように口を１度開閉させると、図２２［Ｂ］に示すように第１のウィンドウ５０１がノンアクティブとなり、代わって第２のウィンドウ５０２がアクティブとなる。また、図２２［Ｂ］に示すように第２のウィンドウ５０２がアクティブな状態からユーザ３５１が口を１度開閉させると、今度は図２２［Ａ］に示すように第１のウィンドウ５０１が再びアクティブな状態になる。このとき、第２のウィンドウ５０２はノンがアクティブとなる。 The first state shown in FIG. 22A is an initial state. In this initial state, the first window 501 is active and can be operated. At this time, the second window 502 is inactive and cannot be operated. In this state, when the user 351 opens and closes his / her mouth once as shown in FIGS. 20A and 20C, the first window 501 becomes inactive as shown in FIG. The second window 502 becomes active. When the user 351 opens and closes his / her mouth once from the state in which the second window 502 is active as shown in FIG. 22B, the first window 501 is again displayed as shown in FIG. 22A. Become active. At this time, the second window 502 is non-active.

このようにディスプレイ３０５に２つの操作事項が存在し、これを択一的に選択しなければならない場合、この第１の変形例ではユーザ３５１が口を開閉するたびに選択動作が繰り返されることになる。したがって、極めて簡単に画面の操作が可能になる。 In this way, when there are two operation items on the display 305 and these must be selected alternatively, the selection operation is repeated each time the user 351 opens and closes the mouth in this first modification. Become. Therefore, the screen can be operated very easily.

しかしながら口の開閉だけでは操作を指示する単語の種類が極端に限定される。したがって、図１７に示した複数のウィンドウのいずれかをアクティブにするような制御が困難になる。そこで、このような場合には図２０（Ａ）および（Ｂ）で示した単語発声時における唇３５３の開口部の形状の判別による操作が有効となる。 However, the types of words for instructing operations are extremely limited only by opening and closing the mouth. Therefore, it becomes difficult to control one of the plurality of windows shown in FIG. Therefore, in such a case, the operation by determining the shape of the opening of the lips 353 at the time of word utterance shown in FIGS. 20A and 20B is effective.

図２３は、図１７で説明した「タブ１」〜「タブ４」の選択を、この第１の変形例における第２の例として説明するためのものである。図１７および図２０と共に説明する。 FIG. 23 is a diagram for explaining selection of “tab 1” to “tab 4” described in FIG. 17 as a second example in the first modification. This will be described with reference to FIGS. 17 and 20.

図１７に示したように第１のウィンドウが前面に表示されている状態を第１の状態とする。このとき、第１のウィンドウがアクティブとなっており、ユーザはこれに対して更に詳細な操作を行うことができる。この状態でユーザ３５１が１つの音を発声するたびに口を開閉させながら、「ミ」、「ギ」（右）と発声したとする。この場合、先の実施の形態で説明した認識によってそれぞれの母音が判別され、「イ、イ」という判別結果が得られる。この第１の変形例では１つの音を発声するたびに口を開閉させるので、両唇音の判別による表記を省略している。 A state in which the first window is displayed on the front as shown in FIG. 17 is defined as a first state. At this time, the first window is active, and the user can perform more detailed operations on the first window. In this state, it is assumed that the user 351 utters “Mi” and “Gi” (right) while opening and closing his mouth each time he utters one sound. In this case, each vowel is discriminated by the recognition described in the previous embodiment, and a discrimination result “I, I” is obtained. In this first modification, the mouth is opened and closed each time one sound is uttered, so the notation by discrimination of both lip sounds is omitted.

この「イ、イ」という判別結果が得られるたびに、第１の変形例の画面表示操作装置ではタブの選択を右方向に１つずつエンドレスに進める。したがって、第１の状態のときにユーザ３５１が「ミ」、「ギ」（右）と発声すると、第２の状態となり「タブ２」が選択されて第２のウィンドウがアクティブとなる。また、第２の状態のときにユーザ３５１が「ミ」、「ギ」（右）と発声すると、第３の状態となり「タブ３」が選択されて第３のウィンドウがアクティブとなる。更に第３の状態のときにユーザ３５１が「ミ」、「ギ」（右）と発声すると、第４の状態となり「タブ４」が選択されて第４のウィンドウがアクティブとなる。更にまた第４の状態のときにユーザ３５１が「ミ」、「ギ」（右）と発声すると、第１の状態に戻り「タブ１」が選択されて第１のウィンドウがアクティブとなる。以下、同様である。 Each time the determination result “I, i” is obtained, the screen display operation device of the first modification advances the tab selection one by one in the right direction. Therefore, when the user 351 utters “Mi” and “Gi” (right) in the first state, the second state is set and “Tab 2” is selected and the second window is activated. When the user 351 utters “Mi” and “Gi” (right) in the second state, the third state is entered, “Tab 3” is selected, and the third window is activated. Further, when the user 351 utters “Mi” and “Gi” (right) in the third state, the fourth state is entered, “Tab 4” is selected, and the fourth window is activated. Furthermore, when the user 351 utters “mi” and “gi” (right) in the fourth state, the user returns to the first state and “tab 1” is selected and the first window becomes active. The same applies hereinafter.

また、第１のウィンドウがアクティブとなっている状態でユーザ３５１が１つの音を発声するたびに口を開閉させながら、「ヒ」、「ダ」、「リ」（左）と発声したとする。この場合、先の実施の形態で説明した認識によってそれぞれの母音が判別され、「イ、ア、イ」という判別結果が得られる。 In addition, it is assumed that the user 351 utters “hi”, “da”, “li” (left) while opening and closing his / her mouth every time he utters one sound while the first window is active. . In this case, each vowel is discriminated by the recognition described in the previous embodiment, and a discrimination result of “A, A, A” is obtained.

この「イ、ア、イ」という判別結果が得られるたびに、第１の変形例の画面表示操作装置ではタブの選択を左方向に１つずつエンドレスに進める。したがって、第１の状態のときにユーザ３５１が「ヒ」、「ダ」、「リ」（左）と発声すると、第４の状態となり「タブ４」が選択されて第４のウィンドウがアクティブとなる。また、第４の状態のときにユーザ３５１が「ヒ」、「ダ」、「リ」（左）と発声すると、第３の状態となり「タブ３」が選択されて第３のウィンドウがアクティブとなる。更に第３の状態のときにユーザ３５１が「ヒ」、「ダ」、「リ」（左）と発声すると、第２の状態となり「タブ２」が選択されて第２のウィンドウがアクティブとなる。更にまた第２の状態のときにユーザ３５１が「ヒ」、「ダ」、「リ」（左）と発声すると、第１の状態に戻り「タブ１」が選択されて第１のウィンドウがアクティブとなる。以下、同様である。 Each time the determination result “i, a, i” is obtained, the screen display operation device of the first modification advances the selection of the tabs one by one in the left direction. Therefore, when the user 351 utters “hi”, “da”, “li” (left) in the first state, the fourth state is entered, and “tab 4” is selected and the fourth window is activated. Become. In addition, when the user 351 utters “hi”, “da”, “li” (left) in the fourth state, the third state is entered and “tab 3” is selected and the third window becomes active. Become. Further, when the user 351 utters “hi”, “da”, “li” (left) in the third state, the second state is entered and “tab 2” is selected and the second window becomes active. . Furthermore, when the user 351 utters “hi”, “da”, “li” (left) in the second state, the user returns to the first state and “tab 1” is selected and the first window is activated. It becomes. The same applies hereinafter.

以上説明した本発明の第１の変形例によれば、ユーザ３５１が１つの音を発声するたびに口を開閉させるので、単語を構成する各母音の判別が容易になる。したがって、画面（ウィンドウ）の切り替えや、画面の移動、バックライトの点灯や消灯といったディスプレイに関する操作の典型的なものについて予め口の開閉や母音の組み合わせを登録しておけば、手を使用することなく各種の操作が可能になる。このため、手がふさがった状態でも画面表示操作装置に顔を向けることができれば、各種の操作が可能になるという利点がある。 According to the first modified example of the present invention described above, the mouth is opened and closed each time the user 351 utters one sound, so that each vowel constituting the word can be easily identified. Therefore, if you register the combination of opening and closing the mouth and vowels in advance for typical display operations such as screen switching, screen movement, backlight on / off, etc., you can use your hands. Various operations become possible. For this reason, there is an advantage that various operations are possible if the face can be turned to the screen display operation device even when the hand is occupied.

しかも、「ミ」、「ギ」（右）や「ヒ」、「ダ」、「リ」（左）といったユーザ３５１が直感的に分かりやすい動作パターンを登録することで、使い勝手の良いインタフェースとして、誰でも気軽に活用することができる。 Moreover, by registering operation patterns that are easy to understand intuitively for the user 351 such as “Mi”, “Gi” (right), “Hi”, “Da”, “Li” (left), Anyone can use it easily.

＜発明の第２の変形例＞ <Second Modification of Invention>

図２４は、本発明の第２の変形例における画面表示操作装置を示したものである。この第２の変形例の画面表示操作装置は、ノート型のパーソナルコンピュータ６００である。このパーソナルコンピュータ６００は、装置本体６０１に開閉自在に配置された蓋部６０２の内面側に、ディスプレイ６０３と共に撮像装置６０４を取り付けた構造となっている。したがって、ユーザはキー操作部６０５を手で操作することができるだけでなく、撮像装置６０４を用いて口の開閉や単語発声時の唇の開口部の形状認識による各種の操作が可能である。 FIG. 24 shows a screen display operating device in a second modification of the present invention. The screen display operation device according to the second modification is a notebook personal computer 600. The personal computer 600 has a structure in which an imaging device 604 is attached together with a display 603 on the inner surface side of a lid portion 602 that can be freely opened and closed in the apparatus main body 601. Therefore, the user can not only operate the key operation unit 605 by hand, but also can perform various operations by opening and closing the mouth and recognizing the shape of the opening of the lip when speaking a word using the imaging device 604.

このように本発明の画面表示操作装置は、携帯電話機やＰＨＳ（Personal Handy-phone System）といった小型の情報処理装置に限定される必要はなく、撮像装置を使用できるあらゆる情報処理装置に適用可能である。 Thus, the screen display operation device of the present invention is not necessarily limited to a small information processing device such as a mobile phone or a PHS (Personal Handy-phone System), and can be applied to any information processing device that can use an imaging device. is there.

また、本発明の実施の形態および変形例では日本語を対象とした母音および両唇音の特徴に基づく音の判別について説明したが、母音の種類や数の異なる他の言語についても本発明を適用することができることは当然である。 In the embodiments and modifications of the present invention, the sound discrimination based on the characteristics of vowels and lip sounds for Japanese is described. However, the present invention is also applied to other languages having different types and numbers of vowels. Of course you can do it.

以上説明した実施の形態の一部または全部は、以下の付記のようにも記載されるが、以下の記載に限定されるものではない。 Some or all of the embodiments described above are described as in the following supplementary notes, but are not limited to the following descriptions.

（付記１）
各種の情報を視覚的に表示するディスプレイと、
このディスプレイを使用する操作者の口を少なくとも撮影する撮像手段と、
この撮像手段によって得られた操作者の口の画像の経時的な変化を判別する変化判別手段と、
この変化判別手段の判別結果に応じて判別結果と対応付けた予め定めた特定の操作を実行する特定操作実行手段
とを具備することを特徴とする情報処理装置。 (Appendix 1)
A display that visually displays various types of information;
Imaging means for photographing at least an operator's mouth using the display;
Change discriminating means for discriminating temporal changes in the image of the operator's mouth obtained by the imaging means;
An information processing apparatus comprising: a specific operation executing unit that executes a predetermined specific operation associated with the determination result according to the determination result of the change determination unit.

（付記２）
前記変化判別手段は操作者の口の開閉の変化を判別することを特徴とする付記１記載の情報処理装置。 (Appendix 2)
The information processing apparatus according to claim 1, wherein the change determination unit determines a change in opening / closing of an operator's mouth.

（付記３）
前記変化判別手段は操作者の唇の開口部の発声時に１音ごとに形成される特有の形状を判別することを特徴とする付記１記載の情報処理装置。 (Appendix 3)
The information processing apparatus according to claim 1, wherein the change determining unit determines a specific shape formed for each sound when the operator's lip opening is uttered.

（付記４）
前記１音ごとに形成される特有の形状は、母音を発声するときの形状であることを特徴とする付記３記載の情報処理装置。 (Appendix 4)
The information processing apparatus according to appendix 3, wherein the unique shape formed for each sound is a shape when a vowel is uttered.

（付記５）
認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出手段と、
この唇画像領域抽出手段によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する単語画像抽出手段と、
この単語画像抽出手段によって抽出した単語画像における上下の唇の閉じたときの合わせ目が全体的に水平に配置しているとしたときの前記唇の開口部の中心点を通る上下方向と左右方向のそれぞれの距離の時間的な変化を１単位の単語が発声されたときの単語発声パターンとして記録する単語発声パターン記録手段と、
この単語発声パターン記録手段あるいはこれと同等の手段によって予め記録しておいた単語発声パターンをそれぞれ単語と対応付けた単語辞書と、
前記単語発声パターン記録手段で記録した認識の対象となる単語発声パターンを前記単語辞書内の各単語についての単語発声パターンと比較するパターン比較手段と、（へ）このパターン比較手段による比較の結果、最も一致すると判別した単語発声パターンに対応する単語を前記認識の対象とする人物の発声した単語であると判別する単語判別手段
とを具備することを特徴とする単語判別装置。 (Appendix 5)
Lip image region extracting means for extracting a lip region image from a human face image to be recognized;
A word image extraction means for extracting a series of temporal changes from the start to the end of the lip image change in the lip image area extracted by the lip image area extraction means as a unit word image;
Vertical and horizontal directions passing through the center point of the opening of the lips when the joints when the upper and lower lips are closed in the word image extracted by the word image extracting means are arranged horizontally. A word utterance pattern recording means for recording a temporal change in each distance as a word utterance pattern when a unit word is uttered;
A word dictionary in which each word utterance pattern previously recorded by the word utterance pattern recording means or equivalent means is associated with a word;
Pattern comparison means for comparing the word utterance pattern to be recognized recorded by the word utterance pattern recording means with the word utterance pattern for each word in the word dictionary; and (f) the result of comparison by the pattern comparison means, A word discriminating apparatus comprising: a word discriminating unit that discriminates a word corresponding to a word utterance pattern determined to be the best match as a word uttered by a person to be recognized.

（付記６）
認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出手段と、
この唇画像領域抽出手段によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する単語画像抽出手段と、
この単語画像抽出手段によって抽出した単語画像における上下の唇が閉じられるタイミングを測定する閉タイミング測定手段と、
前記単語画像抽出手段によって抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別する母音判別手段と、
前記閉タイミング測定手段あるいはこれと同等の手段によって測定したそれぞれのタイミングと前記母音判別手段あるいはこれと同等の手段によって判別した各母音の組み合わせを、予め複数の単語とそれぞれ対応付けて用意した単語辞書と、
前記単語画像抽出手段によって抽出した単語画像における前記閉タイミング測定手段で測定したそれぞれのタイミングおよび前記母音判別手段で判別した単語を構成するそれぞれの母音の組み合わせを前記単語辞書における前記組み合わせと比較する比較手段と、
この比較手段で最も一致すると判別した前記単語辞書における前記組み合わせに対応する単語を前記認識の対象とする人物の発声した単語であると判別する単語判別手段
とを具備することを特徴とする単語判別装置。 (Appendix 6)
Lip image region extracting means for extracting a lip region image from a human face image to be recognized;
A word image extraction means for extracting a series of temporal changes from the start to the end of the lip image change in the lip image area extracted by the lip image area extraction means as a unit word image;
A closing timing measuring means for measuring the timing at which the upper and lower lips in the word image extracted by the word image extracting means are closed;
Vowel discrimination means for discriminating each vowel constituting a word from the shape of the same lip opening that continues for a predetermined time or longer in the word image extracted by the word image extraction means;
A word dictionary prepared by associating a combination of each timing measured by the closing timing measuring means or equivalent means and each vowel determined by the vowel discrimination means or equivalent means with a plurality of words in advance. When,
Comparison comparing each timing measured by the closing timing measuring unit in the word image extracted by the word image extracting unit and each vowel combination constituting the word determined by the vowel determining unit with the combination in the word dictionary Means,
Word discrimination means comprising: word discrimination means for discriminating that a word corresponding to the combination in the word dictionary determined to be the best match by the comparison means is a word uttered by the person to be recognized; apparatus.

（付記７）
前記母音判別手段は、同一の唇の開口部の形状が前記所定時間以上継続するとき、この所定時間を超えた特定時間が経過するたびに同一の母音が繰り返し発声されたと判別することを特徴とする付記６記載の単語判別装置。 (Appendix 7)
The vowel discrimination means discriminates that the same vowel is repeatedly uttered every time a specific time exceeding the predetermined time elapses when the shape of the same lip opening continues for the predetermined time or longer. The word discrimination device according to appendix 6.

（付記８）
認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出手段と、
この唇画像領域抽出手段によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する単語画像抽出手段と、
この単語画像抽出手段によって抽出した単語画像における上下の唇の閉じたときの合わせ目が全体的に水平に配置しているとしたときの前記唇の開口部の中心点を通る上下方向と左右方向のそれぞれの距離の時間的な変化を１単位の単語が発声されたときの単語発声パターンとして記録する単語発声パターン記録手段と、
この単語発声パターン記録手段あるいはこれと同等の手段によって予め記録しておいた単語発声パターンをそれぞれ単語と対応付けた単語辞書と、
前記単語発声パターン記録手段で記録した認識の対象となる単語発声パターンを前記単語辞書内の各単語についての単語発声パターンと比較するパターン比較手段と、
このパターン比較手段による比較の結果、最も一致すると判別した単語発声パターンに対応する単語を前記認識の対象とする人物の発声した単語であると判別する単語判別手段と、
各種の情報を表示するディスプレイと、
前記単語判別手段で判別した単語に対応する操作内容でこのディスプレイ上に表示された表示内容を操作する内容操作手段
とを具備することを特徴とする画面表示操作装置。 (Appendix 8)
Lip image region extracting means for extracting a lip region image from a human face image to be recognized;
A word image extraction means for extracting a series of temporal changes from the start to the end of the lip image change in the lip image area extracted by the lip image area extraction means as a unit word image;
Vertical and horizontal directions passing through the center point of the opening of the lips when the joints when the upper and lower lips are closed in the word image extracted by the word image extracting means are arranged horizontally. A word utterance pattern recording means for recording a temporal change in each distance as a word utterance pattern when a unit word is uttered;
A word dictionary in which each word utterance pattern previously recorded by the word utterance pattern recording means or equivalent means is associated with a word;
Pattern comparing means for comparing the word utterance pattern to be recognized recorded by the word utterance pattern recording means with the word utterance pattern for each word in the word dictionary;
As a result of the comparison by the pattern comparison means, a word determination means for determining that a word corresponding to the word utterance pattern determined to be the best match is a word uttered by the person to be recognized,
A display that displays various information,
A screen display operation device comprising: content operation means for operating display contents displayed on the display with operation contents corresponding to the word determined by the word determination means.

（付記９）
認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出手段と、
この唇画像領域抽出手段によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する単語画像抽出手段と、
この単語画像抽出手段によって抽出した単語画像における上下の唇が閉じられるタイミングを測定する閉タイミング測定手段と、
前記単語画像抽出手段によって抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別する母音判別手段と、
前記閉タイミング測定手段あるいはこれと同等の手段によって測定したそれぞれのタイミングと前記母音判別手段あるいはこれと同等の手段によって判別した各母音の組み合わせを、予め複数の単語とそれぞれ対応付けて用意した単語辞書と、
前記単語画像抽出手段によって抽出した単語画像における前記閉タイミング測定手段で測定したそれぞれのタイミングおよび前記母音判別手段で判別した単語を構成するそれぞれの母音の組み合わせを前記単語辞書における前記組み合わせと比較する比較手段と、
この比較手段で最も一致すると判別した前記単語辞書における前記組み合わせに対応する単語を前記認識の対象とする人物の発声した単語であると判別する単語判別手段と、
各種の情報を表示するディスプレイと、
前記単語判別手段で判別した単語に対応する操作内容でこのディスプレイ上に表示された表示内容を操作する内容操作手段
とを具備することを特徴とする画面表示操作装置。 (Appendix 9)
Lip image region extracting means for extracting a lip region image from a human face image to be recognized;
A word image extraction means for extracting a series of temporal changes from the start to the end of the lip image change in the lip image area extracted by the lip image area extraction means as a unit word image;
A closing timing measuring means for measuring the timing at which the upper and lower lips in the word image extracted by the word image extracting means are closed;
Vowel discrimination means for discriminating each vowel constituting a word from the shape of the same lip opening that continues for a predetermined time or longer in the word image extracted by the word image extraction means;
A word dictionary prepared by associating a combination of each timing measured by the closing timing measuring means or equivalent means and each vowel determined by the vowel discrimination means or equivalent means with a plurality of words in advance. When,
Comparison comparing each timing measured by the closing timing measuring unit in the word image extracted by the word image extracting unit and each vowel combination constituting the word determined by the vowel determining unit with the combination in the word dictionary Means,
A word discriminating unit for discriminating that a word corresponding to the combination in the word dictionary determined to be the best match by the comparing unit is a word uttered by the person to be recognized;
A display that displays various information,
A screen display operation device comprising: content operation means for operating display contents displayed on the display with operation contents corresponding to the word determined by the word determination means.

（付記１０）
前記母音判別手段は、同一の唇の開口部の形状が前記所定時間以上継続するとき、この所定時間を超えた特定時間が経過するたびに同一の母音が繰り返し発声されたと判別することを特徴とする付記９記載の画面表示操作装置。 (Appendix 10)
The vowel discrimination means discriminates that the same vowel is repeatedly uttered every time a specific time exceeding the predetermined time elapses when the shape of the same lip opening continues for the predetermined time or longer. The screen display operation device according to appendix 9.

（付記１１）
単語登録の際に認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出手段と、
この唇画像領域抽出手段によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を登録する画像を発声した場合の単語画像として抽出する登録対象単語画像抽出手段と、
この登録対象単語画像抽出手段によって抽出した単語画像における上下の唇の閉じたときの合わせ目が全体的に水平に配置しているとしたときの前記唇の開口部の中心点を通る上下方向と左右方向のそれぞれの距離の時間的な変化を１単位の単語が発声されたときの単語発声パターンとして記録する単語発声パターン記録手段と、
この単語発声パターン記録手段あるいはこれと同等の手段によって予め記録しておいた単語発声パターンをそれぞれ単語と対応付けた単語辞書と、
前記単語発声パターン記録手段で記録した単語登録の対象となる単語発声パターンを前記単語辞書内の各単語についての単語発声パターンと比較するパターン比較手段と、
このパターン比較手段による比較の結果、所定の値以上近似しないと判別した単語発声パターンに対応する未登録の単語のみを単語登録可能とする登録単語可否判別手段
とを具備することを特徴とする単語登録装置。 (Appendix 11)
Lip image region extracting means for extracting a lip region image from a human face image to be recognized at the time of word registration;
A registration target word image extraction unit that extracts a word image when a series of changes in the lip image region extracted from the lip image region extracted by the lip image region extraction unit from the start to the end of the lip image region is registered as a word image;
A vertical direction passing through the center point of the opening of the lips when the joints when the upper and lower lips are closed in the word image extracted by the registration target word image extracting means are horizontally arranged as a whole; Word utterance pattern recording means for recording temporal changes in the distances in the left and right directions as word utterance patterns when one unit of words is uttered;
A word dictionary in which each word utterance pattern previously recorded by the word utterance pattern recording means or equivalent means is associated with a word;
Pattern comparison means for comparing the word utterance pattern to be registered in the word utterance pattern recording means with the word utterance pattern for each word in the word dictionary;
A word comprising: a registered word availability judging means for registering only unregistered words corresponding to a word utterance pattern judged not to be approximated by a predetermined value or more as a result of comparison by the pattern comparing means; Registration device.

（付記１２）
単語登録の際に認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出手段と、
この唇画像領域抽出手段によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を登録する画像を発声した場合の単語画像として抽出する登録対象単語画像抽出手段と、
この登録対象単語画像抽出手段によって抽出した単語画像における上下の唇が閉じられるタイミングを測定する閉タイミング測定手段と、
前記登録対象単語画像抽出手段によって抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別する母音判別手段と、
前記閉タイミング測定手段あるいはこれと同等の手段によって測定したそれぞれのタイミングと前記母音判別手段あるいはこれと同等の手段によって判別した各母音の組み合わせを、予め複数の単語とそれぞれ対応付けて用意した単語辞書と、
前記登録対象単語画像抽出手段によって抽出した単語画像における前記閉タイミング測定手段で測定したそれぞれのタイミングおよび前記母音判別手段で判別した単語を構成するそれぞれの母音の組み合わせを前記単語辞書における前記組み合わせと比較する比較手段と、
この比較手段による比較の結果、所定の値以上近似しないと判別した単語発声パターンに対応する未登録の単語のみを単語登録可能とする登録単語可否判別手段
とを具備することを特徴とする単語登録装置。 (Appendix 12)
Lip image region extracting means for extracting a lip region image from a human face image to be recognized at the time of word registration;
A registration target word image extraction unit that extracts a word image when a series of changes in the lip image region extracted from the lip image region extracted by the lip image region extraction unit from the start to the end of the lip image region is registered as a word image;
A closing timing measuring means for measuring a timing at which upper and lower lips are closed in the word image extracted by the registration target word image extracting means;
Vowel discrimination means for discriminating each vowel constituting a word from the shape of the same lip opening that lasts for a predetermined time or longer in the word image extracted by the registration target word image extraction means;
A word dictionary prepared by associating a combination of each timing measured by the closing timing measuring means or equivalent means and each vowel determined by the vowel discrimination means or equivalent means with a plurality of words in advance. When,
Compare each timing combination measured by the closing timing measurement unit in the word image extracted by the registration target word image extraction unit and each vowel constituting the word determined by the vowel determination unit with the combination in the word dictionary. Comparing means to
A word registration characterized by comprising: a registered word availability judging means capable of registering only unregistered words corresponding to a word utterance pattern judged not to approximate more than a predetermined value as a result of comparison by the comparing means apparatus.

（付記１３）
前記母音判別手段は、同一の唇の開口部の形状が前記所定時間以上継続するとき、この所定時間を超えた特定時間が経過するたびに同一の母音が繰り返し発声されたと判別することを特徴とする付記１２記載の単語登録装置。 (Appendix 13)
The vowel discrimination means discriminates that the same vowel is repeatedly uttered every time a specific time exceeding the predetermined time elapses when the shape of the same lip opening continues for the predetermined time or longer. The word registration device according to appendix 12.

（付記１４）
各種の情報を視覚的に表示するディスプレイを使用する操作者の口を少なくとも撮影する撮像ステップと、
この撮像ステップで得られた操作者の口の画像の経時的な変化を判別する変化判別ステップと、
この変化判別ステップによる判別結果に応じて判別結果と対応付けた予め定めた特定の操作を実行する特定操作実行ステップ
とを具備することを特徴とする情報処理方法。 (Appendix 14)
An imaging step of photographing at least an operator's mouth using a display that visually displays various types of information;
A change determination step for determining a change over time in an image of the operator's mouth obtained in this imaging step;
An information processing method comprising: a specific operation executing step for executing a predetermined specific operation associated with the determination result according to the determination result in the change determination step.

（付記１５）
認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出ステップと、
この唇画像領域抽出ステップで抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する単語画像抽出ステップと、
この単語画像抽出ステップによって抽出した単語画像における上下の唇の閉じたときの合わせ目が全体的に水平に配置しているとしたときの前記唇の開口部の中心点を通る上下方向と左右方向のそれぞれの距離の時間的な変化を１単位の単語が発声されたときの単語発声パターンとして記録する単語発声パターン記録ステップと、
この単語発声パターン記録ステップで記録した認識の対象となる単語発声パターンを、前記単語発声パターン記録ステップで予めそれぞれの単語と対応付けて登録した単語辞書内の各単語についての単語発声パターンと比較するパターン比較ステップと、
このパターン比較ステップによる比較の結果、最も一致すると判別した単語発声パターンに対応する単語を前記認識の対象とする人物の発声した単語であると判別する単語判別ステップ
とを具備することを特徴とする単語判別方法。 (Appendix 15)
A lip image region extracting step for extracting a lip region image from a human face image to be recognized;
A word image extraction step for extracting a series of temporal changes from the start to the end of the lip image change in the lip image region extracted in the lip image region extraction step as a unit word image;
The vertical and horizontal directions passing through the center point of the opening of the lips when the joints when the upper and lower lips are closed in the word image extracted by the word image extraction step are arranged horizontally. A word utterance pattern recording step of recording a temporal change of each distance of the word as a word utterance pattern when one unit of words is uttered;
The word utterance pattern to be recognized recorded in the word utterance pattern recording step is compared with the word utterance pattern for each word in the word dictionary registered in advance in association with each word in the word utterance pattern recording step. A pattern comparison step;
A word discrimination step for discriminating that the word corresponding to the word utterance pattern determined to be the best match as a result of the comparison in the pattern comparison step is a word uttered by the person to be recognized. Word discrimination method.

（付記１６）
認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出ステップと、
この唇画像領域抽出ステップによって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する単語画像抽出ステップと、
この単語画像抽出ステップによって抽出した単語画像における上下の唇が閉じられるタイミングを測定する閉タイミング測定ステップと、
前記単語画像抽出ステップによって抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別する母音判別ステップと、
前記単語画像抽出ステップによって抽出した単語画像における前記閉タイミング測定ステップで測定したそれぞれのタイミングおよび前記母音判別ステップで判別した単語を構成するそれぞれの母音の組み合わせを、前記閉タイミング測定ステップで測定したそれぞれのタイミングと前記母音判別ステップで判別した各母音の組み合わせで単語ごとに予め登録した単語辞書における前記組み合わせと比較する比較ステップと、
この比較ステップによる比較の結果、最も一致すると判別した前記単語辞書における前記組み合わせに対応する単語を前記認識の対象とする人物の発声した単語であると判別する単語判別ステップ
とを具備することを特徴とする単語判別方法。 (Appendix 16)
A lip image region extracting step for extracting a lip region image from a human face image to be recognized;
A word image extraction step for extracting a series of temporal changes from the start to the end of the lip image change in the lip image region extracted by the lip image region extraction step as a unit word image;
A closing timing measurement step for measuring the timing at which the upper and lower lips in the word image extracted by the word image extraction step are closed;
A vowel discrimination step for discriminating each vowel constituting the word from the shape of the same lip opening that lasts for a predetermined time or longer in the word image extracted by the word image extraction step;
Each timing measured in the closing timing measurement step in the word image extracted in the word image extraction step and each vowel combination constituting the word determined in the vowel determination step is measured in the closing timing measurement step, respectively. A comparison step for comparing with the combination in the word dictionary registered in advance for each word with a combination of the timing of the vowel and the vowels determined in the vowel determination step;
A word determination step of determining that a word corresponding to the combination in the word dictionary determined to be the best match as a result of comparison in the comparison step is a word uttered by the person to be recognized. A word discrimination method.

（付記１７）
前記母音判別ステップでは、同一の唇の開口部の形状が前記所定時間以上継続するとき、この所定時間を超えた特定時間が経過するたびに同一の母音が繰り返し発声されたと判別することを特徴とする付記１６記載の単語判別方法。 (Appendix 17)
In the vowel determination step, when the shape of the same lip opening continues for the predetermined time or more, it is determined that the same vowel is repeatedly uttered each time a specific time exceeding the predetermined time elapses. The word discrimination method according to supplementary note 16.

（付記１８）
認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出ステップと、
この唇画像領域抽出ステップによって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する単語画像抽出ステップと、
この単語画像抽出ステップによって抽出した単語画像における上下の唇の閉じたときの合わせ目が全体的に水平に配置しているとしたときの前記唇の開口部の中心点を通る上下方向と左右方向のそれぞれの距離の時間的な変化を１単位の単語が発声されたときの単語発声パターンとして記録する単語発声パターン記録ステップと、
この単語発声パターン記録ステップで記録した認識の対象となる単語発声パターンを、前記単語発声パターン記録ステップで予めそれぞれの単語と対応付けて登録した単語辞書内の各単語についての単語発声パターンと比較するパターン比較ステップと、
このパターン比較ステップによる比較の結果、最も一致すると判別した単語発声パターンに対応する単語を前記認識の対象とする人物の発声した単語であると判別する単語判別ステップと、
この単語判別ステップで判別した単語に対応する操作内容で、各種の情報を表示するディスプレイ上に表示された表示内容を操作する内容操作ステップ
とを具備することを特徴とする画面表示操作方法。 (Appendix 18)
A lip image region extracting step for extracting a lip region image from a human face image to be recognized;
A word image extraction step for extracting a series of temporal changes from the start to the end of the lip image change in the lip image region extracted by the lip image region extraction step as a unit word image;
The vertical and horizontal directions passing through the center point of the opening of the lips when the joints when the upper and lower lips are closed in the word image extracted by the word image extraction step are arranged horizontally. A word utterance pattern recording step of recording a temporal change of each distance of the word as a word utterance pattern when one unit of words is uttered;
The word utterance pattern to be recognized recorded in the word utterance pattern recording step is compared with the word utterance pattern for each word in the word dictionary registered in advance in association with each word in the word utterance pattern recording step. A pattern comparison step;
As a result of comparison by this pattern comparison step, a word determination step for determining that a word corresponding to the word utterance pattern determined to be the best match is a word uttered by the person to be recognized,
A screen display operation method, comprising: a content operation step for operating display contents displayed on a display for displaying various types of information with operation contents corresponding to the word determined in the word determination step.

（付記１９）
認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出ステップと、
この唇画像領域抽出ステップによって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する単語画像抽出ステップと、
この単語画像抽出ステップによって抽出した単語画像における上下の唇が閉じられるタイミングを測定する閉タイミング測定ステップと、
前記単語画像抽出ステップによって抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別する母音判別ステップと、
前記単語画像抽出ステップによって抽出した単語画像における前記閉タイミング測定ステップで測定したそれぞれのタイミングおよび前記母音判別ステップで判別した単語を構成するそれぞれの母音の組み合わせを、前記閉タイミング測定ステップで測定したそれぞれのタイミングと前記母音判別ステップで判別した各母音の組み合わせで単語ごとに予め登録した単語辞書と比較する比較ステップと、
この比較ステップによる比較の結果、最も一致すると判別した単語辞書の単語を前記認識の対象とする人物の発声した単語であると判別する単語判別ステップと、
この単語判別ステップで判別した単語に対応する操作内容で、各種の情報を表示するディスプレイ上に表示された表示内容を操作する内容操作ステップ
とを具備することを特徴とする画面表示操作方法。 (Appendix 19)
A lip image region extracting step for extracting a lip region image from a human face image to be recognized;
A word image extraction step for extracting a series of temporal changes from the start to the end of the lip image change in the lip image region extracted by the lip image region extraction step as a unit word image;
A closing timing measurement step for measuring the timing at which the upper and lower lips in the word image extracted by the word image extraction step are closed;
A vowel discrimination step for discriminating each vowel constituting the word from the shape of the same lip opening that lasts for a predetermined time or longer in the word image extracted by the word image extraction step;
Each timing measured in the closing timing measurement step in the word image extracted in the word image extraction step and each vowel combination constituting the word determined in the vowel determination step is measured in the closing timing measurement step, respectively. A comparison step for comparing with a word dictionary registered in advance for each word with a combination of each vowel determined in the timing and the vowel determination step;
As a result of comparison by this comparison step, a word determination step of determining that a word in the word dictionary determined to be the best match is a word spoken by the person to be recognized,
A screen display operation method, comprising: a content operation step for operating display contents displayed on a display for displaying various types of information with operation contents corresponding to the word determined in the word determination step.

（付記２０）
前記母音判別ステップは、同一の唇の開口部の形状が前記所定時間以上継続するとき、この所定時間を超えた特定時間が経過するたびに同一の母音が繰り返し発声されたと判別することを特徴とする付記１９記載の画面表示操作方法。 (Appendix 20)
In the vowel discrimination step, when the shape of the same lip opening continues for the predetermined time or more, it is determined that the same vowel is repeatedly uttered each time a specific time exceeding the predetermined time elapses. The screen display operation method according to appendix 19.

（付記２１）
単語登録の際に認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出ステップと、
この唇画像領域抽出ステップによって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を登録する画像を発声した場合の単語画像として抽出する登録対象単語画像抽出ステップと、
この登録対象単語画像抽出ステップによって抽出した単語画像における上下の唇の閉じたときの合わせ目が全体的に水平に配置しているとしたときの前記唇の開口部の中心点を通る上下方向と左右方向のそれぞれの距離の時間的な変化を１単位の単語が発声されたときの単語発声パターンとして記録する単語発声パターン記録ステップと、
この単語発声パターン記録ステップで記録した単語登録の対象となる単語発声パターンを、前記単語発声パターン記録ステップで予めそれぞれの単語と対応付けて登録した単語辞書内の各単語についての単語発声パターンと比較するパターン比較ステップと、
このパターン比較ステップによる比較の結果、所定の値以上近似しないと判別した単語発声パターンに対応する未登録の単語のみを単語登録可能とする登録単語可否判別ステップ
とを具備することを特徴とする単語登録方法。 (Appendix 21)
A lip image region extracting step for extracting a lip region image from a human face image to be recognized at the time of word registration;
A registration target word image extraction step for extracting as a word image when an image for registering a series of temporal changes from the start to the end of the lip image change in the lip image region extracted by the lip image region extraction step;
The vertical direction passing through the center point of the opening of the lips when the joints when the upper and lower lips are closed in the word image extracted by the registration target word image extraction step are arranged horizontally. A word utterance pattern recording step for recording temporal changes in the distances in the left and right directions as a word utterance pattern when one unit of words is uttered;
The word utterance pattern to be registered in the word utterance pattern recording step is compared with the word utterance pattern for each word in the word dictionary registered in advance in association with each word in the word utterance pattern recording step. Pattern comparison step to
A word comprising: a registered word propriety determining step for registering only unregistered words corresponding to a word utterance pattern determined not to be approximated by a predetermined value or more as a result of comparison in the pattern comparing step; Registration method.

（付記２２）
単語登録の際に認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出ステップと、
この唇画像領域抽出ステップによって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を登録する画像を発声した場合の単語画像として抽出する登録対象単語画像抽出ステップと、
この登録対象単語画像抽出ステップによって抽出した単語画像における上下の唇が閉じられるタイミングを測定する閉タイミング測定ステップと、
前記登録対象単語画像抽出ステップによって抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別する母音判別ステップと、
前記登録対象単語画像抽出ステップによって抽出した単語画像における前記閉タイミング測定ステップで測定したそれぞれのタイミングおよび前記母音判別ステップで判別した単語を構成するそれぞれの母音の組み合わせを、前記閉タイミング測定ステップで測定したそれぞれのタイミングと前記母音判別ステップで判別した各母音の組み合わせで単語ごとに予め登録した単語辞書と比較する比較ステップと、
この比較ステップによる比較の結果、所定の値以上近似しないと判別した前記した組み合わせに対応する未登録の単語のみを単語登録可能とする登録単語可否判別ステップ
とを具備することを特徴とする単語登録方法。 (Appendix 22)
A lip image region extracting step for extracting a lip region image from a human face image to be recognized at the time of word registration;
A registration target word image extraction step for extracting as a word image when an image for registering a series of temporal changes from the start to the end of the lip image change in the lip image region extracted by the lip image region extraction step;
A closing timing measuring step for measuring the timing at which the upper and lower lips in the word image extracted by the registration target word image extracting step are closed;
A vowel discrimination step for discriminating each vowel constituting a word from the shape of the same lip opening that continues for a predetermined time or longer in the word image extracted by the registration target word image extraction step;
In the closed timing measurement step, a combination of each timing measured in the closing timing measurement step in the word image extracted in the registration target word image extraction step and each vowel constituting the word determined in the vowel discrimination step is measured. A comparison step for comparing each word and a word dictionary registered in advance for each word with a combination of each vowel determined in the vowel determination step;
A word registration characterized by comprising: a registered word availability determination step that allows only unregistered words corresponding to the above-described combinations determined not to be approximated by a predetermined value or more as a result of comparison in the comparison step. Method.

（付記２３）
前記母音判別ステップは、同一の唇の開口部の形状が前記所定時間以上継続するとき、この所定時間を超えた特定時間が経過するたびに同一の母音が繰り返し発声されたと判別することを特徴とする付記２２記載の単語登録方法。 (Appendix 23)
In the vowel discrimination step, when the shape of the same lip opening continues for the predetermined time or more, it is determined that the same vowel is repeatedly uttered each time a specific time exceeding the predetermined time elapses. The word registration method according to appendix 22.

（付記２４）
コンピュータに、
各種の情報を視覚的に表示するディスプレイを使用する操作者の口を少なくとも撮影する撮像処理と、
この撮像処理で得られた操作者の口の画像の経時的な変化を判別する変化判別処理と、
この変化判別処理による判別結果に応じて判別結果と対応付けた予め定めた特定の操作を実行する特定操作実行処理
とを実行させることを特徴とする情報処理プログラム。 (Appendix 24)
On the computer,
An imaging process for photographing at least an operator's mouth using a display that visually displays various types of information;
A change determination process for determining a temporal change in the image of the operator's mouth obtained by the imaging process;
An information processing program for executing a specific operation execution process for executing a predetermined specific operation associated with a determination result in accordance with a determination result by the change determination process.

（付記２５）
コンピュータに、
認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出処理と、
この唇画像領域抽出処理で抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する単語画像抽出処理と、
この単語画像抽出処理によって抽出した単語画像における上下の唇の閉じたときの合わせ目が全体的に水平に配置しているとしたときの前記唇の開口部の中心点を通る上下方向と左右方向のそれぞれの距離の時間的な変化を１単位の単語が発声されたときの単語発声パターンとして記録する単語発声パターン記録処理と、
この単語発声パターン記録処理で記録した認識の対象となる単語発声パターンを、前記単語発声パターン記録処理で予めそれぞれの単語と対応付けて登録した単語辞書内の各単語についての単語発声パターンと比較するパターン比較処理と、
このパターン比較処理による比較の結果、最も一致すると判別した単語発声パターンに対応する単語を前記認識の対象とする人物の発声した単語であると判別する単語判別処理
とを実行させることを特徴とする単語判別プログラム。 (Appendix 25)
On the computer,
Lip image region extraction processing for extracting a lip region image from a human face image to be recognized;
A word image extraction process for extracting a series of temporal changes from the start to the end of the lip image change in the lip image area extracted by the lip image area extraction process as a unit word image;
Vertical and horizontal directions passing through the center point of the opening of the lips when the joints when the upper and lower lips are closed in the word image extracted by the word image extraction process are arranged horizontally. A word utterance pattern recording process for recording a temporal change in the distance of each as a word utterance pattern when one unit of words is uttered;
The word utterance pattern to be recognized recorded in the word utterance pattern recording process is compared with the word utterance pattern for each word in the word dictionary registered in advance in association with each word in the word utterance pattern recording process. Pattern comparison processing,
As a result of the comparison by the pattern comparison process, a word discrimination process for discriminating that the word corresponding to the word utterance pattern most closely matched is a word uttered by the person to be recognized is performed. Word discrimination program.

（付記２６）
コンピュータに、
認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出処理と、
この唇画像領域抽出処理によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する単語画像抽出処理と、
この単語画像抽出処理によって抽出した単語画像における上下の唇が閉じられるタイミングを測定する閉タイミング測定処理と、
前記単語画像抽出処理によって抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別する母音判別処理と、
前記単語画像抽出処理によって抽出した単語画像における前記閉タイミング測定処理で測定したそれぞれのタイミングおよび前記母音判別処理で判別した単語を構成するそれぞれの母音の組み合わせを、前記閉タイミング測定処理で測定したそれぞれのタイミングと前記母音判別処理で判別した各母音の組み合わせで単語ごとに予め登録した単語辞書と比較する比較処理と、
この比較処理による比較の結果、最も一致すると判別した前記単語辞書における前記組み合わせに対応する単語を前記認識の対象とする人物の発声した単語であると判別する単語判別処理
とを実行させることを特徴とする単語判別プログラム。 (Appendix 26)
On the computer,
Lip image region extraction processing for extracting a lip region image from a human face image to be recognized;
A word image extraction process for extracting a series of temporal changes from the start to the end of the lip image change in the lip image area extracted by the lip image area extraction process as a unit word image;
A closing timing measurement process for measuring the timing at which the upper and lower lips are closed in the word image extracted by the word image extraction process;
Vowel discrimination processing for discriminating each vowel constituting a word from the shape of the same lip opening that continues for a predetermined time or longer in the word image extracted by the word image extraction processing;
Each timing measured in the closing timing measurement process in the word image extracted by the word image extraction process and each vowel combination constituting the word determined in the vowel determination process is measured in the closed timing measurement process, respectively. A comparison process for comparing with a word dictionary registered in advance for each word with a combination of the timing of and the vowels determined in the vowel determination process;
A word discrimination process for discriminating that a word corresponding to the combination in the word dictionary determined to be the best match as a result of comparison by the comparison process is a word uttered by a person to be recognized; A word discrimination program.

（付記２７）
前記母音判別処理では、同一の唇の開口部の形状が前記所定時間以上継続するとき、この所定時間を超えた特定時間が経過するたびに同一の母音が繰り返し発声されたと判別することを特徴とする付記２６記載の単語判別プログラム。 (Appendix 27)
In the vowel discrimination process, when the shape of the same lip opening continues for the predetermined time or more, it is determined that the same vowel is repeatedly uttered each time a specific time exceeding the predetermined time elapses. The word discrimination program according to appendix 26.

（付記２８）
コンピュータに、
認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出処理と、
この唇画像領域抽出処理によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する単語画像抽出処理と、
この単語画像抽出処理によって抽出した単語画像における上下の唇の閉じたときの合わせ目が全体的に水平に配置しているとしたときの前記唇の開口部の中心点を通る上下方向と左右方向のそれぞれの距離の時間的な変化を１単位の単語が発声されたときの単語発声パターンとして記録する単語発声パターン記録処理と、
この単語発声パターン記録処理で記録した認識の対象となる単語発声パターンを、前記単語発声パターン記録処理で予めそれぞれの単語と対応付けて登録した単語辞書内の各単語についての単語発声パターンと比較するパターン比較処理と、
このパターン比較処理による比較の結果、最も一致すると判別した単語発声パターンに対応する単語を前記認識の対象とする人物の発声した単語であると判別する単語判別処理と、
この単語判別処理で判別した単語に対応する操作内容で、各種の情報を表示するディスプレイ上に表示された表示内容を操作する内容操作処理
とを実行させることを特徴とする画面表示操作プログラム。 (Appendix 28)
On the computer,
Lip image region extraction processing for extracting a lip region image from a human face image to be recognized;
A word image extraction process for extracting a series of temporal changes from the start to the end of the lip image change in the lip image area extracted by the lip image area extraction process as a unit word image;
Vertical and horizontal directions passing through the center point of the opening of the lips when the joints when the upper and lower lips are closed in the word image extracted by the word image extraction process are arranged horizontally. A word utterance pattern recording process for recording a temporal change in the distance of each as a word utterance pattern when one unit of words is uttered;
The word utterance pattern to be recognized recorded in the word utterance pattern recording process is compared with the word utterance pattern for each word in the word dictionary registered in advance in association with each word in the word utterance pattern recording process. Pattern comparison processing,
As a result of the comparison by the pattern comparison process, a word determination process for determining that the word corresponding to the word utterance pattern determined to be the best match is a word uttered by the person to be recognized,
What is claimed is: 1. A screen display operation program for executing content operation processing for operating display content displayed on a display for displaying various types of information with operation content corresponding to a word determined by the word determination processing.

（付記２９）
コンピュータに、
認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出処理と、
この唇画像領域抽出処理によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を１単位の単語画像として抽出する単語画像抽出処理と、
この単語画像抽出処理によって抽出した単語画像における上下の唇が閉じられるタイミングを測定する閉タイミング測定処理と、
前記単語画像抽出処理によって抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別する母音判別処理と、
前記単語画像抽出処理によって抽出した単語画像における前記閉タイミング測定処理で測定したそれぞれのタイミングおよび前記母音判別処理で判別した単語を構成するそれぞれの母音の組み合わせを、前記閉タイミング測定処理で測定したそれぞれのタイミングと前記母音判別処理で判別した各母音の組み合わせで単語ごとに予め登録した単語辞書と比較する比較処理と、
この比較処理による比較の結果、最も一致すると判別した単語辞書の単語を前記認識の対象とする人物の発声した単語であると判別する単語判別処理と、
この単語判別処理で判別した単語に対応する操作内容で、各種の情報を表示するディスプレイ上に表示された表示内容を操作する内容操作処理
とを実行させることを特徴とする画面表示操作プログラム。 (Appendix 29)
On the computer,
Lip image region extraction processing for extracting a lip region image from a human face image to be recognized;
A word image extraction process for extracting a series of temporal changes from the start to the end of the lip image change in the lip image area extracted by the lip image area extraction process as a unit word image;
A closing timing measurement process for measuring the timing at which the upper and lower lips are closed in the word image extracted by the word image extraction process;
Vowel discrimination processing for discriminating each vowel constituting a word from the shape of the same lip opening that continues for a predetermined time or longer in the word image extracted by the word image extraction processing;
Each timing measured in the closing timing measurement process in the word image extracted by the word image extraction process and each vowel combination constituting the word determined in the vowel determination process is measured in the closed timing measurement process, respectively. A comparison process for comparing with a word dictionary registered in advance for each word with a combination of the timing of and the vowels determined in the vowel determination process;
As a result of comparison by this comparison process, a word determination process for determining that a word in the word dictionary determined to be the best match is a word spoken by the person to be recognized;
What is claimed is: 1. A screen display operation program for executing content operation processing for operating display content displayed on a display for displaying various types of information with operation content corresponding to a word determined by the word determination processing.

（付記３０）
前記母音判別処理は、同一の唇の開口部の形状が前記所定時間以上継続するとき、この所定時間を超えた特定時間が経過するたびに同一の母音が繰り返し発声されたと判別することを特徴とする付記２９記載の画面表示操作プログラム。 (Appendix 30)
In the vowel discrimination process, when the shape of the same lip opening continues for the predetermined time or more, it is determined that the same vowel is repeatedly uttered each time a specific time exceeding the predetermined time elapses. The screen display operation program according to appendix 29.

（付記３１）
コンピュータに、
単語登録の際に認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出処理と、
この唇画像領域抽出処理によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を登録する画像を発声した場合の単語画像として抽出する登録対象単語画像抽出処理と、
この登録対象単語画像抽出処理によって抽出した単語画像における上下の唇の閉じたときの合わせ目が全体的に水平に配置しているとしたときの前記唇の開口部の中心点を通る上下方向と左右方向のそれぞれの距離の時間的な変化を１単位の単語が発声されたときの単語発声パターンとして記録する単語発声パターン記録処理と、
この単語発声パターン記録処理で記録した単語登録の対象となる単語発声パターンを、前記単語発声パターン記録処理で予めそれぞれの単語と対応付けて登録した単語辞書内の各単語についての単語発声パターンと比較するパターン比較処理と、
このパターン比較処理による比較の結果、所定の値以上近似しないと判別した単語発声パターンに対応する未登録の単語のみを単語登録可能とする登録単語可否判別処理
とを実行させることを特徴とする単語登録プログラム。 (Appendix 31)
On the computer,
Lip image region extraction processing for extracting a lip region image from a human face image to be recognized at the time of word registration;
A registration target word image extraction process for extracting a word image when a series of temporal changes from the start to the end of a lip image change in the lip image area extracted by the lip image area extraction process is spoken;
The vertical direction passing through the center point of the opening of the lips when the joints when the upper and lower lips are closed in the word image extracted by the registration target word image extraction process are arranged horizontally. A word utterance pattern recording process for recording a temporal change in each distance in the left-right direction as a word utterance pattern when one unit of words is uttered;
The word utterance pattern to be registered in the word utterance pattern recording process is compared with the word utterance pattern for each word in the word dictionary registered in advance in association with each word in the word utterance pattern recording process. Pattern comparison processing,
A word that is subjected to a registered word availability determination process that allows only unregistered words corresponding to a word utterance pattern that is determined not to be approximated by a predetermined value or more as a result of comparison by the pattern comparison process. Registration program.

（付記３２）
コンピュータに、
単語登録の際に認識の対象とする人物の顔の画像から唇の領域の画像を抽出する唇画像領域抽出処理と、
この唇画像領域抽出処理によって抽出した唇画像領域における唇画像の変化開始から終了までの一連の経時的変化を登録する画像を発声した場合の単語画像として抽出する登録対象単語画像抽出処理と、
この登録対象単語画像抽出処理によって抽出した単語画像における上下の唇が閉じられるタイミングを測定する閉タイミング測定処理と、
前記登録対象単語画像抽出処理によって抽出した単語画像における所定時間以上継続する同一の唇の開口部の形状から単語を構成するそれぞれの母音を判別する母音判別処理と、
前記登録対象単語画像抽出処理によって抽出した単語画像における前記閉タイミング測定処理で測定したそれぞれのタイミングおよび前記母音判別処理で判別した単語を構成するそれぞれの母音の組み合わせを、前記閉タイミング測定処理で測定したそれぞれのタイミングと前記母音判別処理で判別した各母音の組み合わせで単語ごとに予め登録した単語辞書における前記した組み合わせと比較する比較処理と、
この比較処理による比較の結果、所定の値以上近似しないと判別した前記した組み合わせに対応する未登録の単語のみを単語登録可能とする登録単語可否判別処理
とを実行させることを特徴とする単語登録プログラム。 (Appendix 32)
On the computer,
Lip image region extraction processing for extracting a lip region image from a human face image to be recognized at the time of word registration;
A registration target word image extraction process for extracting a word image when a series of temporal changes from the start to the end of a lip image change in the lip image area extracted by the lip image area extraction process is spoken;
A closing timing measurement process for measuring the timing at which the upper and lower lips in the word image extracted by the registration target word image extraction process are closed;
Vowel discrimination processing for discriminating each vowel constituting a word from the shape of the same lip opening that continues for a predetermined time or longer in the word image extracted by the registration target word image extraction processing;
In the word image extracted by the registration target word image extraction process, the timing measured by the closing timing measurement process and the combination of the vowels constituting the word determined by the vowel determination process are measured by the closed timing measurement process. A comparison process for comparing with the above combination in the word dictionary registered in advance for each word with a combination of each timing and each vowel determined in the vowel determination process;
Word registration characterized in that, as a result of comparison by this comparison process, a registered word propriety determination process that allows only unregistered words corresponding to the above-described combinations determined not to approximate more than a predetermined value is executed. program.

（付記３３）
前記母音判別処理は、同一の唇の開口部の形状が前記所定時間以上継続するとき、この所定時間を超えた特定時間が経過するたびに同一の母音が繰り返し発声されたと判別することを特徴とする付記３２記載の単語登録プログラム。 (Appendix 33)
In the vowel discrimination process, when the shape of the same lip opening continues for the predetermined time or more, it is determined that the same vowel is repeatedly uttered each time a specific time exceeding the predetermined time elapses. The word registration program according to Supplementary Note 32.

１０情報処理装置
１１、５８、４７、３０５、６０３ディスプレイ
１２撮像手段
１３変化判別手段
１４特定操作実行手段
２０、３０単語判別装置
２１、３１唇画像領域抽出手段
２２、３２単語画像抽出手段
２３単語発声パターン記録手段
２４、３５、４４、５５、６４、７５、３２７単語辞書
２５、４５、６５パターン比較手段
２６、３７、４６、５７単語判別手段
３３、５３、７３閉タイミング測定手段
３４、５４、７４母音判別手段
３６、５６、７６比較手段
４０、５０画面表示操作装置
４１、５１、６１、７１唇画像領域抽出手段
４２、５２単語画像抽出手段
４３、６３単語発声パターン記録手段
４８、５９内容操作手段
６０、７０単語登録装置
６２、７２登録対象単語画像抽出手段
６６、７７登録単語可否判別手段
８０、９０単語判別方法
８１、９１唇画像領域抽出ステップ
８２、９２単語画像抽出ステップ
８３単語発声パターン記録ステップ
８４パターン比較ステップ
８５、９６単語判別ステップ
９３閉タイミング測定ステップ
９４母音判別ステップ
９５比較ステップ
１００単語判別プログラム
１０１唇画像領域抽出処理
１０２単語画像抽出処理
１０３閉タイミング測定処理
１０４母音判別処理
１０５比較処理
１０６単語判別処理
３００携帯電話機
３０６、６０４撮像装置
３２１ＣＰＵ
３２２メモリ
３２３主制御部
３２８単語判別部
３２９画像メモリ
３３０唇画像領域抽出部
３３１単語画像抽出部
３３２閉タイミング測定部
３３３母音判別部
３５１ユーザ
３５２顔
３５３唇
５０１第１のウィンドウ
５０２第２のウィンドウ
６００パーソナルコンピュータ DESCRIPTION OF SYMBOLS 10 Information processing apparatus 11, 58, 47, 305, 603 Display 12 Imaging means 13 Change discrimination means 14 Specific operation execution means 20, 30 Word discrimination apparatus 21, 31 Lip image area extraction means 22, 32 Word image extraction means 23 Word utterance Pattern recording means 24, 35, 44, 55, 64, 75, 327 Word dictionary 25, 45, 65 Pattern comparison means 26, 37, 46, 57 Word discrimination means 33, 53, 73 Closed timing measurement means 34, 54, 74 Vowel determination means 36, 56, 76 Comparison means 40, 50 Screen display operation device 41, 51, 61, 71 Lip image area extraction means 42, 52 Word image extraction means 43, 63 Word utterance pattern recording means 48, 59 Content operation means 60, 70 Word registration device 62, 72 Registration target word image extraction means 66, 77 Registration unit Visibility determining means 80, 90 Word determining method 81, 91 Lip image region extracting step 82, 92 Word image extracting step 83 Word utterance pattern recording step 84 Pattern comparing step 85, 96 Word determining step 93 Closing timing measuring step 94 Vowel determining step 95 Comparison step 100 Word discrimination program 101 Lip image region extraction processing 102 Word image extraction processing 103 Close timing measurement processing 104 Vowel discrimination processing 105 Comparison processing 106 Word discrimination processing 300 Mobile phone 306, 604 Imaging device 321 CPU
322 Memory 323 Main control unit 328 Word discrimination unit 329 Image memory 330 Lip image area extraction unit 331 Word image extraction unit 332 Close timing measurement unit 333 Vowel discrimination unit 351 User 352 Face 353 Lip 501 First window 502 Second window 600 Personal computer

Claims

A display that visually displays various types of information;
Imaging means for photographing at least an operator's mouth using the display;
Change discriminating means for discriminating temporal changes in the image of the operator's mouth obtained by the imaging means;
An information processing apparatus comprising: a specific operation executing unit that executes a predetermined specific operation associated with the determination result according to the determination result of the change determination unit.

Lip image region extracting means for extracting a lip region image from a human face image to be recognized;
A word image extraction means for extracting a series of temporal changes from the start to the end of the lip image change in the lip image area extracted by the lip image area extraction means as a unit word image;
Vertical and horizontal directions passing through the center point of the opening of the lips when the joints when the upper and lower lips are closed in the word image extracted by the word image extracting means are arranged horizontally. A word utterance pattern recording means for recording a temporal change in each distance as a word utterance pattern when a unit word is uttered;
A word dictionary in which each word utterance pattern previously recorded by the word utterance pattern recording means or equivalent means is associated with a word;
Pattern comparison means for comparing the word utterance pattern to be recognized recorded by the word utterance pattern recording means with the word utterance pattern for each word in the word dictionary; and (f) the result of comparison by the pattern comparison means, A word discriminating apparatus comprising: a word discriminating unit that discriminates a word corresponding to a word utterance pattern determined to be the best match as a word uttered by a person to be recognized.

Lip image region extracting means for extracting a lip region image from a human face image to be recognized;
A word image extraction means for extracting a series of temporal changes from the start to the end of the lip image change in the lip image area extracted by the lip image area extraction means as a unit word image;
A closing timing measuring means for measuring the timing at which the upper and lower lips in the word image extracted by the word image extracting means are closed;
Vowel discrimination means for discriminating each vowel constituting a word from the shape of the same lip opening that continues for a predetermined time or longer in the word image extracted by the word image extraction means;
A word dictionary prepared by associating a combination of each timing measured by the closing timing measuring means or equivalent means and each vowel determined by the vowel discrimination means or equivalent means with a plurality of words in advance. When,
Comparison comparing each timing measured by the closing timing measuring unit in the word image extracted by the word image extracting unit and each vowel combination constituting the word determined by the vowel determining unit with the combination in the word dictionary Means,
A word discriminating unit for discriminating that the word corresponding to the combination in the word dictionary determined to be the best match as a result of comparison by the comparing unit is a word uttered by the person to be recognized; Word discrimination device.

Lip image region extracting means for extracting a lip region image from a human face image to be recognized;
A word image extraction means for extracting a series of temporal changes from the start to the end of the lip image change in the lip image area extracted by the lip image area extraction means as a unit word image;
Vertical and horizontal directions passing through the center point of the opening of the lips when the joints when the upper and lower lips are closed in the word image extracted by the word image extracting means are arranged horizontally. A word utterance pattern recording means for recording a temporal change in each distance as a word utterance pattern when a unit word is uttered;
A word dictionary in which each word utterance pattern previously recorded by the word utterance pattern recording means or equivalent means is associated with a word;
Pattern comparing means for comparing the word utterance pattern to be recognized recorded by the word utterance pattern recording means with the word utterance pattern for each word in the word dictionary;
As a result of the comparison by the pattern comparison means, a word determination means for determining that a word corresponding to the word utterance pattern determined to be the best match is a word uttered by the person to be recognized,
A display that displays various information,
A screen display operation device comprising: content operation means for operating display contents displayed on the display with operation contents corresponding to the word determined by the word determination means.

Lip image region extracting means for extracting a lip region image from a human face image to be recognized;
A word image extraction means for extracting a series of temporal changes from the start to the end of the lip image change in the lip image area extracted by the lip image area extraction means as a unit word image;
A closing timing measuring means for measuring the timing at which the upper and lower lips in the word image extracted by the word image extracting means are closed;
Vowel discrimination means for discriminating each vowel constituting a word from the shape of the same lip opening that continues for a predetermined time or longer in the word image extracted by the word image extraction means;
A word dictionary prepared by associating a combination of each timing measured by the closing timing measuring means or equivalent means and each vowel determined by the vowel discrimination means or equivalent means with a plurality of words in advance. When,
Comparison comparing each timing measured by the closing timing measuring unit in the word image extracted by the word image extracting unit and each vowel combination constituting the word determined by the vowel determining unit with the combination in the word dictionary Means,
A word discriminating unit for discriminating that a word corresponding to the combination in the word dictionary determined to be the best match by the comparing unit is a word uttered by the person to be recognized;
A display that displays various information,
A screen display operation device comprising: content operation means for operating display contents displayed on the display with operation contents corresponding to the word determined by the word determination means.

Lip image region extracting means for extracting a lip region image from a human face image to be recognized at the time of word registration;
A registration target word image extraction unit that extracts a word image when a series of changes in the lip image region extracted from the lip image region extracted by the lip image region extraction unit from the start to the end of the lip image region is registered as a word image;
A vertical direction passing through the center point of the opening of the lips when the joints when the upper and lower lips are closed in the word image extracted by the registration target word image extracting means are horizontally arranged as a whole; Word utterance pattern recording means for recording temporal changes in the distances in the left and right directions as word utterance patterns when one unit of words is uttered;
A word dictionary in which each word utterance pattern previously recorded by the word utterance pattern recording means or equivalent means is associated with a word;
Pattern comparison means for comparing the word utterance pattern to be registered in the word utterance pattern recording means with the word utterance pattern for each word in the word dictionary;
A word comprising: a registered word availability judging means for registering only unregistered words corresponding to a word utterance pattern judged not to be approximated by a predetermined value or more as a result of comparison by the pattern comparing means; Registration device.

Lip image region extracting means for extracting a lip region image from a human face image to be recognized at the time of word registration;
A registration target word image extraction unit that extracts a word image when a series of changes in the lip image region extracted from the lip image region extracted by the lip image region extraction unit from the start to the end of the lip image region is registered as a word image;
A closing timing measuring means for measuring a timing at which upper and lower lips are closed in the word image extracted by the registration target word image extracting means;
Vowel discrimination means for discriminating each vowel constituting a word from the shape of the same lip opening that lasts for a predetermined time or longer in the word image extracted by the registration target word image extraction means;
A word dictionary prepared by associating a combination of each timing measured by the closing timing measuring means or equivalent means and each vowel determined by the vowel discrimination means or equivalent means with a plurality of words in advance. When,
Compare each timing combination measured by the closing timing measurement unit in the word image extracted by the registration target word image extraction unit and each vowel constituting the word determined by the vowel determination unit with the combination in the word dictionary. Comparing means to
A word registration characterized by comprising: a registered word availability judging means capable of registering only unregistered words corresponding to a word utterance pattern judged not to approximate more than a predetermined value as a result of comparison by the comparing means apparatus.

A lip image region extracting step for extracting a lip region image from a human face image to be recognized;
A word image extraction step for extracting a series of temporal changes from the start to the end of the lip image change in the lip image region extracted in the lip image region extraction step as a unit word image;
The vertical and horizontal directions passing through the center point of the opening of the lips when the joints when the upper and lower lips are closed in the word image extracted by the word image extraction step are arranged horizontally. A word utterance pattern recording step of recording a temporal change of each distance of the word as a word utterance pattern when one unit of words is uttered;
The word utterance pattern to be recognized recorded in the word utterance pattern recording step is compared with the word utterance pattern for each word in the word dictionary registered in advance in association with each word in the word utterance pattern recording step. A pattern comparison step;
A word discrimination step for discriminating that the word corresponding to the word utterance pattern determined to be the best match as a result of the comparison in the pattern comparison step is a word uttered by the person to be recognized. Word discrimination method.

A lip image region extracting step for extracting a lip region image from a human face image to be recognized;
A word image extraction step for extracting a series of temporal changes from the start to the end of the lip image change in the lip image region extracted by the lip image region extraction step as a unit word image;
A closing timing measurement step for measuring the timing at which the upper and lower lips in the word image extracted by the word image extraction step are closed;
A vowel discrimination step for discriminating each vowel constituting the word from the shape of the same lip opening that lasts for a predetermined time or longer in the word image extracted by the word image extraction step;
Each timing measured in the closing timing measurement step in the word image extracted in the word image extraction step and each vowel combination constituting the word determined in the vowel determination step is measured in the closing timing measurement step, respectively. A comparison step for comparing with the combination in the word dictionary registered in advance for each word with a combination of the timing of the vowels and the vowels determined in the step,
A word determination step of determining that a word corresponding to the combination in the word dictionary determined to be the best match as a result of comparison in the comparison step is a word uttered by the person to be recognized. A word discrimination method.

On the computer,
Lip image region extraction processing for extracting a lip region image from a human face image to be recognized;
A word image extraction process for extracting a series of temporal changes from the start to the end of the lip image change in the lip image area extracted by the lip image area extraction process as a unit word image;
A closing timing measurement process for measuring the timing at which the upper and lower lips are closed in the word image extracted by the word image extraction process;
Vowel discrimination processing for discriminating each vowel constituting a word from the shape of the same lip opening that continues for a predetermined time or longer in the word image extracted by the word image extraction processing;
Each timing measured in the closing timing measurement process in the word image extracted by the word image extraction process and each vowel combination constituting the word determined in the vowel determination process is measured in the closed timing measurement process, respectively. A comparison process for comparing with a word dictionary registered in advance for each word with a combination of the timing of and the vowels determined in the vowel determination process;
A word discrimination process for discriminating that a word corresponding to the combination in the word dictionary determined to be the best match as a result of comparison by the comparison process is a word uttered by a person to be recognized; A word discrimination program.