JP2005135432A

JP2005135432A - Image recognition apparatus and image recognition method

Info

Publication number: JP2005135432A
Application number: JP2004360240A
Authority: JP
Inventors: Norio Mihara; 功雄三原; Miwako Doi; 美和子土井
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2004-12-13
Filing date: 2004-12-13
Publication date: 2005-05-26
Anticipated expiration: 2018-01-30
Also published as: JP4160554B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an image recognition apparatus which can quickly and highly accurately recognize shapes and motions of human lips. <P>SOLUTION: The image recognition apparatus is provided with; an image acquisition part for acquiring a range image stream for an object material; a mouth part extraction part for extracting a mouth part from the range image stream acquired by the image acquisition part; and an image recognition part for recognizing at least shapes or motions of lips on the basis of the range image stream of the mouth part extracted by the mouth part extraction part. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、取得した距離画像に基づいて画像の形状および／または動きを認識する画像認識装置及び画像認識方法に関する。 The present invention relates to an image recognition apparatus and an image recognition method for recognizing the shape and / or movement of an image based on an acquired distance image.

従来、人間の口唇の形状や動きを認識して読唇したり、顔の向き、表情などを判別するような画像処理を行う場合、まず、ＣＣＤカメラなどの撮像装置を用いて、人間の口唇周辺や顔部などを撮影し、その画像から背景などの余計な部分を取り除き、口唇部のみ、顔のみなど認識したい対象のみを切り出すという前処理を行う。そして、その処理後の画像を用いることで、形状や動きなどの認識を行っていた。 Conventionally, when performing image processing such as reading a lip by recognizing the shape and movement of a human lip, or determining the face orientation, facial expression, etc., first, using a CCD camera or other imaging device, A pre-processing is performed in which an extra portion such as a background is removed from the image and only an object to be recognized, such as only the lips and only the face, is cut out. Then, by using the processed image, the shape and movement are recognized.

まず、この認識対象の切り出しという前処理部分について説明する。 First, the pre-processing part of extracting the recognition target will be described.

従来の手法では、カメラで撮影した画像から取得したい対象物の部分のみを切り出す処理において、対象物とそれ以外の部分との間の何らかの相違点を手掛かりとして対象物の切り出しが行われていた。この手掛かりとして、色相の変化を利用する方法、差分画像を利用する方法、マーカーなどを利用する方法、クロマキーを利用する方法などが用いられていた。これらについて、人物の映っている画像から、口唇部分のみを切り出す場合を例として説明する。 In the conventional method, in the process of cutting out only the part of the object to be acquired from the image photographed by the camera, the object is cut out by using some difference between the object and the other part as a clue. As a clue, a method using a change in hue, a method using a difference image, a method using a marker, a method using a chroma key, and the like have been used. These will be described by taking as an example a case where only the lip portion is cut out from an image showing a person.

色相の変化を利用する方法では、口唇の部分はほぼ均一に赤色をしており、周りの肌の部分はほぼ均一に肌色をしている、という色相（画素値）の急激な変化を利用することで、口唇部のみを判別し、切り出しを行っていた。 In the method using the change in hue, a rapid change in hue (pixel value) is used in which the lip portion is almost uniformly red and the surrounding skin portion is substantially uniformly colored. Thus, only the lip portion was discriminated and cut out.

しかしこの方法では、照明の状況によって、肌や口唇の部分に影ができるなどして、色相が変化してしまうなど、通常と異なる色相を示す環境下では、巧く、確実に抽出することが出来なくなるといったような問題点があった。また、口唇の形状を安定的に得るために、特定の色の口紅を用いることで、色相変化を強調したりしなければならない場合もあった。 However, with this method, it is possible to extract skillfully and reliably in an environment that shows a hue different from normal, such as the hue changing due to a shadow on the skin or lip depending on the lighting conditions. There was a problem that it was impossible. In addition, in order to stably obtain the shape of the lips, it may be necessary to emphasize the hue change by using a lipstick of a specific color.

差分画像を利用する方法では、話者が会話をしている際には、顔の中で、口唇の部分のみが動いているということを利用して、現在のフレームと、次のフレームとの差分画像を取ることによって、動いている部分を取得し、それを口唇の部分とする、ということが行われていた。 In the method using the difference image, when the speaker is talking, only the lip part is moving in the face, and the current frame and the next frame are used. By taking a difference image, a moving part is acquired and used as a lip part.

しかしこの方法では、背景で何かが動いているような環境下では、口唇以外の不必要な部分も抽出してしまう、口唇が動いていないときには抽出できない、というように、環境や条件に著しく依存してしまい、常に、確実に口唇の部分のみを抽出するのは大変困難であった。 However, with this method, in an environment where something is moving in the background, unnecessary parts other than the lips are extracted, and when the lips are not moving, extraction is not possible. It was very difficult to always extract only the lip part reliably.

マーカーを利用する方法では、口唇の周りに幾つかのマーカーを貼って特徴点とし、その特徴点の動きをもとに、口唇部を抽出していた。 In the method using the marker, some markers are pasted around the lips as feature points, and the lip portion is extracted based on the movement of the feature points.

しかしこの方法では、顔に、マーカーなどを貼らなくてはならないため、使える環境が限られているなどの問題があった。 However, this method has a problem in that the environment in which it can be used is limited because a marker or the like must be put on the face.

クロマキーを利用する方法では、例えば、青色など、人物の顔にあまり現れないような色のスクリーンの前に人物が配置し、カメラなどで得た画像から青色を取り除くことで、顔の部分のみを抽出していた。 In the method using the chroma key, for example, a person is placed in front of a screen of a color that does not appear much on the face of the person, such as blue, and the blue is removed from the image obtained with a camera etc. It was extracted.

しかしこの方法では、背景の色を強要されるため、特定の状況でのみしか用いることができない、口唇のような顔の内部の一部分のみの抽出ができない、などというような問題があった。 However, in this method, since the background color is compelled, there are problems such that it can be used only in a specific situation and only a part of the inside of the face such as the lips cannot be extracted.

このように従来の手法では、カメラで撮影した画像から取得したい対象物の部分のみを確実に切り出す処理は、大変困難なものであった。 As described above, according to the conventional method, it is very difficult to reliably cut out only a portion of an object to be acquired from an image captured by a camera.

次に、対象物が切り出された画像から、対象物の形状、動きなどの認識を行う部分について説明する。 Next, a part for recognizing the shape, movement, etc. of the object from the image obtained by cutting out the object will be described.

従来、切り出された対象物の画像は、２次元情報しか含んでいない。これは、従来の撮像装置では３次元形状を取得することは困難であり、３次元形状を取得するような撮像装置があっても、それらは、動きの様なリアルタイムの認識に適していなかったからである。また、そのような３次元形状の撮像装置は、大変高価で、気軽に用いることができないという問題もあった。そのため、従来の画像処理では、２次元情報のみを用いて、人間の顔や口唇の形状、動きといった、本来は３次元的ものを、なんとか認識しようと努力していた。 Conventionally, an image of a cut object includes only two-dimensional information. This is because it is difficult to acquire a three-dimensional shape with a conventional imaging device, and even if there are imaging devices that acquire a three-dimensional shape, they are not suitable for real-time recognition such as movement. It is. In addition, such a three-dimensional imaging device is very expensive and cannot be used easily. For this reason, in conventional image processing, efforts have been made to somehow recognize originally three-dimensional objects such as the shape and movement of human faces and lips using only two-dimensional information.

しかし、本来３次元的な形状や動きを２次元情報として用いていたため、必要な情報が欠落してしまい、様々な工夫はしているものの、簡単な形状や動きの認識のみしか行えないといったように、どうしても無理があった。 However, since 3D shapes and movements were originally used as 2D information, necessary information was lost, and although various measures were taken, only simple shapes and movements could be recognized. However, it was impossible.

また、上述したとおり、画像から対象物のみを切り出すという作業を確実に行うことは大変困難であるため、この切り出しの不確実さも、認識率を下げる要因に大きく関わっていた。 Further, as described above, since it is very difficult to reliably perform the operation of cutting out only the target object from the image, the uncertainties of the cutting are greatly related to the factors that lower the recognition rate.

以上のように、従来方法では、画像からの対象物の抽出方法にも、画像の認識方法にも、様々な問題点があった。 As described above, the conventional method has various problems in both the method for extracting an object from an image and the method for recognizing an image.

以上のように、従来、カメラで撮影した画像から取得したい対象物の部分のみを確実に切り出す処理は大変困難なものであり、それが画像認識の認識率の低下の要因となっていた。 As described above, conventionally, it is very difficult to reliably cut out only a portion of an object desired to be acquired from an image photographed by a camera, which has been a cause of a decrease in the recognition rate of image recognition.

また、様々な制約から、カメラなどを用いて画像を２次元情報として取得していたため、３次元形状や３次元的動きの認識を２次元情報のみから行うしかなく、簡単な形状、動きの認識しか行うことができないという問題があった。 In addition, because the image was acquired as two-dimensional information using a camera or the like due to various restrictions, the recognition of the three-dimensional shape and the three-dimensional movement can be performed only from the two-dimensional information, and the simple shape and movement can be recognized. There was a problem that could only be done.

本発明は、上記事情を考慮してなされたものであり、人間の顔や口唇の形状や動きを高速かつ高精度に認識可能な画像認識装置及び画像認識装置方法を提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object thereof is to provide an image recognition apparatus and an image recognition apparatus method capable of recognizing the shape and movement of a human face and lips with high speed and high accuracy. .

本発明１に係る画像認識装置は、対象物体に対する距離画像を取得するための画像取得手段と、前記画像取得手段により取得された距離画像から口腔部分を抽出する口腔部抽出手段と、前記口腔部抽出手段により抽出された口腔部分の距離画像に基づいて、口唇の形状を認識するための画像認識手段とを具備したことを特徴とする。 An image recognition apparatus according to the first aspect of the present invention includes an image acquisition unit for acquiring a distance image with respect to a target object, an oral part extraction unit that extracts an oral part from the distance image acquired by the image acquisition unit, and the oral part And an image recognition means for recognizing the shape of the lips based on the distance image of the oral cavity portion extracted by the extraction means.

本発明２に係る画像認識装置は、対象物体に対する距離画像ストリーム（距離画像の動画像）を取得するための画像取得手段と、前記画像取得手段により取得された距離画像ストリーム（距離画像の動画像）から口腔部分を抽出する口腔部抽出手段と、前記口腔部抽出手段により抽出された口腔部分の距離画像ストリームに基づいて、口唇の形状および口唇の動きの少なくとも一方を認識するための画像認識手段とを具備したことを特徴とする。 The image recognition apparatus according to the second aspect includes an image acquisition unit for acquiring a distance image stream (a moving image of a distance image) for a target object, and a distance image stream (a moving image of a distance image) acquired by the image acquisition unit. ) And an image recognition means for recognizing at least one of the shape of the lips and the movement of the lips based on the distance image stream of the oral portions extracted by the oral portion extraction means It was characterized by comprising.

本発明によれば、対象物体に対する距離画像から必要とする部分を抽出し、抽出した部分の距離画像に基づいて認識処理を行うので、話者の口唇認識（例えば、口腔部の形状、動きや、発言内容の認識など）等を高速かつ高精度に行うことができる。 According to the present invention, since a necessary part is extracted from a distance image with respect to a target object, and recognition processing is performed based on the extracted distance image of the extracted part, speaker's lip recognition (for example, oral cavity shape, movement, , Etc.) can be performed at high speed and with high accuracy.

本発明３に係る画像認識装置は、対象物体に対する距離画像を取得するための画像取得手段と、前記画像取得手段により取得された距離画像から顔部分を抽出する顔部抽出手段と、前記顔部抽出手段により抽出された顔部分の距離画像に基づいて、顔の形状を認識するための画像認識手段とを具備したことを特徴とする。 An image recognition apparatus according to a third aspect includes an image acquisition unit for acquiring a distance image for a target object, a face part extraction unit for extracting a face part from the distance image acquired by the image acquisition unit, and the face part And image recognition means for recognizing the shape of the face based on the distance image of the face portion extracted by the extraction means.

本発明４に係る画像認識装置は、対象物体に対する距離画像ストリーム（距離画像の動画像）を取得するための画像取得手段と、前記画像取得手段により取得された距離画像ストリーム（距離画像の動画像）から顔部分を抽出する顔部抽出手段と、前記顔部抽出手段により抽出された顔部分の距離画像ストリームに基づいて、顔の形状および顔の動きの少なくとも一方を認識するための画像認識手段とを具備したことを特徴とする。 An image recognition apparatus according to a fourth aspect of the present invention includes an image acquisition unit for acquiring a distance image stream (a moving image of a distance image) for a target object, and a distance image stream (a moving image of a distance image) acquired by the image acquisition unit. ), And an image recognition unit for recognizing at least one of the shape of the face and the movement of the face based on the distance image stream of the face part extracted by the face part extraction unit. It was characterized by comprising.

本発明によれば、対象物体に対する距離画像から必要とする部分を抽出し、抽出した部分の距離画像に基づいて認識処理を行うので、話者の顔部認識（例えば、顔部の形状、動きの認識など）等を高速かつ高精度に行うことができる。 According to the present invention, since a necessary part is extracted from a distance image with respect to a target object and recognition processing is performed based on the extracted distance image of the extracted part, speaker face recognition (for example, face shape, motion, etc.) Can be performed at high speed and with high accuracy.

本発明５は、発明１ないし発明４のいずれか１項に係る画像認識装置において、前記画像認識手段により得られた前記形状の情報または前記動きの情報に基づいて、話者の顔の向きを識別するための方向識別手段をさらに具備したことを特徴とする。 According to a fifth aspect of the present invention, in the image recognition device according to any one of the first to fourth aspects, the direction of the speaker's face is determined based on the shape information or the movement information obtained by the image recognition means. It further comprises direction identifying means for identifying.

本発明６は、発明１または発明２に係る画像認識装置において、前記画像認識手段により認識された口唇の形状もしくは口唇の動きに基づいて、話者の顔の向きを識別するための方向識別手段をさらに具備したことを特徴とする。 According to a sixth aspect of the present invention, in the image recognition apparatus according to the first or second aspect of the present invention, the direction identifying means for identifying the direction of the speaker's face based on the lip shape or lip movement recognized by the image recognizing means. Is further provided.

本発明７は、発明３または発明４に係る画像認識装置において、前記画像取得手段により取得された顔の形状もしくは顔の動きに基づいて、話者の顔の向きを識別するための方向識別手段をさらに具備したことを特徴とする。 According to a seventh aspect of the present invention, in the image recognition apparatus according to the third or fourth aspect, the direction identifying means for identifying the orientation of the speaker's face based on the face shape or the face movement acquired by the image acquiring means. Is further provided.

本発明８は、発明１または発明２に係る画像認識装置において、前記画像取得手段により取得された距離画像から顔部分を抽出する顔部抽出手段と、前記顔部抽出手段により抽出された顔部分の距離画像に基づいて、話者の顔の向きを識別するための方向識別手段とをさらに具備したことを特徴とする。 The present invention 8 is the image recognition apparatus according to the first or second invention, wherein the face part extracting unit extracts a face part from the distance image acquired by the image acquiring unit, and the face part extracted by the face part extracting unit And a direction identification means for identifying the direction of the speaker's face based on the distance image.

本発明によれば、話者の口唇認識あるいは顔部認識等と伴に、話者の向いている方向の識別をすることができる。 According to the present invention, it is possible to identify the direction in which the speaker is facing along with the lip recognition or face recognition of the speaker.

本発明９は、発明１、発明２または発明８に係る画像認識装置において、入力された音声を認識するための音声認識手段と、前記画像認識手段による認識結果に基づいて話者の会話の開始が検出された場合に前記音声認識手段による音声認識を開始させる制御と前記画像認識手段による認識結果に基づいて話者の会話の終了が検出された場合に前記音声認識手段による音声認識を終了させる制御の少なくとも一方の制御を行う制御手段とをさらに具備したことを特徴とする。 A ninth aspect of the present invention is the image recognition apparatus according to the first, second, or eighth aspect, wherein the speech recognition means for recognizing the input voice and the start of the conversation of the speaker based on the recognition result by the image recognition means When the end of the conversation of the speaker is detected based on the control for starting the speech recognition by the speech recognition unit and the recognition result by the image recognition unit when the speech recognition unit is detected, the speech recognition by the speech recognition unit is terminated. Control means for performing at least one of the controls is further provided.

本発明１０は、発明５または発明８に係る画像認識装置において、入力された音声を認識するための音声認識手段と、前記方向識別手段による識別結果が正面である場合に前記音声認識手段による音声認識を開始させる制御と前記方向識別手段による識別結果が正面でない場合に前記音声認識手段による音声認識を終了させる制御の少なくとも一方の制御を行う制御手段とをさらに具備したことを特徴とする。 According to the tenth aspect of the present invention, in the image recognition apparatus according to the fifth or eighth aspect, the voice recognition means for recognizing the input voice and the voice by the voice recognition means when the identification result by the direction identification means is the front. Control means for performing at least one of control for starting recognition and control for ending speech recognition by the speech recognition means when the identification result by the direction identification means is not the front is further provided.

本発明によれば、話者の口唇認識あるいは顔部認識等と伴に、口唇認識結果あるいは話者の向いている方向に応じた音声認識の制御を行うことができる。 According to the present invention, voice recognition can be controlled in accordance with the lip recognition result or the direction in which the speaker is facing, along with the speaker's lip recognition or face recognition.

本発明１１は、発明１、発明２、発明８または発明９に係る画像認識装置において、所定の出力形態（音声、画像、あるいは他の形態、あるいは複数の形態を組み合わせたもの）により所定の情報を呈示するための情報呈示手段と、前記画像認識手段による認識結果に基づいて話者の会話の開始と終了の少なくとも一方の検出を行い、該検出結果に応じて、前記情報呈示手段による情報呈示を開始させる制御と前記情報呈示手段による情報呈示を終了させる制御と前記情報呈示手段により行われている情報呈示に用られている出力形態の少なくとも一部の変更を行う制御のうち少なくとも１つの制御を行う制御手段とをさらに具備したことを特徴とする。 According to the eleventh aspect of the present invention, in the image recognition apparatus according to the first, second, eighth, or ninth aspects, predetermined information is obtained by a predetermined output form (sound, image, other forms, or a combination of plural forms). Information presenting means for presenting and detecting at least one of the start and end of the conversation of the speaker based on the recognition result by the image recognizing means, and according to the detection result, the information presenting means by the information presenting means At least one of the control for starting the information, the control for ending the information presentation by the information presentation means, and the control for changing at least a part of the output form used for the information presentation performed by the information presentation means And a control means for performing the above.

本発明１２は、発明５、発明８または発明１０に係る画像認識装置において、所定の出力形態（音声、画像、あるいは他の形態、あるいは複数の形態を組み合わせたもの）により所定の情報を呈示するための情報呈示手段と、前記方向識別手段による識別結果に係る向きと正面方向との関係に応じて、前記情報呈示手段による情報呈示を開始させる制御と前記情報呈示手段による情報呈示を終了させる制御と前記情報呈示手段により行われている情報呈示に用られている出力形態の少なくとも一部の変更を行う制御のうち少なくとも１つの制御を行う制御手段とをさらに具備したことを特徴とする。 The present invention 12 presents predetermined information in a predetermined output form (sound, image, other forms, or a combination of plural forms) in the image recognition apparatus according to the fifth, eighth or tenth invention. Information presentation means, control for starting information presentation by the information presentation means and control for ending information presentation by the information presentation means in accordance with the relationship between the direction related to the identification result by the direction identification means and the front direction And a control means for performing at least one control among the controls for changing at least a part of the output form used for the information presentation performed by the information presenting means.

本発明によれば、話者の口唇認識あるいは顔部認識等と伴に、口唇認識結果や話者の向いている方向に応じた情報呈示の制御を行うことができる。 According to the present invention, it is possible to control information presentation according to the lip recognition result and the direction in which the speaker is facing along with the speaker's lip recognition or face recognition.

本発明１３は、発明１、発明２、発明６または発明８に係る画像認識装置において、入力された音声を認識するための音声認識手段と、前記画像認識手段による認識結果に基づいて話者の会話（話者の行為実施）の開始を検出し、該会話の開始が検出された場合に前記音声認識手段による音声認識を開始させる音声認識開始手段とをさらに具備したことを特徴とする。 The present invention 13 is the image recognition apparatus according to the first, second, sixth or eighth invention, wherein the speech recognition means for recognizing the input speech and the speaker's recognition result based on the recognition result by the image recognition means. It further comprises voice recognition starting means for detecting the start of conversation (acting by the speaker) and starting voice recognition by the voice recognition means when the start of the conversation is detected.

本発明１４は、発明１、発明２、発明６、発明８または発明１３に係る画像認識装置において、入力された音声を認識するための音声認識手段と、前記画像認識手段による認識結果に基づいて話者の会話（話者の行為実施）の終了を検出し、該会話の終了が検出された場合に前記音声認識手段による音声認識を終了させる音声認識終了手段とをさらに具備したことを特徴とする。 According to the fourteenth aspect of the present invention, in the image recognition apparatus according to the first, second, sixth, eighth or thirteenth aspects, the speech recognition means for recognizing the input voice and the recognition result by the image recognition means. And further comprising voice recognition ending means for detecting the end of the conversation of the speaker (execution of the speaker's action) and ending the voice recognition by the voice recognition means when the end of the conversation is detected. To do.

本発明１５は、発明５、発明６、発明７または発明８に係る画像認識装置において、入力された音声を認識するための音声認識手段と、前記方向識別手段による識別結果が正面（話者の行為実施）である場合に、前記音声認識手段による音声認識を開始させる音声認識開始手段とをさらに具備したことを特徴とする。 According to a fifteenth aspect of the present invention, in the image recognition device according to the fifth, sixth, seventh, or eighth aspects, the speech recognition means for recognizing the input speech and the identification result by the direction identifying means are in front And voice recognition starting means for starting voice recognition by the voice recognition means.

本発明１６は、発明５、発明６、発明７、発明８または発明１５に係る画像認識装置において、入力された音声を認識するための音声認識手段と、前記方向識別手段による識別結果が正面（話者の行為実施）でない場合に、前記音声認識手段による音声認識を終了させる音声認識終了手段とをさらに具備したことを特徴とする。 According to a sixteenth aspect of the present invention, in the image recognition device according to the fifth, sixth, seventh, eighth, or fifteenth aspects, the voice recognition means for recognizing the input voice and the identification result by the direction identification means are in front ( And voice recognition end means for ending voice recognition by the voice recognition means when it is not a speaker's action).

本発明１７は、発明１、発明２、発明６、発明８、発明１３または発明１４に係る画像認識装置において、所定の出力形態（音声、画像、あるいは他の形態、あるいは複数の形態を組み合わせたもの）により所定の情報を呈示するための情報呈示手段と、前記画像認識手段による認識結果に基づいて話者の会話（話者の行為実施）の開始を検出し、該会話の開始が検出された場合に前記情報呈示手段による情報呈示を開始させる情報呈示開始手段とをさらに具備したことを特徴とする。 Invention 17 is an image recognition apparatus according to Invention 1, Invention 2, Invention 6, Invention 8, Invention 13 or Invention 14, and a predetermined output form (sound, image, other form, or a combination of plural forms) Information) presenting means for presenting predetermined information and the start of the conversation of the speaker (execution of the speaker) is detected based on the recognition result by the image recognition means, and the start of the conversation is detected And an information presentation start means for starting the information presentation by the information presentation means.

本発明１８は、発明１、発明２、発明６、発明８、発明１３、発明１４または発明１８に係る画像認識装置において、所定の出力形態（音声、画像、あるいは他の形態、あるいは複数の形態を組み合わせたもの）により所定の情報を呈示するための情報呈示手段と、前記画像認識手段による認識結果に基づいて話者の会話（話者の行為実施）の終了を検出し、該終了が検出された場合に前記情報呈示手段による情報呈示を終了させる情報呈示終了手段とをさらに具備したことを特徴とする。 According to the eighteenth aspect of the present invention, in the image recognition device according to the first, second, sixth, eighth, thirteenth, fourteenth, or eighteenth aspects, a predetermined output form (sound, image, other form, or a plurality of forms) Information presenting means for presenting predetermined information by a combination of the above and the end of the speaker's conversation (execution of the speaker's action) is detected based on the recognition result by the image recognition means, and the end is detected And an information presentation end means for ending the information presentation by the information presenting means when the information is presented.

本発明１９は、発明５、発明６、発明７、発明８、発明１５または発明１６に係る画像認識装置において、所定の出力形態（音声、画像、あるいは他の形態、あるいは複数の形態を組み合わせたもの）により所定の情報を呈示するための情報呈示手段と、前記方向識別手段による識別結果が正面（話者の行為実施）である場合に、前記情報呈示手段による情報呈示を開始させる情報呈示開始手段とをさらに具備したことを特徴とする。 In the image recognition device according to the fifth, sixth, seventh, eighth, fifteenth, or sixteenth invention, the present invention 19 is a predetermined output form (sound, image, other form, or a combination of plural forms) Information presentation means for presenting predetermined information, and information presentation start for starting information presentation by the information presentation means when the identification result by the direction identification means is front (acting by the speaker) And a means.

本発明２０は、発明５、発明６、発明７、発明８、発明１５、発明１６または発明２０に係る画像認識装置において、所定の出力形態（音声、画像、あるいは他の形態、あるいは複数の形態を組み合わせたもの）により所定の情報を呈示するための情報呈示手段と、前記方向識別手段による識別結果が正面（話者の行為実施）でない場合に、前記情報呈示手段による情報呈示を終了させる情報呈示終了手段とをさらに具備したことを特徴とする。 In the image recognition apparatus according to the fifth, sixth, seventh, eighth, fifteenth, sixteenth, or twentyth invention, the present invention 20 has a predetermined output form (sound, image, other form, or plural forms). Information presenting means for presenting predetermined information by information) and information for terminating the information presentation by the information presenting means when the identification result by the direction identifying means is not the front (acting of the speaker) And a presentation ending unit.

本発明２１は、発明１、発明２、発明６、発明８、発明１３、発明１４、発明１７または発明１８に係る画像認識装置において、所定の出力形態（音声、画像、あるいは他の形態、あるいは複数の形態を組み合わせたもの）により所定の情報を呈示するための情報呈示手段と、前記画像認識手段による認識結果に基づいて話者の会話（話者の行為実施）の開始を検出し、該会話の開始が検出された場合に、前記情報呈示手段による情報呈示をそれまでとは異なる出力形態による情報呈示に切り替える情報呈示切り替え手段とをさらに具備したことを特徴とする。 Invention 21 is an image recognition apparatus according to Invention 1, Invention 2, Invention 6, Invention 8, Invention 13, Invention 14, Invention 17 or Invention 18, in a predetermined output form (sound, image, or other form, or Information presenting means for presenting predetermined information by combining a plurality of forms) and detecting the start of the conversation of the speaker (performing the speaker's action) based on the recognition result by the image recognition means, It further comprises information presentation switching means for switching information presentation by the information presentation means to information presentation in a different output form when the start of conversation is detected.

本発明２２は、発明１、発明２、発明６、発明８、発明１３、発明１４、発明１７、発明１８または発明２１に係る画像認識装置において、所定の出力形態（音声、画像、あるいは他の形態、あるいは複数の形態を組み合わせたもの）により所定の情報を呈示するための情報呈示手段と、前記画像認識手段による認識結果に基づいて話者の会話（話者の行為実施）の終了を検出し、該会話の終了が検出された場合に、前記情報呈示手段による情報呈示をそれまでとは異なる出力形態による情報呈示に切り替える情報呈示切り替え手段とをさらに具備したことを特徴とする。 Invention 22 is an image recognition apparatus according to Invention 1, Invention 2, Invention 6, Invention 8, Invention 13, Invention 14, Invention 17, Invention 18 or Invention 21, in a predetermined output form (sound, image, or other). Information presenting means for presenting predetermined information in a form or a combination of a plurality of forms) and detecting the end of the conversation of the speaker (execution of the speaker's action) based on the recognition result by the image recognition means And an information presentation switching means for switching the information presentation by the information presentation means to the information presentation in a different output form when the end of the conversation is detected.

本発明２３は、発明５、発明６、発明７、発明８、発明１５、発明１６、発明１９または発明２０に係る画像認識装置において、所定の出力形態（音声、画像、あるいは他の形態、あるいは複数の形態を組み合わせたもの）により所定の情報を呈示するための情報呈示手段と、前記方向識別手段による識別結果が正面（話者の行為実施）となった場合に、前記情報呈示手段による情報呈示をそれまでとは異なる出力形態による情報呈示に切り替える情報呈示切り替え手段とをさらに具備したことを特徴とする。 In the image recognition apparatus according to the fifth, sixth, seventh, eighth, fifteenth, sixteenth, nineteenth, or twentyth invention, the present invention 23 is a predetermined output form (sound, image, or other form, or Information presenting means for presenting predetermined information by combining a plurality of forms) and information by the information presenting means when the identification result by the direction identifying means is the front (acting of the speaker) It further comprises information presentation switching means for switching the presentation to the information presentation in an output form different from the previous one.

本発明２４は、発明５、発明６、発明７、発明８、発明１５、発明１６、発明１９、発明２０または発明２３に係る画像認識装置において、所定の出力形態（音声、画像、あるいは他の形態、あるいは複数の形態を組み合わせたもの）により所定の情報を呈示するための情報呈示手段と、前記情報呈示手段による情報呈示中に前記方向識別手段による識別結果が正面（話者の行為実施）でなくなった場合に、前記情報呈示手段による情報呈示をそれまでとは異なる出力形態による情報呈示のみに切り替える情報呈示切り替え手段とをさらに具備したことを特徴とする。 The present invention 24 is an image recognition apparatus according to the invention 5, the invention 7, the invention 8, the invention 15, the invention 16, the invention 19, the invention 20, or the invention 23. Information presenting means for presenting predetermined information in the form or a combination of plural forms), and the identification result by the direction identifying means during the information presenting by the information presenting means is the front (acting of the speaker) The information presenting means further comprises information presenting switching means for switching the information presenting by the information presenting means only to the information presenting in the output form different from the previous one.

本発明２５は、発明１ないし発明１２のいずれか１項に係る画像認識装置において、得られた所定の情報を通信するための通信手段をさらに具備したことを特徴とする。 According to a twenty-fifth aspect of the present invention, in the image recognition apparatus according to any one of the first to twelfth aspects of the present invention, the image recognition apparatus further includes a communication unit for communicating the obtained predetermined information.

本発明によれば、認識結果等の所望の情報を外部の装置に与えることができる。 According to the present invention, desired information such as a recognition result can be given to an external device.

本発明２６は、発明１または発明２に係る画像認識装置において、前記画像認識手段により認識された口唇の形状の情報もしくは口唇の動きの情報を通信するための通信手段をさらに具備したことを特徴とする。 According to a twenty-sixth aspect of the present invention, in the image recognition apparatus according to the first or second aspect of the present invention, the image recognition apparatus further comprises a communication means for communicating lip shape information or lip movement information recognized by the image recognition means. And

本発明２７は、発明３または発明４に係る画像認識装置において、前記画像認識手段により認識された顔の形状の情報もしくは顔の動きの情報を通信するための通信手段をさらに具備したことを特徴とする。 According to a twenty-seventh aspect of the present invention, in the image recognition apparatus according to the third or fourth aspect of the present invention, the image recognition apparatus further includes a communication unit for communicating information on a face shape or face movement recognized by the image recognition unit. And

本発明２８は、発明５、発明６、発明７または発明８に係る画像認識装置において、前記方向識別手段により識別された話者の顔の向きの情報を通信するための通信手段をさらに具備したことを特徴とする。 According to a twenty-eighth aspect of the present invention, in the image recognition apparatus according to the fifth, sixth, seventh or eighth aspect of the present invention, the image recognition apparatus further comprises a communication means for communicating information about a speaker's face orientation identified by the direction identification means. It is characterized by that.

本発明２９は、発明９、発明１０、発明８、発明９、発明１０または発明１１に係る画像認識装置において、前記音声認識手段による認識結果を通信するための通信手段をさらに具備したことを特徴とする。 The present invention 29 is the image recognition apparatus according to the ninth, tenth, eighth, ninth, tenth, or eleventh invention, further comprising a communication means for communicating a recognition result by the voice recognition means. And

本発明３０は、発明１１、発明１２、発明１２、発明１３、発明１４、発明１５、発明１６、発明１７、発明１８または発明１９に係る画像認識装置において、前記情報呈示手段により呈示された情報を通信するための通信手段をさらに具備したことを特徴とする。 The present invention 30 is the information presented by the information presenting means in the image recognition device according to the invention 11, the invention 12, the invention 12, the invention 14, the invention 15, the invention 16, the invention 17, the invention 18, or the invention 19. It further has a communication means for communicating.

なお、以上の各本発明において、画像取得手段を省き、対象物体に対する距離画像を外部から与えるようにした構成も成立する。 In each of the present invention described above, a configuration in which the image acquisition unit is omitted and a distance image with respect to the target object is given from the outside is also established.

本発明３１に係る画像認識方法は、与えられた、対象物体に対する距離画像から、口腔部分を抽出し、抽出された口腔部分の距離画像に基づいて、口唇の形状を認識することを特徴とする。 The image recognition method according to the present invention 31 is characterized in that an oral part is extracted from a given distance image with respect to a target object, and the shape of the lips is recognized based on the extracted distance image of the oral part. .

本発明３２に係る画像認識方法は、与えられた、対象物体に対する距離画像ストリームから、口腔部分を抽出し、抽出された口腔部分の距離画像ストリームに基づいて、口唇の形状および口唇の動きの少なくとも一方を認識することを特徴とする。 The image recognition method according to the thirty-second aspect of the present invention extracts a mouth part from a given distance image stream with respect to a target object, and based on the extracted distance image stream of the mouth part, at least lip shape and lip movement It is characterized by recognizing one.

本発明３３は、コンピュータに、与えられた対象物体に対する距離画像から口腔部分を抽出させ、抽出された口腔部分の距離画像に基づいて口唇の形状を認識させるための手順を含むプログラムを記録したコンピュータ読取り可能な記録媒体を要旨とする。 The present invention 33 is a computer that records a program including a procedure for causing a computer to extract an oral part from a distance image with respect to a given target object and to recognize the shape of the lips based on the extracted distance image of the oral part. The gist is a readable recording medium.

本発明３４は、コンピュータに、与えられた対象物体に対する距離画像ストリームから口腔部分を抽出させ、抽出された口腔部分の距離画像ストリームに基づいて口唇の形状および口唇の動きの少なくとも一方を認識させるための手順を含むプログラムを記録したコンピュータ読取り可能な記録媒体を要旨とする。 The present invention 34 causes a computer to extract a mouth part from a distance image stream for a given target object, and to recognize at least one of the shape of the lips and the movement of the lips based on the extracted distance image stream of the mouth part. A gist is a computer-readable recording medium on which a program including the above procedure is recorded.

なお、装置に係る本発明は方法に係る発明としても成立し、方法に係る本発明は装置に係る発明としても成立する。 The present invention relating to the apparatus is also established as an invention relating to a method, and the present invention relating to a method is also established as an invention relating to an apparatus.

また、装置または方法に係る本発明は、コンピュータに当該発明に相当する手順を実行させるための（あるいはコンピュータを当該発明に相当する手段として機能させるための、あるいはコンピュータに当該発明に相当する機能を実現させるための）プログラムを記録したコンピュータ読取り可能な記録媒体としても成立する。 Further, the present invention relating to an apparatus or a method has a function for causing a computer to execute a procedure corresponding to the invention (or for causing a computer to function as a means corresponding to the invention, or for a computer to have a function corresponding to the invention. It can also be realized as a computer-readable recording medium on which a program (for realizing) is recorded.

本発明によれば、対象物体に対する距離画像から必要とする部分を抽出し、抽出した部分の距離画像に基づいて認識処理を行うので、人間の顔や口唇の形状や動きを高速かつ高精度に認識することができる。 According to the present invention, since a necessary part is extracted from a distance image with respect to a target object and recognition processing is performed based on the extracted distance image of the extracted part, the shape and movement of a human face and lips can be detected at high speed and with high accuracy. Can be recognized.

以下、図面を参照しながら本発明の実施形態について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

（第１の実施形態）
まず、本発明の第１の実施形態について説明する。 (First embodiment)
First, a first embodiment of the present invention will be described.

図１は、本発明の第１の実施形態に係る画像認識装置の全体構成図である。 FIG. 1 is an overall configuration diagram of an image recognition apparatus according to a first embodiment of the present invention.

本実施形態の画像認識装置は、距離画像ストリームを取得するための画像取得部１と、画像取得部１で取得された顔の全部または一部の距離画像ストリームから、口腔部分のみを抽出する口腔部抽出部２と、抽出された口腔部の距離画像ストリームから、口唇の形状および／または口唇の動きを認識する画像認識部３とから構成される。 The image recognition apparatus according to the present embodiment includes an image acquisition unit 1 for acquiring a distance image stream, and an oral cavity that extracts only the oral cavity part from all or part of the distance image stream of the face acquired by the image acquisition unit 1. A part extraction unit 2 and an image recognition unit 3 for recognizing the shape of the lips and / or the movement of the lips from the extracted distance image stream of the oral cavity.

画像取得部１は、画像認識対象物体となる人間の顔の全部または一部を、その３次元形状を反映した奥行き値を持つ画像（以下、距離画像と呼ぶ）として所定時間毎（例えば１／６０秒毎など）に取得するものである（例えば特願平９−２９９６４８の画像取得方法を用いて実現することができる）。画像取得部１は概略的には、例えば、対象物体に光を照射し、対象物体からの反射光の空間的な強度分布を抽出し、その各画素の強度値を奥行きあるいは距離を示す値に変換することにより、距離画像を生成する。この画像取得部１を用いて顔を撮像することで、顔の全部または一部分の、距離画像による動画像（以下、距離画像ストリームと呼ぶ）を得ることができる。なお、画像取得部１の詳細については後述する。 The image acquisition unit 1 sets all or part of a human face as an image recognition target object as an image having a depth value reflecting the three-dimensional shape (hereinafter referred to as a distance image) at predetermined time intervals (for example, 1 / (For example, it can be realized by using the image acquisition method of Japanese Patent Application No. 9-299648). For example, the image acquisition unit 1 schematically irradiates light on a target object, extracts a spatial intensity distribution of reflected light from the target object, and sets the intensity value of each pixel to a value indicating depth or distance. A distance image is generated by the conversion. By capturing an image of a face using the image acquisition unit 1, a moving image (hereinafter referred to as a distance image stream) based on a distance image of all or part of the face can be obtained. Details of the image acquisition unit 1 will be described later.

図２に、画像取得部１により取得された顔の距離画像（距離画像ストリーム中の１フレーム分）の例を示す。距離画像は、奥行き情報を有する３次元画像で、例えば、ｘ軸（横）方向６４画素、ｙ軸（縦）方向６４画素、ｚ軸（奥行き）方向２５６階調の画像になっている。図２は、距離画像の距離値すなわちｚ軸方向の階調をグレースケールで表現したものである。距離画像においては、色が白に近いほど距離が近く、黒に近くなるほど距離が遠い。また、色が完全に黒のところは、画像がない、あるいはあっても遠方でないのと同じであることを示している。例えば、図２は、口唇部が白く、その内側の口腔部が黒くなっている様子を示すものである。 FIG. 2 shows an example of the face distance image (for one frame in the distance image stream) acquired by the image acquisition unit 1. The distance image is a three-dimensional image having depth information, and is, for example, an image having 64 pixels in the x-axis (horizontal) direction, 64 pixels in the y-axis (vertical) direction, and 256 gradations in the z-axis (depth) direction. FIG. 2 represents the distance value of the distance image, that is, the gradation in the z-axis direction in gray scale. In the distance image, the closer the color is to white, the closer the distance is, and the closer the color is to black, the farther the distance is. Further, a place where the color is completely black indicates that there is no image, or even if it is present, it is the same as not far away. For example, FIG. 2 shows a state where the lip portion is white and the oral portion inside thereof is black.

なお、画像取得部１における受光面もしくはこれを収容した筐体は、本画像認識装置の目的等に応じて適宜設置するばよい。例えば本画像認識装置が表示装置を持つものである場合、この表示装置に対して対象物体となる人間の顔が正面を向いたときに、当該受光面に対しても正面を向いた形になるように当該画像認識装置の筐体に設ける。 Note that the light-receiving surface in the image acquisition unit 1 or the housing that houses the light-receiving surface may be appropriately installed according to the purpose of the image recognition apparatus. For example, when the image recognition apparatus has a display device, when a human face as a target object faces the front with respect to the display device, the light receiving surface also faces the front. Thus, it is provided in the housing of the image recognition apparatus.

次に、口腔部抽出部２について説明する。 Next, the oral cavity extraction unit 2 will be described.

口腔部抽出部２は、画像取得部１によって取得された顔の全部または一部の距離画像ストリームから、口腔部のみを抽出するものである。 The oral cavity extraction unit 2 extracts only the oral cavity from the distance image stream of all or part of the face acquired by the image acquisition unit 1.

人間の口唇の周辺部分を３次元的に見た場合、その局所的な形状は人によって様々であるし、同じ人でも状況によって様々な形状をしている。しかし、大局的には、「口唇部が少し凸形状をしており、その内側の口腔部が大きく凹形状をしている」という、人や状況に依らず一意に定まる特徴がある。 When the peripheral portion of the human lip is viewed three-dimensionally, the local shape varies depending on the person, and the same person has various shapes depending on the situation. However, as a general rule, there is a characteristic that is uniquely determined regardless of a person or a situation, that “the lip portion has a slightly convex shape and the oral cavity portion inside thereof has a large concave shape”.

図３は口唇を閉じている場合の顔の距離画像を、図４は口唇を開いている場合の顔の距離画像を、それぞれ、立体的に示したものである。図３および図４を見ると、上述したような口腔部の３次元的特徴がはっきりと見て取れることが分かる。 FIG. 3 is a three-dimensional view of the face distance image when the lips are closed, and FIG. 4 is a three-dimensional view of the face distance image when the lips are open. 3 and 4, it can be seen that the three-dimensional features of the oral cavity as described above can be clearly seen.

この口腔部の３次元形状の特徴を積極的に利用すれば、顔の距離画像ストリームから、口腔部のみを抽出した距離画像ストリームを構成することは容易である。 If the feature of the three-dimensional shape of the oral cavity is positively used, it is easy to construct a distance image stream in which only the oral cavity is extracted from the facial distance image stream.

以下では、口腔部抽出部２でどのように口腔部を抽出するのかを具体的に説明する。 Below, how the oral cavity part is extracted by the oral cavity extraction part 2 is demonstrated concretely.

画像取得部１によって取得された距離画像（以下、原画像とも呼ぶ）は、顔の３次元的形状を表している。この距離画像の２階微分画像を求めることで、原画像における傾き変化の様子を知ることができる。これを用いれば、原画像のエッジ部分を抽出することができる。なお、ここでエッジと言うのは、顔と背景との境界や、口唇と肌との境界のように、傾きの変化がある部分のことである。 A distance image (hereinafter also referred to as an original image) acquired by the image acquisition unit 1 represents a three-dimensional shape of a face. By obtaining a second-order differential image of this distance image, it is possible to know the state of inclination change in the original image. If this is used, the edge portion of the original image can be extracted. Note that the edge here refers to a portion having a change in inclination such as a boundary between the face and the background or a boundary between the lips and the skin.

図５にエッジ抽出の具体的な処理の流れの一例を示す。 FIG. 5 shows an example of a specific processing flow for edge extraction.

まず、Ｍａｒｒ−Ｈｉｌｄｒｅｔｈが提案したガウスラプラシアンフィルタを原画像に施す（ステップＳ１００）。 First, a Gaussian Laplacian filter proposed by Marr-Holdreth is applied to the original image (step S100).

次に、そのゼロクロス点を求める（ステップＳ１０１）。このとき、例えば、注目画素の４近傍の画素値が正である点をゼロクロス点とすればよい。 Next, the zero cross point is obtained (step S101). At this time, for example, a point where the pixel value in the vicinity of 4 of the target pixel is positive may be set as the zero cross point.

そして、ゼロクロス点ならば、図６に示すようなＳｏｂｅｌオペレータ（図中（ａ）がＸ方向に対応し、（ｂ）がＹ方向に対応する）を施し、その画素の強度を求める（ステップＳ１０２）。 If it is a zero cross point, a Sobel operator ((a) in the figure corresponds to the X direction and (b) corresponds to the Y direction) as shown in FIG. 6 is performed to determine the intensity of the pixel (step S102). ).

この強度がある閾値以上ならば、エッジの構成点であるとみなす（ステップＳ１０３）。 If this intensity is greater than or equal to a certain threshold value, it is regarded as a constituent point of the edge (step S103).

以上の処理により、原画像から、エッジ部分のみを抽出することができる。 With the above processing, only the edge portion can be extracted from the original image.

なお、ここでは、エッジ抽出の一手法として、ガウスラプラシアンフィルタ、Ｓｏｂｅｌオペレータを用いる方法について説明したが、これに限定されるものではなく、ハフ変換を用いる方法など、別の手法を用いて実現しても良い。 Here, as a method of edge extraction, a method using a Gaussian Laplacian filter and a Sobel operator has been described. However, the method is not limited to this, and can be realized by using another method such as a method using Hough transform. May be.

以上で説明した処理を距離画像に施すことで、顔の距離画像から、エッジ部分のみを抽出することができる。さらに、このエッジ情報と、口唇の形状（ループ状（穴）のエッジを持つもののなかで、一番大きなものなど）の情報を用いることで、口唇部のエッジのみを抽出することができる。 By applying the processing described above to the distance image, only the edge portion can be extracted from the distance image of the face. Furthermore, by using this edge information and information on the shape of the lip (the largest one among the loop-shaped (hole) edges), only the edge of the lip can be extracted.

この方法では、実際の顔の３次元形状をもとに、エッジの抽出を行っているため、従来の２次元画像から色相の変化などを利用してエッジを抽出する方法と比べて、エッジの誤認識（余分なエッジの抽出）をすることがなく、確実に口腔部のみを切り出すことが可能である。これは、３次元形状は実際のエッジに深く関係しているのに対し、色相変化を用いる方法は色相が異なる部分をエッジと見なして判断する一手段ではあるが、決定的なものではないからである。 In this method, the edge is extracted based on the actual three-dimensional shape of the face. Therefore, compared to the conventional method of extracting an edge from a two-dimensional image using a change in hue, the edge is extracted. Only the oral cavity can be reliably cut out without erroneous recognition (excess edge extraction). This is because the three-dimensional shape is deeply related to the actual edge, whereas the method using the hue change is a means for judging the portion having a different hue as the edge, but it is not definitive. It is.

以上の処理で、顔の距離画像ストリームから、口唇部のみの距離画像ストリームを取得することができる。 With the above processing, it is possible to acquire a distance image stream of only the lips from the distance image stream of the face.

なお、ここでは、顔の距離画像から、口腔部を抽出する方法として、傾きの変化を利用する方法について説明したが、これに限定されるものではない。例えば、口腔部の「窪み」という幾何学的な形状（奥行きＺ値が一定値以下）を利用して、閾値を設けることで「窪み」部分を抽出してもよいし、幾何学的推論を行うことによって抽出しても良い。また、口腔部の「窪み」状のテンプレートをあらかじめ用意しておいて、それとのパターンマッチングを取ることで求めてもよい。また、距離情報を用いてバンドパスフィルタによるフィルタリング処理を行うことでもエッジを取ることができる。他の３次元形状を利用して抽出する方法でも構わない。 In addition, although the method of using the change of inclination was demonstrated here as a method of extracting an oral cavity part from the distance image of a face, it is not limited to this. For example, using the geometric shape of the “dent” of the oral cavity (depth Z value is below a certain value), the “dent” portion may be extracted by providing a threshold, or geometric inference You may extract by doing. Alternatively, a “dent” -shaped template for the oral cavity may be prepared in advance and obtained by pattern matching with the template. An edge can also be obtained by performing a filtering process using a bandpass filter using distance information. An extraction method using another three-dimensional shape may be used.

次に、画像認識部３について説明する。 Next, the image recognition unit 3 will be described.

画像認識部３は、口腔部抽出部２によって抽出された口腔部の距離画像ストリームをもとに、口唇の形状および／または動きを認識するものである。 The image recognizing unit 3 recognizes the shape and / or movement of the lips based on the distance image stream of the oral cavity extracted by the oral cavity extracting unit 2.

まず、口唇の形状の認識について説明する。 First, recognition of the shape of the lips will be described.

画像認識部３では、「あ」、「い」、…、といった様々なテンプレートを予め用意しておき、それらと口腔部抽出部２で得られた口唇の形状とを比較して、類似度を計算し、類似度の最も高いものを認識結果として採用するという、テンプレートマッチングなどを用いて、認識を行う。 In the image recognizing unit 3, various templates such as “A”, “I”,... Are prepared in advance, and these are compared with the shape of the lip obtained by the oral part extracting unit 2 to determine the similarity. Recognition is performed using template matching or the like in which the highest similarity is calculated and adopted as the recognition result.

図７に画像認識部３におけるテンプレートマッチングの処理の流れの一例を示す。 FIG. 7 shows an example of the template matching processing flow in the image recognition unit 3.

まず、抽出された口腔部の距離画像（原画像）を、テンプレートの方向、サイズに合わせて正規化する（ステップＳ２００）。 First, the extracted distance image (original image) of the oral cavity is normalized in accordance with the direction and size of the template (step S200).

次に、用意した様々なテンプレートの中から、原画像と比較すべきテンプレートｋを選択する（ステップＳ２０１）。 Next, a template k to be compared with the original image is selected from various prepared templates (step S201).

次に、原画像とテンプレートとのハミング距離を計算する（ステップＳ２０２）。ハミング距離（Ｈ）は、例えば、Ｈ＝Σ_iΣ_j｜ｄ（ｉ，ｊ）−ｔ_k（ｉ，ｊ）｜により計算する。ここで、ｉ、ｊはそれぞれ各画素のｘ、ｙ座標、ｄ（ｉ，ｊ）は原画像の座標（ｉ，ｊ）での距離値、ｔ_k（ｉ，ｊ）はテンプレートｋの座標（ｉ，ｊ）での距離値である。 Next, the Hamming distance between the original image and the template is calculated (step S202). The Hamming distance (H) is calculated by, for example, H = Σ _i Σ _j | d (i, j) −t _k (i, j) | Here, i and j are the x and y coordinates of each pixel, d (i, j) is the distance value at the coordinates (i, j) of the original image, and t _k (i, j) is the coordinates of the template k ( i, j) is the distance value.

なお、ここでは、ハミング距離の導出の一方法を説明したが、ハミング距離の導出は、これに限定されるものではなく、他の計算式を用いても良い。 Here, although one method of deriving the Hamming distance has been described, the deriving of the Hamming distance is not limited to this, and other calculation formulas may be used.

これらの処理を全てのテンプレートについて行うため、全てのテンプレートについて、上述のハミング距離の計算が終了しているか判定する（ステップＳ２０３）。 Since these processes are performed for all templates, it is determined whether the calculation of the Hamming distance is completed for all templates (step S203).

未だハミング距離の計算が終わっていないテンプレートがあれば、ステップＳ２０１に戻る。 If there is a template for which the calculation of the Hamming distance is not yet completed, the process returns to step S201.

全てのテンプレートについて、原画像とのハミング距離の計算が終了したら、それらを比較し、最も値の小さなテンプレートを見つける。そして、このテンプレートの表現している内容を認識結果とする（ステップＳ２０４）。例えば、この選ばれたテンプレートが、「た」を発音している際の口唇形状であったならば、原画像の距離画像の発音（口唇形状）は「た」であったと認識する。 When calculation of the Hamming distance from the original image is completed for all templates, they are compared to find the template with the smallest value. Then, the content expressed by this template is set as a recognition result (step S204). For example, if the selected template has a lip shape when “ta” is pronounced, it is recognized that the pronunciation (lip shape) of the distance image of the original image is “ta”.

以上の処理を距離画像ストリームに含まれる、全ての距離画像に対して、順次行うことによって、話者の発話内容の認識が行われる。 The speaker's utterance content is recognized by sequentially performing the above processing on all the distance images included in the distance image stream.

なお、以下では、音声認識と区別するために、口唇形状から話者の発話内容を認識すること（認識対象となった者が現実には音声を出さず、実際に話すときと同じように口唇を動した場合に得られた距離画像に基づく認識を含む）を口唇認識と呼ぶ。 In the following, in order to distinguish it from speech recognition, the utterance content of the speaker is recognized from the lip shape (the person who is the recognition target does not actually make a voice, just like when speaking actually (Including recognition based on the distance image obtained when moving) is called lip recognition.

次に、口唇の動きの認識について説明する。 Next, recognition of lip movement will be described.

口唇の動きの認識を行う場合、例えば、「口を開け閉めしている」、「あくびをしている」といったような、動きを表すテンプレートの列（動きを各フレームに分割し、それぞれを１つのテンプレートとして、一連の動きのテンプレートをまとめたもの）を用意しておき、上述したものと同様に、距離画像ストリームに含まれる全ての距離画像に対して、前記テンプレートの列と順次テンプレートマッチングを行うことで、動きに対する口唇認識を行うこともできる。 When recognizing the movement of the lips, for example, a sequence of templates representing movements such as “open and closed mouth”, “yawning” (divide the movement into each frame, 1 each As a template, a series of motion templates is prepared), and in the same manner as described above, template matching is sequentially performed with the template column for all the distance images included in the distance image stream. By doing so, it is also possible to perform lip recognition for movement.

以上のような方法で得られた口唇認識の結果は、従来の画像認識と異なり、実際の口唇の３次元形状を利用することによって、認識を行った結果である。従来は、通常のビデオカメラの画像などから抽出した２次元的な口唇形状を用いて認識していたため、口唇の平面的な動きのみから認識を行うしかなかったが、この方法では、上述の通り、３次元の情報を用いることが可能であるため、従来よりも、より多くの情報を用いて認識することが可能である。そこで、正面から見たときの口唇形状がほぼ同じで、口唇の奥行き方向の形状が異なっているというような、従来なら認識が不可能であった場合も、本実施形態の画像認識装置を用いることで認識することが可能となっている。また、識別する手掛かりが増えているため、従来よりも、認識率も高くなり、誤認識し難いという利点もある。 Unlike the conventional image recognition, the result of lip recognition obtained by the above method is the result of recognition using the actual three-dimensional shape of the lips. Conventionally, since recognition was performed using a two-dimensional lip shape extracted from an image of a normal video camera or the like, the recognition had to be performed only from the planar movement of the lips. Since three-dimensional information can be used, more information can be recognized than in the past. Therefore, the image recognition apparatus according to the present embodiment is used even in the case where recognition is impossible in the past, such as when the lip shape when viewed from the front is almost the same and the shape of the lips in the depth direction is different. Can be recognized. In addition, since there are more clues to identify, there is an advantage that the recognition rate is higher than the conventional one, and it is difficult to make erroneous recognition.

なお、ここでは、原画像とテンプレートとのハミング距離を求めることで、原画像とテンプレートの類似度を計算する方法について説明したが、類似度の計算は、これに限定されるものではない。ＤＰマッチング法、ＫＬ変換法などを用いて求める方法、原画像をフーリエ変換し、フーリエ変換後の画像について相関関係を求めることで、類似度を計算する方法など、あらゆる方法を用いることができる。 Here, the method of calculating the similarity between the original image and the template by obtaining the Hamming distance between the original image and the template has been described. However, the calculation of the similarity is not limited to this. Any method can be used, such as a method using a DP matching method, a KL conversion method, or the like, or a method of calculating a degree of similarity by Fourier-transforming an original image and obtaining a correlation with respect to the image after the Fourier transform.

また、ここでは、口腔部の距離画像ストリームから、口唇の形状、動きを認識する方法として、テンプレートマッチングを行う方法について説明したが、これに限定されるものではなく、例えば、口唇の形状から、筋肉の動きを求めて、その形状変化を手掛かりとして、筋肉モデルから発音内容を類推する、などのように他の方法で認識を行ってもよい。 In addition, here, as a method of recognizing the shape and movement of the lips from the distance image stream of the oral cavity, the method of performing template matching has been described, but the present invention is not limited to this, for example, from the shape of the lips, Recognition may be performed by other methods such as obtaining the movement of the muscle and using the shape change as a clue to analogize the pronunciation content from the muscle model.

以上のように本実施形態によれば、口唇の距離画像を用いることで、あまり計算コストをかけずに、容易に、口唇部を抽出することが可能となる。さらに、口唇認識に関しても、抽出した口唇部の３次元形状の情報を用いることにより、従来方法では、判別に難しかった（誤認識が多かった）ような形状に関する認識や、従来では不可能であったような形状に関する認識が可能になる。 As described above, according to the present embodiment, by using the lip distance image, it is possible to easily extract the lip portion without much calculation cost. Furthermore, with regard to lip recognition, by using the extracted three-dimensional shape information of the lip, it is impossible to recognize the shape, which was difficult to distinguish with the conventional method (there were many misrecognitions), or impossible with the conventional method. It is possible to recognize the shape as if it were.

以上のようにして得た口唇の形状の認識結果、口唇の動きの認識結果、あるいは口唇の形状の認識結果と口唇の動きの認識結果を組み合わせたものは、その後の種々の処理に供することができる。なお、画像認識部３に、口唇の形状と動きの認識の両方の機能を設けるか、いずれか一方を設けるかは、システムの目的等に応じて適宜設計することが可能である。 The lip shape recognition result, the lip movement recognition result, or the combination of the lip shape recognition result and the lip movement recognition result obtained as described above can be used for various subsequent processing. it can. It should be noted that whether the image recognition unit 3 is provided with both functions of lip shape and movement recognition or either one can be appropriately designed according to the purpose of the system.

本実施形態は、上記した構成に限定されず、種々変形して実施することができる。以下では、本実施形態のいくつかの変形例を示す。 The present embodiment is not limited to the configuration described above, and can be implemented with various modifications. Below, some modifications of this embodiment are shown.

（第１の実施形態の変形例１）
口唇部抽出部２の代わりに、画像取得部１で所得された距離画像ストリームから顔部分のみを抽出するための顔部抽出部を具備してもよい。 (Modification 1 of the first embodiment)
Instead of the lip extraction unit 2, a face extraction unit for extracting only the face part from the distance image stream obtained by the image acquisition unit 1 may be provided.

そして、画像認識部３で、予めＡ氏、Ｂ氏、というように人物の顔形状のテンプレートを用意しておき、それらを用いて顔部抽出部５で抽出された顔部の距離画像とのマッチングを行うことで、本実施形態の画像認識装置で撮像された人物が誰であるのかを認識することができる。 The image recognition unit 3 prepares a template of a person's face shape such as Mr. A and Mr. B in advance, and uses the distance image of the face part extracted by the face part extraction unit 5 using them. By performing matching, it is possible to recognize who is the person imaged by the image recognition apparatus of the present embodiment.

これにより、例えば、本実施形態の画像認識装置（または少なくとも画像取得部１の発光素子と受光素子の部分）を、自動ドアの近くなどに置き、そこを通る人物の顔を認識することで、特定の人物と認識したときのみドアを開ける、といったような、簡単なセキュリティチェックに使うことが可能である。 Thereby, for example, by placing the image recognition device of this embodiment (or at least the light emitting element and the light receiving element part of the image acquisition unit 1) near an automatic door or the like and recognizing the face of a person passing there, It can be used for simple security checks, such as opening the door only when it recognizes a specific person.

（第１の実施形態の変形例２）
本実施形態は、医療面でも重病者の看護に有効である。従来、病室や在宅看護者の家庭などにいる患者が何か異常をきたした場合には、枕元にある押しボタン式のブザーで、看護婦や医者に知らせていた。しかし、押しボタン式のブザーでは、患者が弱っていた場合に、ボタンを押す余裕が無いことが多く、危険であった。このような場所に第１の実施形態の画像認識装置を置くことで、病気で弱っていて、あまり声を出せないような場合でも、病人のわずかな声と、微妙な口唇の動きから、病人が何か伝えたいということを判別することが可能である。 (Modification 2 of the first embodiment)
This embodiment is also effective for nursing a seriously ill patient from the medical aspect. Conventionally, when a patient in a hospital room or home of a home caregiver has something abnormal, a push button type buzzer at the bedside informs the nurse or doctor. However, the push button type buzzer is dangerous because there is often no room for pressing the button when the patient is weak. By placing the image recognition apparatus of the first embodiment in such a place, even if it is weak because of illness and cannot speak so much, the sick person's slight voice and subtle lip movement Can tell you what you want to convey.

これを押し進めて、普段口唇を動かすことがない病人が口唇を動かしたら、病状が急変した可能性がある。このような場合には、口唇の動きを何らかの音に変換して、警報音代わりに用いることができ、それにより医者や看護婦が病室や在宅看護者の家庭に駆け付けるような方策をとることができる。 If you push this forward and a sick person who does not normally move the lips moves the lips, the medical condition may have suddenly changed. In such a case, the movement of the lips can be converted into some sound and used as a warning sound, so that doctors and nurses can take measures to rush to hospital rooms and home nurse homes. it can.

このような場合、図８に例示するように、口唇認識の結果をそのまま音声に変換し呈示する、または、結果に応じて何らかの音を呈示するを音呈示部４を設ける。 In such a case, as illustrated in FIG. 8, the sound presenting unit 4 is provided that directly converts the lip recognition result into a sound and presents it or presents some sound according to the result.

（第１の実施形態の変形例３）
図９に例示するように、上記の第１の実施形態の変形例２の構成（図８）に、さらに顔部のみの距離画像ストリームを抽出するための顔部抽出部５を付加して、顔の３次元形状情報を用いることで、例えば、顔を上下に振っているなどというように、顔のゼスチャーの認識を行ったり、笑っている、怒っている、困っているなどというように、表情の認識を行うことが可能である。 (Modification 3 of the first embodiment)
As illustrated in FIG. 9, a face extraction unit 5 for extracting a distance image stream of only the face is added to the configuration of the second modification of the first embodiment (FIG. 8). By using the 3D shape information of the face, for example, waving the face up and down, recognizing the gesture of the face, laughing, angry, troubled, etc. It is possible to recognize facial expressions.

その際、画像認識部３では、例えば、頷く：顔を上下に数回振る、拒む：顔を左右に数回振る、喜ぶ：大きく口があく、目が細くなる、驚く：目を見開く、などというようにゼスチャーや表情などを得るためのテンプレートを用意しておき、それらを用いてテンプレートマッチングを行うことで、顔のゼスチャーや表情の認識を行う。 At that time, the image recognition unit 3 may, for example, whisper: shake the face several times up and down, refuse: shake the face several times to the left and right, rejoice: make a big mouth, narrow eyes, surprise: open eyes, etc. In this way, templates for obtaining gestures and facial expressions are prepared, and template matching is performed using them to recognize facial gestures and facial expressions.

そして、この認識した表情に応じて、口唇の動きを音声変換する際に、変換する音声の種類やピッチなどを変えることも可能である。 Then, according to the recognized facial expression, when voice conversion is performed on the movement of the lips, it is possible to change the type and pitch of the voice to be converted.

また、例えば、同じ口唇の動きでも、肯定の場合は犬のなき声、否定の場合はニワトリの鳴き声、喜んでいる場合は猫のなき声というように変化させることもできる。このようにすることで、例えば、子供に、英語の単語発生などを楽しく飽きないように勉強できるようにすることが可能となる。 Also, for example, even if the movement of the lips is the same, it is possible to change such as a voice without a dog if affirmative, a cry of a chicken if negative, or a voice without a cat if happy. In this way, for example, it becomes possible to allow children to study English word generation and the like so as not to get bored.

（第２の実施形態）
次に、本発明の第２の実施形態について説明する。本実施形態では、第１の実施形態と相違する部分を中心に説明する。 (Second Embodiment)
Next, a second embodiment of the present invention will be described. In the present embodiment, description will be made centering on portions that differ from the first embodiment.

図１０は、本発明の第２の実施形態に係る画像認識装置の全体構成図である。 FIG. 10 is an overall configuration diagram of an image recognition apparatus according to the second embodiment of the present invention.

図１０に示されるように、本実施形態の画像認識装置は、第１の実施形態の画像認識装置の構成に対して、画像認識部３で得られた口唇の形状もしくは動きの認識結果をもとに、話者の顔の向いている方向を識別するための方向識別部６が追加された構成になっている。 As shown in FIG. 10, the image recognition apparatus according to the present embodiment has a recognition result of the shape or movement of the lips obtained by the image recognition unit 3 with respect to the configuration of the image recognition apparatus according to the first embodiment. In addition, a direction identification unit 6 for identifying the direction in which the speaker's face is facing is added.

これにより話者の発言内容だけでなく、同時に、話者がどちらの方向を向いて話しているかを認識することができる。 This makes it possible to recognize not only the content of the speaker's speech but also the direction in which the speaker is speaking.

次に、方向識別部６について説明する。 Next, the direction identification unit 6 will be described.

方向識別部６では、画像認識部３で得られた口唇の形状もしくは動きの認識結果をもとに、話者の顔の向いている方向を識別する。その際、口唇の３次元形状を利用することで、話者の顔の向きを計算する。 The direction identification unit 6 identifies the direction in which the speaker's face is facing based on the recognition result of the lip shape or movement obtained by the image recognition unit 3. At that time, the direction of the speaker's face is calculated by using the three-dimensional shape of the lips.

以下では、話者の顔の向いている方向を求める具体的な方法の一例について、図１１に示す処理の流れ図を用いて説明する。 In the following, an example of a specific method for obtaining the direction in which the speaker's face is facing will be described using the flowchart of the processing shown in FIG.

まず、口唇の距離画像中のある画素Ｘ（例えば座標値（ｉ、ｊ））を選択する（ステップＳ３００）。 First, a certain pixel X (for example, coordinate value (i, j)) in the lip distance image is selected (step S300).

次に、画素Ｘと隣接している画素Ｙ（例えば座標値（ｉ−１、ｊ））を選択する（ステップＳ３０１）。 Next, a pixel Y (for example, coordinate values (i−1, j)) adjacent to the pixel X is selected (step S301).

次に、図１２（（ａ）は隣接８画素を示す図、（ｂ）は傾きベクトルｇとこれに直交する法線ベクトルｐを説明するための図）のように、選択した画素Ｙ（例えば座標値（ｉ−１、ｊ））との距離値の差ｄ（ｉ、ｊ）−ｄ（ｉ−１、ｊ）をもとに、この２画素間の傾きベクトルｇを求める（ステップ３０２）。 Next, as shown in FIG. 12 ((a) is a diagram showing eight adjacent pixels, (b) is a diagram for explaining a gradient vector g and a normal vector p orthogonal thereto) Based on the difference d (i, j) −d (i−1, j) of the distance value from the coordinate value (i−1, j)), an inclination vector g between the two pixels is obtained (step 302). .

この２画素Ｘ、Ｙと同一平面上にあり、ステップＳ３０２で得られた傾きベクトルｇと直行する法線ベクトルｐを求める（ステップＳ３０３）。 A normal vector p that is on the same plane as the two pixels X and Y and is orthogonal to the gradient vector g obtained in step S302 is obtained (step S303).

画素Ｘと隣接する全ての画素Ｙについて法線ベクトルの計算が終了したか判別する（ステップＳ３０４）。 It is determined whether the normal vector calculation has been completed for all the pixels Y adjacent to the pixel X (step S304).

全ての隣接画素について終了していなかったら、ステップＳ３０１に戻る。全てについて終了していたら、この法線ベクトルの平均Ｐ＝Σｐを計算し、画素Ｘの法線ベクトルＰとする（ステップＳ３０５）。 If the process has not been completed for all adjacent pixels, the process returns to step S301. If all the processing has been completed, the average P = Σp of the normal vectors is calculated and set as the normal vector P of the pixel X (step S305).

以上の処理を距離画像中の全ての画素について行ったかどうか判定する（ステップＳ３０６）。行っていなかったら、ステップＳ３００に戻る。 It is determined whether or not the above processing has been performed for all the pixels in the distance image (step S306). If not, the process returns to step S300.

全ての画素について、法線ベクトルＰの計算が終了したら、各画素の法線ベクトルの平均Ｐ_lip＝Σｐを計算し、これを口唇の法線ベクトルとする（ステップＳ３０７）。 When the calculation of the normal vector P is completed for all the pixels, the average P _lip = Σp of the normal vectors of each pixel is calculated, and this is set as the normal vector of the lips (step S307).

口唇は、顔のほぼ中央にあり、ほぼ左右上下対称形状であるため、口唇の法線ベクトルと顔の法線ベクトルの方向は、おおむね一致する。そのため、ステップＳ３０７で得られたＰ_lipが顔の法線ベクトルとなる。つまり、法線ベクトルＰ_lipを顔の向きとして話者の向いている方向を識別することができる。 Since the lip is substantially at the center of the face and has a substantially symmetrical shape, the direction of the normal vector of the lip and the direction of the normal vector of the face are approximately the same. Therefore, P _lip obtained in step S307 becomes the normal vector of the face. That is, the direction in which the speaker is facing can be identified using the normal vector P _lip as the face direction.

なお、ここでは、口唇の向いている方向を得る一手段として、距離画像から口唇の法線ベクトルを計算する方法について説明したが、これに限定されるものではなく、口唇の大きさの比率や形状の変化から口唇の向いている方向を類推するなど、他の方法を用いても構わない。 Here, as a means for obtaining the direction in which the lips are facing, the method of calculating the normal vector of the lips from the distance image has been described, but the present invention is not limited to this, and the ratio of the size of the lips, Other methods such as analogizing the direction of the lips from the change in shape may be used.

以上のように本実施形態によれば、話者がどちらの方向を向いて、どのような話をしているのかもしくはどのような口唇の動きをしているのかなどを、同時に認識することが可能である。 As described above, according to the present embodiment, it is possible to simultaneously recognize which direction the speaker is facing, what kind of talk he is talking about, what kind of lip movement, etc. Is possible.

（第２の実施形態の変形例１）
図１３のように、口腔部抽出部２の代わりに、画像取得部１で取得された顔の全部または一部の距離画像ストリームから顔部のみを抽出するための顔部抽出部５を置いても良い。この場合、画像認識部３には、顔部抽出部５で抽出された顔部の距離画像ストリームが入力される。 (Modification 1 of 2nd Embodiment)
As shown in FIG. 13, instead of the oral cavity extraction unit 2, a facial part extraction unit 5 for extracting only the facial part from the distance image stream of all or part of the face acquired by the image acquisition unit 1 is placed. Also good. In this case, the distance image stream of the face part extracted by the face part extraction part 5 is input to the image recognition part 3.

そして、画像認識部３では、例えば、頷く：顔を上下に数回振る、拒む：顔を左右に数回振る、喜ぶ：大きく口があく、目が細くなる、驚く：目を見開く、などというようにゼスチャーや表情などを得るためのテンプレートを用意しておき、それらを用いて、入力された顔部の距離画像ストリームとのテンプレートマッチングを行うことで、頷いているなどのゼスチャーや、喜んでいる、驚いている、困っているなどの表情変化などを認識することが可能である。 Then, in the image recognition unit 3, for example, whisper: shake the face several times up and down, refuse: shake the face several times left and right, rejoice: big mouth, narrow eyes, surprise: open eyes, etc. Prepare a template to obtain gestures and facial expressions, and use them to perform template matching with the distance image stream of the input face. It is possible to recognize facial expression changes such as being surprised or troubled.

方向識別部６では、画像認識部３で得られた顔部の形状、動きの認識結果をもとに、話者の顔の向いている方向を識別する。 The direction identification unit 6 identifies the direction in which the speaker's face is facing based on the recognition result of the shape and movement of the face obtained by the image recognition unit 3.

このように変形することにより、対象人物が、どちらの方向を向いて、どのような顔の動作（ゼスチャー、表情変化など）をしているのかを認識することができる。 By deforming in this way, it is possible to recognize in which direction the target person is facing and what kind of facial movement (such as a gesture or facial expression change).

（第２の実施形態の変形例２）
なお、第２の実施形態では、画像認識部３の認識結果をもとに、前記方向識別部６で話者の向いている方向を識別したが、図１４のように、画像取得部１で取得された顔の距離画像ストリーム（これには、背景などが含まれる）から顔の部分のみを抽出するための顔部抽出部５を新たに追加し、顔部抽出部５で抽出された顔の距離画像ストリームをもとに、方向識別部６で話者の向いている方向を識別するようにしても良い。この場合、方向識別部６では、顔部抽出部５で抽出された顔の距離画像ストリームから、顔の法線方向（例えば、顔を構成する画素の法線方向の平均）を計算することにより、話者の向いている方向を得る。 (Modification 2 of the second embodiment)
In the second embodiment, the direction of the speaker is identified by the direction identification unit 6 based on the recognition result of the image recognition unit 3, but the image acquisition unit 1 as shown in FIG. A face extraction unit 5 for newly extracting only the face portion from the acquired distance image stream of the face (this includes the background etc.) is newly added, and the face extracted by the face extraction unit 5 Based on the distance image stream, the direction identification unit 6 may identify the direction in which the speaker is facing. In this case, the direction identification unit 6 calculates the normal direction of the face (for example, the average of the normal directions of the pixels constituting the face) from the distance image stream of the face extracted by the face extraction unit 5. Get the direction the speaker is facing.

このようにすることで、第２の実施形態では、口唇の向いている方向から、顔の向いている方向を得ていたが、直接、顔の向いている方向を得ることができるため、より細かく、微妙な顔の向きを得ることが可能である。 By doing in this way, in the second embodiment, the direction in which the face is facing is obtained from the direction in which the lips are facing, but since the direction in which the face is facing can be obtained directly, more It is possible to obtain a fine and delicate face orientation.

（第３の実施形態）
次に、本発明の第３の実施形態について説明する。本実施形態では、第２の実施形態と相違する部分を中心に説明する。 (Third embodiment)
Next, a third embodiment of the present invention will be described. In the present embodiment, description will be made centering on parts that are different from the second embodiment.

図１５は、本発明の第３の実施形態に係る画像認識装置の全体構成図である。 FIG. 15 is an overall configuration diagram of an image recognition apparatus according to the third embodiment of the present invention.

図１５に示されるように、本実施形態の画像認識装置は、第２の実施形態の画像認識装置の構成もしくはその変形例の構成に対して、話者の発言内容を認識する音声認識部７と、方向識別部６で得られた話者の顔の向いている方向をもとに、音声認識部７に、音声認識の開始を指示するための音声認識開始部８が追加された構成になっている。 As shown in FIG. 15, the image recognition apparatus according to the present embodiment has a voice recognition unit 7 that recognizes the content of a speaker's speech in comparison with the configuration of the image recognition apparatus according to the second embodiment or the configuration of a modification thereof. And a voice recognition start unit 8 for instructing the voice recognition unit 7 to start voice recognition based on the direction of the face of the speaker obtained by the direction identification unit 6. It has become.

これにより話者の顔の向いている方向に応じて、音声認識を行うことができる。 This makes it possible to perform speech recognition according to the direction in which the speaker's face is facing.

次に、音声認識部７について説明する。 Next, the voice recognition unit 7 will be described.

音声認識部７は、マイクなどの音声の入力装置を用いて入力された音声の内容を認識するものである。音声認識部７では、種々の認識手法を用いることが可能である。例えば、隠れマルコフモデルなどを用いて実現してもよい。音声認識を行うことで、話者の会話の内容を認識することができる。 The voice recognition unit 7 recognizes the content of voice input using a voice input device such as a microphone. The voice recognition unit 7 can use various recognition methods. For example, it may be realized using a hidden Markov model. By performing speech recognition, it is possible to recognize the content of the speaker's conversation.

次に、音声認識開始部８について説明する。 Next, the voice recognition start unit 8 will be described.

音声認識開始部８は、方向識別部６で得られた結果をもとに、音声認識部７に、音声認識を開始するように指示を出すものである。ここでは、例えば、話者が（本実施形態の画像認識装置に対して（すなわち画像取得部１の受光素子の部分に対して；以下、同様））正面を向いたとき、話者の行為が開始されたとみなし、この時点で、音声認識の開始の指示を音声認識部７に送る。 The voice recognition start unit 8 instructs the voice recognition unit 7 to start voice recognition based on the result obtained by the direction identification unit 6. Here, for example, when the speaker faces the front (with respect to the image recognition apparatus of the present embodiment (that is, with respect to the light receiving element portion of the image acquisition unit 1; hereinafter the same)), the speaker's action is At this time, the voice recognition unit 7 is instructed to start voice recognition.

以上のように本実施形態によれば、話者の動作に応じて、音声認識を開始することが可能である。例えば、話者が（本実施形態の画像認識装置に対して）正面を向いたときに音声認識を開始することができる。 As described above, according to the present embodiment, it is possible to start speech recognition according to the operation of the speaker. For example, voice recognition can be started when the speaker is facing the front (relative to the image recognition device of the present embodiment).

また、本実施形態によれば、画像認識部３による口唇認識（読唇）の結果も得られるため、音声認識と口唇認識（読唇）を同時に行うことが可能となり、これら２つの認識の結果を総合的に用いることにより、話者の会話内容について、より高い認識率を持つ認識結果を得ることができる。 Further, according to the present embodiment, since the result of lip recognition (lip reading) by the image recognition unit 3 can also be obtained, it is possible to perform voice recognition and lip recognition (lip reading) at the same time, and combine the results of these two recognitions. By using this method, a recognition result having a higher recognition rate can be obtained for the conversation contents of the speaker.

これは、以下の様な状況において大変有効である。例えば、工事現場などの雑音が多く声を聞き取りにくいような場所では、音声認識だけの場合、認識率が低下するし、場合によっては全然認識できなくなったりするが、第３の実施形態のように、口唇認識も同時に行えば、口唇認識は雑音に影響されないので、認識率は低下することはなく、全体的に高い認識率を維持することができる。また、図書館のような静寂で大きな声を出せない場所でも、音声認識だけでは、微少の音声で認識を行なわねばならないため、認識率の低下が考えられるが、同様の理由で、口唇認識も同時に行えば、全体的に高い認識率を維持することができる。 This is very effective in the following situations. For example, in a place where there is a lot of noise such as a construction site where it is difficult to hear the voice, the recognition rate decreases when only voice recognition is performed, and depending on the case, it may not be recognized at all, but as in the third embodiment. If the lip recognition is performed at the same time, the lip recognition is not affected by noise, so that the recognition rate does not decrease, and a high recognition rate can be maintained as a whole. Also, even in a quiet place such as a library where you cannot make a loud voice, speech recognition alone must be recognized with a small amount of speech, so the recognition rate may be reduced. If done, a high recognition rate can be maintained as a whole.

また、２人が話をしているような場合、従来の音声認識では、複数の音声が同時に入力されてしまい、認識対象を判別することが困難だったが、本実施形態の場合、２人のうち、例えば、本実施形態の画像認識装置に対して正面を向いている人の方のみを認識するというように、認識対象を判別することも容易であるし、口唇認識も同時に行っているので、その情報を用いて認識対象を判別することもできる。 In addition, when two people are talking, in the conventional voice recognition, it is difficult to determine a recognition target because a plurality of voices are input at the same time. Among them, for example, it is easy to determine the recognition target and recognize the lips at the same time, such as recognizing only the person facing the front with respect to the image recognition apparatus of the present embodiment. Therefore, the recognition target can be determined using the information.

（第３の実施形態の変形例１）
第３の実施形態では、音声認識部７、音声認識開始部８を置き、方向識別部６で得られた結果をもとに、音声認識を開始する例について説明したが、これに限定されるものではなく、音声認識に限らず、他のどのような認識手段でも良い。 (Modification 1 of 3rd Embodiment)
In the third embodiment, the example in which the voice recognition unit 7 and the voice recognition start unit 8 are placed and the voice recognition is started based on the result obtained by the direction identification unit 6 has been described. However, the third embodiment is limited to this. It is not limited to speech recognition, and any other recognition means may be used.

（第３の実施形態の変形例２）
第３の実施形態では、話者の顔の向いている向きに応じて、音声認識の開始の指示に用いる例を示したが、図１６に示すように、音声認識開始部８の代わりに、音声認識部７に音声認識の終了を指示するための音声認識終了部９を置き、音声認識の終了の指示に用いても良い。 (Modification 2 of the third embodiment)
In the third embodiment, an example of using the voice recognition start instruction according to the direction of the speaker's face is shown. However, as shown in FIG. A speech recognition end unit 9 for instructing the end of speech recognition may be placed in the speech recognition unit 7 and used for instructing the end of speech recognition.

こうした場合、話者の動作に応じて、音声認識を終了することが可能である。例えば、話者が（本実施形態の画像認識装置に対して）顔を背けたときに音声認識を終了することができる。 In such a case, it is possible to end the speech recognition according to the operation of the speaker. For example, the voice recognition can be terminated when the speaker turns away from the face (relative to the image recognition apparatus of the present embodiment).

もちろん、図１５にさらに音声認識終了部９を設け、音声認識の開始と終了の両方の指示に用いてもよい。 Of course, the voice recognition end unit 9 may be further provided in FIG. 15 and used for both the start and end instructions of voice recognition.

（第３の実施形態の変形例３）
方向識別部６で話者の顔の向いている方向を得て、それを音声認識の開始の指示に用いるのではなく、図１７に示すように、画像認識部３で得られた認識結果から、会話の始まりにおける口唇の動き出しを検出し、それをもとに音声認識部７に音声認識の開始を指示するための新たな音声認識開始部８を置いても良い。 (Modification 3 of the third embodiment)
Instead of obtaining the direction in which the speaker's face is directed by the direction identification unit 6 and using it as an instruction to start speech recognition, the recognition result obtained by the image recognition unit 3 is used as shown in FIG. Alternatively, a new voice recognition start unit 8 may be provided for detecting the movement of the lips at the beginning of the conversation and instructing the voice recognition unit 7 to start voice recognition based on the detected movement.

この場合、音声認識開始部８は、画像認識部３で得られた口唇認識の結果から、口唇の動作が始まる点（言葉を話し始める際、口唇が微妙に動き始める点で、この時点では、まだ発音は始まっていない）を求め、その時点で、音声認識部７に音声認識の開始を指示する。 In this case, the voice recognition start unit 8 starts from the result of the lip recognition obtained by the image recognition unit 3 (the point where the lip starts to move slightly when starting to speak a word, At this time, the voice recognition unit 7 is instructed to start voice recognition.

また、同様に、本変形例３の音声認識開始部８の代わりに、口唇の動作が終了する点を検出する音声認識終了部９を置き、音声認識の終了の指示に用いても良い。 Similarly, instead of the voice recognition start unit 8 of the third modification, a voice recognition end unit 9 that detects a point at which the movement of the lips ends may be placed and used for an instruction to end the voice recognition.

もちろん、同様に、本変形例３の音声認識開始部８に加えて、口唇の動作が終了する点を検出する音声認識終了部９を置き、音声認識の開始と終了の両方の指示に用いても良い。 Of course, similarly, in addition to the voice recognition start unit 8 of the third modification, a voice recognition end unit 9 for detecting a point at which the movement of the lips ends is placed and used for both the start and end instructions of voice recognition. Also good.

なお、従来方法では、口唇の動きだしの検出をおこなうための計算に時間がかかるため、このようなリアルタイム処理に口唇の動きだしの検出を用いることは困難であったが、本実施形態の画像認識装置では、第１の実施形態で説明したように、あまり計算コストを必要とせずに口唇部の抽出が可能であるため、このような口唇の動きだしの検出を十分にリアルタイムに行うことができる。 In the conventional method, since it takes time to calculate the movement of the lip, it is difficult to use the detection of the movement of the lip for such real-time processing. However, the image recognition apparatus according to the present embodiment is difficult to use. Then, as described in the first embodiment, since it is possible to extract the lip portion without requiring much calculation cost, it is possible to detect such movement of the lip sufficiently in real time.

（第４の実施形態）
次に、本発明の第４の実施形態について説明する。本実施形態では、第１の実施形態と相違する部分を中心に説明する。 (Fourth embodiment)
Next, a fourth embodiment of the present invention will be described. In the present embodiment, description will be made centering on portions that differ from the first embodiment.

図１８は、本発明の第４の実施形態に係る画像認識装置の全体構成図である。 FIG. 18 is an overall configuration diagram of an image recognition apparatus according to the fourth embodiment of the present invention.

図１８に示されるように、本実施形態の画像認識装置は、第２の実施形態の画像認識装置の構成に対して、各種の情報の提示を行う情報呈示部１０と、方向識別部６で得られた話者の顔の向いている方向をもとに情報呈示の開始を情報呈示部１０に指示するための情報呈示開始部１１が追加された構成になっている。 As shown in FIG. 18, the image recognition apparatus according to the present embodiment includes an information presentation unit 10 that presents various types of information and a direction identification unit 6 with respect to the configuration of the image recognition apparatus according to the second embodiment. An information presentation start unit 11 for instructing the information presentation unit 10 to start information presentation based on the obtained direction of the speaker's face is added.

これにより話者の顔の向いている方向に応じて、各種の情報呈示を行うことができる。 As a result, various types of information can be presented according to the direction in which the speaker's face is facing.

次に、情報呈示部１０について説明する。 Next, the information presentation unit 10 will be described.

情報呈示部１０は、対象者（話者）に何らかの情報を提示するものである。情報呈示部１０は、ディスプレー（画像、文字などを呈示）、スピーカー（音を呈示）、フォースフィードバック装置（感触を呈示）などの少なくとも１つの情報呈示装置を具備しており、それを通して対象者に情報を提示することができる。 The information presentation unit 10 presents some information to the target person (speaker). The information presenting unit 10 includes at least one information presenting device such as a display (presenting images, characters, etc.), a speaker (presenting sound), a force feedback device (presenting touch), and the like. Information can be presented.

次に、情報呈示開始部１１について説明する。 Next, the information presentation start part 11 is demonstrated.

情報呈示開始部１１は、前述した第３の実施形態における音声認識開始部８と同様の役割をするもので、方向識別部６で得られた結果をもとに、情報呈示部１０に、情報呈示の開始の指示を出すものである。 The information presentation start unit 11 plays the same role as the voice recognition start unit 8 in the third embodiment described above. Based on the result obtained by the direction identification unit 6, the information presentation unit 10 An instruction to start the presentation is given.

本実施形態によれば、話者の動作に応じて、情報呈示を開始することが可能である。例えば、話者が（本実施形態の画像認識装置に対して）正面を向いたときに、それを話者の行為開始とみなし、情報呈示を開始することができる。 According to this embodiment, it is possible to start information presentation according to the operation of the speaker. For example, when the speaker faces the front (with respect to the image recognition apparatus of the present embodiment), it can be regarded as the start of the speaker's action and information presentation can be started.

また、画像認識部３による口唇認識（読唇）の結果も得られているため、話者の会話の内容に応じて、情報呈示を開始することも可能である。 Further, since the result of lip recognition (lip reading) by the image recognition unit 3 is also obtained, it is possible to start presenting information according to the content of the conversation of the speaker.

（第４の実施形態の変形例１）
第３の実施形態の変形例２の場合と同様に、情報呈示開始部１１に代えてあるいは情報呈示開始部１１に加えて、情報呈示終了部を置き、呈示終了の指示をしても良い。 (Modification 1 of 4th Embodiment)
As in the case of the second modification of the third embodiment, an information presentation end unit may be placed instead of the information presentation start unit 11 or in addition to the information presentation start unit 11 to give an instruction to end the presentation.

（第４の実施形態の変形例２）
第３の実施形態の変形例３の場合と同様に、画像認識部３で得られた認識結果から、会話の始まりにおける口唇の動き出しを検出し、それをもとに情報呈示部１０に情報呈示の開始を指示するための新たな情報呈示開始部１１を置いても良い。 (Modification 2 of the fourth embodiment)
Similar to the third modification of the third embodiment, the movement of the lip at the beginning of the conversation is detected from the recognition result obtained by the image recognition unit 3, and information is presented to the information presentation unit 10 based on the detected movement. A new information presentation start unit 11 for instructing the start of the information may be placed.

このようにすることにより、例えば、情報呈示の方法として音声合成を用いて、口唇の形状、動きの認識結果をもとに、その認識内容を音声合成で提供することで、喉の病気などで言葉が話せない場合でも、口パク（音声は出さずに、実際話しているように口唇を動かす）をするだけで、音声合成により、本実施形態の画像認識装置に代わりに話させるなどというような、いわゆる、音声同期（リップシンク）が可能である。 In this way, for example, by using speech synthesis as a method of presenting information, based on the recognition result of the shape and movement of the lips, the recognition content is provided by speech synthesis, so Even if you can't speak a word, you can just speak to the image recognition device of this embodiment by voice synthesis, just by talking to the mouth (moving your lips as if you are actually speaking) In other words, so-called voice synchronization (lip sync) is possible.

もちろん、第３の実施形態の変形例３の場合と同様に、本変形例の情報呈示開始部１１に代えてあるいは情報呈示開始部１１に加えて、情報呈示終了部を置き、呈示終了の指示をしても良い。 Of course, as in the case of the third modification of the third embodiment, an information presentation end unit is placed instead of or in addition to the information presentation start unit 11 of this modification, and an instruction to end the presentation is given. You may do it.

（第４の実施形態の変形例３）
図１９に示すように、情報呈示開始部１１の代わりに、呈示する情報の種類を切り替えるための情報呈示切り替え部１２を置き、話者の向いている方向によって、情報呈示の形態を切り替えるようにしても良い。 (Modification 3 of the fourth embodiment)
As shown in FIG. 19, an information presentation switching unit 12 for switching the type of information to be presented is placed instead of the information presentation starting unit 11, and the information presentation form is switched depending on the direction in which the speaker is facing. May be.

この情報呈示の形態の切り替えとしては、（１）異なる情報呈示の形態を追加する、（２）複数の情報呈示の形態を提供している場合に、少なくとも１つの情報呈示の形態を中止する、（３）１または複数の情報呈示の形態を提供している場合に、一部または全てを異なる情報呈示の形態に変更する（情報呈示の形態数が変化する場合を含む）、などが考えられる。 As switching of this information presentation form, (1) adding different information presentation forms, (2) when providing a plurality of information presentation forms, at least one information presentation form is canceled, (3) When one or a plurality of information presentation forms are provided, some or all of them may be changed to different information presentation forms (including the case where the number of information presentation forms changes). .

こうすることで、話者の顔が（本実施形態の画像認識装置の方を）向いていないときには、音声のみの情報呈示を行っていて、話者の顔が向いたときには、情報呈示切り替え部１２を用いて、音声のみの呈示から、音声に加えて、画像などの複合メディアを用いた情報呈示に切り替える、などということが可能である。 In this way, when the speaker's face is not facing (toward the image recognition apparatus of the present embodiment), the information presentation only by voice is performed, and when the speaker's face is facing, the information presentation switching unit 12, it is possible to switch from presenting only sound to presenting information using composite media such as images in addition to sound.

これは、例えば、博物館、美術館などの展示物の説明を行うのに、通常は音声で説明文を読み上げておいて、見学者が展示物の方を見て（あるいは、さらに何か話すと）、展示物の横に置いておいたディスプレーで説明ビデオの上映が始まる、といったように用いることができる。 For example, to explain an exhibit such as a museum or an art gallery, you usually read the explanation in speech and the visitor looks at the exhibit (or speaks more). It can be used such that the explanation video starts on the display placed next to the exhibit.

（第４の実施形態の変形例４）
第４の実施形態に、第３の実施形態で説明した音声認識部、音声認識開始部、音声認識終了部などを組み合わせることにより、話者の生の音声と情報呈示部１０で生成した画像情報を組み合わせて呈示することが可能となる。 (Modification 4 of the fourth embodiment)
By combining the voice recognition unit, the voice recognition start unit, the voice recognition end unit, and the like described in the third embodiment with the fourth embodiment, the raw voice of the speaker and the image information generated by the information presentation unit 10 Can be presented in combination.

例えば、口腔部抽出部２で抽出した口腔部の距離画像ストリームを用いて、情報呈示部１０でその形状を３次元ＣＧ合成を行い、それに、音声認識部で取得した話者の生の音声を組み合わせることで、話者の生の声と音声同期（リップシンク）して口唇が動く３次元ＣＧを提供することができる。 For example, using the distance image stream of the oral cavity extracted by the oral cavity extraction unit 2, the information presentation unit 10 performs three-dimensional CG synthesis on the shape, and the raw speech of the speaker acquired by the speech recognition unit is used. By combining, it is possible to provide a three-dimensional CG in which the lips move in synchronism (lip sync) with the voice of the speaker.

（第５の実施形態）
次に、本発明の第５の実施形態について説明する。 (Fifth embodiment)
Next, a fifth embodiment of the present invention will be described.

第５の実施形態の画像認識装置は、第１、第２、第３、あるいは第４の実施形態の画像認識装置やそれらの種々の変形例の構成それぞれにおいて、外部との通信を行う通信部（図示せず）を追加したものである。 The image recognition apparatus according to the fifth embodiment includes a communication unit that performs communication with the outside in each of the configurations of the image recognition apparatus according to the first, second, third, or fourth embodiment and various modifications thereof. (Not shown) is added.

これにより第１、第２、第３、あるいは第４の実施形態やその変形例で得られた所望の情報を外部に通信することができる。 Thereby, the desired information obtained in the first, second, third, or fourth embodiment or its modification can be communicated to the outside.

通信部は、入力されたデータを、電話回線などの通信路を用いて外部に通信するもので、これが加えられることで、例えば、第１の実施形態では、口唇認識の結果を、第２の実施形態では、口唇認識の結果および話者の向いている方向を、第３の実施形態では、口唇認識および音声認識の結果を、第４の実施形態では、口唇認識の結果および呈示された情報を、それぞれ通信することが可能である。 The communication unit communicates the input data to the outside using a communication path such as a telephone line. By adding this, for example, in the first embodiment, the result of the lip recognition is changed to the second lip recognition result. In the embodiment, the result of lip recognition and the direction in which the speaker is facing, in the third embodiment, the result of lip recognition and speech recognition, in the fourth embodiment, the result of lip recognition and the presented information Can be communicated with each other.

以上のように本実施形態によれば、当該画像認識装置で得られた結果（第１の実施形態を基にしたものでは、口唇認識結果、第２の実施形態を基にしたものでは、話者方向と口唇認識結果、第３の実施形態を基にしたものでは、口唇および音声認識結果、第４の実施形態を基にしたものでは、口唇認識結果および呈示情報）を、インターネットなどを通して通信することが可能である。 As described above, according to the present embodiment, the result obtained by the image recognition apparatus (the lip recognition result based on the first embodiment, the talk based on the second embodiment, The person's direction and lip recognition result, the lip and voice recognition result based on the third embodiment, and the lip recognition result and presentation information (based on the fourth embodiment) are communicated via the Internet or the like. Is possible.

例えば、第４の実施形態の変形例４の場合、話者の生の声と音声同期（リップシンク）して口唇が動く３次元ＣＧが得られるが、先に顔の口唇部以外の部分を通信先の相手に送っておき、話者の発言とともに、上記３次元ＣＧの口唇部だけを通信部を用いてリアルタイムに送り、通信先で、あらかじめ送っておいた顔と合成することで、通信路に負荷をかけずに（つまり通信路をボトルネックとせずに）、３次元ＣＧの音声同期（リップシンク）を行うことができる。これは、通信路に速度のボトルネックが生じやすいインターネットなどで、音声とＣＧといった比較的大きなデータを用いてリアルタイム処理する際に大変有効である。 For example, in the case of the fourth modification of the fourth embodiment, a three-dimensional CG in which the lips move in synchronism with the voice of the speaker (lip sync) and the lips move is obtained. Send it to the other party, send the speaker's remarks and send only the lip of the 3D CG in real time using the communication unit, and combine it with the face sent in advance at the destination. It is possible to perform three-dimensional CG audio synchronization (lip sync) without applying a load on the road (that is, without making the communication path a bottleneck). This is very effective when performing real-time processing using relatively large data such as voice and CG on the Internet or the like where speed bottlenecks are likely to occur in the communication path.

以下では、以上の各実施形態における画像取得部１の構成について詳しく説明する。 Below, the structure of the image acquisition part 1 in each above embodiment is demonstrated in detail.

図２０に、画像取得部１の一構成例を示す。この画像取得部１は、対象物体に光を照射するための発光部１０１、対象物体からの反射光を画像として抽出するための反射光抽出部１０２、画像化された反射光の情報をもとに距離画像を生成するための距離画像生成部１０３、これらの各部の動作タイミングを制御するタイミング制御部１０４を用いて構成される。 FIG. 20 shows a configuration example of the image acquisition unit 1. The image acquisition unit 1 includes a light emitting unit 101 for irradiating a target object with light, a reflected light extraction unit 102 for extracting reflected light from the target object as an image, and information on the reflected light imaged. The distance image generating unit 103 for generating the distance image and the timing control unit 104 for controlling the operation timing of each of these units are configured.

発光部１０１は、発光素子を持ち、タイミング制御部１０４によって生成されるタイミング信号に従って時間的に強度変動する光を発光する。発光部１０１が発した光は、発光部１０１の発光素子の前方にある対象物体により反射された後に、反射光抽出部１０２の受光面に入射する。 The light emitting unit 101 has a light emitting element and emits light whose intensity varies with time according to the timing signal generated by the timing control unit 104. The light emitted from the light emitting unit 101 is reflected by the target object in front of the light emitting element of the light emitting unit 101 and then enters the light receiving surface of the reflected light extracting unit 102.

物体からの反射光は、物体の距離が大きくなるにつれ大幅に減少する。物体の表面が一様に光を散乱する場合、反射光画像１画素あたりの受光量は物体までの距離の２乗に反比例して小さくなる。従って、当該受光面の前に物体が存在する場合、背景からの反射光はほぼ無視できるくらいに小さくなり、物体のみからの反射光画像を得ることができる。 The reflected light from the object decreases significantly as the distance of the object increases. When the surface of the object scatters light uniformly, the amount of received light per pixel of the reflected light image decreases in inverse proportion to the square of the distance to the object. Therefore, when an object is present in front of the light receiving surface, the reflected light from the background becomes small enough to be ignored, and a reflected light image from only the object can be obtained.

例えば、当該受光面の前に人間の顔の部分が存在する場合、その顔からの反射光画像が得られる。このとき、反射光画像の各画素値は、その画素に対応する単位受光部で受光した反射光の量を表す。反射光量は、物体の性質（光を鏡面反射する、散乱する、吸収する、など）、物体の向き、物体の距離、などに影響されるが、物体全体が一様に光を散乱する物体である場合、その反射光量は物体までの距離と密接な関係を持つ。顔などはこのような性質を持つため、顔を対象物体とした場合の反射光画像は、顔の３次元形状、顔の距離、顔の傾き（部分的に距離が異なる）、などを反映する。 For example, when a human face portion exists in front of the light receiving surface, a reflected light image from the face is obtained. At this time, each pixel value of the reflected light image represents the amount of reflected light received by the unit light receiving unit corresponding to the pixel. The amount of reflected light is affected by the nature of the object (specularly reflecting, scattering, absorbing, etc.), the direction of the object, the distance of the object, etc. In some cases, the amount of reflected light is closely related to the distance to the object. Since the face and the like have such properties, the reflected light image when the face is the target object reflects the three-dimensional shape of the face, the distance of the face, the inclination of the face (partly different distance), and the like. .

反射光抽出部１０２は、マトリクス状に配列した、光の量を検出する受光素子を持ち、発光部１０１が発した光の対象物体による反射光の空間的な強度分布を抽出する。この反射光の空間的な強度分布は、画像として捉えることができるので、以下では反射光画像と呼ぶ。 The reflected light extraction unit 102 has a light receiving element that detects the amount of light arranged in a matrix, and extracts a spatial intensity distribution of reflected light from the target object of light emitted from the light emitting unit 101. Since the spatial intensity distribution of the reflected light can be grasped as an image, it is hereinafter referred to as a reflected light image.

ここで、反射光抽出部１０２の受光素子においては、一般的に、発光部１０１の光の対象物体による反射光だけでなく、照明光や太陽光などの外光も同時に受光することが想定される。そこで、本構成例の反射光抽出部１０２では、発光部１０１が発光しているときに受光した光の量と、発光部１０１が発光していないときに受光した光の量の差を取ることによって、発光部１０１からの光の対象物体による反射光の成分だけを取り出すようにしている。この受光のタイミングも、タイミング制御部１０４によって制御される。 Here, in the light receiving element of the reflected light extraction unit 102, it is generally assumed that not only reflected light from the target object of the light emitted from the light emitting unit 101 but also external light such as illumination light and sunlight is simultaneously received. The Therefore, in the reflected light extraction unit 102 of this configuration example, the difference between the amount of light received when the light emitting unit 101 emits light and the amount of light received when the light emitting unit 101 does not emit light is taken. Thus, only the component of the reflected light from the target object of the light from the light emitting unit 101 is extracted. The timing of this light reception is also controlled by the timing control unit 104.

そして、反射光抽出部１０２により得られた外光補正後の反射光画像の各画素に対応する反射光量（アナログ信号）が必要に応じて増幅された後にＡ／Ｄ変換され、これによってデジタル化された反射光画像が得られる。 Then, the reflected light amount (analog signal) corresponding to each pixel of the reflected light image after external light correction obtained by the reflected light extraction unit 102 is amplified as necessary, and then A / D converted, thereby digitizing. A reflected light image is obtained.

距離画像生成部１０３は、反射光抽出部１０２によって得られた反射光画像の各画素の受光量の値（デジタルデータ）を距離の値に変換することによって、距離画像（例えば、６４画素×６４画素、２５６階調の画像）を生成する。 The distance image generation unit 103 converts the received light amount value (digital data) of each pixel of the reflected light image obtained by the reflected light extraction unit 102 into a distance value, thereby obtaining a distance image (for example, 64 pixels × 64). Pixel, 256 gradation image).

次に、図２１に、画像取得部１のより詳しい一構成例を示す。 Next, FIG. 21 shows a more detailed configuration example of the image acquisition unit 1.

発光部１０１より発光された光は、対象物体１０６に反射して、レンズ等の受光光学系１０７により、反射光抽出部１０２の受光面上に結像する。 The light emitted from the light emitting unit 101 is reflected by the target object 106 and imaged on the light receiving surface of the reflected light extracting unit 102 by the light receiving optical system 107 such as a lens.

反射光抽出部１０２は、この反射光の強度分布、すなわち反射光画像を検出する。反射光抽出部１０２は、各画素（単位受光部）ごとに設けられた第１の受光部１２１および第２の受光部１２２、ならびに全画素について１つ（または一纏まりの複数画素ごとにまたは各画素ごとに）設けられた差分演算部１２３を用いて構成される。 The reflected light extraction unit 102 detects the intensity distribution of the reflected light, that is, the reflected light image. The reflected light extraction unit 102 includes a first light receiving unit 121 and a second light receiving unit 122 provided for each pixel (unit light receiving unit), and one for all pixels (or for each of a plurality of pixels or each It is configured using a difference calculation unit 123 provided for each pixel).

第１の受光部１２１と第２の受光部１２２は、異なるタイミングで受光を行う。そして、第１の受光部１２１が受光しているときに発光部１０１が発光し、第２の受光部１２２が受光しているときには発光部１０１は発光しないように、タイミング制御部１０４がこれらの動作タイミングを制御する。これにより、第１の受光部１２１が発光部１０１からの光の物体による反射光とそれ以外の太陽光、照明光などの外光を受光する。一方、第２の受光部１２２は外光のみを受光する。両者が受光するタイミングは異なっているが近いので、この間における外光の変動や対象物体の変位は無視できる。 The first light receiving unit 121 and the second light receiving unit 122 receive light at different timings. Then, the timing control unit 104 performs these operations so that the light emitting unit 101 emits light when the first light receiving unit 121 receives light and the light emitting unit 101 does not emit light when the second light receiving unit 122 receives light. Control the operation timing. As a result, the first light receiving unit 121 receives light reflected from the light emitting unit 101 and other external light such as sunlight and illumination light. On the other hand, the second light receiving unit 122 receives only external light. Since the timings at which they are received are different, but close to each other, fluctuations in external light and displacement of the target object during this period can be ignored.

従って、差分演算部１２３により第１の受光部１２１で受光した像と第２の受光部１２２で受光した像の差分をとれば、対象物体による反射光の成分だけが抽出される。１つの差分演算部１２３が複数の画素で共用される場合には、シーケンシャルに差分が演算される。 Therefore, if the difference calculation unit 123 calculates the difference between the image received by the first light receiving unit 121 and the image received by the second light receiving unit 122, only the component of the reflected light from the target object is extracted. When one difference calculation unit 123 is shared by a plurality of pixels, the difference is calculated sequentially.

なお、単位受光部の第１の受光部１２１および第２の受光部１２２の実際の構成については種々のものが考えられる。例えば、第１の受光部１２１および第２の受光部１２２のそれぞれに受光素子を設けるのではなく、単位受光部ごとに、光電変換素子（例えばフォトダイオード）を１つ設けて第１の受光部１２１と第２の受光部１２２で兼用するとともに、受光量に対応する電荷量を蓄積する電荷蓄積素子（例えばコンデンサ）を第１の受光部１２１および第２の受光部１２２のそれぞれのために２つ設ける方法が考えられる。 In addition, various things can be considered about the actual structure of the 1st light-receiving part 121 of the unit light-receiving part, and the 2nd light-receiving part 122. FIG. For example, instead of providing a light receiving element in each of the first light receiving unit 121 and the second light receiving unit 122, one photoelectric conversion element (for example, a photodiode) is provided for each unit light receiving unit, and the first light receiving unit is provided. A charge storage element (for example, a capacitor) that stores the charge amount corresponding to the amount of received light is used for each of the first light receiving unit 121 and the second light receiving unit 122. One method can be considered.

さて、上記のようにして、反射光抽出部１０２は、反射光画像の各画素の反射光量を外光補正を行った後に出力する。なお、ここでは、各画素の反射光量をシーケンシャルに出力するものとする。 As described above, the reflected light extraction unit 102 outputs the reflected light amount of each pixel of the reflected light image after performing external light correction. Here, it is assumed that the amount of reflected light of each pixel is output sequentially.

反射抽出部１０２からの出力はアンプ１３１によって増幅され、Ａ／Ｄ変換器１３２によってデジタルデータに変換された後、メモリ１３３に画像データとして蓄えられる。そして、しかるべきタイミングでこのメモリより蓄積されたデータが読み出され、距離画像生成部１０３に与えられる。 The output from the reflection extraction unit 102 is amplified by the amplifier 131, converted into digital data by the A / D converter 132, and then stored as image data in the memory 133. Then, the data accumulated from this memory is read out at an appropriate timing and is given to the distance image generation unit 103.

距離画像生成部１０３では、反射光抽出部１０２により得られた反射光画像をもとに距離画像を生成する。例えば、反射光画像の各画素の反射光量を、それぞれ、所定の階調（例えば、２５６階調）のデジタルデータに変換する。なお、この変換にあたっては、例えば、（１）受光素子における受光量が対象物体までの距離に対して非線形性を持つ（対象物体までの距離の２乗に反比例する）という非線形要因に対する補正を行う処理、あるいは（２）各画素に対応する受光素子の特性のばらつきや非線形性を補正する処理、あるいは（３）背景やノイズを除去する処理（例えば、基準値以下の受光量を持つ画素の階調を０にする）、などといった処理を適宜行ってもよい。 The distance image generation unit 103 generates a distance image based on the reflected light image obtained by the reflected light extraction unit 102. For example, the reflected light amount of each pixel of the reflected light image is converted into digital data of a predetermined gradation (for example, 256 gradations). In this conversion, for example, (1) correction for a nonlinear factor that the amount of light received by the light receiving element has nonlinearity with respect to the distance to the target object (inversely proportional to the square of the distance to the target object) is performed. Processing, or (2) processing for correcting variations and nonlinearity of the characteristics of the light receiving elements corresponding to each pixel, or (3) processing for removing background and noise (for example, a level of a pixel having a received light amount equal to or less than a reference value) The key may be set as appropriate.

なお、顔の３次元形状を抽出する場合、距離情報を高い分解能で求められることが望ましい。この場合、アンプ１３１として対数アンプを用いると望ましい。受光面での受光量は対象物体までの距離の２乗に反比例するが、対数アンプを用いると、その出力は距離に反比例するようになる。このようにすることで、ダイナミックレンジを有効に使うことができる。 When extracting a three-dimensional shape of a face, it is desirable to obtain distance information with high resolution. In this case, it is desirable to use a logarithmic amplifier as the amplifier 131. The amount of light received on the light receiving surface is inversely proportional to the square of the distance to the target object, but when a logarithmic amplifier is used, the output is inversely proportional to the distance. By doing so, the dynamic range can be used effectively.

さて、上記のような構成において、１回の発光によって全画素について反射光が得られるものとすると、タイミング制御部１０４の制御によって、発光→第１の受光部による受光→発光なしに第２の受光部による受光→差分演算→デジタル化→距離画像の生成（もしくは発光なしに第２の受光部による受光→発光→第１の受光部による受光→差分演算→デジタル化→距離画像の生成）といった一連の処理が進められ、これによって１枚の距離画像が得られる。また、この一連の処理を繰り返し行う（例えば、１／６０秒ごとに行う）ことによって、距離画像ストリームを得ることができる。 Now, in the configuration as described above, assuming that reflected light is obtained for all pixels by one light emission, the second control is performed without light emission → light reception by the first light receiving unit → light emission under the control of the timing control unit 104. Light reception by the light receiving unit → difference calculation → digitalization → distance image generation (or light reception by the second light receiving unit without light emission → light emission → light reception by the first light receiving unit → difference calculation → digitalization → distance image generation) A series of processing is advanced, and thereby a single distance image is obtained. Further, a range image stream can be obtained by repeating this series of processing (for example, every 1/60 seconds).

なお、発光部１０１は、人間の目に見えない、近赤外光を発光するようにするのが好ましい。このようにすれば、光が照射されても人間には光が見えないため、眩しさを感じさせないようにすることができる。また、この場合に、受光光学系には、近赤外光通過フィルタを設けると好ましい。このフィルタは、発光波長である近赤外光を通過し、可視光、遠赤外光を遮断するので、外光の多くをカットすることができる。ただし、人間の目に眩しくない条件であれば（例えば、発光量がそれほど大きくない、人間の目には直接入射しないような光学系となっている、など）、可視光を用いても構わない。また、電磁波や超音波などを用いる方法も考えられる。 The light emitting unit 101 preferably emits near infrared light that is invisible to human eyes. In this way, even if light is irradiated, since humans cannot see the light, it is possible not to feel dazzling. In this case, it is preferable to provide a near-infrared light passing filter in the light receiving optical system. Since this filter passes near-infrared light, which is the emission wavelength, and blocks visible light and far-infrared light, much of the external light can be cut off. However, visible light may be used as long as it is not dazzling to human eyes (for example, the light emission amount is not so large or the optical system does not directly enter human eyes). . A method using electromagnetic waves, ultrasonic waves, or the like is also conceivable.

また、上記では外光補正として発光部１０１の発光の有無の相違による２種類の受光量の差分をアナログ信号の状態で取ったが、２種類の受光量をそれぞれデジタル化した後に差分を取るようにする方法もある。 In the above description, the difference between the two types of received light amounts due to the difference in the presence or absence of light emission of the light emitting unit 101 is taken as an analog signal as external light correction. However, the difference is calculated after each of the two types of received light amounts is digitized. There is also a way to make it.

なお、上記した受光面もしくはこれを収容した筐体は、本画像認識装置の目的等に応じて適宜設置するばよい。例えば本画像認識装置が表示装置を持つものである場合、この表示装置に対して対象物体となる人間の顔が正面を向いたときに、当該受光面に対しても正面を向いた形になるように当該画像認識装置の筐体に設ける。 Note that the above-described light-receiving surface or a housing that houses the light-receiving surface may be installed as appropriate according to the purpose of the image recognition apparatus. For example, when the image recognition apparatus has a display device, when a human face as a target object faces the front with respect to the display device, the light receiving surface also faces the front. Thus, it is provided in the housing of the image recognition apparatus.

なお、以上の各実施形態やその変形例は、適宜組み合わせて実施することが可能である。 It should be noted that the above embodiments and their modifications can be implemented in combination as appropriate.

また、以上の各実施形態やその変形例あるいはそれらを適宜組み合わせたものでは、距離画像ストリームから形状および／または動きを認識し、あるいはさらにその認識結果をもとに種々の処理を行うものであったが、距離画像から形状を認識し、あるいはさらにその認識結果をもとに種々の処理を行うように構成した実施形態も可能である。 Further, in each of the above-described embodiments and modifications thereof or a combination thereof as appropriate, the shape and / or movement is recognized from the distance image stream, or various processes are performed based on the recognition result. However, an embodiment in which the shape is recognized from the distance image or various processes are performed based on the recognition result is also possible.

また、以上の各実施形態やその変形例あるいはそれらを適宜組み合わせたものは、画像取得部１もしくはそのうちの反射光画像を抽出する部分を省き、与えられた距離画像もしくはそのストリームに基づいて、もしくは与えられた反射光画像もしくはそのストリームから距離画像もしくはそのストリームを生成し、生成した距離画像もしくはそのストリームに基づいて、形状および／または動きを認識し、あるいはさらにその認識結果をもとに種々の処理を行うような装置として構成することも可能である。 Further, each of the above-described embodiments and modifications thereof, or a combination of them as appropriate, omits the image acquisition unit 1 or a portion from which the reflected light image is extracted, and is based on a given distance image or its stream, or A distance image or its stream is generated from a given reflected light image or its stream, and shape and / or movement is recognized based on the generated distance image or its stream, or various types of motions based on the recognition result. It is also possible to configure as an apparatus that performs processing.

以上の各機能は、素子部分を除いて、ソフトウェアとしても実現可能である。また、上記した各手順あるいは手段をコンピュータに実行させるためのプログラムを記録した機械読取り可能な媒体として実施することもできる。 Each of the above functions can be realized as software except for the element portion. Further, the present invention can also be implemented as a machine-readable medium in which a program for causing a computer to execute each procedure or means described above is recorded.

本発明は、上述した実施の形態に限定されるものではなく、その技術的範囲において種々変形して実施することができる。 The present invention is not limited to the embodiment described above, and can be implemented with various modifications within the technical scope thereof.

本発明の第１の実施形態に係る画像認識装置の構成例を概略的に示す図1 schematically shows a configuration example of an image recognition apparatus according to a first embodiment of the present invention. 距離画像について説明するための図Diagram for explaining the distance image 距離画像について説明するための図Diagram for explaining the distance image 距離画像について説明するための図Diagram for explaining the distance image エッジ抽出の処理の流れを示すフローチャートFlow chart showing the flow of edge extraction processing Ｓｏｂｅｌオペレータを説明するための図Illustration for explaining the Sobel operator テンプレートマッチングの処理の流れを示すフローチャートFlow chart showing the flow of template matching process 本発明の第１の実施形態の変形例２に係る画像認識装置の構成例を概略的に示す図The figure which shows schematically the structural example of the image recognition apparatus which concerns on the modification 2 of the 1st Embodiment of this invention. 本発明の第１の実施形態の変形例３に係る画像認識装置の構成例を概略的に示す図The figure which shows schematically the structural example of the image recognition apparatus which concerns on the modification 3 of the 1st Embodiment of this invention. 本発明の第２の実施形態に係る画像認識装置の構成例を概略的に示す図The figure which shows schematically the structural example of the image recognition apparatus which concerns on the 2nd Embodiment of this invention. 話者の顔の向いている方向を求める処理の流れを示すフローチャートFlow chart showing the flow of processing for obtaining the direction of the speaker's face 画素の法線方向を説明するための図Diagram for explaining the normal direction of a pixel 本発明の第２の実施形態の変形例１に係る画像認識装置の構成例を概略的に示す図The figure which shows schematically the structural example of the image recognition apparatus which concerns on the modification 1 of the 2nd Embodiment of this invention. 本発明の第２の実施形態の変形例２に係る画像認識装置の構成例を概略的に示す図The figure which shows schematically the structural example of the image recognition apparatus which concerns on the modification 2 of the 2nd Embodiment of this invention. 本発明の第３の実施形態に係る画像認識装置の構成例を概略的に示す図The figure which shows schematically the structural example of the image recognition apparatus which concerns on the 3rd Embodiment of this invention. 本発明の第３の実施形態の変形例２に係る画像認識装置の構成例を概略的に示す図The figure which shows schematically the structural example of the image recognition apparatus which concerns on the modification 2 of the 3rd Embodiment of this invention. 本発明の第３の実施形態の変形例３に係る画像認識装置の構成例を概略的に示す図The figure which shows schematically the structural example of the image recognition apparatus which concerns on the modification 3 of the 3rd Embodiment of this invention. 本発明の第４の実施形態に係る画像認識装置の構成例を概略的に示す図The figure which shows schematically the structural example of the image recognition apparatus which concerns on the 4th Embodiment of this invention. 本発明の第４の実施形態の変形例３に係る画像認識装置の構成例を概略的に示す図The figure which shows schematically the structural example of the image recognition apparatus which concerns on the modification 3 of the 4th Embodiment of this invention. 画像取得部の構成例を示す図The figure which shows the structural example of an image acquisition part. 画像取得部のより詳しい構成例を示す図The figure which shows the more detailed structural example of an image acquisition part.

Explanation of symbols

１…画像取得部、２…口腔部抽出部、３…画像認識部、４…音呈示部、５…顔部抽出部、６…方向識別部、７…音声認識部、８…音声認識開始部、９…音声認識終了部、１０…情報呈示部、１１…情報呈示開始部、１２…情報呈示切り替え部、１０１…発光部、１０２…反射光抽出部、１０３…距離画像生成部、１０４…タイミング信号生成部、１０７…受光光学系、１２１…第１の受光部、１２２…第２の受光部、１２３…差分演算部、１３１…アンプ１３１、１３２…Ａ／Ｄ変換器、１３３…メモリ DESCRIPTION OF SYMBOLS 1 ... Image acquisition part, 2 ... Oral part extraction part, 3 ... Image recognition part, 4 ... Sound presentation part, 5 ... Face part extraction part, 6 ... Direction identification part, 7 ... Voice recognition part, 8 ... Voice recognition start part DESCRIPTION OF SYMBOLS 9 ... Voice recognition end part, 10 ... Information presentation part, 11 ... Information presentation start part, 12 ... Information presentation switching part, 101 ... Light emission part, 102 ... Reflected light extraction part, 103 ... Distance image generation part, 104 ... Timing Signal generating unit 107 ... Light receiving optical system 121 ... First light receiving part 122 ... Second light receiving part 123 ... Difference calculating part 131 ... Amplifier 131, 132 ... A / D converter, 133 ... Memory

Claims

An image acquisition means for acquiring a distance image for the target object;
Oral part extraction means for extracting the oral part from the distance image acquired by the image acquisition means,
An image recognition means for recognizing the shape of the lips based on the distance image of the oral cavity portion extracted by the oral cavity extraction means;
Speech recognition means for recognizing input speech;
Control for starting speech recognition by the speech recognition means when the start of the conversation of the speaker is detected based on the recognition result by the image recognition means and termination of the conversation of the speaker based on the recognition result by the image recognition means An image recognition apparatus comprising: a control unit that performs at least one control of terminating the voice recognition performed by the voice recognition unit when the voice is detected.

Image acquisition means for acquiring a distance image stream for the target object;
An oral cavity extraction means for extracting an oral cavity portion from the distance image stream acquired by the image acquisition means;
Image recognition means for recognizing the movement of the lips based on the distance image stream of the oral cavity portion extracted by the oral cavity extraction means;
Speech recognition means for recognizing input speech;
Control for starting speech recognition by the speech recognition means when the start of the conversation of the speaker is detected based on the recognition result by the image recognition means and termination of the conversation of the speaker based on the recognition result by the image recognition means An image recognition apparatus comprising: a control unit that performs at least one control of terminating the voice recognition performed by the voice recognition unit when the voice is detected.

An image acquisition means for acquiring a distance image for the target object;
Oral part extraction means for extracting the oral part from the distance image acquired by the image acquisition means,
An image recognition means for recognizing the shape of the lips based on the distance image of the oral cavity portion extracted by the oral cavity extraction means;
Information presenting means for presenting information to be presented to the user using all or part of a plurality of output forms including a sound output form and an image output form;
Based on the recognition result by the image recognition means, at least one of the start and the end of the conversation of the speaker is detected, and control for starting the information presentation by the information presentation means according to the detection result and the information presentation means Image recognition comprising: control for terminating information presentation; and control means for performing at least one control among addition, stop, or change of an output form used for information presentation performed by the information presentation means apparatus.

Image acquisition means for acquiring a distance image stream for the target object;
An oral cavity extraction means for extracting an oral cavity portion from the distance image stream acquired by the image acquisition means;
Image recognition means for recognizing the movement of the lips based on the distance image stream of the oral cavity portion extracted by the oral cavity extraction means;
Information presenting means for presenting information to be presented to the user using all or part of a plurality of output forms including a sound output form and an image output form;
Based on the recognition result by the image recognition means, at least one of the start and the end of the conversation of the speaker is detected, and control for starting the information presentation by the information presentation means according to the detection result and the information presentation means Image recognition comprising: control for terminating information presentation; and control means for performing at least one control among addition, stop, or change of an output form used for information presentation performed by the information presentation means apparatus.

An image acquisition means for acquiring a distance image for the target object;
A face extraction means for extracting a face portion from the distance image acquired by the image acquisition means;
An image recognition means for recognizing the shape of the face based on the distance image of the face part extracted by the face extraction means;
Speech recognition means for recognizing input speech;
Control for starting speech recognition by the speech recognition means when the start of the conversation of the speaker is detected based on the recognition result by the image recognition means and termination of the conversation of the speaker based on the recognition result by the image recognition means An image recognition apparatus comprising: a control unit that performs at least one control of terminating the voice recognition performed by the voice recognition unit when the voice is detected.

Image acquisition means for acquiring a distance image stream for the target object;
Face part extracting means for extracting a face part from the distance image stream acquired by the image acquiring means;
Image recognition means for recognizing the movement of the face based on the distance image stream of the face extracted by the face extraction means;
Speech recognition means for recognizing input speech;
Control for starting speech recognition by the speech recognition means when the start of the conversation of the speaker is detected based on the recognition result by the image recognition means and termination of the conversation of the speaker based on the recognition result by the image recognition means An image recognition apparatus comprising: a control unit that performs at least one control of terminating the voice recognition performed by the voice recognition unit when the voice is detected.

An image acquisition means for acquiring a distance image for the target object;
A face extraction means for extracting a face portion from the distance image acquired by the image acquisition means;
An image recognition means for recognizing the shape of the face based on the distance image of the face part extracted by the face extraction means;
Information presenting means for presenting information to be presented to the user using all or part of a plurality of output forms including a sound output form and an image output form;
Based on the recognition result by the image recognition means, at least one of the start and the end of the conversation of the speaker is detected, and control for starting the information presentation by the information presentation means according to the detection result and the information presentation means Image recognition comprising: control for terminating information presentation; and control means for performing at least one control among addition, stop, or change of an output form used for information presentation performed by the information presentation means apparatus.

Image acquisition means for acquiring a distance image stream for the target object;
Face part extracting means for extracting a face part from the distance image stream acquired by the image acquiring means;
Image recognition means for recognizing the movement of the face based on the distance image stream of the face extracted by the face extraction means;
Information presenting means for presenting information to be presented to the user using all or part of a plurality of output forms including a sound output form and an image output form;
Based on the recognition result by the image recognition means, at least one of the start and the end of the conversation of the speaker is detected, and control for starting the information presentation by the information presentation means according to the detection result and the information presentation means Image recognition comprising: control for terminating information presentation; and control means for performing at least one control among addition, stop, or change of an output form used for information presentation performed by the information presentation means apparatus.

9. The apparatus according to claim 1, further comprising direction identification means for identifying a direction of a speaker's face based on the shape information or the movement information obtained by the image recognition means. The image recognition apparatus of any one of Claims.

The image recognition apparatus according to claim 1, further comprising communication means for communicating the obtained predetermined information.

An image acquisition step for acquiring a distance image for the target object;
Oral part extraction step for extracting the oral part from the distance image acquired by the image acquisition step;
An image recognition step for recognizing the shape of the lips based on the distance image of the oral cavity part extracted by the oral cavity extraction step;
A speech recognition step for recognizing the input speech by speech recognition means;
Control of starting speech recognition by the speech recognition means when the start of the speaker's conversation is detected based on the recognition result of the image recognition step and termination of the speaker's conversation based on the recognition result of the image recognition step And a control step of performing at least one of the controls for ending the speech recognition by the speech recognition means when a speech is detected.

An image acquisition step for acquiring a distance image stream for the target object;
Oral part extraction step for extracting the oral part from the distance image stream acquired by the image acquisition step,
An image recognition step for recognizing the movement of the lips based on the distance image stream of the oral cavity portion extracted by the oral cavity extraction step;
A speech recognition step for recognizing the input speech by speech recognition means;
Control of starting speech recognition by the speech recognition means when the start of the speaker's conversation is detected based on the recognition result of the image recognition step and termination of the speaker's conversation based on the recognition result of the image recognition step And a control step of performing at least one of the controls for ending the speech recognition by the speech recognition means when a speech is detected.

An image acquisition step for acquiring a distance image for the target object;
Oral part extraction step for extracting the oral part from the distance image acquired by the image acquisition step;
An image recognition step for recognizing the shape of the lips based on the distance image of the oral cavity part extracted by the oral cavity extraction step;
An information presenting step for presenting information to be presented to the user using all or part of a plurality of output forms including an output form by sound and an output form by an image by an information presenting unit;
Based on the recognition result of the image recognition step, at least one of the start and end of the conversation of the speaker is detected, and in accordance with the detection result, the information presentation means starts the information presentation and the information presentation means An image recognition method comprising: control for terminating information presentation; and a control step for performing at least one control among control relating to addition, cancellation, or change of an output form used for information presentation performed by the information presentation means. .

An image acquisition step for acquiring a distance image stream for the target object;
Oral part extraction step for extracting the oral part from the distance image stream acquired by the image acquisition step,
An image recognition step for recognizing the movement of the lips based on the distance image stream of the oral cavity portion extracted by the oral cavity extraction step;
An information presenting step for presenting information to be presented to the user using all or part of a plurality of output forms including an output form by sound and an output form by an image by an information presenting unit;
Based on the recognition result of the image recognition step, at least one of the start and end of the conversation of the speaker is detected, and in accordance with the detection result, the information presentation means starts the information presentation and the information presentation means An image recognition method comprising: control for terminating information presentation; and a control step for performing at least one control among control relating to addition, cancellation, or change of an output form used for information presentation performed by the information presentation means. .

A computer-readable recording medium storing a program for causing a computer to function as an image recognition device,
An image acquisition step for acquiring a distance image for the target object;
Oral part extraction step for extracting the oral part from the distance image acquired by the image acquisition step;
An image recognition step for recognizing the shape of the lips based on the distance image of the oral cavity part extracted by the oral cavity extraction step;
A speech recognition step for recognizing the input speech by speech recognition means;
Control of starting speech recognition by the speech recognition means when the start of the speaker's conversation is detected based on the recognition result of the image recognition step and termination of the speaker's conversation based on the recognition result of the image recognition step A computer-readable recording medium on which is recorded a program for causing a computer to execute a control step of performing at least one control of control for ending speech recognition by the speech recognition means when a sound is detected.

A computer-readable recording medium storing a program for causing a computer to function as an image recognition device,
An image acquisition step for acquiring a distance image stream for the target object;
Oral part extraction step for extracting the oral part from the distance image stream acquired by the image acquisition step,
An image recognition step for recognizing the movement of the lips based on the distance image stream of the oral cavity portion extracted by the oral cavity extraction step;
A speech recognition step for recognizing the input speech by speech recognition means;
Control of starting speech recognition by the speech recognition means when the start of the speaker's conversation is detected based on the recognition result of the image recognition step and termination of the speaker's conversation based on the recognition result of the image recognition step A computer-readable recording medium on which is recorded a program for causing a computer to execute a control step of performing at least one control of control for ending speech recognition by the speech recognition means when a sound is detected.

A computer-readable recording medium storing a program for causing a computer to function as an image recognition device,
An image acquisition step for acquiring a distance image for the target object;
Oral part extraction step for extracting the oral part from the distance image acquired by the image acquisition step;
An image recognition step for recognizing the shape of the lips based on the distance image of the oral cavity part extracted by the oral cavity extraction step;
An information presenting step for presenting information to be presented to the user using all or part of a plurality of output forms including an output form by sound and an output form by an image by an information presenting unit;
Based on the recognition result of the image recognition step, at least one of the start and end of the conversation of the speaker is detected, and in accordance with the detection result, the information presentation means starts the information presentation and the information presentation means Records a program for causing a computer to execute a control step of ending information presentation and a control step of performing at least one control among additions, cancellations, or changes of an output form used for information presentation performed by the information presentation means Computer-readable recording medium.

A computer-readable recording medium storing a program for causing a computer to function as an image recognition device,
An image acquisition step for acquiring a distance image stream for the target object;
Oral part extraction step for extracting the oral part from the distance image stream acquired by the image acquisition step,
An image recognition step for recognizing the movement of the lips based on the distance image stream of the oral cavity portion extracted by the oral cavity extraction step;
An information presenting step for presenting information to be presented to the user using all or part of a plurality of output forms including an output form by sound and an output form by an image by an information presenting unit;
Based on the recognition result of the image recognition step, at least one of the start and end of the conversation of the speaker is detected, and in accordance with the detection result, the information presentation means starts the information presentation and the information presentation means Records a program for causing a computer to execute a control step of ending information presentation and a control step of performing at least one control among additions, cancellations, or changes of an output form used for information presentation performed by the information presentation means Computer-readable recording medium.