JP2001067482A

JP2001067482A - Human reaction recognizing device and computer- readable recording medium where program for the same is recorded

Info

Publication number: JP2001067482A
Application number: JP24258199A
Authority: JP
Inventors: Kinzan To; 金山唐; Shinjiro Kawato; 慎二郎川戸; Atsushi Otani; 淳大谷
Original assignee: ATR Media Integration and Communication Research Laboratories
Current assignee: ATR Media Integration and Communication Research Laboratories
Priority date: 1999-08-30
Filing date: 1999-08-30
Publication date: 2001-03-16

Abstract

PROBLEM TO BE SOLVED: To provide a human reaction recognizing device which can accurately transmit the reactions of many people at a distant place to some speaker to the speaker. SOLUTION: The human reaction recognizing device 34 which recognizes the reaction of a person to displayed video includes a video acquisition part 100 which acquires a video sequence of face images of the person, a detection and separation subsystem 102 which classifies the acquired video sequence into a stable stationary unit and an active detection unit according to movement between video frames, and a decision subsystem 104 which recognizes the reaction of the person from a frame sequence classified as the stable stationary unit and a frame sequence classified as the active detection unit.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、多数地点を結ん
で通信により行なわれる会議、講演、授業等に用いられ
る通信システムに関し、特に、各会場の出席者の反応を
発言者、講演者等に対して的確に理解させるための、聴
衆の反応の調査および評価をするためのシステムに関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a communication system used for conferences, lectures, classes, etc., which are conducted by connecting a number of points by communication. The present invention relates to a system for investigating and assessing audience response so that the subject can be properly understood.

【０００２】[0002]

【従来の技術】放送は、ある情報を多数の聴衆に伝える
上で非常に重要な手段である。その典型として、放送を
用いた教育システムがある。放送を用いた教育システム
では、講師が行なう講議を放送し、各地に設けた受講会
場でこの放送を受信してたとえばテレビジョン等を用い
て表示する。受講者はこの受講会場においてこの放送を
見ることにより講議を受講することができる。なお、各
地の受講者は必ずしも複数でなく、一人の場合もあり得
る。BACKGROUND OF THE INVENTION Broadcasting is a very important means of transmitting certain information to a large audience. A typical example is an education system using broadcasting. In an educational system using broadcasting, a lecture given by a lecturer is broadcast, and the broadcasting is received at attending venues provided in various places and displayed using, for example, a television. A student can attend a lecture by watching this broadcast at the attendance site. It should be noted that the number of participants in each place is not necessarily plural, and may be one.

【０００３】一方、近年の通信技術、特にインターネッ
ト技術の発達により、このような放送に類似した方法を
用いて講議を行なうシステムが出現しつつある。これは
特に、講師と受講者との間での双方向通信が可能なの
で、たとえば外国語学習等において有効である。既存の
システムとしては、特開平１１−５５６４３号公報に開
示されたものがある。[0003] On the other hand, with the recent development of communication technology, particularly Internet technology, a system for giving a lecture using a method similar to such broadcasting is emerging. This is particularly effective in learning a foreign language, for example, because bidirectional communication between the instructor and the student is possible. As an existing system, there is a system disclosed in Japanese Patent Application Laid-Open No. H11-55643.

【０００４】特開平１１−５５６４３号公報に開示され
たシステムは、講師の側と受講者との間で映像（および
音声）を双方向通信することを前提に、利用者のプライ
バシーを保護するために、利用者の許諾があった場合の
み映像の送信を行なうものである。The system disclosed in Japanese Patent Application Laid-Open No. H11-55643 is intended to protect the privacy of a user on the premise that video (and audio) is bidirectionally communicated between a lecturer and a student. The video is transmitted only when the user has given permission.

【０００５】[0005]

【発明が解決しようとする課題】通信を用いない一般的
な講議では、講師が演壇に立ち、聴衆は演壇の前に座る
ことが多い。講師は演壇の上から講議を行なう。この場
合、講師は聴衆の反応を見ることができ、その結果に応
じて途中で講議の内容を変えたり、講議の順序を変えた
りすることができる。たとえば、聴衆が退屈していると
見れば、講師は講議とは直接の関係はないが、聴衆の興
味を引くことができるような話題に切替え、聴衆が再び
講議に集中した段階で講議の内容を本筋に戻すことがで
きる。In a general lecture not using communication, a lecturer stands on a podium and an audience often sits in front of the podium. The lecturer gives a lecture from the podium. In this case, the lecturer can see the response of the audience, and can change the content of the lecture or change the order of the lecture on the way according to the result. For example, if the audience is bored, the instructor switches to topics that are not directly related to the lecture but can be of interest to the audience, and when the audience concentrates on the lecture again. The content of the discussion can be returned to the main subject.

【０００６】通信を用いた学習システムでも同様のこと
が可能である。たとえば講師１人に対して生徒が３名程
度であれば、各生徒の画像を講師の前に表示することが
できる。講師は、一般的な講議の場合と同じようにこれ
ら生徒の反応を把握し、生徒の反応に応じて講議の内容
を適切に調整することができる。[0006] The same is possible with a learning system using communication. For example, if there are about three students per instructor, an image of each student can be displayed in front of the instructor. The instructor can grasp the reaction of these students as in the case of a general lecture, and can appropriately adjust the content of the lecture according to the student's response.

【０００７】しかし、通信を用いて多数の聴衆に対して
同じことを行なうのは困難である。なぜなら、インター
ネット等の手段で映像を送信する場合には、受講会場が
多数となると通信量が莫大となり、各受講会場から講師
のもとに映像を送るのは実質的に不可能となるためであ
る。また仮に各受講会場から講師のもとに映像を送るこ
とができたとしても、一般的な講議または講演の場合と
比較して聴衆の数がはるかに多くなることが予想され、
その場合にはそれら聴衆全体の反応を講師に的確に伝え
ることが極めて困難となるからである。[0007] However, it is difficult to do the same for a large audience using communications. This is because, when transmitting images via the Internet or other means, if the number of venues is large, the amount of communication will be enormous, and it will be virtually impossible to send video from each venue to the instructor. is there. Also, even if the video could be sent from each venue to the instructor, it is expected that the number of audiences will be much larger than in the case of general lectures or lectures,
In that case, it is extremely difficult to accurately communicate the response of the entire audience to the instructor.

【０００８】また、こうした問題は講議、講演、演説
等、いわゆる一方方向の情報の伝達のときに限らず、た
とえば複数箇所に集まった比較的多数の人物の間でディ
スカッションを行なう場合にも存在する。こうしたディ
スカッションでは、いわゆる仮想空間の技術が用いられ
得るが、仮想空間で主として使用される技術は、出席者
個々の表情等を他の出席者にいかに伝えるか、という技
術であって、上記した問題を解決することはできない。
またディスカッションでは、話者のみならず、聴衆に相
当する人物も話者になりうるので、互いに他の会場にお
ける出席者の反応を的確に把握できれば便利である。[0008] These problems are not limited to the transmission of so-called one-way information such as lectures, lectures, speeches, etc., but also exist, for example, when discussions are conducted among a relatively large number of persons gathered at a plurality of places. I do. In such discussions, so-called virtual space technology can be used, but the technology mainly used in virtual space is the technology of how to convey the facial expressions of each attendee to other attendees. Can not be solved.
In the discussion, not only the speaker but also the person corresponding to the audience can be the speaker, so it is convenient if the responses of the attendees at other venues can be accurately grasped.

【０００９】さらに、このようにディスカッションの場
合に限らず、講議、講演、演説等の場合にも、聴衆が他
の会場の聴衆の反応を知ることが有益な場合もあるだろ
う。[0009] Furthermore, not only in the case of the discussion, but also in the case of lectures, lectures, speeches, etc., it may be useful for the audience to know the reaction of the audience at other venues.

【００１０】それゆえに本発明の目的は、遠隔地にいる
多数の人物の、ある話者に対する反応を話者に対して的
確に伝達することができる人物の反応認識装置を提供す
ることである。SUMMARY OF THE INVENTION It is therefore an object of the present invention to provide an apparatus for recognizing a reaction of a person, which can accurately transmit the reaction of a number of persons in a remote place to a speaker to the speaker.

【００１１】この発明の他の目的は、遠隔地にいる多数
の人物の反応を互いに的確に伝達することができる人物
の反応認識装置を提供することである。Another object of the present invention is to provide an apparatus for recognizing a reaction of a person, which can accurately transmit the reactions of a large number of persons at remote locations to each other.

【００１２】[0012]

【課題を解決するための手段】請求項１に記載の発明に
かかる人物の反応認識装置は、表示される映像に対する
人物の反応を認識するための反応認識装置であって、人
物の顔画像の映像シーケンスを取得するための映像取得
手段と、取得された映像シーケンスを、映像フレーム間
の動きに基づいて安定静止ユニットと活動検出ユニット
とに分類するための第１の分類手段と、安定静止ユニッ
トに分類されたフレームシーケンスおよび活動検出ユニ
ットに分類されたフレームシーケンスから人物の反応を
認識するための反応認識手段とを含む。According to a first aspect of the present invention, there is provided a person reaction recognition apparatus for recognizing a person's reaction to a displayed image, comprising: Video acquisition means for acquiring a video sequence, first classification means for classifying the acquired video sequence into a stable still unit and an activity detection unit based on motion between video frames, and a stable still unit And a reaction recognizing means for recognizing a person's reaction from the frame sequence classified into the activity detection unit.

【００１３】この発明によれば、安定静止ユニットと活
動検出ユニットとに基づいて、人物の反応を認識するこ
とができる。この反応を示す情報を映像の送信元に送っ
て統計をとることにより、映像の送信元では通信先に存
在する多くの聴衆の反応を把握することができる。According to the present invention, the reaction of a person can be recognized based on the stable stationary unit and the activity detecting unit. By sending the information indicating the reaction to the source of the video and collecting statistics, the source of the video can grasp the reactions of many audiences existing at the communication destination.

【００１４】請求項２に記載の発明にかかる人物の反応
認識装置は、請求項１に記載の発明の構成に加えて、第
１の分類手段は、隣接するフレーム間のフレーム間差分
に基づいて、映像シーケンスを静止ユニットと動きユニ
ットとに分類するための第２の分類手段と、静止フレー
ムを、その持続期間に基づいて安定静止ユニットと不安
定静止ユニットとに分類するための第３の分類手段と、
連続する不安定静止ユニットと動きユニットとを活動検
出ユニットとして統合するための統合手段とを含む。According to a second aspect of the present invention, in addition to the configuration of the first aspect, the first classifying means may further include a first classifying unit configured to determine a difference between frames between adjacent frames. A second classification means for classifying the video sequence into stationary units and motion units, and a third classification for classifying still frames into stable stationary units and unstable still units based on their duration. Means,
And integrating means for integrating the continuous unstable stationary unit and the moving unit as an activity detecting unit.

【００１５】請求項３に記載の発明にかかる人物の反応
認識装置は、請求項１に記載の発明の構成に加えて、反
応認識手段は、安定静止ユニット内のフレームから人物
の顔画像の特徴ベクトルを抽出するための第１の特徴ベ
クトル抽出手段と、第１の特徴ベクトル抽出手段の出力
する特徴ベクトルを入力として、特徴ベクトルに対応す
る姿勢情報を出力する、あらかじめ学習済みの第１のニ
ューラルネットワークと、活動検出ユニット内のフレー
ム間差分の情報から人物の顔画像の動きに対応する特徴
ベクトルを抽出するための第２の特徴ベクトル抽出手段
と、第２の特徴ベクトル抽出手段の出力する特徴ベクト
ルを入力として、特徴ベクトルに対応するジェスチャー
情報を出力する、あらかじめ学習済みの第２のニューラ
ルネットワークとを含む。According to a third aspect of the present invention, in addition to the configuration of the first aspect of the present invention, the reaction recognizing means is characterized in that the reaction recognizing means converts the feature of the face image of the person from a frame in the stable stationary unit. A first feature vector extracting means for extracting a vector, and a pre-trained first neural network which outputs, as an input, a feature vector output from the first feature vector extracting means, and posture information corresponding to the feature vector. A network, second feature vector extracting means for extracting a feature vector corresponding to the motion of the face image of the person from the information of the inter-frame difference in the activity detecting unit, and features output by the second feature vector extracting means A pre-trained second neural network that outputs gesture information corresponding to a feature vector with a vector as input, Including.

【００１６】請求項４に記載の発明にかかる人物の反応
認識装置は、請求項３に記載の発明の構成に加えてさら
に、第１のニューラルネットワークと第２のニューラル
ネットワークとのいずれか少なくとも一方は、各々が所
定の反応カテゴリーと関連付けられ、第１または第２の
特徴ベクトル抽出手段の出力する特徴ベクトルを入力と
して、所定の反応カテゴリとの関連度の高さを出力す
る、複数個の１対１ニューラルネットワークを含む。According to a fourth aspect of the present invention, in addition to the configuration of the third aspect of the present invention, the human reaction recognition apparatus further includes at least one of a first neural network and a second neural network. Are associated with a predetermined reaction category, and receive a feature vector output from the first or second feature vector extraction unit as an input and output a high degree of association with the predetermined reaction category. Includes one-to-one neural networks.

【００１７】１対１ニューラルネットワークを用いるた
め、新たな反応カテゴリを認識対象として追加するとき
には、その反応カテゴリに対応する１対１ニューラルネ
ットワークを追加すればよく、簡単に機能を拡張でき
る。Since a one-to-one neural network is used, when a new reaction category is added as a recognition target, a one-to-one neural network corresponding to the reaction category may be added, and the function can be easily expanded.

【００１８】請求項５に記載の発明にかかる人物の反応
認識装置は、請求項１〜請求項４のいずれかに記載の発
明の構成に加えて、映像取得手段は、ビデオカメラと、
ビデオカメラの出力する映像信号をフレームごとにデジ
タル信号に変換するための映像信号変換手段と、映像信
号変換手段の出力する映像シーケンスに基づいて、各フ
レーム内の人物の顔領域を特定するための顔領域特定手
段とを含む。According to a fifth aspect of the present invention, in addition to the configuration of the first aspect of the present invention, the apparatus for recognizing a person's reaction further comprises a video camera,
A video signal conversion unit for converting a video signal output from the video camera into a digital signal for each frame, and a video signal conversion unit for specifying a face area of a person in each frame based on a video sequence output by the video signal conversion unit. And a face area specifying unit.

【００１９】請求項６に記載の発明にかかる人物の反応
認識装置は、請求項１〜請求項５のいずれかに記載の発
明の構成に加えて、顔領域特定手段は、映像信号変換手
段の出力する映像シーケンスに基づいて、第１の手法に
より各フレーム内の人物の顔領域を特定するための第１
の手段と、映像信号変換手段の出力する映像シーケンス
に基づいて、第２の手法により各フレーム内の人物の顔
領域を特定するための第２の手段と、第１の手段および
第２の手段による顔領域の特定結果を統合して顔領域を
特定する顔領域の統合手段とを含む。According to a sixth aspect of the present invention, in addition to the configuration of the first aspect of the present invention, the face region specifying means includes a video signal converting means. A first method for specifying a face area of a person in each frame by a first method based on a video sequence to be output.
Means, a second means for specifying a face area of a person in each frame by a second method based on a video sequence output from the video signal converting means, and first and second means. And a face area integrating means for integrating the result of specifying the face area according to the above.

【００２０】複数の手法を用いて顔画像領域を決定する
ので、顔画像領域を確実に決定することができる。Since the face image area is determined by using a plurality of techniques, the face image area can be reliably determined.

【００２１】請求項７に記載の発明にかかる人物の反応
認識装置は、請求項６に記載の発明の構成に加えて、映
像シーケンスはＲＧＢカラー映像シーケンスであり、第
１の手段はＲＧＢカラー映像シーケンスをｒｇ色空間に
変換した映像中の色分布と所定の色分布パターンとの類
似度に基づいて顔領域を特定するための手段を含む。According to a seventh aspect of the present invention, in addition to the configuration of the sixth aspect, the video sequence is an RGB color video sequence, and the first means is an RGB color video sequence. Means for specifying a face area based on the similarity between a color distribution in an image obtained by converting the sequence into the rg color space and a predetermined color distribution pattern is included.

【００２２】請求項８に記載の発明にかかる人物の反応
認識装置は、請求項６または請求項７に記載の発明の構
成に加えて、映像シーケンスはＲＧＢカラー映像シーケ
ンスであり、第１の手段はＲＧＢカラー映像シーケンス
をＮＣｂ‐ＮＣｒ色空間に変換した映像中の色分布と所
定の色分布パターンとの類似度に基づいて顔領域を特定
するための手段を含む。According to an eighth aspect of the present invention, in addition to the configuration of the sixth or seventh aspect, the video sequence is an RGB color video sequence. Includes means for specifying a face area based on the similarity between a color distribution in an image obtained by converting an RGB color image sequence into an NCb-NCr color space and a predetermined color distribution pattern.

【００２３】請求項９に記載の発明にかかるコンピュー
タ読取可能な記録媒体は、コンピュータを、表示される
映像に対する人物の反応を認識するための反応認識装置
として動作させるためのプログラムを記録したコンピュ
ータ読取可能な記録媒体であって、このプログラムは、
人物の顔画像に対して取得された映像シーケンスを、映
像フレーム間の動きに基づいて安定静止ユニットと活動
検出ユニットとに分類するための第１の分類プログラム
部分と、安定静止ユニットに分類されたフレームシーケ
ンスおよび活動検出ユニットに分類されたフレームシー
ケンスから人物の反応を認識するための反応認識プログ
ラム部分とを含む。A computer-readable recording medium according to a ninth aspect of the present invention is a computer-readable recording medium storing a program for operating a computer as a reaction recognition device for recognizing a person's reaction to a displayed image. A possible recording medium, this program
A first classification program portion for classifying a video sequence obtained for a face image of a person into a stable stationary unit and an activity detecting unit based on motion between video frames; A reaction recognition program part for recognizing a human reaction from the frame sequence and the frame sequence classified into the activity detection unit.

【００２４】請求項１０に記載の発明にかかるコンピュ
ータ読取可能な記録媒体は、請求項９に記載の発明の構
成に加えて、第１の分類プログラム部分は、隣接するフ
レーム間のフレーム間差分に基づいて、映像シーケンス
を静止ユニットと動きユニットとに分類するための第２
の分類プログラム部分と、静止フレームを、その持続期
間に基づいて安定静止ユニットと不安定静止ユニットと
に分類するための第３の分類プログラム部分と、連続す
る不安定静止ユニットと動きユニットとを活動検出ユニ
ットとして統合するための統合プログラム部分とを含
む。According to a tenth aspect of the present invention, in the computer readable recording medium according to the ninth aspect of the present invention, in addition to the configuration of the ninth aspect, the first classification program portion is configured to calculate a difference between frames between adjacent frames. A second unit for classifying the video sequence into stationary units and moving units based on
And a third classifier for classifying the still frames into stable and unstable stationary units based on their duration, and a series of unstable stationary and moving units. An integrated program part for integration as a detection unit.

【００２５】請求項１１に記載の発明にかかるコンピュ
ータ読取可能な記録媒体は、請求項１０に記載の発明の
構成に加えて、反応認識プログラム部分は、安定静止ユ
ニット内のフレームから人物の顔画像の特徴ベクトルを
抽出するための第１の特徴ベクトル抽出プログラム部分
と、第１の特徴ベクトル抽出プログラム部分の出力する
特徴ベクトルを入力として、特徴ベクトルに対応する姿
勢情報を出力する、あらかじめ学習済みの第１のニュー
ラルネットワークプログラム部分と、活動検出ユニット
内のフレーム間差分の情報から人物の顔画像の動きに対
応する特徴ベクトルを抽出するための第２の特徴ベクト
ル抽出プログラム部分と、第２の特徴ベクトル抽出プロ
グラム部分の出力する特徴ベクトルを入力として、特徴
ベクトルに対応するジェスチャー情報を出力する、あら
かじめ学習済みの第２のニューラルネットワークプログ
ラム部分とを含む。According to the eleventh aspect of the present invention, in the computer-readable recording medium according to the tenth aspect of the present invention, in addition to the configuration of the tenth aspect, the reaction recognition program portion includes a face image of a person from a frame in the stable stationary unit. A first feature vector extraction program part for extracting the feature vector of the first and a feature vector output from the first feature vector extraction program part are input, and posture information corresponding to the feature vector is output. A first neural network program portion, a second feature vector extraction program portion for extracting a feature vector corresponding to a motion of a face image of a person from information of an inter-frame difference in the activity detection unit, and a second feature Using the feature vector output from the vector extraction program as an input, And outputs the gesture information, and a previously learned second neural network program portion.

【００２６】請求項１２に記載の発明にかかるコンピュ
ータ読取可能な記録媒体は、請求項１１に記載の発明の
構成に加えて、第１のニューラルネットワークプログラ
ム部分と第２のニューラルネットワークプログラム部分
とのいずれか少なくとも一方は、各々が所定の反応カテ
ゴリーと関連付けられ、第１または第２の特徴ベクトル
抽出プログラム部分の出力する特徴ベクトルを入力とし
て、所定の反応カテゴリとの関連度の高さを出力する、
複数個の１対１ニューラルネットワークプログラム部分
を含む。According to a twelfth aspect of the present invention, a computer readable recording medium according to the eleventh aspect further comprises a first neural network program part and a second neural network program part. At least one of them is associated with a predetermined reaction category, and receives a feature vector output from the first or second feature vector extraction program portion as an input and outputs a high degree of association with the predetermined reaction category. ,
It includes a plurality of one-to-one neural network program portions.

【００２７】１対１ニューラルネットワークプログラム
部分を用いるため、新たな反応カテゴリを認識対象とし
て追加するときには、その反応カテゴリに対応する１対
１ニューラルネットワークプログラム部分を追加すれば
よく、簡単に機能を拡張できる。Since a one-to-one neural network program portion is used, when a new reaction category is added as a recognition target, a one-to-one neural network program portion corresponding to the reaction category may be added, and the function is easily expanded. it can.

【００２８】請求項１３に記載の発明にかかるコンピュ
ータ読取可能な記録媒体は、請求項９〜請求項１２のい
ずれかに記載の発明の構成に加えて、プログラムはさら
に、映像シーケンスに基づいて、各フレーム内の人物の
顔領域を特定して第１の分類プログラム部分に与えるた
めの顔領域特定プログラム部分を含む。According to a thirteenth aspect of the present invention, in a computer readable recording medium according to the ninth aspect of the present invention, in addition to the configuration of the ninth aspect, the program further comprises: A face area specifying program portion for specifying a person's face region in each frame and providing the same to the first classification program portion is included.

【００２９】請求項１４に記載の発明にかかるコンピュ
ータ読取可能な記録媒体は、請求項９〜請求項１３のい
ずれかに記載の発明の構成に加えて、顔領域特定プログ
ラム部分は、映像シーケンスに基づいて、第１の手法に
より各フレーム内の人物の顔領域を特定するための第１
のプログラム部分と、映像シーケンスに基づいて、第２
の手法により各フレーム内の人物の顔領域を特定するた
めの第２のプログラム部分と、第１のプログラム部分お
よび第２のプログラム部分による顔領域の特定結果を統
合して顔領域を特定する顔領域の統合プログラム部分と
を含む。According to a fourteenth aspect of the present invention, in addition to the configuration of the ninth to thirteenth aspect, the computer readable recording medium according to the ninth aspect further comprises: Based on the first method for identifying the face area of the person in each frame by the first method.
Based on the program part and the video sequence,
A second program part for specifying the face area of a person in each frame by the method of the first embodiment, and a face for specifying the face area by integrating the result of specifying the face area by the first program part and the second program part And the integrated program part of the domain.

【００３０】複数の手法を用いて顔画像領域を決定する
ので、顔画像領域を確実に決定することができる。Since the face image area is determined by using a plurality of methods, the face image area can be reliably determined.

【００３１】請求項１５に記載の発明にかかるコンピュ
ータ読取可能な記録媒体は、請求項１４に記載の発明の
構成に加えて、映像シーケンスはＲＧＢカラー映像シー
ケンスであり、第１のプログラム部分はＲＧＢカラー映
像シーケンスをｒｇ色空間に変換した映像中の色分布と
所定の色分布パターンとの類似度に基づいて顔領域を特
定するためのプログラム部分を含む。According to a fifteenth aspect of the present invention, in addition to the configuration of the fourteenth aspect, the video sequence is an RGB color video sequence, and the first program portion is an RGB color video sequence. A program portion for specifying a face area based on a similarity between a color distribution in a video obtained by converting a color video sequence into an rg color space and a predetermined color distribution pattern is included.

【００３２】請求項１６に記載の発明にかかるコンピュ
ータ読取可能な記録媒体は、請求項１４または請求項１
５に記載の発明の構成に加えて、映像シーケンスはＲＧ
Ｂカラー映像シーケンスであり、第１のプログラム部分
はＲＧＢカラー映像シーケンスをＮＣｂ‐ＮＣｒ色空間
に変換した映像中の色分布と所定の色分布パターンとの
類似度に基づいて顔領域を特定するためのプログラム部
分を含む。A computer-readable recording medium according to the invention of claim 16 is the computer-readable recording medium of claim 14 or 1.
In addition to the configuration of the invention described in Item 5, the video sequence is RG
The first program portion is a B color video sequence, and specifies a face region based on a similarity between a color distribution in a video obtained by converting an RGB color video sequence into an NCb-NCr color space and a predetermined color distribution pattern. Including the program part.

【００３３】[0033]

【発明の実施の形態】［説明で使用する式］以下の実施
の形態の説明では、以下の式を用い、各式の式番号を用
いて随時参照する。DESCRIPTION OF THE PREFERRED EMBODIMENTS [Equations used in the description] In the following description of the embodiments, the following equations will be used, and reference will be made as needed using the equation numbers of the respective equations.

【００３４】[0034]

【数１】 (Equation 1)

【００３５】[0035]

【数２】 (Equation 2)

【００３６】[0036]

【数３】 (Equation 3)

【００３７】[0037]

【数４】 (Equation 4)

【００３８】［実施の形態１］［システム全体構成］以下、本発明の実施の形態１にか
かる遠隔放送システムについて説明する。以下に説明す
るシステムは、１箇所において講議をする講師の映像を
複数箇所の受講会場において放送し、この各受講会場の
聴衆の反応を講師に対して提示するシステムである。な
お以下の説明では、説明を簡略化するために、音声に関
する記載はしていないが、講議、講演、ディスカッショ
ン等では音声は不可欠であり、必要に応じて音声を取得
し送受信し再生するための構成が存在することは明らか
である。また、それらを実現するための構成は当業者に
は明白であろう。[First Embodiment] [Entire System Configuration] Hereinafter, a remote broadcasting system according to a first embodiment of the present invention will be described. The system described below is a system in which a video of a lecturer who gives a lecture at one place is broadcast at a plurality of lecture halls, and the response of the audience at each lecture hall is presented to the lecturer. In the following description, audio is not described for simplicity, but audio is indispensable in lectures, lectures, discussions, etc. It is clear that the configuration exists. Also, the configuration for realizing them will be apparent to those skilled in the art.

【００３９】図１を参照して、この遠隔放送システム２
０は、遠隔の複数地点に設けられ、各地点の聴衆の反応
をまとめ抽象化した、反応を示すデータを生成するため
の複数個の受講会場用システム３４と、講師が行なう講
議の様子をこれら複数個の受講会場にインターネットを
通じて放送するための講師用システム３２と、これら複
数個の受講会場用システム３４および講師用システム３
２を結び、各受講会場用システム３４の前の聴衆の反応
を総合して講師用システム３２に対して送信するための
集計センター３０とを含む。なお、講師用システム３２
から受講会場用システム３４への講議の放送は、インタ
ーネットを通じたもののみでなく、衛星放送または地上
波など通常の電波を用いた放送でもよい。［ハードウェア構成］この遠隔放送システムは、パーソ
ナルコンピュータまたはワークステーション等、コンピ
ュータ上で実行されるソフトウェアにより実現される。
図２に、この遠隔放送システムを実現するために使用さ
れるコンピュータの外観を示す。集計センター３０、講
師用システム３２および受講会場用システム３４はいず
れもほぼ同様のハードウェア構成であり、この図２に示
したものと同じである。以下に、集計センター３０、講
師用システム３２および受講会場用システム３４を代表
してシステム４０として表し、このシステム４０につい
て説明する。Referring to FIG. 1, this remote broadcasting system 2
0 indicates a plurality of lecture hall systems 34 for generating data indicating the reactions, which are provided at a plurality of remote points and summarize and abstract the response of the audience at each point, and the state of the lecture conducted by the instructor. A lecturer system 32 for broadcasting to the plurality of lecture halls via the Internet, a plurality of lecture hall systems 34 and a lecturer system 3
And a tallying center 30 for synthesizing the response of the audience in front of each venue system 34 and transmitting it to the instructor system 32. The instructor system 32
The broadcast of the lecture from the to the venue system 34 is not limited to the Internet, but may be a broadcast using a normal radio wave such as a satellite broadcast or a terrestrial wave. [Hardware Configuration] This remote broadcasting system is realized by software executed on a computer such as a personal computer or a workstation.
FIG. 2 shows an external view of a computer used to realize the remote broadcasting system. The counting center 30, the instructor system 32, and the attendance hall system 34 have almost the same hardware configuration, and are the same as those shown in FIG. Hereinafter, the counting center 30, the instructor system 32, and the attendance hall system 34 are represented as a system 40 as a representative, and the system 40 will be described.

【００４０】図２を参照してこのシステム４０を構成す
るコンピュータ本体６０は、ＣＤ−ＲＯＭ（Compact Di
sc Read-Only Memory ）ドライブ７０およびＦＤ（Flex
ibleDisk ）ドライブ７２を備えたコンピュータ本体６
０と、コンピュータ本体６０に接続された表示装置とし
てのディスプレイ６２と、同じくコンピュータ本体６０
に接続された入力装置としてのキーボード６６およびマ
ウス６８と、コンピュータ本体６０に接続された、人物
（聴衆）の画像を取込むためのビデオカメラ５０と、講
師用システム３２から送信されてきた講師の映像を表示
するための、大画面表示装置８６とを含む。この実施の
形態の装置では、ビデオカメラ５０としてはＣＣＤ（固
体撮像素子）を含むビデオカメラを用い、ビデオカメラ
５０を聴衆の正面において、大画面表示装置８６に表示
される講議に対する聴衆の反応を抽出する処理を行なう
ものとする。なお、簡単のために、以下では撮影される
聴衆の数は１人として説明するが、聴衆が複数いるとき
にも、人数に応じた繰返しを行なうことで容易に対応で
きる。Referring to FIG. 2, a computer main body 60 constituting the system 40 includes a CD-ROM (Compact Diode).
sc Read-Only Memory) drive 70 and FD (Flex)
ibleDisk) Computer body 6 with drive 72
0, a display 62 as a display device connected to the computer main body 60, and a computer main body 60
, A keyboard 66 and a mouse 68 as input devices, a video camera 50 connected to the computer main body 60 for capturing an image of a person (audience), and an instructor transmitted from the instructor system 32. A large-screen display device 86 for displaying video. In the apparatus according to the present embodiment, a video camera including a CCD (solid-state imaging device) is used as the video camera 50, and the video camera 50 is placed in front of the audience and the audience responds to a lecture displayed on the large-screen display device 86. Is extracted. In addition, for the sake of simplicity, the following description will be made assuming that the number of audiences to be photographed is one.

【００４１】図３に、このシステム４０の構成をブロッ
ク図形式で示す。図３に示されるようにこのシステム４
０を構成するコンピュータ本体６０は、ＣＤ−ＲＯＭド
ライブ７０およびＦＤドライブ７２に加えて、それぞれ
バス９２に接続されたＣＰＵ（Central Processing Uni
t）７６と、ＲＯＭ（Read Only Memory)７８と、ＲＡＭ
（Random Access Memory）８０と、ハードディスク７４
と、ビデオカメラ５０からの画像を取込むための画像取
得回路８８と、バス９２を介して得られる、図１に示す
講師用システム３２から送信された講議の模様を示すデ
ジタルデータ（圧縮されている）を伸長し、ビデオ信号
に変換するためのビデオ出力回路９０とを含んでいる。
ＣＤ−ＲＯＭドライブ７０にはＣＤ−ＲＯＭ８２が装着
される。ＦＤドライブ７２にはＦＤ８４が装着される。FIG. 3 shows the configuration of the system 40 in block diagram form. As shown in FIG.
In addition to the CD-ROM drive 70 and the FD drive 72, the computer main body 60 constituting the CPU 0 includes a CPU (Central Processing Unit) connected to a bus 92, respectively.
t) 76, ROM (Read Only Memory) 78, RAM
(Random Access Memory) 80 and hard disk 74
And digital data (compressed and compressed) obtained through the bus 92 and transmitted from the instructor system 32 shown in FIG. And a video output circuit 90 for decompressing and converting it into a video signal.
A CD-ROM 82 is mounted on the CD-ROM drive 70. An FD 84 is mounted on the FD drive 72.

【００４２】既に述べたようにこの遠隔放送システムの
主要部は、コンピュータハードウェアと、ＣＰＵ７６に
より実行されるソフトウェアとにより実現される。一般
的にこうしたソフトウェアはＣＤ−ＲＯＭ８２、ＦＤ８
４等の記憶媒体に格納されて流通し、ＣＤ−ＲＯＭドラ
イブ７０またはＦＤドライブ７２等により記憶媒体から
読取られてハードディスク７４に一旦格納される。また
は、当該装置がネットワークに接続されている場合に
は、ネットワーク上のサーバから一旦ハードディスク７
４に一旦コピーされる。そうしてさらにハードディスク
７４からコンピュータ本体６０に読出されてＣＰＵ７６
により実行される。なお、ネットワーク接続されている
場合には、ハードディスク７４に格納することなくＲＡ
Ｍ８０に直接ロードして実行するようにしてもよい。As described above, the main part of the remote broadcasting system is realized by computer hardware and software executed by the CPU 76. Generally, such software is a CD-ROM 82, FD8
4 and stored in a storage medium such as the CD-ROM drive 70 or the FD drive 72 and read from the storage medium and temporarily stored in the hard disk 74. Alternatively, if the device is connected to a network, the server on the network temporarily
4 once. Then, the data is further read from the hard disk 74 to the computer
Is executed by Note that, when connected to a network, the RA
The program may be directly loaded into the M80 and executed.

【００４３】図２および図３に示したコンピュータのハ
ードウェア自体およびその動作原理は一般的なものであ
る。したがって、本発明の最も本質的な部分はＣＤ−Ｒ
ＯＭ８２、ＦＤ８４、コンピュータ本体６０等の記憶媒
体に記憶されたソフトウェアである。The hardware itself and the operation principle of the computer shown in FIGS. 2 and 3 are general. Therefore, the most essential part of the present invention is the CD-R
The software is stored in a storage medium such as the OM 82, the FD 84, and the computer main body 60.

【００４４】なお、最近の一般的傾向として、コンピュ
ータのオペレーティングシステムの一部として様々なプ
ログラムモジュールを用意しておき、アプリケーション
プログラムはこれらモジュールを所定の配列で必要な時
に呼び出して処理を進める方式が一般的である。そうし
た場合、当該遠隔放送システムを実現するためのソフト
ウェア自体にはそうしたモジュールは含まれず、当該コ
ンピュータでオペレーティングシステムと協働してはじ
めて遠隔放送システムの構成要素の各機能が実現するこ
とになる。しかし、一般的なプラットフォームを使用す
る限り、そうしたモジュールを含ませたソフトウェアを
流通させる必要はなく、それらモジュールを含まないソ
フトウェア自体およびそれらソフトウェアを記録した記
録媒体（およびそれらソフトウェアがネットワーク上を
流通する場合のデータ信号）が実施の形態を構成すると
考えることができる。［受講会場用システム３４］図４を参照して、本実施の
形態の受講会場用システム３４は、ビデオカメラ５０
と、ビデオカメラ５０から受ける信号をフレームごとに
Ａ／Ｄ（アナログ／デジタル）変換し、記憶するための
映像取得部１００と、映像取得部１００から与えられる
デジタル映像信号に基づいて、映像の各フレームを後述
する「安定静止ユニット」と「活動検出ユニット」とに
分離するための検出・分離サブシステム１０２と、検出
・分離サブシステム１０２によって分離された安定静止
ユニットおよび活動検出ユニットに属する各フレームの
画像に基づいて、この会場の聴衆の反応を判定し、反応
を複数個のカテゴリに分類してカテゴリの情報を出力す
るための判定サブシステム１０４と、判定サブシステム
１０４から出力された各カテゴリの情報を所定時間集計
し、集計した結果を、講議に対するこの会場の聴衆の反
応を示すスコアとして出力することを所定の間隔で繰返
すための集計部１０６と、集計部１０６の出力をインタ
ーネットを通じて集計センター３０に送信し、講師用シ
ステム３２からの講議の映像データを受信するのを始
め、システム内での通信制御を行なうための送受信部１
０８と、送受信部１０８によって受信された講議の映像
データに対してデータ伸長、データ補間等の必要な処理
を行なった上で映像信号に変換し出力するための画像処
理部１１０と、画像処理部１１０からの映像信号を表示
するための、前述した大画面表示装置８６とを含む。As a recent general tendency, there is a method in which various program modules are prepared as a part of the operating system of a computer, and the application program calls these modules in a predetermined arrangement when necessary and proceeds with the processing. General. In such a case, the software itself for realizing the remote broadcasting system does not include such a module, and the functions of the components of the remote broadcasting system are realized only when the computer cooperates with the operating system. However, as long as a general platform is used, it is not necessary to distribute software including such modules, and the software itself that does not include those modules and a recording medium on which the software is recorded (and the software is distributed on a network) Data signal in such a case) can be considered to constitute an embodiment. [Student venue system 34] With reference to FIG.
A / D (analog / digital) conversion of a signal received from the video camera 50 for each frame, and a video acquisition unit 100 for storing the video signal. A detection / separation subsystem 102 for separating a frame into a “stable stationary unit” and an “activity detection unit” to be described later, and respective frames belonging to the stable stationary unit and the activity detection unit separated by the detection / separation subsystem 102 A judgment subsystem 104 for judging an audience response at the venue based on the image of the venue, classifying the reactions into a plurality of categories and outputting category information, and each category output from the judgment subsystem 104. Information for a given period of time, and the totaled result is a score indicating the response of the audience at this venue to the lecture. , Which transmits the output of the counting unit 106 to the counting center 30 via the Internet, and starts receiving video data of the lecture from the instructor system 32. Transmission / reception unit 1 for performing communication control in the system
08, an image processing unit 110 for performing necessary processing such as data decompression and data interpolation on the video data of the lecture received by the transmission / reception unit 108, converting the video data into a video signal, and outputting the video signal. The large-screen display device 86 for displaying the video signal from the unit 110 is included.

【００４５】以下、この受講会場用システム３４の主要
な機能ブロックの構成についてその詳細を説明する。［検出・分離サブシステム１０２］図５を参照して、検
出・分離サブシステム１０２は、映像取得部１００から
与えられる各フレームの画像の中から、聴衆たる人物の
顔の領域を検出するための顔領域検出部１２０と、顔領
域検出部１２０によって検出された顔領域について、入
力される映像の中に含まれる画像の動きに着目して、映
像を後述する「静止ユニット」と「動きユニット」とに
分類するための映像ストリームユニット分離部１２２
と、映像ストリームユニット分離部１２２によって静止
ユニットに分類されたユニットの継続時間に着目し、あ
るしきい値以上続く静止ユニットを「安定静止ユニッ
ト」に、あるしきい値未満の期間だけ継続する静止ユニ
ットを「不安定静止ユニット」に、それぞれ分類するた
めの静止ユニット分類部１２４と、映像ストリームユニ
ット分離部１２２によって動きユニットとして分類され
たユニットと、静止ユニット分類部１２４によって不安
定静止ユニットに分類されたユニットとを統合して前述
の「活動検出ユニット」として統合するためのユニット
統合部１２６とを含む。The configuration of the main functional blocks of the attendance hall system 34 will be described in detail below. [Detection / Separation Subsystem 102] Referring to FIG. 5, detection / separation subsystem 102 is used to detect a face area of a person who is an audience from the image of each frame provided from video acquisition section 100. For the face area detection unit 120 and the face area detected by the face area detection unit 120, paying attention to the movement of the image included in the input video, the video is referred to as a “stationary unit” and a “motion unit” to be described later. Video stream unit separation unit 122 for classifying
Focusing on the duration of a unit classified as a stationary unit by the video stream unit separation unit 122, a stationary unit that continues for a certain threshold or more is set as a “stable stationary unit”, The static unit classification unit 124 for classifying the unit into the “unstable stationary unit”, the unit classified as the motion unit by the video stream unit separation unit 122, and the unstable stationary unit by the stationary unit classification unit 124 And a unit integrating unit 126 for integrating the obtained unit with the above-mentioned “activity detecting unit”.

【００４６】静止ユニット、動きユニット、安定静止ユ
ニット、不安定静止ユニットおよび活動検出ユニットに
ついては、図８および図９を参照して後により詳細に説
明する。［顔領域検出部１２０］複雑な背景を含む画像から人物
の顔領域を検出するための技術として多くの技術が提案
されている。それらは大別して、目、鼻、口等の顔部品
に代表される、顔面の特徴に基づいて顔領域を検出する
第１の手法と、人物の顔の色彩と背景の色彩との相違に
基づいて顔領域を検出する第２の手法とに分類される。
顔領域を検出できるものであればどちらの手法を用いて
もよいが、本実施例では以下に述べるように第２の手法
を用いる。The stationary unit, the motion unit, the stable stationary unit, the unstable stationary unit and the activity detecting unit will be described in more detail later with reference to FIGS. [Face area detection unit 120] Many techniques have been proposed as techniques for detecting a face area of a person from an image including a complicated background. These are roughly classified into a first method for detecting a face region based on facial features represented by facial parts such as eyes, a nose, and a mouth, and a method based on a difference between a person's face color and a background color. And the second method of detecting a face area.
Either method may be used as long as it can detect a face area. In this embodiment, the second method is used as described below.

【００４７】また、本実施の形態のシステムでは、特に
顔領域の検出を確実に行なうために、rgb色空間を用い
る手法と、NCb-NCr色空間を用いる手法とを組合わせ、
双方の手法による結果を統合して顔領域を検出してい
る。これら手法はいずれも、人の顔の色の色分布が、人
の顔の色および照明の色にかかわらず２Ｄガウス分布で
モデル化できるという事実を利用している。色分布を算
出する際に一方がｒ，ｇ，ｂを用い、他方がＮＣｂ−Ｎ
Ｃｒを用いる点のみにおいてこれら手法は異なってい
る。Also, in the system of the present embodiment, in order to reliably detect a face area, in particular, a method using an rgb color space and a method using an NCb-NCr color space are combined.
The face area is detected by integrating the results of both methods. Both of these approaches take advantage of the fact that the color distribution of a human face color can be modeled with a 2D Gaussian distribution, regardless of the color of the human face and the color of the illumination. When calculating the color distribution, one uses r, g, and b, and the other uses NCb-N
These methods differ only in that Cr is used.

【００４８】図６を参照して、顔領域検出部１２０は、
入力されるＲＧＢ成分からなる映像ストリーム１３０に
対して、式（１）〜（３）で示される変換を行なうこと
により（ｒ，ｇ，ｂ）成分への色空間変換を行なうため
のrgb色空間変換処理１３２と、rgb色空間変換処理１３
２によって（ｒ，ｇ，ｂ）成分に変換された色成分のう
ち、ｒ，ｇ成分を用いて入力画像の色分布と、あらかじ
め用意されていた色分布との間の類似度マッピングをお
こなうためのr-g空間類似度計算処理１３４とを含む。
なお、ｒ，ｇ，ｂの和は式（１）〜（３）より分かるよ
うに必ず「１」となる。すなわち、ｒ，ｇ，ｂのうちの
どの一つも他の二つの成分によって表わされる。したが
って、ｒ，ｇのみを用いて色分布を表現することができ
る。Referring to FIG. 6, face area detecting section 120 includes:
An rgb color space for performing a color space conversion to an (r, g, b) component by performing the conversion represented by equations (1) to (3) on the input video stream 130 composed of RGB components Conversion process 132 and RGB color space conversion process 13
2 to perform similarity mapping between the color distribution of the input image and the color distribution prepared in advance using the r and g components of the color components converted to the (r, g, b) components by 2. Rg space similarity calculation processing 134.
Note that the sum of r, g, and b is always "1" as can be seen from equations (1) to (3). That is, any one of r, g, and b is represented by the other two components. Therefore, the color distribution can be expressed using only r and g.

【００４９】ここで、人の顔の色における色分布は２Ｄ
ガウス分布Ｇ₁（ｍ₁，Ｖ₁ ²）と表現することができる。
ただし、各変数は式（４）〜（７）によって表される値
である。式のうち上線を引いた変数は、各変数のガウス
平均を表わす。Ｖ₁は２Ｄガウス分布の共分散行列を表
わす。Ｎは顔領域内の全画素数を表わす。Here, the color distribution of the human face color is 2D.
It can be expressed as Gaussian distribution _{_{_{G 1 (m 1, V 1}}} 2).
Here, each variable is a value represented by Expressions (4) to (7). The overlined variables in the equation represent the Gaussian means of each variable. V ₁ represents a covariance matrix of a 2D Gaussian distribution. N represents the total number of pixels in the face area.

【００５０】顔領域検出部１２０はさらに、それぞれrg
b色空間変換処理１３２およびr-g空間類似度計算処理１
３４と同様の処理をＮＣｂ−ＮＣｒ色空間で行なうため
の(NCb, NCr)色空間変換処理１３６およびNCb-NCr空間
類似度計算処理１３８とを含む。ＮＣｂ，ＮＣｒを計算
するためのＹ，Ｃｂ，Ｃｒ色成分の値は式（８）〜（１
０）によって計算される。さらにこれらを式（１１）
（１２）に示されるように正規化することでＮＣｂおよ
びＮＣｒ色成分が得られる。The face area detecting section 120 further outputs rg
b color space conversion processing 132 and rg space similarity calculation processing 1
And (NCb, NCr) color space conversion processing 136 and NCb-NCr space similarity calculation processing 138 for performing the same processing as in the NCb-NCr color space. The values of the Y, Cb, and Cr color components for calculating NCb and NCr are given by equations (8) to (1).
0). Further, these are expressed by formula (11).
By normalizing as shown in (12), NCb and NCr color components are obtained.

【００５１】このＮＣｂおよびＮＣｒ色成分を用いて表
わした色分布は２Ｄガウス分布Ｇ２（ｍ₂，Ｖ₂ ²）と表
現することができる。ここで、各変数は式（１３）〜
（１６）で表現される値を表わす。ここでも上線の意味
は式（４）〜（７）における意味と同じである。Ｖ₂は
２Ｄガウス分布の共分散マトリクスである。The color distribution expressed using the NCb and NCr color components can be expressed as a 2D Gaussian distribution G2 (m ₂ , V ₂ ² ). Here, each variable is expressed by Expression (13)
Represents the value represented by (16). Here, the meaning of the overline is the same as the meaning in the expressions (4) to (7). V ₂ is a covariance matrix of a 2D Gaussian distribution.

【００５２】顔領域検出部１２０はさらに、r-g空間類
似度計算処理１３４で計算された類似度マッピングとNC
b-NCr空間類似度計算処理１３８で計算された類似度マ
ッピングとを統合して新たな類似度マッピングを生じる
ための類似度計算の統合処理１４０と、画像に対して後
処理をし、類似度計算の統合処理１４０によって得られ
た最終的な類似度マッピングにしたがって、顔領域に相
当する部分をだ円または矩形領域で表現してその位置を
示す情報を出力するための後処理１４２とを含む。The face area detection unit 120 further performs the similarity mapping calculated by the rg space similarity calculation processing 134 and the NC
b-NCr spatial similarity calculation processing 138: a similarity calculation integration processing 140 for integrating the similarity mapping calculated by the processing 138 to generate a new similarity mapping; A post-process 142 for expressing a portion corresponding to the face region as an ellipse or a rectangular region and outputting information indicating its position in accordance with the final similarity mapping obtained by the calculation integration process 140 .

【００５３】r-g空間類似度計算処理１３４およびNCb-N
Cr空間類似度計算処理１３８における類似度マッピング
としては、式（１７）を用いた。式（１７）においてｋ
＝１の場合はｒ，ｇ空間、ｋ＝２の場合はＮＣｒ−ＮＣ
ｂ空間の場合にそれぞれ相当する。また式（１７）にお
いてＩ（ｘ，ｙ）は点（ｘ，ｙ）における各色成分の強
さを表わし、ｍは色成分の値の平均の平均ベクトル表現
を表わし、Ｖは共分散マトリクスを表わす。各画素に対
して式（１７）（ｋ＝１，２）の計算を行ない、その結
果に対してあるしきい値を設けておく。このしきい値よ
り小さい値を持つ画素を顔領域の候補とする。Rg space similarity calculation processing 134 and NCb-N
Equation (17) was used as the similarity mapping in the Cr space similarity calculation processing 138. In equation (17), k
= 1, r, g space, k = 2, NCr-NC
This corresponds to the case of the b space. In equation (17), I (x, y) represents the intensity of each color component at point (x, y), m represents the average vector representation of the average of the color component values, and V represents the covariance matrix. . Equation (17) (k = 1, 2) is calculated for each pixel, and a certain threshold value is provided for the result. A pixel having a value smaller than the threshold value is set as a face area candidate.

【００５４】類似度計算の統合処理１４０での統合は次
のようにして行なわれる。すなわち、r-g空間類似度計
算処理１３４で得られた類似度マッピングの式ｄ
_k（ｉ，ｊ）（ｋ＝１，２）に対して式（１８）の計算
を行なう。その結果によって式（１９）にしたがい各画
素が顔領域に属する候補か否かの判定を行なう。式（１
８）（１９）において、μ_kおよびσ_k ²はそれぞれｄ
_k（ｉ，ｊ）（ｋ＝１，２）での平均および分散であ
る。またｗ₁およびｗ₂はそれぞれ所定の重みであり、Ｄ
はしきい値である。The integration in the integration processing 140 of the similarity calculation is performed as follows. That is, the similarity mapping expression d obtained in the rg space similarity calculation processing 134
Equation (18) is calculated for _k (i, j) (k = 1, 2). Based on the result, it is determined whether each pixel is a candidate belonging to the face area according to Expression (19). Equation (1
8) In (19), μ _k and σ _k ² are respectively d
The mean and variance at _k (i, j) (k = 1,2). W ₁ and w ₂ are predetermined weights, respectively, and D 1
Is a threshold.

【００５５】後処理１４２は二つの処理を含む。類似度
計算の統合処理１４０の結果、映像の各画素について、
その画素が顔領域に属すると考えられるか否かにしたが
って各画素についてそれぞれ１または０の値が割り当て
られている。そこで、以下の処理によって顔領域を表わ
すだ円または矩形領域を求める。以下の説明では、簡略
化のために顔領域を円として表わす場合を想定する。The post-processing 142 includes two processes. As a result of the integration processing 140 of the similarity calculation, for each pixel of the video,
Each pixel is assigned a value of 1 or 0 depending on whether the pixel is considered to belong to the face area. Therefore, an elliptical or rectangular area representing a face area is obtained by the following processing. In the following description, it is assumed that a face area is represented as a circle for simplification.

【００５６】まず、ノイズ除去を行なう。ここでは、顔
領域の候補部分を拡大しその周囲の輪郭を求める。次
に、こうして求められた顔領域の候補部分に対して以下
のアルゴリズムで円領域を顔領域に当てはめる。First, noise removal is performed. Here, the candidate portion of the face area is enlarged and its surrounding contour is obtained. Next, the circle area is applied to the face area by the following algorithm for the face area candidate part thus obtained.

【００５７】最初に式（２０）（２１）によって肌色と
思われる領域の中心（Ｃｘ、Ｃｙ）を求める。ここでＪ
(x, y)は上記処理によって顔領域候補画素には１が、そ
うでない画素には０が割当てられている２値画像であ
る。次に肌色と思われる領域の半径Ｒを式（２２）によ
って計算する。こうして得られた、中心が（Ｃｘ、Ｃ
ｙ）、半径Ｒの円を顔領域とする。なお式（２０）〜
（２２）はあくまで一例であり、アプリケーションおよ
び設計思想によってこれら式としては任意のものを選択
できる。First, the center (Cx, Cy) of the area considered to be flesh color is obtained by the equations (20) and (21). Where J
(x, y) is a binary image in which 1 is assigned to the face area candidate pixel and 0 is assigned to other pixels by the above processing. Next, the radius R of the region considered to be skin color is calculated by the equation (22). The center thus obtained is (Cx, C
y) A circle having a radius R is defined as a face area. Expression (20)-
(22) is merely an example, and any of these formulas can be selected according to the application and the design concept.

【００５８】図７に顔領域検出部１２０の処理過程およ
び結果の画像を示す。図７（ａ）がオリジナルの画像で
ある。r-g空間類似度計算処理１３４の結果得られた画
像を図７（ｂ）に、NCb-NCr空間類似度計算処理１３８
の結果得られた画像を図７（ｃ）に、それぞれ示す。類
似度計算の統合処理１４０の結果と、その結果に対して
後処理１４２が行なった処理によって得られた顔領域を
示す円（だ円）を図７（ｄ）に示す。こうして得られた
円（だ円）を図７（ａ）のオリジナル画像と合成したの
が図７（ｅ）である。FIG. 7 shows the processing steps of the face area detecting section 120 and the resulting image. FIG. 7A shows an original image. The image obtained as a result of the rg space similarity calculation processing 134 is shown in FIG. 7B in the NCb-NCr space similarity calculation processing 138.
The images obtained as a result are shown in FIG. FIG. 7D shows a result of the integration process 140 of the similarity calculation and a circle (ellipse) indicating the face region obtained by the process performed by the post-process 142 on the result. FIG. 7E combines the circle (ellipse) thus obtained with the original image of FIG. 7A.

【００５９】こうして一旦顔領域を決定すると、以下は
この顔領域をトラッキングすればよい。また顔領域のト
ラッキングに失敗したときには、上述の処理を再度行な
うことで顔領域を決定することができる。Once the face area has been determined in this way, tracking of this face area may be performed in the following. When the tracking of the face area fails, the face area can be determined by performing the above-described processing again.

【００６０】次に、映像ストリームユニット分離部１２
２、静止ユニット分類部１２４、およびユニット統合部
１２６の処理の内容について図８〜図１０を参照して説
明する。まず、映像ストリームユニット分離部１２２
は、入力されるフレーム間の差分をとることにより図８
に示されるフレーム間差分１５０が得られる。フレーム
間差分の値が得られたら、以下のようにしてユニットの
分離を行なう。Next, the video stream unit separation unit 12
2. The contents of the processing of the stationary unit classification unit 124 and the unit integration unit 126 will be described with reference to FIGS. First, the video stream unit separation unit 122
Is obtained by taking the difference between the input frames.
Are obtained. When the value of the difference between frames is obtained, the units are separated as follows.

【００６１】フレームＮ₁〜フレームＮ₂（Ｎ₁＜Ｎ₂）の
間の全てのフレームにおいて、そのフレーム間差分があ
るしきい値（典型的には０）より大きく、フレームＮ₁
−１およびフレームＮ₂＋１においてフレーム間差分が
このしきい値以下の場合に、フレームＮ₁〜Ｎ₂を動きユ
ニットと呼ぶ。またフレームＮ₁〜フレームＮ₂（Ｎ₁＜
Ｎ₂）の間の全てのフレームにおいて、そのフレーム間
差分がこのしきい値以下であり、フレームＮ₁−１およ
びフレームＮ₂＋１においてフレーム間差分がこのしき
い値より大きい場合に、フレームＮ₁〜Ｎ₂を静止ユニッ
トと呼ぶ。なお、しきい値としては小さい値が選択され
るべきであるが、０に限定されるわけではない。[0061] In all the frames between frames N ₁ ~ frame _{_{_{N 2 (N 1 <N 2}}} ), larger than the threshold value (typically 0) where that inter-frame difference, frame N ₁
If the inter-frame difference is less than or equal to this threshold value in −1 and frame N ₂ +1, frames N _{1 to} N ₂ are called motion units. Frame N ₁ to frame N ₂ (N ₁ <
N ₂ ), if the inter-frame difference is less than or equal to this threshold in frames N ₁ -1 and N ₂ +1, then the frame N the ₁ to N ₂ is referred to as a stationary unit. Although a small value should be selected as the threshold value, it is not limited to 0.

【００６２】映像ストリームユニット分離部１２２は、
図８に示すフレーム間差分１５０を、上記した定義にし
たがって、静止ユニット１６０、１６４、１６８、１７
２および１７６、ならびに動きユニット１６２、１６
６、１７０、１７４および１７８に分離する。The video stream unit separation unit 122
The inter-frame differences 150 shown in FIG. 8 are converted into the stationary units 160, 164, 168, 17 according to the above-described definition.
2 and 176, and motion units 162, 16
6, 170, 174 and 178.

【００６３】実際のユニット分離の手順を図１０に示
す。図１０を参照して、まず画像ストリーム内の連続す
る二つのフレームの画像の特徴点抽出およびトラッキン
グを行なう（ステップ２２０）。このときの特徴点抽出
およびトラッキングには、公知のアルゴリズムを用いる
ことができる。たとえばＫＬＴ（Kanade-Lucas-Tomas
i）アルゴリズムを用いることができる。FIG. 10 shows the actual procedure of unit separation. Referring to FIG. 10, first, feature point extraction and tracking of images of two consecutive frames in the image stream are performed (step 220). A known algorithm can be used for feature point extraction and tracking at this time. For example, KLT (Kanade-Lucas-Tomas
i) An algorithm can be used.

【００６４】次に、ステップ２２０で抽出された特徴点
の間の対応関係を用いて、フレーム間差分の計算を行な
う（２２２）。こうして得られたフレーム間差分によっ
てユニット分離を行なう（２２４）。Next, an inter-frame difference is calculated using the correspondence between the feature points extracted in step 220 (222). Unit separation is performed based on the inter-frame difference thus obtained (224).

【００６５】次に、静止ユニット分類部１２４が静止ユ
ニット１６０、１６４、１６８、１７２および１７６を
以下のようにして安定静止ユニットと不安定静止ユニッ
トとに分類する。すなわち静止ユニット分類部１２４
は、静止ユニットを構成するフレームの数があるしきい
値以上であればそのユニットを安定静止ユニットに分類
し、それ以外の場合にそのユニットを不安定静止ユニッ
トに分類する。たとえば図８に示すフレーム間差分１５
０の例では、静止ユニット１６０および静止ユニット１
６８がそれぞれ安定静止ユニットにそれぞれ分類され
る。静止ユニット１６４、１７２および１７６は不安定
静止ユニットに分類される。Next, the stationary unit classifying section 124 classifies the stationary units 160, 164, 168, 172 and 176 into stable stationary units and unstable stationary units as follows. That is, the stationary unit classification unit 124
Classifies a stationary unit if the number of frames constituting the stationary unit is equal to or greater than a certain threshold value, and otherwise classifies the unit as an unstable stationary unit. For example, the inter-frame difference 15 shown in FIG.
0, the static unit 160 and the static unit 1
68 are respectively classified into stable stationary units. Stationary units 164, 172 and 176 are classified as unstable stationary units.

【００６６】ここで、安定静止ユニットは、ある長さの
時間以上にわたって人の動きがなかった、と考えられる
ことから、人が一定のポーズをとって動かなかった期間
と考えることができる。これは、たとえば人が講議に集
中している可能性もあるし、また講議に退屈して他を見
ている可能性もある。一方、不安定静止ユニットは、動
きユニットに挟まれたごく短い期間のみであるので、な
んらかの動きに含まれる一次的な静止状態に対応すると
考えられる。Here, the stable stationary unit is considered to be a period in which a person does not move in a certain pose since it is considered that the person has not moved for a certain length of time or more. This could mean, for example, that a person is concentrating on the lecture, or that he is bored with the lecture and looks elsewhere. On the other hand, the unstable stationary unit is considered to correspond to a temporary stationary state included in some movement because it is only a very short period sandwiched between the moving units.

【００６７】静止ユニット分類部１２４は、安定静止ユ
ニットに関する情報を判定サブシステム１０４に、不安
定静止ユニットに関する情報をユニット統合部１２６
に、それぞれ与える。The stationary unit classifying unit 124 determines the information relating to the stable stationary unit to the determination subsystem 104, and transmits the information relating to the unstable stationary unit to the unit integrating unit 126.
To give each.

【００６８】ユニット統合部１２６は、映像ストリーム
ユニット分離部１２２から与えられる動きユニットに関
する情報と、静止ユニット分類部１２４から与えられる
不安定静止ユニットに関する情報とを統合し、活動検出
ユニットとする。つまり、なんらかの動きが人物に検出
されたフレームと、それら動きのあるフレームの間に挟
まれたごく短い期間の静止フレームとによって、たとえ
ば頷く、首をふる、上下を向く、首を傾ける、居眠りを
している、等、聴衆が講議に対して見せる反応を検出す
ることができると考えられるので、活動検出ユニットと
いう分類とする。ただしこれらの活動は必ずしも講議に
対して集中していることを示すと考えられるわけではな
く、講議とは関係のない動きに対応していることも考え
られる。The unit integrating unit 126 integrates the information on the motion unit given from the video stream unit separating unit 122 and the information on the unstable still unit given from the still unit classifying unit 124 to make an activity detecting unit. In other words, a frame in which some movement is detected by a person and a very short period of still frame sandwiched between those moving frames can cause nodding, shaking, turning up and down, tilting the head, falling asleep, etc. It is considered that it is possible to detect a response that the audience shows to the lecture, for example, so that it is classified as an activity detection unit. However, these activities do not necessarily indicate that they are focused on the lecture, but may also correspond to movements unrelated to the lecture.

【００６９】さて、再び図４を参照して、こうして分類
された安定静止ユニットに関する情報と活動検出ユニッ
トに関する情報とは判定サブシステム１０４に与えら
れ、判定サブシステム１０４がこれら情報と各映像フレ
ームの画像情報とに基づいて聴衆の反応を認識し判定す
る。Referring again to FIG. 4, the information on the stable stationary unit and the information on the activity detecting unit thus classified are given to the judgment subsystem 104, and the judgment subsystem 104 makes the judgment and the information of each video frame. The response of the audience is recognized and determined based on the image information.

【００７０】図１１を参照して、判定サブシステム１０
４は、静止ユニット情報を受けて聴衆の姿勢を推定する
ための姿勢推定部２３０と、活動検出ユニット情報を受
けて聴衆のジェスチャーを認識するためのジェスチャー
認識部２３２とを含む。姿勢推定部２３０は推定姿勢情
報を、ジェスチャー認識部２３２は推定ジェスチャー情
報を、それぞれ出力する。Referring to FIG. 11, judgment subsystem 10
Reference numeral 4 includes a posture estimating unit 230 for estimating the posture of the audience by receiving the stationary unit information, and a gesture recognizing unit 232 for recognizing the gesture of the audience by receiving the activity detecting unit information. The posture estimating unit 230 outputs estimated posture information, and the gesture recognizing unit 232 outputs estimated gesture information.

【００７１】図１２を参照して、姿勢推定部２３０は、
安定静止ユニットフレーム画像２４０から画像の特徴ベ
クトルを抽出するための特徴ベクトル抽出処理部２４２
と、特徴ベクトル抽出処理部２４２からのフレーム画像
の特徴ベクトルを入力として、安定静止ユニットフレー
ム画像２４０に対応する聴衆の姿勢を示す情報（姿勢カ
テゴリ名）を出力するための、あらかじめ学習が済んで
いるニューラルネットによる姿勢判定部２４４とを含
む。この実施の形態のシステムでは、ニューラルネット
による姿勢判定部２４４は検出対象となる一つの姿勢
カテゴリに対して一つのニューラルネットが対応するよ
うに、あらかじめ学習が済んでいる複数個の１対１ニュ
ラルネット２５０と、同じ入力特徴ベクトルに対してこ
れら複数個の１対１ニュラルネット２５０の出力を調
べ、最も高い出力値を示した１対１ニュラルネット２５
０に対応する姿勢カテゴリ名を出力するための最大値検
出部２５２とを含む。Referring to FIG. 12, posture estimating section 230 includes:
Feature vector extraction processing unit 242 for extracting an image feature vector from stable still unit frame image 240
And learning in advance to output information (posture category name) indicating the posture of the audience corresponding to the stable still unit frame image 240 using the feature vector of the frame image from the feature vector extraction processing unit 242 as an input. And a posture determination unit 244 using a neural network. In the system of this embodiment, the posture determination unit 244 based on the neural network performs a plurality of one-to-one neural networks that have been learned in advance so that one neural network corresponds to one posture category to be detected. The output of these one-to-one neural nets 250 is examined for the same input feature vector as that of the neural network 250, and the one-to-one neural net 25 showing the highest output value is examined.
And a maximum value detection unit 252 for outputting a posture category name corresponding to 0.

【００７２】このように、一つの姿勢カテゴリに対して
一つの１対１ニュラルネット２５０を設けるようにする
と、新たな姿勢カテゴリについての認識機能を追加しよ
うとするときに簡単に対応できるという効果がある。す
なわちその場合には、その新たな姿勢カテゴリを検出す
るようあらかじめ学習が行なわれている１対１ニュラル
ネット２５０をニューラルネットによる姿勢判定部２
４４に追加し、その出力を最大値検出部２５２への入力
に接続してやればよい。仮にニューラルネットによる姿
勢判定部２４４の全体を大きなニューラルネットとした
場合には、新たな姿勢カテゴリを検出する機能を追加し
ようとすると、ニューラルネット全体の学習をし直す必
要がある。したがって、実際の応用ではニューラルネッ
トによる姿勢判定部２４４のように複数個の１対１ニュ
ラルネット２５０を設けるようにすることが実用的であ
る。As described above, if one one-to-one neural net 250 is provided for one posture category, it is possible to easily cope with an attempt to add a recognition function for a new posture category. is there. That is, in this case, the one-to-one neural network 250, which has been learned in advance to detect the new posture category, is replaced with the neural network-based posture determination unit 2.
44, and connect its output to the input to the maximum value detection unit 252. If the whole posture determination unit 244 using a neural network is a large neural network, it is necessary to relearn the entire neural network in order to add a function for detecting a new posture category. Therefore, in an actual application, it is practical to provide a plurality of one-to-one neural nets 250 like the attitude determination unit 244 using a neural network.

【００７３】図１３を参照して、ジェスチャー認識部２
３２も同様の構成を有する。すなわちジェスチャー認識
部２３２は、活動検出ユニットフレーム画像情報２６０
から画像の特徴ベクトルの抽出を行なう特徴ベクトル
の抽出処理２６２と、この特徴ベクトルを入力として
聴衆のジェスチャーをカテゴリに分類してその情報（ジ
ェスチャーカテゴリ名）を出力するためのニューラルネ
ットによるジェスチャー判定部２６４とを含む。Referring to FIG. 13, gesture recognition unit 2
32 has a similar configuration. That is, the gesture recognizing unit 232 generates the activity detection unit frame image information 260.
A feature vector extraction process 262 for extracting a feature vector of an image from an image, and a gesture determining unit using a neural network for classifying audience gestures into categories using the feature vectors as input and outputting the information (gesture category name) 264.

【００７４】ニューラルネットによるジェスチャー判定
部２６４も、それぞれ一つのジェスチャーカテゴリに対
応するように設けられた複数個の１対１ニューラルネッ
ト２７０と、これら複数個の１対１ニューラルネット２
７０の出力を受け、最も出力の大きかった１対１ニュー
ラルネット２７０に対応するジェスチャーカテゴリ名を
出力するため最大値検出部２７２とを含む。The gesture determination unit 264 based on the neural network also includes a plurality of one-to-one neural nets 270 provided to correspond to one gesture category, and a plurality of one-to-one neural nets 2.
And a maximum value detection unit 272 for outputting a gesture category name corresponding to the one-to-one neural network 270 having the largest output.

【００７５】特徴ベクトル抽出処理部２４２は、この実
施の形態ではウェーブレット変換を用いて安定静止画像
の頭部領域の画像から特徴ベクトルを抽出する。たとえ
ば、入力イメージに対して式（２３）〜（２５）で示さ
れる関係を持つローパスフィルタＨ_i（Ｚ）（ｉ＝０、
１）およびハイパスフィルタＧ_i（Ｚ）（ｉ＝０、１）
を用意する。そしてこれらをｉ＝０、１の順序で組合わ
せて元の入力イメージＰに適用した結果、Ｈ₀Ｈ₁によっ
てＰ₁が、Ｈ₀Ｇ₁によってＩ₁が、Ｇ₀Ｈ₁によってＩ
₂が、Ｇ₀Ｇ₁によってＩ₃が、それぞれ得られたものとす
る。この結果のＰ₁に対してさらに上述のフィルタを適
用して同様にＰ₂、Ｉ₁（Ｐ₁）、Ｉ₂（Ｐ₁）、Ｉ
₃（Ｐ₃）が得られ、以下同様にＰ_N、Ｉ₁（Ｐ_N-1）、Ｉ₂
（Ｐ_N-1）、Ｉ₃（Ｐ_N- ₁）までを得る。こうして得られ
た全ての値と所定のしきい値とを比較して、各値を０ま
たは１のいずれかとする。そしてこれら０または１の値
を所定の順番で並べて特徴ベクトルとする。In this embodiment, the feature vector extraction processing unit 242 extracts a feature vector from the image of the head region of the stable still image using the wavelet transform. For example, a low-pass filter H _i (Z) (i = 0, i) having a relationship represented by Expressions (23) to (25) with respect to an input image.
1) and high-pass filter _{G i (Z) (i =} 0,1)
Prepare The result of applying these in combination in the order of i = 0,1 to the original input image P, P ₁ by H ₀ H ₁ is, I ₁ by H ₀ G ₁ is, I by G ₀ H ₁
_It is assumed that I ₃ is obtained by G ₀ G ₁ and I ₂ respectively. As a result of further relative P ₁ similarly by applying the above-mentioned filter _{_{_{P 2, I 1 (P 1}}} ), I 2 (P 1), I
₃ (P ₃ ) is obtained, and similarly, P _N , I ₁ (P _N-1 ), I ₂
(P _N-1 ) and I ₃ (P _N- ₁ ) are obtained. All the values thus obtained are compared with a predetermined threshold value, and each value is set to either 0 or 1. Then, these 0 or 1 values are arranged in a predetermined order to form a feature vector.

【００７６】一方、特徴ベクトルの抽出処理２６２によ
る特徴ベクトル抽出は特徴ベクトル抽出処理部２４２の
場合とは異なる。この場合は、聴衆の頭部の動きが問題
となるので、連続する２つのフレーム間の差分を基本と
して特徴ベクトルを抽出する。この実施の形態では、公
知の方法を用いて、２フレーム（フレームｔ−１および
ｔ）間での人物の頭部の動きを表わす角度θ_t、φ_tおよ
びρ_tを求める。θ_tは光軸周りの回転角度を、φ_tは対
象に固定された座標軸のｘ軸と、画像面に平行な回転軸
Φとの間の角度を、ρ_tはΦ軸周りの角度を、それぞれ
表わす。この値を、連続するＮ個のフレームに対して求
める。On the other hand, the feature vector extraction by the feature vector extraction processing 262 is different from the case of the feature vector extraction processing unit 242. In this case, since the movement of the audience's head becomes a problem, the feature vector is extracted based on the difference between two consecutive frames. In this embodiment, angles θ _t , φ _t, and ρ _t representing the motion of the head of a person between two frames (frames t-1 and t) are obtained using a known method. theta _t is a rotation angle around the optical axis, phi _t is the x-axis of the coordinate axes fixed to the subject, the angle between the rotating axis Φ parallel to the image plane, [rho _t is the angle about the Φ axis, Shown respectively. This value is obtained for N consecutive frames.

【００７７】さらに、処理をよりロバストにするため
に、他の情報を追加する。これら情報としては、本実施
の形態では頭部の動きの中心の座標、頭部の動きの方
向、ｘ軸方向およびｙ軸方向の頭部の動きのエネルギー
等がある。本実施の形態ではこれらの値を連続する所定
のＮ個のフレームから抽出し、上記したＮフレーム分の
角度情報と合わせてそれらを並べて特徴ベクトルとして
いる。Further, other information is added to make the process more robust. In the present embodiment, such information includes the coordinates of the center of the head movement, the direction of the head movement, the energy of the head movement in the x-axis direction and the y-axis direction, and the like. In the present embodiment, these values are extracted from predetermined N consecutive frames, and are combined with the above-described angle information of the N frames to form a feature vector.

【００７８】なお、上記したＮ個は固定された数値と考
えて論じてきたが、たとえば活動検出ユニットに含まれ
るフレーム数がＮ個より多かったり、少なかったりする
場合がある。これに対しては、次のようにして対処す
る。まず、望ましいＮの値を定める。次に、活動検出ユ
ニット内のフレーム数とＮとを比較する。フレーム数が
Ｎ個より多ければ、最初のＮ個のフレームのみを用いて
特徴ベクトルを抽出する。フレーム数がＮ個より少なけ
れば、その足りない部分については最後のフレームと同
じであると仮定して特徴ベクトルを抽出する。このよう
な補正を行なっても、１対１ニューラルネット２７０を
適切に学習させておくことで正確な判定を行なうことが
可能である。Although the above N has been discussed as being a fixed numerical value, for example, the number of frames included in the activity detection unit may be larger or smaller than N. This is dealt with as follows. First, a desirable value of N is determined. Next, N is compared with the number of frames in the activity detection unit. If the number of frames is larger than N, a feature vector is extracted using only the first N frames. If the number of frames is less than N, the feature vector is extracted on the assumption that the missing portion is the same as the last frame. Even if such a correction is made, it is possible to make an accurate determination by appropriately learning the one-to-one neural network 270.

【００７９】次に、図４に示す集計部１０６の構成につ
いて説明する。集計部１０６は、実質的にソフトウェア
によって実現されるので、以下にそのソフトウェアで行
なわれる処理の構成について図１４を参照して説明す
る。この処理ではまず最初に、必要な記憶領域の確保お
よび初期化を行なう（２９０）。次に、一定のサンプリ
ング時間の経過を待つ（２９２）。この処理は、判定サ
ブシステム１０４の出力があったときに生じるイベント
を待つ処理としてもよい。サンプリング時間が経過する
と（または判定サブシステム１０４からの出力がある
と）、その時点で判定サブシステム１０４の出力する姿
勢情報（姿勢カテゴリ名）、ジェスチャー情報（ジェス
チャーカテゴリ名）に基づいて姿勢、ジェスチャーカテ
ゴリのスコアを評価する。この評価は本実施の形態では
以下のようにして行なう。なお、最大値検出部２５２の
出力する姿勢カテゴリ名として「Frontal-view（正
面）」「Right（右）」「Left（左）」「Up（上）」「D
own（下）」の５種類があり、最大値検出部２７２の出
力するジェスチャーカテゴリ名として、「nod（頷
く）」「shake head（首をふる）」「Look right（右を
見る）」「Look left（左を見る）」「Look down（下を
見る）」「Look Up（上を見る）」の６通りがあるもの
とする。Next, the configuration of the counting unit 106 shown in FIG. 4 will be described. Since the counting unit 106 is substantially realized by software, a configuration of processing performed by the software will be described below with reference to FIG. In this processing, first, a necessary storage area is secured and initialized (290). Next, it waits for a certain sampling time to elapse (292). This process may be a process of waiting for an event that occurs when there is an output from the determination subsystem 104. When the sampling time elapses (or when there is an output from the determination subsystem 104), the posture and the gesture based on the posture information (posture category name) and the gesture information (gesture category name) output from the determination subsystem 104 at that time. Evaluate the category score. This evaluation is performed as follows in the present embodiment. Note that the posture category names output by the maximum value detection unit 252 include “Frontal-view (front)”, “Right (right)”, “Left (left)”, “Up (up)”, and “D”.
There are five types of “own (lower)”, and the gesture category names output by the maximum value detection unit 272 are “nod (nod)”, “shake head (shake your head)”, “Look right (look right)” “Look” It is assumed that there are six types: left (look at the left), "Look down (look at the bottom)", and "Look Up (look at the top)".

【００８０】まず、活動検出ユニットについては、次の
ような評価を行なう。（１）このユニットについてジェスチャーカテゴリが
「nod」または「shakehead」であればこのユニットのス
コアを１とする。First, the following evaluation is performed on the activity detection unit. (1) If the gesture category of this unit is “nod” or “shakehead”, the score of this unit is set to 1.

【００８１】（２）直前の安定静止ユニットの姿勢カ
テゴリ名が「Left」であり、この活動検出ユニットのジ
ェスチャーカテゴリが「Look right」であり、かつこの
ユニットの直後の安定静止ユニットの姿勢カテゴリ名が
「Frotal-view」であれば、この活動検出ユニットのス
コアを「１」とする。(2) The posture category name of the immediately preceding stable stationary unit is “Left”, the gesture category of this activity detecting unit is “Look right”, and the posture category name of the stable stationary unit immediately after this unit Is “Frotal-view”, the activity detection unit score is set to “1”.

【００８２】（３）直前の安定静止ユニットの姿勢カ
テゴリ名が「Right」であり、この活動検出ユニットの
ジェスチャーカテゴリが「Look left」であり、かつこ
のユニットの直後の安定静止ユニットの姿勢カテゴリ名
が「Frotal-view」であれば、この活動検出ユニットの
スコアを「１」とする。(3) The posture category name of the immediately preceding stable stationary unit is “Right”, the gesture category of this activity detecting unit is “Look left”, and the posture category name of the stable stationary unit immediately after this unit Is “Frotal-view”, the activity detection unit score is set to “1”.

【００８３】（４）直前の安定静止ユニットの姿勢カ
テゴリ名が「Up」であり、この活動検出ユニットのジェ
スチャーカテゴリが「Look down」であり、かつこのユ
ニットの直後の安定静止ユニットの姿勢カテゴリ名が
「Frotal-view」であれば、この活動検出ユニットのス
コアを「１」とする。(4) The posture category name of the immediately preceding stable stationary unit is “Up”, the gesture category of this activity detecting unit is “Look down”, and the posture category name of the stable stationary unit immediately after this unit Is “Frotal-view”, the activity detection unit score is set to “1”.

【００８４】（５）直前の安定静止ユニットの姿勢カ
テゴリ名が「Down」であり、この活動検出ユニットのジ
ェスチャーカテゴリが「Look up」であり、かつこのユ
ニットの直後の安定静止ユニットの姿勢カテゴリ名が
「Frotal-view」であれば、この活動検出ユニットのス
コアを「１」とする。(5) The posture category name of the immediately preceding stable stationary unit is “Down”, the gesture category of this activity detecting unit is “Look up”, and the posture category name of the stable stationary unit immediately after this unit Is “Frotal-view”, the activity detection unit score is set to “1”.

【００８５】（６）他の全ての場合についてはスコア
は「１」である。安定静止ユニットについては以下の基
準でスコアを評価する。(6) In all other cases, the score is “1”. The score of the stable stationary unit is evaluated based on the following criteria.

【００８６】（１）もしもこのユニットの姿勢カテゴ
リ名が「Frontal-view」であればスコアは「１」とす
る。(1) If the posture category name of this unit is “Frontal-view”, the score is “1”.

【００８７】（２）他の場合についてはスコアは
「０」とする。こうして得られたスコアを、所定時間だ
け累算する（ステップ２９６および２９８）。所定時間
経過するごとに累算結果を式（２６）によって計算し反
応のスコアＳｆとして出力する（ステップ３００）。な
お式（２６）においてＴは累算時間、Ｓ_uはｕ番目のユ
ニットのスコア、ｔ_uはｕ番目のユニットの持続時間を
表わす。なお、このスコア評価方法は１例であって、ア
プリケーションによって任意の評価方法を採用すること
ができる。(2) In other cases, the score is “0”. The scores thus obtained are accumulated for a predetermined time (steps 296 and 298). Every time a predetermined time elapses, the accumulation result is calculated by equation (26) and output as a reaction score Sf (step 300). In equation (26), T represents the accumulation time, S _u represents the score of the u-th unit, and t _u represents the duration of the u-th unit. This score evaluation method is an example, and any evaluation method can be adopted depending on the application.

【００８８】続いて、スコア計算用の作業領域をクリア
して（ステップ３０２）制御をステップ２９２に戻し、
以下同じ処理を繰返す。Subsequently, the work area for score calculation is cleared (step 302), and the control returns to step 292.
Hereinafter, the same processing is repeated.

【００８９】上記した最終スコアＳ_fは送受信部１０８
（図４参照）を介して集計センター３０に出力される。The above-mentioned final score S _f is transmitted to the transmitting / receiving section 108.
(See FIG. 4).

【００９０】以上が、受講会場用システム３４の構成お
よび動作の概略である。［集計センター３０］図１５を参照して、集計センター
３０は、受講会場用システム３４および講師用システム
３２と通信を行なうための送受信部３２０と、受講会場
用システム３４から送信されてくるスコアを全ての受講
会場用システム３４にわたって集計するための結果集計
部３２２と、集計センター３０および集計方法等の設定
を行なうためにユーザが操作するためのシステム設定部
３２４と、システム設定部３２４によって設定された条
件にしたがって結果集計部３２２での集計方法を制御
し、集計結果を送受信部３２０を介して講師用システ
ム３２に定期的に送信するためのシステム管理部３２６
と、集計センター３０の稼動状況、結果集計部３２２に
よる集計状況等を表示するためにシステム管理部３２６
が用いる表示装置３２８とを含む。The above is an outline of the configuration and operation of the lecture hall system 34. [Aggregation Center 30] Referring to FIG. 15, the aggregation center 30 transmits and receives a score transmitted from the attendance hall system 34 and a transmission / reception unit 320 for communicating with the attendance hall system 34 and the instructor system 32. The result totaling unit 322 for totaling over all attendance hall systems 34, the system setting unit 324 for the user to operate for setting the totaling center 30, the totaling method, and the like, and the system setting unit 324 are set. The system management unit 326 controls the counting method in the result counting unit 322 according to the conditions, and periodically transmits the counting result to the instructor system 32 via the transmission / reception unit 320.
And a system management unit 326 for displaying the operation status of the tallying center 30, the tallying status of the result tallying unit 322, and the like.
And a display device 328 used by the device.

【００９１】図１５に示されるのは最も単純な構成であ
るが、受講会場用システム３４の出力について上記した
とおりの説明に基づけば、当業者であれば一般的なコン
ピュータを用いてこの集計センター３０を容易に作成す
ることが可能であろう。なお、この実施の形態では、集
計センター３０が講師用システム３２または受講会場用
システム３４とは別個に設けられている。しかしもちろ
ん本願発明はそのような構成に限定されない。たとえば
集計センター３０が講師用システム３２または受講会場
用システム３４のうちの任意の一つと同じコンピュータ
上で実現されてもよい。［講師用システム３２］図１６を参照して、講師用シス
テム３２は、講師の映像を出力するためのビデオカメラ
５０と、ビデオカメラ５０の出力する映像信号をデジ
タル映像信号に変換し、圧縮するための画像圧縮部３４
０と、画像圧縮部３４０の出力する圧縮された画像を各
受講会場用システム３４および集計センター３０に送信
し、また集計センター３０から聴衆の反応を示すスコア
の集計結果を受信するための送受信部３４２と、送受信
部３４２によって集計センター３０から受取られた聴衆
の反応のスコアの集計結果に対して、表示のための集
計、受講会場用システム３４ごとの集計、分類、順序付
け等、反応を示す情報に対して行なうべき情報処理を実
行するための結果処理部３４４と、結果処理部３４４の
出力する結果を、どのような形式で表示するかを設定す
るための表示条件設定部３４６と、表示条件設定部３４
６によって設定された条件にしたがって結果処理部３４
４の出力に基づいて、聴衆の反応を分かりやすく表現す
る映像信号を生成するための映像生成部３４８と、映
像生成部３４８の出力する映像信号を表示するための、
前述の大画面表示装置８６とを含む。FIG. 15 shows the simplest configuration. However, based on the description of the output of the attendance hall system 34 as described above, those skilled in the art can use a general computer to calculate the totaling center. 30 could easily be created. In this embodiment, the counting center 30 is provided separately from the lecturer system 32 or the lecture hall system 34. However, of course, the present invention is not limited to such a configuration. For example, the tallying center 30 may be implemented on the same computer as any one of the lecturer system 32 or the lecture hall system 34. [Instructor System 32] Referring to FIG. 16, instructor system 32 converts video signal output from video camera 50 into a digital video signal, and compresses the video signal. Image compression unit 34 for
0, a transmitting / receiving unit for transmitting the compressed image output from the image compressing unit 340 to each attendance hall system 34 and the counting center 30, and receiving the counting result of the score indicating the audience's response from the counting center 30. 342 and information indicating the reaction, such as totaling for display, totaling, classification, ordering, etc. for the audience venue system 34 with respect to the total result of the audience response score received from the totaling center 30 by the transmitting / receiving unit 342. A result processing unit 344 for executing information processing to be performed on a display condition, a display condition setting unit 346 for setting a format in which a result output from the result processing unit 344 is displayed, and a display condition Setting unit 34
6. The result processing unit 34 according to the conditions set by
4 for generating a video signal that expresses the audience's response in an easy-to-understand manner based on the output of FIG. 4 and a video signal output by the video generation unit 348.
The above-described large screen display device 86 is included.

【００９２】反応の表示形式としてはたとえば、単純
に、全ての受講会場用システム３４でスコアが「１」と
なったら１００パーセントとなるように、反応をパーセ
ントに換算して数字として表示してもよい。なおこの実
施の形態では、このように聴衆の反応を数字として表示
できる。したがって、この数字に基づいてどのような表
示形式を実現することもできる。たとえば受講会場用シ
ステム３４ごとに表示区画を設けて受講会場ごとに聴衆
の反応を色で表示すること、または棒グラフ、円グラフ
等のグラフ形式で表示することが考えられる。また、聴
衆を代表する複数（例えば１０人から３０人程度）の人
物の映像を合成し、それらの人物の反応が、ちょうど全
聴衆の反応の割合に対応するように、これら人物の映像
の姿勢、動き、表情等を制御してもよい。［システムの動作］上記した構成のこのシステムは以下
のように動作する。あらかじめ、講師用システム３２は
集計センター３０との接続を確立しておく。受講会場用
システム３４はいずれも講師用システム３２からの映像
を受信できるように、かつ集計センター３０に対して情
報を送信できるように、集計センター３０との間で適切
なコネクションをネットワーク上で確立しておく。As a display format of the reaction, for example, the response may be simply converted into a percentage and displayed as a number so that if the score becomes “1” in all attendance hall systems 34, the response becomes 100%. Good. In this embodiment, the response of the audience can be displayed as a number. Therefore, any display format can be realized based on these numbers. For example, it is conceivable to provide a display section for each attendance hall system 34 and display the audience response in color for each attendance hall, or to display it in a graph format such as a bar graph or a pie chart. In addition, the images of a plurality of persons (for example, about 10 to 30 persons) representing the audience are synthesized, and the postures of the images of the persons are adjusted so that the reactions of the persons correspond exactly to the ratios of the reactions of the entire audience. , Movement, expression, etc. may be controlled. [Operation of System] This system having the above-described configuration operates as follows. The instructor system 32 establishes a connection with the counting center 30 in advance. The attendance hall system 34 establishes an appropriate connection with the aggregation center 30 on the network so that the system can receive the video from the instructor system 32 and transmit information to the aggregation center 30. Keep it.

【００９３】講師用システム３２では、講師を撮像した
映像信号（音声信号を含む）を画像圧縮部３４０によっ
て圧縮し集計センター３０に送信する。集計センター３
０は、接続してきた受講会場用システム３４に対してこ
の映像を送信する。In the instructor system 32, a video signal (including an audio signal) obtained by imaging the instructor is compressed by the image compressing section 340 and transmitted to the tallying center 30. Counting Center 3
No. 0 transmits this image to the attended venue system 34 connected thereto.

【００９４】受講会場用システム３４は、集計センター
３０から送信されてきた講師の映像を伸長し、大画面表
示装置８６上に表示する。受講会場用システム３４で
はあわせて、ビデオカメラ５０を用いて聴衆を撮像し、
これをフレームごとのデジタル信号に変換する。さら
に、これらフレームを検出・分離サブシステム１０２に
よって安定静止ユニットと活動検出ユニットとに分類す
る。そして、それらユニットに属するフレームの映像信
号に基づいて、判定サブシステム１０４が聴衆の反応を
推定しスコアとして出力する。集計部１０６が所定時
間ごとにこのスコアを集計して送受信部１０８を介し
て集計センター３０に最終スコアを送信する。[0094] The attendance hall system 34 expands the video of the lecturer transmitted from the counting center 30 and displays the video on the large screen display device 86. The attendance hall system 34 also images the audience using the video camera 50,
This is converted into a digital signal for each frame. Further, these frames are classified by the detection and separation subsystem 102 into stable stationary units and activity detection units. Then, based on the video signals of the frames belonging to those units, the determination subsystem 104 estimates the audience response and outputs it as a score. The totalizing section 106 totals the scores at predetermined time intervals and transmits the final score to the totaling center 30 via the transmitting / receiving section 108.

【００９５】集計センター３０では、接続されている複
数の受講会場用システム３４の全てに対して式（２６）
で表わされる計算を行ない、その結果を講師用システム
３２に送信する。In the tallying center 30, the expression (26) is applied to all of the plurality of attending venue systems 34 connected.
Is performed, and the result is transmitted to the instructor system 32.

【００９６】講師用システム３２では、この集計結果を
集計センター３０から受信し、聴衆の反応を表現するよ
うにあらかじめ設定された表現形式にしたがって講師に
対して聴衆の反応を示す映像を提示する。[0096] Instructor system 32 receives the tallying result from tallying center 30 and presents to the instructor an image showing the audience's response to the instructor in accordance with a preset expression format so as to express the audience's response.

【００９７】講師は、このようにして提示された聴衆の
反応を見て、話題を変えたり、聴衆の注意をひくための
なんらかの行為、たとえば質問を発する等の適切な行為
を行なったり、聴衆の反応が満足すべきものであればそ
のまま講演を継続したり、という適切な行動をとること
ができる。［実施の形態２］上記した実施の形態１のシステムは、
講師から各講習会場にいる聴衆への、情報の一方通行を
行なうシステムであった。しかし本発明はこうしたシス
テムのみに適用可能なわけではなく、双方向の情報の送
受信を行なうシステム、たとえば仮想空間を利用した電
子会議システムにも適用可能である。この場合には、シ
ステム内の、集計センターを除く全ての受講会場用シス
テムに、講師用システム３２と同様の映像の送信装置お
よび他の受講会場用システムからの反応に基づいて仮想
空間の内容を制御できる機能を追加すればよい。この場
合、講師用システム３２に相当するものは不要であり、
集計センター３０と複数個の受講会場用システムとで
システムを構築できる。The instructor looks at the reaction of the audience presented in this manner, changes the topic, takes some action to get the attention of the audience, for example, asks a question, or performs an appropriate action such as asking a question. If the response is satisfactory, the lecturer can continue the lecture or take appropriate action. [Embodiment 2] The system of Embodiment 1 described above
It was a one-way system of information from the lecturer to the audience at each classroom. However, the present invention can be applied not only to such a system but also to a system for transmitting and receiving information in two directions, for example, an electronic conference system using a virtual space. In this case, the contents of the virtual space are transmitted to all the lecture hall systems except the aggregation center in the system based on the same video transmission device as the lecturer system 32 and the response from the other lecture hall systems. What is necessary is just to add a controllable function. In this case, what is equivalent to the instructor system 32 is unnecessary,
A system can be constructed with the counting center 30 and a plurality of attendance hall systems.

【００９８】図１７に、そうした受講会場用システム３
６０のブロック図を示す。図１７を参照して、この受講
会場用システム３６０は、図４に示した受講会場用シス
テムと同様に、ビデオカメラ５０、映像取得部１００、
検出・分離サブシステム１０２、判定サブシステム１０
４、集計部１０６を備え、それを備え、それに加えて、
集計部１０６からの集計結果を集計センター３０に送信
し、集計センター３０から与えられる他の受講会場用シ
ステムでの反応を受信するための送受信部３７０と、送
受信部３７０から得られた他の受講会場用システムでの
聴衆の反応に対して、結果処理部３４４（図１６参照）
と同様の処理を行なうための結果処理部３７２と、結果
処理部３７２によって処理された結果に基づき、当該集
計結果を反映するように仮想空間および仮想空間内の人
物に関する情報を更新するための仮想空間維持部３７４
と、仮想空間維持部３７４によって維持されている仮想
空間情報にしたがって仮想空間内の環境と人物との映像
を生成する映像生成部３７６と、映像生成部３７６に
よって生成された仮想空間の映像を表示するための大画
面表示装置８６とを含む。なお図１７において、図４に
示された各ブロックと同じ機能を持つブロックには同じ
参照符号を付してある。それらの構成も同じである。し
たがってここではそれらについての詳細な説明は繰返さ
ない。FIG. 17 shows such a lecture hall system 3
FIG. Referring to FIG. 17, this attendance hall system 360 includes a video camera 50, a video acquisition unit 100, and the like, similarly to the attendance hall system shown in FIG. 4.
Detection / separation subsystem 102, judgment subsystem 10
4. A totalizing unit 106 is provided.
A transmission / reception unit 370 for transmitting the result of the tallying from the tallying unit 106 to the tallying center 30 to receive a response from the tallying center 30 at another attendance hall system, and other courses obtained from the transmitting / receiving unit 370. The result processing unit 344 (see FIG. 16) responds to the audience reaction in the venue system.
A result processing unit 372 for performing the same processing as described above, and a virtual space for updating information about the virtual space and the person in the virtual space based on the result processed by the result processing unit 372 so as to reflect the aggregation result. Space maintenance part 374
A video generation unit 376 that generates a video of the environment and the person in the virtual space according to the virtual space information maintained by the virtual space maintenance unit 374; and displays the video of the virtual space generated by the video generation unit 376. And a large screen display device 86 for performing the operations. In FIG. 17, blocks having the same functions as the respective blocks shown in FIG. 4 are denoted by the same reference numerals. Their configurations are the same. Therefore, detailed description thereof will not be repeated here.

【００９９】この実施の形態２のシステムでは、１対多
という形式ではなく、様々な場所にいる複数の人物が、
互いに他の人物の反応を把握しながら、ディスカッショ
ンを行なうことができるという効果を奏する。In the system according to the second embodiment, a plurality of persons in various places are not in a one-to-many format.
The effect is that discussion can be performed while grasping the reaction of each other with each other.

【０１００】今回開示された実施の形態はすべての点で
例示であって制限的なものではないと考えられるべきで
ある。本発明の範囲は上記した説明ではなくて特許請求
の範囲によって示され、特許請求の範囲と均等の意味お
よび範囲内でのすべての変更が含まれることが意図され
る。The embodiments disclosed this time are to be considered in all respects as illustrative and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

[Brief description of the drawings]

【図１】本願発明の実施の形態１にかかる遠隔放送シ
ステムの全体構成を示す図である。FIG. 1 is a diagram showing an overall configuration of a remote broadcasting system according to a first embodiment of the present invention.

【図２】本発明の実施の形態１にかかるシステムを構
成する集計センター３０、講師用システム３２および受
講会場用システム３４を実現するためのコンピュータの
外観図である。FIG. 2 is an external view of a computer for realizing a tallying center 30, a lecturer system 32, and a lecture hall system 34, which constitute the system according to the first exemplary embodiment of the present invention.

【図３】図２にかかるコンピュータのハードウェア的
構成を示すブロック図である。FIG. 3 is a block diagram showing a hardware configuration of the computer according to FIG. 2;

【図４】本発明の実施の形態１のシステムの受講会場
用システム３４のブロック図である。FIG. 4 is a block diagram of a lecture hall system 34 of the system according to the first embodiment of the present invention.

【図５】検出・分離サブシステム１０２の機能ブロッ
ク図である。FIG. 5 is a functional block diagram of the detection / separation subsystem 102.

【図６】顔領域検出部１２０の機能ブロック図であ
る。FIG. 6 is a functional block diagram of the face area detection unit 120.

【図７】顔領域検出部１２０の処理の結果例を示す図
である。FIG. 7 is a diagram illustrating an example of a result of a process performed by the face area detection unit 120;

【図８】映像シーケンスのユニットへの分類方法を説
明するための図である。FIG. 8 is a diagram for explaining a method of classifying a video sequence into units.

【図９】ユニットを安定静止ユニットと活動検出ユニ
ットとに再構成する方法を説明するための図である。FIG. 9 is a diagram for explaining a method of reconfiguring a unit into a stable stationary unit and an activity detecting unit.

【図１０】映像シーケンスのユニットへの分類処理の
流れを示す図である。FIG. 10 is a diagram showing a flow of a process of classifying a video sequence into units.

【図１１】判定サブシステム１０４の機能ブロック図
である。FIG. 11 is a functional block diagram of a determination subsystem 104.

【図１２】姿勢推定部２３０の機能ブロック図であ
る。FIG. 12 is a functional block diagram of a posture estimating unit 230.

【図１３】ジェスチャー認識部２３２の機能ブロック
図である。13 is a functional block diagram of a gesture recognition unit 232. FIG.

【図１４】集計部１０６の処理構成を示すフローチャ
ートである。FIG. 14 is a flowchart illustrating a processing configuration of a counting unit.

【図１５】集計センター３０の構成を示す機能ブロッ
ク図である。FIG. 15 is a functional block diagram showing the configuration of the counting center 30.

【図１６】講師用システム３２の構成を示す機能ブロ
ック図である。FIG. 16 is a functional block diagram showing the configuration of the instructor system 32.

【図１７】本願発明の実施の形態２にかかる受講会場
用システム３６０の機能ブロック図である。FIG. 17 is a functional block diagram of a lecture hall system 360 according to the second embodiment of the present invention;

[Explanation of symbols]

２０遠隔放送システム、３０集計センター、３２講
師用システム、３４受講会場用システム、５０カメ
ラ、１００映像取得部、１０２検出・分離サブシス
テム、１０４判定サブシステム、１０６集計部、１
２０顔領域検出部、１２２映像ストリームユニット
分離部、１２４静止ユニット分類部、１２６ユニッ
ト統合部、２３０姿勢推定部、２３２ジェスチャー
認識部、２４２、２６２特徴ベクトル抽出処理部、２
４４姿勢判定部、２６４ジェスチャー判定部。Reference Signs List 20 remote broadcasting system, 30 counting center, 32 lecturer system, 34 lecture hall system, 50 camera, 100 video acquisition unit, 102 detection / separation subsystem, 104 judgment subsystem, 106 counting unit, 1
Reference Signs List 20 face area detection unit, 122 video stream unit separation unit, 124 still unit classification unit, 126 unit integration unit, 230 posture estimation unit, 232 gesture recognition unit, 242, 262 feature vector extraction processing unit, 2
44 posture determination unit, 264 gesture determination unit.

─────────────────────────────────────────────────────
────────────────────────────────────────────────── ───

【手続補正書】[Procedure amendment]

【提出日】平成１２年８月１８日（２０００．８．１
８）[Submission date] August 18, 2000 (2000.8.1)
8)

【手続補正１】[Procedure amendment 1]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００５０[Correction target item name] 0050

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【００５０】顔領域検出部１２０はさらに、それぞれrg
b色空間変換処理１３２およびr-g空間類似度計算処理１
３４と同様の処理をＮＣｂ−ＮＣｒ色空間で行なうため
の(NCb, NCr)色空間変換処理１３６およびNCb-NCr空間
類似度計算処理１３８を含む。ＮＣｂ，ＮＣｒを計算す
るためのＹ，Ｃｂ，Ｃｒ色成分の値は式（８）〜（１
０）によって計算される。さらにこれらを式（１１）
（１２）に示されるように正規化することでＮＣｂおよ
びＮＣｒ色成分が得られる。The face area detecting section 120 further outputs rg
b color space conversion processing 132 and rg space similarity calculation processing 1
34 including (NCb, NCr) color space conversion process 13 6 Contact and NCb-NCr space similarity calculation process 13 8 for performing at NCb-NCr color space processing similar to. The values of the Y, Cb, and Cr color components for calculating NCb and NCr are given by equations (8) to (1).
0). Further, these are expressed by formula (11).
By normalizing as shown in (12), NCb and NCr color components are obtained.

【手続補正２】[Procedure amendment 2]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００５２[Correction target item name] 0052

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【００５２】顔領域検出部１２０はさらに、r-g空間類
似度計算処理１３４で計算された類似度マッピングとNC
b-NCr空間類似度計算処理１３８で計算された類似度マ
ッピングとを統合して新たな類似度マッピングを生じる
ための類似度計算の統合処理１４０と、画像に対して後
処理をし、類似度計算の統合処理１４０によって得られ
た最終的な類似度マッピングにしたがって、顔領域に相
当する部分をだ円または矩形領域で表現してその位置を
示す情報を出力するための後処理１４２とを含む。The face area detection unit 120 further performs the similarity mapping calculated by the rg space similarity calculation processing 134 and the NC
and b-NCr space similarity calculation process 13 8 integration process 140 of calculation of similarity to produce a new similarity mapping integrates calculated a similarity mapping, post-processing on the image, and similar According to the final similarity mapping obtained by the integration processing 140 of degree calculation, a post-processing 142 for expressing a portion corresponding to the face area as an ellipse or a rectangular area and outputting information indicating its position is included. Including.

【手続補正３】[Procedure amendment 3]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００６２[Correction target item name] 0062

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【００６２】映像ストリームユニット分離部１２２は、
図８に示すフレーム間差分１５０を、上記した定義にし
たがって、静止ユニット１６０、１６４、１６８、１７
２および１７６、ならびに動きユニット１６２、１６
６、１７０、１７４および１７８に分離する。The video stream unit separation unit 122
The inter-frame difference 150 shown in FIG. 8 is converted into the stationary units 160, 164, 168, 17 according to the above-described definition.
2 and 176, and motion units 162, 16
6, 170, 174 and 178.

【手続補正４】[Procedure amendment 4]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００７１[Correction target item name] 0071

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【００７１】図１２を参照して、姿勢推定部２３０は、
安定静止ユニットフレーム画像２４０から画像の特徴ベ
クトルを抽出するための特徴ベクトル抽出処理部２４２
と、特徴ベクトル抽出処理部２４２からのフレーム画像
の特徴ベクトルを入力として、安定静止ユニットフレー
ム画像２４０に対応する聴衆の姿勢を示す情報（姿勢カ
テゴリ名）を出力するための、あらかじめ学習が済んで
いるニューラルネットによる姿勢判定部２４４とを含
む。この実施の形態のシステムでは、ニューラルネット
による姿勢判定部２４４は検出対象となる一つの姿勢カ
テゴリに対して一つのニューラルネットが対応するよう
に、あらかじめ学習が済んでいる複数個の１対１ニュー
ラルネット２５０と、同じ入力特徴ベクトルに対してこ
れら複数個の１対１ニューラルネット２５０の出力を調
べ、最も高い出力値を示した１対１ニューラルネット２
５０に対応する姿勢カテゴリ名を出力するための最大値
検出部２５２とを含む。Referring to FIG. 12, posture estimating section 230 includes:
Feature vector extraction processing unit 242 for extracting an image feature vector from stable still unit frame image 240
And learning in advance to output information (posture category name) indicating the posture of the audience corresponding to the stable still unit frame image 240 using the feature vector of the frame image from the feature vector extraction processing unit 242 as an input. And a posture determination unit 244 using a neural network. In the system of this embodiment, as the posture determination unit 24 4 by the neural network one neural net corresponding to one posture categories to be detected, a plurality of pair have already done previously learned 1 New over <br/> and Rarunetto 250, the same check the output of the one-to-one New over Rarunetto 25 0 plurality of these for the input feature vector, 1 pair exhibited the highest output value New over Rarunetto 2
And a maximum value detection unit 252 for outputting a posture category name corresponding to 50.

【手続補正５】[Procedure amendment 5]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００７２[Correction target item name] 0072

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【００７２】このように、一つの姿勢カテゴリに対して
一つの１対１ニューラルネット２５０を設けるようにす
ると、新たな姿勢カテゴリについての認識機能を追加し
ようとするときに簡単に対応できるという効果がある。
すなわちその場合には、その新たな姿勢カテゴリを検出
するようあらかじめ学習が行なわれている１対１ニュー
ラルネット２５０をニューラルネットによる姿勢判定部
２４４に追加し、その出力を最大値検出部２５２への入
力に接続してやればよい。仮にニューラルネットによる
姿勢判定部２４４の全体を大きなニューラルネットとし
た場合には、新たな姿勢カテゴリを検出する機能を追加
しようとすると、ニューラルネット全体の学習をし直す
必要がある。したがって、実際の応用ではニューラルネ
ットによる姿勢判定部２４４のように複数個の１対１ニ
ューラルネット２５０を設けるようにすることが実用的
である。[0072] The effect of this way, if you provided one of the one-to-one New over Rarunetto 250 for one of the attitude category, easily can respond when you try to add a recognition function for a new attitude category There is.
That is, if so, adds a one to one New over <br/> Rarunetto 25 0 previously learning has been carried out to detect the new posture categories posture determination unit 244 according to the neural network, maximize its output What is necessary is just to connect to the input to the value detection part 252. If the whole posture determination unit 244 using a neural network is a large neural network, it is necessary to relearn the entire neural network in order to add a function for detecting a new posture category. Therefore, it is practical to be provided a plurality of one-to-one two <br/>-menu Rarunetto 250 as posture determination unit 244 according to the neural network in practical applications.

【手続補正６】[Procedure amendment 6]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００７３[Correction target item name] 0073

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【００７３】図１３を参照して、ジェスチャー認識部２
３２も同様の構成を有する。すなわちジェスチャー認識
部２３２は、活動検出ユニットフレーム画像情報２６０
から画像の特徴ベクトルの抽出を行なう特徴ベクトルの
抽出処理２６２と、この特徴ベクトルを入力として聴衆
のジェスチャーをカテゴリに分類してその情報（ジェス
チャーカテゴリ名）を出力するためのニューラルネット
によるジェスチャー判定部２６４とを含む。Referring to FIG. 13, gesture recognition unit 2
32 has a similar configuration. That gesture recognition unit 232, the activity detection unit frame image information 26 0
And the extraction process 26 2 feature vectors to extract the feature vectors of pressurized et al images, gestures audience as input the feature vector classified into categories gesture by the neural network for outputting the information (gesture category name) And a determination unit 264.

【手続補正７】[Procedure amendment 7]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００９０[Correction target item name] 0090

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【００９０】以上が、受講会場用システム３４の構成お
よび動作の概略である。［集計センター３０］図１５を参照して、集計センター
３０は、受講会場用システム３４および講師用システム
３２と通信を行なうための送受信部３２０と、受講会場
用システム３４から送信されてくるスコアを全ての受講
会場用システム３４にわたって集計するための結果集計
部３２２と、集計センター３０および集計方法等の設定
を行なうためにユーザが操作するためのシステム設定部
３２４と、システム設定部３２４によって設定された条
件にしたがって結果集計部３２２での集計方法を制御
し、集計結果を送受信部３２０を介して講師用システム
３２に定期的に送信するためのシステム管理部３２６
と、集計センター３０の稼動状況、結果集計部３２２に
よる集計状況等を表示するためにシステム管理部３２６
が用いる表示装置３２８とを含む。The above is an outline of the configuration and operation of the lecture hall system 34. [Aggregation Center 30] Referring to FIG. 15, the aggregation center 30 transmits and receives a score transmitted from the attendance hall system 34 and a transmission / reception unit 320 for communicating with the attendance hall system 34 and the instructor system 32. The result totaling unit 322 for totaling over all attendance hall systems 34, the system setting unit 324 for the user to operate for setting the totaling center 30, the totaling method, and the like, and the system setting unit 324 are set. controls aggregation method in result counting unit 322 in accordance with conditions, total result the system manager 326 for periodically transmitting the instructor system 32 via the transceiver 32 0
And a system management unit 326 for displaying the operation status of the tallying center 30, the tallying status of the result tallying unit 322, and the like.
And a display device 32 8 are used.

【手続補正８】[Procedure amendment 8]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００９１[Correction target item name] 0091

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【００９１】図１５に示されるのは最も単純な構成であ
るが、受講会場用システム３４の出力について上記した
とおりの説明に基づけば、当業者であれば一般的なコン
ピュータを用いてこの集計センター３０を容易に作成す
ることが可能であろう。なお、この実施の形態では、集
計センター３０が講師用システム３２または受講会場用
システム３４とは別個に設けられている。しかしもちろ
ん本願発明はそのような構成に限定されない。たとえば
集計センター３０が講師用システム３２または受講会場
用システム３４のうちの任意の一つと同じコンピュータ
上で実現されてもよい。［講師用システム３２］図１６を参照して、講師用シス
テム３２は、講師の映像を出力するためのビデオカメラ
５０と、ビデオカメラ５０の出力する映像信号をデジタ
ル映像信号に変換し、圧縮するための画像圧縮部３４０
と、画像圧縮部３４０の出力する圧縮された画像を各受
講会場用システム３４および集計センター３０に送信
し、また集計センター３０から聴衆の反応を示すスコア
の集計結果を受信するための送受信部３４２と、送受信
部３４２によって集計センター３０から受取られた聴衆
の反応のスコアの集計結果に対して、表示のための集
計、受講会場用システム３４ごとの集計、分類、順序付
け等、反応を示す情報に対して行なうべき情報処理を実
行するための結果処理部３４４と、結果処理部３４４の
出力する結果を、どのような形式で表示するかを設定す
るための表示条件設定部３４６と、表示条件設定部３４
６によって設定された条件にしたがって結果処理部３４
４の出力に基づいて、聴衆の反応を分かりやすく表現す
る映像信号を生成するための映像生成部３４８と、映像
生成部３４８の出力する映像信号を表示するための、前
述の大画面表示装置８６とを含む。FIG. 15 shows the simplest configuration. However, based on the description of the output of the attendance hall system 34 as described above, those skilled in the art can use a general computer to calculate the totaling center. 30 could easily be created. In this embodiment, the counting center 30 is provided separately from the lecturer system 32 or the lecture hall system 34. However, of course, the present invention is not limited to such a configuration. For example, the tallying center 30 may be implemented on the same computer as any one of the lecturer system 32 or the lecture hall system 34. [Instructor System 32] Referring to FIG. 16, instructor system 32 converts video signal output from video camera 50 and video camera 50 into a digital video signal, and compresses the video signal. Image compression unit 340 for performing
And a transmission / reception unit 342 for transmitting the compressed image output from the image compression unit 340 to each attendance hall system 34 and the tallying center 30, and receiving from the tallying center 30 the tallying result of the score indicating the audience response. With respect to the tallying result of the audience response score received from the tallying center 30 by the transmission / reception unit 342, the information indicating the reaction such as tallying for display, tallying, classification, and ordering for each attendance hall system 34 is displayed. A result processing unit 344 for executing information processing to be performed on the display condition, a display condition setting unit 346 for setting a format in which a result output from the result processing unit 344 is displayed, and a display condition setting Part 34
6. The result processing unit 34 according to the conditions set by
4 and a large-screen display device for displaying a video signal output from the video generation unit 348 for generating a video signal that clearly expresses the audience's response based on the output of the display unit 4. 86.

【手続補正９】[Procedure amendment 9]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００９４[Correction target item name]

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【００９４】受講会場用システム３４は、集計センター
３０から送信されてきた講師の映像を伸長し、大画面表
示装置８６上に表示する。受講会場用システム３４では
あわせて、ビデオカメラ５０を用いて聴衆を撮像し、こ
れをフレームごとのデジタル信号に変換する。さらに、
これらフレームを検出・分離サブシステム１０２によっ
て安定静止ユニットと活動検出ユニットとに分類する。
そして、それらユニットに属するフレームの映像信号に
基づいて、判定サブシステム１０４が聴衆の反応を推定
しスコアとして出力する。集計部１０６が所定時間ごと
にこのスコアを集計して送受信部１０８を介して集計セ
ンター３０に最終スコアを送信する。[0094] The attendance hall system 34 expands the video of the lecturer transmitted from the counting center 30 and displays the video on the large screen display device 86. Together the student venue system 3 4 captures an audience using a video camera 50, converts it into a digital signal for each frame. further,
These frames are classified by the detection and separation subsystem 102 into stable stationary units and activity detection units.
Then, based on the video signals of the frames belonging to those units, the determination subsystem 104 estimates the audience response and outputs it as a score. Counting unit 106 transmits the final score counting center 30 through the transceiver 10 8 by aggregating the scores for each predetermined time.

【手続補正１０】[Procedure amendment 10]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００９７[Correction target item name] 0097

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【００９７】講師は、このようにして提示された聴衆の
反応を見て、話題を変えたり、聴衆の注意をひくための
なんらかの行為、たとえば質問を発する等の適切な行為
を行なったり、聴衆の反応が満足すべきものであればそ
のまま講演を継続したり、という適切な行動をとること
ができる。［実施の形態２］上記した実施の形態１のシステムは、
講師から各講習会場にいる聴衆への、情報の一方通行を
行なうシステムであった。しかし本発明はこうしたシス
テムのみに適用可能なわけではなく、双方向の情報の送
受信を行なうシステム、たとえば仮想空間を利用した電
子会議システムにも適用可能である。この場合には、シ
ステム内の、集計センターを除く全ての受講会場用シス
テムに、講師用システム３２と同様の映像の送信装置お
よび他の受講会場用システムからの反応に基づいて仮想
空間の内容を制御できる機能を追加すればよい。この場
合、講師用システム３２に相当するものは不要であり、
集計センター３０と複数個の受講会場用システムとでシ
ステムを構築できる。The instructor looks at the reaction of the audience presented in this manner, changes the topic, takes some action to get the attention of the audience, for example, asks a question, or performs an appropriate action such as asking a question. If the response is satisfactory, the lecturer can continue the lecture or take appropriate action. [Embodiment 2] The system of Embodiment 1 described above
It was a one-way system of information from the lecturer to the audience at each classroom. However, the present invention can be applied not only to such a system but also to a system for transmitting and receiving information in two directions, for example, an electronic conference system using a virtual space. In this case, the contents of the virtual space are transmitted to all the lecture hall systems except the aggregation center in the system based on the same video transmission device as the lecturer system 32 and the response from the other lecture hall systems. What is necessary is just to add a controllable function. In this case, what is equivalent to the instructor system 32 is unnecessary,
A system can be constructed with the counting center 30 and a plurality of attendance hall systems.

【手続補正１１】[Procedure amendment 11]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００９８[Correction target item name] 0098

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【００９８】図１７に、そうした受講会場用システム３
６０のブロック図を示す。図１７を参照して、この受講
会場用システム３６０は、図４に示した受講会場用シス
テムと同様に、ビデオカメラ５０、映像取得部１００、
検出・分離サブシステム１０２、判定サブシステム１０
４、集計部１０６を備え、それに加えて、集計部１０６
からの集計結果を集計センター３０に送信し、集計セン
ター３０から与えられる他の受講会場用システムでの反
応を受信するための送受信部３７０と、送受信部３７０
から得られた他の受講会場用システムでの聴衆の反応に
対して、結果処理部３４４（図１６参照）と同様の処理
を行なうための結果処理部３７２と、結果処理部３７２
によって処理された結果に基づき、当該集計結果を反映
するように仮想空間および仮想空間内の人物に関する情
報を更新するための仮想空間維持部３７４と、仮想空間
維持部３７４によって維持されている仮想空間情報にし
たがって仮想空間内の環境と人物との映像を生成する映
像生成部３７６と、映像生成部３７６によって生成され
た仮想空間の映像を表示するための大画面表示装置８６
とを含む。なお図１７において、図４に示された各ブロ
ックと同じ機能を持つブロックには同じ参照符号を付し
てある。それらの構成も同じである。したがってここで
はそれらについての詳細な説明は繰返さない。FIG. 17 shows such a lecture hall system 3
FIG. Referring to FIG. 17, this attendance hall system 360 includes a video camera 50, a video acquisition unit 100, and the like, similarly to the attendance hall system shown in FIG. 4.
Detection / separation subsystem 102, judgment subsystem 10
4, includes a counting unit 106, in addition to the Re its, totaling unit 106
And a transmission / reception unit 370 for transmitting the result of the tallying to the tallying center 30 and receiving a response from the tallying center 30 at another attendance hall system.
A result processing unit 372 and a result processing unit 372 for performing the same processing as the result processing unit 344 (see FIG. 16) with respect to the audience reaction in the other attendance hall system obtained from
Virtual space maintaining unit 374 for updating information about the virtual space and the person in the virtual space so as to reflect the totaling result based on the result processed by the virtual space maintaining unit 374, and the virtual space maintained by the virtual space maintaining unit 374 an image generating unit 37 6 to generate an image of the environment and the person in the virtual space according to the information, a large-screen display device 86 for displaying the image of the virtual space generated by the image generation unit 376
And In FIG. 17, blocks having the same functions as the respective blocks shown in FIG. 4 are denoted by the same reference numerals. Their configurations are the same. Therefore, detailed description thereof will not be repeated here.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｈ０４Ｎ 7/15 ６３０Ｇ０６Ｆ 15/70 ４６５Ａ (72)発明者川戸慎二郎京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール知能映像通信研究所内 (72)発明者大谷淳京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール知能映像通信研究所内Ｆターム(参考） 2C028 AA12 BA02 BB04 BC05 BD02 CA12 DA06 5B057 AA20 BA02 CA01 CA08 CA12 CA16 DA12 DB02 DB06 DB09 DC08 DC25 DC32 DC40 5C064 AA02 AB04 AC04 AC13 AC18 AD08 AD14 5L096 DA02 FA19 GA08 GA41 HA02 HA04 HA11 JA03 JA11 JA22 KA04 9A001 BB04 EE02 EE05 FF02 GG05 HH06 HH20 HH30 HH31 ──────────────────────────────────────────────────続き Continued on the front page (51) Int.Cl. ⁷ Identification symbol FI Theme coat ゛ (Reference) H04N 7/15 630 G06F 15/70 465A (72) Inventor Shinjiro Kawato Ojiya, Sanikacho, Soraku-gun, Kyoto Pref. No. 5 Hiratani A / T Intelligent Motion Imaging Communications Laboratory (72) Inventor Atsushi Atsushi Atsushi Oya, Seika-cho, Kyoto Prefecture Term (reference) 2C028 AA12 BA02 BB04 BC05 BD02 CA12 DA06 5B057 AA20 BA02 CA01 CA08 CA12 CA16 DA12 DB02 DB06 DB09 DC08 DC25 DC32 DC40 5C064 AA02 AB04 AC04 AC13 AC18 AD08 AD14 5L096 DA02 FA19 GA08 GA41 HA02 HA04 HA11 JA04 9 EE05 FF02 GG05 HH06 HH20 HH30 HH31

Claims

[Claims]

1. A reaction recognition device for recognizing a response of a person to a displayed video, comprising: a video obtaining unit for obtaining a video sequence of a face image of a person; First classifying means for classifying into a stable stationary unit and an activity detecting unit based on motion between frames; and a human response from the frame sequence classified into the stable stationary unit and the frame sequence classified into the activity detecting unit Reaction recognition means for recognizing
Human reaction recognition device.

2. The method according to claim 1, wherein the first classifying unit is configured to classify the video sequence into a still unit and a moving unit based on an inter-frame difference between adjacent frames. Third classification means for classifying the unstable stationary unit and the unstable stationary unit based on the duration thereof, and integrating the continuous unstable stationary unit and the motion unit as the activity detecting unit. 2. The apparatus for recognizing a person's reaction according to claim 1, further comprising: integrating means for performing the following.

3. The reaction recognition unit includes: a first feature vector extraction unit configured to extract a feature vector of a human face image from a frame in the stable stationary unit; and an output of the first feature vector extraction unit. A first neural network, which has been trained in advance and outputs posture information corresponding to the feature vector as an input of a feature vector to be input, and a motion of a person's face image from information of a difference between frames in the activity detection unit. A second feature vector extracting unit for extracting a corresponding feature vector, and a gesture vector corresponding to the feature vector, which receives a feature vector output from the second feature vector extracting unit and outputs gesture information corresponding to the feature vector. 2. The apparatus for recognizing a person's reaction according to claim 1, further comprising: a second neural network.

4. At least one of the first neural network and the second neural network is associated with a predetermined reaction category, and is output from the first or second feature vector extracting means. 4. The apparatus according to claim 3, further comprising a plurality of one-to-one neural networks that receive a feature vector as input and output a degree of relevance to the predetermined reaction category. 5.

5. A video camera, a video camera, a video signal converter for converting a video signal output from the video camera into a digital signal for each frame, and a video sequence output from the video signal converter. 5. The apparatus according to claim 1, further comprising: a face area specifying unit configured to specify a face area of the person in each frame based on the information.

6. A first means for specifying a face area of a person in each frame by a first method based on a video sequence output from the video signal converting means, A second unit for specifying a face area of a person in each frame by a second method based on a video sequence output by the video signal conversion unit; and a first unit and a second unit. The apparatus for recognizing a person's reaction according to any one of claims 1 to 5, further comprising means for integrating a face area by integrating the result of specifying the face area.

7. The video sequence is an RGB video sequence, and the first means is based on a similarity between a color distribution in a video obtained by converting the RGB color video sequence into an rg color space and a predetermined color distribution pattern. 7. The apparatus for recognizing a person's reaction according to claim 6, further comprising means for specifying a face area by using the apparatus.

8. The image sequence is an RGB color image sequence, and the first unit is configured to perform similarity between a color distribution in an image obtained by converting the RGB color image sequence into an NCb-NCr color space and a predetermined color distribution pattern. 8. The apparatus according to claim 6, further comprising means for specifying a face area based on the degree.

9. A computer-readable recording medium storing a program for causing a computer to operate as a reaction recognition device for recognizing a reaction of a person to a displayed image, wherein the program comprises: A first classification program portion for classifying a video sequence acquired for an image into a stable still unit and an activity detection unit based on motion between video frames; and a frame sequence classified as a stable still unit; A computer-readable recording medium comprising: a response recognition program portion for recognizing a human response from a frame sequence classified by the activity detection unit.

10. The first classification program portion, the second classification program portion for classifying a video sequence into a still unit and a motion unit based on an inter-frame difference between adjacent frames; A third classification program part for classifying a frame into the stable stationary unit and the unstable stationary unit based on the duration thereof; and a continuous unstable stationary unit and the motion unit as the activity detecting unit. 10. The computer-readable recording medium according to claim 9, comprising an integrated program part for integrating.

11. The reaction recognition program portion includes: a first feature vector extraction program portion for extracting a feature vector of a human face image from a frame in the stable still unit; and the first feature vector extraction program. A first neural network program portion that has been trained in advance, and outputs a posture information corresponding to the feature vector, using a feature vector output from the portion as an input, and a human neural network based on information of a difference between frames in the activity detection unit. A second feature vector extraction program portion for extracting a feature vector corresponding to the movement of the face image; and a feature information output from the second feature vector extraction program portion as input, gesture information corresponding to the feature vector Output a second neural network that has been learned in advance And a network program portion, the computer readable recording medium of claim 10.

12. At least one of the first neural network program portion and the second neural network program portion is each associated with a predetermined reaction category, and
12. The program according to claim 11, further comprising a plurality of one-to-one neural network program parts, which output a high degree of relevance to the predetermined reaction category by using as input a feature vector output from the feature vector extraction program part. Computer readable recording medium.

13. The program according to claim 13, further comprising a face area specifying program part for specifying a face area of a person in each frame based on the video sequence and providing the face area to the first classification program part. 9 to Claim 12
A computer-readable recording medium according to any one of the above.

14. A face area identification program portion, comprising: a first program portion for identifying a face area of a person in each frame by a first method based on the video sequence; and A second program portion for specifying a face area of a person in each frame by a second method, and a result of specifying the face region by the first program portion and the second program portion. 14. The computer-readable recording medium according to claim 9, further comprising a face area integration program part for specifying a face area.

15. The image sequence is an RGB color image sequence, and the first program portion is a similarity between a color distribution in an image obtained by converting the RGB color image sequence into an rg color space and a predetermined color distribution pattern. 15. The computer-readable recording medium according to claim 14, further comprising a program part for specifying a face area based on the computer program.

16. The video sequence is an RGB color video sequence, and the first program portion includes a color distribution in a video obtained by converting the RGB color video sequence into an NCb-NCr color space and a predetermined color distribution pattern. 16. The computer-readable recording medium according to claim 14, further comprising a program part for specifying a face area based on the degree of similarity.