JP5771127B2

JP5771127B2 - Attention level estimation device and program thereof

Info

Publication number: JP5771127B2
Application number: JP2011249799A
Authority: JP
Inventors: 高橋　正樹; 正樹高橋; 苗村　昌秀; 昌秀苗村; 藤井　真人; 真人藤井
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2011-11-15
Filing date: 2011-11-15
Publication date: 2015-08-26
Anticipated expiration: 2031-11-15
Also published as: JP2013105384A

Description

本発明は、映像コンテンツを視聴する人物の当該コンテンツに対する注目度を推定する注目度推定装置およびそのプログラムに関する。 The present invention relates to a degree-of-interest estimation apparatus that estimates the degree of attention of a person who views video content with respect to the content, and a program thereof.

近年、人物の脳内活動として、ある対象に集中した状態か、あるいは、ある対象を注目した状態かといった、集中度あるいは注目度を計測する研究が盛んに進められている。これらの研究は、例えば、車の運転や授業での生徒の理解度調査など、様々な分野に応用可能な技術であり、従来から多くの研究がなされてきた。なお、“集中度”は“注目度”よりも人物の内的状態を表すニュアンスがあるもののほぼ同義で用いられている。 2. Description of the Related Art In recent years, research on measuring the degree of concentration or the degree of attention, such as whether a person is focused on a certain object or is focused on a certain object, has been actively promoted. These studies are technologies that can be applied to various fields, such as driving a car and investigating students' understanding in class, and many studies have been made heretofore. Note that “concentration” is used almost synonymously with “nuisance”, although there is a nuance representing the internal state of a person.

この集中度を計測する手法として、脳波、脈拍、瞬目間隔時間、身体の動き等の生体情報を利用して集中度を計測する技術が種々開示されている（特許文献１，２参照）。
例えば、脳波は、特定の対象に対する脳波の反応が明確に現れた場合に、集中度が高い方向に働く指標となる。また、例えば、脈拍は、特定の対象に対して心拍間隔時間が変動した場合に、集中度が高い方向に働く指標となる。また、例えば、瞬目間隔時間は、その間隔が長くなった場合に、集中度が高い方向に働く指標となる。また、例えば、身体の動きは、その動きが大きい場合、集中していない方向に働く指標となる。
このような生体情報を利用して集中度を計測するには、通常、人物に装着した接触型デバイスから生体情報を取得する必要がある。 Various techniques for measuring the degree of concentration using biological information such as brain waves, pulse, blink interval time, and body movement have been disclosed as techniques for measuring the degree of concentration (see Patent Documents 1 and 2).
For example, the electroencephalogram is an index that works in the direction of high concentration when the electroencephalogram response to a specific object clearly appears. In addition, for example, the pulse is an index that works in a direction in which the degree of concentration is high when the heartbeat interval time fluctuates with respect to a specific target. In addition, for example, the blink interval time is an index that works in a direction in which the degree of concentration is high when the interval is long. In addition, for example, the movement of the body is an index that works in a non-concentrated direction when the movement is large.
In order to measure the degree of concentration using such biological information, it is usually necessary to acquire biological information from a contact-type device worn on a person.

また、集中度を計測する他の手法として、視線の動きを用いる手法も開示されている（特許文献３参照）。
この手法は、視線の動きを追跡し、予めデータ化した、ある対象に視線が集中する場合の視線の動きと比較することで、人物がその対象に集中している度合いを計測するものである。この手法において、視線の追跡には、非接触型のデバイスが発光する赤外線等が人物の瞳によって反射した光をカメラで撮影することで行っている。 Further, as another method for measuring the degree of concentration, a method using the movement of the line of sight is also disclosed (see Patent Document 3).
This method measures the degree of concentration of a person on the target by tracking the movement of the line of sight and comparing it with the movement of the line of sight when the line of sight concentrates on a certain target. . In this technique, tracking of the line of sight is performed by photographing, with a camera, light reflected by a human pupil such as infrared rays emitted from a non-contact type device.

特開平９−２６２２１６号公報Japanese Patent Laid-Open No. 9-262216 特開２００７−２８３０４１号公報JP 2007-283041 A 特開２００８−１２２２３号公報JP 2008-12223 A

前記したように、従来の人物の生体情報を利用して集中度（注目度）を計測する手法では、通常、接触型デバイスが必要となる。しかし、接触型デバイスを用いた場合、それを装着した人物の視界や動作の自由度が奪われ、人物の負担が大きいという問題がある。
また、従来のように、視線の動きから集中度（注目度）を計測する手法では、たとえ、赤外線光等を発光する非接触型デバイスを用いる場合であっても、人物に負荷を与えてしまう。
例えば、一般家庭において、視聴者がテレビ等で視聴する映像コンテンツを対象として注目度を計測する場合、接触型デバイスを装着することは現実的ではない。また、健康面等を考慮して、視聴者に赤外線を照射し続けることはできない。 As described above, in the conventional method of measuring the degree of concentration (attention level) using the biological information of a person, a contact type device is usually required. However, when a contact-type device is used, there is a problem that a person wearing it loses the field of view and freedom of movement, and the burden on the person is large.
Further, the conventional method of measuring the degree of concentration (attention level) from the movement of the line of sight places a load on a person even when a non-contact type device that emits infrared light or the like is used. .
For example, in a general home, when measuring the degree of attention for video content that a viewer views on a television or the like, it is not realistic to wear a contact type device. In addition, in view of health and the like, the viewer cannot continue to irradiate infrared rays.

本発明は、以上のような問題に鑑みてなされたものであり、映像コンテンツを視聴する人物の当該コンテンツに対する注目度を、人物に負荷を与えることなく計測することが可能な注目度推定装置およびそのプログラムを提供することを課題とする。 The present invention has been made in view of the above problems, and a degree-of-attention estimation device capable of measuring the degree of attention of a person who views video content with respect to the content without imposing a load on the person, and The problem is to provide the program.

本発明は、前記課題を解決するために創案されたものであり、まず、請求項１に記載の注目度推定装置は、映像コンテンツを視聴している人物を撮影した画像から当該人物の骨格位置を検出するモーションキャプチャで計測して得られる骨格位置情報と、前記人物をカメラで撮影したカメラ映像とから、前記映像コンテンツの予め定めた映像区間において、前記人物の注目の度合いを示す注目度を推定する注目度推定装置であって、身体動作量計測手段と、視線変動量計測手段と、統計特徴量生成手段と、学習データ記憶手段と、注目度特定手段と、を備える構成とした。 The present invention was created to solve the above-described problem. First, the attention level estimation device according to claim 1 is configured such that a skeleton position of a person is obtained from an image obtained by photographing a person who is viewing video content. The degree of attention indicating the degree of attention of the person in a predetermined video section of the video content is determined from the skeleton position information obtained by measuring with motion capture to detect the person and the camera image obtained by photographing the person with a camera. An attention level estimation device for estimation, which includes a body movement amount measurement unit, a gaze fluctuation amount measurement unit, a statistical feature amount generation unit, a learning data storage unit, and an attention level identification unit.

かかる構成において、注目度推定装置は、身体動作量計測手段によって、モーションキャプチャで計測された人物の骨格位置情報を時系列に入力し、当該骨格位置情報の予め定めた骨格位置、例えば、人物の頭部位置等における単位時間当たりの変化量である身体動作量を身体特徴量の１つとして計測する。なお、人物が映像コンテンツを注目している場合、身体動作が少なくなる傾向にあることから、身体動作量は注目度を推定する指標となる。 In such a configuration, the attention level estimation device inputs the skeleton position information of the person measured by motion capture in time series by the body movement amount measuring means, and determines a predetermined skeleton position of the skeleton position information, for example, the person's skeleton position information. A body movement amount, which is a change amount per unit time in the head position or the like, is measured as one of the body feature amounts. Note that when the person is paying attention to the video content, the body motion tends to decrease, and thus the amount of body motion is an index for estimating the degree of attention.

また、注目度推定装置は、視線変動量計測手段によって、カメラ映像として時系列に入力されるカメラ画像において、予め定めた画像特徴、例えば、Ｈａａｒ−ｌｉｋｅ特徴量に基づいて人物の目領域を検出し、当該目領域を区分した左右領域の輝度に基づいて、単位時間当たりの視線変動量を身体特徴量の１つとして計測する。例えば、目領域内の左右領域の画素の輝度値は、角膜（黒目）の位置によって変化する。そこで、視線変動量計測手段は、左右領域の輝度比から、角膜の位置の変化を検出することで、その変化量を視線変動量とする。
なお、人物が映像コンテンツを注目している場合、視線の動きが小さくなる傾向にあることから、視線変動量は注目度を推定する指標となる。 Further, the attention level estimation device detects the eye region of a person based on a predetermined image feature, for example, a Haar-like feature amount, in a camera image input in time series as a camera image by the line-of-sight variation measurement unit. Then, based on the luminances of the left and right regions obtained by dividing the eye region, the line-of-sight fluctuation amount per unit time is measured as one of the body feature amounts. For example, the luminance values of the pixels in the left and right regions in the eye region vary depending on the position of the cornea (black eye). Therefore, the line-of-sight variation measuring means detects the change in the position of the cornea from the luminance ratio of the left and right regions, and sets the amount of change as the line-of-sight variation.
Note that when the person is paying attention to the video content, the movement of the line of sight tends to be small, and thus the line-of-sight variation is an index for estimating the degree of attention.

そして、注目度推定装置は、統計特徴量生成手段によって、身体特徴量のそれぞれについて、映像コンテンツの予め定めた映像区間において統計し、当該映像区間における統計特徴量として生成する。この統計特徴量は、例えば、身体特徴量の平均値や標準偏差、あるいは、度数等の統計量である。 Then, the attention level estimation device uses the statistical feature value generation means to perform statistics on each of the body feature values in a predetermined video section of the video content, and generate the statistical feature values in the video section. This statistical feature quantity is, for example, a statistical quantity such as an average value, standard deviation, or frequency of body feature quantities.

そして、注目度推定装置は、注目度特定手段によって、学習データ記憶手段に記憶している、統計特徴量と注目度との対応関係を予め学習により求めた学習データに基づいて、統計特徴量生成手段で生成された統計特徴量に対応する注目度を、当該映像区間に対する注目度として特定する。この学習データは、例えば、サポートベクタマシン（ＳＶＭ）で実現することができ、複数の統計特徴量から対応する注目度を出力する識別関数である。なお、この学習データは、サポートベクタマシンにおける学習フェーズにおいて、任意の映像コンテンツを視聴した際の統計特徴量と人物の主観評価とに基づいて、事前に生成しておく。 Then, the attention level estimation device generates the statistical feature value based on the learning data that is stored in the learning data storage unit by the attention level specifying unit and the correspondence between the statistical feature value and the attention level is obtained by learning in advance. The attention level corresponding to the statistical feature value generated by the means is specified as the attention level for the video section. This learning data can be realized by, for example, a support vector machine (SVM), and is an identification function that outputs a corresponding degree of attention from a plurality of statistical feature amounts. This learning data is generated in advance in the learning phase of the support vector machine based on the statistical feature amount and the subjective evaluation of the person when viewing arbitrary video content.

また、請求項２に記載の注目度推定装置は、請求項１に記載の注目度推定装置において、瞬目間隔計測手段を、さらに備える構成とした。 Further, the attention level estimation device according to claim 2 is the attention level estimation device according to claim 1, further comprising blink interval measurement means.

かかる構成において、注目度推定装置は、瞬目間隔計測手段によって、時系列に入力されるカメラ画像において、予め定めた画像特徴、例えば、Ｈａａｒ−ｌｉｋｅ特徴量に基づいて人物の瞬きを検出し、その瞬きが発生する間隔である瞬目間隔時間を身体特徴量の１つとして計測する。なお、人物が映像コンテンツを注目している場合、瞬きが少なくなる傾向にあることから、瞬目間隔時間は注目度を推定する指標となる。 In such a configuration, the attention level estimation device detects blinking of a person based on a predetermined image feature, for example, a Haar-like feature amount, in a camera image input in time series by the blink interval measurement unit, The blink interval time, which is the interval at which the blink occurs, is measured as one of the body feature values. Note that when a person is paying attention to video content, blinking tends to decrease, and thus the blink interval time is an index for estimating the degree of attention.

また、請求項３に記載の注目度推定装置は、請求項１または請求項２に記載の注目度推定装置において、傾き補正手段を、さらに備える構成とした。 Further, the attention level estimation device according to claim 3 is the same as the attention level estimation device according to claim 1 or 2, further comprising an inclination correction means.

かかる構成において、注目度推定装置は、傾き補正手段によって、骨格位置情報で示される人物の頭部位置および頸部位置に基づいて、頸部位置が頭部位置の直下になるようにカメラ画像を回転させる。これによって、人物の顔が、カメラ画像上で垂直に保たれることになる。 In such a configuration, the attention level estimation device uses a tilt correction unit to display a camera image so that the neck position is directly below the head position based on the head position and neck position of the person indicated by the skeleton position information. Rotate. As a result, the human face is kept vertical on the camera image.

さらに、請求項４に記載の注目度推定装置は、請求項１から請求項３のいずれか一項に記載の注目度推定装置において、第２学習データ記憶手段と、字幕情報量計測手段と、映像動き量検出手段と、使用判定手段と、をさらに備える構成とした。 Furthermore, the attention level estimation apparatus according to claim 4 is the attention level estimation apparatus according to any one of claims 1 to 3, wherein the second learning data storage unit, the caption information amount measurement unit, The image motion amount detection means and the use determination means are further provided.

かかる構成において、注目度推定装置は、第２学習データ記憶手段に、身体特徴量から視線変動量を除いた統計特徴量と注目度との対応関係を第２学習データとして予め記憶しておく。この第２学習データは、例えば、サポートベクタマシンで実現することができ、身体動作量および視線変動量の統計特徴量から対応する注目度を出力する識別関数である。なお、この第２学習データは、サポートベクタマシンにおける学習フェーズにおいて、任意の映像コンテンツを視聴した際の身体動作量および視線変動量の統計特徴量と人物の主観評価とに基づいて、事前に生成しておく。 In such a configuration, the attention level estimation device stores in advance in the second learning data storage means the correspondence relationship between the statistical feature amount obtained by removing the line-of-sight variation from the body feature amount and the attention level as second learning data. The second learning data can be realized by, for example, a support vector machine, and is an identification function that outputs a corresponding degree of attention from the statistical feature values of the body movement amount and the line-of-sight variation amount. This second learning data is generated in advance in the learning phase of the support vector machine based on the statistical features of the body movement amount and gaze fluctuation amount when viewing arbitrary video content and the subjective evaluation of the person. Keep it.

また、注目度推定装置は、字幕情報量計測手段によって、映像コンテンツにおいて、当該映像コンテンツに含まれる字幕情報量を計測する。なお、人物が映像コンテンツ内の字幕を注目している場合、視線の動きが大きくなる傾向にあることから、字幕情報量は、視線変動量とは逆の相関を持った指標となる。
また、注目度推定装置は、映像動き量検出手段によって、映像コンテンツにおいて、フレーム間ごとの差分により映像動き量を計測する。なお、人物が映像コンテンツを注目している場合、視線の動きが大きくなる傾向にあることから、映像動き量は、視線変動量とは逆の相関を持った指標となる。 In addition, the attention level estimation device measures the amount of caption information included in the video content by the caption information amount measurement unit. Note that when the person is paying attention to the subtitles in the video content, the movement of the line of sight tends to increase, and thus the subtitle information amount is an index having a reverse correlation with the line-of-sight fluctuation amount.
Also, the attention level estimation device measures the amount of video motion based on the difference between frames in the video content by the video motion amount detection means. Note that when the person is paying attention to the video content, the movement of the line of sight tends to increase, and thus the amount of movement of the video is an index having an inverse correlation with the amount of fluctuation of the line of sight.

そして、注目度推定装置は、使用判定手段によって、字幕情報量が予め定めた情報量よりも多い、または、映像動き量が予め定めた動き量よりも多い場合に、視線変動量を身体特徴量として使用しない旨を判定する。
そして、使用判定手段において、視線変動量を身体特徴量として使用しない旨が判定された場合、注目度推定装置は、注目度特定手段によって、学習データに代えて第２学習データに基づいて、視線変動量を除いた統計特徴量に対応する注目度を、当該映像区間に対する注目度として特定する。 Then, the attention level estimation device uses the use determination unit to determine the gaze variation amount when the caption information amount is larger than the predetermined information amount or the video motion amount is larger than the predetermined motion amount. Is determined not to be used.
When the use determination unit determines that the line-of-sight variation is not used as the body feature amount, the attention level estimation device uses the line-of-sight estimation unit based on the second learning data instead of the learning data. The attention level corresponding to the statistical feature value excluding the fluctuation amount is specified as the attention level for the video section.

また、請求項５に記載の注目度推定装置は、請求項１から請求項３のいずれか一項に記載の注目度推定装置において、第２学習データ記憶手段と、字幕情報量計測手段と、使用判定手段と、をさらに備える構成とした。 Moreover, the attention level estimation apparatus according to claim 5 is the attention level estimation apparatus according to any one of claims 1 to 3, wherein the second learning data storage unit, the caption information amount measurement unit, And a usage determination unit.

かかる構成において、注目度推定装置は、第２学習データ記憶手段に、身体特徴量から視線変動量を除いた統計特徴量と注目度との対応関係を第２学習データとして予め記憶しておく。
また、注目度推定装置は、字幕情報量計測手段によって、映像コンテンツにおいて、当該映像コンテンツに含まれる字幕情報量を計測する。そして、注目度推定装置は、使用判定手段によって、字幕情報量が予め定めた情報量よりも多い場合に、視線変動量を身体特徴量として使用しない旨を判定する。 In such a configuration, the attention level estimation device stores in advance in the second learning data storage means the correspondence relationship between the statistical feature amount obtained by removing the line-of-sight variation from the body feature amount and the attention level as second learning data.
In addition, the attention level estimation device measures the amount of caption information included in the video content by the caption information amount measurement unit. Then, the attention level estimation apparatus determines that the gaze fluctuation amount is not used as the body feature amount when the subtitle information amount is larger than the predetermined information amount by the use determination unit.

そして、使用判定手段において、視線変動量を身体特徴量として使用しない旨が判定された場合、注目度推定装置は、注目度特定手段によって、学習データに代えて第２学習データに基づいて、視線変動量を除いた統計特徴量に対応する注目度を、当該映像区間に対する注目度として特定する。 When the use determination unit determines that the line-of-sight variation is not used as the body feature amount, the attention level estimation device uses the line-of-sight estimation unit based on the second learning data instead of the learning data. The attention level corresponding to the statistical feature value excluding the fluctuation amount is specified as the attention level for the video section.

また、請求項６に記載の注目度推定装置は、請求項１から請求項３のいずれか一項に記載の注目度推定装置において、第２学習データ記憶手段と、映像動き量検出手段と、使用判定手段と、をさらに備える構成とした。 Further, the attention level estimation device according to claim 6 is the attention level estimation device according to any one of claims 1 to 3, wherein the second learning data storage unit, the video motion amount detection unit, And a usage determination unit.

かかる構成において、注目度推定装置は、第２学習データ記憶手段に、身体特徴量から視線変動量を除いた統計特徴量と注目度との対応関係を第２学習データとして予め記憶しておく。
また、注目度推定装置は、映像動き量検出手段によって、映像コンテンツにおいて、フレーム間ごとの差分により映像動き量を計測する。そして、注目度推定装置は、使用判定手段によって、映像動き量が予め定めた動き量よりも多い場合に、視線変動量を身体特徴量として使用しない旨を判定する。 In such a configuration, the attention level estimation device stores in advance in the second learning data storage means the correspondence relationship between the statistical feature amount obtained by removing the line-of-sight variation from the body feature amount and the attention level as second learning data.
Also, the attention level estimation device measures the amount of video motion based on the difference between frames in the video content by the video motion amount detection means. Then, the attention level estimation device determines, by the use determination unit, that the line-of-sight variation amount is not used as the body feature amount when the video motion amount is larger than the predetermined motion amount.

また、請求項７に記載の注目度推定装置は、請求項１から請求項６のいずれか一項に記載の注目度推定装置において、前記統計特徴量生成手段が、前記映像区間全体における前記身体特徴量の平均値および標準偏差であるグローバル特徴と、前記身体統計量を予め定めたビン幅でヒストグラム化した局所ヒストグラム特徴とを前記統計特徴量として生成することを特徴とする。 Further, the attention level estimation apparatus according to claim 7 is the attention level estimation apparatus according to any one of claims 1 to 6, wherein the statistical feature value generation unit is configured to generate the body in the entire video section. A global feature that is an average value and a standard deviation of feature amounts and a local histogram feature that is a histogram of the physical statistics amount with a predetermined bin width are generated as the statistical feature amount.

かかる構成において、注目度推定装置は、統計特徴量生成手段によって、統計特徴量をグローバル特徴と局所ヒストグラム特徴とで表すことで、映像区間の長さによらず、固定次元の特徴量を生成することができる。 In this configuration, the attention level estimation device generates a fixed-dimension feature amount regardless of the length of the video section by representing the statistical feature amount as a global feature and a local histogram feature by the statistical feature amount generation unit. be able to.

また、請求項８に記載の注目度推定装置は、請求項７に記載の注目度推定装置において、前記統計特徴量生成手段が、前記映像区間を予め定めた時間区間に分割した区間ごとに、さらに前記局所ヒストグラム特徴を生成することを特徴とする。 Further, the attention level estimation device according to claim 8 is the attention level estimation device according to claim 7, wherein the statistical feature value generation unit is configured to divide the video section into predetermined time sections. Further, the local histogram feature is generated.

かかる構成において、注目度推定装置は、統計特徴量生成手段によって、映像区間を細分化して、局所ヒストグラム特徴を生成することで、局所的に発生する特徴を注目度の推定に反映させることができる。 In such a configuration, the attention level estimation device can reflect the locally occurring feature in the attention level estimation by subdividing the video section and generating the local histogram feature by the statistical feature value generation unit. .

さらに、請求項９に記載の注目度推定装置は、請求項７または請求項８に記載の注目度推定装置において、前記統計特徴量生成手段が、前記注目度を推定する対象となる映像区間の統計特徴量に、当該映像区間の前後の映像区間の局所ヒストグラム特徴を付加して当該注目度を推定する映像区間の統計特徴量とすることを特徴とする。 Furthermore, the attention level estimation apparatus according to claim 9 is the attention level estimation apparatus according to claim 7 or claim 8, wherein the statistical feature value generation unit is a target of the video section for which the attention level is estimated. A local feature of a video section before and after the video section is added to the statistical feature quantity to obtain a statistical feature quantity of the video section in which the degree of attention is estimated.

かかる構成において、注目度推定装置は、統計特徴量生成手段によって、注目度を推定する映像区間の統計特徴量に、当該映像区間の前後の映像区間の局所ヒストグラム特徴を付加することで、注目度を推定する映像区間の統計特徴量に、映像区間を跨って発生する特徴が付加されることになる。
この映像区間を跨って発生する特徴とは、例えば、映像コンテンツ内で、ある映像区間から人物が注目する映像区間に移った場合や、注目している映像区間が終了し、他の映像区間に移った場合等における特徴である。 In this configuration, the attention level estimation device adds the local histogram feature of the video section before and after the video section to the statistical feature quantity of the video section for which the attention level is estimated by the statistical feature amount generation unit. The feature generated across the video section is added to the statistical feature amount of the video section for estimating the video.
The feature that occurs across the video section is, for example, when the video content moves from one video section to the video section that the person is interested in, or when the video section of interest ends and other video sections This is a feature in the case of moving.

また、請求項１０に記載に注目度推定プログラムは、モーションキャプチャで計測された、映像コンテンツを視聴している人物を撮影した画像から当該人物の骨格位置を検出するモーションキャプチャで計測して得られる骨格位置情報と、前記人物をカメラで撮影したカメラ映像とから、前記映像コンテンツの予め定めた映像区間において、前記人物の注目の度合いを示す注目度を推定するために、コンピュータを、身体動作量計測手段、視線変動量計測手段、統計特徴量生成手段、注目度特定手段、として機能させる構成とした。 In addition, the attention degree estimation program according to claim 10 is obtained by measuring with motion capture that detects a skeleton position of the person from an image of the person who is viewing the video content measured by motion capture. In order to estimate a degree of attention indicating the degree of attention of the person in a predetermined video section of the video content from the skeleton position information and a camera image obtained by photographing the person with a camera, The measurement unit, the line-of-sight variation measurement unit, the statistical feature quantity generation unit, and the attention level identification unit are configured to function.

かかる構成において、注目度推定プログラムは、身体動作量計測手段によって、モーションキャプチャで計測された人物の骨格位置情報を時系列に入力し、当該骨格位置情報の予め定めた骨格位置における単位時間当たりの変化量である身体動作量を身体特徴量の１つとして計測する。
また、注目度推定プログラムは、視線変動量計測手段によって、カメラ映像として時系列に入力されるカメラ画像において、予め定めた画像特徴に基づいて人物の目領域を検出し、当該目領域を区分した左右領域の輝度に基づいて、単位時間当たりの視線変動量を身体特徴量の１つとして計測する。 In such a configuration, the attention level estimation program inputs the skeleton position information of the person measured by motion capture in time series by the body movement amount measuring unit, and the skeleton position information per unit time at a predetermined skeleton position of the skeleton position information. The amount of body movement that is the amount of change is measured as one of the body feature amounts.
Further, the attention degree estimation program detects a human eye area based on a predetermined image feature in a camera image input in time series as a camera image by the line-of-sight variation measurement unit, and classifies the eye area. Based on the luminance of the left and right regions, the line-of-sight variation amount per unit time is measured as one of the body feature amounts.

そして、注目度推定プログラムは、統計特徴量生成手段によって、身体特徴量のそれぞれについて、映像コンテンツの予め定めた映像区間において統計し、当該映像区間における統計特徴量として生成する。 Then, the attention level estimation program uses the statistical feature value generation means to statistically calculate each of the body feature values in a predetermined video section of the video content, and generate the statistical feature value in the video section.

そして、注目度推定プログラムは、注目度特定手段によって、学習データ記憶手段に記憶している、統計特徴量と注目度との対応関係を予め学習により求めた学習データに基づいて、統計特徴量生成手段で生成された統計特徴量に対応する注目度を、当該映像区間に対する注目度として特定する。 Then, the attention level estimation program generates the statistical feature value based on the learning data obtained by learning the correspondence relationship between the statistical feature value and the attention level stored in the learning data storage unit by the attention level specifying unit in advance. The attention level corresponding to the statistical feature value generated by the means is specified as the attention level for the video section.

本発明は、以下に示す優れた効果を奏するものである。
請求項１，１０に記載の発明によれば、人物の注目度を推定するための身体特徴量である身体動作量や視線変動量を、画像処理によって抽出することができるため、接触型デバイスの装着や、赤外線光の照射等、人物に負荷をかけることなく注目度を推定することができる。 The present invention has the following excellent effects.
According to the first and tenth aspects of the present invention, the amount of body movement and the amount of line-of-sight variation, which are body feature amounts for estimating the attention level of a person, can be extracted by image processing. The degree of attention can be estimated without putting a load on the person such as wearing or irradiation with infrared light.

請求項２に記載の発明によれば、人物の注目度を推定するための身体特徴量として、さらに、瞬目間隔時間を加えることで、注目度の推定精度を高めることができる。また、本発明は、カメラで撮影した画像から瞬目間隔時間を求めることができるため、人物に負荷をかけることなく注目度を推定することができる。 According to the second aspect of the present invention, the attention degree estimation accuracy can be increased by adding the blink interval time as the body feature amount for estimating the attention degree of the person. Further, according to the present invention, since the blink interval time can be obtained from an image photographed by a camera, the degree of attention can be estimated without imposing a load on the person.

請求項３に記載の発明によれば、映像コンテンツを視聴する人物の頭部を、カメラ画像上において垂直に補正することができるため、カメラ画像上において、目の領域の検出等、人物の頭部の特徴量を検出する際の精度を高めることができ、より正確に注目度を推定することができる。 According to the third aspect of the present invention, since the head of the person viewing the video content can be corrected vertically on the camera image, the head of the person such as the detection of the eye area on the camera image can be corrected. The accuracy at the time of detecting the feature amount of the part can be increased, and the degree of attention can be estimated more accurately.

請求項４に記載の発明によれば、字幕の情報量が多くなったり、映像の動きが大きくなった場合、視線変動量の注目度を推定する際の指標としての意味が逆に作用するため、注目度を推定する際の特徴量から視線変動量を除外することで、精度よく注目度を推定することができる。 According to the fourth aspect of the present invention, when the information amount of subtitles increases or the motion of the video increases, the meaning as an index for estimating the attention degree of the line-of-sight variation amount acts reversely. The attention level can be estimated with high accuracy by excluding the line-of-sight variation from the feature amount when the attention level is estimated.

請求項５に記載の発明によれば、字幕の情報量が多くなった場合、視線変動量の注目度を推定する際の指標としての意味が逆に作用するため、注目度を推定する際の特徴量から視線変動量を除外することで、精度よく注目度を推定することができる。
請求項６に記載の発明によれば、映像の動きが大きくなった場合、視線変動量の注目度を推定する際の指標としての意味が逆に作用するため、注目度を推定する際の特徴量から視線変動量を除外することで、精度よく注目度を推定することができる。 According to the fifth aspect of the present invention, when the information amount of subtitles increases, the meaning as an index when estimating the attention level of the line-of-sight variation amount acts in reverse. By excluding the line-of-sight variation from the feature amount, it is possible to accurately estimate the attention level.
According to the sixth aspect of the present invention, when the motion of the video becomes large, the meaning as an index when estimating the attention level of the line-of-sight variation amount acts reversely. By excluding the line-of-sight variation from the amount, it is possible to accurately estimate the attention level.

請求項７に記載の発明によれば、身体特徴量の平均値および標準偏差であるグローバル特徴と、身体統計量を予め定めたビン幅でヒストグラム化した局所ヒストグラム特徴とを、統計特徴量とすることで、固定次元で特徴量を表すことができる。これによって、映像区間の時間長が可変であっても、同一のアルゴリズムで注目度を推定することができる。 According to the seventh aspect of the present invention, the global feature that is the average value and the standard deviation of the body feature amount and the local histogram feature that is a histogram of the body statistics with a predetermined bin width are used as the statistical feature amount. Thus, the feature quantity can be expressed in a fixed dimension. As a result, even when the time length of the video section is variable, the degree of attention can be estimated with the same algorithm.

請求項８に記載の発明によれば、映像区間を時間方向に区分した特徴量を統計特徴量に付加することで、統計特徴量は、映像区間において、時間方向に局所的な特徴を残した特徴量となる。これによって、映像区間内のある区間における注目の度合いを特徴として加味することができ、精度よく注目度を推定することができる。 According to the invention described in claim 8, by adding the feature quantity obtained by dividing the video section in the time direction to the statistical feature quantity, the statistical feature quantity leaves a local feature in the time direction in the video section. This is a feature value. Thereby, the degree of attention in a certain section in the video section can be taken into consideration as a feature, and the degree of attention can be estimated with high accuracy.

請求項９に記載の発明によれば、映像区間の前後の映像区間の局所ヒストグラム特徴を付加して統計特徴量とすることで、統計特徴量は、映像区間を跨った特徴量となる。これによって、映像区間の切り替わりで人物の注目状態に変化が起こる等の特徴を、注目度の推定に反映させることができる。 According to the ninth aspect of the present invention, by adding the local histogram feature of the video section before and after the video section to obtain the statistical feature quantity, the statistical feature quantity becomes a feature quantity across the video section. As a result, it is possible to reflect a feature such as a change in the attention state of the person due to the switching of the video section in the estimation of the attention level.

本発明の実施形態に係る注目度推定装置を含んだ注目度測定システムの構成を示す構成図である。It is a block diagram which shows the structure of the attention level measurement system containing the attention level estimation apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る注目度推定装置の構成を示すブロック図である。It is a block diagram which shows the structure of the attention level estimation apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る注目度推定装置に入力される骨格位置情報を説明するための説明図である。It is explanatory drawing for demonstrating the skeleton position information input into the attention degree estimation apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る注目度推定装置の傾き補正手段が行う傾き補正の処理内容を説明するための説明図である。It is explanatory drawing for demonstrating the processing content of the inclination correction | amendment which the inclination correction | amendment means of the attention degree estimation apparatus which concerns on embodiment of this invention performs. 本発明の実施形態に係る注目度推定装置の瞬目間隔計測手段において、瞬目状態を判定する際に用いる特徴量を説明するための説明図である。It is explanatory drawing for demonstrating the feature-value used when the blink interval measurement means of the attention degree estimation apparatus which concerns on embodiment of this invention determines a blink state. 本発明の実施形態に係る注目度推定装置の視線変動量計測手段における視線変動量を計測する手法を説明するための説明図である。It is explanatory drawing for demonstrating the method of measuring the gaze fluctuation amount in the gaze fluctuation amount measurement means of the attention degree estimation apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る注目度推定装置の統計特徴量生成手段が生成する局所ヒストグラム特徴のヒストグラムを説明するための説明図である。It is explanatory drawing for demonstrating the histogram of the local histogram feature which the statistical feature-value production | generation means of the attention degree estimation apparatus which concerns on embodiment of this invention produces | generates. 本発明の実施形態に係る注目度推定装置の統計特徴量生成手段が生成する特徴量（特徴量記述子）の内容を説明するための構造図である。It is a structure figure for demonstrating the content of the feature-value (feature-value descriptor) which the statistical feature-value production | generation means of the attention degree estimation apparatus which concerns on embodiment of this invention produces | generates. 本発明の実施形態に係る注目度推定装置の動作を説明するためのフローチャートである。It is a flowchart for demonstrating operation | movement of the attention level estimation apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る注目度推定装置の統計特徴量生成手段におけるある映像区間（トピック）内の特徴量（特徴量記述子群）を説明するための説明図である。It is explanatory drawing for demonstrating the feature-value (feature-value descriptor group) in a certain video area (topic) in the statistical feature-value production | generation means of the attention degree estimation apparatus which concerns on embodiment of this invention. 本発明の他の実施形態に係る注目度推定装置の全体構成を示すブロック構成図である。It is a block block diagram which shows the whole structure of the attention degree estimation apparatus which concerns on other embodiment of this invention. 本発明の実施形態に係る注目度推定装置の学習データ記憶手段に記憶させる学習データを生成する注目度学習装置の構成を示すブロック図である。It is a block diagram which shows the structure of the attention level learning apparatus which produces | generates the learning data memorize | stored in the learning data storage means of the attention level estimation apparatus which concerns on embodiment of this invention.

以下、本発明の実施形態について図面を参照して説明する。
［注目度推定システムの構成］
最初に、図１を参照して、本発明の実施形態に係る注目度推定装置を含んだ注目度測定システムＳの構成について説明する。
注目度測定システムＳは、映像コンテンツを視聴している人物の当該映像コンテンツに対する注目度を測定するものである。
この注目度測定システムＳは、モニタＭと、モーションキャプチャＭｃと、カメラＣと、注目度推定装置１と、を備えている。 Embodiments of the present invention will be described below with reference to the drawings.
[Configuration of attention level estimation system]
Initially, with reference to FIG. 1, the structure of the attention level measurement system S including the attention level estimation apparatus which concerns on embodiment of this invention is demonstrated.
The attention level measurement system S measures the degree of attention of a person who is viewing video content with respect to the video content.
This attention level measurement system S includes a monitor M, a motion capture Mc, a camera C, and a attention level estimation device 1.

モニタＭは、テレビ放送として放送される番組（映像コンテンツ）や、記録媒体（例えば、ＤＶＤ等）に記録された映像コンテンツを表示するものである。このモニタＭは、テレビ受像機等の一般的な表示装置であって、人物Ｈに対して映像コンテンツを提示するものである。 The monitor M displays a program (video content) broadcast as a television broadcast and a video content recorded on a recording medium (for example, a DVD). The monitor M is a general display device such as a television receiver and presents video content to the person H.

モーションキャプチャＭｃは、映像コンテンツを視聴している人物Ｈの骨格の位置を計測するものであって、一般的な姿勢検出装置である。このモーションキャプチャＭｃは、人物Ｈを撮影する方向を向けて、モニタＭの近傍に配置される。
このモーションキャプチャＭｃは、奥行きカメラ（不図示）で撮影した距離画像によって人物Ｈまでの距離を計測するとともに、人物Ｈの３次元空間上の骨格位置（例えば、頭部位置、頸部位置等）を検出し、その骨格位置の３次元座標を２次元座標に投影することで骨格位置情報を生成する。このモーションキャプチャＭｃで計測された骨格位置情報は、注目度推定装置１に出力される。 The motion capture Mc measures the position of the skeleton of the person H who is viewing the video content, and is a general posture detection device. This motion capture Mc is arranged in the vicinity of the monitor M in the direction in which the person H is photographed.
This motion capture Mc measures the distance to the person H from a distance image taken by a depth camera (not shown), and also the skeleton position (for example, head position, neck position, etc.) of the person H in the three-dimensional space. , And the skeleton position information is generated by projecting the three-dimensional coordinates of the skeleton position onto the two-dimensional coordinates. The skeleton position information measured by the motion capture Mc is output to the attention level estimation device 1.

カメラＣは、映像コンテンツを視聴している人物Ｈを撮影するもので、一般的な撮像装置である。このカメラＣは、人物Ｈを撮影する方向に向けて、モニタＭの近傍に配置される。このカメラＣが撮影したカメラ映像は、注目度推定装置１に出力される。
なお、カメラＣとモーションキャプチャＭｃとが人物Ｈを撮影する画角は、ほぼ等しくしておく。もちろん、モーションキャプチャＭｃから、人物Ｈを撮影した映像を取得可能であれば、モーションキャプチャＭｃから、カメラ映像を出力することとしてもよい。 The camera C captures a person H who is viewing video content and is a general imaging device. The camera C is arranged in the vicinity of the monitor M in the direction in which the person H is photographed. The camera image captured by the camera C is output to the attention level estimation device 1.
Note that the angle of view at which the camera C and the motion capture Mc capture the person H is set to be approximately equal. Of course, if it is possible to obtain a video of the person H from the motion capture Mc, a camera video may be output from the motion capture Mc.

注目度推定装置１は、モーションキャプチャＭｃで計測された、映像コンテンツを視聴している人物Ｈの骨格位置情報と、カメラＣで撮影された、人物Ｈを撮影したカメラ映像とから、映像コンテンツの予め定めた映像区間（トピック）において、人物Ｈの注目の度合いを示す注目度を推定するものである。 The attention level estimation device 1 uses the skeleton position information of the person H who is viewing the video content, measured by the motion capture Mc, and the camera video of the person H captured by the camera C. The degree of attention indicating the degree of attention of the person H in a predetermined video section (topic) is estimated.

一般に、人物Ｈが映像コンテンツを注目（集中）して視聴している場合、身体動作が少なくなり、瞬き間隔（瞬目間隔時間）が長くなり、視線の動き（視線変動）が小さくなる等の傾向がある。
そこで、注目度推定装置１は、これらの変化を、モーションキャプチャＭｃで計測された骨格位置情報と、カメラＣで撮影されたカメラ映像とに基づいて検出することで、人物Ｈの映像コンテンツに対する注目度を推定する。すなわち、注目度推定装置１は、骨格位置情報から、例えば、人物Ｈの頭部位置を特定し、その動きによって身体動作の変化を検出し、カメラ映像から、人物Ｈの瞬き間隔や視線の動きの変化を検出することで、注目度を推定する。
このように、注目度測定システムＳを構成することで、接触型デバイスの装着や、赤外線光の照射等、人物Ｈに負荷を与えることなく、注目度を推定することができる。
以下、注目度推定装置１の構成および動作について説明する。 In general, when a person H is watching (concentrating) watching video content, the body motion is reduced, the blink interval (blink interval time) is increased, and the movement of the line of sight (gaze fluctuation) is reduced. Tend.
Therefore, the attention level estimation device 1 detects these changes based on the skeleton position information measured by the motion capture Mc and the camera image captured by the camera C, so that attention is paid to the video content of the person H. Estimate the degree. That is, the attention level estimation device 1 identifies, for example, the head position of the person H from the skeletal position information, detects a change in body motion based on the movement, and blinks the person H or moves the line of sight from the camera image. The degree of attention is estimated by detecting the change in.
In this way, by configuring the attention level measurement system S, it is possible to estimate the attention level without imposing a load on the person H, such as wearing a contact-type device or irradiating infrared light.
Hereinafter, the configuration and operation of the attention level estimation device 1 will be described.

［注目度推定装置の構成］
まず、図２を参照（適宜図１参照）して、本発明の実施形態に係る注目度推定装置１の構成について説明する。ここでは、注目度推定装置１は、映像コンテンツを時間方向に区切った映像区間を示す情報として、映像区間情報を入力する。そして、注目度推定装置１は、この映像区間単位で注目度を推定する。また、映像区間情報は、図示を省略した入力手段を介して入力することとする。 [Configuration of attention level estimation device]
First, referring to FIG. 2 (refer to FIG. 1 as appropriate), the configuration of the attention level estimation device 1 according to the embodiment of the present invention will be described. Here, the attention level estimation device 1 inputs video section information as information indicating a video section obtained by dividing video content in the time direction. Then, the attention level estimation device 1 estimates the attention level for each video section. In addition, the video section information is input via an input unit (not shown).

なお、この映像区間情報は、注目度を推定したい区間を特定する時間情報（フレーム番号等）であって、映像内のトピック（例えば、ニュース番組における「政治」、「経済」、「スポーツ」、「芸能」等）ごとの切り替わりを示す情報である。また、この映像区間情報は、図示を省略したトピック検出装置を介して出力される、映像コンテンツからトピックを検出した情報としてもよい。
また、ここでは、時間情報以外に、トピックを識別するための識別子（ＩＤ番号）を、付加しておくこととする。もちろん、この映像区間は、映像コンテンツ全体を示すものであってもよいし、映像コンテンツの一部を示すものであってもよい。 The video section information is time information (frame number or the like) for specifying a section for which the degree of attention is to be estimated. Topics in the video (for example, “politics”, “economics”, “sports”, “Entertainment”, etc.). Further, the video section information may be information that detects a topic from video content that is output via a topic detection device (not shown).
Here, in addition to the time information, an identifier (ID number) for identifying the topic is added. Of course, this video section may indicate the entire video content, or may indicate a part of the video content.

図２に示すように、注目度推定装置１は、身体特徴量抽出手段１０と、視線変動量使用判定手段２０と、統計特徴量生成手段３０と、学習データ記憶手段４０と、注目度特定手段５０と、を備えている。 As shown in FIG. 2, the attention level estimation device 1 includes a body feature amount extraction unit 10, a gaze fluctuation amount use determination unit 20, a statistical feature amount generation unit 30, a learning data storage unit 40, and an attention level identification unit. 50.

身体特徴量抽出手段１０は、モーションキャプチャＭｃから入力される骨格位置情報と、カメラＣから入力されるカメラ映像とから、映像コンテンツを視聴している人物の身体特徴量を抽出するものである。
なお、モーションキャプチャＭｃから入力される骨格位置情報は、例えば、図３（ａ）に示すように、モーションキャプチャＭｃにおいて人物Ｈが撮影されたとき、図３（ｂ）に示すように、人物Ｈの骨格の位置である頭部位置Ｐ^Ｈや頸部位置Ｐ^Ｎの２次元画像上の座標である。
ここでは、身体特徴量抽出手段１０は、身体動作量計測手段１１と、傾き補正手段１２と、瞬目間隔計測手段１３と、視線変動量計測手段１４と、を備えている。 The body feature amount extraction unit 10 extracts the body feature amount of the person who is viewing the video content from the skeleton position information input from the motion capture Mc and the camera video input from the camera C.
Note that the skeleton position information input from the motion capture Mc is, for example, as shown in FIG. 3A, when the person H is photographed in the motion capture Mc, as shown in FIG. is a coordinate on the two-dimensional image of the head position P ^H and the neck position P ^N is the position of the skeleton.
Here, the body feature amount extraction unit 10 includes a body movement amount measurement unit 11, an inclination correction unit 12, a blink interval measurement unit 13, and a line-of-sight variation measurement unit 14.

身体動作量計測手段１１は、モーションキャプチャＭｃから入力される骨格位置情報を時系列に入力し、骨格位置情報の予め定めた骨格位置における単位時間（例えば、モーションキャプチャＭｃのフレーム単位）当たりの変化量である身体動作量を計測するものである。
一般に、映像コンテンツを視聴している人物Ｈが注目状態に入ったとき、身体の動きは少なくなる。そこで、身体動作量計測手段１１は、身体が動く量（身体動作量）を、注目度の推定の指標となる特徴量として抽出（計測）する。 The body movement amount measuring means 11 inputs the skeleton position information input from the motion capture Mc in time series, and changes per unit time (for example, frame unit of the motion capture Mc) of the skeleton position information at a predetermined skeleton position. It measures the amount of body movement that is a quantity.
In general, when the person H who is viewing the video content enters the attention state, the movement of the body decreases. Therefore, the body movement amount measuring means 11 extracts (measures) the amount of movement of the body (body movement amount) as a feature amount that serves as an index for estimating the degree of attention.

ここでは、身体動作量計測手段１１は、骨格位置情報として人物Ｈの頭部位置を利用することとする。もちろん、動きを計測することができる部位であれば、他の骨格位置であっても構わない。
例えば、ｔｐを映像区間（トピック）のＩＤ番号、Ｐ^Ｈ _ｘ（ｔ）を時刻ｔフレームにおける２次元座標上での水平方向（ｘ軸方向）の頭部位置（ｘ座標）、Ｐ^Ｈ _ｙ（ｔ）を同じく垂直方向（ｙ軸方向）の頭部位置（ｙ座標）としたとき、身体動作量計測手段１１は、以下の式（１）により、単位時間あたりの身体動作量Ｋ_ｔｐ（ｔ）を計測する。 Here, the body movement amount measuring means 11 uses the head position of the person H as the skeleton position information. Of course, any other skeleton position may be used as long as the movement can be measured.
For example, tp is the ID number of the video section (topic), P ^H _x (t) is the head position (x coordinate) in the horizontal direction (x-axis direction) on the two-dimensional coordinates in the time t frame, and P ^H _y ( Similarly, when t) is the head position (y coordinate) in the vertical direction (y-axis direction), the body motion amount measuring means 11 uses the following equation (1) to calculate the body motion amount K _tp (t ).

この身体動作量計測手段１１で計測された身体動作量は、映像区間（トピック）ごとに、統計特徴量生成手段３０に出力される。 The body movement amount measured by the body movement amount measurement unit 11 is output to the statistical feature amount generation unit 30 for each video section (topic).

傾き補正手段１２は、カメラＣから入力されるカメラ映像を、カメラＣのフレームごとに人物Ｈの顔がフレーム画像（カメラ画像）上で垂直になるように補正するものである。ここでは、傾き補正手段１２は、モーションキャプチャＭｃから入力される骨格位置情報のうちで、頸部位置が頭部位置の直下になるように、フレーム画像の画像中心を中心として、フレーム画像を回転処理する。 The tilt correction unit 12 corrects the camera video input from the camera C so that the face of the person H is vertical on the frame image (camera image) for each frame of the camera C. Here, the inclination correction unit 12 rotates the frame image around the image center of the frame image so that the neck position is directly below the head position in the skeleton position information input from the motion capture Mc. To process.

ここで、Ｐ^Ｈ _ｘ（ｔ），Ｐ^Ｈ _ｙ（ｔ）をそれぞれ時刻ｔフレームにおける頭部位置のｘ，ｙ座標とし、Ｐ^Ｎ _ｘ（ｔ），Ｐ^Ｎ _ｙ（ｔ）をそれぞれ時刻ｔフレームにおける頸部位置のｘ，ｙ座標としたとき、傾き補正手段１２は、首の傾きθ_ｔを、以下の式（２）で算出する。 Here, P ^H _x (t) and P ^H _y (t) are the x and y coordinates of the head position in the time t frame, respectively, and P ^N _x (t) and P ^N _y (t) are each in the time t frame. When the x and y coordinates of the cervical position are taken, the inclination correction means 12 calculates the inclination θ _t of the neck by the following equation (2).

そして、傾き補正手段１２は、フレーム画像の画像中心を（ｃｘ_ｔ，ｃｙ_ｔ）としたとき、首の傾きθ_ｔを利用して、以下の式（３）により、フレーム画像の任意の点（ｘ_ｔ，ｙ_ｔ）を、（ｘ_ｔ′，ｙ_ｔ′）に変換することで、傾きを補正したフレーム画像を生成する。 Then, the inclination correction means 12 uses the neck inclination θ _t and the following expression (3), using the neck inclination θ _t , where the center of the frame image is (cx _t , cy _t ). By converting x _t , y _t ) into (x _t ′, y _t ′), a frame image with corrected inclination is generated.

例えば、図４（ａ）の（ａ−１）に示すように、人物Ｈが、フレーム画像上で斜め左に傾いている場合、頸部位置Ｐ^Ｎが頭部位置Ｐ^Ｈの直下になるように、フレーム画像の画像中心を中心として、その傾き分、フレーム画像を右回転させる（ａ−２）。また、例えば、図４（ｂ）の（ｂ−１）に示すように、人物Ｈが、フレーム画像上で斜め右に傾いている場合、頸部位置Ｐ^Ｎが頭部位置Ｐ^Ｈの直下になるように、フレーム画像の画像中心を中心として、その傾き分、フレーム画像を左回転させる（ｂ−２）。 For example, as shown in FIG. 4 (a) (a-1) , the person H is, if the inclined obliquely left on the frame image, so that the neck position P ^N becomes immediately below the head position P ^H Then, the frame image is rotated to the right by the inclination of the center of the frame image (a-2). Further, for example, as shown in FIG. 4 (b) (b-1) , the person H is, if the inclined obliquely right on the frame image, the neck position P ^N is immediately below the head position P ^H Thus, the frame image is rotated counterclockwise by an amount corresponding to the inclination of the image center of the frame image (b-2).

これによって、補正された画像（ａ−２）（ｂ−２）において、人物Ｈの顔が直立した状態となり、後記する瞬目間隔計測手段１３および視線変動量計測手段１４における瞬目や、目の領域の判定を行う精度を高めることができる。
この傾き補正手段１２で補正されたフレーム画像（カメラ画像）は、瞬目間隔計測手段１３および視線変動量計測手段１４に出力される。 Thereby, in the corrected images (a-2) and (b-2), the face of the person H is in an upright state, and blinks and eyes in the blink interval measuring means 13 and the line-of-sight variation measuring means 14 to be described later are displayed. It is possible to improve the accuracy of determining the area.
The frame image (camera image) corrected by the inclination correcting unit 12 is output to the blink interval measuring unit 13 and the line-of-sight variation measuring unit 14.

瞬目間隔計測手段１３は、傾き補正手段１２で補正された、カメラＣから時系列に入力されるカメラ画像において、予め定めた画像特徴に基づいて、人物Ｈの瞬きを検出し、当該瞬きの間隔を瞬目間隔時間として計測するものである。 The blink interval measuring unit 13 detects the blink of the person H based on a predetermined image feature in the camera image input in time series from the camera C corrected by the inclination correction unit 12 and detects the blink of the blink. The interval is measured as the blink interval time.

ここでは、瞬目間隔計測手段１３は、まず、カメラ画像から顔領域を検出する。この顔領域の検出は、一般的な手法を用いることができる。例えば、瞬目間隔計測手段１３は、ＯｐｅｎＣＶライブラリなどにも用いられているビオラ（Ｖｉｏｌａ）とジョーンズ（Ｊｏｎｅｓ）が提案した顔検出手法を用いることができる。この顔検出手法は、白と黒で表された矩形内の輝度値の差で表された特徴量（Ｈａａｒ−ｌｉｋｅ特徴量）を用いて、予め学習した複数のカスケード型の識別器によって顔の識別を行うものである。
そして、瞬目間隔計測手段１３は、当該顔領域における予め定めた画像特徴となる特徴点を時系列に追跡した複数の特徴点軌跡と、予め瞬目動作の特徴点軌跡として学習した学習データとに基づいて、瞬目状態を検出し、その間隔時間を計測する。 Here, the blink interval measuring means 13 first detects a face area from the camera image. A general method can be used for the detection of the face area. For example, the blink interval measuring means 13 can use the face detection method proposed by Viola and Jones, which is also used in the OpenCV library and the like. This face detection method uses a feature quantity (Haar-like feature quantity) represented by a difference between luminance values in a rectangle represented by white and black, and uses a plurality of cascade-type classifiers that have been learned in advance. Identification is performed.
Then, the blink interval measuring means 13 includes a plurality of feature point trajectories obtained by tracking feature points that are predetermined image features in the face area in time series, and learning data learned in advance as feature point trajectories for the blink operation. Based on the above, the blink state is detected and the interval time is measured.

具体的には、瞬目間隔計測手段１３は、時系列に変化する顔領域ごとに、画像の特徴となる点（特徴点）、例えば、隣接画素に対する画素値あるいは輝度値の変化によって特徴点を検出し、特徴量が類似する特徴点をフレームごとにマッチングすることで、特徴点を時間方向に追跡する。この特徴点の検出、追跡は、例えば、ＫＬＴ法、ＭｅａｎＳｈｉｆｔ法等を用いることができる。 Specifically, the blink interval measuring unit 13 determines a feature point by changing a point (feature point) that is a feature of an image, for example, a pixel value or a luminance value with respect to an adjacent pixel, for each face region that changes in time series. By detecting and matching feature points with similar feature quantities for each frame, the feature points are tracked in the time direction. For this feature point detection and tracking, for example, the KLT method, the Mean Shift method, or the like can be used.

そして、瞬目間隔計測手段１３は、この検出、追跡によって得られた特徴点軌跡を１つの単語とみなしたＢａｇ−ｏｆ−ｗｏｒｄｓ手法を用いて瞬目動作を識別する。
このＢａｇ−ｏｆ−ｗｏｒｄｓ手法は、単語（ここでは、特徴点軌跡）をその特徴に基づいて分類した辞書であるコードブックに基づいて、多次元の特徴を予め定めたｋ種類のクラスタで代表させ、クラスタの頻度ヒストグラムで分類処理を行う手法である。
ここでは、瞬目間隔計測手段１３は、１つの特徴点軌跡から１つのヒストグラム（軌跡ヒストグラム）を生成し、ある時点において存在する複数の軌跡ヒストグラムを、Ｂａｇ−ｏｆ−ｗｏｒｄｓ手法における複数の単語とする。 Then, the blink interval measuring unit 13 identifies the blink operation using a Bag-of-words method in which the feature point trajectory obtained by the detection and tracking is regarded as one word.
This Bag-of-words method is based on a codebook that is a dictionary in which words (here, feature point trajectories) are classified based on their features, and multidimensional features are represented by k types of predetermined clusters. This is a technique for performing classification processing using a cluster frequency histogram.
Here, the blink interval measuring unit 13 generates one histogram (trajectory histogram) from one feature point trajectory, and converts a plurality of trajectory histograms existing at a certain time point to a plurality of words in the Bag-of-words method. To do.

なお、この軌跡ヒストグラムは、単位時間あたりの個々の軌跡（ベクトル）の向きと長さによって定めた固定次元のヒストグラムとする。例えば、ここでは、軌跡の向きを４５度単位の８方向に区分し、軌跡の長さを、“０”を含む４つに区分した。なお、この軌跡の長さの区分は、事前の学習フェーズで計測した、軌跡の長さの平均値と標準偏差とする。 The trajectory histogram is a fixed-dimension histogram determined by the direction and length of individual trajectories (vectors) per unit time. For example, here, the direction of the trajectory is divided into 8 directions in units of 45 degrees, and the length of the trajectory is divided into four including “0”. The trajectory length classification is the average and standard deviation of trajectory lengths measured in the prior learning phase.

例えば、事前に計測した映像コンテンツにおいて、すべての特徴点軌跡の個々のベクトルの長さ（ｍ_ｉ：ｉ＝０〜Ｎ〔Ｎは、映像コンテンツ内でのベクトル総数〕）の平均値μ、その標準偏差σを、以下の式（４）により予め求めておき、軌跡の長さの区分を“０”、（０，μ−σ／２］、（μ−σ／２，μ＋σ／２］、（μ＋σ／２，∞）の４つとする。なお、（ａ，ｂ）は、値がａより大きくｂより小さい区分を示し、（ａ，ｂ］は、値がａより大きくｂ以下の区分を示す。 For example, in the video content measured in advance, the average value μ of the lengths of individual vectors of all feature point trajectories (m _i : i = 0 to N [N is the total number of vectors in the video content]), The standard deviation σ is obtained in advance by the following formula (4), and the segmentation of the length of the trajectory is “0”, (0, μ−σ / 2], (μ−σ / 2, μ + σ / 2), (Μ + σ / 2, ∞), where (a, b) indicates a segment having a value greater than a and less than b, and (a, b) represents a segment having a value greater than a and less than or equal to b. Show.

すなわち、瞬目間隔計測手段１３は、図５（ａ）に示すように、１つの特徴点軌跡について、個々の軌跡（ベクトル）の長さＬと向きθから、図５（ｂ）に示すように、向きを８分割、長さを４分割した区分で、ビン（ｂｉｎ）数の合計が２５（８〔方向〕×３〔長さ〕＋１〔長さ“０”〕）のヒストグラム（軌跡ヒストグラム）を生成する。 That is, as shown in FIG. 5A, the blink interval measuring means 13 shows the length of each trajectory (vector) L and the direction θ as shown in FIG. A histogram (trajectory histogram) in which the direction is divided into eight and the length is divided into four, and the total number of bins is 25 (8 [direction] × 3 [length] +1 [length “0”]). ) Is generated.

そして、瞬目間隔計測手段１３は、生成した軌跡ヒストグラムから、予め学習によって求めた２値ＳＶＭ（サポートベクタマシン）識別器により、現時点における特徴点軌跡の集合が瞬目状態を示す軌跡であるか否かを判定する。なお、この２値ＳＶＭ識別器は、事前（学習フェーズ）に特徴点の軌跡ヒストグラムをｋ−ｍｅａｎｓ法（ｋ平均法）によって、予め定めたｋ個（例えば、１００個）の代表ヒストグラムに量子化することでｋ個のコードブックを生成し、瞬目状態であるか否か予め学習しておくものとする。この２値ＳＶＭ識別器は、軌跡ヒストグラムが入力された際に、その軌跡ヒストグラムが瞬目を示しているか否かの結果を返す瞬目検出器（瞬目検出関数）であって、図示を省略した記憶手段に予め記憶しておく。 Then, the blink interval measuring means 13 uses the binary SVM (support vector machine) discriminator previously obtained from learning from the generated locus histogram to determine whether the current set of feature point locus is a locus indicating the blink state. Determine whether or not. The binary SVM discriminator quantizes the trajectory histogram of feature points in advance (learning phase) into k representative histograms (for example, 100) in advance by the k-means method (k average method). By doing so, k codebooks are generated, and it is assumed in advance whether or not it is in the blink state. This binary SVM discriminator is a blink detector (blink detection function) that returns a result of whether or not the locus histogram indicates blinks when the locus histogram is input, and is not shown in the figure. Previously stored in the storage means.

そして、瞬目間隔計測手段１３は、瞬目と判定した時刻（瞬目と判定した時点のフレーム）の時間間隔（瞬目間隔時間）を計測する。
すなわち、ｔｐをトピックのＩＤ番号、ｎを当該トピック内での瞬目を識別するＩＤ番号、ｔ（ｎ）をｎ番目の瞬目が検出された時刻（フレーム番号）としたとき、瞬目間隔計測手段１３は、以下の式（５）により、瞬目が検出された時間間隔（瞬目間隔時間）Ｂ_ｔｐ（ｎ）を計測する。 Then, the blink interval measuring means 13 measures the time interval (blink interval time) at the time determined to be a blink (the frame at the time when the blink is determined).
That is, when tp is a topic ID number, n is an ID number for identifying a blink in the topic, and t (n) is a time (frame number) at which the nth blink is detected, the blink interval The measuring means 13 measures the time interval (blink interval time) B _tp (n) at which _blinks are detected by the following equation (5).

この瞬目間隔計測手段１３で計測された瞬目間隔時間は、映像区間（トピック）ごとに、統計特徴量生成手段３０に出力される。 The blink interval time measured by the blink interval measuring unit 13 is output to the statistical feature value generating unit 30 for each video section (topic).

視線変動量計測手段１４は、傾き補正手段１２で補正された、カメラＣから時系列に入力されるカメラ画像において、予め定めた画像特徴に基づいて、人物Ｈの目領域を検出し、当該目領域内の左右領域の輝度比から、単位時間当たりの視線変動量を計測するものである。 The line-of-sight variation measuring unit 14 detects the eye region of the person H based on a predetermined image feature in the camera image input in time series from the camera C corrected by the tilt correcting unit 12, and The line-of-sight variation per unit time is measured from the luminance ratio of the left and right regions in the region.

ここでは、視線変動量計測手段１４は、まず、カメラ画像から目領域を検出する。この目領域の検出は、瞬目間隔計測手段１３と同様に、一般的なＶｉｏｌａとＪｏｎｅｓの手法を用いることができる。
すなわち、視線変動量計測手段１４は、目の白黒領域を矩形内の輝度値の差で表したＨａａｒ−ｌｉｋｅ特徴量を用いて、予め学習した複数のカスケード型の識別器によって目の領域を検出する。なお、視線変動量計測手段１４は、瞬目間隔計測手段１３と同様に、カメラ画像から、一旦、顔領域を検出し、その顔領域内で目の領域を検出することとしてもよい。 Here, the line-of-sight variation measuring means 14 first detects an eye area from the camera image. The eye area can be detected by using the general Viola and Jones methods as in the blink interval measuring means 13.
That is, the line-of-sight variation measuring means 14 detects the eye region by using a plurality of cascade-type discriminators learned in advance using the Haar-like feature amount that represents the black and white region of the eye by the luminance value difference in the rectangle. To do. Note that, similarly to the blink interval measuring unit 13, the line-of-sight variation measuring unit 14 may temporarily detect a face area from a camera image and detect an eye area within the face area.

そして、視線変動量計測手段１４は、検出した目領域を水平方向の中心で左右に区分し、左領域および右領域のそれぞれの輝度計測領域について画素の輝度値を合計する。
すなわち、視線変動量計測手段１４は、図６（ａ）に示すように、目領域を検出後、図６（ｂ）に示すように、目領域の水平方向の中心で左右に区分した右領域Ｅ_Ｒと左領域Ｅ_Ｌとにおいて、画素の輝度値を合計する。 Then, the line-of-sight variation measuring means 14 divides the detected eye region into left and right at the center in the horizontal direction, and sums up the luminance values of the pixels in the luminance measurement regions of the left region and the right region.
That is, the line-of-sight variation measuring means 14 detects the eye area as shown in FIG. 6A, and then, as shown in FIG. 6B, the right area divided into left and right at the center in the horizontal direction. in the E _R and the left region E _L, summing the luminance values of the pixels.

そして、視線変動量計測手段１４は、この右領域Ｅ_Ｒと左領域Ｅ_Ｌとの輝度値の比で視線方向を特定する。
例えば、図６（ｂ）におけるそれぞれの領域（Ｅ_Ｒ，Ｅ_Ｌ）の画素数をＮ、右領域Ｅ_Ｒ内の任意の画素ｉにおける輝度値をＩ_Ｒ（ｉ）、左領域Ｅ_Ｌ内の任意の画素ｉにおける輝度値をＩ_Ｌ（ｉ）としたとき、視線変動量計測手段１４は、ある時刻ｔフレームにおける視線方向ｄ_ｔを、以下の式（６）により算出する。 The gaze variation measurement unit 14 identifies the viewing direction by a ratio of luminance values between the right area E _R and the left region E _L.
For example, each region _(E R, _{E L)} in FIG. 6 (b) the number of pixels N, the luminance value at an arbitrary pixel i in the right area _{E R} _I _R (i), in the left area _{E L} when the luminance value was I _{L (i)} at an arbitrary pixel i, gaze variation measurement unit 14, the viewing direction d _t at a certain time t frame, is calculated by the following equation (6).

この式（６）において、輝度値が大きいほど明るい画素であるとすると、人物ＨがカメラＣに向かって右方向を向き、右領域Ｅ_Ｒにおける角膜（黒目）の割合が多くなるとｄ_ｔは増加する。また、人物ＨがカメラＣに向かって左方向を向き、左領域Ｅ_Ｌにおける角膜の割合が多くなるとｄ_ｔは減少する。
なお、視線変動量計測手段１４は、人物Ｈの目領域として、左右の２つの目領域を検出した場合、左右の目領域において、それぞれ前記式（６）で視線方向を算出し、その平均をとることとする。 In the formula (6), when a bright pixel larger the luminance value, a person H is oriented in the right direction in the camera C, the d _t the ratio increases corneal (iris) in the right area E _R increases To do. In addition, when the person H turns left toward the camera C and the proportion of the cornea in the left region E _L increases, d _t decreases.
Note that, when the left and right eye areas are detected as the eye area of the person H, the gaze fluctuation amount measuring unit 14 calculates the gaze direction according to the equation (6) in each of the left and right eye areas, and calculates the average thereof. I will take it.

そして、視線変動量計測手段１４は、前記式（６）で算出された視線方向の時間方向の差分を求めることで、時系列に視線変動量を算出する。
すなわち、ｔｐをトピックのＩＤ番号、ｔを視線方向を計測した時刻（フレーム番号）としたとき、視線変動量計測手段１４は、以下の式（７）により、視線変動量Ｅ_ｔｐ（ｔ）を計測する。なお、｜ａ｜は、ａの絶対値を示す。 Then, the line-of-sight variation measurement means 14 calculates the amount of line-of-sight variation in time series by obtaining the time direction difference of the line-of-sight direction calculated by the equation (6).
That is, when tp is the topic ID number and t is the time (frame number) when the line-of-sight direction is measured, the line-of-sight variation measuring means 14 _calculates the line-of-sight variation E _tp (t) by the following equation (7). measure. | A | indicates the absolute value of a.

前記式（６）における視線方向ｄ_ｔは、視線方向推定としては十分な精度は得られないが、前記式（７）のように、差分値を算出することで、視線の変動量を精度よく求めることができる。
この視線変動量計測手段１４で計測された視線変動量は、映像区間（トピック）ごとに、統計特徴量生成手段３０に出力される。 The line-of-sight direction _dt in equation (6) does not provide sufficient accuracy for line-of-sight direction estimation, but by calculating the difference value as in equation (7), the line-of-sight variation can be accurately determined. Can be sought.
The line-of-sight variation measured by the line-of-sight variation measurement unit 14 is output to the statistical feature amount generation unit 30 for each video section (topic).

視線変動量使用判定手段２０は、身体特徴量抽出手段１０で抽出された視線変動量を後記する統計特徴量生成手段３０で使用するか否かを判定するものである。ここでは、視線変動量使用判定手段２０は、字幕情報量計測手段２１と、映像動き量計測手段２２と、使用判定手段２３と、を備えている。 The line-of-sight variation amount use determination unit 20 determines whether or not the line-of-sight variation amount extracted by the body feature amount extraction unit 10 is used by the statistical feature amount generation unit 30 described later. Here, the line-of-sight variation usage determination unit 20 includes a caption information amount measurement unit 21, a video motion amount measurement unit 22, and a usage determination unit 23.

通常、映像内に字幕が多く出現し、人物Ｈがその字幕を注目した場合、人物Ｈは字幕を読むために必然的に視線変動量は多くなる。また、映像に動きが多い場合、人物Ｈはその動きを目で追うために必然的に視線変動量は多くなる。
すなわち、人物Ｈが字幕を注目した場合、あるいは、人物Ｈが映像の動きに注目した場合、人物Ｈが映像に注目すると視線変動量が小さくなるという前提と逆の方向に作用することになる。
そこで、ここでは、字幕情報量や映像動き量を、注目度を推定する際に視線変動量を使用するか否かの判定の指標として検出する。 Usually, when a lot of subtitles appear in the video and the person H pays attention to the subtitles, the person H inevitably increases the amount of line-of-sight variation in order to read the subtitles. In addition, when there is a lot of motion in the video, the person H inevitably increases the amount of line-of-sight fluctuation in order to follow the motion with his eyes.
That is, when the person H pays attention to the subtitle, or when the person H pays attention to the motion of the video, the effect is opposite to the premise that when the person H pays attention to the video, the line-of-sight fluctuation amount becomes small.
Therefore, here, the amount of caption information and the amount of video motion are detected as indicators for determining whether or not to use the line-of-sight variation when estimating the degree of attention.

字幕情報量計測手段２１は、入力される映像コンテンツにおいて、指定された映像区間（トピック）ごとに、字幕の情報量（字幕情報量）を計測するものである。ここでは、字幕情報量計測手段２１は、トピック内に含まれる字幕を含んだフレーム数の割合を字幕情報量とする。 The subtitle information amount measuring means 21 measures the subtitle information amount (subtitle information amount) for each designated video section (topic) in the input video content. Here, the caption information amount measuring unit 21 sets the ratio of the number of frames including the caption included in the topic as the caption information amount.

具体的には、字幕情報量計測手段２１は、入力された映像コンテンツをフレーム画像単位で、２次微分であるラプラシアン画像に変換する。一般に、映像内における字幕領域は、他の領域に比べてコントラストが高く、エッジ特徴が表れやすいためである。
ここで、フレーム画像の画素値をＩ（ｘ，ｙ）、変換後のラプラシアン画像の画素値をＩ′（ｘ，ｙ）としたとき、字幕情報量計測手段２１は、以下の式（８）の演算により、ラプラシアン画像を生成する。 Specifically, the caption information amount measuring unit 21 converts the input video content into a Laplacian image that is a second order derivative in frame image units. This is because a caption area in a video generally has a higher contrast than other areas, and edge features tend to appear.
Here, when the pixel value of the frame image is I (x, y) and the pixel value of the converted Laplacian image is I ′ (x, y), the caption information amount measuring means 21 uses the following equation (8). A Laplacian image is generated by the above calculation.

このラプラシアン画像Ｉ′（ｘ，ｙ）の各画素は、例えば、画素の階調が８階調であれば、“０”〜“２５５”の値を持つ。ここでは、その画素値ごと（ｂｉｎ数２５６）に画素数を累計したヒストグラム（エッジヒストグラム）を、当該フレーム画像における字幕特徴量とする。
そして、字幕情報量計測手段２１は、エッジヒストグラムを字幕特徴量として予め学習によって求めた識別器（例えば、２値ＳＶＭ識別器）により、フレーム画像ごとに字幕の有無を検出する。あるいは、簡易に、フレーム画像において、所定輝度値以上の割合が、予め定めた割合よりも多いか否かによって、字幕の有無を検出することとしてもよい。 For example, each pixel of the Laplacian image I ′ (x, y) has a value of “0” to “255” if the gradation of the pixel is 8 gradations. Here, a histogram (edge histogram) obtained by accumulating the number of pixels for each pixel value (bin number 256) is set as a caption feature amount in the frame image.
Then, the caption information amount measuring means 21 detects the presence or absence of captions for each frame image by a discriminator (for example, a binary SVM discriminator) obtained by learning in advance using the edge histogram as a caption feature amount. Or it is good also as detecting the presence or absence of a subtitle simply by whether a ratio more than predetermined brightness value is more than a predetermined ratio in a frame image.

このように、字幕情報量計測手段２１は、フレーム画像において字幕を検出し、字幕を検出したフレーム数と、トピックの時間長（フレーム数）との比によって、字幕情報量を算出する。
すなわち、トピックｔｐ（トピックのＩＤ番号）において、字幕を検出したフレーム数をＮ（ｔｐ）、トピックの時間長（トピックの総フレーム数）をＴ（ｔｐ）としたき、字幕情報量計測手段２１は、以下の式（９）により字幕情報量Ｊ_ｔｐを算出する。 As described above, the caption information amount measuring unit 21 detects a caption in the frame image, and calculates the caption information amount based on the ratio between the number of frames in which the caption is detected and the time length (number of frames) of the topic.
That is, in the topic tp (topic ID number), the number of frames in which captions are detected is N (tp), and the topic length (total number of frames in a topic) is T (tp). Calculates the subtitle information amount J _{tp according} to the following equation (9).

この字幕情報量計測手段２１で計測された字幕特徴量は、使用判定手段２３に出力される。 The caption feature amount measured by the caption information amount measuring unit 21 is output to the use determining unit 23.

映像動き量計測手段２２は、入力される映像コンテンツにおいて、指定された映像区間（トピック）ごとに、映像内の動き量（映像動き量）を計測するものである。ここでは、映像動き量計測手段２２は、トピック内のフレームごとに、差分をとることで映像内の動きを検出し、トピック内に含まれる動きの大きいフレーム数の割合を映像動き量とする。
例えば、映像動き量計測手段２２は、入力される映像コンテンツのフレーム画像ごとに、予め定めた大きさのブロック単位で、１フレーム前に入力されたフレーム画像の同一のブロック間で差分をとり、その差が予め定めた量よりも大きい場合に、当該ブロックにおいて動きがあったことを検出し、動きのあったブロックの数が予め定めた数（あるいは割合）よりも大きい場合に当該フレーム画像において動きが大きいと判定する。 The video motion amount measuring means 22 measures the motion amount (video motion amount) in the video for each designated video section (topic) in the input video content. Here, the video motion amount measuring means 22 detects the motion in the video by taking the difference for each frame in the topic, and sets the ratio of the number of frames with a large motion included in the topic as the video motion amount.
For example, the video motion amount measuring means 22 takes a difference between the same blocks of the frame image input one frame before in units of blocks of a predetermined size for each frame image of the input video content, When the difference is larger than a predetermined amount, it is detected that there is motion in the block, and in the frame image when the number of blocks in motion is larger than a predetermined number (or ratio) It is determined that the movement is large.

そして、映像動き量計測手段２２は、字幕情報量計測手段２１と同様に、動きが大きいと判定したフレーム数と、トピックの時間長（フレーム数）との比によって、映像動き量を算出する。
この映像動き量計測手段２２で計測された映像動き量は、使用判定手段２３に出力される。 Then, as with the caption information amount measuring unit 21, the video motion amount measuring unit 22 calculates the video motion amount based on the ratio between the number of frames determined to have a large motion and the topic time length (the number of frames).
The video motion amount measured by the video motion amount measuring unit 22 is output to the use determining unit 23.

使用判定手段２３は、字幕情報量計測手段２１で計測された字幕情報量と、映像動き量計測手段２２で計測された映像動き量とに基づいて、指定された映像区間（トピック）において、視線変動量を、注目度を推定する際の特徴量とするか否かを判定するものである。 The use determination unit 23 is configured to look at the line of sight in a designated video section (topic) based on the subtitle information amount measured by the subtitle information amount measurement unit 21 and the video motion amount measured by the video motion amount measurement unit 22. It is determined whether or not the amount of variation is a feature amount for estimating the degree of attention.

ここでは、使用判定手段２３は、字幕情報量が予め定めた量よりも多い、または、映像動き量が予め定めた量よりも多い場合に、視線変動量を、注目度を推定する際の特徴量としない旨の判定を行う。なお、それ以外の場合、使用判定手段２３は、視線変動量を、注目度を推定する際の特徴量とする旨の判定を行う。 Here, the use determination means 23 is a feature for estimating the attention level of the line-of-sight variation when the amount of caption information is larger than a predetermined amount or the amount of video motion is larger than a predetermined amount. Judgment that it is not an amount is made. In other cases, the use determination unit 23 determines that the line-of-sight fluctuation amount is a feature amount when estimating the degree of attention.

なお、字幕情報量や映像動き量が多いか否かを判定する予め定めた量は、字幕情報量計測手段２１や映像動き量計測手段２２において、映像コンテンツの全トピックで字幕情報量と映像動き量とを計測した後、統計量によって定めることとしてもよい。
例えば、使用判定手段２３は、全トピックの字幕情報量Ｊ_ｔｐの平均μ_ｔｐと標準偏差σ_ｔｐとを算出し、μ_ｔｐ＋σ_ｔｐを超える場合に、字幕情報量が多いと判定する。また、映像動き量についても同様である。 It should be noted that the predetermined amount for determining whether the amount of caption information or the amount of video motion is large is determined by the caption information amount measuring means 21 or the video motion amount measuring means 22 for all topics of the video content. After measuring the quantity, it may be determined by a statistic.
For example, the usage determining unit 23 calculates the average μ _tp and standard deviation σ _tp of the caption information amount J _tp of all topics, and determines that the amount of caption information is large when it exceeds μ _tp + σ _tp . The same applies to the video motion amount.

なお、ここでは、使用判定手段２３は、字幕情報量と映像動き量とをそれぞれ個別に判定したが、字幕情報量と映像動き量とを加算（例えば、重み付き加算）した量に対して判定を行うこととしてもよい。
この使用判定手段２３におけるトピックごとの視線変動量の使用判定結果は、統計特徴量生成手段３０および注目度特定手段５０に出力される。 Here, the use determination unit 23 determines the caption information amount and the video motion amount individually, but determines the amount obtained by adding (for example, weighted addition) the caption information amount and the video motion amount. It is good also as performing.
The use determination result of the gaze variation amount for each topic in the use determination unit 23 is output to the statistical feature amount generation unit 30 and the attention level identification unit 50.

統計特徴量生成手段３０は、身体特徴量抽出手段１０で抽出された各特徴量（身体動作量、瞬目間隔時間、視線変動量）を統計し、映像区間（トピック）における固定次元の特徴量を生成するものである。なお、統計特徴量生成手段３０は、視線変動量使用判定手段２０から、視線変動量を特徴量として使用しない旨の判定結果が入力された場合、視線変動量を除いた特徴量で固定次元の特徴量を生成する。
ここでは、統計特徴量生成手段３０は、グローバル特徴生成手段３１と、局所ヒストグラム特徴生成手段３２と、を備えている。 The statistical feature value generation unit 30 statistically calculates each feature amount (body movement amount, blink interval time, and line-of-sight variation amount) extracted by the body feature amount extraction unit 10, and features a fixed dimension in the video section (topic). Is generated. Note that the statistical feature quantity generation means 30 receives the determination result indicating that the gaze fluctuation amount is not used as the feature quantity from the gaze fluctuation quantity use judgment means 20, and uses the feature quantity excluding the gaze fluctuation quantity as a fixed dimension. Generate feature values.
Here, the statistical feature quantity generating unit 30 includes a global feature generating unit 31 and a local histogram feature generating unit 32.

グローバル特徴生成手段３１は、身体特徴量抽出手段１０で抽出された各特徴量（身体動作量、瞬目間隔時間、視線変動量）から、指定された映像区間（トピック）内におけるグローバル（大局的）な統計特徴量（特徴量記述子）を生成するものである。
すなわち、グローバル特徴生成手段３１は、あるトピックにおける人物Ｈの大まかな特徴をグローバル特徴として生成する。 The global feature generation unit 31 uses a global (global) in a specified video section (topic) from each feature amount (body motion amount, blink interval time, and line-of-sight variation amount) extracted by the body feature amount extraction unit 10. ) Statistical feature quantity (feature quantity descriptor).
That is, the global feature generating unit 31 generates a rough feature of the person H in a certain topic as a global feature.

ここでは、グローバル特徴生成手段３１は、入力されたトピックごとに、身体動作量の平均値μ_Ｋｔｐおよび標準偏差σ_Ｋｔｐ、瞬目間隔時間の平均値μ_Ｂｔｐおよび標準偏差σ_Ｂｔｐ、ならびに、視線変動量の平均値μ_Ｅｔｐおよび標準偏差σ_Ｅｔｐを算出し、固定次元の特徴量記述子とする。
これによって、３種類の特徴量を、トピックの時間長によらず、固定の６次元の特徴量（特徴量記述子）として表すことができる。 Here, the global feature generating means 31, for each input topic, the average value of the body movement amount mu _Ktp and standard deviation sigma _Ktp, mean mu _Btp and standard deviation sigma _Btp blink interval _time, and line-of-sight change An average value μ _Etp and a standard deviation σ _Etp of the _quantities are calculated and used as fixed dimension feature descriptors.
As a result, the three types of feature quantities can be expressed as fixed six-dimensional feature quantities (feature quantity descriptors) regardless of the topic time length.

なお、グローバル特徴生成手段３１は、視線変動量使用判定手段２０から、あるトピックｔｐにおいて、視線変動量を特徴量として使用しない旨の判定結果が入力された場合、視線変動量の平均値および標準偏差を算出せず、２種類の特徴量（身体動作量、瞬目間隔時間）から、それぞれの平均値および標準偏差である４次元の特徴量（特徴量記述子）を生成する。
このように生成されたグローバル特徴（特徴量記述子）は、トピックと対応付けて注目度特定手段５０に出力される。 The global feature generation unit 31 receives an average value and a standard value of the line-of-sight fluctuation amount when a determination result indicating that the line-of-sight variation amount is not used as the feature amount in a certain topic tp is input from the line-of-sight variation use determination unit 20. The deviation is not calculated, and a four-dimensional feature quantity (feature quantity descriptor) that is an average value and a standard deviation is generated from two kinds of feature quantities (body movement amount and blink interval time).
The global feature (feature descriptor) generated in this way is output to the attention level specifying means 50 in association with the topic.

局所ヒストグラム特徴生成手段３２は、身体特徴量抽出手段１０で抽出された各特徴量（身体動作量、瞬目間隔時間、視線変動量）から、指定された映像区間（トピック）内における局所的な統計特徴量（特徴量記述子）を生成するものである。
すなわち、局所ヒストグラム特徴生成手段３２は、あるトピックにおける人物Ｈのより細かい特徴量を算出し、ヒストグラム（局所ヒストグラム）化するものである。 The local histogram feature generation unit 32 determines a local region within a designated video section (topic) from each feature amount (body movement amount, blink interval time, line-of-sight variation amount) extracted by the body feature amount extraction unit 10. A statistical feature quantity (feature quantity descriptor) is generated.
In other words, the local histogram feature generating means 32 calculates a finer feature amount of the person H in a certain topic and converts it into a histogram (local histogram).

ここでは、局所ヒストグラム特徴生成手段３２は、特徴の分布が特定のビン（ｂｉｎ）に集中しないように、ヒストグラムの各ビンのしきい値を、映像コンテンツ全体の特徴量の平均および標準偏差から求めた値とする。
具体的には、局所ヒストグラム特徴生成手段３２は、映像コンテンツ全体で検出された身体動作量の平均値をμ、その標準偏差をσとし、８つのビンでヒストグラムを生成する。その際の各ビンのしきい値は、図７（ａ）に示すように、（−∞，μ−２σ），［μ−２σ，μ−σ），［μ−σ，μ−１／２σ），［μ−１／２σ，μ），［μ，μ＋１／２σ），［μ＋１／２σ，μ＋σ），［μ＋σ，μ＋２σ），［μ＋２σ，∞）とする。なお、（ａ，ｂ）は、値がａより大きくｂより小さい区分を示し、（ａ，ｂ］は、値がａより大きくｂ以下の区分を示す。 Here, the local histogram feature generating means 32 obtains the threshold value of each bin of the histogram from the average and standard deviation of the feature amount of the entire video content so that the distribution of features is not concentrated on a specific bin. Value.
Specifically, the local histogram feature generating means 32 generates a histogram with eight bins, where μ is the average value of physical movement detected in the entire video content, and σ is its standard deviation. The threshold values of the bins at that time are (−∞, μ−2σ), [μ−2σ, μ−σ), [μ−σ, μ−1 / 2σ, as shown in FIG. ), [Μ−1 / 2σ, μ), [μ, μ + 1 / 2σ), [μ + 1 / 2σ, μ + σ), [μ + σ, μ + 2σ), [μ + 2σ, ∞). Note that (a, b) indicates a segment having a value greater than a and less than b, and (a, b) represents a segment having a value greater than a and less than b.

そして、局所ヒストグラム特徴生成手段３２は、身体特徴量抽出手段１０で抽出された、指定された映像区間（トピック）内における身体動作量を、図７（ａ）で示したビン（区間０〜７）ごとに累計して、図７（ｂ）に示すようなヒストグラム（局所ヒストグラム特徴）を生成する。
なお、瞬目間隔時間および視線変動量についても、身体動作量と同様に、映像コンテンツ全体で検出されたそれぞれの特徴量の平均値と標準偏差でビンのしきい値を求めてヒストグラムを生成する。 Then, the local histogram feature generation unit 32 indicates the body motion amount in the designated video section (topic) extracted by the body feature amount extraction unit 10 as bins (sections 0 to 7) shown in FIG. ) To generate a histogram (local histogram feature) as shown in FIG.
As for the blink interval time and the line-of-sight fluctuation amount, similarly to the body movement amount, a bin threshold value is obtained from the average value and standard deviation of each feature amount detected in the entire video content, and a histogram is generated. .

このように、局所ヒストグラム特徴生成手段３２は、映像コンテンツ全体で検出された特徴量の平均値と標準偏差でビンのしきい値を定めることで、極度に偏ったヒストグラムの生成を避けることができる。
これによって、３種類の特徴量を、トピックの時間長によらず、固定の２４次元の特徴量（特徴量記述子）として表すことができる。 As described above, the local histogram feature generation unit 32 can avoid the generation of extremely biased histograms by determining the bin threshold value based on the average value and the standard deviation of the feature amounts detected in the entire video content. .
As a result, the three types of feature quantities can be represented as fixed 24-dimensional feature quantities (feature quantity descriptors) regardless of the topic time length.

なお、局所ヒストグラム特徴生成手段３２は、視線変動量使用判定手段２０から、あるトピックｔｐにおいて、視線変動量を特徴量として使用しない旨の判定結果が入力された場合、視線変動量については、局所ヒストグラムを生成せず、２種類の特徴量（身体動作量、瞬目間隔時間）から、それぞれの局所ヒストグラムを生成し、１６次元の特徴量（特徴量記述子）を生成する。
このように算出された局所ヒストグラム特徴（特徴量記述子）は、トピックと対応付けて注目度特定手段５０に出力される。 Note that the local histogram feature generation unit 32 receives a determination result indicating that the line-of-sight variation amount is not used as the feature amount in a certain topic tp from the line-of-sight variation usage determination unit 20. Without generating a histogram, a local histogram is generated from two types of feature quantities (body motion quantity and blink interval time), and a 16-dimensional feature quantity (feature quantity descriptor) is generated.
The local histogram feature (feature descriptor) calculated in this way is output to the attention level specifying means 50 in association with the topic.

すなわち、統計特徴量生成手段３０で生成される統計特徴量は、図８に示すように、グローバル特徴である身体動作量、瞬目間隔時間および視線変動量のそれぞれの平均値および標準偏差の６次元の特徴量記述子と、局所ヒストグラム特徴である身体動作量、瞬目間隔時間および視線変動量の各ビン（区間０〜７）の度数の２４次元の特徴量記述子とからなる３０次元の固定次元の特徴量記述子である。 That is, the statistical feature amount generated by the statistical feature amount generating means 30 is 6 as the average value and standard deviation of the global body motion amount, blink interval time, and line-of-sight variation amount as shown in FIG. A 30-dimensional feature descriptor including a dimension feature descriptor and a 24-dimensional feature descriptor of the frequency of each bin (sections 0 to 7) of body movement, blink interval time, and line-of-sight variation that are local histogram features. It is a fixed dimension feature descriptor.

なお、視線変動量使用判定手段２０から、視線変動量を特徴量として使用しない旨の判定結果が入力された場合、統計特徴量生成手段３０で生成される統計特徴量は、図８に示した特徴量記述子から、視線変動量の特徴量を除いた２０次元の特徴量記述子となる。 Note that when a determination result indicating that the gaze variation amount is not used as a feature amount is input from the gaze variation amount use determination unit 20, the statistical feature amount generated by the statistical feature amount generation unit 30 is illustrated in FIG. This is a 20-dimensional feature descriptor excluding the feature amount of the line-of-sight variation from the feature descriptor.

学習データ記憶手段（第２学習データ記憶手段）４０は、特徴量（グローバル特徴、局所ヒストグラム特徴）と、注目度との対応関係を予め学習した学習データを記憶するものであって、ハードディスク等の一般的な記憶装置である。
この学習データ記憶手段４０は、予め第１学習データＤ１と第２学習データＤ２の２つの学習データを記憶しておく。なお、第１学習データＤ１と第２学習データＤ２とを異なる記憶手段に記憶することとしてもよい。 The learning data storage means (second learning data storage means) 40 stores learning data in which the correspondence between the feature amount (global feature, local histogram feature) and the degree of attention is learned in advance, such as a hard disk It is a general storage device.
The learning data storage means 40 stores two learning data of the first learning data D1 and the second learning data D2 in advance. Note that the first learning data D1 and the second learning data D2 may be stored in different storage means.

第１学習データＤ１は、学習フェーズにおいて、予め人物が映像コンテンツ（トピック）を視聴した際の特徴量（グローバル特徴、局所ヒストグラム特徴）を学習特徴量とし、そのときの注目度を、主観評価値（例えば、非注目から注目までを５段階で評価した値）とすることで学習したＳＶＭ推定器（識別関数）である。
このＳＶＭ推定器は、例えば、出力値（注目度）が連続値をとる分類器（ＳＶＭ回帰推定器）とする。もちろん、注目度を２クラス（注目、非注目）で出力させたい場合、２クラス分類器であってもよいし、注目度を多クラス（多値）で出力させたい場合、多クラス分類器であっても構わない。 The first learning data D1 uses, as a learning feature amount, a feature amount (a global feature or a local histogram feature) when a person views video content (topic) in advance in the learning phase, and the degree of attention at that time is a subjective evaluation value. It is an SVM estimator (discriminant function) learned by setting (for example, a value evaluated from five points of non-attention to attention).
This SVM estimator is, for example, a classifier (SVM regression estimator) whose output value (degree of interest) takes a continuous value. Of course, if you want to output the attention level in two classes (attention and non-attention), you may use a two-class classifier. If you want to output the attention level in multiple classes (multi-value), use a multi-class classifier. It does not matter.

第２学習データＤ２は、第１学習データＤ１と同様に学習したＳＶＭ推定器（識別関数）である。ただし、第１学習データＤ１が特徴量に視線変動量を含んでいるのに対し、第２学習データＤ２は、特徴量に視線変動量を含まずに学習したＳＶＭ推定器（識別関数）である。 The second learning data D2 is an SVM estimator (discriminant function) learned in the same manner as the first learning data D1. However, while the first learning data D1 includes the line-of-sight variation amount in the feature amount, the second learning data D2 is an SVM estimator (discriminant function) learned without including the line-of-sight variation amount in the feature amount. .

このような学習データは、例えば、図１２に示すような、注目度学習装置２を用いて生成することができる。
この注目度学習装置２は、注目度推定装置１において、注目度特定手段５０を学習手段６０に替え、視線変動量使用判定手段２０を除いて構成したもので、他の構成は同一である。 Such learning data can be generated using, for example, the attention level learning device 2 as shown in FIG.
This attention level learning device 2 is configured by replacing the attention level identification unit 50 with the learning unit 60 in the attention level estimation device 1 except for the line-of-sight variation usage determination unit 20, and the other configurations are the same.

すなわち、注目度学習装置２の学習手段６０は、予め人物が映像コンテンツ（トピック）を視聴した際の特徴量（グローバル特徴、局所ヒストグラム特徴）を学習特徴量とし、そのときの注目度を、主観評価値（例えば、非注目から注目までを５段階で評価した値）として、図示を省略した入力手段を介して入力されることで、ＳＶＭ推定器（識別関数）を生成し、学習データ記憶手段４０に記憶する。 That is, the learning means 60 of the attention level learning device 2 uses a feature amount (global feature, local histogram feature) when a person views video content (topic) in advance as a learning feature amount, and the attention level at that time is expressed as subjective. An SVM estimator (discriminant function) is generated by inputting an evaluation value (for example, a value obtained by evaluating from non-attention to attention in five stages) through an input unit (not shown), and learning data storage unit 40.

このとき、注目度学習装置２は、視線変動量を特徴量として用いた第１学習データＤ１と、視線変動量を特徴量として用いない第２学習データＤ２とを生成する。
このように、注目度学習装置２によって、予め学習によって生成された第１学習データＤ１と第２学習データＤ２とを、注目度推定装置１の学習データ記憶手段４０に記憶しておく。
図２に戻って、注目度推定装置１の構成について説明を続ける。 At this time, the attention level learning device 2 generates first learning data D1 using the line-of-sight variation amount as a feature amount and second learning data D2 not using the line-of-sight variation amount as a feature amount.
In this way, the first learning data D1 and the second learning data D2 generated in advance by learning by the attention level learning device 2 are stored in the learning data storage means 40 of the attention level estimation device 1.
Returning to FIG. 2, the description of the configuration of the attention level estimation device 1 will be continued.

注目度特定手段５０は、学習データ記憶手段４０に記憶されている学習データに基づいて、統計特徴量生成手段３０で生成された特徴量（特徴量記述子）に対応する注目度を、指定された映像区間に対する注目度として特定するものである。
すなわち、注目度特定手段５０は、学習データ記憶手段４０に記憶されている学習データ（ＳＶＭ推定器：識別関数）を用い、統計特徴量生成手段３０で生成された特徴量記述子を入力値として注目度を演算する。 The attention level specifying means 50 is designated with the attention level corresponding to the feature quantity (feature quantity descriptor) generated by the statistical feature quantity generation means 30 based on the learning data stored in the learning data storage means 40. It is specified as the degree of attention to the video section.
That is, the attention level specifying unit 50 uses the learning data (SVM estimator: identification function) stored in the learning data storage unit 40 and uses the feature descriptor generated by the statistical feature generator 30 as an input value. Calculate the degree of attention.

なお、注目度特定手段５０は、視線変動量使用判定手段２０から、あるトピックｔｐ（映像区間）において、視線変動量を特徴量として使用する旨の判定結果が入力された場合、学習データ記憶手段４０に記憶されている第１学習データＤ１を用いて注目度を演算する。 Note that the attention level specifying unit 50 receives learning data storage unit when a determination result indicating that the line-of-sight variation amount is used as a feature amount in a certain topic tp (video section) is input from the line-of-sight variation usage determination unit 20. The degree of attention is calculated using the first learning data D1 stored in the memory 40.

一方、注目度特定手段５０は、視線変動量使用判定手段２０から、あるトピックｔｐ（映像区間）において、視線変動量を特徴量として使用しない旨の判定結果が入力された場合、学習データ記憶手段４０に記憶されている第２学習データＤ２を用いて注目度を演算する。 On the other hand, when the determination result indicating that the line-of-sight variation amount is not used as the feature amount in a certain topic tp (video section) is input from the line-of-sight variation amount use determination unit 20, the attention level specifying unit 50 The degree of attention is calculated using the second learning data D2 stored in the memory 40.

このようにトピック（映像区間）ごとに特定された注目度は、注目度推定装置１の推定結果として出力される。なお、注目度特定手段５０は、図示を省略した通信制御部を介して、ネットワーク経由で、映像コンテンツ（トピック）の識別情報と対応付けて、映像コンテンツの送信元に注目度を送信することとしてもよい。 Thus, the attention level specified for each topic (video section) is output as an estimation result of the attention level estimation device 1. Note that the attention level specifying unit 50 transmits the attention level to the video content transmission source in association with the identification information of the video content (topic) via the network via a communication control unit (not shown). Also good.

以上説明したように、注目度推定装置１を構成することで、注目度推定装置１は、モーションキャプチャＭｃから入力される骨格位置情報と、カメラＣから入力されるカメラ映像とから、接触型デバイスの装着や、赤外線光の照射等、人物に負荷を与えることなく、映像コンテンツ（トピック）の注目度を推定することができる。
なお、注目度推定装置１は、一般的なコンピュータを前記した各手段として機能させるプログラム（注目度推定プログラム）により動作させることができる。 As described above, by configuring the attention level estimation device 1, the attention level estimation device 1 can generate a contact-type device from the skeleton position information input from the motion capture Mc and the camera image input from the camera C. The degree of attention of the video content (topic) can be estimated without giving a load to the person, such as wearing an infrared ray or irradiating infrared light.
The attention level estimation device 1 can be operated by a program (attention level estimation program) that causes a general computer to function as each of the above-described means.

［注目度推定装置の動作］
次に、図９を参照（適宜図１，図２参照）して、本発明の実施形態に係る注目度推定装置１の動作について説明する。なお、ここでは、予め学習データ記憶手段４０に、学習データ（第１学習データＤ１、第２学習データＤ２）が記憶されているものとする。 [Operation of attention level estimation device]
Next, referring to FIG. 9 (refer to FIGS. 1 and 2 as appropriate), the operation of the attention level estimation apparatus 1 according to the embodiment of the present invention will be described. Here, it is assumed that learning data (first learning data D1, second learning data D2) is stored in the learning data storage unit 40 in advance.

まず、注目度推定装置１は、身体動作量計測手段１１によって、モーションキャプチャＭｃから入力される骨格位置情報を時系列に入力し、人物Ｈの予め定めた骨格位置における単位時間（例えば、フレーム）当たりの身体動作量を計測する（ステップＳ１）。例えば、身体動作量計測手段１１は、骨格位置情報として入力される人物Ｈの頭部位置の単位時間あたりの変化量を身体動作量とする。 First, the attention level estimation device 1 inputs the skeleton position information input from the motion capture Mc by the body movement amount measuring unit 11 in time series, and unit time (for example, frame) at a predetermined skeleton position of the person H. The amount of bodily movement is measured (step S1). For example, the body movement amount measuring unit 11 sets the amount of change per unit time of the head position of the person H input as the skeleton position information as the body movement amount.

また、注目度推定装置１は、傾き補正手段１２によって、カメラＣから入力されるカメラ映像を、フレームごとに人物Ｈの顔がフレーム画像（カメラ画像）上で垂直になるように傾きを補正する（ステップＳ２）。このとき、傾き補正手段１２は、モーションキャプチャＭｃから入力される骨格位置情報において、頸部位置が頭部位置の直下になるように、フレーム画像の画像中心を中心としてカメラ画像を回転させる。 Also, the attention level estimation device 1 corrects the inclination of the camera video input from the camera C by the inclination correction unit 12 so that the face of the person H is vertical on the frame image (camera image) for each frame. (Step S2). At this time, the inclination correction unit 12 rotates the camera image around the image center of the frame image so that the neck position is directly below the head position in the skeleton position information input from the motion capture Mc.

そして、注目度推定装置１は、瞬目間隔計測手段１３によって、ステップＳ２で傾き補正されて逐次入力されるカメラ画像において、人物Ｈの瞬きを検出し、当該瞬きの間隔を瞬目間隔時間として計測する（ステップＳ３）。ここでは、瞬目間隔計測手段１３は、カメラ画像内で特徴点を検出、追跡し、その特徴点軌跡を１つの単語とみなしたＢａｇ−ｏｆ−ｗｏｒｄｓ手法を用いて瞬目動作を識別する。 Then, the attention level estimation device 1 detects blinks of the person H in the camera images sequentially input after being tilt-corrected in step S2 by the blink interval measurement unit 13, and uses the blink interval as the blink interval time. Measurement is performed (step S3). Here, the blink interval measuring means 13 detects and tracks a feature point in the camera image, and identifies a blink operation using a Bag-of-words method in which the feature point locus is regarded as one word.

さらに、注目度推定装置１は、視線変動量計測手段１４によって、ステップＳ２で傾き補正されて逐次入力されるカメラ画像において、単位時間当たりの視線変動量を計測する（ステップＳ４）。ここでは、視線変動量計測手段１４は、カメラ画像から、人物Ｈの目領域を検出し、目領域の水平方向の中心で左右に区分した右領域と左領域との輝度比の時間変化によって、視線変動量を計測する。 Further, the attention level estimation device 1 measures the line-of-sight variation amount per unit time in the camera image sequentially input after the inclination correction in step S2 by the line-of-sight variation measuring unit 14 (step S4). Here, the line-of-sight variation measuring means 14 detects the eye area of the person H from the camera image, and the time ratio of the luminance ratio between the right area and the left area divided into right and left at the center in the horizontal direction of the eye area, Measure gaze fluctuation.

また、注目度推定装置１は、字幕情報量計測手段２１によって、入力される映像コンテンツにおいて、フレームごとに字幕を検出する（ステップＳ５）。さらに、注目度推定装置１は、入力される映像コンテンツにおいて、フレームごとの差分から、予め定めた量よりも動き量が大きいフレームを検出する（ステップＳ６）。 Also, the attention level estimation device 1 detects the caption for each frame in the input video content by the caption information amount measuring unit 21 (step S5). Further, the attention level estimation device 1 detects a frame having a motion amount larger than a predetermined amount from the difference of each frame in the input video content (step S6).

そして、映像コンテンツの入力で、指定された映像区間（トピック）が終了していない場合（ステップＳ７でＮｏ）、注目度推定装置１は、ステップＳ１に戻って、順次ステップＳ１からステップＳ６までの操作を繰り返す。
これによって、トピック内における身体動作量、瞬目間隔時間および視線変動量が単位時間（ここでは、フレーム）ごとに計測される。また、トピック内において字幕が存在するフレームおよび動きが大きいフレームが検出される。 If the designated video section (topic) is not completed by the input of the video content (No in step S7), the attention level estimating apparatus 1 returns to step S1 and sequentially performs steps S1 to S6. Repeat the operation.
As a result, the body movement amount, the blink interval time, and the line-of-sight variation amount in the topic are measured for each unit time (here, frame). In addition, a frame in which a caption exists in a topic and a frame with a large motion are detected.

そして、指定された映像区間（トピック）が終了した場合（ステップＳ７でＹｅｓ）、注目度推定装置１は、字幕情報量計測手段２１によって、トピックの時間長（トピックの総フレーム数）に対する字幕を検出したフレーム数の割合を字幕情報量として計算する（ステップＳ８）。
また、注目度推定装置１は、トピックの時間長（トピックの総フレーム数）に対する動きが大きいとして検出したフレーム数の割合を映像動き量として計算する（ステップＳ９）。
そして、注目度推定装置１は、使用判定手段２３によって、ステップＳ８，Ｓ９で計算された字幕情報量が予め定めた量よりも多い、または、映像動き量が予め定めた量よりも多いか否かを判定する（ステップＳ１０）。 When the designated video section (topic) ends (Yes in step S7), the attention level estimation device 1 uses the caption information amount measuring unit 21 to generate captions for the topic time length (total number of frames of the topic). The ratio of the detected number of frames is calculated as the amount of caption information (step S8).
Also, the attention level estimation device 1 calculates a ratio of the number of frames detected as having a large motion with respect to the topic time length (total number of frames of the topic) as a video motion amount (step S9).
Then, the attention level estimation device 1 determines whether or not the amount of caption information calculated in steps S8 and S9 is larger than a predetermined amount or the amount of video motion is larger than a predetermined amount by the use determining unit 23. Is determined (step S10).

ここで、字幕情報量が予め定めた量よりも多い、または、映像動き量が予め定めた量よりも多い場合（ステップＳ１０でＹｅｓ）、注目度推定装置１は、統計特徴量生成手段３０によって、視線変動量を除いた特徴量（特徴量記述子）を生成する（ステップＳ１１）。
このとき、統計特徴量生成手段３０は、グローバル特徴生成手段３１によって、身体動作量および瞬目間隔時間のそれぞれについて、トピック内における平均値と標準偏差をグローバル特徴として生成する。
また、統計特徴量生成手段３０は、局所ヒストグラム特徴生成手段３２によって、身体動作量および瞬目間隔時間のそれぞれについて、ヒストグラムを生成することで、局所ヒストグラム特徴を生成する。 Here, when the amount of caption information is larger than a predetermined amount or the amount of video motion is larger than a predetermined amount (Yes in step S10), the attention level estimation device 1 uses the statistical feature amount generation unit 30. Then, a feature quantity (feature quantity descriptor) excluding the line-of-sight variation is generated (step S11).
At this time, the statistical feature value generation unit 30 generates, as global features, an average value and a standard deviation within the topic for each of the body movement amount and the blink interval time by the global feature generation unit 31.
Further, the statistical feature quantity generation unit 30 generates local histogram features by generating a histogram for each of the body movement amount and the blink interval time by the local histogram feature generation unit 32.

そして、注目度推定装置１は、注目度特定手段５０によって、学習データ記憶手段４０に記憶されている、視線変動量を除いて学習した第２学習データＤ２を用いて、ステップＳ１１で生成された特徴量（特徴量記述子）に対する注目度を特定（推定）する（ステップＳ１２）。 The attention level estimation device 1 is generated in step S11 using the second learning data D2 learned by the attention level identification unit 50 except for the line-of-sight variation stored in the learning data storage unit 40. The degree of attention to the feature amount (feature amount descriptor) is specified (estimated) (step S12).

一方、字幕情報量が予め定めた量よりも少なく、かつ、映像動き量が予め定めた量よりも少ない場合（ステップＳ１０でＮｏ）、注目度推定装置１は、統計特徴量生成手段３０によって、視線変動量を含んだ特徴量（特徴量記述子）を生成する（ステップＳ１３）。
すなわち、統計特徴量生成手段３０は、グローバル特徴生成手段３１によって、身体動作量、瞬目間隔時間および視線変動量のそれぞれについて、トピック内における平均値と標準偏差をグローバル特徴として生成する。
また、統計特徴量生成手段３０は、局所ヒストグラム特徴生成手段３２によって、身体動作量、瞬目間隔時間および視線変動量のそれぞれについて、ヒストグラムを生成することで、局所ヒストグラム特徴を生成する。 On the other hand, when the subtitle information amount is smaller than the predetermined amount and the video motion amount is smaller than the predetermined amount (No in step S10), the attention level estimating device 1 uses the statistical feature amount generating unit 30 to A feature amount (feature amount descriptor) including the line-of-sight variation is generated (step S13).
In other words, the statistical feature value generating unit 30 generates, as global features, an average value and a standard deviation within a topic for each of the body movement amount, the blink interval time, and the line-of-sight variation amount by the global feature generating unit 31.
Further, the statistical feature value generating unit 30 generates local histogram features by generating a histogram for each of the body movement amount, the blink interval time, and the line-of-sight variation amount by the local histogram feature generating unit 32.

そして、注目度推定装置１は、注目度特定手段５０によって、学習データ記憶手段４０に記憶されている、視線変動量を含んで学習した第１学習データＤ１を用いて、ステップＳ１３で生成された特徴量（特徴量記述子）に対する注目度を特定（推定）する（ステップＳ１４）。 The attention level estimation device 1 is generated in step S13 by using the first learning data D1 learned by the attention level identification unit 50 and stored in the learning data storage unit 40 including the line-of-sight variation. The degree of attention to the feature quantity (feature quantity descriptor) is specified (estimated) (step S14).

以上の動作によって、注目度推定装置１は、映像コンテンツ（トピック）を視聴する人物Ｈの当該トピックに対する注目度を推定することができる。このとき、注目度推定装置１は、身体動作量、瞬目間隔時間および視線変動量といった人物Ｈの身体特徴を、画像処理によって計測するため、人物Ｈに負荷を与えることなく、注目度を推定することができる。
また、注目度推定装置１は、映像コンテンツに字幕が多い場合、あるいは、映像の動きが多い場合には、視線特徴量を注目度推定に使用しないことで、注目度を精度よく求めることができる。 Through the above operation, the attention level estimation device 1 can estimate the attention level of the person H who views the video content (topic) with respect to the topic. At this time, the attention level estimation device 1 estimates the attention level without imposing a load on the person H because the body characteristics of the person H such as the body movement amount, the blink interval time, and the line-of-sight variation amount are measured by image processing. can do.
Also, the attention level estimation device 1 can accurately calculate the attention level by not using the line-of-sight feature amount for attention level estimation when there are many subtitles in the video content or when there is a lot of video motion. .

以上、本発明の実施形態に係る注目度推定装置１の構成および動作について説明したが、本発明はこの実施形態に限定されるものではない。
例えば、ここでは、統計特徴量生成手段３０は、図８で説明したように、あるトピックにおいて、６次元のグローバル特徴と、２４次元の局所ヒストグラム特徴とからなる３０次元の固定次元の特徴量記述子を生成することとしたが、さらにトピックを時間方向に区切って、その区間ごとに、３０次元の特徴量記述子を生成することとてもよい。 The configuration and operation of the attention level estimation device 1 according to the embodiment of the present invention have been described above, but the present invention is not limited to this embodiment.
For example, here, the statistical feature value generation means 30 has a 30-dimensional fixed-dimension feature description composed of 6-dimensional global features and 24-dimensional local histogram features in a certain topic, as described in FIG. Although it is decided to generate children, it is very good to further divide topics in the time direction and generate a 30-dimensional feature descriptor for each section.

例えば、図１０に示すように、統計特徴量生成手段３０は、あるトピックｎについて、全体特徴量として、前記した３０次元の特徴量記述子を生成する。さらに、統計特徴量生成手段３０は、トピックｎを時間方向に２分割し、それぞれの区間において、前記した３０次元の特徴量記述子をそれぞれ生成する（２分割特徴量）。
あるいは、さらに、トピックｎを時間方向に４分割し、それぞれの区間において、前記した３０次元の特徴量記述子をそれぞれ生成することとしてもよい（４分割特徴量）。これによって、トピックｎについて、２１０次元（３０次元×７特徴量記述子）の固定次元の特徴量記述子群が生成される。 For example, as illustrated in FIG. 10, the statistical feature value generation unit 30 generates the above-described 30-dimensional feature value descriptor as an overall feature value for a certain topic n. Further, the statistical feature value generation unit 30 divides the topic n into two in the time direction, and generates the above-described 30-dimensional feature value descriptors in each section (2-division feature value).
Alternatively, the topic n may be further divided into four in the time direction, and the 30-dimensional feature descriptor described above may be generated in each section (four-divided feature). Thereby, a 210-dimensional (30 dimensions × 7 feature descriptor) fixed dimension feature descriptor group is generated for topic n.

このように、時間方向に区分した特徴量を含ませることで、全体特徴量において、時間方向に局所的な特徴の影響が薄れてしまう場合であっても、局所的な特徴を残した特徴量となる。
この場合、視線変動量を使用しない特徴量記述子群については、１４０次元（２０次元×７特徴量記述子）の固定次元の特徴量記述子群となる。 In this way, by including feature quantities that are segmented in the time direction, even if the influence of local features in the time direction is diminished in the overall feature quantities, the feature quantities that retain the local features It becomes.
In this case, the feature descriptor group that does not use the line-of-sight variation is a fixed dimension feature descriptor group of 140 dimensions (20 dimensions × 7 feature descriptors).

また、統計特徴量生成手段３０は、図１０に示すように、トピックｎに前後するトピック（ｎ−１，ｎ＋１）における特徴量記述子を付加して、トピックｎにおける特徴量記述子群としてもよい。例えば、図１０の例の場合、統計特徴量生成手段３０は、トピックｎにおける２１０次元（３０次元×７特徴量記述子）の特徴量記述子群に、トピック（ｎ−１）とトピック（ｎ＋１）のそれぞれのトピックｎの直近の４分割特徴量を付加して、２７０次元（３０次元×９特徴量記述子）の固定次元の特徴量記述子群を生成する。
この場合、視線変動量を使用しない特徴量記述子群については、１８０次元（２０次元×９特徴量記述子）の固定次元の特徴量記述子群となる。 Further, as shown in FIG. 10, the statistical feature value generation unit 30 adds feature value descriptors in the topic (n−1, n + 1) before and after the topic n to form a feature value descriptor group in the topic n. Good. For example, in the example of FIG. 10, the statistical feature value generation unit 30 adds topic (n−1) and topic (n + 1) to the 210-dimensional (30 dimensions × 7 feature value descriptor) feature value descriptor group in topic n. ) Is added to the most recent four-division feature quantity of each topic n to generate a fixed dimension feature quantity descriptor group of 270 dimensions (30 dimensions × 9 feature quantity descriptors).
In this case, the feature descriptor group that does not use the line-of-sight variation is a fixed dimension feature descriptor group of 180 dimensions (20 dimensions × 9 feature descriptors).

このように、トピック前後の特徴を付加することで、例えば、瞬目回数が、注目状態から解放された直後に増加するといった、トピックに跨った特徴量の変化を考慮して注目状態を判定することができる。
なお、このような時間方向に分割した特徴量記述子群を付加して用いる場合、学習データ記憶手段４０に記憶される第１学習データＤ１や第２学習データＤ２は、その付加した特徴量記述子群と同次数の特徴量記述子群によって、予め学習しておくことはいうまでもない。 In this way, by adding features before and after the topic, for example, the attention state is determined in consideration of a change in the feature amount across the topic, for example, the number of blinks increases immediately after being released from the attention state. be able to.
In addition, when such a feature amount descriptor group divided in the time direction is added and used, the first learning data D1 and the second learning data D2 stored in the learning data storage unit 40 are the added feature description. It goes without saying that learning is performed in advance using a feature descriptor group having the same degree as the child group.

また、本実施形態では、字幕や映像の動きによって、視線変動量を特徴量として使用するか否かを判定することとしたが、予め字幕が存在しない映像コンテンツや、動きに大きな変化がない映像コンテンツを対象とする場合であれば、字幕情報量計測手段２１や映像動き量計測手段２２を、構成から省略しても構わない。 In the present embodiment, it is determined whether to use the line-of-sight variation amount as the feature amount based on the motion of the caption or video. However, the video content in which the caption does not exist in advance or the video in which there is no significant change in motion If content is targeted, the subtitle information amount measuring means 21 and the video motion amount measuring means 22 may be omitted from the configuration.

また、字幕情報量計測手段２１や映像動き量計測手段２２を両者とも構成から省略する場合、図１１に示すように、図２の注目度推定装置１から、視線変動量使用判定手段２０を省略しても構わない。その場合、学習データ記憶手段４０Ｂには、第１学習データＤ１のみを予め学習して記憶しておけばよい。
あるいは、さらに、構成を簡略化し、注目度推定装置１，１Ｂから、身体動作量計測手段１１と瞬目間隔計測手段１３のいずれか一方を省略して構成してもよい。 Further, when both the caption information amount measuring unit 21 and the video motion amount measuring unit 22 are omitted from the configuration, as shown in FIG. 11, the line-of-sight variation amount use determining unit 20 is omitted from the attention level estimation device 1 of FIG. It doesn't matter. In that case, only the first learning data D1 may be learned and stored in advance in the learning data storage means 40B.
Alternatively, the configuration may be further simplified, and either one of the body motion amount measurement unit 11 and the blink interval measurement unit 13 may be omitted from the attention level estimation devices 1 and 1B.

以上説明したように、本発明は、映像コンテンツを視聴している人物の注目度を、人物に負荷をかけずに推定することができるため、一般家庭においても容易に人物の注目度を推定することができる。
このため、従来は、単に映像コンテンツを表示するだけで計測していた“視聴率”に対して、実際に人物が映像コンテンツを視聴し、その注目度を推定することで、映像コンテンツそのものの評価となる“視聴質”を計測することも可能になる。 As described above, the present invention can estimate the degree of attention of a person who is viewing video content without imposing a load on the person, and therefore easily estimates the degree of attention of a person even in a general household. be able to.
For this reason, in comparison with the “viewing rate” that has been measured simply by displaying video content in the past, a person actually views the video content and estimates its attention level, thereby evaluating the video content itself. It becomes possible to measure “viewing quality”.

１注目度推定装置
１０身体特徴量抽出手段
１１身体動作量計測手段
１２傾き補正手段
１３瞬目間隔計測手段
１４視線変動量計測手段
２０視線変動量使用判定手段
２１字幕情報量計測手段
２２映像動き量計測手段
２３使用判定手段
３０統計特徴量生成手段
３１グローバル特徴生成手段
３２局所ヒストグラム特徴生成手段
４０学習データ記憶手段（第２学習データ記憶手段）
５０注目度特定手段
Ｓ注目度測定システム
Ｍモニタ
Ｃカメラ
Ｍｃモーションキャプチャ DESCRIPTION OF SYMBOLS 1 Attention level estimation apparatus 10 Body feature-value extraction means 11 Body motion-amount measurement means 12 Inclination correction means 13 Blink interval measurement means 14 Gaze fluctuation amount measurement means 20 Gaze fluctuation amount use determination means 21 Subtitle information amount measurement means 22 Video motion amount Measuring means 23 Usage determining means 30 Statistical feature quantity generating means 31 Global feature generating means 32 Local histogram feature generating means 40 Learning data storage means (second learning data storage means)
50 Attention level identification means S Attention level measurement system M Monitor C Camera Mc Motion capture

Claims

From the skeleton position information obtained by measuring with the motion capture that detects the skeleton position of the person from the image obtained by photographing the person who is viewing the video content, and the camera image obtained by photographing the person with the camera, the video content An attention level estimation device that estimates a degree of attention indicating a degree of attention of the person in a predetermined video section,
Body motion amount measurement that inputs the skeleton position information in time series and measures a body motion amount that is a change amount per unit time at a predetermined skeleton position of the skeleton position information as one of the body feature amounts of the person. Means,
In a camera image input in time series as the camera video, the eye area of the person is detected based on a predetermined image feature, and the line of sight per unit time is determined based on the luminance of the left and right areas that divide the eye area. A line-of-sight variation measuring means for measuring a variation as one of the body feature amounts;
For each of the body feature amounts, statistical feature amount generating means that performs statistics in a predetermined video section of the video content and generates as a statistical feature amount in the video section;
Learning data storage means for storing in advance the correspondence between the statistical feature quantity and the attention level as learning data;
Attention level specifying means for specifying the attention level corresponding to the statistical feature value generated by the statistical feature value generating means as the attention level for the video section based on the learning data stored in the learning data storage means; ,
An attention level estimation apparatus comprising:

A blink interval measuring means for detecting blink of the person based on a predetermined image feature and measuring a blink interval time as one of the body feature amounts in the camera image input in time series; The degree-of-interest estimation apparatus according to claim 1, comprising:

The image processing apparatus further includes an inclination correction unit that rotates the camera image so that the neck position is directly below the head position based on the head position and neck position of the person indicated by the skeleton position information. The degree-of-interest estimation apparatus according to claim 1, wherein:

Second learning data storage means for preliminarily storing, as second learning data, a correspondence relationship between the statistical feature amount obtained by removing the line-of-sight variation from the body feature amount and the attention degree;
In the video content, subtitle information amount measuring means for measuring the amount of subtitle information included in the video content;
In the video content, video motion amount detection means for measuring a video motion amount by a difference between frames,
Use determination for determining that the line-of-sight variation amount is not used as the body feature amount when the caption information amount is larger than a predetermined information amount or the video motion amount is larger than a predetermined motion amount And further comprising means,
When the use determining unit determines that the line-of-sight variation amount is not used as the body feature amount, the attention level specifying unit is configured to change the line-of-sight variation based on the second learning data instead of the learning data. The degree-of-interest estimation apparatus according to any one of claims 1 to 3, wherein the degree of attention corresponding to the statistical feature amount excluding the amount is specified as the degree of attention with respect to the video section.

Second learning data storage means for preliminarily storing, as second learning data, a correspondence relationship between the statistical feature amount obtained by removing the line-of-sight variation from the body feature amount and the attention degree;
In the video content, subtitle information amount measuring means for measuring the amount of subtitle information included in the video content;
Use determination means for determining that the line-of-sight variation amount is not used as the body feature amount when the subtitle information amount is greater than a predetermined information amount;
When the use determining unit determines that the line-of-sight variation amount is not used as the body feature amount, the attention level specifying unit is configured to change the line-of-sight variation based on the second learning data instead of the learning data. The degree-of-interest estimation apparatus according to any one of claims 1 to 3, wherein the degree of attention corresponding to the statistical feature amount excluding the amount is specified as the degree of attention with respect to the video section.

Second learning data storage means for preliminarily storing, as second learning data, a correspondence relationship between the statistical feature amount obtained by removing the line-of-sight variation from the body feature amount and the attention degree;
In the video content, video motion amount detection means for measuring a video motion amount by a difference between frames,
Use determination means for determining that the line-of-sight variation amount is not used as the body feature amount when the video motion amount is greater than a predetermined motion amount;
When the use determining unit determines that the line-of-sight variation amount is not used as the body feature amount, the attention level specifying unit is configured to change the line-of-sight variation based on the second learning data instead of the learning data. The degree-of-interest estimation apparatus according to any one of claims 1 to 3, wherein the degree of attention corresponding to the statistical feature amount excluding the amount is specified as the degree of attention with respect to the video section.

The statistical feature value generation means includes a global feature that is an average value and a standard deviation of the physical feature values in the entire video section, and a local histogram feature that is a histogram of the physical statistics values with a predetermined bin width. It generates as a feature-value, The attention level estimation apparatus as described in any one of Claims 1-6 characterized by the above-mentioned.

The attention level estimation apparatus according to claim 7, wherein the statistical feature value generation unit further generates the local histogram feature for each section obtained by dividing the video section into predetermined time sections.

The statistical feature quantity generation unit adds a local histogram feature of a video section before and after the video section to the statistical feature quantity of the video section for which the degree of attention is to be estimated, and estimates the degree of attention. The attention degree estimation apparatus according to claim 7 or 8, wherein the attention degree estimation device is a statistical feature amount.

From the skeleton position information obtained by measuring with the motion capture that detects the skeleton position of the person from the image obtained by photographing the person who is viewing the video content, and the camera image obtained by photographing the person with the camera, the video content In order to estimate the degree of attention indicating the degree of attention of the person in a predetermined video section,
Body motion amount measurement that inputs the skeleton position information in time series and measures a body motion amount that is a change amount per unit time at a predetermined skeleton position of the skeleton position information as one of the body feature amounts of the person. means,
In a camera image input in time series as the camera video, the eye area of the person is detected based on a predetermined image feature, and the line of sight per unit time is determined based on the luminance of the left and right areas that divide the eye area. A line-of-sight variation measuring means for measuring a variation as one of the body feature amounts;
Statistical feature value generating means for generating statistics as a statistical feature value in the video section, for each of the body feature values, statistically in a predetermined video section of the video content,
The attention level corresponding to the statistical feature value generated by the statistical feature value generation unit is specified as the attention level for the video section based on learning data in which the correspondence relationship between the statistical feature value and the attention level is previously learned. Attention level identification means,
Attention level estimation program characterized by functioning as