JP5937829B2

JP5937829B2 - Viewing situation recognition device and viewing situation recognition program

Info

Publication number: JP5937829B2
Application number: JP2012013452A
Authority: JP
Inventors: 苗村　昌秀; 昌秀苗村; 藤井　真人; 真人藤井; 高橋　正樹; 正樹高橋; 山内　結子; 結子山内
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2012-01-25
Filing date: 2012-01-25
Publication date: 2016-06-22
Anticipated expiration: 2032-01-25
Also published as: JP2013153349A

Description

本発明は、視聴状況認識装置及び視聴状況認識プログラムに係り、特に、視聴者の視聴状況を高精度に認識するための視聴状況認識装置及び視聴状況認識プログラムに関する。 The present invention relates to a viewing status recognition device and a viewing status recognition program, and more particularly to a viewing status recognition device and a viewing status recognition program for recognizing a viewer's viewing status with high accuracy.

従来では、視聴状況の認識や推定を行う技術が存在する。このような技術は、例えばテレビの番組推薦機能の高度化や、視聴しているコンテンツと関連した情報の効果的な提示等を行う情報提供システム等に使用されている。 Conventionally, there is a technique for recognizing and estimating the viewing situation. Such a technique is used, for example, in an information providing system that performs advanced program recommendation function of television, effective presentation of information related to the content being viewed, and the like.

上述したシステムで使用されている技術としては、例えば機器の操作履歴を解析することで視聴状況を認識する手法が存在する（例えば、特許文献１及び特許文献２参照）。特許文献１に示されている手法では、機器のリモコンの操作情報をシステム設計者が決めたルールに合致するかどうかで視聴状況の認識出力を行っている。また、特許文献２に示されている手法では、入力情報をリモコン操作ではなく、ＰＣ（ｐｅｒｓｏｎａｌｃｏｍｐｕｔｅｒ）や携帯電話、カーナビゲーションシステム、家電等の広範囲な事象を扱うようになっている。 As a technique used in the above-described system, for example, there is a method of recognizing a viewing situation by analyzing an operation history of a device (see, for example, Patent Document 1 and Patent Document 2). In the technique disclosed in Patent Document 1, viewing status recognition output is performed based on whether or not the operation information of the remote controller of the device matches the rule determined by the system designer. The technique disclosed in Patent Document 2 handles a wide range of events such as a PC (personal computer), a mobile phone, a car navigation system, and home appliances instead of remote control operation for input information.

また、従来では、画像や音声等のユーザの振る舞いを観測して得られるデータを解析して視聴状況を推定する手法が知られている（例えば、特許文献３参照）。特許文献３に示されている手法では、カメラやマイクで観測した映像、音声データから表情認識や視線検出、感情表現検出等の画像認識処理、音声認識処理を介して視聴状況を推定している。 Conventionally, there is known a method for estimating a viewing situation by analyzing data obtained by observing a user's behavior such as an image or sound (for example, see Patent Document 3). In the method disclosed in Patent Document 3, the viewing situation is estimated through image recognition processing such as facial expression recognition, gaze detection, emotion expression detection, and voice recognition processing from video and audio data observed with a camera or a microphone. .

更に、従来では、視聴者の画像、音声等のセンサから得られるデータと、操作履歴やシステムへのフィードバック入力情報等の視聴者が直接入力するインタラクションデータとを統合して、視聴状況を推定する手法が知られている（例えば、特許文献４参照）。特許文献４に示されている手法では、センサから得られるデータを反応データ、インタラクションデータを視聴データと称して、両者の時間的な同期を取ってそれぞれの信号の重み付け変数を変えた線形和出力を閾値処理することにより状況認識の結果を出力している。 Further, conventionally, data obtained from a sensor such as a viewer's image and sound and the interaction data directly input by the viewer such as operation history and feedback input information to the system are integrated to estimate the viewing situation. A technique is known (see, for example, Patent Document 4). In the method disclosed in Patent Document 4, the data obtained from the sensor is called reaction data, the interaction data is called viewing data, and a linear sum output in which the weighting variables of the respective signals are changed in synchronization with each other in time. Is output as a result of situation recognition.

特開２００２−３６９０９０号公報JP 2002-369090 A 特開２００５−１９０４２１号公報JP 2005-190421 A 特開２００６−２６０２７５号公報JP 2006-260275 A 特開２００５−１４２９７５号公報JP 2005-142975 A

しかしながら、特許文献１に示されているような手法では、リモコンの操作情報をシステム設計者が決めたルールに合致するかどうかで視聴状況の認識出力を行っているため、その判断の基になるルールが主観的になりがちとなる。また、特許文献１に示されている手法では、ルールベースでの判断であるため、複雑な事象への対応が難しくなると同時に拡張性にも乏しくなるといった課題がある。 However, in the method as shown in Patent Document 1, since the viewing status is recognized and output based on whether or not the operation information of the remote controller matches the rule determined by the system designer, this is the basis for the determination. Rules tend to be subjective. In addition, the technique disclosed in Patent Document 1 is based on a rule-based determination, and thus there is a problem that it is difficult to deal with complicated events and at the same time, the expandability is poor.

また、特許文献２に示されているような手法では、情報の入力として、より広範囲な事象を扱うようにはなっているが、状況認識はルールとの照合となるため、上述したルールベースの判断に付随する課題を有している。また、特許文献１及び特許文献２は、共に事象単位での処理のため、テレビ視聴のような時系列で連続的な事象との時間的な関係が曖昧となる課題もある。更に、特許文献１及び特許文献２は、共に機器の操作履歴や行動履歴等のようなユーザが意識的にシステムに入力できるデータのみでの処理となってしまう。 In addition, in the technique as shown in Patent Document 2, a wider range of events is handled as information input. However, since situation recognition is collated with a rule, the rule base described above is used. Has a problem associated with judgment. In addition, since both Patent Document 1 and Patent Document 2 are processed in units of events, there is a problem that the temporal relationship with time-series continuous events such as TV viewing becomes ambiguous. Furthermore, both Patent Literature 1 and Patent Literature 2 are processing only with data that can be input to the system consciously by the user, such as device operation history and behavior history.

また、特許文献３に示されているような手法では、認識出力結果を加工せず、システム設計者が予め設定した視聴状況のＩＤを割り当てる構成となっており、視聴状況の種類が大きくなりすぎる課題がある。また、特許文献３に示されているような手法では、画像認識、音声認識結果のみを使用しているため、高精度な認識処理を実現することはできない。 Further, the technique as disclosed in Patent Document 3 does not process the recognition output result, but assigns a viewing situation ID preset by the system designer, and the type of viewing situation becomes too large. There are challenges. Further, in the method as disclosed in Patent Document 3, only the image recognition and voice recognition results are used, and therefore high-accuracy recognition processing cannot be realized.

更に、特許文献４に示されているような手法では、統合される信号に付随したタイムスタンプにより時間的な同期を行っており、入力される個別の信号の時間的なズレまで考慮した同期方法とは言い難い。また、特許文献４に示されているような手法では、統合の方法自体、入力の反応データ、視聴データにアドホック的に設定した重み付け変数や閾値処理から構成されており、信号間の相互作用や入力データの性質を考慮した客観的な方法でなく、システム設計者の主観によるところが多い。 Furthermore, in the method as disclosed in Patent Document 4, temporal synchronization is performed by a time stamp associated with the integrated signal, and a synchronization method that takes into account the temporal deviation of individual signals that are input. It's hard to say. In addition, the technique as shown in Patent Document 4 is composed of an integration method itself, input reaction data, weighting variables set in an ad hoc manner for viewing data, and threshold processing. It is not an objective method considering the nature of the input data, but often depends on the subjectivity of the system designer.

つまり、従来手法では、状況認識を行うに当たってルールベースを中心としたシステム設計者側の主観で決定したルールで行っており、またアドホック的なプログラム設計での対応となってしまう。 In other words, in the conventional method, the situation recognition is performed using the rules determined by the system designer's subjectivity centering on the rule base, and it becomes a response in ad hoc program design.

本発明は、上述した問題点に鑑みなされたものであり、視聴者の視聴状況を高精度に認識するための視聴状況認識装置及び視聴状況認識プログラムを提供することを目的とする。 The present invention has been made in view of the above-described problems, and an object thereof is to provide a viewing status recognition apparatus and a viewing status recognition program for recognizing a viewer's viewing status with high accuracy.

上記課題を解決するために、本件発明は、以下の特徴を有する課題を解決するための手段を採用している。 In order to solve the above problems, the present invention employs means for solving the problems having the following characteristics.

本発明は、視聴者の番組視聴時に得られる視聴者情報に基づいて視聴状況を認識する視聴状況認識装置において、前記視聴者情報として得られる複数の異なるデータからヒストグラムデータを生成するヒストグラム生成手段と、前記ヒストグラム生成手段により得られるヒストグラムデータから単位時間毎の代表値を選定する代表値選定手段と、前記代表値選定手段により得られる代表値を用いて、前記複数の異なるデータを同期させた同期シンボル列を生成する同期シンボル列生成手段と、前記同期シンボル列生成手段により得られる同期シンボル列に対して、予め設定された素性関数を用いて時系列素性ベクトル信号に変換する変換手段と、前記変換手段により得られる時系列素性ベクトル信号を、予め設定された重み付けを示す学習パラメータを用いて統合し、前記視聴者の視聴状況を認識する統合手段とを有することを特徴とする。 The present invention relates to a histogram generation means for generating histogram data from a plurality of different data obtained as the viewer information in a viewing status recognition apparatus for recognizing a viewing status based on viewer information obtained when the viewer views the program. Synchronizing the plurality of different data using representative value selection means for selecting a representative value for each unit time from histogram data obtained by the histogram generation means, and representative values obtained by the representative value selection means Synchronization symbol sequence generation means for generating a symbol sequence, conversion means for converting a synchronization symbol sequence obtained by the synchronization symbol sequence generation means into a time-series feature vector signal using a preset feature function, learning path showing the time series feature vector signal obtained by converting means, the weighting previously set Integrated with the meter, and having a recognizing integrating means the viewing condition of the viewer.

また本発明は、視聴者の番組視聴時に得られる視聴者情報に基づいて視聴状況を認識する視聴状況認識プログラムにおいて、コンピュータを、前記視聴者情報として得られる複数の異なるデータからヒストグラムデータを生成するヒストグラム生成手段、前記ヒストグラム生成手段により得られるヒストグラムデータから単位時間毎の代表値を選定する代表値選定手段、前記代表値選定手段により得られる代表値を用いて、前記複数の異なるデータを同期させた同期シンボル列を生成する同期シンボル列生成手段、前記同期シンボル列生成手段により得られる同期シンボル列に対して、予め設定された素性関数を用いて時系列素性ベクトル信号に変換する変換手段、及び、前記変換手段により得られる時系列素性ベクトル信号を、予め設定された重み付けを示す学習パラメータを用いて統合し、前記視聴者の視聴状況を認識する統合手段として機能させる。 According to the present invention, in a viewing situation recognition program for recognizing a viewing situation based on viewer information obtained when a viewer views a program, the computer generates histogram data from a plurality of different data obtained as the viewer information. Histogram generation means, representative value selection means for selecting a representative value for each unit time from histogram data obtained by the histogram generation means, and the representative value obtained by the representative value selection means to synchronize the plurality of different data. A synchronization symbol sequence generating means for generating a synchronized symbol sequence, a conversion means for converting a synchronization symbol sequence obtained by the synchronization symbol sequence generation means into a time series feature vector signal using a preset feature function, and The time-series feature vector signal obtained by the conversion means is set in advance. Integrated with the learning parameter indicating the weighting, to function as recognizing integrating means the viewing condition of the viewer.

本発明によれば、視聴者の視聴状況を高精度に認識することができる。 According to the present invention, a viewer's viewing situation can be recognized with high accuracy.

本実施形態における視聴状況認識装置の機能構成の一例を示す図である。It is a figure which shows an example of a function structure of the viewing condition recognition apparatus in this embodiment. 本実施形態における学習処理の一例を示すフローチャートである。It is a flowchart which shows an example of the learning process in this embodiment. 本実施形態における認識処理の一例を示すフローチャートである。It is a flowchart which shows an example of the recognition process in this embodiment. 本実施形態における機械学習処理を利用した認識処理の流れを説明するための図である。It is a figure for demonstrating the flow of the recognition process using the machine learning process in this embodiment. 異なる複数のデータを用いた時系列の素性ベクトル信号の生成手順を説明するための図である。It is a figure for demonstrating the production | generation procedure of the time-sequential feature vector signal using several different data. 学習データの生成例を説明するための図である。It is a figure for demonstrating the production | generation example of learning data. 論理関数群の一例を示す図である。It is a figure which shows an example of a logical function group. 他の実施形態における視聴状況認識装置の機能構成の一例を示す図である。It is a figure which shows an example of a function structure of the viewing condition recognition apparatus in other embodiment. 他の実施形態の具体例な内容を説明するための図である。It is a figure for demonstrating the specific content of other embodiment. 視聴状況のラベルと足操作との対応関係について説明するための図である。It is a figure for demonstrating the correspondence of a viewing condition label and foot operation.

＜本発明について＞
本発明は、例えば視聴者の視聴状況を認識する手法において、視聴状況認識に関する性質の異なる複数のデータを、それぞれの相互作用も取り入れて統計的に統合し、視聴状況の認識を効果的に行う。また、本発明は、システム設計者の主観によらず、視聴状況認識に関係するデータの性質を客観的、かつ、自動的に取り込むことができる拡張性の高い仕組みを構築する。 <About the present invention>
The present invention, for example, in a method for recognizing a viewer's viewing situation, statistically integrates a plurality of data having different properties related to viewing situation recognition, including their respective interactions, and effectively recognizes the viewing situation. . In addition, the present invention constructs a highly scalable mechanism that can objectively and automatically capture the nature of data related to viewing situation recognition, regardless of the subjectivity of the system designer.

ここで、視聴状況認識に係るデータの種類は、視聴者毎の視聴環境等によって異なる。そのため、本発明では、視聴状況認識手法において、簡単に認識過程の仕組みを変えることができる仕組みを構築する。例えば、本発明の視聴状況認識手法では、まず、予め実環境に近い環境で取得した視聴状況の正解データを用いた学習処理で認識用の学習パラメータを取得し、その後、その学習パラメータを用いて視聴者毎の任意の入力信号に基づく視聴状況の認識処理を行い、認識結果に対応する視聴状況を出力する。 Here, the type of data related to viewing status recognition varies depending on the viewing environment for each viewer. Therefore, in the present invention, in the viewing situation recognition method, a mechanism that can easily change the mechanism of the recognition process is constructed. For example, in the viewing situation recognition method of the present invention, first, a learning parameter for recognition is obtained by learning processing using correct answer data of a viewing situation obtained in advance in an environment close to the real environment, and then the learning parameter is used. A viewing state recognition process is performed based on an arbitrary input signal for each viewer, and a viewing state corresponding to the recognition result is output.

具体的には、本発明は、例えば時間単位（時間解像度）等の異なる複数の時系列の異種信号をある単位時間毎にその信号の性質を勘案してシンボル化し、統一的に時系列のシンボルベクトル列信号に変換する。また、本発明は、上述の過程で生成したシンボルベクトル列の各要素間の相互作用を記述する複数の素性関数を用意し、それぞれの関数出力を新たに時系列素性ベクトル信号として機械学習の入力信号とする。ここで、上述した素性関数の入力変数は、例えば上述にて得られた時系列のシンボルベクトル列の任意の時間における任意の要素を取ることができる構成とする。 Specifically, in the present invention, for example, a plurality of time series heterogeneous signals having different time units (time resolutions) are symbolized in consideration of the characteristics of the signals for each unit time, and the time series symbols are unified. Convert to vector sequence signal. In addition, the present invention provides a plurality of feature functions that describe the interaction between each element of the symbol vector sequence generated in the above process, and outputs each function as a new time-series feature vector signal for machine learning input. Signal. Here, the input variable of the above-described feature function is configured to be able to take an arbitrary element at an arbitrary time in the time-series symbol vector sequence obtained above, for example.

また、上述した機械学習処理では、上述した時系列素性ベクトル信号を入力信号、目的の視聴状況を出力信号として学習処理を行い、入力と出力との関係を定式化して、その学習結果から得られる学習パラメータ等を蓄えておく。また、本発明では、視聴者毎の任意の入力信号から上述と同様の処理により得られる未知の時系列素性ベクトル信号にその定式化された学習パラメータ等を適用させて統合し視聴状況の値を決定する。これにより、本発明では、例えば視聴者のテレビ番組に対する興味度等を高精度に取得することができる。 In the machine learning process described above, the learning process is performed using the time series feature vector signal described above as the input signal and the target viewing situation as the output signal, and the relationship between the input and the output is formulated and obtained from the learning result. Store learning parameters, etc. Also, in the present invention, the learning parameters and the like are applied to an unknown time-series feature vector signal obtained by the same processing as described above from an arbitrary input signal for each viewer, and the values of the viewing situation are integrated. decide. Thereby, in this invention, the interest degree etc. with respect to a television program of a viewer can be acquired with high precision, for example.

なお、本発明では、学習処理での学習データ収集の際に、インタラクションデータの収集を自然な環境で行いながら視聴状況の正解データの収集をリアルタイムで、かつ、高精度に行えるように、例えばフットペダル等の足操作手段を用いて、視聴に対する興味レベルや興味の種類（方向）等を設定しながら、正解の視聴状況のラベリング処理を行うこともできる。 Note that, in the present invention, when learning data is collected in the learning process, for example, foot data is collected in real time and with high accuracy while collecting the interaction data in a natural environment while collecting the correct data of the viewing situation. It is also possible to label the correct viewing situation while setting the interest level and type (direction) of interest for viewing using foot operation means such as a pedal.

以下に、上述したような各特徴を有する本発明における視聴状況認識装置及び視聴状況認識プログラムを好適に実施した形態について、図面を用いて詳細に説明する。 In the following, a preferred embodiment of the viewing situation recognition apparatus and viewing situation recognition program according to the present invention having the above-described features will be described in detail with reference to the drawings.

＜視聴状況認識装置：機能構成例＞
本実施形態における視聴状況認識装置の機能構成例について図を用いて説明する。図１は、本実施形態における視聴状況認識装置の機能構成の一例を示す図である。図１に示す視聴状況認識装置１０は、情報取得手段１１と、解析手段１２と、ヒストグラム生成手段１３と、代表値選定手段１４と、同期シンボル列生成手段１５と、変換手段１６と、素性関数群データベース１７と、統計的学習手段１８と、学習データベース１９と、学習パラメータ記憶手段２０と、統合手段２１とを有するよう構成されている。 <Viewing situation recognition device: functional configuration example>
A functional configuration example of the viewing status recognition apparatus according to the present embodiment will be described with reference to the drawings. FIG. 1 is a diagram illustrating an example of a functional configuration of a viewing status recognition apparatus according to the present embodiment. 1 includes an information acquisition unit 11, an analysis unit 12, a histogram generation unit 13, a representative value selection unit 14, a synchronization symbol string generation unit 15, a conversion unit 16, and a feature function. The group database 17, the statistical learning unit 18, the learning database 19, the learning parameter storage unit 20, and the integration unit 21 are configured.

情報取得手段１１は、学習用の入力信号及び視聴者毎の視聴状況を認識するための認識用の任意の入力信号を取得する。なお、入力信号は、視聴者情報として取得されるものである。入力信号としては、例えばカメラ（撮像手段）等からのカメラ映像やマイク等の音声取得手段からの音声情報等の観測情報、テレビ等に対する遠隔操作手段（以下、「リモコン」という）による操作等の操作履歴情報等の異種データがある。 The information acquisition unit 11 acquires an input signal for learning and an arbitrary input signal for recognition for recognizing a viewing situation for each viewer. The input signal is acquired as viewer information. Examples of input signals include observation information such as camera video from a camera (imaging means) or sound information from sound acquisition means such as a microphone, operation by a remote operation means (hereinafter referred to as “remote control”) for a television or the like. There is heterogeneous data such as operation history information.

なお、本実施形態においては、上述した入力信号に種類に限定されるものではなく、例えば、テレビ等に対して遠隔操作したり、インターネット閲覧機能又はメール送受信機能等を有する携帯情報端末（例えば、スマートフォン、タブレット端末、ノートＰＣ、携帯電話、ゲーム機器等）の操作情報等を入力信号として用いることができる。 In the present embodiment, the input signal is not limited to the type described above. For example, a portable information terminal (for example, a remote control for a television or the like, an Internet browsing function, a mail transmission / reception function, etc.) Operation information or the like of a smartphone, a tablet terminal, a notebook PC, a mobile phone, a game machine, or the like can be used as an input signal.

ここで、上述した入力信号には、それぞれ時間情報が付与されていることが好ましいが、本実施形態においてはこれに限定されるものではなく、例えば情報取得手段１１で各入力信号を取得した時点で、その時間情報を入力信号に付与してもよい。情報取得手段１１は、上述した１又は複数の入力信号を解析手段１２に出力する。 Here, it is preferable that time information is given to each of the input signals described above. However, the present invention is not limited to this. For example, when the information acquisition unit 11 acquires each input signal. Thus, the time information may be added to the input signal. The information acquisition unit 11 outputs the above-described one or more input signals to the analysis unit 12.

解析手段１２は、情報取得手段１１により得られる各入力信号を解析する。具体的には、解析手段１２は、カメラ映像を取得した場合には、視聴者の顔認識や顔の向き、表情、姿勢、行動（動作）等のうち、少なくとも１つを解析する。なお、本実施形態における視聴者とは、視聴状況認識装置１０において、例えば学習処理を行っている場合には、正解データを学習させるための実験者や管理者等が該当し、また認識処理を行っている場合には、一般の視聴者等が該当する。 The analysis unit 12 analyzes each input signal obtained by the information acquisition unit 11. Specifically, when acquiring the camera image, the analysis unit 12 analyzes at least one of the viewer's face recognition, face orientation, facial expression, posture, action (motion), and the like. Note that the viewer in the present embodiment corresponds to an experimenter or administrator for learning correct answer data, for example, when the learning process is performed in the viewing status recognition device 10, and the recognition process is performed. If so, a general audience or the like is applicable.

例えば、上述した顔の向きや姿勢等については、例えばカメラ映像等から顔検出した領域に予め設定された複数の顔のテンプレート情報をマッチさせることにより，テレビ画面前に人物がいるかどうか、また人物がいる場合に顔がどちらの方向を向いているか等を認識する。また、上述した表情については、例えばカメラ映像等からテレビ視聴時の視聴者の顔表情の変化度を検出し、予め設定した閾値等に基づいて、顔表情が変化したかどうか等を出力する。更に、上述した行動については、例えばカメラ映像からテレビ視聴時に視聴者が動いているか静止しているかを判定し、その結果を２値出力する。判定は、視聴者の映像での時間差分量がある時間にわたって小さい場合に静止していると判定する。 For example, with regard to the above-described face orientation, posture, and the like, for example, by matching template information of a plurality of faces set in advance to a face detection area from a camera video or the like, When there is, it recognizes which direction the face is facing. As for the above-described facial expressions, for example, the degree of change in the facial expression of the viewer when watching TV is detected from a camera image or the like, and whether or not the facial expression has changed is output based on a preset threshold or the like. Further, with respect to the above-described behavior, for example, it is determined whether the viewer is moving or stationary when viewing the television from the camera image, and the result is binary output. The determination is that the viewer is stationary when the amount of time difference in the viewer's video is small over a certain period of time.

更に、解析手段１２は、リモコンや携帯情報端末等で番組を切り替えたり、推薦番組を選択したりする等の操作情報を区分けしてシンボル情報として出力する。なお、上述した解析手法については、これに限定されるものではなく、他の手法を用いて同様の情報を取得してもよい。 Further, the analysis means 12 classifies operation information such as switching programs or selecting recommended programs with a remote controller or a portable information terminal and outputs the information as symbol information. In addition, about the analysis method mentioned above, it is not limited to this, You may acquire the same information using another method.

ここで、例えばリモコンや携帯情報端末等から入力信号を取得している場合に、解析手段１２は、カメラ映像から視聴者の顔がリモコンや携帯情報端末の方向を向いているか等の姿勢や行動を解析する。つまり、解析手段１２は、性質の異なる複数のデータを用いて、それぞれの相互作用も取り入れた解析を行う。また、解析手段１２は、カメラ映像の輝度情報や映像に含まれる照明機器の電源のＯＮ／ＯＦＦ等により視聴者の周囲環境（例えば、夜、昼等）等も解析することもできる。 Here, for example, when an input signal is acquired from a remote controller, a portable information terminal, or the like, the analysis unit 12 determines whether the viewer's face is facing the direction of the remote controller or the portable information terminal from the camera image. Is analyzed. In other words, the analysis unit 12 performs analysis that incorporates each interaction using a plurality of data having different properties. The analysis unit 12 can also analyze the ambient environment of the viewer (for example, night, noon, etc.) by turning on / off the brightness information of the camera video and the power of the lighting device included in the video.

また、解析手段１２は、音声情報を取得した場合には、例えば音の強さや文章（言葉）の抽出等の音声解析や、音の調子の解析、音声解析等により得られる文章等に対する形態素解析、構文解析、意味解析等を行い、例えば視聴者が発した言葉の意味や、笑い声又は泣き声等の認識による感情情報、しゃべる速度や声の大きさ等の認識による性格情報等のうち、少なくとも１つを解析する。また、解析手段１２は、視聴者の周囲環境（例えば、騒音の有無等）等を解析することもできる。 In addition, when the voice information is acquired, the analysis unit 12 performs, for example, voice analysis such as sound intensity and sentence (word) extraction, morphological analysis on sentences obtained by sound tone analysis, voice analysis, and the like. , Syntactic analysis, semantic analysis, etc., for example, at least one of the meaning of words spoken by the viewer, emotion information by recognition of laughter or cry, personality information by recognition of speaking speed, loudness, etc. Analyze one. The analysis unit 12 can also analyze the surrounding environment (for example, presence or absence of noise) of the viewer.

また、解析手段１２は、リモコン等から得られる操作履歴情報から、例えばチャンネル切り替え内容や音量調整等の情報を取得する。 The analysis unit 12 acquires information such as channel switching content and volume adjustment from operation history information obtained from a remote controller or the like.

更に、解析手段１２は、携帯情報端末から得られる操作情報から、例えばインターネットによる閲覧をしているか否か、メールをしているか否か、テレビ操作の内容等を解析する。なお、解析手段１２は、リモコンや携帯情報端末等の操作履歴情報や操作情報等に基づいて、例えば予め設定されたテーブル情報を参照し、対応する解析結果を出力してもよい。 Further, the analysis unit 12 analyzes, for example, whether or not browsing on the Internet, whether or not mailing, the contents of television operation, and the like from the operation information obtained from the portable information terminal. Note that the analysis unit 12 may output a corresponding analysis result by referring to, for example, preset table information based on operation history information, operation information, or the like of a remote controller or a portable information terminal.

つまり、解析手段１２は、上述した解析により、各入力信号に対する時系列シンボル列を生成し、その各情報をヒストグラム生成手段１３に出力する。 That is, the analysis unit 12 generates a time-series symbol sequence for each input signal by the above-described analysis, and outputs the information to the histogram generation unit 13.

ヒストグラム生成手段１３は、解析手段１２により得られる性質の異なる１又は複数のデータに基づいて、予め設定された単位時間毎にそれぞれの情報に対応するヒストグラムデータを生成し、入力信号の定量化（数値化）を行う。 The histogram generation means 13 generates histogram data corresponding to each information for each preset unit time based on one or a plurality of data having different properties obtained by the analysis means 12, and quantifies the input signal ( (Numericalization).

具体的には、ヒストグラム生成手段１３は、視聴情報として異なる複数のデータが存在する場合には、それらの異なるデータを予め設定された時間単位で、そのデータの頻度を表すヒストグラムデータを生成する。これにより、後段の処理で、時間的に同期したシンボル列に直して統一的に扱うことができる。 Specifically, when there are a plurality of different pieces of data as viewing information, the histogram generation unit 13 generates histogram data representing the frequency of the different data in a preset time unit. As a result, in the subsequent processing, the symbol strings synchronized in time can be changed and handled uniformly.

更に、ヒストグラム生成手段１３は、予め複数の単位時間を設定し、その設定した単位時間毎にヒストグラムデータを生成してもよい。これにより、本実施形態では、単位時間の異なるシンボル列で表現されたデータ間の相互作用も考慮した視聴状況の学習処理又は認識処理を行うことができる。ヒストグラム生成手段１３は、生成されたヒストグラムデータを代表値選定手段１４及び同期シンボル列生成手段１５に出力する。 Further, the histogram generation means 13 may set a plurality of unit times in advance and generate histogram data for each set unit time. Thereby, in the present embodiment, it is possible to perform a viewing situation learning process or a recognition process in consideration of an interaction between data expressed by symbol sequences having different unit times. The histogram generation unit 13 outputs the generated histogram data to the representative value selection unit 14 and the synchronization symbol string generation unit 15.

代表値選定手段１４は、ヒストグラム生成手段１３により得られるヒストグラムデータに基づいて代表値を選定する。なお、本実施形態における代表値とは、例えば同一データに対するヒストグラムデータの最大頻度（最大値）又は最小頻度（最小値）の値であってもよいが、これに限定されるものではなく、例えばヒストグラムデータの合計値や平均値であってもよい。 The representative value selection unit 14 selects a representative value based on the histogram data obtained by the histogram generation unit 13. Note that the representative value in the present embodiment may be, for example, a value of maximum frequency (maximum value) or minimum frequency (minimum value) of histogram data for the same data, but is not limited thereto. It may be a total value or an average value of histogram data.

具体的には、例えば入力信号が番組チャンネルの切り替えや音量調整等の操作の種類を示すシンボル列の場合には、その単位時間で最も頻繁に行った操作のシンボルが代表値の出力となる。代表値選定手段１４は、選定した代表値を同期シンボル列生成手段１５に出力する。 Specifically, for example, when the input signal is a symbol string indicating the type of operation such as program channel switching or volume adjustment, the symbol of the operation most frequently performed in that unit time is the representative value output. The representative value selection unit 14 outputs the selected representative value to the synchronization symbol string generation unit 15.

同期シンボル列生成手段１５は、ヒストグラム生成手段１３により得られるヒストグラムデータ及び代表値選定手段１４により得られる代表値データに基づいて、同期シンボル列を生成する。視聴者がテレビ番組等を視聴している際に取得される複数の異なるデータ（要素）は、それぞれの基準時間間隔が異なる。そのため、同期シンボル列生成手段１５は、各データ間の時間の同期を取るために、それぞれのデータを同期させた代表値のシンボル列を生成する。また、同期シンボル列生成手段１５は、生成した同期シンボル列を変換手段１６に出力する。 The synchronization symbol string generation unit 15 generates a synchronization symbol string based on the histogram data obtained by the histogram generation unit 13 and the representative value data obtained by the representative value selection unit 14. A plurality of different data (elements) acquired when the viewer is watching a television program or the like has different reference time intervals. Therefore, the synchronization symbol string generation means 15 generates a symbol string of representative values obtained by synchronizing each data in order to synchronize the time between the data. In addition, the synchronization symbol string generation unit 15 outputs the generated synchronization symbol string to the conversion unit 16.

変換手段１６は、予め素性関数群データベース１７に記憶されている１又は複数の素性関数に基づいて、同期シンボル列生成手段１５により得られるシンボル列の変換処理を行い、同期した時系列素性ベクトル信号を生成する。なお、素性関数とは、ある１つの入力に対してどのような出力を出すかが予め設定された関数であり、例えば複数の素性関数を用いることで、シンボル化されたデータ列から更にシンボル化された複数のデータ列を生成することができる。 The conversion unit 16 performs a conversion process of the symbol sequence obtained by the synchronous symbol sequence generation unit 15 based on one or a plurality of feature functions stored in advance in the feature function group database 17, and a synchronized time-series feature vector signal Is generated. Note that a feature function is a function that determines in advance what kind of output is to be output with respect to a certain input. For example, by using a plurality of feature functions, further symbolization is performed from a symbolized data sequence. A plurality of data strings can be generated.

ここで、変換手段１６は、学習用の入力信号に基づく学習処理を行っている場合には、生成した時系列素性ベクトル信号を統計的学習手段１８に出力する。また、変換手段１６は、認識用の任意の入力信号に基づく認識処理を行っている場合には、生成した時系列素性ベクトル信号を統合手段２１に出力する。 Here, the conversion means 16 outputs the generated time series feature vector signal to the statistical learning means 18 when the learning process based on the learning input signal is being performed. Further, the conversion means 16 outputs the generated time-series feature vector signal to the integration means 21 when performing recognition processing based on an arbitrary input signal for recognition.

統計的学習手段１８は、変換手段１６により得られる時系列素性ベクトル信号と、予め学習データベース１９に設定されている正解の視聴状況ラベルデータ（正解の視聴状況データ）とに基づいて、機械学習処理により学習パラメータを取得する。なお、本実施形態における機械学習処理には、例えばＣＲＦ（ＣｏｎｄｉｔｉｏｎａｌＲａｎｄｏｍＦｉｅｌｄ）等を用いることができるが、本実施形態においてはＣＲＦに限定されるものではなく、時系列のベクトル列を扱うことができる他の機械学習処理等を用いることができる。 The statistical learning unit 18 performs machine learning processing based on the time-series feature vector signal obtained by the conversion unit 16 and the correct viewing status label data (correct viewing status data) set in the learning database 19 in advance. The learning parameter is acquired by For example, CRF (Conditional Random Field) can be used for the machine learning process in the present embodiment, but the present embodiment is not limited to the CRF, and handles time-series vector sequences. Other machine learning processes that can be used can be used.

具体的には、統計的学習手段１８は、上述した過程で得られた時間的に同期した時系列のシンボル列間の関係を記述し、新たに複数の時系列シンボル列（例えば、素性シンボル列）を生成し、生成したシンボル列をＣＲＦ等の機械学習処理を用いて、視聴状況の認識処理用の学習パラメータを取得することもできる。統計的学習手段１８は、取得した学習パラメータを学習パラメータ記憶手段２０に出力する。 Specifically, the statistical learning means 18 describes a relationship between time-synchronized symbol sequences obtained in the above-described process, and newly adds a plurality of time-series symbol sequences (for example, feature symbol sequences). ) And a learning parameter for the viewing state recognition process can be acquired from the generated symbol string using a machine learning process such as CRF. The statistical learning unit 18 outputs the acquired learning parameter to the learning parameter storage unit 20.

学習パラメータ記憶手段２０は、統計的学習手段１８から得られる学習パラメータを記憶する。 The learning parameter storage unit 20 stores the learning parameters obtained from the statistical learning unit 18.

統合手段２１は、変換手段１６により得られる時系列素性ベクトル信号に対して、学習パラメータ記憶手段２０により得られる学習パラメータに基づいて統合処理を行い、入力された任意の入力信号に対するユーザの視聴状況の認識結果（例えば、興味度、興味の種類（方向）等）を出力する。 The integration unit 21 performs integration processing on the time-series feature vector signal obtained by the conversion unit 16 based on the learning parameter obtained by the learning parameter storage unit 20, and the user's viewing situation with respect to the input arbitrary input signal Recognition results (for example, interest level, interest type (direction), etc.) are output.

なお、本実施形態において、上述した学習データベース１９、素性関数群データベース１７、及び学習パラメータ記憶手段２０は、それぞれが別体の構成でもよく、１つの記憶手段に設けられていてもよい。 In the present embodiment, the learning database 19, the feature function group database 17, and the learning parameter storage unit 20 described above may be configured separately or may be provided in one storage unit.

上述したように、本実施形態では、視聴者の映像データ、音声データ及び操作履歴等の機器とのユーザーインタラクション情報の時系列データを統合することにより、認識結果として、時系列視聴状況データを出力する。これにより、本実施形態では、例えばテレビ視聴時において、任意の入力信号として得られる視聴者の振る舞いを計測する映像、音声、操作履歴等の機器インタラクションログデータを統合して視聴者が番組に興味を持っているかどうか等の心的な情報まで含めた視聴状況を認識することができる。なお、上述した視聴状況認識装置１０は、例えば、テレビ等の番組視聴装置の内部に一体に設けられていてもよく、別体として設けられていてもよい。 As described above, in this embodiment, time series viewing status data is output as a recognition result by integrating time series data of user interaction information with devices such as viewer video data, audio data, and operation history. To do. As a result, in this embodiment, for example, when watching TV, the viewer interacts with the program by integrating device interaction log data such as video, audio, and operation history that is measured as an arbitrary input signal. It is possible to recognize the viewing situation including mental information such as whether or not the user has. Note that the above-described viewing status recognition device 10 may be provided integrally within a program viewing device such as a television, for example, or may be provided separately.

また、本実施形態において、例えば学習処理は、視聴状況の認識処理の前にオフライン処理として実施され、認識処理は、学習処理後にオンライン処理として実施される。また、本実施形態において、例えば上述した解析手段１２、ヒストグラム生成手段１３、代表値選定手段１４、同期シンボル列生成手段１５、及び変換手段１６は、学習処理であっても視聴状況の認識処理であっても入力データに対して同一の処理を行う。ここで、上述した学習処理と、認識処理の各処理についてフローチャートを用いて具体的に説明する。 In this embodiment, for example, the learning process is performed as an offline process before the viewing state recognition process, and the recognition process is performed as an online process after the learning process. In the present embodiment, for example, the analysis unit 12, the histogram generation unit 13, the representative value selection unit 14, the synchronization symbol sequence generation unit 15, and the conversion unit 16 described above are recognition processing for viewing status even in the learning process. Even if it exists, the same processing is performed on the input data. Here, the learning process and the recognition process described above will be specifically described with reference to flowcharts.

＜学習処理例＞
図２は、本実施形態における学習処理の一例を示すフローチャートである。図２に示す学習処理は、まず、学習用の複数の入力信号を取得し（Ｓ０１）、取得した入力信号に対して上述した解析処理等を行い、各信号に対する時系列のシンボル化を行う（Ｓ０２）。また、学習処理は、Ｓ０２の処理により得られるシンボル列に対してヒストグラムを生成する（Ｓ０３）。 <Example of learning process>
FIG. 2 is a flowchart illustrating an example of the learning process in the present embodiment. In the learning process shown in FIG. 2, first, a plurality of input signals for learning are acquired (S01), the above-described analysis process is performed on the acquired input signals, and time-series symbolization of each signal is performed ( S02). In the learning process, a histogram is generated for the symbol string obtained by the process of S02 (S03).

次に、学習処理は、生成されたヒストグラムデータに対して、所定の条件に基づき代表値を選定し（Ｓ０４）、選定された代表値に基づいて時系列毎に性質の異なるデータを用いた同期シンボル列を生成する（Ｓ０５）。また、学習処理は、Ｓ０５の処理で得られた同期シンボル列に対し、予め設定された素性関数による変換を行い（Ｓ０６）、変換されたデータに対して正解の視聴状況ラベルデータを用いた統計的学習を行う（Ｓ０７）。また、学習処理は、Ｓ０７における統計的学習により得られた学習パラメータを認識処理で使用するために記憶手段等に記憶する（Ｓ０８）。なお、上述した学習処理は、認識処理が実行される前にオフライン処理として実行されていることが好ましいが、これに限定されるものではない。 Next, the learning process selects a representative value based on a predetermined condition for the generated histogram data (S04), and synchronizes using data having different properties for each time series based on the selected representative value. A symbol string is generated (S05). Further, in the learning process, the synchronization symbol sequence obtained in the process of S05 is converted by a preset feature function (S06), and the statistical data using the correct viewing status label data for the converted data. Learning is performed (S07). In the learning process, the learning parameter obtained by the statistical learning in S07 is stored in a storage unit or the like for use in the recognition process (S08). The learning process described above is preferably executed as an offline process before the recognition process is executed, but is not limited to this.

＜認識処理例＞
次に、本実施形態における認識処理の一例について説明する。図３は、本実施形態における認識処理の一例を示すフローチャートである。図３に示す認識処理は、例えば、番組等を視聴中の視聴者から任意の入力信号を取得する（Ｓ１１）。なお、任意の入力信号とは、学習用と入力信号と同様の信号形式であることが好ましいが、これに限定されるものではない。 <Example of recognition processing>
Next, an example of recognition processing in this embodiment will be described. FIG. 3 is a flowchart showing an example of recognition processing in the present embodiment. In the recognition process shown in FIG. 3, for example, an arbitrary input signal is acquired from a viewer who is watching a program or the like (S11). The arbitrary input signal preferably has the same signal format as the learning and input signal, but is not limited to this.

次に、認識処理は、Ｓ１１の処理により得られる入力信号に対して、上述した解析処理等を行い、各信号に対する時系列のシンボル化を行う（Ｓ１２）。また、学習処理は、Ｓ１２の処理により得られるシンボル列に対してヒストグラムを生成する（Ｓ１３）。 Next, in the recognition processing, the above-described analysis processing is performed on the input signal obtained by the processing of S11, and time-series symbolization is performed on each signal (S12). In the learning process, a histogram is generated for the symbol string obtained by the process of S12 (S13).

次に、認識処理は、生成したヒストグラムデータから所定の条件に基づき代表値を選定し（Ｓ１４）、選定された代表値から時系列毎に性質の異なるデータを用いた同期シンボル列を生成し（Ｓ１５）、生成した同期シンボル列に対し、予め設定された素性関数による変換を行う（Ｓ１６）。なお、上述したＳ１２〜Ｓ１６の処理は、上述した学習処理におけるＳ０２〜Ｓ０６の処理と同様の処理を行う。 Next, the recognition process selects a representative value from the generated histogram data based on a predetermined condition (S14), and generates a synchronization symbol sequence using data having different properties for each time series from the selected representative value ( S15) The generated synchronization symbol string is converted by a preset feature function (S16). In addition, the process of S12-S16 mentioned above performs the process similar to the process of S02-S06 in the learning process mentioned above.

次に、認識処理は、Ｓ１６の処理により変換された時系列素性ベクトル信号に対し、上述した学習処理で得られた学習パラメータによる統合を行い（Ｓ１７）、統合された結果を、その視聴者毎の視聴状況の認識結果として出力し（Ｓ１８）、その認識結果から、入力信号に対応する視聴者毎の視聴状況（例えば、興味度等）を出力する（Ｓ１９）。 Next, in the recognition process, the time series feature vector signal converted by the process of S16 is integrated by the learning parameter obtained by the learning process described above (S17), and the integrated result is obtained for each viewer. Is output as the recognition result of the viewing status (S18), and the viewing status (for example, interest level) for each viewer corresponding to the input signal is output from the recognition result (S19).

上述したように、本実施形態は、上述したように機械学習処理の枠組みを利用した処理で、パラメータ予測のための学習データの収集と、予測されたパラメータを用いた認識処理を行うことができる。 As described above, the present embodiment can collect learning data for parameter prediction and perform recognition processing using a predicted parameter by using the machine learning processing framework as described above. .

＜学習処理における統計的学習手法について＞
ここで、上述した学習処理における統計的学習手法について具体的に説明する。本実施形態における統計的学習手法では、機械学習処理として、例えば時系列信号のラベリング問題を扱うのに優れているＣＲＦを用いることとするが、これに限定されるものではない。 <Statistical learning method in learning process>
Here, the statistical learning method in the learning process described above will be specifically described. In the statistical learning method according to the present embodiment, for example, a CRF excellent in handling a time-series signal labeling problem is used as the machine learning process, but the present invention is not limited to this.

ＣＲＦは、時系列の観測ベクトル列ｘに対応する確率的に最もらしいラベリングの時系列データｙを求める機械学習処理で、基本的なものは以下に示す（１）式〜（３）式で定式化される。 The CRF is a machine learning process for obtaining the most probable labeling time series data y corresponding to the time series observation vector sequence x, and the basic ones are expressed by the following formulas (1) to (3). It becomes.

ここで、上述した（１）式において、ｐ（）は確率分布を示し、Ｚ（）は正規化項（ノーマライズ関数）を示し、ｅｘｐ（）は、ｅを底とする数値のべき乗を求める関数を示している。また、（１）において、ｘは観測信号列（時系列シンボル列）を示し、ｙは各ｘ、ｔにおける隠れ状態の出力値（例えば、興味度の判定結果等）、ｔは時間を示している。なお、隠れ状態とは、例えば視聴者に対する目に見えない状態（例えば、興味がある・ない）等を意味するが、これに限定されるものではない。また、上述した（１）式において、Ｚ（）は、（２）式で定義され、Ｆ（）は（３）式で定義される。
Here, in the above-described expression (1), p () indicates a probability distribution, Z () indicates a normalization term (normalization function), and exp () is a function for obtaining a power of a numerical value with e as the base. Is shown. In (1), x represents an observation signal sequence (time-series symbol sequence), y represents an output value of a hidden state at each x and t (for example, determination result of interest), and t represents time. Yes. The hidden state means, for example, a state invisible to the viewer (for example, interested / not interested), but is not limited thereto. In the above-described equation (1), Z () is defined by equation (2), and F () is defined by equation (3).

上述した（１）式〜（３）式におけるｙとｙ'については、数式の表現の便宜上、ｙは１通りの状態ベクトルを表すの対し、ｙ'はΣで和を取る全ての状態ベクトルを表していることを強調するために違う表現を用いているだけであり、同じｙを用いてもよい。 Regarding y and y ′ in the above formulas (1) to (3), for convenience of expression, y represents one state vector, and y ′ represents all state vectors summed by Σ. Only different expressions are used to emphasize the expression, and the same y may be used.

また、（３）式におけるｆ_ｉ（），ｇ_ｉ（）は、素性関数と呼ばれ、時系列の観測データ（例えば、上述した観測情報や操作履歴情報、操作情報等を含む）やラベリングデータを変数とした関数である。具体的には、ｆ_ｉ（）は、同期処理した結果のｘの相互作用を表現する素性であり、例えばある状態のときに、観測された同期を取ったシンボル列が、どういう状態であれば、最も興味を感じているかを見つけるための関数を示している。また、ｇ_ｉ（）は、状態変数であるｙの遷移に依存する素性であり、例えばある状態から次の状態に対して、時間の経過にしたがって、どのように遷移するかという移り易さを示す関数を示している。 Further, f _i () and g _i () in the expression (3) are called feature functions, and include time-series observation data (for example, including the above-described observation information, operation history information, operation information, etc.) and labeling data. Is a function with Specifically, f _i () is a feature that expresses the interaction of x as a result of the synchronization processing. For example, in what state, the observed symbol string in the synchronized state is in what state Shows the function to find out what you are most interested in. G _i () is a feature that depends on the transition of y, which is a state variable. For example, g _i () indicates how easily a state transitions from one state to the next over time. The function shown is shown.

ＣＲＦを用いた学習処理では、まず視聴者から得られる観測データと、正解のラベリングデータとから、学習パラメータを求める。次に、ラベリングデータが未知の観測データに対して求められた学習パラメータと、視聴者から得られる観測データから計算した素性関数の出力データとを用いて視聴状況の認識結果を出力する手順で構成される。 In the learning process using the CRF, first, learning parameters are obtained from observation data obtained from the viewer and correct labeling data. Next, it consists of a procedure that outputs the recognition result of the viewing situation using the learning parameter obtained for the observation data whose labeling data is unknown and the output data of the feature function calculated from the observation data obtained from the viewer Is done.

なお、本実施形態では、上述した（１）式〜（３）式以外にも観測データ列とラベリングデータ列との間に予め設定された隠れ層を有した構造のものも存在するが、何れの手法でも本実施形態に適用することができる。上述したＣＲＦについては、例えば「Ｊ．Ｌａｆｆｅｒｔｙ，Ａ．ＭｃＣａｌｌｕｍ，ａｎｄＦ．Ｐｅｒｅｉｒａ．"Ｃｏｎｄｉｔｉｏｎａｌｒａｎｄｏｍｆｉｅｌｄｓ：ｐｒｏｂａｂｉｌｉｓｔｉｃｍｏｄｅｌｓｆｏｒｓｅｇｍｅｎｔｉｎｇａｎｄｌａｂｅｌｉｎｇｓｅｑｕｅｎｃｅｄａｔａ，"Ｐｒｏｃ．１８^ｔｈＩｎｔｅｒｎａｔｉｏｎａｌｃｏｎｆ．ｏｎＭａｃｈｉｎｅＬｅａｒｎｉｎｇ，２００１」等に示されている。 In this embodiment, in addition to the above-described equations (1) to (3), there is a structure having a hidden layer set in advance between the observation data sequence and the labeling data sequence. This method can also be applied to the present embodiment. For the above-mentioned CRF, for example, "J.Lafferty, A.McCallum, and F.Pereira.. " Conditional random fields: probabilistic models for segmenting and labeling sequence data, "Proc.18 th International conf on Machine Learning, 2001 ", etc. Is shown in

ここで、図４は、本実施形態における機械学習処理を利用した認識処理の流れを説明するための図である。 Here, FIG. 4 is a diagram for explaining the flow of recognition processing using machine learning processing in the present embodiment.

図４において、パラメータ予測のための学習データの収集は、上述した学習処理としての手順を示している。 In FIG. 4, the collection of learning data for parameter prediction indicates the procedure as the learning process described above.

本実施形態では、時系列に存在する性質の異なる複数のデータ（入力信号）から上述した同期処理（同期シンボル列生成）、素性関数処理（複数の素性関数を用いた同期シンボル列の変換処理）を得て、時系列素性ベクトル信号を生成する処理と、生成された時系列素性ベクトル信号に対応した視聴状況のラベル値を管理者等により付与する処理からなる。なお、上述したラベルの付与は、予め設定しておいてもよい。 In the present embodiment, the synchronization processing (synchronization symbol sequence generation) and feature function processing (synchronization symbol sequence conversion processing using a plurality of feature functions) described above from a plurality of data (input signals) having different properties existing in time series. And a process for generating a time-series feature vector signal and a process for giving a viewing status label value corresponding to the generated time-series feature vector signal by an administrator or the like. Note that the labeling described above may be set in advance.

本実施形態では、上述した処理により、ＣＲＦ学習処理のための観測データとそれに対応した正解データの複数個の組み合わせデータ群を取得する。その後、本実施形態では、これらのデータを用いて学習パラメータを予測する。この学習パラメータは、ＣＲＦの場合は、｛θ｝に相当する。 In the present embodiment, a plurality of combination data groups of observation data for CRF learning processing and corresponding correct data are acquired by the above-described processing. Thereafter, in this embodiment, the learning parameters are predicted using these data. This learning parameter corresponds to {θ} in the case of CRF.

また、図４に示す認識処理では、パラメータ予測のための学習データ収集の時と同様の過程で、時系列に存在する性質の異なる複数のデータ（任意の入力信号）から時系列素性ベクトル信号を生成する。このとき、この時系列素性ベクトル信号に対応した視聴状況ラベルは、未知である。そこで、求めた学習パラメータを用いて視聴状況ラベルを付与する。なお、視聴状況ラベルの付与は、例えば以下に示す（４）式のように確率的に最もらしいラベリング列を選定する。 In the recognition process shown in FIG. 4, a time-series feature vector signal is obtained from a plurality of data (arbitrary input signals) having different properties existing in a time series in the same process as when learning data is collected for parameter prediction. Generate. At this time, the viewing status label corresponding to this time-series feature vector signal is unknown. Therefore, a viewing status label is assigned using the obtained learning parameter. Note that the viewing status label is assigned by selecting the most probable labeling sequence, for example, as shown in equation (4) below.

ここで、上述した（４）式において、Θは学習パラメータを示している。 Here, in the above-described equation (4), Θ represents a learning parameter.

＜時系列素性ベクトル信号の生成手順について＞
次に、上述した時系列素性ベクトル信号の生成手順について、図を用いて説明する。図５は、異なる複数のデータを用いた時系列の素性ベクトル信号の生成手順を説明するための図である。 <Procedure for generating time-series feature vector signal>
Next, a procedure for generating the above-described time series feature vector signal will be described with reference to the drawings. FIG. 5 is a diagram for explaining a procedure for generating a time-series feature vector signal using a plurality of different data.

図５の例では、まず、入力されたそれぞれの異種データから、予め設定された単位時間の間隔分だけの信号の頻度を表すヒストグラムを生成する。その後、代表値選定処理において、そのヒストグラム形状から代表となるシンボル値を定め、定めたシンボル値をその時間間隔における出力値とする。 In the example of FIG. 5, first, a histogram representing the frequency of a signal corresponding to a preset unit time interval is generated from each input heterogeneous data. Thereafter, in the representative value selection process, a representative symbol value is determined from the histogram shape, and the determined symbol value is set as an output value at the time interval.

なお、ヒストグラムからの代表となるシンボル値を選定する処理としては、例えば最大頻度や最小頻度の値を選ぶ処理等を用いることができるが、これに限定されるものではなく、例えば、その単位時間間隔における複数のヒストグラムデータから得られる各シンボル値の平均値等を求め、求めた平均値を出力値としてもよい。 As a process of selecting a representative symbol value from the histogram, for example, a process of selecting a maximum frequency value or a minimum frequency value can be used, but the present invention is not limited to this. An average value or the like of each symbol value obtained from a plurality of histogram data at intervals may be obtained, and the obtained average value may be used as an output value.

例えば、本実施形態では、入力信号が番組チャンネルの切り替えや音量調整等の機器インタラクションを示すシンボル列の場合、その単位時間で最も頻繁に行ったインタラクションのシンボルが代表値選定処理の出力となる。 For example, in the present embodiment, when the input signal is a symbol string indicating device interaction such as program channel switching or volume adjustment, the symbol of the interaction most frequently performed in the unit time is the output of the representative value selection process.

このように、本実施形態では、予め設定された時間間隔（例えば、図５に示す時間Ｔｐ）の各区間における観測データ等から、ヒストグラム生成及び代表値選定処理を行うことにより、時間解像度等の異なる異種信号の同期を取ることができ、それぞれの信号間の相互作用を次の素性関数処理で表現すること等が可能となる。 As described above, in this embodiment, by performing histogram generation and representative value selection processing from observation data in each section of a preset time interval (for example, time Tp shown in FIG. 5), the time resolution and the like are set. Different types of signals can be synchronized, and the interaction between the signals can be expressed by the following feature function processing.

また、本実施形態における素性関数処理では、複数の信号間の関係を表現する任意の関数を設定することができ、認識する視聴状況の種類に応じてその関係式を複数定義し、それぞれの出力を一纏めにして時系列の素性ベクトル信号とすることができる。特に、ＣＲＦを用いた学習処理では、入力の観測時系列データのどの時間の信号も素性関数の変数として扱うことができ、柔軟な素性関数を組むことができる。 In the feature function processing in the present embodiment, an arbitrary function that expresses a relationship between a plurality of signals can be set, and a plurality of relational expressions are defined according to the type of viewing situation to be recognized, and each output is defined. Can be grouped into a time-series feature vector signal. In particular, in the learning process using CRF, a signal at any time of the input observation time series data can be handled as a variable of the feature function, and a flexible feature function can be assembled.

ここで、図６は、学習データの生成例を説明するための図である。図６の例では、例えば視聴状況がテレビ番組を見ている時の興味度とし、システムの入力信号として顔の向き情報（Ｄｉｒ）、体の静止状態（Ｍｏｖ）、表情変化（Ｅｘｐ）、及び、タブレット等の携帯情報端末の操作情報（Ｔａｂ）とした場合の学習データの組を示している。なお、学習データとなる情報の種類については、これに限定されるものではない。 Here, FIG. 6 is a diagram for explaining an example of generation of learning data. In the example of FIG. 6, for example, the viewing situation is the degree of interest when watching a TV program, and face direction information (Dir), body rest state (Mov), facial expression change (Exp), 3 shows a set of learning data in the case of operating information (Tab) of a portable information terminal such as a tablet. Note that the type of information serving as learning data is not limited to this.

図６の表は、入力信号として、番組視聴時の観測データから、単位時間毎にヒストグラムから求めた代表値の単位時間毎のデータに変換した時系列信号を表している。また、図６の表では、視聴状況の興味度として、その時点における興味ある（Ｃｕｒｉ）／なし（Ｎｏｎ）の興味判定（Ｃｕｒ）を表している。なお、興味判定は、学習データを生成させるために管理者等が主観的に付加した興味マーカである。 The table of FIG. 6 represents a time series signal converted from observation data at the time of program viewing into data for each unit time of a representative value obtained from a histogram for each unit time as an input signal. In the table of FIG. 6, the interest level (Curi) of interest (Curi) / none (Non) at that time is represented as the interest level of the viewing situation. The interest determination is an interest marker subjectively added by an administrator or the like to generate learning data.

ここで、素性関数は、これらの入力信号から興味に関係する相互作用を記述すればよい。例えば、上述した（１）式〜（３）式での観測信号列ｘが関係する素性関数として、１又は複数の論理関数（例えば、｛ｆ_ｉ（ｘ，ｙ_ｔ）｝，｛ｇ_ｉ（ｙ_ｔ−１，ｙ_ｔ）｝）を生成することができる。 Here, the feature function may describe an interaction related to interest from these input signals. For example, one or more logical functions (for example, {f _i (x, y _t )}, {g _i ( y _t−1 , y _t )}) can be generated.

図７は、論理関数群の一例を示す図である。図７では、素性関数の例として、ｆ０〜ｆ２０の論理関数群が示されている。なお、図７に示す論理関数において、Ｄｉｒは「顔の向き情報」を示し、Ｍｏｖは「体の静止状態」を示し、Ｅｘｐは「表情変化」を示し、Ｔａｂは「タブレット等の携帯情報端末の操作情報」を示している。 FIG. 7 is a diagram illustrating an example of a logical function group. In FIG. 7, logic function groups f0 to f20 are shown as examples of feature functions. In the logical function shown in FIG. 7, Dir represents “face orientation information”, Mov represents “body still state”, Exp represents “expression change”, and Tab represents “portable information terminal such as a tablet” Operation information ”.

更に、図７に示すＦｒｏｎｔ、ＴＢ、Ｐｏｓ、Ｎｅｇ、Ｃｈ、Ｎｏｎ等のシンボルは、例えば顔向き情報、表情変化、操作情報等の状態を個別に表すシンボルを表すものである。具体的には、Ｆｒｏｎｔは「前向き」、ＴＢは「普通の端末操作あり」、Ｐｏｓは「積極的な操作あり」、Ｎｅｇは「消極的な端末操作あり」、Ｃｈは「端末操作でチャンネル切替あり」、Ｎｏｎは「何もしない」を表すシンボルである。また、ｔは、時間を示している。ここで、上述した操作情報のシンボルの一例であるＴＢ，Ｐｏｓ，Ｎｅｇについて更に具体的に説明する。これらのシンボルは、システム設計の段階で、主観的にタブレット端末等の携帯情報端末の操作に対して付加するものであり、その操作が現在視聴している番組コンテンツに対してどのくらい関係しているかを表現したシンボルである。 Furthermore, symbols such as Front, TB, Pos, Neg, Ch, and Non shown in FIG. 7 represent symbols individually representing states such as face orientation information, expression changes, and operation information, for example. Specifically, Front is “forward”, TB is “normal terminal operation”, Pos is “active operation”, Neg is “reactive terminal operation”, and Ch is “channel switching by terminal operation” “Yes” and Non are symbols for “do nothing”. T represents time. Here, TB, Pos, and Neg, which are examples of the above-described operation information symbols, will be described more specifically. These symbols are subjectively added to the operation of a portable information terminal such as a tablet terminal at the stage of system design, and how much the operation relates to the program content currently being viewed. Is a symbol that expresses

一般的に、携帯情報端末等を操作しながらテレビを視聴している場合には、携帯情報端末上のいろいろな機能を使うことが想定される。そこで、例えばその操作が、テレビ番組に関係した調べものを行う場合等のように、番組と関連した操作のときにはＰｏｓを、明らかに番組とは関係ない調べものやメール等の入力等をしているときにはＮｅｇを、そのどちらでもなく漫然と端末を操作しているときにはＴＢを付加する。本実施形態では、上述したような端末操作により、ユーザの番組に対する興味度等を、より高精度に推定することができる。 In general, when watching a television while operating a portable information terminal or the like, it is assumed that various functions on the portable information terminal are used. Therefore, when the operation is related to a TV program, for example, when the operation is related to a TV program, Pos is input. Neg is added when the terminal is in operation, and TB is added when the terminal is operated casually. In the present embodiment, the user's degree of interest in a program and the like can be estimated with higher accuracy by the terminal operation as described above.

例えば、Ｄｉｒ（ｔ）＝Ｆｒｏｎｔは、「顔の向きが時刻ｔのとき正面を向いている」という状態を表すことになる。本実施形態では、上述したようにそれぞれのシンボルで状態を個別に識別し、これらを組み合わせることで、図７に示すような論理を組むことができる。なお、論理関数の種類や数、順序、内容等については、これに限定されるものではない。 For example, Dir (t) = Front represents a state that “when the face is at the time t, it is facing the front”. In the present embodiment, as described above, the state is individually identified by each symbol, and the logic shown in FIG. 7 can be formed by combining these states. Note that the types, number, order, contents, and the like of logic functions are not limited to this.

本実施形態では、図７のように選定された関数群の出力が、時系列の素性ベクトル信号の構成要素であり、（１）式におけるｆの部分となる。また、図７に示すｆ０〜ｆ２０までの素性関数では変数としてラベリング状態（出力値）ｙを省略しているが、実際には各ラベリング状態毎にｆ０〜ｆ２０までの素性関数出力が計算されることになる。 In the present embodiment, the output of the function group selected as shown in FIG. 7 is a constituent element of the time-series feature vector signal, and is a part f in the equation (1). Further, in the feature function from f0 to f20 shown in FIG. 7, the labeling state (output value) y is omitted as a variable, but in reality, the feature function output from f0 to f20 is calculated for each labeling state. It will be.

ここで、本実施形態における素性関数の具体的な処理の例として、図７に示すｆ０、ｆ１７、ｆ１８を例にあげて説明する。ｆ０の素性関数は、例えば時刻ｔの時に顔向きが正面を向いて、表情変化があり、かつ、静止状態である場合に、時刻ｔの出力として'１'を出す論理となっている。 Here, f0, f17, and f18 shown in FIG. 7 will be described as examples of specific processing of the feature function in the present embodiment. The feature function of f0 is, for example, a logic that outputs “1” as an output at time t when the face is facing forward at the time t, there is a change in facial expression, and it is stationary.

また、ｆ１７の素性関数は、例えば時刻ｔの時に端末操作でチャンネル切り替えをした場合に、時刻ｔの出力として'１'を出す論理となっている。また、ｆ１８の素性関数は、例えば全ての入力状態を加味した論理で、時刻ｔ−１でチャンネル切り替えがあり、なおかつ、時刻ｔで顔が正面を向き、更に時刻ｔ＋１で顔が正面を向いているか、或いは、静止している場合に、時刻ｔの出力として'１'を出す論理となっている。 Also, the feature function of f17 is a logic that outputs “1” as the output at time t when the channel is switched by terminal operation at time t, for example. The feature function of f18 is, for example, logic that takes into account all input states, and there is a channel switching at time t-1, and the face turns to the front at time t, and the face turns to the front at time t + 1. Or when it is still, the logic is to output “1” as the output at time t.

本実施形態では、このような１又は複数の素性関数を入力から組み上げて学習処理することにより、それぞれの素性関数要素毎に個別の重み付パラメータθｉが算出されることになる。 In the present embodiment, such weighting parameters θi are calculated for each feature function element by assembling such one or more feature functions from the input and performing learning processing.

＜ＣＲＦの学習について＞
ここで、上述したＣＲＦの学習では、ＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）の学習等で広く使われているｆｏｒｗａｒｄ−ｂａｃｋｗａｒｄアルゴリズムを用いて上述した（１）式におけるＺ（ｘ）を計算することができる。また、本実施形態では、準ニュートン法であるＬ−ＢＦＧＳ（Ｌｉｍｉｔｅｄ−ｍｅｍｏｒｙＢｒｏｙｄｅｎＦｌｅｔｃｈｅｒＧｏｌｄｆａｒｂＳｈａｎｎｏ）法等の最適アルゴリズムと組み合わせて学習を効率的に行うことができる。 <CRF learning>
Here, in the above-described CRF learning, Z (x) in the above-described equation (1) can be calculated using a forward-backward algorithm widely used in HMM (Hidden Markov Model) learning and the like. In the present embodiment, learning can be efficiently performed in combination with an optimal algorithm such as the L-BFGS (Limited-memory Broden Fletcher Goldfarb Shanno) method which is a quasi-Newton method.

また、上述した（４）式の処理についても、本実施形態では、Ｖｉｔｅｒｂｉアルゴリズム等を用いることにより効率よく求めることができる。更に、本実施形態では、上述した（１）式の基本的なＣＲＦに対し、出力の時系列ラベリングデータｙと観測ベクトル列ｘの間に複数の隠れ状態ｈを設け、それぞれの隠れ状態と出力ラベリングの間の関係からｙを求めるＬＤＣＲＦ（ＬａｔｅｎｔＤｙｎａｍｉｃＣｏｎｄｉｔｉｏｎａｌＲａｎｄｏｍＦｉｅｌｄ）を用いることができる。なお、上述したＬＤＣＲＦについては、例えば、「Ｌ．−Ｐ．Ｍｏｒｅｎｃｙ，Ａ．ＱｕａｔｔｏｎｉａｎｄＴｒｅｖｏｒＤａｒｒｅｌｌ，Ｌａｔｅｎｔ−ＤｙｎａｍｉｃＤｉｓｃｒｉｍｉｎａｔｉｖｅＭｏｄｅｌｓｆｏｒＣｏｎｔｉｎｕｏｕｓＧｅｓｔｕｒｅＲｅｃｏｇｎｉｔｉｏｎ，ＰｒｏｃｅｅｄｉｎｇｓＩＥＥＥＣｏｎｆｅｒｅｎｃｅｏｎＣｏｍｐｕｔｅｒＶｉｓｉｏｎａｎｄＰａｔｔｅｒｎＲｅｃｏｇｎｉｔｉｏｎ，Ｊｕｎｅ２００７」等に示されている。 Further, the processing of the above-described expression (4) can also be obtained efficiently by using the Viterbi algorithm in this embodiment. Furthermore, in the present embodiment, a plurality of hidden states h are provided between the output time-series labeling data y and the observed vector sequence x for the basic CRF of the above-described equation (1), and the respective hidden states and outputs are provided. LDCRF (Lent Dynamic Conditional Random Field) for obtaining y from the relationship between labeling can be used. Regarding the above-mentioned LDCRF, for example, “L.-P. Morency, A. Quattoni and Trevor Darnell, etc. It is shown.

ここで、上述したＬＤＣＲＦの場合、ＣＲＦの枠組みで求めるのは、それぞれの隠れ状態を表すｈへの確率であり、ｙの確率は（５）式で求められる。 Here, in the case of the above-described LDCRF, what is obtained in the framework of the CRF is a probability to h that represents each hidden state, and the probability of y is obtained by equation (5).

また、個々の出力ラベルと関係づけられている隠れ状態間で重なりがない場合は、以下に示す（６）式のように単純和となることがわかっており、基本的なＣＲＦの枠組みをそのまま踏襲する計算が可能である。 In addition, when there is no overlap between hidden states associated with individual output labels, it is known that a simple sum is obtained as shown in the following equation (6), and the basic CRF framework is maintained as it is. Calculations to follow are possible.

なお、上述したＬＤＣＲＦの場合、上述した（３）式におけるｇ_ｉ（）は、状態変数であるｈの遷移に依存する素性であり、例えばある隠れ状態から次の隠れ状態に対して、時間の経過にしたがって、どのように遷移するかという移り易さを示す関数を示す。本実施形態では、ＬＤＣＲＦを用いることで、ＣＲＦよりも識別性能をより向上させることができる。 In the case of the above-described LDCRF, g _i () in the above-described equation (3) is a feature that depends on the transition of h, which is a state variable, and for example, from a hidden state to the next hidden state, The function which shows the ease of transfer of how to change according to progress is shown. In the present embodiment, by using LDCRF, it is possible to improve the identification performance more than CRF.

＜入力信号から学習パラメータ出力までの具体的な信号の流れについて＞
ここで、入力信号から学習パラメータ出力までの具体的な信号の流れについて説明する。なお、以下の説明では、入力信号の一例として、例えば視聴者のテレビ視聴時の観測映像信号、操作履歴のチャンネル（Ｃｈ）切り替え情報を用いて入力から視聴状況認識出力までの処理について説明する。 <Specific signal flow from input signal to learning parameter output>
Here, a specific signal flow from the input signal to the learning parameter output will be described. In the following description, as an example of an input signal, a process from input to viewing status recognition output will be described using, for example, an observation video signal when the viewer watches the television and operation channel (Ch) switching information.

［１．原信号からの時系列シンボル列化］
まず、観測データとして得られる映像信号に対する解析処理として、顔の向きを計算する。また、操作履歴信号については、ある一定時間毎に操作の事象があったかどうかの集計を取り、それぞれの操作が行われたときに操作情報毎にシンボルを割り当てて記録する。チャンネル（Ｃｈ）切り替えをした時のシンボルは'Ｃｈ'とする。 [1. Time-series symbol sequence from original signal]
First, the face orientation is calculated as an analysis process for the video signal obtained as observation data. In addition, the operation history signal is summed up as to whether or not there is an operation event every certain time, and a symbol is assigned and recorded for each operation information when each operation is performed. The symbol when the channel (Ch) is switched is “Ch”.

［２．時系列シンボル列データからのヒストグラム生成］
次に、ある一定時間内の区間（例えば、５秒間）の顔の向き、操作履歴情報に基づきヒストグラムを生成する。顔の向きの場合は、ある一定時間内にそれぞれの顔の向きがどのくらいの頻度で生じていたか、また、操作履歴については、ある一定時間内にそれぞれの操作がどのくらいの頻度で生じていたかがこのヒストグラム生成で反映される。 [2. Histogram generation from time series symbol sequence data]
Next, a histogram is generated based on the face orientation and operation history information within a certain period of time (for example, 5 seconds). In the case of face orientation, how often each face orientation occurred within a certain time, and for the operation history, how often each operation occurred within a certain time. Reflected by histogram generation.

［３．時系列シンボルデータの生成と代表値選定処理］
顔の向きについては、ヒストグラムデータの最大頻度を示す顔の向きをヒストグラム生成区間の出力とする。また、操作履歴情報については、操作情報のヒストグラムの最大頻度を示す種類の操作情報をヒストグラム生成区間の出力とする。チャンネル切り替えについては、最もチャンネル切り替えが多かった区間に'Ｃｈ'というシンボルが割り当てられることになる。何も操作がなかった場合は、'Ｎｏｎ'というシンボルを割り当てる。 [3. Generation of time series symbol data and representative value selection processing]
As for the face direction, the face direction indicating the maximum frequency of the histogram data is set as the output of the histogram generation section. For the operation history information, the type of operation information indicating the maximum frequency of the operation information histogram is output as the histogram generation section. For channel switching, the symbol “Ch” is assigned to the section with the most channel switching. If no operation is performed, the symbol “Non” is assigned.

［４．時間同期した時系列シンボル列の生成］
上記の処理で時間的に同期した顔の向き信号列とチャンネル切り替えシンボル列の信号が得られる。 [4. Generation of time-synchronized time series symbol sequence]
The face direction signal sequence and the channel switching symbol sequence signal synchronized in time are obtained by the above processing.

［５．素性関数での時系列素性ベクトル信号列の信号生成の例］
次に、時間同期した複数の時系列シンボル列を入力変数とした複数の異なる素性関数出力より、複数の時系列素性ベクトル信号列を生成する。ここでは、入力としてチャンネル切り替えと顔の向きの時系列信号を扱う素性関数の例を用いて説明する。 [5. Example of signal generation of time series feature vector signal sequence with feature function]
Next, a plurality of time-series feature vector signal sequences are generated from a plurality of different feature function outputs using a plurality of time-synchronized time-series symbol sequences as input variables. Here, description will be made using an example of a feature function that handles time-series signals of channel switching and face orientation as inputs.

まず、現時刻をＴ、１つ前の時刻をＴ−１とする。また、入力信号として時刻Ｔ−１での操作情報Ｔａｂ（Ｔ−１）、時刻Ｔでの顔の向き情報Ｄｉｒ（Ｔ）とすると素性関数の例として以下の（７）式のような関数が想定される。 First, the current time is T, and the previous time is T-1. If the operation information Tab (T-1) at time T-1 and the face orientation information Dir (T) at time T are input signals, a function such as the following equation (7) is given as an example of a feature function. is assumed.

これは、時刻Ｔの隠れ状態Ｙ（Ｔ）によらず、時刻Ｔ−１でチャンネル切り替えを行い、時刻Ｔで正面を向いている状態のときには１を出力し、それ以外は０を出力する関数である。 This is a function that performs channel switching at time T-1 regardless of the hidden state Y (T) at time T, outputs 1 when facing the front at time T, and outputs 0 otherwise. It is.

この素性関数の結果、チャンネル切り替えと顔の向きの時系列信号からこの条件を満たす状態のときのみ１を出力する新たな時系列信号が生成されることになる。 As a result of this feature function, a new time-series signal that outputs 1 only when the condition is satisfied is generated from the time-series signals of channel switching and face orientation.

時系列素性ベクトル信号は、上述したような想定される複数の状態毎に記述した複数の素性関数出力を一つに纏めたベクトルの時系列信号として構成される。 The time series feature vector signal is configured as a vector time series signal in which a plurality of feature function outputs described for each of a plurality of assumed states as described above are combined.

［６．素性関数の統合のための重み付けパラメータの学習］
なお、本実施形態では、複数の素性関数を統合するための重み付けパラメータを機械学習の枠組みで学習して記憶する。記憶された重み付けパラメータは学習パラメータとして記憶しておき、認識処理の時に利用する。 [6. Learning weighting parameters for integration of feature functions]
In the present embodiment, weighting parameters for integrating a plurality of feature functions are learned and stored in a machine learning framework. The stored weighting parameters are stored as learning parameters and used during recognition processing.

つまり、本実施形態では、例えば学習パラメータの中に重み付パラメータが含まれており、認識処理のときには、学習で得られた重み付パラメータの値を使用し、例えば上述した（３）式に基づいて、認識対象の状態から計算した素性関数値に掛け合わせることにより、認識対象の確率値を上述した（１）式で求める。また、認識は、（１）式の確率の最も高いyを上述した（４）式にしたがって選択することにより、状態ｙ、すなわち興味度の推定を行う。 That is, in the present embodiment, for example, the weighting parameter is included in the learning parameter, and the value of the weighting parameter obtained by learning is used in the recognition process, for example, based on the above-described equation (3). Then, by multiplying the feature function value calculated from the state of the recognition target, the probability value of the recognition target is obtained by the above equation (1). In recognition, the state y, that is, the degree of interest is estimated by selecting y having the highest probability of the expression (1) according to the above-described expression (4).

＜他の実施形態＞
上述した視聴状況認識装置１０では、収集した学習データに基づいて学習パラメータを予測し、視聴状況のラベル付けを上述した（４）式にしたがって算出しているが、その精度は如何に実際の状況と同じ良質な学習データを取得できるかに影響される場合がある。 <Other embodiments>
In the viewing status recognition device 10 described above, learning parameters are predicted based on the collected learning data, and the labeling of the viewing status is calculated according to the above-described formula (4). It may be influenced by whether the same high quality learning data can be acquired.

例えば、通常「興味を持って見た」等の視聴状況の正解ラベリングデータは、観測データの計測と同時に行うのが適しているが、本実施形態では、機器とのインタラクション情報も観測データに含めている。そのため、ユーザが機器操作をしながら正解ラベリングデータを入力するのが難しい。 For example, correct labeling data for viewing status such as “I watched with interest” is suitable to be performed simultaneously with measurement of observation data, but in this embodiment, interaction information with the device is also included in the observation data. ing. Therefore, it is difficult for the user to input correct labeling data while operating the device.

したがって、このような場合、従来手法では、番組の全部視聴した後に、状況を思い出しながら正解のラベリングデータを付与する方法が用いられていた。しかしながら、このような手法では、記憶の曖昧性で不正確になりがちである。そこで、本発明では、番組を視聴しながら、リアルタイムに正解のラベリングデータを付与させるための構成を設ける。具体的には、上述した視聴状況認識装置１０の構成に、例えばフットペダル等の足操作手段を設け、視聴者の足操作により興味度や興味の種類（方向）等のラベリングするデータの正解データの入力を行う。ここで、上述の内容を他の実施形態として、以下に説明する。 Therefore, in such a case, in the conventional method, after viewing the whole program, a method of giving correct labeling data while remembering the situation was used. However, this approach tends to be inaccurate due to memory ambiguity. Therefore, in the present invention, a configuration for providing correct labeling data in real time while viewing a program is provided. Specifically, the above-described configuration of the viewing status recognition apparatus 10 is provided with foot operation means such as a foot pedal, for example, and correct data of data to be labeled such as the degree of interest and the type of interest (direction) by the foot operation of the viewer Input. Here, the above-described content will be described below as another embodiment.

図８は、他の実施形態における視聴状況認識装置の機能構成の一例を示す図である。なお、図８に示す視聴状況認識装置３０において、上述した視聴状況認識装置１０と同様の構成については、同一の名称及び符号を付するものとし、ここでの具体的な説明は省略する。 FIG. 8 is a diagram illustrating an example of a functional configuration of a viewing status recognition apparatus according to another embodiment. In the viewing situation recognition device 30 shown in FIG. 8, the same components as those in the above-described viewing situation recognition device 10 are denoted by the same names and reference numerals, and detailed description thereof is omitted here.

図８に示す視聴状況認識装置３０は、情報取得手段１１と、解析手段１２と、ヒストグラム生成手段１３と、代表値選定手段１４と、同期シンボル列生成手段１５と、変換手段１６と、素性関数群データベース１７と、統計的学習手段１８と、学習データベース１９と、学習パラメータ記憶手段２０と、統合手段２１と、足操作手段３１と、正解データ変換手段３２とを有するよう構成されている。 8 includes an information acquisition unit 11, an analysis unit 12, a histogram generation unit 13, a representative value selection unit 14, a synchronization symbol string generation unit 15, a conversion unit 16, and a feature function. The group database 17, the statistical learning means 18, the learning database 19, the learning parameter storage means 20, the integration means 21, the foot operation means 31, and the correct data conversion means 32 are configured.

図８に示す構成では、上述した視聴状況認識装置１０と比較して新たに足操作手段３１と、正解データ変換手段３２とが設けられている。足操作手段３１は、例えばフライトシミュレーションゲームやドライビングゲーム等に用いられているフットペダル等を用いることができる。 In the configuration shown in FIG. 8, a foot operation unit 31 and a correct data conversion unit 32 are newly provided as compared with the above-described viewing state recognition device 10. As the foot operation means 31, for example, a foot pedal used in a flight simulation game, a driving game, or the like can be used.

足操作手段３１は、視聴者が視聴している番組に対し、例えば興味がある内容を視聴した時点で、その興味レベルや興味ジャンル（興味の種類）等を視聴者の足操作により入力する。なお、興味レベルとは、例えば、予め３段階や１０段階等の評価基準を設定し、その設定された各段階をフットペダルの踏み込み具合により取得するものである。また、興味ジャンルについても、フットペダルの踏み込み具合に応じて予め設定されたジャンルの内容を取得する。なお、興味ジャンルとしては、例えば、番組に登場している俳優や女優、登場人物が身に付けている洋服、景色、建物、音楽、番組の内容等のうち、少なくとも１つを有するが、これに限定されるものではない。なお、興味レベルや興味ジャンルについては、例えば右足又は左足を用いてそれぞれの情報を入力することができる。 The foot operation means 31 inputs an interest level, an interest genre (kind of interest), and the like by a viewer's foot operation when, for example, content of interest is viewed with respect to a program being viewed by the viewer. The interest level is obtained, for example, by setting evaluation criteria such as 3 levels or 10 levels in advance, and acquiring each set level based on how the foot pedal is depressed. For the genre of interest, the content of the genre set in advance according to how the foot pedal is depressed is acquired. The genre of interest includes at least one of, for example, actors and actresses appearing in the program, clothes worn by the characters, scenery, buildings, music, program contents, etc. It is not limited to. In addition, about an interest level and an interest genre, each information can be input, for example using a right foot or a left foot.

なお、足操作手段３１については、上述したフットペダルに限定されるものではなく、例えば、視聴者の足のカメラ映像を取得し、取得したカメラ映像から解析された両足の位置関係から上述した興味レベルや興味ジャンル等を入力してもよい。なお、足操作手段３１により入力される情報は、上述した興味レベルや興味ジャンルに限定されるものではない。 Note that the foot operation means 31 is not limited to the above-described foot pedal. For example, a camera image of a viewer's foot is acquired, and the interest described above is obtained from the positional relationship of both feet analyzed from the acquired camera image. A level, an interest genre, etc. may be input. In addition, the information input by the foot operation means 31 is not limited to the above-described interest level or interest genre.

足操作手段３１は、取得した興味レベルや興味ジャンルを正解データ変換手段３２に出力する。このとき、足操作手段３１は、番組視聴に対応する時間情報を付与して出力する。これにより、視聴者の番組視聴時に得られる視聴者情報との同期を容易に取ることができる。 The foot operation means 31 outputs the acquired interest level and interest genre to the correct data conversion means 32. At this time, the foot operation means 31 adds and outputs time information corresponding to program viewing. Thereby, it is possible to easily synchronize with the viewer information obtained when the viewer views the program.

正解データ変換手段３２は、足操作手段３１から得られる興味レベルや興味ジャンルから正解の視聴状況ラベルデータを生成し、生成したラベルデータを学習データベース１９に記憶させる。足操作手段３１及び正解データ変換手段３２によって得られるデータは、例えば、上述した図６の表に示す興味判定（Ｃｕｒ）として取得されるが、これに限定されるものではない。 The correct answer data conversion means 32 generates correct viewing status label data from the interest level and interest genre obtained from the foot operation means 31 and stores the generated label data in the learning database 19. The data obtained by the foot operation means 31 and the correct answer data conversion means 32 is acquired as, for example, the interest determination (Cur) shown in the table of FIG. 6 described above, but is not limited thereto.

図８に示す実施形態では、手や声等を用いずに、足による入力を行うことで、一般の視聴者と同様の環境下で正解の視聴状況ラベルデータをリアルタイムに設定することができる。 In the embodiment shown in FIG. 8, the correct viewing status label data can be set in real time under the same environment as that of a general viewer by inputting with the foot without using hands or voices.

なお、図８に示す足操作手段３１や正解データ変換手段３２は、学習処理において用いられるため、主に正解データを学習させるための実験者や管理者等に使用されるが、これに限定されるものではなく、例えば一般の視聴者が使用してもよい。 Since the foot operation means 31 and the correct answer data conversion means 32 shown in FIG. 8 are used in the learning process, they are mainly used by experimenters and managers for learning correct answer data, but are not limited thereto. For example, a general viewer may use it.

ここで、図９は、他の実施形態の具体例な内容を説明するための図である。図９の例では、例えば、情報取得手段１１から得られる観測データやユーザインタラクションデータから変換手段１６による時系列素性ベクトル信号への変換処理を行い、更に足操作手段３１である左右のフットペダル４０−１，４０−２から得られるペダルの踏み込み位置情報（［ｌｅｆｔ：ｘ，ｙ，ｚ］、［ｒｉｇｈｔ：ｘ，ｙ，ｚ］）から正解データ変換手段３２による視聴状況の正解データへの変換処理を行う。具体的には、図９に示すように、視聴者による足操作手段３１の操作により、左右のフットペダル４０−１，４０−２のｘｙｚポジション位置が出力されるため、それぞれの出力値に対して視聴状況のラベル値を割り振ることにより、フットペダル操作のみで簡単に正解データを登録することが可能になる。 Here, FIG. 9 is a diagram for explaining specific contents of another embodiment. In the example of FIG. 9, for example, the observation data or user interaction data obtained from the information acquisition unit 11 is converted into a time-series feature vector signal by the conversion unit 16, and the left and right foot pedals 40 as the foot operation unit 31 are further processed. -1 and 40-2 conversion of pedal depression position information ([left: x, y, z], [right: x, y, z]) into correct data of viewing situation by correct data conversion means 32 Process. Specifically, as shown in FIG. 9, the xyz position positions of the left and right foot pedals 40-1 and 40-2 are output by the operation of the foot operation means 31 by the viewer. By assigning the viewing status label value, it is possible to easily register the correct answer data only by operating the foot pedal.

また、図９では、それぞれの変換処理の結果に基づいて、上述した学習データの収集が行われ、学習パラメータを取得する。 In FIG. 9, the learning data described above is collected based on the results of the respective conversion processes, and learning parameters are acquired.

ここで、視聴状況のラベルとフットペダル４０−１，４０−２からのｘｙｚポジション位置との対応は、例えばｘｍｌ（ＥｘｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ）ファイル等で予め規定しておくことで、いろいろな視聴状況に対応したラベル付けを行うことができる。 Here, the correspondence between the label of the viewing situation and the xyz position position from the foot pedals 40-1 and 40-2 is defined in advance in, for example, an xml (Extensible Markup Language) file or the like. Corresponding labeling can be performed.

ここで、図１０は、視聴状況のラベルと足操作との対応関係について説明するための図である。なお、図１０の例では、例えば視聴状況として興味のレベル（程度）と、どのジャンルに興味が向いているのか（興味の種類）を同時に入力する場合のｘｍｌファイルの一例を示している。 Here, FIG. 10 is a diagram for explaining the correspondence between the viewing status label and the foot operation. Note that the example of FIG. 10 shows an example of an xml file in the case where, for example, the level of interest (degree) as the viewing situation and the genre that interests in (the type of interest) are input simultaneously.

図１０の例では、フットペダルについての定義として、例えば右足のフットペダル（ｒｉｇｈｔ）で興味の程度を３段階（０，１，２で表現）で入力する。また、図１０の例では、例えば左足のフットペダル（ｌｅｆｔ）で興味の向いているジャンルを３種類（０、１，２で表現）で入力する。 In the example of FIG. 10, as the definition of the foot pedal, for example, the degree of interest is input in three stages (represented by 0, 1 and 2) using the right foot foot pedal (right). In the example of FIG. 10, for example, the genre that interests the user is input in three types (represented by 0, 1, and 2) with the left foot pedal (left).

ここで、右足での興味のレベルの入力は、３次元ｘ，ｙ，ｚの何れの方向も入力値の最大値２５５を３等分してその範囲に入っている踏み込み量に対応して入力される。具体的には、踏み込み量が０〜８０の場合には０の数値データが出力され、８０〜１２０の場合には１の数値データが出力され、１２０〜２５５の場合には２の数値データが出力される。つまり、どの方向にでも思い切り踏み込めば最大の興味レベル（例えば、興味レベル２）が出力されることになる。 Here, the level of interest on the right foot is input corresponding to the amount of stepping in the range by dividing the maximum value 255 of the input value into three equal parts in any of the three-dimensional x, y and z directions. Is done. Specifically, numerical data of 0 is output when the amount of depression is 0 to 80, numerical data of 1 is output when 80 to 120, and numerical data of 2 is output when 120 to 255. Is output. In other words, the maximum interest level (for example, interest level 2) is output if you make a thorough decision in any direction.

また、興味ジャンルについては、左足のＺ方向の踏み込み量のみに対応しており、入力値の最大値２５５を３等分した範囲に対応した値を入力することになる。具体的には、図１０の例では、入力値として｛０、１，２｝の数値データが存在するが、それぞれの数値を興味ジャンルに対応させることにより、ジャンル入力の入力とすることができる。このような処理により、視聴者は、普段通りの機器とのインタラクションを行いながら、その時点での視聴状況を簡単に入力することができるようになる。 For the genre of interest, only the amount of depression of the left foot in the Z direction is supported, and a value corresponding to a range obtained by dividing the maximum value 255 of the input value into three equal parts is input. Specifically, in the example of FIG. 10, there are numerical data of {0, 1, 2} as input values, but by making each numerical value correspond to the genre of interest, it is possible to input genre input. . Through such processing, the viewer can easily input the viewing status at that time while interacting with the device as usual.

なお、ある時点において、興味があるジャンルが複数ある場合には、短い時間でジャンル毎のデータを入力してもよく、複数のジャンルを組み合わせたフットペダル４０の踏み込み量を予め設定しておき、その位置まで踏み込むようにしてもよい。 In addition, when there are a plurality of genres of interest at a certain point in time, data for each genre may be input in a short time, and the depression amount of the foot pedal 40 combining a plurality of genres is set in advance. You may make it step on that position.

このように、他の実施形態によれば、機械学習用の視聴状況の認識結果のラベルデータ列を、例えばテレビ視聴状況を損なうことなく作成することができる。具体的には、ラベル化した視聴状況結果を、例えばフライトシミュレーションゲーム等に使用するフットペダルの踏み込み量に対応させることにより、足元で操作をしながらリアルタイムに学習用の正解視聴状況データ列を作成することができる。 As described above, according to another embodiment, the label data string of the recognition result of the viewing situation for machine learning can be created without impairing the television viewing situation, for example. Specifically, by making the labeled viewing situation results correspond to the amount of foot pedal depression used in, for example, a flight simulation game, a correct viewing situation data string for learning is created in real time while operating with the feet. can do.

なお、上述した本実施形態では、視聴状況の出力内容の例として、認識結果から得られる番組の興味度について説明したが、これに限定されるものではなく、例えば興味の内容や視聴者の嗜好を推定してもよい。 In the above-described embodiment, the degree of interest of the program obtained from the recognition result has been described as an example of the output content of the viewing situation. However, the present invention is not limited to this. For example, the content of interest and the preference of the viewer May be estimated.

＜実行プログラム（視聴状況認識プログラム）＞
ここで、上述した視聴状況認識装置１０，３０は、例えば、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等の揮発性の記憶媒体、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）等の不揮発性の記憶媒体、マウスやキーボード、ポインティングデバイス等の入力装置、画像やデータを表示する表示部、並びに外部と通信するためのインターフェイスを備えたコンピュータによって構成することができる。 <Execution program (viewing situation recognition program)>
Here, the above-described viewing status recognition devices 10 and 30 are, for example, volatile storage media such as a CPU (Central Processing Unit), RAM (Random Access Memory), and non-volatile storage media such as a ROM (Read Only Memory). In addition, it can be configured by a computer having an input device such as a mouse, a keyboard, and a pointing device, a display unit for displaying images and data, and an interface for communicating with the outside.

したがって、視聴状況認識装置１０，３０が有する各機能は、これらの機能を記述したプログラムをＣＰＵに実行させることによりそれぞれ実現可能となる。また、このプログラムは、磁気ディスク（フロッピィーディスク、ハードディスク等）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤ等）、半導体メモリ等の記録媒体に格納して頒布することもできる。 Accordingly, the functions of the viewing status recognition devices 10 and 30 can be realized by causing the CPU to execute a program describing these functions. The program can also be stored and distributed on a recording medium such as a magnetic disk (floppy disk, hard disk, etc.), optical disk (CD-ROM, DVD, etc.), semiconductor memory, or the like.

つまり、本実施形態では、上述した各構成における処理をコンピュータ（ハードウェア）に実行させるための実行プログラム（視聴状況認識プログラム）を生成し、例えば汎用のパーソナルコンピュータやサーバ等にそのプログラムをインストールすることにより、上述したハードウェアと、プログラム等からなるソフトウェアとを協働させて、上述した学習処理や認識処理等を有する視聴状況認識処理を実現することができる。 That is, in the present embodiment, an execution program (viewing situation recognition program) for causing a computer (hardware) to execute the processing in each configuration described above is generated, and the program is installed in, for example, a general-purpose personal computer or server. Accordingly, it is possible to realize the viewing state recognition process having the above-described learning process, recognition process, and the like by cooperating the above-described hardware and software including a program or the like.

上述したように本発明によれば、視聴者の視聴状況を高精度に認識することができる。具体的には、本発明によれば、視聴者の視聴状況を入手可能な性質の異なる複数の信号の関係を記述して機械学習処理で自動出力する仕組みを構築することができる。例えば、本発明では、テレビ視聴時の興味区間の推定を、リモコンや携帯情報端末等の操作履歴や、カメラ映像等による人物認識結果等の複数の情報をＣＲＦで統合して時系列にラベリングされた情報として扱う。これにより、本発明では、テレビ視聴等において、視聴者が番組のどのシーンに興味を示しているか等といった視聴状況を推定し、番組への興味状態等に応じて、番組や情報を適応的に推薦するＵＴＡＮ（ＵｓｅｒＴｅｃｈｎｏｌｏｇｙＡｓｓｉｓｔｅｄＮａｖｉｇａｔｉｏｎ）等の情報推薦サービスに活用することができる。具体的には、例えば、本発明を適用したＵＴＡＮ等を用いて、番組中に変化する視聴者の興味度を認識技術から推定し、その時間帯の字幕や番組情報を利用して、番組推薦や関連キーワード等を携帯情報端末やテレビ画面に提示することができる。 As described above, according to the present invention, the viewing status of the viewer can be recognized with high accuracy. Specifically, according to the present invention, it is possible to construct a mechanism for describing the relationship between a plurality of signals having different properties that allow the viewer's viewing status to be obtained and automatically outputting the relationship by machine learning processing. For example, in the present invention, an estimation of an interest interval when watching TV is performed, and a plurality of pieces of information such as an operation history of a remote control or a portable information terminal, a person recognition result by a camera image, etc. are integrated in a CRF and labeled in time series. Treat as information. As a result, in the present invention, when watching TV, etc., the viewing situation such as which scene the viewer is interested in is estimated, and the program and information are adaptively adapted according to the interest state of the program. It can be used for information recommendation services such as UTAN (User Technology Assisted Navigation) to be recommended. Specifically, for example, using UTAN to which the present invention is applied, the interest level of a viewer that changes during a program is estimated from recognition technology, and program recommendation is performed using subtitles and program information in that time period. And related keywords can be presented on a portable information terminal or a television screen.

また、本発明によれば、視聴状況に興味レベルだけでなく、例えば興味視聴後の活動内容にラベルを割り振ることにより、興味を持って視聴した状況を反映した行動支援システムの構築を行うこともできる。 Further, according to the present invention, it is possible to construct an action support system that reflects a situation viewed with interest by assigning a label not only to the level of interest in the viewing situation but also to the activity content after the interest viewing, for example. it can.

以上本発明の好ましい実施形態について詳述したが、本発明は係る特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形、変更が可能である。 Although the preferred embodiment of the present invention has been described in detail above, the present invention is not limited to the specific embodiment, and various modifications, within the scope of the gist of the present invention described in the claims, It can be changed.

１０，３０視聴状況認識装置
１１情報取得手段
１２解析手段
１３ヒストグラム生成手段
１４代表値選定手段
１５同期シンボル列生成手段
１６変換手段
１７素性関数群データベース
１８統計的学習手段
１９学習データベース
２０学習パラメータ記憶手段
２１統合手段
３１足操作手段
３２正解データ変換手段
４０フットペダル DESCRIPTION OF SYMBOLS 10,30 Viewing condition recognition apparatus 11 Information acquisition means 12 Analysis means 13 Histogram generation means 14 Representative value selection means 15 Synchronization symbol sequence generation means 16 Conversion means 17 Feature function group database 18 Statistical learning means 19 Learning database 20 Learning parameter storage means 21 Integration means 31 Foot operation means 32 Correct data conversion means 40 Foot pedal

Claims

In a viewing status recognition device for recognizing a viewing status based on viewer information obtained when a viewer views a program,
Histogram generating means for generating histogram data from a plurality of different data obtained as the viewer information;
Representative value selection means for selecting a representative value for each unit time from histogram data obtained by the histogram generation means;
Using a representative value obtained by the representative value selection means, a synchronization symbol string generation means for generating a synchronization symbol string by synchronizing the plurality of different data;
Conversion means for converting a synchronization symbol sequence obtained by the synchronization symbol sequence generation means into a time-series feature vector signal using a preset feature function;
A viewing situation recognition comprising: a time series feature vector signal obtained by the converting means, using a learning parameter indicating a preset weight, and integrating a means for recognizing the viewing situation of the viewer. apparatus.

2. A statistical learning means for generating the learning parameter by statistical learning based on a time-series feature vector signal obtained by the conversion means and preset correct viewing situation data. The viewing situation recognition device according to 1.

The statistical learning means includes
A plurality of time series symbol sequences are newly generated based on the relationship between the plurality of synchronization symbol sequences obtained by the synchronization symbol sequence generation means, and the learning is performed by a machine learning process set in advance from the time series symbol sequences. The viewing condition recognition apparatus according to claim 2, wherein a parameter is generated.

The histogram generation means includes
4. The viewing status recognition apparatus according to claim 1, wherein a histogram is generated based on a plurality of different unit times.

4. The viewing status recognition apparatus according to claim 2, further comprising foot operation means for inputting information that is a basis of the correct viewing status data by operating the viewer's foot.

In the viewing status recognition program for recognizing the viewing status based on the viewer information obtained when the viewer views the program,
Computer
Histogram generating means for generating histogram data from a plurality of different data obtained as the viewer information;
Representative value selecting means for selecting a representative value for each unit time from histogram data obtained by the histogram generating means;
Using a representative value obtained by the representative value selection means, a synchronization symbol string generation means for generating a synchronization symbol string by synchronizing the plurality of different data;
Conversion means for converting a synchronization symbol sequence obtained by the synchronization symbol sequence generation means into a time-series feature vector signal using a preset feature function; and
A viewing situation recognition program for integrating time series feature vector signals obtained by the conversion means using a learning parameter indicating a preset weight and functioning as an integration means for recognizing the viewing situation of the viewer.