JP2019046304A

JP2019046304A - Event series extraction apparatus, event series extraction method, and event extraction program

Info

Publication number: JP2019046304A
Application number: JP2017170423A
Authority: JP
Inventors: 大樹宮西; Daiki Miyanishi; 一晃川鍋; Kazuaki Kawanabe; 淳一郎平山; Junichiro Hirayama; 卓也前川; Takuya Maekawa
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2017-09-05
Filing date: 2017-09-05
Publication date: 2019-03-22
Anticipated expiration: 2037-09-05
Also published as: JP7071717B2

Abstract

To provide an event series extraction apparatus capable of extracting event series on the basis of a sensor signal so as to generalize it to unknown places, while taking account of contexts of an event.SOLUTION: An event series extraction apparatus 1 stores data about a subject 2 in a storage unit 40, with a behavior label, an object label, and a position label defined as a time line and extracts event series. The event series extraction apparatus 1 extracts a most likelihood combination, as an event, out of a plurality of candidates of combinations of the object label and position label in a prescribed time window with the behavior label used as a standard, according to the most likelihood that the behavior label, object label, and position label corresponding to the candidates may be present in a language corpus.SELECTED DRAWING: Figure 1

Description

本発明は、被験者の知覚経験の情報からイベント系列を抽出する技術に関する。 The present invention relates to a technique for extracting an event sequence from information of a subject's perceptual experience.

情報通信技術の進歩に伴って、各個人が生活するにあたって生じる様々なデータに対して、メタ情報（典型的には、ラベル）を付与して分類するような技術が提案されている（例えば、特許文献１、非特許文献１など参照）。このように収集された様々なデータの中から、付与されたメタ情報をキーワードとして目的のデータに対する検索および抽出が実行される。 With the advancement of information communication technology, a technology has been proposed in which meta information (typically, a label) is added and classified to various data generated as each individual lives (for example, See Patent Document 1, Non-Patent Document 1, etc.). Among various pieces of data collected in this manner, search and extraction are performed on target data using the provided meta information as a keyword.

すなわち、自分の今ある状態や、周囲の状況などを自動的に認識することは、人間の日常生活を理解する上で重要な課題である。このような技術は、「行動認識」と呼ばれ、多種多様なセンサ信号をもとにして、人の行動のパターンの発見や特定を行うことが目的である。センサ技術の発展に伴い、行動認識の技術はアルツハイマー病患者のモニタリング、スマートホーム内での行動パターンの発見、医療向上を目的とした看護師の行動認識、ライフログといった幅広い分野で応用されている。 That is, automatically recognizing one's current state and surrounding circumstances is an important task in understanding human daily life. Such technology is called "action recognition", and its purpose is to detect and identify human behavior patterns based on a wide variety of sensor signals. With the development of sensor technology, the technology of behavior recognition is applied in a wide range of fields such as monitoring of Alzheimer's disease patients, detection of behavior pattern in smart home, nurse's behavior recognition for medical improvement, life log .

従来の多くの技術では、センサから取得した時系列信号を「歩く」「食べる」「飲む」「読書する」などの生活を営む上で行う基本的行動（日常生活行動）のラベルへ分類する問題に取り組んできた（たとえば、非特許文献２）。 In many conventional technologies, the problem of classifying time-series signals acquired from sensors into labels of basic actions (daily life actions) performed in performing life such as "walk", "eat", "drink" and "read" For example, Non-Patent Document 2).

しかし、我々の日常生活は、「キッチンでコーヒーを作り、リビングルームに移動した後、リビングルームでコーヒーを飲む」ように、様々な場所で様々な物体に働きかけて逐次的に日常生活行動を行う。このような、誰がいつ・どこで・何をしているのか表す一連の行動を詳細に認識するためには、個別の行動ラベルを認識するだけではなく、行動・物体・場所といった複数の意味コンセプトの組み合わせである「イベント」を認識し、さらに時間的に連続したイベントの系列を正しく認識することが必要である。 However, as our daily life "make coffee in the kitchen, move to the living room and then drink coffee in the living room", we work on various objects in various places and perform daily life actions sequentially. . In order to recognize such a series of actions representing who, when, where, and what in detail, in addition to recognizing individual action labels, a plurality of semantic concepts such as actions, objects, places, etc. It is necessary to recognize “events” that are combinations and to correctly recognize a series of temporally consecutive events.

特開２０１７−５８７２９号公報JP, 2017-58729, A

J. Gemmell, G. Bell, and R. Lueder, MyLifeBits: A Personal D atabase for Everything, Communications of the ACM, 49(1):88-95, 2006.J. Gemmell, G. Bell, and R. Lueder, MyLifeBits: A Personal Data atabase for Everything, Communications of the ACM, 49 (1): 88-95, 2006.

Ord’o”nez, F. J., and Roggen, D. ”Deep convolutional andlstm recurrent neural networks for multimodal wearable activity recognition.” Sensors 16(1):115，2016.Ord'o "nez, F. J., and Roggen, D." Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. "Sensors 16 (1): 115, 2016.

センサ信号から意味コンセプトを推定し、予測した多数の意味コンセプトの中から、日常生活行動に関するイベント系列（イベントタイムライン）を生成する手法を実行しようとする場合、以下のような問題がある。 When trying to execute a method of generating an event sequence (event timeline) relating to daily life actions from a large number of semantic concepts estimated from semantic signals estimated from sensor signals, there are the following problems.

すなわち、各意味コンセプトは、実環境で観測された「脱ぐ」「着る」「歩く」「座る」といった行動や「靴」「スリッパ」「ソファ」「テレビ」といった物体、「玄関」「リビングルーム」といった場所を表しており、観測された開始と終了時刻が付与されている。しかし、実環境では、時間軸上に複数の意味コンセプト（コンセプトストリーム）が出現し、これらの意味コンセプトの観測される時間は一致しないため、ルールベースでコンセプトを組み合わせてイベントを生成することは困難である。また、センサ信号から意味コンセプトを予測しても、正しい意味コンセプトだけが予測されるわけではないので、適切な意味コンセプトだけを取捨選択する必要がある。さらに、従来提案されている実世界の行動に対して言語を付与する研究は、場所が変化しない限られた空間で行われてきた。しかし、日常生活では様々な場所で逐次的にイベントが起こるので、イベントの前後関係を考慮しながら、未知の場所に汎化するようにイベントタイムラインを作る必要がある。 In other words, each meaning concept is an action such as "take off", "wear", "walk" or "sit" observed in a real environment, an object such as "shoes", "slippers", "sofa" or "TV", "door entrance" "living room" Represents the location where the observed start and end times are given. However, in a real environment, it is difficult to generate events by combining concepts on a rule basis, since multiple semantic concepts (concept streams) appear on the time axis, and the observed times of these semantic concepts do not match. It is. Also, predicting a semantic concept from a sensor signal does not mean that only the correct semantic concept is predicted, so it is necessary to select only appropriate semantic concepts. Furthermore, researches for providing language to real-world behavior, which have been proposed conventionally, have been conducted in a limited space where the place does not change. However, since events occur sequentially in various places in daily life, it is necessary to create an event timeline so as to generalize to unknown places while considering the context of the events.

本発明は、上記のような問題点を解決するためになされたものであって、その目的は、イベントの前後関係を考慮しながら、未知の場所に汎化するようにイベントタイムライン（イベント系列）をセンサ信号に基づいて抽出することが可能なイベント系列抽出装置、イベント系列抽出方法およびイベント抽出プログラムを提供することである。 The present invention has been made to solve the above-mentioned problems, and its object is to provide an event timeline (event sequence) so as to generalize to an unknown place while considering the context of the event. An event sequence extraction device, an event sequence extraction method, and an event extraction program capable of extracting the signal sequence based on a sensor signal.

この発明の１つの局面に従うと、イベント系列抽出装置であって、被験者の知覚経験の対象物を特定するためのデータを取得するための第１のセンサと、被験者の知覚経験のときに被験者の行動および被験者の位置を特定するための位置情報を取得する第２のセンサと、情報を格納するための記憶装置と、被験者について、行動、対象物および位置に対して、それぞれを識別するための行動ラベル、物体ラベルおよび位置ラベルをタイムラインとして、記憶装置に格納し、イベント系列を抽出するイベント系列抽出手段とを備え、イベント系列抽出手段は、所定のスライディングウィンドウごとに、行動ラベル、物体ラベルおよび位置ラベルを予測して、タイムライン上に同一ラベルを結合して並べる処理を実行するコンセプト推定処理手段と、タイムライン上の行動ラベル、物体ラベルおよび位置ラベルの結合結果に対して、行動ラベルを基準とする所定の時間窓内の物体ラベルおよび位置ラベルの組み合わせの複数の候補について、言語コーパス内に候補に対応する行動ラベル、物体ラベルおよび位置ラベルの存在する尤度に応じて、最も尤もらしい組み合わせをイベントとして抽出する言語モデル対照手段と、抽出されたイベントの時系列から、イベント系列を抽出するイベント系列選択手段とを含む。 According to one aspect of the present invention, there is provided an event sequence extraction device, a first sensor for acquiring data for identifying an object of a subject's perceptual experience, and a subject's perceptual experience when the subject has a perceptual experience. A second sensor for acquiring position information for identifying the action and the position of the subject, a storage device for storing the information, and for identifying each of the action, the object, and the position of the subject. The action label, the object label, and the position label are stored in the storage device as a timeline, and the event sequence extraction unit extracts an event sequence. The event sequence extraction unit includes an activity label and an object label for each predetermined sliding window. Concept estimation processing means that predicts position labels and combines and arranges identical labels on the timeline Candidates for combinations of object labels and position labels within a predetermined time window with reference to the action label, for the combined result of the action label, the object label and the position label on the timeline, candidates in the language corpus Language model contrasting means for extracting the most likely combination as an event according to the likelihood that the action label, the object label and the position label corresponding to the event exist, and the event for extracting an event sequence from the time series of the extracted event And sequence selection means.

好ましくは、言語モデル対照手段は、言語コーパス内に存在する尤度、ならびに、結合された行動ラベル、物体ラベルおよび位置ラベル同士の時間の近さによる尤度に応じて、尤もらしい組み合わせをイベントとして抽出する。 Preferably, the language model contrast means uses a likely combination as an event according to the likelihood existing in the language corpus and the likelihood due to the closeness of time between the combined action label, the object label and the position label. Extract.

好ましくは、コンセプト推定処理手段は、各スライディングウィンドウ内の第１および第２のセンサ信号に対して、系列ラベリングを用いて行動ラベル、物体ラベルおよび位置ラベルを予測するゲート付き再帰型ユニットを含む。 Preferably, the concept estimation processing means includes a gated recursive unit that predicts activity labels, object labels and position labels using sequence labeling for the first and second sensor signals in each sliding window.

好ましくは、言語モデル対照手段は、単語の類似度を考慮した単語埋め込みベクトルを入力とする、ゲート付き再帰型ユニットを含む。 Preferably, the language model comparison means includes a gated recursive unit which receives as input a word embedding vector in consideration of word similarity.

好ましくは、イベント系列選択手段は、イベントを表現するベクトルの非類似度をイベントの遷移確率とする隠れ状態モデルによりイベント系列を抽出する。 Preferably, the event sequence selection means extracts an event sequence according to a hidden state model in which the dissimilarity of the vector representing the event is the transition probability of the event.

この発明の他の局面に従うと、実環境でのセンサデータから、被験者に発生したイベントの系列の情報を抽出するためのイベント系列抽出方法であって、被験者の知覚経験の対象物を特定するためのデータを第１のセンサにより取得するステップと、被験者の知覚経験のときに、第２のセンサにより、被験者の行動および被験者の位置を特定するための位置情報を取得するステップと、所定のスライディングウィンドウごとに、行動、対象物および位置に対して、それぞれを識別するための行動ラベル、物体ラベルおよび位置ラベルを予測して、タイムライン上に同一ラベルを結合して並べる処理を実行するステップと、被験者について、行動ラベル、物体ラベルおよび位置ラベルをタイムラインとして、記憶装置に格納するステップと、
所定のスライディングウィンドウごとに、行動ラベル、物体ラベルおよび位置ラベルを予測して、タイムライン上に同一ラベルを結合して並べる処理を実行するステップと、タイムライン上の行動ラベル、物体ラベルおよび位置ラベルの結合結果に対して、行動ラベルを基準とする所定の時間窓内の物体ラベルおよび位置ラベルの組み合わせの複数の候補について、言語コーパス内に候補に対応する行動ラベル、物体ラベルおよび位置ラベルの存在する尤度に応じて、最も尤もらしい組み合わせをイベントとして抽出するステップと、抽出されたイベントの時系列から、イベント系列を抽出するステップとを備える。 According to another aspect of the present invention, there is provided an event sequence extraction method for extracting information on a sequence of events generated in a subject from sensor data in a real environment, which identifies an object of perceptual experience of the subject. Acquiring data of the subject by the first sensor; acquiring position information for identifying the subject's behavior and the subject's position by the second sensor at the time of the subject's perceptual experience; and predetermined sliding Predicting, for each window, an action label, an object label and a position label for identifying each of the action, the object, and the position, and combining and arranging the same labels on the timeline; Storing the action label, the object label and the position label as a timeline for the subject in the storage device;
Predicting an action label, an object label and a position label for each predetermined sliding window, and performing a process of combining and arranging the same label on the timeline; an action label, an object label and a position label on the timeline The existence of action labels, object labels and position labels corresponding to the candidates in the language corpus for a plurality of combinations of object labels and position labels within a predetermined time window relative to the action labels, for the combined result of According to the likelihood, the step of extracting the most likely combination as an event, and the step of extracting an event sequence from the extracted time series of events are provided.

この発明のさらに他の局面にしたがうと、実環境でのセンサデータから、被験者に発生したイベントの系列の情報を抽出するためのイベント系列抽出処理をコンピュータに実行させるためのイベント系列抽出プログラムであって、イベント系列抽出プログラムは、コンピュータに、被験者の知覚経験の対象物を特定するためのデータを第１のセンサにより取得するステップと、被験者の知覚経験のときに、第２のセンサにより、被験者の行動および被験者の位置を特定するための位置情報を取得するステップと、所定のスライディングウィンドウごとに、行動、対象物および位置に対して、それぞれを識別するための行動ラベル、物体ラベルおよび位置ラベルを予測して、タイムライン上に同一ラベルを結合して並べる処理を実行するステップと、被験者について、行動ラベル、物体ラベルおよび位置ラベルをタイムラインとして、記憶装置に格納するステップと、タイムライン上の行動ラベル、物体ラベルおよび位置ラベルの結合結果に対して、行動ラベルを基準とする所定の時間窓内の物体ラベルおよび位置ラベルの組み合わせの複数の候補について、言語コーパス内に候補に対応する行動ラベル、物体ラベルおよび位置ラベルの存在する尤度に応じて、最も尤もらしい組み合わせをイベントとして抽出するステップと、抽出されたイベントの時系列から、イベント系列を抽出するステップとを実行させる。
（用語の意義）
本明細書においては、用語の意義は、以下のとおりであるものとする。 According to still another aspect of the present invention, an event sequence extraction program for causing a computer to execute an event sequence extraction process for extracting information on a sequence of events generated in a subject from sensor data in a real environment. The event sequence extraction program causes the computer to acquire data for identifying an object of the subject's perceptual experience with the first sensor, and the subject to perform the subject's perceptual experience with the second sensor. Acquiring position information for specifying the behavior of the subject and the position of the subject, an action label, an object label and a position label for identifying each of the action, the object and the position for each predetermined sliding window Execute the process of combining and arranging identical labels on the timeline by predicting And storing the action label, the object label and the position label in the storage device as a timeline for the subject, and the action label on the basis of the combination result of the action label, the object label and the position label on the timeline. For a plurality of combinations of object labels and position labels within a predetermined time window, the most likely combination is determined according to the likelihood of the action label, object label and position label corresponding to the candidate in the language corpus A step of extracting as an event and a step of extracting an event sequence from the extracted time series of events are executed.
(Significance of terms)
In the present specification, the meanings of the terms are as follows.

１）「行動ラベル」とは、被験者に装着されたセンサ信号に基づいて、当該センサ信号に対応する被験者の行動を、予め規定された「行動を示す用語」と対応付けたときの当該用語をいう。 1) The "action label" refers to the term when the subject's action corresponding to the sensor signal is associated with the "defined term indicating the action" based on the sensor signal worn by the subject. Say.

２）「意味コンセプト」とは、被験者の行動ラベル、被験者の行動と関連する対象として認識された物体、被験者の存在する場所によって特定される被験者の行動を特定する情報をいう。実環境で観測された「脱ぐ」「着る」「歩く」「座る」といった行動ラベルや「靴」「スリッパ」「ソファ」「テレビ」といった行動と関連する対象としての物体、「玄関」「リビングルーム」といった場所を表す情報からなり、観測された開始と終了時刻が付与されている。 2) "Semantic concept" refers to information identifying the subject's behavior identified by the subject's action label, an object recognized as a subject related to the subject's action, and the place where the subject is present. Action labels such as "take off", "wear", "walk" and "sit" observed in a real environment, objects as subjects related to actions such as "shoes", "slippers", "sofa" and "TV", "door entrance" "living room" And the observed start and end times.

３）「コンセプトストリーム」とは、時間軸上に並ぶ複数の意味コンセプトのことをいう。 3) "Concept stream" means a plurality of semantic concepts aligned on the time axis.

４）「イベント」とは、複数の意味コンセプトからなり、日常生活の上で、被験者の一連の行動として１つのまとまった意味を有すると認識されるものをいう。 4) "Event" refers to one that consists of multiple semantic concepts and is recognized in everyday life as having a single meaning as a series of actions of the subject.

本発明のイベント系列抽出装置、イベント系列抽出方法およびイベント抽出プログラムによれば、未知の場所に汎化するようにイベントタイムライン（イベント系列）をセンサ信号に基づいて抽出することが可能である。 According to the event sequence extraction apparatus, the event sequence extraction method, and the event extraction program of the present invention, it is possible to extract an event timeline (event sequence) based on a sensor signal so as to generalize to an unknown place.

本実施の形態に従うイベント系列抽出装置１の全体構成の一例を示す模式図である。It is a schematic diagram which shows an example of the whole structure of the event sequence extraction apparatus 1 according to this Embodiment. 図１に示すウェアラブルカメラ１０の構成を示す模式図である。It is a schematic diagram which shows the structure of the wearable camera 10 shown in FIG. 図１に示すモーションセンサ２０の構成を示す模式図である。It is a schematic diagram which shows the structure of the motion sensor 20 shown in FIG. 図１に示す情報処理装置５０の構成を示す模式図である。It is a schematic diagram which shows the structure of the information processing apparatus 50 shown in FIG. センサ信号から予測した意味コンセプトを組み合わせて、イベントタイムラインを作成した例を示す図である。It is a figure which shows the example which created the event timeline by combining the semantic concept estimated from the sensor signal. イベント系列を生成する手順を説明するための第１の概念図である。It is a 1st conceptual diagram for demonstrating the procedure which produces | generates an event sequence. イベント系列を生成する手順を説明するための第２の概念図である。It is a 2nd conceptual diagram for demonstrating the procedure which produces | generates an event sequence. イベント系列を生成する手順を説明するための第３の概念図である。It is a 3rd conceptual diagram for demonstrating the procedure which produces | generates an event series. 行動、物体、場所のコンセプトタイプごとのコンセプト名を示す表である。It is a table showing a concept name for each concept type of action, object, and place. 単語埋め込みベクトルとＧＲＵを用いたＲＮＮ言語モデルの構成を示す概念図である。It is a conceptual diagram which shows the structure of the RNN language model which used word embedding vector and GRU. イベントの遷移の例を示す概念図である。It is a conceptual diagram which shows the example of the transition of an event. 実験で使用された家の間取りと被験者が行った２０個の場所固有の行動を示す図である。It is a figure which shows the house plan used by experiment, and 20 site-specific actions which a test subject performed. 抽出されたイベントタイムラインの評価結果を示す図である。It is a figure which shows the evaluation result of the extracted event timeline.

以下、本発明の実施の形態のイベント系列抽出装置、イベント系列抽出方法およびイベント抽出プログラムの構成を、図に従って説明する。なお、以下の実施の形態において、同じ符号を付した構成要素および処理工程は、同一または相当するものであり、必要でない場合は、その説明は繰り返さない。
[実施の形態]
本明細書において、被験者の「知覚経験」としては、何らかの装置を用いて取得または計測できるものであれば、どのような知覚（例えば、視覚、聴覚、嗅覚、味覚、触覚）であってもよい。すなわち、何らかのデータとして格納しておき、事後的に再現できるものであれば、どのような知覚であってもよい。 Hereinafter, the configuration of an event sequence extraction apparatus, an event sequence extraction method, and an event extraction program according to an embodiment of the present invention will be described with reference to the drawings. In the following embodiments, components and processing steps denoted by the same reference numerals are the same or equivalent, and the description thereof will not be repeated if not required.
Embodiment
In the present specification, the “perceived experience” of the subject may be any perception (for example, vision, hearing, smell, taste, touch) as long as it can be obtained or measured using any device. . That is, any perception may be used as long as it can be stored as some data and reproduced later.

視覚であれば、カメラを用いて撮像した静止画像または動画像としてデータ化可能である。そして、テレビモニタなどでその画像を再現することができる。聴覚であれば、マイクを取得した音声としてデータ化可能である。そして、スピーカなどでその音声を再現できる。嗅覚および味覚であれば、それぞれ、臭気センサおよび味覚センサなどを用いてデータ化可能である。触覚であれば、接触圧センサなどを用いてデータ化可能である。これらの知覚についても、特定の装置を用いることで再現可能である。 If it is visual, it can be digitized as a still image or a moving image captured using a camera. Then, the image can be reproduced on a television monitor or the like. If it is a hearing, it can be digitized as an acquired voice. Then, the voice can be reproduced by a speaker or the like. The sense of smell and taste can be converted to data using an odor sensor and a taste sensor, respectively. If it is a tactile sense, it can be data-ized using a contact pressure sensor etc. These perceptions can also be reproduced by using a specific device.

本実施の形態に従うイベント系列抽出装置は、実世界における被験者の行動・物体・位置の少なくとも３つの特性を考慮して、日常生活に関するイベントのタイムラインを生成する。まず、人間が行動する時間、行動の対象となる物体を観測する時間、行動を行う場所にいる時間の時間的な近さ関係をモデル化することで、コンセプトが観測される時間の不一致に対処する。また、複数のコンセプトの中から尤もらしいコンセプトの組み合わせを見つけるため、実世界の常識的知識を反映する外部の言語コーパスから実世界で起こりうるコンセプトの組み合わせを学習する。さらに、リカレントニューラルネットワークを用いてイベントの情報表現ベクトルを獲得し、このベクトルを用いて逐次的なイベントの変化を表現する。最後に、これらのモデルを隠れマルコフモデルを用いて統合し、日常生活を表す尤もらしいイベントライムラインを生成する。 The event sequence extraction device according to the present embodiment generates a timeline of events related to daily life, taking into consideration at least three characteristics of the subject's behavior, objects, and position in the real world. First of all, by modeling the temporal closeness relationship between the time of human action, the time of observing the object to be acted, and the time of being in the place of action, the time difference of observation of the concept is dealt with Do. Also, in order to find a combination of concepts that is most likely among multiple concepts, we learn combinations of concepts that may occur in the real world from an external language corpus that reflects real-world common sense knowledge. Furthermore, a recurrent neural network is used to obtain an information expression vector of the event, and this vector is used to express sequential changes in the event. Finally, we integrate these models using hidden Markov models to generate plausible event lime lines that represent everyday life.

被験者の「行動」として生じた外的な時間変化としては、被験者が自発的または受動的に行なった挙動などを含む。例えば、日常生活の中で生じる手、頭、体、足などの動作を含む。また、被験者に「行動」として生じた内的な時間変化としては、被験者の体内に自発的または受動的に生じた変化を含む。例えば、脳内の血流パターンの変化、体温パターンの変化、発汗パターンの変化などを含む。 The external time change that occurs as the subject's "action" includes a behavior that the subject performed spontaneously or passively. For example, it includes movements of hands, head, body, legs, etc. that occur in daily life. In addition, internal time changes that occur in the subject as "action" include changes that occur spontaneously or passively in the subject's body. For example, it includes changes in blood flow patterns in the brain, changes in body temperature patterns, changes in sweating patterns, and the like.

以下の説明では、一つの実施形態として、被験者の視覚により知覚された映像（静止画像または動画像を含む）と、被験者に生じた外的な時間変化としての行動と、位置情報とを関連付けて記憶装置に格納し、イベント系列として抽出する処理について説明する。 In the following description, as one embodiment, an image (including a still image or a moving image) perceived by the subject's vision, an action as an external time change generated in the subject, and position information are associated with each other. A process of storing in the storage device and extracting as an event sequence will be described.

したがって、以下の説明において、「行動」に対応する挙動とは、被験者の体の一部を動作させて時間的に生じる変化を含む概念である。このように、本実施の形態における知覚経験は、被験者の視覚により知覚される映像を含み、被験者に生じた時間変化は、被験者の体の一部の動きを示す情報を含む。 Therefore, in the following description, the behavior corresponding to the “action” is a concept including a temporally generated change by operating a part of the subject's body. Thus, the perceptual experience in the present embodiment includes an image perceived by the subject's vision, and the temporal change produced in the subject includes information indicating the movement of a part of the subject's body.

［システム構成］
次に、本実施の形態に従う検索システムの全体構成について例示する。 [System configuration]
Next, an overall configuration of a search system according to the present embodiment will be illustrated.

図１は、本実施の形態に従うイベント系列抽出装置１の全体構成の一例を示す模式図である。図１を参照して、本実施の形態に従うイベント系列抽出装置１は、基本的には、イベント系列抽出機能を有している。 FIG. 1 is a schematic diagram showing an example of the entire configuration of an event sequence extraction device 1 according to the present embodiment. Referring to FIG. 1, event sequence extraction apparatus 1 according to the present embodiment basically has an event sequence extraction function.

イベント系列抽出装置１のイベント系列抽出機能において、被験者２の視覚、すなわち被験者２が見た映像を収集するために、被験者２は、ウェアラブルカメラ１０を装着している。併せて、被験者２の挙動を収集するために、被験者２は、モーションセンサ２０も装着しているとする。被験者２の挙動をより正確に計測するためには、複数のモーションセンサ２０を装着するようにしてもよい。この場合には、例えば、被験者の両腕（例えば、右手首および左手首）ならびに頭部の３ヶ所に装着するようにしてもよい。 In the event sequence extraction function of the event sequence extraction device 1, the subject 2 wears the wearable camera 10 in order to collect the vision of the subject 2, that is, the image seen by the subject 2. At the same time, in order to collect the behavior of the subject 2, the subject 2 also wears the motion sensor 20. In order to measure the behavior of the subject 2 more accurately, a plurality of motion sensors 20 may be attached. In this case, for example, the subject may be worn on three places, for example, the arms of the subject (for example, the right wrist and the left wrist) and the head.

ウェアラブルカメラ１０は、撮像した映像データを所定周期ごとまたはイベントごとに送出する。あるいは、ウェアラブルカメラ１０に内蔵されたメモリ内に撮像した映像データを格納するようにしてもよい。モーションセンサ２０は、被験者２の挙動を示す情報である運動データ（例えば、加速度、角速度など）とともに、被験者２の位置情報を所定周期ごとに送出する。あるいは、モーションセンサ２０に内蔵されたメモリ内に取得した運動データおよび位置情報を所定周期で所定の期間にわたって格納し、これらを後に一括して送出するようにしてもよい。以下の説明では、主として、３軸分の加速度および３軸分の角速度を運動データとして用いる場合について説明する。 The wearable camera 10 sends the captured video data at predetermined intervals or events. Alternatively, the captured video data may be stored in a memory incorporated in the wearable camera 10. The motion sensor 20 sends out positional information of the subject 2 at predetermined intervals along with motion data (for example, acceleration, angular velocity, etc.) which is information indicating the behavior of the subject 2. Alternatively, motion data and position information acquired in a memory incorporated in the motion sensor 20 may be stored for a predetermined period in a predetermined cycle, and these may be collectively transmitted later. In the following description, a case where accelerations for three axes and angular velocities for three axes are used as motion data will be mainly described.

但し、ウェアラブルカメラ１０から送出される映像データと、モーションセンサ２０から送出される運動データとは、互いに同期させる必要があるので、例えば、同一のタイマ
から出力される時刻またはカウンタ値などを含むタイムスタンプが収集されるデータに付与される。 However, since the video data sent from the wearable camera 10 and the motion data sent from the motion sensor 20 need to be synchronized with each other, for example, a time including a time or a counter value output from the same timer A stamp is attached to the data to be collected.

イベント系列抽出装置１は、イベント系列抽出機能として、映像収集部３０と、映像特徴量算出部３２と、運動情報収集部３４と、運動特徴量算出部３６と、位置情報収集部３５と、位置特徴量算出部３７と、イベント系列抽出部３８と、データ格納部４０とを含む。 The event sequence extraction device 1 has, as an event sequence extraction function, an image collection unit 30, an image feature amount calculation unit 32, an exercise information collection unit 34, an exercise feature amount calculation unit 36, a position information collection unit 35, and a position information A feature amount calculation unit 37, an event sequence extraction unit 38, and a data storage unit 40 are included.

映像収集部３０および運動情報収集部３４は、被験者２の知覚経験、および、被験者２の知覚経験のときに生じた外的または内的な時間変化を取得する機能を提供する。すなわち、映像収集部３０は、ウェアラブルカメラ１０が送出する映像データを受信し、運動情報収集部３４は、モーションセンサ２０が送出する運動データを受信する。 The image collection unit 30 and the motion information collection unit 34 provide the function of acquiring the perceptual experience of the subject 2 and the external or internal temporal change that has occurred during the perceptual experience of the subject 2. That is, the video collecting unit 30 receives the video data transmitted by the wearable camera 10, and the motion information collecting unit 34 receives the motion data transmitted by the motion sensor 20.

また、位置情報収集部３５は、各運動データと同時刻における位置情報を受信する。 In addition, the position information collecting unit 35 receives position information at the same time as each exercise data.

映像特徴量算出部３２、運動特徴量算出部３６、位置情報特徴算出部３７およびイベント系列抽出部３８は、被験者２に生じた時間変化に基づいて識別情報を生成する機能を提供する。本実施の形態においては、好ましい一形態として、被験者２に生じた時間変化から算出される特徴量に基づいて識別情報を生成する。 The video feature quantity calculating unit 32, the motion feature quantity calculating unit 36, the position information feature calculating unit 37, and the event sequence extracting unit 38 provide a function of generating identification information based on a time change occurring in the subject 2. In the present embodiment, as a preferable form, identification information is generated based on the feature value calculated from the time change occurring in the subject 2.

より具体的には、映像特徴量算出部３２は、映像収集部３０にて受信された映像データに含まれる特徴量（運動量データに含まれる特徴量と区別するために、以下では「映像特徴量」とも称す。）を算出する。映像特徴量の算出方法および利用方法などの詳細については、後述する。運動特徴量算出部３６は、運動情報収集部３４にて受信した運動データに含まれる特徴量（上述の「映像特徴量」と区別するために、以下では「運動特徴量」とも称す。）を算出する。運動特徴量の算出方法および利用方法などの詳細については、後述する。位置特徴量算出部３７は、位置情報収集部３５にて受信した位置データから特徴量（上述の「映像特徴量」「運動特徴量」と区別するために、以下では「位置特徴量」とも称す。）を算出する。位置特徴量の算出方法および利用方法などの詳細については、後述する。 More specifically, in order to distinguish the feature amount included in the image data received by the image collection unit 30 (the feature amount included in the movement amount data, the image feature amount calculation unit 32 may Also referred to as Details of the method of calculating the image feature amount and the method of use will be described later. The motion feature quantity calculation unit 36 includes feature quantities included in the motion data received by the motion information collection unit 34 (hereinafter, also referred to as "motion feature quantities" to distinguish from the above-described "image feature quantities"). calculate. Details of the calculation method and usage method of the movement feature quantity will be described later. The position feature quantity calculation unit 37 also refers to the position quantity received by the position information collection unit 35 as a feature quantity (hereinafter, also referred to as “position feature quantity” in order to distinguish it from the “image feature quantity” and “motion feature quantity” described above. ). Details of the calculation method and usage method of the position feature amount will be described later.

イベント系列抽出部３８は、それぞれセンサからの運動データ４０．１について算出される運動特徴量を参照して、各運動データ４０．１に関連付ける識別情報として行動ラベル４０．２を決定する。すなわち、イベント系列抽出部３８は、運動特徴量算出部３６にて算出される運動特徴量に基づいて、運動データ４０．１が取得されたときに被験者２が行なった挙動を特徴付ける情報として行動ラベル４０．２を決定し、当該決定した行動ラベル４０．２を対応する運動データ４０．１に関連付ける。そして、データ格納部４０は、運動データ４０．１と対応する行動ラベル４０．２との組を格納する。すなわち、データ格納部４０は、被験者２の運動を特徴づけるセンサ情報と対応する行動を特徴づける情報とを関連付けて格納する機能を提供する。 The event sequence extraction unit 38 refers to motion feature quantities calculated for motion data 40.1 from the sensors, and determines an action label 40.2 as identification information to be associated with each motion data 40.1. That is, the event sequence extraction unit 38 uses the action label as information characterizing the behavior performed by the subject 2 when the exercise data 40.1 is acquired based on the movement feature amount calculated by the movement feature amount calculation unit 36. Determine 40.2 and associate the determined action label 40.2 with the corresponding exercise data 40.1. Then, the data storage unit 40 stores a set of exercise data 40.1 and a corresponding action label 40.2. That is, the data storage unit 40 provides a function of associating and storing sensor information characterizing the exercise of the subject 2 and information characterizing the corresponding action.

イベント系列抽出部３８は、それぞれの映像データ４２．１について算出される映像特徴量を参照して、各映像データ４２．１に関連付ける識別情報として物体ラベル４２．２を決定する。すなわち、イベント系列抽出部３８は、映像特徴量算出部３２にて算出される映像特徴量に基づいて、映像データ４２．１の種別を推定するとともに、当該映像データが撮像されたときに被験者２が行なった挙動において対象となる物体を特徴付ける情報として物体ラベル４２．２を決定し、当該決定した物体ラベル４２．２を対応する映像データ４２．１に関連付ける。そして、データ格納部４０は、映像データ４２．１と対応する物体ラベル４２．２との組を格納する。すなわち、データ格納部４０は、被験者２の行動の対象となる物体を特徴づける情報と対応する知覚経験とを関連付けて格納する機能を提供する。 The event sequence extraction unit 38 determines an object label 42.2 as identification information to be associated with each video data 42.1 with reference to the video feature amount calculated for each video data 42.1. That is, the event sequence extraction unit 38 estimates the type of the video data 42.1 based on the video feature amount calculated by the video feature amount calculation unit 32, and when the video data is captured, the subject 2 The object label 42.2 is determined as information characterizing the object to be targeted in the behavior performed by the user, and the determined object label 42.2 is associated with the corresponding video data 42.1. Then, the data storage unit 40 stores a set of the video data 42.1 and the corresponding object label 42.2. That is, the data storage unit 40 provides a function of associating and storing information characterizing the object which is the target of the action of the subject 2 and the corresponding perceptual experience.

また、イベント系列抽出部３８は、位置データ４４．１から位置特徴量算出部３７により算出された特徴量に基づいて、運動データが収集された時点で被験者２が位置する場所を特定する識別情報である位置ラベル４４．２を算出し、データ格納部４０は、位置データ４４．１と対応する位置ラベル４４．２との組を格納する。 Further, the event sequence extraction unit 38 identifies the location where the subject 2 is located when the exercise data is collected, based on the feature data calculated by the position feature data calculation unit 37 from the position data 44.1. The data storage unit 40 stores a set of the position data 44.1 and the corresponding position label 44.2.

さらに、データ格納部４０は、後述するような言語コーパスデータ４８も格納する。 Furthermore, the data storage unit 40 also stores language corpus data 48 as described later.

［ハードウェア構成］
次に、本実施の形態に従うイベント系列抽出装置１に用いられるハードウェアについて説明する。
（ウェアラブルカメラ１０）
図２は、図１に示すウェアラブルカメラ１０の構成を示す模式図である。図２を参照して、ウェアラブルカメラ１０は、撮像部１０２と、制御部１０４と、通信部１０８と、各部に電力を供給するバッテリ１１０とを含む。制御部１０４は、撮像部１０２に対して制御コマンドを与えることで、目的の周期またはタイミングで映像データを取得し、通信部１０８を介して、取得した映像データを送出する。制御部１０４は、タイマ１０６を有しており、各映像データが取得されたタイミングを示す情報を、取得した映像データに付与する。 [Hardware configuration]
Next, hardware used for the event sequence extraction device 1 according to the present embodiment will be described.
(Wearable camera 10)
FIG. 2 is a schematic view showing the configuration of the wearable camera 10 shown in FIG. Referring to FIG. 2, wearable camera 10 includes an imaging unit 102, a control unit 104, a communication unit 108, and a battery 110 for supplying power to each unit. The control unit 104 gives a control command to the imaging unit 102 to acquire video data at a target cycle or timing, and sends out the acquired video data via the communication unit 108. The control unit 104 has a timer 106 and adds information indicating the timing at which each video data is acquired to the acquired video data.

特に限定されないが、たとえば、映像特徴量算出部３２は、ウェアラブルカメラ１０の視野の中の最も大きな物体について、この物体を画像から抽出し、特徴量を算出して、イベント系列抽出部３８は、後に説明するような再起型ニューラルネットワークを利用した認識処理を実行して、これに分類してラベル情報を付与することで、「物体ラベル」を付す処理を行うこととしてもよい。
（モーションセンサ２０）
図３は、図１に示すモーションセンサ２０の構成を示す模式図である。図３を参照して、モーションセンサ２０は、加速度センサ２０２と、ジャイロセンサ２０４と、位置センサ２０５と、制御部２０６と、通信部２１０と、制御部２０６および通信部２１０に電力を供給するバッテリ２１２とを含む。制御部２０６は、加速度センサ２０２から出力される加速度データ、および、ジャイロセンサ２０４から出力される角速度データ、ならびに位置センサ２０５から出力される位置データを取得し、通信部２１０を介して、それらのデータを送出する。制御部２０６は、タイマ２０８を有しており、各データを取得したタイミングを示す情報を、送出するデータに付与する。 Although not particularly limited, for example, for the largest object in the field of view of wearable camera 10, image feature amount calculation unit 32 extracts this object from the image, calculates the feature amount, and event sequence extraction unit 38 It is also possible to perform the process of attaching an "object label" by executing recognition processing using a reoccurring neural network as described later and classifying it to this and attaching label information.
(Motion sensor 20)
FIG. 3 is a schematic view showing the configuration of the motion sensor 20 shown in FIG. Referring to FIG. 3, motion sensor 20 is a battery that supplies power to acceleration sensor 202, gyro sensor 204, position sensor 205, control unit 206, communication unit 210, control unit 206 and communication unit 210. And 212. The control unit 206 acquires the acceleration data output from the acceleration sensor 202, the angular velocity data output from the gyro sensor 204, and the position data output from the position sensor 205, and via the communication unit 210, Send data The control unit 206 has a timer 208, and adds information indicating the timing at which each data is acquired to the data to be sent.

なお、位置センサとしては、屋外であれば、ＧＰＳなどの測位装置により計測される緯度経度などの位置情報を出力するセンサであればよい。また、屋内であれば、各部屋に設けられたBluetooth（登録商標）信号等を用いたビーコン送信機からの位置情報や、所定の周波数帯域の信号を利用する無線タグ装置からの位置情報や、あるいは、複数の無線ＬＡＮのアクセスポイントからの電波強度などにより算出される位置情報を出力するセンサなどを利用することが可能である。ただし、位置情報の取得の方法としては、これらの方法に限定されるものではない。
（情報処理装置５０）
図４は、図１に示す情報処理装置５０の構成を示す模式図である。情報処理装置５０は、典型的には、汎用アーキテクチャに従うコンピュータが採用される。より具体的には、図４を参照して、情報処理装置５０は、プロセッサ５０２と、主メモリ５０４と、ネットワークインターフェイス５０６と、通信インターフェイス５０８と、入力部５１０と、出力部５１２と、二次記憶部５２０とを含む。これらの各コンポーネントは、バス５１４を介して互いにデータ通信可能に接続される。 In addition, if it is the outdoors, as a position sensor, it should just be a sensor which outputs positional information, such as the latitude longitude measured by positioning devices, such as GPS. Also, if indoors, position information from a beacon transmitter using a Bluetooth (registered trademark) signal or the like provided in each room, position information from a wireless tag device using a signal of a predetermined frequency band, Alternatively, it is possible to use a sensor or the like that outputs position information calculated based on the radio wave intensity from a plurality of wireless LAN access points. However, as a method of acquisition of position information, it is not limited to these methods.
(Information processing device 50)
FIG. 4 is a schematic view showing the configuration of the information processing apparatus 50 shown in FIG. The information processing apparatus 50 typically employs a computer conforming to a general-purpose architecture. More specifically, referring to FIG. 4, the information processing device 50 includes a processor 502, a main memory 504, a network interface 506, a communication interface 508, an input unit 510, an output unit 512, and a secondary And a storage unit 520. Each of these components is communicably connected to each other via a bus 514.

プロセッサ５０２は、主メモリ５０４に展開されたプログラムコードを指定された順序に従って実行することで、後述するような各種処理を実現する。プロセッサ５０２としては、シングルコアまたはマルチコアのいずれの構成を採用してもよいし、複数のプロセッサを用いてもよい。主メモリ５０４は、典型的には、ＤＲＡＭ（Dynamic Random Access Memory）のような揮発性の記憶装置が用いられる。 The processor 502 executes the program codes expanded in the main memory 504 in the specified order to realize various processes as described later. As the processor 502, either a single core configuration or a multi-core configuration may be adopted, or a plurality of processors may be used. As the main memory 504, typically, a volatile storage device such as a dynamic random access memory (DRAM) is used.

ネットワークインターフェイス５０６は、ＬＡＮ（Local Area Network）などを介して、他の装置との間でデータを遣り取りする。通信インターフェイス５０８は、図１に示すウェアラブルカメラ１０およびモーションセンサ２０との間でデータをやり取りする。典型的には、通信インターフェイス５０８は、イーサネット（登録商標）などの有線または無線によりパケットを遣り取りするデバイスが採用される。通信インターフェイス５０８としては、Ｂｌｕｔｏｏｔｈ（登録商標）などの無線デバイスが採用されることが好ましい。但し、ネットワークインターフェイス５０６および通信インターフェイス５０８を共通のデバイスで実現してもよい。 The network interface 506 exchanges data with other devices via a LAN (Local Area Network) or the like. The communication interface 508 exchanges data with the wearable camera 10 and the motion sensor 20 shown in FIG. Typically, the communication interface 508 employs a device that exchanges packets by wire or wirelessly such as Ethernet (registered trademark). As the communication interface 508, a wireless device such as Blutooth (registered trademark) is preferably employed. However, the network interface 506 and the communication interface 508 may be realized by a common device.

入力部５１０は、ユーザからの操作を受付けるデバイスであり、例えば、キーボード、マウス、タッチパネルなどにより構成される。出力部５１２は、ユーザに対して各種情報を提示、または、他の装置に対して各種データを出力するデバイスであり、例えば、ディスプレイ、各種インジケータ、プリンタなどにより構成される。 The input unit 510 is a device that receives an operation from a user, and is configured of, for example, a keyboard, a mouse, a touch panel, and the like. The output unit 512 is a device that presents various information to the user or outputs various data to another device, and includes, for example, a display, various indicators, and a printer.

二次記憶部５２０は、収集された映像データおよび運動データを格納するとともに、プロセッサ５０２にて実行されるＯＳ（Operating System）およびアプリケーションプログラムを格納している。例えば、二次記憶部５２０は、映像収集部３０、運動情報収集部３４および位置情報収集部３５、ならびに、映像特徴量算出部３２、運動特徴量算出部３６および位置特徴量算出部３７の機能を実現するための特徴量算出プログラム５２２、および、イベント系列抽出機能を実現するためのイベント系列抽出プロラム５２４を格納している。なお、特徴量算出プログラム５２２およびイベント系列抽出プロラム５２４は、ＯＳが提供するライブラリなどを利用して、目的のプログラムを実行する場合もある。この場合であっても、これらのプログラムは、本願発明の範囲に含まれ得る。 The secondary storage unit 520 stores the collected video data and motion data, and stores an operating system (OS) and application programs executed by the processor 502. For example, the secondary storage unit 520 includes the functions of the video collection unit 30, the motion information collection unit 34, the position information collection unit 35, and the video feature quantity calculation unit 32, the motion feature quantity calculation unit 36, and the position feature quantity calculation unit 37. And an event sequence extraction program 524 for realizing the event sequence extraction function. The feature amount calculation program 522 and the event sequence extraction program 524 may execute a target program using a library provided by the OS. Even in this case, these programs can be included in the scope of the present invention.

典型的には、特徴量算出プログラム５２２およびイベント系列抽出プロラム５２４は、一体として、または、別々に、光学ディスクなどの記録媒体を介して流通する。あるいは、インターネットワークなどを介して、ダウンロードの形で、特徴量算出プログラム５２２およびイベント系列抽出プロラム５２４を配布するようにしてもよい。この場合には、特徴量算出プログラム５２２および／またはイベント系列抽出プロラム５２４を格納した記録媒体自体も本願発明の範囲に含まれ得る。 Typically, the feature value calculation program 522 and the event sequence extraction program 524 are distributed integrally or separately via a recording medium such as an optical disc. Alternatively, the feature value calculation program 522 and the event sequence extraction program 524 may be distributed in the form of download via an internetwork or the like. In this case, the recording medium itself storing the feature quantity calculation program 522 and / or the event sequence extraction program 524 may be included in the scope of the present invention.

図５は、センサ信号から予測した意味コンセプトを組み合わせて、イベントタイムラインを作成した例を示す図である。 FIG. 5 is a diagram showing an example of creating an event timeline by combining semantic concepts predicted from sensor signals.

図５に示した例では、イベント系列抽出部３８は、タイムスタンプが付与された意味コンセプト（walk、sit_down、coffee、cracker、kitchen、living_room）をセンサ信号から推定し、複数のコンセプトの中から、適切な組み合わせを選んで尤もらしいイベント系列を生成する。たとえば、以下のようなイベント系列が生成されうる。 In the example shown in FIG. 5, the event sequence extraction unit 38 estimates the semantic concept (walk, sit_down, coffee, cracker, kitchen, living_room) to which the time stamp is given from the sensor signal, and from among a plurality of concepts, Choose an appropriate combination and generate a plausible event sequence. For example, the following event sequence may be generated.

［ｗａｌｋ、＊、ｌｉｖｉｎｇ＿ｒｏｏｍ］→［ｐｕｔ、ｃｒａｋｅｒ、ｌｉｖｉｎｇ＿ｒｏｏｍ］→［ｅａｔ、ｃｒａｋｅｒ、ｌｉｖｉｎｇ＿ｒｏｏｍ］）
すなわち、イベント系列抽出部３８は、与えられたセンサデータから日常生活行動に関する時間的に順序付けられたイベント系列を生成する。イベント系列抽出部３８は、センサ信号から作られた複数の意味概念を結合し、最終的にイベントのタイムラインを作成する。 [Walk, *, living_room] → [put, craker, living_room] → [eat, craker, living_room])
That is, the event sequence extraction unit 38 generates a temporally ordered event sequence related to daily living behavior from given sensor data. The event sequence extraction unit 38 combines a plurality of semantic concepts generated from sensor signals and finally creates a timeline of events.

図６は、イベント系列抽出部３８が、イベント系列を生成する手順を説明するための第１の概念図であり、図７は、イベント系列を生成する手順を説明するための第２の概念図であり、図８は、イベント系列を生成する手順を説明するための第３の概念図である。 FIG. 6 is a first conceptual diagram for explaining a procedure of generating an event sequence by the event sequence extraction unit 38, and FIG. 7 is a second conceptual diagram for explaining a procedure of generating an event sequence. FIG. 8 is a third conceptual diagram for describing a procedure for generating an event sequence.

まず、図６に示すように、イベント系列抽出部３８は、実環境をセンシングして得られた信号をその内容を表す意味コンセプトに変換する。意味コンセプトは、コンセプト名、コンセプトタイプ、コンセプトが観測された開始時刻と終了時刻で構成されている。 First, as shown in FIG. 6, the event sequence extraction unit 38 converts a signal obtained by sensing a real environment into a semantic concept representing its content. The meaning concept consists of the concept name, the concept type, and the start and end times at which the concept was observed.

図６では、運動データに基づいて、イベント系列抽出部３８が実行する処理を示しており、コンセプトタイプとしては、「行動」であり、所定の時間間隔（スライディングウィンドウ）ごとに、運動ラベルの算出が行われ、コンセプト名としては、「歩行（walk）」、「着座（sit down）」、「飲む（drink）」が割り振られている。 FIG. 6 shows the process executed by the event sequence extraction unit 38 based on exercise data, and the concept type is "action", and the exercise label is calculated for each predetermined time interval (sliding window). The concept names “walk”, “sit down”, and “drink” are assigned as concept names.

より一般には、コンセプト名はセンサ信号の意味を表すラベルであり、コンセプトタイプは行動・物体・場所のカテゴリを表す。 More generally, the concept name is a label indicating the meaning of the sensor signal, and the concept type indicates the action / object / location category.

例えば、何かを飲んだときのセンサ信号は、行動を表す意味コンセプトに変換される。センサ信号を一旦意味コンセプトに変換さえすれば、イベント系列の予測を行うため、任意の環境センサまたはウェアラブルセンサを使用することができる。 For example, the sensor signal when you drink something is converted into a semantic concept that represents an action. Once the sensor signal has been converted to a semantic concept, any environmental or wearable sensor can be used to predict the event sequence.

図７は、イベント候補を生成する手続きを示す図である。 FIG. 7 is a diagram showing a procedure for generating an event candidate.

次に、イベント系列抽出部３８は、所定の時間ウィンドウ内で、複数の意味コンセプトの中から各行動の意味コンセプトと時間的に近いコンセプトの組み合わせを選択することで、イベントの候補を生成する。 Next, the event sequence extraction unit 38 generates event candidates by selecting a combination of a semantic concept of each action and a concept temporally close from a plurality of semantic concepts within a predetermined time window.

図８は、イベント系列の生成方法を説明するための図である。 FIG. 8 is a diagram for explaining a method of generating an event sequence.

イベント系列抽出部３８は、図７のようにして得られたイベント候補の中から、後述するようにして、意味コンセプト同士の時間的な関連性、外部の言語資源、イベント間の変化のモデルを手がかりに尤もらしいイベント系列を選択する。
［イベント系列抽出部３８の処理］
以下では、イベント系列抽出部３８の行う処理の一例として、屋内環境の日常生活行動に関するイベントのタイムラインを生成する枠組みについて説明する。 Among the event candidates obtained as shown in FIG. 7, the event sequence extraction unit 38 relates to temporal relationships between semantic concepts, external language resources, and models of changes between events as described later. Select an event sequence likely to be clues.
[Process of Event Sequence Extraction Unit 38]
In the following, as an example of the processing performed by the event sequence extraction unit 38, a framework for generating a timeline of events related to daily living behavior in the indoor environment will be described.

この手続きでは、センサ信号からの意味コンセプトの作成、イベント候補の作成、実世界の特性を用いた隠れマルコフモデル（HMM）の遷移・出力確率の学習とイベント系列の予測の３つの構成要素から成る。 This procedure consists of three components: creation of semantic concept from sensor signal, creation of event candidate, learning of Hidden Markov Model (HMM) transition and output probability using real world characteristics, and prediction of event sequence .

イベント系列抽出部３８において、実世界の状態をイベントの系列ｅを以下のように定義する。 In the event sequence extraction unit 38, the real world state is defined as the event sequence e as follows.

各イベントe_iは、意味コンセプトｃは、タプル(c₁，…，ｃ_N)で表現される。ここで、「タプル」とは、複数の構成要素からなる組を意味する。 In each event e _i , the semantic concept c is represented by a tuple (c ₁ ,..., C _N ). Here, "tuple" means a set of a plurality of components.

ここで、各意味コンセプトｃは意味コンセプトが観測された時刻を表すタイムスタンプc.tとコンセプト名c．ｗ、コンセプトタイプｃ．ｔｙｐｅ∈｛a、o、p｝の属性を持ち、aは被験者がした行動、oは被験者が観測した物体、pは被験者がいる場所を表す。 Here, each meaning concept c is a time stamp c.t representing the time when the meaning concept was observed and a concept name c. w, concept type c. It has an attribute of type ∈ {a, o, p}, where a represents the action taken by the subject, o represents an object observed by the subject, and p represents a place where the subject is present.

本実施の形態では、(c₁.type＝a、c₂.type＝o、c₃.type＝p)の順序で、コンセプト名を並べて表すものとし、各要素がそれぞれ行動、物体、場所の意味コンセプト名を表す。つまり、本実施の形態では、３つの意味コンセプト名のタプル(c₁ 、c₂ 、c₃)をイベントの内容と定義する。つまり、１つのイベントの内容は、（c₁.w, c₂.w，c₃.w）を表し、これとイベントの開始時間を表すタイムスタンプe.tとで、各イベントが定義される。 In this embodiment, concept names are arranged side by side in the order of (c ₁ .type = a, c ₂ .type = o, c ₃ .type = p), and each element is an action, an object, or a place. Represents a semantic concept name. That is, in the present embodiment, a tuple (c ₁ , c ₂ , c ₃ ) of _three semantic concept names is defined as the content of the event. That is, the content of one event represents (c ₁ .w, c ₂ .w, c ₃ .w), and each event is defined by this and a time stamp et indicating the start time of the event.

ただし、物体を表す意味コンセプトc₂を観測していない場合（例えば、「歩く」「立つ」）は、シンボル“＊”を代わりに使う。また、コンセプトタイプc.typeによって、コンセプト名は予め決まっている。 However, when the semantic concept c ₂ representing the object is not observed (for example, “walk” or “stand”), the symbol “*” is used instead. Also, the concept name is determined in advance by the concept type c.type.

図９は、行動、物体、場所のコンセプトタイプごとのコンセプト名を示す表である。 FIG. 9 is a table showing concept names according to action types, objects, and place concept types.

図９に示すように、各イベントは日常生活行動を表すものなので、イベントの時間e.tはイベント内の行動を表すコンセプトの開始時刻と定義する。例えば、「男が屋内で8時31分55秒にリビングルーム(living_room)でコーヒー(coffee)を飲んでいる(drink)」場合、e=(c1.w=drink、c2.w=coffee、c3.w=living_room) として表現でき、e.t=8:31:55となる。なお、全てのイベントはタイムラインを構成するために、タイムスタンプe.tによって時間順に並べられる。
（意味コンセプト推定処理）
次に、図６にしたがって、イベント系列抽出部３８が、センサ信号からイベントを構成する可能性のある行動・物体・場所の意味コンセプトを作成する手順について、より詳しく説明する。 As shown in FIG. 9, since each event represents a daily life activity, the time et of the event is defined as the start time of the concept representing the activity in the event. For example, in the case where “man drinks coffee (coffee) in the living room (living room) at 8:31:55” (drink), e = (c1.w = drink, c2.w = coffee, c3 It can be expressed as .w = living_room) and et = 8: 31: 55. Note that all the events are arranged in chronological order by the time stamp et so as to constitute a timeline.
(Meaning concept estimation process)
Next, according to FIG. 6, the procedure for the event sequence extraction unit 38 to create semantic concepts of actions, objects and places that may constitute an event from sensor signals will be described in more detail.

センサ信号は連続している信号なので、特徴抽出のために滑走窓法を用いる。滑走窓法（スライディングウィンドウ法）では、各時間窓（窓サイズのサンプル数はＮ）内にあるセンサ信号に対して、系列ラベリングを用いてラベルを予測する。 Since the sensor signal is a continuous signal, a sliding window method is used for feature extraction. In the sliding window method (sliding window method), labels are predicted using sequence labeling for sensor signals within each time window (the number of window size samples is N).

系列ラベリングのために、特に限定されないが、たとえば、再起型ニューラルネットワークの一種であるゲート付き再帰型ユニット（ＧＲＵ：Gated Recurrent Unit）を用いることができる。 Although not particularly limited, for example, a gated recurrent unit (GRU), which is a type of recurrent neural network, can be used for sequence labeling.

このようなＧＲＵについては、たとえば、以下の文献に開示がある。 Such GRUs are disclosed, for example, in the following documents.

公知文献１：Chung, J．；G”ulccehre, cC.; Cho, K．； and Bengio, Y. ”Empirical evaluation of gated recurrent neural networks on sequence modeling．” In The NIPS 2014 Deep Learning and Representation Learning Workshop．
ここで、各ステップｔのＧＲＵの入力データｘ_tと隠れ状態ｈ^tとする。ＧＲＵの関数は下記のように定義される。 Known documents 1: Chung, J. et al. Cho, K .; and Bengio, Y. "Empirical evaluation of gated recurrent neural networks on sequence modeling. G" ulccehre, cC. ”In The NIPS 2014 Deep Learning and Representation Learning Workshop.
Here, it is assumed that the GRU input data x _t and the hidden state h ^t in each step t. The GRU function is defined as follows.

ここで、σはシグモイド関数、○はアダマール積、を表し、以下のように定義される。 Here, σ represents a sigmoid function, and ○ represents a Hadamard product, which is defined as follows.

ここで、次元n_Iは、入力ベクトルのサイズ、次元ｎ_Hは隠れベクトルのサイズ、ｂ^(Z)，ｂ^(r)，ｂ^(h)はバイアス項である。上記の関数をｈt＝ＧＲＵ（ｘ_t、ｈ_t-1）と表記する。 Here, the dimension n _I is the size of the input vector, the dimension n _H is the size of the hidden vector, and b ^(Z) , b ^(r) and b ^(h) are bias terms. The above function is expressed as ht = GRU (x _t , h _t-1 ).

また、コンセプトは任意の時間に観測されるので、ＧＲＵを用いて各時間窓のラベルを予測した後は、時間的に隣接してかつ同じラベルであれば１つに結合する。結果として、図５や図６のように、時間軸に並べられた複数の意味コンセプト（意味コンセプトストリーム）を生成することができる。
（イベント候補生成処理）
以下では、図７にしたがって、イベント系列抽出部３８が、意味コンセプトストリームからコンセプトの組み合わせ選択して統合することで、実世界を表すイベントの候補を生成する方法について紹介する。 Also, since the concept is observed at any time, after predicting labels of each time window using GRU, they are combined in time adjacent and one if they are the same label. As a result, as shown in FIGS. 5 and 6, it is possible to generate a plurality of semantic concepts (meaning concept streams) arranged on a time axis.
(Event candidate generation process)
Below, according to FIG. 7, the event sequence extraction part 38 introduces the method of producing | generating the candidate of the event showing a real world by combining and combining the combination of a concept from a semantic concept stream.

意味コンセプトストリーム中では、どのコンセプトの組み合わせがイベントになるのかは自明ではない。そこで、滑走窓を用いて、窓内のコンセプトからイベントの候補を生成する。まず、行動のコンセプトごとに窓ｗ＝（ｗ₁，…ｗ_｜w｜）を作る。 In the meaning concept stream, it is not obvious which combination of concepts will be an event. Therefore, using the sliding window, event candidates are generated from the concept in the window. First, create windows w = (w ₁ ,... W _{| w |} ) for each concept of action.

ここで、窓|w|の数は、観測された行動の数に等しくなる。図７は、イベント系列抽出部３８が、意味コンセプトストリームに滑走窓を用いて、イベント候補を生成した例である。 Here, the number of windows | w | is equal to the number of observed actions. FIG. 7 is an example in which the event sequence extraction unit 38 generates event candidates using a sliding window in the semantic concept stream.

ここでは、行動の意味コンセプトの開始時刻c.tを窓の中心として、固定長の時間窓（窓のサイズのサンプル数はＭ）を用いる。この時間窓内で時間的に重複するコンセプトを見つけ出し、タプル(c₁ 、c₂ 、c₃)を満たすイベントの候補を生成する。 Here, a fixed-length time window (the number of samples of the window size is M) is used with the start time ct of the semantic concept of action as the center of the window. Find concepts overlapping in time in this time window, and generate candidate events that satisfy the tuple (c ₁ , c ₂ , c ₃ ).

例えば、行動の意味コンセプト「飲む（drink）」に8時31分55秒から8時31分58秒の時間窓を設ける場合、窓と重複する「テレビ（TV）」、「コーヒー（coffee）」、「テーブル（table）」がイベント候補内の物体の意味コンセプトになり、「居間（living_room）」がイベント候補内の場所の意味コンセプトになる。 For example, when the meaning concept of action "drink" is provided with a time window from 8:31:55 to 8:31:58, "TV (TV)", "coffee" overlapping with the window , "Table" is the semantic concept of the object in the event candidate, and "living room" is the semantic concept of the place in the event candidate.

その結果、複数のイベント候補ｅ＝(c₁ 、c₂ 、c₃)として、(c_1.W 、c_2.W 、c_3.W) = (drink、TV、living_room)、(drink、table、living_room)、(drink、coffee、living_room)、(drink、*、living_room)がこの窓について生成される。 As a result, as the multiple event candidates e = (c ₁ , c ₂ , c ₃ ), (c _1. W, c _2. W, c _3. W) = (drink, TV, living_room), (drink, table) , Living_room), (drink, coffee, living_room), (drink, *, living_room) are generated for this window.

次に、実世界の性質を用いて、イベント候補の中から尤もらしいイベント系列を推定する方法について紹介する。 Next, we introduce a method for estimating likely event sequences from event candidates using the properties of the real world.

各窓のイベント候補の中から尤もらしいイベント系列ｅ^*を効率的に見つけるため、隠れマルコフモデル（ＨＭＭ：Hidden Markov Model）を用いる。ＨＭＭは、隠れ状態間の遷移確率と隠れ状態と出力状態の出力確率から構成される。ここでのＨＭＭは、次の関数を最大化することで、イベント候補郡の中から最適なイベント系列ｅ^*を推定する。 A Hidden Markov Model (HMM) is used to efficiently find a probable event sequence e ^* from the event candidates of each window. The HMM is composed of transition probabilities between hidden states and output probabilities of the hidden states and output states. The HMM here estimates the optimal event sequence e ^* from among the event candidate group by maximizing the following function.

ここで、ＨＭＭのwが隠れ状態、eが出力に相当する。この出力wは２値を持ち、イベント候補が存在する場合は１、そうでない場合は０の値になる。 Here, w in the HMM corresponds to the hidden state, and e corresponds to the output. This output w has a binary value, and is 1 when there is an event candidate, and 0 otherwise.

また、Ｐ（ｅ_i｜ｗ_i）は出力確率で、窓ｗ_iからイベントｅ_iが出力される確率を表す。また、Ｐ（ｗ_i｜ｗ_i-1）は遷移確率に相当し、イベント候補を含む窓間の遷移し易さを表す。さらに、イベントの時間の性質とイベントの言語的な尤もらしさは独立であると仮定し、Ｐ（ｅ｜ｗ）＝Ｐ（ｅ_l｜ｗ）Ｐ（ｅ_t｜ｗ）とする。ここで、Ｐ（ｅ_l｜ｗ）はイベントの言語モデル、Ｐ（ｅ_t｜ｗ）は時間のモデルを表す。もし、窓内にイベント候補がなければ、状態ｗは出力がないとする。以下に説明するように、Ｐ（ｅ_i｜ｗ_i）として、Ｐ（ｅ_l｜ｗ）に加えて、Ｐ（ｅ_t｜ｗ）も考慮することで、より尤もらしいイベント候補を選択することができる。 Further, P (e _i | w _i ) is an output probability and represents the probability that the event e _i is output from the window w _i . _{Also, P (w i | w i} -1) corresponds to a transition probability, represents a transition ease between the window containing event candidate. Furthermore, it is assumed that the time property of the event and the linguistic likelihood of the event are independent, and let P (e | w) = P (e _l | w) P (e _t | w). Here, P (e _l | w) represents a language model of an event, and P (e _t | w) represents a model of time. If there is no event candidate in the window, state w has no output. As described below, selecting a more likely event candidate by considering P (e _t | w) as P (e _i | w _i ) in addition to P (e _l | w) Can.

以下に各確率の定義について紹介する。
（時間関係モデル）
まず、時間関係のモデルＰ（ｅ_t｜ｗ）を定義する。 The definitions of each probability are introduced below.
(Time relation model)
First, a time-related model P (et _t ) is defined.

ここで、ｅt＝(c_1.ｔ、c_2.ｔ、c_3.ｔ)とする。これは人の行動と物体・場所に関する意味コンセプトが観測された時間を表している。ここで、行動が行われた時間付近に物体や場所が観察されると両者には関係があると仮定する。例えば、人はコーヒーを飲むという行動をするときに、コーヒーカップという物体を見る。このコンセプト間の時間的関係を考慮することで、行動と無関係な物体や場所を含むイベント候補を除去することができる。 Here, let et = (c1 _. T, c2 _. T, c3 _. T). This represents the time when the semantic concept of human actions and objects / places was observed. Here, it is assumed that when an object or a place is observed near the time when the action is performed, there is a relation between the two. For example, when a person acts to drink coffee, he sees an object called a coffee cup. By considering the temporal relationship between the concepts, it is possible to remove event candidates including objects and places unrelated to the action.

意味コンセプト間の時間的関係を次式で定義する。 The temporal relationship between the semantic concepts is defined by the following equation.

ここで、e.tはイベントの時間、c.tはコンセプトが観測された開始時刻を表す。 Here, e.t represents the time of the event, and c.t represents the start time at which the concept was observed.

Ｐ（ｅ_t｜ｗ）をイベント系列モデルを表すＨＭＭの出力確率の一つとする。
（イベント言語モデル）
コンセプトの組み合わせの尤もらしさを表現するため、言語コーパスからマイニングした言語知識を利用する。 Let P (et _t ) be one of the output probabilities of the HMM representing the event series model.
(Event language model)
We use linguistic knowledge mined from language corpus to express the likelihood of combination of concepts.

言語コーパスには、現実世界の常識的知識が反映されているので、この性質を用いて人間の行動と物体・場所を表す意味コンセプトの尤もらしさを表す。 Since the language corpus reflects common-sense knowledge in the real world, this property is used to express the likelihood of human actions and semantic concepts representing objects and places.

例えば、人はクラッカーを飲むことはないが、コーヒーは飲むことがある。また、人はトイレでコーヒーを飲むよりも、リビングルームでコーヒーを飲むほうが尤もらしいと考えられる。このような実世界の常識的知識を表現するため、言語コーパスから言語モデルを利用する。 For example, people do not drink crackers, but sometimes drink coffee. Also, people are more likely to drink coffee in the living room than to drink coffee in the bathroom. In order to express such real-world common sense knowledge, a language model is used from a language corpus.

言語モデルは単語列に対する確率分布であり、イベントの言語モデルをＰ（ｅ_l｜ｗ）とする。ここで、ｅ1＝(c_1.W 、c_2.W 、c_3.W) とする。イベントの言語モデルには、リカレントニューラルネットワーク（ＲＮＮ：Recurrent Neural Network）言語モデルを用いる。 The language model is a probability distribution for a word string, and the language model of the event is P (e _l | w). Here, let e1 = (c1 _. W, c2 _. W, c3 _. W). As a language model of an event, a Recurrent Neural Network (RNN) language model is used.

このＲＮＮ言語モデルは、入力の単語埋め込みベクトルの系列を内部状態のベクトルを通して出力の単語埋め込みベクトルの系列へ対応づけを学習したものである。 This RNN language model is one in which a sequence of input word embedding vectors is trained to a sequence of output word embedding vectors through internal state vectors.

ＲＮＮ言語モデルには、ＧＲＵを用いることができる。ＧＲＵを用いた言語モデルは、以下の文献にも記載されるとおり、tanh関数を用いた標準的なRNN言語モデルよりも良い性能を示すことが知られている。 GRU can be used for the RNN language model. Language models using GRU are known to perform better than standard RNN language models using tanh functions, as also described in the following document.

公知文献２：Chung, J．；G”ulcehre, C．； Cho, K．； and Bengio, Y. 2015. ”Gated feedback recurrent neural networks．” In ICML, 2067-2075.
図１０は、単語埋め込みベクトルとＧＲＵを用いたＲＮＮ言語モデルの構成を示す概念図である。 Known document 2: Chung, J. et al. G "ulcehre, C .; Cho, K .; and Bengio, Y. 2015." Gated feedback recurrent neural networks. In ICML, 2067-2075.
FIG. 10 is a conceptual diagram showing the configuration of an RNN language model using word embedding vectors and GRUs.

ＲＮＮ言語モデルでは、k番目の単語が出現する確率を以下のように定義する。 In the RNN language model, the probability that the k-th word appears is defined as follows.

ここでは、ｋ番目の単語をone-hot vector（該当する単語の箇所を1にしたベクトル）で表現している。 Here, the k-th word is expressed as a one-hot vector (a vector in which the location of the corresponding word is 1).

ただし、実世界で観測されるコンセプトが、学習用の言語コーパス中に現れるとは限らない。そこで、単語の意味的な類似度を考慮してこの問題に対処する。単語の類似性を考慮するため、以下の文献に開示されたGloVe で学習した単語埋め込みベクトルをＧＲＵに用いる。 However, concepts observed in the real world do not always appear in the language corpus for learning. Therefore, this problem is dealt with in consideration of the semantic similarity of words. In order to consider the similarity of words, the word embedding vector learned by GloVe disclosed in the following document is used for GRU.

公知文献３：Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In EMNLP, 1532-1543.
イベント内のコンセプト名ｃi.wを単語とみなし、この単語の単語埋め込みベクトルをｘ_kとすると、イベント内のコンセプトに対応する単語埋め込みベクトルすべてを読み込んだときの同時確率は以下の式になる。 Well-known reference 3: Pennington, J .; Socher, R .; and Manning, CD 2014. Glove: Global vectors for word representation. In EMNLP, 1532-1543.
Assuming that the concept name ci.w in the event is a word and the word embedding vector of this word is x _k , the joint probability when all the word embedding vectors corresponding to the concept in the event are read is as follows.

これが言語モデルの尤度になる。
（イベント系列モデル）
日常生活では、異なる行動を逐次的に行うため、日常生活行動を表すイベントが前後で少しずつ変化する。例えば、人はスリッパを脱いで後にスリッパを再び履くよりも靴を履くことが多い。本実施の形態では、このイベントの変化をＨＭＭの遷移確率に利用する。 This is the likelihood of the language model.
(Event series model)
In daily life, in order to perform different actions sequentially, events representing daily life actions change little by little before and after. For example, people often wear shoes rather than taking off their slippers and putting on their slippers again. In this embodiment, this change in event is used for the transition probability of the HMM.

イベントを表現するベクトルの非類似度をイベントの遷移確率Ｐ（ｅ_i｜ｅ_i-1）として、以下のように表す。 The dissimilarity of a vector representing an event is represented as the transition probability P (e _i | e _i-1 ) of the event as follows.

ここで、ｑ_iはイベントｅ_i 中の末尾単語の埋め込みベクトルを読み込んだ際のＧＲＵの隠れ状態ベクトルを表す。Ｐ（ｅ_i｜ｅ_i-1）がＨＭＭによるイベント系列モデルの遷移確率になる。 Here, q _i represents the hidden state vector of the GRU when the embedded vector of the end word in the event e _i is read. P (e _i | e _i-1 ) is the transition probability of the event series model by the HMM.

図１１は、イベントの遷移の例を示す概念図である。 FIG. 11 is a conceptual diagram showing an example of event transition.

例えば、イベントの経路（remove、shoe、entrance）→（wear、slipper、entrance）→（walk、*、living_room）、は他のイベント経路（remove、shoe、entrance）→（wear、shoe、entrance）→（walk、slipper、entrance）よりも尤もらしい経路であると考えられる。 For example, event route (remove, shoe, entrance) → (wear, slipper, entrance) → (walk, *, living_room), other event route (remove, shoe, entrance) → (wear, shoe, entrance) → It is considered to be a more likely route than (walk, slipper, entrance).

その理由は、イベント (wear、slipper、entrance)のベクトル表現は(wear、shoe、entrance)のベクトル表現と比較して (remove、shoe、entrance)よりも似ていないからである。これらのイベント遷移を考慮して、尤もらしいイベントの系列を選択する。
（イベント系列の選択）
次に、図８にしたがって、イベント系列抽出部３８が、イベント系列を選択する手順について、より詳しく説明する。 The reason is that the vector representation of the event (wear, slipper, entrance) is less similar to (remove, shoe, entrance) compared to the vector representation of (wear, shoe, entrance). Considering these event transitions, select a likely sequence of events.
(Select event series)
Next, referring to FIG. 8, the procedure of the event sequence extraction unit 38 selecting an event sequence will be described in more detail.

意味コンセプトストリームの各窓の中には複数のイベント候補が存在しており、窓間で、イベント候補の経路は複数考えられる。図８に、イベントの経路を表すＨＭＭのラティス構造を示す。 There are multiple event candidates in each window of the semantic concept stream, and multiple paths of event candidates can be considered between the windows. FIG. 8 shows the lattice structure of the HMM representing the path of the event.

上述の通り、本実施の形態では、イベントの系列にＨＭＭを用いているので、これまで計算したイベント間の遷移確率Ｐ（ｅ_i｜ｅ_i-1）と正規化した出力確率Ｐ（ｅ｜ｗ）＝Ｐ（ｅ_l｜ｗ）Ｐ（ｅ_t｜ｗ）を用いることで、ビタビアルゴリズムで効率的に最適なイベントの系列を見つけることができる。この最適なイベント系列がイベントタイムラインとなる。 As described above, in this embodiment, since the HMM is used for the series of events, the transition probability P (e _i | e _i-1 ) between the events calculated so far and the normalized output probability P (e | By using w) = P (e _l | w) P (e _t | w), the Viterbi algorithm can efficiently find the optimal sequence of events. This optimal event sequence is the event timeline.

以上のようなイベント系列抽出部３８の構成により、未知の場所に汎化するようにイベントタイムライン（イベント系列）をセンサ信号に基づいて抽出することが可能である。
［実験結果］
以下では、日常生活行動データセット用いて本実施の形態のイベント系列生成手法をベースラインと比較して評価する。この日常生活行動データセットは複数の被験者が屋内で日常生活行動をシミュレーションした結果を言語化したものである。 By the configuration of the event sequence extraction unit 38 as described above, it is possible to extract an event timeline (event sequence) based on a sensor signal so as to generalize to an unknown place.
[Experimental result]
In the following, the event sequence generation method of the present embodiment is evaluated in comparison with a baseline using a daily life activity data set. This daily living behavior data set is a verbalized result of simulating daily living behavior indoors by a plurality of subjects.

本実施の形態のイベント系列抽出部３８の動作の評価を行うために、日常生活行動データセットを用いる。 In order to evaluate the operation of the event sequence extraction unit 38 of the present embodiment, a daily life activity data set is used.

このデータセットは日常生活行動の検索で用いられたデータセットであり、モーション信号（加速度とジャイロ）とヘッドマウントカメラで撮影した一人称視点映像で構成されている。
本データセットは、日常生活行動のセンシングにウェアラブルセンサを用いているので、場所を移動しても逐次被験者の行動の詳細を追跡することができる。このデータセットでは、８人の被験者がモーションセンサとウェアラブルカメラを装着して２０個の日常生活行動を１セッションとして、異なる場所でワークシートに記述された場所固有の行動を被験者の意思に従って任意のタイミングで行う。 This data set is a data set used for searching for daily life behavior, and is composed of motion signals (acceleration and gyro) and a first-person viewpoint image captured by a head mounted camera.
Since this data set uses wearable sensors for sensing daily living behavior, it is possible to sequentially track the details of the subject's behavior even if the location is moved. In this data set, eight subjects wear motion sensors and wearable cameras, and 20 daily living activities are considered as one session, and the place-specific behavior described in the worksheet at different places is arbitrary according to the subject's intention. Do it at the timing.

例えば、被験者はソファに座ってリビングルームでテレビを見ながらコーヒーを飲み、その後、台所に移動して皿を洗う、といった日常生活行動を行う。この一連の行動を１０回繰り返して、１人につき１０セッション、合計で約１７時間分のデータが収録されている。このセミナチュナルな実験プロトコルによるデータ収集は、限られた環境の中で多様な行動についてのセンサデータを集めるために、行動認識の研究で頻繁に用いられる。 For example, a subject sits on a sofa, drinks coffee while watching TV in a living room, and then moves to the kitchen to wash dishes, etc. This series of actions is repeated 10 times, and 10 sessions per person, totaling about 17 hours worth of data are collected. Data collection by this semi-natural experimental protocol is frequently used in behavioral recognition research to collect sensor data on various behaviors in a limited environment.

図１２は、実験で使用された家の間取りと被験者が行った２０個の場所固有の行動を示す図である。 FIG. 12 is a diagram showing house layouts used in the experiment and 20 place-specific actions performed by the subject.

図１２（ａ）は、家の間取り図を示し、図１２（ｂ）は、行動を示す。 FIG. 12 (a) shows a floor plan of a house, and FIG. 12 (b) shows an action.

たとえば、場所のラベルとしては、「台所」「居間」「寝室」「玄関」「トイレ」「洗面所」がある。 For example, as labels for places, there are "kitchen", "living room", "bedroom", "door entrance", "toilet" and "lavatory".

また、行動のラベルとしては、らとえば、「玄関(entrance)」では、「靴をはく（put on shoes）」「靴を脱ぐ（remove shoes）」がある。 Also, as a label of action, there are "put on shoes" and "remove shoes" at "entrance", for example.

あるいは、「台所（kitchen）」では、「サンドイッチクッキーを作る（make a sandwiched cookie）」「コーヒーをいれる（make coffee）」「手を洗う（wash hands）」「皿を洗う（wash the dishes）」などがある。 Or, in the "kitchen", "make a sandwich cookie" "make coffee" "wash hands" "wash the dishes" and so on.

「居間（living room）」では、「コーヒーを飲む（drink coffee）」「サンドイッチクッキーを食べる（eat a sandwiched cookie）」「床をふく（mop the floor）」「エアコンをつける／消す（turn on/off A/C）」「テレビをつける／消す（turn on/off TV）」などがある。 In the living room, drink coffee, eat a sandwich cookie, mop the floor, turn on / off the air conditioner (turn on / off A / C), "turn on / off TV", etc.

「寝室（bedroom）」「トイレ（bathroom）」「洗面所（washroom）」についても、それぞれ、固有の行動が予め規定されている。 In each of the "bedroom", the "bedroom", and the "washroom", specific actions are defined in advance.

イベントタイムラインの生成手法を評価するため、この日常生活行動データセットに行動・物体・場所を表すイベント文の記述を行う。 In order to evaluate the generation method of the event timeline, an event sentence representing an action, an object, and a place is described in this daily life activity data set.

データセットの作り方は以下のようになる。まず、２人の作業者が、被験者の頭部に装着された装着型カメラからの映像を見ながら、２７個の行動、３９個の物体、６つの場所についての意味コンセプトのラベルを人手で付与する。 How to make a data set is as follows. First, two workers manually label the action concept of 27 actions, 39 objects, and 6 places while watching the image from the wearable camera attached to the subject's head. Do.

次に、クラウドソーシングサービスを使用して、６人の作業者がラベリングされた意味コンセプトの中からイベントを構成する意味コンセプトを選んび正しいイベントになるように手動で意味コンセプトの並び替えを行った。 Next, using the crowdsourcing service, six workers selected the semantic concepts that make up the event from the semantic concepts that were labeled, and manually rearranged the semantic concepts so that they would be correct events. .

例えば、drink（行動）、coffee（物体）、living_room（場所）コンセプトが観測されたときに「ある人が居間のコーヒーを飲む」となるようにコンセプトの並び替えを行った。 For example, when the drink (action), coffee (object) and living_room (place) concepts were observed, the concepts were rearranged so that "one person drinks coffee in the living room".

結果、合計５５０件の固有のイベントと11,501個の行動コンセプト、13,087個の物体コンセプト、931個の場所コンセプト、11,280個のイベントを作成することができた。 As a result, a total of 550 unique events and 11,501 action concepts, 13,087 object concepts, 931 location concepts, and 11,280 events were created.

本実施の形態のイベント系列抽出部３８は、与えられた意味コンセプトから自動的にイベントの系列であるイベントタイムラインを生成する。この生成したイベント系列と手動でラベル付けしたイベント系列を一致度を見ることで、本実施の形態のイベント系列抽出部３８の性能を評価する。 The event sequence extraction unit 38 of the present embodiment automatically generates an event timeline that is a sequence of events from the given semantic concept. The performance of the event sequence extraction unit 38 of the present embodiment is evaluated by looking at the degree of coincidence between the generated event sequence and the event sequence manually labeled.

（実験条件）
ウェアラブルセンサから取得したセンサデータから意味コンセプトを作成する。特徴抽出の方法は、同データセットを用いた以下の文献に開示されたのと同様の方法で行った。 (Experimental conditions)
Create a semantic concept from sensor data acquired from the wearable sensor. The method of feature extraction was performed in the same manner as disclosed in the following document using the same data set.

公知文献：Miyanishi, T.; Hirayama, J.-i.; Maekawa, T.; Kong, Q.; Moriya, H.; and Suyama, T. 2016. Egocentric video search via physical interactions. In AAAI, 330-336.
モーションデータの特徴抽出として、25Hzのデータに対して窓サイズ75サンプルで1サンプルづつずらして短時間フーリエ変換を用いて特徴抽出を行った。 Known materials: Miyanishi, T .; Hirayama, J.-i .; Maekawa, T .; Kong, Q .; Moriya, H .; and Suyama, T. 2016. Egocentric video search via physical interactions. In AAAI, 330- 336.
As feature extraction of motion data, feature extraction was performed using short-time Fourier transform by shifting one sample at a window size of 75 samples to data of 25 Hz.

動画像の特徴抽出には、まず各画像についてImageNetで事前学習したVGG のモデルの最終層の活性化関数の値を用いて画像の特徴抽出を行い、滑走窓を用いて動画特徴を抽出した。 For feature extraction of moving images, feature extraction of images was first performed using the value of the activation function of the final layer of the VGG model pre-learned with ImageNet for each image, and moving feature was extracted using a sliding window.

ＶＧＧモデルについては、以下の文献に開示がある。 The VGG model is disclosed in the following document.

公知文献：Simonyan, K., and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR.
さらに、モーションデータとVGGの画特徴を0.2秒ごとにダウンサンプリングして一つのベクトルに結合し、３秒間の窓内のベクトルを作成し、このセンサデータのベクトルからＧＲＵを用いて意味コンセプトのラベルの予測を行った。行動と場所の意味コンセプトは、時間的に重複しないので、交差エントロピー損失を用いて多クラス分類でラベルの予測をした。一方、物体の意味コンセプトは複数観測されるので時間的に重複する場合がある。そこで、物体のラベルの予測には最大エントロピーに基づく1対全の損失を用いてマルチラベル分類でラベルの予測をした。最後に、隣接したラベルを結合して任意の時間幅を持つ意味コンセプトを作成した。結果、14,444個の行動、34,923個の物体、1,113個の場所の意味コンセプトを予測した。
(イベント系列の生成)
センサ信号から推定した意味コンセプトを用いて尤もらしいイベントの系列を作成する。イベントの言語モデルの学習には、Montreal Video Annotation Dataset (M-VAD)コーパスを用いた。このM-VADコーパスから、動詞とその係り先をStanford係り受け解析器を用いて抽出した。 Known documents: Simonyan, K., and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR.
Furthermore, motion data and image features of VGG are downsampled every 0.2 seconds and combined into one vector to create a vector within a 3 second window, and from this vector of sensor data, a concept of a semantic concept is labeled using GRU. Made a prediction. Since the semantic concepts of behavior and place do not overlap in time, we used cross-entropy loss to predict labels in multiclass classification. On the other hand, since the semantic concept of the object is observed more than once, it may overlap in time. Therefore, we used multi-label classification to predict labels by using one-to-all loss based on maximum entropy to predict labels of objects. Finally, we combined adjacent labels to create a semantic concept with arbitrary duration. As a result, the semantic concepts of 14,444 actions, 34,923 objects, and 1,113 places were predicted.
(Generation of event series)
The semantic concept estimated from the sensor signal is used to create a sequence of plausible events. The Montreal Video Annotation Dataset (M-VAD) corpus was used to learn the language model of the event. From this M-VAD corpus, we extracted verbs and their dependencies using the Stanford dependency analyzer.

使用する単語の係り先としては、動詞の主部と補語とXCMOPに限定した。イベント構成要素の語彙は、頻度高い上位4,000個の主部、4,000個の動詞、6,000個の動詞の係先の名詞を選んだ。イベント言語モデルの訓練データは合計51,564個のイベントで、検証データに6,871のイベントを使用した。また、イベントの言語モデルに使用する単語の埋め込みベクトルは、Wikipediaの英語版で学習したGloveを用いて作成した。 As the target of the word to be used, I was limited to the main part of the verb and the complement and XCMOP. The vocabulary of the event component chose the nouns of the top 4,000 main parts, 4,000 verbs, and 6,000 verbs who were frequently used. The training data of the event language model used a total of 51,564 events and used 6,871 events as verification data. Moreover, the embedded vector of the word used for the language model of an event was created using Glove learned with the English version of Wikipedia.

ここで、Gloveの隠れ層の次元は300、語彙は40万個の単語を使用した。それ以外の単語は、未知のトークンに置き換えて学習を行った。言語モデルの学習にはＧＲＵを使い、最適化手法はAdam を用いて、学習率を0.01、バッチサイズを32にして、検証データの損失を見て10エポックで学習を打ち切った。ＧＲＵの300次元の隠れ層は正規分布N(0、0.01)に従って初期化した。イベントタイムラインの生成については、１つの場所のイベント系列をテストデータ、他の場所でのイベント系列を訓練データとする leave-one-place-outの交差検定で本実施の形態のイベント系列生成手法のパラメータＮ、Ｍを調整し、評価を行った。 Here, the dimension of the hidden layer of Glove was 300, and the vocabulary used 400,000 words. The other words were learned by replacing them with unknown tokens. We used GRU to learn the language model, and using Adam as the optimization method, we set the learning rate to 0.01, the batch size to 32, and lost learning at 10 epochs to see the loss of verification data. The GRU 300-dimensional hidden layer was initialized according to the normal distribution N (0, 0.01). Regarding event timeline generation, the event sequence generation method of this embodiment is based on the cross-validation of leave-one-place-out with test data for an event sequence in one place and training data for an event sequence in another place. Parameters N and M were adjusted and evaluated.

本実施の形態のイベント系列生成手法の優位性を検証するため、ベースライン手法を用意した。FullRandomは、全イベント候補からランダムにイベントを選んでイベント系列を作成する。WindowRandomは、各窓のイベント候補の中からランダムにイベントを選んでイベント系列を作成する。Unigram (MVAD)は、各窓のイベント候補をMVADコーパスで学習したユニグラムの尤度に基づいて順位付けしてイベント系列を作成する。Bigram (MVAD)は、各窓のイベント候補をMVADコーパスで学習したバイグラムの尤度に基づいて順位付けしてイベント系列を作成する。Video2TextGenerateは、与えられたセンサデータからsequence-to-sequence を用いて直接イベントを生成する。これはと同様の手法である。Video2TextRankingは、各窓のイベント候補をsequence-to-sequenceの尤度を用いて順位付けしてイベント系列を作成する。 In order to verify the superiority of the event sequence generation method of the present embodiment, a baseline method is prepared. FullRandom creates an event sequence by randomly selecting events from all event candidates. WindowRandom creates an event sequence by randomly selecting an event from the event candidates of each window. Unigram (MVAD) creates an event sequence by ranking event candidates of each window based on the likelihood of unigrams trained in the MVAD corpus. Bigram (MVAD) creates an event sequence by ranking the event candidates of each window based on the likelihood of the bigram trained in the MVAD corpus. Video2TextGenerate generates an event directly from given sensor data using sequence-to-sequence. This is the same method as. Video2TextRanking creates an event sequence by ranking event candidates of each window using sequence-to-sequence likelihood.

本実施の形態のイベント系列生成手法について、モデルの各構成要素の効果を見るため、RNN-Lang、RNN-Lang + Time、RNN-Lang + Time + Contextを用意した。RNN-Langは、各窓のイベント候補をMVADコーパスで学習したRNN言語モデルの尤度に基づいて順位付けしてイベント系列を生成する。RNN-Lang + Timeは、RNN-Langに時間モデルを追加して、RNN言語モデルの尤度に時間モデルのスコアを足して、合計スコアに基づいてイベントの順位付けを行い、イベント系列を生成する。RNN-Lang + Time + Contextは、RNN-Lang + Timeにイベントの変化モデルを追加して、RNN言語モデルの尤度と時間モデルのスコアにイベント間の遷移確率を考慮して、HMMを用いてイベント系列を生成する。 In the event sequence generation method of the present embodiment, RNN-Lang, RNN-Lang + Time, and RNN-Lang + Time + Context are prepared to see the effect of each component of the model. RNN-Lang generates an event sequence by ranking event candidates of each window based on the likelihood of the RNN language model learned in the MVAD corpus. RNN-Lang + Time adds a temporal model to RNN-Lang, adds the temporal model score to the likelihood of the RNN language model, ranks the events based on the total score, and generates an event sequence . RNN-Lang + Time + Context adds an event change model to RNN-Lang + Time, and uses HMM, taking into account transition probability between events in RNN language model likelihood and time model score Generate an event sequence.

（イベント生成のパフォーマンス）
以下では、単語の統計情報を用いたベースラインUnigram (MVAD)、Bigram (MVAD)と、動画の説明文生成の課題でよく用いられるsequence-to-sequeceによるイベント生成手法 Video2TextGenerate、Video2TextRankingと比較することで、提案するイベントタイムライン生成手法RNN-Lang + Time + Contextの性能を評価する。 (Event generation performance)
In the following, baseline Unigram (MVAD) and Bigram (MVAD), which use word statistical information, and event-generating methods using sequence-to-sequece, which are often used for generating explanatory sentences for moving pictures, are compared with Video2TextGenerate and Video2TextRanking. Then, we evaluate the performance of the proposed event timeline generation method RNN-Lang + Time + Context.

用意したベースラインは、コンセプト間の時間モデルやイベント間の変化などの実世界の性質は使用していない。 The prepared baseline does not use real-world properties such as temporal models between concepts and changes between events.

イベントタイムラインの評価には、人手でラベリングした意味コンセプトからイベント系列を生成する条件と、ＧＲＵで予測した意味コンセプトからイベント系列を生成する条件の２通りの評価を行う。これらの２つの条件をそれぞれ正解コンセプト（TrueConcept）と予測コンセプト（PredConcept）と明記する。人手でラベリングした意味コンセプトを使用することで、イベントタイムライン生成のみの性能を評価することができ、予測した意味コンセプトを使ってイベント系列を予測することで、ノイズが多いより現実に近い条件でイベント系列手法の評価ができる。 In evaluating the event timeline, two kinds of evaluation are performed: conditions for generating an event sequence from manually labeled semantic concepts and conditions for generating an event sequence from semantic concepts predicted by GRU. These two conditions are respectively described as a correct concept (TrueConcept) and a prediction concept (PredConcept). By using the semantic concept that is labeled manually, it is possible to evaluate the performance of only event timeline generation, and by predicting the event sequence using the predicted semantic concept, it is possible to make the condition more noisy and closer to reality Can evaluate the event series method.

イベントタイムラインの生成手法を評価する指標として、BLEU 、METEOR 、CIDEr を用いる。これらの指標は、画像の説明文を生成する手法を評価するために標準的に用いられる。 BLEU, METEOR, CIDER are used as indices for evaluating the event timeline generation method. These indices are typically used to evaluate techniques for generating image captions.

これらの指標については、以下の文献に開示がある。 These indices are disclosed in the following documents.

公知文献：Fang, H．；Gupta, S.; Iandola, F．；Srivastava, R. K，；Deng, L．；Doll’ar, P．；Gao, J，；He, X．；Mitchell, M．；Platt, J. C．；et al. 2015. From captions to visual concepts and back. In CVPR, 1473-1482.
特に、BLEUとMETEORは機械翻訳システムの評価で標準的に用いられ、BLEUは予測したイベントと正解のイベントの重複するN-グラムの一致度をみる。CIDErは画像から生成した説明文を評価するために用いられ、予測した文にある単語が正解の文の単語集合にどれだけ含まれるかを評価する。 Known documents: Fang, H. et al. Gupta, S .; Iandola, F .; Srivastava, R. K ,; Deng, L .; Doll'ar, P .; Gao, J, He, X .; Mitchell, M .; Platt, J. C .; From captions to visual concepts and back. In CVPR, 1473-1482.
In particular, BLEU and METEOR are used as standard in the evaluation of machine translation systems, and BLEU looks at the degree of coincidence between the predicted event and the correct N-gram. CIDEr is used to evaluate an explanatory text generated from an image, and evaluates how much a word in a predicted sentence is included in a word set of a correct sentence.

ベースラインと本実施の形態のイベント系列生成手法によって生成したイベント系列と人手で作成したイベント系列を比較して BLEU、METEOR、CIDErのスコアの計算を行う。これらのスコアが高いほどイベントタイムラインの生成手法の性能が高いことを示す。 The score of BLEU, METEOR, CIDEr is calculated by comparing the event sequence generated by the event sequence generation method of the present embodiment with the baseline and the event sequence generated manually. The higher these scores indicate that the performance of the event timeline generation method is higher.

以下では、イベントタイムラインの全体的な性能を評価する。 In the following, we will assess the overall performance of the event timeline.

図１３は、抽出されたイベントタイムラインの評価結果を示す図である。 FIG. 13 is a diagram showing the evaluation result of the extracted event timeline.

図１３では、正解コンセプト（TrueConcept）と予測コンセプト（PredConcept）を用いたときのベースラインと本実施の形態のイベント系列生成手法の全体的な結果を示す。
正解コンセプトの結果から、これらの手法は同じ言語コーパスを使用して言語モデルを構築しているにもかかわらず、RNN-LangはUnigram (MVAD)やBigram (MVAD)よりもすべての評価指標に対して上回っていることがわかる。この結果は、標準的なユニグラムやバイグラムの言語モデルと比較して、任意長の単語列を表現できるRNN言語モデルを利用したRNN-Langは、より正確に実世界の状態を表現できることを示唆している。 FIG. 13 shows the overall results of the baseline and the event sequence generation method of the present embodiment when using the correct concept (TrueConcept) and the predictive concept (PredConcept).
Based on the results of the correct concept, although these methods use the same language corpus to construct language models, RNN-Lang is more effective than Unigram (MVAD) and Bigram (MVAD) for all metrics. It can be seen that the This result suggests that RNN-Lang using RNN language model that can express word string of arbitrary length can more accurately represent real-world state compared with standard unigram and bigram language models. ing.

さらに、RNN-Langはsequence-to-sequenceを用いたVideo2TextGenerateやVideo2TextRankingと比較して、すべての評価指標で上回っている。 Furthermore, RNN-Lang is higher in all evaluation indexes compared to Video2TextGenerate and Video2TextRanking using sequence-to-sequence.

これは、sequence-to-sequenceを用いた手法が、ある特定の環境で収集したセンサデータに過適合してしまい、他の場所でのイベントを予測できなかったからである。 This is because the method using sequence-to-sequence over-fits the sensor data collected in a specific environment, and can not predict events in other places.

また、Video2TextGenerateはイベント候補の中からランダムに選択するモデル（FullRandom）と結果がほぼ同じである。これは、異なる環境でイベント系列の生成の性能を汎化させるためには、センサとその説明文の並列コーパスで学習するのではなく、外部の言語資源を用いて実世界の状態を表す単語列を学習する必要があることを示唆している。 In addition, Video2TextGenerate has almost the same result as a model (FullRandom) randomly selected from event candidates. In order to generalize the generation performance of event sequences in different environments, this is not learning with a parallel corpus of sensors and their explanatory sentences, but word strings that represent real-world states using external language resources Suggests that you need to learn.

つまり、動画像の説明文生成で用いられるsequence-to-sequenceは、場所が変化するような環境でのイベントタイムラインの生成に適していない。正解コンセプトを用いた結果から、提案するイベント系列生成手法RNN-Lang、RNN-Lang + Time、RNN-Lang + Context、RNN-Lang + Time + Contextがベースラインを上回っている。 That is, the sequence-to-sequence used in the description generation of a moving image is not suitable for generating an event timeline in an environment where the location changes. From the results using the correct answer concept, the proposed event sequence generation methods RNN-Lang, RNN-Lang + Time, RNN-Lang + Context, and RNN-Lang + Time + Context exceed the baseline.

これは、日常生活行動に関するイベント系列を生成するために、実環境の特有の性質を取り入れることが効果的であることを示している。特に、RNN-Lang + TimeはRNN-Langよりも良い性能を示していることから、コンセプト間の時間的近さを考慮することが、実世界のイベントを生成する上で重要だということがわかる。また、RNN-Lang + Time + Contextの性能がRNN-Lang + Timeを上回っていることから、イベントの変化を考慮したモデルを用いることで、より正しいイベント系列を生成できることがわかった。RNN-Lang + Time + Contextの性能がRNN-LangとRNN-Lang + Timeを上回っていることから、コンセプトの組み合わせの尤もらしさを表す常識的知識だけでなく、行動と物体・場所を表すコンセプトの時間的近さや、さらに時間とともに変化する日常生活行動を表すイベント間の遷移を追加してモデル化することが日常生活のイベントタイムラインを生成する上で効果的だとわかる。 This indicates that it is effective to incorporate the characteristic nature of the real environment in order to generate an event sequence related to daily living behavior. In particular, RNN-Lang + Time shows better performance than RNN-Lang, so it can be seen that considering the closeness in time between concepts is important in generating real-world events . Moreover, since the performance of RNN-Lang + Time + Context exceeds RNN-Lang + Time, it was found that a more correct event sequence can be generated by using a model in consideration of event changes. Since the performance of RNN-Lang + Time + Context surpasses RNN-Lang and RNN-Lang + Time, not only common sense knowledge expressing likelihood of combination of concepts, but also concept expressing action and object / place It can be seen that modeling by adding transitions between events representing temporal living behavior and events that change with time is effective in generating an event timeline for everyday life.

次に、予測コンセプトを用いてイベント系列を生成した結果を紹介する。 Next, we introduce the results of generating event sequences using the prediction concept.

当然ながら、予測コンセプトには誤りがあるので、予測コンセプトを用いた結果は、正解のコンセプトを使った場合と比べて評価指標の値は低下している。 Naturally, because there is an error in the prediction concept, the result using the prediction concept is lower in the value of the evaluation index than in the case where the correct concept is used.

図１３から、提案するRNN-LangはベースラインのUnigram (MVAD)を上回っているが、RNN-LangとBigram (MVAD)は同程度である。 From FIG. 13, the proposed RNN-Lang exceeds the baseline Unigram (MVAD), but the RNN-Lang and Bigram (MVAD) are comparable.

これは、イベント候補の中に誤って予測されたコンセプトが含まれるためにRNN-Langの性能が低下したためと考えられる。しかし、常識的知識に時間のモデルを加えたRNN-Lang + TimeはRNN-Langと比較してイベント系列生成の性能を向上させることができており、その結果、Unigram (MVAD)、Bigram (MVAD)の性能を上回っている。 It is considered that this is because the performance of RNN-Lang is degraded because an event candidate includes an incorrectly predicted concept. However, RNN-Lang + Time which added the model of time to common sense knowledge can improve the performance of event sequence generation compared with RNN-Lang, and as a result, Unigram (MVAD), Bigram (MVAD) It exceeds the performance of).

これは、より実用的な条件でもコンセプトの時間的近さを考慮することでイベント生成の性能を向上できることを示している。さらに、正解のコンセプトを使ったときと同様に、RNN-Lang + Time + Contextは、RNN-LangとRNN-Lang + Timeを上回っている。 This indicates that the performance of event generation can be improved by considering the temporal closeness of concepts even under more practical conditions. Furthermore, as with the correct answer concept, RNN-Lang + Time + Context exceeds RNN-Lang and RNN-Lang + Time.

この結果は、ノイジーな条件でもイベントの変化を考慮することで、実世界の状態を表す正しいイベント系列を選択しやすくなることを示している。また、本実施の形態のイベント系列生成手法のRNN-Lang + Time + Contextを用いた手法が全ての手法に対して全ての評価指標で上回ってる。この結果は、実世界の特性をモデリングすることで、実環境でセンシングしたデータを使っても正確なイベントタイムラインを生成できることを示している。 This result shows that it is easy to select the correct event sequence that represents the state of the real world by considering the change of the event even in noisy conditions. In addition, the method using RNN-Lang + Time + Context of the event sequence generation method of the present embodiment outperforms all methods in all evaluation indexes. The results show that modeling the characteristics of the real world enables accurate event timelines to be generated using data sensed in the real environment.

以上説明したように、本実施の形態のイベント系列抽出部３８では、意味コンセプトの時間的な近さや、コンセプトの組み合わせの言語的な尤もらしさや、イベントの時間変化変化などの実世界の特徴をモデル化している。日常生活行動に関するデータセットを使用して評価実験を行った結果、実世界の特徴をモデル化することで、逐次的に変化するイベントの系列を正しく生成できる。 As described above, in the event sequence extraction unit 38 according to the present embodiment, the characteristics of the real world such as temporal proximity of semantic concepts, linguistic likelihood of combination of concepts, temporal change of events, etc. I am modeling. As a result of conducting evaluation experiments using a data set related to daily living behavior, by modeling the features of the real world, it is possible to correctly generate a series of events that change sequentially.

今回開示された実施の形態は、本発明を具体的に実施するための構成の例示であって、本発明の技術的範囲を制限するものではない。本発明の技術的範囲は、実施の形態の説明ではなく、特許請求の範囲によって示されるものであり、特許請求の範囲の文言上の範囲および均等の意味の範囲内での変更が含まれることが意図される。 The embodiment disclosed this time is an illustration of a configuration for specifically implementing the present invention, and does not limit the technical scope of the present invention. The technical scope of the present invention is indicated not by the description of the embodiment but by the scope of claims, and includes modifications within the scope of wording and meaning of the scope of claims. Is intended.

１イベント系列抽出装置、２被験者、１０ウェアラブルカメラ、２０モーションセンサ、３０映像収集部、３２映像特徴量算出部、３４運動情報収集部、３５位置情報収集部、３６運動特徴量算出部、３７位置特徴量算出部、３８イベント系列抽出部、４０データ格納部。 DESCRIPTION OF SYMBOLS 1 event sequence extraction apparatus, 2 test subjects, 10 wearable cameras, 20 motion sensors, 30 image collecting units, 32 image feature amount calculating units, 34 exercise information collecting units, 35 position information collecting units, 36 exercise feature amount calculating units, 37 positions Feature amount calculation unit, 38 event sequence extraction unit, 40 data storage unit.

Claims

A first sensor for acquiring data for identifying an object of a subject's perceptual experience;
A second sensor that acquires position information for specifying the behavior of the subject and the position of the subject during the perceptual experience of the subject;
A storage device for storing information;
For the subject, an action sequence for storing an action label, an object label, and a position label for identifying each of the action, the object, and the position as a timeline in the storage device and extracting an event sequence And an extraction means,
The event sequence extraction unit
Concept estimation processing means for predicting the action label, the object label, and the position label for each predetermined sliding window, and executing a process of combining and arranging identical labels on the timeline;
Regarding the combination result of the action label, the object label and the position label on the timeline, a plurality of candidates of combinations of the object label and the position label within a predetermined time window based on the action label Language model comparison means for extracting the most likely combination as an event according to the likelihood that the action label corresponding to the candidate, the object label, and the position label exist in the language corpus;
And an event sequence selection unit that extracts the event sequence from the extracted time series of events.

The language model comparing means may select a combination according to the likelihood existing in the language corpus and the likelihood due to the closeness of time between the combined action label, the object label and the position label. The event sequence extraction device according to claim 1, which is extracted as an event.

The concept estimation processing means is a gated recursive unit that predicts the action label, the object label, and the position label using sequence labeling for the first and second sensor signals in each sliding window. The event sequence extraction device according to claim 1 or 2, comprising

The event series extraction device according to any one of claims 1 to 3, wherein the language model comparison means includes a gated recursive unit which receives a word embedding vector in consideration of word similarity.

The event sequence according to any one of claims 1 to 4, wherein the event sequence selection means extracts the event sequence by a hidden state model in which the degree of dissimilarity of the vector representing the event is a transition probability of the event. Extraction device.

An event sequence extraction method for extracting information on a series of events occurring in a subject from sensor data in a real environment, comprising:
Acquiring, with a first sensor, data for identifying an object of the perceptual experience of the subject;
Obtaining the behavior of the subject and position information for specifying the position of the subject by a second sensor when the perceptual experience of the subject is performed;
For each predetermined sliding window, an action label, an object label and a position label for identifying each of the action, the object and the position are predicted, and the same label is combined and arranged on the timeline. Performing the process;
Storing the action label, the object label and the position label as the time line in the storage device for the subject;
Predicting the action label, the object label, and the position label for each predetermined sliding window, and combining and arranging the same labels on the timeline;
Regarding the combination result of the action label, the object label and the position label on the timeline, a plurality of candidates of combinations of the object label and the position label within a predetermined time window based on the action label Extracting a most likely combination as an event according to the likelihood that the action label, the object label, and the position label corresponding to the candidate exist in a language corpus;
Extracting an event series from the extracted time series of events.

An event sequence extraction program for causing a computer to execute an event sequence extraction process for extracting information on a sequence of events generated in a subject from sensor data in a real environment, the event sequence extraction program comprising: ,
Acquiring, with a first sensor, data for identifying an object of the perceptual experience of the subject;
Obtaining the behavior of the subject and position information for specifying the position of the subject by a second sensor when the perceptual experience of the subject is performed;
For each predetermined sliding window, an action label, an object label and a position label for identifying each of the action, the object and the position are predicted, and the same label is combined and arranged on the timeline. Performing the process;
Storing the action label, the object label and the position label as the time line in the storage device for the subject;
Regarding the combination result of the action label, the object label and the position label on the timeline, a plurality of candidates of combinations of the object label and the position label within a predetermined time window based on the action label Extracting a most likely combination as an event according to the likelihood that the action label, the object label, and the position label corresponding to the candidate exist in a language corpus;
An event sequence extraction program that executes an event sequence extraction step from the time series of the extracted events.