JP2023012795A

JP2023012795A - Training device, abnormal behavior assessment device, method, and program

Info

Publication number: JP2023012795A
Application number: JP2021116487A
Authority: JP
Inventors: 基宏高木; Motohiro Takagi; 潤島村; Jun Shimamura; 正樹北原; Masaki Kitahara; 峻司細野; Shunji Hosono; 洋一佐藤; Yoichi Sato; 裕介菅野; Yusuke Sugano; 諒佑古田; Ryosuke Furuta
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Current assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2023-01-26

Abstract

To assess an abnormal behavior with good accuracy that does not conform to procedures.SOLUTION: A procedural sentence database 24 stores a plurality of procedural sentences for each of a series of procedure that represent at least one action included in the procedure, or a vector that represents each of the plurality of procedural sentences. A label detection unit 28 detects an action label related to a human action from video data or audio data for training that represents the actions included in the procedure. A model learning unit 30 learns a language model that generates a sentence or a sentence vector from the action label, so as to generate the procedural sentences preliminarily determined for the video data or audio data for learning, or the sentence vectors that represent the procedural sentences.SELECTED DRAWING: Figure 2

Description

開示の技術は、学習装置、異常行動判定装置、方法、及びプログラムに関する。 The disclosed technology relates to a learning device, an abnormal behavior determination device, a method, and a program.

近年、高精細カメラの普及により、カメラで取得した映像や音声で人の行動を解析する技術のニーズが高まっている。例えば、監視カメラでの犯罪行動の検出や工事現場での危険行動の検出などである。これらの行動を発見するには、大量の映像や音声を観察する必要がある。さらに、異常な行動の定義について理解している人が映像中の行動を観察して異常行動を検出する必要がある。しかしながら、人手での検出は時間的・人的コストがかかるため、異常行動を自動で検出するアルゴリズムを構築して検出する方法が考えられる。 In recent years, with the spread of high-definition cameras, there is a growing need for technology that analyzes human behavior using video and audio captured by cameras. For example, detection of criminal behavior with a surveillance camera, detection of dangerous behavior at a construction site, and the like. Discovering these behaviors requires observing large amounts of video and audio. Furthermore, it is necessary for a person who understands the definition of abnormal behavior to observe the behavior in the video and detect the abnormal behavior. However, since manual detection is time-consuming and labor-intensive, a method of detecting abnormal behavior by constructing an algorithm that automatically detects it is conceivable.

近年では、ニューラルネットワークを用いた異常行動の検出技術が提案されている（非特許文献１）。非特許文献１の手法では、映像をクラスタリングすることで高精度に異常行動を検出する。 In recent years, an abnormal behavior detection technique using a neural network has been proposed (Non-Patent Document 1). In the method of Non-Patent Document 1, abnormal behavior is detected with high accuracy by clustering videos.

Zaheer M.Z., Mahmood A., Astrid M., Lee SI. CLAWS: Clustering Assisted Weakly Supervised Learning with Normalcy Suppression for Anomalous Event Detection. ECCV 2020.Zaheer M.Z., Mahmood A., Astrid M., Lee SI. CLAWS: Clustering Assisted Weakly Supervised Learning with Normalcy Suppression for Anomalous Event Detection. ECCV 2020.

非特許文献１に示す、映像にうつる異常行動を検出する従来手法では、手順と動作を明確に区別していない。そのため、例えば、（手順１）床においた脚立を立てる、（手順２）安全帯ベルトを締める、（手順３）脚立を登る、という一連の手順があった場合、それぞれの手順において多数の行動があり、手順に含まれる行動の順序が合っているかどうかの判定が困難である。具体的には、（手順１）に、人は膝を曲げて脚立を掴み、脚立を持ち上げて固定する、という多数の行動が含まれる。また同様に、（手順２）の安全帯ベルトを締めて脚立を登る、といった手順にも、安全帯ベルトを持ち、人体に固定するといった一連の行動が含まれる。（手順３）には、脚立の方へ歩いてステップに足をかけ、手で脚立を持って登るという多数の行動が含まれる。このように、行動をある程度まとめて１つの手順としてとらえて、手順に含まれる行動の順序が合っているかどうかを確認する必要があるが、現在の異常行動検出手法では個々の行動の異常検出が中心であり、複数の行動がまとまった手順と異なるか否かの異常判定に対する検討はなされていない。そのため、安全帯ベルトを締めたタイミングが脚立を登った後であれば、手順と異なり危険である、というような異常行動の判定を映像から行うことは困難である。 In the conventional method for detecting abnormal behavior in images, which is shown in Non-Patent Document 1, procedures and actions are not clearly distinguished. Therefore, for example, if there is a series of steps such as (step 1) standing a stepladder on the floor, (step 2) tightening the safety belt, and (step 3) climbing the stepladder, many actions are required in each step. Therefore, it is difficult to judge whether the order of actions included in the procedure is correct. Specifically, (procedure 1) includes a number of actions in which a person bends his knees, grabs a stepladder, lifts the stepladder, and fixes it. Similarly, the procedure (Procedure 2) of tightening the safety belt and climbing the stepladder also includes a series of actions of holding the safety belt and fixing it to the human body. (Procedure 3) includes a number of actions of walking to a stepladder, placing the foot on the step, and climbing up while holding the stepladder in the hand. In this way, it is necessary to group behaviors to some extent as a procedure and check whether the order of the behaviors included in the procedure is correct. It is the center, and there is no examination of abnormality determination as to whether or not a plurality of actions are different from the integrated procedure. Therefore, if the safety belt is tightened after climbing the stepladder, it is difficult to determine from the video an abnormal behavior such as dangerous behavior, unlike the procedure.

開示の技術は、上記の点に鑑みてなされたものであり、手順とは異なる異常行動を精度よく判定することができる学習装置、異常行動判定装置、方法、及びプログラムを提供することを目的とする。 The disclosed technology has been made in view of the above points, and aims to provide a learning device, an abnormal behavior determination device, a method, and a program that can accurately determine abnormal behavior that differs from a procedure. do.

本開示の第１態様は、学習装置であって、一連の手順の各々についての、前記手順に含まれる少なくとも一つの行動を表す複数の手順文、又は前記複数の手順文の各々を表す文ベクトルを記憶する手順文データベースと、前記手順に含まれる行動を表す学習用の映像データ又は音声データから、人の行動に関する行動ラベルを検出するラベル検出部と、前記学習用の映像データ又は音声データに対して予め定められた前記手順文、又は前記手順文を表す文ベクトルを生成するように、前記行動ラベルから文又は文ベクトルを生成する言語モデルを学習するモデル学習部と、を含む。 A first aspect of the present disclosure is a learning device, for each of a series of procedures, a plurality of procedural sentences representing at least one action included in the procedure, or a sentence vector representing each of the plurality of procedural sentences a label detection unit that detects action labels related to human actions from a procedure sentence database that stores actions included in the procedure, video data or audio data for learning representing actions included in the procedure; and a model learning unit that learns a language model for generating a sentence or a sentence vector from the action label so as to generate the predetermined procedural sentence or a sentence vector representing the procedural sentence.

本開示の第２態様は、異常行動判定装置であって、一連の手順の各々についての、前記手順に含まれる少なくとも一つの行動を表す複数の手順文、又は前記複数の手順文の各々を表す文ベクトルを記憶する手順文データベースと、人の行動を表す映像データ又は音声データから検出された、人の行動に関する行動ラベルに基づいて、前記行動ラベルから文又は文ベクトルを生成する予め学習された言語モデルを用いて、前記文又は文ベクトルを生成する生成部と、前記生成された前記文又は文ベクトルと、前記手順文データベースに記憶された前記手順文、又は前記手順文を表す文ベクトルとの類似度を算出する類似度算出部と、前記類似度算出部によって算出された類似度に基づいて、前記人の行動が異常であるか否かを判定する異常判定部と、を含む。 A second aspect of the present disclosure is an abnormal behavior determination device, for each of a series of procedures, representing a plurality of procedural sentences representing at least one action included in the procedure, or each of the plurality of procedural sentences Based on a procedural sentence database storing sentence vectors and an action label related to human action detected from video data or audio data representing human action, a sentence or sentence vector is generated from the action label. a generation unit that generates the sentence or sentence vector using a language model; the generated sentence or sentence vector; the procedural sentence stored in the procedural sentence database; or a sentence vector representing the procedural sentence; and an abnormality determination unit that determines whether or not the behavior of the person is abnormal based on the similarity calculated by the similarity calculation unit.

本開示の第３態様は、学習方法であって、一連の手順の各々についての、前記手順に含まれる少なくとも一つの行動を表す複数の手順文、又は前記複数の手順文の各々を表す文ベクトルを記憶する手順文データベースを含む学習装置が、前記手順に含まれる行動を表す学習用の映像データ又は音声データから、人の行動に関する行動ラベルを検出し、前記学習用の映像データ又は音声データに対して予め定められた前記手順文、又は前記手順文を表す文ベクトルを生成するように、前記行動ラベルから文又は文ベクトルを生成する言語モデルを学習する。 A third aspect of the present disclosure is a learning method, which includes, for each of a series of procedures, a plurality of procedural sentences representing at least one action included in the procedure, or a sentence vector representing each of the plurality of procedural sentences A learning device that includes a procedural statement database that stores the action label for human action from learning video data or audio data representing actions included in the procedure, and stores the action label in the learning video data or audio data A language model for generating a sentence or a sentence vector from the action label is learned so as to generate the predetermined procedural sentence or a sentence vector representing the procedural sentence.

本開示の第４態様は、異常行動判定方法であって、一連の手順の各々についての、前記手順に含まれる少なくとも一つの行動を表す複数の手順文、又は前記複数の手順文の各々を表す文ベクトルを記憶する手順文データベースを含む異常行動判定装置が、人の行動を表す映像データ又は音声データから検出された、人の行動に関する行動ラベルに基づいて、前記行動ラベルから文又は文ベクトルを生成する予め学習された言語モデルを用いて、前記文又は文ベクトルを生成し、前記生成された前記文又は文ベクトルと、前記手順文データベースに記憶された前記手順文、又は前記手順文を表す文ベクトルとの類似度を算出し、前記算出された類似度に基づいて、前記人の行動が異常であるか否かを判定する。 A fourth aspect of the present disclosure is a method for determining abnormal behavior, wherein for each of a series of procedures, a plurality of procedural sentences representing at least one behavior included in the procedure, or each of the plurality of procedural sentences An abnormal behavior determination device including a procedural sentence database that stores sentence vectors, based on an action label related to human behavior detected from video data or audio data representing human behavior, extracts a sentence or a sentence vector from the action label. Using the generated pre-trained language model to generate the sentence or sentence vector, representing the generated sentence or sentence vector and the procedural sentence stored in the procedural sentence database or the procedural sentence A degree of similarity with the sentence vector is calculated, and whether or not the person's behavior is abnormal is determined based on the calculated degree of similarity.

本開示の第５態様は、プログラムであって、コンピュータを、上記第１態様の学習装置又は上記第２態様の異常行動判定装置として機能させるためのプログラムである。 A fifth aspect of the present disclosure is a program for causing a computer to function as the learning device of the first aspect or the abnormal behavior determination device of the second aspect.

開示の技術によれば、手順とは異なる異常行動を精度よく判定することができる。 According to the disclosed technique, it is possible to accurately determine abnormal behavior that differs from the procedure.

第１実施形態及び第２実施形態の学習装置及び異常行動判定装置として機能するコンピュータの一例の概略ブロック図である。1 is a schematic block diagram of an example of a computer functioning as a learning device and an abnormal behavior determination device according to the first embodiment and the second embodiment; FIG. 第１実施形態の学習装置の構成を示すブロック図である。1 is a block diagram showing the configuration of a learning device according to a first embodiment; FIG. 第１実施形態及び第２実施形態の異常行動判定装置の構成を示すブロック図である。It is a block diagram showing the configuration of the abnormal behavior determination device of the first embodiment and the second embodiment. 第１実施形態の学習装置の文ベクトル生成処理ルーチンを示すフローチャートである。6 is a flow chart showing a sentence vector generation processing routine of the learning device of the first embodiment; 第１実施形態の学習装置の学習処理ルーチンを示すフローチャートである。4 is a flow chart showing a learning processing routine of the learning device of the first embodiment; 第１実施形態の学習装置の言語モデルを更新する処理の流れを示すフローチャートである。4 is a flow chart showing the flow of processing for updating the language model of the learning device of the first embodiment; 第１実施形態の異常行動判定装置の異常行動判定処理ルーチンを示すフローチャートである。4 is a flowchart showing an abnormal behavior determination processing routine of the abnormal behavior determination device of the first embodiment; 第１実施形態の異常行動判定装置の文ベクトルを生成する処理の流れを示すフローチャートである。4 is a flow chart showing the flow of processing for generating a sentence vector of the abnormal behavior determination device of the first embodiment; 第１実施形態の異常行動判定装置の異常行動を判定する処理の流れを示すフローチャートである。4 is a flowchart showing the flow of processing for determining abnormal behavior by the abnormal behavior determination device of the first embodiment; 第２実施形態の学習装置の学習処理ルーチンを示すフローチャートである。9 is a flow chart showing a learning processing routine of the learning device of the second embodiment; 第２実施形態の学習装置の言語モデルを更新する処理の流れを示すフローチャートである。10 is a flow chart showing the flow of processing for updating the language model of the learning device of the second embodiment; 第２実施形態の異常行動判定装置の異常行動判定処理ルーチンを示すフローチャートである。9 is a flowchart showing an abnormal behavior determination processing routine of the abnormal behavior determination device of the second embodiment; 第２実施形態の異常行動判定装置の文を生成する処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process which produces|generates a sentence of the abnormal behavior determination apparatus of 2nd Embodiment.

以下、開示の技術の実施形態の一例を、図面を参照しつつ説明する。なお、各図面において同一又は等価な構成要素及び部分には同一の参照符号を付与している。また、図面の寸法比率は、説明の都合上誇張されており、実際の比率とは異なる場合がある。 An example of embodiments of the technology disclosed herein will be described below with reference to the drawings. In each drawing, the same or equivalent components and portions are given the same reference numerals. Also, the dimensional ratios in the drawings are exaggerated for convenience of explanation, and may differ from the actual ratios.

［第１実施形態］
＜本実施形態の概要＞
本実施形態では、映像や音声から抽出された行動ラベルの群を入力として行動ラベル群の文ベクトルを生成する。例えば、映像から抽出された、人ラベルと、物体ラベルと、物体に対する人の動作ラベルとの組み合わせであるｈｕｍａｎ－ｏｂｊｅｃｔ－ｉｎｔｅｒａｃｔｉｏｎラベル（ＨＯＩラベル）の群を入力として行動ラベル群の文ベクトルを生成する。そして、文ベクトルと比較対象である手順文の文ベクトルを入力として文ベクトル間の類似度を算出し、文ベクトル間の類似度を入力として、手順と異なる異常行動であるか否かを判定する。 [First embodiment]
<Overview of this embodiment>
In this embodiment, a group of action labels extracted from video or audio is input, and a sentence vector of the action label group is generated. For example, a group of human-object-interaction labels (HOI labels), which are combinations of human labels, object labels, and human action labels for objects extracted from video, is input to generate a sentence vector of action labels. do. Then, the sentence vector and the sentence vector of the procedural sentence to be compared are used as input to calculate the similarity between the sentence vectors. .

ここで、手順文の文ベクトルは、一連の手順の各々を表す手順文の文ベクトルであり、一つの手順には、少なくとも一つの行動が含まれる。 Here, a sentence vector of a procedure sentence is a sentence vector of a procedure sentence representing each of a series of procedures, and one procedure includes at least one action.

＜本実施形態に係る学習装置の構成＞
図１は、本実施形態の学習装置１０のハードウェア構成を示すブロック図である。 <Configuration of learning device according to the present embodiment>
FIG. 1 is a block diagram showing the hardware configuration of the learning device 10 of this embodiment.

図１に示すように、学習装置１０は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１１、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１２、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１３、ストレージ１４、入力部１５、表示部１６及び通信インタフェース（Ｉ／Ｆ）１７を有する。各構成は、バス１９を介して相互に通信可能に接続されている。 As shown in FIG. 1, the learning device 10 includes a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, a storage 14, an input unit 15, a display unit 16, and a communication interface ( I/F) 17. Each component is communicatively connected to each other via a bus 19 .

ＣＰＵ１１は、中央演算処理ユニットであり、各種プログラムを実行したり、各部を制御したりする。すなわち、ＣＰＵ１１は、ＲＯＭ１２又はストレージ１４からプログラムを読み出し、ＲＡＭ１３を作業領域としてプログラムを実行する。ＣＰＵ１１は、ＲＯＭ１２又はストレージ１４に記憶されているプログラムに従って、上記各構成の制御及び各種の演算処理を行う。本実施形態では、ＲＯＭ１２又はストレージ１４には、ニューラルネットワークである言語モデルを学習するための学習プログラムが格納されている。学習プログラムは、１つのプログラムであっても良いし、複数のプログラム又はモジュールで構成されるプログラム群であっても良い。 The CPU 11 is a central processing unit that executes various programs and controls each section. That is, the CPU 11 reads a program from the ROM 12 or the storage 14 and executes the program using the RAM 13 as a work area. The CPU 11 performs control of each configuration and various arithmetic processing according to programs stored in the ROM 12 or the storage 14 . In this embodiment, the ROM 12 or storage 14 stores a learning program for learning a language model, which is a neural network. The learning program may be one program, or may be a program group composed of a plurality of programs or modules.

ＲＯＭ１２は、各種プログラム及び各種データを格納する。ＲＡＭ１３は、作業領域として一時的にプログラム又はデータを記憶する。ストレージ１４は、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）又はＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）により構成され、オペレーティングシステムを含む各種プログラム、及び各種データを格納する。 The ROM 12 stores various programs and various data. The RAM 13 temporarily stores programs or data as a work area. The storage 14 is configured by a HDD (Hard Disk Drive) or SSD (Solid State Drive), and stores various programs including an operating system and various data.

入力部１５は、マウス等のポインティングデバイス、及びキーボードを含み、各種の入力を行うために使用される。 The input unit 15 includes a pointing device such as a mouse and a keyboard, and is used for various inputs.

入力部１５は、学習用の映像データを、入力として受け付ける。具体的には、入力部１５は、作業マニュアルの手順文が表す手順に含まれる行動を表す学習用の映像データを受け付ける。学習用の映像データには、作業マニュアルのどの手順文に対応するかを示す情報（例えば、ＩＤ）が付与されている。 The input unit 15 receives video data for learning as an input. Specifically, the input unit 15 receives video data for learning representing actions included in the procedure represented by the procedure sentence of the work manual. Information (for example, ID) indicating which procedural sentence in the work manual corresponds to the video data for learning is given.

表示部１６は、例えば、液晶ディスプレイであり、各種の情報を表示する。表示部１６は、タッチパネル方式を採用して、入力部１５として機能しても良い。 The display unit 16 is, for example, a liquid crystal display, and displays various information. The display unit 16 may employ a touch panel system and function as the input unit 15 .

通信インタフェース１７は、他の機器と通信するためのインタフェースであり、例えば、イーサネット（登録商標）、ＦＤＤＩ、Ｗｉ－Ｆｉ（登録商標）等の規格が用いられる。 The communication interface 17 is an interface for communicating with other devices, and uses standards such as Ethernet (registered trademark), FDDI, and Wi-Fi (registered trademark), for example.

次に、学習装置１０の機能構成について説明する。図２は、学習装置１０の機能構成の例を示すブロック図である。 Next, the functional configuration of the learning device 10 will be described. FIG. 2 is a block diagram showing an example of the functional configuration of the learning device 10. As shown in FIG.

学習装置１０は、機能的には、図２に示すように、作業マニュアルデータベース（ＤＢ）２０、文ベクトル生成部２２、手順文データベース（ＤＢ）２４、学習用データベース（ＤＢ）２６、ラベル検出部２８、及びモデル学習部３０を備えている。 Functionally, as shown in FIG. 2, the learning device 10 includes a work manual database (DB) 20, a sentence vector generator 22, a procedural sentence database (DB) 24, a learning database (DB) 26, and a label detector. 28 and a model learning unit 30 .

作業マニュアルデータベース２０には、一連の手順が記載された作業マニュアルに含まれる複数の手順文を記憶している。ここで、手順文はテキストデータとして作業マニュアルデータベース２０に記憶されている。 The work manual database 20 stores a plurality of procedural sentences included in a work manual describing a series of procedures. Here, the procedure sentences are stored in the work manual database 20 as text data.

文ベクトル生成部２２は、作業マニュアルデータベース２０から手順文を一つ一つ取り出し、事前学習済みモデルを用いて、手順文を表す文ベクトルを生成する。 The sentence vector generation unit 22 extracts the procedural sentences one by one from the work manual database 20, and uses the pre-trained model to generate a sentence vector representing the procedural sentence.

具体的には、文ベクトル生成部２２は、大量のテキストで学習したＢＥＲＴ（ＢｉｄｉｒｅｃｔｉｏｎａｌＥｎｃｏｄｅｒＲｅｐｒｅｓｅｎｔａｔｉｏｎｓｆｒｏｍＴｒａｎｓｆｏｒｍｅｒｓ）等の言語モデルを学習済みモデルとして予め用意しておき、当該言語モデルを用いて手順文を表す文ベクトルを生成する。この言語モデルでは事前学習以上の学習は行わず、手順文を表す文ベクトルの生成のみに用いる。文ベクトル生成部２２は、生成した文ベクトルを手順文データベース２４に格納する。 Specifically, the sentence vector generation unit 22 prepares in advance a language model such as BERT (Bidirectional Encoder Representations from Transformers) trained with a large amount of text as a trained model, and generates a procedural sentence using the language model. Generate a sentence vector to represent This language model is used only for generating sentence vectors representing procedural sentences without performing any learning beyond pre-learning. The sentence vector generation unit 22 stores the generated sentence vectors in the procedural sentence database 24 .

手順文データベース２４は、作業マニュアルに含まれる複数の手順文の各々について文ベクトル生成部２２により生成された文ベクトルを記憶している。 The procedural sentence database 24 stores sentence vectors generated by the sentence vector generation unit 22 for each of a plurality of procedural sentences included in the work manual.

ここで、文ベクトルは、後述する言語モデルで生成される文ベクトルと粒度が合った形で手順文データベース２４に格納されているものとする。 Here, it is assumed that sentence vectors are stored in the procedural sentence database 24 in a form that matches the granularity of sentence vectors generated by a language model, which will be described later.

学習用データベース２６は、入力された学習用の映像データを複数記憶する。学習用の映像データは、映像ごとに入力されたものでもよいし、分割した映像セグメントごとに入力されたものでもよし、映像フレームごとに入力されたものでもよい。ここで映像セグメントは映像を複数フレームごとにまとめて分割した単位であり、例えば３２フレームで１セグメントと定めた単位である。 The learning database 26 stores a plurality of input video data for learning. The video data for learning may be input for each video, may be input for each divided video segment, or may be input for each video frame. Here, the video segment is a unit obtained by dividing a video into a plurality of frames. For example, 32 frames are defined as one segment.

ラベル検出部２８は、学習用データベース２６に記憶された複数の学習用の映像データの各々から、人の行動に関する行動ラベルを検出する。具体的には、ラベル検出部２８は、学習用データベース２６から学習用の映像データを取り出し、映像データの長さである一定時間内で検出される行動ラベルの群である行動ラベル群を出力する。行動ラベル群の一例はＨＯＩラベル群である。ＨＯＩラベル群は、一定時間内で検出されるＨＯＩラベルの集合である。 The label detection unit 28 detects action labels related to human actions from each of the plurality of video data for learning stored in the database 26 for learning. Specifically, the label detection unit 28 extracts video data for learning from the database 26 for learning, and outputs a group of action labels, which is a group of action labels detected within a certain period of time, which is the length of the video data. . An example of a group of activity labels is a group of HOI labels. A HOI label group is a set of HOI labels detected within a certain period of time.

例えば、人がｐｅｒｓｏｎ、梯子がｌａｄｄｅｒ、登るがｃｌｉｍｂであれば、＜ｐｅｒｓｏｎ，ｌａｄｄｅｒ，ｃｌｉｍｂ＞のＨＯＩラベルを行動ラベルとして検出する。ＨＯＩラベルに含まれる動作ラベルの動詞が自動詞か他動詞かであるかに応じて、物体ラベルのありなしが変化する。上記の例では、動作ラベルが他動詞であるが、自動詞でもよい。 For example, if person is person, ladder is ladder, and climbing is climb, the HOI label of <person, ladder, climb> is detected as the activity label. Depending on whether the verb of the action label included in the HOI label is intransitive or transitive, the presence or absence of the object label changes. In the example above, the action label is a transitive verb, but it can also be an intransitive verb.

モデル学習部３０は、学習用の映像データに対して予め定められた手順文を表す文ベクトルを生成するように、行動ラベルから文ベクトルを生成する言語モデルを学習する。 The model learning unit 30 learns a language model for generating sentence vectors from action labels so as to generate sentence vectors representing procedural sentences predetermined for video data for learning.

具体的には、モデル学習部３０は、ラベル検出部２８より学習用の映像データから検出された行動ラベル群を受け取る。また、モデル学習部３０は、手順文データベース２４より、学習用の映像データに対応付けられた正解となる手順文の手順文ベクトルを取り出す。モデル学習部３０は、学習用の映像データから検出された行動ラベル群から、学習対象である言語モデルを用いて、文ベクトルを生成する。モデル学習部３０は、正解となる手順文の手順文ベクトルと、生成した文ベクトルとを比較して評価し、正解となる手順文の手順文ベクトルと、生成した文ベクトルとが一致するように、言語モデルを更新して言語モデルを出力する。ここで言語モデルはニューラルネットワークのモデルなどである。 Specifically, the model learning unit 30 receives action labels detected from the video data for learning from the label detection unit 28 . The model learning unit 30 also retrieves from the procedure sentence database 24 the procedure sentence vector of the correct procedure sentence associated with the video data for learning. The model learning unit 30 generates a sentence vector from the action label group detected from the video data for learning, using the language model to be learned. The model learning unit 30 compares and evaluates the procedural sentence vector of the correct procedural sentence and the generated sentence vector so that the procedural sentence vector of the correct procedural sentence and the generated sentence vector match each other. , update the language model and output the language model. Here, the language model is a neural network model or the like.

例えば、学習対象の言語モデルのベースモデルとして、大量のテキストで自己教師あり学習した言語モデルを用いる。代表的なモデルとしては、非特許文献２のＢＥＲＴがある。ＢＥＲＴに対して、＜人ラベル動作ラベル物体ラベル＞の順で行動ラベルを並び替え、入力とする。言い換えると、主語－述語－目的語の関係性を考慮した並びであり、上記の例では＜ｐｅｒｓｏｎｃｌｉｍｂｌａｄｄｅｒ＞となる。また、“安全帯ベルトを締める”という手順が“人が梯子を登る”の前にあった場合、同様に＜ｐｅｒｓｏｎｔｉｇｈｔｅｎｓａｆｅｔｙｂｅｌｔ＞という行動ラベルが検出できたとして、ＢＥＲＴには、＜ｐｅｒｓｏｎｔｉｇｈｔｅｎｓａｆｅｔｙｂｅｌｔｐｅｒｓｏｎｃｌｉｍｂｌａｄｄｅｒ＞を入力とすることで、一つの文ベクトルを得ることができる。また、入力方法として、＜ｐｅｒｓｏｎｔｉｇｈｔｅｎｓａｆｅｔｙｂｅｌｔ＞と＜ｐｅｒｓｏｎｃｌｉｍｂｌａｄｄｅｒ＞を別々の文として分割して入力してもよい。具体的には、ＨＯＩラベルの間にＢＥＲＴモデルが認識可能なセパレータを挟んで入力とする。こうすることにより、複数の手順を考慮した文ベクトルを生成することができる。 For example, as a base model for the language model to be learned, a language model that has undergone self-supervised learning with a large amount of text is used. A typical model is BERT in Non-Patent Document 2. For BERT, action labels are rearranged in the order <person label action label object label> and input. In other words, it is an arrangement considering the subject-predicate-object relationship, and in the above example, it is <person climb ladder>. Also, if the procedure "fasten the safety belt" precedes "the person climbs the ladder", the action label <person tight safety belt> can be similarly detected. safety belt person climb ladder>, one sentence vector can be obtained. Also, as an input method, <person tight safety belt> and <person climb ladder> may be divided and input as separate sentences. Specifically, a separator that can be recognized by the BERT model is inserted between the HOI labels and input. By doing so, it is possible to generate a sentence vector that considers a plurality of procedures.

［非特許文献２］Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL2019. [Non-Patent Document 2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL2019.

＜第１実施形態に係る異常行動判定装置の構成＞
上記図１は、第１実施形態の異常行動判定装置５０のハードウェア構成を示すブロック図である。 <Configuration of Abnormal Behavior Determining Device According to First Embodiment>
FIG. 1 above is a block diagram showing the hardware configuration of the abnormal behavior determination device 50 of the first embodiment.

上記図１に示すように、異常行動判定装置５０は、学習装置１０と同様の構成であり、ＲＯＭ１２又はストレージ１４には、異常行動を判定するための異常行動判定プログラムが格納されている。 As shown in FIG. 1, the abnormal behavior determination device 50 has the same configuration as the learning device 10, and the ROM 12 or the storage 14 stores an abnormal behavior determination program for determining abnormal behavior.

入力部１５は、人の行動を表す映像データから検出された行動ラベル群を、入力として受け付ける。具体的には、入力部１５は、映像データの長さである一定時間内で検出されるＨＯＩラベルの群である行動ラベル群を、入力として受け付ける。 The input unit 15 receives as an input a behavior label group detected from video data representing human behavior. Specifically, the input unit 15 receives as an input an action label group, which is a group of HOI labels detected within a certain period of time, which is the length of video data.

次に、異常行動判定装置５０の機能構成について説明する。図３は、異常行動判定装置５０の機能構成の例を示すブロック図である。 Next, the functional configuration of the abnormal behavior determination device 50 will be described. FIG. 3 is a block diagram showing an example of the functional configuration of the abnormal behavior determination device 50. As shown in FIG.

異常行動判定装置５０は、機能的には、図３に示すように、手順文データベース（ＤＢ）６０、生成部６２、類似度算出部６４、及び異常判定部６６を備えている。 The abnormal behavior determination device 50 functionally includes a procedural statement database (DB) 60, a generation unit 62, a similarity calculation unit 64, and an abnormality determination unit 66, as shown in FIG.

手順文データベース６０には、手順文データベース２４と同様に、作業マニュアルに含まれる複数の手順文の各々について文ベクトル生成部２２により生成された文ベクトルを記憶している。 Similar to the procedure sentence database 24, the procedure sentence database 60 stores sentence vectors generated by the sentence vector generation unit 22 for each of a plurality of procedure sentences included in the work manual.

生成部６２は、入力された行動ラベル群に基づいて、学習装置１０により予め学習された言語モデルを用いて、文ベクトルを生成する。 The generation unit 62 generates a sentence vector based on the input action label group and using a language model pre-learned by the learning device 10 .

類似度算出部６４は、生成部６２により生成された文ベクトルと、手順文データベース６０に記憶された、手順文を表す文ベクトルとの類似度を算出する。 The similarity calculator 64 calculates the similarity between the sentence vector generated by the generator 62 and the sentence vector representing the procedural sentence stored in the procedural sentence database 60 .

異常判定部６６は、類似度算出部６４によって算出された類似度に基づいて、人の行動が、手順と異なり異常であるか否かを判定する。例えば、異常判定部６６は、算出された類似度と、予め定められた閾値とを比較して、人の行動が異常であるか否かの判定を行い、１もしくは０の値をとるラベルを出力する。ここで、異常である場合にラベルが１となり、正常である場合に、ラベルが０となる。 Based on the degree of similarity calculated by the degree-of-similarity calculation unit 64, the abnormality determination unit 66 determines whether or not the human behavior is abnormal unlike the procedure. For example, the abnormality determination unit 66 compares the calculated degree of similarity with a predetermined threshold to determine whether or not the behavior of a person is abnormal, and assigns a label with a value of 1 or 0. Output. Here, the label is 1 when abnormal, and the label is 0 when normal.

＜第１実施形態に係る学習装置の作用＞
次に、第１実施形態に係る学習装置１０の作用について説明する。 <Action of the learning device according to the first embodiment>
Next, operation of the learning device 10 according to the first embodiment will be described.

図４は、学習装置１０による文ベクトル生成処理の流れを示すフローチャートである。ＣＰＵ１１がＲＯＭ１２又はストレージ１４から学習プログラムを読み出して、ＲＡＭ１３に展開して実行することにより、文ベクトル生成処理が行なわれる。 FIG. 4 is a flow chart showing the flow of sentence vector generation processing by the learning device 10 . The CPU 11 reads out the learning program from the ROM 12 or the storage 14, develops it in the RAM 13, and executes it, thereby performing sentence vector generation processing.

ステップＳ１００で、ＣＰＵ１１は、文ベクトル生成部２２として、作業マニュアルデータベース２０から手順文を一つ一つ取り出し、入力とする。 In step S100, the CPU 11, as the sentence vector generation unit 22, extracts the procedure sentences one by one from the work manual database 20 and inputs them.

ステップＳ１０２で、ＣＰＵ１１は、文ベクトル生成部２２として、事前学習済みモデルに手順文を一つずつ入力して手順文を表す文ベクトルを生成する。ここで、事前学習済みモデルは、ＢＥＲＴのような言語モデルで大量の一般的なテキストのみで事前学習されたモデルである。 In step S102, the CPU 11, as the sentence vector generation unit 22, inputs procedural sentences to the pre-trained model one by one to generate a sentence vector representing the procedural sentences. Here, a pre-trained model is a language model such as BERT that has been pre-trained only with a large amount of common texts.

ステップＳ１０４で、ＣＰＵ１は、手順文の各々について生成された文ベクトルを、手順文データベース２４に格納する。 In step S104, the CPU 1 stores the sentence vector generated for each procedural sentence in the procedural sentence database 24. FIG.

図５は、学習装置１０による学習処理の流れを示すフローチャートである。ＣＰＵ１１がＲＯＭ１２又はストレージ１４から学習プログラムを読み出して、ＲＡＭ１３に展開して実行することにより、学習処理が行なわれる。また、学習装置１０に、学習用の映像データが複数入力され、学習用データベース２６に格納される。 FIG. 5 is a flowchart showing the flow of learning processing by the learning device 10. As shown in FIG. The learning process is performed by the CPU 11 reading the learning program from the ROM 12 or the storage 14, developing it in the RAM 13, and executing it. Also, a plurality of video data for learning are input to the learning device 10 and stored in the learning database 26 .

ステップＳ１１０で、ＣＰＵ１１は、ラベル検出部２８として、学習用データベース２６から複数の学習用の映像データを取り出し、それぞれを入力とする。 In step S110, the CPU 11, as the label detection unit 28, extracts a plurality of video data for learning from the database 26 for learning and uses each of them as an input.

ステップＳ１１２で、ＣＰＵ１１は、ラベル検出部２８として、複数の学習用の映像データの各々から、行動ラベル群を検出する。 In step S112, the CPU 11, as the label detection unit 28, detects an action label group from each of the plurality of video data for learning.

ステップＳ１１４で、ＣＰＵ１１は、モデル学習部３０として、複数の学習用の映像データの各々から検出された行動ラベル群を取得し、入力とする。 In step S114, the CPU 11, as the model learning unit 30, acquires a group of action labels detected from each of the plurality of video data for learning, and inputs them.

ステップＳ１１６で、ＣＰＵ１１は、モデル学習部３０として、手順文データベース２４より、複数の学習用の映像データの各々に対応付けられた正解となる手順文の手順文ベクトルを取り出す。ＣＰＵ１１は、複数の学習用の映像データの各々から検出された行動ラベル群から、学習対象である言語モデルを用いて、文ベクトルを生成する。ＣＰＵ１１は、正解となる手順文の手順文ベクトルと、生成した文ベクトルとを比較して評価し、正解となる手順文の手順文ベクトルと、生成した文ベクトルとが一致するように、言語モデルを更新する。 In step S116, the CPU 11, as the model learning unit 30, retrieves from the procedure sentence database 24 the procedure sentence vector of the correct procedure sentence associated with each of the plurality of video data for learning. The CPU 11 generates a sentence vector from action labels detected from each of a plurality of video data for learning, using a language model to be learned. The CPU 11 compares and evaluates the procedural sentence vector of the correct procedural sentence and the generated sentence vector, and sets the language model so that the procedural sentence vector of the correct procedural sentence and the generated sentence vector match. to update.

ステップＳ１１８で、ＣＰＵ１１は、モデル学習部３０として、更新した言語モデルを出力する。ここで、ラベル検出部２８は、非特許文献３のような映像データからＨＯＩラベルを検出できる手法を用いて、行動ラベルを検出すればよい。映像フレーム、映像セグメント等、出力の単位は任意とする。 In step S118, the CPU 11, as the model learning unit 30, outputs the updated language model. Here, the label detection unit 28 may detect the action label using a method such as Non-Patent Document 3 that can detect the HOI label from the video data. Any output unit such as video frame or video segment may be used.

［非特許文献３］Georgia Gkioxari, Ross Girshick, Piotr Dollar, Kaiming He. Detecting and Recognizing Human-Object Interactions. CVPR2018. [Non-Patent Document 3] Georgia Gkioxari, Ross Girshick, Piotr Dollar, Kaiming He. Detecting and Recognizing Human-Object Interactions. CVPR2018.

上記ステップＳ１１４、Ｓ１１６の処理における詳細動作を図６に示す。 FIG. 6 shows detailed operations in the processing of steps S114 and S116.

ステップＳ１２０で、ＣＰＵ１１は、モデル学習部３０として、複数の学習用の映像データの各々から検出された行動ラベル群を取得し、入力とする。 In step S120, the CPU 11, as the model learning unit 30, acquires a group of action labels detected from each of the plurality of video data for learning, and inputs them.

ステップＳ１２２で、ＣＰＵ１１は、モデル学習部３０として、行動ラベル群の各行動ラベルを＜人ラベル動作ラベル物体ラベル＞の順に並べ替える。 In step S122, the CPU 11, as the model learning unit 30, rearranges each action label in the action label group in the order <person label action label object label>.

ステップＳ１２４で、ＣＰＵ１１は、モデル学習部３０として、行動ラベル群が、学習対象の言語モデルの入力長上限より長いかを判断する。行動ラベル群が、学習対象の言語モデルの入力長上限より長い場合には、ステップＳ１２６へ移行する。一方、行動ラベル群が、学習対象の言語モデルの入力長上限以下である場合には、ステップＳ１２８へ移行する。 In step S124, the CPU 11, as the model learning unit 30, determines whether the action label group is longer than the upper limit of input length of the language model to be learned. If the action label group is longer than the upper limit of the input length of the language model to be learned, the process proceeds to step S126. On the other hand, when the action label group is equal to or less than the upper limit of the input length of the language model to be learned, the process proceeds to step S128.

ステップＳ１２６で、ＣＰＵ１１は、モデル学習部３０として、行動ラベル群における各行動ラベルの出現頻度を集計し、出現頻度順に各行動ラベルを並べ、学習対象の言語モデルの入力長上限におさまるように、上位Ｎ個の行動ラベルを抽出し、抽出された行動ラベル以外の行動ラベルを削除し、新たな行動ラベル群とする。 In step S126, the CPU 11, as the model learning unit 30, aggregates the appearance frequency of each action label in the action label group, arranges each action label in order of appearance frequency, and performs The top N action labels are extracted, action labels other than the extracted action labels are deleted, and a new action label group is created.

ステップＳ１２８で、ＣＰＵ１１は、モデル学習部３０として、並び替えた行動ラベル群を学習対象の言語モデルへと入力し、文ベクトルを生成する。 In step S128, the CPU 11, as the model learning unit 30, inputs the rearranged action label group to the learning target language model to generate a sentence vector.

ステップＳ１３０で、ＣＰＵ１１は、モデル学習部３０として、生成された文ベクトルと、手順文データベース２４から取得した、当該学習用の映像データに紐づけられた正解の手順文の文ベクトルとの間で、評価関数による損失を算出する。 In step S130, the CPU 11, as the model learning unit 30, performs the following calculation between the generated sentence vector and the sentence vector of the correct procedural sentence linked to the learning video data acquired from the procedural sentence database 24: , to calculate the loss by the evaluation function.

ここで、評価関数は距離関数で表され、ユークリッド距離の二乗や非特許文献４のＣｏｎｔｒａｓｔｉｖｅＬｏｓｓ等が利用可能である。これらの距離関数に限定されるものではなく、距離を表現でき、微分可能、言い換えると深層学習のバックプロパゲーションに必要な勾配が計算可能な関数であれば、他の距離関数であってもよい。 Here, the evaluation function is represented by a distance function, and the square of the Euclidean distance, Contrastive Loss of Non-Patent Document 4, or the like can be used. It is not limited to these distance functions, and other distance functions may be used as long as they can express distance and are differentiable, in other words, a function that can calculate the gradient required for backpropagation in deep learning. .

［非特許文献４］Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick. Momentum Contrast for Unsupervised Visual Representation Learning. CVPR2020. [Non-Patent Document 4] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick. Momentum Contrast for Unsupervised Visual Representation Learning. CVPR2020.

ステップＳ１３２で、ＣＰＵ１１は、モデル学習部３０として、得られた損失から勾配を計算し、言語モデルのパラメータをバックプロパゲーションで更新する。 In step S132, the CPU 11, as the model learning unit 30, calculates a gradient from the obtained loss and updates the parameters of the language model by back propagation.

ステップＳ１３４で、ＣＰＵ１１は、モデル学習部３０として、更新した言語モデルを出力する。 At step S134, the CPU 11, as the model learning unit 30, outputs the updated language model.

＜第１実施形態に係る異常行動判定装置の作用＞
次に、第１実施形態に係る異常行動判定装置５０の作用について説明する。 <Action of Abnormal Behavior Determining Device According to First Embodiment>
Next, the operation of the abnormal behavior determination device 50 according to the first embodiment will be described.

図７は、異常行動判定装置５０による異常行動判定処理の流れを示すフローチャートである。ＣＰＵ１１がＲＯＭ１２又はストレージ１４から異常行動判定プログラムを読み出して、ＲＡＭ１３に展開して実行することにより、異常行動判定処理が行なわれる。また、異常行動判定装置５０に、人の行動を表す映像データから検出された行動ラベル群が入力される。ここで、行動ラベル群は非特許文献３に示すような手法で検出済みとする。 FIG. 7 is a flowchart showing the flow of abnormal behavior determination processing by the abnormal behavior determination device 50. As shown in FIG. The CPU 11 reads out an abnormal behavior determination program from the ROM 12 or the storage 14, develops it in the RAM 13, and executes the abnormal behavior determination process. Also, a group of behavior labels detected from video data representing human behavior is input to the abnormal behavior determination device 50 . Here, it is assumed that the action label group has already been detected by the method described in Non-Patent Document 3.

ステップＳ１４０で、ＣＰＵ１１は、入力された行動ラベル群を、生成部６２に入力する。 In step S140 , the CPU 11 inputs the input action label group to the generation unit 62 .

ステップＳ１４２で、ＣＰＵ１１は、生成部６２として、行動ラベル群の文ベクトルを生成する。 In step S142, the CPU 11, as the generation unit 62, generates a sentence vector of the action label group.

ステップＳ１４４で、ＣＰＵ１１は、行動ラベル群の文ベクトルを類似度算出部６４に入力する。 In step S144 , the CPU 11 inputs the sentence vector of the action label group to the similarity calculation unit 64 .

ステップＳ１４６で、ＣＰＵ１１は、手順文データベース６０から手順文を表す文ベクトルを各々取り出し、類似度算出部６４に入力する。 In step S146 , the CPU 11 extracts sentence vectors representing procedural sentences from the procedural sentence database 60 and inputs them to the similarity calculation unit 64 .

ここで、手順文データベース６０からの手順文の文ベクトルの取り出しには、手順の順序を考慮してもよい。例えば、一つ前の映像セグメントの行動ラベル群について最も類似度が高いと判断された手順文の文ベクトルのＩＤ等を保持しておき、当該手順文の手順の前後Ｔ個の手順文を表す文ベクトルを取り出すようにしてもよい。これにより、手順文の文ベクトルとの類似度の計算量を削減できる。 Here, the order of procedures may be taken into consideration when retrieving sentence vectors of procedure sentences from the procedure sentence database 60 . For example, the ID of the sentence vector of the procedural sentence judged to have the highest similarity with respect to the action label group of the previous video segment is stored, and T procedural sentences before and after the procedure of the relevant procedural sentence are represented. A sentence vector may be retrieved. This reduces the amount of calculation of the similarity between the procedural sentence and the sentence vector.

ステップＳ１４８で、ＣＰＵ１１は、類似度算出部６４として、取り出した手順文を表す文ベクトルの各々について、当該手順文を表す文ベクトルと行動ラベル群の文ベクトルとの類似度を算出する。例えば、類似度の指標としてコサイン類似度を算出する。なお、類似度の指標はコサイン類似度でなくともベクトル間の類似度が比較できればよい。 In step S148, the CPU 11, as the similarity calculation unit 64, calculates the degree of similarity between the sentence vector representing the procedural sentence and the sentence vector of the action label group for each sentence vector representing the extracted procedural sentence. For example, cosine similarity is calculated as an index of similarity. It should be noted that the index of similarity may not be cosine similarity as long as similarity between vectors can be compared.

ステップＳ１５０で、ＣＰＵ１１は、類似度算出部６４として、算出した類似度を異常判定部６６へ出力する。 In step S150 , the CPU 11 , acting as the similarity calculation unit 64 , outputs the calculated similarity to the abnormality determination unit 66 .

ステップＳ１５２で、ＣＰＵ１１は、異常判定部６６として、類似度と閾値とを比較し、人の行動が、手順と異なり異常であるか否かを判定し、判定結果を示すラベルを出力し、異常行動判定処理を終了する。 In step S152, the CPU 11, as the abnormality determination unit 66, compares the similarity with a threshold value, determines whether or not the human behavior is abnormal unlike the procedure, outputs a label indicating the determination result, Terminate the action determination process.

上記ステップＳ１４０、Ｓ１４２の処理における詳細動作を図８に示す。図８に示す動作は、映像セグメントごとの行動ラベル群について繰り返し行われる。なお、映像セグメントのセグメント長は、アプリケーションに依存する。 FIG. 8 shows detailed operations in the processing of steps S140 and S142. The operation shown in FIG. 8 is repeatedly performed for the action label group for each video segment. Note that the segment length of the video segment depends on the application.

ステップＳ１６０で、ＣＰＵ１１は、行動ラベル群を生成部６２に入力する。 In step S160 , the CPU 11 inputs the action label group to the generation unit 62 .

ステップＳ１６２で、ＣＰＵ１１は、生成部６２として、行動ラベル群の各行動ラベルを＜人ラベル動作ラベル物体ラベル＞の順に並べ替える。例えば、一つの映像セグメントの行動ラベル群に２つの人ラベルがある場合、＜人ラベル１動作ラベル１物体ラベル１人ラベル２動作ラベル２物体ラベル２＞と並べる。動作と物体の組み合わせについて、動作ラベルと物体ラベルの有り得る組み合わせをあらかじめテーブルとして保持しておき、そのテーブルを参照して行動ラベルを決定する。また、その他、映像特徴等を考慮してもよい。 In step S162, the CPU 11, as the generation unit 62, rearranges each action label in the action label group in the order <person label action label object label>. For example, when there are two person labels in the action label group of one video segment, they are arranged as <person label 1, action label 1, object label 1, person label 2, action label 2, object label 2>. With regard to combinations of actions and objects, possible combinations of action labels and object labels are stored in advance as a table, and action labels are determined by referring to the table. In addition, image features and the like may be taken into consideration.

並べ替えた行動ラベル群をＢＥＲＴのような学習済みの言語モデルに入力して文ベクトルを得る。本実施形態では、行動ラベル群は、ＢＥＲＴの入力長上限におさまる系列長（５１２単語）の範囲で、＜行動ラベル１＞＜行動ラベル２＞…といった形で連結して入力を行う。そこで、ステップＳ１６４で、ＣＰＵ１１は、生成部６２として、行動ラベル群が学習済みの言語モデルの入力長上限より長いか否かを判定する。行動ラベル群が学習済みの言語モデルの入力長上限より長い場合には、ステップＳ１６６へ移行する。一方、行動ラベル群が学習済みの言語モデルの入力長上限以下である場合には、ステップＳ１６８へ移行する。 A sentence vector is obtained by inputting the rearranged behavior label group to a trained language model such as BERT. In this embodiment, the action label group is input by linking it in the form of <action label 1><action label 2> . Therefore, in step S164, the CPU 11, as the generation unit 62, determines whether or not the action label group is longer than the upper limit of the input length of the learned language model. If the action label group is longer than the upper limit of the input length of the learned language model, the process proceeds to step S166. On the other hand, when the action label group is equal to or less than the upper limit of the input length of the learned language model, the process proceeds to step S168.

ステップＳ１６６では、ＣＰＵ１１は、生成部６２として、行動ラベル群における各行動ラベルの出現頻度を集計し、出現頻度順に各行動ラベルを並べ、学習済みの言語モデルの入力長上限におさまるように、上位Ｎ個の行動ラベルを抽出し、抽出された行動ラベル以外の行動ラベルを削除し、新たな行動ラベル群とする。 In step S166, the CPU 11, as the generation unit 62, aggregates the appearance frequency of each action label in the action label group, arranges each action label in order of appearance frequency, and arranges each action label so as to fit within the input length upper limit of the learned language model. N action labels are extracted, action labels other than the extracted action labels are deleted, and a new action label group is created.

ステップＳ１６８で、ＣＰＵ１１は、生成部６２として、並び替えた行動ラベル群を学習済みの言語モデルへと入力し、文ベクトルを生成する。 In step S168, the CPU 11, as the generation unit 62, inputs the rearranged action label group to the learned language model to generate a sentence vector.

ステップＳ１７０で、ＣＰＵ１１は、生成部６２として、得られた文ベクトルを出力する。 At step S170, the CPU 11, as the generator 62, outputs the obtained sentence vector.

上記ステップＳ１５０、Ｓ１５２の処理において、異常行動を判定する方法は２通りある。一方の方法では、手順文の文ベクトルとの類似度が、取り出した全ての手順文に対して低い場合、どの手順文にも該当していない行動として異常行動であると判定する。 There are two methods for determining abnormal behavior in the processes of steps S150 and S152. In one method, when the degree of similarity between the procedural sentence and the sentence vector is low with respect to all the extracted procedural sentences, the action that does not correspond to any procedural sentence is determined to be an abnormal action.

他方の方法では、手順の順序を考慮する。手順の順序を考慮する場合の、上記ステップＳ１５０、Ｓ１５２の処理における詳細動作を図９に示す。 The other method considers the order of steps. FIG. 9 shows detailed operations in the processing of steps S150 and S152 when the order of procedures is considered.

ステップＳ１８０で、ＣＰＵ１１は、算出した類似度を高い順に上位Ｍ個、異常判定部６６に入力する。 In step S180 , the CPU 11 inputs the top M calculated similarities in descending order to the abnormality determination unit 66 .

ステップＳ１８２で、ＣＰＵ１１は、異常判定部６６として、入力された行動ラベル群が、映像の先頭から取り出した映像セグメント（以降先頭セグメントと表記）のものであるか否かを判定する。入力された行動ラベル群が、映像の先頭セグメントのものであった場合、ステップＳ１８８へ移行する。一方、入力された行動ラベル群が、映像の先頭セグメントのものでない場合、ステップＳ１８４へ移行する。 In step S182, the CPU 11, as the abnormality determination unit 66, determines whether or not the input action label group belongs to the video segment extracted from the beginning of the video (hereinafter referred to as the leading segment). If the input action label group is for the leading segment of the video, the process proceeds to step S188. On the other hand, if the input action label group is not the leading segment of the video, the process proceeds to step S184.

ステップＳ１８４では、ＣＰＵ１１は、異常判定部６６として、入力された類似度Ｍ個の中に、一つ前の映像セグメントについて類似度が最も高かった手順文の文ベクトルと同じ文ベクトルと比較した類似度があるかどうかを判定する。入力された類似度Ｍ個の中に、一つ前の映像セグメントについて類似度が最も高かった手順文の文ベクトルと同じ文ベクトルと比較した類似度がある場合には、ステップＳ１８６へ移行する。一方、入力された類似度Ｍ個の中に、一つ前の映像セグメントについて類似度が最も高かった手順文の文ベクトルと同じ文ベクトルと比較した類似度がない場合には、ステップＳ１８８へ移行する。 In step S184, the CPU 11, as the abnormality determination unit 66, compares the sentence vector of the procedural sentence with the highest similarity with the previous video segment among the M input similarities with the same sentence vector. Determine if there is a degree If there is a similarity compared with the same sentence vector as the sentence vector of the procedural sentence with the highest similarity for the previous video segment among the M similarities that have been input, the process proceeds to step S186. On the other hand, if there is no similarity compared with the same sentence vector as the sentence vector of the procedural sentence with the highest similarity for the previous video segment among the M similarities that have been input, the process proceeds to step S188. do.

ステップＳ１８６では、ＣＰＵ１１は、異常判定部６６として、入力された類似度Ｍ個のうちの、一つ前の映像セグメントについて類似度が最も高かった手順文の文ベクトルと同じ文ベクトルと比較した類似度が、一つ前の映像セグメントについての同じ類似度より高いか否かを判定する。入力された類似度Ｍ個のうちの、一つ前の映像セグメントについて類似度が最も高かった手順文の文ベクトルと同じ文ベクトルと比較した類似度が、一つ前の映像セグメントについての同じ類似度より高い場合いは、ステップＳ１９２へ移行する。一方、入力された類似度Ｍ個のうちの、一つ前の映像セグメントについて類似度が最も高かった手順文の文ベクトルと同じ文ベクトルと比較した類似度が、一つ前の映像セグメントについての同じ類似度以下である場合には、ステップＳ１８８へ移行する。 In step S186, the CPU 11, as the abnormality determination unit 66, compares the sentence vector of the procedural sentence with the highest similarity with respect to the previous video segment among the M input similarities with the same sentence vector. is higher than the same similarity for the previous video segment. Among the M input similarities, the sentence vector of the procedural sentence with the highest similarity for the previous video segment and the same sentence vector have the same similarity for the previous video segment. If it is higher than the degree, the process proceeds to step S192. On the other hand, among the M input similarities, the similarity when compared with the same sentence vector as the sentence vector of the procedural sentence with the highest similarity for the previous video segment is If the similarities are less than or equal to the same degree, the process proceeds to step S188.

ステップＳ１８８では、入力された類似度Ｍ個のうちの最も高い類似度が、閾値より低いか否かを判定する。入力された類似度Ｍ個のうちの最も高い類似度が、閾値より低い場合には、ステップＳ１９０へ移行する。一方、入力された類似度Ｍ個のうちの最も高い類似度が、閾値以上である場合には、ステップＳ１９２へ移行する。 In step S188, it is determined whether or not the highest similarity among the input M similarities is lower than a threshold. If the highest similarity among the M similarities that are input is lower than the threshold, the process proceeds to step S190. On the other hand, if the highest similarity among the M similarities that are input is greater than or equal to the threshold, the process proceeds to step S192.

ステップＳ１９０では、ＣＰＵ１１は、異常判定部６６として、人の行動が異常であることを示すラベルを出力する。ステップＳ１９２では、ＣＰＵ１１は、異常判定部６６として、人の行動が正常であることを示すラベルを出力する。 In step S190, the CPU 11, as the abnormality determination unit 66, outputs a label indicating that the human behavior is abnormal. In step S192, the CPU 11, as the abnormality determination unit 66, outputs a label indicating that the behavior of the person is normal.

以上説明したように、第１実施形態に係る学習装置は、手順に含まれる行動を表す学習用の映像データから、人の行動に関する行動ラベルを検出し、学習用の映像データに対して予め定められた手順文を表す文ベクトルを生成するように、行動ラベルから文ベクトルを生成する言語モデルを学習する。これにより、手順とは異なる異常行動を精度よく判定するための言語モデルを学習することができる。 As described above, the learning apparatus according to the first embodiment detects action labels related to human actions from learning video data representing actions included in a procedure, and predetermines action labels for the learning video data. We learn a language model that generates sentence vectors from action labels so as to generate sentence vectors that represent procedural sentences. As a result, it is possible to learn a language model for accurately determining abnormal behavior that differs from procedures.

また、第１実施形態に係る異常行動判定装置は、人の行動を表す映像データから検出された、人の行動に関する行動ラベルに基づいて、行動ラベルから文ベクトルを生成する予め学習された言語モデルを用いて、文ベクトルを生成し、生成された文ベクトルと、手順文データベースに記憶された手順文を表す文ベクトルとの類似度を算出し、算出された類似度に基づいて、人の行動が異常であるか否かを判定する。これにより、映像で撮影された手順のある作業において、手順とは異なる異常行動を精度よく判定することができる。 Further, the abnormal behavior determination device according to the first embodiment includes a pre-learned language model that generates a sentence vector from the behavior label based on the behavior label related to the behavior of the person detected from the video data representing the behavior of the person. is used to generate a sentence vector, the similarity between the generated sentence vector and the sentence vector representing the procedural sentence stored in the procedural sentence database is calculated, and based on the calculated similarity, human behavior is abnormal. As a result, abnormal behavior different from the procedure can be determined with high accuracy in the work with the procedure photographed in the video.

［第２実施形態］
次に、第２実施形態について説明する。なお、第１実施形態と同様の構成となる部分については、同一符号を付して説明を省略する。 [Second embodiment]
Next, a second embodiment will be described. Parts having the same configuration as in the first embodiment are denoted by the same reference numerals, and descriptions thereof are omitted.

第２実施形態では、文ベクトルの類似度を用いるのではなく、行動ラベル群から手順文を生成し、手順文の類似度を算出して、人の行動が、手順と異なり異常であるか否かを判定する点が、第１実施形態と異なっている。 In the second embodiment, instead of using the similarity of sentence vectors, a procedural sentence is generated from an action label group, the similarity of the procedural sentence is calculated, and whether or not a person's action is different from the procedure is abnormal. It is different from the first embodiment in that whether or not is determined.

＜第２実施形態の概要＞
第２実施形態では、行動ラベル群を入力として文を生成し、生成した文と手順文を入力として類似度を算出し、算出した類似度を入力として人の行動が異常であるか否かを判定する。 <Overview of Second Embodiment>
In the second embodiment, a sentence is generated with an action label group as an input, a similarity is calculated with the generated sentence and a procedure sentence as an input, and whether or not a person's behavior is abnormal is determined with the calculated similarity as an input. judge.

＜第２実施形態に係る学習装置の構成＞
第２実施形態の学習装置は、文ベクトル生成部２２が不要な点以外は、上記第１実施形態の学習装置１０と同様であるため、同一符号を付して説明を省略する。 <Structure of Learning Apparatus According to Second Embodiment>
The learning device of the second embodiment is the same as the learning device 10 of the first embodiment except that the sentence vector generation unit 22 is not required.

第２実施形態の学習装置１０の手順文データベース２４は、作業マニュアルに含まれる複数の手順文の各々を記憶している。 The procedure sentence database 24 of the learning device 10 of the second embodiment stores each of a plurality of procedure sentences included in the work manual.

モデル学習部３０は、学習用の映像データに対して予め定められた手順文を生成するように、行動ラベルから文を生成する言語モデルを学習する。 The model learning unit 30 learns a language model that generates sentences from action labels so as to generate predetermined procedural sentences for video data for learning.

具体的には、モデル学習部３０は、ラベル検出部２８より学習用の映像データから検出された行動ラベル群を受け取る。また、モデル学習部３０は、手順文データベース２４より、学習用の映像データに対応付けられた正解となる手順文を取り出す。モデル学習部３０は、学習用の映像データから検出された行動ラベル群から、学習対象である言語モデルを用いて、文を生成する。例えば、生成される文は、各単語の発生確率で表される。モデル学習部３０は、正解となる手順文と、生成した文とを比較して評価し、正解となる手順文と、生成した文とが一致するように、言語モデルを更新して言語モデルを出力する。ここで言語モデルは、非特許文献５のように翻訳等で用いられるＥｎｃｏｄｅｒ－Ｄｅｃｏｄｅｒ型のニューラルネットワークのモデルなどである。 Specifically, the model learning unit 30 receives action labels detected from the video data for learning from the label detection unit 28 . In addition, the model learning unit 30 extracts a correct procedural sentence associated with the video data for learning from the procedural sentence database 24 . The model learning unit 30 generates sentences from the action label group detected from the video data for learning, using the language model to be learned. For example, the generated sentence is represented by the occurrence probability of each word. The model learning unit 30 compares and evaluates the procedural sentence that is the correct answer and the generated sentence, and updates the language model so that the procedural sentence that is the correct answer matches the generated sentence. Output. Here, the language model is, for example, an Encoder-Decoder type neural network model used for translation, as in Non-Patent Document 5.

［非特許文献５］Jinhua Zhu and Yingce Xia and Lijun Wu and Di He and Tao Qin and Wengang Zhou and Houqiang Li and Tieyan Liu. Incorporating BERT into Neural Machine Translation. ICLR2020. [Non-Patent Document 5] Jinhua Zhu and Yingce Xia and Lijun Wu and Di He and Tao Qin and Wengang Zhou and Houqiang Li and Tieyan Liu. Incorporating BERT into Neural Machine Translation. ICLR2020.

＜第２実施形態に係る異常行動判定装置の構成＞
第２実施形態の異常行動判定装置は、上記第１実施形態の異常行動判定装置５０と同様であるため、同一符号を付して説明を省略する。 <Configuration of Abnormal Behavior Determining Device According to Second Embodiment>
Since the abnormal behavior determination device of the second embodiment is the same as the abnormal behavior determination device 50 of the first embodiment, the same reference numerals are assigned and the description thereof is omitted.

手順文データベース６０には、手順文データベース２４と同様に、作業マニュアルに含まれる複数の手順文の各々を記憶している。 Like the procedure sentence database 24, the procedure sentence database 60 stores each of a plurality of procedure sentences included in the work manual.

生成部６２は、入力された行動ラベル群に基づいて、学習装置１０により予め学習された言語モデルを用いて、文を生成する。 The generation unit 62 generates a sentence based on the input action label group using a language model pre-learned by the learning device 10 .

類似度算出部６４は、生成部６２により生成された文と、手順文データベース６０に記憶された手順文との類似度を算出する。 The similarity calculator 64 calculates the similarity between the sentence generated by the generator 62 and the procedural sentences stored in the procedural sentence database 60 .

異常判定部６６は、類似度算出部６４によって算出された類似度に基づいて、人の行動が、手順と異なり異常であるか否かを判定する。 Based on the degree of similarity calculated by the degree-of-similarity calculation unit 64, the abnormality determination unit 66 determines whether or not the human behavior is abnormal unlike the procedure.

＜第２実施形態に係る学習装置の作用＞
次に、第２実施形態に係る学習装置１０の作用について説明する。なお、第１実施形態と同様の処理については同一符号を付して詳細な説明を省略する。 <Action of the learning device according to the second embodiment>
Next, the operation of the learning device 10 according to the second embodiment will be described. The same reference numerals are given to the same processing as in the first embodiment, and detailed description thereof will be omitted.

図１０は、学習装置１０による学習処理の流れを示すフローチャートである。ＣＰＵ１１がＲＯＭ１２又はストレージ１４から学習プログラムを読み出して、ＲＡＭ１３に展開して実行することにより、学習処理が行なわれる。また、学習装置１０に、学習用の映像データが複数入力され、学習用データベース２６に格納される。 FIG. 10 is a flowchart showing the flow of learning processing by the learning device 10. As shown in FIG. The learning process is performed by the CPU 11 reading the learning program from the ROM 12 or the storage 14, developing it in the RAM 13, and executing it. Also, a plurality of video data for learning are input to the learning device 10 and stored in the learning database 26 .

ステップＳ２００で、ＣＰＵ１１は、モデル学習部３０として、手順文データベース２４より、複数の学習用の映像データの各々に対応付けられた正解となる手順文を取り出す。ＣＰＵ１１は、複数の学習用の映像データの各々から検出された行動ラベル群から、学習対象である言語モデルを用いて、文を生成する。ＣＰＵ１１は、正解となる手順文と、生成した文とを比較して評価し、正解となる手順文と、生成した文とが一致するように、言語モデルを更新する。 In step S200, the CPU 11, as the model learning unit 30, retrieves from the procedure sentence database 24 the correct procedure sentence associated with each of the plurality of video data for learning. The CPU 11 generates sentences from action labels detected from each of a plurality of video data for learning, using a language model to be learned. The CPU 11 compares and evaluates the correct procedure sentence and the generated sentence, and updates the language model so that the correct procedure sentence and the generated sentence match.

ステップＳ１１８で、ＣＰＵ１１は、モデル学習部３０として、更新した言語モデルを出力する。 In step S118, the CPU 11, as the model learning unit 30, outputs the updated language model.

上記ステップＳ１１４、Ｓ２００の処理における詳細動作を図１１に示す。 FIG. 11 shows detailed operations in the processing of steps S114 and S200.

ステップＳ１２２で、ＣＰＵ１１は、モデル学習部３０として、行動ラベル群の各行動ラベルを＜人ラベル動作ラベル物体ラベル＞の順に並べ替える。 In step S122, the CPU 11, as the model learning unit 30, rearranges each action label in the action label group in the order of <person label, action label, object label>.

ステップＳ１２４で、ＣＰＵ１１は、モデル学習部３０として、行動ラベル群が、学習対象の言語モデルの入力長上限より長いかを判断する。行動ラベル群が、学習対象の言語モデルの入力長上限より長い場合には、ステップＳ１２６へ移行する。一方、行動ラベル群が、学習対象の言語モデルの入力長上限以下である場合には、ステップＳ２１０へ移行する。 In step S124, the CPU 11, as the model learning unit 30, determines whether the action label group is longer than the upper limit of input length of the language model to be learned. If the action label group is longer than the upper limit of the input length of the language model to be learned, the process proceeds to step S126. On the other hand, when the action label group is equal to or less than the upper limit of the input length of the language model to be learned, the process proceeds to step S210.

ステップＳ２１０で、ＣＰＵ１１は、モデル学習部３０として、並び替えた行動ラベル群を学習対象の言語モデルへと入力し、文を生成する。 In step S210, the CPU 11, as the model learning unit 30, inputs the rearranged action label group to the learning target language model to generate a sentence.

ステップＳ２１２で、ＣＰＵ１１は、モデル学習部３０として、生成された文と、手順文データベース２４から取得した、当該学習用の映像データに紐づけられた正解の手順文との間で、評価関数による損失を算出する。具体的には、生成した文が表す各単語の出現確率と、正解の手順文の各単語とがどの程度異なるかを評価する評価関数による損失を算出する。例えば、評価関数により、正解の手順文の各単語の出現確率が１に近いほど損失が低く、正解の手順文に含まれない各単語の出現確率が０に近いほど損失が低くなるように損失を計算する。 In step S212, the CPU 11, as the model learning unit 30, compares the generated sentence with the correct procedural sentence linked to the learning video data acquired from the procedural sentence database 24 using an evaluation function. Calculate loss. Specifically, a loss is calculated by an evaluation function that evaluates the degree of difference between the appearance probability of each word represented by the generated sentence and each word of the correct procedural sentence. For example, with the evaluation function, the closer the appearance probability of each word in the correct procedural sentence is to 1, the lower the loss, and the closer to 0 the appearance probability of each word that is not included in the correct procedural sentence, the lower the loss. to calculate

＜第２実施形態に係る異常行動判定装置の作用＞
次に、第２実施形態に係る異常行動判定装置５０の作用について説明する。 <Action of Abnormal Behavior Determining Device According to Second Embodiment>
Next, the operation of the abnormal behavior determination device 50 according to the second embodiment will be described.

図１２は、異常行動判定装置５０による異常行動判定処理の流れを示すフローチャートである。ＣＰＵ１１がＲＯＭ１２又はストレージ１４から異常行動判定プログラムを読み出して、ＲＡＭ１３に展開して実行することにより、異常行動判定処理が行なわれる。また、異常行動判定装置５０に、人の行動を表す映像データから検出された行動ラベル群が入力される。ここで、行動ラベル群は非特許文献３に示すような手法で検出済みとする。 FIG. 12 is a flow chart showing the flow of abnormal behavior determination processing by the abnormal behavior determination device 50 . The CPU 11 reads out an abnormal behavior determination program from the ROM 12 or the storage 14, develops it in the RAM 13, and executes the abnormal behavior determination process. Also, a group of behavior labels detected from video data representing human behavior is input to the abnormal behavior determination device 50 . Here, it is assumed that the action label group has already been detected by the method described in Non-Patent Document 3.

ステップＳ２２０で、ＣＰＵ１１は、入力された行動ラベル群を、生成部６２に入力する。そして、ＣＰＵ１１は、生成部６２として、行動ラベル群から文を生成する。 In step S220 , the CPU 11 inputs the input action label group to the generation unit 62 . Then, the CPU 11, as the generation unit 62, generates a sentence from the action label group.

ステップＳ２２２で、ＣＰＵ１１は、生成した文を類似度算出部６４に入力する。 In step S222 , the CPU 11 inputs the generated sentence to the similarity calculation unit 64 .

ステップＳ２２４で、ＣＰＵ１１は、手順文データベース６０から手順文を各々取り出し、類似度算出部６４に入力する。ここで、手順文データベース６０からの手順文の取り出しには、手順の順序を考慮してもよい。例えば、一つ前の映像セグメントの行動ラベル群について最も類似度が高いと判断された手順文のＩＤ等を保持しておき、当該手順文の手順の前後Ｔ個の手順文を取り出すようにしてもよい。これにより、手順文との類似度の計算量を削減できる。 In step S224 , the CPU 11 extracts each procedural sentence from the procedural sentence database 60 and inputs it to the similarity calculation unit 64 . Here, the order of procedures may be taken into account when retrieving procedure statements from the procedure statement database 60 . For example, the ID of the procedural sentence judged to have the highest degree of similarity with respect to the action label group of the previous video segment is held, and T procedural sentences before and after the procedure of the relevant procedural sentence are taken out. good too. This can reduce the amount of calculation of the degree of similarity with the procedural sentence.

ステップＳ２２６で、ＣＰＵ１１は、類似度算出部６４として、取り出した手順文の各々について、当該手順文と生成した文との類似度を算出する。なお、類似度の指標としては、機械翻訳で使用される非特許文献６のＢＬＥＵや非特許文献７のＲＯＵＧＥ等を用いればよく、その他、文間の類似度が測定できる指標を用いてもよい。 In step S226, the CPU 11, as the similarity calculation unit 64, calculates the similarity between each extracted procedural statement and the generated procedural statement. As a similarity index, BLEU in Non-Patent Document 6, ROUGE in Non-Patent Document 7, etc. used in machine translation may be used, and other indices that can measure similarity between sentences may be used. .

［非特許文献６］Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. BLEU: a method for automatic evaluation of machine translation. ACL2002. [Non-Patent Document 6] Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. BLEU: a method for automatic evaluation of machine translation. ACL2002.

［非特許文献７］Lin, Chin-Yew. ROUGE: A Package for Automatic Evaluation of summaries. ACL Workshop: Text Summarization Braches Out 2004. [Non-Patent Document 7] Lin, Chin-Yew. ROUGE: A Package for Automatic Evaluation of summaries. ACL Workshop: Text Summarization Braches Out 2004.

上記ステップＳ２００の処理における詳細動作を図１３に示す。図１３に示す動作は、映像セグメントごとの行動ラベル群について繰り返し行われる。なお、映像セグメントのセグメント長は、アプリケーションに依存する。 FIG. 13 shows detailed operations in the process of step S200. The operation shown in FIG. 13 is repeatedly performed for the action label group for each video segment. Note that the segment length of the video segment depends on the application.

ステップＳ１６２で、ＣＰＵ１１は、生成部６２として、行動ラベル群の各行動ラベルを＜人ラベル動作ラベル物体ラベル＞の順に並べ替える。 In step S162, the CPU 11, as the generation unit 62, rearranges each action label in the action label group in the order <person label action label object label>.

ステップＳ１６４で、ＣＰＵ１１は、生成部６２として、行動ラベル群が学習済みの言語モデルの入力長上限より長いか否かを判定する。行動ラベル群が学習済みの言語モデルの入力長上限より長い場合には、ステップＳ１６６へ移行する。一方、行動ラベル群が学習済みの言語モデルの入力長上限以下である場合には、ステップＳ２３０へ移行する。 In step S164, the CPU 11, as the generation unit 62, determines whether or not the action label group is longer than the input length upper limit of the learned language model. If the action label group is longer than the upper limit of the input length of the learned language model, the process proceeds to step S166. On the other hand, when the action label group is equal to or less than the upper limit of the input length of the learned language model, the process proceeds to step S230.

ステップＳ１６６では、ＣＰＵ１１は、生成部６２として、行動ラベル群における各行動ラベルの出現頻度を集計し、出現頻度順に各行動ラベルを並べ、学習済みの言語モデルの入力長上限におさまるように、上位Ｎ個の行動ラベルを抽出し、抽出された行動ラベル以外の行動ラベルを削除し、新たな行動ラベル群とする。 In step S166, the CPU 11, as the generation unit 62, aggregates the appearance frequency of each action label in the action label group, arranges each action label in order of appearance frequency, and arranges each action label so as to fit within the input length upper limit of the learned language model. N action labels are extracted, action labels other than the extracted action labels are deleted, and a new action label group is formed.

ステップＳ２３０で、ＣＰＵ１１は、生成部６２として、並び替えた行動ラベル群を学習済みの言語モデルへと入力し、文を生成する。 In step S230, the CPU 11, as the generation unit 62, inputs the rearranged action label group into the learned language model to generate a sentence.

ステップＳ２３２で、ＣＰＵ１１は、生成部６２として、得られた文を出力する。 In step S232, the CPU 11, as the generator 62, outputs the obtained sentence.

以上説明したように、第２実施形態に係る学習装置は、手順に含まれる行動を表す学習用の映像データから、人の行動に関する行動ラベルを検出し、学習用の映像データに対して予め定められた手順文を生成するように、行動ラベルから文を生成する言語モデルを学習する。これにより、手順とは異なる異常行動を精度よく判定するための言語モデルを学習することができる。 As described above, the learning apparatus according to the second embodiment detects action labels related to human actions from learning video data representing actions included in a procedure, and predetermines action labels for learning video data. We learn a language model that generates sentences from action labels so as to generate ordered procedural sentences. As a result, it is possible to learn a language model for accurately determining abnormal behavior that differs from procedures.

また、第２実施形態に係る異常行動判定装置は、人の行動を表す映像データから検出された、人の行動に関する行動ラベルに基づいて、行動ラベルから文を生成する予め学習された言語モデルを用いて、文を生成し、生成された文と、手順文データベースに記憶された手順文との類似度を算出し、算出された類似度に基づいて、人の行動が異常であるか否かを判定する。これにより、映像で撮影された手順のある作業において、手順とは異なる異常行動を精度よく判定することができる。 Further, the abnormal behavior determination device according to the second embodiment uses a pre-learned language model for generating sentences from action labels based on action labels related to human actions detected from video data representing human actions. to generate a sentence, calculate the degree of similarity between the generated sentence and the procedural sentence stored in the procedural sentence database, and determine whether or not the human behavior is abnormal based on the calculated similarity judge. As a result, abnormal behavior different from the procedure can be determined with high accuracy in the work with the procedure photographed in the video.

＜変形例＞
なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 <Modification>
The present invention is not limited to the above-described embodiments, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、映像データではなく、音声データから行動ラベルを検出するようにしてもよい。この場合、音声データから検出した動作ラベルと人ラベルの組み合わせを行動ラベルとすればよい。また、映像データと音声データとの組み合わせから行動ラベルを検出するようにしてもよい。この場合、音声データから検出した動作ラベルと人ラベルと、映像データから検出した物体ラベルとの組み合わせを行動ラベルとしてもよい。 For example, action labels may be detected from audio data instead of video data. In this case, a combination of the action label detected from the voice data and the person label may be used as the action label. Also, an action label may be detected from a combination of video data and audio data. In this case, the action label may be a combination of the action label and person label detected from the audio data and the object label detected from the video data.

また、学習装置と異常行動判定装置とを別々の装置として構成する場合を例に説明したが、これに限定されるものではなく、学習装置と異常行動判定装置とを一つの装置として構成してもよい。 In addition, although the case where the learning device and the abnormal behavior determination device are configured as separate devices has been described as an example, the present invention is not limited to this, and the learning device and the abnormal behavior determination device are configured as one device. good too.

また、上記各実施形態でＣＰＵがソフトウェア（プログラム）を読み込んで実行した各種処理を、ＣＰＵ以外の各種のプロセッサが実行してもよい。この場合のプロセッサとしては、ＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＦＰＧＡ（Ｆｉｅｌｄ－ＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）等の製造後に回路構成を変更可能なＰＬＤ（ＰｒｏｇｒａｍｍａｂｌｅＬｏｇｉｃＤｅｖｉｃｅ）、及びＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）等の特定の処理を実行させるために専用に設計された回路構成を有するプロセッサである専用電気回路等が例示される。また、学習処理及び異常行動判定処理を、これらの各種のプロセッサのうちの１つで実行してもよいし、同種又は異種の２つ以上のプロセッサの組み合わせ（例えば、複数のＦＰＧＡ、及びＣＰＵとＦＰＧＡとの組み合わせ等）で実行してもよい。また、これらの各種のプロセッサのハードウェア的な構造は、より具体的には、半導体素子等の回路素子を組み合わせた電気回路である。 Further, the various processes executed by the CPU by reading the software (program) in each of the above-described embodiments may be executed by various processors other than the CPU. Processors in this case include GPUs (Graphics Processing Units), FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices) whose circuit configuration can be changed after manufacturing, and specific circuits such as ASICs (Application Specific Integrated Circuits). A dedicated electric circuit or the like, which is a processor having a circuit configuration exclusively designed for executing the processing of , is exemplified. In addition, the learning process and the abnormal behavior determination process may be executed by one of these various processors, or a combination of two or more processors of the same or different types (for example, multiple FPGAs, CPUs and combination with FPGA, etc.). More specifically, the hardware structure of these various processors is an electric circuit in which circuit elements such as semiconductor elements are combined.

また、上記各実施形態では、学習プログラム及び異常行動判定プログラムがストレージ１４に予め記憶（インストール）されている態様を説明したが、これに限定されない。プログラムは、ＣＤ－ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＤＶＤ－ＲＯＭ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、及びＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）メモリ等の非一時的（ｎｏｎ－ｔｒａｎｓｉｔｏｒｙ）記憶媒体に記憶された形態で提供されてもよい。また、プログラムは、ネットワークを介して外部装置からダウンロードされる形態としてもよい。 Further, in each of the above-described embodiments, a mode in which the learning program and the abnormal behavior determination program are stored (installed) in advance in the storage 14 has been described, but the present invention is not limited to this. The program is stored in non-transitory storage media such as CD-ROM (Compact Disk Read Only Memory), DVD-ROM (Digital Versatile Disk Read Only Memory), and USB (Universal Serial Bus) memory. may be provided in the form Also, the program may be downloaded from an external device via a network.

以上の実施形態に関し、更に以下の付記を開示する。 The following additional remarks are disclosed regarding the above embodiments.

（付記項１）
一連の手順の各々についての、前記手順に含まれる少なくとも一つの行動を表す複数の手順文、又は前記複数の手順文の各々を表す文ベクトルを記憶する手順文データベースを含む学習装置であって、
メモリと、
前記メモリに接続された少なくとも１つのプロセッサと、
を含み、
前記プロセッサは、
前記手順に含まれる行動を表す学習用の映像データ又は音声データから、人の行動に関する行動ラベルを検出し、
前記学習用の映像データ又は音声データに対して予め定められた前記手順文、又は前記手順文を表す文ベクトルを生成するように、前記行動ラベルから文又は文ベクトルを生成する言語モデルを学習する
ように構成される学習装置。 (Appendix 1)
A learning device including a procedural sentence database storing a plurality of procedural sentences representing at least one action included in the procedure, or a sentence vector representing each of the plurality of procedural sentences, for each of a series of procedures,
memory;
at least one processor connected to the memory;
including
The processor
Detecting action labels related to human actions from learning video data or audio data representing actions included in the procedure,
learning a language model for generating a sentence or a sentence vector from the action label so as to generate the procedural sentence predetermined for the video data or audio data for learning, or a sentence vector representing the procedural sentence; A learning device configured to:

（付記項２）
学習処理を実行するように、一連の手順の各々についての、前記手順に含まれる少なくとも一つの行動を表す複数の手順文、又は前記複数の手順文の各々を表す文ベクトルを記憶する手順文データベースを含むコンピュータによって実行可能なプログラムを記憶した非一時的記憶媒体であって、
前記学習処理は、
前記手順に含まれる行動を表す学習用の映像データ又は音声データから、人の行動に関する行動ラベルを検出し、
前記学習用の映像データ又は音声データに対して予め定められた前記手順文、又は前記手順文を表す文ベクトルを生成するように、前記行動ラベルから文又は文ベクトルを生成する言語モデルを学習する
非一時的記憶媒体。 (Appendix 2)
A procedural sentence database for storing a plurality of procedural sentences representing at least one action included in each of a series of procedures, or a sentence vector representing each of the plurality of procedural sentences, for each of a series of procedures so as to execute a learning process. A non-transitory storage medium storing a computer-executable program containing
The learning process includes
Detecting action labels related to human actions from learning video data or audio data representing actions included in the procedure,
learning a language model for generating a sentence or a sentence vector from the action label so as to generate the procedural sentence predetermined for the video data or audio data for learning, or a sentence vector representing the procedural sentence; Non-transitory storage media.

（付記項３）
一連の手順の各々についての、前記手順に含まれる少なくとも一つの行動を表す複数の手順文、又は前記複数の手順文の各々を表す文ベクトルを記憶する手順文データベースを含む異常行動判定装置であって、
メモリと、
前記メモリに接続された少なくとも１つのプロセッサと、
を含み、
前記プロセッサは、
人の行動を表す映像データ又は音声データから検出された、人の行動に関する行動ラベルに基づいて、前記行動ラベルから文又は文ベクトルを生成する予め学習された言語モデルを用いて、前記文又は文ベクトルを生成し、
前記生成された前記文又は文ベクトルと、前記手順文データベースに記憶された前記手順文、又は前記手順文を表す文ベクトルとの類似度を算出し、
前記算出された類似度に基づいて、前記人の行動が異常であるか否かを判定する
ように構成される異常行動判定装置。 (Appendix 3)
An abnormal behavior determination device including a procedural sentence database that stores a plurality of procedural sentences representing at least one action included in the procedure, or a sentence vector representing each of the plurality of procedural sentences, for each of a series of procedures. hand,
memory;
at least one processor connected to the memory;
including
The processor
Based on action labels related to human actions detected from video data or audio data representing human actions, using a pre-learned language model that generates sentences or sentence vectors from the action labels, the sentences or sentences generate a vector,
calculating a degree of similarity between the generated sentence or sentence vector and the procedural sentence stored in the procedural sentence database or a sentence vector representing the procedural sentence;
An abnormal behavior determination device configured to determine whether the person's behavior is abnormal based on the calculated similarity.

（付記項４）
異常行動判定処理を実行するように、一連の手順の各々についての、前記手順に含まれる少なくとも一つの行動を表す複数の手順文、又は前記複数の手順文の各々を表す文ベクトルを記憶する手順文データベースを含むコンピュータによって実行可能なプログラムを記憶した非一時的記憶媒体であって、
前記異常行動判定処理は、
人の行動を表す映像データ又は音声データから検出された、人の行動に関する行動ラベルに基づいて、前記行動ラベルから文又は文ベクトルを生成する予め学習された言語モデルを用いて、前記文又は文ベクトルを生成し、
前記生成された前記文又は文ベクトルと、前記手順文データベースに記憶された前記手順文、又は前記手順文を表す文ベクトルとの類似度を算出し、
前記算出された類似度に基づいて、前記人の行動が異常であるか否かを判定する
非一時的記憶媒体。 (Appendix 4)
A step of storing a plurality of procedural sentences representing at least one action included in said procedure, or a sentence vector representing each of said plurality of procedural sentences, for each of a series of procedures, so as to execute abnormal behavior determination processing. A non-transitory storage medium storing a computer-executable program containing a sentence database,
The abnormal behavior determination process includes:
Based on action labels related to human actions detected from video data or audio data representing human actions, using a pre-learned language model that generates sentences or sentence vectors from the action labels, the sentences or sentences generate a vector,
calculating a degree of similarity between the generated sentence or sentence vector and the procedural sentence stored in the procedural sentence database or a sentence vector representing the procedural sentence;
A non-temporary storage medium for determining whether or not the person's behavior is abnormal based on the calculated similarity.

１０学習装置
１１ＣＰＵ
１４ストレージ
２０作業マニュアルデータベース
２２文ベクトル生成部
２４手順文データベース
２６学習用データベース
２８ラベル検出部
３０モデル学習部
５０異常行動判定装置
６０手順文データベース
６２生成部
６４類似度算出部
６６異常判定部 10 learning device 11 CPU
14 Storage 20 Work manual database 22 Sentence vector generation unit 24 Procedure sentence database 26 Learning database 28 Label detection unit 30 Model learning unit 50 Abnormal behavior determination device 60 Procedure sentence database 62 Generation unit 64 Similarity calculation unit 66 Abnormality determination unit

Claims

a procedural sentence database that stores a plurality of procedural sentences representing at least one action included in the procedure, or a sentence vector representing each of the plurality of procedural sentences, for each of a series of procedures;
a label detection unit that detects an action label related to a person's action from video data or audio data for learning representing actions included in the procedure;
learning a language model for generating a sentence or a sentence vector from the action label so as to generate the procedural sentence predetermined for the video data or audio data for learning, or a sentence vector representing the procedural sentence; a model learning unit;
Learning device including.

2. The learning device according to claim 1, wherein the sentence vector representing the procedural sentence predetermined for the video data or audio data for learning is generated using a pre-trained model for generating sentence vectors. .

When the label detection unit detects an action label group composed of a plurality of action labels, the model learning unit converts the sentence or the sentence vector generated by the language model from the concatenation of the action label group to: 3. The learning device according to claim 1, wherein the language model is learned so as to match the procedural sentence predetermined for the video data or audio data for learning, or a sentence vector representing the procedural sentence. .

The model learning unit, when the action label group is detected by the label detection unit and when the concatenated action label group exceeds the input length upper limit of the language model, The sentence or sentence vector generated by the language model from the concatenation of the action labels extracted so that the input length is equal to or less than the upper limit is determined in advance for the video data or audio data for learning. 4. The learning device according to claim 3, wherein the language model is trained so as to match procedural sentences or sentence vectors representing the procedural sentences.

a procedural sentence database that stores a plurality of procedural sentences representing at least one action included in the procedure, or a sentence vector representing each of the plurality of procedural sentences, for each of a series of procedures;
Based on the behavior label related to human behavior detected from video data or audio data representing human behavior,
a generation unit that generates the sentence or sentence vector using a pre-learned language model that generates a sentence or sentence vector from the action label;
a similarity calculation unit that calculates a similarity between the generated sentence or sentence vector and the procedural sentence stored in the procedural sentence database or a sentence vector representing the procedural sentence;
an abnormality determination unit that determines whether the behavior of the person is abnormal based on the similarity calculated by the similarity calculation unit;
Abnormal behavior determination device including.

a learning device including a procedural sentence database storing a plurality of procedural sentences representing at least one action included in the procedure, or a sentence vector representing each of the plurality of procedural sentences, for each of a series of procedures;
Detecting action labels related to human actions from learning video data or audio data representing actions included in the procedure,
learning a language model for generating a sentence or a sentence vector from the action label so as to generate the procedural sentence predetermined for the video data or audio data for learning, or a sentence vector representing the procedural sentence; learning method.

an abnormal behavior determination device including a procedural sentence database that stores a plurality of procedural sentences representing at least one action included in the procedure, or a sentence vector representing each of the plurality of procedural sentences, for each of a series of procedures;
Based on action labels related to human actions detected from video data or audio data representing human actions, using a pre-learned language model that generates sentences or sentence vectors from the action labels, the sentences or sentences generate a vector,
calculating a degree of similarity between the generated sentence or sentence vector and the procedural sentence stored in the procedural sentence database or a sentence vector representing the procedural sentence;
An abnormal behavior determination method, comprising: determining whether or not the person's behavior is abnormal based on the calculated degree of similarity.

A program for causing a computer to function as the learning device according to any one of claims 1 to 4 or the abnormal behavior determination device according to claim 5.