JP7474399B1

JP7474399B1 - An abnormal behavior recognition monitoring method based on incremental space-time learning

Info

Publication number: JP7474399B1
Application number: JP2023206529A
Authority: JP
Inventors: 秦君; 楊天国; ▲ぱん▼丁黎; 李暁敏; 呉慶升; 李宏梅; 商経鋭; 奪実祥偉; 高正剛; 楊舒舒
Original assignee: 雲南電網有限責任公司徳宏供電局
Priority date: 2023-08-31
Filing date: 2023-12-06
Publication date: 2024-04-25
Anticipated expiration: 2043-12-06
Also published as: CN117315565A

Abstract

【課題】インクリメンタル時空間学習に基づく異常行動認識監視方法を提供する。【解決手段】方法は、時空間モデルを確立すること、モニタの第１プリセット時間帯の第１ビデオ監視画面を収集し、時空間モデルに入力し、時空間モデルを訓練すること、訓練済の時空間モデルによってモニタの第２プリセット時間帯で異常行動を有する第２ビデオ監視画面を特定すること及び検証のために人手に送信することを含む。異常行動が手動検証に合格した場合、異常行動が正常と判定され、ファジイ集約法により第２ビデオ監視画面中の異常行動を第２正常行動として構築し、第２正常行動を時空間モデルに入力し、時空間モデルを再訓練し、上記ステップを繰り返す。【選択図】図１An abnormal behavior recognition monitoring method based on incremental spatio-temporal learning is provided. The method includes: establishing a spatio-temporal model; collecting a first video monitoring screen of a monitor in a first preset time period, inputting it into the spatio-temporal model, training the spatio-temporal model; identifying a second video monitoring screen having abnormal behavior in a second preset time period of the monitor by the trained spatio-temporal model, and sending it to a human for verification. If the abnormal behavior passes the manual verification, the abnormal behavior is determined to be normal, and the abnormal behavior in the second video monitoring screen is constructed as a second normal behavior by a fuzzy aggregation method, the second normal behavior is inputted into the spatio-temporal model, the spatio-temporal model is retrained, and the above steps are repeated. [Selected Figure]

Description

本出願は、異常行動監視の技術分野に関し、特にインクリメンタル時空間学習に基づく異
常行動認識監視方法に関する。 The present application relates to the technical field of abnormal behavior monitoring, and in particular to an abnormal behavior recognition monitoring method based on incremental space-time learning.

人間行動認識アルゴリズムは、スポーツにおけるアスリートの動作技術評価、ゲームにお
ける仮想キャラクタの動作制御、医療における患者の動作能力評価、セキュリティにおけ
る人間の行動識別など、多くの分野で広く利用されている。
人間行動認識アルゴリズムは、センサーデータと機械学習技術を利用して、人間の動きや
行動を自動的に分析・識別する。従来の監視ビデオに比べ、人間行動認識アルゴリズムは
リアルタイムで正確な無人監視を可能にする。リアルタイムの監視と分析により、潜在的
な安全リスクを特定し、タイムリーに対応することができるため、変電所の安全性と安定
性が向上する。
ビデオ監視における異常検出のための人工知能における現在の最近の開発は、課題の一部
に対処しているに過ぎず、時間の経過に伴う異常行動の性質を無視しており、リアルタイ
ムのビデオ監視のための異常検出とローカライゼーションの開発には限界がある。 Human action recognition algorithms are widely used in many fields, such as evaluating the movement techniques of athletes in sports, controlling the movements of virtual characters in games, evaluating the movement capabilities of patients in medicine, and identifying human actions in security.
The human behavior recognition algorithm utilizes sensor data and machine learning technology to automatically analyze and identify human movements and behaviors. Compared with traditional surveillance video, the human behavior recognition algorithm enables real-time and accurate unmanned monitoring. Real-time monitoring and analysis can identify potential safety risks and respond in a timely manner, thereby improving the safety and stability of substations.
Current recent developments in artificial intelligence for anomaly detection in video surveillance only address part of the challenge and ignore the nature of anomalous behavior over time, limiting the development of anomaly detection and localization for real-time video surveillance.

本出願は、ビデオ監視における異常検知に使用される人工知能が、時間経過に伴う異常行
動の性質を無視し、リアルタイムビデオ監視のための異常検知および位置特定を開発する
上で限界があるという問題を解決するために、インクリメンタル時空間学習に基づく異常
行動認識監視方法を提供し、この方法は、
時空間モデルを確立すること、
モニタの第１プリセット時間帯の第１ビデオ監視画面を収集すること、
前記第１ビデオ監視画面内の行動を第１正常行動とし、前記第１正常行動を前記時空間モ
デルに入力し、前記時空間モデルを訓練すること、
訓練済の前記時空間モデルにより、前記モニタの第２プリセット時間帯で異常行動を有す
る第２ビデオ監視画面を特定すること、
前記第２ビデオ監視画面を手動検証に送信すること、
前記第２ビデオ監視画面中の前記異常行動が手動検証に合格した場合、前記異常行動が正
常と判定され、ファジイ集約法により、前記第２ビデオ監視画面中の前記異常行動を第２
正常行動として構築すること、
前記第２正常行動を前記時空間モデルに入力し、前記時空間モデルを再訓練し、異常行動
を有する第２ビデオ監視画面を特定するステップを繰り返して実行すること、
前記異常行動が手動検証に合格しない場合、前記異常行動が非正常と判定されて記録され
ること、を含む。
実現可能な実施態様では、前記第１プリセット時間および前記第２プリセット時間が連続
し、
前記第１プリセット時間は第１時点で開始し、第２時点で終了し、前記第２プリセット時
間は第２時点で開始し、第３時点で終了する。
実現可能な実施態様では、前記時空間モデルは入力データ層、畳み込み層を含み、
前記入力データ層は、前記第１ビデオ監視画面および／または前記第２ビデオ監視画面を
前処理し、前記時空間モデルの学習能力を強化するために使用され、
前記畳み込み層は、前記第１ビデオ監視画面および／または前記第２ビデオ監視画面を分
析して学習するために使用される。
実現可能な実施態様では、前記入力データ層は前記第１ビデオ監視画面および／または前
記第２ビデオ監視画面を前処理するステップは、
長さＴのスライドウィンドウを用いて前記第１ビデオ監視画面および／または前記第２ビ
デオ監視画面を抽出すること、
抽出された前記第１ビデオ監視画面および／または前記第２ビデオ監視画面を連続フレー
ムとし、前記連続フレームをグレースケールダウンスケーリングに変換し、２２４×２２
４のピクセル値に調整し、０から１にスケーリングすることにより前記ピクセル値を正規
化処理すること、
長さＴの前記連続フレームを積層して入力時間矩形を形成すること、を含む。
実現可能な実施態様では、前記時空間モデルはＣｏｎｖＬＳＴＭ層をさらに含み、
前記ＣｏｎｖＬＳＴＭ層は前記連続フレームから時空間特徴を捕捉するために使用され、
前記ＣｏｎｖＬＳＴＭ層のモデルは以下のように表示され：
式において、「
」は畳み込み演算を示し、「
」はＨａｄａｍａｒｄ積演算を示し、
は入力を示し、
はセル状態を示し、
は隠れ状態を示し、
和
は３次元テンソルであり、「
」はｓｉｇｍｏｉｄ関数を示し、
および
はＣｏｎｖＬＳＴＭ中の２次元畳み込みカーネルである。
実現可能な実施態様では、前記時空間モデルは異常閾値によって前記正常行動および前記
異常行動を区別し、前記異常閾値は手動で選択され、
前記異常閾値が低くなると、前記時空間モデルのモニタ中の前記異常行動に対する検出感
度が高くなり、異常行動を有する第２ビデオ監視画面の検出回数が多くなり、
前記異常閾値が高くなると、前記時空間モデルのモニタ中の前記異常行動に対する検出感
度が低くなり、異常行動を有する第２ビデオ監視画面の検出回数が少なくなる。
実現可能な実施態様では、前記手動検証は再構成誤差により前記第２ビデオ監視画面中の
前記異常行動が合格しかかどうかを判定し、
前記再構成誤差は異常局在化の各前記入力時間矩形の分数として表され、前記異常局在化
は異常が発生するビデオフレーム内の特定領域の局在化であり、前記再構成誤差の計算式
は式（６）および式（７）に示され：
ここで、
は時間ウィンドウであり、
はビデオフレームの高さである。
実現可能な実施態様では、訓練済の前記時空間モデルにより、前記モニタの第２プリセッ
ト時間帯で異常行動を有する第２ビデオ監視画面を特定するステップは、
前記時空間モデルは前記入力時間矩形の前記再構成誤差が前記異常閾値よりも大きいと検
出した場合、該前記入力時間矩形を異常と分類し、監視画面から前記第２ビデオ監視画面
を特定することを含む。
上記内容から分かるように、本出願が提供するインクリメンタル時空間学習に基づく異常
行動認識監視方法は、時空間モデルを確立すること、モニタの第１プリセット時間帯の第
１ビデオ監視画面を収集すること、前記第１ビデオ監視画面内の行動を第１正常行動とし
、前記第１正常行動を前記時空間モデルに入力し、前記時空間モデルを訓練すること、訓
練済の前記時空間モデルにより、前記モニタの第２プリセット時間帯で異常行動を有する
第２ビデオ監視画面を特定すること、前記第２ビデオ監視画面を手動検証に送信すること
、前記第２ビデオ監視画面中の前記異常行動が手動検証に合格した場合、前記異常行動が
正常と判定され、ファジイ集約法により、前記第２ビデオ監視画面中の前記異常行動を第
２正常行動として構築しること、前記第２正常行動を前記時空間モデルに入力し、前記時
空間モデルを再訓練し、異常行動を有する第２ビデオ監視画面を特定するステップを繰り
返して実行すること、前記異常行動が手動検証に合格しない場合、前記異常行動が非正常
と判定されて記録されること、を含む。本出願は、時系列検出して正常行動を時空間モデ
ルに入力することにより、時空間モデルを継続的に学習訓練させて異常行動の検出に活用
することができ、異常行動の検出精度を向上させることができる。 In order to solve the problem that the artificial intelligence used for anomaly detection in video surveillance ignores the nature of abnormal behavior over time, and has limitations in developing anomaly detection and localization for real-time video surveillance, the present application provides an abnormal behavior recognition and monitoring method based on incremental space-time learning, the method comprising:
Establishing a space-time model;
acquiring a first video surveillance image of the monitor during a first preset time period;
a behavior in the first video surveillance image is a first normal behavior, and the first normal behavior is input to the spatio-temporal model to train the spatio-temporal model;
identifying a second video surveillance screen having abnormal activity during a second preset time period of the monitor using the trained spatio-temporal model;
transmitting said second video surveillance image for manual review;
If the abnormal behavior in the second video monitoring screen passes manual verification, the abnormal behavior is determined to be normal, and the abnormal behavior in the second video monitoring screen is identified as a second abnormal behavior by a fuzzy aggregation method.
Constructing it as normal behavior
repeatedly performing the steps of inputting the second normal behavior into the spatio-temporal model, retraining the spatio-temporal model, and identifying a second video surveillance scene having anomalous behavior;
If the anomalous behavior does not pass manual verification, the anomalous behavior is determined to be non-normal and recorded.
In a possible embodiment, the first preset time and the second preset time are consecutive,
The first preset time period begins at a first time point and ends at a second time point, and the second preset time period begins at a second time point and ends at a third time point.
In a possible embodiment, the spatio-temporal model includes an input data layer, a convolutional layer,
the input data layer is used to pre-process the first video surveillance image and/or the second video surveillance image and enhance the learning ability of the spatio-temporal model;
The convolutional layer is used to analyze and learn from the first video surveillance screen and/or the second video surveillance screen.
In a possible embodiment, the step of pre-processing the input data layer of the first video surveillance image and/or the second video surveillance image comprises:
Sampling the first video surveillance image and/or the second video surveillance image using a sliding window of length T;
The extracted first video monitoring screen and/or the extracted second video monitoring screen are converted into successive frames, and the successive frames are converted into grayscale downscaling to 224×22
4 and normalizing the pixel values by scaling them from 0 to 1;
stacking said successive frames of length T to form an input time rectangle.
In a possible embodiment, the spatio-temporal model further comprises a ConvLSTM layer;
The ConvLSTM layer is used to capture spatio-temporal features from the successive frames;
The model of the ConvLSTM layer is expressed as follows:
In the formula,
" indicates the convolution operation, and "
" denotes the Hadamard product operation,
indicates the input,
indicates the cell state,
denotes the hidden state,
sum
is a three-dimensional tensor,
" indicates a sigmoid function,
and
is the two-dimensional convolution kernel in ConvLSTM.
In a possible embodiment, the spatio-temporal model distinguishes between the normal and abnormal behaviors by an anomaly threshold, the anomaly threshold being manually selected;
When the abnormality threshold is lower, the detection sensitivity of the abnormal behavior during monitoring of the spatio-temporal model is higher, and the number of detections of the second video monitoring image having the abnormal behavior is higher;
The higher the anomaly threshold, the lower the sensitivity to detect the abnormal behavior during monitoring of the spatio-temporal model, and the fewer the number of detections of second video surveillance images having abnormal behavior.
In a possible embodiment, the manual verification includes determining whether the abnormal behavior in the second video surveillance image is acceptable according to a reconstruction error;
The reconstruction error is expressed as a fraction of each of the input time rectangles of an anomaly localization, where the anomaly localization is a localization of a specific region in a video frame where an anomaly occurs, and the calculation formula of the reconstruction error is shown in Equation (6) and Equation (7):
here,
is the time window,
is the height of the video frame.
In a possible embodiment, the step of identifying a second video surveillance screen having abnormal behavior during a second preset time period of the monitor using the trained spatio-temporal model includes:
If the spatio-temporal model detects that the reconstruction error of the input time rectangle is greater than the anomaly threshold, the spatio-temporal model classifies the input time rectangle as anomalous and identifies the second video surveillance image from the surveillance image.
[0023] As can be seen from the above, the abnormal behavior recognition and monitoring method provided by the present application based on incremental space-time learning includes: establishing a space-time model; collecting a first video monitoring screen of a monitor in a first preset time period; a behavior in the first video monitoring screen is a first normal behavior, and the first normal behavior is input into the space-time model to train the space-time model; identifying a second video monitoring screen having abnormal behavior in a second preset time period of the monitor through the trained space-time model; sending the second video monitoring screen to manual verification; if the abnormal behavior in the second video monitoring screen passes the manual verification, the abnormal behavior is determined to be normal; and constructing the abnormal behavior in the second video monitoring screen as a second normal behavior through a fuzzy aggregation method; inputting the second normal behavior into the space-time model, re-training the space-time model, and repeatedly performing the steps of identifying a second video monitoring screen having abnormal behavior; if the abnormal behavior does not pass the manual verification, the abnormal behavior is determined to be non-normal and recorded. In the present application, by detecting normal behavior in a time series and inputting the normal behavior into a time-space model, the time-space model can be continuously trained and used to detect abnormal behavior, thereby improving the accuracy of detecting abnormal behavior.

ここでの添付図面は本明細書に組み込まれて本明細書の一部を構成し、本発明の実施に適
した実施例を図示し、明細書とともに本発明の実施例の原理を解釈するために使用される
。明らかに、以下で説明される添付図面は本発明のいくつかの実施例に過ぎず、当業者で
あれば、創造的な労働をすることなく、これらの添付図面に基づいて他の図面を得ること
ができる。
本出願の実施例が提供するインクリメンタル時空間学習に基づく異常行動認識監視方法の概略フローチャートである。本出願の実施例が提供する入力データ層が前記第１ビデオ監視画面および／または前記第２ビデオ監視画面を前処理する場合の概略フローチャートである。 The accompanying drawings herein are incorporated in and constitute a part of this specification, illustrate embodiments suitable for carrying out the present invention, and are used together with the specification to interpret the principles of the embodiments of the present invention. Obviously, the accompanying drawings described below are merely some embodiments of the present invention, and those skilled in the art can derive other drawings based on these accompanying drawings without creative labor.
1 is a schematic flowchart of an abnormal behavior recognition monitoring method based on incremental space-time learning provided by an embodiment of the present application; 4 is a schematic flowchart when an input data layer provided by an embodiment of the present application pre-processes the first video monitoring screen and/or the second video monitoring screen;

添付図面を参照して例示的な実施形態をより完全に説明する。しかし、例示的な実施形態
は様々な形態で実施され得、ここでの例に限定されなく、むしろ、これらの実施形態を提
供することにより、本発明の実施例がより包括的かつ完全なものとなり、当業者に例示的
な実施形態の思想を包括的に伝えることができる。説明した特徴、構造または特性は、任
意の公的な方法で１つまたは複数の実施形態に組み合わせることができる。以下の説明に
おいて、多くの具体的な詳細が提供され、本発明の実施形態の完全な理解が得られる。
人間の行動認識アルゴリズムは、センサデータと機械学習技術を利用して、人間の動きと
行動を自動的に分析し、認識することができる。従来の監視ビデオと比較して、人間行動
認識アルゴリズムは、リアルタイムで正確な無人介入監視を可能にする。変電所の運転・
保守要員の操作行動を監視し、潜在的な操作ミスや異常を検出し、タイムリーに警告を発
し、適切な措置を講じることができる。さらに、人間行動認識アルゴリズムは、無許可で
変電所エリアに入る人員、破壊された設備、悪意のある操作など、変電所内の異常行動を
識別することができる。リアルタイムの監視と分析を通じて、潜在的なセキュリティ・リ
スクを特定し、タイムリーに対応することで、変電所の安全性と安定性を向上させること
ができる。ビデオ監視における異常検知のための最近の人工知能開発は、課題の一部にし
か対処しておらず、経時的な異常行動の性質はほとんど無視されている。
実世界のビデオ監視環境において、能動学習は動的に変化する環境における異常検知を達
成することを目的としている。学習時空間モデルは、最初に提供された許容可能な最初の
正常行動を識別するように訓練される。しかし、予期しない新しい正常行動や、異常とみ
なされる既存の行動が正常行動に変化することを含む動的環境では、検知システムがこれ
らの新しいシナリオを検知する能力とともに進化することが重要である。すなわち、ファ
ジィ集約法を用いて、対応する監視コンテキストに特有の未知の／新しい正常行動を含む
時空間モデルを継続的に訓練する。
図１に示すように、本出願が提供するインクリメンタル時空間学習に基づく異常認識監視
方法方法は、以下のステップを含む。
Ｓ１００：時空間モデルを確立する。
時空間モデルはＩＳＴＬモデルを採用し、時空間自己エンコーダから構成され、ビデオ入
力から外観および動き表現を学習する。時空間自己エンコーダは教師なし学習アルゴリズ
ムであり、逆伝播を用いて再構成誤差を最小化することで目標値を入力値と等しくする。
Ｓ２００：モニタの第１プリセット時間帯の第１ビデオ監視画面を収集する。
第１ビデオ監視画面は、所定のカメラ視角のける正常行動を示すビデオフレームからなる
訓練ビデオストリーム
から構成され、訓練ビデオストリームは、高さｈ、幅ｗのフレームシーケンスから構成さ
れ、
、
は現実世界におけるカメラビューのすべてのビデオフレームを示す。
Ｓ３００：第１ビデオ監視画面内の行動を第１正常行動とし、第１正常行動を時空間モデ
ルに入力し、時空間モデルを訓練する。
第１正常行動を時空間モデルに入力した後、時空間モデルは学習訓練を行い、いくつかの
実施例では、時空間モデルを訓練した後に試験してもよく、試験時、試験ビデオストリー
ム
を採用し、ここで、
、
は正常行動と異常行動のビデオフレームを含む。その目的は、時空間モデルが
から正常行動の表現を学習した後
で検証して、異常行動を区別し、時空間モデルの訓練が完了したと判定することである。
Ｓ４００：訓練済の時空間モデルによってモニタの第２プリセット時間帯で異常行動を有
する第２ビデオ監視画面を特定する。
時空間モデルは訓練結果に基づいて、第２プリセット時間帯で第１正常行動と異なる行動
を区別し、異常行動と判定する。
Ｓ５００：第２ビデオ監視画面を手動検証に送信する。
時空間モデルは第１ビデオ監視画面のみに基づいて訓練されるので、認識された異常行動
が正確でない場合があり、このとき手動で再検証して認識された異常行動の正確さを検証
する必要がある。
Ｓ６００：第２ビデオ監視画面中の異常行動が手動検証に合格した場合、異常行動が正常
と判定され、ファジイ集約法によって第２ビデオ監視画面中の異常行動を第２正常行動と
して構築する。
具体的に、本出願中のＩＳＴＬモデルはまず監視環境下で予め認識された正常行動で訓練
され、異常検出に使用される。手動で検証してフィードバックする目的は、時空間モデル
に動的に進化する正常行動を積極的に提供することである。したがって、検出された異常
行動が誤検出（偽陽性）である場合、異常行動を有する第２ビデオ監視画面中のビデオフ
レームタグを手動で「正常」と判定し、第２正常行動を得、継続的な学習段階に使用され
る。
手動でフィードバックした後、正常と判定されたビデオフレームをＩＳＴＬモデルの継続
的な訓練に使用してその正常概念の知識を更新する。
ビデオフレームのファジー集約は、ＩＳＴＬモデルの連続的な学習を強化し、学習反復の
安定性を維持する。検出段階では、評価されたすべてのビデオフレームは再構成誤差に基
づいてファジイ尺度
でラベル付けされ、
に基づいて有限個(ｎ個)集合体にグループ化される。続いて、継続的な学習段階では、フ
ァジー集約アルゴリズムは各組ファジイ尺度(Ｓ)から
最も高い
個のビデオフレーム矩形を選択してＩＳＴＬモデルを訓練する。パラメータ
と
は初期化時に継続的な学習に使用されるビデオ監視ストリームの継続時間に基づいて定義
される。継続的な訓練のシナリオ選択は式(８)のように定義され、ここで、
、ここで、
および
は選択された時間矩形のインデックスであり、それらは継続的な訓練データセットに含ま
れ、
継続的な訓練反復のデータセットは、手動検証された偽陽性検出とファジー集約により正
常行動から選択された時間矩形から構成される。継続的な訓練により検出モデルが新しい
正常行動を捕捉する能力を更新しながら、以前に知られている正常行動の安定性を維持す
ることができる。このファジー集約法は、ＩｏＴストリームマイニング、テキストマイニ
ング、およびビデオストリームマイニングの連続学習において、安定性と可塑性を維持す
ることに成功している。
Ｓ７００：第２正常行動を時空間モデルに入力し、時空間モデルを再訓練し、異常行動を
有する第２ビデオ監視画面を特定するステップを繰り返して実行する。
シナリオ選択後、ＩＳＴＬモデルは入力ビデオデータから選択された表現に基づいて継続
的に訓練され、監視領域から更新された期待される行動と許容される行動となる。その後
、更新後のＩＳＴＬモデルは異常検出に再利用される。
Ｓ８００：異常行動が手動検証に合格しない場合、異常行動が非正常と判定されて記録さ
れる。
なお、手動検証で異常行動と判定された場合、時空間モデルの学習訓練に投入する必要が
ない。
本出願のいくつかの実施例では、第１プリセット時間帯は、第１時点
から第２時点
までとし、第２プリセット時間帯は時点
から時点
までとし、第１プリセット時間内で連続している。時空間モデルの入力監視画面はすべて
連続し、重複や漏れを避けることができる。
本出願のいくつかの実施例では、時空間モデルは入力データ層、畳み込み層を含む。
入力データ層は第１ビデオ監視画面および／または第２ビデオ監視画面を前処理し、時空
間モデルの学習能力を強化するために使用される。
図２に示すように、具体的な前処理のステップは以下を含む。
Ｓ００１：長さＴのスライドウィンドウにより第１ビデオ監視画面および／または第２ビ
デオ監視画面を抽出する。
Ｓ００２：抽出された第１ビデオ監視画面および／または第２ビデオ監視画面を連続フレ
ームとし、連続フレームをグレースケールダウンスケーリングに変換し、２２４×２２４
のピクセル値に調整し、０から１へのスケーリングによりピクセル値を正規化処理する。
Ｓ００３：長さＴの連続フレームを積層して入力時間矩形を形成する。なお、時間ウィン
ドウＴの長さを長くすることで、より長い行動を含むことができることを理解されたい。
畳み込み層は、第１ビデオ監視画面および／または第２ビデオ監視画面を分析して学習す
るために使用される。
畳み込み層（ＣＮＮ）は、動物の視覚野の組織と同様の生物学的プロセスに着想を得てい
る。畳み込み層のニューロンの接続性は、個々の皮質ニューロンが入力フレームの限定さ
れた領域（すなわち受容野）においてのみ刺激に反応するように、動物の視覚系に類似し
た方法で設計されている。ビデオ解析では、畳み込み層は、学習中に値が学習されるフィ
ルタを使用して特徴表現を学習することにより、入力フレーム内の空間的関係を保持する
ことができる。 Exemplary embodiments will be described more completely with reference to the accompanying drawings. However, exemplary embodiments may be implemented in various forms and are not limited to the examples herein, but rather, by providing these embodiments, the present invention will be more comprehensive and complete, and the idea of the exemplary embodiments can be more comprehensively conveyed to those skilled in the art. The described features, structures, or characteristics may be combined in any general manner into one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of the embodiments of the present invention.
Human behavior recognition algorithms can utilize sensor data and machine learning techniques to automatically analyze and recognize human movements and behaviors. Compared with traditional surveillance videos, human behavior recognition algorithms enable real-time, accurate, unattended monitoring.
It can monitor the operation behavior of maintenance personnel, detect potential operation errors or anomalies, issue warnings in a timely manner, and take appropriate measures. In addition, human behavior recognition algorithms can identify abnormal behaviors in substations, such as personnel entering substation areas without authorization, vandalized equipment, and malicious operations. Through real-time monitoring and analysis, potential security risks can be identified and responded to in a timely manner, improving the safety and stability of substations. Recent artificial intelligence developments for anomaly detection in video surveillance have only addressed part of the challenge, and the nature of abnormal behavior over time has been largely ignored.
In real-world video surveillance environments, active learning aims to achieve anomaly detection in a dynamically changing environment. A learning spatio-temporal model is trained to identify acceptable initial normal behaviors provided initially. However, in dynamic environments that include unexpected new normal behaviors or existing behaviors considered anomalous that change to normal behaviors, it is important that the detection system evolves with the ability to detect these new scenarios. That is, fuzzy aggregation methods are used to continuously train a spatio-temporal model that includes unknown/new normal behaviors specific to the corresponding surveillance context.
As shown in FIG. 1, the anomaly recognition monitoring method based on incremental spatio-temporal learning provided in this application includes the following steps:
S100: Establish a spatio-temporal model.
The spatio-temporal model adopts the ISTL model and is composed of a spatio-temporal autoencoder that learns appearance and motion representations from the video input. The spatio-temporal autoencoder is an unsupervised learning algorithm that uses backpropagation to make the target value equal to the input value by minimizing the reconstruction error.
S200: Collect a first video monitoring screen of a monitor during a first preset time period.
The first video surveillance screen includes a training video stream of video frames that exhibit normal behavior within a predetermined camera viewing angle.
the training video stream consists of a sequence of frames of height h and width w,
,
shows all the video frames of the camera view in the real world.
S300: A behavior in a first video monitoring screen is determined as a first normal behavior, and the first normal behavior is input into a spatio-temporal model to train the spatio-temporal model.
After the first normal behavior is input to the spatio-temporal model, the spatio-temporal model is trained. In some embodiments, the spatio-temporal model may be tested after it is trained. During testing, a test video stream is
Here,
,
contains video frames of normal and abnormal behavior. The purpose is to
After learning the representation of normal behavior from
The goal is to verify the abnormal behavior and determine that the training of the spatio-temporal model is complete.
S400: Identifying a second video surveillance screen having abnormal behavior during a second preset time period of the monitor by the trained spatio-temporal model.
Based on the training result, the spatio-temporal model distinguishes behaviors different from the first normal behavior during the second preset time period and determines the behaviors as abnormal behaviors.
S500: Send the second video surveillance screen for manual verification.
Since the spatio-temporal model is trained based only on the first video surveillance image, the recognized abnormal behavior may not be accurate, and manual re-verification is then required to verify the accuracy of the recognized abnormal behavior.
S600: If the abnormal behavior in the second video monitoring screen passes manual verification, the abnormal behavior is determined as normal, and the abnormal behavior in the second video monitoring screen is constructed as a second normal behavior through a fuzzy aggregation method.
Specifically, the ISTL model in the present application is first trained with pre-recognized normal behaviors in a monitoring environment and then used for anomaly detection. The purpose of manual verification and feedback is to actively provide the spatio-temporal model with dynamically evolving normal behaviors. Therefore, if the detected abnormal behavior is a false positive, the video frame tag in the second video monitoring screen with the abnormal behavior is manually determined to be "normal", and a second normal behavior is obtained and used in the continuous learning phase.
After manual feedback, the video frames that are determined to be normal are used to continually train the ISTL model to update its knowledge of normality.
The fuzzy aggregation of video frames enhances the continuous learning of the ISTL model and maintains the stability of the learning iterations. During the detection phase, every evaluated video frame is assigned a fuzzy measure based on the reconstruction error.
is labeled with
Then, in the continuous learning phase, the fuzzy aggregation algorithm calculates the fuzzy measures (S) of each set of fuzzy objects into a finite set (n).
highest
Select video frame rectangles to train the ISTL model. Parameters
and
is defined based on the duration of the video surveillance stream used for continuous learning at initialization. The scenario selection for continuous training is defined as Equation (8), where
,here,
and
are the indices of the selected time rectangles, which are included in the continuous training dataset,
The dataset for successive training iterations consists of time rectangles selected from normal behaviors by manually verified false positive detection and fuzzy aggregation. Continuous training allows the detection model to update its ability to capture new normal behaviors while maintaining the stability of previously known normal behaviors. This fuzzy aggregation method has been successful in maintaining stability and plasticity in continuous learning for IoT stream mining, text mining, and video stream mining.
S700: Input a second normal behavior into the spatio-temporal model, retrain the spatio-temporal model, and repeatedly perform the steps of identifying a second video surveillance scene having abnormal behavior.
After scenario selection, the ISTL model is continuously trained based on the representations selected from the input video data to obtain updated expected and acceptable behaviors from the monitored domain. The updated ISTL model is then reused for anomaly detection.
S800: If the abnormal behavior does not pass manual verification, the abnormal behavior is determined to be abnormal and recorded.
In addition, if manual verification determines that the behavior is abnormal, there is no need to use it for training the spatiotemporal model.
In some embodiments of the present application, the first preset time period is a first time point
From the second time point
The second preset time period is
From
The input monitoring screens of the time-space model are all continuous, and overlaps and omissions can be avoided.
In some embodiments of the present application, the spatio-temporal model includes an input data layer and a convolutional layer.
The input data layer is used to pre-process the first video surveillance image and/or the second video surveillance image to enhance the learning ability of the spatio-temporal model.
As shown in FIG. 2, the specific pre-processing steps include:
S001: Extract a first video monitoring screen and/or a second video monitoring screen by a sliding window of length T.
S002: The extracted first video monitoring screen and/or the extracted second video monitoring screen are converted into consecutive frames, and the consecutive frames are converted into grayscale downscaling to 224×224
The pixel values are then normalized by scaling from 0 to 1.
S003: Stack consecutive frames of length T to form an input time rectangle. It should be understood that by increasing the length of the time window T, longer actions can be included.
The convolutional layer is used to analyze and learn from the first video surveillance screen and/or the second video surveillance screen.
Convolutional layers (CNNs) are inspired by biological processes similar to the organization of animal visual cortex. The connectivity of neurons in a convolutional layer is designed in a manner similar to the animal visual system, such that individual cortical neurons respond to stimuli only in a limited region (i.e., receptive field) of the input frame. In video analysis, convolutional layers are able to preserve spatial relationships in the input frame by learning feature representations using filters whose values are learned during training.

本出願のいくつかの実施例では、時空間モデルはＣｏｎｖＬＳＴＭ層をさらに含み、Ｃｏ
ｎｖＬＳＴＭ層は連続フレームから時空間特徴を捕捉するために使用される。
ＲＮＮ内部メモリを使用して入力シーケンスを処理することにより、時系列入力データの
動的な時間的挙動を捕捉するように設計されている。ＬＳＴＭユニットは、ＲＮＮの一般
的なビルディングブロックを変更したものである。ＬＳＴＭユニットは、入力ゲート、出
力ゲート、忘却ゲート、およびセルから構成される。入力ゲートは、入力値がセルに入る
範囲を定義する。忘却ゲートは、前の時間ステップの値がセルに保持される範囲を制御し
、出力ゲートは、現在の入力値がセルの活性化計算に使用される範囲を制御する。セルは
任意の時間間隔で値を記憶する。ＬＳＴＭは主に長期的な時間相関をモデル化するために
使用されるため、空間情報が状態遷移に符号化されないため、空間データを扱う上で欠点
がある。しかし、監視ビデオストリームの空間構造を維持したまま時間的パターンを学習
することは、特に異常検知において極めて重要である。したがって、本出願では、ＬＳＴ
Ｍの拡張、すなわち入力から状態への遷移と状態から状態への遷移の両方が畳み込み構造
を持つＣｏｎｖＬＳＴＭを使用する。ＣｏｎｖＬＳＴＭ層は、入力、隠れ状態、ゲート、
セル出力を３次元テンソルとして設計することでこの欠点を克服する。さらに、入力とゲ
ートの行列演算は畳み込み演算に置き換えられている。これらの改良により、ＣｏｎｖＬ
ＳＴＭ層は入力フレームシーケンスから時空間的特徴を捕捉することができる。
ＣｏｎｖＬＳＴＭ層のモデルは以下のように表示され：
式において、「
」は畳み込み演算を示し、「
」はＨａｄａｍａｒｄ積演算を示し、
は入力を示し、
はセル状態を示し、
は隠れ状態を示し、
および
は３次元テンソルであり、「
」はｓｉｇｍｏｉｄ関数を示し、
および
はＣｏｎｖＬＳＴＭ中の２次元畳み込みカーネルである。
上記実施例によれば、本出願で採用される時空間自己エンコーダ構成は表１に示される。
表１時空間自己エンコーダ構成
本出願のいくつかの実施例では、時空間モデルは異常閾値により正常行動と異常行動を区
別し、異常閾値は手動で選択され、異常閾値が低くなると、時空間モデルのモニタ中の異
常行動に対する検出感度が高くなり、異常行動を有する第２ビデオ監視画面の検出回数が
多くなり、異常閾値が高くなると、時空間モデルのモニタ中の異常行動に対する検出感度
が低くなり、異常行動を有する第２ビデオ監視画面の検出回数が少なくなる。
本出願では、再構成誤差閾値を定義して正常行動と異常行動を区別し、異常閾値
と命名された。実際のビデオ監視アプリケーションでは、監視アプリケーションに必要な
感度は
の値を選択する。
値が低いとき監視領域の感度がより高くなり、警報の数がより多くなる。
値が高いとき感度が低くなり、監視領域内の敏感な異常を見逃してしまう可能性がある。
さらに、本発明は、時間閾値
を導入し、
よりも高いビデオフレーム数として定義され、イベントを異常と認識する。
は、監視ビデオストリームの突然変化による偽陽性異常警報を低減するために使用され、
ここで、監視ビデオストリームの突然変化は、オクルージョン、モーションブラー、およ
び高輝度照明条件によって引き起こされる可能性がある。
本出願のいくつかの実施例では、手動検証は、再構成誤差により第２ビデオ監視画面中の
異常行動が合格であるかどうかを判定し、再構成誤差は、異常局在化のための各入力時間
矩形の分数として表現され、異常局在化は、異常が発生するビデオフレーム内の特定領域
の局在化であり、再構成誤差の計算式は式（６）および式（７）に示され：
ここで、
は時間ウィンドウであり、
はビデオフレームの高さである。
異常局在化とは、異常が発生したビデオフレームの特定領域を特定することである。ビデ
オクリップ内の異常を検出した後、非重複時空間局所矩形ウィンドウ上の再構成誤差を計
算して異常を特定し、式(７)で局所矩形の再構成誤差を計算する。
本出願のいくつかの実施例では、訓練済の時空間モデルによってモニタの第２プリセット
時間帯で異常行動を有する第２ビデオ監視画面を特定するステップは、時空間モデルは、
入力時間矩形の再構成誤差が異常閾値よりも大きいと検出した場合、該入力時間矩形を異
常と分類し、監視画面内で第２ビデオ監視画面を特定する。
ＩＳＴＬモデルは最初に監視環境において予め認識された正常行動を使用して訓練され、
異常検出に使用される。ビデオフレーム異常が検出され、すなわち入力時間矩形の再構成
誤差が異常閾値よりも大きい場合、該入力時間矩形を異常と分類する。その後、ビデオフ
レームを手動検証に送信する。
上記内容から分かるように、本出願は、インクリメンタル時空間学習に基づく異常行動認
識監視方法を提供し、この方法は、時空間モデルを確立すること、モニタの第１プリセッ
ト時間帯の第１ビデオ監視画面を収集すること、前記第１ビデオ監視画面内の行動を第１
正常行動とし、前記第１正常行動を前記時空間モデルに入力し、前記時空間モデルを訓練
すること、訓練済の前記時空間モデルにより、前記モニタの第２プリセット時間帯で異常
行動を有する第２ビデオ監視画面を特定すること、前記第２ビデオ監視画面を手動検証に
送信すること、前記第２ビデオ監視画面中の前記異常行動が手動検証に合格した場合、前
記異常行動が正常と判定され、ファジイ集約法により、前記第２ビデオ監視画面中の前記
異常行動を第２正常行動として構築しること、前記第２正常行動を前記時空間モデルに入
力し、前記時空間モデルを再訓練し、異常行動を有する第２ビデオ監視画面を特定するス
テップを繰り返して実行すること、前記異常行動が手動検証に合格しない場合、前記異常
行動が非正常と判定されて記録されること、を含む。本出願は、時系列検出して正常行動
を時空間モデルに入力することにより、時空間モデルを継続的に学習訓練させて異常行動
の検出に活用することができ、異常行動の検出精度を向上させることができる。
本出願の実施例では、「備える」、「含む」または他の変形は、非排他的な包含を意味し
、一連要素の構造、物品またはデバイスはそれらの要素だけでなく、明示的に列挙されて
いない他の要素、またはそれらの構造、物品またはデバイスに固有の要素もカバーするこ
とを意図している。さらなる制限がない限り、「……を含む」という表現によって定義さ
れる要素は、要素を含む構造、物品またはデバイスに別の同一要素の存在を排除するもの
ではない。
当業者は、本明細書および実施形態を考慮すれば本開示の他の実施態様を容易に想到する
ことができる。本出願は本開示の任意の変形、用途または適応変化を含み、これらの変形
、用途または適応変化は本開示の一般的な原理に従い、本開示に開示されていない本分野
の周知常識または慣用の技術手段を含むことを意図している。本明細書と実施例は例示的
なものと見なされ、本開示の真の範囲および精神は以下の特許請求の範囲によって示され
る。 In some embodiments of the present application, the space-time model further includes a ConvLSTM layer,
The nvLSTM layer is used to capture spatio-temporal features from consecutive frames.
It is designed to capture the dynamic temporal behavior of time-series input data by processing the input sequence using the RNN internal memory. The LSTM unit is a modification of the common building blocks of RNN. The LSTM unit consists of an input gate, an output gate, a forget gate, and a cell. The input gate defines the range in which the input value goes into the cell. The forget gate controls the range in which the value of the previous time step is retained in the cell, and the output gate controls the range in which the current input value is used for the activation calculation of the cell. The cell stores the value for any time interval. Since LSTM is mainly used to model long-term temporal correlations, it has a drawback in dealing with spatial data because spatial information is not encoded into the state transitions. However, learning temporal patterns while preserving the spatial structure of surveillance video streams is crucial, especially in anomaly detection. Therefore, in this application, we propose a method to model the LSTM.
We use ConvLSTM, an extension of M, where both the input-to-state transition and the state-to-state transition have a convolutional structure. The ConvLSTM layer is composed of input, hidden state, gate,
This shortcoming is overcome by designing the cell output as a 3D tensor. Furthermore, the matrix operations of the input and gates are replaced with convolution operations. With these improvements, ConvL
The STM layer can capture spatiotemporal features from the input frame sequence.
The model of the ConvLSTM layer is shown as follows:
In the formula,
" indicates the convolution operation, and "
" denotes the Hadamard product operation,
indicates the input,
indicates the cell state,
denotes the hidden state,
and
is a three-dimensional tensor,
" indicates a sigmoid function,
and
is the two-dimensional convolution kernel in ConvLSTM.
According to the above embodiment, the space-time autoencoder configuration adopted in this application is shown in Table 1.
Table 1. Space-time autoencoder configuration
In some embodiments of the present application, the spatio-temporal model distinguishes between normal and abnormal behavior through an abnormality threshold, which is manually selected, and a lower abnormality threshold results in a higher detection sensitivity for abnormal behavior during monitoring of the spatio-temporal model, resulting in a higher number of detections of the second video surveillance screen with abnormal behavior; a higher abnormality threshold results in a lower detection sensitivity for abnormal behavior during monitoring of the spatio-temporal model, resulting in a lower number of detections of the second video surveillance screen with abnormal behavior.
In this application, we define a reconstruction error threshold to distinguish between normal and abnormal behaviors, and an abnormality threshold
In a real video surveillance application, the sensitivity required for the surveillance application is
Select a value for.
A lower value will result in a more sensitive monitoring area and a higher number of alarms.
A high value results in low sensitivity and may result in missing sensitive abnormalities within the monitored area.
Furthermore, the present invention provides a time threshold
Introduced
is defined as the number of video frames higher than that for which an event is recognized as anomalous.
is used to reduce false positive anomaly alarms caused by sudden changes in surveillance video streams,
Here, sudden changes in the surveillance video stream can be caused by occlusion, motion blur, and high brightness lighting conditions.
In some embodiments of the present application, the manual verification determines whether the abnormal behavior in the second video monitoring screen is acceptable according to the reconstruction error, the reconstruction error is expressed as a fraction of each input time rectangle for anomaly localization, the anomaly localization is the localization of a specific region in the video frame where anomaly occurs, and the calculation formula of the reconstruction error is shown in Equation (6) and Equation (7):
here,
is the time window,
is the height of the video frame.
Anomaly localization refers to identifying a specific region of a video frame where an anomaly occurs. After detecting an anomaly in a video clip, we identify the anomaly by calculating the reconstruction error on a non-overlapping spatiotemporal local rectangular window, and then calculate the reconstruction error of the local rectangle using Equation (7).
In some embodiments of the present application, the step of identifying a second video surveillance screen having abnormal behavior at a second preset time period of the monitor by the trained spatio-temporal model comprises:
If a reconstruction error of the input time rectangle is detected to be greater than an anomaly threshold, the input time rectangle is classified as anomaly and a second video surveillance image is identified within the surveillance image.
The ISTL model is first trained using previously recognized normal behaviors in a surveillance environment.
Used for anomaly detection: if a video frame anomaly is detected, i.e., the reconstruction error of an input time rectangle is larger than an anomaly threshold, the input time rectangle is classified as anomalous, and then the video frame is sent for manual verification.
As can be seen from the above, the present application provides an abnormal behavior recognition and monitoring method based on incremental spatio-temporal learning, the method including: establishing a spatio-temporal model; collecting a first video monitoring screen of a monitor during a first preset time period; and detecting the behavior in the first video monitoring screen.
The method includes: inputting the first normal behavior into the spatio-temporal model and training the spatio-temporal model ; identifying a second video surveillance screen having abnormal behavior during a second preset time period of the monitor using the trained spatio-temporal model; sending the second video surveillance screen to manual verification; if the abnormal behavior in the second video surveillance screen passes the manual verification, the abnormal behavior is determined to be normal; constructing the abnormal behavior in the second video surveillance screen as a second normal behavior using a fuzzy aggregation method; inputting the second normal behavior into the spatio-temporal model, retraining the spatio-temporal model, and repeatedly performing the steps of identifying a second video surveillance screen having abnormal behavior; if the abnormal behavior does not pass the manual verification, the abnormal behavior is determined to be non-normal and recorded. The present application detects normal behavior in a time series and inputs it into the spatio-temporal model, thereby continuously training the spatio-temporal model and utilizing it for detecting abnormal behavior, thereby improving the detection accuracy of abnormal behavior.
In the examples of this application, the words "comprising,""including," or other variations are intended to imply a non-exclusive inclusion, such that a structure, article, or device of a set of elements is intended to cover not only those elements, but also other elements not expressly listed or inherent in that structure, article, or device. Unless further limited, an element defined by the phrase "comprising" does not exclude the presence of other identical elements in the structure, article, or device that includes the element.
Those skilled in the art can easily conceive of other embodiments of the present disclosure in view of the specification and embodiments. This application is intended to cover any modifications, uses or adaptations of the present disclosure, which follow the general principles of the present disclosure and include common knowledge or customary technical means in the art that are not disclosed in the present disclosure. The specification and examples are considered to be exemplary, with the true scope and spirit of the present disclosure being indicated by the following claims.

Claims

Establishing a space-time model;
acquiring a first video surveillance image of the monitor during a first preset time period;
a behavior in the first video surveillance image is a first normal behavior, and the first normal behavior is input to the spatio-temporal model to train the spatio-temporal model;
identifying a second video surveillance screen having abnormal activity during a second preset time period of the monitor using the trained spatio-temporal model;
transmitting said second video surveillance image for manual review;
If the abnormal behavior in the second video monitoring screen passes manual verification, the abnormal behavior is determined to be normal, and the abnormal behavior in the second video monitoring screen is identified as a second abnormal behavior by a fuzzy aggregation method.
Constructing it as normal behavior
repeatedly performing the steps of inputting the second normal behavior into the spatio-temporal model, retraining the spatio-temporal model, and identifying a second video surveillance scene having anomalous behavior;
If the abnormal behavior does not pass manual verification, the abnormal behavior is determined to be abnormal and recorded ;
The first preset time and the second preset time are consecutive,
The first preset time period starts at a first time point and ends at a second time point, and the second preset time period
The interval begins at a second time point and ends at a third time point,
The spatio-temporal model includes an input data layer and a convolution layer;
The input data layer may include the first video surveillance screen and/or the second video surveillance screen.
used to pre-process and enhance the learning ability of the spatio-temporal model;
The convolution layer separates the first video monitoring screen and/or the second video monitoring screen.
used to analyze and learn from
The input data layer may include a first video surveillance screen and/or a second video surveillance screen.
The processing step includes:
A sliding window of length T is used to image the first video surveillance screen and/or the second video surveillance screen.
Extracting video surveillance screens,
The extracted first video surveillance image and/or the extracted second video surveillance image are displayed in succession as frames.
The successive frames are converted to grayscale downscaling and 224×22
4 pixel value and normalize the pixel value by scaling it from 0 to 1.
Chemical processing,
stacking said successive frames of length T to form an input time rectangle;
The spatio-temporal model further includes a ConvLSTM layer;
The ConvLSTM layer is used to capture spatio-temporal features from the successive frames;
The model of the ConvLSTM layer is expressed as follows:
In the formula,
" indicates the convolution operation, and "
" denotes the Hadamard product operation,
indicates the input,
indicates the cell state,
denotes the hidden state,
and
is a three-dimensional tensor,
" indicates a sigmoid function,
and
is the 2D convolution kernel in ConvLSTM,
The spatio-temporal model distinguishes between the normal behavior and the abnormal behavior by an abnormality threshold.
The threshold is usually chosen manually.
The lower the anomaly threshold, the greater the sensitivity to detect anomalous behavior during monitoring of the spatio-temporal model.
The degree is higher, and the number of detections of the second video monitoring screen having abnormal behavior is higher;
The higher the anomaly threshold, the greater the sensitivity to detect anomalous behavior during monitoring of the spatio-temporal model.
The degree of abnormal behavior is reduced, and the number of detections of the second video monitoring screen having abnormal behavior is reduced ;
The manual verification determines whether the abnormal behavior in the second video surveillance image is passed due to a reconstruction error.
Determine whether
The reconstruction error is expressed as a fraction of each input time rectangle of an anomaly localization, and the anomaly localizations are
localization of a specific area in a video frame where a particular
The trained spatio-temporal model detects abnormal behavior during a second preset time period of the monitor.
The step of identifying a second video surveillance screen includes:
The spatio-temporal model detects that the reconstruction error of the input time rectangle is greater than the anomaly threshold.
If so, classify the input time rectangle as anomalous and switch the monitoring screen to the second video monitoring screen.
including identifying
The abnormal behavior recognition and monitoring method based on incremental space-time learning is characterized by: