JP2022026016A

JP2022026016A - Information processing device, information processing method, and program

Info

Publication number: JP2022026016A
Application number: JP2020129279A
Authority: JP
Inventors: ムハマドアクマル; Akmal Muhammad; 満中澤; Mitsuru Nakazawa
Original assignee: Rakuten Group Inc
Current assignee: Rakuten Group Inc
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2022-02-10
Anticipated expiration: 2040-07-30
Also published as: JP7096296B2; JP2022153360A; JP7361163B2

Abstract

To provide an information processing device capable of reducing an operator load and highly accurately detecting various abnormalities from a video, an information processing method, and a program.SOLUTION: An abnormality detection device 1 (information processing device) comprises: a video acquisition unit which acquires video data; a feature extraction unit which extracts voice features from the video data acquired by the video acquisition unit and extracts image features from the video data; an abnormal scene candidate detection unit which detects an abnormal scene candidate from the video data on the basis of the voice features extracted by the feature extraction unit; an abnormality determination unit which determines the abnormal scene candidate detected by the abnormal scene candidate detection unit to be abnormal, normal, or the other on the basis of the voice features and the image features; and a scene presentation unit which presents the abnormal scene candidate via a user interface and accepts an input of information to be added to the presented abnormal scene candidate via the user interface when the abnormality detection unit determines that the abnormal scene candidate belongs to the other.SELECTED DRAWING: Figure 1

Description

本発明は、情報処理装置、情報処理方法およびプログラムに関し、特に、映像を解析して異常を検知するための技術に関する。 The present invention relates to an information processing apparatus, an information processing method and a program, and more particularly to a technique for analyzing an image and detecting an abnormality.

近年の映像配信サービスは、コンテンツプロバイダが作成した映像コンテンツのみならず、一般ユーザが作成した映像コンテンツのリアルタイム配信を可能にしている。
このような映像配信サービスにおいては、配信される映像コンテンツ中に、視聴するのに不適切ないわゆる異常シーンが含まれないよう、配信される映像を監視し、検出された異常シーンが誤って視聴されないよう、異常シーンの削除、配信停止や配信アカウント削除等の処理をする必要がある。このような異常シーンは、例えば、暴力的なシーンや子供向けでないシーン等、家族での視聴に不適切な（Ｎｏｎ－Ｆａｍｉｌｙ－Ｓａｆｅ：ＮＦＳ）シーンを含む。 Recent video distribution services enable real-time distribution of not only video content created by content providers but also video content created by general users.
In such a video distribution service, the distributed video is monitored so that the distributed video content does not include so-called abnormal scenes that are inappropriate for viewing, and the detected abnormal scene is erroneously viewed. It is necessary to delete abnormal scenes, stop distribution, delete distribution accounts, etc. so that they will not be deleted. Such anomalous scenes include non-family-safe (NFS) scenes that are unsuitable for family viewing, such as violent scenes and non-children's scenes.

特許文献１は、エレベータの乗りかご内に設けられた防犯カメラにより撮影された撮影データから乗員の異常行動を検知するエレベータ監視装置を開示する。
具体的には、特許文献１の監視装置においては、乗りかご内に設置されたインターホンで集音された乗員の音声データを周波数分析した結果から抽出された所定の周波数帯域に応じて暴れ判定閾値を設定するとともに、防犯カメラにより撮影された撮影データから乗員の動きのばらつき量を統計的に算出する。特許文献１の監視装置はさらに、算出された乗員の動きのばらつき量と暴れ判定閾値とを比較し、乗員の動きのばらつき量が暴れ判定閾値以上のときに乗員の動きを異常行動とみなして暴れを判定する。これにより、乗員が僅かにしか動けない場合でも撮影データから異常行動を判定している。 Patent Document 1 discloses an elevator monitoring device that detects abnormal behavior of an occupant from shooting data taken by a security camera provided in an elevator car.
Specifically, in the monitoring device of Patent Document 1, the rampage determination threshold value is obtained according to a predetermined frequency band extracted from the result of frequency analysis of the voice data of the occupant collected by the intercom installed in the car. And statistically calculate the amount of variation in the movement of the occupants from the shooting data taken by the security camera. The monitoring device of Patent Document 1 further compares the calculated amount of variation in the movement of the occupant with the rampage determination threshold value, and considers the movement of the occupant as abnormal behavior when the amount of variation in the movement of the occupant is equal to or greater than the rampage determination threshold value. Judge the rampage. As a result, even if the occupant can move only slightly, the abnormal behavior is determined from the shooting data.

特開２０１３－６３８１０号公報Japanese Unexamined Patent Publication No. 2013-63810

しかしながら、特許文献１の技術では、検知可能な異常がエレベータ内における乗員の暴れに限定されているため、多様な映像コンテンツ中に含まれ得る多様な異常シーンを適切に検出することは困難である。 However, in the technique of Patent Document 1, since the detectable abnormality is limited to the rampage of the occupant in the elevator, it is difficult to appropriately detect various abnormal scenes that can be included in various video contents. ..

特に、映像配信サービスは、メインターゲットとするユーザの年齢層や嗜好等によりそれぞれ多岐にセグメント化されており、映像配信サービスごとに、視聴するのに不適切な異常シーンの範囲が区々である。さらに、映像コンテンツ中に異常シーンが出現する頻度は通常僅かであるため、教師あり機械学習のために必要となる学習データの汎用データベース化には適さない。他方、教師なしの機械学習で映像コンテンツから異常シーンを検出しようとすると、検出精度が低下してしまう。 In particular, video distribution services are broadly segmented according to the age group and preferences of the main target users, and the range of abnormal scenes that are inappropriate for viewing varies depending on the video distribution service. .. Furthermore, since the frequency of abnormal scenes appearing in video content is usually small, it is not suitable for creating a general-purpose database of learning data required for supervised machine learning. On the other hand, if an attempt is made to detect an abnormal scene from video content by unsupervised machine learning, the detection accuracy will decrease.

ところで、コンテンツプロバイダにより作成された映像には、コンテンツプロバイダにより、配信される映像コンテンツに、暴力シーンを含むか否か、子供向けコンテンツであるか否か、あるいは年齢制限の有無等のタグ情報が付加されていることが多く、コンテンツ作成時にコンテンツプロバイダに異常シーンの存在にタグ付けさせることも可能である。
一方、近年増加している一般ユーザが作成した映像コンテンツには、このような異常シーンのタグ情報が付加されていないことが多く、あるいは、付加されていたとしてもタグ付けが必ずしも当該映像配信サービスにおいて適切でないおそれがある。 By the way, in the video created by the content provider, tag information such as whether or not the video content distributed by the content provider includes a violent scene, whether or not the content is for children, or whether or not there is an age limit is included. It is often added, and it is also possible to have the content provider tag the existence of an abnormal scene when creating content.
On the other hand, video content created by general users, which has been increasing in recent years, often does not have tag information of such abnormal scenes added, or even if it is added, tagging is not always the video distribution service. May not be appropriate.

このため、従来は、映像配信サービスによっては、オペレータが、配信される映像コンテンツを常時監視し、映像コンテンツ中から異常シーンを発見した場合に、当該映像コンテンツに年齢制限を設定したり、当該映像コンテンツの配信を停止したりしており、これにより、映像を監視するオペレータの時間的および作業的負荷や、さらに心理的負担をも増加させていた。同時に、マニュアルで映像コンテンツを監視することによる異常シーンの見逃しも発生するおそれがあった。 For this reason, conventionally, depending on the video distribution service, the operator constantly monitors the video content to be distributed, and when an abnormal scene is found in the video content, an age limit is set for the video content or the video is concerned. The distribution of content has been stopped, which has increased the time and work load of the operator who monitors the video, and also the psychological burden. At the same time, there was a risk that an abnormal scene might be overlooked by manually monitoring the video content.

本発明は上記課題を解決するためになされたものであり、その目的は、オペレータの負荷を軽減しつつ、映像から多様な異常を高精度に検出することが可能な情報処理装置、情報処理方法およびプログラムを提供することにある。 The present invention has been made to solve the above problems, and an object thereof is an information processing device and an information processing method capable of detecting various abnormalities from an image with high accuracy while reducing the load on an operator. And to provide the program.

上記課題を解決するために、本発明に係る情報処理装置の一態様は、映像データを取得する映像取得部と、前記映像取得部により取得された映像データから音声特徴を抽出し、前記映像データから画像特徴を抽出する特徴抽出部と、前記特徴抽出部により抽出された前記音声特徴に基づいて、前記映像データから異常シーンの候補を検出する異常シーン候補検出部と、前記異常シーン候補検出部により検出された前記異常シーンの候補を、前記音声特徴および前記画像特徴に基づいて、異常、正常、およびその他のいずれかに判定する異常判定器と、前記異常判定器により、前記異常シーンの候補がその他に属すると判定された場合、当該異常シーンの候補を、ユーザインタフェースを介して提示し、提示された異常シーンの候補に対して付加すべき情報の入力を、前記ユーザインタフェースを介して受け付けるシーン提示部とを備える。 In order to solve the above problems, one aspect of the information processing apparatus according to the present invention is a video acquisition unit for acquiring video data and an audio feature extracted from the video data acquired by the video acquisition unit. A feature extraction unit that extracts image features from, an abnormal scene candidate detection unit that detects abnormal scene candidates from the video data based on the audio features extracted by the feature extraction unit, and the abnormal scene candidate detection unit. An abnormality determination device that determines whether the abnormality scene candidate detected by the above-mentioned is abnormal, normal, or other based on the voice feature and the image feature, and the abnormality determination device, the candidate of the abnormality scene. If is determined to belong to something else, the candidate for the abnormal scene is presented via the user interface, and the input of information to be added to the presented candidate for the abnormal scene is accepted via the user interface. It is equipped with a scene presentation unit.

前記異常シーン候補検出部は、教師なし学習を用いて、前記映像データから前記異常シーンの候補を検出してよい。 The abnormal scene candidate detection unit may detect the abnormal scene candidate from the video data by using unsupervised learning.

前記異常シーン候補検出部は、正常な音声特徴群のモデルを生成することなく、異常な音声特徴を直接分離することにより、前記映像データから前記異常シーンの候補を検出してよい。 The abnormal scene candidate detection unit may detect the candidate of the abnormal scene from the video data by directly separating the abnormal audio feature without generating a model of the normal audio feature group.

前記異常シーン候補検出部は、それぞれの音声特徴のアイソレーションフォレスト（ＩｓｏｌａｔｉｏｎＦｏｒｅｓｔ）におけるパス長を算出することにより、前記異常な音声特徴を分離してよい。 The abnormal scene candidate detection unit may separate the abnormal voice feature by calculating the path length in the isolation forest of each voice feature.

前記特徴抽出部は、前記映像データ中の音声データのメル周波数（ＭｅｌＦｒｅｑｕｅｎｃｙ）スペクトログラムで表現される音声特徴を抽出してよい。 The feature extraction unit may extract audio features represented by the Mel Frequency spectrogram of the audio data in the video data.

前記特徴抽出部は、前記音声データから、メル周波数ケプストラム係数（ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔｓ：ＭＦＣＣ）を算出し、算出されたＭＦＣＣを前記メル周波数に連結して、前記音声特徴を抽出してよい。 The feature extraction unit may calculate the Mel Frequency Cepstrum Coefficients (MFCC) from the voice data, connect the calculated MFCC to the Mel frequency, and extract the voice feature.

前記シーン提示部は、前記ユーザインタフェースを介して入力される情報を、前記音声特徴および前記画像特徴に付加して、前記異常判定器のための学習データとして記憶装置に格納してよい。 The scene presenting unit may add information input via the user interface to the voice feature and the image feature and store the information in the storage device as learning data for the abnormality determining device.

前記異常シーン候補検出部は、前記記憶装置に格納される前記学習データの数が所定の閾値を上回る場合に、前記異常シーンの候補を前記異常判定器に判定させてよい。 When the number of the learning data stored in the storage device exceeds a predetermined threshold value, the abnormality scene candidate detection unit may cause the abnormality determination device to determine the candidate for the abnormality scene.

前記異常シーン候補検出部は、前記記憶装置に格納される前記学習データの数が所定の閾値以内である場合に、前記異常器による判定をバイパスして、前記シーン提示部に、前記異常シーンの候補を提示させてよい。 When the number of the learning data stored in the storage device is within a predetermined threshold value, the abnormal scene candidate detection unit bypasses the determination by the abnormal device and causes the scene presentation unit to display the abnormal scene. You may be asked to present a candidate.

前記異常判定器は、前記音声特徴と前記画像特徴が統合された特徴空間において、前記異常シーンの候補の近傍に位置する異常サンプルの数と正常サンプルの数との差が所定の閾値以内である場合に、前記異常シーンの候補をその他に判定してよい。
前記異常判定器は、ｋ近傍法により、前記異常シーンの候補を判定してよい。 In the abnormality determination device, the difference between the number of abnormal samples located in the vicinity of the candidate for the abnormal scene and the number of normal samples in the feature space in which the audio feature and the image feature are integrated is within a predetermined threshold value. In this case, the candidate for the abnormal scene may be determined elsewhere.
The abnormality determination device may determine a candidate for the abnormality scene by the k-nearest neighbor method.

前記特徴抽出部により抽出される前記画像特徴から、教師あり学習を用いて、前記映像データに含まれる顔の感情を解析し、解析された前記顔の感情の特徴を前記異常判定器に供給する感情解析部をさらに備えてよい。 From the image features extracted by the feature extraction unit, supervised learning is used to analyze facial emotions contained in the video data, and the analyzed facial emotion features are supplied to the abnormality determination device. An emotion analysis unit may be further provided.

前記感情解析部は、解析された前記顔の感情に基づいて、前記映像データから前記異常シーンの候補を検出した場合に、前記異常シーン候補検出部に、前記音声特徴に基づく異常シーンの検出を実行させてよい。 When the emotion analysis unit detects a candidate for the abnormal scene from the video data based on the analyzed emotion of the face, the abnormal scene candidate detection unit detects the abnormal scene based on the audio feature. You may let it run.

本発明に係る情報処理システムの一態様は、サーバと、該サーバとネットワークを介して接続される少なくとも１つのクライアント装置とを備える情報処理システムであって、前記サーバは、映像データを取得する映像取得部と、前記映像取得部により取得された映像データから音声特徴を抽出し、前記映像データから画像特徴を抽出する特徴抽出部と、前記特徴抽出部により抽出された前記音声特徴に基づいて、前記映像データから異常シーンの候補を検出する異常シーン候補検出部と、前記異常シーン候補検出部により検出された前記異常シーンの候補を、前記音声特徴および前記画像特徴に基づいて、異常、正常、およびその他のいずれかに判定する異常判定器と、前記異常判定器により、前記異常シーンの候補がその他に属すると判定された場合、当該異常シーンの候補を、ユーザインタフェースを介して提示し、提示された異常シーンの候補に対して付加すべき情報の入力を、前記ユーザインタフェースを介して受け付けるシーン提示部と、当該異常シーンの候補を前記クライアント装置へ送信する送信部と、を有し、前記クライアント装置は、前記サーバから送信される前記異常シーンの候補を受信する受信部と、前記受信部により受信された前記異常シーンの候補を提示し、提示された異常シーンの候補に対して付加すべき情報の入力を受け付ける前記ユーザインタフェースと、前記ユーザインタフェースが入力を受け付けた前記異常シーンの候補に対して付加すべき情報を、前記サーバへ送信する送信部と、を有する。 One aspect of the information processing system according to the present invention is an information processing system including a server and at least one client device connected to the server via a network, wherein the server acquires video data. Based on the acquisition unit, the feature extraction unit that extracts audio features from the video data acquired by the video acquisition unit and extracts the image features from the video data, and the audio features extracted by the feature extraction unit. Anomalous scene candidate detection unit that detects anomalous scene candidates from the video data and anomalous scene candidates detected by the anomalous scene candidate detection unit are abnormal, normal, based on the audio features and the image features. When the abnormality determination device for determining any of the above and the other and the abnormality determination device determine that the candidate for the abnormality scene belongs to the other, the candidate for the abnormality scene is presented and presented via the user interface. It has a scene presentation unit that accepts input of information to be added to a candidate for an abnormal scene through the user interface, and a transmission unit that transmits the candidate for the abnormal scene to the client device. The client device presents a receiving unit that receives the candidate for the abnormal scene transmitted from the server, a candidate for the abnormal scene received by the receiving unit, and adds the candidate for the presented abnormal scene. It has the user interface that accepts input of information to be input, and a transmission unit that transmits information to be added to the candidate of the abnormal scene that the user interface has received input to the server.

本発明に係る情報処理方法の一態様は、情報処理装置が実行する情報処理方法であって、映像データを取得するステップと、取得された映像データから音声特徴を抽出し、前記映像データから画像特徴を抽出するステップと、教師なし学習により、抽出された前記音声特徴に基づいて、前記映像データから異常シーンの候補を検出するステップと、異常判定器により、検出された前記異常シーンの候補を、前記音声特徴および前記画像特徴に基づいて、異常、正常、およびその他のいずれかに判定するステップと、前記異常判定器により、前記異常シーンの候補がその他に属すると判定された場合、当該異常シーンの候補を、ユーザインタフェースを介して提示し、提示された異常シーンの候補に対して付加すべき情報の入力を、前記ユーザインタフェースを介して受け付けるステップとを含む。 One aspect of the information processing method according to the present invention is an information processing method executed by an information processing apparatus, in which a step of acquiring video data and audio features are extracted from the acquired video data and an image is obtained from the video data. A step of extracting a feature, a step of detecting a candidate for an abnormal scene from the video data based on the audio feature extracted by unsupervised learning, and a step of detecting the candidate for the abnormal scene by the abnormality determiner. , The step of determining any of anomalies, normals, and others based on the audio features and the image features, and when the anomaly determination device determines that the candidate for the anomalous scene belongs to the other, the anomaly. It includes a step of presenting a scene candidate via a user interface and accepting input of information to be added to the presented abnormal scene candidate via the user interface.

本発明に係る情報処理プログラムの一態様は、情報処理をコンピュータに実行させるための情報処理プログラムであって、該プログラムは、前記コンピュータに、映像データを取得する映像取得処理と、前記映像取得処理により取得された映像データから音声特徴を抽出し、前記映像データから画像特徴を抽出する特徴抽出処理と、前記特徴抽出処理により抽出された前記音声特徴に基づいて、前記映像データから異常シーンの候補を検出する異常シーン候補検出処理と、異常判定器により、前記異常シーン候補検出処理により検出された前記異常シーンの候補を、前記音声特徴および前記画像特徴に基づいて、異常、正常、およびその他のいずれかに判定する異常判定処理と、前記異常判定器により、前記異常シーンの候補がその他に属すると判定された場合、当該異常シーンの候補を、ユーザインタフェースを介して提示し、提示された異常シーンの候補に対して付加すべき情報の入力を、前記ユーザインタフェースを介して受け付ける入出力処理とを含む処理を実行させるためのものである。 One aspect of the information processing program according to the present invention is an information processing program for causing a computer to execute information processing, and the program includes a video acquisition process for acquiring video data and the video acquisition process. A feature extraction process that extracts audio features from the video data acquired by the above and extracts image features from the video data, and a candidate for an abnormal scene from the video data based on the audio features extracted by the feature extraction process. Abnormal scene candidate detection processing for detecting abnormal scenes, and abnormal, normal, and other abnormal scene candidates detected by the abnormal scene candidate detection processing by the abnormality determining device based on the audio features and the image features. When the abnormality determination process for determining any of the above and the abnormality determination device determine that the candidate for the abnormality scene belongs to the other, the candidate for the abnormality scene is presented via the user interface, and the presented abnormality is presented. This is for executing a process including an input / output process for receiving input of information to be added to a scene candidate via the user interface.

本発明によれば、オペレータの負荷を軽減しつつ、映像から多様な異常を高精度に検出することができる。
上記した本発明の目的、態様及び効果並びに上記されなかった本発明の目的、態様及び効果は、当業者であれば添付図面及び請求の範囲の記載を参照することにより下記の発明を実施するための形態から理解できるであろう。 According to the present invention, it is possible to detect various abnormalities from an image with high accuracy while reducing the load on the operator.
The above-mentioned object, aspect and effect of the present invention and the above-mentioned object, aspect and effect of the present invention not described above are to be used by those skilled in the art to carry out the following invention by referring to the accompanying drawings and the description of the scope of claims. It can be understood from the form of.

図１は、本発明の実施形態１に係る異常検出装置の機能構成の一例を示すブロック図である。FIG. 1 is a block diagram showing an example of the functional configuration of the abnormality detection device according to the first embodiment of the present invention. 図２は、実施形態１に係る異常検出装置が実行する異常シーン検出処理の処理手順の一例を示すフローチャートである。FIG. 2 is a flowchart showing an example of a processing procedure of an abnormality scene detection process executed by the abnormality detection device according to the first embodiment. 図３は、異常検出装置の特徴抽出部が映像データから分離する音声データのセグメントの一例を示す図である。FIG. 3 is a diagram showing an example of audio data segments separated from video data by the feature extraction unit of the abnormality detection device. 図４は、異常検出装置の特徴抽出部が音声データから抽出する異常シーン候補のメルスペクトグラム特徴の一例を示す図である。FIG. 4 is a diagram showing an example of melspectogram features of anomalous scene candidates extracted from voice data by a feature extraction unit of an anomaly detection device. 図５は、異常検出装置の特徴抽出部が音声データから抽出する正常シーンのメルスペクトグラム特徴の一例を示す図である。FIG. 5 is a diagram showing an example of a melspectogram feature of a normal scene extracted from audio data by a feature extraction unit of an abnormality detection device. 図６は、異常検出装置の特徴抽出部が異常シーン候補検出部に出力するメル周波数ケプストラムとメルスペクトグラムとを連結した音声特徴の一例を示す図である。FIG. 6 is a diagram showing an example of a voice feature in which a mel frequency cepstrum and a melspectogram output by the feature extraction unit of the abnormality detection device to the abnormality scene candidate detection unit are connected. 図７は、異常検出装置の異常シーン候補検出部が教師なし学習で使用するアイソレーションフォレストの異常特徴点の分離を説明する図である。FIG. 7 is a diagram illustrating the separation of anomalous feature points of the isolation forest used by the anomaly scene candidate detection unit of the anomaly detection device in unsupervised learning. 図８は、異常検出装置の異常シーン候補検出部が教師なし学習で使用するアイソレーションフォレストの決定木の一例を示す概略図である。FIG. 8 is a schematic diagram showing an example of a decision tree of an isolation forest used by an abnormality scene candidate detection unit of an abnormality detection device in unsupervised learning. 図９は、異常検出装置の異常判定器が異常シーン候補の異常判定に使用するｋ近傍法の一例を説明する概略図である。FIG. 9 is a schematic diagram illustrating an example of the k-nearest neighbor method used by the abnormality determination device of the abnormality detection device to determine an abnormality of an abnormality scene candidate. 図１０は、実施形態１の変形例に係る異常検出装置が実行する異常シーン検出処理の処理手順の一例を示すフローチャートである。FIG. 10 is a flowchart showing an example of a processing procedure of an abnormality scene detection process executed by the abnormality detection device according to the modified example of the first embodiment. 図１１は、本発明の実施形態２に係る異常検出装置の機能構成の一例を示すブロック図である。FIG. 11 is a block diagram showing an example of the functional configuration of the abnormality detection device according to the second embodiment of the present invention. 図１２は、実施形態２に係る異常検出装置が実行する異常シーン検出処理の処理手順の一例を示すフローチャートである。FIG. 12 is a flowchart showing an example of a processing procedure of an abnormality scene detection process executed by the abnormality detection device according to the second embodiment. 図１３は、実施形態２の変形例に係る異常検出装置が実行する異常シーン検出処理の処理手順の一例を示すフローチャートである。FIG. 13 is a flowchart showing an example of a processing procedure of the abnormal scene detection process executed by the abnormality detecting device according to the modified example of the second embodiment. 図１４は、実施形態２に係る異常検出装置の感情解析部が映像データの画像データから認識する顔感情認識結果の一例を示す概略図である。FIG. 14 is a schematic diagram showing an example of a facial emotion recognition result recognized by the emotion analysis unit of the abnormality detection device according to the second embodiment from the image data of the video data. 図１５は、本発明の各実施形態に係る異常検出装置のハードウエア構成の一例を示すブロック図である。FIG. 15 is a block diagram showing an example of the hardware configuration of the abnormality detection device according to each embodiment of the present invention.

以下、添付図面を参照して、本発明を実施するための実施形態について詳細に説明する。以下に開示される構成要素のうち、同一機能を有するものには同一の符号を付し、その説明を省略する。なお、以下に開示される実施形態は、本発明の実現手段としての一例であり、本発明が適用される装置の構成や各種条件によって適宜修正または変更されるべきものであり、本発明は以下の実施形態に限定されるものではない。また、本実施形態で説明されている特徴の組み合わせの全てが本発明の解決手段に必須のものとは限らない。 Hereinafter, embodiments for carrying out the present invention will be described in detail with reference to the accompanying drawings. Among the components disclosed below, those having the same function are designated by the same reference numerals, and the description thereof will be omitted. The embodiments disclosed below are examples as means for realizing the present invention, and should be appropriately modified or modified depending on the configuration of the apparatus to which the present invention is applied and various conditions, and the present invention is described below. Is not limited to the embodiment of the above. Moreover, not all combinations of features described in the present embodiment are essential for the means of solving the present invention.

（実施形態１）
本実施形態に係る異常検出装置は、映像データから、音声データおよび画像データそれぞれの特徴を抽出し、これら音声データおよび画像データのマルチモーダルな特徴を用いて、映像データから異常シーンを複数段階で半自動的に検出する。
以下では、異常検出装置が、リアルタイムでストリーミング配信される映像データから抽出される音声データの特徴に基づいて教師なし学習により異常シーンの候補をまず検出し、次に、検出された異常シーンの候補を異常判定器により異常シーン、正常シーン、およびオペレータの判断を要するシーン、のいずれかに判定し、オペレータの判断を要すると判定された異常シーンの候補の映像データを提示し、オペレータによる異常シーンか否かの確認入力を映像データの特徴に付加して、異常判定器に対する学習データとして蓄積する一例を説明する。 (Embodiment 1)
The abnormality detection device according to the present embodiment extracts the characteristics of the audio data and the image data from the video data, and uses the multimodal characteristics of the audio data and the image data to generate an abnormality scene from the video data in a plurality of stages. Detects semi-automatically.
In the following, the anomaly detection device first detects anomaly scene candidates by unsupervised learning based on the characteristics of audio data extracted from video data streamed in real time, and then the detected anomaly scene candidates. Is determined by the anomaly judge as one of an abnormal scene, a normal scene, and a scene that requires the operator's judgment, and the video data of the candidate of the abnormal scene that is determined to require the operator's judgment is presented, and the abnormal scene by the operator is presented. An example of adding the confirmation input of whether or not to the feature of the video data and accumulating it as learning data for the abnormality determining device will be described.

しかしながら、本実施形態はこれに限定されない。例えば、異常検出装置は、録画された映像データから事後的に異常シーンを検出してもよい。また、例えば、蓄積される学習データの数に応じて、異常シーン検出を可変に制御し、検出された異常シーンの候補の映像データのすべてを、異常判定器をバイパスしてオペレータに提示してもよく、あるいは、異常判定器が、検出された異常シーンの候補の映像データの音声および画像の特徴に基づいて、異常シーンを自動検出してもよい。後者の場合、異常判定の閾値を比較的低く設定して、閾値近傍の異常シーンのみを適宜確認的にオペレータに提示してもよい。 However, this embodiment is not limited to this. For example, the anomaly detection device may detect an abnormal scene ex post facto from the recorded video data. Further, for example, the abnormal scene detection is variably controlled according to the number of accumulated learning data, and all of the detected abnormal scene candidate video data is presented to the operator by bypassing the abnormality judge. Alternatively, the abnormality determination device may automatically detect the abnormality scene based on the audio and image characteristics of the video data of the detected abnormality scene candidate. In the latter case, the threshold value for determining an abnormality may be set relatively low, and only the abnormal scene near the threshold value may be appropriately confirmed and presented to the operator.

＜異常検出装置の機能構成＞
図１は、本実施形態に係る異常検出装置１の機能構成の一例を示すブロック図である。
図１に示す異常検出装置１は、データ取得部１１、特徴抽出部１２、異常シーン候補検出部１３、異常判定器１４、およびシーン提示部１５を備える。
異常検出装置１は、ＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）等で構成されるクライアント装置３とネットワークを介して通信可能に接続してよい。この場合、異常検出装置１はサーバに実装され、クライアント装置３は、異常検出装置１が外部と情報の入出力を実行する際のユーザインタフェースを提供してよく、また、異常検出装置１のシーン提示部１５を含む各コンポーネント１１～１５の一部または全部を備えてもよい。 <Functional configuration of anomaly detection device>
FIG. 1 is a block diagram showing an example of the functional configuration of the abnormality detection device 1 according to the present embodiment.
The abnormality detection device 1 shown in FIG. 1 includes a data acquisition unit 11, a feature extraction unit 12, an abnormality scene candidate detection unit 13, an abnormality determination device 14, and a scene presentation unit 15.
The abnormality detection device 1 may be communicably connected to a client device 3 composed of a PC (Personal Computer) or the like via a network. In this case, the anomaly detection device 1 is mounted on the server, and the client device 3 may provide a user interface for the anomaly detection device 1 to input / output information to / from the outside, and the scene of the anomaly detection device 1 may be provided. A part or all of each component 11 to 15 including the presentation unit 15 may be provided.

データ取得部１１は、リアルタイムでストリーミング配信される映像データを取得して、取得された映像データを特徴抽出部１２へ供給する。映像データは、音声（Ａｕｄｉｏ）データと画像（Ｖｉｓｕａｌ）データとを含む動画像データであるが、データ取得部１１は、動画像データに替えて、音声データを含む静止画データを取得して、特徴抽出部１２へ供給してもよい。
データ取得部１１は、ストリーミング配信される映像データに替えて、異常検出装置１のＨＤＤ等の不揮発性記憶装置に予め録画された映像データを取得してもよく、録画された映像データを対向装置から通信Ｉ／Ｆを介して受信してもよい。
データ取得部１１はまた、異常検出装置１において異常シーン検出処理を実行するために必要な各種パラメータの入力を受け付ける。データ取得部１１は、異常検出装置１と通信可能に接続されるクライアント装置３のユーザインタフェースを介して、各種パラメータの入力を受け付けてよい。 The data acquisition unit 11 acquires video data streamed in real time and supplies the acquired video data to the feature extraction unit 12. The video data is moving image data including audio data and visual data, but the data acquisition unit 11 acquires still image data including audio data in place of the moving image data. It may be supplied to the feature extraction unit 12.
The data acquisition unit 11 may acquire video data pre-recorded in a non-volatile storage device such as an HDD of the abnormality detection device 1 instead of the video data to be streamed, and the recorded video data may be used as an opposite device. May be received via communication I / F.
The data acquisition unit 11 also receives input of various parameters necessary for executing the abnormal scene detection process in the abnormality detection device 1. The data acquisition unit 11 may accept input of various parameters via the user interface of the client device 3 which is communicably connected to the abnormality detection device 1.

特徴抽出部１２は、データ取得部１１から供給される映像データから音声データを分離し、分離された音声データから音声特徴を抽出する。
特徴抽出部１２はまた、データ取得部１１から供給される映像データから画像データを分離し、分離された画像データから画像特徴を抽出する。
特徴抽出部１２は、抽出された音声特徴および画像特徴を、映像データとともに、異常シーン候補検出部１３へ供給する。 The feature extraction unit 12 separates audio data from the video data supplied from the data acquisition unit 11, and extracts audio features from the separated audio data.
The feature extraction unit 12 also separates the image data from the video data supplied from the data acquisition unit 11 and extracts the image features from the separated image data.
The feature extraction unit 12 supplies the extracted audio features and image features to the abnormality scene candidate detection unit 13 together with the video data.

異常シーン候補検出部１３は、特徴抽出部１２から供給される音声特徴に基づいて、映像データから異常シーンの候補を検出し、検出された異常シーンの候補を、異常判定器１４へ供給する。異常シーン候補検出部１３はまた、検出された異常シーンの候補を、異常判定器１４をバイパスして、シーン提示部１５へ供給してもよい。 The abnormal scene candidate detection unit 13 detects an abnormal scene candidate from the video data based on the audio feature supplied from the feature extraction unit 12, and supplies the detected abnormal scene candidate to the abnormality determining device 14. The abnormal scene candidate detection unit 13 may also supply the detected abnormal scene candidate to the scene presentation unit 15 by bypassing the abnormality determining device 14.

なお、異常シーンとは、例えば、暴力的なシーンや子供向けでないシーン等、家族での視聴に不適切な（Ｎｏｎ－Ｆａｍｉｌｙ－Ｓａｆｅ：ＮＦＳ）シーンを含むがこれに限定されない。異常シーンは、映像配信サービスごとの規約ないしルール上当該映像配信サービスを介して配信すべきでない旨規定されているシーンまたはコンテンツ、その他オペレータが映像データの音声および画像から最終的に配信すべきでないとマニュアルで判定したシーンまたはコンテンツを広く含むものとする。
特徴抽出部１２が実行する特徴抽出処理および異常シーン候補検出部１３が実行する異常シーン候補検出処理の詳細は、図３～図８を参照して後述する。 The abnormal scene includes, but is not limited to, a scene unsuitable for family viewing (Non-Family-Safe: NFS), such as a violent scene or a scene not intended for children. Abnormal scenes should not be finally distributed from the audio and images of the video data by the scene or content that is stipulated in the rules or regulations of each video distribution service that it should not be distributed via the video distribution service. It shall include a wide range of scenes or contents determined manually.
Details of the feature extraction process executed by the feature extraction unit 12 and the abnormal scene candidate detection process executed by the abnormal scene candidate detection unit 13 will be described later with reference to FIGS. 3 to 8.

本実施形態において、異常シーン候補検出部１３は、教師なし学習により音声特徴を分類することで、異常シーンの候補を検出する。ストリーミング配信される映像データ中で、異常シーンの出願頻度は僅かであり、また異常シーンとすべきか否かの基準も映像配信サービスごとに多様であるため、新たなサービスが開始される際や基準が変更された際に、適切な教師データを予め用意することは難しく、教師あり学習により高精度の分類を実現することが困難である。本実施形態では、映像データのうち、音声データのみから教師なし学習により音声特徴を分類することで、少ないサンプル数であっても高精度かつ低負荷で、異常シーンの候補を検出することができる。 In the present embodiment, the abnormal scene candidate detection unit 13 detects candidates for abnormal scenes by classifying voice features by unsupervised learning. In the video data to be streamed, the frequency of filing anomalous scenes is small, and the criteria for whether or not to make an anomalous scene vary from video distribution service to video distribution service. When is changed, it is difficult to prepare appropriate teacher data in advance, and it is difficult to realize highly accurate classification by supervised learning. In the present embodiment, by classifying audio features from only audio data from audio data by unsupervised learning, it is possible to detect candidates for abnormal scenes with high accuracy and low load even with a small number of samples. ..

異常判定器１４は、異常シーン候補検出部１３から供給される異常シーンの候補を入力とし、入力された異常シーンの候補の映像データを、異常シーン、正常シーン、オペレータの判断を要するシーンのいずれかに分類する。異常判定器１４は、異常シーンの候補の分類結果のうち、異常シーンおよび正常シーンのいずれかに分類された異常シーンの候補を、分類結果を付加して学習データＤＢ（データベース）２に格納していく。また、異常判定器１４は、異常シーンの候補のうち、オペレータの判断を要するシーンと分類された異常シーンの候補を、シーン提示部１５へ供給する。 The abnormality determination device 14 inputs the candidate of the abnormal scene supplied from the abnormal scene candidate detection unit 13, and inputs the input video data of the candidate of the abnormal scene to any of the abnormal scene, the normal scene, and the scene requiring the judgment of the operator. Classify into crabs. The abnormality determination device 14 stores the abnormal scene candidates classified into either the abnormal scene or the normal scene among the classification results of the abnormal scene candidates in the learning data DB (database) 2 with the classification results added. To go. Further, the abnormality determining device 14 supplies the candidate of the abnormal scene classified as the scene requiring the operator's judgment from the candidates of the abnormal scene to the scene presentation unit 15.

本実施形態において、異常判定器１４は、特徴抽出部１２により抽出された映像データの音声特徴および画像特徴が統合された特徴空間を用いて、教師あり学習により、入力される異常シーンの候補を、異常シーン、正常シーン、およびオペレータの判断を要するシーンのいずれかに３分類する。異常判定器１４は、学習データＤＢ２に蓄積された異常シーンの候補の分類結果を教師データとした学習を実行してよい。
異常判定器１４が実行する異常シーン判定処理の詳細は、図９を参照して後述する。 In the present embodiment, the abnormality determining device 14 uses a feature space in which the audio features and image features of the video data extracted by the feature extraction unit 12 are integrated to select candidates for abnormal scenes input by supervised learning. , Abnormal scenes, normal scenes, and scenes that require the judgment of the operator. The abnormality determining device 14 may execute learning using the classification result of the candidate of the abnormal scene accumulated in the learning data DB 2 as teacher data.
Details of the abnormality scene determination process executed by the abnormality determination device 14 will be described later with reference to FIG.

シーン提示部１５は、異常判定器１４から供給される、オペレータの判断を要するシーンと分類された異常シーンの候補を、表示装置等を介して外部に提示して、オペレータの確認入力を受け付ける。異常検出装置１はまた、異常シーン候補検出部１３から供給される異常シーンの候補を、外部に提示して、オペレータの確認入力を受け付けてよい。
異常検出装置１は、自装置の表示装置等をユーザインタフェースとしてもよいが、異常検出装置１と通信可能に接続されるクライアント装置３のユーザインタフェースを介して、異常シーンの候補を外部に提示し、またはオペレータの確認入力を受け付けてよい。
この場合、異常検出装置１はさらに、異常シーン候補検出部１３から供給される異常シーンの候補を、クライアント装置３へ送信し、クライアント装置３から送信されるオペレータの確認入力を受信する送受信部を備えてよい。クライアント装置３は、異常検出装置１から送信される異常シーンの候補を受信し、ユーザインタフェースを介して提示された異常シーンの候補に対するオペレータの確認入力を異常検出装置１へ送信する送受信部を備えてよい。 The scene presenting unit 15 presents a candidate for an abnormal scene, which is supplied from the abnormality determining device 14 and is classified as a scene requiring an operator's judgment, to the outside via a display device or the like, and accepts an operator's confirmation input. The abnormality detection device 1 may also present the abnormality scene candidate supplied from the abnormality scene candidate detection unit 13 to the outside and accept the confirmation input of the operator.
The abnormality detection device 1 may use the display device or the like of its own device as a user interface, but presents an abnormality scene candidate to the outside via the user interface of the client device 3 communicably connected to the abnormality detection device 1. , Or the operator's confirmation input may be accepted.
In this case, the abnormality detection device 1 further transmits a transmission / reception unit for transmitting the abnormality scene candidate supplied from the abnormality scene candidate detection unit 13 to the client device 3 and receiving the operator's confirmation input transmitted from the client device 3. You may be prepared. The client device 3 includes a transmission / reception unit that receives an abnormality scene candidate transmitted from the abnormality detection device 1 and transmits an operator confirmation input for the abnormality scene candidate presented via the user interface to the abnormality detection device 1. You can do it.

オペレータは、シーン提示部１５により提示される異常シーンの候補の映像データの画像を音声と照らし合わせることで、提示された異常シーンの候補を、異常シーンまたは正常シーンのいずれかであると確認し、確認結果をシーン提示部１５に入力する。オペレータは、異常シーンであると確認された異常シーンの候補に対して、所定の措置を講じることができる。例えば、確認された異常シーンを、配信される映像データから削除してもよく、あるいは当該映像データの配信を停止してもよく、当該映像データの配信元ユーザのアカウントを停止してもよい。
シーン提示部１５は、提示された異常シーンの候補の音声特徴および画像特徴に対して、オペレータが確認入力した確認結果（異常シーンまたは正常シーンのアノテーション）を付加し、学習データとして学習データＤＢ２に格納する。 The operator confirms that the presented abnormal scene candidate is either an abnormal scene or a normal scene by comparing the image of the video data of the abnormal scene candidate presented by the scene presentation unit 15 with the sound. , The confirmation result is input to the scene presentation unit 15. The operator can take predetermined measures for the candidate of the abnormal scene confirmed to be the abnormal scene. For example, the confirmed abnormal scene may be deleted from the video data to be distributed, the distribution of the video data may be stopped, or the account of the user who distributes the video data may be suspended.
The scene presentation unit 15 adds a confirmation result (annotation of an abnormal scene or a normal scene) confirmed and input by the operator to the presented audio feature and image feature of the candidate of the abnormal scene, and adds the confirmation result (annotation of the abnormal scene or the normal scene) to the learning data DB 2 as learning data. Store.

具体例として、異常シーン候補検出部１３が、映像データから抽出された音声特徴から、銃を発砲したような音声を検出し、当該音声を含むシーンを異常シーンの候補として検出したものとする。この場合、オペレータは、異常シーンの候補の画像をチェックして、異常シーンおよび正常シーンのいずれかであるかを確認すればよい。
例えば、異常シーンの候補の画像が、銃やその他暴力的または残酷なオブジェクトを含んでいれば、異常シーンと確認することができ、一方、屋外の花火等のオブジェクトを含んでいれば、正常シーンと確認することができる。 As a specific example, it is assumed that the abnormal scene candidate detection unit 13 detects a sound like shooting a gun from the sound features extracted from the video data, and detects a scene including the sound as a candidate for the abnormal scene. In this case, the operator may check the image of the candidate of the abnormal scene to confirm whether it is an abnormal scene or a normal scene.
For example, if the image of a candidate for an abnormal scene contains a gun or other violent or cruel object, it can be confirmed as an abnormal scene, while if it contains an object such as outdoor fireworks, it is a normal scene. Can be confirmed.

このように、本実施形態では、映像データの音声および画像のマルチモーダルな情報を用いて、複数段階で半自動的に異常シーンを検出している。具体的には、映像データの音声から異常シーンの候補を自動的に検出し、オペレータに検出された異常シーンの候補を提示して、異常シーンの候補の画像から異常シーンか正常シーンかを確認させている。これにより、配信される映像の監視におけるオペレータの負荷が格段に軽減される。 As described above, in the present embodiment, the abnormal scene is semi-automatically detected in a plurality of stages by using the multimodal information of the audio and the image of the video data. Specifically, the candidate of the abnormal scene is automatically detected from the sound of the video data, the candidate of the detected abnormal scene is presented to the operator, and it is confirmed whether the abnormal scene or the normal scene is from the image of the candidate of the abnormal scene. I'm letting you. As a result, the load on the operator in monitoring the delivered video is significantly reduced.

＜異常シーン検出処理の処理手順＞
図２は、本実施形態に係る異常検出装置１が実行する、異常シーン検出処理の処理手順の一例を示すフローチャートである。
なお、図２の各ステップは、異常検出装置１のＨＤＤ等の記憶装置に記憶されたプログラムをＣＰＵが読み出し、実行することで実現される。また、図２に示すフローチャートの少なくとも一部をハードウエアにより実現してもよい。ハードウエアにより実現する場合、例えば、所定のコンパイラを用いることで、各ステップを実現するためのプログラムからＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）上に自動的に専用回路を生成すればよい。また、ＦＰＧＡと同様にしてＧａｔｅＡｒｒａｙ回路を形成し、ハードウエアとして実現するようにしてもよい。また、ＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）により実現するようにしてもよい。 <Processing procedure for abnormal scene detection processing>
FIG. 2 is a flowchart showing an example of a processing procedure of an abnormality scene detection process executed by the abnormality detection device 1 according to the present embodiment.
Each step in FIG. 2 is realized by the CPU reading and executing a program stored in a storage device such as an HDD of the abnormality detection device 1. Further, at least a part of the flowchart shown in FIG. 2 may be realized by hardware. When it is realized by hardware, for example, by using a predetermined compiler, a dedicated circuit may be automatically generated on FPGA (Field Programmable Gate Array) from a program for realizing each step. Further, a Gate Array circuit may be formed in the same manner as the FPGA and realized as hardware. Further, it may be realized by ASIC (Application Specific Integrated Circuit).

Ｓ１で、異常検出装置１の特徴抽出部１２は、データ取得部１１から供給される映像データを音声データおよび画像データに分離し、音声特徴および画像特徴をそれぞれ抽出する。
図３は、特徴抽出部１２が、特徴抽出の前処理として、データ取得部１１から供給される映像データ（例えば、ｍｐ４またはｍ３ｕ８等のマルチメディアフォーマット）から分離した、例えば１秒単位にセグメント化した音声データ（例えば、ｗａｖフォーマット）の音声信号波形の一例を示す。図３において、縦軸が音声の振幅を示し、横軸が時間を示す。
特徴抽出部１２は、分離された音声データを、配信される映像データで想定され得る音源等に合わせて、適宜アップサンプリング等により正規化してよい。 In S1, the feature extraction unit 12 of the abnormality detection device 1 separates the video data supplied from the data acquisition unit 11 into audio data and image data, and extracts the audio features and the image features, respectively.
In FIG. 3, the feature extraction unit 12 is separated from the video data (for example, a multimedia format such as mp4 or m3u8) supplied from the data acquisition unit 11 as a preprocessing for feature extraction, and is segmented in units of one second, for example. An example of the audio signal waveform of the generated audio data (for example, wav format) is shown. In FIG. 3, the vertical axis represents the amplitude of voice and the horizontal axis represents time.
The feature extraction unit 12 may normalize the separated audio data by upsampling or the like as appropriate according to a sound source or the like that can be assumed in the delivered video data.

本実施形態において、特徴抽出部１２は、図３に示す音声データから、音声特徴を、例えば、メルスペクトログラム（メル周波数スペクトログラム）（ＭｅｌＦｒｅｑｕｅｎｃｙＳｐｅｃｔｒｏｇｒａｍ）で表現される音声特徴として抽出してよい。
スペクトログラムとは、音声信号を窓関数に通して周波数スペクトルを計算した結果を指し、時間、周波数、および信号成分の強さ（振幅）をそれぞれＸ軸、Ｙ軸、およびＺ軸とする３次元のグラフで表される。スペクトログラムは、音声信号の周波数成分と振幅成分を例えばフーリエ変換により取り出した各音声データセグメント（フレーム）のスペクトルを時間軸に沿って並べた、いわゆる声紋に相当する。メルスペクトログラムとは、人間の音高知覚（周波数知覚特性）が考慮された重み付けを行うためのメル尺度で変換されたスペクトログラムである。 In the present embodiment, the feature extraction unit 12 may extract voice features from the voice data shown in FIG. 3 as voice features expressed by, for example, a mel frequency spectrogram.
Spectrogram refers to the result of calculating the frequency spectrum by passing an audio signal through a window function, and has three dimensions with the time, frequency, and strength (amplitude) of the signal components as the X-axis, Y-axis, and Z-axis, respectively. It is represented by a graph. The spectrogram corresponds to a so-called voiceprint in which the spectra of each voice data segment (frame) obtained by extracting the frequency component and the amplitude component of the voice signal by, for example, Fourier transform are arranged along the time axis. A mel spectrogram is a spectrogram converted by a mel scale for weighting in consideration of human pitch perception (frequency perception characteristic).

図４は、特徴抽出部１２が映像データから分離した音声データから抽出したメルスペクトログラムで表現される音声特徴であって、異常シーン候補検出部１３により異常シーンの候補として検出される音声特徴の一例を示す。図４および図５において、Ｘ軸が時間を示し、Ｙ軸が周波数を示し、Ｚ軸が振幅、すなわち音声信号の強度を示す。また、図４および図５において、信号強度が大きいセルほど薄いパターンで、信号強度が小さいほど濃いパターンで示されている。
図４に示すスペクトログラムは、音量が大きく、信号強度の分布にピーク性があり、短時間で音声信号が減衰しているパターンを示す。図４は、銃の発砲のスペクトログラムの一例を示すが、例えば、人の叫び声や何かを殴る音等も同様または同種のパターンを示すものと考えられる。 FIG. 4 is an example of an audio feature represented by a mel spectrogram extracted from audio data separated from video data by the feature extraction unit 12, and is detected as an abnormal scene candidate by the abnormal scene candidate detection unit 13. Is shown. In FIGS. 4 and 5, the X-axis indicates time, the Y-axis indicates frequency, and the Z-axis indicates amplitude, that is, the intensity of the audio signal. Further, in FIGS. 4 and 5, cells with higher signal strength are shown with a lighter pattern, and cells with lower signal strength are shown with a darker pattern.
The spectrogram shown in FIG. 4 shows a pattern in which the volume is loud, the signal intensity distribution has a peak, and the audio signal is attenuated in a short time. FIG. 4 shows an example of a spectrogram of shooting a gun, and for example, a person's cry or the sound of hitting something is considered to show the same or similar pattern.

一方、図５は、特徴抽出部１２が映像データから分離した音声データセグメントから抽出したメルスペクトグラムで表現される音声特徴であって、異常シーン候補検出部１３により正常シーンと判定される（異常シーンの候補として検出されない）音声特徴の一例を示す。図５に示すスペクトログラムは、低音量または中音量であり、信号強度の分布が時間軸上均一であるパターンを示す。
特徴抽出部１２は、音声データから、前景音声（例えば、人の発話音声や叫び声等）と背景音声（音楽や雑踏音等）とを分離して、いずれか一方の音声のスペクトログラムを音声特徴として異常シーン候補検出部１３へ供給してもよい。この場合、例えば、時間軸上一時的に出現して繰り返されない音声を前景音声として分離することができる。 On the other hand, FIG. 5 shows an audio feature expressed by a melspectogram extracted from an audio data segment separated from the video data by the feature extraction unit 12, and is determined to be a normal scene by the abnormal scene candidate detection unit 13 (abnormality). An example of audio features (not detected as scene candidates) is shown. The spectrogram shown in FIG. 5 shows a pattern in which the volume is low or medium and the signal intensity distribution is uniform on the time axis.
The feature extraction unit 12 separates the foreground voice (for example, a person's utterance voice, screaming voice, etc.) and the background voice (music, crowd sound, etc.) from the voice data, and uses the spectrogram of either voice as the voice feature. It may be supplied to the abnormal scene candidate detection unit 13. In this case, for example, a sound that temporarily appears on the time axis and is not repeated can be separated as a foreground sound.

本実施形態において、特徴抽出部１２はさらに、音声データからメル周波数ケプストラム係数（ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔｓ：ＭＦＣＣ）を算出し、算出されたＭＦＣＣを図４ないし図５に示すメルスペクトログラムに連結して、音声特徴を抽出してもよい。
ケプストラムとは、音声信号をフーリエ変換した振幅スペクトルに対数を掛けて対数スペクトルを求め、対数スペクトルに再度フーリエ変換を適用してスペクトル化したものをいう。対数スペクトルのケプストラム（対数ケプストラム）を求めることで、高周期で変動する音源成分と畳み込まれていた声道特性の成分とを分離することができる。 In the present embodiment, the feature extraction unit 12 further calculates a mel frequency cepstrum coefficient (MFCC) from the voice data, and concatenates the calculated MFCC with the mel spectrogram shown in FIGS. 4 to 5. Audio features may be extracted.
The cepstrum is a spectrum obtained by multiplying an amplitude spectrum obtained by Fourier transforming an audio signal by a logarithm to obtain a logarithmic spectrum, and then applying the Fourier transform again to the logarithmic spectrum. By obtaining the cepstrum of the logarithmic spectrum (logarithmic cepstrum), it is possible to separate the sound source component that fluctuates with a high period from the convoluted vocal tract characteristic component.

対数ケプストラムの低次成分は、音声のスペクトル包絡（声道成分に由来する周波数特性）を表現している。個人差の大きいピッチ成分を除去し、音韻の特定に重要である声道の音響特性のみを抽出することができる。この対数ケプストラムの低次成分に対して、人の周波数知覚特性を考慮した重み付けを、メル尺度を適用することにより付与した特徴量が、ＭＦＣＣである。 The low-order components of the log cepstrum represent the spectral envelope of speech (frequency characteristics derived from the vocal tract components). It is possible to remove pitch components with large individual differences and extract only the acoustic characteristics of the vocal tract, which are important for phonological identification. The feature amount given by applying the Mel scale to the low-order component of the logarithmic cepstrum in consideration of the human frequency perception characteristic is MFCC.

具体的には、振幅スペクトルを、メル尺度上で等間隔である複数のフィルタバンクにかけて、各帯域のスペクトル成分を取り出し、各帯域の振幅スペクトルの和を取って、複数次元の振幅スペクトルに圧縮し、この圧縮された振幅スペクトルの対数を取って、対数振幅スペクトルを求める。
こうして求めたメル周波数スペクトル（メル尺度で圧縮された対数振幅スペクトル）に対して、フーリエ変換（例えば、離散フーリエ変換（ＤｉｓｃｒｅｔｅＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ：ＤＦＴ）を行うことにより、メル周波数ケプストラムに変換する。メル周波数ケプストラムの低次成分（スペクトルの声道成分）を取り出して、必要に応じて正規化処理を行うことにより、ＭＦＣＣを求めることができる。 Specifically, the amplitude spectrum is applied to a plurality of filter banks at equal intervals on the Mel scale, the spectral components of each band are extracted, the sum of the amplitude spectra of each band is summed, and the amplitude spectrum is compressed into a multidimensional amplitude spectrum. , The log of this compressed amplitude spectrum is taken to obtain the log amplitude spectrum.
The Mel frequency spectrum obtained in this way (a logarithmic amplitude spectrum compressed by the Mel scale) is converted into a Mel frequency cepstrum by performing a Fourier transform (for example, Discrete Fourier Transform (DFT)). The MFCC can be obtained by taking out the low-order component (voiceway component of the spectrum) of cepstrum and performing normalization processing as necessary.

図６は、単位時間（例えば、１秒）でスライスして例えば平均値を取ったＭＦＣＣ６１と、メルスペクトラムを時間軸上で平均振幅を取ったメルスペクトグラム６２とを連結した音声特徴の一例を示す。
図６に示すような音声特徴を異常シーン候補検出部１３に供給して異常シーンの候補を検出させることで、音声データの周波数成分の情報、特に人の聴覚上重要な周波数成分を失うことなく、音声特徴を適切に圧縮することができる。 FIG. 6 shows an example of a voice feature in which an MFCC 61 obtained by slicing in a unit time (for example, 1 second) and taking an average value and a melspectogram 62 having an average amplitude on the time axis are connected. show.
By supplying the voice feature as shown in FIG. 6 to the abnormal scene candidate detection unit 13 to detect the candidate of the abnormal scene, the information of the frequency component of the voice data, particularly the frequency component important for human hearing is not lost. , Audio features can be properly compressed.

なお、特徴抽出部１２により映像データの音声データから抽出される音声特徴は、図６に示す表現に限定されず、特徴抽出部１２は、上記以外の任意の手法および表現により、音声特徴を抽出してよい。
特徴抽出部１２はまた、映像データから分離された画像データの全部または一部から、例えば、畳み込みニューラルネットワーク（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ：ＣＮＮ）等を使用して、画像特徴を抽出してよい。しかしながら、特徴抽出部１２により画像データから画像特徴を抽出する手法はＣＮＮに限定されず、任意の手法を用いることができる。 The audio features extracted from the audio data of the video data by the feature extraction unit 12 are not limited to the expressions shown in FIG. 6, and the feature extraction unit 12 extracts the audio features by any method and expression other than the above. You can do it.
The feature extraction unit 12 may also extract image features from all or part of the image data separated from the video data by using, for example, a convolutional neural network (CNN) or the like. However, the method for extracting image features from image data by the feature extraction unit 12 is not limited to CNN, and any method can be used.

図２に戻り、Ｓ２で、異常検出装置１の異常シーン検出部１３は、特徴抽出部１２から供給される映像データから分離された音声データの音声特徴に基づいて、教師なし学習により分類することにより、異常シーンの候補を検出する。
異常シーン検出部１３は、例えば、アイソレーションフォレスト（ＩｓｏｌａｔｉｏｎＦｏｒｅｓｔ：ＩＦ）により、音声特徴の特徴空間上で、異常値を持つ音声特徴を分離し、分離された異常な音声特徴に対応する映像シーンを、異常シーンの候補として検出する。 Returning to FIG. 2, in S2, the abnormality scene detection unit 13 of the abnormality detection device 1 classifies by unsupervised learning based on the audio characteristics of the audio data separated from the video data supplied from the feature extraction unit 12. Detects candidates for abnormal scenes.
The abnormal scene detection unit 13 separates audio features having abnormal values in the feature space of audio features by, for example, an isolation forest (IF), and the video scene corresponding to the separated abnormal audio features. Is detected as a candidate for an abnormal scene.

アイソレーションフォレストは、正常値を持つ特徴群をモデル化（プロファイル化）して正常モデルを生成することなく、異常値を持つ特徴を直接分離する教師なし学習の１つである。高速アルゴリズムでありかつメモリ消費も少ないためリアルタイム配信される映像の監視に適しており、また、正常モデルのモデル化が不要であるため、少ないサンプリング数でも精度の低下を招き難い。 Isolation forest is one of unsupervised learning that directly separates features with outliers without modeling (profiling) features with normal values to generate a normal model. Since it is a high-speed algorithm and consumes less memory, it is suitable for monitoring video delivered in real time, and since it is not necessary to model a normal model, it is unlikely that the accuracy will decrease even with a small number of samplings.

図７は、アイソレーションフォレストが、特徴空間上、異常値を持つ特徴を分離するアルゴリズムを説明する概略図である。アイソレーションフォレストは、特徴空間上に配置される各特徴点が、他のすべての特徴点と分離できるまで、図７の破線で示されるように、繰り返しパーティション（仕切り）を生成していく。図７を参照して、左端の特徴点および右端の特徴点は、それぞれ、中央近傍に位置する特徴点より必要なパーティションの数が少ない。図７では、左端および右端の特徴点は、１つのパーティションで他のすべての特徴点から分離することができるため、それぞれ異常値を持つ特徴点として検出することができる。 FIG. 7 is a schematic diagram illustrating an algorithm in which the isolation forest separates features having outliers in the feature space. The isolation forest creates repeating partitions (partitions) as shown by the dashed line in FIG. 7 until each feature point placed on the feature space can be separated from all other feature points. With reference to FIG. 7, the leftmost feature point and the rightmost feature point each require fewer partitions than the feature points located near the center. In FIG. 7, since the leftmost and rightmost feature points can be separated from all other feature points in one partition, they can be detected as feature points having abnormal values.

図８は、図７におけるパーティション生成の繰り返し処理を二分木の木構造（アイソレーションツリー：ＩｓｏｌａｔｉｏｎＴｒｅｅ）で表現した概念図である。図７におけるパーティションの数は、図８において木構造のルートノードから終端ノードまでのパス長で表現することができる。
異常シーン候補検出部１３は、各音声特徴の特徴点のパス長に基づいて、各特徴点の異常（ａｎｏｍａｌｙ）スコアを、下記の式１により算出する。 FIG. 8 is a conceptual diagram in which the iterative process of partition generation in FIG. 7 is represented by a binary tree structure (isolation tree: Isolation Tree). The number of partitions in FIG. 7 can be expressed by the path length from the root node to the terminal node of the tree structure in FIG.
The abnormality scene candidate detection unit 13 calculates the abnormality score of each feature point by the following equation 1 based on the path length of the feature points of each voice feature.

（式１）

(Equation 1)

ここで、右辺指数部のＥ（ｈ（ｘ））は平均パス長であり、ｃ（ｎ）はデータセットのインスタンス数ｎに依存する正規化因子である。各特徴点の異常スコアＳ（ｘ、ｎ）は、平均パス長が短い程１に近づき、平均パス長が長い程０に近づく。 Here, E (h (x)) of the right-hand side exponent part is the average path length, and c (n) is a normalization factor depending on the number of instances n of the data set. The abnormal score S (x, n) of each feature point approaches 1 as the average path length is shorter, and approaches 0 as the average path length is longer.

図８を参照して、左側のバーは、下端から上端に向かって、０から１までの異常スコアの値に対応する。異常シーン検出部１３は、０から１までの間の異常スコアの閾値θと、各特徴点の異常スコアＳ（ｘ、ｎ）とを比較し、異常スコアＳ（ｘ、ｎ）が閾値θを上回る特徴点の音声特徴を、異常値（外れ値）として判定し、他方、異常スコアＳ（ｘ、ｎ）が閾値θ以内の特徴点の音声特徴を正常値として判定する。 With reference to FIG. 8, the bar on the left corresponds to the value of the anomaly score from 0 to 1 from the bottom to the top. The abnormal scene detection unit 13 compares the threshold value θ of the abnormal score between 0 and 1 with the abnormal score S (x, n) of each feature point, and the abnormal score S (x, n) sets the threshold value θ. The voice features of the feature points that exceed are determined as abnormal values (outliers), while the voice features of the feature points whose abnormal score S (x, n) is within the threshold value θ are determined as normal values.

異常シーン検出部１３は、閾値θを上回る異常スコアが算出された音声特徴および対応する画像特徴を含む映像シーンを、異常シーンの候補として検出する。
なお、異常シーン候補検出部１３が異常シーンの候補を検出するために使用する教師なし学習アルゴリズムは、アイソレーションフォレストに限定されない。異常シーン候補検出部１３は、アイソレーションフォレストに替えて、変分オートエンコーダ（ＶａｒｉａｔｉｏｎａｌＡｕｔｏＥｎｃｏｄｅｒ：ＶＡＥ）を使用して、音声特徴の再構成スコアを算出することにより、異常シーンの候補を検出してもよく、他のあらゆる教師なし学習を使用してもよい。 The abnormal scene detection unit 13 detects a video scene including an audio feature for which an abnormal score exceeding the threshold value θ is calculated and a corresponding image feature as a candidate for the abnormal scene.
The unsupervised learning algorithm used by the abnormal scene candidate detection unit 13 to detect an abnormal scene candidate is not limited to the isolation forest. The abnormal scene candidate detection unit 13 detects a candidate for an abnormal scene by calculating a reconstruction score of a voice feature using a variational autoencoder (VAE) instead of the isolation forest. You may also use any other unsupervised learning.

図２に戻り、Ｓ３で、異常検出装置１の異常シーン候補検出部１３は、学習データＤＢ２に格納された学習データの数を、所定の閾値と比較する。学習データＤＢ２に格納された学習データの数が所定の閾値を上回る場合（Ｓ３：Ｙ）、Ｓ４に進み、異常シーン候補検出部１３は、検出された異常シーンの候補を、異常判定器１４へ供給する。一方、学習データＤＢ２に格納された学習データの数が所定の閾値以内である場合（Ｓ３：Ｎ）、異常判定器での処理（Ｓ４～Ｓ６）をバイパスして、Ｓ７に進む。 Returning to FIG. 2, in S3, the abnormality scene candidate detection unit 13 of the abnormality detection device 1 compares the number of learning data stored in the learning data DB 2 with a predetermined threshold value. When the number of learning data stored in the learning data DB 2 exceeds a predetermined threshold value (S3: Y), the process proceeds to S4, and the abnormal scene candidate detection unit 13 sends the detected abnormal scene candidates to the abnormality determining device 14. Supply. On the other hand, when the number of learning data stored in the learning data DB 2 is within a predetermined threshold value (S3: N), the process of the abnormality determination device (S4 to S6) is bypassed and the process proceeds to S7.

本実施形態では、異常判定器１４が学習データＤＢ２へ異常シーン判定の学習データを十分蓄積していない場合は、異常判定器１４による異常シーン判定の精度（信頼度）が十分でないと判断して、異常判定器１４での処理をバイパスする。そして、シーン提示部１５は、異常シーン候補検出部１３により検出された異常シーンの候補を、オペレータに直接提示し、オペレータの確認入力を受け付ける。これにより、学習データのサンプル数が少ない間は、異常シーンの候補に対して常にオペレータの確認判断を要求することで、異常判定器１４での機械学習実行の処理負荷を削減することができる。 In the present embodiment, when the abnormality determining device 14 does not sufficiently accumulate the learning data for determining the abnormal scene in the learning data DB 2, it is determined that the accuracy (reliability) of the abnormal scene determination by the abnormality determining device 14 is not sufficient. , Bypassing the processing in the abnormality determining device 14. Then, the scene presentation unit 15 directly presents the candidate of the abnormal scene detected by the abnormal scene candidate detection unit 13 to the operator, and accepts the confirmation input of the operator. As a result, while the number of samples of learning data is small, the processing load of machine learning execution by the abnormality determining device 14 can be reduced by always requesting the operator's confirmation judgment from the candidate of the abnormality scene.

このように、本実施形態では、検出された異常シーンの候補からどのように異常シーンを判定するかの制御を、自律的に最適化する。具体的には、異常判定器１４への学習データのサンプル数が少ないうちは、専らオペレータによる異常シーンの判定を優先して異常シーンの判定の精度低下を防止する。一方、異常判定器１４への学習データのサンプル数が所定の閾値を超えた場合には、異常判定器１４が異常シーンまたは正常シーンのいずれかに分類できなかった異常シーンの候補のみをオペレータに提示して確認入力を要求することで、オペレータの負荷をさらに軽減することができる。 As described above, in the present embodiment, the control of how to determine the abnormal scene from the detected abnormal scene candidates is autonomously optimized. Specifically, while the number of samples of learning data in the abnormality determining device 14 is small, the operator gives priority to the determination of the abnormal scene and prevents the accuracy of the determination of the abnormal scene from deteriorating. On the other hand, when the number of samples of learning data to the abnormality determining device 14 exceeds a predetermined threshold value, only the candidate of the abnormal scene that the abnormality determining device 14 could not classify as either an abnormal scene or a normal scene is used as an operator. By presenting and requesting confirmation input, the load on the operator can be further reduced.

Ｓ４で、異常検出装置１の異常判定器１４は、異常シーン候補検出部１３から供給される異常シーンの候補を、正常シーン、異常シーン、およびオペレータの判断を要するシーンのいずれかに分類することにより、異常シーンの候補の異常を判定し、判定結果を学習データＤＢ２に格納する。
具体的には、異常判定器１４は、例えば、教師あり学習として、ｋ近傍法（ｋ－ｎｅａｒｅｓｔｎｅｉｇｈｂｏｒａｌｇｏｒｉｔｈｍ：ｋ－ＮＮ）を使用して、音声特徴および画像特徴が統合された特徴空間上で最近傍解を探索することにより、異常シーン候補検出部１３から供給される異常シーンの候補を分類する。 In S4, the abnormality determination device 14 of the abnormality detection device 1 classifies the abnormality scene candidates supplied from the abnormality scene candidate detection unit 13 into one of a normal scene, an abnormality scene, and a scene requiring the operator's judgment. Therefore, the abnormality of the candidate of the abnormal scene is determined, and the determination result is stored in the learning data DB 2.
Specifically, the abnormality determiner 14 uses the k-nearest neighbor algorithm (k-NN), for example, as supervised learning, on a feature space in which audio features and image features are integrated. By searching for the nearest neighbor solution, the candidate of the abnormal scene supplied from the abnormal scene candidate detection unit 13 is classified.

図９は、ｋ近傍法による分類アルゴリズムの例を説明する概念図である。
図９を参照して、特徴空間には、丸マークで示されるオブジェクト群が配置されている。各オブジェクトは多次元の特徴空間における位置ベクトルで表現され、正しい分類クラスが既知である。同心円の中央の星マークは、分類クラスが未知である分類対象の位置ベクトルであり、本実施形態では、判定対象の異常シーン候補の位置ベクトルである。ｋ近傍法では、星マークで示される新たな位置ベクトルと、丸マークで示される既存の位置ベクトル群との距離を算出し、ｋ個の最近傍のサンプルが選択される。位置ベクトル間の距離は、ユークリッド距離として算出されてよいが、マンハッタン距離等の他の距離として算出されてもよい。 FIG. 9 is a conceptual diagram illustrating an example of a classification algorithm based on the k-nearest neighbor method.
With reference to FIG. 9, a group of objects indicated by circle marks is arranged in the feature space. Each object is represented by a position vector in a multidimensional feature space, and the correct classification class is known. The star mark in the center of the concentric circle is the position vector of the classification target whose classification class is unknown, and in the present embodiment, it is the position vector of the abnormal scene candidate to be determined. In the k-nearest neighbor method, the distance between the new position vector indicated by the star mark and the existing position vector group indicated by the circle mark is calculated, and k nearest neighbor samples are selected. The distance between the position vectors may be calculated as the Euclidean distance, but may also be calculated as another distance such as the Manhattan distance.

図９を参照して、ｋ＝３の場合、内側同心円内には、最近傍の３つのオブジェクトとして、濃い丸マークが２個に対して薄い丸マークが１個配置されているから、判定対象の位置ベクトルは、濃い丸マークのクラスに分類される。一方、ｋ＝６の場合、外側同心円内には、最近傍の６つのオブジェクトとして、濃い丸マークが２個に対して薄い丸マークが４個配置されているから、判定対象の位置ベクトルは、薄い丸マークのクラスに分類される。なお、ｋ個の最近傍のオブジェクトの間で、新たな位置ベクトルとの距離を重み付けしてクラスを決定してもよい。 With reference to FIG. 9, when k = 3, since two dark circle marks and one light circle mark are arranged as the three nearest neighbors in the inner concentric circle, the determination target is obtained. The position vector of is classified into the class of dark circle marks. On the other hand, when k = 6, in the outer concentric circle, four light circle marks are arranged for two dark circle marks as six objects in the nearest neighbor, so that the position vector to be determined is the position vector to be determined. Classified as a light circle mark class. The class may be determined by weighting the distance from the new position vector among the k nearest neighbor objects.

異常判定器１４は、映像データの音声特徴と画像特徴とが統合された特徴空間上に、正しいクラスが未知である異常シーンの候補を位置ベクトルとしてマッピングし、ｋ個の最近傍のオブジェクト（サンプル）のうち、異常シーンに分類されるサンプルの数を、正常シーンに分類されるサンプルの数と比較することにより、判定対象の異常シーンの候補を、異常シーン、正常シーン、およびオペレータの判断を要するシーンのいずれかに分類する。 The anomaly determination device 14 maps a candidate for an abnormal scene whose correct class is unknown on a feature space in which audio features and image features of video data are integrated as a position vector, and k objects (samples) in the nearest vicinity. ), By comparing the number of samples classified as abnormal scenes with the number of samples classified as normal scenes, the candidates for the abnormal scenes to be determined can be determined by the abnormal scene, the normal scene, and the operator. Classify into one of the required scenes.

具体的には、異常判定器１４は、音声特徴と画像特徴とが統合された特徴空間上で、異常シーンの候補に対するｋ個の最近傍のサンプルのうち、異常シーンに分類されるサンプルの数が、正常シーンに分類されるサンプルの数より十分多い場合、判定対象の異常シーンの候補を異常シーンであると判定する。異常判定部１４はまた、特徴空間上で、異常シーンの候補に対するｋ個の最近傍のサンプルのうち、正常シーンに分類されるサンプルの数が、異常シーンの分類されるサンプルの数より十分多い場合、判定対象の異常シーンの候補を正常シーンであると判定する。 Specifically, the anomaly determination device 14 is a feature space in which audio features and image features are integrated, and the number of samples classified as anomalous scenes among k nearest neighbor samples for anomalous scene candidates. However, when the number of samples is sufficiently larger than the number of samples classified as normal scenes, the candidate of the abnormal scene to be determined is determined to be an abnormal scene. In the feature space, the abnormality determination unit 14 also has a sufficiently larger number of samples classified as normal scenes among the k nearest neighbor samples with respect to the candidate of abnormal scenes, which is sufficiently larger than the number of samples classified as abnormal scenes. In this case, the candidate of the abnormal scene to be determined is determined to be a normal scene.

一方、異常判定器１４は、特徴空間上で、異常シーンの候補に対するｋ個の最近傍のサンプルのうち、異常シーンに分類されるサンプルの数と正常シーンに分類されるサンプルの数との差が小さく、所定の閾値内である場合、判定対象の異常シーンの候補を、オペレータの判断を要するシーンであると判定する。
代替的に、異常判定器１４は、ｋ個の最近傍の異常シーンのサンプル数と正常シーンのサンプル数との大小により、判定対象の異常シーンの候補を、異常シーンまたは正常シーンのいずれかに自動的に分類してもよい。特に、学習データＤＢ２に十分なサンプル数の学習データが蓄積されている場合には、異常シーン検出においてオペレータの介入を不要ともできる。
なお、異常判定器１４が異常シーンの候補の異常を判定するためのアルゴリズムは、上記のｋ近傍法に限定されない。異常判定器１４は、例えば、ＣＮＮ等のニューラルネットワークや、サポートベクタマシン（ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ：ＳＶＭ）等を含む、他の教師あり学習の機械学習アルゴリズムを使用して異常シーンを判定してよい。 On the other hand, in the abnormality determination device 14, the difference between the number of samples classified as abnormal scenes and the number of samples classified as normal scenes among the k nearest neighbor samples with respect to the candidate of abnormal scenes on the feature space. Is small and is within a predetermined threshold value, it is determined that the candidate of the abnormal scene to be determined is a scene that requires the operator's judgment.
Alternatively, the abnormality determining device 14 sets the candidate of the abnormal scene to be determined as either an abnormal scene or a normal scene depending on the size of the number of samples of k nearest abnormal scenes and the number of samples of the normal scene. It may be classified automatically. In particular, when a sufficient number of sample numbers of training data are accumulated in the training data DB 2, it is possible to eliminate the need for operator intervention in detecting abnormal scenes.
The algorithm for the abnormality determination device 14 to determine the abnormality of the candidate of the abnormality scene is not limited to the above-mentioned k-nearest neighbor method. The anomaly determination device 14 may determine an abnormality scene using, for example, another supervised learning machine learning algorithm including a neural network such as CNN, a support vector machine (SVM), or the like.

図２に戻り、Ｓ４で、異常判定器１４は、異常シーンの候補が異常シーンであると判定した場合、Ｓ５に進み、異常シーンを含む映像コンテンツの配信停止、当該映像コンテンツの配信元のアカウントの削除、あるいは判定された異常シーンの削除等の処理を実行して処理を終了する。
異常判定器１４は、異常シーンの候補が正常シーンであると判定した場合、Ｓ６に進み、異常シーンの候補を含む映像コンテンツの配信を続行して処理を終了する。一方、異常判定器１４は、異常シーンの候補が、オペレータの判断を要するシーンであると判定した場合、Ｓ７に進む。 Returning to FIG. 2, when the abnormality determining device 14 determines in S4 that the candidate for the abnormal scene is an abnormal scene, the process proceeds to S5, the distribution of the video content including the abnormal scene is stopped, and the account of the distribution source of the video content. Processing such as deletion of the abnormal scene or deletion of the determined abnormal scene is executed to end the processing.
When the abnormality determining device 14 determines that the candidate for the abnormal scene is a normal scene, the process proceeds to S6, continues distribution of the video content including the candidate for the abnormal scene, and ends the process. On the other hand, if the abnormality determination device 14 determines that the candidate for the abnormality scene is a scene that requires the operator's judgment, the process proceeds to S7.

Ｓ７で、異常検出装置１のシーン提示部１５は、異常シーン候補検出部１３から供給された異常シーンの候補、あるいは異常判定器１４によりオペレータの判断を要するシーンと判定された異常シーンの候補の映像（音声データおよび画像データ）を、オペレータに提示する。
Ｓ８で、異常検出装置１のシーン提示部１５は、Ｓ７で提示された異常シーンの候補の映像に対するオペレータの確認入力として、異常シーンまたは正常シーンのいずれかのタグ付けの入力を受け付ける。
Ｓ９で、オペレータは、異常シーンとタグ付けした異常シーンの候補について、異常シーンに対する処理、すなわち、異常シーンを含む映像コンテンツの配信停止、当該映像コンテンツの配信元のアカウントの削除、あるいは判定された異常シーンの削除等の処理を実行する。一方、オペレータは、正常シーンとタグ付けした異常シーンの候補については、異常シーンに対する処理を実行することなく、映像の配信を続行させる。 In S7, the scene presentation unit 15 of the abnormality detection device 1 is a candidate for an abnormality scene supplied from the abnormality scene candidate detection unit 13, or a candidate for an abnormality scene determined by the abnormality determination device 14 to be a scene that requires the operator's judgment. The video (audio data and image data) is presented to the operator.
In S8, the scene presentation unit 15 of the abnormality detection device 1 accepts the input of tagging either the abnormal scene or the normal scene as the operator's confirmation input for the video of the candidate of the abnormal scene presented in S7.
In S9, the operator determines that the candidate for the abnormal scene tagged as the abnormal scene is processed for the abnormal scene, that is, the distribution of the video content including the abnormal scene is stopped, the account of the distribution source of the video content is deleted, or the determination is made. Execute processing such as deleting abnormal scenes. On the other hand, the operator causes the video distribution to continue for the candidate of the abnormal scene tagged as the normal scene without executing the processing for the abnormal scene.

Ｓ１０で、異常検出装置１のシーン提示部１５は、Ｓ８でシーン提示部１５に入力されたオペレータの異常シーンまたは正常シーンのタグ（ラベル）を、提示された異常シーンの候補の音声特徴および画像特徴と対応付けて、オペレータによる異常シーンの判定結果である学習データとして、学習データＤＢ２に格納する。これにより、新たな学習データで、学習データＤＢ２が更新される。 In S10, the scene presentation unit 15 of the abnormality detection device 1 uses the tag (label) of the operator's abnormal scene or normal scene input to the scene presentation unit 15 in S8 as a voice feature and an image of the presented abnormal scene candidate. It is stored in the learning data DB 2 as learning data which is a determination result of an abnormal scene by the operator in association with the feature. As a result, the learning data DB 2 is updated with the new learning data.

Ｓ１１で、異常検出装置１の異常判定器１４は、Ｓ１０で更新された学習データＤＢ２を基づいて、再学習を実行する。なお、異常判定器１４の再学習が必要か否かは、異常判定器１４の異常判定アルゴリズムに依存する。例えば、上記で説明したように、異常判定器１４がｋ近傍法を使用する場合は、異常シーンの判定の度に、学習データＤＢ２を参照してｋ個の最近傍の標本を選ぶため、Ｓ１１で再学習を実行する必要がなく、Ｓ１０で学習データＤＢ２を更新すれば足り、Ｓ１１の処理を省略してよい。 In S11, the abnormality determination device 14 of the abnormality detection device 1 executes re-learning based on the learning data DB 2 updated in S10. Whether or not the abnormality determination device 14 needs to be relearned depends on the abnormality determination algorithm of the abnormality determination device 14. For example, as described above, when the abnormality determination device 14 uses the k-nearest neighbor method, k-nearest neighbor samples are selected with reference to the learning data DB 2 each time an abnormality scene is determined, so that S11 It is not necessary to execute re-learning in S10, it is sufficient to update the learning data DB2 in S10, and the processing of S11 may be omitted.

一方、異常判定器１４が、ニューラルネットワークやＳＶＭ等を使用する場合は、いずれかのタイミングで異常判定器１４を再学習させて、異常判定器１４のパラメータを更新する必要がある。
再学習のタイミングは、学習データＤＢ２を更新する度に、異常判定器１４を毎回再学習させてもよく、学習データＤＢ２が所定回数更新される度に、異常判定器１４を再学習させてもよい。あるいは、Ｓ４で異常判定器１４を使用する直前に、異常判定器１４を再学習させることもできるが、リアルタイム配信される映像からリアルタイムで異常シーンを検出しようとする場合には、再学習実行によりリアルタイム性が低下しかねないことを考慮すべきである。 On the other hand, when the abnormality determination device 14 uses a neural network, SVM, or the like, it is necessary to relearn the abnormality determination device 14 at any timing to update the parameters of the abnormality determination device 14.
As for the timing of re-learning, the abnormality determination device 14 may be re-learned every time the learning data DB 2 is updated, or the abnormality determination device 14 may be re-learned every time the learning data DB 2 is updated a predetermined number of times. good. Alternatively, the abnormality determination device 14 can be re-learned immediately before the abnormality determination device 14 is used in S4, but when an abnormality scene is to be detected in real time from the video delivered in real time, the re-learning execution is performed. It should be taken into consideration that the real-time property may be reduced.

＜変形例＞
図１０は、本実施形態に係る異常検出装置１が実行する異常シーン検出処理の変形例を示す図である。
異常シーン候補検出部１３は、変形例として、図１０に示すように、Ｓ３の処理を省略して、学習データＤＢ２に蓄積される学習データの数にかかわりなく、一律に、検出された異常シーンの候補を、異常判定器１４に供給してもよい。これにより、映像監視におけるオペレータの異常シーンの候補の確認処理の負荷をさらに軽減することができる。 <Modification example>
FIG. 10 is a diagram showing a modified example of the abnormality scene detection process executed by the abnormality detection device 1 according to the present embodiment.
As a modified example, the abnormal scene candidate detection unit 13 omits the processing of S3 and uniformly detects the abnormal scene regardless of the number of learning data stored in the learning data DB 2. Can be supplied to the abnormality determination device 14. This makes it possible to further reduce the load of the operator's confirmation processing of abnormal scene candidates in video monitoring.

以上説明したように、本実施形態によれば、異常検出装置は、取得された映像データ中の音声特徴および画像特徴を抽出し、抽出された音声特徴を教師なし学習により分類することにより、映像データから異常シーンの候補を検出する。異常検出装置はまた、検出された異常シーンの候補を、映像データの音声特徴および画像特徴に基づいて、異常、正常、およびその他のいずれかに判定する異常判定器を備え、異常判定器により、異常シーンの候補がその他に属すると判定された場合、当該異常シーンの候補を、ユーザインタフェースを介して提示し、提示された異常シーンの候補に対して付加すべき情報の入力を、ユーザインタフェースを介して受け付ける。 As described above, according to the present embodiment, the anomaly detection device extracts audio features and image features in the acquired video data, and classifies the extracted audio features by unsupervised learning. Detects abnormal scene candidates from the data. The anomaly detector also comprises an anomaly determiner that determines any of the detected anomalous scene candidates as anomalous, normal, or otherwise based on the audio and image features of the video data. When it is determined that the candidate for the abnormal scene belongs to others, the candidate for the abnormal scene is presented via the user interface, and the user interface is used to input information to be added to the presented candidate for the abnormal scene. Accept through.

これにより、第１段階で、映像データ中の音声特徴に基づいて教師なし学習により高速かつ低負荷で異常シーンの候補を第１段階として検出し、第２段階で、映像データ中の音声特徴および画像特徴に基づく異常判定器による異常シーンの判定と、映像提示に基づくオペレータの目視による異常シーンの判定とを補完的に併用する。
したがって、オペレータの負荷を軽減しつつ、映像から多様な異常を高精度に検出することができる。
これにより、リアルタイムで配信され、多様な異常シーンを含み得る映像データに十分に追従した、高速かつ高精度な異常シーンのマルチモーダルな検出が実現できる。 As a result, in the first stage, candidates for abnormal scenes are detected as the first stage at high speed and low load by unsupervised learning based on the audio characteristics in the video data, and in the second stage, the audio characteristics and the audio characteristics in the video data are detected. The judgment of the abnormal scene by the abnormality judgment device based on the image feature and the judgment of the abnormal scene by the operator's visual observation based on the video presentation are used in a complementary manner.
Therefore, it is possible to detect various abnormalities from the video with high accuracy while reducing the load on the operator.
As a result, it is possible to realize high-speed and high-precision multimodal detection of abnormal scenes that are delivered in real time and sufficiently follow the video data that may include various abnormal scenes.

（実施形態２）
以下、図１１～図１４を参照して、実施形態２を、実施形態１と異なる点についてのみ詳細に説明する。
本実施形態では、上記で説明した実施形態１に加え、さらに、映像データの画像特徴から映像中のオブジェクトである人の感情を解析し、感情解析結果を異常シーンの候補の検出や異常シーンの判定に用いる。 (Embodiment 2)
Hereinafter, the second embodiment will be described in detail only with reference to FIGS. 11 to 14 and different from the first embodiment.
In this embodiment, in addition to the first embodiment described above, the emotions of a person who is an object in the video are analyzed from the image features of the video data, and the emotion analysis results are used to detect candidates for abnormal scenes and to detect abnormal scenes. Used for judgment.

図１１は、本実施形態に係る異常検出装置１の機能構成の一例を示すブロック図である。
図１１のブロック図では、図１に示す実施形態１の異常検出装置１の機能構成に加えて、感情解析部１６を備える。
図１１において、データ取得部１１、特徴抽出部１２、異常シーン候補１３、異常判定器１４、およびシーン提示部１５の機能構成は、図１に示す対応する各部と同様である。
図１１を参照して、特徴抽出部１２は、映像データ中の画像データから抽出した画像特徴を、感情解析部１６へ供給する。 FIG. 11 is a block diagram showing an example of the functional configuration of the abnormality detection device 1 according to the present embodiment.
In the block diagram of FIG. 11, in addition to the functional configuration of the abnormality detection device 1 of the first embodiment shown in FIG. 1, an emotion analysis unit 16 is provided.
In FIG. 11, the functional configurations of the data acquisition unit 11, the feature extraction unit 12, the abnormality scene candidate 13, the abnormality determination device 14, and the scene presentation unit 15 are the same as those of the corresponding units shown in FIG.
With reference to FIG. 11, the feature extraction unit 12 supplies the image features extracted from the image data in the video data to the emotion analysis unit 16.

感情解析部１６は、特徴抽出部１２から供給される映像データの画像特徴に基づいて、画像中のオブジェクトである人の顔の感情を解析する。
感情解析部１６は、例えば、ＣＮＮ等の教師あり学習を用いて、画像中の人の顔を解析することで、画像中の人の顔の感情を推定してよい。人の顔の画像から推定される人の顔の感情は、例えば、怒り、嫌悪、恐怖、幸福、悲しみ、驚き、その他（ニュートラル）の感情を含んでよい。
感情解析部１６はまた、時間的に隣接する複数の画像フレームの間で算出される、推定された感情の平均信頼度に基づいて、対象画像中の人の顔の感情を決定してもよい。 The emotion analysis unit 16 analyzes the emotion of the human face, which is an object in the image, based on the image features of the video data supplied from the feature extraction unit 12.
The emotion analysis unit 16 may estimate the emotion of the person's face in the image by analyzing the face of the person in the image by using, for example, supervised learning such as CNN. The emotions of a person's face estimated from the image of the person's face may include, for example, anger, disgust, fear, happiness, sadness, surprise, and other (neutral) emotions.
The emotion analysis unit 16 may also determine the emotion of a person's face in the target image based on the estimated average reliability of emotions calculated between a plurality of image frames adjacent in time. ..

感情解析部１６はさらに、画像中の人の身体や四肢の動き、人が把持等するオブジェクト（例えば、マイクロフォン、楽器等）、または背景（例えば、屋内か屋外か等）を解析してよい。特徴抽出部１２は、感情解析部１６が解析すべき対象オブジェクトの特徴を抽出して、感情解析部１６へ供給してよい。 The emotion analysis unit 16 may further analyze the movement of a person's body or limbs in an image, an object held by the person (for example, a microphone, a musical instrument, etc.), or a background (for example, indoor or outdoor). The feature extraction unit 12 may extract the features of the target object to be analyzed by the emotion analysis unit 16 and supply them to the emotion analysis unit 16.

本実施形態において、感情解析部１６が人の顔の画像の画像特徴から推定する人の顔の感情は、異常シーン候補検出部１３が実行する異常シーンの候補の検出処理、および異常判定器１４が実行する異常シーンの候補の異常判定処理を補完する。
具体的には、感情解析部１６は、人の顔の画像特徴から推定された人の顔の感情から、映像の文脈を推定して、異常シーン候補検出部１３に対して、異常シーン候補検出処理のキュー（トリガ）を与えてもよい。例えば、感情解析部１６が、人の顔の画像を解析して人の顔の感情として、例えば、怒り、恐怖、驚き等を検出した場合、当該画像を含む映像は、異常シーンである可能性が高いため、感情解析部１６は、異常シーン解析部１３にキューを与えて、当該映像の音声特徴から異常シーンの候補を検出する処理を実行させてもよい。 In the present embodiment, the emotions of the human face estimated by the emotion analysis unit 16 from the image features of the image of the human face are detected by the abnormality scene candidate detection unit 13 and the abnormality determination device 14. Complements the anomaly determination process of candidate abnormal scenes executed by.
Specifically, the emotion analysis unit 16 estimates the context of the image from the emotion of the human face estimated from the image features of the human face, and detects the abnormal scene candidate with respect to the abnormal scene candidate detection unit 13. A processing queue (trigger) may be given. For example, when the emotion analysis unit 16 analyzes an image of a person's face and detects, for example, anger, fear, surprise, etc. as the emotion of the person's face, the image including the image may be an abnormal scene. Therefore, the emotion analysis unit 16 may give a queue to the abnormal scene analysis unit 13 to execute a process of detecting a candidate for an abnormal scene from the audio characteristics of the video.

感情解析部１６はまた、人の顔の画像特徴から推定される人の顔の感情の特徴を異常判定器１４に供給し、異常判定器１４が、感情解析部１６から供給される人の顔の感情の特徴を特徴空間に統合して、ｋ近傍法により、異常シーンの候補を異常判定してもよい。例えば、異常判定器１４は、人の顔の感情として、例えば、怒り、恐怖、驚き等の特徴を、異常シーンと判定するための正因子として使用してよい。
感情解析部１６はさらに、人の顔の画像特徴から推定される人の顔の感情の解析結果を、シーン提示部１５に供給し、シーン提示部１５が、感情解析部１６から供給される人の顔の感情の解析結果を、例えば、提示される映像中に重畳表示や別ウインドウ表示等で併せて表示してもよい。 The emotion analysis unit 16 also supplies the emotional features of the human face estimated from the image features of the human face to the abnormality determination device 14, and the abnormality determination unit 14 supplies the human face from the emotion analysis unit 16. The emotional features of the above may be integrated into the feature space, and the candidate for the abnormal scene may be abnormally determined by the k-near method. For example, the abnormality determining device 14 may use features such as anger, fear, and surprise as emotions on a person's face as positive factors for determining an abnormal scene.
The emotion analysis unit 16 further supplies the analysis result of the emotion of the human face estimated from the image features of the human face to the scene presentation unit 15, and the scene presentation unit 15 is supplied from the emotion analysis unit 16. The analysis result of the emotion of the face may be displayed together with, for example, a superimposed display or a separate window display in the presented image.

図１２は、実施形態２に係る異常検出装置１が実行する異常シーン検出処理の処理手順の一例を示すフローチャートである。
図１２のフローチャートでは、図２に示す実施形態１の異常検出装置１が実行する異常シーン検出処理に対して、Ｓ１とＳ２の間に、Ｓ１２の処理が追加されている。
Ｓ１の処理は、図２に示す実施形態１と同様である。すなわち、実施形態１と同様、異常検出装置１の特徴抽出部１２は、データ取得部１１により供給される映像データから、音声特徴および画像特徴をそれぞれ抽出する。 FIG. 12 is a flowchart showing an example of a processing procedure of an abnormality scene detection process executed by the abnormality detection device 1 according to the second embodiment.
In the flowchart of FIG. 12, the process of S12 is added between S1 and S2 with respect to the abnormal scene detection process executed by the abnormality detection device 1 of the first embodiment shown in FIG.
The processing of S1 is the same as that of the first embodiment shown in FIG. That is, as in the first embodiment, the feature extraction unit 12 of the abnormality detection device 1 extracts audio features and image features from the video data supplied by the data acquisition unit 11, respectively.

Ｓ１で、異常検出装置１の特徴抽出部１２が、映像データから音声特徴および画像特徴がそれぞれ抽出すると、Ｓ１２に進む。
Ｓ１２で、異常検出装置１の感情解析部１６は、特徴抽出部１２により抽出された画像特徴から、異常シーンの候補を検出する。具体的には、感情解析部１６は、画像中の人の顔の画像特徴から、人の感情を推定し、例えば、怒り、恐怖、驚き等の感情が推定された場合には、当該画像を含む映像シーンを異常シーンの候補として検出してよい。
感情解析部１６は、画像特徴から異常シーンの候補を検出した場合、後続するＳ２で実行される異常シーン候補検出部１３により実行される映像の音声特徴に基づく異常シーン候補の検出処理にキュー（トリガ）を与える。 In S1, when the feature extraction unit 12 of the abnormality detection device 1 extracts the audio feature and the image feature from the video data, the process proceeds to S12.
In S12, the emotion analysis unit 16 of the abnormality detection device 1 detects a candidate for an abnormal scene from the image features extracted by the feature extraction unit 12. Specifically, the emotion analysis unit 16 estimates a person's emotions from the image features of the person's face in the image, and when, for example, emotions such as anger, fear, and surprise are estimated, the image is used. The included video scene may be detected as a candidate for an abnormal scene.
When the emotion analysis unit 16 detects a candidate for an abnormal scene from the image feature, it queues up for the detection process of the abnormal scene candidate based on the audio feature of the video executed by the abnormal scene candidate detection unit 13 executed in the subsequent S2. Trigger) is given.

Ｓ１２に続き、Ｓ２で、異常検出装置１の異常シーン候補検出部１３は、感情解析部１６が画像特徴から異常シーンの候補を検出してトリガを与えた場合、感情解析部１６から供給される異常シーンの候補に対応する音声特徴を教師なし学習を用いて分類することにより、異常シーンの候補を検出する。 Following S12, in S2, the abnormal scene candidate detection unit 13 of the abnormality detection device 1 is supplied from the emotion analysis unit 16 when the emotion analysis unit 16 detects an abnormal scene candidate from the image feature and gives a trigger. By classifying the voice features corresponding to the abnormal scene candidates using unsupervised learning, the abnormal scene candidates are detected.

代替的に、異常シーン候補検出部１３は、感情解析部１６からトリガを与えられるか否かにかかわりなく、常時、映像データの音声特徴から異常シーンの候補を検出し、感情解析部１６から画像特徴に基づく異常シーン候補検出のトリガを与えられた際に、検出された異常シーン候補の音声特徴から、異常シーンの候補として異常検出器１４に供給すべきかを確認してもよい。
Ｓ２～Ｓ１１までの処理は、図２に示す第１の実施形態と同様である。
なお、本実施形態に係る異常検出装置１は、図１０と同様、Ｓ３の判定及び分岐処理を省略し、学習データＤＢ２に格納される学習データの数にかかわりなく、Ｓ４の異常判定器１４による異常シーンの判定処理に進んでもよい。 Alternatively, the abnormal scene candidate detection unit 13 constantly detects an abnormal scene candidate from the audio characteristics of the video data regardless of whether or not a trigger is given from the emotion analysis unit 16, and the emotion analysis unit 16 displays an image. When a trigger for detecting an abnormal scene candidate based on a feature is given, it may be confirmed from the detected voice feature of the abnormal scene candidate whether or not it should be supplied to the abnormality detector 14 as a candidate for the abnormal scene.
The processing from S2 to S11 is the same as that of the first embodiment shown in FIG.
As in FIG. 10, the abnormality detection device 1 according to the present embodiment omits the determination and branch processing of S3, and uses the abnormality determination device 14 of S4 regardless of the number of learning data stored in the learning data DB 2. You may proceed to the determination process of the abnormal scene.

図１３は、実施形態２に係る異常検出装置１が実行する異常シーン検出処理の変形例の処理手順の一例を示すフローチャートである。
図１３のフローチャートでは、図１に示す実施形態１の異常検出装置１が実行する異常シーン検出処理に対して、Ｓ２とＳ３の間に、Ｓ１３の処理が追加されている。
Ｓ１およびＳ２の処理は、図２に示す実施形態１と同様である。すなわち、実施形態１と同様、異常検出装置１の特徴抽出部１２は、データ取得部１１により供給される映像データから、音声特徴および画像特徴をそれぞれ抽出し、異常シーン候補検出部１３は、特徴抽出部１２から供給される映像データの音声特徴に基づいて、異常シーンの候補を検出する。 FIG. 13 is a flowchart showing an example of a processing procedure of a modified example of the abnormality scene detection processing executed by the abnormality detection device 1 according to the second embodiment.
In the flowchart of FIG. 13, the process of S13 is added between S2 and S3 with respect to the abnormal scene detection process executed by the abnormality detection device 1 of the first embodiment shown in FIG.
The processing of S1 and S2 is the same as that of the first embodiment shown in FIG. That is, as in the first embodiment, the feature extraction unit 12 of the abnormality detection device 1 extracts audio features and image features from the video data supplied by the data acquisition unit 11, and the abnormality scene candidate detection unit 13 extracts the features. A candidate for an abnormal scene is detected based on the audio characteristics of the video data supplied from the extraction unit 12.

次に、Ｓ１３で、異常検出装置１の感情解析部１６は、特徴抽出部１２から供給される画像データの画像特徴のうち、特に画像中に含まれる人の顔の画像特徴から、人の顔の感情を推定する。
Ｓ３～Ｓ１１までの処理は、図２に示す実施形態１と同様であるが、Ｓ４で、異常判定器１４は、感情解析部１６から供給される画像中の人の顔の感情の特徴を音声および画像の特徴空間に統合してよい。また、Ｓ７で、シーン提示部１５は、感情解析部１６の解析結果を、異常シーンの候補の画像と併せて提示してよい。
なお、図１３において、Ｓ２およびＳ１３は、同時並行的に実行されてもよく、Ｓ１３は、時系列的にＳ２より前に実行されてもよい。
また、本実施形態に係る異常検出装置１は、図１０と同様、Ｓ３の判定及び分岐処理を省略し、学習データＤＢ２に格納される学習データの数にかかわりなく、Ｓ４の異常判定器１４による異常シーンの判定処理に進んでもよい。 Next, in S13, the emotion analysis unit 16 of the abnormality detection device 1 determines the human face from the image features of the human face included in the image among the image features of the image data supplied from the feature extraction unit 12. Estimate the emotions of.
The processing from S3 to S11 is the same as that of the first embodiment shown in FIG. 2, but in S4, the abnormality determining device 14 voices the emotional features of the human face in the image supplied from the emotion analysis unit 16. And may be integrated into the feature space of the image. Further, in S7, the scene presentation unit 15 may present the analysis result of the emotion analysis unit 16 together with the image of the candidate of the abnormal scene.
In addition, in FIG. 13, S2 and S13 may be executed in parallel, and S13 may be executed before S2 in chronological order.
Further, the abnormality detection device 1 according to the present embodiment omits the determination and branching processing of S3 as in FIG. 10, and uses the abnormality determination device 14 of S4 regardless of the number of learning data stored in the learning data DB 2. You may proceed to the determination process of the abnormal scene.

図１４は、異常検出装置１の感情解析部１６が映像データの画像を解析し、シーン提示部１５が提示する感情解析結果の出力例を示す図である。
図１４を参照して、画像中で、人の顔の周囲にバウンディングボックス１３１が表示され、人の顔のオブジェクトとして検出されたことを示している。このバウンディングボックス１３内の人の顔から推定された感情の信頼度が、出力ウインドウの左上に表示されている。
図１４の例では、怒りの信頼度が３７．４５％と最も高く算出されているが、バウンディングボックス１３１で包囲された人の顔の表情は、怒りを示しておらずニュートラルであるものとする。この場合、図１４の画像を提示されたオペレータは、提示された画像中の人の顔の表情を目視で確認し、異常シーンではない（すなわち、正常シーンである）との確認結果をシーン提示部１５に入力することができる。あるいは、感情解析部１６は、信頼度のスコアに所定の閾値を設け、怒りの信頼度のスコアが閾値以下である場合には、異常シーンの候補として検出しなくてもよい。 FIG. 14 is a diagram showing an output example of an emotion analysis result presented by the scene presentation unit 15 after the emotion analysis unit 16 of the abnormality detection device 1 analyzes an image of video data.
With reference to FIG. 14, a bounding box 131 is displayed around the human face in the image, indicating that the object has been detected as a human face object. The reliability of emotions estimated from the face of the person in the bounding box 13 is displayed in the upper left of the output window.
In the example of FIG. 14, the reliability of anger is calculated to be the highest at 37.45%, but the facial expression of the person surrounded by the bounding box 131 is assumed to be neutral without showing anger. .. In this case, the operator presented with the image of FIG. 14 visually confirms the facial expression of the person in the presented image, and presents the confirmation result that the scene is not an abnormal scene (that is, a normal scene). It can be input to the unit 15. Alternatively, the emotion analysis unit 16 sets a predetermined threshold value for the reliability score, and when the anger reliability score is equal to or less than the threshold value, the emotion analysis unit 16 may not detect it as a candidate for an abnormal scene.

以上説明したように、本実施形態によれば、異常検出装置の異常判定器は、映像データの音声特徴、および画像特徴、特に、人の顔の感情の特徴、の双方のマルチモーダルな情報から、異常シーンである蓋然性が高いと判定された異常シーンの候補について、異常シーンの判定を実行すれば足りる。したがって、学習データのサンプル数が少ない場合であっても、高精度かつ低負荷で異常判定処理を実行することができる。
同様に、本実施形態によれば、異常検出装置のシーン提示部は、映像データの音声特徴、および画像特徴、特に、人の顔の感情の特徴、の双方のマルチモーダルな情報から、異常シーンである蓋然性が高いと判定された異常シーンの候補について、オペレータに提示すれば足りる。したがって、異常シーンの確認におけるオペレータの負荷がさらに軽減される。 As described above, according to the present embodiment, the abnormality determination device of the abnormality detection device is based on multimodal information of both audio characteristics and image characteristics of video data, particularly emotional characteristics of a human face. , It suffices to execute the determination of the abnormal scene for the candidate of the abnormal scene determined to have a high probability of being an abnormal scene. Therefore, even when the number of samples of the training data is small, the abnormality determination process can be executed with high accuracy and low load.
Similarly, according to the present embodiment, the scene presentation unit of the abnormality detection device is based on multimodal information of both audio characteristics of video data and image characteristics, particularly emotional characteristics of a human face, to obtain an abnormal scene. It suffices to present to the operator a candidate for an abnormal scene that is determined to be highly probable. Therefore, the load on the operator in confirming the abnormal scene is further reduced.

＜異常検出装置のハードウエア構成＞
図１５は、本実施形態に係る異常検出装置１のハードウエア構成の非限定的一例を示す図である。
本実施形態に係る異常検出装置１は、単一または複数の、あらゆるコンピュータ、モバイルデバイス、または他のいかなる処理プラットフォーム上にも実装することができる。
図１５を参照して、異常検出装置１は、単一のコンピュータに実装される例が示されているが、本実施形態に係る異常検出装置１は、複数のコンピュータを含むコンピュータシステムに実装されてよい。複数のコンピュータは、有線または無線のネットワークにより相互通信可能に接続されてよい。 <Hardware configuration of anomaly detection device>
FIG. 15 is a diagram showing a non-limiting example of the hardware configuration of the abnormality detection device 1 according to the present embodiment.
The anomaly detection device 1 according to the present embodiment can be implemented on a single or a plurality of any computer, mobile device, or any other processing platform.
Although an example in which the abnormality detection device 1 is mounted on a single computer is shown with reference to FIG. 15, the abnormality detection device 1 according to the present embodiment is mounted on a computer system including a plurality of computers. You can do it. A plurality of computers may be connected to each other so as to be able to communicate with each other by a wired or wireless network.

図１５に示すように、異常検出装置１は、ＣＰＵ２１と、ＲＯＭ２２と、ＲＡＭ２３と、ＨＤＤ２４と、入力部２５と、表示部２６と、通信Ｉ／Ｆ２７と、システムバス２８とを備えてよい。異常検出装置１はまた、外部メモリを備えてよい。ＰＣ３もまた、図１５と同様の構成を備えてよい。
ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２１は、異常検出装置１における動作を統括的に制御するものであり、データ伝送路であるシステムバス２８を介して、各構成部（２２～２７）を制御する。
異常検出装置１はまた、ＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）を備えてよい。ＧＰＵは、ＣＰＵ２１より高い計算機能を有し、複数または多数のＧＰＵを並列して動作させることにより、特に、本実施形態のような機械学習を使用する映像処理アプリケーションに、より高い処理パフォーマンスを提供する。ＧＰＵは、通常、プロセッサと共有メモリを含む。それぞれのプロセッサが高速の共有メモリからデータを取得し、共通プログラムを実行することで、同種の計算処理を大量かつ高速に実行する。 As shown in FIG. 15, the abnormality detection device 1 may include a CPU 21, a ROM 22, a RAM 23, an HDD 24, an input unit 25, a display unit 26, a communication I / F 27, and a system bus 28. The anomaly detection device 1 may also include an external memory. PC3 may also have the same configuration as in FIG.
The CPU (Central Processing Unit) 21 comprehensively controls the operation of the abnormality detection device 1, and controls each component (22 to 27) via the system bus 28, which is a data transmission path.
The abnormality detection device 1 may also include a GPU (Graphics Processing Unit). The GPU has a higher calculation function than the CPU 21, and by operating a plurality or a large number of GPUs in parallel, it provides higher processing performance, particularly for a video processing application using machine learning as in the present embodiment. do. The GPU typically includes a processor and shared memory. Each processor acquires data from a high-speed shared memory and executes a common program to execute a large amount of the same kind of calculation processing at high speed.

ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）２２は、ＣＰＵ２１が処理を実行するために必要な制御プログラム等を記憶する不揮発性メモリである。なお、当該プログラムは、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）１４、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の不揮発性メモリや着脱可能な記憶媒体（不図示）等の外部メモリに記憶されていてもよい。
ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）２３は、揮発性メモリであり、ＣＰＵ１１の主メモリ、ワークエリア等として機能する。すなわち、ＣＰＵ２１は、処理の実行に際してＲＯＭ２２から必要なプログラム等をＲＡＭ２３にロードし、当該プログラム等を実行することで各種の機能動作を実現する。 The ROM (Read Only Memory) 22 is a non-volatile memory for storing a control program or the like necessary for the CPU 21 to execute a process. The program may be stored in a non-volatile memory such as an HDD (Hard Disk Drive) 14 or SSD (Solid State Drive) or an external memory such as a removable storage medium (not shown).
The RAM (Random Access Memory) 23 is a volatile memory, and functions as a main memory, a work area, or the like of the CPU 11. That is, the CPU 21 loads a program or the like necessary from the ROM 22 into the RAM 23 when executing the process, and realizes various functional operations by executing the program or the like.

ＨＤＤ２４は、例えば、ＣＰＵ２１がプログラムを用いた処理を行う際に必要な各種データや各種情報等を記憶している。また、ＨＤＤ２４には、例えば、ＣＰＵ２１がプログラム等を用いた処理を行うことにより得られた各種データや各種情報等が記憶される。
入力部２５は、キーボードやマウス等のポインティングデバイスにより構成される。
表示部２６は、液晶ディスプレイ（ＬＣＤ）等のモニターにより構成される。表示部２６は、異常シーン検出処理で使用される各種パラメータや、他の装置との通信で使用される通信パラメータ等をパラメータ調整装置１へ指示入力するためのユーザインタフェースであるＧＵＩ（ＧｒａｐｈｉｃａｌＵｓｅｒＩｎｔｅｒｆａｃｅ）を提供してよい。 The HDD 24 stores, for example, various data and various information necessary for the CPU 21 to perform processing using a program. Further, the HDD 24 stores, for example, various data and various information obtained by the CPU 21 performing processing using a program or the like.
The input unit 25 is composed of a pointing device such as a keyboard and a mouse.
The display unit 26 is composed of a monitor such as a liquid crystal display (LCD). The display unit 26 is a GUI (Graphical User Interface) which is a user interface for instructing and inputting various parameters used in the abnormal scene detection process, communication parameters used in communication with other devices, and the like to the parameter adjustment device 1. ) May be provided.

通信Ｉ／Ｆ２７は、異常検出装置１と外部装置との通信を制御するインタフェースである。
通信Ｉ／Ｆ２７は、ネットワークとのインタフェースを提供し、ネットワークを介して、外部装置との通信を実行する。通信Ｉ／Ｆ２７を介して、外部装置との間で映像、異常シーン判定結果、異常シーン確認入力、各種パラメータ等が送受信される。本実施形態では、通信Ｉ／Ｆ２７は、イーサネット（登録商標）等の通信規格に準拠する有線ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）や専用線を介した通信を実行してよい。ただし、本実施形態で利用可能なネットワークはこれに限定されず、無線ネットワークで構成されてもよい。この無線ネットワークは、Ｂｌｕｅｔｏｏｔｈ（登録商標）、ＺｉｇＢｅｅ（登録商標）、ＵＷＢ（ＵｌｔｒａＷｉｄｅＢａｎｄ）等の無線ＰＡＮ（ＰｅｒｓｏｎａｌＡｒｅａＮｅｔｗｏｒｋ）を含む。また、Ｗｉ－Ｆｉ（ＷｉｒｅｌｅｓｓＦｉｄｅｌｉｔｙ）（登録商標）等の無線ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）や、ＷｉＭＡＸ（登録商標）等の無線ＭＡＮ（ＭｅｔｒｏｐｏｌｉｔａｎＡｒｅａＮｅｔｗｏｒｋ）を含む。さらに、ＬＴＥ／３Ｇ、４Ｇ、５Ｇ等の無線ＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）を含む。なお、ネットワークは、各機器を相互に通信可能に接続し、通信が可能であればよく、通信の規格、規模、構成は上記に限定されない。 The communication I / F 27 is an interface that controls communication between the abnormality detection device 1 and the external device.
The communication I / F 27 provides an interface with the network and executes communication with an external device via the network. Video, abnormal scene determination result, abnormal scene confirmation input, various parameters, etc. are transmitted and received to and from the external device via the communication I / F27. In the present embodiment, the communication I / F 27 may execute communication via a wired LAN (Local Area Network) or a dedicated line conforming to a communication standard such as Ethernet (registered trademark). However, the network that can be used in this embodiment is not limited to this, and may be configured by a wireless network. This wireless network includes a wireless PAN (Personal Area Network) such as Bluetooth®, ZigBee®, UWB (Ultra Wide Band) and the like. Further, it includes a wireless LAN (Local Area Network) such as Wi-Fi (Wiless Fidelity) (registered trademark) and a wireless MAN (Metropolitan Area Network) such as WiMAX (registered trademark). Further, it includes a wireless WAN (Wide Area Network) such as LTE / 3G, 4G, and 5G. The network may connect each device so as to be able to communicate with each other, and the communication standard, scale, and configuration are not limited to the above.

図１に示す異常検出装置１の各要素のうち少なくとも一部の機能は、ＣＰＵ２１がプログラムを実行することで実現することができる。ただし、図１に示す異常検出装置１の各要素のうち少なくとも一部の機能が専用のハードウエアとして動作するようにしてもよい。この場合、専用のハードウエアは、ＣＰＵ２１の制御に基づいて動作する。 At least a part of the functions of each element of the abnormality detection device 1 shown in FIG. 1 can be realized by the CPU 21 executing a program. However, at least a part of the functions of each element of the abnormality detection device 1 shown in FIG. 1 may be operated as dedicated hardware. In this case, the dedicated hardware operates based on the control of the CPU 21.

なお、上記において特定の実施形態が説明されているが、当該実施形態は単なる例示であり、本発明の範囲を限定する意図はない。本明細書に記載された装置及び方法は上記した以外の形態において具現化することができる。また、本発明の範囲から離れることなく、上記した実施形態に対して適宜、省略、置換及び変更をなすこともできる。かかる省略、置換及び変更をなした形態は、請求の範囲に記載されたもの及びこれらの均等物の範疇に含まれ、本発明の技術的範囲に属する。 Although specific embodiments have been described above, the embodiments are merely examples and are not intended to limit the scope of the present invention. The devices and methods described herein can be embodied in forms other than those described above. Further, without departing from the scope of the present invention, omissions, substitutions and modifications can be made to the above-described embodiments as appropriate. Such abbreviations, substitutions and modifications are included in the claims and equivalents thereof and fall within the technical scope of the invention.

１…異常検出装置、２…学習データＤＢ、３…ＰＣ、１１…データ取得部、１２…特徴抽出部、１３…異常シーン候補検出部、１４…異常判定器、１５…シーン提示部、１６…感情解析部、２１…ＣＰＵ、２２…ＲＯＭ、２３…ＲＡＭ、２４…ＨＤＤ、２５…入力部、２６…表示部、２７…通信Ｉ／Ｆ、２８…バス 1 ... Abnormality detection device, 2 ... Learning data DB, 3 ... PC, 11 ... Data acquisition unit, 12 ... Feature extraction unit, 13 ... Abnormal scene candidate detection unit, 14 ... Abnormality determination device, 15 ... Scene presentation unit, 16 ... Emotion analysis unit, 21 ... CPU, 22 ... ROM, 23 ... RAM, 24 ... HDD, 25 ... input unit, 26 ... display unit, 27 ... communication I / F, 28 ... bus

Claims

The video acquisition unit that acquires video data and
A feature extraction unit that extracts audio features from the video data acquired by the video acquisition unit and extracts image features from the video data, and a feature extraction unit.
An abnormal scene candidate detection unit that detects an abnormal scene candidate from the video data based on the audio feature extracted by the feature extraction unit.
An abnormality determining device that determines the abnormal scene candidate detected by the abnormal scene candidate detection unit as abnormal, normal, or other based on the audio feature and the image feature.
When the abnormality determination device determines that the candidate for the abnormal scene belongs to another, the candidate for the abnormal scene is presented via the user interface, and information to be added to the presented candidate for the abnormal scene. An information processing apparatus including a scene presentation unit that receives input from the user interface via the user interface.

The information processing apparatus according to claim 1, wherein the abnormal scene candidate detection unit detects a candidate for the abnormal scene from the video data by using unsupervised learning.

The claim is characterized in that the abnormal scene candidate detection unit detects a candidate for the abnormal scene from the video data by directly separating the abnormal audio feature without generating a model of the normal audio feature group. Item 2. The information processing apparatus according to Item 1 or 2.

The information processing according to claim 3, wherein the abnormal scene candidate detection unit separates the abnormal voice feature by calculating the path length in the isolation forest of each voice feature. Device.

The information processing according to any one of claims 1 to 4, wherein the feature extraction unit extracts audio features represented by a Mel Frequency spectrogram of audio data in the video data. Device.

The feature extraction unit is characterized in that the Mel Frequency cepstrum coefficient (MFCC) is calculated from the voice data, the calculated MFCC is connected to the Mel frequency, and the voice feature is extracted. The information processing apparatus according to claim 5.

The scene presenting unit is characterized in that information input via the user interface is added to the voice feature and the image feature and stored in a storage device as learning data for the abnormality determining device. The information processing apparatus according to any one of claims 1 to 6.

7. The abnormality scene candidate detection unit is characterized in that, when the number of the learning data stored in the storage device exceeds a predetermined threshold value, the abnormality determination device determines the candidate of the abnormality scene. The information processing device described in.

When the number of the learning data stored in the storage device is within a predetermined threshold value, the abnormal scene candidate detection unit bypasses the determination by the abnormal device and causes the scene presentation unit to display the abnormal scene. The information processing apparatus according to claim 7, wherein the candidate is presented.

In the anomaly determination device, the difference between the number of abnormal samples located in the vicinity of the candidate for the abnormal scene and the number of normal samples in the feature space in which the audio feature and the image feature are integrated is within a predetermined threshold value. The information processing apparatus according to any one of claims 1 to 9, wherein the candidate for the abnormal scene is determined elsewhere.

The information processing apparatus according to any one of claims 1 to 10, wherein the abnormality determining device determines a candidate for the abnormal scene by the k-nearest neighbor method.

From the image features extracted by the feature extraction unit, supervised learning is used to analyze facial emotions contained in the video data, and the analyzed facial emotion features are supplied to the abnormality determination device. Further equipped with an emotion analysis department
The information processing apparatus according to any one of claims 1 to 11.

When the emotion analysis unit detects a candidate for the abnormal scene from the video data based on the analyzed emotion of the face, the abnormal scene candidate detection unit detects the abnormal scene based on the audio feature. The information processing apparatus according to claim 12, wherein the information processing apparatus is to be executed.

An information processing system including a server and at least one client device connected to the server via a network.
The server
The video acquisition unit that acquires video data and
A feature extraction unit that extracts audio features from the video data acquired by the video acquisition unit and extracts image features from the video data, and a feature extraction unit.
An abnormal scene candidate detection unit that detects an abnormal scene candidate from the video data based on the audio feature extracted by the feature extraction unit.
An abnormality determining device that determines the abnormal scene candidate detected by the abnormal scene candidate detection unit as abnormal, normal, or other based on the audio feature and the image feature.
When the abnormality determination device determines that the abnormality scene candidate belongs to another, the abnormality scene candidate is presented via the user interface, and information to be added to the presented abnormality scene candidate. The scene presentation unit that accepts the input of
It has a transmission unit that transmits a candidate for the abnormal scene to the client device, and has.
The client device is
A receiving unit that receives the candidate of the abnormal scene transmitted from the server, and a receiving unit.
The user interface that presents the candidate of the abnormal scene received by the receiving unit and accepts the input of the information to be added to the presented candidate of the abnormal scene, and the user interface.
An information processing system comprising: a transmission unit for transmitting information to be added to the candidate of the abnormal scene for which the user interface has received input to the server.

It is an information processing method executed by an information processing device.
Steps to acquire video data and
A step of extracting audio features from the acquired video data and extracting image features from the video data,
A step of detecting a candidate for an abnormal scene from the video data based on the audio features extracted by unsupervised learning.
A step of determining the candidate of the abnormal scene detected by the abnormality determining device as abnormal, normal, or other based on the audio feature and the image feature.
When the abnormality determination device determines that the candidate for the abnormal scene belongs to another, the candidate for the abnormal scene is presented via the user interface, and information to be added to the presented candidate for the abnormal scene. An information processing method comprising a step of accepting an input of the above through the user interface.

It is an information processing program for causing a computer to execute information processing, and the program causes the computer to execute information processing.
Video acquisition processing to acquire video data and
A feature extraction process that extracts audio features from the video data acquired by the video acquisition process and extracts image features from the video data.
Anomalous scene candidate detection processing that detects anomalous scene candidates from the video data based on the audio features extracted by the feature extraction processing, and
Abnormality determination processing for determining the abnormal scene candidate detected by the abnormal scene candidate detection process by the abnormality determination device as abnormal, normal, or other based on the audio feature and the image feature.
When the abnormality determination device determines that the candidate for the abnormal scene belongs to another, the candidate for the abnormal scene is presented via the user interface, and information to be added to the presented candidate for the abnormal scene. An information processing program for executing a process including a scene presentation process for receiving an input of the above via the user interface.