JP2022153360A

JP2022153360A - Information processing device, information processing method, and information processing program

Info

Publication number: JP2022153360A
Application number: JP2022099393A
Authority: JP
Inventors: ムハマドアクマル; Akmal Muhammad; 満中澤; Mitsuru Nakazawa
Original assignee: Rakuten Group Inc
Current assignee: Rakuten Group Inc
Priority date: 2020-07-30
Filing date: 2022-06-21
Publication date: 2022-10-12
Anticipated expiration: 2040-07-30
Also published as: JP7096296B2; JP2022026016A; JP7361163B2

Abstract

PROBLEM TO BE SOLVED: To provide an information processing device capable of reducing the load of an operator and highly accurately detecting various abnormalities from a video, an information processing method, and an information processing program.

SOLUTION: An abnormality detecting device 1 comprises: a data acquisition unit which acquires video data; a feature extracting unit which extracts audio features from the video data acquired by the data acquisition unit and extracts image features from the video data; an abnormality scene candidate detecting unit which detects a candidate of an abnormal scene from the video data on the basis of the audio features extracted by the feature extracting unit; an abnormality determination unit which determines whether the candidate of the abnormal scene detected by the abnormal scene candidate detecting unit is abnormal, normal, or others on the basis of the audio and image features; and a scene presentation unit which presents a candidate of the abnormality scene via a user interface and receives an input of information which have to be added to the candidate of the presented abnormality scene via the user interface when the abnormality determination unit determines that the candidate of the abnormality scene belongs to the others..

SELECTED DRAWING: Figure 1

Description

本発明は、情報処理装置、情報処理方法およびプログラムに関し、特に、映像を解析して異常を検知するための技術に関する。 The present invention relates to an information processing device, an information processing method, and a program, and more particularly to technology for analyzing video and detecting anomalies.

近年の映像配信サービスは、コンテンツプロバイダが作成した映像コンテンツのみならず、一般ユーザが作成した映像コンテンツのリアルタイム配信を可能にしている。
このような映像配信サービスにおいては、配信される映像コンテンツ中に、視聴するのに不適切ないわゆる異常シーンが含まれないよう、配信される映像を監視し、検出された異常シーンが誤って視聴されないよう、異常シーンの削除、配信停止や配信アカウント削除等の処理をする必要がある。このような異常シーンは、例えば、暴力的なシーンや子供向けでないシーン等、家族での視聴に不適切な（Ｎｏｎ－Ｆａｍｉｌｙ－Ｓａｆｅ：ＮＦＳ）シーンを含む。 Recent video distribution services enable real-time distribution of not only video content created by content providers but also video content created by general users.
In such a video distribution service, the distributed video is monitored so that the video content to be distributed does not include so-called abnormal scenes that are inappropriate for viewing. In order to prevent this, it is necessary to delete abnormal scenes, stop distribution, delete distribution accounts, etc. Such abnormal scenes include, for example, Non-Family-Safe (NFS) scenes such as violent scenes and scenes not intended for children.

特許文献１は、エレベータの乗りかご内に設けられた防犯カメラにより撮影された撮影データから乗員の異常行動を検知するエレベータ監視装置を開示する。
具体的には、特許文献１の監視装置においては、乗りかご内に設置されたインターホンで集音された乗員の音声データを周波数分析した結果から抽出された所定の周波数帯域に応じて暴れ判定閾値を設定するとともに、防犯カメラにより撮影された撮影データから乗員の動きのばらつき量を統計的に算出する。特許文献１の監視装置はさらに、算出された乗員の動きのばらつき量と暴れ判定閾値とを比較し、乗員の動きのばらつき量が暴れ判定閾値以上のときに乗員の動きを異常行動とみなして暴れを判定する。これにより、乗員が僅かにしか動けない場合でも撮影データから異常行動を判定している。 Patent Literature 1 discloses an elevator monitoring device that detects abnormal behavior of passengers from photographed data photographed by a security camera provided inside an elevator car.
Specifically, in the monitoring device of Patent Literature 1, the threshold value for judging agitation is determined according to a predetermined frequency band extracted from the result of frequency analysis of voice data of passengers collected by an intercom installed in the car. is set, and the amount of variation in the movement of the occupant is statistically calculated from the photographed data photographed by the security camera. The monitoring device of Patent Literature 1 further compares the calculated variation amount of the occupant's movement with a violent judgment threshold, and regards the movement of the passenger as an abnormal behavior when the fluctuation amount of the passenger's movement is greater than or equal to the violent judgment threshold. determine outrage. As a result, even if the occupant can only move slightly, abnormal behavior can be determined from the photographed data.

特開２０１３－６３８１０号公報JP 2013-63810 A

しかしながら、特許文献１の技術では、検知可能な異常がエレベータ内における乗員の暴れに限定されているため、多様な映像コンテンツ中に含まれ得る多様な異常シーンを適切に検出することは困難である。 However, with the technique of Patent Document 1, the detectable abnormalities are limited to the rampage of the passengers in the elevator, so it is difficult to appropriately detect various abnormal scenes that can be included in various video contents. .

特に、映像配信サービスは、メインターゲットとするユーザの年齢層や嗜好等によりそれぞれ多岐にセグメント化されており、映像配信サービスごとに、視聴するのに不適切な異常シーンの範囲が区々である。さらに、映像コンテンツ中に異常シーンが出現する頻度は通常僅かであるため、教師あり機械学習のために必要となる学習データの汎用データベース化には適さない。他方、教師なしの機械学習で映像コンテンツから異常シーンを検出しようとすると、検出精度が低下してしまう。 In particular, video distribution services are segmented into a wide range of segments depending on the age group and preferences of the main target users, and the range of abnormal scenes that are inappropriate for viewing varies for each video distribution service. . Furthermore, since the frequency of appearance of abnormal scenes in video content is usually very low, it is not suitable for creating a general-purpose database of learning data required for supervised machine learning. On the other hand, if an attempt is made to detect abnormal scenes from video content by unsupervised machine learning, the detection accuracy will drop.

ところで、コンテンツプロバイダにより作成された映像には、コンテンツプロバイダにより、配信される映像コンテンツに、暴力シーンを含むか否か、子供向けコンテンツであるか否か、あるいは年齢制限の有無等のタグ情報が付加されていることが多く、コンテンツ作成時にコンテンツプロバイダに異常シーンの存在にタグ付けさせることも可能である。
一方、近年増加している一般ユーザが作成した映像コンテンツには、このような異常シーンのタグ情報が付加されていないことが多く、あるいは、付加されていたとしてもタグ付けが必ずしも当該映像配信サービスにおいて適切でないおそれがある。 By the way, in the video created by the content provider, tag information such as whether or not the video content distributed by the content provider includes a violent scene, whether or not the content is for children, or whether or not there is an age limit is included. It is often added, and it is possible to have content providers tag the presence of anomalous scenes when content is created.
On the other hand, the video content created by general users, which has been increasing in recent years, often does not have such tag information of abnormal scenes added, or even if it does, the tag information is not necessarily attached to the video distribution service. may not be appropriate for

このため、従来は、映像配信サービスによっては、オペレータが、配信される映像コンテンツを常時監視し、映像コンテンツ中から異常シーンを発見した場合に、当該映像コンテンツに年齢制限を設定したり、当該映像コンテンツの配信を停止したりしており、これにより、映像を監視するオペレータの時間的および作業的負荷や、さらに心理的負担をも増加させていた。同時に、マニュアルで映像コンテンツを監視することによる異常シーンの見逃しも発生するおそれがあった。 For this reason, conventionally, depending on the video distribution service, the operator constantly monitors the video content to be distributed, and if an abnormal scene is found in the video content, the operator sets an age limit for the video content, or In some cases, distribution of content is stopped, which increases the time and work load of the operator who monitors the video, as well as the psychological load. At the same time, there is a risk that abnormal scenes may be overlooked due to manual monitoring of video content.

本発明は上記課題を解決するためになされたものであり、その目的は、オペレータの負荷を軽減しつつ、映像から多様な異常を高精度に検出することが可能な情報処理装置、情報処理方法およびプログラムを提供することにある。 SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, and an object of the present invention is to provide an information processing apparatus and an information processing method capable of detecting various abnormalities from images with high accuracy while reducing the burden on the operator. and to provide programs.

上記課題を解決するために、本発明に係る情報処理装置の一態様は、映像データを取得する映像取得部と、前記映像取得部により取得された映像データから音声特徴を抽出し、前記映像データから画像特徴を抽出する特徴抽出部と、前記特徴抽出部により抽出された前記音声特徴に基づいて、前記映像データから異常シーンの候補を検出する異常シーン候補検出部と、前記異常シーン候補検出部により検出された前記異常シーンの候補を、前記音声特徴および前記画像特徴に基づいて、異常、正常、およびその他のいずれかに判定する異常判定器と、前記異常判定器により、前記異常シーンの候補がその他に属すると判定された場合、当該異常シーンの候補を、ユーザインタフェースを介して提示し、提示された異常シーンの候補に対して付加すべき情報の入力を、前記ユーザインタフェースを介して受け付けるシーン提示部とを備える。 In order to solve the above problems, one aspect of an information processing apparatus according to the present invention is a video acquisition unit that acquires video data; an abnormal scene candidate detection unit for detecting abnormal scene candidates from the video data based on the audio features extracted by the feature extraction unit; and the abnormal scene candidate detection unit an anomaly determiner for determining, based on the audio feature and the image feature, the candidate for the abnormal scene detected by is determined to belong to others, the candidate for the abnormal scene is presented via a user interface, and input of information to be added to the candidate for the presented abnormal scene is accepted via the user interface. and a scene presenter.

前記異常シーン候補検出部は、教師なし学習を用いて、前記映像データから前記異常シーンの候補を検出してよい。 The abnormal scene candidate detection unit may detect the abnormal scene candidates from the video data using unsupervised learning.

前記異常シーン候補検出部は、正常な音声特徴群のモデルを生成することなく、異常な音声特徴を直接分離することにより、前記映像データから前記異常シーンの候補を検出してよい。 The abnormal scene candidate detection unit may detect the abnormal scene candidates from the video data by directly separating abnormal audio features without generating a model of normal audio features.

前記異常シーン候補検出部は、それぞれの音声特徴のアイソレーションフォレスト（ＩｓｏｌａｔｉｏｎＦｏｒｅｓｔ）におけるパス長を算出することにより、前記異常な音声特徴を分離してよい。 The abnormal scene candidate detector may isolate the abnormal audio features by calculating a path length in an isolation forest of each audio feature.

前記特徴抽出部は、前記映像データ中の音声データのメル周波数（ＭｅｌＦｒｅｑｕｅｎｃｙ）スペクトログラムで表現される音声特徴を抽出してよい。 The feature extraction unit may extract an audio feature represented by a Mel frequency spectrogram of audio data in the video data.

前記特徴抽出部は、前記音声データから、メル周波数ケプストラム係数（ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔｓ：ＭＦＣＣ）を算出し、算出されたＭＦＣＣを前記メル周波数に連結して、前記音声特徴を抽出してよい。 The feature extracting unit may calculate Mel Frequency Cepstrum Coefficients (MFCC) from the audio data, connect the calculated MFCC to the Mel frequency, and extract the audio feature.

前記シーン提示部は、前記ユーザインタフェースを介して入力される情報を、前記音声特徴および前記画像特徴に付加して、前記異常判定器のための学習データとして記憶装置に格納してよい。 The scene presenting unit may add information input via the user interface to the audio feature and the image feature, and store the information in a storage device as learning data for the abnormality determiner.

前記異常シーン候補検出部は、前記記憶装置に格納される前記学習データの数が所定の閾値を上回る場合に、前記異常シーンの候補を前記異常判定器に判定させてよい。 The abnormal scene candidate detection unit may cause the abnormality determiner to determine the candidate of the abnormal scene when the number of the learning data stored in the storage device exceeds a predetermined threshold.

前記異常シーン候補検出部は、前記記憶装置に格納される前記学習データの数が所定の閾値以内である場合に、前記異常器による判定をバイパスして、前記シーン提示部に、前記異常シーンの候補を提示させてよい。 When the number of pieces of learning data stored in the storage device is within a predetermined threshold, the abnormal scene candidate detection unit bypasses the determination by the abnormal device and instructs the scene presentation unit to display the abnormal scene. Candidates can be presented.

前記異常判定器は、前記音声特徴と前記画像特徴が統合された特徴空間において、前記異常シーンの候補の近傍に位置する異常サンプルの数と正常サンプルの数との差が所定の閾値以内である場合に、前記異常シーンの候補をその他に判定してよい。
前記異常判定器は、ｋ近傍法により、前記異常シーンの候補を判定してよい。 The abnormality determiner determines that a difference between the number of abnormal samples and the number of normal samples located near the candidate for the abnormal scene is within a predetermined threshold in a feature space in which the audio features and the image features are integrated. In this case, other abnormal scene candidates may be determined.
The abnormality determiner may determine the candidate for the abnormal scene by a k-nearest neighbor method.

前記特徴抽出部により抽出される前記画像特徴から、教師あり学習を用いて、前記映像データに含まれる顔の感情を解析し、解析された前記顔の感情の特徴を前記異常判定器に供給する感情解析部をさらに備えてよい。 From the image features extracted by the feature extraction unit, using supervised learning, analyze facial emotion included in the video data, and supply the analyzed facial emotion features to the abnormality determiner. An emotion analysis unit may be further provided.

前記感情解析部は、解析された前記顔の感情に基づいて、前記映像データから前記異常シーンの候補を検出した場合に、前記異常シーン候補検出部に、前記音声特徴に基づく異常シーンの検出を実行させてよい。 The emotion analysis unit causes the abnormal scene candidate detection unit to detect an abnormal scene based on the audio feature when the candidate for the abnormal scene is detected from the video data based on the analyzed emotion of the face. You can let it run.

本発明に係る情報処理システムの一態様は、サーバと、該サーバとネットワークを介して接続される少なくとも１つのクライアント装置とを備える情報処理システムであって、前記サーバは、映像データを取得する映像取得部と、前記映像取得部により取得された映像データから音声特徴を抽出し、前記映像データから画像特徴を抽出する特徴抽出部と、前記特徴抽出部により抽出された前記音声特徴に基づいて、前記映像データから異常シーンの候補を検出する異常シーン候補検出部と、前記異常シーン候補検出部により検出された前記異常シーンの候補を、前記音声特徴および前記画像特徴に基づいて、異常、正常、およびその他のいずれかに判定する異常判定器と、前記異常判定器により、前記異常シーンの候補がその他に属すると判定された場合、当該異常シーンの候補を、ユーザインタフェースを介して提示し、提示された異常シーンの候補に対して付加すべき情報の入力を、前記ユーザインタフェースを介して受け付けるシーン提示部と、当該異常シーンの候補を前記クライアント装置へ送信する送信部と、を有し、前記クライアント装置は、前記サーバから送信される前記異常シーンの候補を受信する受信部と、前記受信部により受信された前記異常シーンの候補を提示し、提示された異常シーンの候補に対して付加すべき情報の入力を受け付ける前記ユーザインタフェースと、前記ユーザインタフェースが入力を受け付けた前記異常シーンの候補に対して付加すべき情報を、前記サーバへ送信する送信部と、を有する。 One aspect of an information processing system according to the present invention is an information processing system comprising a server and at least one client device connected to the server via a network, wherein the server acquires video data. an acquisition unit, a feature extraction unit that extracts audio features from the video data acquired by the video acquisition unit, extracts image features from the video data, and based on the audio features extracted by the feature extraction unit, an abnormal scene candidate detection unit that detects abnormal scene candidates from the video data; and the abnormal scene candidates detected by the abnormal scene candidate detection unit are determined as abnormal, normal, and an anomaly determiner that determines one of the other, and when the anomaly determiner determines that the candidate for the abnormal scene belongs to the other, the candidate for the abnormal scene is presented via a user interface, and presented a scene presenting unit that receives input of information to be added to the abnormal scene candidate received via the user interface; and a transmission unit that transmits the abnormal scene candidate to the client device, The client device presents a receiving unit for receiving the abnormal scene candidates transmitted from the server, presents the abnormal scene candidates received by the receiving unit, and adds to the presented abnormal scene candidates. and a transmitting unit configured to transmit to the server information to be added to the abnormal scene candidate whose input is received by the user interface.

本発明に係る情報処理方法の一態様は、情報処理装置が実行する情報処理方法であって、映像データを取得するステップと、取得された映像データから音声特徴を抽出し、前記映像データから画像特徴を抽出するステップと、教師なし学習により、抽出された前記音声特徴に基づいて、前記映像データから異常シーンの候補を検出するステップと、異常判定器により、検出された前記異常シーンの候補を、前記音声特徴および前記画像特徴に基づいて、異常、正常、およびその他のいずれかに判定するステップと、前記異常判定器により、前記異常シーンの候補がその他に属すると判定された場合、当該異常シーンの候補を、ユーザインタフェースを介して提示し、提示された異常シーンの候補に対して付加すべき情報の入力を、前記ユーザインタフェースを介して受け付けるステップとを含む。 One aspect of an information processing method according to the present invention is an information processing method executed by an information processing apparatus, comprising: acquiring video data; extracting audio features from the acquired video data; a step of extracting features; a step of detecting abnormal scene candidates from the video data based on the extracted audio features by unsupervised learning; , a step of determining one of abnormal, normal, and others based on the audio features and the image features; presenting scene candidates via a user interface, and accepting input of information to be added to the presented abnormal scene candidates via the user interface.

本発明に係る情報処理プログラムの一態様は、情報処理をコンピュータに実行させるための情報処理プログラムであって、該プログラムは、前記コンピュータに、映像データを取得する映像取得処理と、前記映像取得処理により取得された映像データから音声特徴を抽出し、前記映像データから画像特徴を抽出する特徴抽出処理と、前記特徴抽出処理により抽出された前記音声特徴に基づいて、前記映像データから異常シーンの候補を検出する異常シーン候補検出処理と、異常判定器により、前記異常シーン候補検出処理により検出された前記異常シーンの候補を、前記音声特徴および前記画像特徴に基づいて、異常、正常、およびその他のいずれかに判定する異常判定処理と、前記異常判定器により、前記異常シーンの候補がその他に属すると判定された場合、当該異常シーンの候補を、ユーザインタフェースを介して提示し、提示された異常シーンの候補に対して付加すべき情報の入力を、前記ユーザインタフェースを介して受け付ける入出力処理とを含む処理を実行させるためのものである。 One aspect of the information processing program according to the present invention is an information processing program for causing a computer to execute information processing, wherein the program causes the computer to perform image acquisition processing for acquiring image data, and the image acquisition processing. a feature extraction process for extracting an audio feature from the video data acquired by and extracting an image feature from the video data; and an abnormal scene candidate from the video data based on the audio feature extracted by the feature extraction process. an abnormal scene candidate detection process for detecting an abnormal scene candidate detection process for detecting an abnormal scene candidate detection process for detecting an abnormal scene candidate detection process for detecting an abnormal scene candidate detected by the abnormal scene candidate detection process for detecting an abnormal scene candidate based on the audio feature and the image feature by an abnormality determiner; an abnormality determination process for determining one of them; Input/output processing for accepting input of information to be added to scene candidates via the user interface.

本発明によれば、オペレータの負荷を軽減しつつ、映像から多様な異常を高精度に検出することができる。
上記した本発明の目的、態様及び効果並びに上記されなかった本発明の目的、態様及び効果は、当業者であれば添付図面及び請求の範囲の記載を参照することにより下記の発明を実施するための形態から理解できるであろう。 ADVANTAGE OF THE INVENTION According to this invention, various abnormalities can be detected from an image|video with high precision, reducing an operator's load.
The objects, aspects and effects of the present invention described above and the objects, aspects and effects of the present invention not described above can be understood by a person skilled in the art to carry out the following invention by referring to the accompanying drawings and the description of the claims. can be understood from the form of

図１は、本発明の実施形態１に係る異常検出装置の機能構成の一例を示すブロック図である。FIG. 1 is a block diagram showing an example of the functional configuration of an abnormality detection device according to Embodiment 1 of the present invention. 図２は、実施形態１に係る異常検出装置が実行する異常シーン検出処理の処理手順の一例を示すフローチャートである。FIG. 2 is a flowchart illustrating an example of a procedure of abnormal scene detection processing executed by the abnormality detection device according to the first embodiment; 図３は、異常検出装置の特徴抽出部が映像データから分離する音声データのセグメントの一例を示す図である。FIG. 3 is a diagram showing an example of a segment of audio data separated from video data by the feature extractor of the anomaly detection device. 図４は、異常検出装置の特徴抽出部が音声データから抽出する異常シーン候補のメルスペクトグラム特徴の一例を示す図である。FIG. 4 is a diagram showing an example of mel-spectogram features of anomalous scene candidates extracted from audio data by the feature extraction unit of the anomaly detection device. 図５は、異常検出装置の特徴抽出部が音声データから抽出する正常シーンのメルスペクトグラム特徴の一例を示す図である。FIG. 5 is a diagram showing an example of mel-spectogram features of a normal scene extracted from audio data by the feature extraction unit of the anomaly detection device. 図６は、異常検出装置の特徴抽出部が異常シーン候補検出部に出力するメル周波数ケプストラムとメルスペクトグラムとを連結した音声特徴の一例を示す図である。FIG. 6 is a diagram showing an example of speech features obtained by connecting mel-frequency cepstrum and mel-spectogram output by the feature extraction unit of the anomaly detection device to the anomalous scene candidate detection unit. 図７は、異常検出装置の異常シーン候補検出部が教師なし学習で使用するアイソレーションフォレストの異常特徴点の分離を説明する図である。FIG. 7 is a diagram for explaining separation of abnormal feature points in an isolation forest used in unsupervised learning by the abnormal scene candidate detection unit of the abnormality detection device. 図８は、異常検出装置の異常シーン候補検出部が教師なし学習で使用するアイソレーションフォレストの決定木の一例を示す概略図である。FIG. 8 is a schematic diagram showing an example of a decision tree of an isolation forest used in unsupervised learning by the abnormal scene candidate detection unit of the abnormality detection device. 図９は、異常検出装置の異常判定器が異常シーン候補の異常判定に使用するｋ近傍法の一例を説明する概略図である。FIG. 9 is a schematic diagram for explaining an example of the k-nearest neighborhood method used by the anomaly determiner of the anomaly detection device to determine an anomaly of an anomalous scene candidate. 図１０は、実施形態１の変形例に係る異常検出装置が実行する異常シーン検出処理の処理手順の一例を示すフローチャートである。10 is a flowchart illustrating an example of a procedure of abnormal scene detection processing executed by the abnormality detection device according to the modification of the first embodiment; FIG. 図１１は、本発明の実施形態２に係る異常検出装置の機能構成の一例を示すブロック図である。FIG. 11 is a block diagram showing an example of the functional configuration of an abnormality detection device according to Embodiment 2 of the present invention. 図１２は、実施形態２に係る異常検出装置が実行する異常シーン検出処理の処理手順の一例を示すフローチャートである。FIG. 12 is a flowchart illustrating an example of a procedure of abnormal scene detection processing executed by the abnormality detection device according to the second embodiment; 図１３は、実施形態２の変形例に係る異常検出装置が実行する異常シーン検出処理の処理手順の一例を示すフローチャートである。FIG. 13 is a flowchart illustrating an example of a procedure of abnormal scene detection processing executed by the abnormality detection device according to the modification of the second embodiment; 図１４は、実施形態２に係る異常検出装置の感情解析部が映像データの画像データから認識する顔感情認識結果の一例を示す概略図である。FIG. 14 is a schematic diagram showing an example of a face emotion recognition result recognized from image data of video data by the emotion analysis unit of the abnormality detection device according to the second embodiment. 図１５は、本発明の各実施形態に係る異常検出装置のハードウエア構成の一例を示すブロック図である。FIG. 15 is a block diagram showing an example of the hardware configuration of an abnormality detection device according to each embodiment of the present invention.

以下、添付図面を参照して、本発明を実施するための実施形態について詳細に説明する。以下に開示される構成要素のうち、同一機能を有するものには同一の符号を付し、その説明を省略する。なお、以下に開示される実施形態は、本発明の実現手段としての一例であり、本発明が適用される装置の構成や各種条件によって適宜修正または変更されるべきものであり、本発明は以下の実施形態に限定されるものではない。また、本実施形態で説明されている特徴の組み合わせの全てが本発明の解決手段に必須のものとは限らない。 Embodiments for carrying out the present invention will be described in detail below with reference to the accompanying drawings. Among the constituent elements disclosed below, those having the same functions are denoted by the same reference numerals, and descriptions thereof are omitted. The embodiments disclosed below are examples of means for realizing the present invention, and should be appropriately modified or changed according to the configuration of the device to which the present invention is applied and various conditions. is not limited to the embodiment of Also, not all combinations of features described in the present embodiment are essential for the solution means of the present invention.

（実施形態１）
本実施形態に係る異常検出装置は、映像データから、音声データおよび画像データそれぞれの特徴を抽出し、これら音声データおよび画像データのマルチモーダルな特徴を用いて、映像データから異常シーンを複数段階で半自動的に検出する。
以下では、異常検出装置が、リアルタイムでストリーミング配信される映像データから抽出される音声データの特徴に基づいて教師なし学習により異常シーンの候補をまず検出し、次に、検出された異常シーンの候補を異常判定器により異常シーン、正常シーン、およびオペレータの判断を要するシーン、のいずれかに判定し、オペレータの判断を要すると判定された異常シーンの候補の映像データを提示し、オペレータによる異常シーンか否かの確認入力を映像データの特徴に付加して、異常判定器に対する学習データとして蓄積する一例を説明する。 (Embodiment 1)
The anomaly detection apparatus according to the present embodiment extracts features of audio data and image data from video data, and uses multimodal features of these audio data and image data to detect anomalous scenes from video data in multiple stages. Detect semi-automatically.
In the following, the anomaly detection device first detects anomalous scene candidates by unsupervised learning based on the features of audio data extracted from video data streamed in real time, and then detects anomalous scene candidates. is judged by an anomaly judgment device as one of an abnormal scene, a normal scene, and a scene requiring operator judgment, and video data of candidate abnormal scenes judged to require operator judgment are presented. An example will be described in which confirmation input as to whether or not is added to the features of the video data and accumulated as learning data for the anomaly determiner.

しかしながら、本実施形態はこれに限定されない。例えば、異常検出装置は、録画された映像データから事後的に異常シーンを検出してもよい。また、例えば、蓄積される学習データの数に応じて、異常シーン検出を可変に制御し、検出された異常シーンの候補の映像データのすべてを、異常判定器をバイパスしてオペレータに提示してもよく、あるいは、異常判定器が、検出された異常シーンの候補の映像データの音声および画像の特徴に基づいて、異常シーンを自動検出してもよい。後者の場合、異常判定の閾値を比較的低く設定して、閾値近傍の異常シーンのみを適宜確認的にオペレータに提示してもよい。 However, this embodiment is not limited to this. For example, the anomaly detection device may detect an anomalous scene ex post facto from recorded video data. Further, for example, abnormal scene detection is variably controlled according to the number of stored learning data, and all of the detected abnormal scene candidate video data is presented to the operator bypassing the abnormality determiner. Alternatively, the anomaly determiner may automatically detect an anomalous scene based on the audio and image features of the detected anomalous scene candidate video data. In the latter case, the threshold for abnormality determination may be set relatively low, and only abnormal scenes near the threshold may be appropriately presented to the operator for confirmation.

＜異常検出装置の機能構成＞
図１は、本実施形態に係る異常検出装置１の機能構成の一例を示すブロック図である。
図１に示す異常検出装置１は、データ取得部１１、特徴抽出部１２、異常シーン候補検出部１３、異常判定器１４、およびシーン提示部１５を備える。
異常検出装置１は、ＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）等で構成されるクライアント装置３とネットワークを介して通信可能に接続してよい。この場合、異常検出装置１はサーバに実装され、クライアント装置３は、異常検出装置１が外部と情報の入出力を実行する際のユーザインタフェースを提供してよく、また、異常検出装置１のシーン提示部１５を含む各コンポーネント１１～１５の一部または全部を備えてもよい。 <Functional configuration of abnormality detection device>
FIG. 1 is a block diagram showing an example of the functional configuration of an abnormality detection device 1 according to this embodiment.
The abnormality detection device 1 shown in FIG.
The abnormality detection device 1 may be communicably connected to a client device 3 configured by a PC (Personal Computer) or the like via a network. In this case, the abnormality detection device 1 is mounted on a server, and the client device 3 may provide a user interface when the abnormality detection device 1 executes input/output of information with the outside. Some or all of the components 11 to 15 including the presentation unit 15 may be provided.

データ取得部１１は、リアルタイムでストリーミング配信される映像データを取得して、取得された映像データを特徴抽出部１２へ供給する。映像データは、音声（Ａｕｄｉｏ）データと画像（Ｖｉｓｕａｌ）データとを含む動画像データであるが、データ取得部１１は、動画像データに替えて、音声データを含む静止画データを取得して、特徴抽出部１２へ供給してもよい。
データ取得部１１は、ストリーミング配信される映像データに替えて、異常検出装置１のＨＤＤ等の不揮発性記憶装置に予め録画された映像データを取得してもよく、録画された映像データを対向装置から通信Ｉ／Ｆを介して受信してもよい。
データ取得部１１はまた、異常検出装置１において異常シーン検出処理を実行するために必要な各種パラメータの入力を受け付ける。データ取得部１１は、異常検出装置１と通信可能に接続されるクライアント装置３のユーザインタフェースを介して、各種パラメータの入力を受け付けてよい。 The data acquisition unit 11 acquires video data streamed in real time and supplies the acquired video data to the feature extraction unit 12 . The video data is moving image data including audio data and visual data, but the data acquisition unit 11 acquires still image data including audio data instead of the moving image data. It may be supplied to the feature extraction unit 12 .
The data acquisition unit 11 may acquire video data pre-recorded in a non-volatile storage device such as an HDD of the abnormality detection device 1 instead of streaming-delivered video data. may be received from via the communication I/F.
The data acquisition unit 11 also receives input of various parameters necessary for executing abnormal scene detection processing in the abnormality detection device 1 . The data acquisition unit 11 may receive input of various parameters through the user interface of the client device 3 communicably connected to the abnormality detection device 1 .

特徴抽出部１２は、データ取得部１１から供給される映像データから音声データを分離し、分離された音声データから音声特徴を抽出する。
特徴抽出部１２はまた、データ取得部１１から供給される映像データから画像データを分離し、分離された画像データから画像特徴を抽出する。
特徴抽出部１２は、抽出された音声特徴および画像特徴を、映像データとともに、異常シーン候補検出部１３へ供給する。 The feature extraction unit 12 separates audio data from the video data supplied from the data acquisition unit 11 and extracts audio features from the separated audio data.
The feature extraction unit 12 also separates image data from the video data supplied from the data acquisition unit 11 and extracts image features from the separated image data.
The feature extraction unit 12 supplies the extracted audio features and image features to the abnormal scene candidate detection unit 13 together with the video data.

異常シーン候補検出部１３は、特徴抽出部１２から供給される音声特徴に基づいて、映像データから異常シーンの候補を検出し、検出された異常シーンの候補を、異常判定器１４へ供給する。異常シーン候補検出部１３はまた、検出された異常シーンの候補を、異常判定器１４をバイパスして、シーン提示部１５へ供給してもよい。 The abnormal scene candidate detection unit 13 detects abnormal scene candidates from the video data based on the audio features supplied from the feature extraction unit 12 , and supplies the detected abnormal scene candidates to the abnormality determiner 14 . The abnormal scene candidate detection unit 13 may also supply the detected abnormal scene candidates to the scene presentation unit 15 by bypassing the abnormality determiner 14 .

なお、異常シーンとは、例えば、暴力的なシーンや子供向けでないシーン等、家族での視聴に不適切な（Ｎｏｎ－Ｆａｍｉｌｙ－Ｓａｆｅ：ＮＦＳ）シーンを含むがこれに限定されない。異常シーンは、映像配信サービスごとの規約ないしルール上当該映像配信サービスを介して配信すべきでない旨規定されているシーンまたはコンテンツ、その他オペレータが映像データの音声および画像から最終的に配信すべきでないとマニュアルで判定したシーンまたはコンテンツを広く含むものとする。
特徴抽出部１２が実行する特徴抽出処理および異常シーン候補検出部１３が実行する異常シーン候補検出処理の詳細は、図３～図８を参照して後述する。 Abnormal scenes include, but are not limited to, scenes inappropriate for family viewing (Non-Family-Safe: NFS), such as violent scenes and scenes not for children. Abnormal scenes are scenes or content that are stipulated that they should not be distributed via the video distribution service according to the terms or rules of each video distribution service, and other operators should not ultimately distribute audio and images of video data. shall broadly include any scene or content manually determined to be
Details of the feature extraction processing performed by the feature extraction unit 12 and the abnormal scene candidate detection processing performed by the abnormal scene candidate detection unit 13 will be described later with reference to FIGS. 3 to 8. FIG.

本実施形態において、異常シーン候補検出部１３は、教師なし学習により音声特徴を分類することで、異常シーンの候補を検出する。ストリーミング配信される映像データ中で、異常シーンの出願頻度は僅かであり、また異常シーンとすべきか否かの基準も映像配信サービスごとに多様であるため、新たなサービスが開始される際や基準が変更された際に、適切な教師データを予め用意することは難しく、教師あり学習により高精度の分類を実現することが困難である。本実施形態では、映像データのうち、音声データのみから教師なし学習により音声特徴を分類することで、少ないサンプル数であっても高精度かつ低負荷で、異常シーンの候補を検出することができる。 In this embodiment, the abnormal scene candidate detection unit 13 detects abnormal scene candidates by classifying audio features by unsupervised learning. In the video data distributed by streaming, the frequency of applications for abnormal scenes is very low, and the criteria for determining whether or not an abnormal scene should be included vary from video distribution service to video distribution service. is changed, it is difficult to prepare appropriate teacher data in advance, and it is difficult to achieve highly accurate classification by supervised learning. In this embodiment, by classifying the audio features from only the audio data in the video data by unsupervised learning, it is possible to detect candidates for abnormal scenes with high accuracy and low load even with a small number of samples. .

異常判定器１４は、異常シーン候補検出部１３から供給される異常シーンの候補を入力とし、入力された異常シーンの候補の映像データを、異常シーン、正常シーン、オペレータの判断を要するシーンのいずれかに分類する。異常判定器１４は、異常シーンの候補の分類結果のうち、異常シーンおよび正常シーンのいずれかに分類された異常シーンの候補を、分類結果を付加して学習データＤＢ（データベース）２に格納していく。また、異常判定器１４は、異常シーンの候補のうち、オペレータの判断を要するシーンと分類された異常シーンの候補を、シーン提示部１５へ供給する。 The abnormal scene candidate supplied from the abnormal scene candidate detection unit 13 is input to the abnormality determiner 14, and the image data of the input abnormal scene candidate is classified into an abnormal scene, a normal scene, or a scene that requires an operator's judgment. classified as The anomaly determiner 14 adds the anomalous scene candidates classified into either an anomalous scene or a normal scene out of the abnormal scene candidate classification results and stores them in the learning data DB (database) 2 with the classification results. To go. In addition, the abnormality determiner 14 supplies to the scene presenting unit 15 an abnormal scene candidate classified as a scene requiring operator's judgment among the abnormal scene candidates.

本実施形態において、異常判定器１４は、特徴抽出部１２により抽出された映像データの音声特徴および画像特徴が統合された特徴空間を用いて、教師あり学習により、入力される異常シーンの候補を、異常シーン、正常シーン、およびオペレータの判断を要するシーンのいずれかに３分類する。異常判定器１４は、学習データＤＢ２に蓄積された異常シーンの候補の分類結果を教師データとした学習を実行してよい。
異常判定器１４が実行する異常シーン判定処理の詳細は、図９を参照して後述する。 In this embodiment, the abnormality determiner 14 uses a feature space in which the audio features and image features of the video data extracted by the feature extraction unit 12 are integrated, and uses supervised learning to identify candidates for input abnormal scenes. , an abnormal scene, a normal scene, and a scene requiring operator's judgment. The abnormality determiner 14 may perform learning using the abnormal scene candidate classification results accumulated in the learning data DB 2 as teacher data.
The details of the abnormal scene determination process executed by the abnormality determiner 14 will be described later with reference to FIG.

シーン提示部１５は、異常判定器１４から供給される、オペレータの判断を要するシーンと分類された異常シーンの候補を、表示装置等を介して外部に提示して、オペレータの確認入力を受け付ける。異常検出装置１はまた、異常シーン候補検出部１３から供給される異常シーンの候補を、外部に提示して、オペレータの確認入力を受け付けてよい。
異常検出装置１は、自装置の表示装置等をユーザインタフェースとしてもよいが、異常検出装置１と通信可能に接続されるクライアント装置３のユーザインタフェースを介して、異常シーンの候補を外部に提示し、またはオペレータの確認入力を受け付けてよい。
この場合、異常検出装置１はさらに、異常シーン候補検出部１３から供給される異常シーンの候補を、クライアント装置３へ送信し、クライアント装置３から送信されるオペレータの確認入力を受信する送受信部を備えてよい。クライアント装置３は、異常検出装置１から送信される異常シーンの候補を受信し、ユーザインタフェースを介して提示された異常シーンの候補に対するオペレータの確認入力を異常検出装置１へ送信する送受信部を備えてよい。 The scene presenting unit 15 presents abnormal scene candidates, which are classified as scenes requiring operator judgment, supplied from the abnormality determiner 14 to the outside via a display device or the like, and receives confirmation input from the operator. The abnormality detection device 1 may also present the abnormal scene candidates supplied from the abnormal scene candidate detection unit 13 to the outside, and may receive confirmation input from the operator.
The abnormality detection device 1 may use its own display device or the like as a user interface. , or an operator confirmation input may be accepted.
In this case, the abnormality detection device 1 further includes a transmitting/receiving unit that transmits the abnormal scene candidates supplied from the abnormal scene candidate detection unit 13 to the client device 3 and receives the operator's confirmation input transmitted from the client device 3. Be prepared. The client device 3 includes a transmitting/receiving unit that receives anomalous scene candidates transmitted from the anomaly detection device 1 and transmits an operator's confirmation input for the anomaly scene candidates presented via the user interface to the anomaly detection device 1. you can

オペレータは、シーン提示部１５により提示される異常シーンの候補の映像データの画像を音声と照らし合わせることで、提示された異常シーンの候補を、異常シーンまたは正常シーンのいずれかであると確認し、確認結果をシーン提示部１５に入力する。オペレータは、異常シーンであると確認された異常シーンの候補に対して、所定の措置を講じることができる。例えば、確認された異常シーンを、配信される映像データから削除してもよく、あるいは当該映像データの配信を停止してもよく、当該映像データの配信元ユーザのアカウントを停止してもよい。
シーン提示部１５は、提示された異常シーンの候補の音声特徴および画像特徴に対して、オペレータが確認入力した確認結果（異常シーンまたは正常シーンのアノテーション）を付加し、学習データとして学習データＤＢ２に格納する。 The operator confirms that the presented abnormal scene candidate is either an abnormal scene or a normal scene by comparing the video data image of the abnormal scene candidate presented by the scene presentation unit 15 with the sound. , the confirmation result is input to the scene presentation unit 15 . The operator can take a predetermined action on the abnormal scene candidate confirmed as an abnormal scene. For example, the confirmed abnormal scene may be deleted from the video data to be distributed, the distribution of the video data may be stopped, or the account of the distribution source user of the video data may be suspended.
The scene presenting unit 15 adds the confirmation result (annotation of an abnormal scene or a normal scene) confirmed and input by the operator to the sound feature and image feature of the candidate of the abnormal scene presented, and stores the result in the learning data DB 2 as learning data. Store.

具体例として、異常シーン候補検出部１３が、映像データから抽出された音声特徴から、銃を発砲したような音声を検出し、当該音声を含むシーンを異常シーンの候補として検出したものとする。この場合、オペレータは、異常シーンの候補の画像をチェックして、異常シーンおよび正常シーンのいずれかであるかを確認すればよい。
例えば、異常シーンの候補の画像が、銃やその他暴力的または残酷なオブジェクトを含んでいれば、異常シーンと確認することができ、一方、屋外の花火等のオブジェクトを含んでいれば、正常シーンと確認することができる。 As a specific example, it is assumed that the abnormal scene candidate detection unit 13 detects a gunshot-like sound from the audio features extracted from the video data, and detects the scene containing the sound as an abnormal scene candidate. In this case, the operator may check the image of the candidate for the abnormal scene to confirm whether it is an abnormal scene or a normal scene.
For example, if an abnormal scene candidate image contains a gun or other violent or cruel object, it can be identified as an abnormal scene. can be confirmed.

このように、本実施形態では、映像データの音声および画像のマルチモーダルな情報を用いて、複数段階で半自動的に異常シーンを検出している。具体的には、映像データの音声から異常シーンの候補を自動的に検出し、オペレータに検出された異常シーンの候補を提示して、異常シーンの候補の画像から異常シーンか正常シーンかを確認させている。これにより、配信される映像の監視におけるオペレータの負荷が格段に軽減される。 Thus, in this embodiment, multimodal information of audio and image of video data is used to semi-automatically detect an abnormal scene in a plurality of stages. Specifically, candidates for abnormal scenes are automatically detected from the sound of the video data, the candidates for the detected abnormal scenes are presented to the operator, and the image of the candidate for the abnormal scene is checked to see if it is an abnormal scene or a normal scene. I am letting As a result, the operator's burden in monitoring the video to be distributed is greatly reduced.

＜異常シーン検出処理の処理手順＞
図２は、本実施形態に係る異常検出装置１が実行する、異常シーン検出処理の処理手順の一例を示すフローチャートである。
なお、図２の各ステップは、異常検出装置１のＨＤＤ等の記憶装置に記憶されたプログラムをＣＰＵが読み出し、実行することで実現される。また、図２に示すフローチャートの少なくとも一部をハードウエアにより実現してもよい。ハードウエアにより実現する場合、例えば、所定のコンパイラを用いることで、各ステップを実現するためのプログラムからＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）上に自動的に専用回路を生成すればよい。また、ＦＰＧＡと同様にしてＧａｔｅＡｒｒａｙ回路を形成し、ハードウエアとして実現するようにしてもよい。また、ＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）により実現するようにしてもよい。 <Procedure of Abnormal Scene Detection Processing>
FIG. 2 is a flowchart showing an example of the abnormal scene detection process performed by the abnormality detection device 1 according to the present embodiment.
Each step in FIG. 2 is implemented by the CPU reading and executing a program stored in a storage device such as an HDD of the abnormality detection device 1 . Also, at least part of the flowchart shown in FIG. 2 may be realized by hardware. When implemented by hardware, for example, by using a predetermined compiler, a dedicated circuit may be automatically generated on an FPGA (Field Programmable Gate Array) from a program for implementing each step. Also, a Gate Array circuit may be formed in the same manner as the FPGA and implemented as hardware. Also, it may be realized by an ASIC (Application Specific Integrated Circuit).

Ｓ１で、異常検出装置１の特徴抽出部１２は、データ取得部１１から供給される映像データを音声データおよび画像データに分離し、音声特徴および画像特徴をそれぞれ抽出する。
図３は、特徴抽出部１２が、特徴抽出の前処理として、データ取得部１１から供給される映像データ（例えば、ｍｐ４またはｍ３ｕ８等のマルチメディアフォーマット）から分離した、例えば１秒単位にセグメント化した音声データ（例えば、ｗａｖフォーマット）の音声信号波形の一例を示す。図３において、縦軸が音声の振幅を示し、横軸が時間を示す。
特徴抽出部１２は、分離された音声データを、配信される映像データで想定され得る音源等に合わせて、適宜アップサンプリング等により正規化してよい。 In S1, the feature extraction unit 12 of the abnormality detection device 1 separates the video data supplied from the data acquisition unit 11 into audio data and image data, and extracts audio features and image features, respectively.
FIG. 3 shows that the feature extraction unit 12 separates from the video data (for example, multimedia format such as mp4 or m3u8) supplied from the data acquisition unit 11 as preprocessing for feature extraction, for example, segments in units of 1 second 1 shows an example of an audio signal waveform of audio data (for example, wav format). In FIG. 3, the vertical axis indicates the amplitude of the voice, and the horizontal axis indicates time.
The feature extraction unit 12 may normalize the separated audio data by appropriate upsampling or the like in accordance with the sound source that can be assumed in the video data to be distributed.

本実施形態において、特徴抽出部１２は、図３に示す音声データから、音声特徴を、例えば、メルスペクトログラム（メル周波数スペクトログラム）（ＭｅｌＦｒｅｑｕｅｎｃｙＳｐｅｃｔｒｏｇｒａｍ）で表現される音声特徴として抽出してよい。
スペクトログラムとは、音声信号を窓関数に通して周波数スペクトルを計算した結果を指し、時間、周波数、および信号成分の強さ（振幅）をそれぞれＸ軸、Ｙ軸、およびＺ軸とする３次元のグラフで表される。スペクトログラムは、音声信号の周波数成分と振幅成分を例えばフーリエ変換により取り出した各音声データセグメント（フレーム）のスペクトルを時間軸に沿って並べた、いわゆる声紋に相当する。メルスペクトログラムとは、人間の音高知覚（周波数知覚特性）が考慮された重み付けを行うためのメル尺度で変換されたスペクトログラムである。 In the present embodiment, the feature extracting unit 12 may extract speech features from the speech data shown in FIG. 3 as speech features expressed by, for example, a mel spectrogram (Mel Frequency Spectrogram).
A spectrogram is the result of passing an audio signal through a window function to calculate the frequency spectrum. represented graphically. A spectrogram corresponds to a so-called voiceprint in which the spectrum of each audio data segment (frame) obtained by extracting frequency components and amplitude components of an audio signal by Fourier transform, for example, is arranged along the time axis. A mel spectrogram is a spectrogram converted by the mel scale for weighting considering human pitch perception (frequency perception characteristics).

図４は、特徴抽出部１２が映像データから分離した音声データから抽出したメルスペクトログラムで表現される音声特徴であって、異常シーン候補検出部１３により異常シーンの候補として検出される音声特徴の一例を示す。図４および図５において、Ｘ軸が時間を示し、Ｙ軸が周波数を示し、Ｚ軸が振幅、すなわち音声信号の強度を示す。また、図４および図５において、信号強度が大きいセルほど薄いパターンで、信号強度が小さいほど濃いパターンで示されている。
図４に示すスペクトログラムは、音量が大きく、信号強度の分布にピーク性があり、短時間で音声信号が減衰しているパターンを示す。図４は、銃の発砲のスペクトログラムの一例を示すが、例えば、人の叫び声や何かを殴る音等も同様または同種のパターンを示すものと考えられる。 FIG. 4 shows an example of audio features expressed by a mel-spectrogram extracted from audio data separated from video data by the feature extraction unit 12 and detected as abnormal scene candidates by the abnormal scene candidate detection unit 13. FIG. indicates 4 and 5, the X-axis indicates time, the Y-axis indicates frequency, and the Z-axis indicates amplitude, ie strength of the audio signal. In addition, in FIGS. 4 and 5, a cell with a higher signal strength is indicated by a lighter pattern, and a cell with a lower signal strength is indicated by a darker pattern.
The spectrogram shown in FIG. 4 shows a pattern in which the sound volume is large, the signal strength distribution has peaks, and the audio signal is attenuated in a short period of time. FIG. 4 shows an example of a spectrogram of a gun shot, but it is conceivable that, for example, a person's shouting or the sound of hitting something would exhibit a similar or similar pattern.

一方、図５は、特徴抽出部１２が映像データから分離した音声データセグメントから抽出したメルスペクトグラムで表現される音声特徴であって、異常シーン候補検出部１３により正常シーンと判定される（異常シーンの候補として検出されない）音声特徴の一例を示す。図５に示すスペクトログラムは、低音量または中音量であり、信号強度の分布が時間軸上均一であるパターンを示す。
特徴抽出部１２は、音声データから、前景音声（例えば、人の発話音声や叫び声等）と背景音声（音楽や雑踏音等）とを分離して、いずれか一方の音声のスペクトログラムを音声特徴として異常シーン候補検出部１３へ供給してもよい。この場合、例えば、時間軸上一時的に出現して繰り返されない音声を前景音声として分離することができる。 On the other hand, FIG. 5 shows audio features represented by mel-spectograms extracted from audio data segments separated from video data by the feature extraction unit 12, which are determined as normal scenes by the abnormal scene candidate detection unit 13 (abnormal scenes). 4 shows an example of audio features that are not detected as scene candidates. The spectrogram shown in FIG. 5 shows a pattern of low or medium volume and uniform distribution of signal strength over time.
The feature extraction unit 12 separates foreground audio (e.g., human utterances, shouts, etc.) and background audio (e.g., music, crowd noise, etc.) from audio data, and extracts the spectrogram of one of the audio as an audio feature. It may be supplied to the abnormal scene candidate detection unit 13 . In this case, for example, a sound that appears temporarily on the time axis and is not repeated can be separated as a foreground sound.

本実施形態において、特徴抽出部１２はさらに、音声データからメル周波数ケプストラム係数（ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔｓ：ＭＦＣＣ）を算出し、算出されたＭＦＣＣを図４ないし図５に示すメルスペクトログラムに連結して、音声特徴を抽出してもよい。
ケプストラムとは、音声信号をフーリエ変換した振幅スペクトルに対数を掛けて対数スペクトルを求め、対数スペクトルに再度フーリエ変換を適用してスペクトル化したものをいう。対数スペクトルのケプストラム（対数ケプストラム）を求めることで、高周期で変動する音源成分と畳み込まれていた声道特性の成分とを分離することができる。 In this embodiment, the feature extraction unit 12 further calculates Mel Frequency Cepstrum Coefficients (MFCC) from the audio data, connects the calculated MFCC to the Mel spectrogram shown in FIGS. Audio features may be extracted.
The cepstrum is obtained by multiplying the logarithm of the amplitude spectrum obtained by Fourier-transforming an audio signal to obtain a logarithmic spectrum, and applying the Fourier transform again to the logarithmic spectrum to convert it into a spectrum. By obtaining the cepstrum of the logarithmic spectrum (logarithmic cepstrum), it is possible to separate the sound source component that fluctuates at a high frequency from the convoluted vocal tract characteristic component.

対数ケプストラムの低次成分は、音声のスペクトル包絡（声道成分に由来する周波数特性）を表現している。個人差の大きいピッチ成分を除去し、音韻の特定に重要である声道の音響特性のみを抽出することができる。この対数ケプストラムの低次成分に対して、人の周波数知覚特性を考慮した重み付けを、メル尺度を適用することにより付与した特徴量が、ＭＦＣＣである。 The low-order components of the logarithmic cepstrum represent the spectral envelope of speech (frequency characteristics derived from vocal tract components). It is possible to remove pitch components with large individual differences and extract only the acoustic characteristics of the vocal tract, which are important for identifying phonemes. The MFCC is a feature amount obtained by applying a weighting that considers the human frequency perception characteristic to the low-order components of the logarithmic cepstrum by applying the Mel scale.

具体的には、振幅スペクトルを、メル尺度上で等間隔である複数のフィルタバンクにかけて、各帯域のスペクトル成分を取り出し、各帯域の振幅スペクトルの和を取って、複数次元の振幅スペクトルに圧縮し、この圧縮された振幅スペクトルの対数を取って、対数振幅スペクトルを求める。
こうして求めたメル周波数スペクトル（メル尺度で圧縮された対数振幅スペクトル）に対して、フーリエ変換（例えば、離散フーリエ変換（ＤｉｓｃｒｅｔｅＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ：ＤＦＴ）を行うことにより、メル周波数ケプストラムに変換する。メル周波数ケプストラムの低次成分（スペクトルの声道成分）を取り出して、必要に応じて正規化処理を行うことにより、ＭＦＣＣを求めることができる。 Specifically, the amplitude spectrum is applied to multiple filter banks that are evenly spaced on the Mel scale, the spectral components of each band are extracted, the sum of the amplitude spectra of each band is taken, and the amplitude spectrum is compressed into a multi-dimensional amplitude spectrum. , taking the logarithm of this compressed amplitude spectrum to obtain the logarithmic amplitude spectrum.
The obtained Mel frequency spectrum (the logarithmic amplitude spectrum compressed on the Mel scale) is subjected to Fourier transform (for example, Discrete Fourier Transform: DFT) to convert it into a Mel frequency cepstrum. The MFCC can be obtained by extracting the low-order components of the cepstrum (vocal tract components of the spectrum) and performing normalization processing as necessary.

図６は、単位時間（例えば、１秒）でスライスして例えば平均値を取ったＭＦＣＣ６１と、メルスペクトラムを時間軸上で平均振幅を取ったメルスペクトグラム６２とを連結した音声特徴の一例を示す。
図６に示すような音声特徴を異常シーン候補検出部１３に供給して異常シーンの候補を検出させることで、音声データの周波数成分の情報、特に人の聴覚上重要な周波数成分を失うことなく、音声特徴を適切に圧縮することができる。 FIG. 6 shows an example of speech features in which an MFCC 61 obtained by slicing in a unit time (for example, one second) and taking an average value, and a mel-spectogram 62 obtained by taking an average amplitude of the mel spectrum on the time axis are connected. show.
By supplying the audio features as shown in FIG. 6 to the abnormal scene candidate detection unit 13 to detect abnormal scene candidates, the frequency component information of the audio data, especially the frequency components that are important for human hearing, are not lost. , the speech features can be adequately compressed.

なお、特徴抽出部１２により映像データの音声データから抽出される音声特徴は、図６に示す表現に限定されず、特徴抽出部１２は、上記以外の任意の手法および表現により、音声特徴を抽出してよい。
特徴抽出部１２はまた、映像データから分離された画像データの全部または一部から、例えば、畳み込みニューラルネットワーク（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ：ＣＮＮ）等を使用して、画像特徴を抽出してよい。しかしながら、特徴抽出部１２により画像データから画像特徴を抽出する手法はＣＮＮに限定されず、任意の手法を用いることができる。 Note that the audio features extracted from the audio data of the video data by the feature extraction unit 12 are not limited to the expressions shown in FIG. You can
The feature extraction unit 12 may also extract image features from all or part of the image data separated from the video data using, for example, a convolutional neural network (CNN). However, the method for extracting image features from image data by the feature extraction unit 12 is not limited to CNN, and any method can be used.

図２に戻り、Ｓ２で、異常検出装置１の異常シーン検出部１３は、特徴抽出部１２から供給される映像データから分離された音声データの音声特徴に基づいて、教師なし学習により分類することにより、異常シーンの候補を検出する。
異常シーン検出部１３は、例えば、アイソレーションフォレスト（ＩｓｏｌａｔｉｏｎＦｏｒｅｓｔ：ＩＦ）により、音声特徴の特徴空間上で、異常値を持つ音声特徴を分離し、分離された異常な音声特徴に対応する映像シーンを、異常シーンの候補として検出する。 Returning to FIG. 2, in S2, the abnormal scene detection unit 13 of the abnormality detection device 1 performs classification by unsupervised learning based on the audio features of the audio data separated from the video data supplied from the feature extraction unit 12. to detect candidates for abnormal scenes.
The abnormal scene detection unit 13, for example, by isolation forest (Isolation Forest: IF), separates an audio feature having an abnormal value on a feature space of audio features, and detects a video scene corresponding to the separated abnormal audio feature. is detected as an abnormal scene candidate.

アイソレーションフォレストは、正常値を持つ特徴群をモデル化（プロファイル化）して正常モデルを生成することなく、異常値を持つ特徴を直接分離する教師なし学習の１つである。高速アルゴリズムでありかつメモリ消費も少ないためリアルタイム配信される映像の監視に適しており、また、正常モデルのモデル化が不要であるため、少ないサンプリング数でも精度の低下を招き難い。 Isolation forest is one of unsupervised learning that directly isolates features with abnormal values without modeling (profiling) a group of features with normal values to generate a normal model. Since it is a high-speed algorithm and consumes little memory, it is suitable for monitoring video that is distributed in real time. In addition, since modeling of a normal model is unnecessary, it is difficult to cause a decrease in accuracy even with a small number of samples.

図７は、アイソレーションフォレストが、特徴空間上、異常値を持つ特徴を分離するアルゴリズムを説明する概略図である。アイソレーションフォレストは、特徴空間上に配置される各特徴点が、他のすべての特徴点と分離できるまで、図７の破線で示されるように、繰り返しパーティション（仕切り）を生成していく。図７を参照して、左端の特徴点および右端の特徴点は、それぞれ、中央近傍に位置する特徴点より必要なパーティションの数が少ない。図７では、左端および右端の特徴点は、１つのパーティションで他のすべての特徴点から分離することができるため、それぞれ異常値を持つ特徴点として検出することができる。 FIG. 7 is a schematic diagram illustrating an algorithm for isolating features having outliers in the feature space by the isolation forest. The isolation forest repeatedly generates partitions as indicated by broken lines in FIG. 7 until each feature point placed on the feature space can be separated from all other feature points. Referring to FIG. 7, the leftmost and rightmost feature points each require fewer partitions than the feature points located near the center. In FIG. 7, the leftmost and rightmost feature points can be separated from all other feature points by one partition, so they can each be detected as feature points with outliers.

図８は、図７におけるパーティション生成の繰り返し処理を二分木の木構造（アイソレーションツリー：ＩｓｏｌａｔｉｏｎＴｒｅｅ）で表現した概念図である。図７におけるパーティションの数は、図８において木構造のルートノードから終端ノードまでのパス長で表現することができる。
異常シーン候補検出部１３は、各音声特徴の特徴点のパス長に基づいて、各特徴点の異常（ａｎｏｍａｌｙ）スコアを、下記の式１により算出する。 FIG. 8 is a conceptual diagram expressing the iterative processing of partition generation in FIG. 7 in a binary tree structure (isolation tree). The number of partitions in FIG. 7 can be represented by the path length from the root node to the end node of the tree structure in FIG.
The abnormal scene candidate detection unit 13 calculates an anomaly score for each feature point using Equation 1 below, based on the path length of the feature point for each audio feature.

（式１）

(Formula 1)

ここで、右辺指数部のＥ（ｈ（ｘ））は平均パス長であり、ｃ（ｎ）はデータセットのインスタンス数ｎに依存する正規化因子である。各特徴点の異常スコアＳ（ｘ、ｎ）は、平均パス長が短い程１に近づき、平均パス長が長い程０に近づく。 where E(h(x)) of the exponent on the right side is the average path length, and c(n) is a normalization factor that depends on the number of instances n in the data set. The abnormality score S(x, n) of each feature point approaches 1 as the average path length becomes shorter, and approaches 0 as the average path length becomes longer.

図８を参照して、左側のバーは、下端から上端に向かって、０から１までの異常スコアの値に対応する。異常シーン検出部１３は、０から１までの間の異常スコアの閾値θと、各特徴点の異常スコアＳ（ｘ、ｎ）とを比較し、異常スコアＳ（ｘ、ｎ）が閾値θを上回る特徴点の音声特徴を、異常値（外れ値）として判定し、他方、異常スコアＳ（ｘ、ｎ）が閾値θ以内の特徴点の音声特徴を正常値として判定する。 Referring to FIG. 8, the left bar corresponds to anomaly score values from 0 to 1 from bottom to top. The abnormal scene detection unit 13 compares the abnormal score threshold θ between 0 and 1 with the abnormal score S(x, n) of each feature point, and the abnormal score S(x, n) exceeds the threshold θ. The audio features of the feature points exceeding the score are determined as abnormal values (outliers), and the audio features of the feature points whose abnormal scores S(x, n) are within the threshold value θ are determined as normal values.

異常シーン検出部１３は、閾値θを上回る異常スコアが算出された音声特徴および対応する画像特徴を含む映像シーンを、異常シーンの候補として検出する。
なお、異常シーン候補検出部１３が異常シーンの候補を検出するために使用する教師なし学習アルゴリズムは、アイソレーションフォレストに限定されない。異常シーン候補検出部１３は、アイソレーションフォレストに替えて、変分オートエンコーダ（ＶａｒｉａｔｉｏｎａｌＡｕｔｏＥｎｃｏｄｅｒ：ＶＡＥ）を使用して、音声特徴の再構成スコアを算出することにより、異常シーンの候補を検出してもよく、他のあらゆる教師なし学習を使用してもよい。 The abnormal scene detection unit 13 detects video scenes including audio features and corresponding image features for which abnormal scores exceeding the threshold θ are calculated as abnormal scene candidates.
The unsupervised learning algorithm used by the abnormal scene candidate detection unit 13 to detect abnormal scene candidates is not limited to the isolation forest. The abnormal scene candidate detection unit 13 uses a Variational AutoEncoder (VAE) instead of the isolation forest to calculate the reconstruction score of the speech feature, thereby detecting the candidate of the abnormal scene. may be used, or any other unsupervised learning may be used.

図２に戻り、Ｓ３で、異常検出装置１の異常シーン候補検出部１３は、学習データＤＢ２に格納された学習データの数を、所定の閾値と比較する。学習データＤＢ２に格納された学習データの数が所定の閾値を上回る場合（Ｓ３：Ｙ）、Ｓ４に進み、異常シーン候補検出部１３は、検出された異常シーンの候補を、異常判定器１４へ供給する。一方、学習データＤＢ２に格納された学習データの数が所定の閾値以内である場合（Ｓ３：Ｎ）、異常判定器での処理（Ｓ４～Ｓ６）をバイパスして、Ｓ７に進む。 Returning to FIG. 2, in S3, the abnormal scene candidate detection unit 13 of the abnormality detection device 1 compares the number of learning data stored in the learning data DB2 with a predetermined threshold. If the number of learning data stored in the learning data DB 2 exceeds the predetermined threshold (S3: Y), the process proceeds to S4. supply. On the other hand, when the number of learning data stored in the learning data DB2 is within the predetermined threshold (S3: N), the processing (S4 to S6) in the abnormality determiner is bypassed and the process proceeds to S7.

本実施形態では、異常判定器１４が学習データＤＢ２へ異常シーン判定の学習データを十分蓄積していない場合は、異常判定器１４による異常シーン判定の精度（信頼度）が十分でないと判断して、異常判定器１４での処理をバイパスする。そして、シーン提示部１５は、異常シーン候補検出部１３により検出された異常シーンの候補を、オペレータに直接提示し、オペレータの確認入力を受け付ける。これにより、学習データのサンプル数が少ない間は、異常シーンの候補に対して常にオペレータの確認判断を要求することで、異常判定器１４での機械学習実行の処理負荷を削減することができる。 In the present embodiment, when the abnormality determiner 14 does not accumulate enough learning data for abnormal scene determination in the learning data DB 2, it is determined that the accuracy (reliability) of abnormal scene determination by the abnormality determiner 14 is not sufficient. , bypasses the processing in the abnormality determiner 14 . Then, the scene presenting unit 15 directly presents the abnormal scene candidates detected by the abnormal scene candidate detecting unit 13 to the operator, and receives confirmation input from the operator. As a result, while the number of samples of learning data is small, it is possible to reduce the processing load of the machine learning execution in the abnormality determiner 14 by always requesting the operator's confirmation judgment for the candidate of the abnormal scene.

このように、本実施形態では、検出された異常シーンの候補からどのように異常シーンを判定するかの制御を、自律的に最適化する。具体的には、異常判定器１４への学習データのサンプル数が少ないうちは、専らオペレータによる異常シーンの判定を優先して異常シーンの判定の精度低下を防止する。一方、異常判定器１４への学習データのサンプル数が所定の閾値を超えた場合には、異常判定器１４が異常シーンまたは正常シーンのいずれかに分類できなかった異常シーンの候補のみをオペレータに提示して確認入力を要求することで、オペレータの負荷をさらに軽減することができる。 As described above, in the present embodiment, the control of how to determine an abnormal scene from detected abnormal scene candidates is optimized autonomously. Specifically, while the number of samples of learning data to the abnormality determiner 14 is small, priority is given exclusively to the operator's determination of an abnormal scene, thereby preventing a decrease in the accuracy of the determination of an abnormal scene. On the other hand, when the number of samples of learning data to the abnormality determiner 14 exceeds a predetermined threshold, only abnormal scene candidates that the abnormality determiner 14 could not classify as either an abnormal scene or a normal scene are sent to the operator. By presenting and requesting confirmation input, the operator's load can be further reduced.

Ｓ４で、異常検出装置１の異常判定器１４は、異常シーン候補検出部１３から供給される異常シーンの候補を、正常シーン、異常シーン、およびオペレータの判断を要するシーンのいずれかに分類することにより、異常シーンの候補の異常を判定し、判定結果を学習データＤＢ２に格納する。
具体的には、異常判定器１４は、例えば、教師あり学習として、ｋ近傍法（ｋ－ｎｅａｒｅｓｔｎｅｉｇｈｂｏｒａｌｇｏｒｉｔｈｍ：ｋ－ＮＮ）を使用して、音声特徴および画像特徴が統合された特徴空間上で最近傍解を探索することにより、異常シーン候補検出部１３から供給される異常シーンの候補を分類する。 In S4, the anomaly determiner 14 of the anomaly detection device 1 classifies the anomalous scene candidates supplied from the anomalous scene candidate detector 13 into normal scenes, anomalous scenes, and scenes requiring operator judgment. , the abnormality of the candidate of the abnormal scene is determined, and the determination result is stored in the learning data DB2.
Specifically, the abnormality determiner 14 uses, for example, k-nearest neighbor algorithm (k-NN) as supervised learning, on a feature space in which audio features and image features are integrated The abnormal scene candidates supplied from the abnormal scene candidate detection unit 13 are classified by searching for the nearest neighbor solution.

図９は、ｋ近傍法による分類アルゴリズムの例を説明する概念図である。
図９を参照して、特徴空間には、丸マークで示されるオブジェクト群が配置されている。各オブジェクトは多次元の特徴空間における位置ベクトルで表現され、正しい分類クラスが既知である。同心円の中央の星マークは、分類クラスが未知である分類対象の位置ベクトルであり、本実施形態では、判定対象の異常シーン候補の位置ベクトルである。ｋ近傍法では、星マークで示される新たな位置ベクトルと、丸マークで示される既存の位置ベクトル群との距離を算出し、ｋ個の最近傍のサンプルが選択される。位置ベクトル間の距離は、ユークリッド距離として算出されてよいが、マンハッタン距離等の他の距離として算出されてもよい。 FIG. 9 is a conceptual diagram illustrating an example of a classification algorithm based on the k-nearest neighbor method.
Referring to FIG. 9, object groups indicated by circle marks are arranged in the feature space. Each object is represented by a position vector in a multidimensional feature space, and the correct classification class is known. The star mark in the center of the concentric circles is the position vector of the classification target whose classification class is unknown, and in this embodiment, the position vector of the abnormal scene candidate to be determined. In the k-nearest neighbor method, the distance between a new position vector indicated by a star mark and the existing position vector group indicated by a circle mark is calculated, and k nearest samples are selected. The distance between position vectors may be calculated as Euclidean distance, but may be calculated as other distances such as Manhattan distance.

図９を参照して、ｋ＝３の場合、内側同心円内には、最近傍の３つのオブジェクトとして、濃い丸マークが２個に対して薄い丸マークが１個配置されているから、判定対象の位置ベクトルは、濃い丸マークのクラスに分類される。一方、ｋ＝６の場合、外側同心円内には、最近傍の６つのオブジェクトとして、濃い丸マークが２個に対して薄い丸マークが４個配置されているから、判定対象の位置ベクトルは、薄い丸マークのクラスに分類される。なお、ｋ個の最近傍のオブジェクトの間で、新たな位置ベクトルとの距離を重み付けしてクラスを決定してもよい。 Referring to FIG. 9, when k=3, two dark circle marks and one light circle mark are arranged as the three closest objects within the inner concentric circle. position vectors fall into the dark circle class. On the other hand, when k=6, two dark circle marks and four light circle marks are arranged as the six closest objects in the outer concentric circle, so the position vector to be determined is: It is classified into the thin circle mark class. Note that the class may be determined by weighting the distance from the new position vector among the k nearest objects.

異常判定器１４は、映像データの音声特徴と画像特徴とが統合された特徴空間上に、正しいクラスが未知である異常シーンの候補を位置ベクトルとしてマッピングし、ｋ個の最近傍のオブジェクト（サンプル）のうち、異常シーンに分類されるサンプルの数を、正常シーンに分類されるサンプルの数と比較することにより、判定対象の異常シーンの候補を、異常シーン、正常シーン、およびオペレータの判断を要するシーンのいずれかに分類する。 The anomaly determiner 14 maps, as a position vector, candidates for anomalous scenes whose correct class is unknown on a feature space in which audio features and image features of video data are integrated, and k nearest objects (samples ), by comparing the number of samples classified as abnormal scenes with the number of samples classified as normal scenes, candidates for abnormal scenes to be determined are classified into abnormal scenes, normal scenes, and operator judgment. Classify it into one of the required scenes.

具体的には、異常判定器１４は、音声特徴と画像特徴とが統合された特徴空間上で、異常シーンの候補に対するｋ個の最近傍のサンプルのうち、異常シーンに分類されるサンプルの数が、正常シーンに分類されるサンプルの数より十分多い場合、判定対象の異常シーンの候補を異常シーンであると判定する。異常判定部１４はまた、特徴空間上で、異常シーンの候補に対するｋ個の最近傍のサンプルのうち、正常シーンに分類されるサンプルの数が、異常シーンの分類されるサンプルの数より十分多い場合、判定対象の異常シーンの候補を正常シーンであると判定する。 Specifically, the anomaly determiner 14 determines the number of samples classified as an anomalous scene among the k nearest samples to an anomaly scene candidate on the feature space in which the audio features and the image features are integrated. is sufficiently larger than the number of samples classified as normal scenes, the abnormal scene candidate to be determined is determined to be an abnormal scene. The abnormality determination unit 14 also determines whether the number of samples classified as normal scenes among the k nearest neighbor samples to the abnormal scene candidate is sufficiently larger than the number of samples classified as abnormal scenes in the feature space. In this case, the abnormal scene candidate to be determined is determined to be a normal scene.

一方、異常判定器１４は、特徴空間上で、異常シーンの候補に対するｋ個の最近傍のサンプルのうち、異常シーンに分類されるサンプルの数と正常シーンに分類されるサンプルの数との差が小さく、所定の閾値内である場合、判定対象の異常シーンの候補を、オペレータの判断を要するシーンであると判定する。
代替的に、異常判定器１４は、ｋ個の最近傍の異常シーンのサンプル数と正常シーンのサンプル数との大小により、判定対象の異常シーンの候補を、異常シーンまたは正常シーンのいずれかに自動的に分類してもよい。特に、学習データＤＢ２に十分なサンプル数の学習データが蓄積されている場合には、異常シーン検出においてオペレータの介入を不要ともできる。
なお、異常判定器１４が異常シーンの候補の異常を判定するためのアルゴリズムは、上記のｋ近傍法に限定されない。異常判定器１４は、例えば、ＣＮＮ等のニューラルネットワークや、サポートベクタマシン（ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ：ＳＶＭ）等を含む、他の教師あり学習の機械学習アルゴリズムを使用して異常シーンを判定してよい。 On the other hand, the abnormality determiner 14 calculates the difference between the number of samples classified as abnormal scenes and the number of samples classified as normal scenes among the k nearest samples to the candidate of abnormal scenes in the feature space. is small and within a predetermined threshold, the candidate for the abnormal scene to be determined is determined to be a scene requiring operator's determination.
Alternatively, the anomaly determiner 14 selects either an anomalous scene or a normal scene as an anomaly scene candidate to be determined, depending on the magnitude of the number of samples of k nearest neighbor anomalous scenes and the number of samples of normal scenes. It may be classified automatically. In particular, when a sufficient number of samples of learning data are accumulated in the learning data DB2, it is possible to eliminate the need for operator intervention in abnormal scene detection.
Note that the algorithm used by the abnormality determiner 14 to determine the abnormality of the abnormal scene candidate is not limited to the k-neighborhood method described above. The anomaly determiner 14 may determine anomalous scenes using other supervised learning machine learning algorithms, including, for example, neural networks such as CNNs, Support Vector Machines (SVMs), and the like.

図２に戻り、Ｓ４で、異常判定器１４は、異常シーンの候補が異常シーンであると判定した場合、Ｓ５に進み、異常シーンを含む映像コンテンツの配信停止、当該映像コンテンツの配信元のアカウントの削除、あるいは判定された異常シーンの削除等の処理を実行して処理を終了する。
異常判定器１４は、異常シーンの候補が正常シーンであると判定した場合、Ｓ６に進み、異常シーンの候補を含む映像コンテンツの配信を続行して処理を終了する。一方、異常判定器１４は、異常シーンの候補が、オペレータの判断を要するシーンであると判定した場合、Ｓ７に進む。 Returning to FIG. 2, when the abnormality determiner 14 determines in S4 that the candidate for the abnormal scene is an abnormal scene, the process proceeds to S5, where distribution of the video content including the abnormal scene is stopped, and the account of the distribution source of the video content is , or the determined abnormal scene is deleted, and the process ends.
When the abnormal scene candidate is determined to be a normal scene, the abnormality determiner 14 proceeds to S6, continues distribution of the video content including the abnormal scene candidate, and terminates the process. On the other hand, if the abnormality determiner 14 determines that the candidate for the abnormal scene is a scene that requires the operator's determination, the process proceeds to S7.

Ｓ７で、異常検出装置１のシーン提示部１５は、異常シーン候補検出部１３から供給された異常シーンの候補、あるいは異常判定器１４によりオペレータの判断を要するシーンと判定された異常シーンの候補の映像（音声データおよび画像データ）を、オペレータに提示する。
Ｓ８で、異常検出装置１のシーン提示部１５は、Ｓ７で提示された異常シーンの候補の映像に対するオペレータの確認入力として、異常シーンまたは正常シーンのいずれかのタグ付けの入力を受け付ける。
Ｓ９で、オペレータは、異常シーンとタグ付けした異常シーンの候補について、異常シーンに対する処理、すなわち、異常シーンを含む映像コンテンツの配信停止、当該映像コンテンツの配信元のアカウントの削除、あるいは判定された異常シーンの削除等の処理を実行する。一方、オペレータは、正常シーンとタグ付けした異常シーンの候補については、異常シーンに対する処理を実行することなく、映像の配信を続行させる。 In S7, the scene presenting unit 15 of the abnormality detection device 1 presents the candidate of the abnormal scene supplied from the candidate of abnormal scene detection unit 13 or the candidate of the abnormal scene determined by the abnormality determiner 14 as the scene requiring the operator's determination. The video (audio data and image data) is presented to the operator.
In S8, the scene presenting unit 15 of the abnormality detection device 1 receives an input for tagging either an abnormal scene or a normal scene as an operator confirmation input for the image of the candidate for the abnormal scene presented in S7.
In S9, for the abnormal scene candidate tagged as an abnormal scene, the operator performs processing for the abnormal scene, that is, stop distribution of the video content containing the abnormal scene, delete the account of the distribution source of the video content, or Execute processing such as deletion of abnormal scenes. On the other hand, the operator allows the video distribution to continue for abnormal scene candidates tagged as normal scenes without executing processing for abnormal scenes.

Ｓ１０で、異常検出装置１のシーン提示部１５は、Ｓ８でシーン提示部１５に入力されたオペレータの異常シーンまたは正常シーンのタグ（ラベル）を、提示された異常シーンの候補の音声特徴および画像特徴と対応付けて、オペレータによる異常シーンの判定結果である学習データとして、学習データＤＢ２に格納する。これにより、新たな学習データで、学習データＤＢ２が更新される。 In S10, the scene presentation unit 15 of the abnormality detection device 1 converts the tag (label) of the operator's abnormal scene or normal scene input to the scene presentation unit 15 in S8 into the audio features and images of the presented abnormal scene candidates. It is stored in the learning data DB 2 as learning data, which is the abnormal scene determination result by the operator, in association with the feature. As a result, the learning data DB2 is updated with the new learning data.

Ｓ１１で、異常検出装置１の異常判定器１４は、Ｓ１０で更新された学習データＤＢ２を基づいて、再学習を実行する。なお、異常判定器１４の再学習が必要か否かは、異常判定器１４の異常判定アルゴリズムに依存する。例えば、上記で説明したように、異常判定器１４がｋ近傍法を使用する場合は、異常シーンの判定の度に、学習データＤＢ２を参照してｋ個の最近傍の標本を選ぶため、Ｓ１１で再学習を実行する必要がなく、Ｓ１０で学習データＤＢ２を更新すれば足り、Ｓ１１の処理を省略してよい。 At S11, the abnormality determiner 14 of the abnormality detection device 1 performs re-learning based on the learning data DB2 updated at S10. It should be noted that whether re-learning of the abnormality determiner 14 is necessary depends on the abnormality determination algorithm of the abnormality determiner 14 . For example, as described above, when the abnormality determiner 14 uses the k-nearest neighbor method, each time an abnormal scene is determined, the learning data DB2 is referred to and k nearest samples are selected. It is not necessary to re-learn in , and it is sufficient to update the learning data DB2 in S10, and the processing of S11 may be omitted.

一方、異常判定器１４が、ニューラルネットワークやＳＶＭ等を使用する場合は、いずれかのタイミングで異常判定器１４を再学習させて、異常判定器１４のパラメータを更新する必要がある。
再学習のタイミングは、学習データＤＢ２を更新する度に、異常判定器１４を毎回再学習させてもよく、学習データＤＢ２が所定回数更新される度に、異常判定器１４を再学習させてもよい。あるいは、Ｓ４で異常判定器１４を使用する直前に、異常判定器１４を再学習させることもできるが、リアルタイム配信される映像からリアルタイムで異常シーンを検出しようとする場合には、再学習実行によりリアルタイム性が低下しかねないことを考慮すべきである。 On the other hand, when the abnormality determiner 14 uses a neural network, SVM, or the like, it is necessary to relearn the abnormality determiner 14 at some timing to update the parameters of the abnormality determiner 14 .
As for the timing of relearning, the abnormality determiner 14 may be relearned each time the learning data DB2 is updated, or the abnormality determiner 14 may be relearned each time the learning data DB2 is updated a predetermined number of times. good. Alternatively, the abnormality determiner 14 can be re-learned immediately before using the abnormality determiner 14 in S4. Consideration should be given to the fact that real-time performance may be degraded.

＜変形例＞
図１０は、本実施形態に係る異常検出装置１が実行する異常シーン検出処理の変形例を示す図である。
異常シーン候補検出部１３は、変形例として、図１０に示すように、Ｓ３の処理を省略して、学習データＤＢ２に蓄積される学習データの数にかかわりなく、一律に、検出された異常シーンの候補を、異常判定器１４に供給してもよい。これにより、映像監視におけるオペレータの異常シーンの候補の確認処理の負荷をさらに軽減することができる。 <Modification>
FIG. 10 is a diagram showing a modification of the abnormal scene detection process executed by the abnormality detection device 1 according to this embodiment.
As a modified example, as shown in FIG. 10, the abnormal scene candidate detection unit 13 omits the processing of S3, and uniformly detects the detected abnormal scenes regardless of the number of learning data accumulated in the learning data DB 2. can be supplied to the abnormality determiner 14 . As a result, it is possible to further reduce the operator's burden of confirming abnormal scene candidates in video monitoring.

以上説明したように、本実施形態によれば、異常検出装置は、取得された映像データ中の音声特徴および画像特徴を抽出し、抽出された音声特徴を教師なし学習により分類することにより、映像データから異常シーンの候補を検出する。異常検出装置はまた、検出された異常シーンの候補を、映像データの音声特徴および画像特徴に基づいて、異常、正常、およびその他のいずれかに判定する異常判定器を備え、異常判定器により、異常シーンの候補がその他に属すると判定された場合、当該異常シーンの候補を、ユーザインタフェースを介して提示し、提示された異常シーンの候補に対して付加すべき情報の入力を、ユーザインタフェースを介して受け付ける。 As described above, according to the present embodiment, the anomaly detection apparatus extracts audio features and image features from acquired video data, and classifies the extracted audio features by unsupervised learning to detect video data. Detect abnormal scene candidates from data. The abnormality detection device also includes an abnormality determiner that determines the detected abnormal scene candidate as abnormal, normal, or other based on the audio features and image features of the video data, and the abnormality determiner: If it is determined that the candidate for the abnormal scene belongs to Others, the candidate for the abnormal scene is presented via a user interface, and information to be added to the candidate for the presented abnormal scene is input through the user interface. accepted through

これにより、第１段階で、映像データ中の音声特徴に基づいて教師なし学習により高速かつ低負荷で異常シーンの候補を第１段階として検出し、第２段階で、映像データ中の音声特徴および画像特徴に基づく異常判定器による異常シーンの判定と、映像提示に基づくオペレータの目視による異常シーンの判定とを補完的に併用する。
したがって、オペレータの負荷を軽減しつつ、映像から多様な異常を高精度に検出することができる。
これにより、リアルタイムで配信され、多様な異常シーンを含み得る映像データに十分に追従した、高速かつ高精度な異常シーンのマルチモーダルな検出が実現できる。 As a result, in the first step, abnormal scene candidates are detected at high speed and low load by unsupervised learning based on the audio features in the video data, and in the second step, the audio features in the video data and Abnormal scene determination by an abnormality determiner based on image features and abnormal scene determination by operator's visual observation based on image presentation are used in a complementary manner.
Therefore, various abnormalities can be detected from the video with high accuracy while reducing the burden on the operator.
As a result, high-speed and highly accurate multimodal detection of abnormal scenes can be realized, which sufficiently follows video data that is distributed in real time and can include various abnormal scenes.

（実施形態２）
以下、図１１～図１４を参照して、実施形態２を、実施形態１と異なる点についてのみ詳細に説明する。
本実施形態では、上記で説明した実施形態１に加え、さらに、映像データの画像特徴から映像中のオブジェクトである人の感情を解析し、感情解析結果を異常シーンの候補の検出や異常シーンの判定に用いる。 (Embodiment 2)
Only the differences from the first embodiment will be described in detail below with reference to FIGS. 11 to 14. FIG.
In this embodiment, in addition to the above-described first embodiment, the emotion of a person, who is an object in the video, is analyzed from the image characteristics of the video data, and the result of the emotion analysis is used to detect candidates for abnormal scenes and to detect abnormal scenes. Used for judgment.

図１１は、本実施形態に係る異常検出装置１の機能構成の一例を示すブロック図である。
図１１のブロック図では、図１に示す実施形態１の異常検出装置１の機能構成に加えて、感情解析部１６を備える。
図１１において、データ取得部１１、特徴抽出部１２、異常シーン候補１３、異常判定器１４、およびシーン提示部１５の機能構成は、図１に示す対応する各部と同様である。
図１１を参照して、特徴抽出部１２は、映像データ中の画像データから抽出した画像特徴を、感情解析部１６へ供給する。 FIG. 11 is a block diagram showing an example of the functional configuration of the abnormality detection device 1 according to this embodiment.
The block diagram of FIG. 11 includes an emotion analysis unit 16 in addition to the functional configuration of the abnormality detection device 1 of the first embodiment shown in FIG.
In FIG. 11, the functional configurations of the data acquisition unit 11, the feature extraction unit 12, the abnormal scene candidate 13, the abnormality determination unit 14, and the scene presentation unit 15 are the same as the corresponding units shown in FIG.
Referring to FIG. 11, feature extractor 12 supplies image features extracted from image data in video data to emotion analyzer 16 .

感情解析部１６は、特徴抽出部１２から供給される映像データの画像特徴に基づいて、画像中のオブジェクトである人の顔の感情を解析する。
感情解析部１６は、例えば、ＣＮＮ等の教師あり学習を用いて、画像中の人の顔を解析することで、画像中の人の顔の感情を推定してよい。人の顔の画像から推定される人の顔の感情は、例えば、怒り、嫌悪、恐怖、幸福、悲しみ、驚き、その他（ニュートラル）の感情を含んでよい。
感情解析部１６はまた、時間的に隣接する複数の画像フレームの間で算出される、推定された感情の平均信頼度に基づいて、対象画像中の人の顔の感情を決定してもよい。 The emotion analysis unit 16 analyzes the emotion of the human face, which is the object in the image, based on the image features of the video data supplied from the feature extraction unit 12 .
The emotion analysis unit 16 may estimate the emotion of the person's face in the image by analyzing the person's face in the image using, for example, supervised learning such as CNN. A person's facial emotion estimated from a person's facial image may include, for example, anger, disgust, fear, happiness, sadness, surprise, and other (neutral) emotions.
The emotion analyzer 16 may also determine the emotion of the person's face in the target image based on the estimated average confidence of the emotion calculated over a plurality of temporally adjacent image frames. .

感情解析部１６はさらに、画像中の人の身体や四肢の動き、人が把持等するオブジェクト（例えば、マイクロフォン、楽器等）、または背景（例えば、屋内か屋外か等）を解析してよい。特徴抽出部１２は、感情解析部１６が解析すべき対象オブジェクトの特徴を抽出して、感情解析部１６へ供給してよい。 The emotion analysis unit 16 may further analyze the movement of the person's body and limbs in the image, objects held by the person (eg, microphone, musical instrument, etc.), or the background (eg, indoors or outdoors, etc.). The feature extraction section 12 may extract features of the target object to be analyzed by the emotion analysis section 16 and supply them to the emotion analysis section 16 .

本実施形態において、感情解析部１６が人の顔の画像の画像特徴から推定する人の顔の感情は、異常シーン候補検出部１３が実行する異常シーンの候補の検出処理、および異常判定器１４が実行する異常シーンの候補の異常判定処理を補完する。
具体的には、感情解析部１６は、人の顔の画像特徴から推定された人の顔の感情から、映像の文脈を推定して、異常シーン候補検出部１３に対して、異常シーン候補検出処理のキュー（トリガ）を与えてもよい。例えば、感情解析部１６が、人の顔の画像を解析して人の顔の感情として、例えば、怒り、恐怖、驚き等を検出した場合、当該画像を含む映像は、異常シーンである可能性が高いため、感情解析部１６は、異常シーン解析部１３にキューを与えて、当該映像の音声特徴から異常シーンの候補を検出する処理を実行させてもよい。 In this embodiment, the emotion of a person's face estimated by the emotion analysis unit 16 from the image features of the image of the person's face is detected by the abnormal scene candidate detection process executed by the abnormal scene candidate detection unit 13 and by the abnormality determiner 14 . supplements the abnormality determination process of the candidate of the abnormal scene executed by .
Specifically, the emotion analysis unit 16 estimates the context of the video from the emotion of the person's face estimated from the image features of the person's face, and instructs the abnormal scene candidate detection unit 13 to detect an abnormal scene candidate. Queues (triggers) for processing may be provided. For example, when the emotion analysis unit 16 analyzes an image of a person's face and detects, for example, anger, fear, surprise, etc., as the emotion of the person's face, there is a possibility that the video containing the image is an abnormal scene. is high, the emotion analysis unit 16 may give a cue to the abnormal scene analysis unit 13 to execute processing for detecting abnormal scene candidates from the audio features of the video.

感情解析部１６はまた、人の顔の画像特徴から推定される人の顔の感情の特徴を異常判定器１４に供給し、異常判定器１４が、感情解析部１６から供給される人の顔の感情の特徴を特徴空間に統合して、ｋ近傍法により、異常シーンの候補を異常判定してもよい。例えば、異常判定器１４は、人の顔の感情として、例えば、怒り、恐怖、驚き等の特徴を、異常シーンと判定するための正因子として使用してよい。
感情解析部１６はさらに、人の顔の画像特徴から推定される人の顔の感情の解析結果を、シーン提示部１５に供給し、シーン提示部１５が、感情解析部１６から供給される人の顔の感情の解析結果を、例えば、提示される映像中に重畳表示や別ウインドウ表示等で併せて表示してもよい。 The emotion analysis unit 16 also supplies the emotion features of the human face estimated from the image features of the person's face to the abnormality determiner 14, and the abnormality determiner 14 detects the human face supplied from the emotion analysis unit 16. may be integrated into the feature space, and abnormal scene candidates may be judged to be abnormal by the k-neighborhood method. For example, the abnormality determiner 14 may use features such as anger, fear, surprise, etc. as positive factors for determining an abnormal scene as emotions of a person's face.
The emotion analysis unit 16 further supplies an analysis result of the emotion of the person's face estimated from the image features of the person's face to the scene presentation unit 15 , and the scene presentation unit 15 receives the human facial emotion supplied from the emotion analysis unit 16 . For example, the analysis result of the emotion of the face may be displayed together with the presented image by superimposing it or displaying it in a separate window.

図１２は、実施形態２に係る異常検出装置１が実行する異常シーン検出処理の処理手順の一例を示すフローチャートである。
図１２のフローチャートでは、図２に示す実施形態１の異常検出装置１が実行する異常シーン検出処理に対して、Ｓ１とＳ２の間に、Ｓ１２の処理が追加されている。
Ｓ１の処理は、図２に示す実施形態１と同様である。すなわち、実施形態１と同様、異常検出装置１の特徴抽出部１２は、データ取得部１１により供給される映像データから、音声特徴および画像特徴をそれぞれ抽出する。 FIG. 12 is a flowchart showing an example of the procedure of abnormal scene detection processing executed by the abnormality detection device 1 according to the second embodiment.
In the flowchart of FIG. 12, the process of S12 is added between S1 and S2 to the abnormal scene detection process executed by the abnormality detection device 1 of the first embodiment shown in FIG.
The processing of S1 is the same as that of the first embodiment shown in FIG. That is, as in the first embodiment, the feature extraction unit 12 of the abnormality detection device 1 extracts audio features and image features from the video data supplied by the data acquisition unit 11 .

Ｓ１で、異常検出装置１の特徴抽出部１２が、映像データから音声特徴および画像特徴がそれぞれ抽出すると、Ｓ１２に進む。
Ｓ１２で、異常検出装置１の感情解析部１６は、特徴抽出部１２により抽出された画像特徴から、異常シーンの候補を検出する。具体的には、感情解析部１６は、画像中の人の顔の画像特徴から、人の感情を推定し、例えば、怒り、恐怖、驚き等の感情が推定された場合には、当該画像を含む映像シーンを異常シーンの候補として検出してよい。
感情解析部１６は、画像特徴から異常シーンの候補を検出した場合、後続するＳ２で実行される異常シーン候補検出部１３により実行される映像の音声特徴に基づく異常シーン候補の検出処理にキュー（トリガ）を与える。 In S1, when the feature extraction unit 12 of the abnormality detection device 1 extracts the audio feature and the image feature from the video data, the process proceeds to S12.
In S<b>12 , the emotion analysis unit 16 of the abnormality detection device 1 detects abnormal scene candidates from the image features extracted by the feature extraction unit 12 . Specifically, the emotion analysis unit 16 estimates a person's emotion from the image features of the person's face in the image. A video scene including a video scene may be detected as a candidate for an abnormal scene.
When the emotion analysis unit 16 detects abnormal scene candidates from the image features, the emotion analysis unit 16 queues ( trigger).

Ｓ１２に続き、Ｓ２で、異常検出装置１の異常シーン候補検出部１３は、感情解析部１６が画像特徴から異常シーンの候補を検出してトリガを与えた場合、感情解析部１６から供給される異常シーンの候補に対応する音声特徴を教師なし学習を用いて分類することにより、異常シーンの候補を検出する。 Following S12, in S2, the abnormal scene candidate detection unit 13 of the abnormality detection device 1 is supplied from the emotion analysis unit 16 when the emotion analysis unit 16 detects an abnormal scene candidate from the image feature and gives a trigger. Abnormal scene candidates are detected by classifying audio features corresponding to the abnormal scene candidates using unsupervised learning.

代替的に、異常シーン候補検出部１３は、感情解析部１６からトリガを与えられるか否かにかかわりなく、常時、映像データの音声特徴から異常シーンの候補を検出し、感情解析部１６から画像特徴に基づく異常シーン候補検出のトリガを与えられた際に、検出された異常シーン候補の音声特徴から、異常シーンの候補として異常検出器１４に供給すべきかを確認してもよい。
Ｓ２～Ｓ１１までの処理は、図２に示す第１の実施形態と同様である。
なお、本実施形態に係る異常検出装置１は、図１０と同様、Ｓ３の判定及び分岐処理を省略し、学習データＤＢ２に格納される学習データの数にかかわりなく、Ｓ４の異常判定器１４による異常シーンの判定処理に進んでもよい。 Alternatively, the abnormal scene candidate detection unit 13 always detects abnormal scene candidates from the audio features of the video data regardless of whether or not the emotion analysis unit 16 gives a trigger, and detects the image from the emotion analysis unit 16. When the feature-based abnormal scene candidate detection is triggered, it may be confirmed from the audio features of the detected abnormal scene candidate whether it should be supplied to the abnormality detector 14 as an abnormal scene candidate.
The processing from S2 to S11 is the same as in the first embodiment shown in FIG.
10, the abnormality detection device 1 according to the present embodiment omits the determination and branching process of S3, and regardless of the number of learning data stored in the learning data DB 2, the abnormality detection device 14 of S4 You may proceed to the abnormal scene determination process.

図１３は、実施形態２に係る異常検出装置１が実行する異常シーン検出処理の変形例の処理手順の一例を示すフローチャートである。
図１３のフローチャートでは、図１に示す実施形態１の異常検出装置１が実行する異常シーン検出処理に対して、Ｓ２とＳ３の間に、Ｓ１３の処理が追加されている。
Ｓ１およびＳ２の処理は、図２に示す実施形態１と同様である。すなわち、実施形態１と同様、異常検出装置１の特徴抽出部１２は、データ取得部１１により供給される映像データから、音声特徴および画像特徴をそれぞれ抽出し、異常シーン候補検出部１３は、特徴抽出部１２から供給される映像データの音声特徴に基づいて、異常シーンの候補を検出する。 FIG. 13 is a flowchart showing an example of a processing procedure of a modified example of abnormal scene detection processing executed by the abnormality detection device 1 according to the second embodiment.
In the flowchart of FIG. 13, the process of S13 is added between S2 and S3 to the abnormal scene detection process executed by the abnormality detection device 1 of the first embodiment shown in FIG.
The processing of S1 and S2 is the same as that of Embodiment 1 shown in FIG. That is, as in the first embodiment, the feature extraction unit 12 of the abnormality detection device 1 extracts audio features and image features from the video data supplied by the data acquisition unit 11, and the abnormal scene candidate detection unit 13 extracts the feature Abnormal scene candidates are detected based on the audio features of the video data supplied from the extraction unit 12 .

次に、Ｓ１３で、異常検出装置１の感情解析部１６は、特徴抽出部１２から供給される画像データの画像特徴のうち、特に画像中に含まれる人の顔の画像特徴から、人の顔の感情を推定する。
Ｓ３～Ｓ１１までの処理は、図２に示す実施形態１と同様であるが、Ｓ４で、異常判定器１４は、感情解析部１６から供給される画像中の人の顔の感情の特徴を音声および画像の特徴空間に統合してよい。また、Ｓ７で、シーン提示部１５は、感情解析部１６の解析結果を、異常シーンの候補の画像と併せて提示してよい。
なお、図１３において、Ｓ２およびＳ１３は、同時並行的に実行されてもよく、Ｓ１３は、時系列的にＳ２より前に実行されてもよい。
また、本実施形態に係る異常検出装置１は、図１０と同様、Ｓ３の判定及び分岐処理を省略し、学習データＤＢ２に格納される学習データの数にかかわりなく、Ｓ４の異常判定器１４による異常シーンの判定処理に進んでもよい。 Next, in S13, the emotion analysis unit 16 of the abnormality detection device 1 extracts a human face from the image features of the image data supplied from the feature extraction unit 12, particularly the image features of the human face included in the image. to estimate the emotions of
The processing from S3 to S11 is the same as that of Embodiment 1 shown in FIG. and may be integrated into the feature space of the image. Further, in S7, the scene presentation unit 15 may present the analysis result of the emotion analysis unit 16 together with the image of the candidate for the abnormal scene.
In FIG. 13, S2 and S13 may be executed concurrently, and S13 may be executed before S2 in chronological order.
10, the abnormality detection device 1 according to the present embodiment omits the determination and branching process of S3, and regardless of the number of learning data stored in the learning data DB 2, the abnormality detection device 14 of S4 You may proceed to the abnormal scene determination process.

図１４は、異常検出装置１の感情解析部１６が映像データの画像を解析し、シーン提示部１５が提示する感情解析結果の出力例を示す図である。
図１４を参照して、画像中で、人の顔の周囲にバウンディングボックス１３１が表示され、人の顔のオブジェクトとして検出されたことを示している。このバウンディングボックス１３内の人の顔から推定された感情の信頼度が、出力ウインドウの左上に表示されている。
図１４の例では、怒りの信頼度が３７．４５％と最も高く算出されているが、バウンディングボックス１３１で包囲された人の顔の表情は、怒りを示しておらずニュートラルであるものとする。この場合、図１４の画像を提示されたオペレータは、提示された画像中の人の顔の表情を目視で確認し、異常シーンではない（すなわち、正常シーンである）との確認結果をシーン提示部１５に入力することができる。あるいは、感情解析部１６は、信頼度のスコアに所定の閾値を設け、怒りの信頼度のスコアが閾値以下である場合には、異常シーンの候補として検出しなくてもよい。 FIG. 14 is a diagram showing an output example of emotion analysis results presented by the scene presentation unit 15 after the emotion analysis unit 16 of the abnormality detection device 1 analyzes the image of the video data.
Referring to FIG. 14, a bounding box 131 is displayed around the human face in the image, indicating that the human face object has been detected. The reliability of emotion estimated from the human face within this bounding box 13 is displayed in the upper left of the output window.
In the example of FIG. 14, the reliability of anger is calculated to be the highest at 37.45%, but it is assumed that the facial expression of the person surrounded by the bounding box 131 does not indicate anger and is neutral. . In this case, the operator presented with the image of FIG. 14 visually confirms the expression of the person's face in the presented image, and presents the scene with the result of confirmation that it is not an abnormal scene (that is, the scene is normal). can be entered in section 15. Alternatively, the emotion analysis unit 16 may set a predetermined threshold for the reliability score, and if the reliability score for anger is equal to or less than the threshold, it may not be detected as an abnormal scene candidate.

以上説明したように、本実施形態によれば、異常検出装置の異常判定器は、映像データの音声特徴、および画像特徴、特に、人の顔の感情の特徴、の双方のマルチモーダルな情報から、異常シーンである蓋然性が高いと判定された異常シーンの候補について、異常シーンの判定を実行すれば足りる。したがって、学習データのサンプル数が少ない場合であっても、高精度かつ低負荷で異常判定処理を実行することができる。
同様に、本実施形態によれば、異常検出装置のシーン提示部は、映像データの音声特徴、および画像特徴、特に、人の顔の感情の特徴、の双方のマルチモーダルな情報から、異常シーンである蓋然性が高いと判定された異常シーンの候補について、オペレータに提示すれば足りる。したがって、異常シーンの確認におけるオペレータの負荷がさらに軽減される。 As described above, according to the present embodiment, the anomaly determiner of the anomaly detection apparatus uses multimodal information of both audio features and image features of video data, particularly facial emotional features of people. , it is sufficient to execute the determination of the abnormal scene for the candidate of the abnormal scene determined to have a high probability of being an abnormal scene. Therefore, even if the number of samples of learning data is small, the abnormality determination process can be executed with high accuracy and low load.
Similarly, according to the present embodiment, the scene presenting unit of the abnormality detection device extracts an abnormal scene from multimodal information of both audio features and image features of the video data, especially the emotional features of a person's face. It is sufficient to present the operator with the candidate of the abnormal scene determined to have a high probability. Therefore, the burden on the operator in confirming abnormal scenes is further reduced.

＜異常検出装置のハードウエア構成＞
図１５は、本実施形態に係る異常検出装置１のハードウエア構成の非限定的一例を示す図である。
本実施形態に係る異常検出装置１は、単一または複数の、あらゆるコンピュータ、モバイルデバイス、または他のいかなる処理プラットフォーム上にも実装することができる。
図１５を参照して、異常検出装置１は、単一のコンピュータに実装される例が示されているが、本実施形態に係る異常検出装置１は、複数のコンピュータを含むコンピュータシステムに実装されてよい。複数のコンピュータは、有線または無線のネットワークにより相互通信可能に接続されてよい。 <Hardware configuration of abnormality detection device>
FIG. 15 is a diagram showing a non-limiting example of the hardware configuration of the abnormality detection device 1 according to this embodiment.
The anomaly detection device 1 according to this embodiment can be implemented on any single or multiple computers, mobile devices, or any other processing platform.
Referring to FIG. 15, an example in which abnormality detection device 1 is implemented in a single computer is shown, but abnormality detection device 1 according to the present embodiment is implemented in a computer system including a plurality of computers. you can A plurality of computers may be interconnectably connected by a wired or wireless network.

図１５に示すように、異常検出装置１は、ＣＰＵ２１と、ＲＯＭ２２と、ＲＡＭ２３と、ＨＤＤ２４と、入力部２５と、表示部２６と、通信Ｉ／Ｆ２７と、システムバス２８とを備えてよい。異常検出装置１はまた、外部メモリを備えてよい。ＰＣ３もまた、図１５と同様の構成を備えてよい。
ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２１は、異常検出装置１における動作を統括的に制御するものであり、データ伝送路であるシステムバス２８を介して、各構成部（２２～２７）を制御する。
異常検出装置１はまた、ＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）を備えてよい。ＧＰＵは、ＣＰＵ２１より高い計算機能を有し、複数または多数のＧＰＵを並列して動作させることにより、特に、本実施形態のような機械学習を使用する映像処理アプリケーションに、より高い処理パフォーマンスを提供する。ＧＰＵは、通常、プロセッサと共有メモリを含む。それぞれのプロセッサが高速の共有メモリからデータを取得し、共通プログラムを実行することで、同種の計算処理を大量かつ高速に実行する。 As shown in FIG. 15, the abnormality detection device 1 may include a CPU 21, a ROM 22, a RAM 23, an HDD 24, an input section 25, a display section 26, a communication I/F 27, and a system bus . The anomaly detection device 1 may also include an external memory. PC3 may also have a configuration similar to that of FIG.
A CPU (Central Processing Unit) 21 comprehensively controls the operation of the abnormality detection device 1, and controls each component (22 to 27) via a system bus 28, which is a data transmission line.
The abnormality detection device 1 may also include a GPU (Graphics Processing Unit). The GPU has higher computational power than the CPU 21, and running multiple or multiple GPUs in parallel provides higher processing performance, especially for video processing applications that use machine learning as in the present embodiment. do. A GPU typically includes a processor and shared memory. Each processor obtains data from a high-speed shared memory and executes a common program to perform a large amount of the same kind of computation at high speed.

ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）２２は、ＣＰＵ２１が処理を実行するために必要な制御プログラム等を記憶する不揮発性メモリである。なお、当該プログラムは、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）１４、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の不揮発性メモリや着脱可能な記憶媒体（不図示）等の外部メモリに記憶されていてもよい。
ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）２３は、揮発性メモリであり、ＣＰＵ１１の主メモリ、ワークエリア等として機能する。すなわち、ＣＰＵ２１は、処理の実行に際してＲＯＭ２２から必要なプログラム等をＲＡＭ２３にロードし、当該プログラム等を実行することで各種の機能動作を実現する。 A ROM (Read Only Memory) 22 is a non-volatile memory that stores control programs and the like necessary for the CPU 21 to execute processing. The program may be stored in a non-volatile memory such as a HDD (Hard Disk Drive) 14 or an SSD (Solid State Drive) or an external memory such as a removable storage medium (not shown).
A RAM (Random Access Memory) 23 is a volatile memory and functions as a main memory of the CPU 11, a work area, and the like. That is, the CPU 21 loads necessary programs and the like from the ROM 22 to the RAM 23 when executing processing, and executes the programs and the like to realize various functional operations.

ＨＤＤ２４は、例えば、ＣＰＵ２１がプログラムを用いた処理を行う際に必要な各種データや各種情報等を記憶している。また、ＨＤＤ２４には、例えば、ＣＰＵ２１がプログラム等を用いた処理を行うことにより得られた各種データや各種情報等が記憶される。
入力部２５は、キーボードやマウス等のポインティングデバイスにより構成される。
表示部２６は、液晶ディスプレイ（ＬＣＤ）等のモニターにより構成される。表示部２６は、異常シーン検出処理で使用される各種パラメータや、他の装置との通信で使用される通信パラメータ等をパラメータ調整装置１へ指示入力するためのユーザインタフェースであるＧＵＩ（ＧｒａｐｈｉｃａｌＵｓｅｒＩｎｔｅｒｆａｃｅ）を提供してよい。 The HDD 24 stores, for example, various data and information necessary for the CPU 21 to perform processing using programs. Further, the HDD 24 stores various data, various information, and the like obtained by the CPU 21 performing processing using programs and the like, for example.
The input unit 25 is composed of a pointing device such as a keyboard and a mouse.
The display unit 26 is configured by a monitor such as a liquid crystal display (LCD). The display unit 26 is a GUI (Graphical User Interface), which is a user interface for inputting instructions to the parameter adjustment apparatus 1, such as various parameters used in abnormal scene detection processing and communication parameters used in communication with other devices. ) may be provided.

通信Ｉ／Ｆ２７は、異常検出装置１と外部装置との通信を制御するインタフェースである。
通信Ｉ／Ｆ２７は、ネットワークとのインタフェースを提供し、ネットワークを介して、外部装置との通信を実行する。通信Ｉ／Ｆ２７を介して、外部装置との間で映像、異常シーン判定結果、異常シーン確認入力、各種パラメータ等が送受信される。本実施形態では、通信Ｉ／Ｆ２７は、イーサネット（登録商標）等の通信規格に準拠する有線ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）や専用線を介した通信を実行してよい。ただし、本実施形態で利用可能なネットワークはこれに限定されず、無線ネットワークで構成されてもよい。この無線ネットワークは、Ｂｌｕｅｔｏｏｔｈ（登録商標）、ＺｉｇＢｅｅ（登録商標）、ＵＷＢ（ＵｌｔｒａＷｉｄｅＢａｎｄ）等の無線ＰＡＮ（ＰｅｒｓｏｎａｌＡｒｅａＮｅｔｗｏｒｋ）を含む。また、Ｗｉ－Ｆｉ（ＷｉｒｅｌｅｓｓＦｉｄｅｌｉｔｙ）（登録商標）等の無線ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）や、ＷｉＭＡＸ（登録商標）等の無線ＭＡＮ（ＭｅｔｒｏｐｏｌｉｔａｎＡｒｅａＮｅｔｗｏｒｋ）を含む。さらに、ＬＴＥ／３Ｇ、４Ｇ、５Ｇ等の無線ＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）を含む。なお、ネットワークは、各機器を相互に通信可能に接続し、通信が可能であればよく、通信の規格、規模、構成は上記に限定されない。 The communication I/F 27 is an interface that controls communication between the abnormality detection device 1 and an external device.
A communication I/F 27 provides an interface with a network and executes communication with an external device via the network. Via the communication I/F 27, images, abnormal scene determination results, abnormal scene confirmation inputs, various parameters, etc. are transmitted/received to/from external devices. In this embodiment, the communication I/F 27 may perform communication via a wired LAN (Local Area Network) conforming to a communication standard such as Ethernet (registered trademark) or a dedicated line. However, the network that can be used in this embodiment is not limited to this, and may be configured as a wireless network. This wireless network includes a wireless PAN (Personal Area Network) such as Bluetooth (registered trademark), ZigBee (registered trademark), and UWB (Ultra Wide Band). It also includes a wireless LAN (Local Area Network) such as Wi-Fi (Wireless Fidelity) (registered trademark) and a wireless MAN (Metropolitan Area Network) such as WiMAX (registered trademark). Furthermore, wireless WANs (Wide Area Networks) such as LTE/3G, 4G, and 5G are included. It should be noted that the network connects each device so as to be able to communicate with each other, and the communication standard, scale, and configuration are not limited to those described above.

図１に示す異常検出装置１の各要素のうち少なくとも一部の機能は、ＣＰＵ２１がプログラムを実行することで実現することができる。ただし、図１に示す異常検出装置１の各要素のうち少なくとも一部の機能が専用のハードウエアとして動作するようにしてもよい。この場合、専用のハードウエアは、ＣＰＵ２１の制御に基づいて動作する。 At least some of the functions of the elements of the abnormality detection device 1 shown in FIG. 1 can be realized by the CPU 21 executing a program. However, at least some of the functions of the elements of the abnormality detection device 1 shown in FIG. 1 may operate as dedicated hardware. In this case, the dedicated hardware operates under the control of the CPU 21 .

なお、上記において特定の実施形態が説明されているが、当該実施形態は単なる例示であり、本発明の範囲を限定する意図はない。本明細書に記載された装置及び方法は上記した以外の形態において具現化することができる。また、本発明の範囲から離れることなく、上記した実施形態に対して適宜、省略、置換及び変更をなすこともできる。かかる省略、置換及び変更をなした形態は、請求の範囲に記載されたもの及びこれらの均等物の範疇に含まれ、本発明の技術的範囲に属する。 It should be noted that although specific embodiments are described above, the embodiments are merely examples and are not intended to limit the scope of the invention. The apparatus and methods described herein may be embodied in forms other than those described above. Also, appropriate omissions, substitutions, and modifications may be made to the above-described embodiments without departing from the scope of the invention. Forms with such omissions, substitutions and modifications are included in the scope of what is described in the claims and their equivalents, and belong to the technical scope of the present invention.

１…異常検出装置、２…学習データＤＢ、３…ＰＣ、１１…データ取得部、１２…特徴抽出部、１３…異常シーン候補検出部、１４…異常判定器、１５…シーン提示部、１６…感情解析部、２１…ＣＰＵ、２２…ＲＯＭ、２３…ＲＡＭ、２４…ＨＤＤ、２５…入力部、２６…表示部、２７…通信Ｉ／Ｆ、２８…バス DESCRIPTION OF SYMBOLS 1... Anomaly detection apparatus 2... Learning data DB 3... PC 11... Data acquisition part 12... Feature extraction part 13... Abnormal scene candidate detection part 14... Abnormality determination device 15... Scene presentation part 16... Emotion analysis unit 21 CPU 22 ROM 23 RAM 24 HDD 25 input unit 26 display unit 27 communication I/F 28 bus

Claims

a video acquisition unit that acquires video data;
a feature extraction unit that extracts audio features from the video data acquired by the video acquisition unit and extracts image features from the video data;
an abnormal scene candidate detection unit that detects abnormal scene candidates from the video data based on the audio features extracted by the feature extraction unit;
an abnormality determiner for determining whether the abnormal scene candidate detected by the abnormal scene candidate detection unit is abnormal, normal, or otherwise based on the audio feature and the image feature;
When the abnormality determiner determines that the candidate for the abnormal scene belongs to others, the candidate for the abnormal scene is presented via a user interface, and information to be added to the candidate for the presented abnormal scene. and a scene presenting unit that receives input of through the user interface.

The information processing apparatus according to claim 1, wherein the abnormal scene candidate detection unit detects the abnormal scene candidate from the video data using unsupervised learning.

The abnormal scene candidate detection unit detects the abnormal scene candidates from the video data by directly separating abnormal audio features without generating a model of normal audio features. Item 3. The information processing device according to Item 1 or 2.

The information processing according to claim 3, wherein the abnormal scene candidate detection unit separates the abnormal audio features by calculating a path length in an isolation forest of each audio feature. Device.

5. The information processing according to any one of claims 1 to 4, wherein said feature extraction unit extracts an audio feature represented by a Mel frequency spectrogram of audio data in said video data. Device.

The feature extraction unit calculates Mel Frequency Cepstrum Coefficients (MFCC) from the audio data, connects the calculated MFCC to the Mel frequency, and extracts the audio feature. 6. The information processing apparatus according to claim 5.

The scene presentation unit adds information input via the user interface to the audio features and the image features, and stores the information in a storage device as learning data for the abnormality determiner. The information processing apparatus according to any one of claims 1 to 6.

8. The abnormal scene candidate detection unit causes the abnormality determiner to determine the candidate of the abnormal scene when the number of the learning data stored in the storage device exceeds a predetermined threshold. The information processing device according to .

When the number of pieces of learning data stored in the storage device is within a predetermined threshold, the abnormal scene candidate detection unit bypasses the determination by the abnormal device and instructs the scene presentation unit to display the abnormal scene. 9. The information processing apparatus according to claim 7, wherein candidates are presented.

The abnormality determiner determines that a difference between the number of abnormal samples and the number of normal samples located near the candidate for the abnormal scene is within a predetermined threshold in a feature space in which the audio features and the image features are integrated. 10. The information processing apparatus according to any one of claims 1 to 9, wherein the candidate for the abnormal scene is determined to be other when the abnormal scene is detected.

The information processing apparatus according to any one of claims 1 to 10, wherein the abnormality determiner determines the candidate for the abnormal scene by a k nearest neighbor method.

From the image features extracted by the feature extraction unit, using supervised learning, analyze facial emotion included in the video data, and supply the analyzed facial emotion features to the abnormality determiner. further comprising an emotion analysis unit,
12. The information processing apparatus according to any one of claims 1 to 11, characterized by:

The emotion analysis unit causes the abnormal scene candidate detection unit to detect an abnormal scene based on the audio feature when the candidate for the abnormal scene is detected from the video data based on the analyzed emotion of the face. 13. The information processing apparatus according to claim 12, wherein the information processing apparatus is executed.

An information processing system comprising a server and at least one client device connected to the server via a network,
The server is
a video acquisition unit that acquires video data;
a feature extraction unit that extracts audio features from the video data acquired by the video acquisition unit and extracts image features from the video data;
an abnormal scene candidate detection unit that detects abnormal scene candidates from the video data based on the audio features extracted by the feature extraction unit;
an abnormality determiner for determining whether the abnormal scene candidate detected by the abnormal scene candidate detection unit is abnormal, normal, or otherwise based on the audio feature and the image feature;
When the abnormality determiner determines that the candidate for the abnormal scene belongs to others, the candidate for the abnormal scene is presented via a user interface, and information to be added to the candidate for the presented abnormal scene. a scene presentation unit that receives the input of via the user interface;
a transmitting unit configured to transmit the abnormal scene candidate to the client device;
The client device
a receiving unit that receives the abnormal scene candidate transmitted from the server;
the user interface that presents the candidate for the abnormal scene received by the receiving unit and receives input of information to be added to the candidate for the presented abnormal scene;
An information processing system, comprising: a transmitting unit configured to transmit to the server information to be added to the abnormal scene candidate whose input is received by the user interface.

An information processing method executed by an information processing device,
obtaining video data;
extracting audio features from the acquired video data and extracting image features from the video data;
detecting abnormal scene candidates from the video data based on the extracted audio features by unsupervised learning;
determining, by an abnormality determiner, the detected abnormal scene candidate as abnormal, normal, or otherwise based on the audio features and the image features;
When the abnormality determiner determines that the candidate for the abnormal scene belongs to others, the candidate for the abnormal scene is presented via a user interface, and information to be added to the candidate for the presented abnormal scene. and a step of accepting an input of via the user interface.

An information processing program for causing a computer to execute information processing, the program causing the computer to:
a video acquisition process for acquiring video data;
A feature extraction process for extracting an audio feature from the video data acquired by the video acquisition process and extracting an image feature from the video data;
Abnormal scene candidate detection processing for detecting abnormal scene candidates from the video data based on the audio features extracted by the feature extraction processing;
an abnormality determination process for determining, by an abnormality determiner, the abnormal scene candidate detected by the abnormal scene candidate detection process as abnormal, normal, or other based on the audio feature and the image feature;
When the abnormality determiner determines that the candidate for the abnormal scene belongs to others, the candidate for the abnormal scene is presented via a user interface, and information to be added to the candidate for the presented abnormal scene. and a scene presentation process for receiving an input of through the user interface.