JP2020119154A

JP2020119154A - Information processing device, information processing method, and program

Info

Publication number: JP2020119154A
Application number: JP2019008776A
Authority: JP
Inventors: 侑輝斎藤; Yuki Saito
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2019-01-22
Filing date: 2019-01-22
Publication date: 2020-08-06
Anticipated expiration: 2039-01-22
Also published as: JP7313828B2

Abstract

To enhance the data recognition accuracy of a recognition process, and shorten the time of the recognition process.SOLUTION: An information processing device includes an input unit 21 that acquires recognition target data, a feature extraction unit 221 that maps the recognition target data in a feature space, a representative feature amount generation unit 222 that generates a plurality of points in a data space and/or the feature space of the recognition target data, a distance calculation unit 223 that compares the recognition target data with the plurality of points in the feature space and/or the data space, and an abnormality determination unit 31 that outputs a recognition process result based on the comparison result.SELECTED DRAWING: Figure 1

Description

本発明は、映像や画像等のデータから認識対象を認識する技術に関する。 The present invention relates to a technique for recognizing a recognition target from data such as video and images.

映像や画像等のデータから、認識対象としての物体およびその状態を認識するために、その認識対象のデータを用いて特徴量を抽出し、その特徴量から最も近い学習データの特徴量との距離に基づいて、認識対象の状態やクラス等を判別する技術が存在する。例えば異常検知を行う場合の手法として、非特許文献１には、認識対象データの近傍に存在する正常データとの距離、ないし正常データのクラスタとの距離に基づいて異常を検知する方法が開示されている。 In order to recognize an object as a recognition target and its state from data such as video and images, the feature amount is extracted using the data of the recognition target, and the distance from the feature amount to the feature amount of the nearest learning data. There is a technique for discriminating the state or class of the recognition target based on the. For example, as a method for detecting abnormality, Non-Patent Document 1 discloses a method of detecting abnormality based on a distance from normal data existing in the vicinity of recognition target data or a distance from a cluster of normal data. ing.

Nearest-Neighbor and Clustering based Anomaly Detection Algorithms for RapidMiner. M. Amer and M. Goldstein. 2012Nearest-Neighbor and Clustering based Anomaly Detection Algorithms for RapidMiner. M. Amer and M. Goldstein. 2012

ここで、入力データの認識をより高速に行うためには、認識処理時間を低減した認識システムが望まれる。また、入力データの認識をより高精度に行うためには、認識能力がさらに向上した認識システムが望まれる。
非特許文献１のように、認識対象データと近傍データとの距離を算出する場合、近傍データの検索処理の際に学習データの数に依存した処理時間が必要となる。また、非特許文献１のように、認識対象データとクラスタ中心との距離を算出する場合、同様にクラスタの数に依存した処理時間が必要となる。一方で、学習データとその量が多様・大規模であると、様々な学習データを表現するために、大量のクラスタの種類やクラス数を用いて学習データを近似する必要があるため、クラスタの数に依存した処理が行われることは望ましくない場合がある。しかしながら、クラスタの数に依存した処理時間を低減するために、クラスタの数を減ずるようにすると、クラスタ中心とクラスタに属するデータとの乖離度合いが大きくなるため、認識対象の検索処理時に誤答の割合が大きくなる場合がある。 Here, in order to recognize the input data at a higher speed, a recognition system with reduced recognition processing time is desired. Further, in order to recognize the input data with higher accuracy, a recognition system having further improved recognition ability is desired.
When calculating the distance between the recognition target data and the neighboring data as in Non-Patent Document 1, a processing time depending on the number of learning data is required for the neighboring data search processing. Further, when calculating the distance between the recognition target data and the center of the cluster as in Non-Patent Document 1, similarly, processing time depending on the number of clusters is required. On the other hand, when the learning data and the amount thereof are diverse and large, it is necessary to approximate the learning data by using a large number of cluster types and the number of classes in order to express various learning data. It may not be desirable for the number dependent processing to occur. However, if the number of clusters is reduced in order to reduce the processing time that depends on the number of clusters, the degree of deviation between the cluster center and the data that belongs to the clusters increases, so there is an The percentage may increase.

そこで、本発明は、認識処理におけるデータ認識精度の向上と認識処理時間の低減とを可能にすることを目的とする。 Therefore, it is an object of the present invention to improve the accuracy of data recognition in recognition processing and reduce the recognition processing time.

本発明の情報処理装置は、認識対象データを取得する取得手段と、前記認識対象データを特徴空間に写像する写像手段と、前記認識対象データの前記特徴空間とデータ空間の少なくともいずれかに、複数の点を生成する生成手段と、前記認識対象データと前記複数の点とを、前記特徴空間とデータ空間の少なくともいずれかで比較する比較手段と、前記比較の結果に基づいた認識処理結果を出力する出力手段と、を有することを特徴とする。 The information processing apparatus of the present invention includes a plurality of acquisition means for acquiring recognition target data, mapping means for mapping the recognition target data into a feature space, and at least one of the feature space and the data space of the recognition target data. Generating means for generating points, comparing means for comparing the recognition target data and the plurality of points in at least one of the feature space and the data space, and outputting a recognition processing result based on the result of the comparison. And an output unit that operates.

本発明によれば、認識処理におけるータ認識精度の向上と認識処理時間の低減とが可能となる。 According to the present invention, it is possible to improve the accuracy of data recognition in the recognition processing and reduce the recognition processing time.

異常検知システムの構成例を示す図である。It is a figure which shows the structural example of an abnormality detection system. 異常検知システムの動作（学習時）を示すフローチャートである。It is a flowchart which shows operation|movement (at the time of learning) of an abnormality detection system. 選択部の動作を示すフローチャートである。It is a flow chart which shows operation of a selection part. 入力画像と人体領域の例を示す図である。It is a figure which shows the example of an input image and a human body region. 学習部の動作を示すフローチャートである。It is a flowchart which shows operation|movement of a learning part. ＣＮＮの構成例（学習時）を示す図である。It is a figure which shows the structural example (at the time of learning) of CNN. 異常検知システムの動作（検出時）を示すフローチャートである。It is a flowchart which shows operation|movement (at the time of detection) of an abnormality detection system. ＣＮＮの構成例（検出時）を示す図である。It is a figure which shows the structural example (at the time of detection) of CNN. ＳｉａｍｅｓｅＣＮＮの構成例（学習時）を示す図である。It is a figure which shows the structural example (at the time of learning) of Siamese CNN. ＣＮＮの構成例（学習時）を示す図である。It is a figure which shows the structural example (at the time of learning) of CNN. Ａｕｔｏｅｎｃｏｄｅｒの構成例（学習時）を示す図である。It is a figure which shows the structural example (at the time of learning) of Autoencoder. 補助情報の利用による代表点生成時の構成例（学習時）を示す図である。It is a figure which shows the structural example (at the time of learning) at the time of generating a representative point by using auxiliary information. 複数層における代表点生成と階層的クラスタ損失の例を示す図である。FIG. 6 is a diagram showing an example of representative point generation and hierarchical cluster loss in a plurality of layers. クラスタごとの代表点生成器の例を示す図である。It is a figure which shows the example of the representative point generator for every cluster. 多段階系列長の例を示す図である。It is a figure which shows the example of a multistep sequence length.

以下、添付の図面を参照しながら、本発明の実施形態について説明する。なお、以下の実施形態において示す構成は一例に過ぎず、本発明は図示された構成に限定されるものではない。
＜第１の実施形態＞
以降の説明では、物体およびその状態についての認識対象に関するデータを認識対象データと呼び、認識対象データから抽出した特徴量を対象特徴量と呼ぶことにする。なお認識対象データが特徴空間上に写像された場合でも、一般的に特徴空間上のデータと呼称することが多いと考えられるため、以下の説明で単に認識対象データと記す場合は、例えば元画像空間上のデータのみならず、特徴空間上のデータをも含むものとする。認識対象は人間が指定したものでも、機械が処理対象として指定したものでも、単数であっても、複数であってもよいものとする。以降における個々の事例では、具体例として認識対象がどのような性質のものであるかを述べる場合があるが、本発明において原理的にその場合に限るという意図ではない。本実施形態では、学習データの数に依存しない定数時間で認識対象の認識処理を行う例を第一に示す。また本実施形態では異常検知システムの例を挙げているため、異常検知という目的に合わせた学習のためのアルゴリズム例を挙げている。学習アルゴリズムとしては、例えば正常クラス・異常クラスを前提にしたＴｗｏ−Ｃｌａｓｓ学習アルゴリズムや、正常データのみを学習に用いるＯｎｅ−Ｃｌａｓｓ学習を挙げることができる。もちろん、本発明はその他の目的にも用いることができ、以降では様々な構成例について順を追って説明する。 Embodiments of the present invention will be described below with reference to the accompanying drawings. The configurations shown in the following embodiments are merely examples, and the present invention is not limited to the illustrated configurations.
<First Embodiment>
In the following description, the data regarding the recognition target for the object and its state will be referred to as recognition target data, and the feature amount extracted from the recognition target data will be referred to as the target feature amount. Note that even when the recognition target data is mapped onto the feature space, it is generally considered that it is often referred to as data on the feature space. Therefore, when simply referred to as recognition target data in the following description, for example, the original image Not only the data on the space but also the data on the feature space are included. The recognition target may be specified by a human, may be specified by a machine as a processing target, and may be singular or plural. In each of the following cases, the nature of the recognition target may be described as a specific example, but the present invention is not intended to be limited to such a case in principle. In the present embodiment, firstly, an example of performing recognition processing of a recognition target in a constant time that does not depend on the number of learning data will be shown. In addition, since an example of the abnormality detection system is given in the present embodiment, an example of an algorithm for learning for the purpose of abnormality detection is given. Examples of the learning algorithm include a Two-Class learning algorithm based on a normal class/abnormal class, and a One-Class learning using only normal data for learning. Of course, the present invention can be used for other purposes, and various configuration examples will be described below in order.

詳細は後述するが、本実施形態では、認識対象データの近傍（以降、認識対象近傍と呼ぶ）の点を、認識対象データと比較すべき代表点として予測し、その代表点と認識対象データとの間の距離を算出し、その距離を基に異常検知を行う構成及び動作の例を示す。ここで、認識対象データに基づいて予測された代表点は、認識対象近傍の正常データを表現するように予め学習が行われたものとする。ここで述べた学習には、代表点を生成する処理に関する学習と、特徴抽出を行う処理に関する学習と、の二種類があるものとする。なお後述するように、それら二種類の学習処理を一つの学習処理の系として捉えてもよい。また、認識対象データを基に予測された代表点は、認識対象近傍の学習データを表現するものである。このため、代表点を用いることで、学習データの全てと認識対象データとを比較する場合や、学習データを全てクラスタリングした上で全てのクラスタ中心と認識対象とを比較する場合よりも、少ない数の比較により認識対象の認識を行うことが可能となる。本来であれば、認識対象データと、認識対象近傍データ（またはそのクラスタ中心等）と、に基づいて決まる近傍との距離は、認識対象データに基づいて"ここにあるであろう"と予測される"認識対象データと比べるべき点"との距離によって求められる。このため、本実施形態では、代表点を"予測する"という表現を用いる場合があり、これは言い換えると、代表すると予測される点を"生成する"とも言える。また本実施形態では、第一に、代表点は固定のＭ個として説明を行う。そして本実施形態では、Ｍ個の代表点で済むように、特徴抽出器および代表点生成器を学習するものとする。なお、ここでは代表点という用語を用いたが、後述するように構成としては点に限らない。加えて、代表点の学習の仕方によって、"代表点が認識対象近傍の正常データを表現すると考える"というパターン以外の目的にも用いることができることを後述する。 Although details will be described later, in the present embodiment, a point in the vicinity of the recognition target data (hereinafter, referred to as a recognition target vicinity) is predicted as a representative point to be compared with the recognition target data, and the representative point and the recognition target data are compared with each other. An example of a configuration and an operation of calculating a distance between the two and performing abnormality detection based on the calculated distance will be described. Here, it is assumed that the representative points predicted based on the recognition target data have been learned in advance so as to represent normal data in the vicinity of the recognition target data. It is assumed that there are two types of learning described here, that is, learning related to processing for generating a representative point and learning related to processing for performing feature extraction. As will be described later, these two types of learning processing may be regarded as one learning processing system. In addition, the representative points predicted based on the recognition target data represent learning data in the vicinity of the recognition target data. Therefore, by using the representative points, a smaller number than in the case of comparing all the learning data with the recognition target data, or in the case of comparing all the cluster centers and the recognition target after clustering all the learning data. It becomes possible to recognize the recognition target by comparing Originally, the distance between the recognition target data and the neighborhood determined based on the recognition target neighborhood data (or the cluster center thereof) is predicted to be "here" based on the recognition target data. It is calculated by the distance from the "point to be compared with the recognition target data". Therefore, in this embodiment, the expression "predicting" the representative point may be used, which in other words can be said to "generate" the point predicted to be the representative point. Further, in the present embodiment, first, the description will be made assuming that the number of representative points is fixed M. Then, in the present embodiment, the feature extractor and the representative point generator are learned so that M representative points are sufficient. Although the term “representative point” is used here, the configuration is not limited to a point as described later. In addition, it will be described later that it can be used for purposes other than the pattern "the representative point is considered to represent normal data in the vicinity of the recognition target" depending on the way of learning the representative point.

以降では、例えば、ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ（ＣＮＮ）を特徴抽出および代表点生成に用いて、学習する例を示す。代表点は、ＣＮＮの特徴空間上に生成するものとして、以降では代表特徴量と呼称することとする。なお、後述するように、ＣＮＮは必須の構成事例ではなく、特徴抽出器の一例とする。 In the following, for example, learning is performed by using, for example, Convolutional Neural Network (CNN) for feature extraction and representative point generation. The representative point will be referred to as a representative feature amount hereinafter, as it is generated in the CNN feature space. As will be described later, the CNN is not an indispensable configuration example, but an example of the feature extractor.

本実施形態では、具体例として異常検知システムの構成を挙げており、異常検知システムとしては、監視カメラによって撮像された映像中から異常を検出する機能を備えたシステムを例示する。
本実施形態における異常検知システムでは、監視対象をカメラ等の撮像装置で撮影し、撮影した映像データに基づいて、監視対象に異常があるか否かの判定が行われる。そして、異常があると判定した場合には、警備室等の監視センタに常駐する監視者に対し、異常がある旨の警告、例えば警告表示や警告音の出力による警告を行う。なお、この監視対象には、例えば、一般家庭の屋内及び屋外、又は病院、駅などの公共施設が含まれる。 In the present embodiment, the configuration of the abnormality detection system is taken as a specific example, and as the abnormality detection system, a system having a function of detecting an abnormality from an image captured by a surveillance camera is illustrated.
In the abnormality detection system according to the present embodiment, the monitoring target is photographed by an imaging device such as a camera, and it is determined whether or not the monitoring target is abnormal based on the photographed video data. When it is determined that there is an abnormality, a warning indicating that there is an abnormality, for example, a warning is displayed or a warning sound is output, is issued to a monitor resident in a monitoring center such as a security room. The monitoring target includes, for example, indoors and outdoors of general households, or public facilities such as hospitals and stations.

本実施形態における異常検知システムの動作は、「学習時」と、「検出時」と、の２つのステージからなる。これら２つのステージのうち、最初の第１ステージは学習を行う際の動作を示しており、また２つ目の第２ステージは学習した結果に基づいて実際の検出処理を行う際の動作を示している。
図１は、第１の実施形態に係る情報処理装置の一例としての異常検知システムの概略構成例を示すブロック図である。本実施形態の異常検知システムは、学習装置１０、学習データの記憶部Ｄ１、認識装置２０、認識対象データの記憶部Ｄ２、判定装置３０、端末装置４０を有して構成されている。 The operation of the abnormality detection system in the present embodiment includes two stages, "at the time of learning" and "at the time of detection". Of these two stages, the first first stage shows the operation when learning is performed, and the second second stage shows the operation when actual detection processing is performed based on the learned result. ing.
FIG. 1 is a block diagram showing a schematic configuration example of an abnormality detection system as an example of the information processing apparatus according to the first embodiment. The abnormality detection system of this embodiment includes a learning device 10, a learning data storage unit D1, a recognition device 20, a recognition target data storage unit D2, a determination device 30, and a terminal device 40.

まず、本実施形態の異常検知システムにおいて学習時の動作に関わる構成について説明する。
図１の異常検知システムは、学習に関わる構成として、学習装置１０を備えている。なお、これらの装置内および装置間は、電子回路を介して接続されていても、外部記憶装置を介して接続されていても、またネットワークを介して接続されていてもよい。このネットワークには、例えば携帯電話回線網やインターネットが適用できる。以降で説明する装置等に関しても、同様に様々な方法で接続されていてよい。このことは後述する認識装置２０等においても同様であるとする。
学習装置１０は、選択部１１と、学習部１２とを備えている。
学習部１２は、特徴抽出部１２１と、代表特徴量生成部１２２と、損失算出部１２３とを備えている。 First, the configuration related to the operation during learning in the abnormality detection system of this embodiment will be described.
The abnormality detection system in FIG. 1 includes a learning device 10 as a configuration related to learning. It should be noted that these devices and devices may be connected via an electronic circuit, an external storage device, or a network. A mobile phone network or the Internet can be applied to this network, for example. The devices described below may be similarly connected by various methods. This also applies to the recognition device 20 and the like described later.
The learning device 10 includes a selection unit 11 and a learning unit 12.
The learning unit 12 includes a feature extraction unit 121, a representative feature amount generation unit 122, and a loss calculation unit 123.

次に、異常検知システムにおいて学習時の動作を、図２を用いて説明する。図２は、異常検知システムにおける学習時の動作を示すフローチャートである。なお、以降の説明では、図２のフローチャートにおける各ステップＳ２００〜ステップＳ２０４をＳ２００〜Ｓ２０４と略記する。このことは後述する他のフローチャートにおいても同様とする。 Next, the operation during learning in the abnormality detection system will be described with reference to FIG. FIG. 2 is a flowchart showing the operation at the time of learning in the abnormality detection system. In the following description, steps S200 to S204 in the flowchart of FIG. 2 will be abbreviated as S200 to S204. This also applies to other flowcharts described later.

まずＳ２００において、学習装置１０は、初期化処理を行う。例えば、学習装置１０は、学習時に用いるＣＮＮの初期化を行う。具体例は後述する。
次にＳ２０１において、選択部１１は、学習データ記憶部Ｄ１から学習処理の対象となる学習対象データを読み込み、学習部１２にｍｉｎｉｂａｔｃｈ（ミニバッチ）データを送る。特徴抽出部１２１は、選択部１１から学習データを受け取り、その受け取ったデータを基に特徴抽出器により特徴量を抽出し、認識対象特徴量（以下、対象特徴量と呼ぶ）を代表特徴量生成部１２２に送り、ｍｉｎｉｂａｔｃｈ特徴量を損失算出部１２３に送る。代表特徴量生成部１２２は、受け取った対象特徴量から、代表特徴量生成器により代表特徴量を生成して、損失算出部１２３に送る。損失算出部１２３は、受け取ったｍｉｎｉｂａｔｃｈ特徴量と代表特徴量とから、損失関数に基づいて損失値を算出する。以降では、損失値や、損失関数を微分すること等によって得た勾配情報を、誤差に関する情報または簡単に誤差と呼ぶ。これらの詳細については後述する。 First, in S200, the learning device 10 performs an initialization process. For example, the learning device 10 initializes the CNN used during learning. A specific example will be described later.
Next, in S201, the selection unit 11 reads the learning target data that is the target of the learning process from the learning data storage unit D1, and sends the minibatch data to the learning unit 12. The feature extraction unit 121 receives the learning data from the selection unit 11, extracts a feature amount by a feature extractor based on the received data, and generates a recognition target feature amount (hereinafter, referred to as a target feature amount) as a representative feature amount. To the loss calculation unit 123. The representative feature amount generation unit 122 generates a representative feature amount from the received target feature amount by the representative feature amount generator and sends it to the loss calculation unit 123. The loss calculation unit 123 calculates a loss value from the received minibat feature amount and representative feature amount based on the loss function. Hereinafter, the loss value and the gradient information obtained by differentiating the loss function will be referred to as error-related information or simply error. Details of these will be described later.

次にＳ２０２に進むと、損失算出部１２３は、誤差情報を代表特徴量生成部１２２および特徴抽出部１２１に送る。代表特徴量生成部１２２は、受け取った誤差情報に基づいて、さらに代表特徴量生成器の誤差を算出して、特徴抽出部１２１に送る。特徴抽出部１２１は、損失算出部１２３と代表特徴量生成部１２２とから受け取った（ＣＮＮの上段の層の損失である）誤差を用いて、特徴抽出器の学習処理を行う。詳細については後述する。 Next, in S202, the loss calculation unit 123 sends the error information to the representative feature amount generation unit 122 and the feature extraction unit 121. The representative feature amount generation unit 122 further calculates the error of the representative feature amount generator based on the received error information and sends it to the feature extraction unit 121. The feature extraction unit 121 uses the error (which is the loss of the upper layer of CNN) received from the loss calculation unit 123 and the representative feature amount generation unit 122 to perform the learning process of the feature extractor. Details will be described later.

Ｓ２０３に進むと、代表特徴量生成部１２２は、Ｓ２０２で損失算出部１２３から受け取った誤差情報を用いて、代表特徴量生成器の学習処理を行う。
その後、Ｓ２０４において、学習装置１０は、学習が完了すると、これまで学習した特徴抽出器および代表特徴量生成部を、学習データ記憶部Ｄ１に保存する。 In step S203, the representative feature amount generation unit 122 uses the error information received from the loss calculation unit 123 in step S202 to perform the learning process of the representative feature amount generator.
After that, in S204, when the learning is completed, the learning device 10 stores the feature extractor and the representative feature amount generation unit that have been learned so far in the learning data storage unit D1.

次に、選択部１１の動作に関して、図３に基づいて詳細な説明を行う。図３は、選択部１１における学習データの選択動作を示すフローチャートである。
まずＳ３０１において、選択部１１は、学習データ記憶部Ｄ１から学習データを受け取る。ここでの学習データとは、ある一つの監視カメラから得られた監視映像から作成された人物画像であるとする。 Next, the operation of the selection unit 11 will be described in detail with reference to FIG. FIG. 3 is a flowchart showing the learning data selecting operation in the selecting unit 11.
First, in S301, the selection unit 11 receives the learning data from the learning data storage unit D1. The learning data here is assumed to be a person image created from a surveillance video obtained from a certain surveillance camera.

ここで、ある監視映像から１フレームを抜き出した画像の例を図４に示す。図４に示した画像４０１は、ある交差点における監視映像の画像例を示しており、この画像４０１内には撮像された被写体等のオブジェクト４０２〜４０５が存在しているとする。また画像４０１内の点線で示したエリア４０６〜４０９は、選択部１１の備える人体領域抽出器によって抽出されたＢｏｕｎｄｉｎｇＢｏｘ（バウンディングボックス）を示している。これらのＢｏｕｎｄｉｎｇＢｏｘに囲われた部分画像は、それぞれが人体画像に対応している。なお、ここで示したＢｏｕｎｄｉｎｇＢｏｘは、あくまでも人体領域が抽出された際の具体例の一つであり、例えば後述する背景差分法によって撮像されたオブジェクトの輪郭に沿った小領域を抽出してもよい。 Here, an example of an image obtained by extracting one frame from a certain monitoring video is shown in FIG. An image 401 shown in FIG. 4 shows an example of a surveillance video image at a certain intersection, and it is assumed that objects 402 to 405 such as the imaged subject are present in this image 401. Areas 406 to 409 indicated by dotted lines in the image 401 represent Bounding Boxes (bounding boxes) extracted by the human body region extractor included in the selection unit 11. Each of the partial images surrounded by these Bounding Boxes corresponds to a human body image. It should be noted that the Bounding Box shown here is only one of the specific examples when the human body region is extracted, and even if a small region along the contour of the object imaged by the background subtraction method described later is extracted, for example. Good.

前述したような人体領域を抽出するための方法は複数存在し、例えば背景差分法、物体検出・追尾法、領域分割法の三つがある。監視対象のオブジェクトが、例えば人体のように予め既知のオブジェクトである場合は、ターゲットのオブジェクトのみを検出・追尾する目的に絞られた物体検出・追尾法が比較的適していると考えられる。物体検出・追尾法には、例えば参考文献１に開示された方法があり、この方法を用いてもよい。 There are a plurality of methods for extracting the human body region as described above, and there are three methods, for example, the background subtraction method, the object detection/tracking method, and the area division method. When the object to be monitored is a known object such as a human body in advance, it is considered that the object detection/tracking method narrowed down to the purpose of detecting/tracking only the target object is relatively suitable. The object detection/tracking method includes, for example, the method disclosed in Reference Document 1, and this method may be used.

参考文献１：Real-Time Tracking via On-line Boosting. H. Grabner, M. Grabner and H. Bischof. Proceedings of the British Machine Conference, pages 6. 1-6. 10. BMVA Press, September 2006. Reference 1: Real-Time Tracking via On-line Boosting. H. Grabner, M. Grabner and H. Bischof. Proceedings of the British Machine Conference, pages 6.1-6. 10. BMVA Press, September 2006.

さらに選択部１１は、学習データ（監視映像から矩形領域などで人体領域として切り出された人体画像）に予め付与された教師データを利用して、人体画像に対して教師データを付与する。例えば、矩形の人体領域を抽出する場合は、人体領域のＢｏｕｎｄｉｎｇＢｏｘを定義し、該ＢｏｕｎｄｉｎｇＢｏｘに教師データを付与することができる。なお、矩形ではなく、例えば人体の存在する輪郭に沿って人体領域を抽出する場合は、画像に対する人体領域のマスクを定義し、該マスクに基づいて人体画像を作成することができる。 Further, the selection unit 11 uses the teacher data previously given to the learning data (the human body image cut out as a human body region in a rectangular region or the like from the monitoring video) to give the teacher data to the human body image. For example, when a rectangular human body area is extracted, a bounding box of the human body area can be defined and teacher data can be added to the bounding box. Note that when a human body region is extracted along a contour where a human body exists, instead of a rectangle, a mask of the human body region for an image can be defined and a human body image can be created based on the mask.

ここで、教師データは、対象をどのように分類すべきかを示すラベルである。ラベルの種類や、どのような対象にどのようなラベルを付与するか、ということは問題に依存するため、本実施形態では、異常検知システムを使用または導入するユーザが予め決定し、学習データに対して教師データを付与しておくものとする。例えば、正常な映像から得られる人体画像は正常であるとみなし、それらの人体画像に対して正常というラベルを付与してもよい。なお、以降では、監視員が異常であると判断した人体画像には、異常というラベルを付与するものとする。すなわち、ここでの問題設定は、ある特定の監視カメラから得られた映像（に含まれる人体画像）に関する正常−異常の２クラス問題となる。 Here, the teacher data is a label indicating how to classify the target. Since the type of label and what kind of label is given to what target depends on the problem, in the present embodiment, the user who uses or introduces the abnormality detection system determines in advance and sets the learning data in the learning data. On the other hand, teacher data should be added. For example, human body images obtained from normal images may be regarded as normal, and those human body images may be labeled as normal. In addition, hereinafter, a label "abnormality" is given to the human body image that the observer has determined to be abnormal. That is, the problem setting here is a normal/abnormal two-class problem related to (a human body image included in) an image obtained from a specific surveillance camera.

学習データに予め付与する教師データは、例えば撮像された被写体の画像上の領域に対して、ユーザが付与することができる。より具体的には、例えば歩行者のオブジェクト４０３の領域をユーザが手動によって指定し、その領域に対して例えば正常を示すラベルを付与することができる。このとき、抽出した人体領域と、付与された教師データの領域とが重畳している場合、その人体領域に対して最も大きな面積の割合で重畳された教師データを付与してよい。当然ながら、必ず前述のやり方で教師データを付与しなければならないわけではなく、例えば予め人体領域を抽出しておき、それぞれの人体領域に対して、ユーザが教師データを付与してもよい。なお、本実施形態では人体領域を抽出する例を示したが、当然その他の物体領域を抽出してもよいし、また領域を抽出せずに画像全体を認識処理してもよい。どのような領域を抽出し、認識処理の対象とするかは問題依存であるため、問題に応じて設定する必要がある。また、ここでは説明を簡易にするために、学習処理のターゲットとなる全ての監視映像から予め学習データ（人体画像）を作成しておくものとする。 The teacher data that is previously added to the learning data can be added by the user to, for example, the region on the image of the captured subject. More specifically, for example, the area of the pedestrian object 403 can be manually designated by the user, and a label indicating normality can be given to the area. At this time, when the extracted human body region and the region of the imparted teacher data overlap each other, the teacher data superimposed on the human body region in the largest area ratio may be imparted. Of course, it is not always necessary to attach the teacher data in the above-described manner, and for example, the human body region may be extracted in advance and the user may attach the teacher data to each human body region. In the present embodiment, an example in which the human body region is extracted has been shown, but naturally other object regions may be extracted, or the entire image may be recognized without extracting the region. It is necessary to set which region is extracted and which is to be the target of the recognition process because it depends on the problem. Further, here, for simplification of description, it is assumed that learning data (human body image) is created in advance from all the monitoring videos that are targets of the learning process.

次にＳ３０２に進むと、選択部１１は、得られた学習データの前処理を行う。本実施形態では、各画像を２２４×２２４ピクセルの画像サイズに変形した後、平均画像を引く処理を行うものとする。ここで平均画像は、学習データの平均画像を示す。本実施形態では、学習画像を水平方向に反転して水増しし、水増し前の画像と、水増し後の画像とを、全て学習画像として扱うこととする。 Next, when proceeding to S302, the selection unit 11 performs preprocessing of the obtained learning data. In the present embodiment, it is assumed that each image is transformed into an image size of 224×224 pixels, and then the process of subtracting the average image is performed. Here, the average image indicates the average image of the learning data. In the present embodiment, the learning image is inverted in the horizontal direction to be padded, and the image before padding and the image after padding are all treated as the learning image.

次にＳ３０３に進むと、選択部１１は、認識対象となる学習データを決定し、それぞれの認識対象の近傍探索処理を行う。認識対象となる学習データは、全ての学習データをそれぞれ認識対象として扱ってもよいし、重要な学習データを予め選別して、認識対象として選定してもよい。ここで選ぶ認識対象は一枚の学習データとする。すなわち、ある認識対象を選んだとき、その認識対象に類する近傍データが複数選ばれるものとする。なお、複数個のデータによって認識対象を構成することも可能であり、そのような例は後述する。また、選択するのは一つでも可能であるが、認識対象近傍に異なる種類の近傍データの分布が存在しうる場合、複数個がよいと考えられる。 Next, when proceeding to S303, the selection unit 11 determines the learning data to be a recognition target, and performs a neighborhood search process for each recognition target. As for the learning data to be recognized, all learning data may be treated as recognition targets, or important learning data may be selected in advance and selected as recognition targets. The recognition target selected here is one piece of learning data. That is, when a certain recognition target is selected, a plurality of neighborhood data similar to the recognition target is selected. It is also possible to configure the recognition target by a plurality of data, and such an example will be described later. Also, although it is possible to select only one, if there may be different types of neighborhood data distributions in the neighborhood of the recognition target, it is considered that a plurality of neighborhood data are good.

近傍探索を行う空間は、ここで述べる第一の例としては画像空間上であるとする。すなわち選択部１１は、監視映像から切り出され、前処理された学習データが画像空間上に分布しているとみなして、認識対象データの近傍探索を行う。画像空間以外にも、例えば特徴空間を用いてもよいため、それについては後述する。近傍探索を用いる方法はどのような公知の手段を選んでもよいが、ここでは参考文献２に記載されているｋ−ＮｅａｒｅｓｔＮｅｉｇｈｂｏｒ（ｋ−ＮＮ）を用いる。なお、ｋ−ＮＮは、最初に一度実行するだけでもよいし、各エポックの開始時に実行してもよいし、学習用の認識対象を選択する（いわゆるｍｉｎｉｂａｔｃｈを作成する）たびに実行してもよいとする。 It is assumed that the space for performing the neighborhood search is the image space as the first example described here. That is, the selection unit 11 considers that the learning data, which is cut out from the monitoring video and preprocessed, is distributed in the image space, and performs a neighborhood search of the recognition target data. Besides the image space, for example, a feature space may be used, which will be described later. Any known means may be selected as the method using the neighborhood search, but here, the k-Nearest Neighbor (k-NN) described in Reference Document 2 is used. Note that k-NN may be executed only once at the beginning, may be executed at the start of each epoch, or may be executed each time a recognition target for learning is selected (a so-called minibatch is created). I say good.

参考文献２：C. Huang et al. Learning Deep Representation for Imbalanced Classification. CVPR2016. Reference 2: C. Huang et al. Learning Deep Representation for Imbalanced Classification. CVPR2016.

次にＳ３０４に進むと、選択部１１は、学習用データ部分集合として、Ｍｉｎｉｂａｔｃｈを設定する。本実施形態の第一の例では、Ｍｉｎｉｂａｔｃｈのサイズは、認識対象１個と、該認識対象近傍データ９９個をひとまとめにしたものとする。この認識対象近傍は、前述の近傍探索処理結果によって得られたものとする。このＭｉｎｉｂａｔｃｈとしては、学習に用いるために一度に全てのＭｉｎｉｂａｔｃｈを作成してもよいし、学習のエポックが進むたびに作成し直してもよいし、その他の学習のどのタイミングで作成してもよいものとする。この例では、最初に一度に全てのＭｉｎｉｂａｔｃｈを作成し、Ｍｉｎｉｂａｔｃｈ集合を得るものとする。なお、学習処理の例として、後述する参考文献３の方法を用いてもよい。 Next, in S304, the selection unit 11 sets Minibatch as the learning data subset. In the first example of the present embodiment, the size of the Minibatch is one recognition target and 99 recognition target neighborhood data are grouped together. It is assumed that this recognition target neighborhood is obtained by the above-mentioned neighborhood search processing result. As this Minibatch, all Minibatches may be created at once for use in learning, may be recreated each time the learning epoch progresses, or may be created at any other timing of learning. I shall. In this example, it is assumed that all Minibatches are first created at one time to obtain a Minibatch set. In addition, you may use the method of the reference document 3 mentioned later as an example of a learning process.

Ｓ３０５に進むと、選択部１１は、Ｍｉｎｉｂａｔｃｈ集合から一つＭｉｎｉｂａｔｃｈを選択し、学習部１２に送る。このとき選択するＭｉｎｉｂａｔｃｈは、例えばランダムな順番でＭｉｎｉｂａｔｃｈを選択する。なお、エポックとは、Ｍｉｎｉｂａｔｃｈ集合の全てのＭｉｎｉｂａｔｃｈを何回学習したかを表すものとする（ひととおり学習したら１エポック進むというように数える）。本実施形態では、一つの例として、予め決められたエポック数に基づいて学習完了の判定を行うとする。
その後、Ｓ３０６に進むと、選択部１１は、既定のエポック数に到達していた場合、学習処理を終了する。 In step S305, the selection unit 11 selects one Minibatch from the Minibatch set and sends it to the learning unit 12. As the Minibatch to be selected at this time, for example, the Minibatch is selected in a random order. Note that the epoch represents the number of times all Minibatches of the Minibatch set have been learned (counting one epoch after learning once). In the present embodiment, as one example, it is assumed that the learning completion is determined based on a predetermined number of epochs.
Then, if it progresses to S306, the selection part 11 will end a learning process, when the predetermined number of epochs has been reached.

次に、学習部１２の動作に関して、図５に基づいて詳細な説明を行う。図５は、学習部１２における学習動作を示すフローチャートである。
まずＳ５０１において、学習部１２は、ＣＮＮのパラメータ（結合重み・バイアス項・学習パラメータなど）を初期化する。初期化にあたっては、ＣＮＮのネットワーク構造を予め決めておく必要がある。ここで用いるネットワーク構造や初期パラメータは、例えば参考文献３と同じものを用いてよいし、独自に定義したネットワーク構造を用いてもよい。 Next, the operation of the learning unit 12 will be described in detail with reference to FIG. FIG. 5 is a flowchart showing the learning operation in the learning unit 12.
First, in step S501, the learning unit 12 initializes the CNN parameters (coupling weights, bias terms, learning parameters, etc.). Upon initialization, it is necessary to determine the network structure of CNN in advance. The network structure and the initial parameters used here may be the same as those in Reference Document 3, or may be a network structure defined uniquely.

参考文献３：A. Krizhevsky et al. ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems 25 (NIPS)、 2012. Reference 3: A. Krizhevsky et al. ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems 25 (NIPS), 2012.

図６はＣＮＮの模式図の一例を示した図である。この図６を用いて、ＣＮＮの構成および動作の例について簡単に説明する。図６の特徴抽出器６２０と、代表点生成器６２１とは、本実施形態で用いるＣＮＮのネットワーク構造の一例を示したものである。図６では、特徴抽出器６２０が、入力層６０１と、ｃｏｎｖｏｌｕｔｉｏｎ１層６０２と、ｐｏｏｌｉｎｇ１層６０３と、ｃｏｎｖｏｌｕｔｉｏｎ２層６０４と、ｐｏｏｌｉｎｇ２層１３０５とから構成されていることを示している。また図６では、２つの階層間の処理方法として、ｃｏｎｖｏｌｕｔｉｏｎ処理６１０と、ｐｏｏｌｉｎｇ処理６１１と、ｃｏｎｖｏｌｕｔｉｏｎ処理６１２とが設定されていることを示している。各処理の具体的内容は参考文献３と同様であるためここでは省略する。ｃｏｎｖｏｌｕｔｉｏｎ処理６１０では畳み込みフィルタを用いたデータ処理が実行され、ｐｏｏｌｉｎｇ処理６１１では例えばｍａｘｐｏｏｌｉｎｇであれば局所的な最大値を出力する処理が行われる。 FIG. 6 is a diagram showing an example of a schematic diagram of CNN. An example of the configuration and operation of the CNN will be briefly described with reference to FIG. The feature extractor 620 and the representative point generator 621 of FIG. 6 show an example of the network structure of the CNN used in this embodiment. In FIG. 6, the feature extractor 620 is shown to be composed of an input layer 601, a convolution 1 layer 602, a pooling 1 layer 603, a convolution 2 layer 604, and a pooling 2 layer 1305. Further, FIG. 6 shows that a convolution process 610, a pooling process 611, and a convolution process 612 are set as the processing method between the two layers. Since the specific contents of each process are the same as those in Reference Document 3, the description thereof is omitted here. In the convolution process 610, data processing using a convolution filter is executed, and in the pooling process 611, for example, if max pooling, a local maximum value is output.

また、図６では、ｃｏｎｖｏｌｕｔｉｏｎ層およびｐｏｏｌｉｎｇ層には複数の特徴マップが存在し、入力層の画像上のピクセルに対応する位置には、複数のニューロンが存在することを示している。例えば学習画像が３原色のＲＧＢ形式の画像である場合、ＲＧＢチャンネルに対応する３つのニューロンが存在することになる。また学習画像が撮像映像の動き情報を持つＯｐｔｉｃａｌＦｌｏｗ画像であれば、画像の横軸方向と縦軸方向とをそれぞれ表現する２種類のニューロンが存在することになる。 Further, FIG. 6 shows that a plurality of feature maps exist in the convolution layer and the pooling layer, and a plurality of neurons exist at positions corresponding to pixels on the image of the input layer. For example, if the learning image is an image in the RGB format of three primary colors, there are three neurons corresponding to the RGB channels. Further, if the learning image is an optical flow image having motion information of the picked-up image, there are two types of neurons expressing the horizontal axis direction and the vertical axis direction of the image.

また、複数の画像を同時に入力として用いる場合は、それら入力画像の数に対応する分だけ入力層のニューロンを増やすことで対応することが可能である。本実施形態では、標準的なＲＧＢ画像を対象とする例を示すものとする。図６には、学習時の各層における特徴マップ等のサイズ６３０〜６３５を例示している。これらサイズ６３０〜６３５において、（ａ，ｂ，ｃ，ｄ）という記述のうち、ａはデータの数に相当する次元であり、例えばＮはｍｉｎｉｂａｔｃｈのサイズ、Ｍは後述する代表点の数である。ｂは画像のチャンネルに対応する特徴次元の数、ｃは画像のｙ軸に対応する特徴次元の数、ｄは画像のｘ軸に対応する特徴次元の数と考えることができる。なお、ここでは上述の部分について特徴抽出器であると述べたが、ＣＮＮの性質として、ＣＮＮ全体が特徴抽出器であるとみなすことも可能であり、必要であればそのように解釈してもよい。 Further, when a plurality of images are used as inputs at the same time, it is possible to deal with them by increasing the number of neurons in the input layer by an amount corresponding to the number of those input images. In this embodiment, an example in which a standard RGB image is targeted is shown. FIG. 6 illustrates sizes 630 to 635 of the feature maps and the like in each layer at the time of learning. In these sizes 630 to 635, in the description (a, b, c, d), a is a dimension corresponding to the number of data, for example, N is the size of minibatch and M is the number of representative points described later. .. It can be considered that b is the number of feature dimensions corresponding to the channel of the image, c is the number of feature dimensions corresponding to the y-axis of the image, and d is the number of feature dimensions corresponding to the x-axis of the image. Although the above part is described as a feature extractor here, it is also possible to regard the entire CNN as a feature extractor as a property of CNN, and if necessary, even if it is interpreted as such. Good.

また図６では、代表点生成器６２１が、認識対象の対象特徴量層６０５と、代表特徴量層６０６とから構成されていることを示している。対象特徴量層６０５は、認識対象の特徴量を保持する層である。認識対象抽出処理６１３では、ｍｉｎｉｂａｔｃｈの１番目のデータが認識対象であるとのルールに基づき、Ｃｏｎｖｏｌｕｔｉｏｎ２層の（Ｎ，５１２，３２，３２）サイズの中間特徴量から、１番目の（１，５１２，３２，３２）サイズの特徴量を抜き出す。そしてこの抜き出した特徴量が対象特徴量となされる。当然ながら、どのデータが認識対象であるかを判別可能であれば、ｍｉｎｉｂａｔｃｈの１番目に認識対象を必ず入れておくという操作は必要がなく、認識対象であることを示す印に基づいて、単に対象特徴量を取り出すという処理に置き換えてもよい。 In addition, FIG. 6 shows that the representative point generator 621 includes a target feature amount layer 605 that is a recognition target and a representative feature amount layer 606. The target feature amount layer 605 is a layer that holds the feature amount of the recognition target. In the recognition target extraction processing 613, based on the rule that the first data of the minibat is the recognition target, the first (1,512) is calculated from the intermediate feature amount of the (N,512,32,32) size of the Convolution 2 layer. , 32, 32) size feature quantity is extracted. Then, the extracted feature amount is set as the target feature amount. As a matter of course, if it is possible to determine which data is the recognition target, it is not necessary to put the recognition target in the first position of the minibatch, and simply use the mark indicating that it is the recognition target. It may be replaced with a process of extracting the target feature amount.

次に、学習部１２は、代表点生成処理６１４によって、対象特徴量層の特徴量を用い、代表特徴量層６０６のＭ個の代表特徴量を生成する処理を行う。代表点生成処理６１４の具体例は後述する。
最後に、学習部１２は、損失算出処理６１５によって、損失値６０７を得る。損失算出処理６１５の具体例は後述する。
以上が学習時のＣＮＮの構成と動作の例である。なお、初期化の方法としては、公知の手法でＣＮＮのパラメータを初期化することができる。学習率などのパラメータは、必要に応じて任意に決定することができる。 Next, the learning unit 12 performs a representative point generation process 614 to generate M representative feature amounts of the representative feature amount layer 606 by using the feature amounts of the target feature amount layer 606. A specific example of the representative point generation processing 614 will be described later.
Finally, the learning unit 12 obtains the loss value 607 by the loss calculation process 615. A specific example of the loss calculation processing 615 will be described later.
The above is an example of the configuration and operation of the CNN during learning. As the initialization method, the CNN parameters can be initialized by a known method. Parameters such as the learning rate can be arbitrarily determined as needed.

次にＳ５０２に進むと、学習部１２は、変数Ｉｔｅｒａｔｉｏｎを０に初期化する。ここで変数Ｉｔｅｒａｔｉｏｎは、ｍｉｎｉｂａｔｃｈを何回読み込んでＣＮＮの更新を行ったか、という意味を持つ。必要であれば、変数Ｉｔｅｒａｔｉｏｎやエポック数に基づいて、学習率などの学習に必要なパラメータが変更されてもよい。例えば、安定的な学習のため、エポック数が１０を越えたときに、学習率を１／１０倍するといった処理を導入することができる。 Next, in S502, the learning unit 12 initializes the variable Iteration to 0. Here, the variable Iteration has a meaning of how many times the minibatch is read to update the CNN. If necessary, parameters necessary for learning such as the learning rate may be changed based on the variable Iteration or the number of epochs. For example, for stable learning, a process of multiplying the learning rate by 1/10 when the number of epochs exceeds 10 can be introduced.

次にＳ５０３に進むと、学習部１２は、学習用のＭｉｎｉｂａｔｃｈを選択部１１から受信する。
さらにＳ５０４に進むと、学習部１２は、ＣＮＮの学習時認識処理を実行する。具体的には、学習部１２は、Ｓ５０４１において階層的特徴抽出処理を行い、Ｓ５０４２において代表点生成処理を行い、Ｓ５０４３において損失算出処理を行う。これらの処理の概要は、図４の内容について説明した際に上述したが、詳細な処理内容については後述する。ここでは基本的に、Ｃｏｎｖｏｌｕｔｉｏｎ２層のＮ個の特徴量と、生成されたＭ個の代表点（代表特徴量）と、に基づいて損失を算出する。なお、ここでは特徴抽出処理の後に代表点生成処理を行い、その後に損失算出処理を行うという順番を例示したが、これ以外にも、後述するように繰り返し実行する場合など、他の様々な構成にすることも可能である。また例えば、特徴抽出器のみを学習し、それが終わった後に代表点生成器のみを学習するなど、排他的に学習を行ってもよい。特徴抽出器のみを学習する場合の損失算出処理については後述する。 Next, in S503, the learning unit 12 receives the Minibat for learning from the selection unit 11.
Further, when proceeding to S504, the learning unit 12 executes CNN learning recognition processing. Specifically, the learning unit 12 performs the hierarchical feature extraction processing in S5041, the representative point generation processing in S5042, and the loss calculation processing in S5043. The outline of these processes was described above when the contents of FIG. 4 were described, but the detailed contents of the processes will be described later. Here, basically, the loss is calculated based on the N feature amounts of the Convolution 2 layer and the generated M representative points (representative feature amount). Although the representative point generation process is performed after the feature extraction process and the loss calculation process is performed after that, other various configurations such as a case of repeatedly executing the process as described later are also possible. It is also possible to Further, for example, learning may be performed exclusively by learning only the feature extractor and then learning only the representative point generator. The loss calculation process when learning only the feature extractor will be described later.

次にＳ５０５に進むと、学習部１２は、Ｓ５０４３において算出した損失に基づいて、学習誤差をＣＮＮに逆伝播する。本実施形態では、ＣＮＮの学習を行うための方法として、誤差逆伝播法とＳＧＤとを組み合わせた方法を用いる。誤差逆伝播法とＳＧＤとを組み合わせた方法は参考文献３に詳細に説明されているため、ここでは詳細な説明は行わないが、基本的にはＭｉｎｉｂａｔｃｈを選択し、ＣＮＮのパラメータを逐次更新するという手順を繰り返すことを特徴とする。なお、誤差情報は認識処理のデータフローとは逆方向に伝播するのが一般的であり、ここでも特徴抽出器および代表点生成器などに対して誤差情報が伝播するものとする。 Next, in S505, the learning unit 12 back propagates the learning error to the CNN based on the loss calculated in S5043. In the present embodiment, a method combining the error back propagation method and SGD is used as the method for learning CNN. The method that combines the error backpropagation method and the SGD is described in detail in Reference Document 3, and thus will not be described in detail here. Basically, Minibatch is selected and the CNN parameters are sequentially updated. It is characterized by repeating the procedure. Note that the error information is generally propagated in the opposite direction to the data flow of the recognition processing, and here also the error information is propagated to the feature extractor, the representative point generator, and the like.

次にＳ５０６において、学習部１２は、Ｓ５０５で伝播した誤差情報を用いて、ＣＮＮのパラメータの更新を行う。例えば、学習部１２は、変数Ｉｔｅｒａｔｉｏｎに１を加算する。
次にＳ５０７において、学習部１２は、Ｍｉｎｉｂａｔｃｈを全て学習に利用したか否かの判定を行う。そして、学習部１２は、全てのＭｉｎｉｂａｔｃｈを学習に利用したと判定した場合にはＳ５０８の処理に進み、そうでなければＳ５０３の処理に戻る。 Next, in S506, the learning unit 12 updates the CNN parameters using the error information propagated in S505. For example, the learning unit 12 adds 1 to the variable Iteration.
Next, in S507, the learning unit 12 determines whether all Minibatches have been used for learning. If the learning unit 12 determines that all Minibatches have been used for learning, the learning unit 12 proceeds to the process of S508, and otherwise returns to the process of S503.

Ｓ５０８に進んだ場合、学習部１２は、エポックに１を加算する。
次にＳ５０９に進むと、学習部１２は、予め設定した上限にエポックが達しているか否かに関する判定を行う。そして、学習部１２は、上限に達していると判定した場合にはＣＮＮの学習を終了してＳ５１０に進み、そうでなければＳ５０３に進む。なお、多くの場合、ＮＮの学習停止条件は、エポック数の予め決定した上限値に達成したか否か、学習曲線の勾配などを用いて自動的に決めるか、それらのどちらかを採用する。本実施形態では、エポック数の上限値に達成したか否かを学習停止条件とする。例えば本実施形態では、エポック数の上限値は例として２００００回とする。なお、エポック数が増えることによってＮＮの学習率を低下させる方法など、エポック数に基づく学習処理の工夫を導入してもよい。 When the process proceeds to S508, the learning unit 12 adds 1 to the epoch.
Next, in S509, the learning unit 12 determines whether or not the epoch has reached the preset upper limit. If the learning unit 12 determines that the upper limit is reached, the learning of the CNN is terminated and the process proceeds to S510, and if not, the process proceeds to S503. In many cases, the learning stop condition of the NN employs either of whether or not the predetermined upper limit value of the epoch number is reached, automatic determination using the slope of the learning curve, or the like. In the present embodiment, the learning stop condition is whether or not the upper limit value of the number of epochs has been reached. For example, in this embodiment, the upper limit value of the number of epochs is set to 20000 times as an example. A learning process based on the number of epochs may be introduced, such as a method of decreasing the learning rate of the NN by increasing the number of epochs.

Ｓ５１０に進むと、学習部１２は、学習済みのＣＮＮのモデルを学習データ記憶部Ｄ１に保存する。ここでは後段の認識処理の動作において認識装置が用いることを考慮し、学習時に使用したコンピュータのメモリ上に保存しておくことを想定しているが、例えばその他の記憶部を用意して保存してもよいし、その他の保存手法を用いてもよい。 When the process proceeds to S510, the learning unit 12 stores the learned CNN model in the learning data storage unit D1. Here, in consideration of the use of the recognition device in the operation of the recognition process in the latter stage, it is assumed that the recognition device is used to store it in the memory of the computer used for learning, but for example, another storage unit is prepared and stored. Alternatively, other storage methods may be used.

なお、上述のＣＮＮによる特徴抽出処理では、特徴学習を通して、損失関数に基づく損失値を低減するように学習が行われることになる。例えば、代表点と認識対象データとを近づけるように学習がなされる。これにより、精度よく認識処理を行うことができる。 In the above-described CNN feature extraction processing, learning is performed through feature learning so as to reduce the loss value based on the loss function. For example, learning is performed so that the representative point and the recognition target data are brought close to each other. As a result, the recognition process can be performed accurately.

次に、代表点生成処理の具体的な処理内容について例を挙げて説明する。本実施形態の第一の例では、先述したようにＭ個の代表点は認識対象データに基づいて生成される。図６に示した代表点生成処理６１４の例では、（１，５１２，３２，３２）サイズの特徴量を、（Ｍ，５１２，３２，３２）サイズの特徴量に変換する処理が、代表点生成処理となる（特徴量のサイズは一例である）。 Next, a specific processing content of the representative point generation processing will be described with an example. In the first example of this embodiment, as described above, the M representative points are generated based on the recognition target data. In the example of the representative point generation process 614 illustrated in FIG. 6, the process of converting the feature amount of (1,512,32,32) size to the feature amount of (M,512,32,32) size is the representative point. This is a generation process (the size of the feature amount is an example).

代表点生成処理は、認識対象に基づいてＭ個の代表点を生成する処理であればよいため、上述の変換処理を適用することができる。なお必要に応じて、どのような代表点生成処理を採用することも可能である。ここでは、以下の代表点生成処理の例を挙げて説明する。
まず学習部１２は、（１，５１２，３２，３２）サイズの特徴量をＭ個分コピーし、（Ｍ，５１２，３２，３２）サイズの特徴量を作り、次に、（Ｍ，５１２，３２，３２）サイズの特徴量を（１，５１２・Ｍ，３２，３２）サイズに変形する。なお、"・"はスカラー値同士の掛け算を表す。変形には、（１，５１２，３２，３２）サイズのＭ個の特徴量を特徴量の二次元目でｃｏｎｃａｔｅｎａｔｅ処理してもよいし、その他の公知の手段を用いてもよい。ここで、ｃｏｎｃａｔｅｎａｔｅ処理とは、二つのデータや特徴量をある指定された次元において結合する処理である。 The representative point generation process may be a process that generates M representative points based on the recognition target, and thus the conversion process described above can be applied. Note that any representative point generation processing can be adopted as necessary. Here, the following representative point generation processing will be described as an example.
First, the learning unit 12 copies M feature amounts of (1,512,32,32) size to create feature amounts of (M,512,32,32) size, and then (M,512,32). The feature quantity of 32, 32) size is transformed into (1, 512·M, 32, 32) size. Note that "." represents multiplication of scalar values. For the transformation, M feature amounts of (1, 512, 32, 32) size may be concatenate-processed in the second dimension of the feature amount, or other known means may be used. Here, the concatenate process is a process of combining two pieces of data and feature amounts in a specified dimension.

次に学習部１２は、（１，５１２，３２，３２）サイズの特徴量から、（１，５１２・Ｍ，３２，３２）サイズの特徴量への変換処理を行う。このとき変換前は、同じ特徴量がＭ個あるだけであるが、変換処理によって、異なるＭ個の特徴量が現れることを狙う。この処理には、公知の方法を用いることができるため、詳細な説明を省略する。例えば、特徴量に対する大きな非線形変換が必要ないのであれば、変換前の特徴量の周辺に１サイズの余白を作るとともに３×３の畳み込み関数を用いて変換処理を行えばよい。また前述の例では対象特徴量をＭ個コピーした上で変換処理する例を挙げたが、例えばＤｅｃｏｎｖｏｌｕｔｉｏｎ（逆畳み込み関数）を用いた公知の方法を用いてもよい。つまり、Ｄｅｃｏｎｖｏｌｕｔｉｏｎ処理により、（１，５１２，３２，３２）サイズの対象特徴量を（１，５１２・Ｍ，３２，３２）サイズの特徴量に変換してもよい。また、その他の公知の逆畳み込み関数以外の方法を用いてもよい。 Next, the learning unit 12 performs a conversion process from the feature amount of (1,512,32,32) size to the feature amount of (1,512·M,32,32) size. At this time, there are only M identical feature amounts before the conversion, but it is aimed that different M feature amounts appear by the conversion processing. Since a known method can be used for this processing, detailed description thereof will be omitted. For example, if it is not necessary to perform a large non-linear transformation on the feature quantity, a margin of one size may be created around the feature quantity before the transformation and the transformation process may be performed using a 3×3 convolution function. Further, in the above-mentioned example, an example in which M target feature amounts are copied and then conversion processing is performed, but a known method using, for example, Deconvolution (deconvolution function) may be used. That is, the target feature amount of (1,512,32,32) size may be converted into the feature amount of (1,512·M,32,32) size by the deconvolution process. Moreover, you may use methods other than another well-known deconvolution function.

最後に、学習部１２は、変換した（１，５１２・Ｍ，３２，３２）サイズの特徴量を（Ｍ，５１２，３２，３２）サイズの特徴量に変形する処理を行う。ここで用いる変形処理は、例えば前述の方法で行ってもよい。なお、これらの例は参考に示した代表点生成器の構成例であり、その他の公知の方法を用いて代表点生成器（の一部）として用いてもよい。 Finally, the learning unit 12 performs a process of transforming the converted (1,512·M, 32, 32) size feature amount into a (M, 512, 32, 32) size feature amount. The transformation process used here may be performed, for example, by the method described above. Note that these examples are the configuration examples of the representative point generator shown for reference, and may be used as (a part of) the representative point generator by using other known methods.

以降で、今回以外の方法で代表点生成器を構成する例を後述する。なお、後述する第２の実施形態では、代表点生成に近年発達してきた生成モデル系の方法を用いる例を説明する。なお、Ｍは１であってもよいが、認識精度を向上させるために、複数の代表点を用いたほうがよいと考えられる。またＮを学習データ全体の数としたとき、クラスタリングのような目的関数を得ることができる。 Hereinafter, an example of configuring the representative point generator by a method other than this time will be described later. In a second embodiment to be described later, an example of using a generative model system method that has been developed in recent years for representative point generation will be described. Although M may be 1, it is considered better to use a plurality of representative points in order to improve the recognition accuracy. Moreover, when N is the number of the entire learning data, an objective function such as clustering can be obtained.

次に、損失算出処理の具体的な処理内容について例を述べる。本実施形態の第一の例では、先述したようにＭ個の代表点が認識対象データを含むＮ個のデータからなるＭｉｎｉｂａｔｃｈと比較され、損失が算出される。このとき、どのように損失算出処理を行うかという点でバリエーションが存在するが、先述したように正常−異常の２クラス問題として解く場合には、以下の式（１）に示す損失関数を用いることができる。 Next, an example of specific processing contents of the loss calculation processing will be described. In the first example of the present embodiment, as described above, the M representative points are compared with the Minibatch consisting of N pieces of data including the recognition target data, and the loss is calculated. At this time, there is a variation in how to perform the loss calculation process, but when solving as a normal-abnormal two-class problem as described above, the loss function shown in the following equation (1) is used. be able to.

ここで、式（１）のＸはＭｉｎｉｂａｔｃｈデータの特徴量、Ａは代表特徴量である。特に、Ｘ₀は認識対象データとする。また、少なくとも認識対象以外のＭｉｎｉｂａｔｃｈデータは正常とする。||・||は二つのデータ（特徴量）間の距離を求める関数であり、例えば二乗フロベニウスノルムなどが用いられる。式（１）は、代表特徴量のＡが正常なＭｉｎｉｂａｔｃｈデータ（認識対象以外は正常）を精度よく近似するほど小さい値となる。また、認識対象のラベルが異常である場合は、Ａが認識対象データから離れているほど損失関数の値が小さくなることが分かる。すなわち、Ａは正常な代表特徴量を表していると解釈ができる。 Here, X in the equation (1) is a feature amount of Minibatch data, and A is a representative feature amount. In particular, X ₀ is the recognition target data. In addition, at least Minibatch data other than the recognition target is normal. ||·|| is a function for finding the distance between two data (features), and for example, the squared Frobenius norm is used. Formula (1) has a smaller value as the representative feature value A more accurately approximates normal Minibatch data (normal except for the recognition target). Further, when the label of the recognition target is abnormal, it can be seen that the value of the loss function becomes smaller as A is farther from the recognition target data. That is, it can be interpreted that A represents a normal representative feature amount.

式（１）では全てＭｉｎｉｂａｔｃｈデータの特徴量と代表特徴量との距離を算出しているが、認識対象近傍のデータが多峰である例を考慮する場合には、各代表特徴量に対して最近傍となるＭｉｎｉｂａｔｃｈデータの特徴量との間の距離のみを用いてもよい。これにより、多峰性を考慮したマッチングが可能となる。その場合、最近傍は、例えば特徴空間上でｋ−ＮＮを用いて見つけることができる。なお、各Ｍｉｎｉｂａｔｃｈデータの特徴量に対して最近傍となる代表特徴量との間の距離のみを用いてもよい。また、必要であれば、ｋ＝１ではなく任意の数としてｋ個との距離を計算してもよい。例えばｋ＝Ｍとしたとき、損失関数は式（１）と同じになる。 In Expression (1), the distance between the feature amount of the Minibatch data and the representative feature amount is calculated, but when considering an example in which the data in the vicinity of the recognition target has multiple peaks, for each representative feature amount, You may use only the distance between the feature-values of the Minibatch data used as the nearest neighbor. As a result, it is possible to perform matching considering multimodality. In that case, the nearest neighbor can be found using k-NN on the feature space, for example. Note that only the distance between the representative feature amount that is the closest to the feature amount of each Minibatch data may be used. If necessary, the distance to k may be calculated as an arbitrary number instead of k=1. For example, when k=M, the loss function becomes the same as the equation (1).

なお、式（１）は、場合分けしたそれぞれで微分可能であり、先述した誤差逆伝播法の目的関数として用いることができる。また、例えば代表点の多様性を維持するために、ある代表点にすでに割りあたったＭｉｎｉｂａｔｃｈデータがある場合には、他のＭｉｎｉｂａｔｃｈ中のデータを同じ代表点に割り当てないようにしてもよい。すなわち、Ｎ＞Ｍの場合、Ｍ個のデータのみしかＭｉｎｉｂａｔｃｈ中で学習に寄与しないことになる。一方、Ｎ＜Ｍの場合は、Ｎ個の代表点のみ学習に用いられることになる。すなわち、動的な代表点数になる。更に後述のマルチタスクラーニングにより、例えばＭｉｎｉｂａｔｃｈ中の各データがノイズデータ（外乱）であるか否かを判定し、そのノイズ度合いを基にＭｉｎｉｂａｔｃｈ中のデータを選択してもよい。この場合、Ｎは変動することになり、これも動的な代表点数になりうる。なお、一個しか割り当てない場合のように制約がきつすぎると考えられる場合は、少数のＭｉｎｉｂａｔｃｈ中のデータのみ代表点に割り当てることを許すようにしてもよい。
以上が、学習装置１０の動作の説明である。各部のさらに詳細な説明は後述する。 The expression (1) can be differentiated for each case, and can be used as the objective function of the above-described back propagation method. Further, for example, in order to maintain the diversity of the representative points, if there is Minibatch data already assigned to a certain representative point, the data in other Minibatches may not be assigned to the same representative point. That is, in the case of N>M, only M pieces of data contribute to learning in the Minibatch. On the other hand, when N<M, only N representative points are used for learning. That is, it becomes a dynamic representative score. Furthermore, it is also possible to judge whether or not each data in the Minibatch is noise data (disturbance) by the multitask learning described later, and select the data in the Minibatch based on the noise degree. In this case, N will fluctuate, which can also be a dynamic representative score. If it is considered that the constraint is too tight as in the case where only one piece is assigned, only a small number of data in the Minibatch may be assigned to the representative point.
The above is the description of the operation of the learning device 10. A more detailed description of each unit will be given later.

次に、異常検知システムの検出時の動作に関わる構成に関する説明を行う。
図１の異常検知システムの認識装置２０は、入力部２１と、認識部２２とを備えている。
認識部２２は、特徴抽出部２２１と、代表特徴量生成部２２２と、距離計算部２２３と、を備えている。
判定装置３０は、異常判定部３１を備えている。
端末装置４０は、表示部４１を備えている。なお、端末装置４０としては、例えばＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）のディスプレイやタブレットＰＣ、スマートフォン、フューチャーフォン等が適用できる。 Next, the configuration related to the operation at the time of detection of the abnormality detection system will be described.
The recognition device 20 of the abnormality detection system in FIG. 1 includes an input unit 21 and a recognition unit 22.
The recognition unit 22 includes a feature extraction unit 221, a representative feature amount generation unit 222, and a distance calculation unit 223.
The determination device 30 includes an abnormality determination unit 31.
The terminal device 40 includes a display unit 41. Note that, as the terminal device 40, for example, a display of a PC (Personal Computer), a tablet PC, a smartphone, a future phone, or the like can be applied.

以下、異常検知システムの検出時の動作に関する説明を図７に基づいて行う。図７は、異常検知システムにおける異常検出動作のフローチャートである。
まずＳ７０１において、入力部２１は、認識対象データ記憶部Ｄ２から認識対象データ（検知対象のデータ）を受け取り、認識部２２に送る。このとき認識対象データには、学習時と同様の前処理を施すとする。前処理では、平均画像を引くなどの処理を行い、画像反転処理などは行わなくてもよい。認識部２２は、受け取った認識対象データを用いて、ＣＮＮの認識処理を行う。このとき用いられるＣＮＮのモデルは、学習時に得られたモデルを用いるものとする。なお、もし必要であれば、異常検知システムの学習時に得られたモデルを用いず、その他のモデルを用いてもよい。本実施形態では前述のように学習したモデルを用いるとする。 Hereinafter, the operation of the abnormality detection system at the time of detection will be described with reference to FIG. 7. FIG. 7 is a flowchart of the abnormality detection operation in the abnormality detection system.
First, in step S701, the input unit 21 receives recognition target data (detection target data) from the recognition target data storage unit D2 and sends the recognition target data to the recognition unit 22. At this time, it is assumed that the recognition target data is subjected to the same preprocessing as during learning. In the pre-processing, processing such as subtracting an average image may be performed, and image inversion processing may not be performed. The recognition unit 22 performs CNN recognition processing using the received recognition target data. As the CNN model used at this time, the model obtained at the time of learning is used. If necessary, other models may be used instead of using the model obtained during learning of the abnormality detection system. In this embodiment, the model learned as described above is used.

図６を例に挙げて説明すると、検出時に認識対象が一つ与えられるのは、Ｎ＝１としたときの図６のようになる。具体例を図８に示す。なお図８において、図６と同様の処理内容のものは、図６と同じ符号を付与し、重複する説明は省略する。
図８において、サイズ８３０〜８３３は、本実施形態において検出時に認識対象が一つだけ入力され、それが認識処理にかけられることを表している。また、図８の場合、損失算出処理はなく、対象特徴量と代表特徴量との距離を算出する距離算出処理８１５を用いて距離８０７を得る。 To explain with FIG. 6 as an example, one recognition target is given at the time of detection as shown in FIG. 6 when N=1. A specific example is shown in FIG. Note that, in FIG. 8, the same processing contents as those in FIG. 6 are denoted by the same reference numerals as those in FIG. 6, and redundant description will be omitted.
In FIG. 8, sizes 830 to 833 represent that only one recognition target is input at the time of detection in the present embodiment, and the recognition target is subjected to recognition processing. Further, in the case of FIG. 8, there is no loss calculation process, and the distance 807 is obtained using the distance calculation process 815 that calculates the distance between the target feature amount and the representative feature amount.

次にＳ７０２に進むと、判定装置３０は、認識対象データに対するＣＮＮの認識処理結果に対して、異常判定処理を行う。このとき現れるＣＮＮの認識処理結果は、本実施形態においては、認識対象と生成された代表点との距離とする。 Next, in step S702, the determination device 30 performs abnormality determination processing on the CNN recognition processing result for the recognition target data. In this embodiment, the CNN recognition processing result that appears at this time is the distance between the recognition target and the generated representative point.

ここで、本実施形態の学習時において、代表点は認識対象近傍の正常なデータを表現するように学習されるため、認識対象が与えられた際に、認識対象に基づいて生成される代表点は認識対象近傍に存在すると予測される代表点となる。すなわち、認識対象と代表点との間の距離は、認識対象と認識対象近傍とに存在するであろう正常な点との距離であるとみなすことができる。そして、距離が大きいほど異常であると考えられるため、ここで得られた距離に基づいて異常検知を行うことができる。このとき、異常判定を行う際の閾値は、ここでは予め決定した値を用いるとする。判定装置３０は、距離が閾値以内である場合には正常と判定し、閾値を超える場合には異常として検知する。閾値は、操作者が決定してもよいし、機械的に決めてもよいし、その他の学習方法などによって決めてもよいとする。 Here, at the time of learning of the present embodiment, since the representative points are learned so as to represent normal data in the vicinity of the recognition target, the representative points generated based on the recognition target when the recognition target is given. Is a representative point predicted to exist near the recognition target. That is, the distance between the recognition target and the representative point can be regarded as the distance between the recognition target and a normal point that may exist near the recognition target. Since it is considered that the larger the distance is, the more abnormal it is, the abnormality can be detected based on the distance obtained here. At this time, it is assumed that a threshold value used for the abnormality determination is a predetermined value here. The determination device 30 determines normal when the distance is within the threshold, and detects abnormal when the distance exceeds the threshold. The threshold may be determined by the operator, may be determined mechanically, or may be determined by another learning method or the like.

次にＳ７０３に進むと、端末装置４０は、判定装置３０から異常判定結果を受け取り、その判定結果に基づいた表示処理を行う。例えば、表示部４１は、検出対象が異常であることを示す異常判定結果が送られてきた場合、それを受け取り、警告処理を行ってもよい。このとき、端末装置４０の備える機能に応じて、警告処理を行ってもよい。例えばランプとサイレンが備わっている場合はランプ等の点滅と共に警告音を鳴らしてもよく、映像確認用のディスプレイが備わっている場合は、通常の監視映像に対して異常な場所を強調表示してもよい。ただしこのとき、異常な場所を強調表示するには、映像上の異常な箇所を特定する必要がある。そのためには、例えば入力される検出対象データに画面座標データを付与しておき、異常判定結果に応じてその座標データを利用してもよい。 Next, when proceeding to S703, the terminal device 40 receives the abnormality determination result from the determination device 30 and performs display processing based on the determination result. For example, when the abnormality determination result indicating that the detection target is abnormal is sent, the display unit 41 may receive the abnormality determination result and perform a warning process. At this time, warning processing may be performed according to the function of the terminal device 40. For example, if a lamp and a siren are provided, you may sound a warning sound with the blinking of the lamp, etc.If a display for checking the image is provided, highlight an abnormal place in the normal surveillance image. Good. However, at this time, in order to highlight the abnormal place, it is necessary to identify the abnormal place on the image. To that end, for example, screen coordinate data may be added to the input detection target data, and the coordinate data may be used according to the abnormality determination result.

次にＳ７０４に進むと、認識装置２０は、認識対象データ記憶部Ｄ２に検出対象となるデータが残っている場合にはＳ７０１に戻り、そうでなければ、検出時の動作を終了する。 Next, proceeding to S704, the recognition device 20 returns to S701 if data to be detected remains in the recognition target data storage unit D2, and otherwise ends the operation at the time of detection.

前述した説明では、２クラス問題として解く場合の例として式（１）を示したが、本実施形態は２クラス問題以外の問題にも適用可能である。例えば、１クラス問題（異常検知の文脈では正常学習とも呼ばれる）においては、例えば以下の式（２）に示す損失関数によって損失を計算し、学習に用いることができる。 In the above description, the formula (1) is shown as an example in the case of solving as a two-class problem, but the present embodiment can be applied to problems other than the two-class problem. For example, in a one-class problem (also called normal learning in the context of anomaly detection), the loss can be calculated by the loss function shown in the following equation (2) and used for learning.

式（２）において、Ｘは正常なデータのみを含むＭｉｎｉｂａｔｃｈデータ（の特徴量）である。なお、前述した式（１）の場合と同様に、全ての代表点との比較を行わずに、近傍の代表点との距離だけを求めてもよい。正常学習は、学習時には正常なデータしか用いられない場合に特に有効であり、監視カメラに撮像された人物の行動を分析する場合や、工場等における製品の外観検査等にも用いることができる。このような場合は、応用のユースケースによっては、ＣＮＮなど多くの処理能力を必要とする特徴抽出器以外の特徴抽出器が望まれる場合がある。そのような場合は、ＣＮＮではなく、その他の公知の特徴抽出器等を用いてもよい。特徴学習が困難な場合は、代表点生成器の学習のみ行い、特徴学習を行わないようにしてもよい。 In the equation (2), X is (minimum feature amount) Minibatch data including only normal data. As in the case of the above-mentioned formula (1), only the distances to the neighboring representative points may be obtained without comparing with all the representative points. The normal learning is particularly effective when only normal data is used at the time of learning, and can also be used when analyzing the behavior of a person imaged by a surveillance camera, or in the appearance inspection of a product in a factory or the like. In such a case, a feature extractor other than the feature extractor that requires a lot of processing capability such as CNN may be desired depending on the use case of application. In such a case, other known feature extractors or the like may be used instead of CNN. When the feature learning is difficult, only the representative point generator may be learned and the feature learning may not be performed.

また、監視用途の異常行動検知においては、継続して映像を収集することを通して、今まで存在しなかった異常データが得られる場合がある。その際、追加学習を行い、教師データとして用いてもよい。すなわち最初は式（２）に基づいて学習を行い、追加的に式（１）に基づいて学習を行ってもよい。なお、追加学習においては、上述の問題設定以外でも、新たに学習データが取得された都度行ってもよい。その場合、ＣＮＮ全体を再学習してもよいし、構成の一部（例えば特徴抽出器や代表点生成器）のみを再学習するようにしてもよい。例えば、監視カメラを用いたユースケースにおいて、特徴学習を監視カメラごとに行うのは計算コスト・学習コストが大きい。しかし、もし代表点生成器の学習コストが比較的低い場合は、特徴抽出器を複数のカメラで共通に利用して、代表点生成器のみをカメラごとに学習することができる。ここで用いる特徴抽出器は、他のドメインで予め学習済みのものを用いてよい。 Further, in the abnormal behavior detection for monitoring purposes, abnormal data that did not exist until now may be obtained by continuously collecting images. At that time, additional learning may be performed and used as teacher data. That is, initially, learning may be performed based on the equation (2), and additionally learning may be performed based on the equation (1). The additional learning may be performed each time new learning data is acquired, other than the problem setting described above. In that case, the entire CNN may be relearned, or only a part of the configuration (for example, the feature extractor or the representative point generator) may be relearned. For example, in a use case using surveillance cameras, performing feature learning for each surveillance camera requires a large calculation cost and learning cost. However, if the learning cost of the representative point generator is relatively low, the feature extractor can be commonly used by a plurality of cameras, and only the representative point generator can be learned for each camera. The feature extractor used here may be one already learned in another domain.

なお、１クラス問題または２クラス問題として解く場合の例を上述したが、さらに多いクラス数の問題としても解くことが可能である。例えば、画像認識等の分類問題にも本発明を適用することができる。具体的には、Ｍ個の代表点に対して、例えば排他的なラベルを最大Ｍ種類付与することができ、以下の式（３）に示すような損失関数を用いることが可能である。 Although an example of solving as a one-class problem or a two-class problem has been described above, it is possible to solve as a problem with a larger number of classes. For example, the present invention can be applied to classification problems such as image recognition. Specifically, for example, a maximum of M types of exclusive labels can be given to M representative points, and a loss function as shown in the following Expression (3) can be used.

式（３）において、||・||の右下の項は距離を計算する条件について示しており、Ｃは認識対象データのクラスを表す。また、ｉ∈Ｃ_jは"Ｍｉｎｉｂａｔｃｈ中に含まれるｊ番目のデータのクラスにｉ番目の代表点のラベルが対応する場合"という条件を示し、ｉ¬∈Ｃ_jは逆を示す。|・|は集合の要素を数え上げたときの数を表し、分母として与えることで総和された値を正規化する役割を持つ。例えば、｜ｉ∈Ｃ_j｜は"Ｍｉｎｉｂａｔｃｈ中に含まれるｊ番目のデータのクラスにｉ番目の代表点のラベルが対応する場合の数"であり、何回条件に当てはまったかを数え上げた数に相当する。上述の対応関係は予め決める必要がある。例えば、ｊ＝｛０，１，・・・，３０｝のときＣ_jは０番目のラベルであり、ｊ＝｛３１，３２，・・・，６０｝のときＣ_jは１番目のラベルであるなどである。すなわち、Ｘ₀に基づいて生成された各代表特徴量には予めラベルが付与されており、当該ラベルに予め紐付けられた認識対象のクラスのデータと近づいて生成されるように学習がなされると考えることができる。そして、検出時には、認識対象データに基づいて生成された代表特徴量と対象特徴量との距離を計算し、最も対象特徴量との距離が近い代表特徴量のラベルに対応する認識対象のクラスが、対象特徴量のクラスと考えることができる。なお、上述の場合は代表特徴点のラベルが排他的である場合について述べたが、排他的ではなく、例えばマルチラベルのような構成にしてもよい。 In Expression (3), the lower right term of ||•|| indicates the condition for calculating the distance, and C represents the class of the recognition target data. In addition, iεC _j indicates the condition “when the label of the i-th representative point corresponds to the j-th data class included in Minibatch”, and i¬εC _j indicates the opposite. |・| represents the number when enumerating the elements of the set, and has the role of normalizing the summed value by giving it as the denominator. For example, | _i ∈ C _j | is "the number when the label of the i-th keypoint corresponds to the j-th data class included in Minibatch", and is the number of times the condition is met. Equivalent to. It is necessary to determine the above correspondence relationship in advance. For example, when j={0, 1,..., 30}, C _j is the 0th label, and when j={31, 32,..., 60}, C _j is the 1st label. There is. That is, each representative feature amount generated based on X ₀ is labeled in advance, and learning is performed so as to be generated closer to the data of the recognition target class that is preliminarily linked to the label. Can be considered. Then, at the time of detection, the distance between the representative feature amount generated based on the recognition target data and the target feature amount is calculated, and the recognition target class corresponding to the label of the representative feature amount closest to the target feature amount is , Can be considered as the class of the target feature quantity. In addition, although the case where the label of the representative feature point is exclusive was described in the above-mentioned case, it is not exclusive and may be configured as a multi-label, for example.

なお上述の例では、代表点に基づいて損失を計算して学習処理を行う場合や、代表点に基づいて検出処理を行う場合を示したが、それ以外のタスクをマルチタスクラーニングとして同時に解いてもよい。例えば、認識対象に不審度などの値が教師データとして付与されている場合、代表点の生成と並行して、不審度の推定（回帰）を行うことができる。回帰の学習には、例えば二乗誤差を用いることができる。代表点の生成と回帰とをＣＮＮで行う場合には、例えば、図６のＣｏｎｖｏｌｕｔｉｏｎ２層に基づいて回帰処理を行うＣＮＮを別途用意し、Ｅｎｄ−ｔｏ−ｅｎｄで学習処理を行ってもよい。そして、その結果を検出時に用いてもよい。あるいは、生成した代表点に基づいて回帰などの処理を行ってもよい。図６の例では、（Ｍ，５１２，３２，３２）サイズの代表点と、Ｃｏｎｖｏｌｕｔｉｏｎ２層の（Ｎ，５１２，３２，３２）サイズのＭｉｎｉｂａｔｃｈから選択した（１，５１２，３２，３２）サイズの認識対象データとをｃｏｎｃａｔｅｎａｔｅ処理する。そして、ｃｏｎｃａｔｅｎａｔｅ処理による（Ｍ＋１，５１２，３２，３２）サイズの特徴量を、（１，５１２・（Ｍ＋１），３２，３２）サイズの特徴量に変形処理し、認識処理に用いる。これによって、Ｍｉｎｉｂａｔｃｈ中のデータを考慮して認識処理を行うことができる。なお、通常のニューラルネットワークでは、Ｍｉｎｉｂａｔｃｈ中の他のデータが、あるデータの特徴抽出処理に対して関与することは行われない。一方、この抽出処理等によってＭｉｎｉｂａｔｃｈ中のデータを特徴次元上に移す処理を通すことで、Ｍｉｎｉｂａｔｃｈ中のデータを考慮した特徴抽出を行うことができるようになる。また、マルチタスクラーニングにおいては、それぞれのタスクの学習をどのような順番で行ってもよいし、同時に学習を行ってもよい。 In the above example, the case where the loss is calculated based on the representative point and the learning processing is performed, and the case where the detection processing is performed based on the representative point are shown, but other tasks are simultaneously solved as multi-task learning. Good. For example, when a value such as a suspiciousness degree is given to the recognition target as teacher data, the suspiciousness degree can be estimated (regression) in parallel with the generation of the representative points. For the learning of regression, for example, a squared error can be used. When the representative point is generated and the regression is performed by the CNN, for example, a CNN that performs the regression process based on the Convolution 2 layer of FIG. 6 may be separately prepared, and the end-to-end learning process may be performed. Then, the result may be used at the time of detection. Alternatively, processing such as regression may be performed based on the generated representative points. In the example of FIG. 6, the representative point of the (M, 512, 32, 32) size and the (1, 512, 32, 32) size of the Minibatch of the (N, 512, 32, 32) size of the Convolution 2 layer are selected. Concatenate the recognition target data. Then, the feature quantity of size (M+1, 512, 32, 32) by the concatenate processing is transformed into the feature quantity of size (1, 512·(M+1), 32, 32) and used for recognition processing. As a result, the recognition process can be performed in consideration of the data in the Minibatch. In a normal neural network, other data in the Minibatch does not participate in the feature extraction processing of certain data. On the other hand, by performing the process of moving the data in the Minibatch onto the feature dimension by this extraction process or the like, it becomes possible to perform the feature extraction in consideration of the data in the Minibatch. In multitask learning, learning of each task may be performed in any order, or learning may be performed simultaneously.

また上述の特徴量のサイズの表現（例えば（Ｎ，５１２，３２，３２））は、あくまでも特徴量のサイズを例示するために使用された表現であって、計算機内部のデータとしてどのように特徴量を保持していてもよい。例えば（Ｎ，５１２，３２，３２）というサイズの特徴量は、内部的には（１，５１２・Ｎ，３２，３２）という配列のサイズであってもよい。上述の処理においては、特徴量の変形処理を繰り返し行う場合があるため、効率のよいデータ表現が望まれる。また、例えば（Ｎ，５１２，３２，３２）というサイズの特徴量として内部的に保持している場合であっても、（１，５１２・Ｎ，３２，３２）のように変形処理せず扱うために、例えば畳み込み処理を第一の次元（データ数の次元）にまで拡張してもよい。つまり、通常の畳み込み処理の一例としては、第二の次元（チャンネル数の次元）数だけ畳み込みフィルタを用いる。この場合、（Ｎ，５１２，３２，３２）の例では５１２個分の畳み込みフィルタを用いることになるが、第一の次元まで拡張された畳み込み処理は、（Ｎ，５１２，３２，３２）の例では５１２・Ｎ個分の畳み込みフィルタを用いることになる。これは、（１，５１２・Ｎ，３２，３２）のように変形処理したうえで畳み込み処理を行う場合と等価な表現であるため、変形処理の分だけ計算コストを削減しうる。 Further, the above-described expression of the size of the feature quantity (for example, (N, 512, 32, 32)) is merely an expression used to exemplify the size of the feature quantity, and how The amount may be retained. For example, the feature amount having the size of (N, 512, 32, 32) may be the size of the array of (1, 512·N, 32, 32) internally. In the above process, the feature amount transformation process may be repeatedly performed, and therefore efficient data representation is desired. Further, even if it is internally held as a feature amount having a size of (N, 512, 32, 32), for example, it is handled without being transformed like (1, 512·N, 32, 32). Therefore, for example, the convolution processing may be extended to the first dimension (dimension of the number of data). That is, as an example of the normal convolution processing, convolution filters are used for the second dimension (the dimension of the number of channels). In this case, in the example of (N, 512, 32, 32), 512 convolution filters are used, but the convolution processing extended to the first dimension is (N, 512, 32, 32). In the example, 512·N convolution filters are used. This is an expression equivalent to the case where the convolution processing is performed after the transformation processing such as (1,512·N,32,32), and thus the calculation cost can be reduced by the amount of the transformation processing.

また上述の例では、代表点は主に特徴空間上の点である場合について述べたが、例えば入力データの空間と同じ空間上の点であってもよい。具体的には、入力データが画像である場合、画像空間上に代表点を生成して用いてもよい。このような場合、例えば参考文献４のように、Ａｕｔｏｅｎｃｏｄｅｒを用いて画像空間上に認識対象データの再構成画像を生成し、代表点と比較する形で用いることができる。代表点の生成の仕方はさまざまなやり方があるが、ＣＮＮを例にした方法と同様の方法を用いて代表点を生成することができる。なお、ここではＡｕｔｏｅｎｃｏｄｅｒを用いて画像空間上で代表点を生成する場合の例について述べたが、Ａｕｔｏｅｎｃｏｄｅｒの中間層で代表点を生成してもよい。その場合、上述したマルチタスクラーニングの形で、再構成誤差と代表点との距離とに関して最適化を行うようにしてもよい。なお、例えば元となる空間（例えば、画像空間とする）上で代表点を生成した場合、Ｍｉｎｉｂａｔｃｈ中のデータを画像空間上に再構成したデータだけでなく、入力された学習データをも代表点生成器の学習に用いることが可能である。例えば、入力された学習データに近づけるように代表点を生成するための損失関数の項を追加してもよいし、その他の公知な方法等を応用して、入力された学習データと代表点との関係に関する損失関数を設計してもよい。また、もし入力された学習データに異常と正常などの教師情報が付与されているならば、例えば異常の学習データと生成された代表点との距離が大きくなるように学習を行ってもよい。具体的には、例えば、異常の学習データと生成された代表点との二乗誤差が大きくなるほど損失が大きくなる形で損失関数の項を追加することで上記の目的を達成することができる。 In the above example, the representative point is mainly a point on the feature space, but it may be a point on the same space as the space of the input data. Specifically, when the input data is an image, representative points may be generated and used in the image space. In such a case, as in Reference Document 4, for example, a reconstructed image of the recognition target data is generated in the image space using the Autoencoder, and can be used in a form of comparison with the representative point. There are various ways to generate the representative point, but the representative point can be generated using a method similar to the method using CNN as an example. Here, an example in which the representative point is generated in the image space using the Autoencoder has been described, but the representative point may be generated in the middle layer of the Autoencoder. In that case, optimization may be performed with respect to the reconstruction error and the distance between the representative points in the form of the above-described multitask learning. In addition, for example, when the representative point is generated in the original space (for example, an image space), not only the data in the Minibatch is reconstructed in the image space but also the input learning data is used as the representative point. It can be used for learning the generator. For example, a loss function term for generating a representative point may be added so as to approach the input learning data, or other known methods may be applied to input the learning data and the representative point. You may design a loss function for the relationship If the input learning data is provided with teacher information such as abnormality and normality, the learning may be performed so that the distance between the abnormal learning data and the generated representative point becomes large. Specifically, for example, the above-mentioned object can be achieved by adding a term of a loss function such that the loss increases as the square error between the abnormal learning data and the generated representative point increases.

参考文献４：C. Zhou et al. Anomaly Detection with Robust Deep Ａｕｔｏｅｎｃｏｄｅｒs. KDD, 2017. Reference 4: C. Zhou et al. Anomaly Detection with Robust Deep Autoencoders. KDD, 2017.

また、例えば入力データが画像データとテキストデータとである場合など、マルチソースな構成である場合に、一部の入力データの空間上にだけ代表点を生成してもよいし、全部の入力データの空間上に代表点を生成してもよい。もちろん、マルチソース構成でなくとも、一部の入力データの空間上にだけ代表点を生成してもよい。 In addition, when the input data is image data and text data, for example, in the case of a multi-source configuration, representative points may be generated only in the space of some input data, or all input data may be generated. The representative points may be generated in the space. Of course, the representative point may be generated only in the space of a part of the input data, without using the multi-source configuration.

また、本実施形態では主として入力データの空間上で近傍探索を行い、近傍画像等を選択する場合の例について述べたが、任意の特徴空間上で近傍探索を行ってもよい。またさらに、学習ベースの方法を用いて近傍を選択してもよい。例えば、特徴学習とクラスタリングの学習を同時に行う方法を用いて、特徴抽出・クラスタリング・代表点生成の３者を学習することができる。例えば認識対象データを含むランダムに選択した１００個のデータをまずＭｉｎｉｂａｔｃｈとして特徴抽出・クラスタリングする。そして、そのうち認識対象と同じクラスタに割り当てられた５０個のデータを認識対象近傍として特徴抽出・代表点生成に用いてもよい。特徴抽出・クラスタリング・代表点生成は、共通のＣＮＮによって行うことができ、効率的に学習処理を行うことができる。例えば、図６のＣｏｎｖｏｌｕｔｉｏｎ２層まで共通にＣＮＮを利用し、その後の処理をクラスタリング・代表点生成とで別々の系統に分けてＣＮＮを学習することができる。なお、クラスタリングを用いる上述の例はあくまでも一例であって、その他の近傍選択・探索が可能な公知の技術を用いてもよい。 Further, in the present embodiment, an example in which the neighborhood search is mainly performed in the space of the input data and the neighborhood image or the like is selected has been described, but the neighborhood search may be performed in an arbitrary feature space. Still further, learning-based methods may be used to select neighborhoods. For example, three methods of feature extraction, clustering, and representative point generation can be learned by using a method of simultaneously performing feature learning and clustering learning. For example, 100 pieces of randomly selected data including recognition target data are first subjected to feature extraction/clustering as Minibatches. Then, 50 pieces of data assigned to the same cluster as the recognition target may be used as the recognition target neighborhood for feature extraction/representative point generation. Feature extraction/clustering/representative point generation can be performed by a common CNN, and learning processing can be performed efficiently. For example, the CNN can be commonly used up to the Convolution 2 layer in FIG. 6, and the CNN can be learned by dividing the subsequent processing into different systems by clustering and representative point generation. Note that the above-described example using clustering is merely an example, and other known techniques capable of selecting and searching for neighborhoods may be used.

また本実施形態では、第一の例として、近傍探索を行い、近傍データを近い順から選択してＭｉｎｉｂａｔｃｈを作るための構成や動作の例を挙げたが、他の方法として、例えば認識対象近傍の一定の距離以内に存在するデータを集めてランダム選択をしてもよい。これにより、同じ認識対象に対して異なるＭｉｎｉｂａｔｃｈを構成することが可能となる。またその際に、学習のエポックやイテレーションが進むほど、認識対象近傍の選択する範囲を狭めるようにしてもよい。これによって、徐々にＦｉｎｅ−ｇｒａｉｎｅｄな特徴表現や代表点生成器を獲得しうる。なおこれらの近傍選択については、先述のように特徴空間上で行ってもよい。 Further, in the present embodiment, as a first example, an example of a configuration and an operation for performing a neighborhood search and selecting neighborhood data from a close order to create a Minibatch has been described. However, as another method, for example, a recognition target neighborhood Random selection may be performed by collecting data existing within a certain distance of. This makes it possible to configure different Minibatches for the same recognition target. Further, in that case, the selection range in the vicinity of the recognition target may be narrowed as the learning epoch or iteration progresses. As a result, it is possible to gradually obtain fine-grained feature expressions and representative point generators. Note that these neighborhood selections may be performed in the feature space as described above.

また、認識対象近傍のデータを選択する際に、近傍性のみならず、その他の基準を用いてもよい。この基準の定義によって様々な性質の代表点生成器を学習することが可能である。例えば、認識対象近傍のデータとして、認識対象と同一の属性を持つデータを選択してもよい。具体的には、例えば人物照合等のタスクにおいて、認識対象と同一人物のデータを認識対象近傍のデータとして扱うことで、同一人物らしい代表点の生成を行うことが期待できる。この属性は、例えば物体種に関する属性などでもよい。 Further, when selecting the data in the vicinity of the recognition target, not only the proximity but also other criteria may be used. By defining this criterion, it is possible to learn representative point generators of various properties. For example, as the data in the vicinity of the recognition target, data having the same attribute as the recognition target may be selected. Specifically, for example, in a task such as person matching, by treating the data of the same person as the recognition target as the data in the vicinity of the recognition target, it can be expected to generate representative points that seem to be the same person. This attribute may be, for example, an attribute related to the object type.

その他にも、例えば監視映像を用いた行動認識などにおいて、監視映像中に撮像された同一グループと思われる人物らを同一属性として扱ってもよい。また、認識処理が誤りやすいデータをより多く近傍データに選択すれば、学習が早く進むことが期待できる。認識処理が誤りやすいとは、例えば誤分類が多いデータであったり、所望の代表点との距離が大きいデータであったり、必要であればどのような定義を用いてもよい。より多く選択されやすくするために、ある近傍の範囲内にある誤りやすいデータを必ずＭｉｎｉｂａｔｃｈに含めたり、確率的に選択されやすくしたりすることができる。誤りやすいデータは、学習のエポックごとに見直してもよいし、イテレーションごとに見直してもよいし、端末装置４０等を用いたユーザからのフィードバックに基づいて選択してもよい。 In addition, for example, in behavior recognition using the monitoring video, the persons considered to be the same group captured in the monitoring video may be treated as the same attribute. In addition, learning can be expected to proceed faster by selecting more neighboring data in which recognition processing is likely to be erroneous. If the recognition process is likely to be erroneous, it may be, for example, data with a large number of misclassifications, data with a large distance from a desired representative point, or any definition may be used if necessary. In order to make more selection easier, it is possible to always include error-prone data within a certain neighborhood range in the Minibatch, or to make it probabilistically selectable. The error-prone data may be reviewed for each learning epoch, for each iteration, or may be selected based on feedback from the user using the terminal device 40 or the like.

また、間違えやすいという意味のみならず、各データについて重要度が分かっているならば、重み付け学習を導入してもよい。ここで重み付け学習とは、例えば損失関数において、代表点とデータとの距離などを求める際に、各データの重みに基づいてその大きさを変えることを意味している。例えば重要度が１から１０の範囲である場合は、その範囲で決められた重要度と、例えば代表点とデータとの差を二乗した値とを掛ける処理を行って損失を計算してもよい。また、近傍データを選択する際に、さらにその他の基準を用いる場合の例として、様々な照明条件下で撮影したデータや、異なる視点・カメラ等で撮影されたデータを含むようにしてもよい。 Weighting learning may be introduced if the importance of each data is known as well as the fact that it is easy to make a mistake. Here, the weighted learning means, for example, in the loss function, when the distance between the representative point and the data is obtained, the size is changed based on the weight of each data. For example, when the degree of importance is in the range of 1 to 10, the loss may be calculated by performing a process of multiplying the degree of importance determined in the range by, for example, a value obtained by squaring the difference between the representative point and the data. .. Further, as an example in which other criteria are used when selecting the neighborhood data, the data captured under various illumination conditions or the data captured by different viewpoints/cameras may be included.

このようにすることで、変動に対してロバストな代表点生成器等を学習することが可能になる。このとき、例えば人物画像検索において、認識対象と同一の人物であって、様々な環境下で撮影した画像を認識対象近傍として選択してもよく、そのようにすれば、さらに認識処理のロバスト性が向上すると考えられる。 By doing so, it becomes possible to learn a representative point generator or the like that is robust against variations. At this time, for example, in the person image search, the same person as the recognition target may be selected as an image captured in various environments as the recognition target vicinity, and in that case, the robustness of the recognition process is further increased. Is expected to improve.

なお、もし必要であれば、認識対象近傍データを選択するのではなく、その他の異なる基準を用いてＭｉｎｉｂａｔｃｈを作成してもよい。その場合、近傍性を考慮した代表点生成ではなく、異なる性質を持つ代表点生成器を学習することができる。 It should be noted that, if necessary, the Minibatch may be created using other different criteria instead of selecting the recognition target neighborhood data. In that case, it is possible to learn representative point generators having different properties, instead of representative point generation considering neighborhood.

また、人物画像の照合等、１対１のマッチングに関するタスクにも本発明を用いることができる。まずここで例として、照合対象の人物画像が２枚与えられており、これら２枚の画像が同一人物の画像であるか否かを認識処理によって確認することを考える。そのために、例えば、参考文献５に示されるようなＳｉａｍｅｓｅ構造のＣＮＮを用いるとする。 The present invention can also be used for tasks related to one-to-one matching such as matching of person images. First, as an example, it is considered that two person images to be collated are given and whether or not these two images are images of the same person is confirmed by a recognition process. Therefore, for example, a CNN having a Siamese structure as shown in Reference 5 is used.

参考文献５：E. Ahmed et al. An Improved Deep Learning Architecture for Person Re-Identification. CVPR, 2015. Reference 5: E. Ahmed et al. An Improved Deep Learning Architecture for Person Re-Identification. CVPR, 2015.

具体例を図９に示す。ここで示すのは学習時の構成とする。説明を簡単にするため、図９のうち、図６に共通する要素は図６と同じ符号を与えてあり、それらの説明は省略する。ここで用いるＭｉｎｉｂａｔｃｈは先述の例と同様に、認識対象データと、認識対象データが与えられたもとでの認識対象近傍のデータとによって構成される。ここでの認識対象データとは、照合対象のうち１枚の人物画像である。近傍データとは、ここでは認識対象の人物が撮像された画像であり、認識対象データと類似するデータであるとする。このようにして構成されたＭｉｎｉｂａｔｃｈが各認識器に対して一つ与えられる。なお、学習を効率的に行うために、各Ｍｉｎｉｂａｔｃｈに存在するデータはかぶりなく得られるように排他性を考慮してもよい。図９は、認識器Ａ９０１と、認識器Ｂ９０２とを含み、これらは先述の例において与えられた２枚の人物画像それぞれを処理する認識器を表している。２つの認識器の出力を、損失算出処理９０３によって処理することで、損失値が得られることとなる。各認識器の出力は、各認識器に対するＭｉｎｉｂａｔｃｈ中のデータを特徴量に抽出したものと、生成された代表点とである。すなわち、Ｓｉａｍｅｓｅ構成においては、Ｍｉｎｉｂａｔｃｈ中のデータを特徴量に抽出したものと、生成された代表点とが、それぞれ２つ得られることとなる。１対１のマッチングを行うための学習において、正例の画像のみ用いる場合の損失算出処理は、例えば以下の式（４）に示す損失関数を用いることができる。 A specific example is shown in FIG. The configuration shown here is for learning. For simplification of description, elements common to FIG. 6 in FIG. 9 are given the same reference numerals as those in FIG. 6, and description thereof will be omitted. The Minibatch used here is composed of the recognition target data and the data in the vicinity of the recognition target given the recognition target data, as in the above-mentioned example. The recognition target data here is one person image among the matching targets. Here, the neighborhood data is an image in which a person to be recognized is captured and is similar to the recognition target data. One Minibatch configured in this way is given to each recognizer. In addition, in order to efficiently perform learning, exclusivity may be taken into consideration so that data existing in each Minibatch can be obtained without being covered. FIG. 9 includes a recognizer A 901 and a recognizer B 902, which represent a recognizer that processes each of the two human images given in the above example. The loss value can be obtained by processing the outputs of the two recognizers by the loss calculation processing 903. The output of each recognizer is obtained by extracting data in Minibatch for each recognizer as a feature amount and the generated representative point. That is, in the Siamese configuration, two extracted feature points of Minibatch data and two generated representative points are obtained. In the learning for performing the one-to-one matching, the loss calculation process in the case of using only the image of the positive example can use the loss function shown in the following Expression (4), for example.

式（４）において、正例の画像のみ用いるとは、学習に用いる画像のペアが、必ず同一人物であることを示している。ここで、Ｘ^A、Ｘ^Bはそれぞれ認識器Ａ、認識器ＢのＭｉｎｉｂａｔｃｈデータの特徴量（画像空間上のデータでもよい）、Ａ^A、Ａ^Bはそれぞれ認識器Ａ、認識器Ｂの代表点である。特に、Ｘ₀ ^A、Ｘ₀ ^Bはそれぞれ認識器Ａ、認識器Ｂの認識対象データとする。ａ₁，ａ₂，ａ₃，ａ₄は、それぞれの項に対する重みであり、ここでは事前に決めるものとする。式（４）は、式（２）と同様にＯｎｅ−ｃｌａｓｓ学習を目的にした損失関数である。ただし、異なる認識器をまたいで、代表点とＭｉｎｉｂａｔｃｈデータの特徴量との距離を考慮している点が異なる。基本的には、それぞれの距離が近いほど損失関数の出力値は小さくなり、そのバランスは式（４）のａ₁，ａ₂，ａ₃，ａ₄によって決定できる、Ｏｎｅ−ｃｌａｓｓ学習を目的とした式である。なお、Ｏｎｅ−ｃｌａｓｓ学習ではなく、その他の目的関数に基づいて学習を行いたい場合は、それぞれの目的に合った関数を用いてもよい。例えば、Ｍｕｌｔｉ−ｃｌａｓｓ学習では、本実施形態において先述した方法を用いて、ラベルの異なる代表点とＭｉｎｉｂａｔｃｈデータの特徴量との距離を離し、同じラベル同士は近づけるための損失関数を用いてもよい。具体的には、同一人物か否かを正例・負例についてＴｗｏ−ｃｌａｓｓ学習する場合は、以下の式（５）に示す損失関数を用いて学習を行うことができる。 In Expression (4), using only the images of the positive examples means that the pair of images used for learning is always the same person. Here, X ^A, the feature amount of Minibatch data X ^B each recognizer A, recognizer B (may be data in the image space), ^A A, A ^B Each recognizer A, the representative point of the recognizer B Is. In particular, X ₀ ^A and X ₀ ^B are recognition target data of the recognizer A and the recognizer B, respectively. a ₁ , a ₂ , a ₃ , and a ₄ are weights for the respective terms, and are to be determined in advance here. Expression (4) is a loss function for the purpose of One-class learning as in Expression (2). However, the difference is that the distance between the representative point and the feature amount of the Minibatch data is considered across different recognizers. Basically, the closer the respective distances are, the smaller the output value of the loss function is, and the balance thereof can be determined by a ₁ , a ₂ , a ₃ , and a _{4 in the} equation (4), and the purpose is for one-class learning. It is a formula. In addition, when it is desired to perform learning based on other objective functions instead of One-class learning, a function suitable for each objective may be used. For example, in the multi-class learning, the method described above in the present embodiment may be used to separate the representative points of different labels from the feature amount of the Minibatch data, and a loss function for bringing the same labels closer to each other may be used. .. Specifically, when two-class learning is performed on whether the same person is a positive example or a negative example, learning can be performed using a loss function shown in the following Expression (5).

式（５）において、右辺上段の式は式（４）に一致しており、右辺下段は、負例（画像ペアは異なる人物の場合）の場合の損失関数について示している。負例の場合は、異なる認識器から得られる対象特徴量や代表点等が、より離れるように学習されるように、一部の符号が反転している。また、今回の例ではＭｉｎｉｂａｔｃｈ中のデータのうち、認識対象データとその他のデータは必ず同一人物であるという前提であったが、そうではなく、異なる人物（それも、非常に似通った人物）にすることもできる。その場合は、異なる人物を表す代表点と、同じ人物を表す代表点とで、異なるラベルの代表点を生成すればよい。以上では、学習時の場合について例示したが、検出時においても距離ベースの認識処理を行うことが可能である。具体的には、例えば、以下の式（６）によって生成された代表点を考慮した距離を計算することができる。 In Expression (5), the expression on the upper right side matches Expression (4), and the lower expression on the right side shows the loss function in the case of a negative example (in the case of different image pairs). In the case of the negative example, some of the signs are inverted so that the target feature amount and the representative point obtained from different recognizers are learned so as to be further apart. Also, in this example, it was assumed that the data to be recognized and the other data in the data in the Minibatch are always the same person, but this is not the case. You can also do it. In that case, representative points with different labels may be generated for representative points representing different persons and representative points representing the same person. In the above, the case of learning is exemplified, but it is possible to perform the distance-based recognition processing even at the time of detection. Specifically, for example, the distance can be calculated in consideration of the representative points generated by the following formula (6).

この式（６）による距離と、予め決められた閾値とに基づいて、同一人物か否かを決めることができる。なお、式（６）のように対象特徴量および代表点を総合的に考慮した値ではなく、異なる認識器が出力した対象特徴量および代表点を比べたときに、最も距離の近い組の距離を採用してもよい。例えば、認識器Ａが出力した対象特徴量と、認識器Ｂが出力した代表点との距離などを算出し、最も値の小さい距離値が用いられる。 It is possible to determine whether or not they are the same person based on the distance obtained by the equation (6) and a predetermined threshold value. It should be noted that it is not a value that comprehensively considers the target feature amount and the representative point as in Expression (6), but when the target feature amount and the representative point output by different recognizers are compared, May be adopted. For example, the distance between the target feature amount output by the recognizer A and the representative point output by the recognizer B is calculated, and the smallest distance value is used.

なお、人物画像の照合等に関して、１対１のマッチングを行う場合の例を上述したが、複数の認識対象候補が与えられたもとで、複数の検索候補から最もマッチしたデータを選択する場合も、ユースケースとしてありうる。例えば、スタジアムの観客席等での監視において、他の人物とは異なる行動をしている人物の特定を行いたいとする。具体的な例としては、例えば、参考文献６に「図３スポーツ競技場の観客席における撮影例競技を観戦している人々の中で、赤丸で示した人物のみが時計やスマートフォンなど別の場所を終始見続けており、本研究では不審行動に分類される。本映像における顔の大きさは，縦方向に約２５〜３０ｐｉｘｅｌである。」と記されている。このように、他の人物と異なる行動をとっていることは、不審者を見つけるうえで監視員にとって重要なエビデンスと考えられる。 Note that the example of the case of performing one-to-one matching with respect to matching of person images has been described above, but when a plurality of recognition candidates are given and the most matched data is selected from a plurality of search candidates, Possible as a use case. For example, in monitoring in a spectator seat of a stadium, it is desired to identify a person who behaves differently from other persons. As a specific example, for example, in Reference 6, “Figure 3 Shooting Example in Spectator Seat of Sports Stadium: Among people watching the competition, only the people indicated by red circles are different places such as watches and smartphones. , And is classified as suspicious behavior in this study. The size of the face in this video is about 25 to 30 pixels in the vertical direction.” Thus, acting differently from other persons is considered to be important evidence for the observer in finding suspicious persons.

参考文献６：黒沢健至 et al. 映像解析を用いた安全安心技術開発のための評価用映像データベースの構築. 第23回画像センシングシンポジウム, 2017. Reference 6: Kenji Kurosawa et al. Construction of evaluation video database for development of safety and security technology using video analysis. 23rd Image Sensing Symposium, 2017.

そこで、例として、ある人物が画面上の他の人物とは別の行動をしているか否かを判定することを考える。このような場合、複数の認識対象候補が与えられたもとで、複数の検索候補から最もマッチしたデータを選択するという問題設定として捉えることが可能である。以降では、説明を簡単にするために、認識対象候補と検索候補とが重複しており、複数存在する検索候補同士の対応について調べる場合について述べる。重複していない場合・重複が一部しかない場合については後述する。 Therefore, as an example, it is considered to determine whether or not a certain person is acting differently from other persons on the screen. In such a case, it is possible to regard it as a problem setting in which the most matched data is selected from a plurality of search candidates, given a plurality of recognition target candidates. In the following, for simplification of description, a case will be described in which the recognition target candidate and the search candidate overlap, and the correspondence between the plurality of search candidates is checked. The case where there is no overlap and the case where there is only a partial overlap will be described later.

ここで、図１０に複数認識対象を用いて複数の検索候補とのマッチングを行う場合の構成の例を示す（学習時）。図１０の例において、構成の一部は図６の機能と同様の機能を持ち、その場合は図６と同一の符号を付与して、重複した説明は省略する。図１０の変形処理１００１は本実施形態で既に述べた変形処理であり、Ｍｉｎｉｂａｔｃｈのデータを全て第二の次元（特徴量の次元）に持っていく処理を行い、サイズ１００２の特徴量を作成する。ここで、Ｎは可変であってもよい。この特徴量を、サイズ６３５の特徴量に変換するために、正規化可変代表点生成処理１００３を用いる。 Here, FIG. 10 shows an example of the configuration in the case of performing matching with a plurality of search candidates using a plurality of recognition targets (during learning). In the example of FIG. 10, a part of the configuration has a function similar to the function of FIG. 6, and in that case, the same reference numerals as those in FIG. 6 are given and duplicate description is omitted. The transformation processing 1001 of FIG. 10 is the transformation processing already described in the present embodiment, and processing for bringing all Minibatch data to the second dimension (feature amount dimension) is performed to create a feature amount of size 1002. .. Here, N may be variable. In order to convert this feature amount into the feature amount of size 635, the normalized variable representative point generation processing 1003 is used.

正規化可変代表点生成処理１００３の例を図１０の左下に示す。ここでは（１，５１２，３２，３２）サイズの特徴量がＮ個あると捉え、（１，５１２，３２，３２）サイズの特徴量に対する代表点生成処理をＮ回繰り返す（Ｎ並列の処理を行ってもよい）。代表点生成処理は、本実施形態で説明した代表点生成処理を用いてよい。そして、Ｎ回繰り返された代表点生成処理は、足し合わされ、代表特徴量層６０６に入力される。このときＮ回足し合されるため、代表特徴量層６０６の入力の前にＮで割って正規化処理を行うとする。これによって、任意の数Ｎは代表特徴量層６０６で消え、Ｍ個の代表点の特徴量が代表点生成器１００４の出力として得られる。最後に、Ｎ個の対象特徴量（Ｃｏｎｖｏｌｕｔｉｏｎ２層の特徴量）と、Ｍ個の代表点を用いて、損失が計算される。このとき学習データが全て正常であるならば、１ｃｌａｓｓ学習を行うことができ、式（２）を損失関数に用いることができる。正規化可変代表点生成処理１００３と、式（２）とを用いて学習を行うことによって、Ｎ個のデータの平均的な代表点を学習することが可能である。もし異常ラベルが与えられていれば、異常な認識対象データが代表点から離れるほど損失が減少するような損失関数を与えて学習することが可能である。 An example of the normalized variable representative point generation processing 1003 is shown in the lower left of FIG. Here, it is assumed that there are N feature quantities of (1,512,32,32) size, and the representative point generation processing for the feature quantity of (1,512,32,32) size is repeated N times (N parallel processing is performed). You may go). As the representative point generation processing, the representative point generation processing described in this embodiment may be used. Then, the representative point generation processing repeated N times is added and input to the representative feature amount layer 606. At this time, since N times are added, it is assumed that the normalization process is performed by dividing by N before inputting to the representative feature amount layer 606. As a result, the arbitrary number N disappears in the representative feature amount layer 606, and the feature amounts of M representative points are obtained as the output of the representative point generator 1004. Finally, the loss is calculated using N target feature values (feature values of the Convolution 2 layer) and M representative points. At this time, if all the learning data are normal, 1 class learning can be performed, and the equation (2) can be used for the loss function. By performing learning using the normalized variable representative point generation processing 1003 and Expression (2), it is possible to learn an average representative point of N pieces of data. If an abnormal label is given, it is possible to learn by giving a loss function such that the loss decreases as the abnormal recognition target data moves away from the representative point.

また、検出時には、図１０の構成において損失算出処理６１５の代わりに距離計算処理を導入し、検出時に得られたＮ個の認識対象データと、それらから生成されたＭ個の代表点との距離を計算した結果に基づいて、異常判定を行ってよい。本来であれば、Ｎ個の認識対象が存在する場合、それらの距離を計算するには、Ｎの自乗のオーダーでの計算時間増加が見込まれる。図１０の方法によれば、それをＮ・Ｍのオーダーの計算時間に圧縮することができる。ＭがＮよりも大きい場合については、例えばＮ個の認識対象データをそれぞれ１対１比較してもよいし、それぞれの認識対象データに基づいて生成した代表点を本実施形態に記載の方法によって比較することで距離を算出してもよい。なお、上述したように、認識対象と検索候補が重複していない場合や、重複が一部しかない場合については、距離計算の際等に、例えば検索候補と代表点との間の距離のみを計算することで、計算処理を短縮してもよい。 Further, at the time of detection, a distance calculation process is introduced in place of the loss calculation process 615 in the configuration of FIG. 10, and the distance between the N pieces of recognition target data obtained at the time of detection and the M representative points generated from them. The abnormality determination may be performed based on the result of calculating. Normally, when there are N recognition targets, it is expected to increase the calculation time in the order of N squared in order to calculate the distance between them. According to the method of FIG. 10, it can be compressed in the calculation time of the order of N·M. When M is larger than N, for example, N pieces of recognition target data may be compared with each other one-to-one, or the representative points generated based on the respective recognition target data may be converted by the method described in the present embodiment. The distance may be calculated by comparing. As described above, when the recognition target and the search candidate do not overlap, or when there is only a partial overlap, only the distance between the search candidate and the representative point is calculated when calculating the distance. The calculation process may be shortened by performing the calculation.

さらに発展的には、複数個の検索候補を、例えば徐々に絞っていくという場合も考えられる。上述の複数認識対象の例では、認識対象数Ｎは可変であってよいので、これを用いて徐々に検索候補を絞っていくような使い方を学習時や検出時に行ってもよいし、その他の公知の方法を用いて、検索候補を絞っていくための処理を行ってもよい。複数認識対象の例を用いる場合は、例えば、生成された代表点から近い検索候補を除外していくことで、検索候補を絞っていくための処理を行うことができる。そして、絞られた検索候補に基づいて、再び代表点を生成する処理を行う。
なお、より詳しく認識処理を行いたい場合は、式（２）のように代表点とデータとの間の距離を計算するのみならず、データ同士、代表点同士の距離を計算し、学習時の学習処理に用いてもよいし、検出時に用いてもよい。 As a further development, a plurality of search candidates may be gradually narrowed down, for example. In the above-described example of a plurality of recognition targets, the number N of recognition targets may be variable, and thus a method of gradually narrowing down search candidates may be used at the time of learning or detection, and other methods may be used. A process for narrowing down search candidates may be performed using a known method. When using an example of a plurality of recognition targets, it is possible to perform processing for narrowing down the search candidates, for example, by excluding search candidates that are close to the generated representative point. Then, based on the narrowed-down search candidates, the process of generating a representative point is performed again.
In order to perform more detailed recognition processing, not only the distance between the representative point and the data is calculated as in Expression (2), but also the distance between the data and the representative points is calculated, and It may be used for learning processing or may be used for detection.

上述の例では認識対象データに基づいて代表点を生成する例を示したが、これは具体例としてＣＮＮを用いる場合の一例であって、その他の公知の方法を用いて代表点を利用してもよい。例えば、すでに生成された代表点をリカレントに用いて、認識処理に再利用してもよい。 In the above example, the representative point is generated based on the recognition target data, but this is an example of using CNN as a specific example, and the representative point is used by using another known method. Good. For example, the representative points already generated may be used for the recurrent and reused for the recognition processing.

また上述の例では、画像を用いた認識処理の例を主に記したが、他のデータを対象とした認識処理を行ってもよい。例えば、映像データや、センサデータや、反射光スペクトルデータや、物質組成データや、化学的データなどである。例えば、化合物の新種を発見するために、化合物の正常データのみを学習し、認識対象として新種の化合物が現れたとき、距離ベースの検知によって新種の判定を行うことができる。 Further, in the above example, the example of the recognition process using the image is mainly described, but the recognition process for other data may be performed. For example, image data, sensor data, reflected light spectrum data, material composition data, chemical data, and the like. For example, in order to discover a new type of compound, only normal data of the compound is learned, and when a new type of compound appears as a recognition target, a new type can be determined by distance-based detection.

また、上述の例では実データを用いて学習処理を行う場合の例について述べたが、ＣＧ（コンピュータグラフィックス）を用いてデータを作成し、学習処理等に用いてもよい。例えば、認識対象データに類似するＣＧデータを作成し、そのＣＧデータに光源の変化等で生じえるバリエーションを付与したＣＧデータを作成し、それらのＣＧデータや該ＣＧデータから抽出した特徴データを、認識対象近傍データや認識対象データとして用いる。これによって、画像上の様々なバリエーションに対応することができる。 Further, in the above example, an example in which learning processing is performed using actual data has been described, but data may be created using CG (computer graphics) and used for learning processing or the like. For example, CG data similar to the recognition target data is created, CG data in which variations that can occur due to changes in the light source, etc. are added to the CG data, and these CG data and the characteristic data extracted from the CG data are It is used as recognition target neighborhood data and recognition target data. This makes it possible to deal with various variations on the image.

なお、上述の例では、光源等の変化に対応するため（例えば異なる光源の画像が与えられても正しく認識処理を行うため）に、学習画像等そのものや近傍の選び方等に工夫を加える例を示した。これに対し、さらに代表点に関しても光源等の変化に対応して認識処理に工夫を加えることで、よりロバストな認識処理を行うことができる。ここでは外観検査における具体例を以下に示す。例えば、本実施形態で示したＡｕｔｏｅｎｃｏｄｅｒを用いるとする。なおＡｕｔｏｅｎｃｏｄｅｒに畳み込みフィルタを導入したＣｏｎｂｏｌｕｔｉｏｎａｌＡｕｔｏｅｎｃｏｄｅｒでも、その他の再帰的構造を取り入れたモデルでもよい。 In the above example, in order to cope with changes in the light source and the like (for example, to correctly perform recognition processing even when images of different light sources are given), examples in which the learning image itself and the selection method of the neighborhood are added Indicated. On the other hand, regarding the representative point, more robust recognition processing can be performed by devising the recognition processing corresponding to the change of the light source and the like. Here, a specific example in the visual inspection is shown below. For example, it is assumed that the Autoencoder shown in this embodiment is used. A convolutional autoencoder in which a convolution filter is introduced in the autoencoder or a model incorporating other recursive structure may be used.

まず、検査する対象の製品の認識対象データを選択する。説明を簡単にするために、ここで用いる認識対象データは一つであるとするが、複数あってもよい。
次に、ここで選んだ認識対象データと同一の製品のデータであって、認識対象データとは異なる光源等のバリエーションが存在するデータを選択する。単純にＡｕｔｏｅｎｃｏｄｅｒを適用するならば、Ａｕｔｏｅｎｃｏｄｅｒは独立にそれぞれのデータを再構成するための学習に用いられるが、先述のようにＡｕｔｏｅｎｃｏｄｅｒの中間層特徴量に基づいて、代表点を生成することを考える。 First, the recognition target data of the product to be inspected is selected. For simplification of explanation, the number of recognition target data used here is one, but there may be a plurality of recognition target data.
Next, the data of the same product as the recognition target data selected here, which has a variation such as a light source different from the recognition target data, is selected. If the Autoencoder is simply applied, the Autoencoder is used for learning to reconstruct each data independently, but consider generating representative points based on the middle layer feature amount of the Autoencoder as described above. ..

学習時の具体例を図１１に示す。図１１の例において、構成の一部は図６の機能と同様の機能を持ち、その場合は図６と同一の符号を付与してそれらの説明は省略する。
補助データ選択器１１４１は、認識対象データ１１０１に基づいて補助データを選ぶとする。ここで、補助データとは、本実施形態で認識対象近傍データと述べてきたデータに相当する。今回は"選んだ認識対象データと同一の製品のデータであって、認識対象データとは異なる光源等のバリエーションが存在するデータを選択する"ようにしている。これらは特徴空間上等での認識対象近傍とは限らないため、ここでは補助データと呼称することにする。 A specific example at the time of learning is shown in FIG. In the example of FIG. 11, a part of the configuration has the same function as that of FIG. 6, and in that case, the same reference numerals as those in FIG. 6 are given and the description thereof is omitted.
It is assumed that the auxiliary data selector 1141 selects auxiliary data based on the recognition target data 1101. Here, the auxiliary data corresponds to the data described as the recognition target neighborhood data in the present embodiment. This time, "select the data of the same product as the selected recognition target data, which has a variation such as a light source different from the recognition target data". Since these are not necessarily in the vicinity of the recognition target on the feature space or the like, they are called auxiliary data here.

ＭｉｎｉｂａｔｃｈサイズをＮとしたとき、Ｎ−１個の補助データ（補助データ１１０２〜１１０Ｎ）を選択する。補助データは、様々な基準で選ぶことは可能であるが、ここでは"選んだ認識対象データと同一の製品のラベルがついているデータであって、認識対象データとは異なる露光で撮像したデータをランダムに選択する"とする。なお、異なる露光で撮像したデータのみを補助データとしたが、認識対象データと同一の露光で撮像したデータを補助データとしてまじえることで、学習処理を安定化させる試みを行ってもよい。また、補助データは人間が選択してもよい。 When the Minibatch size is N, N-1 pieces of auxiliary data (auxiliary data 1102 to 110N) are selected. Although it is possible to select the auxiliary data based on various criteria, here, data that is labeled with the same product label as the selected recognition target data and that was captured with a different exposure from the recognition target data is used. Select randomly". Although only the data captured by different exposures is used as the auxiliary data, it is possible to try to stabilize the learning process by mixing the data captured by the same exposure as the recognition target data as the auxiliary data. Further, the auxiliary data may be selected by a human.

再構成関数１１２０は、Ａｕｔｏｅｎｃｏｄｅｒのことであり、再構成関数１１２０によって再構成認識対象データ１１１１および再構成補助データ１１１２〜１１１Ｎが得られる。ここで、通常のＡｕｔｏｅｎｃｏｄｅｒと同様に再構成誤差を最小化するとともに、代表点生成の学習を進めることを考える。認識対象データ１１０１から再構成認識対象データ１１１１を得るまでのＡｕｔｏｅｎｃｏｄｅｒの中間層において、Ａｕｔｏｅｎｃｏｄｅｒの中間層特徴量を用いて代表点生成処理６１４が行われ、代表点１１３１を得るとする。 The reconstruction function 1120 is an Autoencoder, and the reconstruction recognition target data 1111 and reconstruction auxiliary data 1112 to 111N are obtained by the reconstruction function 1120. Here, it is considered that the reconstruction error is minimized and learning of representative point generation is advanced as in the case of the ordinary Autoencoder. In the middle layer of the Autoencoder from the recognition target data 1101 to the reconstructed recognition target data 1111, the representative point generation processing 614 is performed using the middle layer feature amount of the Autoencoder to obtain the representative point 1131.

ここで、代表点生成処理６１４で用いるＡｕｔｏｅｎｃｏｄｅｒの中間層は、どの層であってもよく、複数の層の特徴量を用いてもよい。代表点生成処理６１４で用いるＡｕｔｏｅｎｃｏｄｅｒの中間層と同じ層の特徴量を用いて、特徴量抽出処理１１２１が行われ、特徴量１１３２〜１１３Ｎが得られる。ここで特徴量抽出処理１１２１は、代表点との比較等に用いる補助データの特徴量を抽出するための処理であり、Ａｕｔｏｅｎｃｏｄｅｒの中間層特徴量をそのまま用いてもよいし、その他の変換処理を行ってもよい。ここで得た特徴量と代表点とを用いて、損失算出処理６１５が行われて、損失値６０７が得られる。このとき用いる損失関数は本実施形態で例示したどのようなものを用いてもよく、その他の公知の損失関数を応用したものを用いてもよい。例えば式（２）によってＵｎｓｕｐｅｒｖｉｓｅｄな学習を行うことができる。なお学習データは全て正常であるという仮定をおく意味では、教師が存在する。 Here, the middle layer of the Autoencoder used in the representative point generation processing 614 may be any layer, and the feature amounts of a plurality of layers may be used. The feature amount extraction process 1121 is performed using the feature amount of the same layer as the Autoencoder intermediate layer used in the representative point generation process 614, and the feature amounts 1132 to 113N are obtained. Here, the feature amount extraction process 1121 is a process for extracting a feature amount of auxiliary data used for comparison with a representative point, and the intermediate layer feature amount of Autoencoder may be used as it is, or other conversion process may be performed. You can go. A loss calculation process 615 is performed using the feature amount and the representative point obtained here, and a loss value 607 is obtained. As the loss function used at this time, any one exemplified in the present embodiment may be used, and another known loss function may be applied. For example, Unsupervised learning can be performed by the equation (2). Note that there is a teacher in the sense that it is assumed that all learning data is normal.

このとき、補助データは露光が異なるデータであるから、認識対象データから生成された代表点と、補助データの特徴量との位置関係は、露光の影響によって離れた場所に存在しうる。しかしながら、式（２）に基づいて学習を進めることで、特徴抽出器および代表点生成器の学習が進み、生成された代表点と、補助データの特徴量との位置関係は、より近づいたものになりうる。これは、露光の影響を減じる効果であると捉えることができる。 At this time, since the auxiliary data is data with different exposures, the positional relationship between the representative point generated from the recognition target data and the feature amount of the auxiliary data may exist at a distant place due to the influence of the exposure. However, the learning based on the equation (2) advances the learning of the feature extractor and the representative point generator, and the positional relationship between the generated representative point and the feature amount of the auxiliary data becomes closer. Can be. This can be regarded as an effect of reducing the influence of exposure.

さらに、式（２）など代表点に関する損失関数以外に、Ａｕｔｏｅｎｃｏｄｅｒならではの再現誤差関数（入力データと再現データとの二乗誤差など）を導入し、前述したマルチタスクラーニングとして解くことで、学習を効率的に進めることができる。また、Ａｕｔｏｅｎｃｏｄｅｒのある中間層の特徴空間を、代表点を生成する特徴空間としてそのまま用いる場合、生成された代表点を用いて、Ａｕｔｏｅｎｃｏｄｅｒの一部の関数によって画像を構成することが可能である。すなわち、生成された代表点が意味する画像を生成することができる。これは、Ａｕｔｏｅｎｃｏｄｅｒの中間層で代表点を生成し、さらにその代表点に基づいて、画像空間上での代表点を生成したとして解釈することができる。つまり、段階的な代表点生成、あるいは階層的代表点生成処理として解釈できる。 Furthermore, in addition to the loss function related to the representative point such as the formula (2), a reproduction error function (such as a square error between the input data and the reproduction data) unique to the Autoencoder is introduced, and the learning is efficiently performed by solving it as the above-mentioned multitask learning. You can proceed. Further, when the feature space of the intermediate layer with Autoencoder is used as it is as the feature space for generating the representative point, the generated representative point can be used to compose an image by a part of the function of Autoencoder. That is, an image represented by the generated representative point can be generated. This can be interpreted as generating a representative point in the Autoencoder intermediate layer and further generating a representative point in the image space based on the representative point. That is, it can be interpreted as stepwise representative point generation or hierarchical representative point generation processing.

なお、例えば複数の層においてラベル付き代表点を生成した際に、それぞれの層で認識対象データが所属すると思われるクラスが異なるという結果であった場合は、例えば複数の層の結果を加味して所属するクラスを決定してもよい。例えば、それぞれの層で得られる平均の距離で正規化した距離を足し合せて、最も距離値の小さいクラスを所属クラスとしてもよい。また、ラベル付きでない場合でも、例えば最大または最小の距離を用いて異常検知等の検出処理を行ってもよい。また、各層において異なる閾値を用いて、それぞれの層で得られた代表点生成結果に基づいて独立に異常検知等の検出処理を行ってもよい。 If, for example, when the labeled representative points are generated in a plurality of layers and the result is that the classes to which the recognition target data belong are different in each layer, for example, the results of the plurality of layers are taken into consideration. You may decide which class you belong to. For example, the distances normalized by the average distances obtained in the respective layers may be added together, and the class having the smallest distance value may be set as the belonging class. Further, even when not labeled, the detection process such as abnormality detection may be performed using the maximum or minimum distance, for example. Further, different thresholds may be used in each layer, and detection processing such as abnormality detection may be independently performed based on the representative point generation result obtained in each layer.

また、上述のＡｕｔｏｅｎｃｏｄｅｒの例では、画像空間上に生成した代表点を端末装置４０で可視化することで、監視員等に対して、検出結果の理解を促進してもよい。また、これらは代表点であるから、式（２）等によって、さらに損失関数を適用して学習を促進することや、画像空間上の代表点と入力データとの距離に基づいて、異常検知に用いることもできる。なお、ここでは学習データが全て正常であるという前提に基づいて説明を行ったが、異常データなどのラベルが存在する場合は、式（１）などの損失関数を用いて学習し、得られた学習結果に基づいて、検出処理を行ってもよい。なお、Ａｕｔｏｅｎｃｏｄｅｒはデータを再構成する場合の本実施形態における構成の一例であり、その他の公知の方法を用いてもよい。 Further, in the example of the Autoencoder described above, the representative points generated in the image space may be visualized by the terminal device 40, thereby facilitating the understanding of the detection result for the surveillance staff and the like. Further, since these are representative points, a loss function is further applied by Equation (2) or the like to promote learning, and abnormality detection is performed based on the distance between the representative point in the image space and the input data. It can also be used. Although the explanation is given here based on the assumption that all the learning data are normal, if there is a label such as abnormal data, it is obtained by learning using a loss function such as equation (1). The detection process may be performed based on the learning result. Note that the Autoencoder is an example of the configuration of this embodiment when reconstructing data, and other known methods may be used.

上述の例では露光について述べたが、その他のカメラパラメータや、製品に関するデータ等を補助データとして用いてもよい。例えばカメラパラメータＡの場合、カメラパラメータＢの場合、カメラパラメータＣの場合、・・・、製品Ａの場合、製品Ｂの場合、製品Ｃの場合、・・・、といったように、外部情報を補助情報として扱い、どのように入力処理を行ってもよい。 Although exposure has been described in the above example, other camera parameters, product-related data, and the like may be used as auxiliary data. For example, in the case of the camera parameter A, the camera parameter B, the camera parameter C,..., The product A, the product B, the product C,... It may be treated as information and any input process may be performed.

本実施形態における上述の例では、検出時に、同一の認識対象が入力された際に、同一の代表点が生成される仕組みであった。例えば監視映像の監視を目的とした際に、状況に応じて正常の定義が変化する場合がありうるため、状況に応じて生成される代表点が変化することで、正常の定義の変化を表現できることは望ましい。例えば、端末装置４０に補助情報入力器１２０１が備えられ、天気や備え付けられた店舗の状態（例えば営業中か否か）等の外部の情報が自動的に入力されるか、またはユーザによって与えられるとし、補助情報は状況の状態を与える情報であるとする。これに基づいて代表点生成器を切り替えてもよいし、また、補助情報を用いた代表点生成器を用いてもよい。 In the above-described example of this embodiment, the same representative point is generated when the same recognition target is input at the time of detection. For example, when the purpose is to monitor surveillance video, the definition of normal may change depending on the situation.Therefore, the change of the representative point generated depending on the situation can express the change of definition of normal. It is desirable to be able to. For example, the terminal device 40 is provided with the auxiliary information input device 1201, and external information such as the weather and the state of the installed store (for example, whether it is open or not) is automatically input or given by the user. The auxiliary information is information that gives the state of the situation. The representative point generator may be switched based on this, or the representative point generator using the auxiliary information may be used.

以降では、補助情報を用いた代表点生成器の学習時の例について図１２を参照しながら示す。なお、学習時のＭｉｎｉｂａｔｃｈは、補助情報に基づいて正常の定義が変化することを表現するために、補助情報ごとに構成する。例えば補助情報が０のときの正常データを集めたＭｉｎｉｂａｔｃｈと、補助情報が１のときの正常データを集めたＭｉｎｉｂａｔｃｈとを用意するとする。なお、これらのＭｉｎｉｂａｔｃｈ中のデータの選定には、本実施形態で述べた方法を用いてよい。図１２の例において、構成の一部は図６の機能と同様の機能を持ち、その場合は図６と同一の符号を付与し、それらの説明は省略する。 Hereinafter, an example of learning the representative point generator using the auxiliary information will be described with reference to FIG. Note that the Minibat during learning is configured for each piece of auxiliary information in order to express that the definition of normal changes based on the auxiliary information. For example, it is assumed that a Minibat that collects normal data when the auxiliary information is 0 and a Minibat that collects normal data when the auxiliary information is 1 are prepared. The method described in the present embodiment may be used to select the data in these Minibatches. In the example of FIG. 12, a part of the configuration has the same function as that of FIG. 6, and in that case, the same reference numerals as those in FIG. 6 are given and the description thereof is omitted.

図１２の補助情報入力器１２０１は、先述の補助情報を入力するものである。補助情報１２０２は入力された補助情報を示し、離散値・連続値、次元数を問わないが、ここでは説明を簡単にするために、一次元の｛０，１｝の値をとりうるものとする。与えられた補助情報１２０２と、対象特徴量層６０５の特徴量とを、補助情報付き代表点生成処理１２０３によって変換し、代表特徴量層６０６の代表特徴量を得る。このとき代表特徴量は、補助情報に依存して決まる。これらが代表点生成器１２０４の役割である。 The auxiliary information input device 1201 of FIG. 12 inputs the above-mentioned auxiliary information. The auxiliary information 1202 indicates the input auxiliary information and may be a discrete value, a continuous value, or the number of dimensions, but here, in order to simplify the explanation, it is assumed that one-dimensional {0, 1} values can be taken. To do. The given auxiliary information 1202 and the feature amount of the target feature amount layer 605 are converted by the representative point addition process with auxiliary information 1203 to obtain the representative feature amount of the representative feature amount layer 606. At this time, the representative feature amount is determined depending on the auxiliary information. These are the roles of the representative point generator 1204.

補助情報付き代表点生成処理１２０３の動作に関する具体例を、補助情報の入力に基づく状況変化対応の例１２１０として、図１２の左下枠に示す。図１２中の枠の上部は補助情報が０であるときの動作、下部は補助情報が１であるときの動作について説明を行うための図である。Ｃｏｎｖｏｌｕｔｉｏｎ２層と代表特徴量層６０６は同一サイズの特徴空間であり、この特徴空間における代表点等の模式図について１２２０に示す。図中の黒色の点は対象特徴量１２１３であり、補助情報の値に関わらず、同じ位置に存在する。代表点群１２１１と代表点群１２１２は、補助情報の値に依存して、生成される点が変動している。この動作が、状況の変化に応じて正常が変化することを表現している。このとき、各代表点群と対象特徴量との距離が、前述の二つの場合で変化しうるため、異常検知結果が変わりうる。このときの損失を、本実施形態で例示した損失関数や、その他の公知の手法を用いて算出して学習に用いてよい。 A specific example regarding the operation of the representative point with auxiliary information generation process 1203 is shown in the lower left frame of FIG. 12 as an example 1210 of situation change correspondence based on input of auxiliary information. The upper part of the frame in FIG. 12 is a diagram for explaining the operation when the auxiliary information is 0, and the lower part is a diagram for explaining the operation when the auxiliary information is 1. The Convolution 2 layer and the representative feature amount layer 606 are feature spaces of the same size, and a schematic diagram of representative points and the like in this feature space is shown at 1220. The black dot in the figure is the target feature amount 1213 and exists at the same position regardless of the value of the auxiliary information. The generated points of the representative point group 1211 and the representative point group 1212 vary depending on the value of the auxiliary information. This operation represents that the normality changes according to the change of the situation. At this time, the distance between each representative point group and the target feature amount may change in the above two cases, and thus the abnormality detection result may change. The loss at this time may be calculated by using the loss function exemplified in this embodiment or other known methods and used for learning.

補助情報付き代表点生成処理１２０３は、例えば以下のように構成してもよい。具体的には、対象特徴量層６０５の特徴量に対して、補助情報をｃｏｎｃａｔｅｎａｔｅし、その特徴量を図６に記載の代表点生成処理６１４に入力し、代表点を作成することができる。ここで、ｃｏｎｃａｔｅｎａｔｅする補助情報は、ｏｎｅ−ｈｏｔなベクトルを、ｃｏｎｃａｔｅｎａｔｅする特徴マップと同一サイズのマップに拡大してから、ｃｏｎｃａｔｅｎａｔｅするものとする。具体的には、補助情報が１のスカラー値であったとしたら、１次元目が０で埋められ、２次元目が１で埋められたマップをｃｏｎｃａｔｅｎａｔｅする。このようにして、補助情報を考慮した代表点の生成を行うことができる。なお、その他の公知の方法を用いて、補助情報を考慮した代表点の生成を行ってもよい。 The representative information added representative point generation processing 1203 may be configured as follows, for example. Specifically, auxiliary information can be concatenated to the feature amount of the target feature amount layer 605, and the feature amount can be input to the representative point generation processing 614 illustrated in FIG. 6 to create a representative point. Here, the auxiliary information to be concatenated is assumed to be obtained by expanding the one-hot vector to a map having the same size as the feature map to be concatenated and then concatenating. Specifically, if the auxiliary information has a scalar value of 1, the map in which the first dimension is filled with 0 and the second dimension is filled with 1 is concatenated. In this way, the representative points can be generated in consideration of the auxiliary information. Note that other known methods may be used to generate the representative points in consideration of the auxiliary information.

これらの学習を行ったのち、検出時に、補助情報を考慮した代表点の生成を行うことで、状況の変化に応じた異常検知等に用いることができる。これらの方法を本実施形態で述べたマルチクラス学習に拡張することで、状況の変化に対応した多クラス分類問題等に適用することもできる。 After performing these learnings, by generating representative points in consideration of auxiliary information at the time of detection, it can be used for abnormality detection or the like according to changes in the situation. By extending these methods to the multi-class learning described in the present embodiment, it is possible to apply them to a multi-class classification problem or the like corresponding to a change in situation.

本実施形態で述べた学習処理において、特段最適化技法の制約について述べなかったが、必要に応じて制約や、正則化を追加してもよい。例えば、代表点や特徴量を、超球上に生成するように制約してもよい。そのようにすることにより、学習処理が安定化しうる。また、その他の公知の方法を用いて制約や正則化を加えてもよい。また例えば、下記の式（７）に示す代表点類似正則化項を損失関数に追加することで、同じ代表点生成器から生成される代表点同士が似ないように学習を行うことができる。 In the learning process described in the present embodiment, the constraint of the optimization technique is not described, but a constraint or regularization may be added if necessary. For example, the representative point and the feature amount may be constrained to be generated on the hypersphere. By doing so, the learning process can be stabilized. Moreover, you may add restrictions and regularization using another well-known method. Further, for example, by adding the representative point similarity regularization term shown in the following Expression (7) to the loss function, it is possible to perform learning so that the representative points generated by the same representative point generator are not similar to each other.

なお、代表点同士が似すぎると、幅広い認識対象近傍を表現しづらくなる虞があるため、このような正則化項を追加してもよい。なお、式（７）のｂは任意のパラメータである。
また、上述の例では、代表点それぞれを点として扱って、認識対象データ等との距離を測ったが、距離ではなく、例えば代表点群とＭｉｎｉｂａｔｃｈ中のデータとの分布としての差異を測ってもよい。同様に、距離以外の公知の尺度を用いて、代表点等の差異を測ってもよい。

If the representative points are too similar to each other, it may be difficult to represent a wide recognition target neighborhood, and thus such a regularization term may be added. In addition, b of Formula (7) is an arbitrary parameter.
Further, in the above example, each representative point is treated as a point and the distance to the recognition target data or the like is measured. However, not the distance but the difference as the distribution between the representative point group and the data in Minibatch is measured. Good. Similarly, a difference such as a representative point may be measured using a known scale other than the distance.

これまで代表点を生成することについて述べてきたが、点ではなく、その他のものを生成してもよい。例えば、質量をもった超球を生成してもよい。質量をもった超球は、これまで述べた代表点を生成するように生成し、ある半径の範囲の超球内に特徴量が充填されているとみなすことができる。ここで半径は、予め決定されるものとし、例えば１を与えてもよい。質量を持った超球（代表点）と認識対象データ等との距離は、認識対象データ等と、質量を持った超球とが交わる最も短い直線によって定義してもよい。 The generation of representative points has been described above, but other points may be generated instead of points. For example, a hypersphere having a mass may be generated. A hypersphere having a mass is generated so as to generate the representative points described above, and it can be considered that the feature amount is filled in the hypersphere within a certain radius range. Here, the radius is to be determined in advance, and may be 1, for example. The distance between the supersphere having a mass (representative point) and the recognition target data may be defined by the shortest straight line that intersects the recognition target data and the mass supersphere.

また、可視化のために、生成した代表点と認識対象データ等を低次元に写像し、それをマップとして表示してもよい。低次元に写像する方法は、どのような公知の技術を用いてもよく、例えば参考文献６に記載の方法を用いてもよい。 Further, for visualization, the generated representative points and the recognition target data may be mapped in a low dimension and displayed as a map. Any known technique may be used as the method of mapping in a low dimension, and for example, the method described in Reference Document 6 may be used.

参考文献６：S. Wold et al. Principal component analysis. 1987. Reference 6: S. Wold et al. Principal component analysis. 1987.

また例えばマップ上に配置した代表点や認識対象データ等のうちの一つを、グラフィカルインタフェース等を用いて指定すると、近傍に存在する代表点や認識対象データ等が更に表示され、グラフィカルインタフェースを通した芋づる式の近傍探索が可能である。端末装置４０は、そのようなインタフェースを備えてもよい。
その他にも、クロスバリデーションによって、最適なＭを選択してもよい。また、代表点生成器は、ＣＮＮを用いず、例えば線形回帰等によって構成してもよい。ＣＮＮを用いる例は、高い非線形性を求められる場合に用いられる一例であり、本発明はＣＮＮを用いる場合に限らない。 Also, for example, if one of the representative points and recognition target data placed on the map is designated using a graphical interface, etc., the representative points and recognition target data existing in the vicinity are further displayed, and the graphical interface is used. It is possible to perform a neighborhood search using the above-mentioned potato-based formula. The terminal device 40 may include such an interface.
Besides, the optimum M may be selected by cross validation. Further, the representative point generator may be configured by, for example, linear regression or the like without using CNN. The example of using CNN is an example used when high nonlinearity is required, and the present invention is not limited to the case of using CNN.

以上説明したように第１の実施形態では、認識対象データを用いて、データ中の認識対象人物等やその状態等の認識対象の代表点を特徴空間ないし元のデータ空間上に予測（生成）する代表点生成器を有し、認識対象データや代表点の比較を基に検出処理を行う。これにより、第１の実施形態によれば、高速な認識処理や、高い認識精度の認識処理が実現しうる。 As described above, in the first embodiment, by using the recognition target data, the representative points of the recognition target such as the recognition target person in the data and the state thereof are predicted (generated) in the feature space or the original data space. It has a representative point generator that performs detection processing based on comparison of recognition target data and representative points. Thus, according to the first embodiment, high-speed recognition processing and recognition processing with high recognition accuracy can be realized.

＜第２の実施形態＞
第１の実施形態では、認識対象データを用いて、認識対象人物等やその状態等の認識対象の代表点を特徴空間ないし元のデータ空間上に予測（生成）し、認識対象データと代表点との比較に基づいて検出処理を行う例を示した。その際の構成の第一の例では、データの特徴量を抽出する特徴抽出器と、代表点を生成する代表点生成器とがあり、それらを畳み込みニューラルネットワーク（ＣＮＮ）で構成する例を述べた（ＣＮＮを使わない場合の例も述べた）。第２の実施形態では、代表点生成器を識別的に学習することに関して、第一の実施形態とは異なる構成および動作について説明する。 <Second Embodiment>
In the first embodiment, the recognition target data is used to predict (generate) a representative point of the recognition target such as a recognition target person or the like in the feature space or the original data space, and the recognition target data and the representative point are calculated. An example in which the detection processing is performed based on the comparison with In the first example of the configuration at that time, there is a feature extractor for extracting a feature amount of data and a representative point generator for generating a representative point, and an example of configuring them by a convolutional neural network (CNN) is described. (The example when CNN is not used is also described). In the second embodiment, a configuration and an operation different from those in the first embodiment will be described regarding discriminatively learning a representative point generator.

第２の実施形態で示す構成は第１の実施形態で例示した構成と大部分が同一であり、一部の構成と動作が異なる。本実施形態において第１の実施形態と構成や動作が異なるのは、先述のように代表点生成器を学習するための構成および動作である。なお、本実施形態における式の表記は第１の実施形態と共通であるとする。 Most of the configuration shown in the second embodiment is the same as the configuration illustrated in the first embodiment, and some configurations and operations are different. The present embodiment differs from the first embodiment in the configuration and operation as described above in the configuration and operation for learning the representative point generator. The expressions used in this embodiment are the same as those used in the first embodiment.

本実施形態で、一つの構成例として、第１の実施形態の構成に加えて、代表点生成器をＧｅｎｅｒａｔｉｖｅＡｄｖｅｒｓａｒｉａｌＮｅｔｗｏｒｋ（ＧＡＮ）のような敵対的生成ネットワークの一部を応用することによって学習する例を示す。ＧＡＮは、参考文献７に開示されている方法であり、元のデータ空間上に偽データを生成する生成モデルと、生成された偽データの真贋を識別する識別モデルとがあり、敵対的に学習処理が行われることで、高度な生成器が学習されうる。本実施形態では、この識別モデルを用い、生成された代表点が認識対象データであるか、あるいは生成された代表点であるか（以降、真贋を識別するとする）、という識別を行うための学習を行い、代表点生成器は識別器をだますように学習を行う例を示す。なお、これらは識別器を用いて代表点生成器の精度向上を行う場合の一例であって、上述のようなＧＡＮの構成・動作に本発明は制限されない。なお、本実施形態で用いる代表点は、通常のＧＡＮとは異なり、元のデータ空間のみならず、特徴空間上での比較を想定していることに注意が必要である。 In the present embodiment, as one configuration example, in addition to the configuration of the first embodiment, the representative point generator is learned by applying a part of an adversarial generation network such as a General Adversarial Network (GAN). Here is an example: GAN is a method disclosed in Reference Document 7, and has a generation model for generating fake data in the original data space and an identification model for identifying the authenticity of the generated fake data. By performing the processing, the advanced generator can be learned. In the present embodiment, using this discrimination model, learning for discriminating whether the generated representative point is recognition target data or is the generated representative point (hereinafter, authenticity is identified). And the representative point generator learns by tricking the discriminator. It should be noted that these are examples of the case where the accuracy of the representative point generator is improved by using the discriminator, and the present invention is not limited to the configuration and operation of the GAN as described above. It should be noted that the representative point used in the present embodiment is different from the normal GAN and is supposed to be compared not only in the original data space but also in the feature space.

参考文献７：I. J. Goodfellow et al. Generative Adversarial Networks. Arxiv, 2014. Reference 7: I. J. Goodfellow et al. Generative Adversarial Networks. Arxiv, 2014.

例えば、ＧＡＮの目的関数を代表点の生成データに応用することで、以下の式（８）を得ることができる。 For example, the following expression (8) can be obtained by applying the GAN objective function to representative point generation data.

式（８）において、Ｄは真贋を識別する識別関数（識別器）である。第一項は式（２）と一致するＭｉｎｉｂａｔｃｈ中のデータ（の特徴量）と生成された代表点との誤差を表す。第二項は識別器がＭｉｎｉｂａｔｃｈ中のデータ（の特徴量）を誤って偽のデータであると識別してしまった誤差を表し、第三項は識別器が代表点を偽のデータであると正しく識別できたか否かの損失を表す。ｃ₁〜ｃ₃はそれぞれの項の重みであり、任意の数を設定してよい。このとき、代表点Ａは代表点生成器である生成モデルＧによって生成された点であり、認識対象データＸ₀を用いて生成されることから、Ａ＝Ｇ（Ｘ₀）と書くことができる。ここで、Ａ＝｛Ａ₁，Ａ₂，・・・，Ａ_M｝である。 In Expression (8), D is a discriminant function (discriminator) for discriminating authenticity. The first term represents the error between (the feature amount of) the data in Minibatch that matches Equation (2) and the generated representative point. The second term represents an error in which the discriminator erroneously discriminates (feature amount) of the data in the Minibatch as false data, and the third term indicates that the discriminator recognizes the representative point as false data. It represents the loss of whether or not it was correctly identified. c _{1 to} c ₃ are weights of the respective terms, and an arbitrary number may be set. At this time, the representative point A is a point generated by the generation model G that is the representative point generator, and is generated using the recognition target data X ₀ , so it can be written as A=G(X ₀ ). .. Here, A={A ₁ , A ₂ ,..., A _M }.

なお、生成モデルＧは第１の実施形態で述べた方法を用いて構成してもよいし、その他の公知の方法を用いて構成してもよい。このとき、敵対的学習を行うことを考えると、式（８）を用いて目的関数を例えばｍｉｎ_D ｍａｘ_G １ｃｌａｓｓＧＡＮＬｏｓｓと書くことができる。式（８）は微分可能であるから、ＣＮＮを用いる場合、通常のＣＮＮの学習と同様に、例えば誤差逆伝播法を用いて学習を行うことができる。また、式（８）は１ｃｌａｓｓ学習する場合として真贋を識別する識別器の誤差と式（２）とを組み合わせた例であって、式（２）ではなく、第１の実施形態に示したようなどのような損失関数を用いてもよい。 The generative model G may be configured by using the method described in the first embodiment, or may be configured by using another known method. At this time, considering performing adversarial learning, the objective function can be written as, for example, min _D max _G 1class GAN Loss by using Expression (8). Since the expression (8) is differentiable, when the CNN is used, the learning can be performed by using the error back-propagation method as in the case of the normal CNN learning. Further, Expression (8) is an example in which the error of the discriminator for identifying authenticity and Expression (2) are combined in the case of performing 1 class learning, and instead of Expression (2), as shown in the first embodiment. You may use the loss function like this.

このように学習された代表点生成器は、検出時において、より実際のデータに近くなった代表点を生成できる代表点生成器によって生成された代表点と、認識対象データと、を比較することで検出処理を行うこととなる。これにより、第２の実施形態では、認識精度が向上しうる。 The representative point generator learned in this way compares the representative point generated by the representative point generator capable of generating a representative point closer to actual data with the recognition target data at the time of detection. The detection processing will be carried out. As a result, in the second embodiment, the recognition accuracy can be improved.

＜第３の実施形態＞
第１の実施形態では、Ａｕｔｏｅｎｃｏｄｅｒ等の入力データを再構成する構成を用いて、中間層と出力層（再構成後の層）とで、それぞれの層で代表点の生成を行う場合の例を示した。これは、ＣＮＮ等を用いて複数階層の特徴抽出を行う場合に、複数の層で代表点を生成できることを示す例である。第３の実施形態では、代表点生成器を階層的な損失のもとで学習することに関して、第１の実施形態とは異なる構成および動作について説明する。特に、複数の認識対象が存在する場合として、入力されたデータは全て認識対象である場合について記述する。 <Third Embodiment>
In the first embodiment, an example in which a representative point is generated in each of the middle layer and the output layer (layer after reconstruction) by using a configuration for reconstructing input data such as Autoencoder Indicated. This is an example showing that representative points can be generated in a plurality of layers when feature extraction of a plurality of layers is performed using CNN or the like. In the third embodiment, regarding the learning of the representative point generator under the hierarchical loss, a configuration and an operation different from those in the first embodiment will be described. In particular, a case where a plurality of recognition targets exist and the input data are all recognition targets will be described.

以降、第３の実施形態について、図面を参照して説明する。第３の実施形態の構成は、第１の実施形態で例示した構成と大部分が同一であり、一部の構成と動作が異なる。第３の実施形態において第１の実施形態と構成や動作が異なるのは、先述のように、代表点生成器を階層的に学習するための構成および動作である。なお、第３の実施形態における式の表記は第１の実施形態と共通であるとする。 Hereinafter, the third embodiment will be described with reference to the drawings. Most of the configuration of the third embodiment is the same as the configuration illustrated in the first embodiment, and a part of the configuration and operation are different. The configuration and operation of the third embodiment differ from those of the first embodiment in the configuration and operation for hierarchically learning the representative point generator as described above. The expressions in the third embodiment are the same as those in the first embodiment.

図１３は、複数層における代表点生成と、階層的クラスタ損失に関する模式図（学習時）である。図１３において、図６と同様の処理内容のものは、図６と同じ符号を付与し、重複する説明は省略する。図１３では、左側に特徴抽出器および代表点生成器の例を示し、右側に代表点生成を行った空間上で階層的な損失計算を行う動作に関して例示している。代表点生成器１３２１〜１３２３は、特徴抽出器６２０の特定の層から認識対象抽出処理６１３に基づいて認識対象データを取得し、代表点を生成する処理を行う。そして、代表点生成器１３２１〜１３２３は、生成された代表点と、Ｍｉｎｉｂａｔｃｈ中のデータの特徴量とを比較する損失算出処理１３０１〜１３０３を行う。 FIG. 13 is a schematic diagram (during learning) regarding representative point generation in a plurality of layers and hierarchical cluster loss. In FIG. 13, the same processing contents as those in FIG. 6 are given the same reference numerals as those in FIG. In FIG. 13, an example of the feature extractor and the representative point generator is shown on the left side, and an operation of hierarchical loss calculation on the space where the representative point is generated is shown on the right side. The representative point generators 1321 to 1323 acquire recognition target data from a specific layer of the feature extractor 620 based on the recognition target extraction processing 613, and perform processing of generating representative points. Then, the representative point generators 1321 to 1323 perform loss calculation processing 1301 to 1303 for comparing the generated representative point with the feature amount of the data in the Minibatch.

このとき、損失算出処理１３０１において、代表点生成器１３２１で生成された代表点と、Ｍｉｎｉｂａｔｃｈ中のデータの特徴量との近傍探索を行い、例えば最も近い代表点に各データを割り当てるとする。この処理はクラスタリングと解釈することができる。近傍探索処理については、第１の実施形態で記載した方法を用いることができる。そして、ここで得られた割り当ての情報を、クラスタ１３５１として代表点生成器１３２２および損失算出処理１３０２に送る。また、第１の実施形態で記載した方法によって、損失値１３１１を得る。 At this time, in the loss calculation processing 1301, a neighborhood search is performed between the representative point generated by the representative point generator 1321 and the feature amount of the data in the Minibatch, and each data is assigned to the nearest representative point, for example. This process can be interpreted as clustering. For the neighborhood search processing, the method described in the first embodiment can be used. Then, the allocation information obtained here is sent to the representative point generator 1322 and the loss calculation processing 1302 as a cluster 1351. Further, the loss value 1311 is obtained by the method described in the first embodiment.

次に、損失算出処理１３０２において、代表点生成器１３２２で生成された代表点と、Ｍｉｎｉｂａｔｃｈ中のデータの特徴量との近傍探索を行ってクラスタリングを行う。このとき、代表点生成器１３２２は、損失算出処理１３０１で行ったクラスタリングで得られたクラスタの情報を用いて、クラスタごとに代表点を生成する。この処理に関しては後述する。また、ここで行うクラスタリングでは、クラスタ１３５１のクラスタをさらに分割する階層的なクラスタリングが行われる。すなわち、クラスタ１３５１の各クラスタに割り当てられたデータに関して、代表点を生成し、それぞれをクラスタリングすることで、二段目のクラスタの分割が行われる。このようにすることで、クラスタが段階的に微細になり、よりきめ細かい表現を獲得しうる。また、多段に何度も代表点を生成することで、一度に多数の代表点を生成する必要性が薄れるため、より効率的に学習を行える場合がある。そして、損失算出処理１３０２により損失値１３１２が得らる。このような操作を、代表点生成器１３２３、損失算出処理１３０３、と行っていき、多段に損失算出処理を行えるようになる。 Next, in the loss calculation processing 1302, clustering is performed by performing a neighborhood search between the representative point generated by the representative point generator 1322 and the feature amount of the data in the Minibatch. At this time, the representative point generator 1322 generates a representative point for each cluster using the information of the cluster obtained by the clustering performed in the loss calculation process 1301. This processing will be described later. Further, in the clustering performed here, hierarchical clustering for further dividing the cluster of the cluster 1351 is performed. That is, with respect to the data assigned to each cluster of the cluster 1351, representative points are generated and each clustered to divide the second cluster. By doing so, the cluster becomes finer in stages, and a more detailed expression can be obtained. In addition, since it is unnecessary to generate a large number of representative points at once by generating representative points in multiple stages, it may be possible to perform learning more efficiently. Then, the loss value 1312 is obtained by the loss calculation processing 1302. By performing such an operation with the representative point generator 1323 and the loss calculation processing 1303, it becomes possible to perform the loss calculation processing in multiple stages.

なお、前述の例では、特徴抽出器の全ての層で代表点を生成する例を示したが、任意の層を選択して代表点を生成する構成を用いてもよい。図１３の右側には、上述した動作に関する特徴空間上の模式図を示している。図１３の右上上段の図は、損失算出処理１３０１で用いられるクラスタの模式図であり、Ｃｏｎｖｏｌｕｔｉｏｎ１層６０２と同じサイズの特徴空間上に、対象特徴量１３３１、代表点１３３２、クラスタ１３４０がある。ここで、図中の白丸は代表点、黒丸は対象特徴量、破線はクラスタの範囲の模式図を表現している。なお、第１の実施形態で述べたように、代表点を生成する先の空間は、元の特徴量の空間と同じでなくてもよい点に注意が必要である。つまり、Ｍｉｎｉｂａｔｃｈ中のデータの特徴量も、代表点を生成する先の空間に写像すればよいためである。 In the above example, the representative points are generated in all the layers of the feature extractor, but a configuration in which an arbitrary layer is selected to generate the representative points may be used. On the right side of FIG. 13, a schematic diagram in the feature space regarding the above-described operation is shown. The upper right diagram in FIG. 13 is a schematic diagram of clusters used in the loss calculation processing 1301, and the target feature amount 1331, representative points 1332, and clusters 1340 are in the feature space of the same size as the Convolution 1 layer 602. Here, the white circles in the figure represent the representative points, the black circles represent the target feature amount, and the broken line represents the schematic diagram of the cluster range. It should be noted that, as described in the first embodiment, the space where the representative points are generated does not have to be the same as the space of the original feature amount. In other words, this is because the feature quantity of the data in the Minibatch may be mapped to the space in which the representative point is generated.

ここで例として代表点は２点生成されており、複数の対象特徴量が近傍探索によって代表点に割り当てられている。このクラスタが、階層的クラスタリングにおける一段目のクラスタである。そして第３の実施形態においても第１の実施形態と同様に、このクラスタの割り当てを用いて、損失を計算してよい。例えば、以下の式（９）を用いて損失を計算することができる。 Here, as an example, two representative points are generated, and a plurality of target feature amounts are assigned to the representative points by neighborhood search. This cluster is the first-stage cluster in the hierarchical clustering. Then, also in the third embodiment, the loss may be calculated using this cluster allocation, as in the first embodiment. For example, the loss can be calculated using the following equation (9).

式（９）において、Ｃ'_iはｉ番目のクラスタを表し、ｉは代表点に対応する。ｊ∈Ｃ'_iはj番目のデータがｉ番目のクラスタに割り当てられているということを表現し、｜Ｃ'_i｜はi番目のクラスタに割り当てられたデータの数を表す。また、各層での損失に関して重み付けを行ってもよい。 In Expression (9), C′ _i represents the i-th cluster, and i corresponds to the representative point. jεC′ _i represents that the j-th data is assigned to the i-th cluster, and |C′ _i | represents the number of data assigned to the i-th cluster. Further, weighting may be performed on the loss in each layer.

図１３の右下の図は、損失算出処理１３０２で用いられるクラスタの模式図であり、下層でのクラスタをさらに分割していることがわかる。
なお、以上のように得た複数の損失値は、第１の実施形態で述べたように全て学習に用いてもよいし、一部のみ学習に用いてもよい。 The lower right diagram in FIG. 13 is a schematic diagram of clusters used in the loss calculation processing 1302, and it can be seen that the clusters in the lower layer are further divided.
Note that the plurality of loss values obtained as described above may be used for all learning as described in the first embodiment, or may be used for only part of learning.

次に、クラスタごとに代表点を生成する方法について図１４を参照しながら例示する。図１４において、図６、図１０と同様の処理内容のものは、同じ符号を付与し、重複した説明は省略する。図１４のクラスタ１４４０は、前段でクラスタリングを行ったクラスタリング結果を示す。クラスタ１４４０は、クラスタ情報入力１４４１によって分割処理１４４２の処理に用いられ、対象特徴量層６０５のデータをクラスタごとに分割する。仮にクラスタ数をＬとしたとき、対象特徴量層１４０５にはＬ個に分割されたデータが入力される。 Next, a method of generating a representative point for each cluster will be illustrated with reference to FIG. In FIG. 14, the same processing contents as those in FIGS. 6 and 10 are given the same reference numerals, and the duplicated description will be omitted. A cluster 1440 in FIG. 14 shows a clustering result obtained by performing clustering in the previous stage. The cluster 1440 is used for the processing of the division processing 1442 by the cluster information input 1441 and divides the data of the target feature amount layer 605 into each cluster. Assuming that the number of clusters is L, the data divided into L pieces is input to the target feature amount layer 1405.

サイズ１４０１〜１４０Ｌは分割されたデータ数を考慮した特徴量のサイズを示しており、データ数Ｎ₁〜Ｎ_Lはそれぞれのクラスタのデータ数を表す。クラスタのデータ数は可変であるため、第１の実施形態で例示した正規化可変代表点生成処理１００３を用いて、各クラスタのデータについて、それぞれＭ個の代表点を生成することができる。Ｍ個は任意の数を設定でき、クラスタをさらに分割するためには、２個以上の数を設定するとよい。上述した代表点生成器１４１０によって、Ｍ・Ｌ個の代表点が生成される。最後に、これらの代表点を（Ｍ・Ｌ，５１２，３２，３２）のサイズのようにｃｏｎｃａｔｅｎａｔｅ処理し、損失の計算処理に用いることができる。 The sizes 1401 to 140L indicate the sizes of the feature quantities in consideration of the number of divided data, and the data numbers N _{1 to} _NL represent the number of data of each cluster. Since the number of data of clusters is variable, M representative points can be generated for each cluster of data using the normalized variable representative point generation processing 1003 exemplified in the first embodiment. An arbitrary number can be set for M, and it is preferable to set a number of 2 or more in order to further divide the cluster. The representative point generator 1410 described above generates ML representative points. Finally, these representative points can be concatenate-processed to have a size of (M·L, 512, 32, 32) and used for the loss calculation process.

なお、前述した例は学習時の構成および動作について挙げているが、同様の代表点生成処理を用いて検出時に用いることができる。その際は、第１の実施形態で例示したように、認識対象データと代表点とを比較し、距離等を計算して、検出処理に用いてよい。その際、複数層での代表点生成結果を用いてよい。 In addition, although the above-mentioned example gives the configuration and operation at the time of learning, it can be used at the time of detection using the same representative point generation processing. In that case, as illustrated in the first embodiment, the recognition target data may be compared with the representative point, the distance or the like may be calculated, and used in the detection process. At that time, a representative point generation result in a plurality of layers may be used.

また、上述の例は、いわば分割統治型のクラスタリングであるが、凝集型のクラスタリングを行うこともできる。その場合、クラスタを分割したのち、階層的にクラスタを統合していきながら、代表点の生成や損失の計算を行えばよく、上述の分割統治型と類似の処理を行えば実現可能であるため、詳細は省略する。 Further, although the above-described example is, so to speak, a division-and-conquer type clustering, agglomerative type clustering can also be performed. In that case, after dividing the cluster, it is sufficient to perform representative point generation and loss calculation while hierarchically integrating the clusters, and this can be achieved by performing a process similar to the above-mentioned divide-and-conquer type. , Details are omitted.

また、上述の例は、複数の認識対象データが存在する場合の例であった。認識対象データが単数であり、認識対象近傍のデータ等を学習に用いる場合にも、上述の方法を用いることができる。代表点の生成には単数の認識対象データを用い、分割統治型のクラスタリングを行う場合には、認識対象データの所属するクラスタのみを階層的にクラスタリングしていくこととなる。その際、認識対象データの所属するクラスタ以外に割り当てられた認識対象近傍のデータは、以降の認識処理を省略することができる。 Further, the above example is an example in the case where a plurality of recognition target data exist. The above-described method can be used when the recognition target data is singular and the data in the vicinity of the recognition target is used for learning. When a representative point is generated using a single piece of recognition target data, and when performing division-and-conquer type clustering, only the cluster to which the recognition target data belongs is hierarchically clustered. At that time, the subsequent recognition processing can be omitted for the data in the vicinity of the recognition target assigned to a cluster other than the cluster to which the recognition target data belongs.

以上説明したように、第３の実施形態によれば、階層的に多段階に学習された代表点生成器は、より精度よく学習処理が行われうるため、検出時の認識精度が向上しうる。 As described above, according to the third embodiment, the representative point generator learned hierarchically in multiple stages can perform the learning process with higher accuracy, and thus the recognition accuracy at the time of detection can be improved. ..

＜第４の実施形態＞
第４の実施形態では、時間などの系列を考慮した場合に、代表点生成や特徴抽出等に関してどのような工夫を導入することが可能であるかを述べる。第４の実施形態で示す構成は第１の実施形態で例示した構成と大部分が同一であり、一部の構成と動作が異なる。第４の実施形態において第１の実施形態と構成や動作が異なるのは、先述のように、時系列性を考慮した点である。 <Fourth Embodiment>
In the fourth embodiment, what kind of device can be introduced regarding representative point generation, feature extraction, and the like in consideration of a series such as time will be described. Most of the configuration shown in the fourth embodiment is the same as the configuration illustrated in the first embodiment, and part of the configuration and operation are different. The configuration and operation of the fourth embodiment are different from those of the first embodiment in that, as described above, the time series property is taken into consideration.

なお予め述べるが、第１〜第３の実施形態において説明を行ったいずれの例も、時系列性のあるデータに対して構成・動作しうる。例えば、第２の実施形態において、一枚の画像ではなく、映像や動きベクトル等を生成することも可能である。また、ある人物の画像を認識対象とするとき、別の時刻の同一人物の画像を認識対象近傍の画像として用いることも可能である。第４の実施形態で述べるのは、さらに時系列性を利用した場合の例である。 Although described in advance, any of the examples described in the first to third embodiments can be configured/operated with respect to time-series data. For example, in the second embodiment, it is possible to generate a video, a motion vector, etc., instead of a single image. Further, when an image of a certain person is set as the recognition target, it is possible to use images of the same person at different times as images near the recognition target. Described in the fourth embodiment is an example in which time series is further used.

第１の実施形態では、例えば行動認識や不審行動検知において、人物画像を用いる場合の例を述べた。その他の例として、例えば参考文献８に開示された方法を用いて、ＲＧＢ画像のみならずオプティカルフロー画像を抽出し、認識処理に用いてもよい。 In the first embodiment, an example in which a person image is used in action recognition or suspicious action detection has been described. As another example, not only the RGB image but also the optical flow image may be extracted using the method disclosed in Reference Document 8 and used for the recognition process.

参考文献８：K. Simonyan et al. Two-Stream Convolutional Networks for Action Recognition in Videos. NIPS, 2014. Reference 8: K. Simonyan et al. Two-Stream Convolutional Networks for Action Recognition in Videos. NIPS, 2014.

またその場合、各ストリームにおいて独立に代表点を生成して認識処理に用いてもよい。また、異常検知等に用いる場合、例えば以下の式（１０）を損失関数として用いてもよい。 In that case, a representative point may be independently generated in each stream and used for recognition processing. When used for abnormality detection or the like, for example, the following equation (10) may be used as the loss function.

式（１０）において、Ｘ^T、Ｘ^Sはそれぞれ認識対象データに対応する時間的ストリーム（ＴｅｍｐｏｒａｌＳｔｒｅａｍ）の特徴量、空間的ストリーム（ＳｐａｔｉａｌＳｔｒｅａｍ）の特徴量を表している。Ａ^T、Ａ^Sはそれぞれ認識対象データに対応する時間的ストリーム上の代表点、空間的ストリーム上の代表点を表している。ｄ¹、ｄ²は重み付けのパラメータであり任意に設定してよい。式（１０）は、空間的ストリームの対象特徴量から生成された代表点が、時間的ストリームの対象特徴量とどれくらい近いかを測り、時間的ストリームの対象特徴量から生成された代表点が、空間的ストリームの対象特徴量とどれくらい近いかを測る式になっている。そして、式（１０）に基づいて正常学習されることを通して、一方のストリームの正常な特徴量から、一方の正常な特徴量を予測するための特徴抽出器・代表点生成器が学習されうる。 In Expression (10), X ^T and X ^S represent the feature amount of the temporal stream (Temporal Stream) and the feature amount of the spatial stream (Spatial Stream) corresponding to the recognition target data, respectively. A ^T and A ^S respectively represent a representative point on the temporal stream and a representative point on the spatial stream corresponding to the recognition target data. d ¹ and d ² are weighting parameters and may be set arbitrarily. Equation (10) measures how close the representative point generated from the target feature amount of the spatial stream is to the target feature amount of the temporal stream, and the representative point generated from the target feature amount of the temporal stream is It is a formula to measure how close the target feature amount of the spatial stream is. Then, through normal learning based on Expression (10), a feature extractor/representative point generator for predicting one normal feature amount from one stream normal feature amount can be learned.

検出時には、式（１０）と同様にこれらの特徴量および代表点との距離を測り、もし距離が大きくなっていれば、正常学習を行った際の正常な関係が崩れたとみなし、異常であると判定することができる。なお、式（１０）は特徴量と代表点との複数の組み合わせについて考慮するが、他の実施形態で示した例と同様に、最近傍の点のみを考慮するように変更してもよい。また、式（１０）は時間的ストリームと空間的ストリームの入力データをそれぞれＮ個用いている例であるが、これらのデータは、映像の連続する区間において抽出したデータであってもよい。また、ある映像からランダムないし何らかの基準によって取り出したデータであってもよい。 At the time of detection, the distances to these characteristic quantities and representative points are measured as in the case of Expression (10), and if the distances are large, it is considered that the normal relationship at the time of normal learning is broken, and it is abnormal. Can be determined. Note that although Expression (10) considers a plurality of combinations of the feature amount and the representative point, it may be modified to consider only the nearest point, as in the examples shown in the other embodiments. Further, although the expression (10) is an example in which N pieces of input data of the temporal stream and N pieces of the input data of the spatial stream are used, these data may be data extracted in a continuous section of the video. Further, it may be data extracted from a certain image randomly or by some standard.

なお、第１の実施形態における１対１のマッチングのように、ある系列とある系列とのマッチングを行いたい場合は、系列長を階層的・多段階に複数パターン考慮して代表点を生成し、マッチングに用いてもよい。例えば第３の実施形態において、階層的に代表点を生成し、階層的に損失関数を計算する場合の例を述べたが、この際、多段階に系列長を考慮することが可能である。 When it is desired to perform matching between a certain series and a certain series such as the one-to-one matching in the first embodiment, a representative point is generated by considering a plurality of patterns of the sequence length hierarchically and in multiple stages. , May be used for matching. For example, in the third embodiment, an example in which representative points are hierarchically generated and a loss function is hierarchically calculated has been described, but in this case, the sequence length can be considered in multiple stages.

図１５にその例を述べる。時間方向１５０１は、右にいくほど時間が進むことを表している。ＣＮＮ特徴量１５０２のように、複数のタイミングでＣＮＮ特徴量が抽出されているとする。範囲１５０３は複数のＣＮＮ特徴量をまとめあげる範囲を示しており、動作１５０４は代表点を生成する動作を表している。すなわち、複数のＣＮＮ特徴量を認識対象データとして、代表点を生成する例を示す。また、さらにその上段でＣＮＮ特徴量をまとめあげて、認識対象データとして、代表点が生成されている。このようにして、多段階に代表点を生成することができる。図１５では、上段と下段とで系列１、系列２の１対１マッチングの例を示しており、例えば上段の系列１における動作１５０４と下段の系列２における動作１５０５とは対応している。第４の実施形態では、このように対応する場所のそれぞれで、代表点および認識対象データを比較することを考えることができる。比較する方法は、前述の他の実施形態に記載のどのような方法を用いてもよい。また、図１５に示した系列は時間的に持続しうるので、ＣＮＮ特徴量をまとめあげる範囲を、例えば右に１つ移動することで、また新たに１対１のマッチングを行うことが可能である。このように学習した結果と、前述の他の実施形態で示した方法とを用いて、例えば行動認識や、類似行動検索、異常行動などの検出処理に用いることができるようになる。 An example thereof will be described in FIG. The time direction 1501 indicates that the time advances toward the right. It is assumed that the CNN feature amount is extracted at a plurality of timings like the CNN feature amount 1502. A range 1503 shows a range in which a plurality of CNN feature quantities are collected, and an operation 1504 shows an operation of generating a representative point. That is, an example in which a representative point is generated using a plurality of CNN feature amounts as recognition target data is shown. In addition, the CNN feature quantities are summarized in the upper row, and representative points are generated as recognition target data. In this way, the representative points can be generated in multiple stages. FIG. 15 shows an example of one-to-one matching of series 1 and series 2 in the upper and lower rows, and, for example, the operation 1504 in the upper series 1 and the operation 1505 in the lower series 2 correspond to each other. In the fourth embodiment, it can be considered to compare the representative point and the recognition target data at each of the corresponding places. As the method of comparison, any method described in the other embodiments described above may be used. Further, since the sequence shown in FIG. 15 can be sustained in time, it is possible to newly perform one-to-one matching by moving the range in which the CNN feature amounts are collected to the right, for example. .. By using the result of learning in this way and the method shown in the other embodiment described above, it can be used for action recognition, similar action search, detection process of abnormal action, and the like.

以上述べたように、第４の実施形態によれば、階層的・多段階に系列を考慮して学習された代表点生成器では、系列データに対する認識性能を向上することができる。 As described above, according to the fourth embodiment, the representative point generator that has learned the sequence hierarchically and in multiple stages can improve the recognition performance for the sequence data.

本発明は、前述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 The present invention supplies a program that implements one or more functions of the above-described embodiments to a system or apparatus via a network or a storage medium, and one or more processors in a computer of the system or apparatus read and execute the program. It can also be realized by the processing. It can also be realized by a circuit (for example, ASIC) that realizes one or more functions.

前述の実施形態は、何れも本発明を実施するにあたっての具体化の例を示したものに過ぎず、これらによって本発明の技術的範囲が限定的に解釈されてはならないものである。即ち、本発明は、その技術思想、又はその主要な特徴から逸脱することなく、様々な形で実施することができる。 Each of the above-described embodiments is merely an example of an embodiment for carrying out the present invention, and the technical scope of the present invention should not be limitedly interpreted by these. That is, the present invention can be implemented in various forms without departing from its technical idea or its main features.

１０：学習装置、１１：選択部、１２：学習部、２０：認識装置、２１：入力部、２２：認識部、３０：判定装置、３１：異常判定部３１、４０：端末装置、４１：表示部 10: learning device, 11: selection unit, 12: learning unit, 20: recognition device, 21: input unit, 22: recognition unit, 30: determination device, 31: abnormality determination unit 31, 40: terminal device, 41: display Department

Claims

Acquisition means for acquiring recognition target data,
Mapping means for mapping the recognition target data to a feature space,
Generating means for generating a plurality of points in at least one of the feature space and the data space of the recognition target data;
Comparing means for comparing the recognition target data and the plurality of points in at least one of the feature space and the data space,
Output means for outputting a recognition processing result based on the result of the comparison;
An information processing device comprising:

Holding means having a plurality of learning data,
Selecting means for selecting auxiliary data corresponding to recognition target data from the plurality of learning data;
Representative point generating means for generating a representative point from the learning data,
Calculating means for calculating a loss based on the representative point and the auxiliary data,
Learning means for performing a learning process based on the loss;
The information processing apparatus according to claim 1, further comprising:

Holding means having a plurality of learning data,
Selecting means for selecting auxiliary data corresponding to recognition target data from the plurality of learning data;
First generation means for generating a representative point from the learning data,
Calculating means for calculating a loss based on the representative point and the auxiliary data,
Learning means for performing a learning process based on the loss;
Acquisition means for acquiring recognition target data,
Second generation means for generating a representative point of the recognition target data based on the result of the learning;
Comparing means for comparing the recognition target data and the representative points of the recognition target data,
Output means for outputting a recognition processing result based on the result of the comparison;
An information processing device comprising:

The information processing apparatus according to claim 2 or 3, wherein the selection unit selects, as the auxiliary data, data existing in the vicinity of the recognition target data.

The calculation means calculates a loss based on a distance between the representative point and the auxiliary data,
The information processing apparatus according to claim 2, wherein the learning unit performs a learning process of reducing the distance.

The generating means generates a plurality of representative points from the learning data,
The calculation means determines whether normal based on the distance between the plurality of representative points and the auxiliary data,
The learning means performs a learning process so as to reduce the distance when it is determined to be normal, and performs a learning process so as to increase the distance when it is determined to be not normal. 5. The information processing device according to item 5.

The generating means generates a plurality of representative points from the learning data,
The information processing apparatus according to claim 2, wherein the calculation unit calculates the loss based on a distribution of the plurality of representative points and the auxiliary data.

The information processing apparatus according to claim 2, wherein the calculation unit calculates the loss based on a weight given to the representative point and the auxiliary data.

The information processing apparatus according to claim 2, wherein the representative point generation unit generates a representative point based on a model of a hostile generation network.

The information processing apparatus according to claim 2, wherein the representative point generation unit generates a representative point for each layer of a plurality of layers.

The information processing apparatus according to claim 2, wherein the representative point generation unit generates a representative point from data having time series.

An information processing method executed by an information processing device, comprising:
An acquisition process for acquiring recognition target data,
A mapping step of mapping the recognition target data to a feature space,
A generation step of generating a plurality of points in at least one of the feature space and the data space of the recognition target data;
A comparison step of comparing the recognition target data and the plurality of points in at least one of the feature space and the data space;
An output step of outputting a recognition processing result based on the result of the comparison,
An information processing method comprising:

An information processing method executed by an information processing device, comprising:
A selection step of selecting auxiliary data corresponding to the recognition target data from the plurality of held learning data,
A first generation step of generating a representative point from the learning data,
A calculation step of calculating a loss based on the representative point and the auxiliary data,
A learning step of performing a learning process based on the loss;
An acquisition process for acquiring recognition target data,
A second generation step of generating a representative point of the recognition target data based on the result of the learning;
A comparing step of comparing the recognition target data and a representative point of the recognition target data,
An output step of outputting a recognition processing result based on the result of the comparison,
An information processing method comprising:

A program for causing a computer to function as each unit of the information processing apparatus according to claim 1.