JP7166415B1

JP7166415B1 - Feature extractor

Info

Publication number: JP7166415B1
Application number: JP2021158317A
Authority: JP
Inventors: 淳也古賀; 寛明澤戸; 求島山
Original assignee: Ｐｃｉソリューションズ株式会社
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2022-11-07
Anticipated expiration: 2041-09-28
Also published as: JP2023048794A

Abstract

【課題】移動体の様々な動きを精度良く認識するため、入力データから特徴量を抽出する特徴量抽出装置を提供する。【解決手段】特徴量抽出装置は、解像度別差分データを作成する時系列画像データ特徴量分配部２０と、３次元畳み込み演算を実行して画像サイズ特徴量を抽出する特徴量抽出部１０と、画像サイズ特徴量の各々に対して特徴量を分配し、連結して画像サイズ連結特徴量を生成する特徴量分配連結部３０と、画像サイズ特徴量の各々に対して、重要度の重み付けを行う重要度判断部４０と、を備えている。特徴量抽出部１０は、特徴量連結生成器１１ａ～１１ｄを複数接続することで、各解像度別画像サイズ特徴量を生成する。【選択図】図７Kind Code: A1 To provide a feature amount extraction device for extracting feature amounts from input data in order to accurately recognize various motions of a moving object. A feature quantity extraction device includes a time-series image data feature quantity distribution unit 20 that creates difference data for each resolution, a feature quantity extraction unit 10 that extracts an image size feature quantity by executing a three-dimensional convolution operation, A feature quantity distribution/connecting unit 30 that distributes the feature quantity to each of the image size feature quantities and connects them to generate an image size concatenated feature quantity, and weights the importance of each of the image size feature quantities. and an importance determination unit 40 . The feature amount extraction unit 10 generates image size feature amounts for each resolution by connecting a plurality of feature amount concatenated generators 11a to 11d. [Selection drawing] Fig. 7

Description

本発明は、動画等の入力データからその特徴量を抽出する特徴量抽出装置に関する。 The present invention relates to a feature quantity extraction device for extracting feature quantities from input data such as moving images.

従来、移動物体の動きを検知する方法として、動画データに含まれるフレームの画素値の差分を算出して二値化処理を行う方法や、フレーム内から検知対象の画像の特徴を抽出して時系列的にその特徴を追跡する方法等が知られている。 Conventional methods for detecting the movement of moving objects include a method of calculating the difference in pixel values between frames included in video data and performing binarization processing, and a method of extracting features of the image to be detected from within the frame. A method for tracking the feature in series is known.

また、近年は、ディープニューラルネットワークを利用した機械学習手法が確立されている。特に、静止画や動画等の入力データから特徴量を抽出する手法として、畳み込みニューラルネットワーク（ＣＮＮ:Convolutional Neural Network）が利用されることが多い。 In recent years, machine learning methods using deep neural networks have also been established. In particular, a convolutional neural network (CNN) is often used as a technique for extracting feature amounts from input data such as still images and moving images.

畳み込みニューラルネットワークは、一対の畳み込み層（Convolution Layer）とプーリング層（Pooling Layer）からなる多層構造をなしており、畳み込み処理とダウンサンプリング処理を繰り返すことで、入力データからその特徴量を抽出する。抽出された特徴量は、物体認識、物体検出、画像変換等の様々な目的で利用される。また、入力データからより良い特徴量を抽出するため、畳み込みニューラルネットワークの構造や内部の処理方法に様々な工夫がなされている。 A convolutional neural network has a multi-layered structure consisting of a pair of convolution layers and pooling layers, and by repeating convolution processing and downsampling processing, the features are extracted from the input data. The extracted feature amount is used for various purposes such as object recognition, object detection, and image conversion. In order to extract better features from input data, various improvements have been made to the structure and internal processing methods of convolutional neural networks.

例えば、下記の特許文献１の画像情報変換器では、複数のマルチスケール変換器を連結している。そして、特徴量生成部及び画像情報生成部において、畳み込み演算によるスケールの異なる解像度の特徴量抽出と、異なるスケールへの振り分けとを繰り返し実行する。画像情報変換器は、異なるスケールの特徴を組み合わせることで、画像情報の複雑な特徴を抽出することができる（特許文献１／段落００１１、図１）。 For example, in the image information converter disclosed in Patent Document 1 below, a plurality of multiscale converters are connected. Then, in the feature quantity generation section and the image information generation section, feature quantity extraction with resolutions of different scales by convolution operation and sorting to different scales are repeatedly executed. An image information transformer can extract complex features of image information by combining features of different scales (Patent Document 1/paragraph 0011, FIG. 1).

特開２０１９－１２８８８９号公報JP 2019-128889 A

しかしながら、特許文献１の手法では、各解像度に応じた特徴量が最終的なデータに反映されていないため、移動物体の大まかな動きは認識できるが、細かな動きは正確に認識できない等の問題が生じる可能性があった。 However, in the method of Patent Document 1, since the feature amount according to each resolution is not reflected in the final data, it is possible to recognize the rough movement of the moving object, but it is not possible to accurately recognize the fine movement. could have occurred.

本発明は、このような事情に鑑みてなされたものであり、入力データからその特徴量を精度良く抽出する特徴量抽出装置を提供することを目的とする。 SUMMARY OF THE INVENTION It is an object of the present invention to provide a feature quantity extraction apparatus that accurately extracts a feature quantity from input data.

本発明の特徴量抽出装置は、時系列画像データのフレーム間差分を計算し、前記時系列画像データを解像度別に分配して、解像度別差分データを作成する時系列画像データ特徴量分配部と、前記時系列画像データ及び／又は前記解像度別差分データに対して３次元畳み込み演算を実行して、画像サイズ特徴量を抽出する特徴量抽出部と、前記特徴量抽出部から入力される複数の前記画像サイズ特徴量の各々に対して解像度別に特徴量を分配した画像サイズ特徴量と、前記画像サイズ特徴量を連結した画像サイズ連結特徴量とを生成する特徴量分配連結部と、前記画像サイズ特徴量の各々に対して、機械学習で得られたパラメータにより決定される数値に応じた重み付けを行う重要度判断部と、を備え、
前記特徴量抽出部は、前記画像サイズ連結特徴量と前記解像度別差分データとを連結して新たな画像サイズ特徴量を生成する特徴量連結生成器を複数有し、前記特徴量連結生成器を複数接続して前記画像サイズ特徴量のそれぞれを生成することを特徴とする。 A feature amount extraction device of the present invention includes a time-series image data feature amount distribution unit that calculates differences between frames of time-series image data, distributes the time-series image data by resolution, and creates difference data by resolution, a feature quantity extraction unit for performing a three-dimensional convolution operation on the time-series image data and/or the resolution-based difference data to extract an image size feature quantity; a feature amount distribution and connection unit that generates an image size feature amount obtained by distributing a feature amount for each image size feature amount by resolution and an image size connection feature amount that connects the image size feature amounts; and the image size feature amount. an importance determination unit that weights each quantity according to a numerical value determined by parameters obtained by machine learning,
The feature quantity extraction unit has a plurality of feature quantity concatenated generators that concatenate the image size concatenated feature quantity and the resolution-based difference data to generate a new image size feature quantity. It is characterized in that each of the image size feature amounts is generated by connecting a plurality of them.

本発明において、特徴量抽出部は、特徴量連結生成器を複数連結した構造となっており、３次元畳み込み演算により複数の画像サイズ特徴量を抽出する。また、特徴量抽出部は、抽出した画像サイズ特徴量と、時系列画像データ特徴量分配部で作成された解像度別差分データとを画像サイズ別に連結することで、新たな画像サイズ特徴量を生成する。 In the present invention, the feature amount extraction unit has a structure in which a plurality of feature amount connection generators are connected, and extracts a plurality of image size feature amounts by a three-dimensional convolution operation. In addition, the feature amount extraction unit generates a new image size feature amount by connecting the extracted image size feature amount and the resolution-based difference data created by the time-series image data feature amount distribution unit for each image size. do.

特徴量分配連結部は、特徴量抽出部からの画像サイズ特徴量を連結した画像サイズ連結特徴量を生成する。さらに、重要度判断部は、画像サイズ特徴量の各々に対して前記パラメータから決定される数値に応じて重み付けをする。これにより、解像度別に特徴量をまとめた画像サイズ特徴量が生成され、これらは機械学習に利用することができる。 The feature amount distribution connection unit generates an image size connection feature amount by connecting the image size feature amounts from the feature amount extraction unit. Further, the importance determination unit weights each of the image size feature values according to the numerical value determined from the parameters. As a result, image size feature quantities are generated by collecting feature quantities for each resolution, and these can be used for machine learning.

本発明の特徴量抽出装置において、前記特徴量連結生成器は、前記画像サイズ特徴量を生成するとき、前記特徴量分配連結部で生成された前記画像サイズ連結特徴量をさらに連結することが好ましい。 In the feature quantity extraction device of the present invention, it is preferable that the feature quantity connection generator, when generating the image size feature quantity, further concatenates the image size concatenated feature quantity generated by the feature quantity distribution concatenation unit. .

特徴量抽出部の特徴量連結生成器は、新たな画像サイズ特徴量を生成するとき、特徴量分配連結部で生成された画像サイズ連結特徴量をさらに連結する。このため、より高解像度の情報を加えた画像サイズ特徴量を生成することができる。 The feature quantity concatenated generator of the feature quantity extraction unit further concatenates the image size concatenated feature quantity generated by the feature quantity distribution concatenation unit when generating a new image size feature quantity. Therefore, it is possible to generate an image size feature amount to which higher resolution information is added.

また、本発明の特徴量抽出装置において、前記時系列画像データ特徴量分配部は、特定サイズのフィルタを用いた平均化プーリング処理により前記解像度別差分データを作成する特徴量分配器を複数有し、前記特徴量分配器を複数接続して前記解像度別差分データを作成することが好ましい。 Further, in the feature amount extraction device of the present invention, the time-series image data feature amount distribution unit has a plurality of feature amount distributors that create the difference data by resolution by an average pooling process using a filter of a specific size. Preferably, a plurality of the feature amount distributors are connected to create the difference data by resolution.

時系列画像データ特徴量分配部は、特徴量分配器を複数連結した構造となっており、特定サイズのフィルタを用いて、時系列画像データの平均化プーリング処理を行う。これにより、大きさや移動量が異なる物体の認識のため、解像度別差分データを作成することができる。 The time-series image data feature quantity distribution unit has a structure in which a plurality of feature quantity distributors are connected, and performs average pooling processing of time-series image data using a filter of a specific size. As a result, it is possible to create difference data by resolution for recognizing objects having different sizes and moving amounts.

また、本発明の特徴量抽出装置において、前記特徴量分配連結部は、前記特徴量抽出部から入力される前記画像サイズ特徴量を解像度別にダウンサンプリングする畳み込み演算を行い、生成された前記画像サイズ特徴量を前記重要度判断部に伝達することが好ましい。 Further, in the feature amount extraction device of the present invention, the feature amount distribution connection unit performs a convolution operation for down-sampling the image size feature amount input from the feature amount extraction unit by resolution, and the generated image size Preferably, the feature amount is transmitted to the importance determination unit.

特徴量分配連結部は、解像度別に分離独立した経路を通過し、畳み込み演算が行われるため、各解像度の情報が保持された画像サイズ特徴量が生成され、これを重要度判断部に伝達することができる。 Since the feature amount distribution and connection unit passes separate and independent paths for each resolution and performs convolution operation, an image size feature amount that holds information of each resolution is generated and transmitted to the importance determination unit. can be done.

また、本発明の特徴量抽出装置において、前記特徴量分配連結部は、前記畳み込み演算により生成された、同じ画像サイズの前記画像サイズ特徴量を連結して前記画像サイズ連結特徴量を生成することが好ましい。 Further, in the feature quantity extraction device of the present invention, the feature quantity distribution and connection unit may connect the image size feature quantities of the same image size generated by the convolution operation to generate the image size concatenated feature quantity. is preferred.

畳み込み演算を行うと、出力される画像サイズ特徴量は入力された画像サイズ特徴量からサイズ変更される。特徴量分配連結部は、同じ画像サイズの画像サイズ特徴量を連結して、新たな画像サイズ連結特徴量を生成することができる。 When the convolution operation is performed, the output image size feature amount is resized from the input image size feature amount. The feature quantity distribution and concatenation unit can concatenate image size feature quantities of the same image size to generate a new image size concatenated feature quantity.

また、本発明の特徴量抽出装置において、前記重要度判断部は、入力された前記画像サイズ特徴量を連結する特徴量連結器と、前記特徴量連結器の出力データを変換し、解像度の種類数Ｒのｋ倍（ｋ：チャネル数）のＲ・ｋ長ベクトルを出力する特徴量集約器と、前記特徴量集約器から出力された前記Ｒ・ｋ長ベクトルに対し、全結合層での処理により、その構成要素が各解像度の重要度を表すＲ長ベクトルを生成する解像度別重要度生成器と、前記解像度別重要度生成器で生成された前記Ｒ長ベクトルの構成要素の数値を、前記特徴量連結器から出力された値と掛け合わせるスケール器と、を備え、
前記画像サイズ特徴量のそれぞれに対し、各解像度を示すチャネルｋ個を１単位として前記重要度を算出し、重み付けを行うことが好ましい。 Further, in the feature amount extraction device of the present invention, the importance determination unit includes a feature amount coupler that couples the input image size feature amount, converts the output data of the feature amount coupler, and converts the output data of the feature amount A feature aggregator that outputs an R·k length vector that is k times the number R (k: the number of channels), and processing in a fully connected layer for the R · k length vector output from the feature aggregator a resolution-specific importance generator whose components generate an R-length vector representing the importance of each resolution; and a scaler that multiplies the value output from the feature quantity coupler,
It is preferable that the importance is calculated and weighted for each of the image size feature values, with k channels indicating each resolution as one unit.

重要度判断部では、特徴量連結器が入力された画像サイズ特徴量を連結し、連結したデータを特徴量集約器に出力する。特徴量集約器は、当該出力データから解像度の種類数（Ｒ）とチャネル数（ｋ）に応じたＲ・ｋ長ベクトルを出力し、これを解像度別重要度生成器に出力する。 In the importance determination unit, the feature amount coupler couples the input image size feature amounts, and outputs the coupled data to the feature amount aggregator. The feature aggregator outputs an R·k length vector corresponding to the number of resolution types (R) and the number of channels (k) from the output data, and outputs this to the resolution-specific importance generator.

また、解像度別重要度生成器は、当該Ｒ・ｋ長ベクトルを変換して、その構成要素が各解像度の重要度を表すＲ長ベクトルを生成し、これをスケール器に出力する。最後に、スケール器は、Ｒ長ベクトルの構成要素の数値と特徴量連結器から出力された数値と掛け合わせ、画像サイズ特徴量に対して重み付けを行う。これにより、重要度判断部は、解像度別の重要度によって重み付けがなされた最終的な特徴量データを抽出することができる。 Also, the resolution-by-resolution importance generator transforms the R·k length vector to generate an R length vector whose components represent the importance of each resolution, and outputs this to the scaler. Finally, the scaler multiplies the numerical value of the component of the R-length vector by the numerical value output from the feature quantity connector, and weights the image size feature quantity. Thereby, the importance determination unit can extract the final feature amount data weighted by the importance of each resolution.

本発明の実施形態に係る特徴量抽出装置の概要を説明する図。BRIEF DESCRIPTION OF THE DRAWINGS The figure explaining the outline|summary of the feature-value extraction apparatus which concerns on embodiment of this invention. 特徴量抽出部の概要を説明する図。The figure explaining the outline|summary of a feature-value extraction part. 特徴量抽出部の詳細を説明する図。The figure explaining the detail of a feature-value extraction part. 時系列画像データ特徴量分配部の概要を説明する図。The figure explaining the outline|summary of a time series image data feature-value distribution part. 時系列画像データ特徴量分配部での処理の詳細を説明する図。FIG. 5 is a diagram for explaining details of processing in a time-series image data feature amount distribution unit; 特徴量分配連結部の概要を説明する図。The figure explaining the outline|summary of a feature-value distribution connection part. 特徴量分配連結部の前後で行われる処理を説明する図。4A and 4B are diagrams for explaining processing performed before and after a feature quantity distribution connection unit; FIG. 重要度判断部の概要を説明する図。The figure explaining the outline|summary of an importance determination part. 重要度判断部の特徴量集約器を説明する図。FIG. 5 is a diagram for explaining a feature amount aggregator of an importance determination unit; 重要度判断部の解像度別重要度生成器を説明する図。FIG. 4 is a diagram for explaining a resolution-based importance generator of an importance determination unit; 重要度判断部のスケール器を説明する図。The figure explaining the scale device of an importance judgment part.

以下では、図面を参照しながら、本発明の実施形態に係る特徴量抽出装置１００を説明する。 A feature quantity extraction device 100 according to an embodiment of the present invention will be described below with reference to the drawings.

図１は、特徴量抽出装置１００の概要を示している。特徴量抽出装置１００は、車両等の移動体の動きを精度良く認識するため、時系列画像データから特徴量を抽出する。最終的に生成された特徴量データは、移動体の動きを認識する情報として利用することができる。特徴量抽出装置１００は、特徴量抽出部１０と、時系列画像データ特徴量分配部２０と、特徴量分配連結部３０と、重要度判断部４０とから構成されている。 FIG. 1 shows an outline of a feature extraction device 100. As shown in FIG. The feature amount extraction device 100 extracts feature amounts from time-series image data in order to accurately recognize the movement of a moving object such as a vehicle. The finally generated feature amount data can be used as information for recognizing the movement of the moving object. The feature quantity extraction device 100 includes a feature quantity extraction unit 10, a time-series image data feature quantity distribution unit 20, a feature quantity distribution connection unit 30, and an importance determination unit 40. FIG.

（特徴量抽出部１０）
特徴量抽出部１０は、機械学習の１つである畳み込みニューラルネットワーク（以下、ＣＮＮという）により、入力データＤ及び１／１解像度差分データＤ１（詳細は後述する）に対して３次元の畳み込み演算を実行し、特徴量を抽出する。 (Feature quantity extraction unit 10)
The feature quantity extraction unit 10 performs a three-dimensional convolution operation on input data D and 1/1 resolution difference data D1 (details will be described later) by a convolutional neural network (hereinafter referred to as CNN), which is one of machine learning. to extract features.

特徴量抽出部１０は、コンボリューション層における畳み込み演算により、入力画像の画像サイズを徐々に浅い層から深い層に向かって縮小していくことで特徴量を抽出する。 The feature amount extraction unit 10 extracts feature amounts by gradually reducing the image size of the input image from the shallower layers to the deeper layers by the convolution operation in the convolution layer.

ここで、図２及び図３を参照して、特徴量抽出部１０の詳細について説明する。 Here, the details of the feature quantity extraction unit 10 will be described with reference to FIGS. 2 and 3. FIG.

図２に示すように、特徴量抽出部１０は特徴量連結生成器１１ａ～１１ｄを有し、これらを連結して構成されている。まず、特徴量連結生成器１１ａに、画像サイズが１（Ｔ，Ｈ，Ｗ，Ｃ）の差分データである１／１解像度差分データＤ１が入力される。ここで、「Ｔ」はＴｉｍｅ、「Ｈ」はＨｅｉｇｈｔ、「Ｗ」はＷｉｄｔｈ、「Ｃ」はＣｈａｎｎｅｌを意味し、それぞれ特徴量データの構成要素である。 As shown in FIG. 2, the feature quantity extraction unit 10 has feature quantity connection generators 11a to 11d, which are connected together. First, 1/1 resolution difference data D1, which is difference data with an image size of 1 (T, H, W, C), is input to the feature value link generator 11a. Here, 'T' means Time, 'H' means Height, 'W' means Width, and 'C' means Channel, which are components of the feature data.

その後、特徴量連結生成器１１ａは、学習済みパラメータを用いた畳み込み演算により、画像サイズが１／２（Ｔ，Ｈ／２，Ｗ／２，Ｃ）の１／２画像サイズ特徴量Ｆ１を生成する。学習済みパラメータとは、出力精度を高めるため、ニューラルネットワークの各層が有する「重み」と「バイアス」のことである。 After that, the feature value connection generator 11a generates a 1/2 image size feature value F1 whose image size is 1/2 (T, H/2, W/2, C) by a convolution operation using the learned parameters. do. A learned parameter is a "weight" and a "bias" that each layer of a neural network has to improve output accuracy.

特徴量連結生成器１１ａは、特徴量連結器１２ａと、特徴量抽出器１３ａとで構成されている。特徴量連結器１２ａでは、入力データＤと１／１解像度画像データＤ１とが連結される。これ以降、画像データの「連結」とは、チャネル方向への連結を意味する。また、特徴量抽出器１３ａが１／２画像サイズ特徴量Ｆ１を生成する。 The feature quantity connection generator 11a is composed of a feature quantity connector 12a and a feature quantity extractor 13a. The input data D and the 1/1 resolution image data D1 are connected in the feature quantity concatenator 12a. Hereinafter, "concatenation" of image data means concatenation in the channel direction. Also, the feature amount extractor 13a generates a 1/2 image size feature amount F1.

図３は、特徴量抽出器１３ａの詳細を示している。特徴量抽出器１３ａは、３次元畳み込み層と、空間ｄｓ（ダウンサンプリング）３次元畳み込み層とで構成されている。カッコ内は、それぞれフィルタサイズ、ストライド数、入力チャネル数、出力チャネル数を示す。なお、この３次元畳み込み層には、（フィルタサイズ×入力チャネル数×出力チャネル数）個の学習済みパラメータが存在する。 FIG. 3 shows details of the feature quantity extractor 13a. The feature quantity extractor 13a is composed of a three-dimensional convolutional layer and a spatial ds (down-sampling) three-dimensional convolutional layer. Parentheses indicate the filter size, the number of strides, the number of input channels, and the number of output channels, respectively. This three-dimensional convolutional layer has (filter size*number of input channels*number of output channels) learned parameters.

また、「ＲｅＬＵ（Rectified Linear Unit）」は、活性化関数（層間をどのように電気信号を伝搬させるかを調整する関数）の１つである。必要に応じて、バッチ正規化層（Batch Normalization）を追加してもよい。 Also, "ReLU (Rectified Linear Unit)" is one of the activation functions (functions for adjusting how an electrical signal is propagated between layers). A batch normalization layer may be added if desired.

入力データＤは、その形状が（Ｔ，Ｈ，Ｗ）、チャネル数が「１」のグレースケールの動画データである（shape=（Ｔ，Ｈ，Ｗ，１））。時系列画像データ特徴量分配部２０から特徴量連結器１２ａに延びる矢印の添え字はチャネル数を意味する（ここでは「１」）。入力データＤと時系列画像データ特徴量分配部２０からの差分データが特徴量連結器１２ａで連結され、特徴量抽出器１３ａの３次元畳み込み層に入力される（入力チャネル数「２」）。 The input data D is grayscale video data having a shape of (T, H, W) and a channel number of "1" (shape=(T, H, W, 1)). The subscript of the arrow extending from the time-series image data feature quantity distributor 20 to the feature quantity connector 12a means the number of channels (here, "1"). The input data D and the difference data from the time-series image data feature quantity distribution unit 20 are connected by the feature quantity connector 12a and input to the three-dimensional convolution layer of the feature quantity extractor 13a (the number of input channels is "2").

なお、３次元畳み込み層で用いる出力チャネル数を「３２」としているが、これは予め設定した任意の値である。特徴量連結器１２ａで連結された特徴量は、３次元畳み込み層により一度チャネル数が「３２」に拡張され、次段の空間ｄｓ３次元畳み込み層のフィルタ数を「ｋ（設定値）」に絞るため、特徴量抽出器１３ａの出力チャネル数は「ｋ」となる。ここでは、入力となる特徴量のＦ,Ｈ,Ｗ方向へゼロパディング処理を実行して、同方向軸への入力と出力のサイズが同じになるようしている。 Although the number of output channels used in the three-dimensional convolutional layer is set to "32", this is an arbitrary value set in advance. The feature quantity connected by the feature quantity connector 12a is once expanded to the number of channels "32" by the three-dimensional convolution layer, and the number of filters in the spatial ds three-dimensional convolution layer of the next stage is narrowed down to "k (set value)". Therefore, the number of output channels of the feature quantity extractor 13a is "k". Here, zero-padding processing is executed in the F, H, and W directions of the input feature amount so that the sizes of the input and output on the same direction axes are the same.

「ｋ」は４,８,１６等のより小さな値が好ましい。また、空間ｄｓ３次元畳み込み層では、stride=(１,２,２)（それぞれＴ,Ｈ,Ｗ）に設定されていることでＨ,Ｗ方向のみ縮小を行い、Ｔに関しては入力と出力とが同値になる。なお、後述する解像度別特徴量分配器３１ａ～３１ｃ,３２ａ～３２ｂ，３３ａも同様で、このストライド設定により特徴量抽出部１０、特徴量分配連結部３０で扱われる特徴量の時間軸がＴで維持される。 Smaller values such as 4, 8, 16, etc. are preferred for "k". In the spatial ds three-dimensional convolution layer, stride=(1, 2, 2) (T, H, W, respectively) is set so that only the H and W directions are reduced. be of the same value. Note that the resolution-based feature amount distributors 31a to 31c, 32a to 32b, and 33a, which will be described later, are similar. maintained.

図２に戻り、次段の特徴量連結生成器１１ｂに、特徴量連結生成器１１ａによって生成された１／２画像サイズ特徴量Ｆ１が入力される。そして、畳み込み演算により画像サイズが１／４（Ｔ，Ｈ／４，Ｗ／４，Ｃ）の１／４画像サイズ特徴量Ｆ２が生成される。 Returning to FIG. 2, the 1/2 image size feature quantity F1 generated by the feature quantity concatenated generator 11a is input to the feature quantity concatenated generator 11b in the next stage. Then, a 1/4 image size feature amount F2 whose image size is 1/4 (T, H/4, W/4, C) is generated by a convolution operation.

特徴量連結生成器１１ｂは、特徴量連結器１２ｂと特徴量抽出器１３ｂとで構成されている。特徴量連結器１２ｂは、同じ画像サイズの１／２画像サイズ特徴量Ｆ１と、時系列画像データ特徴量分配部２０からの１／２解像度差分データＤ２を連結する。また、特徴量抽出器１３ｂが１／４画像サイズ特徴量Ｆ２を生成する。 The feature quantity connection generator 11b is composed of a feature quantity connector 12b and a feature quantity extractor 13b. The feature quantity coupler 12 b joins the ½ image size feature quantity F 1 of the same image size and the ½ resolution difference data D 2 from the time-series image data feature quantity distributor 20 . Also, the feature amount extractor 13b generates a 1/4 image size feature amount F2.

図３に示すように、特徴量抽出器１３ｂは、３次元畳み込み層と空間ｄｓ３次元畳み込み層とで構成されている。なお、特徴量抽出器１３ａからの１／２画像サイズ特徴量Ｆ１（チャネル数「ｋ」）と時系列画像データ特徴量分配部２０からの１／２解像度差分データＤ２（チャネル数「１」）が特徴量連結器１２ｂで連結され、特徴量抽出器１３ｂの３次元畳み込み層に入力される（入力チャネル数「ｋ＋１」）。 As shown in FIG. 3, the feature quantity extractor 13b is composed of a three-dimensional convolutional layer and a spatial ds three-dimensional convolutional layer. 1/2 image size feature quantity F1 (number of channels "k") from feature quantity extractor 13a and 1/2 resolution difference data D2 (number of channels "1") from time-series image data feature quantity distribution section 20 are connected by the feature quantity connector 12b and input to the three-dimensional convolution layer of the feature quantity extractor 13b (the number of input channels is "k+1").

さらに、次段の特徴量連結生成器１１ｃに、特徴量連結生成器１１ｂによって生成された１／４画像サイズ特徴量Ｆ２が入力される。そして、畳み込み演算により画像サイズが１／８（Ｔ，Ｈ／８，Ｗ／８，Ｃ）である１／８画像サイズ特徴量Ｆ３が生成される。 Further, the 1/4 image size feature quantity F2 generated by the feature quantity concatenated generator 11b is input to the feature quantity concatenated generator 11c in the next stage. Then, a ⅛ image size feature amount F3 whose image size is ⅛ (T, H/8, W/8, C) is generated by a convolution operation.

特徴量連結生成器１１ｃは、特徴量連結器１２ｃと、特徴量抽出器１３ｃとで構成されている。特徴量連結器１２ｃは、同じ画像サイズの１／４画像サイズ特徴量Ｆ２と、時系列画像データ特徴量分配部２０からの１／４解像度差分データＤ３と、特徴量分配連結部３０からの１／４画像サイズ連結特徴量Ｇ１（詳細は後述する）とを連結する。また、特徴量抽出器１３ｃが１／８画像サイズ特徴量Ｆ３を生成する。 The feature quantity connection generator 11c is composed of a feature quantity connector 12c and a feature quantity extractor 13c. The feature quantity connector 12c combines the 1/4 image size feature quantity F2 of the same image size, the 1/4 resolution difference data D3 from the time-series image data feature quantity distribution unit 20, and the 1/4 resolution difference data D3 from the feature quantity distribution connection unit 30. /4 image size concatenated feature value G1 (details will be described later) are concatenated. Also, the feature amount extractor 13c generates a ⅛ image size feature amount F3.

図３に示すように、特徴量抽出器１３ｃは、３次元畳み込み層と空間ｄｓ３次元畳み込み層とで構成されている。なお、特徴量抽出器１３ｂからの１／４画像サイズ特徴量Ｆ２（チャネル数「ｋ」）と、時系列画像データ特徴量分配部２０からの１／４解像度差分データＤ２（チャネル数「１」）と、特徴量分配連結部３０からの１／４画像サイズ連結特徴量Ｇ１（チャネル数「ｋ」）が特徴量連結器１２ｃで連結され、特徴量抽出器１３ｃの３次元畳み込み層に入力される（入力チャネル数「２ｋ＋１」）。 As shown in FIG. 3, the feature quantity extractor 13c is composed of a three-dimensional convolutional layer and a spatial ds three-dimensional convolutional layer. Note that the 1/4 image size feature amount F2 (the number of channels "k") from the feature amount extractor 13b and the 1/4 resolution difference data D2 (the number of channels "1") from the time-series image data feature amount distribution unit 20 ) and the 1/4 image size concatenated feature G1 (number of channels “k”) from the feature distribution concatenator 30 are concatenated by the feature concatenator 12c and input to the three-dimensional convolution layer of the feature extractor 13c. (Number of input channels "2k+1").

最終段の特徴量連結生成器１１ｄには、特徴量連結生成器１１ｃによって生成された１／８画像サイズ特徴量Ｆ３が入力される。そして、畳み込み演算により画像サイズが１／１６（Ｔ，Ｈ／１６，Ｗ／１６，Ｃ）の１／１６画像サイズ特徴量Ｆ４が生成される。 The ⅛ image size feature quantity F3 generated by the feature quantity concatenated generator 11c is input to the feature quantity concatenated generator 11d at the final stage. Then, a 1/16 image size feature amount F4 whose image size is 1/16 (T, H/16, W/16, C) is generated by the convolution operation.

特徴量連結生成器１１ｄは、特徴量連結器１２ｄと、特徴量抽出器１３ｄとで構成されている。特徴量連結器１２ｄは、同じ画像サイズの１／８画像サイズ特徴量Ｆ３と、時系列画像データ特徴量分配部２０からの１／８解像度差分データＤ４と、特徴量分配連結部３０からの１／８画像サイズ連結特徴量Ｇ２（詳細は後述する）とを連結する。そして、特徴量抽出器１３ｄが１／１６画像サイズ特徴量Ｆ４を生成し、重要度判断部４０に出力する。 The feature quantity connection generator 11d is composed of a feature quantity connector 12d and a feature quantity extractor 13d. The feature quantity coupler 12d combines the 1/8 image size feature quantity F3 of the same image size, the 1/8 resolution difference data D4 from the time-series image data feature quantity distribution unit 20, and the 1/8 resolution difference data D4 from the feature quantity distribution and connection unit 30. /8 image size concatenated feature value G2 (details will be described later) are concatenated. Then, the feature amount extractor 13 d generates the 1/16 image size feature amount F 4 and outputs it to the importance determination section 40 .

図３に示すように、特徴量抽出器１３ｄは、３次元畳み込み層と空間ｄｓ３次元畳み込み層とで構成されている。なお、特徴量抽出器１３ｃからの画像サイズ特徴量Ｆ３（チャネル数「ｋ」）と時系列画像データ特徴量分配部２０からの画像データＤ３（チャネル数「１」）と特徴量分配連結部３０からの画像データＧ２（チャネル数「２ｋ」）が特徴量連結器１２ｄで連結され、特徴量抽出器１３ｄの３次元畳み込み層に入力される（入力チャネル数「３ｋ＋１」）。 As shown in FIG. 3, the feature quantity extractor 13d is composed of a three-dimensional convolutional layer and a spatial ds three-dimensional convolutional layer. Note that the image size feature amount F3 (the number of channels is "k") from the feature amount extractor 13c, the image data D3 (the number of channels is "1") from the time-series image data feature amount distribution unit 20, and the feature amount distribution connection unit 30 The image data G2 (the number of channels is "2k") from 1 is concatenated by the feature concatenator 12d and input to the three-dimensional convolution layer of the feature extractor 13d (the number of input channels is "3k+1").

（時系列画像データ特徴量分配部２０）
時系列画像データ特徴量分配部２０は、１／１解像度差分データＤ１に基づいて、特徴量抽出部１０の入力特徴量に応じた次元の画像情報に分配する装置である。 (Time-series image data feature quantity distribution unit 20)
The time-series image data feature quantity distribution unit 20 is a device that distributes image information of dimensions according to the input feature quantity of the feature quantity extraction unit 10 based on the 1/1 resolution difference data D1.

１／１解像度画像データＤ１の画像サイズを縮小するためには、コンボリューション層による畳み込み演算ではなく、平均値を演算するプーリング層による平均化プーリング処理を実行することが好ましい。その際、プーリング実行時のストライドは、（Ｔ，Ｈ，Ｗ）＝（１，２，２）のように画像サイズのみが縮小されるように設定する。 In order to reduce the image size of the 1/1 resolution image data D1, it is preferable to perform an averaging pooling process by a pooling layer that calculates an average value instead of a convolution operation by a convolution layer. At this time, the stride during pooling is set so that only the image size is reduced, such as (T, H, W)=(1, 2, 2).

次に、図４及び図５を参照して、時系列画像データ特徴量分配部２０の詳細について説明する。 Next, details of the time-series image data feature quantity distribution unit 20 will be described with reference to FIGS. 4 and 5. FIG.

図４に示すように、時系列画像データ特徴量分配部２０は、特徴量分配器２１ａ～２１ｃで構成されている。特徴量分配器２１ａは、例えば、時刻Ｔ１のときの画像フレームと、その後の時刻Ｔ２のときの画像フレームのフレーム間差分をとった１／１解像度差分データＤ１（Ｔ＝８であれば、８フレーム分）に対して、カーネルサイズが（１，２，２）、ストライドが（１，２，２）の平均化プーリング処理を実行する。これにより、特徴量分配器２１ａは、画像サイズが１／２の（Ｔ，Ｈ／２，Ｗ／２，Ｃ）の１／２解像度差分データＤ２を作成する。 As shown in FIG. 4, the time-series image data feature amount distributor 20 is composed of feature amount distributors 21a to 21c. The feature amount distributor 21a generates 1/1 resolution difference data D1 (if T=8, 8 frames), the kernel size is (1, 2, 2) and the stride is (1, 2, 2). As a result, the feature amount distributor 21a creates 1/2 resolution difference data D2 of (T, H/2, W/2, C) whose image size is 1/2.

差分データは動画データのフレーム間差分であり、動作のない背景等の情報を除外し、フレーム間で変化のある移動体情報のみを残したものである。差分データは、移動体の形状、大きさ、その移動量等の変化パターンによって特徴的な空間情報を示す。平均化プーリング処理は段階的に実行されるため（図５参照）、作成される解像度別差分データのそれぞれは、物体の大きさや移動量に対して異なる挙動を示す。 The difference data is the frame-to-frame difference of the moving image data, and excludes information such as the background that does not move, leaving only moving object information that changes between frames. The difference data indicates characteristic spatial information by changing patterns such as the shape, size, and amount of movement of the moving object. Since the averaging pooling process is executed step by step (see FIG. 5), each of the generated difference data by resolution exhibits different behavior with respect to the size and movement amount of the object.

例えば、高解像度の差分データは、物体の移動の詳細（移動前後の位置情報等）や、複数の物体が同時に移動する場合にその特徴を捕らえることができる。なお、段階的な平均化プーリング処理を実行していく中で、その処理回数が少ないものほど高解像度の差分情報が残るため、「高解像度の差分データ」となる。また、平均化プーリング処理を繰り返すほど解像度が低下するため、「低解像度の差分データ」となる。 For example, the high-resolution differential data can capture the details of object movement (position information before and after movement, etc.) and the characteristics when multiple objects move simultaneously. Note that while the stepwise averaging pooling process is executed, the smaller the number of times the process is performed, the higher the resolution of the difference information remains, so it becomes "high resolution difference data". In addition, since the resolution decreases as the averaging pooling process is repeated, it becomes "low-resolution differential data".

低解像度の差分データは、物体な大まかな動きをより少ない情報で捕らえたり、逆に小さな動きを捕らえないようにしたりすることで移動量フィルタリングの役割を担うこともできる。また、低解像度の差分データは、撮影時の振動等によりフレーム間で小さなブレが生じる状況で、その位置ずれを吸収することができる。 The low-resolution differential data can also play the role of movement amount filtering by capturing rough movements of an object with less information, and conversely by not capturing small movements. In addition, the low-resolution difference data can absorb the positional deviation in a situation where small blurring occurs between frames due to vibration or the like during shooting.

これらの情報によって物体（移動体）の判定を行うニューラルネットワークは、より高度な学習及び推論を行うことができる。 A neural network that determines an object (moving object) based on this information can perform more advanced learning and inference.

ここで、図５に、時系列画像データ特徴量分配部２０の処理の詳細を示す。まず、入力データＤ（shape=（Ｔ，Ｈ，Ｗ，１））を用いて、隣接フレーム間差分（出力解像度：１／１（Ｆｕｌｌ））を計算する。その後、特徴量分配器２１ａにて上述の平均化プーリングが行われる。また、特徴量分配器２１ａで作成された１／２解像度差分データＤ２は、特徴量抽出部１０に出力される。 Here, FIG. 5 shows the details of the processing of the time-series image data feature quantity distribution unit 20. As shown in FIG. First, using input data D (shape=(T, H, W, 1)), the difference between adjacent frames (output resolution: 1/1 (Full)) is calculated. After that, the above-described averaging pooling is performed in the feature quantity distributor 21a. Also, the 1/2 resolution difference data D2 created by the feature amount distributor 21a is output to the feature amount extraction section 10. FIG.

次段の特徴量分配器２１ｂは、１／２解像度差分データＤ２に対して、平均化プーリング処理を実行することで、解像度が１／４（Ｔ，Ｈ／４，Ｗ／４，Ｃ）の１／４解像度差分データＤ３を作成する。また、特徴量分配器２１ｂで作成された１／４解像度差分データＤ３は、特徴量抽出部１０に出力される。 The next-stage feature amount distributor 21b performs an average pooling process on the 1/2 resolution difference data D2, thereby obtaining 1/4 (T, H/4, W/4, C) resolution data. 1/4 resolution difference data D3 is created. Also, the 1/4 resolution difference data D3 created by the feature amount distributor 21b is output to the feature amount extraction section 10. FIG.

最終段の特徴量分配器２１ｃは、１／４解像度差分データＤ３に対して、平均化プーリング処理を実行することで、解像度が１／８（Ｔ，Ｈ／８，Ｗ／８，Ｃ）の１／８解像度差分データＤ４を作成する。また、特徴量分配器２１ｃで作成された１／８解像度差分データＤ４は、特徴量抽出部１０に出力される。 The last-stage feature amount distributor 21c performs an average pooling process on the 1/4 resolution difference data D3 so that the resolution is 1/8 (T, H/8, W/8, C). 1/8 resolution difference data D4 is created. Also, the ⅛ resolution difference data D 4 created by the feature amount distributor 21 c is output to the feature amount extraction section 10 .

なお、図４では、特徴量分配器が３段で構成されているが、ｎ（ｎ≧４）段の構成としてもよい。この場合、ｎ段目の特徴量分配器で作成された１／ｎ解像度差分データＤｎ（出力解像度＝１／ｎ）が特徴量抽出部１０に出力される（図５参照）。 In addition, in FIG. 4, the feature amount distributor is configured with three stages, but it may be configured with n (n≧4) stages. In this case, the 1/n resolution difference data Dn (output resolution=1/n) created by the n-th feature quantity distributor is output to the feature quantity extraction unit 10 (see FIG. 5).

（特徴量分配連結部３０）
特徴量分配連結部３０は、特徴量抽出部１０から入力された画像サイズ特徴量Ｆ１～Ｆ３の各々に対して解像度別に特徴量を分配し、さらに解像度別に特徴量を連結して新たな画像サイズ特徴量を生成する。 (Feature quantity distribution connection unit 30)
The feature quantity distribution/connection unit 30 distributes the feature quantities by resolution to each of the image size feature quantities F1 to F3 input from the feature quantity extraction unit 10, and further connects the feature quantities by resolution to obtain a new image size. Generate features.

また、特徴量分配連結部３０は、画像サイズ特徴量Ｆ１～Ｆ３の各々に対して、解像度別にダウンサンプリングする畳み込み演算を行う。特徴量分配連結部３０は、解像度別に分離独立した経路において処理することで、各解像度の情報を保持した画像サイズ特徴量を重要度判断部４０に伝達することができる。 Further, the feature quantity distribution/connecting unit 30 performs a convolution operation for down-sampling for each resolution on each of the image size feature quantities F1 to F3. The feature quantity distribution/coupling unit 30 can transmit the image size feature quantity holding the information of each resolution to the importance determination unit 40 by processing in separate and independent paths for each resolution.

次に、図６及び図７を参照して、特徴量分配連結部３０の詳細について説明する。 Next, details of the feature quantity distribution connection unit 30 will be described with reference to FIGS. 6 and 7. FIG.

図６に示すように、特徴量分配連結部３０は、解像度別特徴量伝達器３（解像度別特徴量分配器３１ａ～３１ｃ）、解像度別特徴量伝達器３２（解像度別特徴量分配器３２ａ，３２ｂ）、解像度別特徴量伝達器３３（解像度別特徴量分配器３３ａ）と、特徴量連結器３５ａ，３５ｂとで構成されている。本実施形態の特徴量分配連結部３０において、解像度別特徴量分配器は、空間方向のストライドを２に設定した畳み込み演算（ストライド（Ｔ，Ｈ，Ｗ）＝（１，２，２））により特徴量を抽出しつつ、画像サイズのダウンサンプリングを実行する。 As shown in FIG. 6, the feature amount distribution connecting unit 30 includes a resolution-based feature amount transmitter 3 (resolution-based feature amount distributors 31a to 31c), a resolution-based feature amount transmitter 32 (resolution-based feature amount distributor 32a, 32b), a resolution-based feature amount transmitter 33 (resolution-based feature amount distributor 33a), and feature amount couplers 35a and 35b. In the feature quantity distribution connection unit 30 of the present embodiment, the feature quantity distributor by resolution performs a convolution operation (stride (T, H, W) = (1, 2, 2)) with the stride in the spatial direction set to 2. Downsampling of the image size is performed while extracting the feature amount.

解像度別特徴量伝達器３１に１／１解像度の情報を保持した１／２画像サイズ特徴量Ｆ１が入力されると、解像度別特徴量分配器３１ａは、画像サイズが１／４の１／４画像サイズ特徴量Ｆ１２を生成する。また、特徴量連結器３５ａは、画像サイズが１／４の１／４画像サイズ特徴量Ｇ１を生成し、特徴量抽出部１０（特徴量連結器１２ｃ）に出力する。なお、特徴量連結器３５ａは形式上存在しているものの、連結対象が１／４画像サイズ特徴量Ｆ１２のみであるため、ここでは特に処理を行わない。 When the 1/2 image size feature F1 holding the information of 1/1 resolution is input to the resolution feature amount transmitter 31, the resolution feature amount distributor 31a divides the image size into 1/4 of the image size. An image size feature quantity F12 is generated. Further, the feature quantity coupler 35a generates a 1/4 image size feature quantity G1, which is 1/4 of the image size, and outputs it to the feature quantity extractor 10 (feature quantity coupler 12c). Note that although the feature quantity coupler 35a is present in form, it does not carry out any particular processing here because the object of concatenation is only the 1/4 image size feature quantity F12.

解像度別特徴量分配器３１ｂは、畳み込み演算を実行して１／４画像サイズ特徴量Ｆ１２から画像サイズが１／８の１／８画像サイズ特徴量Ｆ１３を生成する。 The resolution-by-resolution feature amount distributor 31b performs a convolution operation to generate a 1/8 image size feature amount F13 having an image size of 1/8 from the 1/4 image size feature amount F12.

また、解像度別特徴量伝達器３２に１／２解像度の情報を保持した１／４画像サイズ特徴量Ｆ２が入力されると、解像度別特徴量分配器３２ａは、畳み込み演算を実行して１／４画像サイズ特徴量Ｆ２から画像サイズが１／８の１／８画像サイズ特徴量Ｆ２３を生成する。 Further, when the 1/4 image size feature quantity F2 holding the information of 1/2 resolution is input to the resolution-specific feature quantity transmitter 32, the resolution-specific feature quantity distributor 32a executes a convolution operation to perform a 1/4 image size feature quantity distribution. A 1/8 image size feature amount F23 having an image size of 1/8 is generated from the 4 image size feature amount F2.

そして、特徴量連結器３５ｂは、画像サイズが同じ１／８である１／８画像サイズ特徴量Ｆ１３と、１／８画像サイズ特徴量Ｆ２３とを連結して１／８画像サイズ連結特徴量Ｇ２を生成し、特徴量抽出部１０（特徴量連結器１２ｄ）に出力する。 Then, the feature quantity connector 35b connects the 1/8 image size feature quantity F13 and the 1/8 image size feature quantity F23, which have the same image size of 1/8, to obtain a 1/8 image size concatenated feature quantity G2. is generated and output to the feature extraction unit 10 (feature coupler 12d).

解像度別特徴量分配器３１ｃは、畳み込み演算を実行して画像サイズ特徴量Ｆ１３から画像サイズが１／１６である１／１６画像サイズ特徴量Ｆ１４を生成する。１／１６画像サイズ特徴量Ｆ１４は、１／１解像度の情報を保持している。 The resolution-by-resolution feature amount distributor 31c performs a convolution operation to generate a 1/16 image size feature amount F14 whose image size is 1/16 from the image size feature amount F13. The 1/16 image size feature amount F14 holds 1/1 resolution information.

また、解像度別特徴量分配器３２ｂは、畳み込み演算を実行して１／８画像サイズ特徴量Ｆ２３から画像サイズが１／１６である１／１６画像サイズ特徴量Ｆ２４を生成する。１／１６画像サイズ特徴量Ｆ１４及び１／１６画像サイズ特徴量Ｆ２４は、重要度判断部４０に出力される。１／１６画像サイズ特徴量Ｆ２４は、１／２解像度の情報を保持している。 Further, the resolution-by-resolution feature amount distributor 32b performs a convolution operation to generate a 1/16 image size feature amount F24 having an image size of 1/16 from the 1/8 image size feature amount F23. The 1/16 image size feature amount F14 and the 1/16 image size feature amount F24 are output to the importance determination section 40. FIG. The 1/16 image size feature amount F24 holds 1/2 resolution information.

また、解像度別特徴量伝達器３３に１／４解像度の情報を保持した１／８画像サイズ特徴量Ｆ３が入力されると、解像度別特徴量分配器３３ａは、畳み込み演算を実行して１／８画像サイズ特徴量Ｆ３から画像サイズが１／１６である１／１６画像サイズ特徴量Ｆ３４を生成する。１／１６画像サイズ特徴量Ｆ３４は、１／４解像度の情報を保持している。 Further, when the 1/8 image size feature quantity F3 holding information of 1/4 resolution is input to the resolution-specific feature quantity transmitter 33, the resolution-specific feature quantity distributor 33a executes a convolution operation to perform a 1/8 image size feature quantity distribution. A 1/16 image size feature amount F34 whose image size is 1/16 is generated from the 8 image size feature amount F3. The 1/16 image size feature amount F34 holds 1/4 resolution information.

ここで、図７に、特徴量分配連結部３０の前後で行われる処理を説明する。解像度別特徴量分配器３１ａ～３１ｃ，３２ａ，３２ｂ，３３ａは、それぞれ空間ｄｓ（ダウンサンプリング）３次元畳み込み層（フィルタサイズ（３×３×３）、ストライド数（１，２，２）、入力チャネル数＝ｋ、出力チャネル数＝ｋ、ＲｅＬＵ）である。ここでも、必要に応じてバッチ正規化層（Batch Normalization）を追加してもよい。 Here, processing performed before and after the feature quantity distribution connection unit 30 will be described with reference to FIG. 7 . The resolution-based feature amount distributors 31a to 31c, 32a, 32b, and 33a are respectively spatial ds (downsampling) three-dimensional convolution layers (filter size (3×3×3), stride number (1, 2, 2), input number of channels=k, number of output channels=k, ReLU). Again, a batch normalization layer may be added if desired.

解像度別特徴量分配器３１ａ～３１ｃを含む解像度別特徴量伝達器３１は、より高解像度の情報（１／１解像度の情報）を保持し、重要度判断部４０に伝達するため、撮像領域内の移動体の小さな動きを検出することができる。また、解像度別特徴量分配器３２ａ，３２ｂを含む解像度別特徴量伝達器３２は、中解像度の情報（１／２解像度の情報）を保持し、重要度判断部４０に伝達する。さらに、解像度別特徴量分配器３３ａを含む解像度別特徴量伝達器３３は、より低解像度の情報（１／４解像度の情報）を保持し、重要度判断部４０に伝達するため、撮像領域内の移動体の大きな動きを検出することができる。もちろん、ネットワークの長さによっては、解像度別特徴量伝達器がさらに必要となる。 The resolution-by-resolution feature quantity transmitter 31 including the resolution-by-resolution feature quantity distributors 31a to 31c holds higher resolution information (1/1 resolution information) and transmits it to the importance determination unit 40. can detect small movements of moving objects. Further, the resolution-based feature amount transmitter 32 including the resolution-based feature amount distributors 32 a and 32 b holds medium resolution information (1/2 resolution information) and transmits it to the importance determination section 40 . Further, the resolution-by-resolution feature quantity transmitter 33 including the resolution-by-resolution feature quantity distributor 33a holds lower resolution information (1/4 resolution information) and transmits it to the importance determination unit 40. can detect large movements of moving objects. Of course, depending on the length of the network, a resolution-specific feature value transmitter may be additionally required.

ネットワーク全体で解像度が高い方から低い方へのフィードフォワードが保たれている（解像度が低い方から高い方への接続なし）ため、解像度別特徴量伝達器３１～３３内の各パスでは、特徴量抽出部１０で与えられた解像度の移動体情報が保持される。 Since feedforward from the higher resolution side to the lower resolution side is maintained throughout the network (there is no connection from the lower resolution side to the higher resolution side), in each path in the resolution-specific feature transmitters 31 to 33, the feature The moving object information with the resolution given by the amount extraction unit 10 is held.

（重要度判断部４０）
重要度判断部４０は、特徴量分配連結部３０から出力された１／１６画像サイズ特徴量Ｆ１４，Ｆ２４，Ｆ３４（図６参照）等に対して、学習済みパラメータにより決定される重要度に基づいて重み付けを行う。なお、学習済みパラメータは、解像度別に重要度を算出できるように学習されたパラメータである。 (Importance determination unit 40)
The importance determination unit 40 determines the importance of the 1/16 image size feature values F14, F24, F34 (see FIG. 6) output from the feature distribution connection unit 30, etc. based on the learned parameters. are weighted. Note that the learned parameters are parameters that have been learned so that the degree of importance can be calculated for each resolution.

ここで、図８～図１１を参照して、重要度判断部４０の詳細について説明する。 Here, the details of the importance determination unit 40 will be described with reference to FIGS. 8 to 11. FIG.

図８に示すように、重要度判断部４０は、特徴量連結器４１と、特徴量集約器４２と、解像度別重要度生成器４３と、スケール器４４とで構成されている。重要度判断部４０には、特徴量分配連結部３０から１／１６画像サイズ特徴量Ｆ１４，Ｆ２４，Ｆ３４が入力され（図６参照）、特徴量抽出部１０から１／１６画像サイズ特徴量Ｆ４が入力される（図２参照）。これらは、全て１／１６の画像サイズ特徴量であり、重要度判断部４０は、１／１６画像サイズ特徴量Ｆ４，Ｆ１４，Ｆ２４，Ｆ３４の各々に重要度に基づいて重み付けを行い、最終的な特徴量データとする。 As shown in FIG. 8 , the importance determining unit 40 includes a feature coupler 41 , a feature aggregator 42 , a resolution-by-resolution importance generator 43 , and a scaler 44 . The 1/16 image size feature amounts F14, F24, and F34 are input from the feature amount distribution/connection section 30 to the importance determination section 40 (see FIG. 6), and the 1/16 image size feature amount F4 is input from the feature amount extraction section 10. is input (see FIG. 2). These are all 1/16 image size feature amounts, and the importance determination unit 40 weights each of the 1/16 image size feature amounts F4, F14, F24, and F34 based on their importance. feature data.

特徴量連結器４１は、画像サイズ特徴量Ｆ４，Ｆ１４，Ｆ２４，Ｆ３４を連結し、生成した連結特徴量を特徴量集約器４２に出力する。ここで出力された連結特徴量は、特徴量集約器４２の重要度算出に用いられる。また、特徴量連結器４１は、連結特徴量をスケール器４４にも出力する。この連結特徴量は、スケール器４４において重み付けをされる対象となる。 The feature quantity concatenator 41 concatenates the image size feature quantities F4, F14, F24 and F34, and outputs the generated concatenated feature quantity to the feature quantity aggregator . The connected feature quantity output here is used for importance calculation of the feature quantity aggregator 42 . The feature quantity coupler 41 also outputs the coupled feature quantity to the scaler 44 . This connected feature quantity is weighted by the scaler 44 .

特徴量集約器４２は、特徴量連結器４１からの連結特徴量をＲ・ｋ長ベクトルに変換する。特徴量集約器４２が処理を行う際、深さ方向３次元畳み込み（入力チャネル同士の隔離が保たれる手法）を実行する。これは、異なる解像度の情報を持つ各チャネルが、特徴量集約器４２の処理によって混合又は結合されないようにするためである。 The feature quantity aggregator 42 converts the connected feature quantity from the feature quantity connector 41 into an R·k length vector. When the feature aggregator 42 performs processing, depth-direction three-dimensional convolution (a technique for maintaining isolation between input channels) is performed. This is to prevent channels having different resolution information from being mixed or combined by the processing of the feature aggregator 42 .

図９は、特徴量集約器４２の詳細を示している。特徴量集約器４２は、深さ方向３次元畳み込み層と、グローバル平均化プーリング層とで構成されている。深さ方向３次元畳み込み層（depthwise 3D convolution layer）は、チャネル毎に畳み込み処理を行うため、フィルタ演算の結果がチャネル毎に独立し、交差しないという特性を有する。今回は、各々のチャネルが異なる解像度の情報を有しているため、解像度別に分離して処理する。なお、グローバル平均化プーリング層は、チャネル毎にチャネル内の数値の平均値をとり、チャネル順に並べることでチャネル数次元のベクトルを出力する層と定義することができる。 FIG. 9 shows the details of the feature aggregator 42. As shown in FIG. The feature quantity aggregator 42 is composed of a depth direction three-dimensional convolution layer and a global averaging pooling layer. Since the depthwise 3D convolution layer performs convolution processing for each channel, it has the characteristic that the results of filter operations are independent for each channel and do not intersect. Since each channel has different resolution information this time, it is processed separately for each resolution. The global averaging pooling layer can be defined as a layer that outputs a channel-number-dimensional vector by averaging numerical values in each channel and arranging them in the order of the channels.

また、特徴量集約器４２において、深さ方向３次元畳み込み層のフィルタサイズを(Ｔ×１×１)に設定することで、Ｔ方向の次元を「１」、すなわち奥行なしのデータに圧縮する（pointwise convolution）。具体的には、入力した連結特徴量へのゼロパディング（padding=None）を行わずに、(Ｔ×１×１)のフィルタで深さ方向３次元畳み込み演算を行う。これにより、続くグローバル平均化プーリング層での単純な平均化処理による情報欠落を抑えることができる。 Also, in the feature amount aggregator 42, by setting the filter size of the three-dimensional convolution layer in the depth direction to (T×1×1), the dimension in the T direction is compressed to “1”, that is, data without depth. (pointwise convolution). Specifically, without performing zero padding (padding=None) on the input connected feature amount, depth direction three-dimensional convolution operation is performed with a filter of (T×1×1). This makes it possible to suppress the loss of information due to simple averaging processing in the subsequent global averaging pooling layer.

図１０は、解像度別重要度生成器４３の詳細を示している。解像度別重要度生成器４３は、特徴量集約器４２から入力したＲ・ｋ長ベクトルに対し、全結合層（ニューラルネットワークを構成する層の１つ）での処理により、Ｒ長ベクトルを生成する（図８参照）。解像度別重要度生成器４３は、最終的にシグモイド関数により０～１の数値に変換する処理を行うが、構成要素の数値は対応する解像度の重要度を表している。 FIG. 10 shows details of the resolution-by-resolution importance generator 43 . The resolution-specific importance generator 43 generates an R-length vector by processing the R·k-length vector input from the feature aggregator 42 in a fully connected layer (one of the layers constituting the neural network). (See Figure 8). The resolution-by-resolution importance generator 43 finally converts the value into a numerical value of 0 to 1 using a sigmoid function, and the numerical value of the component represents the importance of the corresponding resolution.

図１１は、スケール器４４の詳細を示している。スケール器４４は、特徴量拡張器４４ａと乗算器４４ｂとで構成されている。 FIG. 11 shows details of scaler 44 . The scaler 44 is composed of a feature extender 44a and a multiplier 44b.

まず、特徴量拡張器４４ａは、Ｒ長ベクトルを特徴量連結器４１からの入力サイズへ一致させると同時に、Ｒ長ベクトルの各要素を対応する解像度の位置へ一致させる。そのために、例えば、Ｒ長ベクトルを１×１×１×Ｒとして各要素をチャネル方向へｋ個に拡張(１×１×１×Ｒ・ｋ)した後、拡張後の各要素を更に(Ｔ,Ｈ,Ｗ)の形状へサイズ拡張(Ｔ×Ｈ×Ｗ×Ｒ・ｋ)するといった手法を取ることができる。 First, the feature quantity extender 44a matches the R length vector to the input size from the feature quantity concatenator 41, and simultaneously matches each element of the R length vector to the position of the corresponding resolution. For this purpose, for example, after the R-length vector is 1×1×1×R and each element is expanded in the channel direction to k elements (1×1×1×R·k), each element after expansion is further expanded (T , H, W) can be used.

乗算器４４ｂは、Ｒ長ベクトルを特徴量連結器４１から出力された特徴量と同じ形状へ拡張変換後に掛け合わせることで重要度の重み付けを行い、重要度判断部４０の出力とする。以上の各処理により、各解像度における情報を保持した最終特徴量データを作成することができる。最終特徴量データを用いれば、入力データ（数フレームの動画データ）から所定の判定を行う際、今回の判定に必要な解像度を重み付けにより選択することができ、効率的な機械学習を行うことができる。 The multiplier 44b weights the degree of importance by multiplying the R-length vector by the same shape as the feature quantity output from the feature quantity coupler 41 after the extension conversion, and outputs the weighted degree of importance. Through the above processes, final feature amount data that holds information at each resolution can be created. By using the final feature value data, when making a predetermined judgment from the input data (several frames of video data), the resolution necessary for this judgment can be selected by weighting, and efficient machine learning can be performed. can.

本発明は上記実施形態及び変更形態に限られるものではなく、その要旨を逸脱しない範囲において種々の態様で実施することが可能である。 The present invention is not limited to the above embodiments and modifications, and can be implemented in various forms without departing from the scope of the invention.

１０…特徴量抽出部、１１ａ～１１ｄ…特徴量連結生成器、１２ａ～１２ｄ…特徴量連結器、１３ａ～１３ｄ…特徴量抽出器、２０…時系列画像データ特徴量分配部、２１ａ～２１ｃ…特徴量分配器、３０…特徴量分配連結部、３１～３３…解像度別特徴量伝達器、３１ａ～３１ｃ，３２ａ，３２ｂ，３３ａ…解像度別特徴量分配器、３５ａ，３５ｂ…特徴量連結器、４０…重要度判断部、４１…特徴量連結器、４２…特徴量集約器、４３…解像度別重要度生成器、４４…スケール器、１００…特徴量抽出装置。 10 Feature extractor 11a to 11d Feature link generator 12a to 12d Feature coupler 13a to 13d Feature extractor 20 Time-series image data feature distributor 21a to 21c Feature quantity distributor 30 Feature quantity distribution connector 31 to 33 Feature quantity transmitter by resolution 31a to 31c, 32a, 32b, 33a Feature quantity distributor by resolution 35a, 35b Feature quantity coupler, 40... Importance determining unit, 41... Feature quantity coupler, 42... Feature quantity aggregator, 43... Resolution-based importance generator, 44... Scaler, 100... Feature quantity extraction device.

Claims

a time-series image data feature amount distribution unit that calculates differences between frames of time-series image data, distributes the time-series image data by resolution, and creates difference data by resolution;
a feature quantity extraction unit that extracts an image size feature quantity by executing a three-dimensional convolution operation on the time-series image data and/or the resolution-based difference data;
a feature amount distribution connection unit that distributes a feature amount to each of the plurality of image size feature amounts input from the feature amount extraction unit and generates a connected image size connection feature amount;
an importance determination unit that weights each of the image size feature amounts according to a numerical value determined by a parameter obtained by machine learning,
The feature quantity extraction unit has a plurality of feature quantity concatenated generators that concatenate the image size concatenated feature quantity and the resolution-based difference data to generate a new image size feature quantity. A feature amount extracting device that is connected in plurality to generate each of the image size feature amounts.

2. The feature of claim 1, wherein the feature quantity concatenated generator further concatenates the image size concatenated features generated by the feature distribution concatenator when generating the image size feature quantity. Quantity extraction device.

The time-series image data feature amount distribution unit has a plurality of feature amount distributors that create the difference data by resolution by an average pooling process using a filter of a specific size,
3. The feature amount extracting apparatus according to claim 1, wherein a plurality of the feature amount distributors are connected to create the difference data for each resolution.

The feature amount distribution connection unit performs a convolution operation for down-sampling the image size feature amount input from the feature amount extraction unit by resolution,
4. The feature amount extracting apparatus according to claim 1, wherein the generated image size feature amount is transmitted to the importance level determination unit.

5. The feature according to claim 4, wherein the feature quantity distribution and concatenation unit concatenates the image size feature quantities of the same image size generated by the convolution operation to generate the image size concatenated feature quantity. Quantity extraction device.

The importance determination unit
a feature concatenator that concatenates the input image size features;
a feature aggregator that converts the output data of the feature concatenator and outputs an R·k length vector that is k times the number of resolution types R (where k is the number of channels);
a resolution-by-resolution importance generator for generating an R-length vector whose components represent the importance of each resolution by processing in a fully connected layer for the R·k length vector output from the feature aggregator; ,
a scaler that multiplies the numerical values of the components of the R-length vector generated by the resolution importance generator by the values output from the feature quantity concatenator;
6. The feature according to any one of claims 1 to 5, wherein for each of the image size feature amounts, the importance is calculated and weighted with k channels indicating each resolution as one unit. Quantity extraction device.