JP6744838B2

JP6744838B2 - Encoder-decoder convolutional program for improving resolution in neural networks

Info

Publication number: JP6744838B2
Application number: JP2017082412A
Authority: JP
Inventors: 仁武高
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2017-04-18
Filing date: 2017-04-18
Publication date: 2020-08-19
Anticipated expiration: 2037-04-18
Also published as: JP2018181124A

Description

本発明は、画像に映る物体を認識する技術に関する。 The present invention relates to a technique of recognizing an object appearing in an image.

画像に映る物体を認識するために、画像の領域セグメンテーションの処理が必要となる。この処理は、画像の各ピクセルが、何の物体に属するのかを検出するものであり、従来、ランダムフォレスト(random forest)や、サポートベクターマシン(support vector machine, SVM)、エイダブースト(adaboost)などが用いられてきた。 In order to recognize an object appearing in an image, it is necessary to perform a region segmentation process on the image. This process is to detect what object each pixel of the image belongs to, and conventionally, random forest (random forest), support vector machine (support vector machine, SVM), Ada boost (adaboost), etc. Has been used.

近年、画像の領域セグメンテーションに、深層学習(deep learning)における畳み込みニューラルネットワーク(convolutional neural network)が適用されてきている。畳み込みニューラルネットワークに画像を入力することによって、特徴を抽出し、その特徴が現れた位置を検出することができる。 In recent years, convolutional neural networks in deep learning have been applied to region segmentation of images. By inputting an image to the convolutional neural network, it is possible to extract the feature and detect the position where the feature appears.

「ニューラルネットワーク」とは、生体の脳における特性を計算機上のシミュレーションによって表現することを目指した数学モデルをいう。シナプスの結合によってネットワークを形成した人工ニューロン（ユニット）が、学習によってシナプスの結合強度を変化させ、問題解決能力を持つようなモデル全般をいう。
また、「畳み込みニューラルネットワーク」とは、複数のユニットを持つ層が入力段から出力段へ向けて一方向に連結されており、出力層側のユニットが、隣接する入力層側の特定のユニットに結合された畳み込み層を有する順伝播型ネットワークをいう。 "Neural network" refers to a mathematical model aiming at expressing the characteristics of the living body's brain by computer simulation. An artificial neuron (unit) that forms a network by connecting synapses changes the synaptic connection strength by learning, and is a general model that has problem solving ability.
A "convolutional neural network" is a layer having multiple units connected in one direction from the input stage to the output stage, and the unit on the output layer side is connected to a specific unit on the adjacent input layer side. A forward-propagation network with convolutional layers connected.

前方層のユニットから後方層のユニットへつなぐ関数のパラメータを、「重み(weight)」と称す。学習とは、この関数のパラメータとして、適切な「重み」を算出することにある。教師データの入力データに対する出力層からの出力データと、教師データの正解ラベルとの誤差を用いて、各層の重みを最適に更新される。誤差は、「誤差逆伝播法」によって、出力層側から入力層側へ向けて次々に伝播し、各層の重みを少しずつ更新していく。最終的に、誤差が小さくなるように、各層の重みを適切な値に調整する収束計算を実行する。 The parameter of the function connecting the unit in the front layer to the unit in the rear layer is called "weight". Learning is to calculate an appropriate "weight" as a parameter of this function. The weight of each layer is optimally updated using the error between the output data from the output layer for the input data of the teacher data and the correct label of the teacher data. The error propagates one after another from the output layer side to the input layer side by the “error back propagation method”, and the weight of each layer is updated little by little. Finally, a convergence calculation is performed to adjust the weight of each layer to an appropriate value so that the error becomes small.

従来、画像の物体認識用の畳み込みニューラルネットワークとして、エンコーダデコーダ構造を用いた技術がある（例えば非特許文献１参照）。この技術によれば、エンコーダは、画像から物体の特徴を抽出し、デコーダは、その特徴を物体の位置にマッピングする。
また、完全畳み込み構造を用いた技術もある（例えば非特許文献２参照）。この技術によれば、画像をエンコードし、スキップ構造によってある層を統合して位置を推測する。このとき、スキップ構造の後に合併させる技術もある（例えば非特許文献３参照）。 Conventionally, as a convolutional neural network for object recognition of an image, there is a technique using an encoder/decoder structure (see Non-Patent Document 1, for example). According to this technique, an encoder extracts a feature of an object from an image and a decoder maps the feature to the position of the object.
There is also a technique using a complete convolution structure (see Non-Patent Document 2, for example). According to this technique, an image is encoded and a layer is integrated by a skip structure to estimate a position. At this time, there is also a technique of merging after the skip structure (for example, see Non-Patent Document 3).

図１は、エンコーダデコーダ畳み込みニューラルネットワークの概説図である。 FIG. 1 is a schematic diagram of an encoder-decoder convolutional neural network.

図１によれば、エンコーダデコーダ畳み込みニューラルネットワーク(encoder-decoder convolutional neural network)は、入力された画像から物体が検出されると共に、その画像の各ピクセルが何の物体に属するか、を認識する。エンコーダデコーダ畳み込みニューラルネットワークは、エンコーダ及びデコーダの２つのステップを実行する。
エンコーダ：物体検出における特徴抽出処理
デコーダ：物体検出における位置検出処理 According to FIG. 1, an encoder-decoder convolutional neural network detects an object from an input image and recognizes to which object each pixel of the image belongs. Encoder Decoder Convolutional Neural Networks perform two steps: encoder and decoder.
Encoder: Feature extraction processing in object detection Decoder: Position detection processing in object detection

エンコーダデコーダ畳み込みニューラルネットワークは、入力された画像と同じサイズの物体認識画像を出力する。
図１によれば、複数の人が映る写真画像が入力されている（非特許文献４から引用）。尚、入力画像は、スマートフォンやカメラなどで撮影した自然画像に限られず、ＣＧ(computer Graphics)画像であってもよい。
出力された物体認識画像からは、人や、テーブル、椅子のような物体が検出されると共に、その物体の位置が特定されている。 The encoder/decoder convolutional neural network outputs an object recognition image having the same size as the input image.
According to FIG. 1, a photographic image showing a plurality of people is input (cited from Non-Patent Document 4). The input image is not limited to a natural image captured by a smartphone or a camera, but may be a CG (computer graphics) image.
From the output object recognition image, an object such as a person, a table, or a chair is detected, and the position of the object is specified.

図２は、従来技術のエンコーダデコーダ畳み込みネットワークにおける層の構造図である。 FIG. 2 is a structural diagram of layers in a prior art encoder-decoder convolutional network.

エンコーダデコーダ畳み込みネットワークは、Ｕ字型のショートカット構造（入れ子状構造）を有する。Ｕ字型ネットワークによれば、エンコーダは、畳み込み層及びプーリング層(pooling layer)によって要素数（画素数）を減少させながら特徴マップ(feature map)を作成していく。一方で、デコーダは、畳み込み層及びアップサンプリング層(upsampling layer)によって要素数を増加させながら特徴マップを作成していく。
尚、Ｕ字型ネットワークの段層を深くすることによって、演算量は増加するが、表現力の高い特徴に対する位置を検出することができる。 The encoder-decoder convolutional network has a U-shaped shortcut structure (nested structure). According to the U-shaped network, the encoder creates a feature map while reducing the number of elements (the number of pixels) by using a convolutional layer and a pooling layer. Meanwhile, the decoder creates a feature map while increasing the number of elements by using a convolutional layer and an upsampling layer.
It should be noted that by deepening the layers of the U-shaped network, the amount of calculation increases, but the position for a feature having high expressiveness can be detected.

畳み込み層は、入力データに重みフィルタを充てて、その各要素の積の和を、特徴マップの１個の要素の値とする。そして、入力データに対して重みフィルタをスライディングさせながら、局所特徴を増強した特徴マップを生成する。畳み込み層から出力される特徴マップについて、サイズはS×Sとなり、その枚数はNとなる。特徴マップの枚数Nは、重みフィルタの個数Nと一致する。
そして、同じ重みフィルタを、入力データに対して移動させて、１枚の特徴マップを生成する。ここで、移動させる要素の数（移動量）を「ストライド(stride)」と称す。
プーリング層は、入力データから重要な特徴要素のみに縮小した特徴マップを生成する。
アップサンプリング層は、入力された特徴マップの要素（画素）を、例えば縦2倍・横2倍の４個の要素に同じ値で埋めて、拡大した特徴マップを生成する。 The convolutional layer applies a weighting filter to the input data and sets the sum of products of the respective elements as a value of one element of the feature map. Then, while sliding the weight filter on the input data, a feature map in which the local features are enhanced is generated. The size of the feature map output from the convolutional layer is S×S, and the number is N. The number N of feature maps matches the number N of weight filters.
Then, the same weight filter is moved with respect to the input data to generate one feature map. Here, the number of elements to be moved (movement amount) is referred to as "stride".
The pooling layer generates a feature map in which only important feature elements are reduced from the input data.
The upsampling layer fills the elements (pixels) of the input feature map with four elements, for example, twice the length and twice the width, with the same value to generate an enlarged feature map.

＜エンコーダ＞
図２によれば、画像は、入力層(input layer)に入力され、入力層の出力データは、エンコード側の第１段の畳み込み層へ入力される。第１段の畳み込み層から出力された特徴マップは、第２段のプーリング層に入力されると共に、第１段のデコーダ側の畳み込み層へも入力される。
エンコード側の第２段のプーリング層によって要素数が縮小された特徴マップは、第２段の畳み込み層に入力される。第２段の畳み込み層から出力された特徴マップは、第３段のプーリング層に入力されると共に、第２段のデコーダ側の畳み込み層へも入力される。
エンコード側の第３段のプーリング層によって要素数が縮小された特徴マップは、第３段の畳み込み層に入力される。 <Encoder>
According to FIG. 2, the image is input to the input layer, and the output data of the input layer is input to the first convolutional layer on the encoding side. The feature map output from the convolutional layer of the first stage is input to the pooling layer of the second stage and also to the convolutional layer of the decoder side of the first stage.
The feature map whose number of elements has been reduced by the second pooling layer on the encoding side is input to the second convolutional layer. The feature map output from the convolutional layer of the second stage is input to the pooling layer of the third stage and also to the convolutional layer of the decoder of the second stage.
The feature map whose number of elements has been reduced by the third-stage pooling layer on the encoding side is input to the third-stage convolutional layer.

＜デコーダ＞
図２によれば、第３段の畳み込み層から出力された特徴マップは、第３段のアップサンプリング層に入力される。
第３段のアップサンプリング層によって要素数が拡大された特徴マップは、デコーダ側の第２段の畳み込み層に入力される。
ここで、エンコード側の第２段の畳み込み層から出力された特徴マップと、第３段のアップサンプリング層から出力された特徴マップとをマージした特徴マップが、デコーダ側の第２段の畳み込み層へ入力される。そして、第２段の畳み込み層から出力された特徴マップは、第２段のアップサンプリング層に入力される。
デコーダ側の第２段のアップサンプリング層によって要素数が拡大された特徴マップは、デコーダ側の第１段の畳み込み層に入力される。
ここで、エンコード側の第１段の畳み込み層から出力された特徴マップと、第２段のアップサンプリング層から出力された特徴マップとをマージした特徴マップが、デコーダ側の第１段の畳み込み層へ入力される。そして、第１段の畳み込み層から出力された特徴マップは、活性化層へ入力される。
活性化層は、例えばＲｅＬＵ(Rectified Linear Unit)の場合、信号の強いニューロンを増強し、弱いニューロンを抑圧することができる。活性化層から出力されたデータは、各ピクセルに物体がマッピングされた画像データとなる（例えば図１の参照）。 <Decoder>
According to FIG. 2, the feature map output from the third convolutional layer is input to the third upsampling layer.
The feature map whose number of elements is expanded by the third stage upsampling layer is input to the second convolution layer on the decoder side.
Here, the feature map obtained by merging the feature map output from the encoding-side second-stage convolution layer and the feature map output from the third-stage upsampling layer is the decoder-side second-stage convolution layer. Is input to. Then, the feature map output from the second convolutional layer is input to the second upsampling layer.
The feature map whose number of elements is expanded by the second-stage upsampling layer on the decoder side is input to the first-stage convolutional layer on the decoder side.
Here, the feature map obtained by merging the feature map output from the first convolution layer on the encoding side with the feature map output from the second upsampling layer is the first convolution layer on the decoder side. Is input to. Then, the feature map output from the first convolutional layer is input to the activation layer.
In the case of, for example, a ReLU (Rectified Linear Unit), the activation layer can enhance neurons with strong signals and suppress neurons with weak signals. The data output from the activation layer becomes image data in which an object is mapped to each pixel (see, for example, FIG. 1).

図３は、従来技術のアップサンプリング層及びマージ機能における特徴マップの処理を表す説明図である。 FIG. 3 is an explanatory diagram showing the processing of the feature map in the upsampling layer and the merge function of the conventional technique.

デコーダ側の第ｎ段のアップサンプリング層は、S/2×S/2×Nのサイズの特徴マップを入力した場合、例えば縦2倍・横2倍にしたS×S×Nのサイズの特徴マップを出力する。
そして、デコーダ側の第ｎ段のアップサンプリング層から出力されたS×S×Nのサイズの特徴マップと、エンコード側の第ｎ−１段の畳み込み層から出力されたS×S×Nのサイズの特徴マップとは、同じサイズとなって、マージされる。
ここでのマージとは、２つの特徴マップを単に連結して（線形に合併させて）、2Nとしたものである。S×S×2Nのサイズの特徴マップが、デコーダ側の第ｎ−１段の畳み込み層へ入力される。 When an S/2×S/2×N size feature map is input, the n-th up-sampling layer on the decoder side is, for example, a S×S×N size feature that is doubled vertically and horizontally. Output the map.
Then, the S×S×N size feature map output from the nth upsampling layer on the decoder side and the S×S×N size output from the n−1th convolution layer on the encode side. The same size as that of the feature map is merged.
The merging here is to simply connect (linearly merge) two feature maps into 2N. A feature map of size S×S×2N is input to the (n−1)th convolution layer on the decoder side.

前述した入れ子型のニューラルネットワークとして、例えばResNet(residual network)やU-Netがある。これらは、デコーダ側で特徴マップを連結させることによって、第ｎ−１段と第ｎ段との層間の差を混ぜて、ネットワークのオーバーフィット(overfitting, 過剰適合)を防ごうとするものである。 Examples of the nested neural network described above include ResNet (residual network) and U-Net. These are intended to prevent overfitting of the network by connecting the feature maps on the decoder side, thereby mixing the differences between the layers of the (n-1)th stage and the nth stage. ..

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation、[online]、［平成２９年４月１７日検索］、インターネット＜URL: https://arxiv.org/pdf/1511.00561v3.pdf＞SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation, [online], [April 17, 2017 search], Internet <URL: https://arxiv.org/pdf/1511.00561v3.pdf> Fully Convolutional Networks for Semantic Segmentation、[online]、［平成２９年４月１７日検索］、インターネット＜URL:https://arxiv.org/pdf/1605.06211.pdf＞Fully Convolutional Networks for Semantic Segmentation, [online], [April 17, 2017 search], Internet <URL:https://arxiv.org/pdf/1605.06211.pdf> Deep Residual Learning for Compressed Sensing CT Reconstruction via Persistent Homology Analysis、[online]、［平成２９年４月１７日検索］、インターネット＜URL:https://arxiv.org/pdf/1611.06391.pdf＞Deep Residual Learning for Compressed Sensing CT Reconstruction via Persistent Homology Analysis, [online], [April 17, 2017 search], Internet <URL:https://arxiv.org/pdf/1611.06391.pdf> Pascal VOC、[online]、［平成２９年４月１７日検索］、インターネット＜URL:http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html＞Pascal VOC, [online], [Search April 17, 2017], Internet <URL:http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html>

前述したエンコーダデコーダ畳み込みニューラルネットワークによれば、アップサンプリング層における特徴マップのサイズ拡大（ブロック化）によって、画像の領域セグメンテーションの解像感が低下するという副作用がある。 According to the encoder/decoder convolutional neural network described above, there is a side effect that the resolution of the image area segmentation is reduced due to the size expansion (blocking) of the feature map in the upsampling layer.

そこで、本発明は、エンコーダデコーダ畳み込みニューラルネットワークにおける解像感を改善するプログラムを提供することを目的とする。 Therefore, an object of the present invention is to provide a program for improving the resolution in an encoder/decoder convolutional neural network.

本発明によれば、エンコーダデコーダ畳み込みネットワークについて、デコーダ側の入れ子状の第ｎ段のアップサンプリング層から出力された特徴マップと、エンコーダ側の第ｎ−１段の畳み込み層から出力された特徴マップとを連結して、デコーダ側の第ｎ−１段の畳み込み層へ入力するマージ機能を有するようにコンピュータを機能させるプログラムにおいて、
デコーダ側の第ｎ段のアップサンプリング層から出力された特徴マップを入力する第ｎ段の補間用畳み込み層を更に有し、
前記マージ機能は、第ｎ段の補間用畳み込み層から出力された特徴マップと、エンコーダ側の第ｎ−１段の畳み込み層から出力された特徴マップとを、要素毎に加算した特徴マップを、デコーダ側の第ｎ−１段の畳み込み層へ入力する
ようにコンピュータを機能させることを特徴とする。 According to the present invention, for an encoder-decoder convolutional network, a feature map output from a nested n-th stage upsampling layer on the decoder side and a feature map output from an encoder-side n-1 th convolution layer. In a program that causes a computer to function so as to have a merging function of connecting and by inputting to the (n−1)th convolutional layer on the decoder side,
The decoder further includes an nth-stage interpolation convolutional layer for inputting the feature map output from the nth-stage upsampling layer,
The merge function is a feature map obtained by adding the feature map output from the n-th stage convolutional layer for interpolation and the feature map output from the encoder-side n−1-th stage convolutional layer to each element, It is characterized in that the computer is made to function as an input to the (n-1)th convolutional layer on the decoder side.

本発明のプログラムにおける他の実施形態によれば、
第ｎ段の補間用畳み込み層は、前記アップサンプリング層に基づく要素サイズの拡大による解像感低下の副作用を軽減させるために、第ｎ−１段の畳み込み層から誤差逆伝播によって重みを更新する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
The n-th interpolating convolutional layer updates the weights from the (n-1)-th convolutional layer by error backpropagation in order to reduce the side effect of resolution reduction due to the element size expansion based on the upsampling layer. It is also preferable to make the computer function as described above.

本発明のプログラムにおける他の実施形態によれば、
エンコーダ側の第ｎ−１段の畳み込み層から出力された特徴マップのサイズ及び枚数と、
デコーダ側の第ｎ段のアップサンプリング層及び補間用畳み込み層から出力された特徴マップのサイズ及び枚数と、
前記マージ機能から出力される特徴マップのサイズ及び枚数と
は、全て同一となる
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
The size and number of feature maps output from the (n-1)th convolutional layer on the encoder side,
The size and number of feature maps output from the n-th upsampling layer and interpolation convolutional layer on the decoder side,
It is also preferable to make the computer function so that the size and the number of feature maps output from the merge function are all the same.

本発明のプログラムにおける他の実施形態によれば、
前記エンコーダデコーダ畳み込みネットワークは、Ｕ字型のショートカット構造を有する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
It is also preferable that the encoder/decoder convolutional network causes the computer to have a U-shaped shortcut structure.

本発明のプログラムにおける他の実施形態によれば、
前記エンコーダデコーダ畳み込みネットワークは、入力画像における物体検出に適用されており、
前記エンコーダは、物体検出における特徴抽出処理であり、
前記デコーダは、物体検出における位置検出処理である
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
The encoder-decoder convolutional network has been applied to object detection in the input image,
The encoder is a feature extraction process in object detection,
It is also preferable that the decoder causes the computer to function as a position detection process in object detection.

本発明によれば、エンコーダデコーダ畳み込みネットワークについて、デコーダ側の入れ子状の第ｎ段のアップサンプリング層から出力された特徴マップと、エンコーダ側の第ｎ−１段の畳み込み層から出力された特徴マップとを連結して、デコーダ側の第ｎ−１段の畳み込み層へ入力することによってマージするべくコンピュータに実行させるプログラムにおいて、
デコーダ側の第ｎ段のアップサンプリング層から出力された特徴マップを、第ｎ段の補間用畳み込み層へ入力する第１のステップと、
第ｎ段の補間用畳み込み層から出力された特徴マップと、エンコーダ側の第ｎ−１段の畳み込み層から出力された特徴マップとを、要素毎に加算した特徴マップを、デコーダ側の第ｎ−１段の畳み込み層へ入力する第２のステップと
をコンピュータに実行させることを特徴とする。 According to the present invention, for an encoder-decoder convolutional network, a feature map output from a nested n-th stage upsampling layer on the decoder side and a feature map output from an encoder-side n-1 th convolution layer. In a program that causes a computer to execute a concatenation of and concatenation by inputting to the convolutional layer of the n−1th stage on the decoder side,
A first step of inputting the feature map output from the n-th up-sampling layer on the decoder side to the n-th interpolation convolutional layer;
The feature map output from the n-th stage convolutional layer for interpolation and the feature map output from the (n-1)th convolutional layer on the encoder side are added element by element, A second step of inputting to the -1 stage convolutional layer and causing the computer to perform the second step.

本発明のプログラムによれば、エンコーダデコーダ畳み込みニューラルネットワークにおける解像感を改善することができる。 According to the program of the present invention, it is possible to improve the sense of resolution in an encoder/decoder convolutional neural network.

エンコーダデコーダ畳み込みニューラルネットワークの概説図である。It is a schematic diagram of an encoder-decoder convolutional neural network. 従来技術のエンコーダデコーダ畳み込みネットワークにおける層の構造図である。FIG. 3 is a structural diagram of layers in a prior art encoder-decoder convolutional network. 従来技術のアップサンプリング層及びマージ機能における特徴マップの処理を表す説明図である。It is explanatory drawing showing the process of the feature map in the upsampling layer and merge function of a prior art. 本発明のエンコーダデコーダ畳み込みネットワークにおける層の構造図である。FIG. 6 is a structural diagram of layers in an encoder-decoder convolutional network of the present invention. 本発明のアップサンプリング層及びマージ機能における特徴マップの処理を表す説明図である。It is explanatory drawing showing the process of the feature map in the upsampling layer and merge function of this invention. 従来技術の図２と本発明の図４とを比較したプログラムコードである。5 is a program code comparing FIG. 2 of the prior art with FIG. 4 of the present invention.

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図４は、本発明のエンコーダデコーダ畳み込みネットワークにおける層の構造図である。 FIG. 4 is a structural diagram of layers in the encoder-decoder convolutional network of the present invention.

図４によれば、従来技術の図２と比較して、位置検出のデコーダのみが相違する。
具体的には、デコーダ側の第ｎ段のアップサンプリング層から出力された特徴マップを入力する第ｎ段の「補間用畳み込み層」を更に有する。
第ｎ段の「補間用畳み込み層」は、学習時に、デコーダ側の第ｎ−１段の畳み込み層から誤差逆伝播によって重みを更新する。これによって、第ｎ段のアップサンプリング層に基づく要素サイズの拡大による解像感低下の副作用を軽減させることができる。 According to FIG. 4, compared with FIG. 2 of the prior art, only the position detection decoder is different.
Specifically, it further has an n-th stage "interpolation convolutional layer" for inputting the feature map output from the n-th stage upsampling layer on the decoder side.
The n-th stage “interpolation convolutional layer” updates the weights from the decoder-side n−1-th stage convolutional layer by error back propagation during learning. As a result, it is possible to reduce the side effect of lowering the resolution due to the enlargement of the element size based on the nth upsampling layer.

そして、デコーダ側の第ｎ段の補間用畳み込み層から出力された特徴マップは、エンコード側の第ｎ−１段の畳み込み層から出力された特徴マップとマージされる。
ここで、第ｎ−１段におけるマージ機能は、従来技術のように連結（線形合併）ではなく、要素毎に加算するものである。即ち、特徴マップのサイズをS×S×2Nとすることなく、要素毎に加算して、S×S×Nとする。これによって、更に、アップサンプリング層に基づく要素サイズの拡大による解像感低下の副作用を軽減させることができる。 Then, the feature map output from the n-th interpolation convolutional layer on the decoder side is merged with the feature map output from the n-1th convolutional layer on the encoding side.
Here, the merging function at the (n-1)th stage is not a concatenation (linear merging) as in the conventional technique, but an addition for each element. That is, the size of the feature map is not S×S×2N, but is added element by element to obtain S×S×N. As a result, it is possible to further reduce the side effect of the reduction in resolution due to the increase in element size based on the upsampling layer.

図４によれば、第３段の畳み込み層から出力された特徴マップは、第３段のアップサンプリング層に入力される。
第３段のアップサンプリング層によって要素数が拡大された特徴マップは、補間用畳み込み層に入力される。
ここで、エンコード側の第２段の畳み込み層から出力された特徴マップと、第３段の補間用畳み込み層から出力された特徴マップとを、要素毎に加算した特徴マップが、デコーダ側の第２段の畳み込み層へ入力される。そして、第２段の畳み込み層から出力された特徴マップは、第２段のアップサンプリング層に入力される。
デコーダ側の第２段のアップサンプリング層によって要素数が拡大された特徴マップは、補間用畳み込み層に入力される。
ここで、エンコード側の第１段の畳み込み層から出力された特徴マップと、第２段の補間用畳み込み層から出力された特徴マップとを、要素毎に加算した特徴マップが、デコーダ側の第１段の畳み込み層へ入力される。そして、第１段の畳み込み層から出力された特徴マップは、活性化層へ入力される。活性化層から出力されたデータは、各ピクセルに物体がマッピングされた画像データとなる。 According to FIG. 4, the feature map output from the third convolutional layer is input to the third upsampling layer.
The feature map whose number of elements is expanded by the third-stage upsampling layer is input to the interpolation convolution layer.
Here, the feature map obtained by adding the feature map output from the second convolutional layer on the encoding side and the feature map output from the third convolutional layer for interpolation on an element-by-element basis is the decoder-side feature map. Input to the two-stage convolutional layer. Then, the feature map output from the second convolutional layer is input to the second upsampling layer.
The feature map whose number of elements is expanded by the second up-sampling layer on the decoder side is input to the interpolation convolution layer.
Here, the feature map obtained by adding the feature map output from the first-stage convolutional layer on the encoding side and the feature map output from the second-stage interpolation convolutional layer for each element is the decoder-side feature map. Input to the one-stage convolutional layer. Then, the feature map output from the first convolutional layer is input to the activation layer. The data output from the activation layer becomes image data in which an object is mapped to each pixel.

図５は、本発明のアップサンプリング層及びマージ機能における特徴マップの処理を表す説明図である。 FIG. 5 is an explanatory diagram showing the processing of the feature map in the upsampling layer and merge function of the present invention.

デコーダ側の第ｎ段のアップサンプリング層は、S/2×S/2×Nのサイズの特徴マップを入力した場合、例えば縦2倍・横2倍に拡大したS×S×Nのサイズの特徴マップを出力する。その特徴マップは、補間用畳み込み層へ入力される。第ｎ段の「補間用畳み込み層」は、デコーダ側の第ｎ−１段の畳み込み層から誤差逆伝播によって重みが更新されたものである。
そして、デコーダ側の第ｎ段の補間用畳み込み層から出力されたS×S×Nのサイズの特徴マップと、エンコード側の第ｎ−１段の畳み込み層から出力されたS×S×Nのサイズの特徴マップとは、同じサイズとなって、マージされる。
ここでのマージとは、２つの特徴マップの要素毎に加算して、Nとしたものである。S×S×Nのサイズの特徴マップが、デコーダ側の第ｎ−１段の畳み込み層へ入力される。
即ち、エンコーダ側の第ｎ−１段の畳み込み層から出力された特徴マップのサイズS×S及び枚数Nと、デコーダ側の第ｎ段のアップサンプリング層及び補間用畳み込み層から出力された特徴マップのサイズS×S及び枚数Nと、マージ機能から出力される特徴マップのサイズS×S及び枚数Nとは、全て同一となる。 The n-th up-sampling layer on the decoder side, for example, when the feature map of S/2×S/2×N size is input, the size of S×S×N is increased to twice the length and twice the width. Output a feature map. The feature map is input to the interpolating convolutional layer. The “interpolation convolutional layer” at the n-th stage has the weight updated from the n−1-th convolutional layer at the decoder side by back propagation.
Then, the S×S×N feature map output from the n-th interpolation convolutional layer on the decoder side and the S×S×N output from the n−1-th convolutional layer on the encoding side are output. The size map and the size feature map have the same size and are merged.
The merging here is to add N for each element of the two feature maps to obtain N. A feature map of size S×S×N is input to the (n−1)th convolution layer on the decoder side.
That is, the size S×S and the number N of feature maps output from the (n−1)th convolution layer on the encoder side, and the feature map output from the nth upsampling layer and interpolation convolution layer on the decoder side. The size S×S and the number N of the same are all the same as the size S×S and the number N of the feature map output from the merge function.

図６は、従来技術の図２と本発明の図４とを比較したプログラムコードである。 FIG. 6 is a program code comparing FIG. 2 of the prior art with FIG. 4 of the present invention.

図６によれば、以下のように表されている。
左側：従来技術の図２におけるプログラムコード
右側：本発明の図４によって更新されたプログラムコードのみ According to FIG. 6, it is expressed as follows.
Left: Program code in FIG. 2 of the prior art Right: Only program code updated by FIG. 4 of the present invention

（図２の従来技術）
up1 = merge([UpSampling2D(size=(2,2))(conv3), conv2],
mode='concat', concat_axis=1)
#conv3（第３段の畳み込み層から出力された特徴マップ）を、size=(2,2)(縦2倍・横2倍)にUpSamplingし、conv2（第２段の畳み込み層から出力された特徴マップ）とconcat（連結）によってmergeし、その特徴マップをup1とする。
（図４の本発明）
up1=UpSampling2D(size=(2,2))(conv3)
conv3 = Convolution2D(64, 3, 3, activation='relu',
border_mode='same')(up1)
up1 = merge([conv3, conv2], mode='sum', axis=1)
# conv3（第３段の畳み込み層から出力された特徴マップ）を、size=(2,2)(縦2倍・横2倍)倍にUpSamplingし、その特徴マップをup1とする。
# Convolution（補間用畳み込み層）に、up1の特徴マップを入力し、その出力となる特徴マップをconv3とする。
# conv3（補間用畳み込み層から出力された特徴マップ）と、conv2（第２段の畳み込み層から出力された特徴マップ）とをsum（要素毎の加算）によってmergeし、その特徴マップをup1とする。 (Prior art in FIG. 2)
up1 = merge([UpSampling2D(size=(2,2))(conv3), conv2],
mode='concat', concat_axis=1)
#conv3 (feature map output from the third-stage convolution layer) is UpSampling to size=(2,2) (vertical x2 and horizontal x2), and conv2 (output from the second-stage convolution layer) Feature map) and concat (concatenation), and the feature map is up1.
(Invention of FIG. 4)
up1=UpSampling2D(size=(2,2))(conv3)
conv3 = Convolution2D(64, 3, 3, activation='relu',
border_mode='same')(up1)
up1 = merge([conv3, conv2], mode='sum', axis=1)
UpSampling of #conv3 (feature map output from the third convolutional layer) is multiplied by size=(2,2) (twice in the vertical direction and twice in the horizontal direction), and the feature map is up1.
The feature map of up1 is input to # Convolution (convolution layer for interpolation), and the output feature map is conv3.
# conv3 (feature map output from the convolutional layer for interpolation) and conv2 (feature map output from the second-stage convolutional layer) are merged by sum (addition for each element), and the feature map is up1. To do.

（図２の従来技術）
up2 = merge([UpSampling2D(size=(2,2))(conv4), conv1],
mode='concat', concat_axis=1)
#conv4（第２段の畳み込み層）から出力された特徴マップを、size=(2,2)(縦2倍・横2倍)にUpSamplingし、conv1（第１段の畳み込み層）から出力された特徴マップとconcat（連結）によってmergeし、その特徴マップをup2とする。
（図４の本発明）
up2=UpSampling2D(size=(2,2))(conv4)
conv4 = Convolution2D(32, 3, 3, activation='relu',
border_mode='same')(up2)
up2 = merge([conv4, conv1], mode='sum', axis=1)
# conv4（第２段の畳み込み層から出力された特徴マップ）を、size=(2,2)(縦2倍・横2倍)にUpSamplingし、その特徴マップをup2とする。
# Convolution（補間用畳み込み層）に、up2の特徴マップを入力し、その出力となる特徴マップをconv4とする。
# conv4（補間用畳み込み層から出力された特徴マップ）と、conv1（第１段の畳み込み層から出力された特徴マップ）とをsum（要素毎の加算）によってmergeし、その特徴マップをup2とする。 (Prior art in FIG. 2)
up2 = merge([UpSampling2D(size=(2,2))(conv4), conv1],
mode='concat', concat_axis=1)
The feature map output from #conv4 (second convolutional layer) is UpSampling to size=(2,2) (vertical x2, horizontal x2) and output from conv1 (first convolutional layer). It merges with the feature map and concat (concatenation), and the feature map is up2.
(Invention of FIG. 4)
up2=UpSampling2D(size=(2,2))(conv4)
conv4 = Convolution2D(32, 3, 3, activation='relu',
border_mode='same')(up2)
up2 = merge([conv4, conv1], mode='sum', axis=1)
UpSampling # conv4 (feature map output from the second convolutional layer) to size=(2,2) (twice in height and twice in width), and sets the feature map as up2.
The feature map of up2 is input to # Convolution (convolution layer for interpolation), and the output feature map is conv4.
# conv4 (feature map output from the interpolation convolution layer) and conv1 (feature map output from the first convolution layer) are merged by sum (element-wise addition), and the feature map is up2 To do.

以上、詳細に説明したように、本発明のプログラムによれば、エンコーダデコーダ畳み込みニューラルネットワークにおける解像感を改善することができる。 As described above in detail, according to the program of the present invention, it is possible to improve the sense of resolution in the encoder/decoder convolutional neural network.

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。
With respect to the various embodiments of the present invention described above, various changes, modifications and omissions in the technical idea and scope of the present invention can be easily made by those skilled in the art. The above description is merely an example and is not intended to be any limitation. The invention is limited only by the claims and their equivalents.

Claims

Regarding the encoder-decoder convolutional network, the feature map output from the nested n-th upsampling layer on the decoder side and the feature map output from the n−1-th convolutional layer on the encoder side are concatenated, In a program that causes a computer to function so as to have a merge function for inputting to the (n−1)th convolutional layer on the decoder side,
The decoder further includes an nth-stage interpolation convolutional layer for inputting the feature map output from the nth-stage upsampling layer,
The merge function is a feature map obtained by adding the feature map output from the n-th stage convolutional layer for interpolation and the feature map output from the encoder-side n-1 th stage convolutional layer for each element, A program for causing a computer to function as an input to a (n-1)th convolutional layer on the decoder side.

The n-th interpolating convolutional layer updates weights from the (n-1)th convolutional layer by error back-propagation in order to reduce the side effect of reduction in resolution due to enlargement of the element size based on the upsampling layer. The program according to claim 1, which causes a computer to function as described above.

The size and number of feature maps output from the (n-1)th convolutional layer on the encoder side,
The size and number of feature maps output from the n-th upsampling layer and interpolation convolutional layer on the decoder side,
The program according to claim 1, wherein the computer is caused to function so that the size and the number of feature maps output from the merge function are all the same.

4. The program according to claim 1, wherein the encoder/decoder convolutional network causes a computer to have a U-shaped shortcut structure.

The encoder-decoder convolutional network has been applied to object detection in the input image,
The encoder is a feature extraction process in object detection,
The program according to any one of claims 1 to 4, wherein the decoder causes a computer to function as a position detection process in object detection.

Regarding the encoder-decoder convolutional network, the feature map output from the nested n-th stage upsampling layer on the decoder side and the feature map output from the encoder-side n-1 th convolution layer are concatenated, In the program executed by the computer to be merged by inputting into the (n-1)th convolutional layer on the decoder side,
A first step of inputting the feature map output from the n-th up-sampling layer on the decoder side to the n-th interpolation convolutional layer;
The feature map output from the n-th stage convolutional layer for interpolation and the feature map output from the (n-1)th convolutional layer on the encoder side are added element by element, A program for causing a computer to execute the second step of inputting to the one-fold convolutional layer.