JP6830742B2

JP6830742B2 - A program for pixel-based image segmentation

Info

Publication number: JP6830742B2
Application number: JP2017228585A
Authority: JP
Inventors: 仁武高
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2017-11-29
Filing date: 2017-11-29
Publication date: 2021-02-17
Anticipated expiration: 2037-11-29
Also published as: JP2019101519A

Description

本発明は、深層学習(deep learning)を用いた、画素に基づく画像セグメンテーションの技術に関する。 The present invention relates to a pixel-based image segmentation technique using deep learning.

画像に映る物体を認識するために、画像セグメンテーションの技術がある。この技術は、画像の各画素が、何の物体に属するのかを検出するものであり、従来、ランダムフォレスト(random forest)や、サポートベクターマシン(support vector machine)、エイダブースト(adaboost)等が用いられてきた。 There is a technique of image segmentation to recognize an object reflected in an image. This technology detects what object each pixel of an image belongs to, and has been conventionally used by random forest, support vector machine, adaboost, etc. Has been done.

近年、画像セグメンテーションに、深層学習における畳み込みニューラルネットワーク(convolutional neural network)が適用されてきている。畳み込みニューラルネットワークに画像を入力することによって、特徴を抽出し、その特徴が現れた位置・輪郭・領域を検出することができる。 In recent years, a convolutional neural network in deep learning has been applied to image segmentation. By inputting an image into the convolutional neural network, it is possible to extract features and detect the position, contour, and region where the features appear.

「ニューラルネットワーク」とは、生体の脳における特性を計算機上のシミュレーションによって表現することを目指した数学モデルをいう。シナプスの結合によってネットワークを形成した人工ニューロン（ユニット）が、学習によってシナプスの結合強度を変化させ、問題解決能力を持つようなモデル全般をいう。
また、「畳み込みニューラルネットワーク」とは、狭義には、複数のユニットを持つ層が入力段から出力段へ向けて一方向に連結されており、出力層側のユニットが、隣接する入力層側の特定のユニットに結合された「畳み込み層」を有する順伝播型ネットワークをいう。 "Neural network" refers to a mathematical model that aims to express the characteristics of a living body in the brain by computer simulation. It refers to a general model in which artificial neurons (units) that form a network by synaptic connection change the synaptic connection strength by learning and have problem-solving ability.
Further, in a narrow sense, the "convolutional neural network" means that layers having a plurality of units are connected in one direction from the input stage to the output stage, and the units on the output layer side are on the adjacent input layer side. A forward-propagating network with a "convolutional layer" coupled to a particular unit.

前方層のユニットから後方層のユニットへつなぐ関数のパラメータを、「重み(weight)」と称す。学習とは、この関数のパラメータとして、適切な「重み」を算出することにある。教師データの入力データに対する出力層からの出力データと、教師データの正解ラベルとの誤差を用いて、各層の重みが最適に更新される。誤差は、「誤差逆伝播法」によって、出力層側から入力層側へ向けて次々に伝播し、各層の重みを少しずつ更新していく。最終的に、誤差が小さくなるように、各層の重みを適切な値に調整する収束計算を実行する。尚、本発明における「畳み込みニューラルネットワーク」には、ＲＮＮ(Recurrent Neural Network)やＬＳＴＭ(Long short-term memory)に代表される単層内循環型の畳み込みニューラルネットワークや、複数層間循環型の畳み込みニューラルネットワーク等も含まれるものとする。 The parameter of the function that connects the unit in the front layer to the unit in the rear layer is called "weight". Learning is to calculate an appropriate "weight" as a parameter of this function. The weight of each layer is optimally updated using the error between the output data from the output layer for the input data of the teacher data and the correct label of the teacher data. The error is propagated one after another from the output layer side to the input layer side by the "error backpropagation method", and the weight of each layer is updated little by little. Finally, a convergence calculation is performed that adjusts the weights of each layer to the appropriate values so that the error is small. The "convolutional neural network" in the present invention includes a single-layer inner circulation type convolutional neural network represented by RNN (Recurrent Neural Network) and LSTM (Long short-term memory), and a multi-layer circulation type convolutional neural network. The network etc. shall be included.

画像認識と比較して、画像セグメンテーションでは、画像に「何が」映り込んでいるか、だけではなく、「何処に」あるか、及び、「その輪郭」はどうなっているか、まで認識しなければならない。
ニューラルネットワークは、「何が」を認識するために、画像のダウンサンプリングを実行する必要がある。一方で、ダウンサンプリングによって、画像のサイズが縮小し、輪郭が不明となる。 Compared to image recognition, image segmentation must recognize not only "what" is reflected in the image, but also "where" and what the "outline" is. It doesn't become.
Neural networks need to perform image downsampling in order to recognize "what". On the other hand, downsampling reduces the size of the image and makes the contour unclear.

従来、画像の物体認識用の畳み込みニューラルネットワークとして、エンコーダデコーダ(encoder-decoder)構造を用いた技術がある（例えば非特許文献１参照）。この技術によれば、エンコーダは、物体検出における位置検出処理を実行し、デコーダは、その特徴を物体の位置にマッピングする。これは、ニューラルネットワークの浅い層の位置情報と深い層の特徴情報とを、スキップ構造によって統合している。 Conventionally, there is a technique using an encoder-decoder structure as a convolutional neural network for recognizing an image object (see, for example, Non-Patent Document 1). According to this technique, the encoder performs a position detection process in object detection, and the decoder maps its features to the position of the object. It integrates the position information of the shallow layer and the feature information of the deep layer of the neural network by a skip structure.

図１は、画像セグメンテーションに適用したエンコーダデコーダ畳み込みニューラルネットワークの概説図である。 FIG. 1 is a schematic diagram of an encoder-decoder convolutional neural network applied to image segmentation.

図１によれば、エンコーダデコーダ畳み込みニューラルネットワークは、複数の人が映る写真画像を入力し、入力画像と同じサイズの物体認識画像を出力する。図１によれば、出力された物体認識画像からは、人や、テーブル、椅子のような物体が検出されると共に、その物体の位置が特定されている。 According to FIG. 1, the encoder-decoder convolutional neural network inputs a photographic image showing a plurality of people and outputs an object recognition image having the same size as the input image. According to FIG. 1, an object such as a person, a table, or a chair is detected from the output object recognition image, and the position of the object is specified.

また、他の従来技術として、完全畳み込み構造を用いた技術もある（例えば非特許文献２参照）。この技術によれば、画像をエンコードし、スキップ構造によってある層を統合して位置を推測する。これは、ニューラルネットワークをPoolingストリームとResidualストリームとに分けており、一方は位置情報を統合させ、他方は判別させている。 Further, as another conventional technique, there is also a technique using a completely convolution structure (see, for example, Non-Patent Document 2). According to this technique, the image is encoded and some layers are integrated by a skip structure to infer the position. It divides the neural network into a Pooling stream and a Residual stream, one that integrates location information and the other that discriminates.

更に、スキップ構造の後に連結させる技術もある（例えば非特許文献３参照）。これは、DenseNet構造を用いて、画像をダウンサンプリングした後、アップサンプリングして、元の解像度に復元している。 Further, there is also a technique of connecting after the skip structure (see, for example, Non-Patent Document 3). It uses the DenseNet structure to downsample the image and then upsample it to restore it to its original resolution.

Fully Convolutional Networks for Semantic Segmentation、[online]、［平成２９年１１月２５日検索］、インターネット＜URL:https://arxiv.org/pdf/1605.06211.pdf＞Fully Convolutional Networks for Semantic Segmentation, [online], [Search on November 25, 2017], Internet <URL: https://arxiv.org/pdf/1605.06211.pdf> “Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes”、[online]、［平成２９年１１月２５日検索］、インターネット＜http://openaccess.thecvf.com/content_cvpr_2017/papers/Pohlen_Full-Resolution_residual_networks_CVPR_2017_paper.pdf＞“Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes”, [online], [Searched November 25, 2017], Internet <http://openaccess.thecvf.com/content_cvpr_2017/papers/Pohlen_Full-Resolution_residual_networks_CVPR_2017_paper.pdf ＞ “The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation”、[online]、［平成２９年１１月２５日検索］、インターネット＜http://openaccess.thecvf.com/content_cvpr_2017_workshops/w13/papers/Jegou_The_One_Hundred_CVPR_2017_paper.pdf＞“The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation”, [online], [Search November 25, 2017], Internet <http://openaccess.thecvf.com/content_cvpr_2017_workshops/w13/papers/Jegou_The_One_Hundred_CVPR_2017_paper. pdf ＞

非特許文献１及び３に記載された技術によれば、１つのネットワークで、物体特徴の推定と、画素の位置特徴の推定とを同時に学習させるために、画像を合成する（例えばリサイズする）必要がある。そのために、画素クラス情報（認識対象のクラス）が損失する可能性がある。
また、非特許文献２に記載された技術によれば、Residualストリームの部分もFull-Resolutionを持たすことができるが、画素クラス情報を学習せず、単なる加算によって統合させている。 According to the techniques described in Non-Patent Documents 1 and 3, it is necessary to synthesize (for example, resize) images in order to simultaneously learn the estimation of object features and the estimation of pixel position features in one network. There is. Therefore, the pixel class information (class to be recognized) may be lost.
Further, according to the technique described in Non-Patent Document 2, the Residual stream part can also have Full-Resolution, but the pixel class information is not learned and integrated by simple addition.

これに対し、本願の発明者らは、画素推定ストリームと特徴推定ストリームとの両方の流れの中で、画像をリサイズすることなく、Full-Resolutionを保ったまま、高い精度の画像セグメンテーションを実現することができないか、と考えた。 On the other hand, the inventors of the present application realize high-precision image segmentation without resizing the image in both the pixel estimation stream and the feature estimation stream while maintaining the full-resolution. I wondered if I could do it.

そこで、本発明は、画素推定と特徴推定との両方を用いて、高い精度の画像セグメンテーションを実現するプログラムを提供することを目的とする。 Therefore, an object of the present invention is to provide a program that realizes highly accurate image segmentation by using both pixel estimation and feature estimation.

本発明によれば、画素に基づく画像セグメンテーションとしてコンピュータを機能させるプログラムにおいて、
入力画像に対して、画素クラスを抽出する複数の画素推定手段を直列に実行する画素推定ストリームと、
同一の前記入力画像に対して、物体特徴を抽出する複数の特徴推定手段を直列に実行する特徴推定ストリームと
を並列に実行すると共に、
前記画素推定ストリームの中で、前段の画素推定手段から出力された画素マップと、前段の特徴推定手段から出力された特徴マップとを入力し、画素マップのサイズに合わせて、特徴マップをアップ／ダウンサンプリングした後、当該特徴マップを当該画素マップに連結又は加算し、当該画素マップを次段の画素推定手段へ出力する画素統合手段と、
前記特徴推定ストリームの中で、前段の画素推定手段から出力された画素マップと、前段の特徴推定手段から出力された特徴マップとを入力し、特徴マップのサイズに合わせて、画素マップをアップ／ダウンサンプリングした後、当該画素マップを当該特徴マップに連結又は加算し、当該特徴マップを次段の特徴推定手段へ出力する特徴統合手段と、
前記画素推定ストリーム及び前記特徴推定ストリームを、最終段で連結又は加算するストリーム統合手段と
してコンピュータを機能させることを特徴とする。 According to the present invention, in a program that causes a computer to function as pixel-based image segmentation.
A pixel estimation stream that executes a plurality of pixel estimation means for extracting pixel classes in series for an input image,
For the same input image, a feature estimation stream that executes a plurality of feature estimation means for extracting object features in series is executed in parallel, and at the same time.
In the pixel estimation stream, the pixel map output from the pixel estimation means in the previous stage and the feature map output from the feature estimation means in the previous stage are input, and the feature map is uploaded according to the size of the pixel map. After downsampling, the feature map is connected or added to the pixel map, and the pixel map is output to the pixel estimation means in the next stage.
In the feature estimation stream, the pixel map output from the pixel estimation means in the previous stage and the feature map output from the feature estimation means in the previous stage are input, and the pixel map is uploaded / uploaded according to the size of the feature map. After downsampling, the pixel map is connected or added to the feature map, and the feature map is output to the feature estimation means in the next stage.
It is characterized in that a computer functions as a stream integration means for connecting or adding the pixel estimation stream and the feature estimation stream in the final stage.

本発明のプログラムにおける他の実施形態によれば、
画素推定ストリームに、少なくとも１つ以上の画素統合手段を含み、
特徴推定ストリームに、少なくとも１つ以上の特徴統合手段を含む
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the program of the present invention
The pixel estimation stream contains at least one or more pixel integration means.
It is also preferred to have the computer function so that the feature estimation stream contains at least one feature integration means.

本発明のプログラムにおける他の実施形態によれば、
画素推定手段及び特徴推定手段はそれぞれ、畳み込みニューラルネットワークによって構成される
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the program of the present invention
It is also preferable that the pixel estimation means and the feature estimation means each function the computer so as to be composed of a convolutional neural network.

本発明のプログラムにおける他の実施形態によれば、
ストリーム統合手段は、画素推定ストリーム及び特徴推定ストリームを最終段で連結又は加算した後に、更に畳み込み層を実行する
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the program of the present invention
It is also preferable that the stream integration means causes the computer to function to further execute the convolution layer after connecting or adding the pixel estimation stream and the feature estimation stream in the final stage.

本発明のプログラムによれば、画素推定と特徴推定との両方を用いて、高い精度の画像セグメンテーションを実現することができる。特に、画素推定ストリームと特徴推定ストリームとの両方の流れの中で、画像をリサイズすることなく、Full-Resolutionを保ったまま、高い精度の画像セグメンテーションを実現することができる。 According to the program of the present invention, high-precision image segmentation can be realized by using both pixel estimation and feature estimation. In particular, in both the flow of the pixel estimation stream and the feature estimation stream, it is possible to realize high-precision image segmentation without resizing the image and maintaining full-resolution.

画像セグメンテーションに適用したエンコーダデコーダ畳み込みニューラルネットワークの概説図である。It is a schematic diagram of an encoder-decoder convolutional neural network applied to image segmentation. 本発明におけるニューラルネットワークの機能構成図である。It is a functional block diagram of the neural network in this invention. 本発明における各機能の内部構造を表す構造図である。It is a structural drawing which shows the internal structure of each function in this invention. 画素統合部における画素マップの第１のフローを表す説明図である。It is explanatory drawing which shows the 1st flow of the pixel map in a pixel integration part. 画素統合部における画素マップの第２のフローを表す説明図である。It is explanatory drawing which shows the 2nd flow of the pixel map in a pixel integration part. 本発明におけるストリーム統合部の内部構造を表す構造図である。It is a structural diagram which shows the internal structure of the stream integration part in this invention. 本発明における推定結果を表す画像図である。It is an image diagram which shows the estimation result in this invention.

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図２は、本発明におけるニューラルネットワークの機能構成図である。 FIG. 2 is a functional configuration diagram of the neural network in the present invention.

図２によれば、ニューラルネットワークは、画素推定部１１と、画素統合部１２と、特徴推定部２１と、特徴統合部２２と、ストリーム統合部３とを有する。これら機能構成部は、コンピュータを機能させるプログラムとして実現される。 According to FIG. 2, the neural network has a pixel estimation unit 11, a pixel integration unit 12, a feature estimation unit 21, a feature integration unit 22, and a stream integration unit 3. These functional components are realized as programs that make the computer function.

ニューラルネットワークには、例えばカメラで撮影された画像（映像）が入力される。この画像は、勿論、ＣＧ(Computer Graphics)画像であってもよい。これに対し、画素クラスやその位置、輪郭、領域が特定された画像セグメンテーションの推定結果が出力される。 For example, an image (video) taken by a camera is input to the neural network. Of course, this image may be a CG (Computer Graphics) image. On the other hand, the estimation result of the image segmentation in which the pixel class and its position, contour, and region are specified is output.

図２によれば、画素に基づく画像セグメンテーションのために、＜画素推定ストリーム１＞と＜特徴推定ストリーム２＞とが並列に実行されている。
＜画素推定ストリーム１＞
画素推定ストリーム１は、入力画像に対して、画素クラスを抽出する複数の「画素推定部１１」を直列に実行する。
＜特徴推定ストリーム２＞
特徴推定ストリーム２は、同一の入力画像に対して、物体特徴を抽出する複数の「特徴推定部２１」を直列に実行する。 According to FIG. 2, <pixel estimation stream 1> and <feature estimation stream 2> are executed in parallel for image segmentation based on pixels.
<Pixel estimation stream 1>
The pixel estimation stream 1 executes a plurality of "pixel estimation units 11" for extracting pixel classes in series with respect to the input image.
<Characteristic estimation stream 2>
The feature estimation stream 2 executes a plurality of "feature estimation units 21" for extracting object features in series for the same input image.

図２（ａ）によれば、画素推定ストリーム１の中で、各画素推定部１１の間に「画素統合部１２」を含み、特徴推定ストリーム２の中で、各特徴推定部２１の間に「特徴統合部２２」を含む。
画素統合部１２は、画素推定ストリーム１の中で、前段の画素推定部１１nから出力された画素マップと、前段の特徴推定部２１nから出力された特徴マップとを入力し、処理した画素マップを、次段の画素推定部１１n+1へ出力する。
また、特徴統合部２２は、特徴推定ストリーム２の中で、前段の特徴推定部２１nから出力された特徴マップと、前段の画素推定部１１nから出力された画素マップとを入力し、処理した特徴マップを、次段の特徴推定部２１n+1へ出力する。
そして、画素推定ストリーム１と特徴推定ストリーム２とは、出力段のストリーム統合部３で統合される。 According to FIG. 2A, the “pixel integration unit 12” is included between the pixel estimation units 11 in the pixel estimation stream 1, and between the feature estimation units 21 in the feature estimation stream 2. Includes "feature integration unit 22".
Pixel integration section 12, within the pixel estimation streams 1 inputs the pixel map output from the previous stage pixel estimating unit 11n, and a feature map output from the preceding characteristic estimating unit 21 n, a pixel map treated Is output to the pixel estimation unit 11n + 1 in the next stage.
Further, the feature integration unit 22 inputs and processes the feature map output from the feature estimation unit 21n in the previous stage and the pixel map output from the pixel estimation unit 11n in the previous stage in the feature estimation stream 2. The map is output to the feature estimation unit 21n + 1 in the next stage.
Then, the pixel estimation stream 1 and the feature estimation stream 2 are integrated by the stream integration unit 3 of the output stage.

このように、画素推定ストリーム１と特徴推定ストリーム２との相補的な関係性を持つ構造にすることによって、高精度な画素クラス情報の推定と、高解像度な位置情報の推定とを両立させる。 By forming a structure having a complementary relationship between the pixel estimation stream 1 and the feature estimation stream 2 in this way, it is possible to achieve both high-precision pixel class information estimation and high-resolution position information estimation.

図２（ｂ）によれば、画素推定ストリーム１の中で、最終段の画素推定部１１の間のみに画素統合部１２を含み、特徴推定ストリーム２の中で、最終段の特徴推定部２１の間のみに特徴統合部２２を含む。
このように、画素統合部１２の数及び特徴統合部２２の数は、少なくとも１つだけを備えるものであってもよく、メモリの使用量や計算量的コストに基づいて増減させることができる。 According to FIG. 2B, the pixel integration unit 12 is included only between the pixel estimation units 11 in the final stage in the pixel estimation stream 1, and the feature estimation unit 21 in the final stage in the feature estimation stream 2. The feature integration unit 22 is included only between.
As described above, the number of the pixel integration unit 12 and the number of the feature integration units 22 may be at least one, and can be increased or decreased based on the amount of memory used and the computational cost.

図３は、本発明における各機能の内部構造を表す構造図である。 FIG. 3 is a structural diagram showing the internal structure of each function in the present invention.

［画素推定部１１］
画素推定部１１は、入力画像における各画素の正解クラスを算出するために、畳み込みニューラルネットワークによって構成される。
図３によれば、画素推定部１１は、Ｕ字型のショートカット構造（入れ子状構造）を有する。Ｕ字型ネットワークによれば、畳み込み層、正規化層及び活性化層によって要素数（画素数）を減少させながら画素マップ(pixel map)を作成していく。ここでは、画素推定部１１は、例えばResNet(Residual Network)やU-Netと類似する構造を有する。 [Pixel estimation unit 11]
The pixel estimation unit 11 is configured by a convolutional neural network in order to calculate the correct answer class of each pixel in the input image.
According to FIG. 3, the pixel estimation unit 11 has a U-shaped shortcut structure (nested structure). According to the U-shaped network, a pixel map is created while reducing the number of elements (number of pixels) by the convolution layer, the normalization layer, and the activation layer. Here, the pixel estimation unit 11 has a structure similar to, for example, ResNet (Residual Network) or U-Net.

畳み込み層(Convolutional Layer)は、入力データに重みフィルタを充て、その各要素の積の和を、画素マップの１個の要素値とする。そして、入力データに対して重みフィルタをスライディングさせながら、局所特徴を増強した画素マップを生成する。畳み込み層から出力される画素マップについて、サイズはS×Sとなり、その枚数はNとなる。画素マップの枚数Nは、重みフィルタの個数Nと一致する。
そして、同じ重みフィルタを、入力データに対して移動させて、１枚の画素マップを生成する。ここで、移動させる要素の数（移動量）を「ストライド(stride)」と称す。 The convolutional layer applies a weight filter to the input data, and the sum of the products of the elements is set as one element value of the pixel map. Then, while sliding the weight filter with respect to the input data, a pixel map with enhanced local features is generated. The size of the pixel map output from the convolution layer is S × S, and the number of pixels is N. The number N of pixel maps coincides with the number N of weight filters.
Then, the same weight filter is moved with respect to the input data to generate one pixel map. Here, the number of elements to be moved (movement amount) is referred to as "stride".

正規化層は、領域内の濃淡値の平均を０とする減算正規化や、分散を正規化する除算正規化を実行する。 The normalization layer executes subtraction normalization in which the average of the shading values in the region is 0, and division normalization in which the variance is normalized.

活性化層は、例えばＲｅＬＵ(Rectified Linear Unit)を用いて、信号の強いニューロンを増強し、弱いニューロンを抑圧する。活性化層から出力されたデータは、各画素に物体がマッピングされた画像データとなる。 The activation layer uses, for example, ReLU (Rectified Linear Unit) to enhance neurons with strong signals and suppress weak neurons. The data output from the activation layer is image data in which an object is mapped to each pixel.

最後に、異なる段層のストリームを、加算(add)ではなく、線形の連結(concatenate)によって統合する。即ち、画素マップのサイズをS×S×2Nとする。これによって異なる段層の差を混ぜて、ネットワークのオーバーフィット(overfitting, 過剰適合)を防ぐことができる。
尚、Ｕ字型ネットワークの段層を深くすることによって、演算量は増加するが、表現力の高い特徴に対する位置を検出することができる。 Finally, the streams of different layers are merged by linear concatenate rather than by add. That is, the size of the pixel map is S × S × 2N. This allows the difference between different layers to be mixed to prevent network overfitting.
By deepening the layer of the U-shaped network, the amount of calculation increases, but the position for a feature with high expressive power can be detected.

［画素統合部１２］
画素統合部１２は、画素推定ストリーム１の中で、前段の画素推定部１１nから出力された画素マップと、前段の特徴推定部２１nから出力された特徴マップとを入力とする。そして、当該特徴マップを当該画素マップに連結又は加算し、当該画素マップを次段の画素推定部１１n+1へ出力する。 [Pixel integration unit 12]
In the pixel estimation stream 1, the pixel integration unit 12 inputs a pixel map output from the pixel estimation unit 11n in the previous stage and a feature map output from the feature estimation unit 21n in the previous stage. Then, the feature map is connected or added to the pixel map, and the pixel map is output to the pixel estimation unit 11n + 1 in the next stage.

これによって、次段の画素推定部１１n+1は、前段の画素推定部１１nの推定結果（画素クラス情報）だけでなく、前段の特徴推定部２１nの推定結果（物体特徴）も反映した画素マップが入力される。これによって、次段の画素推定部１１n+1では、より正確な画素クラスの推定が期待される。一方で、前段の画素推定部１１nから出力された画素マップ（画素クラス情報）を直接的に入力するために、学習処理にかかる計算量的コストを削減することもできる。 As a result, the pixel estimation unit 11n + 1 in the next stage reflects not only the estimation result (pixel class information) of the pixel estimation unit 11n in the previous stage but also the estimation result (object feature) of the feature estimation unit 21n in the previous stage. Is entered. As a result, the pixel estimation unit 11n + 1 in the next stage is expected to estimate the pixel class more accurately. On the other hand, since the pixel map (pixel class information) output from the pixel estimation unit 11n in the previous stage is directly input, the computational cost required for the learning process can be reduced.

ここで、他の実施形態として、特徴推定部２１nから出力される特徴マップのサイズが、画素推定部１１nから出力される画素マップのサイズと異なる場合、特徴マップのサイズを、画素マップのサイズにリサイズした後、両方の特徴マップを統合する。そのために、画素統合部１２は、特徴マップを、画素マップのサイズに合わせて、アップ／ダウンサンプリングする。その後、画素マップと特徴マップとを、連結又は加算によって統合する。 Here, as another embodiment, when the size of the feature map output from the feature estimation unit 21n is different from the size of the pixel map output from the pixel estimation unit 11n, the size of the feature map is set to the size of the pixel map. After resizing, integrate both feature maps. Therefore, the pixel integration unit 12 up / down samples the feature map according to the size of the pixel map. After that, the pixel map and the feature map are integrated by concatenation or addition.

図４は、画素統合部における画素マップの第１のフローを表す説明図である。 FIG. 4 is an explanatory diagram showing a first flow of the pixel map in the pixel integration unit.

画素統合部１２は、画素推定ストリーム１（前段の画素推定部１１n）からの、S×S×Nのサイズの画素マップを入力とする。ここで、特徴推定ストリーム２（前段の特徴推定部２１n）からの、例えばS/2×S/2×Nのサイズの特徴マップを入力とする。この場合、特徴マップを、縦2倍・横2倍のS×S×Nのサイズにアップサンプリングする。これによって、アップサンプリング層から出力された特徴マップと、画素推定ストリーム１の画素マップとは、同じサイズとなり、線形に連結することができる。そして、S×S×2Nのサイズの画素マップを、次段の画素推定部１１n+1へ出力することができる。 The pixel integration unit 12 inputs a pixel map having a size of S × S × N from the pixel estimation stream 1 (pixel estimation unit 11n in the previous stage). Here, a feature map having a size of, for example, S / 2 × S / 2 × N from the feature estimation stream 2 (feature estimation unit 21n in the previous stage) is input. In this case, the feature map is upsampled to a size of S × S × N that is twice as long and twice as wide. As a result, the feature map output from the upsampling layer and the pixel map of the pixel estimation stream 1 have the same size and can be linearly connected. Then, a pixel map having a size of S × S × 2N can be output to the pixel estimation unit 11n + 1 in the next stage.

逆に、画素推定ストリーム１の画素マップのサイズが、特徴推定ストリーム２の特徴マップのサイズよりも小さい場合、特徴マップを、逆にダウンサンプリングする。これによって、ダウンサンプリング層から出力された特徴マップと、画素推定ストリーム１の画素マップとは、同じサイズとなり、線形に連結することができる。 On the contrary, when the size of the pixel map of the pixel estimation stream 1 is smaller than the size of the feature map of the feature estimation stream 2, the feature map is downsampled. As a result, the feature map output from the downsampling layer and the pixel map of the pixel estimation stream 1 have the same size and can be linearly connected.

図５は、画素統合部における画素マップの第２のフローを表す説明図である。 FIG. 5 is an explanatory diagram showing a second flow of the pixel map in the pixel integration unit.

図５によれば、図４と比較して、統合部分が「連結」ではなく、「加算」となっている。これは、同じサイズの画素マップ及び特徴マップについて、要素毎に加算して、Nとしたものである。S×S×Nのサイズの画素マップを、次段の画素推定部１１n+1へ出力することができる。 According to FIG. 5, as compared with FIG. 4, the integrated portion is not “consolidated” but “added”. In this, pixel maps and feature maps of the same size are added for each element to obtain N. A pixel map having a size of S × S × N can be output to the pixel estimation unit 11n + 1 in the next stage.

抽象的な画素推定をするため、画素推定部１１及び画素統合部１２を少なくとも１回以上繰り返すことによって、画素推定ストリーム１に、特徴推定ストリーム２の推定結果を反映することができる。尚、画素推定ストリーム１の流れの中には、アップ／ダウンサンプリングの処理が挿入されないために、Full-Resolutionを保つことができ、入力画像の全画素について推定することができる。 By repeating the pixel estimation unit 11 and the pixel integration unit 12 at least once in order to perform abstract pixel estimation, the estimation result of the feature estimation stream 2 can be reflected in the pixel estimation stream 1. Since the up / down sampling process is not inserted in the flow of the pixel estimation stream 1, Full-Resolution can be maintained and all pixels of the input image can be estimated.

［特徴推定部２１］
特徴推定部２１も、物体の特徴をエンコードするべく、畳み込みニューラルネットワークによって構成される。
図３によれば、特徴推定部２１は、畳み込み層、プーリング層(pooling layerやaverage pooling)及び活性化層によって、要素数を減少させながら特徴マップを作成していく。
畳み込み層は、前述した画素数推定部１１と同じものである。
プーリング層は、入力された特徴マップから重要な特徴要素のみに縮小した特徴マップを生成し、逆に、非重要な情報を排除する。
活性化層も、前述した画素数推定部１１と同じものであり、例えばＲｅＬＵやsoftmax関数、sigmoid関数に基づくものであってもよい。
高度に抽象的な特徴を抽出するために、これら３つの層を繰り返すことによって、より表現力の高い特徴を抽出することが期待できる。 [Characteristic estimation unit 21]
The feature estimation unit 21 is also configured by a convolutional neural network in order to encode the features of the object.
According to FIG. 3, the feature estimation unit 21 creates a feature map while reducing the number of elements by the convolution layer, the pooling layer (pooling layer and average pooling), and the activation layer.
The convolution layer is the same as the pixel number estimation unit 11 described above.
The pooling layer generates a feature map reduced from the input feature map to only the important feature elements, and conversely excludes non-important information.
The activation layer is also the same as the pixel number estimation unit 11 described above, and may be based on, for example, the ReLU, softmax function, or sigmoid function.
By repeating these three layers in order to extract highly abstract features, it can be expected that more expressive features will be extracted.

［特徴統合部２２］
特徴統合部２２は、特徴推定ストリーム２の中で、前段の画素推定部１１nから出力された画素マップと、前段の特徴推定部２１nから出力された特徴マップとを入力する。そして、特徴マップに画素マップを連結又は加算し、当該特徴マップを次段の特徴推定部２１n+1へ出力する。
また、特徴統合部２２も、前述した画素統合部１２と全く同様に、特徴マップのサイズに合わせて、画素マップをアップ／ダウンサンプリングした後、当該画素マップを当該特徴マップに連結又は加算する。
[Feature integration unit 22]
In the feature estimation stream 2, the feature integration unit 22 inputs a pixel map output from the pixel estimation unit 11n in the previous stage and a feature map output from the feature estimation unit 21n in the previous stage. Then, the pixel map is connected or added to the feature map, and the feature map is output to the feature estimation unit 21n + 1 in the next stage.
Further, the feature integration unit 22 also up / down samples the pixel map according to the size of the feature map, and then connects or adds the pixel map to the feature map in exactly the same manner as the pixel integration unit 12 described above.

これによって、次段の特徴推定部２１n+1は、前段の特徴推定部２１nの推定結果（物体特徴）だけでなく、前段の画素推定部１１nの推定結果（画素クラス情報）も反映した特徴マップが入力される。これによって、次段の特徴推定部２１n+1では、より正確な物体認識の推定が期待される。一方で、前段の特徴推定部２１nから出力された特徴マップ（物体特徴）を直接的に入力するために、学習処理にかかる計算量的コストを削減することもできる。 As a result, the feature estimation unit 21n + 1 in the next stage reflects not only the estimation result (object feature) of the feature estimation unit 21n in the previous stage but also the estimation result (pixel class information) of the pixel estimation unit 11n in the previous stage. Is entered. As a result, the feature estimation unit 21n + 1 in the next stage is expected to estimate the object recognition more accurately. On the other hand, since the feature map (object feature) output from the feature estimation unit 21n in the previous stage is directly input, the computational cost required for the learning process can be reduced.

［ストリーム統合部３］
ストリーム統合部３は、画素推定ストリーム及び特徴推定ストリームを、最終段で連結又は加算する。 [Stream integration unit 3]
The stream integration unit 3 connects or adds the pixel estimation stream and the feature estimation stream at the final stage.

図６は、本発明におけるストリーム統合部の内部構造を表す構造図である。 FIG. 6 is a structural diagram showing the internal structure of the stream integration unit in the present invention.

図６によれば、ストリーム統合部３は、同一のサイズに合わせた画素マップと特徴マップとを、連結又は加算によって統合している。この場合、その後段に、畳み込み層を挿入することも好ましい。 According to FIG. 6, the stream integration unit 3 integrates the pixel map and the feature map having the same size by concatenation or addition. In this case, it is also preferable to insert a convolution layer in the subsequent stage.

図７は、本発明における推定結果を表す画像図である。 FIG. 7 is an image diagram showing an estimation result in the present invention.

図７によれば、入力画像及び正解データに対して、従来技術の画像セグメンテーション結果と、本発明の画像セグメンテーションの結果とが表されている。
従来技術と比較して、本発明は、セグメンテーションの全体的なばらつきが少ないことが理解できる。また、特に、左側面の欄干部分のセグメンテーションも、比較的明確になっていることが理解できる。 According to FIG. 7, the image segmentation result of the prior art and the image segmentation result of the present invention are shown with respect to the input image and the correct answer data.
It can be understood that the present invention has less overall variation in segmentation as compared with the prior art. In particular, it can be understood that the segmentation of the parapet portion on the left side is also relatively clear.

以上、詳細に説明したように、本発明のプログラムによれば、画素推定と特徴推定との両方を用いて、高い精度の画像セグメンテーションを実現することができる。特に、画素推定ストリームと特徴推定ストリームとの両方の流れの中で、画像をリサイズすることなく、Full-Resolutionを保ったまま、高い精度の画像セグメンテーションを実現することができる。 As described in detail above, according to the program of the present invention, highly accurate image segmentation can be realized by using both pixel estimation and feature estimation. In particular, in both the flow of the pixel estimation stream and the feature estimation stream, it is possible to realize high-precision image segmentation without resizing the image and maintaining full-resolution.

本発明によれば、画素推定ストリームと特徴推定ストリームとの間で、画素マップ及び特徴マップの相互利用を繰り返す中で、高い精度の画像セグメンテーションや物体検出をすることが期待できる。
例えば、人物が映り込む入力画像について、特徴推定ストリームの中で検出できた人物のおおよその位置情報を、画素推定ストリームに与えることによって、より正確な人物の輪郭や領域を得ることができる。一方で、画素推定ストリームの中で検出できた人物のおおよその輪郭や領域を、特徴推定ストリームに与えることによって、より正確に人物の位置を得ることができる。
また、画素推定ストリームや特徴推定ストリームそれぞれについて、画素マップや特徴マップのリサイズが生じることがなく、full-resolutionを保つことで、マップの統合によって解像度の変化で発生した画素クラスや位置・輪郭・領域の学習が必要なくなる。 According to the present invention, high-precision image segmentation and object detection can be expected while repeating mutual use of the pixel map and the feature map between the pixel estimation stream and the feature estimation stream.
For example, for an input image in which a person is reflected, more accurate contours and regions of the person can be obtained by giving the approximate position information of the person detected in the feature estimation stream to the pixel estimation stream. On the other hand, by giving the approximate outline or region of the person detected in the pixel estimation stream to the feature estimation stream, the position of the person can be obtained more accurately.
In addition, for each of the pixel estimation stream and feature estimation stream, the pixel map and feature map are not resized, and by maintaining full-resolution, the pixel class, position, contour, and pixel class, position, contour, and resolution generated by the change in resolution due to map integration No need to learn the area.

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 With respect to the various embodiments of the present invention described above, various changes, modifications and omissions within the scope of the technical idea and viewpoint of the present invention can be easily made by those skilled in the art. The above explanation is just an example and does not attempt to restrict anything. The present invention is limited only to the scope of claims and their equivalents.

１画素推定ストリーム
１１画素推定部
１２画素統合部
２特徴推定ストリーム
２１特徴推定部
２２特徴統合部
３ストリーム統合部

1 Pixel estimation stream 11 Pixel estimation unit 12 Pixel integration unit 2 Feature estimation stream 21 Feature estimation unit 22 Feature integration unit 3 Stream integration unit

Claims

In a program that makes a computer function as pixel-based image segmentation
A pixel estimation stream that executes a plurality of pixel estimation means for extracting pixel classes in series for an input image,
For the same input image, a feature estimation stream that executes a plurality of feature estimation means for extracting object features in series is executed in parallel, and at the same time.
In the pixel estimation stream, the pixel map output from the pixel estimation means in the previous stage and the feature map output from the feature estimation means in the previous stage are input, and the feature map is uploaded according to the size of the pixel map. After downsampling, the feature map is connected or added to the pixel map, and the pixel map is output to the pixel estimation means in the next stage.
In the feature estimation stream, the pixel map output from the pixel estimation means in the previous stage and the feature map output from the feature estimation means in the previous stage are input, and the pixel map is uploaded according to the size of the feature map. After downsampling, the pixel map is connected or added to the feature map, and the feature map is output to the feature estimation means in the next stage.
A program characterized in that a computer functions as a stream integration means for connecting or adding the pixel estimation stream and the feature estimation stream in the final stage.

The pixel estimation stream includes at least one or more pixel integration means.
The program according to claim 1 , wherein the computer is made to function so that the feature estimation stream includes at least one feature integration means.

The program according to claim 1 or 2 , wherein each of the pixel estimation means and the feature estimation means causes a computer to function so as to be configured by a convolutional neural network.

The program according to claim 3 , wherein the stream integrating means causes a computer to further execute a convolution layer after connecting or adding the pixel estimation stream and the feature estimation stream in the final stage.