JP6504590B2

JP6504590B2 - System and computer implemented method for semantic segmentation of images and non-transitory computer readable medium

Info

Publication number: JP6504590B2
Application number: JP2018523830A
Authority: JP
Inventors: チュゼル、オンセル; ベムラパリ、ラビテジャ; リウ、ミン−ユ
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2016-03-25
Filing date: 2017-02-21
Publication date: 2019-04-24
Anticipated expiration: 2037-02-21
Also published as: US9704257B1; WO2017163759A1; JP2018535491A

Description

本発明は、包括的には、コンピュータービジョン及び機械学習に関し、より詳細には、画像を意味的にラベル付けすることに関する。 The present invention relates generally to computer vision and machine learning, and more particularly to semantically labeling images.

画像におけるあらゆるピクセルのカテゴリラベルを予測することを目的としたセマンティックセグメンテーションは、シーンの理解のための重要なタスクである。セマンティックセグメンテーションは、意味クラスの視覚的外観の大きな変化と、視覚世界における様々なクラス間の複雑な相互作用とに起因する困難な問題である。近年、畳み込みニューラルネットワーク（ＣＮＮ(convolutional neural network)）が、この困難なタスクに効果的であることが示されている。しかしながら、畳み込みニューラルネットワークは、セマンティックセグメンテーション等の構造予測タスクには最適でない場合がある。なぜならば、構造予測タスクは、出力変数間の相互作用を直接モデル化するものではないからである。 Semantic segmentation aimed at predicting the category labels of every pixel in an image is an important task for understanding the scene. Semantic segmentation is a difficult problem due to the large changes in the visual appearance of semantic classes and the complex interactions between the various classes in the visual world. In recent years convolutional neural networks (CNNs) have been shown to be effective for this difficult task. However, convolutional neural networks may not be optimal for structural prediction tasks such as semantic segmentation. This is because the structural prediction task does not directly model the interaction between output variables.

様々なセマンティックセグメンテーション方法が、ＣＮＮ上で離散条件付き確率場（ＣＲＦ(conditional random field)）を用いる。ＣＮＮ及びＣＲＦを組み合わせることによって、これらの方法は、ＣＮＮが複雑な入力出力関係をモデル化する能力と、ＣＲＦが出力変数間の相互作用を直接モデル化する能力とを提供する。これらの方法の大部分は、ＣＲＦを別個の後処理ステップとして用いる。通常、ＣＮＮは、画像を処理してユーナリ（unary：単一）エネルギーを生成し、このユーナリエネルギーは、次に、ＣＲＦによって処理され、画像がラベル付けされる。しかしながら、ＣＲＦは、ＣＮＮと異なる動作原理を有する。そのことは、ＣＮＮをＣＲＦから切り離し、それらの合同トレーニングを妨げる。一般に、ＣＲＦは、手動で調節されるか、又は、ＣＮＮとは別にトレーニングされる。 Various semantic segmentation methods use discrete conditional random fields (CRFs) on CNN. By combining CNN and CRF, these methods provide the ability of CNN to model complex input-output relationships and the ability of CRF to directly model interactions between output variables. Most of these methods use CRF as a separate post-processing step. Typically, the CNN processes the image to generate unary energy, which is then processed by the CRF to label the image. However, CRF has an operating principle different from CNN. That decouples the CNN from the CRF and prevents their joint training. In general, CRF is either manually adjusted or trained separately from CNN.

ＣＲＦを後処理ステップとして用いることに代わる１つの方法は、離散ＣＲＦの推定手順をリカレントニューラルネットワークに変換することによって、離散ＣＲＦとともにＣＮＮをトレーニングするものである。しかしながら、一般に、離散ＣＲＦにおける推定は、ＣＲＦ定式化の離散的で微分不能な性質に起因して取り扱いにくい。そのために、その方法は、大域的最適保証を有せず、不十分なトレーニング結果をもたらす可能性がある近似推定手順を用いる。 One alternative to using CRF as a post-processing step is to train CNN with discrete CRF by transforming the discrete CRF estimation procedure into a recurrent neural network. However, in general, estimation in discrete CRF is cumbersome due to the discrete and non-differentiable nature of CRF formulation. To that end, the method uses an approximate estimation procedure that does not have a global optimum guarantee and can lead to poor training results.

本発明の幾つかの実施の形態は、畳み込みニューラルネットワーク（ＣＮＮ）及び離散条件付き確率場（ＣＲＦ）の組み合わせを用いて画像のセマンティックセグメンテーションを提供することが有利であるという認識に基づいている。一方、幾つかの実施の形態は、この組み合わせにおいて、ＣＲＦをニューラルネットワーク（ＮＮ(neural network)）と置き換えることが有利であるというさらなる別の認識に基づいている。そのような置き換えは、セマンティックセグメンテーションに参加する様々なサブネットワークを、合同でトレーニングすることができる共通のニューラルネットワーク内に接続することができる。しかしながら、ＮＮを用いてＣＲＦの演算をエミュレートすることは、ＣＲＦ定式化の離散的で微分不能な性質に起因して困難である。 Some embodiments of the present invention are based on the recognition that it is advantageous to provide semantic segmentation of images using a combination of convolutional neural networks (CNN) and discrete conditional random fields (CRF). On the other hand, some embodiments are based on yet another recognition that it is advantageous to replace CRF with a neural network (NN) in this combination. Such permutations can connect the various sub-networks participating in semantic segmentation into a common neural network that can be jointly trained. However, emulating the operation of CRF with NN is difficult due to the discrete and non-differentiable nature of the CRF formulation.

幾つかの実施の形態は、最初に、ＣＲＦを、当該ＣＲＦのサブクラスであるガウス確率場（ＧＲＦ(Gaussian random field)）に置き換えることができるという認識に基づいている。ＧＲＦ推定の演算は、連続かつ微分可能であり、最適に解くことができる。画像セグメンテーションが離散タスクであるにもかかわらず、ＧＲＦは、それでもセマンティックセグメンテーションに適している。 Some embodiments are based initially on the recognition that CRF can be replaced by a Gaussian random field (GRF), which is a subclass of CRF. The operations of GRF estimation are continuous and differentiable, and can be solved optimally. Although image segmentation is a discrete task, GRF is still suitable for semantic segmentation.

幾つかの実施の形態は、ニューラルネットワークを用いてＧＲＦ推定の演算をエミュレートすることが可能であるという認識に基づいている。ニューロン演算及びＧＲＦ演算の双方は、連続かつ微分可能であるので、ＧＲＦの演算の連続性によって、ＧＲＦにおける各代数的演算を幾つかのニューロン演算に置き換えることが可能になる。これらのニューロン演算は、ＧＲＦ推定中に適用されるそれらの代数的演算として逐次的に適用される。 Some embodiments are based on the recognition that neural networks can be used to emulate the operation of GRF estimation. Since both neuron operations and GRF operations are continuous and differentiable, the continuity of the operations of GRF makes it possible to replace each algebraic operation in GRF with several neuron operations. These neuron operations are applied sequentially as their algebraic operations applied during GRF estimation.

そのために、実施の形態は、ユーナリエネルギーを求める第１のサブネットワークと、ペアワイズ（pairwise：対）エネルギーを求める第２のサブネットワークと、ＧＲＦ推定をエミュレートする第３のサブネットワークとを作成し、３つの全てのサブネットワークを合同でトレーニングする。 To that end, the embodiment creates a first sub-network for unary energy, a second sub-network for pairwise energy, and a third sub-network emulating GRF estimation. And jointly train all three sub-networks.

したがって、本発明の１つの実施形態は、画像のセマンティックセグメンテーションのためのコンピューター実施方法を開示する。本方法は、第１のサブネットワークを用いて、画像における各ピクセルのユーナリエネルギーを求めることと、第２のサブネットワークを用いて、前記画像のピクセルの少なくとも幾つかのペアのペアワイズエネルギーを求めることと、第３のサブネットワークを用いて、前記ユーナリエネルギー及び前記ペアワイズエネルギーの組み合わせを含むエネルギー関数を最小にするガウス確率場（ＧＲＦ）に関する推定結果を求めて、前記画像における各ピクセルの意味ラベルの確率を規定するＧＲＦ推定結果を生成することと、意味的セグメント化画像におけるピクセルに、前記第３のサブネットワークによって求められる前記確率の中で前記画像における対応するピクセルの最も高い確率を有する意味ラベルを割り当てることによって、前記画像を前記意味的セグメント化画像に変換することと、を含み、前記第１のサブネットワーク、前記第２のサブネットワーク、及び前記第３のサブネットワークは、ニューラルネットワークの一部分である。本方法のステップは、プロセッサによって実行される。 Thus, one embodiment of the present invention discloses a computer-implemented method for semantic segmentation of images. The method uses a first subnetwork to determine the unary energy of each pixel in the image and a second subnetwork to determine the pairwise energy of at least some pairs of pixels of the image. And the meaning of each pixel in the image using a third subnetwork to obtain an estimation result on a Gaussian random field (GRF) minimizing an energy function including a combination of the unary energy and the pairwise energy Generating a GRF estimation result defining the probability of the label, and having pixels in the semantic segmented image with the highest probability of the corresponding pixel in the image among the probabilities determined by the third subnetwork By assigning a semantic label, It comprises, converting the image into the semantic segmented image, the first sub-network, the second sub-network, and the third sub-network is part of a neural network. The steps of the method are performed by a processor.

更に別の実施形態は、画像のセマンティックセグメンテーションのためのシステムであって、前記画像及び意味的セグメント化画像を記憶する少なくとも１つの非一時的コンピューター可読メモリと、ガウス確率場（ＧＲＦ）ネットワークを用いて前記画像のセマンティックセグメンテーションを実行して、前記意味的セグメント化画像を生成するプロセッサと、を備え、前記ＧＲＦネットワークは、画像における各ピクセルのユーナリエネルギーを求める第１のサブネットワークと、前記画像のピクセルの少なくとも幾つかのペアのペアワイズエネルギーを求める第２のサブネットワークと、前記ユーナリエネルギー及び前記ペアワイズエネルギーの組み合わせを含むエネルギー関数を最小にするガウス確率場（ＧＲＦ）に関する推定結果を求めて、前記画像における各ピクセルの意味ラベルの確率を規定するＧＲＦ推定結果を生成する第３のサブネットワークと、を備えるニューラルネットワークであり、前記プロセッサは、意味的セグメント化画像におけるピクセルに、前記第３のサブネットワークによって求められる前記確率の中で前記画像における対応するピクセルの最も高い確率を有する意味ラベルを割り当てることによって、前記画像を前記意味的セグメント化画像に変換する、システムを開示する。 Yet another embodiment is a system for semantic segmentation of an image, using at least one non-temporary computer readable memory for storing the image and semantically segmented image, and a Gaussian random field (GRF) network. A processor for performing semantic segmentation of the image to generate the semantically segmented image, the GRF network comprising: a first sub-network for determining the unary energy of each pixel in the image; A second sub-network for determining pairwise energy of at least some pairs of pixels, and estimation results on Gaussian random field (GRF) minimizing an energy function including a combination of the unary energy and the pairwise energy And a third sub-network for generating a GRF estimation result defining the probability of the semantic label of each pixel in the image, and the processor is further configured to: A system is disclosed that converts the image into the semantic segmented image by assigning a semantic label having the highest probability of the corresponding pixel in the image among the probabilities determined by the third subnetwork.

更に別の実施形態は、命令が記憶された非一時的コンピューター可読媒体であって、前記命令は、プロセッサによって実行されると、第１のサブネットワークを用いて、画像における各ピクセルのユーナリエネルギーを求めることと、第２のサブネットワークを用いて、前記画像のピクセルの少なくとも幾つかのペアのペアワイズエネルギーを求めることと、第３のサブネットワークを用いて、前記ユーナリエネルギー及び前記ペアワイズエネルギーの組み合わせを含むエネルギー関数を最小にするガウス確率場（ＧＲＦ）に関する推定結果を求めて、前記画像における各ピクセルの意味ラベルの確率を規定するＧＲＦ推定結果を生成することと、意味的セグメント化画像におけるピクセルに、前記第３のサブネットワークによって求められる前記確率の中で前記画像における対応するピクセルの最も高い確率を有する意味ラベルを割り当てることによって、前記画像を前記意味的セグメント化画像に変換することと、を含むステップを実行し、前記第１のサブネットワーク、前記第２のサブネットワーク、及び前記第３のサブネットワークは、ニューラルネットワークの一部分として合同でトレーニングされる、非一時的コンピューター可読媒体を開示する。 Yet another embodiment is a non-transitory computer readable medium having instructions stored thereon, wherein the instructions, when executed by the processor, use a first subnetwork to generate a unique energy for each pixel in the image. Determining the pairwise energy of at least some pairs of pixels of the image using a second subnetwork, and using the third subnetwork to calculate the unary energy and the pairwise energy. Determining an estimation result on a Gaussian random field (GRF) minimizing energy functions including combinations, generating a GRF estimation result defining the probability of the semantic label of each pixel in the image, and in the semantic segmented image Determined by the third sub-network And D. converting the image into the semantic segmented image by assigning the semantic label having the highest probability of the corresponding pixel in the image among the probabilities. The subnetwork, the second subnetwork, and the third subnetwork disclose a non-transitory computer readable medium jointly trained as part of a neural network.

本発明の幾つかの実施形態による画像のセマンティックセグメンテーションのためのコンピューターシステムのブロック図である。FIG. 7 is a block diagram of a computer system for semantic segmentation of images according to some embodiments of the present invention. 本発明の幾つかの実施形態によるガウス確率場（ＧＲＦ）ニューラルネットワークを用いる画像ラベル付け（image labeling：画像ラベリング）を介したセマンティックセグメンテーションの概略図である。FIG. 5 is a schematic diagram of semantic segmentation via image labeling using a Gaussian Random Field (GRF) neural network according to some embodiments of the present invention. 本発明の１つの実施形態による画像の意味的ラベル付け（semantic labeling：セマンティックラベリング）のためのコンピューター実施方法のブロック図である。FIG. 6 is a block diagram of a computer-implemented method for semantic labeling of images according to one embodiment of the present invention. 本発明の１つの実施形態によるＧＲＦネットワークのブロック図である。FIG. 1 is a block diagram of a GRF network according to one embodiment of the present invention. 本発明の幾つかの実施形態によるエネルギー関数の最小化の概略図である。FIG. 5 is a schematic view of the minimization of the energy function according to some embodiments of the invention. 本発明の１つの実施形態によるＧＲＦネットワークのブロック図である。FIG. 1 is a block diagram of a GRF network according to one embodiment of the present invention. 本発明の１つの実施形態によるＧＲＦネットワークの実施態様の擬似コードである。7 is pseudo code of an implementation of a GRF network according to one embodiment of the invention. 本発明の１つの実施形態によるペアワイズエネルギーを求めるピクセルのペアを形成する方法のブロック図である。FIG. 5 is a block diagram of a method of forming a pair of pixels for pairwise energy, according to one embodiment of the present invention. 本発明の幾つかの実施形態による図４Ａの２部グラフ構造を利用するネットワークのブロック図である。FIG. 4B is a block diagram of a network utilizing the bipartite graph structure of FIG. 4A according to some embodiments of the present invention. 本発明の幾つかの実施形態によって用いられるトレーニング方法の概略図である。FIG. 7 is a schematic view of a training method used by some embodiments of the present invention. 本発明の幾つかの実施形態によって用いられるトレーニング方法のブロック図である。FIG. 6 is a block diagram of a training method used by some embodiments of the present invention. 本発明の１つの実施形態によるトレーニングシステムのブロック図である。FIG. 1 is a block diagram of a training system according to one embodiment of the present invention.

図１Ａは、本発明の幾つかの実施形態による画像のセマンティックセグメンテーションのためのコンピューターシステム１００のブロック図を示す。コンピューターシステム１００は、記憶された命令を実行するように構成されたプロセッサ１０２と、このプロセッサによって実行可能な命令を記憶するメモリ１０４とを備える。プロセッサ１０２は、シングルコアプロセッサ、マルチコアプロセッサ、コンピューティングクラスター、又は任意の数の他の構成体とすることができる。メモリ１０４は、ランダムアクセスメモリ（ＲＡＭ）、読み出し専用メモリ（ＲＯＭ）、フラッシュメモリ、又は他の任意の適したメモリシステムを含むことができる。プロセッサ１０２は、バス１０６を通じて１つ以上の入力デバイス及び出力デバイスに接続される。 FIG. 1A shows a block diagram of a computer system 100 for semantic segmentation of images according to some embodiments of the present invention. Computer system 100 comprises a processor 102 configured to execute stored instructions, and a memory 104 that stores instructions executable by the processor. Processor 102 may be a single core processor, a multi-core processor, a computing cluster, or any number of other configurations. Memory 104 may include random access memory (RAM), read only memory (ROM), flash memory, or any other suitable memory system. Processor 102 is connected to one or more input and output devices through bus 106.

図１Ｂは、本発明の幾つかの実施形態によるガウス確率場（ＧＲＦ）ニューラルネットワークを用いる画像ラベル付け（image labeling：画像ラベリング）を介したセマンティックセグメンテーションの概略図を示す。セマンティックセグメンテーションは、メモリ１０４に記憶された命令を実行するプロセッサ１０２によって実行することができる。ＧＲＦネットワーク１１４は、画像１６０の意味的ラベル付けを実行して、意味クラス、例えば、意味ラベル１７１、１７２、及び１７３を用いてラベル付けされたピクセルを有するセグメント化画像１７０を生成する。ＧＲＦネットワーク１１４は、ニューラルネットワークであり、ＧＲＦネットワーク１１４の少なくとも幾つかの演算は、ＧＲＦ推定の演算をエミュレートする。 FIG. 1B shows a schematic of semantic segmentation via image labeling using a Gaussian random field (GRF) neural network according to some embodiments of the present invention. Semantic segmentation may be performed by processor 102 executing instructions stored in memory 104. The GRF network 114 performs semantic labeling of the image 160 to produce a segmented image 170 having pixels labeled with semantic classes, eg, semantic labels 171, 172, and 173. GRF network 114 is a neural network, and at least some operations of GRF network 114 emulate operations of GRF estimation.

ＧＲＦは、変数のガウス分布及び／又はガウス確率密度関数を伴う確率場である。１次元ＧＲＦは、ガウスプロセスとも呼ばれる。例えば、ＧＲＦネットワーク１１４は、画像１６０の各ピクセルの値を条件とする可能な意味ラベル１７１、１７２、及び１７３の確率密度をユーナリエネルギー及びペアワイズエネルギーを含むエネルギー関数のガウス分布としてモデル化し、エネルギー関数に関するガウス推定を実行して、画像の各ピクセルの各意味ラベルの確率を求める。 GRF is a random field with a Gaussian distribution of variables and / or Gaussian probability density functions. One-dimensional GRF is also called Gaussian process. For example, the GRF network 114 models the probability density of possible semantic labels 171, 172, and 173 conditional on the value of each pixel of the image 160 as a Gaussian distribution of energy functions including unary energy and pairwise energy, the energy Perform Gaussian estimation on the function to determine the probability of each semantic label of each pixel of the image.

一般に、ガウス推定は、基礎をなすガウス分布の特性（例えば、平均又は共分散）を求めることを指す。この場合、このガウス分布は、画像のピクセルが異なる意味クラスに属する確率を規定する統計的変数によって形成される。そのために、ユーナリエネルギー及びペアワイズエネルギーは、ピクセルの意味ラベルの確率の関数である。例えば、幾つかの実施形態では、ガウス推定は、ユーナリエネルギー及びペアワイズエネルギーを用いて規定されるガウス分布の平均を求める。 In general, Gaussian estimation refers to determining properties (eg, mean or covariance) of the underlying Gaussian distribution. In this case, this Gaussian distribution is formed by statistical variables which define the probability that the pixels of the image belong to different semantic classes. Thus, the unary energy and the pairwise energy are functions of the probability of the semantic label of the pixel. For example, in some embodiments, Gaussian estimation determines the mean of Gaussian distributions defined using unary energy and pairwise energy.

幾つかの実施形態は、最初に、ＣＲＦを、当該ＣＲＦのサブクラスであるＧＲＦに置き換えることができるという認識に基づいている。ＧＲＦ推定の演算は、連続かつ微分可能であり、最適に解くことができる。画像のセマンティックセグメンテーションが離散タスクであるにもかかわらず、ＧＲＦは、それでもセマンティックセグメンテーションに適している。 Some embodiments are based initially on the recognition that CRF can be replaced by GRF, which is a subclass of CRF. The operations of GRF estimation are continuous and differentiable, and can be solved optimally. Although semantic segmentation of images is a discrete task, GRF is still suitable for semantic segmentation.

コンピューターシステム１００は、元画像１１０を記憶するように適合された記憶デバイス１０８、元画像をフィルタリングして、セグメンテーションに適した画像１６０を生成するフィルター１１２を備えることもできる。例えば、このフィルターは、元画像をサイズ変更して、トレーニングデータの画像と位置合わせすることができる。記憶デバイス１０８は、ＧＲＦネットワーク１１４の構造及びパラメーターも記憶することができる。様々な実施形態では、ＧＲＦネットワーク１１４は、トレーニング画像のセット及び対応するトレーニング意味ラベルのセットに関してトレーニングされる。 Computer system 100 may also include a storage device 108 adapted to store the original image 110, and a filter 112 that filters the original image to produce an image 160 suitable for segmentation. For example, the filter may resize the original image to align with the training data image. Storage device 108 may also store the structure and parameters of GRF network 114. In various embodiments, the GRF network 114 is trained on a set of training images and a set of corresponding training semantic labels.

記憶デバイス１０８は、ハードドライブ、光学ドライブ、サムドライブ、ドライブのアレイ、又はそれらの任意の組み合わせを含むことができる。コンピューターシステム１００内のヒューマンマシンインターフェース１１６は、システムをキーボード１１８及びポインティングデバイス１２０に接続することができ、ポインティングデバイス１２０は、とりわけ、マウス、トラックボール、タッチパッド、ジョイスティック、ポインティングスティック、スタイラス、又はタッチ画面を含むことができる。コンピューターシステム１００は、当該システム１００をディスプレイデバイス１２４に接続するように適合されたディスプレイインターフェース１２２にバス１０６を通じてリンクすることができ、ディスプレイデバイス１２４は、とりわけ、コンピューターモニター、カメラ、テレビ、プロジェクター、又はモバイルデバイスを含むことができる。 Storage device 108 may include a hard drive, an optical drive, a thumb drive, an array of drives, or any combination thereof. A human machine interface 116 within computer system 100 can connect the system to keyboard 118 and pointing device 120, which, among other things, is a mouse, trackball, touch pad, joystick, pointing stick, stylus, or touch It can contain a screen. Computer system 100 may be linked through bus 106 to display interface 122 adapted to connect system 100 to display device 124, which may be, inter alia, a computer monitor, a camera, a television, a projector, or It can include mobile devices.

コンピューターシステム１００は、当該システムを撮像デバイス１２８に接続するように適合された撮像インターフェース１２６に接続することもできる。１つの実施形態では、セマンティックセグメンテーション用の画像は、この撮像デバイスから受信される。撮像デバイス１２８は、カメラ、コンピューター、スキャナー、モバイルデバイス、ウェブカム、又はそれらの任意の組み合わせを含むことができる。プリンターインターフェース１３０も、バス１０６を通じてコンピューターシステム１００に接続することができ、コンピューターシステム１００を印刷デバイス１３２に接続するように適合させることができ、印刷デバイス１３２は、とりわけ、液体インクジェットプリンター、固体インクプリンター、大規模商用プリンター、サーマルプリンター、ＵＶプリンター、又は昇華型プリンターを含むことができる。ネットワークインターフェースコントローラー１３４は、コンピューターシステム１００を、バス１０６を通じてネットワーク１３６に接続するように適合されている。ネットワーク１３６を通じて、電子テキスト及び撮像入力文書のうちの一方又は組み合わせを含む画像１３８をダウンロードし、記憶及び／又は更なる処理のためにコンピューターの記憶システム１０８内に記憶することができる。 Computer system 100 may also be connected to imaging interface 126 adapted to connect the system to imaging device 128. In one embodiment, an image for semantic segmentation is received from the imaging device. Imaging device 128 may include a camera, a computer, a scanner, a mobile device, a webcam, or any combination thereof. The printer interface 130 may also be connected to the computer system 100 through the bus 106 and may be adapted to connect the computer system 100 to the printing device 132, which may be, inter alia, a liquid ink jet printer, a solid ink printer , Large scale commercial printers, thermal printers, UV printers, or sublimation printers. Network interface controller 134 is adapted to connect computer system 100 to network 136 through bus 106. Through the network 136, the image 138, including one or a combination of electronic text and imaged input documents, may be downloaded and stored in the computer storage system 108 for storage and / or further processing.

説明を容易にするために、本開示は、太字体小文字を用いてベクトルを示し、太字体大文字を用いて行列を示す。

及び

は、行列

の転置行列及び逆行列を示す。表記

は、ベクトル

の二乗

ノルムを示す。

は、

が対称半正定値行列（symmetric and positive semidefinite matrix）であることを意味する。 For ease of explanation, the disclosure uses boldface lowercase letters to indicate vectors and boldface capital letters to indicate matrices.

as well as

Is the matrix

Shows the transposed matrix and the inverse matrix of. Notation

Is the vector

Square of

Indicates the norm.

Is

It means that is a symmetric and positive semidefinite matrix.

ニューラルネットワークは、生物学的なニューラルネットワークによってインスパイアされたモデルのファミリーであり、多数の入力に依存する可能性があり一般に未知である関数を推定又は近似するのに用いられる。ニューラルネットワークは、一般に、互いの間でメッセージを交換する相互接続されたノード又は「ニューロン」のシステムとして提供される。各ノードは、メッセージを変換する関数に関連付けられている。この関数は、通常、メッセージ変換の非線形部分を形成するために非線形である。ノード間の各接続は、メッセージ変換の線形部分を形成するためにメッセージをスケーリングする数値重みに関連付けられる。通常、これらの関数は、全てのノードについて固定され、事前に定められ、例えば、ニューラルネットワークの設計者によって選択されている。ノードについて通常選択される関数の例には、シグモイド関数及び整流関数が含まれる。これとは対照的に、数値重みは異なり、トレーニングデータに基づいて調節され、ニューラルネットワークを入力に適応したものとするとともに学習可能なものにする。 A neural network is a family of models inspired by biological neural networks and is used to estimate or approximate functions that may depend on multiple inputs and are generally unknown. Neural networks are generally provided as a system of interconnected nodes or "neurons" that exchange messages between one another. Each node is associated with a function that translates messages. This function is usually non-linear to form the non-linear part of message conversion. Each connection between nodes is associated with a numerical weight that scales the message to form a linear part of the message transformation. Usually, these functions are fixed for all nodes and are predetermined, for example, selected by the designer of the neural network. Examples of functions that are usually selected for nodes include sigmoid functions and rectification functions. In contrast, the numerical weights are different and are adjusted based on the training data, making the neural network adaptive and learnable to the input.

幾つかの実施形態は、ニューラルネットワークを用いてＧＲＦ推定の演算をエミュレートすることが可能であるという認識に基づいている。ニューロン演算及びＧＲＦ演算の双方は、連続かつ微分可能であるので、ＧＲＦの演算の連続性によって、ＧＲＦにおける各代数的演算を幾つかのニューロン演算に置き換えることが可能になる。これらのニューロン演算は、ＧＲＦ推定中に適用されるそれらの代数的演算として逐次的に適用される。 Some embodiments are based on the recognition that neural networks can be used to emulate the operation of GRF estimation. Since both neuron operations and GRF operations are continuous and differentiable, the continuity of the operations of GRF makes it possible to replace each algebraic operation in GRF with several neuron operations. These neuron operations are applied sequentially as their algebraic operations applied during GRF estimation.

セマンティックセグメンテーションは、画像

１６０における各ピクセルを、画像１７０におけるＫ個の可能なクラスのうちの１つに割り当てる。そのような割り当ては、本明細書では、意味的ラベル付けと呼ばれる。意味的ラベル付けが行われた後、ピクセルの意味的ラベル付けの結果は、画像のセマンティックセグメンテーションを生成する。幾つかの実施形態は、Ｋ個の変数（各クラスにつき１つ）を用いて、各ピクセルにおける出力をモデル化し、最終ラベル割り当ては、これらのＫ個の変数のうちのいずれが最大値、例えば、確率の値を有するのかに基づいて行われる。第ｉのピクセルに関連付けられたＫ個の出力変数のベクトルを

とし、全ての出力変数のベクトルを

とする。例えば、条件付き確率密度

は、以下の式によって与えられるガウス分布としてモデル化することができる。

Semantic segmentation is an image

Each pixel at 160 is assigned to one of K possible classes in image 170. Such assignment is referred to herein as semantic labeling. After semantic labeling has been performed, the result of semantic labeling of the pixels produces a semantic segmentation of the image. Some embodiments model the output at each pixel using K variables (one for each class), and the final label assignment is such that any of these K variables is a maximum, eg , Is performed based on whether it has a probability value. A vector of K output variables associated with the i th pixel

And the vector of all output variables

I assume. For example, conditional probability density

Can be modeled as a Gaussian distribution given by:

上記エネルギー関数Ｅにおける第１項は、ユーナリエネルギーを表すユーナリ項であり、第２項は、ペアワイズエネルギーを表すペアワイズ項である。ここで、各ピクセルｉのユーナリエネルギーパラメーター

及び第１のピクセルｉと第２のピクセルｊとの間のペアワイズエネルギーパラメーター

の双方は、θ_ｕ及びθ_ｐがそれぞれの関数パラメーターである入力画像

の関数を用いて計算される。ピクセルの全てのペアについて

を有する実施形態では、ユーナリ項及びペアワイズ項を互いに組み合わせて、単一の半正定値二次形式にすることができる。 The first term in the energy function E is a unary term representing unary energy, and the second term is a pairwise term representing pairwise energy. Where the unary energy parameter of each pixel i

And pairwise energy parameters between the first pixel i and the second pixel j

Both have an input image where θ _u and θ _p are the respective function parameters

Calculated using the function of For every pair of pixels

In an embodiment having a Eunary term and a Pairwise term can be combined with one another into a single semidefinite quadratic form.

図１Ｃは、本発明の１つの実施形態による画像の意味的ラベル付けの方法のブロック図を示している。本方法は、プロセッサ１０２によって実行されるＧＲＦネットワーク１１４によって実行することができる。本方法は、画像における各ピクセルのユーナリエネルギー１８５を求め（１８０）、画像のピクセルの少なくとも幾つかのペアのペアワイズエネルギー１９５を求める（１９０）。次に、本方法は、ユーナリエネルギー１８５及びペアワイズエネルギー１９５を処理することによって画像のＧＲＦ推定１７６を求める（１７５）。例えば、幾つかの実施形態では、ＧＲＦ推定は、ユーナリエネルギー及びペアワイズエネルギーの組み合わせを含むエネルギー関数を最小にすることによって求められる。 FIG. 1C shows a block diagram of a method of semantic labeling of an image according to one embodiment of the invention. The method may be performed by the GRF network 114 executed by the processor 102. The method determines (180) the unary energy 185 of each pixel in the image and determines (190) the pairwise energy 195 of at least some pairs of pixels of the image. Next, the method determines 175 a GRF estimate 176 of the image by processing the unary energy 185 and the pairwise energy 195. For example, in some embodiments, GRF estimation is determined by minimizing an energy function that includes a combination of unary energy and pairwise energy.

様々な実施形態では、ユーナリエネルギー１８５は、第１のサブネットワークを用いて求められ（１８０）、ペアワイズエネルギー１９５は、第２のサブネットワークを用いて求められ（１９０）、ＧＲＦ推定１７６は、第３のサブネットワークを用いて求められる（１７５）。これらの第１のサブネットワーク、第２のサブネットワーク、及び第３のサブネットワークは、ニューラルネットワークの一部分である。そのような方法では、ニューラルネットワークの全てのパラメーターを合同でトレーニングすることができる。 In various embodiments, the unary energy 185 is determined using the first subnetwork 180 and the pairwise energy 195 is determined using the second subnetwork 190 and the GRF estimate 176 is It is determined 175 using the third subnetwork. The first subnetwork, the second subnetwork, and the third subnetwork are parts of a neural network. In such a way, all the parameters of the neural network can be trained jointly.

ＧＲＦ推定は、画像における各ピクセルの意味ラベルの確率を規定する。例えば、本発明の幾つかの実施形態では、ユーナリエネルギー１８５は、第１のサブネットワークを用いて求められるピクセルの意味ラベルの確率の第１の関数であり、ペアワイズエネルギー１９５は、第２のサブネットワークを用いて求められるピクセルの意味ラベルの確率の第２の関数である。そのために、本方法は、意味的セグメント化画像１７０におけるピクセルに、第３のサブネットワークによって求められる確率の中で画像１６０における対応するピクセルの最も高い確率を有する意味ラベルを割り当てる（１９６）ことによって、画像１６０を意味的セグメント化画像１７０に変換する。ここで、第１のサブネットワーク、第２のサブネットワーク。 GRF estimation defines the probability of semantic labels for each pixel in the image. For example, in some embodiments of the present invention, the unary energy 185 is a first function of the probability of the semantic label of the pixel determined using the first subnetwork, and the pairwise energy 195 is a second It is a second function of the probability of the semantic label of the pixel determined using the subnetwork. To that end, the method assigns 196 the pixels in the semantic segmented image 170 the semantic label with the highest probability of the corresponding pixel in the image 160 among the probabilities determined by the third subnetwork (196). , Image 160 is transformed into semantically segmented image 170. Here, the first subnetwork, the second subnetwork.

エネルギー関数Ｅを最小にする最適な意味ラベル

は、閉形式で取得することができる。なぜならば、Ｅの最小化は制約なし２次計画法であるからである。しかしながら、この閉形式解は、クラスの数にピクセルの数を乗算したものに等しい数の変数を有する線形システムを解くことを必要とする。幾つかの実施形態は、そのような大規模な線形システムを解くことは計算上法外であり得るという認識に基づいている。それらの実施形態では、第３のサブネットワークは、ガウス平均場（ＧＭＩ(Gaussian mean field)）推定の演算をエミュレートすることによってＧＲＦ推定を求める。 Optimal semantic label to minimize energy function E

Can be obtained in closed form. This is because the minimization of E is an unconstrained quadratic programming. However, this closed-form solution requires solving a linear system with a number of variables equal to the number of classes multiplied by the number of pixels. Some embodiments are based on the recognition that solving such large linear systems can be computationally prohibitive. In those embodiments, the third subnetwork determines the GRF estimate by emulating the operation of a Gaussian mean field (GMI) estimate.

図２Ａは、本発明の１つの実施形態によるＧＲＦネットワークのブロック図を示す。この実施形態では、ＧＲＦネットワークは、３つのサブネットワーク、すなわち、ユーナリエネルギー１８５を求めるユーナリネットワーク２０１としてトレーニングされる第１のサブネットワークと、ペアワイズエネルギー１９５を求めるペアワイズネットワーク２０２としてトレーニングされる第２のサブネットワークと、エネルギー関数を最小にする平均場推定更新を求めるＧＭＩネットワーク２０３である第３のサブネットワークとを備える。ユーナリネットワーク及びペアワイズネットワークは、エネルギー関数式（１）のユーナリ項及びペアワイズ項においてそれぞれ用いられるパラメーター

及び

を生成する一方、ＧＭＩネットワークは、ユーナリネットワーク及びペアワイズネットワークの出力を用いてガウス平均場推定を実行する。 FIG. 2A shows a block diagram of a GRF network according to one embodiment of the present invention. In this embodiment, the GRF network is trained as three sub-networks: a first sub-network trained as the unary network 201 seeking unary energy 185 and a pair-wise network 202 seeking the pairwise energy 195 2 sub-network and the third sub-network which is the GMI network 203 which seeks the mean field estimation update which minimizes the energy function. The unary network and the pairwise network are parameters used in the unary term and the pairwise term of the energy function equation (1), respectively.

as well as

The GMI network performs Gaussian mean field estimation using the output of the unary network and the pairwise network.

１つの実施形態では、平均

を計算する平均場更新は、以下の式によって与えられる。

In one embodiment, the average

The mean field update to calculate is given by:

ここで、これらの更新は、各ピクセルｉについて逐次的に実行される。エネルギー関数は、ＧＲＦの場合には凸二次であり、式（２）の更新は、各部分問題(sub-problem)を最適に解く。すなわち、他の全ての

（又は

）が固定されているときの最適な

（又は

）を見つける。そのために、逐次的な更新を実行して最大事後確率（ＭＡＰ(maximum a posteriori)）解を与えることが保証される。 Here, these updates are performed sequentially for each pixel i. The energy function is convex quadratic in the case of GRF, and the update of equation (2) optimally solves each sub-problem. All other

(Or

Optimal when) is fixed

(Or

Find out). To that end, it is guaranteed that successive updates are performed to give a maximum a posteriori (MAP) solution.

図２Ｂは、本発明の幾つかの実施形態によるＮＮを含有するエネルギー関数の最小化の概略図を示す。エネルギー関数２１０は、ユーナリエネルギー１８５及びペアワイズエネルギー１９５の組み合わせを含む。エネルギー関数の一例は、式（１）の関数である。第３のサブネットワーク２０３の各層２３１、２３２、２３３、２３４、２３５、及び２３６は、エネルギー関数２１０を最小にする平均場推定更新を再帰的に求める。再帰的最小化の例は、式（２）に提供されている。サブネットワーク２０３における層の数は、更新の所望の反復数に基づいて選択することができる。 FIG. 2B shows a schematic diagram of the minimization of the energy function containing the NN according to some embodiments of the present invention. The energy function 210 comprises a combination of unary energy 185 and pairwise energy 195. An example of the energy function is the function of equation (1). Each layer 231, 232, 233, 234, 235, and 236 of the third subnetwork 203 recursively determines the mean field estimate update that minimizes the energy function 210. An example of recursive minimization is provided in equation (2). The number of layers in subnetwork 203 can be selected based on the desired number of iterations of the update.

図３Ａは、本発明の１つの実施形態によるＧＲＦネットワークのブロック図を示す。この実施形態では、第１のサブネットワーク２０１は、パラメーター

を有するユーナリＣＮＮ３０５と本明細書では呼ばれる畳み込みＮＮ（ＣＮＮ）である。ユーナリＣＮＮは、画像１６０の各ピクセルについて、そのピクセルの近傍にあり、かつ、そのピクセルが各可能な意味ラベルに属する確率を生成するピクセルのサブセットを入力として受信する。例えば、このサブセットのピクセルは、そのピクセルを中心とする矩形パッチのピクセルとすることができる。 FIG. 3A shows a block diagram of a GRF network according to one embodiment of the present invention. In this embodiment, the first subnetwork 201 has parameters

Is a convolutional NN (CNN), referred to herein as the Unary CNN 305. For each pixel of the image 160, the unary CNN receives as input a subset of pixels that are close to that pixel and that generate the probability that the pixel belongs to each possible semantic label. For example, the subset of pixels may be pixels of a rectangular patch centered on that pixel.

この実施形態では、ユーナリエネルギーパラメーター

３０６は、ピクセルの近傍にあるピクセルのサブセットの関数を用いて計算され、式（１）のエネルギー関数のユーナリ項において用いられる。例えば、ユーナリエネルギー関数は、二次関数

である。ここで、

は、ユーナリＣＮＮを通じて計算されるユーナリエネルギーパラメーターであり、θ_ｕは、線形フィルターのパラメーターであり、

は、意味ラベルの確率であり、ｉは、ピクセルのインデックスである。ユーナリＣＮＮは、畳み込み演算を実行する一連の線形フィルターを各層への入力に適用し、少なくとも幾つかの層において、各線形フィルターの出力の非線形関数を適用する。 In this embodiment, the unary energy parameters

306 is calculated using a function of the subset of pixels in the vicinity of the pixel and is used in the unary term of the energy function of equation (1). For example, the unary energy function is a quadratic function

It is. here,

Is the unary energy parameter calculated through the unary CNN, and θ _u is the parameter of the linear filter,

Is the probability of the semantic label, i is the index of the pixel. Unary CNN applies a series of linear filters that perform convolution operations to the input to each layer, and applies a non-linear function of the output of each linear filter in at least some layers.

例えば、１つの実施態様では、ユーナリＣＮＮ３０５は、オックスフォードビジュアルジオメトリグループ（Oxford Visual Geometry Group）（ＶＧＧ−１６）ネットワークの変更版である。ＶＧＧ−１６と比較した変更点には、完全接続層を畳み込み層に変換することと、ダウンサンプリング層をスキップすることと、例えば、第４のプーリング層後の畳み込み層を変更して、ダウンサンプリングをスキップすることによる視野の損失を補償することと、マルチスケール特徴量を用いることとが含まれる。 For example, in one embodiment, Unary CNN 305 is a modified version of the Oxford Visual Geometry Group (VGG-16) network. Changes compared to VGG-16 include converting the fully connected layer to the convolutional layer, skipping the downsampling layer, and, for example, modifying the convolutional layer after the fourth pooling layer to downsample Compensation of the loss of visual field due to skipping and using multi-scale features.

第２のサブネットワーク（すなわち、ペアワイズネットワーク）２０２は、式（１）のエネルギー関数のペアワイズ項において用いられる行列

３１０を求めるパラメーター

を有するペアワイズＣＮＮ３０１を備える。例えば、ペアワイズネットワーク２０２は、ペアワイズＣＮＮ３０１を用いてペアのピクセル間の類似度を求め、この類似度に基づいて共分散行列を求め、この共分散行列の関数としてペアワイズエネルギーを求める。 The second subnetwork (ie, pairwise network) 202 is a matrix used in the pairwise terms of the energy function of equation (1)

Parameter for finding 310

And a pairwise CNN 301. For example, the pairwise network 202 determines the similarity between the pixels of the pair using the pairwise CNN 301, determines the covariance matrix based on the similarity, and determines the pairwise energy as a function of the covariance matrix.

例えば、ペアワイズネットワーク２０２は、ペアの第１のピクセルｉの近傍の第１のピクセルのサブセットを処理して、第１のピクセルの特徴量

を生成する（３０２）とともに、ペアの第２のピクセルｊの近傍の第２のピクセルのサブセットを処理して、第２のピクセルの特徴量

を生成する（３０２）。ペアワイズネットワーク２０２は、第１の特徴量と第２の特徴量との間の差の関数を求めて類似度ｓ_ｉｊを生成し（３０３）、ペアワイズエネルギーを共分散行列

として以下の式に従って求める（３０４）。

ここで、ｓ_ｉｊ∈［０，１］は、ピクセルｉとピクセルｊとの間の類似度であり、学習された行列

は、クラス適合性情報（class compatibility information）を符号化する。類似度ｓ_ｉｊは、以下の式に従って求めることができる（３０３）。

ここで、

（３０２）は、ペアワイズＣＮＮ３０１を用いて第ｉのピクセルにおいて抽出された特徴量ベクトルであり、学習された行列

は、距離関数、例えばマハラノビス（Mahalanobis）距離関数を規定する。 For example, the pairwise network 202 processes the subset of the first pixels in the vicinity of the first pixel i of the pair, and the feature value of the first pixel

Processing the second subset of pixels in the vicinity of the second pixel j of the pair together with generating 302

Are generated (302). The pairwise network 202 obtains a function of the difference between the first feature amount and the second feature amount to generate the similarity s _ij (303), and the pairwise energy is covariance matrix

It calculates | requires according to the following formula as (304).

Where s _ij ∈ [0, 1] is the similarity between pixel i and pixel j, and the learned matrix

Encodes class compatibility information. The similarity s _ij can be determined according to the following equation (303).

here,

(302) is a feature quantity vector extracted at the ith pixel using the pairwise CNN 301, and the learned matrix

Defines a distance function, for example a Mahalanobis distance function.

ペアワイズＣＮＮの構造は、ユーナリＣＮＮと同じものとすることができる。幾つかの実施形態では、ｓ_ｉｊの指数は以下となる。

ここで、

である。この実施形態では、マハラノビス距離計算は、

とフィルター

との畳み込み及びその後に続くユークリッド距離計算として実施される。 The structure of the pairwise CNN can be the same as the unary CNN. In some embodiments, the index of s _ij is

here,

It is. In this embodiment, the Mahalanobis distance calculation

And filters

And the Euclidean distance calculation that follows.

１つの実施形態では、ペアワイズネットワーク２０２は、ピクセル特徴量

を生成するペアワイズＣＮＮと、接続されたピクセルのあらゆるペアのｓ_ｉｊを式（４）及び／又は式（５）を用いて計算する類似層３０３と、行列

を式（３）を用いて計算する行列生成層３０４とを備える。この実施形態では、

は、類似層３０３のパラメーターであり、

は、行列生成層３０４のパラメーターである。 In one embodiment, the pairwise network 202 comprises pixel features

A pairwise CNN that generates H, and a similar layer 303 that calculates s _ij of every pair of connected pixels using equation (4) and / or equation (5), and a matrix

And a matrix generation layer 304 for calculating In this embodiment

Is a parameter of the similar layer 303,

Is a parameter of the matrix generation layer 304.

ＧＭＩ２０３は、ユーナリエネルギー及びペアワイズエネルギーの組み合わせを含むエネルギー関数が最小になるような各ピクセルの意味ラベルの確率を反復して求める。各ピクセルにおける最終出力は、Ｋ次元クラス予測スコアベクトル３０７である。ここで、Ｋはクラスの数である。第ｉのピクセルにおける最終出力を

とする。その場合、第ｉのピクセルの意味ラベルは、

３０８によって与えられる。 The GMI 203 repeatedly finds the probability of the semantic label of each pixel such that the energy function including the combination of the unary energy and the pairwise energy is minimized. The final output at each pixel is the K-dimensional class prediction score vector 307. Here, K is the number of classes. The final output at the ith pixel

I assume. In that case, the meaning label of the ith pixel is

Given by 308.

図３Ｂは、本発明の１つの実施形態によるＧＲＦネットワークの実施態様の擬似コードである。 FIG. 3B is pseudo code of an implementation of a GRF network according to one embodiment of the present invention.

図４Ａは、本発明の１つの実施形態によるペアワイズエネルギーを求める画像１６０のピクセルのペアを形成する方法のブロック図を示す。この実施形態は、画像１６０におけるピクセルの全ての可能なペアのペアワイズエネルギーを求めることが、多数の変数に起因して計算を遅くするという理解に基づいている。全てのピクセルの並列更新を同時に用いることが、合理的な代替案であるように見えるが、並列更新の収束は、限られた条件下でしか保証されない。 FIG. 4A shows a block diagram of a method of forming pairs of pixels of image 160 for determining pairwise energy according to one embodiment of the invention. This embodiment is based on the understanding that determining the pairwise energy of all possible pairs of pixels in the image 160 slows the computation due to a large number of variables. Using parallel updates of all pixels simultaneously seems to be a reasonable alternative, but the convergence of parallel updates is only guaranteed under limited conditions.

この問題に対処するために、実施形態は２部グラフ構造を用いる。この２部グラフ構造によって、各ステップにおいて変数の半分を並列に更新することが可能になるとともに、それでも、対角優位制約なしの収束が保証される。例えば、実施形態は、ピクセルの列又は行のインデックスのパリティに基づいて画像１６０におけるピクセルを奇数ピクセル又は偶数ピクセルに分割し（４２０）、ピクセルの各ペアにおいて、第１のピクセルが奇数ピクセルであり、第２のピクセルが偶数ピクセルであるようにピクセルのペアを形成する（４３０）。例えば、ピクセル４１０は、ピクセル４１１、４１２、４１３、及び４１４等のより大きな黒色の円を用いて示された７×７空間近傍内のピクセルとのみペアリングされる。 To address this problem, embodiments use a bipartite graph structure. This bipartite graph structure allows half of the variables to be updated in parallel at each step, while still ensuring convergence without diagonal dominance constraints. For example, the embodiment divides 420 pixels in the image 160 into odd pixels or even pixels based on the parity of the index of the column or row of pixels (420), and in each pair of pixels, the first pixel is an odd pixel , 430 forming a pair of pixels such that the second pixel is an even pixel. For example, pixel 410 is only paired with pixels in the 7x7 spatial neighborhood shown with larger black circles such as pixels 411, 412, 413 and 414.

幾つかの実施態様では、グラフィカルモデルは各ピクセルのノードを有し、各ノードはＫ個の変数のベクトルを表す。式（２）を用いて第ｉのノードを更新するために、実施形態は、第ｉのノードに接続された他の全てのノード（すなわち、非ゼロの

を有する全てのノード）を固定した状態にしておく。画像を奇数列及び偶数列（又は奇数行及び偶数行）に分割するとともに、分割した部分内のエッジを回避することによって、偶数列（又は偶数行）を固定した状態のままで全ての奇数列（又は奇数行）を、式（２）を用いて並列に更新することが可能になり、また、その逆も可能になる。この交互の最小化を最適に解いて、大域的最適に収束することができる。 In some implementations, the graphical model has nodes for each pixel, and each node represents a vector of K variables. In order to update the i-th node using equation (2), the embodiment includes all other nodes connected to the i-th node (ie non-zero)

All nodes that have) are fixed. By dividing the image into odd and even columns (or odd and even rows) and avoiding the edges in the divided part, all odd columns remain with the even columns (or even rows) fixed. (Or odd rows) can be updated in parallel using equation (2), and vice versa. This alternating minimization can be solved optimally to converge on a global optimum.

図４Ｂは、本発明の幾つかの実施形態による図４Ａの２部グラフ構造を利用するＧＭＩネットワーク４４０のブロック図を示す。ＧＭＩネットワーク４４０は、ユーナリネットワーク及びペアワイズネットワークの出力を用いて固定数のガウス平均場更新を実行する。このネットワークへの入力は、ユーナリ出力

を用いて初期化される。 FIG. 4B shows a block diagram of a GMI network 440 that utilizes the bipartite graph structure of FIG. 4A according to some embodiments of the present invention. GMI network 440 performs a fixed number of Gaussian mean field updates using the outputs of the unary and pairwise networks. The input to this network is a unary output

Initialized using.

ＧＭＩネットワーク４４０は、連続的に組み合わされた幾つかのＧＭＩ層４０１を備える。各層は、２つの副層、すなわち、偶数更新副層４０２と、これに後続又は先行する奇数更新副層４０３とを有する。偶数更新副層４０２は、先行層の出力を入力として取り込み、奇数ピクセルノードを固定した状態のままで式（２）を用いて偶数ピクセルノードを更新する。同様に、奇数更新副層は、偶数更新副層の出力を入力として取り込み、偶数ピクセルノードを固定した状態のままで式（２）を用いて奇数ピクセルノードを更新する。奇数更新副層及び偶数更新副層の順序は逆にすることができる。 The GMI network 440 comprises several GMI layers 401 combined in series. Each layer has two sublayers, ie, an even update sublayer 402 and an odd update sublayer 403 following or preceding it. The even update sublayer 402 takes the output of the previous layer as input and updates the even pixel nodes using equation (2) with the odd pixel nodes fixed. Similarly, the odd update sublayer takes the output of the even update sublayer as an input and updates the odd pixel nodes using equation (2) while keeping the even pixel nodes fixed. The order of the odd update sublayer and the even update sublayer can be reversed.

２部グラフ構造に起因して、上記副層のそれぞれによって実行される更新は、最適な更新であり得る。したがって、本発明者らのＧＭＩネットワークの各層は、その入力と比較してＭＡＰ解により近い出力を生成することが保証される（入力自体がＭＡＰ解でない場合であり、入力自体がＭＡＰ解である場合に、出力は入力に等しい）。 Due to the bipartite graph structure, the updates performed by each of the sublayers may be optimal updates. Thus, each layer of our GMI network is guaranteed to produce an output that is closer to the MAP solution compared to its input (if the input itself is not a MAP solution and the input itself is a MAP solution) In which case the output is equal to the input).

トレーニング
ＧＲＦネットワーク１１４は、相互接続されたサブネットワークを備えるので、ＧＲＦネットワーク１１４のこれらの様々なサブネットワークを合同でトレーニングすることができる。例えば、図３Ａのユーナリネットワーク、ペアワイズネットワーク及びＧＭＩネットワークの組み合わせをエンドツーエンド形式でトレーニングすることができる。１つの実施形態は、ＧＭＩネットワーク内の固定数の層を用いる。層が有限個であるので、ＧＲＦネットワークの出力は、潜在的に準最適であり得る。一方、実施形態は、ＧＲＦネットワーク全体をエンドツーエンド形式で弁別的にトレーニングするので、ユーナリネットワーク及びペアワイズネットワークは、固定数の平均場更新後の出力が最適解に近づくように近似的なユーナリエネルギーパラメーター

及びペアワイズエネルギーパラメーター

を生成するように学習する。 Training Since the GRF network 114 comprises interconnected sub-networks, these various sub-networks of the GRF network 114 can be trained jointly. For example, the combination of the unary network, the pairwise network, and the GMI network of FIG. 3A can be trained in an end-to-end fashion. One embodiment uses a fixed number of layers in the GMI network. Because of the finite number of layers, the output of the GRF network can potentially be suboptimal. On the other hand, since the embodiment differentially trains the entire GRF network in an end-to-end manner, the unary network and the pairwise network are close to each other so that the output after the fixed number of mean field updating approaches the optimum solution. Nari energy parameter

And pairwise energy parameters

Learn to generate

図５は、本発明の幾つかの実施形態によって用いられるトレーニングの概略図を示す。トレーニング５１０は、画像のペアのトレーニングセット５０１と、対応する意味的セグメント化画像５０２とを用いて、ＧＲＦネットワークのパラメーター５２０を生成する。一般に、人工ニューラルネットワークをトレーニングすることは、トレーニングセットを考慮して、「学習」アルゴリズムと呼ばれることがあるトレーニングアルゴリズムを人工ニューラルネットワークに適用することを含む。トレーニングセットは、入力の１つ以上のセットと、出力の１つ以上のセットとを含むことができ、入力の各セットは、出力の１つのセットに対応する。トレーニングセットにおける出力のセットは、対応する入力のセットが人工ニューラルネットワークに入力され、人工ニューラルネットワークがその後フィードフォワード形式で動作されたときに人工ニューラルネットワークが生成することが望まれる出力のセットを含む。ニューラルネットワークをトレーニングすることは、パラメーター、例えば、人工ニューラルネットワークにおける接続に関連付けられた重み値を計算することを伴う。例えば、ＧＲＦネットワークのパラメーターは、ユーナリネットワークパラメーター

と、ペアワイズネットワークパラメーター

とを含むことができる。 FIG. 5 shows a schematic view of the training used by some embodiments of the present invention. The training 510 uses the training set 501 of the image pair and the corresponding semantic segmented image 502 to generate the parameters 520 of the GRF network. In general, training an artificial neural network involves applying to the artificial neural network a training algorithm, sometimes referred to as a "learning" algorithm, taking into account the training set. The training set may include one or more sets of inputs and one or more sets of outputs, each set of inputs corresponding to one set of outputs. The set of outputs in the training set includes the set of outputs that the artificial neural network is desired to generate when the corresponding set of inputs is input to the artificial neural network and the artificial neural network is subsequently operated in a feedforward manner. . Training a neural network involves calculating parameters, eg, weight values associated with connections in the artificial neural network. For example, GRF network parameters are unique network parameters

And pairwise network parameters

And can be included.

図６は、本発明の幾つかの実施形態によって用いられるトレーニング方法５１０のブロック図を示す。本方法は、ＧＲＦネットワーク１１４を用いてセット５０１からの画像６１０を処理して意味的セグメント化画像６３０を生成し、この意味的セグメント化画像６３０を、セット５０２からの対応する意味的セグメント化画像６２０と比較して、これらの２つの意味的セグメント化画像間の距離を生成する（６４０）。例えば、１つの実施形態は、各ピクセルにおける以下の損失関数を求める。

ここで、ｌ_ｉは、距離６４０としての真のクラスラベルである。この損失関数は、基本的に、真のクラスに関連付けられた出力をマージンＴによって他の全てのクラスに関連付けられた出力よりも大きくなるように促進する。 FIG. 6 shows a block diagram of a training method 510 used by some embodiments of the present invention. The method processes the image 610 from the set 501 using the GRF network 114 to generate a semantically segmented image 630, which is a corresponding semantically segmented image from the set 502. The distance between these two semantically segmented images is generated (640) as compared to 620. For example, one embodiment determines the following loss function at each pixel:

Where l _i is the true class label as distance 640. This loss function basically encourages the output associated with the true class to be greater by margin T than the output associated with all other classes.

そのために、実施形態は、損失関数を最小にすることによってＧＲＦネットワーク１１４を弁別的にトレーニングする。例えば、トレーニングは、ネットワークパラメーターの勾配を計算するバックプロパゲーションを用いて実行される。トレーニングは、パラメーター

に対する対称半正定値性制約に起因した制約付き最適化を含むことができる。１つの実施形態は、

を

としてパラメーター化することによってこの制約付き最適化を制約なし最適化に変換し、確率的勾配降下法を最適化に用いる。ここで、

は、下三角行列である。 To that end, embodiments differentially train the GRF network 114 by minimizing the loss function. For example, training is performed using backpropagation to calculate the slope of the network parameters. Training is a parameter

Can include constrained optimization due to symmetric semidefinite constraints on. One embodiment is

The

Convert this constrained optimization into an unconstrained optimization by parameterizing as and use stochastic gradient descent for optimization. here,

Is a lower triangular matrix.

図７は、本発明の１つの実施形態によるトレーニングシステムのブロック図を示す。このトレーニングシステムは、バス２２によって読み出し専用メモリ（ＲＯＭ）２４及びメモリ３８に接続されたプロセッサを備える。このトレーニングシステムは、ユーザーに情報を提示するディスプレイ２８と、キーボード２６、マウス３４及び入力／出力ポート３０を介して取り付けることができる他のデバイスを含む複数の入力デバイスとを備えることもできる。他のポインティングデバイス又は音声センサー若しくは画像センサー等の他の入力デバイスも取り付けることができる。他のポインティングデバイスは、タブレット、数値キーパッド、タッチ画面、タッチ画面オーバーレイ、トラックボール、ジョイスティック、ライトペン、サムホイール等を含む。Ｉ／Ｏ３０は、通信ライン、ディスク記憶装置、入力デバイス、出力デバイス又は他のＩ／Ｏ機器に接続することができる。メモリ３８は、表示画面のピクセル強度値を含むディスプレイバッファー７２を備える。ディスプレイ２８は、ディスプレイバッファー７２からピクセル値を周期的に読み出し、これらの値を表示画面上に表示する。ピクセル強度値は、グレーレベルを表すこともできるし、カラーを表すこともできる。 FIG. 7 shows a block diagram of a training system according to one embodiment of the present invention. The training system comprises a processor connected by a bus 22 to a read only memory (ROM) 24 and a memory 38. The training system may also include a display 28 for presenting information to the user, and a plurality of input devices including a keyboard 26, a mouse 34 and other devices that may be attached via the input / output port 30. Other pointing devices or other input devices such as voice or image sensors may also be attached. Other pointing devices include tablets, numeric keypads, touch screens, touch screen overlays, trackballs, joysticks, light pens, thumbwheels and the like. The I / O 30 can be connected to communication lines, disk storage, input devices, output devices or other I / O devices. The memory 38 comprises a display buffer 72 containing pixel intensity values of the display screen. The display 28 periodically reads the pixel values from the display buffer 72 and displays these values on the display screen. The pixel intensity values can represent gray levels or colors.

メモリ３８は、データベース９０、トレーナー８２、ＧＲＦ１１４、プリプロセッサ８４を含む。データベース９０は、履歴データ１０５、トレーニングデータ、テストデータ９２を含むことができる。データベースは、ニューラルネットワークを用いる動作モード、トレーニングモード又は保持モードからの結果も含むことができる。これらの要素は、上記で詳細に説明されている。 The memory 38 includes a database 90, a trainer 82, a GRF 114, and a pre-processor 84. The database 90 can include historical data 105, training data, and test data 92. The database can also include results from an operating mode using a neural network, a training mode or a holding mode. These elements are described in detail above.

メモリ３８には、オペレーティングシステム７４も示されている。オペレーティングシステムの例には、ＡＩＸ、ＯＳ／２、及びＤＯＳが含まれる。メモリ３８に示されている他の要素は、キーボード及びマウス等のデバイスによって生成された電気信号を解釈するデバイスドライバー７６を含む。ワーキングメモリエリア７８もメモリ３８に示されている。ワーキングメモリエリア７８は、メモリ３８に示された要素のいずれもが利用することができる。ワーキングメモリエリアは、ニューラルネットワーク１０１、トレーナー８２、オペレーティングシステム７４及び他の機能が利用することができる。ワーキングメモリエリア７８は、要素間で分割することもできるし、或る要素内において分割することもできる。ワーキングメモリエリア７８は、通信、バッファリング、一時記憶、又はプログラムが実行されている間のデータの記憶に利用することができる。 Also shown in memory 38 is operating system 74. Examples of operating systems include AIX, OS / 2, and DOS. Other elements shown in memory 38 include a device driver 76 that interprets the electrical signals generated by devices such as a keyboard and mouse. A working memory area 78 is also shown in the memory 38. The working memory area 78 can be used by any of the elements shown in the memory 38. The working memory area may be utilized by neural network 101, trainer 82, operating system 74 and other functions. The working memory area 78 can be divided between elements or can be divided within an element. The working memory area 78 can be used for communication, buffering, temporary storage, or storage of data while the program is being executed.

本発明の上記で説明した実施形態は、多数の方法のうちの任意のもので実施することができる。例えば、実施形態は、ハードウェア、ソフトウェア又はそれらの組み合わせを用いて実施することができる。ソフトウェアで実施される場合、ソフトウェアコードは、単一のコンピューターに設けられるのか又は複数のコンピューター間に分散されるのかにかかわらず、任意の適したプロセッサ又はプロセッサの集合体において実行することができる。そのようなプロセッサは、１つ以上のプロセッサを集積回路部品に有する集積回路として実装することができる。ただし、プロセッサは、任意の適したフォーマットの回路類を用いて実装することができる。 The above-described embodiments of the present invention can be implemented in any of numerous ways. For example, the embodiments may be implemented using hardware, software or a combination thereof. When implemented in software, the software codes may be executed on any suitable processor or collection of processors, whether provided on a single computer or distributed among multiple computers. Such processors can be implemented as integrated circuits having one or more processors in integrated circuit components. However, a processor may be implemented using circuitry of any suitable format.

また、本発明の実施形態は、例が提供された方法として実施することができる。この方法の一部として実行される動作は、任意の適切な方法で順序付けすることができる。したがって、動作が示したものと異なる順序で実行される実施形態を構築することができ、これには、例示の実施形態では一連の動作として示されたにもかかわらず、幾つかの動作を同時に実行することを含めることもできる。 Also, embodiments of the present invention can be implemented as a method provided an example. The operations performed as part of this method can be ordered in any suitable manner. Thus, embodiments can be constructed in which the operations are performed in a different order than that shown, and although this is illustrated as a series of operations in the illustrated embodiment, several operations may be performed simultaneously. It can also include doing.

請求項の要素を修飾する、特許請求の範囲における「第１」、「第２」等の序数の使用は、それ自体で、１つの請求項の要素の別の請求項の要素に対する優先順位も、優位性も、順序も暗示するものでもなければ、方法の動作が実行される時間的な順序も暗示するものでもなく、請求項の要素を区別するために、単に、或る特定の名称を有する１つの請求項の要素を、同じ（序数の用語の使用を除く）名称を有する別の要素と区別するラベルとして用いられているにすぎない。 The use of ordinal numbers such as “first”, “second” and the like in a claim to modify elements of the claims themselves is also a priority to elements of another claim of one claim In order to distinguish between the elements of the claims, it is merely understood that the advantages, the ordering, the ordering and the implicit order of execution, and neither the temporal ordering in which the acts of the method are performed nor implied. The elements of one claim are merely used as labels to distinguish them from other elements having the same (except for the use of ordinal terms) names.

Claims

A computer-implemented method for semantic segmentation of an image, comprising
Determining the unary energy of each pixel in the image using the first subnetwork;
Determining the pairwise energy of at least some pairs of pixels of the image using a second subnetwork;
The probability of the semantic label of each pixel in the image is determined using the third subnetwork to obtain an estimation result on a Gaussian random field (GRF) minimizing an energy function including a combination of the unary energy and the pairwise energy. Generating a GRF estimation result defining
The semantic segmentation of the image by assigning to the pixels in the semantic segmentation image the semantic label having the highest probability of the corresponding pixel in the image among the probabilities determined by the third subnetwork. Converting into an image, the first subnetwork, the second subnetwork, and the third subnetwork being part of a neural network.
The step of determining the pairwise energy of pixel pairs of the image is:
Determining the similarity between the pixels of the pair in the image;
Determining a covariance matrix based on the degree of similarity;
Determining the pairwise energy as a function of the covariance matrix
Including
Each said step of the method is performed by a processor.

Rendering said semantically segmented image in non-transitory computer readable memory;
The method of claim 1, further comprising

The third subnetwork is a Gaussian mean field such that each layer of the third subnetwork recursively finds a mean field estimate update that minimizes an energy function including a combination of the unary energy and the pairwise energy The method according to claim 1, wherein the GRF estimation result is obtained by emulating an operation of (GMI) estimation.

For each pixel in the image, the first sub-network receives as input a subset of pixels near the pixel in the image and generates a unique energy parameter of the pixel, the unique energy being The method according to claim 1, which is a function of the unary energy parameter of each pixel in the image and the probability of each pixel in the image belonging to each possible semantic label.

Applying a series of linear filters to perform convolution operations on the inputs to each layer of the first subnetwork;
5. The method according to claim 4, further comprising: applying a non-linear function for the output of each linear filter in several layers of the first subnetwork.

The unary energy function is a quadratic function

And here

Is the unary energy parameter calculated through the first subnetwork, θ _u is the parameter of the linear filter,

6. The method of claim 5, wherein is the probability of the semantic label and i is an index of the pixel.

5. The method of claim 4, wherein the subset of pixels is a rectangular patch centered on the pixels in the image.

The determination of the similarity is
The second sub-network is used to process a subset of first pixels in the vicinity of the first pixel i of the pair to characterize the first pixels

To generate
The second sub-network is used to process a subset of second pixels in the vicinity of the second pixel j of the pair to characterize the second pixels.

To generate
The method according to claim 1 , comprising determining the function of the difference between the first feature and the second feature to generate the similarity s _ij .

Dividing the pixels in the image into odd pixels or even pixels based on the parity of the index of the column or row of pixels in the image;
In each pair of the pixels, the first pixel is said odd pixels, as the second pixel is in the even pixels, further including forming a pair of said pixel to claim 8 Method described.

The method according to claim 1, wherein the first subnetwork, the second subnetwork, and the third subnetwork are jointly trained.

The first subnetwork, the second subnetwork, and the third subnetwork are jointly trained to minimize the loss function of the set of training images and the corresponding set of training semantic labels. The method of claim 1.

A system for semantic segmentation of images,
At least one non-transitory computer readable memory for storing the image and semantically segmented image;
A processor that performs semantic segmentation of the image using a Gaussian random field (GRF) network to generate the semantic segmented image;
Equipped with
The GRF network is
A first subnetwork for determining the unary energy of each pixel in the image;
A second sub-network for determining pairwise energy of at least some pairs of pixels of the image;
An estimation result on a Gaussian random field (GRF) minimizing an energy function including a combination of the unary energy and the pairwise energy is obtained, and a GRF estimation result defining the probability of the semantic label of each pixel in the image is generated. The third subnetwork,
A neural network comprising
The processor assigns the image to the pixels in the semantic segmented image by assigning a semantic label having the highest probability of the corresponding pixel in the image among the probabilities determined by the third subnetwork. Convert to semantically segmented images,
The second subnetwork is
Determine the similarity between the pixels of the pair in the image;
Determining a covariance matrix based on the degree of similarity;
Determine the pairwise energy as a function of the covariance matrix,
system.

The third subnetwork is a Gaussian mean field such that each layer of the third subnetwork recursively finds a mean field estimate update that minimizes an energy function including a combination of the unary energy and the pairwise energy Determine the GRF estimation result by emulating the operation of the (GMI) estimation,
A system according to claim 12 .

For each pixel in the image, the first sub-network receives as input a subset of pixels near the pixel in the image and generates a unique energy parameter of the pixel, the unique energy being The system according to claim 12 , wherein the system is a function of the unary energy parameter of each pixel in the image and the probability of each pixel in the image belonging to each possible semantic label.

The second subnetwork is
Processing a subset of the first pixels in the vicinity of the first pixel i of the pair to obtain the feature quantities of the first pixels

To generate
Processing a second subset of pixels in the vicinity of the second pixel j of the pair, and processing the feature quantity of the second pixel

To generate
The system according to claim 12 , wherein the similarity is determined by determining a function of a difference between the first feature and the second feature to generate the similarity s _ij .

The processor is
Dividing the pixels in the image into odd pixels or even pixels based on the parity of the index of the column or row of pixels in the image;
The system of claim 12 , wherein in each pair of pixels, the pair of pixels is formed such that the first pixel is the odd pixel and the second pixel is the even pixel.

The system according to claim 12 , wherein the first subnetwork, the second subnetwork, and the third subnetwork are jointly trained.

A non-transitory computer readable medium having instructions stored thereon, wherein the instructions are executed by a processor
Determining the unary energy of each pixel in the image using the first subnetwork;
Determining the pairwise energy of at least some pairs of pixels of the image using a second subnetwork;
The probability of the semantic label of each pixel in the image is determined using the third subnetwork to obtain an estimation result on a Gaussian random field (GRF) minimizing an energy function including a combination of the unary energy and the pairwise energy. Generating a GRF estimation result defining
The semantic segmentation of the image by assigning to the pixels in the semantic segmentation image the semantic label having the highest probability of the corresponding pixel in the image among the probabilities determined by the third subnetwork. Perform the steps of converting to an image,
The first subnetwork, the second subnetwork, and the third subnetwork are jointly trained as part of a neural network ,
The step of determining the pairwise energy of pixel pairs of the image is:
Determining the similarity between the pixels of the pair in the image;
Determining a covariance matrix based on the degree of similarity;
Determining the pairwise energy as a function of the covariance matrix
including,
Non-transitory computer readable medium.