JP2022145825A

JP2022145825A - Image processing apparatus, image processing method, and image processing program

Info

Publication number: JP2022145825A
Application number: JP2022126701A
Authority: JP
Inventors: 琢麿山本; Takuma Yamamoto
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2018-03-02
Filing date: 2022-08-08
Publication date: 2022-10-04
Anticipated expiration: 2038-03-02
Also published as: JP7405198B2; JP2019153057A

Abstract

PROBLEM TO BE SOLVED: To provide an image processing apparatus for determining background/foreground robust against noise, or the like.

SOLUTION: An image processing apparatus includes, an acquisition unit, a difference generation unit, a neural network, and an output unit. The acquisition unit acquires a background image of a background taken in advance and a target image to be determined. The difference generation unit generates a difference image indicating first difference between the background image and the target image. The neural network estimates a residual indicating second difference between the difference image and a map image that distinguishes the background and the foreground from the target image and serves as an output target by inputting each of the background image and the target image into the neural network. The output unit outputs the map image that distinguishes the background and the foreground from the target image based on the generated difference image by the difference generation unit and difference estimated by the neural network.

SELECTED DRAWING: Figure 4

Description

本発明の実施形態は、画像処理装置、学習装置、画像処理方法、学習方法、画像処理プログラムおよび学習プログラムに関する。 The embodiments of the present invention relate to an image processing device, a learning device, an image processing method, a learning method, an image processing program, and a learning program.

従来、カメラで撮影された動画像から前景として映り込む動物体（以後、前景とも呼ぶ）を検出する手法として、背景差分法が知られている。背景差分法では、カメラで撮影された動画像から検出対象物体が撮影されていない背景画像（背景モデルとも呼ぶ）を検出して記憶しておく。そして、カメラで撮影された動画像から背景画像の差分を求めることで、前景に対応する画像領域を検出する。 2. Description of the Related Art Conventionally, a background subtraction method is known as a technique for detecting a moving object that appears as the foreground (hereinafter also referred to as the foreground) from a moving image captured by a camera. In the background subtraction method, a background image (also called a background model) in which the object to be detected is not captured is detected from moving images captured by a camera and stored. Then, an image area corresponding to the foreground is detected by obtaining the difference of the background image from the moving image captured by the camera.

このように、画像から背景／前景を判別する手法としては、画像センサから出力された画像情報並びに時間的に遅延させた画像情報を入力層に取り込み、その差異に応じた情報を出力層から出力するニューラル・ネットワークを用いた手法が知られている。 In this way, as a method for determining background/foreground from an image, image information output from the image sensor and temporally delayed image information are taken into the input layer, and information corresponding to the difference is output from the output layer. A technique using a neural network that

特開平２－１７３８７７号公報JP-A-2-173877

しかしながら、上記の従来技術では、背景と類似する類似色が前景に含まれる場合や、ノイズ等に対して頑健に背景／前景の判別を行うことが困難であるという問題がある。 However, in the above-described conventional technology, there is a problem that it is difficult to robustly distinguish between the background and the foreground when a similar color similar to the background is included in the foreground or against noise or the like.

例えば、ニューラル・ネットワークにおける中間層の層数が少ない場合、エッジや色などの局所的な特徴をもとに判別することから、背景と類似する類似色が前景に含まれる場合に背景／前景の判別が困難となる。また、ノイズ等の影響を受けやすく、誤検出を生じることがある。 For example, when the number of intermediate layers in a neural network is small, discrimination is based on local features such as edges and colors. Difficult to distinguish. In addition, it is susceptible to noise and the like, which may cause erroneous detection.

また、ニューラル・ネットワークにおける中間層の層数を増やすと、学習に初期においては、結合重みが小さな値の乱数で初期化されているため、入力信号が層を経るごとに拡散していくことから、ニューラル・ネットワークからはほぼ０のノイズしか得られないこととなる。このため、教師データと比較しても有意味な情報が得られず、ニューラル・ネットワークの学習が進まないことから、ニューラル・ネットワークにおける中間層の層数を単純に増やすことは難しい。 Also, when the number of intermediate layers in the neural network is increased, the connection weights are initialized with small random numbers at the beginning of learning, so the input signal spreads as it passes through the layers. , we get almost zero noise from the neural network. For this reason, meaningful information cannot be obtained even when compared with teacher data, and learning of the neural network does not progress. Therefore, it is difficult to simply increase the number of intermediate layers in the neural network.

１つの側面では、ノイズ等に頑健な背景／前景の判別を可能とする画像処理装置、学習装置、画像処理方法、学習方法、画像処理プログラムおよび学習プログラムを提供することを目的とする。 An object of one aspect of the present invention is to provide an image processing device, a learning device, an image processing method, a learning method, an image processing program, and a learning program that enable robust background/foreground discrimination against noise and the like.

第１の案では、画像処理装置は、取得部と、差分生成部と、ニューラル・ネットワークと、出力部とを有する。取得部は、予め撮影された背景の背景画像と、判別対象となる対象画像とを取得する。差分生成部は、背景画像と対象画像との第一の差分を示す差分画像を生成する。ニューラル・ネットワークは、背景画像と対象画像とのそれぞれをニューラル・ネットワークに入力することで、対象画像から背景および前景を区別した出力対象となるマップ画像と、差分画像との第二の差分を示す残差を推定する。出力部は、差分生成部が生成した差分画像と、ニューラル・ネットワークが推定した差とに基づいて、対象画像から背景および前景を区別したマップ画像を出力する。 In a first scheme, the image processing device has an acquisition unit, a difference generation unit, a neural network, and an output unit. The obtaining unit obtains a background image of a background photographed in advance and a target image to be determined. The difference generator generates a difference image representing a first difference between the background image and the target image. The neural network inputs the background image and the target image respectively to the neural network to show the second difference between the map image to be output in which the background and foreground are distinguished from the target image and the difference image. Estimate the residuals. The output unit outputs a map image in which the background and foreground are distinguished from the target image based on the difference image generated by the difference generation unit and the difference estimated by the neural network.

本発明の１実施態様によれば、ノイズ等に頑健な背景／前景の判別を行うことができる。 According to one embodiment of the present invention, background/foreground discrimination robust against noise and the like can be performed.

図１は、実施形態の概要を説明する説明図である。FIG. 1 is an explanatory diagram for explaining the outline of the embodiment. 図２は、背景画像、対象画像および教師画像の一例を説明する説明図である。FIG. 2 is an explanatory diagram illustrating examples of a background image, a target image, and a teacher image. 図３は、ニューラル・ネットワークで前景マップを推定する従来手法の説明図である。FIG. 3 is an explanatory diagram of a conventional method of estimating a foreground map with a neural network. 図４は、高次の特徴／低次の特徴による判別を説明する説明図である。FIG. 4 is an explanatory diagram for explaining discrimination based on high-order features/low-order features. 図５は、実施形態にかかる画像処理装置の機能構成例を示すブロック図である。FIG. 5 is a block diagram of a functional configuration example of the image processing apparatus according to the embodiment; 図６は、実施形態にかかる画像処理装置の動作例を示すフローチャートである。6 is a flowchart illustrating an operation example of the image processing apparatus according to the embodiment; FIG. 図７は、残差推定部のニューラル・ネットワークを説明する説明図である。FIG. 7 is an explanatory diagram for explaining the neural network of the residual estimator. 図８は、実施形態にかかる学習装置の機能構成例を示すブロック図である。FIG. 8 is a block diagram of a functional configuration example of the learning device according to the embodiment; 図９は、学習フローを例示するフローチャートである。FIG. 9 is a flow chart illustrating a learning flow. 図１０は、プログラムを実行するコンピュータの一例を示す説明図である。FIG. 10 is an explanatory diagram of an example of a computer that executes a program.

以下、図面を参照して、実施形態にかかる画像処理装置、学習装置、画像処理方法、学習方法、画像処理プログラムおよび学習プログラムを説明する。実施形態において同一の機能を有する構成には同一の符号を付し、重複する説明は省略する。なお、以下の実施形態で説明する画像処理装置、学習装置、画像処理方法、学習方法、画像処理プログラムおよび学習プログラムは、一例を示すに過ぎず、実施形態を限定するものではない。また、以下の各実施形態は、矛盾しない範囲内で適宜組みあわせてもよい。 An image processing device, a learning device, an image processing method, a learning method, an image processing program, and a learning program according to embodiments will be described below with reference to the drawings. Configurations having the same functions in the embodiments are denoted by the same reference numerals, and overlapping descriptions are omitted. Note that the image processing device, learning device, image processing method, learning method, image processing program, and learning program described in the following embodiments are merely examples, and do not limit the embodiments. Moreover, each of the following embodiments may be appropriately combined within a non-contradictory range.

図１は、実施形態の概要を説明する説明図である。図１に示すように、本実施形態では、背景および前景の判別対象となる対象画像Ｇ１と、事前に撮影しておいた背景の背景画像Ｇ２とを入力し、対象画像Ｇ１に含まれる背景および前景を区別する前景マップＧ５を得る。 FIG. 1 is an explanatory diagram for explaining the outline of the embodiment. As shown in FIG. 1, in the present embodiment, a target image G1 to be used for background/foreground discrimination and a background background image G2 captured in advance are input, and the background and Obtain a foreground map G5 that distinguishes the foreground.

図２は、対象画像Ｇ１、背景画像Ｇ２および教師画像Ｇ６の一例を説明する説明図である。図２に示すように、対象画像Ｇ１は、撮影範囲にいる人物Ｈなどの前景と、背景との判別を行う画像データであり、例えば、不審者を検知するための監視カメラの画像データなどである。背景画像Ｇ２は、背景を事前に撮影しておいた画像データなどである。なお、背景画像Ｇ２については、幾つかの画像を重ね合わせて生成した背景モデルであってもよい。 FIG. 2 is an explanatory diagram illustrating examples of the target image G1, the background image G2, and the teacher image G6. As shown in FIG. 2, the target image G1 is image data for distinguishing between the foreground and the background of the person H, etc. in the shooting range. be. The background image G2 is, for example, image data obtained by photographing the background in advance. Note that the background image G2 may be a background model generated by superimposing several images.

前景マップＧ５は、マップ画像の一例であり、例えば対象画像Ｇ１における背景に対応する領域を黒画素とし、前景に対応する領域を白画素とする画像データである。このようにして得られた前景マップＧ５に対象画像Ｇ１を掛け合わせることで、例えば対象画像Ｇ１に含まれる前景を識別することができる。例えば、対象画像Ｇ１に含まれる前景の識別結果を用いることで、自由視点映像生成技術における被写体のシルエットの抽出や、映像監視技術における不審者の抽出に応用できる。 The foreground map G5 is an example of a map image, and is image data in which, for example, the area corresponding to the background in the target image G1 is black pixels and the area corresponding to the foreground is white pixels. By multiplying the foreground map G5 thus obtained by the target image G1, for example, the foreground included in the target image G1 can be identified. For example, by using the identification result of the foreground included in the target image G1, it can be applied to extraction of a subject's silhouette in free-viewpoint video generation technology and extraction of a suspicious person in video monitoring technology.

具体的には、前景マップＧ５を得るための推論フェーズでは、入力された対象画像Ｇ１と、背景画像Ｇ２との差分を生成する差分生成（Ｓ１）を行い、差分画像Ｇ３を生成する。また、入力された対象画像Ｇ１および背景画像Ｇ２により、対象画像Ｇ１から背景および前景を区別する前景マップＧ５と、差分画像Ｇ３との差を示す残差Ｇ４（差分画像Ｇ３では、正解とする前景マップＧ５に足りない情報）をニューラル・ネットワークを用いて推定する残差推定（Ｓ２）を行う。次いで、差分生成（Ｓ１）で生成された差分画像Ｇ３と、残差推定（Ｓ２）で推定された残差Ｇ４とを足し合わせることで前景マップＧ５を得る（Ｓ３）。 Specifically, in the inference phase for obtaining the foreground map G5, difference generation (S1) is performed to generate the difference between the input target image G1 and the background image G2 to generate the difference image G3. Further, based on the input target image G1 and background image G2, a foreground map G5 that distinguishes the background and foreground from the target image G1 and a residual G4 that indicates the difference between the difference image G3 (the correct foreground Information missing in the map G5) is estimated using a neural network (S2). Next, a foreground map G5 is obtained by adding the difference image G3 generated in the difference generation (S1) and the residual G4 estimated in the residual estimation (S2) (S3).

なお、残差推定（Ｓ２）を行うニューラル・ネットワークの学習を行う学習フェーズでは、前景マップＧ５と教師画像Ｇ６との比較により、ニューラル・ネットワークを構成する各ノードの結合重みを調整する。 In the learning phase for learning the neural network that performs the residual estimation (S2), the foreground map G5 and the teacher image G6 are compared to adjust the connection weight of each node that constitutes the neural network.

図２に示すように、教師画像Ｇ６は、ニューラル・ネットワークの学習時において入力された対象画像Ｇ１における背景／前景についての正解を示す教師データである。一例として、教師画像Ｇ６には、対象画像Ｇ１に含まれる前景（人物Ｈ）に対応した前景領域Ｒ１を白画素、前景領域Ｒ１以外の領域（背景領域）を黒画素とする画像データなどがある。 As shown in FIG. 2, the teacher image G6 is teacher data indicating the correct answer regarding the background/foreground in the target image G1 input during learning of the neural network. As an example, the teacher image G6 includes image data in which the foreground region R1 corresponding to the foreground (person H) contained in the target image G1 is white pixels, and the region (background region) other than the foreground region R1 is black pixels. .

この教師画像Ｇ６を用いた教師付き学習を行うことで、正解とする前景マップＧ５に足りない部分の残差Ｇ４を適正に推定するように残差推定部１１のニューラル・ネットワークの学習が行われる。 By performing supervised learning using this teacher image G6, the neural network of the residual estimating unit 11 is trained so as to properly estimate the residual G4 of the missing part of the foreground map G5 as the correct answer. .

図３は、ニューラル・ネットワークで前景マップを推定する従来手法の説明図である。図３に示すように、従来手法では、対象画像Ｇ１および背景画像Ｇ２をニューラル・ネットワーク２００の入力層２０１に入力し、中間層２０２を経て出力層２０３から背景／前景の判別結果Ｇ４ａが直接出力される。 FIG. 3 is an explanatory diagram of a conventional method of estimating a foreground map with a neural network. As shown in FIG. 3, in the conventional method, the target image G1 and the background image G2 are input to the input layer 201 of the neural network 200, and the background/foreground discrimination result G4a is directly output from the output layer 203 via the intermediate layer 202. be done.

ニューラル・ネットワーク２００における中間層２０２の層数が少ない場合、入力された画像からは高次の特徴が得られず、低次の特徴、すなわちエッジや色などの局所的な特徴をもとに判別することとなる。ここで、高次の特徴とは、画像上のある領域が人物なのか車なのか、その内部・外部であるかなど、セマンティック（Ｓｅｍａｎｔｉｃ）な情報を含む特徴である。低次の特徴とは、縦・横方向のエッジや平坦な領域なのかなど、画像の局所的な構造の特徴である。 If the number of intermediate layers 202 in the neural network 200 is small, high-order features cannot be obtained from the input image, and low-order features, i.e., local features such as edges and colors, are used for discrimination. It will be done. Here, the high-order feature is a feature including semantic information such as whether a certain area on the image is a person or a car, or whether it is inside or outside. Low-order features are features of the local structure of the image, such as vertical and horizontal edges and flat areas.

図４は、高次の特徴／低次の特徴による判別を説明する説明図である。図４に示すように、ケースＣ１では、中間層を多層とするディープ・ニューラル・ネットワーク（ＤＮＮ）のニューラル・ネットワーク２００ａを用いて入力画像Ｇ１０から人物の領域を白画素とする判別結果Ｇ１１を得ている。ＤＮＮでは、対象画像Ｇ１から層を経ることで、エッジ→四角や丸（エッジの組み合わせ）→タイヤや顔（四角や丸の組み合わせ）→…のように徐々に特徴が抽象化されていき、高次の特徴を抽出できる。このような高次の特徴をもとに判別を行う場合は、ノイズ等に対して頑健に背景／前景の判別を行うができる。 FIG. 4 is an explanatory diagram for explaining discrimination based on high-order features/low-order features. As shown in FIG. 4, in case C1, a deep neural network (DNN) neural network 200a having multiple intermediate layers is used to obtain a determination result G11 in which the region of a person is white pixels from an input image G10. ing. In the DNN, features are gradually abstracted from the target image G1 through layers, such as edges → squares and circles (combination of edges) → tires and faces (combination of squares and circles) → . The following features can be extracted. When discrimination is performed based on such high-order features, background/foreground discrimination can be performed robustly against noise and the like.

これに対し、ケースＣ２では、３層型のニューラル・ネットワーク２００ｂを用いて入力画像Ｇ１０から判別結果Ｇ１１を得ている。３層型のニューラル・ネットワーク２００ｂでは、局所的な低次の特徴をもとに判別を行うことから、ノイズ等の影響を受けやすく、誤検出を生じる。例えば、入力画像Ｇ１０は、コートのラインを跨いでプレーする人物の画像であり、ライン右側の領域ではコートの色と類似する類似色が人物に含まれている。また、ライン左側の領域では、コートの色と、人物（左足）との色は類似していない。したがって、ケースＣ２の判別結果Ｇ１１では、背景と類似する類似色が前景に含まれる部分については前景として判別されていない（左足の一部が前景として正しく判別されている）。 On the other hand, in case C2, the discrimination result G11 is obtained from the input image G10 using the three-layer neural network 200b. Since the three-layer neural network 200b performs discrimination based on local low-order features, it is susceptible to noise and the like, resulting in erroneous detection. For example, the input image G10 is an image of a person playing across the line of the court, and in the area on the right side of the line, the person includes a similar color similar to the color of the court. Also, in the area on the left side of the line, the color of the coat and the color of the person (left foot) are not similar. Therefore, in the determination result G11 of case C2, the portion where the foreground includes a similar color similar to the background is not determined as the foreground (part of the left foot is correctly determined as the foreground).

図３に戻り、ニューラル・ネットワーク２００における中間層２０２の層数を増やす場合は、学習に初期においては、結合重みが小さな値の乱数で初期化されているため、入力層２０１からの入力信号（対象画像Ｇ１、背景画像Ｇ２）が層を経るごとに拡散していき、弱まることとなる。このため、出力層２０３から得られる判別結果Ｇ４ａは、学習初期において、ほぼ０のノイズとなる。したがって、教師画像Ｇ６と比較しても有意味な情報が得られず（勾配の方向が定まらないため）、ニューラル・ネットワーク２００における各ノードの結合重みを示すパラメータを調整することが困難となり、学習が進まないこととなる。 Returning to FIG. 3, when increasing the number of intermediate layers 202 in the neural network 200, since the connection weights are initialized with small random numbers at the beginning of learning, the input signal from the input layer 201 ( The target image G1 and the background image G2) are diffused and weakened as they pass through the layers. Therefore, the discrimination result G4a obtained from the output layer 203 has almost zero noise at the initial stage of learning. Therefore, meaningful information cannot be obtained by comparison with the teacher image G6 (because the direction of the gradient is not determined), and it becomes difficult to adjust the parameter indicating the connection weight of each node in the neural network 200. will not proceed.

これに対し、本実施形態では、図１に示すように、対象画像Ｇ１から背景および前景を区別する前景マップＧ５と、差分画像Ｇ３との差を示す残差Ｇ４（差分画像Ｇ３では、正解とする前景マップＧ５に足りない情報）をニューラル・ネットワークを用いて推定する。そして、差分画像Ｇ３と、残差Ｇ４とを足し合わせることで前景マップＧ５を得ている。 On the other hand, in the present embodiment, as shown in FIG. 1, a residual G4 indicating the difference between the foreground map G5 that distinguishes the background and foreground from the target image G1 and the difference image G3 (the difference image G3 (Information lacking in the foreground map G5) is estimated using a neural network. A foreground map G5 is obtained by adding the difference image G3 and the residual G4.

このため、例えば学習の初期に得られる残差Ｇ４が０であっても、差分画像Ｇ３が足し合わされていることから、有意味な出力（前景マップＧ５）が得られ、教師画像Ｇ６との比較によりニューラル・ネットワークの学習を進めることができる。したがって、ＤＮＮを用いて高次な特徴をもとに判別した残差Ｇ４による前景マップＧ５を得ることができ、背景類似色やノイズ等が含まれる場合においても、頑健に背景／前景を判別することができる。 Therefore, for example, even if the residual G4 obtained at the beginning of learning is 0, since the difference image G3 is added, a meaningful output (foreground map G5) can be obtained and compared with the teacher image G6. can advance the learning of the neural network. Therefore, it is possible to obtain the foreground map G5 from the residual G4 discriminated based on the high-order features using the DNN, and robustly discriminate the background/foreground even when the background similar color or noise is included. be able to.

図５は、実施形態にかかる画像処理装置の機能構成例を示すブロック図である。図５に示すように、画像処理装置１は、差分生成部１０と、残差推定部１１と、出力部１２とを有する。 FIG. 5 is a block diagram of a functional configuration example of the image processing apparatus according to the embodiment; As shown in FIG. 5 , the image processing device 1 has a difference generator 10 , a residual estimator 11 and an output unit 12 .

差分生成部１０は、入力された対象画像Ｇ１と、背景画像Ｇ２との差分により差分画像Ｇ３を生成する。すなわち、差分生成部１０は、生成部の一例である。例えば、差分生成部１０は、対象画像Ｇ１および背景画像Ｇ２において互いに対応する画素における画素値の差分を求めることで差分画像Ｇ３を生成する。この差分については、画素値のＬ１ノルム（差の絶対値）や、Ｌ２ノルムなどを用いることができる。 The difference generation unit 10 generates a difference image G3 from the difference between the input target image G1 and the background image G2. That is, the difference generator 10 is an example of a generator. For example, the difference generation unit 10 generates the difference image G3 by calculating the difference in the pixel values of the pixels corresponding to each other in the target image G1 and the background image G2. As for this difference, the L1 norm (absolute value of the difference) or the L2 norm of the pixel value can be used.

また、差分生成部１０が差分をとる対象は、互いに対応する画素における画素値に限定しない。例えば、差分生成部１０は、対象画像Ｇ１および背景画像Ｇ２それぞれの各画素から計算した特徴量の差分を求め、差分画像Ｇ３を生成してもよい。 Further, the targets for which the difference generation unit 10 calculates the difference are not limited to pixel values of pixels corresponding to each other. For example, the difference generation unit 10 may obtain the difference in feature amount calculated from each pixel of the target image G1 and the background image G2 to generate the difference image G3.

一例として、次の文献に示すように、局所特徴量を用いてもよい。
・SIFT(Scale-Invariant Feature Transform): David G.Lowe, “Distinctive image features from scale-invariant keypoints”, Int.Journal of Computer Vision,Vol.60, No.2, pp.91-110, 2004．
・SURF (Speeded-Up Robust Features): H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded Up Robust. Features”, In ECCV , pp.404-417, 2006.
・BRIEF (Features from Accelerated Segment Test): M.Calonder, V.Lepetit and C.Strecha and P.Fua, “BRIEF: Binary Robust Independent Elementary Features”, In Proc. European Conference on Computer Vision, pp.778-792, 2010.
・ORB (Oriented FAST and Rotated BRIEF): E.Rublee, V.Rabaud, K.Konolige and G.Bradski “ORB: an efficient alternative to SIFT or SURF”, In Proc. International Conference on Computer Vision, 2011. As an example, a local feature amount may be used as shown in the following document.
・SIFT (Scale-Invariant Feature Transform): David G.Lowe, “Distinctive image features from scale-invariant keypoints”, Int.Journal of Computer Vision, Vol.60, No.2, pp.91-110, 2004.
・SURF (Speeded-Up Robust Features): H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded-Up Robust Features”, In ECCV , pp.404-417, 2006.
・BRIEF (Features from Accelerated Segment Test): M.Calonder, V.Lepetit and C.Strecha and P.Fua, “BRIEF: Binary Robust Independent Elementary Features”, In Proc. European Conference on Computer Vision, pp.778-792 , 2010.
・ORB (Oriented FAST and Rotated Brief): E.Rublee, V.Rabaud, K.Konolige and G.Bradski “ORB: an efficient alternative to SIFT or SURF”, In Proc. International Conference on Computer Vision, 2011.

また、差分生成部１０は、次の文献に示すように、一般物体認識向けに学習済みのＤＮＮの中間層の値を特徴量として用いてもよい。
・AlexNet： Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.
・GoogLeNet： Szegedy, Christian, et al. "Going deeper with convolutions." Cvpr, 2015.
・VGG： Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
・ResNet： He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. In addition, the difference generation unit 10 may use the value of the intermediate layer of the DNN that has been trained for general object recognition as a feature amount, as described in the following document.
・AlexNet: Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.
・GoogLeNet: Szegedy, Christian, et al. "Going deeper with convolutions." Cvpr, 2015.
・VGG: Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
・ResNet: He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

残差推定部１１は、入力された対象画像Ｇ１および背景画像Ｇ２より、対象画像Ｇ１から背景および前景を区別する前景マップＧ５と、差分画像Ｇ３との差を示す残差Ｇ４をニューラル・ネットワークを用いて推定する。すなわち、残差推定部１１は、推定部の一例である。 The residual estimating unit 11 uses a neural network to generate a residual G4 indicating the difference between a foreground map G5 that distinguishes the background and foreground from the target image G1 and a difference image G3 from the input target image G1 and background image G2. estimated using That is, the residual estimator 11 is an example of an estimator.

出力部１２は、差分生成部１０より生成された差分画像Ｇ３と、残差推定部１１より推定された残差Ｇ４との基づく前景マップＧ５を出力する。具体的には、出力部１２は、差分画像Ｇ３および残差Ｇ４において互いに対応する画素における画素値を加算して得られた前景マップＧ５を出力する。この加算については、重み付き加算であってもよい。この重みは、予め設計者が定めた値であってもよいし、可変のパラメータであってもよい。可変のパラメータについては、残差推定部１１のニューラル・ネットワークを学習するときに一緒に最適化を行ってもよい。 The output unit 12 outputs a foreground map G5 based on the difference image G3 generated by the difference generation unit 10 and the residual G4 estimated by the residual estimation unit 11 . Specifically, the output unit 12 outputs the foreground map G5 obtained by adding the pixel values of the pixels corresponding to each other in the difference image G3 and the residual G4. This addition may be weighted addition. This weight may be a value predetermined by the designer, or may be a variable parameter. Variable parameters may be optimized together when the neural network of the residual estimator 11 is trained.

図６は、実施形態にかかる画像処理装置１の動作例を示すフローチャートである。図６に示すように、処理が開始されると、画像処理装置１は、メモリ、ハードディスク、データベースもしくはネットワーク上のストレージ等に予め格納されている背景画像Ｇ２を取得する（Ｓ１０）。同様に、画像処理装置１は、対象画像Ｇ１を取得する（Ｓ１１）。 FIG. 6 is a flow chart showing an operation example of the image processing apparatus 1 according to the embodiment. As shown in FIG. 6, when the process is started, the image processing apparatus 1 acquires a background image G2 stored in advance in a memory, hard disk, database, network storage, or the like (S10). Similarly, the image processing device 1 acquires the target image G1 (S11).

なお、監視カメラからの画像を対象画像Ｇ１とする場合、画像処理装置１は、監視カメラと接続するインタフェースを介して監視カメラより対象画像Ｇ１を直接取得してもよい。また、画像処理装置１は、対象画像Ｇ１および背景画像Ｇ２の取得時にノイズ除去や色補正などの前処理を施してもよい。 When an image from a surveillance camera is used as the target image G1, the image processing apparatus 1 may directly acquire the target image G1 from the surveillance camera via an interface connected to the surveillance camera. Further, the image processing apparatus 1 may perform preprocessing such as noise removal and color correction when acquiring the target image G1 and the background image G2.

次いで、画像処理装置１は、対象画像Ｇ１および背景画像Ｇ２を差分生成部１０に入力する（Ｓ１２、Ｓ１３）。次いで、差分生成部１０は、対象画像Ｇ１と背景画像Ｇ２の差分を生成し（Ｓ１４）、差分画像Ｇ３を得る。 Next, the image processing device 1 inputs the target image G1 and the background image G2 to the difference generator 10 (S12, S13). Next, the difference generator 10 generates the difference between the target image G1 and the background image G2 (S14) to obtain the difference image G3.

次いで、画像処理装置１は、対象画像Ｇ１および背景画像Ｇ２を残差推定部１１に入力する（Ｓ１５、Ｓ１６）。次いで、残差推定部１１は、入力された対象画像Ｇ１および背景画像Ｇ２より、前景マップＧ５と差分画像Ｇ３の差を示す残差Ｇ４をニューラル・ネットワークを用いて推定する（Ｓ１７）。 Next, the image processing device 1 inputs the target image G1 and the background image G2 to the residual estimator 11 (S15, S16). Next, the residual estimator 11 estimates a residual G4 representing the difference between the foreground map G5 and the difference image G3 from the inputted target image G1 and background image G2 using a neural network (S17).

残差推定部１１のニューラル・ネットワークは、後述する学習装置による学習フェーズにより、残差Ｇ４を適正に推定するようにパラメータ調整が施されている。 The neural network of the residual estimator 11 is subjected to parameter adjustment so as to properly estimate the residual G4 in a learning phase by a learning device, which will be described later.

図７は、残差推定部１１のニューラル・ネットワークを説明する説明図である。図１１に示すように、ニューラル・ネットワーク１１ａは、脳のニューロンを模したユニットを階層的に結合したネットワーク構造を有する。脳には、多数のニューロン（神経細胞）が存在する。各ニューロンは、他のニューロンから信号を受け取り、他のニューロンへ信号を受け渡す。脳は、この信号の流れによって、様々な情報処理を行う。ニューラル・ネットワーク１１ａは、このような脳の機能の特性を計算機上で実現したモデルである。 FIG. 7 is an explanatory diagram for explaining the neural network of the residual estimator 11. As shown in FIG. As shown in FIG. 11, the neural network 11a has a network structure in which units resembling neurons in the brain are hierarchically connected. A large number of neurons (nerve cells) exist in the brain. Each neuron receives signals from other neurons and passes signals to other neurons. The brain performs various information processing according to this signal flow. The neural network 11a is a model that realizes such characteristics of brain functions on a computer.

具体的には、ニューラル・ネットワーク１１ａは、対象画像Ｇ１および背景画像Ｇ２が入力される層から残差Ｇ４を出力する層までの中間層を多層とするディープ・ニューラル・ネットワークであってもよい。複数の中間層は、例えば、畳み込み層、活性化関数層、プーリング層、全結合層およびソフトマックス層を含む。各層の数及び位置は、要求されるアーキテクチャに応じて随時変更され得る。すなわち、ニューラル・ネットワーク１１ａの階層構造や各層の構成は、識別する対象などに応じて、設計者が予め定めることができる。 Specifically, the neural network 11a may be a deep neural network having multiple intermediate layers from a layer to which the target image G1 and the background image G2 are input to a layer to output the residual G4. The multiple hidden layers include, for example, convolution layers, activation function layers, pooling layers, fully connected layers and softmax layers. The number and position of each layer may be changed at any time according to the required architecture. That is, the hierarchical structure and the configuration of each layer of the neural network 11a can be determined in advance by the designer according to the object to be identified.

また、ニューラル・ネットワーク１１ａにおいては、入力された画像データからの特徴抽出を可能とするように、畳み込み層と、プーリング層とを交互に積み重ねたＣＮＮ（畳み込みニューラル・ネットワーク）としての構成を有してもよい。また、ニューラル・ネットワーク１１ａは、ＣＮＮではなく、全結合層を多層に並べたもので構成してもよい。この場合、対象画像Ｇ１および背景画像Ｇ２については、ラスタスキャン順などの特定の方法に従って一列に並べたベクトルを入力とすればよい。 In addition, the neural network 11a has a configuration as a CNN (convolutional neural network) in which convolutional layers and pooling layers are alternately stacked so as to enable feature extraction from input image data. may Also, the neural network 11a may be configured by arranging fully connected layers in multiple layers instead of CNN. In this case, for the target image G1 and the background image G2, the input may be vectors arranged in a line according to a specific method such as raster scan order.

例えば、ニューラル・ネットワーク１１ａは、次の文献に示すようなネットワーク構造を用いてもよい。
・FCN（Fully Convolutional Networks）：Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
・U-Net：Ronneberger, Olaf, et al. "U-net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015. For example, neural network 11a may use a network structure as shown in the following document.
・FCN (Fully Convolutional Networks): Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
・U-Net: Ronneberger, Olaf, et al. "U-net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.

図７のニューラル・ネットワーク１１ａは、上記のＵ－Ｎｅｔを適用した場合のネットワーク構造の一例である。例えば、ニューラル・ネットワーク１１ａでは、対象画像Ｇ１および背景画像Ｇ２をＲＧＢ３チャンネルのカラー画像とする場合、入力はそれらを重ねた６チャンネルとなる。この入力に対して、畳み込み層、拡大畳み込み層、プーリング層、Batch normalization層、活性化関数層などを経て、１チャンネルの残差Ｇ４を出力する。 A neural network 11a in FIG. 7 is an example of a network structure when the above U-Net is applied. For example, in the neural network 11a, when the target image G1 and the background image G2 are RGB 3-channel color images, the input is 6 channels in which they are superimposed. This input is passed through a convolutional layer, an augmented convolutional layer, a pooling layer, a batch normalization layer, an activation function layer, etc., and a residual error G4 of one channel is output.

図６に戻り、Ｓ１７に次いで、出力部１２は、差分画像Ｇ３と残差Ｇ４を加算し（Ｓ１８）、前景マップＧ５を出力する（Ｓ１９）。 Returning to FIG. 6, after S17, the output unit 12 adds the difference image G3 and the residual G4 (S18), and outputs the foreground map G5 (S19).

画像処理装置１では、得られた前景マップＧ５を、自由視点映像生成技術における被写体のシルエットの抽出や、映像監視技術における不審者の抽出に適用することができる。 In the image processing apparatus 1, the obtained foreground map G5 can be applied to extraction of the silhouette of a subject in the free-viewpoint video generation technology and extraction of a suspicious person in the video surveillance technology.

例えば、複数視点のカメラ映像から任意の視点の映像を作り出す技術は自由視点映像生成と呼ばれる。この自由視点映像生成技術を用いることで、撮影したカメラ以外の視点での映像の生成や、現実では不可能なカメラワークの映像などを生成でき、ダイナミックで臨場感のある映像コンテンツを生成や、ユーザーが各自で好きなアングルから視聴などに応用できる。 For example, a technique for creating a video of an arbitrary viewpoint from camera videos of multiple viewpoints is called free-viewpoint video generation. By using this free-viewpoint video generation technology, it is possible to generate video from a viewpoint other than the camera that shot it, or to create video with camerawork that is impossible in reality. Users can apply it for viewing from their favorite angles.

この自由視点映像生成技術への適用例としては、次のようなものがある。各カメラ画像に対して前景マップＧ５を求め、得られた前景マップＧ５をもとに、人物などの前景物体のシルエットを抽出する。次いで、Visual Hullという手法を用いて前景物体の３次元構造を復元し、任意に設定した視点からの映像をレンダリングする。 Examples of application to this free-viewpoint video generation technology include the following. A foreground map G5 is obtained for each camera image, and a silhouette of a foreground object such as a person is extracted based on the obtained foreground map G5. Next, a technique called Visual Hull is used to restore the three-dimensional structure of the foreground object and render an image from an arbitrarily set viewpoint.

また、映像監視技術における不審者の抽出への適用例としては、次のようなものがある。監視カメラの画像に対して前景マップＧ５を求め、前景領域の画素数を求める。この前景領域の画素数が所定の閾値以上であった場合、不審物を検出したものとして検出信号を出力する。 In addition, there are the following examples of application of video surveillance technology to extraction of suspicious persons. A foreground map G5 is obtained for the surveillance camera image, and the number of pixels in the foreground area is obtained. If the number of pixels in the foreground area is greater than or equal to a predetermined threshold value, it is determined that a suspicious object has been detected, and a detection signal is output.

次に、ニューラル・ネットワーク１１ａの学習（学習フェーズ）を行う学習装置の詳細について説明する。図８は、実施形態にかかる学習装置の機能構成例を示すブロック図である。 Next, the details of the learning device that performs learning (learning phase) of the neural network 11a will be described. FIG. 8 is a block diagram of a functional configuration example of the learning device according to the embodiment;

なお、学習フェーズにおける対象画像Ｇ１、背景画像Ｇ２および教師画像Ｇ６については、学習用に予め設定された学習データセットのデータを用いるものとする。この学習データセットは、例えば、メモリ、ハードディスク、データベースもしくはネットワーク上のストレージ等に予め格納されているものを読み出して用いる。また、学習データセットについては、画像の回転やスケーリングなどの幾何学的変換やノイズを付加するなど、擬似的にデータの多様性を増やす処理（Data augumentation）を行ってもよい。また、ミニバッチ学習をする場合、対象画像Ｇ１、背景画像Ｇ２および教師画像Ｇ６については、ミニバッチ数分を取得してもよい。 For the target image G1, the background image G2, and the teacher image G6 in the learning phase, data of a learning data set preset for learning are used. This learning data set is read and used, for example, from a memory, hard disk, database, network storage, or the like. Further, the learning data set may be subjected to a process (data augmentation) to artificially increase the diversity of the data, such as geometric transformation such as image rotation or scaling, or addition of noise. In the case of mini-batch learning, the target image G1, the background image G2, and the teacher image G6 may be acquired for the number of mini-batches.

図８に示すように、学習装置２は、誤差算出部２０と、勾配算出部２１と、更新部２２とを有する。 As shown in FIG. 8 , the learning device 2 has an error calculator 20 , a gradient calculator 21 and an updater 22 .

誤差算出部２０は、画像処理装置１が出力した前景マップＧ５と、教師画像Ｇ６との入力を受け付ける。誤差算出部２０は、入力された前景マップＧ５と、教師画像Ｇ６とを比較して誤差を算出する。すなわち、誤差算出部２０は、算出部の一例である。 The error calculator 20 receives inputs of the foreground map G5 output by the image processing device 1 and the teacher image G6. The error calculator 20 compares the input foreground map G5 and the teacher image G6 to calculate an error. That is, the error calculator 20 is an example of a calculator.

例えば、誤差算出部２０は、前景マップＧ５および教師画像Ｇ６において互いに対応する画素における画素値の二乗誤差を求めることで、誤差を算出する。この誤差については、二乗誤差に限定するものではなく、Ｌ１ノルム誤差や、ロバスト統計で用いられるＨｕｂｅｒノルム誤差などを用いてもよい。 For example, the error calculator 20 calculates the error by obtaining the squared error of the pixel values of corresponding pixels in the foreground map G5 and the teacher image G6. This error is not limited to the square error, and may be an L1 norm error, a Huber norm error used in robust statistics, or the like.

勾配算出部２１は、誤差算出部２０が算出した誤差をもとに、教師あり学習で一般的に使用される誤差逆伝搬法に基づいてニューラル・ネットワーク１１ａ全体の勾配を算出する。 Based on the error calculated by the error calculator 20, the gradient calculator 21 calculates the gradient of the entire neural network 11a based on the error backpropagation method generally used in supervised learning.

更新部２２は、勾配算出部２１が算出した勾配をもとに、ニューラル・ネットワーク１１ａを構成する各ノードの結合重み（パラメータ）の更新量を算出する。更新部２２は、算出した更新量をもとに、ニューラル・ネットワーク１１ａにおけるパラメータを更新する。 Based on the gradients calculated by the gradient calculator 21, the updater 22 calculates the update amount of the connection weights (parameters) of the nodes forming the neural network 11a. The update unit 22 updates parameters in the neural network 11a based on the calculated update amount.

更新部２２における更新量の算出には、例えば、次の文献に示すような最適化手法を用いることができる。
・Momentum付きのSGD（stocastic gradient descent）： Goodfellow, Ian, et al. Deep learning. Vol. 1. Cambridge: MIT press, 2016.
・RMSProp： Geoffrey Hinton, Nitish Srivastava, Kevin Swersky. 2014. Lecture 6e: Rmsprop: Divide the gradient by a running average of its recent magnitude (CSC321 Winter 2014).
・Adam： Diederik Kingma, Jimmy Ba. 2015. Adam: a method for stochastic optimization. the 3rd International Conference for Learning Representations (ICLR 2015). For calculation of the amount of update in the updating unit 22, for example, an optimization method as described in the following document can be used.
・SGD (stocastic gradient descent) with momentum: Goodfellow, Ian, et al. Deep learning. Vol. 1. Cambridge: MIT press, 2016.
・RMSProp: Geoffrey Hinton, Nitish Srivastava, Kevin Swersky. 2014. Lecture 6e: Rmsprop: Divide the gradient by a running average of its recent magnitude (CSC321 Winter 2014).
・Adam: Diederik Kingma, Jimmy Ba. 2015. Adam: a method for stochastic optimization. the 3rd International Conference for Learning Representations (ICLR 2015).

図９は、学習フローを例示するフローチャートである。図９に示すように、処理が開始されると、学習データセットから対象画像Ｇ１および背景画像Ｇ２を取得する（Ｓ２０、Ｓ２１）。次いで、画像処理装置１は、Ｓ１２～Ｓ１９と同様の処理を行い、前景マップＧ５を出力する（Ｓ２２～Ｓ２９）。画像処理装置１より出力された前景マップＧ５は、学習装置２の誤差算出部２０に入力される。 FIG. 9 is a flow chart illustrating a learning flow. As shown in FIG. 9, when the process is started, the target image G1 and the background image G2 are acquired from the learning data set (S20, S21). Next, the image processing apparatus 1 performs the same processing as S12-S19, and outputs the foreground map G5 (S22-S29). The foreground map G5 output from the image processing device 1 is input to the error calculator 20 of the learning device 2 .

次いで、学習装置２の誤差算出部２０は、学習データセットから教師画像Ｇ６を取得する（Ｓ３０）。次いで、誤差算出部２０は、画像処理装置１より出力された前景マップＧ５と、教師画像Ｇ６とを比較して誤差を計算する（Ｓ３１）。 Next, the error calculator 20 of the learning device 2 acquires the teacher image G6 from the learning data set (S30). Next, the error calculator 20 compares the foreground map G5 output from the image processing device 1 and the teacher image G6 to calculate the error (S31).

次いで、勾配算出部２１は、誤差算出部２０が計算した誤差をもとに、ニューラル・ネットワーク１１ａ全体の勾配を算出する（Ｓ３２）。次いで、更新部２２は、差分生成部１０のニューラル・ネットワーク１１ａにおける各ノードの結合重みを勾配算出部２１が計算した勾配に応じて更新する（Ｓ３３）。 Next, the gradient calculator 21 calculates the gradient of the entire neural network 11a based on the error calculated by the error calculator 20 (S32). Next, the update unit 22 updates the connection weight of each node in the neural network 11a of the difference generation unit 10 according to the gradient calculated by the gradient calculation unit 21 (S33).

次いで、学習装置２は、学習データセットに含まれる全てのデータを用いた学習が終了したか否かなど、所定の学習終了の条件を満たすかを判定する（Ｓ３４）。満たさない場合（Ｓ３４：ＮＯ）、学習装置２は、Ｓ２０へ処理を戻し、学習を継続する。満たす場合（Ｓ３４：ＹＥＳ）、学習装置２は、学習を終了する。 Next, the learning device 2 determines whether a predetermined learning end condition is satisfied, such as whether or not learning using all the data included in the learning data set has been completed (S34). If not satisfied (S34: NO), the learning device 2 returns to S20 and continues learning. If the condition is satisfied (S34: YES), the learning device 2 ends learning.

以上のように、画像処理装置１の差分生成部１０は、背景および前景の判別対象となる対象画像Ｇ１と、背景にかかる背景画像Ｇ２との差分画像Ｇ３を生成する。画像処理装置１の残差推定部１１は、入力された対象画像Ｇ１および背景画像Ｇ２より、対象画像Ｇ１から背景および前景を区別する前景マップＧ５と、差分画像Ｇ３との差を示す残差Ｇ４をニューラル・ネットワーク１１ａを用いて推定する。画像処理装置１の出力部１２は、差分生成部１０により生成された差分画像Ｇ３と、残差推定部１１により推定された残差Ｇ４とに基づく前景マップＧ５を出力する。 As described above, the difference generation unit 10 of the image processing apparatus 1 generates the difference image G3 between the target image G1, which is a target for discrimination between the background and the foreground, and the background image G2 that is the background. The residual estimating unit 11 of the image processing apparatus 1 generates a residual G4 indicating the difference between the input target image G1 and background image G2, and the foreground map G5 that distinguishes the background and foreground from the target image G1, and the difference image G3. is estimated using the neural network 11a. The output unit 12 of the image processing device 1 outputs a foreground map G5 based on the difference image G3 generated by the difference generation unit 10 and the residual G4 estimated by the residual estimation unit 11 .

これにより、画像処理装置１は、高次な特徴を判別する多層な残差推定部１１を用いることができ、背景と類似する類似色が前景に含まれる場合や、ノイズ等に対して頑健に背景／前景の判別を行うことが可能となる。 As a result, the image processing apparatus 1 can use the multi-layered residual estimating unit 11 that discriminates high-order features, and can robustly deal with noise and the like when the foreground includes similar colors similar to the background. Background/foreground discrimination can be performed.

また、残差推定部１１のニューラル・ネットワーク１１ａは、中間層を多層とするディープ・ニューラル・ネットワーク（ＤＮＮ）である。これにより、残差推定部１１は、入力された対象画像Ｇ１および背景画像Ｇ２に含まれる高次な特徴をもとに残差Ｇ４を推定することができる。したがって、画像処理装置１は、ノイズ等に対してより頑健な背景／前景の判別を行うことが可能となる。 The neural network 11a of the residual estimator 11 is a deep neural network (DNN) with multiple intermediate layers. Thereby, the residual estimator 11 can estimate the residual G4 based on the high-order features included in the input target image G1 and background image G2. Therefore, the image processing apparatus 1 can perform more robust background/foreground discrimination against noise and the like.

また、残差推定部１１のニューラル・ネットワーク１１ａは、畳み込みニューラル・ネットワークである。これにより、残差推定部１１は、入力された対象画像Ｇ１および背景画像Ｇ２の抽象化を行い、高次の特徴を得ることができる。 Also, the neural network 11a of the residual estimator 11 is a convolutional neural network. As a result, the residual estimating unit 11 can abstract the input target image G1 and background image G2 and obtain high-order features.

また、差分生成部１０は、対象画像Ｇ１および背景画像Ｇ２それぞれの各画素に基づく特徴量の差分をもとに差分画像Ｇ３を生成する。このように差分画像Ｇ３は、対象画像Ｇ１および背景画像Ｇ２における特徴量の差分であってもよい。 In addition, the difference generation unit 10 generates a difference image G3 based on the difference in feature amount based on each pixel of the target image G1 and the background image G2. In this way, the difference image G3 may be the difference in feature amount between the target image G1 and the background image G2.

また、学習装置２の誤差算出部２０は、差分生成部１０により生成された差分画像Ｇ３、および、残差推定部１１のニューラル・ネットワーク１１ａにより推定された残差Ｇ４に基づいた前景マップＧ５を受け付ける。学習装置２は、受け付けた前景マップＧ５と、教師画像Ｇ６との誤差を算出する。学習装置２の更新部２２は、算出された誤差に基づいてニューラル・ネットワーク１１ａにかかるパラメータを更新する。これにより、学習装置２は、対象画像Ｇ１から背景および前景を区別する前景マップＧ５と、差分画像Ｇ３との差を示す残差Ｇ４を推定するニューラル・ネットワーク１１ａの学習を行うことができる。 Further, the error calculation unit 20 of the learning device 2 generates a foreground map G5 based on the difference image G3 generated by the difference generation unit 10 and the residual G4 estimated by the neural network 11a of the residual estimation unit 11. accept. The learning device 2 calculates the error between the received foreground map G5 and the teacher image G6. The updating unit 22 of the learning device 2 updates the parameters of the neural network 11a based on the calculated error. Thus, the learning device 2 can learn the neural network 11a for estimating the residual G4 indicating the difference between the foreground map G5 that distinguishes the background and foreground from the target image G1 and the difference image G3.

なお、図示した各装置の各構成要素は、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。 It should be noted that each component of each illustrated device does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution and integration of each device is not limited to the one shown in the figure, and all or part of them can be functionally or physically distributed and integrated in arbitrary units according to various loads and usage conditions. Can be integrated and configured.

画像処理装置１、学習装置２で行われる各種処理機能は、ＣＰＵ（またはＭＰＵ、ＭＣＵ（Micro Controller Unit）等のマイクロ・コンピュータ）上で、その全部または任意の一部を実行するようにしてもよい。また、各種処理機能は、ＣＰＵ（またはＭＰＵ、ＭＣＵ等のマイクロ・コンピュータ）で解析実行されるプログラム上、またはワイヤードロジックによるハードウエア上で、その全部または任意の一部を実行するようにしてもよいことは言うまでもない。また、画像処理装置１、学習装置２で行われる各種処理機能は、クラウドコンピューティングにより、複数のコンピュータが協働して実行してもよい。 Various processing functions performed by the image processing device 1 and the learning device 2 may be executed in whole or in part on a CPU (or a microcomputer such as an MPU or MCU (Micro Controller Unit)). good. Also, various processing functions may be executed in whole or in part on a program analyzed and executed by a CPU (or a microcomputer such as an MPU or MCU) or on hardware based on wired logic. It goes without saying that it is good. Further, various processing functions performed by the image processing device 1 and the learning device 2 may be performed in cooperation with a plurality of computers by cloud computing.

ところで、上記の実施形態で説明した各種の処理は、予め用意されたプログラムをコンピュータで実行することで実現できる。そこで、以下では、上記の実施例と同様の機能を有するプログラムを実行するコンピュータ（ハードウェア）の一例を説明する。図１０は、プログラムを実行するコンピュータの一例を示す説明図である。 By the way, the various processes described in the above embodiments can be realized by executing a prepared program on a computer. Therefore, an example of a computer (hardware) that executes a program having functions similar to those of the above embodiments will be described below. FIG. 10 is an explanatory diagram of an example of a computer that executes a program.

図１０に示すように、コンピュータ３は、各種演算処理を実行するＣＰＵ１０１と、データ入力を受け付ける入力装置１０２と、モニタ１０３と、スピーカ１０４とを有する。また、コンピュータ３は、記憶媒体からプログラム等を読み取る媒体読取装置１０５と、各種装置と接続するためのインタフェース装置１０６と、有線または無線により外部機器と通信接続するための通信装置１０７とを有する。また、コンピュータ３は、各種情報を一時記憶するＲＡＭ１０８と、ハードディスク装置１０９とを有する。また、コンピュータ３内の各部（１０１～１０９）は、バス１１０に接続される。 As shown in FIG. 10, the computer 3 has a CPU 101 that executes various arithmetic processes, an input device 102 that receives data input, a monitor 103 and a speaker 104 . The computer 3 also has a medium reading device 105 for reading programs and the like from a storage medium, an interface device 106 for connecting with various devices, and a communication device 107 for communicating with external devices by wire or wirelessly. The computer 3 also has a RAM 108 that temporarily stores various information, and a hard disk device 109 . Also, each unit (101 to 109) in the computer 3 is connected to the bus 110. FIG.

ハードディスク装置１０９には、上記の実施形態で説明した差分生成部１０、残差推定部１１、出力部１２、誤差算出部２０、勾配算出部２１および更新部２２等の機能部における各種処理を実行するためのプログラム１１１が記憶される。また、ハードディスク装置１０９には、プログラム１１１が参照する各種データ１１２が記憶される。入力装置１０２は、例えば、コンピュータ３の操作者から操作情報の入力を受け付ける。モニタ１０３は、例えば、操作者が操作する各種画面を表示する。インタフェース装置１０６は、例えば印刷装置等が接続される。通信装置１０７は、ＬＡＮ（Local Area Network）等の通信ネットワークと接続され、通信ネットワークを介した外部機器との間で各種情報をやりとりする。 The hard disk device 109 executes various processes in the functional units such as the difference generation unit 10, the residual estimation unit 11, the output unit 12, the error calculation unit 20, the gradient calculation unit 21, and the update unit 22 described in the above embodiments. A program 111 for doing is stored. Various data 112 referred to by the program 111 are stored in the hard disk device 109 . The input device 102 receives input of operation information from an operator of the computer 3, for example. The monitor 103 displays, for example, various screens operated by an operator. The interface device 106 is connected to, for example, a printing device. The communication device 107 is connected to a communication network such as a LAN (Local Area Network), and exchanges various information with external devices via the communication network.

ＣＰＵ１０１は、ハードディスク装置１０９に記憶されたプログラム１１１を読み出して、ＲＡＭ１０８に展開して実行することで、差分生成部１０、残差推定部１１、出力部１２、誤差算出部２０、勾配算出部２１および更新部２２等にかかる各種の処理を行う。なお、プログラム１１１は、ハードディスク装置１０９に記憶されていなくてもよい。例えば、コンピュータ３が読み取り可能な記憶媒体に記憶されたプログラム１１１を、コンピュータ３が読み出して実行するようにしてもよい。コンピュータ３が読み取り可能な記憶媒体は、例えば、ＣＤ－ＲＯＭやＤＶＤディスク、ＵＳＢ（Universal Serial Bus）メモリ等の可搬型記録媒体、フラッシュメモリ等の半導体メモリ、ハードディスクドライブ等が対応する。また、公衆回線、インターネット、ＬＡＮ等に接続された装置にプログラム１１１を記憶させておき、コンピュータ３がこれらからプログラム１１１を読み出して実行するようにしてもよい。 The CPU 101 reads out the program 111 stored in the hard disk device 109, develops it in the RAM 108, and executes it to obtain the difference generator 10, the residual estimator 11, the output unit 12, the error calculator 20, and the gradient calculator 21. and various processing related to the update unit 22 and the like. Note that the program 111 does not have to be stored in the hard disk device 109 . For example, the computer 3 may read and execute the program 111 stored in a storage medium readable by the computer 3 . Examples of storage media readable by the computer 3 include portable recording media such as CD-ROMs, DVD discs, USB (Universal Serial Bus) memories, semiconductor memories such as flash memories, and hard disk drives. Alternatively, the program 111 may be stored in a device connected to a public line, the Internet, a LAN, etc., and the computer 3 may read and execute the program 111 therefrom.

以上の実施形態に関し、さらに以下の付記を開示する。 Further, the following additional remarks are disclosed with respect to the above embodiment.

（付記１）背景および前景の判別対象となる対象画像と、前記背景にかかる背景画像との差分画像を生成する生成部と、
前記対象画像から前記背景および前記前景を区別するマップ画像と、前記差分画像との差を示す残差をニューラル・ネットワークを用いて推定する推定部と、
生成された前記差分画像と、推定された前記残差とに基づくマップ画像を出力する出力部と、
を有することを特徴とする画像処理装置。 (Appendix 1) a generation unit that generates a difference image between a target image that is a background and foreground determination target and a background image that is related to the background;
an estimating unit that uses a neural network to estimate a residual indicating a difference between a map image that distinguishes the background and the foreground from the target image and the difference image;
an output unit that outputs a map image based on the generated difference image and the estimated residual;
An image processing device comprising:

（付記２）前記ニューラル・ネットワークは、中間層を多層とするディープ・ニューラル・ネットワークである、
ことを特徴とする付記１に記載の画像処理装置。 (Appendix 2) The neural network is a deep neural network with multiple intermediate layers,
The image processing apparatus according to Supplementary Note 1, characterized by:

（付記３）前記ニューラル・ネットワークは、畳み込みニューラル・ネットワークである、
ことを特徴とする付記２に記載の画像処理装置。 (Appendix 3) The neural network is a convolutional neural network.
The image processing apparatus according to appendix 2, characterized by:

（付記４）前記生成部は、前記対象画像および前記背景画像それぞれの各画素に基づく特徴量の差分をもとに前記差分画像を生成する、
ことを特徴とする付記１乃至３のいずれか一に記載の画像処理装置。 (Additional remark 4) The generation unit generates the difference image based on the difference in feature amount based on each pixel of the target image and the background image.
The image processing apparatus according to any one of Appendices 1 to 3, characterized by:

（付記５）背景および前景の判別対象となる対象画像と前記背景にかかる背景画像との差分画像、および、前記差分画像との差を示す残差に基づいて、ニューラル・ネットワークによって推定された、前記対象画像から前記背景および前記前景を区別するマップ画像を受け付け、当該マップ画像と、教師データとの誤差を算出する算出部と、
算出された前記誤差に基づいて前記ニューラル・ネットワークにかかるパラメータを更新する更新部と、
を有することを特徴とする学習装置。 (Appendix 5) Based on a difference image between a target image for background and foreground discrimination and a background image on the background, and a residual indicating the difference between the difference image and the difference image, estimated by a neural network, a calculation unit that receives a map image that distinguishes the background and the foreground from the target image, and calculates an error between the map image and teacher data;
an updating unit that updates parameters for the neural network based on the calculated error;
A learning device characterized by comprising:

（付記６）前記ニューラル・ネットワークは、中間層を多層とするディープ・ニューラル・ネットワークである、
ことを特徴とする付記５に記載の学習装置。 (Appendix 6) The neural network is a deep neural network with multiple intermediate layers,
The learning device according to appendix 5, characterized by:

（付記７）前記ニューラル・ネットワークは、畳み込みニューラル・ネットワークである、
ことを特徴とする付記６に記載の学習装置。 (Appendix 7) The neural network is a convolutional neural network.
The learning device according to appendix 6, characterized by:

（付記８）背景および前景の判別対象となる対象画像と、前記背景にかかる背景画像との差分画像を生成し、
前記対象画像から前記背景および前記前景を区別するマップ画像と、前記差分画像との差を示す残差をニューラル・ネットワークを用いて推定し、
生成された前記差分画像と、推定された前記残差とに基づくマップ画像を出力する、
処理をコンピュータが実行することを特徴とする画像処理方法。 (Appendix 8) generating a difference image between a target image to be used for background and foreground discrimination and a background image for the background;
estimating a residual indicating a difference between a map image that distinguishes the background and the foreground from the target image and the difference image using a neural network;
outputting a map image based on the generated difference image and the estimated residual;
An image processing method characterized in that the processing is executed by a computer.

（付記９）前記ニューラル・ネットワークは、中間層を多層とするディープ・ニューラル・ネットワークである、
ことを特徴とする付記８に記載の画像処理方法。 (Appendix 9) The neural network is a deep neural network with multiple intermediate layers,
The image processing method according to appendix 8, characterized by:

（付記１０）前記ニューラル・ネットワークは、畳み込みニューラル・ネットワークである、
ことを特徴とする付記９に記載の画像処理方法。 (Appendix 10) The neural network is a convolutional neural network.
The image processing method according to appendix 9, characterized by:

（付記１１）前記生成する処理は、前記対象画像および前記背景画像それぞれの各画素に基づく特徴量の差分をもとに前記差分画像を生成する、
ことを特徴とする付記８乃至１０のいずれか一に記載の画像処理方法。 (Supplementary Note 11) The generating process generates the difference image based on a difference in feature amount based on each pixel of the target image and the background image.
11. The image processing method according to any one of appendices 8 to 10, characterized by:

（付記１２）背景および前景の判別対象となる対象画像と前記背景にかかる背景画像との差分画像、および、前記差分画像との差を示す残差に基づいて、ニューラル・ネットワークによって推定された、前記対象画像から前記背景および前記前景を区別するマップ画像を受け付け、当該マップ画像と、教師データとの誤差を算出し、
算出された前記誤差に基づいて前記ニューラル・ネットワークにかかるパラメータを更新する、
処理をコンピュータが実行することを特徴とする学習方法。 (Appendix 12) Based on a difference image between a target image for background and foreground discrimination and a background image on the background, and a residual indicating the difference between the difference image and the difference image, estimated by a neural network, receiving a map image that distinguishes the background and the foreground from the target image, calculating an error between the map image and teacher data;
updating parameters for the neural network based on the calculated error;
A learning method characterized in that processing is executed by a computer.

（付記１３）前記ニューラル・ネットワークは、中間層を多層とするディープ・ニューラル・ネットワークである、
ことを特徴とする付記１２に記載の学習方法。 (Appendix 13) The neural network is a deep neural network with multiple intermediate layers,
The learning method according to appendix 12, characterized by:

（付記１４）前記ニューラル・ネットワークは、畳み込みニューラル・ネットワークである、
ことを特徴とする付記１３に記載の学習方法。 (Appendix 14) The neural network is a convolutional neural network.
The learning method according to appendix 13, characterized by:

（付記１５）背景および前景の判別対象となる対象画像と、前記背景にかかる背景画像との差分画像を生成し、
前記対象画像から前記背景および前記前景を区別するマップ画像と、前記差分画像との差を示す残差をニューラル・ネットワークを用いて推定し、
生成された前記差分画像と、推定された前記残差とに基づくマップ画像を出力する、
処理をコンピュータに実行させることを特徴とする画像処理プログラム。 (Appendix 15) generating a difference image between a target image to be used for background and foreground discrimination and a background image for the background;
estimating a residual indicating a difference between a map image that distinguishes the background and the foreground from the target image and the difference image using a neural network;
outputting a map image based on the generated difference image and the estimated residual;
An image processing program that causes a computer to execute processing.

（付記１６）前記ニューラル・ネットワークは、中間層を多層とするディープ・ニューラル・ネットワークである、
ことを特徴とする付記１５に記載の画像処理プログラム。 (Appendix 16) The neural network is a deep neural network with multiple intermediate layers,
16. The image processing program according to appendix 15, characterized by:

（付記１７）前記ニューラル・ネットワークは、畳み込みニューラル・ネットワークである、
ことを特徴とする付記１６に記載の画像処理プログラム。 (Appendix 17) The neural network is a convolutional neural network.
17. The image processing program according to appendix 16, characterized by:

（付記１８）前記生成する処理は、前記対象画像および前記背景画像それぞれの各画素に基づく特徴量の差分をもとに前記差分画像を生成する、
ことを特徴とする付記１５乃至１７のいずれか一に記載の画像処理プログラム。 (Appendix 18) In the generating process, the difference image is generated based on a difference in feature amount based on each pixel of the target image and the background image.
18. The image processing program according to any one of appendices 15 to 17, characterized by:

（付記１９）背景および前景の判別対象となる対象画像と前記背景にかかる背景画像との差分画像、および、前記差分画像との差を示す残差に基づいて、ニューラル・ネットワークによって推定された、前記対象画像から前記背景および前記前景を区別するマップ画像を受け付け、当該マップ画像と、教師データとの誤差を算出し、
算出された前記誤差に基づいて前記ニューラル・ネットワークにかかるパラメータを更新する、
処理をコンピュータに実行させることを特徴とする学習プログラム。 (Appendix 19) Based on the difference image between the target image for background and foreground discrimination and the background image on the background, and the residual indicating the difference between the difference image and the difference image, estimated by a neural network, receiving a map image that distinguishes the background and the foreground from the target image, calculating an error between the map image and teacher data;
updating parameters for the neural network based on the calculated error;
A learning program characterized by causing a computer to execute processing.

（付記２０）前記ニューラル・ネットワークは、中間層を多層とするディープ・ニューラル・ネットワークである、
ことを特徴とする付記１９に記載の学習プログラム。 (Appendix 20) The neural network is a deep neural network with multiple intermediate layers,
The learning program according to appendix 19, characterized by:

（付記２１）前記ニューラル・ネットワークは、畳み込みニューラル・ネットワークである、
ことを特徴とする付記２０に記載の学習プログラム。 (Appendix 21) The neural network is a convolutional neural network.
The learning program according to appendix 20, characterized by:

１…画像処理装置
２…学習装置
３…コンピュータ
１０…差分生成部
１１…残差推定部
１１ａ、２００、２００ａ、２００ｂ…ニューラル・ネットワーク
１２…出力部
２０…誤差算出部
２１…勾配算出部
２２…更新部
１０１…ＣＰＵ
１０２…入力装置
１０３…モニタ
１０４…スピーカ
１０５…媒体読取装置
１０６…インタフェース装置
１０７…通信装置
１０８…ＲＡＭ
１０９…ハードディスク装置
１１０…バス
１１１…プログラム
１１２…各種データ
２０１…入力層
２０２…中間層
２０３…出力層
Ｃ１、Ｃ２…ケース
Ｇ１…対象画像
Ｇ２…背景画像
Ｇ３…差分画像
Ｇ４…残差
Ｇ４ａ…判別結果
Ｇ５…前景マップ
Ｇ６…教師画像
Ｇ１０…入力画像
Ｇ１１…判別結果
Ｈ…人物
Ｒ１…前景領域 Reference Signs List 1... Image processing device 2... Learning device 3... Computer 10... Difference generation unit 11... Residual error estimation unit 11a, 200, 200a, 200b... Neural network 12... Output unit 20... Error calculation unit 21... Gradient calculation unit 22... Update unit 101 ... CPU
REFERENCE SIGNS LIST 102: Input device 103: Monitor 104: Speaker 105: Medium reading device 106: Interface device 107: Communication device 108: RAM
109 Hard disk device 110 Bus 111 Program 112 Various data 201 Input layer 202 Intermediate layer 203 Output layers C1, C2 Case G1 Target image G2 Background image G3 Difference image G4 Residual G4a Discrimination Result G5... Foreground map G6... Teacher image G10... Input image G11... Discrimination result H... Person R1... Foreground area

Claims

an acquisition unit that acquires a background image of a background captured in advance and a target image to be determined;
a difference generation unit that generates a difference image indicating a first difference between the background image and the target image;
By inputting each of the background image and the target image to a neural network, a map image to be output in which the background and the foreground are distinguished from the target image, and a residual representing a second difference between the difference image and the map image. the neural network for estimating the difference;
an output unit that outputs a map image in which a background and a foreground are distinguished from the target image based on the difference image generated by the difference generation unit and the residual estimated by the neural network;
An image processing device comprising:

The output unit outputs the map image obtained by adding pixel values of pixels corresponding to each other in the difference image and the residual.
2. The image processing apparatus according to claim 1, wherein:

The neural network, based on the output result output from the neural network when the background image and the target image are input to the neural network, and teacher data indicating correct data of the map image, A neural network in which a parameter of the neural network is changed;
2. The image processing apparatus according to claim 1, wherein:

Acquiring a background image of a background photographed in advance and a target image to be determined,
generating a difference image indicating a first difference between the background image and the target image;
By inputting each of the background image and the target image to a neural network, a map image to be output in which the background and the foreground are distinguished from the target image, and a residual representing a second difference between the difference image and the map image. identifying the neural network for estimating the difference;
outputting a map image in which a background and a foreground are distinguished from the target image based on the difference image generated by the difference generation unit and the residual estimated by the neural network;
An image processing method characterized in that the processing is executed by a computer.

Acquiring a background image of a background photographed in advance and a target image to be determined,
generating a difference image indicating a first difference between the background image and the target image;
By inputting each of the background image and the target image to a neural network, a map image to be output in which the background and the foreground are distinguished from the target image, and a residual representing a second difference between the difference image and the map image. identifying the neural network for estimating the difference;
outputting a map image in which a background and a foreground are distinguished from the target image based on the generated difference image and the residual estimated by the neural network;
An image processing program that causes a computer to execute processing.

outputting the map image obtained by adding pixel values of pixels corresponding to each other in the difference image and the residual;
6. The image processing program according to claim 5, characterized by:

The neural network, based on the output result output from the neural network when the background image and the target image are input to the neural network, and teacher data indicating correct data of the map image, A neural network in which a parameter of the neural network is changed;
6. The image processing program according to claim 5, characterized by: