JP2024025499A

JP2024025499A - Information processing device, method for producing fully convolutional network, and program

Info

Publication number: JP2024025499A
Application number: JP2022128999A
Authority: JP
Inventors: 友則矢澤; Tomonori Yazawa
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2022-08-12
Filing date: 2022-08-12
Publication date: 2024-02-26

Abstract

The present invention is to classify sequence data with high precision and high speed.
SOLUTION: The likelihood that array data is included in a specific class is calculated. The likelihood that array data is included in a specific class is calculated using different processing using a fully convolutional network. The learning process of the fully convolutional network is performed using information obtained during the process in which the first processing means calculates the likelihood of the array data for learning as teacher data.
[Selection diagram] Figure 1

Description

本発明は、情報処理装置、完全畳み込みネットワークを生産する方法、及びプログラムに関し、特に配列データに対するセグメンテーション処理に関する。 The present invention relates to an information processing device, a method for producing a fully convolutional network, and a program, and particularly relates to segmentation processing for array data.

近年、画像等の配列データを分類する技術、及び分類結果に基づいて配列データから一部を抽出する技術が提案されている。例えば、物体が写っている画像を処理することにより、画像から有用な情報を抽出することができる。特に、ニューラルネットワーク（例えば多階層のディープニューラルネットワーク）を用いて、画像中の物体のカテゴリを認識する物体認識技術が盛んに研究されている。また、画像中の物体のカテゴリを画素レベルで推定することにより、画像をセグメンテーションする技術も、盛んに研究されている。 In recent years, techniques for classifying array data such as images and techniques for extracting part of the array data based on the classification results have been proposed. For example, by processing an image containing an object, useful information can be extracted from the image. In particular, object recognition technology that uses neural networks (eg, multilayer deep neural networks) to recognize categories of objects in images is being actively researched. Furthermore, techniques for segmenting images by estimating the categories of objects in images at the pixel level are also being actively researched.

例えば、非特許文献１は、Ｔｒａｎｓｆｏｒｍｅｒアーキテクチャを用いて物体認識を行うＶｉｓｉｏｎＴｒａｎｓｆｏｒｍｅｒを提案している。この手法では、Ｓｅｌｆ－Ａｔｔｅｎｔｉｏｎ層などにおいて中間処理結果の全体を用いた処理が行われる。一方で、非特許文献２は、畳み込み層及びプーリング層のみからなる完全畳み込みネットワークを用いて画像のセグメンテーションを行う技術を提案している。 For example, Non-Patent Document 1 proposes a Vision Transformer that performs object recognition using a Transformer architecture. In this method, processing is performed using the entire intermediate processing result in the Self-Attention layer or the like. On the other hand, Non-Patent Document 2 proposes a technique for performing image segmentation using a fully convolutional network consisting of only convolutional layers and pooling layers.

A. Dosovitskiy et al. "An Image Is Worth 16x16 Words: Transformers For Image Recognition at Scale", arXiv:2010.11929A. Dosovitskiy et al. "An Image Is Worth 16x16 Words: Transformers For Image Recognition at Scale", arXiv:2010.11929 J. Long et al. "Fully Convolutional Networks for Semantic Segmentation", Computer Vision and Pattern Recognition (CVPR) 2015J. Long et al. "Fully Convolutional Networks for Semantic Segmentation", Computer Vision and Pattern Recognition (CVPR) 2015

非特許文献１の方法では、画像全体に対する処理を各層で行うことにより、物体認識精度が向上する一方で、処理に時間を要する。一方で、非特許文献２の方法によれば、畳み込み層では局所的なフィルタ処理が行われるため、高速なセグメンテーション処理を行うことができるが、非特許文献１の方法よりも処理精度が低下する傾向にある。また、非特許文献１の方法では各層の処理が画像全体を参照しながら行われ、非特許文献２の方法では各層の処理が画像の局所領域を参照しながら行われるため、非特許文献１に示される手法を非特許文献２に示される手法に組み込むことも容易ではなかった。 In the method of Non-Patent Document 1, the object recognition accuracy is improved by processing the entire image in each layer, but the processing takes time. On the other hand, according to the method of Non-Patent Document 2, since local filter processing is performed in the convolutional layer, high-speed segmentation processing can be performed, but the processing accuracy is lower than that of the method of Non-Patent Document 1. There is a tendency. In addition, in the method of Non-Patent Document 1, processing of each layer is performed while referring to the entire image, and in the method of Non-Patent Document 2, processing of each layer is performed while referring to a local area of the image. It was also not easy to incorporate the method shown in the method shown in Non-Patent Document 2.

本発明は、配列データを高精度及び高速に分類することを目的とする。 The present invention aims to classify sequence data with high precision and high speed.

本発明の一実施形態に係る情報処理装置は以下の構成を備える。すなわち、
配列データが特定のクラスに含まれる尤度を算出する第１の処理手段と、
前記第１の処理手段とは異なる処理を行う第２の処理手段であって、完全畳み込みネットワークを用いて、配列データが特定のクラスに含まれる尤度を算出する第２の処理手段と、
学習用の配列データについての前記尤度を前記第１の処理手段が算出する処理の過程で得られた情報を教師データとして用いて、前記完全畳み込みネットワークの学習処理を行う学習手段と、
を備える。 An information processing device according to an embodiment of the present invention has the following configuration. That is,
a first processing means for calculating the likelihood that the array data is included in a specific class;
a second processing means that performs processing different from the first processing means, the second processing means calculating the likelihood that the array data is included in a specific class using a fully convolutional network;
Learning means for performing learning processing for the fully convolutional network using information obtained in the process in which the first processing means calculates the likelihood for the learning array data as teacher data;
Equipped with

配列データを高精度及び高速に分類することができる。 Sequence data can be classified with high accuracy and high speed.

一実施形態に係る情報処理装置のブロック図。FIG. 1 is a block diagram of an information processing device according to an embodiment. 一実施形態に係る学習処理の手順を示すフローチャート。5 is a flowchart showing the procedure of learning processing according to an embodiment. 第２の処理部１０２による処理例を示す図。7 is a diagram illustrating an example of processing by the second processing unit 102. FIG. 第２の処理部１０２の学習方法の一例を説明する図。6 is a diagram illustrating an example of a learning method of the second processing unit 102. FIG. 第２の処理部１０２の学習方法の一例を説明する図。6 is a diagram illustrating an example of a learning method of the second processing unit 102. FIG. 第２の処理部１０２の学習方法の一例を説明する図。6 is a diagram illustrating an example of a learning method of the second processing unit 102. FIG. 一実施形態に係る情報処理装置のブロック図。FIG. 1 is a block diagram of an information processing device according to an embodiment. テストデータを用いたネットワーク構造の確認方法を説明する図。FIG. 3 is a diagram illustrating a method for confirming a network structure using test data. 第１の処理部１０１の追加学習方法の一例を説明する図。FIG. 3 is a diagram illustrating an example of an additional learning method of the first processing unit 101. 一実施形態に係る情報処理装置のハードウェア構成例を示す図。FIG. 1 is a diagram illustrating an example of a hardware configuration of an information processing device according to an embodiment.

以下、添付図面を参照して実施形態を詳しく説明する。なお、以下の実施形態は特許請求の範囲に係る発明を限定するものではない。実施形態には複数の特徴が記載されているが、これらの複数の特徴の全てが発明に必須のものとは限らず、また、複数の特徴は任意に組み合わせられてもよい。さらに、添付図面においては、同一若しくは同様の構成に同一の参照番号を付し、重複した説明は省略する。 Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. Note that the following embodiments do not limit the claimed invention. Although a plurality of features are described in the embodiments, not all of these features are essential to the invention, and the plurality of features may be arbitrarily combined. Furthermore, in the accompanying drawings, the same or similar components are designated by the same reference numerals, and redundant description will be omitted.

本発明の一実施形態に係る情報処理装置は、第１の処理部と、第２の処理部と、学習部と、を備える。この情報処理装置は、配列データの分類を行うことができる。以下の例において、情報処理装置は、配列データの複数の位置について、当該位置周辺のデータの分類を行う。また、この情報処理装置は、分類結果に応じた配列データのセグメンテーションを行うことができる。以下の例では、情報処理装置は、特に、画像に写っている物体のクラスに応じて画像を分割する処理を行う。このような情報処理装置は、例えば、画像に写っている物体を認識する物体認識システムとして用いることができる。 An information processing device according to an embodiment of the present invention includes a first processing section, a second processing section, and a learning section. This information processing device can classify array data. In the following example, the information processing device classifies data around a plurality of positions in array data. Further, this information processing device can perform segmentation of array data according to the classification results. In the following example, the information processing apparatus performs processing to divide an image, particularly according to the class of an object shown in the image. Such an information processing device can be used, for example, as an object recognition system that recognizes objects in images.

第１の処理部は、例えば、非特許文献１と同様の処理を行うことができる。また、第２の処理部は、例えば、非特許文献２と同様の処理を行うことができる。本発明の一実施形態によれば、第２の処理部は、あらかじめ学習されている第１の処理部と近い精度で配列データの分類を行うことができる。一方で、第２の処理部は、非特許文献２と同様に高速に配列データの分類及びセグメンテーション処理を行うことができる。 The first processing unit can perform the same processing as in Non-Patent Document 1, for example. Further, the second processing unit can perform the same processing as in Non-Patent Document 2, for example. According to one embodiment of the present invention, the second processing section can classify array data with accuracy close to that of the first processing section that has been trained in advance. On the other hand, the second processing unit can perform the classification and segmentation processing of array data at high speed as in Non-Patent Document 2.

配列データは、例えば、１つ又は複数の座標軸に沿って配列されたデータを有している。配列データの例としては、１次元の時間軸に沿って配列された音圧データを有する音声データ、及び２次元座標軸に沿って配列された画素データを有する画像データが挙げられる。配列データは、３次元座標軸に沿って配列された画素データを有するボクセルデータであってもよい。以下では、配列データが画像データである例について説明する。以下の例において、第２の処理部は、画像データを処理するために、２次元の完全畳み込みネットワークを用いる。一方で、第２の処理部は、時間軸方向の完全畳み込みネットワークを用いて音声データを処理することができる。また、第２の処理部は、３次元の完全畳み込みネットワークを用いてボクセルデータを処理することができる。 The array data includes, for example, data arranged along one or more coordinate axes. Examples of array data include audio data having sound pressure data arranged along a one-dimensional time axis, and image data having pixel data arranged along a two-dimensional coordinate axis. The array data may be voxel data having pixel data arranged along three-dimensional coordinate axes. An example in which the array data is image data will be described below. In the following example, the second processing unit uses a two-dimensional fully convolutional network to process the image data. On the other hand, the second processing unit can process the audio data using a fully convolutional network in the time axis direction. Further, the second processing unit can process the voxel data using a three-dimensional fully convolutional network.

図１は、本発明の一実施形態に係る情報処理装置１の構成を示すブロック図である。第１の処理部１０１は、配列データが特定のクラスに含まれる尤度を算出する。例えば、第１の処理部１０１は、複数のクラスのそれぞれについて配列データがクラスに含まれる尤度を算出することができる。本実施形態において、第１の処理部１０１は、画像からオブジェクト尤度を生成する処理を行う。 FIG. 1 is a block diagram showing the configuration of an information processing device 1 according to an embodiment of the present invention. The first processing unit 101 calculates the likelihood that the array data is included in a specific class. For example, the first processing unit 101 can calculate the likelihood that the array data is included in the class for each of a plurality of classes. In this embodiment, the first processing unit 101 performs a process of generating object likelihood from an image.

この例において、第１の処理部１０１には所定サイズの配列データが入力される。例えば、第１の処理部１０１には固定サイズの学習用の配列データ（すなわち学習用の画像）が入力される。そして、第１の処理部１０１は、入力された画像に対するオブジェクト尤度を算出する。オブジェクト尤度は、入力された画像に写る物体が所定のカテゴリ（すなわちクラス）に属する確率の推定値を、１以上のカテゴリのそれぞれについて示す。 In this example, array data of a predetermined size is input to the first processing unit 101. For example, fixed-sized learning array data (that is, learning images) is input to the first processing unit 101 . The first processing unit 101 then calculates the object likelihood for the input image. Object likelihood indicates an estimated value of the probability that an object appearing in an input image belongs to a predetermined category (ie, class) for each of one or more categories.

図４（Ａ）は、第１の処理部１０１による処理例を示す図である。入力画像１０１０１は、第１の処理部１０１に入力される。第１の処理部１０１は、分類精度を向上させるために、Ｓｅｌｆ－Ａｔｔｅｎｔｉｏｎ処理、Ｔｒａｎｓｆｏｒｍｅｒ、又は入力された特徴の全てを用いる処理の繰り返しを用いてオブジェクト尤度を算出することができる。 FIG. 4A is a diagram illustrating an example of processing by the first processing unit 101. An input image 10101 is input to the first processing unit 101. In order to improve classification accuracy, the first processing unit 101 can calculate object likelihood using Self-Attention processing, Transformer, or repetition of processing using all of the input features.

本実施形態において、第１の処理部１０１は、非特許文献１に示されるＶｉｓｉｏｎＴｒａｎｓｆｏｒｍｅｒを用いて処理を行う。ＶｉｓｉｏｎＴｒａｎｓｆｏｒｍｅｒは、最初に入力画像を特徴に変換する。ＶｉｓｉｏｎＴｒａｎｓｆｏｒｍｅｒは、次に、特徴に対してエンコーダと呼ばれる多層パーセプトロン及びＳｅｌｆ－Ａｔｔｅｎｔｉｏｎを含むブロックをＬ回適用する。ＶｉｓｉｏｎＴｒａｎｓｆｏｒｍｅｒは、最後に、得られた特徴を物体のカテゴリごとの尤度に変換する。 In this embodiment, the first processing unit 101 performs processing using Vision Transformer shown in Non-Patent Document 1. VisionTransformer first transforms the input image into features. The VisionTransformer then applies a multilayer perceptron called an encoder and a block containing Self-Attention to the feature L times. VisionTransformer finally transforms the obtained features into likelihoods for each category of objects.

本実施形態では、第１の処理部１０１が処理に用いるパラメータは、オブジェクト尤度を推定できるように調整されている。例えば、第１の処理部１０１が用いるＶｉｓｉｏｎＴｒａｎｓｆｏｒｍｅｒは、予めオブジェクト尤度を推定するように学習されている。ここで、学習とは、ＶｉｓｉｏｎＴｒａｎｓｆｏｒｍｅｒが有するパラメータを調整することである。学習方法は特に限定されないが、例えば以下の方法を用いることができる。すなわち、オブジェクト尤度にｓｏｆｔｍａｘ関数を適用した結果と、画像中の物体のカテゴリを表すラベルと、に基づいて推定の妥当性を判定する。このとき、推定の妥当性を示すロス関数を設定することができる。そして、ロス関数の値が減少するようにバックプロパゲーションを行うことにより、ＶｉｓｉｏｎＴｒａｎｓｆｏｒｍｅｒのパラメータを調整することができる。 In this embodiment, the parameters used for processing by the first processing unit 101 are adjusted so that the object likelihood can be estimated. For example, the Vision Transformer used by the first processing unit 101 is trained in advance to estimate object likelihood. Here, learning means adjusting parameters that Vision Transformer has. Although the learning method is not particularly limited, for example, the following method can be used. That is, the validity of the estimation is determined based on the result of applying the softmax function to the object likelihood and the label representing the category of the object in the image. At this time, it is possible to set a loss function that indicates the validity of the estimation. Then, by performing backpropagation so that the value of the loss function decreases, the parameters of the VisionTransformer can be adjusted.

図４（Ａ）の例では、入力画像１０１０１は、画素値を線形変換することにより、特徴１０１０３に変換されている。ＶｉｓｉｏｎＴｒａｎｓｆｏｒｍｅｒを用いる場合、処理されるデータは複数の特徴に分割され、ＶｉｓｉｏｎＴｒａｎｓｆｏｒｍｅｒは分割により得られた特徴を扱う。本実施形態では、画像が同じ大きさの部分画像に分割され、それぞれの部分画像に線形変換を施すことにより特徴が得られる。 In the example of FIG. 4A, an input image 10101 is converted into features 10103 by linearly converting pixel values. When using VisionTransformer, the data to be processed is divided into multiple features, and VisionTransformer handles the features obtained by the division. In this embodiment, an image is divided into partial images of the same size, and features are obtained by performing linear transformation on each partial image.

特徴１０１０３は、第１エンコーダブロック１０１０４に入力される。第１エンコーダブロック１０１０４は、多層パーセプトロン及びＳｅｌｆ－Ａｔｔｅｎｔｉｏｎ処理を含み、入力特徴と同じ次元の特徴を出力する。第１エンコーダブロック１０１０４は、生成した特徴を第２エンコーダブロック１０１０５に出力する。第２エンコーダブロック１０１０５を含む各エンコーダブロックも同様に、１つ前のエンコーダブロックから出力された特徴を処理し、生成した特徴を出力する。第Ｌエンコーダブロック１０１０６から出力された特徴は、特徴をオブジェクト尤度へ変換する尤度変換処理１０１０７に入力される。ＶｉｓｉｏｎＴｒａｎｓｆｏｒｍｅｒは、第Ｌエンコーダブロック１０１０６から出力された特徴のうち、一部の次元に対し多層パーセプトロン処理を行うことにより、オブジェクト尤度１０１０８を得る。オブジェクト尤度１０１０８は、第１の処理部１０１が出力するオブジェクト尤度である。オブジェクト尤度１０１０８の各次元は物体のカテゴリに対応する。そして、オブジェクト尤度１０１０８の各次元は、対応するカテゴリに物体が含まれる確率の推定値を表す。図４（Ａ）の例では、オブジェクト尤度１０１０８は、物体が第１カテゴリ第ｄカテゴリのそれぞれに属する確率の推定値を示す、第１カテゴリ尤度～第ｄカテゴリ尤度を示す。このように、オブジェクト尤度はベクトルにより表すことができる。 Features 10103 are input to a first encoder block 10104. The first encoder block 10104 includes a multilayer perceptron and self-attention processing, and outputs features of the same dimension as the input features. The first encoder block 10104 outputs the generated features to the second encoder block 10105. Each encoder block including the second encoder block 10105 similarly processes the features output from the previous encoder block and outputs the generated features. The features output from the L-th encoder block 10106 are input to a likelihood conversion process 10107 that converts the features into object likelihoods. VisionTransformer obtains object likelihood 10108 by performing multilayer perceptron processing on some dimensions of the features output from L-th encoder block 10106. The object likelihood 10108 is the object likelihood output by the first processing unit 101. Each dimension of object likelihood 10108 corresponds to a category of object. Each dimension of object likelihood 10108 represents an estimate of the probability that the object is included in the corresponding category. In the example of FIG. 4A, the object likelihood 10108 indicates the first category likelihood to the d-th category likelihood, which indicate estimated values of the probability that the object belongs to each of the first category and the d-th category. In this way, object likelihood can be represented by a vector.

ＶｉｓｉｏｎＴｒａｎｓｆｏｒｍｅｒのエンコーダブロック１０１０４～１０１０６に含まれるＳｅｌｆ－Ａｔｔｅｎｔｉｏｎ処理は、入力された特徴のすべての値を利用した処理を行う。このように、ＶｉｓｉｏｎＴｒａｎｓｆｏｒｍｅｒを用いた処理においては、画像全体の情報を有効に利用した処理が行われる。一方で、畳み込み処理においては、入力された特徴のうちフィルタサイズに相当する特徴のみを利用した処理が行われるため、ローカルな特徴が出力される。したがって、ＶｉｓｉｏｎＴｒａｎｓｆｏｒｍｅｒを用いた処理によれば、畳み込み処理を行う畳み込みニューラルネットワークと比較して、物体のカテゴリの推定精度が向上する。 Self-Attention processing included in the encoder blocks 10104 to 10106 of the Vision Transformer performs processing using all values of input features. In this way, in processing using VisionTransformer, processing is performed that effectively utilizes the information of the entire image. On the other hand, in convolution processing, processing is performed using only the features corresponding to the filter size among the input features, so local features are output. Therefore, processing using Vision Transformer improves the accuracy of estimating the category of an object compared to a convolutional neural network that performs convolution processing.

第２の処理部１０２は、完全畳み込みネットワークを用いて、配列データが特定のクラスに含まれる尤度を算出する。例えば、第２の処理部１０２は、複数のクラスのそれぞれについて配列データがクラスに含まれる尤度を算出する。ここで、第２の処理部１０２は、第１の処理部１０１とは異なる処理を用いて尤度を算出する。本実施形態において、第２の処理部１０２は、画像から１つ以上のオブジェクト尤度を含むオブジェクト尤度マップを生成する処理を行う。オブジェクト尤度マップは、画像座標と対応する座標を持つ。また、オブジェクト尤度マップは、各座標の要素がオブジェクト尤度を表す。本実施形態において、第２の処理部１０２は、入力画像の画素又は部分領域ごとのオブジェクト尤度を表すオブジェクト尤度マップを算出することができる。 The second processing unit 102 uses a fully convolutional network to calculate the likelihood that the array data is included in a specific class. For example, the second processing unit 102 calculates the likelihood that the array data is included in the class for each of the plurality of classes. Here, the second processing unit 102 calculates the likelihood using a process different from that of the first processing unit 101. In this embodiment, the second processing unit 102 performs a process of generating an object likelihood map including one or more object likelihoods from an image. The object likelihood map has coordinates that correspond to image coordinates. Further, in the object likelihood map, each coordinate element represents the object likelihood. In this embodiment, the second processing unit 102 can calculate an object likelihood map that represents the object likelihood for each pixel or partial region of the input image.

図３（Ａ）は、第２の処理部１０２による処理例を示す図である。画像１０２０１は、第２の処理部１０２に入力される。本実施形態において、第２の処理部１０２は、非特許文献２に示される完全畳み込みネットワークを用いる。完全畳み込みネットワークは、全ての層が畳み込み層又はプーリング層からなるニューラルネットワークである。完全畳み込みネットワークを用いた処理においては、畳み込み処理又はプーリング処理を、入力画像に対して複数回（図３（Ａ）においてはＭ回）適用することにより特徴マップが得られる。そして、特徴マップがオブジェクト尤度マップに変換され、出力される。なお、各層では、畳み込み処理又はプーリング処理に加えて、活性化関数又はバイアスの適用などを行ってもよい。 FIG. 3A is a diagram illustrating an example of processing by the second processing unit 102. The image 10201 is input to the second processing unit 102. In this embodiment, the second processing unit 102 uses a fully convolutional network shown in Non-Patent Document 2. A fully convolutional network is a neural network in which all layers are convolutional or pooling layers. In processing using a fully convolutional network, a feature map is obtained by applying convolution processing or pooling processing to an input image multiple times (M times in FIG. 3A). The feature map is then converted into an object likelihood map and output. Note that in each layer, in addition to convolution processing or pooling processing, application of an activation function or bias may be performed.

上述のように、第１の処理部１０１には所定サイズの配列データが入力される。一方で、第２の処理部１０２には、この所定サイズ以上の配列データを入力することができる。第２の処理部１０２が用いる完全畳み込みネットワークは、この所定サイズの第１の配列データが入力されると、第１の配列データが特定のクラスに含まれる尤度を示す１つのオブジェクト尤度（例えば１つのベクトル）を出力するように調整されている。例えば、完全畳み込みネットワークは、第１の処理部１０１への入力画像と同じサイズの画像が入力されると、１つのオブジェクト尤度を出力する。図４（Ｂ）の例では、調整後の、畳み込み層及びプーリング層の数はＭ個である。例えば、第１の処理部１０１への入力画像のサイズが９９×９９画素である場合、完全畳み込みネットワークは３×３フィルタを用いた畳み込み処理を行う畳み込み層を４９層有していてもよい。このようなニューラルネットワークに９９×９９画素の画像を入力すると、１つのオブジェクト尤度が出力される。なお、後述するように、この畳み込み層ではパディング処理が行われないため、１つの畳み込み層において画像の縦及び横の大きさが２画素ずつ減少する。 As described above, array data of a predetermined size is input to the first processing unit 101. On the other hand, array data larger than this predetermined size can be input to the second processing unit 102. When the first array data of a predetermined size is input, the fully convolutional network used by the second processing unit 102 generates one object likelihood ( For example, it is adjusted to output one vector). For example, the fully convolutional network outputs one object likelihood when an image of the same size as the input image to the first processing unit 101 is input. In the example of FIG. 4B, the number of convolution layers and pooling layers after adjustment is M. For example, if the size of the input image to the first processing unit 101 is 99×99 pixels, the fully convolutional network may have 49 convolutional layers that perform convolution processing using 3×3 filters. When a 99x99 pixel image is input to such a neural network, one object likelihood is output. Note that, as will be described later, since padding processing is not performed in this convolutional layer, the vertical and horizontal sizes of the image are reduced by two pixels in one convolutional layer.

一方で、図３（Ａ）に示すように、第２の処理部１０２が用いる完全畳み込みネットワークは、所定サイズよりも大きい第２の配列データが入力されると、複数のオブジェクト尤度を含むオブジェクト尤度マップを出力する。例えば、第２の処理部１０２は、第１の処理部１０１への入力画像よりも大きいサイズの画像が入力されると、入力画像の大きさに応じた尤度マップを生成することができる。この場合、オブジェクト尤度マップは、複数の部分配列データのそれぞれが特定のクラスに含まれる尤度を示す。ここで、複数の部分配列データのそれぞれは、第２の配列データの一部である。例えば、部分配列データは、所定サイズよりも大きい画像の一部である、所定サイズの部分画像である。 On the other hand, as shown in FIG. 3A, when second array data larger than a predetermined size is input to the fully convolutional network used by the second processing unit 102, the fully convolutional network uses an object including multiple object likelihoods. Output the likelihood map. For example, when an image larger in size than the input image to the first processing unit 101 is input, the second processing unit 102 can generate a likelihood map according to the size of the input image. In this case, the object likelihood map indicates the likelihood that each of the plurality of partial array data is included in a specific class. Here, each of the plurality of partial array data is part of the second array data. For example, the partial array data is a partial image of a predetermined size that is part of an image larger than the predetermined size.

また、本実施形態においては、完全畳み込みネットワークは、パディング処理、又は配列データのサイズの変動によって処理内容が変動する階層処理を行わないように構成されている。本実施形態において、第２の処理部１０２の学習時に入力される画像（例えば図４（Ｂ）の画像１０１０１）のサイズと、第２の処理部１０２を用いたオブジェクト尤度の推定時に入力される画像（例えば図３（Ａ）の画像１０２０１）のサイズとは異なる。したがって、第２の処理部１０２によって行われる処理が、入力画像サイズが変動すると処理内容も変動するような階層処理を含む場合、セグメンテーション処理の精度が学習によっても向上しにくくなる。そこで、例えば、畳み込み層及びプーリング層でパディングを行わないように、完全畳み込みネットワークを構成することができる。画像の端部におけるパディングの影響を除くことにより、第２の処理部１０２への入力画像のサイズが変化しても、推定処理の結果が変化しなくなる。このため、より高精度なセグメンテーション処理が可能になる。 Furthermore, in the present embodiment, the fully convolutional network is configured not to perform padding processing or hierarchical processing in which the processing contents vary depending on variations in the size of array data. In this embodiment, the size of an image (for example, image 10101 in FIG. 4B) that is input when the second processing unit 102 learns and the size that is input when estimating the object likelihood using the second processing unit 102 are determined. (for example, image 10201 in FIG. 3A). Therefore, if the processing performed by the second processing unit 102 includes hierarchical processing in which the processing content changes as the input image size changes, it becomes difficult to improve the accuracy of the segmentation processing even by learning. Therefore, for example, a fully convolutional network can be configured such that no padding is performed in the convolutional layer and the pooling layer. By removing the influence of padding at the edges of the image, the result of the estimation process does not change even if the size of the input image to the second processing unit 102 changes. Therefore, more accurate segmentation processing becomes possible.

画像１０２０１は、畳み込み層又はプーリング層における処理である、第２の処理部１０２が有する第１階層処理１０２０３に入力される。第１階層処理１０２０３では、入力画像が処理される。第１階層処理１０２０３では畳み込み処理又はプーリング処理が行われるため、第１階層処理１０２０３で得られる処理結果は、画像１０２０１と同じ座標軸を有する特徴マップとなる。第１階層処理１０２０３で得られた処理結果は、第２階層処理１０２０４に出力される。同様に、第２階層処理１０２０４及び第Ｍ階層処理１０２０５を含む各階層処理も、畳み込み層又はプーリング層における処理である。これらの各階層処理においても、同様に、１つ前の階層処理からの出力が処理され、処理結果が出力される。 The image 10201 is input to the first layer processing 10203 of the second processing unit 102, which is processing in a convolution layer or a pooling layer. In the first layer processing 10203, the input image is processed. Since convolution processing or pooling processing is performed in the first layer processing 10203, the processing result obtained in the first layer processing 10203 is a feature map having the same coordinate axes as the image 10201. The processing results obtained in the first layer processing 10203 are output to the second layer processing 10204. Similarly, each layer process including the second layer process 10204 and the Mth layer process 10205 is also a process in a convolution layer or a pooling layer. In each of these hierarchical processes, the output from the previous hierarchical process is similarly processed, and the processing results are output.

第Ｍ階層処理１０２０５によって得られた特徴マップは、尤度変換処理１０２０６に入力される。尤度変換処理１０２０６では、特徴マップに示される各特徴をオブジェクト尤度に変換することにより、オブジェクト尤度マップ１０２０７が生成される。そして、第２の処理部１０２はこのオブジェクト尤度マップ１０２０７を出力する。尤度変換処理１０２０６は、第１の処理部１０１における尤度変換処理１０１０７と同様の処理である。第２の処理部１０２は特徴マップを扱うため、尤度変換処理１０２０６は畳み込み処理により行われる。例えば、特徴マップからオブジェクト尤度への変換は、１層の畳み込み処理により行うことができる。 The feature map obtained by the Mth layer processing 10205 is input to the likelihood conversion processing 10206. In the likelihood conversion process 10206, an object likelihood map 10207 is generated by converting each feature shown in the feature map into an object likelihood. Then, the second processing unit 102 outputs this object likelihood map 10207. The likelihood conversion process 10206 is similar to the likelihood conversion process 10107 in the first processing unit 101. Since the second processing unit 102 handles feature maps, likelihood conversion processing 10206 is performed by convolution processing. For example, conversion from a feature map to an object likelihood can be performed by one-layer convolution processing.

オブジェクト尤度マップ１０２０７は、オブジェクト尤度１０２０８，１０２０９を含んでいる。各オブジェクト尤度は画像１０２０１の画素又は部分領域に対応している。また、オブジェクト尤度マップ１０２０７上のオブジェクト尤度の位置関係は、画像１０２０１上の位置関係と対応している。例えば、オブジェクト尤度１０２０８は、オブジェクト尤度１０２０９と比較して、画像上１０２０１でより左側にある部分領域についてのオブジェクト尤度を表している。 Object likelihood map 10207 includes object likelihoods 10208 and 10209. Each object likelihood corresponds to a pixel or a partial region of the image 10201. Furthermore, the positional relationship of object likelihoods on the object likelihood map 10207 corresponds to the positional relationship on the image 10201. For example, object likelihood 10208 represents the object likelihood for a partial region located further to the left of image 10201 compared to object likelihood 10209.

図３（Ｃ）は、図３（Ｂ）に示す画像１０００１を第２の処理部１０２に入力することにより得られるオブジェクト尤度マップが示す、カテゴリＡの物体に対応する尤度マップを示す。図３（Ｄ）は、同じオブジェクト尤度マップが示す、カテゴリＢの物体に対応する尤度マップを示す。対応するカテゴリの物体が存在する画像１０００１の領域付近で、それぞれの尤度マップの値が高くなっている。 FIG. 3C shows a likelihood map corresponding to an object of category A, which is represented by an object likelihood map obtained by inputting the image 10001 shown in FIG. 3B to the second processing unit 102. FIG. 3(D) shows a likelihood map corresponding to an object of category B, which is indicated by the same object likelihood map. The value of each likelihood map is high near the area of image 10001 where an object of the corresponding category exists.

完全畳み込みネットワークに含まれる各階層処理１０２０３～１０２０５においては、座標軸方向に隣接している特徴が共通して利用される。一方で、ＶｉｓｉｏｎＴｒａｎｓｆｏｒｍｅｒを用いる場合、Ｓｅｌｆ－Ａｔｔｅｎｔｉｏｎ処理及び多層パーセプトロン処理においては座標軸方向の全ての特徴が利用される。このため、ある入力画像に対する処理と、この入力画像を１画素分だけ並進移動させて得られる画像に対する処理との間で、計算過程は全く異なるため、計算過程で生成される中間特徴を共有することはできない。このように、計算過程で生成される中間特徴を共有しながら処理が行われる完全畳み込みネットワークを用いた処理は、ＶｉｓｉｏｎＴｒａｎｓｆｏｒｍｅｒを用いた処理よりも計算効率がよい。 In each of the hierarchical processes 10203 to 10205 included in the fully convolutional network, features adjacent in the coordinate axis direction are commonly used. On the other hand, when using Vision Transformer, all features in the coordinate axis direction are used in Self-Attention processing and multilayer perceptron processing. For this reason, the calculation process is completely different between processing for a certain input image and processing for an image obtained by translating this input image by one pixel, so intermediate features generated in the calculation process are shared. It is not possible. In this way, processing using a fully convolutional network in which processing is performed while sharing intermediate features generated during the calculation process is more computationally efficient than processing using Vision Transformer.

学習部１０３は、学習用の配列データについての尤度を第１の処理部１０１が算出する処理の過程で得られた情報を教師データとして用いて、第２の処理部１０２が用いる完全畳み込みネットワークの学習処理を行う。本明細書では、第２の処理部１０２が用いる完全畳み込みネットワークの学習のことを、第２の処理部１０２の学習と呼ぶことがある。例えば、学習部１０３は、第１の処理部１０１が入力画像に基づいてオブジェクト尤度を算出するときに用いる情報を、第２の処理部１０２に伝達することができる。上述のように、第１の処理部１０１はＶｉｓｉｏｎＴｒａｎｓｆｏｒｍｅｒを用いた処理を行う。一方で、ＶｉｓｉｏｎＴｒａｎｓｆｏｒｍｅｒは第２の処理部１０２が有する完全畳み込みネットワークとは構造が異なる。このため、第１の処理部１０１が用いるパラメータを単純に第２のパラメータに移動することはできない。 The learning unit 103 uses information obtained during the process in which the first processing unit 101 calculates the likelihood of the learning array data as training data to create a fully convolutional network used by the second processing unit 102. The learning process is performed. In this specification, learning of the fully convolutional network used by the second processing unit 102 may be referred to as learning of the second processing unit 102. For example, the learning unit 103 can transmit information used when the first processing unit 101 calculates the object likelihood based on the input image to the second processing unit 102. As described above, the first processing unit 101 performs processing using Vision Transformer. On the other hand, the VisionTransformer has a different structure from the fully convolutional network included in the second processing unit 102. Therefore, it is not possible to simply move the parameter used by the first processing unit 101 to the second parameter.

本実施形態においては、学習部１０３は、教師データを用いた第２の処理部１０２の学習を行う。教師データとしては、画像１１と、画像１１を入力された第１の処理部１０１が出力したオブジェクト尤度とのセットが用いられる。このような学習により、第２の処理部１０２は、第１の処理部１０１によるオブジェクト尤度推定性能に近い性能を有するように学習される。そして、第２の処理部１０２は、このような性能を用いてセグメンテーション処理ができるようになる。 In this embodiment, the learning unit 103 performs learning of the second processing unit 102 using teacher data. As the teacher data, a set of the image 11 and the object likelihood output by the first processing unit 101 to which the image 11 is input is used. Through such learning, the second processing unit 102 is trained to have performance close to the object likelihood estimation performance by the first processing unit 101. Then, the second processing unit 102 can perform segmentation processing using such performance.

学習部１０３は、尤度取得部１０３１及び尤度学習部１０３２を有する。尤度取得部１０３１は、画像１１を入力された第１の処理部１０１が出力したオブジェクト尤度を取得する。そして、尤度取得部１０３１は、取得したオブジェクト尤度を尤度学習部１０３２に出力する。 The learning unit 103 includes a likelihood acquisition unit 1031 and a likelihood learning unit 1032. The likelihood acquisition unit 1031 acquires the object likelihood output by the first processing unit 101 to which the image 11 is input. The likelihood acquisition unit 1031 then outputs the acquired object likelihood to the likelihood learning unit 1032.

尤度学習部１０３２は、第２の処理部１０２の学習を行う。具体的には、尤度学習部１０３２は、画像１１、及び第１の処理部１０１が出力した画像１１に対応するオブジェクト尤度を、教師データとして取得する。そして、尤度学習部１０３２は、この教師データを用いて第２の処理部１０２の学習を行う。尤度学習部１０３２は、第１の処理部１０１が出力したオブジェクト尤度と、画像１１が入力された第２の処理部１０２が出力したオブジェクト尤度マップに示されるオブジェクト尤度との差分が減少するように、第２の処理部１０２のパラメータを更新する。すなわち、尤度学習部１０３２は、第２の処理部１０２が用いる完全畳み込みネットワークのパラメータを更新することができる。尤度学習部１０３２は学習のために例えばバックプロパゲーションを用いることができ、学習のために参照するロス関数としては、例えばオブジェクト尤度の差分のＬ１ノルムを用いることができる。 The likelihood learning unit 1032 performs learning of the second processing unit 102. Specifically, the likelihood learning unit 1032 acquires the image 11 and the object likelihood corresponding to the image 11 output by the first processing unit 101 as teacher data. Then, the likelihood learning unit 1032 performs learning of the second processing unit 102 using this teacher data. The likelihood learning unit 1032 calculates the difference between the object likelihood output by the first processing unit 101 and the object likelihood shown in the object likelihood map output by the second processing unit 102 to which the image 11 is input. The parameters of the second processing unit 102 are updated so as to decrease. That is, the likelihood learning unit 1032 can update the parameters of the fully convolutional network used by the second processing unit 102. The likelihood learning unit 1032 can use, for example, backpropagation for learning, and can use, for example, the L1 norm of the difference in object likelihood as a loss function referred to for learning.

図２は、一実施形態に係る情報処理装置１が行う、一実施形態に係る学習処理方法の手順を示すフローチャートである。この処理により、画像に対するセグメンテーション処理を行うセグメンテーションネットワークを作成することができる。このような処理によれば、高精度及び高速に分類処理及びセグメンテーション処理を行うニューラルネットワークを生成することができる。 FIG. 2 is a flowchart illustrating a procedure of a learning processing method according to an embodiment, which is performed by the information processing device 1 according to the embodiment. Through this processing, it is possible to create a segmentation network that performs segmentation processing on images. According to such processing, it is possible to generate a neural network that performs classification processing and segmentation processing with high precision and high speed.

Ｓ１００１ではデータ取得処理が行われる。具体的には、尤度取得部１０３１は画像１１を取得する。本実施形態で尤度取得部１０３１は、第１の処理部１０１によるオブジェクト尤度の推定対象となる複数の画像１１を取得する。本実施形態では取得される画像１１はカラー画像である。しかしながら、尤度取得部１０３１は画像１１としてグレースケール画像又は距離画像を取得してもよい。 In S1001, data acquisition processing is performed. Specifically, the likelihood acquisition unit 1031 acquires the image 11. In this embodiment, the likelihood acquisition unit 1031 acquires a plurality of images 11 for which the first processing unit 101 estimates the object likelihood. In this embodiment, the acquired image 11 is a color image. However, the likelihood acquisition unit 1031 may acquire a grayscale image or a distance image as the image 11.

Ｓ１００２及びＳ１００３で、学習部１０３は、第１の処理部１０１の情報を第２の処理部１０２に伝達する情報伝達処理を行う。Ｓ１００２においては、第１の処理部１０１による尤度取得処理が行われる。具体的には、尤度取得部１０３１は、画像１１を第１の処理部１０１に入力する。次に、第１の処理部１０１は画像１１に対してオブジェクト尤度を算出する処理を行う。そして、尤度取得部１０３１は第１の処理部１０１から算出されたオブジェクト尤度を取得する。上述のように、オブジェクト尤度は、１以上のカテゴリのそれぞれについて、入力された画像に写る物体がこのカテゴリに属する確率の推定値を示す。 In S1002 and S1003, the learning unit 103 performs information transmission processing to transmit information from the first processing unit 101 to the second processing unit 102. In S1002, the first processing unit 101 performs likelihood acquisition processing. Specifically, the likelihood acquisition unit 1031 inputs the image 11 to the first processing unit 101. Next, the first processing unit 101 performs a process of calculating object likelihood for the image 11. Then, the likelihood acquisition unit 1031 acquires the object likelihood calculated from the first processing unit 101. As described above, the object likelihood indicates, for each of one or more categories, the estimated probability that an object appearing in the input image belongs to this category.

第１の処理部１０１による尤度取得処理は、図４（Ａ）を参照して説明したように行われる。この例において、図４（Ａ）の画像１０１０１は、第１の処理部１０１に入力された画像１１である。ここで、カテゴリの数をｄとすると、第１の処理部１０１が出力し、図４（Ａ）においてオブジェクト尤度１０１０８として表されている、画像１１に対応するオブジェクト尤度ｕは、

と表せる。また、尤度取得部１０３１は、取得した画像１１とオブジェクト尤度とを尤度学習部１０３２に出力する。 The likelihood acquisition process by the first processing unit 101 is performed as described with reference to FIG. 4(A). In this example, an image 10101 in FIG. 4A is the image 11 input to the first processing unit 101. Here, if the number of categories is d, the object likelihood u corresponding to the image 11 outputted by the first processing unit 101 and represented as the object likelihood 10108 in FIG.

It can be expressed as Furthermore, the likelihood acquisition unit 1031 outputs the acquired image 11 and object likelihood to the likelihood learning unit 1032.

Ｓ１００３では、尤度学習部１０３２による尤度学習処理が行われる。例えば、尤度学習部１０３２は、取得した画像１１とオブジェクト尤度とを教師データとして用いて、上述のように第２の処理部１０２が用いる完全畳み込みネットワークの学習を行う。 In S1003, likelihood learning processing is performed by the likelihood learning unit 1032. For example, the likelihood learning unit 1032 uses the acquired image 11 and object likelihood as teacher data to perform learning of the fully convolutional network used by the second processing unit 102 as described above.

図４（Ｂ）は、本実施形態における尤度学習処理を説明する図である。図４（Ｂ）に示されている第２の処理部１０２における各処理１０２０３～１０２０６は、図３（Ａ）と同様である。一方で、この例において、第２の処理部１０２には、第１の処理部１０１にも入力された、画像１１が入力される。図４（Ｂ）の画像１０１０１は、第２の処理部に入力された画像１１である。既に説明したように、第２の処理部１０２は、第１の処理部１０１への入力画像と同じサイズの画像が入力されると、１つのオブジェクト尤度１０２１８からなるオブジェクト尤度マップ１０２１７を出力する。図４（Ｂ）においてオブジェクト尤度１０２１８として表されるオブジェクト尤度ｖは、オブジェクト尤度マップ１０２１７に含まれ、画像１１に写る物体のカテゴリの尤度を示すベクトル１０２１９である。オブジェクト尤度ｖは、オブジェクト尤度ｕと同じ次元を有している。 FIG. 4(B) is a diagram illustrating the likelihood learning process in this embodiment. Each process 10203 to 10206 in the second processing unit 102 shown in FIG. 4(B) is the same as that in FIG. 3(A). On the other hand, in this example, the image 11 that is also input to the first processing section 101 is input to the second processing section 102 . An image 10101 in FIG. 4B is the image 11 input to the second processing unit. As already explained, when an image of the same size as the input image to the first processing unit 101 is input, the second processing unit 102 outputs an object likelihood map 10217 consisting of one object likelihood 10218. do. The object likelihood v expressed as the object likelihood 10218 in FIG. 4B is a vector 10219 that is included in the object likelihood map 10217 and indicates the likelihood of the category of the object appearing in the image 11. The object likelihood v has the same dimensions as the object likelihood u.

この場合、学習部１０３は、学習用の第３の配列データ（例えば画像１１）について第１の処理部１０１が算出した尤度と、第３の配列データについて第２の処理部１０２が算出した尤度と、の差分に基づいて、第２の処理部１０２の学習を行うことができる。具体的には、学習部１０３は、このような差分が小さくなるように第２の処理部１０２の学習を行うことができる。このような差分を評価するために用いるロス関数としては、例えば、

を用いることができる。ロス関数はこれに限られず、例えばロス関数としてＬ２距離などが用いられてもよい。そして、このようなロス関数に基づいて第２の処理部１０２の最適化を行うことができる。例えば、Ａｄａｍ法に従って、第２の処理部１０２が用いるパラメータをロス関数の値が減少するように最適化することができる。 In this case, the learning unit 103 uses the likelihood calculated by the first processing unit 101 for the third array data for learning (for example, image 11) and the likelihood calculated by the second processing unit 102 for the third array data. The second processing unit 102 can be trained based on the difference between the likelihood and the likelihood. Specifically, the learning unit 103 can perform learning on the second processing unit 102 so that such a difference becomes small. As a loss function used to evaluate such a difference, for example,

can be used. The loss function is not limited to this, and for example, the L2 distance may be used as the loss function. Then, the second processing unit 102 can be optimized based on such a loss function. For example, according to the Adam method, the parameters used by the second processing unit 102 can be optimized so that the value of the loss function decreases.

十分な数の画像１１のそれぞれを用いて、Ｓ１００１～Ｓ１００３に従う最適化を繰り返し行うことにより、第２の処理部１０２の学習を行うことができる。その後、学習後の第２の処理部１０２に画像２１を入力することにより、第２の処理部１０２からは画像２１に対応する尤度マップが出力される。 The second processing unit 102 can be trained by repeatedly performing optimization according to S1001 to S1003 using each of a sufficient number of images 11. Thereafter, by inputting the image 21 to the second processing unit 102 after learning, the second processing unit 102 outputs a likelihood map corresponding to the image 21.

ところで、このように学習された第２の処理部１０２は、図４（Ｂ）に示すように、第１の処理部１０１への入力画像と同じサイズの画像が入力されると、第１の処理部１０１と同様のオブジェクト尤度を出力する。一方で、図３（Ａ）に示すように、第２の処理部１０２は、第１の処理部１０１への入力画像よりも大きいサイズの画像が入力されると、入力画像の大きさに応じた尤度マップを生成する。この尤度マップは、入力画像に含まれる部分画像が特定のクラスに含まれる尤度であるオブジェクト尤度を示している。すなわち、この尤度マップに基づいて、画像２１に含まれる物体のカテゴリ推定を行うことができる。上記のように学習された第２の処理部１０２は、第１の処理部１０１による物体のカテゴリ推定精度と近い精度で、このカテゴリ推定を行うことができる。 By the way, as shown in FIG. 4(B), the second processing unit 102 that has been trained in this way will process the first processing unit 102 when an image of the same size as the input image to the first processing unit 101 is input. The processing unit 101 outputs the same object likelihood. On the other hand, as shown in FIG. 3A, when an image larger in size than the input image to the first processing unit 101 is input, the second processing unit 102 performs processing according to the size of the input image. generate a likelihood map. This likelihood map indicates object likelihood, which is the likelihood that a partial image included in the input image is included in a specific class. That is, the category of the object included in the image 21 can be estimated based on this likelihood map. The second processing unit 102 trained as described above can perform this category estimation with an accuracy close to that of the object category estimation accuracy by the first processing unit 101.

また、第２の処理部１０２が出力した尤度マップに示されるオブジェクト尤度に従って、写っている物体のカテゴリごとに画像をセグメンテーションすることができる。したがって、このように学習された第２の処理部１０２は、画像をセグメンテーションするために用いることができる。一実施形態において、第２の処理部１０２は、尤度マップに基づいて配列データのセグメンテーションを行う。 Further, according to the object likelihood shown in the likelihood map output by the second processing unit 102, the image can be segmented for each category of the object in the image. Therefore, the second processing unit 102 trained in this way can be used to segment an image. In one embodiment, the second processing unit 102 performs segmentation of array data based on a likelihood map.

以上説明したように、本実施形態に係る情報処理装置は、あらかじめ学習されている第１の処理部１０１と近い精度を持ち、かつ効率よくオブジェクト尤度を算出できるように、第２の処理部１０２の学習を行うことができる。こうして、本実施形態に係る情報処理装置は、パラメータが学習された完全畳み込みネットワークを生産することができる。その後、本実施形態に係る情報処理装置の第２の処理部１０２は、パラメータが学習された完全畳み込みネットワークを用いて、配列データの分類処理又はセグメンテーション処理を行うことができる。また、本実施形態に係る情報処理装置は、学習により得られた完全畳み込みネットワークのパラメータを出力してもよい。この場合、他の情報処理装置は、本実施形態に係る情報処理装置学習から出力された完全畳み込みネットワークのパラメータを用いて、配列データの分類処理又はセグメンテーション処理を行うことができる。いずれの方法によっても、完全畳み込みネットワークを用いた、高精度及び高速に分類処理及びセグメンテーション処理を行うことができる。 As described above, the information processing device according to the present embodiment has a second processing unit that has accuracy close to that of the first processing unit 101 that has been trained in advance and can efficiently calculate object likelihood. 102 learnings can be performed. In this way, the information processing apparatus according to this embodiment can produce a fully convolutional network whose parameters have been learned. Thereafter, the second processing unit 102 of the information processing apparatus according to the present embodiment can perform classification processing or segmentation processing of the array data using the fully convolutional network whose parameters have been learned. Further, the information processing apparatus according to the present embodiment may output parameters of a fully convolutional network obtained through learning. In this case, the other information processing device can perform classification processing or segmentation processing of the array data using the parameters of the fully convolutional network output from the information processing device learning according to this embodiment. With either method, classification processing and segmentation processing can be performed with high accuracy and high speed using a fully convolutional network.

上記の実施形態において、学習部１０３は、第２の処理部１０２の学習のために、第１の処理部１０１が出力したオブジェクト尤度を教師データとして用いた。一方で、学習部１０３は、学習用の配列データがクラスに含まれる尤度を第１の処理部１０１が算出する処理の過程で得られた他の情報に基づいて、第２の処理部１０２の学習を行うことができる。 In the embodiment described above, the learning unit 103 uses the object likelihood output by the first processing unit 101 as training data for the second processing unit 102 to learn. On the other hand, the learning unit 103 calculates the likelihood that the learning array data is included in the class based on other information obtained in the process in which the first processing unit 101 calculates the likelihood that the learning array data is included in the class. You can learn about.

例えば、学習部１０３は、第２の処理部１０２の学習のために、第１の処理部１０１による画像１１に対する処理の途中で得られた中間特徴を教師データとして用いることができる。一実施形態においては、学習用の配列データに対して第１の処理部１０１が尤度を算出する過程で得られた学習用の配列データの特徴と、第２の処理部１０２が算出した学習用の配列データの特徴と、が用いられる。学習部１０３は、これらの特徴の差分が小さくなるように、第２の処理部１０２が用いる完全畳み込みネットワークの学習処理を行う。 For example, the learning unit 103 can use intermediate features obtained during the processing of the image 11 by the first processing unit 101 as training data for the second processing unit 102 to learn. In one embodiment, the characteristics of the learning array data obtained in the process of calculating the likelihood by the first processing unit 101 for the learning array data, and the learning The characteristics of the array data are used. The learning unit 103 performs learning processing on the fully convolutional network used by the second processing unit 102 so that the difference between these features becomes small.

例えば、第１の処理部１０１は、図４（Ａ）の例において第Ｌエンコーダブロック１０１０６が出力した特徴を出力することができる。この特徴のことを中間特徴と呼ぶ。この中間特徴は、例えばｐ次元のベクトルである。学習部１０３は、このような中間特徴を第１の処理部１０１から取得することができる。この中間特徴は、教師データとして用いることができる。 For example, the first processing unit 101 can output the feature output by the L-th encoder block 10106 in the example of FIG. 4(A). This feature is called an intermediate feature. This intermediate feature is, for example, a p-dimensional vector. The learning unit 103 can acquire such intermediate features from the first processing unit 101. This intermediate feature can be used as training data.

また、第２の処理部１０２は、図４（Ｂ）の例において第Ｍ階層処理１０２０５で得られた特徴マップを出力することができる。この特徴マップのことを中間特徴マップと呼ぶ。この中間特徴マップは、この例では１×１×ｐ次元のマップであり、すなわちｐ次元のベクトルで表される特徴を示す。学習部１０３は、このような中間特徴マップを第２の処理部１０２から取得することができる。 Further, the second processing unit 102 can output the feature map obtained in the Mth layer processing 10205 in the example of FIG. 4(B). This feature map is called an intermediate feature map. In this example, this intermediate feature map is a 1×1×p-dimensional map, that is, it indicates a feature represented by a p-dimensional vector. The learning unit 103 can acquire such an intermediate feature map from the second processing unit 102.

この場合、学習部１０３は、画像１１に対応する第１の処理部１０１が出力した中間特徴と、画像１１が入力された第２の処理部１０２が出力した中間特徴マップとの差分が減少するように、第２の処理部１０２のパラメータを更新することができる。パラメータの更新は上記の実施形態と同様に行うことができる。 In this case, the learning unit 103 reduces the difference between the intermediate feature output by the first processing unit 101 corresponding to the image 11 and the intermediate feature map output by the second processing unit 102 to which the image 11 is input. As such, the parameters of the second processing unit 102 can be updated. Parameters can be updated in the same way as in the above embodiments.

このような実施形態に従って学習された第２の処理部１０２によるオブジェクト尤度の算出及び画像のセグメンテーションについて、図５を参照して説明する。この場合、第２の処理部１０２は、完全畳み込みネットワークを用いて特定のクラスの配列データの特徴を算出することができる。そして、第２の処理部１０２は、特定のクラスの配列データの特徴と、完全畳み込みネットワークを用いて算出した処理対象の配列データの特徴と、の相関に基づいて、処理対象の配列データが特定のクラスに含まれる尤度を算出することができる。以下の例では、第２の処理部１０２が特定のカテゴリの画像について出力した中間特徴と、第２の処理部１０２が処理対象の画像について出力した中間特徴と、の類似性に基づいて、処理対象の画像がセグメンテーションされる。 The object likelihood calculation and image segmentation performed by the second processing unit 102 learned according to this embodiment will be described with reference to FIG. 5. In this case, the second processing unit 102 can calculate the characteristics of the specific class of array data using a fully convolutional network. Then, the second processing unit 102 identifies the array data to be processed based on the correlation between the characteristics of the array data of a specific class and the characteristics of the array data to be processed calculated using the fully convolutional network. The likelihood of being included in the class can be calculated. In the example below, processing is performed based on the similarity between intermediate features output by the second processing unit 102 for an image of a specific category and intermediate features output by the second processing unit 102 for an image to be processed. The target image is segmented.

まず、学習後の第２の処理部１０２に、あるカテゴリの物体の画像１０２３１が入力される。この例において、画像１０２３１のサイズは、第１の処理部１０１に入力される画像のサイズと同じである。すると、第２の処理部１０２は、第Ｍ階層処理１０２０５によって得られた中間特徴マップ１０２３７を出力する。この中間特徴マップ１０２３７は、類似性の基準となる特徴１０２３９を含んでいる。図５の例において、特徴１０２３９はｐ次元のベクトルである。 First, an image 10231 of an object in a certain category is input to the second processing unit 102 after learning. In this example, the size of the image 10231 is the same as the size of the image input to the first processing unit 101. Then, the second processing unit 102 outputs the intermediate feature map 10237 obtained by the Mth layer processing 10205. This intermediate feature map 10237 includes a feature 10239 that serves as a similarity criterion. In the example of FIG. 5, feature 10239 is a p-dimensional vector.

さらに、学習後の第２の処理部１０２に、セグメンテーション処理の対象となる画像１０２４１が入力される。この画像１０２４１のサイズは、画像１０２３１のサイズより大きくてもよい。すると、第２の処理部１０２は、第Ｍ階層処理１０２０５によって得られた中間特徴マップ１０２４７を出力する。図５の例において、この中間特徴マップは、それぞれが画像１０２４１の画素又は部分領域に対応する、複数のｐ次元のベクトルを含む。 Furthermore, an image 10241 to be subjected to segmentation processing is input to the second processing unit 102 after learning. The size of this image 10241 may be larger than the size of image 10231. Then, the second processing unit 102 outputs the intermediate feature map 10247 obtained by the M-th layer processing 10205. In the example of FIG. 5, this intermediate feature map includes a plurality of p-dimensional vectors, each corresponding to a pixel or subregion of image 10241.

次に、第２の処理部１０２は、中間特徴マップ１０２４７と特徴１０２３９との内積処理を行うことにより、類似度マップ１０２４９を得る。この類似度マップ１０２４９は、画像１０２４１の部分領域におけるオブジェクト尤度を示す。この場合のオブジェクト尤度は、画像１０２４１の部分領域に写る物体が、画像１０２３１に写る物体のカテゴリに属する確率の推定値を示す。さらに、第２の処理部１０２は、類似度マップ１０２４９が示すオブジェクト尤度に基づいて画像１０２４１をセグメンテーションすることができる。こうして、画像１０２３１に写る物体が存在する領域が区別されるように、画像１０２４１のセグメンテーションを行うことができる。 Next, the second processing unit 102 obtains a similarity map 10249 by performing inner product processing on the intermediate feature map 10247 and the feature 10239. This similarity map 10249 indicates the object likelihood in a partial region of the image 10241. The object likelihood in this case indicates an estimated value of the probability that the object appearing in the partial area of image 10241 belongs to the category of the object appearing in image 10231. Further, the second processing unit 102 can segment the image 10241 based on the object likelihood indicated by the similarity map 10249. In this way, the image 10241 can be segmented so that the area where the object shown in the image 10231 is present can be distinguished.

この実施形態では、中間特徴マップを生成するように第２の処理部１０２の学習が行われる。このような構成により、あらかじめ定められたカテゴリについてのオブジェクト尤度を推定する代わりに、任意の画像特徴との類似性に基づく画像のセグメンテーションを行うことができる。 In this embodiment, the second processing unit 102 is trained to generate an intermediate feature map. With such a configuration, image segmentation can be performed based on similarity with arbitrary image features instead of estimating object likelihood for a predetermined category.

また、上記の実施形態においては、第１の処理部１０１が出力したオブジェクト尤度が、それぞれ第２の処理部１０２の学習のための教師データとして用いられた。一方で、第１の処理部１０１が出力した複数のオブジェクト尤度がまとめられたオブジェクト尤度マップを、第２の処理部１０２の学習のための教師データとして用いてもよい。このような実施形態について、図６（Ａ）及び図６（Ｂ）を参照して説明する。 Furthermore, in the embodiments described above, the object likelihoods output by the first processing unit 101 are respectively used as training data for learning by the second processing unit 102. On the other hand, an object likelihood map in which a plurality of object likelihoods outputted by the first processing unit 101 are compiled may be used as training data for learning by the second processing unit 102. Such an embodiment will be described with reference to FIGS. 6(A) and 6(B).

この場合、第１の処理部１０１は、複数の配列データのそれぞれが特定のクラスに含まれる尤度を算出する。ここで、第１の処理部１０１に入力される複数の配列データのそれぞれは所定サイズを有している。また、これらの複数の配列データは、この所定サイズよりも大きい学習用の第３の配列データの一部である。例えば、第１の処理部１０１は、複数の画像（例えば４枚の画像）のそれぞれに対応するオブジェクト尤度を算出することができる。尤度取得部１０３１は、第１の処理部１０１が出力したこれらのオブジェクト尤度を取得することができる。 In this case, the first processing unit 101 calculates the likelihood that each of the plurality of array data is included in a specific class. Here, each of the plurality of array data input to the first processing unit 101 has a predetermined size. Moreover, these plurality of array data are part of the third array data for learning which is larger than this predetermined size. For example, the first processing unit 101 can calculate object likelihoods corresponding to each of a plurality of images (for example, four images). The likelihood acquisition unit 1031 can acquire these object likelihoods output by the first processing unit 101.

例えば、図６（Ａ）の例では、学習用の第３の配列データである画像２０３０１のサイズは第１の処理部１０１に入力可能な画像サイズより大きい。そして、複数の配列データである４枚の画像２０３０２は、画像２０３０１からの、画像２０３０１よりも小さいサンプリング画像である。そして、第１の処理部１０１は、オブジェクト尤度の算出処理を４回行うことにより、４つの画像のそれぞれについてのオブジェクト尤度２０３１１～２０３１４を出力する。図６（Ａ）の例では、画像１１は画像２０３０１として示されている。 For example, in the example of FIG. 6A, the size of the image 20301 that is the third array data for learning is larger than the image size that can be input to the first processing unit 101. The four images 20302, which are a plurality of array data, are sampled images smaller than the image 20301 from the image 20301. Then, the first processing unit 101 outputs object likelihoods 20311 to 20314 for each of the four images by performing the object likelihood calculation process four times. In the example of FIG. 6A, image 11 is shown as image 20301.

尤度取得部１０３１がサンプリングにより生成する画像の数は、画像２０３０１が入力された第２の処理部１０２が出力するオブジェクト尤度の個数と同じである。また、尤度取得部１０３１は、第１の処理部１０１へ入力される画像として適したサイズを有するように、画像のサンプリングを行う。また、サンプリングの際には、４枚の画像２０３０２として示されているように、第２の処理部１０２が出力する複数のオブジェクト尤度の位置関係と合うように、サンプリング位置がずらされる。 The number of images generated by sampling by the likelihood acquisition unit 1031 is the same as the number of object likelihoods output by the second processing unit 102 to which the image 20301 is input. Furthermore, the likelihood acquisition unit 1031 samples the image so that the image has a size suitable for input to the first processing unit 101. Furthermore, during sampling, the sampling position is shifted to match the positional relationship of the plurality of object likelihoods output by the second processing unit 102, as shown as four images 20302.

さらに、尤度取得部１０３１は、第１の処理部１０１が算出した尤度を含む尤度マップを生成する。例えば、尤度取得部１０３１は、オブジェクト尤度２０３１１～２０３１４をオブジェクト尤度マップ２０３０４に変換する。ここで、尤度マップは、サンプリング画像のサンプリング位置に、このサンプリング画像について第１の処理部１０１が算出した尤度を有している。例えば、図６（Ａ）において、オブジェクト尤度マップ２０３０４上におけるオブジェクト尤度の配置は、尤度取得部１０３１がサンプリングした画像の位置関係と整合している。具体的には、左上のサンプリング画像に対応するオブジェクト尤度２０３１１は、オブジェクト尤度マップ２０３０４の左上に配置される。このために、尤度取得部１０３１は、画像座標の位置関係を示す数値を決定する。この例では、オブジェクト尤度２０３１１はｕ_０，０、オブジェクト尤度２０３１２はｕ_１，０、オブジェクト尤度２０３１３はｕ_０，１、オブジェクト尤度２０３１４はｕ_１，１と表される。そして、尤度取得部１０３１は、生成したオブジェクト尤度マップ２０３０４を尤度学習部１０３２に出力する。 Furthermore, the likelihood acquisition unit 1031 generates a likelihood map including the likelihood calculated by the first processing unit 101. For example, the likelihood acquisition unit 1031 converts object likelihoods 20311 to 20314 into an object likelihood map 20304. Here, the likelihood map has the likelihood calculated by the first processing unit 101 for this sampling image at the sampling position of the sampling image. For example, in FIG. 6A, the arrangement of object likelihoods on the object likelihood map 20304 matches the positional relationship of the images sampled by the likelihood acquisition unit 1031. Specifically, the object likelihood 20311 corresponding to the upper left sampling image is placed at the upper left of the object likelihood map 20304. For this purpose, the likelihood acquisition unit 1031 determines a numerical value indicating the positional relationship of the image coordinates. In this example, object likelihood 20311 is expressed as u _0,0 , object likelihood 20312 is expressed as u _1,0 , object likelihood 20313 is expressed as u _0,1 , and object likelihood 20314 is expressed as u _1,1 . The likelihood acquisition unit 1031 then outputs the generated object likelihood map 20304 to the likelihood learning unit 1032.

尤度学習部１０３２は、第２の処理部１０２の学習を行う。尤度学習部１０３２は、尤度取得部１０３１から、画像２０３０１と、画像２０３０１に対応するオブジェクト尤度マップ２０３０４を取得する。そして、尤度学習部１０３２は、画像２０３０１及びオブジェクト尤度マップ２０３０４を教師データとして用いて、第２の処理部１０２の学習を行う。具体的には、尤度学習部１０３２は、第１の処理部１０１が算出した尤度を含む尤度マップと、学習用の第３の配列データについて第２の処理部１０２が算出した尤度マップと、の差分に基づいて、第２の処理部１０２の学習を行うことができる。具体的には、尤度学習部１０３２は、オブジェクト尤度マップ２０３０４と、画像２０３０１が入力された第２の処理部１０２が出力したオブジェクト尤度マップ２０３０６との差分が減少するように、第２の処理部１０２のパラメータを更新することができる。差分の評価に用いるロス関数としては、例えば、各オブジェクト尤度の差分のＬ１ノルムの和を用いることができる。例えば、ロス関数は下式で表すことができる。

ここで、ｖ_ｉ，ｊはオブジェクト尤度マップ２０３０６に示されるオブジェクト尤度であり、ｉ，ｊはｕ_ｉ，ｊと同様にマップ上の位置を表す。 The likelihood learning unit 1032 performs learning of the second processing unit 102. The likelihood learning unit 1032 acquires the image 20301 and the object likelihood map 20304 corresponding to the image 20301 from the likelihood acquisition unit 1031. Then, the likelihood learning unit 1032 performs learning of the second processing unit 102 using the image 20301 and the object likelihood map 20304 as teacher data. Specifically, the likelihood learning unit 1032 uses a likelihood map including the likelihood calculated by the first processing unit 101 and a likelihood calculated by the second processing unit 102 for the third array data for learning. The second processing unit 102 can perform learning based on the difference between the map and the map. Specifically, the likelihood learning unit 1032 uses the second The parameters of the processing unit 102 can be updated. As the loss function used to evaluate the difference, for example, the sum of the L1 norm of the difference of each object likelihood can be used. For example, the loss function can be expressed by the following formula.

Here, v _i,j is the object likelihood shown in the object likelihood map 20306, and i,j represents the position on the map similarly to u _i,j .

このような実施形態によれば、複数のオブジェクト尤度の集合が教師データとして用いられるため、学習の効率が向上する。 According to such an embodiment, a set of a plurality of object likelihoods is used as training data, so that learning efficiency is improved.

（ネットワーク構造の確認）
上記の実施形態によれば、第２の処理部１０２が用いるニューラルネットワークを、高精度及び高速に分類処理及びセグメンテーション処理を行えるように学習することができる。一方で、第２の処理部１０２が用いるニューラルネットワークが、十分に高精度及び高速な分類処理及びセグメンテーション処理を行える構造を有しているかどうかを、さらに確認することができる。 (Check network structure)
According to the above embodiment, the neural network used by the second processing unit 102 can be trained to perform classification processing and segmentation processing with high accuracy and high speed. On the other hand, it is possible to further confirm whether the neural network used by the second processing unit 102 has a structure capable of performing sufficiently accurate and high-speed classification processing and segmentation processing.

図７は、一実施形態に係る情報処理装置３の構成を示すブロック図である。この情報処理装置３は、第２の処理部１０２が用いるニューラルネットワークの構造を確認することができる。第１の処理部１０１、第２の処理部１０２、及び学習部１０３の構成は図１と同様であり、以下では異なる点について説明する。 FIG. 7 is a block diagram showing the configuration of the information processing device 3 according to one embodiment. This information processing device 3 can confirm the structure of the neural network used by the second processing unit 102. The configurations of the first processing unit 101, the second processing unit 102, and the learning unit 103 are the same as those in FIG. 1, and the different points will be explained below.

本実施形態において、第２の処理部１０２は、ネットワーク構造１２を入力として受け付けることができる。ネットワーク構造１２は、第２の処理部１０２が用いるニューラルネットワークの構造を示す情報である。ネットワーク構造１２は、例えば、階層処理の個数、階層処理の種類、階層処理を施す順番、及び尤度変換後のカテゴリ数などを示すことができる。第２の処理部１０２は、ネットワーク構造１２に従う構造を有する完全畳み込みネットワークを用いて、オブジェクト尤度マップを生成することができる。また、第２の処理部１０２は、受け付けたネットワーク構造１２を確認部３０４に出力することができる。 In this embodiment, the second processing unit 102 can receive the network structure 12 as input. The network structure 12 is information indicating the structure of a neural network used by the second processing unit 102. The network structure 12 can indicate, for example, the number of hierarchical processes, the type of hierarchical processes, the order in which the hierarchical processes are performed, and the number of categories after likelihood conversion. The second processing unit 102 can generate an object likelihood map using a fully convolutional network having a structure according to the network structure 12. Further, the second processing unit 102 can output the received network structure 12 to the confirmation unit 304.

確認部３０４は、ネットワーク構造１２が、十分に高精度及び高速に分類処理及びセグメンテーション処理を行える構造を示しているかどうかを確認する。このような処理は、例えば図２に従う処理を開始する前に行うことができる。確認部３０４は、例えば、第２の処理部１０２が用いる完全畳み込みネットワークが、パディング処理、又は配列データのサイズの変動によって処理内容が変動する階層処理を含むかどうかを確認することができる。 The confirmation unit 304 confirms whether the network structure 12 exhibits a structure that allows classification processing and segmentation processing to be performed with sufficiently high accuracy and high speed. Such processing can be performed, for example, before starting the processing according to FIG. 2. The confirmation unit 304 can confirm, for example, whether the fully convolutional network used by the second processing unit 102 includes padding processing or hierarchical processing in which the processing contents vary depending on variations in the size of array data.

既に説明したように、第２の処理部１０２によって行われる処理が、入力画像サイズが変動すると処理内容も変動するような階層処理を含む場合、セグメンテーション処理の精度が学習によって向上しにくくなる。ニューラルネットワークがこのような階層処理を含む場合、ニューラルネットワークの構造を修正することにより、より高精度及び高速に分類処理及びセグメンテーション処理を行うことが可能になる。 As described above, when the processing performed by the second processing unit 102 includes hierarchical processing in which the processing content changes as the input image size changes, the accuracy of the segmentation processing becomes difficult to improve through learning. When a neural network includes such hierarchical processing, by modifying the structure of the neural network, it becomes possible to perform classification processing and segmentation processing with higher accuracy and speed.

そこで、確認部３０４は、各階層処理に、入力画像サイズの変動によって処理内容が変動する特定の処理が含まれるかどうかを確認することができる。このような特定の処理としては、例えばパディング処理が挙げられる。パディング処理においては、画像の周辺部分に対してのみ特別な処理が行われる。このため、画像が小さいほど、パディング処理によって追加された値を用いた処理の回数が多くなる。すなわち、パディング処理が行われる場合、入力画像サイズの変動によって第２の処理部１０２による処理内容が変動する。 Therefore, the confirmation unit 304 can confirm whether each hierarchical process includes a specific process whose processing content changes depending on a change in the input image size. An example of such specific processing is padding processing. In padding processing, special processing is performed only on the peripheral portions of the image. Therefore, the smaller the image, the more times the process using the values added by the padding process will be performed. That is, when padding processing is performed, the content of processing by the second processing unit 102 changes depending on changes in the input image size.

確認部３０４はまた、完全畳み込みネットワークが、パディング処理、又は配列データのサイズの変動によって処理内容が変動する階層処理を行わないように、完全畳み込みネットワークの構成を修正することができる。例えば、確認部３０４は、階層処理がパディング処理を含むことをネットワーク構造１２が示す場合、パディング処理を行わないようにネットワーク構造１２を修正することができる。このように、確認部３０４は、自動的にネットワーク構造１２を修正し、修正されたネットワーク構造１２を第２の処理部１０２に出力することができる。 The confirmation unit 304 can also modify the configuration of the fully convolutional network so that the fully convolutional network does not perform padding processing or hierarchical processing in which the processing contents change due to changes in the size of array data. For example, if the network structure 12 indicates that the hierarchical processing includes padding processing, the confirmation unit 304 can modify the network structure 12 so as not to perform the padding processing. In this way, the confirmation unit 304 can automatically modify the network structure 12 and output the modified network structure 12 to the second processing unit 102.

別の方法として、確認部３０４は、第２の処理部１０２にテストデータに対する処理を行わせることができる。こうして、確認部３０４は、第２の処理部１０２が用いる完全畳み込みネットワークが、パディング処理、又は配列データのサイズの変動によって処理内容が変動する階層処理を含むかどうかを確認することができる。このような例について、図８（Ａ）～図８（Ｃ）を参照して説明する。 Alternatively, the confirmation unit 304 can cause the second processing unit 102 to process the test data. In this way, the confirmation unit 304 can confirm whether the fully convolutional network used by the second processing unit 102 includes padding processing or hierarchical processing in which the processing contents vary depending on variations in the size of array data. Such an example will be explained with reference to FIGS. 8(A) to 8(C).

図８（Ａ）は、第２の処理部１０２ａによって行われる処理が、入力画像サイズの変動によって処理内容が変動する階層処理を含まない場合の例を示す。図８（Ａ）の例では、ネットワーク構造１２に従う処理を行う第２の処理部１０２ａにテストデータ３０２０１が入力されている。テストデータ３０２０１は、十分に大きい画像であり、１画素のみ異なる画素値を有し、他の画素は同じ画素値を有する画像である。入力画像サイズの変動によって処理内容が変動する階層処理が行われない場合、オブジェクト尤度の推定値は、テストデータ３０２０１における画素値が等しい領域では一致する。したがって、テストデータ３０２０１に対して出力されたオブジェクト尤度マップ３０２０７のほとんどの領域では値が等しくなる。一方で、テストデータ３０２０１において異なる値を有する画素の付近においては、この画素から抽出された特徴を用いた計算が行われるため、図８（Ａ）に示されるようにオブジェクト尤度の値が他の領域とは異なる。 FIG. 8A shows an example in which the processing performed by the second processing unit 102a does not include hierarchical processing in which the processing contents change depending on changes in the input image size. In the example of FIG. 8A, test data 30201 is input to the second processing unit 102a that performs processing according to the network structure 12. The test data 30201 is a sufficiently large image in which only one pixel has a different pixel value and the other pixels have the same pixel value. If hierarchical processing in which the processing content changes due to changes in the input image size is not performed, the estimated values of the object likelihoods match in areas where the pixel values in the test data 30201 are the same. Therefore, the values are the same in most areas of the object likelihood map 30207 output for the test data 30201. On the other hand, near pixels that have different values in the test data 30201, calculations are performed using features extracted from these pixels, so the object likelihood value differs from other values as shown in FIG. 8(A). It is different from the area of

一方で、図８（Ｂ）は、第２の処理部１０２ｂによって行われる処理が、入力画像サイズの変動によって処理内容が変動する階層処理を含む場合の例を示す。図８（Ｂ）の例でも、図８（Ａ）と同様のテストデータ３０２０１が、ネットワーク構造１２に従う処理を行う第２の処理部１０２ｂに入力されている。そして、第２の処理部１０２ｂは、テストデータ３０２０１に対してオブジェクト尤度マップ３０２１７を出力する。例えば、第２の処理部１０２ｂによって行われる処理がパディング処理を含んでいる場合、テストデータの内容にかかわらず、オブジェクト尤度マップの周辺部分の値は内側部分の値とは異なる。図８（Ｃ）は、オブジェクト尤度マップ３０２１７を詳細に示す。オブジェクト尤度マップ３０２１７は、テストデータ３０２０１に含まれる、異なる画素値を有する画素の影響を受ける部分３０２１８を含む。また、オブジェクト尤度マップ３０２１７は、部分３０２１８の外側にある、テストデータ３０２０１に含まれる異なる画素値を有する画素の影響を受けない部分３０２１９を含む。さらに、オブジェクト尤度マップ３０２１７は、部分３０２１９の外側にある、パディング処理などの影響で部分３０２１９とは異なる値を有する部分を有している。 On the other hand, FIG. 8(B) shows an example in which the processing performed by the second processing unit 102b includes hierarchical processing in which the processing contents change depending on changes in the input image size. In the example of FIG. 8(B) as well, test data 30201 similar to that of FIG. 8(A) is input to the second processing unit 102b that performs processing according to the network structure 12. Then, the second processing unit 102b outputs an object likelihood map 30217 for the test data 30201. For example, if the processing performed by the second processing unit 102b includes padding processing, the values in the peripheral portion of the object likelihood map are different from the values in the inner portion, regardless of the content of the test data. FIG. 8C shows the object likelihood map 30217 in detail. Object likelihood map 30217 includes a portion 30218 affected by pixels having different pixel values included in test data 30201. The object likelihood map 30217 also includes a portion 30219 that is outside the portion 30218 and is not affected by pixels having different pixel values included in the test data 30201. Furthermore, the object likelihood map 30217 has a portion outside the portion 30219 that has a different value from the portion 30219 due to padding processing or the like.

このように、確認部３０４は、テストデータを入力された第２の処理部１０２が出力したオブジェクト尤度マップが、内部とは異なる値を有する領域をマップの周辺部分に有しているかどうかを判定できる。そして、確認部３０４は、このような領域がマップの周辺部分に存在しない場合、第２の処理部１０２が行う処理は、入力画像サイズが変動すると処理内容が変動するような階層処理を含まないと判定することができる。この場合、第２の処理部１０２は、ネットワーク構造１２が、高精度及び高速な分類処理及びセグメンテーション処理を行える構造を示していると判定できる。 In this way, the confirmation unit 304 checks whether the object likelihood map output by the second processing unit 102 to which the test data has been input has an area in the periphery of the map that has a value different from that in the interior. Can be judged. Then, the confirmation unit 304 determines that if such an area does not exist in the peripheral portion of the map, the processing performed by the second processing unit 102 does not include hierarchical processing in which the processing content changes when the input image size changes. It can be determined that In this case, the second processing unit 102 can determine that the network structure 12 exhibits a structure capable of performing highly accurate and high-speed classification processing and segmentation processing.

別の方法として、確認部３０４は、テストデータを入力された第２の処理部１０２が出力したオブジェクト尤度マップ３０２１７のうち、異なる画素値を有する画素の影響を受ける部分３０２１８を検出することができる。次に、確認部３０４は、オブジェクト尤度マップ３０２１７のうち、異なる画素値を有する画素の影響を受ける部分３０２１８を除く部分が、一様な値を有しているか否かを判定することができる。例えば、この部分における最頻値から所定範囲内の値の割合が、所定の割合を超える場合に、確認部３０４はこの部分が一様な値を有していると判定することができる。確認部３０４は、この部分が一様な値を有している場合に、第２の処理部１０２が行う処理は、入力画像サイズが変動すると処理内容が変動するような階層処理を含まないと判定することができる。確認部３０４は、このような処理を異なるサイズのテストデータのそれぞれについて行ってもよい。 As another method, the confirmation unit 304 may detect a portion 30218 affected by pixels having different pixel values in the object likelihood map 30217 output by the second processing unit 102 input with test data. can. Next, the confirmation unit 304 can determine whether a portion of the object likelihood map 30217 excluding a portion 30218 affected by pixels having different pixel values has a uniform value. . For example, if the ratio of values within a predetermined range from the mode in this part exceeds a predetermined ratio, the confirmation unit 304 can determine that this part has a uniform value. The confirmation unit 304 determines that when this portion has a uniform value, the processing performed by the second processing unit 102 does not include hierarchical processing in which the processing content changes when the input image size changes. can be determined. The confirmation unit 304 may perform such processing on each of test data of different sizes.

このようにテストデータを用いることにより、第２の処理部１０２が行う処理が、パディング処理以外の、入力画像サイズが変動したときに処理内容が変動するような階層処理を含むかどうかを判定することができる。確認部３０４は、さらに、ネットワーク構造１２の確認結果を、表示装置１３などを介して出力してもよい。この場合、ユーザは確認部３０４による確認結果を知ることができる。 By using the test data in this way, it is determined whether the processing performed by the second processing unit 102 includes hierarchical processing other than padding processing in which the processing contents change when the input image size changes. be able to. The confirmation unit 304 may further output the confirmation result of the network structure 12 via the display device 13 or the like. In this case, the user can know the confirmation result by the confirmation unit 304.

このような方法によれば、第２の処理部１０２の学習を行う前に、第２の処理部１０２が高精度及び高速に分類処理及びセグメンテーション処理を行える構造を有しているかどうかを確認することができる。このため、より確実に、高精度及び高速に分類処理及びセグメンテーション処理を行うことができるニューラルネットワークを生成することができる。 According to such a method, before learning the second processing unit 102, it is checked whether the second processing unit 102 has a structure that can perform classification processing and segmentation processing with high accuracy and high speed. be able to. Therefore, it is possible to generate a neural network that can more reliably perform classification processing and segmentation processing with high accuracy and high speed.

（第１の処理部の追加学習）
学習後の第２の処理部１０２が出力したオブジェクト尤度マップを利用することにより、第１の処理部１０１の追加学習に用いる教師データを作成してもよい。 (Additional learning of the first processing unit)
Teacher data used for additional learning of the first processing unit 101 may be created by using the object likelihood map output by the second processing unit 102 after learning.

上述のように、学習後の第２の処理部１０２に画像２１が入力されると、第２の処理部１０２は画像２１に対応するオブジェクト尤度マップを出力する。ここで、学習部１０３は、第２の処理部１０２から出力されたオブジェクト尤度マップを取得することができる。そして、学習部１０３は、取得したオブジェクト尤度マップ及び画像２１から、第１の処理部１０１の学習に用いる教師データを作成することができる。さらに、学習部１０３、この教師データを用いて第１の処理部１０１の追加学習を行うことができる。 As described above, when the image 21 is input to the second processing unit 102 after learning, the second processing unit 102 outputs the object likelihood map corresponding to the image 21. Here, the learning unit 103 can acquire the object likelihood map output from the second processing unit 102. The learning unit 103 can then create teacher data used for learning by the first processing unit 101 from the acquired object likelihood map and image 21. Further, the learning unit 103 can perform additional learning of the first processing unit 101 using this teacher data.

以下で、図９を参照して、第１の処理部１０１の追加学習のために学習部１０３が行う処理について説明する。このような処理は、Ｓ１００３の後に行うことができる。まず、第２の処理部１０２は、追加学習用の配列データを、学習部１０３による学習後の完全畳み込みネットワークに入力する。こうして、第２の処理部１０２は、追加学習用の配列データに含まれる部分配列データのそれぞれが特定のクラスに含まれる尤度を示す尤度マップを生成する。例えば、第２の処理部１０２には、画像４０００１が入力される。この画像４０００１は図１の画像２１に対応し、第１の処理部１０１へ入力される画像よりも大きいサイズを有している。そして、第２の処理部１０２は、画像４０００１に対応するオブジェクト尤度マップ４０００７を出力する。尤度マップ４０００８，４０００９は、オブジェクト尤度マップ４０００７から抽出された、特定のカテゴリの物体についての尤度マップである。具体的には、尤度マップ４０００８はカテゴリＡの物体についての尤度マップである。また、尤度マップ４０００９はカテゴリＢの物体についての尤度マップである。 Below, with reference to FIG. 9, the process performed by the learning unit 103 for additional learning of the first processing unit 101 will be described. Such processing can be performed after S1003. First, the second processing unit 102 inputs the array data for additional learning to the fully convolutional network after learning by the learning unit 103. In this way, the second processing unit 102 generates a likelihood map that indicates the likelihood that each piece of partial array data included in the array data for additional learning is included in a specific class. For example, an image 40001 is input to the second processing unit 102. This image 40001 corresponds to the image 21 in FIG. 1 and has a larger size than the image input to the first processing unit 101. Then, the second processing unit 102 outputs an object likelihood map 40007 corresponding to the image 40001. Likelihood maps 40008 and 40009 are likelihood maps for objects in specific categories extracted from object likelihood map 40007. Specifically, likelihood map 40008 is a likelihood map for category A objects. Further, the likelihood map 40009 is a likelihood map for objects of category B.

そして、学習部１０３は、尤度マップに基づいて追加学習用の配列データから追加学習用の部分配列データを抽出する。また、学習部１０３は、抽出された部分配列データを用いて第１の処理部１０１の追加学習を行う。学習部１０３が行う処理は、ブロック４０４１０として示されている。学習部１０３は、画像４０００１とオブジェクト尤度マップ４０００７を用いて、第１の処理部１０１の学習のために利用できる教師データを生成する。学習部１０３は、オブジェクト尤度マップにおいて高い尤度を有する領域を検出する。例えば、学習部１０３は、オブジェクト尤度マップ４０００７を所定の閾値（例えば０．３）を用いて２値化することにより、高い尤度を有する領域を特定することができる。そして、この領域の面積が閾値よりも大きい場合に、学習部１０３は、特定された領域に対応する部分を、画像４０００１から切り抜くことができる。こうして切り抜かれたデータは、教師データとして用いられる。 Then, the learning unit 103 extracts partial sequence data for additional learning from the sequence data for additional learning based on the likelihood map. Further, the learning unit 103 performs additional learning of the first processing unit 101 using the extracted partial sequence data. The process performed by the learning unit 103 is shown as block 40410. The learning unit 103 uses the image 40001 and the object likelihood map 40007 to generate teacher data that can be used for learning by the first processing unit 101. The learning unit 103 detects a region having a high likelihood in the object likelihood map. For example, the learning unit 103 can identify a region having a high likelihood by binarizing the object likelihood map 40007 using a predetermined threshold (for example, 0.3). Then, when the area of this region is larger than the threshold value, the learning unit 103 can cut out a portion corresponding to the identified region from the image 40001. The data thus cut out is used as training data.

図９の例において、学習部１０３は、尤度マップ４０００８における高い尤度を有する領域を検出し、画像４０００１における対応する領域である破線領域を切り抜いている。こうして切り抜かれた画像と、尤度マップ４０００８に対応するカテゴリ情報（この例ではカテゴリＡ）との組み合わせが、教師データ４０４１３として用いられる。同様に、学習部１０３は、尤度マップ４０００９を参照して、画像４０００１から破線領域を切り抜く。こうして切り抜かれた画像と、尤度マップ４０００９に対応するカテゴリ情報（この例ではカテゴリＢ）との組み合わせが、教師データ４０４１４として用いられる。 In the example of FIG. 9, the learning unit 103 detects a region with a high likelihood in the likelihood map 40008, and cuts out the corresponding region in the image 40001, which is a broken line region. A combination of the image thus cut out and the category information (category A in this example) corresponding to the likelihood map 40008 is used as the teacher data 40413. Similarly, the learning unit 103 refers to the likelihood map 40009 and cuts out the broken line area from the image 40001. A combination of the image thus cut out and the category information (category B in this example) corresponding to the likelihood map 40009 is used as the teacher data 40414.

そして、学習部１０３は、得られた教師データを用いて第１の処理部１０１の学習を行う。第１の処理部１０１の具体的な学習方法は特に限定されず、例えば上述の方法を用いることができる。 Then, the learning unit 103 performs learning of the first processing unit 101 using the obtained teacher data. The specific learning method of the first processing unit 101 is not particularly limited, and for example, the above-mentioned method can be used.

上記の方法によれば、任意の画像に対してオブジェクト尤度マップを生成し、さらにこのオブジェクト尤度マップに基づいて教師データを生成することができる。こうして生成された教師データを用いて第１の処理部１０１の追加学習を行うことにより、第１の処理部１０１はより高精度なオブジェクト尤度の推定を行えるようになる。さらに、こうして追加学習された第１の処理部１０１を用いて、既に説明した方法で第２の処理部１０２の学習を行うことにより、高速にセグメンテーション処理を行うニューラルネットワークの性能をさらに向上させることができる。 According to the above method, it is possible to generate an object likelihood map for any image, and further to generate training data based on this object likelihood map. By performing additional learning of the first processing unit 101 using the teacher data generated in this way, the first processing unit 101 can estimate the object likelihood with higher accuracy. Furthermore, by using the first processing section 101 that has been additionally trained in this way, the second processing section 102 is trained using the method already described, thereby further improving the performance of the neural network that performs segmentation processing at high speed. Can be done.

（その他の実施例）
ここまで本発明の実施形態の例を説明したが、本発明は例えば、システム、装置、方法、プログラム、又は記録媒体等としての実現することができる。例えば、複数の機器（例えば、ホストコンピュータ、インタフェース機器、撮像装置、又はｗｅｂアプリケーション等）から構成されるシステムに本発明を適用することができる。一方で、１つの機器からなる装置に本発明を適用してもよい。 (Other examples)
Although examples of embodiments of the present invention have been described so far, the present invention can be realized as, for example, a system, an apparatus, a method, a program, a recording medium, or the like. For example, the present invention can be applied to a system composed of a plurality of devices (eg, a host computer, an interface device, an imaging device, a web application, etc.). On the other hand, the present invention may be applied to a device consisting of one device.

上記の各実施形態に係る情報処理装置は、コンピュータを用いて実現することができる。例えば、図１等に示される各情報処理装置が有する各処理部の機能は、コンピュータにより実現することができる。コンピュータとしては例えば、汎用のパーソナルコンピュータ及びサーバなどが挙げられる。もっとも、少なくとも一部の処理部が専用のハードウェアによって実現されてもよい。また、各画像処理装置が、例えばネットワークを介して接続された複数の情報処理装置によって構成されていてもよい。例えば、各画像処理装置の機能はクラウドサービスとして提供されてもよい。 The information processing apparatus according to each of the embodiments described above can be realized using a computer. For example, the functions of each processing unit included in each information processing device shown in FIG. 1 and the like can be realized by a computer. Examples of the computer include a general-purpose personal computer and a server. However, at least some of the processing units may be realized by dedicated hardware. Further, each image processing device may be configured by a plurality of information processing devices connected via a network, for example. For example, the functions of each image processing device may be provided as a cloud service.

図１０は、一実施形態に係る、コンピュータを用いて実現される情報処理装置のハードウェア構成例を示す図である。図１０においてプロセッサ１０１０は、例えばＣＰＵであり、コンピュータ全体の動作をコントロールする。メモリ１０２０は、例えばＲＡＭであり、プログラム及びデータ等を一時的に記憶する。コンピュータが読み取り可能な記憶媒体１０３０は、例えばハードディスク又はＣＤ－ＲＯＭ等であり、プログラム及びデータ等を長期的に記憶する。本実施形態においては、記憶媒体１０３０が格納している、各部の機能を実現するプログラムが、メモリ１０２０へと読み出される。そして、プロセッサ１０１０が、メモリ１０２０上のプログラムに従って動作することにより、各部の機能が実現される。 FIG. 10 is a diagram illustrating an example of a hardware configuration of an information processing apparatus implemented using a computer, according to an embodiment. In FIG. 10, a processor 1010 is, for example, a CPU, and controls the operation of the entire computer. The memory 1020 is, for example, a RAM, and temporarily stores programs, data, and the like. The computer-readable storage medium 1030 is, for example, a hard disk or a CD-ROM, and stores programs, data, etc. for a long period of time. In this embodiment, programs that are stored in the storage medium 1030 and implement the functions of each part are read into the memory 1020. The functions of each part are realized by the processor 1010 operating according to the program on the memory 1020.

図１０において、入力インタフェース１０４０は外部の装置から情報を取得するためのインタフェースである。また、出力インタフェース１０５０は外部の装置へと情報を出力するためのインタフェースである。バス１０６０は、上述の各部を接続し、データのやりとりを可能とする。 In FIG. 10, an input interface 1040 is an interface for acquiring information from an external device. Further, the output interface 1050 is an interface for outputting information to an external device. A bus 1060 connects the above-mentioned units and enables data exchange.

本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 The present invention provides a system or device with a program that implements one or more functions of the embodiments described above via a network or a storage medium, and one or more processors in a computer of the system or device reads and executes the program. This can also be achieved by processing. It can also be realized by a circuit (for example, ASIC) that realizes one or more functions.

本明細書の開示は、以下の情報処理装置、パラメータが学習された完全畳み込みネットワークを生産する方法、及びプログラムを含む。 The disclosure of this specification includes the following information processing apparatus, method for producing a fully convolutional network with learned parameters, and a program.

（項目１）
配列データが特定のクラスに含まれる尤度を算出する第１の処理手段と、
前記第１の処理手段とは異なる処理を行う第２の処理手段であって、完全畳み込みネットワークを用いて、配列データが特定のクラスに含まれる尤度を算出する第２の処理手段と、
学習用の配列データについての前記尤度を前記第１の処理手段が算出する処理の過程で得られた情報を教師データとして用いて、前記完全畳み込みネットワークの学習処理を行う学習手段と、
を備えることを特徴とする情報処理装置。 (Item 1)
a first processing means for calculating the likelihood that the array data is included in a specific class;
a second processing means that performs processing different from the first processing means, the second processing means calculating the likelihood that the array data is included in a specific class using a fully convolutional network;
Learning means for performing learning processing for the fully convolutional network using information obtained in the process in which the first processing means calculates the likelihood for the learning array data as teacher data;
An information processing device comprising:

（項目２）
前記第１の処理手段には、所定サイズの配列データが入力され、
前記第２の処理手段には、前記所定サイズ以上の配列データが入力されることを特徴とする、項目１に記載の情報処理装置。 (Item 2)
Array data of a predetermined size is input to the first processing means,
2. The information processing apparatus according to item 1, wherein array data of the predetermined size or more is input to the second processing means.

（項目３）
前記完全畳み込みネットワークは、前記所定サイズの第１の配列データが入力されると、前記第１の配列データが前記特定のクラスに含まれる尤度を示す１つのベクトルを出力することを特徴とする、項目２に記載の情報処理装置。 (Item 3)
The fully convolutional network is characterized in that when the first array data of the predetermined size is input, it outputs one vector indicating the likelihood that the first array data is included in the specific class. , the information processing device according to item 2.

（項目４）
前記完全畳み込みネットワークは、前記所定サイズよりも大きい第２の配列データが入力されると、複数の部分配列データのそれぞれが前記特定のクラスに含まれる尤度を示す尤度マップを出力し、前記複数の部分配列データのそれぞれは、前記第２の配列データの一部であることを特徴とする、項目２又は３に記載の情報処理装置。 (Item 4)
When the second array data larger than the predetermined size is input, the fully convolutional network outputs a likelihood map indicating the likelihood that each of the plurality of partial array data is included in the specific class; 4. The information processing device according to item 2 or 3, wherein each of the plurality of partial array data is a part of the second array data.

（項目５）
前記学習手段は、学習用の第３の配列データについて前記第１の処理手段が算出した尤度と、前記第３の配列データについて前記第２の処理手段が算出した尤度と、の差分が少なくなるように前記完全畳み込みネットワークの学習処理を行うことを特徴とする、項目１から４のいずれか１項目に記載の情報処理装置。 (Item 5)
The learning means calculates the difference between the likelihood calculated by the first processing means for the third array data for learning and the likelihood calculated by the second processing means for the third array data. The information processing device according to any one of items 1 to 4, characterized in that the learning process of the fully convolutional network is performed so that the total convolutional network is reduced.

（項目６）
前記第１の処理手段は、複数の配列データのそれぞれが特定のクラスに含まれる尤度を算出し、
前記複数の配列データのそれぞれは、前記所定サイズよりも大きい学習用の第３の配列データの一部であり、
前記学習手段は、前記第１の処理手段が算出した尤度を含む尤度マップと、前記第３の配列データについて前記第２の処理手段が算出した尤度マップと、の差分が少なくなるように前記完全畳み込みネットワークの学習処理を行うことを特徴とする、項目２から４のいずれか１項目に記載の情報処理装置。 (Item 6)
The first processing means calculates the likelihood that each of the plurality of array data is included in a specific class,
Each of the plurality of array data is a part of third array data for learning that is larger than the predetermined size,
The learning means is configured to reduce the difference between a likelihood map including the likelihood calculated by the first processing means and a likelihood map calculated by the second processing means for the third array data. The information processing device according to any one of items 2 to 4, characterized in that the learning process of the fully convolutional network is performed.

（項目７）
前記第３の配列データは画像であり、
前記複数の配列データは、前記画像の異なる位置からの、前記画像よりも小さいサンプリング画像であり、
前記第１の処理手段が算出した尤度を含む尤度マップは、前記サンプリング画像のサンプリング位置に、前記サンプリング画像について前記第１の処理手段が算出した前記尤度が配置された構造を有することを特徴とする、項目６に記載の情報処理装置。 (Item 7)
the third array data is an image;
The plurality of array data are sampled images smaller than the image from different positions of the image,
The likelihood map including the likelihood calculated by the first processing means has a structure in which the likelihood calculated by the first processing means for the sampling image is placed at the sampling position of the sampling image. The information processing device according to item 6, characterized by:

（項目８）
前記第２の処理手段は、前記尤度マップに基づいて、前記配列データのセグメンテーションを行うことを特徴とする、項目４に記載の情報処理装置。 (Item 8)
The information processing device according to item 4, wherein the second processing means performs segmentation of the array data based on the likelihood map.

（項目９）
前記配列データは画像であり、前記部分配列データは前記画像に含まれる部分画像であり、前記第２の処理手段は、前記部分画像が特定のクラスに含まれる尤度に基づいて前記画像のセグメンテーションを行うことを特徴とする、項目８に記載の情報処理装置。 (Item 9)
The array data is an image, the partial array data is a partial image included in the image, and the second processing means performs segmentation of the image based on the likelihood that the partial image is included in a specific class. The information processing device according to item 8, characterized in that the information processing device performs the following.

（項目１０）
前記第１の処理手段は、複数のクラスのそれぞれについて配列データが前記クラスに含まれる尤度を算出し、
前記第２の処理手段は、複数のクラスのそれぞれについて配列データが前記クラスに含まれる尤度を算出することを特徴とする、項目１から９のいずれか１項目に記載の情報処理装置。 (Item 10)
The first processing means calculates the likelihood that the array data is included in the class for each of the plurality of classes;
9. The information processing apparatus according to any one of items 1 to 9, wherein the second processing means calculates, for each of a plurality of classes, the likelihood that the array data is included in the class.

（項目１１）
前記第１の処理手段は、Ｓｅｌｆ－Ａｔｔｅｎｔｉｏｎ処理、Ｔｒａｎｓｆｏｒｍｅｒ、又は入力された特徴の全てを用いる処理の繰り返し、を用いて前記尤度を算出することを特徴とする、項目１から１０のいずれか１項目に記載の情報処理装置。 (Item 11)
Any one of items 1 to 10, wherein the first processing means calculates the likelihood using Self-Attention processing, Transformer, or repetition of processing using all input features. The information processing device described in item 1.

（項目１２）
前記学習手段は、前記学習用の配列データに対して前記第１の処理手段が尤度を算出する過程で得られた前記学習用の配列データの特徴と、前記第２の処理手段が前記完全畳み込みネットワークを用いて算出した前記学習用の配列データの特徴と、の差分が少なくなるように前記完全畳み込みネットワークの学習処理を行うことを特徴とする、項目１から４のいずれか１項目に記載の情報処理装置。 (Item 12)
The learning means uses the features of the learning array data obtained in the process in which the first processing means calculates the likelihood for the learning array data, and the second processing means According to any one of items 1 to 4, the learning process of the fully convolutional network is performed so that the difference between the characteristics of the learning array data calculated using the convolutional network is reduced. information processing equipment.

（項目１３）
前記第２の処理手段は、前記完全畳み込みネットワークを用いて算出した特定のクラスの配列データの特徴と、前記完全畳み込みネットワークを用いて算出した処理対象の配列データの特徴と、の相関に基づいて、処理対象の配列データが前記特定のクラスに含まれる尤度を算出することを特徴とする、項目１２に記載の情報処理装置。 (Item 13)
The second processing means is based on the correlation between the characteristics of the array data of a specific class calculated using the fully convolutional network and the characteristics of the array data to be processed calculated using the fully convolutional network. , the information processing apparatus according to item 12, wherein the information processing apparatus calculates the likelihood that the array data to be processed is included in the specific class.

（項目１４）
前記完全畳み込みネットワークは、パディング処理、又は前記配列データのサイズの変動によって処理内容が変動する階層処理を行わないように構成されていることを特徴とする、項目１から１３のいずれか１項目に記載の情報処理装置。 (Item 14)
According to any one of items 1 to 13, the fully convolutional network is configured so as not to perform padding processing or hierarchical processing in which processing contents vary depending on variations in the size of the array data. The information processing device described.

（項目１５）
前記完全畳み込みネットワークが、パディング処理、又は前記配列データのサイズの変動によって処理内容が変動する階層処理を含むかどうかを確認する確認手段をさらに備えることを特徴とする、項目１から１３のいずれか１項目に記載の情報処理装置。 (Item 15)
Any one of items 1 to 13, further comprising confirmation means for confirming whether the fully convolutional network includes padding processing or hierarchical processing in which processing contents vary depending on variations in the size of the array data. The information processing device described in item 1.

（項目１６）
前記確認手段は、前記第２の処理手段にテストデータに対する処理を行わせることにより、前記完全畳み込みネットワークが、パディング処理、又は前記配列データのサイズの変動によって処理内容が変動する階層処理を含むかどうかを確認することを特徴とする、項目１５に記載の情報処理装置。 (Item 16)
The confirmation means causes the second processing means to perform processing on the test data, thereby determining whether the fully convolutional network includes padding processing or hierarchical processing in which processing contents vary depending on variations in the size of the array data. The information processing device according to item 15, characterized in that the information processing device confirms whether or not the information is present.

（項目１７）
前記確認手段は、前記完全畳み込みネットワークが、パディング処理、又は前記配列データのサイズの変動によって処理内容が変動する階層処理を行わないように、前記完全畳み込みネットワークの構成を修正することを特徴とする、項目１５又は１６に記載の情報処理装置。 (Item 17)
The confirmation means is characterized in that the configuration of the fully convolutional network is modified so that the fully convolutional network does not perform padding processing or hierarchical processing in which the processing contents vary due to variations in the size of the array data. , the information processing device according to item 15 or 16.

（項目１８）
前記第２の処理手段は、追加学習用の配列データを、前記学習手段による学習後の前記完全畳み込みネットワークに入力することにより、前記追加学習用の配列データに含まれる部分配列データのそれぞれが特定のクラスに含まれる尤度を示す尤度マップを生成し、
前記学習手段は、前記尤度マップに基づいて前記追加学習用の配列データから追加学習用の部分配列データを抽出し、抽出された前記部分配列データを用いて前記第１の処理手段の追加学習を行うことを特徴とする、項目１から１７のいずれか１項目に記載の情報処理装置。 (Item 18)
The second processing means specifies each of the partial sequence data included in the sequence data for additional learning by inputting the sequence data for additional learning into the fully convolutional network after learning by the learning means. Generate a likelihood map showing the likelihood of being included in the class of
The learning means extracts partial sequence data for additional learning from the sequence data for additional learning based on the likelihood map, and performs additional learning of the first processing means using the extracted partial sequence data. The information processing device according to any one of items 1 to 17, characterized in that the information processing device performs the following.

（項目１９）
情報処理装置が、パラメータが学習された完全畳み込みネットワークを生産する方法であって、
前記情報処理装置は、
配列データが特定のクラスに含まれる尤度を算出する第１の処理手段と、
前記第１の処理手段とは異なる処理を行う第２の処理手段であって、完全畳み込みネットワークを用いて、配列データが特定のクラスに含まれる尤度を算出する第２の処理手段と、を備え、
前記方法は、
前記第１の処理手段を用いて、学習用の配列データについての前記尤度を算出する処理を行う工程と、
前記第１の処理手段が前記尤度を算出する処理の過程で得られた情報を教師データとして用いて、前記完全畳み込みネットワークの学習処理を行う工程と、
を含むことを特徴とする、方法。 (Item 19)
A method for an information processing device to produce a fully convolutional network with learned parameters, the method comprising:
The information processing device includes:
a first processing means for calculating the likelihood that the array data is included in a specific class;
a second processing means that performs processing different from the first processing means, the second processing means calculating the likelihood that the array data is included in a specific class using a fully convolutional network; Prepare,
The method includes:
using the first processing means to calculate the likelihood of the learning array data;
performing a learning process of the fully convolutional network using information obtained in the process of calculating the likelihood by the first processing means as training data;
A method, comprising:

（項目２０）
コンピュータを、項目１から１８のいずれか１項目に記載の情報処理装置として機能させるためのプログラム。 (Item 20)
A program for causing a computer to function as the information processing device according to any one of items 1 to 18.

発明は上記実施形態に制限されるものではなく、発明の精神及び範囲から離脱することなく、様々な変更及び変形が可能である。従って、発明の範囲を公にするために請求項を添付する。 The invention is not limited to the embodiments described above, and various changes and modifications can be made without departing from the spirit and scope of the invention. Therefore, the following claims are hereby appended to disclose the scope of the invention.

１０１：第１の処理部、１０２：第２の処理部、１０３：学習部、３０４：確認部 101: First processing unit, 102: Second processing unit, 103: Learning unit, 304: Confirmation unit

Claims

a first processing means for calculating the likelihood that the array data is included in a specific class;
a second processing means that performs processing different from the first processing means, the second processing means calculating the likelihood that the array data is included in a specific class using a fully convolutional network;
Learning means for performing learning processing for the fully convolutional network using information obtained in the process in which the first processing means calculates the likelihood for the learning array data as teacher data;
An information processing device comprising:

Array data of a predetermined size is input to the first processing means,
2. The information processing apparatus according to claim 1, wherein array data of the predetermined size or more is input to the second processing means.

The fully convolutional network is characterized in that when the first array data of the predetermined size is input, it outputs one vector indicating the likelihood that the first array data is included in the specific class. , The information processing device according to claim 2.

When the second array data larger than the predetermined size is input, the fully convolutional network outputs a likelihood map indicating the likelihood that each of the plurality of partial array data is included in the specific class; 4. The information processing apparatus according to claim 3, wherein each of the plurality of partial array data is a part of the second array data.

The learning means calculates the difference between the likelihood calculated by the first processing means for the third array data for learning and the likelihood calculated by the second processing means for the third array data. 5. The information processing apparatus according to claim 4, wherein learning processing of the fully convolutional network is performed so that the total number of convolutional networks decreases.

The first processing means calculates the likelihood that each of the plurality of array data is included in a specific class,
Each of the plurality of array data is a part of third array data for learning that is larger than the predetermined size,
The learning means is configured to reduce the difference between a likelihood map including the likelihood calculated by the first processing means and a likelihood map calculated by the second processing means for the third array data. 5. The information processing apparatus according to claim 4, wherein learning processing of the fully convolutional network is performed.

the third array data is an image;
The plurality of array data are sampled images smaller than the image from different positions of the image,
The likelihood map including the likelihood calculated by the first processing means has a structure in which the likelihood calculated by the first processing means for the sampling image is placed at the sampling position of the sampling image. The information processing device according to claim 6, characterized in that:

5. The information processing apparatus according to claim 4, wherein the second processing means performs segmentation of the array data based on the likelihood map.

The array data is an image, the partial array data is a partial image included in the image, and the second processing means performs segmentation of the image based on the likelihood that the partial image is included in a specific class. The information processing device according to claim 8, wherein the information processing device performs the following.

The first processing means calculates the likelihood that the array data is included in the class for each of the plurality of classes;
5. The information processing apparatus according to claim 4, wherein the second processing means calculates, for each of a plurality of classes, the likelihood that the array data is included in the class.

The information according to claim 4, wherein the first processing means calculates the likelihood using Self-Attention processing, Transformer, or repetition of processing using all input features. Processing equipment.

The learning means uses the features of the learning array data obtained in the process in which the first processing means calculates the likelihood for the learning array data, and the second processing means 2. The information processing apparatus according to claim 1, wherein the learning processing of the complete convolutional network is performed so that a difference between the characteristics of the learning array data calculated using the convolutional network and the characteristics of the learning array data is reduced.

The second processing means is based on the correlation between the characteristics of the array data of a specific class calculated using the fully convolutional network and the characteristics of the array data to be processed calculated using the fully convolutional network. 13. The information processing apparatus according to claim 12, wherein the information processing apparatus calculates the likelihood that the array data to be processed is included in the specific class.

2. The information processing apparatus according to claim 1, wherein the fully convolutional network is configured not to perform padding processing or hierarchical processing in which processing contents vary depending on variations in the size of the array data.

The information according to claim 1, further comprising confirmation means for confirming whether the fully convolutional network includes padding processing or hierarchical processing in which processing contents vary depending on variations in the size of the array data. Processing equipment.

The confirmation means causes the second processing means to perform processing on the test data, thereby determining whether the fully convolutional network includes padding processing or hierarchical processing in which processing contents vary depending on variations in the size of the array data. 16. The information processing apparatus according to claim 15, wherein the information processing apparatus checks whether the

The confirmation means is characterized in that the configuration of the fully convolutional network is modified so that the fully convolutional network does not perform padding processing or hierarchical processing in which the processing contents vary due to variations in the size of the array data. , the information processing device according to claim 15.

The second processing means specifies each of the partial sequence data included in the sequence data for additional learning by inputting the sequence data for additional learning into the fully convolutional network after learning by the learning means. Generate a likelihood map showing the likelihood of being included in the class of
The learning means extracts partial sequence data for additional learning from the sequence data for additional learning based on the likelihood map, and performs additional learning of the first processing means using the extracted partial sequence data. The information processing apparatus according to claim 1, wherein the information processing apparatus performs the following.

A method for an information processing device to produce a fully convolutional network with learned parameters, the method comprising:
The information processing device includes:
a first processing means for calculating the likelihood that the array data is included in a specific class;
a second processing means that performs processing different from the first processing means, the second processing means calculating the likelihood that the array data is included in a specific class using a fully convolutional network; Prepare,
The method includes:
using the first processing means to calculate the likelihood of the learning array data;
performing a learning process of the fully convolutional network using information obtained in the process of calculating the likelihood by the first processing means as training data;
A method, comprising:

A program for causing a computer to function as the information processing device according to any one of claims 1 to 18.