WO2022208632A1 - 推論装置、推論方法、学習装置、学習方法、及びプログラム - Google Patents
推論装置、推論方法、学習装置、学習方法、及びプログラム Download PDFInfo
- Publication number
- WO2022208632A1 WO2022208632A1 PCT/JP2021/013407 JP2021013407W WO2022208632A1 WO 2022208632 A1 WO2022208632 A1 WO 2022208632A1 JP 2021013407 W JP2021013407 W JP 2021013407W WO 2022208632 A1 WO2022208632 A1 WO 2022208632A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- learning
- domain
- mathematical model
- inference
- image
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 42
- 238000013178 mathematical model Methods 0.000 claims abstract description 36
- 238000013527 convolutional neural network Methods 0.000 claims description 92
- 238000013135 deep learning Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 6
- 238000012937 correction Methods 0.000 claims description 4
- 238000010801 machine learning Methods 0.000 abstract description 13
- 230000000116 mitigating effect Effects 0.000 abstract 1
- 238000005516 engineering process Methods 0.000 description 27
- 239000000306 component Substances 0.000 description 20
- 238000004364 calculation method Methods 0.000 description 16
- 238000000605 extraction Methods 0.000 description 16
- 230000008878 coupling Effects 0.000 description 13
- 238000010168 coupling process Methods 0.000 description 13
- 238000005859 coupling reaction Methods 0.000 description 13
- 238000012545 processing Methods 0.000 description 11
- 238000013528 artificial neural network Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 10
- 238000011176 pooling Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 230000001502 supplementing effect Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 230000004913 activation Effects 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 239000008358 core component Substances 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000001035 drying Methods 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/096—Transfer learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
Definitions
- the technology disclosed herein relates to an inference device, an inference method, a learning device, a learning method, and a program.
- a technology related to an identification device that uses pre-learned information to make inferences for images captured by a camera and performs various types of identification is known.
- a discriminating device that makes this inference is disclosed that uses a neural network and is reinforced by machine learning such as deep learning.
- the prior art exemplified in Patent Document 1 is certainly robust against image change levels due to external factors such as weather. However, when trying to handle images with different domains, the prior art cannot learn and reason correctly because the level of change in the images is too great.
- the domain here means the type of image, and includes, for example, an actual RGB image, a Thermal Infrared image (hereinafter referred to as "TIR image") by an infrared camera, an illustration image, an image generated by a CG simulator, and the like.
- TIR image Thermal Infrared image
- a situation where it is necessary to handle images with different domains is when there is an abundance of actual RGB images for training in person recognition using a surveillance camera using infrared images, but there is not an abundance of TIR images to be learned. be.
- the disclosed technology aims to solve the above problems and provide an inference device, an inference method, a learning device, a learning method, and a program that can correctly perform learning and inference even for images with different domains.
- a learning device is a learning device that includes a coupled mathematical model capable of machine learning and learns a dataset of a target domain from a dataset of an original domain for a teacher, wherein the front part of the coupled mathematical model is generating a plurality of low-level feature maps from input image data, comparing the low-level feature maps of datasets belonging to the same type of learning object for the original domain and the target domain of the image data, and Calculating shared features, and calculating domain relaxation learning information for each space of (1) color, (2) luminance, (3) low frequency components, and (4) high frequency components among the domain shared features.
- the learning device makes good use of the essence of learning, which is the order of acquired feature quantities.
- the essence is that the mathematical model represented by CNN completes learning faster for simple feature values represented by "color" in learning.
- the learning device can handle images with different domains in learning. Also, by using the information learned by the learning device according to the technology disclosed herein, images with different domains can be handled in inference.
- FIG. 1 is a system block diagram showing a configuration example of a system including a learning device and an inference device according to Embodiment 1.
- FIG. 2 is a flow chart showing a processing flow of the system according to Embodiment 1.
- FIG. 2A shows the flow of learning and inferring images in the original domain.
- FIG. 2B shows the filter learning flow in preparation for processing the target domain image.
- FIG. 2C shows the flow when learning and inferring images of the target domain.
- FIG. 3 is a system block diagram showing a configuration example of a system including a learning device and an inference device according to Embodiment 2.
- FIG. 4 is a schematic diagram showing the idea of the technology disclosed herein.
- FIG. 5 is an explanatory diagram No. 1 supplementing the idea of the disclosed technique.
- FIG. 6 is a second explanatory diagram supplementing the idea of the disclosed technique.
- the disclosed technology will be clarified by the following description of each embodiment with reference to the drawings.
- the technology disclosed herein is used in various identification devices that use pre-learned information to perform inference with respect to a captured image captured by a camera.
- the technology disclosed herein can also be used, for example, in surveillance cameras using infrared images, futuristic room air conditioners equipped with human detection cameras, and the like. More specifically, the disclosed technology relates to learning and inferring a target domain dataset from an original domain dataset for training.
- the disclosed technology is effective in situations where it is necessary to handle images with different domains.
- a surveillance camera using infrared images there are a lot of actual RGB images, which are a data set for learning, but there are not a lot of TIR images to be learned.
- the type of image that can be abundantly prepared for learning such as an actual RGB image
- the type of an image that is originally desired to be learned such as a TIR image
- a target domain the type of an image that is originally desired to be learned
- a CNN is also called a convolutional neural network and has properties such as global position invariance and rotation invariance.
- a CNN is a type of multilayer perceptron that combines a convolution layer, a pooling layer, and a fully connected layer.
- Each layer of the image and CNN handled by the technology of the present disclosure can be represented by the spatial resolution and channel of the feature map, respectively.
- the number of dimensions of an image is determined by the number of pixels in the horizontal direction, the number of pixels in the vertical direction, and the number of channels.
- the number of channels is 3 for an RGB image and 1 for a TIR image, and horizontal and vertical dimensions are different values. That is, the total number of dimensions of an image can be represented by the number of pixels in the horizontal direction ⁇ the number of pixels in the vertical direction ⁇ channels.
- a convolutional layer in a CNN performs an operation called two-dimensional convolution.
- a Gaussian filter that performs a blurring operation is well known as one that performs a convolution operation in general image processing.
- a filter that performs a convolution operation is called a convolution filter.
- Processing by a convolution filter places a kernel, which can be regarded as a small image patch such as 3x3, in each pixel of the input image, and outputs the inner product of the input image and the kernel to each pixel.
- Convolutional layers in CNN usually have multiple layers with multiple convolutional filters, and in deep learning, Activation and Batch Normalization are introduced before and after the convolutional layers to eliminate gradients. It has the effect of preventing over-learning with respect to the locality of learning data.
- a nonlinear function such as ReLU (Rectified Linear Unit), Sigmoid, or Softmax is used. It avoids the vanishing gradient problem, which makes it impossible to differentiate in linear space.
- a convolutional layer can be arbitrary dimensional, such as an input with M channels and an output with N channels. The number of convolution filters that a convolution layer has is expressed as a channel. The size of the convolutional layer can be represented by the number of channels in the output layer ⁇ the vertical size of the feature map ⁇ the horizontal size of the feature map.
- the output of the convolutional layer which contains spatial information, is called a Feature Map or Feature Map.
- a pooling layer in a CNN performs an image resolution-reducing operation, also called subsampling, to reduce the size of features while preserving them, thereby reducing the position sensitivity of features and achieving global position and rotation invariance. . Since the CNN for image classification finally outputs a vector, the resolution is lowered step by step. There are several methods for the pooling layer, but maximum value pooling is often used. Maximum value pooling performs resizing to output the maximum value for each feature map.
- a convolution layer and a pooling layer are layers that utilize the structure of an image and have spatial information.
- the fully connected layer is sometimes placed at the end of the network. Unlike the convolutional layer and the pooling layer, the fully connected layer does not have a structure of horizontal ⁇ vertical ⁇ channel, and quantized features are described as vectors. Fully-connected layers are sometimes used for dimensionality reduction and expansion, where each pixel in the feature map is connected not only to the neighborhood but also to the entire region to obtain more conceptual high-dimensional semantics. It becomes possible to
- FIG. 1 is a system block diagram showing a configuration example of a system including a learning device 1 and an inference device 2 according to Embodiment 1.
- the system according to the disclosed technology includes a learning device 1, an inference device 2, a shared storage device 3 in which information can be shared between the learning device 1 and the inference device 2, and an external storage device accessed by the learning device 1. a device 4;
- the learning device 1 includes an image input unit 10, a shallow layer feature amount extraction unit 11, a common feature amount calculation unit 12, a domain relaxation learning information calculation unit 13, and a high-dimensional feature amount addition unit 14. , and a learning information correction unit 15 .
- the inference device 2 includes a deep feature amount extraction unit 20 and an attribute regression unit 21 .
- FIG. 2 is a flow chart showing the processing flow of the system according to the first embodiment.
- FIG. 2A shows the flow of learning and inferring images in the original domain.
- FIG. 2B shows the filter learning flow in preparation for processing the target domain image.
- FIG. 2C shows the flow when learning and inferring images of the target domain.
- both the learning of images in the original domain and the learning of images in the target domain are classified as supervised learning.
- the shaded portions in FIGS. 2B and 2C indicate steps performed by shallow-layer CNN 100, described below.
- the original domain image may be learned by constructing a machine learning model that performs image recognition using CNN, which is a typical method of deep learning.
- the process of constructing this machine learning model includes a step ST1 of inputting an image, a step ST2 of extracting a feature amount, a step ST3 of calculating an object position or attribute, and a step ST4 of outputting an inference result. and have
- Machine learning has different purposes depending on the context in which the machine learning model is used. For example, when using a machine learning model in an object recognition device, the goal is to estimate what is where. For example, if the object in the image is a car, the purpose is to infer what attribute the car is at at what position in the image.
- a method is known in which a feature amount of a teacher image that has been categorized in advance is extracted and a machine learning model is constructed from plots in the feature amount space.
- SVM support vector machine
- the like are known as a method of obtaining the boundaries of each category in this feature amount space. Since features are usually multi-dimensional, the feature space is also called a high-dimensional feature space.
- Step ST3 of calculating the object position or attribute in FIG. 2A corresponds to the process of classifying the attributes of the image or the process of regression of the position of the object.
- the learning of the images of the target domain is performed at the stage when the learning of the images of the original domain is completed. Learning images of the target domain is done in two stages.
- the two-stage learning includes learning (hereinafter referred to as “filter learning”) in the shallow feature amount extraction unit 11, the common feature amount calculation unit 12, and the domain relaxation learning information calculation unit 13, and the deep feature amount extraction unit 20 (hereinafter referred to as “main learning”).
- Image data of the target domain is first input to the learning device 1 via the image input unit 10 .
- Image data input via the image input unit 10 is output to the shallow layer feature amount extraction unit 11 .
- the flow of processing in filter learning is shown in FIG. 2B, and the flow of processing in main learning is shown in FIG. 2C.
- the shallow layer feature amount extraction unit 11 is composed of a plurality of image filters that output a plurality of low-level feature maps from input image data. Since the shallow layer feature amount extraction unit 11 is a plurality of image filters, it is conceivable to configure it with a CNN convolution layer.
- the shallow layer feature amount extraction unit 11, the common feature amount calculation unit 12, and the domain relaxation learning information calculation unit 13 are included in the shallow layer CNN (hereinafter referred to as "shallow layer CNN 100"). Consists of The shallow layer CNN 100, which is a shallow CNN, has a common feature value (in English, Domain Shared Features, hereinafter referred to as "domain-sharing features”) are extracted.
- the image data of the original domain and the image data of the target domain are input to the shallow layer feature amount extraction unit 11 as teacher data.
- the plotting to the high-dimensional feature amount space seems random, but gradually, a certain rule can be seen in the distribution for each image category.
- FIG. 5 is an explanatory diagram No. 1 supplementing the idea of the disclosed technique.
- the disclosed technique selects (1) color, (2) luminance, (3) low-frequency components, and (4) high-frequency components, which are low-level features, according to the learning epoch. Teach the ingredients to intensity.
- the feature map output by the shallow layer feature amount extraction unit 11 includes low-level features (Low- level Feature). (3) The low-frequency component may be rephrased as blur information in the image. Also, (4) the high-frequency component may be rephrased as edge and texture.
- the shallow layer feature amount extraction unit 11 performs step ST12 for extracting low-level feature amounts.
- the domain-shared feature amount is strongly supervised according to the degree to which the main learning of the image of the target domain progresses.
- a method called attention is used as a method of obtaining a feature map (hereinafter referred to as a "weighted feature map”) in which the domain-sharing feature amount is emphasized.
- attention is a method of automatically learning which area of the feature map output by CNN should be focused on.
- Attention is the weighting of the region of interest.
- a feature map has horizontal and vertical spatial dimensions and a channel dimension
- a teaching method using attention also has attention in the spatial direction and attention in the channel direction.
- a technique called an SE block is disclosed for attention in the channel direction (for example, Non-Patent Document 1).
- Non-Patent Document 1 Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
- FIG. 6 is a second explanatory diagram supplementing the idea of the disclosed technology.
- the common feature quantity calculator 12 of the shallow layer CNN 100 compares feature maps of data sets belonging to the same category for two domains. A comparison of plots in the high-dimensional feature space is illustrated in the graph on the right side of FIG. FIG. 6 illustrates a comparison of datasets in the category "drying hair with a hair dryer", with the original domain being photographs and the target domain being illustrations. Below is an example of another plot comparison. For example, the category is men in their teens, and the two domains are an original domain of RGB images and a target domain of TIR images. Each teacher image is input to the shallow layer feature amount extraction unit 11, and each feature map is output.
- the common feature amount calculation unit 12 compares the feature maps for each channel, and assigns a large weight to a channel in which domain-shared feature amounts exist. More specifically, the common feature quantity calculation unit 12 spatially compares the feature maps of the original domain and the target domain calculated by the common feature quantity calculation unit 12, and calculates the distance between the most similar feature maps, for example, The weight may be calculated by image correlation, pixel-by-pixel similarity, SSIM (Structure Similarity), or the like.
- the common feature amount calculation unit 12 applies Global Average Pooling (GAP) to the feature maps to calculate the representative values, and calculates the distance between the representative values of the most similar feature maps using, for example, image correlation or Similarity in pixel units, SSIM (Structure Similarity), or the like may be calculated and used as a weight.
- GAP Global Average Pooling
- the common feature amount calculation unit 12 calculates a feature map that emphasizes the channel of interest (step ST13 for calculating domain-shared feature amounts). These weights are called “domain relaxation weights”.
- a feature map in which domain-sharing features are emphasized is called a "domain relaxed training signal”.
- the aforementioned weights and teacher signals are collectively referred to as "domain relaxation learning information”.
- the common feature quantity calculator 12 of the shallow layer CNN 100 performs step ST14 of calculating domain relaxation weights.
- the domain relaxation learning information is used as a teacher signal for main learning, which will be described later. Domain-sharing features can be classified into (1) color, (2) luminance, (3) low-frequency components, and (4) high-frequency components.
- the domain relaxation learning information calculation unit 13 of the shallow layer CNN 100 calculates domain relaxation learning information (domain relaxation Step ST24 for calculating learning information.
- the effect of having the shallow CNN 100 is clarified by comparing it with a conventional system without the shallow CNN 100.
- the target domain dataset is not abundant, it is not possible to sufficiently train the machine learning model with only the target domain dataset. Therefore, it is conceivable to try to build a machine learning model with images of other domains with abundant datasets and re-learn with images of the target domain. That is, it is conceivable to try Pre-Training using the dataset of the original domain, transfer learning to the target domain, and Fine-Tuning.
- the image features differ too much between domains, which destroys the prior learning results.
- the effect of providing the shallow layer CNN 100 is to prevent the prior learning results from being destroyed, thereby effectively reducing the difference in feature amounts between domains even when there is little learning data for the target domain.
- the deep feature quantity extraction unit 20 and the attribute regression unit 21 of the inference device 2 may be configured by a CNN (hereinafter referred to as “deep CNN 110”) consisting of a deep layer different from the shallow CNN 100 .
- deep CNN 110 a CNN
- initial learning is performed using an abundant dataset of original domain images.
- the original domain image data set can be roughly used in two ways. A method of using the original domain image data set as it is, and a method of using a feature map in which the domain-sharing feature amount is emphasized through the above-mentioned shallow layer CNN 100 can be considered.
- the learning device 1 according to the technology disclosed herein may use the original domain image data set by any method.
- FIG. 4 is a schematic diagram showing the idea of the technology disclosed herein.
- the teacher data for the full-scale learning of the main learning is a data set of images of the target domain that have passed through the shallow layer CNN 100 . Since the target domain image passes through the shallow layer CNN 100 for which filter learning has been completed, the domain-sharing feature amount is emphasized.
- Fig. 2C shows the processing flow when learning and inferring images of the target domain.
- this process includes a step ST21 of inputting a target domain image, a step ST22 of calculating a low-level feature map, a step ST23 of multiplying domain relaxation weights, and a step ST23 of calculating domain relaxation learning information.
- the shallow layer CNN 100 that has completed filter learning performs step ST22 of calculating a low-level feature map, step ST23 of multiplying domain relaxation weights, and step ST24 of calculating domain relaxation learning information.
- the deep CNN 110 performs step ST26 of calculating a high-level feature map and step ST27 of calculating an object position or attribute.
- the greatest feature of the learning device 1 according to the technology disclosed herein is that (1) color, (2) luminance, (3) low-frequency components, and (4) high-frequency components are It is to change the domain-sharing feature quantity to be emphasized in order.
- the learning information correcting unit 15 of the learning device 1 performs switching of the domain-shared feature quantity to be emphasized.
- An epoch is a unit for forward propagation and backward propagation of a set of data through a neural network once. Since one epoch is too large for a computer to handle at one time, it is usually divided into several batches. Iteration is the number of Batch required to complete one Epoch. For example, assume that there is a data set of 2000 teacher images. Assume that these 2000 images are divided into batches of 500 images each.
- the order of the feature amount acquired by the CNN is also (1) color, (2) luminance, (3) low frequency component, and (4) high frequency component. This is due to the sequential nature of CNN.
- Main learning is evaluated using images of the target domain. If inference can be performed with a desired correct answer rate even if the image of the target domain is input to the deep CNN 110 without direct processing, the reasoning apparatus 2 can use the deep CNN 110 that has completed this main learning as it is. If inference cannot be performed with the desired correct answer rate, the target domain image is multiplied by the unprocessed image and the domain relaxation weight calculated by the learned shallow layer CNN 100 in the high-dimensional feature amount assigning unit 14 of the learning device 1. to generate a processed image (step ST23 of multiplying the domain relaxation weight) and input it to the deep CNN 110 .
- the reasoning device 2 is composed only of the deep CNN 110
- the reasoning device 2 is composed of a combination of the shallow CNN 100 and the deep CNN 110 .
- the inference device 2 can make inferences about the images in the target domain.
- the processing flow of the inference device 2 will be clarified by the description based on FIG. 2C below.
- the description here assumes that the inference device 2 is configured by a combination of the shallow CNN 100 and the deep CNN 110 .
- An image of the target domain to be inferred is first input to the image input unit 10 (step ST21 for inputting the target domain image).
- a low-level feature map is created from the input image in the shallow layer feature amount extraction unit 11 of the shallow layer CNN 100 (step ST22 for calculating the low-level feature map).
- the created low-level feature map is multiplied by the domain relaxation weight in the high-dimensional feature quantity assigning unit 14 (step ST23 for multiplication of the domain relaxation weight), and an input image to the deep CNN 110 is generated.
- the deep CNN 110 calculates the object position or attribute for the input image in the attribute regression unit 21 (step ST27 for calculating the object position or attribute) and outputs the inference result (step ST28 for outputting the inference result).
- the learning device 1 and the inference device 2 have the effect that learning progresses without lowering the recognition rate even when the amount of data in the dataset of the target domain is small.
- Embodiment 2 The system including the learning device 1 and the reasoning device 2 according to the first embodiment is based on the assumption that there is a certain amount of target domain data set for learning, even if it is not abundant.
- a system including the learning device 1 and the inference device 2 according to the second embodiment can cope with the case where there is no data set of the target domain at the learning stage.
- the problem of learning a class without teacher data to be inferred in the learning stage is called the Zero-Shot Learning problem.
- the same reference numerals are used for the components that are common to those of the first embodiment, and overlapping descriptions are omitted as appropriate.
- FIG. 3 is a system block diagram showing a configuration example of a system including the learning device 1 and the inference device 2 according to the second embodiment.
- the learning device 1 according to Embodiment 2 includes a learning information updating unit 14B instead of the high-dimensional feature amount adding unit 14 and the learning information correcting unit 15.
- FIG. 14B is a learning information updating unit 14B instead of the high-dimensional feature amount adding unit 14 and the learning information correcting unit 15.
- the core idea for solving the problem is the same as in the first embodiment. That is, the system according to Embodiment 2 attempts to solve the problem by simultaneously performing filter learning and main learning from a given target domain image. Specifically, the learning information updating unit 14B simultaneously performs step ST23 of multiplying the domain relaxation weight performed by the high-dimensional feature amount adding unit 14 and switching of the domain shared feature amount to be emphasized performed by the learning information correcting unit 15 at the same time.
- the deep CNN 110 of the reasoning device 2 according to Embodiment 2 uses the same neural network as that of the reasoning device 2 prepared for images in the original domain (see FIG. 2A).
- the initial state of the deep CNN 110 the initial state of a neural network that has been sufficiently trained with a large-scale image data set in the original domain may be used.
- Embodiment 3 In Embodiments 1 and 2, the shallow CNN 100 and the deep CNN 110, which are core components, are described as "two independent CNNs" employing CNNs. However, the components corresponding to the shallow CNN 100 and the deep CNN 110 according to the technique of the present disclosure need not be two independent CNNs, nor do they need to be CNNs in the first place. Embodiment 3 clarifies the disclosed technique that employs a configuration example other than "two independent CNNs".
- the shallow CNN 100 and the deep CNN 110 are realized as one large coupled CNN 120 shared by the learning device 1 and the inference device 2 .
- the coupled CNN 120 is a kind of multilayer neural network, it can be divided into a front layer 121 and a rear layer 122 .
- the front layer 121 of the coupled CNN 120 may serve as the shallow CNN 100 and the rear layer 122 of the coupled CNN 120 may serve as the deep CNN 110 . That is, the joint CNN 120 has a function of extracting high-dimensional feature quantities from the input image.
- a method of sharing the combined CNN 120 may be implemented on the cloud and shared, or may be connected online.
- a second configuration example implements the joint CNN 120 with a neural network that is not a CNN.
- a component that implements the joint CNN 120 with a neural network that is not a CNN is named a joint NN 130 .
- the coupling NN 130 is divided into a coupling NN pre-layer 131 and a coupling NN post-layer 132 .
- the coupling NN pre-stage layer 131 may play the role of the shallow CNN 100 and the coupling NN post-stage layer 132 may play the role of the deep CNN 110 . That is, the joint NN 130 has a function of extracting high-dimensional feature quantities from the input image. Since the joint NN 130 is a multilayer neural network, it can be said that its learning method is deep learning.
- a method of sharing the coupled NN 130 may be implemented on the cloud and shared, or may be connected online.
- a third configuration example realizes the joint CNN 120 with a mathematical model other than a neural network.
- a component that realizes the coupled CNN 120 with a mathematical model other than a neural network is named a coupled mathematical model 140 .
- the coupled mathematical model 140 includes a coupled mathematical model front part 141 and a coupled mathematical model rear part 142 .
- the coupling mathematical model front part 141 may play the role of the shallow layer CNN 100 and the coupling mathematical model rear part 142 may play the role of the deep layer CNN 110 . That is, the coupled mathematical model 140 has a function of extracting high-dimensional feature quantities from an input image.
- the joint mathematical model 140 like the joint CNN 120, must have an input part, a calculation part that calculates the output from the input with variable parameters, and an output part.
- the coupled mathematical model 140 must be capable of machine learning by changing variable parameters based on an evaluation function that evaluates the output. Such a coupled mathematical model 140 is described here as "learnable”.
- a method of sharing the coupled mathematical model 140 may be implemented on the cloud and shared, or may be connected online.
- the technology of the present disclosure performs (1) color, (2) luminance, (3) low frequency components, ( 4) Change the domain-shared features to be emphasized in the order of high-frequency components. This is based on the fact that in machine learning for image recognition and the like, the simpler the feature represented by "color” is, the more the learning is completed in the early stages of learning.
- the learning device 1 and the inference device 2 according to the third embodiment have the above-described configurations, learning and inference can be performed correctly even for images having different domains without adopting two independent CNNs. can be done.
- the inference device 2 the inference method, the learning device 1, the learning method, and the program according to the technology disclosed herein can be used for identification devices that perform various types of identification on captured images, and have industrial applicability.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Biodiversity & Conservation Biology (AREA)
- Image Analysis (AREA)
Abstract
Description
活性化関数は、例えばReLU(Rectified Linear Unit)やSigmoid、Softmaxなどの非線形関数が用いられ、線形空間から逃れることで誤差逆伝播法(Back Propagation)による畳込み層への勾配伝播の際に、線形空間で微分できなくなる勾配消失問題を避けることができる。
畳込み層は、入力がMチャネルであり出力がNチャネルである、といった任意の次元操作が可能である。畳込み層が有する畳込みフィルタの枚数は、チャネルと表現する。畳込み層のサイズは、出力層のチャネル数×フィーチャーマップの縦サイズ×フィーチャーマップの横サイズで表すことができる。畳込み層の出力は、空間情報を備えるものでありフィーチャーマップ(Feature Map)又は特徴量マップと呼ばれる。
図1は、実施の形態1に係る学習装置1と推論装置2とを備えたシステムの構成例を示すシステムブロック図である。図1が示すとおり本開示技術に係るシステムは、学習装置1と、推論装置2と、学習装置1と推論装置2とが情報を共有できる共有記憶装置3と、学習装置1がアクセスする外部記憶装置4と、から構成される。
この画像の属性を分類(Classification)させる処理、又は物体の位置を回帰(Regression)する処理は、図2Aの中の物体位置又は属性を算出するステップST3が該当する。
Shallow CNNである浅層CNN100は、オリジナルドメインの画像データの特徴量とターゲットドメインの画像データの特徴量とを、それぞれ高次元特徴量空間でプロットしたときに、共通する特徴量(英語ではDomain Shared Featuresであり、以下「ドメイン共有特徴量」と呼ぶ)を抽出するように設計する。そこで浅層特徴量抽出部11には、オリジナルドメインの画像データとターゲットドメインの画像データとが教師データとして入力される。フィルタ学習の初期段階では高次元特徴量空間へのプロットはランダムのように映るが、次第に画像のカテゴリーごとに、分布に一定の法則が見られるようになる。
共通特徴量算出部12で抽出した低レベル特徴量のうちドメイン共有特徴量は、ターゲットドメインの画像のメイン学習が進む度合に応じて、ドメイン共有特徴量を強度に教師する。
非特許文献1:
Hu, Jie, Li Shen, and Gang Sun. ”Squeeze-and-excitation networks.” Proceedings of the IEEE conference on computer vision and pattern recognition.2018.
より具体的に共通特徴量算出部12は、共通特徴量算出部12より算出されたオリジナルドメインとターゲットドメインのそれぞれのフィーチャーマップを空間的に比較し、最も類似するフィーチャーマップ間の距離を、例えば画像相関やピクセル単位での類似性、SSIM(Structure Similarity)などで算出し、重みとしてもよい。
また、より簡易的に共通特徴量算出部12は、Global Average Pooling(GAP)をフィーチャーマップに適用して代表値を算出し、最も類似するフィーチャーマップの代表値間の距離を、例えば画像相関やピクセル単位での類似性、SSIM(Structure Similarity)などで算出し、重みとしてもよい。
このようにして共通特徴量算出部12は、注目すべきチャネルを強調した特徴マップを算出する(ドメイン共有特徴量を算出するステップST13)。前記の重みは、「ドメイン緩和重み」と呼ぶ。ドメイン共有特徴量が強調された特徴マップは、「ドメイン緩和教師信号」と呼ぶ。前記の重みと教師信号とは、まとめて「ドメイン緩和学習情報」と呼ぶ。浅層CNN100の共通特徴量算出部12は、ドメイン緩和重みを算出するステップST14を実施する。
ここではチャネル方向のAttentionを用いた実施の形態を説明したが、本開示技術はチャネル方向のAttentionと空間方向のAttentionとを適宜組み合わせてもよい。
本開示技術においてこのような順番で強調する特徴量を変える理由は、CNNが獲得する特徴量の順序も(1)色、(2)輝度、(3)低周波成分、(4)高周波成分の順番であるというCNNの特質に由来している。
推論対象であるターゲットドメインの画像は、まず画像入力部10へ入力される(ターゲットドメイン画像を入力するステップST21)。入力された画像は、浅層CNN100の浅層特徴量抽出部11において低レベル特徴マップが作成される(低レベル特徴マップを算出するステップST22)。作成された低レベル特徴マップは、高次元特徴量付与部14においてドメイン緩和重みが乗算され(ドメイン緩和重みを乗算するステップST23)、深層CNN110への入力画像が生成される。深層CNN110は、属性回帰部21において入力された画像についての物体位置又は属性を算出し(物体位置又は属性を算出するステップST27)、推論結果を出力する(推論結果を出力するステップST28)。
実施の形態1に係る学習装置1と推論装置2とを備えたシステムは、潤沢にはないにせよ学習用に一定量のターゲットドメインのデータセットがあることを前提とした。実施の形態2に係る学習装置1と推論装置2とを備えたシステムは、学習段階において全くターゲットドメインのデータセットがない場合に対応できるものである。一般に、学習段階において推論すべき教師データがないクラスを学習する問題は、Zero-Shot Learning問題と呼ばれている。以降の実施の形態2についての説明において、実施の形態1と共通する構成要素は同じ符号を用い、重複する説明については適宜省略する。
実施の形態1と実施の形態2とにおいて、コアとなる構成要素である浅層CNN100と深層CNN110とは、ともにCNNを採用した「2つの独立したCNN」として描写がなされた。しかし、本開示技術に係る浅層CNN100と深層CNN110とに該当する構成要素は、2つの独立したCNNである必要はないしそもそもCNNである必要もない。実施の形態3は、「2つの独立したCNN」以外の構成例を採用した本開示技術を明らかにするものである。
結合CNN120を共有する方法は、クラウド上に実現して共有してもよいし、オンラインでつないでもよい。
なお結合NN130は多層のニューラルネットワークであるから、その学習方法は深層学習であると言える。
結合NN130を共有する方法も、クラウド上に実現して共有してもよいし、オンラインでつないでもよい。
結合数理モデル140は、結合CNN120と同様に、入力部と、可変なパラメータによって入力から出力を計算する計算部と、出力部と、を備えなければならない。また、結合数理モデル140は、出力を評価する評価関数に基づいて、可変なパラメータを変更し、機械学習が可能でなければならない。このような結合数理モデル140は、ここでは「学習自在である」と形容する。
結合数理モデル140を共有する方法も、クラウド上に実現して共有してもよいし、オンラインでつないでもよい。
Claims (14)
- 機械学習可能な結合数理モデルを備え、教師用のオリジナルドメインのデータセットからターゲットドメインのデータセットを学習する学習装置であって、
前記結合数理モデルの前段部は、
入力された画像データから複数の低レベル特徴マップを生成し、
前記画像データのうち前記オリジナルドメインと前記ターゲットドメインについて同じ種類の学習対象に属するデータセットの前記低レベル特徴マップを比較し、ドメイン共有特徴量を算出し、
前記ドメイン共有特徴量のうち、(1)色、(2)輝度、(3)低周波成分、及び(4)高周波成分のそれぞれの空間についてドメイン緩和学習情報を算出することを特徴とする学習装置。 - 前記結合数理モデルは、教師あり学習により学習自在であることを特徴とする請求項1に記載の学習装置。
- 前記結合数理モデルの前記前段部は、畳み込みニューラルネットワークであることを特徴とする請求項2に記載の学習装置。
- 前記結合数理モデルの前記前段部は、学習方法が深層学習であることを特徴とする請求項3に記載の学習装置。
- 前記ドメイン緩和学習情報を用いて入力された前記ターゲットドメインの特徴マップを重み付けし新たな重付け特徴マップを生成する高次元特徴量付与部と、
算出された前記ドメイン緩和学習情報のうち、強調するドメイン共有特徴量の切換える学習情報補正部と、
をさらに備えることを特徴とする請求項1に記載の学習装置。 - 機械学習可能な前記結合数理モデルを備え、前記ターゲットドメインの特徴マップについて推論を実施する推論装置であって、
前記結合数理モデルの後段部は、請求項5に記載の学習装置が生成した前記重付け特徴マップを用いてメイン学習されたことを特徴とする推論装置。 - 前記結合数理モデルの前記後段部は、教師あり学習により学習自在であることを特徴とする請求項6に記載の推論装置。
- 前記結合数理モデルの前記後段部は、畳み込みニューラルネットワークであることを特徴とする請求項6に記載の推論装置。
- 前記結合数理モデルの前記後段部は、学習方法が深層学習であることを特徴とする請求項6記載の推論装置。
- 前記学習情報補正部は、推論装置のメイン学習のEpochに応じて前記強調するドメイン共有特徴量を切換えることを特徴とする請求項5に記載の学習装置。
- 機械学習可能な数理モデルを備え、教師用のオリジナルドメインのデータセットからターゲットドメインのデータセットを学習する学習装置の学習方法であって、
2つのドメインの画像を入力するステップと、
入力された前記画像から低レベル特徴量を抽出するステップと、
抽出された前記低レベル特徴量からドメイン共有特徴量を算出するステップと、
前記ドメイン共有特徴量からドメイン緩和重みを算出するステップと、
を有する学習方法。 - 機械学習可能な数理モデルを備え、ターゲットドメインの画像について特徴マップを受け取り推論を実施する推論装置の推論方法であって、
ターゲットドメイン画像の前記特徴マップを入力するステップと、
入力された前記ターゲットドメイン画像の前記特徴マップから低レベル特徴マップを算出するステップと、
を備え、
前記数理モデルは、前記低レベル特徴マップからドメイン緩和学習情報を算出し推論することを特徴とする推論方法。 - 機械学習可能な数理モデルを備え、教師用のオリジナルドメインのデータセットからターゲットドメインのデータセットを学習する処理を実行するプログラムであって、
2つのドメインの画像を入力するステップと、
入力された前記画像から低レベル特徴量を抽出するステップと、
抽出された前記低レベル特徴量からドメイン共有特徴量を算出するステップと、
前記ドメイン共有特徴量からドメイン緩和重みを算出するステップと、
を有するプログラム。 - 機械学習可能な数理モデルを備え、ターゲットドメインの画像の特徴マップについて推論を実施する処理を実行するプログラムであって、
ターゲットドメイン画像の前記特徴マップを入力するステップと、
入力された前記ターゲットドメイン画像から低レベル特徴マップを算出するステップと、を備え、
前記数理モデルは、前記低レベル特徴マップからドメイン緩和学習情報を算出し推論することを特徴とするプログラム。
Priority Applications (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202180096136.5A CN117099127A (zh) | 2021-03-29 | 2021-03-29 | 推理装置、推理方法、学习装置、学习方法以及程序 |
JP2023509940A JP7274071B2 (ja) | 2021-03-29 | 2021-03-29 | 学習装置 |
EP21934805.9A EP4296939A4 (en) | 2021-03-29 | 2021-03-29 | INFERENCE DEVICE, INFERENCE METHOD, LEARNING DEVICE, LEARNING METHOD AND PROGRAM |
PCT/JP2021/013407 WO2022208632A1 (ja) | 2021-03-29 | 2021-03-29 | 推論装置、推論方法、学習装置、学習方法、及びプログラム |
KR1020237031632A KR102658990B1 (ko) | 2021-03-29 | 2021-03-29 | 학습 장치 |
US18/235,677 US20230394807A1 (en) | 2021-03-29 | 2023-08-18 | Learning device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2021/013407 WO2022208632A1 (ja) | 2021-03-29 | 2021-03-29 | 推論装置、推論方法、学習装置、学習方法、及びプログラム |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/235,677 Continuation US20230394807A1 (en) | 2021-03-29 | 2023-08-18 | Learning device |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022208632A1 true WO2022208632A1 (ja) | 2022-10-06 |
Family
ID=83455707
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2021/013407 WO2022208632A1 (ja) | 2021-03-29 | 2021-03-29 | 推論装置、推論方法、学習装置、学習方法、及びプログラム |
Country Status (6)
Country | Link |
---|---|
US (1) | US20230394807A1 (ja) |
EP (1) | EP4296939A4 (ja) |
JP (1) | JP7274071B2 (ja) |
KR (1) | KR102658990B1 (ja) |
CN (1) | CN117099127A (ja) |
WO (1) | WO2022208632A1 (ja) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2019175107A (ja) | 2018-03-28 | 2019-10-10 | 沖電気工業株式会社 | 認識装置、認識方法、プログラムおよびデータ生成装置 |
WO2020031851A1 (ja) * | 2018-08-08 | 2020-02-13 | 富士フイルム株式会社 | 画像処理方法及び画像処理装置 |
CN111191690A (zh) * | 2019-12-16 | 2020-05-22 | 上海航天控制技术研究所 | 基于迁移学习的空间目标自主识别方法、电子设备和存储介质 |
JP2020126468A (ja) * | 2019-02-05 | 2020-08-20 | 富士通株式会社 | 学習方法、学習プログラムおよび学習装置 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200380369A1 (en) | 2019-05-31 | 2020-12-03 | Nvidia Corporation | Training a neural network using selective weight updates |
US20230072400A1 (en) | 2021-09-07 | 2023-03-09 | Arizona Board Of Regents On Behalf Of Arizona State University | SYSTEMS, METHODS, AND APPARATUSES FOR GENERATING PRE-TRAINED MODELS FOR nnU-Net THROUGH THE USE OF IMPROVED TRANSFER LEARNING TECHNIQUES |
KR20230139257A (ko) * | 2022-03-25 | 2023-10-05 | 재단법인 아산사회복지재단 | 기계 학습 모델 기반의 ct 영상을 분류 및 분할하기 위한 방법 및 장치 |
WO2023230748A1 (en) | 2022-05-30 | 2023-12-07 | Nvidia Corporation | Dynamic class weighting for training one or more neural networks |
-
2021
- 2021-03-29 EP EP21934805.9A patent/EP4296939A4/en active Pending
- 2021-03-29 JP JP2023509940A patent/JP7274071B2/ja active Active
- 2021-03-29 WO PCT/JP2021/013407 patent/WO2022208632A1/ja active Application Filing
- 2021-03-29 KR KR1020237031632A patent/KR102658990B1/ko active IP Right Grant
- 2021-03-29 CN CN202180096136.5A patent/CN117099127A/zh active Pending
-
2023
- 2023-08-18 US US18/235,677 patent/US20230394807A1/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2019175107A (ja) | 2018-03-28 | 2019-10-10 | 沖電気工業株式会社 | 認識装置、認識方法、プログラムおよびデータ生成装置 |
WO2020031851A1 (ja) * | 2018-08-08 | 2020-02-13 | 富士フイルム株式会社 | 画像処理方法及び画像処理装置 |
JP2020126468A (ja) * | 2019-02-05 | 2020-08-20 | 富士通株式会社 | 学習方法、学習プログラムおよび学習装置 |
CN111191690A (zh) * | 2019-12-16 | 2020-05-22 | 上海航天控制技术研究所 | 基于迁移学习的空间目标自主识别方法、电子设备和存储介质 |
Non-Patent Citations (2)
Title |
---|
HU, JIELI SHENGANG SUN: "Squeeze-and-excitation networks", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2018 |
See also references of EP4296939A4 |
Also Published As
Publication number | Publication date |
---|---|
JP7274071B2 (ja) | 2023-05-15 |
KR20230144087A (ko) | 2023-10-13 |
CN117099127A (zh) | 2023-11-21 |
US20230394807A1 (en) | 2023-12-07 |
EP4296939A1 (en) | 2023-12-27 |
JPWO2022208632A1 (ja) | 2022-10-06 |
EP4296939A4 (en) | 2024-05-01 |
KR102658990B1 (ko) | 2024-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ding et al. | Semi-supervised locality preserving dense graph neural network with ARMA filters and context-aware learning for hyperspectral image classification | |
WO2020216227A9 (zh) | 图像分类方法、数据处理方法和装置 | |
CN107529650B (zh) | 闭环检测方法、装置及计算机设备 | |
US10311342B1 (en) | System and methods for efficiently implementing a convolutional neural network incorporating binarized filter and convolution operation for performing image classification | |
EP3065085B1 (en) | Digital image processing using convolutional neural networks | |
US20220215227A1 (en) | Neural Architecture Search Method, Image Processing Method And Apparatus, And Storage Medium | |
CN111583263A (zh) | 一种基于联合动态图卷积的点云分割方法 | |
KR20170038622A (ko) | 영상으로부터 객체를 분할하는 방법 및 장치 | |
JP7405198B2 (ja) | 画像処理装置、画像処理方法および画像処理プログラム | |
CN110826458A (zh) | 一种基于深度学习的多光谱遥感图像变化检测方法及系统 | |
CN110210493B (zh) | 基于非经典感受野调制神经网络的轮廓检测方法及系统 | |
US20220157046A1 (en) | Image Classification Method And Apparatus | |
Verma et al. | Computational cost reduction of convolution neural networks by insignificant filter removal | |
CN116863194A (zh) | 一种足溃疡图像分类方法、系统、设备及介质 | |
Bailly et al. | Boosting feature selection for neural network based regression | |
CN110569852B (zh) | 基于卷积神经网络的图像识别方法 | |
WO2022208632A1 (ja) | 推論装置、推論方法、学習装置、学習方法、及びプログラム | |
US20230073175A1 (en) | Method and system for processing image based on weighted multiple kernels | |
Shahbaz et al. | Moving object detection based on deep atrous spatial features for moving camera | |
CN110837787A (zh) | 一种三方生成对抗网络的多光谱遥感图像检测方法及系统 | |
Halder et al. | Color image segmentation using semi-supervised self-organization feature map | |
Jiu et al. | Deep context networks for image annotation | |
Gangloff et al. | A general parametrization framework for pairwise Markov models: An application to unsupervised image segmentation | |
Ruz et al. | NBSOM: The naive Bayes self-organizing map | |
Lin et al. | Using Fully Convolutional Networks for Floor Area Detection. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21934805 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2023509940 Country of ref document: JP Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 20237031632 Country of ref document: KR Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1020237031632 Country of ref document: KR |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2021934805 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202180096136.5 Country of ref document: CN |
|
ENP | Entry into the national phase |
Ref document number: 2021934805 Country of ref document: EP Effective date: 20230919 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |