WO2024085114A1 - Image classification learning device, image classification learning method, image classification learning program, and post-image-classification learning model - Google Patents

Image classification learning device, image classification learning method, image classification learning program, and post-image-classification learning model Download PDF

Info

Publication number
WO2024085114A1
WO2024085114A1 PCT/JP2023/037394 JP2023037394W WO2024085114A1 WO 2024085114 A1 WO2024085114 A1 WO 2024085114A1 JP 2023037394 W JP2023037394 W JP 2023037394W WO 2024085114 A1 WO2024085114 A1 WO 2024085114A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
classification
learning
image data
loss
Prior art date
Application number
PCT/JP2023/037394
Other languages
French (fr)
Japanese (ja)
Inventor
修一 鶴田
悠太 中島
良知 李
博文 王
Original Assignee
国立大学法人大阪大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国立大学法人大阪大学 filed Critical 国立大学法人大阪大学
Publication of WO2024085114A1 publication Critical patent/WO2024085114A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present invention relates to an image classification learning device, an image classification learning method, an image classification learning program, and an image classification trained model that can help humans understand the process of performing image classification.
  • XAI explainable artificial intelligence
  • Non-Patent Document 1 understanding the behavior of neural networks is a big challenge, especially for medical applications (see Non-Patent Document 1) and for identifying biases in neural networks (see Non-Patent Document 2).
  • Non-Patent Document 2 For this reason, a lot of research effort has been devoted to providing post-hoc explanations of artificial intelligence models after they have been generated by machine learning (see Non-Patent Document 3).
  • This kind of explanation successfully provides a low-level (or pixel-by-pixel) relationship between the image and the model's judgment by highlighting some regions in the image as a heat map, but the interpretation of these relationships remains a problem.
  • an attention matrix (attention weights) is generated from the similarity between a query (Q) generated from a weight matrix generation model consisting of multiple weight vector columns (slots) and a key (K) representing the image features, and this attention matrix is used to extract areas from the image that are used for object detection.
  • This representation of image features in “object detection” is also called “object-centric representation” (see non-patent document 5).
  • a simple way to predefine concepts is to use human knowledge (see non-patent document 7).
  • Other methods use a manually created set of concepts and quantify the importance of each concept to the decision using directional derivatives, while the Broden dataset unifies several densely labeled image datasets to provide a large concept corpus that can be used to directly and automatically match Convolutional Neural Network (CNN) representations with labeled interpretations (see non-patent document 8).
  • CNN Convolutional Neural Network
  • SENN Self-Explaining Neural Networks
  • Alvarez-Melis et al. utilizes the bottleneck of concepts and treats the activation of concepts as input to a regression model (see Non-Patent Document 9). (Application of image recognition processing to determine the soundness of concrete structures, etc.)
  • Non-Patent Document 10 the Ministry of Land, Infrastructure, Transport and Tourism's Guidelines for Periodic Bridge Inspection (see Non-Patent Document 10) states that the damage level of concrete walls is classified based on the crack width, whether the cracks are in a lattice pattern, and the occurrence of water leakage and free lime.
  • Inspection of concrete structures requires close visual inspection by technicians with specialized knowledge, and this is done based on a comprehensive judgment that takes into account various aspects such as the state of deterioration, type, location, and traffic volume. In other words, judging the soundness of concrete structures relies heavily on the know-how (tacit knowledge) of experienced technicians, which cannot be put into a manual.
  • Patent Document 1 discloses a configuration in which cracks are detected as deformed areas using a feature map created with a CNN (Convolutional Neural Network) and the crack width is determined as attribute information of the deformed area.
  • CNN Convolutional Neural Network
  • Patent Document 2 also discloses a configuration that uses deep learning to provide a performance evaluation system for concrete structures that makes it possible to efficiently carry out a series of maintenance and management tasks, from inputting deformations to performance inspections.
  • the deep learning unit performs machine learning using artificial intelligence based on the discrepancy between the results automatically calculated by the performance evaluation system, which are accumulated for each inspection, and the results corrected by the inspector.
  • the configuration disclosed shows that the results of the machine learning are then reflected in subsequent judgments and predictions.
  • Patent Document 3 discloses the following technology:
  • Patent Document 3 makes it possible to determine the condition of a wide area based on a local area and a wide area, and to determine the degree of damage to the concrete wall surface of an infrastructure structure.
  • concept learning is guided by a learning process using an autoencoder structure to reconstruct the original image. It is not yet clear whether such a structure can be applied to learning from natural images.
  • the present invention has been made to solve the above problems, and aims to provide an image classification learning device, an image classification learning method, an image classification learning program, and an image classification trained model that can learn the "concepts" that a trained model uses to make judgments through learning on a given task, so that the judgment process can be compared with that of a human.
  • the present invention also aims to provide an image classification learning device, an image classification learning method, an image classification learning program, and an image classification trained model that can assist or replace the judgment of the soundness of concrete structures by using a trained model of artificial intelligence.
  • an image classification learning device includes a storage device for storing learning data including a plurality of image data and image labels corresponding to the image data, and a calculation processing means for reading out the learning data stored in the storage device and executing a process of machine learning a plurality of concepts in the image data for classifying the image data.
  • the calculation processing means includes an image recognition means for extracting a set of features expressing the image data, learning and generating a classification model that identifies and classifies image labels for the image data based on the extracted set of features, and storing the classification model in the storage device; an attention mechanism processing means for converting a slot vector according to an image feature defined by a slot vector in a concept matrix that corresponds to each of the plurality of concepts and defines an image region in which a feature valued in the classification processing of the image recognition means appears, and storing the slot vector in the storage device; a loss evaluation means for calculating a loss based on a classification loss that is calculated by evaluating the classification rate of the image recognition means and decreases as the classification rate increases, and a separation loss that is calculated by evaluating the degree to which features corresponding to the plurality of concepts are separated from each other in the feature space and decreases as the degree of separation increases; and a learning processing means for executing machine learning on the classification model and concept matrix stored in the storage device so as to reduce the loss.
  • the attention mechanism processing means includes an attention matrix learning means for learning an attention matrix for extracting image regions in the set of features to which attention is directed in the classification process of the image classification means in accordance with the degree of similarity with the concept matrix
  • the image classification means includes a concept occurrence calculation means for generating an activity vector based on the attention matrix, the activity vector having elements representing the degree to which each concept corresponding to the slot vector appears in the image data, and a classifier for performing classification of image labels using the activity vector corresponding to the image data as input.
  • the set of features representing the image data is a feature map output from a convolutional neural network image recognition model.
  • the segregation loss includes a consistency loss, which is a loss that decreases as a single concept occupies a smaller volume in feature space, and a discrimination loss, which is a loss that decreases as pairs of concepts become less likely to occupy the same region in feature space.
  • the image data is data of images of the surfaces of multiple concrete structures captured by a camera
  • the image labels are labels indicating the soundness of the concrete structures that each correspond to the image data.
  • an image classification learning method in which a computer learns multiple concepts in image data for classifying image data based on learning data including multiple image data and image labels corresponding to the image data, the computer including a storage device for storing the learning data and a calculation device for executing a machine learning process, the computer including a step of extracting a set of features expressing the image data, and learning and generating a classification model that identifies and classifies image labels for the image data based on the extracted set of features, a step of converting a slot vector in a concept matrix including slot vectors that correspond to each of the multiple concepts and define an image region in which a feature value that is emphasized in the classification process appears, according to an image feature defined by the slot vector, a step of calculating a loss based on a classification loss calculated by evaluating a classification rate in the classification of image data and decreasing as the classification rate increases, and a separation loss calculated by evaluating the degree to which features corresponding to the multiple concepts are separated from each other
  • the image data is data of images of the surfaces of multiple concrete structures captured by a camera
  • the image labels are labels indicating the soundness of the concrete structures that each correspond to the image data.
  • an image classification learning program for machine learning a plurality of concepts in image data for classifying image data, based on learning data including a plurality of image data and image labels corresponding to the image data, by a computer.
  • the computer includes a calculation device and a storage device, and includes the steps of: for image data stored in the storage device, the calculation device extracts a set of features expressing the image data, and learns and generates a classification model that identifies and classifies image labels for the image data based on the extracted set of features; the calculation device converts slot vectors in a concept matrix including slot vectors that correspond to each of the plurality of concepts and define image regions in which features that are emphasized in the classification process appear, according to image features defined by the slot vectors; the calculation device calculates a loss based on a classification loss calculated by evaluating a classification rate in the classification of image data and that decreases as the classification rate increases, and a separation loss calculated by evaluating the degree to which features corresponding to the plurality of concepts are separated from each other in a feature space and that decreases as the degree of separation increases; and the calculation device learns the classification model and the concept matrix so as to reduce the loss.
  • the computer-readable non-transitory recording medium stores an image classification learning program.
  • an image classification trained model generated by an image classification learning method that machine-learns a plurality of concepts in image data for classifying image data based on training data including a plurality of image data and image labels corresponding to the image data, the image classification trained model having a configuration of a classifier model that uses as input an activation vector whose elements are the degree to which each of the concepts appears in the image data and classifies the image data based on the co-occurrence relationship of the elements, the image classification trained model includes a step of extracting a set of features that express the image data, and updating by learning a classifier model that identifies and classifies an image label for the image data based on the extracted set of features, and a step of converting a slot vector in a concept matrix composed of slot vectors that correspond to each of the plurality of concepts and define an image area in which a feature that is emphasized in the classification process appears, according to an image feature defined by the slot vector.
  • the method is generated by a step of calculating a loss based on a classification loss calculated by evaluating the classification rate in classifying image data and decreasing as the classification rate increases, and a separation loss calculated by evaluating the degree to which features corresponding to multiple concepts are separated from each other in feature space and decreasing as the degree of separation increases, and a step of training a model and a concept matrix so as to reduce the loss
  • the step of converting the slot vector includes a step of training an attention matrix for extracting image regions in the set of features to which attention is directed in the classification process according to the similarity with the concept matrix
  • the step of updating the classifier model by training includes a step of generating an activation vector based on the attention matrix, the elements of which are the degree to which each of the concepts corresponding to the slot vector appears in the image data, and a step of training parameters of the classifier model to perform classification for image labels using the activation vector corresponding to the image data as an input.
  • an image classification learning device includes a storage device for storing learning data including image data of the surfaces of a plurality of concrete structures captured by a camera and image labels indicating the soundness of the concrete structures corresponding to the image data, and a calculation device for executing a process of machine learning a plurality of concepts in the image data for classifying the image data in terms of soundness based on the learning data stored in the storage device, the calculation device performing an image identification step of extracting a set of features representing the image data, learning and generating a classification model that identifies and classifies image labels for the image data based on the extracted set of features, and storing the image identification step in the storage device;
  • the system executes an attention mechanism processing step in which the slot vectors are converted and stored in a storage device according to the image features defined by the slot vectors in a concept matrix consisting of slot vectors that correspond to each of the concepts and define the image regions in which the features that are emphasized in the classification model's classification process appear; a loss evaluation step in
  • the attention mechanism processing step includes an attention matrix learning step of learning an attention matrix for extracting image regions in the set of features to which attention is directed in the classification model identification process according to the degree of similarity with the concept matrix
  • the image identification step includes a concept occurrence calculation step of generating an activation vector based on the attention matrix, the activation vector having elements representing the degree to which each concept corresponding to the slot vector appears in the image data, and a step of generating a classifier that performs classification on image labels using the activation vector corresponding to the image data as input.
  • the learning process step includes a step of generating a treatment discrimination model that learns to discriminate treatment labels using the activity vector and treatment labels of repair measures corresponding to image data of the surface of the concrete structure as input.
  • an image classification trained model generated by an image classification learning method that machine-learns a plurality of concepts in image data for classifying image data regarding soundness, based on learning data including image data of the surfaces of a plurality of concrete structures and image labels indicating the soundness of the concrete structures corresponding to the image data, the image classification trained model having a configuration of a classifier model that uses as input an activity vector having elements representing the degree to which each of the concepts appears in the image data, and classifies the image data based on a co-occurrence relationship of the elements, the image classification trained model comprising the steps of: extracting a set of features that represent the image data, and updating by learning a classifier model that identifies and classifies an image label for the image data based on the extracted set of features; and in a concept matrix consisting of slot vectors that correspond to each of the plurality of concepts and define image regions in which features that are emphasized in the discrimination process appear, according to the image features defined by the slot vectors,
  • the image classification trained model comprising the steps of:
  • the image classification learning device, image classification learning method, and image classification learning program of the present invention enable humans to understand what image feature regions are used as the basis for classification by a trained model generated by artificial intelligence learning how to classify images.
  • the feature regions of this image are separated to minimize overlap between different classification classes, so even in classification tasks involving natural images, the activity of the feature regions during the separation process can be displayed and visualized in a way that allows comparison with the "concepts" humans use for classification.
  • image classification learning device when the image classification learning device, image classification learning method, and image classification learning program of the present invention are applied to determining the soundness of concrete structures, it becomes possible to make a determination of soundness by utilizing the accumulated judgment know-how of engineers and experts.
  • FIG. 1 is a functional block diagram for explaining the configuration of an image classification learning device 1000 according to a first embodiment.
  • FIG. 2 is a functional block diagram for explaining the configuration of a concept regularization unit 300.
  • FIG. 2 is a conceptual diagram showing the concept of processing by the concept regularizer 300.
  • FIG. 2 is a conceptual diagram showing the concept of processing by the concept regularizer 300.
  • FIG. 2 is a conceptual diagram showing the concept of processing by the concept regularizer 300.
  • FIG. 1 is a block diagram for explaining the hardware configuration of an image classification learning device 1000. 1 is a flowchart for explaining the learning process of image classification learning device 1000.
  • FIG. 4 is a functional block diagram for explaining the configuration of the image classification device 4000 when performing classification processing for a new image.
  • FIG. 4 is a conceptual diagram for explaining the processing performed by the classifier 400.
  • FIG. 13 shows the classification performance of the classifier 400 for CUB200 and ImageNet.
  • FIG. 13 shows the classification performance of the classifier 400 for CUB200 and ImageNet.
  • FIG. 13 is a diagram for explaining the validity of a concept represented by concept activity level t.
  • FIG. 13 is a diagram for explaining the validity of a concept represented by concept activity level t.
  • FIG. 13 is a diagram for explaining the validity of a concept represented by concept activity level t.
  • FIG. 1 is a diagram showing the attention levels of the five most important concepts (based on "importance" to be described later) for an input image of a black bird with a yellow head.
  • FIG. 1 is a diagram showing the attention levels of the five most important concepts (based on "importance" to be described later) for an input image of a black bird with a yellow head.
  • FIG. 13 is a diagram for explaining a concept represented by concept activity level t for a natural image.
  • FIG. 1 is a diagram showing the importance of each concept in the CUB200 dataset.
  • FIG. 13 is a diagram showing the magnitude of each hyperparameter and the accuracy rate, consistency, and discriminability.
  • FIG. 11 is a conceptual diagram for explaining the operation of the concrete soundness classification device of the second embodiment.
  • FIG. 15 is a conceptual diagram showing the configuration of learning data for generating a trained model of artificial intelligence such as that shown in FIG. 14.
  • FIG. 1 is a diagram showing an example of a system configuration for determining the soundness of concrete.
  • FIG. 5 is a functional block diagram showing the configuration of a terminal 500.1. A block diagram for explaining the hardware configuration of terminal 500.1.
  • FIG. 1 is a diagram for explaining the configuration of learning data based on image data, health level labels corresponding to the images, and corrective action labels.
  • FIG. FIG. 13 is a functional block diagram for explaining the configurations of an image classification learning device 1000 and a classification device 4000 according to a third embodiment.
  • the image classification learning device of the present invention will be described as a computer program that is installed on a standalone computer device and executes the image classification learning method.
  • the processing of the image classification learning device may be distributed among multiple computer devices, and the arithmetic device that executes the computer processing may be single or multiple.
  • the processing of the image classification learning device is not limited to a program installed in such a computer device, and may generally be realized as an arithmetic processing device such as a microcomputer that combines an arithmetic device and a storage device, or may be implemented in a dedicated IC circuit, an FPGA (Field-Programmable Gate Array), or other electronic circuit.
  • Embodiment 1 Concept-based image classification
  • the term “concept” refers to a feature region in an "image” in a training dataset that the classifier "attentions” when classifying in machine learning of an image classifier using a neural network, and that is separated to the extent that it satisfies a predetermined condition.
  • the method of "classification based on concepts” is also called “concept-based classification.”
  • predetermined conditions refer to conditions that enable the trained model to learn concepts so that the original image can be reconstructed or identified from the activation vector alone, while making feature values of feature regions (in different images) corresponding to the same concept as similar as possible, and making feature values of feature regions corresponding to different concepts as dissimilar as possible, regardless of the correct label.
  • the image classifier described below is an artificial intelligence learning model that can learn the optimal bottleneck "concept" for the target image classification task in parallel with learning the image classification task based only on the images that are the training data and the labels that indicate the image classes, and will be described below.
  • the model structure (mathematical configuration, parameter configuration) before learning is called the “learned model,” and after the model parameter values are determined by the learning process, it is called the “trained model.”
  • the “trained model” functions as part of a program by being installed on a computer.
  • the "trained model (classifier)" may be recorded as a program or as part of a program on a computer-readable recording medium and installed on a computer other than the one that performed the learning process.
  • Such a “learning model” includes a “(self) attention mechanism” (described later) and makes it possible to identify the areas in which each of the above-mentioned concepts are discovered during the machine learning process. By displaying such “learning images” that share the detected “concept” together, humans can easily understand what each of the learned concepts represents, thereby providing clues for interpreting the classification and judgment processes.
  • the "attention mechanism” has the function of gating the channels of the "feature map” extracted from the "images" of the input learning data, so that a lot of map information that is considered noteworthy passes through, and not much map information that is considered not noteworthy passes through.
  • the following embodiments aim to provide an image classification learning device, an image classification learning method, and an image classification learning program.
  • the "trained model (image classifier)" of the embodiments uses the activation level of each concept as input to characterize and classify images.
  • [Embodiment 1] (Configuration of an image classification learning device that learns concepts)
  • FIG. 1 is a functional block diagram for explaining the configuration of an image classification learning device 1000 according to the first embodiment.
  • the image classification learning device 1000 uses as input learning data consisting of multiple pieces of image data and image labels (indicating the class to be classified) associated with each piece of image data, and generates a trained model for image classification.
  • the image dataset that is the input learning data is as follows:
  • x i is an image
  • y i is an object class in the set ⁇ associated with x i .
  • the image classification learning apparatus 1000 learns a set of k concepts using only the labels of the images.
  • the image classification learning device 1000 includes a convolutional neural network (hereinafter, referred to as a CNN backbone) 100 that serves as a backbone for generating a feature map from input image data, a concept learner 200, a concept regularizer 300, a classifier 400, a quantization error calculator 500, a loss calculator 600 that calculates the amount of loss during learning as described below, and a learning process controller 700 that controls the learning process according to the loss calculated by the loss calculator 600.
  • a convolutional neural network hereinafter, referred to as a CNN backbone
  • a CNN backbone a convolutional neural network
  • a concept learner 200 that serves as a backbone for generating a feature map from input image data
  • a concept regularizer 300 for generating a feature map from input image data
  • a classifier 400 for generating a feature map from input image data
  • a quantization error calculator 500 a loss calculator 600 that calculates the amount of loss during learning as described below
  • a learning process controller 700 that controls the
  • the CNN backbone 100, the concept learner 200, the concept regularizer 300, the classifier 400, the quantization error calculator 500, the loss calculator 600 that calculates the amount of loss during learning as described below, and the learning process control unit 700 that controls the learning process according to the loss calculated by the loss calculator 600 correspond to functions realized by a computing device that operates based on a program, and in this program, for example, each can be configured to be implemented as a program module.
  • the concept learner 200, concept regularizer 300, classifier 400, and quantization error calculator 500 can be configured as modules in separate neural networks, with parameters adjusted by the learning process control unit 700 based on the loss calculated by the loss calculator 600.
  • the CNN backbone 100 can also be included in the learning target, resulting in a so-called "end-to-end" configuration, and the configuration of the neural network/artificial intelligence is not limited to this configuration.
  • the CNN backbone 100 extracts a feature map F, expressed by the following equation, for input image data x.
  • c is the number of channels, or feature maps.
  • the CNN backbone 100 divides the input image into h x w regions, and in each of these regions there is a vector with c elements. This makes F a c x h x w feature map.
  • the concept prototype processing unit 2100 learns the concept matrix W according to a procedure described below, and each column vector of the matrix W is referred to in this specification as a "concept prototype" to be learned.
  • the concept learner 200 generates a concept activity t indicating the presence of each concept, and an image feature V from the region where each concept exists in x.
  • the concept activity t is used as an input to the classifier 400, which learns to calculate a score s indicating the classification result of the image class.
  • the concept activity level t, the image feature amount V, and the score s are as follows.
  • the concept regularization unit 300 receives the concept activity t and the image feature V as input, and in the concept prototype update process, as described below, imposes restrictions for the consistency of individual concepts and the mutual distinguishability between concepts, and also performs supervised self-learning. (Concept Learner 200)
  • the concept learner 200 uses a "slot attention" technique based on a self-attention mechanism to learn "concepts" for the image dataset D that can be retroactively associated with features that serve as the basis for recognition in human visual recognition.
  • the position information encoding unit 2002 executes position embedding (position information encoding) processing by adding position embedding information P to the input feature map F in order to retain spatial information, as follows:
  • positional information encoding is disclosed in, for example, the following document: Published literature: Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object centric learning with slot attention. Proc. NeurIPS, 2020.
  • the feature map F' with embedded position information is processed in the shaping processing unit 2004 to flatten the spatial dimensions.
  • the similarity calculation unit 2010 calculates the dot product similarity between a query Q(W) obtained by nonlinearly processing the concept matrix W representing the concept prototype, which is successively transformed by the concept prototype processing unit 2100, by the nonlinear processing unit 2008, and a key K(F') obtained by nonlinearly transforming the feature map F' by the nonlinear processing unit 2006.
  • the concept prototype (concept matrix) W is not particularly limited, but can be configured to be generated and converted by a GRU (Gated Recurrent Unit), which is a neural network model capable of learning time-series data, as described in the following literature, for example.
  • GRU Gate Recurrent Unit
  • Published literature Liangzhi Li, Bowen Wang, Manisha Verma, Yuta Nakashima, Ryo Kawasaki, and Hajime Nagahara.
  • Scouter Slot attention-based classifier for explainable image recognition.
  • Proc. ICCV pages 1046-1055, 2021.
  • the slot matrix is converted by the GRU using U (t) , which is a weighted sum of feature amounts in the spatial dimension, and the slot matrix at the previous timing.
  • U (t) is a weighted sum of feature amounts in the spatial dimension
  • the concept matrix W in this embodiment can be configured to be converted to the concept matrix W at the next timing by the GRU using an image feature V to be described later and the concept matrix W at the previous timing.
  • the method of converting the concept matrix W is not limited to this method.
  • the normalization unit 2012 calculates an “attention matrix A” as given by the following equation (1).
  • the function ⁇ is the normalization function.
  • This attention matrix A indicates where in the image the k concepts are located, as shown in Figure 7 below.
  • the normalization function ⁇ determines the spatial distribution of each concept, which depends on the target domain of the classification.
  • images in handwritten digit recognition datasets are typically black and white, and only the shapes formed by the strokes are important. In this case, concepts are unlikely to overlap spatially.
  • natural images have color, texture, and shape, which means concepts may overlap at the same spatial location.
  • can be designed as follows:
  • is a sigmoid function
  • the product between ⁇ and a softmax function is the Hadamard product.
  • the softmax function is applied to the concepts (i.e., each column vector; hereafter, in this specification, this column vector will be referred to as a "slot vector") so that different concepts are not detected at the same spatial location.
  • the concept occurrence calculation unit 2030 calculates the concept activity vector t by taking the sum of A in the spatial dimension as shown in the following formula (4).
  • Each element of the concept activity vector indicates whether or not a corresponding concept has appeared, and each element is called concept activity.
  • the shaping processor 2020 performs shaping processing on the feature map F to flatten the spatial dimension, and calculates the following feature map F * .
  • the similarity calculation unit 2040 calculates and extracts image features V from the feature map F * using the following formula:
  • weighting by ⁇ k gives the attention-weighted average of image features across spatial dimensions.
  • the concept activity level t mentioned above is an index showing the existence of each concept, and can be expressed as a binary value.
  • the concept learner may not be able to consistently capture meaningful features.
  • the concept regularization unit 300 therefore executes a concept regularization process so that learning of the "concept" progresses.
  • FIG. 2 is a functional block diagram for explaining the configuration of the concept regularization unit 300.
  • Figures 3A to 3C are conceptual diagrams showing the processing concept of the conceptual regularization unit 300 in Figure 2.
  • the discrimination loss calculation unit 3010 which ensures the individual consistency of concepts, does not want each learned concept to have many variations in order to make it easier for humans to interpret after it has been extracted as a "concept" through the learning process of the concept learner 200.
  • the concept learner 200 performs so-called "mini-batch learning” to randomly select a portion (n pieces) of N pieces of training data and update the parameters.
  • the kth element t k of concept activity t can be used to identify images in a mini-batch that have concept k.
  • the classification loss calculation unit 3010 calculates the "consistency loss" as follows:
  • Image feature vk which is the k-th row vector of image feature V, contains image features from a region corresponding to concept k if tk is close to 1.
  • Hk denote the set of all pairs of image features vk in the mini-batch where tk is greater than a threshold ⁇ that is empirically and experimentally set in advance.
  • Consistency loss is a loss term used during mini-batch learning to advance learning so that the “image features” of different images that belong to the “same concept” become “more similar” even if they are different images.
  • the discrimination loss calculation unit 3010 calculates the following "discriminability loss" as a loss term.
  • the average image feature amount of concept k in a mini-batch is given by the following formula.
  • set M is the pair set of all pairs of average image features. Note that concept k is excluded from set M if there are no images with concept k in the mini-batch.
  • discriminativity loss is a loss term used to progress through mini-batch learning so that the “average image features” of images belonging to “different concepts” become “more different.”
  • Non-Patent Document 8 uses an autoencoder structure for supervised self-learning. This is effective, for example, in handwritten digit recognition tasks where different visual elements (line patterns) are strongly associated with their positions.
  • a cross with horizontal and vertical lines only appears in the number 4, which is typically placed near the center of the image.
  • the concept regularization unit 300 of this embodiment introduces "supervised self-learning" to evaluate losses based on retrieval of natural images, in addition to losses based on image reconstruction.
  • the reconstruction-based loss calculator 3020 as shown in FIG. 3B or the search-based loss calculator 3030 as shown in FIG. 3A executes the processes described below selectively or in parallel depending on the type of target of the classification task, for example, by external pre-setting, to calculate the loss term in the learning of the concept learner 200.
  • the concept activation t contains enough information to reconstruct the original image.
  • the reconstruction-based loss calculation unit 3020 includes a concept decoder D, which receives the concept activity t as input and reconstructs the original image so that the image x and the output D(t) of the concept decoder D are similar to each other.
  • the concept activity t is insufficient to reconstruct the original image x, since it corresponds to a concept placed at an arbitrary position, and the spatial information required for reconstruction is lost in t, the concept activity t.
  • the search-based loss calculator 3030 instead of reconstructing the original image, performs a simple search task of finding images of the same class in mini-batch B using concept activity t. For any pair (t, t') computed from images x, x' ⁇ B with image labels y and y', respectively, we define a function J as follows:
  • t and t' should be similar to each other if they have the same class label, since similar sets of visual elements should appear in images x and x'. On the other hand, if they do not have the same class label, t and t' should be different.
  • the search-based loss calculation unit 3030 defines the search-based loss l ret in the above-described supervised self-learning by the following equation.
  • search-based loss becomes smaller when the concept activity t is used as a teaching signal in self-learning, for different images with the same class label, as the concept activity becomes more similar, and for different images with different class labels, as the concept activity becomes more dissimilar.
  • the influence of each concept can be visualized by showing the top images to a human based on the revised concept activation t′ and the concept activation t T t ′ calculated for all images in the training image dataset D. (Loss for the classification performance of the classifier)
  • This simple classifier 400 can be interpreted as determining the co-occurrence of the activation level of each concept with the class to be classified.
  • the total loss amount of the image classification learning device 1000 is defined by the following formula, combining the above loss terms:
  • the learning process control unit 700 controls the learning process in accordance with the loss calculated by the loss calculation unit 600 .
  • FIG. 4 is a block diagram for explaining the hardware configuration of the image classification learning device 1000 shown in FIG. 1.
  • the image classification learning device 1000 may be configured so that a computing device (CPU: Central Processing Unit) within its own housing performs the computational processing, or the program processing itself may be executed on a server. In the following, it will be described as if a computing device within its own housing performs the computational processing.
  • a computing device CPU: Central Processing Unit
  • the image classification learning device 1000 includes a computer device 6010, a network communication unit 6300 for communicating with a network, a camera 6400 for providing captured image data to the computer device 6010 as necessary, and a recording medium (e.g., a memory card) 6210 for recording the captured image data and providing it to the computer device 6010.
  • a recording medium e.g., a memory card
  • the recording medium 6210 may be a USB memory, a memory card, or an external storage device.
  • the network communication unit 6300 may be a wired LAN router or a wireless LAN access point.
  • the image data may be provided to the computer device 6010 via the network communication unit 6300.
  • the computer main body constituting this computer device 6010 includes, in addition to a disk drive 6030 and a memory drive 6020, a CPU (Central Processing Unit) 6040, each connected to a bus 6050, memory including a ROM (Read Only Memory) 6060 and a RAM (Random Access Memory) 6070, a non-volatile rewritable storage device such as an SSD (Solid State Drive) 6080, and an input/output interface 6090 for communicating over a network and sending and receiving data with the outside world.
  • An optical disk can be attached to the disk 6030.
  • a memory card 6210 can be attached to the memory drive 6020.
  • the RAM 6070 also functions as a working memory when the CPU 6040 performs calculations, and data and parameters during calculations are stored or read out as needed, and the CPU 6040 executes the calculations.
  • the non-transient recording medium from which the computer can read information such as a program to be installed in the computer main unit may be, for example, a DVD-ROM (Digital Versatile Disc), a memory card, or a USB memory.
  • the computer main unit 6200 is provided with a drive device (memory drive 6020, disk drive 6030) capable of reading these media.
  • the main components of the computer device 6010 are computer hardware and software executed by the CPU 6040.
  • such software is stored in a computer-readable non-transitory storage medium and distributed or distributed via a network, and is obtained via the disk drive 6030 or the network communication unit 6410 and temporarily stored in the SSD 6080. It is then read from the SSD 6080 into the RAM 6070 in the memory and executed by the CPU 6040. Note that when connected to a network, the software may be directly loaded into the RAM and executed without being stored in the SSD 6080.
  • a program for functioning as a computer device 6010 as described below does not necessarily need to include an operating system (OS) that causes the computer main body 6010 to execute functions such as an information processing device.
  • the program only needs to include instructions that call appropriate functions (modules) in a controlled manner to obtain the desired results. How the computer system 6010 operates is well known, and a detailed explanation will be omitted.
  • the CPU 6040 may also be a one-core processor or a multiple-core processor. That is, it may be a single-core processor or a multi-core processor.
  • FIG. 5 is a flowchart for explaining the learning process of the image classification learning device 1000 shown in FIG. 1.
  • the learning image data selected as mini-batch processing is input (S100), and the CNN backbone 100 extracts a feature map (S102).
  • the location information encoding processing unit 2002 encodes the location information of the feature map (S104), and the shaping processing unit 2020 performs flattening processing of the feature map (S106).
  • the shaping processor 2004 performs flattening processing on the feature map F' with the encoded positional information (S108), and the nonlinear processor 2008 performs nonlinear processing on the concept matrix W output from the concept prototype processor 2100 to generate a query Q(W) (S110).
  • the nonlinear processing unit 2006 generates a key K(F') by nonlinear processing of the feature map F' (S112)
  • the similarity calculation unit 2010 calculates the dot product between the query Q(W) and the key K(F')
  • the normalization unit 2012 normalizes the dot product to generate an attention matrix A (S114).
  • the concept occurrence calculation unit 2030 calculates the concept activity t and inputs it to the classifier 400 (S116). Meanwhile, the similarity calculation unit 2040 generates image features V as shown in equation (5) from the dot product of the attention matrix A and the flattened feature map.
  • the concept regularizer 300 calculates the "consistency loss,” “discriminability loss,” “reconstruction-based loss,” and “search-based loss” from the concept activity t and the image feature V, while the quantization error calculator 500 calculates the quantization error loss of the concept activity t according to equation (6).
  • the loss calculator 600 calculates the "classification performance loss” as equation (14), and calculates the total loss L during the learning process of the concept learner 200 using equation (15) (S120).
  • the learning process control unit 700 updates the parameters of the concept learner 200, which is composed of a neural network, based on the total loss L, for example, by the gradient descent method. Note that the method of updating the model parameters is not limited to this method.
  • the learning process control unit 700 determines that the learning process using mini-batch processing meets the specified conditions, it ends the learning process, but if it does not end, it returns the process to step S100.
  • FIG. 5 can be executed by the CPU 2040 in the hardware configuration shown in FIG. 4, for example, using a computer program stored in the non-volatile storage device 2080.
  • Each process can be distributed, or a cloud-type configuration in which the processes are executed by a server device can be used.
  • FIG. 6 is a functional block diagram for explaining the configuration of an image classification device 4000 including a classifier 400 generated by learning in the image classification learning device 1000 shown in FIG. 1 when performing classification processing on a new image.
  • the concept prototype processing unit 2100 when the classification process is performed after the learning process is completed, the concept matrix W is fixed to the one at the end of learning and is not updated, so in Figure 6 it is described as the concept prototype storage unit 4100.
  • the learned parameters are stored in memory.
  • the parameters of the other components in Figure 6, including the classifier 400, are also fixed at the time when learning is completed.
  • FIG. 7 is a conceptual diagram for explaining the processing performed by the classifier 400 in FIG. 6.
  • the concept occurrence calculation unit 2030 calculates the concept activity t for each concept. This is called a "concept bottleneck" in the sense that the diversity of the original image is consolidated into a small number of features.
  • classifier 400 the pattern of co-occurrence of concepts is learned for each label of each training image.
  • a concept bottleneck for a new image is input to classifier 400, the similarity with the pattern of co-occurrence of concepts is calculated, and the label with the highest similarity among the similarities for each label is output as the classification result.
  • FIG. 7 illustrates the case where natural images of birds are learned.
  • concepts such as "yellow head” and "black body” are generated for birds with label 1, and in the process of calculating the classification results, the degree of co-occurrence of each of these concepts with concepts in the target images is determined.
  • the image classification learning device 1000 learns the concept that, for example, an image of a "black bird” with a “yellow head” can be interpreted as a “yellow head” and a "body with black feathers.”
  • the classifier 400 is a single fully connected (FC) layer that encodes the co-occurrence of each concept with each class.
  • the image classification learning device 1000 learns the "classifier” and the “bottleneck concept” simultaneously.
  • concepts are constrained to be individually consistent (i.e., single concepts occupy a smaller volume in feature space) and mutually distinctive (i.e., pairs of concepts do not occupy the same region in feature space, or are separated in such a way that they are less likely to do so).
  • the image classification learning device 1000 can also simultaneously learn classifiers and concepts in an end-to-end manner.
  • the configuration is not necessarily limited to an end-to-end configuration; for example, at least a portion of the CNN backbone 100 may be fixed and only other portions may be trained.
  • the image classification learning device 1000 has a configuration that can essentially explain the classification process to humans in two ways.
  • the classifier provides a prototype of the target class based on the concept.
  • the image classification learning device 1000 is more likely to fail when learning with a large number of classes.
  • the number of concepts k is set to 20 for MNIST and 50 for others by default.
  • Figures 8A and 8B show the classification performance of classifier 400 for CUB200 and ImageNet.
  • BotCL refers to the classifier 400.
  • classifier 400 shows a performance degradation of approximately 3 points.
  • Figure 8B also shows the change in performance versus the number of classes for CUB200 (number of classes: 20-200) and ImageNet (number of classes: 50-300).
  • the classifier 400 has a single fully connected layer configuration that uses the concept activity t as an input, it can be said that there is almost no degradation in classification performance compared to the baseline model. In other words, it can be said that the "concept" corresponding to the concept activity t sufficiently expresses the characteristics of the image classification. (Interpretability) (Validity of detected concepts)
  • the image classification learning device 1000 calculates the concept activity t, which indicates the presence of each concept.
  • This concept activity t corresponds to the sum of the spatial-dimensional attention a k corresponding to the concept k. By visualizing this a k , humans can qualitatively confirm the presence or absence of a concept.
  • Figures 9A to 9C are diagrams for explaining the validity of concepts expressed by concept activity t.
  • Figure 9A shows the original images of the five most frequently activated concepts (i.e., t k >0.5) selected in MNIST, with a k superimposed on them.
  • the area of interest as a concept is displayed brighter by superimposing.
  • the obvious difference between the two is the activation of concept 2 (Cpt.2), and the attention point is the lower vertical line.
  • Figure 9B shows the top five most frequently activated concepts
  • Figure 9C shows the images reconstructed by each concept.
  • the image classification learning device 1000 is designed so that the learned concepts are individually consistent and distinctive from each other.
  • Figure 10 shows the attention levels of the five most important concepts (based on "importance” described below) for the input image of a black bird with a yellow head.
  • Figure 11 is a diagram to explain the concepts expressed by concept activity t for natural images.
  • Figure 11 shows a similar overlay display to that shown in Figure 9B.
  • the classifier 400 consists of one fully connected layer and can be interpreted as learning the co-occurrence of concepts.
  • Figure 12 shows the importance of each concept in the CUB200 dataset.
  • disabling concept 1 results in more images of black-headed birds appearing in the search results.
  • the search task is more robust because the output (highly similar samples) is determined by multiple concepts, and changing one concept does not significantly affect the overall similarity.
  • Figure 12 shows the percentage of samples in the ground truth class among the retrieved samples, allowing us to measure the importance of each concept in this search task.
  • Figure 13 shows the magnitude of each hyperparameter and the accuracy rate (Accuracy: circles), individual consistency (squares: the higher the better), and mutual distinctiveness (triangles: the lower the better).
  • the parameter ⁇ qua controls how close the concept activation t is to a binary value. An appropriate value can regularize the activation and prevent some ambiguous concepts. However, setting an extreme value can cause the gradient to disappear, which can lead to poor learning. The default ⁇ qua of about 0.1 was the optimal value within the scope of this experiment. The impact of ⁇ con and ⁇ dis
  • the image classification learning device 1000 of this embodiment can learn the features used in classification as "concepts" that humans can understand through learning on classification tasks.
  • image classification learning device 1000 can provide not only the learned concepts but also the interpretability of the judgments. [Embodiment 2]
  • the degree of partial damage to the structure is determined based on factors such as the occurrence of cracks on the concrete wall surface.
  • FIG. 14 is a conceptual diagram for explaining the operation of the concrete soundness classification device of embodiment 2.
  • the classification process for the soundness of concrete structures can also be configured so that learning classification and classification processes are performed within an integrated computer device.
  • the learning process is executed using the image data and a discrimination index associated with the image data (in this case, the soundness as correct answer data for the image data), and a learned model of the artificial intelligence is generated.
  • a discrimination index associated with the image data in this case, the soundness as correct answer data for the image data
  • image data capturing the surface of the concrete structure is sent to server 1010.
  • server 1010 a classification process is performed using a trained model of artificial intelligence, and a health level (for example, health level III) is output.
  • a health level for example, health level III
  • the area corresponding to the "concept" used in the classification process in server 1010 is displayed in a frame or the like so that it can be visually recognized by humans.
  • a skilled professional can visually view an image that has been classified in this way and understand not only the classification results of the artificial intelligence, but also which areas of the image attention was focused on to determine the healthiness of the image. The skilled professional can then determine how to respond based on the area of the image that attention was focused on during the classification process.
  • the soundness of concrete structures can be classified into four levels, from soundness I to soundness IV, as disclosed in the Ministry of Land, Infrastructure, Transport and Tourism's "Guidelines for Periodic Bridge Inspection” (Non-Patent Document 9).
  • Non-Patent Document 9 the classification of each soundness level in Non-Patent Document 9 is as follows, taking roads, bridges, etc. as examples.
  • Soundness level I Sound: The road bridge's functionality is not impaired.
  • Level II Preventive maintenance stage: The road bridge's functionality is not impaired, but it is desirable to take preventive maintenance measures.
  • Level III Early action stage: The road bridge's functionality may be impaired, and immediate action is required.
  • Level of soundness IV Emergency action stage: The function of the road bridge is impaired or there is a very high possibility that this will occur, and immediate action is required.
  • the expert technician in response to the soundness level being III, the expert technician makes the decision that "this part is a crack, so let's repair it by injecting resin.”
  • FIG. 15 is a conceptual diagram showing the configuration of training data for generating a trained artificial intelligence model like that shown in FIG. 14.
  • image data is prepared as health level labels corresponding to health levels I to IV.
  • Figure 15 shows health level III as an example. Similar images are assumed to be prepared for the other health levels.
  • Figure 16 shows an example of a system configuration for determining the soundness of concrete.
  • an image of the surface of the concrete structure is captured by the inspector terminal 500.1 and transmitted to the server 1010.
  • the inspector terminal 500.1 transmits, for example, the structure's location information (e.g., latitude and longitude information obtained by a positioning means) and the structure name data entered by the inspector, together with the image data, to the server 1010.
  • the structure's location information e.g., latitude and longitude information obtained by a positioning means
  • the structure name data entered by the inspector together with the image data
  • the server 1010 returns to the inspector terminal 500.1 information on the healthiness assessment result and information indicating areas that were noted in relation to the image sent from the inspector terminal 500.1 (data marked with a frame or the like).
  • the trained artificial intelligence model to be used for classification and judgment processing in server 1010 can be configured to undergo training processing, for example, in another server 1020, and then be transmitted to server 1010, stored, and operated.
  • server 1020 which is in charge of the learning process, collects data such as "position data,” “image data,” and "structure name” from other concrete structures from multiple terminals 500.2 to 500.n (n: natural number). At this time, for example, if each terminal 500.2 to 500.n is operated by a skilled professional, the data collected by server 1020 can be configured so that when an image is captured on the terminal, the soundness is associated with the image data as correct answer data and transmitted to server 1020. Learning data such as that described in FIG. 15 can be generated from the data collected in this way.
  • a specialized technician on the server side may perform a process to associate the soundness with the correct data.
  • server 1020 can use the accumulated learning data to retrain the AI's trained model and improve classification performance.
  • FIG. 17 is a functional block diagram showing the configuration of terminal 500.1 shown in FIG. 16.
  • terminals 500.2 to 500.n have a similar configuration, so their explanation will not be repeated.
  • the terminal 500.1 of this embodiment includes a control unit 5010 for controlling the communication operation and input/output operation of the terminal, a communication processing unit 5040 for generating baseband signals for wireless LAN and mobile communication and sending them to a modulation/demodulation circuit/device, and for obtaining original data or signals from received baseband signals, an imaging sensor 5050 for capturing still images or videos, an image acquisition unit 5060 for converting signals from the imaging sensor 5050 into electrical signals in a predetermined format, a display control unit 5070 for controlling image display on the terminal side, a display unit 5080 for displaying images under the control of the display control unit 5070, a position acquisition unit 5090 for measuring and acquiring the position of the terminal 500.1, and an input interface unit 5100 for receiving input of information from the outside.
  • a control unit 5010 for controlling the communication operation and input/output operation of the terminal
  • a communication processing unit 5040 for generating baseband signals for wireless LAN and mobile communication and sending them to a modulation/demodulation circuit/device, and
  • the imaging sensor 5050 may be a module that combines a lens and a CCD (Charge-Coupled Device) sensor, or a module that combines a lens and a CMOS sensor.
  • CCD Charge-Coupled Device
  • the location acquisition unit 5090 is not limited to a positioning device that uses GPS (Global Positioning System) as an outdoor positioning means, a positioning device that also uses signals from quasi-zenith satellites, a device that enables indoor positioning using beacon signals, etc., and may be any device capable of acquiring location information of the terminal 500.1.
  • GPS Global Positioning System
  • the input interface unit 5100 converts external input into text data using a touch panel or voice recognition of voice input.
  • the control unit 5010 includes an acquired image transmission processing unit 5020 that integrates information such as image data from the image acquisition unit 5060, position data from the position acquisition unit 5090, and data on the structure name from the input interface unit 5100, and transmits the integrated information from the communication processing unit 5040 to the server 1010, and a judgment table generation unit 5030 that generates image data representing a judgment table to be displayed on the display unit 5080 by the display control unit 5070 from the structure name, health data, and data indicating the image and area of interest received from the server 1010 via the communication processing unit 5040.
  • an acquired image transmission processing unit 5020 that integrates information such as image data from the image acquisition unit 5060, position data from the position acquisition unit 5090, and data on the structure name from the input interface unit 5100, and transmits the integrated information from the communication processing unit 5040 to the server 1010
  • a judgment table generation unit 5030 that generates image data representing a judgment table to be displayed on the display unit 5080 by the display control unit 5070 from the structure name, health data, and data
  • FIG. 18 is a block diagram for explaining the hardware configuration of terminal 500.1 shown in FIG. 17.
  • the control unit 5010 is equipped with a calculation device 501 such as an MPU (Micro Processing Unit) or CPU (Central Processing Unit), a storage device consisting of a RAM 502, a ROM 503, etc., and controls each part and creates a native platform environment and application execution environment in the software configuration by executing a predetermined basic OS, middleware, etc.
  • a calculation device 501 such as an MPU (Micro Processing Unit) or CPU (Central Processing Unit)
  • a storage device consisting of a RAM 502, a ROM 503, etc.
  • the imaging device 505 a camera module as described above is used, and as the positioning device 509, a GPS or other positioning device as described above is used.
  • the display device 508 may be a liquid crystal panel or an organic EL panel, and the operation device 510 may be a touch panel integrated with the display panel, or a voice recognition device.
  • the storage device of the control unit 5010 includes, for example, semiconductor memory such as RAM as a temporary storage device and flash memory as a non-volatile storage device.
  • This non-volatile storage device stores driver programs, operating system programs, application programs, data, etc. used for processing in each unit.
  • the non-volatile storage device stores driver programs such as a communication driver program that executes a wireless communication method conforming to the IEEE 802.11 standard or a wireless communication method for mobile communication (cellular communication), an input device driver program that controls the operation device 510, and an output device driver program that controls the display device 508.
  • driver programs such as a communication driver program that executes a wireless communication method conforming to the IEEE 802.11 standard or a wireless communication method for mobile communication (cellular communication), an input device driver program that controls the operation device 510, and an output device driver program that controls the display device 508.
  • the non-volatile storage device also stores operating system programs, such as basic OSs such as Android (registered trademark) OS and iOS (registered trademark), and connection control programs that perform authentication in wireless communication methods such as the IEEE 802.11 standard and wireless communication methods for mobile communication (cellular communication).
  • operating system programs such as basic OSs such as Android (registered trademark) OS and iOS (registered trademark)
  • connection control programs that perform authentication in wireless communication methods such as the IEEE 802.11 standard and wireless communication methods for mobile communication (cellular communication).
  • the communication interface 504 has the functionality to perform wireless LAN communication and mobile communication via a base station (not shown) of a cellular mobile communication network.
  • information relating to judging the soundness of concrete can be obtained on the inspector's terminal based on the input surface image data of a concrete structure, and the image area used for classification by the trained model of artificial intelligence can be confirmed.
  • the "trained model (classifier)" may be recorded as a program or as part of a program on a computer-readable recording medium and installed on another computer.
  • an image classification learning device 1000 as described in FIG. 1 executes a learning process using learning data as shown in FIG. 15, thereby enabling a human to not only classify the soundness of a concrete structure, but also to determine which characteristic parts (characteristic areas) the artificial intelligence focused on in determining the soundness.
  • the configuration of the image classification learning device 1000 and the classification device 4000 is further described, in which the image classification learning device 1000 executes a learning process so that, using image data as input, it not only classifies the soundness of the corresponding concrete structure, but also outputs a method of dealing with such a concrete structure.
  • Figure 19 is a diagram for explaining the composition of learning data that includes image data, health level labels corresponding to the images, and corrective action labels.
  • image data about concrete structures and corresponding information about their soundness can be collected, and data on how to deal with the concrete structures represented by the image data can also be accumulated.
  • FIG. 20 is a functional block diagram for explaining the configuration of the image classification learning device 1000 and classification device 4000 according to the third embodiment.
  • the concept activity t is input not only to the classifier 400 but also to the action discriminator 410, and furthermore, a response action label indicating a response action as shown in FIG. 19 is also input to the action discriminator 410 to execute the learning process.
  • the learning process control unit 700 executes the learning process so that the response measures output from the action discriminator 410 match the teacher data.
  • the action discriminator 410 that has been trained in this way outputs a response action.
  • the image classification learning device 1000 and classification device 4000 of embodiment 3, or according to the learning program and classification program of embodiment 3, it is possible to obtain information relating to the assessment of the soundness of concrete based on the input surface image data of a concrete structure, and a human, in particular a skilled professional, can confirm the image area used for classification by the trained model of artificial intelligence, as well as information relating to countermeasures.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides an image classification learning device that makes it possible to learn a "concept" to be used by a post-learning model to make a determination. A concept learner 200 performs machine learning of a plurality of concepts in image data on the basis of learning data including the image data and an image label. A concept prototype processing unit 2100 is an attention mechanism that, in a concept matrix comprising slot vectors, converts the slot vectors according to image features defined in the slot vectors, the slot vectors respectively corresponding to a plurality of concepts and defining an image region in which feature quantities emphasized in identification processing by an image identification means appear. A learning processing control unit 700 controls learning processing so as to decrease a loss calculated on the basis of: an identification loss that decreases as the identification rate of a classifier 400 increases, and a separation loss that decreases as the degree of mutual separation of the feature quantities corresponding to the plurality of concepts in a feature quantity space increases.

Description

画像分類学習装置、画像分類学習方法、画像分類学習プログラムおよび画像分類学習済モデルImage classification learning device, image classification learning method, image classification learning program, and image classification trained model
 本発明は、画像分類を実行する過程に対する人間の理解を助けることが可能な画像分類学習装置、画像分類学習方法、画像分類学習プログラムおよび画像分類学習済モデルに関する。 The present invention relates to an image classification learning device, an image classification learning method, an image classification learning program, and an image classification trained model that can help humans understand the process of performing image classification.
 (説明可能なデータ駆動型人工知能の必要性) (The need for explainable data-driven artificial intelligence)
 近年、ニューラルネットワークに基づくディープラーニング技術が、画像認識、音声認識、自然言語処理など数多くの分野で大きなブレークスルーをもたらしている。 In recent years, deep learning technology based on neural networks has brought about major breakthroughs in many fields, including image recognition, voice recognition, and natural language processing.
 しかしながら、一方で、問題視されているのは、ニューラルネットワークのようなデータ駆動型人工知能では、機械学習により生成されたモデルが実行する「判断」の根拠が、人間にとって解釈しにくいことである。これは「ブラックボックス問題」と呼ばれ、人工知能技術の社会での現実的な適用を難しくしている要因の1つである。 However, one problem that has been raised is that with data-driven artificial intelligence such as neural networks, the basis for the "judgments" made by models generated by machine learning is difficult for humans to interpret. This is known as the "black box problem," and is one of the factors that makes it difficult to practically apply artificial intelligence technology in society.
 すなわち、「人工知能モデルが何を根拠にそのような予測や判断をしたのか」を説明できていないと、これまで、顕在的に人工知能技術が利用されていなかった分野にまで、その応用が広がろうとしている状況では、そのモデルを導入している、あるいはしようとしているサービスやアプリケーションの利用者が不安を感じてしまうことになる。 In other words, if it is not possible to explain "the basis on which the AI model made such a prediction or judgment," users of services or applications that have adopted or are planning to adopt the model will become uneasy, especially in a situation where AI technology is being applied to fields where it has not been used explicitly until now.
 これは高リスク分野、つまり利用者の健康安全や個人情報に関与する分野では特に深刻である。 This is especially serious in high-risk areas, namely those involving users' health safety and personal information.
 このような問題点への対処として、「説明可能な人工知能(XAI: eXplainable Artificial Intelligence)」と呼ばれる技術が精力的に研究されている。 To address these issues, a technology called "explainable artificial intelligence (XAI)" is being actively researched.
 つまり、ニューラルネットワークの挙動を理解することは、特に医療への応用(非特許文献1を参照)、ニューラルネットワークのバイアスの特定などで大きな課題になっている(非特許文献2を参照)。このため、人工知能モデルが機械学習によって生成された後に、事後的な説明を行うために、多大な研究努力が払われてきた(非特許文献3を参照)。この種の説明は、画像中のいくつかの領域をヒートマップとして強調することにより、画像とモデルの判断との間の低レベル(あるいは画素単位)の関係をうまく提供するが、こうした関係の解釈は、まだ解決できていないのが現状である。  In other words, understanding the behavior of neural networks is a big challenge, especially for medical applications (see Non-Patent Document 1) and for identifying biases in neural networks (see Non-Patent Document 2). For this reason, a lot of research effort has been devoted to providing post-hoc explanations of artificial intelligence models after they have been generated by machine learning (see Non-Patent Document 3). This kind of explanation successfully provides a low-level (or pixel-by-pixel) relationship between the image and the model's judgment by highlighting some regions in the image as a heat map, but the interpretation of these relationships remains a problem.
 そこで、たとえば、予め人工知能モデルに対して入力データに対する着目点を示すような機構を導入する手法も検討されてきている。このような着目点は、この分野において人工知能の「注意(Attention)」と呼ばれ、このような「注意機構(Attention Mechanism)」は、最初は、自然言語処理を行う人工知能の翻訳モデルへ適用され、翻訳先の単語を出力する際に元文のどの単語に注目するかを、翻訳機能の学習と併せて学習させる仕組みとして登場したものである(非特許文献4を参照)。 Therefore, for example, methods are being considered that introduce mechanisms that indicate to the artificial intelligence model in advance the points to focus on in the input data. Such points of focus are called "attention" in artificial intelligence in this field, and this "attention mechanism" was first applied to artificial intelligence translation models that perform natural language processing, emerging as a mechanism that teaches, in conjunction with learning the translation function, which words in the source text to focus on when outputting translated words (see Non-Patent Document 4).
 その後、この注意機構は、画像認識の分野にも適用され、「物体検知」をタスクとしたニューラルネットワークの深層学習において、人工知能のモデルが、入力画像のどの場所に注目して(注意して)、物体を検知しているのか、言い換えると、入力画像の特徴のうち、どの特徴に重みを大きくかけて検知しているのかを、物体検知の学習時に合わせて、「注意機構」が学習するというモデルである。 Later, this attention mechanism was also applied to the field of image recognition, where in deep learning of neural networks with the task of "object detection," the "attention mechanism" learns which parts of the input image the artificial intelligence model should focus on (pay attention to) in order to detect an object; in other words, which features of the input image should be weighted heavily in order to detect the object, while learning to detect the object.
 この場合、複数の重みベクトルの列(スロット)からなる重み行列の生成モデルから生成されたクエリ(Q)と、画像の特徴を表すキー(K)との類似度から、注意行列(注意の重み)が生成されて、この注意行列により、画像中から物体の検知に利用される領域が抽出される。このような「物体検知」における画像の特徴の表現は、「物体中心表現」とも呼ばれる(非特許文献5を参照)。 In this case, an attention matrix (attention weights) is generated from the similarity between a query (Q) generated from a weight matrix generation model consisting of multiple weight vector columns (slots) and a key (K) representing the image features, and this attention matrix is used to extract areas from the image that are used for object detection. This representation of image features in "object detection" is also called "object-centric representation" (see non-patent document 5).
 解釈可能性へのもう一つの前進は、無意識に細かい概念を見つけることによって新しい概念を学習する人間の能力に触発された、「概念ベース」の枠組みである(非特許文献6を参照)。このフレームワークは、画素ごとの重要度スコアを説明として与える代わりに、「概念」を媒介として対象となる画像と分類の判断の間のより高いレベルの関係を提示する。このフレームワークでは、画像と判断の間の高次の関係を、「概念」を媒介として提供する。 Another advance towards interpretability is the "concept-based" framework, inspired by the human ability to learn new concepts by unconsciously finding finer details (see non-patent document 6). Instead of providing pixel-by-pixel importance scores as explanations, this framework presents a higher level relationship between the target image and the classification decision through the medium of "concepts". In this framework, a higher level relationship between the image and the decision is provided through the medium of "concepts".
 このような判断に関する説明は、いくつかの「概念」と呼ばれる特徴領域を対象中に見出すことに帰着し、そのような「概念」はタスクの異なるターゲットクラス間でも共有される。 The explanation for such judgments boils down to identifying feature domains in the target, called "concepts," that are shared across different target classes in a task.
 ただし、従来は、このような「概念」については、事前に人間によって概念集合を定義することによって、規定されていた。 However, traditionally, such "concepts" have been specified in advance by humans, who define sets of concepts.
 たとえば、概念を事前に定義する簡単な方法は、人間の知識を利用することである(非特許文献7を参照)。手作業で作成した概念集合を用い、方向性導関数を用いて各概念の判断に対する重要度を定量化する方法や、いくつかの密にラベル付けされた画像データセットを統一した、Brodenデータセットは、CNN(Convolutional Neural Network)の表現とラベル付けされた解釈を直接かつ自動的にマッチングするために使用される大規模なコンセプトコーパスを提供している(非特許文献8を参照)。 For example, a simple way to predefine concepts is to use human knowledge (see non-patent document 7). Other methods use a manually created set of concepts and quantify the importance of each concept to the decision using directional derivatives, while the Broden dataset unifies several densely labeled image datasets to provide a large concept corpus that can be used to directly and automatically match Convolutional Neural Network (CNN) representations with labeled interpretations (see non-patent document 8).
 このような状況の中にあって、Alvarez-Melisらによって発表されたSENN(Self-Explaining neural networks)は、概念のボトルネックを利用し、概念の活性化を回帰モデルへの入力として扱うことを提案している(非特許文献9を参照)。
(画像識別処理の応用用途としてのコンクリート建造物等の健全度判定)
In this context, SENN (Self-Explaining Neural Networks) proposed by Alvarez-Melis et al. utilizes the bottleneck of concepts and treats the activation of concepts as input to a regression model (see Non-Patent Document 9).
(Application of image recognition processing to determine the soundness of concrete structures, etc.)
 一方で、上述したようなニューラルネットワークに基づくディープラーニング技術による画像認識技術の応用用途として、コンクリート構造物や鉄骨造等のインフラ構造物の点検が検討されてきている。 On the other hand, the inspection of concrete structures, steel-framed structures, and other infrastructure structures is being considered as an application of image recognition technology using deep learning technology based on neural networks as described above.
 現在、コンクリート構造物をはじめとする社会インフラの構造物の老朽化の進行が社会課題として挙げられる。このようなコンクリート構造物の点検では、コンクリート壁面のひび割れ発生状況などに基づいて、構造物の部分的な損傷度を判定している。 Currently, the aging of social infrastructure structures, including concrete structures, is cited as a social issue. When inspecting such concrete structures, the degree of partial damage to the structure is determined based on factors such as the occurrence of cracks in the concrete wall surface.
 例えば、国土交通省の橋梁定期点検要領(非特許文献10参照)には、発生したひび割れのひび割れ幅と、ひび割れが格子状であること、漏水・遊離石灰の発生状況に基づいて、コンクリート壁面の損傷度ランクを分類することが記載されている。 For example, the Ministry of Land, Infrastructure, Transport and Tourism's Guidelines for Periodic Bridge Inspection (see Non-Patent Document 10) states that the damage level of concrete walls is classified based on the crack width, whether the cracks are in a lattice pattern, and the occurrence of water leakage and free lime.
 コンクリート構造物の点検には、専門的知識を有する技術者の近接目視が必要であり、これらは劣化状況や種類、立地、交通量などの様々な観点を踏まえた総合的な判断で行われている。すなわち、コンクリート構造物の健全度判断には、マニュアル化できない経験技術者のノウハウ(暗黙知)によるものが大きい。 Inspection of concrete structures requires close visual inspection by technicians with specialized knowledge, and this is done based on a comprehensive judgment that takes into account various aspects such as the state of deterioration, type, location, and traffic volume. In other words, judging the soundness of concrete structures relies heavily on the know-how (tacit knowledge) of experienced technicians, which cannot be put into a manual.
 そこで、作業コストの削減や、作業者による損傷度判定のばらつきを避けるため、情報処理装置により損傷度の判定を自動化することが望まれている。従来、損傷度の判定を自動化する技術として、例えば、特許文献1がある。 Therefore, in order to reduce work costs and avoid variability in damage level assessment by workers, it is desirable to automate the assessment of damage level using an information processing device. Conventional technology for automating damage level assessment is disclosed, for example, in Patent Document 1.
 特許文献1では、CNN(Convolutional Neural Network)を用いて作成した特徴マップにより、ひび割れを変状部として検出し、ひび割れ幅を変状部の属性情報として判定する構成が開示されている。 Patent Document 1 discloses a configuration in which cracks are detected as deformed areas using a feature map created with a CNN (Convolutional Neural Network) and the crack width is determined as attribute information of the deformed area.
 また、特許文献2では、変状の入力から性能照査までの一連の維持管理業務を効率的に実施することが可能になるコンクリート構造物の性能評価システムを提供するために、深層学習を利用する構成が開示されている。すなわち、深層学習部が、検査ごとに蓄積される性能評価システムが自動的に演算した結果と検査員が修正した結果との齟齬に基づいて、人工知能によって機械学習が行われる。そして、機械学習した結果は、その後の判定や予測などに反映される、との構成が開示されている。 Patent Document 2 also discloses a configuration that uses deep learning to provide a performance evaluation system for concrete structures that makes it possible to efficiently carry out a series of maintenance and management tasks, from inputting deformations to performance inspections. In other words, the deep learning unit performs machine learning using artificial intelligence based on the discrepancy between the results automatically calculated by the performance evaluation system, which are accumulated for each inspection, and the results corrected by the inspector. The configuration disclosed shows that the results of the machine learning are then reflected in subsequent judgments and predictions.
 さらに、特許文献3では、以下のような技術が開示されている。 Furthermore, Patent Document 3 discloses the following technology:
 すなわち、国土交通省の橋梁点検要領のように、ひび割れ幅とひび割れが格子状であることに基づいて、損傷度を自動判定する場合、画像中のひび割れを検知し、そのひび割れ幅を推定するためには、高解像画像の局所範囲の情報から判定する必要がある。一方、ひび割れが格子状であることを判定するためには、複数のひび割れを含む広域の範囲の情報を用いて判定を行う必要がある。 In other words, when automatically determining the degree of damage based on crack width and whether the cracks are lattice-shaped, as in the Ministry of Land, Infrastructure, Transport and Tourism's bridge inspection guidelines, it is necessary to make a determination from information in a local area of a high-resolution image in order to detect cracks in an image and estimate their crack width. On the other hand, to determine that a crack is lattice-shaped, it is necessary to make the determination using information in a wide area that includes multiple cracks.
 そこで、特許文献3に開示の技術では、局所範囲と広域の範囲を基に、広域の範囲の状態を判定して、インフラ構造物のコンクリート壁面の損傷度の判定を可能としている。 The technology disclosed in Patent Document 3 makes it possible to determine the condition of a wide area based on a local area and a wide area, and to determine the degree of damage to the concrete wall surface of an infrastructure structure.
特開2018-198053号明細書JP 2018-198053 A 特開2019-200120号明細書JP 2019-200120 A 特開2021-165888号明細書JP 2021-165888 A
 上述したような「概念」を利用するというアイディアに従えば、そのような学習の結果、生成される学習済みモデルの挙動を人間の判断と対比して、学習済みモデルの判断プロセスを人間が理解できることが期待される。 By following the idea of using "concepts" as described above, it is expected that humans will be able to understand the decision-making process of a trained model by comparing the behavior of the trained model generated as a result of such learning with human judgment.
 しかしながら、従来は、このような人間の「概念」に相当する学習は、事前に人間の知識が反映された「学習データ」によって達成されてきているものであり、たとえば、任意の自然な画像に対して、学習済みモデルの判断プロセスを、人間の判断プロセスと対比するということは、必ずしも達成できていない。 However, traditionally, learning equivalent to such human "concepts" has been achieved using "training data" that incorporates human knowledge in advance, and it has not always been possible to compare the judgment process of a trained model with the human judgment process for any given natural image, for example.
 また、概念学習はオートエンコーダ構造による学習処理によって導かれ、元の画像を再構築する。このような構成が自然画像からの学習においても適用可能であるかは、いまだ明確ではない。 In addition, concept learning is guided by a learning process using an autoencoder structure to reconstruct the original image. It is not yet clear whether such a structure can be applied to learning from natural images.
 一方で、コンクリート構造物の点検では、技術者の不足や点検への膨大な時間やコストが問題になっている。 On the other hand, there are problems with inspecting concrete structures, including a shortage of engineers and the enormous time and cost required for inspections.
 そこで、上述のとおり、人工知能技術を用いて、点検業務の補助ないしは自動化が検討されているものの、特許文献1や特許文献2に開示の技術では、コンクリート構造物の点検における注目点自体は、人間が判別していることになる。また、特許文献3に開示の技術では、ひび割れ幅とひび割れが格子状であることに基づいて、損傷度を自動判定するが、高解像度の画像が必須であり、複雑な工程を要する。 As mentioned above, the use of artificial intelligence technology to assist or automate inspection work is being considered, but with the technology disclosed in Patent Documents 1 and 2, the points of interest in inspecting concrete structures are determined by humans. Furthermore, with the technology disclosed in Patent Document 3, the degree of damage is automatically determined based on the crack width and the fact that the cracks are in a lattice pattern, but high-resolution images are essential and complex processes are required.
 また、大量の学習データが必要であり、それには、人間が画像を確認して、損傷度ランクを領域ごとに判定し、記録することにより作成されることになるが、このような学習データを大量に収集して、学習データを作成すること自体が、容易ではない。 In addition, a large amount of training data is required, which is created by humans reviewing the images and judging and recording the damage rank for each area, but collecting such large amounts of training data and creating the training data itself is not easy.
 また、上述したような、健全性の判断のひとつとなる近接目視の代替になる画像識別には、上述のとおり、人工知能による画像識別の技術が検討されているものの、従来の技術では、判断結果の信憑性が問われている。そのため、技術者の代替となるノウハウが見える化できる判断システムは重要になる。 As mentioned above, image recognition technology using artificial intelligence is being considered for image recognition, which can replace close-up visual inspection, one way of judging soundness, but with conventional technology, the credibility of the judgment results is called into question. For this reason, it is important to have a judgment system that can visualize the know-how that can replace that of engineers.
 本発明は、上記のような課題を解決するためになされたものであって、与えられたタスクに対する学習を通じて、人間の判断プロセスと対比が可能なように、学習済みモデルが判断に利用する「概念」を学習することができる画像分類学習装置、画像分類学習方法、画像分類学習プログラムおよび画像分類学習済モデルを提供することを目的する。 The present invention has been made to solve the above problems, and aims to provide an image classification learning device, an image classification learning method, an image classification learning program, and an image classification trained model that can learn the "concepts" that a trained model uses to make judgments through learning on a given task, so that the judgment process can be compared with that of a human.
 本発明は、また、コンクリート構造物の健全性の判断の補助や代替を、人工知能の学習済みモデルを用いることで可能とするような画像分類学習装置、画像分類学習方法、画像分類学習プログラムおよび画像分類学習済モデルを提供することを目的する。 The present invention also aims to provide an image classification learning device, an image classification learning method, an image classification learning program, and an image classification trained model that can assist or replace the judgment of the soundness of concrete structures by using a trained model of artificial intelligence.
 この発明の1つの局面に従うと、画像分類学習装置であって、複数の画像データと画像データにそれぞれ対応する画像ラベルとを含む学習データを格納するための記憶装置と、記憶装置に格納された学習データを読み出して、画像データを分類するための画像データ中の複数のコンセプトを機械学習する処理を実行するための演算処理手段とを備え、演算処理手段は、画像データを表現する特徴量の組を抽出し、抽出された特徴量の組に基づいて、画像データに対する画像ラベルを識別して分類する分類モデルを学習して生成し記憶装置に格納する画像識別手段と、複数のコンセプトの各々に対応し、画像識別手段の識別処理において重視される特徴量の出現する画像領域を規定するスロットベクトルからなる概念行列において、スロットベクトルで規定される画像特徴に応じて、スロットベクトルを変換し記憶装置に格納する注意機構処理手段と、画像識別手段の識別率の評価により算出され、識別率が上昇するほど減少する識別損失と、複数のコンセプトに対応する特徴量が特徴空間において互いに分離する程度の評価により算出され、分離の程度が大きいほど減少する分離化損失とに基づいて、損失を算出する損失評価手段と、損失を減少させるように、記憶装置に格納される分類モデルと概念行列とに対する機械学習を実行する学習処理手段とを含む。 In accordance with one aspect of the present invention, an image classification learning device includes a storage device for storing learning data including a plurality of image data and image labels corresponding to the image data, and a calculation processing means for reading out the learning data stored in the storage device and executing a process of machine learning a plurality of concepts in the image data for classifying the image data. The calculation processing means includes an image recognition means for extracting a set of features expressing the image data, learning and generating a classification model that identifies and classifies image labels for the image data based on the extracted set of features, and storing the classification model in the storage device; an attention mechanism processing means for converting a slot vector according to an image feature defined by a slot vector in a concept matrix that corresponds to each of the plurality of concepts and defines an image region in which a feature valued in the classification processing of the image recognition means appears, and storing the slot vector in the storage device; a loss evaluation means for calculating a loss based on a classification loss that is calculated by evaluating the classification rate of the image recognition means and decreases as the classification rate increases, and a separation loss that is calculated by evaluating the degree to which features corresponding to the plurality of concepts are separated from each other in the feature space and decreases as the degree of separation increases; and a learning processing means for executing machine learning on the classification model and concept matrix stored in the storage device so as to reduce the loss.
 好ましくは、注意機構処理手段は、概念行列との類似度に応じて、特徴量の組において画像識別手段の識別処理において注意が向けられる画像領域を抽出するための注意行列を学習する注意行列学習手段を含み、画像識別手段は、注意行列に基づいて、スロットベクトルに対応するコンセプトのそれぞれが、画像データ中に出現する程度を要素とする活性度ベクトルを生成する概念生起度算出手段と、画像データに対応する活性度ベクトルを入力として、画像ラベルについての分類を実行する分類器とを含む。 Preferably, the attention mechanism processing means includes an attention matrix learning means for learning an attention matrix for extracting image regions in the set of features to which attention is directed in the classification process of the image classification means in accordance with the degree of similarity with the concept matrix, and the image classification means includes a concept occurrence calculation means for generating an activity vector based on the attention matrix, the activity vector having elements representing the degree to which each concept corresponding to the slot vector appears in the image data, and a classifier for performing classification of image labels using the activity vector corresponding to the image data as input.
 好ましくは、画像データを表現する特徴量の組は、畳み込みニューラルネットワークの画像認識モデルから出力される特徴マップである。 Preferably, the set of features representing the image data is a feature map output from a convolutional neural network image recognition model.
 好ましくは、分離化損失は、単一のコンセプトが特徴空間においてより小さな体積を占めるほど減少する損失である整合性損失と、コンセプトのペアは特徴空間において同じ領域を占める確率がより低くなるほど減少する損失である識別損失とを含む。 Preferably, the segregation loss includes a consistency loss, which is a loss that decreases as a single concept occupies a smaller volume in feature space, and a discrimination loss, which is a loss that decreases as pairs of concepts become less likely to occupy the same region in feature space.
 好ましくは、画像データは、カメラにより撮像された複数のコンクリート構造物の表面の画像のデータであり、画像ラベルは、画像データにそれぞれ対応するコンクリート構造物の健全度を示すラベルである。 Preferably, the image data is data of images of the surfaces of multiple concrete structures captured by a camera, and the image labels are labels indicating the soundness of the concrete structures that each correspond to the image data.
 この発明の他の局面に従うと、複数の画像データと画像データにそれぞれ対応する画像ラベルとを含む学習データに基づいて、画像データを分類するための画像データ中の複数のコンセプトをコンピュータにより機械学習する画像分類学習方法であって、コンピュータは、学習データを格納する記憶装置と、機械学習の処理を実行するための演算装置とを含み、演算装置が、画像データを表現する特徴量の組を抽出し、抽出された特徴量の組に基づいて、画像データに対する画像ラベルを識別して分類する分類モデルを学習して生成するステップと、演算装置が、複数のコンセプトの各々に対応し、識別の処理において重視される特徴量の出現する画像領域を規定するスロットベクトルからなる概念行列において、スロットベクトルで規定される画像特徴に応じて、スロットベクトルを変換するステップと、演算装置が、画像データの識別における識別率の評価により算出され、識別率が上昇するほど減少する識別損失と、複数のコンセプトに対応する特徴量が特徴空間において互いに分離する程度の評価により算出され、前記分離の程度が大きいほど減少する分離化損失とに基づいて、損失を算出するステップと、演算装置が、損失を減少させるように、分類モデルと概念行列とを学習させるステップとを備える。 In accordance with another aspect of the present invention, there is provided an image classification learning method in which a computer learns multiple concepts in image data for classifying image data based on learning data including multiple image data and image labels corresponding to the image data, the computer including a storage device for storing the learning data and a calculation device for executing a machine learning process, the computer including a step of extracting a set of features expressing the image data, and learning and generating a classification model that identifies and classifies image labels for the image data based on the extracted set of features, a step of converting a slot vector in a concept matrix including slot vectors that correspond to each of the multiple concepts and define an image region in which a feature value that is emphasized in the classification process appears, according to an image feature defined by the slot vector, a step of calculating a loss based on a classification loss calculated by evaluating a classification rate in the classification of image data and decreasing as the classification rate increases, and a separation loss calculated by evaluating the degree to which features corresponding to the multiple concepts are separated from each other in a feature space and decreasing as the degree of separation increases, and a step of learning the classification model and the concept matrix so as to reduce the loss.
 好ましくは、画像データは、カメラにより撮像された複数のコンクリート構造物の表面の画像のデータであり、画像ラベルは、画像データにそれぞれ対応するコンクリート構造物の健全度を示すラベルである。 Preferably, the image data is data of images of the surfaces of multiple concrete structures captured by a camera, and the image labels are labels indicating the soundness of the concrete structures that each correspond to the image data.
 この発明のさらに他の局面に従うと、複数の画像データと画像データにそれぞれ対応する画像ラベルとを含む学習データに基づいて、画像データを分類するための画像データ中の複数のコンセプトをコンピュータにより機械学習する画像分類学習プログラムであって、コンピュータは演算装置と記憶装置とを含み、記憶装置に記憶された画像データについて、演算装置が、画像データを表現する特徴量の組を抽出し、抽出された特徴量の組に基づいて、画像データに対する画像ラベルを識別して分類する分類モデルを学習して生成するステップと、演算装置が、複数のコンセプトの各々に対応し、識別の処理において重視される特徴量の出現する画像領域を規定するスロットベクトルからなる概念行列において、スロットベクトルで規定される画像特徴に応じて、スロットベクトルを変換するステップと、演算装置が、画像データの識別における識別率の評価により算出され、識別率が上昇するほど減少する識別損失と、複数のコンセプトに対応する特徴量が特徴空間において互いに分離する程度の評価により算出され、分離の程度が大きいほど減少する分離化損失とに基づいて、損失を算出するステップと、演算装置が、損失を減少させるように、分類モデルと概念行列とを学習させるステップとを備える。 In accordance with yet another aspect of the present invention, an image classification learning program is provided for machine learning a plurality of concepts in image data for classifying image data, based on learning data including a plurality of image data and image labels corresponding to the image data, by a computer. The computer includes a calculation device and a storage device, and includes the steps of: for image data stored in the storage device, the calculation device extracts a set of features expressing the image data, and learns and generates a classification model that identifies and classifies image labels for the image data based on the extracted set of features; the calculation device converts slot vectors in a concept matrix including slot vectors that correspond to each of the plurality of concepts and define image regions in which features that are emphasized in the classification process appear, according to image features defined by the slot vectors; the calculation device calculates a loss based on a classification loss calculated by evaluating a classification rate in the classification of image data and that decreases as the classification rate increases, and a separation loss calculated by evaluating the degree to which features corresponding to the plurality of concepts are separated from each other in a feature space and that decreases as the degree of separation increases; and the calculation device learns the classification model and the concept matrix so as to reduce the loss.
 好ましくは、コンピュータ読取り可能な非一時的な記録媒体は、画像分類学習プログラムを格納する。 Preferably, the computer-readable non-transitory recording medium stores an image classification learning program.
 この発明のさらに他の局面に従うと、複数の画像データと画像データにそれぞれ対応する画像ラベルとを含む学習データに基づいて、画像データを分類するための画像データ中の複数のコンセプトを機械学習する画像分類学習方法によって生成される画像分類学習済モデルであって、画像分類学習済モデルは、コンセプトのそれぞれが、画像データ中に出現する程度を要素とする活性度ベクトルを入力として、要素の共起関係に基づいて、画像データを分類する分類器モデルの構成を有し、画像分類学習済モデルは、画像データを表現する特徴量の組を抽出し、抽出された特徴量の組に基づいて、画像データに対する画像ラベルを識別して分類する分類器モデルを学習により更新するステップと、複数のコンセプトの各々に対応し、識別の処理において重視される特徴量の出現する画像領域を規定するスロットベクトルからなる概念行列において、スロットベクトルで規定される画像特徴に応じて、スロットベクトルを変換するステップと、画像データの識別における識別率の評価により算出され、識別率が上昇するほど減少する識別損失と、複数のコンセプトに対応する特徴量が特徴空間において互いに分離する程度の評価により算出され、分離の程度が大きいほど減少する分離化損失とに基づいて、損失を算出するステップと、損失を減少させるように、モデルと概念行列とを学習させるステップとにより生成され、スロットベクトルを変換するステップは、概念行列との類似度に応じて、特徴量の組において識別の処理において注意が向けられる画像領域を抽出するための注意行列を学習するステップを含み、分類器モデルを学習により更新するステップは、注意行列に基づいて、スロットベクトルに対応するコンセプトのそれぞれが、画像データ中に出現する程度を要素とする活性度ベクトルを生成するステップと、画像データに対応する活性度ベクトルを入力として、画像ラベルについての分類を実行するよう分類器モデルのパラメータを学習するステップとを含む。 According to yet another aspect of the present invention, there is provided an image classification trained model generated by an image classification learning method that machine-learns a plurality of concepts in image data for classifying image data based on training data including a plurality of image data and image labels corresponding to the image data, the image classification trained model having a configuration of a classifier model that uses as input an activation vector whose elements are the degree to which each of the concepts appears in the image data and classifies the image data based on the co-occurrence relationship of the elements, the image classification trained model includes a step of extracting a set of features that express the image data, and updating by learning a classifier model that identifies and classifies an image label for the image data based on the extracted set of features, and a step of converting a slot vector in a concept matrix composed of slot vectors that correspond to each of the plurality of concepts and define an image area in which a feature that is emphasized in the classification process appears, according to an image feature defined by the slot vector. The method is generated by a step of calculating a loss based on a classification loss calculated by evaluating the classification rate in classifying image data and decreasing as the classification rate increases, and a separation loss calculated by evaluating the degree to which features corresponding to multiple concepts are separated from each other in feature space and decreasing as the degree of separation increases, and a step of training a model and a concept matrix so as to reduce the loss, and the step of converting the slot vector includes a step of training an attention matrix for extracting image regions in the set of features to which attention is directed in the classification process according to the similarity with the concept matrix, and the step of updating the classifier model by training includes a step of generating an activation vector based on the attention matrix, the elements of which are the degree to which each of the concepts corresponding to the slot vector appears in the image data, and a step of training parameters of the classifier model to perform classification for image labels using the activation vector corresponding to the image data as an input.
 この発明のさらに他の局面に従うと、画像分類学習装置であって、カメラにより撮像された複数のコンクリート構造物の表面の画像データと画像データにそれぞれ対応するコンクリート構造物の健全度を示す画像ラベルとを含む学習データを格納するための記憶装置と、記憶装置に格納された学習データに基づいて、画像データを健全度について分類するための画像データ中の複数のコンセプトを機械学習する処理を実行するための演算装置とを備え、演算装置は、画像データを表現する特徴量の組を抽出し、抽出された特徴量の組に基づいて、画像データに対する画像ラベルを識別して分類する分類モデルを学習して生成し記憶装置に格納する画像識別ステップと、複数のコンセプトの各々に対応し、分類モデルの識別処理において重視される特徴量の出現する画像領域を規定するスロットベクトルからなる概念行列において、スロットベクトルで規定される画像特徴に応じて、スロットベクトルを変換し記憶装置に格納する注意機構処理ステップと、分類モデルの識別率の評価により算出され、識別率が上昇するほど減少する識別損失と、複数のコンセプトに対応する特徴量が特徴空間において互いに分離する程度の評価により算出され、分離の程度が大きいほど減少する分離化損失とに基づいて、損失を算出する損失評価ステップと、損失を減少させるように、記憶装置に格納される分類モデルと概念行列とに対する機械学習を実行する学習処理ステップとを実行する。 According to yet another aspect of the present invention, an image classification learning device includes a storage device for storing learning data including image data of the surfaces of a plurality of concrete structures captured by a camera and image labels indicating the soundness of the concrete structures corresponding to the image data, and a calculation device for executing a process of machine learning a plurality of concepts in the image data for classifying the image data in terms of soundness based on the learning data stored in the storage device, the calculation device performing an image identification step of extracting a set of features representing the image data, learning and generating a classification model that identifies and classifies image labels for the image data based on the extracted set of features, and storing the image identification step in the storage device; The system executes an attention mechanism processing step in which the slot vectors are converted and stored in a storage device according to the image features defined by the slot vectors in a concept matrix consisting of slot vectors that correspond to each of the concepts and define the image regions in which the features that are emphasized in the classification model's classification process appear; a loss evaluation step in which a loss is calculated based on a classification loss that is calculated by evaluating the classification model's classification rate and decreases as the classification rate increases, and a separation loss that is calculated by evaluating the degree to which features corresponding to multiple concepts are separated from each other in feature space and decreases as the degree of separation increases; and a learning processing step in which machine learning is performed on the classification model and concept matrix stored in the storage device so as to reduce the loss.
 好ましくは、注意機構処理ステップは、概念行列との類似度に応じて、特徴量の組において分類モデルの識別処理において注意が向けられる画像領域を抽出するための注意行列を学習する注意行列学習ステップを含み、画像識別ステップは、注意行列に基づいて、スロットベクトルに対応するコンセプトのそれぞれが、画像データ中に出現する程度を要素とする活性度ベクトルを生成する概念生起度算出ステップと、画像データに対応する活性度ベクトルを入力として、画像ラベルについての分類を実行する分類器を生成するステップと含む。 Preferably, the attention mechanism processing step includes an attention matrix learning step of learning an attention matrix for extracting image regions in the set of features to which attention is directed in the classification model identification process according to the degree of similarity with the concept matrix, and the image identification step includes a concept occurrence calculation step of generating an activation vector based on the attention matrix, the activation vector having elements representing the degree to which each concept corresponding to the slot vector appears in the image data, and a step of generating a classifier that performs classification on image labels using the activation vector corresponding to the image data as input.
 好ましくは、学習処理ステップの処理は、活性度ベクトルとコンクリート構造物の表面の画像データに対応する修復対処処置の処置ラベルとを入力として、処置ラベルの判別を学習する処置判別モデルを生成するステップを含む。 Preferably, the learning process step includes a step of generating a treatment discrimination model that learns to discriminate treatment labels using the activity vector and treatment labels of repair measures corresponding to image data of the surface of the concrete structure as input.
 この発明のさらに他の局面に従うと、複数のコンクリート構造物の表面の画像データと画像データにそれぞれ対応するコンクリート構造物の健全度を示す画像ラベルとを含む学習データに基づいて、画像データを健全度について分類するための画像データ中の複数のコンセプトを機械学習する画像分類学習方法によって生成される画像分類学習済モデルであって、画像分類学習済モデルは、コンセプトのそれぞれが、画像データ中に出現する程度を要素とする活性度ベクトルを入力として、要素の共起関係に基づいて、画像データを分類する分類器モデルの構成を有し、画像分類学習済モデルは、画像データを表現する特徴量の組を抽出し、抽出された特徴量の組に基づいて、画像データに対する画像ラベルを識別して分類する分類器モデルを学習により更新するステップと、複数のコンセプトの各々に対応し、識別の処理において重視される特徴量の出現する画像領域を規定するスロットベクトルからなる概念行列において、スロットベクトルで規定される画像特徴に応じて、スロットベクトルを変換するステップと、画像データの識別における識別率の評価により算出され、識別率が上昇するほど減少する識別損失と、複数のコンセプトに対応する特徴量が特徴空間において互いに分離する程度の評価により算出され、分離の程度が大きいほど減少する分離化損失とに基づいて、損失を算出するステップと、損失を減少させるように、モデルと概念行列とを学習させるステップとにより生成され、スロットベクトルを変換するステップは、概念行列との類似度に応じて、特徴量の組において識別の処理において注意が向けられる画像領域を抽出するための注意行列を学習するステップを含み、分類器モデルを学習により更新するステップは、注意行列に基づいて、スロットベクトルに対応するコンセプトのそれぞれが、画像データ中に出現する程度を要素とする活性度ベクトルを生成するステップと、画像データに対応する活性度ベクトルを入力として、画像ラベルについての分類を実行するよう分類器モデルのパラメータを学習するステップとを含む。 According to yet another aspect of the present invention, there is provided an image classification trained model generated by an image classification learning method that machine-learns a plurality of concepts in image data for classifying image data regarding soundness, based on learning data including image data of the surfaces of a plurality of concrete structures and image labels indicating the soundness of the concrete structures corresponding to the image data, the image classification trained model having a configuration of a classifier model that uses as input an activity vector having elements representing the degree to which each of the concepts appears in the image data, and classifies the image data based on a co-occurrence relationship of the elements, the image classification trained model comprising the steps of: extracting a set of features that represent the image data, and updating by learning a classifier model that identifies and classifies an image label for the image data based on the extracted set of features; and in a concept matrix consisting of slot vectors that correspond to each of the plurality of concepts and define image regions in which features that are emphasized in the discrimination process appear, according to the image features defined by the slot vectors, The method is generated by a step of converting a slot vector, a step of calculating a loss based on a classification loss calculated by evaluating the classification rate in the classification of image data and decreasing as the classification rate increases, and a separation loss calculated by evaluating the degree to which features corresponding to multiple concepts are separated from each other in the feature space and decreasing as the degree of separation increases, and a step of training a model and a concept matrix so as to reduce the loss, and the step of converting a slot vector includes a step of training an attention matrix for extracting image regions to which attention is directed in the classification process in the set of features according to the similarity with the concept matrix, and the step of updating the classifier model by training includes a step of generating an activation vector based on the attention matrix, the element of which is the degree to which each of the concepts corresponding to the slot vector appears in the image data, and a step of training parameters of the classifier model to perform classification of image labels using the activation vector corresponding to the image data as an input.
 本発明の画像分類学習装置、画像分類学習方法および画像分類学習プログラムによれば、画像に対する分類処理を人工知能が学習して生成される学習済みモデルが、どのような画像の特徴領域に基づいて分類を行っているのかを、人間が理解することが可能となる。 The image classification learning device, image classification learning method, and image classification learning program of the present invention enable humans to understand what image feature regions are used as the basis for classification by a trained model generated by artificial intelligence learning how to classify images.
 より特定的には、この画像の特徴領域は、異なる分類クラス間では、重なりを最小とするように分離されているので、自然画像に対する分類タスクであっても、分離処理における特徴領域の活性度を表示することで、人間にとっての分類に使用する「概念」と対比することができるように可視化することができる。 More specifically, the feature regions of this image are separated to minimize overlap between different classification classes, so even in classification tasks involving natural images, the activity of the feature regions during the separation process can be displayed and visualized in a way that allows comparison with the "concepts" humans use for classification.
 また、本発明の画像分類学習装置、画像分類学習方法および画像分類学習プログラムを、コンクリート構造物の健全性の判断に適用した場合は、技術者・専門家の蓄積した判断ノウハウを生かして、健全性の判断を行うことが可能になる。 Furthermore, when the image classification learning device, image classification learning method, and image classification learning program of the present invention are applied to determining the soundness of concrete structures, it becomes possible to make a determination of soundness by utilizing the accumulated judgment know-how of engineers and experts.
 より特定的には、コンクリート構造物の健全性の判断に適用した場合に、単に、健全性の判断にとどまらず、健全性の程度と判断の特徴に応じて、実施すべき対処方法も判断することが可能な人工知能の学習済みモデルが可能となる。 More specifically, when applied to determining the soundness of concrete structures, it becomes possible to create a trained artificial intelligence model that can not only determine soundness, but also determine the appropriate course of action depending on the level of soundness and the characteristics of the judgment.
実施の形態1の画像分類学習装置1000の構成を説明するための機能ブロック図である。FIG. 1 is a functional block diagram for explaining the configuration of an image classification learning device 1000 according to a first embodiment. 概念正則化部300の構成を説明するための機能ブロック図である。FIG. 2 is a functional block diagram for explaining the configuration of a concept regularization unit 300. 概念正則化部300の処理の概念を示す概念図である。FIG. 2 is a conceptual diagram showing the concept of processing by the concept regularizer 300. 概念正則化部300の処理の概念を示す概念図である。FIG. 2 is a conceptual diagram showing the concept of processing by the concept regularizer 300. 概念正則化部300の処理の概念を示す概念図である。FIG. 2 is a conceptual diagram showing the concept of processing by the concept regularizer 300. 画像分類学習装置1000のハードウェア構成を説明するためのブロック図である。FIG. 1 is a block diagram for explaining the hardware configuration of an image classification learning device 1000. 画像分類学習装置1000の学習処理を説明するためのフローチャートである。1 is a flowchart for explaining the learning process of image classification learning device 1000. 画像分類装置4000が新規の画像の分類処理を実行する際の構成を説明するための機能ブロック図である。FIG. 4 is a functional block diagram for explaining the configuration of the image classification device 4000 when performing classification processing for a new image. 分類器400の実行する処理を説明するための概念図である。FIG. 4 is a conceptual diagram for explaining the processing performed by the classifier 400. CUB200とImageNetについての分類器400の分類性能を示す図である。FIG. 13 shows the classification performance of the classifier 400 for CUB200 and ImageNet. CUB200とImageNetについての分類器400の分類性能を示す図である。FIG. 13 shows the classification performance of the classifier 400 for CUB200 and ImageNet. 概念活性度tにより表現される概念の妥当性を説明するための図である。FIG. 13 is a diagram for explaining the validity of a concept represented by concept activity level t. 概念活性度tにより表現される概念の妥当性を説明するための図である。FIG. 13 is a diagram for explaining the validity of a concept represented by concept activity level t. 概念活性度tにより表現される概念の妥当性を説明するための図である。FIG. 13 is a diagram for explaining the validity of a concept represented by concept activity level t. 入力画像である黄色い頭の黒い鳥に対する、最も重要な(後述する「重要度」に基づく)5つの概念の注目度を示す図である。FIG. 1 is a diagram showing the attention levels of the five most important concepts (based on "importance" to be described later) for an input image of a black bird with a yellow head. 自然画像について、概念活性度tにより表現される概念を説明するための図である。FIG. 13 is a diagram for explaining a concept represented by concept activity level t for a natural image. CUB200のデータセットにおいて、各概念の重要度を示す図である。FIG. 1 is a diagram showing the importance of each concept in the CUB200 dataset. 各ハイパーパラメータの大きさと正解率、整合性、識別性を示す図である。FIG. 13 is a diagram showing the magnitude of each hyperparameter and the accuracy rate, consistency, and discriminability. 実施の形態2のコンクリート健全度分類装置の動作を説明するための概念図である。FIG. 11 is a conceptual diagram for explaining the operation of the concrete soundness classification device of the second embodiment. 図14のような人工知能の学習済モデルを生成するための学習データの構成を示す概念図である。FIG. 15 is a conceptual diagram showing the configuration of learning data for generating a trained model of artificial intelligence such as that shown in FIG. 14. コンクリートの健全度の判定のためのシステム構成の一例を示す図である。FIG. 1 is a diagram showing an example of a system configuration for determining the soundness of concrete. 端末500.1の構成を示す機能ブロック図である。FIG. 5 is a functional block diagram showing the configuration of a terminal 500.1. 端末500.1のハードウェア構成を説明するためのブロック図である。A block diagram for explaining the hardware configuration of terminal 500.1. 画像データ、画像に対応する健全度のラベルおよび対処措置のラベルのデータによる学習データの構成を説明するための図である。1 is a diagram for explaining the configuration of learning data based on image data, health level labels corresponding to the images, and corrective action labels. FIG. 実施の形態3の画像分類学習装置1000および分類装置4000の構成を説明するための機能ブロック図である。FIG. 13 is a functional block diagram for explaining the configurations of an image classification learning device 1000 and a classification device 4000 according to a third embodiment.
 以下、本発明の実施の形態の画像分類学習装置および画像分類学習方法の構成を説明する。なお、以下の実施の形態において、同じ符号を付した構成要素および処理工程は、同一または相当するものであり、必要でない場合は、その説明は繰り返さない。 The configuration of an image classification learning device and an image classification learning method according to an embodiment of the present invention will be described below. Note that in the following embodiments, components and processing steps with the same reference numerals are the same or equivalent, and their description will not be repeated unless necessary.
 なお、以下では、本発明の画像分類学習装置は、単体のコンピュータ装置にインストールされ、画像分類学習方法を実行するコンピュータプログラムであるものとして説明する。 In the following, the image classification learning device of the present invention will be described as a computer program that is installed on a standalone computer device and executes the image classification learning method.
 ただし、画像分類学習装置の処理は、複数のコンピュータ装置で分散処理されるものであってもよいし、コンピュータ処理を実行する演算装置も単数でも複数でもよい。また、画像分類学習装置の処理は、このようなコンピュータ装置にインストールされるプログラムに限らず、一般的には、演算装置と記憶装置が組み合わされたマイコンなどの演算処理装置として実現される構成や、専用のIC回路、FPGA(Field-Programmable Gate Array)、その他の電子回路に実装される構成であってもよい。
 [実施の形態1]
(概念ベースでの画像分類)
However, the processing of the image classification learning device may be distributed among multiple computer devices, and the arithmetic device that executes the computer processing may be single or multiple. Furthermore, the processing of the image classification learning device is not limited to a program installed in such a computer device, and may generally be realized as an arithmetic processing device such as a microcomputer that combines an arithmetic device and a storage device, or may be implemented in a dedicated IC circuit, an FPGA (Field-Programmable Gate Array), or other electronic circuit.
[Embodiment 1]
(Concept-based image classification)
 以下では、ニューラルネットワークを用いた画像分類器において、「概念」と呼ばれる「画像特徴の着目領域」を画像から抽出し、入力された画像における「概念」の活性化を画像表現として利用して画像の分類を実施する構成について説明する。 Below, we explain a configuration for an image classifier that uses a neural network to extract "areas of interest in image features," called "concepts," from an image, and classify images by using the activation of "concepts" in the input image as an image representation.
 この明細書において、「概念」との用語は、ニューラルネットワークを用いた画像分類器の機械学習において、学習データセット内の「画像」内で、分類器が分類にあたって「注意」を向けている特徴領域であって、かつ、特徴領域が所定の条件を満たす程度に分離している特徴領域のことをいうものとする。また「概念に基づく分類」の方法を、「コンセプトベースの分類」と呼ぶ。 In this specification, the term "concept" refers to a feature region in an "image" in a training dataset that the classifier "attentions" when classifying in machine learning of an image classifier using a neural network, and that is separated to the extent that it satisfies a predetermined condition. The method of "classification based on concepts" is also called "concept-based classification."
 ここで、「所定の条件」とは、学習済みモデルが、正解ラベルとは無関係に、同じコンセプトに対応する(異なる画像の)特徴領域の特徴量がなるべく似たものになるように、異なるコンセプトに対応する特徴領域の特徴量はなるべく似ないようにしつつ、活性度ベクトルのみから元の画像を再構成、または同定できるようにコンセプトを学習することができるようにするための条件をいう。 Here, "predetermined conditions" refer to conditions that enable the trained model to learn concepts so that the original image can be reconstructed or identified from the activation vector alone, while making feature values of feature regions (in different images) corresponding to the same concept as similar as possible, and making feature values of feature regions corresponding to different concepts as dissimilar as possible, regardless of the correct label.
 そして、以下に説明するような画像分類器は、学習データである画像と画像のクラスを示すラベルのみに基づいて、画像の分類タスクの学習と並行して、対象画像の分類タスクに最適なボトルネックとなる「概念」を学習できる人工知能の学習モデルであり、以下、これについて説明する。なお、本明細書では、学習前のモデル構造(数学的な構成、パラメータの構成)を「学習モデル」と呼び、学習処理によりモデルのパラメータの値が決定された後には、「学習済みモデル」と呼ぶことにする。「学習済みモデル」は、コンピュータにインストールされることで、プログラムの一部として機能する。なお、特に限定されないが、「学習済みモデル(分類器)」が、プログラムとして、または、プログラムの一部として、コンピュータ読取り可能な記録媒体に記録されて、学習処理を行ったのとは異なるコンピュータにインストールされることがあってもよい。 The image classifier described below is an artificial intelligence learning model that can learn the optimal bottleneck "concept" for the target image classification task in parallel with learning the image classification task based only on the images that are the training data and the labels that indicate the image classes, and will be described below. In this specification, the model structure (mathematical configuration, parameter configuration) before learning is called the "learned model," and after the model parameter values are determined by the learning process, it is called the "trained model." The "trained model" functions as part of a program by being installed on a computer. Although not limited to this, the "trained model (classifier)" may be recorded as a program or as part of a program on a computer-readable recording medium and installed on a computer other than the one that performed the learning process.
 このような「学習モデル」は、後述する「(自己)注意機構」を含み、機械学習の過程で、上述したような各概念が発見された領域を特定することを可能とする。このような、検出された「概念」を共通に有する「学習画像」をまとめて表示することで、人間側にとっては、学習した各概念が何を表しているかが容易に理解でき、結果として、分類の過程や判定の過程を解釈するための手がかりが提供されることになる。 Such a "learning model" includes a "(self) attention mechanism" (described later) and makes it possible to identify the areas in which each of the above-mentioned concepts are discovered during the machine learning process. By displaying such "learning images" that share the detected "concept" together, humans can easily understand what each of the learned concepts represents, thereby providing clues for interpreting the classification and judgment processes.
 ここで、「注意機構」は、入力される学習データの「画像」から抽出された「特徴マップ」のチャンネルに関して、注目すべきと考えられるマップの情報は多く通過し、注目すべきでないと考えられるマップの情報はあまり通過しないようにゲーティングを行う機能を有する。 Here, the "attention mechanism" has the function of gating the channels of the "feature map" extracted from the "images" of the input learning data, so that a lot of map information that is considered noteworthy passes through, and not much map information that is considered not noteworthy passes through.
 なお、特に、「自己注意機構」の構成を利用する場合では、「注意」する領域の学習において使用されるクエリ(Q)、キー(K)、バリュー(V)の全てについて、これらが同じ入力データから生成される。ただし、本実施の形態において、「注意機構」を実現する方法としては、このような「自己注意機構」に限定されるものではない。 In particular, when using the configuration of a "self-attention mechanism," all of the queries (Q), keys (K), and values (V) used in learning the area of "attention" are generated from the same input data. However, in this embodiment, the method of realizing the "attention mechanism" is not limited to this type of "self-attention mechanism."
 なお、以下に説明する通り、以下の実施の形態では、画像分類学習装置、画像分類学習方法および画像分類学習プログラムを提供することを目的する。実施の形態の「学習済みモデル(画像分類器)」は、各概念の活性化の程度を入力として、画像を特徴づけて分類する。
[実施の形態1]
(概念を学習する画像分類学習装置の構成)
As described below, the following embodiments aim to provide an image classification learning device, an image classification learning method, and an image classification learning program. The "trained model (image classifier)" of the embodiments uses the activation level of each concept as input to characterize and classify images.
[Embodiment 1]
(Configuration of an image classification learning device that learns concepts)
 図1は、実施の形態1の画像分類学習装置1000の構成を説明するための機能ブロック図である。 FIG. 1 is a functional block diagram for explaining the configuration of an image classification learning device 1000 according to the first embodiment.
 以下に説明するように、画像分類学習装置1000は、複数の画像データとそれぞれの画像データに対応付けられた画像ラベル(分類対象となるクラスを示す)とからなる学習データを入力として、画像分類のための学習済みモデルを生成する。 As described below, the image classification learning device 1000 uses as input learning data consisting of multiple pieces of image data and image labels (indicating the class to be classified) associated with each piece of image data, and generates a trained model for image classification.
 このとき、入力される学習データである画像データセットを以下のとおりとする。
Figure JPOXMLDOC01-appb-M000001
 
In this case, the image dataset that is the input learning data is as follows:
Figure JPOXMLDOC01-appb-M000001
 ここで、xiは画像、yiはxiに関連する集合Ω内の対象クラスである。画像分類学習装置1000は、画像のラベルのみを用いてk個の概念の集合を学習する。 Here, x i is an image, and y i is an object class in the set Ω associated with x i . The image classification learning apparatus 1000 learns a set of k concepts using only the labels of the images.
 図1を参照して、画像分類学習装置1000は、入力される画像データから特徴マップを生成するバックボーンとなる畳み込みニューラルネットワーク(以下、CNNバックボーンと記載する)100と、概念学習器200と、概念正則化部300と、分類器400と、量子化誤差算出部500と、後述するように学習中の損失の量を算出する損出算出部600と、損出算出部600で算出される損失に応じて、学習処理を制御する学習処理制御部700とを備える。 Referring to FIG. 1, the image classification learning device 1000 includes a convolutional neural network (hereinafter, referred to as a CNN backbone) 100 that serves as a backbone for generating a feature map from input image data, a concept learner 200, a concept regularizer 300, a classifier 400, a quantization error calculator 500, a loss calculator 600 that calculates the amount of loss during learning as described below, and a learning process controller 700 that controls the learning process according to the loss calculated by the loss calculator 600.
 後述するように、CNNバックボーン100と、概念学習器200と、概念正則化部300と、分類器400と、量子化誤差算出部500と、後述するように学習中の損失の量を算出する損出算出部600と、損出算出部600で算出される損失に応じて、学習処理を制御する学習処理制御部700とは、プログラムに基づいて動作する演算装置により実現される機能に対応し、このプログラムにおいては、たとえば、それぞれ、プログラムモジュールとして実装される構成とすることができる。 As described below, the CNN backbone 100, the concept learner 200, the concept regularizer 300, the classifier 400, the quantization error calculator 500, the loss calculator 600 that calculates the amount of loss during learning as described below, and the learning process control unit 700 that controls the learning process according to the loss calculated by the loss calculator 600 correspond to functions realized by a computing device that operates based on a program, and in this program, for example, each can be configured to be implemented as a program module.
 なお、特に限定されないが、概念学習器200と、概念正則化部300と、分類器400と、量子化誤差算出部500とについては、個別のニューラルネットワークでモジュールとして構成され、損出算出部600で算出される損失に基づいて、学習処理制御部700により、パラメータが調整される構成とすることができる。ただし、たとえば、CNNバックボーン100を含めて、学習の対象とし、いわゆる「エンドトゥーエンド(End-to-end)」の構成とすることもでき、ニューラルネットワーク・人工知能の構成はこのような構成に限定されるものではない。 Although not limited to this, the concept learner 200, concept regularizer 300, classifier 400, and quantization error calculator 500 can be configured as modules in separate neural networks, with parameters adjusted by the learning process control unit 700 based on the loss calculated by the loss calculator 600. However, for example, the CNN backbone 100 can also be included in the learning target, resulting in a so-called "end-to-end" configuration, and the configuration of the neural network/artificial intelligence is not limited to this configuration.
 CNNバックボーン100は、入力される画像データxに対して、以下の式で表される特徴マップFを抽出する。
Figure JPOXMLDOC01-appb-M000002
 
The CNN backbone 100 extracts a feature map F, expressed by the following equation, for input image data x.
Figure JPOXMLDOC01-appb-M000002
 ここで、cは、チャネル数であって特徴マップの個数である。すなわち、CNNバックボーン100は、入力画像をh×wの領域に分割し、この一つ一つの領域にc個の要素を持つベクトルが存在することになる。これによってFはc×h×wの特徴マップとなる。 Here, c is the number of channels, or feature maps. In other words, the CNN backbone 100 divides the input image into h x w regions, and in each of these regions there is a vector with c elements. This makes F a c x h x w feature map.
 そして、特徴マップFは、概念学習器200に入力される。ここで、図1において、概念プロトタイプ処理部2100は、後述する手続きに従って、概念行列Wを学習するものあり、行列Wの各列ベクトルを、本明細書では、学習される「概念のプロトタイプ」と呼ぶ。 Then, the feature map F is input to the concept learner 200. Here, in FIG. 1, the concept prototype processing unit 2100 learns the concept matrix W according to a procedure described below, and each column vector of the matrix W is referred to in this specification as a "concept prototype" to be learned.
 概念学習器200は、各概念の存在を示す概念活性度tと、各概念がx中に存在する領域からの画像特徴量Vを生成する。概念活性度tは分類器400の入力として用いられ、分類器400は、画像クラスの分類結果を示すスコアsを計算するように学習する。 The concept learner 200 generates a concept activity t indicating the presence of each concept, and an image feature V from the region where each concept exists in x. The concept activity t is used as an input to the classifier 400, which learns to calculate a score s indicating the classification result of the image class.
 概念活性度t、画像特徴量V、スコアsは、以下のとおりである。
Figure JPOXMLDOC01-appb-M000003
 
The concept activity level t, the image feature amount V, and the score s are as follows.
Figure JPOXMLDOC01-appb-M000003
 ここで、|Ω|は、集合の要素(元)の数を示す。 Here, |Ω| indicates the number of elements in the set.
 概念正則化部300は、概念活性度tと画像特徴量Vとを入力とし、概念のプロトタイプの更新処理において、後述するように、個々の概念の一貫性と、概念間の相互識別性のために制限を課し、また、教師あり自己学習を実行する。
(概念学習器200)
The concept regularization unit 300 receives the concept activity t and the image feature V as input, and in the concept prototype update process, as described below, imposes restrictions for the consistency of individual concepts and the mutual distinguishability between concepts, and also performs supervised self-learning.
(Concept Learner 200)
 概念学習器200は、自己注意(self-attention)機構に基づき、「スロット注意(slot attention)」の手法を用いることで、画像データセットDに対して、事後的に、人間の視覚認識において、認識の根拠となる特徴との対応付けを実行可能な「概念」を学習する。 The concept learner 200 uses a "slot attention" technique based on a self-attention mechanism to learn "concepts" for the image dataset D that can be retroactively associated with features that serve as the basis for recognition in human visual recognition.
 概念学習器200では、位置情報エンコード処理部2002は、入力される特徴マップFについて、空間情報を保持するために、位置の埋め込み情報Pを特徴マップFに加えることで、位置埋め込み(位置情報エンコーディング)処理を以下のように実行する。
Figure JPOXMLDOC01-appb-M000004
 
In the concept learner 200, the position information encoding unit 2002 executes position embedding (position information encoding) processing by adding position embedding information P to the input feature map F in order to retain spatial information, as follows:
Figure JPOXMLDOC01-appb-M000004
 なお、「位置情報エンコーディング」については、たとえば、以下の文献に開示がある。
 公知文献:Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object centric learning with slot attention. Proc. NeurIPS, 2020.
The "positional information encoding" is disclosed in, for example, the following document:
Published literature: Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object centric learning with slot attention. Proc. NeurIPS, 2020.
 位置情報が埋め込まれた特徴マップF´は、整形処理部2004において、空間次元を平坦化する処理が行われる。 The feature map F' with embedded position information is processed in the shaping processing unit 2004 to flatten the spatial dimensions.
 自己注意機構として、概念プロトタイプ処理部2100で逐次変換される概念プロトタイプを表す概念行列Wを非線形処理部2008で非線形処理したクエリQ(W)と、特徴マップF´を非線形処理部2006で非線形変換処理したキーK(F´)との間で類似度算出部2010がドット積類似度を算出する。 As a self-attention mechanism, the similarity calculation unit 2010 calculates the dot product similarity between a query Q(W) obtained by nonlinearly processing the concept matrix W representing the concept prototype, which is successively transformed by the concept prototype processing unit 2100, by the nonlinear processing unit 2008, and a key K(F') obtained by nonlinearly transforming the feature map F' by the nonlinear processing unit 2006.
 なお、概念プロトタイプ(概念行列)Wについては、特に限定されないが、たとえば、以下の文献に記載されるように、時系列データを学習することができるニューラルネットワークのモデルであるGRU(Gated Recurrent Unit)により生成され変換される構成とすることが可能である。
 公知文献:Liangzhi Li, Bowen Wang, Manisha Verma, Yuta Nakashima, Ryo kawasaki, and Hajime Nagahara. Scouter: Slot attention-based classifier for explainable image recognition. In Proc. ICCV, pages 1046-1055, 2021.
The concept prototype (concept matrix) W is not particularly limited, but can be configured to be generated and converted by a GRU (Gated Recurrent Unit), which is a neural network model capable of learning time-series data, as described in the following literature, for example.
Published literature: Liangzhi Li, Bowen Wang, Manisha Verma, Yuta Nakashima, Ryo Kawasaki, and Hajime Nagahara. Scouter: Slot attention-based classifier for explainable image recognition. In Proc. ICCV, pages 1046-1055, 2021.
 なお、上記の文献では、重みベクトル(スロットベクトル)からなるスロット行列(本実施の形態の概念行列Wに相当)を入力画像に対して適応させる意味で、GRUによって、空間次元において特徴量の重みづけ和であるU(t)と前タイミングでのスロット行列とを使って、スロット行列を変換している。これに対して、本実施の形態の概念行列Wについては、後述する画像特徴Vと前タイミングでの概念行列Wを使って、GRUで、次のタイミングの概念行列Wに変換するとの構成とすることができる。ただし、概念行列Wの変換の方法としては、このような方法に限定されるものではない。 In the above document, in order to adapt a slot matrix (corresponding to the concept matrix W in this embodiment) consisting of weight vectors (slot vectors) to an input image, the slot matrix is converted by the GRU using U (t) , which is a weighted sum of feature amounts in the spatial dimension, and the slot matrix at the previous timing. In contrast, the concept matrix W in this embodiment can be configured to be converted to the concept matrix W at the next timing by the GRU using an image feature V to be described later and the concept matrix W at the previous timing. However, the method of converting the concept matrix W is not limited to this method.
 たとえば、GRUを使用せずに、以下の説明では、Q(…)(3つの全結合(FC)層のニューラルネットワーク)でWを変換するものとして説明する。 For example, instead of using a GRU, the following explanation will be described as transforming W using Q(...) (a neural network with three fully connected (FC) layers).
 ここで、Q(W)とK(F´)とにおいて、それぞれWとF´とに対する非線形変換とは、3つのFC層(全結合層 Fully Connected Layer)とその間のReLU非線形層を持つ多層パーセプトロンとして与えられ、以下のディメンジョンを有する。
Figure JPOXMLDOC01-appb-M000005
 
Here, in Q(W) and K(F'), the nonlinear transformations for W and F', respectively, are given as a multilayer perceptron having three FC layers (fully connected layers) and a ReLU nonlinear layer between them, and have the following dimensions:
Figure JPOXMLDOC01-appb-M000005
 正規化部2012は、以下の式(1)で与えられるような「注意行列A」を算出する。
Figure JPOXMLDOC01-appb-M000006
 
The normalization unit 2012 calculates an “attention matrix A” as given by the following equation (1).
Figure JPOXMLDOC01-appb-M000006
 ここで、関数φは、正規化関数である。 Here, the function φ is the normalization function.
 この注意行列Aは、後述する図7に示すように、k個の概念が画像のどの位置に存在するかを示すものである。 This attention matrix A indicates where in the image the k concepts are located, as shown in Figure 7 below.
 正規化関数φは、各概念の空間分布を決定するが、これは分類のターゲットドメインに依存する。 The normalization function φ determines the spatial distribution of each concept, which depends on the target domain of the classification.
 たとえば、手書きの数字認識データセットの画像は通常白黒で、ストロークで形成される形状のみが重要である。この場合、概念同士が空間的に重なる可能性は低くなる。一方、自然画像には色やテクスチャ、形状があるため、同じ空間位置で概念が重なる可能性がある。 For example, images in handwritten digit recognition datasets are typically black and white, and only the shapes formed by the strokes are important. In this case, concepts are unlikely to overlap spatially. On the other hand, natural images have color, texture, and shape, which means concepts may overlap at the same spatial location.
 重ならない場合について、φを次のように設計することが可能である。
Figure JPOXMLDOC01-appb-M000007
 
For the non-overlapping case, φ can be designed as follows:
Figure JPOXMLDOC01-appb-M000007
 ここで、σはシグモイド関数、σとソフトマックス(softmax)関数との間の積はアダマール積である。ソフトマックス関数は、異なる概念が同じ空間位置で検出されないように、概念(すなわち、各列ベクトル、以下、本明細書においては、この列ベクトルを「スロットベクトル」と呼ぶ)に対して適用される。 Here, σ is a sigmoid function, and the product between σ and a softmax function is the Hadamard product. The softmax function is applied to the concepts (i.e., each column vector; hereafter, in this specification, this column vector will be referred to as a "slot vector") so that different concepts are not detected at the same spatial location.
 一方、概念の重複を許容するためには、正規化にはシグモイド関数のみを用い、次式のようにすることができる。
Figure JPOXMLDOC01-appb-M000008
 
On the other hand, to allow for overlapping concepts, only the sigmoid function can be used for normalization, as follows:
Figure JPOXMLDOC01-appb-M000008
 概念生起度算出部2030は、以下の式(4)のように、空間次元の方向にAの和を取ることで概念活性度ベクトルtを算出する。なお、概念活性度ベクトルの各要素は、対応する概念が現れたか否かを表すもので、その各要素を概念活性度と呼ぶ。
Figure JPOXMLDOC01-appb-M000009
 
The concept occurrence calculation unit 2030 calculates the concept activity vector t by taking the sum of A in the spatial dimension as shown in the following formula (4). Each element of the concept activity vector indicates whether or not a corresponding concept has appeared, and each element is called concept activity.
Figure JPOXMLDOC01-appb-M000009
 また、概念学習器200では、整形処理部2020が、空間次元を平坦化するために特徴マップFを整形処理し、以下のような特徴マップFを算出する。
Figure JPOXMLDOC01-appb-M000010
 
Moreover, in the concept learner 200, the shaping processor 2020 performs shaping processing on the feature map F to flatten the spatial dimension, and calculates the following feature map F * .
Figure JPOXMLDOC01-appb-M000010
 そして、類似度算出部2040は、特徴マップFから、画像特徴Vを以下の式で計算して抽出する。
Figure JPOXMLDOC01-appb-M000011
 
Then, the similarity calculation unit 2040 calculates and extracts image features V from the feature map F * using the following formula:
Figure JPOXMLDOC01-appb-M000011
 ここで、ωkで重みづけることで、空間次元にわたる画像特徴の平均を注意度で重み付けして与えることになる。
(量子化と量子化損失)
Here, weighting by ω k gives the attention-weighted average of image features across spatial dimensions.
(Quantization and Quantization Loss)
 上述した概念活性度tは各概念の存在を示す指標であり、2値でも表現は可能である。 The concept activity level t mentioned above is an index showing the existence of each concept, and can be expressed as a binary value.
 しかしながら、ニューラルネットワークについて、勾配降下法に基づく学習を行うため、上述したような連続値を用いる。その代わり、ニューラルネットワークの学習において、量子化損失を用いることで、0か1かのどちらかに近い値をとるようにすることを保証する。 However, to train neural networks using gradient descent, we use continuous values as described above. Instead, we use a quantization loss to ensure that the neural network training results in values close to either 0 or 1.
 このような量子化損失lquaは、BをDのランダムな部分集合であるミニバッチとすると、以下の式で与えられる。
Figure JPOXMLDOC01-appb-M000012
 
Such a quantization loss lqua is given by the following equation, where B is a mini-batch that is a random subset of D:
Figure JPOXMLDOC01-appb-M000012
 ここで、tは、画像xに対して計算された概念の活性化度である。
(概念正則化部300)
where t is the concept activation calculated for image x.
(Concept Regularization Unit 300)
 画像分類学習装置1000の学習における唯一の教師データは画像レベルのラベルyであるため、概念学習器は意味のある特徴を一貫して捉えることができない可能性がある。 Because the only training data in the image classification learning device 1000 is the image-level label y, the concept learner may not be able to consistently capture meaningful features.
 そこで、「概念」としての学習が進行するように、概念正則化部300は、概念正則化の処理を実行する。 The concept regularization unit 300 therefore executes a concept regularization process so that learning of the "concept" progresses.
 そのような概念正則化の処理のうち、2つはVを通して概念のプロトタイプを制約するものである。もう1つは、より良い表現力を得るために、画像再構成や画像検索タスクによる教師あり自己学習を採用する。 Two of these concept regularization procedures constrain concept prototypes through V. The other employs supervised self-learning from image reconstruction and image retrieval tasks to obtain better representation.
 図2は、概念正則化部300の構成を説明するための機能ブロック図である。 FIG. 2 is a functional block diagram for explaining the configuration of the concept regularization unit 300.
 図3A~図3Cは、図2の概念正則化部300の処理の概念を示す概念図である。 Figures 3A to 3C are conceptual diagrams showing the processing concept of the conceptual regularization unit 300 in Figure 2.
 図2および図3A~図3Cを参照して、概念の個別の整合性を担保するための識別損失算出部3010は、概念学習器200の学習処理により、「概念」として抽出された後に、人間による解釈がしやすいためには、学習された各概念はその中に多くのバリエーションを持たない方がよい。 Referring to Figures 2 and 3A to 3C, the discrimination loss calculation unit 3010, which ensures the individual consistency of concepts, does not want each learned concept to have many variations in order to make it easier for humans to interpret after it has been extracted as a "concept" through the learning process of the concept learner 200.
 このような「概念学習」の正則化については、画像特徴量Vと概念活性度tを通じて、損失項において考慮(損失項としてエンコード)することができる。 Regularization of this type of "concept learning" can be taken into account in the loss term (encoded as a loss term) through image features V and concept activity t.
 ここで、概念学習器200の学習においては、いわゆる「ミニバッチ学習」が行われ、N個の訓練データのなかから、ランダムにその一部、n個を取り出し、パラメータの更新がされるものとする。 Here, the concept learner 200 performs so-called "mini-batch learning" to randomly select a portion (n pieces) of N pieces of training data and update the parameters.
 学習中において、概念活性度tのk番目の要素tを用いて、概念kを持つミニバッチ内の画像を特定できる。 During training, the kth element t k of concept activity t can be used to identify images in a mini-batch that have concept k.
 すなわち、まず、図3Cに示すように、識別損失算出部3010は、「整合性損失」を以下のようにして算出する。画像特徴量Vのk番目の行ベクトルである画像特徴vkは、tkが1に近い場合、概念kに対応する領域からの画像特徴量を含んでいる。Hkは、tkが経験的・実験的に事前に設定されるしきい値ξよりも大きいミニバッチ内の画像特徴vkのすべての組の集合を表すとする。 That is, first, as shown in Fig. 3C, the classification loss calculation unit 3010 calculates the "consistency loss" as follows: Image feature vk , which is the k-th row vector of image feature V, contains image features from a region corresponding to concept k if tk is close to 1. Let Hk denote the set of all pairs of image features vk in the mini-batch where tk is greater than a threshold ξ that is empirically and experimentally set in advance.
 コサイン類似度sim(・,・)を用いて、「整合性損失」を以下の式で定義する。
Figure JPOXMLDOC01-appb-M000013
 
Using the cosine similarity sim(·,·), we define the “consistency loss” as follows:
Figure JPOXMLDOC01-appb-M000013
 lconは、2つの異なる画像からの概念kに対応する画像特徴vk,vk´の間でより小さい類似度にペナルティを課すものである。  lcon penalizes smaller similarities between image features vk, vk' corresponding to concept k from two different images.
 すなわち、「整合性損失」とは、ミニバッチ学習中に、「同じ概念(コンセプト)」に属する異なる画像の「画像特徴」は、異なる画像であっても「より類似している」状態となるように学習を進めるための損失項である。
(概念相互の識別性)
In other words, “consistency loss” is a loss term used during mini-batch learning to advance learning so that the “image features” of different images that belong to the “same concept” become “more similar” even if they are different images.
(Mutual Distinctness of Concepts)
 画像の異なる側面を捉えるために、各概念は異なる視覚的要素に注目する必要があり、識別損失算出部3010は、以下のような「識別性損失」を損失項として算出する。つまり、ミニバッチ内の概念kの平均的な画像特徴量は、以下の式で与えられる。
Figure JPOXMLDOC01-appb-M000014
 
To capture different aspects of an image, each concept needs to focus on different visual elements, and the discrimination loss calculation unit 3010 calculates the following "discriminability loss" as a loss term. In other words, the average image feature amount of concept k in a mini-batch is given by the following formula.
Figure JPOXMLDOC01-appb-M000014
 ここで、集合Mは平均的な画像特徴のすべてのペアのペアセットである。なお、概念kは、ミニバッチに概念kを持つ画像がない場合、集合Mから除外される。 Here, set M is the pair set of all pairs of average image features. Note that concept k is excluded from set M if there are no images with concept k in the mini-batch.
 すなわち、「識別性損失」とは、ミニバッチ学習中に、「異なるコンセプト」に属する画像の「平均の画像特徴」は、「より異なっている」状態となるように学習を進めるための損失項である。
(教師あり自己学習)
In other words, “discriminativity loss” is a loss term used to progress through mini-batch learning so that the “average image features” of images belonging to “different concepts” become “more different.”
(Supervised self-learning)
 従来技術において説明した非特許文献8に開示の“SENN”は,教師あり自己学習のためにオートエンコーダ構造を用いている。これは、例えば、異なる視覚的要素(線のパターン)がその位置と強く結びついている手書きの数字認識タスクなどに有効である。 The "SENN" disclosed in Non-Patent Document 8, described in the prior art, uses an autoencoder structure for supervised self-learning. This is effective, for example, in handwritten digit recognition tasks where different visual elements (line patterns) are strongly associated with their positions.
 例えば、水平線と垂直線を持つ十字は、一般的に画像の中央付近に配置される数字の4にしか現れない。 For example, a cross with horizontal and vertical lines only appears in the number 4, which is typically placed near the center of the image.
 しかしながら、より一般的な「自然界の画像」では、このようなことは必ずしも当てはまらない。そこで、本実施の形態の概念正則化部300では、画像の再構成に基づく損失に加え、自然画像に対する検索に基づく損失を評価するための「教師あり自己学習」を導入する。 However, this does not necessarily apply to more general "images of the natural world." Therefore, the concept regularization unit 300 of this embodiment introduces "supervised self-learning" to evaluate losses based on retrieval of natural images, in addition to losses based on image reconstruction.
 そこで、本実施の形態においては、概念正則化部300において、図3Bに示すような再構成ベース損失算出部3020または図3Aに示すような検索ベース損失算出部3030は、対象となる分類タスクのターゲットの種類に応じて、例えば、外部からの事前の設定により、選択的に、または並行して、以下に説明するような処理を実行して、概念学習器200の学習における損失項を算出する。
  (再構成ベースの自己学習)
Therefore, in this embodiment, in the concept regularizer 300, the reconstruction-based loss calculator 3020 as shown in FIG. 3B or the search-based loss calculator 3030 as shown in FIG. 3A executes the processes described below selectively or in parallel depending on the type of target of the classification task, for example, by external pre-setting, to calculate the loss term in the learning of the concept learner 200.
(Reconstruction-Based Self-Learning)
 視覚的要素が、その位置と強く結びついていると予想できる画像領域(例えば、手書き数字の画像セットのMNIST)では、概念の活性化度tは元画像を再構成するのに十分な情報を持っている。 In image domains where visual elements are expected to be strongly associated with their location (e.g., MNIST on a set of handwritten digit images), the concept activation t contains enough information to reconstruct the original image.
 そこで、図3Bに示すように、再構成ベース損失算出部3020には、概念デコーダDが含まれており、この概念デコーダDは概念活性度tを入力とし、画像xと概念デコーダDの出力であるD(t)が相互に類似するように元画像を再構築するものとする。 As shown in FIG. 3B, the reconstruction-based loss calculation unit 3020 includes a concept decoder D, which receives the concept activity t as input and reconstructs the original image so that the image x and the output D(t) of the concept decoder D are similar to each other.
 ここで、教師あり自己学習での再構成ベース損失lrecを以下の式で定義する。
Figure JPOXMLDOC01-appb-M000015
 
Here, the reconstruction-based loss l rec in supervised self-learning is defined as follows:
Figure JPOXMLDOC01-appb-M000015
 したがって、「再構成ベース損失」は、概念活性度tを自己学習での教師信号として、再構成される画像が元の画像と類似するほど小さくなる。
(検索ベースの自己学習)
Therefore, the “reconstruction-based loss” becomes smaller as the reconstructed image becomes more similar to the original image using the concept activity t as a teaching signal in self-learning.
(Search-based self-learning)
 一般に、概念活性度tは、任意の位置に配置された概念に対応するため、原画像xを再構成するには不十分であり、再構成のための空間情報はt、概念活性度tでは失われている。 Generally, the concept activity t is insufficient to reconstruct the original image x, since it corresponds to a concept placed at an arbitrary position, and the spatial information required for reconstruction is lost in t, the concept activity t.
 そこで、図3Aに示すように、検索ベース損失算出部3030は、原画像を再構成する代わりに、概念活性度tを用いてミニバッチB内の同じクラスの画像を見つけるという単純な検索タスクを実行するものとする。画像ラベルをそれぞれy、y´とする画像x,x´∈Bから計算される任意のペア(t,t´)に対して、関数Jを次式で定義する。
Figure JPOXMLDOC01-appb-M000016
 
Therefore, as shown in Fig. 3A, instead of reconstructing the original image, the search-based loss calculator 3030 performs a simple search task of finding images of the same class in mini-batch B using concept activity t. For any pair (t, t') computed from images x, x'∈B with image labels y and y', respectively, we define a function J as follows:
Figure JPOXMLDOC01-appb-M000016
 ここで、t,t´は、同じクラスラベルを持つ場合には、画像x,x´に類似の視覚要素のセットが現れるはずなので、互いに類似しているはずである。一方で、同じクラスラベルを持たない場合には、t,t´は、異なるはずである。 Here, t and t' should be similar to each other if they have the same class label, since similar sets of visual elements should appear in images x and x'. On the other hand, if they do not have the same class label, t and t' should be different.
 一般に、同じラベルを持つペアの数Csは、異なるラベルを持つペアの数CDよりはるかに少ない。そこで、y=y´のときに、CD/(Cs+CD)となり、それ以外のときには、Cs/(Cs+CD)となって、アンバランスを軽減するための重みα(y,y´)を導入する。 In general, the number of pairs with the same label, Cs , is much smaller than the number of pairs with different labels, Cd . Therefore, when y=y', Cd /( Cs + Cd ) holds, and otherwise Cs /( Cs + Cd ) holds, and a weight α(y, y') is introduced to reduce the imbalance.
 そして、検索ベース損失算出部3030は、上述したような教師あり自己学習での検索ベースの損失lretを、以下の式で定義する。
Figure JPOXMLDOC01-appb-M000017
 
Then, the search-based loss calculation unit 3030 defines the search-based loss l ret in the above-described supervised self-learning by the following equation.
Figure JPOXMLDOC01-appb-M000017
 ここで、総和は画像のすべての組(x,x´)とそれらに対応するラベル(y,y´)に対して計算される。 Here, the sum is calculated over all pairs of images (x, x') and their corresponding labels (y, y').
 すなわち、「検索ベース損失」は、概念活性度tを自己学習での教師信号として、異なる画像であって、同じクラスラベルを有する画像については、概念活性度が類似するほど、また、異なる画像であって、異なるクラスラベルを有する画像については、概念活性度が非類似であるほど、より小さくなることになる。 In other words, the "search-based loss" becomes smaller when the concept activity t is used as a teaching signal in self-learning, for different images with the same class label, as the concept activity becomes more similar, and for different images with different class labels, as the concept activity becomes more dissimilar.
 さらに、再構成の場合と同様に、修正された概念の活性度t´と学習用の画像データセットD中の全画像の概念活性度t´に対して計算されたtTt´に基づいて、上位の画像を人間に示すことにより、各概念の影響を視覚化することができる。
 
(分類器の分類性能に対する損失)
Furthermore, as in the case of reconstruction, the influence of each concept can be visualized by showing the top images to a human based on the revised concept activation t′ and the concept activation t T t ′ calculated for all images in the training image dataset D.

(Loss for the classification performance of the classifier)
 以下では、分類器400としてバイアス項を除いた単一の全連結層を用いるものとして説明する。 In the following, we will explain the classifier 400 as using a single fully connected layer excluding the bias term.
 ここで、以下のことが言える。
Figure JPOXMLDOC01-appb-M000018
 
Here, the following can be said:
Figure JPOXMLDOC01-appb-M000018
 この単純な分類器400の学習は、各概念の活性化度と分類されるべきクラスとの共起を求めることと解釈できる。 The training of this simple classifier 400 can be interpreted as determining the co-occurrence of the activation level of each concept with the class to be classified.
 すなわち、以下のようにして交差エントロピーを定義することで、分類器400の分類性能に関する損失(以下、「分類性能損失」という)を評価することができる。
Figure JPOXMLDOC01-appb-M000019
 
 (総損失量)
That is, by defining cross entropy as follows, it is possible to evaluate the loss related to the classification performance of the classifier 400 (hereinafter referred to as "classification performance loss").
Figure JPOXMLDOC01-appb-M000019

(Total loss)
 学習の過程において、画像分類学習装置1000の総損失量は、上記の損失項を組み合わせて、以下の式で定義される。 During the learning process, the total loss amount of the image classification learning device 1000 is defined by the following formula, combining the above loss terms:
 上述した通り、学習処理制御部700は、損出算出部600で算出される損失に応じて、学習処理を制御する。
Figure JPOXMLDOC01-appb-M000020
 
(学習の結果生成された分類器の構成)
As described above, the learning process control unit 700 controls the learning process in accordance with the loss calculated by the loss calculation unit 600 .
Figure JPOXMLDOC01-appb-M000020

(Classifier configuration generated as a result of learning)
 図4は、図1に示した画像分類学習装置1000のハードウェア構成を説明するためのブロック図である。 FIG. 4 is a block diagram for explaining the hardware configuration of the image classification learning device 1000 shown in FIG. 1.
 上述した通り、画像分類学習装置1000は、自身の筐体内の演算装置(CPU:Central Processing Unit)が演算処理を実行する構成であってもよいし、プログラムの処理自体は、サーバー上で実行される構成であってもよい。以下では、自身の筐体内の演算装置が演算処理を実行するものとして説明する。 As described above, the image classification learning device 1000 may be configured so that a computing device (CPU: Central Processing Unit) within its own housing performs the computational processing, or the program processing itself may be executed on a server. In the following, it will be described as if a computing device within its own housing performs the computational processing.
 図4を参照して、画像分類学習装置1000は、コンピュータ装置6010と、ネットワークと通信するためのネットワーク通信部6300と、必要に応じて、撮影された画像データをコンピュータ装置6010に提供するためのカメラ6400と、撮影された画像データを記録してコンピュータ装置6010に提供するための記録媒体(たとえば、メモリカード)6210とを備える。 Referring to FIG. 4, the image classification learning device 1000 includes a computer device 6010, a network communication unit 6300 for communicating with a network, a camera 6400 for providing captured image data to the computer device 6010 as necessary, and a recording medium (e.g., a memory card) 6210 for recording the captured image data and providing it to the computer device 6010.
 なお、例えば、記録媒体6210としては、USBメモリ、メモリカードや外付け記憶装置などを利用することができる。また、ネットワーク通信部6300としては、例えば、有線LANのルーターや無線LANのアクセスポイントを利用することができる。そして、画像データは、このようなネットワーク通信部6300を介して、コンピュータ装置6010に提供される構成であってもよい。 For example, the recording medium 6210 may be a USB memory, a memory card, or an external storage device. The network communication unit 6300 may be a wired LAN router or a wireless LAN access point. The image data may be provided to the computer device 6010 via the network communication unit 6300.
 図4に示されるように、このコンピュータ装置6010を構成するコンピュータ本体は、ディスクドライブ6030およびメモリドライブ6020に加えて、それぞれバス6050に接続されたCPU(Central Processing Unit )6040と、ROM(Read Only Memory) 6060およびRAM(Random Access Memory)6070を含むメモリと、不揮発性の書換え可能な記憶装置、たとえば、SSD(Solid State Drive)6080と、ネットワークを介しての通信や外部とのデータの授受を行うための入出力インタフェース6090とを含んでいる。ディスクライブ6030には、光ディスクが装着可能である。メモリドライブ6020にはメモリカード6210が装着可能である。 As shown in FIG. 4, the computer main body constituting this computer device 6010 includes, in addition to a disk drive 6030 and a memory drive 6020, a CPU (Central Processing Unit) 6040, each connected to a bus 6050, memory including a ROM (Read Only Memory) 6060 and a RAM (Random Access Memory) 6070, a non-volatile rewritable storage device such as an SSD (Solid State Drive) 6080, and an input/output interface 6090 for communicating over a network and sending and receiving data with the outside world. An optical disk can be attached to the disk 6030. A memory card 6210 can be attached to the memory drive 6020.
 後に説明するように、コンピュータ装置6010のプログラムが動作するにあたっては、そのコンピュータとしての動作の基礎となる情報を格納するデータやプログラムは、SSD6080に格納されるものとして説明を行う。また、RAM6070は、CPU6040が演算動作をする際のワーキングメモリとして機能し、演算途中のデータやパラメータが随時格納され、あるいは、読み出されて、CPU6040は演算処理を実行する。 As will be explained later, when the computer device 6010's programs run, the data and programs that store the information that is the basis of the computer's operation will be explained as being stored in the SSD 6080. The RAM 6070 also functions as a working memory when the CPU 6040 performs calculations, and data and parameters during calculations are stored or read out as needed, and the CPU 6040 executes the calculations.
 なお、図4では、コンピュータ本体に対してインストールされるプログラム等の情報をコンピュータが読取り可能な非一時的な記録媒体として、たとえば、 DVD-ROM(Digital Versatile Disc)、メモリカードやUSBメモリなどでもよい。そのような場合に対応して、コンピュータ本体6200には、これらの媒体を読取ることが可能なドライブ装置(メモリドライブ6020、ディスクドライブ6030)が設けられる。 In FIG. 4, the non-transient recording medium from which the computer can read information such as a program to be installed in the computer main unit may be, for example, a DVD-ROM (Digital Versatile Disc), a memory card, or a USB memory. To accommodate such cases, the computer main unit 6200 is provided with a drive device (memory drive 6020, disk drive 6030) capable of reading these media.
 コンピュータ装置6010の主要部は、コンピュータハードウェアと、CPU6040により実行されるソフトウェアとにより構成される。一般的にこうしたソフトウェアは コンピュータ読取り可能な非一時的な記憶媒体に格納されて流通またはネットワーク経由で流通し、ディスクドライブ6030やネットワーク通信部6410経由で取得されて SSD6080に一旦格納される。そうしてさらにSSD6080からメモリ中のRAM6070に読出されてCPU6040により実行される。なお、ネットワーク接続されている場合には、SSD6080に格納することなくRAMに直接ロードして実行するようにしてもよい。 The main components of the computer device 6010 are computer hardware and software executed by the CPU 6040. Generally, such software is stored in a computer-readable non-transitory storage medium and distributed or distributed via a network, and is obtained via the disk drive 6030 or the network communication unit 6410 and temporarily stored in the SSD 6080. It is then read from the SSD 6080 into the RAM 6070 in the memory and executed by the CPU 6040. Note that when connected to a network, the software may be directly loaded into the RAM and executed without being stored in the SSD 6080.
 以下に説明するようなコンピュータ装置6010として機能するためのプログラムは、その流通にあたっては、コンピュータ本体6010に、情報処理装置等の機能を実行させるオペレーティングシステム(OS)は、必ずしも含まなくても良い。プログラムは、制御された態様で適切な機能(モジュール)を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいれば良い。コンピュータシステム6010がどのように動作するかは周知であり、詳細な説明は省略する。 When distributed, a program for functioning as a computer device 6010 as described below does not necessarily need to include an operating system (OS) that causes the computer main body 6010 to execute functions such as an information processing device. The program only needs to include instructions that call appropriate functions (modules) in a controlled manner to obtain the desired results. How the computer system 6010 operates is well known, and a detailed explanation will be omitted.
 さらに、CPU6040も、1つのコアのプロセッサであっても、あるいは複数のコアのプロセッサであってもよい。すなわち、シングルコアのプロセッサであっても、マルチコアのプロセッサであってもよい。 Furthermore, the CPU 6040 may also be a one-core processor or a multiple-core processor. That is, it may be a single-core processor or a multi-core processor.
 図5は、図1に示した画像分類学習装置1000の学習処理を説明するためのフローチャートである。 FIG. 5 is a flowchart for explaining the learning process of the image classification learning device 1000 shown in FIG. 1.
 図5と図1とを参照して、学習処理が開始されると、ミニバッチ処理として選択された学習画像データが入力され(S100)、CNNバックボーン100により、特徴マップの抽出が行われる(S102)。 Referring to Figures 5 and 1, when the learning process starts, the learning image data selected as mini-batch processing is input (S100), and the CNN backbone 100 extracts a feature map (S102).
 続いて、特徴マップについて、位置情報エンコード処理部2002により位置情報のエンコードが行われるとともに(S104)、整形処理部2020により特徴マップの平坦化処理が実行される(S106)。 Next, the location information encoding processing unit 2002 encodes the location information of the feature map (S104), and the shaping processing unit 2020 performs flattening processing of the feature map (S106).
 位置情報のエンコードがされた特徴マップF´に対して、整形処理部2004が平坦化処理を実行し(S108)、概念プロトタイプ処理部2100から出力される概念行列Wに対して非線形処理部2008が非線形処理を実行してクエリQ(W)を生成する(S110)。 The shaping processor 2004 performs flattening processing on the feature map F' with the encoded positional information (S108), and the nonlinear processor 2008 performs nonlinear processing on the concept matrix W output from the concept prototype processor 2100 to generate a query Q(W) (S110).
 一方で、非線形処理部2006は、特徴マップF´に対する非線形処理によりキーK(F´)を生成し(S112)、類似度算出部2010が、クエリQ(W)とキーK(F´)との間のドット積を算出して、正規化部2012がドット積を正規化して注意行列Aを生成する(S114)。 Meanwhile, the nonlinear processing unit 2006 generates a key K(F') by nonlinear processing of the feature map F' (S112), the similarity calculation unit 2010 calculates the dot product between the query Q(W) and the key K(F'), and the normalization unit 2012 normalizes the dot product to generate an attention matrix A (S114).
 概念生起度算出部2030が、概念活性度tを算出して、分類器400に入力する(S116)。一方で、類似度算出部2040は、注意行列Aと平坦化された特徴マップのドット積から、式(5)で示されるような画像特徴Vを生成する。 The concept occurrence calculation unit 2030 calculates the concept activity t and inputs it to the classifier 400 (S116). Meanwhile, the similarity calculation unit 2040 generates image features V as shown in equation (5) from the dot product of the attention matrix A and the flattened feature map.
 概念活性度tと画像特徴Vとから、概念正則化部300が、「整合性損失」「識別性損失」「再構成ベース損失」「検索ベース損失」を算出するとともに、量子化誤差算出部500が、概念活性度tの量子化誤差損失を式(6)に従って算出し、分類器400からの出力に応じて、損失算出部600が「分類性能損失」を式(14)として算出して、概念学習器200の学習過程での総損失Lを、式(15)により算出する(S120)。 The concept regularizer 300 calculates the "consistency loss," "discriminability loss," "reconstruction-based loss," and "search-based loss" from the concept activity t and the image feature V, while the quantization error calculator 500 calculates the quantization error loss of the concept activity t according to equation (6). Depending on the output from the classifier 400, the loss calculator 600 calculates the "classification performance loss" as equation (14), and calculates the total loss L during the learning process of the concept learner 200 using equation (15) (S120).
 学習処理制御部700は、総損失Lに基づいて、ニューラルネットワークで構成される概念学習器200のパラメータを、たとえば、勾配降下法により、更新する。なお、モデルのパラメータの更新方法は、この方法に限定されるものではない。 The learning process control unit 700 updates the parameters of the concept learner 200, which is composed of a neural network, based on the total loss L, for example, by the gradient descent method. Note that the method of updating the model parameters is not limited to this method.
 学習処理制御部700は、ミニバッチ処理による学習処理が所定の条件を満たすと判断した場合は学習処理を終了し、一方で、終了していない場合は、処理をステップS100に復帰させる。 If the learning process control unit 700 determines that the learning process using mini-batch processing meets the specified conditions, it ends the learning process, but if it does not end, it returns the process to step S100.
 以上説明したような図5の処理は、図4に示したハードウェア構成において、たとえば、不揮発性記憶装置2080に格納されたコンピュータプログラムにより、CPU2040によって実行させることができる。各処理を分散処理することも可能であるし、サーバー装置のより実行するクラウド型の構成とすることも可能である。 The above-described processing of FIG. 5 can be executed by the CPU 2040 in the hardware configuration shown in FIG. 4, for example, using a computer program stored in the non-volatile storage device 2080. Each process can be distributed, or a cloud-type configuration in which the processes are executed by a server device can be used.
 図6は、図1に示した画像分類学習装置1000の学習により生成された分類器400を含む画像分類装置4000が新規の画像の分類処理を実行する際の構成を説明するための機能ブロック図である。 FIG. 6 is a functional block diagram for explaining the configuration of an image classification device 4000 including a classifier 400 generated by learning in the image classification learning device 1000 shown in FIG. 1 when performing classification processing on a new image.
 ただし、図6においては、各構成は図1に記載した画像分類学習装置1000と同一の処理を実行する部分には、同一符号を付している。 However, in Figure 6, the same reference numerals are used for the components that perform the same processes as those of the image classification learning device 1000 shown in Figure 1.
 基本的には、概念プロトタイプ処理部2100については、学習処理が終了した後に、分類処理を実行する際には、概念行列Wは学習終了時のものに固定されて更新は実行しないので、図6においては、概念プロトタイプ記憶部4100との記載としている。たとえば、「概念プロトタイプ記憶部4100」については、学習されたパラメータがメモリに記憶されている。 Basically, for the concept prototype processing unit 2100, when the classification process is performed after the learning process is completed, the concept matrix W is fixed to the one at the end of learning and is not updated, so in Figure 6 it is described as the concept prototype storage unit 4100. For example, for the "concept prototype storage unit 4100," the learned parameters are stored in memory.
 分類器400を含めて、図6の他の構成部分についても、パラメータは、学習終了時のものに固定されている点は、同様である。 The parameters of the other components in Figure 6, including the classifier 400, are also fixed at the time when learning is completed.
 図7は、図6の分類器400の実行する処理を説明するための概念図である。 FIG. 7 is a conceptual diagram for explaining the processing performed by the classifier 400 in FIG. 6.
 学習データセットより、概念プロトタイプ(概念行列)Wが生成されて、概念プロトタイプ記憶部4100が、その状態を保持していることを前提とする。 It is assumed that a concept prototype (concept matrix) W is generated from the learning dataset and that the concept prototype storage unit 4100 retains this state.
 新規画像が入力されると、概念生起度算出部2030により、概念活性度tが、各概念ごとに算出される。これを、元の画像の多様性が集約されて少数の特徴に集約されているという意味で「概念ボトルネック」と呼ぶ。 When a new image is input, the concept occurrence calculation unit 2030 calculates the concept activity t for each concept. This is called a "concept bottleneck" in the sense that the diversity of the original image is consolidated into a small number of features.
 分類器400では、各学習画像のラベルごとに、概念の共起のパターンが学習されており、新規画像に対する概念ボトルネックが、分類器400に入力されると、概念の共起のパターンとの類似度が算出されて、各ラベルについての類似度のうち、もっとも高い類似度を有するラベルが、分類結果として出力される。 In classifier 400, the pattern of co-occurrence of concepts is learned for each label of each training image. When a concept bottleneck for a new image is input to classifier 400, the similarity with the pattern of co-occurrence of concepts is calculated, and the label with the highest similarity among the similarities for each label is output as the classification result.
 図7に示した例では、鳥の自然画像を学習した場合が例示的に示されている。ここでは、ラベル1の鳥は、「黄色い頭の」、「体が黒い」というように、概念が生成されており、分類結果の算出の過程では、このような各概念について、対象画像中の概念との共起の程度が判断されていることになる。 The example shown in Figure 7 illustrates the case where natural images of birds are learned. Here, concepts such as "yellow head" and "black body" are generated for birds with label 1, and in the process of calculating the classification results, the degree of co-occurrence of each of these concepts with concepts in the target images is determined.
 すなわち、画像分類学習装置1000は、学習の過程で、「黄色い頭」の「黒い鳥」の画像を例にとると、「黄色い頭」と「黒い羽の体」と解釈できる概念を学習により取得していることなる。 In other words, during the learning process, the image classification learning device 1000 learns the concept that, for example, an image of a "black bird" with a "yellow head" can be interpreted as a "yellow head" and a "body with black feathers."
 分類器400は、1つの全結合(FC)層であり、各概念と各クラスの共起性を符号化していることになる。 The classifier 400 is a single fully connected (FC) layer that encodes the co-occurrence of each concept with each class.
 したがって、画像分類学習装置1000は、「分類器」と「ボトルネックとなる概念」を同時に学習する。 Therefore, the image classification learning device 1000 learns the "classifier" and the "bottleneck concept" simultaneously.
 これまで説明してきたとおり、表現力と解釈力を高めるため、概念は個々に矛盾がなく(すなわち、単一の概念は特徴空間においてより小さな体積を占める)、かつ相互に特徴がある(すなわち、概念のペアは特徴空間において同じ領域を占めない、または、占める確率をより低くなるようにするにして、分離している)よう制約している。 As we have seen, to enhance representation and interpretation, concepts are constrained to be individually consistent (i.e., single concepts occupy a smaller volume in feature space) and mutually distinctive (i.e., pairs of concepts do not occupy the same region in feature space, or are separated in such a way that they are less likely to do so).
 「概念の個々に矛盾がないこと」については、主として、「整合性損失」により考慮されている。「相互に特徴があること」については、主として、「識別性損失」により考慮されている。 The "absence of contradictions between concepts" is primarily taken into consideration through "loss of consistency." The "mutual distinctiveness" is primarily taken into consideration through "loss of distinctiveness."
 このような制約を課すことは、以下のような効果がある。
 i)概念は特定の視覚要素(または特徴)にのみ対応し、各概念が何を表しているかが容易に分かるようにする、
Imposing such a constraint has the following effects:
i) Concepts correspond only to specific visual elements (or features), making it easy to see what each concept represents;
 ii)異なる概念は互いに非類似であり、より多様な視覚要素をカバーする。 ii) Different concepts are dissimilar to each other and cover a greater variety of visual elements.
 このような制約i)ii)は、人間が、画像を分類する際の直観にも整合する。 These constraints i) and ii) are consistent with human intuition when classifying images.
 したがって、画像分類学習装置1000は、分類器と概念をエンドツーエンドで同時に学習することもできる。 Therefore, the image classification learning device 1000 can also simultaneously learn classifiers and concepts in an end-to-end manner.
 もちろん、たとえば、CNNバックボーン100の少なくとも一部は固定して、他の部分のみを学習する構成とするなど、必ずしも、エンドツーエンドの構成に限定されるものではない。 Of course, the configuration is not necessarily limited to an end-to-end configuration; for example, at least a portion of the CNN backbone 100 may be fixed and only other portions may be trained.
 画像分類学習装置1000は、2つの方法で、人間に対して分類の過程について本質的な説明を行うことが可能な構成を有することになる。 The image classification learning device 1000 has a configuration that can essentially explain the classification process to humans in two ways.
 まず、第1に、概念の活性化は、モデルが何を発見したかを直感的に伝える。 First, concept activation conveys an intuitive sense of what the model has discovered.
 第2に、分類器は概念に基づいてターゲットクラスのプロトタイプを提供する。 Second, the classifier provides a prototype of the target class based on the concept.
 以下に説明するような実験結果は、画像分類学習装置1000が、性能を大きく落とすことなく、(少なくとも定性的に)より直感的な解釈を与えうることを実証している。 The experimental results described below demonstrate that the image classification learning device 1000 can provide more intuitive interpretations (at least qualitatively) without significantly compromising performance.
 以下では、このようなアブレーション実験は、概念制約の重要性とともに、どの教師あり自己学習タスクが異なるターゲット分類タスクに適しているかを示す。
(分類器に関する評価実験の結果)
(実験の設定)
In what follows, such ablation experiments demonstrate the importance of concept constraints as well as which supervised self-learning tasks are suitable for different target classification tasks.
(Results of the evaluation experiment on the classifier)
(Experimental Setup)
 データセット手書き数字認識タスクであるMNIST(MITライセンス)、自然画像認識タスクCUB200(カスタムライセンス)、ImageNet(3条項BSDライセンス)の3つの分類タスクで実験を行った結果について、以下に説明する。 Datasets The results of experiments on three classification tasks, the handwritten digit recognition task MNIST (MIT license), the natural image recognition task CUB200 (Custom license), and ImageNet (3-clause BSD license), are described below.
 画像分類学習装置1000の学習はクラス数が多くなると失敗する可能性が高くなる。 The image classification learning device 1000 is more likely to fail when learning with a large number of classes.
 そこで、以下の実験では、CUB200のフルセットと、クラスIDの昇順で最初のn個(0<n<1000)のクラスを抽出したImageNetのサブセットを用いた。 Therefore, in the following experiments, we used the full set of CUB200 and a subset of ImageNet that extracted the first n classes (0<n<1000) in ascending order of class ID.
 3つのタスクの分類精度と、MNISTとCUB200の定性分析を示し、画像分類学習装置1000が提供する解釈可能性を検証した。また、ImageNet上での定性分析も補足的に言及する。
 (トレーニングの詳細)
We present the classification accuracy of three tasks and a qualitative analysis of MNIST and CUB200 to verify the interpretability provided by the image classification learning device 1000. We also supplementarily mention a qualitative analysis on ImageNet.
(Training details)
 MNISTでは、バックボーンとコンセプトデコーダには、非特許文献8に開示の“SENN”と同じネットワークを適用した。 In MNIST, the backbone and concept decoder use the same network as "SENN" disclosed in Non-Patent Document 8.
 CUB200とImageNetについては、バックボーンに以下の公知文献に開示されている事前学習済みのResNetを用い、1 × 1 畳み込み層でチャンネル数(ResNet-18 は 512、他は 2048)を128に削減した。
 公知文献:Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Pro. CVPR, pages 770-778, 2016.
For CUB200 and ImageNet, we used a pre-trained ResNet disclosed in the following publicly known literature as the backbone, and reduced the number of channels (512 for ResNet-18, 2048 for others) to 128 in the 1 × 1 convolutional layer.
Public literature: Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Pro. CVPR, pages 770-778, 2016.
 入力画像はすべて、256×256にリサイズし、224×224にクロップした。学習時のデータ補強として、ランダムな水平反転のみを採用した。 All input images were resized to 256x256 and cropped to 224x224. Only random horizontal flipping was used to augment the data during training.
 概念数kはMNISTでは20個、その他では50個をデフォルトとした。各損失の重みは、デフォルトでは、以下のとおりである。
   λqua=0.1, λcon=1, λdis=1, λR=1
The number of concepts k is set to 20 for MNIST and 50 for others by default. The weights of each loss are as follows by default:
λqua=0.1, λcon=1, λdis=1, λR =1
 なお、概念数kとこれらの重みが与える影響についても検証した。
(分類の性能)
We also examined the influence of the number of concepts k and their weights.
(Classification performance)
 概念数k=20の場合、MNISTのテストに対する分類精度は96.7%であった。 When the number of concepts k = 20, the classification accuracy for the MNIST test was 96.7%.
 図8Aおよび図8Bは、CUB200とImageNetについての分類器400の分類性能を示す図である。 Figures 8A and 8B show the classification performance of classifier 400 for CUB200 and ImageNet.
 なお、図8Aおよび図8Bにおいて、BotCLと記載されるものが、分類器400を意味する。 Note that in Figures 8A and 8B, BotCL refers to the classifier 400.
 図8Aは、n=300のCUB200とImageNetにおいて、分類器400とベースラインモデルであるResNetの性能を比較して示している。 Figure 8A shows a comparison of the performance of classifier 400 and the baseline model ResNet for CUB200 and ImageNet with n=300.
 すべてのタスクにおいて、分類器400について、およそ3ポイントの性能低下が確認される。 Across all tasks, classifier 400 shows a performance degradation of approximately 3 points.
 また、図8Bは、CUB200(クラス数:20~200)、ImageNet(クラス数:50~300)において、クラス数に対する性能の変化を示す。 Figure 8B also shows the change in performance versus the number of classes for CUB200 (number of classes: 20-200) and ImageNet (number of classes: 50-300).
 クラス数が多くなると、ベースラインモデルも分類器400も性能が低下していく傾向がある。 As the number of classes increases, the performance of both the baseline model and classifier 400 tends to decrease.
 しかしながら、分類器400は、概念活性度tを入力とする単一の全連結層という構成であるものの、ベースラインモデルに対して、分類性能をの劣化は、ほとんどないといえる。言い換えると、概念活性度tに対応する「概念」が、画像の分類に対してその特徴を十分に表現しているといえる。
(説明可能性(インタープリタビリティ))
 (検出された概念の妥当性)
However, although the classifier 400 has a single fully connected layer configuration that uses the concept activity t as an input, it can be said that there is almost no degradation in classification performance compared to the baseline model. In other words, it can be said that the "concept" corresponding to the concept activity t sufficiently expresses the characteristics of the image classification.
(Interpretability)
(Validity of detected concepts)
 上述した通り、学習過程において、一つの入力画像に対して、画像分類学習装置1000は各概念の存在を示す概念活性度tを計算する。 As described above, during the learning process, for one input image, the image classification learning device 1000 calculates the concept activity t, which indicates the presence of each concept.
 この概念活性度tは、概念kに対応する空間次元上の注目度のakの総和に対応している。このakを可視化することで、人間が概念の有無を定性的に確認することができる。 This concept activity t corresponds to the sum of the spatial-dimensional attention a k corresponding to the concept k. By visualizing this a k , humans can qualitatively confirm the presence or absence of a concept.
 図9A~図9Cは、概念活性度tにより表現される概念の妥当性を説明するための図である。 Figures 9A to 9C are diagrams for explaining the validity of concepts expressed by concept activity t.
 図9Aは、MNISTにおいて、最も頻繁に活性化する(すなわち、tk>0.5)5つの概念を選び、その原画にakを重ねたものである。概念としての注目領域は、重ねることでより明るく表示される。数字の0と9を例にとると、両者の明らかな違いは概念2(Cpt.2)の活性化であり、その注目点は下の縦線にあることがわかる。 Figure 9A shows the original images of the five most frequently activated concepts (i.e., t k >0.5) selected in MNIST, with a k superimposed on them. The area of interest as a concept is displayed brighter by superimposing. For example, in the case of the numbers 0 and 9, the obvious difference between the two is the activation of concept 2 (Cpt.2), and the attention point is the lower vertical line.
 図9Bは、各概念が活性化する頻度のトップ5を示したものであり、図9Cは、各概念によって、再構成される画像を示す。 Figure 9B shows the top five most frequently activated concepts, and Figure 9C shows the images reconstructed by each concept.
 画像分類学習装置1000は、学習された概念が個々に整合し、相互に特徴的であるように設計されている。 The image classification learning device 1000 is designed so that the learned concepts are individually consistent and distinctive from each other.
 このことは、各概念を学習画像データDにおける活性化サンプルの上位P個と概念kを重ね合わせて表示することで定性的に検証できる。図9Bに示すように、MNISTではP=5とした。概念によって異なるストロークパターンに注目しており、各概念は異なるサンプル間で(異なるカテゴリのサンプルでも)一貫した注目領域を持っていることが観察される。 This can be qualitatively verified by overlaying each concept with the top P activated samples in the training image data D and concept k. As shown in Figure 9B, in MNIST, P = 5. Different concepts focus on different stroke patterns, and it can be observed that each concept has a consistent area of focus across different samples (even samples from different categories).
 また、ある概念を取り除くことで寄与を定性的に確認し、対応する教師あり自己学習タスクの出力の変化を見ることができる。 You can also qualitatively see the contribution of removing a concept and see the corresponding change in the output of a supervised self-learning task.
 図9Cでは、数字9に対する注意領域である縦線を担当する概念2(Cpt.2)の活性度をゼロにすると、再構成された画像は数字0に変化していることがわかる。 In Figure 9C, when the activation of concept 2 (Cpt.2), which is responsible for the vertical line and is the attention area for the number 9, is set to zero, the reconstructed image changes to the number 0.
 また、数字7の注意領域であって、円の非存在を表す概念1(Cpt.1)を非活性化すると、再構成画像の上部に円が現れ、数字9に近い形となっているのが分かる。 Furthermore, when concept 1 (Cpt.1), which is the attention area for the number 7 and represents the absence of a circle, is deactivated, a circle appears at the top of the reconstructed image, and it can be seen that its shape resembles the number 9.
 図10は、入力画像である黄色い頭の黒い鳥に対する、最も重要な(後述する「重要度」に基づく)5つの概念の注目度を示す図である。 Figure 10 shows the attention levels of the five most important concepts (based on "importance" described below) for the input image of a black bird with a yellow head.
 概念1~5(Cpt.1~Cpt.5)の注目度は、頭、首、胴体、足など鳥の様々な部分をカバーしていることが分かる。これは、画像分類学習装置1000が自然画像から様々な概念を学習できることを証明している。
(各概念の一貫性・独自性)
It can be seen that the attention levels of concepts 1 to 5 (Cpt.1 to Cpt.5) cover various parts of birds, such as the head, neck, body, and legs, etc. This proves that image classification learning device 1000 can learn various concepts from natural images.
(Consistency and uniqueness of each concept)
 図11は、自然画像について、概念活性度tにより表現される概念を説明するための図である。 Figure 11 is a diagram to explain the concepts expressed by concept activity t for natural images.
 図9Bで示したのと同様の重ね合わせの表示を、図11に示す。 Figure 11 shows a similar overlay display to that shown in Figure 9B.
 画像分類学習装置1000はCUB200データセットでも、図9Bで説明したのと、同様の挙動を示す。選択されたP=5個の概念は異なるパターンに注目し、各概念はサンプル間で一貫した注目領域を持っていることがわかる。
(推論における各概念の貢献度)
The image classification learning apparatus 1000 exhibits similar behavior on the CUB200 dataset as described in Fig. 9B. It can be seen that the selected P = 5 concepts focus on different patterns, and each concept has a consistent focus region across samples.
(Contribution of each concept to inference)
 分類器400は、1つの全結合層からなり、概念の共起を学習すると解釈できる。 The classifier 400 consists of one fully connected layer and can be interpreted as learning the co-occurrence of concepts.
 したがって、1回の推論で、クラスωに対する概念kの寄与を以下の式で定義する。
Figure JPOXMLDOC01-appb-M000021
 
Therefore, in one inference, the contribution of concept k to class ω is defined as follows:
Figure JPOXMLDOC01-appb-M000021
 また、自然画像に対してもMNISTと同様に、ある概念を取り除くことで寄与を定性的に確認し、対応する教師あり自己学習タスクの出力の変化を見ることができる。 Also, for natural images, as with MNIST, it is possible to qualitatively confirm the contribution by removing certain concepts and observe the change in the output of the corresponding supervised self-learning task.
 図12は、CUB200のデータセットにおいて、各概念の重要度を示す図である。 Figure 12 shows the importance of each concept in the CUB200 dataset.
 図12に示すように、CUB200のデータセットでは、検索された上位8つの検索結果を用いて、各概念の貢献度が示される。たとえば、概念1(Cpt.1(黄色い頭に相当))を無効化すると、検索結果にはより多くの黒い頭の鳥の画像が現れるようになる。 As shown in Figure 12, in the CUB200 dataset, the contribution of each concept is shown using the top 8 search results. For example, disabling concept 1 (Cpt.1 (corresponding to yellow head)) results in more images of black-headed birds appearing in the search results.
 もっとも、復元タスクと比較すると、検索タスクは出力(類似度の高いサンプル)が複数の概念で決定され、一つの概念の変更は全体の類似度にあまり影響しないため、よりロバストであると言える。 However, compared to the restoration task, the search task is more robust because the output (highly similar samples) is determined by multiple concepts, and changing one concept does not significantly affect the overall similarity.
 図12の下部には、検索されたサンプルのうち、グラウンドトゥルース(ground truth)クラスのサンプルの割合を示しており、この検索タスクにおける各概念の重要度を測ることができる。 The bottom of Figure 12 shows the percentage of samples in the ground truth class among the retrieved samples, allowing us to measure the importance of each concept in this search task.
 なお、概念5(Cpt.5)を無効化しても検索結果にほとんど変化がないのは、概念5が全ての鳥類クラスで有効な共通概念であるためと考えられる。
 
(アブレーション試験)
The reason why there is almost no change in the search results even when concept 5 (Cpt. 5) is disabled is that concept 5 is a common concept that is valid for all bird classes.

(Ablation test)
 以下では、概念kの個数の設定と各損失項の重みの影響を検討した。また、分類精度の他に、各概念の上位100個の活性化サンプルに対して式(7)、(9)で計算した損失の元となる「整合性」「識別性」について評価を行った。ハイパーパラメータは、探索するものを除き、デフォルトの値を使用した。
 (概念kの影響)
Below, we consider the effect of setting the number of concepts k and the weight of each loss term. In addition to the classification accuracy, we also evaluated the "consistency" and "discriminability" that are the basis of the loss calculated by equations (7) and (9) for the top 100 activation samples of each concept. The default values were used for the hyperparameters, except for those to be searched.
(Influence of concept k)
 図13は、各ハイパーパラメータの大きさと正解率(Accuracy:丸)、整合性(individual consistency:四角:高くなるほど望ましい)、識別性(mutual distinctiveness:三角:低くなるほど望ましい)を示す図である。 Figure 13 shows the magnitude of each hyperparameter and the accuracy rate (Accuracy: circles), individual consistency (squares: the higher the better), and mutual distinctiveness (triangles: the lower the better).
 図13に示すように、一般に、概念数kが大きいほど正解率が高く、整合性と識別性の両方に正の影響を与える。 As shown in Figure 13, in general, the larger the number of concepts k, the higher the accuracy rate, which has a positive effect on both consistency and discriminability.
 しかし、MNISTではkが10以上、CUB200では、200以上で識別性が増加(劣化)しており、これは意味のない概念が学習されたか、複数の概念が同じ視覚要素を表現していることを示唆している。 However, discrimination increases (deteriorates) when k is greater than 10 in MNIST and greater than 200 in CUB200, suggesting that meaningless concepts are learned or that multiple concepts represent the same visual elements.
 このように、概念の数はデータセットごとに調整することが望ましい。
 量子化損失の重みパラメータλquaの影響
Thus, it is desirable to adjust the number of concepts for each dataset.
Effect of the weight parameter λqua on the quantization loss
 パラメータλquaを小さくすると(ゼロではない)、CUB200で学習するモデル性能が向上し、大きくすると3つのメトリクス全てに悪影響があった。 Smaller (but not zero) parameter λqua improved model performance trained on CUB200, while larger values had a negative impact on all three metrics.
 パラメータλquaは、概念活性化度tをどの程度バイナリ値に近づけるかを制御する。適切な値は、活性化を正則化し、いくつかの曖昧な概念を防ぐことができる。しかし、極端な値を設定すると勾配が消失してしまい、学習がうまくいかなくなる可能性がある。デフォルトのλqua=0.1程度が、今回の実験の範囲では最適な値であった。
 λconとλdisの影響
The parameter λqua controls how close the concept activation t is to a binary value. An appropriate value can regularize the activation and prevent some ambiguous concepts. However, setting an extreme value can cause the gradient to disappear, which can lead to poor learning. The default λqua of about 0.1 was the optimal value within the scope of this experiment.
The impact of λcon and λdis
 MNISTでは、個体の整合性損失と相互の識別性損失はほとんど性能に影響を与えないことが分かった。その理由として、手書き数字の外観のばらつきが少ないことが考えられる。つまり、少なくとも整合性は常に高いということである。また、課題自体が簡単なため、正解率が飽和している可能性もある。 In MNIST, we found that individual consistency loss and mutual discriminability loss have almost no effect on performance. This is likely because there is little variation in the appearance of handwritten digits, meaning that at least consistency is always high. Also, the task itself is easy, so the accuracy rate may have saturated.
 一方で、CUB200では、2つの損失が設計通りに機能した。λconの増加に伴い、整合性は継続的に向上した。識別性はλcon=1までわずかに減少し、その後わずかに増加した。λdisを増加させると、識別性は減少し続け(性能として向上し)、整合性はわずかに改善された。
 λRの影響
On the other hand, for the CUB200, the two losses performed as designed. As λcon increased, the matching improved continuously. Discrimination decreased slightly up to λcon = 1, then increased slightly. As λdis increased, the discrimination continued to decrease (improving performance) and the matching improved slightly.
Effect of λR
 教師あり自己学習損失は分類精度に明らかな影響を与えないが、その重みを増やすことで整合性と識別性の両方を改善できることがわかる。 We find that while the supervised self-learning loss has no obvious impact on classification accuracy, increasing its weights can improve both consistency and discriminability.
 以上説明したように、本実施の形態の画像分類学習装置1000は、分類課題に対する学習を通じて、分類に使用した特徴を人間が理解可能な「概念」として学習することができる。 As described above, the image classification learning device 1000 of this embodiment can learn the features used in classification as "concepts" that humans can understand through learning on classification tasks.
 また、画像分類学習装置1000は学習された概念だけでなく、その判断に対する解釈可能性を提供できる。
[実施の形態2]
In addition, image classification learning device 1000 can provide not only the learned concepts but also the interpretability of the judgments.
[Embodiment 2]
 以下では、実施の形態1で説明した「画像分類学習装置および分類器」を、コンクリート建造物におけるコンクリートの健全度の判定に用いる例について説明する。 Below, we will explain an example of using the "image classification learning device and classifier" described in the first embodiment to determine the soundness of concrete in a concrete structure.
 コンクリート構造物の点検では、コンクリート壁面のひび割れ発生状況などに基づいて、構造物の部分的な損傷度を判定している。 When inspecting concrete structures, the degree of partial damage to the structure is determined based on factors such as the occurrence of cracks on the concrete wall surface.
 図14は、実施の形態2のコンクリート健全度分類装置の動作を説明するための概念図である。 FIG. 14 is a conceptual diagram for explaining the operation of the concrete soundness classification device of embodiment 2.
 実施の形態1でも説明したように、コンクリート構造物の健全度の分類処理についても、一体のコンピュータ装置内で学習分類や分類処理が実行されるとの構成とすることもできる。 As explained in the first embodiment, the classification process for the soundness of concrete structures can also be configured so that learning classification and classification processes are performed within an integrated computer device.
 ただし、以下では、後述するように、実施の形態1で説明した方法で生成された「学習済みモデル(分類器)」による健全度の判定処理は、サーバー1010(図示せず)により実行されるものとする。 However, in the following, as described below, the process of determining the health level using the "trained model (classifier)" generated by the method described in embodiment 1 is assumed to be executed by server 1010 (not shown).
 図14を参照して、画像データが入力に対して、あらかじめ、演算装置6040により実行される学習処理においては、画像データと当該画像データと関連付けられた識別指標(この場合は、画像データに対する正解データとしての健全度)とで、学習処理が実行され、人工知能の学習済モデルが生成される。 Referring to FIG. 14, in a learning process that is executed in advance by the computing device 6040 in response to input image data, the learning process is executed using the image data and a discrimination index associated with the image data (in this case, the soundness as correct answer data for the image data), and a learned model of the artificial intelligence is generated.
 そこで、このようにして、人工知能の学習済モデルが生成された後に、コンクリート構造物の表面を撮影した画像データが、サーバー1010に送信されるとする。 After the artificial intelligence trained model is generated in this manner, image data capturing the surface of the concrete structure is sent to server 1010.
 サーバー1010では、人工知能の学習済モデルによって、分類処理が実行され、健全度(たとえば、健全度III)との出力がされる。同時に、サーバー1010での分類処理において使用された「概念」に相当する領域について、枠などで、人間が視認できるように表示がされる。 In server 1010, a classification process is performed using a trained model of artificial intelligence, and a health level (for example, health level III) is output. At the same time, the area corresponding to the "concept" used in the classification process in server 1010 is displayed in a frame or the like so that it can be visually recognized by humans.
 たとえば、専門技能者が、このような分類処理がされた画像を視認することで、人工知能の分類結果だけでなく、画像中のどの領域に注意が向けられた結果、健全度の判定がされたのかを理解できる。そのうえで、専門技能者は、分類処理において、画像中の注意が向けられた領域に基づいて、対処法を判断することができる。 For example, a skilled professional can visually view an image that has been classified in this way and understand not only the classification results of the artificial intelligence, but also which areas of the image attention was focused on to determine the healthiness of the image. The skilled professional can then determine how to respond based on the area of the image that attention was focused on during the classification process.
 特に限定されないが、コンクリート構造物の健全度については、非特許文献9の国土交通省「橋梁定期点検要領」に開示されるように、健全度I~健全度IVの4段階とすることができる。 Although not limited to this, the soundness of concrete structures can be classified into four levels, from soundness I to soundness IV, as disclosed in the Ministry of Land, Infrastructure, Transport and Tourism's "Guidelines for Periodic Bridge Inspection" (Non-Patent Document 9).
 ここで、非特許文献9の各健全度の分類は、道路、橋梁等を例示として、以下の通りである。 Here, the classification of each soundness level in Non-Patent Document 9 is as follows, taking roads, bridges, etc. as examples.
 健全度I : 健全    : 道路橋の機能に支障が生じていない状態。  Soundness level I: Sound: The road bridge's functionality is not impaired.
 健全度II: 予防保全段階: 道路橋の機能に支障が生じていないが,予防保全の観点から措置を講ずることが望ましい状態。  Level II: Preventive maintenance stage: The road bridge's functionality is not impaired, but it is desirable to take preventive maintenance measures.
 健全度III: 早期措置段階: 道路橋の機能に支障が生じる可能性があり,早期に措置を講ずべき状態。  Level III: Early action stage: The road bridge's functionality may be impaired, and immediate action is required.
 健全度IV: 緊急措置段階: 道路橋の機能に支障が生じている,又は生じる可能性が著しく高く,緊急に措置を講ずべき状態。
 また、図14の例では、健全度IIIであることに応じて、専門技能者は、「この部分は
ひび割れなので樹脂を注入して補修をしよう」との判断を下している。
Level of soundness IV: Emergency action stage: The function of the road bridge is impaired or there is a very high possibility that this will occur, and immediate action is required.
In the example of FIG. 14, in response to the soundness level being III, the expert technician makes the decision that "this part is a crack, so let's repair it by injecting resin."
 図15は、図14のような人工知能の学習済モデルを生成するための学習データの構成を示す概念図である。 FIG. 15 is a conceptual diagram showing the configuration of training data for generating a trained artificial intelligence model like that shown in FIG. 14.
 図15に示すように、健全度ラベルとして、健全度I~健全度IVにそれぞれ対応して、画像データが準備されている。図15では、例示として、健全度IIIの場合を示す。他の健全度についても、同様の画像が準備されているものとする。 As shown in Figure 15, image data is prepared as health level labels corresponding to health levels I to IV. Figure 15 shows health level III as an example. Similar images are assumed to be prepared for the other health levels.
 なお、状況によっては、「健全」とされる画像については、特に、学習データとせず、単に、「健全ではない」とのラベル(健全度II~IV)の画像を、学習データとして学習させることも可能である。 Depending on the situation, it may be possible not to use images that are deemed "healthy" as training data, but to simply use images labeled "unhealthy" (health levels II to IV) as training data.
 図16は、コンクリートの健全度の判定のためのシステム構成の一例を示す図である。 Figure 16 shows an example of a system configuration for determining the soundness of concrete.
 図16を参照して、図14において説明した通り、検査者端末500.1で、コンクリート構造物の表面の画像を撮影して、サーバー1010に送信する。 Referring to FIG. 16, as described in FIG. 14, an image of the surface of the concrete structure is captured by the inspector terminal 500.1 and transmitted to the server 1010.
 このとき、検査者端末500.1からは、たとえば、構造物の位置情報(たとえば、測位手段により獲得される緯度・経度の情報)と、検査者により入力される構造物名のデータとが、画像データとともに、サーバー1010に送信される。 At this time, the inspector terminal 500.1 transmits, for example, the structure's location information (e.g., latitude and longitude information obtained by a positioning means) and the structure name data entered by the inspector, together with the image data, to the server 1010.
 サーバー1010では、健全度の判定結果の情報と、検査者端末500.1から送信されてきた画像に対して、注意した領域を示す情報(枠などでのマーキングを施したデータ)を、検査者端末500.1に対して返信する。 The server 1010 returns to the inspector terminal 500.1 information on the healthiness assessment result and information indicating areas that were noted in relation to the image sent from the inspector terminal 500.1 (data marked with a frame or the like).
 なお、サーバー1010での分類・判定処理に使用するための人工知能の学習済みモデルは、たとえば、別のサーバー1020において、学習処理がされた後に、サーバー1010に送信され、格納されて動作する構成とすることができる。 In addition, the trained artificial intelligence model to be used for classification and judgment processing in server 1010 can be configured to undergo training processing, for example, in another server 1020, and then be transmitted to server 1010, stored, and operated.
 ここで、学習処理を担当するサーバー1020は、複数の端末500.2~500.n(n:自然数)から、他のコンクリート構造物からの「位置データ」「画像データ」「構造物名」などのデータを収集する。このとき、たとえば、サーバー1020に収集されたデータについては、各端末500.2~500.nを操作するのが、専門技能者である場合は、この端末での画像の撮像の際に、画像データに対して正解データとして健全度を関連付けて、サーバー1020に送信する構成とすることができる。このようにして収集したデータから、図15で説明したような学習データを生成することができる。 Here, server 1020, which is in charge of the learning process, collects data such as "position data," "image data," and "structure name" from other concrete structures from multiple terminals 500.2 to 500.n (n: natural number). At this time, for example, if each terminal 500.2 to 500.n is operated by a skilled professional, the data collected by server 1020 can be configured so that when an image is captured on the terminal, the soundness is associated with the image data as correct answer data and transmitted to server 1020. Learning data such as that described in FIG. 15 can be generated from the data collected in this way.
 あるいは、端末500.2~500.nから、「位置データ」「画像データ」「構造物名」のみが送信されてきたときには、サーバー側で専門技能者が、正解データとして健全度を関連付ける処理を行ってもよい。 Alternatively, when only "location data," "image data," and "structure name" are sent from terminals 500.2 to 500.n, a specialized technician on the server side may perform a process to associate the soundness with the correct data.
 サーバー1020では、このようにして逐次、累積される学習データを用いることで、人工知能の学習済みモデルを再学習させて、分類性能の向上を図ることが可能となる。 In this way, server 1020 can use the accumulated learning data to retrain the AI's trained model and improve classification performance.
 図17は、図16に示した端末500.1の構成を示す機能ブロック図である。 FIG. 17 is a functional block diagram showing the configuration of terminal 500.1 shown in FIG. 16.
 なお、端末500.2~500.nも、同様の構成を有するので、その説明は繰り返さない。 Note that terminals 500.2 to 500.n have a similar configuration, so their explanation will not be repeated.
 図17を参照して、本実施形態の端末500.1は、端末の通信動作や入出力動作を制御するための制御部5010と、無線LANおよび移動体通信を行うためベースバンド信号を生成して変復調回路・装置へ送出したり、受信したベースバンド信号から元のデータや信号を得る通信処理部5040と、静止画または動画を撮影するための撮像センサ5050と、撮像センサ5050からの信号を所定のフォーマットの電気信号に変換する画像取得部5060と、端末側での画像表示を制御するための表示制御部5070と、表示制御部5070に制御されて画像を表示する表示部5080と、端末500.1の位置を測位して取得する位置取得部5090と、外部からの情報の入力を受け付ける入力インタフェース部5100とを備える。 Referring to FIG. 17, the terminal 500.1 of this embodiment includes a control unit 5010 for controlling the communication operation and input/output operation of the terminal, a communication processing unit 5040 for generating baseband signals for wireless LAN and mobile communication and sending them to a modulation/demodulation circuit/device, and for obtaining original data or signals from received baseband signals, an imaging sensor 5050 for capturing still images or videos, an image acquisition unit 5060 for converting signals from the imaging sensor 5050 into electrical signals in a predetermined format, a display control unit 5070 for controlling image display on the terminal side, a display unit 5080 for displaying images under the control of the display control unit 5070, a position acquisition unit 5090 for measuring and acquiring the position of the terminal 500.1, and an input interface unit 5100 for receiving input of information from the outside.
 特に、限定されないが、撮像センサ5050としては、レンズとCCD(Charge-Coupled Device)センサが一体となったモジュールや、レンズとCMOSセンサとが一体となったモジュールを使用することができる。 In particular, although not limited to, the imaging sensor 5050 may be a module that combines a lens and a CCD (Charge-Coupled Device) sensor, or a module that combines a lens and a CMOS sensor.
 また、位置取得部5090としては、屋外での測位手段として、GPS(Global Positioning System)を利用する測位装置のほか、準天頂衛星からの信号も利用する測位装置のほか、ビーコン信号その他を利用して屋内での測位を可能とする装置など、端末500.1の位置情報を取得することが可能な装置であれば、これらの装置に限定されない。 The location acquisition unit 5090 is not limited to a positioning device that uses GPS (Global Positioning System) as an outdoor positioning means, a positioning device that also uses signals from quasi-zenith satellites, a device that enables indoor positioning using beacon signals, etc., and may be any device capable of acquiring location information of the terminal 500.1.
 入力インタフェース部5100は、タッチパネルによる文字入力や音声入力の音声認識などを利用して、外部からの入力をテキストデータに変換する。 The input interface unit 5100 converts external input into text data using a touch panel or voice recognition of voice input.
 制御部5010は、画像取得部5060からの画像データと、位置取得部5090からの位置データと、入力インタフェース部5100からの構造物名のデータなどの情報を統合して、通信処理部5040からサーバー1010に向けて送信するための取得画像送信処理部5020と、通信処理部5040を介して受信したサーバー1010からの構造物名、健全度のデータ、画像と注目領域とを示すデータとから、表示制御部5070により、表示部5080に表示させる判定表を表す画像データを生成する判定表生成部5030とを含む。 The control unit 5010 includes an acquired image transmission processing unit 5020 that integrates information such as image data from the image acquisition unit 5060, position data from the position acquisition unit 5090, and data on the structure name from the input interface unit 5100, and transmits the integrated information from the communication processing unit 5040 to the server 1010, and a judgment table generation unit 5030 that generates image data representing a judgment table to be displayed on the display unit 5080 by the display control unit 5070 from the structure name, health data, and data indicating the image and area of interest received from the server 1010 via the communication processing unit 5040.
 図18は、図17に示した端末500.1のハードウェア構成を説明するためのブロック図である。 FIG. 18 is a block diagram for explaining the hardware configuration of terminal 500.1 shown in FIG. 17.
 制御部5010に相当して、MPU(Micro Processing Unit)またはCPU(Central Processing Unit)などの演算装置501、RAM502、ROM503等からなる記憶装置を備え、所定の基本OSやミドルウェア等のプログラムが実行されることにより、各部を制御したり、ソフトウェア構成上のネイティブプラットフォーム環境やアプリケーション実行環境を構築したりする。 The control unit 5010 is equipped with a calculation device 501 such as an MPU (Micro Processing Unit) or CPU (Central Processing Unit), a storage device consisting of a RAM 502, a ROM 503, etc., and controls each part and creates a native platform environment and application execution environment in the software configuration by executing a predetermined basic OS, middleware, etc.
 撮像装置505としては、上述のようなカメラモジュールが使用され、測位装置509としては、上述のようなGPSその他の測位装置が使用される。 As the imaging device 505, a camera module as described above is used, and as the positioning device 509, a GPS or other positioning device as described above is used.
 表示装置508としては、液晶パネルや有機ELパネルが使用され、操作装置510としては表示パネルと一体となったタッチパネルであってもよいし、音声認識装置であってもよい。 The display device 508 may be a liquid crystal panel or an organic EL panel, and the operation device 510 may be a touch panel integrated with the display panel, or a voice recognition device.
 制御部5010の記憶装置は、例えば、一時記憶装置としてのRAMや不揮発性記憶装置としてのフラッシュメモリなどの半導体メモリを含む。この不揮発性記憶装置は、各部での処理に用いられるドライバプログラム、オペレーティングシステムプログラム、アプリケーションプログラム、データ等を記憶する。 The storage device of the control unit 5010 includes, for example, semiconductor memory such as RAM as a temporary storage device and flash memory as a non-volatile storage device. This non-volatile storage device stores driver programs, operating system programs, application programs, data, etc. used for processing in each unit.
 例えば、不揮発性記憶装置は、ドライバプログラムとして、IEEE802.11規格の無線通信方式や移動体通信(セルラー通信)の無線通信方式を実行する通信ドライバプログラム、操作装置510を制御する入力デバイスドライバプログラム、表示装置508を制御する出力デバイスドライバプログラム等を記憶する。 For example, the non-volatile storage device stores driver programs such as a communication driver program that executes a wireless communication method conforming to the IEEE 802.11 standard or a wireless communication method for mobile communication (cellular communication), an input device driver program that controls the operation device 510, and an output device driver program that controls the display device 508.
 また、不揮発性記憶装置は、オペレーティングシステムプログラムとして、例えば、Android(登録商標)OS、iOS(登録商標)等の基本OSや、IEEE802.11規格の無線通信方式や移動体通信(セルラー通信)の無線通信方式での認証等を行う接続制御プログラム等を記憶する。 The non-volatile storage device also stores operating system programs, such as basic OSs such as Android (registered trademark) OS and iOS (registered trademark), and connection control programs that perform authentication in wireless communication methods such as the IEEE 802.11 standard and wireless communication methods for mobile communication (cellular communication).
 通信インタフェース504は、無線LAN通信およびセルラー方式の移動体通信ネットワークの基地局(図示せず)を介して通信する移動体通信を実行するための機能を有する。 The communication interface 504 has the functionality to perform wireless LAN communication and mobile communication via a base station (not shown) of a cellular mobile communication network.
 以上説明したように、実施の形態2の画像分類学習装置1000および分類装置4000によれば、または、実施の形態2の学習プログラム、分類プログラムによれば、入力されたコンクリート構造物の表面画像データに基づいて、コンクリートの健全度の判定に関する情報を検査者側の端末で得ることができるとともに、人工知能の学習済みモデルが分類に使用した画像領域を確認することができる。 As described above, according to the image classification learning device 1000 and classification device 4000 of embodiment 2, or according to the learning program and classification program of embodiment 2, information relating to judging the soundness of concrete can be obtained on the inspector's terminal based on the input surface image data of a concrete structure, and the image area used for classification by the trained model of artificial intelligence can be confirmed.
 その結果、人工知能による健全度の判断を利用して、あるいは、補助情報として、専門技能者が、対処方法を判断することが容易となる。 As a result, it will be easier for specialists to determine how to respond by using the artificial intelligence's assessment of the health level, or as supplementary information.
 なお、特に限定されないが、実施の形態1と同様に、「学習済みモデル(分類器)」が、プログラムとして、または、プログラムの一部として、コンピュータ読取り可能な記録媒体に記録されて、他のコンピュータにインストールされることがあってもよい。
[実施の形態3]
Although not limited thereto, similarly to embodiment 1, the "trained model (classifier)" may be recorded as a program or as part of a program on a computer-readable recording medium and installed on another computer.
[Embodiment 3]
 実施の形態2では、図15に示したような学習データにより、図1で説明したような画像分類学習装置1000が学習処理を実行することで、コンクリート構造物の健全度の分類だけでなく、どのような特徴部分(特徴領域)に注目して、人工知能が健全度を判断したのかを、人間が判断可能となる構成について説明した。 In the second embodiment, an image classification learning device 1000 as described in FIG. 1 executes a learning process using learning data as shown in FIG. 15, thereby enabling a human to not only classify the soundness of a concrete structure, but also to determine which characteristic parts (characteristic areas) the artificial intelligence focused on in determining the soundness.
 実施の形態3では、さらに、画像分類学習装置1000が学習処理を実行することで、画像データを入力として、対応するコンクリート構造物の健全度を分類するだけでなく、そのようなコンクリート構造物に対する対処方法も出力するような画像分類学習装置1000および分類装置4000の構成について説明する。 In the third embodiment, the configuration of the image classification learning device 1000 and the classification device 4000 is further described, in which the image classification learning device 1000 executes a learning process so that, using image data as input, it not only classifies the soundness of the corresponding concrete structure, but also outputs a method of dealing with such a concrete structure.
 図19は、画像データ、画像に対応する健全度のラベルおよび対処措置のラベルのデータによる学習データの構成を説明するための図である。 Figure 19 is a diagram for explaining the composition of learning data that includes image data, health level labels corresponding to the images, and corrective action labels.
 図19に示すように、画像データ、画像に対応する健全度のラベルおよび対処措置のラベルのデータが集積されると、このようなデータを学習データとして利用することが可能となる。 As shown in Figure 19, when image data, health level labels corresponding to the images, and corrective action labels are collected, such data can be used as learning data.
 もっとも、実施の形態2のようなシステムを利用することで、コンクリート構造物についての画像データや、それに対応する健全度の情報が収集され、さらに、その画像データにより表現されるコンクリート構造物に対する対処方法についてのデータが集積されることになる。 However, by using a system like that of embodiment 2, image data about concrete structures and corresponding information about their soundness can be collected, and data on how to deal with the concrete structures represented by the image data can also be accumulated.
 図19では、健全度IIIの場合を例示しているものの、他の健全度についても、同様なデータが学習データとして準備されているものとする。 In Figure 19, the case of health level III is shown as an example, but similar data is also prepared as learning data for other health levels.
 図20は、実施の形態3の画像分類学習装置1000および分類装置4000の構成を説明するための機能ブロック図である。 FIG. 20 is a functional block diagram for explaining the configuration of the image classification learning device 1000 and classification device 4000 according to the third embodiment.
 図20を参照して、実施の形態3の画像分類学習装置1000においては、図1に示した構成に加えて、分類器400だけではなく、措置判別器410にも、概念活性度tが入力され、さらに、措置判別器410には、図19で示したような対応措置を示す対応措置ラベルも入力されて学習処理が実行されるものとする。 Referring to FIG. 20, in the image classification learning device 1000 of embodiment 3, in addition to the configuration shown in FIG. 1, the concept activity t is input not only to the classifier 400 but also to the action discriminator 410, and furthermore, a response action label indicating a response action as shown in FIG. 19 is also input to the action discriminator 410 to execute the learning process.
 学習処理制御部700は、措置判別器410から出力される対応措置が教師データと一致するように、学習処理を実行する。 The learning process control unit 700 executes the learning process so that the response measures output from the action discriminator 410 match the teacher data.
 そして、分類装置4000においては、図6に示した構成に加えて、このようにして学習済みとなった措置判別器410が、対応措置を出力する。 In the classification device 4000, in addition to the configuration shown in FIG. 6, the action discriminator 410 that has been trained in this way outputs a response action.
 したがって、実施の形態3の画像分類学習装置1000および分類装置4000によれば、または、実施の形態3の学習プログラム、分類プログラムによれば、入力されたコンクリート構造物の表面画像データに基づいて、コンクリートの健全度の判定に関する情報を得ることができるとともに、人工知能の学習済みモデルが分類に使用した画像領域と、さらに、対応措置に関する情報も、人間、特に、専門技能者が確認することができる。 Therefore, according to the image classification learning device 1000 and classification device 4000 of embodiment 3, or according to the learning program and classification program of embodiment 3, it is possible to obtain information relating to the assessment of the soundness of concrete based on the input surface image data of a concrete structure, and a human, in particular a skilled professional, can confirm the image area used for classification by the trained model of artificial intelligence, as well as information relating to countermeasures.
 その結果、人工知能による健全度の判断の情報および対応装置に関する情報を利用して、あるいは、補助情報として、専門技能者が、対処方法を判断することが容易となる。 As a result, it will be easier for a skilled technician to use the information from the artificial intelligence's assessment of the health level and the information on the corresponding device, or as supplementary information, to determine how to respond.
 今回開示された実施の形態は、本発明を具体的に実施するための構成の例示であって、本発明の技術的範囲を制限するものではない。本発明の技術的範囲は、実施の形態の説明ではなく、特許請求の範囲によって示されるものであり、特許請求の範囲の文言上の範囲および均等の意味の範囲内での変更が含まれることが意図される。 The embodiments disclosed herein are illustrative of configurations for specifically implementing the present invention, and do not limit the technical scope of the present invention. The technical scope of the present invention is indicated by the claims, not by the description of the embodiments, and is intended to include modifications within the literal scope of the claims and within the scope of equivalent meanings.
 100 バックボーンCNN、200 概念学習器、300 概念正則化部、400 分類器、500 量子化誤差算出部、600 損失算出部、700 学習処理制御部、2002 位置情報エンコード処理部、2004,2020 整形処理部、2006,2008 非線形処理部、2010 類似度算出部、2012 正規化部、2030 概念生起度算出部、3010 識別損失算出部、3020 再構成ベース損失算出部、3030 検索ベース損失算出部、4000 画像分類装置、4100 概念プロトタイプ記憶部。 100 Backbone CNN, 200 Concept learner, 300 Concept regularizer, 400 Classifier, 500 Quantization error calculator, 600 Loss calculator, 700 Learning process controller, 2002 Position information encoding processor, 2004, 2020 Shaping processor, 2006, 2008 Nonlinear processor, 2010 Similarity calculator, 2012 Normalizer, 2030 Concept occurrence calculator, 3010 Classification loss calculator, 3020 Reconstruction-based loss calculator, 3030 Search-based loss calculator, 4000 Image classification device, 4100 Concept prototype memory.

Claims (14)

  1.  画像分類学習装置であって、
     複数の画像データと前記画像データにそれぞれ対応する画像ラベルとを含む学習データを格納するための記憶装置と、
     前記記憶装置に格納された前記学習データを読み出して、前記画像データを分類するための前記画像データ中の複数のコンセプトを機械学習する処理を実行するための演算処理手段とを備え、前記演算処理手段は、
      前記画像データを表現する特徴量の組を抽出し、抽出された前記特徴量の組に基づいて、前記画像データに対する前記画像ラベルを識別して分類する分類モデルを学習して生成し前記記憶装置に格納する画像識別手段と、
      前記複数のコンセプトの各々に対応し、前記画像識別手段の識別処理において重視される前記特徴量の出現する画像領域を規定するスロットベクトルからなる概念行列において、前記スロットベクトルで規定される画像特徴に応じて、前記スロットベクトルを変換し前記記憶装置に格納する注意機構処理手段と、
      前記画像識別手段の識別率の評価により算出され、前記識別率が上昇するほど減少する識別損失と、前記複数のコンセプトに対応する前記特徴量が特徴空間において互いに分離する程度の評価により算出され、前記分離の程度が大きいほど減少する分離化損失とに基づいて、損失を算出する損失評価手段と、
      前記損失を減少させるように、前記記憶装置に格納される前記分類モデルと前記概念行列とに対する機械学習を実行する学習処理手段とを含む、画像分類学習装置。
    An image classification learning device, comprising:
    A storage device for storing training data including a plurality of image data and image labels corresponding to the image data;
    and a calculation processing means for reading out the learning data stored in the storage device and executing a process of machine learning a plurality of concepts in the image data for classifying the image data, the calculation processing means comprising:
    an image classification means for extracting a set of features expressing the image data, learning and generating a classification model for identifying and classifying the image label for the image data based on the extracted set of features, and storing the classification model in the storage device;
    an attention mechanism processing means for converting a slot vector in a concept matrix, the slot vector corresponding to each of the plurality of concepts and defining an image region in which the feature value that is emphasized in the classification process of the image classification means appears, in accordance with an image feature defined by the slot vector, and storing the slot vector in the storage device;
    a loss evaluation means for calculating a loss based on a classification loss calculated by evaluating a classification rate of the image classification means and decreasing as the classification rate increases, and a separation loss calculated by evaluating a degree to which the features corresponding to the plurality of concepts are separated from each other in a feature space and decreasing as the degree of separation increases;
    and a learning processing means for performing machine learning on the classification model and the concept matrix stored in the storage device so as to reduce the loss.
  2.  前記注意機構処理手段は、
      前記概念行列との類似度に応じて、前記特徴量の組において前記画像識別手段の識別処理において注意が向けられる前記画像領域を抽出するための注意行列を学習する注意行列学習手段を含み、
     前記画像識別手段は、
      前記注意行列に基づいて、前記スロットベクトルに対応する前記コンセプトのそれぞれが、前記画像データ中に出現する程度を要素とする活性度ベクトルを生成する概念生起度算出手段と、
      前記画像データに対応する前記活性度ベクトルを入力として、前記画像ラベルについての分類を実行する分類器とを含む、請求項1記載の画像分類学習装置。
    The attention mechanism processing means includes:
    an attention matrix learning means for learning an attention matrix for extracting the image region to which attention is directed in the classification process of the image classification means in the set of feature amounts according to a similarity to the concept matrix;
    The image identification means
    a concept occurrence calculation means for generating an activity vector having elements representing the degree to which each of the concepts corresponding to the slot vector appears in the image data based on the attention matrix;
    and a classifier that receives the activation vector corresponding to the image data as an input and performs classification for the image label.
  3.  前記画像データを表現する前記特徴量の組は、畳み込みニューラルネットワークの画像認識モデルから出力される特徴マップである、請求項1または2記載の画像分類学習装置。 The image classification learning device according to claim 1 or 2, wherein the set of features representing the image data is a feature map output from an image recognition model of a convolutional neural network.
  4.  前記分離化損失は、単一の前記コンセプトが前記特徴空間においてより小さな体積を占めるほど減少する損失である整合性損失と、前記コンセプトのペアは前記特徴空間において同じ領域を占める確率がより低くなるほど減少する損失である識別損失とを含む、請求項1または2記載の画像分類学習装置。 The image classification learning device according to claim 1 or 2, wherein the separation loss includes a consistency loss, which is a loss that decreases as a single concept occupies a smaller volume in the feature space, and an identification loss, which is a loss that decreases as the probability that a pair of concepts occupies the same region in the feature space decreases.
  5.  前記画像データは、カメラにより撮像された複数のコンクリート構造物の表面の画像のデータであり、
     前記画像ラベルは、前記画像データにそれぞれ対応する前記コンクリート構造物の健全度を示すラベルである、請求項1記載の画像分類学習装置。
    The image data is data of images of the surfaces of a plurality of concrete structures captured by a camera,
    The image classification learning device according to claim 1 , wherein the image labels are labels indicating soundness of the concrete structures corresponding to the respective image data.
  6.  複数の画像データと前記画像データにそれぞれ対応する画像ラベルとを含む学習データに基づいて、前記画像データを分類するための前記画像データ中の複数のコンセプトをコンピュータにより機械学習する画像分類学習方法であって、前記コンピュータは、前記学習データを格納する記憶装置と、機械学習の処理を実行するための演算装置とを含み、
     前記演算装置が、前記画像データを表現する特徴量の組を抽出し、抽出された前記特徴量の組に基づいて、前記画像データに対する前記画像ラベルを識別して分類する分類モデルを学習して生成するステップと、
     前記演算装置が、前記複数のコンセプトの各々に対応し、前記識別の処理において重視される前記特徴量の出現する画像領域を規定するスロットベクトルからなる概念行列において、前記スロットベクトルで規定される画像特徴に応じて、前記スロットベクトルを変換するステップと、
     前記演算装置が、前記画像データの識別における識別率の評価により算出され、前記識別率が上昇するほど減少する識別損失と、前記複数のコンセプトに対応する前記特徴量が特徴空間において互いに分離する程度の評価により算出され、前記分離の程度が大きいほど減少する分離化損失とに基づいて、損失を算出するステップと、
     前記演算装置が、前記損失を減少させるように、前記分類モデルと前記概念行列とを学習させるステップとを備える、画像分類学習方法。
    An image classification learning method for machine learning a plurality of concepts in image data for classifying the image data, based on learning data including a plurality of image data and image labels respectively corresponding to the image data, by a computer, the computer including a storage device for storing the learning data and a calculation device for executing a machine learning process;
    The computing device extracts a set of features that represent the image data, and learns and generates a classification model that identifies and classifies the image label for the image data based on the extracted set of features;
    The calculation device converts the slot vector in a concept matrix including slot vectors that correspond to each of the plurality of concepts and define an image region in which the feature value that is emphasized in the classification process appears, according to the image feature defined by the slot vector;
    a step of calculating a loss based on a classification loss calculated by an evaluation of a classification rate in classification of the image data, the classification loss decreasing as the classification rate increases, and a segregation loss calculated by an evaluation of a degree to which the features corresponding to the plurality of concepts are separated from each other in a feature space, the segregation loss decreasing as the degree of separation increases;
    The image classification learning method comprises a step in which the computing device learns the classification model and the concept matrix so as to reduce the loss.
  7.  前記画像データは、カメラにより撮像された複数のコンクリート構造物の表面の画像のデータであり、
     前記画像ラベルは、前記画像データにそれぞれ対応する前記コンクリート構造物の健全度を示すラベルである、請求項6記載の画像分類学習方法。
    The image data is data of images of the surfaces of a plurality of concrete structures captured by a camera,
    The image classification learning method according to claim 6 , wherein the image labels are labels indicating soundness of the concrete structures corresponding to the respective image data.
  8.  複数の画像データと前記画像データにそれぞれ対応する画像ラベルとを含む学習データに基づいて、前記画像データを分類するための前記画像データ中の複数のコンセプトをコンピュータにより機械学習する画像分類学習プログラムであって、前記コンピュータは演算装置と記憶装置とを含み、
     前記記憶装置に記憶された前記画像データについて、前記演算装置が、前記画像データを表現する特徴量の組を抽出し、抽出された前記特徴量の組に基づいて、前記画像データに対する前記画像ラベルを識別して分類する分類モデルを学習して生成するステップと、
     前記演算装置が、前記複数のコンセプトの各々に対応し、前記識別の処理において重視される前記特徴量の出現する画像領域を規定するスロットベクトルからなる概念行列において、前記スロットベクトルで規定される画像特徴に応じて、前記スロットベクトルを変換するステップと、
     前記演算装置が、前記画像データの識別における識別率の評価により算出され、前記識別率が上昇するほど減少する識別損失と、前記複数のコンセプトに対応する前記特徴量が特徴空間において互いに分離する程度の評価により算出され、前記分離の程度が大きいほど減少する分離化損失とに基づいて、損失を算出するステップと、
     前記演算装置が、前記損失を減少させるように、前記分類モデルと前記概念行列とを学習させるステップとを備える、画像分類学習プログラム。
    An image classification learning program for performing machine learning of a plurality of concepts in image data for classifying the image data, based on learning data including a plurality of image data and image labels corresponding to the image data, by a computer, the computer including a calculation device and a storage device,
    The computing device extracts a set of features representing the image data stored in the storage device, and learns and generates a classification model that identifies and classifies the image label for the image data based on the extracted set of features;
    The calculation device converts the slot vector in a concept matrix including slot vectors that correspond to each of the plurality of concepts and define an image region in which the feature value that is emphasized in the classification process appears, according to the image feature defined by the slot vector;
    a step of calculating a loss based on a classification loss calculated by an evaluation of a classification rate in classification of the image data, the classification loss decreasing as the classification rate increases, and a segregation loss calculated by an evaluation of a degree to which the features corresponding to the plurality of concepts are separated from each other in a feature space, the segregation loss decreasing as the degree of separation increases;
    The image classification learning program includes a step in which the computing device learns the classification model and the concept matrix so as to reduce the loss.
  9.  請求項8に記載の前記画像分類学習プログラムを格納する、コンピュータ読取り可能な非一時的な記録媒体。 A non-transitory computer-readable recording medium storing the image classification learning program described in claim 8.
  10.  複数の画像データと前記画像データにそれぞれ対応する画像ラベルとを含む学習データに基づいて、前記画像データを分類するための前記画像データ中の複数のコンセプトを機械学習する画像分類学習方法によって生成される画像分類学習済モデルであって、
     前記画像分類学習済モデルは、
      前記コンセプトのそれぞれが、前記画像データ中に出現する程度を要素とする活性度ベクトルを入力として、前記要素の共起関係に基づいて、前記画像データを分類する分類器モデルの構成を有し、
     前記画像分類学習済モデルは、
     前記画像データを表現する特徴量の組を抽出し、抽出された前記特徴量の組に基づいて、前記画像データに対する前記画像ラベルを識別して分類する前記分類器モデルを学習により更新するステップと、
     前記複数のコンセプトの各々に対応し、前記識別の処理において重視される前記特徴量の出現する画像領域を規定するスロットベクトルからなる概念行列において、前記スロットベクトルで規定される画像特徴に応じて、前記スロットベクトルを変換するステップと、
     前記画像データの識別における識別率の評価により算出され、前記識別率が上昇するほど減少する識別損失と、前記複数のコンセプトに対応する前記特徴量が特徴空間において互いに分離する程度の評価により算出され、前記分離の程度が大きいほど減少する分離化損失とに基づいて、損失を算出するステップと、
     前記損失を減少させるように、前記モデルと前記概念行列とを学習させるステップとにより生成され、
     前記スロットベクトルを変換するステップは、
     前記概念行列との類似度に応じて、前記特徴量の組において前記識別の処理において注意が向けられる前記画像領域を抽出するための注意行列を学習するステップを含み、
     前記分類器モデルを学習により更新するステップは、
     前記注意行列に基づいて、前記スロットベクトルに対応する前記コンセプトのそれぞれが、前記画像データ中に出現する程度を要素とする前記活性度ベクトルを生成するステップと、
     前記画像データに対応する前記活性度ベクトルを入力として、前記画像ラベルについての分類を実行するよう前記分類器モデルのパラメータを学習するステップとを含む、画像分類学習済モデル。
    An image classification trained model generated by an image classification training method for machine learning a plurality of concepts in image data for classifying the image data based on training data including a plurality of image data and image labels respectively corresponding to the image data,
    The image classification trained model is
    a classifier model configured to classify the image data based on a co-occurrence relationship of an activity vector having elements representing the degree to which each of the concepts appears in the image data,
    The image classification trained model is
    extracting a set of features representing the image data, and updating the classifier model by learning, the classifier model identifying and classifying the image label for the image data based on the extracted set of features;
    A step of converting the slot vector in a concept matrix including slot vectors that correspond to each of the plurality of concepts and define an image region in which the feature amount that is emphasized in the classification process appears, in accordance with the image feature defined by the slot vector;
    a step of calculating a loss based on a classification loss calculated by evaluating a classification rate in classification of the image data and decreasing as the classification rate increases, and a segregation loss calculated by evaluating a degree to which the features corresponding to the plurality of concepts are separated from each other in a feature space and decreasing as the degree of separation increases;
    training the model and the concept matrix so as to reduce the loss;
    The step of converting the slot vector includes:
    learning an attention matrix for extracting the image region to which attention is directed in the classification process in the set of features according to a similarity to the concept matrix;
    The step of updating the classifier model by learning includes:
    generating an activity vector based on the attention matrix, the activity vector being determined by the degree to which each of the concepts corresponding to the slot vector appears in the image data;
    and learning parameters of the classifier model to perform classification on the image label using the activation vector corresponding to the image data as input.
  11.  画像分類学習装置であって、
     カメラにより撮像された複数のコンクリート構造物の表面の画像データと前記画像データにそれぞれ対応する前記コンクリート構造物の健全度を示す画像ラベルとを含む学習データを格納するための記憶装置と、
     前記記憶装置に格納された前記学習データに基づいて、前記画像データを前記健全度について分類するための前記画像データ中の複数のコンセプトを機械学習する処理を実行するための演算装置とを備え、前記演算装置は、
      前記画像データを表現する特徴量の組を抽出し、抽出された前記特徴量の組に基づいて、前記画像データに対する前記画像ラベルを識別して分類する分類モデルを学習して生成し前記記憶装置に格納する画像識別ステップと、
      前記複数のコンセプトの各々に対応し、前記分類モデルの識別処理において重視される前記特徴量の出現する画像領域を規定するスロットベクトルからなる概念行列において、前記スロットベクトルで規定される画像特徴に応じて、前記スロットベクトルを変換し前記記憶装置に格納する注意機構処理ステップと、
      前記分類モデルの識別率の評価により算出され、前記識別率が上昇するほど減少する識別損失と、前記複数のコンセプトに対応する前記特徴量が特徴空間において互いに分離する程度の評価により算出され、前記分離の程度が大きいほど減少する分離化損失とに基づいて、損失を算出する損失評価ステップと、
      前記損失を減少させるように、前記記憶装置に格納される前記分類モデルと前記概念行列とに対する機械学習を実行する学習処理ステップとを実行する、画像分類学習装置。
    An image classification learning device, comprising:
    a storage device for storing learning data including image data of the surfaces of a plurality of concrete structures captured by a camera and image labels indicating the soundness of the concrete structures corresponding to the image data;
    and a calculation device for executing a process of machine learning a plurality of concepts in the image data for classifying the image data in terms of the health level based on the learning data stored in the storage device, the calculation device comprising:
    an image classification step of extracting a set of features expressing the image data, learning and generating a classification model that identifies and classifies the image label for the image data based on the extracted set of features, and storing the classification model in the storage device;
    an attention mechanism processing step of converting the slot vectors in a concept matrix, the slot vectors corresponding to each of the plurality of concepts and defining an image region in which the feature values that are emphasized in the classification model identification process appear, in accordance with the image features defined by the slot vectors, and storing the converted slot vectors in the storage device;
    a loss evaluation step of calculating a loss based on a classification loss calculated by evaluating a classification rate of the classification model and decreasing as the classification rate increases, and a separation loss calculated by evaluating a degree to which the features corresponding to the plurality of concepts are separated from each other in a feature space and decreasing as the degree of separation increases;
    and a learning processing step of performing machine learning on the classification model and the concept matrix stored in the storage device so as to reduce the loss.
  12.  前記注意機構処理ステップは、
      前記概念行列との類似度に応じて、前記特徴量の組において前記分類モデルの識別処理において注意が向けられる前記画像領域を抽出するための注意行列を学習する注意行列学習ステップを含み、
     前記画像識別ステップは、
      前記注意行列に基づいて、前記スロットベクトルに対応する前記コンセプトのそれぞれが、前記画像データ中に出現する程度を要素とする活性度ベクトルを生成する概念生起度算出ステップと、
      前記画像データに対応する前記活性度ベクトルを入力として、前記画像ラベルについての分類を実行する分類器を生成するステップと含む、請求項11記載の画像分類学習装置。
    The attention mechanism processing step includes:
    an attention matrix learning step of learning an attention matrix for extracting the image region to which attention is directed in a classification process of the classification model in the set of feature amounts according to a similarity to the concept matrix;
    The image identification step includes:
    A concept occurrence calculation step of generating an activity vector having an element representing the degree to which each of the concepts corresponding to the slot vector appears in the image data based on the attention matrix;
    The image classification learning device according to claim 11 , further comprising: generating a classifier that performs classification on the image label using the activation vector corresponding to the image data as an input.
  13.  前記学習処理ステップの処理は、前記活性度ベクトルと前記コンクリート構造物の表面の画像データに対応する修復対処処置の処置ラベルとを入力として、前記処置ラベルの判別を学習する処置判別モデルを生成するステップを含む、請求項11記載の画像分類学習装置。 The image classification learning device according to claim 11, wherein the learning process step includes a step of generating a treatment discrimination model that learns to discriminate between the treatment labels by inputting the activity vector and treatment labels of repair measures corresponding to the image data of the surface of the concrete structure.
  14.  複数のコンクリート構造物の表面の画像データと前記画像データにそれぞれ対応する前記コンクリート構造物の健全度を示す画像ラベルとを含む学習データに基づいて、前記画像データを前記健全度について分類するための前記画像データ中の複数のコンセプトを機械学習する画像分類学習方法によって生成される画像分類学習済モデルであって、
     前記画像分類学習済モデルは、
      前記コンセプトのそれぞれが、前記画像データ中に出現する程度を要素とする活性度ベクトルを入力として、前記要素の共起関係に基づいて、前記画像データを分類する分類器モデルの構成を有し、
     前記画像分類学習済モデルは、
     前記画像データを表現する特徴量の組を抽出し、抽出された前記特徴量の組に基づいて、前記画像データに対する前記画像ラベルを識別して分類する前記分類器モデルを学習により更新するステップと、
     前記複数のコンセプトの各々に対応し、前記識別の処理において重視される前記特徴量の出現する画像領域を規定するスロットベクトルからなる概念行列において、前記スロットベクトルで規定される画像特徴に応じて、前記スロットベクトルを変換するステップと、
     前記画像データの識別における識別率の評価により算出され、前記識別率が上昇するほど減少する識別損失と、前記複数のコンセプトに対応する前記特徴量が特徴空間において互いに分離する程度の評価により算出され、前記分離の程度が大きいほど減少する分離化損失とに基づいて、損失を算出するステップと、
     前記損失を減少させるように、前記モデルと前記概念行列とを学習させるステップとにより生成され、
     前記スロットベクトルを変換するステップは、
     前記概念行列との類似度に応じて、前記特徴量の組において前記識別の処理において注意が向けられる前記画像領域を抽出するための注意行列を学習するステップを含み、
     前記分類器モデルを学習により更新するステップは、
     前記注意行列に基づいて、前記スロットベクトルに対応する前記コンセプトのそれぞれが、前記画像データ中に出現する程度を要素とする前記活性度ベクトルを生成するステップと、
     前記画像データに対応する前記活性度ベクトルを入力として、前記画像ラベルについての分類を実行するよう前記分類器モデルのパラメータを学習するステップとを含む、画像分類学習済モデル。
    An image classification trained model generated by an image classification training method that machine-learns a plurality of concepts in image data for classifying the image data in terms of the soundness, based on training data including image data of the surfaces of a plurality of concrete structures and image labels indicating the soundness of the concrete structures that correspond to the image data,
    The image classification trained model is
    a classifier model configured to classify the image data based on a co-occurrence relationship of an activity vector having an element representing the degree to which each of the concepts appears in the image data,
    The image classification trained model is
    extracting a set of features representing the image data, and updating the classifier model by learning, the classifier model identifying and classifying the image label for the image data based on the extracted set of features;
    A step of converting the slot vector in a concept matrix including slot vectors that correspond to each of the plurality of concepts and define an image region in which the feature amount that is emphasized in the classification process appears, in accordance with the image feature defined by the slot vector;
    a step of calculating a loss based on a classification loss calculated by evaluating a classification rate in classification of the image data and decreasing as the classification rate increases, and a segregation loss calculated by evaluating a degree to which the features corresponding to the plurality of concepts are separated from each other in a feature space and decreasing as the degree of separation increases;
    training the model and the concept matrix so as to reduce the loss;
    The step of converting the slot vector includes:
    learning an attention matrix for extracting the image region to which attention is directed in the classification process in the set of features according to a similarity to the concept matrix;
    The step of updating the classifier model by learning includes:
    generating an activity vector based on the attention matrix, the activity vector being determined by the degree to which each of the concepts corresponding to the slot vector appears in the image data;
    and learning parameters of the classifier model to perform classification on the image label using the activation vector corresponding to the image data as input.
PCT/JP2023/037394 2022-10-18 2023-10-16 Image classification learning device, image classification learning method, image classification learning program, and post-image-classification learning model WO2024085114A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2022167046 2022-10-18
JP2022-167046 2022-10-18

Publications (1)

Publication Number Publication Date
WO2024085114A1 true WO2024085114A1 (en) 2024-04-25

Family

ID=90737809

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/037394 WO2024085114A1 (en) 2022-10-18 2023-10-16 Image classification learning device, image classification learning method, image classification learning program, and post-image-classification learning model

Country Status (1)

Country Link
WO (1) WO2024085114A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2021165888A (en) * 2020-04-06 2021-10-14 キヤノン株式会社 Information processing apparatus, information processing method of information processing apparatus, and program

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2021165888A (en) * 2020-04-06 2021-10-14 キヤノン株式会社 Information processing apparatus, information processing method of information processing apparatus, and program

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GODAI FUJITA: "Slot Attention", 30 July 2021 (2021-07-30), XP093161797, Retrieved from the Internet <URL:https://qiita.com/fujitagodai4/items/7964a07561e6fe5cbbb3> *
JU HE, JIE-NENG CHEN, SHUAI LIU, ADAM KORTYLEWSKI, CHENG YANG, YUTONG BAI, CHANGHU WANG: "TransFG: A Transformer Architecture for Fine-Grained Recognition", PROCEEDINGS OF THE AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, vol. 36, no. 1, 28 June 2022 (2022-06-28), pages 852 - 860, XP093161801, ISSN: 2159-5399, DOI: 10.1609/aaai.v36i1.19967 *

Similar Documents

Publication Publication Date Title
Kim et al. Crack and noncrack classification from concrete surface images using machine learning
Xu et al. Surface fatigue crack identification in steel box girder of bridges by a deep fusion convolutional neural network based on consumer-grade camera images
Prabhu et al. Few-shot learning for dermatological disease diagnosis
CN106295124A (en) Utilize the method that multiple image detecting technique comprehensively analyzes gene polyadenylation signal figure likelihood probability amount
Ottoni et al. Tuning of data augmentation hyperparameters in deep learning to building construction image classification with small datasets
Yang et al. Hyperspectral image classification with spectral and spatial graph using inductive representation learning network
EP3859666A1 (en) Classification device, classification method, program, and information recording medium
CN113283282B (en) Weak supervision time sequence action detection method based on time domain semantic features
Prabhu et al. Prototypical clustering networks for dermatological disease diagnosis
Apeagyei et al. Evaluation of deep learning models for classification of asphalt pavement distresses
Hoang Classification of asphalt pavement cracks using Laplacian pyramid‐based image processing and a hybrid computational approach
Gu et al. Remembering normality: Memory-guided knowledge distillation for unsupervised anomaly detection
CN116872961B (en) Control system for intelligent driving vehicle
Aslan et al. Using artifical intelligence for automating pavement condition assessment
CN117010971B (en) Intelligent health risk providing method and system based on portrait identification
Dhawan et al. Deep Learning Based Sugarcane Downy Mildew Disease Detection Using CNN-LSTM Ensemble Model for Severity Level Classification
Akiyama et al. Evaluating different deep learning models for automatic water segmentation
Yamaguchi et al. Road crack detection interpreting background images by convolutional neural networks and a self‐organizing map
WO2024085114A1 (en) Image classification learning device, image classification learning method, image classification learning program, and post-image-classification learning model
Hoang et al. Image processing-based classification of pavement fatigue severity using extremely randomized trees, deep neural network, and convolutional neural network
Altaei et al. Satellite image classification using multi features based descriptors
Harshavardhan et al. Detection of Various Plant Leaf Diseases Using Deep Learning Techniques
Stark et al. Quantifying uncertainty in slum detection: advancing transfer-learning with limited data in noisy urban environments
Thiyagarajan Performance Comparison of Hybrid CNN-SVM and CNN-XGBoost models in Concrete Crack Detection
JP2021063706A (en) Program, information processing device, information processing method and trained model generation method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23879757

Country of ref document: EP

Kind code of ref document: A1