WO2024085114A1

WO2024085114A1 - Image classification learning device, image classification learning method, image classification learning program, and post-image-classification learning model

Info

Publication number: WO2024085114A1
Application number: PCT/JP2023/037394
Authority: WO
Inventors: 修一鶴田; 悠太中島; 良知李; 博文王
Original assignee: 国立大学法人大阪大学
Priority date: 2022-10-18
Filing date: 2023-10-16
Publication date: 2024-04-25

Abstract

The present invention provides an image classification learning device that makes it possible to learn a "concept" to be used by a post-learning model to make a determination. A concept learner 200 performs machine learning of a plurality of concepts in image data on the basis of learning data including the image data and an image label. A concept prototype processing unit 2100 is an attention mechanism that, in a concept matrix comprising slot vectors, converts the slot vectors according to image features defined in the slot vectors, the slot vectors respectively corresponding to a plurality of concepts and defining an image region in which feature quantities emphasized in identification processing by an image identification means appear. A learning processing control unit 700 controls learning processing so as to decrease a loss calculated on the basis of: an identification loss that decreases as the identification rate of a classifier 400 increases, and a separation loss that decreases as the degree of mutual separation of the feature quantities corresponding to the plurality of concepts in a feature quantity space increases.

Description

Image classification learning device, image classification learning method, image classification learning program, and image classification trained model

The present invention relates to an image classification learning device, an image classification learning method, an image classification learning program, and an image classification trained model that can help humans understand the process of performing image classification.

(The need for explainable data-driven artificial intelligence)

In recent years, deep learning technology based on neural networks has brought about major breakthroughs in many fields, including image recognition, voice recognition, and natural language processing.

However, one problem that has been raised is that with data-driven artificial intelligence such as neural networks, the basis for the "judgments" made by models generated by machine learning is difficult for humans to interpret. This is known as the "black box problem," and is one of the factors that makes it difficult to practically apply artificial intelligence technology in society.

In other words, if it is not possible to explain "the basis on which the AI model made such a prediction or judgment," users of services or applications that have adopted or are planning to adopt the model will become uneasy, especially in a situation where AI technology is being applied to fields where it has not been used explicitly until now.

This is especially serious in high-risk areas, namely those involving users' health safety and personal information.

To address these issues, a technology called "explainable artificial intelligence (XAI)" is being actively researched.

　In other words, understanding the behavior of neural networks is a big challenge, especially for medical applications (see Non-Patent Document 1) and for identifying biases in neural networks (see Non-Patent Document 2). For this reason, a lot of research effort has been devoted to providing post-hoc explanations of artificial intelligence models after they have been generated by machine learning (see Non-Patent Document 3). This kind of explanation successfully provides a low-level (or pixel-by-pixel) relationship between the image and the model's judgment by highlighting some regions in the image as a heat map, but the interpretation of these relationships remains a problem.

Therefore, for example, methods are being considered that introduce mechanisms that indicate to the artificial intelligence model in advance the points to focus on in the input data. Such points of focus are called "attention" in artificial intelligence in this field, and this "attention mechanism" was first applied to artificial intelligence translation models that perform natural language processing, emerging as a mechanism that teaches, in conjunction with learning the translation function, which words in the source text to focus on when outputting translated words (see Non-Patent Document 4).

Later, this attention mechanism was also applied to the field of image recognition, where in deep learning of neural networks with the task of "object detection," the "attention mechanism" learns which parts of the input image the artificial intelligence model should focus on (pay attention to) in order to detect an object; in other words, which features of the input image should be weighted heavily in order to detect the object, while learning to detect the object.

In this case, an attention matrix (attention weights) is generated from the similarity between a query (Q) generated from a weight matrix generation model consisting of multiple weight vector columns (slots) and a key (K) representing the image features, and this attention matrix is used to extract areas from the image that are used for object detection. This representation of image features in "object detection" is also called "object-centric representation" (see non-patent document 5).

Another advance towards interpretability is the "concept-based" framework, inspired by the human ability to learn new concepts by unconsciously finding finer details (see non-patent document 6). Instead of providing pixel-by-pixel importance scores as explanations, this framework presents a higher level relationship between the target image and the classification decision through the medium of "concepts". In this framework, a higher level relationship between the image and the decision is provided through the medium of "concepts".

The explanation for such judgments boils down to identifying feature domains in the target, called "concepts," that are shared across different target classes in a task.

However, traditionally, such "concepts" have been specified in advance by humans, who define sets of concepts.

For example, a simple way to predefine concepts is to use human knowledge (see non-patent document 7). Other methods use a manually created set of concepts and quantify the importance of each concept to the decision using directional derivatives, while the Broden dataset unifies several densely labeled image datasets to provide a large concept corpus that can be used to directly and automatically match Convolutional Neural Network (CNN) representations with labeled interpretations (see non-patent document 8).

In this context, SENN (Self-Explaining Neural Networks) proposed by Alvarez-Melis et al. utilizes the bottleneck of concepts and treats the activation of concepts as input to a regression model (see Non-Patent Document 9).
(Application of image recognition processing to determine the soundness of concrete structures, etc.)

On the other hand, the inspection of concrete structures, steel-framed structures, and other infrastructure structures is being considered as an application of image recognition technology using deep learning technology based on neural networks as described above.

Currently, the aging of social infrastructure structures, including concrete structures, is cited as a social issue. When inspecting such concrete structures, the degree of partial damage to the structure is determined based on factors such as the occurrence of cracks in the concrete wall surface.

For example, the Ministry of Land, Infrastructure, Transport and Tourism's Guidelines for Periodic Bridge Inspection (see Non-Patent Document 10) states that the damage level of concrete walls is classified based on the crack width, whether the cracks are in a lattice pattern, and the occurrence of water leakage and free lime.

Inspection of concrete structures requires close visual inspection by technicians with specialized knowledge, and this is done based on a comprehensive judgment that takes into account various aspects such as the state of deterioration, type, location, and traffic volume. In other words, judging the soundness of concrete structures relies heavily on the know-how (tacit knowledge) of experienced technicians, which cannot be put into a manual.

Therefore, in order to reduce work costs and avoid variability in damage level assessment by workers, it is desirable to automate the assessment of damage level using an information processing device. Conventional technology for automating damage level assessment is disclosed, for example, in Patent Document 1.

Patent Document 1 discloses a configuration in which cracks are detected as deformed areas using a feature map created with a CNN (Convolutional Neural Network) and the crack width is determined as attribute information of the deformed area.

Patent Document 2 also discloses a configuration that uses deep learning to provide a performance evaluation system for concrete structures that makes it possible to efficiently carry out a series of maintenance and management tasks, from inputting deformations to performance inspections. In other words, the deep learning unit performs machine learning using artificial intelligence based on the discrepancy between the results automatically calculated by the performance evaluation system, which are accumulated for each inspection, and the results corrected by the inspector. The configuration disclosed shows that the results of the machine learning are then reflected in subsequent judgments and predictions.

Furthermore, Patent Document 3 discloses the following technology:

In other words, when automatically determining the degree of damage based on crack width and whether the cracks are lattice-shaped, as in the Ministry of Land, Infrastructure, Transport and Tourism's bridge inspection guidelines, it is necessary to make a determination from information in a local area of a high-resolution image in order to detect cracks in an image and estimate their crack width. On the other hand, to determine that a crack is lattice-shaped, it is necessary to make the determination using information in a wide area that includes multiple cracks.

The technology disclosed in Patent Document 3 makes it possible to determine the condition of a wide area based on a local area and a wide area, and to determine the degree of damage to the concrete wall surface of an infrastructure structure.

JP 2018-198053 A JP 2019-200120 A JP 2021-165888 A

By following the idea of using "concepts" as described above, it is expected that humans will be able to understand the decision-making process of a trained model by comparing the behavior of the trained model generated as a result of such learning with human judgment.

However, traditionally, learning equivalent to such human "concepts" has been achieved using "training data" that incorporates human knowledge in advance, and it has not always been possible to compare the judgment process of a trained model with the human judgment process for any given natural image, for example.

In addition, concept learning is guided by a learning process using an autoencoder structure to reconstruct the original image. It is not yet clear whether such a structure can be applied to learning from natural images.

On the other hand, there are problems with inspecting concrete structures, including a shortage of engineers and the enormous time and cost required for inspections.

As mentioned above, the use of artificial intelligence technology to assist or automate inspection work is being considered, but with the technology disclosed in

Patent Documents

1 and 2, the points of interest in inspecting concrete structures are determined by humans. Furthermore, with the technology disclosed in Patent Document 3, the degree of damage is automatically determined based on the crack width and the fact that the cracks are in a lattice pattern, but high-resolution images are essential and complex processes are required.

In addition, a large amount of training data is required, which is created by humans reviewing the images and judging and recording the damage rank for each area, but collecting such large amounts of training data and creating the training data itself is not easy.

As mentioned above, image recognition technology using artificial intelligence is being considered for image recognition, which can replace close-up visual inspection, one way of judging soundness, but with conventional technology, the credibility of the judgment results is called into question. For this reason, it is important to have a judgment system that can visualize the know-how that can replace that of engineers.

The present invention has been made to solve the above problems, and aims to provide an image classification learning device, an image classification learning method, an image classification learning program, and an image classification trained model that can learn the "concepts" that a trained model uses to make judgments through learning on a given task, so that the judgment process can be compared with that of a human.

The present invention also aims to provide an image classification learning device, an image classification learning method, an image classification learning program, and an image classification trained model that can assist or replace the judgment of the soundness of concrete structures by using a trained model of artificial intelligence.

In accordance with one aspect of the present invention, an image classification learning device includes a storage device for storing learning data including a plurality of image data and image labels corresponding to the image data, and a calculation processing means for reading out the learning data stored in the storage device and executing a process of machine learning a plurality of concepts in the image data for classifying the image data. The calculation processing means includes an image recognition means for extracting a set of features expressing the image data, learning and generating a classification model that identifies and classifies image labels for the image data based on the extracted set of features, and storing the classification model in the storage device; an attention mechanism processing means for converting a slot vector according to an image feature defined by a slot vector in a concept matrix that corresponds to each of the plurality of concepts and defines an image region in which a feature valued in the classification processing of the image recognition means appears, and storing the slot vector in the storage device; a loss evaluation means for calculating a loss based on a classification loss that is calculated by evaluating the classification rate of the image recognition means and decreases as the classification rate increases, and a separation loss that is calculated by evaluating the degree to which features corresponding to the plurality of concepts are separated from each other in the feature space and decreases as the degree of separation increases; and a learning processing means for executing machine learning on the classification model and concept matrix stored in the storage device so as to reduce the loss.

Preferably, the attention mechanism processing means includes an attention matrix learning means for learning an attention matrix for extracting image regions in the set of features to which attention is directed in the classification process of the image classification means in accordance with the degree of similarity with the concept matrix, and the image classification means includes a concept occurrence calculation means for generating an activity vector based on the attention matrix, the activity vector having elements representing the degree to which each concept corresponding to the slot vector appears in the image data, and a classifier for performing classification of image labels using the activity vector corresponding to the image data as input.

Preferably, the set of features representing the image data is a feature map output from a convolutional neural network image recognition model.

Preferably, the segregation loss includes a consistency loss, which is a loss that decreases as a single concept occupies a smaller volume in feature space, and a discrimination loss, which is a loss that decreases as pairs of concepts become less likely to occupy the same region in feature space.

Preferably, the image data is data of images of the surfaces of multiple concrete structures captured by a camera, and the image labels are labels indicating the soundness of the concrete structures that each correspond to the image data.

In accordance with another aspect of the present invention, there is provided an image classification learning method in which a computer learns multiple concepts in image data for classifying image data based on learning data including multiple image data and image labels corresponding to the image data, the computer including a storage device for storing the learning data and a calculation device for executing a machine learning process, the computer including a step of extracting a set of features expressing the image data, and learning and generating a classification model that identifies and classifies image labels for the image data based on the extracted set of features, a step of converting a slot vector in a concept matrix including slot vectors that correspond to each of the multiple concepts and define an image region in which a feature value that is emphasized in the classification process appears, according to an image feature defined by the slot vector, a step of calculating a loss based on a classification loss calculated by evaluating a classification rate in the classification of image data and decreasing as the classification rate increases, and a separation loss calculated by evaluating the degree to which features corresponding to the multiple concepts are separated from each other in a feature space and decreasing as the degree of separation increases, and a step of learning the classification model and the concept matrix so as to reduce the loss.

In accordance with yet another aspect of the present invention, an image classification learning program is provided for machine learning a plurality of concepts in image data for classifying image data, based on learning data including a plurality of image data and image labels corresponding to the image data, by a computer. The computer includes a calculation device and a storage device, and includes the steps of: for image data stored in the storage device, the calculation device extracts a set of features expressing the image data, and learns and generates a classification model that identifies and classifies image labels for the image data based on the extracted set of features; the calculation device converts slot vectors in a concept matrix including slot vectors that correspond to each of the plurality of concepts and define image regions in which features that are emphasized in the classification process appear, according to image features defined by the slot vectors; the calculation device calculates a loss based on a classification loss calculated by evaluating a classification rate in the classification of image data and that decreases as the classification rate increases, and a separation loss calculated by evaluating the degree to which features corresponding to the plurality of concepts are separated from each other in a feature space and that decreases as the degree of separation increases; and the calculation device learns the classification model and the concept matrix so as to reduce the loss.

Preferably, the computer-readable non-transitory recording medium stores an image classification learning program.

According to yet another aspect of the present invention, there is provided an image classification trained model generated by an image classification learning method that machine-learns a plurality of concepts in image data for classifying image data based on training data including a plurality of image data and image labels corresponding to the image data, the image classification trained model having a configuration of a classifier model that uses as input an activation vector whose elements are the degree to which each of the concepts appears in the image data and classifies the image data based on the co-occurrence relationship of the elements, the image classification trained model includes a step of extracting a set of features that express the image data, and updating by learning a classifier model that identifies and classifies an image label for the image data based on the extracted set of features, and a step of converting a slot vector in a concept matrix composed of slot vectors that correspond to each of the plurality of concepts and define an image area in which a feature that is emphasized in the classification process appears, according to an image feature defined by the slot vector. The method is generated by a step of calculating a loss based on a classification loss calculated by evaluating the classification rate in classifying image data and decreasing as the classification rate increases, and a separation loss calculated by evaluating the degree to which features corresponding to multiple concepts are separated from each other in feature space and decreasing as the degree of separation increases, and a step of training a model and a concept matrix so as to reduce the loss, and the step of converting the slot vector includes a step of training an attention matrix for extracting image regions in the set of features to which attention is directed in the classification process according to the similarity with the concept matrix, and the step of updating the classifier model by training includes a step of generating an activation vector based on the attention matrix, the elements of which are the degree to which each of the concepts corresponding to the slot vector appears in the image data, and a step of training parameters of the classifier model to perform classification for image labels using the activation vector corresponding to the image data as an input.

According to yet another aspect of the present invention, an image classification learning device includes a storage device for storing learning data including image data of the surfaces of a plurality of concrete structures captured by a camera and image labels indicating the soundness of the concrete structures corresponding to the image data, and a calculation device for executing a process of machine learning a plurality of concepts in the image data for classifying the image data in terms of soundness based on the learning data stored in the storage device, the calculation device performing an image identification step of extracting a set of features representing the image data, learning and generating a classification model that identifies and classifies image labels for the image data based on the extracted set of features, and storing the image identification step in the storage device; The system executes an attention mechanism processing step in which the slot vectors are converted and stored in a storage device according to the image features defined by the slot vectors in a concept matrix consisting of slot vectors that correspond to each of the concepts and define the image regions in which the features that are emphasized in the classification model's classification process appear; a loss evaluation step in which a loss is calculated based on a classification loss that is calculated by evaluating the classification model's classification rate and decreases as the classification rate increases, and a separation loss that is calculated by evaluating the degree to which features corresponding to multiple concepts are separated from each other in feature space and decreases as the degree of separation increases; and a learning processing step in which machine learning is performed on the classification model and concept matrix stored in the storage device so as to reduce the loss.

Preferably, the attention mechanism processing step includes an attention matrix learning step of learning an attention matrix for extracting image regions in the set of features to which attention is directed in the classification model identification process according to the degree of similarity with the concept matrix, and the image identification step includes a concept occurrence calculation step of generating an activation vector based on the attention matrix, the activation vector having elements representing the degree to which each concept corresponding to the slot vector appears in the image data, and a step of generating a classifier that performs classification on image labels using the activation vector corresponding to the image data as input.

Preferably, the learning process step includes a step of generating a treatment discrimination model that learns to discriminate treatment labels using the activity vector and treatment labels of repair measures corresponding to image data of the surface of the concrete structure as input.

According to yet another aspect of the present invention, there is provided an image classification trained model generated by an image classification learning method that machine-learns a plurality of concepts in image data for classifying image data regarding soundness, based on learning data including image data of the surfaces of a plurality of concrete structures and image labels indicating the soundness of the concrete structures corresponding to the image data, the image classification trained model having a configuration of a classifier model that uses as input an activity vector having elements representing the degree to which each of the concepts appears in the image data, and classifies the image data based on a co-occurrence relationship of the elements, the image classification trained model comprising the steps of: extracting a set of features that represent the image data, and updating by learning a classifier model that identifies and classifies an image label for the image data based on the extracted set of features; and in a concept matrix consisting of slot vectors that correspond to each of the plurality of concepts and define image regions in which features that are emphasized in the discrimination process appear, according to the image features defined by the slot vectors, The method is generated by a step of converting a slot vector, a step of calculating a loss based on a classification loss calculated by evaluating the classification rate in the classification of image data and decreasing as the classification rate increases, and a separation loss calculated by evaluating the degree to which features corresponding to multiple concepts are separated from each other in the feature space and decreasing as the degree of separation increases, and a step of training a model and a concept matrix so as to reduce the loss, and the step of converting a slot vector includes a step of training an attention matrix for extracting image regions to which attention is directed in the classification process in the set of features according to the similarity with the concept matrix, and the step of updating the classifier model by training includes a step of generating an activation vector based on the attention matrix, the element of which is the degree to which each of the concepts corresponding to the slot vector appears in the image data, and a step of training parameters of the classifier model to perform classification of image labels using the activation vector corresponding to the image data as an input.

The image classification learning device, image classification learning method, and image classification learning program of the present invention enable humans to understand what image feature regions are used as the basis for classification by a trained model generated by artificial intelligence learning how to classify images.

More specifically, the feature regions of this image are separated to minimize overlap between different classification classes, so even in classification tasks involving natural images, the activity of the feature regions during the separation process can be displayed and visualized in a way that allows comparison with the "concepts" humans use for classification.

Furthermore, when the image classification learning device, image classification learning method, and image classification learning program of the present invention are applied to determining the soundness of concrete structures, it becomes possible to make a determination of soundness by utilizing the accumulated judgment know-how of engineers and experts.

More specifically, when applied to determining the soundness of concrete structures, it becomes possible to create a trained artificial intelligence model that can not only determine soundness, but also determine the appropriate course of action depending on the level of soundness and the characteristics of the judgment.

FIG. 1 is a functional block diagram for explaining the configuration of an image classification learning device 1000 according to a first embodiment. FIG. 2 is a functional block diagram for explaining the configuration of a concept regularization unit 300. FIG. 2 is a conceptual diagram showing the concept of processing by the concept regularizer 300. FIG. 2 is a conceptual diagram showing the concept of processing by the concept regularizer 300. FIG. 2 is a conceptual diagram showing the concept of processing by the concept regularizer 300. FIG. 1 is a block diagram for explaining the hardware configuration of an image classification learning device 1000. 1 is a flowchart for explaining the learning process of image classification learning device 1000. FIG. 4 is a functional block diagram for explaining the configuration of the image classification device 4000 when performing classification processing for a new image. FIG. 4 is a conceptual diagram for explaining the processing performed by the classifier 400. FIG. 13 shows the classification performance of the classifier 400 for CUB200 and ImageNet. FIG. 13 shows the classification performance of the classifier 400 for CUB200 and ImageNet. FIG. 13 is a diagram for explaining the validity of a concept represented by concept activity level t. FIG. 13 is a diagram for explaining the validity of a concept represented by concept activity level t. FIG. 13 is a diagram for explaining the validity of a concept represented by concept activity level t. FIG. 1 is a diagram showing the attention levels of the five most important concepts (based on "importance" to be described later) for an input image of a black bird with a yellow head. FIG. 13 is a diagram for explaining a concept represented by concept activity level t for a natural image. FIG. 1 is a diagram showing the importance of each concept in the CUB200 dataset. FIG. 13 is a diagram showing the magnitude of each hyperparameter and the accuracy rate, consistency, and discriminability. FIG. 11 is a conceptual diagram for explaining the operation of the concrete soundness classification device of the second embodiment. FIG. 15 is a conceptual diagram showing the configuration of learning data for generating a trained model of artificial intelligence such as that shown in FIG. 14. FIG. 1 is a diagram showing an example of a system configuration for determining the soundness of concrete. FIG. 5 is a functional block diagram showing the configuration of a terminal 500.1. A block diagram for explaining the hardware configuration of terminal 500.1. 1 is a diagram for explaining the configuration of learning data based on image data, health level labels corresponding to the images, and corrective action labels. FIG. FIG. 13 is a functional block diagram for explaining the configurations of an image classification learning device 1000 and a classification device 4000 according to a third embodiment.

The configuration of an image classification learning device and an image classification learning method according to an embodiment of the present invention will be described below. Note that in the following embodiments, components and processing steps with the same reference numerals are the same or equivalent, and their description will not be repeated unless necessary.

In the following, the image classification learning device of the present invention will be described as a computer program that is installed on a standalone computer device and executes the image classification learning method.

However, the processing of the image classification learning device may be distributed among multiple computer devices, and the arithmetic device that executes the computer processing may be single or multiple. Furthermore, the processing of the image classification learning device is not limited to a program installed in such a computer device, and may generally be realized as an arithmetic processing device such as a microcomputer that combines an arithmetic device and a storage device, or may be implemented in a dedicated IC circuit, an FPGA (Field-Programmable Gate Array), or other electronic circuit.
[Embodiment 1]
(Concept-based image classification)

Below, we explain a configuration for an image classifier that uses a neural network to extract "areas of interest in image features," called "concepts," from an image, and classify images by using the activation of "concepts" in the input image as an image representation.

In this specification, the term "concept" refers to a feature region in an "image" in a training dataset that the classifier "attentions" when classifying in machine learning of an image classifier using a neural network, and that is separated to the extent that it satisfies a predetermined condition. The method of "classification based on concepts" is also called "concept-based classification."

Here, "predetermined conditions" refer to conditions that enable the trained model to learn concepts so that the original image can be reconstructed or identified from the activation vector alone, while making feature values of feature regions (in different images) corresponding to the same concept as similar as possible, and making feature values of feature regions corresponding to different concepts as dissimilar as possible, regardless of the correct label.

The image classifier described below is an artificial intelligence learning model that can learn the optimal bottleneck "concept" for the target image classification task in parallel with learning the image classification task based only on the images that are the training data and the labels that indicate the image classes, and will be described below. In this specification, the model structure (mathematical configuration, parameter configuration) before learning is called the "learned model," and after the model parameter values are determined by the learning process, it is called the "trained model." The "trained model" functions as part of a program by being installed on a computer. Although not limited to this, the "trained model (classifier)" may be recorded as a program or as part of a program on a computer-readable recording medium and installed on a computer other than the one that performed the learning process.

Such a "learning model" includes a "(self) attention mechanism" (described later) and makes it possible to identify the areas in which each of the above-mentioned concepts are discovered during the machine learning process. By displaying such "learning images" that share the detected "concept" together, humans can easily understand what each of the learned concepts represents, thereby providing clues for interpreting the classification and judgment processes.

Here, the "attention mechanism" has the function of gating the channels of the "feature map" extracted from the "images" of the input learning data, so that a lot of map information that is considered noteworthy passes through, and not much map information that is considered not noteworthy passes through.

In particular, when using the configuration of a "self-attention mechanism," all of the queries (Q), keys (K), and values (V) used in learning the area of "attention" are generated from the same input data. However, in this embodiment, the method of realizing the "attention mechanism" is not limited to this type of "self-attention mechanism."

As described below, the following embodiments aim to provide an image classification learning device, an image classification learning method, and an image classification learning program. The "trained model (image classifier)" of the embodiments uses the activation level of each concept as input to characterize and classify images.
[Embodiment 1]
(Configuration of an image classification learning device that learns concepts)

FIG. 1 is a functional block diagram for explaining the configuration of an image classification learning device 1000 according to the first embodiment.

As described below, the image classification learning device 1000 uses as input learning data consisting of multiple pieces of image data and image labels (indicating the class to be classified) associated with each piece of image data, and generates a trained model for image classification.

In this case, the image dataset that is the input learning data is as follows:

Here, x _i is an image, and y _i is an object class in the set Ω associated with x _i . The image classification learning apparatus 1000 learns a set of k concepts using only the labels of the images.

Referring to FIG. 1, the image classification learning device 1000 includes a convolutional neural network (hereinafter, referred to as a CNN backbone) 100 that serves as a backbone for generating a feature map from input image data, a concept learner 200, a concept regularizer 300, a classifier 400, a quantization error calculator 500, a loss calculator 600 that calculates the amount of loss during learning as described below, and a learning process controller 700 that controls the learning process according to the loss calculated by the loss calculator 600.

As described below, the CNN backbone 100, the concept learner 200, the concept regularizer 300, the classifier 400, the quantization error calculator 500, the loss calculator 600 that calculates the amount of loss during learning as described below, and the learning process control unit 700 that controls the learning process according to the loss calculated by the loss calculator 600 correspond to functions realized by a computing device that operates based on a program, and in this program, for example, each can be configured to be implemented as a program module.

Although not limited to this, the concept learner 200, concept regularizer 300, classifier 400, and quantization error calculator 500 can be configured as modules in separate neural networks, with parameters adjusted by the learning process control unit 700 based on the loss calculated by the loss calculator 600. However, for example, the CNN backbone 100 can also be included in the learning target, resulting in a so-called "end-to-end" configuration, and the configuration of the neural network/artificial intelligence is not limited to this configuration.

The CNN backbone 100 extracts a feature map F, expressed by the following equation, for input image data x.

Here, c is the number of channels, or feature maps. In other words, the CNN backbone 100 divides the input image into h x w regions, and in each of these regions there is a vector with c elements. This makes F a c x h x w feature map.

Then, the feature map F is input to the concept learner 200. Here, in FIG. 1, the concept prototype processing unit 2100 learns the concept matrix W according to a procedure described below, and each column vector of the matrix W is referred to in this specification as a "concept prototype" to be learned.

The concept learner 200 generates a concept activity t indicating the presence of each concept, and an image feature V from the region where each concept exists in x. The concept activity t is used as an input to the classifier 400, which learns to calculate a score s indicating the classification result of the image class.

The concept activity level t, the image feature amount V, and the score s are as follows.

Here, |Ω| indicates the number of elements in the set.

The concept regularization unit 300 receives the concept activity t and the image feature V as input, and in the concept prototype update process, as described below, imposes restrictions for the consistency of individual concepts and the mutual distinguishability between concepts, and also performs supervised self-learning.
(Concept Learner 200)

The concept learner 200 uses a "slot attention" technique based on a self-attention mechanism to learn "concepts" for the image dataset D that can be retroactively associated with features that serve as the basis for recognition in human visual recognition.

In the concept learner 200, the position information encoding unit 2002 executes position embedding (position information encoding) processing by adding position embedding information P to the input feature map F in order to retain spatial information, as follows:

The "positional information encoding" is disclosed in, for example, the following document:
Published literature: Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object centric learning with slot attention. Proc. NeurIPS, 2020.

The feature map F' with embedded position information is processed in the shaping processing unit 2004 to flatten the spatial dimensions.

As a self-attention mechanism, the similarity calculation unit 2010 calculates the dot product similarity between a query Q(W) obtained by nonlinearly processing the concept matrix W representing the concept prototype, which is successively transformed by the concept prototype processing unit 2100, by the nonlinear processing unit 2008, and a key K(F') obtained by nonlinearly transforming the feature map F' by the nonlinear processing unit 2006.

The concept prototype (concept matrix) W is not particularly limited, but can be configured to be generated and converted by a GRU (Gated Recurrent Unit), which is a neural network model capable of learning time-series data, as described in the following literature, for example.
Published literature: Liangzhi Li, Bowen Wang, Manisha Verma, Yuta Nakashima, Ryo Kawasaki, and Hajime Nagahara. Scouter: Slot attention-based classifier for explainable image recognition. In Proc. ICCV, pages 1046-1055, 2021.

In the above document, in order to adapt a slot matrix (corresponding to the concept matrix W in this embodiment) consisting of weight vectors (slot vectors) to an input image, the slot matrix is converted by the GRU using U ^(t) , which is a weighted sum of feature amounts in the spatial dimension, and the slot matrix at the previous timing. In contrast, the concept matrix W in this embodiment can be configured to be converted to the concept matrix W at the next timing by the GRU using an image feature V to be described later and the concept matrix W at the previous timing. However, the method of converting the concept matrix W is not limited to this method.

For example, instead of using a GRU, the following explanation will be described as transforming W using Q(...) (a neural network with three fully connected (FC) layers).

Here, in Q(W) and K(F'), the nonlinear transformations for W and F', respectively, are given as a multilayer perceptron having three FC layers (fully connected layers) and a ReLU nonlinear layer between them, and have the following dimensions:

The normalization unit 2012 calculates an “attention matrix A” as given by the following equation (1).

Here, the function φ is the normalization function.

This attention matrix A indicates where in the image the k concepts are located, as shown in Figure 7 below.

The normalization function φ determines the spatial distribution of each concept, which depends on the target domain of the classification.

For example, images in handwritten digit recognition datasets are typically black and white, and only the shapes formed by the strokes are important. In this case, concepts are unlikely to overlap spatially. On the other hand, natural images have color, texture, and shape, which means concepts may overlap at the same spatial location.

For the non-overlapping case, φ can be designed as follows:

Here, σ is a sigmoid function, and the product between σ and a softmax function is the Hadamard product. The softmax function is applied to the concepts (i.e., each column vector; hereafter, in this specification, this column vector will be referred to as a "slot vector") so that different concepts are not detected at the same spatial location.

On the other hand, to allow for overlapping concepts, only the sigmoid function can be used for normalization, as follows:

The concept occurrence calculation unit 2030 calculates the concept activity vector t by taking the sum of A in the spatial dimension as shown in the following formula (4). Each element of the concept activity vector indicates whether or not a corresponding concept has appeared, and each element is called concept activity.

Moreover, in the concept learner 200, the shaping processor 2020 performs shaping processing on the feature map F to flatten the spatial dimension, and calculates the following feature map F ^* .

Then, the similarity calculation unit 2040 calculates and extracts image features V from the feature map F ^* using the following formula:

Here, weighting by ω _k gives the attention-weighted average of image features across spatial dimensions.
(Quantization and Quantization Loss)

The concept activity level t mentioned above is an index showing the existence of each concept, and can be expressed as a binary value.

However, to train neural networks using gradient descent, we use continuous values as described above. Instead, we use a quantization loss to ensure that the neural network training results in values close to either 0 or 1.

Such a quantization loss lqua is given by the following equation, where B is a mini-batch that is a random subset of D:

where t is the concept activation calculated for image x.
(Concept Regularization Unit 300)

Because the only training data in the image classification learning device 1000 is the image-level label y, the concept learner may not be able to consistently capture meaningful features.

The concept regularization unit 300 therefore executes a concept regularization process so that learning of the "concept" progresses.

Two of these concept regularization procedures constrain concept prototypes through V. The other employs supervised self-learning from image reconstruction and image retrieval tasks to obtain better representation.

FIG. 2 is a functional block diagram for explaining the configuration of the concept regularization unit 300.

Figures 3A to 3C are conceptual diagrams showing the processing concept of the conceptual regularization unit 300 in Figure 2.

Referring to Figures 2 and 3A to 3C, the discrimination loss calculation unit 3010, which ensures the individual consistency of concepts, does not want each learned concept to have many variations in order to make it easier for humans to interpret after it has been extracted as a "concept" through the learning process of the concept learner 200.

Regularization of this type of "concept learning" can be taken into account in the loss term (encoded as a loss term) through image features V and concept activity t.

Here, the concept learner 200 performs so-called "mini-batch learning" to randomly select a portion (n pieces) of N pieces of training data and update the parameters.

During training, the kth element t _k of concept activity t can be used to identify images in a mini-batch that have concept k.

That is, first, as shown in Fig. 3C, the classification loss calculation unit 3010 calculates the "consistency loss" as follows: Image feature _vk , which is the k-th row vector of image feature V, contains image features from a region corresponding to concept k if _tk is close to 1. Let _Hk denote the set of all pairs of image features _vk in the mini-batch where _tk is greater than a threshold ξ that is empirically and experimentally set in advance.

Using the cosine similarity sim(·,·), we define the “consistency loss” as follows:

　lcon penalizes smaller similarities between image features vk, vk' corresponding to concept k from two different images.

In other words, “consistency loss” is a loss term used during mini-batch learning to advance learning so that the “image features” of different images that belong to the “same concept” become “more similar” even if they are different images.
(Mutual Distinctness of Concepts)

To capture different aspects of an image, each concept needs to focus on different visual elements, and the discrimination loss calculation unit 3010 calculates the following "discriminability loss" as a loss term. In other words, the average image feature amount of concept k in a mini-batch is given by the following formula.

Here, set M is the pair set of all pairs of average image features. Note that concept k is excluded from set M if there are no images with concept k in the mini-batch.

In other words, “discriminativity loss” is a loss term used to progress through mini-batch learning so that the “average image features” of images belonging to “different concepts” become “more different.”
(Supervised self-learning)

The "SENN" disclosed in Non-Patent Document 8, described in the prior art, uses an autoencoder structure for supervised self-learning. This is effective, for example, in handwritten digit recognition tasks where different visual elements (line patterns) are strongly associated with their positions.

For example, a cross with horizontal and vertical lines only appears in the number 4, which is typically placed near the center of the image.

However, this does not necessarily apply to more general "images of the natural world." Therefore, the concept regularization unit 300 of this embodiment introduces "supervised self-learning" to evaluate losses based on retrieval of natural images, in addition to losses based on image reconstruction.

Therefore, in this embodiment, in the concept regularizer 300, the reconstruction-based loss calculator 3020 as shown in FIG. 3B or the search-based loss calculator 3030 as shown in FIG. 3A executes the processes described below selectively or in parallel depending on the type of target of the classification task, for example, by external pre-setting, to calculate the loss term in the learning of the concept learner 200.
(Reconstruction-Based Self-Learning)

In image domains where visual elements are expected to be strongly associated with their location (e.g., MNIST on a set of handwritten digit images), the concept activation t contains enough information to reconstruct the original image.

As shown in FIG. 3B, the reconstruction-based loss calculation unit 3020 includes a concept decoder D, which receives the concept activity t as input and reconstructs the original image so that the image x and the output D(t) of the concept decoder D are similar to each other.

Here, the reconstruction-based loss l _rec in supervised self-learning is defined as follows:

Therefore, the “reconstruction-based loss” becomes smaller as the reconstructed image becomes more similar to the original image using the concept activity t as a teaching signal in self-learning.
(Search-based self-learning)

Generally, the concept activity t is insufficient to reconstruct the original image x, since it corresponds to a concept placed at an arbitrary position, and the spatial information required for reconstruction is lost in t, the concept activity t.

Therefore, as shown in Fig. 3A, instead of reconstructing the original image, the search-based loss calculator 3030 performs a simple search task of finding images of the same class in mini-batch B using concept activity t. For any pair (t, t') computed from images x, x'∈B with image labels y and y', respectively, we define a function J as follows:

Here, t and t' should be similar to each other if they have the same class label, since similar sets of visual elements should appear in images x and x'. On the other hand, if they do not have the same class label, t and t' should be different.

In general, the number of pairs with the same label, _Cs , is much smaller than the number of pairs with different labels, _Cd . Therefore, when y=y', _Cd /( _Cs + _Cd ) holds, and otherwise _Cs /( _Cs + _Cd ) holds, and a weight α(y, y') is introduced to reduce the imbalance.

Then, the search-based loss calculation unit 3030 defines the search-based loss l _ret in the above-described supervised self-learning by the following equation.

Here, the sum is calculated over all pairs of images (x, x') and their corresponding labels (y, y').

In other words, the "search-based loss" becomes smaller when the concept activity t is used as a teaching signal in self-learning, for different images with the same class label, as the concept activity becomes more similar, and for different images with different class labels, as the concept activity becomes more dissimilar.

Furthermore, as in the case of reconstruction, the influence of each concept can be visualized by showing the top images to a human based on the revised concept activation t′ and the concept activation t ^{T t} ′ calculated for all images in the training image dataset D.

(Loss for the classification performance of the classifier)

In the following, we will explain the classifier 400 as using a single fully connected layer excluding the bias term.

Here, the following can be said:

The training of this simple classifier 400 can be interpreted as determining the co-occurrence of the activation level of each concept with the class to be classified.

That is, by defining cross entropy as follows, it is possible to evaluate the loss related to the classification performance of the classifier 400 (hereinafter referred to as "classification performance loss").

(Total loss)

During the learning process, the total loss amount of the image classification learning device 1000 is defined by the following formula, combining the above loss terms:

As described above, the learning process control unit 700 controls the learning process in accordance with the loss calculated by the loss calculation unit 600 .

(Classifier configuration generated as a result of learning)

FIG. 4 is a block diagram for explaining the hardware configuration of the image classification learning device 1000 shown in FIG. 1.

As described above, the image classification learning device 1000 may be configured so that a computing device (CPU: Central Processing Unit) within its own housing performs the computational processing, or the program processing itself may be executed on a server. In the following, it will be described as if a computing device within its own housing performs the computational processing.

Referring to FIG. 4, the image classification learning device 1000 includes a computer device 6010, a network communication unit 6300 for communicating with a network, a camera 6400 for providing captured image data to the computer device 6010 as necessary, and a recording medium (e.g., a memory card) 6210 for recording the captured image data and providing it to the computer device 6010.

For example, the recording medium 6210 may be a USB memory, a memory card, or an external storage device. The network communication unit 6300 may be a wired LAN router or a wireless LAN access point. The image data may be provided to the computer device 6010 via the network communication unit 6300.

As shown in FIG. 4, the computer main body constituting this computer device 6010 includes, in addition to a disk drive 6030 and a memory drive 6020, a CPU (Central Processing Unit) 6040, each connected to a bus 6050, memory including a ROM (Read Only Memory) 6060 and a RAM (Random Access Memory) 6070, a non-volatile rewritable storage device such as an SSD (Solid State Drive) 6080, and an input/output interface 6090 for communicating over a network and sending and receiving data with the outside world. An optical disk can be attached to the disk 6030. A memory card 6210 can be attached to the memory drive 6020.

As will be explained later, when the computer device 6010's programs run, the data and programs that store the information that is the basis of the computer's operation will be explained as being stored in the SSD 6080. The RAM 6070 also functions as a working memory when the CPU 6040 performs calculations, and data and parameters during calculations are stored or read out as needed, and the CPU 6040 executes the calculations.

In FIG. 4, the non-transient recording medium from which the computer can read information such as a program to be installed in the computer main unit may be, for example, a DVD-ROM (Digital Versatile Disc), a memory card, or a USB memory. To accommodate such cases, the computer main unit 6200 is provided with a drive device (memory drive 6020, disk drive 6030) capable of reading these media.

The main components of the computer device 6010 are computer hardware and software executed by the CPU 6040. Generally, such software is stored in a computer-readable non-transitory storage medium and distributed or distributed via a network, and is obtained via the disk drive 6030 or the network communication unit 6410 and temporarily stored in the SSD 6080. It is then read from the SSD 6080 into the RAM 6070 in the memory and executed by the CPU 6040. Note that when connected to a network, the software may be directly loaded into the RAM and executed without being stored in the SSD 6080.

When distributed, a program for functioning as a computer device 6010 as described below does not necessarily need to include an operating system (OS) that causes the computer main body 6010 to execute functions such as an information processing device. The program only needs to include instructions that call appropriate functions (modules) in a controlled manner to obtain the desired results. How the computer system 6010 operates is well known, and a detailed explanation will be omitted.

Furthermore, the CPU 6040 may also be a one-core processor or a multiple-core processor. That is, it may be a single-core processor or a multi-core processor.

FIG. 5 is a flowchart for explaining the learning process of the image classification learning device 1000 shown in FIG. 1.

Referring to Figures 5 and 1, when the learning process starts, the learning image data selected as mini-batch processing is input (S100), and the CNN backbone 100 extracts a feature map (S102).

Next, the location information encoding processing unit 2002 encodes the location information of the feature map (S104), and the shaping processing unit 2020 performs flattening processing of the feature map (S106).

The shaping processor 2004 performs flattening processing on the feature map F' with the encoded positional information (S108), and the nonlinear processor 2008 performs nonlinear processing on the concept matrix W output from the concept prototype processor 2100 to generate a query Q(W) (S110).

Meanwhile, the nonlinear processing unit 2006 generates a key K(F') by nonlinear processing of the feature map F' (S112), the similarity calculation unit 2010 calculates the dot product between the query Q(W) and the key K(F'), and the normalization unit 2012 normalizes the dot product to generate an attention matrix A (S114).

The concept occurrence calculation unit 2030 calculates the concept activity t and inputs it to the classifier 400 (S116). Meanwhile, the similarity calculation unit 2040 generates image features V as shown in equation (5) from the dot product of the attention matrix A and the flattened feature map.

The concept regularizer 300 calculates the "consistency loss," "discriminability loss," "reconstruction-based loss," and "search-based loss" from the concept activity t and the image feature V, while the quantization error calculator 500 calculates the quantization error loss of the concept activity t according to equation (6). Depending on the output from the classifier 400, the loss calculator 600 calculates the "classification performance loss" as equation (14), and calculates the total loss L during the learning process of the concept learner 200 using equation (15) (S120).

The learning process control unit 700 updates the parameters of the concept learner 200, which is composed of a neural network, based on the total loss L, for example, by the gradient descent method. Note that the method of updating the model parameters is not limited to this method.

If the learning process control unit 700 determines that the learning process using mini-batch processing meets the specified conditions, it ends the learning process, but if it does not end, it returns the process to step S100.

The above-described processing of FIG. 5 can be executed by the CPU 2040 in the hardware configuration shown in FIG. 4, for example, using a computer program stored in the non-volatile storage device 2080. Each process can be distributed, or a cloud-type configuration in which the processes are executed by a server device can be used.

FIG. 6 is a functional block diagram for explaining the configuration of an image classification device 4000 including a classifier 400 generated by learning in the image classification learning device 1000 shown in FIG. 1 when performing classification processing on a new image.

However, in Figure 6, the same reference numerals are used for the components that perform the same processes as those of the image classification learning device 1000 shown in Figure 1.

Basically, for the concept prototype processing unit 2100, when the classification process is performed after the learning process is completed, the concept matrix W is fixed to the one at the end of learning and is not updated, so in Figure 6 it is described as the concept prototype storage unit 4100. For example, for the "concept prototype storage unit 4100," the learned parameters are stored in memory.

The parameters of the other components in Figure 6, including the classifier 400, are also fixed at the time when learning is completed.

FIG. 7 is a conceptual diagram for explaining the processing performed by the classifier 400 in FIG. 6.

It is assumed that a concept prototype (concept matrix) W is generated from the learning dataset and that the concept prototype storage unit 4100 retains this state.

When a new image is input, the concept occurrence calculation unit 2030 calculates the concept activity t for each concept. This is called a "concept bottleneck" in the sense that the diversity of the original image is consolidated into a small number of features.

In classifier 400, the pattern of co-occurrence of concepts is learned for each label of each training image. When a concept bottleneck for a new image is input to classifier 400, the similarity with the pattern of co-occurrence of concepts is calculated, and the label with the highest similarity among the similarities for each label is output as the classification result.

The example shown in Figure 7 illustrates the case where natural images of birds are learned. Here, concepts such as "yellow head" and "black body" are generated for birds with label 1, and in the process of calculating the classification results, the degree of co-occurrence of each of these concepts with concepts in the target images is determined.

In other words, during the learning process, the image classification learning device 1000 learns the concept that, for example, an image of a "black bird" with a "yellow head" can be interpreted as a "yellow head" and a "body with black feathers."

The classifier 400 is a single fully connected (FC) layer that encodes the co-occurrence of each concept with each class.

Therefore, the image classification learning device 1000 learns the "classifier" and the "bottleneck concept" simultaneously.

As we have seen, to enhance representation and interpretation, concepts are constrained to be individually consistent (i.e., single concepts occupy a smaller volume in feature space) and mutually distinctive (i.e., pairs of concepts do not occupy the same region in feature space, or are separated in such a way that they are less likely to do so).

The "absence of contradictions between concepts" is primarily taken into consideration through "loss of consistency." The "mutual distinctiveness" is primarily taken into consideration through "loss of distinctiveness."

Imposing such a constraint has the following effects:
i) Concepts correspond only to specific visual elements (or features), making it easy to see what each concept represents;

ii) Different concepts are dissimilar to each other and cover a greater variety of visual elements.

These constraints i) and ii) are consistent with human intuition when classifying images.

Therefore, the image classification learning device 1000 can also simultaneously learn classifiers and concepts in an end-to-end manner.

Of course, the configuration is not necessarily limited to an end-to-end configuration; for example, at least a portion of the CNN backbone 100 may be fixed and only other portions may be trained.

The image classification learning device 1000 has a configuration that can essentially explain the classification process to humans in two ways.

First, concept activation conveys an intuitive sense of what the model has discovered.

Second, the classifier provides a prototype of the target class based on the concept.

The experimental results described below demonstrate that the image classification learning device 1000 can provide more intuitive interpretations (at least qualitatively) without significantly compromising performance.

In what follows, such ablation experiments demonstrate the importance of concept constraints as well as which supervised self-learning tasks are suitable for different target classification tasks.
(Results of the evaluation experiment on the classifier)
(Experimental Setup)

Datasets The results of experiments on three classification tasks, the handwritten digit recognition task MNIST (MIT license), the natural image recognition task CUB200 (Custom license), and ImageNet (3-clause BSD license), are described below.

The image classification learning device 1000 is more likely to fail when learning with a large number of classes.

Therefore, in the following experiments, we used the full set of CUB200 and a subset of ImageNet that extracted the first n classes (0<n<1000) in ascending order of class ID.

We present the classification accuracy of three tasks and a qualitative analysis of MNIST and CUB200 to verify the interpretability provided by the image classification learning device 1000. We also supplementarily mention a qualitative analysis on ImageNet.
(Training details)

In MNIST, the backbone and concept decoder use the same network as "SENN" disclosed in Non-Patent Document 8.

For CUB200 and ImageNet, we used a pre-trained ResNet disclosed in the following publicly known literature as the backbone, and reduced the number of channels (512 for ResNet-18, 2048 for others) to 128 in the 1 × 1 convolutional layer.
Public literature: Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Pro. CVPR, pages 770-778, 2016.

All input images were resized to 256x256 and cropped to 224x224. Only random horizontal flipping was used to augment the data during training.

The number of concepts k is set to 20 for MNIST and 50 for others by default. The weights of each loss are as follows by default:
λqua＝0.1, λcon＝1, λdis＝1, _λR ＝1

We also examined the influence of the number of concepts k and their weights.
(Classification performance)

When the number of concepts k = 20, the classification accuracy for the MNIST test was 96.7%.

Figures 8A and 8B show the classification performance of classifier 400 for CUB200 and ImageNet.

Note that in Figures 8A and 8B, BotCL refers to the classifier 400.

Figure 8A shows a comparison of the performance of classifier 400 and the baseline model ResNet for CUB200 and ImageNet with n=300.

Across all tasks, classifier 400 shows a performance degradation of approximately 3 points.

Figure 8B also shows the change in performance versus the number of classes for CUB200 (number of classes: 20-200) and ImageNet (number of classes: 50-300).

As the number of classes increases, the performance of both the baseline model and classifier 400 tends to decrease.

However, although the classifier 400 has a single fully connected layer configuration that uses the concept activity t as an input, it can be said that there is almost no degradation in classification performance compared to the baseline model. In other words, it can be said that the "concept" corresponding to the concept activity t sufficiently expresses the characteristics of the image classification.
(Interpretability)
(Validity of detected concepts)

As described above, during the learning process, for one input image, the image classification learning device 1000 calculates the concept activity t, which indicates the presence of each concept.

This concept activity t corresponds to the sum of the spatial-dimensional attention a _k corresponding to the concept k. By visualizing this a _k , humans can qualitatively confirm the presence or absence of a concept.

Figures 9A to 9C are diagrams for explaining the validity of concepts expressed by concept activity t.

Figure 9A shows the original images of the five most frequently activated concepts (i.e., t _k >0.5) selected in MNIST, with a _k superimposed on them. The area of interest as a concept is displayed brighter by superimposing. For example, in the case of the

numbers

0 and 9, the obvious difference between the two is the activation of concept 2 (Cpt.2), and the attention point is the lower vertical line.

Figure 9B shows the top five most frequently activated concepts, and Figure 9C shows the images reconstructed by each concept.

The image classification learning device 1000 is designed so that the learned concepts are individually consistent and distinctive from each other.

This can be qualitatively verified by overlaying each concept with the top P activated samples in the training image data D and concept k. As shown in Figure 9B, in MNIST, P = 5. Different concepts focus on different stroke patterns, and it can be observed that each concept has a consistent area of focus across different samples (even samples from different categories).

You can also qualitatively see the contribution of removing a concept and see the corresponding change in the output of a supervised self-learning task.

In Figure 9C, when the activation of concept 2 (Cpt.2), which is responsible for the vertical line and is the attention area for the number 9, is set to zero, the reconstructed image changes to the number 0.

Furthermore, when concept 1 (Cpt.1), which is the attention area for the number 7 and represents the absence of a circle, is deactivated, a circle appears at the top of the reconstructed image, and it can be seen that its shape resembles the number 9.

Figure 10 shows the attention levels of the five most important concepts (based on "importance" described below) for the input image of a black bird with a yellow head.

It can be seen that the attention levels of concepts 1 to 5 (Cpt.1 to Cpt.5) cover various parts of birds, such as the head, neck, body, and legs, etc. This proves that image classification learning device 1000 can learn various concepts from natural images.
(Consistency and uniqueness of each concept)

Figure 11 is a diagram to explain the concepts expressed by concept activity t for natural images.

Figure 11 shows a similar overlay display to that shown in Figure 9B.

The image classification learning apparatus 1000 exhibits similar behavior on the CUB200 dataset as described in Fig. 9B. It can be seen that the selected P = 5 concepts focus on different patterns, and each concept has a consistent focus region across samples.
(Contribution of each concept to inference)

The classifier 400 consists of one fully connected layer and can be interpreted as learning the co-occurrence of concepts.

Therefore, in one inference, the contribution of concept k to class ω is defined as follows:

Also, for natural images, as with MNIST, it is possible to qualitatively confirm the contribution by removing certain concepts and observe the change in the output of the corresponding supervised self-learning task.

Figure 12 shows the importance of each concept in the CUB200 dataset.

As shown in Figure 12, in the CUB200 dataset, the contribution of each concept is shown using the top 8 search results. For example, disabling concept 1 (Cpt.1 (corresponding to yellow head)) results in more images of black-headed birds appearing in the search results.

However, compared to the restoration task, the search task is more robust because the output (highly similar samples) is determined by multiple concepts, and changing one concept does not significantly affect the overall similarity.

The bottom of Figure 12 shows the percentage of samples in the ground truth class among the retrieved samples, allowing us to measure the importance of each concept in this search task.

The reason why there is almost no change in the search results even when concept 5 (Cpt. 5) is disabled is that concept 5 is a common concept that is valid for all bird classes.

(Ablation test)

Below, we consider the effect of setting the number of concepts k and the weight of each loss term. In addition to the classification accuracy, we also evaluated the "consistency" and "discriminability" that are the basis of the loss calculated by equations (7) and (9) for the top 100 activation samples of each concept. The default values were used for the hyperparameters, except for those to be searched.
(Influence of concept k)

Figure 13 shows the magnitude of each hyperparameter and the accuracy rate (Accuracy: circles), individual consistency (squares: the higher the better), and mutual distinctiveness (triangles: the lower the better).

As shown in Figure 13, in general, the larger the number of concepts k, the higher the accuracy rate, which has a positive effect on both consistency and discriminability.

However, discrimination increases (deteriorates) when k is greater than 10 in MNIST and greater than 200 in CUB200, suggesting that meaningless concepts are learned or that multiple concepts represent the same visual elements.

Thus, it is desirable to adjust the number of concepts for each dataset.
Effect of the weight parameter λqua on the quantization loss

Smaller (but not zero) parameter λqua improved model performance trained on CUB200, while larger values had a negative impact on all three metrics.

The parameter λqua controls how close the concept activation t is to a binary value. An appropriate value can regularize the activation and prevent some ambiguous concepts. However, setting an extreme value can cause the gradient to disappear, which can lead to poor learning. The default λqua of about 0.1 was the optimal value within the scope of this experiment.
The impact of λcon and λdis

In MNIST, we found that individual consistency loss and mutual discriminability loss have almost no effect on performance. This is likely because there is little variation in the appearance of handwritten digits, meaning that at least consistency is always high. Also, the task itself is easy, so the accuracy rate may have saturated.

On the other hand, for the CUB200, the two losses performed as designed. As λcon increased, the matching improved continuously. Discrimination decreased slightly up to λcon = 1, then increased slightly. As λdis increased, the discrimination continued to decrease (improving performance) and the matching improved slightly.
Effect of _λR

We find that while the supervised self-learning loss has no obvious impact on classification accuracy, increasing its weights can improve both consistency and discriminability.

As described above, the image classification learning device 1000 of this embodiment can learn the features used in classification as "concepts" that humans can understand through learning on classification tasks.

In addition, image classification learning device 1000 can provide not only the learned concepts but also the interpretability of the judgments.
[Embodiment 2]

Below, we will explain an example of using the "image classification learning device and classifier" described in the first embodiment to determine the soundness of concrete in a concrete structure.

When inspecting concrete structures, the degree of partial damage to the structure is determined based on factors such as the occurrence of cracks on the concrete wall surface.

FIG. 14 is a conceptual diagram for explaining the operation of the concrete soundness classification device of embodiment 2.

As explained in the first embodiment, the classification process for the soundness of concrete structures can also be configured so that learning classification and classification processes are performed within an integrated computer device.

However, in the following, as described below, the process of determining the health level using the "trained model (classifier)" generated by the method described in embodiment 1 is assumed to be executed by server 1010 (not shown).

Referring to FIG. 14, in a learning process that is executed in advance by the computing device 6040 in response to input image data, the learning process is executed using the image data and a discrimination index associated with the image data (in this case, the soundness as correct answer data for the image data), and a learned model of the artificial intelligence is generated.

After the artificial intelligence trained model is generated in this manner, image data capturing the surface of the concrete structure is sent to server 1010.

In server 1010, a classification process is performed using a trained model of artificial intelligence, and a health level (for example, health level III) is output. At the same time, the area corresponding to the "concept" used in the classification process in server 1010 is displayed in a frame or the like so that it can be visually recognized by humans.

For example, a skilled professional can visually view an image that has been classified in this way and understand not only the classification results of the artificial intelligence, but also which areas of the image attention was focused on to determine the healthiness of the image. The skilled professional can then determine how to respond based on the area of the image that attention was focused on during the classification process.

Although not limited to this, the soundness of concrete structures can be classified into four levels, from soundness I to soundness IV, as disclosed in the Ministry of Land, Infrastructure, Transport and Tourism's "Guidelines for Periodic Bridge Inspection" (Non-Patent Document 9).

Here, the classification of each soundness level in Non-Patent Document 9 is as follows, taking roads, bridges, etc. as examples.

　Soundness level I: Sound: The road bridge's functionality is not impaired.

　Level II: Preventive maintenance stage: The road bridge's functionality is not impaired, but it is desirable to take preventive maintenance measures.

　Level III: Early action stage: The road bridge's functionality may be impaired, and immediate action is required.

Level of soundness IV: Emergency action stage: The function of the road bridge is impaired or there is a very high possibility that this will occur, and immediate action is required.
In the example of FIG. 14, in response to the soundness level being III, the expert technician makes the decision that "this part is a crack, so let's repair it by injecting resin."

FIG. 15 is a conceptual diagram showing the configuration of training data for generating a trained artificial intelligence model like that shown in FIG. 14.

As shown in Figure 15, image data is prepared as health level labels corresponding to health levels I to IV. Figure 15 shows health level III as an example. Similar images are assumed to be prepared for the other health levels.

Depending on the situation, it may be possible not to use images that are deemed "healthy" as training data, but to simply use images labeled "unhealthy" (health levels II to IV) as training data.

Figure 16 shows an example of a system configuration for determining the soundness of concrete.

Referring to FIG. 16, as described in FIG. 14, an image of the surface of the concrete structure is captured by the inspector terminal 500.1 and transmitted to the server 1010.

At this time, the inspector terminal 500.1 transmits, for example, the structure's location information (e.g., latitude and longitude information obtained by a positioning means) and the structure name data entered by the inspector, together with the image data, to the server 1010.

The server 1010 returns to the inspector terminal 500.1 information on the healthiness assessment result and information indicating areas that were noted in relation to the image sent from the inspector terminal 500.1 (data marked with a frame or the like).

In addition, the trained artificial intelligence model to be used for classification and judgment processing in server 1010 can be configured to undergo training processing, for example, in another server 1020, and then be transmitted to server 1010, stored, and operated.

Here, server 1020, which is in charge of the learning process, collects data such as "position data," "image data," and "structure name" from other concrete structures from multiple terminals 500.2 to 500.n (n: natural number). At this time, for example, if each terminal 500.2 to 500.n is operated by a skilled professional, the data collected by server 1020 can be configured so that when an image is captured on the terminal, the soundness is associated with the image data as correct answer data and transmitted to server 1020. Learning data such as that described in FIG. 15 can be generated from the data collected in this way.

Alternatively, when only "location data," "image data," and "structure name" are sent from terminals 500.2 to 500.n, a specialized technician on the server side may perform a process to associate the soundness with the correct data.

In this way, server 1020 can use the accumulated learning data to retrain the AI's trained model and improve classification performance.

FIG. 17 is a functional block diagram showing the configuration of terminal 500.1 shown in FIG. 16.

Note that terminals 500.2 to 500.n have a similar configuration, so their explanation will not be repeated.

Referring to FIG. 17, the terminal 500.1 of this embodiment includes a control unit 5010 for controlling the communication operation and input/output operation of the terminal, a communication processing unit 5040 for generating baseband signals for wireless LAN and mobile communication and sending them to a modulation/demodulation circuit/device, and for obtaining original data or signals from received baseband signals, an imaging sensor 5050 for capturing still images or videos, an image acquisition unit 5060 for converting signals from the imaging sensor 5050 into electrical signals in a predetermined format, a display control unit 5070 for controlling image display on the terminal side, a display unit 5080 for displaying images under the control of the display control unit 5070, a position acquisition unit 5090 for measuring and acquiring the position of the terminal 500.1, and an input interface unit 5100 for receiving input of information from the outside.

In particular, although not limited to, the imaging sensor 5050 may be a module that combines a lens and a CCD (Charge-Coupled Device) sensor, or a module that combines a lens and a CMOS sensor.

The location acquisition unit 5090 is not limited to a positioning device that uses GPS (Global Positioning System) as an outdoor positioning means, a positioning device that also uses signals from quasi-zenith satellites, a device that enables indoor positioning using beacon signals, etc., and may be any device capable of acquiring location information of the terminal 500.1.

The input interface unit 5100 converts external input into text data using a touch panel or voice recognition of voice input.

The control unit 5010 includes an acquired image transmission processing unit 5020 that integrates information such as image data from the image acquisition unit 5060, position data from the position acquisition unit 5090, and data on the structure name from the input interface unit 5100, and transmits the integrated information from the communication processing unit 5040 to the server 1010, and a judgment table generation unit 5030 that generates image data representing a judgment table to be displayed on the display unit 5080 by the display control unit 5070 from the structure name, health data, and data indicating the image and area of interest received from the server 1010 via the communication processing unit 5040.

FIG. 18 is a block diagram for explaining the hardware configuration of terminal 500.1 shown in FIG. 17.

The control unit 5010 is equipped with a calculation device 501 such as an MPU (Micro Processing Unit) or CPU (Central Processing Unit), a storage device consisting of a RAM 502, a ROM 503, etc., and controls each part and creates a native platform environment and application execution environment in the software configuration by executing a predetermined basic OS, middleware, etc.

As the imaging device 505, a camera module as described above is used, and as the positioning device 509, a GPS or other positioning device as described above is used.

The display device 508 may be a liquid crystal panel or an organic EL panel, and the operation device 510 may be a touch panel integrated with the display panel, or a voice recognition device.

The storage device of the control unit 5010 includes, for example, semiconductor memory such as RAM as a temporary storage device and flash memory as a non-volatile storage device. This non-volatile storage device stores driver programs, operating system programs, application programs, data, etc. used for processing in each unit.

For example, the non-volatile storage device stores driver programs such as a communication driver program that executes a wireless communication method conforming to the IEEE 802.11 standard or a wireless communication method for mobile communication (cellular communication), an input device driver program that controls the operation device 510, and an output device driver program that controls the display device 508.

The non-volatile storage device also stores operating system programs, such as basic OSs such as Android (registered trademark) OS and iOS (registered trademark), and connection control programs that perform authentication in wireless communication methods such as the IEEE 802.11 standard and wireless communication methods for mobile communication (cellular communication).

The communication interface 504 has the functionality to perform wireless LAN communication and mobile communication via a base station (not shown) of a cellular mobile communication network.

As described above, according to the image classification learning device 1000 and classification device 4000 of embodiment 2, or according to the learning program and classification program of embodiment 2, information relating to judging the soundness of concrete can be obtained on the inspector's terminal based on the input surface image data of a concrete structure, and the image area used for classification by the trained model of artificial intelligence can be confirmed.

As a result, it will be easier for specialists to determine how to respond by using the artificial intelligence's assessment of the health level, or as supplementary information.

Although not limited thereto, similarly to embodiment 1, the "trained model (classifier)" may be recorded as a program or as part of a program on a computer-readable recording medium and installed on another computer.
[Embodiment 3]

In the second embodiment, an image classification learning device 1000 as described in FIG. 1 executes a learning process using learning data as shown in FIG. 15, thereby enabling a human to not only classify the soundness of a concrete structure, but also to determine which characteristic parts (characteristic areas) the artificial intelligence focused on in determining the soundness.

In the third embodiment, the configuration of the image classification learning device 1000 and the classification device 4000 is further described, in which the image classification learning device 1000 executes a learning process so that, using image data as input, it not only classifies the soundness of the corresponding concrete structure, but also outputs a method of dealing with such a concrete structure.

Figure 19 is a diagram for explaining the composition of learning data that includes image data, health level labels corresponding to the images, and corrective action labels.

As shown in Figure 19, when image data, health level labels corresponding to the images, and corrective action labels are collected, such data can be used as learning data.

However, by using a system like that of embodiment 2, image data about concrete structures and corresponding information about their soundness can be collected, and data on how to deal with the concrete structures represented by the image data can also be accumulated.

In Figure 19, the case of health level III is shown as an example, but similar data is also prepared as learning data for other health levels.

FIG. 20 is a functional block diagram for explaining the configuration of the image classification learning device 1000 and classification device 4000 according to the third embodiment.

Referring to FIG. 20, in the image classification learning device 1000 of embodiment 3, in addition to the configuration shown in FIG. 1, the concept activity t is input not only to the classifier 400 but also to the action discriminator 410, and furthermore, a response action label indicating a response action as shown in FIG. 19 is also input to the action discriminator 410 to execute the learning process.

The learning process control unit 700 executes the learning process so that the response measures output from the action discriminator 410 match the teacher data.

In the classification device 4000, in addition to the configuration shown in FIG. 6, the action discriminator 410 that has been trained in this way outputs a response action.

Therefore, according to the image classification learning device 1000 and classification device 4000 of embodiment 3, or according to the learning program and classification program of embodiment 3, it is possible to obtain information relating to the assessment of the soundness of concrete based on the input surface image data of a concrete structure, and a human, in particular a skilled professional, can confirm the image area used for classification by the trained model of artificial intelligence, as well as information relating to countermeasures.

As a result, it will be easier for a skilled technician to use the information from the artificial intelligence's assessment of the health level and the information on the corresponding device, or as supplementary information, to determine how to respond.

The embodiments disclosed herein are illustrative of configurations for specifically implementing the present invention, and do not limit the technical scope of the present invention. The technical scope of the present invention is indicated by the claims, not by the description of the embodiments, and is intended to include modifications within the literal scope of the claims and within the scope of equivalent meanings.

100 Backbone CNN, 200 Concept learner, 300 Concept regularizer, 400 Classifier, 500 Quantization error calculator, 600 Loss calculator, 700 Learning process controller, 2002 Position information encoding processor, 2004, 2020 Shaping processor, 2006, 2008 Nonlinear processor, 2010 Similarity calculator, 2012 Normalizer, 2030 Concept occurrence calculator, 3010 Classification loss calculator, 3020 Reconstruction-based loss calculator, 3030 Search-based loss calculator, 4000 Image classification device, 4100 Concept prototype memory.

Claims

An image classification learning device, comprising:
A storage device for storing training data including a plurality of image data and image labels corresponding to the image data;
and a calculation processing means for reading out the learning data stored in the storage device and executing a process of machine learning a plurality of concepts in the image data for classifying the image data, the calculation processing means comprising:
an image classification means for extracting a set of features expressing the image data, learning and generating a classification model for identifying and classifying the image label for the image data based on the extracted set of features, and storing the classification model in the storage device;
an attention mechanism processing means for converting a slot vector in a concept matrix, the slot vector corresponding to each of the plurality of concepts and defining an image region in which the feature value that is emphasized in the classification process of the image classification means appears, in accordance with an image feature defined by the slot vector, and storing the slot vector in the storage device;
a loss evaluation means for calculating a loss based on a classification loss calculated by evaluating a classification rate of the image classification means and decreasing as the classification rate increases, and a separation loss calculated by evaluating a degree to which the features corresponding to the plurality of concepts are separated from each other in a feature space and decreasing as the degree of separation increases;
and a learning processing means for performing machine learning on the classification model and the concept matrix stored in the storage device so as to reduce the loss.
The attention mechanism processing means includes:
an attention matrix learning means for learning an attention matrix for extracting the image region to which attention is directed in the classification process of the image classification means in the set of feature amounts according to a similarity to the concept matrix;
The image identification means
a concept occurrence calculation means for generating an activity vector having elements representing the degree to which each of the concepts corresponding to the slot vector appears in the image data based on the attention matrix;
and a classifier that receives the activation vector corresponding to the image data as an input and performs classification for the image label.
The image classification learning device according to claim 1 or 2, wherein the set of features representing the image data is a feature map output from an image recognition model of a convolutional neural network.
The image classification learning device according to claim 1 or 2, wherein the separation loss includes a consistency loss, which is a loss that decreases as a single concept occupies a smaller volume in the feature space, and an identification loss, which is a loss that decreases as the probability that a pair of concepts occupies the same region in the feature space decreases.
The image data is data of images of the surfaces of a plurality of concrete structures captured by a camera,
The image classification learning device according to claim 1 , wherein the image labels are labels indicating soundness of the concrete structures corresponding to the respective image data.
An image classification learning method for machine learning a plurality of concepts in image data for classifying the image data, based on learning data including a plurality of image data and image labels respectively corresponding to the image data, by a computer, the computer including a storage device for storing the learning data and a calculation device for executing a machine learning process;
The computing device extracts a set of features that represent the image data, and learns and generates a classification model that identifies and classifies the image label for the image data based on the extracted set of features;
The calculation device converts the slot vector in a concept matrix including slot vectors that correspond to each of the plurality of concepts and define an image region in which the feature value that is emphasized in the classification process appears, according to the image feature defined by the slot vector;
a step of calculating a loss based on a classification loss calculated by an evaluation of a classification rate in classification of the image data, the classification loss decreasing as the classification rate increases, and a segregation loss calculated by an evaluation of a degree to which the features corresponding to the plurality of concepts are separated from each other in a feature space, the segregation loss decreasing as the degree of separation increases;
The image classification learning method comprises a step in which the computing device learns the classification model and the concept matrix so as to reduce the loss.
The image data is data of images of the surfaces of a plurality of concrete structures captured by a camera,
The image classification learning method according to claim 6 , wherein the image labels are labels indicating soundness of the concrete structures corresponding to the respective image data.
An image classification learning program for performing machine learning of a plurality of concepts in image data for classifying the image data, based on learning data including a plurality of image data and image labels corresponding to the image data, by a computer, the computer including a calculation device and a storage device,
The computing device extracts a set of features representing the image data stored in the storage device, and learns and generates a classification model that identifies and classifies the image label for the image data based on the extracted set of features;
The calculation device converts the slot vector in a concept matrix including slot vectors that correspond to each of the plurality of concepts and define an image region in which the feature value that is emphasized in the classification process appears, according to the image feature defined by the slot vector;
a step of calculating a loss based on a classification loss calculated by an evaluation of a classification rate in classification of the image data, the classification loss decreasing as the classification rate increases, and a segregation loss calculated by an evaluation of a degree to which the features corresponding to the plurality of concepts are separated from each other in a feature space, the segregation loss decreasing as the degree of separation increases;
The image classification learning program includes a step in which the computing device learns the classification model and the concept matrix so as to reduce the loss.
A non-transitory computer-readable recording medium storing the image classification learning program described in claim 8.
An image classification trained model generated by an image classification training method for machine learning a plurality of concepts in image data for classifying the image data based on training data including a plurality of image data and image labels respectively corresponding to the image data,
The image classification trained model is
a classifier model configured to classify the image data based on a co-occurrence relationship of an activity vector having elements representing the degree to which each of the concepts appears in the image data,
The image classification trained model is
extracting a set of features representing the image data, and updating the classifier model by learning, the classifier model identifying and classifying the image label for the image data based on the extracted set of features;
A step of converting the slot vector in a concept matrix including slot vectors that correspond to each of the plurality of concepts and define an image region in which the feature amount that is emphasized in the classification process appears, in accordance with the image feature defined by the slot vector;
a step of calculating a loss based on a classification loss calculated by evaluating a classification rate in classification of the image data and decreasing as the classification rate increases, and a segregation loss calculated by evaluating a degree to which the features corresponding to the plurality of concepts are separated from each other in a feature space and decreasing as the degree of separation increases;
training the model and the concept matrix so as to reduce the loss;
The step of converting the slot vector includes:
learning an attention matrix for extracting the image region to which attention is directed in the classification process in the set of features according to a similarity to the concept matrix;
The step of updating the classifier model by learning includes:
generating an activity vector based on the attention matrix, the activity vector being determined by the degree to which each of the concepts corresponding to the slot vector appears in the image data;
and learning parameters of the classifier model to perform classification on the image label using the activation vector corresponding to the image data as input.
An image classification learning device, comprising:
a storage device for storing learning data including image data of the surfaces of a plurality of concrete structures captured by a camera and image labels indicating the soundness of the concrete structures corresponding to the image data;
and a calculation device for executing a process of machine learning a plurality of concepts in the image data for classifying the image data in terms of the health level based on the learning data stored in the storage device, the calculation device comprising:
an image classification step of extracting a set of features expressing the image data, learning and generating a classification model that identifies and classifies the image label for the image data based on the extracted set of features, and storing the classification model in the storage device;
an attention mechanism processing step of converting the slot vectors in a concept matrix, the slot vectors corresponding to each of the plurality of concepts and defining an image region in which the feature values that are emphasized in the classification model identification process appear, in accordance with the image features defined by the slot vectors, and storing the converted slot vectors in the storage device;
a loss evaluation step of calculating a loss based on a classification loss calculated by evaluating a classification rate of the classification model and decreasing as the classification rate increases, and a separation loss calculated by evaluating a degree to which the features corresponding to the plurality of concepts are separated from each other in a feature space and decreasing as the degree of separation increases;
and a learning processing step of performing machine learning on the classification model and the concept matrix stored in the storage device so as to reduce the loss.
The attention mechanism processing step includes:
an attention matrix learning step of learning an attention matrix for extracting the image region to which attention is directed in a classification process of the classification model in the set of feature amounts according to a similarity to the concept matrix;
The image identification step includes:
A concept occurrence calculation step of generating an activity vector having an element representing the degree to which each of the concepts corresponding to the slot vector appears in the image data based on the attention matrix;
The image classification learning device according to claim 11 , further comprising: generating a classifier that performs classification on the image label using the activation vector corresponding to the image data as an input.
The image classification learning device according to claim 11, wherein the learning process step includes a step of generating a treatment discrimination model that learns to discriminate between the treatment labels by inputting the activity vector and treatment labels of repair measures corresponding to the image data of the surface of the concrete structure.
An image classification trained model generated by an image classification training method that machine-learns a plurality of concepts in image data for classifying the image data in terms of the soundness, based on training data including image data of the surfaces of a plurality of concrete structures and image labels indicating the soundness of the concrete structures that correspond to the image data,
The image classification trained model is
a classifier model configured to classify the image data based on a co-occurrence relationship of an activity vector having an element representing the degree to which each of the concepts appears in the image data,
The image classification trained model is
extracting a set of features representing the image data, and updating the classifier model by learning, the classifier model identifying and classifying the image label for the image data based on the extracted set of features;
A step of converting the slot vector in a concept matrix including slot vectors that correspond to each of the plurality of concepts and define an image region in which the feature amount that is emphasized in the classification process appears, in accordance with the image feature defined by the slot vector;
a step of calculating a loss based on a classification loss calculated by evaluating a classification rate in classification of the image data and decreasing as the classification rate increases, and a segregation loss calculated by evaluating a degree to which the features corresponding to the plurality of concepts are separated from each other in a feature space and decreasing as the degree of separation increases;
training the model and the concept matrix so as to reduce the loss;
The step of converting the slot vector includes:
learning an attention matrix for extracting the image region to which attention is directed in the classification process in the set of features according to a similarity to the concept matrix;
The step of updating the classifier model by learning includes:
generating an activity vector based on the attention matrix, the activity vector being determined by the degree to which each of the concepts corresponding to the slot vector appears in the image data;
and learning parameters of the classifier model to perform classification on the image label using the activation vector corresponding to the image data as input.