EP4189642A1

EP4189642A1 - Prediction of labels for digital images, especially medical ones, and supply of explanations associated with these labels

Info

Publication number: EP4189642A1
Application number: EP21752088.1A
Authority: EP
Inventors: Gwenolé QUELLEC
Original assignee: Institut National de la Sante et de la Recherche Medicale INSERM; Universite Brest Bretagne Occidentale
Current assignee: Institut National de la Sante et de la Recherche Medicale INSERM; Universite Brest Bretagne Occidentale
Priority date: 2020-07-28
Filing date: 2021-07-21
Publication date: 2023-06-07
Also published as: FR3113157B1; FR3113157A1; US20230306731A1; IL300233A; WO2022023646A1

Abstract

The invention relates to a method for predicting labels associated with a digital image, comprising a prediction phase consisting of: - supplying the image (I) to a segmentation neural network (SN) configured to predict a classification of the pixels of the image in a first set of classes; and, - supplying at least part of this classification to a classification neural network (CN), configured to predict a set of labels for said image from the classification P of the pixels; said segmentation and classification neural networks being determined by a training phase comprising, for each image of a training set, the first and second steps; and the determination of a location of the background of said image, from the classification of the pixels, and the optimisation of the weights of the neural networks, according to a set of cost functions configured, by iteration, to maximise the quality of the set of labels as a function of labels previously established and associated with the image, and to maximise the probability of not predicting any label for the background.

Description

Prediction of labels for digital images, in particular medical ones, and provision of explanations associated with said labels [0001] FIELD OF THE INVENTION

The present invention relates to the prediction of labels, or labeling, to be associated with digital images. It applies in particular to the field of digital medical images in order to allow their use for automatic diagnosis or for aiding diagnosis. [0003] BACKGROUND OF THE INVENTION

[0004] Artificial intelligence techniques, and in particular those related to multilayer neural networks, make it possible to automatically process digital images in order to associate them with labels, or classes, which can be predefined or even dynamically determined according to of all the images processed, during a learning phase.

[0005] Such an approach can, for example, make it possible to automatically associate pathologies with medical images, taken from X-rays, scanners, tomographies, ultrasounds, MRI (Magnetic Resonance Imaging), etc.

[0006] It may also relate to other fields of application of digital imaging such as video surveillance, vision for autonomous driving, etc. in which it is a question of characterizing, via these labels, a scene perceived by a camera in order to possibly initiate an action (alert, automatic maneuver of the vehicle, etc.)

[0007] In general, these automatic systems are based on multilayer neural networks (generally convolutional neural networks). These can be considered as "black boxes", i.e. after a learning phase, they are able to provide, in response to a digital image, a proposal for labels, or classes , without the user being able to understand how this proposal, or prediction, was established and on what basis. [0008] However, there is a need to be able to explain these predictions and to trace a causal chain between the inputs and the outputs of an automatic system based on artificial intelligence.

[0009] Indeed, such a system not being infallible, the user (for example the doctor or the surgeon in the case of medical imaging) will be able to analyze the "reasoning" of the automatic system, to understand the prediction and accept it more easily.

[0010] There is also a general tendency to propose explainable mechanisms for automatic classification. Different legislations have looked into this problem and aim to define, promote, if not impose in certain sensitive areas, explainable automatic classification systems.

[0011] One of the advantages of this explainability is that certain regulations, including those of the European Union, aim to guarantee to the end users of Artificial Intelligence, an explanation for the automatic decisions which concern them. These aspects are described in particular in Goodman B, Flaxman S, “European Union regulations on algorithmic decision-making and a right to explanation” in AI Magazine, 2017 Oct; 38(3):50-7

[0012] In France, in its opinion published in June 2017 on the ethics of research in machine learning, the CERNA commission (Commission for reflection on the ethics of research in science and digital technologies of Allistene) defines the notion of explainability in the following way: "to explain an algorithm is to make its users understand what it does, with enough details and arguments to gain their confidence".

[0013] Consequently, the design of explainable artificial intelligence mechanisms (XAI for "eXplanatory Artificial Intelligence" in English) has become a subject of research, in which many actors are involved. Gilpin et al., “Explaining explanations: An overview of interpretability of machine learning” in Proc. IEEE DSAA, 2018, Torino, p. 80-89 draws up an inventory.

[0014] However, a distinction must be made under this general concept between interpretability and true explainability.

[0015] An algorithmic decision is said to be explicable if it is possible to account for it explicitly from known data and characteristics of the situation. In other words, if it is possible to link the values taken by certain variables (characteristics) and their consequences on the forecast, for example of a score, and thus on the decision. [0016] An algorithmic decision is said to be interpretable if it is possible to identify the characteristics or variables which contribute the most to the decision, or even to quantify their importance.

[0017] By construction, an explicable decision is interpretable.

[0018] These definitions can be found, in particular in the article by Gilpin et al., cited above.

[0019] Currently, there do not seem to be any mechanisms to achieve the required level of explainability, without impacting its predictive qualities.

A first family of proposals is based on an a posteriori analysis of the influence of image pixels on the prediction made by the automatic image classification mechanism. However, this approach does not make it possible to precisely explain the detected influences.

Another family of proposals is based on alternative methods to multilayer neural networks. For example, neural decision forests, consisting of multiple neural decision trees, have been proposed. Although each neural decision tree is, to some extent, explainable, the fact that the final decision is based on a large number of decision trees still renders this final decision opaque.

[0022] SUMMARY OF THE INVENTION

An object of the present invention is to provide a mechanism at least partially overcoming the aforementioned drawbacks.

More particularly, according to embodiments, it aims to provide an automatic and explainable prediction of labels associated with a digital image.

To this end, according to a first aspect, the present invention can be implemented by a method for the prediction of labels associated with a digital image, comprising a prediction phase consisting in supplying, in a first step, said image to a segmentation neural network configured to predict a classification of pixels of said image into a first set of classes; and, in a second step, providing said classification to a classification neural network, configured to predict a set of labels p(I) for said image from said pixel classification, except for a segment corresponding to a background of said image; said segmentation and classification neural networks being determined by a learning phase comprising, for each image of a learning set, - said first and second steps; - determining a location of said background of said image, from the classification of pixels - optimizing the weights of said neural networks, according to a set of cost functions configured to, by iteration, maximize the quality of said set of labels based on previously established labels associated with said image, and maximizing the probability of predicting no label for said background. [0026] According to preferred embodiments, the invention comprises one or more of the following characteristics which can be used separately or in partial combination with each other or in total combination with each other: - the determination of a location of the rear- plane of said image comprises determining an occluded image, from the pixel classification, corresponding to the background of the image, said occluded image being defined by defining a pixel of said image. - During the learning phase, an auxiliary classification neural network is trained in order to optimize the classification of said concealed image. said set of cost functions comprises a global cost function ℒglobal which is expressed [Math.1] with - L is a cost function making it possible to maximize the quality of said set of labels as a function of previously established labels - L' is a cost function making it possible to maximize the quality of the predictions of said auxiliary classification neural network as a function of said previously established labels: - The occlusion is a cost function making it possible to maximize the probability of predicting no label for said occluded image; and L _parsimony is a cost function for maximizing an area of said background pixel classification, and a, b, and g are parameters. said segmentation neural network is an encoder-decoder network formed of an encoder neural network and a decoder neural network, arranged in cascade, said classification neural network consists of summary and classification layers. the output of said classification layer can be expressed as a function of an input vector z _m [Math. 2] with w _m ,n representing positive synaptic weights, b _n representing biases, s representing the activation function of the neurons of said classification layer, N representing the number of image labels and M the number of label of pixels, during said prediction phase, an explanation associated with each label of said set of labels is provided. said explanation is based on said synaptic weights w _m ,n of said classification layer and on the outputs of said summary layers.

- At the end of the learning phase, names are associated with the probability maps and, during the prediction phase, said names are provided with said explanations.

According to another aspect, the invention can also be implemented by a computer program comprising instructions for implementing the method described above when implemented by an information processing platform.

According to another aspect, the invention can also be implemented by a device for predicting labels associated with a digital image, comprising means for implementing the method as described above.

Other characteristics and advantages of the invention will appear on reading the following description of a preferred embodiment of the invention, given by way of example and with reference to the accompanying drawings.

[0030] BRIEF DESCRIPTION OF THE FIGURES The accompanying drawings illustrate the invention:

The [Fig. 1] schematically represents an example of the context in which the invention can be implemented.

The [Fig. 2] schematically represents another example of the context in which the invention can be implemented.

The [Fig. 3] schematically illustrates a functional architecture according to one embodiment of the invention

The [Fig. 4] schematically illustrates a functional flowchart according to one embodiment of the invention.

The [Fig. 5] schematically illustrates a multilayer neural network such as can be used in the context of an implementation of the invention

The [Fig. 6] schematically illustrates a functional architecture according to one embodiment of the invention.

The [Fig. 7] schematically illustrates a functional architecture according to an embodiment of the invention comprising a learning phase.

[0032] DETAILED DESCRIPTION OF THE INVENTION

According to one aspect of the invention, the prediction of labels associated with a digital image can be performed by a device which can be implemented by an information processing system.

[0034] This system can in particular be as illustrated in FIG. 1.

In a first phase, called learning or training, a set 4 of digital images 4 ₁ , 4 ₂ , 4 ₃ ..4 _k is presented to a computer program 10 implementing the method according to a embodiment of the invention. In addition, previously established labels 5, respectively 51, 5 ₂ , 5 ₃ ...5 _k are also provided. These tags may have been established manually by human operators, or possibly by other methods.

According to one embodiment, these images are two-dimensional digital images. These may in particular be digital medical images from x-rays, scanners, tomographies, ultrasounds, MRI (Magnetic Resonance Imaging), etc. [0037] However, the invention can be applied to images of other dimensions, in particular to 1 dimension or 3 dimensions or beyond.

The learning phase makes it possible to constitute a model 11, formed of the internal parameters (synaptic weights, etc.) of the neural networks implemented by the computer program 10.

These neural networks 11, once trained, can be used during an exploitation or prediction phase: a new image 3 is supplied to the computer program 10 which can then determine a prediction of labels 2.

This computer program 10 can be implemented by an information processing device 1. According to one embodiment of the invention, the information processing device can be of different types (personal computer, server, communication terminal, service available via a computing cloud, etc.).

According to one embodiment, the device can be implemented by a set of circuits co-located in a centralized server or distributed within a distributed server or among a set of servers. This set of servers may include “server farm” or “cloud computing” type arrangements.

[0042] In particular, according to an embodiment such as for example illustrated in [FIG. 2], the computer program 10 can be accessed remotely through a communication network 7. Thus, for example, a communication terminal 6 can transmit an image 3 to a label prediction device 1 via the communication network. communication 7 and receive in response a prediction 2 of labels. As mentioned above, this device can be a single server, or, in a more abstract way, a service made accessible via an interface, in particular of the web type, and deployed on a cloud computing type abstraction platform.

[0043] For the concrete implementation of the label prediction device, the term “circuit” is understood in the present application as comprising hardware elements possibly associated with software elements insofar as certain hardware elements can be programmed. In particular, the term circuit includes purely hardware implementations, in the form of specifically printed digital or analog circuits, implementations based, entirely or partially, on elements of the microprocessor or processor type, which are the subject of a programming by software instructions stored in one or more associated memories, etc. Software instructions may consist of only to the instructions necessary for the basic operations of the processors (the "firmware") while the software instructions necessary for carrying out the functions of the embodiments of the invention can be stored either in these same memories associated with the processors, or in remote memories. In the latter case, these software instructions are only present in the circuit when the circuit is in operation to perform the functions according to the embodiments of the invention. According to one aspect of the invention, as illustrated in [FIG.3], the label prediction device 10 comprises a first segmentation neural network, SN, and a second classification neural network, CN. [0045] The [FIG. 4] illustrates in the form of a flowchart this two-step process: a first step S1 makes it possible to predict a classification of the pixels of the image from the image itself, and a second step S2 makes it possible to predict a classification of the image from the pixel classification. These two steps are based on neural networks which have been trained during a learning phase S0. [0046] In a very macroscopic way, multilayer neural networks can be seen as black boxes whose internal parameters must be adjusted during a learning or training phase, by presenting them with both input data and a desired output (i.e., here, previously established labels). The error between this desired output and the "natural" output of the network makes it possible to slightly adjust the parameters to reduce the error. By presenting a large number of these "desired input/output data" pairs, the network learns to react correctly and provide a good output when presented with new input data, not associated with previously established labels ( and therefore have to be predicted). According to one embodiment of the invention, the neural network used can be based on a multilayer perceptron. Among the networks based on the general architecture of the multilayer perceptron, mention may in particular be made of convolutional neural networks (ConvNet or CNN for “Convolutional Neural Network”). The multilayer perceptron (or “multilayer perceptron”, MLP, in English) is a type of artificial neural network organized in several layers within which information flows from the input layer L ₁ to the output layer L _k uniquely ; it is therefore a direct propagation network (“feedforward”). Each layer L1, L2, L3…Lk consists of a variable number of neurons, respectively n ₁ , n ₂ , n… n _k. The neurons of the last layer (called “output”) are the outputs of the neural network and representative of a prediction of the model in response to an input provided on the layer L ₁ . In a multilayer perceptron, each neuron n _i,j is connected at the output to all of the neurons of the next layer Li+1. Conversely, it receives as input the outputs of all the neurons of the previous layer L _{i -1} . In [Fig.5], for clarity, only a few connections are represented by oriented arrows. Each connection is associated with a weight (or synaptic weight). The set of weights form the internal parameters of the neural network. They must be determined during a learning phase (or training) and then make it possible to predict output values, by generalization, from a new input vector presented on the input layer L ₁ . Each neuron ni,j performs, conventionally, a weighted sum of these inputs by the weights of the associated connections and then applies an activation function to this sum. Several techniques exist for determining the internal parameters of the network, in particular the synaptic weights, by learning. Mention may in particular be made of the stochastic gradient descent (DGS or SGD) algorithm, described for example in LeCun, Yann A., et al. "Efficient backprop. Neural networks: Tricks of the trade”, Springer Berlin Heidelberg, 2012.9-48. We can also cite ADAM, originally described in Diederik P. Kingma and Jimmy Lei Ba. “Adam: A method for stochastic optimization”.2014 arXiv:1412.6980v9, or RMSprop, described in particular in Tijmen Tieleman and Geoffrey Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude”, COURSERA: neural networks for machine learning, 4(2) :26–31, 2012. [0053] According to one aspect of the invention, the first neural network is a segmentation neural network, SN, configured to predict a classification P(I) of the pixels of the image I. According to one embodiment, this classification P(I) associates with each pixel (x,y) of the image I a vector P _x,y representing the set of predictions P _m,x,y of the pixel ( x,y) to be associated with the label m ∈ {0, 1, … , M}, where M is the possible number of labels for the image pixels. In other words, P(I) forms a three-dimensional tensor. We can also define this classification as a set of M probability maps P _m , with m ∈ {0, 1, … , M} , representing the prediction of each of the pixels (x,y) of the image to be associated to label m. According to one embodiment, the pixel labels are considered to be mutually exclusive. We can therefore write: [Math.3] This classification P(I) defines a segmentation of the image, that is to say a set of groups of pixels, or image segments, each segment corresponding to a distinct class. As will be seen below, one of the challenges of certain embodiments of the invention consists in conferring a semantic meaning on these classes. This meaning, associated with each class, will then make it possible to explain the prediction of image labels. This classification P(I) of the pixels, predicted by the segmentation network SN, is then provided to a classification network CN, configured to predict a set of labels p(I) for said image, from this pixel classification. According to one embodiment, this classification p(I) associates with the image I a vector of predictions p _n ∈ [0; 1] ] with n ∈ {0,1 … N} N being the number of labels that can be associated with the image I. Each value pn of this vector p(I) indicates a probability that the image is associated with l label no. [0060] According to one embodiment, the labels are not mutually exclusive, that is to say that several labels can be predicted for the same image I. [0061] During the learning phase S0, which will be detailed further, we consider, for each image of the training set, a vector ^(I) of labels δ _n ∈{0,1}. We can then write: [Math.4] The number N of labels for the images I and the number M of labels for the pixels can be different. According to one embodiment of the invention, a particular probability map, P1 (label m=1), is considered to be representative of the background of the image. According to one embodiment, the segmentation neural network SN is implemented by an encoder-decoder network. Such an encoder-decoder network can be broken down, as illustrated in [FIG. 6] into an encoder neural network EN and a decoder neural network DN, arranged in cascade so that the outputs of the first network, EN, are provided at the input of the second network DN. Such an architecture is for example described in Ronneberger, O., Fischer, P., Brox, T., “U-net: Convolutional networks for biomedical image segmentation”, in Proc. MICCAI, October 2015, Munich, pp.234-241. This approach consists in transforming the input data by encoding them in an intermediate vector which represents a set of internal states, then in decoding this intermediate vector by "projecting" it towards an output vector, here the predictions of the pixel labels P(I). We can write SN=DN o EN, where “o” is the function combination operator, and SN, DN, EN represent the functions associated with the neural networks SN, DN, EN respectively. According to one embodiment of the invention, the encoder network EN may be the EfficientNet network, as defined in the article “EfficientNet: Rethinking model scaling for convolutional neural networks” by Tan, M., Le, QV , in Proc. ICML, June 2019. This EfficientNet network is a convolutional neural network (CNN). According to one embodiment of the invention, the segmentation network SN is defined as a “Feature Pyramid Network” (FPN) type according to the article “Feature Pyramid Networks for object detections” by Lin TY, Dollar, P. , Girshick, R., He, K., Hariharan, B., Belongie, S., in Proc. CVPR, pp.936-944, July 2017. This type of pyramid network is used in pattern recognition in digital images in order to detect objects or characteristics independently of their scale of representation in the image. It is based on a convolutional network, which can conventionally be ResNet or other. In particular, feature maps (or “feature maps” in English terminology) are produced at different resolutions. These maps are then resized to the resolution of the considered image I then concatenated. A final layer of convolution then makes it possible to obtain the tensor P(I). Other types of encoder-decoder neural networks can of course be used to implement the segmentation network SN. The FPN network was chosen for its speed of convergence and for its performance related to its independence from the resolution of the detected features (and which are used for the prediction of pixel labels). In particular, other types of network may be chosen depending, for example, on the application context or on the availability of new neural network architectures.

Since the pixel labels are mutually exclusive, one can use a μ softmax activation function for the last convolutional layer:

[Math. 5]

At the output of this decoder network DN, we therefore obtain a tensor P(I) expressing the classification of the pixels of the image I.

This tensor can be seen as a set of M probability maps P _m . Each probability map P _m corresponds to a “segment” of the image, that is to say to a set of pixels (x,y) which are considered to correspond to the same semantic value. Each value P _m,x,y corresponds to the probability of this pixel x,y of being associated with the label m, and therefore of belonging to the segment corresponding to the map P _m .

These probabilities P _m,x,y form the input data of the classification network CN which will be explained later.

As mentioned previously, it can be assumed that each pixel (x,y) is only associated with one class and that the pixel labels are mutually exclusive. In other words, from this tensor P(I), we can assign a unique class to each pixel, by choosing the index m which maximizes the value of P _m,x,y for the pixel (x,y). In this binarized tensor P, each pixel (x,y) is associated with a non-zero value P _{m X y} of the tensor P, only for a unique value of m ∈ (1, 2, ... , M } .

As will be seen later, the segmentation neural network SN is trained so that one of these probability maps, arbitrarily Pi, corresponds to the background of the image I. The other probability maps probabilities therefore form the foreground of the image I. The classification resulting from the segmentation network SN, formed by these other maps of probabilities P _m , with m ∈ (2, 3, , M), is then supplied to a neural network of classification CN.

This classification network is configured to predict a set of labels p(I) for the image I from the classification P of the pixels.

We can write p(I) = CN(P(I)) = (CN o SN)(I), where “o” is the function combination operator.

According to one embodiment, the CN classification neural network is chosen to be simple in order to facilitate the explainability of the predictions p(I). [0082] In particular, it may comprise two layers: a “summary” layer P

("summary layer" according to the usual terminology in English), and a classification layer D.

The summary layer aims to represent each map of probabilities P _m by a unique value. Different implementations are possible to determine this unique value.

According to one embodiment, this unique value can be the average, which is proportional to the area covered by each pixel label.

According to one embodiment, this unique value can be the maximum value of the map over all the x,y pixels. We can then write: [Math. 6]

[0087] Insofar as the first embodiment can cause problems of over-segmentation, the second embodiment is preferred.

According to one embodiment, the classification layer D can be a set of dense layers, in which the synaptic weights are positive. According to another embodiment, the classification layer can be implemented by a derivable decision tree.

According to one embodiment, the classification layer only comprises a single dense layer, with positive weights. The uniqueness of the number of layers makes it possible to improve the explainability of image predictions. Similarly, the positivity constraint allows the explainability of the contribution of each pixel label m ∈ {2, 3, ... , M}: the prediction of the image labels p(I) is defined as a weighted sum maximum predictions for each pixel label, each maximum prediction being weighted by a positive weight that can be interpreted as a level of confidence.

This classification layer D can be defined by:

[Math. 7]

[0091] represent positive synaptic weights, b _n represent biases, and s is an activation function. z _m represents the inputs, i.e. the output or value of the neurons of the previous layer P.

Insofar as the image labels are not mutually exclusive, the sigmoid function can be chosen as the activation function:

[Math. 8]

[0093] N image label predictions are thus obtained at the output, among which it is possible to determine a set of labels as a function of the value p _n .

In particular, it is possible to choose from among the values p _n those which exceed a predefined threshold, or else the few best, etc.

The value p _n associated with each label quantifies a degree of confidence, or likelihood. It can be presented as a result to the user as an indicator.

According to one embodiment, in order to improve the explicability of the image label predictions made, a learning phase S0 is implemented, configured to assign a semantic value to the probability maps P _m .

[0097] As mentioned above and as shown in [FIG. 1], the learning phase S0 consists in providing images 4 ₁ , 4 ₂ , 4 ₃ ... 4 _k of a set learning 4. Each of these images is associated with previously established image labels, respectively 5 ₁ , 5 ₂ , 5 ₃ ... 5 _k . k is here the cardinality of the training set 4.

One of the aspects of the invention consists in allowing the learning and the determination of all the parameters of the neural networks, in order to improve the explainability of the predictions p _m while maintaining good prediction performance and of convergence.

[0099] A difficulty for the design of a mechanism to implement this learning phase is that although we can have labels d(I) previously established for the images, we cannot have such labels. presets for pixels.

To do this, as previously mentioned, the segmentation neural network SN can be trained so that one of these probability maps, arbitrarily Pi, corresponds to the background of the image I. other probability maps therefore form the foreground of the image I.

[0101] The [FIG. 7] illustrates the sequence of functional steps according to one embodiment of the invention. The solid line arrows illustrate the sequences set up in the SI, S2 (or exploitation) prediction phases, while the dotted arrows and blocks illustrate those set up only during the S0 training phase.

Thus, for any image I of the learning set, the previously described steps of prediction of a classification P of the pixels, by the segmentation network SN, are carried out, then of prediction of a set of labels p(I) for the image from the P classification of the pixels, by the CN classification network.

We define a constraint that must be respected by the neural network:

If we provide a background image I, that is to say to which no image label is associated, or for which

[Math. 9]

- Then, the segmentation network SN should only predict background pixel labels, i.e. probability map Pi.

In order to comply with this constraint, the method therefore comprises the determination of a location of the background of the image, from the classification of the pixels P(I). The localization corresponds to the constitution of a segment of images corresponding to the background. This background, as will be seen later, corresponds to areas with no semantic value and which therefore cannot be used for the prediction of image labels.

In particular, according to one embodiment, this determination of the localization comprises the determination of an occulted image I from the classification Pi of the pixels, corresponding to the background of the image.

According to one embodiment, the weights of said neural networks are optimized, according to a set of cost functions configured to, by iterating over the images of the training set, maximize the quality of said set of labels p (I) based on previously established labels associated with said image and maximize the probability of predicting no label for said occluded image.

This concealed image î can be determined as follows:

[Math. 10]

This product can be produced element by element, in the case of an image with several planes, in particular a color image.

[0109] An occlusion mechanism is described in a different context in the article “Visualizing and understanding convolutional networks” by M. D. Zeiler, R. Fergus, in Proc. ECCV, pp. 818-833, September 2014. In this article, a square mask is scanned over the image and, for each position of the mask, an occluded image is created and processed by the classification network. The positions of the mask which disturb the classification are then retained.

The occlusion mechanism described can be considered as an improvement of that proposed by M. D. Zeiler and R. Fergus in which the mask is adapted to each image and a single occluded image is therefore obtained to be processed by the classification network. This approach allows a gain in speed (a single inference) and precision (scale of the pixel rather than the square mask). Furthermore, contrary to what is described in this article, the occlusion mechanism implemented according to one embodiment of the invention is only performed during the learning phase S0. Also, it is implemented only in order to optimize the learning of the probability map Pi.

Thus, the learning aims on the one hand to process the image I with the objective of assigning it the correct image label(s) p(I), and on the other hand to process the concealed image î with the objective of assigning no label to it, which reflects the fact that all the relevant pixels are well hidden.

In the case of medical imaging, one thus obtains a probability map Pi optimized so that all the lesions are removed from the hidden image î.

[0113] The hidden image must meet two properties that must be optimized during the learning phase:

(i) the occluded image should always be perceived as a background image, regardless of the labels previously established for the I-image. This indicates that all relevant pixels have been properly occluded (sensitivity property of the occlusion);

(ii) the probability map Pi should represent as large an area as possible. In other words, the complementary image must be as sparse as possible, that is, it must contain pixels as sparse as possible (property of occlusion specificity).

In order to optimize the first property (i), the occulted image I must be supplied to a neural network of classification CN', and the background map Pi must be optimized, by learning, so that î is predicted as the background.

According to one embodiment, the neural networks CN o SN can be used to do this. However, the optimization of (CN o SN)(Î) will not only impact the detection of background pixels but also the whole classification.

Also, according to one embodiment, a CN' classification neural network is implemented, during the learning phase, in order to isolate the two convergences in an auxiliary classification branch.

It should be noted that the use of such an additional classification neural network and of an auxiliary classification branch are not essential. This proposed optimization makes it possible to increase the image segment corresponding to the background, and, in doing so, to reduce the foreground within which the areas of interest are sought. The goal is to optimize the method and the synaptic weights of the neural networks in order to obtain a better localization of the lesions that one seeks to determine on the digital images. The absence of these characteristics will nevertheless allow their determination, with a rate of correct label prediction as high (or even higher), but with less precision at the pixel level.

[0118] It can be assumed that the encoder network EN performs a separation of the background and the foreground, so that it can be used for the optimization of the background. The auxiliary classification branch therefore consists of the EN and CN’ networks, i.e. (CN’ o EN). The reuse of the EN encoder network makes it possible in particular to reduce the complexity of the training.

The “top activation” layer produces the tensor at the boundary between the encoder part and the decoder part of the segmentation network SN. It is considered that it is in this tensor that we find the information of the highest semantic level and therefore, the best able to allow a classification of the image. Also, the auxiliary neural network of classification CN' can take as input the tensor at the output of the encoder network EN, that is to say the tensor produced by the "top activation" layer.

The CN' classification auxiliary neural network can consist of a "global average pooling" type layer (global average of the activation maps), followed by a classic dense layer. Like the CN classification network, this auxiliary network has a non-mutually exclusive output.

[0121] If T=EN(I), the network output CN’(T) can be written:

[Math. 11] where s is the sigmoid function, previously described, and w' _ln and b' _n represent the synaptic weights and the biases, respectively and L is the number of components of the input layer (or at the output of the encoder network EN).

The CN' o EN branch therefore forms a classification branch for classifying the hidden images, in order to optimize the classification of the background images.

In order to allow the correct optimization of the synaptic weights (and of the biases) during the learning phase S0, cost functions (or "loss functions" in English) must also be defined. The main purpose of the proposed mechanism is to correctly classify image labels. A cost function is therefore defined in order to measure the convergence between the provided label predictions p(I) and the previously established labels ^(I), with [Math.12] It is possible to use a cost function L based on cross-entropy. It can be defined by: [Math.13] with p(I)=CNoSN(I). For the auxiliary classification branch CN'oEN, a cost function ℒ' can also be defined based on a cross-entropy. It can be defined, in the same way as the cost function ℒ, by the equation: [Math.14] Another cost function is defined in order to optimize the learning of the auxiliary classification network CN′ so that it converges towards respecting the first property of the concealed image Î (sensitivity of the occlusion). [0127] For a background image, we can write: [Math.15] [0128] This expression indicates that no image label is assigned to a background image. Typically, in the case of medical imaging, this means that no pathology can be associated with an image that does not include lesions. The cost function L _occlusion making it possible to optimize the sensitivity of the occlusion can be based on a Euclidean norm. It can for example be written: [Math.16] Another cost function, Lparsimony can be defined in order to control the training of the neural network SN in order to optimize the respect of the second property of the hidden image (that is to say the property of the specificity of the occlusion). This cost function makes it possible to maximize an area of the classification P1 of the pixels or, conversely, to minimize that of the classification of the foreground pixels. This _parsimony cost function L can for example be a norm 1 on the predictions of the maps of probabilities Pm, with m∈{2,3, …, M} supplied by the neural network SN. This cost function can be written: [Math.17] The set of cost functions can also comprise a global cost function Lglobale. This global cost function can be based on all the previously described cost functions and serve for the convergence of the various neural networks during the learning phase. Thus, at each iteration on the learning set, we can therefore use this global cost function Lglobale which can be expressed, for example: [Math.18] in which α, β and Υ are parameters regulating the respective contributions of the various cost functions in the global cost function Lglobale This global cost function allows convergence by back-propagation of the gradients of the errors determined by the function of cost in order to determine, iteratively, the synaptic weights of the different neural networks, so as to optimize the different constraints measured by the contributions £, £\ ^occlusion and ^parsimony.

The parameters a, b and g can be determined experimentally. It turns out that the parameter g is the most sensitive and can be used to adjust a trade-off between the quality of image classification (attribution of image labels) and the quality of pixel classification (attribution of pixel labels) , allowing explainability.

[0137] As said previously, the allocation of the pixel labels P(I) as well as the exclusive support of the prediction of the image labels p(I) on these pixel labels P(I) allows the explanation of the predictions image labels. Indeed, it may suffice to consider the contributions of the prediction maps P(I) having led to a prediction p(I) to provide a user with a good (i.e. semantic) understanding of the elements having leads to the prediction: these prediction maps can be displayed, if necessary, so as to explicitly show the sets of pixels that allowed the prediction. Each set of pixels normally corresponds to a single lesion, if the classification has gone well.

According to one embodiment, when an image I is supplied, a prediction p(I) is obtained by inference, which is a vector of N label predictions p _n , as well as Ml maps of probabilities P _m for pixel labels.

According to one embodiment of the invention, during the prediction phase, an explanation associated with each label of the label set is provided. Thus, it is possible for a user to understand the reasons that led to the assignment of labels to digital images.

According to one embodiment, these explanations can be based both on the pixel labels (i.e. the probability maps) as well as on the synaptic weights w _m ,n of the classification layers D. These synaptic weights indicate the contribution of each pixel label in the assignment of image labels.

According to one embodiment, the following procedure can be implemented:

The Ml pixels which maximize the prediction of the maps of probabilities P _m can be presented to the user.

We denote i2 i3 . iM the intensities, or values, (positive) of these pixels. In order to explain the image label prediction p _n , these intensities can be multiplied, respectively, by the (also positive) weights Each product is representative of the weight of the pixel label m in the prediction Pn.

At the end of the learning phase S0, experts can also indicate names to the probability maps P _m . In which case, these names can be used in order to indicate the causes of the prediction p _n in addition to, or instead of, the number of the associated pixel label m.

[0143] The mechanism has been described in the article by Gwenolé Quellec, Hassan Al Hajj, Mathieu Lamard, Pierre-Henri Conze, Pascale Massin, Béatrice Cochener, “ExplAIn: Explanatory artifïcial intelligence for diabetic retinopathy diagnosis” in Medical Image Analysis, Volume 72, 2021, ISSN 1361-8415, [0144] This article presents in particular experimental results of the method described. These show in particular that the rules found by the architecture based on neural networks are consistent with the classification of human experts (see Table 3).

Of course, the present invention is not limited to the examples and to the embodiment described and represented, but is defined by the claims. It is in particular susceptible of numerous variants accessible to those skilled in the art.

Claims

[Claim 1]| Method for predicting labels associated with a digital image, comprising a prediction phase consisting in supplying, in a first step (SI), said image (I) to a segmentation neural network (SN) configured to predict a classification P pixels of said image in a first set of classes; and, in a second step (S2), supplying said classification to a classification neural network (CN), configured to predict a set of labels p(I) for said image from said classification P of the pixels, except for a segment corresponding to a background of said image; said segmentation and classification neural networks being determined by a learning phase (S0) comprising, for each image of a learning set (TS), said first (SI) and second steps (S2); determining a location of said background of said image, from the classification of pixels optimizing the weights of said neural networks, according to a set of cost functions configured to, by iteration, maximize the quality of said set of labels p(I) according to labels previously established and associated with said image, and maximizing the probability of predicting no label for said background.

[Claim 2] Method according to the preceding claim, in which the determination of a location of the background of the said image comprises the determination of an occulted image I, from the classification Pi of the pixels, corresponding to the background of the image, said hidden image being defined by defining a pixel of said image.

[Claim 3] Method according to the preceding claim, in which, during the learning phase, an auxiliary classification neural network (CN') is trained in order to optimize the classification of the said hidden image.

[Claim 4] Method according to the preceding claim, in which said set of cost functions comprises a global _global cost function L which is expressed [Math.19] with - L is a cost function making it possible to maximize the quality of said set of labels p(I) as a function of previously established labels - L' is a cost function making it possible to maximize the quality of the predictions of said auxiliary neural network of classification (CN') as a function of said previously established labels: - _Occlusion is a cost function making it possible to maximize the probability of predicting no label for said occluded image; and - L _parsimony is a cost function making it possible to maximize an area of said classification P1 of the pixels, and α, β and Υ are parameters.

[Claim 5] Method according to one of the preceding claims, in which the said segmentation neural network (SN) is an encoder-decoder network formed of an encoder neural network (EN) and a decoder neural network ( DN), arranged in cascade.

[Claim 6] Method according to one of the preceding claims, in which the said neural network of classification (CN) consists of layers of summary π, and classification Δ.

[Claim 7] Method according to the preceding claim, in which the output of the said classification layer can be expressed, as a function of an input vector zm [Math.20] With wm,n representing positive synaptic weights, σ representing the activation function of neurons of said classification layer, N representing the number of said labels and bn representing biases.

[Claim 8] Method according to the preceding claim, in which an explanation associated with each label of said set of labels is provided during said prediction phase.

[Claim 9] Method according to the preceding claim, wherein said explanation is based on said synaptic weights w _m , _n of said classification layer D and on the outputs of said summary layers P.

[Claim 10] Method according to one of Claims 7 or 8, in which, at the end of the learning phase (S0), names are associated with the maps of probabilities P _m , and, during the phase of prediction, said names are provided with said explanations.

[Claim 11] Device for predicting labels associated with a digital image, comprising means for implementing the method according to one of the preceding claims. |