WO2023194090A1

WO2023194090A1 - Multiple instance learning considering neighborhood aggregations

Info

Publication number: WO2023194090A1
Application number: PCT/EP2023/057120
Authority: WO
Inventors: Johannes HÖHNE; Josef CERSOVSKY; Matthias LENGA; Jacob Coenraad DE ZOETE; Arndt Schmitz; Tricia BAL; Vasiliki Pelekanou; Emmanuelle DI TOMASO
Original assignee: Bayer Aktiengesellschaft
Priority date: 2022-04-08
Filing date: 2023-03-21
Publication date: 2023-10-12

Abstract

Systems, methods, and computer programs disclosed herein relate to training a machine learning model to classify images, preferably medical images, using multiple instance learning techniques. The machine learning model can be trained and the trained machine learning model can be used for various purposes, in particular for the detection, identification and/or characterization of tumor types and/or gene mutations in tissues.

Description

Multiple Instance Learning Considering Neighborhood Aggregations

FIELD

BACKGROUND

Digital pathology is an image-based information environment enabled by computer technology that allows for the management of information generated from a digital slide. Scanning converts tissues on glass slides into digital whole-slide images for assessment, sharing, and analysis.

Artificial intelligence methods can be used to analyze the digital images, e.g., for diagnostic performance improvement, discovery purposes, patient selection, treatment effects monitoring, etc.

The field of digital pathology and evaluation of digital whole slide images using artificial intelligence has experienced incredible growth in recent years. For example, WO2020229152A1 discloses a method of identifying signs indicative of an NTRK oncogenic fusion within patient data comprising a histopathological image of tumor tissue using a deep neural network; C.-L. Chen et al. disclose a wholeslide training approach to pathological classification of lung cancer types using deep learning (https://doi.org/10.1038/s41467-021-21467-y).

However, deep learning for digital pathology is hindered by the extremely high spatial resolution of W'hole-slide images. Whole-slide images are usually multi-gigabyte images with typical resolutions of 100000 x 100000 pixels, present high morphological variance, and often contain various types of artifacts.

Therefore, many successful approaches to train deep learning models on whole-slide images do not use the entire image as input, but instead extract and use only a number of image patches. Image patches are typically rectangular regions with dimensions ranging from 32 x 32 pixels to 10000 x 10000 pixels.

In order to train a machine learning model, training data is required. This training data usually includes images of tissue that is known to be tumor tissue or healthy tissue. The type of tumor present in each case may also be known. This data is also called labeled data and the labeling is usually done by experts.

Patch-level labeling by expert pathologists is very time consuming. In specific clinical scenarios such as NTRK oncogenic fusion detection, a patch-level labeling might even be impossible as even the pathology experts may not be able to identify any decisive pattern.

Due to these practical limitations, ground truth labeling in most cases is done at the level of the w-hole- slide images rather than at the level of individual patches. Labelled whole-slide images are weakly annotated data that can be analyzed using multiple instance learning techniques.

Unlike traditional machine learning models, multiple instance learning involves a set of bags that are labeled such that each bag consists of many unlabeled instances, i.e., the instances in the bags have no label information. The goal of multiple instance learning can be to train a classifier that assigns labels to test bags or assigns labels to the unlabeled instances in the bags.

Each wdiole-slide image can be considered as a bag that contains many instances of patches. Multiple instance learning uses training sets that consist of bags w here each bag contains several instances that are either positive or negative examples for the class of interest, but only bag-level labels are given, and the instance-level labels are unknown during training.

A. V. Konstantinov et al. propose to base a classification not on bags of individual patches, but to take into account adjacent patches (arXiv:2112.06071vl [cs.LG]). However, there is still a need to address the difficulties of detecting and/or characterizing diseases based on weakly annotated data.

SUMMARY

This need is met by the present disclosure.

The approach of the present disclosure is to perform classification not only on the basis of one or more selected patches and its/their nearest neighbors, but also to consider neighbors from more distant regions whose features are aggregated at different hierarchy levels, wherein the more distant the neighbors from the one or more selected patches, the higher the aggregation.

In a first aspect, the present disclosure provides a computer-implemented multiple instance learning method fortraining a model for classifying images, the method comprising: providing a machine learning model, wherein the machine learning model is configured to assign at least one patch of an image to one of at least two classes, training the machine learning model on a plurality of training images, each training image of the plurality of training images being assigned to one of the at least two classes, the training comprising: o receiving a training image, o generating a plurality of patches from the training image, o selecting at least one patch from the plurality of patches, o generating a feature vector on the basis of the selected patch, o defining a sequence of neighborhoods to the selected patch, wherein neighborhoods with increasing rank in the sequence of neighborhoods have an increasing distance to the selected patch and contain increasingly more patches, o selecting a number of neighboring patches in the neighborhoods and generating a feature vector for each selected neighboring patch, o generating a joint feature vector on the basis of the feature vector of the selected patch and the feature vectors of the selected neighboring patches, wherein the feature vectors of the selected neighboring patches are aggregated at different hierarchy levels, wherein each hierarchy level corresponds to one of the neighborhoods, wherein the further the neighborhood is from the selected patch, the more feature vectors of neighboring patches are aggregated in a hierarchy level, o feeding the joint feature vector to a classifier, wherein the classifier is configured to assign the joint feature vector to one of the at least two classes, o comparing the class to which the joint feature vector is assigned with the class to which the training image is assigned, o modifying parameters of the machine learning model if the comparing reveals that the classes are different, storing the trained machine learning model and/or using the trained machine learning model to classify one or more new images.

In another aspect, the present disclosure provides a computer system comprising: a processor; and a memory storing an application program configured to perform, when executed by the processor, an operation, the operation comprising: providing a machine learning model, wherein the machine learning model is configured to assign at least one patch of an image to one of at least two classes, training the machine learning model on a plurality of training images, each training image of the plurality⁷ of training images being assigned to one of the at least two classes, the training comprising: o receiving a training image, o generating a plurality of patches from the training image, o selecting at least one patch from the plurality of patches, o generating a feature vector on the basis of the selected patch, o defining a sequence of neighborhoods to the selected patch, wherein neighborhoods with increasing rank in the sequence of neighborhoods have an increasing distance to the selected patch and contain increasingly more patches, o selecting a number of neighboring patches in the neighborhoods and generating a feature vector for each selected neighboring patch, o generating a joint feature vector on the basis of the feature vector of the selected patch and the feature vectors of the selected neighboring patches, wherein the feature vectors of the selected neighboring patches are aggregated at different hierarchy levels, wherein each hierarchy level corresponds to one of the neighborhoods, wherein the further the neighborhood is from the selected patch, the more feature vectors of neighboring patches are aggregated in a hierarchy level, o feeding the joint feature vector to a classifier, wherein the classifier is configured to assign the joint feature vector to one of the at least two classes, o comparing the class to which the joint feature vector is assigned with the class to which the training image is assigned, o modifying parameters of the machine learning model if the comparing reveals that the classes are different, storing the trained machine learning model and/or using the trained machine learning model to classify one or more new images.

In another aspect, the present disclosure provides a non-transitory computer readable medium having stored thereon software instructions that, when executed by a processor of a computer system, cause the computer system to execute the following steps: providing a machine learning model, wherein the machine learning model is configured to assign at least one patch of an image to one of at least two classes, training the machine learning model on a plurality of training images, each training image of the plurality⁷ of training images being assigned to one of the at least two classes, the training comprising: o receiving a training image, o generating a plurality of patches from the training image, o selecting at least one patch from the plurality of patches, o generating a feature vector on the basis of the selected patch, o defining a sequence of neighborhoods to the selected patch, wherein neighborhoods with increasing rank in the sequence of neighborhoods have an increasing distance to the selected patch and contain increasingly more patches, o selecting a number of neighboring patches in the neighborhoods and generating a feature vector for each selected neighboring patch, o generating a joint feature vector on the basis of the feature vector of the selected patch and the feature vectors of the selected neighboring patches, wherein the feature vectors of the selected neighboring patches are aggregated at different hierarchy levels, wherein each hierarchy level corresponds to one of the neighborhoods, wherein the further the neighborhood is from the selected patch, the more feature vectors of neighboring patches are aggregated in a hierarchy level, o feeding the joint feature vector to a classifier, wherein the classifier is configured to assign the joint feature vector to one of the at least two classes, o comparing the class to which the joint feature vector is assigned with the class to which the training image is assigned, o modifying parameters of the machine learning model if the comparing reveals that the classes are different, storing the trained machine learning model and/or using the trained machine learning model to classify one or more new images.

In another aspect, the present disclosure relates to the use of a trained machine learning model for the detection, identification, and/or characterization of tumor types and/or gene mutations in tissues, wherein training of the machine learning model comprises: o receiving a training image, the training image being assigned to one of at least two classes, wherein at least one class comprises images of tumor tissue and/or tissue in which a gene mutation is present, o generating a plurality of patches from the training image, o selecting at least one patch from the plurality of patches, o generating a feature vector on the basis of the selected patch, o defining a sequence of neighborhoods to the selected patch, wherein neighborhoods with increasing rank in the sequence of neighborhoods have an increasing distance to the selected patch and contain increasingly more patches, o selecting a number of neighboring patches in the neighborhoods and generating a feature vector for each selected neighboring patch, o generating a joint feature vector on the basis of the feature vector of the selected patch and the feature vectors of the selected neighboring patches, wherein the feature vectors of the selected neighboring patches are aggregated at different hierarchy levels, wherein each hierarchy level corresponds to one of the neighborhoods, wherein the further the neighborhood is from the selected patch, the more feature vectors of neighboring patches are aggregated in a hierarchy level, o feeding the joint feature vector to a classifier, wherein the classifier is configured to assign the joint feature vector to one of the at least two classes, o comparing the class to which the joint feature vector is assigned with the class to which the training image is assigned, o modifying parameters of the machine learning model if the comparing reveals that the classes are different. In another aspect, the present disclosure provides a computer-implemented method for classifying an image, the method comprising: providing a machine learning model, wherein the machine learning model is configured to assign at least one patch of an image to one of at least two classes, training the machine learning model on a plurality of training images, each training image of the plurality of training images being assigned to one of the at least two classes, the training comprising: o receiving a training image, o generating a plurality of patches from the training image, o selecting at least one patch from the plurality of patches, o generating a feature vector on the basis of the selected patch, o defining a sequence of neighborhoods to the selected patch, wherein neighborhoods with increasing rank in the sequence of neighborhoods have an increasing distance to the selected patch and contain increasingly more patches, o selecting a number of neighboring patches in the neighborhoods and generating a feature vector for each selected neighboring patch, o generating a joint feature vector on the basis of the feature vector of the selected patch and the feature vectors of the selected neighboring patches, wherein the feature vectors of the selected neighboring patches are aggregated at different hierarchy levels, wherein each hierarchy level corresponds to one of the neighborhoods, wherein the further the neighborhood is from the selected patch, the more feature vectors of neighboring patches are aggregated in a hierarchy level, o feeding the joint feature vector to a classifier, wherein the classifier is configured to assign the joint feature vector to one of the at least two classes, o comparing the class to which the joint feature vector is assigned with the class to which the training image is assigned, o modifying parameters of the machine learning model if the comparing reveals that the classes are different, receiving a new image, generating a plurality of patches based on the new image, inputting the patches into the trained machine learning model, receiving a classification result from the trained machine learning model, outputting the classification result.

DETAILED DESCRIPTION

The invention will be more particularly elucidated below without distinguishing between the aspects of the disclosure (method, computer system, computer-readable storage medium, use). On the contrary, the following elucidations are intended to apply analogously to all the aspects of the disclosure, irrespective of in which context (method, computer system, computer-readable storage medium, use) they occur.

If steps are stated in an order in the present description or in the claims, this does not necessarily mean that the disclosure is restricted to the stated order. On the contrary⁷, it is conceivable that the steps can also be executed in a different order or else in parallel to one another, unless one step builds upon another step, this absolutely requiring that the building step be executed subsequently (this being, however, clear in the individual case). The stated orders are thus preferred embodiments of the invention. As used herein, the articles “a” and “an” are intended to include one or more items and may be used interchangeably with “one or more” and “at least one.” As used in the specification and the claims, the singular form of “a”, “an”, and “the” include plural referents, unless the context clearly dictates otherwise. Where only one item is intended, the term “one” or similar language is used. Also, as used herein, the terms “has”, “have”, “having”, or tlie like are intended to be open-ended terms. Further, the phrase “based on” is intended to mean “based at least partially on” unless explicitly stated otherwise. Further, the phrase “based on” may mean “in response to” and be indicative of a condition for automatically triggering a specified operation of an electronic device (e.g., a controller, a processor, a computing device, etc.) as appropriately referred to herein.

Some implementations of the present disclosure will be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all implementations of the disclosure are shown. Indeed, various implementations of the disclosure may be embodied in many different forms and should not be construed as limited to the implementations set forth herein; rather, these example implementations are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The present disclosure provides means for training a machine learning model and using the trained model for prediction purposes.

Such a “machine learning model”, as used herein, may be understood as a computer implemented data processing architecture. Hie machine learning model can receive input data and provide output data based on that input data and on parameters of the machine learning model. The machine learning model can learn a relation between input data and output data through training. In training, parameters of the machine learning model may be adjusted in order to provide a desired output for a given input.

The process of training a machine learning model involves providing a machine learning algorithm (that is the learning algorithm) with training data to learn from. The term “trained machine learning model” refers to the model artifact that is created by the training process. The training data must contain the correct answer, which is referred to as the target. The learning algorithm finds patterns in the training data that map input data to the target, and it outputs a trained machine learning model that captures these patterns.

In the training process, training data are inputted into the machine learning model and the machine learning model generates an output. The output is compared with the (known) target. Parameters of the machine learning model are modified in order to reduce the deviations between the output and the (known) target to a (defined) minimum.

In general, a loss function can be used for training, where the loss function can quantify the deviations between the output and the target. The loss function may be chosen in such a way that it rewards a wanted relation between output and target and/or penalizes an unwanted relation between an output and a target. Such a relation can be, e.g., a similarity, or a dissimilarity, or another relation.

A loss function can be used to calculate a loss value for a given pair of output and target. The aim of the training process can be to modify (adjust) parameters of the machine learning model in order to reduce the loss value to a (defined) minimum.

The machine learning model of the present disclosure is trained to assign a patch or a number of patches to one of at least two classes.

A patch is one part of an image. In other words, the patch, together with further patches, forms an image. The number of patches forming the image is usually more than 100, preferably more than 1000.

The term “image” as used herein means a data structure that represents a spatial distribution of a physical signal. The spatial distribution may be of any dimension, for example 2D, 3D or 4D. The spatial distribution may be of any shape, for example forming a grid and thereby defining pixels, the grid being possibly irregular or regular. The physical signal may be any signal, for example color, level of gray, depth, surface or volume occupancy, such that the image may be a 2D or 3D RGB/grayscale/depth image, or a 3D surface/volume occupancy model.

For simplicity, the invention is described herein mainly on the basis of two-dimensional images comprising a rectangular array of pixels. However, this is not to be understood as limiting the invention to such images. Those skilled in machine learning based on image data will know how to apply the invention to image data comprising more dimensions and/or being in a different format.

In a preferred embodiment, the image is a medical image.

A medical image is a visual representation of the human body or a part thereof or a visual representation of the body of an animal or a part thereof. Medical images can be used, e.g., for diagnostic and/or treatment purposes.

Techniques for generating medical images include X-ray radiography, computerized tomography, fluoroscopy, magnetic resonance imaging, ultrasonography, endoscopy, elastography, tactile imaging, thermography, microscopy, positron emission tomography and others.

Examples of medical images include CT (computer tomography) scans, X-ray images, MRI (magnetic resonance imaging) scans, fluorescein angiography images, OCT (optical coherence tomography) scans, histopathological images, ultrasound images.

In preferred embodiment, the image is a whole slide histopathological image of tissue of a human body.

In a preferred embodiment, the histopathological image is an image of a stained tissue sample. One or more dyes can be used to create the stained image. Usual dyes are hematoxylin and/or eosin.

The image is labelled, i.e., the image is assigned to one of at least two classes; the label provides information about the class to which the image is assigned.The labelling can be done, e.g., by (medical) experts. For example, in histopathology, it is usually known what the tissue is that is depicted in a histopathological image. It is possible, for example, to take a genetic analysis of a tissue sample and identify gene mutations. A slide and a slide image can then be generated from the tissue sample. The slide image can then be labeled accordingly with the information obtained from the gene analysis, for example.

The class can indicate whether the tissue shown in the image has a certain property or does not have the certain property’.

The class can indicate whether the subject from which the tissue depicted in the image originates has a particular disease or does not have the particular disease.

Further options for classification are described below⁷.

If an image is divided into a plurality of parts (patches), each part (patch) can be assigned to the class to which the image is assigned.

The model is trained on the basis of a plurality of patches of a plurality of training images to assign a given patch or a number of given patches to the correct class.

The term plurality, as it is used herein, means an integer greater than 1, usually greater than 10, preferably greater than 100.

In a first step, a number of patches can be generated from each training image. Usually, each training image is divided into the same number of patches. Usually, all patches have the same size and shape. Usually, the patches have a square or rectangular shape. In case of a quadratic or rectangular 2D image, the resolution of a patch is usually in the range of 32 pixels x 32 pixels to 10000 pixels x 10000 pixels, preferably in the range of 128 pixels x 128 pixel to 4096 pixels x 4096 pixels.

From the number of patches of a training image, a number of patches is selected. The number of selected patches can be, e.g., 1 to 5000. Preferably the number of selected patches is in the range from 10 to 500. The selection can be random or by rules. Rules for selecting patches can be found in the prior art on multiple instance learning.

For each selected patch, a feature vector is generated on the basis of the selected patch.

In machine learning, a feature vector is an /w-dimensional vector of numerical features that represent an object, wherein m is an integer greater than 0. Many algorithms in machine learning require a numerical representation of objects since such representations facilitate processing and statistical analysis. When representing patches (parts of images), the feature values might correspond to the pixels/voxels of the patch. The term “feature vector” shall also include single values, matrices, tensors, and the like. The generation of a feature vector is often accompanied by a dimension reduction in order to reduce an object (a patch) to those features that are important for the classification. Which features these are, the machine learning model learns during training.

Examples of feature vector generation methods can be found in various textbooks and scientific publications (see e.g. G.A. Tsihrintzis, L.C. Jain: Machine Learning Paradigms: Advances in Deep Learning-based Technological Applications, in: Learning and Analytics in Intelligent Systems Vol. 18, Springer Nature, 2020, ISBN: 9783030497248; K. Grzegorczyk: Vector representations of text data in deep learning, Doctoral Dissertation, 2018, arXiv: 1901.01695vl [cs.CL]: M. Use etal. : Attention-based Deep Multiple Instance Learning, arXiv: 1802.04712v4 [cs.LG]).

A feature extraction unit can be used to generate the feature vectors. The feature extraction unit can be an artificial neural network, for example. Examples of artificial neural networks which can be used for feature extraction are disclosed in: WO2020229152A1; J. Hoehne et al. Detecting genetic alterations in BRAF and NTRK as oncogenic drivers in digital pathology images: towards model generalization within and across multiple thyroid cohorts. Proceedings of Machine Learning Research 156, 2021, pages 1-12; A. V. Konstantinov et al. -. Multi-Attention Multiple Instance Learning, arXiv:2112.0607 Ivl [cs.LG], A feature vector is also referred to as “embedding” in the publications mentioned.

Fig. 9 shows schematically by way of example a feature extraction unit/network. The feature extraction unit/network FEU comprises an input layer IL, a number n of hidden layers HLi to HL_n and an output layer. The input neurons of the input layer IL serve to receive a patch P. Usually, there is at least one input neuron for each pixel/voxel of the patch P. The output neurons serve to output a feature vector FV (patch embedding).

The neurons of the input layer IL and the hidden layer HLi are connected by connection lines having a connection weight, and the neurons of the hidden layer HL_n and the output layer are also connected by connection lines with a connection weight. Similarly, the neurons of the hidden layers are connected to the neurons of neighboring hidden layers in a predetermined manner (not shown in Fig. 9). The connection weights can be learned through training.

In a preferred embodiment of the present disclosure, one or more feature extraction units used for feature extraction are or comprise a convolutional neural network (CNN).

A CNN is a class of deep neural networks that comprises an input layer with input neurons, an output layer with at least one output neuron, as well as multiple hidden layers between the input layer and the output layer. The hidden layers of a CNN typically comprise filters (convolutional layer) and aggregation layers (pooling layer) which are repeated alternately and, at the end, one layer or multiple layers of completely connected neurons (dense/fully connected layer(s)).

For each selected patch, a sequence of neighborhoods to the selected patch is selected. The sequence of neighborhoods consists of a number n of neighborhoods: Nl, N2, . . . , No, wherein n is an integer greater than 1. The sequence of neighborhoods is a sequence in the mathematical sense. The number n of neighborhoods is usually in the range from 2 to 10, more preferably in the range from 3 to 8.

Usually, patches in neighborhoods with increasing rank in the sequence of neighborhoods have an increasing distance to the selected patch. Usually, neighborhoods with increasing rank in the sequence of neighborhoods contain increasingly more patches. In other word: neighborhood N 1 is closest to the selected patch and neighborhood Nw is farthest from the selected patch; neighborhood N 1 includes the least number of patches and neighborhood Nw includes the most patches. Tire distance of a neighborhood to a selected patch can be, for example, an arithmetic distance averaged over all patches of the neighborhood.

Preferably, the neighborhoods extend around the selected patch.

Fig. 1 shows schematically four examples ((a), (b), (c) and (d)) of sequences of neighborhoods in an image. The image I is divided into a number of 32 x 32 = 1024 patches. In the example depicted in Fig. 1, the patches form a square grid in which each patch can be assigned an x-coordinate and a y-coordinate that determine its location within the grid.

Patch P is a selected patch within the image I. The coordinates of patch P are x, y = 16, 18.

Around the selected patch P there is a sequence of neighborhoods Nl N2 N3 ... Nw; wherein n is the number of neighborhoods. Each neighborhood comprises a number of patches, each patch referred to as neighboring patch. In the example depicted in Fig. 1, the more distant a neighborhood is from the selected patch, the more neighboring patches it has. The neighboring patches in neighborhood N 1 are also referred to herein as first-order neighbors, the neighboring patches in neighborhood N2 are also referred to as second-order neighbors, the neighboring patched in neighborhood N3 are also referred to as third-order neighbors and so on. In general, neighboring patches in a neighborhood Nz are called z^th-order neighbors, wherein z is an index that indicates the rank of the neighborhood in the sequence of the neighborhoods and can take a number from 1 to n.

In the case of a rectangular or square grid of patches, each selected patch with coordinates x, y has exactly eight first-order neighbors with the following coordinates:

(x, y-1); (x, y+1); (x+1, y-1); (x+1, y); (x+1, y+1); (x-1, y-1); (x-1, y); (x-1, y+1).

The brackets have been placed for better readability.

In the case of Fig. 1 (a), the neighborhoods Nl, N2, N3, N4 form frames around the selected patch with a width of one patch, each frame enclosing the selected patch at its center. The neighborhoods have a greater distance to the selected patch with increasing rank in the order of the neighborhoods: N4 is further away from the selected patch than N3; N3 is further away from the selected patch than N2; N2 is further away from the selected patch than Nl . Nl consists of 8 patches: N2 consists of 16 patches: N3 consists of 24 patches; N4 consists of 32 patches. The neighborhoods consist of an increasing number of neighboring patches as the rank in the order of the neighborhoods increases: N4 consists of a larger number of neighboring patches than N3; N3 consists of a larger number of neighboring patches than N2: N2 consists of a larger number of neighboring patches than N 1.

In case of Fig . 1 (a), the number of patches in a neighborhood Rz is 8z, wherein z is an index that indicates the rank of the neighborhood in the sequence of the neighborhoods and can take a number from 1 to n.

In the example shown in Fig. 1 (a), there is no overlap between the neighborhoods. However, it is possible that the neighborhoods overlap. For example, it is possible that neighborhood N2 comprises neighborhood Nl, neighborhood N3 comprises neighborhood N2, and neighborhood N4 comprises neighborhood N3.

In Fig. 1 (a), four neighborhoods are drawn. However, the sequence of neighborhoods can be continued up to the number n.

In the case of Fig. 1 (b), the neighborhoods form frames around the selected patch with an increasing width. The width of the frame forming neighborhood Nl is 1 patch; the width of the frame forming neighborhood N2 is 2 patches; the width of the frame forming neighborhood N3 is 3 patches; the width of the frame forming neighborhood N4 is 4 patches. In Fig. 1 (b), four neighborhoods are drawn. However, the sequence of neighborhoods can be continued up to the number n. In case of Fig. 1 (b), the width of a frame forming the neighborhood Nz is z, wherein z is an index that indicates the rank of the neighborhood in the sequence of the neighborhoods and can take a number from 1 to n.

In the example shown in Fig. 1 (b), there is no overlap between the neighborhoods. However, it is possible that the neighborhoods overlap. For example, it is possible that neighborhood N2 comprises neighborhood Nl, neighborhood N3 comprises neighborhood N2, and neighborhood N4 comprises neighborhood N3.

In case of Fig. 1 (c), the neighborhoods form frames around the selected patch with an increasing width. The width of the frame forming neighborhood Nl is 1 patch; the width of the frame forming neighborhood N2 is 2 patches; the width of the frame forming neighborhood N3 is 4 patches. In Fig. 1 (c), only 3 neighborhoods are drawn. However, the sequence of neighborhoods can be continued up to the number n. In case of Fig. 1 (c), the width of a frame forming a neighborhood doubles from one neighborhood to the next in the sequence of neighborhoods.

In the example shown in Fig. 1 (c), there is no overlap between the neighborhoods. However, it is possible that the neighborhoods overlap. For example, it is possible that neighborhood N2 comprises neighborhood Nl, and neighborhood N3 comprises neighborhood N2.

Fig. 1(d) shows a preferred embodiment of a sequence of neighborhoods. In case of Fig. 1 (d), the neighborhoods form frames around the selected patch, each neighborhood enclosing the selected patch at its center. The frames of the neighborhoods have an increasing width. The width of the frame forming neighborhood N 1 is 1 patch; the width of the frame forming neighborhood N2 is 3 patches; the width of the frame forming neighborhood N3 is 9 patches. In Fig. 1 (d), only 3 neighborhoods are drawn. However, the sequence of neighborhoods can be continued up to the number n. In case of Fig. 1 (d), the width of a frame forming a neighborhood triples from one neighborhood to the next in the sequence of neighborhoods. In other words, the width of the frame forming neighborhood Nz equals 3^{( )}, wherein i is an index that indicates the rank of the neighborhood in the sequence of the neighborhoods and can take a number from 1 to n.

In the example shown in Fig. 1 (d), there is no overlap between the neighborhoods. However, it is possible that the neighborhoods overlap. For example, it is possible that neighborhood N2 comprises neighborhood Nl, and neighborhood N3 comprises neighborhood N2.

The neighborhoods shown in Fig. 1 are point-symmetric with respect to their center in which the selected patch is located; they each have the same extent in the x- and y-directions; the neighborhoods do not overlap; the neighborhoods are directly adjacent to each other. These are preferred properties of neighborhoods. However, it is also possible that the neighborhoods are not symmetrical, that they have different extents in x- and y- directions, that they overlap, and/or that they are spaced apart.

It is possible that the neighborhoods are defined differently than shown in Fig. 1. For example, it is possible to define the neighborhoods based on other distance metrics. An example is shown in Fig. 2.

Fig. 2 shows an image I divided into a plurality of patches. A patch P is selected which has the coordinates x, y. In Fig. 2, for neighboring patches, the Manhattan distance is given. More precisely, the Manhattan distance is given for those neighboring patches that are located at a maximum Manhattan distance of 8 patches from the selected patch P.

For example, a neighborhood can be formed by patches that have a defined Manhattan distance from the selected patch P. For example, a first neighborhood with first-order neighbors can be formed by patches with a Manhattan distance of one; a second neighborhood with second-order neighbors can be formed by patches with a Manhattan distance of two; a third neighborhood with third-order neighbors can be formed by patches with a Manhattan distance of three, and so on.

Similar to the examples shown in Figs. 1 (a) to (d), it is possible for neighbors with different Manhattan distances to be combined into one neighborhood. For example, neighbors with a Manhattan distance of 1 or 2 may form a first neighborhood, neighbors with a Manhattan distance of 3 or 4 may form a second neighborhood, neighbors with a Manhattan distance of 5 or 6 may form a third neighborhood, and so on.

Similar to the examples shown in Figs. 1 (b) to (d), the mean Manhattan distance can also increase with increasing order. For example, neighbors with a Manhattan distance of 1 may form a first neighborhood, neighbors with a Manhattan distance of 2 or 3 a second neighborhood, neighbors with a Manhattan distance of 4, 5 or 6 a third neighborhood, and so on. Of course, the neighborhoods can also overlap. For example, neighbors with a maximum Manhattan distance of 2 can form a first neighborhood, neighbors with a maximum Manhattan distance of 4 can form a second neighborhood, neighbors with a maximum Manhattan distance of 8 can form a third neighborhood, and so on.

Therefore, as described above, the present disclosure is not limited to specific definitions of neighborhoods. It's just that feature vectors of neighbors in a more distant neighborhood are aggregated to a higher degree than feature vectors of neighbors in a closer neighborhood to the selected patch. This aggregation is now described in more detail.

For each selected patch, a number of neighboring patches in the neighborhoods are selected. A preferred selection process is described below. For each selected neighboring patch, a feature vector is generated. The same feature extraction unit that was used to generate the feature vector of the selected patch can be used to generate the feature vectors of the selected neighboring patches. Preferably, each feature vector representing a selected neighboring patch has the same dimension. Preferably, each feature vector representing a selected neighboring patch has the same dimension as the feature vector representing the selected patch.

On the basis of the feature vector of the selected patch and the feature vectors of the selected neighboring patches a joint feature vector is generated. The joint feature vector forms the basis for the classification.

The generation of the joint feature vector can be done in an iterative process, wherein feature vectors of selected neighboring patches are first aggregated before being fused with the feature vector of the selected patch. In this process, aggregation can take place sequentially at different hierarchy levels, with each hierarchy level corresponding to a neighborhood.

The term hierarchy level means that a feature vector of a neighboring patch is aggregated with more feature vectors of other neighboring patches, the farther the neighborhood where the neighboring patch is located is from the selected patch. The distance of a neighborhood to a selected patch can be, for example, an arithmetic distance averaged over all patches of the neighborhood.

The aggregation process is shown schematically in the form of an example in Fig. 3. In the example shown in Fig. 3, the neighborhoods are defined as shown in Fig. 1 (d).

Fig. 3 (a) shows a section of an image divided into a number of patches. A patch P with the coordinates x, y is selected. Around the selected patch P, a first neighborhood N 1 extends. Neighborhood N 1 consists of eight first-order neighbors. The next step is to select a number of first-order neighboring patches in neighborhood Nl. Up to eight neighboring patches can be selected in neighborhood Nl . Multiple selection strategies are applicable, including a random selection of neighboring patches. The number of neighboring patches selected can also be random.

Fig. 3 (b) shows that two of the eight neighboring patches in neighborhood N 1 have been selected: P’ and P”.

For each selected neighboring patch, a feature vector is generated (not shown in Fig. 3 (b)). The feature vectors of the selected neighboring patches are then aggregated into a neighborhood feature vector representing neighborhood N 1.

Aggregation means that a single feature vector is generated from a number of feature vectors (at least two). In other words, a number of feature vectors are combined into one feature vector. However, the feature vectors are not simply appended (concatenated) to each other but are subjected to mathematical operations that produces a feature vector whose dimension is less than the sum of the dimensions of the feature vectors from which it is computed. For example, the feature vectors can be added or multiplied element by element, or arithmetic averages can be formed element by element. Preferably, the resulting neighborhood feature vector has the same dimension as any of the feature vectors from which it was generated. The neighborhood feature vector is a numerical representation of the neighborhood N 1.

In a preferred embodiment, an attention mechanism is used to aggregate the feature vectors of the selected neighboring patches. In machine learning, atention is a technique that mimics cognitive atention. The effect enhances some parts of the input data while diminishing other parts - the thought being that the machine learning model should devote more focus to that small but important part of the data. Learning which part of the data is more important than others is trained in the training phase.

In the present case, for example, an atention mechanism can be used as described in: A. V. Konstantinov et al. '. Multi-Attention Multiple Instance Learning, arXiv:21 12.06071vl [cs.LG] and/or in M. Use et al. '. Attention-based Deep Multiple Instance Learning, arXiv: 1802.04712v4 [cs.LG]. In such an attention mechanism, weights are applied to the feature vectors when they are combined into a single feature vector, and the machine learning model learns the weights during training, and thus learns what influence the different neighbors (represented by their respective feature vectors) have on the classification result.

The neighborhood feature vector representing neighborhood N 1 can be fused with the feature vector of the selected patch P, e.g., using concatenation. The resulting feature vector is a joint representation of the selected patch and its nearest environment (its nearest neighborhood Nl). The selected patch together with neighborhood Nl is referred to herein as a first-order aggregate. Hie feature vector representing such first-order aggregate is referred to herein as first-order-aggregate feature vector.

Fig. 3 (c) shows a first-order aggregate Al. The first-order aggregate Al contains the selected patch P and its neighborhood N 1 (see also Figs. 3 (a) and 3 (b)). The first order aggregate Al can be considered as a patch which is larger than the selected patch P (Al is eight times the size of patch P). The aggregate Al, like the selected patch P, has a first neighborhood. This first neighborhood corresponds to the neighborhood N2 of the selected patch P.

Thus, the process of selecting neighboring patches, generating feature vectors for neighboring patches, aggregating feature vectors of neighboring patches, and fusing the aggregated feature vectors with the feature vector of Al can be repeated as described with respect to Fig. 3 (a) and (b), except that the operations are performed at the next level of the hierarchy. It is like zooming out of the image and repeating the operations at a different level. Instead of patches, operations are performed based on aggregates of patches. This process of zooming out and repeating the step at the next hierarchy level can occur multiple times. Preferably, this process is repeated 2 to 8 times. It is now described how the process is carried out on the next hierarchy level.

The neighborhood N2 extends around the aggregate Al (see Fig. 3 (c)). The neighborhood N2 has eight times the size of aggregate Al. The neighborhood N2 can be divided into eight neighboring first-order aggregates, all of which have the size of Al. This is depicted in Fig. 3 (d). In a next step, a number of neighboring first-order aggregates is selected. Multiple selection strategies are applicable, including a random selection of aggregates. The number of aggregates selected can also be random. In Fig. 3 (e) it is shown that two neighboring first-order aggregates Al’ and Al” have been selected. Now, for each selected first-order aggregate, a feature vector representing such aggregate needs to be generated. This can be done as described with respect to Fig. 3 (a), (b) and (c), and it is shown in Fig. 3 (f) and (g).

In Fig. 3 (f), two second-order neighboring patches Q and R have been selected. The selected second- order neighboring patches are located at specific locations within neighborhood N2: they are each located in a center of a first-order aggregate.

There are eight second-order neighboring patches in neighborhood N2 which are located in a center of a first-order aggregate. The coordinates of these specific neighboring patches (in relation to the coordinates x, y of the selected patch) are: (x-3, y+3); (x, y+3); (x+3, y+3); (x-3, y); (x+3, y); (x-3, y- 3); (x, y-3); (x+3, y-3).

The brackets have been placed for beter readability.

For each selected second-order neighboring patch Q and R, a feature vector is generated. The same feature extraction unit that was used to generate the feature vector of the selected patch and the feature vectors of the first-order neighbors can be used to generate the feature vectors of the selected second- order neighbors. In a next step, these feature vectors of the selected second-order neighbors (Q and R) are aggregated with feature vectors of further neighboring patches (Q’, Q” and R’, R”, respectively) to generate feature vectors of the first-order aggregates Al’ and Al”. This process is analogous to that described for Fig. 3 (b) and is schematically depicted in Figs. 3 (f) and 3 (g): for each second-order neighbor (Q and R), further neighboring patches (Q’, Q” and R’, R”, respectively) are selected in a respective neighborhoods NT and Nl" around the second-order neighbors; in the present case, two neighboring patches each (Q’, Q” and R’, R”, respectively). For each selected neighboring patch, a feature vector is generated. The feature vectors of the neighboring patches can then be aggregated into feature vectors representing the neighborhoods NF and Nl”, respectively (Fig. 3 (h)). Preferably, attention mechanisms are used for generating the feature vectors representing neighborhoods N 1 ’ and N 1’ ’ (as described with respect to Fig. 3 (b)).

Each of the feature vectors representing the neighborhoods N 1 ’ and N F ’ can then be fused with the respective feature vector of the selected second-order neighbor the neighborhood includes. This results two first-order-aggregate feature vectors.

The respective first-order aggregates AF and Al” are depicted in Fig. 3 (i). The feature vectors of the first-order aggregates Al, AF and Al” can then be fused into a second -order-aggregate feature vector representing the second-order aggregate A2 (see Fig. 3 (k)) which consists of the first-order aggregate Al and the neighborhood N2 (see Fig. 2 (j)). Again, in a first step a feature vector representing the neighborhood N2 from the feature vectors representing the aggregates AF and A” is generated. An attention mechanism (as described with respect to Fig. 2 (b)) can be used for aggregating the feature vectors representing AF and Al” in a feature vector representing neighborhood N2. The feature vector representing neighborhood N2 can then be concatenated with the feature vector representing aggregate Al to generate a feature vector representing aggregate A2.

The process can continue with neighborhood N3. Figs. 3 (1) to 3 (n) show a larger section of the image shown in Figs. 3 (a) to 3 (k). Neighborhood N3 extends around the aggregate A2. The Neighborhood N3 has eight times the size of aggregate A2. The neighborhood N3 can be divided into eight second- order aggregates, all of which have the size of A2 (see Fig. 3 (m)).

When comparing Fig. 3 (m) with Fig. 3 (d) and Fig. 3 (a), it is noticeable that they are similar. Fig. 3 (m) represents the next (third) hierarchy level on which the process can be continued. Fig. 3 (d) represents the second hierarchy level and Fig. 3 (a) the first hierarchy level.

A number of second-order aggregates neighbored to the second-order aggregate A2 can be selected. Multiple selection strategies are applicable, including a random selection of aggregates. The number of aggregates selected can also be random. For example, the process can be continued by selecting the second-order aggregates A2’ and A2” (see Fig. 3 (m)). In order to generate feature vectors representing the second-order aggregates A2’ and A2”, the process as described for second -order aggregate A can be applied to the second-order aggregates A2’ and A2”. The process can be started from the center patches of aggregates A2’ and A2”.

In general, center patches of an aggregate Az are characterized by the follow ing coordinates (in relation to the coordinates x, y of the selected patch): (x - 3^(,-1), y + 3^{(, )}); (x, y + 3 ^{<i )}); (x + 3 ^{(, )}, y + 3 ^{(, )}); (x - 3 o-i), y). (_x + 3 O-D, y). (_x _ 3 o-i), y _ 3 o-i))- (_x_ y _ 3 o-i))- (_x + 3 o-¹). y - 3 <' )). wherein i is an index that can take a number from 1 to n.

Fig. 3 (n) shows how⁷ the process can be continued. Feature vectors can be generated for the selected center patches and these can be aggregated at various hierarchy levels until a number of second-order aggregates are obtained, which can then be aggregated into a third-order aggregate, and so on and so forth. This continues until an aggregate An is formed that is equal in size to the neighborhood Nn plus the aggregate A(w-l). Whenever feature vectors representing aggregates are aggregated to generate a feature vector representing a higher order neighborhood, an attention mechanism can be used.

Fig. 3 (n) shows all patched, neighboring patches, first-order aggregates, and second-order aggregates which have been selected in the example depicted in Figs. 3 (a) to 3 (n), and from which feature vectors have been generated. Fig. 4 shows how a feature vector representing the third-order aggregates A3 (see Fig. 3 (n)) can be generated. The third-order-aggregate feature vector representing the third-order aggregate A3 can be a concatenation of feature vectors representing the selected patch P, the first neighborhood Nl, the second neighborhood N2, and the third neighborhood N3. The feature vector representing the first neighborhood Nl can be generated from the feature vectors representing patches PF and Pl” by aggregation using an attention mechanism. The feature vector representing the second neighborhood N2 can be generating from a feature vectors representing first-order aggregates AT and Al”. A feature vector representing first-order aggregate Al’ can be a concatenation of a feature vector representing neighborhood N F and a feature vector representing patch Q. A feature vector representing first-order aggregate Al” can be a concatenation of a feature vector representing neighborhood N 1 ” and a feature vector representing patch R. A feature vector representing neighborhood N F can be generated from feature vectors of patches Q’ and Q” by aggregation using an attention mechanism. A feature vector representing neighborhood Nl ” can be generated from feature vectors of patches R’ and R” by aggregation using an attention mechanism. In an analogous way, the feature vector representing neighborhood N3 can also be generated, aggregating even more feature vectors from even more patches than in the case of N2 and Nl. This is not shown in Fig. 4. However, it can be seen from Fig. 3 (n) that the feature vectors of two patches (P’, P”) are included in the generation of a feature vector representing neighborhood Nl, that the feature vectors of six patches (Q, Q’, Q”, R, R’, R”) are included in the generation of a feature vector representing neighborhood N2, and that the feature vectors of 18 patches are included in the generation of a feature vector representing neighborhood N3. Thus, the farther a neighborhood is from the selected patch (P), the more patches are considered to generate a feature vector representing that neighborhood, and the greater the aggregation of feature vectors.

The feature vector representing the aggregate Aw (A3 in the case of the example shown in Figs. 3 and 4) can be the joint feature vector of the selected patch and the selected neighboring patches. It includes information about the selected patch as well as information about neighboring patches in different neighborhoods at different distances from the selected patch.

If more than one patch has been selected at the beginning, for each selected patch there is a feature vector representing an aggregate that includes the selected patch and its environment. These feature vectors can be combined into a joint feature vector, for example by concatenation.

The joint feature vector can be used for classification. Therefore, in a further step, the joint feature vector is fed to a classifier. The classifier is configured to assign the joint feature vector to one of the at least two classes. Preferably, the classifier is part of machine learning model. The classifier can be, e.g., an artificial neural network. Examples of preferred classifiers can be found, e.g., in J. Hoehne et al. -. Detecting genetic alterations in BRAF and NTRK as oncogenic drivers in digital pathology images: towards model generalization within and across multiple thyroid cohorts, Proceedings of Machine Learning Research 156, 2021, pages 1-12; M. Y. Lu et al. -. Al-hased pathology predicts origins for cancers of unknown primary. Nature, 594(7861): 106-110, 2021; M. Use et al.: Attention-based deep multiple instance learning, International conference on machine learning, pages 2127-2136, PMLR, 2018; M. Use et al. -. Deep multiple instance learning for digital histopathology, Handbook of Medical Image Computing and Computer Assisted Intervention, pages 521-546. Elsevier, 2020.

The classification result provided by the classifier can be analyzed. The class to which the joint feature vector is assigned can be compared with the class to which the training image is assigned. If the classes match, the classification is correct. If the classes do not match, the classification performed by the machine learning model is wrong, and parameters of the machine learning model need to be modified.

A loss function can be used for quantifying the deviation between the output and the target. Examples of loss functions can be found, e.g., in J. Hoehne et al. : Detecting genetic alterations in BRAF and NTRK as oncogenic drivers in digital pathology images: towards model generalization within and across multiple thyroid cohorts, Proceedings of Machine Learning Research 156, 2021, pages 1-12; A. V. Konstantinov et al. -. Multi-Attention Multiple Instance Learning, arXiv:2112.06071vl [cs.LG]; M. Use et al. -. Attention-based Deep Multiple Instance Learning, arXiv: 1802.04712v4 [cs.LG]). The model is trained on a large number of images until the model reaches a defined accuracy in prediction.

Fig. 5 shows schematically the architecture of the machine learning model of the present disclosure. The machine learning model receives a (training) image I and outputs a class C. The machine learning model comprises a (neighboring) patch selection unit (PSU) which is configured to select patches and neighboring patches as described herein. For each selected patch and selected neighboring patch, a feature vector is generated by a feature extraction unit (FEU), each feature vector representing the respective (neighboring) patch. Feature vectors (FVs) of neighboring patches are aggregated using an attention mechanism (AM), ad means attention distribution and ao means attention output. Feature vectors (FVs) are aggregated on different hierarchy levels. This is indicated by the feedback loop from the feature vector aggregation and fusion unit (FVAFU) to the (neighboring) patch selection unit (PSU). After a number of aggregations, a joint feature vector (JFV) is generated comprising the feature vector of the at least one selected patch and feature vectors of aggregated patches at different hierarchy levels. The joint feature vector (JFV) is fed into the classifier (CF) and the classifier (CF) is configured to output the classification result (class output C).

The term "unit" as used in this disclosure is not intended to imply that there is necessarily a separate unit performing the functions described. Rather, the term is intended to be understood to mean that computation means are present which perform the appropriate functions. These computation means are typically one or more processors configured to perform corresponding operations. Details are described below with reference to Fig. 6.

The trained machine learning model can be stored in a data storage, transmitted to another computer system, or used to classify one or more new images. The term “new” means that the corresponding image was not used during training.

The machine learning model can be trained to perform various tasks. Accordingly, a trained machine learning model can be used for various purposes. In a preferred embodiment, the machine leamin model of the present disclosure is trained and the trained machine learning model is uses to detect, identify, and/or characterize tumor types and/or gene mutations in tissues.

The machine learning model can be trained and the trained machine learning model can be used to recognize a specific gene mutation and/or a specific tumor type, or to recognize multiple gene mutations and/or multiple tumor types.

The machine learning model can be trained and the trained machine learning model can be used to characterize the type or types of cancer a patient or subject has.

The machine learning model can be trained and the trained machine learning model can be used to select one or more effective therapies for the patient.

The machine learning model can be trained and the trained machine learning model can be used to determine how a patient is responding over time to a treatment and, if necessary, to select a new therapy or therapies for the patient as necessary.

Correctly characterizing the type or types of cancer a patient has and, potentially, selecting one or more effective therapies for the patient can be crucial for the survival and overall wellbeing of that patient.

The machine learning model can be trained and the trained machine learning model can be used to determine whether a patient should be included or excluded from participating in a clinical trial.

The machine learning model can be trained and the trained machine learning model can be used to classify images of tumor tissue in one or more of the following classes: inflamed, non-inflamed, vascularized, non-vascularized, fibroblast-enriched, non-fibroblast-enriched (such classes are defined, e.g., in EP3639169A1).

The machine learning model can be trained and the trained machine learning model can be used to identify⁷ differentially expressed genes in a sample from a subject (e.g., a patient) having a cancer (e.g., a tumor). The machine learning model can be trained and the trained machine learning model can be used to identify genes that are mutated in a sample from a subject having a cancer (e.g., a tumor).

The machine learning model can be trained and the trained machine learning model can be used to identify a cancer (e.g., a tumor) as a specific subtype of cancer selected.

Such uses may be useful for clinical purposes including, for example, selecting a treatment, monitoring cancer progression, assessing the efficacy of a treatment against a cancer, evaluating suitability of a patient for participating in a clinical trial, or determining a course of treatment for a subject (e.g., a patient).

The trained machine learning model may also be used for non-clinical purposes including (as a nonlimiting example) research purposes such as, e.g., studying the mechanism of cancer development and/or biological pathways and/or biological processes involved in cancer, and developing new therapies for cancer based on such studies.

The machine learning model of the present disclosure is trained based on images and it generates predictions based on images. The images usually show the tissue of one or more subjects. The images can be created from tissue samples of a subject. The subject is usually a human, but may also be any mammal, including mice, rabbits, dogs, and monkeys.

The tissue sample may be any sample from a subject known or suspected of having cancerous cells or pre-cancerous cells.

The tissue sample may be from any source in the subject's body including, but not limited to, skin (including portions of the epidermis, dermis, and/or hypodermis), bone, bone marrow, brain, thymus, spleen, small intestine, appendix, colon, rectum, liver, gall bladder, pancreas, kidney, lung, ureter, bladder, urethra, uterus, ovary⁷, cervix, scrotum, penis, prostate.

The tissue sample may be a piece of tissue, or some or all of an organ.

The tissue sample may be a cancerous tissue or organ or a tissue or organ suspected of having one or more cancerous cells.

The tissue sample may be from a healthy (e.g. non-cancerous) tissue or organ.

The tissue sample may include both healthy and cancerous cells and/or tissue.

In certain embodiments, one sample has been taken from a subject for analysis. In some embodiments, more than one (e.g., 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more) samples may have been taken from a subject for analysis.

In some embodiments, one sample from a subject will be analyzed. In certain embodiments, more than one (e.g., 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more) samples may be analyzed. If more than one sample from a subject is analyzed, the samples may have been procured at the same time (e.g., more than one sample may be taken in the same procedure), or the samples may have been taken at different times (e.g., during a different procedure including a procedure 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 days; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 weeks; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 months, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 years, or 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 decades after a first procedure). A second or subsequent sample may be taken or obtained from the same region (e.g., from the same tumor or area of tissue) or a different region (including, e.g. a different tumor). A second or subsequent sample may be taken or obtained from the subject after one or more treatments and may be taken from the same region or a different region. As a non-limiting example, the second or subsequent sample may be useful in determining whether the cancer in each sample has different characteristics (e.g., in the case of samples taken from two physically separate tumors in a patient) or whether the cancer has responded to one or more treatments (e.g., in the case of two or more samples from the same tumor prior to and subsequent to a treatment).

Any of the samples described herein may have been obtained from the subject using any known technique. In some embodiments, the sample may have been obtained from a surgical procedure (e.g., laparoscopic surgery⁷, microscopically controlled surgery, or endoscopy), bone marrow biopsy, punch biopsy, endoscopic biopsy, or needle biopsy (e.g., a fine-needle aspiration, core needle biopsy, vacuum- assisted biopsy, or image-guided biopsy).

Detection, identification, and/or characterization of tumor types may be applied to any cancer and any tumor. Exemplary⁷ cancers include, but are not limited to, adrenocortical carcinoma, bladder urothelial carcinoma, breast invasive carcinoma, cervical squamous cell carcinoma, endocervical adenocarcinoma, colon adenocarcinoma, esophageal carcinoma, kidney renal clear cell carcinoma, kidney renal papillary⁷ cell carcinoma, liver hepatocellular carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, ovarian serous cystadenocarcinoma, pancreatic adenocarcinoma, prostate adenocarcinoma, rectal adenocarcinoma, skin cutaneous melanoma, stomach adenocarcinoma, thyroid carcinoma, uterine corpus endometrial carcinoma, and cholangiocarcinoma.

The machine learning model can be trained and the trained machine learning model can be used to detect, identify and/or characterize gene mutations in tissue samples.

Examples of genes related to proliferation of cancer or response rates of molecular target drugs include HERZ, TOP2A, HER3, EGFR, P53, and MET. Examples of tyrosine kinase related genes include ALK, FLT3, AXL, FLT4 (VEGFR3, DDR1, FMS(CSFIR), DDR2, EGFR(ERBB 1), HER4(ERBB4), EML4- ALK, IGF1R, EPHA1, INSR, EPHA2, IRR(INSRR), EPHA3, KIT, EPHA4, LTK, EPHA5, MER(MERTK), EPHA6, MET, EPHA7, MUSK, EPHA8, NPM1-ALK, EPHB1, PDGFRa(PDGFRA), EPHB2, PDGFR0(PDGFRB)EPHB3, RET, EPHB4, RON(MSTIR), FGFR1, ROS(ROSl), FGFR2, TIE2(TEK), FGFR3, TRKA(NTRK1), FGFR4, TRKB(NTRK2), FLTl(VEGFRl), and TRKC(NTRK3). Examples of breast cancer related genes include ATM, BRCA1, BRCA2, BRCA3, CCND1, E-Cadherin, ERBB2, ETV6, FGFR1, HRAS, KRAS, NRAS, NTRK3, p53, and PTEN. Examples of genes related to carcinoid tumors include BCL2, BRD4, CCND1, CDKN1A, CDKN2A, CTNNB1, HES1, MAP2, MEN1, NF1, NOTCH1, NUT, RAF, SDHD, and VEGFA. Examples of colorectal cancer related genes include APC, MSH6, AXIN2, MYH, BMPR1A, p53, DCC, PMS2, KRAS2 (or Ki-ras), PTEN, MLH1, SMAD4, MSH2, STK11, and MSH6. Examples of lung cancer related genes include ALK, PTEN, CCND1, RASSF1A, CDKN2A, RBI, EGFR, RET, EML4, ROS1, KRAS2, TP53, and MYC. Examples of liver cancer related genes include Axinl, MALATl, b-catenin, p!6 INK4A, c-ERBB-2, p53, CTNNB1, RBI, Cyclin DI, SMAD2, EGFR, SMAD4, IGFR2, TCF1, and KRAS. Examples of kidney cancer related genes include Alpha, PRCC, ASPSCR1, PSF, CLTC, TFE3, p54nrb/NONO, and TFEB. Examples of thyroid cancer related genes include AKAP10, NTRK1, AKAP9, RET, BRAF, TFG, ELEI, TPM3, H4/D10S170, and TPR. Examples of ovarian cancer related genes include AKT2, MDM2, BCL2, MYC, BRCA1, NCOA4, CDKN2A, p53, ERBB2, PIK3CA, GATA4, RB, HRAS, RET, KRAS, and RNASET2. Examples of prostate cancer related genes include AR, KLK3, BRCA2, MYC, CDKN1B, NKX3.1, EZH2, p53, GSTP1, and PTEN. Examples of bone tumor related genes include CDH11, COL12A1, CNBP, OMD, COL1A1, THRAP3, COL4A5, and USP6.

In a preferred embodiment, the machine learning model is trained and used for classification of tissue types on the basis of whole slide images. Preferably, the machine learning model is trained and used for identification of gene mutations, such as BRAF mutations and/or NTRK fusions, as described in WO2020229152A1 and/or J. Hoehne et al. -. Detecting genetic alterations in BRAF and NTRK as oncogenic drivers in digital pathology images: towards model generalization within and across multiple thyroid cohorts, Proceedings of Machine Learning Research 156, 2021, pages 1-12, the contents of which are incorporated by reference in their entirety into this specification.

For example, the machine learning model can be trained to detect signs of the presence of oncogenic drivers in patient tissue images stained with hematoxylin and eosin.

F . Penault-Llorca et al. describe a testing algorithm for identification of patients with TRK fusion cancer (see J. Clin. Pathol., 2019, 72, 460-467). The algorithm comprises immunohistochemistry⁷ (IHC) studies, fluorescence in situ hybridization (FISH) and next-generation sequencing.

Immunohistochemistry⁷ provides a routine method to detect protein expression of NTRK genes. However, performing immunohistochemistry⁷ requires additional tissue section(s) and time to proceed and interprete (following hematoxylin and eosin initial staining based on which tumor diagnosis is performed), skills and the correlation between protein expression and gene fusion status is not trivial. Interpretation of IHC results requires the skills of a trained and certified medical professional pathologist.

Similar practical challenges hold true for other molecular assays such as FISH.

Next-generation sequencing provides a precise method to detect NTRK gene fusions. However, performing gene analyses for each patient is expensive, tissue consuming (not always feasible when available tissue specimen is minimal, as in diagnostic biopsies), not universally available in various geographic locations or diagnostic laboratories/healthcare institutions and, due to the low incidence of NTRK oncogenic fusions, inefficient.

There is therefore a need for a comparatively rapid and inexpensive method to detect signs of the presence of specific tumors.

It is proposed to train a machine learning model as described in this disclosure to assign histopathological images of tissues from patients to one of at least two classes, where one class comprises images show ing tissue in which a specific gene mutation is present, such as NTRK or BRAF.

It is proposed to use the trained machine learning model as a preliminary test. Patients in whom the specific mutation can be detected are then subjected to a standard examination such as IHC, FISH and/or next-generation sequencing to verify the finding.

Additional studies may also be considered, such as other forms of medical imaging (CT scans, MRI, etc.) that can be co-assessed using Al to generate multimodal biomarkers/characteristics for diagnostic purposes.

The machine learning model of the present disclosure can, e.g., be used to a) detect NTRK fusion events in one or more indications, b) detect NTRK fusion events in other indications than in those being trained on (i.e., an algorithm trained on thyroid data sets is useful in lung cancer data sets), c) detect NTRK fusion events involving other TRK family members (i.e., an algorithm trained on NTRK1, NTRK3 fusions is useful to predict also NTRK2 fusions), d) detect NTRK fusion events involving other fusion partners (i.e., an algorithm trained on LMNA- fusion data sets is useful also in TPM3 -fusion data sets), e) discover novel fusion partners (i.e., an algorithm trained on known fusion events might predict a fusion in a new data set wdiich is then confirmed via molecular assay to involve a not yet described fusion partner of a NTRK family member), f) catalyze the diagnostic wwkflow⁷ and clinical management of patients offering a rapid, tissue - sparing, low⁷-cost method to indicate the presence of NTRK-fusions (and ultimately others) and identify ing patients that merit further downstream molecular profiling so as to provide precision medicines targeting specific molecular aberrations (e.g. NTRK-fusion inhibitors), g) identify⁷ specific genetic aberrations based on histological specimen can additionally be used to confirm/exclude or re-label certain tumor diagnosis, in cases the presence or absence of this/these alterations(s) is pathognomonic of specific tumors.

Identification of specific genetic aberrations based on histological specimen can additionally be used to confirm/exclude or re-label certain tumor diagnosis, in cases the presence or absence of this/these alterations(s) is pathognomonic of specific tumors.

Histopathological images used fortraining and prediction of the machine learning model can be obtained from patients by biopsy or surgical resection specimens.

In a preferred embodiment, a histopathological image is a microscopic image of tumor tissue of a human patient. The magnification factor is preferably in the range of 10 to 60, more preferably in the range of 20 to 40, whereas a magnification factor of, e.g., "20" means that a distance of 0.05 mm in the tumor tissue corresponds to a distance of 1 mm in the image (0.05 mm x 20 = 1 mm).

In a preferred embodiment, the histopathological image is a whole-slide image.

In a preferred embodiment, the histopathological image is an image of a stained tumor tissue sample. One or more dyes can be used to create the stained images. Preferred dyes are hematoxylin and/or eosin.

Methods for creating histopathological images, in particular stained whole-slide microscopy images, are extensively described in scientific literature and textbooks (see e.g. S. K. Suvama et al. -. Bancroft's Theory and Practice of Histological Techniques, 8^th Ed., Elsevier 2019, ISBN 978-0-7020-6864-5; A. F. Frangi et al. : Medical Image Computing and Computer Assisted Intervention ~~ MICCAI 2018, 21^st International Conference Granada, Spain, 2018 Proceedings, Part II, ISBN 978-030-00933-5; L.C. Junqueira et al.: Histologic, Springer 2001, ISBN: 978-354-041858-0; N. Coudray et al.: Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning, Nature Medicine, Vol. 24, 2018, pages 1559-1567).

The machine learning model can also be configured to generate a probability value, the probability value indicating the probability of a patient suffering from cancer, e.g., caused by an NTRK oncogenic fusion. The probability⁷ value can be outputted to a user and/or stored in a database. The probability value can be a real number in the range from 0 to 1, whereas a probability value of 0 usually means that it is impossible that the cancer is caused by an NTRK oncogenic fusion, and a probability value of 1 usually means that there is no doubt that the cancer is caused by an NTRK oncogenic fusion. The probability value can also be expressed by a percentage.

In a preferred embodiment of the present invention, the probability value is compared with a predefined threshold value. In the event the probability value is lower than the threshold value, the probability that the patient suffers from cancer caused by an NTRK oncogenic fusion is low; treating the patient with a Trk inhibitor is not indicated; further investigations are required in order to determine the cause of cancer. In the event the probability value equals the threshold value or is greater than the threshold value, it is reasonable to assume that the cancer is caused by an NTRK oncogenic fusion; the treatment of the patient with a Trk inhibitor can be indicated; further investigations to verify the assumption can be initiated (e.g., performing a genetic analysis of the tumor tissue).

The threshold value can be a value between 0.5 and 0.99999999999, e.g. 0.8 (80%) or 0.81 (81 %) or 0.82 (82% ) or 0.83 (83%) or 0.84 (84%) or 0.85 (85%) or 0.86 (86%) or 0.87 (87 %) or 0.88 (88%) or 0.89 (89%) or 0.9 (90%) or 0.91 (91 %) or 0.92 (92% ) or 0.93 (93%) or 0.94 (94%) or 0.95 (95%) or 0.96 (96%) or 0.97 (97 %) or 0.98 (98%) or 0.99 (99%) or any other value (percentage). The threshold value can be determined by a medical expert.

Besides a histopathological image, additional patient data can also be included in the classification. Additional patient data can be, e.g., anatomic or physiology data of the patient, such as information about patient's height and weight, gender, age, vital parameters (such as blood pressure, breathing frequency and heart rate), tumor grades, ICD-9 classification, oxygenation of tumor, degree of metastasis of tumor, blood count value tumor indicator value like PA value, information about the tissue the histopathological image is created from (e.g. tissue type, organ), further symptoms, medical history⁷ etc. Also, the pathology report of the histopathological images can be used for classification, using text mining approaches. Also, a next generation sequencing raw⁷ data set which does not cover the TRK genes’ sequences can be used for classification.

The operations in accordance with the teachings herein may be performed by at least one computer specially constructed for the desired purposes or general purpose computer specially configured for the desired purpose by at least one computer program stored in a typically non-transitory computer readable storage medium. The term “non-transitory” is used herein to exclude transitory, propagating signals or waves, but to otherwise include any volatile or non-volatile computer memory' technology suitable to the application.

The term “computer” should be broadly construed to cover any kind of electronic device with data processing capabilities, including, by way of non-limiting example, personal computers, servers, embedded cores, computing system, communication devices, processors (e.g., digital signal processor (DSP)), microcontrollers, field programmable gate array (FPGA), application specific integrated circuit (ASIC), etc.) and other electronic computing devices.

The term “process” as used above is intended to include any type of computation or manipulation or transformation of data represented as physical, e.g., electronic, phenomena which may occur or reside e.g., within registers and/or memories of at least one computer or processor. The tenn processor includes a single processing unit or a plurality of distributed or remote such units.

Fig. 6 illustrates a computer system (1) according to some example implementations of the present disclosure in more detail. The computer may include one or more of each of a number of components such as, for example, processing unit (20) connected to a memory' (50) (e.g., storage device).

The processing unit (20) may be composed of one or more processors alone or in combination with one or more memories. The processing unit is generally any piece of computer hardware that is capable of processing information such as, for example, data, computer programs and/or other suitable electronic information. The processing unit is composed of a collection of electronic circuits some of which may be packaged as an integrated circuit or multiple interconnected integrated circuits (an integrated circuit at times more commonly referred to as a “chip”). The processing unit may be configured to execute computer programs, which may be stored onboard the processing unit or otherwise stored in the memory (50) of the same or another computer.

The processing unit (20) may be a number of processors, a multi -core processor or some other type of processor, depending on the particular implementation. Further, the processing unit may be implemented using a number of heterogeneous processor systems in which a main processor is present with one or more secondary' processors on a single chip. As another illustrative example, the processing unit may be a symmetric multi-processor system containing multiple processors of the same type. In yet another example, the processing unit may be embodied as or otherwise include one or more ASICs, FPGAs or the like. Thus, although the processing unit may be capable of executing a computer program to perform one or more functions, the processing unit of various examples may be capable of performing one or more functions without the aid of a computer program. In either instance, the processing unit may be appropriately programmed to perform functions or operations according to example implementations of the present disclosure.

The memory' (50) is generally any piece of computer hardware that is capable of storing information such as, for example, data, computer programs (e.g., computer-readable program code (60)) and/or other suitable information either on a temporary' basis and/or a permanent basis. The memory' may include volatile and/or non-volatile memory', and may be fixed or removable. Examples of suitable memory' include random access memory' (RAM), read-only memory-’ (ROM), a hard drive, a flash memory-’, a thumb drive, a removable computer diskette, an optical disk, a magnetic tape or some combination of the above. Optical disks may include compact disk - read only memory' (CD-ROM), compact disk - read/yvrite (CD-R/W), DVD, Blu-ray disk or the like. In various instances, the memory' may be referred to as a computer-readable storage medium. The computer-readable storage medium is a non-transitory' device capable of storing information, and is distinguishable from computer-readable transmission media such as electronic transitory' signals capable of carrying information from one location to another. Computer-readable medium as described herein may generally refer to a computer-readable storage medium or computer-readable transmission medium.

In addition to the memory' (50), the processing unit (20) may also be connected to one or more interfaces for displaying, transmitting and/or receiving information. The interfaces may include one or more communications interfaces and/or one or more user interfaces. The communications interface(s) may be configured to transmit and/or receive information, such as to and/or from other computer(s), network(s), database(s) or the like. The communications interface may be configured to transmit and/or receive information by physical (wired) and/or wireless communications links. The communications interface(s) may include interface(s) (41) to connect to a network, such as using technologies such as cellular telephone, Wi-Fi, satellite, cable, digital subscriber line (DSL), fiber optics and the like. In some examples, the communications interface(s) may include one or more short-range communications interfaces (42) configured to connect devices using short-range communications technologies such as NFC, RFID, Bluetooth, Bluetooth LE, ZigBee, infrared (e.g., IrDA) or the like.

The user interfaces may include a display (30). The display may be configured to present or otherwise display information to a user, suitable examples of which include a liquid cry stal display (LCD), lightemitting diode display (LED), plasma display panel (PDP) or the like. The user input interface(s) (11) may be wired or wireless, and may be configured to receive information from a user into the computer system (1), such as for processing, storage and/or display. Suitable examples of user input interfaces include a microphone, image or video capture device, keyboard or keypad, joystick, touch -sensitive surface (separate from or integrated into a touchscreen) or the like. In some examples, the user interfaces may include automatic identification and data capture (AIDC) technology (12) for machine-readable information. This may include barcode, radio frequency identification (RFID), magnetic stripes, optical character recognition (OCR), integrated circuit card (ICC), and the like. The user interfaces may further include one or more interfaces for communicating with peripherals such as printers and the like.

As indicated above, program code instructions may be stored in memory, and executed by processing unit that is thereby programmed, to implement functions of the systems, subsystems, tools and their respective elements described herein. As will be appreciated, any suitable program code instructions may be loaded onto a computer or other programmable apparatus from a computer-readable storage medium to produce a particular machine, such that the particular machine becomes a means for implementing the functions specified herein. These program code instructions may also be stored in a computer-readable storage medium that can direct a computer, processing unit or other programmable apparatus to function in a particular manner to thereby generate a particular machine or particular article of manufacture. The instructions stored in the computer-readable storage medium may produce an article of manufacture, where the article of manufacture becomes a means for implementing functions described herein. The program code instructions may be retrieved from a computer-readable storage medium and loaded into a computer, processing unit or other programmable apparatus to configure the computer, processing unit or other programmable apparatus to execute operations to be performed on or by the computer, processing unit or other programmable apparatus.

Retrieval, loading and execution of the program code instructions may be performed sequentially such that one instruction is retrieved, loaded and executed at a time. In some example implementations, retrieval, loading and/or execution may be performed in parallel such that multiple instructions are retrieved, loaded, and/or executed together. Execution of the program code instructions may produce a computer-implemented process such that the instructions executed by the computer, processing circuitry or other programmable apparatus provide operations for implementing functions described herein.

Execution of instructions by processing unit, or storage of instructions in a computer-readable storage medium, supports combinations of operations for performing the specified functions. In this manner, a computer system (1) may include processing unit (20) and a computer-readable storage medium or memory (50) coupled to the processing circuitry⁷, where the processing circuitry⁷ is configured to execute computer-readable program code (60) stored in the memory. It wall also be understood that one or more functions, and combinations of functions, may be implemented by special purpose hardware -based computer systems and/or processing circuitry⁷ which perform the specified functions, or combinations of special purpose hardware and program code instructions. Fig. 7 shows a preferred method of training the machine learning model of the present disclosure in the form of a flowchart. The method (100) comprises the steps:

(110) providing a machine learning model, wherein the machine learning model is configured to assign at least one patch of an image to one of at least two classes,

(120) training the machine learning model on a plurality of training images, each training image of the plurality of training images being assigned to one of the at least two classes, the training comprising:

(121) receiving a training image,

(122) generating a plurality of patches from the training image,

(123) selecting at least one patch from the plurality of patches,

(124) generating a feature vector on the basis of the selected patch,

(125) defining a sequence of neighborhoods to the selected patch, wherein neighborhoods with increasing rank in the sequence of neighborhoods have an increasing distance to the selected patch and contain increasingly more patches,

(126) selecting a number of neighboring patches in the neighborhoods and generating a feature vector for each selected neighboring patch,

(127) generating a joint feature vector on the basis of the feature vector of the selected patch and the feature vectors of the selected neighboring patches, wherein the feature vectors of the selected neighboring patches are aggregated at different hierarchy levels, wdierein each hierarchy level corresponds to one of the neighborhoods, wherein the further the neighborhood is from the selected patch, the more feature vectors of neighboring patches are aggregated in a hierarchy level,

(128) feeding the joint feature vector to a classifier, wherein the classifier is configured to assign the joint feature vector to one of the at least two classes,

(129) comparing the class to wdiich the joint feature vector is assigned with the class to wdiich the training image is assigned, and modifying parameters of the machine learning model if the comparing reveals that the classes are different,

(130) storing the trained machine learning model and/or using the trained machine learning model to classify’ one or more new^r images.

Fig. 8 shows a preferred method of using the trained machine learning model for classifying an image in the form of a flowchart. Hie computer-implemented method (200) comprises the steps:

(210) providing a trained machine learning model, wherein the trained machine learning model is configured and trained in a multiple instance training method as described herein to assign an image to one of at least tw o classes,

(220) receiving an image,

(220) inputting the received image into the trained machine learning model,

(230) receiving information about a class the received image is assigned to from the trained machine learning model,

(240) outputting the information.

Claims

1. A method comprising: providing a machine learning model, wherein the machine learning model is configured to assign at least one patch of an image to one of at least two classes, training the machine learning model on a plurality of training images, each training image of the plurality of training images being assigned to one of the at least two classes, the training comprising: o receiving a training image, o generating a plurality of patches from the training image, o selecting at least one patch from the plurality of patches, o generating a feature vector on the basis of the selected patch, o defining a sequence of neighborhoods to the selected patch, wherein neighborhoods with increasing rank in the sequence of neighborhoods have an increasing distance to the selected patch and contain increasingly more patches, o selecting a number of neighboring patches in the neighborhoods and generating a feature vector for each selected neighboring patch, o generating a joint feature vector on the basis of the feature vector of the selected patch and the feature vectors of the selected neighboring patches, wherein the feature vectors of tire selected neighboring patches are aggregated at different hierarchy levels, wherein each hierarchy level corresponds to one of the neighborhoods, wherein the further the neighborhood is from the selected patch, the more feature vectors of neighboring patches are aggregated in a hierarchy level, o feeding the joint feature vector to a classifier, wherein the classifier is configured to assign the joint feature vector to one of the at least two classes, o comparing the class to which the joint feature vector is assigned with the class to which the training image is assigned, o modifying parameters of the machine learning model if the comparing reveals that the classes are different, storing the trained machine learning model and/or using the trained machine learning model to classify one or more new images.

2. The method according to claim 1, wherein the sequence of the neighborhoods consists of a number n of neighborhoods Nl, N2, . . . , Nw, wherein n is an integer from 3 to 10.

3. The method according to any one of claims 1 or 2, wherein each neighborhood of the sequence of the neighborhoods encloses the at least one selected patch at its geometric center.

4. The method according to any one of claims 1 to 3, wherein each neighborhood of the sequence of neighborhoods enclose the at least one selected patch at its geometric center, and forms a frame around the at least one selected patch, wherein the frames have an increasing width as the rank of the neighborhood increases in the sequence of the neighborhoods.

5. The method according to claim 4, wherein the width of the frame forming neighborhood Nz is 3^(,-1), wherein z is an index that indicates the rank of the neighborhood in the sequence of the neighborhoods and can take a number from 1 to n.

6. The method according to claim 5, wherein the at least one selected patch is characterized by coordinates x and y, and wherein the number of selected neighboring patches in the neighborhoods is selected from patches being characterized by coordinates: (x - 3^(,4), y + 3^(,-1)); (x, y + 3 ^{(, )}); (x + 3 ^{(, )}, y + 3 (x - 3 y); (x + 3 ^(,4), y); (x - 3 °⁴’, y - 3 (x, y - 3 ^{( )});’(x + 3 ^(,4), y - 3 ^(,4)), wherein i is an index that indicates the rank of the neighborhood in the sequence of the neighborhoods and can take a number from 1 to n.

7. The method according to any one of claims 1 to 6, comprising the steps: selecting a number of first-order neighbors in a first neighborhood N1 of the sequence of neighborhoods, generating a feature vector for each selected first-order neighbor, generating a feature vector representing the first neighborhood N1 on the basis of the feature vectors generated for the selected first-order neighbors using an attention mechanism, concatenating the feature vector representing the first neighborhood N 1 with the feature vector of the at least one selected patch, thereby generating a first-order-aggregate feature vector.

8. The method according to claim 7, comprising the step: repeating the steps of claim 7 on the basis of first-order aggregates, thereby generating a second- order-aggregate feature vector.

9. The method according to claim 8, comprising the step: repeating the step of claim 8 on the basis of two-, three-, . . . , (w-l)^th-order aggregates, thereby generating an (w-l)^th-order-aggregate feature vector.

10. The method according to claim 9, wherein the (w-l)^th-order-aggregate feature vector forms the joint feature vector or a joint feature vector is generated from all (n-l)^th-order-aggregate feature vectors of all selected patches using concatenation.

11. The method according to any one of claims 1 to 10, wherein one class of the at least two classes comprises images showing tissue in which a specific gene mutation is present, preferably a mutation affecting one or more of the following genes: HER2, TOP2A, HER3, EGFR, P53, MET, ALK, FLT3, AXL, FLT4, DDR2, EGFR, HER4, EML4-ALK, IGF1R, EPHA1, INSR, EPHA2, IRR, EPHA3, KIT, EPHA4, LTK, EPHA5, MER, EPHA6, MET, EPHA7, MUSK, EPHA8, NPM1-ALK, EPHB1, PDGFRa, EPHB2, PDGFRp, EPHB3, RET, EPHB4, RON, FGFR1, ROS, FGFR2, TIE2, FGFR3, TRKA, FGFR4, TRKB, FLT1, TRKC, ATM, BRCA1, BRCA2, BRCA3, CCND1, E-Cadherin, ERBB2, ETV6, FGFR1, HRAS, KRAS, NRAS, NTRK3, p53, PTEN, BCL2, BRIM, CCND1, CDKN1A, CDKN2A, CTNNB1, HES1, MAP2, MEN1, NF1, NOTCH!, NUT, RAF, SDHD, VEGFA, APC, MSH6, AXIN2, MYH, BMPR1A, p53, DCC, PMS2, KRAS2, PTEN, MLH1, SMAD4, MSH2, STK11, MSH6, PTEN, CCND1, RASSF1A, CDKN2A, RBI, EGFR, RET, EML4, ROS1, KRAS2, TP53, MYC, Axinl, MALAT1, b-catenin, pl6 INK4A, c-ERBB-2, p53, CTNNB1, RBI, Cyclin DI, SMAD2, EGFR, SMAD4, IGFR2, TCF1, KRAS, Alpha, PRCC, ASPSCR1, PSF, CLTC, TFE3, p54nrb/NONO, TFEB, AKAP10, NTRK1, AKAP9, RET, BRAF, TFG, ELE1, TPM3, H4/D10S170, TPR, AKT2, MDM2, BCL2, MYC, BRCA1, NCOA4, CDKN2A, p53, ERBB2, PIK3CA, GATA4, RB, HRAS, RET, KRAS, RNASET2, AR, KLK3, BRCA2, MYC, CDKN1B, NKX3.1, EZH2, p53, GSTP1, CDH11, COL12A1, CNBP, OMD, C0L1A1, THRAP3, COL4A5, USP6.

12. The method according to any one of claims 1 to 11, wherein each training image is a histopathological image of tissue from a patient stained with hematoxylin and eosin.

13. The method according to any one of claims 1 to 12, further comprising: receiving a new image, generating a plurality of patches based on the new image, inputting the patches into the trained machine learning model, receiving a classification result from the trained machine learning model, outputting the classification result.

14. The method according to any one of claims 1 to 13, wherein the machine learning model is trained, and the trained machine learning model is used to assign histopathological images of tissues from patients to one of at least two classes, wherein one class comprises images showing tissue in which a NTRK or BRAF gene mutation is present.

15. A computer system comprising: a processor; and a memory storing an application program configured to perform, when executed by the processor, an operation, the operation comprising: providing a machine learning model, wherein the machine learning model is configured to assign at least one patch of an image to one of at least two classes, training the machine learning model on a plurality of training images, each training image of the plurality of training images being assigned to one of the at least two classes, the training comprising: o receiving a training image, o generating a plurality of patches from the training image, o selecting at least one patch from the plurality of patches, o generating a feature vector on the basis of the selected patch, o defining a sequence of neighborhoods to the selected patch, wherein neighborhoods with increasing rank in the sequence of neighborhoods have an increasing distance to the selected patch and contain increasingly more patches, o selecting a number of neighboring patches in the neighborhoods and generating a feature vector for each selected neighboring patch, o generating a joint feature vector on the basis of the feature vector of the selected patch and the feature vectors of the selected neighboring patches, wherein the feature vectors of the selected neighboring patches are aggregated at different hierarchy levels, wherein each hierarchy level corresponds to one of the neighborhoods, wherein the further the neighborhood is from the selected patch, the more feature vectors of neighboring patches are aggregated in a hierarchy level, o feeding the joint feature vector to a classifier, wherein the classifier is configured to assign the joint feature vector to one of the at least two classes, o comparing the class to which the joint feature vector is assigned with the class to which the training image is assigned, o modifying parameters of the machine learning model if the comparing reveals that the classes are different, storing the trained machine learning model and/or using the trained machine learning model to classify one or more new images.

16. A non-transitory computer readable medium having stored thereon software instructions that, when executed by a processor of a computer system, cause the computer system to execute the following steps: providing a machine learning model, wherein the machine learning model is configured to assign at least one patch of an image to one of at least two classes, training the machine learning model on a plurality of training images, each training image of the plurality of training images being assigned to one of the at least two classes, the training comprising: o receiving a training image, o generating a plurality of patches from the training image, o selecting at least one patch from the plurality of patches, o generating a feature vector on the basis of the selected patch, o defining a sequence of neighborhoods to the selected patch, wherein neighborhoods with increasing rank in the sequence of neighborhoods have an increasing distance to the selected patch and contain increasingly more patches, o selecting a number of neighboring patches in the neighborhoods and generating a feature vector for each selected neighboring patch, o generating a joint feature vector on the basis of the feature vector of the selected patch and the feature vectors of the selected neighboring patches, wherein the feature vectors of the selected neighboring patches are aggregated at different hierarchy levels, wherein each hierarchy level corresponds to one of the neighborhoods, wherein the further the neighborhood is from the selected patch, the more feature vectors of neighboring patches are aggregated in a hierarchy level, o feeding the joint feature vector to a classifier, wherein the classifier is configured to assign the joint feature vector to one of the at least two classes, o comparing the class to which the joint feature vector is assigned with the class to which the training image is assigned, o modifying parameters of the machine learning model if the comparing reveals that the classes are different, storing the trained machine learning model and/or using the trained machine learning model to classify one or more new images.

17. Use of a trained machine learning model for the detection, identification, and/or characterization of tumor types and/or gene mutations in tissues, wherein training of the machine learning model comprises: o receiving a training image, the training image being assigned to one of at least two classes, wherein at least one class comprises images of tumor tissue and/or tissue in which a gene mutation is present, o generating a plurality of patches from the training image, o selecting at least one patch from the plurality of patches, o generating a feature vector on the basis of the selected patch, o defining a sequence of neighborhoods to the selected patch, wherein neighborhoods with increasing rank in the sequence of neighborhoods have an increasing distance to the selected patch and contain increasingly more patches, o selecting a number of neighboring patches in the neighborhoods and generating a feature vector for each selected neighboring patch, o generating a joint feature vector on the basis of the feature vector of the selected patch and the feature vectors of the selected neighboring patches, wherein the feature vectors of the selected neighboring patches are aggregated at different hierarchy levels, wherein each hierarchy level corresponds to one of the neighborhoods, wherein the further the neighborhood is from the selected patch, the more feature vectors of neighboring patches are aggregated in a hierarchy level, o feeding the joint feature vector to a classifier, wherein the classifier is configured to assign the joint feature vector to one of the at least two classes, o comparing the class to which the joint feature vector is assigned with the class to which the training image is assigned, o modifying parameters of the machine learning model if the comparing reveals that the classes are different.