WO2023101564A1

WO2023101564A1 - Skin lesion classification system and method

Info

Publication number: WO2023101564A1
Application number: PCT/NZ2022/050154
Authority: WO
Inventors: Zhen Yu; Toan Nguyen; Zongyuan Ge; Paul BONNINGTON
Original assignee: Kahu.Ai Limited
Priority date: 2021-12-02
Filing date: 2022-11-25
Publication date: 2023-06-08
Also published as: AU2022400601A1

Abstract

Some previously proposed classification methods of skin lesions may lead to misclassifications. Disclosed herein is a system for classifying skin lesions. The system comprises a feature extraction module (204) that determines feature information from an image of a skin lesion. The system comprises a transformer (206, 208) that determines relationships between classifiers classifying distinct parameters of the skin lesion. The system comprises a hierarchical classifier (210) having at least two of the classifiers ordered based on the number of classification categories in each classifier. The classifiers classify distinct parameters in parallel with each other based on the feature information and the relationships between classifiers.

Description

SKIN LESION CLASSIFICATION SYSTEM AND METHOD

FIELD OF THE INVENTION

The invention relates to methods and systems for performing a risk assessment on a skin lesion.

BACKGROUND TO THE INVENTION

The incidence of skin cancer has been rising for several decades, and computer-aided diagnostic algorithms are desired for assisting dermatologists in diagnosing lesions more efficiently. Existing algorithms for skin cancer diagnosis either simply perform binary classification of benign versus malignant or classify lesions directly into multiple subcategories.

However, both settings have limitation: 1) outputting probability of a lesion being non- cancerous or cancerous provides little information and may confuse dermatologists; 2) giving fine level predictions helps a dermatologist to comprehensively understand a lesion, but learning a flat model to distinguish mixed sub-types of lesions ignores the separability and correlation among different classes which may decrease the model's performance.

(Yan et al., 2015) proposed a Hierarchical Deep Convolutional Neural Networks (HD-CNN) to classify objects by first focus on coarse categories then fine categories. The study shows that the model can achieve better performance than non-hierarchical models. However, the approach has some limitations such as it heeds two steps of training for coarse and fine categories and it cannot be used to classify data in hierarchy having more than two levels.

(Zhu and Bain, 2017) presented a Branch Convolutional Neural Network (B-CNN) model that construct multiple branches on top of different layers of a CNN to output multiple predictions corresponding to levels from coarse to fine in the hierarchical structure. But this method computed the predictions independently without considering hierarchical dependency among different classes. Computing classification results sequentially may lead to misclassification if images are classified into incorrect coarse categories.

It is an object of at least preferred embodiments to address at least some of the aforementioned disadvantages. An additional or alternative object is to at least provide the public with a useful choice.

SUMMARY OF THE INVENTION

In accordance with an aspect, a system for classifying skin lesions comprises: a feature extraction module that determines feature information from an image of a skin lesion; a transformer that determines relationships between classifiers classifying distinct parameters of the skin lesion; and a hierarchical classifier having at least two of the classifiers ordered based on the number of classification categories in each classifier. The classifiers classify distinct parameters in parallel with each other based on the feature information and the relationships between classifiers.

The term 'comprising' as used in this specification means 'consisting at least in part of'. When interpreting each statement in this specification that includes the term 'comprising', features other than that or those prefaced by the term may also be present. Related terms such as 'comprise' and 'comprises' are to be interpreted in the same manner.

In an embodiment the transformer further comprises: at least two encoders that determine the global context of the feature information; and at least two decoders that determines the dependencies of the classifiers in the hierarchical classifier. The classifiers classify distinct parameters in parallel with each other based on the global context and the dependencies.

In accordance with a further aspect of the invention, a computer implemented method for classifying skin lesions comprises: determining feature information from an image of a skin lesion; determining relationships between classifiers classifying distinct parameters of the skin lesion; classifying distinct parameters with at least two of the classifiers in parallel, wherein the classifiers are ordered in a hierarchy based on the number of classification categories in each classifier. The invention in one aspect comprises several steps. The relation of one or more of such steps with respect to each of the others, the apparatus embodying features of construction, and combinations of elements and arrangement of parts that are adapted to affect such steps, are all exemplified in the following detailed disclosure.

To those skilled in the art to which the invention relates, many changes in construction and widely differing embodiments and applications of the invention will suggest themselves without departing from the scope of the invention as defined in the appended claims. The disclosures and the descriptions herein are purely illustrative and are not intended to be in any sense limiting. Where specific integers are mentioned herein which have known equivalents in the art to which this invention relates, such known equivalents are deemed to be incorporated herein as if individually set forth.

In addition, where features or aspects of the invention are described in terms of Markush groups, those persons skilled in the art will appreciate that the invention is also thereby described in terms of any individual member or subgroup of members of the Markush group.

As used herein, '(s)' following a noun means the plural and/or singular forms of the noun.

As used herein, the term 'and/or' means 'and' or 'or' or both.

It is intended that reference to a range of numbers disclosed herein (for example, 1 to 10) also incorporates reference to all rational numbers within that range (for example, 1, 1.1, 2, 3, 3.9, 4, 5, 6, 6.5, 7, 8, 9, and 10) and also any range of rational numbers within that range (for example, 2 to 8, 1.5 to 5.5, and 3.1 to 4.7) and, therefore, all sub-ranges of all ranges expressly disclosed herein are hereby expressly disclosed. These are only examples of what is specifically intended and all possible combinations of numerical values between the lowest value and the highest value enumerated are to be considered to be expressly stated in this application in a similar manner.

In this specification where reference has been made to patent specifications, other external documents, or other sources of information, this is generally for the purpose of providing a context for discussing the features of the invention. Unless specifically stated otherwise, reference to such external documents or such sources of information is not to be construed as an admission that such documents or such sources of information, in any jurisdiction, are prior art or form part of the common general knowledge in the art

Although the present invention is broadly as defined above, those persons skilled in the art will appreciate that the invention is not limited thereto and that the invention also includes embodiments of which the following description gives examples.

BRIEF DESCRIPTION OF THE DRAWINGS

Preferred forms of the system and method will now be described by way of example only with reference to the accompanying figures in which:

Figure 1 shows an example of classification levels in a hierarchical skin lesion classification;

Figure 2 shows an embodiment system for hierarchical skin lesion classification;

Figure 3 shows a schematic view of an embodiment transformer encoder in the system of figure 2;

Figure 4 shows another schematic view of the embodiment transformer encoder in figure 3;

Figure 5 shows a schematic view of an embodiment transformer encoder block and transformer decoder block in the system of figure 2;

Figure 6 shows an example of results obtained from the hierarchical skin lesion classification;

Figure 7 shows a schematic view of a hierarchical knowledge distillation training strategy;

Figure 8 shows an embodiment method for hierarchical skin lesion classification; and

Figure 9 shows an example of skin lesion classification classes organised in a 3-level hierarchical semantic tree. DETAILED DESCRIPTION

Disclosed herein is a system and method for hierarchical skin lesion classification that reflects hierarchical structure on skin diseases and improves the inference of classification results.

The system and method capture the relationship between different classification levels that classifies different parameters of skin lesions and allows hierarchical classification to be performed in a parallel way. Figure 1 shows an example of the different classification levels involved in a hierarchical skin lesion classification. Each classification level 102, 104, 106 classifies a distinct parameter of the skin lesion. For example, classification level 102 classifies a skin lesion in input image 202 into the benign or malignant categories/classes. Classification level 104 classifies the skin lesion in input image 202 according to 8 categories/classes. Classification level 106 classifies the skin lesion in input image 202 according to 65 categories/classes. In this example, the lower or finer classification levels 104, 106 have more classification categories than their previous (coarser) classification levels.

Figure 2 shows a system 200. The system 200 is a CNN Transformer model. The system has a feature extraction module 204 that is a convolutional neural network (CNN) for example. The feature extraction module 204 determines feature information from an image of a skin lesion. As shown in Figure 2, the feature information is a feature map 212 extracted from an input image 202 using feature extraction module 204. The input image is an image of a skin lesion. In other words, the feature map 212 is the output of the feature extraction module 204.

Each input image 202 can be denoted with a symbol Xi, where i represents that the image is the i-th input image. For each input image 202, the feature extraction module 204 backbone extracts a feature map 212 representation

. Feature map 212 is a 3D feature map with a shape of H x W x C.

The feature map 212 is provided as inputs to the transformer encoder 206. In an example, the transformer encoder 206 uses a sequence as input. In order for the transformer encoder to receive the feature map 212 as a sequence, the feature map 212 can be flattened or collapsed to a 2D feature representation along the spatial dimensions (H x W), resulting in a F°*G K feature sequence 214 represented as ’ ^A?XC . Feature sequence 214 consists of H x W vectors and each of them is C-dimensional.

Corresponding position encodings are obtained for each feature sequence or collapsed feature map. This allows the spatial information of the feature map 212 to be retained when it is collapsed to a feature sequence 214. In other words, the pixel relationship in an image is preserved. The position encodings and the feature sequence 214 form feature vectors that

. can be represented as: . Each feature vector is a three-dimensional vector that corresponds to each pixel of input image. As demonstrated, the feature information of the input image can be represented in more than one way, as a feature map 212, feature sequence 214 or feature vector for example.

A transformer determines relationships between classifiers 210 classifying distinct parameters of the skin lesion. The transformer has a transformer encoder 206 and transformer decoder 208. The transformer encoder 206 has at least two encoding blocks that determines the global context of the feature information. The transformer decoder 208 has at least two decoding blocks that determines the dependencies of the classifiers in the hierarchical classifier.

The transformer encoder 206 receives the feature sequence 214 and positional encoding as the feature vector to learn global context. Transformer encoder 206 obtains global image context from local image features. Figure 3 shows an example of the transformer encoder

206 having multiple encoding blocks 302, 304, 306. In the example shown in Figure 4, each of the multiple encoding blocks has a multi-head self-attention layer (MSA), multi-layer preceptor layer (MLP), patch attention layer (PA), and class attention layer (CA). Each encoding block may also have a feed forward layer (FF). The MSA layer first normalises feature vectors using a layer normalization function LNQ, then maps normalized features into separate vectors using linear projections. In an example, these vectors are query vectors

V C R^ ^x ~fr

y and value vectors ’ . These vectors are used so that self-attention does not have be performed by multiplying normalized feature vectors by itself. However, in another example normalized feature vectors do not have to be separated into separate vectors. The MSA layer then generates new feature

vectors 216 represented as ' using a self-attention function:

IQ!.*. K| . Vfjt) = Linear (LN ZH )) , I = 1. L

where LinearQ denotes a linear function realised by a fully connected layer.

LN() and AQ represent a layer normalization function and the attention function respectively. The attention function is used for the self-attention procedure in each encoding block. Selfattention is a kind of mechanism for computing relation between input sequences. The selfattention compares each feature vector to the rest and generates a group of weights. Each weight is a number indicating the similarity of each feature vector to all other feature vectors. Each feature vector can be re-computed by multiplying its weights by the other feature vectors.

The pairwise dot product

measures how similar each pair of query vector and key vector is. The softmax function (SoftmaxO) computes a group of attention weights for every query vector and key vector.

The output of attention function

where a value vector gets a larger weight if its corresponding key vector has larger dot product with its corresponding query vector. In other words, a larger weight is based on the attention weight of the softmax function. The transformer encoder mechanism enables global context of the feature vectors to be obtained by relating every feature vector to the others. Selfattention is used to calculate relationships between all feature vectors that is projected to query, key and value vectors. The weight is based on the dot product of the query and key vectors and indicates the strength of correlation between a feature vector and others. The larger the weight, the stronger the correlation of the feature vector. The weight denotes the global context representing the semantic information of the entire image. A sequence containing multiple feature vectors is represented as Z-. Therefore, Self-attention calculates relation of each feature vector within Z- to the others. In other words, a set of local image features represented by the feature vectors are extracted by the CNN. The transformer encoder then aggregates the local features by determining the relationship between all the local features using global context.

The transformer decoder 208 receives feature vectors 216 from the transformer encoder 206 and classification task queries to compute output embeddings for each classification task. In an example, the transformer decoder 208 is similar to the transformer encoder 206 and contains multiple decoding blocks which consist of two MSA layers and an FF layer. The difference between the transformer encoder 206 and the transformer decoder 208 lies in its inputs. The transformer decoder 208 receives features vectors 216 from the encoder 206 pF c along with classification task queries

The classification task queries are randomly initialized vectors and they are updated during model training. Each task has a task query vector that interacts with global image features for final classification. The classification task queries are learnable parameters that can be updated to compute output embeddings for each classification task respectively by the transformer decoder 208.

Hierarchical classification classifiers 210 have more than one level of classification that ranges from a coarse level of classification (level-0) to a finer level of classification (level-2). The classification task queries are used to determine the output embedding for each level of classification 102, 104, 106. Each classification level has one of classifiers 210 to carry out the classification.

In the transformer decoder 208, each classification task query is associated to the features vectors 216 from the transformer encoder 206 which captures the global image context for generating final prediction for the task using cross-attention. In the example shown in Figure 1, the three levels of classification have three corresponding tasks. The first task is to classify whether a skin lesion in input image 202 is benign or malignant. The second task is to classify the skin lesion in input image 202 according to 8 classes in classification level-1. The third task is to classify the skin lesion in input image 202 according to 65 classes in classification level-2. Therefore, each of the three levels of classification has a corresponding task and classification task query. Each classification task query is a task query vector used for cross attention.

The transformer decoder 208 performs the following calculations: ))

I where U° is a zero-initialized tensor with same size as the classification task query.

MAO denotes multi-head attention function and the calculation is a combination of the following equations:

ft? = MA to?, K , v' uH ) ,

Within each decoder block, ' ‘ performs self-attention by computing attention weights based on classification task queries and output embeddings from a previous decoder block u ^-1).

links decoder blocks within the transformer decoder 208 and provides the relationship or dependencies between the different classification levels.

Again, self-attention is a kind of mechanism for computing relation between input sequence of different classification tasks in the transformer decoder 208. For a series of classification task queries, the self-attention will compare each classification task query to the rest and generate a group of weights. Each weight is a number that indicates the relationship between each classification task query to the other classification task queries. Each classification task query vector can be re-computed by multiplying its weights to the other classification task query vectors.

is cross attention or encoder-decoder attention and transforms features from the transformer encoder 206 into output embeddings 218 for each task corresponding to each level of classification. Output embeddings

are used to predict labels for M levels of hierarchical classification tasks. Figure 5 shows a visual representation of the interactions between the equations of each encoder and decoder block.

Hierarchical classifiers 210 include at least two of classifiers ordered based on the number of classification categories in each classifier. The classifiers 210 classify distinct parameters in parallel with each other. The classifiers 210 classify distinct parameters in parallel with each other based on the global context of the feature information of an input image and dependencies between the classifiers in the hierarchical classifier. The output embeddings 218 of the transformer decoder 208 allow multiple classification tasks in a hierarchical classification to be performed simultaneously. For example, all three classification levels shown in Figure 1 can be carried out simultaneously once the output embeddings for each corresponding classification level are passed into the corresponding classifiers 210 for calculating coarse to fine prediction scores.

Figure 6 shows an example of the results obtained from the hierarchical classification. Results 602 correspond to the classification level 102, results 604 correspond to the classification level 104 and results 606 correspond to the classification level 106.

Therefore, combining the self-attention and encoder-decoder attention in the transformer decoder enables dependency between different classification levels to be utilised/taken into account when classifying while being capable of using global image context in a parallel way. In particular, the self-attention mechanism used on the task query vectors determines the relationships between different classification tasks. Obtaining the relationship between different classification tasks allows all classification tasks in a hierarchical classification process to be performed concurrently or in parallel to each other. This provides more accurate classification results of a skin Jesion than if classification tasks were performed separately without regard to the relationship between classification tasks.

The final prediction is computed by three separate classifiers 102, 104, 106. Each classifier may consist of a dropout layer, a linear projection layer and a Soft-max layer for example. The outputs of the separate classifiers can be represented in an array of hierarchical

from a coarse level to a fine level of classification.

The system 200 can be used to perform a computer implemented method for classifying skin lesions using hierarchical classification as shown in Figure 8. The hierarchical classification method 800 uses feature extraction module 204 to determine 802 feature information from an input image 202 of a skin lesion. The feature information is a spatial feature map for example.

The spatial features maps are used in a transformer model with the transformer encoder 206 and a transformer decoder 208. The transformer encoder 206 uses a self-attention mechanism to generate discriminative features. As described previously, the transformer encoder 206 uses self-attention to re-represent feature map 212 by considering global context. This results in the more discriminative features to be outputted of transformer encoder 216. The discriminative features are inputted into the transformer decoder 208 along with classification task queries representing the different levels of classification tasks from coarse-to-fine level.

The transformer decoder 208 also uses a self-attention mechanism and a cross-attention mechanism to determine the relationships between the feature information and classifiers and determine 804 relationships between classifiers classifying distinct parameters of the skin lesion. At least two separate classifiers are used to classify 806 distinct parameters of the skin lesion in parallel. The classifiers are ordered in a hierarchy based on the number of classification categories in each classifier. The hierarchical classification method is capable of capturing relationships among different skin classes while performing hierarchical classification in parallel.

In an example, the generalization performance can be improved and interaction between different levels of classifiers can be encouraged using a hierarchical knowledge distillation training strategy as shown in Figure 7. The hierarchical knowledge distillation training strategy may consist of ensemble knowledge distillation and mutual knowledge distillation.

Ensemble knowledge distillation aggregates the multiple outputs from the multiple classifiers. The aggregated outputs/predictions are aligned with each of the multiple outputs from each classifier during training. The ensemble knowledge distillation utilises both ensemble learning and knowledge distillation. The aggregated outputs or ensemble predictions can provide a better result by combing outputs from different classification levels while knowledge distillation is capable of generating soft target with the ensemble prediction to guide the learning of individual classifiers and improve their performance.

In an example, logits from each level of classifier can be represented as . . . ,. _t .

An ensemble prediction can be constructed by linearly combining logits of all classifiers into the coarsest level: ge® = W^r • Concat (go, i , .... g»_ i )

Pens = Softmax (gg„) where ^{w e} ® ° represents the parameters of the linear layer.

and P®¹¹ denote ensembled logit and probability respectively. The ensemble prediction can be optimised with cross entropy loss. Since outputs from different classification heads vary in the dimension which is undesired for performing distillation with the ensemble prediction, we then map all of them into same dimension of the coarsest level along the path of leaf nodes to root nodes in the hierarchical structure: [g|->0,g2- ), ...,g -l- O] = map(gi, g2, ...,gAf-l) where map() performs logits mapping by summing all logits of fine level classes belonging to the same coarse level class. Kullback Leibler (KL) divergence loss among the ensemble prediction and all mapped predictions of each heads are calculated as follows:

Ensembled outputs act as a strong teacher for distilling knowledge to each classification layer, but it ignores the relationship between different classification layers. Therefore, mutual knowledge distillation can be used to align outputs from consecutive level of classifiers for maintaining the consistency. Because a hierarchical model with multiple classification layers should favor functions that give consistent output for same inputs. The consensus of outputs from different classification layers on the same input provides supplementary information and regularization to each classifier.

Similar to the ensemble knowledge distillation, the output logits of every consecutive classifier are mapped into the same dimension and the KL divergence loss is calculated. The specific calculation can be summarized as: ,gM-l)

The final objective function for optimizing the hierarchical model includes cross entropy loss on outputs from each classifier as well as the ensemble prediction, ensemble knowledge distillation, and mutual knowledge distillation.

The present system and method for hierarchical skin lesion classification is used for skin image datasets. Skin image datasets may be collected from a clinical environment and follows tele-dermatology labelling standards. Skin image datasets may also be collected from publicly available International Skin Imaging Collaboration (ISIC) archives for example.

In an example, dataset with tele-dermatology labelling standards includes 235,268 teledermatology verified clinical and dermoscopic images. The dataset may contain a total of 65 skin conditions organised in a 3-level hierarchical semantic tree as shown in Figure 9.

In another example, data from ISIC archives have 25,331 dermatoscopic images across 8 different categories: melanoma (MEL), melanocytic nevus (MN), basal cell carcinoma (BCC), actinic keratosis (AK), benign keratosis (BKL), dermatofibroma (DF), vascular lesion (VASC), and squamous cell carcinoma (SCC).

To train the system hierarchical skin lesion classification, a dataset can be split into training, validation and testing datasets with a ratio of 7:1:2. The standard data augmentation techniques such as random resized cropping, colour transformation, and flipping can be used on the datasets. Each dermoscopic image is resized to a fixed size of 384x384 for example. Image Net pre-trained ResNet-34 (He, K„ Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, pp. 770-778. doi:10.1109/CVPR.2016.90.) may be used as the backbone and the model can be trained with ADAM optimizer (Kingma, D., Ba, J., 2014. Adam: A method for stochastic optimization. International Conference on Learning Representations .) with batch size equal to 128 and an initial learning rates of 1 x 10^A-5 and 3 10^A-4 for the backbone and new added layers, respectively.

The foregoing description of the invention includes preferred forms thereof. Modifications may be made thereto without departing from the scope of the invention.

Claims

1. A system for classifying skin lesions, the system comprising: a feature extraction module that determines feature information from an image of a skin lesion; a transformer that determines relationships between classifiers classifying distinct parameters of the skin lesion; and a hierarchical classifier having at least two of the classifiers ordered based on the number of classification categories in each classifier; wherein the classifiers classify distinct parameters in parallel with each other based on the feature information and the relationships between classifiers.

2. The system of claim 1, wherein the transformer further comprises: at least two encoders that determine the global context of the feature information; and at least two decoders that determines the dependencies of the classifiers in the hierarchical classifier; wherein the classifiers classify distinct parameters in parallel with each other based on the global context and the dependencies.

3. A computer implemented method for classifying skin lesions, the method comprising: determining feature information from an image of a skin lesion; determining relationships between classifiers classifying distinct parameters of the skin lesion; and classifying distinct parameters with at least two of the classifiers in parallel, wherein the classifiers are ordered in a hierarchy based on the number of classification categories in each classifier.