WO2014006421A1

WO2014006421A1 - Identification of mitotic cells within a tumor region

Info

Publication number: WO2014006421A1
Application number: PCT/GB2013/051790
Authority: WO
Inventors: Adnan M. KHAN; Nasir M RAJPOOT; Hesham EL-DALY
Original assignee: The University Of Warwick; University Hospitals Coventry And Warwickshire
Priority date: 2012-07-06
Filing date: 2013-07-05
Publication date: 2014-01-09
Also published as: GB201212090D0

Abstract

Methods and systems for analysis of histology slides are described, in particular, methods and systems relating to the identification of pixels associated with mitotic cells in images of histopathology slides. In one example, such a method is concerned with identification of pixels associated with candidate mitotic cells in a digital histopathological image. The method comprises identifying a set of image pixels as being associated with a tumor region in the histopathological image. The method also comprises defining a set of pixel intensities of pixels in the set of image pixels and identifying a subset of pixels as being associated with candidate mitotic cells based on the set of pixel intensities.

Description

IDENTIFICATION OF MITOTIC CELLS WITHIN A TUMOR REGION

Field

The present invention relates to methods and systems for analysis of histology slides, and in particular although not exclusively, the identification of pixels associated with mitotic cells in images of histopathology slides.

Background

Morphological diagnosis is an important tool in anatomical pathology, since accurate diagnosis and staging of cancer and other diseases and conditions requires histopathological examination of samples.

An important example is grading of breast cancer which relies largely on microscopic examination of tissue slides stained with Hematoxylin & Eosin (H&E). This is a subjective process by its very nature consequently leading to inter- and even intra-ob server variability potentially affecting predicted patient prognosis and also the treatment modalities offered. The variability in breast cancer grading may, at least in part, be responsible for the different choice of therapy _USed even within the same institution.

One measure used in grading of breast cancer is a count of mitotic cells. Determining the mitotic count is very challenging since the biological phase of cells undergoing mitosis, the way the tissue is sectioned, and the staining artifacts make their automatic detection extremely difficult. Additionally, if standard H&E staining is used (which stains all chromatin rich structures, such as nuclei, apoptotic cells and mitotic cells in dark blue color), it becomes extremely difficult to detect the latter given the fact that former two are densely localized in the tissue sections.

In order to try and overcome this difficulty, a number of different approaches have been proposed. For example, Roulier et al. {Computerized Medical Imaging and Graphics, 35(7- 8):603-615, 2011) used an additional stain (e.g. PHH3) to stain mitotic cells exclusively, and detect exclusively stained mitotic cells in the images. S. Huh et al. {IEEE Transactions on Medical Imaging, 30(3):586-596, 2011) used a video sequence of live cells in culture to detect mitotic events over time by incorporating spatial and temporal information. However, these methods are costly either requiring an additional stain or operate on cultured cells, whereas the analysis of standard H&E stained histopathological slides by an expert pathologist is typically employed in a pathology laboratory for cancer diagnosis and prognostic purposes.

Summary

The invention is set out in the independent claims. Further, optional features are set out in the remaining, dependent claims.

In particular, embodiments relate to a method for identifying pixels associated with candidate mitotic cells in a histopathological image, comprising the steps of:

(a) identifying a set of image pixels as being within tumor regions in the histopathological image;

(b) defining a set of pixel intensities of pixels in the set of image pixels; and

(c) identifying a subset of pixels as being associated with candidate mitotic cells based on the set of pixel intensities.

By limiting the analysis to pixels in an identified tumor region, an intensity based measure can be used without requiring additional stains as in known methods.

In some embodiments, a mixture model is used to model underlying distributions of an observed intensity distribution to provide a refined measure of whether a pixel is likely to be associated with a cell that is mitotic.

In yet further embodiments, post-processing based on the context of identified pixels likely to be associated with mitotic cells can improve results by rejecting false positives.

Aspects of the invention further include context-aware post-processing independent of the nature of the previous steps identifying candidate pixels; mixture modelling intensity distributions representative of cells in a population, not limited to mitotic and non-mitotic cells; segmenting tumor or cancer regions using texture feature vectors; segmenting tumor or cancer regions using magnitude and phase spetra related feature vectors to account separately for hypo and hyper cellular stroma; and segmenting an image region using a plurality of random projections of feature vectors and subsequent classification based on a majority vote based on clustering labels.

In some embodiments the method comprises the step of identifying candidate mitotic cells that are false positives and removing the false positives from the set of candidate mitotic cells.

In some embodiments method comprises the step of identifying candidate mitotic cells that are false positives, the step comprising defining an image patch around each of the candidate mitotic cells in the set; calculating one or more indicator values for the image patch; and determining whether the pixels associated with candidate cells are pixels associated with mitotic cells based on the indicator values.

Brief description of the drawings

Embodiments are now desribed by way of examples, with reference to the following drawings; in which:

Figure 1 shows a flow diagram of an embodiment of the method for analysing a histopathological image described herein;

Figure 2a shows marginal distributions (thick solid line) and models fitted (dotted lines) by the two-component Gamma-Gaussian Mixture Model, together with the mixture distribution (thin solid line);

Figure 2b shows an example of an overall peripheral distribution used for fitting the mixture model;

Figure 3 shows (a) Ground Truth (GT) of a histopathological image, showing mitotic cells identified visually by a pathologist; (b) image of the results of performing the method described herein on the same image used to produce (a); (c) image of the results of performing a manual thresholding of the distribution of pixel intensities obtained after segmenting the image used to produce (a); (d) image of the results of performing Otsu thresholding on the distribution of pixel intensities obtained after segmenting the image used to produce (a); and

Figure 4 illustrates a system implementing the described methods;

Figure 5 shows (a) a schematic of the boundary of a mitotic cell in prophase; (b) a schematic of the boundary of a mitotic cell in metaphase; and (c) a schematic of the boundary of an apoptotic cell.

Definitions

It will be understood by the skilled person that, in the context of the present disclosure, the term "pixels associated with non-mitotic cells" are all pixels in a segmented section of the histopathological image which are not pixels associated with candidate mitotic cells.

The tumor or cancer region of the histopathological image is the region or area which contains mostly tumor or cancer cells.

In the following detailed description, numerous specific details are set forth to provide a thorough understanding of claimed subject matter. However, it will be understood by those skilled in the art that claimed subject matter may be practiced without these specific details. In other instances, well-known methods, procedures, components and/or circuits have not been described in detail.

Some portions of the detailed description that follow are presented in terms of algorithms and/or symbolic representations of operations on data bits and/or binary digital signals stored within a computing system, such as within a computer and/or computing system memory. These algorithmic descriptions and/or representations are the techniques used by those of ordinary skill in the data processing arts to convey the substance of their work to others skilled in the art. An algorithm is here, and generally, considered to be, a self-consistent sequence of operations and/or similar processing leading to a desired result. The operations and/or processing may involve physical manipulations of physical quantities. Typically, although not necessarily, these quantities may take the form of electrical and/or magnetic signals capable of being stored, transferred, combined, compared and/or otherwise manipulated. It has proven convenient, at times, principally for reasons of common usage, to refer to these signals as bits, messages, data, values, elements, symbols, characters, terms, numbers, numerals and/or the like. It should be understood, however, that all of these and similar terms are to be associated with appropriate physical quantities and are merely convenient labels. Unless specifically stated otherwise, as apparent from the following discussion, it is appreciated that throughout this specification, discussions utilizing terms such as "processing", "computing", "calculating", "determining" and/or the like refer to the actions and/or processes of a computing platform, such as a computer or a similar electronic computing or messaging device (such as but not limited to a mobile phone or PDA), that manipulates and/or transforms or transfers data represented as physical electronic and/or magnetic quantities and/or other physical quantities within the computing platform's processors, memories, registers, and/or other information storage, transmission, and/or input and display devices.

Detailed description A method for identifying pixels associated with candidate mitotic cells in a histopathological image is now described, the method comprising the steps of:

(a) identifying a set of image pixels as being associated with a tumor region in the histopathological image;

(b) defining a set of pixel intensities of pixels in the set of image pixels; and (c) identifying a subset of pixels as being associated with candidate mitotic cells based on the set of pixel intensities.

The image of the histopathology slide may be any histopathological slide which contains mitotic cells.

The preparation of histopathological slides is a technique which is well known in the art. In brief, histopathological analysis of tissue begins with the removal of the tissue from the subject, for example, by surgery, biopsy, or autopsy. The tissue is then placed in a fixative, such as formalin, to stabilize the tissue and to prevent decay. The tissue is cut into thin sections, placed on glass slides and stained for examination under the microscope. The routinely used stain in histopathology is a combination of hematoxylin and eosin (H&E). Hematoxylin is used to stain nuclei blue, while eosin stains other eosinophilic structures, such as cytoplasm and the extracellular connective tissue matrix, in various shades of red, pink and orange. However, other stains which are well known in the art can also be used to selectively stain cells, such as safranin, Oil Red O, congo red, silver salts, DAB stain, PAS stain and other dyes. In certain embodiments, the histopathological slide to be initially analyzed is stained with H&E.

Although staining of histopathological slides, for example using H&E, enables better visualization of tissue structures, as the stain used may cause variation in terms of color and intensity bteween different images, pre-processing of an image may be required to achieve a consistent color and intensity appearance.

With reference to Figure 1, information associated with an image of a histopathological slide (A) is uploaded to the system in step 2. In some embodiments, the color and intensity of the stained image is then normalised in step 4 using any method known in the art, for example, by using the method of D. Magee et al. (in Proceedings Medical Image Understanding and Analysis (MIUA), pp. 1-5, 2010).

Histopathological images can typically be divided into two regions; a tumor region and a non- tumor region (or a cancerous region and a non-cancerous region). The inventors have realized that by limiting the pixel analysis to tumor or cancer regions of the histopathological image, not only does this eliminate the need for further special staining dyes, but also allows the use of a much simpler algorithm to detect pixels associated with candidate mitotic cells.

Therefore, the method involves segmenting the histopathology slide image (at step 6) to remove the non-tumor or non-cancerous areas from the histology slide image, thereby minimizing the search space for mitotic cells. The step of segmenting the image (step 6) is, in some embodiments, carried out using an automatic segmentation alogorithm, the details of which are set out in Example 1 and Annex 1. Prior to automatic segmentation, the image (B) is in some embodiments, further pre-processed, for example by carrying out background estimation and anisotropic diffusion or morphological dilation. Morphological dilation increases the chances of detecting mitotic cells on the boundary of the tumor or cancer regions. It will be appreciated by the skilled person that although in some embodiments the step of segmenting the image (6) is carried out using a particular first algorithm set out in Example 1 and Annex 1, segmentation can also be implemented using other algorithms, such as the second algorithm as detailed in Example 2 and Annex 2, or other segmentation algorithms which are well known in the art.

Alternatively, segmentation of the histopathological image can be carried out manually or in a semi-automated manner, for example by displaying the histopathological image to a user, such as a pathologist. The user may identify the set of image pixels, for example, by highlighting the relevant regions of the image on a screen. The inventors have further realized that the distributions of pixel intensities associated with candidate mitotic cells and non-mitiotic cells in the tumor or cancer region of the image are quite distinct. Figure 2 shows an overall (modeled) distribution of pixel intensities from the tumor or cancer region of a histopathological image, and the distinct component intensity distributions associated with candidate mitotic cells and non-mitiotic cells as identified by expert input and the corresponding fitted model components. The inventors realized that by modeling each of the underlying distributions of pixel intensities associated with candidate mitotic cells and non-mitiotic cells as parametric functions, and fitting a mixture model of these two parametric functions to the overall intensity distribution (overall segmented image, without prior expert assignment into mitotic and non-mitotic cells) , it is possible to determine a quantity which is indicative of how likely it is that a pixel is associated with a candidate mitotic cell. For the avoidance of doubt, while Figure 2 shows separate empirical distributions for identified subpopulations, fitting the mixture model is done simultaneously for all parameters of the mixture components by fitting to an overall empirial distribution (an example of which is illustrated in Figure 2b) without the need for prior expert identification. With reference to Figure 1, in step 8, the pixel intensity of the pixels in the segmented tumor or cancerous regions (L channel of La*b* color space) are modeled as a random variable sampled from a mixture of a Gamma and a Gaussian distribution. Intensities of pixels in the population/mixture associated with mitotic cells are modeled by a Gamma distribution and pixels in the overall population/mixture associated with non-mitotic cells are modeled by a Gaussian distribution. Gamma and Gaussian distributions have been used as the inventors have observed that the characteristics of these distributions match the observed data well, so that a Gamma-Gaussian Mixture Model (GGMM) can provide a good fit to the observed overall marginal distribution. The GGMM is, in effect, a parametric probability density function, the parameters of which can be estimated using any suitable parameter estimation technique.

Some embodiments use the following specific steps and techniques.

For pixel intensities x , the proposed mixture model is given by: f (χ θ) = pfix a, β) + p₂G(x μ, σ) (1) where p_l and p₂ represent the mixing proportions (priors) of intensities beig associated with mitotic and non-mitotic cells, (or other sub-populations) such that p_l + p₂ = l , Τ(χ; , β) represents the Gamma density function parameterized by (the shape parameter) and β (the scale parameter), 0(χ; μ, σ) represents Gaussian density function parameterized by μ (the mean) and σ (the standard deviation), and θ = [ , β, μ,σ, ρ_γ, p₂ ] represents the vector of all unknown parameters in the model.

In order to estimate unknown parameters ( Θ ), a maximum likelihood estimation (MLE) may be employed. Given image intensities x, , i=l ,2,...,n where w is number of pixels, log- likelihood function (£) of the parameter vector Θ is given by

where / (χ_; ; #) is the mixture density function in equation (1). The MLE of Θ

represented by

Θ = argmax £(θ) (3) A convenient approach to obtain a numerical solution to the above maximization problem is provided by the Expectation Maximization (EM) algorithm of A. Dempster et al, (Journal of the Royal Statistical Society, Series B, pp. 1-38, 1977). In the context of the present method, the EM algorithm can be set up as follows.

Let z_jk,k = 1,2 , be indicator variables showing the component membership of each pixel x, in the mixture model (1). Note that these indicator variables are hidden (unobserved). The log- likelihood (2) can be extended as follows:

^£C (^θ ) =∑∑ ¾ ^l0§ ft ^l0§ [Γ(*.^■ ; a, β)] + ½ log [G(x, ;μ,σ)]} (4)

Algorithm 1 Expectation Maximization (EM)

1: Expectation Step (E step): Calculate the expected value of the log-likelihood function t (#) with respect to P z | x,#^(m)), wherez = {z_ft, / = 1,2,. = 1,2} . The conditional expectation can be given as:

where

fl^wr(x,.;ff^w _jJ5^w)

w (6) and

are the conditional expectations of z_jk .

2: Maximization Step (M step): The M-step maximizes the function ζ){θ,θ^{,η)^ with respect to Θ using a numerical optimization.

6>^(m+1) =argmax Q(Q,Q^{m ) (8)

3: Convergence Criteria: The above two steps are repeated until | ^(m+1) -θ^ιη) || < e for a pre- specified value of tolerance _e. The EM algorithm finds $ iteratively, as outlined in Algorithm 1. Let £f^m) be the estimate of Θ after m iterations of the algorithm. The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying Expectation and Maximization steps.

In step 8 of Figure 1, a quantity which indicates how likely it is that the pixel in question is accociated with a mitotic cell is proportional to the posterior probability under the mixture model of the pixel being associated with a mitotic cell (although other embodiments can use other quantites).

The posterior probabilities of a pixel x_i belonging to class 1 (Mitotic, pn) or 2 (Non-Mitotic, Pii) may be calculated as follows,

ρ_ιΤ(χ_ι , β) + ρ₂0(χ_ι μ, σ) (9) P, 2 = ^l -Pn

In step 8 of Figure 1, once the quantity has been determined, the subset of pixels is identified by using the Otsu's method (IEEE Transactions on Systems, Man, and Cybernetics, 9(1):62- 66, 1979). Otsu's method is a method which automatically performs histogram shape-based image thresholding or, in other words the reduction of a grey level image to a binary "black and white" image. The algorithm assumes that the image to be thresholded contains two classes of pixels or bi-modal histogram (e.g. foreground and background, black and white, 0 and 1, etc) then calculates the optimum threshold separating those two classes so that their combined spread (intra-class variance) is minimal. Applied to the present disclosure, Otsu's method is used to correct values of the quantities (e.g. prosterior probabilities) into indicator variables indicating whether a pixel is considered a candidate mitotic pixel or non-mitotic pixel (or more accurately a pixel associated with the respective class of cell).

In other words, in effect, Otsu's method is used to threshold a posterior probability image (or image of another quantity related to class belonging) to get an image of a binary indicator variable indicating class membership: mitotic or non-mitotic. Other thresholding/clarification or binarization algorithms as known to the skilled person can be used in other embodiments. For example, the Level Set method may be used to threshold a posterior probability image. The Level Set method is a numerical technique used to identify interfaces between different regions. When applied to the thresholding of a posterior probability image, the Level Set method uses an auxiliary function, called the Level Set function, to represent a boundary surrounding a subset of pixels associated with a candidate mitotic cell as a closed curve. The closed curve is represented by the zero level set of the auxiliary/Level Set function. Inside the region defined by the boundary, the Level Set function takes positive values, whereas outside the bound region the Level Set function takes negative values. The Level Set method is particularly advantageous since it allows improved boundary detection of the subset of pixels associated with candidate mitotic cells.

It will be appreciated that, in some embodiments, the boundary of the subset of pixels associated with a candidate mitotic cell may correspond to a boundary surrounding the chromatin material of the cell, such a boundary being indicative of the presence of a candidate mitotic cell. On the other hand, in some embodiments, the boundary may correspond to the cell wall. In any event, it is the boundary surrounding the subset of pixels which is identified.

In some embodiments, candidate mitotic cells are identified from the pixels which have been identified as being associated with candidate mitotic cells using an automatic pixel based process as will be known to the skilled person. For example, a candidate mitotic cell is formed by a set of neighboring pixels found, according to the above process, to be associated with the candidate mitotic cells using a connected component analysis method, which is well known to the skilled person. The skilled person is aware of other methods for identifying areas from pixels found and some embodiments use such alternative methods to form or identify a candidate mitotic cell. Of course, it will be understood that in practice it is likely that a plurality of candidate mitotic cells is formed / identified for a given image. It will be appreciated by the skilled person that some of the candidate mitotic cells may, in fact, be objects which have falsely been identified as mitotic cells. These objects will be referred to as "false positives".

With reference to Figures 3a-d, the performance of the method described above and some variants are now discussed. Figure 3 (a) illustrates a Ground Truth (GT) of a histopathological image, where mitotic cells have been visually identified by a pathologist. Figure 3(b) is the result of the above method being carried out on the same histopathological image. It can clearly be seen that the results of the above method are very close to the GT results in that the method correctly identifies pixels which are associated with mitotic cells, and only identifies a few false positives at the same time. Figure 3(c) was obtained by a user visually inspecting the distribution of pixel intensities obtained after performing segmentation, and choosing an intensity value, above which all pixels are assumed to be associated with a candidate mitotic cell. The value was chosen based on the position of the inflection of the intensity distribution. Figure 3(d) was obtained by performing thresholding using Otsu's method on the distribution of pixel intensities obtained after performing segmentation. Both Figures 3(c) and 3(d) correctly identify pixels associated with mitotic cells, but also indentify some false positives. While there is thus an amount of predictive power achieveable with the simplified intensity based embodiments of Figures 3(c) and 3(d), it can be said that the specific GGMM embodiment described above provides indentification of candidate mitotic cells more in line with the expert assessment. The inventors have further realized that the number of false positives can be reduced by analyzing the context of the candidiate mitotic cells. In certain embodiments, and with reference to step 10 of Figure 1, Context- Aware Post Processing (CAPP) is performed on the results of the method set out above in order to reduce the number of false positives without significantly reducing the number of cells which are correctly identified as being mitotic cells. CAPP involves, after identifying candidate mitotic cells using an automatic pixel based process as described above, defining an image patch around each candidate mitotic cell, calculating one or more indicator values for the image patch and then determining whether the candidate cells are likely to be false positives or not. In certain embodiments, the indicator values are based on texture features derived for each pixel in the patch, as described in further detail in section 2.5 of Annex 3.

Applied to the present disclosure, CAPP is employed by constructing a feature vector comprising indicator values derived from an image containing known true and/or false positives. The feature vector is then used to train a Support Vector Machine (SVM) to classify candidate mitotic cells as true or false positives. Images containing known true and/or false positive candidate mitotic cells may be used to train the SVM, each candidate mitotic cell having an associated feature vector comprising indicator values derived from the image.

To classify a candidate mitotic cell as either a true or false positive, the trained SVM is applied to a feature vector comprising indicator values derived from an image patch containing the candidate mitotic cell, resulting in the candidate mitotic cell being classified as either a true positive or a false positive based on an output of the SVM.

In some embodiments, indicator values may be based on both texture features derived for each pixel in the patch and shape features derived for a boundary around the subset of pixels found to be associated with a candidate mitotic cell using the mixture model. In some embodiments, such shape features may comprise measurements indicative of the roughness (or equivalently smoothness) of the boundary, the elongation of the boundary and/or the convexity of the boundary (a measure of the degree to which the boundary is convex).

In some embodiments, the indicator values may be based solely on shape features derived from a boundary surrounding a subset of pixels associated with a candidate mitotic cell, wherein such shape features comprise measurements indicative of the roughness of the boundary, the elongation of the boundary and/or the convexity of the boundary. In these embodiments, the shape feature indicator values replace the texture feature indicator values.

An indicator value indicative of the roughness of the boundary may, for example, be obtained by comparing an image of the boundary with an image of the boundary after Gaussian (or other) smoothing has been applied. The difference between the two images can be used as an indicator of the roughness of the boundary. Alternatively, for example, the sum or the mean of the absolute values, standard deviation, or variance of the pixel values of a difference image between the original image and the smoothed image may be used as an indicator value of the roughness or smoothness of the boundary. Alternative measurements indicative of the roughness of the boundary may, for example, be derived using a Fourier Descriptor. Generally, spatial frequency analysis can be employed, for example, using the amplitude(s) of higher frequency component(s) as the indicator values. Alternatively, other image processing techniques known in the art may also be used. It will of course be appreciated that indicator values may be indicative of the roughness or smoothness of a boundary. It will be understood that a measurement of smoothness will also be indicative of the roughness of the boundary by virtue of its inverse relation.

Measurements of boundary roughness (or smoothness) are thought to be discriminative of candidate mitotic cells in prophase since during prophase, a mitotic cell, in particular the boundary surrounding the chromatin material in the mitotic cell, would typically comprise a rough boundary as shown in Figure 5a. Therefore, identification of a rough boundary may indicate the presence of a mitotic cell, hence indicate the presence of a true positive.

An indicator value indicative of the elongation of the boundary may, for example, comprise the ratio of the major axis of the area defined by the boundary to the minor axis of the area defined by the boundary. Alternatively, other suitable measurement techniques known in the art may be used.

Measurements of the elongation of the boundary are thought to be discriminative of candidate mitotic cells in metaphase or telophase since during metaphase and telophase, a mitotic cell, in particular the boundary surrounding the chromatin material in the mitotic cell, would typically comprise an elongated shape as shown in Figure 5b. Therefore, identification of an elongated boundary may imply the presence of a mitotic cell, hence indicate the presence of a true positive.

In some embodiments, an indicator value indicative of the convexity of a boundary may, for example, comprise the ratio of the length of a circular boundary, generated based on the maximum radius of the actual boundary, to the length of the actual boundary, as given by:

measurement of convexity = actual boundary length / 2nr_max

where r^ is the maximum radius of the actual boundary.

Measurements of convexity of the boundary are suitable to discriminate candidate mitotic cells from apoptotic cells. Apoptotic cells typically contain chromatin material which is similar in shape to mitotic cells and so are easily confused with mitotic cells. When a cell is undergoing apoptosis the cell becomes concave in shape as can be seen by reference to Figure 5c in which the solid line shows the boundary of an apoptotic cell and the dotted line shows the circular boundary generated based on the maximum radius of the boundary of the apoptotic cell. Therefore, identification of a concave boundary may indicate the presence of an apoptotic cell rather than a mitotic cell, hence indicate the presence of a false positive.

Applying the measurement of convexity identified above, a value >1 would be indicative of an apoptotic cell, a value <1 would be indicative of a cell undergoing metaphase or telophase, and a value ~1 would be indicative of a cell undergoing prophase. Alternatively, other suitable measurement techniques known in the art may be used. While the indicator values related to the shape of the boundary as identified above may be suitable to aid classification of a candidate mitotic cell as either a true or false positive, the phase of the cell is not determined via these measurements. The measurements are merely used as a means of classifying a candidate mitotic cell as a true or false positive, based on the likely shape of a mitotic cell at various stages of mitosis and/or compared with the likely shape of an apoptotic cell.

With reference to Figure 4, a system for implementing the method above comprises an input 12 for receiving information associated with the histopathological image to be analyzed and, where needed, user-input. The image may be loaded to the system in any form which allows the processing of the image pixels, such as a standard image file, for example JPEG, PNG, etc, a proprietary image format or a matrix of raw values. Processor 14 is coupled to the input 12 and is configured to implement methods in accordance with the embodiments set out above. An output 16 is coupled to the processor 14 to produce an output, which may be in any suitable form, such as, a standard image file, for example JPEG, PNG, etc, a proprietary image format, a matrix of raw values (e.g. coordinates of detected mitotic cells) , etc. The output 16 may additionally or alternative to a storage device include a display, for example a graphical display for displaying cells identified as mitotic or candidate mitotic cells and/or the segmented regions of the slide image. For example, the graphical display, in some embodiments, displays segmented region(s) and/or mitotic or candidate mitotic cells as graphical display elements overlaid over the slide image. This enables an intuitive display of the information output. In some embodiments, the display may be interactive to enable an expert to interact with the display, for example altering segmented region(s) or rejecting or accepting mitotic cells or candidate cells by way of input using a user input device. In some embodiments the output may additionally or alternatively include a cell count of mitotic cells, for example as an absolute or relative count, cell count per unit area or density. In some embodiments the processor may take further inputs relevant to cancer grading and the output can additionally or alternatively include a cancer grading.

It will be appreciated by the skilled person that the method can be applied to any image of a histopathological slide. For example, the method may be used to analyze any histopathological images containing mitotic cells, such as images of histopathological slides of cancers and tumors such as epithelial tumors (for example breast carcinoma), , stromal tumors, tumors of the soft tissue (for example, sarcomas), neuroectodermal tumors, or neuroendocrine tumors. In particular embodiments, the present invention is used to analyze a breast carcinoma histopathology image.

It will also be appreciated that the method and system can also be used to detect apoptotic cells in a histopathological image. A histopathology slide containing apoptotic cells will be stained dark blue / black by H&E. Thus, the method and system can be used to detect apoptotic cells from a histopathological image containing pixels associated with candidate apoptotic cells and pixels associated with non-apoptotic cells.

In some embodiments, the pixels are scalar valued (eg intensity, a grayscale value derived from an RGB image or otherwise). In other embodiments, the pixels are vector valued. For example, the vector values of the pixels may correspond to a multispectral set of images at respective wavelength bands (obtained, for example, with a multispectral microscope or another multispectral imaging device) or a further reduced subset or derived set of pixel values from the multispectral images, with each entry in the vector value corresponding to a scalar pixel value of the pixel in a respective image of the set of images. The mixture model described above is applicable to both scalar valued and vector valued pixels.

All features and embodiments described herein apply equally to all aspects of the disclosure mutatis mutandis. It will, of course, be understood that, although particular embodiments have just been described by way of example, the claimed subject matter is not limited in scope to a particular embodiment or implementation.

Segmentation Algorithms

Algorithm 1

Disclosed is an unsupervised tumor segmentation framework based on dividing the stromal tissue into two types: Hypo-Cellular Stroma (HypoCS) and Hyper-Cellular Stroma (HyperCS). The proposed algorithm employs magnitude spectrum in the Gabor frequency domain to segment HypoCS regions and phase spectrum in the Gabor frequency domain to segment HyperCS regions. The algorithm has been evaluated on 35 H&E stained breast histopathology images belonging to 5 different tissue slides. Instead of evaluating the system using object based criteria, a much stricter pixel -based quantitative evaluation criterion has been utilized. Based on the tissue morphology, the contents of a breast carcinoma histopathology image can be divided into four regions: Tumor, Hypocellular Stroma (HypoCS), Hyper-cellular Stroma (HyperCS), and tissue fat and/or retractions/artifacts (Background). Background is removed during the pre-processing stage on the basis of color thresholding, while HypoCS and HyperCS regions are segmented by calculating features using magnitude and phase spectra respectively in the frequency domain and performing RanPEC segmentation (see Algorithm 2 and Annex 2) on these features. The algorithm pipeline can be subdivided into three stages: (1) Pre-processing to normalize the staining artifacts and remove tissue fat, artifacts, and the background; (2) Segmentation of HypoCS and HyperCS regions; (3) Post-processing to combine the result of background removal in (1) and segmentation in (2). Further details can be found in Appendix 1.

Algorithm 2

The disclosed Random Projections with Ensemble Clustering (RanPEC) algorithm addresses the problem of segmentation of tumor regions in a breast histopathology image using a features based classification approach. The RanPEC algorithm employs a library of textural features (consisting of just over 200 features), representing each image pixel as a point in a high-dimensional feature space. Due to the so-called curse of dimensionality, the high- dimensional feature space becomes computationally intractable and may even contain irrelevant and redundant features which may hinder in achieving high classification accuracy. Recent feature selection and ranking methods, such as the commonly used minimum redundancy maximum relevance (mRMR) of H. Peng et al. {IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8): 1226-1238, 2005), employ information theoretic measures to reduce dimensionality of the problem and have demonstrated success in several problem domains. However, major limitations of such approaches include data dependence and the requirement for training the feature selection in a supervised manner. The disclosed RanPEC algorithm overcomes these limitations without compromising the pixel-level segmentation accuracy.

Given a set of features computed for each image pixel, a general framework is presented in Section 2 of Annex 2 which employs orthogonal random projections with ensemble clustering for assigning a label to each of the image pixels. Section 3 of Annex 2 gives some details of the segmentation algorithm; in particular how a library of texture features is computed. Comparative results and discussion are presented in Section 4 of Annex 2.

Example

Set out below is an example of the application of the mitotic cell identification method described above. Further details can be found in Annex 3.

A Gamma-Gaussian Mixture Model (GGMM) was used to detect mitotic cells in breast cancer histopathological images. In addition, Context-Aware Post Processing (CAPP) was used to increase the Positive Predictive Value (PPV) with a minimal loss in sensitivity. The performance of the proposed detection algorithm was evaluated in terms of sensitivity and PPV over a set of 28 breast carcinoma histopathology images selected from 5 different tissue slides which showed that a reasonably high value of sensitivity can be retained while increasing the PPV. The experimental dataset consisted of 35 digitized images of breast carcinoma slides with paraffin embedded sections stained with Hematoxylin and Eosin (H&E) and scanned at 40 ^χ using an Aperio ScanScope slide scanner. After stain normalization, background removal and unsupervised tumor segmentation over all 35 images, seven images were selected to extract mitotic and non-mitotic pixel intensities (L channel of La*b* color space) for model fitting using GGMM. 500 iterations and tolerance (e = 0.01) was chosen for the EM algorithm. Although EM provides estimates of priors (p_\ and p₂), a more accurate estimate of priors (p_\ = 0.0014 and p₂ = 0.9986) was used based on the ratio of mitotic and non-mitotic data used for model fitting. The set of textural features extracted from a window of size 30 ^χ 30 pixels around the bounding box of each candidate mitosic cell are as follows: 32 phase gradient or PG features (16 orientations, 2 scales), 1 roughness feature, 1 entropy feature. From each of these 34 features, 4 representative features were computed: (1) mean over the pixels, (2) standard deviation over the pixels, (3) skewness over the pixels, (4) kurtosis over the pixels. This gave a 136-dimensional feature vector for the context window with moments calculated over the pixels in the context window. The resulting 136 dimensional vector was used in training and testing of a Support Vector Machine (SVM).

Since the data consisting of candidate potential mitotic cells, identified before CAPP was applied, was unbalanced (mitotic-29.1%, nonmitotic-70.9%), therefore a balanced mix of mitotic and non-mitotic examples were randomly selected as training data. A total of 69.90% of data was used for training and remaining 30.10%> for testing. Grid search was used to find optimal parameters for the Gaussian kernel of the in SVM. A higher penalty for misclassification in the SVM was set for mitotic class, since the original data was unbalanced. Table 1 provides details of the quantitative results obtained with a five-fold cross-validation. According to these results, more than 100% of Positive Predictive Value (PPV) was enhanced at the cost of less than 13%> reduction in sensitivity.

The results presented in Table 1 are based on CAPP carried out using indicator values comprising solely textural features. However, preliminary experimental results indicate that adding boundary shape features to this method may significantly improve the results. Table 1 : Quantitative Comparison of sensitivity and PPV with and without using CAPP for a fixed value of area threshold = 120 over 35 breast histopathology images containing 226 mitotic cells. By employing CAPP, PPV is more than doubled on unseen data, without drastically reducing the sensitivity (i.e. less than 13% only).

HyMaP: A Hybrid Magnitude-Phase Approach to Unsupervised Segmentation of Tumor Areas in Breast Cancer Histology Images

No Author Given

No Institute Given

Abstract. Segmentation of tumor regions in a breast histology image can not only highlight slide-areas consisting of tumor cells, it is also vital for other tasks related to breast cancer grading such as automated scoring of immunohistochemical (IHC) stained slides and detection of mitotic cells. In this paper, we propose a novel approach to unsupervised tumor segmentation in breast histology images. The novelty of the proposed approach lies in (a) casting the dual problem of segmenting two main types of stromal regions: hypo-cellular and hyper-cellular and (6) employing a hybrid of features derived from magnitude and phase spectra of the frequency domain. We show that the two spectra can provide complimentary segmentation of hypo- and hyper-cellular stromal regions with high degree of segmentation accuracy as compared to ground truth marked by two pathologists. We further demonstrate the efficacy of tumor segmentation for detection of mitotic cells in order to significantly reduce the number of false positives.

1 Introduction

Grading of breast cancer relies largely on microscopic examination of tissue slides stained with Hematoxylin & Eosin (H&E). This is a subjective process by its very nature consequently leading to inter- and even intra-observer variability potentially affecting predicted patient prognosis and also the treatment modalities offered. The variability in breast cancer grading may, at least in part, be responsible for the variability in rates of chemotherapy use between institutions.

Segmentation of areas containing tumor cells in standard H&E histopathology images of breast (and several other tissues) is a key task for computer- assisted assessment and grading of histopathology slides. Good segmentation of tumor regions can not only highlight slide-areas consisting of tumor cells, it is also vital for automated scoring of immunohistochemical (IHC) stained slides to restrict the scoring or analysis to areas containing tumor cells only and avoid potentially misleading results from analysis of stromal regions. Furthermore, detection of mitotic cells is critical for calculating key measures such as mitotic index; a key criteria for grading several types of cancers including breast cancer. We are aware of existing technologies [1] which are capable of detecting mitotic cells on slides stained with IHC stains (eg, Ki67, PHH3 etc). However, as we show 2 Anonymous Authors in this paper, tumor segmentation can allow detection and quantification of mitotic cells from the standard H&E slides with a high degree of accuracy, without need for special stains, in turn making the whole process more cost-effective.

While some algorithms for segmentation of tumor nuclei [2], quantitative evaluation of nuclear pleomorphism [3] , detection and grading of lymphocytic infiltration in histology images [4], and automated malignancy detection [5] have been reported in the literature, tumor segmentation in breast histology images has not received much attention. In [6] , Wang et al. proposed a supervised tumor segmentation approach, that exploits tissue architectural and textural features in an Markov Random Field (MRF) based Bayesian estimation framework. However, supervised segmentation of breast cancer histology images containing highly complex texture often raises questions regarding an algorithm's ability to avoid overfitting, let alone the issue of training overhead.

Feature based segmentation approaches often use a filter bank to represent a pixel as a point in a high-dimensional feature space posing the so-called curse of dimensionality problem. A dimensionality reduction (DR) technique giving a low-dimensional representation and preserving relative distances between features from the original feature space is desirable to solve this problem. Along these lines, Viswanath et al. [7] proposed an ensemble embedding framework and applied it to image segmentation and classification. The idea is to generate an ensemble of low dimensional embeddings (using a variety of DR methods, such as graph embedding), evaluate embedding strength to select most suitable em- beddings and finally generate consensus embedding by exploiting the variance among the ensemble. However, a major limitation of the framework proposed in [7] in the context of histopathology image analysis is that it has high storage and computational complexity, mainly due to the very high-dimensional ailinity matrices required for graph embeddings.

Random Projections (RPs) have recently emerged as a computationally simple and efficient low-dimensional subspace representation [8], with a minor drawback: multiple RPs may produce substantially different projections because of the very nature of random matrices. Although this may not be a big issue in certain applications (like multimedia compression etc.), it cannot be ignored in applications like segmentation in low-dimensional feature space. Khan et al. [9] proposed an ensemble of multiple RPs (which they termed RanPEC, short for Random Projections with Ensemble Clustering) followed by majority voting to address the issue of variability among multiple RPs. They further showed that ensemble clustering of random projections onto merely 5 dimensions achieves higher segmentation accuracy than a well-known supervised DR method [10] on breast histology images.

In this paper, we propose a fast and totally unsupervised tumor segmentation framework based on dividing the stromal tissue into two types: Hypo-Cellular Stroma (HypoCS) and Hyper-Cellular Stroma (HyperCS). The proposed algorithm employs magnitude spectrum in the Gabor frequency domain to segment HypoCS regions and phase spectrum in the Gabor frequency domain to segment HyperCS regions. The algorithm has been evaluated on 35 H&E stained breast HyMaP: Unsupervised Tumor Segmentation in Breast Histology 3

Fig. 1: A sample H&E stained breast cancer histology image: (a) Original image, and (δ) Overlaid image with four types of contents shown in different colors. Tumor areas are shown in Red, HypoCS in Purple, and HyperCS in Green. Areas containing background or fat tissue are shown with Black boundaries. Note the difference in morphology of Hypo- and Hyper-cellular stromal regions.

histology images belonging to 5 different tissue slides. Instead of evaluating the system using object based criteria, we have incorporated a much stricter pixel- based quantitative evaluation criteria. The experimental results show that the proposed system achieves an Fl-Score of 0.89 (with respect to GT markings) for pixel-based segmentation in H&E images.

The main contributions of this paper are as follows: (a) we show that magnitude and phase spectra of frequency domain are effective in representing complimentary features of HypoCS regions and HyperCS regions respectively; (6) we preseirt a fast, unsupervised, and data-independent algorithm for pixel level classification of tumor vs. stromal regions (by integrating HypoCS and HyperCS segmentations) in breast histology images, and (c) we show that segmentation of stromal regions in breast histology images plays a critical role in mitosis detection, leading to more accurate calculation of mitotic index: one of the three criteria used in so-called Nottingham breast cancer grading system [11] .

The remainder of this paper is organized as follows. Section 2 outlines details of the segmentation algorithm, in particular how the segmentation of HypoCS and HyperCS is performed in a low dimensional feature space. Comparative results and discussion are presented in Section 3. The paper concludes with a summary of our results and some directions for future work. Anonymous Authors

Fig. 2: Overview of the proposed algorithm RobUsTS: Robust Unsupervised Tumor Segmentation.

2 Materials & Methods

2.1 Description of The Dataset

We evaluated our segmentation framework on the MITOS dataset¹. The dataset consists of 35 HPF (High Power Field) images taken from 5 different breast cancer biopsy slides, stained with Hematoxylin and Eosin (H&E), scanned at 40 x magnification using an Aperio ScanScope slide scanner. Each HPF has a digital resolution of 2084 x 2084 pixels.

2.2 The Segmentation Algorithm

Based on the tissue morphology, breast histology image contents can be divided into four regions (see Figure 1): Tumor, Hypocellular Stroma (HypoCS), Hyper- cellular Stroma (HyperCS), and tissue fat and/or retractions/artifacts (Background). Background is removed during the pre-processing stage on the basis of color thresholding, while HypoCS and HyperCS regions are segmented by calculating features using magnitude and phase spectra respectively in the frequency domain and performing RanPEC segmentation [9] on these features. The algorithm pipeline can be subdivided into three stages: (1) Pre-processing to normalize the staining artifacts and remove tissue fat, artifacts, and the background; (2) Segmentation of HypoCS and HyperCS regions; (3) Post-processing to combine the result of background removal in (1) and segmentation in (2). A block diagram of the proposed tumor segmentation framework is shown in Figure 2. Algorithm 1 outlines algorithmic details of the pipeline.

¹ http://ipal.cnrs.fr/ICPR2012/ HyMaP: Unsupervised Tumor Segmentation in Breast Histology 5

Algorithm 1 Hybrid Magnitude-Phase (HyMaP) based Tumor Segmentation

1: Input: / -f- RGB breast histology image,

2: Output: T, a binary image where pixels belonging to tumor regions have a value of 1 and all other pixels have a value of 0.

3: Preprocessing:

Inarm— StainNormalize(J)

B = EstimateBackground(/„_orm)

Inarm = AnisotropicDif f usion(7,„,,,_n) ,

where I^_orm is the b* channel from the La*b* color space [12] of I_norm. 4: HypoCS Segmentation:

Ggabor = {GaborFilter(j£,_rm, Θ, f) \ Θ e {0, . . . , ττ} and / £ F},

where F is the set of frequencies as defined at the end of Section 2.2.

G ener— TextureEnergy(|G_SD6_or | , μ, σ),

where μ and σ are parameters of a Gaussian window used to compute the texture energy.

H₀ = RanPECSegmentation(G_erler), as described in Section 2.2.

5: HyperCS Segmentation:

Ggabor = {GaborFilter(ij?_OI._m, 0, /) | Θ€ {0, . . . , π} and / e F},

where J^7, is as defined in Section 2.2.

G_vg— GradientFeature(G₉o6or _! N) according to equation (7),

where an JV x N window is used to compute the local phase gradients. H_r = RanPECSegmentation(G_ps), as described in Section 2.2.

6: Postprocessing:

T = W_akH^- B

7: return T.

Pre-processing Stain color constancy is one of biggest challenges of digitized images of H&E stained tissue slides. Several factors such as thickness of the tissue section, dye concentration, stain timings, stain reactivity result in variable stain color intensity and contrast. We evaluated various stain normalization methods but found [13] to be most effective in terms of dealing with tissues containing large amount of retractions/ staining artifacts. The second stage of preprocessing pipeline is to estimate background. First the stain-normalized (color) tissue image is transformed from the RGB space into the YCbCr space. The luminance channel is then thresholded using an empirically determined, fixed, global threshold. The rough binary mask resulting from this thresholding is finally refined via morphological operations in order to fill up small gaps. Finally, the stain normalized and background free image is converted into the CIE's La*b* color space and anisotropic diffusion [14] is applied to its b* channel in order to remove the inherent camera noise while preserving edges.

Hypo-Cellular Stromal Features A traditional approach to texture segmentation is inspired by the multi-channel filtering theory. The idea is to characterize an image by a bank of filters to generate a set of features that are capable of discriminating texture patterns belonging to different categories. A two-dimensional 6 Anonymous Authors

Gabor function consists of a sinusoidal plane wave of certain frequency and orientation, modulated by a two-dimensional Gaussian. A Gabor filter in the spatial domain is given by the following equation [15] ,

^Ge (x, y) = 9a(x, y) exp (j2?r f(xcos9 + ysinB)) (1) where g_a (x, y) is a Gaussian kernel with a bandwith of σ. The parameters / and Θ represent frequency and orientation of the 2D Gabor filter, where Θ varies between 0 and in regular intervals, / e F, and F denotes a set of possible frequencies and is defined as follows. Zhang et al. [16] proposed that for texture segmentation, features are often prevalent in the intermediate frequency bands. Based on this, they proposed a frequency selection scheme which emphasizes the intermediate frequency bands as given below,

F_L {i) = 0.25 - 2^i~0-⁵ /No | 0 < F_L (i) < 0.25 (2)

F_H {i) = 0.25 + 2ⁱ-°-⁵/N_c I 0.25 < F_H(i) < 0.5, (3) where i = 0, 1, . . . , log₂ (^ ¹-) , Ncoi is the width of image in terms of the nearest power of 2. We then define the set F of possible frequencies as follows,

F = F_L U F_H. (4)

For an image with 512 columns, for example, a total of 84 Gabor filters can then be used - 6 orientations and 14 frequencies. Hyop-cellular stromal features are then computed by convolving the Gabor filters Gg (-) with l£₀rm, the b* channel of the stain normalized and smoothened version of an input image /, W, /, and computing local energy on the results of convolution.

Hyper- Cellular Stromal Features Phase information can be used as an important cue in modeling textural properties of a region. In [17] , Murtaza et al. used local frequency estimates in the Gabor domain over a range of scales and orientations to yield a signature which was shown to efficiently characterize the texture of a village in satellite images. We chose the phase spectrum to represent attributes of the HyperCS regions in a breast histology image due to the recently established efficacy of phase in textures exhibiting randomness.

Let Vi (x, y) denote the zth Gabor channel for the stain normalized and smoothened version of an input image I(x, y), where i = 1, 2, · · - , N_g , N_g = Ng x \F\, and Ng denotes the number of orientations. We can represent it as follows,

Vi {x, y) = \vi (x, y) \ exp(j<l>i (x, y)) (5) where j · | denotes the magnitude operator and cj>i (x, y) denotes the local phase. Gradient of the local phase and its ma nitude can then be computed as below,

HyMaP: Unsupervised Tumor Segmentation in Breast Histology 7 and

The phase gradient features are computed using (7) for each of fthe Gabor filter response over a window of size N x N .

RanPEC Segmentation RanPEC [9] is a fast, unsupervised, and data-independent framework for dimensionality reduction and clustering of high-dimensional data points. The main idea of RanPEC is to project high-dimensional feature vectors onto a relatively small number of orthogonal random vectors belonging to a unit ball and perform ensemble clustering in the reduced-dimensional feature space. By getting an ensemble of projections for each feature vector and then picking a cluster for a pixel by majority voting selection criterion ensures stability of results among different runs. Experimental results in [9] suggest promising classification accuracy can be achieved by random projections using fast matrix operations in an unsupervised manner.

3 Experimental Results and Discussion

All 35 images in the database were hand segmented by two expert pathologists. We generate all experimental results on 3 criteria: (1) considering pathologist- l's markings (P-l) as ground truth (GT); (2) considering pathologist-2's markings (P-2) as GT; (3) fusing P-l and P-2 using logical OR rule (i.e. a pixel is considered to be tumorous if any one of the 2 pathologists marked the pixel as tumorous), and considering the fused image as GT. Some of HPF images contain large tumor regions with small islands of stroma here and there, however majority of HPF images contain a fair share of hypo- and hyper-cellular stroma (approx. 33%, on average). The average degree of disagreement between the two pathologists on GT images is 11.55% ± 5.37%.

All 35 images in the dataset are pre-processed in a similar manner, with stain normalization carried out as described in Section 2.2, and background removal performed to remove fat and other artifacts caused by staining and fixation. This provides robustness in the subsequent steps of pipeline. In order to segment HypoCS, a total of 84 Gabor textural features (14 scales and 6 orientations) are calculated for each pixel of the input image /. In order to generate H perCS, PG features are calculated on 10 orientations and 3 scales. Gradient features are computed in a window of size 15 x 15. RanPEC segmentation framework is applied on both Gabor and PG features independently yielding HypoCS and HyperCS segmentations respectively. RanPEC framework require 2 parameters: r the dimensionality of lower dimensional space, and n_c the number of runs of ensemble. As recommended in [9] , we used r = 5 and n_c— 20 in our experiments. We have compared our proposed algorithm (HyMaP) with Standard-RanPEC using the same experimental setup as suggested in [9]. The algorithms considered here are evaluated on three pixel-wise accuracy measures: precision, recall, and Fl-Score. 8 Anonymous Authors

Fig. 3: Illustration of complimentary segmentations obtained by hypo- and hyper- cellular stroma segmentation: (a) Original images; (b) Results of hypoCS shown in slightly darker contrast, outlined in green color; (c) Results of hyperCS shown in slightly darker contrast, outlined in green color.

Fl-Score is a measure that combines precision and recall in a statistically more meaningful way. Let TP denotes the number of true positive, FP the number of false positive, TN the number of true negatives, and FN the number of false negatives, precision is defined as TP/( TP+FP), recall is defined as TP/{ TP+FN) and Fl-Score is defined as 2 x (precision x recall) / (precision+recall) .

Figure 3 provides illustration of the efficiency of HypoCS segmentation [Figure 3(6)] and HyperCS segmentation [Figure 3(c)] in capturing complimentary stromal subtypes. Figure 4 provides illustration of the proposed tumor segmentation algorithm on 2 different HPF images. Segmentation results obtained by combining HypoCS and HyperCS yield high Fl-Score of 0.86 and 0.89 with respect to the fused GT. Considering the degree of disagreement between the two pathologists (i.e. 11.5% ± 5.37%), the results can be termed as highly accurate. Table 1 shows the segmentation accuracies (in terms of precision, recall and Fl- Score) of the unreduced and reduced feature spaces resulting from automated tumor segmentation. Note that the Fl-Scores obtained from HyMaP (0.88, 0.89 and 0.89) is higher as compared to those from unreduced feature space (0.87, 0.88 and 0.88) and Standard-RanPEC (0.85, 0.85 and 0.85). Table 1 also reveals that reduced textural feature space achieves a Fl-Scores of 0.88, 0.89 and 0.89, suggesting, in turn, that DR removes redundant features and preserves the distances between high dimensional feature space, thereby improving segmentation accuracy.

Further to the accuracy of segmentation, we present an application of tumor segmentation to Mitotic Cell (MC) detection in tumor areas. Mitotic cell detection is critical for calculating key measures such as mitotic index: one of the three criteria used in Nottingham grading system [11] to grade breast cancer histology slides. In [18] , Khan et al. proposed a statistical approach that models HyMaP: Unsupervised Tumor Segmentation in Breast Histology 9

Table 1: Quantitative results of tumor segmentation accuracy indicators (precision, recall and Fl-Score) for 35 BC histopathology images using 3 feature spaces (1) UnRe- duced (n = 114), (2) Standard-RanPEC (n = 10, as in [9]), and (3) HyMaP (n = 10) , where n denotes the dimensionality of the feature space.

Ground Truth Precision Recall Pi-Score

UnReduced 0.89 ± 0.05 0.86 ± 0.07 0.87

Pathologist- 1

RanPEC 0.85 ± 0.06 0.86 ± 0.05 0.85 HyMaP 0.88 ± 0.03 0.88 ± 0.05 0.88

UnReduced 0.90 ± 0.08 0.86 ± 0.09 0.88

Pathologist-2

RanPEC 0.86 ± 0.07 0.85 ± 0.07 0.85 HyMaP 0.9 ± 0.06 0.88 ± 0.07 0.89

UnReduced 0.93 ± 0.06 0.84 ± 0.08 0.88

Fused

RanPEC 0.88 ± 0.07 0.83 ± 0.06 0.85 HyMaP 0.93 ± 0.04 0.86 ± 0.06 0.89 the pixel intensities in mitotic and non-mitotic regions by a mixture of Gamma- Gaussian mixture model. Figure 5 visually illustrates how tumor segmentation can improve mitotic cell detection accuracy in breast histology images. Using the algorithm reported in [18] , Figure 5(a) shows the results of MC detection without tumor segmentation and Figure 5(6) shows the results of MC detection with tumor segmentation. Note tha,t the number of false positives increase significantly (from 4 to 82) when no tumor segmentation is performed.

4 Conclusions

In this paper, we presented an algorithm for segmentation of tumor areas in breast histology images based on segmentation of the image into hypo-cellular stroma and hyper-cellular stroma using magnitude and phase spectra in the Gabor domain. The complimentary nature of the segmentation of two stromal subtypes was shown, resulting in high segmentation accuracy for the tumor areas. It was further demonstrated that the specificity of mitotic cell detection can be significantly enhanced when detection is restricted to the tumor areas. We anticipate further applications of our method to accurate, tumor-localized quantification/scoring of IHC stained slides and its validation on large-scale datasets.

References

1. Roullier, V., Lezoray, O., Ta, V., Elmoataz, A.: Multi-resolution graph-based analysis of histopathological whole slide images: Application to mitotic cell extraction and visualization. Computerized Medical Imaging and Graphics 35(7-8) (2011) 603-615

2. Jeong, H., Kim, T., Hwang, H., Choi, H., Park, H., Choi, H.: Comparison of thresholding methods for breast tumor cell segmentation. In: Enterprise networking Anonymous Authors and Computing in Healthcare Industry, 2005. HEALTHCOM 2005. Proceedings of 7th International Workshop on, IEEE (2005) 392-395

Dalle, J., Leow, W., Racoceanu, D., Tutac, A., Putti, T.: Automatic breast cancer grading of histopathological images. In: Engineering in Medicine and Biology Society, 2008. EMBS 2008. 30th Annual International Conference of the IEEE, IEEE (2008) 3052-3055

Basavanhally, A. , Ganesan, S. , Agner, S., Monaco, J., Feldman, M. , Tomaszewski, J. , Bhanot, C , Madabhushi, A.: Computerized image-based detection and grading of lymphocytic infiltration in HER2+ breast cancer histopathology. Biomedical Engineering, IEEE Transactions on 57(3) (2010) 642-653

Chekkoury, A. , Khurd, P., Ni, J., Bahlmann, C, Kamen, A., Patel, A., Grady, L., Singh, M., Groher, M., Navab, N., et al.: Automated malignancy detection in breast histopathological images. In: Proceedings of SPIE. Volume 8315. (2012) 831515

Wang, C., Fennell, D., Paul, I., Savage, K., Hamilton, P.: Robust automated tumour segmentation on histological and immunohistochemical tissue images. PloS one 6(2) (2011) el5818

Viswanath, S., Madabhushi, A.: Consensus embedding: theory, algorithms and application to segmentation and classification of biomedical data. BMC bioinfor- matics 13(1) (2012) 26

Dasgupta, S.: Experiments with random projection. In: Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, Morgan aufmann Publishers Inc. (2000) 143-151

Khan, A. , El-Daly, H. , Rajpoot, N.M. : RanPEC: Random projections with ensemble clustering for segmentation of tumor areas in breast histology images. In: Medical Image Understanding and Analysis. (2012)

Peng, H., Long, F., Ding, C: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. Pattern Analysis and Machine Intelligence, IEEE Transactions on 27(8) (2005) 1226-1238

Galea, M. , Blarney, R., Elston, C., Ellis, I.: The riottingham prognostic index in primary breast cancer. Breast cancer research and treatment 22(3) (1992) 207-219 Schwarz, M,, Cowan, W., Beatty, J.: An experimental comparison of rgb, yiq, lab, hsv, and opponent color models. ACM Transactions on Graphics (TOG) 6(2) (1987) 123-158

Magee, D., Treanor, D., Chomphuwiset, P., Quirke, P. : Context aware colour classification in digital microscopy. In: Proc. Medical Image Understanding and Analysis, Citeseer (2010) 1-5

Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. Pattern Analysis and Machine Intelligence, IEEE Transactions on 12(7) (1990) 629-639

Daugman, J.: Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. Optical Society of America, Journal, A: Optics and Image Science 2 (1985) 1160-1169

Zhang, J., Tan, T., Ma, L.: Invariant texture segmentation via circular gabor filters. In: Pattern Recognition, 2002. Proceedings. 16th International Conference on. Volume 2., IEEE (2002) 901-904

Murtaza, K., Khan, S., Rajpoot, N. : Villagefinder: Segmentation of nucleated villages in satellite imagery. British Mission Vision Conference (2009)

Khan, A., El-Daly, H., Rajpoot, N.M.: A Gamma-Gaussian mixture model for detection on mitotic cells in breast histology images. In: International Conference on Patteren Recognition. (2012) HyMaP: Unsupervised Tumor Segmentation in Breast Histology 11

(e) (f)

Fig. 4: Visual results of tumor segmentation in two sample images: First row: Original images with fused ground truth (GT) marked non-tumor areas shown in a slightly darker contrast with blue boundaries; Second row: Results of combining HypoCS and HyperCS using the proposed framework (Fl-Score = 0.86 and 0.89, respectively); Third row: Visual illustration of segmentation accuracy, green = TP, red = TN, yellow = FN and blue = FP. 12 Anonymous Authors

Fig. 5: Visual results of Mitotic cells (MCs) detection in a sample image: (a) Results of MCs detection without tumor segmentation (TP= 17, FP= 82) using [18] . All the true positive MCs are shown in yellow color while all the false positives are shown in green color, (b) Results of MCs detection with tumor segmentation (TP= 17, FP= 4). All the true positive MCs are shown in yellow color while all the false positives are shown in green color, (c) Zoomed-in version of a portion of Figure 5((a)) for better visibility, (d) Zoomed-in version of a portion of Figure 5(b) for better visibility. KHAN et a].: RanPEC FOR BREAST TUMOR SEGMENTATION 1

RanPEC: Random Projections with

Ensemble Clustering for Segmentation of Tumor Areas in Breast Histology Images

Adnan Mujahid Khan¹ ¹ Department of Computer Science amkhan@dcs.warwick.ac.uk University of Warwick

Hesham El-Daly² Coventry, UK

Hesham. EI- Daly@uhcw.nhs.uk ² University Hospital

Nasir Rajpoot¹ Coventry and Warwickshire, nasir@dcs.warwick.ac.uk UK

Abstract

Segmentation of areas containing tumor cells in breast histology images is a key task for computer-assisted grading of breast tissue slides. In this paper, we present a fast, unsupervised, and data-independent framework for dimensionality reduction and clustering of high-dimensional data points which we term as Random Projections with Ensemble Clustering (RanPEC). We apply the proposed framework to pixel level classification of tumor vs. non- tumor regions in breast histology images and show that ensemble clustering of random projections of high-dimensional textural feature vectors onto merely 5 dimensions achieves up to 10% higher pixel-level classification accuracy than another state-of-the-art information theoretic method which is both data-dependent and supervised.

1 Introduction

Breast cancer is the leading cancer in women in tenms of incidence both in the developed and the developing world. According to the World Health Organization (WHO), the incidence of breast cancer is increasing in the developed world due to increased life expectancy and other factors. Histological grading of breast cancer relies on microscopic examination of Hematoxylin & Eosin (H&E) stained slides and includes: assessment of mitotic count in the most mitotic area, tubule/acinar formation and degree of nuclear pleomorphism over the whole tumor. Recent studies have shown that the proliferation (mitotic) rate provides useful information on prognosis of certain subtypes of breast cancer. This is a highly subjective process by its very nature and consequently leads to inter- and even intra- observer variability. With digital slide scanning technologies becoming ubiquitous in pathology labs around the world, image analysis promises to introduce more objectivity to the practice of histopathology and facilitate its entry into the digital era [5].

Segmentation of areas containing tumor cells in breast histopathology images is a key task for computer-assisted grading of breast tissue slides. Good segmentation of tumor regions can not only highlight areas of the slides consisting of tumor cells, it can also assist in

It may be distributed unchanged freely in print or electronic forms. 2 KHAN et al: RanPEC FOR BREAST TUMOR SEGMENTATION determining extent of tissue malignancy.Though some algorithms for segmentation of nuclei, quantitative evaluation of nuclear pleomorphism, and grading of lymphocytic infiltration in breast histology images have been proposed in the literature in recent years (see, for example, [2, 5]), tumor segmentation in breast histology images has received relatively less attention.

In this paper, we address the problem of segmentation of tumor regions in a breast histology image using a features based classification approach. The proposed algorithm employs a library of textural features (consisting of just over 200 features), representing each image pixel as a point in a high-dimensional feature space. Due to the so-called curse of dimensionality, the high-dimensional feature space becomes computationally intractable and may even contain irrelevant and redundant features which may hinder in achieving high classification accuracy. Recent feature selection and ranking methods, such as the commonly used minimum redundancy maximum relevance (mRMR) [10], employ information theoretic measures to reduce dimensionality of the problem and have demonstrated success in several problem domains. However, major limitations of such approaches include data dependence and the requirement for training the feature selection in a supervised manner. We show that these limitations can be overcome via the proposed Random Projections with Ensemble Clustering (RanPEC) without compromising the segmentation accuracy down to pixel level.

The remainder of this paper is organized as follows. Given a set of features computed for each image pixel, we present a general framework in Section 2 which employs orthogonal random projections with ensemble clustering for assigning a label to each of the image pixels. Section 3 gives some details of the segmentation algorithm, in particular how a library of texture features is computed. Comparative results and discussion are presented in Section 4. The paper concludes with a summary of our results and some directions for future work.

2 Random Projections with Ensemble Clustering

Let — {x(i, j) I (i, j) £ Ω} denote the set of -dimensional feature vectors for all pixels in an image & Ω, where Ω denotes the set of all legitimate pixel coordinates for

/ and x e *. Suppose now that we reduce the dimensionality of all such vectors to a low- dimensional space W using a linear mapping Φ as follows: y = Φχ, where y e W and r « d and Φ is a r x d matrix containing random entries. According to the Johnson-Lindenstrauss Lemma [6], the above mapping can be used to reduce dimensionality of the feature space while approximately preserving the Euclidean distances between pairs of points in the higher d-dimensional space.

One of the major limitations of using random projections for dimensionality reduction and consequently clustering, however, is that the random matrices generated at different runs can produce variable results. Fern et al. [4] tackled this issue by generating a similarity matrix from multiple runs of random projections and then using the similarity matrix to drive hierarchical clustering of the data. However, the computational complexity of this approach can make it intractable for use in a large-scale setting. We propose an ensemble clustering approach to address the issue of variability in the results of clustering low dimensional feature data generated by random projections. Let {¾ ■■■ , ¾_c } denote the results of clustering the r-dimensional feature data y(i, j), for all £ Ω. In other words, pixel at location (i, j) has n_c labels, where n_c is the number of runs for ensemble clustering. The random projections with ensemble clustering (RanPEC) algorithm for assigning labels to each pixel is given in Algorithm 1. KHAN et al.: RanPEC FOR BREAST TUMOR SEGMENTATION 3

Algorithm 1 Random Projections with Ensemble Clustering (RanPEC)

l : Input: X = {x(i, j) | (i ) £ Ω} (where x G R^d) the set of high-dimensional feature vectors for all image pixels, r the dimensionality of the lower-dimensional space, n the number of clusters, and n_c the number of runs for ensemble clusters.

2: Initialization: Generate random matrices Φ¾, k = 1, 2, . . . , n_c, of the order r d with matrix entries drawn at random from a normal distribution of zero mean and unit variance.

3: Orthogonalization: Use Gram-Schmidt method of orthogonalization to ensure that all rows of Φ^. are orthogonal to each other and have a unit norm. In other words, ensure that Φ^ _& is an identity matrix, for all k = 1 , 2, . . . _t n_c.

4: Random Projections: Project all the feature vectors into r-dimensional space ¾—

{y*(U)} where y_k(i, j) = <¾x(U) and y_k(i, j) G W, for all k = 1, 2, . . . ,n_c and {i ) G

Ω.

5: Ensemble Clustering: Generate clustering results ¾ = {L_k (i, j)} using a clustering method of your choice on the r-dimensional random projections Y^, for k = 1 , 2, . . . , n_c and for all G Ω. Use majority votes in the clustering results to decide the label L(i, j) for image pixel at ( , j) coordinates.

6: return L(i, j) for all (i, j) G Ω.

3 The Segmentation Framework

The RanPEC algorithm described above operates on the set of feature vectors X. In this section, we describe how an input image is pre-processed before computation of feature vectors and application of RanPEC on the feature vectors, followed finally by a set of postprocessing operation. An overview of the segmentation framework is shown in Figure 1 with the help of a block diagram. Below we provide a brief description of each of the building blocks, without going into details due to space restrictions.

3.1 Pre-processing

Stain color constancy is one of biggest challenges of H & E staining based on light microscopy. Several factors such as thickness of the tissue section, dye concentration, stain timings, stain reactivity result in vaiiable stain color intensity and contrast. Our pre-processing pipeline consists of stain normalization, background estimation, and edge adaptive smoothing. We used Magee et al. 's approach to stain normalization [8]. The background removal was achieved by masking areas containing mostly white pixels. Finally, we converted the stain normalized and background free image into the CIE's La*b* color space and applied anisotropic diffusion to its b* channel in order to remove the inherent camera noise while preserving edges.

3.2 Extraction of Textural Features

We collected a library of frequency domain textural features for each pixel in the image. These consisted of Gabor energy, phase gradients, orientation pyramid, and full wavelet packet features. We used the spatial filter for two-dimensional Gabor function [3] with orientation separation of 30° (i.e., 0 °, 30 °, 60 °, 90 °, 120 °, and 150°) and 14 scales, resulting in 84 Gabor channel images. Energy of each filter's response at a pixel location was then KHAN et ah: RanPEC FOR BREAST TUMOR SEGMENTATION

Figure 1 : Overview of the proposed tumor segmentation framework. used as a feature for that filter. Phase information can be used as an important cue in modeling the textural properties of a region. In [9], Murtaza et al. used local frequency estimates in log-Gabor domain [7] over a range of scales and orientations to yield a signature which uniquely characterizes the texture of a village in satellite images. We computed phase gradient features at 3 scales and 16 orientations to compute 48 filter responses over a window of 15 x 15 pixels. Next, we used the 3rd level orientation pyramid (OP) features proposed by Wilson & Spann [1 , 12], resulting in 21 features. Finally, a set of 64 3-level full wavelet packet features fl 1] is computed to cater for fine resolution spatial frequency contents in the two texture classes (i.e., tumor and non-tumor). These four sets of features and two proximity features were then concatenated forming a 219-dimensional feature vector per pixel.

3.3 Feature Ranking for Dimensionality Reduction

Feature Ranking (FR) is a family of techniques in which a subset of relevant features is used to build a robust learning model that aims to achieve equal, if not better, accuracy of representing high dimensional structures. By removing irrelevant and redundant features from the data, we can improve both the accuracy of learning models and performance in terms of computational resources. Peng et al. [10] proposed maximum Relevance Minimum Redundancy (mRMR) feature selection method which employs mutual information to rank features. We compare the performance of mRMR feature selection with the proposed random projections with ensemble clustering (as desciibed in Section 2).

4 Experimental Results and Discussion

Our experimental dataset consisted of digitized images of breast cancer biopsy slides with paraffin embedded sections stained with Hematoxylin and Eosin (H&E) scanned at 40 x using an Aperio ScanScope slide scanner. A set of fourteen images was hand segmented by an expert pathologist. We generate all experimental results taking the pathologist's markings as ground truth (GT). All the images are pre-processed in a similar manner, with stain nor- KHANet al: RanPEC FOR BREAST TUMOR SEGMENTATION

Figure 2: Comparative results of pixel-level classification accuracy (%) versus dimensionality of the feature space for mRM and RanPEC with n_c= l0, 20, and 100.

malization carried out as described in Section 3.1 . Background removal is then performed to remove the artifacts caused by staining, fixation and tissue fat. This provides robustness in the subsequent steps of pipeline. As described in Section 3.2, a total of 219 textural and proximity features are calculated for each pixel of the input image /.

Multiple random projections of these textural features are used to generate multiple clustering results from the low-dimensional representation of features using the standard k-mea s clustering algorithm. A consensus function is then used to combine the partitions generated as a result of multiple random projections into a single partition. A simple majority function is employed on individual cluster labels to produce a single partition. Five replicates of &-means clustering are performed to get a reasonably consistent partitioning.

In order to produce mRMR ranking [10], portion of a subset of GT is chosen as training images. The choice of training images is critical as some of the images have large stromal area and small tumor area while others have vice versa. We ensure that the final training set has approximately similar representation for stromal and tumor areas. Features from all the test images are reordered and k-means clustering is performed. Post-processing is performed on clustering results obtained using both mRMR and RanPEC to eliminate spurious regions and also to merge closely located clusters into larger clusters, producing relatively smooth segmentation results.

Figure 2 presents a quantitative comparison of RanPEC with mRMR. It can be seen from these results that the application of RanPEC with n_c— 20 produces quite stable results for almost all values of feature space dimensionality r. Furthermore, the RanPEC results at r = 5 generate nearly 10% higher overall pixelwise classification accuracy than mRMR ai r = 5. Visual results for RanPEC (at r = 5)and mRMR (at r = 65) as well as the GT are also shown in Figure 3. These results further show that for relatively small dimensionality of the feature space, RanPEC generates superior results than mRMR. 6 SEGMENTATION

(a) (b) (c)

Figure 3: Visual results of tumor segmentation in a sample image: (a) Original image with ground truth (GT) marked non-tumor areas shown in a slightly darker contrast with blue boundaries; (b) Results of segmentation with 65-dimensional feature space using mRMR (83% accuracy) and (c) using RanPEC with 5-dimensional feature space and n_c=2Q (90% accuracy).

5 Conclusions

In this paper, we addressed the issue of robustness of clustering results in the context of random projections. We proposed a framework for random projections with ensemble clustering and applied it to the segmentation of tumor areas in breast cancer histology images. We showed that the proposed framework RanPEC preserves the Euclidean distance between points in high-dimensional spaces in a robust manner. For our application of tumor segmentation, reasonably high accuracy was achieved using only 5 dimensional feature space.

6 Acknowledgments

The authors would like to thank Warwick Postgraduate Research Scholarship (WPRS) and Department of Computer Science, University of Warwick for funding the research work of AM . The images used in this paper are part of MITOS dataset, a dataset setup for ANR French project MICO.

References

[1] C.C.R. Aldasoro and A. Bhalerao. Volumetric texture segmentation by discriminant feature selection and multiresolution classification. Medical Imaging, IEEE Transactions on, 26(1): 1-14, 2007.

[2] A.N. Basavanhally, S. Ganesan, S. Agner, J.P. Monaco, M.D. Feldman, J.E.

Tomaszewski, G. Bhanot, and A. Madabhushi. Computerized image-based detection and grading of lymphocytic infiltration in her2+ breast cancer histopathology. Biomedical Engineering, IEEE Transactions on, 57(3):642-653, 2010.

[3] J.G. Daugman. Uncertainty relation for resolution in space, spatial frequency, and KHAN et al. : RanPEC FOR BREAST TUMOR SEGMENTATION 7 orientation optimized by two-dimensional visual cortical filters. Optical Society of America, Journal, A: Optics and Image Science, 2: 1160-1169, 1985.

[4] X.Z. Fern and C.E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In International Conference on Machine Learning, volume 20, page 186, 2003.

[5] M.N. Gurcan, L.E. Boucheron, A. Can, A. Madabhushi, N.M. Rajpoot, and B. Yener.

Histopathological image analysis: A review. Biomedical Engineering, IEEE Reviews in, 2: 147-171 , 2009.

[6] W.B. Johnson and J. Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206): 1-1, 1984.

[7] H. Knutsson. Filtering and reconstruction in image processing. Linkoping University Electronic Press,, 1982.

[8] D. Magee, D. Treanor, P. Chomphuwiset, and P. Quirke. Context aware colour classification in digital microscopy. In Proc. Medical Image Understanding and Analysis, pages 1-5. Citeseer, 2010.

[9] K. Murtaza, S. Khan, and N. Rajpoot. Villagefinder: Segmentation of nucleated villages in satellite imagery. British Mission Vision Conference, 2009.

[10] H. Peng, F. Long, and C. Ding. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(8): 1226-1238, 2005.

[1 1] N.M. Rajpoot. Texture classification using discriminant wavelet packet subbands. In Circuits and Systems, 2002. MWSCAS-2002. The 2002 45th Midwest Symposium on, volume 3, pages III-300. IEEE, 2002.

[12] R. Wilson and M. Spann. Image segmentation and uncertainty. John Wiley & Sons, Inc., 1988.

A Gamma- Gaussian Mixture Model for Detection of Mitotic Cells in Breast

Cancer Histopathology Images

Adnan M. Khan*, Nasir M. Rajpoot*

* Department of Computer Science, University of Warwick, UK Corresponding Authors: {amkhan, nasir} @ dcs. Warwick, ac. uk

Abstract has been introduced. The experimental results show that the proposed system achieves a high sensitivity of

In this paper, we propose a statistical approach 0.82 with PPV of 0.29. Employing CAPP on these refor mitosis detection in breast cancer histological imsults produce 100% increase in PPV at the cost of less ages. The proposed algorithm models the pixel intenthan 15% decrease in sensitivity.

sities in mitotic and non-mitotic regions by a Gamma- Gaussian mixture model and employs a context-aware

post-processing in order to reduce false positives. Ex2 The Proposed Algorithm

perimental results demonstrate the ability of this simple,

yet effective method to detect mitotic cells in standard 2.1. Stain Normalization

H&E stained breast cancer histology images.

Tissue staining is commonly used to highlight distinct structures in histology images. Among many dif¬

1 Introduction ferent stains, Hematoxylin & Eosin (H&E) is one of the most commonly used. It selectively stains nuclei struc¬

Detection of Mitotic Cells (MCs) in breast tures blue and cytoplasm pink. Although staining enhistopathology images is one of three components ables better visualization of tissue structures, however (the other two being tubule formation, nuclear pleo- due to non-standardization in histopathological work morphism) required for developing computer assisted flow, stained images vary a lot in terms of color and grading of breast cancer tissue slides [3]. This is very intensity. Stain normalization is used to achieve a conchallenging since the biological variability of the MCs sistent color and intensity appearance. Among several makes their detection extremely difficult (see Figure 1). approaches reported in literature, we used [5] to normalAdditionally, if standard H&E staining is used (which ize the color and intensity of breast histology images. stains chromatin rich structures, such as nucleus,

apoptotic and MCs dark blue), it becomes extremely 2.2. Tumor Segmentation

difficult to detect the later given the fact that former

two are densely localized in the tissue sections. As a

consequence, two categories of relevant works have Breast Cancer histology images can be divided into been reported in literature. One that use an additional two regions: tumor and non-tumor. Mitosis events are stain (e.g. PHH3) to stain MCs exclusively, and detect more likely to exist in tumor regions. Therefore, an exclusively stained MCs in the images [7]. Other that intelligent mitosis detection system must first remove use a video sequence to detect mitotic events over non-tumor areas from the tissue slide in order to mintime by incorporating spatial and temporal information imize the search space for MCs. We have developed [4]. Since the exclusive stain costs additionally and a feature based texture segmentation framework (Ran- videos are not at all used in standard histopathological PEC: Random Projections with Ensemble Clustering practices, therefore, a gap exists in literature. [1]) to segment tumor regions. Broadly, the algorithm

In this paper, a robust MCs detection technique

is developed and tested on 28 breast histopathology images, belonging to 5 different tissue slides.

To the best of our knowledge, there is not existing method in the literature for detection of MCs in

standard H&E stained breast histology images. The

proposed method mimics a pathologist's approach to

MCs detection under microscope. The main idea is

to isolate tumor region from non-tumor areas (lym- phoid/infkmmatory/apoptotic cells), and search for

Figure 1. How hard is it to identify MCs in breast MCs in the reduced space by statistically modeling

the pixel intensities from mitotic and non-mitotic rehistology images? First 3 images (from left) are MCs gions. In order to further enhance the positive predictive and last 2 images are non-mitotic images.

value (PPV), Context Aware Post-Processing (CAPP) follows the following pipeline: (1) A library of texture

features is computed over a range of scales and orientations, (2) low dimensional embedding (using Random

Projections) is performed to avoid over fitting and curse

of dimensionality, and finally (3) tumor segmentation is

performed in low dimensional space. This produces an

accurate and totally unsupervised tumor segmentation.

In order to account MCs present on the boundary of

tumor and non-tumor regions, morphological dilation

on tumor segmentation results is performed. Although

it increases the chances of detecting boundary MCs, yet

it also includes some lymphoid/inflammatory cells into

the tumor regions, that appear as false positives when

detecting MCs in breast histology slides. Figure 2. Marginal distributions (solid line) and fitted models (doted lines) by the two-component Gamma-Gaussian Mixture Model

2.3. Statistical Modeling of Mitotic Cells

MCs appear as relatively dark, jagged and irregularly 2.3.2 Parameter Estimation

textured structures (see Figure 1). Due to sectioning

artifacts, some appeal" too dim to notice with a naked In order to estimate unknown parameters(#), we employ eye. In terms of shape, color and textured charactermaximum likelihood estimation (MLE). Given image istics, lymphoid/inflammatory cells and apoptotic cells intensities Xi , i = 1, 2, n where n is number of pixthat are densely present in tissue slides possess almost els, log-likelihood function (£) of parameter vector Θ is similar characteristics, thus could easily be confused given by

with MCs.

In this paper, we propose Gamma-Gaussian Mixture =∑ log / (:¾; < (2) Model (GGMM) for detecting MCs in breast histology

images. Image intensities (L channel of La*b* color where / (¾ ; Θ) is the mixture density function in equaspace) are modeled as random variables sampled from tion (1). The MLE of Θ can be represented by one of the two distributions; Gamma and Gaussian. Intensities from MCs are modeled by a Gamma distribution and those from non-mitotic regions are modeled Θ = argmax £(θ) (3) by a Gaussian distribution. The choice of Gamma and

Gaussian distribution is mainly due to the observation A convenient approach to obtain a numerical solution that the characteristics of the distribution match well to the above maximization problem is provided by the with the data it models (see Figure 2). Expectation Maximization (EM) algorithm [2]. In our context, the EM algorithm can be set up as follows.

Let Zik , k = 1, 2, be indicator variables showing the component membership of each pixel a¾ in the mixture

2.3.1 Gamma-Gaussian Mixture Model model (1). Note that these indicator variables are hidden

(unobserved). The log-likelihood (2) can be extended as

Figure 2 shows two marginal distributions (solid lines) follows:

and their fitted models (dotted lines). The left and

right marginal distributions show the probability distri^' (^θ) =

butions of pixels belonging to mitotic and non-mitotic (4) regions respectively. Close fit to the marginal distri+

butions was achieved by GGMM. The GGMM is a

parametric technique for estimating probability density The EM algorithm finds Θ iteratively, as outlined in Alfunction. In our context, it can be formulated as follows.

gorithm 1. Let 6>(^m) be the estimate of Θ after m it¬

For pixel intensities x, the proposed mixture model erations of the algorithm. The EM algorithm seeks to is given by: find the MLE of the marginal likelihood by iteratively applying Expectation and Maximization steps.

/ (x; θ) = _ΡιΓ(χ; a, β) + p₂G(x; μ, σ) (1)

where p_\ and pi represent the mixing proportions 2.4. Classification

(priors) of intensities belonging to mitotic and non- mitotic regions, and _\ + p₂ = 1. Γ(χ; , β) represents the Gamma density function parameterized by The posterior probabilities of a pixel xi belonging to a (the shape parameter) and β (the scale parameter). class 1 (Mitotic) or 2 (Non-Mitotic) are calculated as G{x; μ, σ) represents Gaussian density function paramfollows,

eterized by μ (mean) and σ (standard deviation). Θ =

[a, β, μ, σ, /¾ , /¾] represents the vector of all unknown

parameters in the model. (9)

Alg

l:

Figure 3. Four examples of 50 x 50 context patches,

cropped around the bounding box of candidate MCs

(detected using the proposed algorithm). First 2 (from left) are mitotic, last 2 are false positives.

Figure 5. Visual results of MC detection in a sample image: (a) Original image with ground truth marked MCs shown in yellow color; (b) Results of Tumor segmentation (as outlined in Section 2.2) where non-tumor ai'eas are shown in a slightly darker contrast with blue boundaries; (c) Results of MC detection (in yellow color) without CAPP (Sensitivity= 0.87, PPV= 0.54) and (d Results of MC detection (in yellow color) with CAPP (Sensitivity= 0.87, PPV= 0.87). sensitivity and PPV over a set of 28 breast histology images selected from 5 different tissue slides and showed that a reasonably high value of sensitivity can be retained while increasing the PPV. Our future work will aim at increasing the PPV further by modeling the spatial appearance of regions surrounding mitotic events.

5 Acknowledgments

The authors would like to thank the organizers of ICPR 2012 contest for mitosis detection in breast can

cer. The images used in this paper are part of MITOS dataset, a dataset setup for ANR French project MICO.

Figure 4. Plot of sensitivity vs. PPV when area- threshold is varied on the candidate MCs. High senReferences

sitivity and low PPV is obtained when small values of

area-threshold were used. Table 1 shows how intro[1] Authors. RanPEC: Random projections with ensemble duction of CAPP appreciates PPV without significantly clustering for segmentation of tumor regions in breast degrading sensitivity. histology images. In submitted to MICCAI, 2012.

[2] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pages 1-38, 1977.

[3] C. Elston and I. Ellis. Pathological prognostic factors in breast cancer, i. the value of histological grade in breast cancer: experience from a large study with long-term follow-up. Histopathology, 19(5):403-410, 1991.

[4] S. Huh, D. Ker, R. Bise, M. Chen, and T. Kanade. Automated mitosis detection of stem cell populations in phase-

Table 1. Quantitative Comparison of sensitivity and contrast microscopy images. Medical Imaging, IEEE

Transactions on, 30(3):586-596, 2011.

PPV with and without using CAPP for a fixed value of [5] D. Magee, D. Treanor, P. Chomphuwiset, and P. Quirke. area threshold = 120 over 28 breast histology images Context aware colour classification in digital microscopy. containing more than 200 mitotic cells. By employing In Proc, Medical Image Understanding and Analysis, CAPP, PPV is doubled on unseen data, without drastipages 1-5, 2010.

[6] K. Murtaza, S. Khan, and N. Rajpoot. Villagefinder: cally reducing the sensitivity (i.e. less than 15% only). Segmentation of nucleated villages in satellite imagery.

British Mission Vision Conference, 2009.

[7] V. Roullier, O. Lezoray, V. Ta, and A. Elmoataz. Multi-resolution graph-based analysis of histopatholog- ical whole slide images: Application to mitotic cell ex

traction and visualization. Computerized Medical Imaging and Graphics, 35(7-8):603-615, 201 1.

Claims

1. A method for identifying pixels associated with candidate mitotic cells in a digital histopathological image, comprising the steps of:

2. A method according to claim 1, wherein the subset of pixels is indentified by fitting a mixture model to a distribution of pixel intensities in the set of pixel intensities, and deriving a value of a quantity for each pixel, the quantity being indicative of how likely it is that the pixel is associated with a candidate mitotic cell, wherein the mixture model comprises a weighted sum of a first distribution of pixel intensities associated with candidate mitotic cells and a second distribution of pixel intensities associated with non-mitotic cells.

3. A method according to claim 2, wherein the first and second distributions are parametric and the fitting includes estimating parameters of the distributions.

4. A method according to claim 2 or 3, wherein the first distribution is a Gamma distribution and the second distribution is a Gaussian distribution.

5. A method according to any of claims 2 to 4, wherein the quantity is proportional to the posterior probability under the mixture model of the pixel being associated with a mitotic cell.

6. A method according to any of claims 2 to 5, wherein the subset of pixels associated with mitotic cells are identified by applying Otsu's method to values of the quantity.

7. A method according to any of claims 2 to 5, wherein the subset of pixels associated with mitotic cells are identified by applying the Level Set method.

8. A method according to any preceding claim, wherein the set of image pixels associated with the tumor region is identified by an image segmentation algorithm.

9. A method according to any preceding claim, wherein the set of image pixels associated with the tumor region is identified by displaying the histopathological image to a user and receiving input from the user.

10. A method according to any preceding claim, wherein the histopathological image is an image of a histopathological slide of an epithelial tumor (for example breast carcinoma), a stromal tumor, a tumor of the soft tissue (for example, sarcomas), a hematolymphoid tumor, a neuroectodermal tumor, or a neuroendocrine tumor.

11. A method according to claim 10, wherein the histopathological image is a breast histopathology image.

12. A method according to any preceding claim, further comprising the step of identifying a set of candidate mitotic cells based on the subset of pixels.

13. A method according to claim 12, further comprising the step of classifying candidate mitotic cells as true or false positives and removing the candidate mitotic cells classified as false positives from the set of candidate mitotic cells.

14. A method according to claim 13, wherein the step of classifying candidate mitotic cells as true or false positives comprises:

(a) defining an image patch around each of the candidate mitotic cells in the set;

(b) calculating one or more indicator values for the image patch; and

(c) classifying the candidate cells as true or false positives based on the indicator values.

15. A method according to claim 14, wherein the indicator values are based on texture features derived for each pixel in the patch.

16. A method according to claim 15, wherein the indicator values are one or more of the 1^st to the η^Λ moments of the texture features over the pixels in the patch.

17. A method according to claim 16, wherein the indicator values are the I^s to the n moments of the texture features over the pixels in the patch.

18. A method according to claim 16 or 17, wherein n is 4.

19. A method according to any of claims 14 to 18, wherein the indicator values include at least one or more values representative of the shape of the subset of pixels associated with the candidate mitotic cell in the image patch.

20. A method according to claim 19, wherein the at least one or more values include a measure indicative of an elongation of a boundary of the subset of pixels associated with the candidate mitotic cell.

21. A method according to claim 19 or 20, wherein the at least one or more values include a measure indicative of a roughness of a boundary of the subset of pixels associated with the candidate mitotic cell.

22. A method according to claim 19, 20 or 21, wherein the at least one or more values include a measure indicative of a convexity of a boundary of the subset of pixels associated with the candidate mitotic cell.

23. A system arranged to implement a method as claimed in any one of claims 1 to 22.

24. A computer program product comprising coded instructions which, when run on a computer, implement a method as claimed in any one of claims 1 to 22.

25. A computer readable medium or signal embodying a computer program product as claimed in claim 24.