WO2011124758A1

WO2011124758A1 - A method, an arrangement and a computer program product for analysing a cancer tissue

Info

Publication number: WO2011124758A1
Application number: PCT/FI2011/050291
Authority: WO
Inventors: Timo Ahopelto; Tommi Pisto; Sami Kilpinen; Kalle Ojala
Original assignee: Medisapiens Oy
Priority date: 2010-04-06
Filing date: 2011-04-05
Publication date: 2011-10-13
Also published as: FI20105347A0; FI20105985A0

Abstract

The invention concerns a computer executable method for characterizing, utilizing a reference database, the strength of a cancer feature of a query sample tissue based on the gene expression data of the tissue. The method is characterized in that it comprises the steps of reading data of the query sample tissue into the computer memory, selecting a cancer feature, reading into the computer memory information defining a list of genes comprising at least one gene that is indicative of the cancer feature within a selected cancer type, comparing at least one gene expression value of the query sample to the corresponding gene expression value in the reference database, and returning a value or description defining the position of the query sample among the reference values. Also a computer arrangement and a computer executable program product are disclosed.

Description

A METHOD, AN ARRANGEMENT AND A COMPUTER PROGRAM PRODUCT FOR ANALYSING A CANCER TISSUE

AREA OF INVENTION

The invention relates to the area of bioinformatics. More specifically, the invention relates to analysis of genetic data for e.g. cancer diagnostics purposes.

BACKGROUND OF THE INVENTION

A large number of methods have been developed for the analysis of microarray gene expression data. This reflects the tremendous complexity of the problem of transforming information on expression levels of over 20,000 genes into meaningful biological and especially clinical insights relevant for patient care.

Recently, there have been major efforts to develop large-scale databases from publicly available microarray datasets (e.g. GeneSapiens, Oncomine, connectivity map, gene expression omnibus, Array-express) in order to analyze and mine the enormous quantities of microarray data that have been published by the biomedical community. Indeed, analyses of such metadata are increasingly recognized as a powerful means to study gene networks and gene regulation, and to identify tissue- or disease-specific gene expression patterns. Availability of these microarray databases would also provide an opportunity to use a comprehensive collection of reference samples as a means of guiding the interpretation of new microarray data produced by investigators from test samples.

Today, the amount of genetic information increases rapidly including both DNA sequence and functional gene expression genetics. Especially this is the situation in oncology: cancer is a genetic disease on a cellular level, and should be treated and diagnosed as such.

A large number of publications exists featuring various methods for classifying gene expression profiles to a priori defined classes. Just for the sake of clarification, these are usually divided in two classes. Unsupervised and supervised clustering methods, former is more commonly known as clustering whereas latter type of methods are more commonly known as classifiers. The fundamental difference between these is that in unsupervised methods data is just organized based on its features, simple sorting of numbers being perhaps the simplest unsupervised approach and hierarchical or k- means clustering being the most commonly applied ones. Stratifying cancer diagnostics tests today (e.g. OncotypeDX, MammaPrint, TargetNow) are based on unsupervised methods where a group of pre-defined gene expression values, among other possible sample analysis techniques, are used to diagnose cancer, typically by using a dedicated chip manufactured for that purpose only to measure pre-set 20-100 genes. In supervised methods some machine learning method is used where computer is taught to recognize certain features of the training data and then subsequently it is able to classify novel data based on these features.

One problem with the methods known in the art is that these tests only tell a single dimension of patient's cancer while cancer is a very multi-dimensional disease to diagnose and treat. For example, breast cancer test OncotypeDX is a dedicated test chip measuring 21 genes. The test provides a prediction of chemotherapy benefit and 10-year distant recurrence to inform adjuvant treatment decisions in certain women with early-stage breast cancer for certain specific treatments.

In order to better understand significance of an expression profile, a biologically and clinically meaningful comparison to known gene expression profiles should be made possible. There are known methods of comparing gene expression samples to each others but usually they fail on one or all of the following i) ability to compare (e.g. position) single sample against multiple samples (one versus one, or many versus many are more feasible), ii) ability to extract biologically and clinically sensible information as to which features (=genes) are responsible for the found similarity and iii) ability to characterize e.g. the strengths of multiple different biological or clinical features of the sample.

Cancer is a very personalized disease on a genetic level. Every cancer is different with enormous number of potential gene mutations and gene expression anomalies - and their combinations across all the approximately 23,000 human genes. It has been shown, e.g. by tumour sequencing projects, that one tumour may have numerous different mutations, and that the same cancer type (like breast cancer, prostate cancer) may have significantly different genetic profiles between individuals.

Currently, cancer diagnostics is done by pathologists performing visual inspection of the histology of the biopsy. Even though this is an indispensable part of the diagnostic procedure it is subject to errors and in some cases visual features cannot reveal the exact nature of the cancer. More advanced methods are based on measuring predetermined genes that are identified from prior research, and prescribing medication to diagnoses derived from those specific genes.

Against the complexity of cancer as a disease, one core problem of the current diagnosis methods is that they do not describe the level of significance of each identified genetic phenomena or anomaly to patient's cancer. It may not be sufficient to have a "Yes/No" -answer to a single question applicable only to a narrowly-defined patient population and typically a single drug only at the time, but to more comprehensively understand how strong different individual phenomena are for the cancer patient in question to e.g. decide among all the available therapies to best match patient's cancer profile. PCT application WO2008045389 teaches an improved computerized decision support system and apparatus incorporating bioinformatics software for selecting the optimum treatment for a cancerous condition in a human patient. The system comprises a PCR kit or a gene chip, an integrated detector, a detector for accepting receipt of the gene chip toward analyzing the patient's genotype, a database describing the correlation of patient genotypes and the efficacy and toxicity of various anti-cancer drugs used in treating patients with a particular cancerous condition and a computerized decision support system.

PCT application WO2009131710 teaches a method for identifying genomic signatures linked to survival specific for a disease. The method comprises performing data analysis comprising bioinformatics and computational methodology to identify copy number abnormalities and altered expression of disease candidate genes. PCT application WO2007137187 teaches a method involving performing a test for a gene and a test for a gene expressed protein from a biological sample of a diseased individual. A determination is made to detect which genes and/or gene expressed proteins exhibit a change in expression compared to a reference. A drug therapy used to interact with the genes and/or gene expressed proteins that exhibited a change in expression that is not single disease restricted, is identified from an automated review of an extensive literature database and data generated from clinical trials.

PCT application WO2009132928 teaches a method for predicting an outcome of a patient suffering from or at risk of developing a neoplastic disease. The method comprises the steps of quantifiably determining the gene expression levels of genes, thus obtaining a pattern of expression levels of the genes, comparing the pattern of expression levels with known, pre-defined reference patterns of expression levels indicative of the outcomes and predicting an outcome of a patient from the comparison using a mathematical function to determine the similarity of the pattern of expression levels with the first reference pattern and the second reference pattern. The method depends on disease candidate genes as the starting point of forming the prediction.

PCT Application WO2009125065 teaches a computer-implemented method for correcting data sets from measurements of properties of biological samples. The method comprises the steps of determining first and second property-specific distribution parameters for each property, determining a property-specific correction element for each version of the parallel measurement device based on the discrepancy between the property-specific distribution parameters, correcting the property value and outputting the property's corrected property value to a physical memory and/or display.

None of the methods known in the art teach a way to provide a multi-dimensional analysis and characterization of a sample tissue for the purpose of predicting various aspects, e.g. the strength, of the biological and/or clinical behaviour of the sample tissue.

OBJECTS OF THE INVENTION

Generally, an object of the present invention is to address at least some shortcomings of the prior art mentioned herein. An object of the present invention is a computer executable method and/or arrangement for assessing the strength of biological and/or clinical feature or features of a sample tissue for e.g. diagnostic and prognostic purposes.

Another object of the present invention may be a method and/or and arrangement that removes the need for cancer specific diagnostic/prognostic and other analysis assays (test chips). BRIEF DESCRIPTION OF THE INVENTION

The invention relates to an analysis method of comparing single sample(s) against reference database of samples in order to understand and interpret the biological or medical or clinical information, e.g. gene expression profiles, of the single sample for biological or medical research, diagnosis and therapy.

An embodiment of the invention discovers and displays patient's relative position within a reference group based on the recognized features of cancer. The features of cancer may include the recognized hallmarks of cancer including e.g. self-sufficiency in growth signals, insensitivity to antigrowth signals, tissue invasion and metastasis, limitless potential for replication, sustained angiogenesis and evading apoptosis. A feature of a cancer may be described by an individual gene or by multiple genes together.

With this comparison, the invention tells the strength of each cancer feature, which is valuable in making treatment decisions as cancer profile is visible more comprehensively than with the current, fixed-use cancer assays.

The feature of a cancer may be e.g. a biological feature (e.g. growth signalling, angiogenesis) or a clinical feature (e.g. survival, resistance to selected drugs).

When the reference group is large enough, e.g. at least 15 or 1000 samples for each cancer type or feature, an embodiment of the invention may calculate a distribution of the normalized expression levels of the gene(s) central to (or indicative of) the cancer feature and, based on the position of the cancer feature score of a patient in the distribution, produce a conclusion on strength of the certain cancer feature for the individual patient. Based on this information, for example a treatment may be recommended accordingly. For example, a breast cancer patient may be assessed to belong to the highest percentile in angiogenesis, and it is known that certain chemotherapy has beneficial effect on highly angiogenetic patients whether it has been originally developed to treat breast cancer or not. In comparison to the invention, current diagnostic/prognostic tests assess only one cancer type for specific treatment or treatments only, leaving totally out an opportunity to find the best-matching treatment for the patient on molecular level based on cancer features.

The present invention discloses a method for aligning and quantitatively comparing new microarray data (test sample) against reference gene expression profiles from a large collection of e.g. healthy and pathological in vivo and/or in vitro samples. In an embodiment, the method compares expression profiles of the test sample with those in the reference data and returns the position of the sample among the plurality of reference samples representing a certain cancer feature.

An aspect of the present invention is a computer executable method for characterizing, utilizing a reference database, the strength of a cancer feature of a query sample expression profile based on the gene expression data of the tissue, The method may comprise any, any combination or all of the steps of: reading data of the query sample expression profile into the computer memory, comparing the expression value of at least one gene from the query sample to the corresponding gene expression values from the reference database consisting of a representative number of samples comprising of a tissue category, returning a value or description defining the position of the query sample genes among the reference values, selecting a cancer feature relevant to the tissue category, reading into the computer memory information defining a list of genes comprising at least one gene that is indicative of, and their expression relation to, the cancer feature within the selected tissue category, and returning a summarization value or description for the selected cancer feature according to the expression relations of the genes to the cancer feature.

In an embodiment the tissue category may be formed containing any meaningful collection of samples such as samples of a specific cancer type. In an embodiment the method may further comprise a step of forming a database comprising of genes and their expression relation to cancer features in specific tissue categories.

The step of forming a database of genes and their expression relations to cancer features may be performed using statistical analysis of the data of the reference database. In an embodiment, the step of reading the individual query sample data into the computer memory may comprise normalizing the query sample data to be comparable with the data of the reference database comprising a representative number of expression profiles of reference patients and/or healthy control samples. The step of comparing the query sample to the reference database may comprise the step of forming an estimation of a reference data distribution from the reference data population, e.g. by using the Gaussian window method known to a person skilled in the art. The step of comparing the query sample to the reference database may comprise calculation, estimation or description of the amount of reference data population or distribution having lower value or calculation, estimation or description of the amount of reference data population or distribution having higher value for the selected gene. The step of comparing the query sample to the reference database may comprise calculation, estimation or description of the proportion of the reference data population or distribution having similar value as the query sample, e.g. by calculating the proportion of reference data within a certain reference data derived distance, such as standard deviation or its multiple, of the value of the query sample.

In an embodiment the step of returning a value or description of the position of the query sample among the reference population or distribution may comprise numerical value between a pre-set lower limit and upper limit, the value defining the amount of reference data population or distribution having lower value than the query sample or defining the amount of reference data population or distribution having higher value than the query sample.

The step of returning a value or description of the position of the query sample among the reference population or distribution may comprise a mathematical operation modifying the returned value according to the information whether the higher or lower end of the reference data population or distribution is associated to characterizable biological or medical process, stage, outcome, situation, diagnosis or prognosis. The mathematical operation may comprise e.g. reversal of the expression axis, if the lower end of the reference data population is indicative of the biological or medical process, stage, outcome, situation, diagnosis or prognosis.

The step of reading into the computer memory information defining a list of genes comprising at least one gene may comprise the step of adding or taking out at least one gene from the list of genes known to be indicative of the cancer feature.

The step of summarizing the position value or description of at least one gene of the selected cancer feature to a cancer feature score may comprise utilization of mathematical methods to provide statistical summary and reliability value of the position values of the query sample in terms of at least one gene associated to the selected feature. The mathematical method may comprise e.g. Tukey's bi-square weight method or any other suitable method known to a person skilled in the art.

In an embodiment, the step of comparing at least one gene expression value of the query sample to the gene expression value(s) in the reference database may comprise summarization of expression values of multiple genes from query sample and from reference data by using mathematical method providing statistical summary and reliability values of the expression values and comparing the resulting summary value of the query sample to the population or distribution of summary values from reference data. In an embodiment, the summarization of expression values may use Tukey's bi- square weight method or any other suitable method known to a person skilled in the art. In another embodiment, e.g. Hodges-Lehmann estimate may be calculated to characterize the strength of the expression signature indicative of the biological or medical process, stage, outcome, situation, diagnosis or prognosis. In an embodiment, the addition or taking out of a gene from the list of genes may be performed based on the results of a statistical analysis performed on the data of the reference database.

Cancer type specifies the cancer according its characteristics, for example by pathological class, histology, genetic profile or a diagnostic classification. A cancer type may thus be e.g. breast cancer or Rb-p16 pathway defective cancers. Then at least one gene that is central in or indicative of the cancer feature may be identified e.g. from literature review of the relevant cancer research, from clinical trials specifically targeted to identify such genes or from statistical analysis of the data of the reference database. Genes indicative of certain cancer feature may be collected into a separate database. The database may also contain information of the genes expression relations to the cancer feature. This database may be used to interpret and summarize the gene level results of comparison of single sample against reference tissue category into cancer feature level information. This information may be used in the treatment optimization of the patient.

The step of reading the individual query sample data into the computer memory may comprise normalizing the query sample data to be comparable with the data of the reference database comprising least 15 normalized expression profiles of reference patients and/or healthy control samples.

The step of reading the individual cancer patient sample data into the computer memory may comprise importing the gene expression profile data of the sample into a reference database that may comprise a representative number (e.g. at least 15, 100 or 1000) of expression profiles to represent the gene expression value combinations within the genes important to a cancer feature, e.g. angiogenesis. The imported new patient sample data is preferably annotated using the classification data available in the reference database. The inclusion of the new annotated patient data in the reference database may further improve the statistical reliability and usability of the data of the reference database.

The calculation of the cancer feature score may be performed, for example, by normalizing all gene expression values of the important gene(s) between some pre-set limits, e.g. 0 and 1 , or 0 and 100, and calculating a sum of those expression values, or by using multipliers for each of genes or sum of selected genes before calculating a sum of all genes, or any other mathematical or statistical formula. A cancer feature may be associated to a tissue category. A tissue of the reference database may belong to at least one tissue category. In an embodiment, a tissue belongs to a plurality of tissue categories.

Tissue categories may be formed using the annotation data of the tissue samples of the reference database. A tissue category may thus represent at least one, preferably a plurality of tissues having a cancer type and/or cancer feature described by the annotation data. A tissue may be annotated using any number of annotation data items and it may thus belong to any number of categories. The characterization of a tissue sample may be performed in a multi-modal manner utilizing the properties of at least one tissue category, preferably a plurality of tissue categories, of a reference database.

Any method disclosed herein may be a computer executable method. Any method may also comprise the step of storing data resulting from the execution of the method onto a memory device of a computer and/or outputting the resulting data to an output device of a computer.

An embodiment of the present invention may contain features to add, take out or differently combine genes and cancer features for analysis, or to add reference data from latest research to have improved assessment results. This provides advantage over existing diagnostic tests that are based on dedicated gene expression chip manufactured for a single test at the time. For example, the genes deemed central to or indicative of a cancer feature may be selected primarily based on literature review of the relevant cancer research. This selection may be verified using e.g. the statistical analysis performed on the data of the reference database. If the statistical analysis, e.g. Cox proportional hazard model which is known to a person skilled in the art, indicates that also some other genes are central to or indicative of the cancer feature, the list of central or indicative genes may be appended using the additionally found genes. In an embodiment, a primarily selected gene may even be removed or replaced in the analysis with another gene (or a group of genes) that has more statistical significance with regards to the cancer feature according to the data of the reference database. Any method disclosed herein may contain storing of data over time to enable longitudinal analysis of patient's cancer and how it changes over time. This allows also e.g. automatic alerts for the treating oncologist when new relevant drugs to treat a patient profile are added to the database. A (reference) tissue category may comprise information of at least one tissue. Preferably, a tissue category comprises information about a plurality of tissues having some common aspect or feature. The common aspect or feature may be described using the annotation data of the tissue samples of the reference database. Any of the embodiments disclosed herein may utilize a reference database that comprises gene expression activity level estimates, where each estimate describes the distribution of expression levels of a specific gene in a specific tissue category of the reference database. The tissue characterization data may be used for e.g. providing information suitable for diagnostics purposes, e.g. for determining the strength of various features of a cancer, clinical outcomes of the sample patient and best-matching treatments.

The properties of the reference patient may comprise e.g. the annotation data of the tissue sample originating from the reference patient.

Suitably, the categorization of tissue data may be multi-modal categorization. The known properties of the matching categories may provide a foundation for e.g. diagnosis, treatment recommendations and prognosis of a disease, e.g. cancer.

An embodiment of the invention may be usable for producing information for a proper diagnosis of an unknown cancer in cases where the exact disease is not yet known. Because the method is able to identify strength of multiple features of a patient's cancer, this information may be use this information to diagnosis and treatment decisions even without knowing which cancer type is in question. The gene expression data of a tissue sample may comprise expression level information of at least 10000, 15000, 20000, 22000 genes. Preferably, but not necessarily, the expression data comprises the expression level information essentially about the entire genome, e.g. human genome, e.g. at least 95 %, 98% or 99% of the genes. Broad coverage of genome is preferred over limited coverage.

Another aspect of the invention may be a computer arrangement comprising at least one computer and means for performing any step, any combination of the steps or all of the steps of any of the methods mentioned herein. In an embodiment, the arrangement may comprise a test chip (assay). Advantageously, the assay may be capable of indicating the expression levels of at least 10000 or 20000 or 22000 genes of a test sample.

The reference database of the computer arrangement may comprise gene expression information of at least 10000, 20000 or 22000 genes from preferably each of a plurality of tissues. The number of tissue samples in the reference database may be at least 1000 or 10000 samples.

Yet another aspect of the invention may be a computer program product comprising computer executable instructions for performing any step, any combination of the steps or all of the steps of any of the methods mentioned herein.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS In the following, the invention is described in greater detail with reference to the accompanying drawings in which:

Figure 1 illustrates an embodiment of the method of the present invention,

Figure 2 illustrates exemplary results obtainable using a preferred embodiment of the present invention,

Figure 3a shows a tissue sample and a reference database comprising data of a plurality of tissue samples, and

Figure 3b illustrates a method of determining tissue similarity and the genes that are important (central) to the similarity, usable in an embodiment of the present invention,

Figure 1 depicts an exemplary embodiment of the present invention. To analyze the query sample expression profile it needs to be read into computer memory 101. Then, the expression value of at least one gene from the query sample is compared to the corresponding gene expression values from the reference database tissue category of interest 102. The reference distribution can for example be formed with kernel density estimation providing a distribution similar to a smoothed histogram. Then, the position of the query sample expression value among the reference values are returned in format of either value or description for 103. This can be for example for each gene a proportion of reference values having lower expression value than the query sample had for the gene in question. Next, at least one cancer feature relevant to the selected tissue category is selected 104. Then, a list of genes and their expression relations indicative of the cancer feature is read into computer memory 105. Then, a summarization value or description is returned according to the list of genes and their expression relations 106. In the preferred embodiment, the query sample expression profile is compared against one reference tissue category in terms of one cancer feature at the time. Therefore, as a last step of the exemplatory method, it is checked 107, if the analysis needs to be run for another tissue category or another cancer feature. As a result, multiple different features of the query sample are characterized. This way, for example the accuracy of the diagnosis as well as the quality of treatment recommendations and prognosis may be improved.

Figure 2 depicts an example about a multi-dimensional diagnosis obtainable using an embodiment of the present invention. In the charts 201 , 21 1 , 221 , 231 , 241 , 251 , the x- axis 202, 212, 222, 232, 242, 252 represents the cancer feature score. The y-axis 203, 213, 223, 233, 243, 253 represents the distribution density.

In the shown example, the curve 204 of the chart 201 may represent the distribution of cancer feature scores of the patients of the reference database with regards to sustained angiogenesis. The point 205 on the curve represents the corresponding cancer feature score of the patient whose tissue sample is being analysed. It may be concluded from the position 205 on the curve that the probability of the patient's cancer having sustained angiogenesis feature is "high".

To continue the example further, the curve 214 of the chart 21 1 may represent the distribution of cancer feature scores of the patients of the reference database with regards to evading apoptosis. It may be concluded from the position 215 on the curve that the probability of the patient's cancer having evading apoptosis feature is "low". Similarly, it may be concluded that the position 225 on the curve 224 of the chart 221 indicates that the probability of the patient to have the cancer feature of limitless potential for replication is "medium". Yet further, in a similar manner, the chart 231 may indicate that the patient's risk for strong "tissue invasion and metastasis" cancer feature is "high". Still further, the chart 241 may indicate that the probability of the patient having "high intensity of antigrowth signals" cancer feature is "low". Finally, the chart 251 may indicate that the probability of the patient having "high self sufficiency in growth signals" cancer feature is "medium".

The exemplary results shown in figure 2 illustrate that the method of an embodiment of the present invention is capable of producing a multi-dimensional diagnosis of a cancer based on the plurality of cancer feature scores calculated from the patient's sample and from the data of the reference database. The treatment recommendations for the cancer may be produced based primarily on e.g. those cancer features whose probability is deemed to be "high" and secondarily on e.g. those cancer features whose probability is deemed to be "medium". The information may also be utilized for producing a prognosis of the disease.

Figures 3a and 3b depict the principle of an exemplary method for determining the genes that are important with regards to a cancer feature using statistical analysis. In the method, microarray data from one test sample 300 (query sample) is compared to samples 303a-i of a large reference database 301 of different tissue/cell types (categories) 302a-c. There are thus, for example, a plurality of tissue samples 303a-c belonging to a tissue category 302a (and 303d-f belonging to category 302b and 303g-i belonging to category 302c). It should be noted that a tissue sample of the reference database may belong to a plurality of categories. This makes the multi-modal similarity analysis of a tissue sample possible.

"Large" here means a database that contains expression data of e.g at least 100, 1000 or 10000 tissue samples.

A generalized workflow of the process of using direct expression value comparison to position query expression profile among the reference values comprises the following steps.

First, the expression profile of a test sample is first transformed into a format compatible with reference data. Such normalization methods are known to a person skilled in the art. One example about a suitable method is provided in WO2009125065. The reference expression values for a gene in a tissue category a considered as a distribution and the corresponding expression value from the query sample is compared to this distribution. Exemplary result of the comparison might be that 75% of the reference values are under the value from the query sample. This operation is repeated for all genes indicative of certain cancer feature.

In an embodiment, the selection of important genes of a cancer feature may also be based on e.g. results of scientific research. Typically this is performed by a qualified group of oncology specialists conducting a literature review of the identified relevant published research, from which literature review the important genes are identified and included into the analysis. For example, it is known that in selected cancers growth signalling pathway is controlled by EGFR gene. This collection of information may be stored in a database and may also contain gene expression relation information indicative the gene's expression's relation to the cancer feature. In a preferred embodiment, the gene expression relation information is available and summarization of the cancer feature's status in the query sample is formed by using this information. For example, all genes whose high expression relates to the positive summary result (e.g. cancer feature 'active') and all genes whose low expression relates to the positive summary result (e.g. cancer feature 'inactive').

Another generalized workflow of the process of using gene expression alignment method to position query expression profile among the reference values comprises the following steps.

First, the expression profile of a test sample is first transformed into a format compatible with reference data. Such normalization methods are known to a person skilled in the art. One example about a suitable method is provided in WO2009125065. In order to be compatible with the query sample data, the data of the reference database should naturally be normalized as well.

Moving to figure 3b, the expression level density estimates 315 have been pre- calculated for each gene in each reference tissue category. The calculation of expression level density estimates for the data of the reference database may be another aspect of the present invention. Then, each gene's data from the test sample is aligned with the density estimate for that same gene in each reference tissue as follows: density of expression values (y-axis 317) in the tissue is estimated in 512 evaluation points (x-axis 316) between the minimum and maximum (in all tissues) expression levels of the gene. The expression value of the gene in the test sample is then compared to the density estimate and a corresponding density value (y-axis 317) is identified. The fraction of evaluation points having lower density (a) forms the expression match score (em-score), describing the likelihood of obtaining a worse matching expression for the gene than the one in input sample. The em-score matrix 310 contains an em-score value for each gene 31 1 of each tissue category 312. An em-score of 1 means that the gene in the input sample had the best matching expression level for the tissue in question, in other words expression of the input sample matched the highest density peak. An em- score of 0 on the other hand means that input sample had an expression level that did not match the tissue at all. This operation is then repeated for all genes of the input sample against all reference tissue categories. Next, tissue specificity scores (ts-scores) for each gene from the test sample for each tissue in the reference database are calculated 313 from the em-score matrix 310. This calculation results as the ts-score matrix 320 which also has a value for each tissue 322 category and gene 321. Ts-scores range from -1 to 1 and tell us how uniquely a gene identifies the test sample as belonging to a certain tissue. Finally, similarity of the input sample at the level of tissues is calculated 323 from tissue specificity scores, resulting in one tissue similarity score 330 per each tissue category of the reference database.

Alignment of a query profile results in a similarity score between the query sample and each of the tissues of the reference data. Behind each of the similarity scores are two scores for each gene. Expression match score (em-score) describes, suitably on the scale of 0 to 1 , the likelihood of obtaining a less matching expression level for the gene in the particular tissue. In other words, em-score 0 for a gene means that all other expression levels for the gene match better in the particular tissue than the one in query sample. Conversely em-score 1 means that none of the expression levels for the gene match better than the one in query sample.

Genes may be labeled as either "typical" or "atypical" for each tissue. This is done by comparing the query sample's em-score for the gene against the range of em-scores for the same gene gained when the tissue is compared against itself. If the em-score from the comparison is higher than e.g. the lowest 5% from the tissue vs. self-spread, the gene may be termed typical, otherwise it is atypical. This is done because the em-score itself does not tell the spread of expression values a gene has in a tissue. This spread affects the range of expected em-scores when a sample of the tissue is compared against itself. For a gene with a very tight spread, one may expect much higher em- scores than for those with a more loose spread.

Tissue specificity score (ts-score), on the scale of -1 to 1 , is further calculated from the em-scores to provide insight into whether the gene is expressed at the level unique for the particular tissue. Ts-score 1 for a gene means that the gene has a unique expression level in that tissue and in the query sample the expression was on that level. -1 means that the gene has a unique expression level but in the query sample expression was not at that level. The mean of the ts-scores of all genes in the particular tissue is used as a similarity score for that tissue. Together these scores allow biologically meaningful interpretation of the transcriptomic state of the query sample by providing similarity match at the level of tissues, then describing what part of the transcriptome, or in other words, which genes are responsible for the similarity and finally which of the genes are on the level which are specific for the particular tissue.

Expression data to be analyzed against the reference data typically needs to be transformed into compatible form by following procedure using a method known to a person skilled in the art. One such method is taught e.g. in patent publication WO2009125065A1.

The density of expression values of each gene in each tissue type may be calculated e.g. as follows: For computational efficiency fast Fourier transformation may be used based approximation to calculate kernel density estimates. Kernel densities may be calculated by using Gaussian window. Density is estimated from 0 to maximum expression value in the entire dataset with 512 equally spaced points.

The modality of gene expression estimates may be calculated by searching for peaks having at least 0.1 of the total area of the density estimate. Some, preferably low percentage, e.g. 10-20%, of the genes may be excluded from the analysis e.g. due to the ambiguous modality of expression distributions. Modality of the expression profiles of genes can be used to further categorize reference data as well as to assign the query sample into the specific categories based on one or multiple genes.

Gene and tissue specific expression value density estimates are used to calculate likelihood of obtaining expression values observed in a query profile from each tissue type. For a gene g in tissue t this is done as follows:

The value of the density diagram for gene g in tissue t corresponding the expression value of gene g in the query sample is determined.

Then that density value is compared to the density values of the 512 evaluation points of the density diagram of gene g in tissue t and the fraction of lower density values is calculated. This is called the expression match score (em-score), with 1 meaning perfect match between the query and tissue for expression of the gene and 0 meaning expression of the gene in the query profile is at non-typical level for tissue. This calculation is repeated for each gene of the query profile against the density estimates of the same genes in each tissue type of the reference data. Additionally, a lower limit for the expected expression match score is calculated for each gene in each tissue type of the reference data to reflect the natural variability of expression of each gene in each tissue. This lower limit may be defined e.g. as the value under which the lowest 5% of em-scores for the gene would settle when a sample from the tissue is compared against itself. The lower limit for the expected expression match score for a gene in a particular tissue is calculated by evaluating the em-scores for all evaluation points, and weighting the abundance of that em-score by the value of the density diagram at that point. The sum of the weights is then normalized to 1. Since the density diagram already represents the levels of gene expression in the tissue, the em-scores, that would be obtained if the corresponding levels of gene expression were compared against the tissue itself, are evaluated. This is repeated for all genes in all tissues. The calculations are detailed in:

The distribution of expected em - scores is defined as :

E = {evaluation points for gene g in tissue t}

e_t = i : th evaluation point for each z (l .. n)

expected em - score = ems(e_lx , t)

with weight =— γ—

n

i=l

For the purpose of combining the individual em-scores obtained from per gene analyses into similarity of query sample to reference sample categories, the tissue specificity score (ts-score) for each gene in each tissue is calculated as follows: The tissue similarity score for tissue t and gene g is :

Where

T = {non -t tissues}

X_j = i : th element of T

and

l-1.25( ^e""⁽*'^{g)+ a25} -0.2), for ems(t,g) > ems(x,g)

fit r σ =( ems(t,g)+0.25

J ν^ι ? ^Λ ?δ/ I ems(t,z)+0.25

-(1-1.25( '^g -0.2)), for e_mS(t,g) < ems(x,g)

ems(x,g)+i).25

ems(t,g) = expression match score for tissue t, gene g

The expression match score for the gene g in tissue t and the expression match score for gene g in a tissue other than t is taken, and e.g. 0.25 is added to both numbers. The smaller number is divided by the larger number, resulting in a score between 0.2 and 1. This number is then scaled to range 0 - 1 , and is subtracted from 1 . If the expression match score for tissue t was the lower of the two, the score is multiplied by -1 . In essence, what this does is give a ratio-weighted difference of the two expression match scores. This calculation is done for all tissue pairs {f, not f}, resulting in n - 1 values, where n is the amount of tissues the query sample is compared to. The tissue specificity score for gene g in tissue t is the mean of these values. It varies between 1 and -1 and describes how well gene g classifies the query profile into tissue t. A score of 1 means the gene has a unique level of expression in the tissue and the query profile has expression level matching it perfectly. 0 means that the expression level observed in the query sample cannot differentiate the tissue from other tissues. -1 means gene has a unique level of expression for the tissue and the query profile does not have that specific expression level. The mean of tissue specificity scores is used as similarity score at the tissue level: The similarity score for sample s and tissue t is :

1 ^

similarity(s, t) =— 2_,

Where

G = {common genes between s and t} g_l = i : th element of G

Now the genes specific to a cancer feature may be identified. This may be performed utilizing the information of the reference database. For this purpose, the uniqueness of the e.g. gene expression level with regards to a single category (cancer feature) in any categorization may be calculated e.g. by subtracting the maximum of the density estimates in each evaluation point for the entity in other categories from the density estimate of the entity in the category under study. This results in a number ("feature significance score") between 0 and 1 , which indicates how big a proportion of the observed quantity of the entity (i.e. expression level) is unique to the category, e.g. a cancer feature. The genes that are the most significant ones with regards to the cancer feature, may now be selected.

In an embodiment, the selection of important genes of a cancer feature may also be based on e.g. results of scientific research. Typically this is performed by a qualified group of oncology specialists conducting a literature review of the identified relevant published research, from which literature review the important genes are identified and included into the analysis. For example, it is known that in selected cancers growth signalling pathway is controlled by EGFR gene.

To a person skilled in the art, the foregoing exemplary embodiments illustrate the model presented in this application whereby it is possible to design different methods and arrangements, which in obvious ways to the expert, utilize the inventive idea presented in this application. For example, it is clear to a person skilled in the art, that query sample values and reference data values may comprise measurement from any other measurable biological properties as well. The invention may also be applicable to characterizing strength of biological properties other than cancer features.

Claims

A computer executable method for characterizing, utilizing a reference database, the strength of a cancer feature of a query sample expression profile based on the gene expression data of the tissue, characterized in that it comprises the steps of:

reading data of the query sample expression profile into the computer memory, comparing the expression value of at least one gene from the query sample to the corresponding gene expression values from the reference database consisting of a representative number of samples comprising of a tissue category,

returning a value or description defining the position of the query sample genes among the reference values,

selecting a cancer feature relevant to the tissue category,

reading into the computer memory information defining a list of genes comprising at least one gene that is indicative of, and their expression relation to, the cancer feature within the selected tissue category, and

returning a summarization value or description for the selected cancer feature according to the expression relations of the genes to the cancer feature.

A method according to claim 1 , characterized in that said tissue category may be formed comprising any meaningful collection of samples such as samples of a specific cancer type.

A method according to claim 1 , characterized in that it further comprises a step of forming a database comprising of genes and their expression relation to cancer features in specific tissue categories.

A method according to claim 3, characterized in that said forming a database of genes and their expression relations to cancer features is performed using statistical analysis of the data of the reference database.

A method according to claim 1 , characterized in that said step of reading the individual query sample data into the computer memory comprises normalizing the query sample data to be comparable with the data of the reference database comprising a representative number of expression profiles of reference patients and/or healthy control samples.

6. A method according to claim 1 , characterized in that said step of comparing the query sample to the reference database comprises the step of forming an estimation of a reference data distribution from the reference data population.

7. A method according to claim 1 , characterized in that said step of comparing the query sample to the reference database comprises calculation, estimation or description of the amount of reference data population or distribution having lower value or calculation, estimation or description of the amount of reference data population or distribution having higher value for the selected gene.

8. A method according to claim 1 , characterized in that said step of comparing the query sample to the reference database comprises calculation, estimation or description of the proportion of the reference data population or distribution having similar value as the query sample.

9. A method according to claim 1 , characterized in that said step of returning a value or description of the position of the query sample among the reference population or distribution comprises numerical value between a pre-set lower limit and upper limit, the value defining the amount of reference data population or distribution having lower value than the query sample or defining the amount of reference data population or distribution having higher value than the query sample.

10. A method according to claim 1 , characterized in that said step of returning a value or description of the position of the query sample among the reference population or distribution comprises a mathematical operation modifying the returned value according to the information whether the higher or lower end of the reference data population or distribution is associated to characterizable biological or medical process, stage, outcome, situation, diagnosis or prognosis.

1 1. A method according to claim 1 , characterized in that said step of reading into the computer memory information defining a list of genes comprising at least one gene comprises the step of adding or taking out at least one gene from the list of genes known to be indicative of the cancer feature.

12. A method according to claim 1 , characterized in that said step of summarizing the position value or description of at least one gene of the selected cancer feature to a cancer feature score comprises utilization of mathematical methods to provide statistical summary and reliability value of the position values of the query sample in terms of at least one gene associated to the selected feature.

13. A method according to claim 1 , characterized in that said step of comparing at least one gene expression value of the query sample to the gene expression value(s) in the reference database comprises summarization of expression values of multiple genes from query sample and from reference data by using mathematical method providing statistical summary and reliability values of the expression values and comparing the resulting summary value of the query sample to the population or distribution of summary values from reference data.

14. A method according to claim 1 1 , characterized in that said addition or taking out of a gene from the list of genes is performed based on the results of a statistical analysis performed on the data of the reference database.

15. A computer arrangement for characterizing, utilizing a reference database comprising reference patient data, the strength of a cancer feature of a query sample tissue based on the gene expression data of the tissue, characterized in that the arrangement comprises means for:

selecting a cancer feature relevant to the tissue category,

16. A computer program product for characterizing, utilizing a reference database comprising reference patient data, the strength of a cancer feature of a query sample tissue based on the gene expression data of the tissue, characterized in that the program product comprises computer executable instructions for:

selecting a cancer feature relevant to the tissue category, and