WO2007015459A1

WO2007015459A1 - Gene set for use in prediction of occurrence of lymph node metastasis of colorectal cancer

Info

Publication number: WO2007015459A1
Application number: PCT/JP2006/315143
Authority: WO
Inventors: Ichiro Takemasa; Takenobu Tasaki; Hikaru Sonoda; Hirofumi Higuchi; Kenichi Matsubara
Original assignee: Osaka University
Priority date: 2005-08-01
Filing date: 2006-07-31
Publication date: 2007-02-08
Also published as: JP2007037421A

Abstract

Disclosed is a method for prediction of the presence or absence of lymph node metastasis of colorectal cancer. Also disclosed is a set of genes which can be used in the method. A method for selecting a set of genes for use in the prediction of the presence or absence of lymph node metastasis of colorectal cancer, the method comprising the following steps (1) to (4): (1) analyzing the information about gene expression in a primary lesion of colorectal cancer in a patient who has been determined to have lymph node metastasis of colorectal cancer by a histopathological test, by at least four analysis methods involving at least one supervised learning analysis method, thereby selecting a group of genes which serve to determine the presence or absence of lymph node metastasis at a correct classification rate of 75% or higher for each of the analysis methods; (2) selecting a common gene which is selected from the group of genes in all of the analysis methods of step (1); (3) analyzing the information about gene expression above to assign the presence or absence of lymph node metastasis to any combination of two or more genes and then selecting a combination(s) of genes showing an interaction from the combinations; and (4) performing a variable selection in a logistic regression model to give an answer regarding the presence or absence of lymph node metastasis using the common gene and the combination(s) of genes as the explaining variables; a set of genes for use in the prediction of the presence or absence of lymph node metastasis of colorectal cancer, which is selected by the method; and a method for prediction of the presence or absence of lymph node metastasis of colorectal cancer by using the set of genes.

Description

Specification

Gene set for predicting the presence or absence of lymph node metastasis from colorectal cancer

Technical field

[0001] The present invention relates to a gene group useful for predicting the presence or absence of lymph node metastasis of colorectal cancer, and a method of utilizing the gene expression information thereof.

Background art

[0002] Colorectal cancer has been increasing year by year even in Japan, where the incidence is particularly high in developed countries, and is one of the leading causes of cancer-related death. According to statistical reports (eg, see Non-patent Document 1), 37% of colorectal cancer patients are confined to the main lesion without metastasis, and 37% have metastasis only to the regional lymph nodes. It turns out that it is a localized cancer and the rest are those with distant metastases! /.

[0003] At present, most commonly used as the malignancy classification of colorectal cancer in clinical practice, the diseases such as the degree of penetration of cancer into the colon wall and the degree of metastasis to regional lymph nodes are the most commonly used. Physics is used as an index, and there is no doubt about the correlation between this classification and prognosis. However, with regard to the determination of the presence or absence of lymph node metastasis, which greatly affects the above classification results, a specimen prepared using a part of the many excised lymph node tissues is observed under a microscope. The current situation depends on the classical histopathological technique. As shown in a report that 20 to 40% of patients determined to be negative for lymph node metastasis by this method later find metastasis (see, for example, Non-Patent Document 2), The metastasis determination method was not necessarily accurate enough.

[0004] Colorectal cancer is one of the most advanced molecular biology research capabilities, including the structure of multistage carcinogenesis, and reports on individual genes such as APC, K-ras, p53, and DCC have been reported so far. Many are seen. However, just focusing on one of these genes is not sufficient to express the individuality of colorectal cancer, and in recent years, information on the expression of a large number of genes can be obtained at once by using a DNA microarray or the like. Attempts have been made to obtain useful new knowledge.

[0005] Alizadeh et al. Isolated B phosphorus from the peripheral blood of patients with diffuse large B-cell lymphoma. By measuring the DNA expression of the lymphocytes using a DNA microarray and performing hierarchical clustering of the gene expression data obtained, the peripheral blood B lymphocytes of the same patient are found in the germinal centers of the lymphoid tissues. It was found that there are two types: a gene expression pattern similar to cells and a gene expression pattern similar to B cells activated in vitro (Non-patent Document 3). As a result of investigating the survival rate of the two using the Kaplan-Meier plot, it is clear that patients with B cells showing the latter expression pattern have a worse prognosis than patients with B cells showing the former expression pattern. It became. The results obtained by the clustering of gene expression information performed by the authors were more highly correlated with the prognosis than by following the prognosis prediction based on the conventional pathological diagnosis. The research results of Alizadeh et al. Are significant in that gene expression information ability has also led to useful laws that can be used clinically. However, it has not been verified whether the law can be applied to completely new clinical cases, and it cannot be denied that the result is valid only within the scope of this paper.

[0006] Khan et al. Reported that four types of cancer belonging to small round blue cell tumors, which are difficult to distinguish histologically, can be accurately distinguished by analysis of gene expression information using an artificial-eural network. (Non-Patent Document 4). In this report, an accurate judgment is made even when test sample data is input to an artificial-Ural network model derived using a part of the random data extracted from the entire data. It has been verified that results can be obtained. Therefore, the artificial-eural network model derived here is generally applied to distinguish four types of cancers belonging to small round blue cell tumors that are not limited to the scope of data in this paper. It is suggested that it is possible. However, the results obtained with the artificial-eural network model are generally unacceptable in that they cannot clearly explain the mathematical basis.

[0007] A recent study conducted using a DNA microarray for the purpose of identifying molecular targets involved in liver metastasis of colorectal cancer includes a report by Yanagawa et al. (Non-patent Document 5). The authors performed PCR using a human cDNA as a template, using oligo DNA designed based on the base sequence of human cDNA registered in a public gene database as a primer. An amplified cDNA fragment was obtained. These cDNA fragments are then Using a DNA microarray printed as a template, gene expression profiles of colon cancer primary and colon cancer liver metastases isolated from 10 colon cancer patients were examined. As a result, we clarified 40 genes whose expression was increased in liver metastases relative to the primary lesion and 7 genes whose expression was decreased in liver metastases relative to the primary lesion. We identified a set of candidate genes that may be involved in liver metastasis.

[0008] With respect to the gene set involved in liver metastasis of colorectal cancer, the DNA microarray method is used to perform statistical analysis processing based on the gene discrimination analysis method on the expression information of genes specifically expressed in colon cancer primary tissue , A method for identifying a gene set effective in predicting liver metastasis of colorectal cancer, a gene set identified by the method, and expression information of the gene set in colorectal cancer primary tissue. A prediction method is known (Patent Document 1). The gene set and method provide information useful for predicting metachronous liver metastasis of colorectal cancer, and are preferable as a material for identifying an important gene specifically expressed in colorectal cancer. However, since lymph node metastasis of colorectal cancer is completely different from liver metastasis in terms of pathology, these gene sets and methods for colorectal cancer liver metastasis can be directly applied to lymph node metastasis of colorectal cancer. That's not the case.

[0009] In addition, an original DNA microarray was prepared using a probe selected from the cDNA library prepared using the primary cancer tissue of colon cancer, liver metastasis tissue of colon cancer and normal colon mucosa tissue as a material. It has also been shown that it is possible to identify candidate genes that are considered to be related to the development and progression of colorectal cancer by performing gene expression analysis in colorectal cancer tissues using this method (Non-patent Document 6).

[0010] On the other hand, regarding lymph node metastasis of colorectal cancer, as described above, the ability to determine the presence or absence of lymph node metastasis. A specimen prepared using a part of a large number of lymph node tissues excised as described above under a microscope. Currently, it relies on the classic histopathological technique of observation, and such a method for determining lymph node metastasis is not necessarily accurate enough. In addition, it is known that postoperative adjuvant therapy performed after surgery to remove the primary colorectal cancer can improve the prognosis of patients with lymph node metastasis, but postoperative adjuvant therapy is anorexia and upper abdominal discomfort.・ Some side effects such as nausea may occur, and the quality of life (QOL) and the cost of medical care It is necessary to determine whether it is necessary or unnecessary considering the condition and disease state. Therefore, if a more accurate method for determining lymph node metastasis is found, it can be used as a useful index for decision-making when selecting postoperative adjuvant therapy, and eventually appropriate treatment can be received. This is thought to lead to patient benefit.

Patent Document 1: Japanese Patent Application Laid-Open No. 2004-33082

Non-Patent Document l: Troisi R.J., et al., 1999, Cancer, vol. 85, p. 1670-1676

Non-Patent Document 2: Cohen A.M., et al., 1997, Curr Probl Surg., Vol. 34, p. 601-676 Non-Patent Document 3: Alizadeh et al., 2000, Nature, vol. 403, p. 503-511

Non-Patent Document 4: Khan et al., 2001, Nature Medicine, vol. 7, p. 673-679

Non-Patent Document 5: Yanagawa et al., 2001, Neoplasia, vol. 3, No. 5, p.395-401

Non-Patent Document 6: Takemasa et al., 2001, Biochem. Biophys. Res. Commun., Vol. 285, p. 1244-1

249

Disclosure of the invention

Problems to be solved by the invention

[0012] As described above, the conventional method for determining the presence or absence of lymph node metastasis of colorectal cancer involves excising a plurality of lymph nodes around the colorectal cancer and observing them under a microscope. There was a problem with accuracy.

[0013] Therefore, in order to improve this point, the present invention aims to provide a method for predicting the presence or absence of lymph node metastasis of colorectal cancer by examining the gene expression profile of the colorectal cancer primary tissue. And In order to make it possible to predict the presence or absence of cancer cell metastasis to lymph nodes, the present invention is based on a set of genes that can be used to determine lymph node metastasis of colorectal cancer and their expression information. The purpose is to provide a discriminant that can be used to determine the presence or absence of lymph node metastasis.

Means for solving the problem

[0014] As a result of intensive studies to achieve the above object, the inventors of the present invention have also made efforts in a cDNA library prepared using colon cancer primary tissue, colon cancer liver metastasis tissue and normal colon mucosa tissue as materials. Create an original DNA microarray using the selected probe and use the DNA microarray to obtain gene expression analysis data for the primary colorectal cancer lesion. Through statistical analysis, find a set of genes that can be used to predict the presence or absence of lymph node metastasis, and the discriminant that is used to actually predict the presence or absence of lymph node metastasis based on their expression level. In particular, the present invention has been completed.

That is, the present invention provides the following method for selecting a gene set for predicting the presence or absence of lymph node metastasis of colorectal cancer.

1. A method for selecting a gene set for predicting the presence or absence of colorectal cancer lymph node metastasis, including the following steps (1) to (4):

(1) At least one supervised learning analysis method for gene expression information in the primary colorectal cancer tissue of patients whose presence or absence of lymph node metastasis was revealed by histopathological determination

Selecting a group of genes in each analysis method that can classify the presence or absence of lymph node metastasis with a positive classification rate of 75% or more by analyzing with 4 or more analysis methods,

(2) a step of selecting a common gene selected in common by any analysis method from the gene group selected by each analysis method used in (1),

(3) analyzing the gene expression information, instructing classification of the presence or absence of lymph node metastasis from any combination of two or more genes, and selecting a combination of genes exhibiting an interaction; and

(4) performing variable selection in a logistic regression model in which the presence or absence of lymph node metastasis as a response using the common gene and the combination of the genes as explanatory variables;

2. (1) Analysis method; ^, (a) Support Vector Machine ^ (b Principal Component Analys is Artificial Neural Network extension method, (c) Hierarchical Cluster Analysis and Stepwise Logistic Discrimination and (d ) The method according to 1. above, comprising at least one selected from the group consisting of a combination of Classification And Regression Tree and Logistic Discrimination;

3. The solution described in 1. or 2 above, which is a combination of lassincation And Regression Tree and Logistic Discnmination, and the solution power of 5, 5);

4. The method according to any one of 1 to 3 above, wherein the variable selection method in (4) is a stepwise variable selection method.

[0016] The present invention also provides a gene set for predicting the presence or absence of the following colon cancer lymph node metastasis: provide.

5. A gene set for predicting the presence or absence of colorectal cancer lymph node metastasis, selected by any of the methods 1 to 4 above;

6.At least NM—003404 (G1592), NM—002128 (G2645), NM—052868 (G3031), NM—005034 (G3177), NM—001540 (G3753), NM—005722 (G3826), and NM—015315 ( G43 70) including the gene represented by the database access number (serial number) above 5

The gene set described in.

[0017] The present invention further provides the following method for predicting the presence or absence of lymph node metastasis of colorectal cancer using the selected gene set.

7. A method for predicting the presence or absence of lymph node metastasis of colorectal cancer, characterized by using the gene set according to any of 5 or 6 above;

8. The method according to 7. above, characterized by using the following discriminant:

D = 0.2307— 2.7132 X NM— 003404 (G1592) expression level

+ 8.9509 X NM— 052868 (G3031) expression level

+ 8.7975 X NM.005722 (G3826) expression level

2.3098 X NM— 015315 (G4370) expression level

+ 3.5126 X NM.002128 (G2645) expression level X NM—005034 (G3177) expression level

-Expression level of 8.8226 X NM—001540 (G3753) X Expression level of NM.005722 (G3826)

(If D> 0, lymph node metastasis is detected, and if D≤0, no lymph node metastasis is determined).

[0018] Examples of gene group analysis methods used in the present invention include the following:

1. Hierarchical Cluster Analysis

2. Logistic Discrimination (including variable selection method) Logistic analysis

3. Classification And Regression Tree 7

4. Principal Component Analysis Artificial Neural Network (including extended method)

5. Projection Pursuit for supervised classincation

6. Support Vector Machine Support Vector Machine / SBM 7. Self Organizing Map

8. AdaBoost

9. Gene selection process combining two or more of these methods.

[0019] As a method for analyzing gene expression information used in the present invention, oral distri- bution analysis (Logistic Discrimination) can be made.

[0020] Variable selection methods used in the present invention include the following:

1. Stepwise (stepwise)

2. Forward selection method (forward)

3.Backward selection method (backward)

The invention's effect

[0021] The present invention is useful for determining whether or not a colorectal cancer cell is likely to metastasize to nearby lymph nodes when a colorectal cancer patient undergoes a primary colorectal cancer resection operation. A discriminant for predicting the presence or absence of lymph node metastasis is provided based on a series of gene sets and their gene expression information. According to the method of the present invention, a favorable lymph node metastasis determination result can be obtained by analyzing the gene expression information of the gene set of the primary colorectal cancer tissue using a mouth dystic regression equation. Therefore, it is possible to predict the presence or absence of cancer cell metastasis to lymph nodes at the time of primary colorectal cancer resection.

BEST MODE FOR CARRYING OUT THE INVENTION

[0022] The method of the present invention is based on a gene set effective for predicting the presence or absence of lymph node metastasis at the time of primary colorectal cancer resection, and the expression level of the gene set! Characterized by a discriminant for predicting the presence or absence of nodal metastasis.

[0023] A gene set useful for predicting the presence or absence of lymph node metastasis is a comprehensive set of genes that can be used for determination from a comprehensive examination of gene expression in multiple samples of colon cancer primary tissue. It is obtained by selecting. Such comprehensive gene expression analysis methods include microarrays, Northern analysis, ATAC-PCR method (Kato et al., Nuc. Acids Res., Vol. 25, p. 4694— 4696, 1997). And real-time PCR represented by Taq Man PCR (Applied Biosystems), SAGE (Velculescu et al., Science, vol. 270, p. 48) 4-487, 1995) can be used.

[0024] In a preferred embodiment of the present invention, it is performed according to the method described in JP-A-2004-33082 using a DNA microarray. More specifically, 63 cases of primary colorectal cancer tissue that had been collected through informed consent and were found to have metastasized to the lymph nodes during histopathological observations during primary lesion removal surgery, Gene expression data were obtained using the above-mentioned DNA microarray for a total of 150 tissues, including 87 primary tumor tissues derived from patients who had no metastasis. As a comparative control, gene expression data obtained from normal colonic mucosal tissue strength around the colon cancer primary tissue for 40 cases was used.

[0025] The gene expression data described above are based on fluorescence emitted from fluorescently labeled cDNA prepared by hybridizing a fluorescently labeled cDNA prepared using total RNA extracted from cancer tissue force to a DNA microarray and a probe on the DNA microarray. The signal is obtained by detecting and quantifying the signal with a special scanner. A more specific procedure is described below.

[0026] Total RNA extraction from colorectal cancer tissue or normal colonic mucosa tissue force is described in the package insert of each reagent using reagents such as TRIzol reagent (GIBC 0 BRL) and ISOGEN (Nitsubon Gene). Can be done according to different methods. The total RNA thus prepared can be used as it is for the preparation of the labeled cDNA described below. In addition, for example, by using a commercially available kit such as mRNA Purification Kit (Amersham Biosciences), purifying polyadenine-added RNA (hereinafter also referred to as “mRNA”) from the total RNA according to the attached method, It can also be used for the preparation of cDNA.

[0027] A cDNA derived from the primary tumor tissue of colorectal cancer labeled with Cy3 (hereinafter sometimes referred to as “Cy3 cDNA”) is mixed in a mixed solution containing the above total RNA or mRNA, oligo dT primer, dNTP and Cy3 labeled dUTP. After adding reverse transcriptase, it is prepared by warming at 37 to 45 ° C for 1 to 3 hours, preferably at 42 ° C for 1 hour. Preparation of a Cy5-labeled normal colon mucosa-derived cDNA (hereinafter also referred to as “Cy5 cDNA”) used as a comparative control is performed in the same manner using total RNA of normal colon mucosa tissue. The thus obtained Cy3 cDNA and Cy5 cDNA are each heat-treated in a denaturing solution at 65 to 70 ° C for 10 to 20 minutes, preferably at 70 ° C for 10 minutes, neutralized, and then mixed in equal amounts (hereinafter referred to as the following). This mixture is sometimes referred to as “Cy5 'Cy3 cDNA”). As a denaturing solution, 50 mM EDTA It is possible to use 0.5N NaOH or IN NaOH containing, but it is preferable to use 0.5N NaOH containing 50 mM EDTA. Cy5′Cy3 cDNA is purified using a commercial kit such as Micro con-30 (Amicon) according to the attached method.

[0028] Hybridization of Cy5'Cy3 cDNA and the probe printed on the DNA microarray is performed as follows. First, in order to heat denature the probe, the DNA microarray was heat-treated, and a hybridization solution containing Cy5'Cy3 cDNA that had been heat-treated at 100 ° C for 2 minutes was added dropwise and covered with a cover glass. Place the array in a sealed container and perform hybridization. As for the hybridization conditions, when the hybridization solution contains formamide, hybridization is performed at 42 ° C for 12 hours or more, and it does not contain formamide. Hybridization takes place at about 68 ° C for over 12 hours. After completion of the hybridization, the fluorescence of Cy3 and Cy5 is scanned as image data by scanning the fluorescence of Cy3 and Cy5 with a device such as Scan Array 4000 (GSI Lumonics). Subsequently, by analyzing these image data using microarray data analysis software such as Quantarray software (GSI Lumonics), the fluorescence intensities of Cy3 and Cy5 for all probes are converted into text data. Obtainable.

[0029] In a preferred embodiment of the present invention, the same results can be obtained even if a synthetic DNA having a chain length effective for hybridization is used instead. For example, based on the gene name or sequence information disclosed in the present invention, a synthetic DNA having a length of about 20 nucleotides or more consisting of a part of the sequence is used as a probe and fixed to a glass substrate or the like. It is also possible to use a trick.

[0030] Generally, data with low fluorescence intensity is greatly affected by the background. For example, by leaving only 3,000 data points from the higher fluorescence intensity, the fluorescence intensity can be reduced. Low probe data is rejected and treated as missing values. Subsequently, an operation for correcting and standardizing the deviation in the detection sensitivity adjustment between Cy3 and Cy5 that may occur during scanning is performed. Specifically, Cy3ZCy5, which is the ratio of the fluorescence intensity values of Cy3 and Cy5 for each probe, is calculated, converted to a logarithmic value with a base of 2 (hereinafter referred to as “log (Cy3ZCy5)”), and log for each probe. From the (Cy3ZCy5) value, all log (Cy3ZCy5) values The standardized log (Cy3ZCy5) value can be obtained by subtracting the median. The standardized log (Cy3 / Cy5) value can be used as the expression level of each gene.

[0031] The standardized numerical data (hereinafter sometimes referred to as “standardized numerical data”) for all cases obtained in this manner is integrated and the probe data containing many missing values is collected. The following selection operation is performed for the purpose of removing from the subsequent analysis target. In other words, only probe data for which data has been acquired in more than 128 cases, or more than 85% of all 150 cases analyzed with the microarray, are selected. This allows you to select only probe data that contains 15% or less missing values. In addition, the following selection operations are added to eliminate personal genetic background factors. That is, for each probe, the variance value in the data for 150 primary colon cancer lesions and the variance value in the data for 12 normal colon mucosa were calculated, and the former was 1.1 times the latter. Only the probe data is selected.

[0032] Through a series of these selection operations, standardized numerical data of 2,121 types of probes are selected for subsequent analysis. The existence of missing values included in standardized numerical data selected in this way causes inconvenience in later statistical analysis and needs to be supplemented in some way.

[0033] Although various methods can be applied as a complementation method, for example, the average value of all data for cases including missing values to be complemented is the data for all cases of genes containing the missing values. There is a method of supplementing with a value obtained by subtracting the average value of all gene data for all cases from the value obtained by calculating the average value of. In addition, in Troyanskaya et al. (Bioinformatics, vol. 17, p. 520-525, 2001), there are three complementary methods: K—Nearest t Neighoors (KNN) method, Singular Value Decomposition (b D) based An example of completion by method and row average method is shown. By applying one of these methods, it is possible to supplement all missing values.

[0034] Standardized numerical data (hereinafter also referred to as “standardized gene expression data”) supplemented with missing values prepared by force is not affected by knock ground, and Cy3 and Cy5 Inheritance that does not include errors due to differences in detection sensitivity, does not include missing values, and the variation range of gene expression in the colorectal cancer primary lesion compared to normal colon mucosa is due to individual differences It has gene expression information that exceeds the fluctuation range of child expression, and can ensure the reliability of subsequent statistical analysis.

[0035] With regard to a method for processing a large amount of gene expression data obtained by DNA microarray measurement using statistical methods and deriving a gene set that meets the purpose, there is considerable research by many researchers. The reality is that it requires diligent ingenuity. In the present invention, analysis is first performed by four different approaches, and for each of these, a gene group that can be used for prediction of the presence or absence of lymph node metastasis is identified.

[0036] The four approaches taken in the present invention are:

(a) Support Vector Machine (SVM) (Hastie, The Elements of Statistical Learning -Data Mining, Inference, and Prediction, Springer, 2001 J,

(b) Extension method of Principal Component Analysis / artificial Neural Network (PC A / aNN) (Khan et al., Nature Medicine, vol. 7, p. 673—679, 2001),

(c) Hierarchical Cluster Analysis (HCA) + Stepwise Logistic Discrimination and

(d) Classification And Regression Tree (CART) (Breiman et al., Classification and Regresion Trees, Wadswarth, 1983) + Logistic Discrimination

It is. When identifying gene groups, the analysis is divided into two groups, one for predicting gene identification and the other for evaluation, to ensure statistical reliability. More specifically, the data of 150 cases were also divided into 42 cases with lymph node metastasis and 57 cases with no lymph node metastasis, 99 cases, and 21 cases with lymph node metastasis and 30 cases with no lymph node metastasis. The data from the former 99 cases are used to identify genes and establish discriminants for predicting the presence or absence of lymph node metastasis. The discriminant is evaluated by discriminating this data. In the following description, the former 99 cases of data used for identification of the predictive gene and establishment of the discriminant are expressed as “training data” and the latter 51 cases used for discriminant evaluation. Data is sometimes expressed as “test data”.

[0037] With regard to (a), (c), and (d) of the above four approaches, the above data is divided into 100 times randomly, taking into account the sample variation at the time of division. Analysis of 100 gene groups for judgment, and the number of identified genes Adopt a child.

[0038] On the other hand, the above approach (b) is implemented only for the first two divisions considering a huge amount of calculation. However, the data for 99 cases for training is divided into 2 1250 times randomly at a ratio of 2: 1, and the principal component analysis and the learning of the -Ural network are repeated using it. After learning, rank genes based on their sensitivity to identify the presence or absence of lymph node metastasis. Start with 2121 genes, and continue learning with 1536, 768, 384, 192, 96, 48, and 24 refinements.

[0039] By performing the above analysis, the number of genes included in each gene set and the correct classification rate of test data using the established discriminant (the discriminant discriminant result and histopathological examination) As the average of the number of cases in which the results of Z match the number of test data X 100 (%)), (a) has 144 genes and the correct classification rate is 80.2% (standard deviation is 5.6%), (b For), the number of genes is 192 and the correct classification rate is (90.2%). For (c), the number of genes is 133 and the correct classification rate is 78.6% (standard deviation is 6.2%) and for (d). The number of genes is 138, and the correct classification rate is 86.3% (standard deviation: 4.5%). At this time, 16 types of genes are commonly included in the gene set selected by each approach.

[0040] Next, in order to narrow down the number of genes used for prediction while keeping the normal classification rate from falling, the target genes are first set to the above 16 genes, and each of these 16 genes is donated (hereinafter referred to as " In addition to the “main effect”), a statistical analysis is performed that also takes into account the interaction of two genes. As a result, it will be possible to search for a discrimination rule in a wider range including the interaction between genes only by the main effect of individual genes, and it is expected that high discrimination performance can be maintained.

[0041] In order to search for interaction, CART analysis is performed again with the presence or absence of lymph node metastasis as a response for each of the 100 training data used in the above analysis. In this CART analysis, rpart of Free software R was used. At that time, the default values were used for all operation parameters. From this analysis, 3 to 5 genes can be obtained per analysis as the number of genes that appear as variables that instruct data division.

[0042] When there are three genes that appear, for example, there are three possible gene pairs, so all these pairs are considered as interactions. Similarly, if there are 4 genes, 6 sets, 5 sets In the case of, 10 pairs are considered as interactions. To select as many candidates as possible, 18 gene pairs that appear 12 times or more out of 100 analyzes are selected as interaction candidates.

[0043] Next, in order to establish a discriminant, using the data of 150 cases, in a logistic regression model with the main effect of 16 genes and 18 interactions as explanatory variables and the presence or absence of lymph node metastasis as a response Perform stepwise variable selection. In this case, the P value of the significance test of the regression coefficient is used as the inclusion criterion (less than 0.05) and exclusion criterion (greater than 0.05). This selects six variables: G1592, G3031, G3826, G4370, G2645 and G3177 interaction, and G3753 and G3826 interaction. And the discriminant for predicting the presence or absence of lymph node metastasis is

D = 0.2307- 2.7132 X `` G1592 expression level ''

+ 8.9509 X `` G3031 expression level ''

+ 8.7975 X `` G3826 expression level ''

— 2.3098 X “G4370 expression level”

+ 3.5126 X `` G2645 expression level '' X `` G3177 expression level ''

— 8.8226 X "Expression level of G3753" X "Expression level of G3826"

It is estimated that when D> 0 there is a lymph node metastasis, and when D≤0 there is no lymph node metastasis. Seven genes appearing in this discriminant, namely G1592, G2645, G3031, G3177, G3753, G3826, and G4370, are selected as a set of genes that contribute to the identification of the presence or absence of lymph node metastasis. Their gene names are listed in Table 1.

[0044] Finally, the discrimination performance of lymph node metastasis by the selected gene set is evaluated by the LOO method. That is, using the remaining 149 sample data excluding one sample, the logistic discriminant including the above six variables is estimated, and the operation of discriminating the sample is divided into 150 samples. To implement. As a result, as shown in Table 2, the correct classification rate for the selected set of genes is estimated to be 88.7% (sensitivity: 77.8%, specificity: 96.6%). As described above, in the present invention, it is possible to clarify a gene set necessary for predicting the presence or absence of lymph node metastasis of colorectal cancer with high accuracy.

[0045] Hereinafter, the present invention will be described in detail with reference to examples. There are no restrictions. The reagents used in the examples were those purchased from Nakarai Testa Co., Ltd. unless otherwise specified.

Example 1

[0046] (1) Total RNA preparation from large intestine 織.

As samples for performing gene expression analysis in colorectal cancer using a DNA microarray, 150 cases of primary colorectal cancer tissue collected at the time of colorectal cancer surgery collected through informed consent were used. The breakdown consists of 63 patients (hereinafter referred to as “positive lymph node metastasis”) derived from patients whose lymph node metastasis was observed by histopathological observation at the time of removal of the primary lesion. There are 87 cases (hereinafter referred to as “negative lymph node metastasis cases”) derived from patients who were not recognized. Total RNA was extracted from these colon cancer tissue samples using TRIzol reagent (purchased from GIBCO BRL). The extraction procedure basically followed the manual attached to the reagent. In addition, 40 cases of total RNA from normal large intestine mucosa were extracted and mixed to obtain standard normal large intestine mucosa total RNA for use in all experiments. The concentrations of these RNA samples were calculated based on the absorbance at a wavelength of 260 nm measured using a spectrophotometer as usual.

[0047] (2) Preparation of fluorescent label target

The fluorescent label target to be hybridized to the DNA microarray was prepared by the following procedure. First, 25 g of colon cancer sample-derived total RNA (hereinafter referred to as “colon cancer RNA”) and 25 μg of standard normal colon mucosa total RNA (hereinafter referred to as “standard colon mucosa HRNA”) are in separate tubes. 2 g of oligo dT primers each having a force of 18 nucleotides were prepared, brought to a volume of 14 L with sterilized distilled water, heated at 70 ° C. for 10 minutes, immediately transferred to ice and rapidly cooled. Then in each tube, add 6 μL of 5 X First Strand Buffer, 3 μL of 0.1 M DTT, 1.5 / z L of 20 X dNTPmix (10 mM dATP, dCTP, dGTP and 6 mM dTTP mixture) and 0.5 μL of RNAguard was added.

[0048] dUTP labeled with the fluorescent dye Cy3 on the tube containing colon cancer RNA

(Hereinafter referred to as “Cy3—dUTP”; concentration ImM) 3 μL, standard colonic mucosa URN A tube containing Cy5 labeled dUTP (hereinafter referred to as “Cy5—dUTP”; concentration lm 3) M) was added and incubated at 42 ° C for 2 minutes. Then Superscrip, a reverse transcriptase Label reaction was performed by adding 2 L of til to each tube and incubating at 42 ° C for an additional hour. As a result of this reaction, when cDNA synthesis occurs using colon cancer RNA and standard colon mucosa HRNA as a saddle type, Cy3-dUTP and Cy5-dUTP were incorporated, respectively, so that they were fluorescently labeled with Cy3 and Cy5, respectively. A colorectal cancer label target and a standard colorectal mucosa label target are generated.

[0049] The 5 X First Strand Buffer, O.IM DTT and Superscriptll used in this reaction were all purchased from GIBCO BRL. DATP, dCTP, dGTP and dTTP, Cy5-d UTP and Cy3-dUTP, and RNAguard were all purchased from Amersham Biosciences. After the reaction, add 5 / z L of denaturing solution (0.5N NaOH, 50 mM EDT A) to each tube, heat at 70 ° C for 10 minutes, and then add 7.5 μL of 1M Tris-HCl (pH 7. It was neutralized by calorie 5). At the stage where these treatments were performed, the colon cancer label target and the standard colon mucosa label target were mixed, and 10 g of human COT-1 DNA (purchased from GIBCO BRL) was added thereto. TE buffer was added to this mixture, adjusted to 500 L, and purified and concentrated using Microcon 30 (purchased from Amicon) to remove unreacted Cy5 dUTP and Cy3-dUTP. The purification / concentration procedure followed the manual attached to Microcon-30. Finally, it was concentrated until the total volume became 5 L, and this was used as a label target to be hybridized to the DNA microarray.

[0050] (3) DNA microarray pretreatment

Mask the DNA microarray by immersing it in a masking solution (3 g of succinic anhydride, 190 mL of N-methyl-2-pyrrolidone and 21 mL of 0.2 M sodium borate) for 5 minutes, and then in distilled water at 95 ° C. The cDNA printed on the microarray was heat denatured by soaking for 3 minutes. Immediately after that, it was immersed in 95% or more ethanol for 1 minute, dehydrated and air-dried.

[0051] (4) Hybridization of label target and DNA microarray

To 5 μL of the label target solution prepared as described above, 2.5 μL of lOmgZ mL of polyadenine (purchased from Roche), 0.5 μL of 10% SDS solution, 3 μL of 20 Χ （solution ( 0.4% BSA and 1% SDS), 15 L formamide, 3 L 20 X SSC (3 M sodium chloride, 0.3 M sodium citrate, pH 7.0) and 1 μL of sterile distilled water After heating at 100 ° C. for 2 minutes, the mixture was allowed to stand at room temperature in the dark for about 30 minutes. After that, drop it on the DNA microarray cDNA pre-treated by the method described in the previous section, cover it with a 24 x 40 mm cover glass (purchased from Matsunami Glass Industry), place the microarray in a sealed container, The label target was hybridized to the cDNA on the microarray by placing the container in a 42 ° C incubator for about 16 hours. After hybridization, the microarray was immersed in 2 × SSC containing 0.1% SDS and washed for 10 minutes, and then immersed in 0.1 × SSC containing 0.1% SDS for 10 minutes. Furthermore, after immersing in 0.1 X SSC and washing for 5 minutes twice, drops were cut and air-dried in the dark.

[0052] (5) Microarray scan data analysis

After washing and air-drying microarrays, ScanArray 4000 (GSI Lumonics), a confocal laser scanner dedicated to microarrays, independently scans the fluorescence of Cy3 and Cy5 to hybridize each probe on the microarray. Fluorescence patterns of Cy3 and Cy5 derived from cancer targets and standard colon targets were obtained as 16-bit Tiff scanned image data. Subsequently, these image data are analyzed using QuantArray software (GSI Lumonics), which is analysis software dedicated to microarray data, so that the fluorescence intensity of Cy3 and Cy5 for all probes is numerically expressed in text format. Obtained as In order to correct the background, the fluorescence intensity value for each probe was subtracted from the fluorescence intensity value of the part where the cDNA was not printed. In addition, since the portion with a low fluorescence intensity value is greatly affected by experimental errors, other data were rejected, leaving a data point of approximately 3000 forces with a high fluorescence intensity value. For each probe, the ratio of the fluorescence intensity values of Cy3 and Cy5, ie, Cy3ZCy5, was calculated and converted to a logarithmic value with a base of 2 (hereinafter referred to as “log (Cy3 / Cy5)”). In order to correct and standardize the detection sensitivity adjustment of Cy3 and Cy5 that may occur during scanning, the total log (Cy3 / Cy5) is calculated from the log (Cy3 / Cy5) value of each probe! The standard log (Cy3, Cy5) value was obtained by subtracting the median of the values.

[0053] By the above operation, the relative expression intensity of the primary colorectal cancer lesions for 63 cases with lymph node metastasis and 87 cases without lymph node metastasis when standard colonic mucosa HRNA was used as a reference. We were able to obtain standardized numerical data. In the same manner, numerical data for 12 normal colon mucosa samples were obtained using standard colon mucosa HRNA as a reference. Of these numerical data, data were acquired for more than 128 cases, or 85% of the 150 colorectal cancer lesions analyzed, and the variance within the data for 150 colon cancer primary lesions ( (dispersion) force The variance value in the data for 12 normal colonic mucosa is 1.1 times more than a total of 2,121 types of probes! Used for statistical analysis.

[0054] Among these 2,121 probe data, some cases were rejected because the expression level was below the cut-off value, which was low, and these data were missing values. Yes. There were a total of 8,816 such missing values. The total number of data is 150 X 2, 121 = 318, 150, and the missing value content is about 2.8%. The existence of these missing values causes inconveniences in later statistical analyses, so they were supplemented using the k-Nearest Neighbor method. Specifically, in the data matrix of 150 cases X 2,121 probes, the expression level that was a missing value was estimated by the average value of the expression levels of the 8 genes closest to that gene in the distance between samples. Specifically, the distance between all paired samples was calculated in a data matrix of 150 cases x 2,121 probes. Then, the average value of the gene expression level to be complemented by the eight samples that were closest to the sample with the missing value was obtained and used as the complement value for the missing value. Here, the number of samples closest to the sample with missing values is defined as the number that increases the number sequentially and minimizes the root mean square (RMS). All the numerical data obtained by complementing the missing values in this way are hereinafter referred to as standardized gene expression data.

Example 2

[0055] Standardized transmission early determination of high-performance gene set for lymph node metastasis prediction I by t analysis of current data

In this example, a probe printed on a DNA microarray may be referred to as a gene.

As an initial attempt, the following four approaches were applied to identify the gene set for judgment. (A) Support Vector Machine (SVM) (Hastie et al., The Elements of St atistical Learning-Data Mining, Inference, and Prediction, Springer, 2001), (b) Prin cipal Component Analysis / artificial Neural Network (PC A / aNN) (Khan et al., Nature Me dicine, vol. 7, p. 673—679 (C) Hierarchical Cluster Analysis (HCA) + Stepwise Logistic Discrimination and (d) Classification And Regression Tree (C ART) (Breiman et al., Classification and Regression Trees, Wadswarth, 1983) + Logistic c Discrimination, The following four methods were used.

[0056] For identification of the gene group, all data were divided into two groups for determination gene identification and evaluation for the purpose of ensuring statistical reliability. More specifically, 1 50 cases of data from 42 cases with lymph node metastases and 57 cases with no lymph node metastases, 99 cases, 21 cases with lymph node metastases, and 30 cases without lymph node metastasis The data from the former 99 cases were used to identify genes and establish discriminants for determining the presence or absence of lymph node metastasis. By discriminating the minute data, gene identification and discriminant evaluation were performed. In the following description, the data used to identify the determination gene and establish the discriminant, as in the former 99 cases, is expressed as “training data” and the latter 51 cases. Thus, data used for discriminant evaluation may be expressed as “test data”.

[0057] With regard to (a), (c), and (d) of the above four approaches, the above data can be divided into two parts 100 times and analyzed for 100 different judgments. After identifying the genes, we selected genes that were identified many times. On the other hand, the approach (b) above was implemented only for the first two splits. However, the data for 99 cases for training were randomly divided into 2 parts at a ratio of 2: 1 1250 times, and the main component analysis and neural network learning were repeated using this. After learning, genes were ranked based on their sensitivity to identify the presence or absence of lymph node metastasis, and the genes were narrowed down. 2121 gene strengths have also begun, and learning has progressed with 1536, 768, 384, 192, 96, 48, and 24 refinements.

[0058] As a result of the above examination, a discriminant gene set and a discriminant were established for each approach. The number of genes included in each gene set and the correct classification rate of test data using established discriminants (the discriminant discriminant results and histopathological examination results are the same The average value of the number of Z-test data x 100 (%)) is (a), the number of genes is 144, the correct classification rate is 80.2% (standard deviation is 5.6%), (b ) Has 192 genes and the correct classification rate is (90.2%), (c) has 133 genes and the correct classification rate is 78.6% (standard deviation is 6.2%) and (d) The number of genes was 138, and the correct classification rate was 86.3% (standard deviation: 4.5%). In addition, there were 16 genes that were commonly included in the gene set selected by each approach.

[0059] As described above, each of the four different approaches was able to achieve a high! However, the number of genes used for discrimination exceeds 100, and it is considered that there are too many genes to be put into practical use as an assembly method.

[0060] In view of this, it was considered to establish a new discrimination rule by narrowing down the number of genes used in the determination while reducing the normal classification rate! /. To that end, we first selected 16 genes that were included in the gene set selected by each of the above approaches. Next, in addition to the contribution of each of these 16 genes (hereinafter referred to as “main effect”), we devised to take into account the interaction of 2 genes. It was expected that high discrimination performance could be maintained by searching for discrimination rules in a wider range including interactions between genes as well as main effects of individual genes.

[0061] In order to search for interactions, CART analysis was performed again with the presence or absence of lymph node metastasis in each of the 100 training data used in the above analysis. As a result, the number of genes that appeared as variables to instruct data division was 3 to 5 per analysis. For example, if there are three genes, there are three possible gene pairs, so all these pairs were considered as interactions. In the same way, 6 pairs for 4 genes and 10 pairs for 5 genes were considered as interactions. Then, out of 100 analyzes, 18 gene pairs that appeared 12 times or more were selected as interaction candidates.

[0062] Next, in order to establish a discriminant, using the data of 150 cases, in a logistic regression model with the main effects of 16 genes and 18 interactions as explanatory variables and the presence or absence of lymph node metastasis as a response Stepwise variable selection was performed. At that time, the significance test of the regression coefficient Constant p-values were used as inclusion criteria (less than 0.05) and exclusion criteria (greater than 0.05). As a result, six variables were selected: G1592, G3031, G3826, G4370, G2645 and G3177 interaction, and G3753 and G3826 interaction. G1592, G3031, G3826, G4370, G2645, G3177 and G3753 are serial numbers assigned to each probe (gene) on the ColonoChip used in the present invention. And the discriminant for judging the presence or absence of lymph node metastasis is

D = 0.2307— 2.7132 X G1592 expression level

+ 8.9509 X G3031 expression level

+ 8.7975 X G3826 expression level

— 2.3098 X G4370 expression level

+ 3.5126 02645 expression level. Expression level of 3177

-8.8226 X G3753 expression level X G3826 expression level

It was estimated that there was a rule for lymph node metastasis when D> 0 and no lymph node metastasis when D≤0.

[0063] Seven genes that appeared in this discriminant, namely G1592, G2645, G3031, G3177, G3 753, G3826, and G4370 were selected as a set of genes that contribute to the identification of the presence or absence of lymph node metastasis. Their gene names are listed in Table 1. In Table 1, the access numbers in the RefSeq database for each of the above genes are also shown. The ReSeq database can be accessed from the National enter for Biotechnology Information (NCBI) website (http://www.ncbi.nlm.nih.gov/ReSeq/index.html).

[0064] [Table 1]

[0065] Finally, the evaluation of lymph node metastasis using the selected gene set was evaluated by the LOO method did. That is, using the remaining 149 sample data excluding one sample, the logistic discriminant including the above six variables is estimated, and the operation of discriminating the sample by that is performed for each 150 samples. Carried out. The results are shown in Table 2. From Table 2, the correct classification rate for the selected set of genes was estimated to be 88.7% (sensitivity: 77.8%, specificity: 96.6%).

[0066] [Table 2]

Industrial applicability

[0067] By the determination of lymph node metastasis enabled by the present invention, it is possible to select a better treatment policy according to the case and to expect a medical economic effect. For example, prognosis can be improved by aggressive treatment for patients with a high possibility of lymph node metastasis, while surgery is recommended for cases with a low possibility of lymph node metastasis. Post-adjuvant therapy can be mild and reduce the physical and economic burden on the patient.

[0068] Furthermore, since it is speculated that individual genes included in the gene set disclosed in the present invention may function as a cause of lymph node migration, these genes and their expression products are used. It can also be expected to develop targeted drugs to directly suppress lymph node metastasis.

Claims

Claims [1] A method of selecting a gene set for predicting the presence or absence of colorectal cancer lymph node metastasis, comprising the following steps (1) to (4):

(4) A step of selecting a variable in a logistic regression model in which the combination of the common gene and the gene is used as an explanatory variable and the presence or absence of lymph node metastasis is used as a response.

[2] (1) Analysis power (a) Support Vector Machine ^ (b) Principal Component Analysis

It includes at least one selected from the group consisting of (c) Hierarchical Cluster Analysis and Stepwise Logistic Discrimination and (d) Classification And Regression Tree and Logistic Discrimination. The method of claim 1.

[3] The method according to claim 1 or 2, wherein the analysis method power S of (3) is a combination of (d) Classification And Regression Tree and Logistic Discriminatio n.

[4] The method according to any one of claims 1 and 3, wherein the variable selection method in (4) is a stepwise variable selection method.

[5] A gene set for predicting the presence or absence of colorectal cancer lymph node metastasis, which is selected by the method according to any one of claims 1 to 4.

[6] At least NM—003404 (G1592), NM—002128 (G2645), NM—052868 (G3031), NM—00

50. (G3177), NM—001540 (G3753), NM.005722 (G3826), and NM—015315 (G4370), which contains the gene represented by the access number (serial number) of the database. The gene set described in.

A method for predicting the presence or absence of colon cancer lymph node metastasis, wherein the gene set according to claim 5 or 6 is used.

8. The method of claim 7, wherein the following discriminant is used:

D = 0.2307— 2.7132 X NM— 003404 (G1592) expression level

+ 8.9509 X NM— 052868 (G3031) expression level

+ 8.7975 X NM.005722 (G3826) expression level

2.3098 X NM— 015315 (G4370) expression level

-8.8226 X NM— Expression level of 001540 (G3753) X Expression level of NM.005722 (G3826)

(When D> 0, lymph node metastasis is determined, and when D≤0, no lymph node metastasis is determined).