US20110152110A1

US20110152110A1 - Set of Tumour-Markers

Info

Publication number: US20110152110A1
Application number: US12/675,736
Authority: US
Inventors: Klemens Vierlinger; Martin Lauss; Albert Kriegner; Christa Noehammer
Original assignee: AIT Austrian Institute of Technology GmbH
Current assignee: AIT Austrian Institute of Technology GmbH
Priority date: 2007-08-30
Filing date: 2008-08-29
Publication date: 2011-06-23
Also published as: EP2191272A2; WO2009026605A2; US20150252437A1; CN106483290A; AT505726A2; ES2429299T3; CN106483290B; WO2009026605A3; EP2191272B1; CN101821628A; US10100364B2

Abstract

The present invention provides a set of moieties specific for tumor markers, in particular of follicular thyroid carcinoma (FTC) and papillary thyroid carcinoma (PTC) as well as a method for identifying markers of any genetic disease.

Description

The present invention relates to the field of cancer diagnosis and diagnostic means therefor.
Thyroid nodules are endemic in iodine deficient areas, like Europes alpine regions, where they have a prevalence of 10-20%. They are classified by their histology into the 2 benign types Struma nodosa (SN) and Follicular Thyroid Adenoma (FTA) and the malignant entities Follicular Thyroid Carcinoma (FTC), Papillary Thyroid Carcinoma (PTC), Medullary Thyroid Carcinoma (MTC) and Anaplastic Thyroid Carcinoma (ATC). Conventionally, discrimination between benign and malignant thyroid nodules is done by scintigraphy and fine needle aspiration followed by histology. Despite many advances in the diagnosis and therapy of thyroid nodules and thyroid cancer, these methods have a well known lack of specificity, particularly for the discrimination between ETA and FTC, which leads to a number of patients unnecessarily treated for malignant disease.
Given the diagnostic limitations of previous methods, in particular fine needle aspiration followed by cytology, multiple investigators have carried out expression profiling studies with hopes of identifying new diagnostic tools. Such analyses attempt to identify differentially expressed genes with an important role in disease development or progression using large-scale transcript-level expression profiling technologies such as cDNA microarrays, oligonucleotide arrays and Serial Analysis of Gene Expression (SAGE). Typically, dozens or hundreds of genes are identified, many of which are expected to be false positives, and only a small fraction useful as diagnostic/prognostic markers or therapeutic targets (Griffith et al., J Clin Oncol 24(31):5043-5051 (2006)).
In other types of cancer it has been shown that gene expression profiling can add substantial value to the discrimination of the different clinically relevant tumour-entities. The US 2006/183141 A e.g. describes classification of tumor markers from a core serum response signature. Different studies have tried to classify the different entities of thyroid carcinoma on the basis of their gene expression profiles each of them discriminates between 2 of the 5 entities. However, the studies have no or very few genes in common and applying a classier from one study to the data from another study generally yields poor classification results.
It is a goal of the present invention to provide reliable distinctive markers for the diagnosis of cancer, in particular to distinguish benign thyroid nodules from malignant follicular thyroid carcinoma (FTC) and papillary thyroid carcinoma (PTC).
Therefore the present invention provides a set of moieties specific for at least 3 tumor markers selected from the tumor markers PI-1 to PI-33, PII-1 to PII-64, PIII-1 to PIII-70, fi-1 to fi-147, PIV-1 to PIV-9, preferably PIV-4 or PIV-5, and PV-1 to PV-11, preferably PV-1, PV-2 and PV-4 to PV-11. These tumor markers are related to different genes aberrantly expressed in tumors and are given in tables 1 to 6 and can be identified by their gene identification sign, their descriptive gene name, but most unambiguously by their UniGeneID or their Accession number referring to specific sequences in common sequence databases such as NCBI GenBank, EMBL-EBI Database, EnsEMBL or the DNA Data Bank of Japan. These markers have been identified in form of preferred sets (PI to PV, FI) but can be combined in any form as targets for the inventive set.

TABLE 1

PTC marker set PI-1 to PI-33

Number	marker
PI-	gene	description of gene	Accession Nr.	UniGeneID

1	BBS9	Bardet-Biedl	NM_198428	Hs.372360
		syndrome 9	NM_001033605
			NM_001033604
			NM_014451
2	C13orf1	Chromosome	NM_020456	Hs.44235
		13 open
		reading frame
1
3	CBFA2T3	Core-binding	NM_005187 NM_175931	Hs.513811
		factor, runt domain, alpha subunit 2
4	CDT1	Chromatin licensing and DANN	NM_030928	Hs.122908
		replication factor 1
5	CRK	V-crk sarcoma virus CT10 oncogene	NM_016823 NM_005206	Hs.638121
		homolog (avian)
6	CTPS	CTP synthase	NM_001905	Hs.473087
7	DAPK2	Death-associated protein kinase 2	NM_014326	Hs.237886
8	EIF5	Eukaryotic translation initiation factor 5	NM_001969 NM_183004	Hs.433702
9	EREG	Epiregulin	NM_001432	Hs.115263
10	GK	Glycerol kinase	NM_203391 NM_000167	Hs.1466
11	GPATCH8	G patch domain containing 8	NM_001002909	Hs.463129
12	HDGF	Hepatoma-derived growth factor	NM_004494	Hs.506748
		(high-mobility group protein 1-like)
13	IRF2BP1	Interferon regulatory factor 2 binding protein 1	NM_015649	Hs.515477
14	KRT83	Keratin 83	NM_002282	Hs.661428
15	MYOD1	Myogenic differentiation 1	NM_002478	Hs.181768
16	NME6	Non-metastatic cells 6, protein expressed in	NM_005793	Hs.465558
		(nucleoside-diphosphate kinase)
17	POLE3	Polymerase (DNA directed), epsilon 3	NM_017443	Hs.108112
		(p17 subunit)
18	PPP1R13B	Protein phosphatase	1, regulatory (inhibitor)	NM_015316	Hs.436113
		subunit 13B
19	PRPH2	Peripherin 2 (retinal degeneration, slow)	NM_000322	Hs.654489
20	RASSF7	Ras association (RalGDS/AF-6) domain	NM_003475	Hs.72925
		family 7
21	ROCK2	Rho-associated, coiled-coil containing	NM_004850	Hs.591600
		protein kinase 2
22	RTN1	Reticulon 1	NM_021136 NM_206857	Hs.368626
			NM_206852
23	S100B	S100 calcium binding protein B	NM_006272	Hs.422181
24	SLIT2	Slit homolog 2 (Drosophila)	NM_004787	Hs.29802
25	SNRPB2	Small nuclear ribonucleoprotein polypeptide	NM_003092 NM_198220	Hs.280378
		B″
26	SPAG7	Sperm associated antigen 7	NM_004890	Hs.90436
27	STAU1	Staufen, RNA binding protein, homolog 1	NM_017453	Hs.596704
		(Drosophila)	NM_001037328
			NM_004602 NM_017452
			NM_017454
28	SUPT5H	Suppressor of Ty 5 homolog (S. cerevisiae)	NM_003169	Hs.631604
29	TBX10	T-box 10	NM_005995	Hs.454480
30	TLK1	Tousled-like kinase 1	NM_012290	Hs.655640
31	TM4SF4	Transmembrane 4 L six family member 4	NM_004617	Hs.133527
32	TXN	Thioredoxin	NM_003329	Hs.435136
33	UFD1L	Ubiquitin fusion degradation 1 like (yeast)	NM_005659	Hs.474213
			NM_001035247

TABLE 2

PTC marker set PII-1 to PII-64

Number	marker
PII-	gene	description of gene	Accession Nr.	UniGeneID

1	ADH1B	Alcohol dehydrogenase IB (class I), beta	NM_000668	Hs.4
		polypeptide
2	AGR2	Anterior gradient homolog 2	NM_006408	Hs.530009
		(Xenopus laevis)
3	AGTR1	Angiotensin II receptor, type 1	NM_031850 NM_004835	Hs.477887
			NM_009585 NM_032049
4	AGTR1	Angiotensin II receptor, type 1	NM_000685	Hs.654382
5	ALDH1A1	Aldehyde dehydrogenase	NM_000689	Hs.76392
		1 family, member A1
6	ALDH1A3	Aldehyde dehydrogenase	NM_000693	Hs.459538
		1 family, member A3
7	AMIGO2	Adhesion molecule with Ig-like	NM_181847	Hs.121520
		domain 2
8	ATP2C2	ATPase, Ca++ transporting,	NM_014861	Hs.6168
		type 2C, member 2
9	BID	BH3 interacting domain death	NM_197966 NM_001196	Hs.591054
		agonist	NM_197967
10	C7orf24	Chromosome	7 open reading	NM_024051	Hs.530024
		frame 24
11	CA4	Carbonic anhydrase IV	NM_000717	Hs.89485
12	CCL21	Chemokine (C-C motif) ligand 21	NM_002989	Hs.57907
13	CD55	CD55 molecule, decay	NM_000574	Hs.527653
		accelerating factor for
		complement (Cromer blood
		group)
14	CDH16	Cadherin 16, KSP-cadherin	NM_004062	Hs.513660
15	CDH3	Cadherin 3, type 1, P-cadherin	NM_133458 NM_001793	Hs.461074
		(placental)
16	CFI	Complement factor I	NM_000204	Hs.312485
17	CHI3L1	Chitinase 3-like 1 (cartilage	NM_001276	Hs.382202
		glycoprotein-39)
18	CHST2	Carbohydrate	NM_004267	Hs.8786
		(N-acetylglucosamine-6-O)
		sulfotransferase 2
19	CITED2	Cbp/p300-interacting	NM_006079	Hs.82071
		transactivator, with Glu/Asp-rich
		carboxy-terminal domain, 2
20	CLCNKB	Chloride channel Kb	NM_000085	Hs.352243
21	COMP	Cartilage oligomeric matrix	NM_000095	Hs.1584
		protein
22	CTSH	Cathepsin H	NM_004390 NM_148979	Hs.148641
23	DIO2	Deiodinase, iodothyronine,	NM_013989 NM_000793	Hs.202354
		type II	NM_001007023
24	DIRAS3	DIRAS family, GTP-binding	NM_004675	Hs.194695
		RAS-like 3
25	DUSP4	Dual specificity phosphatase 4	NM_057158 NM_001394	Hs.417962
26	DUSP5	Dual specificity phosphatase 5	NM_004419	Hs.2128
27	EDN3	Endothelin 3	NM_207032 NM_207034	Hs.1408
			NM_207033 NM_000114
28	ENTPD1	Ectonucleoside triphosphate	NM_001776	Hs.576612
		diphosphohydrolase 1	NM_001098175
29	FHL1	Four and a half LIM domains 1	NM_001449	Hs.435369
30	GDF15	Growth differentiation factor 15	NM_004864	Hs.616962
31	GPM6A	Glycoprotein M6A	NM_201591 NM_005277	Hs.75819
			NM_201592
32	HBA1	Hemoglobin, alpha 1	NM_000558	Hs.449630
33	IRS1	Insulin receptor substrate 1	NM_005544	Hs.471508
34	KCNJ2	Potassium inwardly-rectifying	NM_000891	Hs.1547
		channel, subfamily J, member 2
35	KCNN4	Potassium intermediate/small	NM_002250	Hs.10082
		conductance calcium-activated
		channel, subfamily N, member 4
36	KLK10	Kallikrein-related peptidase 10	NM_002776	Hs.275464
			NM_001077500
			NM_145888
37	LAMB3	Laminin, beta 3	NM_001017402	Hs.497636
			NM_000228
38	LCN2	Lipocalin 2 (oncogene 24p3)	NM_005564	Hs.204238
39	LMOD1	Leiomodin 1 (smooth muscle)	NM_012134	Hs.519075
40	MATN2	Matrilin 2	NM_002380 NM_030583	Hs.189445
41	MPPED2	Metallophosphoesterase	NM_001584	Hs.289795
		domain containing 2
42	MVP	Major vault protein	NM_017458 NM_005115	Hs.632177
43	NELL2	NEL-like 2 (chicken)	NM_006159	Hs.505326
44	NFE2L3	Nuclear factor (erythroid-derived	NM_004289	Hs.404741
		2)-like 3
45	NPC2	Niemann-Pick disease, type C2	NM_006432	Hs.433222
46	NRCAM	Neuronal cell adhesion molecule	NM_001037132	Hs.21422
			NM_005010
			NM_001037133

47	NRIP1	Nuclear receptor interacting	NM_003489	Hs.155017
		protein 1
48	PAPSS2	3′-phosphoadenosine	NM_001015880	Hs.524491
		5′-phosphosulfate synthase 2	NM_004670
49	PDLIM4	PDZ and LIM domain 4	NM_003687	Hs.424312
50	PDZK1IP1	PDZK1 interacting protein 1	NM_005764	Hs.431099
51	PIP3-E	Phosphoinositide-binding protein	NM_015553	Hs.146100
		PIP3-E
52	PLAU	Plasminogen activator, urokinase	NM_002658	Hs.77274
53	PRSS2	Protease, serine, 2 (trypsin 2)	NM_002770	Hs.622865
54	PRSS23	Protease, serine, 23	NM_007173	Hs.25338
55	RAP1GAP	RAP1 GTPase activating protein	NM_002885	Hs.148178
56	S100A11	S100 calcium binding protein A11	NM_005620	Hs.417004
57	SFTPB	Surfactant, pulmonary-associated	NM_198843 NM_000542	Hs.512690
		protein B
58	SLPI	Secretory leukocyte peptidase	NM_003064	Hs.517070
		inhibitor
59	SOD3	Superoxide dismutase 3,	NM_003102	Hs.2420
		extracellular
60	SPINT1	Serine peptidase inhibitor, Kunitz type 1	NM_181642 NM_003710	Hs.233950
			NM
			001032367
61	SYNE1	Spectrin repeat containing,	NM_182961 NM_033071	Hs.12967
		nuclear envelope 1	NM_015293 NM_133650
62	TACSTD2	Tumor-associated calcium signal transducer 2	NM_002353	Hs.23582
63	UPP1	Uridine phosphorylase 1	NM_181597 NM_003364	Hs.488240
64	WASF3	WAS protein family, member 3	NM_006646	Hs.635221

TABLE 3

PTC marker set PIII-1 to PIII-70

Number	marker
PIII-	gene	description of gene	Accession Nr.	UniGeneID

1	APOE	Apolipoprotein E	NM_000041	Hs.654439
2	ATIC	5-aminoimidazole-4-carboxamide ribo-	NM_004044	Hs.90280
		nucleotide formyltransferase/IMP cyclohydrolase
3	BASP1	Brain abundant, membrane attached signal	NM_006317	Hs.201641
		protein 1
4	C9orf61	Chromosome 9 open reading frame 61	NM_004816	Hs.118003
5	CCL13	Chemokine (C-C motif) ligand 13	NM_005408	Hs.414629
6	CD36	CD36 molecule (thrombospondin receptor)	NM_001001548	Hs.120949
			NM_001001547
			NM_000072

7	CDH6	Cadherin 6, type 2, K-cadherin (fetal kidney)	NM_004932	Hs.171054
8	CFB	Complement factor B	NM_001710	Hs.69771
9	CFD	Complement factor D (adipsin)	NM_001928	Hs.155597
10	CLDN10	Claudin 10	NM_182848 NM_006984	Hs.534377
11	COL11A1	Collagen, type XI, alpha 1	NM_080629 NM_001854	Hs.523446
			NM_080630
12	COL13A1	Collagen, type XIII, alpha 1	NM_005203 NM_080804	Hs.211933
			NM_080798 NM_080803
			NM_080802 NM_080799
			NM_080800
			NM_080801NM_080808
			NM_080809 NM_080805
			NM_080807 NM_080806
			NM_080811
			NM_080810NM_080812
			NM_080813 NM_080814
			NM_080815
13	CORO2B	Coronin, actin binding protein, 2B	NM_006091	Hs.551213
14	CRLF1	Cytokine receptor-like factor 1	NM_004750	Hs.114948
15	CXorf6	Chromosome X open reading frame 6	NM_005491	Hs.20136
16	DDB2	Damage-specific DNA binding protein 2,	NM_000107	Hs.655280
		48 kDa
17	DPP6	Dipeptidyl-peptidase 6	NM_001039350	Hs.490684
			NM_130797 NM_001936
18	ECM1	Extracellular matrix protein 1	NM_004425 NM_022664	Hs.81071
19	EFEMP1	EGF-containing fibulin-like extracellular	NM_004105	Hs.76224
		matrix protein 1	NM_001039348
			NM_001039349
20	ESRRG	Estrogen-related receptor gamma	NM_206594 NM_001438	Hs.444225
			NM_206595
21	ETHE1	Ethylmalonic encephalopathy 1	NM_014297	Hs.7486
22	FAS	Fas (TNF receptor superfamily, member 6)	NM_000043 NM_152872	Hs.244139
			NM_152871 NM_152873
			NM_152875 NM_152874
			NM_152877 NM_152876
23	FMOD	Fibromodulin	NM_002023	Hs.519168
24	GABBR2	Gamma-aminobutyric acid (GABA) B receptor, 2	NM_005458	Hs.198612
25	GALE	UDP-galactose-4-epimerase	NM_000403	Hs.632380
			NM_001008216
26	GATM	Glycine amidinotransferase (L-arginine: glycine	NM_001482	Hs.75335
		amidinotransferase)
27	GDF10	Growth differentiation factor 10	NM_004962	Hs.2171
28	GHR	Growth hormone receptor	NM_000163	Hs.125180
29	GPC3	Glypican 3	NM_004484	Hs.644108
30	ICAM1	Intercellular adhesion molecule 1 (CD54),	NM_000201	Hs.643447
		human rhinovirus receptor
31	ID3	Inhibitor of DNA binding 3, dominant negative	NM_002167	Hs.76884
		helix-loop-helix protein
32	IER2	Immediate early response 2	NM_004907	Hs.501629
33	IGFBP6	Insulin-like growth factor binding protein 6	NM_002178	Hs.274313
34	IQGAP2	IQ motif containing GTPase activating protein 2	NM_006633	Hs.291030
35	ITGA2	Integrin, alpha 2 (CD49B, alpha 2 subunit	NM_002203	Hs.482077
		of VLA-2 receptor)
36	ITGA3	Integrin, alpha 3 (antigen CD49C, alpha 3	NM_002204 NM_005501	Hs.265829
		subunit of VLA-3 receptor)
37	ITM2A	Integral membrane protein 2A	NM_004867	Hs.17109
38	KIAA0746	KIAA0746 protein	NM_015187	Hs.479384
39	LRIG1	Leucine-rich repeats and immunoglobulin-	NM_015541	Hs.518055
		like domains 1
40	LRP2	Low density lipoprotein-related protein 2	NM_004525	Hs.470538
41	LY6E	Lymphocyte antigen 6 complex, locus E	NM_002346	Hs.521903
42	MAPK13	Mitogen-activated protein kinase 13	NM_002754	Hs.178695
43	MDK	Midkine (neurite growth-promoting factor	NM_001012334	Hs.82045
		2)	NM_001012333
			NM_002391
44	MLLT11	Myeloid/lymphoid or mixed-lineage leukemia	NM_006818	Hs.75823
		(trithorax homolog, Drosophila)
45	MMRN1	Multimerin 1	NM_007351	Hs.268107
46	MTMR11	Myotubularin related protein 11	NM_181873	Hs.425144
47	MXRA8	Matrix-remodelling associated 8	NM_032348	Hs.558570
48	NAB2	NGFI-A binding protein 2 (EGR1 binding	NM_005967	Hs.159223
		protein 2)
49	NMU	Neuromedin U	NM_006681	Hs.418367
50	OCA2	Oculocutaneous albinism II (pink-eye dilution	NM_000275	Hs.654411
		homolog, mouse)
51	PDE5A	Phosphodiesterase 5A, cGMP-specific	NM_001083 NM_033430	Hs.647971
			NM_033437
52	PLAG1	Pleiomorphic adenoma gene 1	NM_002655	Hs.14968
53	PLP2	Proteolipid protein 2 (colonic epithelium-	NM_002668	Hs.77422
		enriched)
54	PLXNC1	Plexin C1	NM_005761	Hs.584845
55	PRKCQ	Protein kinase C, theta	NM_006257	Hs.498570
56	PRUNE	Prune homolog (Drosophila)	NM_021222	Hs.78524
57	RAB27A	RAB27A, member RAS oncogene family	NM_004580 NM_183234	Hs.654978
			NM_183235 NM_183236
58	RYR2	Ryanodine receptor 2 (cardiac)	NM_001035	Hs.109514
59	SCEL	Sciellin	NM_144777 NM_003843	Hs.534699
60	SELENBP1	Selenium binding protein 1	NM_003944	Hs.632460
61	SORBS2	Sorbin and SH3 domain containing 2	NM_021069 NM_003603	Hs.655143
62	STMN2	Stathmin-like 2	NM_007029	Hs.521651
63	TBC1D4	TBC1 domain family, member 4	NM_014832	Hs.210891
64	TM4SF4	Transmembrane 4 L six family member 4	NM_004617	Hs.133527
65	TNC	Tenascin C (hexabrachion)	NM_002160	Hs.143250
66	TPD52L1	Tumor protein D52-like 1	NM_001003395	Hs.591347
			NM_003287
			NM_001003396
			NM_001003397
67	TSC22D1	TSC22 domain family, member 1	NM_183422 NM_006022	Hs.507916
68	TTC30A	Tetratricopeptide repeat domain 30A	NM_152275	Hs.128384
69	VLDLR	Very low density lipoprotein receptor	NM_003383	Hs.370422
			NM_001018056
70	WFS1	Wolfram syndrome 1 (wolframin)	NM_006005	Hs.518602

TABLE 4

FTC marker set FI-1 to FI-147

Number
FI-	marker gene	description of gene	Accession Nr.	UniGeneID

1	AATF	Apoptosis antagonizing transcription	NM_012138	Hs.195740
		factor
2	ACOX3	Acyl-Coenzyme A oxidase 3,	NM_003501	Hs.479122
		pristanoyl
3	AHDC1	AT hook, DNA binding motif, containing 1	NM_001029882	Hs.469280
4	ALAS2	Aminolevulinate, delta-, synthase 2	NM_000032	Hs.522666
		(sideroblastic/hypochromic anemia)	NM_001037968
			NM_001037967
			NM_001037969
5	ALKBH1	AlkB, alkylation repair homolog 1 (E. coli)	NM_006020	Hs.94542
6	ANGPTL2	Angiopoietin-like 2	NM_012098	Hs.653262
7	AP2A2	Adaptor-related protein complex 2, alpha	NM_012305	Hs.19121
		2 subunit
8	APOBEC3G	Apolipoprotein B mRNA editing enzyme,	NM_021822	Hs.660143
		catalytic polypeptide-like 3G
9	APRIN	Androgen-induced proliferation inhibitor	NM_015032	Hs.693663
10	ARNT	Aryl hydrocarbon receptor nuclear	NM_001668	Hs.632446
		translocator	NM_178427
			NM_178426
11	AZGP1	Alpha-2-glycoprotein 1, zinc-binding	NM_001185	Hs.546239
12	BAT2D1	BAT2 domain containing 1	NM_015172	Hs.494614
13	BATF	Basic leucine zipper transcription	NM_006399	Hs.509964
		factor, ATF-like
14	BPHL	Biphenyl hydrolase-like (serine hydrolase	NM_004332	Hs.10136
15	C13orf1	Chromosome 13 open reading frame 1	NM_020456	Hs.44235
16	C14orf1	Chromosome 14 open reading frame 1	NM_007176	Hs.15106
17	C2orf3	Chromosome 2 open reading frame 3	NM_003203	Hs.303808
18	CBFB	Core-binding factor, beta subunit	NM_001755	Hs.460988
			NM_022845
19	CBR3	Carbonyl reductase 3	NM_001236	Hs.154510
20	CBX5	Chromobox homolog 5 (HP1 alpha homolog,	NM_012117	Hs.632724
		Drosophila)
21	CCNE2	Cyclin E2	NM_057749	Hs.567387
			NM_057735
22	CD46	CD46 molecule, complement regulatory	NM_002389	Hs.510402
		protein	NM_172354
			NM_172351
			NM_172355
			NM_172352
			NM_172359
			NM_172357
			NM_172360
			NM_153826
			NM_172358
			NM_172356
			NM_172353
			NM_172361
			NM_172350
23	CHPF	Chondroitin polymerizing factor	NM_024536	Hs.516711
24	CHST3	Carbohydrate (chondroitin 6) sulfotransferase 3	NM_004273	Hs.158304
25	CLCN2	Chloride channel 2	NM_004366	Hs.436847
26	CLCN4	Chloride channel 4	NM_001830	Hs.495674
27	CLIC5	Chloride intracellular channel 5	NM_016929	Hs.485489
28	CNOT2	CCR4-NOT transcription complex,	NM_014515	Hs.133350
		subunit 2
29	COPS6	COP9 constitutive photomorphogenic	NM_006833	Hs.15591
		homolog subunit 6 (Arabidopsis)
30	CPZ	Carboxypeptidase Z	NM_001014448	Hs.78068
			NM_001014447
			NM_003652
31	CSK	C-src tyrosine kinase	NM_004383	Hs.77793
32	CTDP1	CTD (carboxy-terminal domain, RNA	NM_004715	Hs.465490
		polymerase II, polypeptide A) phosphatase,	NM_048368
		subunit
1
33	DDEF2	Development and differentiation enhancing	NM_003887	Hs.555902
		factor 2
34	DKFZP586H2123	Regeneration associated muscle protease	NM_015430	Hs.55044
			NM_001001991
35	DLG2	Discs, large homolog 2, chapsyn-110	NM_001364	Hs.654862
		(Drosophila)
36	DPAGT1	Dolichyl-phosphate (UDP-N-acetylglucosamine)	NM_001382	Hs.524081
		N-acetylglucosaminephosphotransferase	NM_203316
		1 (GlcNAc-1-P transferase)
37	DSCR1	Down syndrome critical region gene 1	NM_004414	Hs.282326
			NM_203418
			NM_203417

38	DUSP8	Dual specificity phosphatase 8	NM_004420	Hs.41688
39	EI24	Etoposide induced 2.4 mRNA	NM_004879	Hs.643514
			NM_001007277
40	ENOSF1	Enolase superfamily member 1	NM_017512	Hs.369762
41	ERCC1	Excision repair cross-complementing	NM_202001	Hs.435981
		rodent repair deficiency, complementation	NM_001983
		group 1 (includes overlapping
		antisense sequence)
42	ERCC3	Excision repair cross-complementing	NM_000122	Hs.469872
		rodent repair deficiency, complementation
		group 3 (xeroderma pigmentosum
		group B complementing)
43	ERH	Enhancer of rudimentary homolog	NM_004450	Hs.509791
		(Drosophila)
44	F13A1	Coagulation factor XIII, A1 polypeptide	NM_000129	Hs.335513
45	FAM20B	Family with sequence similarity 20,	NM_014864	Hs.5737
		member B
46	FBP1	Fructose-1,6-bisphosphatase 1	NM_000507	Hs.494496
47	FCGR2A	Fc fragment of IgG, low affinity IIa, receptor	NM_021642	Hs.352642
		(CD32)
48	FGF13	Fibroblast growth factor 13	NM_004114	Hs.6540
			NM_033642
49	FGFR1OP	FGFR1 oncogene partner	NM_007045	Hs.487175
			NM_194429
50	FLNC	Filamin C, gamma (actin binding	NM_001458	Hs.58414
		protein 280)
51	FMO5	Flavin containing monooxygenase 5	NM_001461	Hs.642706
52	FRY	Furry homolog (Drosophila)	NM_023037	Hs.591225
53	GADD45G	Growth arrest and DNA-damage-inducible,	NM_006705	Hs.9701
		gamma
54	GCH1	GTP cyclohydrolase 1 (dopa-	NM_000161	Hs.86724
		responsive dystonia)	NM_001024024
			NM_001024070
			NM_001024071
55	GFRA1	GDNF family receptor alpha 1	NM_005264	Hs.591913
			NM_145793
56	GLB1	Galactosidase, beta 1	NM_001039770	Hs.443031
			NM_000404
			NM_001079811
57	GOLGA8A	Golgi autoantigen, golgin subfamily a,	NM_181077	Hs.182982
		8A	NM_001023567
58	HCLS1	Hematopoietic cell-specific Lyn substrate 1	NM_005335	Hs.14601
59	HDGF	Hepatoma-derived growth factor (high-	NM_004494	Hs.506748
		mobility group protein 1-like)
60	HRC	Histidine rich calcium binding protein	NM_002152	Hs.436885
61	ICMT	Isoprenylcysteine carboxyl methyl-	NM_012405	Hs.562083
		transferase
62	IFNA5	Interferon, alpha 5	NM_002169	Hs.37113
63	IGF2BP3	Insulin-like growth factor 2 mRNA	NM_006547	Hs.648088
		binding protein 3
64	IL12A	Interleukin 12A (natural killer cell stimulatory	NM_000882	Hs.673
		factor 1, cytotoxic lymphocyte
		maturation factor
1, p35)
65	ITIH2	Inter-alpha (globulin) inhibitor H2	NM_002216	Hs.75285
66	ITPKC	Inositol 1,4,5-trisphosphate 3-kinase C	NM_025194	Hs.515415
67	JMJD2A	Jumonji domain containing 2A	NM_014663	Hs.155983
68	KCNJ15	Potassium inwardly-rectifying channel,	NM_170736	Hs.411299
		subfamily J, member 15	NM_002243
			NM_170737
69	KCTD12	Potassium channel tetramerisation domain	NM_138444	Hs.693617
		containing 12
70	KIAA0652	KIAA0652	NM_014741	Hs.410092
71	KIAA0913	KIAA0913	NM_015037	Hs.65135
72	KLKB1	Kallikrein B, plasma (Fletcher factor) 1	NM_000892	Hs.646885
73	KRT37	Keratin 37	NM_003770	Hs.673852
74	LAMB3	Laminin, beta 3	NM_001017402	Hs.497636
			NM_000228
75	LPHN3	Latrophilin 3	NM_015236	Hs.694758
				Hs.649524
76	LRIG1	Leucine-rich repeats and immunoglobulin-	NM_015541	Hs.518055
		like domains 1
77	LSR	Lipolysis stimulated lipoprotein receptor	NM_205834	Hs.466507
			NM_015925
			NM_205835
78	MANBA	Mannosidase, beta A, lysosomal	NM_005908	Hs.480415
79	MAP7	Microtubule-associated protein 7	NM_003980	Hs.486548
80	MAPKAPK5	Mitogen-activated protein kinase-activated	NM_139078	Hs.413901
		protein kinase 5	NM_003668
81	MET	Met proto-oncogene (hepatocyte	NM_000245	Hs.132966
		growth factor receptor)
82	MMP14	Matrix metallopeptidase 14 (membrane-	NM_004995	Hs.2399
		inserted)
83	MX1	Myxovirus (influenza virus) resistance	NM_002462	Hs.517307
		1, interferon-inducible protein p78
		(mouse)
84	MYL9	Myosin, light chain 9, regulatory	NM_006097	Hs.504687
			NM_181526
85	MYO9B	Myosin IXB	NM_004145	Hs.123198
86	NCOR1	Nuclear receptor co-repressor 1	NM_006311	Hs.462323
87	NDRG4	NDRG family member 4	NM_020465	Hs.322430
			NM_022910
88	NDUFA5	NADH dehydrogenase (ubiquinone) 1	NM_005000	Hs.651219
		alpha subcomplex, 5, 13 kDa
89	NEUROD2	Neurogenic differentiation 2	NM_006160	Hs.322431
90	NFKB2	Nuclear factor of kappa light poly-	NM_001077494	Hs.73090
		peptide gene enhancer in B-cells 2	NM_001077493
		(p49/p100)	NM_002502
91	NME6	Non-metastatic cells 6, protein expressed	NM_005793	Hs.465558
		in (nucleoside-diphosphate kinase)
92	NPY1R	Neuropeptide Y receptor Y1	NM_000909	Hs.519057
93	NUP50	Nucleoporin	50 kDa	NM_007172	Hs.475103
			NM_153645
94	PDGFRA	Platelet-derived growth factor receptor,	NM_006206	Hs.74615
		alpha polypeptide
95	PDHX	Pyruvate dehydrogenase complex,	NM_003477	Hs.502315
		component X
96	PDLIM1	PDZ and LIM domain 1 (elfin)	NM_020992	Hs.368525
97	PEX1	Peroxisome biogenesis factor 1	NM_000466	Hs.164682
98	PEX13	Peroxisome biogenesis factor 13	NM_002618	Hs.567316
99	PIB5PA	Phosphatidylinositol (4,5) bisphosphate	NM_014422	Hs.517549
		5-phosphatase, A	NM_001002837
100	PICK1	Protein interacting with PRKCA1	NM_012407	Hs.180871
			NM_001039583
			NM_001039584
101	PLEC1	Plectin	1, intermediate filament binding	NM_201380	Hs.434248
		protein 500 kDa	NM_201384
			NM_000445
			NM_201379
			NM_201383
			NM_201382
			NM_201381
			NM_201378
102	POLE2	Polymerase (DNA directed), epsilon 2	NM_002692	Hs.162777
		(p59 subunit)
103	POLE3	Polymerase (DNA directed), epsilon 3	NM_017443	Hs.108112
		(p17 subunit)
104	PPIF	Peptidylprolyl isomerase F (cyclophilin	NM_005729	Hs.381072
		F)
105	PPP2R5A	Protein phosphatase 2, regulatory	NM_006243	Hs.497684
		subunit B′, alpha isoform
106	PSCD2	Pleckstrin homology, Sec7 and coiled-	NM_017457	Hs.144011
		coil domains 2 (cytohesin-2)	NM_004228
107	PSMA5	Proteasome (prosome, macropain)	NM_002790	Hs.485246
		subunit, alpha type, 5
108	PTPN12	Protein tyrosine phosphatase, non-receptor	NM_002835	Hs.61812
		type 12
109	PTPN3	Protein tyrosine phosphatase, non-receptor	NM_002829	Hs.436429
		type 3
110	PTPRCAP	Protein tyrosine phosphatase, receptor	NM_005608	Hs.155975
		type, C-associated protein
111	QKI	Quaking homolog, KH domain RNA	NM_206855	Hs.510324
		binding (mouse)	NM_206854
			NM_206853
			NM_006775
112	RASAL2	RAS protein activator like 2	NM_170692	Hs.656823
			NM_004841
113	RASSF7	Ras association (RalGDS/AF-6) domain	NM_003475	Hs.72925
		family 7
114	RBM10	RNA binding motif protein 10	NM_005676	Hs.401509
			NM_152856
115	RBM38	RNA binding motif protein 38	NM_017495	Hs.236361
			NM_183425
116	RER1	RER1 retention in endoplasmic reticulum	NM_007033	Hs.525527
		1 homolog (S. cerevisiae)
117	RGL2	Ral guanine nucleotide dissociation	NM_004761	Hs.509622
		stimulator-like 2
118	RHOG	Ras homolog gene family, member G	NM_001665	Hs.501728
		(rho G)
119	RNASE1	Ribonuclease, RNase A family, 1	NM_198235	Hs.78224
		(pancreatic)	NM_198234
			NM_198232
			NM_002933
120	RTN4	Reticulon 4	NM_020532	Hs.645283
			NM_207521
			NM_207520
			NM_153828
			NM_007008
121	RYR2	Ryanodine receptor 2 (cardiac)	NM_001035	Hs.109514
122	SCC-112	SCC-112 protein	NM_015200	Hs.331431
123	SDS	Serine dehydratase	NM_006843	Hs.654416
124	SF3B2	Splicing factor 3b, subunit 2, 145 kDa	NM_006842	Hs.406423
125	SH3PXD2A	SH3 and PX domains 2A	NM_014631	Hs.594708
126	SIX6	Sine oculis homeobox homolog 6	NM_007374	Hs.194756
		(Drosophila)
127	SLC10A1	Solute carrier family 10 (sodium/bile	NM_003049	Hs.952
		acid cotransporter family), member 1
128	SLC6A8	Solute carrier family 6 (neurotransmitter	NM_005629	Hs.540696
		transporter, creatine), member 8
129	SMG6	Smg-6 homolog, nonsense mediated	NM_017575	Hs.448342
		mRNA decay factor (C. elegans)
130	SNRPB2	Small nuclear ribonucleoprotein poly-	NM_003092	Hs.280378
		peptide B″	NM_198220
131	SOX11	SRY (sex determining region Y)-box	NM_003108	Hs.432638
		11
132	SPI1	Spleen focus forming virus (SFFV)	NM_001080547	Hs.502511
		proviral integration oncogene spi1	NM_003120
133	SRGAP3	SLIT-ROBO Rho GTPase activating	NM_014850	Hs.654743
		protein 3	NM_001033117
134	STX12	Syntaxin 12	NM_177424	Hs.523855
135	SYK	Spleen tyrosine kinase	NM_003177	Hs.371720
136	TAF4	TAF4 RNA polymerase II, TATA box	NM_003185	Hs.18857
		binding protein (TBP)-associated
		factor, 135 kDa
137	TCN2	Transcobalamin II	NM_000355	Hs.417948
138	TGOLN2	Trans-golgi network protein 2	NM_006464	Hs.593382
139	TIA1	TIA1 cytotoxic granule-associated	NM_022173	Hs.516075
		RNA binding protein	NM_022037
140	TOMM40	Translocase of outer mitochondrial	NM_006114	Hs.655909
		membrane 40 homolog (yeast)
141	TXN2	Thioredoxin 2	NM_012473	Hs.211929
142	UGCG	UDP-glucose ceramide glucosyltransferase	NM_003358	Hs.304249
143	USP11	Ubiquitin specific peptidase 11	NM_004651	Hs.171501
144	VDR	Vitamin D (1,25-dihydroxyvitamin D3)	NM_001017535	Hs.524368
		receptor	NM_000376
145	VEGFC	Vascular endothelial growth factor C	NM_005429	Hs.435215
146	YWHAQ	Tyrosine 3-monooxygenase/tryptophan	NM_006826	Hs.74405
		5-monooxygenase activation
		protein, theta polypeptide
147	ZNF140	Zinc finger protein 140	NM_003440	Hs.181552

TABLE 5

PTC marker set PIV-1 to PIV-9

Number
PIV-	marker gene	description of gene	Accession Nr.	UniGeneID

1	WAS	Wiskott-Aldrich syndrome (eczema-	BC012738	Hs.2157
		thrombocytopenia)
2	LRP4	Low density lipoprotein receptor-related	BM802977	Hs.4930
		protein 4
3	TFF3	Trefoil factor 3 (intestinal)	BC017859	Hs.82961
4	ST3GAL6	ST3 beta-galactoside alpha-2,3-sialyl-	BC023312	Hs.148716
		transferase 6
5	STK39	Serine threonine kinase 39	BM455533	Hs.276271
		(STE20/SPS1 homolog, yeast)
6	DPP4	Dipeptidyl-peptidase 4 (CD26, adenosine	BC065265	Hs.368912
		deaminase complexing protein 2)
7	CHI3L1	Chitinase 3-like 1 (cartilage glycoprotein-39)	BC038354	Hs.382202
8	FABP4	Fatty acid binding protein 4, adipocyte	BC003672	Hs.391561
9	LAMB3	Laminin, beta 3	BC075838	Hs.497636

TABLE 6

PTC marker set PV-1 to PV-11

Number
PV-	marker gene	description of gene	Accession Nr.	UniGeneID

1	GPR4	G protein-coupled receptor 4	BC067535	Hs.17170
2	STAM2	Signal transducing adaptor molecule	BC028740	Hs.17200
		(SH3 domain and ITAM motif) 2
3	QPCT	Glutaminyl-peptide cyclotransferase	BC047756	Hs.79033
		(glutaminyl cyclase)
4	CDK7	Cyclin-dependent kinase 7 (MO15 homolog,	BC000834	Hs.184298
		Xenopus laevis, cdk-activating
		kinase)
5	SFTPD	Surfactant, pulmonary-associated protein D	BC022318	Hs.253495
6	CYB5R1	Cytochrome b5 reductase 1	BC018732	Hs.334832
7	VWF	Von Willebrand factor	BI490763	Hs.440848
8	VWF	Von Willebrand factor	BQ888783	Hs.440848
9	PDHX	Pyruvate dehydrogenase complex,	BC010389	Hs.502315
		component X
10	HOXA4	Homeobox A4	BM996071	Hs.654466
11	HOXA4	Homeobox A4	BI521357	Hs.654466

The inventive set can be used to detect cancer or tumor cells, in particular thyroid cancer, and even to distinguish benign thyroid nodules from malignant follicular thyroid carcinoma (FTC) and papillary thyroid carcinoma (PTC). In preferred embodiments the set comprises moieties specific for at least 3 tumor markers selected from the tumor markers PI-1 to PI-33, PII-1 to PII-64, PIII-1 to PIII-70 and PIV-1 to PIV-9, preferably PIV-4 or PIV-5, and PV-1 to PV-11, preferably PV-1, PV-2 and PV-4 to PV-11, in particular from the tumor markers PI-1 to PI-33. These markers are specific for papillary thyroid carcinoma (PTC) and the diagnosed thyroid cancer can be characterized as PTC.
In a similar preferred embodiment the set comprises moieties specific for at least 3 tumor markers selected from the tumor markers FI-1 to FI-147. These markers are specific for follicular thyroid carcinoma (FTC) and the diagnosed thyroid cancer can be characterized as FTC.
Particularly preferred the set comprises a moiety specific for the tumor marker SERPINA1 (Serine (or cysteine) protease inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 1; NM_—000295, NM_—001002236, NM_—001002235), which is a very potent marker for PTC. This marker as single member of the set can distinguish PTC form benign conditions.
Preferably the set comprises at least 5 or at least 10, preferably at least 15, more preferred at least 20, particular preferred at least 25, most preferred at least 30, moieties specific for the tumor markers of table 1 to 6 above. The set may be selected from moieties specific for any at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 33, 35, 40, 45, 50, 55, 60, 64, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 145, 147, 150, 160, 170, 180, 190 or 200 of the above tumor markers, e.g. selected from PI-1 to PI-33, PII-1 to PII-64, PIII-1 to PIII-70, FI-1 to FI-147, PIV-1 to PIV-9, preferably PIV-4 or PIV-5, and PV-1 to PV-11, preferably PV-1, PV-2 and PV-4 to PV-11, in particular from any one of PI-1, PI-2, PI-3, PI-4, PI-5, PI-6, PI-7, PI-8, PI-9, PI-10, PI-11, PI-12, PI-13, PI-14, PI-15, PI-16, PI-17, PI-18, PI-19, PI-20, PI-21, PI-22, PI-23, PI-24, PI-25, PI-26, PI-27, PI-28, PI-29, PI-30, PI-31, PI-32, PI-33, PII-1, PII-2, PII-3, PII-4, PII-5, PII-6, PII-7, PII-8, PII-9, PII-10, PII-11, PII-12, PII-13, PII-14, PII-15, PII-16, PII-17, PII-18, PII-19, PII-20, PII-21, PII-22, PII-23, PII-24, PII-25, PII-26, PII-27, PII-28, PII-29, PII-30, PII-31, PII-32, PII-33, PII-34, PII-35, PII-36, PII-37, PII-38, PII-39, PII-40, PII-41, PII-42, PII-43, PII-44, PII-45, PII-46, PII-47, PII-48, PII-49, PII-50, PII-51, PII-52, PII-53, PII-54, PII-55, PII-56, PII-57, PII-58, PII-59, PII-60, PII-61, PII-62, PII-63, PII-64, PIII-1, PIII-2, PIII-3, PIII-4, PIII-5, PIII-6, PIII-7, PIII-8, PIII-9, PIII-10, PIII-11, PIII-12, PIII-13, PIII-14, PIII-15, PIII-16, PIII-17, PIII-18, PIII-19, PIII-20, PIII-21, PIII-22, PIII-23, PIII-24, PIII-25, PIII-26, PIII-27, PIII-28, PIII-29, PIII-30, PIII-31, PIII-32, PIII-33, PIII-34, PIII-35, PIII-36, PIII-37, PIII-38, PIII-39, PIII-40, PIII-41, PIII-42, PIII-43, PIII-44, PIII-45, PIII-46, PIII-47, PIII-48, PIII-49, PIII-50, PIII-51, PIII-52, PIII-53, PIII-54,
PIII-56, PIII-57, PIII-58, PIII-59, PIII-60, PIII-61, PIII-62, PIII-63, PIII-64, PIII-66, PIII-67, PIII-68, PIII-69, PIII-70, FI-1, FI-2, FI-3, FI-4, FI-5, FI-6, FI-7, FI-8, FI-9, FI-10, FI-11, FI-12, FI-13, FI-14, FI-15, FI-16, FI-17, FI-18, FI-19, FI-20, FI-21, FI-22, FI-23, FI-24, FI-25, FI-26, FI-27, FI-28, FI-29, FI-30, FI-31, FI-32, FI-33, FI-34, FI-35, FI-36, FI-37, FI-38, FI-39, FI-40, FI-41, FI-42, FI-43, FI-44, FI-45, FI-46, FI-47, FI-48, FI-49, FI-50, FI-51, FI-52, FI-53, FI-54, FI-55, FI-56, FI-57, FI-58, FI-59, FI-60, FI-61, FI-62, FI-63, FI-64, FI-65, FI-66, FI-67, FI-68, FI-69, FI-70, FI-71, FI-72, FI-73, FI-74, FI-75, FI-76, FI-77, FI-78, FI-79, FI-80, FI-81, FI-82, FI-83, FI-84, FI-85, FI-86, FI-87, FI-88, FI-89, FI-90, FI-91, FI-92, FI-93, FI-94, FI-95, FI-96, FI-97, FI-98, FI-99, FI-100, FI-101, FI-102, FI-103, FI-104, FI-105, FI-106, FI-107, FI-108, FI-109, FI-110, FI-111, 112, FI-113, FI-114, FI-115, FI-116, FI-117, FI-118, FI-119, FI-120, FI-121, FI-122, FI-123, FI-124, FI-125, FI-126, FI-127, FI-128, FI-129, FI-130, FI-131, FI-132, FI-133, FI-134, FI-135, FI-136, FI-137, FI-138, FI-139, FI-140, FI-141, FI-142, FI-143, FI-144, FI-145, FI-146, FI-147, PIV-1, PIV-2, PIV-3, PIV-4, PIV-5, PIV-6, PIV-7, PIV-8, PIV-9, PV-1, PV-2, PV-3, PV-4, PV-5, PV-6, PV-7, PV-8, PV-9, PV-10, PV-11. Preferably the set is specific for any complete subset selected from PI, PII, PIII, PIV, PV or FI. However it is also possible to pick any small number from these subsets or combined set since a distinction between benign and malignant states or the diagnosis of cancer can also be performed with acceptable certainty. For example in a preferred embodiment the inventive set comprises at least 5 (or any of the above mentioned numbers) of moieties specific for the tumor markers selected from FI-1 to FI-147. FIGS. 4 and 5 show such diagnostic classification probabilities for PTC and FTC. E.g. a set specific for any number of markers from table 2 (subset PII) specific for 5 markers has only an error margin of 4%, i.e. 96% of all cases would be classified correctly. An error value of 1% (99% certainty) is achieved with at least 20 members. In the case of the FTC specific markers a stable value of 8% errors is achieved with at least 11 different markers selected from the FI subset.
The moieties according to the invention are molecules suitable for specific recognition of the inventive markers. Such molecular recognition can be on the nucleotide, peptide or protein level. Preferably, the moieties are nucleic acids, especially oligonucleotides or primers specific for tumor marker nucleic acids. In another embodiment the moieties are antibodies (monoclonal or polyclonal) or antibody fragments, preferably selected from Fab, Fab′ Fab₂, F(ab′)₂or scFv (single-chain variable fragments), specific for tumor marker proteins. According to the invention it is not of essence which sequence portion of the nucleic acids or which epitopes of the proteins are recognized by the moieties as long as molecular recognition is facilitated. Moieties already known in the art, especially disclosed in the references cited herein, which are all incorporated by reference, are suitable.
In a preferred embodiment the moieties of the set are immobilized on a solid support, preferably in the form of a microarray or nanoarray. The term “microarray”, likewise “nanoarray”, is used to describe a array of an microscopic arrangement (nanoarray for an array in nanometer scale) or refers to a carrier comprising such an array. Both definitions do not contradict each other and are applicable in the sense of the present invention. Preferably the set is provided on a chip whereon the moietes can be immobilized. Chips may be of any material suitable for the immobilization of biomolecules such as the moieties, including glass, modified glass (aldehyde modified) or metal chips.
According to the present invention a set for the specific use for tumor diagnosis is provided. However, it is also possible to provide larger sets including additional moieties for other purposes, in particular in a micoarray set-up, where it is possible to immobilize a multitude of oligonucleotides. However, it is preferred to provide a cost-efficient set including a limited amount of moieties for a single purpose.
Therefore, in a preferred embodiment the set comprises at least 10%, at least 15%, at least 20%, at least 25%, at least 30%, at least 35%, at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, especially preferred at least 100%, of the total analyte binding moieties of the set are moieties, which are specific for the tumor markers selected from the group of PI-1 to P1-33, PII-1 to PII-64, PIII-1 to PIII-70, FI-1 to FI-147, PIV-1 to PIV-9, and PV-1 to PV-11 (all markers disclosed in tables 1 to 6, above) or from at least one of the groups of any one of PI-1 to PI-33, PII-1 to PII-64, PIII-1 to PIII-70, FI-1 to FI-147, PIV-1 to PIV-9, PV-1 to PV-11 or any combination thereof. Such preferred combinations are e.g. all markers of the groups PI-1 to PI-33, PII-1 to PII-64, PIII-1 to PIII-70, PIV-1 to PIV-9, and PV-1 to PV-11, being especially suitable for PTC diagnosis. As used herein “analyte binding moieties” refers to all moieties which can be used to specifically detect a marker, in particular a marker gene or gene product, including mRNA or expressed proteins. The genes are preferably genes of a mammal, in particular a human. The moieties are included in this generic term of any “analyte binding moieties” which can have multiple diagnostic targets. E.g., in the embodiment of a microarray the array comprises at least 10% oligonucleotides specific for the inventive markers. Since—according to current technology—detection means for genes on a chip (nucleic acid molecules, such as DNA-ESTs or complementary DNA-ESTs, respectively) allow easier and more robust array design, gene chips using DNA molecules (for detection of expressed mRNA in the sample) is a preferred embodiment of the present invention. Such gene chips also allow detection of a large number of gene products, whereas detection of a large number of proteins using protein chips (e.g. antibody chips) is more difficult. Detection of proteins is usually performed using ELISA techniques (i.e. a microtiter plate-, bead-, or chip-based ELISA) as an embodiment of a protein chip. A protein chip may comprise suitable means for specifically binding the gene products of the gene from the list according to tables 1 to 6, e.g. affinity molecules such as monoclonal or polyclonal antibodies or lectins.
In a further embodiment the set comprises up to 50000 analyte binding moieties, preferably up to 40000, up to 35000, up to 30000, up to 25000, up to 20000, up to 15000, up to 10000, up to 7500, up to 5000, up to 3000, up to 2000, up to 1000, up to 750, up to 500, up to 400, up to 300, or even more preferred up to 200 analyte binding moieties of any kind, such as oligonucleotides specific for any gene or gene product.
In a further aspect the present invention relates to a method for the detection of one or more thyroid cancer markers in a sample comprising using the inventive set and detecting the presence or measuring amount of the occurrence of tumor markers in the sample. The incidence or pattern of the detected markers can specifically identify the presence of these markers which can be relevant for cancer diagnosis or as a reference of healthy samples, or simply a genetic investigation of subjects.
Preferably the sample comprises cells preferably, mammal cells, particular preferred human cells, which can be provided from a biopsy or body fluid. In particular the presence or amount of the tumor markers is detected or measured in these cells after e.g. cell disintegration.
The method may comprise a detection or measurement by RNA-expression analysis, preferably by microarray or quantitative PCR, or protein analysis, preferably by tissue microarray detection, protein microarray detection, mRNA microarray detection, ELISA, multiplex assays, immunohistochemistry, or DNA analysis, comparative genomic hybridization (CGH)-arrays or single nucleotide polymorphism (SNP)-analysis. These methods are known in the art and can be readily used for the method of the present invention, as examples of the vast field of genetic marker analysis.
In another aspect the present invention provides a method for the diagnosis of cancer in a patient, comprising providing a sample, preferably a sample of cells, of the patient, detecting one or more tumor markers by measuring tumor marker signals with the set according to the present invention, comparing the measured signal values of the tumor markers with values of the tumor markers in healthy samples and diagnosing cancer if more than 50%, preferably more than 60%, more preferred more than 70%, most preferred more than 80%, of the values differ compared to the values of the healthy samples by at least the standard deviation, preferably two times the standard deviation, even more preferred three times the standard deviation, of the method of measurement. The differences in genetic expression between samples of diseased subjects and healthy subjects can be of any kind and includes upregulation (e.g. of oncogenes) or downregulation (e.g. of tumor suppressor genes). It is possible that in healthy samples a gene is not expressed whereas expression occurs in diseased samples. The other way around it is also possible that in diseased samples a gene is not expressed whereas expression occurs in healthy samples.
Cancer can also be diagnosed if more than 50%, preferably more than 60%, more preferred more than 70%, most preferred more than 80%, of the values of the sample differ compared to the values of the healthy samples by at least a factor 1.5, at least a factor 2, at least a factor 3 or at least a factor 4. Usually the tumor marker expression products ar up or down regulated by a factor of 2 to 6, but also differences by a factor 60 are possible.
In yet another aspect the invention relates to a method for the identification of disease specific markers, as e.g. given in tables 1 to 6, preferably genes or gene expression patterns, comprising:

- providing gene expression data on multiple potential disease specific genes of at least two different expression datasets,
- determining common genes of the datasets,
- normalising each gene expression dataset, preferably by lowess or quantile normalisation,
- combining the gene expression datasets to a combined dataset, and preferably normalising the combined dataset, and integrating the combined dataset,
- determination of genes of the combined data set by determining its nearest shrunken centroid, which includes the determination of a cross-validated error value of assigning the genes to the disease and minimizing the error value by reducing the number of members of the combined, preferably normalized, data set,
  wherein the genes of the reduced data set are the markers specific for the disease. The cross-validation can e.g. the leave-one-out method. Preferably the determination step (the classification step) comprises the determination of a maximized threshold of the difference of the normalized expression value for each gene to the centroid value through the cross-validation. Then the genes with normalized expression values lower than the threshold are removed from the reduced (or shrunken) set and genes with values greater than the threshold to the centroid are specific for the disease. Classification by the shrunken centrois methods are e.g. disclosed by Tibshirani et al. (PNAS USA 99(10):105-114 (2004)), Shen et al. (Bioinformatics 22(22) (2006): 2635-42) and Wang et al. (Bioinformatics 23(8) (2007): 972-9), which disclosures are incorporated herein by reference.

The determination step can be repeated multiple times by leaving out the resulting markers of each previous step. The nearest shrunken centroid method will yield a new result set of further markers which are specific for the disease. Preferably the determination step is repeated 2, 3, 4, 5, 6, 7, 8, 9, 10 or more times. Depending on the size of the combined data set it will give further specific markers. Preferably a cross-validation is performed on each result. The determination can be repeated until the cross-validation indicates an error value of e.g. below 50%, 60%, 70% or 80%. At lower values it can be expected that all markers have been identified.
The initial gene expression data sets are raw expression profiles, e.g. each obtained from a multi genetic microarray analysis. Most of the measured genes are expected not to be involved with the disease and the inventive method is capable to identify characteristic marker genes form at least two, preferably at least three, at least four, at least five, at least six, at least seven or at least eight expression data sets. Therefore the expression data of the initial data sets preferably comprises data of at least two different microarray datasets, in particular with study or platform specific biases. Such biases can occur by using only a specific set up during the measurement of the expression data, e.g. a microarray, which can significantly differ from set ups of other datasets. The present invention has the advantage that during the combination of such sets the problems of such measurement biases are overcome. Furthermore the obtained (initial) gene expression data is raw, unprocessed gene expression data, i.e. no refinement or data conversion was performed prior to the inventive method.
Preferably the disease is a genetic disorder, preferably a disorder with altered gene expression, in particular preferred cancer. Other types of disorders with altered gene expression can be e.g. pathogen infections, in particular viral including retroviral infections, radiation damage and age related disorders.
The step of combining and integrating the combined dataset removed study specific biases. In preferred embodiments this step is performed by stepwise combination of two gene expression datasets per step and integration of the combined dataset, preferably by DWD (Distance Weighted Discrimination). E.g. in the case of 3 data sets at first set 1 is combined with set 2 and the merged set 1+2 is combined with set 3. Integration may e.g. include calculating the normal vector of the combined dataset and subsequently a hyperplane which separates clusters (e.g. of the initial datasets) of data values of the dataset and subtracting the dataset means as in the DWD method. In principle any data integration method which removes biases can be used for the inventive method.
Preferably the at least one, preferably two, three, four, five, six, seven or eight, obtained expression datasets comprise data of at least 10, preferably at least 20, more preferred at least 30, even more preferred at least 40, at least 50, at least 70, at least 100, at least 120, at least 140, at least 160 or at even at least 200 different genes. The inventive method is particularly suitable to filter through large data sets and identify the characteristic markers therein. The obtained set of these markers is also referred to as “classifier”.
This method of identifying cancer specific markers and thus moieties, e.g. oligonucleotides or antibodies, specific for cancer can also be used in the above method of diagnosing cancer. I.e. the markers corresponding to the set of moieties used for the diagnostic method are identified (also called “classified”) according to the above method which includes the refinement and establishing of centroid values of the measured values of the initial data sets. This pattern can then be used to diagnose cancer if the values of the sample of the patient are closer to the clustered centroid value of the tumor markers. Accordingly a method for the diagnosis of cancer in a patient is provided, comprising providing a sample, preferably a sample of cells, from the patient, detecting one or more tumor markers by measuring tumor marker signals with the set according to the present invention, comparing the measured signal values of the tumor markers with values of the tumor markers in cancer samples by the identification method mentioned above and diagnosing cancer if the nearest shrunken centroid of values of the sample of the patient for at least 50%, preferably at least 60%, more preferred at least 70% or even at least 80%, most preferred 90%, markers of the set is within the standard deviation, preferably two times the standard deviation, even more preferred three times the standard deviation, of the method of measurement to the nearest shrunken centroid of the tumor markers identified with the cancer samples.
The present invention is further illustrated by the following figures and examples without being specifically restricted thereto. All references cited herein are incorporated by reference.

FIGURES

FIG. 1: The first two principal components before and after DWD—integration. Datasets are coded by colour and tumor entities are coded by letters according to the legend.

FIG. 2: Dendrogram of the DWD integrated data on all genes. The colors of branches of the dendrogram indicate the dataset of the corresponding sample, the color of the leaf-label indicates the tumor entity.

FIG. 3: Discrimination between papillary carcinoma and benign nodules across four different datasets by only one gene (SERPINA1)

FIG. 4 shows a graph of the average error probability during PTC classification of seduced sets (classifier) of markers from table 2.

FIG. 5 shows a graph of the average error probability during FTC classification of seduced sets (classifier) of markers from table 4.

EXAMPLES

Example 1

Datasets

Datasets were downloaded either from websites or from public repositories (GEO, ArrayExpress). Table 7 shows a summary of the datasets used in this study (He et al, PNAS USA 102(52): 19075-80 (2005); Huang et al. PNAS USA 98(26): 15044-49 (2001); Jarzab Cancer Res 65(4): 1587-97 (2005); Lacroix Am J Pathol 167(1): 223-231 (2005); J Clin Endocrinol Metab 90(5): 2512-21 (2005)). Here, three different categories of non-cancer tissues are used: contralateral (c. lat) for healthy surrounding tissue paired with a tumor sample, other disease (o.d.) for thyroid tissue operated for other disease and SN (Struma nodosa) for benign thyroid nodules. For all subsequent analysis these were combined as healthy.

TABLE 7

Microarray Data used for Meta Analysis

	Published	FTA	FTC	PTC	SN	o.d.	c.lat	Platform

He	PNAS	0	0	9	0	0	9	Affy
	2005							U133plus
Huang	PNAS
	0	0	8	8	0	0	Affy
	2001							U133A
Jarzab	Cancer
	0	0	23	0	11	17	Affy
	Res 2005							U133A
Lacroix	Am J Path	4	8	0	11	0	0	Agilent
	2005							Custom
Reyes	not	0	0	7	0	0	7	Affy
	published?							U133A
Weber	J Clin	12	12	0	0	0	0	Affy
	Endocr							U95A
	Metabol
	2005

Example 2

Finding the Gene Overlap

The first step in any MetaAnalysis of microarray data is to find the set of genes which is shared by all microarray platforms used in the analysis. Traditionally, overlap is assessed by finding common UniGene identifiers. This, however, disregards all possible splice variations in the genes under investigation. For example, if a gene had 2 splice variants, one of which was differentially expressed in the experiment and the other not and if one platform would contain an oligo specific only to the differentially expressed variant and the other platform only an oligo to the other variant, then a matching based on UniGene would merge probes which measure different things.
To overcome this problem, the approach adopted here merges only probes which annotate to the same set of RefSeq identifiers. To this end all matching RefSeqs were downloaded for each probe(set), either via the Bioconductor annotation packages (hgu133a, hgu95a and hgu133plus2; available at the web www.bioconductor.org) or by a BLAST search of the sequences at NCBI Database. Then, for each probe the RefSeqs were sorted and concatenated. This is the most accurate representation of the entity measured on the array. The median value was used, if one set of RefSeqs was represented by multiple probes on the array. 5707 different sets of RefSeqs were present on all arrays.

Example 3

Preprocessing and Data Integration

First each dataset was background-corrected and normalised separately, as recommended for each platform (lowess for dual color and quantile normalisation for single color experiments) (Bolstad et al. Bioinformatics 19(2): 185-193(2003); Smyth et al. Methods 31(4): 265-273 (2003)), then they were merged and quantile normalised collectively. Despite all preprocessing, it has been shown that data generated on different microarray platforms or on different generations of the same platform may not be comparable due to platform specific biases (Eszlinger et al. Clin Endocrinol Metab 91(5): 1934-1942 (2006)). This is also evident from principal component analysis of the merged data as shown in FIG. 1. In order to correct for these biases, methods have been developed for integration of microarray data. One of these methods is Distance Weighted Discrimination (DWD) which is described in detail elsewhere (Benito et al. Bioinformatics 20(1): 105-114 (2004)). Briefly, DWD projects data points onto the normal vector of a class (dataset)—separating hyperplane as calculated by a modified Support Vector Machine (SVM) and subtracts the class (dataset) means. Therefore, for a multiclass problem (more than 2 datasets to merge), the datasets need to be merged sequentially. For 6 datasets this leads to 720 different possibilities for merging, not including tree structured approaches, e.g instead of (((1+2)+3)+4), consider ((1+2)+(3+4)). The merging orders applied here were chosen on the general idea that similar and larger datasets should be merged first and more disparate ones later. It is also worth noting, that adding a sample to a DWD merged dataset will change the whole dataset just like adding a new number to a vector of numbers will change its mean.
Data Integration by DWD is illustrated in FIG. 1 which shows the effect of the data integration method on the first two principal components. In this analysis, DWD was able to remove the separation between the datasets as indicated by the PC-plots and by the mixing of the branches in the dendrogram (see FIG. 2). However, even in the DWD-integrated dataset the Lacroix data still partly separates from the other data. Most likely this is due to the platform; the lacroix-data is the only data from a non-Affymetrix platform. FIG. 2 shows dendrograms of the respective integrated datasets. Also, DWD integration does not seem to hamper the discrimination between the tumor entities (see table 8 below).

Example 4

Classification

For probe selection, classification and cross-validation a nearest shrunken centroid method was chosen (Tibshirani et al. PNAS USA 99(10):105-114 (2004)) (implemented in the Bioconductor package pamr). It was chosen for several reasons: it allows multiclass classification and it runs features selection, classification and cross-validation in one go. Briefly, it calculates several different possible classifiers using different shrinkage thresholds (i.e. different number of genes) and finds the best threshold from crossvalidation. The classifier was picked with the smallest number of genes (largest threshold), if more than one threshold yielded the same crossvalidation results.

Example 5

Papillary Thyroid Carcinoma (PTC)

First, and as a quality measure for each study, each dataset was taken separately (before DWD-integration) and a pamr classification and leave-one-out cross-validation (loocv) was performed. The results of the cross-validation are near perfect with single samples classifying wrongly. However, with the exception of the classifier from the He dataset, none of these classifiers can be applied to any of the other dataset. Classification results are rarely ever higher than expected by chance. If, however, one uses the DWD-integrated data (below), the classifiers already fit much better (see table 8).

TABLE 8

Classification results when applying classifiers from
one study on another study. Before data integration (left)
and after DWD integration (right)

test train	he	huang	jarzab	reyes	test train	he	huang	jarzab	reyes

he	1.00	1.00	0.98	1.00	he	1.00	1.00	0.96	1.00
huang	0.50	1.00	0.55	0.50	huang	0.50	1.00	0.90	0.71
jarzab	0.50	0.81	1.00	0.57	jarzab	0.89	1.00	1.00	1.00
reyes	0.78	0.50	0.92	1.00	reyes	0.89	0.88	0.90	1.00

Then a pamr—classifier was built for the complete DWD-integrated dataset and validated in a leave-one-out crossvalidation. This identified a one (!) gene classifier, which classifies 99% of samples correctly in loocv. The discriminative gene is SERPINA1. FIG. 3 shows the discrimination of PTC vs SN before and after DWD. One could add up to 422 genes to the classifier and still yield 99% accuracy (from loocv). If one removes the SERPINA1-probe from the analysis, one can build again a classifier (subsequently denominated classifier) with 99% accuracy in loocv, this time using a 9-gene signature (see Table 3). Removing these 9 genes yields another 9-gene classifier with a similar performance (99% accuracy), and further an 11-gene classifier with 99% accuracy. Such further classifiers are e.g. given in tables 1 to 3, 5 and 6 (above) for PTC.
However, similar results are obtained doing the same analysis on the non-integrated data. Taking into account the results of PCA (FIG. 1), where it was obvious that the variance explained by the different datasets is much larger than the variance explained by tumor entity, one could imagine that the bias introduced by the datasets may help (or hamper) classification. Therefore a study-crossvalidation was performed, whereby sequentially one study was taken out from the dataset, a was classifier built from the remaining samples and tested on the eliminated dataset. On the DWD-integrated data, the accuracy of prediction was 100, 100, 98 and 100% leaving out He, Huang, Jarzab and Reyes from the classifier, respectively. For the non-integrated data, the results were similar (100, 100, 94 and 100%).

TABLE 9

Genes in classifier2 (after leaving out SERPINA1)

Symbol	Title	Cluster	Accession

WAS	Wiskott-Aldrich syndrome	Hs.2157	BC012738
	(eczema-thrombocytopenia)
LRP4	Low density lipoprotein receptor-	Hs.4930	BM802977
	related protein 4
TFF3	Trefoil factor 3 (intestinal)	Hs.82961	BC017859
ST3GAL6	ST3 beta-galactoside alpha-2,3-	Hs.148716	BC023312
	sialyltransferase 6
STK39	Serine threonine kinase 39	Hs.276271	BM455533
	(STE20/SPS1 homolog, yeast)
DPP4	Dipeptidyl-peptidase 4 (CD26,	Hs.368912	BC065265
	adenosine deaminase complexing
	protein 2)
CHI3L1	Chitinase 3-like 1 (cartilage	Hs.382202	BC038354
	glycoprotein-39)
FABP4	Fatty acid binding protein 4,	Hs.391561	BC003672
	adipocyte
LAMB3	Laminin, beta 3	Hs.497636	BC075838

Example 6

Follicular Carcinoma

A similar analysis was also performed for the FTC data, but crossvalidation was hampered, due to the very limited availability of data. Again, a classifier was built for each dataset (Lacroix and Weber). They achieved a loocv-accuracy of 96% (Weber) and 100% (Lacroix) on 25 and 3997 genes. The number of genes in the Lacroix-data already suggests overfitting, which was confirmed by cross-classification with the other dataset (25 and 35% accuracy, respectively). Also, the gene-overlap between the two classifiers is low (between 0 and 10% depending on the threshold). If, however the 2 datasets are combined using DWD, a 147-gene classifier (table 4 above) could be built which was able to correctly identify samples (with a 92% accuracy).

Example 7

Discussion

The present invention represents the largest cohort of thyroid carcinoma microarray data analysed to date. It makes use of the novel combinatory method using the latest algorithms for microarray data integration and classification. Nevertheless, meta-analysis of microarray data still poses a challenge, mainly because single microrarray investigations are aimed at at least partly different questions and hence use different experimental designs. Moreover, the number of thyroid tumor microarray data available to date is still comparably low (compared to breast cancer, e.g.). Therefore, when doing meta analysis one is forced to use all data available, even if the patient cohorts represent a rather heterogeneous and potentially biased population. More specifically, it is difficult to obtain a homogenous collection of control material (from healthy patients). These are usually taken from patients who were operated for other thyroid disease which is in turn very likely to cause a change in gene expression as measured on microarrays. The generation of homogeneous patient cohorts is further hampered by limited availability of patient data like age, gender, genetic background, etc.
When doing meta analysis of microarray data, many researchers have based their approach on comparing gene lists from published studies (Griffith et al, cited above). This is very useful, as one can include all studies in the analysis and is not limited to the studies where raw data is available. However, the studies generally follow very different analysis strategies, some more rigorous than others. It is not under the control of the meta-analyst how the authors arrived at the gene lists. Therefore these analyses may be biased.
Regarding data integration, according to the original DWD paper, DWD performs best when at least 25-30 samples per dataset are present. In the present study, 4 out of 6 datasets contained less than 20 samples. Still DWD performed comparably well for removing platform biases (see Table 8).
DWD greatly improved the results of PCA (FIG. 1), hierarchical clustering (FIG. 2) and the classification accuracy when applying a classifier from one study to another study (Table 8). In this light it was surprising to see that the non-integrated data performed equally well in the study crossvalidation compared to the DWD-integrated data. One explanation for this is that any study-specific bias will become less important the more studies are being evaluated. Given that the study bias affects some genes more than others, the more affected genes will be less likely to survive the pamr-thresholding due to the variance introduced by the study-bias. However, as shown above, there is a large abundance of genes discriminating PTC and benign nodules. As long as one (or a few) of those genes is not affected by the study bias, it (they) will survive thresholding and discrimination between tumor entities will still be possible.
There is an apparent discrepancy when one looks at FIG. 3: Before DWD, the PTC samples have a higher SERPINA1 expression while after DWD it is the other way round. However, as noted in the Materials and Methods section, DWD subtracts the class means from each sample. This simply means that before DWD the study bias for SERPINA1 is higher than the difference in expression between the tumor classes. This also explains, why in the not-integrated data SERPINA1 is not a well working classifier.
A recent Meta-Analysis and Meta-Review by Griffith et.al. (cited above) has summarised genes with a diagnostic potential in the context of thyroid disease. They published lists of genes which appeared in more than one high-throughput study (Microarray, SAGE) analysing thyroid disease and applied a ranking system. In their analysis SERPINA1 scored the third highest, and TFF3, which is part of classifier2 (when leaving out SERPINA1), scored second. Four out of nine genes from classifier2 appeared in the list from Griffith et.al. (LRP4, TFF3, DPP4 and FABP4).
Most of these lists were generated from microarray analysis. However, even when comparing the genes in the classifiers to genelists generated with independent technologies, like cDNA library generation, there is substantial overlap. SERPINA1 appears in their lists as well as four out of the nine genes from classifier2 (TFF3, DPP4, CHI3L1 and LAMB3).
For the case of follicular thyroid disease, building a robust classifier is much more difficult. This is mainly down to the limited availability of data. Also, the two datasets were very different in terms of the platforms used; while all other datasets were generated on Affymetrix GeneChips arrays of different generations, the Lacroix data was generated on a custom Agilent platform. Nevertheless the classifier (set) of table 4 was able to identify most samples correctly in loocv.
The power of the meta analysis approach adopted here is demonstrated by a 99% loocv-accuracy (97.9% weighted average accuracy in the study crossvalidation) for the distinction between papillary thyroid carcinoma and benign nodules. This has been achieved on the largest and most diverse dataset so far (99 samples from 4 different studies).
One sample was classified wrongly, and although it is not possible to correctly map the samples from this analysis to the original analysis, the misclassified sample is from the same group (PTC, validation group) as the sample which was wrongly classified in the original analysis. According to Jarzab et.al. the sample was an outlier because it contained only tumor cells.

Claims

1.-24. (canceled)

25. A set of moieties comprising moieties specific for at least 3 tumor markers, wherein the three tumor markers are further defined as being any three of tumor markers PI-1 to PI-33, PII-1 to PII-64, PIII-1 to PIII-70, FI-1 to FI-147, and PIV-1 to PIV-9.

26. The set of claim 25, wherein the set comprises moieties specific for PIV-4, PIV-5, or any of PV-1 to PV-11.

27. The set of claim 26, wherein at least one of the three tumor markers is further defined as being PV-1, PV-2, or any of PV-4 to PV-11.

28. The set of claim 25, wherein the set comprises moieties specific for at least 3 of the listed tumor markers.

29. The set of claim 25, wherein the set comprises moieties specific for at least 3 tumor markers, wherein the three tumor markers are further defined as being any three of tumor markers PI-1 to PI-33.

30. The set of claim 25, wherein the set comprises moieties specific for at least 3 tumor markers, wherein the three tumor markers are further defined as being any three of tumor markers FI-1 to FI-147.

31. The set of claim 25, wherein the set comprises a moiety specific for the tumor marker SERPINA1.

32. The set of claim 25, further defined as comprising at least 5 moieties specific for the tumor markers of tables 1 to 6.

33. The set of claim 32, further defined as comprising at least 10 moieties specific for the tumor markers of tables 1 to 6.

34. The set of claim 25, wherein the moieties are oligonucleotides specific for tumor marker nucleic acids.

35. The set of claim 25, wherein the moieties are antibodies or antibody fragments.

36. The set of claim 35, wherein the antibodies are further defined as Fab, Fab′ Fab₂, F(ab′)₂or scFv, specific for tumor marker proteins.

37. The set of claim 25, wherein the moieties are immobilized on a solid support.

38. The set of claim 37, wherein the solid support is a microarray.

39. The set of claim 25, wherein at least 10% of all analyte binding moieties of the set are moieties which are specific for tumor markers further defined as any combination of PI-1 to PI-33, PII-1 to PII-64, PIII-1 to PIII-70, FI-1 to FI-147, PIV-1 to PIV-9, and PV-1 to PV-11.

40. The set of claim 25, wherein the set comprises less than 50000 analyte binding moieties.

41. A method for detecting one or more thyroid cancer markers in a sample comprising using the set of claim 25 and detecting the presence or measuring amount of the occurrence of tumor markers in the sample.

42. The method of claim 41, wherein the sample comprises mammalian cells.

43. The method of claim 42, where the mammalian cells are human cells.

44. The method of claim 41, wherein the detection or measurement is done by RNA-expression analysis, protein analysis, protein microarray detection, mRNA microarray detection, ELISA, a multiplex assay, immunohistochemistry, DNA analysis, comparative genomic hybridization (CGH)-arrays, or single nucleotide polymorphism (SNP)-analysis.

45. The method of claim 44, wherein the detection or measurement is done by tissue microarray detection, microarray analysis, or quantitative PCR.

46. A method for diagnosis of cancer in a patient comprising:

providing a sample from the patient;

detecting one or more tumor markers with a set of claim 25;

comparing measured signal values of the tumor markers with values of the tumor markers in a healthy sample; and

diagnosing cancer if more than 50% of the values differ compared to the values of the healthy samples by at least the standard deviation of the method of measurement and/or differ compared to the values of the healthy samples by at least a factor 1.5.

47. The method of claim 46, wherein the sample is a cell sample.

48. A method for the identification of disease specific markers comprising:

providing gene expression data on multiple potential disease specific genes of at least two different expression datasets;

determining common genes of the datasets;

normalizing each gene expression dataset;

combining the gene expression datasets to a combined dataset; and

determining genes of the combined data set by determining its nearest shrunken centroid, which includes determination of a cross-validated error value of assigning the genes to the disease and minimizing the error value by reducing the number of members of the combined, preferably normalized, data set;

wherein the genes of the reduced data set are the markers specific for the disease.